436 31 50MB
English Pages [1963] Year 2021
Yu-Jin Zhang
Handbook of Image Engineering
Handbook of Image Engineering
Yu-Jin Zhang
Handbook of Image Engineering
Yu-Jin Zhang Department of Electronic Engineering Tsinghua University Beijing, China
ISBN 978-981-15-5872-6 ISBN 978-981-15-5873-3 https://doi.org/10.1007/978-981-15-5873-3
(eBook)
© Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
Image engineering is a new inter-discipline that systematically studies various image theories and methods, elaborates the principles of image operation, promotes the applications of image technology, and summarizes production experience in the image industry. As an overall framework and system for the comprehensive research and integrated application of various image technologies, image engineering has received extensive attention and made considerable development in recent years. This is a handbook that takes the chapter as a representation unit for key knowledge points of image engineering. These units combine the definitions of common concepts, related principles, practical techniques, and specific methods of image engineering. It could be of help to different readers related to the field of image technology in diverse ways, such as offering a general idea to unlearned readers who are looking to begin, making a summary for students who are currently learning, providing consultation to engineers who have working requirements, and delivering a comprehensive reference for senior researchers with relevant knowledge. This handbook is organized into 5 parts with 52 chapters that are corresponding to different branches and directions of image engineering. For each chapter, the related entries are grouped into several sections (totally 216 sections) and subsections (totally 651 subsections). The numbers of entry for each chapter, section, and subsection are marked in the braces after their titles in the contents. They provide a general idea about the scale of the chapters, sections, and subsections and thus an overall viewing of the book. Totally, near 10,000 entries are collected mainly from dozens of books and some journal articles published globally in recent years. In the subject index, more than 10,000 guiding items can be found. This handbook provides a comprehensive coverage of image engineering. For each chapter, in addition to its concise definition, extra explanations, examples, analysis, and discussions are also provided. These entries are supported by totally 750 figures and 28 tables, as well as more than a thousand of formulas. These entries can be cross-referenced by bold words in the text. These entries are also interconnected by the specifiers “See”, “Same as”, and “Compare” at the end of v
vi
Preface
entry texts. Besides, for each chapter and section, several appropriate references are selected to indicate the source for further and detailed information. This handbook of image engineering integrates the common concepts, related principles, practical technologies, and specific methods into one • Comprehensive coverage, refined interpretation • Flexible reference and easy access • Well-organized and completely structured Special thanks go to Springer Nature Singapore Pte Ltd. and their staff members. Their kind and professional assistance are truly appreciated. Last but not least, I am deeply indebted to my wife and my daughter for their encouragement, patience, support, tolerance, and understanding during the writing of this book. Beijing, China
Yu-Jin Zhang
General Directory
vii
Contents
Part I 1
2
Image Fundamentals
Image Basics {324} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Basic Concepts of Image {31} . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Image and Image Space {16} . . . . . . . . . . . . . . . . 1.1.2 Digital Image and Computer-Generated Image {15} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Image Decomposition {46} . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Image Decomposition {11} . . . . . . . . . . . . . . . . . 1.2.2 Pixel and Voxel {17} . . . . . . . . . . . . . . . . . . . . . 1.2.3 Various Elements {18} . . . . . . . . . . . . . . . . . . . . 1.3 All Kinds of Image {74} . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Images with Different Wavelengths {19} . . . . . . . 1.3.2 Different Dimensional Images {16} . . . . . . . . . . . 1.3.3 Color Image {20} . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Images for Different Applications {19} . . . . . . . . 1.4 Special Attribute Images {109} . . . . . . . . . . . . . . . . . . . . . 1.4.1 Images with Various Properties {16} . . . . . . . . . . 1.4.2 Image with Specific Attribute {20} . . . . . . . . . . . . 1.4.3 Depth Images {14} . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 Image with Variant Sources {19} . . . . . . . . . . . . . 1.4.5 Processing Result Image {20} . . . . . . . . . . . . . . . 1.4.6 Others {20} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Image Representation {47} . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Representation {9} . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Image Property {19} . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Image Resolution {19} . . . . . . . . . . . . . . . . . . . . 1.6 Image Quality {17} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . .
3 3 3
. . . . . . . . . . . . . . . . . . . . . .
6 8 8 9 12 14 14 17 19 21 25 25 27 31 33 35 38 42 42 44 46 50
Image Engineering {160} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Image Engineering Technology {40} . . . . . . . . . . . . . . . . . .
55 55 ix
x
Contents
2.1.1 Image Engineering {11} . . . . . . . . . . . . . . . . . . . 2.1.2 Image Processing {16} . . . . . . . . . . . . . . . . . . . . 2.1.3 Image Analysis {6} . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Image Understanding {7} . . . . . . . . . . . . . . . . . . Similar Disciplines {64} . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Computer Vision {16} . . . . . . . . . . . . . . . . . . . . . 2.2.2 Machine Vision {11} . . . . . . . . . . . . . . . . . . . . . 2.2.3 Computer Graphics {20} . . . . . . . . . . . . . . . . . . . 2.2.4 Light Field {17} . . . . . . . . . . . . . . . . . . . . . . . . . Related Subjects {56} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Fractals {14} . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Topology {14} . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Virtual Reality {10} . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Others {18} . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
55 59 61 63 64 65 68 69 72 75 75 77 79 80
Image Acquisition Devices {436} . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Device Parameters {49} . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Camera Parameters {18} . . . . . . . . . . . . . . . . . . . 3.1.2 Camera Motion Description {16} . . . . . . . . . . . . . 3.1.3 Camera Operation {15} . . . . . . . . . . . . . . . . . . . . 3.2 Sensors {72} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Sensor Models {16} . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Sensor Characteristics {17} . . . . . . . . . . . . . . . . . 3.2.3 Image Sensors {14} . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Specific Sensors {12} . . . . . . . . . . . . . . . . . . . . . 3.2.5 Commonly Used Sensors {13} . . . . . . . . . . . . . . . 3.3 Cameras and Camcorders {88} . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Conventional Cameras {18} . . . . . . . . . . . . . . . . . 3.3.2 Camera Models {15} . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Special Structure Cameras {20} . . . . . . . . . . . . . . 3.3.4 Special Purpose Cameras {21} . . . . . . . . . . . . . . . 3.3.5 Camera Systems {14} . . . . . . . . . . . . . . . . . . . . . 3.4 Camera Calibration {49} . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
3.4.2 Various Calibration Techniques {18} . . . . . . . . . . 3.4.3 Internal and External Camera Calibration {14} . . . Lens {85} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Lens Model {16} . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Lens Types {20} . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Lens Characteristics {17} . . . . . . . . . . . . . . . . . . 3.5.4 Focal Length of Lens {16} . . . . . . . . . . . . . . . . . 3.5.5 Lens Aperture and Diaphragm {16} . . . . . . . . . . . Lens Aberration {31} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Lens Distortions {15} . . . . . . . . . . . . . . . . . . . . . 3.6.2 Chromatic Aberration {16} . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
85 85 85 88 91 94 94 96 99 101 103 105 105 108 110 113 115 118 118 121 124 126 126 130 136 139 141 146 146 151
2.2
2.3
3
3.4.1
3.5
3.6
Calibration Basics {17} . . . . . . . . . . . . . . . . . . . . . . .
Contents
3.7
xi
Other Equipment and Devices {62} . . . . . . . . . . . . . . . . . . 3.7.1 Input Devices {17} . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Filters {14} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.3 Microscopes {11} . . . . . . . . . . . . . . . . . . . . . . . . 3.7.4 RADAR {10} . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.5 Other Devices {10} . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
154 154 157 160 163 164
4
Image Acquisition Modes {381} . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Imaging and Acquisition {157} . . . . . . . . . . . . . . . . . . . . . 4.1.1 Image Capture {20} . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Field of View {18} . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Camera Models {16} . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Imaging Methods {18} . . . . . . . . . . . . . . . . . . . . 4.1.5 Spectral Imaging {13} . . . . . . . . . . . . . . . . . . . . . 4.1.6 Coordinate Systems {12} . . . . . . . . . . . . . . . . . . . 4.1.7 Imaging Coordinate Systems {16} . . . . . . . . . . . . 4.1.8 Focal Length and Depth {14} . . . . . . . . . . . . . . . 4.1.9 Exposure {15} . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.10 Holography and View {15} . . . . . . . . . . . . . . . . . 4.2 Stereo Imaging {57} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 General Methods {13} . . . . . . . . . . . . . . . . . . . . . 4.2.2 Binocular Stereo Imaging {12} . . . . . . . . . . . . . . 4.2.3 Special Methods {17} . . . . . . . . . . . . . . . . . . . . . 4.2.4 Structured Light {15} . . . . . . . . . . . . . . . . . . . . . 4.3 Light Source and Lighting {81} . . . . . . . . . . . . . . . . . . . . . 4.3.1 Light and Lamps {16} . . . . . . . . . . . . . . . . . . . . . 4.3.2 Light Source {15} . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Lighting {19} . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Illumination {17} . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Illumination Field {14} . . . . . . . . . . . . . . . . . . . . 4.4 Perspective and Projection {62} . . . . . . . . . . . . . . . . . . . . . 4.4.1 Perspective {14} . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Perspective Projection {17} . . . . . . . . . . . . . . . . . 4.4.3 Projective Imaging {18} . . . . . . . . . . . . . . . . . . . 4.4.4 Various Projections {13} . . . . . . . . . . . . . . . . . . . 4.5 Photography and Photogrammetry {24} . . . . . . . . . . . . . . . 4.5.1 Photography {13} . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Photogrammetry {11} . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
167 167 167 170 173 176 179 182 186 188 192 195 197 198 199 204 208 210 210 213 215 217 221 225 225 227 232 234 236 237 239
5
Image Digitization {83} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Sampling and Quantization {44} . . . . . . . . . . . . . . . . . . . . 5.1.1 Sampling Theorem {21} . . . . . . . . . . . . . . . . . . . 5.1.2 Sampling Techniques {17} . . . . . . . . . . . . . . . . . 5.1.3 Quantization {6} . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
241 241 241 244 248
xii
Contents
5.2
Digitization Scheme {39} . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Digitization {20} . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Digitizing Grid {19} . . . . . . . . . . . . . . . . . . . . . . .
249 249 252
6
Image Display and Printing {71} . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Display {35} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Image Display {16} . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Display Devices {19} . . . . . . . . . . . . . . . . . . . . . 6.2 Printing {36} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Printing Devices {10} . . . . . . . . . . . . . . . . . . . . . 6.2.2 Printing Techniques {12} . . . . . . . . . . . . . . . . . . 6.2.3 Halftoning Techniques {14} . . . . . . . . . . . . . . . .
. . . . . . . .
257 257 257 260 262 262 263 265
7
Image Storage and Communication {50} . . . . . . . . . . . . . . . . . . . 7.1 Storage and Communication {22} . . . . . . . . . . . . . . . . . . . . 7.1.1 Image Storage {12} . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Image Communication {10} . . . . . . . . . . . . . . . . 7.2 Image File Format {28} . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Bitmap Images {14} . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Various Formats {14} . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
269 269 269 271 272 272 274
8
Related Knowledge {370} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Basic Mathematics {169} . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Analytic and Differential Geometry {13} . . . . . . . 8.1.2 Functions {18} . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.3 Matrix Decomposition {16} . . . . . . . . . . . . . . . . . 8.1.4 Set Theory {14} . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.5 Least Squares {16} . . . . . . . . . . . . . . . . . . . . . . . 8.1.6 Regression {19} . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.7 Linear Operations {15} . . . . . . . . . . . . . . . . . . . . 8.1.8 Complex Plane and Half-Space {19} . . . . . . . . . . 8.1.9 Norms and Variations {20} . . . . . . . . . . . . . . . . . 8.1.10 Miscellaneous {19} . . . . . . . . . . . . . . . . . . . . . . . 8.2 Statistics and Probability {118} . . . . . . . . . . . . . . . . . . . . . 8.2.1 Statistics {18} . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Probability {17} . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Probability Density {19} . . . . . . . . . . . . . . . . . . . 8.2.4 Probability Distributions {18} . . . . . . . . . . . . . . . 8.2.5 Distribution Functions {14} . . . . . . . . . . . . . . . . . 8.2.6 Gaussian Distribution {17} . . . . . . . . . . . . . . . . . 8.2.7 More Distributions {15} . . . . . . . . . . . . . . . . . . . 8.3 Signal Processing {50} . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Basic Concepts {16} . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Signal Responses {18} . . . . . . . . . . . . . . . . . . . . 8.3.3 Convolution and Frequency {16} . . . . . . . . . . . . . 8.4 Tools and Means {33} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Hardware {10} . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
277 277 277 279 282 286 288 290 295 297 300 302 305 305 308 314 318 321 324 329 334 334 336 340 343 343
Contents
xiii
8.4.2 8.4.3 Part II
Software {11} . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diverse Terms {12} . . . . . . . . . . . . . . . . . . . . . . .
346 347
Image Processing
9
Pixel Spatial Relationship {175} . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Adjacency and Neighborhood {49} . . . . . . . . . . . . . . . . . . . 9.1.1 Spatial Relationship Between Pixels {12} . . . . . . . 9.1.2 Neighborhood {19} . . . . . . . . . . . . . . . . . . . . . . . 9.1.3 Adjacency {18} . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Connectivity and Connected {42} . . . . . . . . . . . . . . . . . . . . 9.2.1 Pixel Connectivity {13} . . . . . . . . . . . . . . . . . . . . 9.2.2 Pixel-Connected {20} . . . . . . . . . . . . . . . . . . . . . 9.2.3 Path {9} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Connected Components and Regions {29} . . . . . . . . . . . . . 9.3.1 Image Connectedness {18} . . . . . . . . . . . . . . . . . 9.3.2 Connected Region in Image {11} . . . . . . . . . . . . . 9.4 Distance {55} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Discrete Distance {20} . . . . . . . . . . . . . . . . . . . . 9.4.2 Distance Metric {11} . . . . . . . . . . . . . . . . . . . . . 9.4.3 Geodesic Distance {13} . . . . . . . . . . . . . . . . . . . . 9.4.4 Distance Transform {11} . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
353 353 353 356 361 363 363 364 366 368 368 369 371 371 375 378 382
10
Image Transforms {231} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Transformation and Characteristics {30} . . . . . . . . . . . . . . . 10.1.1 Transform and Transformation {18} . . . . . . . . . . . 10.1.2 Transform Properties {12} . . . . . . . . . . . . . . . . . . 10.2 Walsh-Hadamard Transform {26} . . . . . . . . . . . . . . . . . . . . 10.2.1 Walsh Transform {17} . . . . . . . . . . . . . . . . . . . . 10.2.2 Hadamard Transform {9} . . . . . . . . . . . . . . . . . . 10.3 Fourier Transform {66} . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Variety of Fourier Transform {15} . . . . . . . . . . . . 10.3.2 Frequency Domain {16} . . . . . . . . . . . . . . . . . . . 10.3.3 Theorem and Property of Fourier Transform {18} . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.4 Fourier Space {17} . . . . . . . . . . . . . . . . . . . . . . . 10.4 Discrete Cosine Transform {8} . . . . . . . . . . . . . . . . . . . . . . 10.5 Wavelet Transform {43} . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 Wavelet Transform and Property {13} . . . . . . . . . 10.5.2 Expansion and Decomposition {11} . . . . . . . . . . . 10.5.3 Various Wavelets {19} . . . . . . . . . . . . . . . . . . . . 10.6 Karhunen-Loève Transform {40} . . . . . . . . . . . . . . . . . . . . 10.6.1 Hotelling Transform {20} . . . . . . . . . . . . . . . . . . 10.6.2 Principal Component Analysis {20} . . . . . . . . . . . 10.7 Other Transforms {18} . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
385 385 385 387 390 390 393 395 395 399
. . . . . . . . . . .
402 409 412 415 415 418 421 425 426 430 435
xiv
Contents
11
Point Operations for Spatial Domain Enhancement {249} . . . . . . 11.1 Fundamentals of Image Enhancement {42} . . . . . . . . . . . . . 11.1.1 Image Enhancement {8} . . . . . . . . . . . . . . . . . . . 11.1.2 Intensity Enhancement {10} . . . . . . . . . . . . . . . . 11.1.3 Contrast Enhancement {12} . . . . . . . . . . . . . . . . . 11.1.4 Operator {12} . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Coordinate Transformation {87} . . . . . . . . . . . . . . . . . . . . . 11.2.1 Spatial Coordinate Transformation {13} . . . . . . . . 11.2.2 Image Transformation {8} . . . . . . . . . . . . . . . . . . 11.2.3 Homogeneous Coordinates {8} . . . . . . . . . . . . . . 11.2.4 Hierarchy of Transformation {13} . . . . . . . . . . . . 11.2.5 Affine Transformation {13} . . . . . . . . . . . . . . . . . 11.2.6 Rotation Transformation {17} . . . . . . . . . . . . . . . 11.2.7 Scaling Transformation {7} . . . . . . . . . . . . . . . . . 11.2.8 Other Transformation {8} . . . . . . . . . . . . . . . . . . 11.3 Inter-image Operations {34} . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Image Operation {6} . . . . . . . . . . . . . . . . . . . . . . 11.3.2 Arithmetic Operations {18} . . . . . . . . . . . . . . . . . 11.3.3 Logic Operations {10} . . . . . . . . . . . . . . . . . . . . 11.4 Image Gray-Level Mapping {38} . . . . . . . . . . . . . . . . . . . . 11.4.1 Mapping {9} . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.2 Contrast Manipulation {7} . . . . . . . . . . . . . . . . . . 11.4.3 Logarithmic and Exponential Functions {15} . . . . 11.4.4 Other Functions {7} . . . . . . . . . . . . . . . . . . . . . . 11.5 Histogram Transformation {48} . . . . . . . . . . . . . . . . . . . . . 11.5.1 Histogram {13} . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.2 Histogram Transformation {12} . . . . . . . . . . . . . . 11.5.3 Histogram Modification {14} . . . . . . . . . . . . . . . . 11.5.4 Histogram Analysis {9} . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
441 441 441 443 445 446 448 448 450 452 453 457 461 466 468 470 470 471 475 478 479 480 482 485 486 487 489 493 496
12
Mask Operations for Spatial Domain Enhancement {175} . . . . . . 12.1 Spatial Domain Enhancement Filtering {37} . . . . . . . . . . . . 12.1.1 Spatial Domain Filtering {19} . . . . . . . . . . . . . . . 12.1.2 Spatial Domain Filters {18} . . . . . . . . . . . . . . . . . 12.2 Mask Operation {35} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Mask {20} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Operator {15} . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Linear Filtering {39} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 Linear Smoothing {15} . . . . . . . . . . . . . . . . . . . . 12.3.2 Averaging and Mean {14} . . . . . . . . . . . . . . . . . . 12.3.3 Linear Sharpening {10} . . . . . . . . . . . . . . . . . . . . 12.4 Nonlinear Filtering {42} . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.1 Nonlinear Smoothing {17} . . . . . . . . . . . . . . . . . 12.4.2 Mid-point, Mode, and Median {15} . . . . . . . . . . . 12.4.3 Nonlinear Sharpening {10} . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
499 499 499 503 506 507 510 515 516 518 520 522 522 526 529
Contents
12.5
xv
Gaussian Filter {22} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.1 Gaussian {17} . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.2 Laplacian of Gaussian {5} . . . . . . . . . . . . . . . . . . .
532 533 535
13
Frequency Domain Filtering {76} . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 Filter and Filtering {26} . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1.1 Basic of Filters {11} . . . . . . . . . . . . . . . . . . . . . . 13.1.2 Various Filters {15} . . . . . . . . . . . . . . . . . . . . . . 13.2 Frequency Domain Filters {50} . . . . . . . . . . . . . . . . . . . . . 13.2.1 Filtering Techniques {10} . . . . . . . . . . . . . . . . . . 13.2.2 Low-Pass Filters {10} . . . . . . . . . . . . . . . . . . . . . 13.2.3 High-Pass Filters {9} . . . . . . . . . . . . . . . . . . . . . 13.2.4 Band-Pass Filters {9} . . . . . . . . . . . . . . . . . . . . . 13.2.5 Band-Reject Filters {6} . . . . . . . . . . . . . . . . . . . . 13.2.6 Homomorphic Filters {6} . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
539 539 539 542 545 545 546 550 553 556 558
14
Image Restoration {215} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1 Fundamentals of Image Restoration {56} . . . . . . . . . . . . . . 14.1.1 Basic Concepts {18} . . . . . . . . . . . . . . . . . . . . . . 14.1.2 Basic Techniques {13} . . . . . . . . . . . . . . . . . . . . 14.1.3 Simulated Annealing {10} . . . . . . . . . . . . . . . . . . 14.1.4 Regularization {15} . . . . . . . . . . . . . . . . . . . . . . 14.2 Degradation and Distortion {46} . . . . . . . . . . . . . . . . . . . . . 14.2.1 Image Degradation {19} . . . . . . . . . . . . . . . . . . . 14.2.2 Image Geometric Distortion {7} . . . . . . . . . . . . . . 14.2.3 Image Radiometric Distortion {20} . . . . . . . . . . . 14.3 Noise and Denoising {91} . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.1 Noise Models {15} . . . . . . . . . . . . . . . . . . . . . . . 14.3.2 Noise Sources {15} . . . . . . . . . . . . . . . . . . . . . . . 14.3.3 Distribution {17} . . . . . . . . . . . . . . . . . . . . . . . . 14.3.4 Impulse Noise {10} . . . . . . . . . . . . . . . . . . . . . . . 14.3.5 Some Typical Noises {20} . . . . . . . . . . . . . . . . . . 14.3.6 Image Denoising {14} . . . . . . . . . . . . . . . . . . . . . 14.4 Filtering Restoration {22} . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.1 Unconstrained and Constrained {10} . . . . . . . . . . 14.4.2 Harmonic and Anisotropic {12} . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
561 561 561 564 567 569 573 574 577 579 581 581 584 587 590 593 595 598 598 602
15
Image Repair and Recovery {83} . . . . . . . . . . . . . . . . . . . . . . . . . 15.1 Image Inpainting {8} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Image Completion {10} . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3 Smog and Haze Elimination {25} . . . . . . . . . . . . . . . . . . . . 15.3.1 Defogging and Effect {14} . . . . . . . . . . . . . . . . . 15.3.2 Atmospheric Scattering Model {11} . . . . . . . . . . . 15.4 Geometric Distortion Correction {40} . . . . . . . . . . . . . . . . . 15.4.1 Geometric Transformation {17} . . . . . . . . . . . . . . 15.4.2 Grayscale Interpolation {14} . . . . . . . . . . . . . . . . 15.4.3 Linear Interpolation {9} . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . .
605 605 606 608 608 611 613 613 615 618
xvi
Contents
16
Image Reconstruction from Projection {101} . . . . . . . . . . . . . . . . 16.1 Principle of Tomography {57} . . . . . . . . . . . . . . . . . . . . . . 16.1.1 Tomography {15} . . . . . . . . . . . . . . . . . . . . . . . . 16.1.2 Computational Tomography {25} . . . . . . . . . . . . 16.1.3 Historical Development {17} . . . . . . . . . . . . . . . . 16.2 Reconstruction Methods {14} . . . . . . . . . . . . . . . . . . . . . . . 16.3 Back-Projection Reconstruction {9} . . . . . . . . . . . . . . . . . . 16.4 Reconstruction Based on Series Expansion {21} . . . . . . . . . 16.4.1 Algebraic Reconstruction Technique {11} . . . . . . 16.4.2 Iterative Back-Projection {10} . . . . . . . . . . . . . . .
. . . . . . . . . .
623 623 623 625 630 634 638 641 641 644
17
Image Coding {213} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.1 Coding and Decoding {83} . . . . . . . . . . . . . . . . . . . . . . . . 17.1.1 Coding and Decoding {17} . . . . . . . . . . . . . . . . . 17.1.2 Coder and Decoder {14} . . . . . . . . . . . . . . . . . . . 17.1.3 Source coding {13} . . . . . . . . . . . . . . . . . . . . . . . 17.1.4 Data Redundancy and Compression {19} . . . . . . . 17.1.5 Coding Types {20} . . . . . . . . . . . . . . . . . . . . . . . 17.2 Coding Theorem and Property {31} . . . . . . . . . . . . . . . . . . 17.2.1 Coding Theorem {12} . . . . . . . . . . . . . . . . . . . . . 17.2.2 Coding Property {19} . . . . . . . . . . . . . . . . . . . . . 17.3 Entropy Coding {18} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.1 Entropy of Image {5} . . . . . . . . . . . . . . . . . . . . . 17.3.2 Variable-Length Coding {13} . . . . . . . . . . . . . . . 17.4 Predictive Coding {20} . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4.1 Lossless and Lossy {12} . . . . . . . . . . . . . . . . . . . 17.4.2 Predictor and Quantizer {8} . . . . . . . . . . . . . . . . . 17.5 Transform Coding {10} . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.6 Bit Plane Coding {19} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.7 Hierarchical Coding {13} . . . . . . . . . . . . . . . . . . . . . . . . . . 17.8 Other Coding Methods {19} . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
647 647 647 649 651 653 657 660 660 661 663 664 665 668 669 672 675 677 681 683
18
Image Watermarking {156} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.1 Watermarking {74} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.1.1 Watermarking Overview {18} . . . . . . . . . . . . . . . 18.1.2 Watermarking Embedding {16} . . . . . . . . . . . . . . 18.1.3 Watermarking Property {20} . . . . . . . . . . . . . . . . 18.1.4 Auxiliary Information {9} . . . . . . . . . . . . . . . . . . 18.1.5 Cover and Works {11} . . . . . . . . . . . . . . . . . . . . 18.2 Watermarking Techniques {38} . . . . . . . . . . . . . . . . . . . . . 18.2.1 Technique Classification {13} . . . . . . . . . . . . . . . 18.2.2 Various Watermarking Techniques {20} . . . . . . . . 18.2.3 Transform Domain Watermarking {5} . . . . . . . . . 18.3 Watermarking Security {44} . . . . . . . . . . . . . . . . . . . . . . . 18.3.1 Security {17} . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.3.2 Watermarking Attacks {17} . . . . . . . . . . . . . . . . . 18.3.3 Unauthorized Attacks {10} . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
689 689 689 691 693 696 697 699 699 701 703 705 705 707 709
Contents
xvii
19
Image Information Security {45} . . . . . . . . . . . . . . . . . . . . . . . . . 19.1 Image Authentication and Forensics {13} . . . . . . . . . . . . . . 19.1.1 Image Authentication {9} . . . . . . . . . . . . . . . . . . 19.1.2 Image Forensics {4} . . . . . . . . . . . . . . . . . . . . . . 19.2 Image Hiding {32} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.2.1 Information Hiding {6} . . . . . . . . . . . . . . . . . . . . 19.2.2 Image Blending {7} . . . . . . . . . . . . . . . . . . . . . . 19.2.3 Cryptography {10} . . . . . . . . . . . . . . . . . . . . . . . 19.2.4 Other Techniques {9} . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
711 711 711 713 713 713 715 717 718
20
Color Image Processing {253} . . . . . . . . . . . . . . . . . . . . . . . . . . . 20.1 Colorimetry and Chromaticity Diagram {86} . . . . . . . . . . . . 20.1.1 Colorimetry {19} . . . . . . . . . . . . . . . . . . . . . . . . 20.1.2 Color Chart {15} . . . . . . . . . . . . . . . . . . . . . . . . 20.1.3 Primary and Secondary Color {10} . . . . . . . . . . . 20.1.4 Color Mixing {16} . . . . . . . . . . . . . . . . . . . . . . . 20.1.5 Chromaticity Diagram {13} . . . . . . . . . . . . . . . . . 20.1.6 Diagram Parts {13} . . . . . . . . . . . . . . . . . . . . . . . 20.2 Color Spaces and Models {76} . . . . . . . . . . . . . . . . . . . . . . 20.2.1 Color Models {16} . . . . . . . . . . . . . . . . . . . . . . . 20.2.2 RGB-Based Models {18} . . . . . . . . . . . . . . . . . . 20.2.3 Visual Perception Models {14} . . . . . . . . . . . . . . 20.2.4 CIE Color Models {10} . . . . . . . . . . . . . . . . . . . . 20.2.5 Other Color Models {18} . . . . . . . . . . . . . . . . . . 20.3 Pseudo-color Processing {19} . . . . . . . . . . . . . . . . . . . . . . . 20.3.1 Pseudo-color Enhancement {8} . . . . . . . . . . . . . . 20.3.2 Pseudo-Color Transform {11} . . . . . . . . . . . . . . . 20.4 True Color Processing {72} . . . . . . . . . . . . . . . . . . . . . . . . 20.4.1 True Color Enhancement {15} . . . . . . . . . . . . . . . 20.4.2 Saturation and Hue Enhancement {18} . . . . . . . . . 20.4.3 False Color Enhancement {6} . . . . . . . . . . . . . . . 20.4.4 Color Image Processing {14} . . . . . . . . . . . . . . . . 20.4.5 Color Ordering and Edges {10} . . . . . . . . . . . . . . 20.4.6 Color Image Histogram {9} . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
721 721 721 725 728 729 732 735 737 737 740 744 747 750 752 753 754 756 757 759 762 763 765 769
21
Video Image Processing {191} . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1 Video {70} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Analog and Digital Video {16} . . . . . . . . . . . . . . 21.1.2 Various Video {15} . . . . . . . . . . . . . . . . . . . . . . 21.1.3 Video Frame {15} . . . . . . . . . . . . . . . . . . . . . . . . 21.1.4 Video Scan and Display {10} . . . . . . . . . . . . . . . 21.1.5 Video Display {14} . . . . . . . . . . . . . . . . . . . . . . . 21.2 Video Terminology {35} . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.1 Video Terms {16} . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Video Processing and Techniques {19} . . . . . . . .
. . . . . . . . . .
773 773 773 776 777 780 781 784 784 787
xviii
Contents
21.3
21.4
21.5
21.5.1
Video Enhancement {31} . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3.1 Video Enhancement {12} . . . . . . . . . . . . . . . . . . 21.3.2 Motion-Based Filtering {11} . . . . . . . . . . . . . . . . 21.3.3 Block Matching {8} . . . . . . . . . . . . . . . . . . . . . . Video Coding {40} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4.1 Video Codec {16} . . . . . . . . . . . . . . . . . . . . . . . . 21.4.2 Intra-frame Coding {7} . . . . . . . . . . . . . . . . . . . . 21.4.3 Inter-frame Coding {17} . . . . . . . . . . . . . . . . . . . Video Computation {15} . . . . . . . . . . . . . . . . . . . . . . . . . .
Image Sequence {6} . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5.2
22
23
Video Analysis {9} . . . . . . . . . . . . . . . . . . . . . . . .
Multi-resolution Image {75} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1 Multi-resolution and Super-Resolution {24} . . . . . . . . . . . . 22.1.1 Multi-resolution {16} . . . . . . . . . . . . . . . . . . . . . 22.1.2 Super-Resolution {8} . . . . . . . . . . . . . . . . . . . . . 22.2 Multi-scale Images {26} . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2.1 Multi-scales {13} . . . . . . . . . . . . . . . . . . . . . . . . 22.2.2 Multi-scale Space {7} . . . . . . . . . . . . . . . . . . . . . 22.2.3 Multi-scale Transform {6} . . . . . . . . . . . . . . . . . . 22.3 Image Pyramid {25} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3.1 Pyramid Structure {18} . . . . . . . . . . . . . . . . . . . . 22.3.2 Gaussian and Laplacian Pyramids {7} . . . . . . . . .
Part III
. . . . . . . . .
790 790 792 793 796 796 798 799 803 803 804
. . . . . . . . . . .
807 807 807 809 811 811 814 815 817 817 820
. . . . . . . . . . . . . . . . . . .
825 825 825 829 831 834 836 837 840 843 846 846 847 847 849 850 854 854 855
Image Analysis
Segmentation Introduction {195} . . . . . . . . . . . . . . . . . . . . . . . . . 23.1 Segmentation Overview {61} . . . . . . . . . . . . . . . . . . . . . . . 23.1.1 Segmentation Definition {16} . . . . . . . . . . . . . . . 23.1.2 Object and Background {12} . . . . . . . . . . . . . . . . 23.1.3 Method Classification {14} . . . . . . . . . . . . . . . . . 23.1.4 Various Strategies {19} . . . . . . . . . . . . . . . . . . . . 23.2 Primitive Unit Detection {60} . . . . . . . . . . . . . . . . . . . . . . 23.2.1 Point Detection {20} . . . . . . . . . . . . . . . . . . . . . . 23.2.2 Corner Detection {20} . . . . . . . . . . . . . . . . . . . . . 23.2.3 Line Detection {13} . . . . . . . . . . . . . . . . . . . . . . 23.2.4 Curve Detection {7} . . . . . . . . . . . . . . . . . . . . . . 23.3 Geometric Unit Detection {50} . . . . . . . . . . . . . . . . . . . . . . 23.3.1 Bar Detection {8} . . . . . . . . . . . . . . . . . . . . . . . . 23.3.2 Circle and Ellipse Detection {10} . . . . . . . . . . . . . 23.3.3 Object Contour {13} . . . . . . . . . . . . . . . . . . . . . . 23.3.4 Hough Transform {19} . . . . . . . . . . . . . . . . . . . . 23.4 Image Matting {24} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23.4.1 Matting Basics {9} . . . . . . . . . . . . . . . . . . . . . . . 23.4.2 Matting Techniques {15} . . . . . . . . . . . . . . . . . . .
Contents
xix
24
Edge Detection {157} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.1 Principle {25} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.1.1 Edge Detection {17} . . . . . . . . . . . . . . . . . . . . . . 24.1.2 Sub-pixel Edge {8} . . . . . . . . . . . . . . . . . . . . . . . 24.2 Various Edges {28} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24.2.1 Type of Edge {13} . . . . . . . . . . . . . . . . . . . . . . . 24.2.2 Edge Description {15} . . . . . . . . . . . . . . . . . . . . 24.3 Gradients and Gradient Operators {57} . . . . . . . . . . . . . . . . 24.3.1 Gradient Computation {9} . . . . . . . . . . . . . . . . . . 24.3.2 Differential Edge Detector {13} . . . . . . . . . . . . . . 24.3.3 Gradient Operators {12} . . . . . . . . . . . . . . . . . . . 24.3.4 Particle Gradient Operators {12} . . . . . . . . . . . . . 24.3.5 Orientation Detection {11} . . . . . . . . . . . . . . . . . 24.4 High-Order Detectors {47} . . . . . . . . . . . . . . . . . . . . . . . . . 24.4.1 Second-Derivative Detectors {20} . . . . . . . . . . . . 24.4.2 Gaussian-Laplacian Detectors {11} . . . . . . . . . . . 24.4.3 Other Detectors {16} . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
859 859 859 863 868 868 871 874 875 876 881 883 885 888 888 893 897
25
Object Segmentation Methods {245} . . . . . . . . . . . . . . . . . . . . . . 25.1 Parallel-Boundary Techniques {40} . . . . . . . . . . . . . . . . . . 25.1.1 Boundary Segmentation {15} . . . . . . . . . . . . . . . . 25.1.2 Boundary Points {13} . . . . . . . . . . . . . . . . . . . . . 25.1.3 Boundary Thinning Techniques {12} . . . . . . . . . . 25.2 Sequential-Boundary Techniques {90} . . . . . . . . . . . . . . . . 25.2.1 Basic Techniques {7} . . . . . . . . . . . . . . . . . . . . . 25.2.2 Graph Search {19} . . . . . . . . . . . . . . . . . . . . . . . 25.2.3 Active Contour {13} . . . . . . . . . . . . . . . . . . . . . . 25.2.4 Snake {18} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25.2.5 General Active Contour {13} . . . . . . . . . . . . . . . . 25.2.6 Graph Cut {20} . . . . . . . . . . . . . . . . . . . . . . . . . 25.3 Parallel-Region Techniques {58} . . . . . . . . . . . . . . . . . . . . 25.3.1 Thresholding {17} . . . . . . . . . . . . . . . . . . . . . . . 25.3.2 Global Thresholding Techniques {12} . . . . . . . . . 25.3.3 Local Thresholding Techniques {13} . . . . . . . . . . 25.3.4 Clustering and Mean Shift {16} . . . . . . . . . . . . . . 25.4 Sequential-Region Techniques {40} . . . . . . . . . . . . . . . . . . 25.4.1 Region Growing {18} . . . . . . . . . . . . . . . . . . . . . 25.4.2 Watershed {12} . . . . . . . . . . . . . . . . . . . . . . . . . 25.4.3 Level Set {10} . . . . . . . . . . . . . . . . . . . . . . . . . . 25.5 More Segmentation Techniques {17} . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
901 901 901 903 905 909 909 910 913 914 918 920 923 923 927 930 933 937 937 942 945 947
26
Segmentation Evaluation {64} . . . . . . . . . . . . . . . . . . . . . . . . . . . 26.1 Evaluation Scheme and Framework {13} . . . . . . . . . . . . . . 26.2 Evaluation Methods and Criteria {46} . . . . . . . . . . . . . . . . . 26.2.1 Analytical Methods and Criteria {11} . . . . . . . . . .
. . . .
953 953 957 957
xx
Contents
26.3
26.2.2 Empirical Goodness Methods and Criteria {12} . . 26.2.3 Empirical Discrepancy Methods and Criteria {14} . 26.2.4 Empirical Discrepancy of Pixel Numbers {9} . . . . Systematic Comparison and Characterization {5} . . . . . . . . .
. . . .
959 961 965 967
27
Object Representation {188} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.1 Object Representation Methods {28} . . . . . . . . . . . . . . . . . 27.1.1 Object Representation {20} . . . . . . . . . . . . . . . . . 27.1.2 Spline {8} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.2 Boundary-Based Representation {85} . . . . . . . . . . . . . . . . . 27.2.1 Boundary Representation {16} . . . . . . . . . . . . . . . 27.2.2 Boundary Signature {11} . . . . . . . . . . . . . . . . . . . 27.2.3 Curve Representation {13} . . . . . . . . . . . . . . . . . 27.2.4 Parametric Curve {16} . . . . . . . . . . . . . . . . . . . . 27.2.5 Curve Fitting {14} . . . . . . . . . . . . . . . . . . . . . . . 27.2.6 Chain Codes {15} . . . . . . . . . . . . . . . . . . . . . . . . 27.3 Region-Based Representation {75} . . . . . . . . . . . . . . . . . . . 27.3.1 Polygon {14} . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.3.2 Surrounding Region {20} . . . . . . . . . . . . . . . . . . 27.3.3 Medial Axis Transform {18} . . . . . . . . . . . . . . . . 27.3.4 Skeleton {9} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27.3.5 Region Decomposition {14} . . . . . . . . . . . . . . . .
. 969 . 969 . 969 . 972 . 973 . 973 . 975 . 979 . 982 . 984 . 988 . 992 . 992 . 996 . 1000 . 1004 . 1006
28
Object Description {159} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.1 Object Description Methods {32} . . . . . . . . . . . . . . . . . . . . 28.1.1 Object Description {16} . . . . . . . . . . . . . . . . . . . 28.1.2 Feature Description {16} . . . . . . . . . . . . . . . . . . . 28.2 Boundary-Based Description {32} . . . . . . . . . . . . . . . . . . . 28.2.1 Boundary {17} . . . . . . . . . . . . . . . . . . . . . . . . . . 28.2.2 Curvature {15} . . . . . . . . . . . . . . . . . . . . . . . . . . 28.3 Region-Based Description {40} . . . . . . . . . . . . . . . . . . . . . 28.3.1 Region Description {20} . . . . . . . . . . . . . . . . . . . 28.3.2 Moment Description {20} . . . . . . . . . . . . . . . . . . 28.4 Descriptions of Object Relationship {22} . . . . . . . . . . . . . . 28.4.1 Object Relationship {16} . . . . . . . . . . . . . . . . . . . 28.4.2 Image Topology {16} . . . . . . . . . . . . . . . . . . . . . 28.5 Attributes {16} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28.6 Object Saliency {17} . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
1011 1011 1011 1013 1015 1015 1018 1020 1020 1024 1027 1027 1028 1031 1036
29
Feature Measurement and Error Analysis {110} . . . . . . . . . . . . . 29.1 Feature Measurement {59} . . . . . . . . . . . . . . . . . . . . . . . . . 29.1.1 Metric {10} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29.1.2 Object Measurement {19} . . . . . . . . . . . . . . . . . . 29.1.3 Local Invariance {15} . . . . . . . . . . . . . . . . . . . . . 29.1.4 More Invariance {15} . . . . . . . . . . . . . . . . . . . . . 29.2 Accuracy and Precision {18} . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
1041 1041 1041 1043 1046 1048 1050
Contents
29.3
xxi
Error Analysis {33} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054 29.3.1 Measurement Error {17} . . . . . . . . . . . . . . . . . . . . 1054 29.3.2 Residual and Error {16} . . . . . . . . . . . . . . . . . . . . 1056
30
Texture Analysis {174} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.1 Texture Overview {42} . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.1.1 Texture {10} . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.1.2 Texture Elements {9} . . . . . . . . . . . . . . . . . . . . . 30.1.3 Texture Analysis {14} . . . . . . . . . . . . . . . . . . . . . 30.1.4 Texture Models {9} . . . . . . . . . . . . . . . . . . . . . . 30.2 Texture Feature and Description {29} . . . . . . . . . . . . . . . . . 30.2.1 Texture Features {14} . . . . . . . . . . . . . . . . . . . . . 30.2.2 Texture Description {15} . . . . . . . . . . . . . . . . . . . 30.3 Statistical Approach {28} . . . . . . . . . . . . . . . . . . . . . . . . . . 30.3.1 Texture Statistics {15} . . . . . . . . . . . . . . . . . . . . . 30.3.2 Co-occurrence Matrix {13} . . . . . . . . . . . . . . . . . 30.4 Structural Approach {19} . . . . . . . . . . . . . . . . . . . . . . . . . . 30.4.1 Structural Texture {13} . . . . . . . . . . . . . . . . . . . . 30.4.2 Local Binary Pattern {6} . . . . . . . . . . . . . . . . . . . 30.5 Spectrum Approach {12} . . . . . . . . . . . . . . . . . . . . . . . . . . 30.6 Texture Segmentation {12} . . . . . . . . . . . . . . . . . . . . . . . . 30.7 Texture Composition {32} . . . . . . . . . . . . . . . . . . . . . . . . . 30.7.1 Texture Categorization {15} . . . . . . . . . . . . . . . . 30.7.2 Texture Generation {17} . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
1061 1061 1061 1062 1064 1065 1066 1066 1069 1071 1071 1076 1079 1079 1081 1084 1086 1089 1089 1091
31
Shape Analysis {175} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1 Shape Overview {26} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.1 Shape {6} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Shape Analysis {20} . . . . . . . . . . . . . . . . . . . . . . 31.2 Shape Representation and Description {43} . . . . . . . . . . . . . 31.2.1 Shape Representation {13} . . . . . . . . . . . . . . . . . 31.2.2 Shape Model {10} . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Shape Description {8} . . . . . . . . . . . . . . . . . . . . . 31.2.4 Shape Descriptors {12} . . . . . . . . . . . . . . . . . . . . 31.3 Shape Classification {17} . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Shape Compactness {21} . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4.1 Compactness and Elongation {6} . . . . . . . . . . . . . 31.4.2 Specific Descriptors {15} . . . . . . . . . . . . . . . . . . 31.5 Shape Complexity {11} . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Delaunay and Voronoï Meshes {57} . . . . . . . . . . . . . . . . . . 31.6.1 Mesh Model {15} . . . . . . . . . . . . . . . . . . . . . . . . 31.6.2 Delaunay Meshes {8} . . . . . . . . . . . . . . . . . . . . . 31.6.3 Voronoï Meshes {17} . . . . . . . . . . . . . . . . . . . . . 31.6.4 Maximal Nucleus Cluster {17} . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
1095 1095 1095 1096 1099 1099 1102 1103 1105 1108 1110 1111 1111 1116 1117 1118 1119 1120 1123
xxii
Contents
32
Motion Analysis {229} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1 Motion and Analysis {63} . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.1 Motion {11} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 Motion Classification {12} . . . . . . . . . . . . . . . . . 32.1.3 Motion Estimation {12} . . . . . . . . . . . . . . . . . . . 32.1.4 Various Motion Estimations {14} . . . . . . . . . . . . . 32.1.5 Motion Analysis and Understanding {14} . . . . . . . 32.2 Motion Detection and Representation {27} . . . . . . . . . . . . . 32.2.1 Motion Detection {13} . . . . . . . . . . . . . . . . . . . . 32.2.2 Motion Representation {14} . . . . . . . . . . . . . . . . 32.3 Moving Object Detection {21} . . . . . . . . . . . . . . . . . . . . . . 32.3.1 Object Detection {11} . . . . . . . . . . . . . . . . . . . . . 32.3.2 Object Trajectory {10} . . . . . . . . . . . . . . . . . . . . 32.4 Moving Object Tracking {74} . . . . . . . . . . . . . . . . . . . . . . 32.4.1 Feature Tracking {13} . . . . . . . . . . . . . . . . . . . . . 32.4.2 Object Tracking {14} . . . . . . . . . . . . . . . . . . . . . 32.4.3 Object Tracking Techniques {14} . . . . . . . . . . . . 32.4.4 Kalman Filter {20} . . . . . . . . . . . . . . . . . . . . . . . 32.4.5 Particle Filtering {13} . . . . . . . . . . . . . . . . . . . . . 32.5 Motion and Optical Flows {44} . . . . . . . . . . . . . . . . . . . . . 32.5.1 Motion Field {12} . . . . . . . . . . . . . . . . . . . . . . . . 32.5.2 Optical Flow {9} . . . . . . . . . . . . . . . . . . . . . . . . 32.5.3 Optical Flow Field {10} . . . . . . . . . . . . . . . . . . . 32.5.4 Optical Flow Equation {13} . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
1127 1127 1127 1129 1130 1132 1134 1135 1136 1137 1141 1141 1143 1145 1145 1147 1148 1150 1154 1156 1156 1158 1160 1162
33
Image Pattern Recognition {346} . . . . . . . . . . . . . . . . . . . . . . . . . 33.1 Pattern {18} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Pattern Recognition {55} . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2.1 Recognition {12} . . . . . . . . . . . . . . . . . . . . . . . . 33.2.2 Recognition Categories {11} . . . . . . . . . . . . . . . . 33.2.3 Image Recognition {13} . . . . . . . . . . . . . . . . . . . 33.2.4 Various Recognition Methods {19} . . . . . . . . . . . 33.3 Pattern Classification {45} . . . . . . . . . . . . . . . . . . . . . . . . . 33.3.1 Category {11} . . . . . . . . . . . . . . . . . . . . . . . . . . 33.3.2 Classification {21} . . . . . . . . . . . . . . . . . . . . . . . 33.3.3 Test and Verification {13} . . . . . . . . . . . . . . . . . . 33.4 Feature and Detection {30} . . . . . . . . . . . . . . . . . . . . . . . . 33.4.1 Feature {15} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.4.2 Feature Analysis {15} . . . . . . . . . . . . . . . . . . . . . 33.5 Feature Dimension Reduction {30} . . . . . . . . . . . . . . . . . . . 33.5.1 Dimension Reduction {14} . . . . . . . . . . . . . . . . . 33.5.2 Manifold and Independent Component {16} . . . . . 33.6 Classifier and Perceptron {56} . . . . . . . . . . . . . . . . . . . . . . 33.6.1 Classifier {16} . . . . . . . . . . . . . . . . . . . . . . . . . . 33.6.2 Optimal Classifier {11} . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
1165 1165 1168 1168 1169 1171 1173 1176 1176 1177 1182 1184 1184 1187 1189 1189 1193 1195 1196 1202
Contents
xxiii
33.6.3
Support Vector Machine {18} . . . . . . . . . . . . . . . . . . 1203 33.6.4 Perceptron {11} . . . . . . . . . . . . . . . . . . . . . . . . . Clustering {19} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.7.1 Cluster {9} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.7.2 Cluster Analysis {10} . . . . . . . . . . . . . . . . . . . . . Discriminant and Decision Function {46} . . . . . . . . . . . . . . 33.8.1 Discriminant Function {16} . . . . . . . . . . . . . . . . . 33.8.2 Kernel Discriminant {11} . . . . . . . . . . . . . . . . . . 33.8.3 Decision Function {19} . . . . . . . . . . . . . . . . . . . . Syntactic Recognition {20} . . . . . . . . . . . . . . . . . . . . . . . . 33.9.1 Grammar and Syntactic {13} . . . . . . . . . . . . . . . . 33.9.2 Automaton {7} . . . . . . . . . . . . . . . . . . . . . . . . . . Test and Error {27} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.10.1 Test {7} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.10.2 True {7} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.10.3 Error {13} . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
1208 1210 1211 1213 1214 1214 1217 1220 1223 1223 1226 1226 1227 1228 1229
Biometric Recognition {152} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34.1 Human Biometrics {14} . . . . . . . . . . . . . . . . . . . . . . . . . . . 34.2 Subspace Techniques {13} . . . . . . . . . . . . . . . . . . . . . . . . . 34.3 Face Recognition and Analysis {58} . . . . . . . . . . . . . . . . . . 34.3.1 Face Detection {14} . . . . . . . . . . . . . . . . . . . . . . 34.3.2 Face Tracking {12} . . . . . . . . . . . . . . . . . . . . . . . 34.3.3 Face Recognition {17} . . . . . . . . . . . . . . . . . . . . 34.3.4 Face Image Analysis {15} . . . . . . . . . . . . . . . . . . 34.4 Expression Analysis {25} . . . . . . . . . . . . . . . . . . . . . . . . . . 34.4.1 Facial Expression {11} . . . . . . . . . . . . . . . . . . . . 34.4.2 Facial Expression Analysis {14} . . . . . . . . . . . . . 34.5 Human Body Recognition {17} . . . . . . . . . . . . . . . . . . . . . 34.5.1 Human Motion {12} . . . . . . . . . . . . . . . . . . . . . . 34.5.2 Other Analysis {5} . . . . . . . . . . . . . . . . . . . . . . . 34.6 Other Biometrics {25} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34.6.1 Fingerprint and Gesture {14} . . . . . . . . . . . . . . . . 34.6.2 More Biometrics {11} . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
1231 1231 1233 1236 1236 1239 1241 1244 1245 1246 1247 1250 1250 1252 1253 1253 1255
. . . . . .
1259 1259 1259 1262 1265 1266
33.7
33.8
33.9
33.10
34
Part IV 35
Image Understanding
Theory of Image Understanding {57} . . . . . . . . . . . . . . . . . . . . . . 35.1 Understanding Models {32} . . . . . . . . . . . . . . . . . . . . . . . . 35.1.1 Computational Structures {17} . . . . . . . . . . . . . . . 35.1.2 Active, Qualitative, and Purposive Vision {15} . . . 35.2 Marr’s Visual Computational Theory {25} . . . . . . . . . . . . . 35.2.1 Theory Framework {11} . . . . . . . . . . . . . . . . . . .
xxiv
Contents
35.2.2
Three-Layer Representations {14} . . . . . . . . . . . . . 1268
36
3-D Representation and Description {224} . . . . . . . . . . . . . . . . . . 36.1 3-D Point and Curve {42} . . . . . . . . . . . . . . . . . . . . . . . . . 36.1.1 3-D Point {12} . . . . . . . . . . . . . . . . . . . . . . . . . . 36.1.2 Curve and Conic {17} . . . . . . . . . . . . . . . . . . . . . 36.1.3 3-D Curve {13} . . . . . . . . . . . . . . . . . . . . . . . . . 36.2 3-D Surface Representation {105} . . . . . . . . . . . . . . . . . . . 36.2.1 Surface {14} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36.2.2 Surface Model {10} . . . . . . . . . . . . . . . . . . . . . . 36.2.3 Surface Representation {18} . . . . . . . . . . . . . . . . 36.2.4 Surface Description {14} . . . . . . . . . . . . . . . . . . . 36.2.5 Surface Classification {19} . . . . . . . . . . . . . . . . . 36.2.6 Curvature and Classification {10} . . . . . . . . . . . . 36.2.7 Various Surfaces {20} . . . . . . . . . . . . . . . . . . . . . 36.3 3-D Surface Construction {26} . . . . . . . . . . . . . . . . . . . . . . 36.3.1 Surface Construction {11} . . . . . . . . . . . . . . . . . . 36.3.2 Construction Techniques {15} . . . . . . . . . . . . . . . 36.4 Volumetric Representation {51} . . . . . . . . . . . . . . . . . . . . . 36.4.1 Volumetric Models {21} . . . . . . . . . . . . . . . . . . . 36.4.2 Volumetric Representation Methods {17} . . . . . . . 36.4.3 Generalized Cylinder Representation {13} . . . . . .
. . . . . . . . . . . . . . . . . . . .
1273 1273 1273 1275 1278 1280 1280 1282 1284 1289 1291 1297 1299 1304 1304 1306 1311 1311 1315 1317
37
Stereo Vision {164} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.1 Stereo Vision Overview {78} . . . . . . . . . . . . . . . . . . . . . . . 37.1.1 Stereo {10} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.1.2 Stereo Vision {20} . . . . . . . . . . . . . . . . . . . . . . . 37.1.3 Disparity {12} . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
37.1.5 Epipolar {13} . . . . . . . . . . . . . . . . . . . . . . . . . . . 37.1.6 Rectification {13} . . . . . . . . . . . . . . . . . . . . . . . . Binocular Stereo Vision {44} . . . . . . . . . . . . . . . . . . . . . . . 37.2.1 Binocular Vision {18} . . . . . . . . . . . . . . . . . . . . . 37.2.2 Correspondence {18} . . . . . . . . . . . . . . . . . . . . . 37.2.3 SIFT and SURF {8} . . . . . . . . . . . . . . . . . . . . . . Multiple-Ocular Stereo Vision {42} . . . . . . . . . . . . . . . . . . 37.3.1 Multibaselines {12} . . . . . . . . . . . . . . . . . . . . . . 37.3.2 Trinocular {11} . . . . . . . . . . . . . . . . . . . . . . . . . 37.3.3 Multiple-Nocular {11} . . . . . . . . . . . . . . . . . . . . 37.3.4 Post-processing {8} . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
1321 1321 1321 1324 1327 1330 1331 1333 1337 1338 1341 1343 1345 1345 1347 1349 1351
Multi-image 3-D Scene Reconstruction {94} . . . . . . . . . . . . . . . . . 38.1 Scene Recovery {35} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38.1.1 3-D Reconstruction {12} . . . . . . . . . . . . . . . . . . . 38.1.2 Depth Estimation {14} . . . . . . . . . . . . . . . . . . . . 38.1.3 Occlusion {9} . . . . . . . . . . . . . . . . . . . . . . . . . . . 38.2 Photometric Stereo Analysis {19} . . . . . . . . . . . . . . . . . . . . 38.2.1 Photometric Stereo {7} . . . . . . . . . . . . . . . . . . . .
. . . . . . .
1355 1355 1355 1357 1359 1360 1360
37.1.4
37.2
37.3
38
Constraint {10} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contents
xxv
38.2.2 Illumination Models {12} . . . . . . . . . . . . . . . . . . Shape from X {40} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38.3.1 Various reconstructions {14} . . . . . . . . . . . . . . . . 38.3.2 Structure from Motion {15} . . . . . . . . . . . . . . . . . 38.3.3 Shape from Optical Flow {11} . . . . . . . . . . . . . . .
. . . . .
1362 1364 1364 1366 1370
39
Single-Image 3-D Scene Reconstruction {66} . . . . . . . . . . . . . . . . 39.1 Single-Image Reconstruction {13} . . . . . . . . . . . . . . . . . . . 39.2 Various Reconstruction Cues {53} . . . . . . . . . . . . . . . . . . . 39.2.1 Focus {8} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.2.2 Texture {15} . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.2.3 Shading {11} . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.2.4 Shadow {13} . . . . . . . . . . . . . . . . . . . . . . . . . . . 39.2.5 Other Cues {6} . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
1373 1373 1376 1376 1378 1382 1384 1387
40
Knowledge and Learning {198} . . . . . . . . . . . . . . . . . . . . . . . . . . 40.1 Knowledge and Model {68} . . . . . . . . . . . . . . . . . . . . . . . . 40.1.1 Knowledge Classification {16} . . . . . . . . . . . . . . 40.1.2 Procedure Knowledge {19} . . . . . . . . . . . . . . . . . 40.1.3 Models {13} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.1.4 Model Functions {20} . . . . . . . . . . . . . . . . . . . . . 40.2 Knowledge Representation Schemes {41} . . . . . . . . . . . . . . 40.2.1 Knowledge Representation Models {14} . . . . . . . 40.2.2 Knowledge Base {11} . . . . . . . . . . . . . . . . . . . . . 40.2.3 Logic System {16} . . . . . . . . . . . . . . . . . . . . . . . 40.3 Learning {61} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.3.1 Statistical Learning {17} . . . . . . . . . . . . . . . . . . . 40.3.2 Machine Learning {17} . . . . . . . . . . . . . . . . . . . . 40.3.3 Zero-Shot and Ensemble Learning {12} . . . . . . . . 40.3.4 Various Learning Methods {15} . . . . . . . . . . . . . . 40.4 Inference {28} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.4.1 Inference Classification {15} . . . . . . . . . . . . . . . . 40.4.2 Propagation {13} . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
1389 1389 1389 1392 1396 1398 1401 1401 1404 1405 1408 1408 1411 1414 1416 1418 1419 1421
41
General Image Matching {196} . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1 General Matching {45} . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.1 Matching {17} . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.2 Matching Function {13} . . . . . . . . . . . . . . . . . . . 41.1.3 Matching Techniques {15} . . . . . . . . . . . . . . . . . 41.2 Image Matching {68} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Image Matching Techniques {9} . . . . . . . . . . . . . 41.2.2 Feature Matching Techniques {12} . . . . . . . . . . . 41.2.3 Correlation and Cross-Correlation {15} . . . . . . . . 41.2.4 Mask Matching Techniques {13} . . . . . . . . . . . . . 41.2.5 Diverse Matching Techniques {19} . . . . . . . . . . . 41.3 Image Registration {48} . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
1429 1429 1429 1433 1436 1441 1441 1442 1444 1447 1450 1454
38.3
xxvi
Contents
41.3.1 Registration {17} . . . . . . . . . . . . . . . . . . . . . . . . 41.3.2 Image Registration Methods {13} . . . . . . . . . . . . 41.3.3 Image Alignment {8} . . . . . . . . . . . . . . . . . . . . . 41.3.4 Image Warping {10} . . . . . . . . . . . . . . . . . . . . . . Graph Isomorphism and Line Drawing {35} . . . . . . . . . . . . 41.4.1 Graph Matching {9} . . . . . . . . . . . . . . . . . . . . . . 41.4.2 Line Drawing {18} . . . . . . . . . . . . . . . . . . . . . . . 41.4.3 Contour Labeling {8} . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
1454 1457 1459 1460 1462 1462 1464 1467
42
Scene Analysis and Interpretation {123} . . . . . . . . . . . . . . . . . . . 42.1 Scene Interpretation {56} . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.1 Image Scene {13} . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Scene Analysis {14} . . . . . . . . . . . . . . . . . . . . . . 42.1.3 Scene Understanding {14} . . . . . . . . . . . . . . . . . . 42.1.4 Scene Knowledge {15} . . . . . . . . . . . . . . . . . . . . 42.2 Interpretation Techniques {67} . . . . . . . . . . . . . . . . . . . . . . 42.2.1 Soft Computing {13} . . . . . . . . . . . . . . . . . . . . . 42.2.2 Labeling {8} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.3 Fuzzy Set {10} . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.4 Fuzzy Calculation {16} . . . . . . . . . . . . . . . . . . . . 42.2.5 Classification Models {20} . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
1471 1471 1471 1474 1475 1477 1479 1480 1482 1483 1485 1488
43
Image Information Fusion {88} . . . . . . . . . . . . . . . . . . . . . . . . . . 43.1 Information Fusion {33} . . . . . . . . . . . . . . . . . . . . . . . . . . 43.1.1 Multi-sensor Fusion {19} . . . . . . . . . . . . . . . . . . 43.1.2 Mosaic Fusion Techniques {14} . . . . . . . . . . . . . 43.2 Evaluation of Fusion Result {19} . . . . . . . . . . . . . . . . . . . . 43.3 Layered Fusion Techniques {36} . . . . . . . . . . . . . . . . . . . . 43.3.1 Three Layers {8} . . . . . . . . . . . . . . . . . . . . . . . . 43.3.2 Method for Pixel Layer Fusion {10} . . . . . . . . . . . 43.3.3 Method for Feature Layer Fusion {8} . . . . . . . . . . 43.3.4 Method for Decision Layer Fusion {10} . . . . . . . .
. . . . . . . . . .
1493 1493 1493 1496 1499 1504 1505 1507 1509 1509
44
Content-Based Retrieval {194} . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.1 Visual Information Retrieval {66} . . . . . . . . . . . . . . . . . . . 44.1.1 Information Content Retrieval {11} . . . . . . . . . . . 44.1.2 Image Retrieval {14} . . . . . . . . . . . . . . . . . . . . . 44.1.3 Image Querying {12} . . . . . . . . . . . . . . . . . . . . . 44.1.4 Database Indexing {14} . . . . . . . . . . . . . . . . . . . . 44.1.5 Image Indexing {15} . . . . . . . . . . . . . . . . . . . . . . 44.2 Feature-Based Retrieval {28} . . . . . . . . . . . . . . . . . . . . . . . 44.2.1 Features and Retrieval {17} . . . . . . . . . . . . . . . . . 44.2.2 Color-Based Retrieval {11} . . . . . . . . . . . . . . . . . 44.3 Video Organization and Retrieval {51} . . . . . . . . . . . . . . . . 44.3.1 Video Organization {15} . . . . . . . . . . . . . . . . . . . 44.3.2 Abrupt and Gradual Changes {17} . . . . . . . . . . . . 44.3.3 Video Structuring {9} . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
1513 1513 1513 1515 1518 1520 1522 1525 1526 1529 1532 1532 1535 1538
41.4
Contents
44.3.4 News Program Organization {10} . . . . . . . . . . . . Semantic Retrieval {49} . . . . . . . . . . . . . . . . . . . . . . . . . . . 44.4.1 Semantic-Based Retrieval {10} . . . . . . . . . . . . . . 44.4.2 Multilayer Image Description {12} . . . . . . . . . . . . 44.4.3 Higher Level Semantics {12} . . . . . . . . . . . . . . . . 44.4.4 Video Understanding {15} . . . . . . . . . . . . . . . . . .
. . . . . .
1539 1541 1541 1543 1545 1546
Spatial-Temporal Behavior Understanding {177} . . . . . . . . . . . . . 45.1 Spatial-Temporal Techniques {32} . . . . . . . . . . . . . . . . . . . 45.1.1 Techniques and Layers {13} . . . . . . . . . . . . . . . . 45.1.2 Spatio-Temporal Analysis {12} . . . . . . . . . . . . . . 45.1.3 Action Behavior Understanding {7} . . . . . . . . . . . 45.2 Action and Pose {45} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45.2.1 Action Models {14} . . . . . . . . . . . . . . . . . . . . . . 45.2.2 Action Recognition {5} . . . . . . . . . . . . . . . . . . . . 45.2.3 Pose Estimation {13} . . . . . . . . . . . . . . . . . . . . . 45.2.4 Posture Analysis {13} . . . . . . . . . . . . . . . . . . . . . 45.3 Activity and Analysis {28} . . . . . . . . . . . . . . . . . . . . . . . . . 45.3.1 Activity {15} . . . . . . . . . . . . . . . . . . . . . . . . . . . 45.3.2 Activity Analysis {13} . . . . . . . . . . . . . . . . . . . . 45.4 Events {23} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45.4.1 Event Detection {13} . . . . . . . . . . . . . . . . . . . . . 45.4.2 Event Understanding {10} . . . . . . . . . . . . . . . . . . 45.5 Behavior and Understanding {49} . . . . . . . . . . . . . . . . . . . 45.5.1 Behavior {10} . . . . . . . . . . . . . . . . . . . . . . . . . . 45.5.2 Behavior Analysis {14} . . . . . . . . . . . . . . . . . . . . 45.5.3 Behavior Interpretation {15} . . . . . . . . . . . . . . . . 45.5.4 Petri Net {10} . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
1549 1549 1549 1552 1554 1556 1556 1558 1559 1560 1562 1562 1564 1566 1566 1567 1568 1569 1570 1572 1574
. . . . . . . . . . . . . .
1579 1579 1579 1583 1585 1590 1592 1597 1601 1601 1605 1609 1611 1611
44.4
45
Part V 46
xxvii
Related References
Related Theories and Techniques {440} . . . . . . . . . . . . . . . . . . . . 46.1 Random Field {103} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46.1.1 Random Variables {17} . . . . . . . . . . . . . . . . . . . . 46.1.2 Random Process {18} . . . . . . . . . . . . . . . . . . . . . 46.1.3 Random Fields {18} . . . . . . . . . . . . . . . . . . . . . . 46.1.4 Markov Random Field {10} . . . . . . . . . . . . . . . . 46.1.5 Markov Models {20} . . . . . . . . . . . . . . . . . . . . . 46.1.6 Markov Process {20} . . . . . . . . . . . . . . . . . . . . . 46.2 Bayesian Statistics {38} . . . . . . . . . . . . . . . . . . . . . . . . . . . 46.2.1 Bayesian Model {13} . . . . . . . . . . . . . . . . . . . . . 46.2.2 Bayesian Laws and Rules {15} . . . . . . . . . . . . . . 46.2.3 Belief Networks {10} . . . . . . . . . . . . . . . . . . . . . 46.3 Graph Theory {109} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46.3.1 Tree {19} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xxviii
Contents
. . . . . . . . .
1614 1617 1620 1622 1625 1631 1635 1635 1636
. . . . . . . . . . . . . .
1638 1642 1642 1644 1647 1649 1651 1654 1654 1656 1660 1662 1664 1667
Optics {280} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.1 Optics and Instruments {33} . . . . . . . . . . . . . . . . . . . . . . . . . 47.1.1 Classifications {15} . . . . . . . . . . . . . . . . . . . . . . . . 47.1.2 Instruments {18} . . . . . . . . . . . . . . . . . . . . . . . . . . 47.2 Photometry {41} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.2.1 Intensity {11} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.2.2 Emission and Transmission {14} . . . . . . . . . . . . . . 47.2.3 Optical Properties of the Surface {16} . . . . . . . . . . 47.3 Ray Radiation {55} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.3.1 Radiation {12} . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.3.2 Radiometry {20} . . . . . . . . . . . . . . . . . . . . . . . . . . 47.3.3 Radiometry Standards {12} . . . . . . . . . . . . . . . . . . 47.3.4 Special Lights {11} . . . . . . . . . . . . . . . . . . . . . . . . 47.4 Spectroscopy {63} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.4.1 Spectrum {14} . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.4.2 Spectroscopy {14} . . . . . . . . . . . . . . . . . . . . . . . . 47.4.3 Spectral Analysis {16} . . . . . . . . . . . . . . . . . . . . . 47.4.4 Interaction of Light and Matter {19} . . . . . . . . . . . 47.5 Geometric Optics {58} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.5.1 Ray {16} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1671 1671 1671 1674 1677 1677 1680 1682 1684 1684 1686 1689 1693 1694 1694 1697 1699 1701 1705 1705
46.4
46.5
46.6
47
46.3.2 Graph {20} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46.3.3 Graph Representation {17} . . . . . . . . . . . . . . . . . 46.3.4 Graph Geometric Representation {10} . . . . . . . . . 46.3.5 Directed Graph {11} . . . . . . . . . . . . . . . . . . . . . . 46.3.6 Graph Model {14} . . . . . . . . . . . . . . . . . . . . . . . 46.3.7 Graph Classification {18} . . . . . . . . . . . . . . . . . . Compressive Sensing {41} . . . . . . . . . . . . . . . . . . . . . . . . . 46.4.1 Introduction {7} . . . . . . . . . . . . . . . . . . . . . . . . . 46.4.2 Sparse Representation {16} . . . . . . . . . . . . . . . . . 46.4.3 Measurement Coding and Decoding Reconstruction {18} . . . . . . . . . . . . . . . . . . . . . . Neural Networks {64} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46.5.1 Neural Networks {13} . . . . . . . . . . . . . . . . . . . . . 46.5.2 Special Neural Networks {16} . . . . . . . . . . . . . . . 46.5.3 Training and Fitting {12} . . . . . . . . . . . . . . . . . . 46.5.4 Network Operations {12} . . . . . . . . . . . . . . . . . . 46.5.5 Activation Functions {11} . . . . . . . . . . . . . . . . . . Various Theories and Techniques {85} . . . . . . . . . . . . . . . . 46.6.1 Optimalization {15} . . . . . . . . . . . . . . . . . . . . . . 46.6.2 Kernels {19} . . . . . . . . . . . . . . . . . . . . . . . . . . . 46.6.3 Stereology {9} . . . . . . . . . . . . . . . . . . . . . . . . . . 46.6.4 Relaxation and Expectation Maximization {14} . . 46.6.5 Context and RANSAC {16} . . . . . . . . . . . . . . . . 46.6.6 Miscellaneous {12} . . . . . . . . . . . . . . . . . . . . . . .
Contents
47.5.2 Reflection {18} . . . . . . . . . . . . . . . . . . . . . . . . . . 47.5.3 Various Reflections {11} . . . . . . . . . . . . . . . . . . . 47.5.4 Refraction {13} . . . . . . . . . . . . . . . . . . . . . . . . . Wave Optics {30} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47.6.1 Light Wave {12} . . . . . . . . . . . . . . . . . . . . . . . . 47.6.2 Scattering and Diffraction {18} . . . . . . . . . . . . . .
. . . . . .
1709 1713 1714 1717 1717 1719
Mathematical Morphology for Binary Images {81} . . . . . . . . . . . 48.1 Image Morphology {43} . . . . . . . . . . . . . . . . . . . . . . . . . . 48.1.1 Morphology Fundamentals {14} . . . . . . . . . . . . . 48.1.2 Morphological Operations {16} . . . . . . . . . . . . . . 48.1.3 Morphological Image Processing {13} . . . . . . . . . 48.2 Binary Morphology {38} . . . . . . . . . . . . . . . . . . . . . . . . . . 48.2.1 Basic Operations {19} . . . . . . . . . . . . . . . . . . . . . 48.2.2 Combined Operations and Practical Algorithms {19} . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
1723 1723 1723 1726 1729 1732 1732
Mathematical Morphology for Gray-Level Images {53} . . . . . . . . 49.1 Gray-Level Morphology {43} . . . . . . . . . . . . . . . . . . . . . . . 49.1.1 Ordering Relations {13} . . . . . . . . . . . . . . . . . . . 49.1.2 Basic Operations {14} . . . . . . . . . . . . . . . . . . . . . 49.1.3 Combined Operations and Practical Algorithms {16} . . . . . . . . . . . . . . . . . . . . . . . . . 49.2 Soft Morphology {10} . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
Visual Sensation and Perception {308} . . . . . . . . . . . . . . . . . . . . . 50.1 Human Visual System {40} . . . . . . . . . . . . . . . . . . . . . . . . 50.1.1 Human Vision {11} . . . . . . . . . . . . . . . . . . . . . . 50.1.2 Organ of Vision {11} . . . . . . . . . . . . . . . . . . . . . 50.1.3 Visual Process {18} . . . . . . . . . . . . . . . . . . . . . . 50.2 Eye Structure and Function {37} . . . . . . . . . . . . . . . . . . . . 50.2.1 Eye Structure {9} . . . . . . . . . . . . . . . . . . . . . . . . 50.2.2 Retina {16} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.2.3 Photoreceptor {12} . . . . . . . . . . . . . . . . . . . . . . . 50.3 Visual Sensation {88} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.3.1 Sensation {18} . . . . . . . . . . . . . . . . . . . . . . . . . . 50.3.2 Brightness {14} . . . . . . . . . . . . . . . . . . . . . . . . . 50.3.3 Photopic and Scotopia Vision {11} . . . . . . . . . . . 50.3.4 Subjective Brightness {15} . . . . . . . . . . . . . . . . . 50.3.5 Vision Characteristics {20} . . . . . . . . . . . . . . . . . 50.3.6 Virtual Vision {10} . . . . . . . . . . . . . . . . . . . . . . . 50.4 Visual Perception {102} . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.4.1 Perceptions {20} . . . . . . . . . . . . . . . . . . . . . . . . . 50.4.2 Perceptual Constancy {12} . . . . . . . . . . . . . . . . . 50.4.3 Theory of Color Vision {14} . . . . . . . . . . . . . . . . 50.4.4 Color Vision Effect {17} . . . . . . . . . . . . . . . . . . . 50.4.5 Color Science {20} . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
47.6
48
49
50
xxix
. 1737 1743 1743 1743 1745
. 1749 . 1753 1755 1755 1755 1757 1759 1761 1761 1763 1766 1768 1768 1771 1774 1775 1780 1787 1789 1789 1794 1796 1798 1801
xxx
Contents
50.4.6 Visual Attention {19} . . . . . . . . . . . . . . . . . . . . . Visual Psychology {41} . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.5.1 Laws of Visual Psychology {17} . . . . . . . . . . . . . 50.5.2 Illusion {13} . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50.5.3 Illusion of Geometric Figure and Reason Theory {11} . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
51
Application of Image Technology {118} . . . . . . . . . . . . . . . . . . . . 51.1 Television {22} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.1.1 Digital Television {13} . . . . . . . . . . . . . . . . . . . . 51.1.2 Color Television {9} . . . . . . . . . . . . . . . . . . . . . . 51.2 Visual Surveillance {40} . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Surveillance {8} . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 Visual Inspection {9} . . . . . . . . . . . . . . . . . . . . . 51.2.3 Visual Navigation {14} . . . . . . . . . . . . . . . . . . . . 51.2.4 Traffic {9} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Other Applications {56} . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.1 Document and OCR {20} . . . . . . . . . . . . . . . . . . 51.3.2 Medical Images {13} . . . . . . . . . . . . . . . . . . . . . 51.3.3 Remote Sensing {12} . . . . . . . . . . . . . . . . . . . . . 51.3.4 Various Applications {11} . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
1819 1819 1819 1821 1823 1823 1825 1826 1828 1829 1829 1831 1832 1834
52
International Organizations and Standards {172} . . . . . . . . . . . . 52.1 Organizations {22} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.1 International Organizations {15} . . . . . . . . . . . . . 52.1.2 National Organizations {7} . . . . . . . . . . . . . . . . . 52.2 Image and Video Coding Standards {62} . . . . . . . . . . . . . . 52.2.1 Binary Image Coding Standards {6} . . . . . . . . . . . 52.2.2 Grayscale Image Coding Standards {12} . . . . . . . 52.2.3 Video Coding Standards: MPEG {17} . . . . . . . . . 52.2.4 Video Coding Standards: H.26x {19} . . . . . . . . . . 52.2.5 Other Standards {8} . . . . . . . . . . . . . . . . . . . . . . 52.3 Public Systems and Databases {40} . . . . . . . . . . . . . . . . . . 52.3.1 Public Systems {16} . . . . . . . . . . . . . . . . . . . . . . 52.3.2 Public Databases {12} . . . . . . . . . . . . . . . . . . . . . 52.3.3 Face Databases {12} . . . . . . . . . . . . . . . . . . . . . . 52.4 Other Standards {48} . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4.1 International System of Units {9} . . . . . . . . . . . . . 52.4.2 CIE Standards {13} . . . . . . . . . . . . . . . . . . . . . . . 52.4.3 MPEG Standards {12} . . . . . . . . . . . . . . . . . . . . 52.4.4 Various Standards {14} . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
1837 1837 1837 1839 1840 1841 1842 1843 1846 1848 1849 1849 1851 1853 1856 1856 1857 1860 1862
50.5
1804 1807 1807 1811
. 1814
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1865 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1869
About the Author
Yu-Jin Zhang received his Ph.D. degree in Applied Science from the State University of Liège, Liège, Belgium, in 1989. From 1989 to 1993, he was a Postdoctoral Fellow and Research Fellow at the Delft University of Technology, Delft, the Netherlands. In 1993, Dr. Zhang joined the Department of Electronic Engineering at Tsinghua University, Beijing, China, where he has been Professor (since 1997) and later Tenured Professor (since 2014) of Image Engineering. He is active in the education and research of image engineering, with more than 550 research papers and more than 50 books published, including the series of textbooks: Image Engineering (I): Image Processing, Image Engineering (II): Image Analysis, Image Engineering (III): Image Understanding, and Image Engineering (bound volume) (1st, 2nd, 3rd, and 4th editions); three monographs: Image Segmentation, Content-based Visual Information Retrieval, and Subspace-based Face Recognition; one dictionary: An English-Chinese Dictionary of Image Engineering (1st, 2nd, and 3rd editions); as well as three edited books: Advances in Image and Video Segmentation, Semantic-Based Visual Information Retrieval, and Advances in Face Image Analysis: Techniques and Technologies. He is the Vice President of China Society of Image and Graphics (2002–2011, 2016 till date) and one of first Fellow members of China Society of Image and xxxi
xxxii
About the Author
Graphics (since 2019). He is/was the deputy Editor-in-Chief of the Journal of Image and Graphics, the Associate Editor of Acta Automatica Sinica, International Journal of Image and Graphics, Journal of CAD and Computer Graphics, Journal of Electronics and Information, Optical Engineering, Pattern Recognition Letters, Signal Processing, etc. He was the Program Chair of First, Second, Fourth, Fifth, Sixth, Seventh, and Eighth International Conference on Image and Graphics (ICIG’2000, ICIG’2002, ICIG’2007, ICIG’2009, ICIG’2011, ICIG’2013, and ICIG’2015). He was the Program Chair of “The Twenty-Fourth International Conference on Image Processing (ICIP’2017)”. He is a senior member of IEEE (since 1999) and a Fellow of SPIE (Since 2011). E-mail: [email protected] Homepage: http://oa.ee.tsinghua.edu.cn/~zhangyujin/
Part I
Image Fundamentals
Chapter 1
Image Basics
Images can be obtained from the physical world by using different observing and capturing systems in various forms and manners. Images can act, directly and/or indirectly, on the human eyes and induce the visual perception for human beings (Zhang 2009).
1.1
Basic Concepts of Image
An image can be represented by a 2-D array f(x, y), where x and y represent the position of a coordinate point in 2-D space XY and f represents, at a point (x, y) in an image, the value of the property F (Bow 2002; Bracewell 1995; Castleman 1996; Gonzalez and Wintz 1987; Jähne 2004; Pratt 2007; Russ and Neal 2016; Shih 2010; Sonka et al. 2014; Umbaugh 2005; Zhang 2009).
1.1.1
Image and Image Space
Image An entity can be observed by the human eye, directly and/or indirectly, and thereby the visual perception can be produced after its observation. It can be obtained by capturing the physical world in different forms and means with various imaging systems (the general image is the projection of the objective scene). The human visual system is a typical observation system, and the image obtained through it is just the image formed by the real scene in the mind. The image is a representation of a scene and contains a description of the scene. It is a function that describes a certain physical property (such as brightness) with a spatial layout (see registration). Scientific research and statistics show that about 75% of the information that humans received comes from the outside world via the © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_1
3
4
1 Image Basics
visual system, in the formats of images. The concept of images here is quite wide, including photos, drawings, animations, videos, and even documents. There is an old saying in China that “Seeing is believing.” People often say that “A picture is worth a thousand words,” which means that the information content in the image is very rich. To some extent, image is the most important information source of human being. An image can be represented by a 2-D array f(x, y), where x and y denote the position of a coordinate point in the 2-D space XY and f denotes the value of some property of the image at the point (x, y). For example, the commonly used image is usually grayscale image, where f denotes the grayscale value and the set of values is F, which often corresponds to the observed brightness of the objective scene. The image can also have many properties at point (x, y), which can be represented by vector F. Please refer to image representation functions for more general image representation methods. Picture A picture of an object is an image formed by a lens or mirror. It also refers to a flat image formed on the photosensitive material, which is substantially the same as the object reproduced by exposure, development, fixing, etc. The image has the distinction of “real” and “virtual”: the real image is gathered on the shadow screen (frosted glass) or the photosensitive sheet. Although the virtual image can be seen by the human eye, it cannot be directly displayed on the shadow screen or the photosensitive film. Generally, the images seen in the optical direct viewfinder or in the mirror are virtual images (the virtual image in the mirror can be converted into a real image). The real image of the photograph is an inverted image that is opposite to the original object and the left and right on the shadow screen (after being printed, it is reversed from the negative film to become a positive image). According to geometric optics, a beam of light emitted from an object point can converge at a point after passing through the lens group, and the surface can also be regarded as light emitted from another object point. The former is called a real image, and the latter is called a virtual image. If one considers each point on a plane perpendicular to the optical axis, imaging will give the image field and image plane. General image representation function Represents a function of the spatial distribution of radiant energy reflected by an image in general. It is a function with five variables, can be written as T(x, y, z, t, λ), where x, y, and z are spatial variables, t is a time variable, and λ is the wavelength of the radiation (corresponding to the spectral variable). Since the actual image is finite in time and space, T(x, y, z, t, λ) is a 5-D finite function. Image conversion 1. Transform between different types of images. For example, converting a plane image into a vector image, or vice versa. 2. Convert an image file format (such as BMP) to another image file format (such as TIFF). 3. Convert an image using one color model (such as RGB) to an image using another color model (such as HSI).
1.1 Basic Concepts of Image
5
Media An entity that expresses, records, and transmits information content, such as images, videos, and so on. Cross section 1. Cross section: measurement surface for absorption or scattering of radiation. Can be seen as completely absorbed or scattered, the effective area of the scattering medium that emits radiation. 2. Moving section: A type of primitive in the expression of a generalized cylinder. The boundary of the section may be a straight line or a curve, which may be rotation and reflection symmetrical or asymmetrical, may be only rotation symmetrical or only reflection symmetrical, and the shape may or may not change when moving. The area of the cross section can also vary. Image space The space that represents the original image. For example, for a 2-D still grayscale image, although its positional distribution is two-dimensional, it is generally considered that the image space is a 3-D space, and the three coordinate axes are X (horizontal coordinates), Y (vertical coordinates), and F (grayscale coordinates), respectively. See space domain. Support region 1. The domain of an image in image engineering, the most typical is the XY space. 2. The area covered by the template or window used in the stereo matching in the local aggregation method. Space domain The original image space consisting of pixels in image engineering. Also known as the image field. 1-D space A space made up of all positions in one direction in the objective world. A straight line without width and height (only length) is a typical case. 2-D space A space made up of all the locations described by a pair of orthogonal basis vectors in the objective world. All positions in this space can be indicated by a 2-D coordinate system. A plane space consisting of only two elements of height and width (geometrically corresponding to the X and Y axes) extends only in the plane. The basic image is defined in the 2-D space. 2-D point A point in 2-D space is also a pixel in a 2-D image whose position is characterized by two coordinates. 3-D space A space composed of all the positions in the objective world. Each object in space has three dimensions: length, width, and height.
6
1 Image Basics
3-D point A point in 3-D space that represents an infinitesimal volume in 3-D space. Ambient space Broadly refers to the 3-D space around a given mathematical target. For example, you can study a straight line in isolation or study it in 2-D space. In the latter case, the environment space is a 2-D space. Similarly, a ball can be studied in the 3-D environment space. This is more important if the environmental space is nonlinear or distorted (such as a magnetic field). Visual space A collection of digital images in a structured form. It has a prominent position in the type of space described by Poincaré. See digital visual space.
1.1.2
Digital Image and Computer-Generated Image
Digital image An image obtained by discretizing both the coordinate space XY of the (2-D, 3-D, or higher) analog image and the property space F. For example, when a 2-D array f (x, y) is used to represent a 2-D digital image, its f, x, and y values are all integers. Digital images are mainly involved in image engineering. Also known as discrete images, abbreviated as images. Digital In contrast to analog, is discrete. Digital video image Images of digital video in image engineering. Compare analog video. Image digitization A technique and process for sampling and quantifying an analog image function to obtain a digital image. It connects the process from the scene, simulating the natural world’s data to the digital form required for computer algorithms responsible for image processing, analysis, and understanding. Digitization involves two processes: (on time and space) sampling and (amplitude) quantization. It can be implemented with various digitizers Isophote A set of points in the image that satisfy all pixels of f(x, y) ¼ constant. Isophote curvature The curvature of the pixel on the isophote (iso-illuminance line). At a given pixel, the isophote curvature expression is –Lvv/Lw, where Lw is the magnitude of the gradient orthogonal to the isophote and Lvv is the curvature of the luminance surface of the point along the isophote.
1.1 Basic Concepts of Image
7
Digital visual space A non-empty set consisting of points in a digital image. A space formed by a non-empty set of structures. See visual space. Computer-generated images [CGI] A computer-generated, very realistic (nearly natural or objective) graphic. Synthetic image Same as computer-generated map. Computer-generated map Visual form generated by computer program based on design. Animation The technique and result of shooting a picture frame by frame and continuously playing to form a moving sequence. It is also an artistic expression. The result of animation (animated cartoon) can also be seen as a moving image or video in a broad sense. Performance-driven animation Animated series is obtained by interactively changing a 3-D graphical model by tracking the movement of the user. For example, the user’s expression and head movement can be tracked and then deformed in a series of hand-drawn sketches based on the information obtained. The cartoonist first extracts the eye and mouth areas of each sketch and draws control lines on each picture. At runtime, the face tracking system determines the current location of these features. The animation system decides which input image to deform based on the nearest feature match and the triangle center of gravity interpolation. It is also possible to calculate the overall position and orientation of the head from the tracked features. The deformed eye and mouth regions are recombined into the overall head model to produce a hand-drawn animation frame. Rotoscoping A technique for generating animations by means of a sequence of live events in which features in a sequence of live events are copied frame by frame as a basis for the picture to be drawn. This is a whole process of detecting the target contour, then segmenting the target, and finally replacing the target. Typical examples are the replacement of a person’s clothing. For video, it can be based on key frames. It can also be used for hand-drawn animations. Drawing Drawing pictures, drawing operations, and their results. Drawing results can also be seen as images in a broad sense. Chart A simplified representation of various statistical information. Provides a means of visualizing statistical data. The chart can also be seen as an image in a broad sense.
8
1 Image Basics
Graphics Visual expression form with the help of lines. It can also be expressed as a vector map. Although graphics are designed by humans and mostly drawn by computers, they can also be regarded as images in a broad sense.
1.2
Image Decomposition
Each basic unit in an image is called an image element. For 2-D images, they are called pixels. For 3-D images, this is called a voxel. Image elements can be combined to form various small to large units of different scales (Castleman 1996; Gonzalez and Wintz 1987; Jähne 2004).
1.2.1
Image Decomposition
Image decomposition A generic term that refers to the separation of the various components of an image. It is often implemented by a set of basis functions (such as Fourier transform, discrete cosine transform, wavelet transform, etc.), a given set of color channels (such as RGB, HSV, etc.), or segmentation (including scene semantic segmentation). Intuitively, the image is split into different components or different parts. For example, a sub-image of an image block is decomposed in a transform coding system. In addition, there is an image decomposition method that treats an image as consisting of two parts, one is a structural part, corresponding to a larger-scale object in the image, and the other is a texture part, corresponding to a smaller-scale detail in the image, which usually has periodicity and oscillation. Image block Generally, sub-images obtained by decomposing images according to a certain rule. They are often consistent in size or shape. Block The result obtained by dividing the image, it can be regarded as a set of pixels processed as a whole. Often square, but can be any size and shape. Image fragmentation Same as fragmentation of image. Image patch Generally speaking, it refers to a connected component that contains a certain number of pixels, often referred to as a slice. Image slices can be decomposed into images according to certain rules, but generally they are not consistent in size or
1.2 Image Decomposition
9
shape with each other (often referred to as image blocks if the size or shape is more regular). Compare superpixel and parts. Blob Represents any collection of associated pixels in an image. It looks like an area that has a certain scale and is relatively uniform in itself but different in nature from the surrounding environment. In image engineering, it is often referred to as a grayscale or a color that is different from the background, but has no limitation on the size and shape. The sign of the Laplacian value can generally be used to distinguish between a bright blob in a dark background and a dark blob in a bright background. Compared to a template or window, a blob often represents a less regular area. Compared with connected components, the blob has no strict requirements on the connectivity of pixels. Active blob A type of blob based on regional features, and it uses active shape model to track non-rigid agglomerates. First, the initial region is divided by the Delaunay triangu lation, and then each deformable patch is tracked frame by frame along the frame sequence, and finally the collection of the slices constitutes a blob. Nonlocal patches Image patches or image regions that are not neighbors. This concept is widely used for image restoration, image segmentation, etc. Discrete object A collection of discrete points in a digital image that have a certain association. Discrete geometry A branch of mathematics that describes the overall (global) geometric properties of discrete targets. Shifting property of delta function Any image f(a, b) can be represented as a superposition of a series of point sources (one point source corresponds to one pixel). Can be written as Z
þ1 Z þ1 1
1
f ðx, yÞδn ðx a, y bÞdxdy ¼ f ða, bÞ
The value of the image f(x, y) at x¼a, y¼b is given to the right of the above equation, corresponding to a point source in the continuous space.
1.2.2
Pixel and Voxel
Pixel The basic unit of a 2-D image. In addition to representing its position in two coordinates, one or more numerical values are used to indicate its properties: for a
10
1 Image Basics
monochrome image (grayscale image), it is sufficient to use a single numerical value to indicate the brightness of the pixel (generally in the range [0, 255]); for color images, it is often necessary to use three values (representing the amount of red (R), green (G), and blue (B), respectively). For multi-spectral or hyperspectral images, one needs to use hundreds or thousands of values. The word pixel is derived from the picture element and is also represented by pel. Indicates the location of each particular luminance value of the discrete rectangular grid in the digital image. The value of a pixel by its coordinates (position in the image) and values at these coordinates (reflecting the physical properties of the image at that location). The most common value is the brightness value (brightness image), but it can also be the value of other physical quantities for different types of images, such as the distance between the camera and the scene (depth image). Physically, one pixel corresponds to the photoelectric sensitive unit of a CCD or other solid-state sensors in a digital camera. The CCD pixels have the exact dimensions specified by the manufacturer, which also determines the aspect ratio of the CCD. See intensity sensor and photosensor spectral response. Picture element Pixel. A single independent measurement in the image. It is the smallest direct measurement feature of image. Pixel coordinates The position value of the pixel in the image. For the pixel at the point (x, y) in the image f(x, y), the coordinates are (x, y). In general, the position of the row and column of the pixel in the image is indicated. Pixel value The property value of the pixel in the image. For grayscale images, it is the grayscale value of the pixel. For a color image, it is the three component intensity values of each pixel (corresponding to three channels and combined into one vector). For the image f(x, y), f represents the property value of the pixel at the point (x, y). Pixel intensity The pixel intensity value or the amount of light emitted by the pixel. In a grayscale image, a white pixel emits a maximum amount of light, and a black pixel emits a zero amount of light. In a color image, the intensity of a color channel pixel is its color brightness. Pixel intensity distributions The number and proportion of pixels of different intensity in the image. A histogram is a typical expression. Extremal pixel For a region R, a pixel having one of the following characteristics among pixels belonging to R: 1. For (x, y) 2 ℝ, the horizontal coordinate value x is an extreme value, and its vertical coordinate value is the extreme value of all column positions.
1.2 Image Decomposition
11
2. For (x, y) 2 ℝ, the vertical coordinate value y is an extreme value, and its horizontal coordinate value is the extreme value of all row positions. A region can have up to eight different extreme pixels, each falling on an external box of the region. Extremum pixels can be used to represent the extent of a region’s spread and can also be used to infer the axial length and orientation of a region. Pixel-based representation An expression of data in an array of 2-D spaces equivalent to an image. See pixel. Sub-pixel A subdivided unit (element) obtained by continuing to decompose the basic unit (i.e., pixel) of the 2-D image. Also commonly used to represent units smaller than the pixel scale. Pixel subsampling For a given image, a process of generating a smaller image by selecting only one pixel from every N pixels. Considering the aliasing effect, the subsampling is not as simple as the above, and the effect is also poor. What is really widely used is scale space filtering. Pixel-change history A series of pixel values over a period of time in the past. Superpixel 1. A segmented set of similar pixels in the image. Generally, a connected set of a certain number of pixels is included, corresponding to a local sub-area having a certain attribute consistency in the image. Compare parts and image patch. 2. One pixel in a high-resolution image. Using the weighted sum of these pixels to form a low-resolution image with anti-aliasing computer graphics technology. Grain The smallest unit that can be observed in an image, also the smallest detail that is perceived. If the image is continuously enlarged, it is possible to convert the original refined image into a mosaic particle image (i.e., granulated) to a certain extent. Volume element Same as voxel. Voxel The basic unit of a 3-D image. The spelling of voxel stems from the volume element, which is named after the pixel. In general, a voxel is a rectangular or cubic entity parallel to the coordinate axis. It is the component of a voxmap that expresses a 3-D stereo. For each voxel (similar to and for each pixel), attributes such as color, position possession, or measured density can be associated. Voxmap A volumetric representation of a regular grid (arranged into a 3-D array v(i, j, k)) that decomposes the space into voxels. For a Boolean voxel plot, the element (i, j, k) intersects the volume only and only when v(i, j, k) ¼ 1. The advantage of this
12
1 Image Basics
expression is that it can express arbitrarily complex topologies and enable fast queries. The main disadvantage is the large amount of memory required (addressable with octree representation). Super-voxel 1. In a 3-D image, a set of cells corresponding to superpixels in a 2-D image. 2. The result of generalizing the basic unit voxels in a 3-D image to higher dimensions. In space-time technology, a typical super-voxel is the basic unit in a hypercube composed of 3-D space and 1-D time (a total of four dimensions).
1.2.3
Various Elements
Image interleaving Describe the way that pixels are organized in an image. This includes pixel interleaving (sorting image data according to pixel position) and band interleaving (sorting image data first according to the band and then in-band according to pixel position). Range element [rangel] A unit of depth data generated by a sensor that can obtain depth information. In image engineering, a two-tuple is often used to represent depth primitives. The first part is the spatial position (row and column), and the second part is the depth value or a vector (the first component of the vector is the depth value, and the second component is the gray value). Depth The distance from the camera (along the optical axis) to the scene objects, that is, the distance between the point in the scene and the image center in the camera or the image plan in the camera. In a depth image, the brightness value of a pixel is a measure of depth. Bit plane In an image that multiple bits represent gray values, the binary plane corresponding to each bit. Bitmaps are binary images. Binary image An image whose pixels have only two values (represented by integers 0 and 1). 1-bit color The properties of the binary image, only black and white pixels, no other colors, no highlights or shaded region. See bitmap.
1.2 Image Decomposition
13
Bit-plane decomposition The process of decomposing a grayscale image into multiple bit planes. For an image whose gray value is represented by multiple bits, each bit can be regarded as representing a binary plane, so the following polynomial am1 2m1 þ am2 2m2 þ þ a1 21 þ a0 20 can be used to represent the gray value of a pixel in an image having m bit gray levels. According to this representation, a simple method of decomposing a grayscale image into a series of binary image sets is to assign m coefficients of the above polynomial to m 1-bit bit planes. Bit resolution Same as bit depth. Bit depth The number of binary digits (i.e., the number of planes) contained in the representation of an image. Also called bit resolution. For example, 1-bit deep can only represent black and white, 8-bit depth can represent 256 grayscale or color combinations, and 24-bit depth can represent true color. Corresponding to how many unique numbers of bits are available in the image palette. For example, if there are 256 colors in the palette, you need to use 8 bits deep. Full-color images typically have a bit depth of 24 bits and each color is 8 bits. Bitmap format Same as BMP format. BMP See BMP format. BMP format A common image file format with a suffix of bmp. It is a standard in the Windows environment; the full name is the device-independent bitmap. BMP image files, also known as bitmap files, consist of three parts: (1) bitmap file header (also called header), (2) bitmap information (often called palette), and (3) bitmap array (i.e., image data). A bitmap file can only hold one image. Facet model In image processing, a model that observes pixel values of a digital image as noise samples of a correlated but unknown gray (intensity) surface discretization. The operations on the image are defined by this surface. Therefore, in order to perform any operation, it is necessary to estimate the relevant grayscale surface. This first requires a generic model to describe the shape of the corresponding surface of any pixel neighborhood (regardless of noise). To estimate the surface from a neighborhood around a pixel, it is necessary to estimate the free parameters of the general shape. Image operations that can be performed with the facet model include gradi ent edge detection, zero-crossing point detection, image segmentation, line
14
1 Image Basics
detection, corner detection, reconstruction from shading, optical flow calculation, and so on. Facet model extraction A model based on facets (small and simple surfaces) is extracted from the depth data. See planar patch extraction and planar facet model. End-member The remote sensing image contains only one primitive of the surface feature information. Generally, the resolution of remote sensing images is low, and one of the pixels contains more features, which exist in the form of mixed spectra and are displayed in one pixel. To analyze remote sensing images, each end element in each pixel is separated to quantify the description, and the area percentage of different end elements in one pixel is calculated. End-member extraction A type of dictionary learning method. Iterative constrained end-member (ICE) is a typical end-member extraction optimization method. Iterative constrained end-membering [ICE] A typical end-member extraction optimization method. Sparse iterative constrained end-membering [SPICE] An improvement to the iterative constrained end-membering method, in which the sparsity prior condition of the weight matrix is introduced.
1.3
All Kinds of Image
The image can be continuous (analog) or digital (discrete). Images can be obtained with the help of different electromagnetic radiation, or they can have different spacetime dimensions (Koschan and Abidi 2008; MarDonald and Luo 1999; Pratt 2007; Russ and Neal 2016).
1.3.1
Images with Different Wavelengths
γ-ray image An image obtained using gamma (γ) rays. Also written as gamma-ray image. Gamma rays have a wavelength between about 0.001 and 1 nm. g-ray Special electromagnetic waves emitted by certain radioactive materials. There are many mechanisms for generating g-rays, including: nuclear energy level transition, positive and negative particles in the radiation of charged particles meet and annihilate, and nuclear decay process. The difference from the accompanying b-ray is
1.3 All Kinds of Image
15
that it has a strong penetrating ability and is not biased in electric and magnetic fields. The energy levels in the nucleus have a large distance, so the emitted photon energy is large, generally greater than 103 MeV, even more than 104 MeV. In nuclear reactions or other particle reactions, g-photons are also emitted, and the energy of g-photons is often greater at this time. The wavelength of the g-ray is one of the characteristics of the emitting material. Before the emergence of electron acceleration technology, g-rays had the shortest wavelength in the electromagnetic spectrum. Also written as Gamma-ray. X-ray image An image obtained by using X-rays as a radiation source. X-ray Electromagnetic radiation generated by high-speed electrons striking a substance. The wavelength is about 10–12 –10–9 m (about 4–40 nm), which is shorter than the ultraviolet wavelength and longer than the wavelength of the γ-ray. Quantum theory holds that X-rays are composed of quantum. The relationship between quantum energy E and its frequency is E ¼ hν, and h is the Planck constant. The ability of X-ray penetrating substances is related to wavelength. The short wavelength (5 10–11–2 10–10 m) has a strong penetrating power, called hard X-ray (< 5 10–11 m called superhard X-rays; longer wavelengths (10–10–6 10–10 m) have weaker penetrating power, called soft X-rays (6 10–10–2 10–9 m often called ultrasoft X-ray). It is widely used in medical imaging because it can penetrate most materials. It is also useful in other fields such as photolithography. Ultraviolet image Image obtained using ultraviolet light. Visible light image Optical radiation (which has a wavelength of approximately 380 to 780 nm) that is perceptible by human vision as an image produced or acquired by a radiant energy source. Infrared image An image obtained using infrared light. The wavelength of infrared light is about 780–1500 nm. Moderate Resolution Imaging Spectroradiometer image Images obtained with a Moderate Resolution Imaging Spectroradiometer (MODIS) mounted on two Terra and Aqua satellites. The data obtained by MODIS covers 36 spectral segments from 0.4 μm to 14.4 μm with a radiation resolution of up to 12 bits. When the satellite is at the lowest point, the spatial resolution of the 2 spectral segments is 250 m, the spatial resolution of the 5 spectral segments is 500 m, and the spatial resolution of the remaining 29 spectral segments is 1000 m. Short-wave infrared [SWIR] Infrared light with a wavelength of approximately 0.78 μm to 1.1 μm.
16
1 Image Basics
Mid-wave infrared [MWIR] Infrared light with a wavelength of approximately 1.1 μm to 2.5 μm. Microwave image An image obtained by microwave (electromagnetic wave) radiation. The wavelength of the microwave is between about 1 mm and 1 m. Millimeter-wave radiometric images A type of image obtained using submillimeter wavelengths of terahertz radiation that can penetrate through clothing. A full-body scanner used in security checkpoints in recent years. Millimeter-wave images Same as Millimeter-wave radiometric images. Radar image An echo image produced by the electromagnetic radiation emitted by the radar after it reaches the object. The frequency of the electromagnetic wave used is mainly in the microwave section (wavelength is between 1 mm and 1 m) Radio band image An image obtained or produced by using radio waves. The radio wave in its narrow means has a wavelength of more than 1 m. Multiband images Same as multi-spectral images. Channel A means or concept used to separate, merge, and organize image data. It is generally considered that a color image contains 3 channels. Multichannel images A color image or remote sensing image containing multiple channels. For example, a typical color image has at least three channels (R, G, B), but there can be other channels (such as masks). Multichannel kernel A core or weight matrix corresponding to pixel. It needs to be convolved with multiple channels, which is different from a single-channel convolution operator. Multi-spectral images A set of images containing multiple spectrum segments/bands. Generally, it refers to a collection of several monochrome images of the same scene, each of which is obtained by means of radiation of different frequencies, or each image corresponds to a spectrum segment. Each such image corresponds to a band or a channel. A set of multi-spectral images can be represented as (n is the number of bands) C ðx, yÞ ¼ ½ f 1 ðx, yÞ, f 2 ðx, yÞ, . . . , f n ðx, yÞT ¼ ½ f 1 , f 2 , . . ., f n T
1.3 All Kinds of Image
17
Fig. 1.1 A color image and its three spectral images
In practice, the number of wavelengths of light can be at least two (such as some medical scanners) or three (such as color RGB images); at most there is no limit (up to hundreds of thousands). A typical registration method uses a vector to record different spectral measurements at each pixel in the image array. Figure 1.1 takes a color image (the left one) as an example and gives three different spectra of this image (corresponding to red, green, and blue). Remote sensing images are generally multi-spectral images. Hyperspectral image An image acquired with a hyperspectral sensor. Generally, it refers to multi-spectral images of more than 100 bands. It differs from having a small number of multispectral images.
1.3.2
Different Dimensional Images
2-D image Images with 2-D definition domain, that is, the coordinate space is a two-dimensional space. It can be represented by the bivariate function f(x, y), which is the most basic and common image. 2.5-D image A depth image obtained from a single viewpoint scan. The 3-D information is represented in a single image array, where each pixel value records distance information from viewpoint to the observed scene. 2.5-D model Corresponds to the geometric representation model of the 2.5-D image. Commonly used in model-based recognition. 3-D image An image whose domain is a 3-D coordinate space can be represented by a ternary function f(x, y, z). Also refers to images that contain 3-D information. For example, a 3-D image can be obtained by using a CT scan with a 3-D domain defined by length, width, and height. The planar image obtained by the stereoscopic method contains depth information (given by the image grayscale), also called depth image or
18
1 Image Basics
distance map. A video image can also be viewed as a 3-D image, represented by f (x, y, t), where t represents time. Anisometry 1. The case where the image (especially 3-D image) is not proportional to the actual scene in each direction, or the resolution is different. 2. The ratio of the ellipse length to the half axis. If the concept of the inertia equivalent ellipse is used, it can also be used to construct an arbitrary region shape descriptor, which describes the slenderness of the region and does not change with the region scaling. Stereo image An image containing stereoscopic information obtained by stereoscopic imaging. In many cases, it refers to a pair of images and can also refer to a plurality of associated images that provide a multidirectional viewing angle to the scene. Volumetric image An array of voxel plots or 3-D points where each cell represents a measure of material density or other properties in 3-D space. Common examples include computed tomography (CT) data and magnetic resonance imaging (MRI) data. Multidimensional image Refers to high-dimensional images, such as remote sensing images. These images can vary in space, time, and wavelength, and their corresponding function values can have many channels (color images can also be viewed as “three-dimensional” images). Sub-image Part of the image. If the image is treated as a collection of pixels, the sub-image is a subset of the collection. A part of a large image or each part after decomposition. Often represents a region of specific meaning separated from a large image. Layer An expression in the computer vision world that corresponds to a sub-picture in the computer game industry. Sprite From the computer game industry, which is used to indicate the tablet animated characters in the game, which is essentially part of the picture. Empty sub-image A 2-D or 3-D sub-image, including its boundary pixels, with no pixels inside. The sub-image of the hole can be thought of as completely black surrounded by white areas. Bi-level image A special grayscale image with only two grayscale values (0 and 1) for pixel. The main advantage is the small amount of data, suitable for images that contain only simple graphics, text, or line drawings. See bit plane.
1.3 All Kinds of Image
19
Binary moment The moment of the binary image. For the binary image B(i, j), there are an infinite number of moments that can be indicated by the integer values p and q, where the pqth moment can be expressed as mpq ¼
XX i
ip jq Bði, jÞ
j
Binary image island Dark region surrounded by white pixels in a binary image. Concentric mosaic A set of images constituting a panorama, which are captured by a camera rotating around a center and having an optical axis direction in a radial direction, that is, a set of panoramic views having parallax. The name derives from a special structure for such acquisition that associates each column of pixels with the radius of a tangent concentric circle.
1.3.3
Color Image
Color image Images expressed by the values of three properties (such as R, G, B), containing three bands and providing a color perception. Indexed color images Images using a 2-D array of the same size as the image, containing the index (pointer) of the fixed color palette (or color map) of the largest size (typically 256 colors), and indexing the resulting color image with the array. The palette provides a list of the colors used in the image. This solves the problem of backward compatibility with older hardware that cannot display 16M colors at the same time. Color bar When displaying images with different meanings of different color pixels, the band indicating the meaning added to the side of the figure (generally on the right side) is displayed. Pseudo-color image A resulting image obtained by encoding or colorizing the pixels of the selected monochrome image. It can be the result of pseudo-color enhancement. For the selected pixel, its gray value is replaced by the red, green, and blue components of the given color vector. The choice of color vector can be arbitrary; the goal is only to make different image regions looking better or more different. Compare true color image.
20
1 Image Basics
Indexed color Color placed in the palette to help color quantization and representation for color images on the display. In the graphical data formats GIF and TIFF, the associated palette and index color images are included. In general, such palettes include RGB values that are suitable for nonlinear displays and can display color images directly on the display (no need to recalibrate). Using an indexed color for a true color image will reduce the color information of the image and reduce the quality of the color image. 8-bit color Up to 256 different colors can be displayed or output. Same as indexed color. 256 colors See 8-bit color. Palette A range of colors that can be used for the image, a board that can mix colors. Color palette The entry address is an index value from which the color table of the corresponding R, G, and B intensity values (or other component values of other color models) can be found. Image palette Same as color palette. True color image The vector component represents a digitized color image of the spectral transmission of visible light. True color images are available from color cameras. Commercial color charge-coupled device cameras are typically quantized to have 8 bits per color channel or vector component, and three color channel combinations can represent more than 160,000 colors. Compare false color image. Compare pseudo-color image. 24-bit color images A color image expressed by three 2-D arrays of the same size, each array corresponding to a primary color channel: red (R), green (G), and blue (B). Each channel uses a value of 8 bits, [0, 255] to indicate the red, green, and blue values of a pixel. Combining three 8-bit values into a single 24-bit number has 224 (16 777 216, often referred to as 16 million or 16M) color combinations. 24-bit color True color, which represents the vast majority of colors perceived by the human visual system. Color tables A list of parameters for a range of colors, or an index table. The color table determines how the original color is mapped to a smaller class of colors. Often used when converting 24-bit color to 8-bit color.
1.3 All Kinds of Image
21
Panchromatic image A grayscale image obtained by means of all wavelengths of visible light. It can be seen as an image of the 2-D optical density function f(x, y) whose value at the plane coordinate point (x, y) is proportional to the brightness of the corresponding point in the scene. The prefix pan- of the English qualifier here means “vast” and does not explicitly specify a certain (single) color. Panchromatic Sensitive to all visible wavelengths of light. False color image A color image obtained by converting a spectrum other than visible light (invisible electromagnetic radiation) into red, green, and blue components of a color vector. The resulting color is often inconsistent with general visual experience. For example, the information content of an infrared image is not from visible light. For expression and display, information in the infrared spectrum can be converted to the range of visible light. The most common type of false color image displays the near infrared in red, red in green, and green in blue. Compare pseudo-color image and true color image. 32-bit color image One more channel image than a 24-bit color image. The extra channel is called the α channel, also known as the color layer channel. It is also 8 bits. It is often the first 8 bits of data in a 32-bit color image file. It provides a measure of the transparency of each pixel and is widely used in image editing effects. Alternatively, a 32-bit color image may be an image using a CMYK model. 32-bit color There is one more channel than the 24-bit color. See 32-bit color image. Continuous tone color Refers to a very smooth continuous natural transition process that occurs in a series of colors as it gradually changes from one color to another. There should be no sudden change from one color to another in the middle, and no obvious ribbon will appear.
1.3.4
Images for Different Applications
Medical image Acquired images of medical scenes and environments. Content is often complex and relevant to specific areas. Face image The result of imaging a face; or an image that primarily contains a human face. Normalized results are often used in research.
22
1 Image Basics
Photomap A new type of map that combines aerial photographs or satellite remote sensing images with conventional cartographic symbols. Also known as photography maps. For photographs or images, some places and objects that are not easy to directly interpret or abstract are symbolized, such as contour lines, boundary lines, alternate roads, and the rivers, administrative areas, transportation networks, and proprietary elevation points are marked. The other original image features are kept. Probe image 1. Images acquired with remote sensing technology, such as images acquired using an endoscope. These images generally have a high degree of geometric distortion. 2. An image used to challenge an image library. Echocardiography A noninvasive technique for imaging the heart and surrounding structures. Also called cardiac ultrasound map. Generally used to assess ventricular size, wall thickness, wall motion, valve structure and motion, and proximal large vessels. High key photo Light toned photos. Most of the screens are composed of a few shades of light from gray to white, giving a clear, pure, delicate feeling, suitable for white-based subjects. Retinal image A (feeling or perception) image constructed on the retina of the eye. Fovea image An image with a sampling pattern that mimics the sensor configuration in a human fovea. Sampling is very dense at the center of the image and becomes more and more sparse as it gradually moves toward the periphery of the image. One example is shown in Fig. 1.2. Foveation 1. The process of generating a fovea image 2. Guides the camera optical axis to a specific direction Fig. 1.2 Schematic diagram of fovea
1.3 All Kinds of Image
23
Cyclopean image Also called central eye image. The central eye image is a single mental image that is generated by the brain through combining images obtained from both eyes. The central eye image also refers to a stereoscopic viewer that fuses the fields of view of two physical eyes and combines them into a central field of view between the two eyes as if they were seen by a single eye (the central eye). The psychological process of constructing a central eye image is the basis of stereo vision. Using this mental process of the brain, stereoscopic perception can be automatically obtained by constructing a central eye image from a random pattern (random point stereogram). Integral image A matrix representation method that maintains global image information. A data structure and associated construction algorithm for efficiently generating a sum of pixel values in a region of an image. Can be used to quickly sum the pixel values of block regions in an image. One only needs to scan the image once when actually build the integration image. In the integral image, the value I(x, y) at position (x, y) represents the sum of all pixel values at the top left of the position in the original image f(x, y): f ðx, yÞ ¼
X
f ðp, qÞ
px, qy
The construction of the integral image can be performed by scanning the image only once by means of a loop: 1. Let s(x, y) represent the cumulative sum of a row of pixels, s(x, –1) ¼ 0. 2. Let I(x, y) be an integral image, I(–1, y) ¼ 0. 3. Scan the entire image line by line; calculate the cumulative sum s(x, y) and the integral image I(x, y) for each pixel (x, y) by means of a loop: sðx, yÞ ¼ sðx, y 1Þ þ f ðx, yÞ I ðx, yÞ ¼ I ðx 1, yÞ þ sðx, yÞ 4. When the pixel in the lower right corner is reached by a progressive scan of the entire image, the integral image I(x, y) is constructed. As shown in Fig. 1.3, the sum of any rectangular area in the image can be calculated with four reference arrays. For rectangle D, there is Dsum ¼ I ðδÞ þ I ðαÞ ½I ðβÞ þ I ðγ Þ where I(α) is the value of the integral image at the point α, i.e., the sum of the pixel values in the rectangle A; I(β) is the sum of the pixel values in the rectangles A and B; and I(γ) is the pixel value in the rectangles A and C. The sum of I(δ) is the sum of the
24 Fig. 1.3 Computation of integral image
1 Image Basics
A
B α
β
D
C γ
δ
pixel values in the rectangles A, B, C, and D. Therefore, the calculation of the difference between the two rectangles requires eight reference arrays. In practice, a lookup table can be created, and the calculation is completed by means of a lookup table. Cosine integral image An extension of the concept of integral images for non-uniform filters such as Gaussian smoothing, Gabor filtering, and bilateral filtering. Summed area table Same as integral image. Gallery image An example image (set) for testing image search, image retrieval of a matching algorithm, or target recognition capabilities. Textual image Originally referred to as images converted from text (books, newspapers, etc.), it now includes various images or video frames with text (which can be printed, handwritten, even photographed in various scenes, etc.). Template image A template-based representation of a template for template matching, detection, or tracking. Watermark image An image with a digital watermark embedded. Iconic image An image of vector form expressed in pixels. Internet photos Refers to a large number of pictures that exist on the network, generally do not pay attention to the camera or camera that gets these pictures, and many do not know the shooting conditions and specific date and time.
1.4 Special Attribute Images
1.4
25
Special Attribute Images
Images can reflect the different attributes of the scene, and they must be adapted to the human visual system. So there can be grayscale and color images, as well as textured images, depth images, etc. (Russ and Neal 2016; Shih 2010; Sonka et al. 2014; Umbaugh 2005).
1.4.1
Images with Various Properties
CT image An image obtained by means of projection reconstruction techniques in computer tomography. The main acquisition methods include (1) transmission tomography and (2) emission tomography, which can be divided into two types: (1) positron emission tomography and (2) single-photon emission computed tomography. Magnetic resonance image A special image obtained using magnetic resonance imaging techniques. Anamorphic map A biological model that can be used to map the registration or region between image data. For example, a model of the brain’s functional region can be used to determine the structure of the brain in a magnetic resonance image. Remote sensing image Images obtained using remote sensing techniques or remote sensing systems. The electromagnetic bands used in remote sensing imaging include ultraviolet, visible, infrared, and microwave bands. Such a wide frequency band is often divided into hundreds of thousands of subbands, so hundreds or even thousands of different remote sensing images can be acquired for the same scene. Mixed pixel A measurement of a pixel originates from the phenomenon of more than one scene. For example, a pixel corresponding to a boundary position of two regions has a gray value associated with both regions. In remote sensing images, each pixel corresponds to a larger region on the ground. Satellite image Images acquire from a section of the earth by placing the camera on an orbiting satellite. Synthetic aperture radar [SAR] image A special radar image. Synthetic aperture radar (SAR) is a device that utilizes coherent microwave imaging. In synthetic aperture radar imaging, the radar is moving and the target is stationary (using the relative motion between them to produce a large relative aperture to increase lateral resolution). The synthetic
26
1 Image Basics
aperture radar system on the satellite operates in the microwave spectrum and has its own source of radiation, so it is almost immune to meteorological conditions and solar radiation. The pixel values of the synthetic aperture radar image reflect the physical properties of the ground, including the topography, shape, convex and concave, and so on. It has a wide range of applications in the fields of remote sensing, hydrology, geology, mining, surveying and mapping, and military. Acoustic image An image obtained with the aid of acoustic principles and sound sources. Reflects the characteristics of sound sources and sound media. Acoustic SONAR A device that detects and locates targets underwater (or in a polite). SONAR is derived from SOund Navigation And Ranging and works by reflecting and receiving sound waves. This is very similar to radar (the time-of-flight method and the Doppler effect can be used to give the radial component of relative position and velocity). It can be mounted on mobile robots or in humans (such as medical ultrasound). Landscape image An image showing various natural phenomena or landscapes such as flowers, rivers, and mountains. Ultrasound image Images obtained using ultrasound imaging techniques. Mondrian image An image made in the style of the Dutch painter Mondrian. It includes different overlapping monochrome rectangles (later promoted to general shape, several examples are shown in Fig. 1.4). Often used to study the color constancy of an observer when these images are illuminated with different spectral combinations. Mondrian A famous Dutch visual artist. The later oil paintings are composed of adjacent areas of uniform color (no change in tone). This style of image (called Mondrian image) has been used in many studies of color vision, especially in the study of color constancy, because it has a simple image structure with no occlusion, highlights, shadows, or light source.
Fig. 1.4 Examples of Mondrian images
1.4 Special Attribute Images
27
Grayscale image The pixel value represents a monochrome image of the luminance value (typically from 0 to 255). See grayscale. Same as gray-level image. Gray-level image An image representing the value of the image property space F is represented by more than two (currently 256) grayscales. In contrast to color images, it is also called a monochrome image. Monochrome image Same as gray-level image.
1.4.2
Image with Specific Attribute
Video image A widely used type of sequence image with special specifications (equal time interval acquisition, rate can produce continuous motion). Generally referred to as video. It can be written as f(x, y, t), where x and y are spatial variables and t is a time variable. See digital image. Motion image A collection of images containing motion information. Typical examples are continuous video images (30 frames per second for NTSC format, 25 frames per second for PAL format), but can also be image sequences that vary at other rates (even unequal intervals). Moving image Same as motion image. Frame image Each single image that makes up a video sequence corresponds to the smallest unit in time. For example, in a PAL system, 25 frames per second are included; in an NTSC system, 30 frames per second are included. Still image A relatively independent single image. The emphasis is on the moment of image acquisition, in which the target is equivalent to a fixed motion. Opposite to moving images. Continuous tone image Generally, it refers to a grayscale image in which the grayscale reaches at least several tens or more. There is no significant jump between the gray levels. Orientation map An image in which the property value of each pixel indicates of the face normal orientation of the 3-D scene surface at which it corresponds.
28
1 Image Basics
Fig. 1.5 Texture images
P" C
Mapping P'
Q P
Q"
Q'
Fig. 1.6 Gaussian map of a 2-D curve
Intensity image An image in which the measured brightness data is recorded. Gradient image See gradient filter and edge image. Texture image An image with obvious texture characteristics. Figure 1.5 shows several different texture images. In texture images, people are more concerned with the texture properties than other properties of the image, such as grayscale. Disparity space image [DSI] A 3-D image consisting of a series of images with different disparity values. In binocular stereo vision, letting the disparity be d and the corresponding pixel positions in the two images be (x, y) and (x', y') ¼ (x + d, y), then the disparity space image can be expressed as D(x, y, d ). Generalized disparity space image [GDSI] The result of stacking a set of disparity space images obtained by different cameras into a 4-D space. The pixel position is represented by (x, y), the disparity is represented by d, and the camera number is k, and then the generalized disparity space image can be written as G(x, y, d, k). Gaussian map A model or tool that represents the change in the direction of a 2-D curve. As shown in Fig. 1.6, for a given curve C, a direction is selected for the curve C, so that the point P traverses the curve C in this direction, and sequentially mapping the points P, P', and P'' on the curve C to each point Q on the unit circumference (i.e., Q, Q', Q'',
1.4 Special Attribute Images
29
Fig. 1.7 Gaussian sphere
Z n1
n1 n2
n2
Y
X n3 (a)
n3 (b)
Fig. 1.8 Sampling expression of a unit ball
. . .). Here, the unit normal vector passing through each point P and the endpoint starting from the unit center are corresponding to the vectors of the respective Qs. The result of this mapping from curve C to the unit circle is the Gaussian map corresponding to curve C. Gaussian sphere A unit ball that can be used to calculate the extended Gaussian image. As shown in Fig. 1.7a, given a point on the 3-D target surface, the corresponding point on the sphere with the same surface normal will produce a Gaussian sphere, as shown in Fig. 1.7b. In other words, the end of the orientation vector of the target surface point is placed at the center of the sphere, and the top of the vector intersects the sphere at a specific point, which can be used to mark the orientation of the original target surface point. The position of the intersection point on the sphere can be represented by two variables (with two degrees of freedom), such as polar and azimuth or longitude and latitude. Also refers to the sampling expression of a unit ball. The surface of the ball is calculated by a certain number of triangles (often calculated by dividing the dodecahedron), as shown in Fig. 1.8. Extended Gaussian image [EGI] A model or tool that represents the 3-D target surface orientation and is a 3-D extension to a Gaussian map of a 2-D curve. The extended Gaussian image can be calculated by means of a Gaussian sphere, via the calculation of the surface normal. Take a unit cubic box as an example; if one places a mass equal to the corresponding surface area at each point on the Gaussian sphere, an extended Gaussian graph can be obtained, as shown in Fig. 1.9.
30
1 Image Basics
Fig. 1.9 Extended Gaussian image
An extended Gaussian image of a target gives the distribution of the target surface normals and provides the orientation of the points on the surface. If the target is a convex multi-cone, the target and its extended Gaussian image have a one-to-one correspondence, but an extended Gaussian image can correspond to an infinite number of concave targets. Stereographic projection A method of projecting a Gaussian sphere onto a plane. The surface orientation can be converted to a gradient space representation. In stereoscopic projection, the destination of the projection is a plane tangent to the North Pole, and the center of the projection is the South Pole. In this way, except for the South Pole, all the points on the Gaussian sphere can be uniquely mapped onto the plane. Azimuth One of the measures for quantifying the angular difference between targets on a plane. Also known as the azimuth (Az) angle. It is the horizontal angle between the clockwise direction and the target direction line from the north direction of a point. Symbolic image The images with attribute values that correspond to each pixel are symbols (markers) rather than grayscales. Visual attention maps A result expressed by a visual attention model for a salient region in an image. The higher the pattern of a certain pixel (such as shape, color, etc.) of a pixel in the image and its surrounding area appears in other identical morphological regions of the image, the lower the visual attention value of the pixel. Otherwise, the lower these patterns appear, the higher the visual attention value of the pixel. This model can distinguish well between salient features and nonsignificant features. Scatterplot A data display technique in which each data item is drawn as a single point in the appropriate coordinate system to help people better understand the data. For example, if you plot a set of estimated surface normals into a 3-D scatterplot, the plane will exhibit a compact point cluster.
1.4 Special Attribute Images
31
Fig. 1.10 The difference between depth image and gray image
Object Prof ile Brightness
Step edge Grey level Depth map
Distance
1.4.3
Ridge edge
Depth Images
Depth image Same as depth map. Refer to range image. Depth map An image that reflects distance information between a scene and a camera in a scene. Although it can be displayed as a grayscale image, its meaning is different from the grayscale image of the scene radiation intensity in a common expression scene. There are two main differences (see Fig. 1.10): 1. The pixel value of the same plane on the object in the depth image changes according to a certain change rate (referring to the plane is inclined with respect to the image plane), and this value varies with the shape and orientation of the object, but is independent of external illumination conditions. The pixel value of the same plane on the object may be a constant. 2. There are two kinds of boundary lines in the depth image: one is the step edge between the object and the background, and the other is the ridge edge at the intersection of the inner regions of the object (unless the intersection itself has a step). The boundary lines in the image correspond to grayscale step edges.
Depth gradient image The result g(di(i, j), dj(i, j)) obtained by calculating the gradient for the depth image d(i, j). Here di(i, j) and dj(i, j) correspond to gradient values in both the horizontal and vertical directions, respectively. Layered depth image [LDI] It can be considered as a data structure that derives from its depth and color values for each pixel in the reference image (or at least the pixels near the foregroundbackground transition) and can be used to draw a new view. Using a layered depth map avoids the inevitable problem of holes and cracks behind the foreground target when it is transformed into a new view.
32
1 Image Basics
Fig. 1.11 Point cloud illustration of a car
Layered depth panorama Same as concentric mosaic. Range image The result obtained by expressing the distance data in the form of an image. The coordinates of the pixel are related to the spatial position of each point on the distance surface, and the pixel value represents the distance from the sensor (or from any fixed background) to the surface point. Same as depth map. Range data A representation of the spatial distribution of a set of 3-D points. Data is often obtained with stereo vision or distance sensors. In computer vision, range data is often represented as a point cloud (as shown in Fig. 1.11), that is, the X, Y, and Z coordinates of each point are represented as a triple or as a distance map (also known as Moiré patches). Range data segmentation A class of techniques that breaks range data into a set of regions. For example, curvature segmentation gives a collection of different surface patches. Range data integration Same as range data fusion. Range data registration See registration. Range data fusion The combination of range data from multiple sets is particularly used to expand the object surface portion described by range data, or to increase measurement accuracy by using redundant information obtained by measuring multiple points at each point of the surface region. See information fusion, fusion, and sensor fusion. Range image edge detector Edge detector working on range image. The edge is where the depth or surface normal direction (folded edge) changes rapidly. See edge detection. Moiré patches See range data. Height image See range image.
1.4 Special Attribute Images
1.4.4
33
Image with Variant Sources
Monocular image Images (one or more) obtained with a single camera. See 3-D reconstruction using one camera. Binocular image A pair of (related) images obtained by a single camera at two locations or two cameras each at one location. See stereo vision and binocular imaging. Trinocular image Three images obtained with a single camera at three locations or with three cameras each at one location. See orthogonal trinocular stereo imaging. Multiple-nocular image Images captured by a single camera at multiple locations and/or orientations, or by multiple cameras at one (or several) location and/or orientations for the same scene. See multiple-nocular imaging and orthogonal multiple-nocular stereo imaging. Electronic image An image that is proportional to electrical energy. For example, a television camera tube is called a storage tube, and a plurality of mutually insulated micro-element cathodes (collectively referred to as mosaic cathodes) are formed on a thin mica sheet, and the mica back surface is made of an insulated anode corresponding to the micro-element cathode. An auxiliary anode is placed in front of the mosaic cathode to attract electrons emitted by the inlaid cathode after being exposed to light. The amount of electricity on the inlaid anode is proportional to the amount of light from the object. Thus, the optical image becomes a stored electronic image. Scale space image A set of images defined by scale space. Each pixel value of each layer is a function of the standard deviation of the Gaussian kernel used to convolve the pixel location. Geometry image An image that represents the complete regular structure of the surface. This can be obtained by re-meshing any surface and pressing the result into the mesh. A target surface can be modeled with a series of irregular triangular meshes. The general method can only obtain semi-regular meshes by re-meshing. Geometric images use 2-D array points to represent geometric information, surface normal, and colors are also represented by similar 2-D array points. Geometric images can be compressed using ordinary image coding methods. Epipolar plane image [EPI] In binocular stereo vision, an image in which a particular line from one camera changes as the camera position changes but the lines in the image remain in the same polar plane is displayed. Each line in the EPI corresponds to the relevant line of the camera at different times. Features that are far from the camera will remain in similar
34
1 Image Basics
positions in each line, while features that are close to the camera will move from one line to another (the closer the more feature moves). In multi-view stereo vision, an image obtained by superimposing scan lines of all images. When the camera moves horizontally (inside the standard horizontal correction geometry), the scene at different depths will move laterally, and its rate is inversely proportional to the depth. A closer foreground target will obscure a farther background target, which can be seen as a strip of polar plane image that obscures the strips in other polar plane images. If there are a sufficient number of images, one can see the bands and infer their connections to construct 3-D scenes and make inferences about translucent objects and specular object reflections. On the other hand, these images can also be viewed as a series of observations and combined using Kalman filtering or maximum likelihood estimation inference techniques. EPI strip See epipolar plane image. Orthoimage An image obtained by a specific correction means in photography. For example, an arbitrary aerial photograph is transformed into an image obtained by the camera being vertically downward. Vector image The image has a property whose value is a vector. Color images are one of the most common vector images. Multi-spectral images are also vector images. Vector field A multivalued function f: ℝn ! ℝm. An RGB image I(x, y) ¼ [r(x, y), g(x, y), b(x, y)] is a vector field from 2-D to 3-D. Figure 1.12 shows a vector field f(x, y) ¼ (y, sinπx) from 2-D to 2-D. Logical image Same as binary image.
Fig. 1.12 A vector field from 2-D to 2-D
1.4 Special Attribute Images
35
High dynamic range [HDR] image Images with high dynamic range, they often have clearer effects than normal images. Omnidirectional image An image obtained by a panoramic camera that reflects a scene with a large surrounding angle of view (up to 360 ). The effect of a panoramic camera can also be achieved by projecting an image obtained with a normal camera onto a hyperboloid with a vertical axis of symmetry. In addition, panoramic images can be obtained with multiple cameras (such as webcam) and image stitching technology. Space-time image An image with both spatial and temporal coordinates. Can be a 2-D, 3-D, or 4-D image. Source image The input image for the image processing or image analysis or image understanding operation. Degraded image The result of the original image being degraded. Restored image The output of the image restoration. In general, it is an estimate of the original image.
1.4.5
Processing Result Image
Feature image An image containing features computed using a low-level image operator. Can be used to further identify and classify targets in an image. Segmented image In the image segmentation, the resulting image after segmentation. Target image A resulting image of an image processing operation (which can be smoothing, sharping, low-pass or high-pass filtering, etc.). Edge image Images with each of pixels represent an edge point or edge amplitude or edge gradient. Decoding image The result obtained by decoding the output of the image encoding.
36
1 Image Basics
Tessellated image An image that is covered with a set of basic geometry units. The Voronoï mesh extracted from the mosaic digital image tends to reveal the geometric properties of the image and the existence of the image target. Convert images 1. The process and result of converting an image file format (such as BMP) to another image file format (such as TIFF) 2. The process and result of converting an image with one color model (such as RGB) into another color model (such as HSI) Classified image See label image. Label image After dividing an image into different areas, each area is marked with a different symbol or number to obtain a divided image. Often referred to as a classified image, because the class to which each pixel belongs is indicated. Such an image cannot be processed in the same manner as a grayscale image. Absorption image One of three types of images divided according to the type of interaction between the radiation sources, the nature of the target, and the relative position of the image sensor (the other two are the emission image and the reflection image). The absorption image is the result of radiation that penetrates the target, providing information about the internal structure of the target. The most common example is an X-ray image. Emission image One of three types of images divided according to the type of interaction between the radiation sources, the nature of the target, and the relative position of the image sensor (the other two are absorption images and reflection images). The emission image is the result of imaging a self-illuminating target, such as stars and bulb images in the visible light range, as well as thermal and infrared images outside the visible range. Reflection image One of three types of images that are divided according to the type of interaction between the radiation sources, the nature of the target, and the relative position of the image sensor (the other two are the absorption image and the emission image). The reflected image is the result of radiation reflected from the target surface, sometimes considering the effects of the environment. Most of the images that come into contact in people’s daily lives are reflected images. The information that can be extracted from the reflected image is primarily about the target surface, such as their shape, color, and texture, as well as the distance from the observer. Environment map Same as reflection image.
1.4 Special Attribute Images
37
Blending image In image hiding, the linear combination of the carrier image and the hiding image. Letting the carrier image be f(x, y) and the image to be hidden be s(x, y), then the image bðx, yÞ ¼ αf ðx, yÞ þ ð1 αÞsðx, yÞ is the parameter α-mixed image of f(x, y) and s(x, y), where α is any real number that satisfies 0 α 1, and is called ordinary blending when α is 0 or 1. Carrier image In image hiding, the image that is to be used for hiding the hiding image. It is generally a common image that can be communicated publicly without being suspicious. Hiding image Images that need to be protected with image hiding. A common method is to embed the image to be hidden (as invisibly as possible) into the carrier image (generally a common image that can be publicly communicated without being suspicious). Spin image The result of rotational projection of the 3-D surface bin around the local normal axis has a global property that does not change with the 3-D rigid body transformation. Johnson and Herbert have proposed a partial surface representation. If the surface normal at a selected point p is n, then all other surface points x can be represented with the help of a 2-D base: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 ðα, βÞ ¼ kx pk ðn • ðx pÞÞ , n • ðx pÞ A spin image is a histogram of all (α, β) values of the surface. Each selected point p results in a different spin image. Points can be matched by comparing the corresponding spin images by means of correlation calculations. The main advantage of this expression is that it is independent of the pose and avoids the possibility of ambiguity between adjacent planes. Spherical spin image A generalization of spin images for free-form surfaces for 3-D targets. It gives a partial surface shape description that can be used for image database indexing and target recognition. Fused image Direct output of image information fusion.
38
1 Image Basics
⊕
=
Fig. 1.13 Multifocus image obtained from different depth focused images
Multifocus images The result of merging images obtained from different scenes at a different focus level. In such an image, targets with very different depths are respectively focused. Figure 1.13 gives an example (fusion is indicated by ); the left picture is the image of the small clock focused on the left front, while the big clock on the right rear is blurred; the middle picture is the image of the big clock focused on the right rear, while the small clock is rather fuzzy; the picture on the right is the result of merging these two images, both of clocks are at their respective depths of focus and are quite clear.
1.4.6
Others
Eigenimage The base image used when singular value decomposition is performed on an image. The singular value decomposition of the image f is to expand its vector outer product, which can be written as
f¼
N X N X i¼1
gij ui vTj
j¼1
The outer product uivjT can be interpreted as an “image,” so f is the original image represented by the sum of all outer product combinations and weighted by the gij coefficient. The base image uivjT is the intrinsic image of the image f. Intrinsic image 1. An image that represents some intrinsic property of the scene and is not affected by other characteristics. Obtaining an intrinsic image is very useful for correctly interpreting the scene represented by the image. Compare non-intrinsic image. 2. One of a set of images that correspond spatially to the input luminance image and reflect the intrinsic characteristics of the scene. Typical examples include an
1.4 Special Attribute Images
39
image showing the distance from a scene point, a scene surface orientation image, a surface reflectance image, and the like. 3. A set of images (or one of the images) that are aligned with the input grayscale image, describing the intrinsic properties of the scene rather than the characteristics of the input image itself. Typical intrinsic images include an image (depth image) at a distance from a scene point, an image facing the surface in the scene, an image of surface reflection characteristics, and the like. Intrinsic property The characteristics of the object and the objective existence of the object in the scene. These characteristics are independent of the nature of the observer and the image capturing equipment itself. Typical intrinsic characteristics include the relative distance between objects in the scene, the relative orientation of each object in space, the speed of motion, and the surface reflectivity, transparency, orientation, etc. Intrinsic dimensionality The number of intrinsic dimensions (degrees of freedom) of a data set, independent of the dimension of the space in which it is expressed. For example, the natural dimension of a curve in 3-D space is 1, although it has three coordinates for each point. Non-intrinsic image A specific image. The physical quantity represented is not only related to the scene, but also related to the nature of the observer or image acquisition device or to the conditions of image acquisition and the surrounding environment. In other words, the nature of extrinsic images is often a combination of factors outside the scene itself (such as the intensity of the radiation source, the way or orientation of the radiation, the reflective properties of the surface material in the scene, and the position, performance, etc. of the image acquisition device). Compare intrinsic image. Continuous image Same as analog image. Analog image An undiscrete image in coordinate space XY or property space F. The objective scene is continuous, and all the results of direct projections are analog images, which is also called continuous images. When a 2-D array f(x, y) is used to represent this type of image, the values of f, x, and y can be any real numbers. Analog A general term for everything in the real world. Compare digital. Optical image An analog image obtained by means of optical principles and light sources. In addition to the source of visible light, infrared sources are also commonly used.
40
1 Image Basics
Fig. 1.14 Optical image processing layout
Source transparen
Focus plane filter
Imaging sensor
Optical image processing The processing of the analog image is performed directly using an optical device. Image processing with lens and coherent light instead of using a computer. The key principle is that the coherent beam is penetrated through the target image and focused to produce a Fourier transform of the image at the focus, and frequency domain filtering is also possible. A typical processing layout is shown in Fig. 1.14. The feature is that the processing speed is faster, but special equipment is required, so the flexibility is poor. Pre-image In the digitized model, the original continuous point set that is digitized into a discrete point set. Given a set of discrete points P, if a set of consecutive points S is digitized as P, then S is the pre-image of P. Note that because digitization (quantization) is a many-to-one mapping, it is possible for different pre-images to be mapped to the same set of discrete points. In other words, the set of discrete points resulting from digitization does not uniquely determine its pre-image. Raster image The smallest unit is an image made up of pixels. Also known as pixmap, dot matrix, and bitmap (when the pixel value has only two values). It is suitable for storing images of natural scenes with irregular target shapes, rich colors, and irregularities, each of which has its own grayscale or color. Since there is always a limited resolution, distortion may occur during scaling. The file formats used for storage include BMP, GIF, JPEG, TIF, etc. Compare vector maps. Raster image processing Pixel-based processing of images. Most image processing techniques use this approach. Raster 1. Raster: The region scanned and illuminated in the video display. It can be obtained by sweeping a phosphorescent screen from top to bottom by a modulated ray at a regular repetition rate. 2. Grid: One of the most basic units for digitally encoding 2-D image content. It uses one or more pixel arrays to represent the image, with good quality and fast display
1.4 Special Attribute Images
41
speed, but it requires a lot of storage space and has size dependence (i.e., magnifying a raster representation of the image may produce obvious artifacts). Rasterizing The process of converting a mathematically defined vector graphic into an image with corresponding pixel points. When outputting from a desktop printer, the vector data should be reinterpreted into a raster file, and then the printer can know where to print the dots line by line. Raster scanning In the image display system, a technique of horizontally scanning a screen display region line by line to generate or record a display element and display an image. “Raster” means a region on a cathode-ray tube (CRT) or a liquid crystal display (LCD) that can display an image. In CRT, raster scanning refers to the process of rapidly displaying the electron beam from left to right and from top to bottom along the horizontal line, similar to the way TV tube scans. In LCDs, a raster (generally referred to as a “grid”) covers the entire device area and can be scanned in different ways. In raster scanning, image units are recorded or displayed one by one. Raster-to-vector Convert pixel-based image information into mathematical information to construct a vector map (from a raster image). Log-polar image An image representation in which pixels do not follow a standard Cartesian layout, i.e., pixels are not arranged in a Cartesian coordinate system, but have a spatially varying layout. In the log-pole space, the image is parameterized by a polar coordinate θ and a radiating (radial) coordinate r. However, unlike the polar coordinate system, the radiation distance increases exponentially with increasing r. The mapping from the spatial position to the Cartesian coordinate system is [βrcos(θ), βrsin (θ)], where β is the design parameter. Further, although the exact pixel size depends on an amount factors such as pixel coverage, the area represented by each pixel on the image plane increases with an exponential value. See fovea image. Log-polar stereo A special form of stereo vision in which the input image is derived from a log-pole layout of sensors rather than a standard Cartesian layout of sensors. Orderless images An image form in which there is a probability distribution associated with each point, rather than the usual grayscale or color. The probability distribution is calculated from the neighborhood around each point, without considering the target structure in the space.
42
1.5
1 Image Basics
Image Representation
The spatial field of view corresponding to the image is determined by the geometric imaging model, and the amplitude range of the image is determined by the brightness imaging model (Bracewell 1995; MarDonald and Luo 1999; Umbaugh 2005). The accuracy in the spatial field of view corresponds to the spatial resolution of image, while the accuracy in the amplitude range corresponds to the amplitude resolution of image (Castleman 1996; Gonzalez and Woods 2008; Sonka et al. 2014; Zhang 2009).
1.5.1
Representation
Image representation The representation of the image. A generic term for how image data is represented. It is a combination of data structure, data volume, display effect, and so on. Typical representations include matrix representations of images and vector representations of images. The image data can be 2-D, 3-D, or higher dimensional. Image data is typically stored in an array, and the spatial layout of the array reflects the spatial layout of the image data. Image representation function A function represents the spatial distribution of the radiant energy reflected by the image. In the simplest case, f(x, y) can be used, where the brightness f represents the radiant intensity at point (x, y). A common spatial 3-D image can be represented by f(x, y, z). See general image representation function. Image size The total number of pixels in an image is often expressed as the product of the number of horizontal rows of pixels and the number of vertical column pixels. For an image of M rows and N columns, the size is MN. Image representation with matrix The way that a matrix is used to represent an image. A 2-D image of M N (where M and N are the total number of rows and the total number of columns in the image, respectively) can be represented by a 2-D M N matrix: 2
f 11 6 f 6 21 F¼6 4 ⋮ f M1
f 12 f 22
⋮ f M2
⋱
3 f 1N f 2N 7 7 7 ⋮ 5 f MN
1.5 Image Representation
43
Image representation with vector The way that vectors are used to represent images. A 2-D image of M N (where M and N are the total number of rows and the total number of columns of the image, respectively) can be represented by an N-dimensional vector: F ¼ ½ f 1 f 2 f N Each of these components is again an M-dimensional vector: f i ¼ ½ f 1i
f 2i
f Mi T ,
i ¼ 1, 2, , N
Each of these components corresponds to one pixel. Outer product of vectors An operation between two vectors that make up a basic image. An image is represented as a linear superposition of some basic images. Given two vectors of N 1: uTi ¼ ½ui1 , ui2 , uiN vTi ¼ ½vi1 , vi2 , viN Their outer product can be expressed as 2
3 ui1 6u 7 6 i2 7 ui vTj ¼ 6 7½ v 4⋮5 uiN
2 j1
v
j2
ui1 v
j1
6u v 6 i2 j1 vjN ¼ 6 4 ⋮ uiN v
j1
3
ui1 v
j2
ui2 v
j2
⋮ uiN v j2
ui2 vjN 7 7 7 ⋮ 5
uiN vjN
ui1 vjN
Therefore, the outer product of the two vectors is a matrix of N N, which can be regarded as an image. On the other hand, an image can also be expanded into a form of outer product of vectors. Unit vector A vector of length 1. Orthonormal vectors A set of vectors in which each vector has an amplitude of 1 and has a point product of zero with a complex conjugate of other vectors. Tomasi view of a raster image An image representation model. Think of an image as a mapping f from a 2-D image position to a 1-D Euclidean space (for binary or grayscale images) or to a 3-D Euclidean space (for color images), i.e., f: ℝm ! ℝn, m ¼ 2, n ¼ 1 (two values, grayscale), n ¼ 3 (color).
44
1 Image Basics
For example, if p is the position of one pixel in a 2-D RGB color image, f( p) is a vector having three components, that is, red, green, and blue intensity values of pixel p.
1.5.2
Image Property
Image brightness Luminous flux that illuminates the surface of the scene. It is the power per unit area of the target surface, in units of Wm–2. It is related to irradiance. Brightness image forming model An image model that reflects the imaged brightness; or a model that reflects the brightness of the obtained image in relation to imaging factors. In the simple case, the image brightness is primarily related to the luminous flux incident on the visible scene for imaging and the ratio of objects in the scene to the reflection of incident light (both proportional). Density The nature of the surface of the object in the original scene corresponding to the gray level of the region in the image. However, density is the intrinsic property of the scene. The relationship between the gray level of the image and its influence is affected by many factors. Although it is generally positive, it does not necessarily have a strict and simple proportional relationship. Density estimation Given a finite set {xi}, the probability distribution p(x) of a random variable x can be modeled. The density estimation problem is essentially an ill-posed problem because it is possible that an infinite number of probability distributions for the limited set of data observed could be produced. For example, the probability distribution p(x) where each data point x1, . . ., xN is not zero would be a candidate. Amplitude The difference between the maximum and minimum values of the signal change, i.e., the dynamic range of the signal change. Since the minimum value of the image property is 0, the amplitude of the image signal is the maximum value that can be obtained. Gray level The digitized luminance value in the digital image after digitization process. Same as grayscale. Grayscale A monochrome representation of a pixel value. It generally indicates the brightness of an image. When expressed in 8 bits, the value ranges from 0 (black) to 255 (white). Medical images are also commonly represented by 12 bits, ranging from 0 (black) to 4095 (white).
1.5 Image Representation
45
Gray value Same as grayscale. Gray-level range The value range of the image grayscale. For the grayscale image f(x, y), its grayscale is in the following range: F min f F max That is, the grayscale range is [Fmin, Fmax]. The only limit on Fmin in theory is that it cannot be negative (generally taken as 0), and the only restriction on Fmax is that it should be finite. Gray-level contrast [GC] See inter-region contrast. Grayscale distribution model A model that describes how grayscale is distributed across an entire image or certain image regions. See intensity histogram. Gray-level resolution The smallest change in the brightness level that the human visual system can discern. When discussing the gray-level resolution of an image, it refers to the number of gray levels of the grayscale image. For grayscale images, typically 8 bits per pixel is a good balance between subjective quality and actual implementation (each pixel value corresponds to one byte). Magnitude resolution A numerical progression corresponding to the nature of the image. For a grayscale image, it is the number of gray levels per pixel, which is also called gray-level resolution. Pixel grayscale resolution The number of different gray levels that a pixel can express depends on the number of bits associated with each pixel. For example, an 8-bit pixel can express 28 ¼ 256 different luminance values. See intensity, intensity image, and intensity sensor. Depth resolution 1. The resolution of the image over the magnitude of the attribute. The grayscale image is grayscale resolution. See bit depth. 2. The attribute (i.e., depth) resolution of the depth image. The resolution corresponding to the distance measure. Relative brightness The ratio of the brightness value A of the object to the brightness value A0 of the optimum color corresponding to the color product, that is, A/A0.
46
1 Image Basics
Relative brightness contrast A measure that describes the relationship between the brightness values of the various parts of the image or the regions in the image. To measure the contrast, you can use Michelson contrast, [(Imax - Imin)/(Imax + Imin)], where Imax represents the maximum brightness value and Imin represents the minimum brightness value. Magnitude function A function that represents the magnitude of a complex function (such as a Fourier transform coefficient). An image in the general sense can be represented as a complex function whose size can be indicated by its magnitude. The actual gray scale image can often be represented as a real function whose gray value indicates its amplitude. Gaussian-Hermite moment A moment describing the amplitude function. Let f(x, y) denote the image intensity function defined on Q. Then the ( p, q) order Gauss-Hermite moment of the image is ZZ M pq ¼ Q
f ðx, yÞH p ðx : σ Þ H q ðy : σ Þdxdy
where Hp(x: σ) and Hq(y: σ) are Gauss-Hermite polynomial functions with corresponding scale parameters σ.
1.5.3
Image Resolution
Image resolution The number of pixels in the horizontal and vertical directions in an image is generally recorded (including other directions for high-dimensional images), but can also refer to the distance between pixels or the angle between adjacent pixel lines of sight. Resolution 1. A parameter in the image engineering that reflects the fineness of the image and corresponds to the amount of image data. It indicates the number of pixels per unit area, scale, angle of view, etc. A 2-D image can be represented by a 2-D array f (x, y), so spatial resolution is considered in the X and Y directions, and amplitude resolution is considered in the property F. 2. The ability to distinguish or identify in different situations. Such as: (1) The resolution of the human eye. The ability of the human eye to recognize details. It is related to the shape of the object and the lighting situation. Usually, the angle of the human eye at both ends of the object cannot be distinguished within 1'.
1.5 Image Representation
47
(2) The resolution of the telescope. The angle between the two points that can be resolved relative to the entrance pupil of the instrument. The image of the point source in the telescope is a diffracted spot with a bright spot in the center and a ring of alternating light and dark. The diffraction spots of two adjacent point images partially overlap. It is distinguishable when the depth of the two-slot overlap valley is 19% of the peak, which is the Rayleigh resolution criterion (Rayleigh criterion). If D is the entrance pupil diameter (objective diameter), the resolution limit is 1.22D–1λ, and λ is the wavelength of light. At this time, the ratio of the depth of the valley to the peak is 0.19. (3) The resolution of the microscope. The distance between the two points in the object that can be distinguished. According to the Rayleigh criterion, the minimum resolution distance is 0.61λ (nsinϕ )–1, where λ is the wavelength, n is the refractive index of the medium in object space, and ϕ is the half of the angle of the lens edges to the object (the aperture angle). (4) Color resolution. The power of the instrument separating two adjacent spectral lines. The resolution represented by the ratio of the wavelength λ to the wavelength that can just be resolved Δλ, that is, λ/Δλ. For prisms, according to the Rayleigh criterion, the resolution is λ/Δλ ¼ bdδ/dλ, where b is the beam width that illuminates the prism and dδ/dλ is the dispersion of the angle (δ is the deflection angle). When the prism is in the minimum deflected position, the criterion becomes tdδ/dλ, where t is the length of the thickest part of the prism. Limiting resolution Same as resolution limit. Resolution limit The size of the smallest target that an imaging or display system can distinguish or observe. It is a finite value. Rayleigh resolving limit Same as Rayleigh criterion of resolution. Rayleigh criterion of resolution A criterion for determining whether the images obtained by two point light sources can be distinguished. Also known as the Rayleigh criterion. Consider two image points with equal brightness obtained by using an aberration-free optical lens. The cross-sectional view is shown in Fig. 1.15 by solid lines, and the dotted line represents the superposition of the brightness of the two points.
Fig. 1.15 The crosssectional view of two point images
δ
0.735 P
48
1 Image Basics
The luminance distribution of each point image can be expressed as a first-order Bessel function as " Lðr Þ ¼ 2
J
2πDr#2 dλ πDr dλ
where r is the distance between the point on the brightness distribution curve and the center of the point image, λ is the wavelength of light (usually λ ¼ 0.55 μm for natural light), d is the distance from the lens to the imaging plane, and D is the diameter of the lens. In Fig. 1.15, the left point image center is fixed at the origin, and the right point image is moved so that the first zero point of the Bessel function of the right point image coincides with the center of the left point image. Similarly, the first zero point of the Bessel function of the left point image coincides with the center of the right point image (point P in the figure), and the distance between the two image centers is the distance δ that the two points can be distinguished: δ¼
1:22 dλ D
The minimum brightness between the two point images is approximately 73.5 % of the center brightness of the point image. Spatial resolution 1. The scale of the smallest unit that can be expressed and processed separately in the image. It is usually indicated by the size of the image (length width). 2. A way to describe the density of pixels in an image. The higher the spatial resolution, the more pixels there will be used to display an image with a fixed physical size. It is commonly represented by dots per inch (dpi). 3. The minimum interval (difference) measured by the sensor between the characteristics of the two signals. For a CCD camera, this interval can be indicated by the distance between the centers of two adjacent pixels. It is also commonly indicated by the angle between the 3-D rays of adjacent pixels. It is the reciprocal of the highest spatial frequency (which the sensor can express without aliasing). Data density An image on a page that refers to its ratio to a blank (nonimage area). Based on this parameter, you can estimate the length of the file produced after the image is scanned. Data density can also be used to indicate the number of pixels or the amount of information contained in an image. The denser the data, the more detail it will show. Dot per inch A unit that can be used to measure data density or resolution. It is mainly used for printers, etc. Compare pixel per inch.
1.5 Image Representation
49
Pixel per inch A unit that can be used to measure data density or resolution. It is mainly used in computers. Compare dot per inch. Bitmap An image in which each pixel is represented by one bit. Bit 1. A unit of data volume. It is the smallest unit of measure for image data in a computer. 2. A unit of information quantity. When one of two equal likelihood events occurs, the amount of information is 1 bit. Nat A unit of information based on the natural logarithm, and its conversion relationship with bits is 1 (nat) ¼ –log2e (bit). Bit depth The number of binary digits (i.e., the number of planes) contained in the representation of an image. Also called bit resolution. For example, 1-bit deep can only represent black and white, 8-bit depth can represent 256 grayscales or color combinations, and 24-bit depth can represent true color. It corresponds to how many unique numbers of bits are available in the image palette. For example, if there are 256 colors in the palette, you need to use 8 bits deep. Full-color images typically have a bit depth of 24 bits and each color is 8 bits. Image grid A geometric map depicting image samples, where each image point is represented by a vertex (or hole) in the graph or in the grid. Image lattice A spatial pattern obtained by decomposing an image plane into a collection of small units. It is also a regular grid of sampled pixel central points in the image space. Complementing the sampling mode, the image grid points sampled by the triangle pattern are hexagonal (hexagon grid points), and the image grid points sampled by the hexagon pattern are triangular (triangle grid points). Image points sampled in square mode are still square (square grid points). Supergrid An expression that is larger than the original image and that explicitly shows the gap edges between pixels, as shown in Fig. 1.16. 2-D image exterior region A region other than any 2-D image on the plane. Image epitome A useful data mining expression for images. It includes statistical features that are more concise than the original data.
50
1 Image Basics
Fig. 1.16 Supergrid
Pixel
Gap edges
1.6
Image Quality
Image quality is related to both subjective and objective factors (Borji et al. 2013; Danaci and Ikizler-Cinbis 2016; Marr 1982). In image processing, the judgment of image quality often depends on human observation, but there are some objective indicators (Gonzalez and Woods 2018; Zhang 2017a). Golden image In image engineering, an ideal image or a desired image used as a comparison reference, also called a true value image. Standard image Images that can be used as a basis for various studies by researchers and publicly available test images. Reference image An image of a known scene or the scene at a specific time, for comparison with a current image. Image quality Evaluation of visual quality of images in image processing. Different image applications often require different image quality and different evaluation indicators. For image acquisition, the sampling rate and the number of quantization levels are important factors affecting image quality. In another example, in the (lossy) image compression, different subjective and objective indicators are used, and the obtained image quality evaluation is also different. A generic term used to indicate that the image data acquired from the scene truly reflects the extent of the scene. Factors affecting image quality are often related to image applications, but good image quality generally corresponds to lower levels of image noise, higher image contrast, more precise focus, less motion blur, and so on. Image quality is often considered when discussing image processing effects. See SSIM and UQI. Quality An absolute measure of image quality.
1.6 Image Quality
51
Image quality measurement A certain measure is used to determine the quality of the image displayed to the viewer. The need for judgment comes from many sources, such as: 1. There is no reason to design a system whose image quality exceeds the human eye’s perception. 2. Because various image processing techniques can cause observable changes in results, it is necessary to evaluate the amount and impact of loss in visual quality. According to the different viewing angles of quality evaluation, the quality measurement can be divided into subjective quality measurement and objective quality measurement. Quality of a digital image See image quality. Photorealistic It has the same image quality as the real world in terms of color, light, gradient, etc. Universal image quality index [UIQI] A model for quantized image distortion based on correlation disappearance, luminance distortion, and contrast distortion, which is better than the traditional sum of squared differences from the point of view of the human correlation to distortion perception. Universal quality index [UQI] A measure of image quality. Let μx, μy be the average pixel intensity along the x and y directions on the image, respectively. Let σ x, σ y be the contrast of the image signal in the x and y directions, respectively, defined as "
n X ð xi μ x Þ σx ¼ n1 i¼1
#12
"
#1 n X yi μ y 2 σy ¼ n1 i¼1
Let σ xy (which can be used to calculate the structural similarity index measurement SSIM(x, y)) be defined by "
# n X ð xi μ x Þ yi μ y σ xy ¼ n1 i¼1 Then the universal quality index UQI(x, y) is defined as
UQIðx, yÞ ¼
4σ xy μx μy μ2x þ μ2y σ 2x þ σ 2y
52
1 Image Basics
where x is the original image signal and y is the test image signal. In grayscale images, x is a row of pixel intensities, and y is a column of pixel intensities. When C1 ¼ C2 ¼ 0 in SSIM(x, y), then UQI(x, y) is the same as SSIM(x, y). Objective evaluation In image information fusion, the image fusion results are evaluated based on some computable indicators. The results are often related to subjective visual effects, but not completely consistent. Objective quality measurement Excluding subjective factors, objectively calculate the image quality directly. Also refers to the basis or criteria for objective calculation. The advantage is that it is easy to calculate and provides an alternative to subjective measurements and their limitations. Disadvantages include often not (completely) related to the quality of human subjective evaluation and require a “raw” image that is not always available and whose quality is assumed to be perfect. See objective fidelity criteria. Compare subjective quality measurement. Subjective evaluation In the image information fusion, the evaluation of image fusion results based on the subjective feeling of the observer. Such a process varies not only from observer to observer but also from the interests of the viewer and the requirements of the fusion application field and occasion. Subjective quality measurement By asking the observer for the quality of an image or a piece of video. This allows them to report, select, or sort the visual experience under a controlled condition. Subjective quality measurement techniques can be absolute or relative. Absolute technology refers to the method of classifying images based on an isolated basis without considering other images. The relative technique requires the observer to compare one image to another and determine which one is better. See subjective fidelity criteria. Compare objective quality measurement. Structural similarity index A similarity measure based on structure perception that can be used to measure/ evaluate image quality. Structural similarity index measurement [SSIM] An indicator used to objectively evaluate image quality. It uses the covariance information of the gray level and the gradient in the image to describe the structure in the image. Since the human eye is sensitive to structural information, structural similarity also reflects a certain subjective visual experience. It measures the (structural) similarity between two images, and its value can be normalized to the interval [0 1]. When the two images are identical, there is SSIM ¼ 1. For color images, the image is often converted to Lab space, and the SSIM of the L channel is calculated because the SSIM of the L channel is more representative than the SSIM of the a and b channels.
1.6 Image Quality
53
SSIM calculates structural similarity by specifically comparing the image intensities in rows x and column y with n columns and m rows of images: x ¼ ð x1 , . . . , xn Þ y ¼ ð y1 , . . . , ym Þ Let μx, μy be the average pixel intensity in the x and y directions, respectively. Let σ x, σ y be the contrast of the image signal in the x and y directions, respectively, defined as "
#12 n X ð xi μ x Þ σx ¼ n1 i¼1
"
#1 n X yi μ y 2 σy ¼ n1 i¼1
Let σ xy be defined by "
# n X ð xi μ x Þ yi μ y σ xy ¼ n1 i¼1 Let C1 and C2 be constants to avoid instability when the average intensity values in the x and y directions are close to each other. The similarity measure SSIM between signals x and y is defined as
2μx μy þ C 1 2σ xy þ C 2 SSIMðx, yÞ ¼ μ2x þ μ2y þ C1 σ 2x þ σ 2y þ C2 Mean structural similarity index measurement [MSSIM] The average of the SSIM, corresponding to a measure of the overall quality of the image.
Chapter 2
Image Engineering
Image engineering is a relatively new and developing interdisciplinary subject that systematically studies various image theories, technologies, and applications (Zhang 2009). It combines the principles of basic sciences such as mathematics, computer science, and optics with the accumulated experience in image applications, takes the advantages of electronic developments, integrates various image technologies, and addresses researches and application for the entire image field (Zhang 2017a, b, c, 2018a).
2.1
Image Engineering Technology
Image technology is a general term for various image-related technologies in a broad sense. Image engineering is the overall framework for comprehensive research and integrated application of various image technologies (Zhang 2017a, b, c). The research content of image engineering can be divided into three levels: image processing, image analysis, and image understanding (Zhang 2009).
2.1.1
Image Engineering
Image engineering [IE] A relatively new and currently growing discipline consisting of the study of various problems and developing efficient technology in the image field. It is an overall framework of combining the principles of basic sciences such as mathematics and optics with image technologies and applications. It contains a very rich content and a wide coverage. It can be divided into three layers according to the degree of abstraction and research methods: (1) image processing, (2) image analysis, and (3) image understanding. In other words, image engineering is an organic © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_2
55
56
2 Image Engineering
(Semantic)
Image Understanding
Symbol
Middle
Image Analysis
Object
Low
Image Processing
Pixel
Smaller
(Data Volume)
Lower
High
(Operand)
( Abstraction Level)
Higher
Bigger
Fig. 2.1 Image engineering overall framework
combination of image processing, image analysis, and image understanding that are both connected and different and also includes engineering applications. Figure 2.1 shows the relationship and main features of the three layers. Image technique Any of a wide range of techniques related to images, that is, techniques involving the use of computers and other electronic devices to perform various tasks on images. Such as image acquisition, gathering, encoding, storage and transmission, image synthesis and production, image display and output, image transform, enhancement, restoration (recovery) and reconstruction, image watermark embedding and extraction, image segmentation, target detection, tracking, expression and description, target feature extraction and measurement, image and target characteristics analysis, sequence image correction registration, 3-D scene reconstruction, image database creation, indexing and retrieval, image classification, representation and recognition, image model building and matching, image and scene interpretation and understanding, and decision-making and behavior planning based on the above results. In addition, techniques for hardware design and fabrication to accomplish the above functions may also be included. Classification scheme of literatures for image engineering A classification scheme based on the content related to image engineering literature. Such schemes cannot only cover the theoretical background and technical content of relevant literature but also reflect the hotspots and focus of current research applications. Table 2.1 shows a scheme used in the classification of image engineering literature in a 25 year’s review series. It is divided into 5 categories and further into 23 sub-categories. Among them, the classes A5, B5, and C4 are added to the literature classification starting in 2000, and the classes A6 and C5 are added to the literature classification starting in 2005, and the rest are derived from the initial classification of series starting in 1995.
2.1 Image Engineering Technology
57
Table 2.1 Classification scheme of literatures in image engineering Group IP: Image Processing
IA: Image Analysis
IU: Image Understanding
TA: Technique Applications
SS: Summary and Survey
Sub-group P1: Image capturing (including camera models and calibration) and storage P2: Image reconstruction from projections or indirect sensing P3: Filtering, transformation, enhancement, restoration, inpainting, quality assessing, etc. P4: Image and/or video coding/decoding and international coding standards P5: Image digital watermarking, forensic, image information hiding, etc. P6: Image processing with multiple resolutions (super-resolution, decomposition and interpolation, resolution conversion, etc.) A1: Image segmentation, detection of edge, corner, interest points A2: Representation, description, measurement of objects (bi-level image processing) A3: Analysis and feature measurement of color, shape, texture, position, structure, motion, etc. A4: (2-D) object extraction, tracking, discrimination, classification, and recognition A5: Human face and organ (biometrics) detection, location, identification, categorization, etc. U1: (Sequential, volumetric) image registration, matching, and fusion U2: 3-D modeling, representation, and real-world/scene recovery U3: Image perception, interpretation, and reasoning (semantic, expert system, machine learning) U4: Content-based image and video retrieval (in various levels, related annotation) U5: Spatial-temporal technology (high-dimensional motion analysis, 3-D gesture detection, tracking, manners judgment and behavior understanding, etc.) T1: System and hardware, fast algorithm implementation T2: Telecommunication, television, web transmission, etc. T3: Documents (texts, digits, symbols) T4: Bio-medical imaging and applications T5: Remote sensing, radar, surveying, and mapping T6: Other application areas S1: Cross-category summary (combination of image processing/analysis/ understanding)
Image geometry Discrete geometry in digital images. The prototype geometry in the image is a variety of image regions, such as polygons in the Voronoi mosaic (Voronoi region) and triangles in the Delaunay triangulation (Delaunay triangle region). Image geometric nerve A collection of polygons (geometric nerves) derived from a digital image and surrounding a central polygon (nucleus), where each non-nuclear polygon has an edge shared with the image nucleus. The image geometry neural approximates the shape of the image target covered by the nerve.
58
2 Image Engineering
Preprocessing Formally process, analyze, and understand the images before they are performed. This is a relative concept. Generally, in multiple steps of a serial of operations, the operation at the front end can be regarded as preprocessing relative to the subsequent operations. Typical preprocessing includes noise cancellation, feature extraction, image registration, extraction, correction, and normalization of the target of interest and the like. The purpose of preprocessing is to improve the quality of the image and complete the preparation to make the subsequent operation more effective or more possible. In image processing, an operation that eliminates some distortion or enhances certain features before a formal operation. Examples include geometric transformations, edge detection, image restoration, and more. In many cases, there is no clear boundary between image preprocessing and image processing. Mimicking The nature of the bionics approach used in the study of computer vision. Imitation refers to the structural principles of the human visual system, the establishment of corresponding processing modules, or the production of visually capable devices to perform similar functions and work. Simulating The nature of the engineering methods used in the study of computer vision. Simulation refers to the analysis of the function of the human visual process, not deliberately simulating the internal structure of the human visual system, but only considering the input and output of the system, and using any existing feasible technical means to achieve the required system functions. Parallel technique Image technology using parallel computing. In parallel techniques, all judgments and decisions need or can be made simultaneously and independently of each other. Compare the sequential technique. Task parallelism A relatively large subset of computer programs are executed by competition to achieve parallel processing. Large subsets can be defined as running time in the order of milliseconds. Parallel tasks do not need to be the same, for example, for a binary image, one task can calculate the area of the target, and the other task calculates the perimeter of the target contour. Image analogy 1. A technique or process that operates on an image, such as filtering. Given a pair of images A and A0 , they can be viewed as the source image before the operation and the target image after the operation. Now given another source image B before the operation, it is desirable to synthesize a new target image B0 . The relationship between the above four images can be written as (“::” here represents similar or analogy) A : A0 :: B : B0
2.1 Image Engineering Technology
59
One way to achieve image analogy is to construct a Gaussian pyramid for each image, then search for a similar neighborhood in B0 from the pyramid constructed by A0 , and at the same time make multi-resolution correspondence between the similar locations of A and A0 . These similarities can be reflected in color and brightness values, as well as in other attributes such as local orientation. 2. A non-photorealistic rendering method. A typical method is to first calculate the Gaussian pyramid for all source and reference images, and then find the neighborhood similar to the neighborhood constructed in B0 in the filtered source pyramid generated from A0 , while requiring A and B having a corresponding multi-scale similar appearance or other characteristics.
2.1.2
Image Processing
Image processing [IP] One of the three branches of image engineering. Emphasis is placed on the transformation between images. Although it is often referred to as various image technologies, the relatively narrow image processing technology mainly performs various processing on images to improve the visual quality and effect of images, and lays a foundation for automatic recognition of targets, or compresses images to reduce image storage space or image transmission time (thus reducing the requirements on the transmission path). It is a generic term that covers all forms of processing of acquired image data. It also refers to the process from the beginning of one image to the end of another image. Digital image processing [DIP] The process of processing digital images by using a digital computer. It is generally automated and relies on carefully designed algorithms. This is a significant difference from the manual processing of images (also known as image manipulation), such as the use of brushes in photo editing software to process photos. See image processing. Interactive image processing Image processing is performed through human-computer interaction by means of human knowledge and ability. The operator can control the access to the original image and/or processing results, preprocessing, feature extraction, image segmentation, target classification and recognition, etc., to subjectively evaluate the processing effect and determine the next interaction. Analog/mixed analog-digital image processing Refers to the traditional result of the image being processed first or not by the analog signal. It has now been largely replaced by digital image processing. Image processing operations The specific processing of the image. According to the space where the processing is performed or located, spatial domain operations and transform domain operations
60
2 Image Engineering
can be distinguished. The former can also be divided into global operations (full image operations), local operations (neighboring operations), and multiple image operations. The latter can also be divided into frequency domain operations (via Fourier transform), wavelet domain operations, and so on. Categorization of image processing operations A method of dividing and grouping operations in image processing. There is a collation that names operations into three categories based on their dependence on the number and location of input pixels: type 0 operations, type 1 operations, and type 2 operations. Type 0 operations The processing output for one pixel depends only on the image processing operations of a single corresponding input pixel. Also called point operation. Type 1 operations The processing output for one pixel depends not only on a single corresponding input pixel but also on the image processing operations of its surrounding (multiple) pixels. Also called local operation. Type 2 operations The processing output for one pixel depends on the image processing operations (such as spatial coordinate transformation) of all pixels of the entire image. Also known as global operation. Image processing operator A function that can be applied to an image to transform it in a certain way. See image processing. Image manipulation See digital image processing. Image processing software Software for building image processing systems. Often composed of modules that perform specific tasks and depends on the software development language and development environment (i.e., the MATLAB editor is a common tool). Image processing toolbox [IPT] A collection of functions dedicated to image processing in MATLAB software. Extends the basic capabilities of the MATLAB environment for special image processing operations. Language of image processing A method of knowledge representation. Embed the knowledge to be expressed. Essentially, it is a program system that automatically calls functions and selects functions.
2.1 Image Engineering Technology
61
Communication Acquisition
Display Input
Processing
Synthesis
Output Printing
Storage Fig. 2.2 Schematic diagram of the image processing system
Image processing hardware The hardware for building the image processing system. It mainly includes acquisition devices, processing devices, display devices, and storage devices. Communication devices are sometimes included. Image processing system Built around a computer (where most image processing work is done), it also includes systems for hardware and software for image acquisition, storage, display, output, and communication. The basic composition can be represented by Fig. 2.2. The seven modules (boxes) in the figure have specific functions, namely, image acquisition, image synthesis, (narrow sense) image processing, image display, image printing, image communication, and image storage. Among them, acquisition and synthesis (may not existed, so the dotted frame is used) constitutes the input of the system, and the output of the system includes display and printing (there can be only one). It should be noted that not all of these modules are included in every image processing system. For an image processing system that works independently, there may be no communication module (so a dashed box is used). On the other hand, for some special image processing systems, other modules may also be included.
2.1.3
Image Analysis
Image analysis [IA] One of the three branches of image engineering. It is the technique and process of detecting and measuring the object of interest in the image to obtain its objective information, thereby establishing a description of the image and the target. The mathematical model is combined with the image processing technology to analyze the underlying features and the higher-level structure, so as to extract information with certain visual perception and semantics. Compared with image processing that pays more attention to image visual effects, image analysis focuses more on studying the objective content of images. If image processing is a process from
62
2 Image Engineering
Communication Acquisition
Data Input
Analysis
Pre-processing
Output Understanding
Storage Fig. 2.3 Schematic diagram of the image analysis system
image to image, image analysis is a process from image to data. Here the data can be the result of a feature measurement of the object, or based on a certain symbolic representation of the measurement, describing the characteristics and properties of the object in the image. Image analysis is also a generic term that includes analysis of all forms of image data. The result of a general image analysis operation is a quantitative description of the image content. Grouping transform An image analysis technique that combines image features together (e.g., based on collinearity, etc.). Image analysis system A system that uses equipment and technology to perform image analysis tasks. Since various analyses of images can generally be described and performed in the form of algorithms, and most of the algorithms can be implemented in software, many image analysis systems now require only a common general-purpose computer. Special hardware can also be used to increase the speed of the operation or to overcome the limitations of a general-purpose computer. The basic composition of an image analysis system can be represented by Fig. 2.3. The seven modules (boxes) in the figure have specific functions, namely, image acquisition, image preprocessing, image analysis, analysis data, content understanding, image communication, and image storage. The dotted box represents the periphery of the system. The input to the system comes from the results of image acquisition or image preprocessing (which can use image processing techniques), while the output of the system is analytical data or can be used for further image understanding. It should be noted that not all of these modules are included in every image analysis system. On the other hand, for some special image analysis systems, other modules may also be included. Microscopic image analysis In medical microscopic image studies, image analysis of small organisms composed of cells. Compare macroscopic image analysis.
2.1 Image Engineering Technology
63
Macroscopic image analysis In medical image research, an analysis of images of human organs such as the brain and heart is performed. Compare microscopic image analysis. Model-based vision A generic term that uses the target model that is expected to be seen in image data to aid in image analysis. This model allows for the prediction of more feature locations in other images, verifies a set of features that may be part of the model, and understands why the model appears in the image data.
2.1.4
Image Understanding
Image understanding [IU] One of the three branches of image engineering. The focus is on the basis of image analysis, to further study the nature of each target in the image and the interrelationship between them, and through the understanding of the image content to obtain an explanation of the original objective scene, thereby guiding and planning the action. If image analysis is mainly based on the observer to study the objective world (mostly studying observable things), then image understanding is, to a certain extent, centered on the objective world, and the whole objective world (it includes things that are not directly observed) is grasped by means of knowledge and experience. A generic term that refers to the high-level (abstract) semantic information of a scene obtained from an image or series of images. It is often used to mimic human visual ability. Theory of image understanding The theoretical framework for acquizing, expressing, processing, and analyzing until understanding image information. It is still under deep study. Agent of image understanding A method of knowledge representation. An image understanding agent can be naturally introduced based on the introduction of the expression of the target into the knowledge expression system. In image understanding, the agent can correspond to the target model or image processing operations. Image understanding environment [IUE] A collection of C++-based data classes and standard computer vision algorithms. The motivation behind the development of IUE is to reduce the independent repetitive development of government-funded basic computer vision code. Image semantics A generic term that refers to the meaning or cognition of high-level content in an image. See semantic scene understanding.
64
2 Image Engineering
Fig. 2.4 Image understanding system model
Objective world Image acquisition
Image understanding
Internal expression
Computer Knowledge base
Low-level scene modeling A class of methods for obtaining semantic interpretation of a scene. It directly expresses and describes the low-level attributes (colors, textures, etc.) of the scene, and on the basis of this, identifies by means of classification, and then infers the highlevel information of the scene. Model for image understanding system Model of systems, system modules, system functions, and system workflows for image understanding. A typical example model is shown in Fig. 2.4. The task of understanding visual information here begins with acquiring images from the objective world. The system works according to the acquired images and operates on the internal expressions of the system. The internal representation of the system is an abstract representation of the objective world, and image understanding is to associate the input and its implicit structure with the explicit structure already existing in the internal expression. The processing of visual information needs to be carried out under the guidance of knowledge. The knowledge base should be representative of analogous, propositional, and procedural structures, allowing quick access to information, supporting queries to analog structures, facilitating transformation between different structures, and supporting belief maintenance, reasoning, and planning.
2.2
Similar Disciplines
As an interdisciplinary subject, image engineering has some closely related subjects (Davies 2005; Forsyth and Ponce 2012; Hartley and Zisserman 2004; Jähne and Hauβecker 2000; Peters 2017; Prince 2012; Ritter and Wilson 2001; Shapiro and Stockman 2001; Sonka et al. 2014; Szeliski 2010).
2.2 Similar Disciplines
2.2.1
65
Computer Vision
Computer vision [CV] A discipline that uses computers to implement the functions of the human visual system. Many of the techniques used in the three levels of image engineering are used, and the current research content mainly corresponds to image understanding. It is also a general term for the computer community to represent the processing of image data. Some people further divide it into three levels: image processing, pattern recognition, and machine recognition. However, their boundaries are not very clear, and each profession has different definitions for them. However, it generally emphasizes: (1) basic theory of optics and surface; (2) statistical, characteristic, and shape models; (3) theory-based algorithms (contrast to commercial development algorithms); and (4) contents related to human “understanding” (as opposed to automation). Low-level vision In computer vision, the first stage of visual processing. The processing is local and independent of the content or purpose. The output is a set of local features that can be used to detect, characterize, and distinguish objects. It is also a somewhat less restrictive (still controversial) but general term that indicates the initial image processing stage in the vision system and can also refer to the initial image processing stage of the biological vision system. Roughly speaking, low-level vision refers to the first few steps of processing an intensity image. Some researchers use this term only to refer to operations from images to images. For most researchers, edge detection is the end of low-level vision and the beginning of mid-level vision. Early vision The first stage in the processing of the human visual system. A generic term for the initial steps of computer vision, such as image acquisition and image processing. The inverse problem of optical imaging problems. It includes a series of processes that restore the properties of 3-D visible surface physics from a 2-D light intensity array, such as edge detection and stereo matching. Since 3-D information has a lot of loss in projecting a 3-D world into a 2-D image, primary vision is an ill-posed problem. See low-level vision. Middle-level vision In computer vision, the second phase of visual processing. It is also the vision (data processing phase) between the lower and upper levels. There are a number of definitions that are generally considered to begin with a description of the image content until a description of the scene features is obtained. Thus, binocular stereo vision is a mid-level visual process because it acts on image edge segments to produce 3-D scene segments. High-level vision The third stage of visual processing in computer vision. The main emphasis is on using the features previously extracted from the image to obtain a description and
66
2 Image Engineering
Fig. 2.5 Two matching examples
A
E X
V1
V2
B Y C
explanation of the scene. A generic term that stands for image analysis and image understanding. That is, these efforts focus on studying the causes and meanings of the scenes seen, rather than the basic image processing. Cognitive vision In computer vision, a branch focusing on the identification and classification of targets, structures, and events, the expression of knowledge for learning and reasoning, and the consideration of control and visual attention. 3-D vision A branch of computer vision. Characterize the data obtained from the 3-D scene. For example, splitting the data into different surfaces and assigning the data to different models. Model-based computer vision A type of computer vision techniques to model a subject to be studied. A top-down control strategy is generally used to perform a generalized image matching of the target data structure obtained from the image with the model data structure. Bipartite matching A graph matching technique commonly used in model-based computer vision. It can be used to match model-based or stereo vision-based observations to solve correspondence problems. Assume that a node set V is divided into two disjoint subsets V1 and V2, namely, V ¼ V1 [ V2 and V1 \ V2 ¼ 0. The arc set between the two subsets is E, E ⊂ {V1 V2}[{ V1 V2}. This is the bipartite graph. The bipartite match is to find the largest match in the bipartite graph, that is, the largest arc set without common vertices. Figure 2.5 gives an example where V1 ¼ {A, B, C}, V2 ¼ {X, Y}. The set of arcs found in the figure is indicated by solid lines, and the other arcs are indicated by dashed lines. Bipartite graph A special graph model in graph theory, also called an even graph. The set of vertices can be divided into two disjoint subsets, and the vertices at both ends of the edge set belong to the two subsets, respectively. It can be used to describe a pyramidal vertical structure or in bipartite matching.
2.2 Similar Disciplines
67
Knowledge-based computer vision A computer vision technology and process that utilizes a knowledge base to store information about a scene environment. Decision-making is achieved by testing hypotheses with a reasoning mechanism consisting of rules. Knowledge-based vision 1. A visual strategy that includes a knowledge database and a reasoning component. The typical knowledge-based vision is mainly rule-based, combining the information obtained by processing the image with the information in the knowledge database, inferring the assumptions that should be generated and verified, and deriving new information from the existing information, and what new primitives need to be further extracted. 2. A style of image interpretation based on multiprocessing components that can perform different image analysis processes, some of which can solve the same problem in different ways. The components are linked by an inference algorithm that understands the capabilities of the various components, such as when they are available and when they are not available. Another common component is a taskdependent knowledge expression component that is used to guide the inference algorithm. Some mechanisms for expressing uncertainty are also public, and they record the confidence of the processing results of system output. For example, a knowledge-based vision system for a road network can be used for aerial analysis of a road network, including detection modules for straight road sections, intersections, and forest roads, as well as map surveying, road classification, curve connections, etc. Physics-based vision A branch of computer vision research. The visual process is studied based on accurate radiation transmission and color image imaging physical models, combined with detailed color and density measurements. It can also be said that it is an attempt to analyze the fields of images and video using physical laws or methods (such as optics, illumination, etc.). For example, in a polarization-based approach, the physical properties of the scene surface can be estimated by means of the polarization properties of the incident light and the detailed radiation model using the imaging. Interactive computer vision A branch of computer vision that focuses on integrating people and their knowledge and abilities into various computer vision technologies to improve the performance of various technologies through user interaction with computers. Axiomatic computer vision A method related to the core principles (axioms) of computer vision. Often related to the interpretation of the basic module of the ideal visual scene understanding by Marr theory. Recently, it is often associated with low-level feature extraction (such as edge detection and corner detection). Computational vision See computer vision.
68
2.2.2
2 Image Engineering
Machine Vision
Machine vision When a computer processes image data, an electronic device and optical sensing technology are used to automatically acquire and interpret images of the scene to control the process of the machine. In many cases, it is also synonymous with computer vision. However, computer vision focuses more on the theory and algorithms of scene analysis and image interpretation, while machine vision or robot vision pays more attention to image acquisition, system construction, and algorithm implementation. So there is a tendency to use “machine vision” to express actual visual systems, such as industrial vision, while “computer vision” to express more exploratory visual systems, or to express systems with human visual system capabilities. Robot vision Machine vision of the robot. The research goal is to build a system that enables the robot to have a visual perception function. The system acquires an image of the environment through a visual sensor and analyzes and interprets it through a visual processor, thereby enabling the robot to detect and recognize the object and perform a specific task. All automated vision systems that provide visual information or feedback to robotic devices. Robot vision is somewhat different from traditional computer vision. It can be part of a closed-loop system in which the vision system directs the motion of the robot so that its motion can be adjusted or corrected. Machine vision system [MVS] A system with machine vision that automatically captures a single or multiple images of a target; processes, analyzes, and measures the characteristics of the image; and interprets the measurement to make a decision on the target. Functions include positioning, inspection, measurement, identification, recognition, counting, estimation of motion, and more. Completeness checking One of the functions of a machine vision system. It is often carried out when the product is assembled to a certain stage to ensure that the parts are positioned in place and assembled correctly. Position detection One of the functions of a machine vision system. Can be used to control the robot to place product components in the correct position on the assembly line. More generally, it also represents the use of image technology to locate objects in the scene. Task-directed It belongs to a working mode that machine vision system can use. For example, as the task changes, the system should be able to adjust the parameters of each work module and the coordination between them to better accomplish the tasks assigned to the system.
2.2 Similar Disciplines
69
Selective vision The machine vision system determines the camera’s attention point and field of view to obtain the corresponding image based on the existing analysis results and the current requirements of the visual task. Originally referred to the human visual system as the process of viewing the eye’s point of interest that can be determined based on the observation and subjective will. Active volume Volume of interest in machine vision applications. Active stereo A variant of traditional binocular vision. One of the cameras is replaced with a structured light projector that projects the light onto the target of interest. If the camera is calibrated, the triangulation of the 3-D coordinate point of target simply determines the intersection of a ray and a known structure in the light field. Industrial vision Refers to the general terminology of various processes and techniques for applying machine vision technology to industrial production. Practical applications include product inspection, process feedback, workpiece or tool positioning, etc. Various lighting and sensing technologies are used. A common feature of industrial vision systems is the fast speed of processing. Visual industrial inspection Use image technology for product quality control or control of industrial environments.
2.2.3
Computer Graphics
Computer graphics [CG] The use of computers to generate data in the form of graphics, diagrams, drawings, animations, and other forms of data information. Computer graphics attempts to generate (realistic) images from non-image-form data representations to visually display the data. The input object and output result of the computer graphics are exactly opposite to the input object and output result of the image analysis. Figure-ground separation The area representing the target of interest (graphics, foreground) is separated from the rest of the image (background). One example is shown in Fig. 2.6. It is often used when discussing visual perception. In image engineering, the term is generally replaced by image segmentation. Plane figures Flat area, mostly used for vector graphics. The plane figure is a closed set.
70
2 Image Engineering
Original Image
Figure/Target
Background
+
Fig. 2.6 Figure-background separation
Plane tiling The plane is filled with basic planar graphics, covering the entire plane without gaps or overlap. Vector map It consists of a straight line or a curve and is suitable for representing a graph. In the vector map, each target corresponds to a mathematical equation. Vector map can be resized and geometrically manipulated without artifacts. Such a representation requires less memory for representing the image, but it needs to be rasterized for most display devices. Graphic processing unit [GPU] A device used to process graphics. Due to its powerful parallel processing capability, it has been widely used in various levels of image engineering in recent years, especially in various occasions where neural network computing is used. GPU algorithms An algorithm implemented with a programmable GPU that computes pixel rendering capabilities. Can be used to quickly implement a variety of image engineering techniques, including object segmentation and object tracking, motion estima tion, and more. Wrapping mode A corresponding designation in computer graphics for the filling (technique) used in the template calculation in image engineering to overcome the boundary effect. Modal deformable model A deformation model based on modal analysis (i.e., the study of the shape that a target may have). Graphic technique A general term for technologies related to graphics in a broad sense. It belongs to the content of computer graphics research.
2.2 Similar Disciplines
71
Graphical user interface [GUI] A typical, currently widely used user interface. The motion control unit of the different purpose functions is represented by a window, an icon, a button, and the like. The user performs an option through a pointing device such as a mouse. Computer graphics metafile [CGM] A file format that is independent of the device and operating system. Used for picture information with many graphics primitives. It is a standard of the American National Standards Institute (ANSI) and is mainly used in the field of computer-aided design. Bidirectional texture function [BTF] A general method of modeling textures in computer graphics (especially in texture synthesis). It is particularly suitable for changing the apparent texture with illumination and viewing angles. Modeling is performed similarly to the bidirectional reflectance distribution function, which indexes the texture output by means of incident illumination and viewing angle. The difference from the bidirectional reflectance distribution function is that the response value is only a simple reflectance at one point, and the response value of the bidirectional texture function corresponds to the texture characteristics of the entire image. Texture addressing mode A corresponding designation in computer graphics for the filling (technique) used in the template calculation in image engineering to overcome the boundary effect. Image composition A technique and process for creating new images by superimposing, combining, blending, stitching, and assembling multiple images. Sometimes it refers to the techniques and processes of generating images using computer graphics. Image-based rendering [IBR] A technique that uses a set of real images to produce scenes that are viewable from any viewpoint. In other words, this technique requires storing a large number of images obtained from different viewpoints, so that a combined image of arbitrary viewpoints can be obtained by means of resampling or interpolation. Image technology and graphics technology are used here to draw a new perspective field of view directly from the input image without knowing the complete 3-D geometry model. See view interpolation. A technique that combines 3-D reconstruction techniques in computer vision with multiple visualizations of a scene in computer graphics to create interactive immersive realistic experiences. Rendering A process of applying a series of editing elements to an image or scene to achieve a visual representation of the data. Light transport In image rendering technology, a mathematical model of energy transfer at an interface where the visible surface of the target changes.
72
2 Image Engineering
Volume rendering A technique for displaying 3-D discrete sampled data with 2-D projection. Non-photorealistic rendering [NPR] Drawing with sample-based texture synthesis. Typical methods include texture transfer and image analogy.
2.2.4
Light Field
Light field Primitive refers to the amount of light that passes through each point in each direction. Now refers to a 4-D optical radiation field that contains both position and orientation information in space. Light field imaging data has two dimensions more than traditional imaging data, so in the image reconstruction process, more abundant image information can be obtained. Light field technology refers to imagebased rendering technologies that can be used to draw complex 3-D scenes from a new perspective. Allows drawing new (unobstructed) scene views based on the image from any of these locations. It can also be regarded as a field obtained by encoding point radiation in space as a function of point position and illumination direction. It also represents the entire light environment, including light traveling in all directions at each point in space. Ray space Same as light field. Surface light field A function defined for each point x on the surface, which assigns an RGB color C (x, d) to each ray d exiting from the point x. This concept is mainly used in computer graphics for realistic rendering of surfaces with highlights. It is also a way of representing objects. First, establish an accurate 3-D model of the object to be represented, and then estimate or collect the Lumispheres of all the rays exiting from each surface point. Approaching light spheres will be highly correlated and thus suitable for compression and processing. In fact, if the object surface is completely scattered, all rays passing through a given point will have the same color (ignoring occlusion). Therefore, the light field will disintegrate into a common 2-D texture map defined on the object surface. Conversely, if the surface is fully specular (such as a mirror), each surface point reflects only very little ambient light. In the case of no mutual reflection (such as a convex object in open space), each surface point only reflects the far-field environment map, which is still 2-D. It can be seen that it is advantageous to re-parameterize the 4-D light field to the object surface.
2.2 Similar Disciplines
73
Lumisphere The result obtained as surface light field holds the light emitted from each surface point in all visible directions. Similar photospheres have a high correlation and are relatively easy to compress and manipulate. Lumigraph 1. A device that generates colored light beams when a person presses the screen, invented by Fesinger in the 1840s. In computer vision in recent years, it refers to the representation used to store a subset of plenoptic functions (i.e., optical flow in all positions and directions). 2. An image-based rendering technology. When the 3-D geometry in the scene is known, using lumigraph technology can achieve better rendering effects (less ghosting) than light field technology. The lumigraph is also a 4-D space, corresponding to the volume of the space through which all light passes. Splatting A technique that can be used for stereo rendering/drawing, which is characterized by speed in exchange for quality. When performing interpolation calculations, it can also be used for forward mapping to solve the problem of mapping integer coordinates to non-integer coordinates. Specifically, the mapped values are weighted to the nearest four neighbor pixels, and the weights of the respective pixels are recorded and normalized after the full mapping. This method achieves some balance between blurring and aliasing. Plenoptic function A 5-D function. Suppose you observe a still scene (i.e., both the target and the light source are fixed, and only the observer moves), and each image can be described by the position and orientation of a virtual camera with six degrees of freedom and its intrinsic characteristics (such as focal length). . However, if a 2-D ball image is acquired around each possible camera position, any view can be drawn based on this information. Therefore, by calculating the cross product of the 3-D space of the camera position and the 2-D space of the ball image, a 5-D function called the plenoptic function can be obtained. If there is no light scattering in the scene (e.g., no smoke or fog), all coincident rays along a free space portion (between the rigid body or the refractor) will have the same color value; at this time, the 5-D plenoptic function will be reduced to a 4-D light field of all possible rays. See plenoptic function representation. Plenoptic function representation A parameterized function that describes all the scenes that can be seen from a given point in space. It is a basic expression in image-based rendering. Plenoptic sampling Sampling of the light field function (i.e., the plenoptic function). This term also often refers to the sampling of surface and texture information in a scene.
74
2 Image Engineering
Plenoptic modeling A method of interpolating new perspectives with the plenoptic function. A columnar (or spherical, or cube) image is used to store a pre-computed rendered image or a real-world image. Image-based visual hull The visual shell is calculated with the principles in image-based rendering. For each new viewpoint, the image-based visual shell is recalculated by sequentially intersecting the line-of-sight segments with the binary silhouette in each image (rather than pre-calculating a global 3-D model). This not only allows for quick calculations but also allows the texture of the model to be quickly restored using the color values in the input image. Point-based rendering A method of drawing an image based on a point primitive in which the primitive is not a polygon face but a 3-D point. Unstructured lumigraph A method of rendering when the acquired images are obtained in an irregular manner. The drawing is performed directly from the acquired image by determining the closest pixel in the original image in each ray of the virtual camera. Unstructured lumigraph rendering [ULR] system A system for implementing a method of drawing an unstructured light map. Among them, multiple fidelity criteria are used in combination with the closest pixel selection, including pole consistency (considering the distance between the light and the center of the source camera), angular deviation (considering similar incident directions on the surface), resolution (considering similar sampling densities along the surface), continuity (with the next pixel), consistency along the ray, etc. Gouraud shading A method of successively coloring planes in 3-D space. It is slower than flat shading, but the effect is much better. It uses the illumination model for each vertex of a polygon and then spreads over the entire surface to give a smooth gradient of the rendered surface. Flat shading A simple way to color a plane in 3-D space. It renders only one color for a 3-D plane, so it is faster than the Gouraud shading, but the effect is much worse. Video-based rendering Generate a new video sequence from the captured video clips. It also refers to generating virtual camera motion (sometimes called free-viewpoint video) from a set of synchronized video cameras placed in the studio.
2.3 Related Subjects
2.3
75
Related Subjects
Image engineering is also related to many disciplines, and its development is supported by new theories, new tools, and new technologies (Barnsley 1988; Chui 1992; Goodfellow et al. 2016; Marr 1982).
2.3.1
Fractals
Fractals A discipline that studies the geometric features, quantitative representations, and universality of fractals. It is a branch of mathematics in nonlinear science. It has been applied in both texture analysis and shape analysis. Fractal dimension A measure of the roughness of the target shape. Consider the lengths L1 and L2 measured for a curve on two scales (S1 and S2). If the curve is rough, its length increases as the scale increases. The fractal dimension of the curve is D¼
log ðL1 L2 Þ log ðS1 S2 Þ
Fractal measure Same as fractal dimension. Hausdorff dimension The dimension used to define the fractal collection. The full name is the HausdorffBesicovitch dimension. Also known as fractal dimension. The Hausdorff dimension is a real number, which is commonly used in box-counting methods to interpret and calculate. The value of the Hausdorff dimension is a single value that brings together various details of the target. It can describe local roughness and describe the surface texture. It can also be used as a measure to describe the roughness of a shape. Consider two length values (L1 and L2) measured on two scales (S1 and S2) for a curve. If the curve is rough, the length value will increase as the scale increases (the scale becomes fine). Fractal dimension D ¼ log(L1 – L2)/log(S1 – S2). Box-counting dimension Fractal dimension estimated by box-counting approach. Box-counting approach A common method of estimating the fractal dimension, a measure connecting a certain measurement (such as length, area) and the value of the unit cell (such as unit length, unit area, etc.) for performing such measurement. Let S be a set in 2-D spaces, and N(r) is the number of open circles (discs that do not contain
76
2 Image Engineering
circumference) of radius r required to cover S. The open circle of the center at (x0, y0) can be expressed as a set {(x, y)2 R2|[(x – x0)2+ (y – y0)2]1/2 < r}, where R is real number collection. The box-counting method is to count the number of open circles required to cover S, and the expression of the fractal dimension d thus obtained is N ðr Þ r d where the “~” denotes proportion. If a proportional coefficient is introduced, the above formula can be written with an equal sign. In the spaces of different dimensions, line segments, squares, spheres, etc. can also be used as unit primitives. Area of spatial coverage Same as area of influence. Area of influence A collection of points within a maximum distance (a given value) from a point or a target. Also called the space coverage area. Let this maximum distance be D. To calculate the influence area of a target, it is necessary to expand the target with a disc of diameter D. The fractal dimension of the target can be calculated by means of the area of influence. To this end, a method similar to box-counting method is needed to analyze how the affected area increases as the diameter D increases. For a 2-D target, the sum of the slope of the log-log curve and the fractal dimension d of the influence region as a function of diameter D is equal to two. See distance transform. Fractal Statistically self-similar objects at any resolution satisfy the fractal model. Fractal representation Expression based on self-similarity. For example, a fractal representation of an image can be made based on the similarity of the pixel blocks Fractal surface A surface model that is progressively defined using fractals. That is, the surface showing self-similarity at different scales. Fractal set The Hausdorff dimension is larger than the set of topological dimensions. In Euclidean space, the Hausdorff dimension of a set is always greater than or equal to the topological dimension of the set. Fractal model A model of geometric primitives (fractals) with self-similarity and irregularity in nature. A fragment of a fractal target is an exact or statistical copy of the entire target and can be matched to the whole by stretching and moving. Fractal dimension as a measure of complexity or irregularity is one of the most important features of the fractal model. It can be estimated by the density of the Fourier power spectrum, or by the box-counting method, and can also be estimated in the texture image by means of the Gabor filter. Lacunarity is another important measure of the fractal model used
2.3 Related Subjects
(a)
77
(b)
(c)
(d)
Fig. 2.7 The initial steps to construct a Koch three-segment curve
to measure structural changes or nonuniformities, which can be calculated by the sliding box algorithm. Koch’s triadic curve A typical curve with fractal dimensions. Referring to Fig. 2.7, a straight line (Fig. 2.7a) is first halved, and then the middle one is replaced by two segments of the same length to obtain the connection of four equal length segments (Fig. 2.7b). By continuing to repeat the above process for each of the four segments, Fig. 2.7c and d is obtained in sequence. This process can be repeated indefinitely, and the result obtained is called Koch’s triadic curve, and its fractal dimension is 1.26. If one starts from a hexagon, one can get a Koch snowflake curve whose length is infinite, but the area enclosed is limited.
2.3.2
Topology
Topology A mathematical branch that studies the topological properties of a graph that is unaffected by distortion (not including tearing or pasting). Its properties include: 1. A set of point (e.g., surface) properties that do not change in re-parameterization (homeomorphic mapping) of contiguous spaces. 2. The connectivity of the target in discrete geometry (see topological representa tion). The topology of a network refers to the set of connections in the network and is also equivalent to the set of neighborhood relationships that describe the network. Topology space A non-empty set X, on which there is the set with topological τ. Topological dimension The number of degrees of freedom of a point in some different sets of points. Also refers to the number of degrees of freedom of the corresponding set. Always take an integer. For example, the topological dimension of a point is 0, the topological dimension of a curve is 1, the topological dimension of a plane is 2, and so on. Topological representation Record how the elemental connectivity is expressed. For example, in a surface boundary expression containing faces, edges, and vertices, the topological representation is a face-edge and edge-vertex join table, independent of the geometry
78
2 Image Engineering
(or spatial position and size) of the expression. In this case, the basic connection is “defined by,” that is, a face is defined by one or more edges, and one edge is defined by zero or more vertices. Topological descriptor A descriptor that describes the topological properties of a region in an image. Typical examples are such as Euler numbers. Topological property A property of a class of 2-D region or 3-D scene structures that is independent of distance and has no connection with the nature of the measure in space. In other words, properties that do not change with general geometric transformations (such as rotation, translation, scaling). This type of property does not rely on other properties based on distance measurements. In image engineering, obtaining the topological properties of a region facilitates a global description of the region. It mainly reflects the structural information of the region. Common topological properties include adjacency, connectivity, connectedness, number of regions, and so on. The Euler number of the region, the number of scenes, and the connectivity of the scene are also typical topological properties. Rubber sheet distortion Deformation studied in topology that does not affect topological properties. Homotopic transformation A set of transformations that do not change the topological properties of the continuous association between regions and holes in the image. This continuous association can be expressed by a homotopic tree structure, where the root corresponds to the background in the image, the first layer branches correspond to the target area, and the second layer branches correspond to the holes in the target. The homotopic transformation can also be seen as a transformation that does not change the homotopic tree. It also refers to a continuous deformation process or method that maintains the connectivity of a target feature, such as a skeleton. If two different targets can be made the same through a series of homotopic transformations, then the two are said to be the homotopy. Homotopic skeleton A skeleton that maintains connectivity. Homotopic tree See homotopic transformation. Digital topology A mathematical branch that studies the local relationships and properties between points within a discrete target. In other words, topological research in the digital domain (such as digital images) is how objects are connected and arranged. See connectivity.
2.3 Related Subjects
79
Digital Voronoï topology A collection of open Voronoï regions embedded in a digital image that satisfy the properties of the digital topology. Computational topology Information useful for the detection and classification of image objects and their shapes is primarily extracted from digital images using three tools. The three tools used here are geometry, topology, and algorithms. Among them, geometry provides a specific aspect of the topology, and the algorithm provides a construction method that is more complex than the threshold required for practical applications. Topology inference 1. Build a relationship diagram between image structures, such as the overlap of the same part of the scene in different images. 2. Construct a relationship diagram between the graph nodes in the probability graph model. 3. Inference of the connectivity of a communication network. Common to these definitions is the inference of a graph or network from the data.
2.3.3
Virtual Reality
Virtual reality [VR] In the virtual world, the portrayal of the human 3-D reality view. Computer graphics and other interactive tools are used to give the user a sense of being in a particular environment and interacting with the environment. This includes simulations of sight, hearing, and touch, etc. The visual environment can be achieved by drawing a 3-D model of the world into a head-mounted display device (helmet) in which the viewpoint is tracked in a 3-D space so that the movement of the user’s head can generate an image of the corresponding viewpoint. In addition, the user can be placed in a computer-enhanced virtual environment (such as Cave Audio and Video Experience, CAVE) so that the user’s field of view can be manipulated as much as possible by the computer used for control. Virtual reality markup language [VRML] Same as virtual reality modeling language. Virtual reality modeling language [VRML] A scene modeling language used to build real-world scene models or people’s fictional three-dimensional worlds. It can also be seen as a way to define a 3-D geometry model for network delivery. Fish tank virtual reality Head tracking is used to control the user’s virtual viewpoint to view a 3-D target or environment on a computer monitor. At this point, the user is like immersed in a fish tank to observe the 3-D objective world.
80
2 Image Engineering
Virtual It may be nearly or nearly identical to the description but is not entirely based on a strict definition of what is described. Virtual view Visualization of the model from a special viewpoint. Virtual endoscopy A technique for simulating a conventional endoscopic process using a virtual reality representation of physiological data (available with CT and MRI). The effect is similar to a virtual device. Reality Everything we experience as human beings. Mixed reality The image data includes both raw image data and computer graphics superimposed thereon. See augmented reality. Augmented reality [AR] Originally refers to the projection method of adding a graphic as a cover to the original image. For example, a firefighter’s helmet display can show the exits in the building seen. Further, a virtual 3-D image or annotation is superimposed onto the active video, which can be achieved by using a penetrable eyeglass (headmounted display) or a rectangular computer or mobile device screen. In some applications, enhancements are achieved by tracking special patterns printed on cards or books. For computer applications, a camera embedded in an enhanced mouse can be used to track the grid points printed on the mouse pad to give the user complete control of the position and orientation of the 6-D degrees of freedom in 3-D space.
2.3.4
Others
Topography A general term for various forms of the Earth’s surface. It is one of the ground physical properties that can be directly obtained by means of the pixel values of synthetic aperture radar images. Typical topography includes plateaus, plains, basins, mountains, and the like. Geomorphology This is a discipline that studies the morphological characteristics of the Earth’s surface. In image engineering, the image f(x, y) is regarded as the height on the XY plane, so that the image can be described and processed using concepts, methods, and the like in the topography.
2.3 Related Subjects
81
Cartography Study the disciplines of maps and map surveying. Automated mapping requires the development of algorithms that reduce manual operations and labor in map construction. Reverse engineering In vision system, the process and model of generating a 3-D object from a set of views (images). Such as a VRML or a triangulation model. The model can be purely geometric, that is, only describe the shape of the target; the shape and texture properties can also be combined to describe the appearance of the target. There are many techniques for reverse engineering for both depth and grayscale images. Inverse kinematics [IK] In human body modeling and tracking, the science of inferring the angle of intersection between limbs in the skeleton from the position of the visible surface points of the human body. See forward kinematics. Inverse problem Link the generated data to the description of the problem in question. For example, computer graphics generates images from a target scene description by modeling an image forming process. This is a typical non-reverse problem situation. Conversely, computer vision pays attention to its inverse problem, that is, to get the description of interest from the data (in this case, the image). The inverse problem is often an ill-posed problem, sometimes solved by regularization theory. Many problems in image engineering are inverse problems (which are the inverse of the forward problem), that is, given one or more images (projection of the scene), a scene or a description of the scene giving the images is reconstructed or restored. From a functional perspective, it can be seen as estimating the unknown model parameters from the data rather than generating data from the model (as in computer graphics). The input and output of a positive problem are often exchanged with its inverse problem, where the positive and negative problems form a pair. Here are some examples: 1. Both positive and inverse transforms of various transforms (such as Fourier transform) form a transform pair. 2. Recovering the primary and optical imaging reciprocal of 3-D visible surface physical properties from a 2-D light intensity array. But some situations are ill-posed. 3. Computer graphics and computer vision reciprocal. Because computer vision attempts to extract 3-D information from 2-D images, computer graphics often uses 3-D models to generate 2-D scenes. 4. Computer graphics and image analysis are reciprocal. Many graphics can be generated by the results of image analysis. See forward problem.
82
2 Image Engineering
Inverse imaging problem The process of adjusting the image model parameters to restore the image data. For example, deconvolution, inverse halftone, and decompression. See inverse problem. Inverse rendering A technique of reverse engineering for obtaining the geometry of a real object in a scene from the image by means of reverse engineering. These techniques include backlit transmission and retroreflection. Inverse rendering is often associated with reconstructing shapes from shadows. Forward problem A positive problem inferring the result from the cause. Forward problems often correspond to the causality of physical systems and generally have a unique solution. However, in pattern recognition, it is often necessary to solve the reverse problem, that is, to determine the cause after knowing the result. If the forward problem contains a many-to-one mapping, then the reverse problem will have multiple solutions. Backward problem See forward problem. Computational geometry [CG] An algorithmic approach in the study of geometric structures. In computational geometry, algorithms (stepwise methods) are introduced to construct and analyze the lines and surfaces of targets, especially real-world targets. Its focus is on how to use computers to build, detect, and analyze points, lines, polygons, (2-D) smooth curves, (3-D) polyhedra, and smooth surfaces. Visualization computation A technique that combines computation and visualization. It is an important part of visualization in scientific computing. Visualization in scientific computing Combine image technology and graphics technology into the field of scientific computing. It includes related theories, methods, techniques, and processes, which involve transforming the data and calculation results generated in the scientific calculation process into images or graphics for display and interactive processing. The visualization of scientific computing can be used not only to help analyze the final data calculated by the computer but also to visually understand the changes in the data during the calculation process and help people to make corresponding control and decision-making accordingly. Computer-aided design [CAD] 1. A general term for design activities such as designing, drawing, modeling, analyzing, and writing technical documents for a product or project. Specifically, it involves the use of computers for calculations, information storage, and drawing, storing the results in the memory or external storage of the computer,
2.3 Related Subjects
83
enabling rapid retrieval, or displaying the results in graphical form, so that designers can make judgments and modifications in a timely manner. The computer performs data processing such as editing, enlarging, reducing, panning, and rotating graphics. 2. Represents the general terminology that the computer helps with the target design process, especially the hierarchical and explanatory aspects of the unit of work. For example, most mechanical components are currently designed with computer aids. 3. Used to distinguish products designed with the help of a computer. Aerial photography Generally include aviation reconnaissance (or surveillance) and aerial surveying. Aviation reconnaissance or surveillance should identify certain targets on the ground and use them for military purposes. Aviation mapping generally uses a dedicated aerial surveying instrument for military and civilian use, such as resource exploration, prospecting, and urban planning. Architectural reconstruction One type mainly uses aerial photography to automatically reconstruct buildings based on geometric models and 3-D lines and planes. Architectural modeling Model ground buildings from images, especially from aerial photographs. The general architecture is characterized by a large planar region and other parameter forms (such as rotating surfaces), which are oriented perpendicular to gravity and perpendicular to each other. Architectural model reconstruction A general term for reverse engineering construction based on collected 3-D data and a library of building constraints.
Chapter 3
Image Acquisition Devices
Image acquisition is achieved with the help of certain equipment, such as camera. The equipment converts objective scenes into images that can be processed by computers, so it is also called imaging equipment (Davies 2005; Jähne 2004; Goshtasby 2005).
3.1
Device Parameters
The input of an image acquisition device when it is working is an objective scene, and the output is an image that reflects the nature of the scene. The camera is the most common image acquisition device, and its performance can be described by multiple parameters (Goshtasby 2005; Tsai 1987; Zhang 2017c).
3.1.1
Camera Parameters
Image capturing equipment An equipment that converts an objective scene into an image that can be processed by a computer. Also called an imaging equipment. Currently used main parts include charge-coupled devices (CCDs), complementary metal oxide semiconductor (CMOS) devices, charge injection devices (CIDs), and so on. Two types of devices are required for various image acquisition equipment: sensors and digitizers. Acquisition device Any device used to acquire images in image engineering. Interior camera parameter 1. In the pinhole camera model, one of the following six parameters: the focal length, the radial distortion parameter, the scaling factor in the X and Y directions, © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_3
85
86
3 Image Acquisition Devices
Fig. 3.1 Interior parameters of camera
v
y
v0
p
c
x
u0
u
θ
o
and the x and y coordinates of image principal point (the vertical projection of the projection center on the image plane and also the center of the radial distortion). 2. In the telecentric camera model, one of the following five parameters: the radial distortion parameter, the scaling factor in the X and Y directions, and the x and y coordinates of image principal point (only the center of the radial distortion). 3. In line scan cameras, one of the following nine parameters: the focal length, the radial distortion parameters, the scaling factors in the X and Y directions, the x and y coordinates of image principal point (the vertical projection of the projection center on the image plane, and also the center of the radial distortion), and the three components along the X, Y, and Z directions in the 3-D motion vector of the camera relative to the scene. Interior parameters of camera Parameters that reflect the characteristics of the camera itself. It is closely related to the relationship between the pixel coordinate system and the image plane coordinate system. As shown in Fig. 3.1, (o, u, v) is the pixel coordinate system, (c, x, y) is the image plane coordinate system, and (u0, v0) is the coordinate of the point c in the pixel coordinate system, and θ is the angle between the two axes of u and v. It can be seen from the figure that the origin of the pixel coordinate system does not coincide with the origin of the image plane coordinate system and the two coordinate axes of the pixel coordinate system are not necessarily at right angles. In addition, the actual pixels are not necessarily square, so it can be assumed that the metric values of the units on the two axes of u and v in the image plane coordinate system are ku and kv, respectively. The above five parameters are the parameters of interior parameters of camera. Intrinsic parameter Parameters such as focal length, radial lens distortion, and principal point position, which describe the mapping from the light in the world coordinate system to the image plane coordinate system in the camera. Determining the parameters of this map is the job of camera calibration. For pinhole cameras, the world light is mapped to the homogeneous image coordinate x as follows, i.e., x ¼ Kr, where K is the following upper triangular matrix of 3 3:
3.1 Device Parameters
87
2
αu f 6 K¼4 0
s αv f
0
0
3 u0 7 v0 5 1
where f is the focal length, s is the angle of inclination between the image coordinate axes, (u0, v0) is the coordinates of principal point, and au and av are the aspect ratios of the u and v directions. Internal coefficient A parameter indicating the characteristics of the camera itself (inside the camera). Such as camera focal length, lens radial distortion, and uncertainty image scale factor. It also refers to the self-characteristic parameters of various optical devices. See interior camera parameter. Internal parameter Same as internal coefficient. Uncertainty image scale factor An internal parameter of the camera. Describes the inconsistency between the distance between adjacent pixels of the camera in the horizontal direction and the distance between adjacent photosensitive points: this is due to the time difference between the image acquisition hardware and the camera scanning hardware or due to the inaccuracy of the camera’s own scanning time. Resolving power The resolution of the camera. Corresponds to the spatial resolution of the image. If the camera imaging unit has a pitch of Δ and the unit is mm, the resolution of the camera is 0.5/Δ, and the unit is line/mm. For example, a charge-coupled device camera’s imaging unit array has a side length of 8 mm and a total of 512 512 imaging units with a resolution of 0.5 512/8 ¼ 32 line/mm. Camera geometry The physical geometry of a camera system. See camera model. Geometrical light-gathering power 1. For a telescope, the square of the entrance pupil diameter divided by the square of the magnification, i.e., the area of the exit pupil. 2. For the camera, the square of the aperture ratio. That is, D2/f2, where D is the entrance pupil diameter (generally the objective diameter) and f is the focal length. Dynamic range of camera The ratio of the saturation intensity of a camera to its absolute sensitivity threshold. This ratio is the input dynamic range of the camera. If the response of the sensor is linear, the input dynamic range is also the output dynamic range. However, if the response of the sensor is nonlinear, the input dynamic range will differ from the output dynamic range.
88
3 Image Acquisition Devices
Accuracy of the camera parameter A measure of camera calibration. It mainly depends on the number N of calibration images. Experiments show that the standard deviation of the focal length of the camera is reduced by approximately N–1.5, and the standard deviation of the radial distortion coefficient is reduced by approximately N–1.1, and the standard deviation of the image sensor relative to the projection center is approximately N–1.2. Camera jitter 1. For an analog camera, two adjacent horizontal scan lines are not aligned and statistically floated. 2. When shooting with a video camera, the motion generated by the camera that is supposed to be fixed due to external force. It will cause motion blur and sharpness reduction in the results. Exterior parameters of camera A parameter that indicates the position of the camera in the world coordinate system. If the position of a point in the world coordinate system is P ¼ [X, Y, Z]T and the position in the camera coordinate system is p ¼ [x, y, z]T, then their relationship is P ¼ Rp + T, where R is the rotation matrix (indicating the direction of the camera) and T is the translation vector (indicating the position of the camera). R and T are called out-of-camera parameters. For rigid bodies, R has only one degree of freedom in 2-D, and T has two degrees of freedom; in 3-D, R has two degrees of freedom as T. Exterior camera parameter Referred to as the camera external parameters. Same as external coefficient. External coefficient A parameter indicating the pose position of the camera (not determined by the camera itself). Typical examples include the position and orientation of the camera, the amount of translation indicating the camera motion, sweep angle, and tilt angle. External parameter Same as external coefficient. Extrinsic parameters See exterior orientation.
3.1.2
Camera Motion Description
Camera motion In motion analysis, the same as background motion.
3.1 Device Parameters
89
Camera roll The form of motion of the camera around each coordinate axis. The rotation around the X axis is called tilting, the rotation around the Y axis is called panning, and the rotation around the Z axis is called rolling around the optical axis. Rolling See camera roll. Camera motion compensation See sensor motion compensation. Camera motion estimation See sensor motion compensation. Camera motion vector A vector that describes the motion of a camera, consisting of motion velocities along three spatial axes, V ¼ [vx, vy, vz]T. Here, the camera is required to perform linear motion (the camera moves at a uniform speed with respect to the target in a straight line), the direction of the camera relative to the target is fixed, and the camera motion state is consistent when all images are acquired. The definition of V here assumes that the camera is moving and the target is stationary. If the camera is stationary and the target is moving (as on a production line), you can use –V as the motion vector. Camera line of sight The camera passes through the optical axis of the center of the lens. See model of image capturing. Camera position estimation An estimate of the optical position of the camera relative to the scene or the structure being observed. This typically consists of six degrees of freedom (three rotational degrees of freedom, three translational degrees of freedom). It is often an element of camera correction. The camera position is sometimes referred to as the camera’s external parameters. In structural and motion algorithms, multiple camera positions can be simultaneously estimated in the reconstruction of a 3-D scene structure. Magnification The process of expanding (e.g., an image). Also refers to the amount of expansion used. Magnification factors See generative topographic mapping. Euler angles A set of angles describing the rotation or orientation of a 3-D rigid body in space. There are three angles, namely, the deflection angle (yaw), the pitch angle, and the roll. See Fig. 3.2, where the intersection line AB of the XY plane and the xy plane is called the pitch line. The deflection angle is the angle Ψ between line AB and the X axis, which is the angle of rotation about the vertical axis Z. The pitch angle is the
90
3 Image Acquisition Devices
Fig. 3.2 Euler angles
Z y
φ B
q
z
Y O,o X
ψ
q A
x
angle ϕ between the Z and z axes, and the roll angle θ is the angle between line AB and x-axis, which is the angle of rotation about the z-axis of the line of sight. Yaw Describe one of the three Euler angles of the 3-D rigid body rotation (the other two are the pitch angle and the roll angle). Often used to describe the motion of the camera or the movement of the observer. See Fig. 3.5. Pitch A 3-D rotation expression commonly used to indicate a camera or a moving observer (used in conjunction with deflection and roll). The pitch component indicates rotation about the horizontal axis, giving an up and down change in camera orientation; see Fig. 3.3. Pitch angle Describe one of the three Euler angles of the rigid body. See Fig. 3.3 for the angle between the Z axis in the world coordinate system and the z-axis in the camera coordinate system. Also called the nutation angle, this is the angle of rotation around the pitch line.
Fig. 3.3 Pitch
Y
Line of view
X Pitch direction
3.1 Device Parameters Fig. 3.4 Roll
91
Y
Line of view Roll direction X
Roll One of the types of motion of the camera about the Z axis (around the optical axis). Rolling indicates rotation about the optical axis or line of sight, as shown in Fig. 3.4. It is also one of the 3-D rotation expression components (the other two are the yaw and pitch angle). It is often used to describe the motion of the camera and observer. Gimbal lock In mechanics, a case in which the Euler angles are used to describe the rotation of the target, that is, if one of the three Euler angles coincides, the two rotation axes coincide with each other, and one degree of freedom is lost. At this point, it is possible that a small fluctuation in the rotation target will cause a very large change in the Euler angles. Also known as the universal joint deadlock.
3.1.3
Camera Operation
Operations of camera The camera’s control process on the camera. Can be divided into three categories: (1) translation operation, (2) rotation operation, and (3) zooming operation. These three types of operations are combined in six situations: (1) panning, (2) tilting, (3) zooming, (4) tracking, (5) lifting, and (6) dollying. See types of camera motion. Types of camera motion Camera motion as considered relative to the world coordinate system. As shown in Fig. 3.5, it is assumed that the camera is placed at the origin of the 3-D space coordinate system, the optical axis of lens is along the Z axis, and the spatial point P (X, Y, Z ) is imaged at the image plane point p(x, y). The camera’s motion includes translation or tracking, lifting, advancing or retracting, tilting, panning, and rolling. In addition, the change in the focal length of the camera lens is called zooming and can be divided into two types: magnifying the lens and reducing the lens. See operations of camera。 Pan A typical form of camera operation. The movement of the camera about its own center and the axis (close) parallel to the vertical height of the image.
92
3 Image Acquisition Devices
Y
Fig. 3.5 Types of camera motion
Booming
Panning Tracking O p(x, y)
λ
Tilting
X
Zoomin
P(X, Y, Z) Image plane
Rolling Z
Advancing or Retracting
Tracking One of the types of camera motion. Corresponds to the camera’s movement along the X axis of the world coordinate system, i.e., horizontal (lateral) movement. It is also a typical form of camera operation. See types of camera motion. Booming The type of camera motion that the camera moves along the Y axis, which is vertical (longitudinal). It is also a typical form of camera operation. See types of camera motion. Dollying One of the types of camera motion. The translational motion of the corresponding camera (the axis in the Z axis direction) along the Z axis, i.e., front and rear (horizontal) movement. It is also a typical form of camera operation. See types of camera motion. Tilting One of the types of camera motion. It is also a typical form of camera operation or movement. Corresponding to the X axis of the camera around the world coordinate system, that is, the motion of vertical rotation in the YZ plane. If the XY plane is the equatorial plane of the earth and the Z axis points to the earth’s north pole, the vertical tilt will cause a change in latitude. See types of camera motion. Slant The angle between the normal of the surface of the object and the observed line of sight in the scene, as shown in Fig. 3.6. See tilting and shape from texture.
3.1 Device Parameters Fig. 3.6 Slant angle
93
Line of sight
Normal Slant
Surface
Tilt direction The direction from which the 3-D surface element of the scene is projected parallel to the 3-D surface normal toward the 2-D image. If the 3-D surface is represented as a depth map z(x, y), the tilt direction at (x, y) is a unit vector parallel to (∂z/∂x, ∂z/∂y). The tilt angle can be defined as actan[(∂z/∂y)/(∂z/∂x)]. Nod In the camera movement, the same as tilting. Tilt angle 1. In an average camera model, the angle at which the camera is tilted relative to the vertical axis in the world coordinate system. That is, the angle between the zaxis of the camera coordinate system and the Z axis of the world coordinate system. If the XY plane is the equatorial plane of the earth and the Z axis points to the earth’s North Pole, the tilt angle corresponds to the latitude. 2. Describe one of the three Euler angles of the 3-D rigid body rotation. See Fig. 3.5; it refers to the angle between AB and the X axis, which is the angle of rotation about the Z axis. Panning The type of camera motion of the camera moving around the Y axis of the world coordinate system (i.e., rotating in a horizontal plane). It is also a typical form of camera operation. If the XY plane is the equatorial plane of the earth and the Z axis points to the earth’s North Pole, a horizontal panning will cause a change in longitude. See types of camera motion. Pan angle In the general camera model, the angle at which the camera is horizontally deflected in the world coordinate system. It is the angle between the x-axis of the camera coordinate system and the X axis of the world coordinate system. Zoom in One of the types of camera motion. It corresponds to control the camera motion by the focal length of the lens for aligning or focusing the target of interest. In practice, by changing the focal length of the camera lens, the object of interest can be magnified in the field of view to be highlighted or extracted. Forward zooming Same as zoom in.
94
3.2
3 Image Acquisition Devices
Sensors
Sensor is an important device in image acquisition equipment. It is a physical device that is sensitive to a certain electromagnetic radiation energy spectrum band (such as X-ray, ultraviolet, visible light, infrared, etc.) of electric signal (Jähne 2004; Zhang 2017a).
3.2.1
Sensor Models
Sensor Any kind of measuring devices of a variety of disciplines (including all mathematics, physics, chemistry, astronomy, geosciences, biology, physiology, etc.). There are more than a dozen words in English with the meaning of “sensor,” where sensor refers to sensitive components. They are devices or systems that detect, transmit, record, and display (including alarms) physical quantities, chemical quantities, and geometric quantities. In general, it is mainly the role of “sense” and “pass.” For the general term of a device that records information from an objective world, the information recorded is often further processed by a computer. The sensor can obtain raw measurements or partially processed information. For example, two images (original measurements) are taken using two cameras (sensors), and stereoscopic vision triangulation is performed by a computer to obtain scene depth information. In image engineering, it mainly refers to a physical device sensitive to the energy spectrum of a specific electromagnetic radiation, which can generate an (analog) electrical signal proportional to the received electromagnetic energy and then convert to the image. Sensor element The constituent unit of the sensor array. Photosite Same as photosensitive unit. Sensor model An abstract representation of the physical sensor and its information processing. It should have the ability to describe both the characteristics of the sensor itself, the effects of various external conditions on the sensor, and the interaction between the sensors. Sensor Component Model A specific sensor model. A common model is the three-component model. Let the observed value of the sensor be yi, and the decision function based on the observed value is Ti. Considering the multi-sensor fusion system as a collection of sensors, each sensor is expressed by the information structure Si that describes the relationship between the sensor’s observation value yi, the physical state of the sensor xi, the
3.2 Sensors
95
sensor’s prior probability distribution function pi, and other sensor behaviors ai. If the effect of each parameter on yi is considered separately, that is, considering the conditional probability density function, the three component models of the sensor, namely, the state model Six, the observation model Sip, and the related model SiT, can be obtained, thus Si ¼ f ðyi jxi , pi , T i Þ ¼ f ðyi jxi Þf ðyi jpi Þf ðyi jT i Þ ¼ Sxi Spi STi Correlation model In the sensor component model, the model of the influence of other sensors on the current sensor is described by the degree of correlation between the various sensors. State model One component of the sensor component model. The dependence of sensor observations on sensor position status can be described. Observation model One of the components in the sensor component model. Describe the model of the sensor when the position and state of one sensor is known and the decisions of other sensors are also known. Sensor modeling Techniques and processes that characterize sensor capabilities such as optical capabilities, spatial capabilities, and time-frequency response. Sensor fusion A large class of techniques designed to combine various information obtained from different sensors to obtain a richer or more accurate description of a scene or motion. Typical techniques include Kalman filter, Bayesian statistical model, fuzzy logic, Dempster-Shafer theory (Dempster-Shafer evidence reasoning), generative systems, and neural networks. See information fusion. Data fusion See sensor fusion. Data integration See sensor fusion. Multimodal fusion See sensor fusion. Sensor network A collection of sensors connected by a communication channel. The data transmitted in the channel can be raw data or processed results. The data can be low-level, such as brightness or original video; it can also be high-level, such as identified targets (such as people and cars) or analyzed information (such as number of people and number of cars). The sensor can communicate with the base station independently,
96
3 Image Acquisition Devices
or the result data can be transmitted to the adjacent sensor first and then transmitted to the base station in turn. Heterogeneous sensor network A collection of sensors, often of different kinds, narrow bandwidth, and mobile. They can aggregate data in one environment. Chip sensor A CCD or other semiconductor based on a light-sensitive imaging device.
3.2.2
Sensor Characteristics
Sensor size Product size of image sensors such as charge-coupled devices and complementary metal oxide semiconductors. Some typical sensor sizes are shown in Table 3.1, and the last column also shows the pixel spacing at a resolution of 640 480 pixels. The description of the dimensions in the table continues the description of the television camera tube; the diagonal diameter of the circumscribed circle is relative to the camera tube. The effective image plane of the camera tube is approximately 2/3 of this size, so the diagonal of the sensor is approximately 2/3 of the nominal size of the sensor. When selecting a lens for the sensor, the lens must be larger than the sensor size; otherwise the outer circumference of the sensor will not receive light. In addition, the table shows the pixel spacing at a resolution of 640 480 pixels. When the resolution of the sensor is increased, the pixel spacing is reduced accordingly. For example, when the image size is 1280 960 pixels, the pixel spacing is reduced by half. Fill factor One of the important factors affecting the performance of digital cameras and digital image sensors. It is the ratio of the actual sensing area to the theoretical sensing area. It is generally desirable to have a high fill factor so that more light can be gained and aliasing can be reduced. Of course, this would require more electronics to be placed on the actual sensing area, so a balance is needed. The fill factor of a camera can be experimentally determined using a photometric correction method.
Table 3.1 Sizes of some typical sensors Size (In) 1 2/3 1/2 1/3 1/4
Wide (mm) 12.8 8.8 6.4 4.8 3.2
Height (mm) 9.6 6.6 4.8 3.6 2.4
Diagonal (mm) 16 11 8 6 4
Spacing (μm) 20 13.8 10 7.5 5
3.2 Sensors
97
Missing pixel A pixel whose pixel value is not available. For example, when a sensor unit of an image sensor fails, a missing pixel is generated. Sensor planning A technique for determining the optimal sensing structure of a reconfigurable sensor system, given a task and an object geometry model. For example, given the geometric characteristics of an object (whose CAD model is known), the task is to verify the size of the feature, i.e., a sensor planning system determines the best position and orientation of the camera and its associated illumination to estimate each feature size. The two basic methods are 1) generation and testing method, in which the configuration (pattern) of the sensor is evaluated relative to the constraints of the task, and 2) the synthesis method, in which the analysis of the task constraints is first performed and the resulting equation is obtained to achieve the optimal sensor configuration. See active vision and purposive vision. Sensor path planning See sensor planning. Sensor response The output of a sensor or the characteristics of certain critical outputs given a set of inputs. Often expressed in the frequency domain as a function of the amplitude and phase of the Fourier transform that relates the known input frequency to the output signal. See phase spectrum, power spectrum, and spectrum. Responsivity The ratio of the electrical signal produced by phosphorescence to the luminescence. Symbol: R. Unit: A/W or V/W. In the linear range of the sensor, the responsiveness is constant. Sensor sensitivity Generally, it refers to the weakest input signal that a sensor can detect. It can be obtained from the sensor response curve. For CCD sensors commonly used in cameras, the sensitivity depends on several parameters, mainly the fill factor (the true sensitivity to light in the sensor area) and the saturation capacity (the amount of charge that a single photosensitive unit can receive). The larger the value of these two parameters, the more sensitive the camera is. See sensor spectral sensitivity. Sensor spectral sensitivity The characteristic (curve) of a sensor response in the frequency domain. Figure 3.7 shows the spectral response curve of a typical CCD sensor (given the spectrum curve of solar radiation and the spectrum response curve of humans), and its spectral sensitivity can be further derived. Note that the high sensitivity of silicon in the infrared range indicates that an IR block filter is required for fine measurement depending on the brightness of the camera. In addition, the CCD sensor is a good sensor for the near infrared range (750–3000 nm).
98
3 Image Acquisition Devices
Fig. 3.7 Spectral response curve of CCD sensor
Intensity 1.2
Solar radiation Silicon sensitivity
0.8 0.4 0.0 200
Human sensitivity
400
600
800
1000
1200 Wavelength
Metamers The sensor produces the same record but the equal color of the different reflection functions. This phenomenon occurs because a sensor integrates all the light it receives. Different combinations of photon energies can result in the same recorded values. In the same lighting situation, materials characterized by different reflection functions have the potential to produce the same record for one sensor. Sensor position estimation See pose estimation. Sensor placement determination See camera calibration and sensor planning. Sensor motion estimation See egomotion. Sensor motion compensation A technique that aims to eliminate the motion effects of a sensor in video (data). Basic operations include motion tracking and motion compensation. A typical example is image sequence stabilization where one target moves through the field of view in the original video and is still in the output video. Another typical example is to use only visual data to keep a robot still stationary above it (called position hold) relative to a moving target. Eliminating jitter in handheld cameras has been commercialized. See egomotion. Thematic mapper [TM] An earth observation sensor. Installed on the LANDSAT satellite. There are seven image bands (three in the visible range and four in the infrared range), and most of the bands have a spatial resolution of 30 m. Asynchronous reset An operation mode in which a sensor resets (and immediately acquires an image and reads out) after receiving an external trigger signal. After adding an overflow channel to the sensor, the sensor has the basis for working in this mode.
3.2 Sensors
99
Anti-blooming drain To prevent charge overflow, add a charge path to the sensor. It allows excess charge of flow go to the circuit substrate to avoid spillage.
3.2.3
Image Sensors
Image sensor A device that converts optical information into its equivalent electronic information. The main purpose is to convert the energy of electromagnetic radiation into an electrical signal that can be processed, displayed, and interpreted as an image. The way this task is done varies significantly with the process used. Currently, it is mainly based on solid-state devices, such as charge-coupled device (CCD) and complementary metal oxide semiconductor device (CMOS). Active sensor A sensor that can actively emit electromagnetic waves to the target to be detected and collect electromagnetic wave information reflected from the target, such as synthetic aperture radar. Passive sensor Sensors that only collected the energy of the sun’s light or the electromagnetic energy of the target’s own radiation reflected. Compared with active sensors, passive sensors have the advantages of strong anti-interference ability and good concealment. Typical passive sensors include optical photography sensors, optoelectronic imaging sensors, imaging spectrometers, and the like. Optical photography sensors include various cameras such as panoramic cameras, multispectral cameras, push-broom cameras, and the like. A typical application for optoelectronic imaging sensors is the television camera. Passive sensing A perceptual process that does not emit any excitation or the sensor does not move. A typical still camera is a passive sensor. Compare active sensing. Photosensitive sensor Sensors that are sensitive to incident light are present in a variety of cameras and camcorders. It converts light energy into electrical energy and ultimately images. It is often divided into two types: sensors based on photovoltaic principles and sensors based on photoemission principles. Sensors based on photoemission principles A sensor that utilizes photoelectric effects. The externally radiated photons carry enough energy to excite free electrons, thereby converting the light energy into electrical energy. The photoelectric effect is most pronounced in metals. This type of sensor is used in television cameras.
100
3 Image Acquisition Devices
Sensors based on photovoltaic principles A sensor that uses photovoltaics (photons excite electrons to generate voltage). Photodiodes are a typical example. Such sensors have been widely used with the development of semiconductors, and CCD and CMOS are currently the most used. Digital pixel sensors [DPS] A sensor that integrates an analog-to-digital conversion circuit on each pixel to increase the readout speed. This can increase the reading speed. CMOS sensors can be used to achieve parallel analog-to-digital conversion. Line sensor Consisting of light-sensitive photodetectors, a 1-D array of images can be acquired. In order to obtain a 2-D image, there must be relative motion between the sensor and the observed object. This can be made by allowing the line sensor mounted above a moving scene (such as a conveyor belt), or it can be moved under a scene (such as a scanner). Line readout speeds are typically between 14 and 140 kHz, which limits the exposure time per line and therefore requires strong illumination. In addition, the aperture of the lens also corresponds to a small f-number, so the depth of field is limited. Area sensors A 2-D array of photodetectors that are sensitive to light. You can use it to get a 2-D image directly. There are two forms commonly used: full-frame sensors and interline transfer sensors. Interline transfer sensor 1. A type of area array sensor. In addition to the photodetector, it also includes a vertical transfer register; the transfer speed between them is very fast, so the smear is eliminated. The disadvantage is that the vertical transfer register needs to take up space on the sensor, so the fill factor of the sensor may be as low as 20%, resulting in image distortion. To increase the fill factor, a micro lens is often added to the sensor to focus the light on the diode, but this still does not allow the fill factor to reach 100%. 2. Light-sensitive sensor elements arranged in columns. Each column is connected to a column of charge-coupled device shift registers via a pass gate. In each clock cycle, a “half image” is selected whereby two sensor elements belonging to two different half images, respectively, are assigned to the cells of the transfer register. Full-frame sensors An area array sensor that is directly extended by line sensors (from 1-D to 2-D). The main advantage is that the fill factor can reach 100%, so the sensitivity is high and the distortion is small. The main drawback is that smear occurs. Compare the interline transfer sensor. Full-frame transfer sensor The light sensor elements that are sensitive in entire sensor range. It is different from the frame transfer sensor and the interline transfer sensor, and there is no storage portion. It always needs to be used with a camera with a shutter to determine to
3.2 Sensors
101
determine the integration time with an external shutter. Full-frame transfer sensors are almost always used in time critical applications and are used in high-resolution cameras. Frame transfer sensor A sensor element that is sensitive to light by frame. The charge-coupled device registers are virtually placed adjacent to each other. It is possible to distinguish between the upper photosensitive sensor area and the lower storage area of the same size (to prevent light). In one read cycle, the entire charge is removed from the image area and the memory area. Unlike interlaced sensors, a frame transfer sensor requires only half the number of photosensor elements required for a complete television image. For this reason, the frame transfer sensor only requires the number of lines contained in half of the image. The result of image integration can be as fast as twice other readout devices, i.e., the frequency of a half image. It is constructed by adding a sensor for storage to the full-frame sensor. In operation, the incident light is first converted to charge in the photosensor and then quickly transferred to the shielded memory array and finally read into the serial read register. This kind of sensor can overcome the smear problem caused by the accumulation of charge in the full-frame sensor, because the transfer of charge from the photoelectric sensor to the shielded storage array is very fast, generally less than 500 μs, so the smear is greatly reduced.
3.2.4
Specific Sensors
Intensity sensor A sensor that can measure brightness data. Depth sensor Same as range sensor. Distance sensor Same as range sensor. Range sensor A sensor that collects distance data. The most widely used distance sensors in computer vision are based on optical and acoustic techniques. Laser distance sensors often use structural light triangulation. A time-of-flight distance sensor measures the round-trip transmission time of an acoustic pulse or optical pulse. See depth estimation. Space-variant sensor A sensor that nonuniformly samples projected image data. For example, a log-polar coordinate image sensor has pixels that increase exponentially along the radial dimension from the center, as shown in Fig. 3.8.
102
3 Image Acquisition Devices
Fig. 3.8 Sensor with increased radial dimension
Laser range sensor A distance sensor that records a sensor from a target or from a sensor to target scene, where the distance is detected by an image of a laser spot or strip projected into the scene. These sensors are typically based on structural light triangulation, flying point, or phase difference techniques. Time-of-flight range sensor A sensor that calculates a distance from a target by emitting electromagnetic radiation or the like and measuring a time difference between a transmitted pulse and a received return pulse. Infrared sensor A sensor that can observe or measure infrared light. Tube sensor A vacuum tube with a light guide window that converts light into a video signal. It used to be the only light sensor, but it has now been largely replaced by CCD, but it is still used in some high dynamic imaging application. The American Academy of Sciences’ Emmy Awards in Television, Art, and Science also recorded the name of the image dissector tube. Solid-state sensor See solid-state imager. Solid-state imager A charge-coupled devices that performs circuit functions by using physical processes such as injection, transmission, and collection of minority carriers on a semiconductor surface. Integrate the photoelectric conversion, signal storage, and reading out on a single wafer. The features are simple structure, high integration, low power consumption, and many functions.
3.2 Sensors
103
Hyperspectral sensor Capable of receiving signals from multiple (possibly hundreds or thousands) of spectral bands at the same time. It can be used to capture super (light) spectral images.
3.2.5
Commonly Used Sensors
Charge-coupled device [CCD] An image pickup device that operates in a charge storage transfer and readout mode. A solid-state device that can record the number of photons falling on it. Cameras based on sensors with charge-coupled device are one of the most widely used image acquisition devices for many years. Their advantages include (1) precise and stable geometry; (2) small size, high strength, and strong vibration resistance; (3) high sensitivity (especially cools it to a lower temperature); (4) meeting various resolution and frame rate requirements; and 5) imaging invisible radiation. In a digital camera, a 2-D matrix CCD unit is used; in practice each pixel value in the final image often corresponds to the output of one or more units. CCD sensor A sensing device made of a charge-coupled device. 2-D (area) charge-coupled device sensors commonly used in cameras or camcorders capture 2-D images directly, while scanners use 1-D (line) charge-coupled device sensors to scan images progressively. A sensor with charge-coupled device consists of a group of photosensitive cells (also known as photo-receptacle cells) made of silicon that produces a voltage proportional to the optical density that falls on it. The photosensitive unit has a limited capacity of about 106 energy carriers, which limits the upper limit of the brightness of the imaged object. A saturated photosensitive unit can overflow, affecting its neighbors and causing defects called bleeding. The nominal resolution of a charge-coupled device sensor is the size of the scene unit that is imaged as one pixel on the image plane. For example, if a 20 cm 20 cm square paper is imaged as a 500 500 pixel digital image, the sensor’s nominal resolution is 0.04 cm. Quantum efficiency The ratio of changes in the number of electrons caused by each photon entering the sensor. For CCD sensors, photons cause electron and hole pairs; while for CMOS sensors, photons cause electron and hole pairs to be destroyed. The change in charge can be converted into a voltage, and finally the gray value of the image is changed. Charge transfer efficiency The percentage of charge transferred from a mobile register unit of a charge-coupled device to an adjacent register unit. Blooming 1. The phenomenon that the photodetector unit in the sensor receives too much charge and the charge generated is diffused to the adjacent unit. For example, in a
104
3 Image Acquisition Devices
CCD sensor array, if the signal is too strong, it will cause the signal to overflow from one sensor unit and string into adjacent sensor units. It can happen in both the time domain and the spatial domain. In a charge-coupled device, when the pixel is too saturated, or when the electronic shutter uses too short illumination, an overflow may occur to cause the image to float. New CCD products generally use anti-blooming technology, which is specially prepared for the tank to allow excess charge to flow in. 2. In color image processing, problems caused by signal strength exceeding the dynamic range of the processing unit during the analog processing or analog-todigital conversion phase. At this point, since the brightness measured at the sensor element is greater than an acceptable level and the charge-coupled device element is saturated, the charge-coupled device unit can no longer store more charge in time units. In this way, the increased charge will overflow and enter the surrounding elements, i.e., diffuse into adjacent charge-coupled device units and produce a floating effect. Float is particularly noticeable in specular analysis of color images, which can result in fringes in the image that are horizontal or vertical (depending on the orientation of the charge-coupled device). Blooming effect A problem caused by the mutual influence of charges in neighboring pixels in CCD imaging. See blooming. Charge-injection device [CID] A class of solid-state semiconductor imaging devices having a matrix of light sensitive elements. It uses a specific method of fabricating a solid-state image sensor that senses the charge generated by the photon by injecting it from the sensor into the photosensitive underlayer. In addition, it has an electrode matrix corresponding to the image matrix, and at each pixel position, there are two electrodes that are isolated from each other and can generate a potential well. One of the electrodes is in communication with a corresponding electrode of all pixels in the same row, and the other electrode is in communication with a corresponding electrode of all pixels in the same column. In other words, to access a pixel, you need to do so by selecting its rows and columns. CID is much less sensitive to light than charge-coupled device sensors, but pixels in the CID matrix can be accessed independently by means of electronic indexing through the rows and columns of electrodes, i.e., having random access to individual pixels in any order, without image floating, and other advantages. Each pixel in the CID array can be accessed separately by electronically indexing the electrodes of the rows and columns. Unlike CCDs, CIDs spread out the collected charge as it is read, so the image is also erased. Complementary metal oxide semiconductor [CMOS] A widely used solid-state imaging device. It mainly includes sensor core, analog-todigital converter, output register, control register, gain amplifier, and so on. An imaging device based on such a semiconductor can integrate the entire system on one chip, reducing power consumption, reducing space, and achieving lower overall cost.
3.3 Cameras and Camcorders
105
Complementary metal oxide semiconductor [CMOS] transistor In CMOS, photons incident on the sensor directly affect the conductivity (or gain) of the photo detecting unit, can be selectively gated to control the exposure interval, and are locally amplified prior to readout by a multiplexing scheme. CMOS sensor A sensing device made of a complementary metal oxide semiconductor. It consumes less power but is more sensitive to noise. Some complementary metal oxide semiconductor sensors have a layered photo-sensor that is sensitive to all three primary colors at each location of the grid, where the physical properties of silicon for different wavelengths of light penetrating the different depths are utilized. CMOS sensors with a logarithmic response A sensor with a nonlinear grayscale response that can be used to acquire images with a wide dynamic range of brightness. Generally, the photocurrent output from the photo-sensor is fed back to a resistor having a logarithmic current-voltage characteristic. This type of sensor always uses progressive exposure. Active pixel sensor [APS] Each pixel has its own independent amplifier and can be directly selected and read out. The CMOS sensor is an active pixel sensor that uses a photodiode as a photodetector. The charge does not need to be transferred sequentially. It can be read directly by the row and column selection circuit, so it can be used as a random read/write memory. Photosensitive unit A unit that is sensitive to incident light, which converts an optical signal into other types of signals (such as electrical signals). It corresponds to the pixels of a camera that uses CCD or CMOS.
3.3
Cameras and Camcorders
There are many types of cameras and camcorders, which are constructed according to different models and have different structures to complete specific image acquisition work (Davies 2005; Hartley and Zisserman 2004; Jähne 2004; Tsai 1987; Zhang 2017a).
3.3.1
Conventional Cameras
Rolling shutter camera A camera in which each pixel collects photons for different lengths of time. This can be due to mechanical or electronic reasons. It allows most other pixels to
106
3 Image Acquisition Devices
continuously acquire photons when one pixel is sampled. This is in contrast to cameras that capture photons simultaneously and read them out. Rolling shutter Originally used to control the conventional camera shutter progressive exposure, it allows the photographic film to be sequentially responsive. Now refers to the exposure mode of progressive imaging, which is usually used by cameras that commonly use CMOS sensors. The sensor is exposed line by line and read out to obtain an image in which the exposure time and the readout time can overlap. As a result, there may be a large acquisition time difference between the first line and the last line of the image. When there is a fast mutual movement between the camera and the scene, the subject may have obvious deformation or partial exposure problems. Shutter A component used in an optical device to control exposure time. The most typical example is the shutter of a camera. The position at the objective lens is called the center (inter-mirror) shutter, and at the image plane is called the image plane (focal plane, curtain, rolling) shutter. The shutter can be viewed as a device that allows enough time for light to enter the camera to form an image on a light-sensitive film or wafer. The shutter can be mechanical, as in a conventional camera; it can also be electronic, as in a digital camera. In the former case, a window-like mechanism is opened to allow light to be recorded by the light-sensitive film. In the latter case, a CCD or other type of sensor is electronically excited to record the light incident at each pixel location. Shutter control A device that controls the length of time the shutter is turned on. Electronic shutter A device for reducing the exposure time of charge-coupled device imaging by limiting the time of charge accumulation and draining using most of the frame time. The exposure time is generally between 1/15000 s and 1/60 s. Global shutter Originally used to control the overall exposure time of the shutter, it allows the entire film to be simultaneously illuminated. Now refers to the acquisition method of the whole image imaging, that is, all the pixels collect light at the same time and simultaneously expose. This method is used by cameras that use CCD sensors. The disadvantage is that the noise can be large, and blurring can occur when there is a very fast motion between the camera and the scene. In addition, since a memory area is required for each pixel, this reduces the fill factor. Compare the rolling shutter. Film camera A traditional camera that uses film roll or film sheet. The images obtained are analog continuous.
3.3 Cameras and Camcorders
107
Color negative The film used in traditional cameras that use three colors: magenta, yellow, and cyan. Projecting light through the negative film onto the photo paper will reverse the color and return to normal. ISO setting Originally referred to as an international indicator of the speed of film used in cameras using film (ISO stands for International Organization for Standardization), commonly used values are 100, 200, etc., corresponding to the speed at which the silver reacts with light in the film. The sensitivity value is doubled and the speed is doubled. In modern digital cameras, the gain of the signal amplification is indicated before the analog signal is converted to a digital signal. In theory, high sensitivity represents a high gain, allowing the camera to achieve better exposure even in low illumination situations. In practice, however, high sensitivity often amplifies sensor noise. Dodge A photographic term derived from darkroom technology that inhibits the illuminating of light onto the photographic paper, causing the corresponding areas to become lighter or even white. Compare burn. Burn A photographic term derived from the darkroom technique means that the light impinging on the photographic paper is not inhibited, causing the corresponding area to become relatively dark. Compare dodge. Filmless camera A camera that does not use film. It can include two major categories: digital cameras and still cameras. Pinhole camera An ideal imaging device that meets the pinhole camera model. Pinhole model Same as pinhole camera model. Pinhole camera model The simplest camera imaging model. The pinhole corresponds to the center of the projection. The image formed is an inverted image of the object. In this model, the wave characteristics of light are ignored, and light is only seen as light ray traveling in a straight line in the same medium. An ideal perspective camera mathematical model that includes an image plane and a point form aperture through which all incident light must pass, as shown in Fig. 3.9. Its equations can be found in perspective projection. For a simple convex lens camera, where all the light passes through the virtual pinhole in the focus, this is a good model.
108
3 Image Acquisition Devices
Fig. 3.9 Pinhole camera model
Image plane
Pinhole
Center point
Scene Optical axis
Optical center The center of an optical system, such as a camera, etc. (often taken as the focus or the center of the lens), whose position is expressed in pixel coordinates. See focal point. Still video camera A filmless camera. The captured photos can be saved in the form of analog data (can be sent directly to the TV display) instead of being saved in digital form. If you want to send photos to your computer, you will also need to use the analog-to-digital device for conversion. Stationary camera The camera whose optical center does not move. The camera can pan, tilt, or rotate about its optical center, but without translational motion. There are also called “rotating” cameras or “non-translation” cameras. It is also often referred to as a camera that does not have any movement at all. Images obtained with still cameras always have planar homography.
3.3.2
Camera Models
Camera 1. Physical device for image acquisition. It turns out that a traditional camera only captures one image at a time and the new camera can capture a series of images at a time, but now the advances in technology and technology have made these cameras have little difference; most the commonly used camera can freeze a single image and also capture a series of images. 2. A mathematical representation of physical devices and their characteristics, such as position and correction. 3. A set of mathematical models and related devices projected from 3-D to 2-D, such as affine cameras, orthographic cameras, or pinhole cameras. Digital camera The camera that image-perceiving surface is composed of independent semiconductor sampling units (typically one unit per pixel) and that a quantized version of the signal value that can be recorded when the image is acquired. The former guarantees the digitization of the image in space and the latter guarantees the digitization of the image in the amplitude.
3.3 Cameras and Camcorders
109 Sensor chips
Camera body Camera picture
Lens
Demosaic
Aperture
Sharpen
Sensor
Shutter
White balance
Gain (ISO)
Gamma correction
A/D
Compre ssion
RAW
JPEG
DSP
Fig. 3.10 Flow of capturing images with a camera
Digital camera model Describe the process model in which a digital camera converts into a digital value after an external photon is incident on a photosensor, such as a CCD unit. For this process, exposure (gain and shutter speed), nonlinear mapping, sampling, aliasing, noise, etc. all have an effect. The following simple model basically considers these elements, including various noise sources and typical post-processing steps, as shown in Fig. 3.10. High-speed camera A camera that can shoot quickly. Generally, the exposure time of a stationary scene is less than 1/2000 s, which is a high-speed photography. Movies with a frame size higher than 128 f/s are high-speed movies, while those above 104 f/s are super-fast movies. Smart camera A hardware device that combines a camera and an onboard computer into a single small container, a programmable vision system implemented in a common video camera size. Intelligent camera control Combine the capabilities of the camera device with the available sensor information in the camera design to capture the best picture. Tube camera See tube sensor. Forward-looking infrared [FLIR] camera An infrared imaging system or device mounted on a vehicle in a forward direction of motion. Multi-tap camera A camera with multiple outputs. Perspective camera The camera whose image is obtained according to the perspective projection. Its light path is shown in Fig. 3.11. The corresponding mathematical model is generally referred to as a pinhole camera model.
110
3 Image Acquisition Devices
Fig. 3.11 Perspective camera light path
Lens
Projection center
Scene
Image plane
Optical axis
Affine camera A special case of a camera that implements a projection transformation. It is a linear mathematical model that approximates the perspective projection from an ideal pinhole camera. It can be obtained by taking T31 ¼ T32 ¼ T33 ¼ 0 in the 3 4 camera parameter matrix T and reducing the degree of freedom of the camera parameter vector from 11 to 8 (implementing the affine transformation). Affine trifocal tensor The three-focus tensor for the observation conditions of the affine camera model. Affine quadrifocal tensor The four-focus tensor for the observation conditions based on an affine camera model. Random access camera A special camera that has the potential to directly access any image location. This is different from sequential scanning cameras where image values are transmitted in a standard order. Orthographic camera One camera whose image is acquired based on orthogonal projection.
3.3.3
Special Structure Cameras
Instrumentation and industrial digital camera [IIDC] See instrumentation and industrial digital camera [IIDC] standard. Instrumentation and industrial digital camera [IIDC] standard The revised standards, based on the IEEE 1394 standard, for industrial cameras. The IIDC defines a variety of video output formats, including their resolution, frame rate, and transmitted pixel data format. Among them, the resolution ranges from 160 120 pixels to 1600 1200 pixels; the frame rate ranges from 3.5 frames per second to 240 frames per second; the pixel data format includes black and white (8 bits and 16 bits per pixel) and RGB (24 bits and 48 bits per pixel). USB camera A camera that complies with the universal serial bus standard.
3.3 Cameras and Camcorders
111
Universal serial bus [USB] An external bus standard that regulates the connection and communication between computers and external devices. The USB interface supports devices with plug-andplay and hot-swap capabilities. Telecentric camera Using a telecentric lens, the world coordinate system can be projected in parallel to the camera in the image coordinate system. Since there is no focal length parameter, the distance between the target and the camera does not affect its coordinates in the image coordinate system. Single-chip camera A color camera consisting of only a single chip sensor. Since the charge-coupled device or the complementary metal oxide semiconductor sensor responds to the entire visible light band, it is necessary to add a color filter array in front of the sensor to enable light energy of a specific wavelength range to reach each photodetecting unit. The single-chip camera subsamples each color (1/2 for green and 1/4 for red and blue), so it produces some image distortion. Compare three-chip cameras. Three-chip camera A color camera using a sensor with three chips. As shown in Fig. 3.12, the light passing through the lens is split into three beams by a beam splitter or a prism, and they are respectively led to three sensors having a filter in front of them. This structure can overcome the image distortion problem of a single-chip camera. Three-CCD camera An imaging camera that uses three separate CCD sensors to capture the red, green, and blue components of an image. This is different from a single CCD sensor camera that uses three different filters (such as Bayer pattern) on a single CCD chip. The advantage of using three separate CCD sensors is that better color quality and spatial resolution can be obtained. CCD camera A charge-coupled device is used as the camera of the sensor. It often includes a computer board called a frame buffer and interacts with the computer through a fast
Fig. 3.12 Optical path of a three-chip camera
Sensor with blue f ilter Sensor with green f ilter
Sensor with red f ilter
Prism or beam splitter
112
3 Image Acquisition Devices
standardized interface such as FireWire (IEEE 1394), Camera Link, or Fast Ethernet. Absolute sensitivity threshold An important camera performance indicator. Refers to the minimum light intensity that the sensor can detect or the minimum amount of light that can be detected when the camera is imaged. It is assumed that the average number of photons reaching the pixel on the imaging plane of the camera during exposure is μp, the probability of generating electrons per quantum photo (quantum efficiency) is η, and the variance of the dark noise current of the camera is σ d2. If you use very weak light when imaging, ημp 0
d=0
d n + 2, it is not possible to uniquely determine H. This configuration is nondegenerate if there are no n points in p on a hyperplane and no n points in q on a hyperplane. Principal distance The distance between the perspective center and the image projection plane in a perspective projection. In the camera, the distance from the projection center to the imaging plane is related to the object distance (and image distance) and is a camera constant. Note that the principal distance is not the focal length of the lens. In an optical system, the two may be different. Harmonic homology The perspective or projection collinearity of two waves.
232
4.4.3
4 Image Acquisition Modes
Projective Imaging
Projective imaging mode A way to capture an objective scene image by means of projection. Corresponding to the transformation from the world coordinate system XYZ through the camera coordinate system xyz to the image plane coordinate system x'y'. These transformations can be accurately described by mathematical formulas, but in practice to simplify the calculations, some approximations of the formula are often made. See model of image capturing. Projection imaging 1. A method of projecting a negative image directly into a positive image for viewing. 2. Capture images by projection (e.g., from 3-D to 2-D). There are many different modes. Collineation In the projection imaging mode, the same as homography. See projective transformation. Collinearity The nature of being co-located on the same straight line. Projection 1. Transform a geometry structure (target) from one space to another, such as projecting a 3-D point to the nearest point on a given plane. The projection can be represented by a linear function, i.e., for all points p on the original structure, their points q ¼ Mp on the projected structure, where M is the matrix describing the projection. The projection also does not have to be linear, which can be expressed as q ¼ f(p). 2. Use the perspective camera to generate an image from a scene based on perspective rules. 3. The shadow that is obtained on a projection surface when the light illuminates the object. It can also refer to the process of obtaining the shadow. Compare projection of a vector in subspace. 4. A compact shape descriptor. For a binary object O(x, y), the horizontal projection for y 2 [0, N–1] and the vertical projection x 2 [0, M–1] can be obtained using the following two equations:
hð x Þ ¼
M 1 X
Oðx, yÞ
x¼0
vð yÞ ¼
N 1 X y¼0
Oðx, yÞ
4.4 Perspective and Projection
233
Integral projection function A collection of pixel values and sums of rows or columns in the image. See projection. Projective invariant A property that is unaffected by projective transformations. More specifically, assume that the invariant property of a geometry structure (target) described by the parameter vector P is I(P). When a projective transformation is performed on the geometry structure (the transformation matrix is M), a geometry structure described by the parameter vector Q will be obtained, and I(Q) ¼ I(P). The most basic projection invariant is the cross ratio. In some applications, a weighting w to the invariant is also used, in which case I(Q) ¼ I(P)(detM)w. Plane projective transfer An algorithm based on projection invariant determines the position of all other points on I1 and I2 given the two images I1 and I2 of a planar object and the correspondence of four features. Interestingly, this does not require any knowledge of the scene or imaging system parameters. Projective stratum A layer in the 3-D geometry layering. Here the layering is from the most complex to the simplest, followed by projection, affine, measure, and Euclidean layer. See projective transformation. Projective geometry A field in geometry that studies projective space and its properties. In projective geometry, only the properties held by the projective transformation are defined. Projection geometry provides a convenient and accurate theory for geometric modeling of commonly used projective cameras. More importantly, the perspective projection equation becomes linear. View-dependent reconstruction A process of generating a new image (or a new video) from any viewpoint given two or more original images (which can also be two or more videos). The technology of stereoscopic vision and projection geometry are often combined. A typical application is to allow an observer to change the viewpoint while watching or replaying a sporting event. Projection matrix Converts the homogeneous projection coordinates (x, y, z, 1) of a 3-D scene point to a matrix of point coordinates (u, v, 1) on the image plane of a pinhole camera. It can decompose two matrices of the intrinsic parameters and the extrinsic parameters. See camera coordinates, image coordinates, and scene coordinates. Projective space The space formed by all points except the origin in the 3-D space, is also the space defined by all 2-D points represented by homogeneous coordinates. Also known as 2-D projective space.
234
4 Image Acquisition Modes
The (N + 1)-D vector on which the projection geometry is defined is often referred to as ℙN. 2-D projective space Same as projective space. Projection plane The imaginary surface on which the object is projected in the scene. It is usually a flat surface, but a curved surface such as a cylindrical surface, a conical surface, or a spherical surface can also be used as a projection surface in terms of earth projection. Projective plane A plane defined by a projection geometry, often referred to as ℙ2. Projective factorization A generalization of factorization for use with perspective cameras. A preliminary affine reconstruction is performed first, and then the perspective effect is corrected in an iterative manner. Projection center In a pinhole camera, the point at which all imaging light meets.
4.4.4
Various Projections
1-D projection The high-dimensional space, is the result of the projection that only exists in the 1-D space. For example, each horizontal histogram is a 1-D result of a high-dimensional image projected in a particular direction. As another example, the signature is the result of a 2-D target projected to 1-D. 2-D projection A transformation that maps high dimensional space to 2-D space. The easiest way is to remove the higher dimensional coordinates, but generally you need to use a viewpoint for the actual projection. Orthogonal projection A simplified projection imaging mode. The Z coordinate of the scene point in the world coordinate system is not considered, that is, the world point is directly projected to the image plane along the direction of the camera axis, so the projection result of the scene gives the true dimension of the scene section. Orthogonal projection can be regarded as the perspective projection when the camera focal length λ is infinite. Orthography See orthogonal projection.
4.4 Perspective and Projection
235
Orthographic projection A 3-D scene is drawn as a 2-D image with a set of rays orthogonal to the image plane. The size of the imaging target does not depend on its distance Z from the observer. The result is that the parallel lines in the scene remain parallel in the image. The equations for orthogonal projection are x¼X y ¼Y where x and y are the image coordinates of the image point in the camera reference frame (in millimeters instead of pixels) and X and Y are the coordinates of the corresponding scene point (Z does not have to be considered). The effect is the same as orthogonal projection. It can be described as a 2-D result obtained by directly eliminating the z component in 3-D coordinates. Matrix of orthogonal projection The calculation matrix required for the completion of orthogonal projection. If homogeneous coordinates are used, the orthogonal projection matrix can be written as 2
1
60 6 P¼6 40 0
0 1 0 0
0 0
3
0 07 7 7 1 05 0 1
Scaled orthographic projection Same as weak perspective projection. Parallel projection A generalization of orthogonal projection. In orthogonal projection, a scene is projected onto an image plane with a set of parallel rays that are not necessarily perpendicular to the image plane. When the scale of the scene itself is relatively small compared to its distance from the center of the projection, this is a good approximation of the perspective projection; the difference is just a normalized scale factor. Parallel projection is a subset of weak perspective projection. Note that the weak perspective projection matrix is not only affected by the orthogonality of the rows of the left 2 3 submatrix but also by the same norm. In orthogonal projection, both rows have a unit norm. Sinusoidal projection A general term for a set of linear image transformations. The rows of the transformation matrix are the eigenvalues of the symmetric tridiagonal matrix. The discrete cosine transform is an example.
236
4 Image Acquisition Modes
Center projection The image on the surface of a sphere is projected onto a tangent plane that passes through the intersection of the line emerging from the center of the sphere and the surface of the sphere. The intersection of a plane and the sphere is a large ring, and the image of this large ring will be a line under the center projection. Also known as the tangent plane projection. Axonometric projection A type of parallel projection. It places the three coordinate planes of the object at a position that is not parallel to the projection line, so that the three coordinate planes of the object can be reflected on the same projection surface and have a threedimensional effect. The length of the object parallel to any of the axes can be measured in ratios in the isometric drawing. When the ratios of the three axial directions are the same, they are called equal projections; when the ratios of the two axial directions are the same, they are called two projections; when the ratios of the three axial directions are different, they are called three projections. Equidifference weft multicone projection The latitude is a coaxial arc, the center of the circle is on the central meridian, the central meridian is a straight line, and the other meridians are multi-conical projections of the curve symmetrical to the central meridian. The meridian spacing on the same latitude is decreasing in equal difference with increasing distance from the central meridian. The warp and weft nets (latitude and meridian set) are constructed by graphical analysis. The shape of the warp and weft is different depending on the projection parameters. Equidistance projection A projection of a warp and weft (or vertical circle, contour circle) orthogonal and along the warp (or vertical circle) equal to one. The deformation size of this projection is between the isometric projection and the equal product projection and can be used for ordinary maps and traffic maps.
4.5
Photography and Photogrammetry
Photography represents the entire process of image acquisition (Tsai 1987), while photogrammetry uses captured images to quantitatively describe the scene being captured (Hartley and Zisserman 2004).
4.5 Photography and Photogrammetry
4.5.1
237
Photography
Photography Take a photo or video using the camera or camcorder (or other device that can record images). 3-D photography The technique and process of constructing a complete 3-D model from multiple (2-D) images and restoring the surface topography of the scene. It can also include a range of applications resulting from the reconstruction of their surfaces. The main processing steps include feature matching, recovery of structures from motion, estimation of dense depth maps, construction of 3-D models, and restoration of texture maps. ARC3D is a complete network-based system that automates these tasks. Panography Refers to a special image stitching, which is also a simple but interesting application for image registration. In the panorama, you want to pan the image first (and possibly rotate and scale) and then simply average. This process mimics the process of generating stitched photos from regular photos, except that the photo stitching uses a model of opaque overlay. Projection background process Get a similar effect to shooting in the field: project a pre-shooted background image to a translucent or directional reflective screen with a special projector with a strong light source, the actor performs in front of the screen, and the camera captures the movements of the actors and the scenes on the screen. Photomicroscopy Photography with a replication rate much greater than 1:1 is usually done with a stereo zoom digital microscope. Variable-speed photography In film photography, at a different frequency than normal shooting (24 f/s) for performing the photography. Shot below 24 f/s is called low-speed photography, shot at a very low shooting frequency is called time-lapse photography, shot above 24 f/s is called fast photography, and shot above 128 f/s is called high-speed photography. Flash photography A way of active illumination imaging. In images obtained with flash, colors are often less natural (because no ambient light is acquired), there may be strong shadows or highlights, and there is radial brightness decay around the camera (axis). Images obtained without flash in darker lighting conditions often have stronger noise and blur (due to longer exposures). The image obtained with the flash can be combined with the image obtained when the flash is just extinguished to obtain a color correct, sharp, and low noise result. First, the image A without flash is
238
4 Image Acquisition Modes
filtered with a variant of a bilateral filter, the joint bilateral filter, but its kernel is determined by means of the flash image F (because F has weaker noise and thus the edge in F remains more reliable). Since the content of F is less reliable inside and in the outline of the shadow or highlight, the result Abase filtered by the standard bilateral filter for the image A without flash is used in these places. Next, calculate a flash image Fdetail with details: F detail ¼
Fþe F base þ e
Among them, Fbase is the result of filtering the F with a bilateral filter, e ¼ 0.02. This detail image contains details that would be filtered out if noise cancellation (obtaining ANR) was performed, as well as additional detail created by the flash camera (increased sharpness). Use this detail image to modulate the ANR to get the final result: Afinal ¼ ð1 kÞANR F detail þ kAbase Flash and non-flash merging Combine flash and non-flash images for better exposure and color balanced images, and reduce noise. One of the problems with flash images is that colors are often unnatural (no ambient illumination), there may be strong shadows or highlights, and there is radial brightness decay around the camera line of sight. Non-flash images acquired under low illumination conditions are often affected by excessive noise (because of high ISO gain and low photon count) and blurring effects (from longterm exposure). See flash photography. Digital radiography X-ray photography with a digital sensor instead of a film or photographic film. Digital panoramic radiography A digital X-ray imaging system around the teeth that images the upper and lower jaws, sinus, and nasal cavity. See digital radiography. High-speed cinematography/motion picture A photographic technology that enables fast filming. Including research on highspeed movies, equipment manufacturing, filming, film processing, etc. Reproduction ratios When shooting, the ratio of the captured image to the object’s (size) of the scene. The replication rate in general photography is less than 1, the replication rate in microphotography is greater than 1, and the replication rate in copying is equal to 1. Soft focus In photography, it refers to techniques that are intended to make the image less clear to achieve some artistic effect. The image thus obtained is called a soft focus image or a weak contrast image. Experience has shown that a pleasing soft focus is best to
4.5 Photography and Photogrammetry
239
superimpose a first-order blurred image on a sharp image, that is, add a circle of halo around the image point.
4.5.2
Photogrammetry
Photogrammetry A field of research on obtaining reliable and accurate measurements from noncontact imaging, such as digital elevation maps obtained from a pair of overlapping satellite images. Here, accurate camera calibration is the first concern; many typical image processing and pattern recognition techniques have been introduced. Photogrammetric modeling The early title of image-based modeling. Digital photogrammetry An analytical photogrammetry method that uses image engineering techniques to process 2-D perspective projection images for automatic interpretation of images and scene content. Stereophotogrammetry An important area of photogrammetry. Take two photos of the object to be measured from two points at both ends of the baseline. There are two cases to consider here: 1. Ground photogrammetry: (1) Normal, the directions of the two cameras are parallel to the object from two points and perpendicular to the baseline; (2) swing condition (left pendulum, right pendulum), the orientation of the two cameras is parallel, but the baseline normal is offset to the left or right by an equal angle; and (3) convergence (or divergence), the orientation of the two cameras remains in the same plane, but can have any angle to the baseline normal. 2. Aerial photogrammetry: Try to make the optical axis of the camera perpendicular to the ground. Stereophotography One of the ways to photograph the 3-D space of the scene. With the photographic image, the stereoscopic effect of the scene seen by both eyes can be reproduced. Hyperstereoscopy In stereophotography or stereo observation, the observation of the baseline is smaller than the baseline of the shot. Observation and shooting should be performed at the same baseline length in stereophotography; otherwise, distortion of shape and scale will occur. The best case in the application can only be shaped, that is, the shape is similar, but it is enlarged or reduced. The sense of depth in the hyperstereoscopy will be strengthened. Compare hypostereoscopy.
240
4 Image Acquisition Modes
Hypostereoscopy In stereophotography or stereo observation, the observation of the baseline is larger than the baseline of the photograph. In addition to the distortion of shape and scale, the sense of depth will also be weakened. Compare hyperstereoscopy. Stereoscopic photography Same as stereophotography. Analytic photogrammetry Mathematical analysis is used to create a geometrical connection between a 2-D perspective projection image and a 3-D scene space. Absolute orientation 1. In photogrammetry, the problem of registering two 3-D point sets. Used to register photogrammetric reconstruction with an absolute coordinate system. It is often described that the rotation transformation R, the translation transformation t, and the scaling transformation s are determined to match the model point set {m1, m2, ..., mn} with the corresponding data points {d1, d2, ..., dn} by minimizing the mean square error: n P eðR, t, sÞ ¼ kdi sðRmi þ t Þk2 i¼1
To solve the above equation, a method of singular value decomposition can be used. 2. In analytic photogrammetry, one or more camera reference frames (in camera coordinate system) are associated (coincidence) with a world reference frame (in world coordinate system) to determine the orientation and scale. This often requires a rotation transformation and/or a translation transformation, the calculation of which is based on the correspondence between the 3-D points. 3. A representation of the coordinate transformation of a 3-D alignment into a rigid body transformation (representing the motion of a rigid body). Absolute orientation algorithm An algorithm that solves the absolute orientation problem and is used to estimate the unit quaternion corresponding to the angle of rotation between the two surfaces to be matched. It is necessary to construct a 4 4 matrix according to the coefficients in the correlation matrix C and then determine the eigenvector associated with the largest positive eigenvalue.
Chapter 5
Image Digitization
An image must be discretized both spatially and grayscale in order to be processed by a computer (Bracewell 1995; Kak and Slaney 2001; MarDonald and Luo 1999; Umbaugh 2005).
5.1
Sampling and Quantization
The discretization of spatial coordinates is called spatial sampling (referred to as sampling), which determines the spatial resolution of the image; the discretization of gray values is called gray quantization (referred to as quantization), which determines the amplitude resolution of the image (Castleman 1996; Gonzalez and Woods 2008; Tekalp 1995).
5.1.1
Sampling Theorem
Sampling A transformation of a (time/space) continuous signal into a discrete signal by recording the value of the signal only at discrete times or locations. Most digital images are sampled spatially and temporally, with the result that their intensity values are defined only on regular spatial and temporal grids, and only integer values (quantization) are available. Abbreviation for spatial sampling in image acquisition. Shannon sampling theorem Referred to as sampling theorem. Also known as the Nyquist sampling theorem. Spatial domain sampling theorem See sampling theorem. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_5
241
242
5 Image Digitization
Sampling theorem An important theory in information theory. It can be expressed as: if an image is sampled with a ratio higher than the Nyquist frequency, then a continuous image can be reconstructed from the sampled image, and the mean square error with the original image converges to zero as the number of samples tends to infinity A fundamental theorem in signal and information processing. Also known as the Shannon sampling theorem. It can be described as: for a signal with a limited bandwidth, such as using two times its interval at the highest frequency, it is possible to completely recover the original signal from the sample. This shows the relationship between the sampling frequency and the signal spectrum, which is the basic basis for discretization of continuous signals. Further, it can be expressed as a time domain sampling theorem and a frequency domain sampling theorem. In image engineering, there is a 2-D spatial domain sampling theorem corresponding to the 1-D time domain sampling theorem. It should be noted that this is only suitable for image processing operations. If objective data information is to be obtained from the image through image analysis, it is often not enough to use the image sampling rate that just satisfies the sampling theorem (in fact, the conditions for the sampling theorem are often not satisfied in this case). Time domain sampling theorem A description of the sampling theorem in the time domain. If the highest frequency component of the time signal f(t) is fM, then the value of f(t) can be determined by a series of sample values with a sampling interval less than or equal to 1/2fM, i.e., the sampling rate fS 2fM. Frequency domain sampling theorem The description of the sampling theorem in the frequency domain has a dual relationship with the sampling theorem of the spatial or temporal domain. If the spectrum of the time-limited continuous signal f(t) is F(w), the value of a series of discrete sample points spaced in the frequency domain w π /TM can be used to determine the original signal, where TM is the time period of the signal. Nyquist rate The reciprocal of the minimum sampling frequency, rs ¼ 1/fs. See Nyquist frequency. Nyquist frequency The minimum sampling frequency at which the considered real image can be completely reconstructed from its sampling. If the sampling is performed at a lower frequency, aliasing occurs, that is, an apparent structure that does not exist in the original image is generated. It can also refer to the maximum frequency in an image, denoted as fmax. To satisfy the sampling theorem, the sampling frequency fs > 2fmax is required. See Nyquist rate. Nyquist sampling rate See Nyquist frequency.
5.1 Sampling and Quantization
243
Fig. 5.1 Three different sampling formats
Sampling scheme A discretization format used in sampling continuous images. Such a format should have a fixed specification and seamlessly cover the 2-D plane. It can be proved that there are only three regular forms as the sampling format when the sampling is performed uniformly, which satisfies the above conditions. These three rules are all in the form of polygons with three, four, and six sides, which in turn correspond to triangles, squares, and hexagons, as shown in Fig. 5.1. Sampling pitch The distance between adjacent samples. In a 2-D image, the horizontal sampling interval and the vertical sampling interval are to be distinguished. Their product gives the theoretically usable perceptual area. The ratio of the actual perceived area to the theoretical perceived area is called the fill factor. Time delay index An integer index that represents the delay between two functions: if g(t) ¼ f(t – nΔt), then n is the delay index, Δt is the sampling time, and nΔt is the total delay or g lags the time delay of f. Sampling mode The geometric or physical arrangement of the sampled samples. The most common sampling mode is a rectangular mode in which the pixels are horizontally aligned in rows and vertically aligned in columns. Other possible sampling modes include hexagonal sampling mode and the like. Sample mean A statistic that depicts the sampling matrix. Representing a d-D data set as a set of n column vectors xi, i ¼ 1, 2, . . ., n, their sample mean m¼
n 1X x n i¼1 i
See mean. Sampling density The number of samples per unit time and/or unit space. In practice, it is necessary to comprehensively consider the size, details, and representativeness of the sample to be observed. See influence of sampling density. The density of the sampling grid, which is the number of samples collected per unit interval. See sampling.
244
5 Image Digitization
Sampling bias The deviation between the sample and the true distribution. If a sample of a random variable is collected based on the true distribution, the statistics calculated from the sample will not systematically deviate from the population expectation. However, the actual sampling may not fully express the true distribution, and the sampling bias will be generated at this time. Sampling efficiency The ratio of the unit circle area to the area of the grid covering the unit circle when sampling according to a given sampling grid. Sample covariance A matrix that depicts the samples. A d-D data set is represented as a set of n column vectors xi, i ¼ 1, 2, . . ., n. When the corresponding sample mean is m, the sample covariance matrix is a matrix of d d: S¼
n 1X ðx mÞðx1 mÞT n i¼1 i
See covariance matrix。 Spatial aliasing One of the consequences of using a relatively low sampling rate in image digitization. Spatial aliasing is common for both images and videos, and the emergence of the Mohr pattern is a classic example. Temporal aliasing One of the consequences of using a relatively low sampling rate. Time aliasing is a relative phenomenon, especially observed in video sequences, such as where the wheel that normally rotates forward sometimes reverses or decelerates, which is called the train wheel effect (also known as the reversal effect). Jaggies Irregularities that occur on oblique line segments or on curved edges when the resolution of the image is low. Essentially due to the aliasing. Methods for mitigating and eliminating aliasing include increasing image resolution, low-pass filtering for anti-aliasing, and the like.
5.1.2
Sampling Techniques
Point sampling The technique or process of acquiring discrete points in the data from a continuous signal to obtain the signal at that point. For example, a digital camera samples a continuous image function as a digital image.
5.1 Sampling and Quantization
245
Irregular sampling Non-equal intervals and thus nonperiodic sampling. Sampling-importance resampling [SIR] A method of using the sample distribution q(z) but avoiding the estimation of the constant k when rejecting samples. It consists of two steps. In the first step, L samples z(1), z(2), . . ., z(L ) are acquired from q(z). In the second step, the corresponding weights w1, w2, . . ., wL. should be built. Finally, a second set of L samples is extracted from the discrete distribution (z(1), z(2), . . ., z(L)) using the probability given by the weights w1, w2, . . ., wL. Stochastic sampling technique A sampling technique with special conditions. Two characteristics need to be met at the same time: (1) randomly select samples, and (2) use probability theory to evaluate sample results. A sampling method that does not have the above two features at the same time is called a non-statistical sampling technique. Spatial sampling The process of spatially discretizing an analog image. The spatial resolution of the image is determined by measuring the value of the 2-D function in discrete intervals along the horizontal and vertical directions. The Shannon sampling theorem provides a reference for selecting the spatial sampling rate in actual image processing. Spatial sampling rate The number of discrete sample points within a unit distance in both the horizontal and vertical directions of the image. The spatial sampling rate for a video refers to the spatial sampling rate of the luminance component Y. Generally, the spatial sampling rate of the chrominance components (also called color difference components) CB and CR is often only one-half of the luminance component. This halves the number of pixels per line, but the number of lines per frame does not change. This format is called 4:2:2, that is, every four Y sampling points correspond to two CB sampling points and two CR sampling points. The lower the amount of data than this format is the 4:1:1 format, that is, one CB sample point and one CR sample point for every four Y sample points. However, the resolution in the horizontal and vertical directions is asymmetrical in this format. Another format with the same amount of data is the 4:2:0 format, which still corresponds to one CB sample point and one CR sample point for every four Y sample points, but both CB and CR are taken one-half the sampling rate in the horizontal and vertical directions. Finally, for applications requiring high resolution, a 4:4:4 format is also defined, i.e., the sampling rate for the luminance component Y is the same as the sampling rate for the chrominance components CB and CR. The correspondence between the luminance and chrominance sampling points in the above four formats can be seen in Fig. 5.2 (two adjacent lines belonging to two different fields). Sampling rate The number of samples in the unit measure along each dimension. In periodic sampling, the number of samples extracted from a continuous signal and composed
246
5 Image Digitization
Y pixel CB and CR pixel 4:4:4
4:2:2
4:1:1
4:2:0
Fig. 5.2 Example of four sampling formats
of discrete signals per second, in Hertz (Hz), also known as sampling speed. Its reciprocal is called the sampling time and represents the time interval between samples. For 2-D images, consider the number of samples in unit length (interval) in the horizontal and vertical directions. For video images, consider the number of samples along the line (horizontal axis), column (vertical axis), and frame (time axis). Research on the human visual system (HVS) has shown that the human visual system cannot distinguish spatial and temporal changes beyond a certain frequency. Therefore, the visual cutoff frequency (i.e., the highest spatial frequency and time frequency that can be perceived by the HVS) should be the driving factor in determining the video sampling rate. Decimation 1. In digital signal processing, a filter that extracts one of every N samples, where N is a pre-fixed number. In contrast to interpolation that increases resolution, decimation is used to reduce resolution. For decimation, conceptually convolve the image with a low-pass filter (to avoid aliasing), and then reserve one for every M samples. In practice, only consider evaluating samples in every M-th sample: P gði, jÞ ¼ f ðk, lÞhðMi k, Mj lÞ k, l
The smoothing kernel h(k, l) used here is often an enhanced and rescaled interpolation kernel. Can also be written as P gði, jÞ ¼ 1r f ðk, lÞhði k=r, j l=r Þ k, l
The same kernel h(k, l) can be used for both interpolation and sampling. See subsampling. 2. Grid sampling: Combine adjacent and similar surface patches or vertices of lattice to reduce the size of the model. It is often used as a processing step when deriving a surface model from a range image. Resampling When performing non-integer multiple interpolation on an image, the process of reselecting the position of the interpolation point is required because the newly added point is not in the original pixel set.
5.1 Sampling and Quantization
247
Sampling rate conversion Converting a video sequence with a certain spatiotemporal resolution into another sequence with parameters (one or several) being changed. Also called standard conversion. Sampling rate conversion generally involves one of two things: 1. When the original sequence contains fewer sample points than the desired result, this problem is called up-conversion (or interpolation); 2. When the original sequence contains more samples than the desired result, this problem is called down-conversion (or sampling). Standard conversion In video image processing, conversion between various video formats. Sometimes referred to as sampling rate conversion. A standard can be defined as a video format specified by a combination of four main parameters: color coding (composite/ component), number of lines per field/frame, frame/field rate, and scanning method (interlaced/progressive). Typical examples of standard conversions include (1) frame and line conversion, (2) color standard conversion, (3) lines and field doubling, and (4) movie to video conversion. Up-conversion One way to sampling rate conversion: first fill all points in the desired video sequence but not in the original sequence with zeros (this process is called padding with zeros), and then use interpolation filters to estimate the values of these points. Up-sampling A process and technique for inserting a portion of a sample at equal intervals in each direction of the original image. The original image can be thought of as a subset of the interpolated resulting image. Increase the sample rate by generating more samples for one signal. For example, doubling the size of an image requires a 2:1 up-sampling. Up-sampling may require interpolation of existing samples. Over-sampling A technique in which the sampling density of a continuous signal is greater than the Nyquist sampling rate, or sampling at such a sampling rate. Down-conversion In the sampling rate conversion, the original sequence is sampled to obtain a video sequence with a reduced sampling rate. Here, a pre-filter is used on the signal to limit the bandwidth and avoid aliasing. Subsampling The size of the original image is reduced by giving a new image whose pixel spacing is wider than the original image (usually several times the original). Interpolation also gives more accurate pixel values. To avoid aliasing effects, the spatial frequencies above the coarse mesh (wide sampling interval) Nyquist frequency in all images need to be removed with a low-pass filter. Also called down-sampling.
248
5 Image Digitization
Down-sampling Part of the sample is taken in each direction or in a certain direction (equal interval) for the original image. The subsampled image is a subset of the original image. It can reduce the sampling rate of the image, so that using fewer pixels to express the same data often leads to a reduction in the amount of data transmitted. Also known as subsampling.
5.1.3
Quantization
Quantization 1. Abbreviation for magnitude quantization in image acquisition. The process of expressing the original continuous image gray tone as a limited number of gray values 2. The step of mapping the prediction error into a finite number of output values in lossy prediction coding 3. The step of selectively eliminating the transform coefficients carrying the least information in the transform coding See spatial quantization. Magnitude quantization The process of discretizing the amplitude (often grayscale) of the acquired analog image. This process determines the amplitude resolution of the image. Spatial quantization Convert an image defined in an infinite domain into a finite set of finite precision samples. For example, the image function f(x, y): ℝ2 ! ℝ can be quantized into an image g(i, j) of width w and height h, {1. . .w} {1. . .h} ! ℝ. A particular sample g(i, j) isR determined by the point spread function p(x, y) and can be written as g (i, j) ¼ p(x – i, y – j)f(x, y)dxdy. Linear quantizing Same as equal-interval quantizing. Equal-interval quantizing A commonly used quantization strategy or technique. First, the range to be quantized in the image is divided into a continuous interval of the same length from the maximum value to the minimum value, and each image value is assigned to the quantized value corresponding to the interval there between. Compare equal-probability quantizing. Equal-probability quantizing A commonly used quantization strategy or technique. The range to be quantized in the image is divided from a maximum value to a minimum value into a plurality of consecutive intervals, so that the image values assigned to the respective quantized
5.2 Digitization Scheme
249
values have the same probability. This idea is essentially consistent with the histo gram equalization. Compare equal-interval quantizing.
5.2
Digitization Scheme
Sampling requires dividing the image plane into regular grids, and the position of each grid is determined by a pair of Cartesian coordinates (Sonka et al. 2014). The process of assigning gray values to grid points is a quantization process (Pratt 2007).
5.2.1
Digitization
Digital geometry The geometry of the sampling and quantization domain (points, lines, faces, etc.) Digitization scheme A method of digitizing a continuous analog image in each space to obtain a digital image. Also refers to the method of digitizing a continuous analog video in space to obtain a digital video. Digitization The process of converting an analog image to a digital version of the sample. Digitization box A split polygon that is equivalent to a center at a certain pixel location. It is a unit that can implement a digitization scheme. Given a digitization box Bi and a pre-image S, one pixel pi is in the digitized set P of the pre-image S if and only if Bi \ S 6¼ ∅. Digitization model In image acquisition, a quantization model for transforming spatially continuous scenes into discrete digital images. Also known as the digitization scheme. Commonly used models are square-box quantization, grid-intersection quantization, and object-boundary quantization. Domain In a digitization model, the union of all possible pre-images for a discrete set of points. Since digital images are studied to obtain information from the original simulated world, and target imaging of different sizes or shapes may give the same discretization results, it is necessary to consider all possible continuous sets of digital images. Digitizer The device that converts the (analog) electrical signal obtained by the sensor into a digital (discrete) form.
250
5 Image Digitization
Analog-to-digital conversion [ADC] In image acquisition, the final step of the analog processing in the imaging sensor. The analog image data is converted into digital image data by analog-to-digital conversion so that it can be input to a computer and processed by a computer. For this process, there are two quantities of interest: the resolution of the output (corresponding to the number of bits produced) and its noise level (how many of the bits have practical use). For most cameras, the number of bits used (e.g., 8 bits for JPEG images and 16 bits for RAW format) exceeds the actual number of bits available. For a given sensor, it is best to perform a correction by repeating the same scene and calculating the estimated noise intensity as a function of image brightness. Value quantization A continuous number is encoded as a finite number of integer values. A typical example is to encode a voltage or a current amount as an integer in the range 0 to 255. Hybrid Monte Carlo The combination of Hamilton dynamics and metropolitan algorithms. It can eliminate the discrepancies when the non-zero step frog leaping algorithm introduces the error of the Hamilton kinetic equation integral. Discrete The coordinates or intensity of an object in a digital image are discontinuous and have separate values. Discrete straightness The property that a set of points in a discrete space can form a line. For example, the theorem and properties of linearity can be used to determine whether a digital arc is a digital straight line segment (chord). Ideal line The lines drawn in the continuous field, which are different from the lines in the digital image, where they are subject to rasterization and become tortuous. Digital chord A segment of a digital curve (a string is a straight line segment connecting any two points on a conic curve). It can be proved that if and only if there are any two discrete points pi and pj in an 8-digital arc Ppq ¼ {pi, i ¼ 0, 1, . . ., n} and a real point ρ in any continuous line segments [pi, pj], there is a point pk 2 Ppq such that d8(ρ, pk) < 1, and then Ppq satisfies the properties of the string. The nature of the string is based on the following principle: Given a digital arc Ppq ¼ {pi, i ¼ 0, 1, . . ., n} from p ¼ p0 to q ¼ pn, the distance between the continuous line segment [pi, pj] and the sum of the segments [ i [pi, pi + 1] can be measured using a discrete distance function and should be not exceed a certain threshold. Two examples are given in Fig. 5.3, where the chord property is satisfied in the left drawing and the chord property is not satisfied in the right drawing. The shaded area
5.2 Digitization Scheme
251
q
pj
pj
p
pi
p
q
pi
Fig. 5.3 Digital Chord property example
in Fig. 5.3 represents the distance between Ppq and the continuous line segment [pi, pj ] Digital straight line segment A straight line segment in a digital image that satisfies certain conditions. Given two pixels in an image, the digital straight line segment is a set of pixels, where the intersection of a portion of the straight line segment with the neighborhood of the pixels is not empty. Digital curve A discrete curve in which each pixel except the two endpoint pixels has exactly two neighboring pixels, and the two endpoint pixels each has one neighboring pixel. Semi-open tile quantization A digitization scheme. The digitization box used is Bi, Bi ¼ [xi – 1/2, xi + 1/ 2) [yi – 1/2, yi + 1/2), which is open in both X and Y directions. The digitized set P obtained from a continuous point set S at this time is the pixel set {pi|Bi \ S 6¼ ∅}. Each point t ¼ (xt, yt) 2 S in Σ is mapped to pi ¼ (xi, yi), where xi ¼ round (xt) and yi ¼ round (yt), where round(•) represents a normal rounding function. Grid-intersection quantization [GIQ] A digitization scheme. Given a thin target C consisting of consecutive points, the intersection with the grid line is a real point t, t ¼ (xt, yt), which, according to C, intersects the vertical grid line and satisfies xt 2 , or intersects the horizontal grid line and satisfies yt 2 , respectively (where represents a set of 1-D integers). This point t 2 C will be mapped to a grid point pi, pi ¼ (xi, yi), where t 2 (xi – 1/2, xi + 1/ 2) (yi – 1/2, yi + 1/2). In special cases (such as xt ¼ xi + 1/2 or yt ¼ yi + 1/2), the point pi that falls to the left or above the point t belongs to the discrete set P. Square-box quantization [SBQ] A digitization scheme. Let the digitization box Bi, Bi ¼ [xi – 1/2, xi + 1/2) [yi – 1/2, yi + 1/2), the closure of Bi, [Bi] ¼ [xi – 1/2, xi + 1/2] [yi – 1/2, yi + 1/2]. Given a pre-image S, one pixel pi is in the digitized set P of S if and only if Bi \ S 6¼ ∅. Further, given a continuous curve C, for each real point t 2 C on a horizontal or vertical grid line, if there are two digitization boxes Bi and Bj, satisfying t 2 [Bi] \ [Bj], the one of the pixels pi and pj that falls in the left digitization box is considered to be in the digitized result of C.
252
5 Image Digitization
Object-boundary quantization [OBQ] A digitization scheme. Given a continuous straight line segment containing the line L: y ¼ σx + μ, 0 σ 1, the object-boundary quantization result is a continuous straight line segment [α, β], [α, β] consists of the pixel pi ¼ (xi, yi), where yi ¼ bσxi + μc. The case where the slope of the straight line segment is other values can be obtained from the symmetry or the rotation π/2 angle. In the object-boundary quantization, the intersection g ¼ (xg, yg) of each vertical grid line (x ¼ xi) and the line segment [α, β] is mapped to the nearest pixel with a smaller coordinate value pi ¼ (xg, bσxg + μc), where b•c represents the down-rounding function.
5.2.2
Digitizing Grid
Triangular lattice A special image grid. The six directly adjacent pixels of one pixel p constitute the neighborhood of the pixel. According to the duality, this is also equivalent to considering the hexagonal region represented by each pixel, and the pixel corresponding to the six regions having the common side of the hexagonal region represented by the pixel is the adjacent pixel of the pixel. The neighborhood thus constructed is called the 6-neighborhood of the triangle mesh and is denoted as N6( p). As shown in Fig. 5.4, the pixel p is connected to its neighboring pixels by the thick solid lines, and the triangular mesh is represented by the thin lines, and the broken line represents a pixel area that is dual with the triangular mesh (corresponding to the hexagonal sampling mode). Triangular sampling grid See triangular lattices. Triangular grid See triangular lattices.
Fig. 5.4 Neighborhood of triangle grid points
p
5.2 Digitization Scheme
253
Triangular sampling mode A sampling mode that does not get too many applications. See triangular sampling efficiency. Triangular sampling efficiency pffiffiffi Sampling efficiency when using triangular sampling mode. It is π/(3 3), which is lower than the sampling efficiency in the square sampling mode or the hexagonal sampling mode. So triangle image representation is rarely used. Triangular image representation An image representation with triangular (rather than square) pixels. It is rarely used in practice. Square sampling grid See square sampling mode. Square grid See square lattices. Square lattice A currently used image grid. Obtained from the square sampling mode. The corresponding pixels are also square. Square sampling mode A square is used to cover the sampling mode of the image plane. Square sampling efficiency Sampling efficiency when using square sampling mode. It is π/4, which is lower than the sampling efficiency in the hexagonal sampling mode. Square image representation An image representation using square pixels. It is the most widely used image representation. Hexagonal sampling grid See hexagonal sampling mode. Hexagonal grid See hexagonal lattices. Hexagonal lattices A special image grid. There are two kinds of neighborhoods for any one of the pixels: 3-neighborhood and 12-neighborhood. Gradient operator in hexagonal lattices The operator of the spatial difference (corresponding to the first derivative) is calculated by convolution in a hexagonal grid. There two templates, which compute the gradients in both directions (U and V ), as shown in Fig. 5.5 (see Sobel operator).
254
5 Image Digitization
Fig. 5.5 Templates for calculating gradients in a hexagonal grid
–1
1
–2
1 2
–1
2
–1
1
1 –2
–1
V
Fig. 5.6 Hexagonal sampling mode
3L/2 L/2
U
Hexagonal sampling mode An efficient sampling mode. Referring to Fig. 5.6, the angle between the two axes in the coordinate system is 60 . If the side length of the hexagon is L, the horizontal distance between two adjacent sampling points is L, and the vertical distance is 3 L/2. The hexagonal sampling coordinates (U, V ) and the square sampling coordinates (X, Y ) have the following conversion relationships: pffiffiffi x ¼ u þ v=2 y ¼ 3v=2 pffiffiffi pffiffiffi u ¼ x y= 3 v ¼ 2y= 3 And the distance between two points in hexagonal space is d12 ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðu1 u2 Þ2 þ ðv1 v2 Þ2 þ ðu1 u2 Þðv1 v2 Þ
Hexagonal sampling efficiency Sampling efficiency when using the hexagonal sampling mode. The sampling pffiffiffi efficiency is π/(2 3), which is higher than the sampling efficiency π/4 when using the quadrilateral sampling mode. Hexagonal image representation An image representation with hexagonal (rather than square) pixels. Reasons for using this expression include: (1) it is similar to human retina structure; (2) it is equally spaced from one pixel to all adjacent pixels. Note that when using square
5.2 Digitization Scheme
255
pixels, the distance between two diagonal adjacent pixels is greater than the distance between two adjacent pixels in the horizontal or vertical direction. In hexagonal image representation, the pixels are hexagonal (or hexagonal sampling mode is used) instead of the usual square or rectangle. This expression is similar to the human retina structure, characterized by the same distance from one pixel to all adjacent pixels.
Chapter 6
Image Display and Printing
Image display is to display the image on the display screen, which is an important step in the communication between the computer image system and the user (Gonzalez and Woods 2008). Various printing devices can also be regarded as image display devices (Koschan and Abidi 2008).
6.1
Display
Commonly used image display devices include cathode-ray tubes (CRT), television displays, and liquid crystal displays (LCD), which can be accessed randomly (Zhang 2017a).
6.1.1
Image Display
Image display An image (and video) that is presented in a specific form for visualization. Different display modes can be adopted for different images. For example, for a binary image, there are only two values in each spatial position, which can be distinguished by black and display modes can be adopted for white, or by 0 and 1. The basic idea for 2-D grayscale image display is to treat the 2-D image as a grayscale distribution at different locations in the 2-D space. Figure 6.1 shows two typical grayscale images (two standard images). The coordinate system used in Fig. 6.1a is often used in the screen display (the screen scan is from left to right, from top to bottom), the origin O is in the upper left corner of the image, the vertical axis marks the line of the image, and the horizontal axis marks the column of the image. I(r, c) can represent either this image or the grayscale (attribute) value of the image at the intersection of the r and c columns. The © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_6
257
258 Fig. 6.1 Image display examples
6 Image Display and Printing C
O
Y
I(r, c)
f (x, y)
O
R
(a)
X
(b)
coordinate system used in Fig. 6.1b is often used in image calculations. The origin is at the lower left corner of the image, the horizontal axis is the X axis, and the vertical axis is the Y axis (the same as the usual Cartesian coordinate system). f(x, y) can represent both this image and the value of the pixel at coordinates (x, y). Display A display device that displays a given electronic document to a screen through a particular transmission device and then reflects it to the human eye. According to the different materials, it can be divided into cathode-ray tube display, plasma display, liquid crystal display, and so on. Landscape mode The way the horizontal side of the screen or paper is longer than the vertical side when the image is displayed or printed. Letterbox A format for playing widescreen movies on a standard TV set. In the letterbox format, the image size is adjusted to fit the 4:3 TV screen, with a black background on the top and bottom of the TV screen. Compare panscan. Panscan A format for playing widescreen movies on a standard TV set. In the full scan format, the image can fill the entire 4:3 TV screen, but the left and right sides of the image are cropped accordingly. Compare letterbox. Video graphics array [VGA] An early computer screen display standard and also the minimum resolution required for early digital cameras: 640 480 pixels. Extended graphics array [XGA] A standard for computer screen display, screen resolution (number of colors): 800 640 pixels (true color) or 1024 768 pixels (65,536 colors). Super extended graphics array [SXGA] A computer screen display standard, screen resolution (color number): 1280 1024 pixels (true color).
6.1 Display
259
Super extended graphics array+ [SXGA+] A computer screen display standard, screen resolution (color number): 1400 1050 pixels (true color). Super video graphics adapter [SVGA] An earlier computer screen display standard, screen resolution (number of colors): 800 640 pixels (32,000 colors) or 1024 768 pixels (256 colors). White level A signal in the television that corresponds to the maximum image brightness. Corresponding to this there is a black level corresponding to the minimum image brightness signal and a gray level between the two. The white level to black level accounts for about 75–15% of the amplitude of the wave, and the rest is reserved for the blanking signal. The white level can correspond to 75%, which is positive, or can correspond to 15%, which is negative. China adopts the latter, so that if there is interference with radio waves, it will behave as a black dot, which is less conspicuous than its white point. Transparent layer A term in image display technology that makes a first image partially transparent (becomes a transparent layer) before overlaying this image onto another image, so that the data in two images can be seen simultaneously after overlaying. Similar effects can sometimes be seen in actual scenes, such as seeing scenes behind a window when you see a window reflection. Color casts An undesired distortion of hue in computer displays or halftone printing. It seems that the entire picture is covered with a layer of light red, green, and other colors. Color shift Represents unsatisfactory and irregular changes in hue. Compare color casts. Back-face culling Method to discard unnecessary obstructed parts to speed up processing. Kinescope recording A technique for recording or replaying images on a television screen using photographic film or magnetic tape, magnetic disk, or optical disc. The method of recording with photographic film is called screen recording or video recording, and the latter three are called magnetic tape, magnetic disk, and optical disc recording. Generally speaking, screen recording often refers to the latter three types, also called screen capture video.
260
6.1.2
6 Image Display and Printing
Display Devices
Image display device A device that displays image (and video) data in spatial brightness distribution mode. The main display devices commonly used in image processing systems include randomly accessible cathode-ray tubes, liquid crystal displays, and light-emitting diode displays, plus a printer. Video display terminal [VDT] A more professional title for a computer monitor or display screen. Liquid crystal display [LCD] A computer display device made of using liquid crystal technology. Compared with the conventional cathode-ray tube, the liquid crystal display has the characteristics of thin thickness, small volume, lightweight, low energy consumption, and no radiation. Light-emitting diode [LED] A semiconductor diode that utilizes electroluminescence to convert electrical energy into light energy. It produces narrow spectral light like monochromatic light. It is made of gallium arsenide (red light), gallium phosphide (green light), and silicon carbide (yellow light). The light it emits is affected by the controlled DC current, and the response is fast and can be used as a flash. The brightness of the light-emitting diode is related to the current passing through, and the color of the emitted light is related to the semiconductor material, from infrared to near ultraviolet. Its advantage is long life, often over 100,000 hours, and almost no aging. Brightness is easily controlled thanks to DC power. In addition, its matrix can be used to form an image display. Its disadvantage is that performance is sensitive to ambient temperature, and the higher the temperature, the worse the performance. It can also be composed of its matrix Light-emitting semiconductor diode Same as light-emitting diode. It is often used as a marker or controllable illumination of a detectable point source. Cathode-ray tube [CRT] A class of electron beam tubes that convert electrical signals into optical images. It can record rapidly changing currents and voltages. A typical example is a kinescope for a television set, mainly composed of an electron gun, a deflection yoke, a tube casing, and a fluorescent screen. Another example is a conventional computer display in which the horizontal and vertical position of the electron beam can be controlled by a computer. At each deflection position, the intensity of the electron beam is modulated with a voltage. The voltage at each point is proportional to the gray value corresponding to the point, so that the grayscale image is converted into a pattern of change in brightness and recorded on the screen of the cathode-ray tube. Images from the input display can be converted to slides, photos, or transparencies by hard copy.
6.1 Display
261
Cathode-ray oscillograph An electronic instrument that can display changes in voltage waveforms, referred to as an oscilloscope. Screen A computer monitor, the device for outputting an image. Kinescope A cathode-ray tube that displays an image on a television set. The signal from the electronic circuit according to the scanning process causes the cathode to emit an electron beam whose intensity corresponds to the intensity of the image point in the camera tube. Further, the electrodes or magnetic poles in the cathode-ray tube are controlled by the synchronizing devices so that the scanning sequence is the same as that of the imaging process, so that an image played by the transmitting station appears on the fluorescent screen. Fluorescent screen A screen made of a cathode ray-emitting device, X-ray or γ-ray. The effect is to convert an electronic image (X-ray or γ-ray) into a visible light image. Screen resolution The parameter indicating the display accuracy of the display (screen) depends on the number of lines displayed on the screen and can also be described by the number of horizontal points and vertical points, such as 800 600, 2560 1440, and so on. Screen capture The technique and process of copying the display content directly and saving it as an image file. Video board Same as video adapter. Video card Same as video adapter. Video adapter The device responsible for displaying video in a computer, also called a video card or a display card. Video accelerator An expansion card used to speed up the display of images and graphics on a computer screen. By using a fast coprocessor, large-capacity display memory, and a wider data channel, the CPU load can be drastically reduced, and display speed and overall performance can be improved. Camera Lucida A (passive) optical projection device that projects a scene onto a piece of paper based on a perspective projection so that it can be rendered. Figure 6.2 shows a schematic diagram of the operation of the projection renderer.
262
6 Image Display and Printing
Fig. 6.2 The operation of the projection renderer
Reflector Scene
Projection
Calibrator A device used to measure color and light parameters of printers, displays, and the like. The resulting information can be passed back to the computer for appropriate correction and adjustment. Most of the calibrators used in practical applications are densitometers, which are used to measure the wavelength and intensity of the light source. Syndicat des constructeurs d’appareils radio récepteurs et téléviseur [SCART] A standardized 21-pin multifunction connector used in televisions and video equipment.
6.2
Printing
Printing and printing equipment converts images to specific media (most commonly paper). In order to output a grayscale image on a binary image output device and maintain its original grayscale, halftone output technology is often used (Lau and Arce 2001).
6.2.1
Printing Devices
Printer A type of computer output device that prints results on a specific medium, such as paper or film. It can also be seen as an image output and display device. Near-photo-quality printer An output device capable of printing continuous-tone images. Visual inspection is of good quality, but if you look at the details with a magnifying glass or other tools, you will find identifiable pigment spots. Compare the dye-sublimation printer. Near-photorealistic Same as near-photo-quality.
6.2 Printing
263
Near-photo-quality See near-photo-quality printer. Inkjet printer A printer that ejects tiny black or colored ink drops onto the paper. Can output higher-quality images, but the results are not durable enough. The reason is that the pigment used is easily deteriorated, resulting in color migration and desalination. Laser printer Printer projecting image data onto a photosensitive drum with a photosensitive surface using a laser beam, attracting toner at a corresponding position (black for a monochrome printer and four colors for a color printer), and transferring the toner to the paper surface where toner is fixed there at high temperature. Laser The light is amplified by the simultaneous emission of light. The very bright light source commonly used in machine vision, its characteristics include the following: most light has a single spectral frequency, and the light is coherent, so you can make full use of various interference effects; the beam can be processed to make the divergence very small. Two common applications are structural light triangulation and depth perception. Wax thermal printers A printer that produces a dithered image and simulates the output of a lithographic press. It uses heat to sublimate the wax layer on the magenta, yellow, cyan, and black ribbons to produce color on paper, much like a dye-sublimation printer, but does not produce continuous-tone images or photo-quality prints. Dye-sublimation printer A special color image output device. Also known as gas diffusion printers or thermal diffusion printers. The primary color pigments such as cyan, magenta, yellow, and black are heated until they become a gas, and then the gas is sprayed onto the smooth coated paper surface of the special coating. With this printer, the colors are really mixed together, so a continuous-tone image of true photo quality can be given. Compare near-photo-quality printer. Phase-change printers A printer that combines the technologies of wax thermal printer and inkjet printer. It prints on any paper and is of high quality.
6.2.2
Printing Techniques
Image printing The image is printed out as a technique and process for outputting image processing results. It can be implemented using a variety of (grayscale, color) printers, such as dye-sublimation printers, near-photo-quality printers, and the like.
264
6 Image Display and Printing
Printer resolution A measure of how fine the print is. Commonly used unit is dots per inch (dpi), such as 300 dpi. There are also points in the horizontal and vertical directions, such as 600 1200. Full bleed Images printed onto paper do not have any edges or borders. Portrait mode When the image is displayed or printed, the vertical side length is longer than the horizontal side length. Electrostatic process The process and technique of transferring images onto paper using magnetic and thermal forces. Most laser printers (including color laser printers) and copies use electrostatic printing technology. Dithering technique A technique for improving the display quality of an over sparse quantized image by adjusting or changing the amplitude value of the image. The over sparse quantized images are prone to false contouring when displayed. A typical dithering technique is implemented by adding a random small noise d(x, y) to the original image f(x, y). Since the value of d(x, y) has no regular relationship with f(x, y), it does not produce a visible regular pattern after superposition, so it can help eliminate the phenomenon of false contouring in the image caused by insufficient quantization. Dithering Abbreviation for dithering technique. Dithered quantization A method and process of superimposing a dithered signal prior to quantizing the original input signal. The dithered signal is generally pseudorandom. If the dithered signal superimposed on the input signal before quantization is properly designed, the quantization error can be made independent of the input signal. Subtractive dither A method of reconstructing a signal after dithered quantization that eliminates the dithered signal prior to reconstruction. Compare non-subtractive dither. Non-subtractive dither A method of reconstructing a signal after dithered quantization, i.e., not eliminating the dithered signal prior to reconstruction. Compare the subtractive dither. Blue-noise dithering A binary nonperiodic dithering pattern designed by a statistical model describing the spatial and spectral characteristics of the FM halftoning technique. It has an uncorrelated structure and does not contain low frequency components (blue noise is the high frequency part of white noise). Compare green-noise dithering.
6.2 Printing
265
Green-noise dithering A binary nonperiodic dithering pattern designed by a statistical model describing the spatial and spectral characteristics of the FM halftoning technique. It has an uncorrelated structure and does not contain low frequency components and is completely generated by the intermediate frequency component (green noise is the intermediate frequency portion of white noise). Unlike blue-noise dithering, it does not contain high frequency components.
6.2.3
Halftoning Techniques
Halftoning technique A technique for converting a grayscale image into a binary image output. A mode in which various gradations in the image to be output are converted into binary dots, so that it can be outputted by a printing device that can directly output only binary dots. By utilizing the integration characteristics of the human eye at the same time and by controlling the form of the output binary point pattern (including the number, size, shape, etc.), it is possible to obtain a visual multi-gray feeling. In other words, the image output by the halftone technique is a binary image on a very fine scale, but due to the spatial local average effect of the eye, the grayscale image is perceived on a coarser scale. For example, a binary image, although each pixel only takes the value of white or black, when viewed from a certain distance, since the unit perceived by the human eye is composed of a plurality of pixels, the gray level perceived by the human eye is the average gray level of this unit (proportional to the number of black pixels in it). The halftone technology is mainly divided into two types: amplitudemodulated (AM) halftoning technique and frequency modulation (FM) halftoning technique. Halftone A method commonly used in the printing industry to indicate the intensity of luminosity. An image of the same area is represented by the same number of white or black dots, and the light and dark are represented by black and white (or denseness) of the dots. The point itself should be small enough to be invisible to the naked eye, as if it were a continuous range of luminosity. The pictures in general newspapers are about 20–30 points per centimeter (cm), and the exquisite color photos or art images can reach 80–100 points per cm. The preparation process is to place the grid plate of the corresponding mesh between the objective lens and the film, and the position thereof is such that the objective lens is imaged on the film by using the grid as a pinhole, and the latex used has strong contrast ratio, so weak light is formed as small dots, while glare forms a cluster of white spots, which is transferred to the printing plate and eroded, called the screen. Halftoning See halftoning technique.
266
6 Image Display and Printing
Inverse halftoning A filtering operation or a smoothing operation of reconstructing a grayscale image from a given error diffusion image. See halftoning. Halftone pattern A pattern of spatial distribution of binary points produced by a halftone technique. See blue-noise halftone pattern and green-noise halftone pattern. Frequency modulation halftoning technique A commonly used halftone output technique. Different gray levels are displayed by adjusting the distribution of the output black points in space (the black dot size is fixed). Here, the distribution of black dots is inversely proportional to the gray level of the corresponding position pixels (the distribution is denser that corresponds to the darker grayscale, and the distribution is sparse that corresponds to the brighter grayscale), which is equivalent to modulating the appearance frequency of the output black point. Frequency modulation [FM] A modulation scheme in which information is represented by a change in instantaneous frequency of a carrier. Different frequencies of the carrier are used to express different information. In image engineering, there are commonly used FM halftoning techniques. FM halftoning technique Same as frequency modulation halftoning technique. Frequency modulation [FM] mask An operation template constructed to implement the FM halftoning technique. Each template corresponds to one output unit. Each template is divided into regular grids, each grid corresponding to a basic binary point. By adjusting each basic binary point to black or white, each template can output different gray levels to achieve the purpose of outputting grayscale images. According to the principle of the frequency modulation halftone technique, many black spots are required to output a darker grayscale, and fewer black dots are needed to output a brighter grayscale. Figure 6.3 shows an example of dividing a template into 2 2 grids so that each template can output five different gradations. Amplitude-modulated [AM] halftoning technique A commonly used halftone output technique. Different gray levels are displayed by adjusting the size of the output black point (given the black dot pitch). It utilizes the integration feature of the human eye, that is, when viewed at a certain distance, a set of small black dots can produce a visual effect of bright gray and a set of large black
Fig. 6.3 Divide a template into 2 2 grids to output 5 shades of gray
0
1
2
3
4
6.2 Printing
267
dots can produce a visual effect of dark gray. Therefore, the size of the black point selected here is inversely proportional to the gray level of the corresponding position pixel, which is equivalent to the amplitude modulation of the pixel gray level. AM-FM halftoning technique Same as amplitude-frequency hybrid modulated halftoning technique. Amplitude-frequency hybrid modulated halftoning technique A type of halftone output technology that combines amplitude modulation and frequency modulation. The size and spacing of the output black dots can be adjusted at the same time, and a high spatial resolution mode comparable to the amplitude modulation technique can be produced when printing grayscale images. Blue-noise halftone pattern A halftone mode produced by the amplitude-frequency hybrid modulated halftoning technique. There are no low frequency particles in the output. Green-noise halftone pattern The best halftone mode that can be produced by amplitude-frequency hybrid modulated halftoning technique. Compare blue-noise halftone pattern.
Chapter 7
Image Storage and Communication
Image storage and image transmission are closely related, which respectively involve the storage of image data locally or remotely, and their efficiencies are both related to the amount of data (Gonzalez and Woods 2018).
7.1
Storage and Communication
Image storage requires image data to be stored on the image memory in a certain image file format (Zhang 2017a). Image transmission exchanges image data between different image memories (Zhang 2009).
7.1.1
Image Storage
Image storage Techniques and procedures for storing images using a certain image file format in image memory. Image storage devices See frame store. Image memory A device that stores image data. In a computer, image data is stored in a bit stream, so the image memory is the same as a general data memory. However, due to the large amount of image data, a lot of space is often required. Commonly used image memories include magnetic tapes, magnetic discs, flash memory, optical discs, and magneto-optical discs. See frame store.
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_7
269
270
7 Image Storage and Communication
Scattering model image storage device A storage device that uses spatially modulated method for preparing a light scattering center of polycrystalline dysprosium doped lanthanum zirconate-titanate (PLZT) to improve image contrast and store images. The resolution is equivalent to a ferroelectric photoconductive (FEPC) image memory device fabricated using birefringence of PLZT, also known as ceramic photo image memory (CERAMPIC). Another type of scattering image storage method is an electric field modulation of a certain phase of a PLZT crystal, which is called a ferroelectric wafer scattering image memory (FEWSIC). Among them, PLZT is a ceramic optical electromagnetic material, or photoelectric material, the fine particle sheet is transparent, and the voltage is birefringent, which can be used as an optical switch, display, and relationship memory component. Flash memory An image memory for recording various signals, which has a long life and can retain the stored information even in the event of a power failure. Compared with traditional hard discs, flash memory has high read/write speed, low power consumption, light weight and strong shock resistance. Super video compact disc [SVCD] An improved form of video disc. The video is MPEG-2 compression encoded, with a resolution of 480 480 pixels for NTSC and 480 560 pixels for PAL. Digital versatile disc [DVD] The late name of the digital video disc. Memory See image memory. Magnetic tape A strip magnetic recording device for recording various signals and having a magnetic layer. Mainly used as computer external memory. The storage capacity is large. Compare the magnetic disc. Magnetic disc A magnetic recording device for recording various signals, which mounts a circular magnetic disc in a sealed case. Mainly used as external storage for computers. The storage capacity is large and easy to use. Compare magnetic tape. Optical disc A recording device for recording various signals using optical information as a storage object. There are many types, and commonly used include CD, VCD, DVD, BD, etc. Magneto-optical disc A recording device for recording various signals, combining the characteristics of a magnetic disc, and using optical information as a storage object. The storage capacity is very large and the storage life is long (up to 50 years). Compare magnetic tape.
7.1 Storage and Communication
7.1.2
271
Image Communication
Image transfer 1. A generic term describing the transfer of an image from one device to another or from one form of expression to another. 2. See novel view synthesis. Progressive image transmission A method of transmitting images, first transmitting a low-resolution version, and then transmitting a version with a gradually increasing resolution (the details of images are gradually enriched) until the final receiver obtains the highest-resolution image, as shown in the respective figures in Fig. 7.1. Image communication Transmission and reception of various images. Can be seen as communication of visual information. Fiber optics A medium for transmitting light that contains very fine glass or plastic fibers. A very large bandwidth can be provided for signals encoded in optical pulse mode. It also transmits images directly through a fixedly connected fiber bundle to bypass the corners and pass through obstacles. Code division multiplexing Multiple information or symbols are transmitted simultaneously in the same channel, and the information or symbols are represented by orthogonal signals that overlap in the time domain, the spatial domain, and/or the frequency domain. Time division multiplexing Multiplexing of multiple signals or multiple symbols in a single channel is achieved by splitting the time domain. Frequency hopping A method used in spread spectrum communication. The transmitter broadcasts the information, where the first portion of the information is transmitted at a certain frequency, then the second portion of the information is transmitted at another frequency, and so on.
Fig. 7.1 Progressive image transfer example
272
7 Image Storage and Communication
Messaging layer A layer in the network system for editing and description of messages. Compare the transport layer. Transport layer A layer in the network model that transmits a message but does not edit the message or interpret the message. Compare the messaging layer. Picture archiving and communication system [PACS] An integrated system for unified management of image acquisition, display, storage, and transmission. Mainly used in the field of medical imaging.
7.2
Image File Format
In addition to the image data itself, the image file generally contains description information of the image to facilitate reading and displaying the image (Zhang 2017a).
7.2.1
Bitmap Images
Image file A computer file containing image data and description information about the image. Image file format A computer file format that contains image data and description information for the image. The image description information in the image file gives information such as the type, size, expression, compression or not, data storage order, data start position, byte offset of the file directory, and the like for reading and displaying the image. Typical image file formats include BMP format, GIF format, TIFF format, and JPEG format. Image file system [IFS] A set of programs that support the development of image processing software and a collection of applications based on these programs. Backup The process of making a complete and accurate mirror copy of an image. BMP format A common image file format with a suffix of “.bmp”. It is a standard in the Windows environment; the full name is the device independent bitmap. BMP image files, also known as bitmap files, consist of three parts: (1) bitmap file header (also called header), (2) bitmap information (often called palette), and (3) bitmap array (i.e., image data). A bitmap file can only hold one image.
7.2 Image File Format
273
Bitmapped file format A file format in which each pixel position information in the image is recorded. Such as BMP format, GIF format, TIFF format, etc. Device-independent bitmap [DIB] An image file format that guarantees that bitmaps created with an application can be loaded or displayed by other applications. The full name of the BMP format. Device independence is mainly reflected in two aspects: (1) the color mode of the DIB is device independent, and (2) DIB has its own color table, and the color of the pixels is independent of the system palette. Since DIB does not depend on a specific device, it can be used to permanently save images. DIBs are usually saved to disc as .BMP files and sometimes in .DIB files. Applications running under different output devices can exchange images via DIB. Portable network graphics [PNG] format An image file format based on bitmap, with the suffix “.png”. Provides a lossless compressed file format for raster images. Support for grayscale images, color palettes, and color images. Supports all functions of GIF format, JPEG format, and label image file format (except for animation functions in GIF format). It always uses a lossless compression scheme. Portable image file format A set of common image file formats with portable properties. Includes portable bitmap, portable graymap, portable pixmap, portable floatmap, and portable networkmap. Portable pixmap A portable image file format. A color image (24 bits/pixel) can be stored with a file suffix of “.ppm”. The encoding format is shown in Fig. 7.2, which requires 96 bits per pixel. The format supports uncompressed ASCII and raw binary encoding and can be extended to a portable floating point format by modifying the header. Portable bitmap A portable image file format. Also known as a portable binary map. Only monochrome bitmaps (1 bit/pixel) are supported, and the file suffix is “.pbm”. Portable graymap A portable image file format. Can store 256 levels of grayscale image (8 bits/pixel) with a file suffix of “.pgm”
Fig. 7.2 Portable pixmap encoding format
Symbol Exponent
Mantissa
274
7 Image Storage and Communication
Portable floatmap A portable image file format. One can store floating point images (32 bits/pixel) with a file suffix of “.pfm”. Portable networkmap See portable network graphics format.
7.2.2
Various Formats
RAW image format Image data obtained directly from the sensor output. Some digital SLR (single-lens reflex) cameras use 16 bits per pixel. Graphics interchange format [GIF] A commonly used compressed image file format based on the LZW encoding algorithm, suffixed with “.gif”. It is an 8-bit file format (one pixel per byte), so it can represent up to 256 colors. The image data in the file is compressed. Multiple images can be stored in one GIF file, which facilitates animation on web pages. Tagged image file format [TIFF] A commonly used image file format with a suffix “.tif”. It is a self-contained format that is independent of the operating system and file system (e.g., can be used both in Windows and Macintosh). It is very easy to exchange image data between software. With compressed and uncompressed variants. Scalable vector graphics [SVG] format A common image file format with a suffix of “.svg”. Unlike the raster image format, which describes the characteristics of each pixel, the vector image format gives a geometric description and can be drawn on any display size. This image file format provides versatile, scriptable, and versatile vector formats for web and other publishing applications. OpenEXR [EXR] A film format that stores high dynamic range images in a compact manner, often used in the visual effects industry (such as film production). Its file suffix is exr. It can support 16-bit images and 32-bit images and can also support a variety of lossless or lossy compression methods. A common OpenEXR file is an FP16 data image file (i.e., 16-bit floating point). Among them, each red, green, and blue channel has 16 bits (as shown in Fig. 7.3), where 1 bit marks “Sign,” 5 bits store the value of the Exponent, and 10 bits store the mantissa of the chromaticity coordinates. Its dynamic range is 9, and the darkest to brightest is 6.14 10 5 to 6.41 104. EXR Abbreviation of OpenEXR.
7.2 Image File Format
275
Fig. 7.3 Encoding format of OpenEXR diagram
Sign Exponent
Mantissa
Common intermediate format [CIF] The video format specified in International Standard H.261/H.263. The resolution is 352 288 pixels (Y component size is 352 288 pixels), the code rate is 128 ~ 384 Kbps, the sampling format is 4:2:0, the frame rate is 30 P, and the original code rate is 37 Mbps. It uses noninterlaced scanning to encode the luminance and the two color difference signal components separately. Quarter common intermediate format [QCIF] A standardized video format. First used for the International Standard H.263, with a resolution of 176 144 pixels, which is 1/4 of the CIF format (hence the name). The code rate is 20–64 k bps, the Y component size is 176 144 pixels, the sampling format is 4:2:0, the frame rate is 30 P, and the original code rate is 9.1 Mbps. Source intermediate format [SIF] An ISO-MPEG format that is basically the same as CIF. Suitable for medium-quality video applications such as video CDs (VCDs) and video games. There are two variants of SIF: one with a frame rate of 30 fps and 240 lines and the other with a frame rate of 25 fps and 288 lines. They have 352 pixels per line. There is also a corresponding SIF-I (2:1 interlaced) format collection. Source input format [SIF] A video format specified in the International Standard MPEG-1. The code rate is 1.5 Mbps, the Y component size is 352 240 / 288 pixels, the sampling format is 4:2:0, the frame rate is 30 P/25 P, and the original code rate is 30 Mbps. Picture data format [PICT] A data format for 2-D graphics and 2-D raster images on Apple computers. Hierarchical data format [HDF] A universal multi-objective graphic and scientific data file format. Mainly used in the fields of remote sensing and medicine. Déjà vu [DjVu] {F} A file format for saving book documents. Also refers to the method of efficient compression and rapid decompression of such documents. The main technique is to divide the image into background layers (paper textures and pictures) and foreground layers (text and lines). A higher resolution of 300 dpi is used to ensure the sharpness of text and lines and a lower resolution of 100 dpi for continuous color pictures and paper background. In this way, DjVu uses high resolution to restore text (because it is binary, so the compression efficiency is still high), allowing sharp
276
7 Image Storage and Communication
edges to be preserved and maximizing resolvability; at the same time with lower resolution to compress the background image to minimize the amount of data while maintaining image quality. Drawing exchange file [DXF] The most widely used interchange format in 2-D and 3-D geometry data. Originated from Autodesk, it is supported by almost all computer-aided design programs.
Chapter 8
Related Knowledge
The research and application of image engineering and other disciplines such as mathematics, physics, physiology, psychology, electronics, computer science, and so on (Zhang 2009, 2018a).
8.1
Basic Mathematics
Image technology is closely related to mathematics; first of all is linear algebra, set and matrix theory, and also geometry, regression analysis, etc. (Chan and Shen 2005; Duda et al. 2001; Ritter and Wilson 2001; Serra 1982; Zhang 2017a).
8.1.1
Analytic and Differential Geometry
Cartesian coordinates A position description system in which an n-D point P is described by n coordinates exactly relative to n linearly independent and orthogonal vectors (called axes). Euclidean space The real space in which the vector can be used for the inner product operation is denoted as Rn. A special case is the common Euclidean 2-D space R2 and the Euclidean 3-D space R3. Use n-tuple representations for all n-D spaces. For example, 3-D Euclidean space (X, Y, Z) can be used to describe the objective world, also known as Cartesian space (see Cartesian coordinates). Euclidean geometry A geometric system based on Euclidean 5 axioms. If the parallel axiom was negated, then non-Euclidean geometry would be obtained. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_8
277
278
8 Related Knowledge
Dimension Space X is an n-D space (designated Rn), provided that each vector in X has n components. Dimensionality The number of dimensions to consider. For example, identifying a 3-D target is often seen as a 7-D problem (position 3-D, orientation 3-D, target scale 1-D). Curse of dimensionality Refers to a phenomenon in which the calculation amount increases exponentially with the increase of the number of dimensions in the problems involving vector calculation. It is also commonly referred to that the difficulty of a calculation problem increases exponentially as the number of dimensions increases, making traditional numerical calculation methods difficult to cope with. With the increase of the dimensionality, dimensional disasters can have different situations and results: (1) the required amount of calculation increases; 2) the data volume that needs to be filled in the data space for the training to proceed normally increases exponentially; and (3) all data points become mutually isometrics which cause the problems with clustering and machine learning algorithms. Position A place in 2-D space or 3-D space. Differential geometry The branch of mathematics that studies curves and surfaces based on local differential properties. Common concepts such as curvature and tangent. Differential invariant Image description that does not change with geometric transformations and lighting changes. Invariant descriptors can generally be divided into global invariants (corresponding to object primitives) and local invariants (typically based on image function differentiation). Here the image function is always considered to be continuous and differentiable. Integral invariant The integral of some functions is characterized by being constant under a set of transformations. For example, local integration of curvature or arc length along a curve remains constant under rotation and translation. Integral invariance is more stable to noise than that of differential invariance to curvature. Geometric algebra A mathematical language whose role is to unify various mathematical forms to more easily describe physical ideas. Geometric algebra in 3-D space is a powerful tool for solving traditional mechanical problems because it provides a very clear and compact way to encode rotations. Riemannian manifold A differentiable manifold is given a metric tensor that changes smoothly from one point to another.
8.1 Basic Mathematics
279
Information geometry A branch of mathematics. It applies the technique of differential geometry to the field of probability theory. It uses the probability distribution of a statistical model as points in the Riemannian manifold space to form a statistical manifold. For example, when using a production model to define the Fisher kernel function, the differential geometry of the model parameter space needs to be considered.
8.1.2
Functions
Domain of a function The definition domain of the function, that is, the space for the arguments of the function. For 2-D images f(x, y), the function domain is (x, y) space. Range of a function The dynamic range of the function value. Can be a vector or a scalar and is function related. Unimodal A term that describes a function or distribution with only a single peak. For example, the sine function has only one peak in one period, or the entire Gaussian distribution. Monotonicity A series of continuously increasing or decreasing values or functions. The former is called a single increase, and the latter is called a single decrease. Rounding function Any function that converts real values to integer values. Commonly used in image engineering including (1) up-rounding function, (2) down-rounding function, and (3) normal-rounding function. Normal-rounding function A rounding function for taking the integer values of a number by rounding. This function is generally written as round(•). If x is a real number, round(x) is an integer, and x 1/2 < round(x) x + 1/2. Down-rounding function A rounding function that gives the integer part of a number. Also called the floor function. Can be recorded as b•c. If x is a real number, bxc is an integer, and x 1 < bxc x. Floor function Same as down-rounding function. Floor operator See down-rounding function.
280
8 Related Knowledge
Fig. 8.1 Convex function diagram
f (x)
Chord
a
xk
b
x
Up-rounding function A rounding function. Also called roof function. Can be recorded as d•e. If x is a real number, dxe is an integer, and x dxe < x + 1. Roof function Same as up-rounding function. Ceiling function Same as up-rounding function. Convex function For a class of mathematical functions with convexity, the chord is always higher than (or equal to) the value of the function’s own curve, as shown in Fig. 8.1.The x in [a, b] can always be written as ka + (1–k)b, where 0 k 1. The corresponding point value on the chord is kf(a) + (1–k)f(b), and the corresponding function value is f(ka + (1–k)b). Convex here means f(ka + (1–k)b) kf(a) + (1–k)f(b). This is equivalent to the second derivative of the function f(x) being positive everywhere. A function is strictly convex if its chord and function values are equal only at k ¼ 0 and k ¼ 1. In addition, if f(x) is a convex function, then –f(x) is a concave function. Jensen’s inequality An inequality satisfied by a convex function. Letting f(x) be a convex function and use E[•] to represent the expected value, then the Jensen’s inequality can be written as f ðE ½xÞ E ½ f ðxÞ If x is a continuous variable and p(x) is its probability distribution, the above formula corresponds Z f
pðxÞxdx
Z
pðxÞf ðxÞdx
If x is a discrete variable with the value {xi} and i ¼ 1, 2, . . ., N, then
8.1 Basic Mathematics
281
Fig. 8.2 Concave function diagram
f (x) Chord
a
f
N X
! pi x i
i¼1
N X
xk
b
x
pi f ð x i Þ
i¼1
Concave function A type of mathematical function whose chord is always lower than (or equal to) the value of the function’s own curve, as shown in Fig. 8.2. The x in [a, b] can always be written as ka + (1–k)b, where 0 k 1. The value of the corresponding point on the chord is kf(a) + (1–k)f(b), and the corresponding function value is f(ka + (1–k)b). Concave here means f(ka + (1–k)b) kf(a) + (1–k)f(b). This is equivalent to the second derivative of the function f(x) being negative everywhere. A function is strictly concave if its chord and function values are equal only at k ¼ 0 and k ¼ 1. In addition, if f(x) is a concave function, then –f(x) is a convex function. Even function A function that satisfies f(x) ¼ f(x) for all x. Spherical harmonic A function defined on the unit ball of the form. m imϕ Ym l ðθ, ϕÞ ¼ N lm Pl ð cos θ Þe
Among them, Nlm is a normalization factor. The real function arbitrarily defined on the ball f(θ, ϕ) has a spherical harmonic expansion form as follows: f ðθ, ϕÞ ¼
1 X l X
m km l Pl ðθ, ϕÞ
l¼0 m¼l
This is similar to the Fourier expansion of a function defined on a plane; klm is similar to the Fourier coefficient. Bessel function One of the special functions mathematically named after the surname of German astronomer Bessel in the nineteenth century. It was put forward when using
282
8 Related Knowledge
cylindrical coordinates to solve physical problems such as potential fields in circles, cylinders, and spheres.
8.1.3
Matrix Decomposition
Matrix decomposition Split a matrix into products of several matrices. Commonly used methods include QR factorization (method), triangular factorization (method), eigenvalue decom position (method), singular value decomposition (method), and so on. Decomposition methods Specifically refers to a method to solve a series of small quadratic programming problems, each of which has a fixed size for training support vector machines. This method can be used for large data sets of any size. Compare chunking. QR factorization A commonly used matrix decomposition method. Where a real-valued matrix A of M N is represented as A ¼ QR Among them, Q is an orthogonal normalization matrix (or unitary matrix), QQT ¼ I, and R is an upper triangular matrix (R is derived from “rechts” in German, that is, “right”). A technique that can solve the poor conditional least squares problem. It can also be used as the basis for more complex algorithms. In image engineering, QR factorization can be used to convert a camera matrix into a rotation matrix and an upper triangle correction matrix, or in various self-correction algorithms. Triangular factorization A commonly used matrix decomposition method. The original square matrix is decomposed into an upper triangular matrix, or a permuted upper triangular matrix and a lower triangular matrix. Also called diagonal method, LU decomposition method, maximum decomposition method, analytical factor analysis method, etc. It is mainly used to simplify the calculation of large matrix determinants or to accelerate the calculation of inverse matrices. Singular value decomposition [SVD] A commonly used matrix decomposition method. The arithmetic square root of the eigenvalue of the product A*A of the matrix A, and its conjugate transposed matrix A* is the singular value of A. It can also be used to introduce the principle of principal component analysis. Decompose any m n matrix A into A ¼ UDVT. The columns of the m m matrix U are unit vectors orthogonal to each other, and the rows of the n n matrix V are also unit vectors orthogonal to each other. The m n matrix D is a diagonal
8.1 Basic Mathematics
283
Fig. 8.3 Geometric meaning of matrix singular values
u1
v1 1
A 0
v0
u0
matrix. Its non-zero elements are called singular values σ ι, which satisfy σ1 σ2 ... σν 0. That is, the singular values are all non-negative. Commonly used in image compression and solving the least square error problem. SVD has very useful properties, such as: 1. A is non-singular if and only if all singular values of A are non-zero. Non-zero singular values give the rank of A. 2. The column of U corresponds to the non-zero singular values covering the rows and columns of A; the column of V corresponds to the non-zero singular values covering the zero space of A. 3. The square of the non-zero singular value is the non-zero eigenvalue of AAT and ATA, the column of U is the eigenvector of AAT, and the column of V is the eigenvector of ATA. 4. In the solution of rectangular linear systems, the pseudo-inverse of a matrix can be easily calculated by SVD definition. When the matrix is specifically decomposed, a real-valued matrix A of M N can be expressed as " AMN ¼ U MP ΣPP V TPN ¼
u0
2 # σ0 6 4 up1
32 T v0 76 6 ⋱ 54 σ p1 vTp1
3 7 7 5
where P ¼ min(M, N ) and ∑ is a diagonal matrix. The matrices U and V are orthogonal, that is, UTU ¼ I and VTV ¼ I. Their column vectors are also orthogonal, that is, ui • uj ¼ vi • vj ¼ δij. In order to intuitively see the geometric meaning of the singular values of a matrix, A ¼ U∑VT can be written as AV ¼ U∑ or Avj ¼ σ juj. The latter formula shows that the matrix A maps the base vector vj to the uj direction according to the length σ j, as shown in Fig. 8.3. Each orthogonal vector vj is converted into an orthogonal vector uj of length σ j. If A is regarded as a covariance matrix, after eigenvalue decomposition, each axis of uj indicates a principal direction, and σ j indicates the standard deviation along that direction. Singular value decomposition has an important property. If the following formula is truncated, the best minimum mean square approximation to the original matrix A can be obtained:
284
8 Related Knowledge
A¼
t X
σ j u j vTj
j¼0
Truncated SVD [tSVD] A scheme for dictionary learning using singular value decomposition. It keeps some of the most important singular values and sets the unimportant singular values to zero to reduce the effect of errors. Skinny-singular value decomposition [skinny-SVD] A simplified form of singular value decomposition (SVD). It truncates the redundant columns in each matrix in the SVD and obtains a slightly concise (slim) SVD decomposition matrix. Kernel-singular value decomposition [K-SVD] A technique for finding dictionary representations of data in sparse coding. Its goal is to make each data represent a linear combination of a few atomic signals. Its basic idea is to examine the contribution of atomic signals to data fitting one by one and update atomic signals one by one. This technique eliminates mutual interference between atomic signals. Higher-order singular value decomposition [HOSVD] A feature decomposition method that can be used to analyze multilinear structures. For example, when decomposing facial expressions, the images of different faces and expressions are represented by a third-order tensor (subspace), and then the tensor is decomposed using a high-order singular value decomposition method to obtain individual subspaces, expression subspace, and feature subspace. Generalized singular value decomposition [GSVD] The generalization of singular value decomposition considers multiple matrices that can be multiplied. Multilinear method Multilinear extension of linear algebra. Can handle general tensor problems (vectors are first-order tensors; matrices are second-order tensors). Similar to the singular value decomposition of a matrix, there is a d-mode singular value decomposition of a tensor. “Tensor face” is one of its applications, in which the analysis of the face image includes multiple changing factors: the person’s identity, facial expressions, viewpoints, and lighting. Cholesky decomposition A method for generating a multivariate Gaussian distribution with a vector value of mean μ and a covariance matrix ∑ from a Gaussian distribution with zero mean and unit variance. It will be decomposed into LLT. If x is a random variable of vector value (Gaussian distribution with independent components and zero mean and unit variance), then y ¼ Lx + μ will have mean μ and covariance matrix ∑.
8.1 Basic Mathematics
285
Toeplitz matrix A matrix in which all values are the same along each principal diagonal. It is an autocorrelation matrix derived from a single image for an image traversing a collection by using its 1-D representation (assuming its infinite repeat). Power method A method for estimating the maximum eigenvalue of a matrix. It is possible to calculate only the most significant eigenvalues of a matrix. However, this is only possible if the covariance matrix C has a single dominant eigenvalue (as opposed to a case with two or more eigenvalues of the same absolute value). If matrix A is diagonalizable and has a single dominant eigenvalue, that is, a single eigenvalue with the largest absolute value, then this eigenvalue and the corresponding eigenvector can be estimated using the following algorithm: 1. Choose a unit-length vector that is not parallel to the dominant eigenvalue. Call it x00 . This is a randomly selected vector that most likely does not coincide with the dominant eigenvector of the matrix. Set k ¼ 0. 2. Calculate xk + 1 ¼ Axk0 . 3. Normalizes xk + 1 so that it has a unit length: xkþ1 x0kþ1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 xkþ1,1 þ xkþ1,2 þ þ x2kþ1,N where xk + 1,i is the i-th element of xk + 1 with N elements. 4. If x0 k + 1 and x0 k are different, but within a certain allowable range, set k ¼ k + 1 and go to step (2). The more dissimilar the eigenvalues of this matrix, the faster the algorithm will converge. Once convergence is reached, the dominant eigenvalue can be calculated using the Rayleigh quotient of the estimated eigenvector x: λdominant ¼
xT Ax xT x
This method can also be used to calculate the minimum non-zero eigenvalues of a matrix. Woodbury identity An identity that relates the matrix and its inverse can be written as
A þ BD1 C
1
1 ¼ A1 A1 B D þ CA1 B CA1
When A is a large diagonal matrix (easy to invert) and the number of rows in B is much greater than the number of columns while the number of columns in C is much
286
8 Related Knowledge
greater than the number of rows, calculating the right side of the equal sign is much faster than calculating the left side of the equal sign. Kronecker product Mathematical operation between two matrices of any size. It is also a special form of tensor product. If A is a matrix of m n and B is a matrix of p q, their Kronecker product is a block matrix of mp nq. For a separable 2-D transform, its 2-D transform kernel can be written as the product of two 1-D transform kernels. Accordingly, the 2-D transformation matrix can be written as the Kronecker product of two 1-D transformation matrices.
8.1.4
Set Theory
Set The whole (something itself) of a certain kind, distinguishing, and deterministic. Cover of a set The cover of a non-empty set A in space X is a set of non-empty subsets in X, and its union is A. In other words, let C(A) be the cover of A, then C ðAÞ ¼ [ B ðThe cover of a non‐empty set AÞ B⊂X
Cover 1. See cover of a set. 2. Works that have informational content and carry protected content. Generally, a certain data type is used to indicate the medium of the work, such as “image” and “video.” Compact set Each open cover has a limited set of subcovers. For example, the interior of each non-empty picture point set A in the Euclidean plane is compact. Set interior Any set that does not include boundaries. See open set. Boundary region of a set A set of all points belonging to a boundary region. For example, the atmosphere of the earth gives the boundary region of the space region above the surface of the earth. Complement of a set Let A be a subset of X, and A’s complement (denoted as AC) is the set of all points in X not in A. Topology of a set The set of open sets τ on a non-empty open set X is the topology on X, if:
8.1 Basic Mathematics
287
1. The empty set ∅ is open and ∅ is in τ. 2. The set X is open, and X is in τ. 3. If A is a subset of the open set in τ, then [B
B2A
is an open set in τ
In other words, the union of the open set in τ is another open set in τ. 4. If A is a subset of the open set in τ, then \B
B2A
is an open set in τ
In other words, the intersection of the open set in τ is another open set in τ. Binary relation For a set S, the binary relation on it is a set of ordered pairs of elements in S. On the other hand, it can also be said that the binary relation on the set S is a subset of the Cartesian product S S. If two sets S and T are considered, the binary relationship on them is a subset of the Cartesian product S T. Equivalence relation The relationship R between different elements x, y, and z on the set S satisfies the following three characteristics: 1. Self-inverting: (x, x) 2 R. 2. Symmetry: (x, y) 2 R implies (y, x) 2 R. 3. Transitive: (x, y) 2 R and (y, z) 2 R implies (x, z) 2 R. Self-inverting The interchangeable nature of the two operations. For example, wavelet transform when the basis functions for analysis and synthesis are the same. Transitive See partial order and strict order. Subset Part of the set. A subset in an image is still a set of pixels, is a part of the image, and can also be called a sub-image. Infimum The lower bound of a set is the maximum lower bound of the set.
288
8.1.5
8 Related Knowledge
Least Squares
Least squares A class of least mean square estimation methods. A mathematical optimization technique that finds the best function match for a set of data by minimizing the sum of squared errors. That is, making the model can best fit the parameter estimates of the data. Can be used for various fitting problems. It can also be used for many other optimization problems by expressing minimized energy or maximized entropy in the form of least squares. It can be further divided into regular least squares (methods), robust least squares (methods), iterative reweighted least squares (methods), and so on. Compare the maximum likelihood method. Regular least squares See least squares. Suitable for the case where the noise is Gaussian noise. Robust least squares See least squares. It can reduce the influence of outliers, so it is suitable for situations with outliers. Weighted least squares [WLS] An extension of least squares, in which the weights can be designed according to different applications. This is a special least mean square estimation process where each data element is associated with a weight. Weights can specify credibility or the quality of the data item. Using weights can help make the estimation more robust. Iteratively reweighted least squares [IRLS] A robust least squares method. It is also a gradient descent technique. The result obtained by introducing a weighting function to the M-estimator. Iteratively calculated by alternately solving the weighted least squares problem and calculating the influence function. Also belongs to the method of gradient descent. Iterative reweighted least squares [IRLS] A weighted least squares technique in which the weighting matrix is not constant but depends on a parameter vector. So it performs iteratively and uses the new parameter vector each time to calculate the updated weighting matrix. Regularized least squares The penalty term used in regularization is the sum of the squares of the weight vector elements, and the error function is also the sum of the squares of the errors. In machine learning’s sequential learning algorithms, it is also called weight decay because it encourages weight values to decay to zero. In statistics, it is a coefficient shrinking method that pushes the coefficient value toward zero. The advantage of this method is that the error function is kept as a quadratic function of weights, which can be minimized accurately in a closed form. It is an extension of the traditional least squares method.
8.1 Basic Mathematics
289
Total least squares [TLS] A minimized set of homogeneous squared errors can generally be written as ETLS ¼
X
ðai xÞ2 ¼ kAxk2
i
The measurement error at this time is not in a specific direction but can be in all directions. See orthogonal regression. Orthogonal least squares A systematic method for selecting basis functions. This is a serial selection process. In each of these steps, the data point that will be selected as the center of the basis function next time corresponds to the point that gives the largest reduction in the sense of square sum error. Least squares estimation Same as least mean square estimation. Least mean square estimation A statistical estimate, also known as a least squares estimation or a mean square estimation. Let v be the parameter vector sought and ei(v) be the error measure associated with the i-th data item (total N items). Common error metrics include Euclidean distance, algebraic distance, and Mahalanobis distance. The distance is measured from the i-th data item to the fitted curve or surface parameterized by v. The mean square error can be written as N 1 X e ð vÞ 2 N i1 i
The sought parameter vector v will be able to minimize this sum. Least median of squares estimation A statistical value related to the least mean square estimation. Let v be the parameter vector sought and ei(v) be the error metric associated with the i-th data item (total N items). Common error metrics include Euclidean distance, algebraic distance, and Mahalanobis distance. The distance is measured from the i-th data item to the fitted curve or surface parameterized by v. The median mean square error is the middle value of the ordered set {ei(v)2}, and the sought parameter vector v will minimize this value. Calculating this value often requires more calculations for iteration and sorting, but it is more robust to outliers than minimum mean square estimation. Least square surface fitting A least mean square estimation process in which a set of data points (usually depth data) is fitted with a parameter model of a curve. For fitting effect, Euclidean
290
8 Related Knowledge
distance, algebraic distance, and Mahalanobis distance are commonly used to evaluate. Constrained least squares A regression technique and optimization strategy. Attempts to minimize ||Ax – b||2 to a subset of the previously determined possible solutions x. For example, the function values of certain points on a parametric curve are known. This gives an equivalent constraint version of the least squares problem: minimize ||Ax – b||2 if Bx ¼ c is satisfied. There are many ways to solve this problem, such as QR factorization and singular value decomposition. This technique is useful in least squares surface fitting, where the requirement is that the described plane is perpendicular to some other planes. Direct least square fitting Fit a model directly to the data using methods with closed-form solutions or globally convergent solutions. Cramer-Rao lower bound In a nonlinear least squares problem, the inverse of the Hessian matrix provides the minimum amount of covariance for a given solution. If the optimal solution corresponds to a local minimum, there is a long tail.
8.1.6
Regression
Regression 1. In statistics, the relationship between two variables. Such as linear regression. It is a special case of curve fitting and coding. 2. Regression test, after implementing changes or adjustments to a system, verify that this has not caused its function to decline or return to a state where its function does not exist. Regression testing Tests to verify that changes to the system implementation did not result in a loss of system functionality, or to return to tests where system functionality did not exist. Regression function See Robbins-Monro algorithm. Robbins-Monro algorithm A general algorithm for constructing sequential learning. Considering a pair of random variables x and y defined by a joint distribution, given the conditional expectation of y after x, a deterministic function (regression function) is given:
8.1 Basic Mathematics
291
Fig. 8.4 Regression function
y f(x)
x x*
Figure 8.4 shows a schematic of a regression function, where f(x) is given by the conditional expectation E[y|x], and the Robbins-Monro algorithm provides a general sequential process for determining the regression function solution x*. The goal here is to determine the solution x* such that f(x*) ¼ 0. If a large set of observations for x and y have been obtained, the regression function can be directly modeled, and its solution can be estimated. Now consider getting only one value for y at a time, but want to determine a sequential method for estimating x*. Assume the conditional variance for y is finite:
Without loss of generality, consider f(x) > 0 for x > x*, and f(x) < 0 for x < x*, as shown in Fig. 8.4. The Robbins-Monro algorithm defines a series of continuous estimates of the solution x*: xðN Þ ¼ xðN1Þ þ wN1 y xðN1Þ Among them, y(x(N–1)) is the observed value of y when x takes value x(N ), and the coefficient {wn} represents a set of positive numbers that meet the following conditions: lim wN ¼ 0
N!1 1 X
wN ¼ 1
N¼1 1 X
w2N < 1
N¼1
The first condition above guarantees that the correction value will decrease continuously so that the iterative process will converge to a finite value. The second
292
8 Related Knowledge
Y Veto point
Fig. 8.5 Veto point in robust return
Inliers
X
condition guarantees that the algorithm will not converge outside the solution, while the third condition guarantees that the accumulated noise has a finite variance without affects algorithm convergence. Ridge regression A method for regularizing linear regression and related methods with a penalty term. The penalty term includes the sum of the squares of the regression coefficients to penalize large coefficients. Same as Tikhonov regularization. See shrinkage. Generally used in the case of quadratic regression. At this time, the error function is quadratic, so there can be analytical solutions. Robust regression A form of regression that does not use outlier values when calculating fit parameters. For example, general regression methods use all data points to perform a minimum mean square line fit. If there is only one point far from the “real” line, it will give distorted results. Robust regression either eliminates these outliers or reduces their contribution to the result. Figure 8.5 shows a schematic for vetoing outlier point. Kernel regression A nonparametric method for constructing regression. Given a data set of inputoutput pairs, which contains n pairs (xi, yi), i ¼ 1, 2, . . ., n, the kernel regression estimates the function value at the input point x as Pn yi kðx, xi Þ Pi¼1 n i¼1 k ðx, xi Þ Among them, k(•, •) represents the kernel function. Kernel ridge regression A generalization of linear regression is equivalent to ridge regression in a highdimensional feature space. Here the kernel trick refers to not performing calculations in the feature space but using a matrix of n n (if there are n data points). The resulting predictor form is
8.1 Basic Mathematics
293
f ð xÞ ¼
n X
ai k ðx, xi Þ
i¼1
Among them, k(•, •) represents the kernel function, and the coefficient {ai} can be determined by solving a linear system. See Gaussian process regression. Gaussian process regression A Bayesian regression method in which the prior of the Gaussian process is added to the function. Assuming that the observation is disturbed by Gaussian noise, the posterior of the function is also a Gaussian process. For any given test input, the predicted mean and variance can be obtained. Minkowski loss In regression, an improvement on the common secondary loss to improve the effect when the conditional distribution is multimode. Logistic regression A widely used two-class problem classification model. Letting x represent the input feature vector, then the logistic regression model predicts p(C1|x) ¼ σ(wTx + w0), where σ(z) ¼ (1 + e–z)1 represents a logistic sigmoid function and w is a parameter vector and w0 is the offset term. And p(C2|x) ¼ 1 – p(C1|x). Parameters can be fitted using maximum likelihood estimation. It can be shown that the optimization problem is a convex problem (i.e., there is no local extremum). See logistic regres sion model. Logistic regression model A linear statistical model. Consider a two-category problem. Given a linear function V of the feature vector v, the posterior probability of class C1 can be written as a logistic sigmoid function acting on V: yðC1 jvÞ ¼ yðvÞ ¼ S wT v Similarly, the posterior probability of class C2 can also be written as a logical sigmoid function acting on V: yðC 2 jvÞ ¼ 1 yðC1 jvÞ ¼ 1 S wT v The above model is called a logistic regression model, but it is actually a model for classification. Mixture of logistic regression model Combination of logistic regression models. Because the logistic regression model defines the conditional distribution of the target variable, given the input variable, it can be used as the component distribution in the mixed model to obtain many more comprehensive conditional distributions than a single logistic regression model.
294
8 Related Knowledge
Multi-class logistic regression The results obtained by extending the basic logistic regression method to multiple types of problems. Bayesian logistic regression Bayesian processing of logistic regression. The exact Bayesian reasoning for logistic regression is a tricky issue. Applying Laplacian approximation is one way to solve this problem. Linear regression The simplest regression model. Given a set of samples xi and yi, an estimate of the connection between two random variables X and Y. The goal is to estimate the matrix A and the vector a, which can minimize the residual r(A, a) ¼ ∑i||yi – Axi – a||2. In this equation, xi is set to a noise-free amount. Orthogonal regression works better when all variables are affected by the error. In N-D space, it is a linear combination of the input variables x ¼ (x1, x2, . . ., xN)T and a linear function of the parameters w0, w1, . . ., wN: yðx, wÞ ¼ w0 þ w1 x1 þ þ wN xN Mixture of linear regression model The result of directly extending Gaussian mixture model to the conditional Gaussian distribution. Bias parameter A special parameter in linear regression models. Linear regression can be expressed as yðx, wÞ ¼ w0 þ w1 x1 þ þ wN xN where x ¼ (x1, x2, . . ., xN)T. A key feature of this model is that it is a linear function of the parameters w1, w2, . . ., wN. Of course, it is also a linear function of the input variable xi, which is a significant limitation on the model. This type of model can be extended by considering a linear combination of fixed nonlinear functions on the input variables, i.e., yðx, wÞ ¼ w0 þ
M 1 X
w j B j ðxÞ
j¼1
Among them, Bj(x) is called a basis function. Letting the maximum value of the index j be M–1, then the total number of parameters in the model is M. The parameter w0 allows the data to have an arbitrary fixed offset, so it can be called an offset parameter. For convenience, often an additional virtual basis function B0(x) ¼ 1 can be defined, and then
8.1 Basic Mathematics
295
yðx, wÞ ¼
M 1 X
w j B j ðxÞ ¼ wT BðxÞ
j¼0
where w ¼ (w0, w1, . . ., wM–1)T, B ¼ (B0, B1, . . ., BM–1)T. Orthogonal regression The general form of linear regression when both {x} and {y} are measured data and are affected by noise. Also called total minimum mean square. Given the samples xi0 and yi0 , the goal is to estimate the true points (xi, yi) and line parameters (a, b, c) such that there is axi + byi + c ¼ 0 for 8i and minimize the error Σi(xi - xi0 )2 + (yi yi0 )2. It is easy to estimate when a line or surface (in high-dimensional space) passes through the data center of gravity along the direction of the feature vector in the data distribution matrix with the smallest eigenvalues.
8.1.7
Linear Operations
Linear 1. Has a linear form. 2. A mathematical description of a process. The relationship between some input variables x and some output variables y can be written as y ¼ Ax, where A is a matrix. Linear transformation A numerical transformation of a set of values. The transformation uses constants to add or multiply values. If the set of values is a vector x, then a general linear transformation gives another vector y ¼ Ax, where y does not need to have the same dimensions as x and A is a constant matrix (i.e., not a function of x). Linear operator A class of operators that have important commonalities and have nothing to do with the work to be done. Consider O as the operator that operates/converts the image. Let f be an image and O( f ) be the result obtained by using O for f: O ½a f þ bg ¼ aO ½ f þ bO ½g Then O is a linear operator. It can be fully defined by its point spread function. Compare nonlinear operators. Superposition principle A property of linear operators. Letting the two images be f1(x, y) and f2(x, y), then the superposition principle can be expressed by
296
8 Related Knowledge
h½ f 1 ðx, yÞ þ f 2 ðx, yÞ ¼ h½ f 1 ðx, yÞ þ h½ f 2 ðx, yÞ where h is a linear operation. That is, the operation on the sum of two images is equal to the sum of the respective operations on the two images. Linear independence For a set of matrices, it represents a characteristic that they satisfy. Consider a set of matrices {v1, v2, . . ., vN}; if for a set of coefficients {k1, k2, . . ., kN}, ∑nknvn ¼ 0 only if all kn ¼ 0, then the set of matrices is linear independent. This shows that any one of this set of matrices cannot be represented as a linear combination of the remaining matrices. The rank of a matrix is the linearly independent maximum number of rows or columns in the matrix. Linearly separable If the two classes can be separated by a hyperplane, the binary classification problem is linearly separable. In many cases, if the decision boundary can be expressed by a linear classifier, the problem is linearly separable. That is, the classifier assigns the input x to the class i ¼ 1, 2, . . ., C to maximize wiTxi + w0i. In image classification, for each region (image or feature) corresponding to a class, a hyperplane can be found to separate this class from other classes. Linearly non-separable For a classification problem, if the categories in the data set are not linearly separable, then the categories are said to be linearly non-separable. Linear system of equations A system of equations, in which each equation is a first power about the unknown. Many image processing algorithms can be reduced to the problem of solving multivariate linear equations. Solution by Jacobi’s method An algorithm for approximate iterative inversion of a large linear system of equations. Compare the solution by Gauss-Seidel method. Solution by Gauss-Seidel method An algorithm for approximate iterative inversion of a large linear system of equations. A more elaborate version than the solution by Jacobi’s method. Non-singular linear transform Linear transform used to perform the transformation of non-zero linear determinant of the matrix. Shift-invariant principle A property of linear operation. Can be expressed as (○ represents an operation that satisfies the principle of shift invariance)
8.1 Basic Mathematics
297
gði, jÞ ¼ f ði þ k, j þ lÞ , ðh∘gÞði, jÞ ¼ ðh∘gÞði þ k, j þ lÞ If described in conjunction with images, it can be said that the order of translating images and performing operations is interchangeable. In other words, the behavior of the operation is the same everywhere. Linear blend A method or process that combines two or more inputs and outputs them linearly. Linear blend operator Operators that implement linear blend. When there are two inputs f1 and f2, its function and effect can be expressed as gðx, yÞ ¼ ð1 wÞ f 1 ðx, yÞ þ wf 2 ðx, yÞ In the process of gradually increasing the weight w from 0 to 1, the two images will gradually cross-dissolve. Uniquenesses In factor analysis, a matrix element used to express independent variances related to each coordinate, and it also represents the independent noise variance of each variable.
8.1.8
Complex Plane and Half-Space
Complex plane pffiffiffiffiffiffiffi Complex number of planes. Point z has the form z ¼ a + bi, a, b 2 R, i ¼ 1 on the complex plane. Riemann surface Union of infinite complex planes. Conformal mapping The function of mapping from the complex plane to itself to maintain the local angle can be written as f: C ! C. For example, the complex function y ¼ sin(x) ¼ i(eix – e–ix)/2 is conformal. Conjugate symmetry A property of complex functions. If the real part of a function f(x) is an even function and the imaginary part is an odd function, then f(x) is a conjugate symmetric function. In other words, the conjugate of f(x) is f(x). Their amplitude components are evenly symmetric, while their phase components are oddly symmetric. Compare conjugate antisymmetry.
298
8 Related Knowledge
Conjugate antisymmetry A property of complex functions. If the real part of a function f(x) is an odd function and the imaginary part is an even function, then f(x) is a conjugate antisymmetric function. In other words, the conjugate of f(x) is –f(x). Their amplitude components are oddly symmetric, while their phase components are evenly symmetric. Compare conjugate symmetry. Hermitian symmetry Same as conjugate symmetry. Anti-Hermitian symmetry Same as conjugate antisymmetry. Conjugate transpose An operation on a matrix. A matrix U is called a unitary matrix if its inverse is a complex conjugate of its transposition, i.e., UUT ¼ I where I is the identity matrix. Sometimes superscript “H” is used instead of “T*”, and UT* UH is called Hermitian transpose or conjugate transposition of matrix U. Hermitian transpose Same as conjugate transpose. Half-space Given a straight line L in an n-D vector space Rn, a half-space is a set of points that includes all points on one side of the boundary line L. If points belonging to the straight line L are also included in the half-space, the half-space is closed. Otherwise, half-space is open. Upper half-space A closed half-space including all points belonging to the line L and all points above L. Lower half-space A closed half-space that includes all points belonging to the line L and all points below L. Open half-space Given a straight line L in n-D space, a closed half-space is a set of open points. It does not include those points that belong to the straight line L as the boundary of the half-space but includes all points above the straight line L or all points below the straight line L point. These two half-spaces are called open upper half-space and open lower half-space, respectively, and their 2-D examples are shown in Fig. 8.6. The straight line is drawn as a dotted line to indicate that the open half-space does not
8.1 Basic Mathematics
299
L
L Fig. 8.6 Open upper half-space and open lower half-space
L
L Fig. 8.7 Closed upper half-space and closed lower half-space
include the points on the line. The open half-space is unbounded on the one side of dotted line (as shown by the arrow). Open upper half-space Includes the set of all points that lie above the boundary line but do not belong to the boundary line. See open half-space. Open lower half-space Includes the set of all points that are below the boundary line but not belonging to the boundary line. See open half-space. Closed half-space Given a straight line L in n-D space, a closed half-space is a set of closed points, including those points on the straight line L and all points above L or all points below L. The half-spaces in these two cases are referred to as the closed upper half-space and the closed lower half-space, respectively. The 2-D examples are shown in Fig. 8.7. The straight line forms the boundary of the half-space. The half-space is unbounded on one side of the line (as shown by the arrow). Closed upper half-space The set of all points above and on the line that forms the boundary of the half-space. See closed half-space. Closed lower half-space The set of all points below and on the line that forms the boundary of the half-space. See closed half-space. Region outside a closed half-space The set of points in the plane outside the edges of the closed half-space collection (which constitutes the boundaries inside the collection).
300
8.1.9
8 Related Knowledge
Norms and Variations
Norm A basic concept of metric space. The norm of the function f(x) can be expressed as
Z k f kw ¼
1=w j f ðxÞjw dx
Among them, w is called an index or indicator, and taking 1, 2 and 1 are several typical cases; see Minkowski distance. L p-norm A metric of differential distortion used to measure image watermarking. If f(x, y) is used to represent the original image, and g(x, y) is used to represent the embedded watermark, the image size is N N, and then the L p norm can be expressed as ( DLp ¼
N 1 N 1 1 XX jgðx, yÞ f ðx, yÞjp 2 N x¼0 y¼0
)1=p
When p ¼ 1, the average absolute difference is obtained; while when p ¼ 2, the root mean square error is obtained. 1-norm When p is 1 in the Lp norm. 2-norm When p is 2 in the Lp norm. Norm of Euclidean space. L1 norm For a d-D vector x, ||x||1 ¼ ∑di ¼ 1|xi|. L2 norm For a d-D vector x, ||x||2 ¼ (∑di ¼ 1x2i)1/2. Chebyshev norm Norm with exponent 1. Quasinorm Lp norm for p < 1, which does not satisfy the triangle inequality. Hyper-Laplacian norm The p-value of the Lp norm corresponding to the hyper-Laplacian distribution is between 0.5 and 0.8. It responds strongly to larger discontinuities. Frobenius norm A matrix norm. It is obtained by squaring the sum of the squares of the absolute values of all the elements of a matrix.
8.1 Basic Mathematics
301
Variational method Represents a signal processing problem as a solution to a variational computing problem. The input signal is a function I(t) over interval t 2 [1, 1]. The processed signal is a function P defined on the system interval. To minimize the energy functional E(P) of the form: Z1 EðPÞ ¼
f Pðt Þ, P_ ðt Þ, I ðt Þ d t
1
Variational calculations show that minimizing P is a solution to the EulerLagrange equation: ∂f d ∂f ¼ dt ∂P ∂P_ In image engineering, functional forms can often be expressed as Z truthðP, I Þ þ λbeautyðPÞ
E¼
The “truth” term measures the fidelity of the data and the “beauty” term is a regular term. These can be seen in a particular example, image smoothing. In traditional methods, smoothing can be seen as the result of an algorithm, that is, convolution with an image using a Gaussian kernel. In the variational method, the smoothed signal P is the best compromise between smoothness expressed as the square of the second derivative and data fidelity expressed as the square of the difference between the input and output. The balance between the two is determined by the parameter λ: Z E ðPÞ ¼
e ðt Þ 2 d t ðPðt Þ I ðt ÞÞ2 þ λ P
It is also a method of measuring the internal variation of a function (i.e., non-smoothness for image functions). Variational approach Same as variational method. Variational problem See variational method. Variational analysis An extension of the variational method to optimization theory problems.
302
8 Related Knowledge
Variational model A family of problems is solved by first quantifying the problem into a constraint set for the solution and then integrating the solution into the entire problem space. Among them, the numerical minimization method can be used to obtain the solution. Typical examples include estimating optical flow, estimating fundamental matrices, and performing image segmentation. Variational mean field method A form of variational model in which the probability distribution of an object is decomposed into products of hypothetical independent terms. This allows one term to be optimized while keeping other marginal terms fixed to their average field values. Variational calculus A branch of mathematics that deals with optimization (maximization or minimization) functional problems. Synonym of calculus of variation. Calculus of variation Same as variational calculus. Euler-Lagrange Refers to the Euler-Lagrange equations, the basic equation of variational method (the branch of the calculus about the maximum and minimum of a definite integral). Euler-Lagrange equations The basic equation in variational calculation is a branch of calculation related to the maximum and minimum values of a definite integral. In image engineering, it has been used for various optimization problems, such as surface interpolation. See variational approach and variational problem.
8.1.10 Miscellaneous Linear programming [LP] A branch of operations researches that studies mathematical theories and methods of extreme values of linear objective functions under linear constraints. Linear programming [LP] relaxation The process of weakening constraints on linear programming problems. For example, the relaxation of 0–1 integer programming is to replace the constraint that the original variable is 0 or 1 with the constraint that the variable belongs to [0, 1]. That is, the original x 2 {0, 1} becomes 0 x 1. This will transform an optimization problem (integer programming) to an NP problem into a polynomialtime solvable problem (linear programming). Many Markov random field optimization problems are based on linear programming relaxation of the energy function. For some practical Markov random field problems, linear programming-based techniques can give solutions with minimal global energy.
8.1 Basic Mathematics
303
Quadratic programming [QP] A class of nonlinear programming methods. Its objective function is quadratic, but the constraints are linear. Chunking A practical technique to speed up the solution of quadratic programming problems. It can be proved that if the rows and columns with zero values in the kernel matrix corresponding to the Lagrangian multiplier method are removed, the Lagrangian value will not change. The chunking method decomposes the complete quadratic programming problem into a series of smaller problems, determines a non-zero Lagrangian multiplier, and removes other multipliers. Chunking can reduce the size of a matrix in a quadratic programming function from the order of squares of data points to an order of magnitude of the square of non-zero data points. Protected conjugate gradients A method to achieve chunking. Quadratic form A second-order homogeneous polynomial containing a series of variables. Given a vector x ¼ (x1, x2, . . ., xn)T containing n real variables, a real matrix A of n n defines a quadratic form of x, i.e., Q(x) ¼ xTAx. Normalization In order to standardize calculation and facilitate comparison, the technology and process of transforming the numerical range of variables. In normalization, it usually refers to transforming the value into [0, 1]. In practice, the value is often transformed into the range allowed by the variable. For example, when adding two images, to solve the problem of brightness overflow, the following normalization calculation can be performed (save the intermediate result in a temporary variable W ): g¼
Lmax ð f f min Þ f max f min
Among them, f is the brightness value of the current pixel, Lmax is the maximum possible brightness value (255 for an 8-bit image), g is the brightness value of the corresponding pixel in the output variable, fmax is the largest pixel value in W, and fmin is the smallest pixel value in W. Absolute sum of normalized dot products A measure of the similarity between two m n matrices. If a and b are two m n matrices, the absolute sum of their normalized dot products is m X ai T bi 1 S¼ n i¼1 kai kkbi k
304
8 Related Knowledge
Normalized correlation See correlation. Normalized cross-correlation [NCC] 1. It is the same as normalization for general correlation. In matching grayscale image f(x, y) with template w(x, y), simply calculating the correlation will be more sensitive to changes in the amplitude values of f(x, y) and w(x, y). To overcome this problem, the correlation can be normalized, also known as calculating the correlation coefficient. 2. A typical correlation distortion metric for evaluating watermarks. If f(x, y) is used to represent the original image, and g(x, y) is used to represent the embedded watermark, the image sizes are all N N, and then their normalized crosscorrelation can be expressed as N1 P N1 P
C ncc ¼
gðx, yÞf ðx, yÞ
x¼0 y¼0 N1 P N1 P
f 2 ðx, yÞ
x¼0 y¼0
Orthonormal A property of a set of basis functions or vectors. If is used to represent the inner product function, and a and b are any two different members in this set of basis functions or vectors, then ¼ < b, b > ¼ 1 and < a, b > ¼ < b, a > ¼ 0. Finite element model A class of numerical methods that can solve the difference problem. Another related class is the finite difference method. Finite element analysis Continuous problems are often transformed into discrete problems by means of discretization on a regular grid. The use of finite element analysis is the main method of discretization. Finite element analysis is a method of approximating and performing an integrated analysis using a piecewise continuous spline. Stiffness matrix The coefficient matrix in numerical analysis is called the rigid matrix in finite element analysis. Stirling’s approximation An approximate formula for factorial (the larger the N, the better the approximation) ln N! N ln N N
8.2 Statistics and Probability
305
Trilinearity An equation containing three variables, where two variables are fixed to give a linear equation for the remaining variables. For example, xyz ¼ 0 is a trilinear equation for x, y, and z. And y ¼ x2 is not a trilinear equation, because if the y is fixed, the result is a quadratic equation for x. Iverson bracket A mathematical expression. It converts the logical proposition P into the number 1 (if the proposition is true) or 0 (if the proposition is not true). Propositions are generally placed in square brackets, that is, ½P ¼
1
if P is true
0
otherwise
Iverson brackets are a generalization of the Kronecker δ function, which is a special case when the symbols in Iverson brackets indicate equal conditions, that is, δij ¼ ½i ¼ j Among them, δij is a binary function, and its argument (input value) is generally two integers. If they are equal, the output value is 1; otherwise it is 0. Invariant theory A theoretical framework in algebra that covers invariant polynomials under various transformations in closed linear systems Dual problem Two problems with equivalent solutions. For example, for the problem max[cTx], Ax b, x 0, the dual problem is min[yTb], yA c, y 0.
8.2
Statistics and Probability
Statistics and probability theory are important mathematical tools for studying image engineering (Bishop 2006; Theodoridis and Koutroumbas 2009).
8.2.1
Statistics
Statistical model For a data set, it is a set of probability distributions. Parametric models have a limited number of parameters; nonparametric models cannot be parameterized with a limited number of parameters. Examples of parametric models include Gaussian distributions with linear error models and linear regression.
306
8 Related Knowledge
Statistical moment Various description moments that provide information about the distribution and shape of statistical data, such as a probability distribution. Given a positive number n > 0 and a probability distribution p(x), the common statistical moment is the R infinite family p(x)xn. When n ¼ 1, the mean of the distribution is obtained. Moments are special cases of expectations. The other family is histogram moments. Maximum likelihood method A statistical estimation method: after randomly selecting several sets of sample observations from the model population, the most reasonable parameter estimation should be to maximize the probability of drawing these groups of sample observations from the model. Compare least squares. Robust statistics A general term referring to statistical methods whose characteristics are not significantly affected by outliers. It can also refer to some relatively stable statistics, such as the negative logarithm of likelihood. When there is not only Gaussian noise in the image but also a lot of outlier points (which can be derived from occlusion, discontinuities, etc.), it is often to analyze the noise with the help of models with long tails such as Laplacian. This can make the penalty function increase more slowly than using the quadratic analysis to increase the likelihood caused by residual for large errors. Robust estimator A statistical estimator that will not be disturbed even if there are many outliers in the data. This is different from using general least mean square estimation. Widely used robust estimators include Random Sample Consensus (RANSAC) and M-estimation. See outlier rejection. M-estimation A robust generalization of least squares estimation and maximum likelihood estimation. M-estimator A typical robust least square (method), commonly used in robust statistical literature. Different from the square of the residual calculation, a robust penalty function s(r) is used for the residual r. The derivative of s(r) is called the influence function. Influence function A function that describes the effects of a single observation on a statistical model. This allows us to evaluate whether observations have an excessive impact on the model. See M-estimator. Robust penalty function See M-estimator.
8.2 Statistics and Probability
307
Robust technique See robust estimator. Spatial statistics Statistical analysis of patterns that appear in space. In image analysis, statistics can refer to, for example, the distribution of characteristics or features of an image region or object. Natural image statistics Statistical information or statistical structure in images obtained from nature. Commonly used in computational modeling of biological vision systems. Statistical classifier A function that maps from the input data space to a collection of labels. The input data is points x 2 Rn, and the labels are scalars. Classifier C(x) ¼ l assigns label l to point x. A classifier is generally a parameterized function, such as a neural network (the parameters are weights) or a support vector machine. Classifier parameters can be set by optimizing performance on a training set of known (x, l) pairs or by selforganizing learning methods. Statistical fusion A general term describing the estimation of a value A (and possibly its distribution), but based on a set of other values {Bi} and their distribution {Di}. A simple example is as follows: given two estimates V1 and V2 for a value, and the corresponding probabilities p1 and p2, the fusion estimate is ( p1V1 + p2V2)/( p1 + p2). There is a more complex statistical fusion method in the Kalman filter. Statistical distribution A description of the relative number of occurrences of each possible output of a variable over a number of experiments. For example, for a fair dice game, every numerical occurrence should be equal, that is, one occurrence every six times. See probability distribution. Probability distribution A normalized statistical distribution is described using the likelihood of occurrence of each possible output of a given variable covering a certain number of trajectories. Probabilistic distribution See probability distribution. Central limit theorem One of a set of theorems that is the theoretical basis of probability theory, mathematical statistics, and error analysis. According to this theorem, if a random variable is the sum of n independent random variables, its probability density function will tend to Gaussian when n tends to infinity, regardless of the probability density function of each independent variable. In other words, the mean distribution of a large number of independent random variables is limited to the normal distribution. The description of many phenomena and laws in image engineering is based on Gaussian distribution, which is all derived from this.
308
8.2.2
8 Related Knowledge
Probability
Probability A measure of confidence in the occurrence of an event, with values ranging from 0 (impossible) to 1 (affirmative). The interpretation of probability is the subject of a heated debate. There are two well-known theories. Frequency theory defines probability as a finite proportion in an infinite set of experiments. For example, the probability of getting any number by rolling a dice is 1/6. The subjective view regards probability as the subjective degree of belief, which does not need to introduce random variables, that is, the degree to which a person can have his own belief in an event. This degree can be mapped to probabilistic rules, as long as these rules satisfy certain consistency laws called Cox’s axioms. Uncertainty Limited knowledge makes it impossible to accurately indicate the current state or future output. Probability theory adopts a framework for reasoning under uncertainty. Uncertainty representation A representation strategy for the probability density of a variable used in the visual system. An interval can represent possible values in a range in a similar way. Frequentist probability An explanatory model of probability. Think of probability as the frequency of random, repeatable events. Also called classical probability. Classical probability Same as frequentist probability. Expectation In probability theory or random theory, the average value of a function f(x) with a probability distribution p(x) is often recorded as E[f]. For discrete distributions, there are E½f ¼
X pðxÞf ðxÞ x
That is, the relative probability of different x values is used for weighted average. In the case of continuous variables, it is expected to be represented as an integral relative to the probability density: Z E½f ¼
pðxÞf ðxÞdx
In both cases, if a finite number of points (N points) are taken from the probability distribution or probability density, the finite sum of these points can be used to approximate the expectation:
8.2 Statistics and Probability
309
E½f
N 1 X f ð xn Þ N n¼1
Expectation value The mean of a function, the average expected value. If p(x) is a probability density function of a random variable x, then the expectation of x is Z x¼
pðxÞxdx
Expected value The mean of a random variable. Bias-variance trade-off A (Bayesian) frequentist probability view of model complexity. When discussing decision theory for regression problems, various loss functions need to be considered. Given a conditional distribution p(t|x), each loss function results in a corresponding optimal prediction. The commonly used is the squared loss function. The optimal prediction at this time is given by the conditional expectation, which is recorded as h(x) Z hðxÞ ¼ E½tjx ¼
t pðtjxÞdt
It can be proven that the expected squared loss can be written as Z E½L ¼
Z fyðxÞ hðxÞg2 pðxÞdx þ
fhðxÞ t g2 pðx, t Þdt
The second term is derived from the noise inherent in the data, which is independent of y(x) and represents the minimum achievable value of the expected loss. The first term depends on the choice of the function y(x), so the solution of y(x) can be searched to minimize this term. Because this term is non-negative, the minimum value that can be obtained is 0. If there is infinite data and infinite computing power, in principle a regression function h(x) will be found for any desired accuracy, and this will represent the optimal choice for y(x). However, in practice, the data set D contains only a limited number of N data points, so the regression function h(x) cannot be accurately obtained. If h(x) is modeled with a parameter function y(x, w) controlled by a parameter vector w, then from the Bayesian point of view, the uncertainty in the model can be expressed by the posterior distribution of w. However, the frequentist probability processing method is based on a point estimate of w on the data set D, and the uncertainty of this estimate is explained through the experiments envisaged below.
310
8 Related Knowledge
Suppose that there are many data sets, each of which comes independently from the distribution p(t, x) and has a size of N. For any given data set D, run the learning program and get a prediction function y(x; D). Considering the first term in the above formula, a specific data set D will have the following form: fyðx; DÞ hðxÞg2 Because this quantity will depend on the particular data set D, the average of the data set ensemble is taken. By adding and subtracting ED[y(x; D)] in curly braces and then expanding again, it will get fyðx; DÞ E D ½yðx; DÞ þ ED ½yðx; DÞ hðxÞg2 ¼ fyðx; DÞ ED ½yðx; DÞg2 þ fED ½yðx; DÞ hðxÞg2 þ2fyðx; DÞ ED ½yðx; DÞg þ fED ½yðx; DÞ hðxÞg Take the expectation of D from the above formula. Observe that the last term is zero and get h i ED fyðx; DÞ hðxÞg2 ¼ fE D ½yðx; DÞ hðxÞg2 þ E D fyðx; DÞ ED ½yðx; DÞg2 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} Biased square
Variance
It can be seen that the expected squared difference between y(x; D) and the regression function h(x) can be expressed as the sum of two terms. The first term is called biased square and represents how far the average prediction for all data sets differs from the desired regression function. The second term is called variance, which measures how much the solution to a single data set changes around its average, that is, how sensitive y(x; D) is to a particular choice of the data set. The above considers the case of a single input value x. If the above formula is substituted into the previous formula of expected squared loss, the following decomposition of expected squared loss will obtain Expected loss ¼ Biased square þ Variance þ Noise in which Z Biased square ¼ Z Variance ¼
fE D ½yðx; DÞ hðxÞg2 pðxÞdx
h i E D fyðx; DÞ ED ½yðx; DÞg2 pðxÞd x
8.2 Statistics and Probability
311
Fig. 8.8 Examples of probability density functions and cumulative density functions
1
p(x)
c(y)
0.5
0
0
0.5
1
ZZ fhðxÞ t g2 pðx, t Þdx d t
Noise ¼
There should be a balance between bias and variance. Very flexible models have small biases and large variances, while relatively rigid models have large biases and small variances. A model with optimal predictive power can achieve the best balance between bias and variance. Probit function The quantile function associated with a standard normal distribution. Consider a two-class case in a linear classification model: pðt ¼ 1jyÞ ¼ C ðyÞ where y ¼ wTx and C(•) is the activation function. Now consider a noisy thresholding model as follows. For any input xn, calculate yn ¼ wTxn and set the target value according to the following formula: tn ¼
1
yn z
0
Otherwise
If the value of x is derived from a probability density p(x), then the corresponding activation function will be given by its cumulative distribution function: Z
y
C ð yÞ ¼
pðzÞdz 1
Figure 8.8 shows a schematic diagram. The blue line in the figure represents a probability density function (a mixture of two Gaussians) p(x), and the red line represents its cumulative density function C( y). The value at any position on the blue line (as shown by the vertical green line) corresponds to the slope of the red line at that position. In turn, the value of the red line at that location corresponds to the area under the blue line (as shown by the green shade). If the value of y ¼ wTx exceeds a
312
8 Related Knowledge
threshold, the class tag takes the value t ¼ 1, otherwise t ¼ 0. This is equivalent to the activation function given by the cumulative density function C( y). Considering a special case when the density function p(x) is a Gaussian function with zero mean and unit variance, the corresponding cumulative distribution function is Z
y
N ðzj0, 1Þdz
C ð yÞ ¼ 1
This is the probability (unit) function. It has the shape of the letter S and is very close to a logical function. For a general Gaussian function, the above form is unchanged, but the linear coefficient needs to be scaled down. The closely related erf function is defined as follows: 2 erf ðyÞ ¼ pffiffiffi π
Z
exp z2 =2 dz
y
1
The relationship between the two is 1 1 1 þ pffiffiffi erf ðyÞ C ð yÞ ¼ 2 2 A generalized linear model based on a probability activation function is called probit regression. Probit regression See probit function. Erf function A nonbasic function (i.e., non-elementary function) that is related to the error but does not necessarily represent the error itself. See probit function. Error function Functions defined by 2 erf ðxÞ ¼ pffiffiffi π
Z 0
x
s2 exp ds 2
It is related to another probability function as follows:
1 1 pf ðxÞ ¼ 1 þ pffiffiffi erf ðxÞ 2 2
8.2 Statistics and Probability Fig. 8.9 A diagram of likelihood function
313
p(x) N(xm| ,
xm
2
)
x
Likelihood function A function that can be used when using observations to determine the parameters of a probability distribution. Suppose there is a set of M observations on the scalar x, written as x ¼ (x1, x2, . . ., xM)T. Assume that all these data come independently from a Gaussian distribution N(μ, σ 2) whose mean μ and variance σ are unknown. Now it needs to determine the parameters of the Gaussian distribution based on these independent identically distributed (i.i.d.) data. Because the joint probability of two independent events is the product of the respective marginal probabilities of a single event, the data probability for a given mean μ and variance σ is M Y p xjμ, σ 2 ¼ N xm jμ, σ 2 m¼1
This is the likelihood function for μ and σ of the Gaussian distribution. A diagram is shown in Fig. 8.9, where the hollow dots correspond to the observation data and the product of the solid dots is the likelihood function. The maximum likelihood function is to adjust the μ and σ of the Gaussian distribution to maximize the above product. In a statistical model M, the link between the parameter θ and the data D is the likelihood function L(θ|M) ¼ p(D|θ, M). Note that the likelihood can be seen as a function of θ and M given D. Likelihood ratio 1. The ratio of the likelihood function of two different models to the same data. For example, in a Bayes classifier for binary classification, the posterior probability ratio of the two classes depends on the product of the likelihood ratio and the prior class probability. 2. In statistics, the likelihood ratio test can be used to compare two statistical models on the same data, and at this time, the ratio of their likelihood score is used to perform the comparisons. Likelihood score For a statistical model M with data D, the likelihood score refers to the maximum value of the log-likelihood function L(θ|M) relative to θ, that is, maxθlogL(θ|M).
314
8 Related Knowledge
Log-likelihood The logarithm of the likelihood function. See likelihood score.
8.2.3
Probability Density
Probability density See probability density function. Laplace approximation A framework for Gaussian approximation of probability density defined on a set of continuous variables. Taking a single Rcontinuous variable z as an example, let its distribution p(z) ¼ f(z)/Z, where Z ¼ f(z)dz is an unknown normalization coefficient. The goal of Laplace approximation is to find an approximate Gaussian distribution q(z) centered on the maximum value of the distribution p(z). First, find the most frequent value of p(z). To this end, calculate z0 that satisfies p0 (z0) ¼ 0. A characteristic of the Gaussian distribution is that its logarithm is a quadratic function of a variable. Taylor expansion is performed on the centered lnf(z) at z0, because z0 is a local extreme point, so the second derivative of f(z) at z0 is negative: 1 ln f ðzÞ ln f ðz0 Þ Aðz z0 Þ2 2 where A > 0 is called accuracy: d2 A ¼ 2 ln f ðzÞ dz z¼z0 Taking the logarithmic n o A f ðzÞ f ðz0 Þ exp ðz z0 Þ2 2 In addition, a normalized distribution q(z) can be obtained by using the normalization method to the Gaussian distribution standard: qð z Þ ¼
A 2π
1=2
n o A exp ðz z0 Þ2 2
Figure 8.10 shows an example of a Laplace approximation. Among them, p(z) / exp(z2/2)σ(20z + 4), and σ(z) ¼ (1 + e–z)1 is a logistic sigmoid function. In Fig. 8.10a, the normalized p(z) is blue, and the Laplace approximation with center at the most frequent value of p(z) is red. Figure 8.10b shows their corresponding negative logarithmic curves.
8.2 Statistics and Probability
315
0.8
40
0.6
30
0.4
20
p(z)
p(z) 0.2 0 –2
10
–1
0
1
2
3
(a)
4
0 –2
–1
0
1
2
3
4
(b)
Fig. 8.10 Laplace approximation examples
Probability density estimation A set of techniques for estimating a density function or its parameters after giving sample data. A related problem is to check whether a particular sample is produced by a process characterized by a particular probability distribution. Two commonly used tests are the goodness of fit test and the Kolmogorov-Smirnov test. The former is a parametric test, which is suitable for a large number of samples; the latter is a nonparametric test, which can give good results for small samples, but cannot estimate the population parameters. See nonparametric method. Nonparametric method In probability density estimation, a method that determines the shape of a distribution based primarily on the size of the data set. The models used contain parameters, but they mainly control the complexity of the model rather than the shape of the distribution. Typical methods include histogram, nearest neighbor, and kernelbased methods. It can also be regarded as a statistical model that allows the number of parameters to increase as the amount of data increases. The advantage is that fewer hypotheses can be established for the distribution under consideration. Typical nonparametric methods include the Parzen window method and the K-nearest neighbor algorithm. Probability density function [PDF] Differentiation of a random variable distribution function. It defines continuous random variables. It is not itself a probability; probability is obtained by integrating the probability density function of continuous random variables within a certain interval. A way to characterize noise in image engineering. The gray level of the noise itself can be regarded as a random variable, so its distribution can be characterized by a probability density function. For a continuous random variable, the probability density function p(x) is the derivative of the cumulative distribution function F(x), that is, p(x) ¼ dF(x)/dx, which in turn has
316
8 Related Knowledge
Z
x
F ð xÞ ¼
pðxÞdx 1
If Δx is small, then Pr(x X x + Δx) p(x)Δx. This definition can be generalized to random variables with vector values. Super-Gaussian probability density function A probability density function that has a higher peak than the Gaussian distribu tion function and has a positive kurtosis. Also called leptokurtic probability density function. Compare the sub-Gaussian probability density function. Leptokurtic probability density function Same as super-Gaussian probability density function. Sub-Gaussian probability density function A probability density function that is flatter than a Gaussian distribution function and has a negative kurtosis. Also known as platykurtic probability density func tion. Compare the super-Gaussian probability density function. Platykurtic probability density function Same as sub-Gaussian probability density function. Parzen window 1. A nonparametric method for density estimation. Given a sample {xi} derived from some underlying density, where i ¼ 1, 2, . . ., n, the Parzen window estimate can be obtained as follows: pð xÞ ¼
1X K ðx, xi Þ i n
Among them, K is a kernel function. Gaussian distribution is a commonly used kernel. See kernel density estimator. 2. A triangular weighted window is often used to limit leakage of pseudo-frequency when calculating the power spectrum of a signal. 3. The probability density function in mean shift, that is, the kernel function. See kernel density estimator. Skewness of probability density function Third moment of probability density function. Probability mass function [PMF] The probability that a discrete random variable takes a specific value. Compare the probability density function. Joint probability distribution A high-dimensional probability distribution. Consider a vector-valued random variable x ¼ (x1, x2, . . ., xn). For discrete random variables, the joint probability
8.2 Statistics and Probability
317
distribution p(x) is the probability that the component variables jointly take a certain structure/configuration. For continuous random variables, there is also a corresponding continuous probability density function p(x). Typical examples are multivariate normal distributions. See marginal distribution and conditional probability. Joint distribution function A joint function describing the probability of multiple random variables each being less than the corresponding function variable. For example, for n random variables, the joint distribution function is P
f 1 f 2 ... f n ðz1 , z2 ,
. . . , zn Þ P ð f 1 < z1 , f 2 < z2 , . . . , f n < zn Þ
Joint probability density function Differentiation of a joint distribution function of multiple random variables. For example, for n random variables, the joint probability density function is n
p
f 1 f 2 ... f n ðz1 , z2 ,
. . . , zn Þ
∂ P
. . . , zn Þ ∂z1 ∂z2 . . . ∂zn
f 1 f 2 ... f n ðz1 , z2 ,
Probabilistic graphical model A graphical model that defines a joint probability distribution for a set of random variables. The structure of the graph records conditional independence relationships. There are two types of probabilistic graphical models commonly used: directed graphical models (also known as Bayesian networks) and undirected graphical models (also known as Markov random fields). Graph model representing probability distribution, where nodes represent random variables (or groups of random variables) and edges represent probability relationships between variables. Bringing probability into a graph model brings many useful features: 1. Provides a simple way to visualize the structure of a probabilistic model and help design new models 2. Provides an effective way to gain insight into the attributes of the model (including conditional independence attributes) by examining the graph 3. Provides a calculation form that uses graph operations to perform complex reasoning and learn complex models Most probable explanation A term in probabilistic graphical model theory. Given a joint probability distri bution over a set of random variables defined by a probabilistic graphical model, knowledge of the value of the variable e (evidence) set will provide the conditional probability p(x|e) of the remaining variable x. The most likely explanation x* is a configuration that maximizes p(x|e). See maximum a posteriori probability.
318
8 Related Knowledge
Probabilistic inference Given the joint probability distribution of a set of random variables and knowledge of some set of variables e (evidence), probabilistic reasoning focuses on the conditional distribution of the remaining variable x (or a subset of it), that is, p(x|e). If x has a large cardinality, it is worth focusing on the summary of p(x|e). The usual choice is the posterior marginal distribution of each variable in the x set, or the maximum posterior probability configuration of p(x|e). In the probabilistic graphical model, these summaries can generally be calculated with the help of messagepassing processes such as belief propagation. Message-passing process A computational process that relies on passing messages between different targets or units. For example, in certain probabilistic graphical models, the probability information carried is inferred through belief propagation.
8.2.4
Probability Distributions
Mixture distribution The linear superposition of multiple basic distributions is also called a mixture distribution model. Mixture model A probability expression in which multiple distributions are combined to model situations where the data may come from different sources or have different behaviors (different probability distributions). The overall distribution will have the form, p(x) ¼ ∑iPipi(x) ¼ 1, where pi(x) represents the density of source i and Pi is its mixture ratio. The mixture model can be fitted to the data by using an expectation maximization algorithm. The Gaussian mixture model is a typical mixture model. Mixture distribution model Same as mixture model. Mixing proportion In a mixture model, the probability Pi of the source i is selected, where ∑iPi ¼ 1. Mixture density network [MDN] A general framework for modeling conditional probability distributions. It uses a mixture model for a distribution with a conditional probability density function p(t| x), where the mixture coefficient and the density of the mixture components are both flexible functions of the input vector x. For any given value of x, this mixture model provides a general form of modeling the conditional probability density function p(t|x). As long as the network is flexible enough, a framework that approximates any conditional distribution can be obtained. For example, using Gaussian components, we can get
8.2 Statistics and Probability Fig. 8.11 Distribution overlap
319
Distribution 1
Distribution 2
Distribution overlap
pðtjxÞ ¼
K X
gk ðxÞN tjμk ðxÞ, σ 2k ðxÞ
k¼1
Among them, the noise variance in the data is a function of the input vector x, so it is also called a heteroscedastic model. Heteroscedastic model See mixture density network. Heteroscedasticity A set of random variables has different variance properties. When using ordinary least squares for regression estimation, it is often assumed that the variance of the error term is unchanged. Those who satisfy this assumption are said to have homoscedasticity, while those who satisfy this assumption are said to have heteroscedasticity. If the ordinary least square method is used for the heteroscedastic model, the estimated variance value will be a biased estimator of the true variance value, but the estimated value itself is unbiased. Mixture component Each basic distribution in a mixture distribution. See mixture of Gaussians. Mixture coefficient The weight of each basic distribution in a mixture distribution. See mixture of Gaussians. Distribution overlap An ambiguous region in a classification between two data sets, where the points are reasonably part of either distribution. A schematic diagram is shown in Fig. 8.11. Posterior distribution A description of a conditional probability distribution after some data or evidence is observed. Contrary to the prior distribution. In the Bayesian statistical model, the posterior distribution p(θ|D, M) describes the posterior uncertainty about the parameter θ in the model M after the data D is observed. See a posteriori probability and Bayesian statistical model. Posterior probability See posterior distribution.
320
8 Related Knowledge
Mixture of experts A mixture model in which the mixture coefficient is itself a function of the input variables. The mixture of expert model can be expressed as pðtjxÞ ¼
K X
gk ðxÞpk ðtjxÞ
k¼1
Among them, gk(x) is a gating function, and each component pk(t|x) is called an expert. Different components can independently model the distribution of different regions in the input space (the components can give expert-level predictions for the region to which they belong). The gating function determines the component that plays a dominant role in a particular region. Gating function The mixture coefficient in the mixed expert model. The gating function gk(x) needs to satisfy the general constraints on the mixing coefficient, that is, 0 gk(x) 1 and ∑kgk(x) ¼ 1. Gate functions are often called in deep learning, such as “input gate,” “forget gate,” and “output gate” in long short-term memory. Gaussian mixture model [GMM] A representation of a distribution based on a combination of Gaussian distributions. A background modeling method for modeling each pixel with a mixed Gaussian distribution. Often, multiple states of the background are modeled separately, and the model parameters of the state are updated according to which state the data belongs to in order to solve the problem of background modeling in a moving background. According to local properties, some Gaussian distributions represent background, and some Gaussian distributions represent foreground. In the case where there is motion in the training sequence, the Gaussian mixture model is more robust and effective than the single Gaussian model. It can also be used to represent a color histogram with multiple peaks. See expectation maximization. Gaussian mixture See Gaussian mixture model. Elliptical K-means A hard-allocated version of the Gaussian mixture model with a universal covari ance matrix. Mixture of Gaussians A special form of mixture distribution model, where each basic distribution is a Gaussian distribution. These Gaussian distributions are combined in a linear form. Considering K different Gaussian distributions, the mixture Gaussian can be expressed as
8.2 Statistics and Probability Fig. 8.12 Mixture of Gaussians
321
p(x)
x
pð xÞ ¼
K X
gk N ðxjμk , σ k Þ
k¼1
Among them, gk represents the mixture coefficient, and each Gaussian density N (x|μk, σ k) is called a mixture component, whose average value is μk and the variance is σ k. Figure 8.12 shows an example of 1-D. Two Gaussian distributions (red, yellow) are superimposed to form a mixture Gaussian distribution (blue). By using a sufficient number of Gaussian distributions, and adjusting their mean and variance and corresponding mixture coefficients, in theory, any complex distribution can be approximated to arbitrary accuracy. Also refers to a pattern discovery technique. The feature vector related to each pixel is modeled as a sample obtained from an unknown probability density function, and an attempt is made to find a clustering pattern in this distribution. See Gaussian mixture model.
8.2.5
Distribution Functions
Distribution function A function of a random variable describing the probability of an event. The event is the result of a randomized trial. The distribution function of the random variable f is expressed as a function where f is less than the probability of the function variable: 8
e pðzÞfor any value of z. The function kq(z) is called a comparison function. A schematic diagram of the univariate distribution is shown in Fig. 8.13. Each step of rejection sampling generates two random numbers. First, a number z0 is generated from the distribution q(z). Then, a number u0 is generated from the uniform distribution among [0, kq (z0)]. The pair of random numbers has a uniform distribution under a curve with the function kq(z). Finally, if u0 >e pðz0 Þ, sampling is rejected; otherwise, u0 is retained. It
8.2 Statistics and Probability
323
Fig. 8.14 Constructing an envelope function for adaptive rejection sampling
lnp(z)
z2
z1
z3
z
can be seen that when this pair of random numbers is in the shaded area in Fig. 8.13, it will be rejected. The remaining pairs are uniformly distributed under the unnormalized distribution curve, so the corresponding z-values are distributed according to the normalized p(z) as desired. The original value of z is generated from the distribution q(z). The probability of accepting these samples is e pðzÞ/kq(z), so the probability of a sample being accepted is Z pðacceptÞ ¼
fe pðzÞ=kqðzÞgqðzÞdz ¼
1 K
Z e pðzÞdz
It can be seen that the proportion of points rejected by this method depends on the ratio of the area under the unnormalized distribution to the area under the curve kq (z). Therefore, the constant k should be as small as possible under the condition that the constraint kq(z) is not less than e pðzÞ. Adaptive rejection sampling An improved method when rejection sampling is prepared but it is difficult to determine a suitable analytical form for the envelope distribution q(z). That is, in the process of constructing the envelope function, it is performed based on the measured value of the distribution p(z). When p(z) is log-recessed, that is, when the differential of lnp(z) is a nonincreasing function of z, constructing the envelope function is straightforward. As shown in Fig. 8.14, the envelope function for rejection sampling can be constructed by means of tangents calculated on a set of grid points. If a sample point is rejected, it is added to the grid point set and used to refine the envelope function. Conditional distribution For a variable, the distribution after given the value of one or more other variables. Conditional mixture model Starting from the standard probabilistic mixture of unconditional density models, the result obtained that component density is replaced by a conditional distribution. Conditional independence A relationship between variables. Given three disjoint sets of variables X, Y, and Z, if p(X, Y/Z ) ¼ p(X/Z)p(Y/Z ), then X and Y are conditional independent for a given Z.
324
8 Related Knowledge
A special case is that Z is an empty set, and then if p(X, Y ) ¼ p(X)p(Y ), then X and Y are independent. Conditional probability The occurrence of an event depends on the probability that another event will occur first (conditionally). In image technology, when several events or data are related to each other, the conditional probability between them must be considered. Given two sets of unrelated random variables X and Y, the conditional probability p(X/Y) is defined as p(X/Y) ¼ p(X, Y )/p(Y ). If p(Y ) > 0, it can be read as the distribution of X given a value for Y. Latent class analysis In mixture modeling, an analytical model when discrete binary variables are described by a Bernoulli distribution. See mixture of Bernoulli distribution. Exponential family A large class of commonly used random distributions. Typical examples include Bernoulli distributions, polynomial distributions, and so on. Natural parameters Parameters in the exponential family distribution. The exponential family distribution for x can be written as a set of random distributions as follows:
pðxjηÞ ¼ hðxÞgðηÞ exp ηT uðxÞ Among them, x can be a vector or a scalar and can be continuous or discrete. h(x) is an entropy function, and u(x) is a function of x. η is the natural parameter of the distribution, and the function g(η) can be interpreted as a coefficient that guarantees the normalization of the distribution and satisfies Z gð η Þ
hðxÞ exp ηT uðxÞ dx ¼ 1
If x is a discrete variable, the integral in the above equation is replaced by a sum. Independent identically distributed [i.i.d.] A distributed form of data. These data are collected independently from the same distribution.
8.2.6
Gaussian Distribution
Gaussian distribution A widely used continuous variable distribution model is also called a normal distribution. For a scalar variable x, its Gaussian distribution (probability density function) can be written as
8.2 Statistics and Probability Fig. 8.15 Chi-squared distribution
325 0.06 Theoretical value
0.04 0.02
Actual value
0
5
10
15 |x|2 , x
20
25
30
5
n o 1 1 N xjμ, σ 2 ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi exp 2 ðx μÞ2 2σ ð2πσ 2 Þ where μ is the mean and σ is the standard deviation. For a vector variable x, its Gaussian distribution can be written as n o 1 1 1 N ðxjμ, ΣÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffi exp ðx μÞT Σ1 ðx μÞ 2 ð2πÞM j Σ j where μ is the M-D mean vector and Σ is the covariance matrix of M M and |Σ| represents the determinant of Σ. Normal distribution Same as Gaussian distribution. Multivariate normal distribution A Gaussian distribution in which the variables are vectors instead of scalars. Let x be a vector variable whose dimension is N. Assume that the mean of this variable is mx and the covariance matrix is C. The probability of observing a specific value x is 1 ð2π ÞN=2 jCj1=2
h i 1 exp ðx mx ÞT C1 ðx mx Þ 2
Chi-squared distribution Describes the probability distribution of the squared vector length from a normal distribution. If the cumulative distribution of a chi-squared (χ 2) distribution with d degrees of freedom is denoted as χ 2(d, q), the probability that a point x from the dD Gaussian distribution has a square norm |x|2 is less than the value T is given by χ 2(d, T ) is given. Figure 8.15 shows the theoretical and actual value curves of a probability density function with 5 degrees of freedom. Chi-squared test A statistical test of a hypothesis, assuming a set of sampled values from a given distribution. See chi-squared distribution.
326
8 Related Knowledge
Marginal Gaussian distribution Gaussian distribution of a subset of a random variable set that is determined only by the variables contained in the subset. For a multivariate Gaussian distribution, if the joint distribution of two sets is Gaussian, the marginal distribution of any set is also Gaussian. Let x be a d-D vector, satisfying a Gaussian distribution N(x|μ, Σ). Divide it into two nonoverlapping subsets xa and xb, xa takes the first k components, and xb takes the last d–k components, that is, x ¼ [xaT xbT]T. In this way, the corresponding mean vector and covariance matrix are
T μ ¼ μTa μTb
Σaa Σab Σ¼ Σba Σbb According to the symmetry of the covariance matrix, it can be seen that Σaa and Σbb are symmetrical and Σba ¼ ΣaTb. Further, consider the inverse of the covariance matrix Τ ¼ Σ1, which is also called the precision matrix:
Τ aa Τ¼ Τ ba
Τ ab Τ bb
Because the inverse of the symmetric matrix is still symmetric, both Taa and Tbb are symmetric, and Tba ¼ T aTb. Note, however, that Σaa is not theR inverse of Taa. With the above definition, for the marginal distribution p(xa) ¼ p(xa, xb)dxb, the corresponding marginal Gaussian distribution has a simple form: pðxa Þ ¼ N ðxa jμa , Σaa Þ Compare conditional Gaussian distribution. Conditional Gaussian distribution For two variable sets, the distribution of another set under the condition of one set. For Gaussian distribution of multiple variables, if the joint distribution of two sets is Gaussian, then the conditional distribution of either set is also Gaussian. Let x be a d-D vector, satisfying a Gaussian distribution N(x|μ, Σ). It will be divided into two nonoverlapping subsets xa and xb, xa takes the first k components, and xb takes the d–k components, that is x ¼ [xaT xbT]T. In this way, the corresponding mean vector and covariance matrix are μ ¼ μTa
μTb
T
8.2 Statistics and Probability
327
Σaa Σ¼ Σba
Σab Σbb
According to the symmetry of the covariance matrix, we can see that Σaa and Σbb are symmetrical and Σba ¼ Σ aTb. Further, consider the inverse of the covariance matrix Τ ¼ Σ1, which is also called the precision matrix:
Τ¼
Τ aa Τ ba
Τ ab Τ bb
Because the inverse of the symmetric matrix is still symmetric, both Taa and Tbb are symmetric, and Tba ¼ T aTb. Note, however, that Σaa is not directly the inverse of Taa. With the above definition, the conditional mean vector and conditional covariance matrix corresponding to the conditional distribution p(xa|xb) are 1 μajb ¼ μa þ Σab Σ1 bb ðxb μb Þ ¼ μa Taa Tab ðxb μb Þ
Σajb ¼ Σaa Σab Σ1 bb Σba Here, the conditional mean vector is a linear function where μa|b is xb, and the conditional covariance matrix is independent of xa. This representation is an example of a linear Gaussian model. Finally, the conditional distribution p(xa|xb) of the joint Gaussian distribution N (x|μ, Σ) can be expressed as pðxa jxb Þ ¼ N xjμajb , T 1 aa Compare marginal Gaussian distribution. Width parameter Gaussian distribution parameter σ. Because as σ increases or decreases, the width of the Gaussian curve increases or decreases accordingly. Therefore, σ is also called the width parameter. Gaussian-Gamma distribution The form of the prior distribution π(μ, λ), which has the same dependency relationship (as the likelihood function) with μ and λ is
328
8 Related Knowledge
Fig. 8.16 Example of contours for Gauss-Gamma distribution
2
λ1
0 –2
0 μ
2
β λμ2 pðμ, λÞ / λ exp exp ðcλμ dλÞ 2 βλ c2 ¼ exp ðμ c=βÞ λβ=2 exp d λ 2 2β 1=2
Among them, c, d, and β are all constants. Since there is always p(μ,λ) ¼ p(μ|λ)p(λ), p(μ|λ) and p(λ) can be determined by observation. p(μ|λ) is Gaussian, its precision is a linear function of λ, and p(λ) is a γ-distribution, so the normalized prior distribution form is h i pðμ, λÞ ¼ N μjμ0 , ðβλÞ1 Gamðλja, bÞ This formula is the Gauss-Gamma distribution, and its diagram is shown in Fig. 8.16 (where μ0 ¼ 0, β ¼ 2, a ¼ 5, and b ¼ 6). Note that this is not a simple product of the independent Gaussian prior to μ and the gamma prior to λ, because the precision of μ is a linear function of λ. Even when independent priors of μ and λ are selected, the posterior distribution shows a coupling between the precision of μ and the value of λ. Normal-gamma distribution Same as Gaussian-gamma distribution. Log-normal distribution If X is a random variable that follows a normal distribution (Gaussian distribu tion), then Y ¼ eX is said to follow a log-normal distribution with a support region of [0, 1]. Binomial distribution A probability distribution function that discretely corresponds to a normal distribu tion. Its coefficient can be obtained by convolving the [1/2 1/2] template with itself. Von Mises distribution A periodic generalization of the Gaussian distribution (univariate or multivariate) is a unimodal distribution.
8.2 Statistics and Probability
329
Wishart distribution A conjugate prior distribution. For a variable x of m-D, its multivariate Gaussian distribution is N(x|μ, Λ1). Assuming the accuracy is known, the conjugate prior distribution is still Gaussian. If the mean is known and the accuracy is unknown, then for a matrix Λ, the conjugate prior distribution is called the Wishart distribution: W ðΛjW, nÞ ¼ GjΛjðnm1Þ=2 exp
h
i 1 trace W 1 Λ 2
where n is the number of distributed degrees of freedom and W is a scaling matrix of m m. The normalization constant G is " n=2
GðW, nÞ ¼ jW j
2
π
nd=2 dðd1Þ=4
d Y nþ1i i¼1
#1
2
In addition to considering precision matrices, conjugate priors can also be defined for co-contrast matrices. If both the mean and precision are unknown, a Gaussian-Wishart distribution is obtained, whose conjugate prior is h i pðμ, Λjμ0 , β, W, nÞ ¼ N μjμ0 , ðβ ΛÞ1 W ðΛjW, nÞ Inverse Wishart distribution See Wishart distribution. Gaussian-Wishart distribution See Wishart distribution. Normal-Wishart distribution Same as Gaussian-Wishart distribution.
8.2.7
More Distributions
Exponential distribution A typical distribution of random variables. For a random variable y with a value in [0, 1), its exponential distribution can be expressed as pðyÞ ¼ k exp ðkyÞ If x is a random variable uniformly distributed in [0, 1], use a function f() to transform the value of x such that y ¼ f(x). Then the distribution of y satisfies
330
8 Related Knowledge
Fig. 8.17 Geometric interpretation of generating nonuniformly distributed random numbers
1 h(y) p(y)
0
y
dx pðyÞ ¼ pðxÞ dy Integrate the above formula, as shown in Fig. 8.17: Z
y
x ¼ hð y Þ
pðzÞdz 1
Substituting the previous exponential distribution (take the lower limit of integration to 0), it gets hðyÞ ¼ 1‐ exp ðkyÞ Therefore, if y ¼ k1ln(1–x) is used to transform the uniformly distributed variable x, the exponential distribution of y can be obtained. Cauchy distribution A random distribution with a probability density function of pð x Þ ¼
1 π ð 1 þ x2 Þ
Polya distribution Another name for the negative binomial distribution. Bernoulli distribution A probability distribution of a binary discrete random variable. The probability of taking a value of 1 is p, and the probability of taking a value of 0 is 1–p. Consider a binary random variable x 2 {0, 1}. The parameter μ represents the probability of x ¼ 1, that is, pðx ¼ 1jμÞ ¼ μ Among them, 0 μ 1. From this
8.2 Statistics and Probability
331
pðx ¼ 0jμÞ ¼ 1 μ According to the above two formulas, the probability distribution of x can be written as BernoulliðxjμÞ ¼ μx ð1 μÞ1x This is the Bernoulli distribution. The distribution has been normalized, and its mean and variance are E ½x ¼ μ var½x ¼ μð1 μÞ Mixture of Bernoulli distribution A distribution describing a discrete binary variable model represented by a Bernoulli distribution. Also called latent class analysis. Consider a set of N binary random variables xi 2 {0, 1}, where i ¼ 1, 2, . . ., N. Each xi satisfies the Bernoulli distribution with the parameter μi, so pðxjμÞ ¼
N Y
μxi i ð1 μi Þð1xi Þ
i¼1
where x ¼ (x1, x2, . . ., xN)T and μ ¼ (μ1, μ2, . . ., μ N)T. Given μ, each variable xi is independent. The mean and covariance of this distribution are E ½x ¼ μ cov½x ¼ diagfμi ð1 μi Þg β-distribution A special probability distribution. Consider a binary random variable x 2 {0, 1} like the Bernoulli distribution. Using the parameter μ to represent the probability of x ¼ 1, then the β-distribution is βðμja, bÞ ¼
Γða þ bÞ a1 μ ð1 μÞb1 ΓðaÞΓðbÞ
where Γ(x) is a gamma function. The coefficients in the above formula have guaranteed the normalization of the unitary function, so
332
8 Related Knowledge
3
3
3 a= 1
a =0.1 b =0.1
b= 1
a= 8
2
2
2
1
1
1
0 0
0.5
μ
1
0 0
0.5
μ
1
0 0
b= 4
0.5
μ
1
Fig. 8.18 Some examples of β-distribution curves
Z1 βðμja, bÞdμ ¼ 1 0
The mean and variance of the distribution are E½μ ¼ var½μ ¼
a aþb
ab ð a þ bÞ 2 ð a þ b þ 1Þ
The coefficients a and b are often called hyperparameters because they control the distribution of the parameter μ. Figure 8.18 gives examples of several radon distribution curves. γ-distribution A probability distribution defined on a positive random variable γ > 0. There are two positive parameters a and b, which ensure that the distribution is normalizable. The distribution is defined as Gðγja, bÞ ¼
1 a a1 bγ bγ e Γ ð aÞ
The expectation, variance, mode, and entropy of the variable γ are a b a var ½γ ¼ 2 b a1 mode ½γ ¼ b E ½γ ¼
a1
H ½γ ¼ ln ΓðaÞ ða 1ÞϕðaÞ ln b þ a where φ() is a digamma function.
8.2 Statistics and Probability
333
The gamma distribution is a conjugate prior to the accuracy (inverse variance) of the univariate Gaussian. For a a 1, the density of the gamma distribution at any location is limited. In the special case of a ¼ 1, the gamma distribution is exponential. The probability density function of a random variable X (defined in [0 1)) that satisfies the gamma distribution Γ(α, β) is pð x Þ ¼
βα α1 βx x e ΓðαÞ
where α is the shape parameter and β is the inverse of the scale parameter. A squared distribution with υ degrees of freedom is a special case of the gamma distribution, where α ¼ υ/2, β ¼ 1/2. Inverse gamma distribution A continuous probability distribution with two parameters, defined only on the positive axis. It is the distribution of the inverse of the variable that satisfies the gamma distribution. If α represents the shape parameter, β represents the scale parameter, and the gamma function is represented by Γ(•), then for x > 0, the probability density function of its inverse gamma distribution is βα α1 β f ðx; α, βÞ ¼ exp x x ΓðαÞ Poisson distribution A discrete random variable with values of 0, 1, ..., and the probability density function is p(x) ¼ (1/x!)λxe–λ, where λ is the rate parameter and E[x] ¼ λ. A probability distribution in which a random variable takes a non-negative integer value. In the study of the object set in the image, if the positions of each object are independent (random), when a histogram of the distance between the nearest neighbors of each object is made, the histogram will show a Poisson distribution. Hyper-Laplacian distribution Log-likelihood distribution of image derivatives. Dirichlet distribution Let x represent a probability vector of d-D, that is, a vector where all elements are non-negative and the sum is 1. Dirichlet distribution is a probability distribution defined by the following probability vectors: pð x Þ ¼
d Γða1 þ þ ad Þ Y ai 1 x Γða1 Þ Γðad Þ i¼1 i
334
8 Related Knowledge
where a ¼ (a1, a2, . . ., ad) is a parameter vector whose elements are all positive values. Dirichlet process mixture model A nonparametric method based on a mixture model that can have an infinite number of elements. The Dirichlet process is a generalization of the Dirichlet distribution. Dirichlet prior In Bayesian statistics, the Dirichlet distribution is a conjugate prior distribution of a polynomial distribution. Gibbs Distribution The prior probability distribution of a Markov random field. The negative value of its log-likelihood can be written as the sum of the interaction potentials (prior energy) according to Hammersley-Clifford theorem: E p ð xÞ ¼
X
V i,j,k,l ½ f ði, jÞ, f ðk, lÞ
fði, jÞ, ðk, lÞg2N
Among them, x represents the pixel set {f(i, j)}, and N ¼ N(i, j) represents the neighborhood of the pixel (i, j). Boltzmann distribution Same as Gibbs distribution.
8.3
Signal Processing
Signal processing is the direct foundation of image processing. Many image technologies are high-dimensional extensions of signal technology (Bracewell 1995; Sonka et al. 2014; Zhang 2017a).
8.3.1
Basic Concepts
Signal processing Generally, it refers to a collection of mathematical and computational tools used to analyze 1-D (also multidimensional) signals, such as audio recordings or other intensities that change over time or location. Digital signal processing is a subset of signal processing for digital streams of signals that have been represented as binary.
8.3 Signal Processing
335
Digital signal processor [DSP] A set of coprocessors designed to perform efficient processing operations on digital signals. A common feature is the ability to provide a fast multiplication and accumulation function. Digital signal processor [DSP] chip Specially designed, fast coprocessor devices. Space division multiplexing By dividing the spatial domain, multiple signals or multiple symbols in a single channel are multiplexed. Temporal process A signal or process that changes over time. See time series and time-varying random process. Blind source separation In the case where the theoretical model of the signal and the source signal cannot be accurately known, the process of separating each source signal from the aliased observation signal. “Blind” here means that only the aliased signal is observed, but the original signal is not observed, and the aliasing coefficient is not known. A typical example is the recording of two simultaneous speakers with two microphones. The resulting mixed signal is always a linear combination of the amplitudes of the two sounds. The coefficients of this linear combination should be constant. If their values can be inferred from the sampled data, then the mixing process can be reversed to obtain two clear signals each containing only one speaker. In image engineering, a special image processing operation. Generally, it refers to the method and process of estimating or separating the original image components in the overlapping observation image when the theoretical model of the image and the original image cannot be accurately known. Often, independent component analysis methods are used. Blind signal separation [BSS] Same as blind source separation. Cocktail party problem In signal processing, when multiple signals are mixed and obtained by multiple receivers, it is necessary to distinguish the original signal from them. For example, in a room, several people are talking. Imagine several microphones recording a conversation. There are several mixed recordings of the same speech signal at any time. If considering that two people are talking, the signals s1(t) and s2(t) are generated; there are two microphones for recording, and the recorded signals x1(t) and x2(t) are x1 ðt Þ ¼ a11 s1 ðt Þ þ a12 s2 ðt Þ x2 ðt Þ ¼ a21 s1 ðt Þ þ a22 s2 ðt Þ
336
8 Related Knowledge
Among them, a11, a12, a21, and a22 are unknown mixing coefficients. There is only one system of equations with two linear equations, but there are six unknowns (four mixing coefficients and two raw signals), which is a cocktail party problem. Obviously, it is impossible to solve the system of equations in a definitive way to recover unknown signals. It needs to be solved by considering the statistical characteristics of the independent signals and using the central limit theorem. Signal quality Comparison of the expected value of the signal with the actual signal. It can be measured by means of mean square error (MSE). Online processing The processing of data when it is collected or when the situation requires it. Different from real-time processing or batch processing (processing flow or processing interval has been determined in advance). Online likelihood ratio test A likelihood ratio test in which data arrives sequentially. Phase space Joint space of position variables and momentum variables. Position variable State variables. See state space. State variable Variables in state space. Phase unwrapping technique The method and process of estimating the reconstruction of the true phase offset based on the phase from the transform to the interval [π, π]. The true phase offset value may not fall into this interval, but it can be mapped to this interval by adding or subtracting several 2π. This technique maximizes the smoothness of the phase image by adding and subtracting several 2π frames at different positions. Phase retrieval problem The problem of reconstructing a signal based only on the amplitude (not phase) of the Fourier transform.
8.3.2
Signal Responses
Point spread function [PSF] Response of a 2-D system or filter to a Dirac pulse input. The response is usually spread over a region around the point where the pulse is located, hence the name. See filter and linear filter.
8.3 Signal Processing
337
Line spread function [LSF] A function describing how an ideal infinitely thin straight line is distorted after passing through an optical system. This function can be calculated by integrating the point spread function of an infinite number of points along a straight line. 1-D projection of a 2-D point spread function. It can use relatively vertical edges to reconstruct the horizontal line spread function. Converting the line spread function to the Fourier domain, the resulting amplitude curve is a 1-D modulation transfer function, which indicates which image frequency components are lost (blurred) or aliased during image acquisition. Separable point spread function A special point spread function. If the effect of the point spread function on the image column and the image row are independent of each other, then the separable point spread function can be expressed as hðx, α, y, βÞ hc ðx, αÞhr ðy, βÞ Shift-invariant point spread function A special point spread function. The point spread function h(x, p, y, q) describes how the input value at position (x, y) affects the output value at position ( p, q). If the effect described by the point spread function is independent of the actual position and only related to the relative position between the affecting pixels and affected pixels, then this point spread function is a shift-invariant point spread function: hðx, p, y, qÞ ¼ hðp x, q yÞ Slant edge calibration A technique for recovering 2-D point spread functions from 1-D projection. Since horizontal or vertical edges are likely to be affected by aliasing during image acquisition, a nearly vertical edge is used to calculate the horizontal line spread function (LSF). The LSF is often converted to Fourier domain, and its amplitude is plotted as a 1-D modulation transfer function (MTF), which indicates the frequency of the image that is lost (fuzzy) and aliased during acquisition. For most computational photography applications, it is best to estimate the complete 2-D point spread function directly, as it may be more difficult to recover from its projection. Transfer function Fourier transform of the point spread function. Let image g(x, y) be the convo lution result of image f(x, y) and a linear, position-invariant operator h(x, y), that is, gðx, yÞ ¼ f ðx, yÞ hðx, yÞ According to the convolution theorem, in the frequency domain,
338
8 Related Knowledge
Gðu, vÞ ¼ F ðu, vÞH ðu, vÞ Among them, G, H, and F are Fourier transforms of g, h, and f, respectively, and H is a transfer function. In a frequency domain filter, the transfer function is the ratio of the filter’s output to its input. Can also refer to its filter function. Modulation transfer function [MTF] The magnitude function of the transfer function. Informally, a pattern of spatial change is a measure of how well an optical system can be observed. More formally, in a 2-D image, let F(u, v) and G(u, v) be the Fourier transforms of the input image f(x, y) and the output image g(x, y), respectively. Then the modulation transfer function of a horizontal and vertical frequency pair (u, v) is |H(u, v)|/|H(0, 0)|, where H(u, v) ¼ G(u, v)/F(u, v). This is also the magnitude of the optical transfer function. Optical transfer function [OTF] Informally, the optical transfer function is a measure of how well a spatially varying pattern is observed by an optical system. More formally, in a 2-D image, let I( fx, fy) and O( fx, fy) be the Fourier transforms of the input image I(x, y) and the output image O(x, y), respectively. The optical transfer function of a horizontal and vertical spatial frequency pair ( fx, fy) is H( fx, fy)/H(0, 0), where H( fx, fy) ¼ O( fx, fy)/I( fx, fy). The optical transfer function is generally a complex number, recording the decrease in signal strength and phase shift at each spatial frequency. Optical transfer function is also a collective term for modulation transfer function and phase transfer function. It represents the ability of the surface of different objects to transfer different frequency spectra. High frequencies correspond to details, and low frequencies correspond to contours. This function gives a way to judge the ability of an optical system to measure spatially varying modes, that is, a way to assess the degree of degradation. Modulator An algorithm or device that converts a sequence signal into a physical signal suitable for channel transmission. For example, in radio transmission, the physical carrier signal can be amplitude-modulated, frequency-modulated, or phase-modulated at the input. Unit sample response See pulse response function. Pulse response function Output response of the signal system when the input is a unit impulse function (unit sample response). That is, the output of the imaging system when it receives illumination from a point light source. It is the Fourier transform of the system transfer function.
8.3 Signal Processing
339
Impulse response See pulse response function. Infinite impulse response [IIR] One of infinite responses to a pulse signal in space and time. The design and implementation of infinite impulse response filters are easier than finite impulse response filters, but they are not easy to optimize and are not necessarily stable. Infinite impulse response [IIR] filter A filter that convolves an image with a finite array. Filter that can give output value (yn) based on current and past input values (xi) and past output values (yj), can be written as yn ¼
I X
ai xni þ
i¼0
J X
b j ynj
j¼0
where, ai and bj are weights. Various ideal filters (low-pass, band-pass, or highpass) do not meet this condition. An iterative filter whose output value depends on the output of a previous filter. If it gets a non-zero pulse, its response will last forever, hence the name. Finite impulse response [FIR] The phenomenon that there are only a finite number of intensity responses to a pulse signal in space and time. The finite impulse response filter is more complex in design and implementation than the infinite impulse response filter. Finite impulse response [FIR] filter A filter that P generates an output value yo based on the current and past input values xi. yo ¼ ni¼0 ai xni , where ai is the weight. There are only a finite number of responses to a pulse signal in time or space. It can also be said that the kernel has a limited range (the actual images are limited). Sifting A function or characteristic of a Dirac function. For an image f(x, y), it provides its value at ( p, q): f ðp, qÞ ¼
XX f ðx, yÞδðx p, y qÞ x
y
Heaviside step function A special discontinuous function, the Dirac function can be considered as its “differential” function. Its discrete form is (i is an integer)
340
8 Related Knowledge
H ðiÞ ¼
0 i
:
x0
It has the following relationship with sign functions: 1 H ðxÞ ¼ ½1 þ sgn ðxÞ 2
8.3.3
Convolution and Frequency
Convolution A basic mathematical operation in a linear system. The convolution of the two functions f(x, y) and g(x, y) can be expressed as Z1 Z1 f ðx, yÞ gðx, yÞ ¼
f ðx u, y vÞgðx, yÞdxdy 1
1
Z1 Z1 ¼
f ðx, yÞgðx u, y vÞdxdy 1
1
Convolution has the basic properties of multiplication, that is, for the three functions f(x, y), g(x, y), and h(x, y), there are f ðx, yÞ gðx, yÞ ¼ gðx, yÞ f ðx, yÞ
½ f ðx, yÞ gðx, yÞ hðx, yÞ ¼ f ðx, yÞ ½gðx, yÞ hðx, yÞ Convolution is widely used in image engineering. When processing an image, a weighted sum of the pixel value and its neighboring pixel values can be calculated for each pixel. Depending on the choice of weights, many different space domain image processing operations can be implemented.
8.3 Signal Processing
341
Convolution theorem Abbreviation for Fourier convolution theorem. Time convolution theorem Same as convolution theorem. Frequency convolution theorem A theorem reflecting the relationship between the convolution of two frequency domain functions and their inverse Fourier transforms. The duality of the convo lution theorem. The product of two frequency domain functions F(u, v) and G(u, v) in the frequency domain and their inverse Fourier transform f(x, y) and g(x, y) in space form a pair of transforms, and the product of the two frequency domain functions in the frequency domain and the convolution of the two inverse Fourier transforms in space form a pair of transforms. Let f(x, y) and F(u, v) constitute a pair of transformations, denoted by f(x, y) $ F(u, v). The frequency convolution theorem can be written as F ðu, vÞ Gðu, vÞ $ f ðx, yÞgðx, yÞ F ðu, vÞGðu, vÞ $ f ðx, yÞ gðx, yÞ Inverse convolution See convolution. Deconvolution Inverse process of convolution. Can be used to remove certain signals (such as removing blur from an image using inverse filtering; see deblur). Consider the image generated by convolution h ¼ f g + n, where f is the image, g is the convolution template, and n is the noise. Deconvolution is to estimate f from h. This is often an ill-posed problem, and the solution is often not unique. See image restoration. It can also be regarded as an operation to eliminate the effect of the convolution operation. Belongs to an important reconstruction work. Because the blur caused by the optical system can be described by convolution, deconvolution requires the use of a deconvolution operator, which is only valid when its transfer function has no zero. Pseudo-convolution The operation of projecting one signal onto another. Projecting a signal on two unit signals (base signal or filter) includes taking each sample of the signal and its local neighbors (the same size as the unit signal), multiplying them, and adding the products together. If ignoring the flip of the convolution signal (which is necessary to perform a real convolution), this is actually a convolution. Therefore, such a convolution without a flip operation can be called a pseudo-convolution. Pseudoconvolution is often used in image processing in the following cases: if the unit signal is symmetric, it does not matter whether it is flipped or not; and if the unit signal is antisymmetric, it needs to be flipped. However, since the square of the
342
8 Related Knowledge
resulting value is calculated at the end, there is no need to change the sign whether it is flipped or not. Continuous convolution Convolution of two consecutive signals. The continuous convolution operation of two images uses the integral corresponding to the sum in the discrete convolution operation. Convolution operator A widely used image processing operator. Take 1-D as an example; calculate the weighted sum y( j) ¼ Σiw(i)x( j – i), where w(i) is the weight, x(i) is the input, and y ( j) is the output. Take 2-D as an example; calculate the weighted sum y(r, c) ¼ ∑i,jw (i, j)x(r–i, c–j), where w(i, j) is the weight, x(r, c) is the input, and y(r, c) is the convolution result. By properly selecting the weights, many linear operations such as low-pass/smoothing, high-pass/differential filtering, or template matching/ matching filtering can be calculated by convolution. Mexican hat operator A convolution operator either implements a Laplacian-of-Gaussian operator or implements a difference of Gaussian operator. Both can give very similar results. The operator template used for this convolution is shaped like a Mexican straw hat (wide-brimmed hat), hence its name. Frequency The number of times that a periodic change is completed in a unit time, which is the reciprocal of the cycle. The basic unit is hertz (Hz). Various types of waves in the electromagnetic spectrum can be described by both their wavelength and their frequency. See light. Low frequency It is often referred to as low spatial frequency in image technology. The low-frequency component of an image corresponds to the slowly changing intensity component in the image, such as a large region of bright or dark pixels. If indicating a low temporal frequency is required, low frequency refers to a slowly changing pattern of light or dark at the same pixel position in a video sequence. Angular frequency A measure that determines the user’s perception of periodic signals. It is a characteristic of the signal, taking into account the viewing distance and the related observation angle. It can be expressed by the cycles per degree (cpd) of the observation angleθ and written as fθ: θ ¼ 2 arctan
h 2d
h 180h ðRadianÞ ¼ ðDegreeÞ 2d πd
8.4 Tools and Means
343
fθ ¼
fs πd f ðcpdÞ ¼ 180h s θ
Among them, h is the height (or width) of the field of view, and d is the viewing distance. As can be seen from the above formula, for the same picture (such as raster mode) and fixed picture height or width, the angular frequency fθ increases with the increase of the viewing distance; conversely, for a fixed viewing distance, a larger display size leads to a lower angular frequency. This is consistent with human experience: the same periodic test pattern changes more frequently when viewed from a greater distance and changes more slowly on larger screens. Local frequency The instantaneous frequency of the periodic signal. Can be calculated based on the phase of the signal. Local wave number Same as local frequency. Frequency window A window that defines a window function in the frequency domain.
8.4
Tools and Means
In image engineering, the processing of images cannot do without the support of many software and hardware (Zhang 2017a, b, c).
8.4.1
Hardware
Single instruction multiple data [SIMD] A computer structure that allows the same instruction to be executed simultaneously on multiple processors for different parts of a data set, such as all pixels in a pixel neighborhood. Useful for many low-level image processing operations. Contrast with MIMD. See pipeline parallelism, data parallelism, and parallel processing. Multiple instruction multiple data [MIMD] A form of parallelism in which each processor can execute a different instruction or execute different programs on different data sets or pixels at any given time. This is in contrast to the parallel form of single instruction multiple data, where all processors execute the same instruction at the same time, although acting on different pixels.
344
8 Related Knowledge
Fig. 8.19 Array processor topology
Cellular array A massively parallel computing architecture. Contains a very large number of processing units. In machine vision applications, it is particularly useful when there is a simple 1: N mapping relationship between image pixels and processing units. See systolic array and SIMD. Array processor A set of time-synchronized processing units that calculates the assigned data. Some array processor units only communicate with directly connected neighboring units, as shown in the topology diagram in Fig. 8.19. See single instruction multiple data. Pipeline parallelism Parallelism achieved by two or more (may be dissimilar) computing devices. The nonparallel processing process includes steps A and B, which operate on a series of terms xi (i > 0) and produce an output yi. The result of step B depends on the result of step A, so the serial computer calculates ai ¼ A(xi); yi ¼ B(ai) for each i. Parallel computers cannot calculate ai and yi at the same time, because they are dependent, so they need to be calculated as follows:
8.4 Tools and Means
345 x i+1
Fig. 8.20 Pipeline process
A
a i+1
B
yi
a1 ¼ Aðx1 Þ y1 ¼ Bða1 Þ a2 ¼ Aðx2 Þ y2 ¼ Bða2 Þ a3 ¼ Aðx3 Þ ai ¼ Aðxi Þ yi ¼ Bðai Þ Since yi is calculated immediately after calculating yi–1, the above calculations can be rearranged as a1 ¼ A ð x 1 Þ a2 ¼ A ð x 2 Þ a3 ¼ A ð x 3 Þ
y1 ¼ Bða1 Þ y2 ¼ Bða2 Þ
aiþ1 ¼ Aðxiþ1 Þ
yi ¼ Bðai Þ
Calculation steps on the same line can be performed simultaneously because they are independent of each other. In this way, the output value yi can be obtained by pipeline calculation once per cycle, rather than once every two cycles. This pipeline process is shown in Fig. 8.20. Parallel processing Execute an algorithm in parallel. This requires that the algorithm can be broken down into a series of calculations that can be performed simultaneously on separate hardware. See single instruction multiple data, multiple instruction multiple data, pipeline parallelism, and task parallelism. Systolic array A type of parallel computer in which the processors are arranged into a directed graph. The processor synchronously receives data from a group of neighbors (such as the upper and left processors of a rectangular array), performs calculations, and passes the calculated results to another group of neighbors (such as the lower and right processors of a rectangular array). System on chip [SOC] A complete system (including central processing unit, memory, peripheral circuits, etc.) integrated on a single chip. System on chip is the main solution to replace
346
8 Related Knowledge
integrated circuits with the development of efficient integrated performance. Can reduce the size and power consumption of image processing hardware and also promote the development of image processing dedicated software. Floating-point operations [FLOPS] An approximation of a real number in a computer. Compared to integer operations, floating-point operations are slower and may have errors, but with higher accuracy. Millions of instructions per second [MIPS] An indicator unit that represents the speed or performance of a computer or its processor.
8.4.2
Software
MATrix LABoratory [MATLAB] A tool for data analysis, prototyping, and visualization. Built-in: matrix representation and calculation, excellent image and graphics processing capabilities, high-level programming language, and development environment support toolkit. Because the image can be represented by a matrix, it can be easily processed with MATLAB. MATLAB operators Operators that perform MATLAB operations. Can be divided into three categories: 1. Arithmetic operators: perform numerical calculations on matrices. 2. Relational operators: compare operands and determine mutual relations. 3. Logical operators: perform standard logical functions, such as AND operator, NOT operator, and OR operator. M-files Files generated and edited with MATLAB’s editor. It can be a script that interprets and executes a series of MATLAB commands (or declarations), or a function that receives a variable (parameter) and produces one or more output values, including a series of functions for storing, viewing, and debugging M-files. Stem plot A form of displaying a grayscale image histogram in MATLAB. The stem plot is a 2-D function graph showing the function value as a lollipop (a stem with rounded ends). The stem chart provides a visual alternative to the histogram (the vertical lines in the histogram are replaced with). Data structure A basic concept in programming: organizing computer data in a precise structure. Typical structures include:trees (see e.g., quadtree), queues, stacks, etc. Data structures are always accompanied by procedures or libraries to implement various types of data operations (such as storage and indexing).
8.4 Tools and Means
347
Fig. 8.21 Hierarchical model example
Hierarchical model A type of tree structure model, that is, the model is composed of sub-models, and each sub-model can be composed of smaller sub-models. A model can contain multiple examples of sub-models. Coordinate system transformations can be used to place sub-models relative to the model, or just list the sub-models in a structure collection. Figure 8.21 shows a schematic diagram with a three-layer model (the three layers are placed horizontally in the figure), in which the sub-model is used multiple times. Structured model See hierarchical model. Subcomponent Target parts (sub-models) used in the hierarchical model Subcomponent decomposition In the hierarchical model, the set of smaller targets is used to express the complete target part. Data parallelism Refers to the organization of the input or the program itself or the parallel structure of the programming language used. Data parallelism is a useful model for many image processing, because the same operation can be performed independently and in parallel on all pixels in the image.
8.4.3
Diverse Terms
Solid angle Starting from one point, the space portion surrounded by rays passing through all points on a closed curve, that is, the space portion surrounded by a cone. The starting point is called the apex of the solid angle, and the solid angle represents the viewing angle when the closed curve is viewed from the apex. In practice, the ratio of the area of a solid angle cut on a spherical surface with its apex as the center of the sphere, and the square of the spherical radius can be taken as a measure of the solid angle. The unit of solid angle is steradian, and it is written as sr. The solid angle is measured by projecting this area onto a sphere of unit radius, or dividing the area S on the sphere by the area and the sphere with r as the radius divided by r2. A unit
348 Fig. 8.22 Solid angle schematic
8 Related Knowledge
Solid angle
Unit sphere
Fig. 8.23 Spatial angle schematic
Spatial angle
solid angle is called a steradian or square degree and is dimensionless. A steradian corresponds to a solid angle when the area intercepted on the spherical surface is equal to the square area with the spherical radius as the side length. For a point, the entire solid angle around it is 4π. Solid angles are used when calculating features such as bidirectional reflectance distribution function, brightness, luminous intensity, radiant intensity, and scene brightness. It is also a feature of 3-D objects: the portion of the object projection on the surface of the unit sphere, as shown in Fig. 8.22. The surface area of a unit sphere is 4π, so the maximum value of a solid angle is 4π. Steradian Solid angle unit Square degree A unit of solid angle equal to (π/180)2 steradian. Origin is a square; the angle between the length of each side and the center of the sphere is 1 Spatial angle The region bounded by a cone on a unit sphere. The apex of the cone is at the center of the ball, as shown in Fig. 8.23. Its unit is steradian. Commonly used when analyzing brightness. Pseudo-random With the randomness of a computer-generated “random variable.” Because these “random variables” are generated from a series of formulas designed to produce different series of values, they are not truly random.
8.4 Tools and Means
349
Fig. 8.24 Unit circle and unit square
1
z2
–1 –1
z1
1
Pseudo-random numbers A sequence of values generated by a computer’s deterministic algorithm but which can pass various random tests. Box-Muller method A method for generating random numbers that obey normal distributions. The basic idea is to generate random numbers that obey uniform distribution first and then change these random numbers from obey uniform distribution to obey normal distribution. Suppose a pair of uniformly distributed random numbers z1, z2 2 (1, 1) are generated, which is equivalent to being in a unit square. Then you can use z ! 2z 1 to convert them to (0, 1). Then only the pairs satisfying z12 + z22 1 are retained. This gives the points distributed in the unit circle (satisfying p(z1, z2) ¼ 1/ π), as shown in Fig. 8.24. For each (z1, z2) pair, calculate sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ln z1 y1 ¼ z 1 z21 þ z22 sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 ln z2 y2 ¼ z 2 z21 þ z22 And the joint distribution of z1 and z2 is
∂ðz1 , z2 Þ 2 2 1 1 pð y 1 , y 2 Þ ¼ pð z 1 , z 2 Þ ¼ pffiffiffiffiffi exp y1 =2 pffiffiffiffiffi exp y2 =2 ∂ðy1 , y2 Þ 2π 2π So y1 and y2 are independent, and each has a Gaussian distribution with zero mean and unit variance. If y has a Gaussian distribution with zero mean and unit variance, then σy + μ will be a Gaussian distribution with mean μ and variance σ 2. In order to generate a multivariate Gaussian distribution with vector values of mean μ and covariance matrix ∑, a Cholesky decomposition can be used.
350
8 Related Knowledge
Efficiency of an algorithm The degree of computational load required to implement an algorithm. It can be used to characterize the time complexity or calculation time of a given computing device (usually a computer) implementing algorithms on different computing scales. Prototype An object or model or device used as a representative example of a class that has the characteristics defined by that class. Bitshift operator An operator that shifts the binary representation of each pixel to the left or right by several bits. For example, shifting 10101010 to the right by 3 bits gives 00010101. When it is needed to multiply or divide an image by a power of two, using a shift operation is less computationally expensive than using a multiplication or division operation. Moving n bits is equivalent to multiplying or dividing by 2n. Renormalization group transform A method of coarsening the configuration space that can simultaneously coarsen data and models (because the configuration space depends on the data and the model). The use of renormalized group transformation can ensure that the desired global solution is in the rough solution derived from the coarsened data and the expression of the model. It is also allowed to define a model for the coarse data that is compatible with the model originally adopted for the entire image. Constraint satisfaction A strategy for solving problems, including three modules: (1) a list of values that are required (preferable) for “variables,” (2) a set of allowed values for each “variable,” and (3) the collection in which values between each “variable” must satisfy the relationship (constraint). Examples of applications in image engineering include marking various structures (such as line marking, region marking, etc.) and recovering geometric models (such as reverse engineering) from depth data.
Part II
Image Processing
Chapter 9
Pixel Spatial Relationship
An image is composed of its basic unit, pixels (Bracewell 1995; Castleman 1996; Gonzalez and Wintz 1987; Jähne 2004; Marchand-Maillet and Sharaiha 2000; MarDonald and Luo 1999; Pratt 2007; Russ and Neal 2016; Tekalp 1995). Pixels are arranged in a certain pattern in the image space and have a certain relationship with each other. In image processing, images can be processed according to the connection between pixels (Sonka et al. 2014; Zhang 2017a).
9.1
Adjacency and Neighborhood
For a pixel, it is most closely related to its neighboring pixels, which constitute the neighborhood of the pixel (Castleman 1996; Gonzalez and Wintz 1987; Sonka et al. 2014).
9.1.1
Spatial Relationship Between Pixels
Spatial relationship between pixels The relationship between the orientation, position, etc. of the pixels in the image. These relationships can be adjusted by coordinate transformation. Adjacency 1. A relationship between pixels in a 2-D image. Adjacent pixels have spatial contact. Commonly used adjacencies can be divided into three categories according to the contact methods: 4-adjacent, diagonal-adjacent, and 8-adjacent.
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_9
353
354
9 Pixel Spatial Relationship
2. A relationship between voxels in a 3-D image. Adjacent voxels have spatial contact. Commonly used adjacencies can be divided into three categories according to the contact methods: 6-adjacent, 18-adjacent, and 26-adjacent. Adjacent 1. Represents a way of contact between pixels or voxels. It refers to the connected pixels in the image and the regions (pixel sets) having a common boundary. In graph, the adjacency of the nodes is indicated by edge (arc). In geometric model, it refers to a component that has a common enclosing component. Defining adjacency to be formalized requires some kind of intuition, because it needs to indicate the approach or how many boundary points one needs to consider between two structures. 2. A form of contact between two nodes (vertices) or between two edges in graph theory. If two vertices are associated with the same edge, the two nodes are adjacent. Similarly, if two edges have a common node, the two edges are adjacent. Adjacent pixels Pixels that satisfy adjacent conditions with each other. For 2-D images, the adjacency conditions can be 4-adjacent, diagonal-adjacent, and 8-adjacent. For 3-D images, the adjacency conditions can be 6-adjacent, 18-adjacent, and 26-adjacent. Hard-adjacency Give the adjacency of the binary result. In a 2-D image, two pixels with the same side are considered to be completely adjacent, i.e., the adjacent value is one. All other pairs of pixels are considered to be non-adjacent, i.e., the adjacent value is zero. Using 6-adjacent in a 3-D image, the binary adjacent values of the two voxels p and q are μðp, qÞ ¼
1
p and q are same,
0
otherwise
or p and q only differ 1 in a coordinate
Compare fuzzy adjacency. Adjacency relation A collection of unordered or ordered pairs of nodes (vertices) in a graph. They are the set of edges of the undirected graph and the directed graph, respectively. Adjacency matrix A binary matrix that expresses a region adjacency map. When two regions R1 and R2 are adjacent, the element at the intersection of R1 and R2 in the matrix is 1, otherwise 0. Further, if the matrix is represented by A, the element aij therein represents the number of edges from the node (vertex) i to the node j. Adjacency graph A graph representing the adjacent relationship between image structures (such as segmented image regions or some corner points). The nodes in the figure correspond
9.1 Adjacency and Neighborhood
(a)
355
(b)
(c)
Fig. 9.1 Adjacency graph
Fig. 9.2 Basic movement in the N16 space
to structures, and the arcs indicate the adjacencies between the two structures connected by arcs. In Fig. 9.1a, shows the original image, Fig. 9.1b shows the three regions obtained after the division, and Fig. 9.1c shows their corresponding adjacency graph. Linear symmetry A neighborhood property in which the target gray level changes only in a certain direction (thus indicating a local orientation). It can be described by a function similar to g(xk), where x is the spatial coordinate and k is the unit vector in the direction of the gray-level change. Local orientation See linear symmetry. N16 space A space consisting of one (center) pixel and its 16-neighbor pixels. The basic movement from the center pixel to its neighboring pixels can be divided into three types (see Fig. 9.2): 1. Horizontal movement and vertical movement: the length of movement is a, called a-move, and the corresponding chain codes are 0, 4, 8, and 12. 2. Diagonal movement: the length of the movement is b, called b-move, and the corresponding chain codes are 2, 6, 10, and 14.
356
9 Pixel Spatial Relationship
3. Knight step movement: the length of the movement is c, called c-move, and the corresponding chain codes are 1, 3, 5, 7, 9, 11, 13, and 15. Segment adjacency graph The graph in which the nodes represent the edge segments, and the edges between the nodes are connected to the adjacent edge segments. For each edge segment, it can be described by local geometric features such as its length and direction, midpoint position, etc. Image matching can be performed using the detected edge segments in the image by means of the segment adjacency graph.
9.1.2
Neighborhood
Neighborhood An area consisting of contiguous pixels of one pixel. It can be interpreted as a small matrix containing one reference pixel (often at the center). According to the adjacency relationship, 4-neighbor, diagonal-neighbor, and 8-neighbor are commonly used in 2-D images of square grids, and 6-neighborhoods, 18-neighbor, and 26-neighbor are commonly used in 3-D images of square grids. In the 2-D image of the hexagonal grid, 3-neighbor and 12-neighbor can be used. For example: 1. The neighborhood of a vertex v in the graph is all vertices connected to v by an arc. 2. The neighborhood of a point (or pixel) p is a collection of points that are “close” to p. A common definition is a collection of points within a certain distance from p, where the measure of distance can be Manhattan distance or Euclidean distance. 3. A 4-connected neighborhood of a 2-D position (x, y) consists of a set of pixel positions: {(x + 1, y), (x 1, y), (x, y + 1), (x, y 1)}; and the 8-connected neighborhood includes a set of pixel positions: {(x + i, y + i)| –1 i, j 1}. Open neighborhood Corresponding to the point neighborhood of the open set. That is, the neighborhood contains only internal points of the set. Close neighborhood Corresponding to the point neighborhood of the closed set. Neighborhood of a point A collection of points around a point and satisfying certain conditions. If you make ε a positive real number and a lower bound, the point neighborhood of the point p in the plane (denoted as N( p, ε) is defined by N ðp, εÞ ¼ x 2 ℝ2 : kx pk < ε ðfor the open neighborhood of pÞ
9.1 Adjacency and Neighborhood
357
Fig. 9.3 3-neighborhood in a hexagonal grid
p
Fig. 9.4 4-neighborhood of a pixel
r r
p
r
r
Here N( p, ε) is an open neighborhood because it excludes the boundary point x in the plane. The closed neighborhood of N( p, ε) includes its internal points and boundary points, which are defined as N ðp, εÞ ¼ x 2 ℝ2 : kx pk ε ðfor the close neighborhood of pÞ 3-neighborhood A neighborhood in a hexagonal grid. It consists of pixels that have a common edge with the center pixel (called a direct adjacent pixel). For example, in Fig. 9.3, three neighboring pixels (represented by black circles) that are co-edge with the three sides of the central pixel p of the triangle constitute a 3-neighborhood of the pixel p, which is denoted as N3( p). 4-neighborhood A region in a 2-D image, consisting of four 4-adjacent pixels of one pixel. In Fig. 9.4, the 4-neighborhood of pixel p includes four pixels labeled r, denoted N4( p). 6-neighborhood In 3-D images, the volume consisting of six voxels. They are on the same plane (6-contiguous) as the central voxel p. This case of the 6-neighborhood can be marked as N6( p), referring to the definition in the 3-D neighborhood, with N 6 ðpÞ ¼ V 11 ðpÞ Figure 9.5 gives a schematic representation of N6( p) in which the black dot voxels constitute the 6-neighborhood of the central white point voxel p.
358
9 Pixel Spatial Relationship
Fig. 9.5 6-neighborhood in the 3-D image
p
Fig. 9.6 8-neighborhood of pixel p
s
r
s
r
p
r
s
r
s
Fig. 9.7 12-neighborhood in a hexagonal grid
p
8-neighborhood A region consisting of eight 8-adjacent pixels (including four 4-adjacent pixels and four diagonal-adjacent pixels) of one center pixel. As shown in Fig. 9.6, the 8neighborhood of pixel p includes 8 pixels labeled r (4-adjacent) and s (diagonaladjacent), denoted as N8( p). 12-neighborhood A neighborhood in a hexagonal grid. It consists of pixels that have a common vertex with the center pixel (called an indirect neighboring pixel). For example, in Fig. 9.7, three vertices of a central pixel p of a triangle and 12 neighboring pixels (represented by black dots) have a common vertex, and these 12 pixels constitute a 12-neighborhood of the pixel p, which is denoted as N12( p). 16-neighborhood In a 2-D image, a type of neighborhoods in a square grid. It consists of an 8neighborhood and a knight-neighborhood of one pixel. In Fig. 9.8, the 16-neighborhood of pixel p includes 16 pixels labeled r, denoted as N16( p).
9.1 Adjacency and Neighborhood
359
Fig. 9.8 16-neighborhood in a square grid
p
Fig. 9.9 18-neighborhood in a 3-D image
r r r
r
r
r
r
r
p
r
r
r
r
r
r r
r
Fig. 9.10 24-neighborhood of a pixel
18-neighborhood In 3-D images, the volume consisting of 18 voxels. They have a common edge (18-contiguous) to the central voxel p. This case of the 18-neighborhood can be marked as N18( p), referring to the definition in the 3-D neighborhood, with N 18 ðpÞ ¼ V 21 ðpÞ \ V 11 ðpÞ Figure 9.9 gives a schematic representation of N18( p), where the black dot voxels make up the 18-neighborhood of the central white point voxel p. 24-neighborhood The area consisting of 24 pixels surrounding a pixel in a 2-D image is an extension of 8-neighborhood. As shown in Fig. 9.10, these pixels are contained in the 5 5 area around the center pixel. 26-neighborhood In the 3-D image, a solid consisting of 26 voxels having the same vertex (26-contiguous) to one central voxel p. This case of the 26-neighborhood can be marked as N26( p), referring to the definition in the 3-D neighborhood, with
360
9 Pixel Spatial Relationship
Fig. 9.11 26-neighborhood in a 3-D image
p
Fig. 9.12 Diagonalneighborhood of a pixel
s
s p
s
s
N 26 ðpÞ ¼ V 11 ðpÞ Figure 9.11 gives an illustration of N26( p) in which the black dot voxels constitute the 26-neighborhood of the central white point voxel p. 3-D neighborhood In 3-D images, the volume consisting of several adjacent voxels of one voxel. Letting p ¼ (x, y, z) be a voxel in the image and q ¼ (x’, y’, z’) be another voxel adjacent to p in the image, then the 3-D neighborhood of the pixel p V1d(p) can be expressed as V d1 ðpÞ ¼ fqjD1 ðp, qÞ dg where D1(p, q) represents a distance function with a norm of 1 and d is a positive integer. The neighborhood V1d(p) of p can be expressed as V d1 ðpÞ ¼
q j Dd1 ðp, qÞ d
where D1d(p, q) represents a distance function whose norm index is 1. There are three types of neighborhoods commonly used in 3-D images: 6neighborhood, 18-neighborhood, and 26-neighborhood. Diagonal-neighborhood An area consisting of four diagonal-adjacent pixels of a pixel. In Fig. 9.12, the diagonal-neighborhood of pixel p includes four pixels labeled s, denoted ND( p). Knight-neighborhood The eight pixels that have a distance from a pixel as one step of knight on the chess board can form a region. In Fig. 9.13, the knight-neighborhood of pixel p includes eight pixels labeled r, denoted as Nk( p).
9.1 Adjacency and Neighborhood
361
Fig. 9.13 Knightneighborhood of a pixel p
r
r r
r p
r
r r
r
Simple neighborhood In a multidimensional image, a neighborhood whose grayscale has changes only along one direction. See linear symmetry. Neighborhood pixels A set of pixels in the neighborhood of a pixel. The number of pixels in the set is different depending on the definition of the neighborhood. Open neighborhood of a pixel A group of pixels within a given distance around a pixel, excluding pixels along the boundary of the neighborhood.
9.1.3
Adjacency
12-adjacency In a 2-D image, an adjacency between pixels when using a hexagonal grid. See 12neighborhood. 12-adjacent See 12-adjacency. 16-adjacency An adjacency in the 16-neighborhood. 16-adjacent See 16-adjacency. 18-adjacency In a 3-D image, an adjacency between a voxel in a 18-neighborhood of a central voxel and this central voxel. Two 18-adjacent voxels have a common edge. Compare 6-adjacency and 26-adjacency. 18-adjacent See 18-adjacency.
362
9 Pixel Spatial Relationship
26-adjacency In a 3-D image, an adjacency between a voxel in a 26-neighborhood of a central voxel and this central voxel. Two 26-contiguous voxels have a common apex. Compare 6-adjacency and 18-adjacency. 26-adjacent See 26-adjacency. 3-adjacency In a 2-D image, an adjacency between pixels when using a hexagonal grid. See 3neighborhood. 3-adjacent See 3-adjacency. 4-adjacency In a 2-D image, an adjacency between pixels. Two 4-adjacent pixels have a common (pixel) edge. See 4-neighborhood. 4-adjacent See 4-adjacency. 6-adjacency In a 3-D image, an adjacency between voxels, provided that the two 6-adjacent voxels have a common (voxel) face. See 6-neighborhood. Compare 18-adjacency and 26-adjacency. 6-adjacent See 6-adjacency. 8-adjacency In a 2-D image, an adjacency between pixels. Two 8-adjacent pixels either have a common edge (4-adjacent) or have a common vertex (diagonal-adjacent). See 8neighborhood. 8-adjacent See 8-adjacency. Diagonal-adjacency A form of adjacency between pixels in two diagonal directions. Two diagonaladjacent pixels have common vertices but no common edges. See diagonalneighborhood. Diagonal-adjacent See diagonal-adjacency.
9.2 Connectivity and Connected
9.2
363
Connectivity and Connected
The adjacency of two pixels is only related to their spatial location, and the connection and connectivity between pixels must also consider the relationship between pixel attribute values (such as grayscale values) (Gonzalez and Woods 2018; Sonka et al. 2014; Zhang 2017a).
9.2.1
Pixel Connectivity
Pixel connectivity A pattern defined for calculation purposes that specifies which pixels are nearby neighbors (both in space and in grayscale) of a given pixel. Commonly used modes include 4-connectivity and 8-connectivity. Connectivity 1. A relationship between pixels. The connection between two pixels in a grayscale image satisfies two conditions: (1) the two pixels are spatially adjacent and (2) the gray values of the two pixels are in the same grayscale set and satisfy a certain similarity criterion. 2. See pixel connectivity. 12-connectivity In a 2-D image, a connection between pixels when using a hexagonal grid. See 12neighborhood. 16-connectivity A type of pixel connection when using 16-neighborhood definition. 18-connectivity A connection between voxels in a 3-D image. The two voxels p and r take values in V (where V represents a set of attribute values defining the connection), and r is connected to p in the 18-neighborhood of p. 26-connectivity A connection between voxels in a 3-D image, provided that two voxels p and r take values in V (where V represents a set of attribute values defining the connection) and r is in the 26-neighborhood of p and is connected with p. 3-connectivity A connection between pixels when using a hexagonal grid. See 3-neighborhood. 4-connectivity A connection between pixels, provided that two pixels p and r take values in V (where V represents the set of gray values that define the connection) and r is connected to p in the 4-neighborhood of p.
364
9 Pixel Spatial Relationship
6-connectivity A connection between voxels in a 3-D image, provided that two voxels p and r take values in V (where V represents a set of attribute values defining the connection) and r is in the 6-neighborhood of p and is connected with p. 8-connectivity A connection between pixels in a 2-D image, provided that two pixels p and r take values in V (where V represents the set of gray values that define the connection) and r is in the 8-neighborhood of p and is connected with p. m-connectivity A connection between pixels in 2-D image. If two pixels p and r take values in V (V denotes the set of gray values used to define the connection) and one of the following conditions is satisfied: (1) r is in the 4-neighborhood of p and (2) r is at a diagonal position of p, and the intersection of the 4-neighborhood of p and the 4-neighborhood of r does not contain pixels in V, then the two pixels are m-connected. Compare M-connectivity. Mixed connectivity Same as m-connectivity. M-connectivity A connection between pixels in N16 space. If two pixels p and r take values in V (V represents a set of gray values that define the connection) and one of the following conditions is met, (1) r is in the 4-neighborhood of p; (2) r is in the diagonal-neighborhood of p, and the intersection of the 4-neighborhood of p and the 4-neighborhood of r does not contain pixels in V; and (3) r is in the knightneighborhood of p and the intersection of 8-neighborhoods of p and r does not contain pixels that take values in V, then the two pixels are M-connected, or they have M-connectivity. Compare m-connectivity.
9.2.2
Pixel-Connected
Connected A form of relationship between two pixels. For example, between two pixels p and q in the sub-image S, there is a case of a path from p to q composed entirely of pixels in S. Two connected pixels belong to the same connected component. Depending on the connection definition used, 4-connected, 8-connected, and m-connected can also be distinguished. Path-connected If a path has p and q as endpoints, then pixels p and q are path-connected. If there are n sequences of adjacent shapes S1, ..., Si, Si + 1, ..., Sn, the image shapes A and B (any polygon) are connected by the path, provided that A ¼ S1, B ¼ Sn.
9.2 Connectivity and Connected
365
Fig. 9.14 4-connection of pixels
1 1 1 1
2 2 2 2 2 2 2 2 2 3 3 3 3
3-connected Connected by 3-connectivity. 3-connectedness A situation in which pixels in an image are connected when a hexagonal grid is used. See 3-neighborhood. 4-connected Connected by 4-connectivity. 4-connectedness A situation in which images are connected. Each pixel is connected to its four adjacent pixels that share an edge. Compare 8-connectedness. In Fig. 9.14, there are three sets of pixel groups (connected components) combined by 4-connectivity. 6-connected Connected by 6-connectivity. 6-connectedness A case of pixel connectivity in a 3-D image. See 6-neighborhood. 8-connected Connected with an 8-connectivity. 8-connectedness A situation in which images are connected. Each pixel is connected not only to its four adjacent pixels that share a boundary but also to its four adjacent pixels that share a vertex. Compare 4-connectedness. In Fig. 9.15, there are three sets of pixel groups (connected components) combined by 8-connection. 12-connected Connected with a 12-connectivity. 12-connectedness In a 2-D image, a case where pixels are connected when a hexagonal grid is used. See 12-neighborhood.
366
9 Pixel Spatial Relationship
Fig. 9.15 8-connection of pixels
2 2 2 2 2 2 2 2
1 1 3 3 3
16-connected Connected with a 16-connectivity. 16-connectedness A situation in which pixels in an image are connected. See 16-neighborhood. 18-connected Connected by an 18-connectivity. 18-connectedness A case of pixel connectivity in a 3-D image. See 18-neighborhood. 26-connectedness A case of pixel connectivity in a 3-D image. See 26-neighborhood. 26-connected Connected with a 26-connectivity. m-connected Connected by m-connectivity. Mixed-connected Same as m-connected.
9.2.3
Path
Path 1. A simple graph. The vertices can be ordered such that the two vertices are contiguous if and only if they are consecutive in the sequence of vertices. 2. The path obtained by sequentially connecting two pixels in the image. That is, a sequence consisting of n pixels or voxels p1, . . ., pi, pi + 1, . . ., pn, where pi and pi + 1 are adjacent (no pixel between pi and pi + 1). Path from a pixel to another pixel A path that is gradually adjacent to one pixel by one pixel. Referred to as a path. Specifically, a path from a pixel p having coordinates (x, y) to a pixel q having coordinates (s, t) is composed of a series of independent pixels with coordinates (x0,
9.2 Connectivity and Connected Fig. 9.16 Nine equivalence classes
367 8
1
7
B
A 6
2 3
5
4
y0), (x1, y1), ..., (xn, yn). Here, (x0, y0) ¼ (x, y), (xn, yn) ¼ (s, t), and (xi, yi) are adjacent to (xi–1, yi–1) for i ¼ 1, . . ., n, where n is the path length. Different paths, such as 4path, 8-path, m-path, can be defined or derived according to the definition of adjacency between pixels used. Digital arc A portion of a digital curve between two points on a path. Given a set of discrete points and their adjacent relationship, the digital arc Ppq ¼ {pi, i ¼ 0, 1, ..., n} from point p to point q satisfies the following conditions: 1. p0 ¼ p, pn ¼ q. 2. i ¼ 1, ..., n–1, point pi has exactly two adjacent points in the digital arc Ppq, namely, pi–1 and pi + 1. 3. The endpoint p0 (or pn) has an adjacent point in the digital arc Ppq, p1 (or pn–1). Walkthrough An infinite number of paths between two points are divided into nine equivalence classes. The first eight classes correspond to eight relative directions between the points, and the last one indicates no motion. In Fig. 9.16, for example, point B is in the equivalence class 2 relative to point A. 4-path An inter-pixel path satisfying a 4-connected condition. 8-path An inter-pixel path satisfying an 8-connected condition. m-path An inter-pixel path satisfying an m-connected condition. Mixed path Same as m-path. Connectivity paradox The connection ambiguity problem arises from the fact that there are two neighborhoods: 4-neighborhood and 8-neighborhood (they are defined according to different adjacency) in the square sampling grid. It can cause confusion between the object boundary point and the internal point, which affects the geometric measurement accuracy of the object. In order to solve this problem, when expressing and measuring the object, the boundary pixels of the internal pixels and the region of the object area are respectively defined by 4-connectedness and 8-connectedness.
368
9.3
9 Pixel Spatial Relationship
Connected Components and Regions
A connected component is a subset of the image in which all pixels are connected to each other inside. Such a subset of images constitutes a region of the image (Gonzalez and Wintz 1987; Zhang 2009).
9.3.1
Image Connectedness
Connectedness An attribute of the shared point collection. If there are disjoint open sets A and B such that X = A[B, that is, X is the union of disjoint sets A and B, then set X is disconnected. The condition for the set X to be connected (having connectedness) is that X is not the union of disjoint sets. Image connectedness See pixel connectivity. Connected component A sub-image defined according to the connectedness between pixels. It is usually a set of connected pixels. Let p be a pixel in the sub-image S, and all the pixel sets (including p) that are connected to p and in S are collectively referred to as a connected component in S. In other words, there is a path between any two pixels in the connected component that is completely inside the connected component, that is, any two adjacent pixels are also connected. According to the connection definition adopted, it can be divided into 4-connected component, 8-connected compo nent, and m-connected component in 2-D image; 6-connected component, 18connected component, and 26-connected component can be distinguished in 3-D image. 4-connected component Connected components obtained by using the definition of 4-connectivity. 4-connected region The connected region obtained according to the definition of 4-connectivity. 4-connected boundary The boundary of the region obtained by the 4-connectivity of the boundary pixels of the region. At this time, the internal pixels of the region should be 8-connected; otherwise, the problem of connectivity paradox will arise. 4-connected contour Same as 4-connected boundary. 6-connected component Connected components obtained by using the definition of 6-connectivity. 8-connected component Connected components obtained by using the definition of 8-connectivity.
9.3 Connected Components and Regions
369
8-connected region The connected region obtained by using the definition of 8-connectivity. 8-connected boundary The boundary of the region obtained by using the definition of 8-connectivity to the boundary pixels of the region. At this time, the internal pixels of the region should be 4-connected; otherwise, the problem of connectivity paradox will arise. 8-connected contour Same as 8-connected boundary. 18-connected component Connected components obtained by using the definition of 18-connectivity. 26-connected component Connected components obtained by using the definition of 26-connectivity. m-connected component Connected components obtained by using the definition of m-connectivity. m-connected region The connected region obtained according to the definition of m-connectivity. Mixed-connected component Same as m-connected component. Mixed-connected region Same as m-connected region.
9.3.2
Connected Region in Image
Connected region The general term for a class of 2-D sub-images is essentially a connected component (each pixel in the region is connected). It can be divided into two types: simply connected regions and non-simply connected regions. Connected sets A collection of points that share one or more points (all points in the collection are connected). The points gathered here can be combined into any form, such as line segments (connected line segments), polygons (connected polygons), and the like. Considering a set of line segments in Fig. 9.17, the union of points on line segment rs Fig. 9.17 Connected and disconnected sets
X
Z
q
p r
s
t Y
370
9 Pixel Spatial Relationship
and line segment st is a set of points on line segment rt, and these points are interconnected. However, the set of points on the line segment pq is not connected to this union. In Fig. 9.17, X ¼ Y[Z, so Y and Z are both connected sets, and X is a disconnected set (not satisfying connectivity). Connected polygons The polygons that share one or more points. See connected sets. Connected line segments Line segments that share one or more points, including line segments with a common endpoint. See connected sets. Simply connected region A connected component is a connected region with no holes inside. Non-simply connected region Similar to the connected component, but there can be a connected area with holes inside. Connected polyhedral object A polyhedron in which internal elements are connected to each other. Can be seen as a 3-D extension of connected regions. There are also two types: simply connected polyhedral object and non-simply connected polyhedral object. Simply connected polyhedral object A connected polyhedral object with no holes inside. Non-simply connected polyhedral object A connected polyhedral object having a hole inside. Extraction of connected component A method for obtaining all elements of a given connected component, using the operations of mathematical morphology for binary images. Letting Y be a connected component in set A and a point in Y has known, then the following iteration formula gives all the elements of Y: X k ¼ ðX k1 BÞ \ A,
k ¼ 1, 2, 3, . . .
where represents the expansion operator. When Xk ¼ Xk–1, the iteration is stopped, and Y ¼ Xk can be taken. Connected component labeling 1. A standard graph problem. Given a graph containing nodes and arcs, it is required to identify the nodes that make up a connected set. If a node has an arc that connects it to another node in a collection, the node is also in the collection. 2. A technique for combining adjacent pixels to form a region in (binary image and grayscale image) processing.
9.4 Distance
9.4
371
Distance
The connection between pixels is often related to how close the pixels are in space. The proximity of pixels in space can be measured by the distance between pixels (Castleman 1996; Gonzalez and Wintz 1987; Sonka et al. 2014).
9.4.1
Discrete Distance
Discrete distance The distance in the discrete space, used to describe the spatial relationship between discrete pixels in a digital image. There are many types. In the image engineering, in addition to the Euclidean distance of the continuous space (after discretization), there are commonly used city-block distance, chessboard distance, chamfer distance, and knight distance. Distance function A function used to measure the proximity of image units in space. Taking a 2-D image as an example, given three pixels p, q, and r, their coordinates are (x, y), (s, t), and (u, v), respectively. When the following conditions are satisfied, the function D is a distance function: 1. D( p, q) 0 (D( p, q) ¼ 0 if and only if p ¼ q). 2. D( p, q) ¼ D(q, p). 3. D( p, r) D( p, q) + D(q, r). Of the above three conditions, the first condition indicates that the distance between two pixels is always positive (the distance between two pixels is zero when the positions of the two pixels are same); the second condition indicates that the distance between the two pixels is irrelevant to the choice of start and endpoints, or the distance is a relative quantity here; the third condition indicates that the shortest distance between two pixels is along a straight line. In image engineering, the commonly used discrete distance functions include city-block distance, chessboard distance, chamfer distance, and knight distance. 4-connected distance The distance defined by the 4-connected nature of the pixel. Same as city-block distance. Block distance Same as city-block distance. City-block distance Minkowski distance with norm index of 1. The city-block distance between the pixels p(x, y) and q(s, t) can be expressed as
372
9 Pixel Spatial Relationship
D1 ðp, qÞ ¼ jx sj þ jy t j City-block metric See city-block distance. Manhattan distance Also known as the Manhattan metric. It is only possible to walk along a block road in a dense urban environment, where the distance between (x1, y1) and (x2, y2) is |x1 – x2| + |y1 – y2|. 8-connected distance The distance defined by the 8-connected nature of the pixel. Same as chessboard distance. Chessboard distance Minkowski distance with norm index of 1. The chessboard distance between the pixels p(x, y) and q(s, t) can be expressed as D1 ðp, qÞ ¼ max ½jx sj, jy t j Chessboard distance metric See city-block metric. Euclidean distance Minkowski distance with norm index of 2. The Euclidean distance between the pixels p(x, y) and q(s, t) can be expressed as h i1=2 DE ðp, qÞ ¼ ðx sÞ2 þ ðy t Þ2 For the n-D vectors x1 and x2, the Euclidean distance between them is sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n X ðx1i x2i Þ2 i¼1
Quasi-Euclidean distance (DQE) An approximation of the Euclidean distance. Given two points (i, j) and (k, l), the quasi-Euclidean distance between them is defined as ( DQE ½ði, jÞ, ðk, lÞ ¼
pffiffiffi 2 1 j j lj ji k j þ pffiffiffi 2 1 ji kj þ j j lj
ji kj > j j lj Otherwise
9.4 Distance
373
Chamfer distance An integer approximation of the Euclidean distance. Given a neighborhood and the corresponding length of movement from one pixel to another, the chamfer distance between two pixels p and q relative to this neighborhood is the length of the shortest digital arc from p to q (i.e., the cumulative move length from p to q). In the 4neighborhood, there are only horizontal movement (a-move) and vertical movement (a-move), where a ¼ 1. In the 8-neighborhood, there is also a diagonal move (b-move). At this time, the chamfer distance is recorded as da,b. The most common set of values is a ¼ 3 and b ¼ 4, that is, the distance of one a-move is 3, and the distance of one b-move is 4. In the case of the 16-neighborhood, a c-move is added. The chamfer distances in the 16-neighborhood are denoted as da,b,c. The most common set of values are a ¼ 5, b ¼ 7, and c ¼ 11. Knight distance A special distance function. The distance between the pixel points p(x, y) and q(s, t) in the first quadrant is 8 hl m l mi n hl m l mi o > > u uþv u uþv > > , þ ðu þ vÞ max , mod2 max > > 2 3 2 3 > > < Dk ðp, qÞ ¼
3 > > > > > > > > : 4
(
ðu, vÞ 6¼ ð1, 0Þ ðu, vÞ 6¼ ð2, 2Þ
ðu, vÞ ¼ ð1, 0Þ ðu, vÞ ¼ ð2, 2Þ
where d•e is the upper rounding function, u ¼ max[|x – s|, |y – t|], and v ¼ min[|x – s|, | y – t|]. 3-D distance The distance between two voxels in a 3-D space. For the two voxels p and q, the Euclidean distance, the city-block distance, and the chessboard distance are respectively qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 2ffi xp xq þ yp yq þ zp zq D1 ðp, qÞ ¼ xp xq þ yp yq þ zp zq D1 ðp, qÞ ¼ max xp xq , yp yq , zp zq DE ðp, qÞ ¼
Weighted 3-D distance metric A measure of the weighting of the 3-D distance based on different adjacencies between voxels. The measure can be written as {a, b, c}, where the distance between 6-adjacent voxels (with face adjacency) is a, the distance between 18-adjacent voxels (except 6-adjacent voxels) is b (there are 12 voxels with edge adjacency), and the distance between 26-adjacent voxels that are not 18-adjacent voxels (there
374
9 Pixel Spatial Relationship a
Fig. 9.18 Three types of distances in 3-D space (corresponding to three types of adjacent voxels)
c
b
are 8 voxels with vertex adjacency) is c. The above three distances are shown in Fig. 9.18. Chernoff distance A distance measure that measures the proximity of two clusters (sets of points), which can be calculated by means of the divergence of the density. Minkowski distance A distance measure between pixels. The Minkowski distance between the pixel points p(x, y) and q(s, t) can be expressed as Dw ðp, qÞ ¼ ½jx sjw þ jy t jw
1=w
The index w depends on the norm used. In the image engineering, some special cases are often used. For example, when w ¼ 1, the city-block distance is obtained; when w ¼ 2, the Euclidean distance is obtained; when w ¼ 1, the chessboard distance is obtained. Bhattacharyya distance A measure of the degree of separation between probability distributions (between categories). Can be used to measure whether two probability distributions are similar. Given two arbitrary distributions pi(x)i ¼ 1,2, the Bhattacharyya distance between them is d2 ¼ log
Z pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p1 ðxÞp2 ðxÞdx
Take the two types of normal distributions with the same prior probability (the mean vectors are μ1 and μ2, respectively, and the mean square differences are σ 1 and σ 2, respectively). The upper bound of Bayesian minimum error is PE ¼
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pðx1 Þpðx2 Þ exp ðDB Þ
where p(∙) represents the probability density distribution function and Bhattacharyya distance DB is
9.4 Distance
375
σ 1 þ σ 2 ðμ1 μ2 ÞT ðμ1 μ2 Þ 1 DB ¼ þ log pffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffi 2 4ð σ 1 þ σ 2 Þ 2 jσ 1 j jσ 2 j This distance can also be generalized to multiple categories and different distributions. The Chernoff distance is a special case. Bhattacharyya coefficient A quantity that helps to build the Bhattacharyya distance to describe the amount of similarity between the two distributions.
9.4.2
Distance Metric
Distance metric See distance function. Distance measure A measure of the distance between pixels in an image. Commonly used: Euclidean distance, D4 (also known as Manhattan or city-block) distance, and D8 (also known as chessboard) distance. Algebraic distance A linear distance measure commonly found in computer vision applications. Its form is simple, based on the matrix standard of the least mean square estimation operation. If the implicit expression of a curve or a surface is f(x, a) ¼ 0 (x • a ¼ 0 represents a hyperplane), then the algebraic distance from a point xj to the curve or surface is f(xj, a). Distance measuring function Same as distance function. Normalized compression distance [NCD] A distance measure. The normalized compression distance between the two objects p and q is defined as NCDðp, qÞ ¼
CðpqÞ min ½CðpÞ, C ðqÞ max ½C ðpÞ, CðqÞ
where C( p) is the compression length of p (corresponding to the diameter of set p), C (q) is the compression length of q, and pq represents the cascade of p and q. This distance can be used as the basis for clustering. Distance measurement based on chain code A typical distance measurement method in digital images. Let there be a digital line between the two points in the image represented by an 8-directional chain code. Letting Ne be the number of even chain codes, No be the number of odd chain
376
9 Pixel Spatial Relationship
Table 9.1 Various chain code-based distance measurement formulas L L1 L2 L3 L4 L5
A 1 1.1107 1 0.948 0.980
B 1 1.1107 1.414 1.343 1.406
C 0 0 0 0 – 0.091
E 16% 11% 6.6% 2.6% 0.8%
Remarks Biased estimation, always shorter Unbiased estimation Biased estimation, always longer The longer the line segment, the smaller the error Unbiased estimation when N ¼ 1000
codes, Nc be the number of corner points, and the total number of chain codes be N (N ¼ Ne + No), then the length L of the entire chain code can be calculated by the following formula: L ¼ A Ne þ B No þ C Nc where A, B, and C are weighting coefficients. Some of the values of the weighting coefficients and the resulting length calculation formula are shown in Table 9.1 (E represents the expected error). Mahalanobis distance A measure of the similarity between two vectors. Letting a and b be two vectors in the image and C be a skew variance matrix, then the Mahalanobis distance M(a, b) between a and b is h i1=2 M ða, bÞ ¼ ða bÞT C1 ða bÞ It is also a distance function between two N-D points measured by statistical changes in each dimension. For example, when x and y are two points originating from the same distribution, and their covariance matrix is C, the Mahalanobis distance between them is [(x – y)TC1(x – y)]1/2. When the covariance matrix is a unit matrix, the Mahalanobis distance is the same as the Euclidean distance. The Mahalanobis distance is commonly used to compare feature vectors whose elements have different variations and ranges of variation (such as double vectors of recording area and perimeter properties). The density function of the Gaussian distribution of high-dimensional multivariable can be written as pð xÞ ¼
1 ð2πÞn=2 jCj1=2
n o 1 exp ðx mÞT C1 ðx mÞ 2
where x represents an n-D column vector, m is an n-D mean vector, C is a covariance matrix of n n, |C| is the determinant of C, C1 is the inverse matrix of C, and the superscript T represents transposition. The following formula is the Mahalanobis distance from x to m:
9.4 Distance
377
n o1=2 1 r ¼ ðx mÞT C1 ðx mÞ 2 Mahalanobis distance between clusters An extension of the Mahalanobis distance. Given two sets of points (clustering) A and B, first mapping the points of the two sets of points to the line connecting the two mean μA and μB, and calculating their respective standard deviations σ A and σ B, then the Mahalanobis distance between the two sets of points is M ðA, BÞ ¼
jμA μB j σ 2A þ σ 2B
Nearest neighbor distance [NND] A measure of the proximity between two sets of points (clustering). Given two sets of points (clustering) A and B, one point a is selected from A, and one point b is selected from B. These two points constitute a point pair. The nearest neighbor distance is the smallest distance of all point pairs: N ðA, BÞ ¼ min dða, bÞ a2A, b2B
where d(•, •) can be any distance measure. Nearest neighbor distance ratio [NNDR] An image matching indicator. When matching image features, consider comparing the nearest neighbor distance to the second nearest neighbor distance, in other words, calculating the ratio of the two distances. Referring to Fig. 9.19, the DL should match the DA and the DR should match the DC. If a fixed threshold is used (indicated by a dashed circle), the DL cannot match the DA, and the DR will match both DB and DC. If the nearest neighbor distance is used to match, the DL will match the DA, but the DR will incorrectly match the DB. If the nearest neighbor distance ratio is used to match, the DL will match the DA due to the smaller dL1/dL2, and the DR will not erroneously match the DB but match the DC due to the larger dR1/dR2.
Fig. 9.19 Comparison of nearest neighbor distance to nearest neighbor distance ratio at a fixed threshold
dL1
DA dL2
DL
DB
dR1
dR2 DR
DC
378
9 Pixel Spatial Relationship
Furthest neighbor distance [FND] A measure of the proximity between two sets of points (clustering). Given two sets of points (clustering) A and B, one point a is selected from A, and one point b is selected from B. These two points constitute a point pair. The farthest neighbor distance is the largest distance of all point pairs: N ðA, BÞ ¼ max dða, bÞ a2A, b2B
where d(•, •) can be any distance measure.
9.4.3
Geodesic Distance
Geodesic distance A distance measure. The length of the shortest path between two points along a certain (curved) surface. This is different from the Euclidean distance that does not take into account the surface curvature. The typical situation is to go from one point to another on the earth. The length of the path traveled is longer than the linear distance between the two points (equivalent to the partial path below the ground), because the curvature of the earth is considered. Let A be a set of pixels, and a and b be two pixels in A. The geodesic distance dA(a, b) between pixels a and b can be expressed as the lower bound of all path lengths from a to b in A. Dividing B ⊆ A into k connected components Bi, then k
B ¼ [ Bi i¼1
For example, consider that one of the images f(x, y) contains a 4-path of L + 1 pixels, which corresponds to a sequence of pixels G ¼ {g(0), g(1), ..., g(L )}, where g (i) and g(i + 1) are 4-adjacent. The geodesic distance of this path can be calculated as follows: DGD ðGÞ ¼
L X
jgðiÞ gði 1Þj
i¼1
That is, the geodesic distance of a path is the sum of the absolute grayscale differences of adjacent pixels on the path. There may be N paths from one pixel p to one pixel set S, and the set of N paths may be represented as T ¼ {G1, G2, ..., GN}, where each G represents a sequence of pixels. The geodesic distance between the pixel p and the pixel set S is the minimum of the geodesic distances of all the paths and can be expressed as
9.4 Distance
379
DGD ðp, SÞ ¼ min ½DGD ð Gi Þ Gi 2T
Geodesic On a mathematically defined surface, the shortest line between two points. Geodesic transform An operation that assigns a geodetic distance relative to a certain set of features to each point in the image. Earth mover’s distance [EMD] A measure for comparing the similarity of two distributions by calculating the cost of converting one distribution to another. The essence is to solve the minimum cost in the weighted point set conversion process and belongs to the constraint optimization problem. It can be described as follows: assume that there are M mounds and N soil pits in the space. The mass of m-th (m ¼ 1, 2, ..., M) mounds is xm, and the capacity of n-th (n ¼ 1, 2, ..., N ) soil pit is yn, and the work required to transport the unit mass of soil from the mound m to the pit n is Cmn, which is equal to the distance between the mound m and the pit n. The mass of the soil transported from the mound m to the pit n is Qmn. The total work required to transport M mounds to N soil pits is W, and then the constraint optimization problem is to ask for the minimum value of the work, which can be expressed as min W ¼ min
N X M X
Cmn Qmn
n¼1 m1
The constraint here is M X
Qmn ¼ yn ,
m¼1
N X
Qmn xm ,
n¼1
M X
xm
N X
m¼1
yn
n¼1
At this time min EMD ¼
N P M P
C mn Qmn
n¼1 m1 N P M P
Qmn
n¼1 m1
min ¼
N P M P
C mn Qmn
n¼1 m1 N P
yn
n¼1
In practice, the minimum cost of converting one distribution to another (as in color histogram matching) needs to be calculated (see Fig. 9.20). The left figure is the first distribution, the middle figure is the second distribution, and the right figure indicates the conversion of the first distribution to the second distribution. It can be seen from the above that the moving distance is visual and intuitive, and the data and the length of the feature can be processed. It is a distribution-based
380
9 Pixel Spatial Relationship
Fig. 9.20 Example of calculation of moving distance
distance measurement method with good anti-noise performance and insensitivity to small offset between probability distributions. It can be used as a criterion for measuring the similarity of probability data. It has been widely used in contentbased image retrieval application. Weighted walkthrough A discrete measure based on the relative position of two regions. It is a histogram that traverses the relative positions of each pair of points selected in these two regions. Basic move A generic type of movement from one pixel to its neighbors in a given neighborhood space. In the N4 space, there are only the horizontal movement and the vertical movement (both a-move), the diagonal movement (b-move) is added in the N8 space, and the c-move is added in the N16 space. a-move The movement between two 4-adjacent pixels in the horizontal or vertical direction along the chamfer distance, also refers to the length of the shortest digital arc corresponding to the movement. b-move The movement between two 8-adjacent diagonal pixels along the chamfer dis tance, also refers to the length of the shortest digital arc corresponding to the movement. c-move The movement between two 16-adjacent pixels in the chamfer distance, also refers to the length of the shortest digital arc corresponding to the movement. The c-move can be decomposed into movements in two mutually perpendicular directions, wherein the amount of movement in one direction is twice the amount of movement in the other direction. Random walk In a 2-D image, a random unit movement in four given directions. For a random walker, at a given pixel, the probability of movement to a 4-connected neighborhood
9.4 Distance
381
pixel can be defined as a function of that pixel. For example, in image segmentation based on random walks, the image is treated as a graph with a fixed number of vertices and edges. Each edge is assigned a weight corresponding to the likelihood of the random walker crossing the edge. The user needs to select a certain number of seeds depending on the number of regions to be divided. Assign a random walker to each pixel that is not selected as a seed. The probability of random walkers to seed points can be used for pixel clustering and image segmentation. Random walker An image segmentation method. The user first interactively assigns a number of pixels (called seeds) a marker (such as a target or background). Suppose that a random walker is sent from the remaining unmarked pixels, and the probability that a random walker of each pixel reaches a seed is calculated. That is, if a user marks K seeds (corresponding to K markers), then each pixel needs to calculate the probability that a random walker will leave the pixel to reach each seed. This calculation can be analytically determined by solving a set of linear equations. After such a calculation is performed for each pixel, the pixel is assigned a flag that is most likely to send a random walker. An image is modeled as a graph in which each pixel corresponds to a node that is connected to adjacent pixels by edges, and the weighting of each edge reflects the similarity between the pixels. Therefore, random walks occur in the weighted graph. Improvements to the random walker algorithm have been fully automated. Discrete disk A collection of points in an image determined by distance. The center pixel is p, and the disk with radius d (d 0) ΔD( p, d ) ¼ {q |, dD( p, q) d}, where q is the neighborhood pixel of p, dD is a given discrete distance function (which can be Euclidean distance, city-block distance, chessboard distance, knight distance, chamfer distance, etc.). Regardless of the position of the center pixel p, the disk of radius d can also be abbreviated as ΔD(d ). Equal-distance disk A circular area composed of pixels whose distance from one pixel is less than or equal to a specific value (radius). In the discrete domain, different distance metrics produce equidistant discrete disks of different shapes. Let Δi(R), i ¼ 4, 8 denote an equidistant disk with a di distance from the center pixel that is less than or equal to R, and use #[Δi(R)] to denote the number of pixels included in Δi(R) except the center pixel. Then the number of pixels increases proportionally with distance. For cityblock distance d4, the equidistant disk has # ½Δ4 ðRÞ ¼ 4
R X
j ¼ 4ð1 þ 2 þ 3 þ . . . þ RÞ ¼ 2RðR þ 1Þ
j¼1
Similarly, for chessboard distance d8, the equidistant disk has
382
9 Pixel Spatial Relationship
# ½Δ8 ðRÞ ¼ 8
R X
j ¼ 8ð1 þ 2 þ 3 þ . . . þ RÞ ¼ 4RðR þ 1Þ
j¼1
9.4.4
Distance Transform
Distance transform [DT] A special transformation that converts a binary image into a grayscale image. Given a set of pixel points P, a subset B (the boundary of P), and a distance function D(•, •), the value assigned to point p (2 P) in the distance transformation of P is given by DTðpÞ ¼ min fDðp, qÞg q2B
That is, the result of the distance transformation to a point is the distance from the point to the nearest boundary point of the region in which it is located. Also known as chamfering. Figure 9.21 shows a binary image and the result of its distance transformation. Distance map The result of distance transform on the image. It has the same size as the original image, and the value at each point p (Î P) is corresponding to the value of DT(p), which is also called the distance map of point set P. Distance image Same as distance map. Euclidean distance transform [DT] In the case of distance transformation, the distance function is based on Euclidean distance. In practice, it is often difficult to calculate the Euclidean distance transform
Fig. 9.21 Example of distance transformation
1
0
0
0
0
0
1
1
2
1
0
1
1
1
1
0
1
2
1
0
1
2
2
1
0
1
2
1
0
1
2
2
1
0
1
2
1
0
1
2
2
1
0
1
2
1
0
1
1
1
1
0
1
2
1
0
0
0
0
0
1
1
2
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
3
9.4 Distance
383
effectively (using local distance calculation). It is often not enough to maintain the minimum scalar distance from the target contour in two traversals. It is necessary to use the vector distance. It is also often necessary to carry out a larger range search (i.e., use a larger template). Signed distance transform An extension of the distance transform. The distance to the image boundary is calculated for all pixels. The easiest way to do this is to first calculate the distance transform for both the original binary image and its complement image and then ignore one of them before combining. Because the distance fields thus calculated are relatively smooth, they can be stored more compactly with splines defined on quadtree or octree data structures (with minimal loss in relative accuracy). Weighted distance transform [WDT] A generalization of distance transform. Taking the weighting of grayscale as an example, the result of the operation is to convert one grayscale image into another grayscale image. The weighted distance transform assigns the value of one pixel in the transformation result graph to be not the minimum distance value of the pixel to the background in the distance transformation, but the smallest value of the sum of the pixel gray values in each path of the pixel to the background. Chamfering See distance transform. Forward pass In the serial implementation algorithm of distance transform, the first scan is performed from the upper left corner to the lower right corner of the image. Compare backward pass. Backward pass In the distance transform, the second scan is performed from the lower right corner to the upper left corner of the image when the algorithm is implemented serially. Compare forward pass. Local distance computation In the distance transform, in order to reduce the amount of calculation, only the calculation method of the local neighborhood information is used. It calculates the distance image by using a local distance template. Figure 9.22 shows some templates for local distance calculations. The central pixel p is represented by a shaded area and represents the center of the template, the size of which is determined by the kind of neighborhood considered. The pixel in the neighborhood of p is taken to the distance value of p. The center pixel has a value of 0, and infinity represents a large number. In Fig. 9.22a, the template is based on a 4-neighborhood definition and is used to calculate the 4-connected distance. Similarly, the template in Fig. 9.22b is based on an 8-neighborhood definition and is used to calculate the 8-connected distance or the chamfer distance of a ¼ 1 and b ¼ 1. The template in Fig. 9.22c is based on a
384
9 Pixel Spatial Relationship
c a a
0
a
a
b
a
b
a
0
a
b
a
b
c
c
c
b
a
b
a
0
a
b
a
b
c (a)
(b)
c
c
c (c)
Fig. 9.22 Template for calculating the distance transform
b
a
b
a
0
a
b
a
b
Fig. 9.23 Local distance map
16-neighborhood definition and is used to calculate the chamfer distance and the knight distance. Distance measurement based on local mask The measurement of the length of a line in an image by means of a local distance template. The length of a long straight line can be calculated step-by-step (partially) by means of a template. Let a be the local distance in the horizontal or vertical direction and b be the local distance in the diagonal direction. These values form a partial distance map, as shown in Fig. 9.23: The distance is D ¼ ama + bmb between two points (x1, y1) and (x2, y2), where ma represents the number of steps in the horizontal or vertical direction, mb represents the number of steps in the diagonal direction. To closely measure the path length to the “true value” obtained with Euclidean distance, a ¼ 0.95509 and b ¼ 1.36930 should be used. It has been shown that the maximum difference between the measured value thus obtained and the value obtained by Euclidean distance is less than 0.04491 M, where M ¼ max{|x1 - x2|, |y1 - y2|}.
Chapter 10
Image Transforms
Image transformation can transform images in image space to some other spaces in different forms. Different transforms change images into different spaces (Gonzalez and Wintz 1987; Pratt 2007; Sonka et al. 2014).
10.1
Transformation and Characteristics
Different image transformations have different characteristics. In image processing, the transformation selection should be made according to different characteristics of the transformation (Bracewell 1986; Gonzalez and Woods 2008; Russ and Neal 2016; Zhang 2017a).
10.1.1 Transform and Transformation Image transform The process of converting an image from one expression space to another in some form. It is a means of efficiently and quickly processing images. Specifically, after the image was converted into a new space, the image will be processed by the unique property of the new space, and the processing result is then converted back to the original space to obtain the desired effect. One operation on the image that can produce another image. The new image can be changed in geometry as compared to the original image, or it can contain newly derived information. A common purpose of using image transform is to express the image more efficiently to compress the image, as well as to enhance or visualize certain information of interest.
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_10
385
386
10 Image Transforms
Transform See image transform. Forward image transform Same as image transform. Image domain In image engineering, the space domain. Transform domain In image engineering, the non-original image space obtained by transformation. The transformations used are different, and the resulting transform domains are also different. For example, the Fourier transform is used to obtain the commonly used frequency domain, and the wavelet transform is used to obtain the wavelet domain. Transformation function A mathematical function that implements transformations. It can be linear, piecewise linear or nonlinear. Transformation kernel In image conversion, a transform function that is independent of the transformed image and is determined only by the transformation itself. Transform kernel Same as transformtion kernel. Transformation matrix In linear space, the matrix M used to transform a vector x. The transformation can be achieved by means of multiplication, such as Mx. The transformation matrix can be used for geometric transformation of position vectors, color transformation of color vectors, projection from 3-D space to 2-D space, and so on. A matrix used to implement linear transformation in image transform. The transformation kernel of the image transform at this time can be a separable and symmetric function (see separable transform and symmetric transform). If F is the N N image matrix and T is the output N N transform result, then T ¼ AFA where A is an N N symmetric transformation matrix. In order to obtain the inverse transform, the two sides of the above equation are respectively multiplied by an inverse transform matrix B to obtain BTB ¼ BAFAB If B ¼ A1, then there is F ¼ BTB That is, the image matrix before the transformation is obtained.
10.1
Transformation and Characteristics
387
Transformation Map data from one space (such as original image space) to another. All image processing operations are in general some kinds of transformations. Symmetric transform A type of image transform. If the transformation kernel is separable and the following conditions are met: hðx, y, u, vÞ ¼ h1 ðx, uÞh1 ðy, vÞ That is, in the form of the two functions after separation, the transformation kernels are mutually symmetrical, and the transform having such a transformation kernel is a symmetric transform. Global transform A generic term describing an operator that transforms an entire image (in the image space) into some other space. Common global transforms include discrete cosine transform, Fourier transform, Haar transform, Hadamard transform, Hartley transform, histogram, Hough transform, Caro transform, Radon transform, wavelet transform, etc. Coffee transform You can drink a cup of coffee and wait to see the results of transformation, describing a computationally complex transform. For example, it is a timeconsuming transformation to consider the possible Hough transform of rotation. Transform pair A transform pair consisting of a forward transform and an inverse transform. Forward transform In image engineering, the transformation from image space to other spaces. Generally, you don’t need to write “forward” specifically. Compare the inverse transform. Forward transformation kernel A transformation kernel corresponding to the forward transform. Inverse transform In image engineering, the transformation from other spaces back to the image space. Compare the forward transform. Inverse transformation kernel A transformation kernel corresponding to the inverse transform.
10.1.2 Transform Properties Basis A collection of expanded functions in a sequence expansion.
388
10 Image Transforms
Basis function The expansion function used to expand a function with a sequence. Consider linear regression first, a function can be expressed as a linear combination of input variables yðx, wÞ ¼ w0 þ w1 x1 þ þ wN xN where x ¼ (x1, x2, ..., xN)T. An important property of this model is that it is a linear function of the parameters w1, w2, ..., wN. It is also a linear function of the input variable xi, but this has significant limitations on the model. To solve this problem, a fixed nonlinear function of the input variable can be considered by linear combination, i.e., (the model has a total of M parameters): yðx, wÞ ¼ w0 þ
M 1 X
w j ϕ j ð xÞ
j¼1
where ϕj(x) is the basis function. Basis function representation A method of expressing a function as the sum of some simple (often orthogonal) functions called a basis function. Expressing a function by means of a Fourier transform is a typical example in which a function is represented as a weighted sum of many sine and cosine functions. Separable transform Image transform with separability. The transformation kernel satisfies the following conditions (i.e., the transformation kernel is separable in both X and Y directions, and h1 and h2 are the results obtained by decomposing h): hðx, y, u, vÞ ¼ h1 ðx, uÞh2 ðy, vÞ Separability 1. An important commonality in many image transforms. In image engineering, if a 2-D transformation kernel of image transform can be decomposed into the product of two 1-D factors, and the two factors respectively correspond to 1-D transformation kernels along different coordinate axes, then the 2-D transformation kernel of the image transform is called separable. A 2-D transform with a separable transformation kernel is generally referred to as a separable transform, and its calculation can be performed in two steps, one by one, for each 1-D transform. For example, the Fourier transform is a transform that is separable. 2. In the classification problem, it refers to whether the data can be separated into different sub-categories by some automatic decision-making methods. If the property values of two classes overlap each other, they are not separable. In
10.1
Transformation and Characteristics
Fig. 10.1 Separable and inseparable classes
389
Feature 2
Feature 1
Fig. 10.1, the class labeled “ ” and the class labeled “□” are inseparable, and the class labeled “” is separable from them. Unitary transform A reversible transform (such as a discrete Fourier transform). The separable linear transformation of the matrix f of an image can be written as g ¼ hTc f hr where g is the output image and hc and hr are the transformation matrices. The matrices hc and hr are chosen differently as needed. For example, they may be selected such that the transformed image uses less bits to represent the original image, or they may be selected such that the truncation of the original image expansion can remove some high-frequency components to smooth the image, or they may be selected such that the original image is optimally approximated according to certain predetermined criteria. The matrices hc and hr are often chosen to be a unitary matrix to make the transform reversible. If the matrix hc and hr are selected as the unitary matrix, the above equation represents a unitary transform to f. Unitary transform domain The domain of the output image of a unitary transform g ¼ hcTfhr. Unitary matrix The matrix U whose inverse is the transpose of the complex conjugate of U. Satisfy UU T ¼ I where U* is also called the adjoint matrix and I is the identity matrix. Sometimes the superscript “H” is used instead of “T*”, and UT* UH is called the Hermitian transpose or conjugate transpose of the matrix U. If the elements of such a matrix are real numbers, these matrices are called orthogonal matrices.
390
10 Image Transforms
Orthogonal transform A special image transform that satisfies the following three conditions simultaneously: 1. The transformation matrix is equal to the inverse of its inverse transformation matrix. 2. The conjugate of the transformation matrix is equal to the inverse of its inverse transformation matrix. 3. The transpose of the transformation matrix is equal to the inverse of its inverse transformation matrix. The forward transform and the inverse transform satisfying the above conditions constitute an orthogonal transform pair. The Fourier transform is a typical orthogonal transform. Orthogonal image transform A class of well-known image compression techniques. The key step is to project the image data onto a set of orthogonal basis functions. Orthogonal image transform is a special case of linear integral transform. See discrete cosine transform, Fourier transform, and Haar transform. Orthographic The unique properties of orthogonal (or vertical) projections (to the image plane). See orthographic projection. Complete set For a set of functions, you cannot find any other functions that are not part of the set that are orthogonal to the set.
10.2
Walsh-Hadamard Transform
Both Walsh and Hadamard transforms are separable and orthogonal transforms, and both are integer transforms (Gonzalez and Wintz 1987; Pratt 2007; Russ and Neal 2016).
10.2.1 Walsh Transform Walsh transform The expression of the vector v of 2n elements formed by the Walsh function of order n corresponds to the multiplication result of the Hadamard matrix. A Walsh transform is a separable transform and an orthogonal transform in which the trans form kernel has a value of +1 or 1 (after normalization). In 2-D, it has the same
10.2
Walsh-Hadamard Transform
391
form as the inverse Walsh transform, and both are symmetric transforms, which can form a transform pair, which is expressed as W ðu, vÞ ¼
f ðx, yÞ ¼
N 1 N 1 n1 Y 1 X X f ðx, yÞ ð1Þ½bi ðxÞ bn1i ðuÞþbi ðyÞ bn1i ðvÞ N x¼0 y¼0 i¼0
N 1 N 1 n1 Y 1 X X W ðu, vÞ ð1Þ½bi ðxÞ bn1i ðuÞþbi ðyÞ bn1i ðvÞ N u¼0 v¼0 i¼0
When N ¼ 2n, the 2-D Walsh positive transform kernel h(x, y, u, v) is identical to the inverse Walsh transform kernel k(x, y, u, v) and is symmetric hðx, y, u, vÞ ¼ k ðx, y, u, vÞ ¼
n1 1 Y ð1Þ½bi ðxÞ bn1i ðuÞþbi ðyÞ bn1i ðvÞ N i¼0
where bk(z) is the k-th bit in the binary representation of z. Walsh transform has been applied in the research of image coding, logic design, and genetic algorithm. Forward Walsh transform Same as Walsh transform. Forward Walsh transform kernel Same as Walsh transform kernel. Inverse Walsh transform The inverse transform of the Walsh transform. Inverse Walsh transform kernel The transformation kernel used by the inverse Walsh transform. Walsh transform kernel The transformation kernel used in the Walsh transform. Walsh functions An orthogonally normalized discrete value function. The set of values is {+1, 1}, which is a complete set. It can be defined in many different ways, and these definitions are all proven to be equivalent. A function consisting of square waves. The Walsh function of order n is a specific set of square waves W(n, k): [0, 2n) ! {1, 1}, where k 2 [1, 2n]. Walsh functions are orthogonal and their product is still a Walsh function. The square wave transition only occurs at integer grid points, so each function can be specified by its vector of values in the point set {½, 1½, . . ., 2n–½}. Combining these values for a given sequence n is a Hadamard transform matrix H2n of order 2n. The two functions with order 1 are the rows of the following matrix:
392
10 Image Transforms
1 0 –1
1 0 –1 0
2
4
1 0 –1 0
2
4
1 0 –1 0
2
4
0
2
4
Fig. 10.2 Walsh functions with order 2
1 1 H2 ¼ 1 1
The four functions with order 2 are the rows of the following matrix (as depictured in Fig. 10.2): 2
H4 ¼
H2 H2
1 6 H2 61 ¼6 41 H 2 1
1 1
1 1
1 1
1 1
3 1 1 7 7 7 1 5 1
In general, a function of order n + 1 can be generated by the following relationship:
H2nþ1
H 2n ¼ H 2n
H 2n H 2n
This loop is the basis of the fast Walsh transform. Dyadic order of Walsh functions An order of Walsh functions defined using the Rademacher function. Rademacher function The n-th (n 6¼ 0) order function defined by the following expression: Rn ðt Þ sign½ sin ð2n πt Þ, 0 t 1 For n ¼ 0, there is Rn ðt Þ 1,
0t1
If Walsh function is defined by using the Rademacher function, the Walsh function of binary order, Perry order and natural order can be obtained, respectively. Binary order of Walsh functions Same as dyadic order of Walsh functions.
10.2
Walsh-Hadamard Transform
393
Paley order of Walsh functions An order of Walsh functions defined using the Rademacher function. Sequency order of Walsh functions An order of Walsh functions defined by recursive equations. Normal order of Walsh functions An order of Walsh functions defined using the Rademacher function. Natural order of Walsh functions An order of Walsh functions defined using the Rademacher function. Natural order See natural order of Walsh functions. Walsh order An order of Walsh functions defined using recursive equations. Walsh-Kaczmarz order An order of Walsh functions defined using recursive equations.
10.2.2 Hadamard Transform Hadamard transform A separable transform and orthogonal transform with a transform kernel value of +1 or 1 (normalized). Also known as forward Hadamard transform. It converts an image into a transform that makes up its Hadamard components. It has a fast algorithm similar to a fast Fourier transform; the difference is that the value of the basis function is only either positive 1 or negative 1. The amount of calculation is greatly reduced, and it is also often used for image compression. At 2-D, the Hadamard transform is a symmetric transform. The Hadamard transform H(u, v) of an image f(x, y) of size N N can be expressed as n1 P N 1 N 1 1 X X H ðu, vÞ ¼ f ðx, yÞð1Þ i¼0 N x¼0 y¼0
Forward Hadamard transform Same as Hadamard transform. Forward Hadamard transform kernel Same as Hadamard transform kernel.
½bi ðxÞ bi ðuÞþbi ðyÞ bi ðvÞ
394
10 Image Transforms
Inverse Hadamard transform The inverse transformation of the Hadamard transform. In 2-D, the inverse of the Hadamard transform is a symmetric transform: n1 P N 1 N 1 1 X X f ðx, yÞ ¼ H ðu, vÞð1Þ i¼0 N u¼0 v¼0
½bi ðxÞ bi ðuÞþbi ðyÞ bi ðvÞ
Among them, the summation on the exponent is modulo 2 (value is 0 or 1), and bi(z) is the i-th bit in the binary representation of i. Inverse Hadamard transform kernel A transformation kernel that implements the inverse of Hadamard transform. It is symmetrical at 2-D and can be expressed as n1 P ½bi ðxÞbi ðuÞþbi ðyÞbi ðvÞ 1 k ðx, y, u, vÞ ¼ ð1Þ i¼0 N
Among them, the summation on the exponent is modulo 2 (value is 0 or 1), and bi(z) is the i-th bit in the binary representation of i. This is exactly the same as the forward Hadamard transform kernel. Hadamard transform kernel A transformation kernel that implements the Hadamard transform. The 2-D Hadamard transform kernel is symmetrical and can be expressed as n1 P ½bi ðxÞbi ðuÞþbi ðyÞbi ðvÞ 1 hðx, y, u, vÞ ¼ ð1Þ i¼0 N
wherein the summation on the exponent is modulo 2 (i.e., the summation result is euther 0 or 1) and bi(z) is the i-th bit in the binary representation of z. It is exactly the same as the inverse Hadamard transform kernel. Lexicographic order Same as Kronecker order. Kronecker order An order of Walsh functions generated by means of a matrix of Hadamard transforms. Hadamard-Haar transform The combination of Hadamard transform and Haar transform is still an orthog onal transform.
10.3
10.3
Fourier Transform
395
Fourier Transform
The Fourier transform is a separable and orthogonal transform (Bracewell 1986, Pratt 2007). The Fourier transform of an image transforms the image from image space to frequency space, so that the Fourier transform can be used for image processing (Gonzalez and Woods 2018).
10.3.1 Variety of Fourier Transform Fourier transform [FT] A widely applicated separable transform and orthogonal transform. It transforms the image from the image space domain to the frequency domain. Letting x and y be the positional variables in the image, u and v are the transformed frequency variables, then the Fourier transform and inverse transform of the 2-D image can be expressed as F ðu, vÞ ¼
f ðx, yÞ ¼
N 1 N 1 1 X X f ðx, yÞ exp ½j2πðux þ vyÞ=N , N x¼0 y¼0 N1 N 1 1 X X F ðu, vÞ exp ½j2πðux þ vyÞ=N , N u¼0 v¼0
u, v ¼ 0, 1, , N 1
x, y ¼ 0, 1, , N 1
The last two equations forms a Fourier transform pair. Fourier transform pair See Fourier transform. Fourier transform representation A transform-based representation method that performs a Fourier transform on a sequence of boundary points and then expresses the boundary using Fourier transform coefficients. Think of the image XY plane as a complex plane that transforms the curve segments in the XY plane into sequences on the complex plane. Thus each point (x, y) on a given boundary can be represented in the form of a complex number u + jv to obtain a Fourier transform representation; see Fig. 10.3. Fig. 10.3 Fourier transform representation of boundary point sequence
Y (uk+jvk)
k=0
O
k = N–1 X
396
10 Image Transforms
For a closed boundary consisting of N points, a complex sequence is obtained around the boundary by starting from any point on the boundary. sðk Þ ¼ uðkÞ þ jvðk Þ,
k ¼ 0, 1, , N 1
The discrete Fourier transform of s(k) is SðwÞ ¼
N 1 1 X sðk Þ exp ½j2πwk=N , N k¼0
w ¼ 0, 1, , N 1
S(w) is the sequence of Fourier transform coefficients of the boundary. A Fourier boundary descriptor can be constructed with some or all of S(w). Hankel transform Simplification of the Fourier transform of a radial symmetric function. Fourier-Bessel transform See Hankel transform. Inverse Fourier transform [IFT] The image is inverse transformed from the frequency domain back to the image domain. Form a transform pair with the Fourier transform. That is, the transformation of the signal is regained from the Fourier coefficient of a signal. 2-D Fourier transform Fourier transform of a 2-D image. Short-time Fourier transform [STFT] A special Fourier transform. Short-time here represents a finite time, and the definition of time is realized by adding a window function to the Fourier transform, so it is also called a windowed Fourier transform. The short-time Fourier transform of the function f(t) relative to the position (b, v) of the window function r(t) on the time-space plane is Z Gr ½ f ðb, vÞ ¼
1
1
f ðt Þr b,v ðt Þd t
among them v ¼ 2πf r b,v ðt Þ ¼ r ðt bÞejvt The window function r(t) can be a complex function and satisfies Z R ð 0Þ ¼
1
1
r ðt Þd t 6¼ 0
10.3
Fourier Transform
397
In other words, the Fourier transform R(w) of r(t) is like a low-pass filter, i.e., the spectrum is not zero at w ¼ 0. The general Fourier transform requires that f(t) over the entire time axis be used to calculate the spectral components at a single frequency, but the short-time Fourier transform only needs to know the interval where r(t – b) is not zero to calculate the spectral components on a single frequency. In other words, Gr[f(b, v)] gives the approximate spectrum of f(t) near t ¼ b. Wigner distribution [WD] One distribution that gives a joint expressions of spatial and spatial frequency. It is sometimes described as a distribution of local spatial frequency expressions. Consider the case of 1-D, let f(x) represent a continuous integrable complex function, and the Wigner distribution can be expressed as x0 x0 iwx0 0 WDðx, wÞ ¼ f x e f xþ dx 2 2 1 Z
1
where w is the spatial frequency and f*(•) is the complex conjugate of f(•). The Wigner distribution directly encodes the phase information and, unlike the shorttime Fourier transform, is a real-valued function. Cracks can be detected in random textures based on Wigner distribution. Windowed Fourier transform See short-time Fourier transform. Fast Fourier transform [FFT] A fast calculation obtained by decomposing the calculation process of the Fourier transform. It can be seen as a Fourier transform version for discrete samples, which is much more efficient than standard discrete Fourier transforms. It reduces the number of complex multiplications required for the 1-D Fourier transform of N points from N2 to (Nlog2N )/2, while the number of complex additions is reduced from N2 to (Nlog2N). That is, the overall computational complexity is reduced from O(N2) to O(Nlog2N ). Successive doubling A practical fast Fourier transform algorithm. Write the 1-D discrete Fourier transform as F ð uÞ ¼
N 1 1 X f ðxÞwux N N x¼0
where wN exp.(j2π/N ). Now suppose N ¼ 2n. You can write N as 2 M and substitute it into the above formula.
398
10 Image Transforms
F ð uÞ ¼
2M1 1 X f ðxÞwux 2M 2M x¼0
Separate the odd and even terms in the f variable, which can be written as x
2y 2y þ 1
x is even x is odd
So there is ( ) M 1 M 1 1 1 X 1 X uð2yÞ uð2yþ1Þ F ð uÞ ¼ f ð2yÞw2M þ f ð2y þ 1Þw2M 2 M y¼0 M y¼0 uy 2uyþu u ¼ wuy It can be proved: w2uy M w2M , and so, 2M ¼ wM and w2M
( ) M 1 M 1 1 1 X 1 X uy uy u F ð uÞ ¼ f ð2yÞwM þ f ð2y þ 1ÞwM w2M 2 M y¼0 M y¼0
ð10:1Þ
This can be written as F ð uÞ
1 F even ðuÞ þ F odd ðuÞwu2M 2
ð10:2Þ
where Feven(u) is defined as the DFT for the even-numbered function f and Fodd(u) is the DFT for the odd-numbered function f. F even ðuÞ
M 1 1 X f ð2yÞwuy M M y¼0
M 1 1 X F odd ðuÞ f ð2y þ 1Þwuy M M y¼0
ð10:3Þ
Equation (10.2) defines F(u) only for u < M as a function of M sample lengths, since Eq. (10.3) only holds for 0 u < M. It is necessary to define F(u) for u ¼ 0, 1, ..., N-1, that is, u until 2 M-1. To do this, use u + M as the variable of F in Eq. (10.1) ( ) M 1 M 1 1 1 X 1 X uyþMy uyþMy uþM F ðu þ M Þ ¼ f ð2yÞwM þ f ð2y þ 1ÞwM w2M 2 M y¼0 M y¼0 u It can be proved: wuþM ¼ wuM and wuþM M 2M ¼ w2M , and so,
10.3
Fourier Transform
399
F ðu þ M Þ
1 F even ðuÞ‐F odd ðuÞwu2M 2
ð10:4Þ
Note that Eqs. (10.2) and (10.4) with Eq. (10.3) can completely define F(u). Thus, an N-point transform can be calculated using two N/2-point transforms, as in Eq. (10.3). The complete transformation can then be calculated using Eqs. (10.2) and (10.4). Discrete Fourier transform [DFT] The discrete form of the Fourier transform. It can also be seen as a Fourier transform version for sampling data. Continuous Fourier transform See Fourier transform. Spatio-temporal Fourier spectrum Fourier transform results for spatio-temporal functions (images or videos that vary with time and space).
10.3.2 Frequency Domain Frequency domain The transform domain obtained by Fourier transform. That is, the frequency space. Frequency response function Fourier transform of the 2-D filter. Fourier domain inspection The features in the Fourier transform of an image are used to identify defects therein. Fourier domain convolution The convolution of the image in the Fourier transform domain actually multiplies the Fourier transformed original image by the Fourier transformed filter. When the filter size is large, calculations in the Fourier domain are much more efficient than calculations in the original image space domain. Frequency spectrum The range or distribution of electromagnetic frequencies. Same as Fourier spectrum. Fourier spectrum A set of individual frequency components obtained by Fourier transform (in recent years, the spectrum is also used for representing other transform results). The actual Fourier spectrum also often represents the amplitude function of the Fourier transform. If the Fourier transform of the image f(x, y) is F(u, v), let R(u, v) and I(u, v) be the real and imaginary parts of F(u, v), respectively, then the spectrum of F(u, v) is
400
10 Image Transforms
h j F ðu, vÞ j ¼ R2 ðu, vÞ þ I 2 ðu, vÞ 1=2 Bessel-Fourier spectrum A set of frequency components obtained by performing a Bessel-Fourier transform on an image. Bessel-Fourier coefficient A texture descriptor derived from the Bessel-Fourier spectrum. Bessel-Fourier boundary function A commonly used Bessel-Fourier spectral function. It has the form GðR, θÞ ¼
R ðAm,n cos mθ þ Bm,n sin mθÞJ m Z m,n Rv n¼0
1 X 1 X m¼0
where G(R, θ) is the polar coordinate representation of the gradation function (R is the radius vector and θ is the polar angle); Am, n, Bm, n are the Bessel-Fourier coefficients; Jm is the first m-th order Bessel function; Zm, n is the zero root of the Bessel function; and Rv is the field of view radius. Fourier power spectrum The square of the spectrum obtained by the Fourier transform. If the Fourier transform of the image f(x, y) is F(u, v), let R(u, v) and I(u, v) be the real and imaginary parts of F(u, v), respectively, then the power spectrum of F(u, v) is Pðu, vÞ ¼ j F ðu, vÞ j2 ¼ R2 ðu, vÞ þ I 2 ðu, vÞ Power spectrum Describe the physical quantity of energy distribution in the frequency domain. Considering an image f(x, y) of W H, it can generally be obtained by computing a complex modulus of the following Fourier transform discrete form: F ðu, vÞ ¼
h
i ux vy f ðx, yÞ exp 2πi þ w h y¼0
W 1 H 1 X X x¼0
That is, P(u, v) ¼ |F(u, v)|2. In image technology, it generally refers to the energy at each spatial frequency. It can also be used to refer to energy at each optical frequency. Also known as “power spectral density function” or “spectral density function.” The radial energy distribution in the power spectrum reflects the roughness of the texture, and the angular distribution is related to the directionality of the texture. For example, in Fig. 10.4, the horizontal orientation of the texture features is reflected in the vertical energy distribution of the spectral image. In this way, these energy distributions can be used
10.3
Fourier Transform
401
Fig. 10.4 The original image on the left and the Fourier spectrum on the right
to characterize the texture. Common techniques include the use of ring filters, wedge filters, and the use of peak detection algorithms. See partition Fourier space into bins. Cepstrum The inverse of the logarithmic power spectrum; it forms a Fourier transform pair with the logarithmic power spectrum. Also known as time spectrum because of its time factor. Parseval’s Theorem The energy of the signal (the sum of the squares of the amplitude values of the points) is equal to the energy of the coefficients obtained by Fourier transforming it. Phase spectrum A term in the Fourier transform that corresponds to the power spectrum. The Fourier transform of an image can be decomposed into a phase spectrum and a power spectrum. The phase spectrum is the relative phase offset of a given spatial frequency. Phase congruency The Fourier transform component of an image has the characteristics of maximum phase at feature points (such as step edges or straight lines). It is also the case when all harmonics of the Fourier sequence of a periodic signal have the same phase. Phase consistency does not vary with the brightness and contrast of the image and can therefore be used as an absolute measure of feature point saliency. Consider a 1-D continuous signal f(t), defined in the interval –T t T, expressed as a Fourier sequence
402
10 Image Transforms
a0 X πnt X πnt þ an cos bn sin þ T T 2 n¼1 n¼1 2 3 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi þ1 X a an πnt bn πnt 7 6 ffi cos ffi sin ¼ 0þ a2n þ b2n 4qffiffiffiffiffiffiffiffiffiffiffiffiffiffi þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffi 5 T T 2 n¼1 a2n þ b2n a2n þ b2n þ1
þ1
f ðt Þ ¼
¼
þ1
a0 X πnt An cos ϕn þ T 2 n¼1
The coefficient of the sequence expansion can be expressed as 1 an ¼ T bn ¼
1 T
Z
T
T
Z
T T
f ðt Þ cos
πnt dt T
f ðt Þ sin
πnt dt T
The expanded amplitude An can be expressed as An ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi a2n þ b2n
And the phase ϕn can be expressed as cos ϕn ¼
an qffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi and a2n þ b2n
bn ffi sin ϕn ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffi a2n þ b2n
Here the relationship cosαcosβ + sinαsinβ ¼ cos(α – β) is used. When the (πnt)/T – ϕn has the same value for all indicators n (modulus is 2π), there is phase consistency at point t. Magnitude-retrieval problem Reconstruction of the signal based only on the phase (rather than amplitude) of the Fourier transform.
10.3.3 Theorem and Property of Fourier Transform Projection theorem for Fourier transform One of the theorems of the Fourier transform. It is the basis of the inverse Fourier transform reconstruction method. If g(s, θ) is used to represent the integral of f(x, y) along a straight line (s, θ), let G(R, θ) be the (1- D) Fourier transform of g(s, θ) corresponding to the first variable s, i.e.,
10.3
Fourier Transform
403
Z GðR, θÞ ¼
gðs, θÞ exp ½j2πRs ds ðs, θÞ
F(X, Y) is a 2-D Fourier transform of f(x, y) (Q is the projection area of f(x, y)): ZZ F ðX, Y Þ ¼
f ðx, yÞ exp ½j2πðxX þ yY Þ dxdy Q
Then the projection theorem can be expressed as GðR, θÞ ¼ F ðR cos θ, R sin θÞ That is, the Fourier transform of f(x, y) projected at a θ angle is equal to the value of the Fourier transform of f(x, y) at the Fourier space (R, θ). In other words, the Fourier transform of the projection result of f(x, y) on a line at an angle θ to the X-axis is a section of the Fourier transform of f(x, y) on the orientation angle θ. Fourier slice theorem A theorem associated with Fourier transform used in various tomography. It states that a layer of a given orientation angle in a 2-D Fourier transform of a target is equal to a 1-D Fourier transform that projects parallel projections at this angle. That is, each parallel projection of a target gives a slice of the Fourier transform of the target on a plane that passes through the origin of the Fourier space and is parallel to the projection plane. An undegraded image containing a bright line can be represented as f ðx, yÞ ¼ δðyÞ It is assumed that the straight line coincides with the X-axis. The image of the line should be Z hl ðx, yÞ ¼
þ1 Z þ1
1
1
0
0
0
0
0
Z
hðx x , y y Þδðy Þdy dx ¼
þ1 1
hðx x0 , yÞdx0
Replace the variable ex x – x0 ) dx0 ¼ dex. The limit of the integration is from +1 to –1. So get Z hl ðx, yÞ ¼
1
þ1
Z hðex, yÞdex ¼
þ1
1
hðex, yÞdex
The right side of the above formula does not depend on x, so the left side does not depend on x, too. This means that the image of the line will be parallel to the X-axis (even coincident with it), and its section will be constant along the X-axis:
404
10 Image Transforms
Z hl ðx, yÞ ¼ hl ðyÞ ¼
þ1 1
hl ðex, yÞdex
Calculate the Fourier transform of hl( y): Z H l ð vÞ
þ1
1
hl ðyÞe2πjvy dy
The Fourier transform of the point spread function is a frequency response function and is given by: Z H ðu, vÞ ¼
þ1 Z þ1
1
1
hðx, yÞe2πjðuxþvyÞ dxdy
Take u ¼ 0 in this expression and get Z H ð0, vÞ ¼
þ1 Z þ1
1
1
hðx, yÞdx e2πjvy dy
After comparison H ð0, vÞ ¼ H l ðvÞ This equation is the Fourier layer theorem. The theorem states that taking a layer of the Fourier transform of a function h(x, y) (i.e., let u ¼ 0 in H(u, v)), the function can be projected in the corresponding direction (here the Y-axis) Fourier transform (i.e., Hl(v)). Next, an inverse Fourier transform is performed to obtain a projection of the function (i.e., hl( y)) in that direction. The above derivation shows that an ideal straight line image provides a section of the point spread function along a single direction (i.e., a direction orthogonal to the line). This is easy to understand because the section orthogonal to the length of one line is not different from the section of a point. By definition, the section of a point image is the point spread function of the blur process. If there are many ideal straight lines in different directions in the image, then the information of the frequency response function in the direction orthogonal to these straight lines in the frequency domain can be obtained. H(u, v) can be calculated at any point in the frequency domain by interpolation. See slice-based reconstruction. Fourier scaling theorem A common relation of the Fourier transform reflects a property of the Fourier transform. Also known as the Fourier similarity theorem. Let f(x, y) and F(u, v) form a pair of transforms, denoted as f(x, y) $ F(u, v). The Fourier scale theorem can be written as
10.3
Fourier Transform
(a)
405
(b)
(c)
(d)
Fig. 10.5 Examples of Fourier scaling theorem
af ðx, yÞ $ aF ðu, vÞ
1 u v f ðax, byÞ $ F , jabj a b where a and b are both scalars. The above two equations show that the scale variation of the amplitude of f(x, y) results in the corresponding scale change of the Fourier transform F(u, v) in terms of amplitude, while the scaling in spatial domain for f(x, y) causes the inverse scaling of its Fourier transform F(u, v) in terms of frequency domain. The second formula also shows that the contraction of f(x, y) (corresponding to a > 1, b > 1) not only causes the expansion of F(u, v) but also reduces the amplitude of F(u, v). Figure 10.5 gives a set of examples. Figure 10.5a, b give a 2-D image and its Fourier spectrum amplitude map, respectively. Figure 10.5c reduces the size of bright square in the center compared to Fig. 10.5a and d shows the Fourier spectral amplitude map of Fig. 10.5c. By comparing Fig. 10.5b and c, it can be seen that the contraction of the square in the image leads to an increase in the spectral space of the Fourier spectrum grid, and the amplitude of the Fourier spectrum is also reduced (the picture is darkened). Domain scaling law Same as Fourier scaling theorem. Fourier affine theorem A generalized relation of the Fourier transform. It can be considered that the Fourier shift theorem, the Fourier rotation theorem, and the Fourier scaling theorem are all special cases of this theorem. Let f(x, y) and F(u, v) form a pair of transforms, denoted as f(x, y) $ F(u, v). The Fourier affine theorem can be written as f ðax þ by þ c, dx þ ey þ f Þ $ 1 j2π eu dv bu þ av ½ðec bf Þu þ ðaf cd Þv F , exp ae bd ae bd ae bd jae bd j where a, b, c, d, e, and f are scalars.
406
10 Image Transforms
Fig. 10.6 Example of Fourier shearing theorem
Fourier shearing theorem A theorem describing the properties of the Fourier transform when a shearing transformation occurs in an image. The pure shearing of the image f(x, y) will result in the pure shearing of its corresponding Fourier transform F(u, v) in the orthogonal direction, so there are, respectively, the horizontal shearing transformation and the vertical shearing transformation: f ðx þ by, yÞ $ F ðu, v buÞ f ðx, y þ dxÞ $ F ðu dv, vÞ Figure 10.6 gives an example of the effect of two shearing transformations on the Fourier spectrum. Figure 10.6a, b give a 2-D image and its Fourier spectrum amplitude map, respectively. Figure 10.6c, d give the horizontally cropped image and its Fourier spectral amplitude map of Fig. 10.6a, respectively. Figure 10.6e, f give the vertically cropped image and its Fourier spectral amplitude map of Fig. 10.6a, respectively Shearing theorem Same as Fourier shearing theorem. Fourier convolution theorem A theorem that reflects the relationship between the convolution of two functions and the Fourier transform of the two. Referred to as the convolution theorem. The convolution of two functions f(x, y) and g(x, y) in space and the product of its Fourier transforms F(u, v) and G(u, v) in the frequency domain form a pair of transformations, and the products of the two functions in space and the convolution of the Fourier transform in the frequency domain form also a pair of transforms. Let f(x, y) and F(u, v) form a pair of transforms, denoted as f(x, y) $ F(u, v). The Fourier convolution theorem can be written as ( denotes convolution) f ðx, yÞ gðx, yÞ $ F ðu, vÞGðu, vÞ f ðx, yÞgðx, yÞ $ F ðu, vÞ Gðu, vÞ Convolution theorem Abbreviation for Fourier convolution theorem.
10.3
Fourier Transform
407
Fourier shift theorem A common relation of the Fourier transform reflects a property of the Fourier transform. Let f(x, y) and F(u, v) form a pair of transforms, denoted as f(x, y) $ F (u, v). The Fourier shift theorem can be written as f ðx a, y bÞ $ exp ½j2πðau þ bvÞF ðu, vÞ F ðu c, v dÞ $ exp ½j2πðcx þ dyÞf ðx, yÞ where a, b, c, and d are both scalars. The first equation shows that shifting f(x, y) in space is equivalent to multiplying its Fourier transform F(u, v) by an exponential term in the frequency domain. The second formula shows that the multiplication of f (x, y) in space with an exponential term is equivalent to shifting its Fourier transform F(u, v) in the frequency domain. Translation theorem A fundamental theorem that reflects the translational properties of the Fourier transform. If f(x, y) and F(u, v) form a pair of transforms, f ðx, yÞ $ F ðu, vÞ Then the translation theorem can be expressed by the following two formulas (a, b, c, and d are scalars): f ðx a, y bÞ $ exp ½j2πðau þ bvÞF ðu, vÞ F ðu c, v dÞ $ exp ½j2πðcx þ dyÞf ðx, yÞ The first equation shows that shifting f(x, y) in space is equivalent to multiplying its transform in the frequency domain by an exponential term, while the second equation shows that multiplying f(x, y) in space by an exponential term is equivalent to shifting its transform in the frequency domain. Fourier correlation theorem The theorem that reflects the correlation between the two functions and the Fourier transform of the two. Referred to as the correlation theorem. The correlation of two functions f(x, y) and g(x, y) in the space domain and the product of their Fourier transforms F(u, v) and G(u, v) (one of which is its complex conjugate, i.e., F*(u, v) and G(u, v) or F(u, v) and G*(u, v)) in the frequency domain constitutes a pair of transformations, while the the product of two functions f(x, y) and g(x, y) (one of which is its complex conjugate, i.e., f*(x, y) and g(x, y) or f(x, y) and g*(x, y)) in the space domain and the correlation of their Fourier transforms constitutes also a pair of transformations. Let f(x, y) and F(u, v) form a pair of transforms, denoted as f(x, y) $ F(u, v). The Fourier correlation theorem can be written as ( denotes correlation)
408
10 Image Transforms
Fig. 10.7 Example of Fourier rotation theorem
f ðx, yÞ gðx, yÞ $ F ðu, vÞGðu, vÞ f ðx, yÞgðx, yÞ $ F ðu, vÞ Gðu, vÞ Correlation theorem Abbreviation for the Fourier correlation theorem. Fourier similarity theorem Same as Fourier scaling theorem. Similarity theorem Same as scale theorem. Scale theorem The theorem that gives the properties of the Fourier transform as it changes in scale (scaling). Also called the similarity theorem. Fourier rotation theorem A common relation of the Fourier transform reflects a property of the Fourier transform. Let f(x, y) and F(u, v) form a pair of transforms, denoted as f(x, y) $ F (u, v). By means of the polar transformation, x ¼ rcosθ, y ¼ rsinθ, u ¼ wcosθ, v ¼ wsinθ, f(x, y), and F(u, v) are converted to f(r, θ) and F(w, ϕ), respectively. The Fourier rotation theorem can be written as f ðr, θ þ θ0 Þ $ F ðw, φ þ θ0 Þ where θ0 is the angle of rotation. The above equation shows that rotating θ0 for f(x, y) corresponds to rotating its Fourier transform F(u, v) also θ0. Similarly, rotating θ0 for F(u, v) corresponds to rotating its inverse Fourier transform f(x, y) also θ0. Figure 10.7 gives a set of examples. Among them, Fig. 10.7a is a 2-D image, and its grayscale image of Fourier spectrum amplitude is shown in Fig. 10.7b, c shows the result of rotating the image of Fig. 10.7a by 45 . The grayscale image of its Fourier spectral amplitude is shown in Fig. 10.7d. As can be seen from these figures, when the image is rotated by a certain angle in the image space, then its Fourier transform is rotated by a corresponding degree in the spectral space.
10.3
Fourier Transform
409
Rotation theorem See Fourier rotation theorem.
10.3.4 Fourier Space Fourier space The frequency domain space in which an image is Fourier transformed. This is also called a Fourier domain. Partition Fourier space into bins The spatial decomposition used to determine the spatial periodicity or directionality of the texture. After the space of the Fourier transform domain is partitioned, the energy of each block is calculated separately, and the spatial periodicity or directionality of the texture can be determined. Commonly used are two types of block, namely, angle type and radial type. The former corresponds to a wedge filter, and the latter corresponds to a ring filter, as respectively shown in Fig. 10.8. Radial feature The texture space periodicity obtained by the radiant Fourier space partitioning. Radiation characteristics can be expressed as Rðr 1 , r 2 Þ ¼
XX
jF j2 ðu, vÞ
where |F|2 is the Fourier power spectrum, and the sum is for r 21
u2 þ v2 < r 22 ,
0 u, v < N 1
where N is the period and the radiation characteristics are related to the roughness of the texture. A smooth texture has a large R(r1, r2) value at small radii, while a coarse grain texture has a large R(r1, r2) value at large radii.
Fig. 10.8 Blocking the Fourier space
V
V
θ
r U
U
410
10 Image Transforms
Direct component The value of the Fourier transform of an image at frequency (0, 0). It is also the grayscale mean of the image. Fourier space smoothing Apply a smoothing filter (such as eliminating high frequency noise) to the Fourier transformed image (in the frequency domain). Distributivity of Fourier transform A property of the Fourier transform. Letting F be the Fourier transform operator, then the Fourier transform is distributive to the addition, but generally has no distributivity for the multiplication, which is F f f 1 ðx, yÞ þ f 2 ðx, yÞg ¼ F f f 1 ðx, yÞg þ F f f 2 ðx, yÞg F f f 1 ðx, yÞ f 2 ðx, yÞg 6¼ F f f 1 ðx, yÞg F f f 2 ðx, yÞg Conjugate symmetry of Fourier transform The Fourier transform F(u, v) of the real image f(x, y) is conjugate symmetrical to the origin, which is F ðu, vÞ ¼ F ðu, vÞ where F*(u, v) is the conjugate of F(u, v). That is, if F(u, v) ¼ R(u, v) + jI(u, v), then F*(u, v) ¼ R(u, v) jI(u, v). Further available jF ðu, vÞj ¼ jF ðu, vÞj Translation of Fourier transform A property of the Fourier transform. Let F be the Fourier transform operator, and the Fourier transform result of f(x, y) be represented by F(u, v). After an image is moved (translated), the obtained frequency domain spectrum will have a phase shift but the amplitude remains the same. That is F ½ f ðx x0 , y y0 Þ ¼ F ðu, vÞ exp ½j2πðux0 =M þ vy0 =N Þ f ðx x0 , y y0 Þ ¼ F 1 F ðu, vÞ exp ½j2πðux0 =M þ vy0 =N Þ Linearity of Fourier transform A property of the Fourier transform. Letting F be the Fourier transform operator, then the Fourier transform is a linear operation, which is F fa • f 1 ðx, yÞ þ b • f 2 ðx, yÞg ¼ a • F 1 ðu, vÞ þ b • F 2 ðu, vÞ
10.3
Fourier Transform
411
a • f 1 ðu, vÞ þ b • f 2 ðu, vÞ ¼ F 1 ½a • F 1 ðx, yÞ þ b • F 2 ðx, yÞ where a and b are constants. Periodicity of Fourier transform A property of the Fourier transform. For the image f(x, y) of N N, the Fourier transform F(u, v) has a period of N in both X and Y directions, which is F ðu, vÞ ¼ F ðu þ N, vÞ ¼ F ðu, v þ N Þ ¼ F ðu þ N, v þ N Þ Fourier image processing Image processing is performed in the Fourier transform domain, that is, the Fourier-transformed image (transformation coefficient image) is processed. Fourier-based computation Fast calculations by means of Fourier transform. The basic steps are to first convert the image to the Fourier space, then perform the corresponding calculation operation in the Fourier space, and finally transform the result back to the image space with the inverse Fourier transform. For an image of N N, the amount of operation required for the Fourier transform is O(N2logN2). This is less than the amount of operations required for many of the processing that is performed on each pixel in the image domain. Fourier matched filter object recognition The correlation is determined by a matched filter (which is the conjugate of the Fourier transform of the target under consideration) to achieve target recognition. Fourier descriptor See Fourier boundary descriptor. Fourier shape descriptor A descriptor for boundary for a region in which the Fourier transform coefficients of the boundary are used. Harmonic analysis Same as Fourier descriptor. Gibbs phenomenon A phenomenon in which a Fourier series is used to first expand a function and then recombine it. For example, Fourier series expansion is performed on a periodic function with discontinuous points, and finite terms are selected for synthesis. As the number of selected items increases, the peak points appearing in the synthesized waveform become closer to the discontinuous points in the original signal, and the peak tends to be a constant, which is the Gibbs phenomenon.
412
10 Image Transforms
10.4
Discrete Cosine Transform
The discrete cosine transform (DCT) is a separable and orthogonal transform and is symmetric. The calculation of the discrete cosine transform can be performed by the calculation of the real part of the discrete Fourier transform (Gonzalez and Woods 2008; Russ and Neal 2016). Discrete cosine transform [DCT] A typical separable transform, orthogonal transform, and symmetric transform. A transform that converts a digital image f(x, y) into a discrete cosine function coefficient C(u, v) of the frequency domain. It has been used for image compression with JPEG and so on. The 2-D discrete cosine transform and its inverse transform are represented by the following two equations (N is the transform period): Cðu, vÞ ¼ aðuÞaðvÞ
ð2x þ 1Þuπ ð2y þ 1Þvπ cos , f ðx, yÞ cos 2N 2N y¼0
N 1 X N 1 X x¼0
u, v ¼ 0, 1, , N 1
f ðx, yÞ ¼
ð2x þ 1Þuπ ð2y þ 1Þvπ cos , aðuÞaðvÞC ðu, vÞ cos 2N 2N v¼0
N 1 X N 1 X u¼0
x, y ¼ 0, 1, , N 1 where a(u) is the normalized weighting coefficient for discrete cosine transform, expressed by ( pffiffiffiffiffiffiffiffiffi 1=N aðuÞ ¼ pffiffiffiffiffiffiffiffiffi 2=N
u¼0 u ¼ 1, 2, , N 1
Cosine transform The expression obtained by the basis of the cosine function of the signal. For a 1-D even function f(x), the cosine transform is Z F ð uÞ ¼ 2
1
f ðxÞ cos ð2πuxÞdx
0
For a 1-D sampled signal f0...(n–1), its discrete cosine transform is the vector v0...(n–1), for k 1, rffiffiffi n1 1X v0 ¼ f n i¼0 i
10.4
Discrete Cosine Transform
413
rffiffiffi n1 h i 2X π ð2i þ 1Þk vk ¼ f i cos n i¼0 2n For a 2-D image f(x, y), its discrete cosine transform is Z F ðu, vÞ ¼ 4 0
1Z 1
f ðx, yÞ cos ð2πuxÞ cos ð2πvyÞdxdy
0
Discrete sine transform [DST] A separable transform, an orthogonal transform, and a symmetric transform. The kernel of the 2-D discrete sine transform and its inverse transform are the same and can be expressed as (N is the transform period) Sðx, y, u, vÞ ¼
ðx þ 1Þðu þ 1Þπ ðy þ 1Þðv þ 1Þπ 2 sin sin N þ1 Nþ1 Nþ1
The resulting transformation matrices are real, orthogonal, and symmetric. Discrete sinusoidal transforms differ from discrete cosine transforms in that they have no DC component. In addition, it is oddly symmetrical. Odd-symmetric discrete cosine transform [ODCT] A special discrete cosine transform. Suppose there is an image f of M N, which is flipped about its leftmost column and the uppermost row to get an image of (2 M 1) (2 N - 1). The discrete Fourier transform of this (2 M 1) (2 N 1) image is real and has " 1 F oc ðm, nÞ ð2M 1Þð2N 1Þ M 1 X
f ð0, 0Þ þ 4
M 1 X N 1 X
f ðk, lÞ cos
k¼0 l¼1
2πmk 2πnl cos 2N 1 2N 1
X 2πmk 2πnl þ2 þ2 f ðk, 0Þ cos f ð0, lÞ cos 2M 1 2N 1 k¼1 l¼1 N 1
#
The above form can be written into a more compact form F oc ðm, nÞ
M1 N1 XX 1 2πmk 2πml cos C ðkÞCðlÞf ðk, lÞ cos 2M 1 2N 1 ð2M 1Þð2N 1Þ k¼0 l¼1
where C(0) ¼ 1, C(k) ¼ C(l ) ¼ 2, for k, l 6¼ 0. This is the odd symmetric discrete cosine transform of the original image. Compare the even-symmetric discrete cosine transform and the odd-antisymmetric discrete sine transform. Even-symmetric discrete cosine transform [EDCT] A special discrete cosine transform. Suppose there is an image f of M N, and the leftmost column and the uppermost flip are used to obtain an image of 2 M 2 N. The discrete Fourier transform (DFT) of this image is real and has
414
10 Image Transforms
F ec ðm, nÞ
M 1 N 1 h
i h
i 1 XX πm 1 πn 1 kþ lþ f ðk, lÞ cos cos MN k¼0 l¼0 M 2 N 2
This is the even symmetric discrete cosine transform of the original image. Compare the even antisymmetric discrete sine transform and the odd-symmetric discrete cosine transform. Odd-antisymmetric discrete sine transform [ODST] A special sine transform. Suppose there is an image f of M N, change its sign and flip it about its leftmost column and uppermost row, and insert a row of 0 and a column of 0 along the flip axis to get a (2 M + 1) (2 N + 1) image. The discrete Fourier transform of this (2 M + 1) (2 N + 1) image is real and is
M X N X 4 2πmk 0 2πnl0 sin f ðk0 , l0 Þ sin 2M þ 1 2N þ 1 ð2M þ 1Þð2N þ 1Þ 0 0 k ¼1 l ¼1
Note that the indicators k0 and l0 here are not indicators of the original image, and the original indicators are taken from 0 to M 1 and 0 to N 1, respectively. The indicator needs to be offset by 1 because all rows and columns of 0 are inserted. To get the original indicator, the odd antisymmetric discrete sine transform of the original image can be defined as F os ðm, nÞ
M 1 X N 1 X 2πmðk þ 1Þ 2πnðl þ 1Þ 4 sin f ðk, lÞ sin 2M þ 1 2N þ 1 ð2M þ 1Þð2N þ 1Þ k¼0 l¼0
Compare the even-antisymmetric discrete sine transform and the odd-symmet ric discrete cosine transform. Even-antisymmetric discrete sine transform [EDST] A special sine transform. Suppose there is an image f of M N, flipping its leftmost column and uppermost row to get a 2 M 2 N image. The discrete Fourier transform (DFT) of this image is real, and there is F es ðm, nÞ
M 1 N 1 πmð2k þ 1Þ πnð2l þ 1Þ 1 XX sin f ðk, lÞ sin 2M 2N MN k¼0 l¼0
The above equation gives an even antisymmetric discrete sine transform to the original image. Compare the even symmetric discrete cosine transform and the odd-antisymmetric discrete sine transform. 2N-point periodicity A property of discrete cosine transforms. The cosine function is an even function, and the discrete cosine transform of an N point has a period of 2 N. In image processing, the 2 N-point periodicity of the discrete cosine transform makes the
10.5
Wavelet Transform
415
transformation result not interrupted at the boundary of image blocks, which reduces the possibility of Gibbs phenomenon.
10.5
Wavelet Transform
Wavelets are attenuating and undulating waves (Chui 1992). Wavelet transform is a local transform in space-time and frequency. Wavelet transform has a fast algorithm in implementation (Chan and Shen 2005).
10.5.1 Wavelet Transform and Property Wavelet transform [WT] A local transformation in space and time and frequency. It expresses a signal as a wavelet basis function. Similar to the Fourier transform, but since the wavelet base is a family of functions {ϕjk(x)} with two parameters, the wavelet transform of an m-D signal is a (m + 1)-D function. However, the number of different values required to express a discrete signal of length n is still O(n). Wavelet transforms have applications similar to Fourier transforms, but wavelet bases have more advantages in expressing natural signals such as images. Including wavelet decomposition, wavelet series expansion, discrete wavelet transform, and so on. Wavelet transform representation A transform-based representation method. When it is used to express the target boundary, the boundary point sequence is first wavelet transformed, and then the wavelet transform coefficient is used to express the boundary. Discrete wavelet transform [DWT] A series of discrete data is transformed or mapped into a series of (two sets) wavelet coefficients of the expansion coefficients. Taking 1-D as an example, if f(x) is used to represent a discrete sequence, the coefficient obtained for the expansion of f(x) is the discrete wavelet transform of f(x). Let the starting scale j of the transformation be 0, and the two sets of expansion coefficients be represented by Wu and Wv, respectively, then the discrete wavelet transform can be expressed as 1 1 X 1 XX f ðxÞ ¼ pffiffiffiffiffi W u ð0, kÞu0,k ðxÞ þ pffiffiffiffiffi W v ð j, kÞv M k M j¼0 k
1 X f ðxÞu0,k ðxÞ W u ð0, kÞ ¼ pffiffiffiffiffi M x
j,k ðxÞ
416
10 Image Transforms
1 X W v ð j, k Þ ¼ pffiffiffiffiffi f ðxÞv M x
j,k ðxÞ
where u(x) represents a scaling function and v(x) represents a wavelet function. Generally, M is chosen to be an integer power of 2, so the above summation is for x (¼ 0, 1, 2, ..., M–1), j (¼ 0, 1, 2, ..., J–1), k (¼ 0, 1, 2, ..., 2j–1). The coefficients Wu(0, k) and Wv( j, k) correspond to a0(k) and dj(k) in the wavelet series expansion, respectively, and are called approximation coefficients and detail coefficients, respectively. Note that if the expansion function only constitutes the biorthogonal basis of the scaling function space U and the wavelet function space V, then u(x) and v(x) are to be used with the corresponding dual functions u0 (x) and v0 (x) replace. Here u(x) and its dual function u0 (x) satisfy
u j ðxÞ, u0 k ðxÞ ¼ δjk ¼
0 j 6¼ k 1 j¼k
Fast wavelet transform [FWT] A method for fast calculation of wavelet transform by using Mallat wavelet decomposition and multi-resolution refinement equation. The calculations are performed iteratively from high scale to low scale. Inverse fast wavelet transform The inverse transform of wavelet transform. There are also fast algorithms that perform from low- to high-scale iterations. Wavelet modulus maxima The result of first performing wavelet transform on the image, and then extracting the maximum value of the obtained coefficient. The wavelet transform modulus maxima is the local maxima of the gradient mode along the gradient direction, which can describe the singularity of the signal. When calculating the image specifically, the image is first multi-scale transformed, and the gradient of the transformation result is then calculated separately on each scale, and finally the maximum value of the modulus is taken. The wavelet modulus maxima describes the multi-scale (multilayer) boundary (contour) information of the target in the image and describes the shape characteristics of the target to some extent. Wavelet calibration Same as natural scale detection transform. Natural scale detection transform A method of extracting information from a signal after multi-scale transformation. It is performed by adjusting the frequency or selecting the scale, also known as wavelet calibration. Its basic idea is not to analyze the entire transformation and only analyze it on a certain fixed scale or set of scales, which is equivalent to the linear filtering result of the analysis signal. Here the linear filter is defined by a transform kernel that is adjusted to a fixed scale or frequency. The most important
10.5
Wavelet Transform
417
Fig. 10.9 Relationship between scaling function space and wavelet function space
U2=U1 ⊕ V1=U0 ⊕ V0 ⊕ V
U1=U0 ⊕ V0 U0
V0
V1
problem to be solved by this method is to correctly adjust the transformation parameters, i.e., the analysis scale. Wavelet function A function that represents the detail component in the wavelet transform result. It is closely related to the scaling function obtained by the expansion function in the sequence expansion. Considering the 1-D case, if the wavelet function is represented by v(x), it is translated and binary scaled, and then the wavelet function set {vj,k(x)} can be obtained. v
j,k ðxÞ
¼2
j=2
v 2 jx k
It can be seen that k determines the position of vj,k(x) along the X-axis, j determines the width of vj,k(x) along the X-axis, and the coefficient 2j/2 controls the magnitude of vj,k(x). The space corresponding to the wavelet function vj,k(x) is represented by Vj. If f(x) 2 Vj, then according to the way the sequence is expanded, f(x) can be expressed as f ð xÞ ¼
X
ak v
j,k ðxÞ
k
The scaling function space Uj, Uj + 1 and the wavelet function space Vj have the following relationship (Fig. 10.9 gives an example of j ¼ 0, 1): U
jþ1
¼ Uj Vj
where represents the sum of the spaces (similar to the sum of the sets). It can be seen that in Uj + 1, the complement of Uj is Vj. All scaling functions uj,k(x) in the scaling function space Uj are orthogonal to all wavelet functions vj,k(x) in the wavelet function space Vj, which can be expressed as
u
j,k ðxÞ, v j,k ðxÞ
¼0
Scale function A function that represents the approximate component in the wavelet transform result. The expansion function in the series expansion can be taken as a scaling function, and the set obtained by translation and binary scaling the scaling function
418
10 Image Transforms
constitutes a set of scaling functions. Considering the 1-D case, if the scaling function is represented by u(x), the scaling function set is {uj,k(x)}, where u
j,k ðxÞ
¼2
j=2
u 2 jx k
It can be seen that k determines the position of uj,k(x) along the X-axis, j determines the width of uj,k(x) along the X-axis, and the coefficient 2j/2 controls the magnitude of uj,k(x). Given an initial j (often taken as 0), a scaling function space Uj can be determined, and the size of Uj is increased or decreased as j increases or decreases. In addition, each scaling function space Uj ( j ¼ 1, ..., 0, 1, ..., 1) is nested, that is, Uj ⊂ Uj + 1. Scaling function The average image obtained in the wavelet transform. Curvelet transform Same as curvelet. Curvelet The extension of the wavelet concept, and is generally called the curvelet trans form. It constitutes a nonadaptive multi-scale representation technique with the goal of overcoming the shortcomings of wavelet methods in expressing continuous curves and contours.
10.5.2 Expansion and Decomposition Wavelet series expansion The process of mapping a continuous variable function into a series of wavelet transform of expansion coefficients. Let U be the scaling function space, u(x) for the scaling function, V for the wavelet function space, and v(x) for the wavelet function. If the starting scale j is taken as 0 and the initial scaling function space is U0, then the wavelet series expansion of f(x) can be expressed as f ð xÞ ¼
X k
a0 ðk Þu0,k ðxÞ þ
1 X X d j ðkÞv j¼0
j,k ðxÞ
k
where a0(k) is the scaling factor (also called the approximation coefficient) and dj(k) is the wavelet coefficient (also called the detail coefficient); the previous summation is the approximation of f(x) (if f(x) 2 U0, the expression is accurate), and the latter summation expresses the details of f(x). a0(k) and dj(k) can be calculated by the following two equations: Z a0 ðkÞ ¼ h f ðxÞ, u0,k ðxÞi ¼
f ðxÞu0,k ðxÞdx
10.5
Wavelet Transform
419
d j ðk Þ ¼
f ðxÞ, v
j,k ðxÞ ¼
Z f ðxÞv
j,k ðxÞdx
Note that if the expansion function only constitutes a biorthogonal basis of U and V, then u(x) and v(x) are replaced by their dual functions u0 (x) and v0 (x). Here u(x) and its dual function u0 (x) satisfy
0
u j ðxÞ, u k ðxÞ ¼ δjk ¼
0 j 6¼ k 1 j¼k
Series expansion A technique and process that represents a function in a linear combination of another set of functions. Take the 1-D function f(x) as an example, if it can be expressed as f ð xÞ ¼
X
ak uk ð x Þ
k
The right side of the equal sign is the series expansion of f(x). In the above formula, k is an integer, the summation can be a finite or infinite term, ak is a real number, and uk(x) is a real function. Generally, ak is the expansion coefficient, and uk(x) is the expansion function. If for any f(x), there is a set of ak such that the above formula holds, then uk(x) is the basic function, and the set {uk(x)} of uk(x) is called the basis. All functions f(x) that can be expressed by the above formula form a function space U, which is closely related to {uk(x)}. If f(x) 2 U, then f(x) can be expressed by the above formula. Expansion function A function used in a series expansion to represent an expanded item. Expansion coefficient The weighting factor of the expansion function in the series expansion of the function. Zooming property of wavelet transform A wavelet transform has a property that is characterized by time-frequency localization. In wavelet transform, the product of the width of the time window function and the width of the frequency (transform domain) window function is a constant. Therefore, when using the wavelet transform to analyze the low-frequency components of the image, the time window can be widened to reduce the frequency window; while the wavelet transform is used to analyze the high-frequency components of the image, the frequency window can be widened and the time window can be reduced. This ability to change the width of the window according to the high and low image frequencies is similar to the ability to adjust the focal length of camera according to the required scene size at the time of photographing, so it is called a
420
10 Image Transforms
Fig. 10.10 Wavelet pyramid
Coarse
l= 2
Medium
l= 1
Fine
l= 0
zoom characteristic. This feature is the basis for wavelet transform to provide multiresolution analysis. Wavelet pyramid A pyramid consisting of layers broken down by wavelets. As shown in Fig. 10.10, each wavelet layer retains 3/4 of the original pixels (typically horizontal, vertical, and mixed gradients), so in the wavelet pyramid, the total number of wavelet coefficients is the same as the total number of original pixels. Comparing the halfoctave pyramids that are complete (i.e., using more pixels than the original image to express the decomposition results), the wavelet pyramid provides a tighter frame, keeping the size of the decomposition the same as the original image. However, some wavelets are also overcomplete wavelets to provide better steering or orientation control. Therefore, a better way to distinguish between the two pyramids is to say that the wavelet pyramid has better orientation selectivity than the general halfoctave pyramids (very close to the band-pass pyramid). Wavelet tree A data structure constructed based on wavelet transform. In the text index, it is a structure that can be used to compress data and text. It converts a string into a balanced bit vector binary tree, with 0 representing half symbols and 1 representing the other half. The tree is filtered and re-encoded to reduce ambiguity, and this process can be repeated until there is no ambiguity. Wavelet decomposition The process of wavelet transforming an image. This decomposition is carried out from a high scale to a low scale. Taking the second-order wavelet decomposition of the 2-D image as an example, let superscript H represent the horizontal direction, V for the vertical direction, and D for the diagonal direction. The first step in the wavelet decomposition decomposes the image into four sub-images, as shown in Fig. 10.11a, where the upper left component corresponds to the low-frequency
10.5
Wavelet Transform
W1, LL
W1, HL
421
W2, LL W2, HL W2, LH W2, HH
W1, LH
W1, HH
W2, LL W2, HL W2H, LL W2H, HL
W2, LL W2, HL W2H, LL W2H, HL
W2, LH W2, HH W2H, LH W2H, HH
W2, LH W2, HH W2H, LH W2H, HH
W1, HL
W1, LH
W1, HH
W2V, LL W2V, HL
W1, HH
W2V, LH W2V, HH
(a)
(b)
(c)
W2V, LL W2V, HL W2D, LL W2D, HL
W2V, LH W2V, HH W2D, LH W2D, HH
(d)
Fig. 10.11 Diagram of 2-D image wavelet decomposition
(L) components in the horizontal and vertical directions (i.e., LL); the upper right component corresponds to the high frequency (H) in the horizontal direction and the low frequency in the vertical direction (i.e., the intermediate-frequency component HL); the lower left component corresponds to the low frequency in the horizontal direction and high frequency in the vertical direction (i.e., the intermediate frequency component LH); and the lower right component corresponds to the high-frequency component in the horizontal and vertical directions (i.e., HH)). The next second step decomposition can be done in the following three ways: 1. Continue to decompose only the low-frequency component (i.e., LL), as shown in Fig. 10.11b, which is also called pyramid-structured wavelet decomposition. 2. In addition to performing (1), the intermediate frequency components (i.e., HL and LH) continue to be decomposed, as shown in Fig. 10.11c, which is also called non-complete tree-structured wavelet decomposition. 3. In addition to performing (1) and (2), the high-frequency component (i.e., HH) continues to decompose, as shown in Fig. 10.11d, which is also called complete tree-structured wavelet decomposition or wavelet packet decomposition. Wavelet packet decomposition See wavelet decomposition. Non-complete tree-structured wavelet decomposition See wavelet decomposition. Complete tree-structured wavelet decomposition See wavelet decomposition.
10.5.3 Various Wavelets Wavelet Waves with attenuation and volatility. Intuitively refers to a small area waveform with a limited length and a mean of zero.
422
10 Image Transforms
ϕ
ϕ1,–2
ϕ1,–1
ϕ 1,0
ϕ 1,1
ϕ1,2
ϕ2,–2
ϕ2,–1
ϕ 2,0
ϕ 2,1
ϕ 2,2
Fig. 10.12 Haar transform wavelet example
Wavelet-based texture analysis uses a set of functions that are in both spatial domain and frequency domain to decompose the texture image. Wavelet functions belonging to the same family can be constructed by expansion and translation using a basis function called “mother wavelet” or “basis wavelet.” The input image can be thought of as a weighted sum of the overlapping wavelet functions through scaling and translation. To simply consider the 1-D case, let g(x) be a wavelet, then the wavelet transform of the signal f(x) can be expressed as Z W f ða, t Þ ¼
1
1
f ðxÞg ½aðx t Þdx
where g[a(x – t)] is calculated from the mother wavelet g(x), where a and t represent the scale and translation parameters, respectively. Their discrete forms are obtained by sampling the parameters a and t. A wavelet is a function ϕ(x) with certain characteristics that can be used to derive a set of basis function expressions and to approximate other functions. Compared to the Fourier transform basis function, a wavelet can be thought of as a set of scaling and translations for f(x) ¼ sin(πx), such as cos(3πx) ¼ sin(3πx + π/2) ¼ f[(6x + 1)/2]. Similarly, a wavelet basis can be obtained by scaling and translating a mother wavelet ϕ(x): each basis function ϕjk(x) has the form ϕjk(x) ¼ const• ϕ(2–jx – k). The requirement for ϕ(x) is that different basis functions (with different j and k) are orthogonal. There are some commonly used ϕ(x) forms, such as Haar wavelet and Daubechies wavelet. They have a good overall performance in terms of some desirable characteristics, such as the compactness of space and time and the ability to approximate certain types of functions. Figure 10.12 shows a mother-Haar transform wavelet and several derived wavelets ϕjk(x). A wavelet can be thought of as a filter that localizes the signal in both the spatial and frequency domains and is defined on a range of scales. Wavelets provide a stationary method of decomposing a signal into frequency components, with no blockiness and very close to the pyramid. As wavelets are widely used, wavelets are often viewed as a mathematical function, and wavelet transforms of an image can be utilized to obtain information about different frequencies and orientations.
10.5
Wavelet Transform
Fig. 10.13 A picture of Haar wavelet with 8 8 images
423
LL
LH
LH
LH
H1
H1
H1
H1
H2
H2
H2
H2
H3
H3
H3
H3
Monogenic wavelets The wavelet function that uses its complex-valued function instead of its negative frequency component. Gabor wavelet A wavelet consisting of a sinusoidal function defined by a Gaussian envelope function. Logarithmic Gabor wavelet A variant of Gabor wavelet. Provides a response that is unaffected by any DC component. Overcomplete wavelets See wavelet pyramid. Haar wavelet All scaled and translated versions of the Haar function. For an 8 8 image, it includes the respective base images as shown in Fig. 10.13 (here only the positions are indicated by boxes). Thick lines separate them into a base image set of the same resolution. The letters L and H indicate low resolution and high resolution, respectively, and the numbers to the right of H indicate high resolution layers, and letter pairs are used to indicate resolution along vertical and horizontal axes. In Fig. 10.13, the box in the upper left corner is the average image, also called the scaling function; the 16 base image boxes in the lower right corner correspond to the finest resolution scale; the other boxes correspond to the intermediate scale. Together, they build a complete set of bases that can be used to extend any 8 8 image.
424
10 Image Transforms
Haar transform A separable transform and a symmetric transform. It can be implemented based on matrix operations, and the transformation matrix used is the Haar matrix. The Haar matrix is based on the Haar function defined on the continuous closed interval [0, 1]. It is a wavelet transform that can be used for image compression. The basis function is similar to the basis function used to detect edges using the first derivative, and images that are decomposed into horizontal, vertical, and diagonal edges at different scales are obtained. Haar function A special set of integer functions. Generally written as hk(z), where k ¼ 0, 1, 2, ..., N – 1, N ¼ 2n. Can be used to define a Haar matrix. The integer k used as a subscript can be further decomposed. Mathematically, k can be decomposed uniquely into k ¼ 2p þ q 1 where 0 p n – 1. When p ¼ 0, q ¼ 0 or q ¼ 1; and when p 6¼ 0, 1 q 2p. The Haar function can be expressed in the following two equations by means of the above formula: pffiffiffiffi z 2 ½0, 1 h0 ðzÞ ¼ h00 ðzÞ ¼ 1= N , 8 q 1=2 q1 > > 2p=2 z< > > < 2p 2p 1 hk ðzÞ ¼ hpq ðzÞ ¼ pffiffiffiffi q 1=2 q z< p > 2p=2 N> p > 2 2 > : 0 Otherwise Haar matrix A matrix that implements the Haar transform. For an N N matrix, the k-th row consists of elements of the Haar function hk(z) of z ¼ 0/N, 1/N, ..., (N – 1)/N. Noiselets Refers to a set of noise-like functions that are complementary to the wavelet signal. If the signal is compact in the wavelet domain, it will spread in the small noise domain. Conversely, if the signal is compact in the small noise domain, it will spread in the wavelet domain. For orthogonal wavelet compression functions, they can result in very poor or no compression at all. Tensor representation An expression implemented or performed with a multidimensional matrix or array. For example, the Gabor wavelet response gives the results obtained from filters of different frequencies and orientations, and the tensor expression is often used to express the results obtained by all filters in all cases.
10.6
Karhunen-Loève Transform
425
Mother wavelet The prototype of the (wavelet) function used in the wavelet transform, from which all other wavelet functions can be derived. Dual-tree wavelet Also known as the Dual Tree Complex Wavelet Transform (DTCWT). An almost shift invariant wavelet transform that can still be used for image reconstruction. It computes dual wavelet transforms in two decompositions (called tree A and tree B), one for real numbers and the other for imaginary numbers. Lifted wavelets A second-generation wavelet because it can be easily adapted to an irregular sampling topology. Can be generated by means of a lifting scheme. Lifting See lifting scheme. Lifting scheme In wavelet transform, a wavelet construction method that does not depend on the Fourier transform. The wavelet transform based on the lifting scheme can implement an integer-to-integer transformation at the current position, thus eliminating the need for additional memory. If the transformed coefficients are directly symbolencoded, the effect of lossless coding can be obtained. The lifting scheme can take advantage of the fast algorithm of wavelet, so the operation speed is fast and the memory is saved. It is also a simple way to design a wavelet filter. The signal is split into its odd and even parts, and each sequence is subjected to a common reversible filtering operation to produce a lifting wavelet. Integer lifting A method for constructing integer wavelet expression. See lifting scheme. Integer wavelet transform An integer version of a discrete wavelet transform. Boxlet Contains positive and negative features of adjacent rectangles. These features can be calculated by means of a summation region table.
10.6
Karhunen-Loève Transform
The Karhunen-Loève transformation (KLT) is a continuous transformation, and its corresponding transformation in the discrete domain is the Hotelling transformation (Gonzalez and Wintz 1987). Hotelling transformation is a transformation based on the statistical characteristics of images, which can be directly used to transform digital images (Sonka et al. 2014).
426
10 Image Transforms
10.6.1 Hotelling Transform Karhunen-Loève [KL] transform The corresponding transform of the Hotelling transform in the continuous domain. A transformation that converts an image into a basis of a primitive image is defined by diagonalizing an autocovariance matrix of an image set. This set of images is considered to be an example of the same random field, and the transformed image is also assumed to belong to the set. The KL transform corresponds to a projection from a vector (or an image that is treated as a vector) to an orthogonal space (having an uncorrelated component, constructed from an autocorrelation matrix of the set of example vectors). One of the advantages is that the orthogonal components have a natural ordering (based on the largest eigenvalue of the covariance of the original vector space), so the most significant change can be selected in the data set. The transform can be used as a basis for image compression, estimating a linear model in a high-dimensional data set, and estimating a dominant pattern of changes in a data set, and the like. Also known as principal component transform (see principal component analysis). For 2-D images, the effect of the KL transform is equivalent to a rotation transformation. As shown in Fig. 10.14, if the object of Fig. 10.14a is treated as a 2-D distribution, each point on it is represented by a 2-D vector as x ¼ [a, b]T, where a and b are the coordinate values of the point corresponding to the x1 and x2 axes on the object. Mapping x to y with a KL transform actually creates a new coordinate system as shown in Fig. 10.14b (the new and old coordinate systems can be merged to make the two origins coincide by translation), and the result of the final rotation transformation is shown in Fig. 10.14c. Discrete Karhunen-Loève [KL] transform Same as Hotelling transform. Discrete KL transform Abbreviation for discrete Karhunen-Loève transform. Hotelling transform An image transformation based on image statistical features that can be directly performed on digital images. It is the basis of principal component analysis. The x2
x2
y2 e1
e2
y1
x1
x1
(a) Fig. 10.14 Effect of KL transform
(b)
(c)
10.6
Karhunen-Loève Transform
427
corresponding transformation in the continuous domain is the Karhunen-Loève transform. Given a set of M random vectors expressed in the following form: xk ¼ xk1
xk2
xkN
T
,
k ¼ 1, 2, , M
The Hotelling transform steps are as follows: 1. Calculate the mean vector and covariance matrix of the random vector:
mx ¼ Cx ¼
M 1 X x M k¼1 k
M 1 X x xT mx mTx M k¼1 k k
2. Calculate the eigenvectors and eigenvalues of the covariance matrix of the random vector: Let e and λ be the feature vector of Cx and the corresponding eigenvalue, respectively, then Cx e ¼ λ e 3. Calculate the Hotelling transform: The feature values obtained above are monotonically arranged from large to small. Let A be the matrix of each row composed of the feature vectors of Cx, and calculate y ¼ Aðx mx Þ The mean of the vector y resulting from this transformation is 0, i.e., my ¼ 0; and the covariance matrix of the vector y can be obtained from A and Cx, i.e., Cy ¼ ACxAT. Cy is a diagonal matrix whose elements on the main diagonal are the characteristic values of Cx, i.e., 2 6 6 Cy ¼ 6 4
λ1
0 λ2 ⋱
0
λN
3 7 7 7 5
428
10 Image Transforms
Here, the elements other than the main diagonal are 0, that is, the elements of the vector y are irrelevant. Cx and Cy have the same eigenvalue and the same eigenvector. Mapping x to y with the Hotelling transform actually creates a new coordinate system whose coordinates are in the direction of the feature vector of Cx. Principal component transform Same as Hotelling transform or KL transform. Eigenvalue transform Same as Hotelling transform. Eigenspace representation See principal component representation. Eigen-decomposition A special case of singular value decomposition. Let A be a square matrix of d d with eigenvalues λi and eigenvectors xi, i ¼ 1, 2, ..., d. Assuming that the eigenvectors are linearly independent, then A can be expressed as A ¼ XΛX1, where X is a matrix of d d and its i-th column is xi, where Λ is the diagonal matrix of d d. The diagonal term is the corresponding eigenvalue Λii ¼ λi. Since there are A ¼ XΛXT for symmetric matrices, the eigenvectors can be chosen such that they are orthogonal and normalized to each other, so X is an orthogonal normalized matrix. Updating eigenspace An algorithm that implements progressive updating of eigenspace expression. These algorithms can be easily implemented with the help of many method, such as active learning. Eigenvector A non-zero vector x that satisfies a particular condition. For a given matrix A and any scalar (eigenvalue) k, x satisfies Ax ¼ kx. Eigenvector projection Projection to the PCA basis vector. Eigenvalue A scalar k that satisfies a particular condition, for a given matrix A and any non-zero vector (eigenvector) x, k can be solved from Ax ¼ kx. Eigenvalue analysis Same as principal component analysis. Eigenvalue decomposition A commonly used matrix decomposition method. The decomposable matrix C is symmetric (M ¼ N ), and the decomposition is expressed as
10.6
Karhunen-Loève Transform
" C ¼ UΛUT ¼
u0
429
un1
2 # λ0 6 4
32 76 54
⋱ λn1
uT0 uTn1
3 7 X λi ui uTi 5¼ n1 i¼0
The eigenvalues can be positive or negative and can be arranged in descending order: λ0 λ1 . . . λn–1. Commonly used for image compression and solving the least square error problem. In solving the problem of least square error, a special case is encountered, that is, the symmetric matrix C is composed of a series of outer product sums: C¼
X ai aTi ¼ AAT i
The matrix A consists of all side-by-side column vectors ai. At this point it is guaranteed that all eigenvalues λi are non-negative. The resulting matrix C is semidefinite: xTCx 0, 8x. If the matrix C is still full rank, then all eigenvalues are positive, and the matrix can be said to be symmetric positive definite. Symmetric positive definite [SPD] See eigenvalue decomposition. Positive definite matrix A special matrix with eigenvalues greater than 0 for all eigenvectors. Given an arbitrary vector, it can be expanded into a linear combination of the eigenvectors of the matrix with the help of a positive definite matrix. Compare positive semidefinite matrix. Positive semi-definite See eigenvalue decomposition. Positive semidefinite matrix A special matrix with eigenvalues greater than or equal to 0 for all eigenvectors. Compare positive definite matrix. Generalized eigenvalue problem An extension of the eigenvalue problem. Let A ¼ [aj,k] be a matrix of n n. Consider the vector equation Ax ¼ λx, where λ is a number. The non-trivial value λ is called the eigenvalue of matrix A, and the corresponding solution (obtained by replacing λ) is called the eigenvector of A. The set of all eigenvalues is called the spectrum of A, and the largest between them is called the spectral radius of A. The problem of finding the eigenvalues and eigenvectors of A is the generalized eigenvalue problem. Generalized eigenvalue See generalized eigenvalue problem.
430
10 Image Transforms
10.6.2 Principal Component Analysis Principal component 1. In a multi-channel image, the orientation or axis obtained by approximating the entire image or the selected region of the image to a multidimensional ellipsoid. The first principal component is parallel to the longest axis of the ellipsoid, and all other principal components are perpendicular to it. The principal component can be calculated based on the transformation of the principal axis based on the covariance matrix of the data. Principal component representation is the optimal expression of multi-channel data because they are uncorrelated, and the principal component with low variance can be ignored without affecting the quality of the data, thereby reducing the number of channels required to express multi-channel images. See Hotelling transform. 2. See principal component analysis. Principal component analysis [PCA] An analysis method for determining the uncorrelated components of a data ensemble based on the Hotelling transform or the KL transform. Each image in the ensemble is treated as a version of a random field, and each pixel in the image is treated as a version of a random vector. The principal component analysis is to decompose the space expressing the feature of the scene into a high-dimensional linear subspace that can optimally express the data under the minimum variance criterion. For principal component analysis, the covariance matrix of the diagonalized data needs to be calculated. Consider the output self-covariance function of the random experiment set C ði, jÞ E ½xi ðm, nÞ xi0 x j m, nÞ x j0 where xi(m, n) is the value of the pixel (m, n) of the i-th band (channel), xi0 is the mean of the i-th band, and xj(m, n) is the value of the same pixel (m, n) of the j-th band, xj0 is the mean of the j-th band, and the expected value is calculated for all outputs of the random experiment, i.e., for all pixels in the image. For an M N image: C ði, jÞ ¼
M N 1 XX ½xi ðm, nÞ xi0 x j ðm, nÞ x j0 MN m¼1 n¼1
For example, for an image with 3 bands, the variables i and j take only 3 values, so the covariance matrix is a 3 3 matrix. If the data is irrelevant, then C is a diagonal matrix, i.e., C(i, j) ¼ 0, if i 6¼ j. To this end, the data needs to be transformed with a transformation matrix A, where A consists of the eigenvectors of the covariance matrix of the untransformed data. This is a statistical method for reducing the dimensionality of data and is widely used in many image technologies (such as point distribution models and
10.6
Karhunen-Loève Transform
431
eigenspace-based recognition). Fundamentally, the difference between a random vector x and the sample population mean μ can be expressed as the product of the eigenvector matrix A of the sample covariance matrix and the projection weight vector y: y ¼ Aðx μÞ So it can be solved x ¼ A1 y þ μ In general, only a subset of components of y is sufficient to approximate x. The elements of this subset correspond to the largest eigenvalues of the covariance matrix. See Karhunen-Loève transform. Axis of elongation 1. A straight line that minimizes the second moment of the data points. If {xi} is the set of data points and d(x, L) is the distance from point x to the straight line L, then elongation axis A can minimize ∑id(xi, A)2. Let μ be the mean of {xi} and define the dispersion matrix S ¼ ∑i(xi–μ)(xi–μ)T. Then, the elongation axis is the eigenvector of S with the largest eigenvalue. See principal component analysis. 2. The longest centerline of the minimum enclosing rectangle with the largest aspect ratio that surrounds the object. Principal component representation See principal component analysis. Principal component basis space In principal component analysis, the space generated by the basis formed by the eigenvectors or the eigen directions of covariance matrices. 2-D PCA 2-D extension of principal component analysis. The basic principal component analysis is performed on the 1-D signal. When performing a principal component analysis operation on a 2-D image, you need to straighten the image into a vector before operating. 2-D PCA does not require this process and directly uses the image as an operation object. Bidirectional 2D-PCA [B2D-PCA] A generalization of 2-D PCA. It considers both the operation along the image row direction and the operation along the image column direction. The 2D PCA value in both directions can be obtained. Based on this, the analysis is also performed, which is also written as (2D) 2PCA. PCA algorithm An algorithm for modeling targets with a small number of parameters based on principal component analysis techniques. There are m targets in the training set,
432
10 Image Transforms
represented by xi, where i ¼ 1, 2, ..., m. The algorithm consists of the following four steps: 1. Calculate the mean of m sample targets in the training set:
x¼
m 1 X x m i¼1 i
2. Calculate the covariance matrix of the training set:
S¼
m 1 X ð x x Þ ð xi x Þ T m i¼1 i
3. Constructing matrix: Φ ¼ ϕ1 jϕ2 j jϕq where ϕj, j ¼ 1, . . ., q represents the eigenvector of S, corresponding to the q largest eigenvalues; 4. Given Φ and x, each target can be approximated as xi x þ Φbi where bi ¼ Φ T ð xi xÞ In the third step of the principal component analysis algorithm, q is determined based on the sum of the q largest eigenvalues greater than 98% of the sum of all eigenvalues. Kernel PCA A kernel technique is used for principal component analysis (replacement of a scalar product with a nonlinear kernel) to obtain a nonlinear variant. Kernel 2D-PCA Kernel PCA in the 2-D case. Kernel-based PCA [KPCA] The extension of principal component analysis and the results obtained by the introduction of kernel techniques.
10.6
Karhunen-Loève Transform
433
Nonlinear PCA [NLPCA] A class of improvements in principal component analysis methods using nonlinear methods. It can improve the performance of various analysis tasks (such as feature extraction) in the case where the component correlation is not too strong or the number of components is large. Unified PCA [UPCA] A generalization of principal component analysis. The traditional covariance matrix is first generalized to the generalized covariance matrix, where each element of the generalized covariance matrix is the generalized covariance of two random vectors (rather than the scalar in the traditional covariance matrix). In the framework of unified principal component analysis, traditional PCA and 2-D PCA are just a special case. Topological PCA [TPCA] An improvement in principal component analysis. When estimating the covariance of samples, the topological relationship between them is also considered, so that their extension ability is improved in recognition and reconstruction. One of its main problems is the high computational complexity. Probabilistic principal component analysis [PPCA] For the improvement of principal component analysis by means of probability theory, principal component analysis is described as the maximum likelihood solution of a probability hidden variable model. It is a typical case in the framework of a linear Gaussian model (where all marginal distributions and conditional distributions are Gaussian). The advantages of probabilistic principal component analysis over traditional principal component analysis include: 1. Probabilistic PCA can be used to model class-conditional densities for use in classification applications. 2. Probabilistic PCA represents a restricted form of Gaussian distribution where the number of free parameters is finite, but still allows for the primary correlation of the model data set. 3. Probabilistic PCA can generate runs to provide samples. 4. Probabilistic PCA forms the basis of Bayesian analysis of principal component analysis, in which the dimension of the main subspace can be automatically obtained from the data. 5. The mixture of probability models can in principle be implemented using an expectation maximization algorithm. 6. Combine the probability model with the expectation maximization can treat the missing values in the data set. 7. For PCA, a computationally efficient expectation maximization algorithm can be derived, in which only a few major feature vectors are used and intermediate steps in the evaluation of the data covariance matrix are avoided. 8. The existence of a likelihood function allows direct comparison with other probability density models. In contrast, traditional PCA always assigns a low
434
10 Image Transforms
reconstruction cost to data points close to the main subspace regardless of how much they deviate from the training data. Probabilistic principal component analysis [PPCA] model A special subspace shape model. It is closely related to factor analysis. The only difference is that the noise term is spherical in the probabilistic PCA model and has a diagonal form in the factor analysis. Principal component analysis-scale-invariant feature transform [PCA-SIFT] Extension of scale-invariant feature transform. There can be different implementations. A method converts a 272-D descriptor obtained from a gradient position-to-orientation histogram (taking the largest 128 eigenvalues) using a PCA trained on a large image library into a 128-D descriptor. There is also a method of calculating the gradients in the X and Y directions point by point in a 39 39 window, and then thus obtained 3042-D vector is reduced by a PCA to a 36-D descriptor. Local feature analysis [LFA] A method based on principal component analysis (PCA) that uses local features to express the local topological properties of a target. Principal curve A form that is required to extend the principle of principal component analysis to a 1-D nonlinear form. The curve in the N-D data space is described using a vector value function f(s), where each component in f(s) is a function of the scalar s (the most natural choice for s is the arc length along the curve). For a given point in the data space, the closest (with Euclidean distance) point on the curve to the given point can always be found. This point is written as s ¼ gf(x), which depends on the particular curve f(s). For continuous data density p(x), the point on the principal curve is the mean point where the points in the data space are projected onto the curve. There can be many principal curves for a given continuous density. Considering a finite set of points and a smooth curve, a two-step iterative process can be used to determine the principal curve, similar to determining the principal component by means of the expectation maximization algorithm. The first principal component is first used to initialize the principal curve and then alternated between the data projection step and the curve re-estimation step. In the projection step, each data point is assigned an s value corresponding to the closest point on the curve. In the next re-estimation step, a point on each curve is assigned a weighted average that is projected to the nearest point of the curve, giving the point closest to the curve the greatest weight. If the subspace is defined to be linear, this process will converge to the first principal component and is equivalent to the power method of finding the largest eigenvector in the covariance matrix. Principal surface The result of extending the principal curve to a multidimensional manifold pattern.
10.7
10.7
Other Transforms
435
Other Transforms
There are many other useful transformations (Barnsley 1988, Bracewell 1986, Gonzalez and Woods 2018, Russ and Neal 2016). Hartley transform A transform similar to the Fourier transform. However, the coefficients used are all real numbers, not the complex coefficients as in the Fourier transform. It is an orthogonal transform and symmetric transform that can be calculated by means of a fast Fourier transform. At 1-D, the Hartley transform kernel can be expressed as 1 hðx, uÞ ¼ pffiffiffiffi N
h
cos
i 2πxu 2πxu þ sin N N
The separable 2-D Hartley transform kernel can be expressed as hðx, y, u, vÞ ¼
h
i 1 2πxu 2πxu pffiffiffiffi cos þ sin N N N h
i 1 2πyv 2πyv pffiffiffiffi cos þ sin N N N
Hartley A unit of information. 1 Hartley ¼ log210 bits.
Its
conversion
relationship
with
bit
is
Discrete Hartley transform See Hartley transform. γ-transformation The transform function used in gamma correction. It is an exponential transformation. Gabor transform A short-time Fourier transform obtained by using a Gaussian function as a window function. Both the 1-D signal and the 2-D image can be represented as a weighted sum of the Gabor functions. A typical Gaussian window function is gð t Þ ¼
1 t2 =4a e , 2πa
The Fourier transform of the above formula is
a>0
436
10 Image Transforms
GðwÞ ¼ ea w , 2
a>0
Both window functions, either in time domain or in frequency domain are Gaussian functions. Symmetric component The result obtained by a symmetric Gabor filter. See Gabor spectrum. Compare anti-symmetric component. Anti-symmetric component The result obtained by applying an antisymmetric Gabor filter to the image. See Gabor spectrum. Compare the symmetric component. Gabor filter A linear filter that performs a frequency domain filtering of an image using a Gabor transform. A filter constructed by multiplying a complex oscillator by an elliptical Gaussian distribution (determined by two standard deviations and one orientation). The resulting filter is local, selective to the direction, has different scales, and can adapt to the brightness mode depending on the frequency selected for the complex oscillation (e.g., the edge, strip, and other modes that can trigger a simple cell response in the mammalian visual cortex). The Gabor filter can be used to model the spatial combination properties of simple cells in the visual cortex. Because it converts images to the Gabor spectrum, it is a powerful tool for texture analysis. Can be used in both the frequency domain and the space domain. In practical applications, a pair of real Gabriel filters are often used simultaneously to have a strong response to a specific frequency and direction. One of the two is symmetric 2 x þ y2 Gs ðx, yÞ ¼ cos k x x þ k y y exp 2σ 2
The other is antisymmetric
x2 þ y2 Ga ðx, yÞ ¼ sin kx x þ ky y exp 2σ 2
where (kx, ky) gives the most intense frequency of the filter response. The Gabor filter can also be decomposed into two components: one real part is symmetrical and one imaginary part is asymmetrical. The mathematical expression for 2-D Geber function is " !# 1 1 x2 y2 Gðx, yÞ ¼ exp þ exp ½2πju0 x 2πσ x σ y 2 σ 2x σ 2y
10.7
Other Transforms
437
Fig. 10.15 Maximum frequency response of a dual-valued Gabor filter bank
Fig. 10.16 The coordinate system used to define the Radon transform
Y (x, y)
p O
θ
l X
where σ x and σ y can determine the Gaussian envelope in the x and y directions, respectively, and u0 represents the frequency of the Gabor function along the axial direction. Figure 10.15 gives an example of the maximum frequency response plot for a dual-valued Gabor filter bank. In the figure, each filter is represented by a centrally symmetric pair of lobes, and the two coordinate axes are normalized spatial frequencies. The center frequency of the filter bank is {2–11/2, 2–9/2, 2–7/2, 2–5/2, 2–3/2}, respectively, and the orientation is, respectively, {0 , 45 , 90 , 135 }. The Gabor filter provides one of the most efficient filtering techniques for extracting useful texture features at different orientations and scales. Gabor spectrum A set of frequency components obtained by performing a Gabor transform on an image. A set of 2-D Gabor filters (having different scales and orientations) are typically used to decompose the image into a series of frequency bands. Radon transform An image transform defined by line integration. A planar image can be transformed into a projection along a straight line. See Fig. 10.16, given the image function f(x, y)
438
10 Image Transforms
whose Radon transform Rf( p, θ) is the line integral along the line l (point (x, y) is on the line) defined by p and θ: Z1 Z1
Z1 R f ðp, θÞ ¼
f ðx, yÞdl ¼ 1
f ðx, yÞδðp x cos θ y sin θÞdxdy 1
1
The Radon transform Rf( p, θ) is defined on the surface of a semi-cylindrical. The Radon inverse transform gives the value of f(x, y) for each point on the line. The Radon transform can be thought of as an integral of a (2-D) function along a set of straight lines and is closely related to Fourier transform, Hough transform, and trace transform. Can be used to indicate the linear trends in an image. Thus, the directional texture will show “hot spots” in the Radon transform space. The Radon transform can also be thought of as an extension of the Hough transform, a transform that maps an image to a parameter space and highlights the line in the image. In the definition, p – xcosθ – ysinθ ¼ 0 represents the parameter line in the image. Peaks can be detected in the PΘ space to identify the line. See Hough transform line finder. Radon inverse transform The inverse transform of the Radon transform. A projection along a straight line can be transformed into a planar image. Trace transform A 2-D representation of an image in polar coordinates (origin in the center of the image). Similar to the Radon transform, it also tracks the straight lines of all possible directions from the origin of the Hilbert transform. However, unlike calculating the Radon transform along a straight line, other functions are also calculated along each trace. So it can be seen as a generation of the Radon transform. In practice, different functions can be used for the same image to obtain different trace transforms. From the transformed image, features can be extracted using functions along the diameter or along the ring. Analytic signal The complex signal consisting of the 1-D signal as the real part and its Hilbert transform as the imaginary part. Or, if the Hilbert transform of the signal f(x) is denoted as fH(x), the analytical signal is f(x) + jfH(x). If one knows the real and imaginary parts of the signal, the local energy and phase of the original signal can be calculated. Riesz transform The name after that Hilbert transform is extended to 2-D. Hilbert transform A special signal transformation. It is the convolution result s0 (t) of a linear invariant system whose impulse response is 1/(πt), when the input is s(t). It can be written as:
10.7
Other Transforms
439
1 s ðt Þ ¼ 2π 0
Zþ1 1
s ð xÞ dx tx
This transformation can shift the phase of all decomposable sinusoidal components in the signal by 90 , but does not change the amplitude. It is useful in calculating the phase and local frequency of a signal. Monogenic signal The signal produced when the Riesz transform is applied to a 2-D signal. Corresponds to the analytic signal in 1-D. Slant transform An orthogonal transform. Let N be an integer power of 2, and a N N-order Slater matrix is defined by the following iterative relationship: 1 1 S2 ¼ pffiffiffi 2 1 1 SN ¼ pffiffiffi 2 2
1
0
6 6 aN bN 6 6 6 6 6 0 6 6 6 6 6 0 1 6 6 6 bN aN 6 6 4 0
⋮ ⋮
0
⋮
1
1
0
⋮
⋮ aN bN ⋮
⋮ IðN=2Þ2 ⋮ ⋮
1
0
0 1
0 ⋮ ⋮ ⋮ IðN=2Þ2 ⋮
bN aN 0
32 0
⋮ IðN=2Þ2
⋮
0
⋮ ‐IðN=2Þ2
76 76 76 76 7 6 SN=2 76 76 76 7 6 76 76 76 76 76 76 76 76 0 54
⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
3 7 7 7 7 0 7 7 7 7 7 7 7 7 7 7 7 7 SN=2 7 5
⋮
where IM is an M-order identity matrix, and the coefficients aN and bN are (N > 1): "
#1=2
"
#1=2
3N 2 aN ¼ 2 4 N 1 N2 4 bN ¼ 2 4 N 1
Slant-Haar transform An improvement to the Slant transform, including a jagged waveform or an oblique transform basis vector, with a fast algorithm.
Chapter 11
Point Operations for Spatial Domain Enhancement
In image processing, the spatial domain refers to the space where the pixel positions are located, which is also called image space, and is generally regarded as the original space of the image (Gonzalez and Woods 2008). Spatial image enhancement refers to the enhancement performed directly on pixels in image space (Castleman 1996; Jähne 2004; Nikolaidis and Pitas 2001).
11.1
Fundamentals of Image Enhancement
In spatial image enhancement, the original image is transformed into an enhanced image through an enhancement operation (Gonzalez and Woods 2008; Pratt 2007; Russ and Neal 2016). If the above enhancement operation is only defined at the position of each pixel, or only the information of the single pixel is used in the enhancement operation of each pixel (the information between multiple pixels is not used), it is regarded as a kind of point operation (Zhang 2017a).
11.1.1 Image Enhancement Image enhancement A large class of basic image processing techniques. By processing the image, an image that is more “good” and more “useful” to the application is obtained. In this sense, the subjective quality of the image is primarily enhanced. Currently, the commonly used technologies can be classified into two types based on their processing space: based on spatial domain and based on transform domain. More broadly, it is a generic term that covers many image processing operations, with the goal of transforming an image to make it more perceptible. Typical operations include contrast enhancement and histogram equalization. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_11
441
442
11
Point Operations for Spatial Domain Enhancement
Image enhancement in space domain An enhancement method that works directly on pixels and in the spatial domain. Expressed as a formula g(x, y) ¼ EH[f(x, y)] where f(x, y) and g(x, y) are images before and after enhancement, respectively, and EH represents an operation of image enhancement. If EH is defined only on each (x, y), EH is a point operation; if EH is also defined on a neighborhood of (x, y), EH is often referred to as a template operation. EH can be applied to either an image f(x, y) or a series of images f1(x, y), f2(x, y), ..., fn(x, y). Spatial domain In image engineering, the space where the pixel is located is referred to as spatial domain. Also called image space, it is generally regarded as the original space of the image. Image enhancement in frequency domain A type of image enhancement method implemented by changing different frequency components in an image. More in the Fourier transform space by means of a frequency filter. Different filters can remove different frequencies and retain other frequencies, so different enhancement effects can be obtained. Photometric normalization technique A normalization technique that solves the problem of illumination invariance in the preprocessing stage. Typical techniques include single-scale retinex algorithm, homomorphic filtering algorithm, anisotropic diffusion algorithm, and so on. Single-scale retinex A simple grayscale mapping algorithm based on the theory of retinal cortex. Also called logarithmic transform. The algorithm consists of two basic steps: (1) local gradation normalization by dividing the local mean in the region and (2) converting the gray level to a logarithmic scale to greatly expand the dark grayscale value and slightly expanding the bright grayscale value. Selectively replacing gray level by gradient An edge sharpening algorithm. For the image f(x, y), the absolute gradient value |G [f(x, y)]| is first calculated. For a given threshold T, if |G[f(x, y)]| is greater than T for a pixel in the image, the gray value of the pixel is replaced by the gradient value or the maximum gray value in the image; if |G[f(x, y)]| is less than T, the gray value of the pixel is retained or replaced with the smallest gray value in the image. For example, the resulting image of the edge sharpening can be calculated by gðx, yÞ ¼
j G½ f ðx, yÞ j f ðx, yÞ
j G½ f ðx, yÞ j T Otherwise
Linear appearance variation The appearance of the image is relatively smooth and changes slowly. Typical examples are changes caused by lighting fluctuations and small non-rigid target deformations. Performing a translation (Δx, Δy) on the position of the image f(x, y), the linear apparent change in grayscale can often be expressed as follows:
11.1
Fundamentals of Image Enhancement
f ðx þ Δx, y þ ΔyÞ ¼ f ðx, yÞ þ
443
X k i Bi ðx, yÞ i
Among them, Bi(x, y) is a set of basis functions, and ki is a set of weighting coefficients.
11.1.2 Intensity Enhancement Intensity enhancement A method of enhancing brightness only in true color enhancement. The image can be first transformed into the HSI space, and its I component is enhanced by the method of enhancing the grayscale image. Although there is no change in hue and saturation here, the enhancement of the luminance component causes the perception of hue or saturation to be different, so the enhancement effect obtained is not the same as the enhancement of the grayscale image. Intensity gradient Applying the mathematical gradient operator ∇ to a luminance image I gives the luminance gradient ∇I at each pixel. The direction of the brightness gradient indicates the direction of the image in which the maximum brightness change occurs in the local range. The magnitude of the luminance gradient gives a local ratio of changes in image brightness. Figure 11.1 shows an image with a gradient of brightness, with arrows indicating the direction of the brightness gradient. Intensity gradient direction The local direction in which the brightness has the greatest change in the image. See intensity gradient. Intensity gradient magnitude The magnitude of the local rate of change in brightness in the image. See intensity gradient. Position-dependent brightness correction A technique that seeks to obtain common brightness variations in real imaging systems. A typical brightness change is a phenomenon in which the brightness of Fig. 11.1 Luminance gradient diagram
444
11
Point Operations for Spatial Domain Enhancement
a lens system with a limited aperture is reduced due to an increase in distance from the optical axis. This effect is generally noticeable only at the edges of the image. Edge enhancement An image enhancement operation, a process or method of enhancing edges by sharpening an image. By sharpening the high-frequency components in the image or reducing the low-frequency components in the image, the edges in the image are sharpened, and the gradient of the edges is steeper (i.e., the edges are sharpened), and the position and outline of the target or the grayscale changes in the image can be more easily observed. For example, to achieve this, the image convolution with a Laplacian L(x, y) can be added to the original image f(x, y), i.e., the enhanced image g (x, y) ¼ f(x, y) + kL(x, y), where k is the weighting constant. Edge editing An image enhancement technique. Calculate the amplitude and blur of each edge segment in the image, selectively eliminate the edges corresponding to undesired features (such as highlights, shadows, interference visual elements, etc.), and reconstruct the image with the remaining edges. Eliminate unwanted visual features to achieve visually similar but enhanced effects. Unsharp operator An image enhancement operator. The edges in image are sharpened by adding a high-pass filtering result of an original image. The high-pass filter is obtained by subtracting a smoothing result from the image (α is a weighting factor): I unsharp ¼ I þ αðI I smooth Þ Unsharp masking A method of using nonlinear filtering to enhance tiny details in an image, an image enhancement technique. This is done by subtracting a blurred version of the image (corresponding to low frequencies) from the original image. Because only highfrequency detail is left, it can be used to construct an enhanced image (so it is actually a sharpening process or technique). The previous blur version is generally obtained by convolving the original image with a Gaussian smoothing template. It is also an image sharpening method commonly used in the printing industry. By subtracting a blurred version fb(x, y) of the image from an image f(x, y) the contrast difference between the images can be reduced. Image intensifier A device for amplifying the brightness of an image, which can significantly increase the perceived luminous flux.
11.1
Fundamentals of Image Enhancement
445
11.1.3 Contrast Enhancement Contrast enhancement A typical image enhancement technique. With the help of the characteristics of the human visual system, the effect of improving the visual quality of the image is achieved by increasing the contrast between the various parts of the image. One common method is to increase the dynamic range of a gray value interval in an image by means of image gray value mapping, thereby stretching the gray-level contrast between adjacent areas of the space. It extends the distribution of intensity values in an image to use a larger sensitivity range of the output device. This is slightly different from the effect of directly increasing the contrast between the brightness levels of the individual images. Histogram equalization is a contrast enhancement method. An example of contrast enhancement is shown in Fig. 11.2. The left image is the original image, and the right image shows the effect of contrast enhancement. Appearance enhancement transform A generic term for an operation applied to an image to alter or enhance certain aspects of the image. Typical examples include brightness adjustment, contrast adjustment, edge sharpening, histogram equalization, and saturation adjustment or enhancement. Local enhancement Enhancements to the details of certain local areas (sub-images) in the image. In general, the number of pixels in these partial regions tends to be smaller than the number of pixels in the entire image, and their characteristics are often ignored in calculating the characteristics of the entire image, so it is not possible to enhance these local areas of interest when the entire image is enhanced. To this end, these local regions of interest are selected first in the image, and targeted local operations are performed according to the pixel characteristics in these local regions to obtain the desired enhancement effect.
Fig. 11.2 Contrast enhancement
446
11
Point Operations for Spatial Domain Enhancement
Local contrast adjustment A contrast-enhanced form that adjusts the brightness of a pixel not based on the entire image but only on the value of the (neighborhood) pixel that is close to the pixel. Local variance contrast It is the variance of the pixel values in the neighborhood of each pixel; a contrast is the difference between the maximum and minimum of this variance. This feature takes a relatively large value in areas with high texture or brightness variations. Local operator An image processing operator that calculates the output value at each pixel using the value of the neighboring pixels instead of the value of all or most of the pixels in the image. Local preprocessing In image processing, it refers to various filtering processes using neighborhood templates. Diffusion An effect of image processing. By migrating the pixels, the display effect of the image can be softened, and a somewhat embarrassing feeling is obtained, which is similar to the use of a softening lens. Diffusion function An equation describing the diffusion process. It often models the flow phenomena of a physical quantity (such as heat) in a material with a specific conductivity (such as thermal conductivity), such as in variable conductance diffusion. Radial blur Gradually blur the image from the center to get some special effects. Feathering A weighted averaging operation by means of a distance map (giving a larger weight to the pixels in the center of the image and a smaller weight to the pixels near the edge of the image). It can be used to properly stitch together images with large differences in exposure. Adaptive contrast enhancement An image processing technique for performing, e.g., a histogram equalization operation on each local/partial region of an image.
11.1.4 Operator Operator A generic term for a function that represents some kind of transformation applied to image data to transform the data. In image engineering, the operator used to perform
11.1
Fundamentals of Image Enhancement
447
image transformation. Can be used to transform one image into another by performing a certain calculation. An operator takes an image (or region) as input and produces another image (or region). According to the operations it performs, it can be divided into linear operators and nonlinear operators. Operators refer to operations that implement image local operations and their templates (still image transform), such as orthogonal gradient operators, compass gradient operators, and so on. The three basic operator types are point operators, neighborhood operators, and global transforms (such as Fourier transform). See image processing operator. Type 0 operations The processing output for one pixel depends only on the image processing operations of a single corresponding input pixel. Also called point operation. Point operation An image spatial domain operation. Also known as global operations. Each pixel of the entire image is processed in the same way, and the gray value after one pixel processing is only a function of its original value and is independent of the position of the pixel in the image. See type 0 operations. Point transformation A variety of grayscale transformations that implement point operations. The transform function can be linear (such as image negative), piecewise linear (such as gray-level slicing), or nonlinear (such as γ-correction). Point operator An image operator assigned to a gray value of an output pixel that depends only on the gray value of the corresponding input pixel. If the assignment also depends on the position of the pixel in the image, then the point operator is nonuniform; otherwise it is uniform. Compare the neighborhood operator. Type 1 operations The processing output of the image processing operations for one pixel depends not only on a single corresponding input pixel but also on its surrounding (multiple) pixels. Also called local operation. Local operation Same as type 1 operations. Neighborhood-oriented operations A space domain operation in image engineering. Also known as local operations. The input image is processed pixel by pixel, and the value processed by one pixel is a function of its original value and also its neighboring pixel value. Type 2 operations The processing output for one pixel depends on the image processing operations (such as spatial coordinate transformation) of all pixels of the entire image. Also known as global operations.
448
11
Point Operations for Spatial Domain Enhancement
Global operation Same as type 2 operations. Global A concept that opposes the local. The global nature of a (mathematically) target is dependent on all components of the target. For example, the average brightness of an image is a global characteristic because it depends on the brightness values of all pixels. Inverse operator 1. An operator that can recover an image after the action of an operator. Many image processing operators reduce the information content, so there is no exact inverse operator. 2. Performs a low-level image processing operation in which a new image is obtained by replacing (inverting) the value in an image with its “complement.” For binary images, the image value is 1 if the input is 0, and the image value is 0 if the input is 1. For grayscale images, the value depends on the maximum range of grayscale values. If the range of gray values is [0, 255], then an inverse with a value of f is 256 – f. The effect is similar to an image negative.
11.2
Coordinate Transformation
Image coordinate transformation is also called spatial coordinate transformation or geometric coordinate transformation. It is a kind of position mapping operation, which involves the transformation and method of each coordinate position in image space (Russ and Neal 2016). Each pixel in the image has a certain spatial position, and its position can be changed by means of coordinate transformation to obtain the effect of enhancing the image. The coordinate transformation of the entire image is achieved by performing a coordinate transformation on each pixel (Gonzalez and Woods 2018; Sonka et al. 2014; Zhang 2017a).
11.2.1 Spatial Coordinate Transformation Spatial coordinate transformation Conversion method and conversion manner between coordinate positions in the image space. A connection between pixel coordinates can be established. Pixel coordinate transformation A mathematical transformation that links two images (or two frames in a video) in space, indicating how the coordinates of one pixel in one image are obtained from the coordinates of that pixel in another image. A linear transformation can be
11.2
Coordinate Transformation
449
expressed as i1 ¼ ai2 + bj2 + e and j1 ¼ ci2 + dj2 + f, where the coordinates of p2 ¼ (i2, j2) are transformed into p1 ¼ (i1, j1). Write in matrix form: p1 ¼ Ap2 + t, where A ¼ (ab dc ) is the rotation matrix and t ¼ [e f]T is the translation vector. See Euclidean transformation, affine transformation, and homography transformation. Coordinate transformation Abbreviation for spatial coordinate transformation. Image coordinate transformation Same as spatial coordinate transformation. Image coordinates See image plane coordinates and pixel coordinates. Inverse warping Same as backward mapping. It is also commonly used to represent a mapping strategy from a target image to a source image in a general coordinate transforma tion (including projective transformation, affine transformation, etc.). Warping Transforming the image by reparameterizing the 2-D plane. Sometimes referred to as a rubber sheet change, as much as attaching an image to a piece of rubber and stretching according to predetermined rules. Given an image I(x) and a 2-D to 2-D mapping w: x ! x0 , the warped image W(x) can be represented as I[w(x)]. The warping function is designed to map certain control points p1. . .n in the source image to the special positions p0 1. . .n of the target image. A typical second-order warping is a special case of polynomial warping. For a pixel’s original coordinates (x, y), its transformation coordinates (x', y') are given by the following equation: x0 ¼ a0 x2 þ a1 y2 þ a2 xy þ a3 x þ a4 y þ a5 y0 ¼ b0 x2 þ b1 y2 þ b2 xy þ b3 x þ b4 y þ b5 Among them, the coefficients a0, a1, . . ., a5, b0, b1, . . ., b5 are used to introduce more complex distortion into the image, such as turning a straight line into a curve. A special application of the second-order warping method is to compensate for lens distortion, especially barrel distortion and pincushion distortion. There are also cubic distortions, using a third-order polynomial and 20 coefficients. Cascade of coordinate transformations A method in which a plurality of coordinate transformations are successively connected in series. For example, in the image coordinate transformation, one pixel point can be sequentially translated, scaled, rotated around the origin, and the like, thereby obtaining a desired coordinate transformation effect. See cascade of transformations.
450
11
Point Operations for Spatial Domain Enhancement
Cascade of transformations The cascading of coordinate transformations is a typical example of how multiple different transformations are performed. For linear transformation, each transformation can be realized by one transformation matrix; and successive transformations can also be represented by matrix multiplication and finally by a single transformation matrix. That is, the cascade matrix can be obtained by multiplication of the matrix, and the result is still a linear transformation. However, it is necessary to note that the order of operations of the various transformation matrices participating in the transformation is generally not interchangeable. See cascade of coordinate transformations. Chained transformations Same as cascade of transformations. Compound transforms A series (two or more) transforms that act sequentially on an image. That is, the transformation of the transformation, the transformation of the transformed transformation, etc. Multi-order transforms Same as compound transforms. Multi-pass transforms A cascaded of (simple) transform is used to complete a (complex) transformation of the image. For example, the parameter transformation, 2-D filtering, and resampling operations can be approximated by a series of 1-D resampling and shearing transformations. The benefit is that 1-D transforms are much more efficient than 2-D transforms.
11.2.2 Image Transformation Image flipping Techniques and processes for exchanaging image pixels around a line in a plane to obtain a mirror image of the original image. The line is generally horizontal or vertical, and correspondingly, the image will be flipped upside down or leftside right. Image scaling An operation that increases or decreases the size of an image based on some scale factor. Some types of interpolation methods are needed. See image magnification. Image shrinking An operation to reduce the size of an image for easy viewing. Generally, each part is scaled down with the same factor. Compare image zooming.
11.2
Coordinate Transformation
451
Image zooming The process of increasing the image size by changing the focal length of the lens for easy viewing or obtaining detailed information. It is generally considered that the proportions of the parts are enlarged proportionally (the actual changement varies with the distance to principal axis). Compare image shrinking. Image magnification Enlarging an image to the extent of observation. It is generally relative to the original image size (e.g., 2). Image twirling A nonlinear image transform. It can cause an image to rotate about a spatially varying angle around an anchor point of coordinates (xc, yc): the angle at the anchor point is C and then decreases linearly as the radial distance from the center increases. This effect is limited to areas with a maximum radius of rmax. All pixels outside this area remain unchanged. Because this transformation uses a backward mapping, the inverse mapping function equation is T 1 x :x ¼ T 1 y where r
:y¼
xc þ r cos ðθÞ x0
r r max r > r max
yc þ r sin ðθÞ y0
r r max r > r max
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d 2x þ d2y , dx ¼ x' – xc, dy ¼ y' – yc, θ ¼ arctan(dx, dy) + C[(rmax – r)/
rmax]. Image rippling A nonlinear image transform that results in an image that produces a localized wavy motion in both the x and y directions. The parameters of the mapping function used are (non-zero) period length Lx, Ly (pixels) and associated amplitude values Ax, Ay. The inverse transform function is given by the following two equations: 0 T 1 x : x ¼ x þ Ax • sin 0 T 1 y : y ¼ y þ Ay • sin
2π • y0 Lx 2π • x0 Ly
Image warping Any processing operations that transform the position of the pixel in the image, thereby changing the pixel distribution. It can also represent a general term for transforming the position of a pixel in an image. It is generally considered that the
452
11
Point Operations for Spatial Domain Enhancement
transformation is on the domain of the image function rather than the value of the image function. The image topology at this time is generally maintained, that is, the original adjacent pixels are still adjacent in the warped image. Image warping produces an image with a new shape. This can be used to correct geometric distortion, align two images (see image rectification), or transform the target shape into a more manageable form (such as turning a circle into a straight line). The image warping problem is often considered in practice as a problem of resampling the source image f(x) given a mapping function x ¼ h(xo) from the target pixel xo to the source pixel x. For example, in optical flow calculations, the optical flow field is often estimated by estimating the position of the source pixel that produces the current pixel, rather than calculating the target pixel that the source pixel is ready to remove. Similarly, when correcting radial distortion, the lens is often corrected by calculating the corresponding pixel position in the original (distorted) image for each pixel in the final (no distortion) image.
11.2.3 Homogeneous Coordinates Homogeneous coordinates A coordinate introduced to facilitate the expression of some image techniques. It is often used to describe points, lines, etc. in the projective space. For example, points (x, y, z) in Euclidean space can be represented by arbitrary λ in homogeneous coordinates (λx, λy, λz, λ). If multiple points are in a multiple relationship with each other, such homogeneous quantities are equal. For example, a 2-D point is represented as (x, y) in Cartesian coordinates and as (x, y, 1) and any multiples in homogeneous coordinates. It can also be seen as a variant of Cartesian coordinates. If the point of the space is represented as a vector, w ¼ ½X
Y
Z T
then the corresponding homogeneous coordinates can be expressed as. wh ¼ ½ kX
kY
kZ
k T
where k is a non-zero constant. If you want to change from homogeneous coordinates back to Cartesian coordinates, you can use the first three coordinate values to divide the fourth coordinate value. Homogeneous representation An expression defined in the projective space.
11.2
Coordinate Transformation
453
Homogeneous vector A vector that represents the same straight line and satisfies the equivalence relationship by using homogeneous coordinates. Augmented vector A vector that can be normalized to the last element of the homogeneous coordinates. Let the homogeneous coordinate of the 2-D point be xh ¼ [xh, yh, wh]T. If the last element wh is used to divide other elements, it can be transformed back to the nonhomogeneous coordinate x. That is, xh ¼ [xh, yh, wh]T ¼ wh[x, y, 1]T ¼ whxn. Here xn ¼ [x, y, 1]T is the augmented vector. Point at infinity A special 2-D point, also known as an ideal point. When the 2-D point is represented by homogeneous coordinates, it is the point where the last element of the homogeneous coordinate is 0. Let the homogeneous coordinate of the 2-D point be xh ¼ [xh, yh, wh]T. If wh ¼ 0, then xh ¼ [xh, yh, 0]T represents the infinity point. The infinity point has no corresponding expression in the nonhomogeneous coordinate system. Ideal point Describe in the continuous domain the point opposite the point affected by the grating in the discrete domain. Sometimes it can also indicate vanishing point. Same as point at infinity. Line at infinity In 2-D space, all points on it are at infinity lines. If you use homogeneous coordi nates, you can write l ¼ [0 0 1]T. Plane at infinity In 3-D space, all points on it are at infinity planes. Corresponds to the line at infinity in the 2-D space. If you use homogeneous coordinates, you can write P ¼ [0 0 0 1]T.
11.2.4 Hierarchy of Transformation Hierarchy of transformations A hierarchical system consisting of projective transformation, affine transforma tion, similarity transformation, and isometric transformation (including rigid body transformation and Euclidean transformation). Projective transformation is the most general transformation, and the other three transformations are in turn a special case of its previous transformation. Projective transformation Coordinate transformation in 3-D projection. The projective transformation matrix has a total of eight degrees of freedom, which can be determined by four pairs of reference points, so it is also called a four-point mapping. By gradually
454
11
Point Operations for Spatial Domain Enhancement
Fig. 11.3 Results of similarity transformation on a polygonal target
2
1
3
reducing the degree of freedom in the projective transformation, affine transforma tion, similarity transformation, and isometric transformation on the plane can be obtained in sequence. Also known as “projectivity” from one projection plane to another. It can be described by a non-singular 3 3 matrix represented by homoge neous coordinates. This transformation matrix has nine elements, but only eight degrees of freedom, because only the ratio of the projected coordinates is meaningful. Four-point mapping Same as projective transformation. Three-point mapping A coordinate transformation that maps a triangle to another triangle. It mainly includes translation transformation, scaling transformation, rotation transfor mation, stretch transformation, and shearing transformation. Similarity transformation A special affine transformation. A matrix that transforms a point p ¼ ( px, py) similarly to another point q ¼ (qx, qy) can be expressed as 2
3 2 qx s cos θ 6 7 6 4 qy 5 ¼ 4 s sin θ 1
s sin θ s cos θ
0
0
32 3 tx px 76 7 t y 5 4 py 5 1
1
or written in block matrix form: q¼
sR
t
0T
1
p
where s (>0) is a coefficient representing isotropic scaling; R is a special 2 2 orthogonal matrix corresponding to the rotation operation (RTR ¼ RRT ¼ I, and det (R) ¼ 1); t ¼ [tx, ty]T is a 1 2 translation vector; and 0 is a 2 1 vector. The similarity transformation matrix has a total of four degrees of freedom. Figure 11.3 shows the three results of the similarity transformation of the leftmost polygon object with s ¼ 1.5, θ ¼ 90 and t ¼ [1, 0]T; s ¼ 1, θ ¼ 180 and t ¼ [4, 8]T; and s ¼ 0.5, θ ¼ 0 and t ¼ [5, 7]T, respectively.
11.2
Coordinate Transformation
455
The transformation of changing one target to another that looks similar. Formally speaking, a comformal map preserves the ratio of distances (magnification). The transformation matrix T can be written as T ¼ B1AB, where A and B are similarity matrices, that is, the same transform after the basis alternation. Typical examples include rotation, translation, expansion, and contraction. Metric stratum A set of similarity transformations (i.e., a rigid body transformation plus a scaling transformation). A target undergone the similarity transformation can be recovered without additional information such as the length of a given (edge). Scaled rotation Same as similarity transformation. The transformation from point p to point q can be expressed as q ¼ sRp þ t where s is an arbitrary scale factor. Isometric transformation A special simplified affine transformation. A matrix that transforms a point p ¼ ( px, py) isometrically to another point q ¼ (qx, qy) is expressed as 2
3 2 qx e cos θ 6 7 6 4 qy 5 ¼ 4 e sin θ 1
sin θ cos θ
0
0
32 3 tx px 76 7 t y 5 4 py 5 1
1
or written in block matrix form: q¼
R
t
0T
1
p
where e ¼ 1, R is generally a 2 2 orthogonal matrix (det(R) ¼ 1), t ¼ [tx ty]T is a 1 2 translation vector, and 0 is a 2 1 vector. The isometric transformation matrix has a total of three degrees of freedom. The isometric transformation can be further divided into a rigid body transfor mation and a Euclidean transformation. Isometry A transformation that maintains distance in image transform. If for all coordinate pairs (x, y), there is |x – y| ¼ |T(x) – T( y)|, and then transform T: x ! y is an isometric transformation. Symmetry group For a given target, all sets that make the target and the original state satisfy the isometric transformation. For example, a square whose center of mass is placed at
456
11
Point Operations for Spatial Domain Enhancement
Fig. 11.4 Results of Euclidean transformation on a polygonal target
2 1
3
the origin has both rotational symmetry (four isometric transformations) and reflection symmetry (five isometric transformations), where the symmetric group contains 4 5 member transformations. Euclidean transformation The transformation in Euclidean space (i.e., maintaining the configuration in Euclidean space). Common rotation transformation and translation transforma tion are typical examples. Euclidean transformation is often representated by using homogeneous coordinate. Also refers to an isometric transformation that expresses the motion of a rigid body with a rotation before translation. A matrix HI that Euclidean transforms a point p ¼ ( px, py) to another point q (qx, qy) can be expressed in the form of a block matrix:
R q ¼ HI p ¼ T 0
t p 1
The Euclidean transformation can be seen as a special case of rigid body transformation. Figure 11.4 shows the results of Euclidean transformation on the leftmost polygon target with three sets of parameters θ ¼ 90 and t ¼ [2, 0]T; θ ¼ 90 and t ¼ [2, 4]T; and θ ¼ 0 and t ¼ [4, 6]T, respectively. Rigid body transformation An isometric transformation that preserves all distances among points in a region. A matrix expression that transforms a point p ¼ ( px, py) through a rigid body transformation to another point q ¼ (qx, qy) can be expressed as a block matrix form: q ¼ HI p ¼
R
t
0T
1
p
where R is a general orthogonal matrix, det(R) ¼ 1, t is a 2 1 translation vector, and 0 is a 2 1 vector. Figure 11.5 is given in turn for the result of the rigid body transformation to the leftmost polygon target with e ¼ 1, θ ¼ 90 , and t ¼ [2, 0]T; e ¼ 1, θ ¼ 180 , and t ¼ [4, 8]T; and e ¼ 1, θ ¼ 180 , and t ¼ [5, 6]T, respectively.
11.2
Coordinate Transformation
457
Fig. 11.5 Results of rigid body transformation on a polygonal target
2
1
3
Rigid body An object whose shape and size are always constant under the influence of any external force. A particle system (set) in which the distance between the mass points remains constant can also be described. In practice, many objects in an image can be considered as rigid bodies.
11.2.5 Affine Transformation Affine transformation A special projective transformation. A non-singular linear transform followed by a translation transformation to form a combined transformation. On a plane, a matrix that affine transforms a point p ¼ ( px, py) to another point q ¼ (qx, qy) can be expressed as 2
qx
3
2
a11
6 7 6 4 qy 5 ¼ 4 a21 1 0
a12 a22 0
tx
32
px
3
76 7 t y 5 4 py 5 1 1
or written in block matrix form: q¼
A 0T
t p 1
where A is a 2 2 nonsingular matrix, t ¼ [tx ty]T is a 1 2 translation vector, and 0 is a 2 1 vector. The transformation matrix has a total of six degrees of freedom. An affine transformation is a special set of transformations in Euclidean geometry. It transforms a straight line into a straight line, a triangle into a triangle, and a rectangle into a parallelogram. More generally, it maintains the following properties of the structure to be transformed: 1. Collinearity of points: If the three points are on the same line, they are still in a straight line after the image affine transformation, and the middle point is still in the middle of the other two points.
458
11
Point Operations for Spatial Domain Enhancement
Table 11.1 Summary of various special affine transformation coefficients Transformation Translating with Δx, Δy Scaling with[Sx, Sy] Rotating counterclockwise with angle θ Shearing with [Jx, Jy]
a11 1 Sx cosθ 1
a12 0 0 sinθ Jy
a21 0 0 sinθ Jx
a22 1 Sy cosθ 1
tx Δx 0 0 0
ty Δy 0 0 0
Fig. 11.6 Results of affine transformation of a polygonal target
1
2 3
2. The parallel lines are kept parallel, and the intersecting lines still intersect after the affine transformation. 3. The length ratio of segments on a given line remains constant. 4. The area ratio of the two triangles remains constant. 5. The ellipse remains elliptical, the hyperbola remains hyperbolic, and the parabola remains parabolic. 6. The centroid of a triangle (also including other geometry) is mapped to the corresponding centroid. Common translation transformations, scaling transformations, rotation trans formations, and shearing transformations are special cases of affine transformations. The corresponding affine transformation coefficients are shown in Table 11.1. The affine transformation can be analytically represented as a matrix form f (x) ¼ Ax + b, where the determinant det(A) 6¼ 0 for the square matrix A. In 2-D, the matrix is 2 2; in 3-D, the matrix is 3 3. Figure 11.6 shows the three results of the transformation "of the leftmost # polygon 1 1=2 object using the affine transformations expressed by A1 ¼ and t1 ¼ 1=2 1 " # " # 1=2 1 3=2 1=2 2 0 4 and t2 ¼ , and A3 ¼ and t3 ¼ , , A2 ¼ 1 3 2 1 1 1=2 1 respectively. Affine fundamental matrix A fundamental matrix obtained from a pair of cameras having affine observation conditions. It is a matrix of 3 3, where the elements in the 2 2 submatrix in the upper left corner are all 0.
11.2
Coordinate Transformation
459
Affine warp The shape change of target in the image satisfies or according to the affine transformation. Affine One of the first terms used by Euler. The invariant properties of geometric objects under affine transformation (mapping) in affine geometry studies. This includes parallelism, cross ratio, adjacency, etc. Affine arc length Considering the parameter equation f(u) ¼ [x(u), y(u)] of the curve, the arc length does not remain constant under affine transformation. Affine length is Z lðuÞ ¼
u
1
ðx0 y00 x00 y0 Þ3
0
It remains unchanged under affine transformation. Affine length Same as affine arc length. Affine moment Derived from the second and third moments of the object, it can be used to describe the object shape, and the four moments are invariant under the affine transforma tion. If Mpq is used to represent the p + q-order central moment of the region, the affine invariant moment can be written as I 1 ¼ M 20 M 02 M 211 =M 400
I 2 ¼ M 230 M 203 6M 30 M 21 M 12 M 03 þ 4M 30 M 212 þ 4M 221 M 03 3M 221 M 212 =M 10 00
I 3 ¼ M 20 M 21 M 03 M 212 M 11 ðM 30 M 03 M 21 M 12 ÞþM 02 M 30 M 12 M 221 =M 700 I 4 ¼ M 320 M 203 6M 220 M 11 M 12 M 03 6M 220 M 02 M 21 M 03 þ 9M 220 M 02 M 212 þ12M 20 M 211 M 21 M 00 þ 6M 20 M 11 M 02 M 30 M 03 18M 20 M 11 M 02 M 21 M 12 8M 311 M 30 M 03 6M 20 M 202 M 30 M 12 þ 9M 20 M 202 M 221 þ12M 211 M 02 M 30 M 12 6M 11 M 202 M 30 M 21 þ M 302 M 230 =M 11 00 Affine flow In motion analysis, a method of determining the motion of a patch on a scene. To do this, the affine transformation parameters need to be estimated to transform a slice from its position in the field of view to a position in another field of view. In other words, the motion of the surface patch is determined by estimating the affine transformation parameters that transform the position of the surface patch from one viewpoint to another.
460
11
Point Operations for Spatial Domain Enhancement
Affine curvature Curvature measure based on affine arc length. Consider the curve of the parameter equation f(u) ¼ [x(u), y(u)], whose affine curvature is (l is the affine arc length) 000
000
cðlÞ ¼ x00 ðlÞy ðlÞ x ðlÞy00 ðlÞ Affine motion model A model for expressing motion by means of affine transformation. Here, the change caused by the motion is regarded as an affine transformation of the coordinates of the moving object. The general affine transformation has six parameters, so the affine motion model also has six parameters. 6-coefficient affine model A linear polynomial parameter model describing affine transformation. The mathematical expression of the model is
u ¼ k1 x þ k2 y þ k3 v ¼ k4 x þ k5 y þ k6
where (x, y) is the coordinate of a point in the image, and (u, v) is the image coordinate of the point corresponding to the affine transformation. k1 to k6 are six model parameters. Compared to the Euclidean transformation, k1, k2, k4 and k5 are rotation parameters, and k3 and k6 are translational parameters. 8-coefficient bilinear model A bilinear polynomial parameter model describing affine transformation. Also known as bi-linear transform. The mathematical expression of the model is
u ¼ k 1 xy þ k2 x þ k3 y þ k 4 v ¼ k5 xy þ k6 x þ k7 y þ k8
where (x, y) is the coordinate of a point in the image, and (u, v) is the image coordinate of the point corresponding to the affine transformation. k1 to k8 are eight model parameters. Compared to the 6-coefficient affine model, a quadratic cross term is added here. Bi-linear transform A spatial non-affine transformation in image geometry. A pixel-by-pixel mapping between the original image and the transformed image can be established. The transformed pixel coordinates are the product of two linear functions of the original pixel coordinates. See 8-coefficient bilinear model.
11.2
Coordinate Transformation
461
11.2.6 Rotation Transformation Rotation transformation A common spatial coordinate transformation. The relative orientation of the point relative to a reference point or line can be changed, and the point with the coordinates (X, Y, Z ) is rotated to the new position (X0 , Y0 , Z0 ) by means of the rotation matrix with a rotation angle (θx, θy, θz) by means of the rotation matrix. It is also a typical three-point mapping transformation. Rotation matrix A matrix for rotation transformation. Consider the rotation of a point around the coordinate axis in the 3-D space, and set the rotation angle to be defined clockwise from the direction of the rotation axis. Turning a point around the X coordinate axis with an angle θx can be achieved with the following rotation matrix: 2
1
60 6 Rθx ¼ 6 40
0
0
cos θx sin θx
sin θx cos θx
0
0
0
0
3
07 7 7 05
1
Turning a point around the Y coordinate axis with an angle θy can be achieved with the following rotation matrix: 2
cos θy
6 0 6 Rθy ¼ 6 4 sin θy 0
0 sin θy 1
0
0 0
cos θy 0
0
3
07 7 7 05 1
Turning a point around the Z coordinate axis with an angle θz can be achieved with the following rotation matrix: 2
cos θz 6 sin θ z 6 Rθ z ¼ 6 4 0 0
sin θz cos θz
0 0
0 0
1 0
3 0 07 7 7 05 1
The rotation matrix can be thought of as a linear operator that selects a vector in a given space. A rotation matrix has only three degrees of freedom in 3-D, and only one degree of freedom in 2-D. In the 3-D space, there are three eigenvalues: 1, cosθ + jsinθ, and cosθ – jsinθ, where j is an imaginary unit. The rotation matrix in 3-D space has nine elements, but only three degrees of freedom, because it needs to satisfy six orthogonal constraints. It can be parameterized in a variety of ways,
462
11
Point Operations for Spatial Domain Enhancement
Fig. 11.7 Axis/angle representation of rotations
n
v
u
v
v||
u
v
v
such as Euler angles, yaw-pitch-roll, rotation angles around the coordinate axes, axis-angle curve representation, etc. See orientation, rotation representation, and quaternion. Rotation operator A linear operator described by a rotation matrix. Rotation representation A formal method of describing rotation and their algebra. The most commonly used one is definitely the rotation matrix, but quaternions, Euler angles, yaw-pitch-roll, rotation angles around the coordinate axes, and axis-angle curve representation can also be used. Axis/angle representation of rotations A 3-D rotation representation. Using a rotation axis n and a rotation angle θ is also equivalent to using a 3-D vector w ¼ θn. To calculate the equivalent motion, see Fig. 11.7. First project the vector v onto the axis n to get
vk ¼ nðn • vÞ ¼ nnT v This is the component of v that is unaffected by rotation. Next, calculate the orthogonal difference between v and n:
v⊥ ¼ v vk ¼ I nnT v Rotate the above vector by 90 degrees with vector outer product: v ¼ n v ¼ ½n v where [n] is the matrix form of the outer product operator of the vector n ¼ (nx, ny, nz):
11.2
Coordinate Transformation
463
2
0
6 ½n ¼ 4 nz ny
nz 0 nx
ny
3
7 nx 5 0
Note that turning this vector to another 90 degrees is equivalent to recalculating the following: v ¼ n v ¼ ½n2 v ¼ v⊥ and so
vk ¼ v v⊥ ¼ v þ v ¼ I þ ½n2 v It is now possible to calculate the components of the rotated vector u in the plane: u⊥ ¼ cos θ v⊥ þ sin θ v⊥ ¼
sin θ½n cos θ½n2 v
Put all these items together to get the final rotation vector: n o u ¼ u⊥ þ vk ¼ I þ sin θ½n þ ð1 cos θÞ½n2 v The rotation matrix with the rotation angle θ around the axis n can be written (i.e., Rodriguez0 s formula) Rðn, θÞ ¼ I þ sin θ½n þ ð1 cos θÞ½n2 Rodriguez0 s formula See axis/angle representation of rotations and exponential twist. Rodrigues’ rotation formula An efficient algorithm for rotating a vector around a particular axis in 3-D space by an angle. Exponential twist A 3-D rotation representation. Can be introduced from a limited angle. The basic idea is that the rotation of an angle θ is equivalent to the rotation of k times an angle θ/k. In the extreme case of k ! 1, it can get
464
11
Point Operations for Spatial Domain Enhancement
Fig. 11.8 Schematic diagram of quaternion
k
j –i
1
–1
i
–k
–j
k 1 Rðn, θÞ ¼ lim I þ ½θn ¼ exp ½w k k!1 where w ¼ θn. If the matrix index is expanded by Taylor series (using the equation [n]k + 2 ¼ [n]k, k > 0; suppose θ is radians), then it can get (i.e., Rodriguez0 s formula) exp ½w
θ2 θ3 ¼ I þ θ½n þ ½n2 þ ½n3 þ 2 3! 2 θ3 θ θ4 þ ½n2 ¼ I þ θ þ ½n þ 3! 2 4! ¼ I þ sin θ½n þ ð1 cos θÞ½n2
Quaternion The predecessor of the modern vector concept, proposed by Hamilton, is used to express rotation in vision. Any 3-D rotation matrix R can be parameterized with a vector q ¼ (q0, q1, q2, q3) having four components, where the sum of the squares of the four components is 1. R uniquely determines the rotation. A rotation can have two expressions, q and –q. See the rotation matrix. A number such as ai + bj + ck + d (which is a linear combination of i, j, k, 1), where a, b, c, and d are real numbers, and the imaginary number i2 ¼ j2 ¼ k2 ¼ 1. And there are ij ¼ k, ji ¼ k, jk ¼ i, kj ¼ i, ki ¼ j, and ik ¼ j. The relationship between them can be represented by Fig. 11.8 (where the red arrow indicates multiplication by i and the green arrow indicates multiplication by j). Quaternions represent a 4-D space, so quaternions are often used to refer to quads, such as string grammars. Axis-angle curve representation The rotation expression based on the amount of warping of the relative rotation axis is a unit vector. The quaternion rotation expression is similar.
11.2
Coordinate Transformation
465
z
Fig. 11.9 Unit quaternion and anti-podal quaternion
q2
q1 w
q0 –q2
x
y
Unit quaternion A 4-D vector q 2 R4. The quaternion of unit length can be used to parameterize the 3-D rotation matrix. Given a quaternion with four components (q0, q1, q2, q3), the corresponding rotation matrix R is (let S ¼ q02 – q12 – q22 – q32) 2
S þ 2q21
6 4 2q1 q2 2q0 q3 2q3 q1 þ 2q0 q2
2q1 q2 þ 2q0 q3 S þ 2q22 2q2 q3 þ 2q0 q1
2q3 q1 2q0 q2
3
7 2q2 q3 þ 2q0 q1 5 S þ 2q23
The quaternion (1, 0, 0, 0) gives an identity rotation. The rotation axis is a unit vector parallel to (q1, q2, q3). An expression method for 3-D rotation. A unit quaternion is a vector having unit length with four components and can be written as q ¼ (x; y; z; w), ||q|| ¼ 1. The unit quaternion can be visually expressed by means of the unit ball, as shown in Fig. 11.9, from q0 through q1 to q2 is a smooth path, and the anti-podal quaternion –q2 is the opposite pole of q2, representing the same rotation as q2. Anti-podal quaternion See unit quaternion. Dual quaternion A sorted pair of quaternions that can be used to represent the positional movement of a rigid body in a 3-D space. Rotation A circular motion of a set of points or a target around a (2-D) point or (3-D) line in an image. The given line around by a 3-D rotation is called the axis of rotation. Axis of rotation The line around which the rotation is made. Equivalently, the points on it are fixed during the rotation. Given a 3-D rotation matrix R, the axis of rotation is the eigenvector of R corresponding to the eigenvalue 1.
466
11
Point Operations for Spatial Domain Enhancement
Rotational symmetry A characteristic of a set of points or a target that remains unchanged after a given rotation. For example, a cube has several rotational symmetries that are all relative to a 90 rotation about an axis that passes through the center point of the two opposing surfaces. See rotation and rotation matrix. Rotation estimation Estimating rotation from raw or processed images, video, depth data, and more. A typical example utilizes two sets of corresponding points (or lines, or planes) derived from two rotated versions of one pattern. This problem often has three manifestations: 1. Estimating 3-D rotation from 3-D data, in this case, at least three corresponding points are required. 2. Estimating 3-D rotation from 2-D data, in this case, at least three corresponding points are needed, but there may be multiple solutions. 3. Estimating 2-D rotation from 2-D data, in this case, at least two corresponding points are required. Another issue to consider is related to the effects of noise, where more than the minimum number of points are often required and a minimum mean square estimate is made.
11.2.7 Scaling Transformation Scaling transformation A common spatial coordinate transformation. The distance between the points can be changed, and the size of the object is changed. The scaling transformation is generally performed along the coordinate axis direction, or can be decomposed into a transformation along the coordinate axis direction. The scaling transformation transforms the point having the coordinates (X, Y, Z) to the new position (X0 , Y0 , Z0 ) by means of the scaling matrix with the scaling factor (Sx, Sy, Sz). It is a typical threepoint mapping transformation. Scaling matrix A matrix for scaling transformations. Letting the scaling factor along the X, Y, and Z axes be (Sx, Sy, Sz), then the scaling matrix can be written as
11.2
Coordinate Transformation
467
2
Sx 60 6 S¼6 40
0 Sy
0 0
0
Sz
3 0 07 7 7 05
0
0
0
1
Scaling 1. A method or process of using a zoom technique or a scaling technique on an image. 2. Zoom in or out on a model to fit a set of data. 3. Transform a set of values so that they are in a standard range (such as [0, 1]), often for the purpose of improving the stability of the values. Scaling factor A number that is commonly used to consistently change a set of values. For example, a normalized value between [0, 1] can be obtained by dividing the data set by the difference between its maximum and minimum values (this may also be useful if the mean is subtracted first). There are two main purposes for doing this: one is to ensure that the data from different characteristics have similar amplitudes, and the other is to ensure that all values are not too large or too small. Both of these goals can improve the numerical performance of the algorithm. Isotropic scaling A result of a uniform transformation that multiplies all values of a vector or matrix by the same isotropic scale factor. Screw motion One 3-D motion that includes a rotation around a coordinate axis and a translation along that axis. In the Euclidean transformation x ! Rx + t, when Rt ¼ t, it is a spiral transformation, which can be used to describe the screw motion. Stretch transformation A coordinate transformation that magnifies a target in one direction and shrinks it in the orthogonal direction. It is also a typical three-point mapping transformation. The stretching transformation can be regarded as a special case of the scaling transformation. Let the stretching coefficient be L, L is larger than 1 when the horizontal direction is enlarged and the vertical direction is reduced, and L is smaller than 1 when the horizontal direction is reduced and the vertical direction is enlarged. The stretch transformation matrix can be expressed as 2
L 6 L ¼ 40 0
0 1=L 0
3 0 7 05 1
468
11
Point Operations for Spatial Domain Enhancement
Fig. 11.10 Horizontal stretch transformation diagram
Figure 11.10 is an example of a horizontal stretch transformation, the solid square is the original target before stretching, and the dotted rectangle is the result after stretching. The arrow indicates the direction in which the corner points move.
11.2.8 Other Transformation Translation transformation A common spatial coordinate transformation. It is also a typical three-point mapping transformation. The point with the coordinates (X, Y, Z) can be translated to the new position (X', Y', Z') by means of the translation matrix with the translation amount (X0, Y0, Z0). Translation A transformation of Euclidean space, expressed as follows: x ! T(x) x ! x + t. In the projective space, transforming the plane at infinity point by point without change. Translation matrix A matrix used for translation transformation. If the amount of translation along the X, Y, and Z axes is (X0, Y0, Z0), the translation matrix can be written as 2
1 60 6 T¼6 40 0
0 0 1 0 0 1 0 0
3 X0 Y0 7 7 7 Z0 5 1
Morphic transformation A coordinate transformation that maps a planar area to another planar area, maps one combined area to another combined area, maps a single area to a combined area, or maps a combined area to a single area. Oriented smoothness A constraint that limits the translation vector between two images changing along a direction that has a slight gray value variation. Shearing transformation A basic image coordinate transformation. It is also a typical three-point mapping transformation. It corresponds to a transformation in which only one of the
11.2
Coordinate Transformation
469
Fig. 11.11 Horizontal shearing transformation diagram
horizontal or vertical coordinates of the pixel changes in translation, wherein the corresponding transformation matrix is only one difference from the unit matrix. Shearing transformation can be divided into horizontal shearing transformation and vertical shearing transformation. After the horizontal shearing transformation, the horizontal coordinate of the pixel changes (associated with the vertical coordinate value of the pixel), but its vertical coordinate itself does not change. The horizontal shearing transformation matrix can be written as 2
1
6 Jh ¼ 4 0 0
Jx
0
3
1
7 05
0
1
Among them, Jx represents the horizontal shearing coefficient. After the vertical shearing transformation, the vertical coordinate of the pixel changes (associated with the horizontal coordinate value of the pixel), but its horizontal coordinate itself does not change. The vertical shearing transformation matrix can be written as 2
1
6 Jv ¼ 4 J y 0
0
0
3
1
7 05
0
1
where Jy represents the vertical shearing coefficient. Figure 11.11 is an example of a horizontal shearing transformation, the solid square is the original target before the shearing, and the dotted parallelogram is the result of the shearing, and the horizontal arrow indicates the direction of the translation change. The horizontal shearing transformation here takes the center line of the square height as the reference line, the upper half is translated to the right, and the lower half is translated to the left. It can also be seen that the amount of translation of the pixel is proportional to the distance of the pixel from the reference line. Parametric transformation A transformation that is controlled by a set of (often a small number of) parameters that causes the image or target to be globally deformed. Various affine transformations are belonging to parameter transformations. Extrude A means of passing a target (especially text) out of a paper in an image to transform the 2-D object into a 3-D shape. Figure 11.12 gives an example of a prominent text.
470
11
Point Operations for Spatial Domain Enhancement
Fig. 11.12 An example of text highlighting
11.3
Inter-image Operations
The calculation between images refers to the operation performed in units of images (performed as an array). The result of the operation is a new image. Basic operations include arithmetic and logic operations (Gonzalez and Woods 2008; Zhang 2009).
11.3.1 Image Operation Image operation An operation performed by taking image as unit. It mainly includes arithmetic operations and logic operations. The operation is performed pixel by pixel on one or more images, and the result is a new image. Arithmetic and logic operations only involve the position of one pixel in each image at a time, so it can be done “in situ,” that is, the result of an arithmetic operation or logic operation at the (x, y) position can be stored in the corresponding position of one of the images, because that position will not change after being updated to the output. Binary operation Processing operations on binary images. Commonly used are various logic operations. Triangular norms A family of binary operations with deterministic values [0, 1] combined with the AND operator or logical conjunctions. Also known as the T-norm. For different combining functions, there are several possible trigonometric norms, including traditional logic AND. Elementwise operation One type of operation on an image at the pixel level is typically done pixel by pixel. Typical examples are logic operations and arithmetic operations between images (in accordance with array arithmetic rules, rather than in accordance with matrix arithmetic rules). Pixel by pixel Sort the pixels in the image and do the same operation in sequence.
11.3
Inter-image Operations
=
471
+
(1 – )
G
B
F
Fig. 11.13 Operation example of over operator
Over operator An image combination operator. Given a foreground image F and a background image B, the combined result G of the overlay operator is G ¼ ð1 αÞB þ αF This operation is equivalent to first subtracting the background image B with a factor (1–α) and then adding the gray value corresponding to the foreground F weighted by α. Figure 11.13 gives an example where α ¼ 0.25.
11.3.2 Arithmetic Operations Arithmetic operation A type of operation performed on corresponding pixels of two grayscale images. The arithmetic operations between two pixels p and q include: 1. 2. 3. 4.
Addition, denoted as p + q Subtraction, denoted as p – q Multiplication, denoted as p * q (also denoted as pq and p q) Division, denoted as p q
The meaning of each of the above operations is to calculate a gray value based on the gray value of two pixels, and this gray value will be the pixel at the same position of the corresponding output image. Image arithmetic Contains general terminology (and sometimes also logic operations) for all forms of image processing operations based on arithmetic operations on two images. See pixel addition operator, pixel subtraction operator, pixel multiplication opera tor, pixel division operator, and image blending. Image addition An arithmetic operation performed on an image. Used to combine the content of the corresponding pixels of the two images or add a positive or negative constant to
472
11
+
Point Operations for Spatial Domain Enhancement
=
Fig. 11.14 Example of pixel addition
the pixel value of an image (to make the image brighter or darker). See pixel addition operator. Pixel addition operator A low-level image processing operator takes two grayscale images I1 and I2 as inputs and returns an image I3 ¼ I1 + I2 as an output (the calculation here is performed pixel by pixel). An example is shown in Fig. 11.14, with dark colors indicating low gray levels and light colors indicating high gray levels. Binning A local addition to a pixel. The goal is to reduce the resolution but increase the sensitivity. Additive image offset A process of adding a constant value (scalar) to an image to increase the overall brightness of the image by image addition. When the value is greater than 0, the brightness is increased, and when it is less than 0, the brightness is decreased. However, it is possible to generate an overflow problem that exceeds the maximum pixel value allowed by the type of data used. There are two solutions: normalization and truncation. Compare the subtractive image offset. Truncation In an additive image offset, when an overflow problem occurs, it is the operation that limits the result to not exceed the maximum positive number that can be represented by the type of data used (the excess values are also set to the largest positive number). Image subtraction An arithmetic operation performed on an image. Often used to detect the difference between two images. This difference can be due to different factors, such as artificially adding or removing related content to the image, relative motion of the target occurring between two frames in a video sequence, and the like. Image subtraction operator See pixel subtraction operator. Image difference See pixel subtraction operator.
11.3
Inter-image Operations
473
Fig. 11.15 Example of pixel subtraction
Fig. 11.16 The difference between the original image and the image after embedding the watermark
Pixel subtraction operator A low-level image processing operator takes two grayscale images I1 and I2 as inputs and returns an image I3 ¼ I1 – I2 as an output (the calculation here is performed pixel by pixel). This operator implements the simplest change detection algorithm. The effect can be seen in Fig. 11.15, where the left and middle images are images acquired before and after one frame apart, and there is almost no difference. But subtracting them to get the right picture, it is easy to find that the position of the person in the left and middle pictures is different, which is the result of the movement. Compare the difference image. In the example given in Fig. 11.16, the left image is the original image, and the middle image is the image after the watermark is embedded. It is difficult to see the difference directly by observing the two images, but if you calculate the difference image (right image), it is easy to see the intensity and distribution differences that caused by watermark. Figure 11.17 shows another set of examples. The left picture shows the original image. The middle picture shows the coded and then decoded image obtained by using the first-order predictor in DPCM coding, and the number of quantizer stages is five, and the right picture is their difference image. The difference is mainly concentrated at each edge.
474
11
Point Operations for Spatial Domain Enhancement
Fig. 11.17 The difference between the original image and the predictive coded image
Fig. 11.18 Motion detection using difference images
Difference image An image obtained by comparing the difference between pixel values corresponding to two frames in the preceding and succeeding frames in a video (or other sequence of images). That is, the pixel value in the image is the difference between the pixel values corresponding to the positions of the other two images. It is the image calculated by image subtraction in practice. Assuming that the illumination conditions do not substantially change during the acquisition of a multi-frame images, the non-zero portion of the difference image indicates that the pixel at that location has moved (note that the pixel at which the difference image is zero may also have moved). In other words, the difference in the position and shape of the moving object in the image can be highlighted by the difference between the two images temporally in sequence. Figure 11.18 shows images of two frames (with two frames apart) in a video sequence and their difference images. These two images are quite similar, but it is easy to detect the motion of the person from the difference image. See pixel subtraction operator. Subtractive image offset The process of subtracting a constant (scalar) from an image to reduce its overall brightness using image subtraction. At this point, you need to be aware of the possibility of obtaining a negative result. There are two ways to solve this underflow problem: treat subtraction as absolute difference (this will always give a positive value proportional to the difference between the two original images, but does not
11.3
Inter-image Operations
475
indicate which pixel is brighter or darker); the result is truncated, causing the negative intermediate value to become zero. Compare the additive image offset. Image multiplication An arithmetic operation performed on an image. Multiply the original value of an image by a scalar coefficient to make the pixel value larger or smaller. If the scalar coefficient is greater than 1, the result of the operation will be a brighter image with a larger gray value. If the scalar coefficient is greater than 0 but less than 1, the result of the operation is a darker image with a smaller gray value. See pixel multiplication operator. Pixel multiplication operator A low-level image processing operator takes two grayscale images I1 and I2 as inputs and returns an image I3 ¼ I1 • I2 as an output (the calculation here is done pixel by pixel). Multiplicative image scaling Brightness adjustment of an image using scalar by image multiplication and image division. Subjective visual effects that are better than additive image offset can often be produced. Image division An arithmetic operation performed on an image. Divide the original value of an image by a scalar coefficient to make the pixel value larger (or smaller). When the scalar coefficient is greater than 1, the gray value becomes smaller and the image is darker; when the scalar coefficient is greater than 0 but less than 1, the gray value becomes larger and the image is brighter. Pixel division operator A low-level image processing operator takes two grayscale images I1 and I2 as inputs and returns the image I3 ¼ I1 / I2 as output (the calculation here is done pixel by pixel).
11.3.3 Logic Operations Logic operator Operator performs logic operations, such as AND operator, NAND operator, OR operator, XOR operator, XNOR operator, NOT operator, etc. Logic operation Global operation on binary images. Both the input and output are binary images. The whole process is done pixel by pixel. Can be divided into basic logic operations and composite logic operations.
476
11
A
NOT(A) = A
B
(A) AND (B) = AB
Point Operations for Spatial Domain Enhancement
(A) OR (B) = A+B
NOT(B) = B
(A) XOR (B) = A B
Fig. 11.19 Example of basic logic operations Table 11.2 AND logic operation
f g f AND g
0 0 0
0 1 0
1 0 0
1 1 1
Basic logic operation Logic operations according to basic logic. The basic logic operations between two pixels p and q include: 1. 2. 3. 4.
COMPLEMENT, recorded as NOT q (also can be recorded as) AND, denoted as p AND q (also denoted as p • q) OR, denoted as p OR q (also denoted as p + q)L XOR, denoted as p XOR q (also denoted as p q)
Figure 11.19 gives some examples of basic logic operations. In the figure, black represents 1 and white represents 0. AND operator A Boolean logic operator. If the input is two binary images, using AND logic at each pair of pixels yields the results as shown in Table 11.2. The AND operator can help to select a specific image area. NAND operator An arithmetic operation consisting of a “logic AND” followed by a “logic complement.” The operation is performed between corresponding bits of each pixel of the two images. It is most suitable for binary images and can also be used for grayscale images.
11.3
Inter-image Operations
477
NAND
Fig. 11.20 Effect of the NAND operator
OR
=
Fig. 11.21 Effect of the OR operator
Figure 11.20 gives two binary images and the results obtained for them using the NAND operator. OR operator A pixel-by-pixel logic operator defined by a binary variable. It takes two binary images I1 and I2 as inputs, and the value of one pixel in the output image I3 is 0 if it is 0 in both I1 and I2 and is 1 in all other cases. Figure 11.21 gives an effectual representation of the or operator (the value of the white pixel is 1), where the right image is the result of using the OR operator for the left and middle images. XOR operator A logic operator used to combine two binary images A and B. In the result of A XOR B, if only one of A(i, j) and B(i, j) is 1, the pixel value of position (i, j) is 1 (all other cases are 0). The output is the complement of the XNOR operator. XNOR operator A logic operator used to combine two binary images A and B. In the result of A XNOR B, if only one of A(i, j) and B(i, j) is 1, the pixel value of position (i, j) is 0 (all other cases are 1). The output is the complement of the XOR operator. NOT operator See invert operator. Composite logic operation A logic operation resulting from a combination of basic logic operations. Some more complicated operations can be done, and some theorems of logic operations can also be considered.
478
11
AB
Point Operations for Spatial Domain Enhancement
AB
A B
A
A
A
B
B
B
AB
AB
A
B
A
B
A
A
B
B
Fig. 11.22 Examples of composite logic operations
Some of the results obtained by combining the basic logic operations with the sets A and B in the basic logic operation diagram are shown in Fig. 11.22. The first line is (A) AND (NOT (B)), (NOT (A)) AND (B), (NOT (A)) AND (NOT (B)), and NOT ((A) AND (B)); the second line is (A) OR (NOT (B)), (NOT (A)) OR (B), (NOT (A)) OR (NOT (B)), and NOT ((A) OR (B)); and the third line is (A) XOR (NOT (B)), (NOT (A)) XOR (B), (NOT (A)) XOR (NOT (B)), and NOT ((A) XOR (B)).
11.4
Image Gray-Level Mapping
An image is composed of pixels arranged in space, and its visual effect is related to the gray level of each pixel. If the gray level of all or part of the pixels can be changed, the visual effect of the image will be changed, too. This is the basic idea of gray-level mapping (Jähne 2004; Pratt 2007; Russ and Neal 2016).
11.4
Image Gray-Level Mapping
479
11.4.1 Mapping Image gray-level mapping A point operation based on pixels in an image. The goal is to enhance the image by assigning a new gray value to each pixel in the original image. According to the gray value of each pixel in the original image, and using some mapping rules (represented by the mapping function), the pixel gray value will be converted into another gray value. The principle can be illustrated with the help of Fig. 11.23. Consider an image with four gray values (represented by R, Y, G, B in order from low to high) on the left, and the mapping rule used is shown by the curve EH(s) in the middle drawing. According to this curve’s mapping rule, the original gray value G is mapped to the gray value R, and the original gray value B is mapped to the gray value G. Grayscale transformation is performed according to this mapping rule, and the image on the left is converted to the image on the right. The key to this type of method is to design mapping rules or mapping curves, and the desired enhancement effect can be obtained by appropriately designing the shape of the curve. Gray-level mapping See image gray-level mapping. Mapping function In image engineering, a function for mapping pixel gradations for image enhancement purposes. See image gray-level mapping. Arbitrary mapping Provides free grayscale conversion between the original image and the output image. Gray-level transformations See point transformations. Definition Another way of saying “quality” or visual quality. High-definition images are usually bright and sharp and have a lot of detail, while low-resolution images are often dull, blurry, and smudged.
T B
EH(s)
G Y R O
R
Fig. 11.23 Principle of gray-level mapping
Y
G B S
480
11
Point Operations for Spatial Domain Enhancement
Acutance A measure of the sharpness degree of the human visual system in a photograph. For an edge, the acutance can be defined by dividing the average squared rate across the gray value of the edge by the total grayscale difference at the both sides of the edges. In image engineering, sharpness is often used. Natural media emulation The image is processed to make it appear to be drawn with real-world media (such as watercolors, paints, etc.). Pixel transform The simplest type of image transform, the operation of each pixel is independent of its neighboring pixels. Also known as point operator, point processing, etc.
11.4.2 Contrast Manipulation Contrast manipulation One of the most commonly used point transformation operations in image enhance ment. Also called contrast adjustment and amplitude scaling. The transform function often has a curve similar to the S-shape (Fig. 11.24a): pixel values smaller than m are compressed toward the darker value in the output image, and pixel values larger than m are mapped to brighter pixel values in the output image. The slope of the curve indicates how severe the contrast changes. In the most extreme case, the contrast manipulation function degenerates into a binary thresholding function (Fig. 11.24b), in which pixels in the input image that are less than m become black and pixels that are larger than m become white. Auto-contrast Same as automatic contrast adjustment.
Fig. 11.24 Contrast manipulation function diagram
T
T T = EH(s)
O
m
(a)
T = EH(s)
S
O
m
(b)
S
11.4
Image Gray-Level Mapping
481
Automatic contrast adjustment A special contrast manipulation, referred to as auto-contrast. For an 8-bit image, the darkest pixel in the input image is mapped to 0, and the brightest pixel is mapped to 255, and the intermediate values are linearly redistributed. Contrast stretching A direct gray-level mapping technique for image enhancement. A special case of contrast manipulation. Also called grayscale stretching. A monotonous increasing point operation is used to extend the original range of grayscales in an image to a larger grayscale range to increase or enhance the visibility of image detail in a small range. The effect is to enhance the image contrast, which is the contrast between the various parts of the image. In practice, it is often achieved by increasing the dynamic range between two gray values in the enhanced image. Contrast 1. The difference in luminance values between two structures (regions, blocks, etc.). The difference in gradation between adjacent pixels or sets of pixels plays an important role in the definition of the image and whether it can distinguish between different pixels or pixel regions. There are many related definitions and types, such as relative brightness contrast, simultaneous brightness con trast, relative saturation contrast, and simultaneous color contrast. 2. A texture measure. In grayscale images, the contrast C used to describe the texture can be defined as C¼
XX i
ði jÞ2 P½i, j
j
where P is a gray-level co-occurrence matrix. Contrast ratio The ratio of the brightness between the brightest white and the darkest black for a scene (given device and environment). The larger the ratio, the more gradations from black to white and the richer the color. The effect of contrast on visual effects is critical. In general, the greater the contrast, the more conspicuous the image, the sharper the detail, and the more vivid the color, and the smaller the contrast, the more the picture will be grayed out. High contrast is very helpful for image sharpness, detail performance, gray-level performance, and so on. Contrast reversal Refers to the case where different areas of the image are brightness-shifted at different times. For example, for infrared (thermal) imaging, since different materials vary in relative temperature with environmental temperature, it may occur that daytime zone A is hotter than zone B but night time zone A is cooler than zone B, which leads to the brightness contrast of the two areas is reversed.
482
11
Point Operations for Spatial Domain Enhancement
11.4.3 Logarithmic and Exponential Functions Logarithmic transform 1. Same as single-scale retinex 2. Transformations commonly used in dynamic range compression Log-based modification A method of modifying image pixel values based on a function, which converts the pixel value to its logarithmic value. Pixel logarithm operator A low-level image processing operator. The input is a grayscale image f, the output is another grayscale image g, and the operator calculates g ¼ clogb(| f + 1|). This operator is used to change the dynamic range of the image, such as to enhance the amplitude of the Fourier transform. Where c is a scale factor and the base b of the logarithmic function is often e (natural index), but it is not really important because there is only one scale factor between the logarithms of any two bases. Compare pixel exponential operator. Dynamic range compression A direct gray-level mapping technique for image enhancement. It is characterized in that the range of gray values of the original image is larger than the range of gray values of the enhanced image. If a logarithmic form of the mapping function is used, letting s be the gray value of the enhancing image and t be the gray value of the enhanced image, then t ¼ C log ð1 þ j sj Þ where C is the scale proportional constant. Dynamic range The ratio of the brightest value to the darkest value in an image. Most digital images have a dynamic range of around 100:1, but when the dynamic range reaches 10,000:1, humans can still feel the details of dark areas. To meet this requirement, it is needed to create a high dynamic range image. Range compression Enhance the appearance of an image by reducing the dynamic range of an image. The amplitude from the Fourier transform image may contain very large and very small values, in which case range compression is required. As shown in Fig. 11.25, the left image is the amplitude map obtained directly after Fourier transform of an image. The information that can be seen is very small. The right image is the result of dynamic range compression obtained by logarithmic transformation; many useful information can be obtained.
11.4
Image Gray-Level Mapping
483
Fig. 11.25 Effect of range compression
Exponential transformation A typical image enhancement technique for improving image visual quality, which is achieved by grayscale mapping of image pixel gray values by means of exponential transformation. The general form can be expressed as t ¼ Csγ where s is the input gradation, t is the output gradation, C is a constant, and γ is a real number, mainly controlling the effect of the transformation. When γ > 1 (such as gamma correction), the result of the transformation is that the wider low grayscale range in the input is mapped to the narrower grayscale range in the output; and when γ < 1, the result of the transformation is that the narrower low grayscale range in the input is mapped to the wider grayscale range in the output, and the wider high grayscale range in the input is mapped to the narrower grayscale range in the output. See pixel exponential operator. Pixel exponential operator A low-level image processing operator. The input is a grayscale image f, the output is another grayscale image g, and the operator calculates g ¼ cb f. Among them, the value of the base b depends on the desired degree of compression on the dynamic range of the image, and c is a scale factor. This operator can be used to change the dynamic range of the image’s gray value. Compare pixel logarithm operator. Power-law transformation Same as exponential transformation. γ-Transformation The transform function used in gamma correction. γ (gamma) The pronunciation of the Greek letter γ. When a device such as a camera and a display performs a conversion between an analog (A) image and a digital (D) image, the relationship between A and D is often nonlinear. A common model describing this nonlinearity is to associate A and D with a gamma curve, i.e., A ¼ kDγ , where k is a constant. For CRT displays, the common value for γ is in [1.0 2.5].
484
11
Point Operations for Spatial Domain Enhancement
L 1
Fig. 11.26 Several correction curves with different γ values
= 0.4 = 0.1 t
=1 = 2.5 = 10
O
s
L 1
γ-Correction (gamma correction) A typical point operation technique for image gray-level mapping. By means of an exponential transformation, it can be used to correct the output response t of the image acquisition, display, and printing device to satisfy a linear relationship with the input excitation s. The correction function can be written as t ¼ c • sγ where c is the scaling constant. The function curves of several different γ values are shown in Fig. 11.26. The intensity of light received at the camera input or generated at the display output is a nonlinear function of the voltage level. The process of compensating for this nonlinear characteristic is gamma correction. To do this, you need to write the function for corresponding relationship. For cameras, this connection is often described as c vc ¼ Bγ c
where Bc represents the actual brightness (light intensity), vc is the voltage obtained at the camera output, and γ c is typically a value between 1.0 and 1.9. Similarly, for display device vd, the relationship between this input voltage and the displayed color intensity Bd can be described as γ
Bd ¼ vdd where γ d is generally a value between 2.2 and 3.0. When displaying an image on a display, it is often necessary to gamma correct the color and brightness to have the appropriate dynamic range. γ-Function (gamma function) A mathematical function for a variable x whose gamma function can be written as
11.4
Image Gray-Level Mapping
485
Z
1
ΓðxÞ
yx1 ey dy
0
γ-Curve (gamma curve) See gamma correction. Digamma function The derivative of the logarithm of a gamma function of a variable. Letting x be the variable, then its digamma function D can be written as DðxÞ ¼
d ln ΓðxÞ dx
11.4.4 Other Functions Image negative A direct gray-level mapping technique for image enhancement, which flips the gray value of the original image. It calculates the negative value of an image (actually the complement). Intuitively, it turns black to white, white to black, light to dark, and dark to light. The mapping function EH(•) at this time can be represented by the curve of Fig. 11.27a, and the original gradation is changed from s0 to t0 . It can be seen that the smaller s0 will become a larger t0 , and the larger s0 will become a smaller t0 . Figure 11.27b, c give a pair of example images that are mutually inverted. In fact, the relationship between the usual black and white negatives used in the past and their (flushed) photos is like this. Invert operator It performs a low-level image processing operation that constructs a new image by inverting pixel values. The effect is similar to the relationship between photos and
t
L–1
EH(s)
t' O
s
s'
L–1
(a) Fig. 11.27 Image negative example
(b)
(c)
486
11
Point Operations for Spatial Domain Enhancement
negatives. For a binary image, the input 0 is changed to 1, and 1 becomes 0. For grayscale images, this depends on the maximum range of luminance values. If the luminance value range is [0, 255], the pixel value f after the pixel inversion is 255 – f. See image negative. Negate operator See invert operator. Intensity level slicing An image processing operation in which pixels having a different value (or range of values) than the selected value are set to a value of zero. If the image is viewed as a landscape and the height is proportional to the brightness, then the intensity level slicing takes a section through the height plane. Gray-level slicing See intensity level slicing.. Linear gray-level scaling A simple gray value transformation method for image enhancement. The gray value of each pixel in the image is transformed by a linear formula, f(g) ¼ ag + b, where g is the original gray level and f(g) is the transformed gray level. The contrast increases when |a| > 1, the contrast decreases when |a| < 1, and the gray value reverses when a < 0; the gray value increases when b > 0 and decreases when b < 0. In order for f(g) to cover the maximum range of image gray, a ¼ (2b – 1)/(gmax – gmin) and b ¼ agmin, where gmin and gmax are the minimum gray values and the maximum gray value of the current image, respectively. Clip transform An operation performed on the pixel grayscale of an image. The gray value of the portion of the image to be clipped is generally set to the minimum or maximum value of the dynamic range. In the threshold selection based on the transition region, a special clip transform is defined to reduce the effects of various disturbances. The difference between it and the general clip transform is that the clipped portion is set to the gray value of the clip level, which avoids the adverse effect of the general clip transform causing excessive contrast at the clip edge.
11.5
Histogram Transformation
The histogram transformation method for image enhancement is based on probability theory. The gray level of each pixel in the image is changed by changing the histogram of the image to achieve the goal of image enhancement. Therefore, histogram transformation is often called histogram correction. Commonly used specific methods are histogram equalization and histogram specification (Gonzalez and Wintz 1987; Russ Neal 2016; Zhang 2009).
11.5
Histogram Transformation
487
11.5.1 Histogram Histogram The representation of the distribution frequency of certain values is a representation that makes statistics on the images and figures them out. The intensity histogram is a typical one. For grayscale images, the grayscale statistical histogram reflects the statistical situation of the frequency of occurrence of different gray levels in the image, that is, gives an overall description of all grayscale values of the image. For example, for a corresponding image in Fig. 11.28a, the grayscale statistical histogram can be represented as Fig. 11.28b, where the horizontal axis represents different gray levels and the vertical axis represents the number of each gray-level pixel in the image. Strictly speaking, the grayscale statistical histogram of the image f(x, y) is a 1-D discrete function that can be written as hð k Þ ¼ nk ,
k ¼ 0, 1, , L 1
where nk is the number of pixels having a gray value k in f(x, y). In Fig. 11.23b, the height of each column or bar of the histogram corresponds to nk. Image histogram Same as histogram. Gray-level histogram A histogram obtained by statistically counting grayscale images. α-Quantile (alpha quantile) A feature of a histogram. If Ch is used to represent the corresponding cumulative histogram, the α-quantile of a histogram is defined as hα ¼ min fh : C h αg where h0 represents the minimum value, h1 represents the maximum value, and h0.5 represents the median value. Fig. 11.28 Example of a grayscale image and its histogram
0
1
2
3
1
2
3
1
H(z) 5 4
4
2
3 z
3 2
3
1
0
3
1
0
2
(a)
0
1
(b)
488
11
Fig. 11.29 Example of grayscale image and its cumulative histogram
Point Operations for Spatial Domain Enhancement
0
1
2
3
1
2
3
1
2
3
1
0
3
1
0
2
H(k)
(a)
16 12 8 3 0
1
2
3 k
(b)
Intensity histogram A data structure that records the distribution of the number of pixels for each luminance value. A typical 8-bit grayscale image has pixels with values in the range [0, 255], so the histogram has 256 units of recorded pixel values. Image intensity histogram See image histogram. Image intensity bucket Same as image intensity bin. Cumulative histogram An expression of image statistics. For a grayscale image, its grayscale statistical cumulative histogram is a 1-D discrete function. According to the definition of the histogram, the cumulative histogram can be written as H ðk Þ ¼
k X
ni ,
k ¼ 0, 1, , L 1
i¼0
where ni is the number of pixels in the i-th column of the original histogram. The height of the column (bin) k in the cumulative histogram gives the total number of pixels whose gray value is less than or equal to k in the image. For example, for an image of Fig. 11.29a, its grayscale statistical cumulative histogram is shown in Fig. 11.29b. Cumulative distribution function [CDF] The sum of the probabilities that a random variable has a value less than a certain value. Letting the random variable be X, then the cumulative distribution function of X is defined as F(x) ¼ Pr(X x), that is, the probability that X does not exceed the value x. The range of F is [0, 1]. This definition can also be used for random variables of vector values. A single-valued nondecreasing function that performs cumulative calculations on a distribution function. For example, a cumulative grayscale histogram (cumulative distribution function) of the image is obtained by accumulating a grayscale statistical histogram of the image (a typical grayscale distribution function). If there are
11.5
Histogram Transformation
489
N pixels in an image with K gradations, the grayscale statistical histogram of the image can be expressed as H ðk Þ ¼
nk N
k ¼ 1, 2, , K
where nk is the number of pixels having the k-th gray value in the image. And the cumulative grayscale histogram of the image can be expressed as Aðk Þ ¼
k X ni i¼1
N
,
k ¼ 1, 2, , K
Multidimensional histogram A histogram with more than one dimension. For example, the gray level gradient value scatter is a 2-D histogram. As another example, the measure obtained from the multispectral image needs to be represented by an N-D vector, and the resulting histogram will be represented by an array of dimension N. The N components of each vector can be used to index in an array. Cumulative counting of arrays yields a multidimensional histogram. Histogram bin For a histogram, the results of the function values corresponding to the respective variables are represented by vertical bars. Another way of saying this is that the histogram is often represented by a bar graph with one bar per grayscale, the height of the bar being proportional to the number (or percentage) of pixels corresponding to the gray value. The straight bar is often abbreviated as bin. Image binning A method of assigning each pixel intensity to a histogram containing a corresponding intensity. The image histogram is constructed by this. Image intensity bin The unit of the image intensity histogram is the set (number) of pixels within a specific intensity range. Also known as image intensity bucket.
11.5.2 Histogram Transformation Histogram transformation The adjustment of the distribution (shape) of each gray value in the histogram. The goal is to change the visual effect of the image by changing the shape of the histogram. Image enhancement using histogram transformation is based on probability theory. The commonly used methods are histogram equalization and histo gram specification.
490
11
(a)
(b)
Point Operations for Spatial Domain Enhancement
(c)
(d)
Fig. 11.30 Histogram equalization instance
Histogram equalization A method of image enhancement using histogram transformation. A single image is processed to obtain an image with a uniform gray-level distribution, i.e., an image with a flat grayscale histogram. It transforms the histogram of the original image into a uniformly distributed form, and by increasing the dynamic range of the pixel gray value, the effect of enhancing the overall contrast of the image is achieved. It is done with the help of a cumulative histogram. The cumulative histogram acts here as a transformation function. It can also be implemented with the help of rank transform. When this technique is applied to a digital image, the resulting histogram often has certain bins that have height more than the average height on the left and right sides of the zero-valued bins. Figure 11.30 gives an example of histogram equalization. Figure 11.30a, b are an original image of 8-bit grayscale and its histogram, respectively. Here, the original image is darker and the dynamic range is smaller. Reflected on the histogram, the histogram occupies a narrow range of gray values and is concentrated on the side of the low gray value. Figure 11.30c, d show the results obtained by histogram equalization of the original image and the corresponding histograms. The histogram now occupies the allowable range of the gray value of the entire image. Since the histogram equalization increases the dynamic range of the image grayscale, the contrast of the image is also increased, and this is reflected in the image with a large contrast, and many details can be seen relatively clearly. However, it should be noted that histogram equalization also increases the visual granularity of the image while enhancing the contrast. Equalization Same as histogram equalization. Rank transform The transformation from gray value to rank in a sorting sequence. Each pixel value in an image (which may be N-D) will be replaced with a sequence (or pointer) corresponding to the pixel after all pixel values are sorted in ascending order. For example, consider a 2-D grayscale image of M M, which is sorted by assigning the first pixel with the smallest gray value to the sequence number 1, then for the other pixels with the smallest gray value (if there are) the sequence numbers 2, 3, 4, ... are sequentially assigned; in this way each pixel having the next smallest gray value is
11.5
Histogram Transformation
491
sequentially sorted; and finally the last pixel having the largest gray value is given to the sequence number M2. Such a transformation process is equivalent to construct the cumulative histogram of the image, so the histogram of the original image will be equalized, that is, the histogram equalization is realized. Histogram equalization with random additions A histogram transformation method that can obtain a completely flat histogram. Here, one of the constraints that must be strictly adhered for the ordering of pixels according to the gray value is to be discarded, that is, the pixels corresponding to each of the bin are still gathered in one bin after the histogram transformation. This method allows a portion of the pixels to be moved into adjacent bins in the histogram such that all of the bins have exactly the same number of pixels. Letting the equalized histogram of an N M image be represented by the 1-D array H(g), where g 2 [0, G 1], the steps of performing histogram equalization with random addition are (b•c represents down-rounding function): 1. Add a random number extracted from the uniform distribution [0.5, 0.5] to the gray value of each pixel. 2. Sort the gray value of the pixel and remember the pixel corresponding to the gray value. 3. Change the gradation value of the first group bNM/Gc pixels to 0, and change the gradation value of the next group bNM/Gc pixels to 1 and so on until the last group bNM/Gc pixels is changed to G 1. Adaptive histogram equalization [AHE] A method of locally improving image contrast. First construct a histogram based on the grayscale in the image, and then remap the grayscale to make the histogram distribution close to horizontal. If you further use the dithering technique, you can also make the histogram completely horizontal. See locally adaptive histogram equalization. Locally adaptive histogram equalization In order to better adapt to the characteristics of different parts of the image, the image is divided into blocks before the histogram is equalized, and then the histogram equalization is performed for each block. The result of such processing alone may have a block effect (i.e., luminance discontinuities occur at the edges of the blocks). One way to overcome the blockiness problem is to use a moving window, which computes a histogram of the image blocks centered on it for each pixel. The problem with this approach is that it is computationally intensive, although only pixels that enter and leave the block can be calculated at a time. A more efficient method is to calculate the equalization function of the nonoverlapping blocks and then smoothly interpolate the transfer functions between the blocks. This technique is called adaptive histogram equalization. Histogram specification An image enhancement method using histogram transformation. Also known as histogram matching. The contrast within a range of gray values in the original
492
11
Point Operations for Spatial Domain Enhancement
image is selectively enhanced according to a predetermined specified histogram, or the distribution of the gray value of the image is made to meet a specific requirement. Histogram equalization can be seen as a special histogram specification method (where the specified histogram is a uniformly distributed histogram). Specified histogram The histogram specifies the histogram that the user expects to obtain. It needs to be determined before the specified operation. Histogram mapping law In the histogram transformation for image enhancement, the law that the original image histogram is mapped to the enhanced image histogram. The gray level of the pixel corresponding to each of the bins in the original image histogram is sequentially converted into the gray level of each of the bins in the enhanced image histogram. Typical examples include single mapping law and group mapping law. Single mapping law [SML] A rule in histogram specification that correspondingly maps the original histogram to a specified histogram. Let ps(si) represent any of the original histogram bin (a total of M), and pu(uj) represents any of the specified histogram bin (a total of N ), which requires a grayscale search from small values to large values, in turn, to find k and l that minimize the following formula: X k l X
p ðs Þ pu u j , i¼0 s i j¼0
k ¼ 0, 1, , M 1 l ¼ 0, 1, , N 1
Then map ps(si) to pu(uj). Compare group-mapping law. Group mapping law [GML] A rule in histogram specification that maps the original histogram to the specified histogram. Let ps(si) represent any of the original histogram bin (a total of M ), and pu(uj) represents any of the specified histogram bin (also M ) and has an integer function I(l), l ¼ 0, 1, ..., N–1, satisfying 0 I(0) I(1) ... I(l ) ... I(N– 1) M–1. The group mapping law is to determine I(l) that enables the following expression to take the minimum value: I ðlÞ X l X
p ðs Þ pu u j , i¼0 s i j¼0
l ¼ 0, 1, , N 1
If l ¼ 0, map those ps(si) with i from 0 to I(0) to pu(u0); if l 1, then map those ps(si) with i from I(l – 1) + 1 to I(l) to pu(uj). Compare the single mapping law.
11.5
Histogram Transformation
493
11.5.3 Histogram Modification Histogram modification 1. A technique for adjusting the distribution of pixel values in an image to produce an enhanced image. Also known as histogram transformation. The range of some of the gray values will be highlighted after the transformation. Some typical histogram transformation functions are shown in Table 11.3. In the table, H(z) represents the histogram before correction, gz is the value corresponding to z in the modified histogram, and gmin and gmax are the minimum and maximum values of the histogram before modification, respectively. 2. Use the other information in the image to modify the grayscale histogram of the image. For example, in thresholding segmentation, the image gray histogram can be modified using gradient information in the image to obtain a new histogram. The new histogram has deeper valley between peaks than the original histogram, or the valley is transformed into a peak that is easier to detect. See histogram modeling. Histogram stretching A simple technique for changing the histogram of low-contrast images. Also known as input cropping. The contrast of the image can be increased without changing the shape of the original histogram. More complex methods are often referred to collectively as histogram manipulation. Let the histogram range of the original image be from fmin to fmax. It is desirable to extend these values to the range [0, G 1], where G 1 > fmax fmin. It is set to replace the original pixel gray value f with the pixel gray value g when mapping the gray value to the new range, and then g¼
f f min G þ 0:5 f max f min
Here, adding a constant 0.5 and combining the down-rounding function b•c to make the actual value of g ¼ ( f – fmin)G/( fmax – fmin) taking the nearest integer (similar to rounding). Table 11.3 Transformation functions for histogram modification Name of modification Uniform modification Exponential modification Rayleigh modification Hyperbolic cube root modification Hyperbolic logarithm modification
Transformation function T(z) gz ¼ (gmax gmin)H(z) + gmin gz ¼ gmin – ln [1 – H(z)]/α h
i1=2 gz ¼ gmin þ 2α2 ln 1H1 ðzÞ h
i3 1=3 1=3 1=3 gz ¼ gmax gmin H ðzÞ þ gmin h iH ðzÞ gz ¼ gmin ggmax min
494
11
Point Operations for Spatial Domain Enhancement
Input cropping Same as histogram stretching. Histogram shrinking A simple technique for changing the histogram of high contrast images. Also called output cropping. By modifying an original histogram, the gray-level range [fmin, fmax] can be compressed to a narrower gray-level range [gmin, gmax], which reduces the contrast of the image but does not change the relative shape of the original histogram. It is set to replace the original pixel gray value f with the pixel gray value g when mapping the gray value to the new range, and then g¼
f f min ðg gmin Þ þ 0:5 f max f min max
Here, adding a constant 0.5 and combining the down-rounding function b•c to make the actual value of g ¼ ( f – fmin)G/( fmax – fmin) taking the nearest integer (similar to rounding). Output cropping Same as histogram shrinking. Histogram thinning A method of performing image segmentation using histogram transformation. In which, the peaks and valleys in the original histogram are more prominent by a mapping process, so that the thresholding using the histogram is more convenient. The idea is to gradually refine each peak in the histogram so that it is close to a pulse form (the peak is more prominent). After obtaining the histogram of the original image, the specific refinement steps are as follows: 1. Starting from the lowest value of the histogram, gradually search for the local peak of the histogram bin in the direction of the high value. 2. After a peak is searched, move a part of the pixels of the two bins aside to the bin where the peak is located to refine the peak. 3. When a peak is refined, continue to search for the next local peak, and refine it until the last local peak is refined. When searching for local peaks, suppose h( f ) is the number of pixels whose gray value is f, h+( f + m) is the average gray value of pixels whose gray value is f + k (k ¼ 1, 2, . . ., m), and h( f m) is the average gray value of pixels whose gray scale is f k (k ¼ 1, 2, . . ., m). If h( f ) is greater than h+( f + m) and h( f m), h( f ) should be a local peak. When refining the peak, first calculate N ¼ [2 h( f ) h+( f + m) h( f m)]/[2 h ( f )], and then sequentially move N h( f + 1) pixels from the histogram bin f + 1 to the histogram bin f, and move N h( f + 2) pixels from the histogram f + 2 to the histogram f + 1, until N h( f + m) pixels are moved from the histogram bin f + m to the histogram bin f + m 1; then sequentially move N h( f 1) pixels from the histogram bin f 1 to the histogram bin f, and move N h( f 2) pixels from the
11.5
Histogram Transformation
495
histogram bin f 2 to the histogram bin f 1, until the N h( f m) pixels are moved from the histogram bin f m to the histogram bin f m + 1. Direct gray-level mapping Same as image gray-level mapping. Piecewise linear transformation A transformation used in point transformation technology. Several linear equations are used to describe the transformation function, each for an interval of gray values in the input image. The main advantage of piecewise linear transformations is that they can have arbitrary complexity; the main disadvantage is that they require more user input. Gray-level slicing is a typical situation. Histogram manipulation In histogram processing, the gray value of an image is changed without affecting the operation of its semantic information content. Histogram sliding A simple histogram adjustment technique that adds or subtracts only one constant luminance value for all pixels in the image. The overall effect is to give an image with substantially the same contrast but with a higher or lower average brightness. Histogram modeling A technique for changing the dynamic range and contrast of an image by changing the brightness histogram of the image to have the desired characteristics. Typical examples are histogram equalization, histogram specification, and the like. See histogram modification. Histogram smoothing A smoothing filter (such as a Gaussian smoothing filter) is applied to the histogram to implement the preprocessing technique. This technique is often used before performing histogram analysis operations. Histogram hyperbolization A variant of histogram equalization. According to Weber’s law, the brightness perceived by the human eye is logarithmic to the brightness of the scene. Histogram hyperbolic considers this point and sets the goal of equalization as to transform the histogram of the original image into a uniform distribution of brightness perceived by the human eye. Histogram hyperbolization with random additions One histogram transformation method that can obtain a perfectly smooth hyperbolic histogram (giving a lower gray value with more weight and giving a higher gray value with less weight). Here, one of the constraints that must be strictly adhered to the ordering of pixels according to the gray value is to be discarded, that is, the pixels corresponding to each of the histogram bin are still gathered in a single histogram bin after the histogram transformation. It is to allow the pixels to be moved into adjacent bins in the histogram so that each of the bins has a hyperbolic number of pixels. If t is used to represent the discrete gray value of the histogram, the
496
11
Point Operations for Spatial Domain Enhancement
expected histogram bin H(t) will contain V(t) pixels, t 2 [0, G 1]. So, first calculate the number of pixels needed for each bin. This can be obtained by multiplying the total number of pixels by a desired probability density function over the width of the bin, i.e., from t to t + 1. For an image of N M, the total number of pixels is NM. There is tþ1 eαt V ðt Þ ¼ NMA e dt ) V ðt Þ ¼ NMA α t t h i h i A A ) V ðt Þ ¼ NM eαðtþ1Þ eαt ) V ðt Þ ¼ NM eαt eαðtþ1Þ α α Z
tþ1
αt
The steps of histogram hyperbolic transformation with random addition are (b•c represents the down- rounding function): 1. Add a random number extracted from the uniform distribution [0.5, 0.5] for the gray value of each pixel. 2. Sort the pixels in image according to their gray values, and remember which gray value corresponds to which pixel. 3. Change the first group bV(0)c pixel’s gray value to 0. 4. For t from 1 to G 1, change the next set of bV(t) + {V(t 1) bV(t 1)c}c pixel’s gray values to t. Note that the addition of the correction term V(t 1) bV (t 1)c is to consider the remainder of V(t 1), which is added to V(t) when the integer operator is used. You can get a value that increases by 1.
11.5.4 Histogram Analysis Histogram analysis Represents a generic term for a technique for extracting quantitative information from a histogram, such as determining a valley/minimum value in a bimodal histogram for thresholding. Histogram feature Image features that can be extracted directly from the histogram. In texture analysis, commonly used histogram features include range, mean, geometric mean, harmonic mean, standard deviation, variance, and median. Although simple, the histogram features are low cost and low level. It does not change with rotation and translation of image and is not sensitive to the exact spatial distribution of color pixels. The histogram similarity can be measured and judged by means of histogram features. Table 11.4 gives some simple measures of the difference between the two histograms, where ri and si are the number of the first and second data sets in the histogram bin i, respectively, r and s are their corresponding mean, n is the total
11.5
Histogram Transformation
497
Table 11.4 Some histogram similarity measures Measures L1 norm L2 norm
Formulas Pn
L1 ¼
i¼1 j r i si j qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Pn 2 L2 ¼ i¼1 ðr i si Þ p 1=p P M p ¼ 1n ni¼1 r ðiÞ sðiÞ P pffiffiffiffiffiffiffi B ¼ ln ni¼1 r i si qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn pffiffiffiffi pffiffiffiffi2 r i si M¼ i¼1 P D ¼ ni¼1 ðr i si Þ ln ðr i =si Þ Pn min ðr i si Þ H ¼ i¼1Pn
Mallows or EMD distance Bhattacharyya distance Matusita distance Divergence Histogram intersection Chi-square
χ ¼ 2
r i¼1 i
Pn
ðri si Þ2 i¼1 ri þsi
Pn ðr r Þðsi sÞ i¼1 i ffipffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn r ¼ pP n 2
Normalized correlation coefficient
i¼1
ðri rÞ
i¼1
ðsi sÞ
number of histogram bins, and r(i) and s(i) represent sorting (ascending) indicators. Note that EMD is the earth mover’s distance. Histogram moment The moment derived from the histogram of the image. Histogram distance The gap or difference between the two histograms. Can be calculated according to various measures. See method of histogram distance. Method of histogram distance A similarity calculation method based on the mean of histograms to roughly express color information. The feature vector consisting of three components R, G, and B of the image is f ¼ ½ μR
μG
μB T
At this time, the value of the histogram distance between the query image Q and the database image D is PðQ, DÞ ¼ f Q f D ¼
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X
2 μ μD R,G,B Q
Similarity computation Compute the similarity of two entities to determine their resemblance. Histogram dissimilarity measure A measure of the difference between two histogram distributions. The degree of similarity between these two distributions is opposite.
498
11
Point Operations for Spatial Domain Enhancement
Histogram density estimation A nonparametric method for estimating the density of an image region by means of a histogram. It is different from the parameter method of estimating the density of an image region using a probability function with only a few parameters. It can avoid the problem that the selected probability function model does not match the actual data and the estimation and prediction are not accurate. The histogram divides a single continuous variable into a histogram bin with a width Δi (generally taking all of the histogram bins having the same width) and counts the number ni of density values that fall into each of the bins. Divide ni by the total number N and Δi to get the probability value of each histogram bin: pi ¼
ni NΔi
The density function thus obtained is a constant in the range of each of the bins. The choice of the width of the bin is critical. If it is too narrow, the resulting density estimates will fluctuate and may have many more structures than the original data; if it is too wide, the resulting density estimate is too flat to reflect the distribution characteristics of the original data. Tree annealing [TA] A method of finding the global minimum of a multivariate function, which may have multiple local minima. This method can be used in histogram analysis and thresholding problems.
Chapter 12
Mask Operations for Spatial Domain Enhancement
In real images, there is a closer relationship between adjacent and close pixels, which can often be considered together (Gonzalez and Woods 2008). In image processing, masks/templates are often used to combine adjacent or close pixels, and operations are performed based on the statistical characteristics of these pixels or local operations, which are called mask/template operations or mask/template operations (Davies 2005; Lohmann 1998; Pratt 2007; Russ and Neal 2016).
12.1
Spatial Domain Enhancement Filtering
The use of mask/template operations for image enhancement is often called filtering and can be either linear or nonlinear (Bovik 2005; Mitra and Sicuranza 2001). Since mask/template operations involve local regions in the image, local enhancements can also be easily performed (Zhang 2017a).
12.1.1 Spatial Domain Filtering Spatial filtering An image enhancement technique performed in the image domain by means of mask/template convolution. It is also the most widely used noise cancellation technology. Typical filters that implement spatial filtering include averaging filters, ordering filters, adaptive filters, and the like. Spatial domain filtering Same as spatial filtering. Smooth filtering A spatial filtering technique. Similar to low-pass filtering in the frequency domain, local grayscale fluctuations are reduced, making the image smoother. Compare sharp filtering. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_12
499
500
12
Mask Operations for Spatial Domain Enhancement
Image smoothing A type of image enhancement technology. Improve the smoothness of the image by reducing the effects of tiny details and noise in the image. See noise reduction. Mean-shift smoothing A method of avoiding blurring an image when smoothing an image. A pixel is represented by a triple of 3-D space. The first two dimensions of the 3-D space represent the position of the pixel in the image, and the third dimension is used to measure its gray/luminance. A pixel (i, j) is represented in this space by a point (xij, yij, gij), where at the beginning xij ¼ i, yij ¼ j, and gij is its grayscale. The pixels in this 3-D space are allowed to move and form a group. The motion of the pixels is iterative, where at each iteration step a new vector (xij, yij, gij) is computed for each pixel (i, j). At the m + 1 iteration, the individual components of this new vector are 2 2 2 ðxmij xmkl Þ ðymij ymkl Þ ðgmij gmkl Þ g g h2x h2y h2g ¼ 2 2 2 P ðxmij xmkl Þ ðymij ymkl Þ ðgmij gmkl Þ g g ðk,lÞ2N ij g h2 h2 h2 P
xmþ1 ij
m ðk,lÞ2N ij xkl g
x
ymþ1 ij
y
x
gmþ1 ij
g
2 2 2 P ðxmij xmkl Þ ðymij ymkl Þ ðgmij gmkl Þ m y g g g ðk,lÞ2N ij kl h2x h2y h2g ¼ 2 2 2 P ðxmij xmkl Þ ðymij ymkl Þ ðgmij gmkl Þ g g g ðk,lÞ2N ij h2 h2 h2 y
g
0 2 1 0 2 1 0 2 1 m m m m m m x x y y g g ij kl kl kl P C B ij C B ij C m B A g@ A g@ A ðk,lÞ2N ij gkl g@ h2x h2y h2g 0 ¼ 2 1 0 2 1 0 2 1 m m m m m m x x y y g g P kl kl kl B ij C B ij C B ij C A g@ A g@ A ðk,lÞ2N ij g@ h2x h2y h2g
where hx, hy, and hg are properly selected scaling constants and Nij is a 3-D spherical neighborhood defined by pixels (i, j) by Euclidean distance, and ( gð x Þ e
x
OR
gð x Þ
1
j xj w
0
Otherwise
where w is a coefficient indicating the size of the flat kernel. Iterations can be repeated a predetermined number of times. At the end of the last iteration, the gray final value of the pixel (i, j) is gm , which can be rounded to the nearest integer. ij
12.1
Spatial Domain Enhancement Filtering
501
Sharp filtering A class of spatial filtering techniques. Similar to the high-pass filtering in the frequency domain, the contrast of the image is increased, the edges are more obvious, and the details are clearer. Compare smooth filtering. High-boost filtering An image sharpening technique that produces a sharpened image and adds it to the original image. This is achieved by multiplying the Fourier transform result of the original image by the amplification factor A and subtracting the Fourier transform of the low-pass filtered original image. Letting the Fourier transform of the original image be F(u, v), the low-pass filtered Fourier transform of the original image is FL(u, v), and the original image high-pass filtered Fourier transform is FH(u, v), the high frequency boost filter result GHB(u, v) is GHB ðu, vÞ ¼ A F ðu, vÞ F L ðu, vÞ ¼ ðA 1ÞF ðu, vÞ þ F H ðu, vÞ In the above formula, when A ¼ 1, it is a normal high-pass filter. When A > 1, a portion of the original image is added to the high-pass filter map, recovering the partial low-frequency components lost during high-pass filtering, so that the final result is closer to the original image. Logarithmic filtering When the signal of the image is multiplicative with the noise, the image representation can be logarithmically transformed, the multiplication relationship is converted into an additive relationship, and then the linear filtering is used to filter out unwanted noise. After filtering, the result needs to be exponentially transformed to restore the original representation of the image. See homomorphic filtering. Nonlinear filtering A spatial filtering technique. This type of filtering uses a set of observations to produce an estimate of the unobserved quantity and logically combines the observations. Compare linear filtering. Nonlinear smoothing filtering A nonlinear filtering method for reducing local grayscale fluctuations and making the image smoother. Nonlinear sharpening filtering A nonlinear filtering method for increasing image contrast and making edges more visible. Rank order filtering One type of filtering whose output relies on the order of pixels within the support area. The traditional example is the median filter, which selects the middle value of the set of input values. In a more general case, the filter selects the k-th largest value in the input set. If there are N values in the input set, then 1 k N. Order-statistics filtering See order-statistics filter.
502
12
Mask Operations for Spatial Domain Enhancement
Ordering In the order-statistics filter, a group of targets are sorted and queued according to a given characteristic (such as pixel gray value). Ordering filter A nonlinear spatial noise filter. Typical filters include median filters, max filters, min filters, and mid-point filters, as well as α-trimmed mean filters, adaptive median filters, and so on. Bilateral filtering An image filtering method. A non-iterative replacement of non-isotropic filtering. The image may be smoothed but the edges will remain. Consider a pixel p whose grayscale is represented by f( p). The neighborhood of the pixel p is denoted by N( p), where the gray level of the pixel q is represented by f(q). The new grayscale of the filtered pixel p is the weighted average of its neighborhood pixel values: gð pÞ ¼
1 X f ðqÞsðp, qÞt ½ f ðpÞ, f ðqÞ M q2N ðpÞ
Normalization factor M¼
X
sðp, qÞt ½ f ðpÞ, f ðqÞ
q2N ðpÞ
The function s( p, q) is determined by the spatial distance between the pixels p and q, and the function t[f( p), f(q)] is determined by the pixel value difference between the pixels p and q. In practice, these two functions are often selected as Gaussian functions, such as 2
12 3 0 q 6 1 @p A 7 sðp, qÞ ¼ exp 4 5 σ 2 2
1 2 3 0 ð p Þ f ð q Þ f 7 6 1@ A 5 t ðp, qÞ ¼ exp 4 σ 2 Bilateral filtering can have the effect of removing noise and maintaining edges. Bilateral smoothing See bilateral filtering. Zero-crossing filtering A very efficient second-order filtering technique that can be used as an image enhancement method; see the zero-crossing operator.
12.1
Spatial Domain Enhancement Filtering
503
Guided filtering A linearly variable filtering technique with edge retention characteristics. To guide the filtering of an input image I, a guide image G is required, and the result is sent to the output image O. Guided filtering has two goals. The first goal is to approximate the output image O to the input image I, which can be achieved by minimizing the 2-norm of the difference |O – I| of the two images, which can be expressed as min| O – I|2. The second goal is to make the texture property of the output image O close to the guide image G. This can be achieved by adjusting the gradient values of the two images. The larger k ¼ ∇O/∇G, the smoother the O will be. If k ¼ ∇O/∇G is integrated on both sides, you get O ¼ aG + b. In practice, the image is divided into blocks. If two target expressions are satisfied in each block, then a and b in each block can be solved to implement filtering. It can be seen from the above that the guiding image is used to guide or constrain the filtering operation. A commonly used feature is the texture feature. A common application is to smooth the image or capture a partial area (matting).
12.1.2 Spatial Domain Filters Spatial filter Convolution templates and operation rules for spatial domain filtering. The sizes of the template can be 3 3, 5 5, 7 7, 9 9 pixels, and so on. Spatial domain filter Same as spatial filter. Classification of spatial filters Classification of spatial domain filters based on their characteristics. The spatial domain filtering techniques can be divided into two types according to their functions: smooth filtering and sharpening filtering. According to the computational characteristics, these filters can be divided into linear filtering and nonlinear filtering. Combined with the above two classifications, the spatial domain filter can be subdivided into four categories. See Table 12.1. Smoothing filter A spatial domain filter for image enhancement. The (weighted) averaging of the pixel neighborhood is implemented by template convolution to achieve a smooth effect on image, especially around region edges (the gray value of these portions has a relatively large change), thereby reducing the local grayscale fluctuation of the image. In other words, a convolution operator is used to reduce noise or high spatial frequency details to achieve a smoothed filter. The averaging filter is a Table 12.1 Classification of spatial filters Function Smoothing Sharpening
Linear Operation Linear smoothing filter Linear sharpening filter
Nonlinear operation Nonlinear smoothing filter Nonlinear sharpening filter
Note: The terms in bold in the table indicate that they are the terms referenced in the text
504
12
Mask Operations for Spatial Domain Enhancement
special case. Such filters include discrete approximations of probability densities with symmetric forms, including Gaussian distributions, binomial distributions, uniform distributions, and the like. Taking a 1-D example, convolving the discrete signals x1, x2, . . ., xn with the kernel [1/6 4/6 1/6] produces a smooth signal y1, y2, . . ., yn + 2, where yi ¼ xi–1/6 + 4xi/6 + xi + 1/6. Spatial noise filter A filter that removes noise in the spatial domain. Sharpening filter A spatial domain filter for image enhancement. The calculation of the (weighted) difference between adjacent pixels is realized by template convolution to achieve the effect of increasing the image contrast and enhancing the edge of the image. Percentile filter A commonly used order-statistics filter. The output is a value that selects the pixel of the corresponding percentage position in the sequence after the input data is ordered according to a certain percentage. If the pixels at 50%, 100%, and 0% are selected, the percentile filters are the median filter, the max filter, and the min filters, respectively. Binomial filter A separable smoothing filter in which the individual element values in the filter template are coefficients based on a binomial distribution. As the size of the template increases, it will quickly converge to the Gaussian filter. A commonly used sampling filter with a sampling rate of 2. It features highfrequency and low-frequency separation and retains a significant amount of high frequency detail (so it causes aliasing after subsampling). However, for applications such as image blending, this aliasing has no effect. Nevertheless, if you want to display the subsampled image directly or mix it with other resolution images, you need a high-quality filter. If the subsampling rate is high, a windowed SINC pre-filter can be used. The commonly used subsampling filters with a sampling rate of 2 include (1) linear filter [1, 2, 1] with relatively poor performance; (2) binomial filter [1, 4, 6, 4, 1] which will filter out many frequency components, but is more suitable for pyramid analysis; (3) third-order (cubic) segmented polynomial filter; (4) cosine windowed SINC filter; and (5) JPEG-2000 9/7 analysis filter, suitable for compression but with large aliasing. Their response curves are shown in Fig. 12.1. Nonrecursive filter A filter that extends finitely in real space. Nonlinear filter A filter whose output is a nonlinear function of the input. Physical implementation and functional implementation of nonlinear filtering. It has many types, such as various gradient operators. Its nonlinearity is reflected in different aspects, such as: 1. Double the values of all input data, but the output results are not doubled (such as recording the position where the given value appears).
12.1
Spatial Domain Enhancement Filtering
505
Fig. 12.1 Response curves of some commonly used subsampling filters with a sampling rate of 2
2. The effect of applying an operator directly to the two images is different from the effect of first applying the operator to the two images and adding the two results (such as in thresholding). Nonlinear mean filter A spatial domain filter for image restoration. Given N numbers xi (i ¼ 1, 2, ..., N ), their nonlinear mean can be expressed as 0
N P
1
B wi hðxi ÞC Bi¼1 C g ¼ f ðx1 , x2 , , xN Þ ¼ h1 B N C @ P A wi i¼1
where h(x) is generally a nonlinear single-valued analytic function and wi is a weight. The nature of the nonlinear mean filter depends on the function h(x) and the weight wi. If h(x) ¼ x, an arithmetic averaging filter will be obtained; if h(x) ¼ 1/x, a harmonic averaging filter will be obtained. If h(x) ¼ ln(x), a geometric mean filter will be obtained. Nonlinear smoothing filter A spatial domain filter. It is nonlinear in terms of computational characteristics, and image can be smoothed from a functional perspective. See classification of spatial filter. Compare linear smoothing filter.
506
12
Mask Operations for Spatial Domain Enhancement
Nonlinear sharpening filter A spatial domain filter. It is nonlinear in terms of computational characteristics and can be sharpened from a functional perspective. See classification of spatial filter. Compare linear sharpening filter. Emboss filter See directional difference filter. Rank filter The filter whose output value depends on the pixels in the filter window, which are sorted by their gray values. The most commonly used rank filter is the median filter, and there are also max filter and min filter. See order-statistics filter. Order-statistics filter A spatial domain filter that implements nonlinear filtering. A filter based on order statistics, often referred to as rank filter. Order statistics here refers to a technique of ordering pixels in a neighborhood according to gray values and assigning their order (position in the sort sequence) to corresponding pixels. That is, it needs to be sorted based on the gray value of the pixels covered by the template. This filter replaces the value of the center pixel of the filtered neighborhood with the value given in the ordering table, also known as a percentile filter, the most common of which is the median filter. Since the median filter is not sensitive to wild values, it is often used in robust statistical processes. See rank order filtering. Ordinal transformation A one-to-one mapping from one set of values to another, where the original sorting relationship is preserved. The order statistics in the order-statistics filter is a case of ordinal transformation. Another type of ordinal transformation converts a set of values into the ordinal values of the individual values in the sorted list, for example, transforming {11, 28, 52, 18, 50} into {1, 3, 5, 2, 4}. Generalized order statistics filter One class of filters that sorts the values in the filter template and selects or combines the result of as the filtering output. The most typical of these is the median filter.
12.2
Mask Operation
A mask is also called a template or window. It can normally be regarded as a small image with size n n (n is generally an odd number, much smaller than common image sizes). The value of the coefficient is determined by the function (Zhang 2009). According to the connection between pixels, various mask operations can be defined and various functions can be implemented (Davies 2005; Gonzalez and Wintz 1987; Russ and Neal 2016). The basic idea of mask operation is to use the value assigned to a pixel as a function of its own gray value and the gray value of its neighboring pixels.
12.2
Mask Operation
507
12.2.1 Mask Mask A sub-image used to implement local operations or to define a range of operations. Sub-images that implement the previous function are also often called operators, and sub-images that implement the latter function are often called windows. If the image is 2-D, the mask/template used is also 2-D; if the image is 3-D, the commonly used mask is 3-D. The sub-image of mask (template) often corresponds to a regular area, expressed as a value or symbol matrix of m n. Typical applications include smoothing masks in convolutional operators, targets in template matching, kernels in mathematical morphology operations, and so on. Specifically, there are orthogonal templates, rotation templates, composite Laplacian templates, convolution templates, gradient templates, frequency modulation templates, or FM templates. Template Same as mask. Window A local template for the filter (to limit the range of operation). Spatial window A window that defines a window function in a space domain. Time window A window that defines a window function in the time domain. Window function A function that defines the range of values for this function by multiplying it by a function. According to the space of the value interval, commonly used window functions include time window, spatial window, frequency window, and timefrequency window. Windowing Use a window to observe or find a small part of an image. For example, given a vector x ¼ {x0, x1, . . ., x255}, it is needed to observe {x32, x33, . . ., x63} (as the third and fourth lines of a 16 16 image). Often used to define a small portion of a particular calculation (such as a Fourier transform) in an image. In general, windowing is described by a windowing function, which is multiplied by the image to give a windowed image. For example, given an image f(x): ℝn ! ℝ2 and a windowing function w(σ; x), where σ controls the width or scale of w. The windowed image is f w ðxÞ ¼ f ðxÞwðσ; x cÞ where c indicates the center of the window. Figure 12.2 shows an illustration of a Gaussian window.
508
12
Fig. 12.2 Schematic of the Gaussian window
Mask Operations for Spatial Domain Enhancement
Z
Y
X
Window scanning An image is decomposed into regular, adjacent, and possibly (partially) overlapping regions of interest to perform certain exhaustive forms of operation. Often used to search for specific content, such as building local histograms, target detection, localization, and positioning. Hann window It is a one-lobe raised cosine window. One-lobe raised cosine A simple windowed sine function with only one side lobe that can be written as: 1 rcosðxÞ ¼ ð1 þ cos π xÞboxðxÞ 2 Among them, box is a window function. If you continue to increase the number of side lobes, the result will get closer and closer to the ideal low-pass filter. Mask operation Image operations in units of templates instead of pixels. The most typical ones are template convolution and template ordering, all involving neighborhood operations (local operations) in the image. Sliding window An element commonly used in image processing algorithms to help perform pointby-point neighborhood calculations on images. A predetermined calculation is performed on each pixel covered by the window, and the result is assigned to the pixel in the center of the corresponding window. By processing the entire image through moving the window from left to right and from top to bottom, the processing result of the whole picture can be obtained, as shown in Fig. 12.3. See mask operation. Mask convolution A spatial filtering technique with a specially designed template. It uses a convolution operation to assign a value to a pixel; this value is a function value of the gray value of the pixel and the gray value of its neighboring pixels. This kind of operation
12.2
Mask Operation
509
Fig. 12.3 Sliding window calculation
cannot be done in place (an extra memory is required). The main steps implemented in the airspace are: 1. Roam the template in the image and sequentially overlap the center of the template with a pixel location in the image. 2. Multiplying each coefficient on the template by the value of each corresponding pixel under the template. 3. Add all the multiplication products. To maintain dynamic range, the result is often divided by the sum of the coefficients of the template. 4. Assign the above operation result (called the output response of the template) to the pixel in the output image corresponding to the center of the template. Convolution mask A local operator or sub-image that implements spatial domain filtering techniques. Commonly used convolution template sizes are generally odd, such as 3 3, 5 5, 7 7, 9 9. Template convolution Same as mask convolution. Mask ordering A process for extracting a subset of images of the same size as the mask in the image to be processed and sorting the pixels by amplitude values. The main steps are: 1. Roam the mask in the input image and coincide the center of the mask with a pixel location in the image. 2. Read the gray value of each corresponding pixel in the input image under the mask. 3. Sort these gray values, generally arranging them from small to large (single increment). 4. Selecting a sequence from the sorting result according to the purpose of the operation and extracting the gray value of the ordering pixel.
510
12
Mask Operations for Spatial Domain Enhancement
5. Assign the extracted gray value to the pixel in the output image corresponding to the center of the mask. Similar to template convolution, the mask ordering process cannot be done in place. However, unlike template convolution, the mask ordering only plays a role in delineating the range of pixels involved in image processing. The coefficients can be regarded as 1 when reading pixel gray values and do not affect the assignment. How to take one of the gray values (such as taking the maximum value, the median value, or the minimum value) after ordering is an important factor to distinguish its whole function. Rotating mask A set of templates is considered in multiple directions relative to certain pixels. For example, a template used in an edge detector such as a compass gradient operator. It is also often used for average smoothing, where the most uniform template is used to calculate the smoothed value of each pixel. Composite Laplacian mask A sharpening filter obtained by combining the original image f(x, y) with the Laplacian operator ∇2(x, y). The expression of the filtered result is gðx, yÞ ¼ f ðx, yÞ þ c ∇2 ðx, yÞ where c is a parameter used to satisfy the symbol convention for the Laplacian operator template implementation. If the center coefficient is positive, c ¼ 1; if the center coefficient is negative, then c ¼ 1. The purpose of adding the original image to the Laplacian operator is to recover the grayscale tones that are lost in the Laplacian calculation, since using only the Laplacian template will tend to produce the result of the value around zero. Separable template A template in a filter or a structural element in a morphological filter, which can be decomposed into a series of smaller templates or structural elements similar to a separable nucleus in a linear filter. The main advantage of separability is the reduction in computational complexity. See separable filter. Edge template A template used to calculate the edge values in different directions in the direction difference operator.
12.2.2 Operator Operator A generic term for a function that represents some kind of transformation applied to image data to the data transform. In image engineering, the operator used to perform
12.2
Mask Operation
511
Fig. 12.4 Three 3-D templates that make up the 3-D operator
image transform. Can be used to transform one image into another by performing a certain calculation. An operator takes an image (or region) as input and produces another image (or region). According to the operations it performs, it can be divided into linear operators and nonlinear operators. Operators refer to operations that implement image local operations and their templates (still image transform), such as orthogonal gradient operators, compass gradient operator, and so on. The three basic operator types are point operators, neighborhood operators, and global transforms (such as Fourier transform). See image-processing operator. 2-D operator A generic term for a set of (including a single) 2-D templates used to implement specific image processing and analysis functions (filtering, edge detection, etc.) along with 2-D image convolution operations, etc. Commonly used templates are squares with odd sides (so the entire template has a central pixel), expressed as 3 3, 5 5 and so on. Typical 2-D operators include neighborhood averaging operators (including a single template), orthogonal gradient operators (including two templates), and so on. 3-D operator Extension of the 2-D operator. A generic term for a set of (can be a single) 3-D templates used to interact with images and implement specific image processing and image analysis functions. Taking the neighborhood averaging operator (including a single template) as an example, the 2-D operator uses a 3 3 template, while the 3-D operator uses a 3 3 3 template. Taking the orthogonal gradient operator as an example, the 2-D operator includes two 2-D templates. When extending to 3-D, three 3-D templates are used; the easiest way to get a 3-D template is to add (another) 1-D to the 2-D template in a direction perpendicular to its plane. For example, the 2-D template of 3 3 size becomes the 3-D template of 3 3 1 size. Figure 12.4 shows three 3-D templates acting in three directions, X, Y, and Z, respectively.
512
12
Mask Operations for Spatial Domain Enhancement
Fig. 12.5 A total of 16 zero crossing modes
Neighborhood operator An image local operator that the gray level assigned to each output pixel depends not only on the corresponding input pixel gray level but also on the pixel gray levels in the corresponding input pixel neighborhood. If the assignment also depends on the position of the input pixel in the image, the neighborhood operator is (spatial) nonuniform; otherwise it is uniform. Compare point operators. Neighborhood processing Neighborhood-oriented operations. Refers to a kind of image processing technology, in which the same operation is performed on each pixel in the image, that is, for any pixel (called a reference pixel), the processed result value is the function of the values at the point and its neighbor pixel original. The form of the function, that is, the way in which adjacent pixel values are combined with reference pixel values to obtain a result, varies greatly in different algorithms. Many algorithms work in a linear fashion (such as neighborhood averaging) and use 2-D convolution, while others process input values in a nonlinear way, such as median filtering. Zero-crossing operator Instead of detecting the maximum value of the first derivative, it is a feature detector that detects the zero crossing of the second derivative. The advantage of determining the zero crossing relative to determining the maximum value is that the detected edge points always form a closed curve, so that different regions can be clearly distinguished. The disadvantage is that the noise is more enhanced by the derivative calculation, so you need to carefully smooth the image before calculating the second derivative. A commonly used kernel that combines the calculation of smooth and second derivative is the Laplacian-of-Gaussian kernel. Zero-crossing point A point at which the value obtained by the Laplacian operator is zero. The Laplacian operator is a second-order difference operator, and the point where the second-order difference is zero corresponds to the point where the gray-level gradient is extreme, that is, the corresponding edge point. Edge points can be detected by detecting zero crossings. Zero-cross pattern A spatial pattern consisting of zero crossings. The commonly used 16 zero-crossing modes are shown in the shaded of Fig. 12.5. They have different connectivity and
12.2
Mask Operation
513
Fig. 12.6 Image sharpening example
can be divided into four groups. The same group can be interchanged by rotation and mirror. Using them as a basis, for example, can help with stereo matching. Image sharpening operator An image enhancement operator that increases the high spatial frequency components of an image to make the edges of the object appear sharper or reduce blurring. Figure 12.6 gives an example, in which the right picture shows the result of (linear) sharpening to the left. See edge enhancement. Image sharpening Techniques and processes that make images sharper and more detailed by enhancing the edges, outlines, and grayscale transitions of the image. Image sharpening can be achieved with different image enhancement techniques. Isotropic operator An operator that produces the same output without being affected by the local orientation of the neighborhood of the pixel at the location of the operator. For example, a mean smoothing operator can give the same output value when the image data is rotated around the center of neighborhood. On the other hand, a directional derivative operator will give different values after the image is rotated. This concept is closely related to feature detection. Some detectors are sensitive to the local orientation of image pixel values, while others are not. Isotropic gradient operator A gradient operator that calculates the magnitude of the gradient scalar (i.e., a value independent of the edge direction). Treating of border pixels The processing strategy for pixels located at the boundaries of the image. According to the definition of the neighborhood, when the pixels at the image boundary are processed using a template operation, some of the template positions correspond to the outline of the image, and there is no pixel defined originally. There are two ways to solve this problem. One is to ignore the pixels on these boundaries, considering only pixels inside the image that have a distance to boundary greater than or equal to the template radius. That is, you can keep pixel values that are not covered by the template or replace those pixel values that are not covered by the template with a fixed constant value (usually 0). The effect of this method is
514
12
Mask Operations for Spatial Domain Enhancement
often acceptable when the image size is large and the object of interest is inside the image. The other is to extend the input image, that is, if the template operation is performed with a template with radius r, then r rows or r columns are added/ expanded outside the four boundaries of the image. That is, adding r rows before the first line of the image and after the last line of the image and adding r columns to the left of the first column and the right column of the last column of the image. Here, the operation can be iterated by row or column to increase the number of rows and columns, so that the operation of the pixels on the original boundary can be performed normally. The amplitude values of the pixels in these new rows or columns can be determined in different ways, for example: 1. The simplest is to take the amplitude value of the newly added pixel as 0. The disadvantage is that it may cause obvious inconsistency at the image boundary. 2. Taking the amplitude values of these newly added pixels as the values of their 4-adjacent pixels in the original image (the amplitude values of the newly added pixels in the four corners are taken as the values of the 8-adjacent pixels in the original image). 3. The image is regarded as a periodic cycle one in both the horizontal and vertical directions, that is, after the last line of the image, the first line of the image is to be considered, and after the last column of the image, the first column of the image is to be considered, so that the corresponding row or column can be moved over. 4. Using extrapolation techniques, extrapolation is performed with a certain rule according to the amplitude values of one or more rows (one or more columns) of pixels near the boundary to obtain amplitude values of pixels outside the boundary of the image. Padding A process used when using the template operation, in order to avoid the problem that the template neighborhood range extends beyond the boundary of the image. The extension of the image is supplemented in advance when the template center corresponds to the boundary pixel of the image. If a template radius of r is used, it is required to add/expand a band of r rows or r columns outside the four boundaries of the image. The amplitude values of the pixels in these new rows or columns can be determined in different ways, for example: 1. Zero: The value of all the padding is taken as 0. The disadvantage is that there may be obvious inconsistency at the boundary of the image. 2. Constant: All the filled values are taken as a special boundary value. Because this value is given to all pixels outside the image boundary, it is also called the boundary color. The previous method is a special case of this method (the constant is zero). 3. Copy (or clamp board): Repeatedly copy the boundary pixel values to the (image) without restriction, that is, the amplitude values of these newly added pixels are taken as the values of the 4-adjacent pixels in the original image; the amplitude value of the increasing pixel (new in 4 corners) is taken as the value of the 8-adjacent pixel in the original image.
12.3
1 3
Linear Filtering
5
7
9
9 9
515
9
9
9
1 3
5
7
9
9 7
5
3
1
0 2
4
6
8
Fig. 12.7 Comparison example of some padding methods
4. Extrapolation: Extrapolating the amplitude values of pixels outside the boundary of the image according to the amplitude values of one or more rows (one or more columns) near the boundary. The previous method is a special case of this method (the amplitude value remains the same). 5. Loop (cycle repeat or tile): The image is regarded as a periodic loop in both horizontal and vertical directions, that is, the last row of the image is followed by the first row of the image, and the last column of the image is followed by the first column of the image. A row or column is moved to the corresponding row or column. 6. Mirroring: Pixels inside and outside the boundary are mirror images of each other. 7. Extend: Extends the mirror value by subtracting it from the boundary pixel value. A 1-D example for each of the last four cases is shown in Fig. 12.7. The first figure shows a line of the input image, the dotted line indicates the boundary; the second figure shows the result of the extrapolation padding (where the amplitude value is copied); the third figure shows the result of the image loop filling; the fourth figure shows the result of mirroring; the fifth figure shows the result of image elongation. Border effects In order to perform template calculation on an image, the image needs to be padded. Different methods have a certain effect on the mask operation at the image boundary, resulting some special effects at the boundary. For example, when a value of 0 is filled, the boundary is generally darkened, especially at the four vertices.
12.3
Linear Filtering
Linear filtering can obtain both the smoothing effect (decreased image contrast) and the sharpening effect (increased image contrast), depending on the coefficient value of the mask/template used (Castleman 1996; Gonzalez and Wintz 1987; Shih 2010; Sonka et al. 2014).
516
12
Mask Operations for Spatial Domain Enhancement
12.3.1 Linear Smoothing Linear filter A filter whose output is a linear function of the input (such as a weighted sum), that is, all items in the filter are either constants or linear variables. If the sequence {xi} is an input of values (which can be some pixel values derived from a local neighborhood, some pixel values from the same position in the same scene of a sequence of images), then the output of the linear filter is ∑iaixi + a0; ai is some constant. Many low-pass filters for image smoothing and high-pass filters for image sharpening are linear filters whose output is the weighted sum of the inputs. Linear smoothing filter A spatial domain filter in image enhancement. It is linear from the operational characteristics and can play a smooth role from a functional point of view. See classification of spatial filter. Compare nonlinear smoothing filters. Linear sharpening filter A spatial domain filter in image enhancement. It is linear from the perspective of operational characteristics and can be used for sharpening image from a functional perspective. See classification of spatial filter. Compare nonlinear sharpening filters. Linear shift invariant [LSI] filter A linear filter that satisfies both the superposition principle and the shift invariant principle. Linear shift invariant [LSI] operator The operator that satisfies the superposition principle and the shift invariant principle at the same time. Convolution operators and correlation operators are typical linear shift invariants. Linear-median hybrid filter A filter obtained by combining a linear filter and a median filter. The typical method is to mix the linear filter and the median filter in series, in which the fast linear filtering operation is applied to the larger template, and the median filtering with better filtering performance but complicated calculation is applied to the few results of linear filtering. This mixture can reduce computational complexity and substantially meet filtering performance requirements. Consider a structure for linear-median mixed filtering of a 1-D signal f(i). As shown in Fig. 12.8, the filters HL, HC, and HR (subscripts L, C, and R represent left, center, and right) are linear filters, and the final filtered result can be expressed as (MED stands for median) gðiÞ ¼ MED½H L ð f ðiÞÞ, H C ð f ðiÞÞ, H R ð f ðiÞÞ If the input has N data, the average value is filtered to become 3 data, and then the median filtering is fast.
12.3
Linear Filtering
Fig. 12.8 Hybrid filtering using basic linear and median filters
517
f (i) HR
HC
MED
HL
g (i)
Linear filtering A class of spatial filtering techniques. A set of observations is used to generate an estimate of the unobserved quantities, and a linear combination is used to linearly combine these results. Compare nonlinear filtering. Linear smoothing filtering A linear filtering method that reduces local grayscale fluctuations in an image and makes the image look smoother. Spatial domain smoothing An implementation of image smoothing in which each pixel value is replaced with a value calculated directly from other pixels (normally in the neighborhood) in the image. In contrast, smoothing using a frequency domain filter first processes all pixels to obtain a linear transformation of the image (such as Fourier transform) and then performs a smoothing operation in the transformed image. Smoothing A spatial domain operation for image enhancement. The effect of using a smooth ing filter corresponds to low pass filtering in the frequency domain. It is effective for eliminating Gaussian noise. In general terms, all of the varying operations performed on the image to eliminate the effects of noise can be represented. Often referred to as weakening the component of the high spatial frequency in the image. Since many noise models have a flat power spectral density (PSD), when natural images have attenuation that tends to zero at high spatial frequencies, eliminating high spatial frequency components increases the overall signal-to-noise ratio (SNR) of the image. See discontinuous preserving regularization, anisotropic diffusion, power spectrum, and adaptive smoothing. Savitzky-Golay filtering One class of filtering techniques that can make the least mean square fit of a polynomial for a moving window of a signal. The filter used averages the window data and can be thought of as an operator with smoothing function. See linear filter and curve fitting. Laplacian smoothing A mesh-based smoothing algorithm in which the current vertex values are replaced with the mean of adjacent vertices in the mesh.
518
12
Mask Operations for Spatial Domain Enhancement
Adaptive weighting A framework for weighting elements in summation, voting, or other forms. The relative influence of each element can be made representative (i.e., adapted to the structure under consideration). For example, this can be a similarity between pixels in a neighborhood (such as an adaptive bilateral filter) or a property that changes over time. See adaptive. Adaptive weighted average [AWA] In video processing, a weighted averaging method for calculating image values along a time-space motion trajectory. The method based on this weighted average is particularly suitable for situations in which the same image area contains different scene contents, such as a fast zoom or a change in camera angle of view. Adaptive smoothing An iterative smoothing method that avoids smoothing edges. Given a given image f (x, y), the steps in a single iterative process of adaptive smoothing are as follows: 1. Calculate the gradient amplitude image G(x, y) ¼ |∇f(x, y)|. 2. Further obtaining an enhanced gradient weighted image W(x, y) ¼ exp.[λG (x, y)]. 3. Smooth the image with 1 P
Sðx, yÞ ¼
1 P
f ðx þ i, y þ jÞW ðx þ i, y þ jÞ
i¼1 j¼1 1 P
1 P
W ðx þ i, y þ jÞ
i¼1 j¼1
12.3.2 Averaging and Mean Averaging filter This is the smoothing filter without weighting. It implements pixel neighborhood averaging by means of template convolution, with the same template coefficients. See box filter. Average smoothing Same as mean smoothing. Neighborhood averaging A linear smoothing filtering using the average value from one pixel’s neighborhood for assignment. In the template used for this filtering, all coefficients can be taken as 1. In the application, in order to ensure that the output result is still in the gray value range of the original image, the filter result is divided by the total number of coefficients and then assigned.
12.3
Linear Filtering
519
Weighted averaging A method for linear smoothing filtering based on template convolution. Compared with the general neighborhood average, the weighted averaging use weights for different values of the template coefficients, which is thus a generalized form of neighborhood average. Spatial averaging A type of image processing operation in which the pixel values in the output image are the weighted average of their neighborhood pixels in the input image. Mean smoothing and Gaussian smoothing are examples of typical spatial averaging. Mean smoothing operator A noise reduction operator that can be used for each component of a grayscale image or a multispectral image. The output value at each pixel is the average of all pixels in the neighborhood of the pixel. The size of the neighborhood determines the degree of smoothing (i.e., the degree of noise reduction) and also determines the degree of blurring in the details. Mean smoothing See mean smoothing operator Mean A statistical value of a random variable, also known as a statistical average. The mean μ of a random variable X is its expected value, i.e., μ ¼ E[X]. See sample mean. Mean filter The spatial noise filter whose output is dependent on the pixel mean calculation. Since the mean has many different definitions, there are many different types of mean filters. See mean smoothing operator. α-trimmed mean filter A special ordering filter. Also written as alpha-trimmed mean filter. Let the neighborhood N(x, y), which is defined by the filter template, of a pixel f(x, y) have a total of M pixels. After sorting these pixels, α/2 minimum gray values and α/2 maximum gray values are trimmed, and the average pixel value of M – α pixels (representing the remaining M – α pixels by gr(s, t)) is computed as the output of this filter f o ðx, yÞ ¼
1 Mα
X
gr ðs, t Þ
ðs, t Þ2N ðx, yÞ
Among them, the value of α can be selected from 0 to M – 1. When α ¼ 0, no trimming is performed, and the filter is reduced to an arithmetic mean filter. When α ¼ M – 1, the value larger or smaller than the median value can be trimmed off, and the filter becomes the median filter. If α is selected to take other values, the effect is between the above two filters. This filter can be used where it is necessary to eliminate multiple noises such as salt and pepper noise and Gaussian noise.
520
12
Mask Operations for Spatial Domain Enhancement
Mean shift filtering An adaptive gradient lifting technique is performed by iteratively moving the center of the search window to the mean of certain points in the window. Box filter A linear smoothing filter, operates in spatial domain. It is a linear operator that smooths grayscale images in the spatial space and is the simplest spatial average filter. Also called arithmetic averaging filter. All the coefficients of its rectangular template are 1, that is, the same weight is given to each pixel in the image. The effect is to replace the value of the pixel with the neighborhood average of one pixel. Its name stems from the shape of the operator like a box. In practice, to improve computational efficiency, only the columns that move out of the template and the columns that move into the template can be updated as the template is swept across the image. To ensure that the value of the output graph is still within the original grayscale value, you need to divide the filtered result by the total number of coefficients on the template. Arithmetic mean filter Same as box filter. Moving average filter Same as box filter.
12.3.3 Linear Sharpening Linear sharpening filtering A linear filtering method that increases the contrast between adjacent pixels in an image to make edges and details more visible. Edge sharpening An image enhancement operation that highlights details in an image. It enhances the high-frequency portion of the image in order to make the edges of the regions in the image visually more distinct. There are many practical algorithms, such as selectively replacing gray level by gradient and statistical differencing. See edge enhancement. Sharpening 1. Enhance the edges and details of an image to help people better observe the image. This can be done in the spatial domain by means of differences in different directions or in the frequency domain by means of a high-pass filter. See edge sharpening. 2. Improve the sharpness of the image by adjusting the focus level. Compare to blur and 4.
12.3
Linear Filtering
521
Fig. 12.9 Two 2-D templates for the calculation of a three-point gradient
1
–1
1 –1
Statistical differencing An edge sharpening algorithm. For the image f(x, y), first determine a window W of n n centered at (x, y), and then calculate the variance of the pixels in the window: σ 2 ðx, yÞ ¼
n n h 1 XX f ðx, yÞ f ðx, yÞ2 n n x2W y2W
Finally, the pixel value of the degraded image g(x, y) is replaced by f(x, y)/σ(x, y), and the pixel value of the sharpening result g(x, y) thus obtained will increase at the edge position and will be reduced elsewhere. Spatial derivatives The rate of grayscale variation is calculated in the spatial domain. In practice, it is approximated by the difference. Average gradient In image information fusion, an objective evaluation index based on statistical characteristics. The average (grayscale) gradient of an image reflects its contrast. Since the gradient calculation is often performed around the local region, the average gradient more reflects the small detail changes and texture characteristics of the image. Let g(x, y) denote the N N fused image, then the average gradient is
A¼
1 NN
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uN1 N1 h uX X t G2X ðx, yÞ þ G2Y ðx, yÞ1=2 x¼0 y¼0
where GX(x, y) and GY(x, y) are the differences (gradients) of g(x, y) along the X and Y directions, respectively. When the average gradient is small, it means that the image has few levels; when the average gradient is large, the image will be clearer in general. Three-point gradient A gradient calculated from the gradation of three pixel points in the image. The calculation can be achieved by template convolution using two 2-D templates as shown in Fig. 12.9 (the upper left pixel is used in both the horizontal and vertical gradients). This is actually the original Robert detection operator. Four-point gradient A gradient calculated using four image points. The calculation can be done using the template convolution of the two 2-D templates of the Robert cross operator.
522
12
Mask Operations for Spatial Domain Enhancement
Derivative estimation A method that does not directly calculate the luminance derivative of a point in an image, but uses some other means to estimate. Derivative estimation by function fitting A derivative estimation method. It takes advantage of the continuity of the image and uses function fitting to estimate the derivative. The image brightness is considered as a bivariate surface, and the tangent plane of the surface gives the brightness derivative in both directions. The problem is transformed into a problem of determining the surface parameters.
12.4
Nonlinear Filtering
Nonlinear filtering can distinguish useful content and unwanted noise in the image, so as to obtain better filtering results (Mitra and Sicuranza 2001; Russ and Neal 2016).
12.4.1 Nonlinear Smoothing Nonlinear operator Often an operator for specific problems and specific applications. There are many different types, each with different characteristics, which cannot be collectively described together, and special tasks should be treated as separate processes. Compare linear operators. Sharp-unsharp masking An image enhancement technique that can make edges in an image structure more visible. The operator can add a certain amount of gradient or high-pass filtering result to the image by means of the mask and can also subtract a certain amount of smoothing or low-pass filtering result from the image. Unsharp masking A method of using nonlinear filtering to enhance tiny details in an image, an image enhancement technique. This is done by subtracting a blurred version of the image (corresponding to low frequencies) from the original image. Because only highfrequency detail is left, it can be used to construct an enhanced image (so it is actually a sharpening process or technique). The previous blurred version is generally obtained by convolving the original image with a Gaussian smoothing template. It is also an image sharpening method commonly used in the printing industry. By subtracting a blurred version fb(x, y) of the image from an image f(x, y) to reduce the contrast difference between the images:
12.4
Nonlinear Filtering
523
f um ðx, yÞ ¼ f ðx, yÞ f b ðx, yÞ Toboggan contrast enhancement An adaptive image enhancement method. The main steps are as follows: 1. Calculating the amplitude of the gradient vector of each pixel 2. Determining, in a local window surrounding each pixel, a pixel having a minimum value of the local gradient amplitude 3. Assigning the gray value of the pixel having the minimum value of the local gradient amplitude to the center pixel Weighted-order statistic A statistical value obtained by weighting the values of the elements participating in the sort. Ordering filter A nonlinear spatial noise filter. Typical examples are median filter, max filter, min filter and mid-point filter, in addition to α-trimmed mean filter, adaptive median filter and so on. Rank-based gradient estimation Estimation of gradient values in noisy situations by means of a sorted list of local data points. Kuwahara filter An edge-preserving noise reduction filter. This is done in the four directions around the pixels need to be smoothed with each a 3 3 mask. The smoothed value for the pixel is the mean of the mask region with the smallest variance. In this regard, it is an adaptive averaging filter. Bilateral filter A nonlinear filter that incorporates a weighted idea and the idea of eliminating a wild pixel (here, a neighborhood pixel whose value is too different from the template center value). The output pixel value of the bilateral filter depends on the weighted combination of neighboring pixels P gðx, yÞ ¼
p, q
f ðp, qÞwðx, y, p, qÞ P wðx, y, p, qÞ p, q
The weighting factor depends on the domain kernel (often the Gaussian function).
ðx pÞ2 þ ðy qÞ2 dðx, y, p, qÞ ¼ exp 2σ 2d
and range kernel (measures the degree of approximation to the center pixel grayscale value, a vector distance should be used for color image)
524
12
Mask Operations for Spatial Domain Enhancement
Fig. 12.10 Comparison of median filtering of edge preservation with neighborhood averaging
" # f ðx; yÞ f p; q 2 r ðx; y; p; qÞ ¼ exp 2σ 2r The product of these two kernels is the bilateral weighting function that depends on the data "
2 # ðx pÞ2 þ ðy qÞ2 f ðx; yÞ f p; q wðx; y; p; qÞ ¼ exp 2σ 2r 2σ 2d The bilateral filter is equivalent to the combination of a Gaussian filter (only considering the difference between the spatial domain) and the α-trimmed mean filter (only considering the difference in the range domain), because it considers both the difference between the spatial domain and the range domain. Joint bilateral filter A modification of the bilateral filter in which the range kernel is calculated for the flash image rather than for the ambient image because the flash image is less noisy and has more believable edges. Edge preserving smoothing An operation is performed using a smoothing filter designed to preserve edges in the image while eliminating noise. For example, a median filter can be used for smoothing with edge retention. Figure 12.10 compares the results of processing the same image with uniformly distributed random noise by neighborhood averaging and median filtering. Figure 12.10a, c show the results of neighborhood averaging using 3 3 and 5 5 templates, respectively, while Fig. 12.10b, d are the result of the median filtering process with 3 3 and 5 5 template, respectively. Comparing these results (compared to the two sets of results using the same size template), the median filtering not only smoothens the image, but also eliminates the noise, so the contour in the filtered image is clearer than the neighborhood average processing result.
12.4
Nonlinear Filtering
Fig. 12.11 Exponential smoothing example
525 Value 5 4 3
Pt (w = 1.0) Dt–1
2
Pt (w = 0.5)
1 0
1
2
3
4
5
6
7
8
9 10 11 Time
Edge adaptive smoothing A method of avoiding image blurring when smoothing images. When smoothing an image, you need to place a window around a pixel, calculate the average in the window, and assign it to the center pixel. If the placed window just spans two areas, the boundary between the two areas will be blurred. In edge adaptive smoothing, several windows are placed around a pixel such that the pixel has various possible relative positions to the center of the window. The variance of the pixel values is calculated in each window. Select the window with the smallest variance (to minimize the blurring of the edges). The average of the weighted or unweighted windows is then calculated, and the result is assigned to the pixel under consideration. Exponential smoothing A method of predicting a current data value Pt based on a previous observation value Dt-1 and a previous prediction value Pt-1. Can be written as Pt ¼¼ wDt1 þ ð1 wÞPt1 where w is a weighted parameter with value between 0 and 1. An example is shown in Fig. 12.11. Crimmins smoothing operator An iterative algorithm to reduce salt and pepper noise. It uses a nonlinear noise reduction technique that compares the gray level of each pixel with the gray levels of the eight neighborhood pixels around it and then increases or decreases the gray level of the pixel to make it more representative of the surrounding environment. If the gray value of the pixel is lower than the surrounding pixels, the gradation is increased, and if the pixel is higher than the surrounding pixels, the gradation is reduced. The more iterations, the more noise is reduced, but the details are more blurred. Bartlett filter A spatial filter that uses a piecewise linear “tent” function. It has a special weighted average function or a piecewise linear continuous smoothing function (“tent” function). The simplest 3 3 template for this filter is shown in Fig. 12.12
526
12
Fig. 12.12 A 3 3 template of Bartlett filter
1 1 16
Mask Operations for Spatial Domain Enhancement
2
1
2
4
2
1
2
1
1
=
1 4
1 4
2
1
2
1
1
and is often referred to as a bilinear kernel because it is an outer product of two linear (first order) splines. Piecewise linear “tent” function Used in the Bartlett filter, the shape is flat on both sides, with a pointed roof-like (tent-like) projection in the middle. The 3 3 template that implements its function in a 2-D image is as follows and can be decomposed into two 1-D templates (tent function). 2
1 2
1 6 42 4 16 1 2
2 3 1 7 16 71 25 ¼ 425 ½1 4 4 1 1 1
3
2
1
Bilinear kernel A piecewise linear “tent” function for smoothing images. The core of the Bartlett filter. It is the outer product of two linear (first-order) splines.
12.4.2 Mid-point, Mode, and Median Mode filter A filter that assigns the most frequently occurring value in a partial window centered on one pixel (the mode of the histogram of local values) to the center pixel and eliminates shot noise. That is, the filtered output for each pixel is the mode of the local neighborhood of the pixel. Mid-point filter The output is a nonlinear filter that is the average (midpoint) of the two outputs of the max filter and the min filters. For the image f(x, y), the filtering result can be expressed as
12.4
Nonlinear Filtering
527
1 gmid ðx, yÞ ¼ 2
max
½ f ðs, t Þ þ
ðs, t Þ2N ðx, yÞ
min
ðs, t Þ2N ðx, yÞ
½ f ðs, t Þ
1 ¼ fgmax ðx, yÞ þ gmin ðx, yÞg 2 where N(x, y) represents the neighborhood of (x, y), as defined by the filter template. This filter combines an order-statistics filter with an averaging filter. Mid-point filtering See mid-point filter. Geometric mean filter A spatial domain filter for image restoration. Let the filter use a template of m n (window W), and the restored image fr(x, y) is obtained by filtering the degraded image g(x, y) 2 f r ðx, yÞ ¼ 4
Y
3mn1 gðp, qÞ5
ðp, qÞ2W
The geometric mean filter can be seen as a variant of the arithmetic mean filter and is mainly used for images with Gaussian noise. This filter preserves image detail better than an arithmetic mean filter. Geometric mean See geometric mean filter. Median filter A special order-statistics filter. It is also the most commonly used ordering filter. The median is the value that divides a distribution into two sets of the same number of samples or the value in the middle of a sequence of values. For the image f(x, y), the output of the median filter can be written as gmedian ðx, yÞ ¼ median ½ f ðs, t Þ ðs, t Þ2N ðx, yÞ
where N(x, y) represents the neighborhood of (x, y), as defined by the filter template. The median filter removes noise while maintaining image detail. The main steps of median filtering are: 1. Roam the template in the image and coincide the center of the template with a pixel location in the image. 2. Read the gray value of each corresponding pixel under the template. 3. Arrange these gray values from small to large. 4. Find the one in the middle of these values (median). 5. Assign this median value to the pixel corresponding to the center of the template.
528
12
Mask Operations for Spatial Domain Enhancement
The effect of median filtering is to force pixels with significantly different densities from their neighboring pixels to be closer to their neighboring pixels, thus eliminating the noise points corresponding to the density mutations that occur in isolation. See median smoothing. Weighted median filter A median filter that filtering with pixel spatial position or pixel structure information. The output of a 2-D weighted median filter can be written as gweight‐median ðx, yÞ ¼ median ½wðs, t Þf ðs, tÞ ðs, t Þ2N ðx, yÞ
where N(x, y) is the neighborhood of (x, y), corresponding to the template size, and w (x, y) is the weight function and here is the template function. Weights can be taken for different locations of the template function. Truncated median filter An approximation of the mode filter when the image neighborhood is small. This filter sharpens blurred image edges and reduces noise. An algorithm that implements truncated median filter needs to truncate the local distribution of the median on the mean side and recalculate the median of the new distribution. The algorithm can be iterative and generally converges to the mode in the general case (even if the observed distribution contains only a few samples and no significant peaks). Adaptive median filter A special filter that extends the standard median filter. The adaptation is reflected in that the filter template size can be adjusted according to the image characteristics (whether or not affected by impulse noise), which can achieve three purposes: 1) filtering impulse noise, 2) smoothing non-impulse noise, 3) reducing the distortion caused by the excessive refinement or coarsening of the target boundary. It preserves image detail better than standard median filters when filtering out impulse noise. Vector median filter A generalization of the median filter from the scalar median to the vector median. See vector ordering, conditional ordering, marginal ordering, and reduced ordering. Vector Alpha-trimmed median filter A special or extended median filter. For example, it can be used to eliminate the effects of Gaussian noise and impulse noise on multispectral images. The feature is that after sorting the (pixel) vectors in the smoothed window, only N(1 α) vectors with the smallest distance from other vectors are retained, where N is the total number of vectors in the window and α is a real number in [0, 1]. Next, you can calculate the mean spectrum with only the retained vector and assign the result to the center pixel in the window. For example, take α ¼ 0.2 to calculate the α-trimmed median vector for the following vector set: x1 ¼ (1, 2, 3), x2 ¼ (0, 1, 3), x3 ¼ (2, 2, 1),
12.4
Nonlinear Filtering
529
x4 ¼ (3, 1, 2), x5 ¼ (2, 3, 3), x6 ¼ (1, 1, 0), x7 ¼ (3, 3, 1), x8 ¼ (1, 0, 0), x9 ¼ (2, 2, 2). Here, since there are N ¼ 9 vectors, only the first 9 (1–0.2) ¼ 7.2 7 vectors in the ordered vector sequence are used. This means that the vectors x2 and x8 with the largest distance need to be ignored. The mean of the remaining vectors is x ¼ (2.0000, 2.0000, 1.7143). Further, after determining the mean of the remaining vectors, the median vector of the remaining vectors can also be calculated by the reduced ordering method and used as the output of the filter. See vector ordering. Median smoothing An image noise reduction operator that replaces the value of one pixel with the median value ordered by the pixel values in its neighborhood. Median filtering See median filter. Median flow filtering Noise reduction operation on vector data. A median filter for image data (scalar) is generalized. It is assumed here that the vectors in the spatial neighborhood have similarities. Dissimilar vectors have been discarded. A noise cancellation operation for video data, which is a generalization of median smoothing techniques for image data. The assumption here is that the vector in the spatial neighborhood of the current vector is similar to the current vector and the dissimilar vectors are to be eliminated. Here the “flow” is derived from the filter being in the context of image motion. Median absolute deviation [MAD] A technique for estimating scale parameters of a robust penalty function. Scale parameters are often set to the variance or standard deviation of normal data (inliers) that are not contaminated by noise. However, estimating the scale parameters from the residual r is often affected by the outlier. The median absolute deviation is calculated as follows: MAD ¼ mediani r i Multiplying it by 1.4 gives a robust estimate of the standard deviation of the interior noise.
12.4.3 Nonlinear Sharpening Max-min sharpening transform An image enhancement technique that combines a max filter with a min filter. You can sharpen the blurred edges in the image and make the blurred target clear. The basic idea is to compare the central pixel value f(x, y) in a template coverage area with the maximum value gmax(x, y) and the minimum value gmin(x, y) in the region
530
12
Mask Operations for Spatial Domain Enhancement
and then replace the central pixel value with its nearest extreme (maximum or minimum) value. The maximum-minimum sharpening transform S for the image f(x, y) can be expressed as S½ f ðx, yÞ ¼
gmax ðx, yÞ
If
gmin ðx, yÞ
Otherwise
gmax ðx, yÞ f ðx, yÞ f ðx, yÞ gmin ðx, yÞ
This process can be iterative: Snþ1 ½ f ðx, yÞ ¼ SfSn ½ f ðx, yÞg Max filter A special order-statistics filter. For the image f(x, y), the output of the max filter can be written as gmax ðx, yÞ ¼
max
½ f ðs, t Þ
ðs, t Þ2N ðx, yÞ
where N(x, y) represents the neighborhood of (x, y), as defined by the filter template. The max filter can be used to detect the brightest points in the image and to reduce the low value pepper noise of the salt and pepper noise. Min filter A special order-statistics filter. For the image f(x, y), the output of a min filter can be written as gmin ðx, yÞ ¼
min
ðs, t Þ2N ðx, yÞ
½ f ðs, t Þ
where N(x, y) represents the neighborhood of (x, y), as defined by the filter template. The min filter can be used to detect the darkest points in the image and to attenuate high value salt noise in salt and pepper noise. Directional derivative filter A filter that calculates the derivative along a particular direction. It has been shown that a Gaussian higher-order directional derivative filter can be composed of a smaller number of basis functions. For example, for a first-order directional derivative, only two basis functions are needed (the direction is u ¼ (u, v)) Gu ¼ uGx þ vGy For the second-order directional derivative, only three basis functions are needed (the direction is u)
12.4
Nonlinear Filtering
531
Fig. 12.13 Direction derivative examples
Guu ¼ u2 Gxx þ 2uvGxy þ v2 Gyy Based on the above results, it is possible to construct a directional derivative filter with a gradual increase in direction selectivity, that is, a filter that responds only to edges having strong local uniformity in a specific orientation. Further, at a given location, higher-order filters are likely to respond to more than one oriented edge, including step edges and straight segments. Direction Corresponds to a vector that describes the position of one point relative to another. Directional derivative In a 2-D grayscale image, the rate of change in gray level in the horizontal or vertical direction. For example, the two components of the gradient on the plane are along two coordinate axes. Figure 12.13 gives a pair of examples, where the left image is the original image, the middle image is the gradient component map calculated in the horizontal direction (strong response to the vertical edges), and the right image is the gradient component map calculated in the vertical direction (strong response to the horizontal edges). Also refers to the derivative calculated in a particular direction. Given the image f (x, y), the Gaussian filter G(x, y, σ), the gradient field ∇ and the unit direction vector u ¼ (cosσ, sinσ) ¼ (u, v), then the derivative in the u direction is u • ∇ðG f Þ ¼ ∇u ðG f Þ ¼ ð∇u GÞ f Gaussian smooth directional derivative filter is Gu ¼ uGx þ vGy ¼ u
∂G ∂G þv ∂x ∂y
The above filter is a steerable filter because the value of an image convolved with Gu can be obtained by first convolving the image with the filter pair (Gx, Gy) and then by controlling the production of the gradient field with a unit direction vector u. The benefit of this process is that the entire filter family is computationally low.
532
12
Mask Operations for Spatial Domain Enhancement
Directional selectivity A characteristic of a directional derivative filter. Refers to the ability of the filter to only have a strong local consistency response. For high-order steerable filters, the ability of a filter at a given location to respond to more than one edge orientation and to respond to the thin line and step edge. Directional differential operator An operator that detects edges based on differences in a particular direction. First recognize the pixel as a possible edge element, and then assign one of several predefined directions. In the spatial domain, the directional difference filter uses a set of templates (edge templates) to convolve with the image to calculate the difference values in different directions, and taking the largest value as the edge intensity, so to determine the corresponding direction as the edge direction. In fact, each template will correspond to two opposite directions, so the direction of one of them should be determined according to the sign of the convolution value. The compass gradient operator is a typical directional difference operator. Directional difference filter A high-pass filter that emphasizes the detection of edges (direction differences) in a particular direction. Also often referred to as an emboss filter. In order to achieve the emboss effect, there are four representative templates (corresponding to four directions) that should be used, such as the following: 2
0
6 60 4 0
1 0 1
0
3 2
1
7 6 6 07 5 40 0 0
0 0 0
0
3
7 0 7 5 1
2
0 0
6 61 0 4 0 0
0
3
7 1 7 5 0
2
0
6 60 4 1
0 0 0
1
3
7 0 7 5 0
Adaptive bilateral filter A variation of the bilateral filter that can be used as an image sharpening operator while eliminating noise. The overall dynamic range of edge properties (such as gradients) can be increased when image sharpening is performed without overshoot and undershoot associated with the unsharp operator.
12.5
Gaussian Filter
A Gaussian filter is a typical and commonly used type of filter (Castleman 1996; Davies 2005; Jähne 2004). Combining it with a Laplacian operator can form a Marr operator consistent with human vision (Forsyth and Ponce 2012; Marr 1982; Pratt 2007; Prince 2012).
12.5
Gaussian Filter
533
12.5.1 Gaussian Gaussian Same as Gaussian distribution. Cascading Gaussians When the Gaussian filter is convoluted with itself, the result is still a Gaussian filter. Also used as a term to denote that a Gaussian function and its own convolution are another Gaussian function. Gaussian convolution See Gaussian smoothing. Gaussian filter A special low-pass filters. Each value in the filter template is a coefficient based on a Gaussian distribution. Gaussian derivative The combined calculation result of the Gaussian smoothing filter and the Gaussian gradient filter. The resulting gradient filter is less sensitive to noise. Gaussian blur filter A low-pass filters implemented with a non-homogeneous core. The template coefficients are sampled from a 2-D Gaussian function
hðx, yÞ ¼ exp
ð x2 þ y2 Þ 2σ 2
Parameter σ control the general shape of function curve: the larger the value of σ, the flatter the resulting curve. The Gaussian blur filter produces a blur effect that is more natural than the averaging filter. In addition, the effect of increasing the size of the Gaussian blur filter template is not as prominent as the averaging filter. Gaussian blur filters also have some obvious characteristics: (1) the kernel is symmetrical with respect to rotation, so there is no directional bias in the result; (2) the core is separable, thereby enabling fast calculation; (3) the kernel coefficients at the kernel edge can be reduced to (almost) zero; and (4) the convolution of the two Gaussian kernels is another Gaussian kernel. Gaussian blur An operation that softens the image and reduces the sharpness of the edges. Gaussian smoothing An image processing operation for the purpose of eliminating image noise is realized by convolving a sample with a Gaussian distribution template. Diffusion smoothing A technique that achieves Gaussian smoothing. Mathematically, it can be described as a solution to the diffusion equation, where the image to be smoothed is used as the
534
12
Mask Operations for Spatial Domain Enhancement
initial boundary condition. The advantage is that unlike the repeat averaging, a continuous scale space can be constructed. Beltrami flow A noise removal technique in which an image is viewed as a surface, the task is to minimizes surface area under conditions that preserve edges. See diffusion smoothing. Diffusion filtering An image denoising method based on nonlinear evolutionary partial differential equations attempts to qualitatively improve image quality by eliminating noise and maintaining detail (even enhancing edges). See anisotropic filtering. Gaussian averaging A way of local averaging an image. Each coefficient in the average template is determined according to the Gaussian distribution, that is, a template having a large center value and some small surrounding values is convoluted with the image. Derivative of Gaussian filter A concept or parameter used when reinforcing textures in a particular direction. Edge orientation or texture directionality is one of the most important clues to understanding texture. Derivative filters, especially the derivatives of Gaussian filters, are often used to enhance texture features in different directions. By varying the width of the Gaussian kernel, these filters can selectively enhance different texture features at different scales. Given a Gaussian function Gσ (x, y) whose first derivative in the x and y directions is
Convolving an image with a Gaussian derivative kernel is equivalent to first smoothing the image with a Gaussian kernel and then calculating the derivative. These filters along a certain orientation have been widely used for texture analysis. Difference of Gaussian [DoG] filter The output of the most commonly used filter for extracting texture features. First, different Gaussian kernels are used to smooth the image, and then the difference between them is calculated to enhance the image features, such as edges of different scales. Gaussian smoothing is low-pass filtering, and the difference of Gaussian filters corresponds to effective band-pass filtering. The Gaussian difference here can be expressed as DoG ¼ Gσ 1 Gσ2 Among them, Gσ1 and Gσ2 are two different Gaussian kernels. The difference of the Gaussian filter is often used as an approximation of the Laplacian-of-Gaussian filter. By adjusting σ 1 and σ 2, texture features can be extracted at specific spatial
12.5
Gaussian Filter
535
frequencies. Note that this filter cannot select the direction. Example applications include primal sketch of scale space and SIFT feature selection. Difference of Gaussian [DoG] The difference of the Gaussian function. The two images are filtered by a Gaussian filter with different variance parameters, and the results are subtracted to obtain a Gaussian difference. Difference of Gaussian [DoG] operator A linear convolution operator, which can be used for image enhancement and corner detection based on difference of Gaussian. Difference of offset Gaussian [DooG] filter A simple filtering technique that provides useful texture features such as edge orientation and edge intensity. Similar to the difference of Gaussian filter, the filter kernel is obtained by subtracting two Gaussian functions. However, there is an offset vector d ¼ (dx, dy) between the centers of the two Gaussian functions: DooGσ ðx, yÞ ¼ Gσ ðx, yÞ Gσ ðx þ dx, y þ dyÞ
12.5.2 Laplacian of Gaussian Laplacian-of-Gaussian [LoG] filter A filter for smoothing an image. The filter function can be written as ∇2 Gðx, y; σ Þ ¼
2 x þ y2 2 Gðx, y; σ Þ σ2 σ4
Laplacian-of-Gaussian [LoG] kernel The kernel of the Laplacian-of-Gaussian filter. The commonly used 5 5 template that implements its function in a 2-D image is as follows and can be decomposed into two 1-D templates. 2
1
6 64 1 6 66 256 6 6 44 1
4
6
4
16
24 16
24 16
36 24 24 16
4
6
4
2 3 1 7 6 7 47 647 7 71 1 6 7 7 67 ¼ 6 6674½1 16 7 6 7 45 445 1 1 1
3
4
6 4
1
It can also be implemented as the sum of two separable kernels, i.e.,
536
12
Fig. 12.14 A Laplacian-ofGaussian template
Mask Operations for Spatial Domain Enhancement
0
1
1
2
2
2
1
1
0
1
2
4
5
5
5
4
2
1
1
4
5
3
0
3
5
4
1
2
5
3
-12
-24
-12
3
5
2
2
5
0
-24
-40
-24
0
5
2
2
5
3
-12
-24
-12
3
5
2
1
4
5
3
0
3
5
4
1
1
2
4
5
5
5
4
2
1
0
1
1
2
2
2
1
1
0
2 1 x 2 þ y2 x þ y2 ∇ Gσ ðx, yÞ ¼ 3 2 exp σ 2σ 2 2σ 2 2
It can be broken down into two separable parts ∇2 Gσ ðx, yÞ ¼
1 x2 1 y2 1 ð x ÞG ð y Þ þ 1 G G ðyÞGσ ðxÞ σ σ3 σ3 2σ 2 σ 2σ 2 σ
Laplacian-of-Gaussian [LoG] operator An underlying image operator, also known as a Marr operator. A Gaussian smoothing operation is first performed at each position in the image, and then a second derivative Laplacian operator (∇2) is used. Edge detection is often performed using an isotropic operator as part of the zero-crossing operator because the position of the sign of value changing (from positive to negative or negative to positive) is near the edge in the input image, and the detected edge details can be controlled using the scale parameters of Gaussian smoothing. Figure 12.14 shows a template that implements a Laplacian-of-Gaussian operator with a smoothing parameter σ ¼ 1.4. Laplacian of Gaussian [LoG] A combination of a Gaussian function and a Laplace function. Laplacian-of-Gaussian [LoG] transform A simple but useful multi-scale image transform. The transformed data contains basic but useful texture features. A Laplace transform of 2-D Gaussian with zero mean and standard variance can be expressed as
12.5
Gaussian Filter
537
2 1 x2 þ y2 x þ y2 LoGσ ðx, yÞ ¼ 4 1 exp πσ 2σ 2 2σ 2 The Laplacian-of-Gaussian transform calculates the second-order spatial differentiation of the image and is closely related to the difference of Gaussian function. Commonly used for feature extraction at the bottom layer.
Chapter 13
Frequency Domain Filtering
Fourier transform can be used to transform the image into frequency domain space (Bracewell 1986; Russ and Neal 2016). Frequency domain filtering is an important method for image enhancement (Gonzalez and Woods 2008; Zhang 2017a).
13.1
Filter and Filtering
Enhancement in the frequency domain space is achieved by means of frequency domain filters, which have many types (Russ and Neal 2016; Szeliski 2010; Tekalp 1995; Umbaugh 2005).
13.1.1 Basic of Filters Image filtering An image processing technique that uses a filter to act on an image to obtain a desired result. Filtering An image enhancement concept or technique that is applied to filters. It can be used to explain some image enhancement methods based on spatial domain and can also be used to explain some image enhancement methods based on transform domain. The transform domain actually used is mainly the frequency domain, so the filtering technique is mainly divided into a spatial domain filtering technique and a frequency domain filtering technique. Filtering the image The process of eliminating some unwanted components in an image or changing the ratio between different components in an image. Often done in the frequency © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_13
539
540
13
Frequency Domain Filtering
domain. For example, this can be achieved by multiplying the Fourier transform of the image by some function that can “eliminate” or change certain frequency components and then inverse the Fourier transform. Filter method 1. Using filters to implement methods for image processing. For example, a highpass filter can be used to sharpen the image, and a low-pass filter can be used to eliminate noise. 2. A simple method of feature selection. The selection of features is used as a preprocessing step of a learning method for calculating feature statistics per variable. This type of method is simple to calculate, but since the relationship between variables is not considered or the nature of the feature-based training in learning method is not considered, it is possible to obtain only suboptimal results that are not very accurate. Filtering threshold A threshold used to determine or select a filter feature. The filtering system typically generates a numerical score to indicate how well the feature matches a given profile. Filtering features having a score above a particular filtering threshold may be selected in practice. See cross-correlation matching. Filter selection One of the functional blocks used in the selective filter to effectively eliminate different types of noise. The function is to select and filter the corresponding noise for noise removal based on detecting and distinguishing different noises. For example, when there are salt-and-pepper noise and Gaussian noise in the image simultaneously, two types of filters can be selected: the smoothing filter or the adaptive Wiener filter is selected for those pixels only affected by Gaussian noise, and the median filter or grayscale interpolation filter is selected for those pixels affected by the salt-and-pepper noise. Digital filter A device consisting of a digital multiplier, an adder, a delay unit, and the like, which performs a filtering function. With the development of electronic technology, computer technology, and large-scale integrated circuits, digital filters can be implemented by computer software or real time by large-scale integrated digital hardware. Filter coefficients The coefficient of the filter kernel (template). For example, linear filtering using neighborhood operations (the first is correlation and the second is convolution) can be written as gðx, yÞ ¼
X i, j
f ðx þ i, y þ jÞhði, jÞ
13.1
Filter and Filtering
541
gðx, yÞ ¼
X
f ðx i, y jÞhði, jÞ
i, j
Among them, the cell value in the kernel h(i, j) is the filter coefficient. Cutoff frequency 1. The frequency of the boundary between the passing frequency and the blocking frequency in the frequency domain filter. It is always a non-negative integer. For an ideal filter, the components on one side of the cutoff frequency can pass through the filter completely unaffected and on other side will not at all. For nonideal filters, the frequency at which the filter output amplitude drops to a certain percentage of the maximum value is the cutoff frequency. The commonly used percentage is 50% or 70.7%. 2. The highest spatial frequency and time frequency that the human visual system can perceive. Also known as the visual cutoff frequency, it should be the driving factor in determining the video sampling rate (in practice, there is no need to consider components beyond those frequencies and therefore not felt) Fidelity factor An indicator that measures the characteristics of a filter. Expressed as the bandwidth of the filter itself divided by its center frequency, the reciprocal of the relative bandwidth. For example, the relative bandwidth Q of the wavelet transform filter is a constant because 1 ðΔwÞs ¼ ¼ ðΔwÞ Q 1=s is independent of the scale s, where (Δw)s is the bandwidth of the wavelet transform filter at scale s. In other words, the frequency bandwidth Δw varies with the center frequency of the wavelet transform filter 1/s, and the product still satisfies the principle of uncertainty. As can be seen from the above, the wavelet transform can be regarded as an analysis of a constant Q. The wavelet transform filter has a small bandwidth at a low frequency corresponding to a large scale factor. The wavelet transform filter has a large bandwidth at a high frequency corresponding to the small scale factor. Filter ringing A distortion that can be produced using a steep recursive filter. This term is generally used to simulate an electronic filter in which certain components (such as capacitors and inductors) can store energy first and then release it. However, there are also digital components that correspond to this effect.
542
13
Frequency Domain Filtering
13.1.2 Various Filters Separable filter In image processing, a 2-D filter can be represented as a product of two 1-D filters. Each of these filters acts independently on the rows and columns of the image. The kernel/template of the 2-D separable filter can be decomposed into two 1-D kernels/ templates. Each convolution kernel/template of order 1 is separable. More generally, it refers to a multidimensional filter in which the convolution operation can be broken down into successive 1-D convolutions in all coordinate directions. Thus, a high-dimensional convolution can be implemented with multiple 1-D convolutions. Therefore, the separable filter can be calculated much faster than the non-separable filter. A traditional example is a linear Gaussian filter. Separability implies an important simplification in terms of computational complexity. Typically, the processing cost is reduced from O(N2) to O(2N ), where N is the size of the filter. Further, for higher dimensions (dimension M), the processing cost will be reduced from O(NM) to O(MN). For example, a binomial filter is a typical separable filter. Its elements are binary numbers, which is the sum of two corresponding numbers in the Yang Hui triangle. For example, the following 5 5 template can be broken down into the product of two 1 5 templates. 2
1
6 64 6 66 6 6 44 1
4
6
4
16 24
24 36
16 24
16 4
24 6
16 4
2 3 1 7 6 7 47 647 7 6 7 6 7 67 7 ¼ 6 6 7½1 4 6 4 1 7 6 7 45 445 1 1 1
3
More generally, if the template size is 2 N + 1 2 N + 1, the convolution result g (x, y) of the template h(x, y) for the image f(x, y) is (h1 and h2 are the pair of results from the decomposition of h) gðx, yÞ ¼
N N X X
hði, jÞf ðx i, y jÞ ¼
i¼N j¼N
N X i¼N
h1 ðiÞ
N X
h2 ð jÞf ðx i, y jÞ
j¼N
Separable filtering Filtering using a separable filter. Selective filter A filter that eliminates different types of noise at the same time. If the image is affected by different noises at the same time, the selective filtering method is adopted to select the appropriate filter in the position affected by different noises,
13.1
Filter and Filtering
543
and the respective characteristics of different filters can be utilized to obtain better comprehensive filtering noise. Selective filtering Filtering using a selective filter. Adaptive filtering The filtering technique that the filter used in some way depends on the characteristics of a pixel and its neighborhood. The parameters of the filter are a function of time, or the parameters of the filter are a function of position in the image. It is often a nonlinear operation. Traditionally used for edge-preserving smoothing: a low-pass filter is applied to the image in a selective manner to minimize edge blurring effects that can occur when using a standard low-pass filter. Adaptive filter A filter that adjusts its functions and characteristics according to the current image (pixel) properties. Generally, it belongs to a filter that adjusts its template or function according to the statistical characteristics of the pixel set covered by the filter template. Because different processing methods are used for different pixels in the image, it is possible to obtain a better filtering effect than the fixed filter, but the filtering computation complexity often increases. Recursive filter A filter that is implemented recursively and extends infinitely in real space. Its output depends on the filter of the previous output. That is, the infinite impulse response filter in the signal processing literature, because the output of the filter to the impulse response can continue. Directional filter A spatial filter that enhances the selected directional features in the image. Composite filter Software or hardware that combines several image processing methods (noise removal, feature extraction, operator combination, etc.). Steerable filters A set of filters with random orientations, each generated by linearly combining a set of basis functions (also called basis filters). For 2-D images, the characteristics can be continuously manipulated by appropriate interpolation between the basic filters. A simple example is a smoothing filter that can smooth the image in any direction by manipulation. The response depends on a scalar “direction” parameter θ, but the response in any direction can be calculated with a small set of basis responses, saving computational effort. For example, the derivative in the direction toward θ can be calculated using the derivatives Ix and Iy in the x and y directions:
544
13
Frequency Domain Filtering
I x cos θ dI ¼ dnθ I y sin θ As another example, a Gaussian differential filter can be used to generate a steerable filter. Let Gx and Gy represent the first-order differentiation of the Gaussian function along x and y (Gy is basically a version of Gx rotation). The first-order differential filter for any direction θ can be obtained by linear combination of Gx and Gy: Dθ ¼ Gx cos θ þ Gy sin θ Among them, cosθ and sinθ are interpolation functions of the basis functions Gx and Gy. These oriented filters have a selective response to the edges, which is important for texture analysis. For filters that are not steerable, such as the Gabor filter, the response needs to be recalculated in each iteration, so the computational complexity is higher. The steerable filter is a combination of the derivatives of the Gaussian filter, so even (symmetric) and odd (antisymmetric) edge and corner features can be quickly calculated in all possible directions. Since a wider Gaussian function can be used, it is less sensitive to errors in position and orientation. Decomposable filter A class of complex filters that can be broken down into a series of simple filters and used in sequence. For example, a 2-D Laplacian-of-Gaussian filter can be broken down into four simple filters. Balanced filter A general term used to indicate a type of filter. The samples are given the same weight in the spatial, temporal, or frequency direction. Typical examples are Gauss ian smoothing, mean filter, low-pass filter, and the like. Corresponding unbalanced filters (such as bilateral filters) are considered to be adaptive. Grid filter A noise-suppressing filter in which a nonlinear feature (of pixels or some pixels) derived from a local neighborhood is used. The grid filter requires a training step in which noisy data and corresponding ideal data are considered. Optimum filter A filter that produces an optimal (close) approximation of the desired signal. Fast elliptical filtering A radially uniform box spline is constructed by repeated convolution of a fixed number of box distributions to achieve filtering.
13.2
13.2
Frequency Domain Filters
545
Frequency Domain Filters
Frequency domain filtering needs to design corresponding filters according to the purpose of image enhancement. The main thing is to determine the frequencies that can be filtered out and the frequencies that can be retained by the filter (Gonzalez and Woods 2008). According to this, the frequency domain filter is divided into many types (Russ and Neal 2016; Umbaugh 2005).
13.2.1 Filtering Techniques Frequency domain filtering A processing operation performed on an image in the frequency domain. There are three main steps: 1. Transforming the input image into the frequency domain by using Fourier transform (FT). 2. Specify a filter of a specific kind (such as ideal, Butterworth, Gaussian) and characteristics (such as low pass, high pass), and apply to the representation of the image in the frequency domain. 3. The resulting value is inversely transformed back to the spatial domain by using an inverse Fourier transform to give a (filtered) output image. Frequency domain filter A filter defined according to its function and role in Fourier space. See high-pass filter and low-pass filter. Frequency domain filtering technique Image enhancement techniques in the frequency domain with the aid of a frequency domain filter. It is also a common technique for eliminating periodic noise. Frequency filtering technique Same as frequency domain filtering technique. Frequency filter Implement the filter function of the frequency domain filtering technique and its operation rules. Band-pass filtering A frequency domain filtering technique that allows a range of frequency components in an image to pass while blocking other ranges of frequency components. If the lower limit of this frequency range is 0 and the upper limit is not 1, it becomes low-pass filtering. If the upper limit of this frequency range is 1 and the lower limit is not 0, it becomes high-pass filtering.
546
13
Frequency Domain Filtering
Band-reject filtering A frequency domain filtering technique that blocks passage of frequency components within a range of images while allowing passage of other ranges of frequency components. If the lower limit of this frequency range is 0 and the upper limit is not 1, it becomes high-pass filtering. If the upper limit of this frequency range is 1 and the lower limit is not 0, it becomes low-pass filtering. High-pass filtering A frequency domain filtering technique that preserves high-frequency components in an image and removes or attenuates low-frequency components. Since the low-frequency component corresponds to a region where the gray value in the image slowly changes, it is related to the overall characteristics of the image (such as overall contrast, average gray value, etc.). Filtering these components increases the contrast of the image, and the edges are noticeable. Low-pass filtering A frequency domain filtering technique that preserves low-frequency components in the image and removes or attenuates high-frequency components. Since the highfrequency component corresponds to a portion where the gray value of the region edge in the image has a large change, the smoothing filter can be used for filtering the components to reduce the local grayscale fluctuation and make the image smoother. Ideal filter A frequency domain filter that can be mathematically defined but cannot be implemented with electronic devices. A filter that produces the desired signal without any distortion or delay. This concept helps explain the principles of the filter, and its characteristics can be simulated in a computer.
13.2.2 Low-Pass Filters Low-pass filter [LPF] A frequency domain filter that implements low-pass filtering. The filtering characteristics are determined by the transfer function. The effect on the output image is equivalent to a filter that attenuates high-frequency components (corresponding to minute details in the image) and preserves low-frequency components (corresponding to slow variations and uniform regions in the image). This is a term introduced from 1-D signal processing theory to 2-D or 3-D image processing. Among them, “low” is shorthand for low frequency. In the case of only one image, the low refers to a low spatial frequency, that is, an intensity pattern that does not change over many pixels. The result of using a low-pass filter for an image is to preserve low spatial frequency patterns, or large and slowly varying modes, and eliminate high spatial frequency components (sharp edges and noise). A low-pass filter is a smoothing filter or a noise canceling filter. On the other hand, it is also possible to filter the change value of the same pixel position on a multi-frame image
13.2
Frequency Domain Filters
547
H(u, v)
H(u, v) 1
0
D0
u
D(u, v)
(a)
v
(b)
Fig. 13.1 Sectional and perspective view of the ideal low-pass filter transfer function
in one image sequence. In this case, the pixel value can be seen as a time series of samples; the definition of the low-pass filter regress back to its original meaning in signal processing. The effect of filtering is to eliminate fast time changes. Compare the high-pass filter. Ideal low-pass filter [ILPF] An ideal filter for low-pass filtering. Transfer function satisfies the condition H ðu, vÞ ¼
1 0
Dðu, vÞ D0 Dðu, vÞ > D0
where D0 is the cutoff frequency and D(u, v) is the distance from the point (u, v) to the origin of the frequency plane, D(u, v) ¼ (u2 + v2)1/2. Figure 13.1a, b give a cross-sectional view (set D(u, v) symmetric to the origin) and a perspective view of H(u, v), respectively. The perspective view is equivalent to the result of profile rotation around the H(u, v) axis. Since the ideal low-pass filter has a sharp transition between the passband and the stopband, there are significant ringing artifacts in the filtered output. Sinc filter An ideal low-pass filter based on the sigma function. The sinc function can be written as sinc[sin(x)/x]. Windowed sinc filter A special sinc filter that defines the time by adding a window function to the sinc filter. The highest quality interpolation can be achieved because the detail in the low-resolution image can be retained while the aliasing can be avoided. Anti-aliasing filter An optional low-pass filter with a cutoff frequency below the Nyquist limit (i.e., less than half of the sample rate). The main function is to eliminate the signal frequency components that may cause aliasing.
548
13
Frequency Domain Filtering
H(u, v)
Fig. 13.2 Schematic diagram of the trapezoidal low-pass filter transfer function
1
0
D'
D0
D(u, v)
Anti-alias The technique and process of making the angular edges of the target in the image smoother and eliminating jagged edges. It also includes homogenizing the color of the lines to make the transition from the target to the background smoother. Trapezoid low-pass filter A frequency domain filter that can be physically implemented to perform low-pass filtering. Its transfer function can be expressed as 8 1 > > < Dðu, vÞ D0 H ðu, vÞ ¼ > D0 D0 > : 0
Dðu, vÞ D0 D0 < Dðu, vÞ < D0 Dðu, vÞ > D0
where D0 is a non-negative integer, which can be determined as the cutoff fre quency; D0 is the segmentation point corresponding to the piecewise linear function; and D(u, v) is the distance from the point (u, v) to the origin of the frequency plane. The transfer function profile of the trapezoidal low-pass filter is shown in Fig. 13.2. Compared to the transfer function of an ideal low-pass filter, the transfer function of the trapezoidal low-pass filter has a transition between high and low frequencies, which can attenuate some ringing. However, because the transition is not smooth enough, the ringing phenomenon is generally stronger than the ringing phenomenon generated by the transfer function of the Butterworth low-pass filter. Butterworth low-pass filter A frequency domain filter that performs low-pass filtering and can be physically implemented. The transfer function H(u, v) of the n-th order Butterworth low-pass filter with a cutoff frequency of D0 is H ðu, vÞ ¼
1 1 þ ½Dðu, vÞ=D0 2n
where D(u, v) is the distance from the point (u, v) to the origin of the frequency plane. The shape of the filter’s frequency response, especially the steepness between the passband and the stopband, is controlled by the value of n: a large n value corresponds to a steeper transition, closer to the shape of the ideal low-pass filter.
13.2
Frequency Domain Filters
Fig. 13.3 Schematic diagram of the exponential low-pass filter transfer function
549
H(u, v) 1
0
1
2
D(u, v) D0
Exponential low-pass filter A frequency domain filter that performs low-pass filtering and can be physically implemented. The transfer function is an exponential function. The transfer function of an exponential low-pass filter with an order of n and a cutoff frequency of D0 is H ðu, vÞ ¼ exp f½Dðu, vÞ=D0 n g where D0 is a non-negative integer; D(u, v) is the distance from the point (u, v) to the origin of the frequency plane, D(u, v) ¼ (u2 + v2)1/2; and n is the order number of the filter (when n is 2, it becomes a Gaussian low-pass filter). The transfer function of the exponential low-pass filter with order 1 is as shown in Fig. 13.3. There is a smooth transition between high and low frequencies, so the ringing phenomenon is weak (no ringing for Gaussian low-pass filter). Compared with the transfer function of the Butterworth low-pass filter, the transfer function of the exponential low-pass filter generally decays faster with the increase of frequency at the beginning, and the filtering ability of high-frequency components is stronger, and the blur caused by the image is more large, but generally generated ringing is less pronounced than the ringing phenomenon produced by the Butterworth low-pass filter transfer function. In addition, its tail is dragged longer, so this filter has a greater attenuation of noise than the Butterworth low-pass filter, but the smoothing effect is generally not as good as the Butterworth low-pass filter. Gaussian low-pass filter A special exponential low-pass filter. Its transfer function takes the form: H ðu, vÞ ¼ exp ½Dðu, vÞ=D0 2 where D(u, v) is the distance from the point (u, v) to the origin of the frequency plane and D0 is the cutoff frequency. D0 is proportional to the variance of the Gaussian function and can control the width of the bell curve. Smaller D0 values correspond to more stringent filtering, resulting in more ambiguous results. Since the inverse Fourier transform of the Gaussian function is also a Gaussian function, the Gaussian low-pass filter does not produce ringing.
550
13
Frequency Domain Filtering
H(u, v)
H(u, v) 1
0
D0
u
D(u, v)
(a)
v
(b)
Fig. 13.4 Sectional and perspective view of the ideal high-pass filter transfer function
13.2.3 High-Pass Filters High-pass filter [HPF] A frequency domain filter that implements high-pass filtering. A frequency domain filter that eliminates or rejects low-frequency components. The filtering characteristics are determined by its transfer function. If it is applied to an image that retains or enhances high-frequency components (such as minute details, points, lines, and edges), it highlights the brightness transition or abrupt changes in the image. Ideal high-pass filter [IHPF] An ideal filter for high-pass filtering. Its transfer function satisfies the condition: H ðu, vÞ ¼
0 1
Dðu, vÞ D0 Dðu, vÞ > D0
where D0 is the cutoff frequency and D(u, v) is the distance from the point (u, v) to the origin of the frequency plane, D(u, v) ¼ (u2 + v2)1/2. Figure 13.4a, b, respectively, give a 1-D cross-section of H(u, v) (set D(u, v) symmetric to the origin) and a perspective view (equivalent to the result of rotating the Fig. 13.4a section around the H(u, v) axis). Since the ideal high-pass filter has a sharp transition between the passband and the stopband, there is a significant ringing artifact in the filtered output. High-boost filter An improved result of a frequency domain filter that performs high-pass filtering. Compare the high-frequency emphasis filter. High-frequency emphasis filter An improved result of a frequency domain filter that performs high-pass filtering. Its transfer function adds a constant c to the transfer function of the high-pass filter. Letting the transfer function used by high-pass filtering be H(u, v), then the highfrequency enhancement transfer function is
13.2
Frequency Domain Filters
551
H e ðu, vÞ ¼ H ðu, vÞ þ c where c is a constant between [0, 1]. The high-frequency enhanced output map thus obtained retains a certain low-frequency component on the basis of high-pass filtering, which is equivalent to superimposing some high-frequency components on the basis of the original image, thereby obtaining the effect of improving the contrast of edge region while maintaining grayscale of smooth region. The high-frequency enhancement filter is converted to a high-frequency boost filter when c ¼ (A 1), where A is the amplification factor. High-frequency emphasis A technique that preserves the low-frequency information content of the input image (and enhances its spectral components), multiplies the high-pass filter function by a constant, and adds an offset to the result. Can be expressed as H hfe ðu, vÞ ¼ a þ bH ðu, vÞ where H(u, v) is the transfer function of the high-pass filter, a 0 and b > 1. Gaussian high-pass filter A special exponential high-pass filter. Its transfer function can be expressed as H ðu, vÞ ¼ exp ½D0 =Dðu, vÞ2 Can also be expressed as H ðu, vÞ ¼ 1 exp ½Dðu, vÞ=D0 2 where D(u, v) is the distance from the origin of the frequency plane to the point (u, v) and D0 is the cutoff frequency. Since the inverse Fourier transform of the Gaussian function is also a Gaussian function, the Gaussian high-pass filter does not generate ringing. Trapezoid high-pass filter A frequency domain filter that can be physically implemented to perform highpass filtering. Its transfer function can be expressed as 8 0 > > < Dðu, vÞ D0 H ðu, vÞ ¼ > D0 D0 > : 1
Dðu, vÞ D0 D0 < Dðu, vÞ < D0 Dðu, vÞ > D0
where D0 is a non-negative integer, which can be determined as the cutoff frequency, D0 is the segmentation point corresponding to the piecewise linear function, and D (u, v) is the distance from the point (u, v) to the origin of the frequency plane.
552
13
Frequency Domain Filtering
H(u, v)
Fig. 13.5 Schematic diagram of the transfer function of the trapezoidal high-pass filter
1
0
D0
D(u, v)
D'
H(u, v)
Fig. 13.6 Schematic diagram of the exponential high-pass filter transfer function
1
0
1
2
D(u, v) D0
The transfer function profile of the trapezoidal high-pass filter is shown in Fig. 13.5. Compared to the transfer function of an ideal high-pass filter, the transfer function of the trapezoidal high-pass filter has a transition between high and low frequencies, which can attenuate some ringing. However, because the transition is not smooth enough, the ringing phenomenon is generally stronger than the ringing phenomenon generated by the transfer function of the Butterworth high-pass filter. Butterworth high-pass filter A frequency domain filter that can be physically implemented to perform highpass filtering. The transfer function H(u, v) of the n-th order Butterworth high-pass filter with a cutoff frequency of D0 is H ðu, vÞ ¼
1 1 þ ½D0 =Dðu, vÞ2n
where D(u, v) is the distance from the point (u, v) to the origin of the frequency plane. The shape of the filter’s frequency response, especially the steepness between the passband and the stopband, is controlled by the value of n: a large value of n corresponds to a steeper transition, closer to the shape of the ideal high-pass filter. Exponential high-pass filter A frequency domain filter that can be physically implemented to perform highpass filtering. The transfer function of an exponential high-pass filter with an order n and a cutoff frequency of D0 is H ðu, vÞ ¼ 1 exp f½Dðu, vÞ=D0 n g
13.2
Frequency Domain Filters
553
where D0 is a non-negative integer; D(u, v) is the distance from the point (u, v) to the origin of the frequency plane, D(u, v) ¼ (u2 + v2)1/2; and n is the order of filter (when n is 2, it becomes a Gaussian high-pass filter). The transfer function of the exponential high-pass filter with order 1 is shown in Fig. 13.6. There is a smooth transition between the high and low frequencies, so the ringing phenomenon is weak (no ringing to the Gaussian high-pass filter). Compared to the transfer function of the Butterworth high-pass filter, the transfer function of the exponential high-pass filter increases faster with the increase of frequency at the beginning, which enables some low-frequency components to pass, which is advantageous for protecting the gray level of the image.
13.2.4 Band-Pass Filters Band-pass filter A frequency domain filter that implements band-pass filtering. Allows signals between two specific frequencies to pass, but prevents other frequencies from passing. The filtering characteristics are determined by its transfer function. Band-pass filtering is complementary to band-reject filtering, so the transfer function Hbp(u, v) of the band-pass filter can be calculated from the transfer function Hbr(u, v) of the band-reject filter as follows: H bp ðu, vÞ ¼ 1 H br ðu, vÞ Ideal band-pass filter [IBPF] An ideal filter for band-pass filtering. A transfer function of an ideal band-pass filter for all frequencies in the region centered on (u0, v0) with a radius of D0 in the frequency domain space UV is H ðu, vÞ ¼
1
Dðu, vÞ D0
0
Dðu, vÞ > D0
where h i1=2 Dðu, vÞ ¼ ðu u0 Þ2 þ ðv v0 Þ2 Considering the symmetry of the Fourier transform, in order to pass frequencies in a given region that is not centered at the origin, the band-pass filter must work symmetrically two by two, that is, the upper two equations need to be changed to
554
13
Frequency Domain Filtering
H(u, v)
Fig. 13.7 A perspective schematic view of a typical ideal band-pass filter
u
H ðu, vÞ ¼
1
D1 ðu, vÞ D0
0
Otherwise
v
D2 ðu, vÞ D0
Or
where h i1=2 D1 ðu, vÞ ¼ ðu u0 Þ2 þ ðv v0 Þ2 h i1=2 D2 ðu, vÞ ¼ ðu þ u0 Þ2 þ ðv þ v0 Þ2 Figure 13.7 is a perspective schematic view of a typical ideal band-pass filter. Compare the ideal band-reject filter. Gaussian band-pass filter A frequency domain filter that performs a band-pass filtering function physically achievable. The transfer function H(u, v) of a Gaussian band-pass filter of order n is (
2 ) 1 D2 ðu, vÞ D2d H ðu, vÞ ¼ exp 2 Dðu, vÞW where D(u, v) is the distance from the origin of the frequency plane to the point (u, v), W is the width of the pass band, and Dd is the radius of the ring band. Butterworth band-pass filter A frequency domain filter that is physically achievable and performs band-pass filtering. The transfer function H(u, v) of an n-th order Butterworth band-pass filter is h H ðu, vÞ ¼
Dðu, vÞW D ðu, vÞD2d
i2n
2
1þ
h
Dðu, vÞW D2 ðu, vÞD2d
i2n
where D(u, v) is the distance from the point (u, v) to the origin of the frequency plane, W is the width of the band, and Dd is the radius of the ring band. The shape of the filter frequency response, especially the steepness between the passband and the
13.2
Frequency Domain Filters
555
Fig. 13.8 A perspective view of the radiation symmetric band-pass filter
Dd –
H(u, v)
u
Dd +W/2
v
stopband, is controlled by the value of n: a large n value corresponds to a steeper transition, closer to the shape of the ideal band-pass filter. Radially symmetric ideal band-pass filter An ideal band-pass filter that allows a range of frequency components centered around the origin of the frequency space to pass. Its transfer function is 8 >
: 0
Dðu, vÞ < Dd W=2 Dd W=2 Dðu, vÞ Dd þ W=2 Dðu, vÞ > Dd þ W=2
where W is the width of the band, Dd is the radius of the ring band, and D(u, v) is the distance from the point (u, v) to the origin of the frequency plane. A perspective view of the radiation symmetric band-pass filter is shown in Fig. 13.8. Radially symmetric Gaussian band-pass filter A frequency domain filter that is physically achievable and provides band-pass filtering. The transfer function of a radially symmetric Gaussian band-pass filter is (
2 ) Dðu, vÞW 1 H ðu, vÞ ¼ 1‐ exp 2 D2 ðu, vÞ D2d where W is the width of the band, Dd is the radius of the ring band, and D(u, v) is the distance from the point (u, v) to the origin of the frequency plane. Radially symmetric Butterworth band-pass filter A frequency domain filter that is physically achievable and provides band-pass filtering. The transfer function of a radially symmetric Butterworth band-pass filter with order n is H ðu, vÞ ¼ 1þ
hD
1
i ðu, vÞD2d 2n Dðu, vÞW
where W is the width of the band, Dd is the radius of the ring band, and D(u, v) is the distance from the point (u, v) to the origin of the frequency plane.
556
13
Frequency Domain Filtering
Band-pass notch filter A notch filter that allows a range of frequency components to pass. Notch filter A filter that filters frequencies in a predetermined neighborhood around a location on the frequency domain. If the frequency in the neighborhood is allowed to pass, it is a band-pass notch filter; if the frequency in the neighborhood is blocked, it is a band-reject notch filter. The zero-phase-shifted filter is symmetrical with respect to the origin, and the notch at center (u0, v0) has a corresponding notch at (u0, v0), so the notch filter must be works symmetrically two by two.
13.2.5 Band-Reject Filters Band-reject filter A frequency domain filter that implements band-reject filtering. The filtering characteristics are determined by its transfer function. The band-reject filter is complementary to the band-pass filter, so the transfer function Hbr(u, v) of the band-reject filter can be calculated by the transfer function Hbp(u, v) of the band-pass filter as follows: Hbr(u, v) ¼ 1 Hbp(u, v) Ideal band-reject filter [IBRF] An ideal filter for band-reject filtering. The transfer function of an ideal bandreject filter for eliminating all frequencies in the region centered on (u0, v0) and having a radius of D0 is H ðu, vÞ ¼
0
Dðu, vÞ D0
1
Dðu, vÞ > D0
where h i1=2 Dðu, vÞ ¼ ðu u0 Þ2 þ ðv v0 Þ2 Considering the symmetry of the Fourier transform, in order to eliminate frequencies in a given region that is not centered at the origin, the band-reject filter must work symmetrically two by two, that is, the above two equations need to be changed to H ðu, vÞ ¼ where
0
D1 ðu, vÞ D0
1
Otherwise
Or
D2 ðu, vÞ D0
13.2
Frequency Domain Filters
557
H(u, v)
Fig. 13.9 A perspective schematic view of a typical ideal band-reject filter
u
Fig. 13.10 A perspective view of the radiation symmetric band-reject filter
v
Dd –W/2
H(u, v)
u
Dd +W/2
v
h i1=2 D1 ðu, vÞ ¼ ðu u0 Þ2 þ ðv v0 Þ2 h i1=2 D2 ðu, vÞ ¼ ðu þ u0 Þ2 þ ðv þ v0 Þ2 Figure 13.9 is a perspective schematic view of a typical ideal band-reject filter. Compare the ideal band-pass filter. Band-reject notch filter A notch filter that blocks the passage of frequency components within a certain range. Radially symmetric ideal band-reject filter An ideal band-reject filter capable of removing frequency components within a certain range centered on the origin in the frequency domain. Its transfer function is 8 >
: 1
Dðu, vÞ < Dd W=2 Dd W=2 Dðu, vÞ Dd þ W=2 Dðu, vÞ > Dd þ W=2
where W is the width of the band, Dd is the radius of the ring band, and D(u, v) is the distance from the point (u, v) to the origin of the frequency plane. A perspective view of the radially symmetric ideal band-reject filter is shown in Fig. 13.10. Radially symmetric Butterworth band-reject filter A frequency domain filter that is physically achievable and provides a band-reject filter. The transfer function of a radially symmetric Butterworth band-stop filter with order n is
558
13
H ðu, vÞ ¼ 1þ
h
1
Dðu, vÞW D2 ðu, vÞD2d
Frequency Domain Filtering
i2n
where W is the width of the band, Dd is the radius of the ring band, and D(u, v) is the distance from the point (u, v) to the origin of the frequency plane. Radially symmetric Gaussian band-reject filter A frequency domain filter that is physically achievable and provides a band-reject filter. The transfer function of a radially symmetric Gaussian band-reject filter is (
2 ) 1 D2 ðu, vÞ D2d H ðu, vÞ ¼ 1 exp 2 Dðu, vÞW where W is the width of the band, Dd is the radius of the ring band, and D(u, v) is the distance from the point (u, v) to the origin of the frequency plane.
13.2.6 Homomorphic Filters Homomorphic filter A filter that implements homomorphic filtering. It is possible to simultaneously amplify high frequencies and reduce low frequencies, which reduces illumination variations and sharpens edges (and details). Homomorphic filtering function A frequency domain filter function that implements homomorphic filtering. It takes a value greater than 1 in the high-frequency portion and less than 1 in the low-frequency portion, so the effects of the high-frequency component and the low-frequency component obtained after the image Fourier transform are different. Homomorphic filtering A method of compressing the gray-level range of an image in the frequency domain while enhancing the contrast of the image. It filters the image with a filter similar to high-pass characteristics in the frequency domain, thereby reducing luminance variations (slowly varying components) and enhancing reflection details (fast-changing components). The technique is based on an imaging model in which an image f(x, y) can be expressed as the product of its illumination function i(x, y) and the reflection function r(x, y): f ðx, yÞ ¼ iðx, yÞr ðx, yÞ The specific filtering steps are as follows: 1. First take the logarithm of both sides of the above formula:
13.2
Frequency Domain Filters
559
ln f ðx, yÞ ¼ ln iðx, yÞ þ ln r ðx, yÞ 2. Take the Fourier transform on both sides of the above formula: F ðu, vÞ ¼ I ðu, vÞ þ Rðu, vÞ 3. Set a frequency domain filter H(u, v) to process F(u, v): H ðu, vÞF ðu, vÞ ¼ H ðu, vÞI ðu, vÞ þ H ðu, vÞRðu, vÞ Here, H(u, v) is called a homomorphic filtering function, which acts on the illumination component and the reflection component, respectively. 4. Inverse transform to the spatial domain: h f ðx, yÞ ¼ hi ðx, yÞ þ hr ðx, yÞ It can be seen that the enhanced image is formed by superimposing two components corresponding to the illumination component and the reflection component, respectively. 5. Take the exponent on both sides of the above formula to obtain the homomorphic filtering result: gðx, yÞ ¼ exp h f ðx, yÞ ¼ exp j hi ðx, yÞ j • exp j hr ðx, yÞ j Quadrature filters The filter whose transfer function has a pair of filters of the same magnitude. Among them, one filter is even symmetric and has a real transfer function; the other filter is oddly symmetric and has a pure virtual transfer function. The response of these two filters has a 90 phase difference. Quadrature mirror filter [QMF] An odd-length, symmetric wavelet filter, is often used in groups. A class of filters commonly used in wavelet and image compression filtering theory. The filter divides a signal into a high-pass component and a low-pass component, and the transfer function of the low-pass component is a mirror of the transfer function of the high-pass component. Multifilter bank Short for “multiple wavelet filter” bank. A set of wavelet filters with more than one scaling function that splits the input signal into multiple components.
Chapter 14
Image Restoration
Image restoration is a large class of techniques in image processing. Image restoration technology is to model the process of image degradation and restore it according to the image degradation model (Gonzalez and Woods 2018; Sonka et al. 2014; Zhang 2017a).
14.1
Fundamentals of Image Restoration
Image restoration technology can be divided into two categories: unconstrained restoration and constrained restoration (Gonzalez and Woods 2008). Image restoration technology can also be divided into two categories, automatic and interactive, depending on whether external intervention is required (Zhang 2009). There are many corresponding technologies for various categories (Bovik 2005; Pratt 2007; Zhang 2017a).
14.1.1 Basic Concepts Image restoration In the image processing, the process of restoring the degraded image as much as possible. Perfect image recovery is only possible if the degenerate function is mathematically negated. Typical image restoration techniques include inverse filter ing, Wiener filtering, and constrained least squares restoration. The similarity between image enhancement and image restoration is that an image that is improved in a certain sense is obtained or that it is desired to improve the visual quality of the input image. The difference between image enhancement and image restoration is that the former generally relies on the characteristics of the human visual system to obtain better visual results, while the latter considers that the image © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_14
561
562
14
Image Restoration
degrades or deteriorates under certain conditions (image quality is degraded and distorted). It is now necessary to reconstruct or restore the original image based on the corresponding degradation model and knowledge. In other words, the image restoration technique is to model the process of image degradation, and the opposite process is taken accordingly to obtain the original image. It can be seen that image restoration is performed according to a certain image degradation model. Image restoration is the process of eliminating some known (and modeled) distortion in an image. This process may not output a perfect image, but it can eliminate an undesirable distortion by introducing a negligible distortion. Restoration Given a noisy sample derived from some real data, the best possible estimate of the original data is obtained. Restoration transfer function Acting on the degraded image to obtain a function of the estimated fe of the original image f, i.e., a function used to determine fe to approximate f. Interactive restoration A method in which human and computer combine to perform image restoration. Here people control the recovery process to get some special effects. Estimating the degradation function A preprocessing in image restoration. There are three main estimation methods: (1) observation, (2) test, and (3) mathematical modeling. However, the true degenerate function is difficult to estimate completely and accurately, so the blind deconvolution method is often used. Blind deconvolution In image restoration, methods and processes that do not rely on estimating the degradation function. Image deconvolution is performed without prior knowledge of the convolution operator applied to the image (or the characteristics of the optical transfer function or the point spread function at the time of image acquisition). Often used for image deblurring. Well-posed problem The problem of solving satisfies three conditions at the same times. These three conditions are (1) the solution of the problem exists; (2) the solution of the problem is unique; and (3) the solution of the problem depends continuously on the initial data. Compare ill-posed problem. Ill-posed problem The problem whose solution can not satisfy the three conditions of the well-posed problem or at different times (i.e., the solution (1) exist; (2) unique; (3) continuous dependence on data). In other words, it is a mathematical problem that at least one of the conditions that violates or does not satisfy a well-posed problem. The problem of
14.1
Fundamentals of Image Restoration
563
Fig. 14.1 Schematic diagram of membrane model
visual processing as an inverse problem in the optical imaging process is an ill-posed problem. The ill-posed problem in computer vision can be solved according to the regularization theory. Compare the well-posed problem. Ill-conditioned problem See ill-posed problem. When fitting data with models, it is more important to emphasize small changes in the input that can cause large changes in the fit. Rubber sheet model See membrane model. Membrane model A surface fitting model that minimizes the smoothness of the fitted surface and the proximity of the fitted surface to the original data. The corresponding smoothing function can be written as S¼
Z h
f 2xx ðx, yÞ þ f 2xy x, yÞ þ f 2xy ðx, yÞ dxdy
The blending item is used to make it rotating invariant. This smoothing function is often used for the surface of a 3-D target. The surface category must have C0 continuity, so it is different from a smooth thin-plate model with C1 continuity. In image restoration, this model can be used by minimizing the regularization term in the cost function. Specifically, it is considered to minimize the first-order difference between adjacent pixels. As shown in Fig. 14.1, for any two fixed points, the membrane model acts like an elastic band (corresponding to the membrane) between the two nails. Compare the thin-plate model. Membrane spline Same as membrane model. Thin-plate model A surface smoothness model for use in variational techniques. If a thin-plate is represented as the parameter surface [x, y, f(x, y)], its internal energy (or bending energy) can be expressed as fxx2+ 2fxy2 + fyy2. The corresponding first-order smoothing function can be written as Z S¼
Z f 2x ðx, yÞ
þ
f 2y ðx, yÞdxdy
¼
k∇f ðx, yÞk2 dxdy
This smoothing function is often used for images or optical flow fields.
564
14
Image Restoration
Fig. 14.2 Schematic diagram of thin-plate model
A priori model in the minimization of regularization terms of the cost function in image restoration. It is considered to minimize the second-order difference between adjacent pixels. As shown in Fig. 14.2, for any three fixed points, the thin-plate model acts like bending a stiff metal bar (corresponding to a thin-plate) to break it as much as possible. Compare membrane model. Thin-plate spline Same as thin-plate model. Thin-plate spline under tension A combination of thin-plate and membrane. See norm of smooth solution. Maximum entropy restoration Image restoration technology based on maximum entropy. Maximum entropy TheR probability density p(x) that maximizes the distribution entropy is constrained by ri(x)p(x)dx ¼ ci, where i ¼ 1, 2, ..., m. It can be shown that the solution has the form p(x) ¼ exp.[λ0 + ∑mi ¼ 1λiri(x)], where each λi is chosen to satisfy the constraint that normalizes p(x) to 1. Maximum entropy method An estimation method based on the principle of maximum entropy. The principle of maximum entropy points out that under the premise of knowing part of the knowledge, the most reasonable inference about the unknown distribution is to conform to the most uncertain or random inference of known knowledge (making the random variable the most random). This will not increase the constraints and assumptions that cannot be made based on the information already mastered.
14.1.2 Basic Techniques Blur A measure of the degree of image sharpness; or the result of a reduction in the sharpness of an image. Blur can be caused by a variety of factors, such as camera movement during image acquisition, not using a fast enough shutter for fast-moving objects, or the lens not focusing. Defocusing of the sensor, noise of the environment or image acquisition process, motion of the target or sensor, and side effects of some image processing operations can also cause image blur.
14.1
Fundamentals of Image Restoration
565
Deblur Eliminate the effect of a known blur function on the image. If the observed image I is the result of convolving an unknown image J with a known fuzzy kernel B, i.e., I ¼ J B, then deblurring is the process to calculate J for given I and B conditions. See deconvolution, image restoration, and Wiener filtering. Blur estimation An estimate of the point spread function or optical transfer function that causes the image to be blurred. In practice, it is often done through a process of deconvolution. See image restoration. Deblurring An operation that processes a blurred image to eliminate or reduce blurring, thereby improving image quality. The techniques used include image restoration techniques such as inverse filtering. Descattering An algorithm or method that eliminates image attenuation from backscattering caused by media such as fog or water. Such algorithms often use the depolarization of light as the first step and then use scene reconstruction. Partition function In the image restoration problem, the maximum a posteriori probability estimates the normalization constant in the solution. It is generally considered that a posteriori probability density function derived from the image f under a given degradation model has the following form: pð f jg, modelÞ ¼
1 H ð e C
f ; gÞ
Among them, H( f; g) is a non-negative function of combination f, often called the combined cost function, also called the energy function, and C is the partition function. It is also the normalized constant when the random distribution is the product of the potential functions of the maximal group in the graph. Potential function A real function with a continuous second-order partial derivative that satisfies the Laplace equation ∇2f ¼ 0. Also known as a harmonic function. See Laplacian and partition function. Potential field A mathematical function that assigns values (usually scalar values) to each point in some space. In image engineering, this often refers to the measurement of certain scalar properties at each point in a 2-D or 3-D image, such as the distance from a target. It is also commonly used in the path planning of robots, and the potential at each spatial point indicates how easy it is to reach the end point from that point.
566
14
Image Restoration
Maximum a posteriori [MAP] probability Given the data and model, the probability of the most probable solution is always determined. It is also the configuration of random variables that maximize a posteriori probability in a statistical model. For example, in image restoration, the combination of any gray value assigned to an image pixel is a possible solution. Given the data and degradation models, it is now necessary to determine the most specific combination possible. This is a global optimization problem because it attempts to find the best combination of gray values, rather than trying to recover the image by performing local operation-based filtering. This term is often used in situations such as parameter estimation, pose estimation, or target recognition, where it is desirable to estimate the parameter, location, or identity with the greatest a posteriori probability, respectively, given the observed image data. It should be pointed out that this method is only an approximation of the complete Bayesian statistical model (combination of a posteriori distributions) in the best case. The MAP solution to the image restoration problem is n o f ¼ arg max fpðxjg, modelÞg x
where f is the restored image, x is the combination of pixel values, g is the degraded image, and model synthesizes all knowledge of the degradation process, including any point spread functions of the fuzzy operators involved, any nonlinear degradation, the intrinsic and statistical properties of noise, and so on. Here p(x|g, model) is a posteriori probability density function of the observed image g from the image x under a given degradation model. Maximum posterior probability Same as maximum a posteriori probability. Partition function The normalization constant in the maximum posterior probability estimation solution in the image restoration problem. It is generally considered that the posterior probability density function of the observation image g derived from the image f under a given degradation model has the following form: pð f jg, modelÞ ¼
1 H ð e C
f ; gÞ
Among them, H( f; g) is a non-negative function of the combination f, often called the cost function of the combination, also called the energy function, and C is the partition function. It is also the normalized constant when the random distribution is the product of the potential function of the largest cluster in the graph. Stochastic approach In image restoration, it refers to a method to obtain the optimal restoration effect according to some random criterion (such as least squares).
14.1
Fundamentals of Image Restoration
567
Nonlinear diffusion The spread function depends on the position in the image. The nonlinear diffusion function is based on partial differential equations and can be used in many image processing applications, such as eliminating noise in images and performing image restoration. Curvature-driven diffusion An image restoration model. It is proposed to overcome the problem that the total variation model used for image restoration will break the connectivity. Where the curvature is taken into account in the diffusion along the illuminance line in order to connect the resulting discontinuities.
14.1.3 Simulated Annealing Simulated annealing A typical method of minimizing the cost function, belonging to the Monte Carlo Markov chain. Temperature parameters are used in the probability density function to gradually sharpen the solution space and grow the configured chain. The gradual reduction of the temperature parameter values mimics physical phenomena such as focusing of crystal or ferromagnetic materials by gradually reducing the temperature of the physical system to allow them to reach their minimum energy state. This is also an algorithm that is optimized from coarse to fine iteration. In each iteration, a smooth energy version is searched, and a global minimum is located by statistical means. Then, the search is done at a finer and smoother level, and so on. The basic idea is to determine the basin of the absolute minimum on the coarse scale, and finally the search on the finer scale can proceed from an approximate solution close enough to the absolute minimum to avoid falling into the local minimum. The name comes from the process of tempering metal, in which the temperature is gradually reduced, allowing the metal to reach a thermal equilibrium each time. Iterative conditional mode A simulated annealing method using a special strategy. When constructing a possible new solution from the previous solution, first calculate a local conditional probability density function from the regularization term used, and then select a pixel and assign it to the most likely new value. This is the iterative conditional least frequency method and sometimes tends to fall into local minima. It is an application of coordinate gradient rise. Temperature parameter In image restoration, the scale parameter used to normalize the degradation process of the prior probability p(x) and the derived a posteriori probability p(x|g, model).
568
14
Image Restoration
Cooling schedule for simulated annealing A formula used to reduce temperature in simulated annealing. It has been proven that if use T ðk Þ ¼
C , k ¼ 1, 2, ln ð1 þ k Þ
where k is the number of iterations, C is a positive constant, and the simulated annealing algorithm will converge to the global minimum of the cost function. However, this cooling schedule is very slow. Therefore, alternative suboptimal cooling schedules are often used T ðkÞ ¼ aT ðk 1Þ,
k ¼ 1, 2,
Here, a is chosen to be a number close to 1 but less than 1. Typical values are 0.99 or 0.999. The starting value T(0) is chosen to be high enough so that all configurations have essentially the same possibilities. In practice, the typical value of the cost function needs to be considered to select a suitable T(0) value, typically on the order of 10. Mean field annealing [MFA] A method that combines simulated annealing and mean field approximation to find the minimum of complex functions. Can be used for image restoration. Mean field approximation A method of approximate probabilistic inference in which a complex distribution is approximated by a simple decomposition factor. Simulated annealing with Gibbs sampler A simulated annealing method using a special strategy. When constructing a possible new solution from the previous solution, first calculate a local conditional probability density function from the regularization term used (here the conditional probability means that for a pixel, when the value of its neighboring pixel is known, its value probability for s: p(xij ¼ s|neighboring pixel value)). A temperature parameter T is used in the exponent term of this function to control the sharpness of the configuration space. Select a new value for the pixel (i, j) based on this probability. Fixed temperature simulated annealing An annealing process and method in which the temperature parameter T is a fixed value in simulated annealing with a Gibbs sampler. Swendsen-Wang algorithm A variant of simulated annealing. An algorithm for random simulation using Monte Carlo Markov chain method in discrete Markov random field (MRF). The feature is to use the Markov connectivity strength based on random walks and select the connected subset of a set of variables as the update unit of the present, rather than just flipping a single variable.
14.1
Fundamentals of Image Restoration
569
Gibbs sampler A method or strategy that avoids the problem of using gradient descent methods to fall into local extrema and can be used for simulated annealing.
14.1.4 Regularization Regularization 1. A mathematical method to solve ill-posed problems. A class of mathematical techniques for solving ill-conditioned problems. The idea is to approximate the solution of the original problem by increasing the constraint conditions and solving the problem with a well-posed problem that is close to the original ill-conditioned problem. When fitting data with a model, it is suitable for the case where the solution space constraint is weak. Essentially, to determine the only solution to a problem, you can introduce constraints that the solution must smooth, because the intuitive idea is that similar inputs should correspond to similar outputs. Thus, the problem is translated into a variational problem where the variational integration depends on both the data and the smoothing constraints. For example, to estimate the function y ¼ f(x) using the values y1, y2, ..., yn at a set of points x1, x2, ..., xn, the regularization method is to minimize the functional
Hð f Þ ¼
N X
½ f ðxi Þ yi 2 þ λΦð f Þ
i¼1
where Φ( f ) is a smoothing function and λ is a positive parameter called a regularization number. 2. A technique for controlling overfitting problems. It adds a penalty to the error function to avoid weighting factors that are close to excessive values. Penalties are often different exponential forms of weight. Figure 14.3a–d, respectively, give the boundaries of the regular term when the values of the exponent q are 0.5, 1, 2, and 4. Regularization parameter A parameter that adjusts the weights of the items in the regularization formula (including the original constraints and the added constraints). Penalty term A term used in regularization process. When fitting a model to a data, it is often necessary to add a smoothing or penalty to the data fitting. If the data fit to the data set D using the model M with the parameterθ, then this can be expressed by the negative log-likelihood term –logp(D|θ, M ), thus the form of the negative log-likelihood term of the penalty is J(θ) ¼ logp(D|θ, M ) + λΦ(θ), where Φ(θ) is the penalty for θ and λ > 0 is a regular constant. Φ(θ) can take many forms, for
570
14
(a)
(b)
(c)
Image Restoration
(d)
Fig. 14.3 Regular term boundary with different exponents
example, the form of square sum is often used in ridge regression, i.e., Φ(θ) ¼ ∑iθi2. For Bayesian statistical models, the penalty term can be interpreted as the negative logarithm of the prior distribution, while the minimization of J(θ) is equivalent to the maximum a posteriori probability of finding θ. Regularization term The description introduced to obtain the optimal solution has the desired weighted term that the response should have certain characteristics. In many optimization problems, it is necessary to minimize the cost function or the energy function. The cost function often includes items that are faithful to the data and all the prior knowledge about the solution. The second term is the regularization term, which often makes the solution smoother. Robust regularization A regularization method that uses a non-quadratic robust penalty function. Tikhonov regularization The regularized form that the transformation to the input includes only the superimposed random noise. Same as ridge regression. Discontinuity preserving regularization A method of keeping edges (discontinuities) in the result of the regularization operation from being blurred. A typical regularization operation: such as recovering a dense disparity map from a sparse disparity set calculated only at feature matching points. Total variation Regularization model in a class of variational methods. The total R variational regularization of function f(x): Rn ! R has the following form: R(f) ¼ Ωj∇f(x)jdx, where Ω is the domain or subdomain of f. It can be used as a basic and typical image repair model. The purpose of image restoration is achieved by diffusing the target area pixel by pixel. The method of extending and diffusing from the source area to the target area along the line of equal light intensity (line of equal gray value) is adopted. The total variation algorithm is a non-isotropic diffusion algorithm that can be used to denoise while maintaining the continuity and sharpness of the edges. It corresponds to the situation when p ¼ 1 in the Lp norm.
14.1
Fundamentals of Image Restoration
571 x2
Fig. 14.4 Tangential propagation
gn xn
O
M s
x1
It is a smooth constraint measure in optical flow calculation, using L1 norm. The calculated energy curve is convex, so the global minimum can be obtained. Total variation regularization The regularization that a penalty term is the form of the total variation. Soft weight sharing A method of effectively reducing the complexity of a network with a large amount of weight. It attempts to use regularization methods to make groups of weights have similar values. In addition, the group of weights is divided, the average weight value of each group is calculated, and the diffusion weight values within the group are determined by the learning process. Smoothness penalty In regularization, the data penalty terms other than the smoothing term added to minimize global energy are generally related to the first and second derivatives of the regularization function. Norm of smooth solution In regularization, to quantify the smoothness of the solution, a norm/rule needs to be defined in the solution space. For example, for a 2-D image f(x, y), it can be defined (subscript indicates differential): e1 ¼
Z h
e2 ¼
f 2x ðx, yÞ Z h
þ
i
f 2y ðx, yÞ
Z dxdy ¼
k∇f ðx, yÞk2 dxdy
f 2xx ðx, yÞ þ f 2xy x, yÞ þ f 2yy ðx, yÞ dxdy
The fxy2(x, y) term in the latter equation guarantees rotation invariance. The first derivative norm is often referred to as a membrane (see membrane model) because the result of interpolating a set of points has a tent-like structure. The second derivative norm is called a thin plate spline because it approaches a thin plate (such as flexible steel) when subjected to small deformations. Combining the two together is called a thin plate spline under tension. Combining the two with a local weighting function is called a controlled continuity spline. Such combinations are all functional because they all map functions to scalar variables. This type of method is often referred to as a variational method because they measure the change (non-smoothness) of a function.
572
14
Image Restoration
Tangent propagation A regularization technique that keeps the model’s coordinate transformation of input data constant. Consider the effect of transforming an input vector xn. As long as the transformation is continuous (such as translation or rotation rather than mapping/mirror), the transformation mode will sweep out a manifold M in the input space of the d-D. For example, in Fig. 14.4, the dimension d ¼ 2 of the input space is simply taken. Assuming that the transformation is only controlled by a 1-D parameter s (which can be an arc length or a rotation angle), the subspace M swept by the input vector xn will be 1-D and parameterized by s. Use T(xn, s) to represent the resulting vector obtained by transforming on xn, then T(xn, 0) ¼ x. The tangent of curve M is given by the direction derivative g ¼ ∂T/∂s, and the tangent vector at point xn (which approximates the local transformation effect) is gn ¼
∂Tðxn , sÞ ∂s s¼0
A coordinate transformation of the input data results in a change in the output vector. The derivative of the output k relative to s is d X ∂yk ∂yk ∂xi ¼ ∂s s¼0 ∂xi ∂s i¼1
¼ s¼0
d X
J ki gi
i¼1
Among them, Jki is the (k, i) element of the Jacobian matrix J. Tangent distance A regularization technique associated with tangent propagation that can be used to make distance-based methods (such as K-nearest neighbors classifiers) invariant. Least absolute shrinkage and selection operator [LASSO] A special form of operator that implements a generalized regularization error function. Let a training set contain N observation data for x, written as x [x1, x2, ..., xN]T; their corresponding index values are t [t1, t2, ..., tN]T. Use a polynomial function like the following to fit the data: yðx, wÞ ¼ w0 þ w1 x þ w2 x2 þ þ wM xM ¼
M X
w jx j
j¼0
where M is the order of the polynomial function. The polynomial coefficients w0, w1, ..., wM can be combined into a vector w. Here, although the polynomial function y (x, w) is a nonlinear function of x, it is a linear function of the coefficient w. The value of the coefficient can be calculated by means of a polynomial fit to the training data.
14.2
Degradation and Distortion
573
This can be done by minimizing an error function that measures the mismatch between the value function y(x, w) for a given w and the data points in the training set. A commonly used error function is the sum of the squares of the difference between the estimated value y(xn, w) of each data point xn and the corresponding index value. Therefore, it is necessary to minimize the error E ðwÞ ¼
N 1X fyðxn , wÞ t n g2 2 n¼0
This error is always non-negative and is zero only if the function y(x, w) is accurately passed through each data point in training set. It should be noted that polynomials that minimize the error of the training set are not necessarily polynomials that minimize the error of the test set. Here the test set error means that the new observation data for a given x can optimally predict the value of t. The main influencing factor here is the order M of the polynomial function. This is essentially a Bayesian model comparison or model selection problem. Small orders give larger test set errors because the corresponding polynomial functions are relatively less flexible (less degrees of freedom) and are less likely to keep up with changes in test data. If you choose a sufficiently large order, because of the large degree of freedom, it is possible to fit the test set data well, but the polynomial will oscillate, making the test set very large. This is an overfitting problem. Regularization is a way to overcome overfitting problems by adding a penalty to the error function to avoid too large a coefficient. The simplest penalty is to take the sum of the squares of all the coefficients so that the error function becomes E p ðwÞ ¼
N 1X λ fyðxn , wÞ t n g2 þ kwk2 2 n¼0 2
A more generalized regularized error function is (the case of k ¼ 1 is called LASSO or lasso) E g ðwÞ ¼
14.2
N 1X λ fyðxn , wÞ t n g2 þ kwkk 2 n¼0 2
Degradation and Distortion
The image restoration technology believes that the input image should have better quality, but the quality is degraded or distorted due to some influence during the acquisition or processing (Jähne 2004; Pratt 2007). Now it is necessary to reconstruct or restore the original quality or image according to the corresponding degradation model and prior knowledge (Gonzalez and Woods 2018).
574
14
Image Restoration
14.2.1 Image Degradation Image degradation The process and factors of image quality degradation in image engineering. There are many examples, such as image distortion, effects caused by noise superposition, chromatic aberration or aberration of the lens, atmospheric disturbance or geometric distortion due to lens defects, and blur caused by inaccurate focus or camera motion. Image degradation model A model that represents the process and the main factors of image degradation. Figure 14.5 shows a simple but commonly used image degradation model in which the image degradation process is modeled as a system H acting on the input image f (x, y), with an additive noise n(x, y). The combined action results in a degraded image g(x, y). According to this model, image restoration is the process of obtaining an approximation of f(x, y) given g(x, y) and H representing degradation. It is assumed here that the statistical properties of n(x, y) are known. General image degradation model Same as image degradation model. Circulant matrix A special matrix in which the last item in each line is equal to the first item in the next line and the last item in the bottom line is equal to the first item in the first line. In other words, the latter column can be obtained by moving all the elements of the previous column down one place; the elements that are removed at the bottom can be placed at the top of the space. According to the image degradation model, the 1-D degenerate system function constitutes a circulant matrix, which has the following structure: 2 6 6 6 L¼6 6 6 4
lð0Þ lð1Þ
lðM 1Þ lðM 2Þ l ð 0Þ l ð M 1Þ
lð2Þ ⋮
l ð 1Þ ⋮
lðM 1Þ
l ð 0Þ ⋮
lðM 2Þ lðM 3Þ
3 l ð 1Þ 7 l ð 2Þ 7 7 l ð 3Þ 7 7 7 ⋮ ⋮ 5 l ð 0Þ
Block-circulant matrix A special matrix in which each element (block) is of the circulant matrix. The 2-D degenerate system function in the image degradation model constitutes a blockcirculant matrix. Fig. 14.5 A simple and commonly used image degradation model
n ( x, y ) f ( x, y )
H
g ( x, y )
14.2
Degradation and Distortion
575
Fig. 14.6 Blur caused by out of focus
Diagonalization A method for quickly restoring an image by calculating the eigenvectors and eigenvalues of a circulant matrix or a block-circulant matrix. Diagonalization of matrix For a matrix A, the process by which two matrices Au and Av are determined to make AuAAv ¼ J (i.e., form a diagonal matrix). Blurring A common image degradation process. It is often produced in the image acquisi tion process, and it has a limiting effect on the spectrum width. For example, a camera that is out of focus (defocused) can cause blurred images (a point source is imaged as a spot or a ring). In the language of frequency analysis, blurring is the process by which high-frequency components are suppressed or eliminated. Blur is generally a definite process. In most cases, people have enough accurate mathematical models to describe it. Defocus blur A distortion produced by an image due to predictable behavior such as incorrectly adjusting parameters of the optics. Blurring occurs when light enters the optical system and causes improper convergence to the image plane. If the camera parameters are known in advance, the defocus blur can be partially corrected. Out-of-focus blur The target is not clear in the image due to observing the target outside the depth of field of the camera. In the two images given in Fig. 14.6, the small clock on the left front in the left picture is clear as it is in the depth of field, but the big clock on the right rear is outside the depth of field and looks blurry; the right picture is opposite, the small clock on the left front seems to be blurry as it is outside the depth of field, but the big clock on the right rear is in the depth of field and looks clearer. The blur here is caused by out of focus. Wrong focus blur The image is blurred due to the camera’s thin lens not being properly focused. This degenerate Fourier transform is
576
14
H ðu, vÞ ¼
Image Restoration
J 1 ðTr Þ Tr
where J1 is a first-order Bessel function, T is the amount of translation (this degradation is spatially variable), and r2 ¼ u2 + v2. Motion blur Blurring of an image results from image acquisition or camera changes or scene changes or both. Dominant plane A degraded condition encountered in uncorrected structural and motion recovery. At this point, most or all of the tracked image features are coplanar in the scene. Color imbalance Color errors in the acquisition results due to imperfect acquisition devices. For example, different red, green, and blue pixel values are obtained for the white portion of the scene, resulting in a color error in the acquired color image. Degradation It refers to the fact that some organs of the organism have decreased in function, gradually become smaller, or even disappear completely. In image engineering, it refers to the process and result of poor image quality. Abbreviation for image degradation. The quality loss suffered by an image often refers to the destruction or demolition of image content due to undesired causes and processes. For example, MPEG compression-decompression can cause changes in video brightness to degrade images. See JPEG and image noise. Degradation system A system module that causes image degradation. For example, in the general image degradation model, the image degradation process is modeled as a linear system acting on the input image. Linear degradation An image whose degradation results or effects are linearly related to the cause of degradation is degraded. For example, the spot due to lens aperture diffraction, the blur due to the target (constant velocity) motion in the scene, and the like. Nonlinear degradation Image degradation that does not satisfy a linear relationship between the degradation effect and the cause of degradation. For example, due to quality degradation caused by random noise superposition, luminance distortion due to nonlinear photosensitive characteristics of photographic film. Sinusoidal interference pattern A correlated noise that causes image degradation. A sinusoidal interference pattern η(x, y) with amplitude A and frequency component (u0, v0) is
14.2
Degradation and Distortion
577
η ðx, yÞ ¼ A sin ðu0 x þ v0 yÞ It can be seen that the amplitude value exhibits a sinusoidal distribution. The corresponding Fourier transform is N ðu, vÞ ¼
h i j A u v u v δ u 0, v 0 δ u þ 0, v þ 0 2 2π 2π 2π 2π
The above formula has only imaginary components, representing a pair of pulses on the frequency plane, with coordinates (u0 /2π, v0 /2π) and (u0 /2π, v0 /2π) and with an intensity of –A/2 and A/2, respectively.
14.2.2 Image Geometric Distortion Image distortion Distortion in the image. All the effects of changing an image from its ideal state. Although the term can refer to various types of distortion (such as the effects of image noise, sampling or quantization results), in most cases it refers to geometric distortion. See distortion. Distortion 1. In image acquisition, the phenomenon that the scene is inconsistent with the image acquired. For example, in perspective projection imaging, the farther the scene is from the observation point or the collector, the smaller the image is formed, and vice versa, which can be regarded as a dimensional distortion. The distortion of the image contains spatial and structural information about the 3-D scene. As another example, when imaging a surface having texture characteristics and the optical axis is not perpendicular to the surface, then the shape of the texture primitive on the surface changes in the image, and the degree of change indicates the angle between the optical axis and the surface. 2. In image compression, the original image (the image to be compressed) is inconsistent with the decompressed image in pixel gradation, which is caused by lossy compression. 3. In image watermarking, the original image (the image to be embedded) is inconsistent with the image after the watermark is embedded in pixel gradation, which is caused by the watermark embedding. 4. A type of Seidel aberration. Also called bending astigmatism. It is only proportional to the cube of the object height. In fact, the image obtained is sharp only with the distortion, but the shape of the image is inconsistent with the shape of the object. Barrel distortion and pincushion distortion are typical distortions.
578
14
Image Restoration
Radial distortion The main component of the distortion of the lens radiating outward along the center. The change from coordinate [x, y]T to coordinate [x0 , y0 ]T caused by radial distortion on the image plane can be expressed as
x0 y0
x 2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p ¼ 2 2 1 þ 1 4kðu þ v Þ y
wherein the parameter k represents the magnitude of the radial distortion and u and v, respectively, represent the offset of the distortion point with respect to the optical axis. If the k value is negative, the distortion becomes barrel distortion; if the k value is positive, the distortion becomes pincushion distortion. Correction of the distortion can be performed by the following formula: 0 x x 1 ¼ y 1 þ k u0 2 þ v 0 2 y 0 A type of lens distortion. Refers to the image coordinates observed during imaging moving toward (pincushion distortion) or moving away (barrel distortion) from the center of the image, which is proportional to its radial distance. Simple radial distortion can be modeled with low-order polynomials: bxc ¼ xc 1 þ k1 r 2c þ k 2 r 4c byc ¼ yc 1 þ k1 r 2c þ k 2 r 4c where r2 ¼ x2 + y2, k1, and k2 are called radial distortion parameters. It is not difficult to compensate for radial distortion in practice. For most lenses, good results can be obtained by using a simple four-order distortion model. Let (xc, yc) be the pixel coordinates obtained by shifting the focal length f and the optical center pixel coordinates (cx, cy), i.e.: rx • p þ t x rz • p þ t z ry • p þ t y yc ¼ rz • p þ t z
xc ¼
Radial distortion parameters Parameters in the radial distortion model. Consider the simplest radial distortion model using low-order polynomials (r stands for radial distance): xr ¼ x 1 þ k 1 r 2 þ k 2 r 4
14.2
Degradation and Distortion
579
yr ¼ y 1 þ k 1 r 2 þ k 2 r 4 where k1 and k2 are radial distortion parameters. Pincushion distortion A radial lens distortion. Using a telephoto lens can sometimes result in pincushion distortion. Among them, the image point moves outward from the center according to its distance from the center of the image. Opposite to barrel distortion. Barrel distortion A geometric lens distortion in an optical system. Some optical systems give up some geometric fidelity to reduce undesired effects. Barrel distortion causes the outer contour curve of the target to expand outward, appearing in a barrel shape. Compare pincushion distortion. Geometric alignment Same as geometric correction. Align several feature points on the two targets to facilitate the matching of these targets.
14.2.3 Image Radiometric Distortion Radiometric distortion In remote sensing imaging, a distortion of the image in amplitude. This is due to the strong depletion of oxygen, carbon dioxide, ozone, and water molecules in the atmosphere to certain wavelengths of radiation. Correction methods for radiation distortion include duplication correction, destriping correction, and geometric correction. Radiometric correction Correction to eliminate radiation errors or distortions in remote sensing images (e.g., systematic and random problems due to external factors, data acquisition, and transmission systems). For example, spectra obtained using earth observation satellites may need to be corrected for considerations such as the atmosphere and other effects. Destriping correction A method of correcting radiometric distortion. When a group of detectors in a certain band is not adjusted, some line patterns in the image may repeat the phenomenon that the gray level is too high or too low, causing streaks in the image. Destriping correction improves image visual quality at this time by means of image enhancement techniques such as smoothing. Destriping The process or means of eliminating strip noise.
580
14
Image Restoration
Legitimate distortion The distortion that does not change of the value of the operating image or works for a given application. For example, in most applications, lossless compressed images do not exhibit appreciable unnatural appearance changes. Compare illegitimate distortion. Illegitimate distortion Distortion that causes product or processing results to be ineffective for some applications. For example, irrational distortion of medical images may lead doctors to make erroneous diagnoses. Compare legitimate distortion. False contour See false contouring. False contouring A phenomenon caused by the small number of gray levels imparted to an image. It appears in images with uniform gray values of only 16 or less. This phenomenon is most likely to occur in areas of slowly varying grayscale in the image, usually as a thinner ridge-like structure. On the other hand, the more detail an image has, the smaller the improvement provides by increasing the number of gray levels. Therefore, for images with many details, such as images of crowds, it is not important to use how many gray levels, even if there is a false contour, the image is not affected much. Contouring False contour phenomena appearing in images due to quantization errors. Clutter An element of the image that is not of interest or has no modeling elements. For example, face detection generally has a personal face model, but there is no modeling of other objects, that is, they are all regarded as clutter. The background of an image is often considered to contain “clutter.” It is generally believed that clutter contains some unwanted structures and is more structured than noise. Checkboard effect A uniform gray square (similar to a checkboard) occurs when the spatial resolution in the image is small (or the number of pixels is small). Artifact A phenomenon that appears to be apparent on an image caused by a scene object or an acquisition device. Further, it refers to various geometric distortion phenomena or non-ideal conditions. Typical examples are blotches or speckle, judder, intensity flicker, blockiness, jagged lines, ringing, etc. Ghost artifacts Undesired and secondary (often weaker) signal artifacts are superimposed on top of the main signal. In imaging, there are many conditions that can cause ghosting, including unexpected light, electricity, exposure, or sensitization effects.
14.3
Noise and Denoising
581
Ghost See ghost artifacts. Artifact noise Noise in the image that interferes with the target image of interest and forms artifacts. Strip artifacts Strip or band interference in the image that does not belong to the target of interest. Banding A distortion caused by the insufficient number of bits in the expressed image and the fact that some transition grays (gradations) are not displayed. Ring artifact An annular (circular mode) artifact, often found in computerized tomography. Judder A phenomenon often seen on screens in TV movies/gels (glue magnetic). Refers to a motion artifact that is particularly prone to slow and steady camera motion and results in smooth motion in the original film sequence that is somewhat unstable after the magnetization. Intensity flicker An artifact in a video sequence (especially an old movie). It appears as an unnatural frame strength that fluctuates over time, and these fluctuations are not derived from the original scene.
14.3
Noise and Denoising
Noise is one of the most common degradation factors, and it is also the focus of research in image restoration (Gonzalez and Woods 2008). The causes of noise should be considered in image restoration, and filters with different characteristics should be used according to the characteristics of different noise models (Mitra and Sicuranza 2001; Zhang 2009).
14.3.1 Noise Models Image noise A type of degenerate form of an image in which the pixel values differ from the ideal values. Noise is often modeled as having a zero mean Gaussian distribution. The type of noise depends on the cause of the noise (such as electronic interference, environmental influences, etc.). The noise intensity is often measured by the signalto-noise ratio.
582
14
Image Restoration
Fig. 14.7 Two Lena images with different noises superimposed
Noise A common image degradation process. It can also be defined as an artifact that is undesirable in a contaminated image. The noise in the image has different sources and thus different types. Noise is a statistical process, so the effect of noise on a particular image is uncertain. In many cases, people can only have a certain understanding of the statistical characteristics of this process. Noise can affect the quality and observation of the image; it can also affect the performance of image technology. For example, noise may cause the histogram to lose its bimodal shape and could have a significant impact on thresholding. Noise is often generated during image recording. It can be derived from thermal disturbance of conductive carriers, current motion, nonuniform current flow, measurement error, counting error, various interference factors, etc. For some noise, it can be described by the characteristics of its statistical distribution. Figure 14.7 shows the results of superimposing Gaussian noise and salt-and-pepper noise on the Lena image, respectively. A generic term that indicates that the signal deviates from its “true” value. When discussing an image, the noise causes the pixel value (or other measure) to be different from the expected value. The source of noise can be random factors such as thermal noise in the sensor or secondary scene events such as dust or smoke. Noise can also represent events that are regular but not modeled, such as short-term lighting changes or quantification. Noise can be reduced or eliminated by using noise reduction methods. Noise model A model built on noise. Noise is generally considered as a random variable whose probability density function (PDF) or histogram describes its distribution over the entire gray-level range. In addition to expression in the spatial domain, they can sometimes be expressed in the frequency domain (through their spectrum). A way to model the statistical properties of noise without regard to the cause of the noise. A common assumption for noise is that the noise has some potential but possibly unknown distribution. Gaussian noise models are often used for random factors, while uniform distribution is often used for unmodeled scene effects. Noise can also be modeled using a hybrid model. A typical noise model has one or more parameters that control the noise intensity. The noise model can also indicate how noise affects the signal, such as the additive noise offset true value and the
14.3
Noise and Denoising
583
multiplicative noise scaling true value. The type of noise model limits the available methods of noise reduction. Additive noise Noise superimposed on the input signal but independent of the strength, phase, etc. of the input signal. It can also be an independent noise added to the original image by some external process. Here the random value of the noise field is superimposed on the true value of the pixel. Let the original image be f(x, y), and the noise is also regarded as an image n(x, y), that is, the noise of each point can be different, and then the image after adding noise is expressed as gðx, yÞ ¼ f ðx, yÞ þ nðx, yÞ Multiplicative noise Noise (proportionally) related to the strength, phase, etc. of the input signal. For a typical multiplicative noise, the random number of the noise field is multiplied by the true value of the pixel. A model of image degradation in which the noise is proportional to the intensity of the signal f(x, y) ¼ g(x, y) + g(x, y)n(x, y), where f(x, y) is the observed image, g (x, y) is the ideal (original undamaged) image, and n(x, y) is the noise. Homogeneous noise The noise whose parameter is the same for all pixels. For example, in the case of Gaussian noise, if the mean μ(i, j) and the variance σ(i, j) are the same for all pixels and are equal to the mean μ and variance σ of the image, respectively, then the noise is homogeneous. Correlation noise The noise whose effect on the signal varies with the signal. Compare additive noise. Uncorrelated noise The irrelevant case that for the noise value at the combination of any n pixels, the average of the products (the average of the n-tuples in the image) is equal to that of the product of the average noise values at the corresponding positions. That is, the {product}of the average ¼ the {average} of the product. Periodic noise A type of noise generated by electrical or electromechanical interference during image acquisition. Because the period corresponds to the frequency, this type of noise can often be eliminated using frequency domain filtering techniques such as band-reject filters. Independent noise The noise value at each pixel location is not affected by noise at other pixel locations. Since the noise value affecting each pixel is random, the noise process can be regarded as a random field, which is the same as the image size and is added to (or multiplied by) the field of the image. It can be said that the value of noise at each
584
14
Image Restoration
pixel location is the output of a random test. If the results of the random test (assuming act on a pixel position) are not affected by the random test output at other pixel locations, the noise is independent. Independent identically distributed [i.i.d.] noise The noise that satisfies both “independent” and “same distribution” conditions at the same time. “Independent” means that the joint probability density function of the combination of noise values can be written as the product of the probability density function of the single noise component at different pixels. “Same distribution” means that the noise components at all pixel locations are derived from the same probability density function. If the noise component nij at the pixel (i, j) is derived from a Gaussian probability density function with a mean μ and standard deviation σ, the joint probability density function p(n11, n12, . . ., nNM) of all noise components can be written as 2
2
2
ðn11 μÞ ðn12 μÞ ðnNM μÞ 1 1 1 pðn11 , n12 , , nNM Þ ¼ pffiffiffiffiffi e 2σ2 pffiffiffiffiffi e 2σ2 pffiffiffiffiffi e 2σ2 2πσ 2πσ 2πσ
Among them, the size of the image is assumed to be N M. Noise characteristics See noise model. Noise level function [NLF] The noise level is a function of the pixel value. Standard deviation of noise The standard deviation computed from the noise amplitude when the noise is counted as a signal. Noise equivalent exposure [NEE] The radiant energy per unit area used to generate an output signal that is comparable to the sensor output noise level. It describes the lower limit of the detectable brightness.
14.3.2 Noise Sources Noise source A generic term that indicates the source of interference with image data. This could be a systematic unmodeled process (such as 50 Hz electromagnetic noise) or a random process (such as shot noise). Noise sources can be in the scene (such as foil strips that affect radar detection), in the media (such as dust), in the lens (such as manufacturing defects), or in sensors (such as changes in sensitivity).
14.3
Noise and Denoising
585
Mosquito noise Time domain noise in a smooth region of a video frame. It will appear in the same spatial region of each frame in the same video sequence. Since adjacent blocks use different prediction modes (inter-frame or intra-frame), even the same quantization coefficients can cause visual perceptual differences and thus noise. Add noise It refers to random noise that is generally added to the image, and the result is often grayscale or colored spots. Thermal noise In a CCD camera, electrons released by incident photons are combined with electrons released by thermal vibration. It is reflected in the image as the gray value is affected by the additive Poisson noise process. System noise For various images and videos acquire, process, and application systems, the system’s own noise, which can also be caused by external interference (such as electromagnetic radiation). Homoscedastic noise The noise that is independent with the variables with normal distribution and constant variance, as well as with sampling and signal. Random noise Noise that is irregular over time and whose amplitude is unpredictable at a specific time. It is also often referred to as background noise. Natural noise accounts for a large proportion, but it is also artificial. Common random noises include three categories: 1. Single-frequency noise: Generally continuous, but its amplitude, frequency, or phase cannot be predicted in advance. It is characterized by a very narrow frequency band. 2. Impulse noise: Generally, it is a discrete noise that occurs suddenly. It is characterized by a large burst amplitude, but a short duration, and there is often a long plateau between adjacent bursts. Its spectrum is wider than single-frequency noise and has a low intensity at high frequencies. 3. Fluctuating noise: It is ubiquitous in both spatial domain and time domain. Typical are thermal noise, cosmic noise, shot noise, etc. Since thermal noise and shot noise have a uniform power spectral density over a wide frequency range, they are also considered to be white noise. Quantum noise Random noise caused by the dispersion of electromagnetic radiation. Typical examples are shot noise, photon noise, and composite noise. Shot noise is the noise derived from electron particles in quantum noise, which is related to the DC component of the photocurrent, the electron charge, and the frequency of the filter. Shot noise increases linearly with increasing light frequency. Photon noise originates
586
14
Image Restoration
from photon fluctuations in the source and modulator as well as improper reflection, dispersion, and interference of the medium. The composite noise comes from the recombination of electrons and holes called carrier recombination in the semiconductor material constituting the light source and the detector device. Photon noise Noise generated by statistical fluctuations associated with photon counting. This count is performed on a charge coupled device or other solid-state sensor in a digital camera over a finite time interval. Since photons do not arrive at the sensor at equal intervals, but arrive randomly according to the Poisson distribution model, this causes fluctuations in photons over time. Photon noise is not independent of the signal, nor is it additive. Photon noise can be characterized by the variance of the number of photons. For the Poisson distribution, the variance is the same as the mean, which means that the signal-to-noise ratio of the light is the square root of the mean. That is, strong light has a larger signal-to-noise ratio and can help achieve better quality images. Shot noise A type of noise caused by the randomness of electronic motion. Typically, electrons will change according to their random motion as they are emitted from the hot cathode of a vacuum tube or from the emitter of a semiconductor transistor. In addition, in synthetic aperture radar images, shot noise is also generated due to special imaging conditions. See impulse noise and salt-and-pepper noise. Electronic shot noise Same as shot noise. Dark noise The sum of dark current noise, amplifier noise, reset noise, quantization noise, etc. in the sensor. Also known as the floor noise, it is the bottom of the noise. It can be characterized by a Gaussian probability density function, which is generally assumed to be the same for all pixels. Dark current Current caused by electron-hole pairs generated by heat produced by the device. It is the source of dark current noise. Dark current noise A sensor noise associated with dark current in a sensor’s signal conversion circuit. The noise generated by the generation of electron-hole pairs due to the heating of the device is a kind of thermal noise. For cameras and sensors, this noise is generated even in the absence of light. Generally, it has a great relationship with the length of the acquisition time. As time increases, this part of the noise also increases. But unless exposed for a long time, this noise generally does not pose a serious problem. Amplifier noise The noise generated by the amplifier circuit in the image capture device is caused by the movement of the electrons and is superimposed on the image signal.
14.3
Noise and Denoising
587
Specifically, the noise caused by the amplifier in the sensor that amplifies the received signal before analog-to-digital conversion to obtain an analog gain. Its standard model is Gaussian noise and is independent of the signal. In color cameras, the blue channel requires more amplification than the green channel or red channel and produces more noise. In well-designed amplifiers, this type of noise is often negligible.
14.3.3 Distribution Gaussian noise The noise whose amplitude values at the respective pixel positions conform to the Gaussian distribution. Or the noise distribution is essentially a Gaussian distribution. It is determined by its standard deviation with respect to zero mean and is often modeled as additive noise. Also known as white noise, because its frequency can cover the entire spectrum. Gaussian noise affects the true value of the pixel. The random noise value is derived from a Gaussian probability density function, which can be written as ðz μÞ2 1 pðzÞ ¼ pffiffiffiffiffi exp 2σ 2 2π σ where z is the magnitude, μ is the mean of z, and σ is the standard deviation of z. Unlike impulse noise that affects only a subset of pixels in an image, Gaussian noise can affect all pixels, but with varying degrees of influence. White noise A noise with the same noise power at all frequencies. Pink noise is a counterexample. When considering spatially distributed noise, white noise refers to distortion at each spatial frequency (both large distortion and small distortion). Same as Gauss ian noise. Poisson noise The amplitude value of noise at each pixel location corresponds to the Poisson distribution. That is, there is noise in the form of a Poisson distribution, in which the variance directly scales down with the mean intensity. Noise generated by low photon counts in low light conditions is often described as Poisson noise. This noise is generated by the CCD sensor. Poisson noise removal Eliminate or attenuate Poisson noise. Mixed Poisson-Gaussian noise An image noise that includes both Poisson noise dependent on the signal and Gaussian noise independent of the signal.
588
14
Image Restoration
Exponential noise The spatial amplitude of noise is consistent with the exponential distribution. Its probability density function can be written as
pð z Þ ¼
a exp ½az z 0 z 0 is the learning rate parameter. Since the weight vector moves in the direction in which the maximum rate decreases in each step, the iterative method is called a gradient descent method or a steepest descent method. Highest confidence first [HCF] In the optimization method of gradient descent, a strategy for updating variables is to select variables according to the magnitude of the difference in energy reduction. Conjugate gradient descent An iterative technique for solving large linear least squares problems using a series of descending steps that are conjugated to each other. Stochastic gradient descent An optimization algorithm that minimizes the convex cost function. See sequential gradient descent and Markov chain Monte Carlo. Sequential gradient descent The online version of the gradient descent method, also known as the stochastic gradient descent method. It updates weights each time based on a single data wðtþ1Þ ¼ wðtÞ k∇E n wðtÞ Among them, the parameter k > 0 is the learning rate parameter, and En represents the n-th error function. In the process of updating, the selection of individual data can be selected either in a certain order or randomly. One advantage of this type of method is that it is more efficient in dealing with the redundancy of the data compared to the conventional gradient descent method. Another advantage is that it is less likely to fall into local minima. Least mean squares algorithm [LMS algorithm] The sequential gradient descent method when the error function is a sum of square functions, it is a statistically based gradient descent method.
646
16
Image Reconstruction from Projection
Steepest descent Same as gradient descent. Batch training A training technique optimized by the gradient descent method. The gradient descent method determines the solution for ∇E(w) ¼ 0. Most techniques include first selecting some initial value w(0) for the weight vector and then iterating over the weight space: wðkþ1Þ ¼ wðkÞ þ ΔwðkÞ To update the weight vector, the gradient information can be used to determine Δw(k). The easiest way is to choose a small step in the negative gradient direction: h i wðkþ1Þ ¼ wðkÞ r∇E wðkÞ The parameter r > 0 is called the learning rate. After each update, the gradient is recalculated for the new weight vector, and the process is repeated. Note that the error function is defined relative to the training set, so the entire data set needs to be processed at each step to evaluate ∇E. The technique of using the entire data set at a time is called batch training technology. At each step, the weight vector moves in the direction in which the rate of decline of the error function is the largest, so it is called the gradient descent method or the steepest descent method. Iterated conditional modes [ICM] Same as gradient descent.
Chapter 17
Image Coding
Image coding is a large class of techniques in image processing (Castleman 1996; Gonzalez and Wintz 1987; Pratt 2007; Salomon 2000). The purpose is to reduce the amount of data (and thus reduce the space required for image storage and the time required for transmission) while ensuring a certain visual quality. This can also be seen as using a smaller amount of data to obtain better visual effects (Zhang 2017a).
17.1
Coding and Decoding
The purpose of image coding is to reduce the amount of data in the image. Information preservation coding or information loss coding can both be used (Gonzalez and Wintz 1987; Zhang 2009). The difference lies in the fidelity of the decoded image to the original image (Pratt 2007).
17.1.1 Coding and Decoding Image coding and decoding A combination of image coding and image decoding. Image coding A large class of basic image processing techniques. The purpose is to reduce the amount of data required to represent the original image by adopting a new representation method on the image (on the basis of retaining the original information as much as possible), so it is also often called image compression. It is used to map or decode a presentation of an image (for a compressed image) or an algorithm. The result of the image coding changes the form of the image into the compressed domain. The direct display of the results of image coding are often meaningless or © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_17
647
648
17
Image Coding
do not visually reflect the full nature and content of the image. Therefore, when it is necessary to use an image, it is also necessary to convert the compression result back to the original image or its approximate image (decoded image), which is called image decoding or image decompression. Broadly speaking, image coding includes both image coding and image decoding. Coding See image coding. Image encoding The process of converting one image to another representation. See image compression. Image compression A method of expressing an image in order to reduce the amount of storage space occupied by the image. The technique used can be lossless (i.e., all image data is completely recorded) or lossy (at this time image quality reduction is allowed, especially when it is necessary to greatly increase the compression ratio). See image coding. Compression noise Noise artifacts due to the use of lossy compression techniques during image storage or transmission. See blocking artifact. Textual image compression A method for compressing textual images containing print (and possibly partial handwriting) text. Text can be used in a variety of fonts, including symbols, hieroglyphics, and more. In practice, characters are often identified first, compressed using one encoding method, and additional compression methods are used for the rest (if the rest can be ignored, the option of lossy compression can be used). Image decoding See image decoding. Decoding 1. The opposite of the encoding process performed on the image encoding result. Restore the encoded result of the image to the original image. Convert the previously encoded image back to its original form (in the case of lossless encoding) or to a near-native form (in the case of lossy encoding). The result of image coding is not necessarily an image form, and it is often necessary to restore the coded result to an image form in an application. 2. One of the three key issues of the hidden Markov model. Image decoding model A model for decoding the image coding results. See source decoder. Image decompression See image coding.
17.1
Coding and Decoding
649
Table 17.1 Evaluation scale of subjective fidelity Damage test 5-not perceived 4- can be perceived but not annoying 3-somewhat annoying 2- seriously annoying 1- not available
Quality inspection A-excellent B-good C-available D-poor E-worry
Comparison test +2 very good +1 good 0 equal 1 poor 2 very poor
Decompression Same as image decompression. See image coding. Image fidelity A measure of how much the decoded image is offset from the original image. The fidelity of the image needs to be considered only for the results of lossy coding. The main criteria for measuring fidelity can be divided into two categories: (1) subjective fidelity criteria and (2) objective fidelity criteria. Objective fidelity See objective fidelity criterion. Objective fidelity criterion A measure of the amount of codec loss information is represented by a correlation function that encodes the input image and the decoded output image. Commonly used are the mean square signal to noise ratio, peak signal to noise ratio, etc., which can be quantitatively calculated Subjective fidelity See subjective fidelity criteria. Subjective fidelity criteria The evaluation criteria or specifications used to measure image quality subjectively. Subjective fidelity can be divided into three types. The first type is called an impairment test in which an observer scores an image based on its degree of damage (degradation). The second type is called a quality inspection, in which the observer sorts the images according to their quality. The third type is called a comparison test, in which the images are compared in pairs. The first two criteria are absolute and often difficult to determine completely unbiased (without bias). The third criterion provides a relative result that is not only easier but also more useful for most people. Table 17.1 shows an evaluation scale for each of the above three subjective fidelities.
17.1.2 Coder and Decoder Image coder and decoder A device that implements image encoding and image decoding.
650
17 Encoder
Input image
Image Coding
Decoder Output image
Source
Channel
Channel
Channel
Source
Fig. 17.1 Image coding system model
Codec A combination of an image encoder and an image decoder. Signal coding system A system that encodes one signal into another, primarily for compression or security purposes. See image compression and digital watermarking. Image coder A device (hardware) or an algorithm/program (software) that implements image encoding. Image encoder Same as image coder. Image decoder A device (hardware) or an algorithm/program (software) that implements image decoding. System model for image coding A model describing the structure of an image coding system (often plus an image decoding system). A general image coding system model is shown in Fig. 17.1. This model mainly consists of two structural modules that are cascaded through the channel: an encoder and a decoder. Error correction encoder An encoder that performs an error correction code. Error-correcting code The purpose is to reverse the encoding of the data compression. By generating some redundancy in the data, it is convenient to find and correct errors in the data that may be caused by transmissions and the like. In practice, some use parity bits and sometimes use generating polynomials to design. Hamming code An error correcting code for 1-bit errors. It is convenient to generate the required parity pattern. Trellis code An error correcting code in which a message is determined by a trellis diagram. Each arc in the grid map is identified as a sequence of symbols that join together the symbol sequences of all the arcs on a given path, and the corresponding information for that path can be encoded.
17.1
Coding and Decoding
651
Trellis-coded modulation A modulation technique that utilizes a trellis code (determining the path of a sequence of symbols through a trellis diagram). Each arc in the grid is represented by a vector, and the vectors of all the arcs on a given path are superimposed to modulate the sequence corresponding to this path. Symmetric application The applications that images need to be compressed and decompressed the same number of times, such as video conferencing. At this point, both the encoding end and the decoding end need a simple and fast method. Compare asymmetric application. Asymmetric application The applications that images need to be compressed once but decompressed multiple times, such as video distribution. At this time, a complicated and time-consuming coding method and a simple and fast decoding method can be adopted. Compare symmetric application.
17.1.3 Source coding Source 1. Same as information source. 2. Radiation source: an energy radiator that illuminates the vision system sensor. Information source The place where information is generated. It can output a discrete random variable representing a sequence of random symbols produced from a finite or infinitely countable set of symbols. The symbol sequence can be expressed as A ¼ {a1, a2, . . ., aj, . . ., aJ}, where the element aj is called an information source symbol. If the probability that the source generates the symbol aj is P(aj), the probability of occurrence corresponding to each source symbol can be written as a vector u ¼ [P (a1) P(a2) . . . P(aJ)] T. Use (A, u) to fully describe the source. Information-source symbol The element in the output collection of the source. Alphabet In image coding, a collection of all possible symbols of a source. It is generally all the values that a pixel can take (such as a pixel value of 256 gray values when represented by 8 bits). More generally, it refers to a set or sequence of symbols used to represent information. Source encoder A device that encodes a source, as is shown in Fig. 17.2. The purpose is to reduce or eliminate coding redundancy, inter-pixel redundancy, and psycho-visual
652
17
Image Coding
Source coding Input image
Mapping
Quantization
Symbol coding
Channel
Fig. 17.2 Source encoder
Source decoding Channel
Symbol decoding
Inverse mapping
Output image
Fig. 17.3 Source decoder
redundancy in the input image. In general, three independent operations in order are included: mapping, quantization, and symbol coding. Compare source decoder. Source decoder A device that decodes the output of a source encoder. As shown in Fig. 17.3, the effect is to restore the encoded result to the original image for normal use. In general, two independent operations are performed in reverse order with the corresponding source encoder: symbol decoding and inverse mapping. Since the quantization operation in the source encoder is not invertible, there is no inverse operation in the source decoder for quantization. Compare source encoder. Zero-memory source A source whose symbols are statistically independent. Code book 1. A series of symbols (such as letters and numbers) used to express a certain amount of information or a set of events. Encoding an image requires the creation of a code book to express the image data. See code word, length of code word. 2. A set of determined code words obtained from clustering of feature descriptors when using the bag-of-words model for classification. Code word A sequence of code symbols assigned to each particular message or event in the code book. See code book and bag of features. Length of code word The number of symbols contained in the code word in the image encoding. Average length of code words The average number of symbols for all code words that represent a set of events. Encoding Converting a digital image represented by a set of values (or symbols) from one form to another for some specific purpose (often to compress an image). In lossy coding,
17.1
Coding and Decoding
653
there will be some loss of information, and the original image will not be fully recovered in subsequent decoding process. See JPEG and MPEG. Coding block [CB] A block of pixels corresponding to the coding unit.
17.1.4 Data Redundancy and Compression Data redundancy In image compression, the use of too much data to represent information causes some data to represent useless information, or the phenomenon in which some data repeatedly represents information that other data has represented. Three basic data redundancy are considered in image compression: (1) coding redundancy, (2) inter-pixel redundancy, and (3) psycho-visual redundancy. If one or more of these redundancy can be reduced or eliminated, the compression effect can be achieved. Redundant data When data is used to represent information, it represents useless information or repeatedly indicates data that other data has represented. In other words, the information expressed by the redundant data is either useless or has been expressed by other data. It is because of the existence of redundancy that data compression is possible. Relative data redundancy A measure that mathematically describes the redundancy of data. If n1 and n2, respectively, represent the number of information carrier units in the two data sets used to express the same information, the relative data redundancy RD of the first data set relative to the second data set can be expressed as RD ¼ 1 1=CR where CR is called compression ratio (relative data rate): C R ¼ n1 =n2 CR and RD take values in the open interval (0, 1) and (1, 1), respectively. Coding redundancy A basic data redundancy in the image. The sum of the average number of bits required to represent one pixel in an image is the sum of the product of the number of pixels of the same gray level and the number of bits required to represent these gray values, divided by the total number of pixels. Encoding an image, if the minimum value of this sum is not reached, indicates that there is coding redundancy. In order to eliminate and reduce coding redundancy, a smaller number of bits can be used to
654
17
Image Coding
represent a gray level with a higher probability of occurrence, and a larger number of bits can be used to represent a gray level with a lower probability of occurrence, thereby reducing coding redundancy and the number of bits required to represent the same information will be less than the original number of bits. This type of encoding method is always reversible. Inter-pixel redundancy A basic image data redundancy. There is a close relationship with the correlation between pixels. There is a certain correlation between adjacent pixels in a natural image (there is a certain regularity in the change of pixel gradation), and the stronger the correlation, the greater the redundancy between pixels. Redundancy between pixels is also often referred to as spatial redundancy or geometric redundancy. Interframe redundancy in video images is also a special case of it. Psycho-visual redundancy A basic type of (image) data redundancy. It has certain subjectivity and varies from person to person. The eye (and further the human visual system) is not the same sensitivity to all visual information, and some information is less important and necessary in the usual visual perception process than other information, which can be considered as psycho-visual redundancy. However, these psychological visual redundancy information exists objectively. Removing such information to eliminate or reduce psycho-visual redundancy leads to real information loss and cannot be compensated. Data compression The number of bits required to store data (text, audio, speech, images, and video) is reduced by using an appropriate and effective representation of the data (as much as possible while preserving the original information). Data reduction A generic term that indicates both processes: 1. Reduce the number of data points, such as by subsampling or by using the mass center of the object as a representative point, or by sampling. 2. Reduce the dimensions of individual data points, such as using projection or principal component analysis. Compression Same as image compression. Compression ratio A commonly used measure of the efficiency of compression methods in image compression or video compression. It is the reciprocal of the compression factor, which can be written as compression ratio ¼ amount of output data / amount of input data. A value less than 1 means that the input data is compressed, and a value greater than 1 means negative compression (data expansion). See relative data redundancy.
17.1
Coding and Decoding
655
Compression factor The inverse of the compression ratio in image compression or video compression. Can be written as compression factor ¼ input data volume / output data volume. A value greater than 1 indicates data compression, and less than 1 indicates data expansion. Compression gain An indicator used to measure the relative performance of different compression methods. It can be written as compression gain ¼ 100 ln(R/C), where R represents the amount of reference data and C represents the amount of data after compression. The reference data amount herein does not refer to the input data amount but refers to the amount of compressed data generated by a standard lossless compression method used as a reference. Resolution-independent compression An image compression method that is independent of the resolution of the compressed image. The image can be decoded at any resolution. Geometric compression Compression of geometric features and targets (such as triangles, polygons, etc.). Signal-to-noise ratio [SNR] A physical quantity that reflects the relative strength of a signal and noise. It is a measure of the relative intensity of the portion of interest and non-interest (noise) in a signal. Signal-to-noise ratios often have different expressions when used in different applications. In signal processing, the signal-to-noise ratio is the number of decibels of the signal power to noise power ratio, which is 10lg(Ps/Pn). When considering statistical noise, the signal-to-noise ratio can be defined as the logarithm of the ratio of ten times the standard deviation of the signal and noise. In image engineering, the following three are commonly used: 1. In image display (such as TV applications), the signal-to-noise ratio can be represented by the energy ratio (or voltage squared ratio): 2 V SNR ¼ 10lg s2 Vn Among them, Vs is the signal voltage, taking the peak-to-peak value; Vn is the noise voltage, taking its root mean square (RMS). 2. In image synthesis, the signal-to-noise ratio can be expressed as SNR ¼
Cob σ
2
656
17
Image Coding
Among them, Cob is the grayscale contrast between the target and the background, and the σ is the mean square error of noise. 3. In image compression, the signal-to-noise ratio is often normalized and expressed in decibels (dB). Let the size of the image f(x, y) be M N, and the mean value of the image gradation is M 1 N 1 1 X X f ðx, yÞ MN x¼0 y¼0
f ¼
Then the signal to noise ratio can be expressed as 2
M1 P N1 P
2 ½f ðx; yÞ f
3
6 x¼0 y¼0 7 6 7 SNR ¼ 10lg6M1 N1 7 2 5 4P P ^f ðx; yÞ f x; y x¼0
y¼0
wherebf (x, y) represents an approximation of f(x, y) obtained by compressing f(x, y) first and then decompressing the result. If fmax ¼ max{f(x, y), x ¼ 0, 1, . . ., M – 1; y ¼ 0, 1, . . ., N – 1}, the peak signal-tonoise ratio (PSNR) is obtained: 2 6 6 PSNR ¼ 10lg6 4
3
1 MN
M1 P N1 P
f 2max
^f ðx; yÞ f x; y
7 7 7 2 5
x¼0 y¼0
In addition, in the watermarking system, if one of the watermark template and the cover works is used as a signal and the other as noise, two different signal to noise ratios can be obtained: a signal-to-watermark ratio and a watermark-to-noise ratio. Peak SNR [PSNR] See signal-to-noise ratio [SNR]. Peak signal-to-noise ratio [PSNR] Expamsion of peak SNR. Mean-square signal-to-noise ratio An objective fidelity criterion used to represent the difference between compressed and decompressed images. If the decompressed image is treated as the sum of the original image f(x, y) and the noise signal e(x, y), i.e., bf ðx, yÞ¼ f(x, y) + e(x, y), then for the output image (size M N ) the mean square signal to noise ratio SNRms can be expressed as
17.1
Coding and Decoding
SNRms ¼
M 1 X N 1 X x¼0 y¼0
657
bf ðx, yÞ2 =
M 1 X N 1 X
h bf ðx, yÞ f ðx, yÞ2
x¼0 y¼0
If the square root is obtained for the above equation, the root mean square signal-to-noise ratio SNRrms is obtained. See signal-to-noise ratio. Root-mean-square signal-to-noise ratio The square root of the mean square signal-to-noise ratio.
17.1.5 Coding Types Waveform coding When encoding an image, the gradation value or color value of each pixel is mainly considered, and the encoding performed as one waveform is sequentially connected. In the waveform coding, a high-level concept of a physical entity in a scene corresponding to a set of pixels is not utilized, so the efficiency is relatively low. In the simplest case of waveform coding, it is assumed that each pixel is statistically independent, and the commonly used coding technique is called pulse-coding modulation (PCM). Pulse-coding modulation [PCM] The simplest waveform coding technique. It is assumed that each pixel is statistically independent. Description coding A coding technique that decomposes a medium into a number of simultaneously played streams and reconstructs them in parallel at the receiving end. Hybrid encoding A type of image coding method that combines 1-D transform coding and differential pulse code modulation. The performance is close to 2-D transform coding, but the amount of calculation is small. Many international standards (starting with H.261) use a combination of such transformations and predictions. Frequency domain coding It refers to coding in transform space such as DCT domain and wavelet domain. Adaptive coding A scheme for transmitting signals over an unstable channel, such as a wireless channel. Adaptation of channel variations (such as “fading” with reduced signalto-noise ratio) by changing the parameters of the encoding. In image coding, a type of compression method that modifies its operations and/or parameters based on newly arrived data from the input stream. This compression method is often used in dictionary-based coding methods.
658
17
Image Coding
Progressive coding Same as progressive image coding. Progressive image coding A method of layering images (each layer with different image details) and encoding the images separately by layer. At the decoding end, the decoder can quickly display the entire image with low resolution and continuously increase the image detail as the decompressing more and more layers to achieve a gradual improvement in image quality. Progressive image compression An image compression method for organizing compressed streams by hierarchy. Each layer contains more image detail than the previous layer. The decoder can display the entire image very quickly in a low-quality format and then improve the display quality as more layers are read in and decompressed. If the user views the decompressed image on the screen, most of the image features are often recognized after the image is decompressed by 5% to 10%. There are different ways to improve image quality over time, such as (1) for image sharpening, (2) for adding color, and (3) for increasing resolution. Progressive compression A sequential progressive technique and process in image coding. Also known as progressive encoding, scalable encoding. In practice, there are several different types of scalability: (1) quality progressive, use a gradual increase in code stream to update the reconstructed image; (2) resolution progressive, when coding, first consider the low resolution image, and then encode the difference between the low and high resolution; (3) component progressive, first consider encoding grayscale data, and then encode color data. A progressive approach can also be used in lossless coding, that is, using a coarse to fine pixel scan format. For example, it is often used when downloading preview images or providing access to different image qualities. Block coding The image coding technique is first to decompose the input image into blocks of fixed size and then transform each block into a smaller block (for compression) or a larger block (for error correction) and then transmit it. Block truncation coding [BTC] A lossy coding method. The image is first decomposed into sub-images (blocks), and then (by quantization) the number of gray values is reduced in each block. The quantizer used here is adapted to local image statistics, the level of which is chosen to minimize the specific error criteria so that the pixel values in each block can be mapped to the quantization level. Different block truncation coding algorithms can be obtained by using different quantization methods and different error criteria. Recursive block coding An encoding method that attempts to reduce blockiness and redundancy between blocks.
17.1
Coding and Decoding
659
Blocking artifact Artifact that is due to loss of information in using block coding schemes for compression, which results either from the use of lossy compression or from errors in data transmission. For stream media, data transmission errors in video compression are more common. See compression noise. Semi-adaptive compression A method of performing image compression by scanning an image twice. The first scan here is to read in the data stream to collect its statistics, and the second scan is the actual compression. This includes the statistic (model) in the compressed stream. Embedded coding A special way of image coding. If an encoder is used twice for the same image, and each time the distortion is different, then two compressed files are available. Set the big file have M bits and the small file have m bits. If the encoder uses embedded encoding, the smaller file is equal to the first m bits of the larger file. For example, there are three users who need to receive a compressed image, but the quality required varies. The required image quality is provided in 10Kb, 20Kb, and 50Kb files, respectively. If you use the lossy compression method in the normal way, you need to compress an image three times with different quality to generate three files of the required size. If you use embedded coding, you only need to make one file and then send three blocks of 10Kb, 20Kb, and 50Kb starting from the same point to the corresponding three users. Symbol-based coding A method suitable for encoding textual images. Each text bitmap in the textual image is treated as a basic symbol or sub-image, and the textual image is treated as a collection of these sub-images. Before encoding, create a symbol dictionary to store all possible text symbols. If a code is assigned to each symbol, the encoding of the image becomes the determination of the code word for each symbol and the spatial position of the symbol in the image. Thus, an image can be represented by a series of triples, namely, {(x1, y1, l1), (x2, y2, l2), ...}, where (xi, yi) represents the coordinates of the symbol in the image and li represents the position label (integer number) of the symbol in the symbol dictionary. Symbol coding Encoding the source symbol. Integral image compression A 3-D integral image takes up a lot of data. To express it with the appropriate resolution, the integral image compression method takes advantage of the characteristics unique to the image mode.
660
17.2
17
Image Coding
Coding Theorem and Property
Information theory is the basis of image coding. Both the noiseless coding theorem and the rate-distortion coding theorem are basic theorems in image coding (Gonzalez and Woods 2008).
17.2.1 Coding Theorem Information theory An applied mathematics branches that uses the method of probability theory and mathematical statistics to study the information, information entropy, communication systems, data transmission, cryptography, data compression, etc. The two major fields of information theory research are information transmission and information compression. These two fields are related to each other through information transmission theorem and information coding theorem. Estimation theory A branch of information theory. Using statistical methods, the received noisy observations are used to estimate the theory of actual or random variables, stochastic processes, or certain characteristics of the system. If further extended, the use of an explicit loss function is called statistical decision theory. Statistical decision theory See estimation theory. Self-information A measure of information about a single event itself. Self-information amount The amount of information that an event (or a symbol) carries when it appears. The smaller the amount of information, the greater the uncertainty of the event system and the greater the entropy. The greater the amount of information, the more regular the structure of the event system and the more complete the functionality, the smaller the entropy. The mathematical properties of entropy are continuity, symmetry, additive, and having a maximum. Self-information quantity Same as self-information amount. Noiseless coding theorem One of the basic theorems of image coding. In the absence of errors in both the channel and the transmission system, the minimum average length of code words achievable for encoding each source symbol is determined. In other words, the limit for compressing a lossless source is given. Also called (for zero-memory sources) Shannon’s first theorem.
17.2
Coding Theorem and Property
661
Source coding theorem The basic theorem of image coding. The theorem determines when there is no error in the channel but the transmission process is distorted; if the average error due to compression is limited to a certain maximum allowable level, then given the objective fidelity criteria, the data rate R for coding will be determined. Since it connects the distortion (reconstruction error) D of the fixed word length coding scheme to the data rate (such as the number of bits per pixel) R used for encoding, it is also called the rate distortion theorem. Rate-distortion coding theorem Same as rate-distortion theorem. Rate-distortion theorem See source coding theorem. Rate distortion A statistical method that is useful in analog-to-digital conversion. It determines the minimum number of bits needed to encode data when a given level of distortion is allowed or determines the amount of distortion produced when a given number of bits are used. Rate-distortion function The number of bits per sample (rate Rd) required to encode an analog image for an allowed distortion D (or error mean square value). Assuming the input is a Gaussian random variable, the variance of the input value is σ 2, and Rd ¼ max{0, [log2(σ 2/D)]/ 2}.
17.2.2 Coding Property Lossless coding An image encoding method in which an image is encoded without loss of information after the encoding and decoding process. That is, the image is encoded by the entire content, and the original image can be completely recovered at the decoding end, so it is also called information-preserving coding. Information-preserving coding See lossless coding. Lossless compression A type of image compression method in which an original image can be accurately reconstructed from a compressed image. This is in contrast to lossy compression. Same as lossless coding.
662
17
Image Coding
Lossless compression technique A technology that implements lossless coding. Commonly used are variable length coding, run length coding, differential coding, predictive coding, dictionarybased coding, and other techniques. Lossy coding An encoding method in which an image is encoded with information loss. So it is also called information lossy coding. After the image is encoded, it cannot be restored to its original state by decoding. Information-lossy coding Same as lossy coding. Lossy compression A type of image compression method in which an original image cannot be accurately reconstructed from a compressed image. The goal is to lose unimportant image detail (such as noise) and limit the perceived change in the appearance of the image. A lossy compression method generally gives a higher compression ratio than lossless compression. See lossy coding. Lossy compression technique A technique for implementing lossy coding. Commonly used are quantization technology, orthogonal transform coding technology, and wavelet transform coding technology. Lossy image coding Same as lossy coding. Lossy image compression Same as lossy compression. Near-lossless coding A coding scheme for compromising lossless coding and lossy coding. A certain balance is achieved between increasing compression ratio and maintaining image quality. The general expectation is that it can achieve higher compression performance than lossless coding when the information loss is relatively small. The main goal in practice is to have certain restrictions on information loss (often using L1 norm) while increase the compression rate as much as possible (often several times of the compression ratio of the information-preserving coding). Near-lossless compression Same as near-lossless coding. L1 image coding The L1 norm (maximum deviation norm) that minimizes the reconstruction error is directly considered, rather than the image encoding or compression technique that utilizes the more general L2 norm as in JPEG.
17.3
Entropy Coding
663
Instantaneous code In image coding, a code that satisfies the immediacy decoding characteristics. The immediacy of decoding refers to any finite length code symbol string, and each code word can be decoded successively, that is, after reading one code word, the corresponding source symbol can be determined, and the subsequent code does not need to be considered. The immediacy of decoding is also called non-continuation, that is, any code word in the symbol set cannot be constructed by adding symbols to other code words. Uniquely decodable code In image coding, a code that satisfies the decoding uniqueness requirement. That is, for any finite length code symbol string, there is only one method of decomposing it into its constituent code symbols or only one way of decomposing it. Scalability A general feature of computer algorithms. It refers to the performance of the algorithm will not be degraded for the model when the input increases. For example, if the image size increases, but the computation time of an image processing algorithm can remain substantially constant or though increase but less than the square of the image size, then the algorithm is scalable. This feature can also mean that when operating on an image database, the operation is extensible if the operating speed does not substantially change with the size of the library. Scalability coding The encoding method can be extended in some feature methods, which can include quality, resolution, spatial domain, etc. Blockiness When the image is subjected to blocking processing, a discontinuous defect occurs at the boundary between different blocks. Deblocking The technique and process of removing the blockiness effect, for example, produced when using JPEG with high compression ratio. A simple method involves searching for an unexpected edge along the edge of the block, treating the quantization step as a projection of a convex region of a transform coefficient space to a corresponding quantized value, and the like.
17.3
Entropy Coding
Entropy coding, also known as variable-length coding, is a statistical coding compression method that reduces coding redundancy (Salomon 2000).
664
17
Image Coding
17.3.1 Entropy of Image Entropy of image A measure of the amount of information in an image. Consider an image that is a collection of symbols (such as pixel values). Each pixel provides some information about the scene. To quantify the information of an image, the expected value of the information content of its symbol, i.e., the average of log(1/p), where p is the probability of a particular gray value appearing in the image, may be used.
G X 1 H E p log pk log pk ¼ p k¼1 where G is the number of different values that the pixel can take (the number of gray values in the image) and pk is the frequency of a particular gray value in the image (the value of the k-th bin of the image normalized histogram, where the normalized histogram consists of G bins). The information measure H is the image entropy. The larger the value, the more information the image contains, or the greater the amount of information. Binary entropy function In information theory, an entropy function of a Bernoulli process with only two values is expressed. Consider a binary source with the symbol set S ¼ {s1, s2} ¼ {0, 1}. Letting the probability that the source produces two symbols are P(s1) ¼ p and P (s2) ¼ 1 – p, the binary entropy function of the source is H ðuÞ ¼ p log 2 p ð1 pÞ log 2 ð1 pÞ One curve of H(u) is shown in Fig. 17.4, which takes the maximum value (1 bit) at p ¼ 1/2. Fig. 17.4 Binary entropy function
H 1.0 0.8 0.6 0.4 0.2 0
0 0.2 0.4 0.6 0.8 1.0
p
17.3
Entropy Coding
665
Entropy coding The lossless coding method that the average code length of the compressed data is tended to be the information entropy of the input data to reduce the coding redundancy. Fast and efficient lossless image compression system [FELICS] A fast and efficient lossless coding algorithm based on entropy coding. Designed for grayscale images, each pixel is variable length encoded according to the values (de-correlated) of two adjacent pixels that have appeared, using binary code and Golomb code. This algorithm is up to five times faster than the international standard JPEG in lossless mode and achieves the same compression ratio. Progressive fast and efficient lossless image compression system [PFELICS] A progressive version of FELICS that can be layered to encode images. The number of coded pixels per layer is doubled. To determine which pixels are included in a layer, it is conceptually possible to rotate the previous layer by 45 and multiply each dimension by √2.
17.3.2 Variable-Length Coding Variable-length coding A type of entropy coding. The statistical coding compression method is adopted, that is, the gray level with a large probability of occurrence is represented by a small number of bits, and the gray level with a small probability of occurrence is represented by a larger number of bits. Variable-length code A code book or code word obtained by a variable length coding method. A typical Huffman code as obtained by the Huffman coding method. Unary code An encoding method that generates a variable length code in one step. For a non-negative integer N, the unary code is N – 1 number 1 followed by a 0 or N – 1 number 0 followed by a 1. Therefore, the length of the unary code of the integer N is N. Prefix compression A variable length code that satisfies the nature of the prefix and is suitable for compressing square images. For a 2n 2n image, use a quadtree structure to assign a 2n-bit number to each pixel. Then select a prefix value P to find out the pixels with the same leftmost P bits (i.e., the same prefix) among all 2n digits. Compression of these pixels simply writes the prefix into the compressed stream, followed by all suffixes.
666
17
Image Coding
Prefix property One of the main features of variable length codes. Provision: Once a bitmap pattern is assigned to a symbol as a code word, the other code words can no longer be used with the bitmap pattern (i.e., the bitmap pattern cannot be used as a prefix for any other code word). For example, once string 1 is assigned to symbol a1 as a code word, the other code words cannot begin with 1 (i.e., they can all start with 0). Similarly, once 01 is assigned to a2 as a code word, the other code words cannot start with 01 (they can only start with 00) Huffman coding A lossless coding method for generating a code tree from the bottom up. It is one of the most commonly used techniques to eliminate coding redundancy. A set of variable length codes (the length of which is inversely proportional to the probability) is assigned to each source symbol according to the probability of occurrence of the input symbol. Huffman coding can give the shortest code word when encoding the source symbols one by one. According to the noiseless coding theorem, the Huffman coding method is optimal for a fixed order source. If the symbol probability is a negative power of 2, Huffman coding gives the best code word. A method of performing optimal variable length coding based on the relative probability of each value of a set of values (such as pixel values). If the relative probability of the data source changes, the code length can change dynamically. This is a technique commonly used in image compression. Suboptimal variable-length coding A modification of variable length coding. Such coding methods are exchanged for the simplicity of coding calculations by sacrificing some coding efficiency. Truncated Huffman coding A modification of Huffman coding. Belongs to suboptimal variable-length cod ing. Such coding reduces the amount of coding computation by sacrificing some coding efficiency. Specifically, Huffman coding is performed only on the symbols that are most likely to occur, and other symbols are represented by a prefix code in front of a suitable fixed length code. Shift Huffman coding A modification of Huffman coding. Belongs to suboptimal variable length cod ing. Such coding reduces the amount of coding computation by sacrificing some coding efficiency. Such encoding divides the total number of source symbols into equal-sized symbol blocks, Huffman encoding the first block, and using the same Huffman codes obtained from the first block for the other blocks, but adding special translation symbols in front to distinguish them. Shannon-Fano coding A lossless coding method based on block (group) codes. Each source symbol is first mapped into a set of fixed-order code symbols so that one symbol can be coded at a time. The code word is then constructed top-down, with 0 and 1 in the code word
17.3
Entropy Coding
667
being independent and appearing substantially equally. This method needs to know the probability of each source symbol generation first. The specific steps are as follows: 1. Arranging source symbols according to their probability of occurrence in descending. 2. Dividing the source symbol whose code word is not currently determined into two parts, so that the sum of the probabilities of the two parts of the source symbols is as close as possible. 3. Assign values to the two parts of the source symbol combination (can be assigned 0 and 1, respectively, or 1 and 0, respectively). 4. If both parts have only one source symbol, the encoding ends; otherwise, return (2) to continue. Arithmetic coding A lossless coding method that continuously encodes in a recursive form starting from the entire source symbol sequence. In arithmetic coding, each code word is assigned to the entire source symbol sequence (i.e., not one symbol at a time), and the code word itself determines a real interval between 0 and 1. As the number of symbols in the symbol sequence increases, the interval used to represent the sequence of symbols decreases, and the number of information units (e.g., bits) required to express the interval becomes larger. The symbol in the symbol sequence determines the length of the corresponding interval according to the probability of occurrence, the reserved interval with large probability is large, and the reserved interval with small probability is small. Arithmetic decoding The decoding process of the arithmetic coding result. This can be done by means of an encoding process. The input to the decoder is the probability of each symbol of the source and a real number (encoded code word) representing the entire sequence of symbols. When decoding, the source symbols are first sorted according to the probability of occurrence, and then the source symbols are selected according to the given code words for arithmetic coding, until all symbols are coded (the interval where the code words are located are determined), and finally the symbols used for the encoding give out the decoded sequence. It should be noted that in order to determine the end of decoding, a special end symbol (also called a terminating symbol) needs to be added after each original symbol sequence, and decoding can be stopped when the symbol appears. In addition to this method, if the length of the original symbol sequence is known, it can be stopped when decoding to obtain a symbol sequence of equal length; or the number of decimal places of the result obtained by encoding the symbol sequence can be considered, and decoding is stopped when the decoding obtains the same precision. Golomb code A simple variable length coding method. Considering the correlation between pixels, the difference between the gray values of adjacent pixels in the image will
668
17
Table 17.2 Several Golomb code examples
n 0 1 2 3 4 5 6 7 8 9
G1(n) 0 10 110 1110 11110 111110 1111110 11111110 111111110 1111111110
Image Coding
G2(n) 00 01 100 101 1100 1101 11100 11101 111100 111101
G4(n) 000 001 010 011 1000 1001 1010 1011 11000 11001
show a characteristics there are many small values and few large values, which is more suitable for Golomb coding. If the probability distribution of each symbol in the non-negative integer input is exponentially decreasing, then according to the noiseless coding theorem, the Golomb coding method can achieve the optimal coding. Let the up-rounding function dxe represent the smallest integer greater than or equal to x, and the down-rounding function bxc represents the largest integer less than or equal to x. Given a non-negative integer n and a positive integer divisor m, the Golomb code of n relative to m is denoted as Gm(n), which is the combination of expressions of a binary code for the quotient bn/mc and a binary value for the remainder n mod m. Gm(n) can be calculated according to the following three steps: 1. Construct a unary code for the quotient bn/mc (the unary code of the integer I is represented by I times 1 followed by one 0). 2. Let k ¼ dlog2me, c ¼ 2k – m, r ¼ n mod m, calculate the truncated r0 :
0
r ¼
Truncat r to ðk 1Þbits 0 r < c Truncat ðr þ cÞ to k bits Otherwise
3. Splicing the results of the last two steps (before and after) to get Gm(n). Table 17.2 shows the first ten non-negative integers G1, G2, and G4.
17.4
Predictive Coding
Predictive coding uses the correlation between pixels to directly encode in the image space, which can eliminate or reduce the redundancy between pixels in the image (Gonzalez and Woods 2008; Zhang 2009).
17.4
Predictive Coding
669
17.4.1 Lossless and Lossy Predictive coding An image coding method in the spatial domain. It can be lossless coding or lossy coding. The basic idea is to eliminate inter-pixel redundancy by extracting only new information (relative to previous pixels) in each pixel and encoding them when sequentially scanning pixels. Here new information for a pixel is defined as the difference between the current or actual value of the pixel and the predicted value. Note that this is because of the correlation between the pixels, so the prediction is made possible. Predictive compression method A class of algorithms that use redundant information for image compression. In most cases, the estimation (prediction) of one pixel is constructed from the neighborhood of this pixel by means of correlation. Lossless predictive coding A predictive coding method applied to a lossless predictive coding system that only predicts the input. All the input information is retained at the output, so the original image (encoded image) can be completely recovered from the encoded result. Lossless predictive coding system Based on the lossless predictive coding method, the encoding system can produce the decompressed output image that is identical to the input image. The system is mainly composed of an coder and a decoder. The flow diagram is shown in Fig. 17.5. In Fig. 17.5, the upper half is the encoder and the lower half is the decoder, each with an identical predictor. When the pixel sequence fn (n ¼ 1, 2, . . .) of the input image enters the encoder one by one, the predictor generates a predicted (estimated) value of the current input pixel based on a number of past inputs. The output of the predictor is rounded to an integer f*n and used to calculate the prediction error: en ¼ f n f n
Input image
fn
+
∑
Compressed image
Nearest integer Symbol decoder
en
+
∑ fn* +
Fig. 17.5 Lossless predictive coding system
Symbol encoder
Compressed image
fn*
– Predictor
en
fn Decompressed image Predictor
670
17
Image Coding
The symbol encoder encodes this error by means of a variable length code to produce the next element of the compressed data stream. Then, the decoder reconstructs en based on the received variable length code word and performs the following operations to give a decompressed image: f n ¼ en þ f n In most cases, the first m pixels can be linearly combined to obtain a predicted value. f
n
¼ round
" m X
# ai f ni
i¼1
where m is the order of the linear predictor, round is the normal rounding function, and ai is the prediction coefficient. Lossy predictive coding A predictive coding method corresponding to a lossy predictive coding system. Not only the input is predicted, but also the quantization operation is introduced, so the entire information of the input cannot be retained at the output. After the image is encoded, the original image cannot be restored by decoding. Δ-modulation [DM] A simple lossy predictive coding method. Assuming that f_ n(n ¼ 1, 2, . . .) represents the pixel sequence of the input image, bf n represents the prediction sequence, en represents the prediction error sequence, and e_ n represents the quantized prediction error sequence, then the incrementally modulated predictor and quantizer are represented as bf n ¼ a f_ n1
þc for en > 0 e_ n ¼ c Otherwise where a is the prediction coefficient (generally less than or equal to 1) and c is a positive constant. Since the output of the quantizer can be represented by a single bit character (the output has only two values), the symbol encoder in the encoder can only use a code whose length is fixed to 1 bit. The bit rate obtained by the incremental modulation method is 1 bit/pixel. Granular noise One of the distortions in Δ-modulations, which is due to the quantized value being much larger than the smallest change in the input.
17.4
Predictive Coding
671
Slope overload A distortion produced in Δ-modulation due to the fact that the grayscale quantization value is much smaller than the maximum variation value in the input. At this time, the increment or deceleration of the quantized value (corresponding to slope) cannot keep up with the change of the input value, resulting in a large gap between the two values. Differential pulse-code modulation [DPCM] A commonly used lossy predictive coding method. Optimal criteria that minimize the mean squared prediction error are employed. It is assumed that fn (n ¼ 1, 2, . . .) represents a pixel sequence of the input image, bf n represents a prediction sequence, and en represents a prediction error sequence, and e_ nrepresents a quantized prediction error sequence. At some point n, if the quantization error can be neglected (_en en) and predicted with a linear combination of m previous pixels, the design problem of the optimal predictor is simplified to more intuitively select m prediction coefficients in order to minimize the following formula:
2
E en ¼ E
8" < :
fn
m X i¼1
#2 9 = ai f ni
;
The sum of the coefficients in the formula is generally less than or equal to 1, i.e., m X
ai 1
i¼1
This limitation is to make the output of the predictor fall within the allowable gray value range and reduce the effects of transmission noise. It can be seen as a technique for converting an analog signal into a binary signal by sampling. The value modulated by the sampled data is expressed as a binary value, so that the successive values do not change much, and the bit rate can be reduced. Lossy predictive coding system An encoding system based on a lossy predictive coding method. The decompressed image is different from the input image. Compared with the lossless predictive coding system, a quantizer is added to the encoder, as shown in Fig. 17.6, in which the upper half is the encoder and the lower half is the decoder. As can be seen from Fig. 17.6, the quantizer is inserted between the symbol encoder and the prediction error generator, and the integer rounding module in the original lossless prediction encoder is absorbed. When the pixel sequence fn (n ¼ 1, 2, . . .) of the input image enters the encoder one by one, the prediction error is obtained by calculating the difference between the pixel value and the predicted value f’n and is mapped into a finite number of outputs e’n, and the amount of compression and the amount of distortion in lossy predictive coding is determined by
672
17
Input image + fn
en
∑
–
Predictor
f 'n
fn* Compressed image
e'n
Quantizer
Symbol encoder
Image Coding
Compressed image
+
∑ +
e'n +
Symbol decoder
∑ f n* +
f 'n
Decompressed image
Predictor
Fig. 17.6 Lossy predictive coding system
e’n. In order to accommodate the quantization step, the predictions produced by the encoder and the decoder need to be equal, in which case the predictor of the lossy encoder is placed in a feedback loop. The input to this loop is a function of the past prediction and the quantization error corresponding to the prediction: f 0n ¼ e0n þ f n Such a closed loop structure prevents errors at the output of the decoder. Predictive error In the predictive coding, the difference between the predicted next input value and the real input value. Residual 1. The deviation between the expected value and the actual measured or calculated value. Such as predicting the deviation in the coding 2. In the regression problem, the difference between the observed response and the corresponding prediction
17.4.2 Predictor and Quantizer Predictor One of the important building blocks of predictive coding systems. Used to estimate the coded input to reduce the bit rate. Optimal predictor In lossy predictive coding, a predictor that is expected to be implemented without a quantization error. The optimal criterion used by the optimal predictor in most cases is to minimize the mean squared prediction error. If fn (n ¼ 1, 2, . . .) is used to represent the pixel sequence of the input image, and the predicted sequence is
17.4
Predictive Coding
673
Fig. 17.7 Quantization error
5 4 3 2 1
0
represented bybf n , and the prediction error sequence is represented by en, then the mean squared prediction error that the optimal predictor can minimize is E
e2n
h ¼E
f n bf n
i2
Quantizer One of the important building blocks of lossy predictive coding systems. The prediction error is mapped to a finite number of outputs to reduce the bit rate. Quantization error The amount of change in the value before and after discretization that occurs when the analog quantity is discretized into similar digital quantities. The quantization error is related to the number of quantization steps. The more the number of quantization steps, the smaller the amount of change in quantization, and the smaller the absolute error and relative error of quantization. Approximation errors resulting from quantifying a continuous variable, especially when using regular interval scale values. Figure 17.7 gives a continuous function (dashed line) and a quantized version (solid line) obtained using only few discrete values. The vertical distance between the two curves is the quantization error (also known as quantization noise). See sampling theorem and Nyquist sampling rate. Quantization function in predictive coding A mapping function used to quantize the prediction error. A typical quantization function is a step-like odd function, as shown by the function t ¼ q(s) in Fig. 17.8. This function can be completely described by L/2 si and ti in the first quadrant. The turning points of the curves in the figure determine the discontinuity of the function, and thus these turning points are treated as the discrimination and reconstruction levels of the corresponding quantizer. By convention, the s in the half-open interval (si, si + 1] is mapped to ti + 1. Minimum-variance quantizing A commonly used quantization strategy or technique. The range from the maximum value to the minimum value in the image is divided into a number of consecutive intervals such that the weighted sum of the variances of the quantization
674
17
Image Coding
Output t tL/2
Fig. 17.8 A typical quantization function
t = q(s)
t2 s–[(L/2)–1]
t1 s1
s2 s(L/2)–1
Input s
–tL/2
intervals is minimized. The weight typically chosen is the quantization level probability, which is calculated from the proportional regions with values in the image quantization interval. Optimal quantizer In the lossy predictive coding system, it is assumed that the quantizer is desired to be implemented, while the predictor is not considered. The key to the quantizer design is to determine the quantization function. If the input of the quantizer is s (called the discrimination level) and the output is t (called the reconstruction level), t ¼ q(s), then the optimal quantizer is designed under the condition of giving the optimization criteria and input probability density function p(s), to select a set of optimal si and ti. If the least mean squared quantization error (i.e., E{(s – ti)2}) is used as the criterion and p(s) is an even function, then the minimum error condition is Zsi ðs t i ÞpðsÞds ¼ 0i ¼ 1, 2, , L=2 si1
where 8 >
: 1 si ¼ si
i¼0 i ¼ 1, 2, , L=2 1 i ¼ L=2 t i ¼ t i
The first equation above indicates that the reconstruction level is the center of gravity of the area under the p(s) curve for the given discriminant interval, and the second equation indicates that the discriminant value is exactly the median of the two reconstructed values, and the third equation can be obtained by q(s) being an odd function. For any L, si, and ti satisfying the above three formulas are optimal in the
17.5
Transform Coding
675
sense of mean square error. The quantizer corresponding to this is called a L-level Lloyd-Max quantizer. See quantization function in predictive coding. Optimal uniform quantizer In a lossy predictive coding system, a quantizer that uses variable length codes in a symbol encoder and has a specific step size. A lower code rate than a fixed length coded Lloyd-Max quantizer can be provided with the same output reliability.
17.5
Transform Coding
Image transform coding first transforms the image and then uses the transform result to encode. Image transform coding is mainly to eliminate the inter-pixel redundancy and psycho-visual redundancy, and it is an information lossy coding (Gonzalez and Woods 2008; Salomon 2000; Zhang 2009). Transform coding A lossy coding method that is implemented in the transform domain. A reversible linear transformation is generally used (orthogonal transform is a typical case). Transform coding maps an image into a set of transform coefficients, which are then quantized and encoded. Sub-image decomposition The technique and process of transforming an input image into a set of sub-images in transform coding. Sub-image size is an important factor affecting transform coding error and computational complexity. In a broader sense, it can also refer to techniques and processes for dividing an image into multiple constituent regions. Bit allocation The whole process of coefficient truncation, quantization, and encoding of sub-images in transform coding. Zonal coding A method of selecting the transform coefficients to be retained in the transform coding system using the maximum variance criterion. Here, the sub-image is partitioned, the coefficient corresponding to the maximum variance position in the partition template is taken as 1, and the coefficient of the other position is taken as 0. According to the principle of uncertainty in information theory, the transform coefficient with the largest variance should carry the most image information, so it should be retained. By multiplying the partition template with the sub-image, the transform coefficients with the largest variance can be preserved. Threshold coding Using the maximum amplitude criterion, the threshold is used for selecting and retaining the transform coefficients in the transform coding. For any sub-image, the transform coefficient with the largest value contributes the most to the quality of the
676
17
Input image
Sub-image decomposition
Compressed image
Symbol decoding
Image transform
Quantization
Image inverse transform
Merge Sub-image
Symbol coding
Image Coding
Compressed image
Decompressed image
Fig. 17.9 Typical transform coding system block diagram
reconstructed sub-image. Since the position of the maximum coefficient differs depending on the sub-image, the position of the retention coefficient also differs depending on the sub-image. Threshold coding is adaptive in nature. Depending on the degree of adaptation, the method of thresholding can be divided into three types: 1. Use a global threshold for all sub-images. 2. Use different thresholds (sub-image thresholds) for each sub-image. 3. Select the threshold (coefficient threshold) according to the position of each coefficient in the sub-image. Transform coding system A system that implements transform coding and transform decoding. A typical transform coding system block diagram is shown in Fig. 17.9, where the upper half is the encoder and the lower half is the decoder. The encoder is often composed of four operational modules: (1) sub-image decomposition, (2) forward image transform, (3) quantization, and (4) symbol coding. First, an image of N N is first decomposed into (N / n)2 sub-images of size n n. Then, by transforming these sub-images, (N/n)2 sub-image transform arrays of n n are obtained. The purpose of the transform here is to remove the correlation between the pixels inside each sub-image or to concentrate as much information as possible into less conversion factors. The subsequent quantization step selectively eliminates or coarsely quantizes the coefficients that carry the least information because they have the least impact on the quality of the reconstructed sub-image. The final step is symbol coding, i.e., variable length coding of the quantized coefficients (often using a variable length code). The decoding portion is composed of a series of inverse operation modules arranged in reverse order from the encoding portion. Since the quantization is irreversible, the decoding portion does not have a module corresponding thereto. Orthogonal transform codec system A system for encoding and decoding an image is implemented by means of orthog onal transform (Fourier transform, discrete cosine transform, etc.). The effect of compressing the image data is obtained by transforming the image to reduce the correlation between the pixels. See the transform coding system for specific steps and modules. Wavelet transform codec system A system for image coding and decoding by means of wavelet transform. The basic idea of wavelet transform coding is to reduce the correlation between pixels by
17.6
Bit Plane Coding
Input image
677
Wavelet transform
Compressed image
Quantization
Symbol decoding
Symbol coding
Wavelet inverse transform
Compressed image
Decompressed image
Fig. 17.10 Block diagram of a typical wavelet transform coding system
transform to obtain the effect of compressing data. Therefore, the flow block diagram of the wavelet transform codec system is similar to the block diagram of the orthogonal transform codec system, except that the orthogonal transform and the orthogonal inverse transform modules are replaced here by the wavelet transform and the wavelet inverse transform modules. On the other hand, since the wavelet transform is computationally efficient and inherently local (the basis function of the wavelet transform is finite in time domain), no image partitioning is required in the wavelet transform codec system. A block diagram of a typical wavelet transform coding system is shown in Fig. 17.10. Compared to a general transform coding system, the encoder portion does not have a “sub-image decomposition” module, and the decoder portion does not have a “merge sub-image” module. Wavelet transform coding system 1. A system for transform coding using wavelet transform. 2. Often represents the wavelet transform codec system. Wavelet scale quantization [WSQ] An efficient lossy compression method developed specifically for compressing fingerprint images. Used by the FBI as a fingerprint compression standard. Its typical compression ratio is about 20. The main steps to achieve are three steps: 1. Perform discrete wavelet transform on the image. 2. Perform an adaptive scalar quantization of wavelet transform coefficients. 3. Perform a two-step Huffman coding for the quantization index. The decoding of wavelet scalar quantization is the inverse of coding, so it is a symmetric coding method.
17.6
Bit Plane Coding
Bit plane coding first decomposes multi-gray value images into a series of binary images and then compresses each binary image (Gonzalez and Woods 2008). It can not only eliminate or reduce coding redundancy but also eliminate or reduce interpixel redundancy in the image (Zhang 2009).
678
17
Image Coding
Bit plane encoding The image is decomposed into bit planes and the image compression technique using run-length coding technique for each bit plane with the bit plane representation. The grayscale image is first decomposed into a bit plane, and then each bit plane is compressed by a binary compression method. To get the individual bit planes of an 8-bit grayscale image, a Boolean AND operation is performed on the image, where the two values correspond to the considered bit plane. For example, using the 00010000 to perform the AND operation on the image, you can extract the fifth plane of the image. Statistical knowledge of the bit plane is often borrowed when coding the bit plane. Bit plane coding eliminates or reduces both coding redun dancy and inter-pixel redundancy. Bit plane coding Same as bit plane encoding. Binary decomposition A simple way to decompose a grayscale image into a series of binary images. Using the polynomial am1 2m1 þ am2 2m2 þ þ a1 21 þ a0 20 To represent the gray value of the pixel in the image having the m-bit gray level. The binary decomposition is to assign the m coefficients of the above polynomial into m 1-bit planes, respectively. An inherent disadvantage of this decomposition method is that small changes in the gray value of the pixel point may have a significant effect on the complexity of the bit plane. For example, when the gray values of two pixels adjacent to each other are 127 (011111112) and 128 (100000002), respectively, each plane of the image will have a transition from 0 to 1 (or from 1 to 0) at this position. Gray-code decomposition A method of decomposing a grayscale image into a series of binary image sets. Using polynomial am1 2m1 þ am2 2m2 þ þ a1 21 þ a0 20 to represent the gray value of a pixel in an image having an m-bit gray level. When the gray code is decomposed, the image is first represented by an m-bit gray code. The m-bit gray code corresponding to the polynomial in the above formula can be calculated by the following formula:
gi ¼
ai ai
L
aiþ1
0im2 i¼m1
L Among them, represents the XOR operator. The result obtained by decomposition in the above manner is still a binary plane. However, the unique property of
17.6
Bit Plane Coding
679
this code is that the adjacent code word differs by only one bit, so that small changes in the gray value of the pixel do not easily affect all the bit planes. Gray code 1. The result of a bit plane decomposition method, also the binary code word of an integer. There is at most one bit difference between the adjacent code words (sequential integers) in the gray code decomposition result. 2. A method used to determine the corresponding functions of different orders. For example, letting n be the ordinal order of the Walsh function W, write n as a binary code and take its Gray code to determine the natural order of the Walsh function. Here, a gray code of a binary code can be obtained by adding each bit i with a modulo addition of 2 of a bit i + 1. Cell coding An image coding method that first divides the entire bitmap into regular units (such as 8 8 blocks) and then scans the elements one by one. Put the first unit in the 0-th item of a table and encode it (i.e., write it into a compressed file) as pointer 0, and then search for the succeeding unit in the table. If it finds one, use its index in the table as its code. The word is written to the compressed file; otherwise it is added to the table. Constant area coding [CAC] A bit plane encoding method. Decompose the bitmap into three types of m n block: all black, all white, or mixed. The class with the highest frequency of occurrence will be given a 1-bit code word of 0, and the other two classes will be given a 2-bit code word of 10 and 11, respectively. Since the constant block originally represented by the mn bit is now represented by only 1 bit or 2 bit code word, the purpose of compression is achieved. White block skipping [WBS] In image compression, a special constant region coding method. When the image to be compressed is mainly composed of white parts (such as a document), the white block area can be coded as 0, and all other blocks (including solid black blocks) can be coded by 1 followed with the bit plane pattern of the region. In other words, encoding is required for regions other than white (white blocks are skipped). Growth geometry coding A progressive lossless compression method for binary images. By selecting some seed pixels, each seed pixel is grown into a pixel pattern using geometric rules. Run-length coding A bit plane coding method. There are many ways to encode the run. Commonly used are 1-D run length coding and 2-D run length coding. For each row of the bitmap, the former encodes consecutive 0 or 1 runs with their length. The latter is a generalization of the former, which can encode 2-D image blocks. This is also a lossless compression technique used to reduce the size of string (called run) representation. It can also be used to compress images. The algorithm encodes the symbols of a run into two bytes, one for the count and one for the
680
17
Image Coding
symbol. For example, the 8-byte string “xxxxxxxx” can be represented as “8x,” requiring only two bytes. This technique can compress any type of information content, but it is clear that the content itself will affect the compression ratio. Compared to other methods, the compression ratio of run-length coding is not very high, but it is easy to implement and performs very fast. Run-length coding is used in many bitmap file formats, such as BMP, PCX (PC Paintbrush Exchange), and TIFF. See image compression, video compression, and JPEG. Run-length compression See run-length coding 1-D run-length coding [1-D RLC] The bit plane encoding method that for each row of the bitmap, the continuous 0-run or 1-run is used to simplify the representation. 2-D run-length coding [2-D RLC] A bit plane coding method. Often derived from 1-D run-length coding. It can directly encode 2-D image blocks. Gray-level run-length coding An encoding method obtained by extending a run-length coding technique applied for a binary image to a grayscale image. In practice, first reduce the number of gray levels in the image, and then use the traditional run length coding technology. Can be used for lossy image compression. Gray-level run-length A feature that describes the spatial relationship of primitives. The largest collinear, same grayscale, and connected set of pixels in the image is used as a run. The length or orientation of this run can be used as a feature to describe the run. Differential coding Same as relative encoding. Differential image compression A method of image lossless compression. First compare each pixel with a reference pixel of one of its neighbors, and then encode the two parts: one is the prefix, which is the same number of the most significant bits between the coding pixel and reference pixel; the second is the suffix, which is the number of least significant bit of the encoded pixel. Relative encoding A variant of run length coding. Also called differential coding. Applicable to the case where the data to be compressed is composed of a series of close numbers or similar symbols. The principle is to send the first data item a1 first and then send the difference ai + 1 – ai.
17.7
Hierarchical Coding
681
Relative address coding [RAC] A 2-D bit plane coding method. It is a generalization of the 1-D run length coding method. First track the start and end transition points of each black run and white run in the adjacent two rows, and determine the shorter run distance. Finally, encode the run distance with the appropriate variable length code.
17.7
Hierarchical Coding
Hierarchical coding is not so much a coding method as a coding framework. Hierarchical coding system allows different source models and different levels of understanding of the scene to be used for coding to obtain the best performance (Zhang 2017a). Hierarchical coding An image coding framework that encodes images at multiple layers. The basic idea is to divide the encoding of the image into multiple layers, allowing different source models and different levels of understanding of the scene to be encoded at different layers to obtain the best performance. Generally start with the higher layer consisting of less detail and gradually increase the resolution. A typical block diagram is shown in Fig. 17.11. Layer 1 only transmits color parameters that are statistically pixel-dependent, and layer 2 also allows for the transmission of block motion parameters with fixed size and position. Thus, the first layer corresponds to the I-frame encoder of the image encoder or the hybrid encoder, and the second layer corresponds to the P-frame encoder of the hybrid encoder. Layer 3 is an analytical synthesis encoder that allows the transfer of target shape parameters. Layer 4 transmits knowledge of the target model and scene content. Layer 5 transmits abstract high-level symbols that describe the behavior of the target. See hierarchical image compression. Hierarchical image compression Image compression using hierarchical coding techniques. Progressive image transmission is possible. First layer Second layer Input image
Third layer Fourth layer Fifth layer
Color
Transform encoder
Color, Motion
Hybrid encoder
Color, Motion, Shape
Analytical synthesis encoder Knowledge-based encoder Semantic encoder
Color, Motion, Shape wireframe Color, Action unit, Shape wireframe
Fig. 17.11 Hierarchical coding system block diagram
Layer selector
Coded image
682
17
Image Coding
Hierarchical progressive image compression In image compression, the encoder gradually stratifies the image encoding by resolution. When decoding, it first decodes the layer with the lowest resolution, gets a rough image, and then continues to decode the layer with higher resolution and gradually proceed to the layer with the highest resolution. In the compressed stream thus obtained, each layer uses the data in the previous layer. Model-based coding A method of encoding the content of an image (or video sequence) using a pre-defined or learned set of models. It can produce a more compact representation (see model-based compression) or a symbolic description of the image data. For example, a Mondrian image can be encoded using the position, size, and color of its colored rectangular area. It also refers to the method of establishing a model of the corresponding target and performing image coding according to the model. You must first analyze the imaging scene to get a priori knowledge about the scene. Model-based compression An application of model-based coding to reduce the amount of memory required to describe an image while still allowing reconstruction of the original image. Object-based coding A higher (semantic) layer of image coding methods. Firstly, the object in the image is extracted, and then the object is represented by the feature parameters, and finally the representation parameters are encoded to achieve the purpose of image compression. Object-based In the object-based coding, the coding is performed on the object extracted from the image (from the object in the scene). Object-oriented Same as object-based. Semantic-based coding A method of extracting image semantics (including the motives, intentions, etc. of the scenes described by the image), obtaining a semantic interpretation model, and then encoding the model. See model-based coding. Semantic-based The highest level of model-based coding. In practice, it is necessary to acquire knowledge of the spatial distribution and behavior of the object by means of complex learning and reasoning and then to acquire the semantics of the scene and encode according to the semantics. Knowledge-based coding In a broad sense, it is often equivalent to the coding technique of model-based coding. In a narrow sense, it is a class in model-based coding.
17.8
Other Coding Methods
683
Knowledge-based The middle level of model-based coding. It considers to use the prior knowledge of the scene, identify the type of object and construct a strict 3-D model of the known scene. Smart coding Generally refers to an image coding method that takes into account the perceived characteristics of the human visual system. The spatial resolution of the human eye decreases rapidly as the distance from the line of sight increases, so the human eye only has a high spatial resolution in a small area around the point of interest. According to this, it is not necessary to display a high-resolution image in an area where the human eye is not focused. The difficulty in doing this is to identify areas that the observer may be interested in the image. One solution is to track the focus of the human eye and feedback the location. In progressive image transmission, if the focus of the human eye can be determined, high-resolution decoding around the point of interest can be considered to improve the transmission speed perceived by the human eye.
17.8
Other Coding Methods
There are many encoding methods, some rely on new mathematical tools, some originate from the promotion of signal compression technology, and so on (Salomon 2000). Fractal coding A method of encoding an image by means of fractals. The basic idea is to divide the image into subsets (fractals) and judge its self-similarity. Fractal coding is a lossy coding method that often provides high compression ratios and can reconstruct images with high quality. Because fractals can be infinitely magnified, fractal coding is independent of resolution, and a single compressed image can be displayed on any resolution display device (even on devices with higher resolution than the original image). Fractal coding requires a large amount of computation when compressing an image, but decompression is simple and fast. So it is a typical compressiondecompression asymmetry method. Fractal image compression A method and process for image compression using self-similarity at different scales. Iterated function system [IFS] A system consisting of functions that are repeated and compounded with themselves. It consists of a complete metric space and a set of compressed maps defined on it, which can effectively describe irregular natural objects. Usually constructed by means of affine transformation, the basic idea is to use some self-similarity of
684
17
Image Coding
geometric objects in the sense of affine transformation. It can be used for fractal natural landscape simulation and fractal coding of images. Partitioned iterated function system [PIFS] A fractal based image coding method. Image compression is achieved using local similarities in the image. On the one hand, the image is divided into a set of smaller square areas that do not overlap each other, called a range block; on the other hand, the image is divided into a group of overlapping and large square areas, called domain blocks. The encoding process is to find a domain block that can be matched by a range block and make the domain block approximate the range block by shrinking affine transformation. The combination of a set of contractive affine transformations found in this way is the partitioned iterated function system. As long as the space required to store the partitioned iterated function system is less than the space required to directly store the image information, it can be said that the compression of the image is achieved. Vector quantization [VQ] The process of mapping a vector with multiple components to a vector with fewer components. In image coding, vector quantization can be used as an independent lossy image coding method or in a quantization module in an encoding system. The theoretical basis of vector quantization is the rate distortion theorem. The basic idea is to treat the source symbol sequence grouping as a vector for encoding. When there are many symbols in the group or the vector dimension is high, the ratedistortion function of the group will be close to the rate-distortion function of the source. First, cluster N data points, and obtain K clusters, and calculate the center Ck of each cluster (the vector consisting of all centers is called the code-book vector). For each data point, only the label k of the category to which it belongs is stored. In this way, each data point can be approximated by the center of the category it belongs to. As long as K < < N, you can get the compression effect. It can be seen as a method of compactly representing a set of vectors (vector sets) by associating each possible vector with one of a small set of “code book” vectors. For example, each pixel in an RGB color image has 2563 possible values, but typically only a small subset of such multiple values for a particular image will be used. If you calculate a color map with 256 elements and each RGB value is represented by the nearest RGB vector in the color map, the RGB space is quantized to 256 elements. Clustering of these data is often used to identify the best subset for each element. Code-book vector A vector used as a cluster center in vector quantization. Linde-Buzo-Gray [LBG] algorithm A code book design method in image coding. It is commonly used in vector quantization-based coding methods. The algorithm first constructs an initial code book and then assigns M typical image training vectors to N different subspaces (M > > N, generally M is several
17.8
Other Coding Methods
685
Encoder Original image
Coding
Decoder Label
Transmission /storage
Codebook
Label
Decoding
Decoded image
Codebook
Fig. 17.12 Vector quantization codec block diagram
tens of times N ), and one subspace corresponds to one code vector. Next, a new set of code vectors is determined by means of training vectors assigned to each subspace. The newly estimated code vector is the centroid of the subspace in which it is located. So iteratively, until the average distortion reduction rate in the training set is less than a preset threshold, the optimally designed code book can be obtained. Coding-decoding flowchart of vector quantization A process of encoding and decoding an image using a vector quantization method. As shown in Fig. 17.12. At the encoding end, the image encoder first divides the original image into small blocks (e.g., small blocks divided into K ¼ n n, which is similar to the method used for coding with orthogonal transform), and these small blocks are represented by vectors and quantized. Next, a code book (i.e., a vector list, which can be used as a lookup table) is constructed, and the previously quantized image block vector is encoded by a unique code word in the search code book. After the encoding, the label of the code word vector (simultaneously with the code book) is transmitted or stored. At the decoding end, the image decoder obtains a coded vector from the code book based on the label of the code word vector, reconstructs the image block using the coded vector, and further composes the decoded image. Pyramid vector quantization A fast way to implement vector quantization based on a set of grid points that fall on a super pyramid. Subband coding A way of encoding discrete signals for transmission. The signal is passed through a set of band-pass filters, each channel being quantized separately. The sampling rate for a single channel must satisfy the sum of the number of samples per channel before quantization and the number of samples in the original system. By varying the quantization of the different bands, the number of samples can be reduced with a small degradation in signal quality. An image coding method performed in the frequency domain. Each subband of the image is first encoded separately, and the results are recombined to reconstruct the original image without distortion. Because each subband has a smaller bandwidth than the original image, the subbands can be subsampled without losing information. To reconstruct the original image, each subband can be interpolated, filtered, and then summed. The main benefits of encoding an image into subbands are:
686
17
Image Coding
1. The image energy and statistical characteristics in different subbands are different, so these subbands can be coded separately with different variable length codes or even different coding methods to improve coding efficiency. 2. By frequency decomposition, the correlation between different frequencies is reduced or eliminated, which is beneficial to reduce inter-pixel redundancy. 3. After the image is decomposed into subbands, quantization and other operations can be performed separately in each subband, avoiding mutual interference and noise diffusion. Subband In frequency domain coding, a series of bandwidth-limited components or a set thereof can be obtained by decomposing an image with band-pass filtering. Low- and high-frequency coding A method of decomposing an image into low frequency and high frequency portions for encoding. There are many ways to represent low-frequency parts with a small amount of data. The high-frequency part is mainly edge, which can be effectively represented by the corresponding method. Dictionary-based compression Same as dictionary-based coding. Dictionary-based coding A class of lossless coding techniques that store image data in a “dictionary” data structure. LZW coding is one of the typical methods. LZW compression Compression technology using LZW coding and LZW decoding. LZW Abbreviation for Lempel-Ziv-Welch. LZW coding An information-preserving coding method. It is also a dictionary-based compres sion method. Named after the initials of the three inventors’ names (Lempel-ZivWelch). A fixed-length code word is assigned to a sequence of symbols of different lengths output by the source, and knowledge about the probability of occurrence of the symbol is not required. Standard file compression for UNIX operating systems, also used in Graphics Interchange Format (GIF), Tagged Image File Format (TIFF), and Portable File Format (PDF). LZW decoding A decoding technique corresponding to the LZW encoding of one of the informa tion-preserving coding methods. Similar to the LZW coding process, it is also necessary to create a code book (dictionary) during the process. This is often referred to as encoder and decoder synchronization, i.e., there is no need to transmit the dictionary created by encoder to the decoder.
17.8
Other Coding Methods
I
e
–
Uniform quantization
687
ê
Entropy coding
Î
+ Ĩ c(ê|w)
+
Error correction
İ Detector
w
Output stream
s
Context extraction and quantification dh, dv
Fig. 17.13 CALIC basic algorithm flow diagram
Context-based adaptive lossless image coding [CALIC] A typical lossless/near-lossless compression method, also known as context-based adaptive predictive entropy coding. The block diagram of the basic algorithm is shown in Fig. 17.13. Each pixel in the image is processed sequentially in the order of raster scanning; when the current pixel I(x, y) is processed, a priori knowledge, i.e., various contexts, is acquired based on previously processed and stored pixels. These contexts are further divided into prediction context (dh in the horizontal direction and dv in the vertical direction), error correction context w, and entropy coding context s. Determining the prediction algorithm for the current pixel according to the prediction context, obtaining the preliminary prediction value İ(x, y); then obtaining the correction value c(ê|w) by the error correction context, and correcting the initial prediction error İ(x, y), to get the new predicted value Î(x, y) ¼ İ(x, y) + c(ê|w). Then, quantize the residual e ¼ I(x, y) – Î(x, y) by the quantizer, and driven by the entropy coding context s, assigned to the coded output of corresponding entropy encoder. At the same time, various contexts are updated based on the quantized residual ê and the pixel reconstruction value Ĩ(x, y) ¼ Î(x, y) + ê with the help of feedback. Since the decoder can obtain the same predicted value Î as well, the reconstruction error is completely determined by the quantization error, i.e., I – Ĩ ¼ e -ê. Therefore, for a given pixel maximum absolute error δ, if the quantizer is designed to satisfy ||e ê||1 δ, then ||I – Ĩ||1 δ would be a guarantee.
Chapter 18
Image Watermarking
Watermarking an image is an image processing technique. Using image watermarking technology can protect image content from being tampered or forged (Barni et al. 2003a, b). For the original input image, the processed result is the output image with watermark embedded (Cox et al. 2002; Shih 2013).
18.1
Watermarking
A digital watermark is a digital mark that is secretly embedded in digital products (such as digital video, digital audio, digital photos, electronic publications, etc.) to help identify the owner, usage rights, content integrity, etc. (Cox et al. 2002; Shih 2013; Zhang 2017a).
18.1.1 Watermarking Overview Watermark A generic term that refers to embedded information, reference patterns, message pattern, or added patterns. Previously, it refers to a specific pattern embedded in the paper to prove its owner’s identity. In image engineering, a digital mark or image watermark embedded in an image. See digital watermarking. Watermark detection The process of detecting a digital watermark embedded in an image and verifying its authenticity. See watermark embedding.
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_18
689
690
18
Image Watermarking
Bit error rate [BER] The frequency at which an error occurs when detecting multi-bit information. As in the image watermarking technique, the ratio of the erroneous bits in the detected watermark bits. Watermark pattern Can refer to one of a reference pattern, a message pattern, or an added pattern, depending on the context. Reference pattern A pattern for comparing the media space of working during the detection of the watermark. Added pattern In image watermarking, it refers to the difference between the original work and each pattern of the watermarked version (i.e., added to the work to embed the watermark pattern). Message pattern In image watermarking, a watermark pattern used to encode information, which is a vector in the media space (i.e., a data object of the same type and the same dimension as the carrier). Message The information content of the transmitted or embedded watermark in the channel. Direct message codes A method of encoding a watermark message. It predefines a separate reference marking for each message, and these reference markings are often generated using a pseudo-random method. Compare multi-symbol message coding. Multi-symbol message coding A method of representing information data with a sequence of symbols and then modulating the sequence of symbols. Compare direct message codes. Watermark vector It can refer to one of a reference vector, a message vector, or an added vector, depending on the context. Reference vector Same as reference mark. Reference mark During the detection of the watermark, the vector used to compare the works in marking space is also referred to as the reference vector. Message vector Same as message mark. Message mark A vector that encodes specific information within the marking space.
18.1
Watermarking
691
Added vector In image watermarking technology, it refers to the difference between the logo space vector extracted from the watermarked work and the logo space vector extracted from the original work. This is similar to an added pattern and is the corresponding vector in the media space. Added mark Same as added vector. Watermark key A single key or key pair used to embed a watermark and detect a watermark. The watermark key can be used in conjunction with a cryptographic key.
18.1.2 Watermarking Embedding Marking space All spaces for watermark embedding and watermark detection can be performed. This can be a subspace of the media space, or it can be a deformation space of the media space. Embedder The algorithm or mechanism for inserting a watermark into a work. The embedder has at least two inputs: the carrier works of embedded information as watermark and the pre-embedded mark, the output is the work containing the watermark. An embedded key can also be used in the embedder. Watermark detector A hardware device or application that detects and decodes a watermark. Extraction function A (content) vector of the media space is mapped to a function of the marking space vector. Extracted mark The vector obtained by processing the work with the extraction function. Watermark decoder A module in the watermark detector that maps the extracted mark to information. In most cases, this is also the main operation of the watermark detector. Random-watermarking false-positive probability In a given work, the probability of containing a randomly selected watermark is detected. Random-work false-positive probability In a randomly selected work, the probability of containing a given watermark is detected.
692
18
Image Watermarking
Digital watermark Watermark embedding
S
Watermark image
Correlation test
Watermark and confidence
Original image Fig. 18.1 Watermark embedding and detection flow chart
Watermark embedding The process of adding a digital watermark to an original image by embedding it. Figure 18.1 shows a flow chart of the watermark embedding and detection process. The watermark image is obtained by adding the watermark to the original image with embedding. There are many ways to embed a watermark into an image. Currently, the commonly used methods are mainly performed in a transform domain, such as watermarking in DCT domain and watermarking in DWT domain. A correlation check is performed on the image to be detected in which watermark is embedded, and it is determined whether the watermark is embedded in the image, and the confidence of the obtained judgment result is determined. Watermark embedding strength Corresponds to the amount of data representing the watermark information embedded in the image. The large watermark embedding strength can improve the watermark robustness. Intensity of embedded watermark An indicator that measures the amount of data carried by an embedded watermark. Related to the amount of embedded information but not equivalent. Signal-to-watermark ratio The power or amplitude of a given cover work is divided by the power or amplitude of the watermark pattern it is superimposed with. Watermark-to-noise ratio The ratio of the power or amplitude of a given watermark pattern to the power or amplitude superimposed on the carrier’s work. The denominator of this ratio may also include noise introduced by subsequent processing. Information of embedded watermark An indicator that measures the amount of information carried by an embedded image watermark. It directly affects the watermark robustness. Payload The adding new data based on the original data. For example, in the image watermarking technology, the watermarked image includes both original image
18.1
Watermarking
693
data and embedded watermark data, wherein the embedded watermark data is a payload. Data payload The amount of protected content carried by the cover work. For example, the number of bits encoded by a watermark in a time unit or a work.
18.1.3 Watermarking Property Property of watermark Image watermarking itself and its embedded and detected features. Image watermarks should have certain characteristics depending on the purpose of use. Security of watermark An important watermarking feature. It mainly represents the ability of digital watermarks to be copied, falsified and forged, and to be easily eliminated by illegal detection and decoding. Digital fingerprints A digital feature of a digital product that distinguishes itself from other similar products. In a digital fingerprint, the information embedded in the image is not (or is not just) the copyright owner’s information, but contains information about the user who has the right to use it. A digital watermark is a typical digital fingerprint, which generally contains basic information such as a copyright holder’s mark or a user code that can confirm that the user legally owns the related product, and the information has a correspondence with its owner or user. Digital fingerprints are unique in that they are different for the digital fingerprint embedded in the work owned by each user. The purpose of digital fingerprinting is not only to prove who the copyright owner of the work is, but also to prove who the user of the work is, thus preventing the image product from being illegally copied or tracking authorized users who illegally distribute the data. Complexity of watermark An important watermarking feature. It mainly represents the computational complexity of digital watermark embedding and extraction. Robustness of watermark An important watermarking feature. Also known as reliability. Reflecting the image watermark against external interference, it can still guarantee its own integrity and its ability to accurately detect the image in the case of distortion. There are many kinds of external disturbances to the image. From the perspective of discussing the robustness of watermarks, they are often divided into two categories. The first category is conventional image processing methods (not specifically for watermarks), and the second category refers to malicious operation modes (specifically for attacking watermarks). The ability of image watermarking to withstand these
694
18
Image Watermarking
external disturbances is also commonly measured by anti-aggression. The robustness of the watermark is related to the amount of embedded information and the embedded strength (corresponding to the amount of embedded data). Robustness 1. One of the commonly used technical evaluation criteria. Refers to the degree of stability of the accuracy or the reliability and consistency of the performance of the algorithm under different parameters. Robustness can be measured relative to noise, density, geometric differences, or percentages of dissimilar regions in the image and the like. The robustness of an algorithm can be obtained by determining the stability of the accuracy of the algorithm or the reliability of the input parameters (such as using their variance, the smaller the variance, the more robust). If there are many input parameters, each affecting the accuracy or reliability of the algorithm, the robustness of the algorithm can be defined separately from each parameter. For example, an algorithm may be robust to noise but not robust to geometric distortion. To say that an algorithm is robust generally means that the performance of the algorithm does not change significantly with changes in its environment or parameters. It is one of the commonly used image matching evaluation criteria. 2. The ability of a watermark to survive against various image processing operations (generally referring to good faith operation of the work). Compare security. 3. An indicator that reflects the stability and reliability of image equipment operations. The goal of many machine vision systems is to mimic the ability of the human visual system (to the extent possible) to recognize objects and scenes in a variety of situations. To achieve this capability, the feature extraction and representation steps of a machine vision system should be robust to various operating conditions and environmental factors such as poor spatial resolution, uneven illumination, geometric distortion, and noise caused by different viewing angles. 4. Another name for the convexity. Robust A generic term used for indicating that a technology is insensitive to noise or other interference. Uniqueness of watermark An important watermark feature. It mainly indicates whether it is possible to make a unique judgment on the ownership of the digital watermarking. Notability of watermark An important watermarking feature. Measure the perceptibility or in-perceptibility of watermarks. For image watermarking, it is desirable to have a lower perceptibility, that is, to have better invisibility. This includes two meanings: one is that the watermark is not easily perceived by the receiver or the user; the second is that the addition of the watermark does not affect the visual quality of the original product. In view of invisibility, the image watermark should be a hidden watermark, i.e., the
18.1
Watermarking
695
presence of the watermark is determined by the copyright holder and is hidden (not visible) to the average user. Distortion metric A measure used to objectively measure image distortion. For example, when judging the invisibility of image watermarking, in order to avoid being affected by factors such as the experience and status of the observer and the experimental environment and conditions, distortion metrics are often used to measure. Commonly used distortion metrics are difference distortion metrics and correlation distortion metrics. The former is divided into a metric based on Lp norm and a metric based on Laplacian mean squared error; the latter is divided into a normalized cross-correlation metric and a correlation quality metric. Perceptibility 1. Psychophysiological ability to sense external stimuli. 2. In the image watermarking, the sensory ability when observing the embedded watermark. See notability of watermark. Distortion metric for watermarks A measure used to measure changes in a watermark image relative to the original image. Can be divided into subjective indicators and objective indicators. The saliency of image watermarking is an example of a subjective indicator. Typical objective indicators include the difference distortion metric and the correlation distortion metric. Difference distortion metric A class of objective indicators for assessing watermark distortion based on the difference between the original image and the watermark image. Suppose that f (x, y) represents the original image, and g(x, y) represents the image embedded with the watermark, and the image size is N N. Typical difference distortion measures include mean absolute deviation, root mean square error (squares are mean squared errors), Lp norm, and Laplacian mean squared error. In addition, the signal-to-noise ratio and peak signal-to-noise ratio can also be used as a difference distortion metric. Correlation distortion metric An objective indicator for watermark distortion is evaluated based on the difference between the original image and the watermark image. Typical metrics include normalized cross-correlation and correlation quality. Correlation quality A typical correlation distortion metric. If f(x, y) is used to represent the original image and g(x, y) is used to represent the image with embedded watermark and the image size is N N, then the correlation quality can be expressed as
696
18 N1 P N1 P
Ccq ¼
Image Watermarking
gðx, yÞgðx, yÞ
x¼0 y¼0 N1 P N1 P
f ðx, yÞ
x¼0 y¼0
Laplacian mean squared error A differential distortion metrics for objectively measuring image watermarking. If f(x, y) is used to represent the original image, g(x, y) is used to represent the image embedded in the watermark, and the image size is N N. The Laplacian mean square error can be expressed as N1 P N1 P
Dlmse ¼
∇2 gðx, yÞ ∇2 f ðx, yÞ
2
x¼0 y¼0 N1 P N1 P
∇2 f ðx, yÞ
2
x¼0 y¼0
Region of acceptable distortion See region of acceptable fidelity. Region of acceptable fidelity In the media space or mark space, a collection of all works that are essentially identical to the given cover works (the gap corresponds to distortion). Each cover work has its own range of acceptable fidelity. Perceptual slack In the perceptual model, a measure of a given component of works that does not seriously affect the perceived effect. Perceptual significant components Ingredients in the works (such as brightness, characteristics, frequency, etc.), which can cause loss of fidelity even if with only minor changes. Although it seems that the manipulation of these components should be avoided as much as possible when operating on the watermark, in reality these components are often the most robust components.
18.1.4 Auxiliary Information Dirty paper codes A class of codes that uses many different code vectors to represent each message. Dirty paper codes help with the encoding of auxiliary information.
18.1
Watermarking
697
Lattice code A type of dirty paper code. In the encoding with the lattice code, all code vectors are on the grid. Both the dithered index modulation and the least significant bit watermark use a lattice code. Lattice coding See lattice code. Dithered index modulation A method of implementing lattice coding. The works in the method are preferentially related to a predefined set of patterns, and the correlation values are quantized to integer multiples of certain quantization steps. Quantization index modulation [QIM] A method of image watermarking implementation. In this method, each piece of information is associated with a unique vector quantizer. The embedder quantizes the cover works (or the vector extracted from the cover works) using the quantizer connected to the desired information; the detector uses all quantizers to quantify the works and identify the information. This method is also a method of dirty paper code, which can be realized by means of dithered index modulation. Least-significant-bit watermarking Watermark embedded by adding information to the least-significant-bit of the cover works. Informed embedding A watermark embedding method that transforms an information mark into an added mark on the basis of analyzing the cover work. Compare blind embedding. Informed coding A method of mapping information into symbols (message mark). The method analyzes the cover works during the encoding process to determine the best code that can be used. In the encoding with auxiliary information, the same information may be mapped to different information marks depending on the cover works. Compare blind coding. Informed detection The watermark is detected with a certain degree of understanding (a priori knowledge) of the original unwatermarked works. This understanding may be for the original work itself or for a function of the original works. Compare blind detection.
18.1.5 Cover and Works Cover work A work that has been embedded or carried with protected content. For example, an image with a watermark embedded.
698
18
Image Watermarking
Cover 1. See cover of a set. 2. Works with information content and carrying protected content. Generally, the medium of the works, such as “image,” “video,” etc., is indicated by a certain data type. Media space High-dimensional space where each point represents a work. Embedding region An area in the media space or mark space that represents a collection of all the works that the watermark embedder is likely to output. Detection region In image watermarking, it refers to a region in the media space or mark space that represents a collection of works that are effectively watermarked. Each watermark message has a unique detection region. Side information 1. In an image watermarking system, the cover work information (such as its metadata) that the embedder can use. 2. In an image coding system, information that the encoder can use in addition to the encoded image itself (e.g., illumination, collector). Work Any medium that can be embedded in a watermark, such as an image or video, is also the object to be protected by the watermark. Distribution of unwatermarked works Describe the possibilities of each entry into a watermark embedder or watermark detector. This distribution is dependent on the specific application. Inseparability A feature in which a watermark cannot be separated from its cover work when it is converted between different formats or when other forms of processing that maintain the perceived quality are performed. Perceptual shaping Adaptive modification of the message mark in accordance with the masking capacity of the cover work. The usual form is to weaken the mark (reduce the amount of embedding) in areas where the cover works cannot hide too much noise and enhance the mark (increase the amount of embedding) in areas where more noise can be hidden. Shaped pattern A message pattern that has been modified by perception shaping.
18.2
18.2
Watermarking Techniques
699
Watermarking Techniques
According to different processing domains, watermarking technology can be divided into the spatial domain method performed in the image space domain and the transform domain method performed in other transform spaces (Cox et al. 2002; Zhang 2009). Since many international standards for image expression and coding use discrete cosine transform (DCT) and discrete wavelet transform (DWT), many image watermarks also work in the DCT domain and the DWT domain (Cox et al. 2002; Shih 2013).
18.2.1 Technique Classification Digital image watermarking Embed a code (watermark) into the image data. A watermark is like a digital signature that identifies the owner or authenticity of the image. It can be visible or invisible to the viewer (hidden watermark). Figure 18.2 shows an example of each, the left picture is the visible watermark, and the right picture shows that the invisible watermark can be extracted and displayed from the image. See steganography. Digital watermarking The process and method of embedding a mark (watermark) into digital media. In the field of digital imaging, its role is mainly to protect copyright. Digital watermarks can be invisible (not perceptible or imperceptible) or visible (such as the station’s logo). See image watermarking. Image watermarking The process of embedding and detecting digital watermarks on images. Digital watermark is a digital mark that is embedded in a digital product to help identify the owner, content, usage rights, integrity, etc. of the product. Image watermark embedding can be regarded as a special way of image information hiding in a broader sense and is a means to realize image information security. See digital image watermarking.
This is an image of Lena
Fig. 18.2 Visible and invisible watermarks
700
18
Image Watermarking
Watermarking The ability of unobtrusively changing a work by embedding information related to the work. Fragile watermark A watermark with very limited robustness. It is sensitive to external disturbances, and even minor changes to the works can cause the original embedded watermark to become undetectable. It is mainly used for authentication purposes. It can detect whether data with watermark protection has been changed and can ensure the integrity of media information. Semi-fragile watermark A type of watermark between a robust watermark and a fragile watermark. A watermark that is fragile for some distortions and robust to some other distortions. Although robust to some operations, modifications to important data features are fragile. It can be used to protect the integrity of certain specific raw data. It has practical implications for selective authentication. Second-generation watermarking Used to refer to a watermarking system that utilizes features. More generally refers to a system that does not simply embed a watermark by an added pattern, or a system that does not simply detect a watermark by linear correlation or normalization correlation. Asymmetric watermarking See asymmetric key watermarking. Asymmetric key watermarking Techniques, methods, or watermarking systems that utilize different watermarking keys when embedding and detecting watermarks. Such a watermarking system allows the user to detect and decode the watermark, but does not allow embedding a new watermark or eliminating the original existed watermark. Public-key cryptography See asymmetric key watermarking. Asymmetric fingerprinting In image watermarking technology, it refers to a special watermark used for operation tracking. In an asymmetric fingerprint system, the owner of the works sends the works to the buyer, who can obtain the watermarked works, but the buyer does not know what the watermark is. Public watermarking A type of watermark that can publicly authorize a particular person to perform watermark detection in an application, but does not authorize it to perform other watermark operations. Compare blind detection.
18.2
Watermarking Techniques
701
Private watermarking Watermarks that various watermark operations including watermark detection are not authorized to others in any application. Compare informed detection.
18.2.2 Various Watermarking Techniques Tell-tale watermarking Use watermarks to try to identify the operations that have been done on the cover work. Erasable watermarking The watermark that can be accurately removed from the work, which can be removed to obtain a copy that is exactly the same as the unwatermarked image. This watermark often refers to a reversible watermark. Perceptible watermark A watermark that can be perceived after embedding into digital media. For image watermarking, it is just a visible watermark. Typical examples are the identification of the television station and the marking of the website picture. Compare imperceptible watermark. Imperceptible watermark A watermark (invisible watermark) that cannot be perceived after embedding into digital media. Mainly used to prevent illegal copying and to help identify the authenticity of products. For images, the imperceptibility of the watermark can guarantee the ornamental value and use value of the original image. Compare perceptible watermark. Perceptual adaptive watermarking A technique that can adjust an added pattern to embed a watermark based on a particular perceptual model. Visible watermark A visible mark placed on an image or video. It is often considered for such marks to have properties that are difficult to remove and partially transparent. Invertible watermark A watermark that is susceptible to ambiguous attacks. If a fake original can be produced from the distributed work, causing the distributed work to appear to be created by embedding a watermark into the counterfeit work (i.e., the “reverse” embedding process is reversible), then there is a reversible watermark. Sometimes it also means erasable watermarking. Quantization watermarking The watermark is embedded by quantizing the content in the transform domain. See dithered index modulation and quantization index modulation.
702
18
Image Watermarking
Robust watermark A watermark having strong robustness. The robust watermark tries to ensure the integrity of the watermark information and can guarantee its own integrity and accuracy of detection when the image has a certain degree of external interference and distortion. Blind watermarking A watermarking system that uses blind detection. Blind embedding A method of embedding a watermark in an image. In this method, the added pattern is not constrained by the cover work. Compare informed coding, informed detection, and informed embedding. Blind coding A method of mapping information into corresponding symbols, which is opposed to encoding of containing auxiliary information. As in the image watermarking technique, the watermark is embedded regardless of the constraints of the cover work. Blind detection The content embedded in the image is detected without knowing the original image. As in the image watermarking technique, the original unwatermarked content is not known before the watermark is detected. Oblivious In image watermarking technology, it means blindness, such as blind detection. Transparent watermark Undetectable image watermark. Compare visible watermark. Meaningless watermark There is no specific meaning in itself, only the digital watermark that reflects the existence or not. The requirements for its embedding and detection are relatively low. A typical example is the use of a pseudo-random sequence as a watermark. Compare the meaningful watermark. Meaningful watermark A digital watermark that the embedded data itself has definite meaning. The meaningful watermark of the image can be visible. Meaningful watermarks provide much more information than meaningless watermarks, but the requirements for embedding and detection are also much higher. Hidden watermark A digital watermark that hides the watermark owner information from the user. Unlike the anonymity that mainly considers that the hidden information is not detected, the watermark is mainly considered to prevent potential pirates from eliminating the watermark, so the watermarking method needs to consider the robustness of the possible attack, but generally it only needs to embed much less
18.2
Watermarking Techniques
703
information in the cover works than the anonymity method. In addition, the information hidden by the watermarking system is always combined with the protected product, while the anonymity system only considers hidden information and does not care about the carrier. From a communication perspective, watermarking techniques are often one-tomany, and anonymity is often one-to-one (between the sender and the receiver). Watermarks are often used instead of anonymity to defend against attacks where the attacker knows that there is hidden information in the carrier and attempts to remove it. In addition, the watermarking technology is different from the cryptography technique generally used for security. The cryptography technology cannot protect the data after the data is received and decrypted, and the watermark can protect the data from loss or leakage during the data transmission process. Moreover, the legitimacy of data usage can be guaranteed throughout the use of the data. Non-hidden watermark In the image watermarking, there is no digital watermark that hides the watermark owner information from the user. An example of a typical non-hidden image watermark is the station logo of a television station. From the perspective of information existence, watermarks and cryptography have some similarities. Cryptography can be used to hide information, so it can also be used as a copyright protection technology. Both watermarking and cryptography emphasize the protection of the content itself, while anonymity techniques emphasize the protection of the existence of information. Least-significant-bit watermarking A watermark embedded by adding information to the least significant bits of the cover work.
18.2.3 Transform Domain Watermarking Watermarking in transform domain Digital watermarking for embedding and detection within the transform domain. It also refers to techniques for embedding and detecting. Since many image international standards use discrete cosine transform (DCT) and discrete wavelet transform (DWT) to represent and compress images, many image watermarking techniques work in both the DCT domain and the DWT domain. Frequency domain watermarking A watermarking system that embeds a watermark in the frequency domain and modifies a specific frequency coefficient of the work. Note that the operation of such a system does not have to be done in the frequency domain, or the operation of the time domain or the spatial domain can be designed by means of Fourier transform to achieve the same effect.
704
18
Image Watermarking
Watermarking in DCT domain One commonly used type of watermarking in transform domain technique; it performs in the DCT domain for both watermark embedding and detection. Generally, the watermark is embedded into both the AC component and the DC component. The AC component can enhance the secrecy of the embedding, and the DC component can improve the embedding strength. In addition, the embedding rules in the DCT domain are often robust when using DCT-based JPEG and MPEG compression methods, so that the watermark of the DCT domain can be better protected from JPEG or MPEG compression. Watermarking in DWT domain One commonly used type of watermarking in transform domain technique; it performs in the DWT domain for both watermark embedding and detection. The advantage of embedding the watermark directly in the DWT domain is that the information obtained in the compression can be reused so that the watermark can be directly embedded in the compressed domain without decoding, which can solve the problem of large computational complexity when embedding the watermark in the DWT domain to some extent. The advantages of watermarking in DWT domain image technology come from a series of features of wavelet transform: 1. Wavelet transform has spatial-frequency multi-scale characteristics, and the decomposition of images can be continuously performed from low resolution to high resolution. This helps to determine the location and distribution of the watermark to improve the robustness of the watermark and to ensure invisibility. 2. Wavelet transform has a fast algorithm, which can transform the whole image, and has good resilience to external interference such as image compression and filtering. 3. The multi-resolution characteristics of wavelet transform can be well matched with human visual system, so it is easy to adjust the watermark embedding strength to adapt to human visual properties, thus better balancing the watermark’s robustness and invisibility. Therefore, the DWT domain algorithm often combines the use of some features of the human visual system and visual masking effects. Double domain watermarking [DDW] A watermarking technique that combines spatial domain and transform domain information. It is more suitable for textual images, which can improve the robustness of the watermark to general image processing operations, and avoid the adverse effects of the operation on binary images in the transform domain.
18.3
18.3
Watermarking Security
705
Watermarking Security
The watermark is additional information embedded in the image, and its existence may be subject to external attacks. Attacks on watermarks are unauthorized operations (Zhang 2017a).
18.3.1 Security Image information security Concerns and considerations for secure acquisition, storage, transmission, processing, and use of image and image information. The current work mainly considers how to control and grasp the use of specific images, how to protect the image content from being falsified and forged, and how to protect the specific information in the image from being discovered and stolen by unauthorized persons. The main technologies used include image watermarking technology, image authentication technology, image forensics technology, and image information hiding technology. Copyright protection National law and international law provide for the legal protection of works. See copy control. Digital rights management [DRM] Manage and control the copyright of digital works to solve their legal and commercial problems. Transaction tracking A watermark application that uses watermarks to characterize what has historically occurred in a work. In a typical transaction tracking process, the watermark identifies the first legal recipient of the work. If the work is later found to have been illegally distributed, the watermark can help identify and determine the responsible person. Broadcast monitoring Watermarks are used to monitor radio or television broadcasts and to search for watermarks in the program. The application of the watermarking technique can be used to ensure the correct playback of the advertisement and the complete payment of the copyright tax. Proof of ownership An application of image watermarking in which a watermark is used to determine the owner of a cover work. Security The ability of watermarks in resisting against vandalism and continually surviving. Compare robustness.
706
18
Image Watermarking
Tamper resistance The ability of a watermark to resist tampering or attack. See security. Attack In image watermarking, it refers to all attempts to destroy the watermark. Active attack Any attempt to destroy the image watermarking system by modifying the content may include unauthorized content removal (removal attacks) and unauthorized detection and embedding (forgery attacks). Compare passive attack. Passive attack The user of the image watermarking product detects the type of attack of the watermark that should be detectable by its owner without authorization. That is, illegal detection of the watermark. See detection attack and unauthorized detec tion. Compare active attack. Adversary In image watermarking, it refers to anyone who attempts to destroy the watermarking system. There are also hackers, pirates, attackers, etc. Depending on the application, the adversary may try to use various attacks, including unauthorized content removal (removal attacks), unauthorized detection, and embedding (forgery attacks). Attacker See adversary. Pirate From the copyright of the work, the pirates are consistent with the goals of the adversary. However, in the tracking of operations, pirates specifically refer to those who attempt to obtain a copy of an unauthorized work. Implicit synchronization A method of combating geometric transformations with the robustness of water marks. In this approach, existing features of the work are used to maintain synchronization. StirMark A specific computer program that is also a test platform. It applies a variety of distortions to works that have already been watermarked, helping to assess the robustness and security of watermarks. Spatial masking model The model used to measure visual quality in a benchmark measurement method that measures the robustness of watermark performance. Based on the characteristics of the human visual system, the model accurately describes the occurrence of visual distortion/degeneration/artifacts in the edge region and the smooth region.
18.3
Watermarking Security
707
18.3.2 Watermarking Attacks Attack to watermark Malicious detection, deletion, modification, replacement, and other means and operations of image watermarking. The watermark embedding of image is equivalent to making the image to have identity information. The attack on the watermark is an attempt to eliminate or change the identity information, which is an unauthorized operation. Attacks are often divided into three types: (1) detection attack, (2) embedding attack, and (3) delete attack. These three types of attacks can also be combined and lead to new types. For example, deleting the existing watermark in the product and then embedding another watermark is called a substitution attack. Copy attack The attacker’s means of copying a legal watermark from one work to another. Copy detection A digital image forensics technology that automatically detects the copying (copy and paste tampering) of image content and cross-media image copying. In the process of copying, it is often accompanied by changes in scale, compression, and coding characteristics. Copy control A precaution against copyright infringement, including preventing the recording of copyrighted content or replaying pirated content. Compliant device Equipment that complies with copy control technology standards. In image watermarking, a media recorder or media player that is capable of detecting a watermark and responding in some suitable manner. Record control Preventing the recording of copyrighted content is an integral part of copy control. Playback control Measures to prevent duplication of copyrighted works from being illegally recorded, which is an integral part of copy control. Generational copy control Allow a copy but prohibit subsequent copies. This is used for copy protection of broadcast content such as in public channels. Because a consumer who buys a work can get a copy, he can’t continue to copy it. Substitution attack An attack to watermark that combines a watermark delete attack with a forgery attack. By deleting the watermark in the watermark product that should have been deleted by the owner, and then embedding the watermark desired by the attacker, the original watermark is changed to serve to attacker.
708
18
Image Watermarking
Collusion attack In image watermarking technology, the adversary has obtained several versions of the same work, each version contains different watermarks, and the adversary removes these versions of the works without authorization and then combines these works together to generate a work almost without watermarks. The term can also describe another method of attack in which the adversary gets a number of different works containing the same watermark and uses them to determine the pattern associated with the mark. Collusion-secure code Code that can be safely opposed to certain collusion attacks, more often able to safely combat attack methods that use several different copies of the same work with different watermarks. Detection attack A type of attack on a watermark. For example, a user of a watermark product detects a watermark that should be detected by the owner. Also known as passive attack. Elimination attack An attack on an embedded watermark by deleting a watermark that should be deleted by the watermark owner. In other words, the adversary eliminates the watermark in the watermarked image and causes the subsequent processing to fail to repair the watermark. Compare the masking attack. Remove attack Same as elimination attack. Delete attack The type of attack that a user of a watermark product deletes the watermark that should be deleted by its owner. It makes the attacked watermark no longer exist. Can be further divided into elimination attacks and masking attacks. Masking attack A type of attack with delete-ability on a watermark. It is a method of attacking a watermark by making it impossible for an existing detector to detect a watermark. This method attempts to prevent the image owner from detecting the image watermark or making the watermark undetectable, thereby implementing an attack on the embedded watermark. However, it does not prevent the detection of a more sophisticated detector. In fact, it does not really remove the watermark. For example, if an existing detector has no correction for the rotation transformation, then rotating the image is a masking attack. Compare the elimination attack. System attack The attack to watermark that uses the flaws in the watermark application instead of using the weakness of the watermark itself.
18.3
Watermarking Security
709
18.3.3 Unauthorized Attacks Unauthorized detection A form of attack on a watermark, which means that the adversary who originally had no right to detect and/or decode the watermark has detected and/or decoded the watermark. Unauthorized reading See unauthorized detection. Unauthorized embedding An attack form of a watermark, it refers to an embedding of a watermark by an adversary who does not have the right to embed a watermark. Unauthorized writing See unauthorized embedding. Unauthorized removal A form of attack on a watermark, which means that the watermark has been removed by an adversary who has no right to remove the watermark. There are two forms here: masking attacks and elimination attacks. Unauthorized deletion See unauthorized removal. Removal attack See unauthorized removal. Sensitivity analysis attack A means of attacking watermarks. When the adversary owns the watermark detector without knowing the detection key, an unauthorized removal attack can be performed. Specifically, the adversary estimates the watermark reference pattern by superimposing the random pattern and the destroyed work and testing the detection process. Once the reference pattern is estimated, it is subtracted from the work to remove the watermark. Forgery attack The user of the watermark product embeds a watermark that should be embedded by the owner, resulting in a change of copyright information. That is, an attack on an image watermark using an unauthorized watermark embedding. Embedding attack A type of attack on the watermark that the user of the watermark product embeds the watermark should be embedded by the owner. Also known as forgery attacks.
Chapter 19
Image Information Security
Image information security has multiple meanings, such as how to control and grasp the use of specific images, how to protect image content from tampering and forgery, and how to protect specific information in images from being discovered and stolen by unauthorized people (Shih 2013; Zhang 2017a).
19.1
Image Authentication and Forensics
Image authentication and forensics technologies are more concerned about the authenticity and integrity of the image (Shih2013).
19.1.1 Image Authentication Image authentication The process of determining the identity of an image. On the one hand, it pays attention to the authenticity of the image and determines whether it has been tampered with. On the other hand, it pays attention to the source of the image and determines which or which type of device the image was generated by. Image tampering Modification of image content without permission of the owner. Commonly used tampering methods include six categories: 1. Compositing: Use different areas from the same image or specific areas on different images, and use the copy-paste operation to form a new image. 2. Enhancement: By changing the color, contrast, etc. of a specific part of the image, focus on strengthening a certain part of the content, or weakening some details. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_19
711
712
19
Image Information Security
3. Trimming: Essentially, it uses image repair technology to change the appearance of the image. 4. Distortion: Evolving one image (source image) into another image (target image). 5. Computer generation: New images are generated by computer software, which can be further divided into photorealistic images and non-photorealistic images. 6. Painting: Image drawing by professionals or artists using image processing software. Authentication Verification of the integrity of the subject and its process. For example, the process of verifying the integrity of a watermark or a work embedded in a watermark and the process of verifying the integrity of a captured image. See selective authentication and exact authentication. Selective authentication Make sure that the work is not subject to any particular distortion. Generally, selective authentication mainly considers unreasonable distortion and ignores reasonable distortion. Compare exact authentication. Exact authentication A technique or process that determines if each bit of a given digital work has not changed. Compare selective authentication. Robust authentication A type of image authentication defined according to the purpose of authentication. Its authentication goal is the expression result of image content. It is also known as “soft” authentication and is only sensitive to changes in image content. It has to distinguish between general image processing operations (mainly changing the visual effects of the image) and malicious attacks (which may change the content of the image). Compare complete authentication. Complete authentication A type of image authentication defined according to the purpose of authentication. Its authentication goal is to determine the expression form of image content, such as the color value and format of image pixels, etc. It is also called “hard” authentication. As long as the image has changed, it is considered incomplete and untrustworthy regardless of whether the image is subjected to general image processing operations (legal compression, filtering, etc.) or malicious attacks. Compare robust authentication. Reversible authentication A special type of image authentication technology. It not only requires the accurate authentication of the image, including identification of malicious tampering with the image, but also requires that the original image be nondestructively restored/ completely restored after authentication, that is, the authentication process needs to be reversible.
19.2
Image Hiding
713
Authentication mark An identified logo or mark used to identify the integrity of a work, such as a watermark used to identify the integrity of a watermark carrier.
19.1.2 Image Forensics Digital image forensics 1. A method of detecting the specific form of tampering or unauthorized alteration of a digital image to verify the content of the image. See digital image watermarking. 2. Image analysis methods used for evidence collection for crime investigations. Image forensics The process of identifying, collecting, discerning, analyzing, and presenting court evidence. Image forensics technology detects the tampering of the image by analyzing the statistical characteristics of the image and judges the authenticity, credibility, integrity, and originality of the image content. Image anti-forensics Techniques and processes to remove or obscure traces left by image manipulation. It competes with image forensics and attempts to degrade or invalidate forensics technology. Blind image forensics Perform blind forensics on digital images regardless of whether the image was previously processed (without performing a security check of the underlying file or accessing the original image).
19.2
Image Hiding
Image information hiding is a promotion of watermarking technology from the perspective of embedding information in the carrier. The existence of confidential information, especially the content of confidential information, needs to be protected (Zhang 2017a).
19.2.1 Information Hiding Image information hiding In order to achieve a certain purpose of confidentiality, the process or technology of intentionally hiding some specific information in an image. In a wide sense, it is a
714
19
Image Information Security
Table 19.1 Classification of image information hiding technology Hide the existence of information Know the existence of information
Relevant to the image Hidden watermark Non-hidden watermark
Not relevant to the image Secret communication Secret embedded communication
Note: The terms in bold type in the table indicate that they are included as headings in the main text
relatively broad concept. Depending on whether the existence of specific information is confidential or not, image information hiding can be secret or non-secret. In addition, according to whether the specific information is related to the image or not, it can be classified into watermark type or non-watermark type image information hiding. According to the above discussion, from a technical perspective, the image information hiding technology can be divided into four categories, as listed in Table 19.1. Image hiding Abbreviation for image information hiding. Information hiding Technology and science that hide information. The process of consciously embedding certain specific information into a selected carrier to achieve a specific confidentiality purpose. It is a relatively broad concept that can include many different ways. Some protect the existence of information so that it cannot be detected; some only protect the content of the information, that is, the information does exist but cannot be recognized; others protect both. It mainly includes steganography and watermarking techniques, but also includes anonymous communication and prevention of unauthorized basic data reasoning. Arnold transform A transformation proposed in the study of ergodic theory, also known as cat mapping. Suppose a cat face image is drawn in a unit square on a plane. To change it from clear to fuzzy, you can use the following Arnold transform to map:
x0 y0
¼
1
1
x
1
2
y
ðmod1Þ
Among them, (x, y) are the pixel coordinates of the original image, and (x', y') are the pixel coordinates after the mapping. Specifically for digital images, the Arnold transformation in the above formula needs to be rewritten to become
19.2
Image Hiding
x0 y0
715
1 ¼ 1
1 2
x ðmodN Þ , x, y 2 f1, 2, , N g y
Among them, N is the order of the image matrix, and generally a square image of N N is considered. Arnold transforming the image actually moves the pixels and becomes confusing. This scrambling operation is repeated multiple times, which can transform a meaningful image into a meaningless white noise image and can achieve a certain information hiding effect. It should be noted, however, that the Arnold transform is periodic, and digital images are a finite point set. It may be possible to restore the original image when the Arnold transform is performed on the image to a certain number of steps. Cat mapping Same as Arnold transform. Steganography A way to hide information. Hiding information in the “carrier” data that is not suspected, or disguising the information to be hidden as information unrelated to the public content, thereby protecting the existence of the information to be hidden from being noticed and detected, to achieve the purpose of hiding information content. It mainly hides the existence of secret information and thus keeps secret information secret. A simple method is to program the information into the lower bits of a digital image.
19.2.2 Image Blending Image blending 1. Arithmetic operations that extend image addition. A similar image arithmetic operation using the pixel addition operator, where the new image is obtained by mixing the corresponding pixel values of the two input images. In image addition, the corresponding pixel values of two images are directly added to form the pixel values of the new image. When in image blending, each input image is weighted differently in the mixture, but the sum of the weights is 1. The image blending can be used to hide one image in another or more. The previous image is called the intending hiding image s(x, y), and the latter one (or more images) is called the carrier image f(x, y). If α is any real number that satisfies 0 α 1, then the image bðx, yÞ ¼ αf ðx, yÞ þ ð1 αÞsðx, yÞ It is the α blending image of the images f(x, y) and s(x, y). When α is 0 or 1, this blending can be called ordinary blending.
716
19
Image Information Security
2. Combine two images into a combined image. The key here is that the low-frequency color changes in the two images need to be blended smoothly, and their high-frequency textures need to be blended together quickly to avoid ghost effects when textures are superimposed. Blending operator The image processing operator for the third image C is obtained by weighting the two images A and B. Letting α and β be two scalar weights (generally α + β ¼ 1), then C(x, y) ¼ αA(x, y) + βB(x, y). Iterative blending A way for image information hiding. Image blending is performed iteratively and successively nested to better hide image information. Related technologies can be divided into single-image iterative blending and multiple-image iterative blending. Single-image iterative blending An image blending method. First, an intending hiding image is blended with a carrier image, and then the blended image is again blended with the same carrier image. This iteration is performed n times to obtain an n-fold iterative blended image of the intending hiding image and a single carrier image. Multiple-image iterative blending A generalization of single-image iterative blending to multiple images. Letting fi(x, y) (i ¼ 1, 2, . . ., N ) be a set of carrier images and s(x, y) be an intending hiding image, {αi|0 αi 1, i ¼ 1, 2, . . ., N} is the given N real numbers. For images f1(x, y) and s(x, y), performing α1 for blending to get b1(x, y) ¼ α1f1(x, y) + (1 – α1)s(x, y); for images f2(x, y) and b1(x, y), performing α2 for blending to get b2(x, y) ¼ α2f2(x, y) + (1 – α2)b1(x, y); . . ., blending N times in sequence to get bN(x, y) ¼ αNfN(x, y) + (1 – αN)bN–1(x, y). The image bN(x, y) is called an N-iterative blending image of image s(x, y) with respect to αi and multiple carrier images fi(x, y) (i ¼ 1, 2, . . ., N ). Gradient domain blending A class of image blending/splicing methods. Perform a reconstruction operation in the gradient domain. Gradient domain image stitching [GIST] A set of methods for image blending in the gradient domain. It mainly includes feathering the gradient of source image and image reconstruction in gradient domain using L1 norm. Among them, the method using the L1 norm to optimize the cost function of feathering on the original image gradient (called GIST1-L1) works well.
19.2
Image Hiding
717
19.2.3 Cryptography Cryptography The technical science that study the compiling and deciphering passwords. Among them, the study of the objective law of the change of passwords, applied to the preparation of passwords to keep communication secrets, is called coding; and those applied to cracking passwords to obtain communication information is called deciphering. There are some similarities between image watermarking technology and cryp tography in image engineering; they all emphasize the protection of information content itself. Cryptography itself can be used to hide information, so it can also be used as a copyright protection technology. Cipher text Encrypted information. Compare clear text. Clear text Unencrypted information, can be viewed and understood by anyone. Encryption Conversion from clear text to cipher text. Decryption Conversion from cipher text to clear text. Cipher key A single key or key pair used for information encryption and decryption. For example, key pairs are used in watermarking systems for watermark embedding and watermark detection. Key management Operations performed in a cryptographic system to ensure the integrity of a key, including key generation, key distribution, and key authentication. Key space The space for picking the key. A large key space means a large number of keys, and adversaries need to search for a larger space, which indicates that a large key space can bring greater security. Symmetric key cryptography Encryption techniques and methods that require the same key to be used for encryption and decryption. Compare asymmetric key cryptography. Asymmetric key cryptography Refers to encryption techniques and methods that use different keys for encryption and decryption. Compare symmetric key cryptography.
718
19
Image Information Security
19.2.4 Other Techniques Secret communication Transmission of confidential information. It often uses a covert channel, so the confidentiality of information is closely related to the confidentiality of the channel. Compared with digital watermarking, the information embedded here is independent of the carrier. The information embedded here is required by the receiver, and the carrier is not required by the receiver. The carrier is only used to help transmit the embedded information. Secret embedded communication The confidential information is transmitted through the public channel, which is embedded in the publicly transmitted signal but has nothing to do with the signal. Embedded communication is closely related to steganography. Unlike cryptography that protects the content of information, steganography protects the existence of information. Traditional secret communication is mainly based on encryption technology. Its disadvantages include the following: (1) the channels used for confidential communication are often open or easily intruded, and confidential information is easy to be intercepted by the other party; (2) encrypted information transmitted separately is easy to attract the other party’s attention and analysis, and the other party may prevent communication by destroying the channel or the transmission file on the channel; and (3) as the encryption algorithm becomes more and more open and transparent, and the computing power is greatly improved, even the sophisticated encryption system may be cracked. In recent years, the mode of covert communication using information hiding has attracted attention. It uses the steganography technology to hide the secret information in the carrier and then transmits it through open media to achieve the effect of secret communication without attracting the attention of the other party. Covert channel Channels often used in secret communication. The embedded information (required by the receiver) is not related to the carrier (not required by the receiver). The carrier is only used to help transmit the embedded information. Steganalysis A technical discipline that detects and decodes hidden information using steganography. Steganography A way to hide information. Hiding information in the “carrier” data that is not suspected, or camouflaging the information to be hidden as information unrelated to the public content, thereby protecting the existence of the information to be hidden from being noticed and detected, to achieve the purpose to protect hiding information content. It mainly hides the existence of secret information and thus keeps secret information secret. A simple method is to program the information into the lower bits of a digital image.
19.2
Image Hiding
719
Anonymity An important method in information hiding. Generally, the (confidential) information is hidden in another set of (disclosed) information/data. The main consideration here is to protect the hidden information from being detected, thereby protecting the hidden information. It can be used to hide the sender, receiver, or both of the information. In image engineering, image information hiding is mainly considered. Steganographic message Information that hides itself on a seemingly insignificant (public) object to keep itself secret. Stego-key The key required to detect secret information. Cipher A method of encrypting and decrypting information.
Chapter 20
Color Image Processing
Color images contain more information than black-and-white images (Koschan and Abidi 2008). Color images are gradually becoming the object of image processing (MarDonald and Luo 1999; Plataniotis and Venetsanopoulos 2000).
20.1
Colorimetry and Chromaticity Diagram
The physical basis of color vision is that there are three kinds of cone cells that sense color in the human retina. The three primary colors constitute the basic unit of color. The physiological basis of color vision is related to the chemical processes of visual perception and to the neural processing of the nervous system in the brain. To express color, people use the concept of chromaticity. In order to express chromaticity, a method of expressing chromaticity diagrams has been introduced (Plataniotis and Venetsanopoulos 2000; Zhang 2009).
20.1.1 Colorimetry Colorimetry 1. The science that involves the quantitative study of color perception. The expression of three stimulus values is used to derive the perception of color. The simplest way to encode color in cameras and displays is to use the red (R), green (G), and blue (B) values of each pixel. 2. The measurement of color intensity relative to a standard. Color purity Subjective perception corresponding to the spectral width of the spectrum. An experimental Munsell color system can be used to calculate or plot the purity of a © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_20
721
722
20
Color Image Processing
color. Referred to purely in colorimetry. If the brightness of the main wavelength λ in a color light is Lλ and the total brightness is L, then the color purity pc ¼ Lλ/L (commonly used in Europe). Monochromatic light has a purity of 1 and white light has a purity of 0 (with no dominant wavelength light). However, the color purity as defined cannot be read directly from the chromaticity diagram. Another color purity can be read from the chromaticity diagram, which is defined as the following: O is the selected white light position in the chromaticity diagram, and M is the position of the considered color. Extending the line connecting these two points to intersect with the spectral locus at S, the spectral color S is the dominant wavelength, and then the color purity pe ¼ OM/OS (commonly used in the United States). The relationship between the two color purity is pc ¼ (yλ/y)pe, where yλ is the chromaticity coordinate of the main wavelength and y is the chromaticity coordinate of the considered color in the standard CIE chromaticity system. Purity A mixture of achromatic and monochromatic light stimuli. The greater the distance between a point expressed as purity and the achromatic stimulus (sunlight) point, the higher is the purity, and conversely, the lower is the purity. Color saturation In colorimetry, color purity indicates how much white a color contains. The color purity of all spectral colors is 1 (strictly speaking close to 1). Psychological researchers focus on subjective perceptions. The amount of white in the spectral color indicates the amount of saturation. Perceptually, not all spectral colors have equal saturation. In fact, purple is the most saturated; yellow is very small, close to white; and red is between purple and yellow. Dominant wavelength When mixed with the specified achromatic hue according to the additive color mixing rule at an appropriate ratio, a monochromatic wavelength that matches the target hue can be generated. It is one of the important data that determines the color properties. It is a complete data of a color together with color purity and brightness. The dominant wavelength is an objective quantity corresponding to the subjective perceived hue, but an objective “white” color must be clearly defined. When the monochromatic light stimulus and the specific white stimulus are added at an appropriate ratio, and the test color stimulus can be matched, the monochromatic light stimulus wavelength is the main wavelength of the test color. The dominant wavelength is the color wavelength corresponding to the perceived hue. The wavelength of pure color light is the dominant wavelength of the color light. Optimal color The color of the surface of an ideal object. The reflection factor of the object to each band is either 0 or 1, and there are only two discontinuous jumps in the full band. According to its distribution, it is divided into short wave color (360 ~ 470 nm), medium wave color (470 ~ 620 nm), and long wave color (620 ~ 770 nm). The
20.1
Colorimetry and Chromaticity Diagram
723
optimal color is the most significant among the colors of the same brightness and the brightest of the colors of the same color. Color representation An expression of color with multiple (usually three) values. The meaning of each value depends on the color model used. In the study of color representation, three stimulus values are used in colorimetry to derive the perception of color. Value in color representation The basic quantity (i.e., “value”) that represents brightness in the HCV model. Corresponds to the other basic quantity that expresses chromaticity in some other perception-oriented color models. Color encoding Color-coded representation. Color can be coded using three numerical components and an appropriate spectral weighting function. Color The human visual system has different perception results of electromagnetic waves of different frequencies. It is a physical phenomenon as well as a psychological phenomenon. Physically, color refers to the textured nature of an object, which allows it to reflect or absorb a specific portion of the incident light. Psychologically, it is characterized by visual perception caused by light of a particular frequency or wavelength incident on the retina. The key paradox here is why light with slightly different wavelengths is perceived as very different (e.g., red vs. blue). Chromatic color Various colors other than achromatic. The colored cubes of the RGB model correspond to points other than the line from the origin to the furthest vertex. Compare achromatic color. Achromatic color A series of neutral gray colors from white to black. Colors include colored and achromatic. Achromatic color corresponds to the line from the origin to the furthest vertex in the colored cube of the RGB model. Compare chromatic color. Neutral color A color with only brightness and no hue. Also called achromatic, which is often referred to as gray. The output obtained by using a neutral filter or filter that reduces the intensity of each color is the neutral color. Color range Corresponds to all colors in a certain wavelength range. Color gamut The range of colors or the progression of colors. Different devices have different color gamuts or levels. For example, the color gradation of a color printer is relatively less than that of a color display (device), so not all colors seen on the
724
20
Color Image Processing
display can be printed on paper. To print true color images on the screen, you need to use a reduced color gamut for the display. A particular display device (such as a CRT, LCD, printer, etc.) cannot display all possible color subsets. Because various devices have different physical mechanisms for producing color, each scanner, monitor, and printer will have a different color gamut (or color range) that it can represent. The RGB color gamut displays only about 70% of the perceptible colors. The CMYK color gamut is much smaller, regenerating only about 20% of the perceptible colors. The color gamut obtained from premixed inks (similar to the Pantone Matching System) is also smaller than the RGB color gamut. Color balance The process or result of using two or more colors to achieve an achromatic effect by adding or subtracting colors. Corresponds to the operation of moving the white point of a given image closer to the pure white point (with the same R, G, B values). A simple method is to multiply each RGB value by a different factor (such as a diagonal matrix transformation of the RGB color space). More complex transformations perform a color twist, using a common 3 3 color transformation matrix, mapping to XYZ space and back. In practice, take a piece of white paper, and adjust the RGB value of the image to make it a neutral color. Most cameras perform some kind of color balancing operation. If the color system is the same as the lighting (e.g., the BT.709 system uses the daylight source D65 as its reference white), the change is small. But if the light source has a strong color, such as indoor incandescent lighting (generally produces yellow or orange hue), the compensation is still obvious. Transmitted color The color of the light emitted by the light source when it penetrates the object. Compare reflective color. Reflective color Because the color of light observed from a specific surface (such as paper, plastic, etc.) is reflected to the human eye, it may be different from the original color of the light. Compare transmitted color. Surface color 1. The color appeared when a strongly absorbing object does not have time to absorb a certain color of light, but chooses to reflect. The color of an object with a strong absorption ability has a lot to do with its surface condition. If the surface is polished well, it will be white. However, such objects sometimes show another situation, that is, the color that should be absorbed is reflected because it is too late to absorb, and the color presented at this time is called surface color. At this time, the transmitted light and the reflected light are complementary colors. Compare body color. 2. The color of diffuse reflection on the surface of opaque objects.
20.1
Colorimetry and Chromaticity Diagram
725
20.1.2 Color Chart Color circle A representation or result of various colors. The color spectrum of visible light is spread on the ring, and the two ends of the spectrum are connected on the ring, as shown in Fig. 20.1, which is also called the color wheel. In this way, a pattern representing a change in hue composed of a circular arrangement of various color cards is obtained, as shown in the left picture. On the color circle, the complementary color of each color can be determined intuitively, as shown in the right picture. Color wheel Same as color circle. Saturated color Spectral colors or saturated colors. Sometimes called color brightness (because it is related to light and dark). The colors on the color circle are all bright colors, such as red, yellow, green, and blue. Non-brightness is a mixture of a certain color on the color circle and “white.” The mixture of white and spectral colors (such as red,
Fig. 20.1 Color circle diagram
726
20
Color Image Processing
yellow, green, and blue) does not change its hue, but it shows the relative amount of this color and white, which is called color brightness or saturation. Higher brightness indicates more spectral colors. In terms of white, it indicates the lower whiteness. Brightness and whiteness are both subjective feelings, and objective colorimetry has the word “purity.” Clockwise spectrum A series of changing colors obtained by rotating the color wheel clockwise. Such as red, yellow, green, cyan, etc. Compare counterclockwise spectrum. Counterclockwise spectrum A series of changing colors obtained by rotating the color wheel counterclockwise. Such as yellow, red, magenta, blue, etc. Compare clockwise spectrum. Complementary colors Colors that are exactly opposite each other (180 difference) on the color circle. For example, red and green and yellow and blue, when the two are mixed in a certain ratio, different shades of gray can be produced. Color complement The result of substituting each color hue with its complement (sometimes called an opposite color). Similar to the process of image negative (negative value transformation), the operation of obtaining color complements can be performed by using a trivial transfer function for individual color channels. If the input image is represented by the HSV color model, a trivial transfer function is needed for the value V, a nontrivial transfer function is used for the hue H (considering the discontinuity of H near 0 ), and no change of saturation S is made. Color chart A pattern composed of various color cards arranged on a picture in a certain order. Figure 20.2 shows a schematic diagram. Monochrome Contains only a single color tone, often a grayscale from pure black to pure white.
Fig. 20.2 Color chart diagram
20.1
Colorimetry and Chromaticity Diagram
727
Fig. 20.3 Macbeth color checker diagram
Table 20.1 Color or grayscale names corresponding to the Macbeth color checker Dark complexion Orange Blue White
Bright complexion Slightly purpleblue Green Gray
Blue sky Medium red Red Gray
Leaf color Purple Yellow Gray
Blue flower Yellowgreen Magenta Gray
Bluish green Orangeyellow Blue-green Black
Color checker chart A reference pattern composed of multiple square planes of different single colors. Typically includes Macbeth color chart and Munsell color chart. Macbeth color chart See Macbeth color checker. Macbeth color checker A reference map for correcting a system of color images. A total of 24 matte-colored surfaces (consisting of Macbeth color charts) can be used to judge color. These colored surfaces include the primary colors of additive color mixing and subtrac tive color mixing, as well as six achromatic colors (gray levels), as shown in Fig. 20.3. The names corresponding to the colors in four rows and six columns are shown in Table 20.1. Munsell color chart See Munsell color system. Munsell color model In the field of art and printing, a commonly used color model for visual perception. In which, different colors are represented by multiple color patches with unique hue, color purity, and value. These three together also form a 3-D entity or a 3-D space. However, the Munsell space is not uniform in perception and cannot be directly combined based on the principle of color addition. Munsell color notation system A system based on hue, brightness (value), and saturation (color purity) to precisely specify various colors and their connections. The printed Munsell color
728
20
Color Image Processing
book contains color slides indexed with these three attributes. The color of any unknown surface can be identified by comparing it with the color in the book under given lighting and viewing conditions.
20.1.3 Primary and Secondary Color Primary color 1. For light, one of three colors: red (R), green (G), and blue (B). Collectively called the three primary colors of light. 2. For pigment, one of the three colors of magenta (M), blue-green (C), and yellow (Y ). Collectively referred to as the three primary colors of the pigment. 3. Primitives of a color coding system. Perceived colors within a certain range can be composed of weighted combinations of primary colors. For example, color television and computer screens use photo-radiation chemistry to produce three primary colors (red, green, and blue). The ability to generate all other colors using just three colors stems from the cones of the human eye that are sensitive and responsive to the three different color spectrum ranges. See additive color and subtractive color. RGB Three primary colors of light. Red [R] 700 nm light specified by CIE. One of the three primary colors. Green [G] One of the three primary colors of light. The green wavelength specified by CIE is 546.1 nm. Blue [B] One of the three primary colors of light. The blue wavelength specified by CIE is 435.8 nm. Yellow [Y] One of the three primary colors of pigments. It is also one of the three complemen tary colors of light (i.e., one of the secondary colors of light). It is the result of red light plus green light. Cyan [C] One of the three primary colors of pigments. It is also one of the three complemen tary colors of light (i.e., one of the secondary colors of light). It is the result of green light plus blue light. Magenta [M] One of the three primary colors of pigments. It is also one of the three complemen tary colors of light (i.e., one of the secondary colors of light). It is the result of red light plus blue light.
20.1
Colorimetry and Chromaticity Diagram
729
Secondary colors Three colors obtained by adding (additive color mixing) the three primary colors of light: red, green, and blue. That is, (1) cyan (green plus blue), (2) magenta (red plus blue), and (3) yellow (red plus green). Complementary wavelength After mixing the test color stimulus and a single-color light stimulus at an appropriate ratio, it is the wavelength of the single-color light stimulus when the color match with the specific white light stimulus is achieved. Mainly used in the fuchsia area.
20.1.4 Color Mixing Color mixing The process and technique of mixing (adding or subtracting) certain color stimuli to produce another color vision. Typical techniques include additive color mixing and subtractive color mixing. Compare color mixture. Color mixture The process and phenomenon of mixing different colors of light or pigments to give new colors. Mixing the light will cause the color to become brighter, and the more mixed it will be brighter (toward white); while mixing the pigment will cause the color to be darker, and the more mixed it will be darker (toward black). Compare color mixing. Dispersion 1. Dispersion: One of the characteristics of propositional expressions in a propositional expression model. An element expressed by a proposition may appear in many different propositions, and these propositions can be expressed in a consistent manner using the semantic network. 2. Dispersion: The phenomenon that white light is broken down into various colors. That is, white light will be dispersed into various colors when it is refracted. To be precise, this corresponds to the quantitative relationship of the refractive index n of a medium with the wavelength λ. 3. Diffusion: The scattering of light as it passes through a medium. Color mixture model Hybrid model based on color distribution in some color expression systems. These color expression systems indicate not only the individual colors in the model but also the connections between them. The conditional probability that an observed pixel belongs to a target can also be modeled as a mixture of multiple connected components. Color-line model A color mixture model. It is assumed that the color distribution of the foreground and background can be approximated by a local blend of the two colors.
730
20
Color Image Processing
Chromatic coefficient Represents the scale factor of the three primary colors that make up a color. Considering that human visual system and display have different sensitivities to different colors, the values of the various scale factors should be different when matching the perceived color and display color. In practice, the green coefficient is the largest, the red coefficient is the middle, and the blue coefficient is the smallest. White balance Describes an accuracy index for the use of three primary colors, red (R), green (G), and blue (B), to generate white. Also means (with the help of correction) that white is displayed correctly. When a black-and-white camera combined with a color filter is used to collect color images, the absolute brightness value of each color channel is largely dependent on the spectral sensitivity of the charge-coupled device unit with the filter coating. On the one hand, the sensitivity of the charge-coupled device chip is very different in the spectral interval of interest; on the other hand, the maximum transmission factor of each filter is also different. This can lead to color imbalances. The basic concept of white balance is “can capture white objects in the scene as a white image regardless of any light source and environment.” On this basis, other colors can be correctly reproduced with “white” as the base color. Also refers to a color correction system to ensure that white objects always look white under different lighting conditions. Reference white The color value of a sample image corresponding to a known white target. Knowledge of this value facilitates the correction of white balance. Imagine the white seen by a particular observer. To avoid subjectivity, colors are often described with respect to white, which requires that colors be modified so that they can all be measured by the corresponding white that a particular observer sees. Next, in order to compare colors in an objective way, it needs to be corrected relative to the reference white. Additive color A method of combining multiple different wavelengths (or frequencies) of light to feel other colors. Red, green, and blue add up to white, but lack black. Compare the subtractive color. Additive color mixing A process or technique that adds two or more colors to get a new color. The more colors mixed, the closer the result is to white. This technique is used to display color images on a monitor. Subtractive color The color remaining in white light after eliminating a certain color. Such as the color after filtering the color chip. The manner in which colors appear due to the attenuation or absorption of light by certain materials. For example, the perception of something as red is because it
20.1
Colorimetry and Chromaticity Diagram
731
weakens or absorbs all wavelengths except the corresponding red. Compare additive color. Absorptive color It means that there is no white in the coloring process, and black is the combination of all colors. Also called subtractive color. Compare additive color. Subtractive color mixing 1. Relative words closely related to additive color mixing. Subtractive color is to subtract a certain color from white light, and the rest is the complementary color that is subtracted from that color. The subtractive color mixture can subtract various colors more than once. At this time, the remaining colors are no longer the complementary color of a certain monochromatic color, but the additive color mixture of the complementary color of each color is subtracted. Subtractive color mixing is more difficult to control than additive color mixing, because color filters cannot have distinct spectral distributions (cutoff filters), so the transmitted light is determined by the product of the transmission factors of each color filter. Causes complex relationships across the entire spectrum. Therefore, some people call it multiplication color mixing. 2. Color change regularity of pigments and dyes. Pigments and dyes show color because they reflect or transmit a certain amount of colored light. Mixing them according to different proportions can present a variety of colors, but at the same time, the brightness decreases, so it is called subtractive color mixing. 3. The technology used on the printer to print color images. The more times it is printed, the darker it becomes. Subtractive primary color Primary colors obtained by subtractive color mixing. For white light filters, three subtractive primary colors can be obtained: magenta (reduce green), yellow (reduce blue), cyan, or blue-green (reduce red). Multiplication color mixing An explanation of the actual effect of subtractive color mixing. In subtractive color mixing, color filters are used to obtain subtractive primary colors: magenta (reduce green), yellow (reduce blue), and cyan (reduce red). Because the actual color filter cannot have a clear and distinct spectral distribution, for many wavelengths, the transmitted light is determined by the product of the transmission factors of the color filters, which has a complex relationship. Some people call it multiplicative color mixing. Maxwell color triangle In 3-D color space, a triangle with three positive semiaxis unit points as vertices. This triangle plane is in the first hexagram. All points representing color in 3-D color space can be projected onto this triangle radioactively. A coordinate system can be established on this plane, and any point on it can be defined by two numbers (plane coordinates). Figure 20.4 gives an example.
732
20
Color Image Processing
y
Fig. 20.4 Maxwell colored triangle and coordinate system
R
B
O
Fig. 20.5 Maxwell equilateral colored triangle
G
0
bC
x
G 1
1 B C 0
gC 0
rC 1
R
Intersections used in RGB models (cubes) to construct chromaticity diagrams. This is a unit plane R + G + B ¼ 1, which is an equilateral triangle (see Fig. 20.5 for an example). A color C is the intersection of the observed color position vector and the unit plane: rC ¼
R G B g ¼ b ¼ RþGþB C RþGþB C RþGþB
where rC + gC + bC ¼ 1.
20.1.5 Chromaticity Diagram Chromaticity diagram A diagram that uses a 2-D graphical representation to represent a section of a 3-D color space. It represents various chromaticities in the 2-D plane. The color values that are linearly related to the position vector differ only in terms of brightness, that is, only the brightness information is lost on the interface, and all chrominance information is retained. Generally, it refers to the chromaticity diagram developed by CIE in 1931. The schematic diagram is shown in Fig. 20.6. Its shape is like a tongue (also called a shark fin shape). The chromaticity is only defined inside the tongue.
20.1
Colorimetry and Chromaticity Diagram
Fig. 20.6 Chromaticity diagram
733
Y 1.0 520 0.8
0.6
530
540 550 G 560 570 Y 580 590 A O 600 620 C R 780
510
500
0.4
490
0.2
480 470 460
0.0
B
P 380
0.2
0.4
0.6
0.8
1.0
X
With the help of the chromaticity diagram, the color can be easily specified by the proportion of the three primary colors that make up a certain color. In Fig. 20.6, the horizontal axis corresponds to the red color coefficient, the vertical axis corresponds to the green color coefficient, and the blue color coefficient corresponds to the direction in which the vertical paper faces outward, which can be obtained from z ¼ 1 – (x + y). Each point on the tongue outline in the figure gives the chromaticity coordinates of each color in the spectrum (the wavelength unit is nm). The part corresponding to blue-violet is in the lower left part of the figure, the green part is in the upper left part, and the red part is in the lower right part. The tongue profile can be considered as a trace left by a narrowband spectrum containing energy at a single wavelength across the range of 380–780 nm. It should be pointed out that the straight line connecting 380 nm and 780 nm corresponds to the purple series from blue to red which is not on the spectrum. From the perspective of human vision, the perception of purple cannot be produced by a single wavelength, which requires mixing a shorter wavelength of light and a longer wavelength of light to produce it. So on the chromaticity diagram, the line corresponding to purple connects the extreme blue (containing only shortwavelength energy) and the extreme red (containing only long-wavelength energy). CIE chromaticity diagram Same as chromaticity diagram. CIE 1976 uniform chromaticity scale diagram See CIE 1976 L*u*v*color space. CIE XYZ chromaticity diagram Same as chromaticity diagram.
734
20
Color Image Processing
Uniform chromaticity chart A graph showing uniform chromaticity scales with uniform chromaticity coordinates. Uniform chromaticity scale A color system in which the errors produced are balanced everywhere when matching a certain color. This goal has not yet been fully achieved. In 1960, CIE determined a uniform chromaticity scale for a 2 field of view in a dark peripheral field. Its abscissa u, ordinate v, and the third coordinate w and the CIE tristimulus values X, Y, and Z (1 ~ 4 ) and CIE chromaticity coordinates x, y, and z have the following transformation relationship: u ¼ 4X ðX þ 15Y þ 3Z Þ1 ¼ 6yð2x þ 12y þ zÞ1 v ¼ 6Y ðX þ 15Y þ 3Z Þ1 ¼ 4xð2x þ 12y þ zÞ1 w ¼ 1 ð u þ vÞ The uniform chromaticity coordinates u, v, and w correspond to a set of uniform chromaticity scale tristimulus values U, V, and W, and their relationship with the X, Y, and Z tristimulus values is U ¼ 2X/3, V ¼ Y, and W ¼ (3Y–X + Z)/2. Uniform chromaticity coordinates Three coordinates in the uniform chromaticity scale. CIE colorimetric system Standard CIE chromaticity system or standard XYZ colorimetric system. An international system based on the three primary color theory, which uses tristim ulus values to represent a color. In the early calculation of tristimulus values, the RGB (red, green, and blue) system was used, and the selection of the reference stimulus (primary color) was sparse, and often a negative stimulus value appeared for pure spectrum colors, which was inconvenient to calculate. To avoid this disadvantage, the selected reference stimuli in the standard stimulus system are virtual stimuli (colors that do not actually exist, but have a clear position on the chromaticity diagram), eliminating negative stimulus values. XYZ colorimetric system Same as CIE colorimetric system. Standard CIE colorimetric system An internationally recognized system based on the three primary color theory that uses tristimulus values to represent any color. Also called CIE system, that is, the standard XYZ system. Originally, the RGB (red, green, and blue) system was used to find the three stimulus values. The reference stimulus (primary color) selection was random, and there was often a negative stimulus value for pure spectral colors, which was inconvenient in calculation. To avoid this shortcoming in the standard CIE chromaticity system, the selected reference stimuli are all virtual stimuli (colors that do not actually exist, but have a clear position on the chromaticity diagram) and are
20.1
Colorimetry and Chromaticity Diagram
735
represented by regular or bold letters XYZ; X and Z have only color and no light, so Y in the standard tristimulus value XYZ represents the brightness value. Chromaticity A collective term for hue and saturation (no brightness). It is the color information retained by eliminating the brightness information in the corresponding color. Chrominance 1. The part of the image and video that carries color information is also the color content in color stimulation. Luminance and chroma are combined to form a complete color signal; 2. One or two color axes in 3-D color space that distinguish between brightness and color. See chroma. Chroma keying Use only information from chrominance values to separate colored targets or backgrounds. Commonly used in media production to separate a moving silhouette from a controllable studio background to allow the insertion of another artificial background. Equivalent to background replacement.
20.1.6 Diagram Parts White point Points where the saturation of each color in the chromaticity diagram is zero (point C in the figure of chromaticity diagram). Spectrum locus On the chromaticity diagram, an arc with a wavelength (each wavelength corresponds to a spectrum) as a parameter. It is a locus formed by connecting the points of the monochromatic light stimulus color coordinates. Pure spectrum The entire series consists of monochromatic light (single-frequency light). Generally, it refers to the part of the tongue on the chromaticity diagram that does not include the pure purple. Pure purple A color obtained by mixing two extreme monochromatic (red and blue) lights in the spectrum. Pure here refers to the most vivid colors. Pure purple has a series (from red to blue) due to the different red and blue components of its composition and together with the pure spectral colors form a closed circle of hue (i.e., the tongueshaped outline on the chromaticity diagram). Pure light A collective term for pure spectrum and pure purple.
736
20
Color Image Processing
Purple A mixture of red and blue. Purple line In the chromaticity diagram, the lines connecting the two extreme wavelengths (400 nm and 700 nm). Purple boundary The line between the red and blue endpoints on the chromaticity diagram. The trajectory of the color of pure spectrum is connected into a curve in the chromaticity diagram (or color space). The line (surface) at the ends of this trajectory of the curve is called the purple boundary. Virtual color In the chromaticity diagram, the colors outside the pure spectrum and pure purple range (i.e., the part outside the tongue diagram) are excluded. Also called virtual stimulation. It only exists in the chromaticity diagram or digital simulation and cannot be felt directly by the eyes. MacAdam ellipses MacAdam used it to describe the equivalent stimulus region (close to an ellipse, where the chroma is very similar) that a normal human eye perceives on a chroma ticity diagram. Figure 20.7 shows the distribution of 25 stimulus regions obtained by experiments. The area ratio relationship between the smallest ellipse and the largest ellipse is about 0.94:69.4. As can be seen from the figure, people can find more color nuances in the blue range than in the green range. Excitation purity The ratio of two lengths on a chromaticity diagram. The first length is the distance between the specified achromatic chromaticity point and the chromaticity point of the color under study, and the second length is along the same direction from the Fig. 20.7 Diagram of MacAdam ellipses
520 540
0.8 510
560 0.6 580
500
600 0.4
650 780
490 0.2
470 0.0
380
0.2
0.4
0.6
0.8
20.2
Color Spaces and Models
737
previous point to the boundary point on the real chromaticity diagram. If the extension of the line connecting the position M of a color in the chromaticity diagram and the position O of the white point intersects the edge of the chromaticity diagram at point S, then point S is the complementary color of the color at M, and the excited purity pe ¼ OM/OS, which is a commonly used definition in the United States. Also called stimulus purity. Imaginary primaries Basic colors that do not correspond to any true color. It is the product of mathematical calculations. Using the tristimulus values of the XYZ color system, the excellentness value can be calculated to draw the spectrum locus and purple line of this color system. Note that the spectrum locus and purple line are now completely in a right-angled triangle with the allowed values of the chromaticity diagram (see the figure in chromaticity diagram). However, the tristimulus value of such a color system is not obtained through physical experiments, but is simply obtained by transforming the tristimulus value of a non-virtual primary color light color system. Complementary color wavelength A monochromatic wavelength corresponding to a proper ratio and additive color mixing rule to match a specified achromatic hue. In the chromaticity diagram, the wavelength of a color between a white point and a point on the purple line. In other words, a line is drawn from a white point to any point on the purple line, and the color on the line is a complementary color. Each color generally has its dominant wavelength or its complementary wavelength. But some colors have both complementary and dominant wavelengths.
20.2
Color Spaces and Models
In order to express color information correctly and effectively, a suitable color representation model needs to be established and selected (Gonzalez and Woods 2018; Koschan and Abidi 2008). Color models are built in color space, and some are called color space directly (Zhang 2017a).
20.2.1 Color Models Color model A coordinate system for the canonical representation of different colors. Among them, each color corresponds to a specific position in a coordinate system. It is also a model system composed of the spectrum of three light sources that can produce any desired color sensation based on the three primary colors theory. For example, three different light beams with a specific spectrum are projected to the same spatial point, and by changing the relative intensity of these three light beams, the observer
738
20
Color Image Processing
P
Fig. 20.8 Diagram of theorem of the three perpendiculars
Õ
P''
P' l
can see any color that he wants to see, that is, any desired color feeling. For different purposes, many different color models have been suggested and proposed, including RGB, CMY, CMYK, HSI, HSV, HSB, etc. See color representation system. Color space The space made up of the basic colors of the color model. Since a color can be described by three basic quantities, the color space is always a 3-D coordinate space, where three axes correspond to three basic quantities and each point represents a specific color. See color representation system. Color similarity The relative gap between two color values in a given color space representation. In perceptual psychology, the difference in color perceived under natural or controlled conditions. See brightness constancy. Theorem of the three perpendiculars A theorem in 3-D geometry that can be used for derivation in 3-D color space. Referring to Fig. 20.8, consider a straight line l that belongs to the plane Π and a point P that does not belong to the plane Π. The intersection point of the line passing through the point P and perpendicular to the plane Π and the plane is P0 . The line from P to the line l and perpendicular to the line l intersects line l at point P". The theorem of the three perpendiculars states that the line segment P0 P" is perpendicular to line l. Color system 1. Same as color model. 2. Select and arrange all non-luminous object colors according to a certain rule, which is different from a certain color group that is arbitrarily selected for practical purposes. Color representation system A 2-D or 3-D space used to express a set of absolute color coordinates. Typical examples are RGB and CIE. Color conversion In printing, it refers to converting the color information (mostly based on the RGB color model) stored in the computer into the color of the ink or pigment used by the printer (based on the CMYK model).
20.2
Color Spaces and Models
739
Color space conversion The process of transforming the input color space into the output color space. Color space conversion can be performed by using a linear transformation on the γ-correction components, which will produce results that are acceptable for most applications. The high-end format converter performs additional inverse γ-correction before the linear transformation step and performs restored γ-correction after the color is linearly mapped to the desired color space. This process is obviously more complicated and more expensive to implement. Color normalization A technique or process that normalizes the distribution of color values in a color image. The result is that the description of the image will not change with changes in lighting. A simple way to generate illumination invariance is to use a unit length vector for each colored point in a color representation system instead of using coordinates to represent it. Color tolerance When the test color is transferred to match the specified color, the allowable difference range between the two colors. Color management The process of adjusting color values between different devices, such as RGB cameras, RGB displays, and RGB printers. Different devices have different characteristics. When the color value of one device is directly transmitted to another device, the color displayed will change, so corresponding adjustments and transformations are required. Colorimetric model A color model based on physical measurements of spectral reflections. Physiological model A color model defined on the basis of the existence of three basic color-receptive cone cells in the human retina. The commonly used RGB model is a typical example. Hardware-oriented model Color model suitable for image acquisition equipment and image display equipment. Typical models include RGB models, CMY models, I1, I2, I3 model, I1, I2, I3 modelcolor models for TV, etc. Perception-oriented color model A type of color model that meets the characteristics of human visual perception. It is independent of image display equipment and is suitable for color image processing. Typical models include HSI model, HSV model, L*a*b* model, etc. L*a*b* model A uniform color space model for color processing. It is suitable for applications close to natural lighting. The characteristic is that it covers all visible light chromatograms, and can accurately express the colors in various display, printing, and input devices.
740
20
B b (0,0,1)
Color Image Processing
c
m
(0,1,0)
g (1,0,0)
R
r
G
y
Fig. 20.9 Cube of RGB color model
Fig. 20.10 A color image and its red, green, and blue component images
20.2.2 RGB-Based Models RGB model A most typical and commonly used color model for hard devices. It is a model closely related to the structure of the human visual system. According to the structure of the human eye, all colors can be regarded as different combinations of the three basic colors: red, green, and blue. The RGB model can be established in a rectangular coordinate system, where the three axes are R, G, and B, see Fig. 20.9. The space of the RGB model is a cube, the origin point corresponds to black, and the vertex farthest from the origin point corresponds to white. In this model, the gray values from black to white are distributed on the line from the origin to the vertex farthest from the origin, while the remaining points in the cube correspond to different colors, which can be represented by the vector from the origin to the point. Generally, for convenience, the cube is always normalized into a unit cube, so that all R, G, and B values are in the interval [0, 1]. A color image and its red, green, and blue component images (all grayscale images) are shown in Fig. 20.10.
20.2
Color Spaces and Models
741
sRGB color space A standard color space for the Internet, proposed by the International Color Consortium (ICC) in 1996. It was developed for cathode-ray tubes (CRT) under daylight viewing conditions, but this standard also considers CRT displays and D65 daylight. The tristimulus value of sRGB is a simple linear combination of the XYZ values of CIE. They are also called Rec.709RGB values and can be calculated using the following formula: 2
3 2 RsRGB 3:2410 6 7 6 4 GsRGB 5 ¼ 4 0:9692 BsRGB
0:05556
32 3 1:5374 0:4986 X 76 7 1:8760 0:0416 54 Y 5 0:2040 1:0570 Z
Among them, X, Y, and Z are derived from the XYZ colorimetric system. Standard color values A set of normalized color values. The representation equation (λ represents the wavelength, and the upper horizontal line represents the mean) is Z X¼k
φðλÞ • xðλÞdλ Z φðλÞ • yðλÞdλ
Y ¼k Z Z¼k
φðλÞ • zðλÞdλ
where k¼R
100 SðλÞyðλÞdλ
φ(λ) is a color stimulus function that causes color stimuli in the eye and S(λ) is the spectral power. The normalization factor k is used for body color so that the standard color value Ywhite body of a purely pale white body under any type of light is 100. In practice, the integration is replaced by a sum of finite measurements. For nonluminous objects, it is necessary to consider that color measurement depends on the measurement geometry. Therefore, according to the standard CIE colorimetric system, three positive standard color values X, Y, and Z are assigned to each spectral signal. Alychne A plane in the origin of the axes in the RGB color space defined by CIE. Perceptual brightness measurements can be made in a direction orthogonal to this plane. In order to feel the same brightness, different intensities are needed for different colors,
742
20
Color Image Processing
Fig. 20.11 Safe RGB colors
so the axis along which the overall brightness change is perceived is different from the main diagonal of the CIE-RGB space. Safe RGB colors A subset of true color. It can be reliably displayed on different display systems, so it is also called reliable color for all systems. This subset has a total of 216 colors, which are obtained by combining the values of R, G, and B each with six values (i.e., 0, 51, 102, 153, 204, 255), as shown in Fig. 20.11. CMY model A color model for hard devices. The three variables C, M, and Y, respectively, represent three complementary colors of light generated by superimposing three primary colors of light: cyan (C), magenta (M), and yellow (Y ). The three complementary colors can be obtained by subtracting the corresponding three primary colors from the white light. A simple and approximate conversion from CMY to RGB is R¼1C G¼1M B¼1Y CMY Three complementary colors: cyan (C), magenta (M), and yellow (Y). CMYK model A color model consisting of cyan, magenta, yellow, and black used in actual printing (industry). It consists of a CMY model plus a fourth variable (i.e., black K ), as shown in Fig. 20.12. When color is absorbed by a medium (such as paint pigments), it is a subtractive model. Unlike the RGB model, which adds black tones to produce special colors, the CMYK model subtracts tones from white. In this
20.2
Color Spaces and Models
743
Fig. 20.12 Diagram of CMYK model
model, R, G, and B are all secondary colors. There are three main reasons for using black inks other than the three complementary colors in printing: (1) printing three colors of cyan, magenta, and yellow on top of each other will cause more fluid output and require longer drying time; (2) due to mechanical tolerances, the three color inks will not be printed in exactly the same position, so colored edges will appear on the black border; and (3) black ink is cheaper than color ink. CMYK Color space used in color printers. Consists of a subtractive primary color (the complementary color of the three primary colors) cyan C, magenta M, and yellow Y and a possible additional black K. CMYK composition Three subtractive primary colors (cyan C, magenta M, and yellow Y ) and black K are used in different combinations to produce a complete color spectrum. Process colors A color obtained by combining different proportions of cyan, magenta, yellow, and black. Compare spot colors. Spot colors The color of a pigment or color ink used in the printing industry as defined by the manufacturer with its brand name and index number. Spot color printing refers to a printing process that uses colors other than yellow, magenta, cyan, and black to reproduce the colors of the original. Compare process colors. Pantone matching system [PMS] A standardized color matching system, widely used in the printing industry. In this system, colors are indicated by Pantone names or numbers. This system can be used to help print colored dots. The color of these dots can generally be described by the
744
20
Color Image Processing
CMYK model, but the system can also be used for colors that cannot be mixed in the CMYK model. Four-color printing Color printing with CMYK models. It is a supplementary improvement to common CMY models. In theory, CMY is the complementary color of RGB, and they should be superimposed to output black. But in practice, superposition can only output turbid dark colors. Therefore, the publishing industry always adds a single black to form the so-called CMYK model for color printing. Full-color process Same as four-color process. Four-color process Also called full-color process. See four-color printing. Color printing Based on the principle of the three primary colors, the technique of using the subtractive color mixing method to reproduce the color of the scene is also called three-color printing. For example, printing colors on a white background (paper, plastic, metal, etc.) to absorb colors in some bands. The specific process is to increase the use of a set of subtractive primary color filters (magenta, yellow, cyan) when photographing the scene, obtain three negatives, make plates, and overlay the pigment with a color complementary to the color of the filter. Color sequence Refers to a specific order of cyan, magenta, yellow, and black when the four-color printer is output. In practice, you can adjust this sequence to obtain some special effects or obtain higher printing quality.
20.2.3 Visual Perception Models Hue, saturation, intensity [HSI] model A perception-oriented color model in color processing. Where H is hue, S is saturation, and I is intensity (corresponding to density or brightness). The human visual system usually uses three basic characteristic quantities to distinguish or describe color: brightness, hue, and saturation. These three quantities correspond to the three components of the HSI model. The color components in the HSI model can often be represented by triangles as shown in Fig. 20.13a. For any of the color point P, the value of H corresponds to the angle between the vector pointing from the center of the triangle to this point P and the R axis. The value of S at this point is proportional to the length of the vector pointing to that point, and the longer it gets, the more saturated it will be (maximum on the triangle edge). In this model, the value of I is measured along a straight line that passes through the center of the triangle and is perpendicular to the triangle plane
20.2
Color Spaces and Models
745
White
G
I
G C
Y P R
B
R
H M
B
(a)
Black
(b)
Fig. 20.13 HSI color triangles and HSI color entity
Fig. 20.14 A color image and its three component images of hue, saturation, and intensity
(the points on this line have only different gray levels and no color). The more it comes out of the paper, the whiter it gets, and the more it comes into the paper, the blacker it gets. When only plane coordinates are used, the HSI color triangle shown in Fig. 20.13a only represents chromaticity. If all three components of the HSI model are considered to form a 3-D color space, a double-pyramid structure as shown in Fig. 20.13b is obtained. The color point on the outer surface of the structure has a pure saturated color. Any cross section in the double pyramid is a chromaticity plane, similar to Fig. 20.13a. In Fig. 20.13b, the I of any color point can be represented by the height difference from the lowest black point (I at the black point is usually 0, and I at the white point is 1). It should be noted that if the color point is on the I axis, its S value is 0 and H is not defined. These points are also called singular points. The existence of singular points is a disadvantage of HSI model, and near the singular points, small changes in R, G, and B values may cause significant changes in H, S, and I values. A color image and its three component images of hue, saturation, and intensity are shown in Fig. 20.14.
746
20
Color Image Processing
HSI space HSI model space. Intensity, hue, saturation [IHS] Same as hue, saturation, intensity. Hue [H] 1. Color: One of the important factors that constitute painting and arts and crafts. Various objects exhibit complex colors due to different amounts of absorption and reflection of light of different wavelengths. Colors often give people different feelings in paintings and arts and crafts. For example, red, orange, and yellow have warm feelings (warm colors), while cyan, blue, and purple have cold and quiet feelings (cold colors). 2. Hue: A name that describes the kind of hue that a color has. Hue is a common component of many color image models. One component common to HCV, HSB, HSI, and HSV models. It is an attribute related to the wavelength of the main light in the mixed spectrum. Represents color characteristics such as red, yellow, green, blue, and purple. It can be represented by a closed color circle sequence from red to yellow, to green, to blue, to purple, to magenta, and to red. Hue describes the difference between colored and achromatic colors with the same brightness (i.e., the same luminous flux or gray tone). It is also an attribute of visual perception, corresponding to whether the viewing field is similar to a perceived color, red, yellow, green, and blue, or a combination of two of them. From a spectral perspective, the hue can be associated with the dominant wavelength of the spectral power distribution. 3. Hue: The name of the color. A picture or photo is often composed of various colors, and the overall relationship between colors forms the color style. The main hue of a picture is its main tone. Hue, saturation, value [HSV] model A perception-oriented color model in color image processing. Where H is hue, S is saturation, and V is density or brightness value. HSV space HSV model space. Hue, chroma, value [HCV] model A perception-oriented color model in color image processing. Where H component represents hue, C component represents color purity, and V component represents brightness value. The HCV model and the HSI model are closer in concept but not identical in definition, but both are more suitable for image processing algorithms that use the human visual system to perceive color characteristics. Chroma A component of the HCV model. It is also the color part of the image and video signal, including hue and saturation, but it requires brightness to be visible at the same time.
20.2
Color Spaces and Models
747
Hue, saturation, brightness [HSB] model A perception-oriented color model. Where H is hue, S is saturation, and B is lightness (corresponding to brightness). Hue, saturation, lightness [HSL] model Same as hue, saturation, brightness model. YCbCr color model The most popular color model in digital video. Among them, one component represents brightness (Y ), and the other two components are color difference signals: Cb (the difference between the blue component and a reference value) and Cr (the difference between the red component and a reference value). Color difference signal The result obtained by combining the three color signals and calculating the difference. For example, combining R0 , G0 , and B0 signals into color difference signals (B0 Y0 ) and (R0 Y0 ), this process is also called matrixization. YCbCr color space In the field of digital video, an internationally standardized color space used to represent color vectors. This color space is based on the YCbCr color model and is different from the color space used in analog video recording. YCbCr color space was originally developed for ordinary TV, and there is no format for HDTV. YC1C2 color space A color space developed by Kodak for the Photo CD system. The color gradation of this space is expected to be as close as possible to the film gradation.
20.2.4 CIE Color Models CIE L*A*B* model A color representation model based on the international standard of color measurement recommended by CIE. It is designed to be device-independent and uniformly perceived (i.e., the distance between two points in this space corresponds to the perceived difference between the corresponding two colors). The L*A*B* model includes one luma component L* and two chroma components (A* from green to red and B* from blue to yellow). Compare CIE L*U*V* models. Lab color space See CIE 1976 L*a*b* color space. It can be expressed as follows:
748
20
Color Image Processing
8 1=3 > < 116 YYn 16
Y > 0:008856 Yn L > Y : 903:3 Y 0:008856 Yn Yn X Y a 500 f f Xn Yn Y Z b 200 f f Yn Zn Among them, the representation of the function f is ( f ð xÞ ¼
x1=3
x > 0:008856
4 7:787x þ 29
x 0:008856
Among them, (Xn, Yn, Zn) is the value of reference white, which is related to the standard illuminant. If the standard illuminator D65 is used, Xn ¼ 95.0155, Yn ¼ 100.0000, and Zn ¼ 108.8259. CIE L*U*V* model A color representation model based on the international standard of color measurement recommended by CIE. Includes one luma component L* and two chroma components (U* and V*). A given change value for any component is substantially proportional to the corresponding color perception difference. Compare the CIE L* A*B* models. Luv color space See CIE 1976 L*u*v*color space. Can be expressed as
L
8 1=3 > < 116 YYn 16 > : 903:3 Y Yn
Y > 0:008856 Yn Y 0:008856 Yn
u 13Lðu0 un 0 Þ v 13Lðv0 vn 0 Þ Among them, the auxiliary functions can be expressed as u0
4X X þ 15Y þ 3Z
v0
9Y X þ 15Y þ 3Z
20.2
Color Spaces and Models
un 0
749
4X n X n þ 15Y n þ 3Z n
vn 0
9Y n X n þ 15Y n þ 3Z n
Among them, (Xn, Yn, Zn) is the value of reference white, which is related to the standard illuminant. If the standard illuminator D65 is used, Xn ¼ 95.0155, Yn ¼ 100.0000, and Zn ¼ 108.8259. CIE chromaticity coordinates The coordinates in CIE’s color space are related to the three ideal standard colors X, Y, and Z. Any visible color can be represented by the weighted sum of these three ideal colors. For example, if a color C ¼ w1X + w2Y + w3Z, the normalized value is w1 w1 þ w2 þ w3 w2 y¼ w1 þ w2 þ w3 w3 z¼ w1 þ w2 þ w3 x¼
Because x + y + z ¼ 1, you only need to know the two values, such as (x, y), to calculate the third value. They are the chromaticity coordinates. Chromaticity coordinates The 2-D coordinates of each colored point in the chromaticity diagram. Can also refer to a 2-D coordinate system that expresses color characteristics. CIE 1976 L*a*b*color space [CIELAB] A uniform color space. It was proposed by CIE in 1976 in order to obtain convenient color measurement values corresponding to the Munsell color system. CIE 1976 L*u*v*color space [CIELUV] A uniform color space. Inside, the straight line can still be mapped as a straight line (different from the CIE 1976 L*a*b* color space), so it is very suitable for additive mixing operations. The chromaticity diagram is referred to as the CIE 1976 uniform chromaticity scale diagram. Profile connection space [PCS] A device-independent virtual color space. The PCS space is defined by the XYZ colorimetric system or Lab color space. ICC profiles A file that defines the mapping between the device source or target color space and the profile connection space (PCS).
750 Fig. 20.15 A representation of opponent color space
20
Color Image Processing
Red light-dark Green
red-green
Blue
blue-yellow
20.2.5 Other Color Models Opponent model A color model based on perception experiments. Typical is the HSB model. Opponent colors Two pairs of colors in the opponent color space: red-green pair and blue-yellow pair. Opponent color system A color representation system originally proposed by Hering, where each image is represented by three channels with contrasting colors: red-green, yellow-blue, and black-white. These three pairs of colors are three pairs of opponent colors. Opponent color space In the opponent-color theory, the space where the opponent neural processes (including observing the opponent color pairs: red-green and blue-yellow, and the opponent organized light-dark system) are located. It is the foundation of human color vision. Figure 20.15 gives a representation of the opponnet color space. The opponent color space has been used in color stereo analysis, color image segmen tation, and approximation of color constancy. Opponent color theory A color vision theory. Think of the adversarial neural processes, including the opponent colors red-green and yellow-blue and similar light-dark systems as the cause of color vision. There are three opposing colors considered, namely, red and green, yellow and blue, and black and white. Opponent includes decomposition and facilitation. Decomposition distinguishes between light feeling (white, red, yellow) and dark feeling (black, green, blue). Facilitating effects cause changes in the substance on the retina. Black and white can only be decomposed (one way), while other colors have both decomposition and facilitation (two way). The disadvantage of this theory is that it cannot explain color blindness. Four-color theory Same as opponent color theory. Opponent color model See opponent model.
20.2
Color Spaces and Models
751
Fig. 20.16 Diagram of Munsell color system
Munsell color model In the field of art and printing, a commonly used color model for visual perception. In this model, different colors are represented by multiple color patches with unique hue, color purity, and value. These three together also constitute a 3-D entity or a 3-D space. However, the Munsell space is not uniform in perception and cannot be directly combined based on the principle of color addition. Munsell color system 1. A color system that uses the hue, lightness, and color density (also called chroma) specified by the Munsell color stereo model to represent the body color. 2. A color classification method. Also called color stereo method. Figure 20.16 shows a volumetric schematic diagram. The ordinate indicates brightness, black is down, white is up, there are 11 levels from 0 to 10, and only 1 to 9 are actually used. On the plane passing the vertical axis, starting from the black-and-white axis as a starting point, one hue is shown to each side, and the two are comple mentary colors. Each horizontal plane represents a kind of lightness, but it includes 40 shades, and each tone has a different number of colors from the black-and-white axis to the most saturated position (farthest from the axis). This three-dimensional shape is very irregular and contains a total of 960 colors. Observation of various colors should be performed under daylight illuminator C (6770 K). Munsell remotation system In 1943, the Color Measurement Committee of the American Optical Society modified the Munsell color system and developed a new color system. Munsell value Munsell lightness value (corresponding to brightness) in the Munsell color system. Munsell chroma The value representing the saturation of colors in the Munsell color system.
752
20
Color Image Processing
Munsell color notation system A system based on hue, lightness (value), and saturation (color density) to precisely specify various colors and their connections. The printed Munsell color book contains color slides indexed with these three attributes. The color of any unknown surface can be identified by comparing it with the color in the book under given lighting and viewing conditions. I1, I2, I3 model A hardware-oriented color model determined by experience. Three orthogonal color features/components can be expressed by a combination of R, G, and B: I 1 ¼ ðR þ G þ BÞ=3 I 2 ¼ ðR BÞ=2 I 3 ¼ ð2G R BÞ=4 Uniform color model A special color space model. The Euclidean distance between two points is positively proportional to the difference between the two colors corresponding to the two points. In other words, in this space, the metric distance and the viewing distance between colors are equal or proportional. Uniform color space Color spaces that can express the same perceived color difference at the same distance. It is also a space with a uniform chromaticity scale. Perceptually uniform color space The color spaces that Euclidean measures can be used to measure the differences between perceived colors. That is, in such a color space, the difference between two colors is proportional to the Euclidean distance between colored points in their corresponding color space. CIELUV and CIELAB are commonly used perceptually uniform color spaces. They can be defined using the XYZ colorimetric system and the coordinates of the reference white. Psychophysical model Color model based on human perception of color. Such as the HSI model.
20.3
Pseudo-color Processing
The input of the pseudo-color enhancement process is a grayscale image, and the output is a color image. Pseudo-color enhancement artificially assigns different regions of different gray values in the original grayscale image to distinguish them more clearly (Gonzalez and Woods 2008; Zhang 2009).
20.3
Pseudo-color Processing
753
20.3.1 Pseudo-color Enhancement Pseudo-color image processing The process of enhancing a grayscale image for better viewing. The principle here is that only small changes in gray levels often obscure and hide areas of interest or content in an image. And because the human visual system can distinguish thousands of color tones and intensities (but generally can only distinguish less than 100 gray levels), if you replace gray with color, you will get better visualization and enhance the ability to detect relevant details in the image. Typical techniques include pseudo-color transform, gray-level slicing, and pseudo-coloring in the frequency domain. Pseudo-color enhancement Color enhancement method that assigns areas of different gray values in the original grayscale image to different colors in order to distinguish these areas more clearly. From an image processing perspective, the input is a grayscale image, and the output is a color image. Pseudo-color The color artificially assigned to the gray pixels in the grayscale image. Compare true color. Gray-level slicing A direct gray-level mapping enhancement technique. The purpose is to make a certain range of gray values more prominent to enhance the corresponding image area. It is a special case of piecewise linear transformation where a specific range of brightness values is highlighted in the output image, while other values remain the same or all map to a fixed (generally lower) grayscale. Intensity slicing A pseudo-color enhancement method. Consider a grayscale image as a 2-D brightness function, cut the image brightness function with a plane parallel to the image coordinate plane, and divide the brightness function into two grayscale intervals (similar to thresholding). By assigning two different colors to the pixels belonging to these two sections, the image enhancement effect is obtained (the image is divided into two parts represented by two colors). Figure 20.17 shows a schematic cross-sectional view of the intensity slicing (the horizontal axis is the coordinate axis, and the vertical axis is the gray value axis). According to Fig. 20.17, for each input gray value, if it is below the slicing gray value lm (the section line corresponding to the slicing plane), a certain color Cm is Fig. 20.17 Intensity slicing diagram
L lm
O
Cm+1 Cm XY
754
20
Color Image Processing
assigned, and another color Cm + 1 is assigned above lm. This is equivalent to defining a transform from grayscale to color. Through this transform, the gray level in a specific grayscale range is converted into a given color, and the original multi-gray value map becomes a two-color image. The gray value is greater than lm and less than lm. Pixels are easily distinguished. If you translate the slicing plane up and down, you can get different discrimination results. The above method can also be extended to the case where multiple colors are used for enhancement. Demosaicing The process of converting a single-color image (collected with a common black-andwhite camera) into a three-color image (collected with a color camera) per pixel. See color image demosaicing Color image demosaicing The super-resolution technology/process used in most still digital cameras, it removes the mosaic from the color filter array (CFA) samples and converts them to a full-color RGB image. Also refers to the interpolation of missing color values in Bayer mode to obtain valid RGB values for all pixels. Transforming from known CFA pixel values to full-color RGB images is a challenging process. Unlike common super-resolution processes (where small errors in the estimation of unknown values are often displayed as blue or aliasing), errors in the demosaicing process often lead to false-like (closely paired) artifacts that are easy to see by the eyes as color or high-frequency mode. Bayer pattern demosaicing See color image demosaicing.
20.3.2 Pseudo-Color Transform Pseudo-color transform A pseudo-color enhancement method. The gray value of each pixel in the original image is processed with three independent transforms (using different pseudo-color transform functions, respectively), and the result is treated as a color component, so that different gray levels are mapped to different colors. Pseudo-color transform function The functions used to map grayscale to color in pseudo-color transform can take many forms. For example, using the three transform functions in Fig. 20.18, the pixels with a small gray value will mainly appear blue after transform (see Fig. 20.18c), and the pixels with a large gray value will mainly appear red (see Fig. 20.18a), and pixels with intermediate gray values will mainly appear green (see Fig. 20.18b).
20.3
Pseudo-color Processing
R
755
G
O
f
B
O
(a)
f
(b)
O
f
(c)
Fig. 20.18 Example of pseudo-color transform function
Pseudo-color transform mapping See pseudo-color transform. Pseudo-coloring Techniques and processes to achieve pseudo-color. A way to give pixels color, not just the original scene color, but the interpretation of the data. A common purpose for pseudo-colorization is to label image pixels in a useful way. For example, a common pseudo-colorization is to assign different colors according to the local surface shape class. When used to pseudo-colorize the aerial or satellite images obtained by the earth, the colors are assigned according to the type of ground (such as waters, vegetation). Colorization The process of adding color to grayscale images or video content. Can be operated manually or with the aid of automated processing. For example, manually add color to a black-and-white (grayscale) image (in most cases, people use graffiti to indicate the desired color of each area, and the system fills the entire area with the corresponding color). This is an example of sparse interpolation. Gray level to color transformation A pseudo-coloring method. Each pixel of the input image is represented by three independent transformation functions, and the results of each function are input to a color channel, as shown in Fig. 20.19. This actually builds a combined color image whose content can be modulated separately with three separate transformation functions. These transformation functions are point transformation functions, that is, the result value obtained for each pixel does not depend on its spatial position or the grayscale of its neighboring pixels. Intensity slicing is a special case of gray to color transformation, where all transformation functions are the same and the shape is one way increased like a step. Pseudo-coloring in the frequency domain A pseudo-coloring method implemented in the frequency domain with the help of filters. The basic idea is to assign different colors to the regions according to the different frequency content of each region in the image. A basic block diagram is shown in Fig. 20.20. After the Fourier transform, the input image is divided into different frequency components through three different filters (low-pass, band-pass,
756
20
Grayscale Image f (x, y)
Red transformation
IR (x, y)
Green transformation
IG (x, y)
Blue transformation
IB (x, y)
Color Image Processing
Color Monitor
Fig. 20.19 Gray to color conversion process Low pass filter
I FT
Post-processing
Band pass filter
I FT
Post-processing
High pass filter
I FT
Post-processing
f (x, y) FT
Color Monitor
Fig. 20.20 Framework of pseudo-coloring in the frequency domain
and high-pass filters can be used). The inverse Fourier transform is performed on the frequency components in each range, and then the results are processed by the enhancement (post-processing) method of the grayscale image. The images of each channel are input to the red, green, and blue input ports of the color display to show enhanced color images. Color mapping Transformation from grayscale to color in pseudo-coloring. HSI-based color map Color mapping in HSI color space. RGB-based color map Color mapping in RGB color space. It can be further divided into two categories: perceptually based color map and mathematical-based color map. Perceptually based color map Color mapping in a visually designed sequence based on the user’s preference for color.
20.4
True Color Processing
A true color image is a vector image (Gonzalez and Woods 2018; Koschan and Abidi 2008). In processing, the processed image is originally color, and the output result often needs color (MarDonald and Luo 1999; Zhang 2017a).
20.4
True Color Processing
757
20.4.1 True Color Enhancement Full-color enhancement Color enhancement to an image that was originally color image. From an image processing perspective, the input is a color image, and the output is also a color image. True color See true color images. Compare false color. True color enhancement Same as full-color enhancement. Color enhancement Abbreviation for color image enhancement Color image enhancement Image processing technology that enhances color images. Both input and output are color images. Color filtering for enhancement Color image enhancement using template operations in the image domain. To ensure that the results are not discolored, the components need to be processed in the same way and the results combined. Color filter array [CFA] A device that subsamples color images to save space and cost. In the original image obtained using the color filter array, the color of each pixel contains only one of the three primary colors, that is, each pixel is monochrome. To construct a color filter array, the surface of the electronic sensor array of the camera can be coated with an optical material that can function as a band-pass filter, and only the light corresponding to a certain color component passes through. Bayer mode is one of the most commonly used color filter arrays. Sony RGBE pattern A color filter array proposed by Sony Corporation. Figure 20.21 shows an example of an 8 8 array, where each pixel has four colors to choose from, three of which are three primary colors of RGB, and the fourth E stands for emerald (in fact, which is actually cyan). They are combined into 2 2 squares to give full color. Compare with Bayer pattern. Color slicing for enhancement A true color enhancement technology. Specifically, the color space where the image is located is divided into blocks or the colors of objects are clustered, and different blocks and clusters are changed in colors of different ways to obtain different enhancement effects. Take RGB color space as an example (it is similar if other color space methods are used). Let the three color components corresponding to one image region W be
758
20
Fig. 20.21 An 8 8 Sony RGBE pattern
1 1
Color Image Processing 2
3
4
5
6
7
8
G R G R G R G R
2
B E B E B E B E
3
G R G R G R G R
4
B E B E B E B E
5
G R G R G R G R
6
B E B E B E B E
7
G R G R G R G R
8
B E B E B E B E
RW(x, y), GW(x, y), and BW(x, y). First calculate their respective averages (i.e., the coordinates of the cluster center in color space): mR ¼
1 X RW ðx, yÞ #W ðx, yÞ2W
mG ¼
1 X GW ðx, yÞ #W ðx, yÞ2W
mB ¼
1 X BW ðx, yÞ #W ðx, yÞ2W
In the above three formulas, #W represents the number of pixels in the region W. After calculating the average value, the distribution widths dR, dG, and dB of each color component can be determined. According to the average value and the distribution width, the color enclosing rectangle {mR - dR/2: mR + dR/2; mG - dG/2: mG + dG/2; mB - dB/2: mB + dB/2} can be determined. In practice, the average and distribution width are often obtained through interaction. Color slicing Map all colors outside the range of interest to a “neutral” color (such as gray) while maintaining a mapping of the primary colors for all colors in the range of interest. The color range of interest can be defined by a cube or sphere centered on a typical reference color. See color slicing for enhancement. Color noise reduction Reduction of noise in color images. Note that the effect of noise on color images depends on the color model used. Consider the original image using the RGB model. If only one component of the R, G, and B channels is affected by noise, converting it to another color model such as the HSI model or the YIQ model will expand the noise to all components. Linear noise reduction techniques (such as the mean filter) can be applied to each of the R, G, and B components to achieve good results.
20.4
True Color Processing
759
Color fringe Irregular short lines composed of multiple colors that are usually generated at the edges of monochrome targets by filtering or interpolating color images. Bleeding When impulse noise points only appear in one color component of a color image and are close to the edge position in the image, filtering the image using a median filter will move the edges toward the noise points, resulting in the filtering results that originally did not exist at the edges. Phenomenon of new colors with impulse noise interference. Color fusion Combination of different colors. The original color sensing characteristics of the original colors disappear, and a new color that is different from any participating color sensing will appear. Color grading The process of changing or enhancing color information in an image or video content. Commonly used for old video content, it is a popular color quality improvement technology in the old age. Records are generally converted to analog media before replay and then reissued for viewing with modern hardware. See color correction and image enhancement.
20.4.2 Saturation and Hue Enhancement Saturation Enhancement In true color enhancement, a method that enhances only saturation. Specifically, the image can be transformed into the HSI space first, and the S component image can be enhanced by the method of enhancing the grayscale image, so as to finally obtain the effect of color enhancement. Saturation [S] 1. The ratio of reaching the upper limit of a dynamic range. For example, for an 8-bit monochrome image, when intensity values greater than 255 levels are recorded, intensity saturation will occur. At this time, all intensity values greater than 255 levels will be recorded as 255 (the maximum possible value). 2. One component of HSI model and HSB model. Corresponding to the purity of a certain hue, the pure color is completely saturated, and the saturation gradually decreases with the addition of white light. Saturation describes the purity of a color, or the degree to which a pure color is diluted by white light. As the saturation decreases, the colors appear slightly faded. 3. The basic characteristics of color vision that are the same as hue and lightness/ brightness. It is a characteristic of people’s perception of color, that is, the concentration of various color vision. It is also the degree of difference between a certain color and gray with the same brightness. A certain brightness color, the
760
20
Color Image Processing
farther away it is from gray with the same brightness, the more saturated it is, that is, the more vivid the hue that can be seen in a certain color. For example, citrus and sand grains may both be the same orange color, but citrus has a greater brightness, that is, it has more orange. Brightness can be used exclusively for chromaticity. The word “saturated” is easily confused with other issues. The saturation of color vision mainly depends on the purity of light, the saturation of monochromatic light is the highest, and the saturation of polychromatic light is small. 4. The formal name used in everyday language to describe words such as “darkness/ lightness” often used in color. For example, red tints from light pink to deep red are considered red with different saturation: pink has lower saturation, and deep red has higher saturation. 5. In the description of object shape, a shape descriptor that can describe both the compactness (compactness) of the object and the complexity of the object. What is considered here is how full the target is in its minimum enclosing rectangle. Specifically, it can be calculated by the ratio of the number of pixels belonging to the target to the number of pixels contained in the entire rectangle. Because relatively compactly distributed targets often have simpler shapes, saturation reflects the complexity of the target to a certain extent. 6. “Color of a region judged in proportion to its brightness.” It is often translated into a description of the whiteness of a light source. From a spectrum perspective, the more concentrated the power distribution of the radiation spectrum of a light source is at a wavelength, the more saturated its associated color is. Adding white light (i.e., light with all wavelengths of energy) results in color desaturation. Saturability Same as saturation. Relative saturation contrast An index describing the differences between adjacent pixels or sets of pixels in a color image. It is a measure that describes the relationship between saturation values in various regions. In color images with low brightness contrast, certain details can be discerned based on the difference in color saturation. The relationship between the saturation values in a color image is called relative color contrast. Hue enhancement A method of true color enhancement that enhances only the hue. The image can be transformed into HSI space first, and then the H component can be enhanced by the method of enhancing the grayscale image, and finally the color enhancement effect can be obtained. Tonal adjustment 1. A method or process of performing point operations on the contrast or gray scale in an image to make it more attractive or easier to interpret. 2. Use image enhancement technology to change the brightness of the image. Here the hue generally refers to different shades of gray.
20.4
True Color Processing
761
Color calibration The process and technique of correcting the color displayed or output by the device according to certain standards (such as the color obtained using a density meter). Compare color correction. Color correction 1. Adjustments to obtain color constancy. 2. Any change made on the color of the image, it can be no standard. Compare color calibration. See gamma correction. Tone mapping The grayscale mapping required to display the obtained (with high dynamic range) radiation intensity map on a screen or printer with a lower color gamut (smaller dynamic range). Tone mapping techniques require computational or spatially varying transfer functions, or reducing the gradient of an image to meet the dynamic range of a display or printing device. Gradient domain tone mapping Tone mapping in the gradient domain of a color image Scale selection in tone mapping The operator’s scale is selected to locally deepen or lighten the tone of the image. Interactive local tone mapping Tone mapping with human-computer interaction. The image is tone mapped with a set of sparse, user-drawn adjustment instructions, in which a locally weighted least squares variation problem is minimized. Color remapping An image transform in which each original color is replaced with another color from a color map. If the image has indexed colors, this operation is very fast and can also provide special graphics effects. Lookup Table Given a limited set of input values {xi}, the lookup table records the values {(xi, f (xi))}, so that the value of the function f(•) can be obtained by looking directly instead of each calculation. Lookup tables can be used for standard functions of color remapping or integer pixel values (i.e., the logarithm of a pixel value) Gamut mapping The mapping from the color space coordinates of a source image to the destination/ output color space coordinates, the function and purpose of this operation are to compensate for the difference in performance between the source and output media in the color gamut. Map the source signal to a display that meets the display requirements. This requires maintaining color saturation within the boundaries of the target color gamut to maintain relative color intensity and compressing colors outside the color gamut
762
20
Color Image Processing
from the source into the target color gamut. Gamma correction is often used in practice to achieve this. Gamut The range of colors. In image processing, the color gamut generally refers to those colors allowed for a particular display device. Posterize Converts colors in an image. A process or technique that can make the transition between different colors more intense and the differences between adjacent tones more visible, ultimately reducing the number of colors used in the image. For example, this method can be used to transform a naturally smooth image into an effect similar to a poster or advertisement, so it is also called posterization. Hue angle Used to define the amount of difference in hue perception. First find the angles corresponding to the Luv color space and Lab color space, respectively:
Then, the hue angle huv or hab can be obtained as follows: 8 ϕij > > > < 360∘ ϕij hij > 180∘ ϕij > > : 180∘ þ ϕij
numerator > 0
denominator > 0
numerator < 0 numerator > 0
denominator > 0 denominator < 0
numerator < 0
denominator < 0
Among them, ij corresponds to uv or ab, the numerator refers to v or b, and the denominator refers to u or a.
20.4.3 False Color Enhancement False color enhancement A special class of full-color enhancement methods. Its input and output are both color images. It takes advantage of the human eye’s different sensitivity to different wavelengths of light. In false color enhancement, the color value of each pixel in the original color image is linearly or nonlinearly mapped to different positions in the color space one by one. According to requirements, some interesting parts in the original color image can be made to be completely different from the original and also very different from people’s expectations (or unnatural inconsistent) false colors, which can make it more obvious and easy to get attention. For example,
20.4
True Color Processing
763
converting red-brown wall tiles to green (the human eye is most sensitive to green light) can improve the recognition of its details. False color The color caused by the physical optics effects of minerals. It mainly includes ochre, halo, color change, etc. Choropleths A variant of false color that is used to visualize information about targets in a single grayscale channel. Pointillist painting Originally refers to a form of painting pioneered by pointillisme. Now refers to the arrangement of visual patterns with solid color dots to form a picture. This method can be used to false color tint selected pixels of a hidden pattern in a digital image. The basic method is to replace selected pixels with false colors to highlight hidden image modes. Usually this image pattern is not visible without false coloring. Stippling Use colored points to draw/render artistic images. Pointillisme A painting form/method in which solid color dots are applied to the canvas (filling the area with dots) so that they blend in the viewer’s eyes. It was a neo-impressionist art that became popular in the 1880s. Also refers to groups that use this method of painting.
20.4.4 Color Image Processing Color image processing Image processing techniques and processes that the input and/or output are color images. Color pixel value The color intensity or brightness of a pixel. Actual color The exact color values that a pixel has when using any color model. Separating color image channels The three color channels in a color image are separated to obtain three monochrome (grayscale) images representing the values of the three channels. Color channels A special but commonly used channel.
764
20
Color Image Processing
Chroma subsampling A process or technique that reduces the amount of data required to represent a chroma component without significantly affecting the visual quality of the results. This is based on the fact that the human visual system is less sensitive to changes in color than sensitive to changes in brightness. See spatial sampling rate. Color halftone filter Filter that simulates halftone printing. Specifically, each color channel is divided into a large number of small rectangles, and then these rectangles are replaced with circles. To mimic the effect of the halftoning technique, the size of each circle is proportional to the brightness of each rectangle. Color halftoning See dithering. Color image restoration See image restoration. Color transformations Let f(x, y) and g(x, y) be color input and output images, where a single pixel value is a three-element value (i.e., the R, G, and B values corresponding to the pixel) and T represents the transformation function, Then the color transformation can be written as gðx, yÞ ¼ T ½ f ðx, yÞ Color lookup table [CLUT] A special color index table. It is mostly used to convert a 24-bit color image into an image represented by a smaller number of colors. Color quantization The number of colors in an image is reduced by selecting a subset of the colors in the original image so that it can be displayed by means of a simple display device. The side effect of this method is to allow the image to be represented with fewer bits, thereby obtaining the effect of image compression. Figure 20.22 shows a set of examples, where Fig. 20.22a is a true color image (24 bits); Fig. 20.22b is a
Fig. 20.22 Effects of different color quantization
20.4
True Color Processing
765
256-color image (8 bits); Fig. 20.22c is a 16-color image (4 bits); and Fig. 20.22d is an 8-color image (3 bits). Color feature Features that incorporate color information into a description of an image or a region of interest therein. Color descriptor Descriptor used to describe the nature of the target color.
20.4.5 Color Ordering and Edges Vector ordering 1. Broadly speaking, the method and process of sorting vectors. Currently, the commonly used methods include conditional ordering method, marginal order ing method, and reduced ordering method. They are all subordering methods. 2. In a narrow sense, on the basis of subordering, a method of sorting vectors using similarity measures is obtained. First calculate the ordinal value of the vector:
Rðf i Þ ¼
N X s fi, f j j¼1
Among them, N is the number of vectors, and s is a similarity function (also can be defined by distance). The corresponding vectors can be sorted according to the value of Ri ¼ R(fi). This method mainly considers the internal connection between the vectors. The output of the vector median filter obtained according to the vector ordering is the vector with the smallest distance sum from the other vectors in the set of vectors, that is, the vector that is ranked first in the ascending order. Subordering A practical sorting method for pixel vector values, such as in color images. Because vectors have both size and direction, there is no unambiguous, general, and widely accepted method of vector pixel ordering. Commonly used subordering methods include conditional ordering, marginal ordering, and reduced ordering. Marginal ordering A method for sorting vector data. Vector pixel values can be sorted when median filtering a color image. In marginal ordering, each component needs to be sorted independently, and the final sorting is performed based on the pixel values formed by the same sequence values of all components. For example, given a set of five color pixels: f1 ¼ [5, 4, 1]T, f2 ¼ [4, 2, 3]T, f3 ¼ [2, 4, 2]T, f4 ¼ [4, 2, 5]T, and f5 ¼ [5, 3, 4]T. Sorting them by marginal ordering, get () indicates sorting result)
766
20
Color Image Processing
f5, 4, 2, 4, 5g ) f2, 4, 4, 5, 5g f4, 2, 4, 2, 3g ) f2, 2, 3, 4, 4g f1, 3, 2, 5, 4g ) f1, 2, 3, 4, 5g That is, the obtained sorting vector (take the same column, i.e., the same order value) is f1 ¼ [2, 2, 1]T, f2 ¼ [4, 2, 2]T, f3 ¼ [4, 3, 3]T, f4 ¼ [5, 4, 4]T, and f5 ¼ [5, 4, 5]T. This results in a median vector of [4, 3, 3]T. Note that this is not an original vector, that is, the median value obtained by the marginal ordering method may not be one of the original data. Conditional ordering A method for sorting vector data. Vector pixel values can be sorted when median filtering a color image. In conditional ordering, one of the components is selected for scalar sorting (the second component can also be considered when the values of the component are the same), and the pixel values (including the values of other components) are ordered according to this order. For example, given a set of five color pixels: f1 ¼ [5, 4, 1]T, f2 ¼ [4, 2, 3]T, f3 ¼ [2, 4, 2]T, f4 ¼ [4, 2, 5]T, and f5 ¼ [5, 3, 4]T. According to the conditional ordering, if the third component is selected for ordering (the second component is selected next), the resulting sorting vectors are f1 ¼ [5, 4, 1]T, f2 ¼ [2, 4, 2]T, f3 ¼ [4, 2, 3]T, f4 ¼ [5, 3, 4]T, and f5 ¼ [4, 2, 5]T. So the median vector is [4, 2, 3]T, which is an original vector. Thus, there is f5, 4, 2, 4, 5g ) f2, 4, 4, 5, 5g f4, 2, 4, 2, 3g ) f2, 2, 3, 4, 4g f1, 3, 2, 5, 4g ) f1, 2, 3, 4, 5g Conditional ordering does not treat each component equally (only one of the three components is considered, or one of the three components is given priority), so it will bring the problem of bias. Reduced ordering A method for sorting vector data. Can be used to sort vector pixel values when performing median filtering on color images. In reduced ordering, all pixel values (vector values) are converted into scalars using a given simplified function for combination and then ordering. Note, however, that the reduced ordering result cannot be interpreted the same as the scalar sorting result, because vectors cannot be strictly compared in size, or there is no absolute maximum or minimum vector. If the similarity function or distance function is used as a simplified function and the results are sorted in ascending order, the vector that is ranked first in the sorted result is considered to be closest to the “center” of the vector data set participating in the sorting, and the last one is outfield point. For example, given a set of five color pixels: f1 ¼ [5, 4, 1]T, f2 ¼ [4, 2, 3]T, f3 ¼ [2, 4, 2]T, f4 ¼ [4, 2, 5]T, and f5 ¼ [5, 3, 4]T. According to reduced ordering, if the distance
20.4
True Color Processing
767
function is used as the simplification function, that is, ri ¼ {[f – fm]T[f – fm]}1/2, where fm ¼ {f1 + f2 + f3 + f4 + f5}/5 ¼ [4, 3, 3]T, we can get r1 ¼ 2.45, r2 ¼ 1, r3 ¼ 2.24, r4 ¼ 2.24, and r5 ¼ 2.45, and the resulting sorted vectors are f1 ¼ [4, 2, 3]T, f2 ¼ [5, 3, 4]T, f3 ¼ [4, 2, 5]T, f4 ¼ [5, 4, 1]T, and f5 ¼ [2, 4, 2]T. The resulting median vector is [4, 2, 3]T. Note that this is not an original vector, i.e., the median is not one of the original vector data. Vector rank operator A simple edge detection operator for color edges. A color image is regarded as a vector, which can be represented by the distance vector value function C: Z2 ! Zm, where m ¼ 3 for a three-channel color image. Let W be the window containing n pixels (color vectors) on the image function. If reduced ordering is used for all color vectors in the window, then xi indicates the i-th vector in this sequence. Vector ordering can be expressed as VR ¼ kxn xl k Vector sorting describes the situation in which the “vector wild points” deviate from the highest order of the median in the window in a quantitative manner. Therefore, a small value is taken in a uniform image area because the gap between the vectors is small. In contrast, the operator has a larger response value to the edge because xn will be selected from the vector along the edge (smaller field) and xl will be selected from the vector on the other side (larger field). By giving a threshold value to the vector ranking result, the edge in the color image can be determined in this way. Color edge Edges in color images. There are several different definitions of color edges, such as: 1. The simplest definition is to use the edge of the brightness image corresponding to the color image as the edge of the color image. This definition ignores discontinuities in hue or saturation. For example, when two targets with the same brightness but different colors are placed side by side, the geometric boundary between the two targets cannot be determined in this way. Because a color image contains more information than a grayscale image, people want more color information from color edge detection. However, this definition does not provide more new information than gray edge detection. 2. If at least one color component has edges, the color image has edges. According to this monochrome-based definition, no new edge detection method is required when detecting the edges of a color image. This definition leads to accuracy issues in determining edges in a single color channel. If the edge detected in the color channel is shifted by one pixel, then combining the three channels may produce a wide edge. In this case, it is difficult to determine which edge position is accurate. 3. When the edge detection operator is applied to a single color channel and the sum of the results is greater than a threshold, an edge is considered to exist. This
768
20
Color Image Processing
definition is convenient, straightforward, and commonly used when accuracy is not emphasized. For example, color edges can be calculated by summing the absolute values of the gradients of the three color components. If the sum of the absolute values of the gradient is greater than a certain threshold, it is judged that a color edge exists. This definition is very dependent on the basic color space used. Pixels determined as edge points in one color space are not guaranteed to be determined as edge points in another color space (and vice versa). 4. Consider the relationship between the components in the vector, and use the differentiation of the three-channel color image to define the edges. Because a color image represents a function of vector values, the discontinuity of color information can and must be defined using vector values. For a color pixel or color vector C(x, y) ¼ [u1, u2, . . ., un]T, if the color image function changes at position (x, y), use the equation ΔC(x, y) ¼ JΔ(x, y), and then the direction with the greatest change or discontinuity in the color image is represented by the eigenvector JTJ corresponding to the eigenvalue. If the change exceeds a certain value, this indicates that there are pixels corresponding to the color edges. Color edge detection Edge detection in color images or edge detection processes in color images. A simple method is to combine (such as add) the edge intensities in each individual RGB color plane. A simple method is to treat each color channel as a gray image, detect separately the gray edges in the three images, and then combine the results. It should be noted that if the results of the three detectors are simply added up, they may cancel each other out; if they are ANDed together, they may produce thicker edges. A better method is to calculate the “direction energy” in each channel, that is, use a secondorder steerable filter to combine them and determine the best combined direction. Another method is to estimate the color statistics in the neighborhood around each pixel and use more sophisticated techniques to compare the statistics between regions to determine the edges. In the latter method, other image features (such as texture) can also be combined. Derivative of a color image Vector derivative of each pixel position in a color image. Consider a color image C (x, y), the differential of which can be given by the Jacobian matrix J, where J includes the first-order partial differentiation of each vector component. For C (x, y) ¼ [u1, u2, u3]T, the derivative at position (x, y) is given by the equation ΔC (x, y) ¼ JΔ(x, y). Can be written as 2
∂u1 6 ∂x 6 6 6 ∂u J¼6 2 6 ∂x 6 4 ∂u3 ∂x where
3 ∂u1 2 3 2 ∂y 7 7 u1x gradðu1 Þ 7 ∂u2 7 6 7 6 7 ¼ 4 gradðu2 Þ 5 ¼ 4 u2x ∂y 7 7 u3x gradðu3 Þ ∂u3 5 ∂y
3 u1y 7 u2y 5 ¼ Cx , Cy u3y
20.4
True Color Processing
769
T and Cy ¼ u1y , u2y , u3y
Cx ¼ ½u1x , u2x , u3x T
Color Harris operator The result of generalizing Harris operator to color image. When detecting the corner position of a color image, it is necessary to extend step (1) in the detection of the corner position of a gray image with a Harris operator to the following step (10 ) to achieve autocorrelation calculation of matrices of color image (other calculation steps remain): (10 ) For each pixel position in the color image C(x, y), calculate the autocorrelation matrix M0 : M0 ¼
M 011
M 012
M 021
M 022
where M 011 ¼ Sσ R2x þ G2x þ B2x , M 022 ¼ Sσ R2y þ G2y þ B2y M 012 ¼ M 021 ¼ Sσ Rx Ry þ Gx Gy þ Bx By
and
Among them, Sσ (•) represents Gaussian smoothing that can be achieved by convolving with a Gaussian window (such as standard deviation σ ¼ 0.7, the radius of the window is generally 3 times σ).
20.4.6 Color Image Histogram Histogram processing for color image Processing that extends the concept of histograms to color images. For each color image, the three component histograms each include N (typically 4 N 256) bins. For example, suppose a color model is used to express an image, and the model (such as the HSI model) allows the luminance component and the chrominance component to be separated. Histogram techniques (such as histogram equalization) can be used for the image: the luminance (intensity I) component is processed, while the chrominance components (H and S) remain unchanged. The resulting image is a modified version of the original image, where background details have become easier to observe. Note that although the colors are somewhat dimmed, the original hue is maintained. Color histogram An expression that represents the presence or absence of a certain color in an image in an RGB cube. Its terms are binary. When there is a certain color in the image, the position corresponding to the color in the color histogram is set to 1, and the other
770
20
Color Image Processing
positions are set to 0. A color image has three bands corresponding to a 3-D space in which the value of a pixel is measured along each axis to obtain a color histogram. Color moment Moments (may also include various statistics) of histograms that can be used to describe color images based on each color channel. Commonly used are the mean, variance, and skewness of the histogram. Color co-occurrence matrix A matrix (is actually a histogram) whose elements represent the sum of the color values that exist in some positions in the image in order and the other color values in other positions (which satisfy a certain relative relationship with the aforementioned position). See co-occurrence matrix. Correlogram Graphical results for a function that reflects the relationship between two variables. The English word comes from the correlation diagram. 1. In general statistical analysis, the function curve of the autocorrelation value against distance or time. 2. In image processing, an expansion of the color image histogram. It records the symbiosis statistics of two colors i and j at a given distance d, where d is in a certain distance range [0, n]. The element of the correlation graph is a triple (i, j, d ), that is, the co-occurrence statistics of the color pair (i, j) at a distance d. See color co-occurrence matrix. Color HOG Color expansion of the histogram of oriented gradient descriptor, where the color calculation is performed in a given color space. Color gradient The gradient of pixels on a color image. Let r, g, and b be unit vectors along the R, G, and B axes in the RGB color space, and combine them as two vectors as follows: u¼
∂R ∂G ∂B rþ gþ b ∂x ∂x ∂x
v¼
∂R ∂G ∂B rþ gþ b ∂y ∂y ∂y
Further define gxx, gyy, and gxy as 2 2 2 ∂R ∂G ∂B gxx ¼ u • u ¼ u u ¼ þ þ ∂x ∂x ∂x T
20.4
True Color Processing
771
2 2 2 ∂R ∂G ∂B gyy ¼ v • v ¼ vT v ¼ þ þ ∂y ∂y ∂y gxy ¼ u • v ¼ uT v ¼
∂R ∂R ∂G ∂G ∂B ∂B þ þ ∂x ∂y ∂x ∂y ∂x ∂y
Then the direction of gradient (maximum rate of change) of the pixel point (x, y) in the color image is (the period is π/2) θðx, yÞ ¼
2gxy 1 tan 1 2 gxx gyy
And the magnitude of the gradient along this direction (the maximum rate of change value) is (the period is π) gθ ðx, yÞ ¼
n
2 1 gxx þ gyy þ gxx gyy cos 2θðx, yÞ þ 2gxy sin 2θðx, yÞ 2
Functional matrix Same as Jacobian matrix. Jacobian matrix 1. A matrix consisting of the first-order partial derivatives of each component of a color vector. If the function f(x) is written in component form: 2
f 1 ðx1 , x2 , , xn Þ
3
6 f ðx , x , , x Þ 7 n 7 6 2 1 2 f ðxÞ ¼ f ðx1 , x2 , , xn Þ ¼ 6 7 4 5 ⋮ f m ðx1 , x2 , , xn Þ Then, the Jacobian matrix J is a matrix of m n: 2
∂f1 6 ∂x1 6 J¼6 6 ⋮ 4∂f m ∂x1
3 ∂f1 ∂xn 7 7 ⋮ 7 7 ∂fm 5 ∂xn
2. In a neural network system, the elements of the Jacobian matrix represent the derivative of the network output with respect to the input. Therefore, the Jacobian matrix can be used to backward propagate the error signal from the output to earlier modules in the system.
Chapter 21
Video Image Processing
Video can be seen as an extension of (still) images. One of the most obvious differences between video and image is that it contains motion information in the scene, which is also a main purpose of using video (Forsyth and Ponce 2012; Prince 2012; Shapiro and Stockman 2001; Szeliski 2010). Aiming at the characteristics of video containing motion information, the original image processing technology also needs to be promoted accordingly (Hartley and Zisserman 2004; Poynton 1996; Tekalp 1995; Zhang 2017a).
21.1
Video
Video generally represents a special type of sequence image, which describes the radiation intensity of a scene projected by a 3-D scene onto a 2-D image plane over a period of time and obtained by three separate sensors (Poynton 1996; Zhang 2017a).
21.1.1 Analog and Digital Video Video A special color sequence image. Commonly used as a short name for video images. From the perspective of learning image technique, video can be viewed as an extension of (still) images. It describes that 3-D scenes are projected onto a 2-D image plane over a period of time, and it records the radiation intensity of the scene obtained by three separate sensors (commonly used are RGB sensors). It also represents:
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_21
773
774
21
Video Image Processing
Table 21.1 Four typical analog video standards Standard EIA-170 CCIR NTSC PAL
frame rate 30.00 25.00 29.97 25.00
Line number 525 625 525 625
Line period/μs 63.49 64.00 63.56 64.00
Rows per second/s1 15,750 15,625 15,774 15,625
Pixel period/ns 82.2 67.7 82.2 67.7
Image size/ pixel 640 480 768 576 640 480 768 576
1. A generic term for a group of images acquired at successive times (with short time intervals). 2. An analog signal collected by a video camera. Each frame of the video corresponds to an electronic signal of about 40 milliseconds, which records the start of each scan line, the image of each video scan line, and synchronization information. 3. A video record. Analog video A video consisting of analog signals that are continuous in time and amplitude. These analog signals simulate physical quantities that represent sound and image information. Analog video signals (mainly analog TV signals) generally have three formats: NTSC, PAL, and SECAM. Compare digital video. Analog video standards Standards formulated in the 1940s to regulate the indicators of analog video. In machine vision, the four standards in Table 21.1 are more important. The first two are black and white video standards, and the last two are color video standards. Analog video raster The result obtained by converting an optical image into an electronic signal according to a grid pattern. It can be defined by two parameters (frame rate and line number). The frame rate defines the time sampling rate, and the unit is fps (frame/second) or Hz; the line number corresponds to the vertical sampling rate, and the unit is line/frame or line/image height. Analog video signal A 1-D electronic signal f(t) obtained by sampling f(x, y, t) in the vertical direction and the time dimension. Compare digital video signal. Digital video A sequence of images captured by a digital camera over time. Digital video standards See video standards. Digital video formats Express a specific type of video in a format specified by some parameters. Some typical digital video formats are shown in Table 21.2.
21.1
Video
775
Table 21.2 Some typical digital video formats Format (application) QCIF (video phone) CIF (video conference) SIF(VCD, MPEG-1) Rec.601 (SDTV distribution) Rec.601 (video production) SMPTE 296 M(HDTV distribution) SMPTE 274 M (video production)
Y0 size (H W ) 176 144 352 288 352 240 (288) 720 480 (576) 720 480 (576) 1280 720 1920 1080
Color sampling 4: 2: 0 4: 2: 0 4: 2: 0
Frame rate 30p 30p 30p/25p
Raw data/ Mbps 9.1 37 30
4: 2: 0
60i/50i
124
4: 2: 2
60i/50i
166
4: 2: 0
24p/30p/ 60p 24p/30p/ 60i
265/332/664
4: 2: 0
597/746/746
Digital video disc [DVD] Upgrade form of video disc. The video is encoded using MPEG-2 compression, with a resolution of 720 480 pixels for NTSC and 720,560,560 pixels for PAL. Digital video express [DIVX] An MPEG-4-based video compression technology; the goal is to obtain a compression rate high enough to be able to transmit digital video content over the Internet and maintain high visual quality. Digital video signal One-dimensional electronic signal f(t) obtained by sampling f(x, y, t) in the vertical and horizontal directions and the time dimension. Compare analog video signal. Digital video processing See video processing. Parameter of a digital video sequence One (or more) of the following parameters used to characterize a digital video sequence: 1. Frame rate ( fs, t), which is the time sampling interval or frame interval. 2. The line number ( fs, y), which is equal to the frame image height divided by the vertical sampling interval. 3. The number of samples per line ( fs, x) is equal to the frame image width divided by the horizontal sampling interval. Advantages of digital video The advantages of digital representation of video over its analog representation. This includes: 1. Robust to signal degradation. Digital signals are inherently more robust than analog signals to degradation factors such as attenuation, distortion, and noise.
776
2. 3. 4. 5.
21
Video Image Processing
In addition, error correction techniques can be used so that distortion cannot be accumulated in subsequent steps of a digital video system. The hardware implementation has small size, high reliability, low price, and easy design. For some processing, such as signal delay or video effects, they are easier to complete in the digital domain. It is possible to encapsulate multiple videos with different spatial and temporal resolutions in a single scalable code stream. Software conversion from one format to another is relatively easy.
Bit rate The amount of data required to represent a certain number of image units. Often used for video transmission and compression, so it is also called video bit rate, in bps. At a code rate of 1 Mbps, the encoder converts 1 s of video into 1 Mb of compressed data; for the decoder, it converts 1 Mb of compressed data into 1 s of video. bps The basic unit of the code rate.
21.1.2 Various Video Video frequency The range of frequency bands used by the TV or radar (including low frequencies in the scan). It is generally considered that the range is from 0 Hz/s to several MHz/s (usually tens of Hz/s). This signal is called a video signal. RS-170 The standard black and white video format used in the United States. Color video Video with color, is the expansion of the color image along the time axis. Because color images have been widely used in the early days of video development, it can be said that video has been color from the beginning. Generally, video refers to color video. 3-D video A video obtained by using multiple synchronized video cameras to shoot the scene from different directions. Sometimes called free-viewpoint video or virtual view point video. The original video frames can be recombined using image-based rendering techniques such as viewpoint interpolation to generate virtual camera paths between cameras as part of the real-time viewing experience. Current 3-D movies are shown in cinemas to viewers wearing polarized lenses. In the future, when watching 3-D video at home, it will be possible to use multi-zone auto-stereoscopic display. Everyone can get a personalized video stream and can move around the scene to observe from different viewpoints.
21.1
Video
777
Free-viewpoint video A stereo video that is being developed can use a multi-zone auto-stereoscopic display device to enable everyone watching the video stream to be personally customized. Also called virtual viewpoint video. See video based rendering. Virtual viewpoint video Same as free-viewpoint video. Compressive video 1. Use of standard video compression for transmission or storage of video 2. Use of compressed sensing technology for transmitting or storing video. Freeze-frame video The effect of slowing down video playback from 25 or 30 frames per second to one or half frame per second. Broadcast video Refers to a video that is a form of content whose image quality and encoding format are designed for (analog or digital) transmission to a large audience. Technologies such as video de-interlacing, lossy compression, and MPEG encoding are often used. Probe video 1. Video captured using remote sensing technology. See probe image. 2. A video used to retrieve certain objects or videos in a database. Multi-temporal images Multiple images that are temporally connected to each other by continuous imaging of the same scene. During the subsequent acquisition interval, the target in the scene can be moved or changed, and the collector can also move. Time-varying image Same as multi-temporal image. Image time sequence Same as multi-temporal image. Bivariate time series A collection of data obtained by sampling two variables that change over time. Dynamic imagery Same as multi-temporal image.
21.1.3 Video Frame Image frame Every image in the video image. Referred to as the frame.
778
21
Video Image Processing
Video volume 3-D stereo consisting of a series of video image frames. If a video image frame (equivalent to a still image) at a certain time is represented by f(x, y), then a video volume can be represented by f(x, y, t), where x and y are spatial variables and t is a time variable. Video stream A sequence of single-frame images acquired by a video camera. The video stream can be raw or compressed. Video data rate The number of data bits b required to store a one-second video image. The unit is b/s (bps). Determined by the temporal, spatial, and amplitude resolutions of the video. Let the video frame rate be L, that is, the time sampling interval is 1/L, the spatial resolution is M N, and the amplitude resolution is G (G ¼ 2k, k ¼ 8 for black and white video and k ¼ 24 for color video), the video bit rate is b¼LMNk The video bit rate can also be defined by the number of lines fy, the number of samples per line fx, and the number of frame rates ft. Set the horizontal sampling interval to Δx ¼ pixel width/fx, the vertical sampling interval to Δy ¼ pixel height/fy, and the time sampling interval to Δt ¼ 1/ft. If K is used to represent the number of bits of a pixel value in a video, it is eight for monochrome video and 24 for color video, so the video bit rate can also be expressed as b ¼ f x f y f tK Temporal resolution The observation frequency with respect to time (e.g., once per second) corresponds to the spatial resolution and depends on the time sampling rate. Time series One or more variables obtained sequentially over time. Frame 1. Framework: A way of expressing knowledge. It can be regarded as a complex structured semantic network. Suitable for recording a related set of facts, rules of reasoning, preconditions, etc. 2. Frame: Short for frame image. A complete standard television picture with even and odd fields. Frame store Electronics for recording frame images from imaging systems. The typical application of this device is as an interface between a CCIR camera and a computer.
21.1
Video
779
Frame buffer A device capable of storing a video frame can be used by a computer for access, display, and processing. For example, it can be used to store video frames to refresh the video display. See frame store. Frame rate Time sampling rate for video capture and display. The number of images captured, propagated, or digitized by a system in one second. A common unit is frames per second (fps). For a given application, the frame rate is also a performance indicator related to computing efficiency. 3:2 pull down A (playback) video conversion technology used to convert raw material at 24 frames per second (fps) to 30 fps or 60 fps. For 30 fps interlaced video, this means converting every four frames of a movie to ten fields or five frames of video. For 60 fps progressive video, every movie with four frames is converted to ten frames of video. Field rate See refresh rate. Refresh rate The image update rate when displaying motion images. Most of the display of motion images includes a period in which the generated image is not displayed, that is, displayed in black for a part of a frame time. To avoid this offensive flicker, the image needs to be refreshed at a higher rate than the rendering motion. Typical refresh rates are 48 Hz (movie), 60 Hz (traditional TV), and 75 Hz (computer monitor). The refresh rate is very dependent on the surrounding lighting: the brighter the environment, the higher the refresh rate should be to avoid flicker. The refresh rate also depends on the display process. In a traditional film projection system, a 48 Hz refresh rate is obtained by displaying a negative film at 24 frames per second twice. For progressive systems, the refresh rate is the same as the frame rate, while in interlaced systems, the refresh rate is equivalent to the field rate (twice the frame rate). Streaming video A series of frame image data from video is converted into continuous code stream line by line and frame by frame. When processing such video, the algorithm cannot easily select a specific frame. Time code In video (movie), a unique representative number given to each image frame for accurate video editing.
780
21
Video Image Processing
21.1.4 Video Scan and Display Scanning A technique used in video systems to convert optical images into electronic signals. During scanning, an electronic sensing point moves across the image in a raster pattern. The sensing point converts the difference in brightness into the difference in instantaneous voltage. This point starts at the upper left corner of the image and passes through the frame horizontally to produce a scan line. It then quickly returns to the left edge of the frame (this process is called horizontal retrace) and starts scanning another line. The retrace line is tilted slightly downward, so after the retrace, the sensing point is just below the previous scan line and ready to scan a new line. After the last line is scanned (that is, when the sensing point reaches the lower right corner of the image), a horizontal and vertical retrace is performed simultaneously to bring the sensing point back to the upper left corner of the image. A complete scan of an image is called a frame. Scanline A single horizontal line of an image. In video display and processing, a single horizontal line corresponding to an image frame. This term was originally used for cameras, where the sensory unit generally scans each pixel on a line in turn, and then moves to the next line to capture the complete image line by line. Scanline slice The intersection of an image scanline with a target structure. Figure 21.1 shows a schematic diagram in which the circular target intersects with 5 scan lines (dashed lines), and 5 thick horizontal straight line segments are scan line segments. Progressive scanning A method of raster scanning in image display. Taking the frame as a unit, the display progresses from the upper left corner to the lower right corner. It is characterized by high definition but large amount of data. Compare interlaced scanning. Interlaced scanning A method of raster scanning in image display. A technique derived from television engineering in which alternating lines (rather than consecutive lines) in an image are scanned or transmitted. Also called interlaced scanning. It takes the field (one frame is divided into two fields: the top field contains all the odd lines and the bottom field contains all the even lines) as the scan unit. Each frame is subject to two consecutive vertical scans. During transmission, an odd line of a television frame is transmitted to Fig. 21.1 Scan line fragment illustration
21.1
Video
781
form an odd field, and then an even line is transmitted to form an even field. The top field and the bottom field are alternated during display, making the human perceive a picture by virtue of the visual persistence characteristics of the human visual system. The vertical resolution of interlaced scanning is half that of a frame image, and its data volume is only half that of progressive scanning. Because each field contains only half the number of lines in a frame, it appears only half the time in a frame. At this time, the flicker rate of the field will be twice the flicker rate of the entire frame, which can give a better sense of movement. Various standard television systems, such as NTSC, PAL, SECAM, and some high-definition television systems, use interlaced scanning. Compare progressive scanning. Line number One of the parameters for designing a video. Corresponds to the vertical sampling rate. The unit is line/frame or line/image height. Odd field A term in interlaced scanning. Interlaced scanning transmits odd and even lines of an image, respectively. The set of all odd rows is called an odd field. Even field The first of two fields of interlaced video. Blanking Eliminate or hide signals that should appear. See blanking interval. Clear the CRT or video device. In TV transmission, vertical blanking interval (VBI) is used to carry data other than audio and video. Blanking interval The time interval at the end of each line (horizontal retrace) or each field (vertical retrace) in the video. During this time, the video signal needs to be hidden until the scanning of a new line or field starts.
21.1.5 Video Display Video de-interlacing Converts video with alternating (even and odd) fields into a non-interlaced signal that contains all fields in each frame. Two common and simple de-interlacing techniques are bob, that is, the missing line in a field is copied and replaced with the upper or lower line in the same field; weave, that is, the missing line in a field copied and replaced with the same line in its previous or next field. Both technologies are named after the visual artifacts or visual artifacts they produce. The upand-down swing technique causes a lifting movement along a strong horizontal line. Weaving technology results in a “zipper” effect on edges that move horizontally. If the substitution operation is replaced by the average operation, a certain improvement effect can be achieved, but the artifacts cannot be completely eliminated.
782
21
Video Image Processing
It can also represent a processing of video. Traditional broadcast video divides each frame into two fields, that is, fields of odd lines and fields of even lines for transmission. De-interlacing is the process of combining two occasions into a single image, and the resulting single image can be progressively transmitted (progressive image transmission). A problem that may arise is that if these two fields were acquired at slightly different times, pixel jitter would be generated on a moving object. De-interlacing Method and process for filling skipped lines to achieve sampling rate up-conversion in each field of a video sequence. Common techniques are mainly based on temporal and spatial filtering, such as line averaging, field merging, and line and field averaging. A method of pixel-by-pixel motion estimation, which can convert videos that are alternately captured by parity fields into non-interlaced videos (i.e., there are two fields in each frame). Common simple de-interlacing techniques include bob (copying the previous or next line of missing lines from the same field) and weave (copying the same line of the previous or next frame). These two names are derived from the visual distortion caused by these two simple techniques. Bob (jumping up and down) introduces a motion that shakes up and down along the horizontal line, while weaving results in a “zippering” effect that transitions along the horizontal edge. Replacing these copy operations with averaging operations (such as line averaging, field averaging) reduces distortion but does not completely eliminate it. Bob A de-interlacing (technology). Weave A de-interlacing (technology). Field merging A method for de-interlacing in adjacent fields of a video by time interpolation. Progressive frames are obtained by combining (merging) the lines that follow two consecutive fields. The equivalent filter frequency response is equivalent to full pass along the vertical axis and low pass along the time axis. Compare field averaging. Field averaging A method for de-interlacing in adjacent fields of a video by time interpolation. It is an improvement on the field merging and takes the average value of the same line of the previous field and the next field corresponding to the missing line. The result is a filter with a good (symmetric) frequency response along the vertical axis. The main disadvantage of this technique is that three fields are needed to interpolate any field, which increases storage and latency requirements. Temporal and vertical interpolation When de-interlacing the video, a combination of up-down interpolation and frontback interpolation is used.
21.1
Video
783
Front porch
Actual video
Back porch
Color pulse Horizontal synchronization
Fig. 21.2 Front porch and back porch
Temporal interpolation The technique of de-interlacing a video uses front and back field interpolation. Vertical interpolation The technique of de-interlacing the video uses up-down interpolation. Line and field averaging In the adjacent fields of video, a method of de-interlacing is achieved by combining temporal interpolation and vertical interpolation. The missing lines are estimated using the average of the previous and next lines in the same field and the corresponding lines in the preceding and following fields. The equivalent filter frequency response is close to the ideal response, but this method requires two frames of memory. Line averaging A simple filtering method when de-interlacing in a video. In order to fill the skipped lines in each field, vertical interpolation is performed in the same field, and the missing lines are estimated using the average of above line and below line. Frame interpolation A wide application of motion estimation. Often implemented in the same circuit as de-interlacing hardware to match the input video to the actual refresh rate of the display. As with de-interlacing, the information from the previous and subsequent frames needs to be interpolated. If accurate motion estimation can be calculated at each pixel position, the best results may be obtained. Front porch In the analog video signal, a section of the signal before the horizontal synchronization signal, as is shown in Fig. 21.2. Back porch The falling edge of the signal. For example, CCIR transmits a picture with two fields interlaced, first with odd lines and then with even lines. Each line consists of a horizontal blanking interval containing horizontal synchronization information and a line effective interval containing actual picture information. The horizontal blanking interval includes three parts, the blanking leading edge, the horizontal synchronization pulse, and the blanking trailing edge. Compare front porch.
784
21.2
21
Video Image Processing
Video Terminology
In fact, a still image is a video of a given (constant) time. In addition to some concepts and definitions of the original image, some new concepts and definitions are needed to represent the video (Tekalp 1995; Zhang 2017a).
21.2.1 Video Terms Video terminology There are strictly defined terms used in the discussion of the video. For example, scanning, refresh rate, aspect ratio, component video, and composite video. Video mail [V-mail] The result of attaching a video file to an e-mail. Video conferencing Meetings between geographically diverse people using digital devices and network connections. The required equipment mainly includes cameras, video capture devices, and network connection devices. Generally, the video frame rate is often required to reach 10 to 15 frames/second. Video container A video file format that packages encoded audio and video data, as well as additional information (such as subtitles), and associated metadata. Currently popular video containers include Microsoft’s AVI, DivX media formats, Adobe Video’s Flash Video, Apple’s Quicktime, etc. Video mouse (Enhanced) mouse with video camera embedded. If you print a grid of dots on a mouse pad, you can use the video mouse to track on it for full control of six degrees of freedom in position and orientation in 3-D space. Video quality assessment Evaluate the quality of the processed video to ensure its use and evaluation of processing technology. When video data is converted from one encoding format to another (including the use of video compression technology), the quality of the video may be reduced. Quality assessment is to measure the degree of degradation and can consider subjective and objective factors. Video rate system A real-time processing system that works at video standard frame rates. Typically, 25 or 30 frames per second (50 or 60 fields per second).
21.2
Video Terminology
785
Video standards A collective name for digital video format standards (such as ITU-R’s BT.601–5 recommendation) and video compression standards (such as MPEG and ITU-T standards). Video technique Various technologies related to video in a broad sense. Video can be regarded as a special image, so video technology can be obtained by promotion or extension of image technique. Video signal 1-D analog or digital signal over time. Its space-time content represents a series of images (or frames) obtained according to a predetermined scanning convention. Mathematically, a continuous (analog) video signal can be written as f(x, y, t), where x, y are spatial variables and t is a time variable. After digitizing, you can still use the same token, except that the variables are all integers. See digital image. Video signal processing Processing of video using image technique. Betacam An SMPTE video standard for recording professional-quality video images using a 1/200 tape recorder. The video signal has a luminance signal bandwidth of 4.1 MHz and a chrominance signal bandwidth of 1.5 MHz. So Betacam signals do not include all colors included in 3-chip color charge-coupled device camera. In English, Betacam is also commonly used for referring to Betacam cameras and Betacam recorders; both are at broadcasting levels. Aspect ratio 1. A shape descriptor is the length ratio of the adjacent two sides of the target bounding box or enclosure. It is a scalar measure that does not change with scale. It can be used as the area parameter of the compactness descriptor of the target shape to describe the slenderness of the target after plastic deformation, which can be expressed as
R¼
L W
Among them, L and W are the length and width of the minimum enclosing rectangle, respectively. The aspect ratio is also defined by the length and width of the Feret box. For square or round targets, the value of R is taken to be the smallest (1). For relatively slender targets, the value of R is greater than 1 and increases with the degree of slenderness. It should be noted that the aspect ratio of a target may change with the rotation of the template, so it cannot be used to compare targets that rotate with each other, unless they are first normalized somehow. For example, calculate the aspect ratio
786
21
Video Image Processing
after aligning the axis of least second moment of the target with the horizontal direction. 2. In a camera, the ratio of the vertical pixel size to the horizontal pixel size; it is also the appearance ratio of an image or video frame. It is generally written as height/ width (such as 4:3). It is necessary to distinguish between the image aspect ratio and the sensor unit aspect ratio. For the former, if the image scale is transformed so that the pixel coordinates of the longer dimension of the image are in [1, 1] and the pixel coordinates of the shorter dimension are in [1/k, 1/k], then k > 1 is the image aspect ratio. For the latter, the ratio of the focal length of the sensor unit in two orthogonal directions. Video format Represent the format of video data. There is a method to divide the video format into component video format and composite video format based on the representation or combination of the three color signals. There is also a method to divide the video into formats with different resolutions such as SIF, CIF, QCIF, etc. according to the different application. Component video A representation format for color video. It represents color video as three independent, non-interfering functions, each describing a color component. Video in this format has higher quality but larger data volume and is generally only used in professional video equipment. Considering the problem of backward compatibility, it is not suitable for ordinary color TV broadcast systems. Compare composite video. Composite video A representation format for color video. The three color signals of the color video are multiplexed into a single signal (the synchronization signal is actually included) to reduce the amount of data, but at the cost of introducing a certain amount of mutual interference. The composite signal is constructed taking into account that the chrominance signal has a much smaller bandwidth than the luminance component. By modulating each chrominance component to a frequency at the high end of the luminance component and adding the chrominance component to the original luminance signal, a composite video containing luminance and chrominance information is produced. It is a backward compatible solution for transmission from black and white TV to color TV. Black and white TVs ignore color components, while color TVs separate color information and display it with black and white intensity. Compare component video.
21.2
Video Terminology
787
21.2.2 Video Processing and Techniques Video processing Any image processing technology that treats video as a special image. The input is a video image; the narrow output is still a video image, and the broad output can be other forms (see image engineering). Since video is a sequence of images, techniques in image processing can still be used for each of them. However, it is also possible to directly perform the overall processing using the correlation characteristics between sequence images. Video image acquisition Techniques and processes for obtaining video images from objective scenes. Video sampling Sampling of a video signal. A digital video can be obtained by sampling a raster scan. In analog video systems, the time axis is sampled as frames and the vertical axis is sampled as lines. Digital video simply adds a third sampling process along the line (horizontal axis). Video imaging 1. Same as video image acquisition. 2. Modes of stereo imaging method of video camera. In this way, to obtain stereo information, it is generally required that there should be relative movement between the camera and the scene. Video quantization A step in digitizing analog video, which is represented by a limited number of bits per sample. Most video quantization systems are based on bytes of 8-bit, which means that each sample can have 256 possible discrete values. In cameras, 1024 levels are sometimes used for signals before gamma correction. Not all stages are used in the video range here, because data or control signals also require some values. In a composite video system, the entire composite signal, including sync pulses, is transmitted in 4 to 200 levels in the quantization range. In a component video system, the brightness occupies 16 to 235 levels; because the color difference signal can be positive or negative, level 0 is placed in the center of the quantization range. In both cases, the values 0 and 255 are reserved by the synchronization code. Video capture The process of capturing images from video equipment (such as video cameras and televisions) and converting them into digital computer images. This is often done with the help of a frame grabber. The converted images are easy to view, edit, process, save, and output. Video sequence A series of video images. The main emphasis is that there are many continuous frame images. See video.
788
21
Video Image Processing
Video sequence synchronization Frame-by-frame registration of two or more video sequences, or matching of content or objects observed from these video sequences. Applications include video surveillance and wide-baseline stereo with two-viewpoint. If these video sequences have different sampling rates, dynamic time warping techniques can also be used. Video transmission Transmission of video (including sound, data, etc.) via various networks. Video transmission format A description of the precise form of analog video signal transcoding, often using duration terms such as line count, pixel count, leading edge, synchronization, and blanking. Video screening 1. Video screening: show the video to the public. 2. Video playback: re-view the content of a video, for example, to detect a specific scene or event. Video restoration Image restoration techniques are used for video, often using temporal correlation in the video. It also refers to the correction of video-specific degradation. Video stippling Realize stippling of video, that is, collage (tessellation) of scene images with basic shapes (such as triangles derived from Delaunay triangulation and polygons derived from Voronoï diagram) to more easily reconstruct and interpret nature scenes. Its core technology is the Voronoï tessellation of video frames. Video-based animation Use video frame images to form animation sequences, that is, video-based rendering. It can be used for facial animation and human animation. An early example is Video Rewrite, in which frames in the original video footage are rearranged to match new spoken words (such as movie dubbing). The basic steps are first extract the elements from the source video and the new audio stream, combine them into three syllables and match them, and then cut out the mouth image corresponding to the selected video frame and insert the video shot that needs animation or dubbing. Here you need to head motion undergoes some geometric transformations. In the analysis phase, it is necessary to correspond to the feature points of the lips, chin, and head according to the image tracking results. During the compositing phase, image deformation techniques are used to blend/combine adjacent mouth shapes into a consistent whole. Video texture 1. A potentially and infinitely changing video can be obtained by randomly splicing the frames of an actual video (there is no obvious relationship between them). In
21.2
Video Terminology
789
computer graphics, this method is used to generate dynamic non-repeating image motion. 2. A way to implement video-based animation. Think of a short piece of video as a series of frame images that can be rearranged in time, and expand them at will while retaining their visual continuity. The basic requirement here is how to generate video textures without introducing visual artifacts. The easiest way is to match frame images based on visual similarities (such as Euclidean distance) and skip similar frame images. 3. Interesting effects obtained by adjusting the original order of the video with a relatively stable background and a small range of moving targets in the foreground. It can also refer to the technology that produces these effects. Here, the order is adjusted only at positions that are difficult for the observer to find, so that a limited-length input segment can be used to generate a video sequence that can be played smoothly but has an infinite length. If more positions are selected and the order selected is more random, the length of the resulting video sequence will be longer. This method/concept can be generalized to the display of individual video sprites/ghosts. For example, extracting an image of a target from a video and stitching each sub-segment containing the target into a new animation, rather than simply playing the original complete sequence. Video-based walkthroughs With the help of camera groups, 3-D dynamic scene images (often stitched into panoramas) are collected from multiple viewpoints, allowing users to explore the scene from viewpoints close to the collection location and determine its movement in the scene. Online video processing There are four basic steps for the processing of the video being played: 1. Begin capturing video of changing scenes. 2. Select the current frame from the video as the processing input. 3. Extract information from the current video frame, for example, find the geometric information by covering the video frame with a Voronoï grid. 4. Repeat step (2) until the entire video processing is completed. Online video screening A model or technique for detecting irregularities in a video stream, such as a vehicle found to be driving illegally. Offline video processing There are three basic steps for processing offline video: 1. Capture a complete video of a changing scene. 2. Extract frames from the video. 3. Extract information from video frames. For example, motion information is found by comparing the previous and subsequent frames.
790
21.3
21
Video Image Processing
Video Enhancement
Filter enhancement here represents a variety of processes and methods (can be used for enhancement, restoration, filtering out noise, etc.). Relative to still image filtering, video filtering can also be considered with the help of motion information (Bovik 2005; Gonzalez and Woods 2018; Tekalp 1995; Zhang 2017a).
21.3.1 Video Enhancement Video enhancement Improve the quality of video images. In addition to using intra-frame filtering (using spatial or frequency domain image processing algorithms) for each individual frame, inter-frame filtering (performing on the space-time support area) can also be used for the previous and subsequent frames. Video deshearing An operation that eliminates the effects of clipping in a video. When capturing video progressively, if there is motion in the camera or scene, the phenomenon of clipping occurs in the video image. The result is oblique structures in the image and vertical structures in the scene no longer look vertical. Anti-shear is to eliminate this kind of shear. Bit reversal The operation of reversing the bit order. The most significant bits become the least significant bits or vice versa. It is generally used for video processing, which is equivalent to image negative in image processing. Video stabilization A set of methods to compensate for the blurring observed in a video due to the small movement of the camera, which is easy to happen when holding the camera by hand. The basic approach is to move the lens, imaging chip, or digitized image slightly to reduce motion. An application based on parametric motion estimation. It includes three steps: motion estimation, motion smoothing, and image warping. Motion estimation algorithms often use similarity transformations to cope with camera translation, rotation, and zoom. The trick here is to have these algorithms lock the background motion (which is the result of camera motion) without being affected by independently moving foreground targets. The motion smoothing algorithm recovers the low-frequency (slowly changing) part of the motion and estimates the highfrequency wobble part to eliminate. Finally, the image warping algorithm uses high-frequency correction to obtain the camera-acquired original frames with only smooth motion.
21.3
Video Enhancement
791
YW
Zz
Yz
Yl Xz
ZW
Xl XW
Zl
Fig. 21.3 Reference framework example
Image stabilization See image sequence stabilization. Image sequence stabilization When using a handheld camera, the recorded video contains some of the resulting image motion due to the shaking of the operator’s hand. Image stabilization attempts to estimate a random portion of camera shake motion and adjust the images in a series to reduce or eliminate shake. A similar application is to eliminate systematic camera motion to output images without motion. See feature stabiliza tion and translation. Reference frame transformation See coordinate system transformation. Frame of reference 1. Reference framework: A coordinate system defined relative to a target, camera, or relative to the real world. Figure 21.3 shows three coordinate systems (identified by the subscripts of the coordinate axis variables), including a cube coordinate system (l) and a cylindrical coordinate system (z), all of which are defined relative to the world coordinate system (w). 2. Reference frame: In some video technologies, such as target motion positioning, gradation detection, etc., it is used as a reference frame image. Video denoising Processes and methods for removing noise and other artifacts from videos and films. Rather than denoising a single image (where all the information available is on the current image), you can average or borrow information from adjacent frames. However, in order not to introduce blur or jitter (irregular motion), accurate pixel-bypixel motion estimation is required. Frame average A direct filtering technique that averages multiple samples at the same position in different frames of a video to eliminate noise without affecting the spatial resolution of the frame image. Effective for fixed parts of the scene.
792
21
Video Image Processing
Temporal averaging A method or step for noise reduction, in which an image that is known to be constant over time is sampled at different times and the results are averaged to reduce the effect of noise. Frame differencing A technique for detecting changes in a video sequence by subtracting successive frames from each other. Sometimes it is necessary to take the absolute value of the obtained difference image. Areas with large values in a difference image are generally considered to be significant areas.
21.3.2 Motion-Based Filtering Motion-detection-based filtering A technique for filtering video. On the basis of filtering still images, the problem brought by motion is considered, that is, the motion adaptive technology is used to filter the video on the basis of motion detection. One method is to imply motion adaptation in the filter design, which is called direct filtering. The simplest direct filtering method is to use the frame average technique to remove noise without affecting the spatial resolution of the frame image by averaging multiple samples at the same position in different frames. In addition, the parameters in the filter can also be determined by detecting the motion, so that the designed filter is adapted to the specific situation of the motion. Motion-adaptive A method for filtering video by extending the filtering technology of still images. Among them, the technique that the motion adaptability is implicit in the design of the filter is called direct filtering; and the motion adaptation method that determines the parameters in the filter by detecting the motion information can be regarded as an indirect filtering technique. Direct filtering A technology and process that imposes motion adaptability in the design of a filter and uses the extension of still image filtering technology to filter video. Indirect filtering A technology for determining parameters in a filter by detecting motion information and filtering a video by extension of a still image filtering technology. Filtering along motion trajectory In video, filtering is performed on each point along the target motion trajectory in each frame of image. Intra-frame filtering (Spatial domain or frequency domain) image filtering techniques used in video for individual frames (think of them as still images).
21.3
Video Enhancement
793
Intra-frame 1. Inside each frame of the video image. That is within the frame image. 2. Represents that each pixel in a frame of video is related. Compare inter-frame. Inter-frame filtering Filtering (in spatial domain and/or frequency domain) multiple consecutive frames in the video. At this time, the size of the filtered space-time support area can be written as m n k, where m and n correspond to the size of a neighborhood in a video frame and k represents the number of frames considered. In the special case of m ¼ n ¼ 1, the resulting filter is called a time filter, while the case of k ¼ 1 corresponds to intra-frame filtering. Inter-frame filtering techniques can have motion compensation, motion adaptation, or neither. Inter-frame Represents the connection between images in different time frames in a video. Compare intra-frame. Motion deblurring A technique that uses inverse filtering technology to eliminate image blur caused by motion. That is to eliminate the blurring effect caused by camera movement or scene movement during image acquisition. A special case of inverse filtering. See image restoration. Uniform linear motion blurring The image is blurred due to the relatively uniform linear motion of the image acquisition device and the observed object during imaging. This is a simple image degradation situation. For this kind of blur, since the motion is uniform in the X axis and Y axis directions, a filter function that can eliminate the blur can be obtained analytically, or the integral equation expressing the blur can be analytically calculated, so this degradation can be obtained analytically. The solution is restored so that the blur caused by such motion can be eliminated.
21.3.3 Block Matching Block-matching algorithm [BMA] An image region matching algorithm commonly used in motion estimation. Suppose a frame image is divided into M non-overlapping blocks, and each block together covers the entire frame. In addition, assuming that the motion in each block is constant, that is, assuming that the entire block experiences the same movement, the movement can be recorded into the associated motion vector. At this time, motion estimation can be realized by finding the optimal motion vector of each block.
794 Fig. 21.4 Example of 2-D logarithmic search method
21
Video Image Processing
Y 3
5
4
5
5
5 3
5
3
2
2
1
2
1
0
1
4
1 O
X
Fast algorithm for block matching A fast search algorithm is proposed to overcome the problem that the exhaustive search block matching algorithm is computationally intensive. Typical are 2-D log search methods and three-step search methods. 2-D-log search method In video coding, a fast search algorithm using block matching when motion compensation is used. It is named because the number of search points is the logarithm of the height or width of the search area. Referring to Fig. 21.4, the number n in the figure indicates the n-th step search point, the dotted arrow indicates the search direction of each step, and the solid arrow indicates the final search result. Starting from the position (0) corresponding to the zero displacement and testing the four search points (1) around the start point, these four points are arranged in a gem shape. Select the point that gives the least error (MAD is often chosen as the quality factor) and use it as the center to construct a new search area around it. If the best matching point is the center point, it will only shift half of the previous offset and enter a new search area to continue searching; otherwise the same offset will be maintained. This process continues for a decreasing range until the offset equals one. Three-step search method In motion compensation of video coding, a fast search algorithm for block matching. Referring to Fig. 21.5, the number n in the figure indicates the n-th step search point, the dotted arrow indicates the direction of the search at each step, and the solid arrow indicates the result of the final search. Starting from the position (0) corresponding to zero displacement and testing the eight search points (points marked with 1) in the up-down, and diagonal directions, these 8 points are arranged in a square. You need to select the point that gives the smallest error (MAD is often selected as the quality factor) and build a new search area around it. If the optimal matching point is the center point, move half of the last offset and enter a new search area to continue searching; otherwise keep the same offset. This process continues for a decreasing range until the offset equals one.
21.3
Video Enhancement
Fig. 21.5 Example of three-steep search method
795
Y 3
3
3
2
3
3
3
3
3 2 1
1
2
1 2
1
1
2
2 2 2
1
0
1
O
1 X
Hierarchical block matching algorithm [HBMA] A multi-resolution motion estimation algorithm. First use a low-pass filtered, downsampled frame pair for coarse resolution to estimate the motion vector. Next, use the fine resolution and proportionally reduced search range to modify and refine the initial solution. The most popular method uses a pyramid structure in which the resolution is halved in both directions between successive layers. Hierarchical block matching algorithm achieves a balance between execution time and result quality: it has lower computational cost than exhaustive search block-matching algorithm and can give better quality for motion vector than 2-D log search method and threestep search method. Exhaustive search block-matching algorithm [EBMA] In image block matching, an algorithm of exhaustive search strategy is adopted. The block to be matched is compared with all candidate blocks to determine and select the block with the highest similarity. The final estimation accuracy is often determined by the step size of the block movement. Full-search [FS] method Same as exhaustive search block matching algorithm. Weaknesses of exhaustive search block matching algorithm Problems in exhaustive search block-matching algorithms include: 1. Block artifacts are generated in predicted frames (discontinuities are visible across block boundaries). 2. The resulting motion field may be a bit cluttered because motion vectors in different blocks are independently estimated. 3. It is more likely that erroneous motion vectors appear in flat areas in the image, because motion is difficult to define when the spatial gradient is close to 0 (this is also a common problem for all optical flow based methods).
796
21.4
21
Video Image Processing
Video Coding
The purpose of video coding is to reduce the bit rate of the video to reduce the amount of storage and to speed up the transmission in the channel, which is similar/ parallel to the still image coding (Gonzalez and Woods 2018; Poynton 1996; Tekalp 1995; Zhang 2017a).
21.4.1 Video Codec Video codec A collective name for video coders and video decoders. It can be either a device (hardware) or one or more official standard software implementations, including open source codecs and proprietary codecs. Video coder The device (hardware) or algorithm program (software) that implements video encoding. Video decoders are often included. Video coding An extension of image coding to video. Also called video compression. This is a large class of basic video processing techniques. The purpose is to reduce the code rate of the video itself by using a new representation method for the video image to reduce the storage capacity and speed up its transmission in the channel, which is similar/parallel to the encoding of the image. Many image coding methods can be directly used for video frame coding (frame image coding) or generalized to 3-D and used for video sequences (at this time, superimposing video frames to form a volumetric image). Also refers to converting the source video into a digital code stream. The source can be analog or digital. In general, encoding can also compress or reduce the bit rate of video data. Video compression Video coding with the goal of reducing the number of bits required to express a video sequence. Examples include MPEG, H.265, DivX, etc. Displaced frame difference [DFD] The title of residuals in video coding literature. Video decoder The device (hardware) or algorithm program (software) that implements video decoding. Video decoding An extension of image decoding to video. Also often called video decompression.
21.4
Video Coding
797
Video decompression See video decoding. Video compression technique Encoding technology to complete various video compression tasks. Video compression technology must not only eliminate the coding redundancy, inter-pixel redundancy, and psycho-visual redundancy that image compression technology needs to eliminate but also eliminate inter-frame (or temporal) redundancy. Temporal redundancy is intuitive for video sequences, meaning that successive frames are generally similar in time. Wyner-Ziv video coding A lossy coding method for video is characterized by low computational complexity at the encoding end (such as a low-power wireless mobile device) and high computational complexity at the decoding end. This method uses side information, such as estimates of camera motion. Mosaic-based video compression With the help of image stitching, the video frames obtained by the scanning cameras are combined into a mosaic to present a segment of video in a single picture, resulting in a more compact expression. Early mosaic-based video compression mainly used the affine motion model, which is only suitable for pictures taken with long focal length lenses. It was later extended to the full eight-parameter homography and incorporated into the international standard MPEG-4. The stitched background layer is called video sprites and can be used to generate the background of each frame after transmission. Temporal correlation A technique or process that considers the similarity between two images over time and evaluates the similarity. For example, the pattern of leg movements of a runner in one gait cycle is similar to that of other cycles, and it is also similar to the movement patterns of other runners. The noise at one pixel position in an image frame is often similar to that from the next, that is, the noise measured in consecutive image frames has related characteristics. The similarity of successive video frames can be used to improve video compression. Weighted prediction Title when using bias and gain methods for video coding. Bias and gain A linear (affine) model that corrects exposure differences. When the exposures used for the two images are different, their brightness can be related by I 2 ðxÞ ¼ ð1 þ αÞI 1 ðxÞ þ β Among them, β is the bias and α is the gain, which, respectively, control the brightness and contrast of the image.
798
21
Video Image Processing
Conditional replenishment A video image encoding method in which only the parts of the video image that have changed since the last frame are transmitted. It is more effective for sequences with large stable backgrounds, but motion compensation is needed for more complex sequences. Video error concealment Compressed video is often transmitted in blocks, so it will be affected by block loss during transmission. The loss of data will affect the reconstruction of the video, which in turn will cause missing segments in the video frames. Error concealment attempts to reconstruct lost data by using data from previous or subsequent frames. Since this repair is only estimated, it may lead to blockiness or interference effects. Error concealment attempts to both replace missing data and correct all introduced artifacts.
21.4.2 Intra-frame Coding Intra-frame coding Encoding that considers only one frame of video in the video itself (without considering its previous and next frame associations). This can be done using a coding method for still images. The resulting encoded frame is called an independent frame. Macro-block Specific sub-images in video coding. For example, in the international standard H. 264/AVC, when motion prediction is performed on a video image, the image is divided into a set of 16 16 luma macro-blocks and two sets of 8 8 chroma macro-blocks. For macro-blocks of 16 16, they can be further decomposed into four sub-blocks, which is called macro-block partition; see Fig. 21.6a. For macroblocks of 8 8, they can be decomposed into four sub-blocks, which is called macro-block sub-partition, as shown in Fig. 21.6b. Macro-block partition See macro-block. Macro-block sub-partition See macro-block. 16 × 16
16 × 8
8 × 16
8×8
8×8
(a) Fig. 21.6 Macro-block partition and macro-block sub-partition
8×4
4×8
(b)
4×4
21.4
Video Coding
799
Deblocking filter A filter is used in the compressed video decoding when the block coding method is used. An apparent enhancement transform is used on the encoding result to eliminate blocky artifacts at the macro-block scale. Can be used in a variety of compressed video encoding schemes, including H.263 and H.264 video stream formats. I-frame One of three frame images encoded in a specific way in the MPEG family of standards. Because the independent frame is only intra-coded, and no other frame image is referenced, the independent frame is also called an intra-frame. Independent frames do not depend on other frames in encoding and decoding and are also suitable as the starting frame for predictive encoding. Independent frames are used in all coding international standards. Independent-frame Same as I-frame.
21.4.3 Inter-frame Coding Inter-frame coding In video coding, the process or technology of encoding by means of inter-frame connections. See inter-frame predictive coding. Inter-frame predictive coding Inter-coding using prediction. Both unidirectional predictive coding and bidirectional predictive coding are typical methods. The encoded frame obtained by the former is called a prediction frame (P-frame), and the encoded frame obtained by the latter is called a bidirectional prediction frame (B-frame). For example, the P frame is encoded in the international standard for moving image compression, where the correlation between the current frame and the next frame is calculated, and the motion of the target within the frame is estimated to determine how to compress the next frame with motion compensation to reduce inter-frame redundancy. Motion prediction In image coding, pixel values of pixels with motion information are predicted. The simplest method is to predict a pixel value in the current frame from the pixels of the previous frame that it corresponds to. If the gray/color of pixels with the same spatial position in adjacent frames before and after changes, it is possible that one or both of the scene and the camera have moved. At this time, motion compensation is required and how to compress the next frame to reduce inter-frame redundancy is determined. This encoding method is also called inter-frame coding. Motion coding 1. A component of video sequence compression, in which effective methods are used to represent the motion of the image area between video frames.
800
21
Video Image Processing
2. Refers to adjusting nerve cells to respond to the direction and speed of image movement. Motion compensation A technique used in video coding, a means to effectively perform predictive coding. The description of one frame image is relative to several other (previous or future) frame images. This allows for a higher level of video compression, because frame-to-frame variation is relatively small in most videos. Here the results of the motion estimation algorithm (often encoded in the form of motion vectors) need to be used to improve the performance of the video encoding algorithm. In the simplest case (in a linear camera movement such as panning), motion compensation calculates where an object of interest will appear in an intermediate field or frame and moves the object to the corresponding position in each source field or frame. Motion compensated prediction See motion compensation. Motion-compensation filter Filter used for motion compensation. Assuming that the change in the gray level of the pixel along the motion trajectory is mainly due to noise, a low-pass filter can be used along the motion trajectory of each pixel. In other words, the effect of noise can be reduced by low-pass filtering the motion trajectory corresponding to each pixel. Motion compensated video compression See motion compensation. It has been used in MPEG-1, MPEG-2, and MPEG-4 as well as international standards such as H.263, H.264, and H.265. Prediction-frame In the MPEG family of standards, one of three frame images that is encoded in a specific way. The prediction frame is encoded with reference to the previous initial frame or the previous prediction frame, and the motion estimation is used to perform inter-frame coding. Predicted-frame Same as prediction-frame. P-frame Abbreviation of prediction-frame. Abbreviation of predicted-frame. Unidirectional prediction frame In video coding, frames that are coded using unidirectional prediction. P-frames in the international standard MPEG-1 and H.261 are unidirectionally predicted frames. Unidirectional prediction In video coding, a method of predicting a pixel value in the current frame image by using the corresponding pixel of the previous frame (or multiple frames). Here the prediction is made from the previous frame to the current frame. The predicted frame can be expressed as
21.4
Video Coding
Fig. 21.7 Schematic diagram of unidirectional prediction sequence
801
I
P
P
P
P
I
P
P
P
P
f p ðx, y, t Þ ¼ f ðx, y, t 1Þ This is a linear prediction and only holds if there is no motion change between adjacent frames. In the real world, the scene and the scenery and the camera may have motion, which will cause the gray/color of pixels in the same spatial position in adjacent frames to change. At this time, motion compensated prediction (MCP) is required. f p ðx, y, t Þ ¼ f ðx þ dx, y þ dy, t 1Þ Among them, (dx, dy) represents the motion vector from t–1 to t, f(x, y, t–1) is called the reference frame, f(x, y, t) is called the coding frame, and fp(x, y, t) are called prediction frames. The reference frame must be encoded and reconstructed before the encoded frame, and the encoded frame is encoded with the help of the prediction frame. For example, in the international standard H.261, a certain number of consecutive frames form a group, and the following two types of frame images in the group are encoded in two different ways: 1. I-frame: the first frame of each group. The initial frame is encoded as an independent frame to reduce intra-frame redundancy. This encoding method is called intra-frame coding. 2. P frame: the remaining frames in the same group. They are predictively coded, that is, by calculating the correlation between the current frame and the next frame, and predicting the motion of the target within the frame to determine how to compress the next frame with motion compensation to reduce inter-frame redundancy. This coding method is called inter-frame coding. According to the above coding method, the structure of the coding (decoding) sequence is shown in Fig. 21.7. Each I-frame is followed by several P-frames, and the I-frame is independently encoded, while the P-frame is encoded with reference to the previous frame. Bidirectional prediction frame One of three frame images encoded in a specific way in the MPEG series of image international standards. It is referred to as a bidirectional frame and is abbreviated as a B-frame. When encoding a B-frame, bidirectional prediction is performed by combining the initial frame and the predicted frame before and after the reference.
802
21
Video Image Processing
Fig. 21.8 Schematic diagram of bidirectional prediction sequence
I B
B
P
B
B
P
I B
B
Bidirectional prediction In video coding, a method of predicting a pixel value in a current frame image by using corresponding pixels in each frame before and after it. The predicted frame can be expressed as f p ðx, y, t Þ ¼ a f ðx þ da x, y þ da y, t 1Þ þ b f ðx db x, y db y, t þ 1Þ Among them, (dax, day) represents the motion vector from time t-1 to t (for forward motion compensation), and (dbx, dby) represents the motion vector from time t to t + 1 (for backward motion compensation). Using f(x, y, t-1) to predict f(x, y, t) is called forward prediction, and using f(x, y, t + 1) to predict f(x, y, t) is called backward prediction. The two predictions combined are bidirectional predictions. The weighting factors a and b satisfy a + b ¼ 1. For example, in the international standard MPEG-1, a certain number of consecutive frames are formed into a group, and the following three types of frame images in the group are encoded in three different ways (see Fig. 21.8): 1. I-frame: Using the discrete cosine transform (DCT) algorithm to compress with only its own information, that is, to perform intra-frame coding without reference to other frame images; each input video signal sequence contains at least two I-frames. 2. P-frame: Need to refer to the previous I-frame or P-frame and perform interframe coding with motion estimation; the compression ratio of P-frame is about 60: 1. 3. B-frame: Also known as bidirectional prediction frame, refer to one I-frame or P-frame before and after for two-way motion compensation; B-frame has the highest compression rate. It should be noted that when bidirectional prediction is used, the encoding order of the frames is different from the original video sequence. For example, unidirec tional prediction is used to encode some frames with independent frames, and then bidirectional prediction is used to encode the remaining frames, that is, P-frames are coded first and then B-frames are coded. According to the above coding method, the structure of the coding (decoding) sequence is shown in Fig. 21.8. Bidirectional frame Abbreviation for bidirectional prediction frame. B-frame Abbreviation for bi-directional prediction frame.
21.5
21.5
Video Computation
803
Video Computation
Various processing can be performed on the sequence images in the video (Forsyth and Ponce 2012; Hartley and Zisserman 2004; Sonka et al. 2014; Zhang 2017a, b).
21.5.1 Image Sequence Video computing Mainly refers to image analysis and image understanding in the video field. Its research focuses on how to use motion information to analyze the movement of scenes and events in the scene to solve practical problems (such as judging human behavior, identifying facial expressions, etc.). To this end, more qualitative information than quantitative information is often required, and knowledge and context (environmental information) are required. Image sequence There is a certain order and interval in time, a group of images or a series of images related in content. Typically, the target in the camera and/or scene is moving. Figure 21.9 shows an image sequence acquired when the camera and the person are moving. A video image is a special image sequence in which frames are arranged at equal intervals on the time axis. Image sequence analysis Techniques that can be used for a series of images to segment, combine, measure, or identify them, using timing information of the image content (change detection can be used). Typical areas of application range from image compression (as in MPEG4) to behavior analysis. Image sequence understanding Extending image understanding with image sequences instead of single images as input. See video understanding.
Fig. 21.9 An image sequence captured with camera and person movements.
804
21
Video Image Processing
Image sequence matching Calculate the correspondence between pixels or image features in an image sequence. According to the correspondence, an image mosaic can be constructed to realize motion detection or scene structure reconstruction. Image sequence fusion A combination of information from multiple images in an image sequence. Typical kinds of fusion applications include 3-D structure restoration, generating mosaic representations of scenes, and improving the imaging of scenes with average smoothing.
21.5.2 Video Analysis Video analysis An extension of image analysis when the input is video (or sequence of images). A general term describing the information extraction from a video. The tasks include extracting measurement data related to the observed target in the video: video segment classification, cut detection, video key frame detection, etc. Video segmentation Segmenting the video not only meets the temporal consistency of the original shots but also divides the video sequence into different groups consisting of consecutive frames, such as scene changes between groups. Temporal segmentation Split the time signal into discrete subsets. For example, the start and end of a given action can be determined from a video sequence (such as a car moving from driving to parking), or a change in condition (such as from an indoor scene to an outdoor scene), or repeated actions can be broken into one-by-one complete cycle (such as breaking a 100-meter run into starting, accelerating, halfway, and crossing the finish line). Video sprites 1. Each moving target in the video texture. 2. In image stitching, it refers to the stitching of the background layer, which is a single large image obtained by stitching the background (usually partially overlapped) in multiple frames (from the camera movement). This large image can be easily transmitted and used to generate the background in each frame. Video descriptor 1. A set of features that combine a video content, such as a color histogram, statistics on the number of scenes, or the amount of motion observed. 2. A professional encoding of the video format, such as 720p represents a 1280 720, 60 Hz progressively scanned video stream.
21.5
Video Computation
805
Feature motion Changes in the detected target features in an image sequence or video. Video clip categorization Determine which predefined category the content in a video (such as action video, sports video, outdoor scenery) belongs to. Video clip A video or set of frame images. Video clips can be any part of a video sequence, but in practice the collection of images in the same video clip often has a content association. Compare video shots. Video mosaic 1. A collection of video key frames, which correspond to the video summarization of various actions in the video. 2. Collage of images or videos to generate a new video. The collage here is similar to the mosaic of multiple images into a still image mosaic, also to improve the effect of art or entertainment.
Chapter 22
Multi-resolution Image
In practical applications, some contents or features in an image need to be observed at a specific resolution to be easily seen or detected. Therefore, using multiresolution images can often extract image features more effectively and obtain image content more actually (Kropatsch and Bischof 2004; Zhang 2017a).
22.1
Multi-resolution and Super-Resolution
Multi-resolution images use multiple resolutions to express and process images (Costa and Cesar 2001; Kropatsch and Bischof 2004). Super-resolution generally represents a class of methods that enlarges (smaller) size images or videos and increases their resolving power (Zhang 2017a).
22.1.1 Multi-resolution Hierarchical A generic term for a method that first considers high-level data with less detail and then gradually increases the level of detail for processing. This method relies on hierarchical representations and often achieves better results. Coarse-to-fine processing A multi-scale strategy or algorithm that starts processing at a larger or coarser level and then iteratively advances to a smaller or finer level. It is important here that the results of each layer are propagated to other layers to ensure a good final result. Coarse/fine A general image processing strategy that calculates the “rough” version of the image first, and then calculates on increasingly finer versions. Previous calculations often © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_22
807
808
22 Multi-resolution Image
have constraints and guidance on subsequent calculations. See multi-resolution theory. Multi-resolution A representation that uses multiple resolutions for an image. Multi-resolution method See multi-scale method. Multi-resolution theory Related theories of representing, processing, and analysis images at more than one resolution. Combining a variety of technologies and methods, including the theory of wavelet transform. Because multiple resolutions correspond to multiple scales, it is also called multi-scale theory. Multi-resolution image analysis Analysis of images simultaneously at multiple resolutions. This image technology has a wide range of uses, such as considering targets on different scales and identifying features on specific scales. See Gaussian pyramid, Laplacian pyra mid, and wavelet transform. Multi-resolution refinement equation An equation that establishes a connection between adjacent resolution hierarchies and spaces in an image. In wavelet transform, multi-resolution refinement equations can be established for both scaling functions and wavelet functions. It can be further proved that the expansion function of any subspace can be constructed with the expansion function of the subspace of the next resolution (1/2 resolution). Multi-resolution wavelet transform Same as multi-scale wavelet transform. Pyramidal radial frequency implementation of shiftable multi-scale transforms A wavelet construction method that decomposes the image into horizontal, vertical, and cross subbands at each layer. The representations are more symmetrical and directional selective, and because they are overcomplete, they can avoid aliasing that occurs when the signal is sampled below the Nyquist frequency. It has the same analysis and synthesis basis functions and is well suited for structural analysis and matching work. Multi-resolution threshold Using multi-resolution theory, the thresholds used for image segmentation are selected on multiple scales. Level of detail [LOD] In multi-resolution surface representation, corresponding to higher-resolution layers. Composite and difference values Pixel values in a progressive method that uses a binary tree to layer an image for representation. In the multi-layer results obtained by this method, the first few layers
22.1
Multi-resolution and Super-Resolution
809
are composed of large image blocks with low resolution, and the latter layers are composed of small image blocks with high resolution. The main principle of this method is to transform a pair of pixel values into two values: a composite value (the sum of the two) and a difference value (the difference between the two). Multigrid A class of iterative techniques. The basic idea is to build a rough (low-resolution) version of the problem first and then use this version to calculate the low-frequency components of the solution. Different from coarse-to-fine processing that generally use coarse solutions to initially thin solutions, the multigrid technique only corrects the low-frequency components of the current solution and uses multiple rounds of coarsening and thinning (called “V” and “W” motion modes in the pyramid) to get fast convergence. For simple homogeneous problems (such as solving Poisson’s equation), the best performance can be achieved, that is, the calculation time is linear with the number of variables. Algebraic multigrid [AMG] A variation of multigrid technology, which is more useful for non-homogeneous or irregular grid problems than for multigrid. Multi-grid method An efficient algorithm for solving discrete differential (or other) systems of equations. The qualifier “multi-grid” is used because the system of equations is solved at a coarse sampling level, and the results are then used to initiate a higher resolution solution.
22.1.2 Super-Resolution Super-resolution [SR] 1. A method or technique for improving the resolution of an optical system. The resolution of the general optical system is limited by the diffraction limit, and the diffraction spot has a natural limit. The main way to reduce the diffraction spot is to reduce the wavelength. Super-resolution now generally represents a class of methods that (smaller) enlarge image or video scales and increase their resolving power. See super-resolution technique. 2. The phenomenon that the camera objective lens cannot resolve a certain spatial frequency, but can resolve at a higher frequency. When the objective lens is inspected with a common resolution inspection chart, the contrast of the image is actually seen. When the objective lens has asymmetrical aberrations, the image of the bright point moves away from the original position, and the focus must be shifted to obtain the best focus. This shift is different for different spatial frequencies. 3. Techniques and processes for generating a high-resolution image from multiple low-resolution images of the same target obtained from different viewpoints. For
810
22 Multi-resolution Image
example, use a moving camera to continuously capture the same scene. The key to successfully obtaining high-resolution images is to accurately estimate the registration relationship between viewpoints. Super-resolution technique A technique for recovering the original scene image from imaging data with limited bandwidth. Physically achievable imaging systems all have an upper frequency limit, and high-frequency parts above the upper limit cannot be imaged. A typical example is that the images actually acquired have a certain spatial resolution. To further improve the spatial resolution, given the imaging system, super-resolution technology can be used to build a high-resolution image by collecting a series of lower-resolution images. The currently used super-resolution technologies mainly include super-resolution reconstruction technology and super-resolution restora tion technology. Super-resolution restoration A super-resolution technology. Using the degraded point spread function of the image and prior knowledge of the target in the image, the information lost in a single image due to exceeding the limit of the transfer function of the optical system is restored beyond the diffraction limit of the image system. Super-resolution reconstruction A super-resolution technology based on multiple images. By combining non-overlapping information in multiple low-resolution images, a higher-resolution (large-size) image is constructed. If multiple low-resolution images are obtained from an image sequence, super-resolution reconstruction results can be achieved with the help of motion detection. Reconstruction-based super-resolution A super-resolution technology implemented using image registration technology and image reconstruction technology. During registration, multiple frames of low-resolution images are used as constraints on data consistency to obtain subpixel accurate relative motion between other low-resolution images and reference low-resolution images. During the reconstruction, the prior knowledge of the image is used to optimize the target image to achieve super-resolution. Example-based super-resolution A typical learning-based super-resolution method. First, learn the relationship between the low-resolution image and the high-resolution image by studying examples, and then use this relationship to guide the super-resolution reconstruction of the low-resolution image. Learning-based super-resolution One of many super-resolution technologies. The basis is that a low-resolution image is fully equipped with information for inference to predict its corresponding high-resolution portion. Therefore, a method of training a set of low-resolution images can be used to generate a learning model, and high-frequency detail information of the image can be derived from this model to achieve super-resolution.
22.2
Multi-scale Images
811
Blind image deconvolution In the algorithm to achieve super-resolution, the problem of inferring fuzzy kernels and sharpening images is performed at the same time.
22.2
Multi-scale Images
Multi-resolution images are often called multi-scale images, which can be obtained by using multi-scale transformations in multi-scale space (Costa and Cesar 2001; Kropatsch and Bischof 2004).
22.2.1 Multi-scales Multi-scale An image is represented on multiple scales, that is, images are represented on different scales. Multi-scale theory See multi-resolution theory. Multi-scale representation Use different scales for the same image. Using different scale representations for the same image is equivalent to adding a new coordinate to the representation of the image data. That is, in addition to the general spatial resolution (number of pixels), there is now a new parameter that describes the current level of expression. If this new scale parameter is labeled with scale s, the data structure containing a series of images with different resolutions can be called scale space. f(x, y, s) can be used to represent the scale space of the image f(x, y). Multi-scale representation of an image actually adds a new dimension to the image, from which the scale information in the image can be detected and extracted, but at the same time, it also results in an increase in image data storage requirements and the amount of calculation required for image processing. Detecting zero-crossing points in a set of Gaussian smooth and gradually increasing images is an application example of multi-scale expression. The generalized cylinder model can also be combined with multi-scale representations to express complex structures. For example, a multi-scale model can be used to represent an arm. If a single generalized cylinder is required on the coarse scale, then two generalized cylinders (the big and small arms) are required on the mesoscale, and on the fine scale, surface triangulation can also be used. The representation result can be several discrete scales, or scales in a more continuous range, similar to scale space.
812
22 Multi-resolution Image
Fig. 22.1 Multi-scale edge detection
Multi-scale method A general term that indicates the operation of an image using more than one scale. Different scales can be obtained by reducing the image size or Gaussian smoothing the image. Both methods can reduce the spatial frequency of information. The main motivations for using the multi-scale approach are: 1. Some structures have different natural dimensions (e.g., a wide strip can be considered as two back-to-back edges). 2. The upscale information is generally more reliable when there is noise in the image, but the spatial accuracy is not as good as the fine scale information. For example, in edge detection, a coarse scale can be used first to reliably detect a rough edge, and then a fine scale can be used to accurately determine the position of the edge. Figure 22.1 shows a set of examples. The left image is the original image, the middle image is the result of the coarse scale detection, and the right image is the result of the fine scale subsequent. Multilevel method See multi-scale method. Multi-scale description See multi-scale method. Multi-scale integration 1. Use operators with different scales to combine the extracted information. 2. Combine the information extracted from the registration of images with different scales. If the difference in operator scales is only considered as the difference in the amount of smoothing, the above two definitions respectively correspond to two methods of consideration for the same process. For example, when trying to combine edges extracted from images with different amounts of smoothing to obtain more reliable edges, multi-scale integration is needed. Multi-scale entropy A shape complexity descriptor. First convolve the original image with a set of multi-scale Gaussian kernels, where the Gaussian kernel g(x, y, s) ¼ exp.[(x2 + y2)/
22.2
Multi-scale Images
813
(2 s2)], where s represents the scale. Letting f(x, y, s) ¼ f(x, y) g(x, y, s) represent a multi-scale (fuzzy) image, then the multi-scale entropy can be expressed as E ðsÞ ¼
X
pi ðsÞ ln pi ðsÞ
i
Among them, pi(s) is the relative frequency of the i-th gray level corresponding to the scale s in the blurred image f(x, y, s). Multi-scale wavelet transform Wavelet transform at different scales. Also called multi-resolution wavelet trans form. The scale change of multi-scale wavelets enables wavelet analysis of images to focus on discontinuities, singular point, and edges. For example, by irregular sampling the results of the multi-scale wavelet transform, the maximum value of the wavelet mode can be obtained, which can represent the edge position in the image. Scale 1. The ratio of the size of an object, image, or frame to its reference or model. Originally refers to size. 2. The nature of certain image features that appear only when viewed at a certain size. For example, a line segment must be enlarged to a certain degree before it can be regarded as a pair of parallel edge features. 3. A measure that represents the degree to which subtle features are removed or weakened in an image. Images can be analyzed at different spatial scales, and at a given scale, only characteristics of a certain size range can be revealed. 4. In image engineering, it is used to describe the level of current image resolution. See multi-scale representation. Scale operator A class of operators that remove details (corresponding to high frequency content) in an image, such as Gaussian smoothing operators. It can discard small-scale details, so that the content in the result can be expressed with smaller images. See scale space, image pyramid, Gaussian pyramid, Laplacian pyramid, and pyramid transform. Scale reduction Results obtained using scale operators. Scale selection 1. Determine the scale required for some image manipulation. With some measurement operations on the image (such as determining edge strength), the results will vary with the image scale or smooth scale. At this time, it is important to determine the appropriate scale. For example, for some scales, the measurement results may obtain local extreme values; for a wide range of scales, the measurement results may no longer be stable. The purpose of scale selection is to determine these scales. In the SIFT operator, the selected feature points are partially dependent on the local minimum of Laplacian.
814
22 Multi-resolution Image
2. Select the size of the operator (template) to fit a specific size target. For example, in eye location, the possible sizes of the eyes appearing in the image need to be considered.
22.2.2 Multi-scale Space Scale space 1. In low-level vision, a theory that considers the multi-scale nature of images. The basic idea is that in the absence of prior knowledge of the optimal spatial scale required for a particular problem/operation (such as edge detection), the image should be analyzed at all possible scales (scale space), where the coarse scale represents the simplification of fine scale. The finest scale is the input image itself. See image pyramid. 2. A data structure containing a series of images of different resolutions. After representing an image with different scales, it is equivalent to adding a new dimension to the representation of image data. That is, in addition to the general spatial resolution, there is now a new parameter describing the current resolution level. If s is used to label this new scale parameter, g(x, s) can be used to represent the scale space of the image g(x) (x represents the space coordinates of the image pixels). In the limit case of s ! 1, the scale space will converge to a constant image with its average gray scale. For example, use a standard variance σ increasing Gaussian low-pass filter to smooth an image. This results in a set of images with decreasing detail. This set of images can be thought of as being in 3-D space, where two axes are the (x, y) axes of the image and the third axis is the standard deviation σ (herein referred to as “scale”). This 3-D space is called scale space. Scale space representation A special multi-scale representation of an image. The information is displayed on multiple spatial scales, and causal links are established on adjacent scale levels. Include a continuous scale parameter and use the same spatial sampling on all scales. The scale hierarchy is identified by a scalar parameter called a “scale parameter.” A key requirement is that the coarser level obtained by using the scale operator in turn should be a simplified version of the finer level first, that is, it should not produce parasitic details. A widely used scale space representation is Gaussian scale space, where the next coarser image is obtained by convolving the current image with a Gaussian kernel. The variance of this kernel is the scale parameter. See image pyramid and Gaussian smoothing. Linear scale space representation A special scale space representation. If f: RN ! R represents a given N-D image, then for the scale space representation L: RN R+ ! R, and define its scale s ¼ 0 (s 2 R+) as the original image L(•; 0) ¼ f, when scale s > 0 satisfies L(•; s) ¼ g(•;
22.2
Multi-scale Images
815
s) f, a linear scale space representation is obtained, where g: RN R+ \{0} ! R is a Gaussian kernel. Curvature scale space Multi-scale representation space for a zero-crossing point of curvature on a stepwise smooth plane contour. The arc length can be used to parameterize the contour and convolved with a Gaussian filter with a gradually increasing standard deviation. Then the curvature zero crossings points are restored and mapped into the scale space image. The horizontal axis represents the arc length parameters in the original contour, and the vertical axis represents the standard deviation of the Gaussian filter. Scale space primal sketch A primitive for scale space analysis, located in each layer of space. Gaussian kernel is commonly used to continuously smooth the image to represent the original image into multiple scales. Then analyze the image element layering at different scales. Studies have shown that the use of scale space primitives can explicitly extract salient structures (such as block-like features) in an image and can be used to characterize their spatial movement rules. Scale space filtering A filtering operation that transforms one resolution level into another resolution level in scale space. Filtering with Gaussian filters is a typical technique. Scale space matching A class of techniques that compare shapes for matching on different scales. See scale space and image matching.
22.2.3 Multi-scale Transform Multi-scale transform In signal processing, it refers to an operation that expands a 1-D signal u(t) by a 2-D transform U(b, s). U(b, s) has two parameters, the position parameter b is related to the time variable t of u(t), and the scale parameter s is related to the scale of the signal. In image processing, the 3-D transform F( p, q, s) is used to expand the 2-D signal f(x, y). F( p, q, s) has three parameters, the position parameters p and q are related to the coordinate variables p and q of the image f(x, y), and the scale parameter s is related to the scale representation of the image. Multi-scale transform technology can be divided into three categories: scale space technology, time scale technology (wavelet transform is a typical representative), and time-frequency technology (Gabor transform is a typical representative). Time scale A multi-scale transform technique or corresponding space. Wavelet transform is a typical representative of this type of technology.
816
22 Multi-resolution Image
(a)
(b)
(c)
Fig. 22.2 Comparison of three time-frequency grids (a) grid corresponding to the Fourier transform (b) grid corresponding to the Gabor transform (c) grid corresponding to the wavelet transform
Time frequency A multi-scale transform technique or corresponding space. The Gabor transform is a typical representative of this type of technology. Time-frequency window A window that defines a window function in a multi-scale analysis. The two directions of this window correspond to time and frequency, respectively; see time-frequency grid. Scale space technique A class of multi-scale transform techniques. There are many types. For example, the local maximum of a signal u(t) corresponds to the zero crossing points of its derivative u’(t). If u’(t) is convolved with a Gaussian function g(t), we get u0 ðt Þ gðt Þ ¼ ½uðt Þ gðt Þ0 ¼ uðt Þ g0 ðt Þ That is, the convolution of the signal’s derivative u’(t) and the Gaussian function g(t) is equal to the convolution of the signal u(t) and the derivative of the Gaussian function g’(t), so that the detection of the extreme point of u(t) will become the detection of the zero crossing point of the convolution result. The width of the Gaussian function is controlled by the standard deviation parameter. If it is defined as a scale parameter, for each scale, a set of extreme points of smoothed u(t) can be determined. In this way, the scale space technology of u(t) is to detect such a set of extreme points that change with scale parameters. Time-frequency grid A graph (pattern) that reflects the domain size and sampling distribution of time and frequency in image transform. Generally, the abscissa of the graph corresponds to time, and the ordinate corresponds to frequency. Figure 22.2a shows the grid corresponding to the Fourier transform. The horizontal parallel solid line indicates that there is only a difference in frequency after the transform. Figure 22.2b shows the grid corresponding to the Gabor transform. It has localization characteristics in time and frequency, but the size of the localized grid does not change with time or frequency. Figure 22.2c shows the grid corresponding to the wavelet transform, which is characterized by a large time window when the frequency is relatively low,
22.3
Image Pyramid
817
and a small time window when the frequency is relatively high. The product of the length of the time window and the length of the frequency window is constant.
22.3
Image Pyramid
Combining the Gaussian pyramid and the Laplacian pyramid can perform multiresolution decomposition and reconstruction of the image (Kropatsch and Bischof 2004; Zhang 2017a).
22.3.1 Pyramid Structure Pyramid A multi-resolution method for object representation. It is essentially a method of region decomposition. The image is decomposed into multiple levels, with a “parent-child” relationship between the upper and lower layers and a “brother” relationship between the same layers. There are two main concepts used to describe the pyramid: reduction factor and reduction window. See image pyramid. Reduction factor 1. In a multi-resolution image, a parameter representing the rate at which the image region decreases between layers. In the case where the reduction windows do not overlap each other, it is equivalent to reducing the number of pixels in the window. 2. A parameter describing the structure of the pyramid. The reduction factor determines decreasing speed of the number of units from the next level to the previous level in the pyramid structure. Reduction window A parameter describing the structure of a pyramid. The relationship between the cells in one layer in the pyramid structure and a group of related cells in the next layer is determined. The reduced window corresponds to this group of related units. Pyramid transform Starting from an image to perform the operation of building a pyramid, it can also refer to the operator or technology that implements this operation. See pyramid, image pyramid, Gaussian pyramid, and Laplacian pyramid. Pyramid architecture A computer architecture that supports pyramid-based processing and is generally used in multi-scale processing environments. See scale space, pyramid, image pyramid, Gaussian pyramid, and Laplacian pyramid.
818
22 Multi-resolution Image
Pyramid-structured wavelet decomposition See wavelet decomposition. Matrix-pyramid [M-pyramid] A hierarchical structure that represents image data. A matrix pyramid is an image sequence {ML, ML–1, . . ., M0}, where ML has the same dimensions and elements as the original image, ML–1 is obtained by reducing the resolution of ML, and M0 corresponds to a single pixel. Each time the resolution is halved, the size of the sequence matrix expressing the image data is an integer power of two. If the size of ML is N N, the number of cells required to store all matrices is 1.33N2. Matrix pyramids are more suitable for situations where images need to be used at different resolutions at the same time. Compare tree-pyramid. Tree-pyramid [T-pyramid] A hierarchical structure that represents image data. Let the size of the original image (highest resolution) be 2L, then the tree pyramid can be defined as: 1. A set of nodes P ¼ {P ¼ (i, j, k)|layer k 2 [0, L]; i, j 2 [0, 2L–1]}. 2. The mapping between adjacent nodes Pk 1 and Pk of a pyramid M(i, j, k) ¼ (i div 2, j div 2, k–1), where “div” represents integer division. 3. A function F that maps the node P of the pyramid to a subset Z (such as {0, 1, . . ., 255}) of the corresponding number of brightness levels. Due to the structural specification of the tree pyramid, the arcs need not be considered when storing the representation tree pyramid, only the nodes need to be considered. Tree pyramids are more suitable for situations where several different resolution versions of an image need to be used simultaneously. Compare matrixpyramid. Steerable pyramid A representation that combines multi-scale decomposition and discriminative measurement. Provides a way to analyze textures at multiple scales and different orientations. Identification is often based on directional measurements of steerable basis functions. The basis functions are rotated and copied with each other. The basis functions copied in any direction can be generated by a linear combination of these basis functions. Pyramids can contain any number of orientation bands. As a result, they are not affected by the aliasing effect, but the pyramids are overcomplete, which reduces their computing power. Concise name of pyramidal radial frequency implementation of shiftable multi-scale transforms. Curve pyramid A sequence of symbols representing curves on multiple scales. The main operation is from bottom to top, constructing partial short segments of the curve into growing segments. If the segment is described by a binary “curve connection” with the aid of the label edges of the square unit, the cascade of short segments can be strictly implemented by selecting the transitive closure of the curve connection. If the obtained pyramids are to be “length-reduced” (at higher levels, long curves with
22.3
Image Pyramid
819
Fig. 22.3 Schematic diagram of half-octave pyramids Coarse
l= 4 l= 3
Medium
l= 2
l= 1
Fine
l= 0
many fragments are retained, and short fragments are eliminated after several reduction steps), overlapping pyramids are required. Oriented pyramid An image is decomposed into directed pyramids of different scales and different orientations. Unlike Laplacian pyramids, which do not have orientation information at each scale, each scale in an oriented pyramid represents texture energy in a particular direction. One way to generate an oriented pyramid is to use a derivative filter for the Gaussian pyramid or a directional filter for the Laplacian pyramid, which means that they are further decomposed to various scales. Adaptive pyramid A multi-scale processing method. The smaller areas in the image that have some of the same features (such as color) are first extracted and put into a graph representation. Then the image is processed (such as pruning or merging) to the appropriate scale layer. Octave pyramids Pyramid obtained by continuously smoothing and 1/2 subsampling the original image. The sampling rate between adjacent layers is two. Half-octave pyramids A modification or variation of the basic pyramid (very close to the band-pass pyramid). A schematic diagram is shown in Fig. 22.3. The even layers correspond to the original pyramid, while the odd layers correspond to the frequency octave layer. Down-sampling is performed only at each octave layer. In this way, the scale change between the layers of the half-octave pyramid is relatively small or slow, so those algorithms based on ascending from coarse to fine can often achieve better results. Band-pass pyramids See half-octave pyramids.
820
22 Multi-resolution Image
Difference of low-pass [DOLP] transforms Same as half-octave pyramids. Quincunx sampling Combination of half-octave pyramid and checkboard sampling grid. Commonly used for multi-scale feature detection. Quarter-octave pyramids A pyramid with half the sampling rate of a half-octave pyramid.
22.3.2 Gaussian and Laplacian Pyramids Image pyramid A multi-scale representation structure for images. Consider an N N image (where N is an integer power of 2, that is, N ¼ 2n). If you take it one pixel at a time in the horizontal and vertical directions (e.g., take the second of two consecutive pixels per line and the second of every two consecutive pixels) for sampling. The extracted pixels constitute an N/2 N/2 image. In other words, by performing a 2:1 subsampling in both directions, you can get a (rougher) thumbnail of the original image. This process can be repeated until the original N N image becomes an N/ N N/N, that is, a 1 1 image, as shown in Fig. 22.4. The obtained series of images form a pyramid structure, the original image corresponds to layer 0, and the images of N/2 N/2 correspond to layer 1, until the images of N/2n N/2n correspond to layer n. A data structure that hierarchically represents an image, where each layer contains a smaller version of the image in the previous (lower) layer. Image pixel values are often obtained through a smoothing process. Generally reduced by a power of two (such as 2, 4, etc.). Figure 22.5 shows a pyramid composed of five layers of images from 8 8, 16 16, 32 32, 64 64, 128 128 in order from high to low. In order to see the pixel size more clearly, the image is not smoothed. A structure that represents image information on several spatial scales. An image pyramid can be constructed with the help of the original image (maximum Fig. 22.4 The composition of the image pyramid
Layer 3 Layer 2 Layer 1
Layer 0
22.3
Image Pyramid
821
Fig. 22.5 Image pyramid display example
resolution) and a scale operator (such as a Gaussian filter) that reduces image information by discarding details at a coarser scale. The next layer (lower resolution) of the pyramid can be obtained by using the operator and subsampling the resulting image in turn. See scale space, image pyramid, Gaussian pyramid, Laplacian pyramid, and pyramid transform. Gaussian pyramid Abbreviation for Gaussian image pyramid in image engineering. It is a multiresolution representation of an image, consisting of multiple images. Each image is a subsampling and Gaussian smoothing version of the original image, and the standard deviation of the Gaussian distribution is gradually increasing. Gaussian pyramid feature One of the simplest multi scale transform results in a Gaussian pyramid. If I(n) is used to represent the n-th layer of the pyramid, L is the total number of layers, and S# is the down-sampling operator, then Iðn þ 1Þ ¼ S # Gσ ½IðnÞ, 8n,
n ¼ 1, 2, . . . , L 1
Among them, Gσ stands for Gaussian convolution. The finest scale layer is the original image, I(1) ¼ I. Because each layer in the Gaussian pyramid is the result of low-pass filtering of the previous layer, the low-frequency information is repeatedly expressed in it. Gaussian image pyramid A multilayer structure composed of a series of Gaussian low-pass filtered images. To construct a Gaussian image pyramid, a series of low-pass filtering is required on the original image. The sampling rates of two adjacent layers of images are generally different in value, that is, the bandwidth is reduced by half. Laplacian pyramid 1. The result of a compressed representation of an image. Laplacian is used on the current grayscale image at each layer of the pyramid. The image of the next layer is constructed by Gaussian smoothing and subsampling. In the last layer, a smooth and subsampled image is maintained. The original image can be approximately reconstructed by expanding and smoothing the current layer image, layer by layer, and adding Laplacian values. Can be used for image compression to eliminate redundancy.
822
22 Multi-resolution Image
2. In texture analysis, a way to often decompose images to minimize redundant information and retain and enhance characteristics. Compared to the Gaussian pyramid, the Laplacian pyramid is a more compact expression. Each layer contains the difference between the low-pass filtered result of the rough layer and the “prediction” of the up-sampling, i.e., ðnÞ
ðnÞ
ðnþ1Þ
IL ¼ IG S " IG
Among them, IL(n) represents the n-th layer of the Laplacian pyramid, IG(n) represents the n-th layer of the Gaussian pyramid of the same image, and S" represents the up-sampling realized by the nearest neighbor. Laplacian image pyramid A multilayer structure composed of a series of band-pass filtered images. To construct the Laplacian image pyramid, the subtraction of two adjacent layers of images in the Gaussian image pyramid can be used to obtain the images of the corresponding layers. Morphological image pyramid A special kind of pyramid. Each layer of the pyramid is obtained by opening and closing the previous layer and then sampling it.
Part III
Image Analysis
Chapter 23
Segmentation Introduction
Image segmentation is a key step from image processing to image analysis (Zhang 2017b). It is also a basic computer vision technology (Forsyth and Ponce 2012; Gonzalez and Woods 2018; Prince 2012; Shapiro and Stockman 2001; Szeliski 2010). Image segmentation refers to the technology and process of dividing an image into regions with characteristics and extracting object regions of interest (Zhang 2006).
23.1
Segmentation Overview
Image segmentation can be given a more formal definition with the help of the concept of sets (Zhang 2017b). In the research and application of images, many times the focus is only on the object or foreground in the image, the other parts are called the background (Zhang 2006). Image segmentation has been highly valued for many years, and thousands of various types of segmentation algorithms have been proposed (Zhang 2015e). Many new mathematical tools are used for image segmentation, and various distinctive segmentation algorithms have been proposed (Bovik 2005; Kropatsch and Bischof 2004; Russ and Neal 2016; Sonka et al. 2014; Zhang 2006, 2017b).
23.1.1 Segmentation Definition Image segmentation The process of dividing the image into regions with characteristics and extracting the target region of interest. The characteristics here can be grayscale, color, texture, etc. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_23
825
826
23
Segmentation Introduction
or other properties or attributes. Targets can correspond to a single area or multiple areas. Image segmentation combines image pixels into meaningful (generally connected) structural elements, such as curves and regions. Image segmentation can be used in various image modalities, such as intensity data or depth data. Image segmentation can be performed with the help of various image characteristics, such as similar feature orientation, feature movement, surface shape or texture, etc. Image segmentation is a key step in transferring image processing to image analysis. It is also a basic computer vision technology. This is because image segmentation, target separation, feature extraction, and parameter measurement convert the original image into a more abstract and compact form, making higherlevel analysis and understanding possible. Segmentation Divides a data set into multiple parts according to a given set of rules. It is assumed here that different parts correspond to different structures in the original input domain. See image segmentation, color image segmentation, curve segmenta tion, motion segmentation, part segmentation, range data segmentation, and texture segmentation. Formal definition of image segmentation A definition of image segmentation using the concept of sets. Let the set R represent the entire image region. The segmentation of R can be regarded as dividing R into several non-empty subsets (sub-regions) {Ri, i ¼ 1, 2, . . ., n} that satisfy the following five conditions: n
[ Ri ¼ R;
i¼1
1. 2. 3. 4.
For all i and j, i 6¼ j, there is Ri \ Rj ¼ ∅. For i ¼ 1, 2, . . ., n, there is P(Ri) ¼ TRUE. For i 6¼ j,, there isP(Ri [ Rj) ¼ FALSE. For i ¼ 1, 2, . . ., n, Ri is a connected region.
Among them, P(Ri) is a uniform predicate for all elements in the set Ri, and ∅ represents the empty set. Uniform predicate A special one-variable logic function. Suppose the uniform predicate is represented by P(Y ), where Y is given a true or false value only depending on the properties of the points belonging to Y. P(Y ) also has the following properties: if Z is a non-empty subset of Y, then P(Y ) ¼ true means P(Z ) ¼ true. In the definition of image segmentation, the uniform predicate function P(Y ) has the following characteristics: 1. Y is assigned a true or false value depending only on the property value of the pixel belonging to Y.
23.1
Segmentation Overview
827
Fig. 23.1 Region of interest in image
2. If the set Z is included in the set Y and the set Z is not empty, then P(Y ) ¼ true means P(Z ) ¼ true. 3. If Y contains only one element, then P(Y ) ¼ true. Region of interest [ROI] Some regions that need to be considered and cared for in specific image analysis applications. The scope can be determined by defining a template (often binary). It is a sub-region in an image, and the real processing should be in it. The region of interest can be used to reduce the amount of calculation required or focus on the area so that image data outside the area does not affect the processing or processing results. For example, in the task of tracking targets in an image sequence, most algorithms consider only the region of interest around the current predicted target position when determining the target in the next frame. Figure 23.1 gives an example of the region of interest in the image. Interest operator A neighborhood operator designed to locate pixel or sub-pixel locations with high spatial accuracy. Generally, the center neighborhood has a special gray value mode. For such neighborhoods, the autocorrelation function often decreases rapidly with increasing radius. It is generally used to mark landmark points in a pair of images obtained from the same scene (moving the camera position or the target position slightly between the two). These pixels are then used as input to the algorithm for establishing a connection between the two images. Space-variant processing Nonuniformly allocate image processing power (geometrically) in the image. For example, use an image of log-polar transformation, or when processing is focused on fovea, or focus processing on a region of interest.
828
23
Segmentation Introduction
Screening Action to select photos of potential regions of interest/content from a set of photos (or images). Most places in the photo may not include the region of interest. Pixel classification Assign the pixels of an image to certain categories. Either a supervised classification method (using a model of a known category) or an unsupervised classification method (when a category model is unknown). See image segmentation. Class of a pixel The category to which each pixel in the image belongs. Trimap The result of segmenting the image (often manually) into three regions (foreground, background, unknown). Trimaps can be used as shape priors to guide image segmentation or image labeling. A preliminary division of the image to be segmented, often used as the basis for further (adaptive) processing. In matting, the trimap includes three parts: foreground, background, and region of uncertainty. Under-segmented The phenomenon or effect that segmentation does not reach the desired level of separation. Given an image in which the desired segmentation result is known, if the output region of a segmentation algorithm is often a combination of multiple desired segmentation regions, it is called under segmentation (further segmentation is needed). Figure 23.2 shows a schematic, the left figure shows that the original five targets are on the background, but the five targets in the right part of the undersegmentation result are connected together. Over-segmentation In image segmentation, the image is divided too finely, resulting in an excessive number of divided regions. Over-segmented The phenomenon that the desired region in the output of a segmentation algorithm consists of too many regions. Figure 23.3 shows a schematic; the left figure shows that there is only one target on the background and the target is divided into five blocks in the over-segmentation result on the right.
Fig. 23.2 Schematic for under segmentation
23.1
Segmentation Overview
829
Fig. 23.3 Schematic for over-segmentation
Original image
Foreground/Object
=
Background
+
Fig. 23.4 Figure-ground separation
Soft segmentation The requirements for image segmentation results are somewhat relaxed. It is no longer strictly required that the sub-regions in the segmentation result do not overlap each other or that it is no longer stipulated that a pixel cannot belong to two regions at the same time, but can belong to two regions with different likelihoods or probabilities. Compare hard segmentation. Hard segmentation Segmentation that strictly meets the requirements for image segmentation results. Compare soft segmentation.
23.1.2 Object and Background Figure-ground separation Separate the region representing the object of interest (graphics, foreground) from the rest of the image (background), as shown in Fig. 23.4. Often used when discussing visual perception. In image engineering, the term image segmentation is generally used.
830
23
Segmentation Introduction
Foreground Objects of interest in image engineering. Against the background. Or in object recognition, the area where the region of interest is located. See figure-ground separation. Foreground-background A division of the surrounding landscape. When people observe a scene, they often refer to the object they want to observe or focus on as the foreground (also called the object) and put the other parts into the background. Distinguishing foreground and background is fundamental to understanding shape perception. The differences and connections between foreground and background include: 1. The foreground has a certain shape, and the background has relatively no shape; the foreground has the characteristics of objects, and the background seems to be unshaped raw materials; the foreground looks contoured, and the background does not. 2. Although the foreground and background are on the same physical plane, the foreground looks closer to the observer. 3. Although the foreground generally occupies a smaller area than the background, the foreground is often more attractive than the background, and it tends to have a certain semantic/meaning. 4. The foreground and background cannot be seen at the same time, but they can be seen sequentially. Binarization The process of reducing a grayscale image to only two grayscale levels (black and white). It is the basic problem of image thresholding. Background In image engineering, an area in the scene that is behind (a longer distance) from the object of interest, or a collection of pixels in the image that originates outside the object of interest in the scene. The background is opposite to the foreground and is often used in the context of object segmentation and recognition. See figure-ground separation. Background labeling The process and method to distinguish the object in the foreground of the image or the part of interest from the background. Object 1. The region of interest in the image. The determination of objects varies with different applications and has a certain subjective color. The objects in natural images are connected components. 2. A general term used to indicate a set of features in a scene that, when combined, form a larger structure. In vision, it usually refers to what the attention focuses on. 3. A general system theory term, the object here refers to the part of interest, as opposed to the background, also known as the foreground. Determining which part is targeted is also related to resolution or scale.
23.1
Segmentation Overview
831
Object region The portion of space in the image the object occupies. It is basically the same as the object, here it is emphasized that it has a certain size on the 2-D plane. Object grouping A general term for bringing together all image data associated with an observed significant object. For example, when looking at a person, the target group clusters all pixels of the person’s image. Object segmentation Another name for image segmentation. This title emphasizes the meaning of object extraction. The main point is to separate the object of interest in a scene or an image. The methods typically used are region-based segmentation or edge-based segmentation. Compare image segmentation. Object detection In image analysis, the work of judging whether an object of interest exists in a scene or an image, and determining its position. Compare object extraction. Object extraction One of the important work of image analysis. Based on the object detection, it is emphasized to distinguish the object to be extracted from other objects and backgrounds in the image, so that it can be analyzed separately next. See image segmentation.
23.1.3 Method Classification Classification of image segmentation techniques Classification scheme for various image segmentation techniques. A typical scheme is shown in Table 23.1, which uses two classification criteria: 1. The nature of the gray value of the pixels used for segmentation: gray-level discontinuity and gray-level similarity 2. Computation strategies in the segmentation process: parallel technique and sequential technique
Table 23.1 Image segmentation technique classification table Classification Parallel technique Sequential technique
Gray-level discontinuity Boundary-based parallel algorithm Boundary-based sequential algorithm
Gray-level similarity Region-based parallel algorithm Region-based sequential algorithm
Note: Terms indicated in bold in the table are included as headings in the text and can be consulted for reference
832
23
Segmentation Introduction
Boundary-based method In image segmentation, gray-level discontinuities are used to segment from the detection of object boundaries. Compare region-based methods. Boundary-based parallel algorithm An image segmentation algorithm that detects pixels on the boundary of a region based on the gray-level discontinuity of pixels between different regions in the image. The judgments and decisions of different pixels are made independently and in parallel. Boundary-based sequential algorithm A class of image segmentation algorithms that process different pixels sequentially. It detects pixels on the boundary of a region based on the gray-level discontinuity of pixels between different regions in the image, where the results of early processing of some pixels can be used by subsequent processing of other pixels. Because this type of method takes into account the global information of the boundaries in the image, it can often achieve more robust results when the image is affected by noise. Typical edge detection and boundary connection methods are performed sequentially in combination with each other, including graph search methods, dynamic programming methods, and graph cut methods based on graph theory. Region-based method In image segmentation, a class of methods that use gray-level similarity to start with detecting the object region and perform segmentation. Compare boundarybased methods. Gray-level similarity In image segmentation, especially in the region-based method, when the image is divided into different regions, the pixels within each region have similarities in grayscale. Region-based parallel algorithm A class of image segmentation algorithms. This type of algorithm is based on the gray-level similarity of pixels in the image region to detect pixels in the region, and the judgment and decision of different pixels in this process are made independently and in parallel. Compare region-based sequential algorithms. Thresholding technology is the most widely used region-based parallel segmentation method (others include pixel classification, template matching, etc.). Region-based sequential algorithm A class of image segmentation algorithms. Detect pixels in the region based on the gray-level similarity of the pixels in the image region, and the judgment and decision of different pixels in this process are performed in sequence. Compare region-based parallel algorithms. The judgment made in the sequential segmentation method should be made according to certain criteria. Generally speaking, if the criterion is based on the grayscale characteristics of the image, this method can be used to segment gray-level images; if the criterion is based on other characteristics of the image (such as
23.1
Segmentation Overview
833
texture), the method can also be used to segment the corresponding characteristic images. The two basic techniques commonly used are region growing and split and merge. Sequential technique Image technology with serial strategy. In sequential technique, each processing operation is performed sequentially in steps, and the results obtained in the early stage can be used by subsequent processes. The process of the general serial technology is often more complicated, and the calculation time is usually longer than that of the parallel technology, but the anti-noise and overall decision-making ability are usually strong. Compare the parallel technique. Parallel Technique Image technology using parallel computing. In parallel technique, all judgments and decisions need or can be made simultaneously and independently of each other. Compare sequential techniques. Image class segmentation A special image segmentation that emphasizes the classification of object classes, and the objects of the same type are no longer distinguished. Object classification is generally based on the characteristics, attributes, or higher-level concepts of the object. For example, in the image of a square, to distinguish fountains and pedestrians (not specific to each individual). Color image segmentation For image segmentation of color images, color images are divided into regions with uniform properties according to certain similarity criteria. Describe the process of extracting one or more connected regions from an image that meet the uniformity (homogeneity) criterion. Here, the uniformity criterion is based on the features extracted from the spectral components of the image. These components are defined in a given color space. The segmentation process can rely on knowledge about the objects in the scene, such as geometric and optical properties. To segment a color image, you must first choose a suitable color space or color model; second, you must use a segmentation strategy and method suitable for this space/model. When color image segmentation is performed in HSI space, since the three components H, S, and I are independent of each other, it is possible to transform this problem of one 3-D segmentation into three 1-D segmentations. Color clustering See color image segmentation. Rigid body segmentation Automatically divide images containing articulated targets or deformed models into a set of rigid connected components. See part segmentation and recognition by components.
834
23
Segmentation Introduction
23.1.4 Various Strategies Segmentation algorithm Algorithm for segmenting images. Thousands of algorithms have been developed for images in a variety of different applications. Bottom-up segmentation A method of segmenting objects by means of clustering. See bottom-up. Edge-based segmentation Extending image segmentation based on the edges detected from the image. Specific algorithms can be further divided into parallel algorithms based on boundaries and serial algorithms based on boundaries. Model-based segmentation An image segmentation method or process that uses a geometric model to decompose an image into different regions. For example, an aerial image may contain roads that are visible using a GIS model of the road network. Energy-based segmentation Image segmentation by minimizing energy. Here the energy E includes both the region energy and the boundary energy of image f, which can be written as Eð f Þ ¼
X
Er ðx, yÞ þ E b ðx, yÞ
x, y
The regional item can be written as Er ðx, yÞ ¼ Es fI ðx, yÞ; R½ f ðx, yÞg It is a non-logarithmic likelihood value where the pixel brightness f(x, y) is consistent with the area statistics f(x, y). The boundary term can be written as E b ðx, yÞ ¼ sx ðx, yÞδ f ðx, yÞ f ði þ 1, jÞ þ sy ðx, yÞδ½ f ðx, yÞ f ði, j þ 1Þ It measures the inconsistency between N4 nearest neighbors modulated with horizontal and vertical smoothing terms sx(x, y) and sy(x, y). Intensity-based image segmentation An image segmentation method based on pixel value distribution, such as a histogram. Also called non-context method. Region-based segmentation One type of segmentation technique that can generate a certain number of image regions is mainly based on a given uniformity criterion. For example, a luminance image region may be uniform in color (see color image segmentation), or it may be uniform in texture properties (see texture field segmentation); a distance image
23.1
Segmentation Overview
835
region may be uniform in shape or curvature properties (see curvature segmentation). Region-based image segmentation A method for image segmentation based on the adjacency and connection criteria between a pixel and its neighboring pixels. Region segmentation See region-based segmentation. Region A set of pixels/points or connected components in an image. Generally, the pixels in the image are consistent or uniform or homogeneous under a given criterion. Meet the following two conditions: 1. Consists of inner points (non-boundary points of a region) and boundary points (there are outer points in the neighborhood that do not belong to the region). 2. Connectivity, that is, any two points in a point set can be connected via internal points. See connected component. The size and shape of the region is not limited, and sometimes small regions are also indicated by windows, masks, patches, blocks, blobs, etc. Region boundary extraction Calculation of a region boundary, such as calculating the contour of a region in an intensity image after color image segmentation. Region detection A large class of algorithms whose goal is to divide an image into different regions and extract the region with specific properties. See region identification, region labeling, region matching, and region-based segmentation. Semantic image segmentation A form of image segmentation, where the segmented area is often extracted and labeled (with its category or identity) at the same time. This method uses both the visual context information and the visual appearance information of specific objects and the relation information between objects, in contrast to the segmentation method using only image characteristics. For example, segmenting and identifying road signs is much easier in the context of outdoor scenes than in a messy indoor environment. In a broad sense, all the objects of interest for image segmentation are a higherlevel concept with semantic meaning, so image segmentation can always be called semantic image segmentation. Semantic segmentation It can be defined as dividing the image into corresponding regions based on semantic information and optimizing with the help of high-level information such as visual context, but it is essentially image segmentation.
836
23
Segmentation Introduction
Visual context Knowledge of the visual environment involved in visual processing and processes can help improve performance when performing certain tasks. For example, when looking for street signs in a new city, experience with a particular context (situation) tells people to look at the corners of a building, corner signs, or signs hanging on a line at an intersection. This is in contrast to context-independent visual search algorithms, in which algorithms are considered to look all around as the environment is the same everywhere. Semantic primitive Corresponding to something meaningful in the image (such as an object in the scene), it can also refer to an action in the video (such as waving a hand). It should be noted that a set of only one pixel (or pixel slice) is not necessarily considered as a semantic primitive, unless the set constitutes a repetitive pattern. Clusters of similar feature vectors may constitute more abstract semantic primitives, where clusters correspond to primitives but not necessarily to entities that can be recognized by humans. Semantic region 1. Regions in the image that correspond to certain semantic primitives. 2. Image regions that participate in multiple behaviors. Semantic region growing A region growing scheme that combines a priori knowledge about neighboring regions. For example, in aerial photographs of rural areas, the fact that roads are often surrounded by farmland is a priori knowledge of adjacent areas. Constraint propagation can be further used to obtain globally optimized region segmentation. See constraint satisfaction, relaxation labeling, region-based segmentation, and recursive region growing. Interpretation-guided segmentation [IGS] Image segmentation technology that combines image elements into regions based on not only the original image (the underlying data) but also the semantic interpretation (high-level knowledge) of the image
23.2
Primitive Unit Detection
The most basic task is the detection of various basic geometric units, such as the detection of points (Davies 2005), corners (Hartley and Zisserman 2004), straight lines (Davies 2012; Szeliski 2010), and curves (Davies 2012).
23.2
Primitive Unit Detection
837
23.2.1 Point Detection Image feature A general term that represents the structure of an image of interest, which may be a corresponding structure of the scene of interest. Features can be single points such as points of interest, curve vertices, edge points, etc. or lines, curves, surfaces, and so on. Image feature extraction A set of image techniques for determining special image characteristics. For example, edge detection and corner detection. Primitive detection Detection of the basic units that make up an image object. Common object primitives include edges, corner points, inflection points (in contour), etc. In a more general sense, they can also include points of interest, control points, etc. Point detection Detection of isolated points (isolated pixels) in an image. If the gray level of an isolated point is significantly different from the background or in a region where the gray level is different and relatively consistent, the gray level of that point is significantly different from the gray level of its neighboring points. At this time, Laplacian detection can be used to detect isolated points. Point object An abstract term for an object. At this time, there are many similar objects in the image, and the focus of research is the spatial relationship between them. For point object sets in an image, the relationship between each object is often more important than the position of a single object in the image or the nature of the single object itself. At this time, the distribution of point objects is often used to describe the point object set. Interest point 1. Generic term for pixels with characteristics of interest. Any particular point of interest in the image. It can be a corner point, a feature point, a salient point, a landmark point, etc. Points of interest are often used to mark the correspondence of feature points between images, and these points often have certain identifiable characteristics. The pixel value of the interest points often varies greatly. 2. In the speeded-up robust feature algorithm, the calculated Hessian matrix determinant has maximum points in scale space and image space, that is, SIFT points, which can be used for matching. In order to limit the combinatorial explosion problem that the matching method may cause, it is often desirable that there are not many (sparse) interest points in the image. Interest point detector A class of techniques or algorithms that detect special points in an image. Here special points can be corner points, inflection points, or any point with specific
838
23
Segmentation Introduction
meaning (the determination of these points often depends on the nature of its neighborhood). Interest point feature detector An operator acting on an image to locate an interest point. Typical examples include the Moravec interest point operator and the Plessey corner finder. Moravec interest point operator An operator for locating an interest point at a pixel whose neighborhood brightness has a large change in at least one direction. These points can be used for stereo vision matching or feature point tracking. This operator calculates the sum of the squares of the pixel value differences in the vertical, horizontal, and two diagonal directions in a 5 5 window centered on the current pixel. The minimum of the above four values is selected to remove values that are not local extremes and values that are less than a given threshold. Key point A generic term for pixels in various images that have specific properties (such as extreme curvature, which is stable with changes in viewing angle). Commonly used: corner points, curve inflection points, center points, etc. Tee junction The intersection of a particular form of straight line segment (which may represent an edge). Among them, one straight line segment touches and ends somewhere (non-endpoint) in another straight line segment. T-shaped intersections can give useful depth-ordered clues. For example, in Fig. 23.5, based on the T-shaped intersection at point P, it can be determined that surface C is in front of surfaces A and B (closer to the observer). Line intersection The intersection of two or more straight lines. That is, the points where lines intersect or meet, as shown in Fig. 23.6. Line junction The point where two or more straight lines meet. See junction label.
Fig. 23.5 Example of T-shaped intersection
P C
Fig. 23.6 Intersecting points
A B
23.2
Primitive Unit Detection
839
Fig. 23.7 Intersecting points Line cotermination
Line cotermination When two straight lines have exactly the same or very close to the end of the position. Figure 23.7 shows a few typical examples. The endpoints of the two straight lines are in the circle, but they may not overlap (Fig. 23.7). Degenerate vertex The middle point when the angle that defined by three points (two straight lines) is 180 (i.e., two straight line segments are concatenated in the same direction). It is neither convex vertex nor concave vertex. Convex vertex Define the middle point of the three points of an angle (the two lines connecting this point with other two points form the two sides of the angle), and the angle of this angle should be in [0 , 180 ]. Compare concave vertex. Concave vertex Define the middle point of the three points of an angle (the two lines connecting this point with other two points form the two sides of the angle), and the angle of this angle should be in [180 , 360 ]. Compare convex vertex. Soft vertex Intersection of multiple almost overlapping lines. Soft vertices may occur when segmenting smooth curves into a set of line segments. They are called “soft” because they can be removed by replacing straight line segments with curved ones. Center-surround operator Operators that are very sensitive to point-like image features have pixel values in the center that are higher or lower than those in the surrounding area. For example, a convolution template that can be used to detect unrelated points is as follows: 2
1=8 6 4 1=8
1=8 1
3 1=8 7 1=8 5
1=8
1=8
1=8
Center-surround See center-surround operator.
840
23
Segmentation Introduction
23.2.2 Corner Detection Inflection point The point of a curve where the second derivative changes sign corresponds to the change of the bump of the curve. One of the singular points of the curve (see Figure in regular point). Corner detection For the detection of corner pixels in an image, various corner detectors can be used. See curve segmentation. Corner point 1. In object detection, the intersection of two perpendicularly intersecting edges. 2. The point in the chain code representation where the direction of the chain code changes. Trihedral corner Corner points on the 3-D object where the surfaces in the scene are all planar and the three surfaces intersect. Corner detector The corner detection operator can be a differential operator (such as various gradi ent operators) or an integration operator (such as SUSAN operator). Corner feature detector See interest point feature detector and curve segmentation. Förstner operator A feature detector for detecting corner points and other edge features. SUSAN corner finder A widely used interest point detector. The derivative-based smoothness and centerdifference calculation steps are combined into a center-to-perimeter comparison. See smallest univalue segment assimilating nucleus operator. SUSAN operator Abbreviation for smallest univalue segment assimilating nucleus operator. Smallest univalue segment assimilating nuclear [SUSAN] operator An edge detection and corner detection operator. A circular template is often used, the center of which is called the “nucleus”. Place the template in different positions in the image, and compare the grayscale of each pixel covered by the template with the grayscale of the pixel covered by the nucleus of the template, and count that its grayscale is the same or similar to the grayscale of the pixels covered by the nucleus. The number of pixels (they make up the univalue segment assimilating nuclear), based on which it is judged whether the pixels covered by the nucleus, are boundary points or corner points. It can be seen that it essentially uses statistical summation or integral operations to work.
23.2
Primitive Unit Detection
841
Univalue segment assimilating nuclear [USAN] In SUSAN operator, the region occupied by the around mask pixels that have the same gradation with the central core pixel of mask. It contains a lot of information about the structure of the image. Harris corner detector A typical corner detector. It is also known as Plessey corner finder. It detects the corner point by computing the corner strength with the help of Harris interest point operator. For image f(x, y), consider the matrix M at (x, y): 2
∂f ∂f 6 ∂x ∂x M¼6 4 ∂f ∂f ∂y ∂x
3 ∂f ∂f ∂x ∂y 7 7 ∂f ∂f 5 ∂y ∂y
If the eigen value of M is large and the local extremum, then the point (x, y) is a corner point. To avoid direct calculation of eigen values, use the local extremum of det(M) – 0.04trace(M). Harris-Stephens corner detector Same as Harris corner detector. Harris interest point operator An interest point detection operator, also known as Harris interest point detector. Its expression matrix can be defined by the gradients Ix and Iy in two directions in the local template in the image. A commonly used Harris matrix can be written as " P H¼
P
I 2x
IxIy
P
IxIy P 2 Iy
#
Harris operator An operator for detecting corner points in grayscale images (can be generalized as the color Harris operator). If acting on a grayscale image f(x, y), the corner position can be detected. Specific steps are as follows: 1. Calculate the autocorrelation matrix M at each pixel position:
M 11 M¼ M 21 where
M 12 M 22
842
23
M 11 ¼ Sσ f 2x ,
M 22 ¼ Sσ
Segmentation Introduction
f 2y , M 12 ¼ M 21 ¼ Sσ f x f y
Among them, Sσ (•) represents Gaussian smoothing that can be achieved by convolving with a Gaussian window (such as standard deviation σ ¼ 0.7, the radius of the window is generally 3 times of σ) 2. Calculate the corner point measure for each pixel position (x, y) and construct a “corner point map” CM(x, y): CM ðx, yÞ ¼ DetðM Þ k½TraceðM Þ2 Among them, Det(M) ¼ M11M22 – M21M12, Trace(M) ¼ M11 + M22, and k ¼ constant (such as k ¼ 0.04). 3. Take the threshold of the graph of interest and set all CM(x, y) less than the threshold T to zero. 4. Perform non-maximum suppression to find local maxima. 5. The non-zero point retained in the corner point map is the corner point. Harris detector Same as Harris operator. Structure tensor Typically a 2-D or 3-D matrix, which characterizes the dominant gradient direction at a point location, uses a weighted combination of gradient values in a given window. Larger window sizes can increase stability but reduce sensitivity to small structures. The eigenvalue of the structure tensor records the range of different gradient directions in the window. Structure tensor is important for interest point feature detector, Harris corner detector, and characterizing texture and also other applications. Moravec corner detector A very simple corner detector. Its output value can achieve a larger value at pixels with high contrast. For image f(x, y), the output value of the Moravec corner detector at (x, y) is M ðx, yÞ ¼
yþ1 xþ1 1 X X j f ðx, yÞ f ðp, qÞj 8 p¼x1 q¼y1
Compare Zuniga-Haralick corner detector. Plessey corner finder A commonly used, first-order differential-based, local autocorrelation corner detector, also known as Harris corner detector. See feature extraction.
23.2
Primitive Unit Detection
843
Zuniga-Haralick operator A corner detection operator whose elements are coefficients that approximate a local neighborhood based on a third degree polynomial. Zuniga-Haralick corner detector A corner detector based on facet model. For image f(x, y), its neighborhood at (x, y) is approximated by a cubic polynomial with coefficient ck f ðx, yÞ ¼ c1 þ c2 x þ c3 y þ c4 x2 þ c5 xy þ c6 y2 þ c7 x3 þ c8 x2 y þ c9 xy2 þ c10 y3 The output value of the Zuniga-Haralic corner detector is Z ðx, yÞ ¼
2 c22 c6 c2 c3 c5 þ c23 c4 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi 2 3=2 c2 þ c23
Compare Moravec corner detector.
23.2.3 Line Detection Line detection Detection of (single-pixel-wide) line segments in an image. A set of templates in Fig. 23.8 can be used to detect four types of line segments in the horizontal direction, the vertical direction, and the positive and negative 45 angles relative to the horizontal direction. Line segment A single-sided convex polygon. Free end The digital curve has two pixels with only one neighbor (the other pixels on the curve have two neighbors).
–1
–1
–1
–1
2
–1
–1
–1
2
2
–1
–1
2
2
2
–1
2
–1
–1
2
–1
–1
2
–1
–1
–1
–1
–1
2
–1
2
–1
–1
–1
–1
2
Fig. 23.8 Four templates for line detection
844
23
Segmentation Introduction
Line detection operator Feature detector for line detection. Depending on the principle, local linear segments can be detected or straight lines can be globally detected. Note that the line here is opposite to the edge (the line consists of two parallel step edges). RANSAC-based line detection Detection of straight lines using RANSAC technology. First randomly select small segment edge (edge primitive) pairs to construct a straight line hypothesis, and then test how many other small segment edges can fall on this line (if the edge orientation is accurate, a single small segment edge can verify this hypothesis). A straight line with a sufficiently large number of interior points (matching the edges of a small segment) is selected as the desired straight line or straight segment. Line equation Equations representing straight lines. It can also be divided into 2-D linear equation and 3-D linear equation. You can use a rectangular coordinate system or a polar coordinate system (as in Hough transform). 2-D line equation An equation representing a 2-D straight line on a plane. For example, using homogeneous coordinates l ¼ [a b c]T, the corresponding 2-D line equation is
½x
y
2 3 a 6 7 1 • 4 b 5 ¼ ax þ by þ c ¼ 0 c
3-D line equation An equation representing a 3-D straight line in space. For example, using homogeneous coordinates l ¼ [a b c d]T, the corresponding 3-D line equation is
½x
y
2 3 a 6b7 6 7 z 1 • 6 7 ¼ ax þ by þ cz þ d ¼ 0 4c5 d
Line fitting A problem with curve fitting is to estimate the parameters of a straight line equation that optimally interpolates data at a given point Least squares line fitting In straight line fitting, a common method that needs to calculate the sum of the squares of all edge points and to achieve the smallest distances to straight line. Although robust to small outliers, it is not robust to large outliers. This is because a
23.2
Primitive Unit Detection
845
squared distance is used, so large outliers can have a large effect on optimization. The solution is to try to suppress the outliers with a small weight. Least squares fitting A general term indicating the least-mean-square estimation process that fits a data set with a parameter shape such as a curve or surface. For evaluating the fitting effect, Euclidean distance, algebraic distance, and Mahalanobis distance are commonly used. Huber function A function for calculating weights in a least squares straight line fitting method. If the squared distance is used in the least squares straight line fitting method, the points (wild points/outliers) that are far away from the straight line will have an excessive effect. The Huber function is a function that weights points in order to reduce their impact. Set the distance from the point to the straight line as d, define the shear threshold (clipping factor) as T, and then use the Huber function to calculate the weight as
w ðd Þ ¼
1 T=d
dT d>T
It can be seen from the above formula that points with a distance less than or equal to T have a weight of 1, and points with a distance greater than T have a weight of less than 1. That is, for small distance points, square distance is still used; for large distance points, (one lower limit) distance is used directly. Compare Tukey function. Tukey function A function for calculating weights in a least squares straight line fitting method. If the squared distance is used in the least squares straight line fitting method, the points (wild points/outliers) that are far away from the straight line will have an excessive effect. The Tukey function is a function that weights points to reduce their impact. Set the distance from the point to the straight line as d, and define the shear threshold (clipping factor) as T, and then use Tukey’s function to calculate the weight as 8h i2 < 1 ðd=T Þ2 wðdÞ ¼ : 0
dT d>T
It can be seen from the above formula that the points with a distance greater than T are completely ignored (the weight is 0, which has a strong suppression effect on the outliers/outliers), and the points with a distance less than or equal to T have weights between 0 and 1. Compare Huber function.
846
23
Segmentation Introduction
23.2.4 Curve Detection Analytic curve finding A method for detecting a parameter curve, first transforming the data into a feature space, and then searching for the assumed curve parameters in the feature space. A typical example is the detection of straight lines using a Hough transform. Generalized curve finding A general term for a method of locating any curve. See generalized Hough transform. Curve segmentation Method for splitting a curve into different primitive types. Of these, the position of changes from one primitive type to another is important. For example, a good curve segmentation method should be able to detect the four lines that make up a rectangle. Common methods include corner detection, iterative splitting, and Lows’ methods. Breakpoint detection See curve segmentation. Line segmentation See curve segmentation. Lows’ curve segmentation method An algorithm that divides a curve into a series of straight line segments. The algorithm has three main steps: (1) iteratively split the line segment into two short line segments. This can constitute a tree of line segments; (2) merge the line segments in the tree from the bottom up according to the linearity measure; (3) extract the remaining unmerged line segments from the tree as the segmentation result. Lows’ method Same as Lows’ curve segmentation method.
23.3
Geometric Unit Detection
The detection of various basic geometric targets is common, such as line segments (Davies 2012), circles and ellipses (Davies 2012), and object contours (Prince 2012; Shapiro and Stockman 2001). Hough transform is a more general method (Forsyth and Ponce 2012; Zhang 2017b).
23.3
Geometric Unit Detection
Fig. 23.9 Bar in perceptual field
847
Perceptual field
Bar
23.3.1 Bar Detection Bar A primitive graph representing a dark line segment on a light background (or vice versa). Bar is also a primitive in Marr’s theory of visual computing. Figure 23.9 shows the small dark bar seen in the perceptual field. Bar detection Image analysis operation with the help of a bar detector. Bar edge Two edges that are close in parallel, such as the edges of a thin line. Bar detector The method or algorithm that produces the maximum response when a bar is in its perceptual field. Valley detection An image analysis operation that enhances linear features instead of bright edges on one side and dark edges on the other, so that linear features can be detected. See bar detection. Valley detector Operators that implement valley detection Ridge detection A class of algorithms used to detect ridges in an image, mainly including edge detectors and line detectors. Anti-parallel edges A pair of edges that have no other edge between them and have opposite contrast.
23.3.2 Circle and Ellipse Detection Circle detection A technique or algorithm for determining the center and radius of a circle in an image. Typical such as Hough transform. Projections of objects in a 3-D scene onto a 2-D plane often become ellipses.
848
23
Segmentation Introduction
Circle A curve on a plane where all points are at a fixed distance radius r from the center point C. Define the arc of the entire torus as a circle with a length of 2πr. The area enclosed by the curve A ¼ πr2. The equation for a circle centered at (a, b) is (x – a)2 + (y – b)2 ¼ r2. A circle is a special case of an ellipse. Circle fitting Technique for obtaining circle parameters from 2-D or 3-D observations. As with all fitting problems, you can either use a good metric (such as the Hough transform) to search the parameter space or solve a well-posed least squares problem. Hole Empty collection inside. A geometric structure that cannot be continuously scaled down to a certain point. Also refers to the foreground area of the image that contains dark pixels (white pixels adjacent to the hole are boundary pixels). Punctured annulus A collection of circular areas with a hole in its center. Lacunarity A scale-dependent translation-invariant measure based on the size distribution of holes in a set. Low lacunarity indicates that the set is uniform and high lacunarity indicates that the set is mixed. Ellipse detection Method for detecting geometrically distorted circles or true ellipses. Ellipse detection is one of the key issues in image analysis because it is particularly useful in extracting the parameters of projected geometry. The ellipse has five main parameters: the center coordinate (x0, y0), the angle of orientation θ, the long axis, and the short axis (a, b). There are many methods (such as Hough transform or RANSAC) that use edge pixels in an image for spatial search. Ellipse fitting Fit the ellipse model to the object contour or data points. Diameter bisection A conceptually simple method for determining the center of an ellipse of various sizes. First, build a list of all edge points in the image based on their edge direction. Then, arrange the points to obtain pairs of points that are parallel in the opposite direction. They may be on both sides of the ellipse diameter. Finally, the midpoint position of these ellipse diameter lines is voted in the parameter space, and the image space point corresponding to the peak position is the candidate position of the ellipse center. Referring to Fig. 23.10, the midpoint of two edge points whose anti-parallel directions are parallel to each other should be a candidate point at the center of the ellipse. However, when there are several ellipses with the same orientation in the image, the diameter bisection may find the center point at the midpoint between them. To eliminate these false center points, you need to allow the voting to the center only at the edge points inward.
23.3
Geometric Unit Detection
849
Fig. 23.10 The basic idea of the diameter bisection C P
Q
The above basic method is useful for locating many symmetrical shapes (such as circles, squares, rectangles, etc.), so it can be used to detect multiple types of objects.
23.3.3 Object Contour Object localization The technique or process of determining the location of an object in a scene or an image. Compare object detection. Visual localization For an object in space, its position is estimated and determined based on one or more images of it. The solution varies due to several factors, including the number of input images (one in model-based pose estimation, two or more in stereo vision, and video sequences in motion analysis), and a priori knowledge (for camera calibra tion, you can consider a simple projection model or a complete perspective model, and see if there is a geometric model of the object). Contour Short name of the object outline. The border of a region. Generally speaking, the contour is more focused on its closedness, while the border is more emphasized between the two areas. Compare boundary. Contour tracking Same as boundary following. Contour linking The process of edge detection or contour pixel detection for the region contour. The purpose is to connect the pixels on these contours to form a closed curve. Image border A square or rectangular outline of the entire image. Compare boundary, contour. Contour grouping See contour linking. Contour following See contour linking. Contour relaxation A relaxation method for contour connection in scene segmentation.
850
23 Edge subspace base
Line subspace base
1
d
1
0
–1
d
0
0
0
0
1
0
–1
–1
–1
–d
–1
–d
1
0
1
0
–1
d
–1
d
0
–d
–1
1
0
–1
0
Symmetrical gradient
Segmentation Introduction
0
1
–2
1
1
1
1
0
–1
–2
4
–2
1
1
1
0
1
0
1
–2
1
1
1
1
0
–1
0
1
–2
1
–2
0
1
0
0
0
1
4
1
1
–d
1
0
–1
–2
1
–2
Ripple
1
Average subspace
Line detection
Laplacian
Fig. 23.11 Masks for integrated orthogonal operator
Rectangle detection A special type of object detection. The detected object is on a plane and has a rectangular shape. In this case, detection can be performed by detecting orthogonal vanishing points. Integrated orthogonal operator An operator consisting of nine orthogonal masks that can detect both edges and lines. Also known as the Frey-Chan orthogonal basis. Referring to Fig. 23.11, the pffiffiffi first group of four masks constitutes the edge subspace base (in the figure d ¼ 2), of which two masks are isotropic symmetrical gradient masks, and the other two masks are ripple masks. The second group of four masks forms a line subspace base, two of which are line detection masks and the other two are discrete Laplacian masks. Here the masks in the same group have only 45 difference in direction. The last mask is the average mask, which constitutes the average subspace, and it is added for spatial integrity. Orthogonal masks A set of masks where the inner product is zero. Frei-Chen orthogonal basis Same as integrated orthogonal operator.
23.3.4 Hough Transform Hough transform [HT] A special transform between different spaces. It uses the duality between variables and parameters in the analytical expression of spatial curves or surfaces. It is
23.3
Geometric Unit Detection
851
assumed that there is an object in the image space, and its contour can be represented by an algebraic equation. In the algebraic equation, there are both image space coordinate variables and parameters belonging to the parameter space. Hough transform is a transform between image space and parameter space, which can transform a problem in one space to a solution in another space. When performing image segmentation based on Hough transform, you can use the global characteristics of the image to connect the edge pixels of the object to form a closed boundary of the object region or directly detect an object of known shape in the image, and it is possible to determine the boundary accuracy to sub-pixel level. The main advantage of the Hough transform is that it uses the global characteristics of the image, so it is less affected by noise and boundary discontinuities and is more robust. It is also a technology that directly transforms image features into the likelihood of the appearance of certain object shapes. See Hough transform line finder and generalized Hough transform. Accumulation method A method of accumulating evidence in the form of a histogram and then searching for corresponding hypothetical peaks. See Hough transform and generalized Hough transform. Accumulator array An array used in the Hough transform to determine the statistical peaks in the parameter space. Accumulative cells In the calculation of the Hough transform, the unit of the accumulation array established in the parameter space. The cumulative results of these units give parameter values corresponding to the points in the image space. Butterfly filter A linear filter that responds to “butterfly” pattern in an image. A 3 3 butterfly filter convolution kernel is 2
0 6 41
2 2
3 0 7 15
0
2
0
It is often used in combination with the Hough transform, especially when detecting straight lines to determine peaks in the Hough feature space. The linear parameter values ( p, θ) often give the butterfly shape of the peak near the correct value. Hough transform line finder Hough transform based on the parametric equation of a line (s ¼ icosθ + jsinθ), where a set of edge points {(i, j)} is converted to the likelihood of a line in (s, θ)
852
23
Segmentation Introduction
space. In practice, the likelihood is quantified by means of a histogram of (sinθ, cosθ) values obtained from the image. Hough forest A method of extending the Hough transform, which uses probabilistic voting to detect the parts of a single target to locate the center of gravity of the target. The detection hypothesis (i.e., the potential target) corresponds to the maximum value of the Hough image obtained by accumulating the probabilities from all parts. A particular type of random forest maps the appearance of local image patches to a probability vote on the center of gravity of the target. Hierarchical Hough transform Techniques that improve the efficiency of the standard Hough transform. It is often used to describe a series of problems and technologies that solve the problem from low resolution Hough space to high-resolution space. It usually starts with low-resolution images or sub-images of the input image and then combines the results. Iterative Hough transform [IHT] An improvement on Hough transform. In order to improve the efficiency of the Hough transform, the parameters of the Hough transform are separated, and the optimal parameters are successively obtained by using iteration, thereby reducing the computational complexity Probabilistic Hough transform A variation of the Hough transform, in which only a portion of the data is used to calculate an approximation of the Hough transform. The purpose is to reduce the computational cost of the standard Hough transform. You can observe a thresholding effect: if the percentage of samples is greater than the threshold level, then only a few false alarms will be detected. Generalized Hough transform [GHT] A generalization of Hough transform that can detect targets of arbitrary shape. The basic Hough transform requires analytical expressions for curves or target contours when detecting curves or target contours. The generalized Hough transform uses tables to establish the relationship between curve points or contour points with reference points. When there are no analytical expressions for curves or target contours, the Hough transform principle can be used for detection. R-table In the generalized Hough transform, a table connecting curve points or contour points to reference points is constructed to detect curves or target contours that are not or cannot be easily expressed analytically. With the help of the R-table, the principle of Hough transform can be used to solve the problem of detecting arbitrary shape curves.
23.3
Geometric Unit Detection
853 Y
Fig. 23.12 Hough transform based on foot-ofnormal
(x, y)
r O
(gx, gy)
(xf, yf)
q X
Cascaded Hough transform When several Hough transforms are used successively, the previous output is used as the next input. Hough transform based on foot-of-normal Hough transform method using the coordinates of the foot-of normal to represent the parameter space. As shown in Fig. 23.12, (ρ, θ) represents a straight line in the parameter space. The straight lines to be detected in the image space are represented by thick lines. Their intersection is the foot-of-normal, and its coordinates are (xf, yf). (ρ, θ) has a one-to-one correspondence with (xf, yf), so (xf, yf) can also be used to represent the parameter space for Hough transform. Specifically consider the need to detect a point (x, y) on a straight line, and set the gray gradient at that point to (gx, gy). According to Fig. 23.12, it can get
gy =gx ¼ y f =x f x xf xf þ y yf yf ¼ 0
Combining them to get the solution x f ¼ gx xgx þ ygy = g2 x þ g2 y y f ¼ gy xgx þ ygy = g2 x þ g2 y It can be seen that every point on the same straight line will vote for (xf, yf) in the parameter space. This Hough transform based on foot-of-normal has the same robustness as the basic Hough transform. But it is faster to calculate in practice, because it does not need to calculate the arc tangent function to obtain θ, nor does it need to calculate the square root to obtain ρ. Foot-of-normal When using a Hough transform based on polar coordinates to detect a straight line, the intersection of the straight line represented by the parameter space and the straight line (in image space) to be detected. If the coordinate space is used to
854
23
Segmentation Introduction
represent the parameter space, a Hough transform based on foot-of-normal can be obtained. Discrimination-generating Hough transform A hierarchical algorithm that distinguishes objects and uses projections and hierarchies of the constructed Hough space to detect the rotation and translation of objects. See generalized Hough transform. Randomized Hough transform A variant of the standard Hough transform, expecting higher accuracy with less computation. The algorithm for straight line detection randomly selects paired edge points and adds accumulator units corresponding to straight lines passing through the two points. This selection process is repeated a fixed number of times. Adaptive Hough transform A Hough transform method that iteratively increases the number of quantization levels in parameter space. This is especially useful when faced with highdimensional parameter spaces. The disadvantage is that it is possible to miss sharp peaks in the histogram. Maximum margin Hough transform An extension of the probabilistic Hough transform, in which various features of the target are weighted to increase the likelihood of obtaining a correct match.
23.4
Image Matting
In practice, it is often necessary to extract each part of interest from the image, that is, to separate it from the background (Davies 2012; Peters 2017).
23.4.1 Matting Basics Image matting The process of extracting (or removing) required parts of an image from this image. Image matting is essentially an image segmentation technique. In practice, it is often supplemented by manual operation or human-computer interaction. After extracting the required part (or removing the background), you can also combine the extracted part with other materials. See matting. Background replacement While maintaining the foreground, change the background in different images or video frames. Z-keying is a commonly used technique.
23.4
Image Matting
855
Image matting and compositing The process and technique of cutting a foreground target from an image and pasting it onto a new background. In order to successfully copy a foreground target from one image to another without producing visible discrete artifacts, it is necessary to extract a matte from the input image C, that is, estimate a soft opaque channel α and undisturbed foreground color F. This can be expressed as C ¼ ð1 αÞB þ αF This operator uses a factor (1-α) to attenuate the influence of the background image B and then adds some color values corresponding to the foreground element F. Matting 1. A common term for the process of extracting a target (to insert another background) from the original image. Compare compositing. 2. Combining multiple images, using one or more matt (binary templates) to indicate the technical process of the available parts in each image. α-Matting The process of extracting objects in the original image according to the scale/ proportion ratio α. α-Matted color image A color image with four color channels. That is, in addition to the usual three color channels of RGB, a fourth alpha (α) channel is added, which describes the relative amount of opacity or coverage at each pixel. It can be used as an intermediate expression between the mat and the merge for the foreground target. α-Channel See α-matted color image. Matte extraction For an image or video frame, extract a binary template or alpha channel (i.e., transparent information). This can often distinguish the target of interest in a scene. Impostor The 2-D target corresponding to the sprite in 3-D is named after a simplified version of a 3-D target that is far away from the camera using a flat α-matting geometry.
23.4.2 Matting Techniques Laplacian matting A matting technique based on optimization with closed expression.
856
23
Segmentation Introduction
Blue screen matting A special matting method. It is assumed that the background has a constant known color. In a typical blue-screen matting application, people or objects are shot in front of a background with a uniform color (previously light blue was used, now light green is more used, but still called blue screen). Then it is needed to estimate the opacity α; an early method was to use a linear combination of the target color channel and user-adjusted parameters. There are also some improvements on the basic method. Among them, when the known colors are arbitrary, it is called difference matting, and when multiple backgrounds are used, it can be called two-screen matting. Difference matting A variant of the blue screen matting method. The known background color used can be arbitrary, or the background can be irregular but known. It is mainly used to shoot people or objects on a static background (such as a video conference in the office). It can also be used to detect continuous visual errors after the break. Two-screen matting A variant of the blue screen matting technique that uses multiple (at least two) background colors to over-constrainedly estimate opacity and foreground colors. Set the opacity to α and the foreground color to F. Tuo-screen matting uses more than one known background color B to composite image C: C ¼ ð1 αÞB þ αF For example, consider setting the background color to black in the composition formula given in image matting and compositing techniques, that is, B ¼ 0. The composite image C thus obtained will be equal to αF. Substituting the background color with a different non-zero value B gives C αF ¼ ð1 αÞB This is an over-constrained color equation for estimating α. In practice, B needs to be selected so that C is not saturated; in order to obtain high accuracy, several B values should be used. It is important to linearize the color before processing, which is required for all matting algorithms. Environment matte A generalization of various object matting techniques, not only to consider cutting out an object from one image and inserting it into another image but also to consider the interaction between the object and the new background, such as subtle refraction and reflection effects (such as inserting a magnifying glass into the image will magnify the scene behind the mirror and reduce the field of view). Smoke matting Matting in video obtained by imaging a smoke-filled scene. Consider each pixel value as a linear combination of the (unknown) background color and a constant
23.4
Image Matting
857
foreground (smoke, which is the same for all pixels) color. Vote in color space to estimate the foreground color, and use the distance along each color line to estimate the α of each pixel over time (see α-matting). Soft matting A matting method developed on the basis of image matting. In image dehazing, the optimization of the media transmission map is performed by solving a sparse linear system, which uses very weak constraints, so it is called “soft” matting. Natural image matting A matting technique for general natural images (more irregular than artificial world images). Flash matting Use flash for matting (requires more complicated acquisition system). Collect one image each with and without flash; this image pair can help extract the foreground more reliably (corresponding to the region with large area lighting changes in the two images). Optimization-based matting A global optimization technique is used to estimate the opacity and foreground color of each pixel, and a matte technique is calculated that takes into account the correlation of neighboring α values. Typical examples are border matting technology and Poisson matting technology in the interactive grab-cut segmentation system. Border matting A matting technique based on optimization. First, the binary segmentation result obtained by using grab cut is expanded to obtain a region surrounding the binary segmentation result. Next, the sub-pixel boundary position Δ and blur width σ are calculated for each point along the boundary. Poisson matting A optimization-based matting technique. For each pixel in the trimap, it is assumed that the foreground and background colors are known. Assume that the gradient of the (soft) opaque channel α and the gradient of the color image C are related as follows: ∇α ¼
FB • ∇C kF B k2
Among them, F is the color of the foreground that is not polluted, B is the color of the background, and C is the input image. The estimation of the gradient of each pixel can be collected into a continuous α field.
858
23
Segmentation Introduction
Video matting Matting the video image sequence. Moving objects in a video sequence may make the matting process easier, because part of the background may be revealed in previous or subsequent frames. Background plate In the video matting, the rest of the foreground object is removed. Aligning them makes up a high-quality estimate of the background. Garbage matte A simple and special video matting method. In video matting, if there are moving objects in the video, the matting work may be simplified because part of the background can be obtained from the frame before or after the current frame. Specifically, you can use a conservative “garbage matte” to remove the foreground target first and then align the remaining background boards in combination to obtain a high-quality background estimate.
Chapter 24
Edge Detection
Edge detection is the first step of all boundary-based image segmentation methods and the key to parallel boundary technology (Bovik 2005; Forsyth and Ponce 2012; Gonzalez and Woods 2018; Hartley and Zisserman 2004; Prince 2012; Shapiro and Stockman 2001; Szeliski 2010; Zhang 2015e).
24.1
Principle
In a grayscale image, there may be discontinuities or local abrupt changes in grayscale values between two different adjacent regions, which may cause the appearance of edges (Gonzalez and Woods 2018). Detection of edges often also takes advantage of these discontinuous or local changes (Gonzalez and Woods 2018; Szeliski 2010; Zhang 2017b).
24.1.1 Edge Detection Edge detection A key step in boundary-based parallel algorithms in image segmentation. Edges are the result of discontinuous gray values (with accelerated changes). Generally, first- and second-order differences can be used to detect edge elements, and then edge elements are combined or connected according to certain criteria to form a closed object contour. Although in theory both the maximum of the first-order difference and the zerocrossing point of the second-order difference correspond to the edge points, in a 2-D image, the 2-D edge is generally a curve. Since there are three second-order derivatives in 2-D, so the results given by the second-order difference and the first-order difference are different under the general curved edge case. Figure 24.1 © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_24
859
860
24
Edge Detection
Fig. 24.1 Different results for first- and second-order differences in edge detection First-order
Secondorder
Fig. 24.2 Edge detection
shows the edge positions detected by the maximum of the first-order difference and the zero-crossing point of the second-order difference when the edge curve is an idealized right-angle intersection. The results obtained with the maximum of the first-order difference are within the idealized corner, and the results obtained with the zero-crossing point of the second-order difference are outside the idealized corner. Note that the former does not pass through the corner points, but the latter always passes through the corner points, so the second-order difference zero-crossing point can be used to accurately detect the target edges with sharp edges. But for edges that are not very smooth, the edges obtained by using the zero-crossing point of the second-order difference will be more tortuous than those obtained by using the maximum value of the first-order difference. In practice, a corresponding edge vector (gradient, i.e., size and direction) needs to be calculated for each point in the image. Figure 24.2 shows the results of the original image and the edge detection, respectively (here, pixels with gradient amplitudes greater than a given threshold are shown as white). Edge 1. In graph theory, the connection (also called an arc) between nodes of graph (or vertices). 2. A boundary or part of a boundary between two image regions, each of which has significant features, based on certain features (such as grayscale, color, or texture). Eigenvalues on the boundary are discontinuous (with accelerated changes). For example, when segmenting an image by considering the dissimilarity between regions, the differences between adjacent pixels need to be checked, and pixels with different attribute values are considered to belong to different regions, while a boundary is assumed to separate them. Such a boundary or boundary segment is an edge. Common is sharp changes in the image brightness
24.1
Principle
861
function. It is described by its position, the magnitude of the brightness gradient, and the direction of the maximum brightness change. Edge element A triple representing the basic edge units in the image. The first part gives the position (row and column) of a pixel, the second part gives the orientation of the edge passing through the pixel position, and the third part gives the strength (or amplitude) of the edge. Discontinuity detection See edge detection. Gray-level discontinuity In image segmentation, especially in the boundary-based method, when the image is divided into different regions, the grayscale of the pixels on the boundary between the regions jumps. Segmentation actually takes advantage of this property, with the jump as the boundary. Edge finding Same as edge detection. Edge extraction The process of determining and extracting the edge pixels of an image. This is often a basic preprocessing step before distinguishing between targets and determining their identity. Edge element extraction The first step in image segmentation with edge detection. Here we first need to extract each edge element on the object contour one by one; these edge elements are the basis of the object contour. Edgel 1. Shorthand for edge element. It consists of three components: the first is the row and column position of the pixel passed by the edge; the second is the orientation of the edge at the pixel position; and the third is the intensity of the edge at the pixel position. Generally, it is a small area or a collection of pixel areas with edge characteristics. Pixels near the edges may also be included. 2. Edge point calculated using the smallest selectable window that includes only two adjacent pixels. The smallest selectable window can be horizontal or vertical. In the absence of noise, this method can well extract discontinuities in image brightness. Edgelet A collection of edge pixels. It generally refers to a collection of local, connected edge pixels. See contour edgelet.
862
24
Edge Detection
Edge map The result map obtained after edge detection is performed on the image, and the pixel value represents the edge intensity value. Generally, it is further processed, and only pixels with edge intensity values greater than a given threshold are retained. Color image edge detection According to the definition of color edges, corresponding methods are used to detect edges in color images. The typical method is to generalize the method of detecting edges based on gradients in monochrome images to color images, and simply calculate the gradient of each image and combine the results (such as using a logical OR operator). The results obtained by this method have some errors but are acceptable for most situations. Adaptive edge detection Edge detection is performed by adaptively thresholding the gradient amplitude image. Edge detector An operator that detects edges in an image. For example, Canny edge detector, compass edge detector, Hueckel edge detector, Laplacian-of-Gaussian edge detector, minimum vector dispersion edge detector, O’Gorman edge detector, etc. Edge detection operator Operator for edge detection (edge detector) using a local template (with derivative calculations). Edge detection steps The process of edge detection mainly includes three steps: 1. Noise reduction: Because the first and second derivatives are sensitive to noise, image smoothing techniques are often used to reduce noise before using edge detection operators. 2. Detecting edge points: Use a local operator that has a strong response to the edge position and a weak response to other places to obtain an output image whose bright pixels are candidates for the edge pixels. 3. Combination of edge localization: Post-process the edge detection results, remove parasitic pixels, and connect the intermittent edges into meaningful lines and boundaries. Region of support The (sub)-region in the image that is used for a particular operation. Can also be seen as a domain of some kind of operations. For example, an edge detector often uses only sub-regions of neighboring pixels that need to consider whether they are edges.
24.1
Principle
863
24.1.2 Sub-pixel Edge Sub-pixel edge detection A special edge detection process. The edge detection accuracy is required to reach the sub-pixel level, that is, the positioning of the edge is determined to be inside the pixel. In the sub-pixel edge detection, it is generally necessary to accurately determine the sub-pixel-level edge according to the comprehensive information of multiple pixels near the actual edge pixel. Typical methods include sub-pixel edge detection based on moment preserving, sub-pixel edge detection based on expectation of first-order derivatives, and sub-pixel edge detection based on tangent information. The position of the edge of the image can also be estimated by sub-pixel interpolation of the response of the gradient operator to give a more accurate position than the integer pixel coordinate value. Sub-pixel edge detection based on moment preserving A sub-pixel edge detection method. Considering the 1-D case, an ideal edge is considered to be composed of a series of pixels with gray b and a series of pixels with gray o (see figure of moment preserving). For an actual edge f(x), the expression of the first three moments is (set n as the number of pixels belonging to the edge part) mp ¼
n 1X ½ f ðxÞp p ¼ 1, 2, 3 n i¼1 i
If t is used to represent the number of pixels with a gray level of b in the ideal edge, the moment-preserving method determines the edge position by making the first three moments of the actual edges and the ideal edge equal. Solving these three equations: " rffiffiffiffiffiffiffiffiffiffiffiffi# n 1 1þs t¼ 2 4 þ s2 where s¼
m3 þ 2m31 3m1 m2 σ3 σ 2 ¼ m2 m21
If the first pixel of the edge is considered to be at i ¼ 1/2, and the distance between adjacent pixels is set to 1, then the non-integer t calculated above gives the sub-pixel edge position.
864
24
fn-1 fn
f(x) o
Fig. 24.3 Actual edge and ideal edge
b
Edge Detection
f1 f2 1 2 3
n 1 n
x
Moment preserving A method that can determine the detected edge position to the sub-pixel level. Considering the 1-D case, suppose the actual edge data is shown by the black dots in Fig. 24.3, and the ideal edge that is expected to be detected is shown by the dashed line in the figure. The moment-preserving method requires that the first three orders of moments of two (edge) pixel sequences are equal. If an ideal edge is considered as being composed of a series of pixels with graylevel b and a series of pixels with gray-level o, the moment p of f(x) ( p ¼ 1, 2, 3) can be expressed as mp ¼
n 1X ½ f ð xÞ p n i¼1 i
If t is used to represent the number of pixels with gray-level b in the ideal edge, keeping the first three moments of the two edges equal is equivalent to solving the following system of equations: t nt p o mp ¼ bp þ n n
p ¼ 1, 2, 3
The result of the solution is (t which is not an integer gives the sub-pixel position obtained by detecting the edge) " rffiffiffiffiffiffiffiffiffiffiffiffi# n 1 t¼ 1þs 2 4 þ s2 where s¼
m3 þ 2m31 3m1 m2 σ3
σ 2 ¼ m2 m21
Sub-pixel edge detection based on expectation of first-order derivatives A sub-pixel edge detection method. Considering the 1-D case, the main steps are: 1. For the image function f(x), calculate its first-order differential g(x) ¼ |f0 (x)|. 2. For a given threshold T, determine the range of values of x that satisfy g(x) > T, [xi, xj] (1 i, j n), where n is a discrete sample number of f(x).
24.1
Principle
865
Fig. 24.4 1-D grayscale and gradient maps
f n-1 f n
f (x) o
f1 f2
b
1
2 3
n1 n
x
1
2
n1 n
x
g(x )= f '(x)
3
3. Calculate the expectation value E of the probability function p(x) of g(x) as the edge position (its precision can be to a decimal):
E¼
n X k¼1
kpk ¼
n X k¼1
kgk =
n X
! gi
i¼1
Consider the 1-D signal f(x) at the upper part of Fig. 24.4, with discrete values f1, f2, . . ., fn1, fn. The desired ideal edge is shown by the dashed line. If the first-order derivative of f(x): g(x) ¼ |f0 (x)| is calculated, then the discrete values are as in the bottom part of Fig. 24.4. The values of x satisfying g(x) > T are those enclosed by a dashed box. Calculating the expectation values for these values can obtain edge positions with sub-pixel accuracy. Sub-pixel edge detection based on tangent information A sub-pixel edge detection method. The main steps are the following: (1) detect the object to the pixel-level boundary and (2) use the information of the pixel-level boundary along the tangent direction to modify the boundary to the sub-pixel level. In step (1), any edge detection method with pixel-level accuracy can be used. The principle of step (2) will be described below by taking the detected object as a circle as an example. Let the equation of a circle be (x – xc)2 + (y – yc)2 ¼ r2, where (xc, yc) are the coordinates of the center of the circle and r is the radius. As long as the outermost boundary points of the circle in the X and Y directions can be measured, (xc, yc) and r can be determined, thereby the circle can be detected. Figure 24.5 shows a schematic diagram of detecting the leftmost point in the X axis direction, in which a pixel point in the circle is marked with a shadow. Considering the symmetry of the circle, only a part of the left border of the upper semicircle is drawn. Let xl be the coordinates of the leftmost boundary point in the X axis direction, and h represents the number of border pixels on a column with the horizontal axis being xl. The
866
24
Fig. 24.5 Schematic diagram of the tangent direction of the circle boundary
Edge Detection
T (h) = r 2 − (h −1)2 − r 2 − h2 Y
T
e
yc+h r
h r yc
o
r S
O
xl
xc
X
difference T (as a function of h) of the abscissa of the intersection between two adjacent pixels on the column and the actual circle boundary can be expressed as T ðhÞ ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r 2 ð h 1Þ 2 r 2 h2
Let S be the difference between xl and the exact sub-pixel edge, that is, the correction value required to correct from the pixel boundary to the sub-pixel boundary. Knowing from Fig. 24.5, this difference can be obtained by summing all T(i) (where i ¼ 1, 2, ..., h) and adding e: S¼r
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r 2 h2 þ e < 1
Among them, e 2 [0, T(h+1)], and the average value is T(h+1)/2. Use the formula for calculating T(h) to calculate T(h+1) and substitute it into the above formula: ffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 r h þ r 2 ð h þ 1Þ 2 S¼r 2 This is the correction value required to obtain a sub-pixel-level boundary after detecting the pixel-level boundary. Further consider the situation when the object is an ellipse (a round object becomes an ellipse when obliquely projected). First set the object as a regular ellipse, and its equation is (x – xc)2/a2 + (y – yc)2/b2 ¼ 1, where a and b are the lengths of the major and minor semi-axes of the ellipse, respectively. The sub-pixel correction value S can be obtained similarly to the derivation of the circle previously:
24.1
Principle
867
8
T: [xi, xj], 1 i, j n. 3. Calculate the probability density function p(x) of g(x), which can be written as
pk ¼ g k =
n X
k ¼ 1, 2, , n
gi ,
i¼1
4. Calculate the expected value E of p(x) and set the edge at E. In discrete images, E is calculated as follows:
E¼
n X k¼1
kpk ¼
n X k¼1
kgk =
n X
! gi
i¼1
Because this method uses the expected value based on statistical characteristics, it can better eliminate the multi-response problem caused by noise in the image (i.e., multiple edges are falsely detected). First-order derivative edge detection Detection of edges with the gradient of pixel gray value (approximate by digital equivalent of first derivative). Common Prewitt detectors and Sobel detectors are typical first-order derivative edge detection operators.
24.3
Gradients and Gradient Operators
881
24.3.3 Gradient Operators Gradient operator An operator that calculates the spatial difference (corresponding to the first derivative) by convolution. For 2-D images, the gradient operator requires two orthogonal masks. For 3-D images, the gradient operator requires three orthogonal masks. Commonly used gradient operators include (2-D) Roberts cross operator, Prewitt detector, and Sobel detector. All three are easily extended to 3-D case. They are all image operators that can generate a gradient image from the grayscale input image I. Depending on the terminology used, the output can be: 1. At each point, there is a gradient vector ∇I consisting of derivatives of x and y. 2. The magnitude of the gradient vector. A common use of gradient operators is to locate regions of strong gradients that can indicate edge locations. Orthogonal gradient operator In a 2-D image, a gradient-computing operator composed of two orthogonal masks. It includes a variety of commonly used first-order differential edge detection operators, such as Roberts cross operator, Prewitt detector, Sobel detector, and isotropic detector. For 3-D images, three orthogonal masks are required. Gradient filter A filter used to convolve the image to generate a new image, and each pixel in the new image represents the gradient value of the position in the original image. The direction of the gradient is determined by the filter. Two orthogonal filters are often used to combine the results into a gradient vector. See Roberts cross detector, Roberts cross gradient operator, Prewitt detector, Prewitt gradient operator, Sobel detector, and Sobel gradient operator. Gradient mask A mask for calculating local gradients in various gradient operators or gradient classification work. Differentiation filtering See gradient filter. Magnitude response One of the two types of responses for gradient operators. It can be calculated based on the response of each mask in the operator. The amplitude response of the general operator is to take the L2 norm of the composed mask response. Considering the amount of calculation, the L1 norm or L1 norm of these mask responses can also be taken. Directional response One of the two types of responses for gradient operators. Refers to the maximum direction value of the gradient operator response, which can also be decomposed into
882
24
Edge Detection
directions relative to each coordinate axis. For example, in the 3-D case, the directional responses relative to the X, Y, and Z axes are Ix b α ¼ arc cos qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi I 2x þ I 2y þ I 2z Iy b β ¼ arc cos qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 I x þ I 2y þ I 2z Iz b γ ¼ arc cos qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi I 2x þ I 2y þ I 2z Among them, Ix, Iy, and Iz are the responses of the three masks in the operator for the X, Y, and Z directions, respectively. Direction detection error In 3-D edge detection using a gradient operator, the angle difference between the true edge surface direction and the gradient direction of the operator is represented. Its value ε and the directional response b α, b β , and b γ of each coordinate axis corresponding to the gradient operator have the following relationship: cos ε ¼ cos α cos b α þ cos β cos b β þ cos γ cos b γ Histogram of oriented gradient [HOG] Statistical histogram of gradient directions in a gradient image. First, divide an image into several image blocks, and then divide each block into several smaller units. Next, the gradient value distribution in each direction in each block and unit is counted to obtain a histogram of gradient directions. Histogram of oriented gradients [HOG] descriptor A descriptor based on the histogram of oriented gradient across a set of blocks in space, each block can be further subdivided into units in the local image neighborhood. But note: this descriptor is not constant for rotation. Grayscale gradient The rate of change of gray levels in a grayscale image. See edge, gradient image, and first-derivative filter. Gray-level gradient value scatter The two coordinate axes represent the scatter maps of gray value and gradient value, respectively. Also called 2-D histogram.
24.3
Gradients and Gradient Operators
883
24.3.4 Particle Gradient Operators Sobel gradient operator See Sobel kernel. Sobel detector A typical first-order differential edge detector. The two 2-D orthogonal masks used are shown in Fig. 24.22, in which larger weights are given to pixels near the symmetry axis of the mask. Sobel operator Same as Sobel detector. Sobel kernel A gradient estimation kernel for edge detection. It contains two parts, one is the smoothing filter s ¼ [1, 2, 1] in the horizontal direction, and the other is the gradient operator g ¼ [–1, 0, 1] in the vertical direction. The kernel that can enhance the horizontal edges is (* stands for convolution): 2
1 6 K y ¼ s gT ¼ 4 0 1
2 0 2
3 1 7 0 5 1
The kernel Kx that can enhance the horizontal edge is the transpose of Ky: 2
1 0
6 K x ¼ sT g ¼ 4 2 0 1 0
1
3
7 25 1
Sobel edge detector An edge detection method based on Sobel kernel. For an image I, the Sobel edge intensity E is the square root of the sum of the squares of the image and the horizontal and vertical Sobel kernel convolution results, that is, (* stands for convolution) Fig. 24.22 Masks for Sobel detector
1
1
2
2
1
1
1
2
1
1
2
1
884
24
Edge Detection
Fig. 24.23 Results of the Sobel edge detector
1
1
1
1
1
1
1
1
1
1
1
1
Fig. 24.24 Masks for Prewitt detector
E¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 ðK x I Þ2 þ K y I
Figure 24.23 shows an example of the detection result of the Sobel edge detector. Figure (a) is an original image that contains edges with various orientations. Figures (b) and (c) are the horizontal and vertical convolution results obtained using the horizontal kernel and vertical kernel of the Sobel operator, which have strong responses to the vertical and horizontal edges, respectively. The gray part in these two figures corresponds to the area with smaller gradient, the dark or black corresponds to the area with larger negative gradient, and the light or white corresponds to the area with larger positive gradient. Comparing the tripod in the two figures, since the tripod is mainly inclined to vertical lines, the positive and negative gradient values in Figure (b) are larger than those in Figure (c). Figure (d) is the edge detection result of the entire Sobel operator obtained according to the above formula. Prewitt gradient operator An edge detection operator based on mask matching. It uses a set of convolutional masks or kernels (Prewitt kernels) to implement matched filters that face the edges in multiple (typically eight) orientations. The edge intensity at a given pixel is taken to be the largest of the responses to all masks. Some implementations use the sum of the absolute values of the response of the horizontal and vertical masks. Prewitt detector A first-order differential edge detector. The two 2-D orthogonal masks used are shown in Fig. 24.24. Prewitt kernel Horizontal and vertical masks used by the Prewitt gradient operator (see Fig. 24.24).
24.3
Gradients and Gradient Operators
885
Fig. 24.25 Masks for Roberts cross detector
1
1 –1
Fig. 24.26 Masks for Ando filter
−1
1
−k
k
−1
1
–1
1
k
1
−1
−k
−1
Roberts cross gradient operator An operator for detecting edges that computes the orthogonal components of the image gradient at each pixel. Convolving the image with two Roberts kernels, two components can be obtained for each pixel, Gx and Gy. The preferred gradient qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi amplitude is G2x þ G2y , and the gradient orientation is arctan(Gy/Gx). See edge detection. Roberts cross detector A first-order differential edge detector has been proposed for more than half a century. Two 2-D orthogonal masks are used to estimate the gray-level gradient at a pixel position (the gradient can be approximated by the digital equivalent of the first derivative) (see Fig. 24.25). Roberts cross operator See Roberts cross detector. Roberts kernel A pair of kernels or masks used to estimate the orthogonal components of the image gradient in Roberts cross operator. The mask has the largest response to the edges facing at a 45 angle from the vertical axis of the image. See Roberts cross detector.
24.3.5 Orientation Detection Ando filter An edge detector. This filter attempts to minimize artifacts that occur when using small-sized filter masks. Similar to the Sobel detector, it is also a first-order differential edge detector. The two 2-D orthogonal masks of 3 3 used in it are shown in Fig. 24.26. The difference from k ¼ 2 in Sobel operator is here k ¼ 2.435101.
886
24
–1 0
–
–1 0
1
0
–1 –
0
Edge Detection
1
0
1
–1
1
0
Fig. 24.27 Masks for isotropic detector
1
1
1
–2
–2
–2
–2
1
1
1
1
–2
1
–2
1
1
–2
1
–2
–2
1
1
–2
–2
–2
–2
–2
1
1
1
–2
1
1
1
1
–2
North
South
East
West
1
1
1
1
1
1
1
–2
–2
1
1
–2
1
–2
–2
–2
–2
1
1
–2
–2
1
1
–2
1
–2
–2
–2
–2
1
1
1
1
–2
–2
–2
North-West
North-East
South-West
South-East
Fig. 24.28 Masks for compass gradient operator
Consistent gradient operator An edge detection operator that minimizes distortion when using small mask (such as 3 3) operators. Among them, the coefficient 2 in the Sobel detection operator mask is replaced by 2.435101, so floating-point arithmetic is required. Isotropic detector A first-order differential edge detector. The two 2-D orthogonal masks used are shown in Fig. 24.27. The value of the coefficient of each position in the mask is inversely proportional to the distance between the position and the center of the mask. Compass edge detector An edge detector obtained by combining multiple edge detectors separated in several directions. The response at the edge of a pixel generally takes the maximum value of the response in each direction. Compass gradient operator A directional differential operator using the maximum response direction of each mask. The eight masks used are shown in Fig. 24.28. The direction of the maximum response of each mask is shown by the double dashed arrow in the figure (the arrows also separate the positive and negative mask coefficients).
24.3
Gradients and Gradient Operators
887
–1
0
1
0
1
1
1
1
1
1
1
0
–1
0
1
–1
0
1
0
0
0
1
0
–1
–1
0
1
–1
–1
0
–1
–1
–1
0
–1
–1
Fig. 24.29 The first four masks of 3 3 for Kirsch operator
5
3
5 5
3
3
3
3
3
3
3
3
3
3
3
3
5
0
3
3
0
3
3
0
5
3
5 5
3
5 5 5
3
5 5
Fig. 24.30 The first four masks of 3 3 for the variant of Kirsch operator
Kirsch compass edge detector A first-order derivative edge detector that calculates gradients in corresponding directions by using different calculation masks. Because edges have high gradient values, edges can be detected by thresholding the gradient magnitude. See Kirsch operator. Kirsch operator A direction difference operator that uses the maximum value of the edge template response (and its sign) to determine the corresponding edge direction. The first four of the eight edge masks of 3 3 used are shown in Fig. 24.29, the angle between the directions is 45 , and the other four masks can be obtained by symmetry. A variant of the above masks is shown in Fig. 24.30. Cumani operator An operator that detects edges in a color image or multi-spectral image based on calculating the second-order partial differential or second-order directional derivative of an image representation function. 3-D differential edge detector Difference operator for detecting edges in 3-D images. It mainly includes first-order difference operators and second-order difference operators, which can be generalized by corresponding 2-D differential edge detectors. First-order difference operators in 3-D require three orthogonal 3-D masks. Due to the expansion of the 3-D neighborhood, 3 3 3 masks centered on one voxel can cover a variety of neighborhood voxels, the most common being 6-, 18-, or 26-neighborhood voxels. Zucker-Hummel operator Convolution kernel for detecting the surface of a scene in a stereo image. There is a kernel of 3 3 3 for each of the derivatives in the three directions. For example, use v(x, y, z) to represent a stereo image, and the convolution of the kernel c ¼ [–S, 0, S] is represented by ∂v/∂z, where S is a 2-D smooth kernel:
888
24
2
a 6 S ¼ 4b a
Edge Detection
3 b a 7 1 b5 b a
pffiffiffi pffiffiffi where a ¼ 1/ 3 and b ¼ 1/ 2. Then the kernel Dz(i, j, k) ¼ S(i, j)c(k). The kernels for ∂v/∂x and ∂v/∂y can be obtained by permuting Dz, which are Dx(i, j, k) ¼ Dz( j, k, i), and Dy(i, j, k) ¼ Dz(k, i, j). Multidimensional edge detection A variant of standard edge detection for grayscale images, where the input is a multispectral image (such as an RGB color image). The edge detection operator can detect edges and combine them independently in each dimension, or use all information at each pixel.
24.4
High-Order Detectors
The calculation of the second derivative can also determine the edge position (Gonzalez and Woods 2018; Szeliski 2010; Zhang 2017b). Many useful secondderivative operators have been proposed (Marr 1982; Zhang 2006).
24.4.1 Second-Derivative Detectors Second-order-derivative edge detection Techniques and processes for detecting edges with the second derivative of pixel gray value. Although it has the potential to be used as an isotropic detection operator, it is rarely used alone in practice. The main reasons are the following: (1) it will produce “double edges,” that is, there are two responses of positive and negative values for each edge; and (2) it is very sensitive to noise. Laplacian operator is a typical second-derivative edge detection operator. Second-derivative operator A linear filter that estimates the second derivative at a given point and in a given direction in the image. A simple approximation to the second derivative of the 1-D function f is the finite center difference. Taylor’s approximation can be used: f 00i ¼
f iþ1 2 f i þ f i1 þ O ð bÞ b2
Among them, b is the sampling step (assuming a constant), and O(b) indicates that the truncation error disappears as b decreases. For 2-D images, the method of
24.4
High-Order Detectors
889
estimating the second derivative in a given direction is similar but more complicated. Compare with first-derivative filter. Canny operator An optimized edge detection operator. Consider converting the edge detection problem to the problem of detecting the maximum value of the unit function. The optimized indicators are the following: (1) low error probability, that is, not only less real edges should be lost, but also non-edges should not be judged as edges (called detection criterion); (2) high position accuracy, that is, the detected edges should be on real boundaries (referred to as the localization criterion); and (3) has a unique response to each edge, that is, the boundary obtained is a single pixel wide (called one response criterion). The idea of the Canny operator is to first determine the partial derivatives of the smooth image function in the x and y directions and then search for the “best” edge magnitude and direction. This idea can be extended to color images. For color pixels and/or color vector RGB space C(x, y) ¼ (R, G, B), the change of the image function at position (x, y) can be expressed by ΔC ¼ JΔ(x, y). J represents the Jacobian matrix, which includes the first-order partial derivatives of each component in the color vector. In RGB space, J is given by 2
Rx
6 J ¼ 4 Gx Bx
Ry
3
7 Gy 5 ¼ Cx , Cy By
Among them, the subscripts x and y, respectively, represent the partial derivatives of the function, such as
The direction with the greatest change in the image and/or the direction with the greatest discontinuity in the chrominance image function is represented by the eigenvector of JTJ corresponding to the largest eigenvalue. The direction θ of the color edges defined by this method can be determined by the following formula (the norm used can be arbitrary): tan ð2θÞ ¼
2Cx Cy 2 kCx k2 Cy
Among them, Cx and Cy are partial differentials of color components. For example, in RGB space, Cx ¼ ðRx , Gx , Bx Þ The amplitude M of the edge can be calculated by the following formula:
890
24
Edge Detection
2 M 2 ¼ kCx k2 cos 2 ðθÞ þ 2Cx Cx sin ðθÞ cos ðθÞ þ Cy sin 2 ðθÞ Finally, after each edge has its direction and amplitude determined, thresholdbased non-maximum suppression can be used to refine the “wide” edges. Canny edge detector An edge detector that considers the balance between the sensitivity of edge detection and the accuracy of edge positioning. Its detection of edges includes four steps: (1) Gaussian smooth the image to reduce noise and eliminate small details; (2) calculate the gradient intensity and direction of each pixel; (3) use non-maximum suppression methods to eliminate small gradients to focus on edge positioning; and (4) take the threshold of the gradient and connect the edge points. Here, hysteresis thresholding is used: first connect the strong edge pixels, and then consider tracking the weak edge pixels. Canny filter Same as Canny edge detector. Canny’s criteria Three optimization criteria/indexes in Canny operator: detection criterion, local ization criterion, and one response criterion. Detection criterion One of the three Canny’s criteria for detecting edges in the Canny operator. Localization criterion One of the three Canny’s criteria for detecting edges in the Canny operator. One response criterion One of the three Canny’s criteria for detecting edges in the Canny operator. Hysteresis thresholding A technique that uses two different thresholds to achieve the final threshold. There can be different applications. 1. Apply to grayscale images. If there are no obvious peaks in the histogram of an image, this means that some pixels in the background have the same grayscale as the object pixel and vice versa. Many of these pixels are near the boundary of the object. At this time, the boundary may be blurred and there is no clear definition. Two thresholds can now be selected on either side of the valley between the two peaks. The larger of the two thresholds is used to define the “hard core” of the object (let here the object gray value be greater than the background gray value). The smaller threshold is used in conjunction with the spatial proximity of the pixel: when a pixel has a gray value greater than the smaller threshold but less than the large threshold, it is marked as an object pixel only when it is adjacent to the hard core object pixel. Sometimes a different statistical rule is used: for a pixel, if most of its neighboring pixels have been marked as object pixels, it can also be marked as object pixels.
24.4
High-Order Detectors
891
2. Apply to gradient images. When thresholding the gradient image to determine the object edge point, first mark the edge pixels whose gradient value is greater than the high threshold value (think that they must all be edge pixels), and then use the low threshold value to judge the pixels connected to these pixels, that is, consider the gradient pixels whose values are greater than the low threshold and which are adjacent to pixels that have been determined as edges are also edge pixels. This method can reduce the influence of noise in the final edge image and can avoid false edges caused by too low thresholds or edge loss caused by too high thresholds when using a single threshold. This process can be performed recursively or iteratively. Thresholding with hysteresis Thresholding a scalar signal over time when the output is a function of the previous signal and the threshold. For example, a temperature-controlled thermostat receives a signal q(t) and generates an output signal p(t): pð t Þ ¼
sðt Þ > T cold
qð t 1Þ ¼ 0
sðt Þ < T hot
qð t 1Þ ¼ 1
Among them, the value at time t depends on the previous decision q(t–1), and two thresholds Tcold and Thot divide s(t) into three parts. In image segmentation, the thresholding with hysteresis is often associated with the edge tracking connection step of the Canny edge detector. Strong edge pixels Edge pixels detected by the Canny edge detector that fall within the interval between the two thresholds of hysteresis thresholding at the orientation angle used. Compare weak edge pixels. Weak edge pixels Edge pixels detected by the Canny edge detector that fall outside the interval between the two thresholds of hysteresis thresholding at the orientation angle used. Compare strong edge pixels. Hysteresis tracking See thresholding with hysteresis. Deriche edge detector A convolution filter similar to the Canny edge detector for detecting edges. The optimal operator used assumes that the filter has an infinitely extended range. The convolution filter thus obtained has a sharper shape than the Gaussian differential used by the Canny edge detector. Can be written as jx j
f ðxÞ ¼ Axe σ
892
24
Edge Detection
Deriche filter Deriche uses the Canny operator to derive an ideal filter that can be implemented recursively. The two edge filters are D01 ðxÞ ¼ α2 xeαjxj D02 ðxÞ ¼ 2α sin ðαxÞeαjxj The corresponding smoothing filters are 1 D1 ðxÞ ¼ αðαjxjþ1Þeαjxj 4 h 1 D2 ðxÞ ¼ α sin ðαjxjÞ þ cos ðαjxjÞeαjxj 2 The smaller the value of α in the Deriche filter, the greater the corresponding pffiffiffi smoothness, which is the opposite of the Gaussian filter. When σ ¼ π=α, the effect pffiffiffi of the Gaussian filter is similar to that of the first Deriche filter. When σ ¼ π=ð2αÞ, the effect of the Gaussian filter is similar to that of the second Deriche filter. Deriche filters are significantly different from Canny filters and more accurate than Canny filters. Lanser filter An improvement to the Deriche filter. The anisotropy problem can be eliminated (i.e., the calculated edge response amplitude depends on the orientation of the edge), and an isotropic filter can be obtained. In addition, it is more accurate than the Canny filter. Robinson edge detector An edge detection operator that can be used to calculate the estimated first-order derivatives of an image in eight directions. Its eight masks are shown in Fig. 24.31. Robinson operators A set of operators that calculate partial derivatives of images along different directions (0 , 45 , 90 , and 135 ). The masks for these operators are shown in Fig. 24.32. By flipping the symbols, operator masks can be obtained in the directions of 180 , 225 , 270 , and 315 . Second-order difference Calculation of the second-order differential/derivative of a function in a discrete domain. It can be obtained by combining the forward difference and the backward
1
1
1
1
1
1
‒1
1
1
‒1 ‒1
1
‒1 ‒1 ‒1
1
‒1 ‒1
1
1
‒1
1
1
1
‒2
1
‒1 ‒2
1
‒1 ‒2
1
‒1 ‒2
1
1
‒2
1
1
‒2 ‒1
1
‒2 ‒1
1
‒2 ‒1
‒1 ‒1 ‒1
‒1 ‒1
1
‒1
1
1
1
1
1
1
1
1
1
1
1
‒1 ‒1
1
1
Fig. 24.31 The eight masks of Robinson edge detector
1
‒1
1
24.4
High-Order Detectors
893
1
2
1
2
1
0
1
0
–1
0
–1
–2
0
0
0
1
0
–1
2
0
–2
1
0
–1
–1
–2
–1
0
–1
–2
1
0
–1
2
1
0
M0
M1
M2
M3
Fig. 24.32 The masks of Robinson operator
difference. If only the horizontal direction is considered, the second-order difference can be expressed as [f(x + 1) – f(x)] – [f(x) – f(x – 1)] ¼ f(x + 1) – 2f(x) + f(x – 1). It can be seen that it makes use of three pixels in the image (in the same direction) that are adjacent next to each other.
24.4.2 Gaussian-Laplacian Detectors Laplacian-of-Gaussian [LoG] edge detector First use a Gaussian low-pass filter to smooth the image, and then use a Laplacian operator to detect the edge of the result. Its working principle can well explain the underlying behavior of the human visual system. Laplacian-of-Gaussian [LoG] kernel Kernel of a Laplacian-of-Gaussian filter. The 5 5 masks that implement their functions in 2-D images are as follows, which can be decomposed into two 1-D masks: 2
1
6 64 1 6 66 256 6 6 44 1
4
6
4
16 24
24 16 36 24
16 4
24 16 6 4
2 3 1 7 6 7 47 647 7 71 1 6 7 7 67 ¼ 6 6674½1 16 7 6 7 45 445 1 1 1
3
4
6 4
1
Can also be implemented as the sum of two separable kernels, i.e., 2 1 x 2 þ y2 x þ y2 exp ∇ Gσ ðx, yÞ ¼ 3 2 σ 2σ 2 2σ 2 2
Can be broken down into two separable parts
894
24
Fig. 24.33 A mask of a Laplacian-of-Gaussian operator
∇2 Gσ ðx, yÞ ¼
Edge Detection
0
1
1
2
2
2
1
1
2
4
5
5
5
1
4
5
2
5
3
2
5
0
2
5
3
1
4
5
3
0
1
2
4
5
0
1
1
2
1
0
4
2
1
5
4
1
-12 -24 -12
3
5
2
-24 -40 -24
0
5
2
-12 -24 -12
3
5
2
3
5
4
1
5
5
4
2
1
2
2
1
1
0
3
0
3
1 x2 1 y2 1 ð x ÞG ð y Þ þ 1 G Gσ ðyÞGσ ðxÞ σ σ σ3 σ3 2σ 2 2σ 2
Zero crossing of the Laplacian of a Gaussian See zero-crossing operator. Laplacian-of-Gaussian [LoG] operator A low-level image operator, also called a Marr operator. First, perform Gaussian smoothing at each position in the image and then use the second-derivative Laplacian (∇2). Isotropic operators are often used as part of the zero-crossing operator for edge detection, because the position of the value change sign (from positive to negative or from negative to positive) is near the edge in the input image, and the edge details detected can be controlled using Gaussian smoothing scale parameters. Figure 24.33 shows a mask for implementing a Laplacian-of-Gaussian operator with a smoothing parameter σ ¼ 1.4. Laplacian operator Operator to calculate the Laplacian of a pixel. It is also a second-order differential edge detector. In the image, the calculation of the Laplacian value can be realized by various masks and mask convolution. The basic requirement for the mask here is that the coefficients corresponding to the central pixel should be positive, while the coefficients corresponding to the neighboring pixels of the central pixel should be negative, and the sum of all coefficients should be equal to zero. Figure 24.34 shows three typical Laplacian masks (all satisfying the above conditions). In grayscale images, Laplacian operators provide a direct digital approximation to the second derivative of brightness. Due to the use of second-order difference, it is quite sensitive to noise in the image. In addition, edges with a width of two pixels are
24.4
High-Order Detectors
Fig. 24.34 Three mask of Laplacian operator
895
0
−1
0
−1
−1
−1
−2
−3
−2
−1
4
−1
−1
8
−1
−3
20
−3
0
−1
0
−1
−1
−1
−2
−3
−2
0
−1
0
−1
4
−1
0
−1
0
Fig. 24.35 A Laplacian kernel
often generated, and information on the direction of the edges cannot be provided. Because of the above reasons, Laplacian operators are rarely used directly to detect edges but are mainly used to determine whether the edge pixel is in the dark or bright side of the image after the edge pixel is known. Laplacian Divergence of the gradient vector of a function f(x, y). Can be expressed as Δf ðx, yÞ
¼ divfgrad½ f ðx, yÞg ¼ ∇ • ∇f ðx, yÞ ¼ ∇2 f ðx, yÞ T ∂f ðx, yÞ ∂f ðx, yÞ , ¼ ∇• ∂x ∂y 2
¼
2
∂ f ðx, yÞ ∂ f ðx, yÞ þ ∂x2 ∂y2
As can be seen from the above formula, the Laplacian of a function is equal to the sum of its second-order partial derivatives. Therefore, the Laplacian at each pixel is equal to the sum of the second-order partial derivatives of the image along each axis. For example, the Laplacian of f(x, y, z): R3 ! R is 2
∇2 f ðx, y, zÞ ¼
2
2
∂ f ∂ f ∂ f þ þ ∂x2 ∂y2 ∂z2
In computer vision, a Laplacian operator can be applied to an image by convolving it with a Laplacian kernel. A definition of the Laplacian kernel is the sum of the second-order differential kernel [–1, 2, –1] and [–1, 2, –1]T, and a kernel of 3 3 can be formed by filling 0, as shown in Fig. 24.35.
896
24
Fig. 24.36 Pixel f(x, y) and its neighborhood
Edge Detection
f3
f4 f5
f2
(x, y) f 1
f6
f7
f8
Laplace-Beltrami operator An operator f derived from differential geometry, where f describes the divergence of the gradient, Δf ¼ div grad f. Wallis edge detector A logarithmic Laplacian operator. The basic assumption is if the difference between the logarithm of the gray value of a pixel and the logarithm of the average gray value of a 4-neighboring pixel of the pixel is greater than a threshold, then the pixel should be an edge pixel. Let a pixel f(x, y) and its neighborhood be as shown in Fig. 24.36, the above difference can be expressed as Dðx, yÞ ¼ log ½ f ðx, yÞ
½ f ðx, yÞ4 1 1 log ½ f 1 f 3 f 5 f 7 ¼ log 4 4 f1f3f5f7
In actual application, in order to compare with the threshold value, it is not necessary to perform logarithmic calculation for each pixel, and only the threshold value needs to take exponent once, so the calculation amount is not large. The Wallis edge detector is less sensitive to multiplicative noise because f(x, y) changes in the same proportion as f1, f3, f5, and f7. Marr edge detector Same as Marr operator. Marr operator An edge detector based on Laplacian operator. Also called Laplacian-of-Gaussian operator. It first (using a Gaussian filter) smooths the image under test and then (using Laplacian operator) detects edges to reduce the effect of noise on edge detection. When used, images with different resolutions need to be processed separately. The calculations at each resolution layer include: 1. Convolve the original image with a 2-D Gaussian smoothing mask. 2. Calculate the Laplacian value of the image after convolution. 3. Detect the zero-crossing point in the Laplacian image and use it as the edge point. Marr-Hildreth edge detector A filter for edge detection based on multi-scale analysis of zero-crossing values of Laplacian-of-Gaussian operators. Same as Marr edge detector.
24.4
High-Order Detectors
897
24.4.3 Other Detectors Vector dispersion edge detector [VDED] In order to eliminate the noise-sensitive problem in vector ordering calculation, a set of general-purpose detectors defined by the combination of ordering vectors and dispersion measures are used. Can be expressed as # " X X X n n n VDED ¼ OSO ai1 xi , ai2 xi , , aik xi i¼1 i¼1 i¼1 # " X n ¼ OSO j aij xi , j ¼ 1, 2, , k i¼1 Among them, OSO represents a calculation function based on ordinal statistics. In principle, the operator for edge detection can be derived from the above formula by appropriately selecting an OSO and a set of coefficients aij. To limit this difficult work, there are some requirements for edge operators. First, the edge operator should be insensitive to impulse noise and Gaussian noise; second, the edge detection operator should provide a reliable response to ramp edges. Minimum vector dispersion [MVD] See minimum vector dispersion edge detector. Minimum vector dispersion edge detector In order to make the vector dispersion edge detector insensitive to Gaussian noise, the vector-valued “α-arranged” average is used to replace the vector median to obtain an edge detector based on minimum vector dispersion (MVD). Among them # " l X xl MVD ¼ min xnjþ1 , l j i¼1
j ¼ 1, 2, , k;
l 0
0
gðx, yÞ ¼ 0
Because only pixels with non-zero gradient values are used in the calculation of EAG, the effect of pixels with zero gradient values on the average gradient is removed, so the calculation result is called “effective” gradient. EAG is the average gradient of non-zero gradient pixels in the image, represents a selective statistic in the image, and is the basis for calculating the transition region. Multi-spectral thresholding An image segmentation technique for multi-spectral image data. Thresholding for each spectral channel can be independently performed, and then the resulting images are combined using a logical AND operation. It is also possible to cluster pixels in multi-spectral space first and then use thresholds to select the required clusters. Color image thresholding Segmentation of color images using thresholding techniques. Here, the idea of thresholding in a monochrome image can be generalized to a color image. Similar to the thresholding of grayscale images, the basic idea is to use a properly selected threshold to decompose the color space into several regions, hoping that they correspond to meaningful target regions in the image. A simple method is to define one (or more) threshold (such as TR, TG, and TB) for each color component (such as R, G, and B) to obtain a result of RGB cube
25.3
Parallel-Region Techniques
933 B
Fig. 25.9 Thresholding of color images
TB G R
TG
TR
decomposition. According to this result, isolate the colored range of interest (a smaller cube) as shown in Fig. 25.9. The selection of the color component threshold can also be generalized by the method of selecting the gray threshold. Thresholds can still be divided into pixeldependent thresholds, region-dependent thresholds, and coordinate-dependent thresholds. Reversing grayscale pixel separation process In order to separate the foreground from the background, the thresholding process of the grayscale image is partially reversed. Wherever there are white pixels in the thresholded result image, change the pixels in the corresponding grayscale image to white. This inversion process produces a grayscale image where the foreground consists of pixels with different intensities and the background is completely white. Compare image negative. Reversing color pixel separation process In order to separate the foreground from the background in the color image, the process performed by reversing grayscale pixel separation process can be referred to. Here all three color channels can be performed identically, or each color channel can be processed separately to fine-tune it.
25.3.4 Clustering and Mean Shift K-means clustering A spatial clustering method based on iterative square error. The input is a point set {xi}ni ¼ 1 and the initial estimated positions c1, c2, . . ., cn of K cluster centers. The algorithm performs two steps alternately: (1) assign a point to the class closest to it and (2) recalculate the mean after adding the point. Iterations produce estimates of K cluster centers, hoping that they minimize ∑xminc|x c|2.
934
25
Object Segmentation Methods
Let x ¼ (x1, x2) represent the coordinates of a location in a feature space, and g(x) represent the feature value at this location. The K-means clustering method is to minimize the following indicators: E¼
2 K X X ðiþ1Þ gðxÞ μ j i¼1 x2QðiÞ j
Among them, Qj(i) represents the set of feature points assigned to class j after the i-th iteration; μj represents the mean of class j. The above formula gives the sum of the distances of each feature point and its corresponding class mean. The specific K-means clustering method steps are as follows: 1. Arbitrarily select K initial class means, μ1(1), μ2(1), . . ., μK(1); 2. At the i-th iteration, each feature point is assigned to one of the K categories according to the following criteria ( j ¼ 1, 2, . . ., K; l ¼ 1, 2, . . ., K, j 6¼ l ), i.e., ðiÞ
x 2 Ql
ðiÞ ðiÞ , if gðxÞ μl < gðxÞ μ j
That is, each feature point is assigned to the class whose mean is closest to it. 3. For j ¼ 1, 2, . . ., K, update class mean μj(i + 1) ðiþ1Þ
μj
¼
1 X gð xÞ Nj ði Þ x2Q j
where Nj is the number of feature points in Qj(i). 4. For all j ¼ 1, 2, . . ., K, if μj(i + 1) ¼ μj(i), the algorithm converges and ends; otherwise return to step (2) to continue the next iteration. K-means See K-means clustering. K-means cluster analysis A simple and widely used non-hierarchical cluster analysis method. K-medians In K-means clustering, a variant of calculating multidimensional medians instead of the mean. The definition of multidimensional median can be changed, but for a point set {xi}ni ¼ 1, that is, the median calculation of {xi1, . . ., xid}ni ¼ 1 includes the definition of component by component with m ¼ (median{xi1}ni ¼ 1, . . ., {xid}ni ¼ 1) P and the definition of one dimension m ¼ argminm2Rd ni¼1 j m xi j.
25.3
Parallel-Region Techniques
935
Hierarchical K-means A splitting technique to achieve hierarchical clustering. First perform K-means clustering on all data (use smaller K, such as K ¼ 2), and then continue to use Kmeans splitting on the class thus obtained. This process is repeated continuously until a stopping criterion is met. This technique is also called tree-structured vector quantization. See divisive clustering. ISODATA clustering A typical spatial clustering method. ISODATA is an abbreviation for Iterative SelfOrganizing Data Analysis Technique A (the last A is added to make the abbreviation easier to pronounce). It is a non-hierarchical clustering method developed on the basis of K-means clustering algorithm. The main steps are as follows: 1. Set the initial values of N cluster center positions. 2. Find the nearest cluster center position for each feature point, and divide the feature space into N regions by assigning values. 3. Calculate the averages belonging to each clustering mode separately. 4. Compare the initial cluster center position with the new average value. If they are the same, stop the calculation. If they are different, use the new average value as the new cluster center position and return to step (2) to continue. Spatial clustering A region-based parallel algorithm in image segmentation. It can be seen as a generalization of the thresholding concept. The elements in the image space are represented by corresponding feature space points according to their feature values. The feature space points are first clustered into clusters corresponding to different regions, and then they are divided into clusters and finally mapped back to the original image space to obtain the result of the segmentation. A general grayscale thresholding method is a special case when the feature space is a grayscale space. 1-of-K coding scheme A method for assigning data points to clusters. For each data point xi, a corresponding set of binary indicator variables is introduced, Vik 2 {{0, 1}, where k ¼ 1, 2, . . ., K represents one of the K clusters to which the data point xi is assigned.. If the data point xi is assigned to cluster k, then Vik ¼ 1 and Vij ¼ 0 (for all j 6¼ k). Further reference may be made to the expression method of probability models in classification. First consider a two-class problem (K ¼ 2). The simplest way to use binary expressions is to use a single object variable t ¼ {0, 1}, where t ¼ 1 represents the first category C1 and t ¼ 0 represents the second Class 2 C2. The value of t can be interpreted as the probability that the class is C1 (here the probability values only take extreme values 0 and 1). Now considering multiple types of problems (K > 2), one of the K coding schemes can be used. Among them, the length of the vector t is K. If we consider class Cj, then all elements of t are tk ¼ 0, except for element tj ¼ 1. For example, consider the case of K ¼ 6, the object vector for a pattern derived from Class 3 is t ¼ [0, 0, 1, 0, 0, 0]T. Similarly, the value of tk can be interpreted as the probability that the class is Ck (here the probability values only take extreme values 0 and 1).
936
25
Object Segmentation Methods
Mean shift image segmentation A method of image segmentation using mean shift technology. It consists of two steps: 1. Discrete hold filtering. 2. Mean shift clustering. Mean shift clustering See mean shift image segmentation. Mean shift A nonparametric technique based on density gradient ascent. Can also refer to the offset mean vector. The algorithm based on mean shift is mainly an iterative step, that is, first calculate the average offset of the current point, move the point to its average offset, and then use this as a new starting point, and continue to move according to the probability density gradient until satisfying certain conditions. Since the probability density gradient always points to the direction where the probability density increases the most, the mean shift algorithm can converge to the nearest steady-state point of the probability density function under certain conditions, so it can be used to detect the modal existence in the probability density function. The mean shift method can be used to analyze complex multimode feature spaces and determine feature clusters. It is assumed here that the distribution of clusters in its central part is more dense, and the purpose is achieved by iteratively calculating the mean value of the density kernel (corresponding to the center of gravity of the cluster, which is also the mode at a given window). The mean shift implicitly models the probability density function with the help of a smooth continuous nonparametric model. Mean shift is a technique to efficiently determine peaks in high-dimensional data distributions without the need to explicitly calculate the complete distribution function. In addition to being used for image segmentation, the moving average algorithm has been widely used in clustering, real-time object tracking, and so on. Normal kernel A kernel commonly used in the mean shift method. If let x be the distance between the feature point and the mean, then a normal kernel (a kernel that is often truncated symmetrically to obtain limited support) is defined as 1 K N ðxÞ ¼ c exp kxk2 2 Among them, c is a constant greater than 0, which is used to make the sum of KN(x) to 1. The cross section of this core is 1 k N ðxÞ ¼ exp x2 2
x 0
25.4
Sequential-Region Techniques
937
Discontinuity preserving filtering See mean shift image segmentation. Bandwidth selection in mean shift In the mean shift method, a process and a technique for determining a bandwidth parameter of a filter kernel used to control a pixel coordinate domain and a pixel value domain. Multivariate density kernel estimator A multidimensional density kernel function needs to be determined in the mean shift method. Here, the function of the kernel function is to make the contribution to the mean shift different with the distance between the feature point and the mean, so that the mean of the cluster can be determined correctly. A symmetric core is often used in practice. Epanechnikov kernel A kernel (function) often used in the mean shift method. If we let x denote the distance between the feature point and the mean, then Epanechnikov kernel is defined as k E ð xÞ ¼
( c 1 kx k2 0
kx k 1 otherwise
Among them, c is a constant greater than 0, which is used to make the sum of KE(x) to 1. The section of this kernel is not differentiable at the boundary k E ðxÞ ¼
25.4
1x 0
0x1 x>1
Sequential-Region Techniques
Sequential-region techniques use a serial method to determine the object region of images (Prince 2012; Szeliski 2010; Theodoridis and Koutroumbas 2003; Zhang 2017b).
25.4.1 Region Growing Sequential-region technique See region-based sequential algorithms.
938
25
Object Segmentation Methods
Fig. 25.10 Schematic diagram of seed region
Seed region
Seed point
Region growing A region-based sequential algorithm for image segmentation. It is a bottom-up method. The basic idea is to gradually combine pixels with similar properties to form the required region. Specifically, first find a seed pixel for each region to be segmented as a starting point for growth, and then use pixels in the neighborhood around the seed pixel that have the same or similar properties as the seed pixel (based on some predetermined similarity rule or growth rule for decision) is merged into the region where the seed pixel is located. These new pixels are treated as new seed pixels, and the above growth process is continued until that no longer having pixels meet the conditions can be included, so that a region is grown. There are three key factors for regional growth: (1) the selection of seed pixels, (2) the selection of similar criteria, and (3) the definition of the end rule. In practice, it can be regarded as a type of algorithm that gradually expands the boundary of a region to build a larger connected region. During the growth process, data that matches the current region is merged into that region. After the new data set is added, the current region can be re-described. Many region growing algorithms include the following steps: 1. For a region (in the beginning, it can contain only a single pixel), describe it according to the existing pixels (such as fitting a linear model to the intensity distribution). 2. For the current region, find all surrounding pixels adjacent to it. 3. For each adjacent pixel, if the description of the current region is also suitable for the pixel (e.g., if it has similar intensity), add the pixel to the current region. 4. Start from step (1) until all pixels in the image are considered. Similar algorithms exist for 3-D images. The difference is that pixels are replaced by voxels and (planar) regions are replaced by surfaces or solids. In a broad sense, it can also refer to various segmentation methods that consider the proximity of pixel space, such as split-merging algorithms and watershed segmentation methods. Region aggregation See region growing. Seed See seed pixel.
25.4
Sequential-Region Techniques
939
Seed pixel In the region growing method of image segmentation, pixels that are the starting point of growth in the region to be segmented. Seed region The initial region in the region growing process that detects the luminance region in the luminance image or the region used for surface fitting in the distance data. It can contain multiple pixels or just one pixel, which is also called the seed point. In Fig. 25.10, there is a seed region on the background region and a seed point on the object region. Starting from them, the object region and the background region can be obtained separately. Seed and grow A class of image manipulation strategies. First, determine the starting point that has a high degree of certainty to meet a specific condition—the seed—and then iterate successively to find and combine other elements that also meet the specific conditions, that is, iterative expansion from the seed. For example, in image segmenta tion, this technique is used by region growing technology. As another example, in reconstructing a 3-D scene and object from a video sequence or a collection of related images, the method of using this strategy is to first select a pair of images with a large number of matching elements to establish the sparse correspondence between the image pairs; then it is estimated that a sufficient number of 3-D point camera poses can be seen, and the complete camera and feature correspondence can be obtained by iteratively stepwise with the help of bundle adjustment. In addition, in stereo matching based on feature, this strategy can also be adopted, that is, extracting highly reliable features first and then using these features as seeds to obtain further matching. Growing criteria In image segmentation, the similarity criterion that controls the growth process in the region growing method. The growing process merges pixels in the neighborhood around the seed pixel that have the same or similar properties as the seed pixel into the region where the seed pixel is located. Judgment is based on some similarity criteria. The criteria can be determined in advance or adjusted gradually during the growth process. Recursive region growing A class of recursive algorithms for region growing. First, select an initial pixel, and then examine its neighboring pixels (i.e., neighboring pixels determined by adjacency). If a pixel meets the criteria for adding it to the region, the pixel is grown. This process continues until all adjacent pixels (including newly grown pixels) have been detected. See image connectedness, neighborhood, and recursive splitting. Boundary-region fusion When the characteristics of two adjacent regions are close and can pass some similarity tests, a region growing image segmentation method combining these
940
25
Object Segmentation Methods
two regions is used. A set of pixels on the common boundary of two regions can be used as a candidate neighborhood for testing their similarity. It also refers to an image segmentation method based on the principle of region growing. When the characteristics of two adjacent regions are close enough, they pass some similarity tests and merge them. The candidate neighborhood for the similarity test here includes pixels close to the boundary of the shared region. Split, merge, and group [SMG] A typical split and merge algorithm for image segmentation. Includes five steps: 1. Initialization: The image is decomposed into sub-images according to the quadtree structure and continues to be decomposed to a certain level. 2. Merging: Consider the four sub-nodes of the nodes in a layer. If they all meet a consistency condition (such as the same gray level), the four sub-nodes are merged from the quadtree expression. The parent node of each child node becomes a leaf node. 3. Split: For a layer of inconsistent nodes, it is decomposed into four sub-images, that is, four sub-nodes are added below. Continue to check the consistency of each child node thus generated, and if it is inconsistent, continue to split until all newly obtained nodes in the quad tree are leaf nodes (non-divisible). 4. Transformation from a quadtree to a region adjacency graph: The child nodes (or leaf nodes) that belong to different parent nodes are bisected. These sub-nodes are converted to a region adjacency graph by converting the quadtree structure to a region adjacency graph. Establish adjacencies based on spatial relationships. 5. Group: Combine the sub-nodes that are adjacent in space and meet the consistency conditions to form a new consistency region, so as to obtain the final segmentation result. Split and merge A two-step process for splitting and clustering (split and merge). The data is first divided into subsets, and the initial partition is a single collection containing all the data. In the splitting step, the subset is decomposed iteratively, provided that the subset fails to satisfy a certain coherence criterion (such as the similarity of pixel colors). In the merging step, pairwise adjacent subsets are found, and if the coherence criterion is satisfied after merging, they are merged together. Although the coherence criterion used by the two steps is the same, the merge step may still find a subset that needs to be merged. Split and merge algorithm A region-based sequential algorithm for image segmentation. The basic idea of splitting is the opposite of region growing, that is, starting from the entire image, each region is obtained by continuous splitting. This is a top-down image segmentation method. In practice, the image is often divided into regions of arbitrary size and nonoverlapping, and then these regions are merged or split to meet the requirements of segmentation. The most commonly used data structure is a quadtree. Each leaf node in the quadtree corresponds to a pixel in the segmented image.
25.4
Sequential-Region Techniques
941
The split and merge algorithm can be written in the following segmentation algorithm form: 1. Define a uniform logical predicate P(Ri). 2. Calculate P(Ri) for each initial region. 3. Split each region P(Ri) ¼ FALSE into disjoint sub-regions corresponding to four quadrants. 4. Repeat steps (2) and (3) until all the result regions satisfy the uniformity criterion, that is, P(Ri) ¼ TRUE. 5. If any two regions Rj and Rk satisfy P(Rj[Rk) ¼ TRUE, merge them. 6. Repeat step (5) until no merge can be performed. Grouping 1. In human perception, the tendency to perceive certain stimulus patterns or stimulus clusters as related wholes rather than independent units. 2. The idea behind a class of image segmentation algorithms. Much of this work was inspired by the work of Gestalt psychology school. See segmentation, image segmentation, supervised classification, and clustering. Divisive clustering A method of clustering or cluster analysis. Initially, all samples are considered to belong to the same set (i.e., a class), and then they are gradually decomposed into subclasses until a predetermined number of subclasses are reached, or the samples within the subclasses meet a certain similarity criterion. See region splitting. Region splitting An image segmentation strategy or framework that implements image segmentation by sequentially decomposing the image into smaller and smaller components. A simple method is to first calculate the histogram of the full image, select a threshold value to separate the large peaks, and repeat the above steps for the region corresponding to the peaks until the segmented region in the image is sufficiently uniform or small. A type of algorithm that divides an image into regions (or subsequently divides a region into sub-regions). The condition to be judged here is whether a given uniformity criterion is satisfied in the region. See region and region-based segmen tation. Compare with region merging. Region merging An image segmentation strategy or framework that implements image segmentation by sequentially merging the basic components of an image into larger and larger areas. A simple method is to consider the boundaries between pixels and merge based on the relative length of the boundary and the magnitude of the boundary gradient. Repeat the above steps until each region in the image is large enough and uniform enough. Another method considers the clusters in the image and connects them based on the distance between points belonging to different clusters. If the nearest point is used, it is called single-connected clustering; if the furthest point is used, it is called full-connected clustering. See agglomerative clustering.
942
25
Object Segmentation Methods
A class of algorithms that fuse two image regions under the conditions that a given consistency criterion is satisfied. See region and region-based segmentation. Compare with region splitting. Recursive splitting A class of recursive methods for region segmentation (decomposing an image into a collection of regions). First, the entire image is initialized as a collection of several regions. Here, a consistency/uniformity criterion is used. If it is not satisfied, the image is split according to a given scheme (if a quadtree is used, it is divided into four sub-images) to obtain a new region set. This process iterates over all regions in the new region set until the remaining regions meet the consistency criteria. See region-based segmentation and recursive region growing.
25.4.2 Watershed Watershed segmentation Image segmentation with watershed transform. The typical implementation process is: 1. Detect edges in the image. 2. Calculate the distance transform D of the edge. 3. Calculate the watershed region in the distance transform. See watershed algorithm. Watershed The two regions in the image can be compared to two mountain peaks in topography. The watershed is the gathering place where the rainwater flows from the two mountain tops when it rains. This position is also the boundary that separates the two regions. An image can be regarded as a representation of 3-D terrain, that is, the 2-D ground (corresponding to the image coordinate space) plus the third dimension height (corresponding to the image gray level). In actual analysis, it is often considered to inject water from the bottom. When the water level reaches a certain height, the water in the two regions will converge. In order to prevent the water in a region from expanding to its adjacent region, an obstacle must be “erected” where the water will converge; this is the watershed. Waterline and watershed are equivalent. The watershed algorithm is to determine the watershed and implement image segmentation. Watershed transform A tool for image segmentation using mathematical morphology. The watershed transform treats the image as an elevation map and assigns a unique integer label to each local minimum on the map. The watershed transform of the image assigns a mark to each non-minimum pixel p to indicate that if a drop of water is placed at p, the water will continue to fall. The points at the watershed (ridges) on the elevation
25.4
Sequential-Region Techniques
943
Fig. 25.11 Example of watershed transform
map are called watershed points (because the water at these locations will fall to one of the two minima on either side), and the set of pixels surrounding each minima (with the same mark) constitutes a watershed region. There are effective algorithms to calculate the watershed transform. Figure 25.11 gives a set of schematic diagrams: the left image covers the minimum value on the original image; the middle image is the result of viewing the original image as an elevation map; the right image is the result of watershed transformation of the original image, where the different minimum values have different colors, watershed pixels are displayed in white, and the arrow points to a specific watershed. Particle segmentation A class of technologies that detect the presence of individuals with small objects (particles). In the image or video, the particles can be cells, water droplets, rice grains, stones, etc. and small in size but large in number. A typical problem is the severe occlusion caused by overlapping particles. This problem has been successfully solved with the help of waterline transformation. Watershed algorithm A region-based sequential algorithm for image segmentation. This algorithm uses the concept of watershed in topography for image segmentation. The calculation process of this algorithm is serial, and the boundary of the object is obtained. However, the consistency in the region is considered in the calculation process, and the boundary is determined by the region, so it is a sequential algorithm based on the region. Morphologic image reconstruction In image segmentation, a method of determining seeds in a watershed algorithm. This can be done by means of dilation or erosion operations in the mathematical morphology of grayscale images, respectively. Let the image under consideration be f(x, y), which will be described separately below. Consider another image g(x, y). For any pixel (x, y), there is g(x, y) f(x, y). Define the grayscale dilation of the image g(x, y) with the structural element b(x, y) as follows: consider that each pixel surrounding g(x, y) and b(x, y) has the same shape
944
25
Object Segmentation Methods
and size of neighboring domain, selects the maximum value in this neighborhood, and assigns L it to the center pixel of the output image. This operation is written as g (x, y) b(x, y). Reconstructing the image f(x, y) with g(x, y) can be performed iteratively by the following formula: gkþ1 ðx, yÞ ¼ max
n
M f ðx, yÞ, gk x, yÞ bðx, yÞg
until the image gk(x, y) no longer changes. Waterhole for the watershed algorithm In the watershed algorithm, a set of all pixels corresponding to a pixel string having a non-rising path and passing up to the minimum value (not rising compared to neighboring pixels) is used. Waterline for the watershed algorithm Starting from all pixels in the image, all non-descent paths are constructed and all pixels on this path and end the path at more than one local minimum. Tobogganing A simple form of waterline segmentation. In the segmentation based on the active contour model, it can be used to pre-segment the image and use the contour of the segmented region as a candidate for the optimization curve/path. The region boundary thus obtained can be transformed into a smaller graph (a search based on this graph will be fast), where the nodes correspond to the intersection of three or four regions. Marker In the watershed algorithm for image segmentation, a symbol representing a connected component in the image and having a one-to-one correspondence with the final segmentation region. Used for controlled segmentation methods; see marker-controlled segmentation. Marker-controlled segmentation In the watershed algorithm for image segmentation, a technique that uses labels to overcome the problem of over-segmentation. In addition to enhancing the contours in the input image to obtain the image segmentation function, it also uses the minimal imposition technique to first filter the input image and then uses feature extraction to determine a labeling function to mark the object and the corresponding background (can be considered as labeling images). These labels are forcibly added to the segmentation function as minimum values, and the watershed is calculated to obtain the final segmentation result. Minimal imposition A technique based on a geodesic operator to determine an object and a label corresponding to a background in a method using label-controlled segmentation.
25.4
Sequential-Region Techniques
945
25.4.3 Level Set Level set methods A numerical technique for interface tracking. For example, in 2-D image segmenta tion, the object contour is regarded as the zero level set of the level set function in 3-D space, and the object contour is obtained by solving its evolution equation. One advantage of this technique is that it can easily track changes in the topology of objects, so it is often used in medical image segmentation of deformed objects. Level set A mathematically defined set of variables describing a curve or surface. A level set of a real-valued function f with n variables is defined as (c is constant) fðx1 , x2 , , xn Þj f ðx1 , x2 , , xn Þ ¼ cg So, a level set is a set of variables that gives a function value with a given constant. The number of variables corresponds to the dimension of the set. A horizontal set is called a horizontal curve when n ¼ 2, a horizontal surface when n ¼ 3, and a horizontal hypersurface when n > 3. From a geometric perspective, a curve or surface to be described can be regarded as the zero level set of a function (level set function) in a one-dimensional space higher than it. The evolution of a curve or surface can be extended to a higher one-dimensional space to let the level set function evolve iteratively. When the level set function evolution stabilizes, a stable zero level set is obtained. A level set is a set of data points x that satisfies a given form of the equation f (x) ¼ c. Changing the value of c gives a set of different points, but they are generally closely related. A visual analogy is the geographical surface and rising ocean. If the function f(x) is sea level, then the level set is the coastline corresponding to different sea levels. In image segmentation, it refers to a set of points on a 2-D curve formed by the intersection of a 3-D surface and a plane. Level set tree A tree structure consisting of disconnected parts of a function’s level set. This is useful for visualizing mixed multivariate density functions, including their models and their shape characteristics. Fast marching method [FMM] A method used to reduce the amount of calculation when expressing closed contours using level sets. The search is only fast in a single direction. This is an update method when using a level set to continuously represent (track) closed contours. The level set uses the zero-crossing points of the feature function (embedding function) to define the contour curve. It adjusts the feature function instead of the contour curve to evolve to approximate and track the object of interest. In order to reduce the amount of calculation, only one small strip (front edge) of the current zero-crossing
946
25
(a)
(b)
(c)
Object Segmentation Methods
(d)
Fig. 25.12 Different segmentation results
attachment is updated at a time in the evolution. This method is called a fastmarching method. See geodesic active contour. Chan-Vese energy functional In image segmentation based on geometric deformable models using level sets, the piecewise constant minimum variance criterion based on Mumford-Shah func tional based on the region properties in the image is used. Letting the 2-D image to be segmented be f(x, y), and the evolved 0-level closed curve be L, then the ChenVese energy functional can be written as C ðL, mi , me Þ ¼ Ci ðL, mi , me Þ þ C e ðL, mi , me Þ R R ¼ insideðLÞ ½ f ðx, yÞ mi 2 dxdy þ outsideðLÞ ½ f ðx, yÞ me 2 dxdy Among them, the constants mi and me represent the internal average intensity and the external average intensity of the segmentation object, respectively. When the 0-level closed curve coincides with the object contour, C(L, mi, me) takes the minimum value (the object and the background are optimally separated in the sense of average intensity). See the segmentation results in Fig. 25.12. If the curve L is outside the object, then Ci(L, mi, me) > 0 and Ce(L, mi, me) 0, as shown in Fig. 25.12a; if the curve L is at the inside of object, then Ci(L, mi, me) 0 and Ce(L, mi, me) > 0, as shown in Fig. 25.12b; if the curve L crosses the object (some parts are inside the object and some parts are outside the object), then Ci(L, mi, me) > 0 and Ce(L, mi, me) > 0, as shown in Fig. 25.12c; if the curve L approaches the object contour, then Ci(L, mi, me) 0 and Ce(L, mi, me) 0, as shown in Fig. 25.12d. Chan-Vese functional Same as Chan-Vese energy functional. Mumford-Shah functional See Chan-Vese energy functional. Graph-based merging A method of image segmentation based on graph representation. Use the relative differences between regions to determine which regions need to be merged. First,
25.5
More Segmentation Techniques
Fig. 25.13 Example of minimal spanning tree
947
6
1
1
1
1 5
7
2
3
3
2
4
based on the dissimilarity measure d(e) of a pixel, the brightness difference in the N8 neighborhood is measured, and the image is initially segmented. For any region R, its internal difference is defined as the maximum edge weight in the region’s minimal spanning tree (MST): INTðRÞ ¼
min dðeÞ
e2MSTðRÞ
For any two adjacent regions Ri and Rj with at least one edge connecting their vertices, the difference between the regions is defined as the minimum edge weight connecting the two regions: DIF Ri , R j ¼
min
e¼ðvi , v j Þjvi 2Ri , v j 2R j
d ð eÞ
For any two adjacent regions, if the difference between the regions is less than the minimum of their respective internal differences, they are merged. Minimal spanning tree A term in graph theory. Consider a graph G and a subset T of arcs in G. All nodes in G are still connected in T, and there is only one connection path between any two nodes in T. At this time, T is a spanning tree. If each arc has a weight (which can be constant), then the smallest spanning tree is the tree with the smallest total weight. Figure 25.13 shows an example. The left figure is G and the right figure is the minimal spanning tree T. The numbers next to each arc represent weight values. Prim’s algorithm An algorithm to find the minimal spanning tree, in which the weights of edges are minimized, but each vertex is still included.
25.5
More Segmentation Techniques
There are also many special image segmentation techniques that use different ideas (Ritter and Wilson 2001; Shapiro and Stockman 2001; Szeliski 2010; Zhang 2007).
948
25
Object Segmentation Methods
Probabilistic aggregation An image segmentation method based on probability. First, calculate the grayscale similarity and texture similarity of the region. The grayscale similarity between the two regions Ri and Rj is based on the minimal external difference from other adjacent regions as þ þ Dþ local ¼ min d i , d j where di+ ¼ mink|dik|, dik is the difference between the respective average brightness of the regions Ri and Rj, and the average brightness difference between the areas Ri and Rj is defined as D local ¼
d i þdj 2
where di ¼ ∑k(likdik)/∑k(lik), where lik is the length of the boundary between the regions Ri and Rk. The pairwise statistics D+local and Dlocal are used to calculate the likelihood pij that the two regions Ri and Rj should merge. The texture similarity between the two regions Ri and Rj is defined using the relative difference between the histograms of the oriented Sobel filter response. Subsequent merging is performed in a hierarchical manner (e.g., an algorithm of segmentation by weighted aggregation can be adopted). Set the number of nodes in the thinnest layer as |F|, and then use |C| nodes in the thicker layer (C is a subset of nodes that satisfy strong coupling C ⊂ F), where strong coupling is defined as P
pij
j2C
P
pij
>T
j2F
Among them, T is the threshold, which is often taken as 0.2. At the coarser nodes, the weighted average is used to repeatedly calculate the brightness similarity and texture similarity, and the relative strength (coupling) between the coarser and finer nodes is based on their merge probability pij. After the coarser layer gets segmented, the membership of each pixel is obtained by spreading the assignment of the coarser layer to the thinner child layer. Minimal external difference See probabilistic aggregation. Winner-takes-all [WTA] A selection strategy or optimization strategy. Originally meant that the last winner of the competition obtained all the competitors and the loser lost all shares. But when there is only one competitor, it often means that the winner got it and the loser did not gain. For example, in the probability aggregation method of image segmentation,
25.5
More Segmentation Techniques
949
although each pixel can belong to different regions with different probabilities at the beginning, as a result of the final decision, each pixel can only belong to a single region. It can be said that the region gets all of the pixels’ probabilities. A strategy that selects only the best candidates (which can be solutions or algorithms) and absorbs all other candidates. It is commonly found in the neural network and learning literature. Particle counting Particle segmentation is used for statistical applications of the occurrence of small objects (particles). In an image or video, particles can be cells, water droplets, stones, etc. Medical image segmentation An image segmentation area has broad application prospects. Image segmentation can be used to obtain anatomical tissue or obtain information on physiological changes to facilitate subsequent quantitative analysis. The purpose of medical image segmentation mainly includes (1) identifying the target of interest, (2) studying the anatomical structure of the organ, and (3) measuring various parameters of the tissue. Medical images are difficult to segment. In addition to the different shapes of medical objects, complex structures, and many uncertainties, there are also many interference effects in imaging. For example, artifacts in CT imaging include motion artifacts, bar artifacts, ring artifacts, metal artifacts, etc.; MRI imaging also includes Gibbs artifacts, folding artifacts, gradient artifacts, and magnetically sensitive artifacts. Blob analysis A set of algorithms for analyzing medical images, where the analyzed area can be various blobs. The work is divided into four steps: (1) obtaining an optimized foreground/background threshold that can segment the object from the background; (2) use this threshold to make the binarization of the image; (3) region growing the object of interest and each connected pixel, and the formed blob (discrete object) is assigned a label; and (4) physical measurement is performed on the blob to obtain a measurement value. Blob extraction A step in blob analysis. See connected component labeling. 3-D motion segmentation An extension of motion segmentation, where the segmented motion information is derived from data in 3-D space. Superpixel segmentation Segment the image into a collection of superpixels. The segmentation results retain the image structure information and image local features on a coarser scale.
950
25
(a)
(b)
(c)
Object Segmentation Methods
(d)
Fig. 25.14 Segmentation using weighted aggregation
Clump splitting Split multiple objects from the cluster. In some categories of images, the object cluster can be detected or isolated. A typical example is the cluster of cells in a biological image. Relaxation segmentation A class of image segmentation methods based on relaxation technology. Atlas-based segmentation An image segmentation technique commonly used in medical images, especially for brain images. The automatic segmentation of the tissue is achieved by means of models of the brain structure and images rendered by human experts. Arterial tree segmentation A collective term for methods used to discover internal tubular structures in medical images. Example images include X-ray images, magnetic resonance images, angiographic images, and so on. Example trees include venous and bronchial systems. Segmentation by weighted aggregation [SWA] An aggregation method from coarse nodes to fine nodes in image segmentation. See Fig. 25.14. Figure (a) shows the original gray pixel grid. Figure (b) shows the coupling between pixels. Thicker lines represent stronger coupling. Figure (c) shows a roughened layer. In this case, the original gray pixels have a stronger coupling with the nodes of the coarser layer. Figure (d) shows the situation after two layers of coarsening (the thickest layer in this example). Objectness An attribute of the image region. Can help determine if a region is likely to be an object of interest. The object-likeness measurement can be used for object-likeness detection, and the purpose of object-likeness detection is to determine whether there is an object of interest in the image and where the object may appear. The goals here can and can be of any category. Object-like detection is different from object detection. Object detection is to determine where the object may appear under the condition that the object category is known. Object-like detection is also different from object recognition. Object recognition pays more attention to what kind of
25.5
More Segmentation Techniques
951
object of interest is. Object-like detection can be used as a preprocessing method for object detection and object recognition, which can improve the calculation efficiency of object detection and recognition tasks. Object-like detection can also be called region proposal. It focuses on using as few region recommendation windows as possible to cover all possible object locations of interest or candidate recognition regions. Region proposal See objectness. Geometric heat flow Based on the evolutionary image segmentation method, by means of the overall motion orthogonal to the edge normal, analogous to the heat flow. Since this method is similar to anisotropic diffusion, it can also be used for image smoothing and image enhancement.
Chapter 26
Segmentation Evaluation
There is no universal segmentation theory for image segmentation, so research is needed on how to evaluate (evaluate) image segmentation technology and its performance. The research on image segmentation has three levels (Zhang 1996). The first level is the study of segmentation technology, the second level is the evaluation of image segmentation (it helps to grasp the performance of different segmentation algorithms), and the third level is the systematic comparison and characterization of evaluation methods and evaluation criteria.
26.1
Evaluation Scheme and Framework
The evaluation mechanism for image segmentation is mainly reflected in the evaluation framework (Zhang 1996, 2017b). Image segmentation evaluation Research on the performance of image segmentation algorithms. The purpose is to enhance and improve the performance of existing algorithms, optimize segmentation results, improve segmentation quality, and guide the research on new algorithms. Image segmentation evaluation can be divided into two cases: (1) performance characterization and (2) performance comparison. The contents of the two are related to each other. The performance characterization determines the direction for improving the algorithm by characterizing the performance of an algorithm, and the performance comparison improves the algorithm by comparing different algorithms and taking advantage of each other. The methods of image segmentation evaluation can be divided into three categories: (1) analytical method, (2) empirical goodness method, and (3) empirical discrepancy method. The latter two categories belong to the empirical method. See general scheme for segmentation and its evaluation.
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_26
953
954
26
Segmentation Evaluation
Performance characterization An image segmentation evaluation work. The purpose is to grasp the performance of the evaluated algorithm in different segmentation situations, so that the parameters of the algorithm can be selected to meet the needs of segmenting images with different content and segmenting images collected under different conditions. This concept and idea can also be extended to the evaluation of other image techniques. More broadly, a set of techniques for evaluating the performance of computer vision systems. Common criteria include accuracy, precision, robustness to noise, repeatability, and reliability. Performance comparison An image segmentation evaluation work. The purpose is to compare the performance of different algorithms when segmenting a given image, to help select appropriate algorithms or improve existing algorithms in specific segmentation applications. This concept and idea can also be extended to the evaluation of other image techniques. Evaluation framework Framework for (experimentally) evaluating image segmentation algorithms. As shown in Fig. 26.1, it mainly includes three modules: (1) image generation, (2) algorithm test, and (3) performance judgment. On the one hand, the analysis purpose, evaluation requirements, image acquisition, and processing conditions and factors can be selectively incorporated into this framework, so it can be applied to various application fields. On the other hand, because only the result of image segmentation is needed when studying the segmentation algorithm and the internal structural characteristics of the algorithm are not required, it can be applied to all segmentation algorithms. Image generation 1. Techniques for generating images based on design data and requirements for specific applications. 2. One of the three modules in the general evaluation framework for experimental evaluation of image segmentation algorithms. In order to ensure the objectivity and versatility of the evaluation study, a synthetic image can be used to test the segmentation algorithm as a reference segmentation map. This not only has good objectivity and strong repeatability, but also results are stable. The key to
Segmentation Goals
Real Imaging di i
PERFORMANCE JUDGMENT Evaluation Requirements IMAGE GENERATION
Original Feature
Real Feature Measurement
Test
TEST PROCEDURE
Fig. 26.1 Evaluation framework for segmentation algorithms
Evaluation Results
Selected Algorithms
26.1
Evaluation Scheme and Framework
Image Generation
Image Segmentation
Segmentation Algorithm
955
Segmented Image
Feature Measurement
Measurement Values
Fig. 26.2 Algorithm test flow diagram
generating synthetic images is that the generated images should reflect the objective world as much as possible, which requires the knowledge of the application domain to be incorporated. The image generation process should be adjustable to suit the actual situation such as changes in image content and different image acquisition conditions. Algorithm test One of the three modules in the general evaluation framework for experimental evaluation of image segmentation algorithms is responsible for testing the segmentation algorithm to obtain the actual segmentation effect. It is a typical image analysis module, and its flowchart is shown in Fig. 26.2. It includes two successive steps: segmentation and measurement. In the segmentation phase, the algorithm under test is regarded as a “black box.” The input to it is a test image (experimental image), and the output is a segmented image. In the measurement phase, the predetermined features are measured according to the segmented objects to obtain the actual object feature values, and then these feature values are input into the “performance evaluation” module for difference/discrepancy calculation. Test procedure One of the three modules in the general evaluation framework for experimental evaluation of image segmentation algorithms. Is a typical image analysis module, including two connected steps (before and after): segmenting the image and measuring feature values. Performance judgment One of the three modules in the general evaluation framework for experimental evaluation using image segmentation algorithms. The corresponding object features can be selected according to the segmentation purpose and the image genera tion can be guided, and the difference between the original feature values and the measured feature values obtained from the original image and the segmented image can be calculated to give an evaluation result. General evaluation framework See evaluation framework.
956
26
Analytical Method
Segmentation Evaluation
Input image OR
Pre-processing
Reference Image
Segmenting Image Algorithms
Segmentation Segmented Image AND Post-processing OR
Goodness Method
OR Output image
Empirical Method
Fig. 26.3 General scheme for segmentation and its evaluation
General scheme for segmentation and its evaluation A general framework model describing image segmentation and image segmenta tion evaluation. See Fig. 26.3, in which a narrow sense basic image segmentation flowchart is given in the dashed-dotted box. Here, image segmentation is considered as the process of segmenting the image to be segmented by the segmentation algorithm to obtain the segmented image. A generalized image segmentation flowchart is shown in the dashed box in Fig. 26.3. Here, image segmentation is considered as a series of three steps: (1) preprocessing, (2) narrow sense image segmentation, and (3) post-processing. In the flowchart of general scheme, the image to be segmented is obtained by performing a certain preprocessing on the general input image, and the segmented image also needs a certain post-processing to become the final output image. The different action points and working methods of the three types of image segmentation evaluation methods (analytical method, empirical goodness method, and empirical discrepancy method) are also shown outside the dotted box in Fig. 26.3. Expert system for segmentation Combined with artificial intelligence technology, the evaluation results are used for inductive reasoning to achieve algorithm optimization, and the segmentation and evaluation are linked to an expert system for image segmentation. It can effectively use the evaluation results for inductive reasoning, so as to raise the segmentation of the image from the current blind experimentation and improvement level to the systematic selection and realization level.
26.2
Evaluation Methods and Criteria
957
Hypothesis testing Testing of a hypothesis that is uncertain. This can be done by inspecting all objects, but this is often not achievable in practical applications, so samples must be selected for testing. If the sample matches the hypothesis, accept it; otherwise, reject it. See RANSAC. A knowledge-driven optimization process. This process is used, for example, in an image segmentation expert system. First, make assumptions that a priori optimal algorithm can be generated on the basis of prior image characteristic estimation or measurement; then, based on the segmented results obtained by the segmentation with the selected algorithm, calculate the posterior image characteristic estimates and obtain corresponding posterior optimal algorithm estimation. If the prior hypothesis is correct, then the posterior estimate should be consistent or consistent with the prior hypothesis. Otherwise, the posterior estimate can be used to update the prior estimate to stimulate a new round of hypothesis testing process until the two meet certain consistency conditions. It can be seen that this feedback process is a process of gradually extracting and approximating information. Algorithm selection is gradually optimized in this process, and finally it tends to be optimal, and the optimal algorithm is selected. Because this feedback method is mainly driven by data to adjust the segmentation link itself, it is a bottom-up processing procedure, so it is relatively fast and convenient. Compare try-feedback. Try-feedback The entire process of making a judgment through the experiment, get the results, then experiment, then get the results, . . . until the experiment is completed. Compare hypothesis testing.
26.2
Evaluation Methods and Criteria
The evaluation of image segmentation can take the method of analysis or experiment (Zhang 1996). The test method can be further divided into superiority test method and difference test method (Zhang 2015d).
26.2.1 Analytical Methods and Criteria Evaluation criteria In image segmentation evaluation, the principles and standards used to evaluate the performance of the algorithm. Commonly used are symmetric divergence, busy ness measure, probability-weighted figure of merit, higher-order local entropy, normalized squared error, detection probability ratio, percentage area misclassified, fragmentation of image, pixel-class-partition error, pixel spatial
958
26
Segmentation Evaluation
distribution, shape measure, modified figure of merit, noise-to-signal ratio, fraction of correctly segmented pixels, ultimate measurement accuracy, etc. Objective criteria In image segmentation evaluation, those evaluation criteria may be obtained through the analysis and calculation, and the results are not influenced by human-specific subjective cognition or knowledge. Evaluation measure Same as evaluation criteria. Evaluation metric Same as evaluation criteria. Analytical method A set of methods for image segmentation evaluation. It only acts on the image segmentation algorithm itself and does not directly involve the segmentation application process and segmented image. See general scheme for segmentation and its evaluation. Analytical evaluation By analyzing the principle of the algorithm itself, the characteristics and performance of the algorithm are compared and characterized. Compare empirical evaluation. Criteria for analytical method In image segmentation evaluation, it is suitable to analyze the evaluation criteria of the segmentation algorithm itself. It can be qualitative or quantitative. Typical are a priori information incorporated, processing strategy, computation cost, detec tion probability ratio, resolution, etc. A priori information incorporated In image segmentation, combining some of the characteristic information of the image to be segmented and prior knowledge in practical applications may improve the stability, reliability, and efficiency of the segmentation. The types and amounts of a priori information combined by different segmentation algorithms are different, and the pros and cons of these algorithms can be compared to a certain extent. Processing strategy A criterion for analytical method in image segmentation evaluation. The image segmentation process can be implemented in the form of serial, parallel, iterative or a combination of the three according to different strategies. The performance of the image segmentation algorithm is closely related to these processing strategies, so the characteristics of the algorithm can be grasped to a certain extent according to the processing strategy of the algorithm. Resolution of segmented image A criterion for analytical method in image segmentation evaluation. The segmented images obtained by different image segmentation algorithms can have
26.2
Evaluation Methods and Criteria
959
multiple resolutions, such as pixel level, several pixel levels, or a fraction of a pixel (i.e., sub-pixel) level. Different resolutions reflect the accuracy with which the algorithm can segment the results, so resolution can also be used as an effective indicator to measure the performance of the algorithm. Detection probability ratio A criterion for analytical method in image segmentation evaluation. It is the ratio of the probability of correct detection to the probability of false detection. It was originally used to compare and study various edge detection operators. Given an edge mode, the correct detection probability Pc and the false detection probability Pf of an operator when detecting such edges can be calculated by the following two formulas: Z1 Pc ¼
PðtjedgeÞd t T
Z1 Pf ¼
Pðtjno edgeÞdt T
Among them, T is a given threshold, P(t|edge) represents the probability that an edge is detected and also is judged to be an edge, and P(t|no-edge) represents the probability that an edge is not detected but is judged to be an edge. For simple edge detection operators, the ratio of Pc and Pf can be obtained through analysis. A larger value indicates higher reliability of the operator when detecting the corresponding edge. Since many segmentation techniques use edge detection operators to help segment images, this criterion can be used as a quantitative criterion for evaluating the performance of these techniques. Computation cost A criterion for analytical method in image segmentation evaluation. For example, each image segmentation algorithm is implemented by a series of arithmetic operations. The computational cost required to complete these operations is related to the complexity of the operation and the efficiency of the algorithm (and the speed of the calculation). Computational cost is also a quantitative criterion for measuring algorithm performance.
26.2.2 Empirical Goodness Methods and Criteria Empirical method In image segmentation evaluation, a method of indirectly judging the performance of a segmentation algorithm based on the quality of the segmented image. Specifically, the algorithm to be evaluated is used to segment the image, and then the
960
26
Segmentation Evaluation
quality of the segmentation result is judged by a certain quality measures, and the performance of the segmentation algorithm used is obtained accordingly. It can be further divided into two groups: one is the empirical goodness method, and the other is the empirical discrepancy method. Empirical evaluation Actually compare and characterize the performance of different algorithms by using standardized public data and testing procedures. Compare analytical evaluation. Empirical goodness method An image segmentation evaluation method that is evaluated only by detecting the quality of a segmented image or an output image. There is no need to consider the input of the segmentation process, so it can be regarded as a “blind” evaluation method. See general scheme for segmentation and its evaluation. Compare empirical discrepancy methods. Goodness A measure of the degree of fit between actual observations and theoretical expectations (often based on human intuition). In image segmentation evaluation, some goodness parameters are often used to describe the characteristics of the segmented image, and the performance of the algorithm for segmentation is judged according to the goodness value. Goodness criteria Judgment criteria used when evaluating the effect of image segmentation using the empirical goodness method. Often represents some of the measures that are subjectively expected of an ideal segmentation result. Typical are inter-region contrast, intra-region uniformity, and shape measure. Higher-order local entropy In the empirical goodness method of image segmentation evaluation, a goodness evaluation criterion. It measures the uniformity of the gray value distribution in the region. For example, for the assumed threshold T, to maximize the second-order local entropy H2 of the object and background regions, it can be calculated as follows: H 2 ðT Þ ¼
T X T X i¼0
pij ln pij
j¼0
Among them, pij is the co-occurrence probability of two pixels i and j in the object/background regions. Inter-region contrast A criterion of empirical goodness method in image segmentation evaluation. Also called gray-level contrast. Given two adjacent regions in a grayscale image, if their respective average grayscales are f1 and f2, the grayscale contrast between them can be calculated as follows:
26.2
Evaluation Methods and Criteria
961
GC ¼
j f 1 f 2j f1 þ f2
Contrast across region Same as inter-region contrast. Intra-region uniformity A criterion of empirical goodness method. Also called intra-region uniformity. Based on this criterion, a uniformity measure within the region can be defined. Uniformity within region Same as intra-region uniformity. Uniformity measure [UM] In image segmentation evaluation, a metric based on the intra-region uniformity. If Ri represents the i-th region of the segmented image f(x, y) and Ai represents its area, the uniformity measure inside the region can be expressed as 8 2 32 9 < X = X X 1 4 f ðx, yÞ 1 UM ¼ 1 f ðx, yÞ5 C i : Ai ; ðx, yÞ2R ðx, yÞ2R i
i
where C is the normalization coefficient. Shape measure [SM] In image segmentation evaluation, an empirical goodness method criterion. If fN(x, y) represents the average gray level in the neighborhood N(x, y) of the pixel (x, y), and g(x, y) represents the gradient at the pixel (x, y), then the shape measure obtained by thresholding image with a predetermined threshold T can be calculated using the following formula: ( 1 SM ¼ C
X u½ f ðx, yÞ f N ðx, yÞ gðx, yÞu½ f ðx, yÞ T
)
x, y
Among them, C is a normalization coefficient, and u() represents a unit step function. This criterion is only related to the roughness of the object boundary and can only be used to evaluate the thresholding method.
26.2.3 Empirical Discrepancy Methods and Criteria Empirical discrepancy method An image segmentation evaluation method that evaluates image segmentation algorithm by comparing a reference image obtained by an input image or an image
962
26
Segmentation Evaluation
to be segmented with a segmented image or an output image. Here both the input and output of the segmentation process should be considered. See general scheme for segmentation and its evaluation. Compare the empirical goodness method. Discrepancy criteria In the empirical discrepancy method of image segmentation evaluation, an evaluation criterion used to compare the difference between a segmented image and a reference image. It can quantitatively measure the segmentation results, and this measure is objective. Common criteria include pixel distance error, pixel number error, object count agreement, ultimate measurement accuracy, etc. Pixel distance error A criterion for empirical discrepancy method in image segmentation evaluation. The distance between each erroneously segmented pixel and the correct region to which they belong should be considered. Based on this criterion, multiple distance measures can be defined and commonly used are (1) the figure of merit, (2) the mean absolute value of the deviation, and (3) the normalized distance measure. Figure of merit [FOM] Any scalar used to characterize the performance of an algorithm. 1. Figure of merit: a distance measure based on the pixel distance error criterion. Letting N be the number of pixels that are incorrectly classified, and d2(i) be the distance between the i-th incorrectly classified pixel and its correct position, then the figure of merit can be expressed as
FOM ¼
N 1 X 1 N i¼1 1 þ p d2 ðiÞ
where p is a proportionality factor. More generally, the figure of merit also often refers to any scalar used to characterize the performance of an algorithm. 2. Goodness coefficient: any function or coefficient based on the goodness parameter. In image segmentation, it refers to a function used to select the segmentation threshold defined by combining the balance measure and the busyness measure with the concavity measure. The value of the function is proportional to the balance measure and the concave measure and is inversely proportional to the busy measure. Modified figure of merit [MFOM] In image segmentation evaluation, an evaluation criterion belonging to the category of pixel distance error. An improvement on the figure of merit so that it can be used to evaluate near-perfect segmentation results. Letting N be the number of pixels that are incorrectly divided, and d2(i) be the distance between the i-th incorrectly classified pixel and its correct position, then the modified figure of merit can be expressed as
26.2
Evaluation Methods and Criteria
963
8 N > 1 0 FOMm ¼ N i¼1 1 þ p d 2 ðiÞ > : 1 if N ¼ 0 where p is a proportionality factor. Probability-weighted figure of merit Difference evaluation criteria used in the empirical discrepancy method for image segmentation evaluation. It measures the distance between pixels that are incorrectly classified into the wrong region that should not belong to the correct region that they should belong to if the segmentation result is not perfect. Specifically, the figure of merit is reweighted with the probability of the pixel class. Mean absolute value of the deviation [MAVD] A distance measure based on the pixel distance error criterion. Letting N be the number of misclassified pixels, and d2(i) be the distance between the i-th misclassified pixel and its correct position, then the average absolute value of the deviation can be expressed as MAVD ¼
N 1 X jd ðiÞj N i¼1
Normalized distance measure [NDM] In image segmentation evaluation, a distance measure based on the pixel dis tance error criterion. Letting N be the number of misclassified pixels, d2(i) be the distance between the i-th misclassified pixel and its correct position, and A be the area of the image or object, the distance measure normalized by using the object area can be expressed as
NDM ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PN 2 i¼1 d ðiÞ A
100%
Pixel spatial distribution In image segmentation evaluation, an evaluation criterion belonging to the category of pixel distance error. The distance information of each misclassified pixel to the nearest pixel of its correct category is considered. Object count agreement [OCA] In image segmentation evaluation, a criterion of empirical discrepancy method. Let Si be the number of i-th class objects obtained by segmenting an image, Ti is the number of i-th class objects that actually exist in the image, and N is the total number of object categories:
964
26
Z1 OCA ¼ L
2
M=2
Segmentation Evaluation
1 t ðM2Þ=2 exp ðt=2Þdt ΓðM=2Þ
where M ¼ N – 1 and Γ(•) is the gamma function; L is calculated by L¼
N X Si T i cT i i¼1
where c is the correction coefficient. Ultimate measurement accuracy [UMA] An empirical discrepancy method criterion in image segmentation evaluation. The final goal of the image segmentation operation is considered, and it also depends on the characteristics of the object measured by image analysis. If a set of object features is used, a set of final measurement accuracy can be defined. According to different calculation methods, absolute ultimate measurement accuracy and relative ultimate measurement accuracy can be calculated separately. Absolute ultimate measurement accuracy [absolute UMA, AUMA] A criterion for calculating the ultimate measurement accuracy in the image segmentation evaluation. If Rf is the original feature value obtained from the reference image and Sf is the actual feature value obtained from the segmented image, the absolute difference between the two gives the absolute ultimate measurement accuracy: AUMAf ¼ j Rf Sf j Relative ultimate measurement accuracy [relative UMA, RUMA] A criterion for the calculation of the ultimate measurement accuracy. If Rf is the original feature value obtained from the reference image and Sf is the actual feature value obtained from the segmented image, their relative difference gives the relative ultimate measurement accuracy: RUMAf ¼
j R f Sf j 100% Rf
Fragmentation of image [FI] A difference evaluation criterion used in the empirical discrepancy method of image segmentation evaluation. Letting Sn be the number of objects obtained by segmenting an image, and Tn be the number of objects that actually exist in the image, then the fragmentation of images can be expressed as
26.2
Evaluation Methods and Criteria
FI ¼
965
1 1 þ pjT n Sn jq
Among them, p and q are scale parameters. The evaluation criterion measures the difference between the actual number of objects in the image to be segmented and the number of objects obtained due to incomplete segmentation results.
26.2.4 Empirical Discrepancy of Pixel Numbers Pixel number error In image segmentation evaluation, a criterion used in empirical discrepancy method. Consider the number of pixels that were incorrectly segmented. If image segmentation is considered as a classification, this criterion can also be called the number of pixels misclassified. Based on this criterion, multiple quantitative measures can be defined. Commonly used include (1) classification error and (2) probability of error. Percentage area misclassified In image segmentation evaluation, an evaluation criterion that belongs to the category of pixel number error. It also refers to the proportion of the area occupied by the incorrectly divided pixels in the image region. Same as number of pixel misclassified. Number of pixels misclassified See pixel number error. Fraction of correctly segmented pixels In image segmentation evaluation, an evaluation criterion that belongs to the category of pixel number error. Is the maximum likelihood estimation of the percentage of object pixels that are correctly segmented, that is, the ratio of the number of object pixels that are actually segmented correctly to the number of real object pixels. Pixel-class-partition error In image segmentation evaluation, an evaluation criterion that belongs to the category of pixel number error. The number of different types of pixels mis-segmented is considered. Classification error A quantitative measure based on the number of pixels in image segmentation evaluation or object classification. Assuming that there are N types of pixels in the image, an N-D matrix C can be constructed, where the element Cij represents the number of pixels of the j type assigned to the i type. There are two classification error expressions:
966
26
ðk Þ MI
ðk Þ M II
PN
PN
¼ 100
¼ 100
i¼1 C ik PN i¼1 C ik
Segmentation Evaluation
Ckk
C ki Ckk P Pi¼1 P N N N i¼1 j¼1 C ij i¼1 C ik
Among them, MI(k) gives the ratio of pixels belonging to the k-th category that are not classified to the k-th category, and MII(k) gives the ratio of pixels of other categories classified to the k-th category. Noise-to-signal ratio [NSR] In image segmentation evaluation, an evaluation criterion that belongs to the category of pixel number error. In edge detection, it is defined as the ratio of the number of detected boundary points that do not match the ideal boundary and the number of detected boundary points that match the ideal boundary. Symmetric divergence [SD] In image segmentation evaluation, an evaluation criterion that belongs to the category of pixel number error. Assuming that the segmented image has N regions, use pis to represent the posterior probability of a pixel in its i-th region and pir to represent the posterior probability of a pixel in the i-th region in the reference image. It can be expressed as SD ¼
N X i¼1
ps psi pri ln ri pi
Normalized squared error A measure used when comparing edge-preserving smoothing techniques with corresponding filters. First, assume that the noise-free image (which can be obtained by the synthesis method) is I1, and convolve it using eight 3 3 compass gradient operators, and take the threshold of the convolution result into a binary image, where low-value pixels correspond to the uniform region in the original image and the high-value pixels correspond to the boundary region between the object and the background in the original image. Let the image obtained by adding noise to the binary image be I2, and the image obtained by smooth filtering (using an average filter, a median filter, and the like to be compared) be I3. The normalized squared error is calculated by P
½I 3 I 1 2
Si
ei ¼ P Si
½I 2 I 1 2
i 2 fl, hg
26.3
Systematic Comparison and Characterization
967
Among them, Sl represents a low-value pixel region, and Sh represents a highvalue pixel region. One of the advantages of this criterion is that it is independent of the specific grayscale of the image. If the noise is completely eliminated by filtering, the value of ei is 0; if the filtering is completely ineffective, the value of ei is 1; if the filtering gives a worse result, the value of ei will be greater than 1. If it is to be used for image segmentation evaluation, different segmentation algorithms can be used to segment image I2, and the result obtained is image I3, and the normalized squared error ei reflects the quality of the segmentation result. It can be seen that this is an evaluation criterion that belongs to the category of pixel number error.
26.3
Systematic Comparison and Characterization
A higher level of work is to compare the performance of segmentation evaluation methods and evaluation criteria, which is also critical for evaluating segmentation algorithms to improve segmentation quality (Zhang 1996; Zhang 2015d). Systematic comparison and characterization Research on various methods of image segmentation evaluation. That is, the evaluation of segmentation evaluation criteria and methods. Requirement for reference image In image segmentation evaluation, a criterion for systematic comparison and characterization. For some evaluation methods, the evaluation conclusion depends on the comparison between the segmented image and the reference image. However, the acquisition of reference images increases the workload of segmentation evaluation and brings some specific problems to the applicability of the evaluation method. For example, in the medical image segmentation work, when the actual image is used as the reference image, it needs to be manually segmented by a domain expert, which is subjective. Complexity of evaluation In image segmentation evaluation, a class of criteria applicable to systematic comparison and characterization. Whether an evaluation method is practical has a lot to do with the complexity of its implementation, or with the processing methods and workload required for evaluation. This can also be used to evaluate other imaging technologies. Generality of evaluation Image segmentation evaluation is suitable for systematic comparison and char acterization. Considering that there are many types of segmentation algorithms, the methods and criteria used to evaluate these algorithms should be universal, that is, the methods of segmentation evaluation should be applicable to the study of various principles and types of segmentation algorithms.
968
26
Segmentation Evaluation
Subjectivity and objectivity of evaluation Suitable and applicable criteria for systematic comparison and characterization of image segmentation evaluation. For each evaluation method, there are often some subjective and/or objective considerations behind it, or the evaluation criteria used are determined based on specific subjective and/or objective factors.
Chapter 27
Object Representation
Object expression is based on image segmentation and object extraction and adopts a suitable data structure to represent the object (Gonzalez and Woods 2018; Samet 2001; Zhang 2017b).
27.1
Object Representation Methods
There are two ways to express objects in image space, either based on boundaries or on regions (Zhang 2009).
27.1.1 Object Representation Object representation The way to choose a certain data structure to represent the objects in the image. Coding an object into a form suitable for computer operation. Commonly used models include geometric models, graph model, or appearance models. Generally, a suitable representation form different from the original image is used to represent the object. A good representation method should have the advantages of saving storage space and easy feature calculation. According to the representation strategy, object representation can be divided into internal representation and external representation. According to the technology used, object representation can be divided into three categories: (1) boundary-based representation, (2) regionbased representation, and (3) transform-based representation.
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_27
969
970
27 Object Representation
Representation Introduction to object characteristics. The shape representation is a classic example, where the representation represents a set of techniques that illustrate the geometry of a 2-D or 3-D object. Representation techniques can be classified as symbolic (symbolized object representation) or non-symbolized (non-symbolized object representation). More generally, it refers to a formal system (such as Arabic number system, binary number system) that can clearly express certain entities or certain types of information and several rules that explain how the system works. According to Marr’s visual computational theory, representation is the key to visual information processing. A basic theoretical framework for computer vision information understanding research is mainly composed of the three-level representation structure of the visible world established, maintained, and interpreted by visual processing. See Koenderink’s surface shape classification. Boundary-based representation An object representation method that uses a series of pixels belonging to the object boundary to express the boundary, also known as contour-based representation. Currently, the method can be divided into three groups: 1. Use the set of boundary points. The boundary line of the object is represented as a set of boundary points, and there may be no order between the points. A typical example is using a set of landmark points. 2. Construct parameter boundaries. The boundary line of the object is represented as a parametric curve, and the points on it have a certain order. Typical examples include boundary segments, polygonal approximation, and boundary signatures. 3. Approximation by curve. A combination of geometric primitives (such as straight line segments or splines) is used to approach the boundary line of the object. Commonly used geometric primitives are polygons composed of straight line segments. External representation An object representation strategy using the set of pixels that make up the boundary of the object region. It can reflect the shape of the object region. Contour-based representation Same as boundary-based representation. Region-based representation An object representation method that directly expresses regions. The methods used can be divided into three categories: (1) region decomposition, (2) surround ing region, and (3) internal features. Area-based Image manipulation that can be applied to areas in an image.
27.1
Object Representation Methods
971
Region-based signature The result obtained by performing a certain projection on all pixels in the object region. As with boundary signatures, the problem of describing 2-D shapes is also transformed into a 1-D function. Internal representation An object representation strategy using the set of pixels that make up the object region. It can reflect the reflective properties of the object region. Region representation An object representation that expresses the object region based on all pixels (including internal points and boundary points) that belong to the object region. It needs to be able to express the characteristics defined by the image region. For shape encoding, see axial representation, convex hull, graph model, quadtree, runlength coding, and skeletonization. To record the characteristics of the region, see moments, curvature scale space, Fourier shape descriptor, wavelet descriptor, and shape representation. Transform-based representation A method of object representation. Use a certain transform (such as Fourier transform or wavelet transform) to transform the object from the image domain to the transform domain, and use the transformation parameters or coefficients to represent the object. Fourier boundary descriptor An object shape descriptor based on the Fourier transform coefficients of the object contour. See Fourier transform representation. Shape unrolling Same as Fourier descriptor. Part-based representation A model that treats an object as a combination of a set of components. For example, a person can be thought of as consisting of a head, a body, two arms, and two legs. Template-based representation A model form or object representation form, where the representation includes some image-based modeling forms, such as an image of an object, surrounding the edges of the object of interest, and a process of generating a template, such as principal component representation. Symbolic object representation Use symbolic nouns for image entities, such as lists of edges, corners, planes, faces, etc., instead of pixels on the object itself to represent the object. The representation can include the shape and position of the object. Symbolic Reasoning or calculation described by a set of symbols rather than a signal. Digital images are discrete representations of continuous functions, while symbols are
972
27 Object Representation
discrete in nature. For example, you can turn an image (signal) into a list (symbol) of the names of the people who appear in it. Non-symbolic representation A model representation in which the appearance is described with a numerical or image-based description, rather than a symbolic or mathematical description. For example, a non-symbolic representation of a line may be a set of coordinates of points on a line or an image of a straight line, rather than the symbol or equation of the line. Compare symbolic object representation. Logical object representation A method of object representation based on logical forms such as predicate calcu lus. For example, a cube can be defined as squareðsÞ , polygomðsÞ&number of sidesðs, 4Þ&8e1 8e2 ðe1 6¼ e2 &side of ðs, e1 ÞÞ& side of ðs, e2 ÞÞ&lengthðe1 Þ ¼ lengthðe2 Þ& ðparallelðe1 , e2 Þjperpendicularðe1 , e2 ÞÞ Temporal representation A model representation that records dynamic information about how the shape or position of an object changes over time.
27.1.2 Spline {8} Spline 1. A curve defined as the weighted sum of control points c(t) ¼ ∑ni ¼ 0Wi(t)pi, where p1. . .n are control points, and a weighting function Wi is defined for each control point. The curve can be interpolated or fitted to the control points. The spline construction guarantees continuity and smoothness. For a uniform spline, the weight functions of each point are translational copies of each other, so Wi (t) ¼ W0(t–i). The form of W0 determines the type of spline: for B-splines and Bezier curves, W0(t) is a polynomial (typically cubic) of t. The non-uniform spline reparameterizes the t-axis, c(t) ¼ c(u(t)), where u(t) maps the integer k ¼ 0. . .n to the node t0. . .n, and the non-integer value for t can perform linear interpolation. Rational number splines with N-D control points are perspective projections of normal splines of (N + 1)-D control points. 2. Unitary tensor product splines define 3-D surface x(u, v) as the product of the splines in u and v. B-spline The Pm spline approximation can be expressed as a combination of basis functions i¼0 ai Bi ðxÞ, where Bi is the basis function and ai is the control point. B-splines do not need to pass through any control point, but if B-splines are calculated for
27.2
Boundary-Based Representation
973
adjacent control points, the curve segments will be combined to produce a continuous curve. B-spline fitting Fit a B-spline to a set of data points. It is useful to produce a compact model for noise reduction or for observed curves. Non-uniform rational B-splines [NURBS] A shape modeling primitive based on B-spline ratio. Can accurately express a wide range of object geometries, including any form of surface. Spline smoothing The smoothing of the discrete sampled signal x(t), specifically, replaces the value at a certain t with a spline value that fits the neighborhood value there. Cubic spline The weighting function is a spline of a cubic polynomial. Bicubic spline interpolation A special case of surface interpolation, using bicubic spline functions in 2-D. This is similar to bilinear surface interpolation, except that the interpolation surface is now curved rather than horizontal. Controlled-continuity splines In a thin plate spline under tension, if the differential terms in the two smoothing functions are multiplied by their respective weight functions, a new spline is obtained. See norm of smooth solution.
27.2
Boundary-Based Representation
Using the segmentation method based on the boundary to segment the image, a series of pixel points along the boundary of the object, which constitute the outline of the object (Russ and Neal 2016). The boundary expression is to express the target based on these boundary points (Costa and Cesar 2001).
27.2.1 Boundary Representation Boundary representation An object representation method that expresses the object region by means of the boundary points/contour points of the object region. See boundary description. Object boundary The set of pixels between an object region and its neighboring regions in the image. Generally emphasizes more on its separation. Compare object contour.
974
27 Object Representation
Simple boundary A closed curve surrounding an object region that is neither intersecting nor tangent. The pixels within a simple boundary constitute a connected region (connected component) without holes. Object contour The set of boundary pixels around the object region generally emphasizes its closedness. See occluding contour. Empty interior set There are only points on the boundary and no set of internal points. Contour representation See boundary representation. Polyline A piecewise linear outline. If closed, it is a polygon. See polycurve, contour analysis, and contour representation. Contourlet A contour expression method using multi-resolution and multi-directional multifilter bank techniques. Tangent angle function Function θ(t) ¼ arctan[y(t)/x(t)] for a given parameter curve [x(t), y(t)]. Implicit curve A curve defined by a functional form with f(x) ¼ 0, which is a set of points S ¼ {x|f (x) ¼ 0}. Contour partitioning Segmentation along the contour for effective representation and description of the object contour. Generally, points with a larger curvature on the contour curve are selected for segmentation, so that the cut curve segment is easier to express and describe with straight line segments or low-order curve segments. See curve segmentation. Landmark point A method to approximate the contour of the object region with a set of boundary points. These points often have distinct characteristics, or they control the shape of the contour. These points (coordinates) are usually represented in the form of vectors, arrays, sets, matrices, and so on. Also refers to feature points used for image registration. Planar-by-vector One of the ways of arranging or combining landmark points. A set of 2-D vectors is used to represent the planar object, where each vector represents the coordinates of one landmark point. Further, each vector (a total of n) can be sequentially put into a matrix of n 2.
27.2
Boundary-Based Representation
975
Planar-by-complex One of the ways of arranging or combining landmark points. Among them, a set of complex values is used to represent a planar object, where each complex value represents the coordinates of one landmark point, so that the set of all landmark points (a total of n) is sequentially placed into a vector of n 1. Free landmark One of the ways of arranging or combining landmark points. Among them, the order of the landmark points need not be considered. For an object S with n coordinate points, a 2n-set is used for its free landmark points S f ¼ Sx,1 , Sy,1 , Sx,2 , Sy,2 , . . . , Sx,n , Sy,n to represent. Compare ordered landmarks. Ordered landmark An expression of arranging or combining landmark points. The coordinates of each landmark point need to be selected, and the landmark points are grouped into a vector starting from a fixed reference landmark point. For an object S with n coordinate points, the sorted landmark points are represented by a 2n-vector: T So ¼ Sx,1 , Sy,1 , Sx,2 , Sy,2 , . . . , Sx,n , Sy,n Compare free landmark.
27.2.2 Boundary Signature Signature Identification characteristics of people or things. In the method of boundary repre sentation, it refers to a method of expressing a boundary with a 1-D functional, also called a boundary signature. In a broader sense, a signature can be produced by a generalized projection. Here the projection can be horizontal, vertical, and diagonal or even radiative, rotating, etc. Boundary signature A boundary-based representation of a 2-D boundary as a 1-D function. Also called contour signature. In a broader sense, labels can be produced by a generalized projection. Projections can be horizontal, vertical, diagonal, or even radial and rotating. One thing to note is that projection is not a transformation that can hold information. The curve that transforms the region boundary on the 2-D plane to 1-D may lose information. There are many ways to generate boundary signatures, such as function of distance-versus-angle, function of distance-versus-arc-length, slope density function, and ψ-s curve (function of contingence-angle-versus-arc-length). But
976
27 Object Representation ψ (s)
ψ (s)
2π
2π 0
0 s A
ψ
s 2πA
0
s
π
ψ A
(a)
8A
0
s
(b)
Fig. 27.1 Two signatures of the tangents as a function of arc length
no matter what method is used to generate the signature, the basic idea is to represent the boundary of 2-D in the form of a more easily described function of 1-D. Compare region signatures. Contour signature Same as boundary signature. Signature curve Consider a smooth planar curve (such as the boundary of a region obtained by region segmentation) and parameterize it with the arc length s. Letting K(s) be the curvature function of the curve and K0 (s) be the derivative with respect to the arc length, then [K(s), K0 (s)] gives the Euclidean signature curve. The affine curvature and affine arc length can be generalized to the affine signature curve. For object recognition in a flat shape, a signature curve can be a useful shape descriptor. ψ-s curve A boundary signature. Obtaining process: first following the boundary to around the object, making a tangent to the point at each position, and then taking the angle ψ between the tangent and a reference direction (such as the horizontal axis of the coordinate) to be (from the given reference point) a function of the boundary length (arc length s). Figure 27.1a, b shows the signatures obtained for the circular and square objects, respectively, where A is the radius of the circle or the half-side length of the square. It can be seen from Fig. 27.1 that the slanted straight line segment in the ψ-s curve corresponds to a circular arc segment on the boundary (ψ changes with a constant value), and the horizontal straight line segment in the ψ-s curve corresponds to a straight line segment on the boundary (ψ is unchanged). In Figure 27.1b, the four horizontal straight segments of ψ correspond to the four sides of a square object. It is also a technique for representing flat contours. As shown in Fig. 27.2, each point P on the outline is represented by an angle ψ and a distance s, where the angle ψ is determined by a straight line passing through the point P and the object center point O and a fixed reference direction (common X axis). The distance s is the distance from P to O. See shape representation. Function of distance-versus-arc-length A kind of boundary signature that first finds the center of gravity for a given object and then uses the distance between the boundary point and the center of gravity as a function of the length of the sequence of boundary points. This signature can be
27.2
Boundary-Based Representation
977
P s Ψ
O
X
Fig. 27.2 ψ-s curve
r(s
r(s) s 0 r(s) A
s
A
0 r(s) 2πΑ
0
s
A
A
0
(a)
8A
s
2π
θ
(b)
Fig. 27.3 Two signatures of the function of distance-versus-arc-length
r
θ 0 A
r(θ )
r
θ
A
0
2π
0
θ
A
(a)
r(θ ) A
0
(b)
Fig. 27.4 Two signatures of the function of distance-versus-angle
made by starting around a point and gradually surrounding the object along the boundary. Figure 27.3a, b shows the signatures obtained for circular and square objects, respectively. For a circular object in Figure 27.3a, the distance r from the center of gravity is constant; for a square object in Figure 27.3b, the distance r varies with s periodically. This boundary signature is equivalent to a function of distanceversus-angle. R-S curve Same as function of distance-versus-arc-length. Can be used to compare contours invariantly with rotation. See contour and shape representation. Function of distance-versus-angle A kind of boundary signature that first calculates the center of gravity for a given object and then uses the distance between the boundary point and the center of gravity as a function of the angle of the connecting line between the boundary point and the origin with the X axis (reference direction). Figure 27.4a, b shows the signatures obtained for circular and square objects, respectively. Among them, in
978
27 Object Representation
r
θ
r(θ )
r
A
0
r(θ )
θ
A
A
0
0 A
2π
0
θ
(a)
2π
θ
(b)
Fig. 27.5 Two signatures of the radius vector function
Figure (a), the length r of the connection line with the origin is constant, and in Figure (b), r ¼ Asecθ. This signature is not affected by object translation, but changes as the object rotates or shrinks. The effect of scaling is the change in the amplitude value of the signature. This problem can be solved by normalizing the maximum amplitude value to a unit value. There are several ways to address the effects of rotation. If you can specify a starting point that does not produce signature as the object’s orientation changes, the effect of rotation changes can be eliminated. For example, the point farthest from the center of gravity may be selected as the starting point for generating the signature. If there is only one such point, the obtained signature is independent of the object orientation. A more robust method is to first obtain the equivalent ellipse of inertia of the region and then take the furthest point on its long axis as the starting point for generating the signature. Since the equivalent ellipse of inertia is determined with the help of all the pixels in the region, it is computationally expensive but reliable. This boundary signature is equivalent to the function of distance-versusarc-length. Radius vector function A contour representation or boundary representation. Similar to the function of distance-versus-angle. First calculate the center point c of the object (which can be the center of gravity or other points of physical significance), and then record the distance r(θ) from point c to the contour point (a vector) as the function of θ (the angle between the connecting line and some reference directions). This representation will be ambiguous if the vector has more than one intersection with the boundary point. Figure 27.5a, b shows the results obtained with the radial vector function for circular and square objects, respectively. Function of contingence-angle-versus-arc-length Same as ψ-s curve. Slope density function A boundary signature that represents an object. A tangent (slope) histogram of a curve or region boundary. It can be used to represent a curve whose shape is invariant with translation and rotation (the density function distribution is not affected but may be shifted).
27.2
Boundary-Based Representation
979
h(θ ) A
0 h(θ ) A
h(θ ) 2A 0
θ
h(θ ) 2π
0
θ
(a)
θ A
2π
0
θ
(b)
Fig. 27.6 Two signatures of slope density function
In actual calculation, first obtain the angle between the tangent of each point along the object boundary and a reference direction (such as the horizontal axis of the coordinate), and then make a histogram of the tangent angle to obtain the slope density function. Among them, the tangent angle corresponds to the slope, and the histogram gives the density distribution. Because the histogram is a measure of the concentration of values, the slope density function will give a stronger response to the boundary segment with a constant tangent angle, while the boundary segment that has a faster change in the tangent angle corresponds to a deeper valley. Figure 27.6a, b shows the obtained signatures for circular and square objects, respectively. For circular object, the signatures of the slope density function and the signatures of the function of distance-versus-angle have the same form. However, for square object, the signatures of the slope density function and the signature of the function of distance-versus-angle have very different forms. The four histogram bars (bins, i.e., vertical line segments) of the histogram h(θ) correspond to the four sides of the square, respectively. This signature gives a histogram of the tangent angle of the object, which can be seen as the result of projecting the ψ-s curve onto the ψ axis.
27.2.3 Curve Representation Curve representation system A method of representing or modeling curves with parameters. Some typical examples are B-splines, crack codes, cross-sectional functions, Fourier descriptors, eigen equations, polycurve, polygonal approximation, radius vector functions, snake models, splines, etc. Regular point Points where local features of 2-D curves do not belong to singular points. Referring to Fig. 27.7, consider a point Q in the first quadrant that moves along curve C to point P (the origin of the coordinate system). When it continues to move after reaching point P, its next position will have four cases; respectively the quadrants of one, two, three, and four are shown in Fig. 27.8a–d. If point Q on the curve reaches the second quadrant after passing through point P, as shown in Figure (b), then point P is a regular point; in the other three cases, point P is a singular point.
980
27 Object Representation N
Q
N
C P
Q
N
C T
(a)
P
(b)
Q
N
C T
P
(c)
Q C
T
P
T
(d)
Fig. 27.7 Four curve cases
Fig. 27.8 Example of polycurve
Polycurve A simple curve C that is all smooth except for a limited set of points. Given any point P on C, the tangents to C from all directions converge to the limit close to P. The shape model in image analysis often uses a multi-curve model including a series of curved or straight segments to describe the contour shape. The multiple curve shown in Fig. 27.8 contains four arcs. Compare polyline. Discrete curve evolution An automatic, step-by-step method of simplifying polynomial curves that ignores minor disturbances and preserves the perceived appearance. See curve evolution. Curve saliency Voting method for detecting curves in 2-D or 3-D images. Convolve each pixel with a curve template to build a saliency map. At locations where the curve is more likely to exist, the saliency map has a higher value. Curve smoothing A method of rounding the results obtained by polygon-based approximation or vertex-based approximation of surface boundaries. Typical methods are the Bessel curve in 2-D (an example is shown in Fig. 27.9) and NURBS in 3-D. See curve evolution. Curve evolution A process or method of simplifying and abstracting a stepwise iteration of a (contour) curve. For example, it can assign a correlation measure to each vertex on a curve. In each iteration, the least important vertices are eliminated and the nearest neighbors are connected. This process is repeated until the required level of abstraction is met. Another method is to use a Gaussian filter that gradually increases the standard deviation to smooth the curve. An example is shown in Fig. 27.10.
27.2
Boundary-Based Representation
Fig. 27.9 Bessel curve smoothing
981
Initial curve
Bessel curve
Fig. 27.10 Curve evolution
Fig. 27.11 Curve bi-tangent
Curve Inf lection point
Double tangent point
Inf lection point Curve bi-tangent Double tangent point
Curve tangent vector A vector that is parallel to a curve at any given point at any time. See local geometry of a space curve. Curve bi-tangent Tangency of a straight line and a curve (also extendable to a surface) at two different locations. Figure 27.11 shows an example. Bi-tangent See curve bi-tangent. Arc length The distance along the curve between two points on the curve. If f(x) is a function and its derivative f0 (x) is continuous in the closed interval [a, b], then the arc length from x ¼ a to x ¼ b is Z b qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 1 þ ½ f 0 ðxÞ dx a
982
27 Object Representation
5
y
4
4
3
3
2
2
1 0
x
5
1
2
3
4
x 5
y
1 0
s 1
(a)
2
3
4
5
6
7
8
9
(b)
Fig. 27.12 Arc length parameterization example
Arc length parameterization An efficient way to represent contours (or curves), similar to the boundary signa ture method. Letting s be the arc length along the contour (or curve), then the arc length can be expressed as [x(s), y(s)] after parameterization. Consider the contour composed of edge segments as shown in Figure 27.12a. Starting from a certain point (such as (1, 1)), draw the corresponding x(s) and y(s) as s increases in sequence in Figure 27.12b), and connect them into curves respectively (for simplicity, let’s quantify as an integer). Such a curve can be Fourier transformed and processed further. Contour smoothing with arc length parameterization A commonly used contour smoothing method, which can eliminate the influence of digitized noise on the contour (the zigzag phenomenon occurs).
27.2.4 Parametric Curve Parametric curve A collective term for single-parametric curves and multi-parametric curves. It is also often short for single parametric curve. Parametric boundary An object representation technique that expresses the boundary line of an object as a parametric curve. The points on the parametric curve have a certain order. Single-parametric curve A curve represented in a single parameter. The curve can be seen as the trajectory of the point moving in 2-D space. Mathematically speaking, the position of the point in the 2-D space can be represented by the position vector P(t) ¼ [x(t) y(t)], where t is a parameter of the curve and can represent the normalized length from a point along the curve. The collection of position vectors represents a curve. Any point on the curve is described by two functions with t as the parameter. The curve starts at t ¼ 0 and ends at t ¼ 1. In order to represent a universal curve, the first
27.2
Boundary-Based Representation
983
Fig. 27.13 Algebraic curve of the unit circle
x2 + y2 – 1 = 0
and second derivatives of the parametric curve are made continuous, and the order of P(t) is at least 3. The cubic polynomial curve can be written as Pðt Þ ¼ at 3 þ bt 2 þ ct þ d Among them, a, b, c, and d are all coefficient vectors. Multi-parametric curve Curves represented with multiple parameters. Compare single-parametric curves. 3-D parametric curve Parametric representation for a special 3-D object, that is, a 3-D curve. The outer contour of a 3-D object profile is generally a 3-D curve after imaging, which can be represented by a parametric spline, written in matrix form as Pðt Þ ¼ ½xðt Þ yðt Þ zðt Þ, 0 t 1 where t represents the normalized length from a certain point along the curve. That is, any point on the 3-D curve can be described by three functions of the parameter t, and the curve starts from t ¼ 0 and ends at t ¼ 1. In order to represent a universal curve, the first and second derivatives of the parametric spline are continuous, and the order of P(t) must be at least 3. The cubic polynomial curve can be written as Pðt Þ ¼ a t 3 þ b t 2 þ c t þ d Among them, a, b, c, and d are all coefficient vectors. Algebraic curve A parametric representation of a curve. Use Euclidean distance if the object cannot be represented by a linear property. An algebraic representation of a circle and its outline is given in Fig. 27.13. The general result of parameterization in Euclidean space is {x: f(x) ¼ 0}. Composed parametric curve A curve composed of multiple parametric curves. Compare single-parametric curves. Regular parametric curve Parametric curve where the speed (derivative) is never zero. It is a kind of regular shape. An important property of the speed of a regular parameter curve is that the
984
27 Object Representation
velocity vectors of points on the curve are tangent to the curve at that point. Compare non-regular parametric curves. Regular shape A curve whose speed is never zero. An important property of the speed of a regular curve is that the velocity vector of each point on the curve is tangent to the curve at that point. The set of all possible tangential vectors of a curve is called the tangential field. As long as the curve is regular, it is possible to normalize so that the tangential vector along the curve is unit-sized. In the shape representation and description, the object contour is often represented by a parametric curves, and the properties of the above regular curve are helpful for the analysis of the shape. Non-regular parametric curve Parametric curve where velocity (derivative) can be zero. Compare regular parametric curves. Smooth parametric curve Parametric curves for all order derivatives. Compare non-smooth parametric curves. Non-smooth parametric curve Parametric curves are not guaranteed to exist for all derivatives. Compare smooth parametric curves. Open parametric curve Non-closed parametric curve. Every point on it except for the two endpoints has exactly two adjacent points, and both endpoints have only one adjacent point. Compare closed parametric curves. Closed parametric curve Closed parametric curve. Each point on it has exactly two adjacent points. Compare open parametric curves. Jordan parametric curve There are no self-intersecting parametric curves. Compare non-Jordan paramet ric curves. Non-Jordan parametric curve There can be self-intersecting parametric curves. Compare Jordan parametric curves.
27.2.5 Curve Fitting Curve fitting A method of determining the best-fit curve parameters through a set of 2-D (or 3-D) data points. This often translates into a problem that minimizes the least square error
27.2
Boundary-Based Representation
985
between some hypothetical curves and actual data points. If the curve y(x) can be seen as the sum of a set of N arbitrary basis functions Bi, yð xÞ ¼
N X
ai Bi ðxÞ
i¼1
The unknown parameter is the weight ai. The curve fitting process can be seen as the problem of minimizing a certain log-likelihood function given N points with standard deviation σ i for Gaussian error. This function can be defined as 2 N X yi yð xi Þ χ ¼ σi i¼1 2
The weight to minimize the above formula can be calculated from the matrix D: Dij ¼
B j ð xi Þ σi
The matrix D can be calculated by solving the following linear equation: Da ¼ v And the vector v satisfies vi ¼ yi/si. Bias Refers to the phenomenon that the maximum likelihood method systematically underestimates the variance of the Gaussian distribution. It is related to the problem of overfitting in polynomial curve fitting. Iterative endpoint curve fitting An iterative process that divides a curve into a set of linear segments that approximate the original curve. First construct a straight line connecting the endpoints of the curve. If the maximum distance between the straight line and the curve is less than a certain allowable value, it means that it is accurate to approximate the curve with a straight line, and the process can end. However, if the maximum distance between the straight line and the curve exceeds a certain allowable value, it means that the approximation of the curve with the current straight line is not accurate enough. At this time, the curve is divided into two segments at the maximum distance, and the above approximation process is continued for each segment, until it is accurate enough to approximate each curve with each straight line. See splitting technique. Curve approximation A class of boundary-based object representation techniques. Use some geometric primitives (line segments, splines, etc.) to approximate and express the outline of the
986
27 Object Representation
object, in order to simplify the complexity of the representation and save the data volume. Least squares curve fitting A least mean square estimation process in which a set of 2-D or 3-D data points is fitted with a straight or curved parametric model. For fitting effect, Euclidean distance, algebraic distance, and Mahalanobis distance are commonly used to evaluate. Sub-pixel-precision of fitting Fitting of point sets, straight lines, curves, etc. can achieve sub-pixel precision. Sampson approximation An approximation of geometric distance when fitting a parametric curve or surface. The parametric curve or surface is defined by a parametric function of the form f(a, x) ¼ 0, where f(a, x) ¼ 0 lies on the curve or surface S(a) defined by the parameter vector a. Fitting a curve or surface P with a set of points {x1, x2, . . ., xn} requires a minimization function eðaÞ ¼ ni¼1 d ½xi , SðaÞ. If the distance function d [x, S(a)] is an algebraic distance d[x, S(a)] ¼ f(a, x)2, then the minimization function has a simple solution. However, under certain common assumptions, when d is a more complicated geometric distance, that is, d[x, S(a)] ¼ miny2S||x– y||2, an optimized solution can also be obtained. The Simpson approximation can be expressed as d ½x, SðaÞ ¼
f ða, xÞ2 k∇f ða, xÞk2
This is a first-order approximation to geometric distance. If there is an effective algorithm for minimizing the weighted algebraic distance, then the k iterations ak of the Simpson approximation are ak ¼ arg min a
n X
wi f 2 ða, xÞ
i¼1
Among them, the weight can be calculated with the help of previous estimates, i.e., wi ¼ 1/||∇f(ak-1; xi)||2. Morphable models A model that can be adjusted to fit the data. For example, you can change a deformable 3-D face model to fit a previously unseen picture of an unknown face. 3-D morphable model A technique for obtaining a 3-D object model from a single 2-D image. For example, given a face photo, the 3-D shape of the face can be estimated by approximating its orientation in space and lighting conditions in the scene. Starting from a rough estimate of size, orientation, and lighting, these parameters can also be
27.2
Boundary-Based Representation
987
Fig. 27.14 An improved deformable template for the eye
Y P4 r
P3
Q
b
a
θ
X
O
P1
P2
c P5
P6
optimized based on the shape and color of the face to best match the input image. The 3-D model obtained from the image can be rotated and manipulated in 3-D space. Viscous model A deformable model based on the concept of viscous fluids, that is, fluids that have a relatively high resistance to flow. Deformable model A class of evolvable representation models constructed for object boundaries or object contours. There are two main types of deformable models: parametric models and geometric deformable models. For example, model a deformable object type (such as human body, eyes), where the object shape changes based on parameter values. If the general characteristics of an object type are considered in the model, the deformable model can be constructed from the new data and used as a matching template. The degree of deformation can be used as a matching score. See modal deformable model. Deformable template A template can deform according to some models. A typical example is a geometric model of eyes. A circle and two parabolas are used to represent the contour of the iris and the contours of the upper and lower eyelids, respectively. An improved deformable template is shown in Fig. 27.14. It can be represented by a seven-tuple (O, Q, a, b, c, r, θ), where O is the center of the eye (located at the origin of the coordinates), Q is the center of the iris, and a and c are the height of the upper and lower parabolas, respectively, b is the length of the parabola, r is the radius of the iris, and θ is the angle between the long axis of the parabola and the X axis. In this model, the intersection of the upper and lower parabolas is two corner points (P1 and P2). In addition, the iris circle has two intersections with the upper and lower parabolas, respectively, forming the other four corner points (P3 and P4, P5 and P6). Deformable template model See deformable model.
988
27 Object Representation
3-D deformable model A model that can change its shape under external influence. As a result of the deformation, the relative position between any points on the model can change. In the tracking and recognition of objects, this model can be used to represent non-rigid bodies such as human bodies.
27.2.6 Chain Codes Chain code A boundary-based representation of the object region. It is a coded representation of boundary points, which is characterized by a series of connected straight line segments with a specific length and direction to represent the object boundary. Because the length of each line segment is fixed and the number of directions is limited, it is only necessary to use the (absolute) coordinates for the starting point of the boundary, and the remaining points can be represented by the continuation direction as the offset. Since the number of bits required to represent a direction is less than the number of bits required to represent a coordinate value, and only one direction number is needed to replace two coordinate values for a point, the chain code expression can greatly reduce the amount of data required to express the boundary. . Slope chain code [SCC] A chain code different from the Freeman code. For a 2-D curve, the slope chain code is obtained by sequentially measuring the curve with straight line segments of equal length (the end of the straight line segment falls on the curve). In the slope chain code, the length of each chain code is fixed, and the slope change between adjacent chain codes is normalized to [0, 1], and the chain code itself only records the slope value. This chain code is independent of the translation and rotation of the curve (the description of the rotation is more elaborate than the Freeman code) and can also be independent of scale changes through normalization. Freeman code A chain code expressing the outline of the object. The object contour is represented by the coordinates of its first point followed by a series of direction codes (typical values are 0 to 7), and each direction code follows the contour from one point to the next. Even chain code Horizontal chain codes in 4-directional chain codes and horizontal and vertical chain codes in 8-directional chain codes. Odd chain code The chain code in the 4-directional chain code in the vertical direction and the chain code in the diagonal direction in the 8-directional chain code.
27.2
Boundary-Based Representation
Fig. 27.15 4-directional chain code
989
Y
1
2
2
1
3
0
0
O
3
Fig. 27.16 8-directional chain code
X
(b)
(a)
Y
2
4
3
1
3
5
2
6
1
0
4
7
0
7
5 6 (a)
O
X (b)
4-directional chain code Use 4-connection chain code. Its direction definition and an example of representation are shown in Figure 27.15a, b. 8-directional chain code Use an 8-linked chain code. See Figure 27.16a, b for directions definition and expression examples, respectively. Length of chain code The length of the chain code after representing the curve or boundary in the image with a chain code. In addition to the number of codes in the chain code, it is also related to the relationship between the label of each code and the adjacent code labels (i.e., the change of the direction between the chain codes). See distance measure ment based on chain code. Chain code histogram A histogram with the chain code representation direction of the outline of the object region as the abscissa. The object scaling corresponds to the vertical expansion and contraction of the chain code histogram, and the rotation of the object corresponds to the horizontal cyclic shift of the chain code histogram. But two different object regions may have the same chain code histogram.
990
27 Object Representation 0
0 Original chain code
New chain code
3
1 0
1 0 1 0 3 3 2 2 3
1
3
1
Starting point normalization 2
3
1
0
0 1 0 3 3 2 2 1
2
2
2
Fig. 27.17 Chain code normalization with respect to starting point Original chain code
0 0
3
1
New chain code
2
0
0
(3) 2 1 2 1 0 0 3 3
3
1 3 3 1 3 3 0
3
1 2
(2) 1 0 1 0 3 3 2 2
Original difference code
3 0
New difference code
Object rotating 90° counterclockwise
2
3
1 2
Fig. 27.18 Marking of two tangents as a function of arc length
Chain code normalization The chain code representation of the object boundary in the image is normalized so that it does not change due to changes in the position and/or orientation of the object in the image. Commonly used are chain code normalization with respect to starting point and chain code normalization for rotation. Chain code normalization with respect to starting point Normalize the representation of the chain code so that the same chain code is still obtained when the boundaries represented by the chain code start from different points. A specific method is to treat a given chain code starting from an arbitrary point as a natural number composed of the number of directions of the chain code. These direction numbers are cycled in one direction to minimize the value of the formed natural number. Then the starting point of the corresponding chain code after this conversion is taken as the starting point of the normalized chain code of this boundary (see Fig. 27.17). Chain code normalization for rotation Normalize the representation of the chain code so that the objects it represents in space still have the same representation form when the object has different orientations. A specific method is to use the first-order difference of the chain code to reconstruct a new sequence representing the direction change between the segments of the original chain code. This difference can be obtained by subtracting the number of adjacent two directions (in the opposite direction). Referring to Fig. 27.18, the upper one line is the original chain code (the number in the rightmost direction is cycled to the left in parentheses), and the lower one line is the difference code obtained by subtracting two by two. As can be seen from the figure, the object on the left becomes the shape on the right after rotating 90 counterclockwise. At this time,
27.2
Boundary-Based Representation
Fig. 27.19 Marking of two tangents as a function of arc length
991
1
2
0
3 Fig. 27.20 Contour chain code and midpoint crack code
the original chain code has changed (compared to the new chain code), but the original difference code and the new difference code are the same. Crack code A contour representation method that does not encode the pixels themselves, but encodes the gaps between pixels. It can be regarded as a 4-directional chain code instead of an 8-directional chain code. Compare chain code. Figure 27.19 shows an example. The left figure shows the direction definition. The right figure shows the position of the crack code (the boundary between the left and right regions). The crack code from top to bottom is {3 2 3 0 3 3}. Mid-crack code A chain code obtained by considering the gaps (edges) between two adjacent contour pixels and connecting the midpoints of these gaps. The straight line segment corresponding to the standard chain code connects the centers of two adjacent pixels. When the object is small or slender, the ratio of the number of object contour pixels to the total number of pixels in the region is relatively large. At this time, the contour obtained by the chain code and the real contour will be significantly different. The closed contour enclosed by the midpoint crack code is more consistent with the actual object contour but often requires more codes than the general chain code representation. For example, there is a “T”-shaped object (shaded) in Fig. 27.20. The closed contours enclosed by the two dashed arrows correspond to the (open set) outer chain
992
27 Object Representation
code and (closed set) inner chain code, respectively. They all have a certain differences from the actual object contour. If the area and perimeter of the object region are calculated according to the outline of the chain code, the outline of the outer chain code will give a higher estimate, while the outline of the inner chain code will give a lower estimate. The closed outline (midpoint crack code) surrounded by solid line arrows is more consistent with the actual object outline. Crack following Edge tracking on dual grids or gaps between pixels can be performed based on continuous line segments obtained from the crack code.
27.3
Region-Based Representation
The segmentation of the image extracts the region corresponding to the object, that is, all the pixels (including interior and boundary points) that belong to the object. The region-based representation expresses the object region based on the information of these pixels (Nikolaidis and Pitas 2001; Russ and Neal 2016; Zhang 2017b).
27.3.1 Polygon Polygon Consists of a set of line segments that can be used to approximate most practical curves (such as object contours) to closed graphics of arbitrary accuracy. It can also be considered as a closed, piecewise linear 2-D contour. In a digital image, if the number of line segments of the polygon is equal to the number of points on the object boundary, the polygon can express the boundary completely and accurately. All sides of a regular polygon have the same length, and the angles formed by adjacent sides are equal. The square is a typical example. What does not meet the above two conditions is an irregular polygon. Regular polygon A polygon with n-sides where all sides have the same length and are arranged symmetrically about a common center, which means that regular polygons are equiangular and equilateral. Patch In image engineering, certain small regions (often irregular shapes) in an image. Polygonal approximation A method to approximate the object region with a polygon. That is, the object contour is approximated by a polygon, and then the edges of the object region are represented by expressing the sides of the polygon. In theory, due to the closed set of line segments, most practical curves can be approximated to arbitrary accuracy. The
27.3
Region-Based Representation
993
Fig. 27.21 Two polygonal approximations
h
Fig. 27.22 Polygon approximation via splitting
g i
k
j
f
a b
e c
d
advantage is that it has better anti-interference performance and can save data volume. Based on the polygon representation of the contour, many different shape descriptors can also be obtained. Common methods for obtaining polygons are: 1. Minimum-perimeter polygon method based on contraction. 2. Minimum mean square error line segment approximation method based on merging technique. 3. Minimum mean square error line segment approximation method based on the splitting technique. It can also be used for the polyline approximation of a curve. Figure 27.21 shows the two polyline approximations to a segment of an arc, where the approximation of the dashed line is more accurate than the approximation of the solid line. Minimum-perimeter polygon [MPP] For a (closed) digital curve, the digitized polygon has the shortest side length. The shape is often similar to the perceived shape of a digital curve. It is an approximation of the object boundary. The approximation becomes accurate when the number of polygon line segments (which can be considered normalized to unit length) is equal to the number of pixels on the boundary. See polygonal approximation. Splitting technique A polygonal approximation technique. First, connect the two pixels that are farthest apart on the boundary of the object region (i.e., divide the boundary into two parts), and then further decompose the boundary according to certain criteria to form a polygon to approximate the boundary until the fitting error meets a certain limit. Figure 27.22 shows an example of splitting the boundary based on the maximum distance between the boundary point and the existing polygon. The original boundary is represented by points a, b, c, d, e, f, g, h, and so on. The first step is to start from
994
27 Object Representation
h
Fig. 27.23 Polygon approximation via merging
g i
a b
j c
f
k e d
point a; first make the line ag, and then calculate di and hj (points d and h are on both sides of the line ag and farthest from the line ag). In the figure, these two distances exceed the error limit, so the decomposition boundary is four segments: ad, dg, gh, and ha. Further calculate the distances between the boundary points such as b, c, e, and f and the corresponding straight lines of the current polygon. In the figure, it is assumed that they do not exceed the error limit (e.g., fk), and then the polygon adgh is required. Compare merging technique. See iterative endpoint curve fitting. Merging technique A polygonal approximation technique that successively connects pixels along the boundary of an object region. First, select a boundary point as the starting point, and connect this point with adjacent boundary points in order by a straight line. Calculate the (approximation) fitting error of each straight line and boundary, determine the line segment before the error exceeds a certain limit as an edge of the polygon, and set the error to zero. Then start from the other end of the line segment and continue to connect the boundary points until it makes a circle around the boundary. This gives an approximate polygon with a boundary. An example of polygon approximation based on merging technique is shown in Fig. 27.23. The original boundary is represented by points a, b, c, d, e, f, g, h, and so on. Now start from point a and make straight line segments ab, ac, ad, ae, etc. in that order. For each line segment starting from ac, the distance between the previous boundary point and the line segment is calculated as the fitting error. In the figure, it is assumed that bi and cj do not exceed a predetermined error limit, and dk exceeds this limit, so d is selected as the next polygon vertex immediately after point a. Proceed from point d and proceed as above. The vertex of the approximate polygon finally obtained is adgh. Compare splitting technique. Divide and conquer The technique of breaking down the problem to be solved into several smaller subproblems (the smaller subproblems are expected to be easier to solve) and then solving these subproblems in turn. For example, when deriving the polygon approximation from the object contour, the line can be iteratively separated at the midpoint (i.e., divided into two segments with the midpoint on the contour) until the distance between the polygon representation and the actual contour is less than a predetermined threshold or allowable limits. See Fig. 27.24.
27.3
Region-Based Representation
Fig. 27.24 From object contour to approximate polygon
995 Original contour Initial estimation
Final estimation Midpoint
Ramer algorithm An algorithm for recursive division of flat curves and approximation with line segments. It has proven to be the best of all polygonal approximation methods. If the curve is not closed, connect the first line segment between the start and endpoints first. For curve closures, see splitting technique. Cluster of polygons Contains a core polygon at the center and a collection of all polygons that share a boundary with the core. Polygonal network A collection of regions all surrounded by straight line segments. For a polygon mesh, if V is the number of vertices, B is the number of edges, and F is the number of faces, the Euler number En is En ¼ V B þ F Grid polygon Vertices are polygons at discrete points on the sampling mesh. Its area can be calculated using the grid theorem. Grid theorem Theorem for calculating area of the grid polygon. Letting R be the set of points contained in the polygon Q. If NB is the number of points exactly on the contour of Q and NI is the number of internal points of Q, then |R| ¼ NB + NI, that is, the number of points in R is the sum of NB and NI. The grid theorem states that the area A(Q) of Q is the number of elements contained in Q: AðQÞ ¼ N I þ
NB 1 2
996
27 Object Representation
Wachspress coordinates Generalization/expansion of the center of gravity coordinates to represent the points inside the polygon with a weighted function of the position of each vertex on the polygon. This generalization makes it possible to fit polygons with any number of vertices instead of the triangles required when using the original center of gravity coordinates. The Wachspress coordinate system is useful in representing and manipulating articulated objects and deformable shapes.
27.3.2 Surrounding Region Surrounding region One of the region-based representation methods used when representing objects in an image. Use a (larger) region that contains the object to approximate the object. The common main methods used are Feret box, minimum bounding rectangle, convex hull, etc. Bounding region Region-based representation of the object region with approximate geometric primitives (often polygons). Also refers to the technology that enables this expression. Commonly used geometric primitives include Feret box, minimum enclosing rectangle, and convex hull. Figure 27.25 shows an example of each of the three representations mentioned above in turn. It can be seen that the representation of the convex hull for the region may be more accurate than that of the minimum enclosing rectangle, and the representation of the minimum enclosing rectangle for the region may be more accurate than that of the Feret box. The ratio of the sides of the minimum enclosing rectangle is often used as a classification metric in model-based recognition.
Feret box
Minimum enclosing rectangle
Fig. 27.25 Three bounding region representations for the same object
Convex hull
27.3
Region-Based Representation
997
Bounding box A type of surrounding region that generally corresponds to a Feret box. Feret box A geometric primitive used in region representation techniques. It gives the smallest rectangle containing the object (toward a specific reference direction, generally taking the coordinate axis). Compare minimum enclosing rectangle. Minimum bounding rectangle A rectangle that surrounds a set of image data (often constituting an object region) and has the smallest area. Width function A function that represents the width of a 2-D object (closed subset on a flat) in a particular orientation θ. Letting the object shape S ⊂ ℝ2 and the width function be w(θ), then the projection P(θ): ¼ {xcosθ + ysinθ|(x, y) 2 S}, and w(θ): ¼ maxP(θ) – minP(θ). Minimum enclosing rectangle [MER] In the surrounding region representation technique, a geometric primitive is given that contains the smallest rectangle (which can oriented to any direction) of the object region. Compare Feret box. Convex hull A geometric primitive that contains the smallest convex shape of the object region in the surrounding region representation technique. See convex-hull construction. Convex-hull construction A method of obtaining the convex hull of region using mathematical morphology for binary image. Given a set A, letting Bi (i ¼ 1, 2, 3, 4) represent four structural elements, and first construct
X ki ¼ X k1 * Bi [ A i Among them, i ¼ 1, 2, 3, 4; k ¼ 1, 2, 3, 4; * means hit or miss transform; X 0i ¼ A. Letting Di ¼ X conv , where the superscript “conv” indicates that X converges in the i k k1 sense X i ¼ X i , then the convex hull of A can be expressed as 4
C ðAÞ ¼ [ Di i¼1
In other words, the process of constructing a convex hull is first iteratively using hit or miss transform for A with B1, joining the result with A when there is no further change, and recording the result as D1; use B2 to repeat the process of iterating and joining the set of A, and record the result as D2; then repeat the above operations with B3 and B4 to obtain D3 and D4; finally, the union set of four results D1, D2, D3, and D4 yields the convex hull of A.
998
27 Object Representation
(a)
(b)
(c)
Fig. 27.26 Convex combination step
Fig. 27.27 Concave residue example
Edelsbrunner-Harer Convex Combination Method A method of constructing larger convex sets by constantly adding new points to existing convex combinations. The basic idea is to start from the convex hull of two points; first add the third point, and construct a convex set of all points on the triangular surface formed by the three points. Referring to Fig. 27.26, first, consider two points as the original set {p0, p1}, and add filling points x ¼ (1 t)p0 + tp1 to some 0 t 1. Here, for t ¼ 0, we get x ¼ p0; for t ¼ 1, we get x ¼ p1; for 0 < t < 1, we get a point between p0 and p1. In this way, a minimum convex set containing two points of the original set is formed (see Figure 27.26a). Now consider the third point p2, adding the filling point y ¼ (1 t)p0 + tp2 and the filling point z ¼ (1 t)p1 + tp2. In this way, a triangle with the above three points as vertices will be obtained, and then further filling will obtain the convex hull of the triangular surface (see Figure 27.26b). This process can be continued iteratively. If a set of four points is used, a quadrilateral convex hull is obtained as shown in Figure 27.26c. In this way, a convex set with more points can be obtained. Convex deficiency The difference between an object region and its convex hull. Both the object region and its convex hull can be considered as a set, so the convex deficiency is also a set. The object is convex, and its convex deficiency is the empty set. Convex residue Same as convex deficiency. Deficit of convexity Same as convex deficiency. Concave residue The set difference between an object region in the image and its convex hull. Also called convex residue. For convex regions, the concave residue is an empty set. Fig. 27.27 shows an example, where the black hammer region is the object under
27.3
Region-Based Representation
999 T
T1
T3
T
T1
T2 T21
T4
T2
T4
T3
T21
Fig. 27.28 Schematic diagram of concavity tree S1 S
H S11
(a)
S
(b)
S12
S2
S1
S2
S11
(c)
(d)
S12
Fig. 27.29 Region concavity tree
consideration and the surrounding light-colored regions correspond to the concave residual set. Concavity tree A tree (structure) that describes an object hierarchically. The concave tree of a region uses its convex hull as the parent node and its concave trees (the concavity trees of its concavities) as child nodes. Subtracting them from the parent node region gives the original region. The concave tree of a convex region is the region itself. The left figure in Fig. 27.28 shows the process of constructing a concave tree based on the concave residues, and the right figure shows the constructed concave tree. Region concavity tree A structure or method for representing an object shape. Referring to Fig. 27.29, Figure 27.29a is a non-convex object (point set S), Figure 27.29b is the convex hull H of the object, Figure 27.29c is the convex residue, and Figure 27.29d gives the region concavity tree. Extremal point In the object representation, a point on the boundary of the smallest enveloping convex region (convex hull) of the point set. Visual hull A 3-D volume model is reconstructed from the intersection of the binary silhouettes projected into the 3-D space. Sometimes also called a line hull, similar to the convex shell of a point set (convex hull), because the volume is maximized relative to the
1000
27 Object Representation
Fig. 27.30 Visual hull
silhouette, and the surface elements are tangent to the line of sight along the silhouette contour. See silhouette-based reconstruction. A space carving method that uses multiple images to approximate the object shape. This method searches the silhouette contour of a given object in each image. The spatial region defined by each camera and the associated image contour constitute the shape constraint on the object. The visual hull is the intersection of all these constraints. As the angle of view increases (the number of images increases), the approximation to the object gets better and better. Fig. 27.30 shows the construction of a visual hull for a car. Line hull See visual hull. Exponential hull A variant of convex hull. First, take the logarithm of the histogram function, and then compute the convex hull on the histogram. With the help of the convex hull thus obtained, the histogram unevenness can be analyzed by thresholding based on concavity-convexity of histogram, and an appropriate threshold can be determined from such a histogram to segment the image.
27.3.3 Medial Axis Transform Medial axis transform [MAT] An operation on a binary image transforms a region into the center pixels of a set of circles. These pixels are on the midpoint of the region outline, and the circles fill the interior of the region. The value of each pixel on the axis is the radius of the midline circle. The medial axis transform can be used to represent regions in a simple axis form structure, which is effective for representing slender regions. A method for obtaining a regional skeleton. The medial axis transform of the region R with the boundary B can be determined as follows. Referring to Fig. 27.31, for each point p in R, the point with the smallest distance from it can be searched in B. If more than one such point can be found for p (i.e., there are two or more points in
27.3
Region-Based Representation
Fig. 27.31 Region R, boundary B, and skeleton point p
1001
R
p B
Fig. 27.32 Interpreting the medial axial transform with wave propagation
B that are at the same minimum distance from p), then p can be considered to belong to the centerline or skeleton of R, or p is a skeleton point. Blum’s medial axis See medial axis transform. Medial axis skeletonization See medial axis transform. Wave propagation The concept used to vividly describe the principle and process of the medial axis transform: if waves at the same speed are simultaneously emitted at each point on the object boundary, the front of the two waves meets the object’s skeleton set. As shown in Fig. 27.32, the outer thin solid line represents the original boundary of the object, each arrow indicates that the wave propagates in a direction perpendicular to the boundary, the dashed line indicates the forward of the wave at different times, and the thick solid straight line in the center represents the resulting skeleton. It can be seen from the figure that the center axis obtained by the medial axis transform has the maximum distance from the boundaries of each place. Skeleton by influence zones [SKIZ] Use the skeleton obtained from the region of influence. Often called Voronoï diagram. Letting X denote that it consists of N connected components Xi, i ¼ 1, 2,. . ., N. The points in the influence zone Z(Xi) will be closer to Xi than other connected components in X:
Z ðX i Þ ¼ p 2 ℤ2 , 8i 6¼ j, dðp, X i Þ d p, X j The skeleton S(X) generated from the region of influence is a set of boundaries of the region of influence {Z(Xi)}. Shock tree A 2-D shape representation technique based on singularity (see singularity event) along the medial axis radius function. The medial axis is represented by a tree with the same structure, and it is decomposed into continuous fragments with consistent
1002
27 Object Representation
3
2
1
4
Fig. 27.33 Shock graph
(a)
(b)
(c)
(d)
(e)
Fig. 27.34 The process of grassfire algorithm
characteristics (constant, monotonic, local minimum, local maximum). See medial axis skeletonization and distance transform. Shock graph Graphic representation of the medial axis of a 2-D planar shape. Four types of nodes are used, all of which are based on a radius function along the axis (1 represents a constant, 2 represents a monotone, 3 represents a local minimum radius, and 4 represents a local maximum radius). Graphs can be organized into a tree for effective object recognition with the help of graph matching. Fig. 27.33 shows a simple object shape (with the axis superimposed on it) and the corresponding shock graph. Grassfire algorithm A technique for determining a regional skeleton based on wave propagation. A virtual fire is fired at all boundaries of the region at the same time, and the skeleton is determined by the intersection of the wavefronts. The set of diagrams in Fig. 27.34 shows the process of calculating the grassland fire skeleton. Figure 27.34a is the region under consideration; Figure 27.34b represents the position where ignition is started (in accordance with the region outline); Figure 27.34c shows that the fire is burning into the region from all directions at the same time (the dark part indicates the burned part, the white part indicates the unburned part); Figure 27.34d shows that the fire continues to burn to the inside, the vertical section of the region has been burned out, the skeleton has been determined, and the horizontal section has unburned parts; Figure 27.34e shows that the horizontal section has also been burned and the entire skeleton has been determined. Grassfire transform Another name for distance transform describes the time course of the fire starting inside the grass field and lighting all the grass. Symmetric axis transform [SAT] A transformation that locates the positions of points on the skeleton of a region. These points are the trajectories of the center of a double tangent circle. See medial
27.3
Region-Based Representation
1003
Fig. 27.35 The result of symmetric axis transform
Fig. 27.36 Two symmetry lines of a rectangular object
Fig. 27.37 An elongated structure and its medial line
axis skeletonization. Figure 27.35 shows the symmetric axis calculated from the binary segmentation result of an object. Symmetry line Axis of bilateral symmetry. In Fig. 27.36, a rectangular object represented by a solid line has two symmetry lines drawn with a dotted line. Axis of symmetry In 2-D space, if a region rotates 180 around a straight line and coincides with the original area, the straight line is called the axis of symmetry. Axial representation An object representation that uses curves to describe image regions. An axis can be a skeleton derived from a region through a thinning process. Medial line A curve passing through the center axis of an elongated structure. Figure 27.37 shows a schematic diagram. See medial axis transform. Center line See medial line.
1004
27 Object Representation
Core line See medial line. Medial Related to the center line. For example, the center line refers to the center axis of an object shape. Medial surface 3-D expansion of the center line of a flat region. It is a trajectory formed by the center of a series of spheres with three or more points in contact with the surface of a solid object.
27.3.4 Skeleton Skeleton A set of points made up of special points called skeleton points in the object region. The skeleton of a region compactly reflects some basic structural characteristics of the region. The skeleton is often a curve, or a collection of curves similar to a tree. It captures the basic structure of an object. The curves that make up the skeleton are usually in the center of the object region. There are several methods for calculating the skeleton, such as the medial axis transform (see medial axis skeletonization) and distance transform (with the help of the grassfire algorithm). Skeleton point Primitives of the skeleton. Given a region R with a boundary B, let p be a point in R. If p satisfies the following conditions, the skeleton points are d s ðp, BÞ ¼ inf fdðp, zÞjz ⊂ Bg Among them, d is a distance measure, which can be Euclidean distance, city-block distance, or chessboard distance. Skeleton transform Same as skeletonization. Skeletonization The process of obtaining the skeleton of the object region. It can be regarded as a region-based representation method using internal features. This representation can be obtained by simplifying a planar region, and this simplification is reversible. A class of techniques that attempts to reduce the object region in a 2-D or 3-D binary image to a skeleton representation. Most of the pixels in the region are removed, leaving only the skeleton pixels, but the basic shape of the region can still be grasped. The definition of the skeleton includes the center set of circles that are tangent to the boundary of the region and smoothed local symmetry.
27.3
Region-Based Representation
1005
Skeleton model A linked model including rigidly connected joints. Commonly used to model people and animals, where limbs are represented by rigid connections. Extraction of skeleton The process of calculating a skeleton of an object region in an image. The method for obtaining the regional skeleton by using mathematical morphology for binary image operations is as follows: letting S(A) represent the skeleton of connected component A, then the skeleton can be expressed as K
SðAÞ ¼ [ Sk ðAÞ k¼0
Sk(A) in the above formula is generally called a skeleton subset and can be written as Sk ðAÞ ¼ ðAƟkBÞ ½ðAƟkBÞ○B Among them, B is a structuring element, Ɵ represents an erosion operator, and ○ represents an opening operator; (AƟkB) represents that erosion of B to A with k times, that is, ðAƟkBÞ ¼ ð ð. . . ðAƟBÞƟBÞƟ . . .ÞƟB K in the skeleton expression represents the number of last iterations before eroding A into the empty set, that is, K ¼ max fkjðAƟkBÞ 6¼ ∅g The skeleton expression indicates that the skeleton of A can be obtained from the union of the skeleton subset Sk(A). A can also be reconstructed with Sk(A): K
A ¼ [ ðSk ðAÞ k BÞ k¼0
where represents the dilation operator and (Sk(A) kB) represents that dilation of B to Sk(A) with k times: ðSk ðAÞ k BÞ ¼ ð ð ðSk ðAÞ BÞ BÞ Þ B Smoothed local symmetries A class of skeletonization algorithms proposed by Asada and Brady. Given a 2-D curve that encloses a closed region on a flat, the skeleton calculated with smoothed local symmetry is the locus of the midpoint of a chord of a bi-tangent circle.
1006
27 Object Representation
Fig. 27.38 Compute skeleton point by smoothed local symmetries
Figure 27.38 shows two skeleton points (represented by diamonds) defined by smoothed local symmetry, and the center of the circle in which they are located is represented by dots. Note that the two do not overlap. Compare symmetric axis transform. Binary region skeleton See skeleton. 3-D skeleton Describes the skeleton of a 3-D object in 3-D space.
27.3.5 Region Decomposition Region decomposition A region-based representation technique for objects. First, decompose the object region into some simple unit forms (such as polygons, etc.), and then use some kind of these simple forms to express the object region. Also refers to a class of algorithms that divide an image or a region into a set of regions. See region-based segmentation. Quadtree [QT] A method of region decomposition. It is also a data structure. First, divide the binary image as a whole into four parts, and then divide each part into four sub-parts, and continue to subdivide until the pixel attributes of each part are consistent (both object pixels or background pixels). When represented graphically, each non-leaf node can have up to four child nodes. This data structure can also be used for grayscale images. A schematic of a quadtree is shown in Fig. 27.39. A hierarchical structure that represents 2-D image regions, where each node represents a region, and the entire image is the root of the tree. Each non-leaf node (representing a region) has four children, each representing a sub-region obtained by
27.3
Region-Based Representation
1007
12 2
2
30 32 33
12
30
32 33
Fig. 27.39 Quadtree schematic (a set of voxels and the corresponding tree representation)
Fig. 27.40 Quadtree representation diagram
dividing the region into four parts. Continue to hierarchically divide the remaining regions until the sub-regions have constant properties. A quadtree can be used to generate a compressed image structure. The 3-D extension of a quadtree is an octree. Figure 27.40 shows that the entire image is divided into four parts, and one part is further divided into four parts. Irregular quadtree A quadtree decomposition, but does not require strict splitting in half. Binary tree A rooted flat tree with up to two children at any non-leaf node. It is also a data structure. The two children of each node are recorded as the left child and the right child, respectively. The two subtrees with the root child as the nodes are called the left and right subtrees, respectively. With a binary tree, the image can be divided into multiple nonoverlapping parts. The specific steps are to first divide the image (along the horizontal direction) into two halves, and then divide each half (along the vertical direction) into smaller halves. The homogeneous parts of the image all end up as leaf nodes. See classification and regression trees. Region neighborhood graph A graph in units of regions. Each region in the image is represented as a node, and the nodes corresponding to the regions in its neighborhood are connected by edges. Here the node contains the attributes of the region, and the edges indicate the relationship between the regions (such as one above the other, etc.) See region adjacency graph.
1008
27 Object Representation
N2
R3
R2 R1
N3 N1
R4
R5
N5
V1 V2 V3 V4 V5 V1 V2 V3
N4
⎡1 ⎢1 ⎢ ⎢1 ⎢ V4 ⎢1 V5 ⎢⎣0
1
1
1
1 1
1 1
1 1
1 1
1 0
1 1
0⎤ 1⎥ ⎥ 0⎥ ⎥ 1⎥ 1⎥⎦
Fig. 27.41 Region adjacency graph and adjacency matrix examples
Region adjacency graph [RAG] A representation form that represents the adjacency relation of regions in an image, is a special case of a region neighborhood graph, that is, a case where regions are in contact with each other. In the region adjacency graph, a set of regions {R1, R2, . . .} is represented by a set of vertices {V1, V2, . . .}, and the adjacency relation between two regions such as R1 and R2 is represented by edge E12. The adjacency can be 4-adjacency or 8-adjacency. The adjacency relation has reflexivity and symmetry, but it does not necessarily have transitivity. Reflexive relations include equality, equivalence, and so on. The adjacency relation can be represented by an adjacency matrix. Figure 27.41 shows a set of examples, from left to right, an image with five regions, and its region adjacency graph and adjacency matrix. A region adjacency graph can often be generated by a segmentation algorithm. See region segmentation and region-based segmentation. Internal feature A feature used for region-based representation of an object. The object region can be expressed by a special point set or a feature set obtained from pixels inside the object region. Region signature A method of region representation. The basic idea is to project the region in different directions and convert the 2-D problem into a 1-D problem. This requires the use of all pixels in the region, that is, the shape information of all regions is used. The technology used in computed tomography (CT) applications is a typical example. Compare boundary signature. Lateral histogram A histogram obtained by projecting an image onto multiple coordinate axes and summing the pixel gray values. Lateral histogram technology transforms the positioning of the object from 2-D space to 1-D space. It is more suitable for detecting and positioning objects (such as circular disks or holes) insensitive to orientation. Projection histogram See lateral histogram.
27.3
Region-Based Representation
1009
Inertia equivalent ellipse Same as equivalent ellipse of inertia. Equivalent ellipse of inertia An ellipse equal to the inertia of an object region and normalized by the region area. Also called inertia equivalent ellipse, referred to as equivalent ellipse. According to the object moment of inertia calculation, the directions and lengths of the two principal axes of the inertia ellipse are completely determined. The slopes of the two major axes (relative to the horizontal axis) are k and l, respectively: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 k¼ ðA BÞ ðA BÞ2 þ 4H 2 2H qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ðA BÞ þ ðA BÞ2 þ 4H 2 l¼ 2H The lengths of the two semi-major axes ( p and q) are ffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 p ¼ 2= ðA þ BÞ ðA BÞ þ 4H ffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi q ¼ 2= ðA þ BÞ þ ðA BÞ2 þ 4H 2 Among them, A and B are the moments of inertia of the object around the X and Y coordinate axes, respectively, and H is the inertia product. The object in a 2-D image can be regarded as a planar uniform rigid body. The equivalent inertia ellipse established on it reflects the distribution of points on the object. With the inertia equivalent ellipse, two irregular regions can be registered. Ellipse of inertia An ellipse calculated from the inertia of the object region. The major and minor axes of the ellipse are determined by the inertia in two orthogonal directions. See equivalent ellipse of inertia. Axis of least second moment The axis of least inertia of the object in the image. That is, the line around which the object rotates with the minimum energy. It can provide the orientation information of the object relative to the image flat coordinates. In practice, moving the origin of the coordinate system to the center of the object region O(x, y) (x 2 [0, M–1], y 2 [0, N–1]), and using the angle θ to represent the angle between the vertical axis and the second-order moment measured counterclockwise, then PM1 PN1 tan ð2θÞ ¼ 2 PM1 PN1 x¼0
y¼0
x¼0
xOðx, yÞ PM1 PN1
y¼0
x2 Oðx, yÞ
x¼0
y¼0
y2 Oðx, yÞ
Chapter 28
Object Description
One of the main tasks of image analysis is to obtain the description values of objects from the image (Zhang 2017b). A key issue is to select suitable features to describe the object (Russ and Neal 2016; Sonka et al. 2014).
28.1
Object Description Methods
In order to achieve the description of the object, on the one hand, the characteristics of the object must be described, and on the other hand, the features of object region must be described (Gonzalez and Woods 2018; Sonka et al. 2014; Zhang 2009).
28.1.1 Object Description Object description On the basis of object representation, the object characteristics are expressed more abstractly. It can be divided into: (1) description based on boundary, (2) description based on region, and (3) description of object relationship. Object characteristics Attributes that reflect the nature of a certain aspect of an object. Common object characteristics include color, texture, shape, spatial relationship, structural characteristics, motion, saliency, etc. Visual appearance The observed appearance of an object depends on several factors, such as the orientation and reflectivity of the object, surface structure, lighting, the spectral sensitivity of the sensor, the light reflected from the scene, and the depth order of other objects in the scene (they may block each other). © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_28
1011
1012
28
Object Description
Dynamic appearance model A model that describes an object or scene that changes its appearance over time. Object feature Parameters that express and describe the characteristics of the object region in the image. It can be obtained with the help of various descriptor for boundary and descriptor for region. Description based on region The object description uses all the object region pixels (including the internal pixels of the region and the boundary pixels of the region). Description based on boundary A method of describing an object by describing its boundaries. Also called description based on contour. Boundary description A functional, geometric, or set-theoretic description of the boundaries of a region. Descriptor An operator, feature, measure, or parameter that describes the object characteris tics. A good descriptor should be insensitive to the scale, translation, rotation, etc. of the object while distinguishing between different objects as much as possible. Descriptor for boundary Descriptor used to describe the characteristics of the object boundary. Also called descriptor for contour. Descriptor for contour Same as descriptor for boundary. Descriptor for region Descriptor used to describe the characteristics of the object region. Region consistency A characteristic that describes the nature of the object region. Refers to the same or similar characteristics throughout the object region. For example, regions with relatively consistent texture characteristics in the original image should appear as regions with consistent texture in the segmentation results. Triangle similarity An object region description feature. Letting P1, P2, and P3 be the three points on the object outline, and d(Pi, Pj) be the Euclidean distance between any two points, then L ¼ d(P1, P2) + d(P2, P3) + d(P3, P1) is the perimeter of the triangle. The triangle similarity is defined by the following vector:
dðP1 , P2 Þ dðP2 , P3 Þ L L
28.1
Object Description Methods
1013
This corresponds to the ratio of the side length to the perimeter of the two sides in the triangle. This feature has invariance to translation, rotation, and scaling. Position descriptor Descriptor used to describe the spatial position of the object in the image. Image descriptor It usually corresponds to a set of short vectors derived from the image and has invariance to commonly used transformations. It provides an overall description comparable to other descriptors in the database, allowing matching based on certain distance measures. Typical examples are SIFT and SURF.
28.1.2 Feature Description Point feature A type of image feature that occupies a small part of the image, ideally only one pixel, so it is local in nature. Typical examples are corner points (see corner detection), edge pixels, etc. Note that although a point feature only occupies a single pixel, it is often necessary to define a neighborhood. For example, a pixel is characterized by sharp changes in the image values of a small neighborhood around the pixel. Ultimate point A center point describing an irregularly shaped object. See Fig. 28.1, where the ultimate point is the center of the inscribed circle of the target, the geometric center point is the center of the circumscribed circle of the target, the centroid point is the center of mass of the object, and the density-weighted center point is the center after weighting the object quality according to the image brightness. Regarding regular shaped objects, their geometric centers and centroids can be better described. For irregularly shaped objects, it is often necessary to consider the ultimate point. Fig. 28.1 Descriptive points of irregularly shaped objects
Circumscribed circle
Geometric center point Centroid Inscribed circle
Density weighted center
Ultimate point
1014
28
Object Description
With the language of mathematical morphology, the ultimate point corresponds to the result of ultimate erosion. In fact, when eroding the object in Fig. 28.1, the ultimate point will be the last eroded point. Centroid Same as center of mass. Center of mass The point in an object where its equivalent gravity acts. If an object can be described as a multidimensional point set {xi} containing N points, the coordinates of its center of mass are ∑Ni ¼ 0xif(xi), where f(xi) is the value (pixel value) of the image at xi. Geometric center See ultimate point. Density-weighted center The center obtained by weighting the geometric distribution of the object with the image brightness. See the ultimate point drawing (Fig. 28.1). Generally, when calculating the center of an object in an image, the quality of each pixel is considered to be the same. When calculating the density-weighted center of an object in the image, the quality of the pixel is proportional to the brightness of the pixel. See centroid of region. Global point signature A special vector. For a fixed point x, its global point signature G(x) is a vector whose components are the eigenfunctions of the Laplacian-Bell-Trammy operator evaluated at x. Function-based model The object representation and object description based on the functions/features of an object (such as the existence of the object, the movement of the object, and the interaction between the objects) rather than its geometric characteristics. Functional representation See function-based model. Linear features A general term describing a class of characteristics. These features are either global or local straight, such as straight lines or straight edges. Global property The characteristics of a mathematical object depend on all the components of the object. For example, the average density of an image is a global characteristic because it depends on all the pixels in the image. String A form of data representation. Is a finite sequence of several characters, letters, numbers, etc.
28.2
Boundary-Based Description
1015
String descriptor A relational descriptor. Suitable for describing the recurring relationship structure of basic elements. Basic elements are first represented by characters, and then strings are constructed with the help of formal grammar and rewriting rules. In other words, first use strings (a 1-D structure) to represent structural units, and then use a formal method (such as formal grammar) to describe the connection between these units. See structural approach. String description See string descriptor. String structure A structural pattern descriptor in image pattern recognition. It is more suitable to represent patterns with recurring structures of basic elements. String grammar A set of syntax rules for structural pattern recognition of string structures. It can control/regulate the process of generating sentences from symbols in the character set. A set of sentences produced by a syntactic G is called a language and is written as L(G). Sentences are strings of symbols, these strings represent a certain pattern, and the language corresponds to a pattern class.
28.2
Boundary-Based Description
Boundary-based description focuses on the characteristics of the object boundary (Davies 2012; Gonzalez and Wintz 1987; Pratt 2007). Curvature is a commonly used descriptor (Zhang 2009).
28.2.1 Boundary Boundary property Boundary characteristics such as arc length, curvature, etc. Diameter of boundary A simple boundary parameter giving the distance between the two furthest points on the boundary. Can be used as a descriptor for boundary. If the boundary of a region is denoted as B, the diameter Diad(B) can be calculated by the following formula: Diad ðBÞ ¼ max Dd bi , b j , i, j
bi 2 B ,
bj 2 B
1016
28
Object Description
Fig. 28.2 Feret’s diameter indication
Maximum diameter
where Dd() can be any distance measure. There are three commonly used distance measures: Euclidean distance DE(), city-block distance D4(), and chessboard distance D8(). If Dd() is measured using different distances, the resulting Diad(B) will be different. Diameter of contour Same as diameter of boundary. Boundary diameter The length of the straight line between the two points furthest from the boundary of the region (different distance metrics can be used). Sometimes this line is also called the major axis of the boundary (the longest line segment between two intersection points perpendicular to this and intersecting the boundary is called the minor axis of the boundary). It is a simple boundary parameter that can be used as a descriptor for boundary. Extremes of a region The farthest pair of points in the object region are considered to determine the “diameter” of the object. Feret’s diameter The distance between two parallel lines tangent to the end boundary of the object region. The maximum, minimum, and mean values of Feret’s diameter are often used to describe statistical characteristics such as the size of multiple regions. An example is shown in Fig. 28.2. Length of boundary A simple boundary parameter that can be used as a descriptor for boundary. It gives the perimeter of the boundary B of the object region, which reflects a global characteristic of the boundary. If the boundary has been represented by a unit long chain code, the length of the boundary can be calculated by multiplying the number of horizontal and vertical codes by the number of diagonal codes. Sorting all pixels on the boundary from 0 to K 1 (assuming there are K boundary points), then the calculation formula of the boundary length ||B|| is n pffiffiffi kBk ¼ # kj xkþ1 , ykþ1 2 N 4 ðxk , yk Þg þ 2# kj xkþ1 , ykþ1 2 N D ðxk , yk Þg
28.2
Boundary-Based Description
1017
Among them, # represents the quantity; k + 1 is calculated according to the modulus K. The first term on the right side of the above formula corresponds to the line segment between two co-edge pixels, and the second term corresponds to the line segment between two diagonal pixels. Boundary length The length of the boundary of an object. See perimeter. Border length Same as length of boundary. Length of contour Same as length of boundary. Perimeter of region A simple region parameter that can be used as a global descriptor for region. Compare length of boundary. Perimeter 1. In a binary image, the collection of foreground pixels that touch the background. 2. The length of the path through the above pixels. Shape number A descriptor for boundary based on chain code expression. Depending on the starting position of the chain code, a boundary expressed by a chain code can have multiple first-order difference sequences (the result of subtracting the chain code elements in pairs). The shape number of a boundary is a sequence that has the smallest value (think of the sequence as a value) of these difference sequences. In other words, the shape number is the (chain-coded) difference code with the smallest value. According to the above definition, the shape number is rotation-invariant and insensitive to the starting point used to calculate the original sequence. Boundary moment Consider the boundary of the object region as a series of curve segments, then take all points on the curve segments, calculate the mean, and then calculate the order moments of the relative mean. Can be used as a typical descriptor for boundary. Boundary-invariant moment Result of scale normalization of boundary moment. Different from the calculation of region-invariant moments, the relationship of γ ¼ p + q + 1 should be used when calculating the normalized boundary center moment. Line moment Similar to traditional region moments, but only calculated at the point [x(s), y(s)] R along the object contour. The pq-th moment is x(s)py(s)qds. An infinite set of straight moments uniquely determines the contour. Line moment invariant A set of constant values calculated from line moments. Their values are constant for coordinate transformations such as translation, rotation, and scaling.
1018
28
Object Description
28.2.2 Curvature Curvature A parameter describing the degree of deviation of an object (curve, surface, etc.) from another object (line, plane, etc.) that is considered flat. It represents the rate of change of slope and is applicable to various quantities such as numbers, vectors, and tensors. The curvature of the object contour or points on the object surface describes the local geometric properties of the contour or surface. Mathematically, given a curve x(s), its curvature k is the length of its second-order differential |∂2x(s)/∂s2| as a function of the arc length s. A related extending definition can be used for surfaces, the difference being that for a sufficiently smooth surface there are two different principal curvatures at each point. Curvature of boundary A simple boundary parameter that can be used as a descriptor for boundary. When describing the object contour, the change in direction of each point on the contour during the advance of the contour is given. In practical applications, the symbol of curvature is mainly used to reflect the unevenness of each point of the contour. If the curvature is greater than zero, the curve is concave toward the positive direction of the point’s normal. If the curvature is less than zero, the curve is concave toward the negative direction of the point’s normal. If you follow the contour clockwise, a point with a curvature greater than zero indicates that the point is part of a convex segment; otherwise, it is part of a concave segment. Discrete curvature In discrete space, the curvature calculated from the discrete contour point sequence of a discrete object. It reflects the local direction change of the contour at each discrete contour point. See curvature of boundary. Curvature estimates The calculation of curvature is generally not affected by changes in viewpoint, but it is very sensitive to quantization noise. Curvature primal sketch Multi-scale representation of the curvature of a plane curve. Metric determinant 1. A measure of curvature. 2. For a surface, it is the square root of the determinant of the first fundamental form matrix of the surface. Local curvature estimation The part of a surface or curve shape estimation that estimates the curvature at a given point in the image based on the value of the surface or curve near that point. For example, the curve y ¼ sin(x) has zero local curvature at the point x ¼ 0 (i.e., the curve is locally straight), although the curve has a non-zero local curvature at other points (such as x ¼ π/4). See differential geometry.
28.2
Boundary-Based Description
1019
Fig. 28.3 Curvature center diagram
O
P
Fig. 28.4 Maximum and minimum curvature points
Maximum curvature
Minimum curvature
Center of curvature Center of curvature ring (osculating circle). As shown in Fig. 28.3, the curvature at the point of intersection P of the plane curve and the close circle is zero. The close circle is tangent to the curve at points. The other points on the circle have the same curvature as point P and point to the concave (inner) surface of the curve. Point of extreme curvature The point at which the curvature takes an extreme value (such as a maximum or minimum). An example is shown in Fig. 28.4. Convexity The nature of the object shape associated with convexity, that is, the nature of the contour line bending outward (such as the interior angle of a circle or sphere and less than 180 ). Mathematically, it can be defined as convex hull perimeter or contour perimeter. See convex hull. Compare non-convexity. Twisting 3-D deformation of an object caused by irregular forces (different directions). Compare bending. See skew and warping. Tortuosity A descriptor for boundary that reflects the degree of distortion of the curve. It can be represented and calculated with the help of slope chain codes. Its value is the sum of the absolute values of all the slope chain codes. Given a curve, if t is used to represent its curvature, the length of the chain code representing the slope of the curve is N, and the slope of each chain code is s: t¼
N X
j si j
i¼1
Twisted cubic A curve (1, t, t2, t3) or any projective transformation in 3-D projection space. Its general form is
1020
28
Fig. 28.5 Feret’s diameter indication
Object Description Symmetry axis
f d
2
x1
3
2
a11
6x 7 6a 6 2 7 6 21 6 7¼6 4 x3 5 4 a31 x4
a41
a14
32
1
d
3
a12
a13
a22 a32
a23 a33
6 7 a24 7 76 t 7 76 2 7 a34 54 t 5
a42
a43
a44
t3
Projecting a warped cube onto a 2-D image is a rational third-order spline. Skew symmetry A characteristic of an object contour on a plane. A skew symmetric contour is a planar contour, where each straight line with a facing angle of ϕ relative to a specific axis intersects the contour at two points, and the distance from the two points to the axis is equal. The abovementioned specific axis is called a skew symmetry axis of the contour, as shown in Fig. 28.5. Shape descriptor based on polygonal representation Shape descriptors describing the different properties of the object region are obtained through the polygonal representation of the object. The polygon representation of the object can be obtained by polygonal approximation.
28.3
Region-Based Description
Region-based description focuses on the characteristics of the object region (Davies 2005; Marchand-Maillet and Sharaiha 2000; Shapiiro and Stockman 2001). Moment is a global feature for describing region property (Zhang 2017b).
28.3.1 Region Description Descriptor for region Descriptor used to describe the characteristics of the object region. Area statistics Statistics on the object region in the image, such as region area, region center of gravity, region perimeter, and region second-order moment.
28.3
Region-Based Description
1021
Region locator descriptor A position descriptor recommended by the international standard MPEG-7. A data structure similar to an R-tree is used, and the position of the object is described by a shrinkable rectangle or polygon that surrounds or contains the object. Centroid of region A simple region parameter that can be used as a global region descriptor. The position of the centroid of the object region can be given, referred to as the centroid. The centroid coordinates of region R are calculated from all points (x, y) that belong to the region: x¼ y¼
1 A 1 A
X
x
ðx, yÞ2R
X
y
ðx, yÞ2R
Among them, A is the area of the object region. It can be seen that the centroid is calculated based on the pixel accuracy (integer) data, but sub-pixel accuracy (real numbers) can often be obtained, so the centroid of region is a feature at the sub-pixel level. This concept and calculation method can be easily extended to 3-D space to represent the three-dimensional centroid. Region centroid Same as centroid of region. Coordinates of centroid The coordinates of the center of gravity of the region. Equal to the first moment of the region’s coordinates along the coordinate axis. Barycentric coordinates A scheme for position-invariant geometry where the position of a point is represented as a weighted sum of a set of control points. For example, in a triangulated mesh, the points on the surface of the mesh can be described as a weighted sum containing the vertices of a triangle. As in Fig. 28.6, the coordinates of the center of gravity of point X are (a, b, c), that is, X ¼ aA + bB + cC. Here A, B, and C represent triangle vertices.
A
Fig. 28.6 Barycentric coordinates of a triangle
X B
C
1022
28
Object Description
Fig. 28.7 Object net area measurement
Fig. 28.8 Object filled area measurement
Fig. 28.9 Object taut-string area measurement
Area of region A simple region parameter that can be used as a global descriptor for region. It gives the coverage of the object region and reflects the size of the region. The best way to calculate the area of region is to count the number of pixels belonging to the region. Pixel counting A simple algorithm for determining the area of a single region in an image. It counts the number of pixels that make up the region. See region. Net area The result of a measurement of the object region is consistent with common object definitions. As shown in Fig. 28.7, the net area measures only the dark part (letter R). Compare filled area and taut-string area. Filled area The result given in a way that the object region is measured. As shown in Fig. 28.8, the filled area measures not only the dark part (letter R) but also the light part (the part closed above the letter R) surrounded by the dark part. For example, in remote sensing images, it is sometimes necessary to consider not only land but also lakes. Compare net area and taut-string area. Taut-string area The result of a measurement of the object region. As shown in Fig. 28.9, the tautstring area corresponding to the convex hull of the object includes not only dark parts (letter R) and light parts (closed part of the upper part of the letter R) but also the textured parts (letters R has no closed part on the right and lower part). For example,
28.3
Region-Based Description
1023
in remote sensing images, sometimes not only land and lakes but also bays (or rivers and their basins) are considered. Compare net area and filled area. Rubber band area Same as taut-string area. Area-perimeter ratio A region parameter that can be used as a shape complexity descriptor. It is the ratio of the area of region to the border length. The circular object has the largest areaperimeter ratio, that is, it can surround the largest area at a given perimeter. Density of a region A physical quantity that reflects the reflection characteristics of the region corresponding to the scene. Common descriptors based on density of a region include (1) transmissivity, (2) optical density, and (3) integrated optical density. Optical density [OD] A logarithmic measure of the luminous flux passing through an optical medium along a path. The optical density is equal to the base-10 logarithm of the inverse of the transmissivity. In a homogeneous medium, the optical density is directly proportional to the path length and the transmittance coefficient. Can also be used as a descriptor based on region density. The expression is as follows: OD ¼ lgð1=T Þ ¼ lgT where T represents transmissivity. Optical depth Same as optical density. Integrated optical density [IOD] A descriptor that gives the sum of the optical density values of individual pixels in the measured image region based on the region density. For an image f(x, y) of M N, its integrated optical density is IOD ¼
M 1 X N 1 X
f ðx, yÞ
x¼0 y¼0
Densitometry A set of techniques for estimating material density from an image. For example, the estimation of bone density (bone density measurement) in the medical field. See integrated optical density. Region density feature Features that describe the gray or color characteristics of the object region. If the object region is obtained by means of image segmentation, the density characteristics of the object region need to be obtained by combining the original image and
1024
28
Object Description
the segmented image. Commonly used region density features include the maximum, minimum, median, average, variance, and high-order moments of the gray level (or various color components) of object. Most of them can be obtained from the histogram of the image.
28.3.2 Moment Description Moment A method of converging the distribution of pixel positions or pixel values. Moments are parameterized families of values. For example, if I(x, y) is a binary image, then ∑x,y I(x, y)x pyq gives its pq order of moment mpq. See grayscale moment and moment of intensity. Moment invariant A function of the value of an image that retains the same value even when the image undergoes certain transformations. For example, when mpq is the central moment of a binary image region and A is the area of the region, the value of [(m20)2+(m02)2]/A2 does not change with the coordinate transformation. That is, when the image undergoes translation, rotation, or scaling, this value is still the original value. Moment characteristic See moment invariant. Grayscale moment Moments based on the gray values of pixels in an image or a region in an image. Compare binary moment. Gray value moment An extension that considers the pixel gray value for the region moment. You can also see the region moments obtained by weighting pixel coordinates with pixel gray value. For example, the area of a region’s gray value (the area is a zero-order moment) is the “volume” of the gray value function within the region. In thresholding of image segmentation, the gray value moment can be used to achieve “fuzzy” segmentation, so that smaller-sized objects can obtain feature measurement values with higher accuracy than using the region moment. Moment of intensity An image moment that takes into account the gray value and position of image pixels. For example, if g(x, y) is a grayscale image, then ∑x,y I(x, y)x pyq gives the grayscale pq order moment gpq. See grayscale moment. Region moment In image engineering, a region-based object descriptor. Calculated by using all pixels belonging to the region, reflecting the nature of the entire region. See invariant moments.
28.3
Region-Based Description
1025
Krawtchouk moments Discrete orthogonal moments based on Krawtchouk polynomials (series of binomial forms) Kn(x: p, N ) ¼ ∑Nk ¼ 0 ak,n,pxk. Zernike moment The dot product of an image with one of the Zernike polynomials. The Zernike polynomial Unm(ρ, ϕ) ¼ Rnm(t)exp.(imϕ) is defined on the 2-D polar coordinates (ρ, ϕ) in the unit circle. When an image is projected, data outside the unit circle is generally not considered. The real and imaginary parts are called even polynomials and odd polynomials, respectively. The radiation function Rnm(t) is Rm n ðt Þ ¼
ðnm XÞ=2
ð1Þk
k¼0
t n2k ðn kÞ! nþm nm k! 2 k ! 2 k !
It can also be used to describe a kind of continuous orthogonal region moments that can be used as descriptor for region based on Zernike polynomials. Zernike moments are not affected by region rotation, are robust to noise, are sensitive to changes in shape, and have less information redundancy when describing regions. Legendre moment A special moment. For a piecewise continuous function f(x, y), the Legendre moment of order (m, n) is 1 ð2m þ 1Þð2n þ 1Þ 4
Z 1Z
1
1 1
Pm ðxÞPn ðyÞf ðx, yÞdxdy
Among them, Pm(x) is a Legendre polynomial of order m, and Pn( y) is a Legendre polynomial of order n. Legendre moments can be used to characterize image data, and images can also be reconstructed from infinite moments. Second-moment matrix For a region R, a matrix composed of its second-order moments. Can be written as M sm ¼
mrr
mrc
mrc
mcc
Among them, each second-order moment element can be calculated as follows: mrr ¼
S2r X ðr r Þ2 A ðr, cÞ2R
mrc ¼
Sr Sc X ð r r Þ ð c cÞ A ðr, cÞ2R
1026
28
mcc ¼
Object Description
S2c X ðc cÞ2 A ðr, cÞ2R
where A is the area of the region; Sr and Sc are the scaling factors of the row and column, respectively; and (r, c) are the coordinates of centroid of the region. Second moment of region See second-moment matrix. Shape moment Moments for a 2-D region or a 3-D volume. 3-D moments The data are calculated from moments in 3-D space. Invariant moments Seven region moments that can be used as descriptor for region they are invariant to translation, rotation, and scale transformations. They are the combination of the second- and third-order normalized center moments, i.e., Npq represents the normalized center moment of the p + q order of the region: T 1 ¼ N 20 þ N 02 T 2 ¼ ðN 20 N 02 Þ2 þ 4N 211 T 3 ¼ ðN 30 3N 12 Þ2 þ ð3N 21 N 03 Þ2 T 4 ¼ ðN 30 þ N 12 Þ2 þ ðN 21 þ N 03 Þ2 h i T 5 ¼ ðN 30 3N 12 ÞðN 30 þ N 12 Þ ðN 30 þ N 12 Þ2 3ðN 21 þ N 03 Þ2 h i þð3N 21 N 03 ÞðN 21 þ N 03 Þ 3ðN 30 þ N 12 Þ2 ðN 21 þ N 03 Þ2 h i T 6 ¼ ðN 20 N 02 Þ ðN 30 þ N 12 Þ2 ðN 21 þ N 03 Þ2 þ 4N 11 ðN 30 þ N 12 ÞðN 21 þ N 03 Þ h i T 7 ¼ ð3N 21 N 03 ÞðN 30 þ N 12 Þ ðN 30 þ N 12 Þ2 3ðN 21 þ N 03 Þ2 h i þð3N 12 N 30 ÞðN 21 þ N 03 Þ 3ðN 30 þ N 12 Þ2 ðN 21 þ N 03 Þ2 Region-invariant moment See invariant moments. Hu moment A set of seven translational, rotating, and shrinking invariant shape moments, composed of a combination of center moments. See invariant moments.
28.4
Descriptions of Object Relationship
1027
Orthogonal Fourier-Mellin moment invariants Invariant moment based on Fourier-Mellin moment or rotation moment. Invariant Legendre moments An extension of Legendre moment. It is not affected by object rotation, translation, and scaling when calculating an object. Hahn moment Moments based on Hahn polynomials. Given a digital image f(x, y) of size N N, the Hahn moment of order (m + n) is H m,n ¼
N1 X N1 X
f ðx, yÞH m ðx, N ÞH n ðy, N Þ
x¼0 y¼0
Among them, Hm(x, N) and Hn(y, N ) are Hahn polynomials (orthogonal polynomials defined by generalized hypergeometric functions) of m and n order, respectively.
28.4
Descriptions of Object Relationship
There are often multiple independent or related objects (or parts/sub-objects of the same object) in the actual image, and there are various relative spatial relationships between them that also need to be described (Sonka et al. 2014; Zhang 2017b).
28.4.1 Object Relationship Descriptions of object relationship An object description method that describes various relative spatial relationships between multiple objects in an image, including boundaries and boundaries, regions and regions, or relationships between boundaries and regions. Relation-based method A class of shape description methods that decompose complex shape objects into simple primitives. Object shape characteristics are described by describing primitive properties and the relationship between primitives. It is similar to the structural method in texture analysis. Distribution of point objects A model describing the distribution of point object sets. Commonly used distribution models include Poisson distribution (a random distribution), cluster distribution, and uniform distribution (a regular distribution). To calculate the distribution, the image region is often divided into sub-regions (such as squares) for statistics. Letting
1028
28
Object Description
μ be the mean of the number of objects in the sub-region and σ 2 be the corresponding variance, then: 1. If σ 2 ¼ μ, the point objects have a Poisson distribution 2. If σ 2 > μ, the point objects have a cluster distribution 3. If σ 2 < μ, the point objects have a uniform distribution Relational descriptor Descriptor used to describe the relative spatial relationship between multiple objects in an image. Wavelet descriptor Descriptors based on wavelet decomposition coefficients of the original signal, similar to Fourier descriptors, can be used to describe the object and its shape. See wavelet transform. Wavelet boundary descriptor An object shape descriptor based on coefficients obtained by wavelet transforming the object contour. Compared with the Fourier boundary descriptor, the wavelet boundary descriptor has a series of advantages.
28.4.2 Image Topology Image topology A domain of topological properties in image technology. Focus on the use of concepts such as neighborhood and connections to study basic image characteristics (such as the number of connected components, the number of holes in the object), often based on binary images and using morphological operations. Euler number A quantitative topological descriptor in image engineering. For example, the number of consecutive components in a region minus the number of holes. Also called genus. Commonly used are Euler number of region, Euler numbers of polygonal network, Euler number of 3-D object, and so on. Euler-Poincaré characteristic Same as Euler number. Euler number of region A topological descriptor describing region connectivity. Is a global characteristic parameter. For a planar region, the Euler number E can be represented by the number of connected components C in the region and the number of holes H in the region (this is called the Euler’s formula): E ¼CH
28.4
Descriptions of Object Relationship
1029
Considering the connectivity used in the calculation, two types of Euler numbers of region can also be distinguished: 4-connected Euler numbers and 8-connected Euler numbers. Euler’s formula See Euler number of region. 4-connected Euler number The Euler number obtained by subtracting the number of 8-connected holes from the number of 4-connected components. 8-connected Euler number The Euler number obtained by subtracting the number of 4-connected holes from the number of 8-connected components. Euler vector A generalization of Euler number of region. Region Euler numbers are generally calculated with the help of binary images. The binary image is regarded as a bitmap, and the grayscale image is bit-plane-decomposed to obtain a plurality of bitmaps. A vector composed of a plurality of Euler numbers of region obtained by the plurality of bitmaps is an Euler vector. Euler vectors can describe grayscale images or regions within them. If a gray code is used when the plane is decomposed, the resulting Euler vector is less sensitive to noise than when no gray code is used. Euler number of 3-D object A topological descriptor describing 3-D object connectivity. Is a global characteristic parameter. The Euler number E of a 3-D object can be represented by the number of (volume) connected components C, the number of cavity A, and the number of genus G: E ¼CþAG Cavity A concept used when calculating the Euler numbers for 3-D objects. It represents the number of connected regions (2-D objects) or connected bodies (3-D objects) completely surrounded by the background. See non-separating cut. Handle A concept used when calculating the Euler numbers for 3-D objects. Often related to holes (i.e., channels) with two exits on the object surface, the number of which is also called the number of channels or the number of genus. Connectivity number A topological descriptor for the object region. Reflects the structural information of the region. Consider the 8-neighborhood pixels qi (i ¼ 0, 1, . . ., 7) of a pixel p, starting from any one of the 4-neighborhood pixels, and arranging them clockwise around p. When the pixel qi is white or black and qi is 0 or qi is 1, the connection number C8( p) is the number of 8-connected components in the 8-neighborhood of p, which can be written as
1030
28
C 8 ð pÞ ¼ q0 q2 q4 q6 þ
3 X
q2i q2i q2iþ1 q2iþ2
Object Description
i¼0
where, qi ¼ 1 qi。. Crossing number 1. A topological descriptor that reflects the structure information of the object region. Consider the 8-neighborhood pixels qi (i ¼ 0, 1, . . ., 7) of a pixel p, starting from the position of any one of the 4-neighborhood pixels and numbered clockwise around p. Assign 0 or 1 to white or black according to the pixel qi. The number of crossover S4( p) is the number of 4-connected components in the 8-neighborhood of p, which can be written as
S 4 ð pÞ ¼
7 Y i¼0
qi þ
7
1 X
q qi
2 i¼0 iþ1
2. There can be multiple drawing methods for a graph and the minimum number of crossings between all arcs of the graph in all drawing methods. The (ideal) number of crossings in a floor plan is zero. The number of intersections (crossings) of the graph shown in Fig. 28.10 is one. Non-separating cut When the object is cut, the cut that completely passes through the object but does not generate new connected components. As shown in Fig. 28.11, the vertical dotted line represents a cut that can be separated (it divides an object into two connected Fig. 28.10 Graph with crossing number one
A
B
E Crossing
C Fig. 28.11 Non-separating cut and separating cut
D
Non-separating cut Cavity Separating cut
28.5
Attributes
1031
components), and the cuts of the two dotted lines are inseparable cuts (neither of them can divide an object into two connected components). However, once an inseparable cut is made, it is not possible to make another inseparable cut in the figure, because any new cut will split the object into two pieces. The maximum number of inseparable cuts can be used to define the genus of an object. The genus of the object in Fig. 28.11 is 1. Genus In the study of topology, the number of “holes” on the surface of an object. When cutting the object, the maximum number of cuts that cannot be separated by the connected area (2-D) or the connector (3-D). In image technique, it sometimes refers to a simple identification feature used for object recognition. Tunnel In 3-D space, there are two exit holes on the surface of the object. See nonseparating cut. The number of channels needs special consideration in the calculation of connected components and Euler numbers.
28.5
Attributes
Attributes refer to characteristics that can be specified by people and can be observed in images, and objects can be described in many ways with the help of their attributes (Babaee et al. 2016; Danaci and Ikizler-Cinbis 2016; Zhang 2017b). Attribute A certain nature of an object that can make some description of an aspect of the object. The concept of visual attributes was proposed to solve the “semantic gap” in visual problems. Image visual attributes are characteristics that can be specified by a person and can be observed in the image. They provide valuable new semantic clues. Different object categories often have common attributes. Modularizing them will explicitly allow the associated categories to be shared between some learning tasks, or allow previously learned knowledge about attributes to be transferred to a new category. From the perspective of descripting objects, attributes can be divided into three types: binary attributes, relative attributes, and language attributes. Attributes can be divided into multiple types according to their description functions. There are three common types: semantic attributes, discriminative attributes, and nameable attributes. In addition, attributes can also indicate the amount and value of pixels or objects that are portrayed in image segmentation. Each pixel is generally not (or not only) characterized by its gray value, but often also quantitatively characterized by other comprehensive state (such as gray level change) results in a small neighborhood around the pixel. The quantities or values that characterize these pixels are features
1032
28
Object Description
or attributes. Pixels that belong to the same region usually have similar or identical attribute values, so they are clustered together. Binary attribute An attribute describing an object that indicates the presence or absence of certain characteristics in an image region. Relative attribute An attribute that describes an object. It indicates the relative existence of a specific attribute. It can be compared with each other. Its value can change continuously, not just the presence or absence of a binary attribute. Compare binary attributes. Language attribute An attribute describing an object that is a combination of binary attributes and relative attributes. It learns to choose between binary attributes or relative attributes, which makes it more natural to model attribute usage in descriptions. Semantic attribute An attribute that can be used to label the semantic information of a certain type of object from the description function. It can be further divided into shape attributes, part attributes, material attributes, and so on. Discriminative attribute An attribute that can clearly indicate the peculiar properties of a certain type of object from the description function. Its semantic meaning is relatively weak, but it emphasizes the distinctive characteristics of different types of objects. Compare semantic attributes. Nameable attribute An attribute that can be explicitly expressed in terms of its description function. Attribute theory The theory of studying attributes. Attributed graph A graph representing the different properties of an image. The nodes are attribute pairs of the image region (such as color and shape), and the connections between the nodes (such as relative texture or brightness) are represented as arcs or edges. For example, an attributed graph can be used to express the attributes of a single 3-D object, where objects are expressed in the form of attribute pairs. Attribute pairs are ordered pairs and can be written as (Ai, ai), where Ai is the attribute name and ai is the attribute value. An attribute set can be expressed as {(A1, a1), (A2, a2), . . ., (An, an)}. An attribute graph can be expressed as G ¼ [V, E], where V is the set of nodes (such as the image region and its color or shape), and E is the set of arcs/edges (such as the relative texture or relative brightness between the image regions). For each node and arc/edge, there is an attribute pair associated with it, that is, the node set and arc/edge set can be represented by the attribute set.
28.5
Attributes
1033 Total attributed hypergraph First attributed hypergraph
Attributed graph A
Attributed graph B
Second attributed hypergraph
Attributed graph C
Attributed graph D
Attributed graph E
Fig. 28.12 Attributed hypergraph
Attributed hypergraph A method for expressing the attributes of multiple 3-D scenes. It is an extension of the attributed graph, including hyper-node set and hyper-arc set. Each super-node in the graph corresponds to a basic attribute graph, and each hyper-arc connection corresponds to two basic attribute graphs of two hyper-nodes. For a scene with multiple objects, an attribute map can be constructed for each object first, and then these attribute maps are used as the hyper-nodes of the upper layer of the hypergraph to further construct the attribute hypergraph. Iterating this way, a property hypergraph of a complex scene can be constructed. Figure 28.12 shows an example of an attributed hypergraph. There are five 3-D scenes A, B, C, D, and E. An attribute graph is constructed for each scene, namely, attribute graph A, attribute graph B, attribute graph C, attribute graph D, and attribute graph E. Among them, the first attribute hypergraph (with three hyper-nodes) is constructed based on the first three attributed graphs, and the second attribute hypergraph (with two hyper-nodes) is constructed based on the last two attributed graphs.. On the basis of these two attributed hypergraphs, the total attribute hypergraph of the scene can also be constructed. Attribute adjacency graph A graph describing the connections between different attribute regions in an image. The node set in the graph includes regions, and the edge set includes connections between regions. Each region is represented by its attributes, and each node corresponds to a set of attributes (can be regarded as a feature) and is represented by a feature vector. Similarly, edges representing different connections can also be described by attributes. If attributes (including node attributes and edge attributes) are represented by features, a corresponding feature adjacency graph is obtained. Attribute-based morphological analysis A mathematical morphological analysis technique that adaptively processes an image from morphological perspectives and is based on the statistical properties of the image regions, such as height, area, and shape. Abstract attribute Very subjective, representing the attributes of abstract concepts (commonly used to describe the atmosphere, emotions, etc. reflected in the image).
1034
28
Fig. 28.13 Direct attribute prediction model
Object Description z
Abstract attribute semantic A collective term for behavioral-level semantics and emotional-level semantics. Direct attribute prediction [DAP] A model that introduces the concept of attributes into a classifier to form an attribute classifier. Letting x be the object category that exists in both the training set and the test set, and z the category that may not exist in the training set but may exist in the test set. Now consider introducing a small set of high-level semantic attributes for each category to establish the connection between x and z, so that the ability to classify z can be obtained by training on x. See Fig. 28.13, where node a represents the attribute set. If for each category x 2 X and z 2 ℤ can obtain the attribute expression a 2 A, you can learn an effective classifier, f: Y ! ℤ, that is, it passes information between X and ℤ through A. The direct attribute prediction model uses layers between attribute variables to couple the image layer with the label layer. During the training process, the output class labels of each sample induce a deterministic label of the attribute layer. Therefore, any supervised learning method can be used to learn the parameter tk of each attribute. When testing, they can allow prediction of the attribute values of each test sample and infer the label of the test class. Note that the categories tested here can be different from the categories used for training, as long as the coupling attribute layer does not require a training process. For the direct attribute prediction model, first learn the probability classifier for each attribute ak. All images of all training categories are selected as training samples, and the labels of these training samples are determined by using the attribute vector values of the corresponding sample labels, that is, a sample belonging to category x is assigned a binary label akx. The trained classifier provides an estimate of p(ak|y), from which a complete image-attribute layer can be constructed, that is, p(a|y) ¼ ∏Kk ¼ 1p(ak|y). In testing, suppose that each category z induces its attribute vector az in a deterministic manner, that is, p(a|z) ¼ [a ¼ az], where the brackets [ ] are Iverson’s brackets. According to Bayes’ rule, we can get p(z| a) ¼ [a ¼ az]p(z)/p(az) as the expression of attribute-category layer. Combining the two layers, we can calculate the posterior probability of a test category given an image: pðzjyÞ ¼
X a2f0, 1gK
pðzjaÞpðajyÞ ¼
K pð z Þ Y z p ak jy z pða Þ k¼1
28.5
Attributes
1035
Fig. 28.14 Indirect attribute prediction model
z1
z2
zN
a1
a2
aK
x1
x2
xM
s1
s2 y
sM
Without more specific knowledge, the priors of each category can be assumed to be the same, so that the factor p(z) can be ignored. For factor p(a), we can set the factor distribution as p(a) ¼ ∏Kk ¼ 1p(ak) and use the expected mean of the training category p(ak) ¼ (1/K )ΣKk ¼ 1akyk as the attribute prior. Using the decision rule f: Y ! Z, it assigns the best output category from all the test categories z1, z2, . . ., zN to the test sample y, and then using MAP prediction, we can get K Y p azkn jy zn f ðyÞ ¼ arg max n¼1, , N k¼1 p ak Compare indirect attribute prediction. Indirect attribute prediction [IAP] A model that introduces the concept of attributes into a classifier to form an attribute classifier. Letting x be the object category that exists in both the training set and the test set, and z the category that may not exist in the training set but may exist in the test set. Now consider introducing a small set of high-level semantic attributes for each category to establish the connection between x and z, so that the ability to classify z can be obtained by training on x. See Fig. 28.14, where node a represents the attribute set. If for each category x 2 X and z 2 ℤ can obtain the attribute expression a 2 A, you can learn an effective classifier, f: Y ! ℤ, that is, it passes information between X and ℤ through A. The indirect attribute prediction model uses attributes to transfer knowledge between categories, but attributes constitute a connection layer between two layers of labels, one of which is known at training time and the other is unknown at training time. The training phase of IAP is a general multi-class classification. At the time of testing, the prediction of all training classes induces the labeling process of the attribute layer, from which the labeling of the test class can be inferred. In order to implement IAP, it is needed to adjust the image-attribute stage on the basis of the previous one: first learn an estimate p(xm|y) of a probabilistic multiclass classifier for all training classes x1, x2, . . . xM. Here again assume a deterministic relationship between attributes and categories, and letting p(ak|x) ¼ [ak ¼ akx]. Combining two steps gives
1036
28
pðak jyÞ ¼
M X
Object Description
pðak jxm Þpðxm jyÞ
m¼1
It can be seen that to obtain the posterior probability p(ak|y) of the attribute, only one matrix-vector multiplication is needed. Finally, the following formula can be used to classify the test samples: K Y p azkn jy zn f ðyÞ ¼ arg max n¼1, 2, , N k¼1 p ak Compare direct attribute prediction.
28.6
Object Saliency
Saliency is a concept associated with subjective perception. Visual saliency is often attributed to the comprehensive results of changes or contrasts in the underlying regions of the scene such as brightness, color, gradient, etc. (Borji and Itti 2013; Borji et al. 2013; Borji et al. 2015; Zhang 2017b). Saliency A concept associated with subjective perception. Visual saliency is often attributed to the overall result of changes or contrasts in the underlying characteristics such as brightness, color, gradient, etc., or it can be an eye-catching result due to the characteristics of the middle layer characteristics such as texture, shape, and motion. Saliency is also a concept derived from Gestalt theory. Notability Visibility of certain parts of the image (edges, corner points, blobs, etc.) and the differences from the surrounding environment. It is a relative concept related to the specific part itself and the contrast between the specific part and the original image. Salience The degree to which certain things or features (such as visual characteristics) stand out more than others (spatially or temporally). Visual salience The ability to judge the degree of attention of the pixels or regions in an image corresponding to the scene. Generally qualitative and relative, but can also be measured quantitatively. It is also the basic content of visual organization in Gestalt theory. Saliency map Output of detection of salient regions in an image. Records the salient expression of a given image element (usually a feature or group of features). It reflects the degree to
28.6
Object Saliency
1037
which each part of the image is attractive, which is reflected in the intensity value of each pixel in the salient map. See salient feature, Gestalt, perceptual grouping, and perceptual organization. Salient point A feature that is very prominent relative to the surrounding features. The feature is generally defined at a point, but it can also be defined at a patch (pixel set). Abbreviation for salient feature point. Salient feature point The center point of a spatial point or region with distinctive features. Salient patch Local regions in the image that have distinctive characteristics or appearance. A typical salient patch that is commonly used today is the region covered by the transformation template around the feature points obtained by using the scaleinvariant feature transform (SIFT). Salient pixel group A group of pixels that exhibit a very prominent pattern relative to neighboring pixels. Salient regions Compared with the local range of the image, the image regions whose content are more prominent and interest. They should be stable to global transformations (including scale, lighting, and perspective distortions) and noise. They can also be used for object representation, correspondence matching, tracking, etc. Bayesian saliency A method of generating a saliency map in an image or in another space. Significance here is defined by the likelihood of a given observation (such as a value) relative to all other observations in the set. Saliency smoothing Saliency map operations are performed to eliminate the effects of noise and improve the robustness of subsequent extraction of saliency regions. Salient feature The features in the image that have special characteristic and are easily identifiable. Features associated with high values of saliency measurements (salient is derived from Latin salir, i.e., jumping) suggest a characteristic of perceived attention. For example, an inflection point can be used as a distinctive feature when expressing a contour. In recent years, scale-invariant feature transform (SIFT) or speeded-up robust feature (SURF) methods are commonly used to obtain salient features. Maximally stable extremal region [MSER] The output of an affine-invariance detector that is only available in grayscale images. To determine the maximum stable extremal region, the image is thresholded with all possible gray values to obtain a binary region. An effective method to achieve this operation is to first sort all pixels according to the gray level and then
1038
28
Fig. 28.15 Directional contrast calculation diagram
Object Description
TL
TR i
BL
j BR
gradually increase the pixels for each connected region as the threshold changes. According to the monitoring of the change of the area of the region with the threshold, it is possible to determine the region with the greatest stability, that is, the region with the smallest area change rate with the change of the threshold. These regions are invariant to affine geometric transformations and photometric (linear offset gain or smooth monotonic) transformations. If necessary, each detection region can be approximated by an affine coordinate frame according to the matrix of its moment. A region defined to establish correspondence between images that reflects the characteristics of the entire image. Consider the image f(x, y) as a topographic map with spatial coordinates (x, y) and elevation f. For an 8-bit image, a threshold T that changes from 0 to 255 is used to gradually threshold the image. Each time a pixel value greater than or equal to T is assigned 1 (white), and the remaining pixel values are assigned 0 (black). When T is equal to 0, the entire image is white. As T increases, larger black connected regions will gradually be seen in the thresholded image, and they correspond to local minima. When T is 255, the whole image is black. Except for the last case, there can always be white connected regions in the results of each thresholding, and they are extremal regions. When the threshold value changes in a certain range, those extremal value regions that have no significant size change are called maximum stable extremal regions. Minimum directional contrast [MDC] The smallest value of the calculation in all directional contrasts. Directional contrast [DC] In the saliency calculation, a contrast index that considers different spatial orientations. Referring to Fig. 28.15, with pixel i as the center of the field of view, the position of the entire image relative to pixel i can be divided into several regions H, such as top left (TL), top right (TR), bottom left (BL), and bottom right (BR). The directional contrast (DC) from each region can be calculated as follows (H1 ¼ TL, H2 ¼ TR, H3 ¼ BL, H4 ¼ BR, i and j are the corresponding directions, i.e., the index of the pixel f from i to j): DC i,H k
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X 2 ¼ fi f j , j2H k
k ¼ 1, 2, 3, 4
28.6
Object Saliency
1039
Minimum barrier distance [MBD] A special distance measure or distance transform. Consider a 4-pass in an image f (x, y) containing L + 1 pixels, which corresponds to a pixel sequence G ¼ {g(0), g(1), . . ., g(L )}, where g(i) is 4-adjacent to g(i + 1). The barrier distance of this pathway can be calculated as follows: DBD ðGÞ ¼ max gðiÞ min gðiÞ i
i
That is, the barrier distance of a path is the difference between the grayscale maximum value and the grayscale minimum value on the path (corresponding to the actual grayscale dynamic range). There can be N channels from a pixel p to a pixel set S. The set of N channels can be expressed as T ¼ {G1, G2, . . ., GN}, where each G represents a pixel sequence. The minimum value of the barrier distance in all these paths is the minimum barrier distance between the pixel p and the pixel set S, which can be expressed as DMBD ðp, SÞ ¼ min ½DBD ð Gi Þ Gi 2T
Chapter 29
Feature Measurement and Error Analysis
Based on the representation and description of the object, the properties of the object can be measured according to a certain metrics (Sonka et al. 2014; Zhang 2017b).
29.1
Feature Measurement
The measurement of object features is fundamentally to accurately estimate the nature of the original analog quantities that produced the data from digitized (sampled and quantized) data (Russ and Neal 2016).
29.1.1 Metric Metric A non-negative real-valued function f defined on the set S for any x, y, z 2 S that satisfies the following conditions: 1. f(x, y) ¼ 0, if and only if x ¼ y, and this is called the identity axiom. 2. f(x, y) + f(y, z) f(x, z), which is called triangle axiom. 3. f(x, y) ¼ f(y, x), which is called the symmetry axiom. Identity axiom One of the conditions that the metric should meet. Triangle axiom One of the conditions that the metric should meet. Symmetry axiom One of the conditions that the metric should meet. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_29
1041
1042
29
Feature Measurement and Error Analysis
Metric property In a 3-D scene structure, a property that is related to distance and can be measured numerically, such as distance or area. Opposite to logical features (such as image connectivity). The measurement of metric properties is often performed on only a part of the whole and is expressed as a measure of unit volume. Metric combination The method and process of combining multiple metrics to form a new metric. There are many specific methods of combination, such as: 1. Addition: the sum of two metrics d1 and d2. d ¼ d1 þ d2 is also a metric. 2. Multiplication of positive real numbers: the product of a metric d1 and a constant β 2 ℝ+. d ¼ β d1 is also a metric. 3. The combination of a metric d1 and a real number γ 2 ℝ+. d¼
d1 γ þ d1
is also a metric. The above three methods can be expressed by a general formula. Letting {dn: n ¼ 1, . . ., N} be a set of metrics. For 8βn, γ n 2 ℝ+, d¼
N X n¼1
βn d n γ n þ dn
is also a metric. Gold standard The term used to refer to the ideal truth value or the result obtained using the latest technology (i.e., the best performing technology) can be used to compare a given processing result with the work effect. Golden template An image obtained from an object or scene without defects and used to determine the difference from the ideal object or scene in template matching.
29.1
Feature Measurement
1043
Ground truth 1. In performance analysis, it refers to the true value that can be output by the equipment or algorithm to be studied or the most accurate value that can be obtained. For example, in a system that measures the diameter of a circular hole, the true value can be known theoretically (such as according to a formula), or it can actually be obtained by selecting a device that is more accurate than the device currently being evaluated. 2. Derived from surveying and mapping science, the feature information used to help in the interpretation of remote sensing image classification. In image classification, it refers to the standard classification of the training set used for supervised training, which can help verify the classification results. More generally, it refers to the actual or ideal reference value or quantity, which is used to compare and judge the effect of image processing. Benchmarking Techniques that give reference values to the measured objects (performance of algorithms, techniques, etc.) and analyze them. It is commonly used, for example, when measuring watermark performance. Generally speaking, the robustness of watermark is related to the visibility and payload of the watermark. In order to fairly evaluate different watermarking methods, a certain amount of image data can be determined first, and as many watermarks as possible can be embedded in them without causing a significant impact on visual quality. Then, the watermarked data is processed or attacked, and the performance of the watermarking method is estimated by measuring the proportion of the errors generated thereby. It can be seen that the benchmark method for measuring watermark performance is related to the selected payload, visual quality measurement, and processing or attack method. Two typical benchmarking methods are robustness benchmarking and perceptibility benchmarking.
29.1.2 Object Measurement Object measurement Measurement of object characteristics in the image to obtain a quantitative description of the object. Feature measurement The process or technique of obtaining quantitative data describing an object. Fundamentally, it is necessary to accurately estimate the nature of the analogs describing the scene that produced these data from the digitized image data. This can be done directly or indirectly. The former gives direct metrics and the latter gives derived metrics.
1044
29
Feature Measurement and Error Analysis
Repeatability In feature detection, it refers to the frequency at which the detector can detect feature points (within a given radius) after transforming the image (such as rotation, scaling, changing lighting, changing viewpoints, adding noise, etc.). Binary object feature Descriptor values describing the characteristics of connected regions in a binary image f(x, y). Includes various descriptor for boundary and descriptor for region. γ -densitometry Same as dual-energy gamma densitometry. Dual-energy gamma densitometry A device for measuring the density of materials using two beams of gamma rays with different energies. Its basic principle is to let a narrow gamma ray beam pass through the channel, and the attenuation of the beam intensity provides the density information of the material along the channel. Metric space In mathematics, a collection in which the distance between its elements, called a metric, is defined. For example, 3-D Euclidean space is a metric space that is very close to human intuition. The distance between scenes can be measured by Euclid ean distance. Metric tree A type of tree structure suitable for indexing data in a metric space. A special multidimensional search tree requires feature descriptors to be compared with several templates at each level. It takes advantage of some properties of metric space (such as triangular inequality) to make access to data more efficiently. The obtained quantified visual words can be used in combination with traditional (textrelated) retrieval techniques to extract possible candidates from a large image database. Direct metric Results obtained by directly measuring the required features. Can be divided into two categories of field metrics and object-specific metrics. Field metric The overall metric for a given field of view. It is a direct metric. Commonly used basic field metrics include the number of fields, the area of the field, the number of objects, the number of unconsidered objects (such as objects tangent to the field boundary), and the position of the object. Object-specific metric A direct metric of an object in a given field of view. The basic metrics include the area of the object, the boundary diameter (including the maximum and minimum diameters), the border length, the position of the incision, the number of tangents, the number of intersections, and the number of holes.
29.1
Feature Measurement
1045
Derived metric In feature measurement, a new metric calculated or combined with other metrics obtained or measured. It can be further divided into field metrics and object-specific metrics. At this time, the characteristic measurement results are obtained by derived measurement, and the number is not limited. Derived measurement Same as derived metric. Quantitative metric The metric can quantitatively describe the characteristics of the object to be measured. Quantitative criteria In image segmentation evaluation, evaluation criteria can be quantitatively calculated and compared. Qualitative metric The metric can only qualitatively describe the characteristics of the object under test. Measurement equation In triangulation, the equations that minimize the residuals are needed. Obviously better estimates are obtained when some cameras are closer to the 3-D point than others. The measurement equation is xi ¼ yi ¼
ðiÞ
ðiÞ
ðiÞ
ðiÞ
ðiÞ
ðiÞ
ðiÞ
ðiÞ
ðiÞ
ðiÞ
ðiÞ
ðiÞ
ðiÞ
ðiÞ
ðiÞ
ðiÞ
p00 X þ p01 Y þ p02 Z þ p03 W p20 X þ p21 Y þ p22 Z þ p23 W p10 X þ p11 Y þ p12 Z þ p13 W p20 X þ p21 Y þ p22 Z þ p23 W
Among them, (xi, yi) is the 2-D feature position measured in the i-th measurement, and {p00(i) p23(i)} is a known element in the camera matrix. Measurement matrix A matrix containing the feature coordinates of an object, an image, or a video. When using the decomposition method for 3-D reconstruction, a matrix composed of the positions of 2-D points after subtracting the center of gravity of the image from the image. Can be expressed as 2
M1
3
7 6 6 ⋮7 7 6 7 X ¼ MS ¼ 6 6 M i 7 p1 p j pN 7 6 4 ⋮5 MM
1046
29
Feature Measurement and Error Analysis
Among them, M and S are both structural matrices. Mi is the upper 2 3 part of the projection matrix Pi, pj ¼ (Xj, Yj, Zj). Measurement resolution The degree to which two different quantities can be distinguished by measurement. For example, the spatial resolution may give the smallest spatial distance between two adjacent pixels, and the temporal resolution may give the smallest time difference between two visual observations.
29.1.3 Local Invariance Invariance An important property that distinguishes different image processing operators and object features. If the performance of an operator or the representation of a feature does not change under a given (time and space) change, it is considered invariant, such as time invariance, shift invariance, and rotation invariance. A general requirement for feature extraction and representation technology is that the features used to represent an image are unchanged for rotation, (scale) scaling, and translation, combined to become RST (R stands for rotation, S stands for scaling, and T stands for translation). The invariance of RST ensures that a machine vision system can be recognized when the object is presented in different sizes, at different positions and angles (relative to the horizontal reference line) in the image. Invariant Refers to a characteristic or quantity that does not change under a particular operation. For example, Hu moments are invariant under translation. Local A concept that is in contrast with global. The local characteristics of a mathematical object are defined by a small neighborhood of the object. In image processing, a local operator considers only a few close pixels at a time. A typical example is calculating the curvature of a point on a curve. Local invariant See local point invariant. Local point invariant A characteristic of local shape or brightness that is insensitive to changes in translation, rotation, scaling, contrast, or brightness. For example, the Gaussian curvature of a surface does not change with position. Point invariant A property that can be measured at a single point in an image and is invariant to certain transformations. For example, the ratio of the brightness of a pixel to the pixels in its brightest neighborhood does not change with illumination, and the
29.1
Feature Measurement
1047
brightness amplitude at a point does not change with translation or rotation. Of course, it is assumed here that images and observations are ideal. Position invariant Features that do not change with position. For example, the length of a 3-D line segment changes when projected onto the image plane, but it does not change if placed in different positions in 3-D space. See invariant. Position invariance See position invariant. Shift invariance For an operator, it refers to the property that its operation and result do not change with space. Common convolution and correlation operations have translation invariance. But sometimes the convolution and correlation of translational changes are also used. For example, the convolution of the translational change of f(i, j) can be expressed as gði, jÞ ¼
X
f ði k, j lÞhði, j; k, lÞ
k, l
where h(i, j; k, l ) is the convolution kernel at position (i, j). Translation invariant Features that maintain the same value when panning over data, scenes, or images. The distance between the two points is constant under translation. Translation invariance A technique or process or description that does not change with the translation of an image (or object). The object has invariant properties at different positions in the image. Compare scale invariance and rotation invariance. A property of mathematical morphology. It means that the result of the displacement operation does not vary depending on the order of the displacement operation and other operations, or that the result of the operation has nothing to do with the displacement of the operation object. Scale invariant It maintains the same value when zooming in or out of the image or scene providing the data. For example, when the image is scaled, the ratio (perimeter perimeter/ area) is scale-invariant. Scale invariance A characteristic that describes a technology, process, or measure that does not change with the scale of an image. For example, scale-invariant feature trans forms are scale-invariant (the transformation results are stable), while surface textures are not scale-invariant (textures have different appearances at different
1048
29
Feature Measurement and Error Analysis
scales). The use of multi-scale representation can overcome the impact of scale changes on a certain scale. Rotation invariant A property that can maintain the same value when the data value, camera, image, scene, etc. are rotated. It can also refer to a technique or process that describes characteristics that do not change with the rotation of an image (or object). It is necessary to distinguish between 2-D rotation invariance and 3-D rotation invariance, the former in the image and the latter in the scene. For example, the angle between straight lines in two images is invariant to image rotation, but the angle between straight lines in two scenes is not invariant to image rotation. Rotation invariance See rotation invariant.
29.1.4 More Invariance Affine invariant The amount whose value (corresponding to the object or shape properties) does not change when using an affine transformation. See invariant. Affine invariance Refers to the characteristics of technology, process, description, etc. that do not change as the image (or object) is affected by the affine transformation. Transla tion invariance, rotation invariance, and scale invariance are typical examples. Affine-invariant detector Detector with affine invariance. Not only does it maintain a stable detection position after changes in scale and orientation, but it still has a consistent response after affine deformation (such as local perspective shortening). Geometric invariant A feature that describes how certain geometric structures remain constant under a certain transformation (such as the cross ratio under a projective transformation). Polar harmonic transforms Can be used to generate transformations with rotation invariant features. Region Invariant 1. A property of the region, that is, the property that remains unchanged after being subjected to some transformation (such as translation, rotation, perspective projection, etc.). 2. Properties or functions that are present throughout the entire region.
29.1
Feature Measurement
1049
Photometric invariant The nature of an image’s features or characteristics that are not sensitive to changes in lighting. See invariant. Color differential invariant A class of differential invariants based on color information. It is also based on the difference invariance of color information. Insensitive to changes in translation, rotation, and uniform illumination. For example, the representation (∇R∇G) / (|| ∇R||•|| ∇G||) has a value that does not change with translation, rotation, and uniform illumination. Viewpoint invariance A process that is independent of the viewpoint from which the data was obtained or has the same numerical characteristics. Typical examples include face recognition using 3-D models. See projective invariants. Stability of views The property of changing the viewpoint while keeping the scene image from changing. This depends on the shape and structure of the scene, but also on the position and orientation of the viewpoint. Affine-invariant region Patches of images that automatically deform as the viewpoint changes. These patches cover the same physical part of the scene. Because such an area can be described by a constant set of features, it is easier to match between viewpoints with varying lighting. Quasi-invariant Fitting or approximation to invariance. For example, the quasi-invariance parameterization of the image curve can be constructed by approximating the constant arc length with a low spatial derivative. Joint invariant An invariant function J: X1 ... Xm ! R, which satisfies the Cartesian product of a group G. As long as 8g 2 G, xi 2 Xi, there is J(g•x1, . . ., g•xm) ¼ J(x1, . . ., xm). Coplanarity invariant A projective invariant used to determine whether five corresponding points in at least two views are coplanar in 3-D space. These five points allow the construction of a set of four collinear points and the value of the cross ratio can be calculated. If these five points are coplanar, the value of the cross ratio must be the same in both views. In Fig. 29.1. point A and lines AB, AC, AD, and AE can be used to define a constant cross ratio of any line L that intersects it. Coplanarity Property on the same plane. For example, for three vectors a, b, and c, if their triple product has (a b) • c ¼ 0 (zero), they are coplanar.
1050
29
Feature Measurement and Error Analysis
Fig. 29.1 Example of coplanarity invariant
A
C B
29.2
E
L D
Accuracy and Precision
Statistical errors cause uncertainty in the results of any measurement of an object in an image. The measurement of uncertainty in measurement often relies on the concepts of accuracy and precision (Zhang 2009). Measurement accuracy The closeness of the actual measured value to the objective standard value as the (reference) true value. Also called measurement unbiasedness. Is a description of the maximum possible error for a given confidence level. Compare measurement precision. Accuracy [ACC] 1. One of the commonly used evaluation criteria for image matching. Refers to the difference between the true value (ideal match result) and the estimated value (actual match result). The smaller the difference, the more accurate the match. With matching correspondence determined, accuracy can be measured with the help of synthetic or simulated images. 2. In image registration, some kind of statistics (such as mean, median, maximum, or root mean square) on the distance between the reference image point and the registered image point (after resampling to the reference image space). You can also place fiducial marks in the scene and use the position of fiducial marks to evaluate the accuracy of registration. The units used can be pixels or voxels, or sub-pixels or subvoxels. 3. In the image measurement, the closeness between the actual measured value and the objective standard value as the (reference) true value, that is, the error between a measured value and the true value. Also called unbiasedness. Compare precision. 4. The ratio that the true (positive) conclusions are made. Can be calculated by. ACC ¼
TP þ TN PþN
29.2
Accuracy and Precision
1051
Among them, TP is the true-correct value, TN is the true-false value, P is the total correct value, and N is the total error value. See measurement accuracy. Measurement unbiasedness Same as measurement accuracy. Unbiasedness The degree of closeness between the actual measured value and the objective standard value (as a reference true value) in the image characteristic measurement. Also called accuracy. Accurate measurement Measurements based on measurement accuracy correspond to unbiased estimates. An unbiased estimate of a parameter ã (a measurement with unbiasedness) should satisfy E{ã} ¼ a, that is, the expected value of the estimation should be a true value. Measurement precision The ability to repeat the measurement process and obtain the same measurement results. Compare measurement accuracy. Efficiency 1. The ability of the measurement process to be repeated and the same measurement results obtained. There are two cases: efficacy/effective and inefficiency/ineffective. Also called measurement accuracy in image engineering. 2. The properties of the estimator whose measurement results can quickly converge to a stable value with a small standard deviation. Compare inefficiency. Inefficiency Properties of estimators whose measurement results converge slowly and fluctuate. This does not affect whether the estimates are biased (measured with accuracy). Compare efficiency. Average precision [AP] See mean average precision. Mean average precision [MAP] The average of positive predictive values. An index that measures the performance of an (matches, recognizes, etc.) algorithm. It can be calculated according to the receiver operating characteristic curve, such as changing the threshold in order to determine the optimal result, determining the optimal two results, and so on to obtain the positive predictive values. The result of calculating the mean for average precisions obtained by experiments on multiple samples. Assuming that the results obtained under a given condition/parameter for a sample can be calculated to obtain an accuracy, and the (weighted) average of multiple accuracy obtained by repeated experiments under changing conditions/parameters is the average accuracy, then the mean average precision is the arithmetic mean of the aforementioned average precisions.
1052
29
Feature Measurement and Error Analysis
An orderly evaluating criterion for judgment. Taking image retrieval as an example, and setting the query set to Q, the average precision of Q retrievals is (| Q| is the number of retrievals) MAPðQÞ ¼
jQj ni 1 X 1 X P Rij j Q j i¼1 j ni j¼1
Among them, ni represents the number of related images corresponding to the i-th query, Rij represents the image ranked by j in the related results obtained by the i-th query, and P(•) represents the precision function, so P(Rij) ¼ related image j sorting in related results/sorting of this image in all results. Suppose there are ten results retrieved from a database with three related images in a query, and the order of these ten results is (R represents relevant; N represents not relevant): R N R N N N N N R N. Here three relevant images were retrieved, sorted 1, 3, and 9, respectively, and then. MAPðQÞ ¼
j1j 3 3 1 X1X 1X 1 1 2 3 2 þ þ ¼ P Rij ¼ P Rij ¼ 3 j¼1 3 1 3 9 3 j 1 j i¼1 3 j¼1
If only two related images sorted by 1 and 3 were retrieved, then MAPðQÞ ¼
j1j 3 3 1 X1X 1X 1 1 2 5 þ ¼ P Rij ¼ P Rij ¼ 3 j¼1 3 1 3 9 j 1 j i¼1 3 j¼1
If we set the first query to retrieve three of them from a library with five related images, the ranking is 1, 3, and 5; if the second query retrieves all four from a library with four related images, the ranking is 1, 2, 4, and 7, respectively: " # j1j 5 4 1 X 1X 1X MAPðQÞ ¼ P Rij þ P Rij 4 j¼1 j 2 j i¼1 5 j¼1 h i 1 1 1 2 3 1 1 2 3 4 þ þ þ þ þ þ ¼ 0:64 ¼ 2 5 1 3 5 4 1 2 4 7 The last example above can be simply expressed as follows: Consider the case of querying with two images, query image A has four related images, and query image B has five related images. Suppose an algorithm uses query image A to retrieve four related images, which are sorted as 1, 2, 4, and 7, respectively, and uses query image B to retrieve three related images, which are sorted as 1, 3, and 5, respectively. For query image A, the average precision is (1/1 + 2/2 + 3/4 + 4/7)/4 ¼ 0.83; for query image B, the average precision is (1/1 + 2/3 + 3/5)/3 ¼ 0.45. In this way, the mean average precision is (0.83 + 0.45)/2 ¼ 0.64.
29.2
Accuracy and Precision
1053
Precise measurement Measurements with high measurement precision. Corresponds to consistent estimate. Reproducibility The ability of an operation or operation result to be repeatedly implemented. Similar to precision in a sense. Consistent estimate An estimate used in precise measurements. A consistent estimate of the parameter a, i.e., â is based on N samples and when N ! 1, â will converge to a (provided the estimate itself is unbiased). Receiver operating curve [ROC] Same as receiver operating characteristic curve. A graph showing the performance of the classifier. It plots the number or percentage of true positives versus the number or percentage of false positives as certain parameters of the classifier change. See performance characterization, test set, and classification. Area under curve [AUC] It refers to the area under the receiver operating characteristic curve, which combines accuracy and precision to give a single measurement value. It mainly reflects the situation where the ROC curve is close to the upper left corner. The larger the area, the closer it is to the upper left corner, and the better is the reflected performance. It is also a scalar measure representing the performance of the algorithm. Its value is 1 in the ideal case and 0 in the worst case. Relative operating characteristic [ROC] curve Same as receiver operating characteristic curve. The name comes from the fact that it is a ratio of two operating characteristics (true positive and false positive) as the criterion changes. ROC curve 1. Receiver operating characteristic curve: Abbreviation for receiver operating characteristic curve, referred to as ROC curve. 2. Relative operating characteristic curve: Abbreviation for relative operating characteristic curve, referred to also as ROC curve. Uncertainty and imprecision data Incomplete and noisy data. The information obtained from these data is uncertain and inaccurate. The typical case of uncertain data is data that contains only partial information, while noisy data is typically not precise.
1054
29.3
29
Feature Measurement and Error Analysis
Error Analysis
In image analysis, the measurement of target features is based on digitized data to accurately estimate the nature of the original analog quantities that produced the data (ASM 2000). Because this is an estimation process, errors are inevitable (Marchand and Sharaiha 2000).
29.3.1 Measurement Error Visual gauging To determine the dimensions or specific locations of artificial objects for quality inspection and grading decisions, measurements are made using non-contact lightsensitive sensors, and these measurements are compared to a predetermined allowable range of activities. Gauging Implement measurement or test. It is a standard requirement for industrial machine vision system. Gauge coordinates A coordinate system that is local to the image surface itself. It provides a convenient reference frame for neighborhood operators such as gradient operators. Fixed-inspection system A visual gauging system that fixes the workpiece to be measured at a precise test location and uses at least one camera to make the required measurements. Compare flexible-inspection system. Flexible-inspection system A visual gauging system using sensors that move around a workpiece. The movement follows a previously planned path. Compare fixed-inspection system. Tolerance interval The interval in which a certain percentage of the number of samples falls within. Uncertainty of measurements The degree of instability of the measurement results due to various errors during feature measurement. Corresponding to the measurement error that can be divided into systematic error and statistical error, the measurement uncertainty can also be divided into system uncertainty and statistical uncertainty. Statistical uncertainty A type of uncertainty of measurement. It is an estimate of the limit of the random error. The general value range is the mean plus or minus the mean square error.
29.3
Error Analysis
1055
System uncertainty A manifestation of uncertainty of measurement. It is an estimate of the system error limit. Generally, for a given random error distribution, the confidence level is usually 95%. Measurement error Difference between measured data and real data during feature measurement. It can be written as measurement error ¼ |measured data – real data |. Because feature measurement is an estimation process, errors are inevitable. Errors can be divided into systematic errors and statistical errors. Statistical error A kind of measurement error. Also called random error. It is an error that causes the experimental results to diverge and describes the degree of scattering of the measurement data (relative to the center of gravity or expected value) obtained from repeated measurements (i.e., the relative distribution of each measurement value). Random error Same as statistical error. Systematic error A specific measurement error. It is a constant error during the experiment or measurement. It reflects the difference between the real data and the average of the measured data. Factors influencing measurement errors Factors that affect measurement accuracy during data extraction from the scene. Common reasons for discrepancies or errors in measurement data are: 1. The natural change of the parameters or characteristics of the objective object 2. The influence of various parameters and factors (spatial sampling, magnitude quantization, resolution of optical lens, etc.) in the image acquisition process 3. Changes to the image by different image processing and image analysis methods (image compression, object segmentation, etc.) 4. Use of different calculation methods and formulas for the measurement data 5. The influence of noise, blur, etc. during image processing The dashed box in Fig. 29.2 shows the three key steps in the process of converting from scene to (measurement) data, namely, image acquisition, object segmentation, and feature measurement. The action points of the above five reasons are also marked in the figure. Influence of lens resolution The impact on object measurement in image analysis due to insufficient resolution of the optical lens. At this time, only increasing the spatial resolution of the digital collector connected behind the lens cannot obtain high-resolution images and highprecision measurement results.
1056
29
Feature Measurement and Error Analysis
(1)
(2)
(3)
(4)
(5)
Scenes
Image acquisition
Object segmentation
Feature measurement
Data
Lens resolution
Sampling density
Segmentation algorithm
Calculation formulas
Fig. 29.2 Several factors affecting measurement accuracy
Influence of sampling density The impact on the object measurement due to the different sampling density of the scene. Because in the image analysis, a finite-scale object must be measured by a computer with limited accuracy in a limited time, the sampling density cannot be selected based on the sampling theorem. In fact, the conditions under which the sampling theorem holds are not satisfied at this time. Influence of segmentation algorithm The impact on object measurement in image analysis due to the use of different segmentation algorithms for the object or the selection of different parameters in the same algorithm. Changes in the segmentation results will have an effect on the quantitative results of feature measurement. In addition, when different feature quantities need to be measured, different segmentation results will also have different effects.
29.3.2 Residual and Error Residual error 1. The difference between the actual detected value and the ideal value, or the difference between the observed value and the predicted value. 2. Same as residual. If the offset between two images f1(x, y) and f2(x, y) is (u, v), then the residual between them is |f1(x + u, y + v) – f2(x, y)|. Mean variance In image information fusion, an objective evaluation index based on statistical characteristics. When the ideal fusion image (i.e., the best result of fusion) is known, the mean squared error between the ideal image and the fused image can be used to evaluate the fusion result. Letting g(x, y) represent the N N image after fusion. If i(x, y) represents the ideal image of N N, the mean squared error between them can be expressed as
29.3
Error Analysis
1057
( E rms ¼
N 1 N 1 h o1=2 1 XX gðx, yÞ iðx, yÞ2 N N x¼0 y¼0
Mean squared error [MSE] A measure of error. It is the average of the sum of squares of the error values from multiple measurements. A commonly used metric to measure the cost of matching. In video understanding, it corresponds to the sum of squared differences in image understanding. A difference distortion metric used to measure image watermarking. If f(x, y) is used to represent the original image and g(x, y) is used to represent the embedded watermark, the image size is N N, and the mean square error can be expressed as Dmse ¼
N 1 N 1 1 XX ½gðx, yÞ f ðx, yÞ2 N 2 x¼0 y¼0
Root mean square [RMS] error The square root of the mean squared error. A difference distortion metric of an image watermarking. It is the result calculated with the parameter p ¼ 2 in the Lp norm. Mean squared reconstruction error [MSRE] A performance index that measures the reconstruction effect by mean squared error. Mean square difference [MSD] A metric used by a classical stereo matching based on region gray-level correla tion. By calculating the gray difference between the two sets of pixels to be matched, the correspondence between the two sets of pixels that satisfy the least square error is established. Mean absolute deviation [MAD] A criterion or metric for the error about similarity measures. For a group of data, it is the absolute distance from its average, also called the average absolute difference. Can be used to measure the degree of discreteness of the data. For example, in motion estimation, consider a window of m n in the current frame. In the next frame, the window’s movement distance in the x-direction and ydirection is dx and dy, respectively. The average absolute deviation can be written as MADðx, yÞ ¼
m1 n1 1 XX j f ðx þ i, y þ jÞ pðx þ i þ dx, y þ j þ dyÞj mn i¼0 j¼0
where p is an array of predicted macro-block pixel values. If you subtract 1/mn from the above formula, you get the sum of absolute distortion.
1058
29
Table 29.1 Combination of objective situation and judgment result
Objective situation
Feature Measurement and Error Analysis
Positive Negative
Judgment result Positive Negative TP FN FP TN
Sum of absolute distortion [SAD] The result obtained by removing the average factor 1/mn from the average absolute deviation calculation formula. Average absolute difference A difference distortion metric used to objectively measure image watermarking. If f(x, y) is used to represent the original image, and g(x, y) is used to represent the embedded watermark, the image size is N N, and then the average absolute difference can be expressed as Dadd ¼
N 1 N 1 1 XX jgðx, yÞ f ðx, yÞj 2 N x¼0 y¼0
Error analysis In feature measurement, various factors such as the cause, category, and value of the error are considered, and the work, technology, and processes to reduce the influence of these factors are studied. Error In image engineering, the difference between the real value of the scene and the measured value obtained from the image measurement. Errors are different from mistakes. Mistakes should and can be avoided, and errors cannot be absolutely avoided. Errors can be divided into systematic errors and random errors. Error rates In many tasks that require judgment (such as feature matching, object recognition, object classification, image retrieval, etc.), the ratio that judgment results differ from actual situations or objective situations. In the case of binary classification, there are two possibilities for objective situations and two possibilities for judgment results, and then there are four possibilities for their combination, as shown in Table 29.1. Among them, TP stands for correct affirmation (true and correct judgment), FN stands for false negative (true but false judgment, missing alarm), FP stands for false positive (false but false judgment, false alarm), and TN stands for correct negation (false and correct judgment). Here, the ratio of the sum of FN and FP in all combinations is the error rate. Probability of error [PE] A metric based on pixel number error in image segmentation evaluation. Suppose P(b|o) represents the probability of misclassifying the object into the background,
29.3
Error Analysis
1059
P(o|b) represents the probability of misclassifying the background into the object, and P(o) and P(b) represent the proportion of prior probabilities of the object and the background in the image, respectively. The total error probability is PE ¼ PðoÞ PðbjoÞ þ PðbÞ PðojbÞ Error propagation 1. Propagate error results from one calculation step to the next. 2. A method for estimating the error (such as variance) of a process based on the estimation of input data and intermediate calculation errors. Error correction Correction of calculation errors in disparity maps obtained from stereo matching. Stereo matching is performed between images affected by geometric distortion and noise interference. Due to the existence of periodic patterns, smooth regions, occlusion effects, and the strictness of the constraint principle, errors in the disparity map will be produced. In order to obtain high-quality stereo matching results, errors need to be detected and corrected, which is also an important stereo vision post-processing content. Error-correction training procedure Iterative training process to achieve error correction. In each iteration step, the decision rules are adjusted based on each misclassification of the training data.
Chapter 30
Texture Analysis
Texture is a characteristic inherent to the surface of an object, so regions in an image often reflect the nature of the texture. Texture is a commonly used concept in image analysis (Zhang 2009).
30.1
Texture Overview
The analysis of texture includes texture representation and description, texture segmentation, texture classification and synthesis, etc. (Zhang 2017b). The three commonly used texture expression and description methods are statistical method, structural method, and spectral method (Forsyth and Ponce 2012).
30.1.1 Texture Texture A commonly used description of the spatial distribution properties (such as smoothness, sparseness, and regularity) of image gray. Texture is one of the inherent characteristics of the surface of an object. Any object surface, if you keep zooming in and observe it, will show a certain pattern, usually called a texture image. Texture is an important feature to describe the image content. It is often regarded as a local property of an image or a measure of the relationship between pixels in a local region. From a psychological perspective, texture is a phenomenon in which consistency is felt in the pattern of primitive weaving. In computer vision, texture generally refers to the pattern of shapes or reflections on a surface. The texture can be regular, that is, satisfy some texture syntax; it can be statistical texture, that is, the distribution of pixel values throughout the image can vary. Texture can also refer to © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_30
1061
1062
30
Texture Analysis
changes in local shapes on the surface, such as the degree of roughness. See shape texture. Micro-texture The result of classifying textures according to scale. When the scale of a texture’s texture elements is small relative to the space between the texture elements, it is considered a micro-texture. Compare macro-texture. See statistical texture. Macro-texture A result of classifying textures based on scale. When the texture elements of a texture have obvious shapes (greater than the range of space between elements) and regular organization, this texture is considered to be a macro-texture. It also refers to the brightness pattern obtained by organizing the texture elements on the surface in space. Compare micro-texture. Ordered texture See macro-texture. Gray-level distribution [GLD] An operational definition of a texture. The texture can be regarded as a certain attribute distribution pattern in the field of view, so that the work to be done to analyze the surface texture and the method to be used can be determined. Color texture The apparent change in a color image. The apparent change (texture) of a region or surface results from a spatial change in color, reflection, or illumination. Hierarchical texture A way to consider texture units in multiple layers (e.g., basic texture units can be combined to form another scale texture unit and so on). Fractal texture Texture representation based on self-similarity between different scales. Homogeneous texture A 2-D (or higher-dimensional) pattern, defined in space S ⊂ R2. Statistical calculations of patterns in a window in S (such as mean and variance) are independent of the position of the window. Shape texture Surface texture viewed from the perspective of shape change. Contrary to the change of the surface reflection mode.
30.1.2 Texture Elements Image texton The basic unit for texture perception is somewhat similar to the phonemes in speech recognition.
30.1
Texture Overview
1063
Texton Introduced by Julesz in 1981 as the basic unit of texture analysis and considered it as a pre-attentive human visual perception unit (i.e., a unit that people pay attention to before they perceive the existence of texture). In a texton-based field of view, textures can be viewed as a regular combination of textons. To describe the texton, a discriminative model can be used. For example, a set of 48 Gaussian filters with different orientations, scales, and phases can be used to analyze the texture image. In this way, a high-dimensional feature vector is extracted at each pixel position. The K-means is used to cluster the output responses of these filters into a smaller number of mean vectors as texture elements. In addition, texture elements can also be defined in the context of the image generation model. In a three-layer image generation model, an image I is regarded as a superposition of a certain number of basis functions, and these basis functions are selected from an overcomplete dictionary Ψ . These image bases, such as Gabor functions or Gauss-Laplace functions with different scales, orientations, and positions, can be generated with fewer texture elements, and these texture elements can be selected out from a texture element dictionary. That is, an image I is generated by a base image B, and the base image B is generated by a texture element image T: Π
Ψ
T!B!I where Π ¼ {πi, i ¼ 1, 2,. . .}, Ψ ¼ {ψ i, i ¼ 1, 2,. . .} Each texture element (which is a sample in the texture element graph T) is considered as a combination of a certain number of basis functions with deformable geometric structures (such as stars, birds, snowflakes). By using this generative model to fit the observed image, the texture element dictionary can be learned as a parameter of the generative model. Texture element In texture analysis, the basic unit of texture in structural approach is one of the two key factors. It represents basic and simple texture units (generally small geometric primitives). Repeats on some surfaces and causes texture characteristics. It forms the basis of complex texture regions. See primitive texture element. Texel Abbreviation for texture element. Texture element difference moment A texture descriptor based on gray-level co-occurrence matrix. Letting pij be an element in the gray-level co-occurrence matrix, then the k-th order texel difference moment can be expressed as W M ðk Þ ¼
XX ði jÞk pij i
j
The first-order texture element difference moment is also called texture contrast.
1064
30
Texture Analysis
Grid of texel Same as regular grid of texel. Regular grid of texel Regularly arranged texels, or a grid of regular arrangements of texture primitives. Primitive texture element The basic unit of texture in structural approach, such as texel. Texem Abbreviation for texture example.
30.1.3 Texture Analysis Texture analysis One of the important areas of image analysis. It is a comprehensive study of texture segmentation, texture representation, texture description, texture classification, and so on. Method for texture analysis A way to represent and describe textures. The three commonly used methods are (1) statistical approach, (2) structural approach, and (3) spectrum approach. Texture representation See texture model. Texture enhancement A process similar to edge preserving smoothing, with the difference that texture boundaries are preserved instead of edges. Anisotropic texture filtering An anisotropic filtering method for filtering directional textures. Sometimes, it needs to be implemented with the help of graphics hardware (such as a GPU). Bilinear filtering A simple way to make blended texture images smooth and delicate. The basic idea is to read the gray or color values of four original pixels and combine the similar values. This method can be used when the low-resolution texture needs to be enlarged at a higher ratio to avoid affecting the quality of the image. Texture description A description of the nature of textures in an image area using texture descriptors. Texture descriptor A descriptor describing the nature of the texture. It can be a vector-valued function calculated from the image sub-windows, designed to produce similar output for different sub-windows with the same texture. The size of the image sub-window controls the scale of the detector. If the response at a pixel position (x, y) is the
30.1
Texture Overview
1065
maximum on multiple scales, an additional scale output s(x, y) can also be obtained. See texture primitive. Gray-level-difference histogram In the texture description, a statistical graph that takes into account the observation distance. Supposing an image I is viewed at a distance d, then the value of the graylevel-difference histogram can be expressed as |I(x, y) – I(s, t)|: (x – s)2 + (y – t)2 ¼ d2, where I(x, y) and I(s, t) are two pixels in the image. Syntactic texture description The texture is described by the syntax and transformation rules of local shapes or image patches. It is more suitable for modeling artificial textures. Texture categorization The process and technology of dividing various textures according to their characteristics and according to certain criteria. A typical classification method is to divide textures into globally ordered textures, locally ordered textures, and disordered textures. Texture classification An image is assigned to a set of texture categories, which are often defined by the training set images that express each category. Using texture classification criteria can help with texture segmentation. Texture recognition See texture classification. Texture matching Region matching based on texture description.
30.1.4 Texture Models Texture modeling A family of techniques for generating color textures and shape textures of a surface, such as texture mapping or bump mapping. Special classes such as hair or skin can also be modeled, and textures may change over time. Models can be used in computer graphics to generate textures, to describe textures in image analysis, or as a feature of region segmentation. Texture pattern A pattern consisting of texture elements. Texture model The theoretical basis of a class of texture descriptors. For example, autocorrela tion of a linear filter response, statistical texture description, or syntactic texture description.
1066
30
Texture Analysis
Texture statistical model See statistical approach. Fractal model for texture A texture description model based on observation of self-similarity. Useful in modeling natural textures. However, it is generally considered that this model is not suitable for representing local image structure. Generative model for texture A local neighborhood-based model for texture description, also called a miniature (a reduced version of the original image). Random field model for texture A model for texture description. It is assumed here that using only local information is sufficient to obtain a good global image representation. The key is to effectively estimate model parameters. A typical model is a Markov random field model, which is easy to handle when performing statistical analysis. Brodatz texture A well-known set of texture images, often used to test texture-related image technologies. See Brodatz’s album. Brodatz’s album An album called Textures. Published in 1966, the album contains 112 textured images collected by Brodac. These pictures were intended for artistic and design use, but later they were often used as standard images in texture analysis.
30.2
Texture Feature and Description
There are different description methods for different texture features (Russ and Neal 2016).
30.2.1 Texture Features Texture feature Features that reflect texture characteristics on the surface of the scene and the imaged image region. Categorization of texture features Classification of texture features in images. There is a method to classify texture features into four categories: statistical texture features, structural texture
30.2
Texture Feature and Description
1067
Fig. 30.1 Wedge filter schematic
V r
q U
features, signal processing-based texture features, and model-based texture features. Statistical texture feature Texture feature that measures the spatial distribution of pixel values. Structural texture feature Texture features described by means of structural information. Signal processing-based texture feature Use a filter bank on the image and calculate the characteristics of the filter response energy. In the spatial domain, gradient filters are often used to filter images to extract edges, lines, points, and so on. In the frequency domain, the most commonly used filters include ring filters and wedge filters. Model-based texture features After using random models and generative models to represent images, the estimated model parameters are used as alternative texture features. Texture Analysis can be performed using texture features. Typical methods include autoregressive models, random field models, epitome models, and fractal models. Wedge filter A filter that can be used to analyze the energy distribution in the frequency domain. If the image is converted into a power spectrum, wedge filters can be used to check the directionality of the texture. The wedge filter can be expressed in polar coordinate system as PðθÞ ¼
1 X
Pðr, θÞ
r¼0
where r is the radius diameter and θ is the radius angle. Figure 30.1 shows an example of a wedge-shaped filter in a polar coordinate system. Compare ring filter. Ring filter The filter has scope of action (range of action) as a ring. It can be used to analyze the texture energy distribution in the image power spectrum (Fourier transform squared). Can be expressed in polar coordinates as
1068
30
Fig. 30.2 Ring filter schematic
Texture Analysis V r
q
U
Pðr Þ ¼ 2
π X
Pðr, θÞ
θ¼0
where r is the radius and θ is the angle. Figure 30.2 shows a schematic diagram of a ring filter. The distribution of P(r) indicates how rough the texture is. Compare wedge filters. Texture gradient 1. In human vision, the observed level changes in the size and density of the projection of the scene on the retina. 2. In image engineering, changes in the scale and density of specific patterns appearing on the image after the surface pattern of the scene is imaged. 3. The gradient ∇s(x, y) of a single scalar output of a texture descriptor. For a homogeneous texture, a common example is scalar output, whose texture gradient can be used to calculate the direction of foreshortening. Texture orientation See texture gradient. Texture direction The direction of the texture gradient, or the direction in which the texture gradient is rotated 90 . Principal texture direction The direction corresponding to the peak in the Fourier transform of the texture image. To determine this direction, the Fourier amplitude curve can be regarded as the distribution of physical mass, and then the minimum inertia axis is calculated, and the direction of the axis is the principal direction of the texture. Directionality A texture feature observed from a psychological point of view. It mainly reflects the orientation of obvious structures in the image. Oriented texture A texture whose priority direction can be detected. For example, the orientation (horizontal direction) of a brick in a regular brick wall. See texture direction and texture orientation.
30.2
Texture Feature and Description
1069
30.2.2 Texture Description Homogeneous texture descriptor [HTD] A texture descriptor recommended in the international standard MPEG-7. Each texture image or each texture region is represented by 62 numbers (quantized to 8 bits). The calculation steps are as follows: 1. Filter the image with a Gabor filter bank with adjustable orientations (in six directions) and scales (in five levels). 2. Calculate the first and second moments of the frequency domain energy in the corresponding channel, and use them as the components of the texture descriptor. Homogeneous See homogeneous coordinates and homogeneous texture. Homogeneity Same as homogeneous. Homogeneity assumption A typical assumption for the texture pattern in the process of recovering the surface orientation from the texture. That is, the texture of a window selected at any position in the image is consistent with the texture of the window selected at other positions. More strictly speaking, the probability distribution of a pixel value depends only on the nature of the neighborhood of the pixel and has nothing to do with the spatial coordinates of the pixel itself. Texture browsing descriptor [TBD] A texture descriptor recommended in the international standard MPEG-7. Each of these textures requires up to 12 bits to be represented. The calculation steps of the texture browsing descriptor are as follows: 1. Filter images at different scales and orientations with Gabor filter banks. 2. Determine two texture directions from the filtered image (one is called the primary texture direction and the other is called the secondary texture direction), and each texture direction is represented by 3 bits. 3. Project the filtered image along its main texture direction, determine its regularity (quantized to 2 bits) and thickness (quantized to 2 bits) for the main texture direction, and determine the thickness only for the secondary texture direction (quantized to 2 bits). Texture spectrum A method similar to texture elements. Think of the texture image as a combination of texture elements, and use the overall distribution of these elements to characterize the texture. Each texture element contains a small local neighborhood, such as 3 3, where the pixels are thresholded based on the gray level of the center pixel (much like the LBP method). Pixels that are brighter or darker than the center pixel are assigned a value of 0 or 2, respectively, and other pixels are assigned a value of
1070
30
Texture Analysis
1. Then use these values to form a feature vector corresponding to the central pixel, and calculate the frequency of various values for the entire image to form the texture element spectrum. Different characteristics, such as symmetry and orientation, can be extracted from this spectrum for texture analysis. Granularity A texture feature observed by humans from a psychological point of view. Mainly related to the scale of texture elements. Texture energy See texture energy map. Texture energy measure A single-valued texture descriptor with a strong response in the texture region. Combine the results of several texture energy measurements to get a vector-valued texture descriptor. Texture energy map A map representing the local texture energy (the energy possessed by the texture mode) obtained by convolving the image with a specific template (also called a kernel). If f(x, y) is used to represent an image, and M1, M2, . . ., MN represent a set of templates, then convolution gn ¼ f * Mn(n ¼ 1, 2, . . . N ) gives each pixel neighborhood a texture energy component that expresses texture characteristics in. If a template with size k k is used, the elements of the texture image corresponding to the n-th template are T n ðx, yÞ ¼
1 kk
ðk1 XÞ=2
ðk1 XÞ=2
jgn ðx þ i, y þ jÞj
i¼ðk1Þ=2 j¼ðk1Þ=2
In this way, for each pixel position (x, y), a texture feature vector [T1(x, y) T2(x, y) . . . TN(x, y)]T can be obtained. Texture motion Blur that occurs when a textured surface moves relative to the camera. This is useful in computer graphics when generating motion observers looking at realistic blurry textures. Appearance prediction A branch of exterior design science, in which the texture of an object changes to make the observer’s experience predictable. Autoregressive model [AR model] 1. A model of continuous variables describing the conditional distribution of a linear Gaussian model, where each node has a Gaussian distribution and its mean is a linear function of the parent node.
30.3
Statistical Approach
1071
2. A model that uses the statistical characteristics of past behavior of a variable to predict its future behavior. If the signal xt at time t satisfies the autoregressive P model, there is xt ¼ pn¼1 an xtn þ wt , where wt represents noise. 3. A model for texture description. Often seen as an example of a Markov random field model. Similar to autocorrelation, this model also makes use of the linear dependency between pixels. For texture analysis, this model can be written as gðsÞ ¼ μ þ
X
θðdÞgðs þ dÞ þ εðsÞ
d2Ω
Among them, g(s) is the gray value of the pixel at position s in image I, d is the displacement vector, θ is a set of model parameters, μ is the offset that depends on the average gray level of the image, and ε(s) is the error term of the model, which depends on the neighborhood pixel set at position s. A commonly used second-order neighborhood is the 8-neighborhood of a pixel. These model parameters correspond to the characteristics of the texture, so they can be used as texture features. Autoregressive moving average [ARMA] model A model that combines an autoregressive model with a moving average technique and is equivalent to a Gaussian process model. It can be applied to statistical analysis and future prediction of time-series information. Autoregressive moving average [ARMA] See autoregressive moving average [ARMA] model.
30.3
Statistical Approach
Statistical descriptions of textures have a long history (Forsyth and Ponce 2012). The simplest statistical method uses the moments of gray histograms to describe textures (Russ and Neal 2016).
30.3.1 Texture Statistics Statistical approach A method of texture analysis. In statistical method, texture is regarded as a quantitative measurement result of density distribution in a region, so it is also called texture statistical model. Statistical approach uses statistical rules that describe the distribution and relationship of image gray level to describe textures.
1072
30
Table 30.1 Some run-length features
Measure Short run trend
Texture Analysis Formula PP P ði, jÞ= j2 i P j θ P Pθ ði, jÞ
i j PP j2 P ði, jÞ Pi Pj θ
Long run trend
P ði, jÞ j θ
i
Gray unevenness
i2 P hP P ði, jÞ i j θ PP
Run-length non-uniformity
P P 2 P ði, jÞ Pj ½Pi θ
P ði, jÞ j θ
i
i PP
Run percentage
i
P ði, jÞ j θ
P ði, jÞ j θ
wh
Statistical texture It describes the texture using the statistic information of the neighborhood of pixels in the image. Examples include co-occurrence statistics for neighborhood pixel pairs, Fourier texture descriptors, autocorrelation, and autoregressive models. A specific example is the distribution statistics of elements in a 5 5 neighborhood. These statistics can be learned from a set of training images or automatically discovered through clustering. Texture contrast See texture element difference moment. Texture homogeneity See texture element difference moment. Run-length feature Features calculated based on grayscale run length. A run is defined as connected pixels with the same grayscale, and these pixels are collinear in the same direction. The number of pixels in a run is called the run length, and the frequency of such runs is called the run-length frequency. Letting Pθ(i, j) be a matrix representing the run length, where each element records the frequency of j pixels with the same gray level i appearing in the θ direction. Some commonly used statistics extracted from the run-length matrix for texture analysis are shown in Table 30.1 (w and h represent the width and height of the image, respectively). Running total The sum of the output values of each run within a certain spatial range. For example, when corner detection is performed using the SUSAN operator, the gray value of each pixel covered by the template is compared with the gray value of the core pixel for the position of each pixel in the image: C ðx0 ; y0 ; x; yÞ ¼
1 0
if if
f ðx0 ; y0 Þ f x; y T f ðx0 ; y0 Þ f x; y > T
30.3
Statistical Approach
1073
Among them, (x0, y0) are the position coordinates of the core pixel in the image; (x, y) are other positions in the template N(x, y); f(x0, y0) and f(x0, y0) are the gray levels of the pixels at (x0, y0) and (x, y); T is a threshold of the gray-level difference; and the function C(•; •) represents the comparison result of the output. Add up these results to get the running total: Sð x 0 , y 0 Þ ¼
X
C ðx0 , y0 ; x, yÞ
ðx, yÞ2N ðx0 , y0 Þ
This sum is actually the number of pixels in the USAN area, or the area of the USAN region. This area will be minimized at the corners, so the corners can be detected by this. The threshold T can be used to help detect the minimum area of the USAN region and determine the maximum value of the noise that can be eliminated. Recursive-running sums A method for quickly calculating the running total. When continuously calculating multiple runs of the same length from left to right, each time the leftmost element is subtracted from the previous run sum, plus the new element on the right. In this order, each time a new running total is calculated, the amount of calculation is two additions, regardless of the run length. When calculating the running total in an image using a square 2-D template, the running height or element height can be taken to be equal to the template width. Laws operator An operator (kernel) that uses a template to calculate various types of grayscale changes and can further calculate local texture energy. The measures obtained by these texture energy operators are regarded as one of the earliest filtering methods for texture analysis. If the image is f(x, y) and a set of templates are M1, M2, . . ., MN, then convolution gn ¼ f Mn, n ¼ 1, 2, . . ., N gives the expression for the texture energy component of the texture property in the neighborhood of each pixel. If the template size is k k, the texture image corresponding to the n-th template is 1 T n ðx, yÞ ¼ kk
ðk1 XÞ=2
ðk1 XÞ=2
jgn ðx þ i, y þ jÞj
i¼ðk1Þ=2 j¼ðk1Þ=2
In this way, for each pixel position (x, y), a texture feature vector [T1(x, y) T2(x, y) . . . TN(x, y)]T can be obtained. The calculation of the Laws’ texture energy measure involves first using a set of separable filters and then using a nonlinear local window-based transformation. The template sizes commonly used by Laws operators are 3 3, 5 5, and 7 7. Letting L be the level, E be the edge, S be the shape, W be the wave, R be the ripple, and O be the oscillation. Various 1-D templates can be obtained. For example, the
1074
30
Texture Analysis
1-D vector (written as a row vector) form corresponding to a 5 5 template (this is the most commonly used set of five unit operators) is L5 ¼ ½1
4
6
4
1
E5 ¼ ½ 1
2
0 2
1
S5 ¼ ½ 1 0 2 0 1 W 5 ¼ ½ 1 2 0 2 1 R5 ¼ ½ 1 4
6
4
1
Among them, L5 gives the center-weighted local average, E5 can detect edges, S5 can detect points, and R5 can detect ripple. Starting from these 1-D operators, by convolving a vertical 1-D operator with a horizontal 1-D operator, 25 2-D Laws operators can be generated. Laws’ texture energy measure A measure of the amount of brightness change at a pixel in an image. The measure is based on 5 1-D finite difference templates of Laws operators, and 25 2-D templates are obtained through orthogonal convolution. Using these 25 templates to convolve with the image, the output is nonlinearly smoothed and combined to give 14 measures of contrast and rotation invariance. Ade’s eigenfilter A texture description method developed based on the Laws operator for energybased texture descriptor. It considers all possible pixel pairs in a 3 3 window and uses a 9 9 covariance matrix to characterize the gray data of the image. To diagonalize the covariance matrix, an eigenvector needs to be determined. These eigenvectors are similar to the templates of the Laws operator, called filter templates or eigenfilter templates, which can produce a principal component image of a given texture. Each eigenvalue gives the variance that can be used in the original image and extracted by the corresponding filter, and these variances give a comprehensive description of the image texture from which the covariance matrix is derived. Filters that can only extract small variances have relatively little effect on texture recognition. Eigenfilter A class of adaptive filters. Most filters used for texture analysis are non-adaptive, that is, the filters are predefined and not directly associated with the texture. However, the eigenfilter is an exception because it is data-dependent and can enhance the characteristics of texture predominance, so it is considered adaptive. Such filters are often designed with the help of Karhunen-Loève transform and can also be extracted from autocorrelation functions. Letting f(x, y) be the original image without displacement, and f(x + n, y) be the image after being shifted by n pixels in the x direction. If the maximum value of n is 2, the eigenvector and eigenvalue can be calculated from the autocorrelation matrix:
30.3
2
Statistical Approach
E ½ f ðx,yÞf ðx,yÞ ⋮
1075
E½ f ðx,yþ2Þf ðx,yÞ ⋮
E ½ f ðxþ2,yþ2Þf ðx,yÞ ⋮
3
7 6 7 6 7 6 6 E ½ f ðx,yþ2Þf ðx,yÞ E ½ f ðx,yþ2Þf ðx,yþ2Þ E ½ f ðx,yþ2Þf ðxþ2,yþ2Þ 7 7 6 7 6 ⋮ ⋮ ⋮ 5 4 E ½ f ðxþ2,yþ2Þf ðx,yÞ E ½ f ðxþ2,yþ2Þf ðx,yþ2ÞE ½ f ðxþ2,yþ2Þf ðxþ2,yþ2Þ Among them, E[] represents expectation. A 9 1 eigenvector is arranged in space into a 3 3 eigenfilter. The number of eigenfilters selected is determined by taking a threshold on the sum of the eigenvalues. The filtered image is often called the base image and can be used to reconstruct the original image. Because of its orthogonality, it is regarded as an optimal representation of the image. Statistical feature The statistical rules describing the gray distribution and relationship of images are used to describe the features used in the statistical approach of texture. The simplest set of statistical features includes the following histogram-based image (or region) descriptors: mean, variance (or its square root, i.e., standard devia tion), skewness, energy (as a measure of uniformity), entropy, and so on. Variance A statistical characteristic of a random variable X, denoted as σ 2, is the expected value of the square of the difference between the variable and its mean μ, that is, σ 2 ¼ E[(X – μ)2]. For a random variable f, when its probability density function is pf and the mean value is μf, the value is calculated by the following formula: σ 2f
E
n
f μf
2 o
Z
þ1
1
z μf
2
p f ðzÞdz
Precision parameter The reciprocal of the variance. Skewness A measure of the direction and degree of skew in the distribution of statistical data. Reflects the degree of asymmetry of the distribution density relative to the mean, that is, the relative length of the tail of the density function curve. In the statistical approach of texture description, when the texture is described by the moment of the gray-level histogram, the third-order moment indicates the skewness of the histogram.
1076
30
Texture Analysis
30.3.2 Co-occurrence Matrix Co-occurrence matrix A well-known and widely used form of texture feature representation. Consider an image f(x, y) of W H, and accumulate its second-order statistics into a set of 2-D matrices P(r, s|d), each of which is recorded in a given positioning vector d ¼ (d, θ) ¼ (dx, dy), and the two gray values r and s measured are spatially dependent (hence the so-called gray co-occurrence matrix). The number of occurrences (frequency) of r and s separated by the distance d is used as the value of the (r, s) term of the co-occurrence matrix P(r, s|d). The co-occurrence matrix can be written as Pðr, sjdÞ ¼ f½ðx1 , y1 Þ, ðx2 , y2 Þ : f ðx1 , y1 Þ ¼ r, f ðx2 , y2 Þ ¼ sg where (x1, y1), (x2, y2) 2 W H, (x2, y2) ¼ (x1 dx, y1 dy), ||•|| is the cardinality of the set. From the co-occurrence matrix, texture features such as energy, entropy, contrast, uniformity, and correlation can be derived. But the texture features based on the co-occurrence matrix also have some disadvantages: there is no unified method for optimizing d; the dimension of the co-occurrence matrix needs to be reduced to make it manageable; the number of terms of each co-occurrence matrix needs to be statistically reliable; the displacement vector can calculate a large number of features, which indicates that a fine feature selection process is required. Concurrence matrix Same as co-occurrence matrix. Grayscale co-occurrence Two specific gray values appear simultaneously at two specific locations (a certain distance from each other and a certain orientation from each other) in the image. See co-occurrence matrix. Generalized co-occurrence matrix In texture analysis, the matrix obtained by generalizing the gray-level co-occurrence matrix. The generalization here considers both the type of texture elements and the nature of texture elements. Letting U be the set of texture elements (pixel pairs are a special case) in the image. The properties of these elements (grayscale is a special case) constitute the set V. Letting f be a function that assigns properties in V to the elements in U, S ⊆ U U be a set of specific spatial relationships, and # represents the number of elements in the set, then each element in the generalized co-occurrence matrix Q can be represented as qðv1 , v2 Þ ¼
# fðu1 , u2 Þ 2 Sj f ðu1 Þ ¼ v1 & f ðu2 Þ ¼ v2 g #S
30.3
Statistical Approach
1077
Gray-level co-occurrence matrix A matrix that represents the statistical information of the number of pixel pairs in an image that have a specific spatial relationship. Also called gray-level dependence matrix. Suppose S is a set of pixel pairs, then each element in the gray-level cooccurrence matrix P can be expressed as: pðg1 , g2 Þ ¼
# f½ðx1 , y1 Þ, ðx2 , y2 Þ 2 Sj f ðx1 , y1 Þ ¼ g1 &f ðx2 , y2 Þ ¼ g2 g #S
where, the numerator on the right side of the equal sign is the number of pixel pairs with a certain spatial relationship and gray values of g1 and g2; the denominator is the total number of pixel pairs (# represents the number). The P obtained in this way is normalized. Gray-level-difference matrix Consider the gray-level-difference statistics as a subset of the co-occurrence matrix. Based on the distribution of pixel pairs separated by d ¼ (dx, dy) and the gray-level difference which is k, it can be expressed as PðkjdÞ ¼ f½ðx1 , y1 Þ, ðx2 , y2 Þ : ½ f ðx1 , y1 Þ f ðx2 , y2 Þ ¼ kg where (x2, y2) ¼ (x1 dx, y1 dy). There are many features that can be extracted from this matrix for texture analysis, such as second-order angular moments, contrast, and mean. Gray-level dependence matrix Same as gray-level co-occurrence matrix. Autocorrelation 1. Correlation between two functions of the same type. For example, consider the similarity between a signal and its translated signal. For an infinite 1-D signal f(t): R ! R, the autocorrelation after translation Δt is Z R f ðΔt Þ ¼
1 1
f ðt Þf ðt þ Δt Þdt
The minimum value of the autocorrelation function Rf is always zero. The spiked autocorrelation function decreases rapidly from Δt ¼ 0. A sample autocorrelation function of a finite set {f1. . .n} is {{Rf(d)|d ¼ 1, 2, . . ., n–1}, where Pnd R f ðd Þ ¼ where f ¼ 1n
Pn
i¼1 f i
i¼1
fi f f iþd f 2 Pn i¼1 f i f
is the sample mean.
1078
30
Texture Analysis
2. A texture description feature based on statistical approach. Many textures from nature are repetitive. Autocorrelation can describe the correlation between the original image matrix f(x, y) and the resulting matrix obtained by translating the image. Letting the translation be represented by the vector d ¼ (dx, dy), then the autocorrelation can be expressed as Pw Ph ρðdÞ ¼
y¼0 f ðx, yÞ f ðx þ dx, y Pw Ph 2 x¼0 y¼0 f ðx, yÞ
x¼0
þ dyÞ
Autocorrelation is a second-order statistic and is more sensitive to noise interference. Textures with strong regularity (repetition) will have obvious peaks and valleys in the autocorrelation. Texture entropy A texture descriptor based on gray-level co-occurrence matrix. A measure of the randomness of the texture can be given. Letting pij be an element in the gray-level co-occurrence matrix, then the texture entropy can be expressed as XX WE ¼ pij log 2 pij i
j
Texture uniformity A texture descriptor based on gray-level co-occurrence matrix. Also called texture second-order moment. Letting pij be the element in the gray-level co-occurrence matrix, then the texture uniformity can be expressed as WU ¼
XX i
p2ij
j
Texture second-order moment Same as texture uniformity. Grayscale texture moment Moments describing texture properties in gray-level images. Inverse texture element difference moment A texture descriptor based on gray-level co-occurrence matrix. Letting pij be an element in the gray-level co-occurrence matrix, then the representation of inverse texture element differential moment of order k is W IM ðkÞ ¼
XX i
j
pij ði j Þk
i 6¼ j
30.4
Structural Approach
1079
To avoid the problem of zero denominator caused by i ¼ j, take the first-order inverse texture difference moment and introduce a positive constant c to get the texture homogeneity: W H ð cÞ ¼
XX i
30.4
j
pij c þ ði j Þ
Structural Approach
The basic idea of the structural method is that complex textures can be formed by repeating permutations and combinations of some simple texture primitives (basic texture elements) in a certain regular form (Zhang 2017b).
30.4.1 Structural Texture Structural texture The texture is constructed by regular repetition of the basic structure. Typical examples are brick walls or glass walls outside office buildings. Structural approach A method of texture analysis. In the structural approach, texture is regarded as the result of a set of texture elements combined with some regular or repetitive relationship, so it is also called the textural structure model. Similar to the use of string descriptors to describe the recurring relational structure of basic elements, there are two keys to analyzing textures using the structure approach. One is to determine the texture elements (sub-images). Letting the texture primitive be h(x, y) and the arrangement rule be r(x, y), then the texture t(x, y) can be expressed as t ðx, yÞ ¼ hðx, yÞ r ðx, yÞ Texture structural model See structural approaches. Coarseness In texture analysis using structural approach, a texture feature related to visual perception. It is mainly related to the density of texture elements.
1080
30
Texture Analysis
Roughness 1. Roughness: a texture feature related to visual perception in texture analysis based on structural approach. It is mainly related to the gray values of pixels and texture elements and their spatial scales. 2. Concavity and convexity: the undulations of the earth’s surface. It is also one of the ground physical properties that can be obtained by the pixel values of synthetic aperture radar images. Local roughness The smoothness of the surface texture or contour of an object is a characteristic that can be described by means of Hausdorff dimensions. Line-likeness In the structural approach of texture analysis, a texture feature that is related to visual perception and mainly considers the orientation of different texture elements. Texture grammar Syntax for describing textures, where textures are considered an example of a simple pattern with a given spatial connection. The sentence from this grammar is a syntactic texture description. Texture primitive A basic unit of texture (such as a small repeating pattern) that can be used in syntactic texture description. Texton boost A method for implementing RGB pixel classification to label images with texture primitives: texture clustering (clustering of multi-scale image derivatives), spatial layout, and image content. The layout is based on a rectangular region with consistent texture primitives. (Texture primitives, layout regions) pairs become weak classifiers of pixels, and a strong classifier can be obtained by learning from the bootstrap method. The context takes advantage of the relative spatial position of the layout region. The texture primitive bootstrap can be used for both region segmen tation and region labeling. A system for segmenting and labeling different object regions in an image with the help of conditional random fields. Features used include pixel-based color distribution, location information (such as the foreground which is often in the center of the image), pixel-pair-based color gradients, and texture layout information (the classifier is trained by sharing bootstrap). The texture layout feature first filters the image using a set of 17 orientation filters and then clusters the response into 30 different texture primitive classes. Further, the offset rectangular region filtering of joint bootstrap training is used to generate pixel-based texture layout features. Arrangement rule In the structural approach of texture analysis, the rules on which texture prim itives are arranged in order to form textures are usually expressed in formal syntax.
30.4
Structural Approach
1081
Spatial relationship of primitives One structural approach of texture analysis describes the spatial distance between the primitives, the location and orientation of the primitives, and so on. It is a class of global features. Typical features include edge per unit area, gray-level run length, maximal component run length, relative extrema density, and so on. Maximal component run length A feature that describes the spatial relationship of primitives. First, calculate the largest connected component in the image and consider it as a 2-D run, and then use the size, orientation, maximum or minimum diameter, etc. measured on the component to describe the primitive.
30.4.2 Local Binary Pattern Local binary pattern [LBP] A texture analysis operator. Is a texture measure defined with the help of local neighborhoods. Given a local neighborhood of a point, the value of the central pixel is used to threshold the neighborhood. This can generate a grayscale local descriptor, which is not sensitive to brightness and contrast transformations, so it can be used to generate local texture primitives. The basic local binary mode uses the gray level of the center pixel of a sliding window as the threshold value of its surrounding pixels. Its value is the weighted sum of the neighborhood pixels after thresholding: LP,
R
¼
P1 X
sign gp ge 2p
p¼0
Among them, gp and ge are the gray levels of the center pixel and the neighboring pixels, P is the total number of neighboring pixels, R is the radius, and sign(•) is the sign function: signðxÞ ¼
1
x0
0
Otherwise
Specifically, the pixels in the 3 3 neighborhood of a pixel are sequentially thresholded (if the neighborhood pixel value is greater than the center pixel value, it is 1, and if the neighborhood pixel value is less than the center pixel value, it is 0). The result is regarded as a binary number and as the label of the center pixel. It belongs to the point sample estimation method and has the advantages of scale invariance, rotation invariance, and low computational complexity. It can be further divided into spatial local binary mode and spatial-temporal local binary mode and further extended to volume local binary pattern.
1082
30
Multiplication
Thresholding
5
4
3
1
4
3
1
1
2
0
3
0
Texture Analysis
1
0
1
1
0
8
1
32
2
64
4
1
16
8
128
0
2
4 0
0
128
LBP = 1+2+4+8+128 = 143,C = (5+4+3+4+3)/5 – (1+2+0)/3 = 2.8 Fig. 30.3 Calculate LBP code and contrast measure
Figure 30.3 shows an example of LBP calculation for 8-neighborhood pixels. Based on the difference between the average gray value of a pixel that is brighter than the center pixel and the average gray value of a pixel that is darker than the center pixel, a simple local contrast measure CP,R can be calculated, that is, C P,R ¼ PP1 p¼0 sign gp ge gp =M sign ge gp gp =ðP M Þ , where M represents the number of bright pixels whose gray value is higher than the gray value of center pixel. Because it is calculated as the complement of the LBP value, the purpose is to describe the local spatial connection, so it is collectively called LBP/C. The 2-D distribution of LBP and local contrast measures can be used as texture features. An LBP operator with an axially symmetric neighborhood is invariant to changes in illumination and image rotation (e.g., compared with a co-occurrence matrix) and is simple to calculate. Uniform pattern An extension of the basic local binary pattern. More appropriately, it is a special case. For a pixel, consider a circle of pixels in its neighborhood in order. If it contains a maximum of two transitions from 0 to 1 or 1 to 0, the binary pattern can be considered uniform. At this time, the image generally has a relatively obvious (with prominent period) texture structure. Volume local binary pattern [VLBP] Spatiotemporal extension of local binary pattern, which can be used for texture description in 3-D, (X, Y, Z ) or (X, Y, T), space (such as video images), and analyze its dynamic situation, including both changes and appearance. Volume local binary pattern [VLBP] operator Operators that compute volume local binary patterns. Generally consider three groups of planes: XY, XT, and YT. The three types of LBP labels obtained are XYLBP, XT-LBP, and YT-LBP. The first type contains spatial information, and the latter two types contain spatial-temporal information. Three LBP histograms can be obtained from the three types of LBP labels, and they can also be combined into a unified histogram. Figure 30.4 shows a schematic diagram, where Figure (a) shows
30.4
Structural Approach
(a)
1083
(b)
(c)
Fig. 30.4 Histogram representation of volume local binary pattern
the three planes of dynamic texture; Figure (b) shows the LBP histograms of each plane; and Figure (c) is the feature histogram after merging. Local ternary pattern [LTP] In order to overcome the problem that the local binary pattern is sensitive to noise interference and improve the robustness, a texture analysis operator is obtained by expanding the local binary pattern. A local texture descriptor is an extension of the local binary pattern. Compared with the local binary pattern, an offset is introduced (essentially a threshold). The result of thresholding is three values, that is, if the neighborhood pixel value is greater than the center pixel value plus the offset, then set the result as 1, and if the neighborhood pixel value is less the center pixel value minus the offset, then set the result as 1, and if the absolute value of the difference between the neighborhood pixel value and the center pixel value is less than the offset, then set the result as 0. In this way, the points in a region can be divided into three groups: 1. It is the same as or close to the center point (the absolute difference is less than a certain threshold). 2. It is less than the center point and exceeds a certain threshold. 3. It is greater than the center point and exceeds a certain threshold. Local shape pattern [LSP] A local descriptor obtained by supplementing the basic local binary pattern. It first calculates an 8-bit binary number string (corresponding to 256 patterns) in the local binary pattern in the neighborhood of each pixel in the image and then calculates a 4-bit binary number string (corresponding to 16 patterns). Finally, two strings are connected one after the other. The former number string reflects the difference structure of the neighborhood, and the latter number string reflects the orientation information of the neighborhood (something similar to a scale-invariant feature transform). Taking the pixel neighborhood of 3 3 as an example, the 8-neighborhood pixels of the central pixel are sorted as shown in Fig. 30.5a, that is, the order is indicated by the number in parentheses. Consider first the calculation of the local binary pattern. With the pixel neighborhood shown in Fig. 30.5b, the gray value of the central pixel is taken as threshold value, and the remaining eight pixels are thresholded. A value greater than the threshold is assigned 1 and a value less than the threshold is assigned 0. Figure 30.5c gives the assignment results. If you write out the binary number
1084
(4)
30
(3)
(5) (6)
(7) (a)
(2)
70
90
20
1
(1)
30
50
60
0
(8)
40
10
80
0
(b)
1
0
–10
80
1 0 (c)
–20
Texture Analysis
0
1
30
0 1
1 (d)
(e)
Fig. 30.5 Local shape mode calculation example
sequence sorted as shown in Fig. 30.5a, it is 10,110,001. Next consider calculating a 4-bit binary number string. The grayscales of the last four pixels after the sorting are subtracted from the grayscales of the first four pixels after the sorting, and the result is shown in Fig. 30.5d. For this result, 0 is used as the threshold value. A value greater than 0 is assigned a value of 1 and a value less than 0 is assigned a value of 0. The result of the evaluation shown in Fig. 30.5e is obtained. If this assignment is written as the binary number string sorted as shown in Fig. 30.5a, it is 1010. Now connect the two number strings back and forth to get the binary number string of the local shape pattern: 101100011010.
30.5
Spectrum Approach
Texture and high-frequency components in the image spectrum are closely related (Forsyth and Ponce 2012). Smooth images (mainly containing low-frequency components) are generally not treated as texture images. The method of the spectrum method corresponding to the transform domain, focusing on the periodicity of the texture (Gonzalez and Woods 2008). Spectrum approach A method for texture analysis. The spectrum method often uses the spectral distribution obtained after the Fourier transform, especially the peaks in the spectrum (with high-energy narrow pulses) to describe the global periodic nature in the texture. In recent years, other spectrums (such as Gabor spectrum, wavelet spectrum, etc.) have also been used for texture analysis. Spectrum function In the spectrum approach of texture analysis, a function obtained by transferring the spectrum to a polar coordinate system. The spectrum as a function of polar coordinate variables (r, θ) is represented by (r, θ). For each determined direction θ, S (r, θ) is a 1-D function Sθ(r); for each determined frequency r, S(r, θ) is a 1-D function Sr(θ). For a given θ, analyze Sθ(r) to obtain the behavior characteristics of the spectrum along the emission direction of the origin; for a given r, analyze Sr(θ) to obtain the behavior characteristics of the spectrum on a circle centered at the origin. If these functions are summed for subscripts, a more global description can be obtained:
30.5
Spectrum Approach
1085
Sð r Þ ¼
π X
Sθ ð r Þ
θ¼0
Sð θ Þ ¼
R X
Sr ðθÞ
r¼1
Here, R is the radius of a circle centered at the origin. S(r) and S(θ) constitute the description of the spectral energy of the entire texture image or image region. S(r) is also called a ring feature (the summation route for θ is ring-shaped), and S(θ) is also called a wedge feature (summing range for r is wedge-shaped). Partial translational symmetry index A texture descriptor derived from Bessel-Fourier spectrum. Expression is 1 1 P P
CT ¼
m¼0 n¼0
ΔR H 2m,n J 2m ðZ m,n Þ ½Am,n Am1,n þ Bm,n Bm1,n J 2m ðZ m1,n Þ 2R v 1 1 P P
2
m¼0 n¼0
H 2m,n J 2m ðZ m,n Þ
where R ¼ 1, 2,. . .; ΔR is the translation along the radius; Hm,n ¼ Am,n + Bm,n, Am,n, and Bm,n are Bessel-Fourier coefficients; Jm is the first m-th Bessel function; Zm,n is the zero root of the Bessel function; and Rv corresponds to the size of the image, thus obtaining CT 2 (0, 1). Partial rotational symmetry index A texture descriptor derived from Bessel-Fourier spectrum. Expression is 1 1 P P
C R ¼ m¼0
n¼0 1 P
H m,n R2 cos mð2π=RÞ J 2m ðZ m,n Þ 1 P
m¼0 n¼0
H m,n R2 J 2m ðZ m,n Þ
where R ¼ 1, 2,. . .; Hm,n ¼ Am,n + Bm,n, Am,n, and Bm,n are Bessel-Fourier coefficients; Jm is the first m-th Bessel Function; and Zm,n is the zero root of the Bessel function. Moment of the gray-level distribution A texture descriptor derived from Bessel-Fourier spectrum. Also called gray histogram moment. See invariant moments. Gray-level histogram moment Same as moment of the gray-level distribution. Angular orientation feature Directionality of texture space obtained by partition Fourier space into bins. Can be expressed as
1086
30
A ðθ 1 , θ 2 Þ ¼
XX
Texture Analysis
jF j2 ðu, vÞ
Among them, |F|2 is the Fourier power spectrum, u and v are both variables in Fourier space (the range of values are [0, N–1]), and the summing limits are θ1 tan 1 ðv=uÞ < θ2 ,
0 u, v N 1
The angular orientation feature expresses the sensitivity of the energy spectrum to the texture direction. If the texture contains many lines or edges in a given direction θ, the value of |F| 2 will gather in the frequency space in the direction around θ + π/2. Relative extrema The minimum and maximum values measured in the local neighborhood are called relative minimum and relative maximum, respectively. The relative frequency of local gray extremes can be used for texture analysis. For example, the texture is characterized by the number of extreme values extracted by each scan line and the associated threshold. This simple method works well in real-time applications. Relative maximum The maximum value of pixel neighborhood attribute value in image engineering. If a pixel f(x, y) satisfies f(x, y) f(x + 1, y), f(x, y) f(x–1, y), f(x, y) f(x, y + 1), f(x, y) f(x, y 1), then the pixel obtains a relative maximum value. Compare relative minimum. Relative minimum The minimum value of the pixel neighborhood attribute value in image engineering. If a pixel f(x, y) satisfies f(x, y) f(x + 1, y), f(x, y) f(x–1, y), f(x, y) f(x, y + 1), f(x, y) f(x, y 1), the pixel gets a relatively minimum value. Compare relative maximum. Relative extrema density A feature that describes the spatial relationship of primitives. Extreme values can be obtained by scanning the image horizontally. Extreme values include relative minimums and relative maximums. The relative extrema density can be obtained by counting the obtained extreme values and averaging. Repetitiveness A texture feature observed from a psychological perspective. Cycles can be used to quantify repeatability.
30.6
Texture Segmentation
Texture is an important clue for people to distinguish different regions of the object surface. The human visual system can easily recognize patterns that are close to the background mean but different in orientation or scale (Zhang 2006).
30.6
Texture Segmentation
1087
Fig. 30.6 Segmentation of five types of Brodatz textures
Texture segmentation Segmentation of texture images or image segmentation using texture characteristics. Texture segmentation can be supervised (supervised texture segmentation) and unsupervised (unsupervised texture segmentation). Texture segmentation divides an image into regions with a consistent texture. In Fig. 30.6, the left image is a texture image composed of five textures in Brodatz album, and the right image is a result of texture segmentation. Texture field segmentation See texture segmentation. Texture field grouping See texture segmentation. Texture boundary In texture segmentation, the boundary between adjacent regions. It also refers to the boundary that is perceived by people between two regions with different textures. In Fig. 30.7, the left half gives the boundaries of the eight texture regions obtained by segmentation. Although the right half is not segmented, it is easy to perceive the boundaries of the eight texture regions. Boundary accuracy Generally refers to the degree of consistency between the boundary obtained by segmenting the texture image and the texture boundary of the original image. It reflects the accuracy of segmentation results on the edge of the region. It can also be generalized to the accuracy of boundary detection of object regions in other attribute images. Texture region Regions in the image with significant texture characteristics. Also refers to the different parts of the texture image where the gray values of pixels are uneven but feel a little regular. Texture region extraction See texture field segmentation.
1088
30
Texture Analysis
Fig. 30.7 Texture boundary
Unsupervised texture image segmentation Segmentation of a texture image without knowing the number of texture categories in the original image in advance. The number of unsupervised segmentation methods is relatively small, and one of the reasons is that it is difficult to determine the number of texture categories. Compare supervised texture image segmentation. Unsupervised texture segmentation In texture segmentation, a method when the number of texture categories in the original image is not known in advance. Supervised texture image segmentation Segmentation of an image with the number of texture categories in the original texture image known. Some of these methods also obtain the segmentation results by clustering the features extracted from the texture image. Compare unsupervised texture image segmentation.
30.7
Texture Composition
1089
Supervised texture segmentation A texture segmentation method. It is considered that the number of texture categories in the graph is known, and the segmentation result is obtained by classifying the features extracted from the texture image. Semantic texton forest A forest is an ensemble of decision trees, where the leaf nodes contain information about the distribution of potential labels for the input structure under consideration. The decision results are derived from the average of the decision results for each tree of the ensemble set. A variation of the forest is the random forest, which uses different tests randomly generated at each split node in each tree. The texel expansion uses a function of the values of a pair of pixels in the image slice. One common area of semantic texture forest is semantic image segmentation.
30.7
Texture Composition
Texture synthesis is based on classifying textures and constructing a model to generate the desired texture structure (Forsyth and Ponce 2012; Zhang 2017b).
30.7.1 Texture Categorization Types of texture The result of classifying textures according to texture characteristics. For example, one classification result includes globally ordered texture, locally ordered texture, and disordered texture. Globally ordered texture A basic texture type that contains a specific arrangement of certain texture primi tives, or a specific distribution of texture primitives of the same types of texture. Also called strong texture, the arrangement regularity of texture is very strong. Structural approach for analysis is often used. Strong texture Same as globally ordered texture. Locally ordered texture A basic types of texture. Also called weak texture. The main feature is that it has local directivity but is random or anisotropic in the whole picture. It is not easy to model by just using statistical approach or structural approach. Weak texture Same as locally ordered texture.
1090
30
Texture Analysis
Disordered texture A basic types of texture that has neither (periodic) repeatability nor (orientation) directionality. This kind of texture is suitable for description based on its unevenness and analyzed by statistical approach. Temporal texture A method for implementing video-based animation using video sequences. Think of a video sequence as a 3-D spatiotemporal volume with texture features, where texture features can be described using an autoregressive time model. Dynamic texture There are some moving image sequences whose temporal characteristics can identify certain statistically similar characteristics or stability of texture patterns. Example sequences include fire, waves, smoke, and moving leaves. The simulation of dynamic texture is usually obtained by approximating the maximum likelihood model and using it for extrapolation. See temporal texture. Dynamic texture analysis Analysis of dynamic changes in texture in 3-D space. Includes both movement changes and appearance changes. This requires extending the original 2-D local binary pattern (LBP) operator to a 3-D volume local binary pattern (VLBP) operator. Texture transfer A non-photorealistic rendering method. You need to use a source texture image and a reference (destination) image and try to match certain characteristics of the reference image with certain characteristics of the newly generated image. For example, the newly generated image should have a similar appearance to the source texture image while at the same time matching the brightness characteristics of the reference image. A more general definition is given a pair of images A and A0 (the unfiltered and filtered source images, respectively) and a number of unfiltered target images B, a new filtered image B0 is synthesized, so that A : A0 , B : B0 Texture mapping A technique for pasting or projecting a 2-D textured piece onto the surface of a 3-D target to make its appearance more realistic. Often used to produce special surface effects. In computer graphics, a step in drawing a polygonal surface. Among them, the surface color at a pixel on each output screen is obtained by interpolating values derived from an image. The source image pixel position can be calculated using the correspondence between polygon vertex coordinates and texture coordinates on the texture image.
30.7
Texture Composition
1091
Trilinear mapping See MIP mapping. MIP mapping Originally refers to fast pre-filtering of images used for texture mapping. MIP stands for multi in parvo (many-to-one). Refers to a standard image pyramid in which each layer is filtered by a high-quality filter instead of being approximated by a low-quality one. To resample from a MIP mapping, first compute a scalar estimate of the resampling rate r. For example, r can be the maximum value of the absolute value in the affine transformation matrix A (can eliminate aliasing), or the minimum value of the absolute value in A (can reduce blur). Once the resampling rate r is determined, a fractional pyramid can be calculated using the base 2 logarithm: l ¼ log2(r). A simple method is to resample from the texture of the pyramid layer immediately above or below it (depending on whether you want to eliminate aliasing or reduce blur). A better approach is to resample both images and blend them linearly using a fraction of l. Since most implementations of MIP mapping use bilinear resampling in each layer, this method is often referred to as trilinear MIP mapping. Trilinear MIP mapping A method to implement MIP mapping. After the resampling rate is estimated, the layer corresponding to its fractional pyramid with base 2 logarithm is selected. Resample the previous and next layers separately and combine them linearly. Because most implementations of MIP mapping use bilinear resampling in each layer, a linear combination is added here, so it is often called trilinear MIP mapping. Multi-pass texture mapping An optimization method to achieve both texture mapping and avoiding excessive aliasing. First, an ideal low-pass filter is used to adaptively pre-filter each pixel of the source image so that the image frequency is lower than the Nyquist sampling rate and then resample to the desired resolution. In practice, interpolation and sampling are cascaded into a single polyphase digital filtering operation.
30.7.2 Texture Generation Texture synthesis Given several texture samples, a larger, similar-looking image is generated. Further promotion also refers to generating a texture image of an object surface from a texture description. Generate a composite image of a textured scene. More specifically, a texture is generated that looks like it shares a set of training set examples. Linear composition In texture synthesis, the same as transparent overlap.
1092
30
Texture Analysis
Nonparametric texture synthesis In texture synthesis, the method that the content of the components is not of changed. Example-based texture synthesis is a typical example. Example-based texture synthesis Directly copy and combine texture samples to synthesize texture images. Belongs to nonparametric texture synthesis. Texture by numbers [TBN] A framework for drawing or rendering based on the principle of image analogy. Given a natural image, the regions of water, trees, and grass can be drawn by hand using pseudo-colors with the meaning of corresponding pixel semantics (equivalent to sketches without textures). The newly drawn sketch can then be converted into a fully drawn composite image to simulate another natural image. It is different from general image analogy. If the natural image and the sketch are regarded as the source image A before operation and the target image A0 after operation, then a new natural image B is constructed from the new target image B0 . Hole filling An application for texture synthesis. To fill holes left after removing an object or a defect from a photo. This technology not only removes unwanted people or other intruders in the photo but also eliminates small flaws in old photos or movies, and it also eliminates the wire rope that hangs the actors or props in the movie. In image restoration, it refers specifically to the techniques and steps for completing the holes that will be left after removing a specific object or defect. This is same as region filling. Image quilting A method of selecting a small image block or an image patch and then bonding them together for texture synthesis. In which, one image block at a time is chosen to copy from the source texture, instead of only compositing a single pixel at a time. The transitions in the overlapping regions between the selected image blocks can be optimized by means of dynamic programming. One of its improvements by using Markov random fields and belief propagation is called “Priority-BP.” Texture composition Complicated texture type or structure obtained by combining various basic types of texture. Functional combination In texture composition, a method of combining different textures together. The basic idea is to embed the features of one types of texture into the framework of another type of texture. Transparent overlap A texture synthesis method. Also called linear composition. Let T1 and T2 be any two texture images, and the linear superposition of the two (the two transparent
30.7
Texture Composition
1093
bodies cover each other) results in the third texture image T3, which can be expressed as T 3 ¼ c1 T 1 þ c2 T 2 Among them, the corresponding weights of c1 and c2 are real numbers. Opaque overlap A texture synthesis method that combines different textures together. The basic method for opaque overlapping of two types of textures is to superimpose one type of texture on another type of texture, and the superimposed texture later covers the original texture. If the texture image is divided into multiple regions and different opaque overlaps are performed in different regions, it is possible to construct a combined image of multiple textures. Texture tessellation A structural approach in texture analysis methods. Think of the more regular textures as an orderly tessellation of texture primitives in space. The most typical methods include regular tessellation and semi-regular tessellation. Regular tessellation The method of tessellating the same regular polygon in texture tessellation. Figure 30.8 shows the pattern obtained by tessellating three regular polygons, where Figure (a) represents a pattern consisting of regular triangles; Figure (b) represents a pattern consisting of squares; and Figure (c) represents a regular hexagon mode. Regularity In the structural approach of texture analysis, a texture feature related to visual perception. It is mainly related to the regular arrangement and distribution of texture primitives. Semi-regular tessellation In texture tessellation, two regular polygons with different sides are used for tessellation. What matters here is the arrangement of regular polygons, not the regular polygons themselves. Figure 30.9 shows several examples of semi-regular tessellation. In order to describe a certain tessellation mode, the number of edges of the polygons surrounding a certain vertex can be listed in turn (labeled below each
(a)
Fig. 30.8 Three regular polygon tessellations
(b)
(c)
1094
30
(4, 8, 8) (a)
(3, 6, 3, 6) (b)
(3, 3, 3, 3, 6) (c)
Texture Analysis
(3, 3, 3, 4, 4) (d)
Fig. 30.9 Example of semi-regular tessellations
figure). For example, for the pattern in Fig. 30.9c, there are four triangles (three sides) and one hexagon (six sides) around each vertex, so it is expressed as (3, 3, 3, 3, 6). Mosaic 1. Geometry tiles with similar units to form larger-scale image regions. Texture tessellation is a typical example. 2. Build a larger image from a set of partially overlapping images collected from different perspectives. The constructed image can have different geometries, such as appearing from a single perspective viewpoint or appearing from an orthographic perspective. See image mosaic. Generalized mosaic A simple and effective way to extract additional information at each scene point. As the camera moves, it perceives each scene point multiple times. The fusion of data from multiple images can give an image mosaic containing additional information about the scene, such as extended dynamic range, higher-order spectral quality, polarization, focus, and distance perception.
Chapter 31
Shape Analysis
Shape analysis work mainly includes shape representation and description and shape classification (Costa and Cesar 2001; Gonzalez and Woods 2018; Peters 2017)
31.1
Shape Overview
It is difficult to explain the shape by language. Only some adjectives can approximate the characteristics of the shape (Costa and Cesar 2001; Peters 2017; Zhang 2017b).
31.1.1 Shape Shape The short name of the object shape. Also often refers to the object itself. Informally, it refers to the (primitive combination) form of an image object or a scene object. It is usually described by geometric expression (such as modeling the object contour with a polynomial or B-spline) or a depth data piece with a quadratic surface. A more formal definition is: 1. A family of point sets, in which any two points are connected by a coordinate transformation. 2. A special set of N-D points, such as a set of squares. For another example, a curve in ℝ2 can be defined by parameters as C(t) ¼ [x(t), y(t)], which contains a set of points or the object shape {C(t)|–1 < t < 1}. In 3-D space, the solid in the unit sphere is the shape {x|kxk < 1, x 2 ℝ3}. 3. The characteristics of an object that do not change with the coordinate system describing it. If the coordinate system is in Euclidean space, this corresponds to © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_31
1095
1096
31 Shape Analysis
the concept of shape in the traditional sense. In an affine coordinate system, the change in coordinates can be affine. At this time, an ellipse and a circle have the same shape. Object shape A pattern consisting of all points on the boundary of an object in an image. Abbreviated shape. In fact, because shapes are difficult to express with a mathematical formula, or are defined using precise mathematics, many discussions of shapes often use comparison methods, such as the object (in shape) like (similar to) a known object, and rarely use quantitative descriptors directly. In other words, discussions of shapes often use relative concepts rather than absolute measures. Geometric shape Objects with relatively simple geometric forms (such as geometric primitives: circles, squares, spheres, cubes, generalized cylinders, etc.) also refer to objects that can be described by the combination of these geometric primitives. Geometric feature A general term describing the shape characteristics of certain data, including edges, corners, and other geometric primitives. Geometric feature proximity A measure of the distance between several geometric features. As in hypothesis testing, the distance between the data and the features of the overlay model is used. Geometric feature learning Learn geometric features from feature examples.
31.1.2 Shape Analysis Shape analysis An important branch in image analysis. The combination of object shape extraction, representation, description, classification, comparison, etc. focuses on describing the various shape characteristics of the object in the image. Contour analysis Analysis of the shape of the image region. Shape decomposition See segmentation and hierarchical model. Relational shape description A type of shape description technology based on the relationship between image regions or feature properties. Commonly used properties for regions include connection, adjacency, inclusion, relative area size, etc. See relational graph and region adjacency graph.
31.1
Shape Overview
1097
Shape grammar A grammar that specifies a class of shape characteristics. The rules determine how to incorporate more basic shapes. A rule consists of two components: (1) describing a particular shape and (2) how to place or transform the shape. It has been used in industrial design, computer-aided design, and architecture. Pattern grammar See shape grammar. Grammatical representation Several primitives are used to describe the representation form of the object shape, and a specific set of rules (grammars) can be used for the combination of primitives. Shape recovery Reconstruct a 3-D shape of an object from image data. There are many ways to assemble under the shape from X. Shape classification The process of classifying shapes. The classification of shapes includes two aspects of work: on the one hand, it is to determine whether an object of a given shape belongs to a predefined category and, on the other hand, how to define the classification system for classification and identify the shapes that are not classified in advance among the categories. The former can be regarded as a shape recognition problem, and the commonly used solution is supervised classification; the latter is generally more difficult and often requires acquiring and using specialized knowledge. The solution can be unsupervised classification or clustering. Shape modeling Construct a compact expression of some form for a shape (object). For example, geometric modeling can be used for accurate models, and point distribution modeling can be used for a family of shapes. In general, a model is a somewhat simplified or abstract representation of an actual object. Flexible template A model describing the shape in which the relative position of the object point is not fixed (as defined in probabilistic form). This approach allows the appearance of the shape to change. Shape recognition Identify an object shape class (such as a square and a circle) or a specific shape (such as a trademark). Shape matching Match two objects based on their shape. Matching can be performed at a low level, at which level the point-to-point correspondence between shapes needs to be established, such as establishing a connection between two faces. Matching can also be performed at a high level, where different parts of the shape are compared for semantic description, such as matching two different brands of cars.
1098
31 Shape Analysis
Comparison of shape number A method for determining the similarity of the object shape, by means of the shape number. Used to compare object shapes. Given the contour of an object, the shape number of each order is unique. Suppose that both object contours A and B are closed and both are represented by the same direction chain code. If S4(A) ¼ S4(B), S6(A) ¼ S6(B), . . ., Sk(A) ¼ Sk(B), Sk + 2(A) 6¼ Sk + 2(B), . . ., where S() represents the shape number, and the subscript represents the order, then A and B have similar degree k (the maximum common shape number between two shape numbers). The inverse of the similarity between two shapes is defined as the distance between the two: DðA, BÞ ¼ 1=k Shape template A geometric pattern used to match image data (such as using correlation matching). The (object) shape should be rigid or parametric. The template generally scans the region of the image (or meets the Fourier matched filter object recognition conditions). In Fig. 31.1, the template on the left is the Chinese character “image,” and it should be shape matched with the image on the right. Here, there is no need to perform character recognition. Each word is regarded as a subimage, and template matching is used for detection and positioning. Shape template models A model that describes the shape or contour of an object. These models use a lot of geometric information; assuming that the shape of the object is fully known, only the position, scale, and orientation of the object need to be determined. Shape transformation-based method A shape description method. Describe shapes with parametric models that transform from one shape to another. Liouville’s theorem An elaboration of an important characteristic of the Hamilton dynamic system. It is pointed out that if a region follows the evolution of the Hamilton dynamic equation, although the shape of the region will change, its size will remain unchanged.
Fig. 31.1 Shape template matching example
Template
Image
31.2
Shape Representation and Description
1099
Ricci flow A nonlinear equation for Riemannian manifolds used in shape analysis. Kendall’s shape space A description system for semi-invariant geometry, in which the set of point coordinates (modalities) is focused at the origin and scaled so that the sum of the squares of the Euclidean distance from the point to the center of gravity of the mode is equal to the size of the mode.
31.2
Shape Representation and Description
Descriptors basically correspond to the geometric parameters of the object, so they are all related to the scale (Barnsley 1988; Costa and Cesar 2001; Gonzalez and Woods 2008; Russ and Neal 2016).
31.2.1 Shape Representation Shape representation A large class of techniques that attempt to obtain the significant characteristics of (2-D and 3-D) object shapes. It is often aimed at analyzing and comparing object shapes. A large number of shape representation methods have been proposed in the literature, including calculating skeletons for 2-D and 3-D object shapes (see medial axis skeletonization, distance transform, etc.), and curvature-based representations (such as curvature primal sketch, curvature scale space, extended Gauss ian image, etc.), generalized cylinders used for articulated object, and many flexible object models (such as snake model, deformable superquadric, deform able template model, etc.). 2-D shape representation The representation of the object region in the 2-D image. 2-D shape An object consisting of points distributed in 2-D space (on a plane) or the shape of that object. Deformable shape See deformable model. Shape matrix A representation of the object shape using polar quantization. As shown in Fig. 31.2a, the coordinate system origin is placed at the center of gravity of the object, and the object is resampled in the radial and circumferential directions. These sampling data are (relatively) independent of the position and orientation of the
1100
31 Shape Analysis
Fig. 31.2 Shape template matching example
2
1
1 2 3 40 O
3
4
5
(a)
0
1
2
3
4
0
1
1
1
1
1
1
1
0
1
1
0
2
1
1
1
1
0
3
1
1
1
1
0
4
1
1
1
1
0
5
1
1
1
0
0
(b)
object. Let the radial increment be a function of the maximum radius, that is, the maximum radius is always quantized to the same number of intervals, and the resulting representation will be independent of the scale. The resulting array is called a shape matrix, as shown in Fig. 31.2b. The number of rows of the shape matrix is the same as the number of circumferential samples, and the number of columns of the shape matrix is the same as the number of radial samples. The shape matrix contains both the object boundary and the internal region information, so it can also express objects with holes. The shape matrix’s projection, orientation, and scale can be standardized. Given two shape matrices M1 and M2 of size m n, the similarity between them is (note that the matrices are binary matrices) S¼
m1 X n1 X 1 M 1 ði, jÞ ^ M 2 i, jÞ _ M 1 i, jÞ ^ M 2 ði, jÞg mn i¼0 j¼0
Among them, the upper horizontal line represents a logical NOT operation. When S ¼ 1, it means that the two objects are exactly the same. As S gradually decreases and approaches 0, the two objects become increasingly dissimilar. If the samples are dense enough when constructing the shape matrix, the original object region can be reconstructed from the shape matrix. Picture tree An iterative shape representation for image and 2-D objects in which a tree data structure is used. Each node in the tree represents a region, and each region can be decomposed into sub-regions, which are represented by sub-nodes. In Fig. 31.3, the left image has five regions, and the right image is the corresponding picture tree. Silhouette The region enclosed by the outline of the object. Shadow puppet is a typical application. Generally, it is the result obtained by projecting a 3-D object onto a 2-D plane, which mainly corresponds to a thick shape (but in practice, the contour is often concerned, and it can also be regarded as a thin shape). In many cases, the shape of a 2-D object alone contains enough information to identify the original object, so a 2-D shape analysis method can be used to analyze the 3-D shape. See object contour.
31.2
Shape Representation and Description
1101
Fig. 31.3 Picture tree
A
Image
B C A
D E
B
C
D
E
Fig. 31.4 Some archetype examples
Symmetry The property that a shape remains unchanged after undergoing at least one non-identical transformation from some particular transformation group. For example, the set of points that make up an ellipse is the same after the ellipse is rotated 180 degrees around its center. The contour image of a rotating surface is invariant to some homography under perspective projection, so the silhouette has projection symmetry. Affine symmetry is sometimes called skew symmetry. The symmetry caused by reflection on a line is called bilateral symmetry. Symmetry detection A method for searching symmetry in a collection of curves, surfaces, and points in an image. Global structure extraction Identify higher-level structures or connections in an image (such as symmetry detection). Anti-symmetric See partial order. Profiles A shape mark for an image region, specified by the number of pixels in each column (vertical section) or the number of pixels in each row (horizontal section). Can be used for pattern recognition. See shape and shape representation. Archetype Abstract shape representation of the same shape pattern. It is the original shape configuration, and the other shape configurations have evolved from it. For many objects, there are no clues such as color, texture, or depth and motion in the archetype, but they can still be correctly identified. Figure 31.4 shows three archetype examples.
1102
31 Shape Analysis
31.2.2 Shape Model Deformable superquadric A hyperquadratic model that can be deformed by bending, skew, etc. to fit the data that needs to be modeled. Active shape model [ASM] A model that represents the shape of an object, which can be deformed to fit a new example of the object. The object shape is constrained by a statistical shape model that changes only as seen from the training set. Models are often constructed by using principal component analysis to determine the dominant pattern of shape changes in the observed shape examples. The shape of the model is a linear combination of these dominant modes. See active contour model. Deformation energy Minimal metrics are needed when constructing an active shape model. Usually it includes two types: (1) internal energy, deformation from the shape of the model, and (2) external energy, difference between the shape of the model and the actual data. Active appearance model [AAM] A generalized model of the active shape model, which contains all the information of the region covered by the object of interest (instead of mainly considering the information near the edge like the active shape model). It is a statistical model of the shape and gray appearance of the object of interest. This conditional model can be generalized to most practical examples. Matching with an image requires finding parameters that minimize the difference between the image and the synthetic model paradigm projected onto the image. A model for obtaining original mixed features of facial expressions. First, the shape information and texture information in the face image are combined to establish a parameterized description of the face, and then the principal component analysis is used to reduce the dimension. It can be thought of as a template-based approach. Generally, it requires a good initialization. The changes in shape (coded by the position of key feature points) and changes in texture (normalized to a typical shape before analysis) are modeled. It has been widely used in face recognition and face tracking. Appearance model A representation used to interpret an image, based on the appearance of the object. These models are often learned by using multiple fields of view on the object. See active appearance model and appearance-based recognition. Point distribution model [PDM] A shape description technique that can be used to detect and match objects (especially irregular and deformable non-rigid objects). A shape representation of a flexible 2-D contour. It is a deformable template model whose parameters can be learned through supervised learning techniques. It is suitable for expressing and
31.2
Shape Representation and Description
1103
describing 2-D shapes that are undergoing related deformations or changes, such as component motion or shape changes. Typical examples include human hands, transistors on circuit boards, pedestrians in monitoring, etc. The shape change of the contour in a series of examples can be obtained by means of principal compo nent analysis. See active shape model. Statistical shape model A parametric shape model in which the parameters are assumed to be random variables derived from a known probability distribution. The distribution is learned from the training samples. The point distribution model is an example. See active shape model. A type of model that can be used to describe the shape change in a group of objects, which can be adjusted to fit any particular object in the group. Also called active shape model and point distribution model. These models use more geometric information than the active contour model, but less than the shape template models. Statistical shape prior Shape priors with statistical properties. The shape prior can constrain the estimation of the shape in noisy or restricted image data (such as stable estimation or guarantee that the reconstructed shape originates from a given shape class). Adding a shape that tends to reconstruct it and has a higher probability offset under the prior distribution is called a statistical shape prior. Examples include active shape models and point distribution models. Active net An active shape model with parameterized triangular meshes. Active surface 1. Use a depth sensor to determine the surface of the object. 2. Use the active shape model to fit the surface of the scene through deformation.
31.2.3 Shape Description Shape description The process of describing a shape. The description of shapes is the basis of shape classification. Three types of methods are often used to describe shapes: (1) featurebased approach, (2) shape transformation-based methods, and (3) relation-based methods. Description-based on boundary A method of describing an object by describing its boundaries. Also called descrip tion-based on contour.
1104
31 Shape Analysis
Description-based on contour Same as description-based on boundary. Invariant contour function A function that describes the shape of a flat graphic based on an external contour. Constant values relative to position, scale, or orientation can be calculated from the contour function. These calculated invariants can be used to identify examples of planar graphics. Shape context A descriptor that describes the shape of a 2-D image based on the distribution of vectors between point pairs. The distribution here is described by means of a 2-D histogram whose bins have log-pole spatial quantization. Specifically, each point p on the shape is sequentially taken, and each bin of the histogram records the number of corresponding values of the relative position vectors (p – q) of other points p and q on the shape. The histogram is a 2-D shape descriptor, and the two histograms can be compared using the dot product. It also represents a class of descriptors that describe the shape of a curve using local properties, which can be used for recognition work. It is more robust to the recognition of missing parts due to occlusion. In the recognition with segmenta tion, object recognition based on contours often uses the context information of contour shapes. Histogram of shape context [HOSC] A method to describe and compare the shape of object regions, in which a relative coordinate system is used to guarantee a certain degree of invariance. A commonly used coordinate system is a logarithmic polar coordinate system. Dimensionlessness One of the common characteristics of the shape parameters used to describe the object microstructure. If the shape parameter has no dimensions, its value does not change with the size of the object. This also shows that the shape cannot be described in an absolute way, and the shape must be described in a relative way. Chord distribution A 2-D shape description technique based on all chords (line segments between any two points on the outline) on the object. The object is described by calculating the length histogram and the orientation histogram of the chord, respectively. The value in the length histogram does not change with the object rotation but changes linearly with the scale. The values in the orientation histogram do not change with the object translation and scale. For example, given the contour of an object, the chord distribution is the distribution of the length and orientation of a line segment (chord) between any two points above it. This distribution can be used to describe the shape of the object. Referring to Fig. 31.5a, let c(x, y) ¼ 1 represent contour points and c(x, y) ¼ 0 represent other points. The chord distribution of an object can be written as
31.2
Shape Representation and Description
1105
Fig. 31.5 Chord distribution computation
Δx Δy r
θ (a)
C ðΔx, ΔyÞ ¼
XX i
(b)
cði, jÞcði þ Δx, j þ ΔyÞ
j
Convert to polar coordinates, and let r ¼ (Δx2 + Δy2)1/2 and θ ¼ tg1(Δy/Δx). In order to obtain the rotation-independent radial distribution Cr(r), all angles need to be summed (Fig. 31.5b): Cr ðr Þ ¼
X C ðΔx, ΔyÞ θ
The radial distribution v changes linearly with the scale. The angular distribution Cθ(θ) does not change with scale, and the object rotation will cause a proportional shift: C θ ðθ Þ ¼
max XðrÞ
C ðΔx, ΔyÞ
r¼0
31.2.4 Shape Descriptors Shape descriptor Descriptors that describe the object shape, or descriptors that describe the shape characteristics of an object. The description of the shape should be based on the nature of the shape and the available theory or technology. On the one hand, shapes can be described by different descriptors based on different theoretical techniques; on the other hand, with the same theoretical technique, different descriptors can be obtained to characterize different properties of shapes. For example, when describing a closed region, the descriptors that can be used include area, invariant moment, shape context, Fourier shape descriptor, and so on. They all characterize certain properties of the object shape. There are many corresponding shape descriptors for describing curves, solids, and even symbols and trademarks. Contour-based shape descriptor A shape descriptor recommended by the international standard MPEG-7: the use of curve curvature of scale space images to express contours at multiple resolutions.
1106
31 Shape Analysis
The contour of each curvature scale space expresses the convexity or concaveness of the original contour. Each object is expressed by the position of the extreme value in the curvature scale space image, and the matching of the curvature scale space image is realized by moving the contour in the curvature scale space so as to coincide with the contour to be matched as much as possible. This description is very compact. It can describe a contour in less than 14 bytes on average, but it can only be used to describe a single connected region. Shape descriptor based on curvature Various shape descriptors depicting the properties of the object shape obtained by calculating the curvature of each point on the object contour. The curvature value itself can be used to describe the shape, and some functions of the curvature value can also be used to describe the shape. See curvature of boundary. Region-based shape descriptor A shape descriptor recommended in the international standard MPEG-7. A set of angular radiation transformation coefficients is defined, which can be used to describe a single connected region or multiple disconnected regions. 2-D shape descriptor Descriptor describing the shape of the object region in a 2-D image. Histogram of local appearance context [HLAC] A context-based shape descriptor using a unique anti-noise local apparent descriptor. A histogram with these characteristics is used to construct the final descriptor. Shapeme histogram Distribution statistics of shape signatures. A shape signature is a unique cluster of shape features on a 2-D shape boundary or on a 3-D shape surface (given a 2-D or 3-D shape descriptor). The histogram records the number of occurrences of examples of different shape signatures on the object. Object recognition and image retrieval can be performed with the help of shapeme histograms. Figure 31.6 shows an L-shaped object on the left, which contains three shape signatures: convex corner points (red solid circles), concave corner points (green textured circles), and straight line segments (blue hollow circles). The figure on the right shows the shapeme histogram of the object. Fig. 31.6 Shapeme histogram
6 5
1
31.2
Shape Representation and Description
1107
Chebyshev moments A set of moments based on orthogonal Chebyshev polynomials, defined on a set of discrete points, is suitable for digital images. Chebyshev moments can be used to define shape characteristics suitable for image compression, noise removal, and object recognition. Shape descriptor based on the statistics of curvature A variety of shape descriptors describing the properties of the object shape obtained by the curvature statistics of the points on the object contour. The most commonly used curvature statistics include mean, median, variance, entropy, moment, histogram, and so on. Bending energy [BE] 1. Terms borrowed from the mechanics of thin metal plates. If a set of landmark points is distributed in two thin metal plates, and the difference between Cartesian coordinates in the two sets is the vertical displacement of the plate, then bending energy is the energy required to bend the metal plate to make the landmark points coincide. When considering images, the landmark point set can also be a feature set. 2. The amount of energy stored due to the object shape. 3. A shape descriptor based on the statistics of curvature. Describes the energy required to bend a given curve into a predetermined shape. Its value is the sum of the squares of curvature of each point along the curve. Letting the length of the curve be L and the curvature of the point t above it be k(t), then the bending energy BE is
BE ¼
L X
k 2 ðt Þ
t¼1
Bending Deformation of rod-shaped (extensible, incompact) objects that are not subject to axial force. In image technology, it usually means that the shape of a straight object changes to a curve. Symmetry measure A shape descriptor based on the statistics of curvature. For a curved line segment, supposing its length is L, and the curvature of point t on the segment is k (t), then the symmetry measure can be expressed as ZL S¼ 0
0 @
Zt 0
1 AA kðlÞdl dt 2
1108
31 Shape Analysis
Among them, the internal integral is the amount of change in the angle along the curve from the starting point to the current position t, and A is the amount of change in the angle of the entire curve.
31.3
Shape Classification
Objects are grouped into a predefined set of categories based on their shape characteristics (Sonka et al. 2014; Zhang 2009). Class of shapes A set of object shapes with matching features. When studying the object shape in a specific image, if each object shape is considered as a member of a type of shape, the detection of the object shape of the image can be simplified to check whether the features of the object shape in the specific image are matching to the representative features of the known shape. Shape class In a given shape classification framework, one of a set of classes expressing different shape categories. For example, in the classification of the mean curvature and Gaussian curvature of a depth image, a local convex type or a hyperbolic type can be used. Shape membership When using the shape class to detect the object shape in the image, the index parameter used to determine the degree of matching. If the feature value of the shape A matches the feature value of a representative shape in the shape class C, it can be determined that the specific shape A is a member of the known class C. Classification of planar shapes A classification of the shape of a 2-D object on a plane. A classification method is shown in Fig. 31.7. Thin shape A representation type of 2-D shape. It is a region that does not include the interior of the shape and is also a set of connected points equivalent to the boundary. Corresponding to the boundary-based representation, its width is infinitesimal. Compare thick shapes. Thick shape A representation type of 2-D shape. Refers to the region that includes the interior of the shape, so it is not equivalent to a boundary or boundary representation. Corresponds to region-based representations. Compare thin shapes. Convex set The condition for a non-empty set in n-D Euclidean space to be convex is that each straight line segment between any two points in the set is also included in the set. For
31.3
Shape Classification
1109
Open parametric curve Thick shape Composed parametric curve
2-D Shape
Close parametric curve Smooth parametric curve
Thin shape Single parametric curve
Non-smooth parametric curve Jordan parametric curve Non-Jordan parametric curve Regular parametric curve Non-regular parametric curve
Fig. 31.7 Classification of planar shapes. Note: Terms shown in bold in the figure are included as headings in the text and can be consulted for reference
example, let A be a set of non-empty points in the Euclidean plane. Then the set A is convex, if ð1 λÞx þ λx 2 A,
for all x, y 2 A,
0λ1
ðconvexityÞ
A non-empty set A is strictly convex if A is a closed set and ð1 λÞx þ λx 2 A,
for all x, y 2 A,
x 6¼ y,
0 0, the structure under consideration is more like a square; otherwise, it is more like a circle. 2-D linear discriminant analysis [2-D-LDA] Linear discriminant analysis in the 2-D case. Direct LDA [DLDA] An improved feature extraction method for linear discriminant analysis (LDA). When the number of samples is less than the dimension of the samples, the intraclass and inter-class scattering matrices are singular. By improving the process of simultaneous diagonalization, the zero space of the intra-class scatter matrix that does not carry the identification information can be eliminated and the zero space of the inter-class scatter matrix that carries the identification information is retained, so there is no need to use other dimensional reduction technologies before the linear discriminant analysis. Probabilistic linear discriminant analysis [PLDA] Probabilistic version of linear discriminant analysis. In face recognition, an image is decomposed into the sum of two parts: a deterministic component (depending on the representation of the basic identity) and a random component (indicating that the two face images obtained from the same person are different). Fisher linear discriminant analysis [Fisher LDA, FLDA] One of the most primitive special linear discriminant analyses. Mainly aiming at the two-class classification problems, the goal is to find the projection direction that can distinguish the two types of data points to the greatest extent by selecting an appropriate projection straight line. In face recognition, the Fisher objective function (also known as the Fisher criterion function or the Fisher discriminant function) is to maximize the distance between the face images belonging to different people and to minimize the distance between the face images belonging to the same person. Fisher discriminant function See Fisher linear discriminant analysis. Fisher linear discriminant function Same as linear discriminant function. Optimal LDA [OLDA] A generalization of the linear discriminant analysis method to the singular value case. First, use principal component analysis to reduce the dimensionality of the feature space. Then, the transform space is divided into two subspaces: the zero space of the intra-class scatter matrix and its orthogonal complement, and the optimal discrimination vector is selected from them. Because this method can extract the optimal Fisher discriminant features in the case of singular values, it is also called optimal Fisher linear discriminant. Optimal Fisher linear discriminant [OFLD] See optimal LDA.
33.8
Discriminant and Decision Function
1217
Null space LDA [NSLDA] An improved feature extraction method for linear discriminant analysis (LDA). First, calculate the projection vector in the zero space of the intra-class scatter matrix. If the subspace does not exist, that is, the intra-class scatter matrix is non-singular, then ordinary linear discriminant analysis methods can be used to extract features. Otherwise, small sample (size) problems will occur. At this time, the transformation sample that can maximize the inter-class scattering needs to be selected as the projection axis. Because the scatter of all samples is zero in the zero space of the intra-class scatter matrix, the projection vector that can satisfy the linear discriminant analysis target is the projection vector that maximizes intra-class scattering.
33.8.2 Kernel Discriminant Kernel discriminant analysis A classification method based on the following three key observations: (1) some problems require curve classification boundaries; (2) classification boundaries need to be defined locally rather than globally according to the category; and (3) highdimensional classification space can be avoided by using kernel techniques. This type of method provides a transformation by means of a kernel function, so that linear discriminant analysis can be performed in the input space instead of the transformation space. Kernel LDA [KLDA] A nonlinear feature extraction method based on kernel criterion. It projects the target into a high-dimensional feature space through nonlinear mapping and uses Fisher criterion functions and kernel techniques to extract the effective features of the target. Kernel Fisher discriminant analysis [KFDA] The results obtained by extending Fisher’s discriminant analysis method to nonlinear cases by introducing kernel trick. Using kernel trick can avoid highdimensional feature spaces. It first nonlinearly maps the data of the original input space to a certain feature space, and then performs a Fisher linear discriminant analysis in this feature space, so that the nonlinear identification of the original input space is implicitly realized. As a classification method for feature space, one of its characteristics is that it can be used to generate curved classification boundaries (here, the class is used to define locally instead of globally). Enhanced FLD model [EFM] An improved model for Fisher linear discriminant analysis. With the help of regularization (constraint optimization) technology, it achieves zero data reduction and enhances the discrimination capability.
1218
33 Image Pattern Recognition
Fisher kernel A method for defining kernel functions using a generative model. Consider a parameter generation model p(x|θ), where θ represents a parameter vector. The purpose here is to determine a kernel that measures the similarity between the two input vectors x and x' caused by the generated model. A gradient relative to θ can be used to define a vector in the feature space with the same dimensions. Specifically, consider the Fisher score: gðθ, xÞ ¼ ∇θ ln pðxjθÞ The Fisher kernel: k ðx, x0 Þ ¼ gðθ, xÞT F 1 gðθ, x0 Þ where F is the Fisher information matrix and is given by F ¼ E x gðθ, xÞg θ, xÞT Among them, the expectation is relative to x under the distribution p(x|θ). Fisher linear discriminant [FLD] A classification method that maps high-dimensional data to a single dimension and maximizes the ability to distinguish categories. Projecting high-dimensional data x to a single dimension y can be represented as y ¼ wT x where w is the weight vector. In general, projecting high-dimensional data into a single dimension results in a significant loss of information. Categories that were originally well separated in high dimensions may have significant overlap in single dimensions after projection. However, you can choose a projection that maximizes the distinction between categories by adjusting the weight of the projection. Take a two-class problem as an example. Let class C1 have N1 points and class C2 have N2 points. The mean vector of these two classes is m1 ¼
1 X x N 1 n2C n 1
m2 ¼
1 X x N 2 n2C n 2
The simplest measure of class discrimination is the discrimination of the class mean after projection. So we should choose w to maximize the formula:
33.8
Discriminant and Decision Function
1219
m2 m1 ¼ wT ðm2 m1 Þ Fisher linear identification is to maximize a function, on the one hand, to obtain a large discrimination of the class mean after projection and, on the other hand, to maintain the variance within the categories, thereby minimizing the overlap between categories. For category k, the variance within the category is s2k ¼
X
ðyn mk Þ2
n2C k
where yn ¼ wTxn. For the entire data set, the total intra-class variance is s12 + s22. The Fisher criterion is defined as the ratio of the variance between classes to the variance within classes: J ðwÞ ¼
ðm 2 m 1 Þ2 s21 þ s22
It can also be written as J ðwÞ ¼
wT SB w wT SW w
where SB is the intra-class covariance matrix: SB ¼ ð m 2 m 1 Þ ð m 2 m 1 Þ T SW is the covariance matrix between classes: SW ¼
X n2C 1
ð xn m 1 Þ ð xn m 1 Þ T þ
X
ðxn m2 Þðxn m2 ÞT
n2C 2
As can be seen from the above, J(w) maximization holds when the following formula is satisfied: T w SB w SW w ¼ wT SW w SB w The terms in parentheses in the above formula are equivalent to the scale factor; remove them: w / S1 W ðm2 m1 Þ
1220
33 Image Pattern Recognition
The above formula gives the optimal projection direction and can be used to further construct the discriminant. The above formula is often called the Fisher linear discriminant directly. See linear discriminant analysis. Fisher linear discriminant function Same as linear discriminant function. Fisher information matrix See Fisher kernel. Information matrix Another term for inverse covariance, and it gives the amount of information contained in a given measurement. Also called Fisher information matrix. Fisher criterion function See Fisher linear discriminant analysis. Fisher criterion See Fisher criterion function.
33.8.3 Decision Function Decision The second step for general classification problems (the first step is inference). It uses the posterior probabilities obtained by the inference stage to optimize the classification. Decision function In pattern recognition, a numerical function used to determine a class for classifying patterns. Loss Matrix A matrix representing different losses from different decisions. Consider a binary classification problem and determine if you are sick through a physical examination. There should be no loss if the disease is judged to be diseased or no disease is judged to be disease-free; but it is a mistake to judge disease-free to be sick, and it is a big mistake to judge sick to be disease-free. In this way, the loss matrix can be determined as
No disease
Disease‐free " 0
Sick # 1
Disease
100
0
Among them, the row corresponds to the real situation, and the column corresponds to the judgment result.
33.8
Discriminant and Decision Function
1221
Linear decision function A decision function used to separate two training sets that are linearly separable. Decision theory The theory of making optimal decisions in situations of uncertainty. Decision rule In pattern classification, the purpose is to assign a pixel or a target to the corresponding pattern class based on the features extracted from the image. A decision rule generally assigns each observation unit and only one (category) label according to the measurement mode or the corresponding feature mode in the data sequence. Nonparametric decision rule Same as distribution-free decision rule. Simple decision rule Decision rules to assign a (category) label to each observation unit based only on the measurement pattern or corresponding feature pattern in the data sequence. Here, each observation unit can be treated separately or independently. At this time, a decision rule can be regarded as a function, and only one (category) label is assigned to each pattern in the measurement space or each feature in the feature space. Distribution-free decision rule Decision rules with no assumptions for the form of the conditional probability distribution function for a given category pattern. Hierarchical decision rule Decision rules for tree structures. For example, in a binary tree, each non-leaf node contains a simple decision rule, which divides the pattern into the left child node or the right child node; and each leaf node contains the category assigned to the observation unit. Compound decision rule In image classification, the decision rule that the observation unit is assigned to a category according to some subsequences of common measurement patterns or corresponding sequences of feature patterns in the data sequence. Minimum description length [MDL] A decision principle that is applied in many aspects of image engineering. For example, in image coding, if the same quality image can be given, the method with the smallest amount of data is the best; in image description, the simpler it is assumed, the more likely it is true; in energy-based segmentation, the segmentation corresponding to the smallest energy is often the best. A criterion for comparing descriptions. The implicit assumption often based on is that the best description is the shortest description (i.e., it can be encoded with the fewest bits). The minimum description often requires several components:
1222
33 Image Pattern Recognition
1. Observed models, such as lines or arcs. 2. Parameters of the model, such as the length of the line. 3. The image data is different from the model, such as obvious gaps or noise model parameters. 4. The rest of the image that cannot be interpreted by the model. Decision boundary After the space is divided according to the classification criteria, the boundaries between the obtained subspaces. Consider a classifier that takes a vector in ℝd space as input, divides the space into a set of regions, and assigns all points in a region (determined by a decision) to a specific category. At this time, the boundary between the regions is the decision boundary. See decision region. Decision surface See decision region. Decision region In the classification problem, the input space is divided into different categories of regions (which can be 2-D, 3-D, or higher dimensional). The boundary of a region is called the decision boundary or decision surface. Decision tree One type of tree structure is used for model combination. The choice of models depends on input variables, so different models need to be used in different regions of the input space. In the decision tree, the model selection process can be described as a series of binary choices corresponding to the traversal tree structure. Decision trees can be used for both classification and regression problems. See binary tree. It is a tool to help choose between several modes of operation or routes. An agent can effectively select the appropriate option based on its structure and examine the possible outcomes. It also balances the risks and rewards associated with every possible way of doing things. Figure 33.17 shows an example of a decision tree. Tree classifier A classifier that uses a series of tests on input data point x to determine the category C to which each point belongs. See decision tree.
Fig. 33.17 An example of a decision tree
Start point
Rule
Operation
Result
Result
Final decision
Result
33.9
Syntactic Recognition
1223
Hierarchical mixture of experts [HME] An improvement on the standard decision tree is a fully probabilistic tree-based model. It divides the input space by probability and does not require the use of a dividing line parallel to the coordinate axis, and the dividing function depends on all input variables. In addition, there are probabilistic explanations for leaf nodes. Subsequence-wise decision A strategy to divide the entire video sequence into several subsequences and make a global optimal decision based on the information provided by the subsequences. In target tracking, the input video can be first divided into several subsequences, followed by tracking in each subsequence (using multiple frames of information), and finally combining the results. Subsequence decision-making can also be regarded as a generalization of frame-by-frame decision-making. If each subsequence divided is one frame, the subsequence decision is degraded into frame-byframe decision-making.
33.9
Syntactic Recognition
Realizing structural pattern recognition requires defining a set of pattern primitives, a set of rules that determine the interaction of these primitives, and a recognizer, called an automaton (Bishop 2006; Theodoridis and Koutroumbas 2003; Zhang 2017b).
33.9.1 Grammar and Syntactic Grammar A rule system that restricts the way in which the basic elements are combined. In a natural language environment, grammar includes syntax and grammar that specify words and word combinations. In image engineering, consider primitives as simple textures, shapes, features, etc. and combine them to represent objects. The rules describing the interaction of pattern primitives in structural pattern recognition determine the structure of the recognizer. It is also a mathematical model of a language generator, which can be described by a quadruple: G ¼ fF n , F t , R, Sg Among them, the elements of Fn are called unending symbols, and the elements of Ft are called ending symbols. They are non-overlapping symbol sets; the set R is a non-empty finite set of sets Fn* Ft*, where Fn* and Ft* are the set of all unending symbols and ending symbols with empty sets, whose elements are called rewriting rules; S is a grammatical atom or a starting symbol. Given a grammar, the set of all words that can be generated from it is called the language L(G).
1224
33 Image Pattern Recognition
According to the different substitution rules, the syntax can be divided into four categories: 1. Class 0—General grammar: There are no restrictions on rewriting rules. 2. Class 1—Context-sensitive grammar: Rewriting rules can be of the form W 1f W 2 ! W 1V W 2 The grammar contains the rule S ! e, where e is an empty word; W1, V, and W2 are composed of the elements of Fn* and Ft*, V 6¼ e, f 2 Fn. Under the condition that the context is W1 and W2 (i.e., context-sensitive), the unending symbol f can be replaced by V. 3. Class 2—Context-free grammar: The rewriting rule can be of the form f !V Among them, V 2 Fn* Ft*. The grammar contains the rule S ! e, where e is an empty word. Without considering the context of f (i.e., free of context), the unending symbol f can be replaced by V. 4. Class 3—Regular grammar: The rewriting rules can be of the form f ! zg
or
f !z
Among them, f, g 2 Fn, z 2 Ft. The grammar can also include the rewriting rule S ! e. General grammar See grammar. Context-sensitive grammar See grammar. Context-free grammar [CFG] 1. A string grammar commonly used in structural pattern recognition. If we let the nonterminal A in the nonterminal set N, we will only include production rules of the form A ! α; α is in the set (N[T )* – λ. In other words, α can be any string consisting of a terminal symbol and a nonterminal symbol except the empty set. Compare regular grammar. 2. In synthesis-based activity modeling and recognition, a syntax for identifying visual activities. It uses a layered process. At the lower level is a combination of hidden Markov models and belief networks. At the higher levels, interactions are modeled with context-free grammar. Because it has a strong theoretical foundation for modeling structured processes, it is often used to model and identify human movements and multi-person interactions.
33.9
Syntactic Recognition
1225
Stochastic context-free grammar [SCFG] Probabilistic expansion of context-free grammar. Can be used to model the semantics of an activity whose structural assumptions are known and is more suitable for combining actual visual models. It is also used to model multitasking activities (activities that involve multiple independent execution threads and intermittent related interactions). Regular grammar A grammar often used in structural pattern recognition. If you let nonterminals A and B be in nonterminal set N and terminal a is in terminal set T, the regular grammar only contains production rules A ! aB or A ! a. Compare context-free grammar. Formal grammar A set of production rules that rewrite strings. Describes how to use the character set of a formal language to construct a string that satisfies the grammar of a language, starting with a starting symbol. Rewriting rule A rule in formal grammar. Can be used to build string descriptors. Formal language In image engineering, a collection of strings qualified by formal grammar rules. Formal language theory Study the theory of formal language syntax (internal structural patterns). See structural pattern recognition. Syntactic classifier Grammar-based classifier. The classification object is regarded as a language produced by grammar, and words belonging to the same language are classified into the same category. Syntactic pattern recognition Object recognition by converting the image of the object into a series of symbols or a set of symbols and using a parsing technique to match the sequence of symbols with grammatical rules in the database. See structural pattern recognition. Structural pattern descriptor A descriptor describing the pattern structure relationship. In the process of struc tural pattern recognition, a pattern is regarded as being composed (or arranged) of one or more pattern symbols (also called features). In addition to the quantitative measurement of some features, the spatial relationship between the features is also important to determine the mode, which depends on the structural mode descriptor to describe these spatial relationships.
1226
33 Image Pattern Recognition
33.9.2 Automaton Automaton Recognizer in structural pattern recognition. Automatic Performed by a machine (computer or other device) without human intervention. Tree automaton In structural pattern recognition, an automaton with a sampling tree structure. Finite automaton Language recognizer generated by regular grammar. Can be represented as a fivetuple: A f ¼ ðQ, T, δ, q0 , F Þ Among them, Q is a finite set of non-empty states; T is a limited set of input characters; δ is a set of mappings from Q T (i.e., a set of ordering pairs composed of elements of Q and T ) to all Q subsets; q0 is the initial state; and F (a subset of Q) is a set of final or receivable states. Generalized finite automaton [GFA] An image compression method originally developed for binary images can also be used for grayscale images and color images. It takes advantage of the fact that natural images have certain self-similarity (a part of the image may be similar in size, direction, brightness, etc. to other parts or the whole). The whole image is divided into small squares with the help of a quadtree, and the relationship between the parts in the image is represented by the structure of the graph. The graph here is similar to that of a finite automaton describing a finite number of states, and the name of the method comes from this. This method can provide a constant output bit rate for video image compression. This method is lossy from the perspective of compression fidelity, because some parts of an actual image may be similar but different from others. Weighted finite automaton [WFA] A modification of generalized/generalized finite automata. Effective for both compressed grayscale and color images. Adaptive linear element [Adaline] A system similar to a perceptron but using different training methods.
33.10
Test and Error
There are many judgment criteria for the effect of recognition or classification, and error functions are often used to measure imperfect results (Bow 2002; Duda et al. 2001).
33.10
Test and Error
33.10.1
1227
Test
Leave-one-out test In pattern classification, an experimental method for evaluating the performance of a classifier, leaving one sample in the training set for testing (without training). Every sample can be left in principle. Miss-one-out test Same as leave-one-out test. Leave-one-out See cross-validation. Leave-K-out In pattern classification, an experimental method to evaluate the performance of a classifier. First, divide the training set into L non-overlapping subsets, and each subset contains K patterns. Train the classifier with the first L – 1 subset and test with the L-th subset. Next, rank the first subset to the last, and then use the first L – 1 subset (at this time) to train the classifier, and use the L-th subset to test. Repeat the L trials in this way, leaving K patterns for training and testing. Use the accumulated test results as the final performance to evaluate the results. This method has no bias effect on performance, but when K is small, the variance of the evaluation will be larger. Cross-validation 1. A method to make full use of samples when the number of samples in the training and test sets is limited. This can be explained with the help of Fig. 33.18. Suppose that N-fold cross-validation is performed (N ¼ 5 here), and all samples are divided into five groups each time (the simple case where the groups are equal here). Each time the N – 1 group is used for training, and the remaining 1 group is used as the test group (also called the hold-out set) for verification. In Fig. 33.18, the test group is represented by the shaded portion, and the unshaded portion represents the training group. Finally, the test results are averaged to give the overall performance.
Fig. 33.18 Crossvalidation method diagram
Test 1 Test 2 Test 3 Test 4 Test 5
1228
33 Image Pattern Recognition
It can be seen that (N – 1)/N samples are used for training, and all samples are used to evaluate performance. When the number of samples is small, this becomes the typical leave-one-out. 2. A scheme to effectively use the data to test the performance of the model. When comparing models, you cannot simply compare their performance on the training set, as there may be problems with overfitting. To prevent or resolve this issue, a validation set can be used. But if the data available for training and validation is limited, a more effective method is to use cross-validation. In cross-validation, the data is divided into K parts, and verification is performed on each part i ¼ 1, 2, . . ., K in turn, and the other parts are used for training. The verification results obtained from each copy are averaged to obtain an overall performance estimate. Leave-one-out test is a special case of it, where K is equal to the number of training samples. N-fold cross-validation See cross-validation. Hold-out set Same as validation set.
33.10.2
True
Detection rate 1. Detection speed, measured in Hertz. 2. The success rate in target detection and feature detection, for a given instance, is related to the true positive of the detection. True positive A performance indicator of a binary classifier. A binary classifier returns + or – for a sample. When a sample is +, a true positive occurs if the classifier also returns +. Compare false positive. True positive rate [TPR] The ratio of giving true conclusion and the conclusion is correct. Can be calculated by TPR ¼
TP TP ¼ TP þ FN P
Among them, TP is a true-correct value, FN is a false-false value, and P is a correct value. In image retrieval, this is the recall. In medicine, it refers to the ratio of positive and judged positive.
33.10
Test and Error
1229
True negative A performance indicator of a binary classifier. A binary classifier returns + or – for a sample. When a sample is –, if the classifier also returns –, a true negative occurs. Compare false negative. Positive predictive value [PPV] The ratio that gives the correct conclusion. Can be calculated by PPV ¼
TP TP ¼ TP þ FP P0
Among them, TP is true-correct value, FP is false-correct value, and P' is total correct value. In image retrieval, it is the precision. Confusion table Same as confusion matrix. Confusion matrix A representation matrix often used in image classification. It is actually a 2-D array of size K K, where K is the total number of categories, and it is used to report the original results of the classification experiment. The value of the i-th row and j-th column in the matrix indicates the number of times a real category i was marked as belonging to category j. The main diagonal of the confusion matrix indicates how many times the classifier succeeded; a perfect classifier should have all off-diagonal elements zero.
33.10.3
Error
Misclassification rate In pattern classification, the proportion of classification in which samples that should belong to one category are incorrectly classified into other categories. Apparent error rate The error rate is calculated based on the number of misclassified samples in the training set. Mis-identification The error of determining a pattern other than the real class to the real class. Compare false alarm. False positive A result output by the classifier. For sample x, the two-class classifier c(x) can give a result of + or –. When a + judgment is given on a sample that is actually –, it is considered a false positive.
1230
33 Image Pattern Recognition
False positive rate [FPR] The ratio of giving false conclusions but actually correct. Can be calculated by FPR ¼
FP FP ¼ FP þ TN N
Among them, FP is a false-correct value (this is false but judged as true), TN is a true-false value (this is true but judged as false), and N is an error value. The false positive rate in medicine refers to the rate that was originally negative but was judged positive. The frequency of false alarm errors during normal use of the detector. Compare false negative probability. Type 2 error False alarm, same as false identification. It can also mean that a false assumption was accepted. False alarm In pattern classification, a pattern is assigned to a class other than its real class. In other words, treat non-concerned objects as objects to be concerned with. Compare mis-identification. See false positive. False identification Same as false alarm. False positive probability The probability of a false alarm error during normal use of the detector. Note that the false alarm probability may vary slightly depending on the specific application (see random-watermarking false positive probability and random-work false positive probability). False alarm rate The ratio of “false positives” (false alarms) obtained or made during testing or judgment. Type 1 error False detection, same as mis-identification. It can also mean that a true hypothesis has been denied. False negative probability The probability of a false alarm occurring during normal use of the detector. Note that the probability of false alarm depends on the specific application, because “normal use” depends on different application types. False negative A result output by the classifier. For sample x, the two-class classifier c(x) can give a result of + or –. A negative judgment of – is considered a false negative on a sample that is actually +.
Chapter 34
Biometric Recognition
Biometric identification and recognition use various biometrics to identify or confirm the identity of a person (Zhang 2011).
34.1
Human Biometrics
Many more mature and widely used biometric features with the help of image technology have been developed (Zhang 2017b). Human biometrics From the physiology, anatomy, and behavior of the human body, different characteristics of the human body can be measured and can be used to identify different individuals. The biometric features that have been researched more widely and are commonly used, with the help of image technology, mainly include anatomical and physiological features, human face, fingerprints, palm prints, hand shape, iris, retina, vein blood vessels, etc., and behavior characteristics, footprints, handwriting, signature, gait, etc. In addition, there are other techniques, such as speech, voiceprint, chromosomal DNA, and even talking style. Identity verification Confirm a person’s identity based on the inspection of certain biological characteristics or biometric features (such as faces, fingerprints, etc.). This is not the same as identifying an unknown (unknown) person. Here you only need to compare the observed information with a model. Biometric feature Features that can be used to verify biometric identity. Typical features such as various features used for fingerprint recognition, face recognition, gait analysis, etc.
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_34
1231
1232
34
Biometric Recognition
Properties of biometrics The characteristics that biometric features should have or expect to have when used for identification. For example, to identify or confirm the identity of a person, the biometric features used should mainly meet universality (owned by everyone), uniqueness (different from person to person), stability (not dependent on changes in person’s age, time, and environment), and collection convenience (i.e., easy collection, simple equipment, and small impact on people). Biometrics Identify people based on their anatomical, physiological, and behavioral characteristics. Broadly refers to the use of various biological characteristics to identify identities. At present, there have been more studies on, or more widely used, human biological characteristics with the help of image technology: face, fingerprints, palm prints, back of the hand, hand shape, veins, iris, retina, expression, gait, cadence, footprints, handwriting, signatures, and so on. Biometric systems A system that identifies biological characteristics (faces, fingerprints, expressions, signatures, etc.) of an individual’s body or behavior. There are two main modes of operation: verification and identification. Verification performs a one-to-one comparison to determine if it is the person claimed. Identification makes one-to-many comparisons to find out and determine who this person is. Biometric recognition Techniques and processes for verifying and identifying the identity of organisms by using biological characteristics such as anatomical and physiological characteristics, or behaviors that reflect individual characteristics. The biometric technology used varies with the biometrics used. Multimodal biometric approaches A biometric identification method that uses more than one anatomical, physiological, or behavioral feature to record, verify, or identify. The shortcomings and deficiencies of using only one single biometric are compensated by using multiple cooperating biometrics. It can be further divided into real multimode systems (processing the same biometrics through different input modes), multi-biometric systems (combining the use of multiple biometrics), and multi-expert systems (the fusion of different classifiers for the same biometrics). Feature-based recognition See biometric recognition. Gesture-based user interface Various methods of communicating commands with electronic devices are based on the acquisition and analysis of a person’s face or body movement. The key here is that the analysis should be very fast to avoid the lag between command issuing and action proceeding.
34.2
Subspace Techniques
1233
Appearance-based approach A method for face detection and localization. This method is similar to the method based on template matching and also uses template and template matching, except that the template (or model) here is obtained through learning or training. In other words, the appearance-based method uses templates obtained through training and learning and uses the template matching method to detect and locate faces. Gaze direction estimation Estimate the gaze location when people observe. Used for human-computer interaction. Gaze location See gaze direction estimation. Gaze direction tracking Gaze direction estimation is performed continuously in a video sequence or in the field of view of a live camera.
34.2
Subspace Techniques
Subspace-based methods are a relatively successful class of dimensionality reduction techniques that have been widely used in face recognition and expression recognition (Bishop 2006; Zhang 2011). Subspace Mathematically, a partial space with a dimension smaller than the full space (here a space refers to a set with some specific properties). In other words, it is a subset of high-dimensional space, which can be composed of a set of orthogonally normalized basic vectors. For example, Fig. 34.1 shows a 2-D subspace L(u, v) in 3-D space R3, where two vectors u and v can be regarded as a unique pair of orthonormal vectors that determine the subspace L. The dimension of this subspace is 2. Y
Fig. 34.1 A 2-D subspace L (u, v) in 3-D space R3
L(u, v) v 0 X
u Z
1234
34
Biometric Recognition
A subspace is also a subset of a vector space, closed for addition and scalar multiplication. In data analysis, subspace structures can be detected using techniques such as principal component analysis. Subspace method 1. Describe a method for transforming a vector space into a lower-dimensional subspace. For example, project an N-D vector onto its first two principal components to generate a 2-D subspace. 2. One of the mainstream methods of face recognition. High-dimensional face image features are compressed into a low-dimensional subspace through spatial transformation (linear or nonlinear) for recognition. Typical techniques include principal component analysis, independent component analysis, linear dis criminant analysis, and kernel-based analysis, as well as discriminative projection embedding, class-dependent feature analysis, etc. Face subspace The eigenface is represented as a linear subspace composed of orthogonal basis vectors. Kernel tensor-subspace [KTS] A tensor-based subspace method using kernel trick. With the help of independent modeling methods and nonlinear mapping, it can help build multilinear structures and model multi-factors. It has strong discrimination ability and does not require prior knowledge of the factor under consideration. Also known as individual kernel tensor face. Grassmannian space A set of k-D linear subspaces in an n-D vector space V, denoted as Gr(n, k). Subspace classification A classification method based on the principal eigenvectors of the subspace. The main steps include first subtracting the sample average of the entire training set from the feature vector, then estimating the correlation matrix for each category Ci and calculating the main eigenvectors (corresponding to the several largest eigenvalues), and taking the eigenvector as the column vector to form a matrix Mi. Given a quasiclassification vector x, if the following formula is satisfied, x is classified into the category Ci: T T M x < M x i j
8i 6¼ j
At this time, the category Ci corresponds to the maximum norm subspace projection of x. It can be seen that subspace classification combines feature selection or extraction with classifier design. Subspace classifier A classifier that implements subspace classification on pattern vectors.
34.2
Subspace Techniques
1235
Subspace analysis Describe data sets in one or more subspaces, or perform analysis work in subspaces. A mixture model of probabilistic principal component analysis components is a probabilistic construction method. Projection matrix in subspace A matrix describing the projection of vectors in a subspace. The orthogonal projection result of a vector x on the subspace U (consisting of the vector group {u1, u2, . . ., um}) is a vector:
x ¼
m X
xT ui
ui
i¼1
Can be written as x ¼
m X
ui uTi x ¼ Px
i¼1
where P¼
m X
ui uTi
i¼1
It is a subspace projection matrix, and any vector can be mapped to a projection vector on the subspace U. Using the orthonormal properties of the vector group {u1, u2, . . ., um}, we can prove that PT ¼ P. See projection of a vector in subspace. Projection of a vector in subspace A basic concept in subspaces. Let the set of orthonormal vectors constituting the subspace U be {u1, u2, . . ., um}. If they satisfy uiTuj ¼ δij, the (orthogonal) projection result of a vector x on U is x ¼
m X
xT ui
ui
i¼1
In addition, the vector x’ (¼ x – x*) is called vertical residual, which is perpendicular to the subspace U. The vectors x, x*, and x’ and their relationships can be seen in Fig. 34.2. Subspace shape models A type of model that uses essential structural information in the covariance of landmark points. It assumes that the shape vectors are very close to a K-D linear space, so the K vectors in this space can be used to describe each shape vector.
1236
34
Biometric Recognition x
Fig. 34.2 Vector projection in subspace
0
x'
x* U(u1, u2)
Subspace learning A method of machine learning in which subspaces are learned from many observations. It can be used to extend the bilinear decomposition implicit in the PCA and SVD methods to multilinear (tensor) decomposition to model multiple related factors simultaneously. Principal subspace When defining the principal component transform, it refers to a low-dimensional linear space where the data is orthogonally projected. In this subspace, the variance of the projection data is maximized.
34.3
Face Recognition and Analysis
As a kind of biometric recognition technology, face recognition has obvious advantages in terms of collecting interference and acceptability of the identified person (Zhang 2015a).
34.3.1 Face Detection Face detection The process of identifying and locating (e.g., using an external box) a human face in an image or a video sequence. Common methods combine human motion analysis and skin color analysis. Figure 34.3 shows a result of detecting faces of different sizes. The detected faces are marked by boxes. Face detection and localization One of the key steps in face image analysis. Search and extract faces from the input image, and determine information such as the position and size of the face. Template-matching-based approach A face detection and localization method. Model the face first, build the corresponding template, and then detect and locate the face through template matching.
34.3
Face Recognition and Analysis
1237
Fig. 34.3 Detecting faces of different sizes
Fig. 34.4 Face detection and localization with the help of bilateral symmetry
Feature-based method 1. A face detection and localization method, by searching for specific (on organ) corners, edges, skin tone regions and texture regions, and more in the image. 2. A shape description method. Use the features extracted from the target to describe its shape characteristics. Feature-based approach Same as feature-based method. Bilateral symmetry A constraint property of symmetry. If a given 2-D or 3-D target is decomposed into two symmetrical halves, one on each side of a line or one plane, the results obtained are bi-symmetrical, also known as axisymmetric and plane symmetry. Many biological organisms, mammals, and plants have such characteristics. Fig. 34.4 shows a schematic diagram of the use of frontal faces with bilateral symmetry to help judge human faces (excluding human faces) in face detection and localization.
1238
34
Biometric Recognition
Fig. 34.5 Detection of eye contour and pupil contour
Fig. 34.6 Detection of inside and outside contours of lips
Face knowledge-based approach A method for face detection and localization. Establish rules for the relationship between facial organ features based on knowledge of face prior knowledge, and then perform face detection and localization and judge results based on these rules. Face feature detection Determine the location of features (eyes, nose, mouth, etc.) from a person’s face. This work is often performed after face detection, although it is sometimes considered as part of face detection. Figure 34.5 shows an example of eye contour and pupil contour detection. The left image is the original image and the right image is the detection result (indicated by white lines). Figure 34.6 shows an example of contour detection on the inside and outside of the lips. The left image is the original image and the right image is the detection result (indicated by white lines). Global component In face recognition, shape information representing the entire face image. The general response is a low-frequency component. Local component Same as detail component. Appearance Observation of an object from a specific viewpoint and under specific lighting conditions. It is a feature obtained using the grayscale nature of pixels. For example, the face image mainly refers to the underlying information such as the gray level of the face, especially the information that reflects local slight changes. The appear ance feature extraction mainly uses the local characteristics of the image.
34.3
Face Recognition and Analysis
1239
Appearance change Changes in images that cannot be easily attributed to motion, such as the object does change shape. Appearance feature The characteristics of the object or scene associated with the visual appearance are opposed (stand-alone) to those derived from shape, motion, or behavioral analysis. Eye location The work of determining the position of the eyes in the face image. Some methods consider the detection of blinks, the detection of facial features, and so on.
34.3.2 Face Tracking Face tracking Track a face in a series of images, such as a video. Often used as an integral part of human-computer interaction. Head tracking Track the pose (position and posture) of a person’s head (often interacting with a computer system) in a video. That is, the position and orientation (or line of sight) of the head are tracked in the image sequence. Commonly used is stereo vision-based head tracking (i.e., using stereo video to improve the effect). Stereo-based head tracking An application of real-time stereo vision algorithms involves tracking the head position and orientation. The result can be used to control the sight of a person, or to obtain the position of a user who interacts with a computer or a game system, and control the virtual environment or game based on the position information. Also refers to a specific head tracking technique in which the information obtained by stereo vision is used. The use of stereo vision can greatly improve the reliability of tracking compared to monocular vision. Face tracker A device for performing parameter estimation on each frame of a video sequence such as a pose, an appearance, and an expression of a human face. Facial organ Organs on human face. In face recognition, facial organs are the main clue to distinguish different faces. In facial expression classification and facial expression recognition, not only the facial organs themselves must be determined, but the changes in the characteristics of these organs also provide important clues for judging the expression. There are many facial organs. At present, the eyes and mouths are mainly concerned, and the others also include noses, ears, and eyebrows.
1240
34
Biometric Recognition
Facial feature The facial features can be observed or extracted. It is the basis of face recognition or expression recognition. In practice, various facial organs (and their mutual positional relationship) are often used as facial features. Facial features detection The process of identifying and locating feature points or regions of interest on the human face, such as the eyes or the center of the eyes. 3-D facial features detection The process of identifying and locating facial feature points using 3-D (grid or point cloud) facial data. Facial organ extraction A key step in face recognition. Including detection of facial organs and judgment of their relative positions. For face recognition, facial organs are the main clues to distinguish different faces, while for facial expression classification, changes in the characteristics of these organs also provide important clues for judging the type of expression. The organs that play a more important role in face recognition and facial expression classification are mainly the eyes and mouth, while the role of the nose, ears, eyebrows, etc. is relatively small. Facial organ tracking A key step in face recognition. In image sequences, changes in the position, shape, and movement of organs on the face are continuously detected. This can also provide more information for facial expression classification. Face liveness detection A way to avoid using photos, 3-D models, or videos of faces instead of real faces to trick facial recognition biometric issues. These methods include detecting various facial movements and features (such as blood vessels) that are difficult to fake. Eye tracking 1. Eye tracking: Track the position of the human eye at different times in the face image sequence. Sometimes it is also necessary to consider detecting and tracking the line of sight. 2. Eyelid tracking: The process or technique of detecting and tracking the contours of the upper and lower eyes. This can be done on the basis of iris tracking. According to the geometric model of eyes, after the outline of the eyelid is represented by a parabola, the parameters of the parabola can be determined by detecting the position of the corner of the eye. However, when the eyes are closed during the blinking process, the upper and lower eyelids overlap, and the parabola representing the outline of the eyelids becomes a straight line. When the eyes are reopened then, because the movement directions of the upper and lower eyelids are different, the tracking of the eyelids will often cause problems, and the errors will be propagated to the subsequent frame images. To solve this problem, a prediction mode is used, that is, using the state pattern of the eyes before the eyes
34.3
Face Recognition and Analysis
1241
are closed to predict the state pattern of the eyes after the eyes are reopened. If k is used in an image sequence to represent the first frame label of the detected eye closure, l is the number of frames with closed eyes, and p is the number of frames used for prediction ( p < k), then the state mode Mkþlþ1 after the eyes are reopened is
M kþlþ1 ¼
p X
wi M ki
i¼1
where wi is the prediction weight.
34.3.3 Face Recognition Face recognition A biometric recognition technology that has been widely studied and applied. Refers to the technology and work of identifying a face from an image (as a special case of a person recorded in the face database). The key steps involved are: (1) face detection and localization, (2) facial organ extraction, (3) facial organ tracking, (4) reducing feature dimension and extracting feature, and (5) face recognition or face verification. In general, different situations such as lighting, distance, pose, and occlusion need to be considered, as well as changes in expression and time lapse. 3-D face recognition Face recognition based on a 3-D face model. The 3-D model provides more geometric information about the face shape, which can overcome the problems and limitations in 2-D face recognition, and is not affected by changes in lighting and pose. With the development of technology, 3-D image acquisition systems are getting cheaper, and the 3-D data acquisition process is getting faster. All these make the 3-D method out of the laboratory and applied in practice. Laplacian face A technique for projecting a human face into a face subspace using local residual projection. It tries to obtain the subspace of the face based on the eigenface manifold structure by retaining local information. Eigenface An image combining principal components of a set of original face images using principal component analysis. The basic idea of the eigenface representation method in face recognition is to decompose the face image into a set of eigenfaces. These eigenfaces constitute orthogonal basis vectors in face space. Given a face image to be identified, it can be projected into the face space and compared with its position in the face space and other known faces in the face space to determine the
1242
34
Fig. 34.7 Face space and the definition of distance in it
Biometric Recognition A
F
m
DFFS
DIFS
Ā
F
similarity between the two. In practice, the eigenface corresponds to the eigenvector determined by the matrix A, and the column of A is an image of the face. The basis of an eigenface is that any face image A can always be produced by starting from a mean image m by adding a small set of scalable symbol image si (which can be obtained from a set of training images using principal component analysis) for reconstruction:
A¼mþ
N 1 X
ai s i
i¼0
Among them, the optimal coefficient ai for any face image A can be calculated as follows: ai ¼ ð A m Þ • s i It can also be used for fast face image matching. Reconstructing a face image with (parts of the N symbol images) is equivalent to projecting the face image onto a linear subspace F. This subspace is called the face space, as shown in Fig. 34.7. Among them, each eigenface is orthogonal to each other and can be normalized, so the distance between a projected face and the mean face m (distance in face space, DIFS) is vffiffiffiffiffiffiffiffiffiffiffiffiffi uN 1 X u DIFS ¼ A m ¼ t a2 i
i¼0
The distance between the original image A and its projection Ā in the face space (distance from face space, DFFS) is the orthogonal distance from the plane (the AĀ line is perpendicular to the mĀ line) and can be calculated directly in pixel space. Distance in face space [DIFS] See eigenface.
34.3
Face Recognition and Analysis
1243
Distance from face space [DFFS] See eigenface. Tensor face High-order tensor representation is used to represent the simplified representation of face images with different factors in imaging. These factors can be highlighted with the help of tensor decomposition, so that face recognition can be effectively performed in different application environments. Individual kernel tensor face See kernel tensor-subspace. Fisher face A technique combining linear projection of eigenface and elastic graph matching. Use Fisher linear discriminant analysis to obtain a robust classifier to enhance the accuracy of face recognition. Temporal face recognition The process or technology of face recognition based on video sequences rather than still images. It combines tense information. Distance-based approach In face recognition, a method of projecting a test image into a lower-dimensional space and calculating its distance from each prototype image. Use the closest image as the matching result. Detail component In face recognition, information representing the details of a face image. It is a highfrequency component, resulting from local distortion of facial expressions, etc. Nonlinear mapping In face recognition, the transformation relationship between input space and feature space. Explicit registration In face recognition, technique for explicitly determining the location of key points of a face with the maximum a posteriori probability. Implicit registration In face recognition, operation to marginalize key point locations, eliminating the need to explicitly determine key point locations. Face Recognition Technology [FERET] Program A 3-year research project started in 1993. The goal is to explore the feasibility of face recognition algorithms and build a baseline to detect progress. An experimental protocol for testing and comparing face recognition algorithms is also specified. See FERET database.
1244
34
Biometric Recognition
Face inversion effect It is better to perceive and recognize a human face when it is in an upright position (hair up, chin down) than in an upside down position (hair down, chin up). The effect of this relationship is stronger (obvious) on human faces than on other scenes.
34.3.4 Face Image Analysis Face image analysis A study of face images using image analysis techniques. Typical research includes face identification, face recognition, facial expression classification, and determination of a person’s gender, age, health status, and even inference of a person’s occupation, emotion, psychological state, temperament, etc. The analysis process (from the input image to the output result) includes several steps, as shown in the various modules in Fig. 34.8. Face analysis Contains common terms for facial image and model analysis. Often refers to the analysis of facial expressions. Face perception In the human visual system, the process and method of describing and explaining how human beings know the face in front of them and see it. Face modeling Faces are modeled according to the characteristics of the human head. Image-based modeling is a commonly used method. Although every face looks different, a person’s head and face can be described with considerable accuracy using dozens of parameters. Also refers to the use of certain kinds of models to represent human faces. These models are mainly derived from one or more images. These models can be used for face authentication, face recognition, etc. Head and face modeling Model the head and face of a person with the help of images of the head and face. Although at first glance, each person’s head and face are different, the actual shape of a person’s head and face can be reasonably described with only a few dozen parameters. In other words, although the shape and appearance of a person’s head and face vary a lot, they are still limited and can be modeled with low-dimensional non-rigid shapes.
Input image
Face detection /tracking
Feature Extraction
Fig. 34.8 Face image analysis flowchart
Feature dimension reduction
Judgment decision
Output result
34.4
Expression Analysis
1245
Face authentication Techniques for confirming that an image of a face corresponds to a specific person. Unlike face recognition, only a single person model is considered here. Human face verification In face recognition, a judgment is made as to whether the identified person belongs to a specific set. It is a supervised identification. It is assumed that the test image originates from a registered user. Compare human face identification. Face verification Same as face authentication and human face verification. Close-set evaluation In face recognition, it is the same as human face verification. Human face identification In face recognition, the judgment of the identity of the recognized person. Belongs to unsupervised identification. It is not required that the test images originate from registered users. Compare human-face verification. Face identification Same as human face identification. See face recognition. Open-set evaluation In face recognition, it is the same as human face identification. Face registration The process and method of locating facial key points such as the corners of the eyes, nose, and mouth in a face image and transforming them to a normalized position. Face transfer A more complex video-based animation technique. A new source video, rather than just an audio track, was used to drive the animation of the previously acquired video, that is, to redraw a video of the speaker’s head with appropriate visual speech, expressions, and head gesture elements. This technique is often used to animate 3-D face models. Face indexing Method for indexing a database of known faces. This work is often a leading step in face recognition.
34.4
Expression Analysis
Expressions are a basic way for humans to express emotions. People can express their thoughts and feelings accurately, fully, and subtly through expressions, and they can also recognize each other’s attitude and inner world through expressions (Zhang 2011).
1246
34
Biometric Recognition
34.4.1 Facial Expression Expression The look/expression presented by the movements, postures, and changes of facial organs. It often refers to a person’s facial expressions and the external expression of human emotions. It is also a basic way for humans to express emotions. It is an effective means in nonverbal communication. People can express their thoughts and emotions accurately, fully, and subtly through expressions, and they can also recognize each other’s attitude and inner world through expressions. The classification and recognition of facial expressions is an important task in face image analysis. Subtle facial expression Smaller facial expressions or not complete or attend peak facial expressions. Facial expression The external expression of emotion in the face. Physically originates from the contraction of facial muscles, resulting in short-term deformation of facial features such as eyes, eyebrows, lips, and skin texture, often manifested by wrinkles. For image technology, see facial expression classification and facial expression recognition. Expression tensor A 3-D tensor used to represent facial expression features and structures. It can be represented by tensor A 2 RI J K, where I represents the number of individuals or people ( p), J represents the number of expression (e) classes, and K is the dimension of the original facial expression feature ( f ) vector. An illustration of the expression tensor A is shown in Fig. 34.9. The two axes in the figure correspond to the number of people and the number of expression classes. Each square represents an expression feature vector (corresponding to a certain expression of a person), so that the 3-D tensor is displayed on a plane. Each
e
Emotion feature vector related to the i-th person … …
…
…
…
Emotion feature vector related to the j-th expression of the i-th person
…
…
Emotion feature vector related to j-th emotion
… O Fig. 34.9 Graphic representation of expression tensor
p
34.4
Expression Analysis
1247
expression feature vector can be combined in rows or columns. All the boxes in the ith column represent the expression feature vectors related to the i-th person, and all the boxes in the j-th row represent the expression feature vectors related to the j-th expression. Nonverbal communication Humans communicate without words. Typical examples are expressions, sign language, and various body languages. Neutral expression Blank or unemotional expressions on human faces. Expression recognition Same as facial expression recognition. Facial expression recognition A biometric recognition technology that has received extensive attention and research. Recognition of facial expressions is usually performed in three steps: (1) face detection and localization, (2) Facial expression feature extraction, and (3) facial expression classification. Expression classification Same as facial expression classification. Facial expression classification The result of dividing facial expressions into predefined categories based on the expression features extracted from the human face. There is a classification definition that divides expressions into six basic categories: happy, sad, angry, surprised, disgusted/depressed, and fear/frightened. There is also a classification definition that treats neutral expressionlessness as a basic expression, so there are seven basic categories. Sociologists divide expressions into 18 single categories: laughter, smile, mockery, disappointment, worry, concern, anger, fright, disgust, arrogant, horror, fear, doubt, anxious, troublous, disdain, contempt, and pleading. Expression understanding See facial expression analysis.
34.4.2 Facial Expression Analysis Facial expression analysis Starting from an image or an image sequence, study and recognize a person’s facial expressions. The result of the analysis is often to divide the real-time expressions into a predetermined category. Typical classification results include six types of expression: anger, disgust, fear, happiness, sadness, and surprise.
1248
34
Biometric Recognition
Fig. 34.10 Neutral and six types of expression examples
Fig. 34.10 shows an example of seven expressions (neutral plus the above six categories, from left to right) of the same person (from the Japanese female expression library, i.e., the JAFFE library). Expression feature vector Vector data reflecting the characteristics of the expression. It is related to the methods and techniques used to express and describe facial features. Facial expression feature extraction Detection of expression-related features on human faces. It is closely related to facial organ extraction. Three tasks need to be completed here: (1) obtaining original features, (2) reducing feature dimension and extracting feature, and (3) feature decomposition. Obtaining original feature One of the three tasks of facial expression feature extraction. It mainly extracts geometric features, appearance features, and sequence features from face images. Some are permanent features (eyes, lips, etc.) and some are transient features (crow’s feet caused by certain expressions, etc.). Reducing feature dimension and extracting feature One of the three tasks of facial expression feature extraction. On the basis of obtaining original features, by reducing the dimensionality of the features, the problems of information redundancy, high dimensionality, and insufficient discrimination in the original feature data are overcome to more effectively represent facial expressions. Feature decomposition One of the three tasks of facial expression feature extraction. On the basis of reducing feature dimension and extracting feature, factors that interfere with expression classification are further removed, and a feature data set that is more favorable for classification is obtained. Facial expression feature tracking Continuous detection of changes in the position, shape, and movement of features related to expressions on a human face. It is closely related to facial organ tracking. Facial action The action and the result of the face deformation, they need to be determined by using a geometrically invariant parameter of a given facial deformation intensity.
34.4
Expression Analysis
1249
Facial action coding system [FACS] It is proposed on the basis of anatomy that a hand-coded system can be displayed for all possible faces. Among them, facial deformation is matched with facial muscle movement, and more than 7000 combinations can be obtained with the help of 44 basic “action units.” A specific combination of these active units can represent expressions with specific emotions, including six types: anger, disgust, fear, happiness, sadness, and surprise. A system that attempts to characterize the entire range of facial movements mainly by mapping physical changes related to emotions. It was first proposed by psychologists Ekman and Friesen. It consists of two parts: one is called the “action unit” (AU), which is the basic action of a single muscle or muscle group, and the other is the action descriptor (AD), which includes several consecutive actions. FACS is a widely accepted facial expression evaluation system, but there are some disadvantages: manual FACS coding takes time and effort. After special training, human observers can achieve the ability to encode face display, but this training takes about 100 hours for the encoder. At the same time, manual coding is inefficient, and coding criteria will change, making it difficult to control. In order to avoid the heavy lifting of manual facial expression coding, psychologists and computer vision researchers use computers for automatic facial expression recognition. However, some problems are encountered when using computer to recognize these objective facial expressions automatically. For example, when using a computer for automatic facial expression recognition, since FACS and the six basic emotion categories are descriptive, some language expressions used to describe these categories (such as “upper eyelid tightening, lips trembling, tightening or shrinking your mouth,” etc.) are difficult to calculate and model. These descriptions are quite instinctive to humans, but it is quite difficult to translate them into computer programs. Active unit [AU] The basic unit for describing motion in a facial action coding system. Each unit corresponds to one or several facial muscles from an anatomical point of view. Muscle movements include contraction or relaxation. Action unit [AU] The minimum unit obtained by decomposing the action sequence, also refers to the minimum unit obtained by removing the motion expression from the original measurement of the pixel motion itself (such as optical flow). For example, in facial expression recognition research, it refers to the basic unit used to describe the change in facial expression. Anatomically, each action unit corresponds to one or more facial muscles. Facial muscle movements reflect changes in facial expressions. This research method is more objective and commonly used in psychological research. For example, some people attributed various facial expression changes to the combination of 44 action units and established the corresponding relationship between expressions and action units. However, there are more than 7000 combinations of 44 action units, which is huge. In addition, from the perspective of
1250
34
Biometric Recognition
detection, there are many action units that are difficult to obtain based on 2-D facial images. Facial action keyframe In facial expression recognition by means of dynamic recognition technology, a video frame corresponding to a sudden change in facial motion or deformation intensity. It generally corresponds to the transition from the neutral point to the apex (neutral-to-apex), that is, the interval from no deformation to the maximum deformation strength. Facial motion capture Motion capture of an organ on a person’s face or a specific location on it. A typical application is expression detection and representation, which is usually performed by obtaining expression key points (stickable tags). Facial animation The ways in which facial expressions change and how they are displayed. In practice, facial organs are detected and modeled to achieve animation. See face feature detection. A widely used application based on 3-D head modeling. Once a parametric 3-D model with shape and appearance (surface texture) is constructed, it can be used to track one’s facial movements and use the same movements and expressions to animate different characters. In order to make the 3-D deformation meaningful, the corresponding vertices obtained by scanning different people need to establish correspondence first. You can then use PCA to parameterize 3-D deformed models more naturally. If you analyze the different parts (such as the eyes, nose, and mouth) separately, you can also increase the flexibility of the model.
34.5
Human Body Recognition
The classification and recognition of human movements and postures are important tasks for biometrics (Sonka et al. 2014; Zhang 2017b)
34.5.1 Human Motion Human motion analysis In image engineering, a general term describing the application of motion analysis techniques to the human body. This analysis is used to track a moving human body to identify a person’s pose and derive its 3-D characteristics.
34.5
Human Body Recognition
1251
Human motion capture Motion capture of the human body (mainly limbs and joints). Widely used in various animation films, sports training, etc. Human motion tracking Positioning the human body continuously in the video sequence. It can be done with a variety of clues, characteristics, and information. There are two types of methods commonly used: model-based and non-model-based, depending on whether a predefined shape model is used. A commonly used expression is line drawing. See model-based tracking. Flow-based human motion tracking Use optical flow calculations for human tracking in video. It is usually performed by matching human limbs, where each limb is divided into two parts (the upper limb is divided into the lower arm and the big arm, and the lower limb is divided into the lower leg and the thigh), and simplified into a rectangular movement. Body part tracking Tracking human parts in video sequences is often used to estimate human poses. Limb extraction An extraction technique and process of human body parts in image interpretation and spatial-temporal behavior understanding. These include (1) extracting the upper or lower limbs of a person or animal, such as for tracking, and (2) extracting almost invisible edges (words from astronomical) on the surface that move leaving the observer. See also occluding contours. Pedestrian surveillance See person surveillance. Pedestrian detection A category of object detection. The detection target is an upright, stationary, or moving human body. Automatically identify and locate walking people in scenes such as streets. It is common to superimpose a frame (rectangular or oval) on the image to indicate the detection result. An example is shown in Fig. 34.11. Human pose estimation Image-based geometric analysis of human joint poses. This is often done after a skeleton fit of the human motion tracking results. Fig. 34.12 shows an example of a line drawing of a human skeleton. Human body modeling Modeling a person’s torso or organ (such as a hand) with the help of a human body image. These are highly articulated structures, so 3-D kinematic models are often used, which are represented as kinematic chains of segmented rigid skeleton elements connected by joints, which specify the length of each limb in the skeleton and the 2-D and 3-D angles rotated relative to each other among limbs.
1252
34
Biometric Recognition
Fig. 34.11 Pedestrian detection
Fig. 34.12 Line drawing representation of a skeleton
Kinematic chain See human body modeling. Human body shape modeling Modeling the structure and relationships of the human body (whole or component).
34.5.2 Other Analysis Gait analysis Analysis of the movement of the human body, especially the legs. Often used for biometric or medical purposes. After detecting the human body, the typical method often uses skeletalization techniques to represent the limbs as articulated objects for analysis. Gait classification 1. Classify different body movements (such as walking, running, jumping, etc.). 2. Biometric identification of people based on their gait parameters.
34.6
Other Biometrics
1253
Skin Color Analysis A set of techniques for analyzing the color of skin images that can be used to retrieve images of people from a database. See color, colorimetry, color image, color image segmentation, color matching, and color-based image retrieval. Skin color detection A very effective means of face detection, based on the color of the skin, is also often used to detect limbs. Skin color model Statistical model of the appearance of human skin in an image. Typical models are often based on the histogram or Gaussian mixture model of the observed pixel colors. The core basic observation is that when the brightness is corrected, almost everyone’s skin has a similar color and is different from the colors of many other scenery observed in the scene. Complex problems arise from changes in scene brightness and the effects of shadows. Skin color models are often used for face detection or online porn scene screening.
34.6
Other Biometrics
There are many types of human biometrics, including fingerprints, gestures, irises, and so on (Theodoridis and Koutroumbas 2003, Zhang 2017b).
34.6.1 Fingerprint and Gesture Fingerprint recognition A biometric recognition technology that has been widely studied and applied. A fingerprint is a pattern on the front surface of the skin at the end of a finger. Its pattern, including the number, location, and interrelationship of intersections and breakpoints, is closely related to the individual, and it can be used as a feature to uniquely verify the identity of a person. The main steps of fingerprint identification include (1) fingerprint collection with a fingerprint reader, (2) fingerprint image preprocessing, (3) fingerprint feature extraction, (4) feature comparison with the fingerprint in the database, and (5) identity verification. Fingerprint minutiae In fingerprint recognition, the main characteristics of a fingerprint, such as a ring or a thread. Fingerprint indexing Same as fingerprint database indexing.
1254
34
Biometric Recognition
Fig. 34.13 Hand orientation
Fingerprint database indexing Use multiple fingerprints for indexing in a fingerprint database. When attempting to perform a fingerprint identification in a database, this method allows only a small number of fingerprint samples to be considered. Fingerprint identification Individuals are verified by comparing unknown fingerprints with previously known fingerprints. Palm-print recognition A biometric recognition technology. Palm-print refers to the texture and curve features on the palm. Although these characteristics are not regular, they correspond to individuals one by one and do not change with the development of people, so they can be used to identify and distinguish the identity of different people. Palm-geometry recognition A collective name for palm-print recognition and hand shape recognition. Gesture analysis Basic analysis of video data expressing human gestures. It generally refers to the previous work of gesture recognition. Hand orientation The direction of the hand is approximately defined as the vector pointing from the center of the wrist to the root of the longest finger (middle finger), as shown in Fig. 34.13. Hand tracking Human hand tracking in video sequences is often used for human-computer interaction. Hand-geometry recognition A biometric recognition technology. Geometric characteristics of the handgeometry fingers, such as the length and width of the finger, the size of the palm, and the ratio between different finger sizes. Some of these characteristics are easy for people and provide certain identity information; some of them change over time, which makes the distinction between individuals insufficient. Therefore, if only hand-geometry features are used for identity recognition, it is possible to obtain better results only when the types of samples to be identified are limited.
34.6
Other Biometrics
1255
Fig. 34.14 Digital sign language symbols
Gesture recognition Differentiating human gestures can be used as a natural way of human-computer interaction. Early methods relied primarily on data gloves. Cameras are often used to obtain images now, and image recognition technology is used to extract finger position and motion information from the images for classification. Non-affective gesture The pattern of the human hand that is not a movement and composition due to emotions, such as sign language. Hand-sign recognition A type of gesture recognition for sign language. A set of digital sign language symbols (from 0 to 9, from left to right) can be seen in Fig. 34.14.
34.6.2 More Biometrics Iris The annular region on the bulb wall of the human eye, at the forefront of the middle vascular membrane, between the pupil and the sclera. It is one of the more mature and widely used biometric features with the help of image technique. It has a high degree of uniqueness and will remain unchanged for life. See cross section of human eye. Iris detection Detection of the human eye iris. Detection of iris visibility can also help determine the closed state of the eye. Iris recognition A biometric recognition technology that has received extensive attention, and research uses the physiological structure of stripes in the human eye’s iris in identification. The texture of the iris surface is determined by the developmental stage before birth and can be used to uniquely verify a person’s identity. The main steps involved in iris recognition include (1) collecting iris images with infrared light, (2) enhancing the iris images, (3) extracting iris features, and (4) performing texture analysis on the iris. It is considered less accurate but more convenient to use than retina-based biometrics.
1256
34
Biometric Recognition
Lip tracking An application of image technology to track the position and shape of human lips in a video sequence. The purpose can be a complement to lip reading and deaf gestures, or to focus on resolution in image compression. Viseme The model established for the appearance of the human face, especially the movement of the lips, generally speaks a phoneme as a unit. Lip shape analysis An application of image technology to understand the position and shape of human lips is a part of face analysis. The target can be face recognition or expression classification. Tongue print A form of personal biometrics that is considered unique to everyone, much like a fingerprint. Age progression Consider the process and work of changing visual appearance due to a person’s age. Generally, the evolution of age must be considered in face detection, face recogni tion, and face modeling. Signature identification One type of technology used to verify handwritten signatures is also called “dynamic signature verification.” A domain of biometrics. See handwriting verification, handwritten character recognition, fingerprint identification, and face identification. Signature verification In image engineering, the technology of comparing and authenticating a real signature image with a fake signature image. In practice, to automatically identify a person’s handwritten signature with the help of image technology, it is necessary to determine whether a signature sufficiently matches a reference sample. A simple four-step method is as follows: 1. Perform binarization of the signature image. 2. Extract the medial axis of the signature mode (medial axis transform can be used). 3. Use the watershed algorithm to extract the structural features of the signature. 4. Use the structural matching algorithm to determine the signature closest to the database signatures. See handwriting verification and handwritten character recognition. Digital signature The digital equivalent of a traditional (handwritten) signature. In the circulation of digital products, it can be used to confirm the identity of the sender. In practice, a digital signature can be constructed by using a one-way Hash method of the sender’s private key to encrypt the information.
Part IV
Image Understanding
Chapter 35
Theory of Image Understanding
Image understanding, as a high level of image engineering, focuses on the basis of image analysis and combines with artificial intelligence and cognitive theory, to further study the nature of the objects in the image and their interconnections, to understand the meaning of image content, and to interpret the corresponding objective scenarios for guiding and planning actions (Bishop 2006; Davies 2012; Forsyth and Ponce 2012; Hartley and Zisseman 2004; Marr 1982; Prince 2012; Szeliski 2010; Zhang 2017c).
35.1
Understanding Models
The theoretical model of image understanding is still under development, and attention has been paid to the characteristics of human vision, such as initiative, selectivity, and purpose (Marr 1982; Zhang 2017c).
35.1.1 Computational Structures Model for image understanding system Models that determine the systems, system modules, system functions, and system workflows of image understanding. A typical model is shown in Fig. 35.1. The task of understanding the visual information here starts by acquiring images from the objective world, and the system works according to the collected images and operates on the internal representation form of the system. The internal representation of the system is an abstract representation of the objective world, and image understanding is to associate the input and its implicit structure with the explicit structure that already exists in the internal representation. The processing of visual information needs to be guided by knowledge. The knowledge base should be able © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_35
1259
1260
35
Theory of Image Understanding
Objective world Image acquisition
Image understanding
Internal representation
Computer
Knowledge base
Fig. 35.1 Model for image understanding system
to represent analogous, propositional, and procedural structures, allow fast access to information, support querying of analogous structures, facilitate transformation between different structures, and support belief maintenance, inference, and planning. Multilayer sequential structure In the model for image understanding system, a hierarchical structure that regards the image understanding process as the information processing process. It has definite inputs and outputs. The image understanding system is organized into a series of modules at different levels and combined in a serial manner. Each module (with the cooperation and cooperation of other modules) performs some specific tasks in order and gradually completes the predetermined visual tasks. Multi-module-cooperated structure In the model for image understanding system, the entire image information system is divided into a plurality of modules, each of which has a definite input and output, a structure that cooperates with each other and cross-cooperates. Radiance structure surrounding knowledge database A structure in the model for image understanding system. A structure analogous to the human visual system is adopted, which is characterized by a knowledge base as the center, the system is not layered as a whole, and signals are exchanged and processed in various modules of the system and the knowledge base multiple times. Tree structure rooting on knowledge database A structure of a model for image understanding system. It mainly represents a module classification method under a tree structure, where the knowledge base is at the root of the tree. Dynamic geometry of surface form and appearance A theory different from Marr’s visual computational theory. It is considered that the perceived surface shape is the combined result of multiple processing actions distributed on multiple spatial scales, while the actual 2.5-D expression does not exist.
35.1
Understanding Models
1261
Vision computing Visual information processing technology. Now mainly refers to various tasks completed by computers to obtain and use visual information, which use various imaging technologies. Computational imaging See computational photography and computational camera. Computational photography [CPh] With the help of image processing and image analysis technology, you can acquire images that cannot be obtained with ordinary cameras, or generate images that exceed the capabilities of traditional imaging systems from one or more photographs. Such methods are constantly being studied, and some have been incorporated into digital cameras and become standard features, for example, panoramic scan mode and multiple shots (to reduce noise in low light conditions). Methods and processes for using cameras to capture, record, and analyze natural scenes. This allows for possible follow-up actions, such as avoiding collisions. Computational photography uses multiple images with small changes to record scenes, or some form of camera (such as a webcam) to record changed scenes in video. In addition, the scene is also recorded by changing the operation of the digital camera (i.e., zoom, focus, etc.) during the recording, or the changed scene is recorded by changing the parameters of the digital camera (e.g., field of view, depth of field, and lighting) during the recording. See computational camera. Computational symmetry A general term for algorithmic processing of various symmetries. See bilateral symmetry, symmetry detection, and symmetric axis transform. Computational complexity An algorithm or the amount of calculations required to complete a calculation. It can also be expressed as a lower bound on the theoretical number of basic computational operations required to perform a given algorithm or process, regardless of the hardware used. For example, an algorithm that iteratively calculates each pixel of a W H image (such as a mean filter) has a computational complexity of O(WH). It is one of the commonly used evaluation criteria for image matching. The speed of the matching algorithm in image matching indicates its effectiveness in specific applications. The computational complexity can be expressed as a function of the image size (considering the number of additions or multiplications required for each unit). Generally, the computational complexity of the matching algorithm is expected to be a linear function of the image size. Computational tractability In computational complexity theory, the nature of whether a given algorithm or process is feasible given a limited number of operations (related to time scales) and
1262
35
Theory of Image Understanding
resources (related to memory and storage). If it is not feasible, it is said to be difficult to calculate. Active sensing 1. Active or purposive perception of the environment. For example, moving the camera in space to obtain multiple optimal fields of view of the target or scene. See active vision, purposive vision, and sensor planning. 2. Perceptual technology that projects energy patterns implicitly. For example, laser light is projected into a scene to help obtain target information. See laser stripe triangulation and structured light triangulation. Active perception Same as active sensing. Perceptual feature Same as visual feature. Perceptual feature grouping A framework of knowledge-based image understanding theory. This theoretical framework is different from Marr’s visual computational theory, which believes that the human vision process is only a recognition process and has nothing to do with reconstruction. In order to identify a 3-D object, human perception can be used to describe the object, which is directly completed through 2-D images under the guidance of knowledge, without the need for 3-D reconstruction from the bottom up through visual input. Purposive perception Motivation comes directly from the perception of task goals. For example, a robot may need to search for specific landmark points to achieve self-localization.
35.1.2 Active, Qualitative, and Purposive Vision Active, qualitative, and purposive vision A three-combination framework of active vision, qualitative vision, and purposive vision. The basic idea is to let the active observer collect multiple images of the scene according to the task goal, in order to convert the ill-posed problem of 3-D reconstruction in vision into a clearly defined and easily solved problem. It can also be seen as an extension of active vision. It is further emphasized that the vision system should complete task guidance and purpose guidance and also have the ability of active perception. The main idea behind these concepts can be summarized as the transformation of the problem, including the following three aspects of work:
35.1
Understanding Models
1263
1. Have an active observer collect multiple images of the scene. 2. Relax the requirements of the original “complete reconstruction” target or scene to a qualitative level, for example, it is only necessary to state that one target is closer to the observer than another. 3. The vision system is not required to consider the general situation, but only considers narrowly defined goals that are clearly defined, so that a dedicated solution can be implemented for specific application problems. Active vision A computer vision technology that controls the movement of a camera or sensor to simplify the nature of the problem. Cameras move in a defined or indeterminate way, or a person turns their eyes to track a target in the environment, a technique and method to perceive the world. It is emphasized here that vision is an active participant in the objective world. This includes tracking the target of interest and adjusting the vision system to optimize certain tasks. Its basic idea is to allow active observers to collect multiple images of the scene in order to transform ill-posed problems into clearly defined and easily solved problems. This is the core of the active vision framework, also known as active perception. In active vision, the “attention” ability of human vision is incorporated, and the selective perception of space, time, and resolution can be achieved by changing camera parameters or processing data acquired by the camera. Camera motion provides additional conditions for studying the shape, distance, and motion of the target. With these conditions, some ill-posed structural problems can be transformed into good structural problems, and some unstable problems can be turned into stability problems. Different from active imaging, active vision focuses on observation strategies rather than observation means. For example, rotate a camera at a uniform angular speed and fix it at a point at the same time to perform an absolute calculation of the depth of the scene point, instead of a relative depth calculation that depends on the speed of the camera movement only. See kinetic depth. Viewpoint planning Determine where an active vision system will look next to maximize the likelihood of achieving certain predetermined goals. A common situation is to calculate the position of a distance sensor at several successive positions to obtain a complete 3-D model of the target. After collecting N images, the viewpoint planning problem is to choose the positions of N + 1 images to maximize the collection of new data, and at the same time ensure that the new positions can register the new data with the existing N images. Next view planning When detecting a target, or acquiring a geometric model or an appearance-based approach, it may be necessary to observe the target from several viewpoints. The next view planning determines where to place the next camera (or move the current camera to a new location). To change the relative layout of the camera and target, move the camera or move the target. Making such a plan is based on the observed
1264
35
Theory of Image Understanding
scene (where the target is unknown) or a geometric model (where the target is known). Next view prediction See next view planning. Best next view See next view planning. Autonomous exploration An active vision method that emphasizes object tracking. Active stereo A variant of traditional binocular vision. One of the cameras was replaced with a structured light projector, which projected light onto a target of interest. If the camera is calibrated, the triangulation of the 3-D coordinate points of the target needs only to determine the intersection of a ray and a known structure in the light field. Active fusion A concept proposed with active meaning in active vision. It emphasizes that the fusion process must actively select the information source to be analyzed and control the processing of data when combining information. Qualitative vision A visual paradigm. The basic idea is that many vision tasks can be better accomplished by calculating only qualitative descriptions of targets and scenes from images, without the need for accurate measurements like quantitative vision. This is proposed under the framework of computational theory for human vision. It seeks the vision of qualitative description of targets or scenes, so that it can still solve certain problems when quantitative description is difficult to obtain. The basic motivation is not to calculate and express geometric information that is not needed for qualitative (non-geometric) tasks or decisions. The advantage of qualitative information is that it is less sensitive to various unwanted transformations (such as slightly changing the viewing angle) or noise than quantitative information. Qualitative or invariant can allow easy interpretation of observed events at different levels of complexity. Purposive vision The field of research in computer vision that associates perception with purposeful action. That is, the position or parameters of the imaging system are purposefully changed to facilitate the execution of the visual task or make it possible to achieve its completion. For example, changing lens parameters in depth from defocus to obtain information about depth, or turning the camera around a target to obtain comprehensive shape information. Represents a special visual concept. Emphasis is placed on the selection of techniques to accomplish visual tasks based on visual purpose. The vision system makes decisions based on visual purposes and decides to perform qualitative
35.2
Marr’s Visual Computational Theory
1265
analysis or quantitative analysis (in practice, there are quite a few occasions where only qualitative results are needed, and quantitative results with higher complexity are not required). For example, it is determined whether to completely recover information such as the position and shape of an object in the scene or just detect whether an object exists in the scene. The basic motivation is to clarify only the part of the information needed, and it is possible to give a simpler solution to the visual problem. Active vision framework A theoretical framework for image understanding based on the initiative of human vision (or more general biological vision). It differs from Marr’s visual computa tional theory by considering two special mechanisms of human vision: selective attention and gaze control. Selective attention A special mechanism of human vision considered by the active vision framework. Studies have shown that human vision does not treat all parts of the scene equally, but selectively pays special attention to some parts according to the needs, and only observes or ignores other parts. Selective Attention Mechanism See selective attention. Gaze control The ability of a human or robot’s head to control their gaze direction. In human vision, it is a special mechanism considered by the active vision framework. Studies have shown that people can adjust their eyeballs and “gaze” at different locations in the environment at different times as needed to effectively obtain useful information. According to this feature, the camera parameters can be adjusted to always obtain visual information suitable for specific tasks. Gaze control includes gaze stabilization and gaze change.
35.2
Marr’s Visual Computational Theory
The visual computational theory proposed by Marr outlines a framework for understanding visual information (Marr 1982). This framework is both comprehensive and refined, and it is the key to make the study of visual information understanding rigorous and to improve visual research from the level of description to the level of mathematical science (Zhang 2017c).
1266
35
Theory of Image Understanding
35.2.1 Theory Framework Marr’s visual computational theory A visual information understanding framework proposed by British neurophysiologist and psychologist Marr in 1982. Also called computational vision theory. It outlines a framework for understanding visual information processing, which is comprehensive and concise, which makes the research on visual information understanding more rigorous, and raises visual research from the level of description to the level of mathematics, which has a greater impact. Key points include the following: 1. Vision is a complex information processing process. 2. There are three elements of visual information processing, namely, computa tional theory, algorithm implementation, and hardware implementation. 3. There are three levels of internal representation of visual information, namely, primal sketch, 2.5-D sketch, and 3-D representation. 4. Visual information processing is organized according to functional modules. 5. The formal representation of computational theory must consider certain constraints. Marr’s theory Short for Marr’s theory of the human visual system. This is a collective but incomplete theory. Some of these key steps include original primitives, full primitives, 2.5-D sketches, and 3-D target recognition. Representation by reconstruction In Marr’s visual computational theory, to explain the scene, we must first carry out the 3-D reconstruction of the scene. Intermediate representation Representations in various intermediate steps generated from input representations until a specific output representation is obtained. For example, according to Marr’s visual computational theory, the initial primal sketch, the full primal sketch, and the 2.5-D representation are all intermediate representations between the input image and the final reconstructed 3-D model. Computational vision theory Same as Marr’s visual computational theory. Visual computational theory Same as computational vision theory. Marr’s framework Marr’s framework for the entire visual computational theory focuses on the description of the three levels of visual information processing systems. These three levels are also called the three elements of visual information processing: computational theory, representations and algorithms, and hardware implementation.
35.2
Marr’s Visual Computational Theory
1267
Computational theory One of the three elements of visual information processing in Marr’s visual computational theory. When using a computer to solve a specific problem, if there is a program that can give output in a limited number of steps for a given input, then the problem is computational according to the computational theory. The research goals of computational theory include decision problems, computable functions, and computational complexity. The highest level in Marr’s framework. The question to answer is what is the purpose of the computation, and why is it computed (what are known constraints or what can be used to solve the problem)? In the description theory of computer vision algorithms promoted by Marr, a process can be described at three levels: computational theory, algorithm (a series of actions), and implementation (program). Computational theory assumes a mathematical connection between input and output and a description of the nature of the input data, such as a statistical distribution. The advantage of this method is that the nature of the processing is made explicit by the computational theory layer, so that it can be compared with the nature of other processing that solve the same problem. With this approach, there are specific implementation details that can cause confusing comparisons to be ignored. Algorithm implementation One of the three elements of visual information processing in Marr’s visual computational theory. Here, on the one hand, the input and output representations to be processed must be selected, and on the other hand, the algorithm to complete the representation conversion must be determined. Representations and algorithms The middle level in Marr’s framework. The question to be answered is how to represent the input, output, and information between them, and which algorithms are used to calculate the required results (implementing the computational theory, and achieving conversion between representations)? Hardware implementation One of the three elements of visual information processing in Marr’s visual computational theory. Pay attention to how to physically implement the representation and processing algorithms of visual information. The lowest level in Marr’s framework. The question to answer is how to map representations and algorithms to actual hardware (such as biological vision systems or special silicon), i.e., how to implement representations and algorithms physically? In turn, consider how hardware constraints can be used to guide the choice of representations and algorithms (this involves specific details of the computational structure).
1268
35
Theory of Image Understanding
35.2.2 Three-Layer Representations Primal sketch 1. The first of the three internal representations of visual information in Marr’s visual computational theory. It is a 2-D representation that corresponds to a collection of image features and describes the contour parts of the object’s surface properties that change. Primal sketch representation attempts to extract unique primitives in the image and describe their spatial relationship. It can provide information about the outline of each object in the image, and it is a sketch (succinct) form of representation for 3-D objects. A representation for primary vision, introduced by Marr, focusing on low-level features such as edges. The complete primal sketch combines the information calculated from the original primal sketch (mostly the edge, stick, endpoint, and spot feature information extracted from the image) to construct a subjective contour. See Marr’s theory, Marr-Hildreth edge detector, and raw primal sketch. 2. A representation of an image primitive or texel. Such as bars, edges, connected components, and terminators. In texture analysis, the process of tracking primitives is generally followed by the process of extracting image primitives. Next, statistics such as the types of primitives, primitive orientation, size parameter distribution, primitive contrast distribution, and primitive spatial density can be obtained from the representation of primitives. The representation of primitives in modern times is often more complicated. For example, the sparse coding theory and Markov random field concept can be combined as a primitive. At this point, the image is decomposed into sketchable regions (modeled with sparse coding) and non-sketchable regions (using a model based on Markov random fields). The texels are extracted from the sketchable area of the image. Raw primal sketch The first-level representation constructed according to the perception process of Marr’s visual computational theory, it relies heavily on the detection of local edge features. It can represent many types of features, such as position, orientation, contrast and scale around the center, edges, rods, and truncated rods. See primal sketch. Full prime sketch In Marr’s visual computational theory, an expression consisting of original primitive units plus combined information. It contains the image structure corresponding to the scene structure (such as the image region corresponding to the scene surface). 2.5-D Sketch One of the three levels of internal representation of visual information in Marr’s visual computational theory, the second level in the middle. Firstly, the target is
35.2
Marr’s Visual Computational Theory
1269
decomposed according to a certain sampling density according to the principle of orthogonal projection, and the visible surface of the object is decomposed into a number of surface elements with a certain size and shape. The surface elements have their own orientation. Then use a normal vector to represent the orientation of its facet. The needle diagram (the vector is shown by an arrow) formed by combining all the normal vectors constitutes a 2.5-D sketch (also called a needle map). The 2.5-D sketch is a depth image scanned from a single viewpoint, which indicates the orientation of the surface elements of the object, and thus gives information on the surface shape, so it is an intrinsic image. Its characteristic is that it not only represents a part of the object’s outline information (similar to the representation of the primal sketch) but also represents the orientation information of the object surface that can be observed in the coordinate system centered on the observer. This allows all data to be represented in a single image, where each pixel value records the distance information of the observed scene. The reason why it is not called a 3-D image is that it does not explicitly express the (invisible) back information of the scene. The name 2.5-D sketch from the fact that, although local changes in depth and discontinuities have been resolved, the absolute distance of the observer from each scene point may still be uncertain. This sketch is the central structure of Marr’s visual computational theory. It is a middle-level representation of the scene, indicating the arrangement of the visible surface and its relative observers. This map can also be considered to contain three elements: contours, textures, and shadow information obtained from primal sketch, stereo vision, and motion. Viewer-centered representation A representation of a 3-D world maintained by an observer (human or robot). The observer is in the world coordinate system. As the observer moves, the representation of the world changes. Compare object-centered. Observer-centered The essence of 2.5-D representation in Marr’s visual computational theory is that the information on the surface of an object (visible) is represented from the perspective of the observer. Viewpoint-dependent representations See viewer-centered representation. Needle map An image representation method for displaying 2-D and 3-D vector fields (such as surface normals, velocity fields, etc.). In the needle map, each pixel corresponds to a vector, which is represented by a short arrow. The length of the arrow is proportional to the size of the vector and the direction of the arrow is the vector direction. To avoid overcrowding in the image, generally only a subset of all pixels are drawn. Figure 35.2 shows an object consisting of a cylinder placed on a square platform and a needle diagram showing the surface normals of the object. See 2.5-D sketch.
1270
35
Theory of Image Understanding
Fig. 35.2 Needle map example
2.1-D Sketch A simplified version of the 2.5-D sketch showing the various scene regions in order of their relative depth from front to back. This is the technique and result of obtaining the depth information of the object from a single image. On this basis, the outline of the occluded object can also be restored. It is a qualitative, hierarchical representation. The 2.5-D sketch emphasizes quantitatively and accurately describing the specific depth of each scene region. An attempt was made to obtain the results of low-level depth reconstruction in the early stage of human vision, in which the terminals at the edges were given important roles. Consider the problem of decomposing the domain D of an image. It is assumed that there are a minimum number of interrupted edges, cracks, corner points and sharp points in the image, and appropriate interrupted edges are generated after occluding the region. The result is that the domain D is decomposed into overlapping regions R1 U R2 U . . . U Rn sorted according to occlusion, which is called a 2.1-D sketch. It can be represented as a minimization problem, and a set of optimal contours called nonlinear splines can be obtained from the model, which can minimize the square of length and curvature. This is necessary to construct a 2.1D sketch of an image, which can maintain the continuity of the disturbed edges. A model that can be further used for image segmentation. 3-D representation One of the three levels of internal representation of visual information in Marr’s visual computational theory, which is the third level at the highest level. It is a representation form of scene-centered (i.e., including the invisible part of the object) and comprehensively describes the shape and spatial organization of the 3-D object in a coordinate system of the world-centered. Scene-centered In Marr’s visual computational theory, 3-D spatial information is represented in all (even invisible) parts of the objective scene. World-centered Same as scene-centered.
35.2
Marr’s Visual Computational Theory
1271
Object-centered See scene-centered. Object-centered representation A model representation that describes the positions of features and the components of the model relative to the target’s own position. This can be a relative description (e.g., the mouth is 3 cm below the nose) or a local coordinate system (e.g., with the nose as the coordinate origin, the right eye is at (0.2, 30, 13)). This representation is opposed to the observer-centered representation.
Chapter 36
3-D Representation and Description
Image understanding requires the interpretation of the objective world with the help of images, so 3-D space and entities need to be represented and described (Hartley and Zisserman 2004; Zhang 2017c).
36.1
3-D Point and Curve
Points and lines are the basic components of 3-D solids, and they also reflect local features in 3-D space (Davies 2005).
36.1.1 3-D Point Point A basic concept in Euclidean geometry that represents an infinitely small element. In image technology, pixels are regarded as image points, and “points in the scene” are often used as the position of the 3-D space as viewed by the camera. 2-D point feature 1. Points of interest in 2-D images, such as corner points, line intersections, etc. 2. Features of pixels in 2-D images, such as the gray value of pixels, gradient values, etc. 3-D point feature Corresponds to the characteristics of a point on a 3-D target, or a characteristic that a 3-D point has. The former, such as the normals of the points on a sphere, are in the direction away from the center of the sphere; and the latter, such as a corner, has a large curvature. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_36
1273
1274
36
3-D Representation and Description
Fig. 36.1 Three symmetrical planes of a cubic object
Point cloud A large set of points in 3-D space. Commonly, information such as the orientation and distance carried by the reflected laser spot recorded when the surface of the scene is illuminated with a laser beam. Frontier point A point on the object surface whose normals are known. In many cases, this can be used to derive the distribution of light reflected from the surface. It can also be used to restore surface shapes in stereo vision applications. Plane Same as flat. Symmetry plane The plane in which the axes with bilateral symmetry lie in 3-D space. In Fig. 36.1, the cubic object represented by the solid line has three symmetrical planes drawn with dotted lines. It is a 3-D generalization of a symmetrical line at the time of 2-D. Hyperplane A geometric structure obtained by extending the concept of a 3-D mid-plane to a higher-dimensional N-D space. It can be expressed as a set of points: x ¼ (x1, x2, . . ., xN)T satisfying the equation a1x1 + a2x2 + . . . + aNxN ¼ c, where a is a fixed vector and c is a constant. Singular point 1. 2-D curve with irregular points of local features. There are three types, namely, the first type of cusp point, the inflection point, and the second type of cusp point, as shown in Fig. 27.7d, c, a (see regular points). 2. In HSI color space, the point on the I axis whose S value is 0 and H is undefined. Cusp point A class of curve singularities. It can also be divided into the first type of cusp points and the second type of cusp points, which are both opposite to the regular points and correspond, respectively, to Fig. 27.7d, a (see regular points).
36.1
3-D Point and Curve
1275
Umbilical point A point on the object surface where the curvature is the same in all directions. Also called spherical point. All points on the sphere are umbilical points. Appearance singularity A small change in the position of the observer will cause a drastic change in the appearance of the observed scene in the image. Here drastic changes can refer to the appearance or disappearance of image features. This is in contrast to the situation at the generic viewpoint. For example, when looking at the vertices of a cube from a long distance, small changes in viewpoint can still guarantee to see three surfaces at the vertices. But when the viewpoint moves into an infinite plane containing one of the cube’s surfaces (an appearance singularity point), one or more planes disappear.
36.1.2 Curve and Conic Curve A collection of connected points in 2-D or 3-D space, where each point has a maximum of two adjacent points (there is only one adjacent point at each end). The curve can be defined by a set of connection points. Here either an implicit function (such as x2 + y ¼ 0) or an explicit function (such as –t, t2, etc. for all t) can be used. Intersection of two surfaces (such as the intersection of planes X ¼ 0 and Y ¼ 1). Singularity event A point where the first derivative is zero in the domain of a geometric curve or surface. Curve invariant point The special points on the curve whose geometric characteristics do not change with the change of the projective transformation have the characteristic that their geometric properties do not change in the projective transformation. It can be identified from the curve and used to establish the correspondence between multiple fields of view of the same scene. Two typical curve invariants are inflection points and double tangent points, as shown in Fig. 36.2.
Fig. 36.2 Curve invariant point
Inflection points
Double tangent points
Double Tangent line
1276
36
3-D Representation and Description
Break point Tolerance region
Tolerance region
Fig. 36.3 Tolerance band example
Curve invariant The measurement of the curve satisfies the characteristics that remain unchanged under a certain transformation. For example, the arc length and curvature of a curve are constant under the Euclidean transformation. Speed of a curve The points on the curve change with the rate of the arc length parameter. Given a curve f[x(s) y(s)], the velocity of the curve at point s is ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s 2 2ffi ∂x ∂y T ðsÞ ¼ þ ∂s ∂s Curve normal Vector perpendicular to the tangent vector of the curve at any given point on the curve that locally contains the point. See local geometry of a space curve. Curve inflection A point on the curve with zero curvature, and adjacent points on either side of the curve have positive curvature values and negative curvature values. For example, see Fig. 27.12 in curve bi-tangent. Osculating circle For a point on a curve, the circle most tangent to that point. Tolerance band algorithm An algorithm for dividing curve increments into linear segment primitives. Assume that the current line segment defines two parallel boundaries of the tolerance band, and the two boundaries are located at a preselected distance from the line segment. When a new curve point (break point) leaves the tolerance band, the current line segment ends and a new line segment begins, as shown in Fig. 36.3. Outward normal direction Given a curve f [x(s), y(s)], the outward normal direction of the curve at point s is
36.1
3-D Point and Curve
1277
Fig. 36.4 Four forms of conics
∂
∂ yðsÞ ∂s xð s Þ N ðsÞ ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ∂x2 ∂y 2 þ ∂s ∂s ∂s
where the denominator is the rate of the curve at point s. Inside of a curve For a closed curve f [x(s), y(s)], consider a point on the plane that is not on this curve p ¼ [px, py]. Let the point closest to point p on the curve be q and the arc length at that point be sq. If ( p–q)N(sq) 0, then point p is said to belong to the interior of the curve, where N(sq) represents the outward normal direction of the curve at point q. Compare outside of a curve. Outside of a curve For a closed curve f [x(s), y(s)], consider a point on the plane that is not on this curve, p ¼ [px, py]. Let the point closest to point p on the curve be q and the arc length at that point be sq. If ( p–q)N(sq) 0, then point p is said to belong to the outside of the curve, where N(sq) represents the outward normal direction of the curve at point q. Compare inside of a curve. Conic A set of curves obtained by intersecting a cone with a plane (also called a conic section). Including four forms of circle, ellipse, parabola and hyperbola, as shown in Fig. 36.4. Their 2-D general form is ax2 + bxy + cy2 + dx + ey + f ¼ 0. See conic fitting. Plane conic Same as conic. Conic invariant Invariant to conic section. If the conic curve is represented by its canonical form ax2 + bxy + cy2 + dx + ey + f ¼ 0, where a2 + b2 + c2 + d2 + e2 + f2 ¼ 1, the two invariants to rotation and translation are the function of eigenvalues of main quadratic matrix A ¼ [ab cd]. For example, ranks and determinants that are relatively easy to calculate are both invariants. For an ellipse, the eigenvalue is a function of two
1278
36
3-D Representation and Description
radii. The only invariant to the affine transformation is the conic curve. The invariant to the projective transformation is a set of symbols representing the eigenvalues of a 3 3 matrix of conic curves in homogeneous coordinates. Conic section The plane used to cut the conic to get a conic curve. Conic fitting Fit a set of data points {(xi, yi)} with a geometric model with a conic section ax2 + bxy + cy2 + dx + ey + f ¼ 0. Commonly used for fitting circles and ellipses.
36.1.3 3-D Curve Space curve A curve that may follow a path in 3-D space, that is, it is not confined to a plane. Local geometry of a space curve A geometric method to describe the spatial structure and properties of points on a space curve. Referring to the coordinate system in Fig. 36.5, where C represents the space curve, P is the point under consideration (can be set to the origin of the coordinate system), the center of curvature of C at P is O, N is the normal plane, T is the osculating plane, R is the rectifying plane, t is the tangent vector, n is the principal normal vector, and b is the binormal vector. Principal normal The intersection of the normal plane and the osculating plane in the coordinate system of the local geometry of a space curve (see Fig. 36.5). It is usually represented by the vector n. Binormal The intersection of the normal plane and the rectifying plane in the coordinate system of the local geometry of a space curve (see Fig. 36.5). It is mostly represented by the vector b.
Fig. 36.5 Local geometry of a space curve
R
b P O T
n N
t C
36.1
3-D Point and Curve
1279
Curve binormal A vector perpendicular to the tangent vector and normal vector of any given point on a curve in 3-D space. See local geometry of a space curve. Normal plane In the local geometry of a space curve (see Fig. 36.5), the set N of infinitely many straight lines perpendicular to the tangent to the origin. Osculating plane In the local geometry of a space curve (see Fig. 36.5), the plane passing through the origin of the coordinate system and closest to the neighborhood point of the considered point. It is unique. Torsion A limit in the local geometry of a space curve (see Fig. 36.5). Consider the rate of change along the space curve in the osculating plane (this can be seen as a generalization of curvature). Suppose that there are two adjacent points P and P0 on the curve C, calculate the angle between the two osculating planes corresponding to them (or the angle between the corresponding two normals) and divide them by the distance between. When P0 tends to P, the above-mentioned angle and distance tend both to 0, while their ratio has a limit. This limit is called the torsion of curve C at point P. In differential geometry, it represents the concept of intuitive local distortion when moving along a 3-D curve. The torsion τ(t) of a 3-D space curve C(t) is a scalar: € ðt Þ, C € ðt Þ C_ ðt Þ, C d τðt Þ ¼ nðt Þ • bðt Þ ¼ dt C € ð t Þ 2 Among them, n(t) is the normal of the curve, b(t) is the binormal of the curve, and the symbol [a, b, c] represents the scalar triple product a•(b c). Torsion scale space [TSS] The description space of a space curve, which is based on the movement of the curve along the zero-crossing point of the torsion function as the curve increases with the smoothing scale. Smoothing is achieved by increasing the scale for the Gaussian smoothing of the space curve. Visual representations of the torsion scale space are often organized into 2-D scale drawings where zero crossings appear to be a function of arc length. Rectifying plane In the local geometry of a space curve (see Fig. 36.5), a plane R passing through the origin and perpendicular to both the normal and osculating planes.
1280
36
3-D Representation and Description
Fig. 36.6 Frenet frame diagram
Tangent Normal
Binormal Frenet frame A triplet consisting of three orthogonal unit vectors (normal, tangent, and binormal/ bi-tangent) used to describe the local properties of a point on a space curve, as shown in Fig. 36.6. Normal map A diagram showing surface normal at grid positions. If the original geometric image is highly compressed, the normal map can be used to compensate for the loss in visual fidelity. Normal vector A vector indicating the facet orientation of a 3-D plane or surface, which is generally normalized.
36.2
3-D Surface Representation
When people observe a 3-D scene, the first thing they see is the outer surface of the object (Davies 2012). In general, the outer surface is made up of a set of curved surfaces (Hartley and Zisserman 2004). To express the outer surfaces of 3-D objects and describe their shapes, the outer contour lines or outer contour surfaces of the objects can be used (Sonka et al. 2014; Szeliski 2010).
36.2.1 Surface Surface A set of interconnected points in 3-D space. In general terms, a 2-D object is in 3-D space. Mathematically speaking, the surface is a 2-D subset of ℝ3, which is local topologically equivalent to the open unit sphere in R2 almost anywhere. This indicates that a point cloud is not a surface, and the surface can have cusp points and boundaries. The result of parameterizing the surface is a function from ℝ2 to ℝ3, defining the 3-D surface point x(u, v) as a function of the 2-D parameter (u, v). Limiting (u, v) to a subset of R2 gives a subset of the surface. The surface is the set S of points on it, defined on domain D:
36.2
3-D Surface Representation
1281
S ¼ xðu, vÞj u, vÞ 2 D ⊂ ℝ2 Surface element [Surfel] The basic unit obtained by decomposing the surface. Surface orientation A convention to determine whether a surface normal or its opposite point is outside the space enclosed by the surface. Fundamental form A metric used to determine the local characteristics of a surface. See first funda mental form and second fundamental form. First fundamental form See surface curvature. Metric determinant 1. A metric for measuring curvature. 2. For a surface, it is the square root of the determinant of the first fundamental form matrix of the surface. Second fundamental form See surface curvature. Local shading analysis A class of methods to infer and determine the shape and pose of target surface elements based on the analysis of image brightness in local neighborhoods in perspective or parallel projection. It uses the current small neighborhood of the corresponding surface pixels to establish the relationship between the differential surface structure and the local structure of the corresponding intensity image. It decomposes the surface into a collection of small neighborhoods of each pixel and considers only the local orientation of the surface. Its advantage is that it can directly obtain surface-related information from a single monochrome grayscale image, without the need for 3-D reconstruction in the form of obvious depth. Reflectance image Same as reflectance map. Reflectance map A graph describing the relationship between the brightness of a scene and its surface orientation. The orientation of object surface can be represented by the gradient ( p, q), and the relationship between it and brightness is R( p, q). The graph obtained by drawing R( p, q) as a contour is the reflection map. The form of the reflection map depends on the nature of the object surface material and the position of the light source, or the reflection map combines the information of surface reflection characteristics and light source distribution.
1282
36
3-D Representation and Description
It can be used to describe the reflectivity of a material for the representation that the local surface is oriented toward the viewer. The most commonly used is a Lambertian reflection map based on Lambert’s law. See shape from shading and photometric stereo. Surface inspection One of the functions of the machine vision system. Used to check the finished product for defects (such as scratches and unevenness). Surface continuity A mathematical definition of the properties of a single point parameterized by (u, v) on the surface {x(u, v)|(u, v) 2 R2}. If the infinitesimal movement from (u, v) to any direction does not cause a sudden change in the value of x, then the surface is continuous at that point. If a surface is continuous at all points, the surface is said to be continuous everywhere. Compare surface discontinuity. Surface segmentation Divide a surface into simpler surface patches. Given a surface defined on domain D, determine the partition D ¼ {D1. . .n} so that certain goodness criteria are satisfied. For example, suppose you need to make the maximum distance from the point on each Di to the best-fit square surface less than a given threshold. See range data segmentation. Surface discontinuity Describe the case where the normal vector of a point on the surface is not continuous. These points are often on the fold edge, where the direction of the surface normal can vary greatly. Compare surface continuity.
36.2.2 Surface Model Surface model A geometric model representing a surface, often used for rendering or object recognition. There are many surface models, such as triangulated models or freeform surfaces. A model for representing 3-D objects is a model built on the outer surface of objects in a 3-D scene. In general, the outer surface of an object is made up of a set of curved surfaces. To represent the outer surfaces of 3-D objects and describe their shapes, the outer contour lines or outer contour surfaces of the objects can be used. The surface model uses only the outer surface of the 3-D object to describe the object (in fact, for opaque 3-D objects, most imaging methods can only observe the outer surface). Compare volumetric models. Biquadratic In the parametric representation of the surface, the degree of the representation of the surface bin bivariate polynomial is 2. Biquadratic bins can be represented as
36.2
3-D Surface Representation
1283
z ¼ a0 þ a1 x þ a2 y þ a3 xy þ a4 x2 þ a5 y2 Among them, a0, a1, . . ., a5 are surface parameters. Bicubic In the parametric representation of the surface, the bivariate polynomial degree of surface bin element is 3. Bicubic bins can be represented as z ¼ a0 þ a1 x þ a2 y þ a3 xy þ a4 x2 þ a5 y2 þ a6 x3 þ a7 x2 y þ a8 xy2 þ a9 y3 Among them, a0, a1, . . ., a9 are all surface parameters. Parametric bicubic surface model A model that uses double third-order polynomials to describe surfaces. Parametric mesh A type of 3-D surface modeling that defines a surface as a combination of primitives. A typical example is the non-normalized rational number B-spline. Parametric surface A 3-D surface modeled using parameters. The symmetric parameter model obtained from the elastically deformed version of the generalized cylinder is a typical model example. Among them, the vector function f is used to encode the 3-D coordinates (x, y, z), and the domain (s, t) is used to encode the surface parameterization. Here, s is a parameter along the spine (axis) of the deformable tube, and t is a parameter around the deformable tube. When using this model to fit an image-based silhouette curve, the model can be constrained with various smoothing forces and radial symmetry forces. Nonparametric surface A 3-D surface that is not modeled using parameters. The general triangular mesh is a commonly used model, in addition to splines and subdivision surfaces. Aspect graph A universal 3-D object representation method. Figure 36.7 shows an example of an aspect graph drawn for a tetrahedron (the simplest polyhedron), where the nodes correspond to the aspects and the arcs correspond to the aspect changes. The path in the figure represents the observer’s orbit around the target. Figure 36.8 shows an example of a set of aspect graphs. The arc in the figure represents the transition between two adjacent views (nodes in the figure), and the change between the aspects can also be called a visual event. See characteristic view. Progressive mesh [PM] A representation method for 3-D structures can be seen as a generalization of 2-D image pyramids. Generally used to represent or draw 3-D surfaces with different levels of detail based on a 3-D model.
1284
36
3-D Representation and Description
Fig. 36.7 Aspect graph of a tetrahedron
Fig. 36.8 Car aspect view schematic
Subdivision connectivity The gradual subdivision of the triangle mesh can also maintain the connected nature. When modeling grid levels (similar to pyramids), with subdivision connectivity, multiple levels from fine resolution to coarse resolution can be constructed to display different details.
36.2.3 Surface Representation Surface representations Representation of 3-D depth scan data. There are many explicit surface representation methods, such as triangular mesh surfaces, spline surfaces, subdivision surfaces, and so on. Not only do they allow generation of high-precision models; they also support processes such as interpolation, smoothing, decomposition, and simplification.
36.2
3-D Surface Representation
1285 e
Fig. 36.9 Surface boundary representation
f
3 B
2 b
4
c
5
1 8
A
C g
6 7
a
9
d
Surface boundary representation A method for defining surface models in computer graphics. It defines a 3-D object as a collection of surfaces with boundaries. The topology of the model indicates which surfaces are connected and which boundaries are shared between surface patches. Figure 36.9 shows the boundary representation method for a 3-D surface containing an object and their parameters: edges 1–9 and vertices a–g with 3-D curve descriptions, as well as connectivity of these entities. For example, surface B is surrounded by curves 1 to 4, and the two vertices of curve 1 are b and c. Surface harmonic representation Representation of surfaces using linear combinations of determined harmonic basis functions. Generally, a spherical coordinate system is also used. Algebraic surface A parametric surface representation using Euclidean geometry defined in N-D Euclidean space. Regular 3-D surfaces include planes, spheres, tori, and generalized cylinders in 3-D. The algebraic surface can be expressed as {x: f(x) ¼ 0}. Algebraic point set surfaces A smooth surface model. Defined from a point cloud representation and using a least squares fit of an algebraic surface (part of an algebraic sphere) that moves locally. Surface simplification After having a complete surface representation, calculate a simplified version of it to get different levels of detail. Surface interpolation Generate a continuous surface from sparse data such as 3-D points. For example, given a set of points with N sampled data S ¼ {p1, p2, . . ., pN}, we want to generate other points in R3 that are on a smooth surface that passes through all the points in S. Common techniques include radial basis functions, splines, and natural neighbor interpolation. Combine a set of discrete control points to form a surface, or let a surface pass through a point cloud. This situation is also called scattered data interpolation.
1286
36
Fig. 36.10 Bilinear surface interpolation
3-D Representation and Description
f11
f21 d1
d2 (x, y)
d1'
d2'' f12
f22
Bilinear surface interpolation Given the discrete samples fij ¼ {f(xi, yj)}ni ¼ 1mj ¼ 1, to determine the value of the function f(x, y) at any position (x, y). These samples are arranged into a 2-D grid, and the value of position (x, y) is obtained by interpolation of the values of the four grid points around it. As shown in Fig. 36.10, there is f bilinear ðx, yÞ ¼
AþB d1 þ d01 d2 þ d02
A ¼ d1 d2 f 11 þ d01 d2 f 21 B ¼ d1 d02 f 12 þ d01 d02 f 22 The dashed line in the figure helps memorize the formula. Each function value fij is multiplied by the two nearest d values. Scattered data interpolation The operation of reconstructing a surface is constrained by a set of sparse data. At this time, the surface is often parameterized as a height field z ¼ f(x, y) or a 3-D parameterized surface [x(u, v), y(u, v), z(u, v)], where 0 u, v 1. See surface interpolation. Natural neighbor interpolation A form of spatial interpolation that uses Voronoï region segmented images to provide weighting for interpolation. Surflet A directional surface point (p, n), where p is the surface point and n is the surface normal at that point. A pair of such surface points constitutes a local descriptor with translation-invariant and rotation-invariant characteristics, which are combined into a histogram to obtain a compact descriptor for a 3-D object. Surface normal Indicates a direction orthogonal to the surface. For a parameterized surface x(u, v), the normal is a unit vector parallel to (∂x/∂u) (∂x/∂v). For an implicit surface F (x) ¼ 0, the normal is a unit vector parallel to ∇F ¼ [∂F/∂x, ∂F/∂y, ∂F/∂z].
36.2
3-D Surface Representation
1287
Tangent plane A plane passing through a point on the surface and perpendicular to the surface normal at that point. Spatial normal fields A surface has a surface normal field formed by spatially adjacent surface normal points on the surface of a scene. These fields can also be derived from the shape from shading algorithm. Surface fitting Parameterize a family of parametric surfaces xθ(u, v) with a parameter vector θ. For example, a 3-D sphere can be parameterized with four parameters: three represent the center coordinates and one represents the radius. Given a set of N sampling data points {p1, p2, . . ., pN}, the task of surface fitting is to find the surface parameters to best fit the given data. Common interpretations of “best fit” include finding surfaces with the smallest sum of Euclidean distances to data points, or maximizing the probability that a data point is its noisy sample. Common techniques include least squares fitting, nonlinear optimization of surface parameters, and so on. Geometric distance In curve and surface fitting, the shortest distance from a given point to a given curve or surface. In many fitting problems, calculating the geometric distance is expensive but yields more accurate solutions. Compare algebraic distance. Simplex angle image [SAI] A way to represent the surface of an object. It stores different surface curvatures in a generalized representation. It starts with a predefined 3-D mesh and iteratively deforms the mesh until it matches the surface to be represented. Keep the one-toone connection between the initial node and the deformed mesh. The curvature features calculated from the mesh are stored in the initial mesh and used to represent the object. This method can also be used for non-convex objects by storing the connectivity between these meshes. Cube map A way to represent 2-D geometric primitives (especially straight lines and vanishing points) with the help of 3-D representation methods and avoid using trigonometric functions. If the 3-D linear equation m ¼ (n, d) is projected onto the unit sphere (the unit sphere can be represented using the sphere coordinates m ¼ (cosθ•cosϕ, sinθ•cosϕ, sinϕ), where θ is the pole angle and ϕ is the azimuth angle), it can represent the straight line originally represented in 2-D polar coordinates. Further, projecting m onto the surface of the unit cube (as shown in Fig. 36.11), we get another representation—cube map. To calculate the cube map coordinates of a 3-D vector m, you need to determine the largest (absolute value) component of m, that is, m ¼ max(|nx|, |ny|, |d|), where nx and ny are two components of n and d is the distance from m to the origin. One of the six cube surfaces can then be selected based on the value of m. This method avoids the use of trigonometric functions but requires some logical judgment.
1288
36
3-D Representation and Description
Fig. 36.11 Cube chart around unit ball
q
q
q/p
p/q
Sub 3
Sub 1
Sub 2
p
p
Sub 2
1/p
1/q
Sub 3 Sub 1
Sub 2
Sub 3
Fig. 36.12 Results of projecting half a cube into three subspaces
t
+y s
s –x
s t
–z
t
s t
+x
s t
+z
t
–y s
Fig. 36.13 Cube map example
One advantage of using cube maps is that all straight lines passing through a point correspond to line segments on the cube surface, which is useful when using the original Hough transform. If d 0 is defined by ignoring the polarization of the edge direction (gradient sign), only half a cube (i.e., only three cube surfaces) can be used, as shown in Fig. 36.12. An example of a cube map is shown in Fig. 36.13. The left side is the result of expanding from the –z direction, and the right side is a real example.
36.2
3-D Surface Representation
1289
36.2.4 Surface Description Surface description The description of surface characteristics can include color, texture, shape, reflec tivity, local differential geometry, global characteristics (such as elongation), and so on. Apparent contour For a surface S in 3-D space, its apparent contour is the set of critical values that S projects on the plane. In other words, the silhouette of the surface S. If the surface is transparent, the apparent contour can be broken down into a set of closed curves with dual focus points and cusp points. The convex envelope of the apparent contour is also the boundary of its convex envelope. Surface average value A statistic for solving the ambiguous problem of marching cube layout. First calculate the average of the four vertex values on the ambiguity surface, and select a topology manifold that may correspond to the layout by comparing the average with a predetermined threshold. Iso-surface A surface in 3-D space, where the value of a function is constant, that is, f(x, y, z) ¼ C, where C is a constant. Surface roughness A measure of the texture of a surface can also be considered as the difference between the surface and the ideal smooth surface. Texture affects the appearance of the surface, because the way light is reflected changes and reflected light has different possibilities. Image analysis applications attempt to determine surface roughness characterizations as this is part of the manufacturing monitoring process. Surface roughness characterization A monitoring application (such as monitoring a painted surface) where the surface roughness is estimated. Asperity scattering A light scattering effect commonly found in photography or modeling of the human skin. The cause is sparsely scattered point scatter on the surface. For the human skin, these point scatters are manifested as fluff on the surface (short, thin, light-colored, hard to notice). Orientation of object surface An important descriptive characteristic of the object surface. For a smooth surface, each point on it has a corresponding tangent plane. The orientation of this tangent plane (also the direction of the normal) indicates the orientation of the surface at that point.
1290
36
3-D Representation and Description
Surface curvature A measure of the shape of a 3-D surface reflects a characteristic that the surface rotates or translates in 3-D space while remaining unchanged. The shape of a surface can be represented by the principal curvature at each point above it. To calculate the principal curvature, the first fundamental form and the second fundamental form need to be considered. In the differential geometry of a surface, the first fundamental form includes the arc length of the surface curve. If the surface is parameterized with a smoothing function x(u, v), the tangent vector of the surface at (u, v) is given by the partial derivatives xu(u, v) and xv(u, v). You can define the dot product E(u, v) ¼ xu•xu, F(u, v) ¼ xu•xv, and G(u, v) ¼ xv•xv. Then, the arc length along a curve on the surface is given by the first fundamental form, which is ds2 ¼ Edu2 + 2Fdudv + Gdv2. The matrix of the first basic type is a 2 2 matrix: I¼
E
F
F
G
The second fundamental form includes curvature information. Second-order partial derivatives are xuu(u, v), xuv(u, v), xvu(u, v), and xvv(u, v). The surface normal at point (u, v) is the unit vector n(u, v) along xu xv. The matrix of the second fundamental form at (u, v) is a 2 2 matrix: II ¼
xuu n
xuv n
xvu n
xvv n
If d ¼ (du, dv)T is one direction of the tangent space (so its 3-D direction is t (d) ¼ duxu + dvxv), then the normal curvature in the d direction is K(d) ¼ (dTIId)/ (dTId). When d changes, the maximum and minimum curvatures at the point (u, v) are the principal curvatures of that point, given by the generalized eigenvalue of IIz ¼ KIz, that is, the solution of quadratic equation det(II – KI) ¼ 0. Sometimes also refers to the maximum or minimum curvature of each point on the 3-D surface. Gaussian curvature A measure of the curvature of a surface at a pixel. It is the product of the maximum and minimum values of the normal curvature passing through the point in all directions. That is, the product of the magnitude of curvature in two principal directions on a 3-D surface. See mean curvature and surface curvatures. Parabolic point A class of points on a smooth surface where the Gaussian curvature is positive. See mean and Gaussian curvature shape classification.
36.2
3-D Surface Representation
1291
Hyperbolic surface region A local saddle-shaped region on a 3-D surface. At the surface region, the Gaussian curvature value is negative (so the signs of the two principal directions of curvature are opposite). Integral curvature Integral of Gaussian curvature over the object region. Mean curvature The average amplitude of the curvatures in the two principal directions on a 3-D surface. A component that mathematically characterizes the local surface shape of a point on a smooth surface. Each point can be described by a pair of principal direction curvatures, and the mean curvature is one of them. The mean curvature is the average of the curvatures in the principal direction. See surface curvatures.
36.2.5 Surface Classification Surface shape classification Use the curvature information of a surface to classify each point on the surface as belonging to a local ellipsoid, hyperbola, cylinder, or plane. For example, given a parameterized surface x(u, v), the classification function C(u, v) is a mapping from (u, v) domain to a set of discrete class labels. See surface class. Surface shape discontinuity Discontinuities in the classification of the surface shape of a scene surface, such as discontinuities in the classification function C(u, v). Fig. 36.14 shows the discontinuity at the point P on the fold edge. Range edge See surface shape discontinuity. Local surface shape The shape of a small neighborhood (region) surrounding a point on the object surface can often be divided into one of several surface shape categories and calculated as a function of surface curvature.
Fig. 36.14 Surface shape discontinuity
P
1292 Table 36.1 Eight surface types determined by Gaussian curvature G and mean curvature H
36
G0
3-D Representation and Description
H0 Saddle valley Valley Pit
Note: The terms in bold type in the table indicate that they are included as headings in the text Fig. 36.15 Flat schematic 1 0.5 0 -0.5 -1 30 20 10 0 0
10
20
30
40
Surface classification The classification description of surface combined with sign analysis of Gaussian curvature and mean curvature. Let the Gaussian curvature be represented by G and the mean curvature be represented by H, then eight surface types can be obtained according to their signs. See Table 36.1. Mean and Gaussian curvature shape classification A classification method for local (very small) surface patches (often only one pixel in depth images). This method is based on the sign of mean curvature and Gaussian curvature of the surface patches. The goal is to divide the surface patches into one of a set of simple surface shape types. Standard surface shape type groups include plane, concave cylindrical (valley), convex cylindrical (ridge), concave ellipsoid (pit), convex ellipsoid (peak), saddle valley, saddle ridge, and minimal. See surface classification. Flat A surface type obtained by surface classification. Corresponds to the case where both the Gaussian curvature and the mean curvature are equal to 0. Refer to Fig. 36.15. Valley 1. A surface type in surface classification. Corresponds to the case where the Gaussian curvature is equal to 0 and the mean curvature is greater than 0. See Fig. 36.16.
36.2
3-D Surface Representation
1293
Fig. 36.16 Valley schematic 4 3 2 1 0 30 20 10 0 0
5
10
15
20
25
Fig. 36.17 Ridge schematic 0 -1 -2 -3 -4 30 20 10 0 0
5
10
15
20
25
2. A dark, slender object in a grayscale image is named because it corresponds to a valley in the image after the image is viewed as a 3-D surface, or in an elevation map whose brightness is a function of the position of the image. Ridge 1. Spine: A special discontinuous type of brightness function that includes changes from dark to light and then to dark, which will result in fine edges and line segments. See step edge, roof edge, and edge detection. 2. Spine surface: A surface type obtained by surface classification. Corresponds to the case where the Gaussian curvature is equal to 0 and the mean curvature is less than 0. Refer to Fig. 36.17 for the schematic.
1294
36
3-D Representation and Description
Fig. 36.18 Pit schematic 10 8 6 4 2 0 30 20 10 0 0
5
10
15
20
25
Ridge detection A class of algorithms used to detect ridges in an image, mainly including edge detectors and line detectors. Pit 1. A surface type obtained by surface classification. Corresponds to the case where both the Gaussian curvature and the mean curvature are greater than 0. See Fig. 36.18. 2. A general term that indicates that the pixel value is less than the value of neighboring pixels around it. For example, observing a darker spot in a brighter background in the image indicates that this corresponds to a pit. Compare peak. 3. A spot-shaped depression on a surface. Peak 1. A surface type obtained by surface classification. Corresponds to the case where the Gaussian curvature is greater than 0 and the mean curvature is less than 0. A schematic diagram is shown in Fig. 36.19. 2. See top surface. 3. A general term that indicates that the pixel value is greater than the surrounding pixel values. For example, observing a brighter spot in a darker background in the image indicates that this corresponds to a peak. Saddle valley A surface type obtained by surface classification is shown in Fig. 36.20. Corresponds to the case where the Gaussian curvature is less than 0 and the mean curvature is greater than 0.
36.2
3-D Surface Representation
1295
Fig. 36.19 Peak schematic
Fig. 36.20 Saddle valley schematic 10 5 0 -5 25
20
15
10
5
0
0
5
10
15
20
25
Saddle ridge A surface type obtained by surface classification is shown in Fig. 36.21. Corresponds to the case where both the Gaussian curvature and the mean curvature are less than 0. Minimal A surface type obtained by surface classification, it corresponds to the case with a Gaussian curvature less than 0 and a mean curvature equal to 0. See Fig. 36.22. Minimal point A class of points on a hyperbolic surface where the two principal curvatures are equal in magnitude but have opposite signs, i.e., K1 ¼ K2.
1296
36
3-D Representation and Description
Fig. 36.21 Saddle ridge schematic 4 2 0 -2 -4 -6 -8 30 20 10 0 0
5
10
15
20
25
Fig. 36.22 Minimal surface schematic 4 2 0 -2 -4 30 20 10 0 0
5
10
15
20
25
Cylindrical surface region A region of a partially cylindrical surface. It is also a region where the Gaussian curvature of all points above it is zero and the mean curvature is not zero. See surface classification. Cylinder extraction Method for determining the cylinder and its constituent data points from 2.5-D and 3-D images obtained from 3-D cylinder sampling. Cylinder patch extraction Given a depth image or a set of 3-D data points, it is necessary to find/determine the (often connected) set of points on the surface of a cylinder and in general determine the equation of the cylinder. This process is useful for detecting and modeling pipelines in depth images in industrial scenes.
36.2
3-D Surface Representation
1297
36.2.6 Curvature and Classification Curvature sign patch classification Classification (method) of a surface based on the mean and local Gaussian curva ture sign or principal curvature sign class. See mean and Gaussian curvature shape classification. Curvature segmentation See mean and Gaussian curvature shape classification. Patch classification In shape classification, a surface patch (region) is assigned to a specific category, and curvature estimates or shading methods are often used to calculate dense depth data. See curvature sign patch classification and mean and Gaussian curvature shape classification. Principal curvature sign class See mean and Gaussian curvature shape classification. Principal curvature The maximum or minimum normal curvature obtained at a surface point along the principal direction. As shown in Fig. 36.23, the two principal curvatures and directions completely specify the local shape of the surface. If the curved surface is a cylindrical surface with a radius r, the principal curvature along the cylindrical axis (T2 direction) is 0, and the principal curvature perpendicular to the cylindrical axis (T1 direction) is 1/r. Principal direction The direction in which the normal curvature takes an extreme value. That is, the direction of the principal curvature. Each point on a 3-D curved surface can have multiple surface curvature values. The direction corresponding to the maximum curvature and the direction with the smallest curvature are both called principal directions, which are orthogonal to each other. The two principal curvatures and the two principal directions together can completely determine the local shape of the surface. Referring to Fig. 36.23 (see principal curvature), T1 and T2 represent two principal directions, and the normal line at the point p is orthogonal to both T1 and T2. For any surface, you can always determine at least one direction with the greatest curvature (such as T1) and at least one direction with the smallest curvature (such as Fig. 36.23 Principal curvatures
Normal T2
p T1
1298 Fig. 36.24 Koenderink’s surface shape classification diagram
36
S:
–1
–1/2
3-D Representation and Description
0
1/2
1
T2). For relatively flat surfaces, there may be multiple maximum and minimum curvature directions at some points, which can be selected optional at this time. Let the curvature amplitudes in the two principal directions be K1 and K2, respectively, and the following Gaussian curvature can be obtained: G ¼ K 1K2 If the surface is partially elliptical, the Gaussian curvature is positive; if the surface is hyperbolic, the Gaussian curvature is negative. The mean curvature can also be calculated from K1 and K2: H ¼ ðK 1 þ K 2 Þ=2 H determines whether the surface is locally convex (mean curvature is negative) or concave (mean curvature is positive). Koenderink’s surface shape classification A classification scheme different from the commonly used 3-D surface shape classification based on mean curvature and Gaussian curvature. The two intrinsic shape parameters are recombined into two new parameters: one parameter S represents the local shape of the surface (including the hyperbolic cylinder, sphere, and plane), and the other parameter C represents the degree of shape bending. Some shape categories in Koenderink’s surface shape classification scheme are shown in Fig. 36.24. Normal curvature A term describing the local characteristics of a surface. For a curved surface, the intersection of the plane containing the surface normal n at point p and the curved surface gives a plane curve C, which passes through point p. Normal curvature is the curvature of curve C at point p. The intersection can have any orientation relative to the surface normal. In Fig. 36.25, the plane where the normal n of the point p on the crossing surface intersects with the curved surface gives the intersection line c. Shape index A shape descriptor that distinguishes the type of surface shape according to the principal curvature of the surface patch of the scene, usually represented by S. Its formula is
36.2
3-D Surface Representation
1299
Fig. 36.25 The plane where the normal is located intersects the surface
n
p C
S¼
2 k þ km arctan M π kM km
Among them, km and kM represent the maximum and minimum principal curvatures, respectively. For planar patch, S is not defined. A parameter R related to the shape index is called the degree of curvature and can be measured for a surface patch by R¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 km þ k2M =2
You can map all curvature-based shape classes to a unit circle in the R-S plane, and the plane patch will be at the origin. Shape magnitude class Part of a local surface curvature representation scheme where each point has a surface class and a curvature magnitude. This representation is a more common alternative to shape classification based on two principal curvatures, or mean curvature and Gaussian curvature.
36.2.7 Various Surfaces Surface class Koenderink classification of local surface shapes according to the function of two principal curvatures. These two functions are the shape index S 2 K þ Km S ¼ tan 1 M π KM Km and curvature C
1300
36
3-D Representation and Description
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 2 K M þ K 2m C¼ 2 where KM > Km are two principal curvatures. Surface categories include planes (C ¼ 0), hyperboloids (|S| < 3/8) or ellipsoids (|S| > 5/8), and cylindrical surfaces (3/8 < |S| < 5/8). Can be further divided into concave (S < 0) and convex (S > 0). Freeform surface Surfaces that do not follow any mathematical form. For example, the surface shape of a piece of cotton material can be any form. Smooth surface A commonsense noun describing the characteristics of a surface, meaning that all order derivatives are defined at each point on the surface (i.e., C1 surface). In practice, C2 is continuous, meaning that it has smoothness if it has a second derivative at least at all points. Smooth region 1. A uniform region in a grayscale image with neither edges nor texture. 2. In stereo vision, regions with relatively uniform gray levels and no obvious features. In this region, the gray values in the matching windows of different positions will be close or even the same, so the matching position cannot be determined. This kind of mismatching problem caused by gray smooth area is inevitable in the use of binocular stereo matching. Tensor product surface A parametric model of a curved surface, commonly used in computer modeling and graphics applications. The shape of a surface is defined by the product of two polynomials (often cubic) in an independent surface coordinate system. Implicit surface A three-dimensional representation method. Use an indicator function (characteristic function) F(x, y, z) to indicate which 3-D points are inside the object (F(x, y, z) < 0) or outside the object (F(x, y, z) > 0). Represents a surface as a collection of points such that a function has a value of zero. For example, the equation of a ball with radius r at the origin is x2 + y2 + z2 ¼ r2, which can be written as the function f(x, y, z) ¼ x2 + y2 + z2 – r2. The set of points satisfying f(x, y, z) ¼ 0 is an implicit surface. Indicator function See implicit surface. Subdivision surface An explicit representation of the surface, which can not only generate a highly detailed model but also perform various processing operations, such as interpola tion, smoothing, simplification, and sampling.
36.2
3-D Surface Representation
1301
Netted surface The surface of an object in 3-D space consisting of a set of polyhedra. Suppose the mesh surface consists of a set of vertices (V ), edges (B), and voxel surfaces (F). Then, the genus G of the mesh surface can be calculated by the following formula: 2 2G ¼ V B þ F Quadric A surface defined by a second-order polynomial. See conic. Quadric surface model A model that uses a second-order polynomial in 3-D space (XYZ) to describe a quadric. It has ten parameters that can describe cylindrical, ellipsoidal, parabolic, and hyperbolic surfaces. Its general formula can be written as X
aijk xi y j zk ¼ 0
i, j, k¼0, 1, 2
Quadratic variation 1. Any function that can be modeled with a second-order polynomial (functions represent changes in some variables). 2. Specific measure of surface shape deformation, that is, calculation of surface f (x, y): f 2xx þ 2 f 2xy þ f 2yy This measure can be used to constrain the smoothness of the reconstructed surface. Quadric patch A quadric defined over a limited area of independent variables or parameters. For example, in depth image analysis, a portion of the depth surface that can be well defined using a quadric. Quadric patch extraction A set of algorithms used to identify a portion of a surface that can be approximated by a quadric patch. The technique used is similar to that used for conic curve fitting. See surface fitting and least squares surface fitting. Absolute quadric Quadric in 3-D projection space. It can be represented by a 4 4 matrix of rank 3 (I3 is a 3 3 unit matrix, O3 is a 3 1 vector):
1302
36
O3 0
I3 Q¼ OT3
3-D Representation and Description
Similar to the absolute conic, in order to be unchanged under the Euclidean transformation and rescaled under the similarity transformation, the following form can be taken by means of the affine transformation (A is a 3 3 orthogonal matrix): " Q¼
AT A OT3
O3 0
#
Under the projective transformation, it becomes an arbitrary 4 4 matrix of rank 3. Hyperquadric A class of three-dimensional representation methods including hyperquadratic surfaces. The hyperconic surface model can describe any convex polyhedron. Superquadric A closed surface spanned by coordinate vectors. The x, y, and z components of this vector are a function of the angles ϕ and θ with the help of two 2-D parameter curves (which can be obtained according to the basic triangular formula)
to calculate the spherical product. The formula for the spherical product of h and m is as follows: 2
ah1 ðϕÞm1 ðθÞ
3
6 7 h m ¼ 4 bh1 ðϕÞm2 ðθÞ 5 ch2 ðθÞ where (a, b, c)T is the scaling vector. Also refers to a type of geometry that can be seen as a generalization of a basic quadratic body. An implicit function approach for modeling 3-D objects. A super ellipsoid is commonly used in image engineering, and its parameter equation is " ð2=sw Þ #ðs j =sw Þ ð2=sw Þ ð2=sw Þ x y z þ ¼1 þ Ax Ay Az Among them, Ax, Ay, and Az are the hyperquadric surface dimensions in the X, Y, and Z directions, respectively, and control the range of the model along the X, Y, and
36.2
3-D Surface Representation
1303
Z directions; sw is the square parameter of the latitude plane, and sj is the square parameter of the longitude plane; their values control the degree of squareness of the model. The values of the square parameters are all in the interval [0, 2]. Close to 0 is close to the square, and close to 2 is close to the triangle. When this value is taken, they are all convex. In order to model various shapes, other rigid or non-rigid body deformations are often combined with hyperquadratic surfaces. It is also a 3-D generalization of the superellipse, and the solution set that is expressed as follows: α β γ x y z þ þ ¼1 a b c Similar to a hyperellipse, it is not easy to fit 3-D data with a hyperquadratic surface, although there are some successful examples. Torus 1. The volume swept by moving a ball for a circular motion in 3-D space. 2. Surface of the above volume. Sphere 1. Spherical surface: A surface in any dimension space, which can be defined by x, satisfying ||x – c|| ¼ r, where c represents the center and r represents the radius of the sphere. 2. Sphere: A solid contained in space as the above surface, or defined by x, satisfying ||x – c|| r. Superellipse A class of 2-D curves, ellipse, and Lame curve are typical special cases. The general form of a superellipse can be expressed as α β x y þ ¼1 a b It is difficult to fit the data with the superellipse, because the dependence of the object shape on the parameters α and β is strongly nonlinear. Figure 36.26 shows two hyperellipses, one is convex, α ¼ β ¼ 3; the other is concave, α ¼ β ¼ 1/2.
Fig. 36.26 Two hyperellipses
1304
36.3
36
3-D Representation and Description
3-D Surface Construction
The basic unit in a 3-D image is a voxel. If the contour voxels of a 3-D object have a certain gray value, then these voxel points will constitute an iso-surface, which is the interface between this object and backgrounds or other objects (Sonka et al. 2014; Zhang 2017c).
36.3.1 Surface Construction Surface mesh A surface boundary representation where the surface is often flat and the edges are straight. Such representations are often combined with effective data structures such as wing edges and four edges to quickly calculate various geometric and topological properties. Hardware acceleration of polygon drawing is a feature of many computers. Moving least squares [MLS] In point-based surface modeling, a method of constructing a continuous 3-D surface based on a local estimate of the sampling density. Surface tiling The process and method of obtaining the polygon (including vertices, edges, and faces) collection representation of the object surface from the stereo data. The construction and representation techniques with iso-surface are often used here, that is, combining voxels on the contours of all objects with a certain gray value to form the boundary surface of the object with other objects or backgrounds. Surface triangulation See surface mesh. Triangulation The process or technique of determining the coordinates (x, y, z) of a 3-D point in two (or more) perspective projections in the original 3-D space. It is assumed here that both the perspective center and the perspective projection plane are known. See Delaunay triangulation, surface triangulation, stereo triangulation, and struc tured light triangulation. This is the inverse of the pose estimation problem. The easiest way to solve this problem is to determine the 3-D point p that is closest to all 3-D rays corresponding to the 2-D matching feature position {xi} observed by the camera {camera matrix is Pi ¼ Ki[Ri|ti]}, where ti ¼ Rici, ci is the center of the i-th camera. As shown in Fig. 36.27 (subscripts L and R represent the left and right camera systems, respectively), these two rays start from ci and follow the direction of vi ¼ N (Ri1Ki1xi), which can be recorded as ci + divi. The point closest to p on these two
36.3
3-D Surface Construction
1305
qR
Fig. 36.27 Triangulation schematic
p dL
qL
dR xR
xL
vR
vL
cR
cL RL
RR
rays is denoted as qi, which can minimize the distance ||ci + divi – p||2. The minimum distance is taken at di ¼ vi(p –ci) and so, qi ¼ ci þ vi vTi ðp ci Þ ¼ ci þ ðp ci Þ Thus, the squared distance between p and q is 2 2 r 2i ¼ I vi vTi ðp ci Þ ¼ ðp ci Þ⊥ Considering the p point closest to all rays, the calculation of its optimal value is a least squares problem, which can be calculated by summing all rj2: " p¼
X
T
#1 "
I vi vi
i
X
T
#
I vi vi ci
i
Triangulated model See surface mesh. Planar facet model See surface mesh. Surface patch A limited surface region, generally smaller. Planar patches See surface triangulation. Planar patch extraction Techniques for finding planar regions or slices, most of which use depth images. This technique is useful in 3-D pose estimation because there are some model-based
1306
36
3-D Representation and Description
matching techniques that give more accurate results for planar than non-planar surfaces. Facet model-based extraction Feature information extraction from range data based on facet model. Here, facets refer to small and simple surface areas (see the planar facet model). See planar patch extraction.
36.3.2 Construction Techniques Surface tiling from parallel planar contour Method for interpolating the parallel profile of a 3-D object to obtain the object surface. It can be regarded as extracting a surface tiling grid from the plane contour represented by the vector. In practice, the parallel profile is often the object contour detected from the 2-D image obtained layer by layer. The tiling of parallel section contours is generally achieved by interpolation between contours using (triangular) surface elements. This problem can be described as a series of triangular planes in 3-D space to tile the outer surface between two adjacent parallel polygons. There are two main steps here: the first step is how to determine a pair of initial vertices from two adjacent polygons, and these two vertices form an edge of the triangle; the second step is how based on the known pair to pick the next adjacent vertex to form a complete triangle. Repeat step 2 continuously, to maintain the work of constructing the triangle till to form a closed contour surface. Fig. 36.28 shows the surface of a 3-D circle obtained by surface tiling two parallel profile contours. Contour interpolation and tiling See surface tiling from parallel planar contour. Tiling An operation in contour interpolation and tiling. Triangular meshes are often used to build surfaces that can cover two corresponding contours of adjacent planes. The basic idea is to generate an optimized set of triangular patches according to some criteria to approximate the object surface. In a more general sense, you can also use arbitrary surfaces to fit the surfaces between corresponding contours. At this time, the representation of parametric surfaces is often used to obtain higher-order Fig. 36.28 Surface of a circle obtained from two parallel profile contours
36.3
3-D Surface Construction
1307
continuity. Generally, it also refers to cover the object surface with a polygon (conventionally a regular polygon). Corresponding A problem or step in contour interpolation and tiling. Includes two levels. First, if there is only one contour in both planes, the correspondence problem only involves determining the correspondence and corresponding points in the adjacent plane contours. Second, if there are more than one contour in both planes, the problem is more complicated. At this time, to determine the corresponding contours in different planes, not only the local characteristics of the contours but also the global characteristics of the contours need to be considered. Branching The contour interpolation and tiling occurs when the contour of a 3-D target is divided into at least two from one plane to an adjacent plane. In general, the contour correspondence when a branch occurs cannot be determined only by the local information at the branch. It is often necessary to consider the overall geometric information and topological relationship of the contour. Wireframe representation A 3-D geometric representation by means of representing vertices and edges connecting vertices. It does not include a description of the surface between the edges and especially does not include information to remove hidden lines. Figure 36.29 shows a wireframe representation of a car. Wireframe The closed outline of an object or region. Also refers to a representation of the entity, in which the object is displayed in multiple layers, each layer showing only its edge point set with line segments. Wireframe representation is an approximation method that uses a set of outer contour lines (generally polygons) to represent a 3-D object. This representation is often used to store object models that need to be matched or recognized by the visual system.
Fig. 36.29 Wireframe representation of a car
1308
36
3-D Representation and Description
Fig. 36.30 Cubes with eight voxels as vertices
(a)
(b)
Weaving wall A term for a group of connected edges in a set of space-time objects. When reconstructing the surface from the occlusion contour, first extract and connect the edges from each input image, and then use the paired polar geometry to match the edges in the successive images. The result is a woven wall. March cube [MC] algorithm A surface tiling algorithm that can be used to determine the iso-surface of an object in a 3-D image. Consider a cube whose vertices are composed of eight voxels, as shown in Fig. 36.30a, where black voxels represent foreground voxels and white voxels represent background voxels. This cube has six cubes that are 6-adjacent to it (up, down, left, right, front, back). If all eight vertex voxels of the cube belong to the foreground or background, the cube is an internal cube. If some of the eight vertex voxels of the cube belong to the foreground and some belong to the background, the cube is a boundary cube. The iso-surface to be represented should be in a bounding cube. For the cube of Fig. 36.30a, the iso-surface can be the shaded rectangle in Fig. 36.30b. The march cube algorithm examines each voxel of the cube as it travels from one cube to another adjacent cube in the image, and determines the boundary cube that intersects the object surface, and also determines the intersection. This process is performed for each cube. According to the situation where the object surface intersects with the cube, the part of the object surface inside the cube is generally a curved surface, but it can be approximated by a polygon patch. Each such piece is easily decomposed into a series of triangles, and the triangles thus obtained can easily be further composed into a triangular mesh of the object surface to reconstruct the surface. Marching cubes [MC] A method of locating a surface using voxel data is to determine a polygonal approximation to the surface based on the voxel representation. Each vertex on a voxel is assigned a binary value depending on whether it is inside or outside the object, and a set of eight vertices forming a virtual cube can define a polygonal surface. Since there are only 256 possible permutations of the 8 vertices in each group, the algorithm is very efficient. Given a function f(•) defined on a voxel, the algorithm estimates the position of the surface f(x) ¼ c for a constant value c. This is equivalent to determining whether and where the surface intersects the 12 edges of a
36.3
3-D Surface Construction
1309
Fig. 36.31 Each cube is broken down into five tetrahedrons
V4
V1
V4
V1 V2
V2
V3
V3
Even scheme
Odd scheme
Fig. 36.32 Two options for breaking cubes into tetrahedrons
voxel. Many implementations propagate this operation from one voxel to its neighboring voxels. This is where the word “march” comes from. See march cube algorithm. Gradient consistency heuristics A method for solving the ambiguous problem of marching cube layout. The average gradient of the four corner points of the ambiguity surface is used to estimate the gradient of the center point of the ambiguity surface. The topological flow pattern of the ambiguity surface is determined by the direction of the gradient and corresponds to the possible layout. Wrapper algorithm A surface tiling method. Also called marching tetrahedral algorithm. The goal is to obtain the outer surface of the 3-D object. In this algorithm, first divide each cube into five tetrahedrons, as shown in Fig. 36.31. Four of the tetrahedrons have edges of the same length, and the fifth tetrahedron has the same size of the face (the rightmost one in the figure). The voxels belonging to the tetrahedron can be regarded as inside the object, and the voxels not belonging to the tetrahedron can be regarded as outside the object. The voxels at the black and white junctions in the figure constitute the outer surface of the 3-D object. There are two ways to decompose a cube into a set of tetrahedrons (see Fig. 36.32). These two schemes can be called odd and even schemes, respectively. The decomposition of the voxel grid is performed according to the even and odd phases, so that the tetrahedrons in adjacent cubes can be matched with each other, so
1310
36
One black vertex
Two black vertex
3-D Representation and Description
Three black vertex
Fig. 36.33 Three cases of surface intersection
as to finally obtain a consistent surface. This can avoid the ambiguous layout problems easily generated by the march cube algorithm. After breaking the cube into a set of tetrahedrons, the algorithm then determines whether the object surface intersects the tetrahedron. Note that each tetrahedron involves four voxels. If for a tetrahedron all four voxels are inside the object or any one voxel is not inside the object, then it can be said that the object surface does not intersect the tetrahedron, and the tetrahedron can be disregarded in subsequent processing. The algorithm finally estimates the boundary where the object surface intersects with each face of the tetrahedron (all polygons) in the tetrahedron that intersects the object surface. The vertices at the two ends of the boundary can be linearly interpolated, and the intersection points on the edges connecting each pair of vertices can be obtained by the approximation method. Based on the intersection points, the vertex after surface tiling can be determined, as shown by the black dots in Fig. 36.33. The spliced surface is a plane with a shadow, and the direction of the surface is indicated by an arrow. Orientation can help distinguish the inside and outside of each tiled surface. It is agreed here that when viewed from outside the tetrahedron, the orientation is counterclockwise. In order to stabilize the object’s topology, this convention should be adopted throughout the surface grid. Marching tetrahedral algorithm [MT algorithm] Same as wrapper algorithm. Marching tetrahedra See marching tetrahedral algorithm. Tetrahedral spatial decomposition A method to decompose 3-D space into a tetrahedral (filled) set instead of the usual rectangular voxels. Tetrahedral spatial decomposition allows iterative decomposition of one tetrahedron into eight smaller tetrahedrons. In Fig. 36.34, a tetrahedron is broken down into eight smaller tetrahedrons, one of which is shaded. See also the method for decomposing a cube into five tetrahedrons in the wrapper algorithm.
36.4
Volumetric Representation
1311
Fig. 36.34 Tetrahedral spatial decomposition
36.4
Volumetric Representation
For the vast majority of objects in the real world, although it is only their surface can be seen, these objects are actually 3-D entities (Shapiro 2001). These entities can be represented in a variety of ways depending on the specific application (Bishop 2006; Davies 2012; Forsyth and Ponce 2012; Hartley and Zisserman 2004).
36.4.1 Volumetric Models Volumetric models A model for representing 3-D objects. It is a model for 3-D solid representation. Common volume models include spatial occupancy arrays, cell decomposition (quadtree, octree), geometric models (rigid body models), generalized cylinders, etc. They directly/explicitly express the outer surfaces and inner parts of a 3-D object. Compare surface models. Volumetric primitive The basic elements of building a 3-D object. For example, in the generalized cylinder model, various penetrating axes and various cross sections are volume primitives that form a generalized cylinder. Theoretically, their combination can construct various 3-D objects, and there are many ways to combine them. Volume 1. A region in 3-D space. A subset of R3. A (possibly infinite) set of points. 2. A 3-D space surrounded by a closed surface. Unit ball An n-D sphere with a radius of 1. Generally, n is 3; if n is 2, it is a projection of 3-D sphere—circle; if n is greater than 3, it is a hypersphere.
1312
36
3-D Representation and Description
Ellipsoid A 3-D solid, the section obtained by cutting it with any plane (intersection) is an ellipse (including a circle). It includes the set of all points satisfied by x2 y2 z 2 þ þ ¼1 a2 b2 c 2 where (x, y, z) are the coordinates of the points. An ellipsoid is a basic shape unit that can be combined with other basic shape units to represent complex objects and scenes. Volume skeletons The skeleton of a 3-D point set can be obtained by extending the definition of a 2-D curve or a regional skeleton. Geometric model A stereo representation technology suitable for the representation of rigid bodies. Also called rigid body model. The model can be 2-D (such as a multi-curve) or 3-D (such as a surface-based model). For example, constructive solid geometry is a typical geometric model that can describe the geometric shape of some objects or scenes. Rigid body model Same as geometric model. Solid model Same as geometric model. Constructive solid geometry [CSG] A geometric model that defines and represents a 3-D shape with a mathematically definable set of basic shapes. In the constructive solid geometry representation system, more complex-shaped rigid bodies are represented as a combination of other simple rigid bodies through a group of set operations (such as AND, intersection, and difference operations in Boolean set theory). Figure 36.35 shows a solid constructed by cutting out a cylinder in a square.
‒
Fig. 36.35 Entity construction example
=
36.4
Volumetric Representation
1313
Set theoretic modeling See constructive solid geometry. Non-rigid model representation A model representation in which the shape of a model can be changed with the extension of few parameters. This type of model is useful for models with variable shapes, such as pedestrians. The shape change can be over time, or between different instances. The change in apparent shape comes from perspective projection and non-corresponding observation points. In contrast, rigid body models always give the same shape, regardless of the observer’s perspective. Articulated object An object consisting of a set of (usually) rigid elements or subassemblies connected by joints. The components can be arranged in different configurations. The human body is a typical example (at this time, each limb part is represented by a skeleton). Articulated object segmentation Methods for obtaining articulated objects from 2-D or 3-D data. Articulated object model A representation of an articulated object. It includes both the parts of the object, as well as the relative relationship and range of motion between the parts (typically joint angles). Part segmentation A type of technology that divides a set of data into object components, each component has its own identity (such as dividing the human body into head, body, and limb). There are two methods for segmenting the brightness image and the depth image. Many geometric models have been used for components such as generalized cylinders, superellipses, and superquadric surfaces. See articulated object segmentation. Cell decomposition A general form of stereoscopic representation technology that can be viewed as domain decomposition in 2-D space. The basic idea is to gradually decompose objects until they are decomposed into units that can be represented uniformly. Both spatial occupancy arrays and octrees can be considered special cases of cell decomposition. Quasi-disjoint The nature of cells obtained from cell decomposition can have different complex shapes but do not share volume. Glue The only combination operation that can be performed on the 3-D solid unit after the object is decomposed. Because different units do not share volume, they can be combined into larger entities.
1314
36
4
60 61
0
3-D Representation and Description
2
4
0 2 10 10
6061
Fig. 36.36 An octree example (a set of voxels and the corresponding tree representation)
Fig. 36.37 Graph representation of a three-level octree
Octree [OT] A hierarchical data structure, basic stereo representation technology. It is a generalization of quadtree representation in 2-D space in 3-D space. When represented by graph, each non-leaf node can have up to eight child nodes. An example of an octree is shown in Fig. 36.36. A volume representation in which 3-D space is iteratively decomposed into eight smaller volumes using a plane parallel to the coordinate system planes XY, YZ, and XZ. Connecting every eight child volume with their parent volume constitutes a tree representation. When a volume contains only one object or is empty, there is no need to further decompose the volume. In this way, this expression is more effective than pure voxel expression. Figure 36.37 shows the graph representation of a three-level octree. The left figure is the highest level, where each hexagram limit expands to the middle figure, and each hexagram limit expands to the right figure. Irregular octree Similar to an irregular quadtree, but split in 3-D space.
36.4
Volumetric Representation
1315
36.4.2 Volumetric Representation Methods 3-D object The collection of related points in 3-D space often forms a solid, and the outer surface forms its outline. Objects in the objective world can be regarded as 3-D objects. In image engineering, it also refers to three-dimensional objects of interest in 3-D images. 3-D object representation 3-D promotion of object representation. What is represented is a 3-D object. Stereo representation Same as 3-D object representation. 3-D shape representation Representation of 3-D object shape in 3-D space. 3-D shape descriptor Descriptor describing the shape of the object in 3-D space. 3-D object description 3-D promotion of object description. Described as a 3-D object. 4-D representation Time series representation of 3-D space data, that is, 4-D representation of 3-D space plus 1-D time. For example, using stereo vision to collect video data, the result obtained is 4-D. 4-D approach A solution to a given problem, where time and 3-D spatial information are used or considered. See 4-D representation. Volumetric representation A data structure whereby a part of a 3-D space can be digitized. Typical examples are voxmaps, octrees, and spaces surrounded by surface representations. A method of representing a three-dimensional object as a collection of 3-D entities (rather than using only visible surfaces). In other words, for the vast majority of objects in the real world, although they can usually only be seen their surface, they are actually 3-D entities, and their full representation requires the use of entities. Representation of 3-D data or deep scan data. Can be used for voxel coloring, space carving, etc. Spatial occupancy array 1. For 2-D images, a basic class of region decomposition methods. At this time, each array cell corresponds to a pixel. If a pixel is in a given region, the cell value is set to 1; if a pixel is outside the given region, the cell value is set to 0. In this way, the set of all points with a value of 1 represents the region to be represented.
1316
36
3-D Representation and Description
2. For 3-D images, a basic stereo representation technique. At this time, each array cell corresponds to a voxel. If a voxel is in a given stereo, the cell value is set to 1; if a voxel is out of the given stereo, the cell value is set to 0. In this way, the set of all points with a value of 1 represents the solid to be represented. Spatial occupancy A representation of an object or scene in which 3-D space is broken down into a grid of voxels. Voxels contained in the object are marked as occupied and other voxels (not included in the object) are marked as free space. This representation is very useful in situations where the position of the object is more important than the characteristics of the object (such as robot navigation). Space carving 1. A generalization of voxel coloring to work even when the camera is in a circle. Among them, iteratively select a subset of cameras that satisfy the voxel coloring constraints, and sculpt alternately along different axes of the 3-D voxel grid (modified model). 2. A method of constructing a 3-D stereo model from a 2-D image. First, starting from the voxel representation in the 3-D spatial occupancy array, if a voxel cannot maintain “photo consistency” in the 2-D image collection, they will be removed. The order of voxel elimination is the key to spatial division, which can avoid difficult visibility calculations. See shape from photo-consistency. Polyhedron 3-D solid with each surface is a flat, which can be regarded as a “3-D polygon.” It can be used as a primitive in many 3-D modeling schemes because many hardware accelerators can process polygons very quickly. The simplest polyhedron is a tetrahedron, as shown in Fig. 36.38. Euler theorem for polyhedral object A theorem describing the relationship between the number of vertices V, the number of edges B, and the number of faces F on a polyhedron. The expression is V BþF ¼2 Considering the case of non-simply connected polyhedral object, it has the following expression:
Fig. 36.38 The simplest polyhedron—tetrahedron
36.4
Volumetric Representation
1317
Fig. 36.39 2-D polyhedron schematic
V B þ F ¼ 2ð C H Þ Among them, C is the number of (volume) connected components, and H is the number of holes in (volume) connected components. Polyhedral object A three-dimensional object or scene with a plane as its surface. Can be seen as a 3-D generalization of polygons. Polytope In n-D Euclidean vector space, the convex hull of a point set or the set of intersections of a finite number of closed half-spaces. A simple example of two cases in 2-D space is shown in Fig. 36.39. The left picture is a convex hull consisting of eight points, and the right picture shows the set of intersections (the central part) of four closed half-spaces (all toward the center). Characteristic view A universal 3-D object representation method, also refers to the method and result of representing the object in multiple views. It aggregates all possible 2-D projections of a convex polyhedral object into a limited number of topological equivalence classes, and different views in each topological equivalence class are connected by linear transformations. The view is selected so that small changes in the viewpoint will not cause large changes in appearance (such as bizarre events). Actual objects can have a large number of singular points, so the practical method of generating feature views requires approximation techniques, such as using only the view on the mosaic sphere or representing only a basically stable view on a large range on the sphere. See aspect graph and appearance based recognition.
36.4.3 Generalized Cylinder Representation Generalized cylinder Generalized cylindrical representation for using in generalized cylinder representation. Generalized cones A type of generalized cylinder in which the cross section changes (increases or decreases in section) along the through axis.
1318
36
3-D Representation and Description
Curve
Line Axis
Cross section
Line Rotational reflection symmetry Contour Symmetry
Asymmetry
Reflection symmetry
Boundary Curve Constant Size
Increase
Increase /decrease
Fig. 36.40 Variations of generalized cylinder
Generalized cylinder representation A general three-dimensional representation technology using a generalized cylin der combination with a certain axis (through axis) and a certain section (cross section) to represent a 3-D object. Also called generalized cylinder model. Various variations of the generalized cylinder can be obtained by combining various changes of the axis and the cross section, as shown in Fig. 36.40. Swept object representation A 3-D object representation scheme. Among them, a 3-D object consists of a 2-D cross section swept along a central axis or trajectory. For example, a brick can be constructed by brushing a rectangle along a straight line. The generalized cylindrical representation (see Fig. 36.40) is a typical scheme that allows changing the size of the cross section and the curvature of the central axis or trajectory. For example, a cone can be obtained by swiping a circumference along a straight axis while linearly increasing/decreasing its radius. Plane sweeping An algorithm that virtually sweeps a plane across a 3-D volume while processing the data encountered. The 2-D version of this algorithm is called line-sweeping technology. Line sweeping A 2-D version of the plane sweeping technique. Ribbon 1. Projection of a basic generalized cylinder from 3-D to 2-D. 2. For the shape of a tubular flat object, the contours of such objects are parallel, such as roads in aerial photos. 3. Often refers to regions with different sizes and large elongation in the image.
36.4
Volumetric Representation
1319
Sweep representation Same generalized cylinder. Generalized cylinder model Same as generalized cylinder representation. Axis A class of primitives in generalized cylinder representations. It can be a straight line or a curve in 3-D space. Cross section 1. Measurement surface for the absorption or scattering of radiation. Can be considered as the effective area of a scattering medium that completely absorbs or scatters radiation. 2. A class of primitives in generalized cylinder representations. The boundary of the section can be a straight line or a curve, it can be rotationally and reflectively symmetric or asymmetrical, or it can be only rotationally or reflectively symmetric. The shape can change or not change when moving. The area of the cross section can also vary. Cross-sectional function The representation function of the moving section in the generalized cylinder representation can give a 3-D representation of an object by moving it. The definition of volume requires an axis of the curve, a section, and a section function (representing the section at each point on the axis). The section function determines how the size or shape of the section changes as a function of its position along the axis. See generalized cone. Figure 36.41 shows a schematic diagram, where Figure (a) is the section, Figure (b) is the axis, Figure (c) is the section function, and Figure (d) is the truncation generated by moving the section along the axis and changing the size according to the section function pyramid. Geometrical ion [Geon] A basic qualitative stereo unit for component-based identification. Can be used for recognition by components. The elements can be obtained from the generalized cylinder (see Fig. 36.40) through four changes:
(a)
(b)
Fig. 36.41 The role of section functions
(c)
(d)
1320
36
3-D Representation and Description
Fig. 36.42 Geometric primitive example
1. 2. 3. 4.
Edge: straight or curved and their combination Symmetry: rotation, reflection, asymmetry, and their combination Dimensional changes: constant, expansion, contraction, and their combination Ridge: straight or curved and their combination Figure 36.42 shows several typical geometric primitives.
Chapter 37
Stereo Vision
Stereo vision mainly studies how to use the (multi-image) imaging technology to obtain the distance (depth) information of objects in the scene from the (multiimage) image (Davies 2012; Forsyth and Ponce 2012; Hartley and Zisserman 2004; Jähne and Hauβecker 2000; Prince 2012; Szeliski 2010). In fact, the human vision system is a natural stereo vision system (Marr 1982).
37.1
Stereo Vision Overview
Stereo vision observes the same scene from two or more viewpoints, collects a set of images from different perspectives, and then uses the principle of triangulation to obtain the disparity between corresponding pixels in different images, obtains depth information from it, and then calculates the shape of the objects in the scene and their relative positions between them (Davies 2012; Forsyth and Ponce 2012; Hartley and Zisserman 2004; Jähne and Hauβecker 2000; Prince 2012; Sonka et al. 2014; Szeliski 2010; Zhang 2009).
37.1.1 Stereo Stereo A general term that refers to a class of problems in which multiple images of the same scene are collected and used to restore various 3-D characteristics, such as surface shape, orientation, or curvature. In binocular stereo, two images are acquired from different viewpoints to calculate a 3-D structure. In trinocular stereo and multi-nocular stereo, three or more images are obtained. In photometric stereo, the viewpoints of image acquisition are the same, but the lighting conditions change, so the surface orientation of the scene can be calculated. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_37
1321
1322
37 Stereo Vision
Volumetric reconstruction A class of techniques for calculating stereo representations from image data. Typical examples include X-ray CAT, space carving, visual convex hull, etc. See image reconstruction. Volumetric 3-D reconstruction Same as 3-D reconstruction. Stereo triangulation For a point in the scene, given its position in two 2-D images obtained with a camera at a known position, determine its 3-D position. In the absence of noise, each 2-D point defines a 3-D ray with the help of back projection, and the scene 3-D point is at the intersection of the two rays. For noisy data, the optimal triangulation is calculated by finding the probability that the two imaging points are noisy projections. A similar approach can be used for multi-nocular situations. Omni-directional stereo Stereo vision based on two panoramic images. These images are typically generated by rotating a pair of stereo cameras. Depth perception The ability to perceive distance from visual stimuli (such as motion, parallax, etc.). See stereo vision and shape from motion. Distance perception See depth perception. Random dot stereogram A stereo image pair. Both images are random point images (i.e., a binary image obtained by randomly assigning pixels to 0 and 1, i.e., black and white), and the second image can be obtained by modifying the first image. For example, in Fig. 37.1, the right image is obtained by moving a rectangular region in the center of the left image horizontally. You can get a strong 3-D effect if you look with close eyes. See stereo and stereo vision.
Fig. 37.1 Random dot stereogram
37.1
Stereo Vision Overview
1 0 1 0 0 1 0 1 1 0 0 1 0 0 1 Y 0 1 0 X
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 A A B B 0 1 0 A B A B 0 0 1
1 0 1 Y B 0 1 0 X B 1 0 0 1 0 1 0 1 0 0
A B A 1 0 0 B A A 0 1 0
1 0 0 1 0 1 0 1 1 0 0 0 0 1 0 1 1 0 0 1 0 1 0
1323
1 0 1 0 1 0 1 0 0 0 0 1
0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 1 A A B B X 0 1 0 0 1 0 A B A B Y 0 0 1
1 0 1 B A 0 1 0 B B 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1
B A X A A Y 1 0 0 0 1 1 0 0 1
1 0 0 0 1 0 1 0 1 0 0 0 0 1 0
Fig. 37.2 Random pattern template
A stereogram designed to study depth perception. There are two pictures in this set of stereograms. They have the same random point structure except for a small region in the center, and this center region also has the same random point structure in the two pictures, but there is some relative movement in the horizontal direction of these two figures (see Fig. 37.2). Except for the shaded part, the two pictures have the same random point structure (represented by 0 and 1), and the elements of the central regions are represented by A and B (corresponding to 0 and 1). They differ by one column in the horizontal direction. The space where the change occurs is filled with additional random elements (represented by X and Y). For this kind of picture, you cannot get depth perception just by looking at any one, but if you use a stereoscope (an instrument invented in the first half of the nineteenth century, two eyes can see two different pictures), to look, you can with the fusion function of the human visual system get a three-dimensional feel. The small region that moves relatively in the horizontal direction will vividly appear above the surrounding background, which shows that the visual system has transformed the disparity (horizontal displacement) between a pair of pictures into a depth perception. It should be noted that in Fig. 37.2, the direction in which the center region is moved is different for the left and right pictures. The center region of the left one moves to the right, and the center region of the right one moves to the left. For this region, both eyes have aberrations and perceive that this region is closer to the observer than the surrounding background. If you change the direction of movement of the small region on the two pictures, the aberration relationship will be reversed, and the observer will feel that this region is farther away from the observer than the surrounding background. If the random pattern picture in Fig. 37.2 is printed with two different colors (usually red and green), it will form a stereo vision picture (3-D picture). Observing such pictures under normal conditions does not produce clear and complete shapes and patterns, but if a special color filter is used so that each eye can only see one
1324
37 Stereo Vision
mode (equivalent to using a stereo mirror), it can produce the effect of creating a sense of depth. Autostereogram An image similar to a random point stereogram, in which corresponding features are combined into a single image. Using the stereo fusion method, you can perceive the 3-D shape in the 2-D image. Euclidean reconstruction 3-D reconstruction of the scene using Euclidean reference frames. Different from affine reconstruction and projective reconstruction. This is the most complete reconstruction that can be achieved. The use of stereo vision is an example.
37.1.2 Stereo Vision Stereo vision Originally refers to the ability to use two eyes to determine the 3-D structure of a scene. Now refers to a class of methods that use binocular or multi-nocular methods to obtain images and further calculate depth information in the scene. It is the technology and process of transforming a pair of images into the surface of a visible object in 3-D representation according to human spatial vision. The basic idea of stereo vision is to observe the same scene from two or more viewpoints, obtain a set of images at different perspectives, and then use the principle of triangulation to obtain the parallax between corresponding pixels in different images and obtain depth information from it and then calculate the position and shape of the object in the scene and the spatial relationship between them. See stereo. Stereo vision system A practical system that can realize the function of stereo vision. The human vision system is a natural stereo vision system. Computer technology and electronic devices can also be used to form a stereo vision system. Its main body includes six modules: (1) camera calibration, (2) stereo imaging, (3) matching feature extrac tion, (4) stereo matching, (5) 3-D information recover, and (6) post-processing for stereo vision. Panoramic image stereo Observation results using a stereo vision system with a very large field of view (such as 360 azimuth and 120 elevation). Parallax map and depth can be restored for the entire field of view simultaneously. The general stereo vision system needs to achieve the same effect by moving and aligning the results. Narrow baseline stereo A form of stereo vision triangulation in which the sensor positions are close to each other. The baseline is the line between the sensors, and the baseline length represents the distance between the sensor locations. The image data in a video sequence
37.1
Stereo Vision Overview
1325
acquired using a motion camera is often considered to meet the conditions of narrow baseline stereo vision. Wide baseline stereo Stereo correspondence problems that arise when two images whose correspondence is to be determined are separated by a long baseline. In particular, a 2-D window in one image will look very different from a 2-D window in the other image, for reasons such as occlusion, foreshortening, lighting effects, and so on. Long baseline stereo Same as wide baseline stereo. Short baseline stereo See narrow baseline stereo. Temporal stereo 1. Stereo vision obtained by moving one camera instead of using two separate cameras. 2. Combine multiple stereo vision fields of a dynamic scene to get better observation results than a single field of view. Projective stereo vision A set of stereo vision algorithms based on projective geometry. Key concepts precisely described by the projection framework include epipolar geometry, funda mental matrices, and projective reconstruction. Projective reconstruction Reconstruct the geometry of a scene from a set (or series) of images in a projected space. The transformation from the projected coordinates to the Euclidean coordinates is easy to calculate when the Euclidean coordinates of five points are known in projection base. See projective geometry and projective stereo vision. Affine stereo A method of reconstructing a scene using two views corrected from a known viewpoint on the scene. A simple but very robust approximation to the geometry of stereo vision, the estimated position, shape, and surface orientation. Calibration can be done very easily by looking at only four reference points. Any two observations on the same plane surface can be linked by an affine transformation that maps one image to another. This includes a translation and a tensor, called a parallax gradient tensor representing the distortion of the image shape. If the standard unit vectors X and Y in an image are the projections of some vectors on the object surface, the linear mapping between the images can be represented by a 2 3 matrix A, and then the first two columns of A will be the corresponding vector in other image. Because the center of gravity of the plane is mapped to the center of gravity of the two images, it can be used to determine the orientation of the surface.
1326
37 Stereo Vision
Correlation-based stereo In two images, a dense stereo reconstruction is performed by cross-correlation of the local image neighborhood to find corresponding points. At this time, the depth can be calculated by stereo triangulation. Feature-based stereo A method of solving the corresponding problem in stereo vision, in which the features from two images are compared. In contrast to correlation-based stereo vision. Edge-based stereo A special case of feature-based stereo vision, where the feature used is an edge. Model-based stereo Modeling and rendering combining geometric and image information. First, estimate the rough geometry, and then use local information to calculate a finer offset map for each plane. The previous step can be performed interactively, using photogrammetry to model. Gradient matching stereo A method for stereo vision matching, which is performed by matching image gradients or deriving features from image gradients. Weakly calibrated stereo The required calibration information is only the binocular stereo vision algorithm of the fundamental matrix between the cameras. In general multi-vision situations, camera calibration can be determined to be only one perspective ambiguity. A weakly calibrated system cannot determine Euclidean properties such as absolute scales, but can return results equivalent to Euclidean reconstruction perspective. Uncalibrated stereo Stereo vision reconstruction without pre-calibrating the camera. Given a pair of images taken with an uncalibrated camera, the fundamental matrix is calculated based on the point correspondence, and then the images can be corrected and traditional stereo vision calibration can be performed. The result of uncalibrated stereo vision is a 3-D point in the projected coordinate system, rather than the Euclidean coordinate system allowed by the calibrated setting. Passive stereo algorithm An algorithm that uses only information obtained by a still camera or group of cameras under ambient lighting. This is in contrast to active vision in stereo vision, where the camera or camera group may move, or some projection stimulus may be used to help solve the stereo correspondence problem. Coherence detection Stereo vision technology that searches for the largest patch correlation to generate features across two images. Its effect depends on good correlation measures and proper patch size.
37.1
Stereo Vision Overview
1327
37.1.3 Disparity Disparity In stereo vision, observe the difference between the images captured by two cameras (binocular) of the same scene. In other words, when the same 3-D spatial point is projected onto two 2-D images, the position of the two image points on the image is different, that is, there is relative displacement between the corresponding points on the two images in a pair of stereo images. The magnitude of the disparity is inversely proportional to the depth of the observed scene (the distance between the scene and the camera), so the disparity contains the depth information of the 3-D object and is the basis of the calculation of the stereo space. Parallax There are several related but not identical definitions and explanations: 1. Observe the difference in position after projecting a 3-D point onto a pair of 2-D perspective images. This difference is caused by the position shift of the perspective center and the optical axis orientation difference and can be used to estimate the depth of the 3-D point. 2. If you observe an object from the two endpoints of a baseline, the angle between the two lines leading from the two endpoints to the object. 3. The angle between the line from one point (which can be a motion point) to the two viewpoints. In motion analysis, two scene points previously projected from one viewpoint to the same image point will be projected to different image points as the camera moves. The vector between these two new points is parallax. 4. When judging the relative distance between two distant objects, if the two eyes are on the same straight line, when moving the eyes along the line, you will feel that the distant objects seem to move in the same direction as the eyes. For example, this type of vision often appears between the instrument reading pointer and the instrument scale. The way to eliminate it is to add a mirror behind the scale and make the pointer coincide with its mirror when reading. For another example, if the imaging system has not eliminated chromatic aberration, each color has its own focusing surface, and color parallax occurs. For optical instruments, a lens group with a focal stop can be used to eliminate chromatic aberration. 5. Anniversary parallax or solar parallax of the position of the star due to the movement of the earth. It is the basis for calculating the distance of stars. This is also used as a distance unit in astronomy. A 100 gap is the distance between stars with an annual parallax (maximum) of 100 , which is approximately 3.2633 light years ¼ 3.0857 1013 km. 6. In photogrammetry, the coordinate difference between image points of the same name on the stereo image. Those parallel to the baseline direction of photography are called left and right parallaxes; those perpendicular to the baseline direction are called top and bottom parallax; the difference between the left and right parallaxes at any two points is called left and right difference. Under stereo
1328
37 Stereo Vision
viewing conditions, the difference in left and right parallax can reflect the physiological parallax of the human eye, so a stereo model can be observed. 7. In photography, the position of the object seen in the camera’s viewfinder does not match the position of the object taken by the lens. Also called parallelism. The farther the camera is from the object, the smaller the parallax; the closer the distance, the greater the parallax. Disparity limit The maximum parallax value allowed in a potential stereo vision match. Evidence from the human visual system supports the existence of parallax limits. Parallactic angle When looking at the same object (point) from the two ends of the baseline of binocular vision, the angles between the two endpoints and the line connecting the object (point). Depth acuity Minimum discrimination threshold for binocular parallax. In stereo vision observation, when two objects are at different distances and the distance exceeds a certain limit, the observer can use binocular parallax to distinguish the distance difference between the two objects. This discrimination ability is called depth acuity, and it can be measured by the difference angle; the minimum depth of sharpness in the eyes of most people is in the range of 0.0014–0.0028 rad. Disparity gradient A constraint used in stereo matching. It is the gradient in the disparity map of the stereo pair and is an estimate of the surface slope of each image point. Suppose Al ¼ (axl, ay), Bl ¼ (bxl, by) in the left image (Fig. 37.3(a)), and Ar ¼ (axr, ay), Br ¼ (bxr, by) in the right image (Fig. 37.3(c)). According to the epipolar constraint, the ycoordinates of these points are correspondingly equal. Their coordinates in the central eye image (Fig. 37.3(b)) are averages: axl þ axr , ay 2 bxl þ bxr Bc ¼ , by 2 Ac ¼
Fig. 37.3 Disparity gradient
Al
Ac
Bl
(a)
Ar
Bc
(b)
Br
(c)
37.1
Stereo Vision Overview
1329
According to this, the cyclopean separation degree S can be defined, that is, the separation distance of two points in the image: ffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2
2 axl þ axr bxl þ bxr SðA, BÞ ¼ þ ay by 2 2 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 1 ½þðaxl bxl Þ þ ðaxr bxr Þ2 þ ay by ¼ 4 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 1 ð x þ x r Þ 2 þ ay by ¼ 4 l The matching parallax between A and B is DðA, BÞ ¼ ðaxl axr Þ ðbxl bxr Þ ¼ ðaxl bxl Þ ðaxr bxr Þ ¼ xl xr This pair of matched parallax gradients is the ratio of the matched parallax to the central eye separation: GðA, BÞ ¼
DðA, BÞ 2ð x l x r Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2 SðA, BÞ ð x l þ x r Þ 2 þ ay by
See binocular stereo. Disparity gradient limit A limitation of the human visual system revealed by psychophysical experiments. Is the maximum disparity gradient value allowed in potential stereo vision matching? The human visual system can only fuse stereo images with parallax within a certain limit. Compare the dispersion smoothness constraint. Disparity map The attribute value of each position pixel represents the image of the parallax at that position. Sub-pixel-level disparity Parallax with numerical accuracy reaching sub-pixel level (i.e., within one pixel). Plane plus parallax A framework that arranges multiple cameras on a plane and uses parallax relative to the reference plane to estimate depth. Global aggregation methods In stereo matching, a parallax calculation is performed using an aggregation method. However, there is no explicit aggregation step here, as constraints on global smoothing have achieved similar results. Compare local aggregation methods.
1330
37 Stereo Vision
Cooperative algorithm Algorithms that solve problems with a series of local interactions between adjacent structures (rather than some global process that accesses all data). The value of a structure changes iteratively based on changes in the values of adjacent structures (points, lines, regions, etc.), and it is expected that the entire process will converge to a good solution. This algorithm is well suited for tasks with a large number of locally parallel processes (such as SIMD) and is sometimes used as a model for humancomputer interaction image technology. In stereo matching, a kind of early method of calculating parallax. Inspired by the human stereo vision computing model, it uses a nonlinear operation to iteratively update the parallax estimation, and its overall behavior is similar to the global optimal algorithm.
37.1.4 Constraint Geometric constraint Possible physical configuration or apparent limitation of the object based on geometry. It is mostly used for stereo vision (such as polar constraints), motion analysis (such as rigid body constraints), and object recognition (such as limiting the relationship between special categories of objects or features). Geometric similarity constraint The similarity of the geometric characteristics of corresponding parts (line segments, contours, or regions) in the two images obtained in binocular vision. Can help increase reliability and improve matching efficiency in stereo matching. Compatibility constraint When matching two images, it is required that the points of a specific gray level can only match the points of the same gray level. It is a special case of photometric compatibility constraints. Photometric compatibility constraint In the correspondence analysis of stereo images, it is assumed that the corresponding pixels in the two images that need to be matched will have similar brightness or color values. Correspondence constraint See stereo correspondence problem. Feature compatibility constraint The physical characteristics (discontinuity of object shadows on the surface, etc.) of the corresponding parts (line segments, contours, or regions) in the two images obtained in binocular vision have the same origin.
37.1
Stereo Vision Overview
1331
Uniqueness constraint The corresponding pixels in the two images obtained in binocular vision are generally one-to-one. The exceptional case is that the object points corresponding to multiple pixels in one image are seen by the same ray of the camera collecting the other image. Uniqueness stereo constraint A simplified constraint when performing stereo matching or stereo reconstruction, it is limited that one point in one image corresponds to only one point in other images. This is true in most cases, except where the object edge is or when other pixels are not completely opaque. Continuous constraint A constraint used in stereo matching. It refers to that the disparity change near the matching point is smooth (gradual) in most points except the occlusion region or discontinuity region in the whole picture. It is also called the disparity smoothness constraint. Disparity smoothness constraint Same as continuous constraint.
37.1.5 Epipolar Epipolar geometry A special geometric relationship between two perspective cameras in binocular stereo matching. It can be explained with the help of Fig. 37.4, where the Z axis points to the observation direction, the optical axes of the left and right image planes are both in the XZ plane, and the intersection angle is θ. C0 and C00 are the light centers of the left and right image planes, respectively, and the connecting line between them is the system baseline (its length is B), which is also called the light center line. The light center line has the intersection points E0 and E00 with the left and right image planes.
Y
Fig. 37.4 Epipolar geometry
W
θ u' C Z
O
u"
v' E' L'
B
E" L"
v"
X C"
1332 Fig. 37.5 Epipole and epipolar lines
37 Stereo Vision
Epipolar lines
Epipole
Epipolar line constraint In epipolar geometry, the constraint relationship between projection points on two image planes. That is, the projection point of the right image plane corresponding to the projection point of the object point W on the left image plane must be on the line between the pole and the origin on the right image plane. The corresponding projection point of the left image plane must be on the line between the pole and the origin on the left image plane. Epipolar constraint A geometric constraint that can reduce the dimensionality of the stereo correspondence problem. In a stereo image pair, possible matching points in another image corresponding to any point on one image are constrained on a straight line called an epipolar line. This constraint can be described mathematically using basic matrices. See epipolar geometry. Epipolar correspondence matching Stereo matching using epipolar constraints. Epipole Intersection point of light center line and image plane in epipolar geometry. It is also the point where all the epipolar lines from a camera pass, as shown in Fig. 37.5. See epipolar geometry. Epipole location The operation determining the epipoles. Epipolar line In epipolar geometry, the special line connecting the intersection of the light center line with the image plane and the origin of the image plane, that is, the intersection of the epipolar plane and the image plane, can be seen in Fig. 37.4. See epipolar constraint. Epipolar transfer The corresponding epipolar lines in the stereo image pair are passed, which can be defined by means of homography. See stereo and stereo vision.
37.1
Stereo Vision Overview
1333
Epipolar plane The plane where the light center line and the object point are located in the epipolar geometry. It is also a plane defined by any real-world scene point and the optical center points of the two cameras (see Fig. 37.4). Epipolar plane motion See epipolar plane image analysis. Epipolar volume The extension of epipolar plane. When the camera moves perpendicular to the epipolar plane, a series of images will be stacked and formed into a volume. Novel view synthesis Starting from multiple images of the same object obtained from multiple viewpoints, a process of synthesizing a new viewpoint for observing the object. One way to achieve this is 3-D reconstruction, which uses binocular stereo vision to obtain 3-D images and then uses computer graphics to draw the results of the reconstruction from various angles. However, the main method to achieve new viewpoint synthesis is to use epipolar geometry to directly synthesize new images from two or more images of the object without 3-D reconstruction. Viewpoint selection Selection of high-information viewpoints when drawing views or acquiring 3-D models. Viewpoint selection also refers to the selection of the source field of view set in image-based rendering.
37.1.6 Rectification Stereo image rectification The process of making the corresponding points of the two images on the corresponding polar lines from a pair of images obtained using a pinhole camera. Stereo image correction resamples a 2-D image to generate two new images with the same number of lines and places the points on the corresponding epipolar lines on the corresponding lines. For some stereo vision algorithms, this can reduce the amount of computation. However, in some relative orientations (such as translation along the optical axis), correction is more difficult to achieve. Rectification 1. A technique that transforms two images into some aligned geometric form (such as making the vertical pixel coordinates of the corresponding points equal). See stereo image rectification. 2. A process of processing a photo image during measurement. In ordinary photography, the object plane is often not parallel to the image plane, and the image is distorted on the projection. The correcting method is to compensate the original
1334
37 Stereo Vision W
Fig. 37.6 Image rectification schematic
before
After
(X, Y) (x, y) L
R
tilt to obtain the correct image when re-photographing. Popularized to image correction, the computer image processing method is used. Planar rectification A set of correction algorithms that project the original image onto a plane parallel to the camera baseline. See stereo and stereo vision. Plane sweep A correction method in stereo matching. Scan a set of planes in the scene, reproject different images onto these planes, and measure the consistency between them. Polar rectification A rectification algorithm designed to overcome camera geometry problems in uncorrected vision applications. It reparameterizes the image around the epipoles in a polar coordinate system. Image rectification The process of deforming a pair of stereo images to make the corresponding epipolar lines (defined by the epipoles of two cameras and any 3-D scene point) collinear. The epipolar line is usually transformed to be parallel to the horizontal axis, so that the corresponding image feature or image structure can be determined on the same grid line. This reduces the computational complexity in the stereo correspondence problem (reducing the search space by one dimension). Consider the image before and after rectification in Fig. 37.6. The light from the object point W intersects with the left image L at (x, y) and (X, Y ) before and after correction. Each point on the pre-corrected image can be connected to the center of the lens and expanded to intersect with the corrected image R, so for each point on the pre-corrected image, its corresponding point on the post-corrected image can be determined. Here the coordinates of the points before and after rectification are related by the projective transformation (a1 to a8 are the coefficients of the projection transformation matrix): x¼
a1 X þ a2 Y þ a3 a4 X þ a5 Y þ 1
37.1
Stereo Vision Overview
1335
y¼
a6 X þ a7 Y þ a8 a4 X þ a5 Y þ 1
In a broad sense, image rectification can also refer to geometric problems in various imaging processes, including parallax, perspective, curved object shapes (such as planetary surfaces), camera shake (such as unstable hands when holding the camera in shooting), and motion during photography (such as the aerial photography of the aircraft distortion of the image) and other corrections. Among them, the problem of motion can be solved by means of special instruments, and other problems can be solved by software through computer processing. Image pair rectification See image rectification. Scanline stereo matching Stereo vision matching is performed on the results after image rectification, where the corresponding points on the scanline have the same index. See rectification and stereo correspondence problem. Fundamental matrix In stereo vision, a matrix that associates two corresponding pixels in two uncorrected images. This contains all the information used for camera calibration and represents the bilinear relationship between the corresponding points p and p0 in the binocular stereo vision image. The fundamental matrix F combines two sets of camera parameters K and K0 and the relative position t and orientation R of the camera. Matching p in one image with p0 in another image satisfies pTFp0 ¼ 0, where F ¼ (K–1)TS(t)R–1(K0 )–1 and S(t) is the skew-symmetric matrix of t. The fundamental matrix has 7 degrees of freedom, two more parameters than the essential matrix, but the functions of these two matrices are similar. Essential matrix In stereo vision, a matrix describing the relationship between the coordinates of the projection points of the same object point on two images in objective space. It can be decomposed into an orthogonal rotation matrix followed by a translation matrix. In binocular stereo vision, the essential matrix E describes the bilinear constraint in camera coordinates between the corresponding image points p and p0 : p0 Ep ¼ 0. This constraint is the basis of some reconstruction algorithms, and E is a function of the camera’s translation and rotation in the reference frame of the world coordinate system. See fundamental matrix. Epipolar standard geometry A special stereo geometric structure in stereo vision that can simplify the calculation of epipolar lines. As shown in Fig. 37.7, the two imaging planes are on the same plane and are vertically aligned. It is assumed that there is no lens distortion, that is, the principal focal lengths of the two lenses are equal. The line coordinates of the principal points on the two images are equal. Both images are rotated parallel to the baseline, and the relative poses between the two cameras differ only along the X axis.
1336
37 Stereo Vision
Pw
Fig. 37.7 Epipolar standard geometry
El
Er Pl
Pr
Because the imaging planes are parallel to each other and the distance between the epipole and the baseline is infinite, the epipolar line of a point is the same straight line as the line coordinates of that point. Standard rectified geometry The geometric configuration obtained by rectifying the stereo image. The corresponding epipolar line is horizontal, and the parallax at the point at infinity is zero. In general, it is not possible to rectify any set of images simultaneously, unless their optical centers are collinear. Standard stereo geometry A special binocular stereo imaging geometry configuration. Two cameras with the same focal length f are juxtaposed, the distance between the optical centers OL and OR of the left and right cameras is b, and the angle between the two optical axes is 0. For two corresponding points (xL, yL) and (xR, yR) in the left and right images, the depth value Z is Z¼
f •b xL xR
The calculation at this camera position is simple, and stereo analysis can be simplified to the detection of corresponding pixels in the left and right images, so many stereo vision systems use this position. Also refers to a geometry that simplifies epipolar calculations. It is assumed here that the lens is not distorted, the row coordinates of the main points on the two images are equal, and the column axis is parallel to the baseline (the relative pose between the two cameras only includes translation in the X direction without rotation). Because the imaging planes are parallel to each other, the distance between the pole and the baseline is infinite. The epipolar line of a point is the same straight line as the line coordinates of that point, that is, all the epipolar lines are aligned horizontally and vertically. Almost any stereo vision geometry can be converted to a standard stereo geometry (also known as an epipolar standard geometry), with the only exception when an epipole is inside the image, but this generally does not occur in practice. The work or process of converting an image into a standard geometry is called image rectification.
37.2
Binocular Stereo Vision
1337
Pw
Fig. 37.8 Convert any stereo vision geometry to standard stereo geometry Or1
Cl Pr1
Pl Cr1
Ol
O2
Cr2
Or2
P2 C2 Pr2
Referring to Fig. 37.8, two new imaging planes on the same plane need to be constructed here. To maintain this consistent 3-D geometry, the locations of the projection centers must be the same, that is, O1 ¼ Or1 and O2 ¼ Or2. However, you need to rotate the camera coordinate system so that their X axis coincides with the baseline. In addition, two new principal points Cr1 and Cr2 need to be constructed, and the connection vector between them must also be parallel to the baseline. Further, the vector from the principal point to the projection center must also be perpendicular to the baseline. This leaves two degrees of freedom. First, a common principal focal length needs to be selected; second, a common plane containing two imaging planes can be rotated around the baseline. These parameters can be selected by minimizing image distortion. Of course, the original image should be completely contained in the rectified image. Finally, the pixels in the original image need to be projected into the corrected image, which can be determined according to the intersection of the line of sight of the pixel and the rectified image plane, such as projecting the pixels P1 and P2 of the original image into Pr1 and Pr2, respectively. In order to obtain the gray value of a pixel in the rectified image, interpolation is needed (because both Pr1 and Pr2 may not be on the image grid points).
37.2
Binocular Stereo Vision
The key step of obtaining depth information is to use stereo matching to establish the relationship between the same spatial points in different images and calculate the corresponding parallax image (Forsyth and Ponce 2012; Hartley and Zisserman 2004; Szeliski 2010).
1338
37 Stereo Vision
37.2.1 Binocular Vision Binocular stereo vision Same as binocular vision. Binocular A system with two cameras that simultaneously observe the same scene from similar viewpoints. Or you must use both eyes to observe the optical instruments, also called binocular. Note that “binocular” does not necessarily have a stereo effect, and some can only make the eyes comfortable but not the sense of stereo. Stereoscopic acuity is required for stereo sharpness, such as binocular depth. In the binocular field of view, the overlapping parts of the two fields of view of the two eyes have the stereoscopic effect. This common field of view is approximately 120 in the horizontal direction and approximately 140 in the vertical direction. See stereo vision. Binocular imaging The stereo imaging forms that use two collectors to image the same scene at one location, or one collector to image the same scene at two locations, or one collector to use the optical imaging system to simultaneously form two images of the same scene to obtain depth information. Also called binocular stereo imaging. There are multiple implementation modes, such as: 1. 2. 3. 4.
Binocular parallel horizontal mode Binocular focused horizontal mode Binocular axial mode Binocular angular scanning mode
Binocular imaging model A model that uses two cameras (such as using binocular horizontal mode) to obtain scene images. It can be regarded as a combination of two monocular imaging models, which can obtain two images with different viewpoints (similar to human eyes) of the same scene. In actual imaging, two monocular systems can be used to acquire at the same time, or a monocular system can be used to collect in two positions, respectively (at this time, it is generally assumed that the subject and the light source have not moved). A monocular system can also simultaneously acquire two images with the help of optics (such as a mirror). Binocular vision People use both eyes to observe the world’s vision at the same time. Also refers to computer vision using a binocular imaging model. Binocular effects are mainly stereoscopic effects. Some optical instruments such as binocular telescopes can enhance this effect; however, some optical instruments may not necessarily play the role of stereo vision. For example, ordinary binocular microscopes often only allow two eyes to observe at the same time.
37.2
Binocular Stereo Vision
1339
Panum fusional area In binocular vision, a region where only a single image is felt. When two eyes look at the same given point on the scene, an image is formed on the retina of each eye (the corresponding distance difference between the two images is the parallax). If the point is on the extension of the line connecting the focal point of the eyes and the retinal image point, then no dual image (i.e., perceived as one point) will be perceived. Panum discovered in 1858 that as long as the binocular parallax is small enough, the dual images will be fused together. This space region that does not perceive the object dual image is the Panum fusional area. Panum vision A law on binocular vision: for binocular vision to obtain a single image, it is not required that all the pixels on the two retinas coincide, but only that the corresponding points fall within the same tiny range on the retina. In stereoscopic vision, this Panum vision often works. Actually, there is only one external object image that falls strictly on the corresponding point of the retina, and its front and back objects cannot be imaged at the corresponding point, but people have a stereoscopic sense because their images are within the range indicated by Panum’s law. Objects that are far away from each other will form a dual image outside this range, but people don’t feel dual images, because one of them is suppressed. Such a range is called a Panum fusional area, and nerve endings in the Panum fusional area can selectively correspond to corresponding points in the other eye to generate a single vision. This area is oval-shaped; the smaller it is toward the fovea, the larger it gets to the edge of the retina. If the viewing angle is above 3 , the diameter in the horizontal direction is about 3% of the viewing angle, which is called Fisher law. In Panum vision, there is generally a difference between the objective line of sight of the two eyes and the subjective direction of the object caused by Panum’s area. Fisher law See Panum vision. Vergence 1. In a stereo vision system, the angle between the optical axes of two cameras when they look at the same point in the scene. 2. Difference between the offset angle settings of the two cameras. 3. A quantity or index describing the degree of convergence or divergence of different rays in a beam. The divergence at a point is numerically equal to the curvature of the wavefront (a geometric surface composed of wave points of the same phase) at that point. Its sign is negative when diverging, positive when converging, and zero when parallel. Vergence maintenance In a stereo vision system, the cyclical action of ensuring that the optical centers of two cameras are watching the same scene point at the same time is controlled by a program and places the cameras at the required position.
1340
37 Stereo Vision
Binocular tracking Use binocular stereo vision to track objects or features in 3-D space. Two-view geometry See binocular stereo. Segmentation-based stereo matching Firstly, the object region to be matched is obtained through image segmentation, and then a parallax is assigned to each region to perform stereo image matching. Unlike most pixel-based stereo matching, more global information is used here. Area-based correspondence analysis A method of combining several neighboring pixels in the same window into a block for correspondence analysis. This can establish unambiguous pixel-to-pixel correspondence when the two images contain a large number of pixels with similar brightness or color values. The similarity between pixels is converted into the similarity between the brightness and color values of all pixels in the block. Based on this method, a dense disparity map can be further obtained. See stereo matching based on region gray-level correlation. Area-based stereo A technique used in the correspondence analysis of stereo images. By combining several neighboring pixels in a window (such as 5 5 or 7 7) into a block, the detection of similarity between pixels becomes the detection of similarity values between brightness or color of all pixels in the blocks. This can establish unambiguous pixel-to-pixel correspondence. Same as correlation-based block matching. Region-based stereo Same as area-based stereo. Stereo matching based on region gray-level correlation Stereo matching using the correlation of pixel gray value sets in different regions. Need to consider the neighborhood properties of each point in different images that need to be matched to build a pattern for matching. Local aggregation methods In stereo matching, the most typical is the stereo matching based on region graylevel correlation, which explicitly sums the gray levels in the matching template or window. In practice, features such as color and texture can also be used; different positions in the template or window used can also be given different weights. Compare global aggregation methods.
37.2
Binocular Stereo Vision
1341
37.2.2 Correspondence Stereo correspondence problem A key issue in stereo vision is to determine the projection points corresponding to the same 3-D scene point in two 2-D images. Pairs of such projection points are said to satisfy a “correspondence relationship.” Correspondence problems are often solved by matching, but only matching features or matching pixel neighborhoods are often ambiguous, resulting in a large number of calculations and different results. In order to reduce the matching space, it is often required that the points that satisfy the corresponding relationship also meet certain constraints, such as having similar orientation and variance, local smoothness, or unique matching. A strong constraint is the epipolar constraint: from a single line of sight, the image points should be on the same 3-D ray, and the ray’s projection on the second image is an epipolar curve. For pinhole cameras, the curve is an extreme line. This constraint can greatly reduce the possible matching space (from 2-D to 1-D) Correspondence 1. The relationship between corresponding points in different images in image registration. 2. The relationship between corresponding points in a pair of images of stereo vision, obtained by imaging the same scene point. Corresponding points In binocular vision, two points in two retinal images corresponding to the same point of the observed object. These two points can be regarded as two imaging systems imaging the same object. If a point p in one image and a point q in another image are different projections of the same 3-D object point, a corresponding point pair ( p, q) is formed. The visual correspondence problem is to match all corresponding points in two images obtained from the same scene. Correspondence matching The process or technique of establishing a connection between something that should be related and represented. Typical are corresponding point matching, dynamic pattern matching, template matching, matching of equivalent ellipse of inertia, object matching, relational matching, string matching, structure matching, etc. Stereo matching A key step in stereo vision. The purpose is to establish the relationship between the same spatial point in different images, so as to obtain the parallax value and then obtain the depth image. Common methods include stereo matching based on region gray-level correlation and stereo matching based on features. See stereo correspondence problem.
1342
37 Stereo Vision
Block matching A technique in region-based stereo vision. It is assumed here that all pixels in an image block have the same parallax value. In this way, only one parallax value needs to be determined for each block. Patch-based stereo See stereo matching based on region gray-level correlation. Correlation-based block matching Matching image blocks based on correlation calculation is a key step in correlationbased stereo vision. Sub-pixel-refinement A step in stereo matching to improve matching accuracy. This is generally done after obtaining a preliminary discrete correspondence. There are many methods used, including iterative gradient descent and fitting matching cost curves at discrete parallax levels. Intensity matching Correspondence is found in a pair of images by matching the gray pattern of the pixels. The goal is to determine the neighborhood of an image with near-pixel brightness. All image points, feature points, or points of interest can be considered for matching. Correlation-based stereo matching is an algorithm that uses intensity matching. z-keying An application of real-time stereo matching, in which the actors in the foreground are segmented from the background using depth information. The purpose is often to replace the background (background replacement) with some other image (also computer-generated image) content. Make up a new image. Correspondence problem See stereo correspondence problem. Correspondence analysis The process of automatically determining the corresponding element between two images. See stereo matching. The main methods can be divided into area-based correspondence analysis and feature-based correspondence analysis. Phase-matching stereo algorithm An algorithm to solve the stereo correspondence problem by finding the similarity of the Fourier transform phases. Dense correspondence In stereo matching, it means to get the correspondence of each pixel instead of the rough (sparse) feature or object correspondence.
37.2
Binocular Stereo Vision
1343
Dense stereo matching A set of methods for establishing correspondence between all pixel pairs in stereo images through matching. The disparity map thus obtained can be used for depth estimation. See stereo correspondence problem. Dense disparity map A diagram showing parallax values of all pixel positions in a stereo image. See area-based correspondence analysis. Dense reconstruction A set of techniques for estimating the depth of each pixel of an input image or a series of images. The result is a dense sampling of the imaged 3-D surface. Implementation methods include distance perception, stereo vision, etc.
37.2.3 SIFT and SURF Scale-invariant feature transform [SIFT] An effective descriptor based on local invariants (the original intention was to get this kind of descriptor). It reflects the characteristics of some key points in the image and has higher discrimination ability. The value of the descriptor does not change with the scale change and rotation change of the image and has a strong adaptability to changes in lighting and geometric deformation of the image. When constructing a scale-invariant feature transform descriptor for a key point, the neighborhood centered on the key point is divided into several sub-regions, and a gradient direction histogram is calculated for each sub-region, and a description vector is formed as this descriptor. It is specifically a feature point (keypoint) descriptor that gives a unique signature of the brightness pattern in the 16 16 neighborhood around the feature point. To calculate it, we need to consider the 8 histograms composed of the gradient size and direction of the 4 4 blocks in the 16 16 neighborhood. The 8 histograms of the 16 blocks are concatenated to form a 128-D vector. Invariant feature Features that do not change with imaging conditions (such as lighting and camera geometry). Commonly used for object recognition. A typical example is the SIFT descriptor. 3-D SIFT A 3-D extension of the scale-invariant feature transform (SIFT), which can be used for voxel data. Gradient location-orientation histogram [GLOH] A variant of descriptors based on scale-invariant feature transforms. The log-polar hierarchical structure is used to replace the structure of the four quadrants in the scale-invariant feature transformation. In other words, the sector block in the polar
1344
37 Stereo Vision
Fig. 37.9 Partition of gradient location-orientation histogram
coordinate system replaces the block in the rectangular coordinate system. Referring to Fig. 37.9, the region is divided into three levels in the radial direction (the radii of the three rings are 6, 11, and 16, respectively). Except that the center circle is not divided into blocks, the other two circles are divided into 8 blocks, and according to a 45 sector, the space is divided into 17 blocks. A gradient direction is calculated in each block, and the direction is quantized to 16 levels (only the center circle block is shown in the figure). Thus, the number of dimensions of the gradient positiontoward histogram is 272 dimensions. Gradient location-orientation histogram [GLOH] features A 128-D robust descriptor is an extension of the SIFT descriptor, using a histogram with 272 bins (17 locations and 16 orientations) in log-polar coordinates. The dimensionality of the histogram is reduced by grading it with principal component analysis. Speeded-up robust feature [SURF] A method for quickly detecting salient feature points in an image, which also refers to corresponding image feature detectors and descriptors. It is proposed with the basic goal of accelerating SIFT calculation, so in addition to the stable characteristics of SIFT method, it also reduces the computational complexity and has good realtime ability for detection and matching. The detector improves the calculation speed by first Gaussian smoothing the scale space and then replacing it with a 1 weighted template based on the Haar wavelet (in which the technique of integral images is used). The descriptors associated with the detected points are based on a set of 4 4 square sub-regions, each of which contains 5 5 ¼ 25 pixels, which can be described by 4 values, which are still calculated using Haar wavelets to produce a 64-D vector. The overall result is an operator similar to the SIFT operator, but several times faster and more stable.
37.3
Multiple-Ocular Stereo Vision
1345
3-D SURF 3-D expansion of speeded-up robust feature, which can then be used for voxel data. The extension can be based on voxel-based stereo representations or surfacebased stereo representations. Census transform histogram [CENTRIST] A feature descriptor similar to a SIFT descriptor, specifically designed to leverage the bag-of-words classification framework for scene classification or topological location recognition. It mainly encodes the structural characteristics of the scene, eliminates the texture detail information, and can obtain higher scene classification performance than the feature descriptors currently used in the same frame. It is a holistic expression method that can be easily generalized to category recognition.
37.3
Multiple-Ocular Stereo Vision
Using the multiple-ocular method is more complicated than using the binocular method but has certain advantages, including reducing the uncertainty of image matching in binocular stereo vision technology, eliminating mismatches caused by smooth regions on the surface of the scene, and reducing match errors caused by periodic patterns on the surface of the scene (Zhang 2017c).
37.3.1 Multibaselines Multibaseline stereo Depth calculation of multiple-ocular stereo vision using more than two cameras to obtain multiple baselines between camera pairs. It is sometimes used to simplify the solution of the correspondence problem, because it can reduce the length of the baseline between camera pairs, and it can effectively provide a wide baseline for cameras that are far away. Baseline The connection between two lens centers (light centers) in a binocular stereo imaging system, or the connection between any two lens centers in a multipleocular stereo vision imaging system. The baseline length is the distance between the centers (optical centers). Multiple-ocular imaging A method of imaging the same scene with more than two collectors at different positions and/or orientations, or a stereo imaging method of imaging the same scene with one collector at more than two positions and/or orientations. Also called multiple-ocular stereo imaging. Binocular imaging is a special case.
1346
37 Stereo Vision
Multiple-ocular stereo imaging Same as multiple-ocular imaging. Multiple-ocular stereo vision See multiple-ocular imaging. Multi-ocular stereo A stereo triangulation method or process that uses more than two cameras (or uses one camera in more than two locations) to obtain 3-D information. When there are only three cameras, it is often directly called trinocular stereo vision. Multi-view geometry See multi-sensor geometry. Multi-view stereo Promotion of binocular stereo vision. Use multiple (usually more than two) perspective images to recover scene depth information. It also includes binocular stereo vision in a broad sense. In practice, it can be multiple-nocular stereo vision imaging (method) on a baseline, but more emphasis is placed on the imaging field of view of the scene to be observed by the multi-camera. See multi-sensor geometry. Voxel coloring A method to recover an object or scene model from multiple images. The central idea is to intersect the observation lines obtained from different viewpoints into a threedimensional space, and use the viewpoint consistency constraint, that is, the real intersection should have the same color (the surface color of the voxels at the intersection of the rays). See space carving. A global optimization method in multi-view stereo vision. It performs a front-toback planar scan of 3-D stereo and simultaneously calculates illumination consistency and visibility. In each plane, all voxels with the same illuminance are marked as part of the object. The 3-D stereo obtained in this way can guarantee the production of a 3-D model with the same illumination without noise and resampling and contains any real 3-D object model that can generate images. However, voxel coloring works only when all cameras are on one side of the scanning plane, which is not possible when the cameras are in a circle. Space carving 1. A generalization of voxel coloring to work even when the camera is in a circle. Among them, iteratively select a subset of cameras that satisfy the voxel coloring constraints, and sculpt alternately along different axes of the 3-D voxel grid (modified model). 2. A method of constructing a 3-D stereo model from a 2-D image. First, starting from the voxel representation in the 3-D spatial occupancy array, if a voxel cannot maintain “photo-consistency” in the 2-D image collection, they will be removed. The order of voxel elimination is the key to spatial division, which can avoid difficult visibility calculations. See shape from photo-consistency.
37.3
Multiple-Ocular Stereo Vision
1347
Voxel carving See space carving. Poisson surface reconstruction A scene reconstruction method in multi-view stereo vision. Calculate multiple depth maps independently before merging partial maps into a complete 3-D model. It uses a smooth 0/1 internal-external function (also seen as a clipped signed distance reconstruction).
37.3.2 Trinocular Trinocular stereo Multi-view stereo vision process using three cameras. Three-view geometry See trinocular stereo. Orthogonal trinocular stereo imaging A special multiple-nocular stereo vision imaging mode. Figure 37.10 shows a schematic diagram where the positions of the three cameras are on the three vertices of a right triangle. Generally, the right-angle vertex is taken as the origin, and its two edges coincide with the X axis and Y axis, respectively. The positions of the three cameras are labeled L, R, and T. A horizontal stereo image pair can be obtained from L and R with a baseline of Bh; a vertical stereo image pair can be obtained from L and T with a baseline of Bv. Gradient classification In orthogonal stereo matching based on orthogonal trinocular stereo imaging, a fast method for selecting matching directions. In order to reduce the matching error, the matching direction needs to be selected reasonably during orthogonal stereo matching. The basic idea of gradient-based classification is to first compare the smoothness of each region in the image in both the horizontal and vertical directions. Use a vertical image pair to match the smoother regions in the horizontal direction, and use horizontal image pairs for matching smoother regions in the vertical direction. The comparison between the horizontal and vertical smoothness can be performed by calculating the gradient direction of the region. Fig. 37.10 Camera positions in orthogonal trinocular stereo imaging
Y Z
T Bv L
Bh
R
X
1348
37 Stereo Vision
Fig. 37.11 Arbitrarily arranged trinocular stereo imaging system
Y C3 P3 L31 L32 W L12
L21
0 L13
P1
C1
P2
L23
X
C2
Z
Arbitrarily arranged trinocular stereo model Trinocular stereo model in general form. In the trinocular stereo imaging system, the three cameras can be arranged in any other form besides the common arrangement of a straight line or a right triangle. Figure 37.11 shows a schematic diagram of an arbitrarily arranged trinocular stereo imaging system, where C1, C2, and C3 are the optical centers of the three image planes, respectively. According to the relationship of epipolar geometry, it can be known that given an object point W and any two optical center points, a polar plane can be determined. The intersection of this plane and the image plane is called the epipolar line (represented by L plus a subscript). In the trinocular stereo imaging system, there are two epipolar lines on each image plane, and the intersection point of the two epipolar lines is also the intersection point of the object point W and the image plane. Trifocal plane In a trinocular stereo imaging system, a plane determined by the optical centers of three cameras or the optical centers of three image planes. Trifocal tensor Represents the tensor of an image where a single 3-D point was observed in three 2-D fields of view obtained from three perspective projections. Algebra can be expressed as an array of 3 3 3, where the value of each element is Tijk. If the projections of a single 3-D point in the first, second, and third fields of view are x, x0 , and x00 , respectively, the following nine equations will be satisfied:
xi x0 jεjαr x00 kεkβs T αβ i ¼ 0rs
37.3
Multiple-Ocular Stereo Vision
1349
where both r and s vary between 1 and 3. In the above formula, ε(epsilon) is a ε tensor, and the value is
εijk
8 > < 1 i, j, k are even permutuations of 1, 2, 3 ¼ 0 i, j, k are pairwise identical > : 1 i, j, k are odd permutations of 1, 2, 3
This formula is linear to the elements of T, so given enough corresponding 2-D points x, x0 , and x00 , each element can be further estimated. Because not all 3 3 3 arrays are achievable camera structure, it must also combine several nonlinear constraints on tensor elements to estimate. Trilinear tensor Same as trifocal tensor. Sum of squared differences [SSD] A criterion for similarity measures. It is generally the result of summing the squares of the gray-level differences between pixels in the image. Sum of SSD [SSSD] A criterion for similarity measures. The sum of the squares of the differences corresponding to the inverse distance (i.e., SSD) is then summed. That is, the sum of squares corresponding to the differences of each inverse distance is calculated first, and then the sums of multiple squares thus obtained are summed. Normalized sum of squared difference [NSSD] A normalized measure of similarity, is the result of normalizing the sum of squared difference.
37.3.3 Multiple-Nocular Four-nocular stereo matching A special multiple-nocular stereo vision matching. As shown in Fig. 37.12, Figure (a) shows the projection imaging of the scene point W. The imaging points on the four images are p1, p2, p3, and p4, respectively. They are the intersections of four rays R1, R2, R3, and R4 with four image planes. Figure (b) is a projection image of the straight line L passing through the scene point W. The imaging results of the straight line on the four images are four straight lines l1, l2, l3, and l4, which are located in four planes Q1, Q2, Q3, and Q4. Geometrically speaking, the rays passing through C1 and p1 must also pass through the intersection of planes Q2, Q3, and Q4. Algebraically speaking, given a quadri-focal tensor and any three straight lines passing through three image points, the position of the fourth image point can be derived.
1350
37 Stereo Vision C4
C4
p4 R4
C3
p3
R3
l4
p4 Q4
W
p3
Q3
l3
W R2
R1 p1
L
C1
Q2
p1
p2
p2
C1
C2
C3
l2 C2
(a)
(b)
Fig. 37.12 Four-nocular stereo matching schematic diagram Y Z
T2 D2
T1 D1 L
R1
R2
R3
X
Fig. 37.13 Orthogonal multiple-nocular stereo imaging
Multifocal tensor Generally refers to a geometric description of the linear relationship between the field of view of four and more cameras. There are special names for the case of two or three cameras, which are called bifocal tensor and trifocal tensor, respectively. Quadri-focal tensor Tensor describing four optical centers. It can also be regarded as an algebraic constraint imposed on four corresponding points of four fields of view at the same time, similar to the epipolar constraint of two cameras and the trifocal tensor of three cameras. When imaging with four cameras that are not collinear, four views can be obtained, and the 3-D reconstruction of the scene can be performed with the help of four views according to the quadri-focal tensor. See four-nocular stereo matching, epipolar geometry, and stereo correspondence problem. Orthogonal multiple-nocular stereo imaging Stereo imaging technology combining unidirectional multiple-nocular imaging and orthogonal trinocular stereo. A schematic diagram of the shooting position of the orthogonal multiple-nocular image sequence is shown in Fig. 37.13. Let the camera shoot along the points L, R1, R2, . . . on the horizontal line and the points L, T1, T2, . . . on the vertical line to obtain a series of stereo images with orthogonal baselines.
37.3
Multiple-Ocular Stereo Vision
1351
Theoretically, multiple images can be acquired not only in the horizontal and vertical directions but also in the depth direction (along the Z axis direction), such as in the two positions D1 and D2 along the Z axis in Fig. 37.13. However, practice has shown that the contribution of the depth direction image to the restoration of 3-D information in the scene is not obvious. Uncertainty problem In stereo matching, the phenomenon of mismatch caused by periodical patterns (repeated features) in the image. This problem is also a problem of ambiguity because periodic repeated features can cause stereo matching results to be non-unique. Periodical pattern Repeated and regular grayscale distributions appear in the image. In stereo matching, when the surface of an object has a repeating pattern, it often causes a one-to-many ambiguity problem in the region-based stereo vision method, which causes a matching error. Inverse distance The reciprocal of the distance between the camera and the object. Combining it with the calculation of SSSD in multiple-nocular stereo vision imaging can eliminate the problem of uncertainty caused by periodical patterns in the image. Periodicity estimation A problem of estimating the period of an event or phenomenon. For example, given a texture generated by repetition, determining the size of a fixed pattern. Multilinear constraint A general term for geometric constraints on multiple fields of view at a point. For two, three, and four constraints, see epipolar constraint, trilinear constraint, and quadrilinear constraint. Trilinear constraint The geometric constraint between the three fields of view of a scene point in the scene (i.e., the intersection of three epipolar lines). This is similar to the epipolar constraint correspondence in two viewing situations. Quadrilinear constraint Geometric constraints on four fields of view for one point (the intersection of four epipolar lines). See epipolar constraint and trilinear constraint.
37.3.4 Post-processing Post-processing for stereo vision The procedure of processing the 3-D information obtained after stereo matching to obtain complete 3-D information and to eliminate errors. Commonly used post-
1352
37 Stereo Vision
processing mainly includes the following three types: (1) depth map interpolation, (2) error correction, and (3) improving accuracy. Disparity error detection and correction A method of directly processing disparity maps to detect and correct matching errors therein. For example, you can use ordering constraints to detect the location and number of mismatches, and use iterative methods to eliminate mismatches. Parallax error detection and correction Techniques and processes for detecting and correcting calculated parallax errors in stereo vision. Zero-cross correction algorithm A comparatively fast and universal parallax error detection and correction algorithm. With the ordering constraint relationship between the corresponding feature points, the errors between the disparity maps are eliminated, and a disparity map meeting the order matching constraints is established. The characteristic of this algorithm is that it can process the disparity map directly, and it is independent of the specific stereo matching algorithm that generates the disparity map. In this way, it can be added as a general disparity map post-processing method to various stereo matching algorithms without modifying the original stereo matching algorithm. Secondly, the calculation amount of this method is only proportional to the number of mismatched pixels, so the calculation amount is small. Total cross number In the parallax error detection and correction algorithm, the total number of mismatches of all pixel pairs in the cross region. See zero-cross correction algorithm. Disparity search range In stereo vision, a range of lengths to search for corresponding points. Improving accuracy The process of improving the accuracy of disparity values at each point of the disparity map. Parallax calculation and depth information restoration are the basis for various subsequent works in stereo vision, so high-precision disparity calculation is important. For this reason, the sub-pixel-level disparity can be further improved after obtaining the pixel-level disparity usually given by general stereo matching. Ordering constraint 1. In feature-based stereo matching, the relationship between the arrangement order of feature points on the visible surface of the observed object and their projection order (along the epipolar line) on the image obtained after imaging. Because the object points and their projection points are on both sides of the optical center at this time, these two orders are exactly opposite. 2. In the parallax error detection and correction, the relationship between the arrangement order of two objects in space and their projection order on the stereo
37.3
Multiple-Ocular Stereo Vision
1353
image pair. Because the objects and their projections are on the same side of the light center at this time, the two orders are consistent. 3. A stereo vision constraint. It states that two points that appear in one order in one image are likely to still appear in the same order in the others. This constraint is not (must be) true when points originate from different depth objects (i.e., when they should exhibit parallax).
Chapter 38
Multi-image 3-D Scene Reconstruction
Using a (fixed) camera to collect multiple varying scene images of the scene can also recover the depth of the scene (Forsyth and Ponce 2012; Hartley and Zisserman 2004; Sonka et al. 2014; Szeliski 2010). This can be seen as converting redundant information between multiple images into depth information (Zhang 2017c).
38.1
Scene Recovery
From the perspective of 3-D scene projection to 2-D image, the process of 3-D reconstruction based on the image is also the process of restoring the scene (Hartley and Zisserman 2004; Prince 2012).
38.1.1 3-D Reconstruction 3-D reconstruction A technique that uses images collected from a scene to reconstruct 3-D scenes and their distribution. Also called scene recovering. 3-D structure recovery See 3-D reconstruction. Photo pop-up A strategy for inferring the structure of 3-D scenes by identifying certain scene structures from a single image using image segmentation and recognition technology. A typical approach is to first calculate the superpixels of the image and aggregate them into regions that may share similar geometric marks. After marking all pixels, use various classifiers and the statistical information learned from the © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_38
1355
1356
38
Multi-image 3-D Scene Reconstruction
labeled image to identify the content of the region, and use region boundaries to determine different perspectives that can quickly display the scene in the image. 3-D information recover A key step in recovering 3-D information in a scene in stereo vision. Using the parallax image obtained from the stereo matching of the 2-D image, the depth distance is calculated. 3-D stratigraphy 1. A branch of geology that focuses on the sequence and relative location of strata and their relationship to geological time scales. For example, to obtain information from formations, modeling is required to effectively display archeological sites or to determine the structure of rocks and soil in geological surveys. In image engineering, it mainly refers to the modeling and analysis of 3-D scenes. 2. Modeling and visualization tools for displaying different strata underground. It is often used to visualize archeological sites or to detect rocks and soil during geological surveys. The related concepts and technologies are useful for studying 3-D scenes and 3-D images. Affine reconstruction There is only affine ambiguous 3-D reconstruction in the choice of basis. Parallel planes in Euclidean base are still parallel in affine reconstruction. The projective reconstruction can be upgraded to affine reconstruction by determining the plane at infinity. A common method is to locate the absolute quadratic curve in the reconstruction. Quasi-affine reconstruction A projective reconstruction with additional constraints (i.e., the plane that does not split at infinity). This is an intermediate level between projective reconstruction and affine reconstruction. Surface reconstruction The process of building a 3-D surface model from a set of related data (such as a set of 3-D points, a set of cross sections of scanned data, a set of filled voxels, etc.) Model-based reconstruction In 3-D reconstruction, the object to be reconstructed is modeled first and then reconstructed according to the model. This requires some knowledge of the object to be reconstructed in advance, so that more detailed and reliable models can be constructed using special representations and techniques. A few typical examples are architectural modeling, head and face modeling, and human body modeling. Silhouette-based reconstruction A 3-D solid model is reconstructed from the intersections generated by projecting a binary silhouette into 3-D space—a visual hull (sometimes called a line hull). Because the volume relative to the visible silhouette is the largest, and the surface elements are tangent to the line of sight along the silhouette outline, the visual hull is somewhat similar to the convex hull of the point collection.
38.1
Scene Recovery
1357
Profile curve-based reconstruction Same as shape from profile curves. Slice-based reconstruction Reconstruct a 3-D object by slicing or cutting through a set of planes. The slice plane is generally regularly (at a given interval) swept along the plane normal to the volume required by the object under consideration. See tomography, computed axial tomography, single photon emission computed tomography, and nuclear magnetic resonance.
38.1.2 Depth Estimation Depth information In stereo vision, the distance information along the line between the camera and the object obtained from two or more 2-D images. Depth estimation The process of estimating the distance between a sensor and an object in a scene to be imaged. Two common depth estimation methods are stereo vision and distance perception. Relative depth The difference in depth (distance from an observer) between two points. In some cases, it may not be possible to calculate the actual depth or the absolute depth, but it is possible to calculate the relative depth. Depth map interpolation A stereo vision post-processing step for restoring the complete information of the visible surface of a scene. In the algorithm for stereo matching based on features, because the features are often discrete, the matching result is only a few sparse matching points, so that only the disparity values at these points can be directly recovered. In order to obtain a complete depth map of each point on the surface, a step of interpolating and reconstructing the disparity values recovered previously is needed, that is, interpolation of discrete data to obtain disparity values not at the feature points. There are many interpolation methods, such as nearest neighbor interpolation, bilinear interpolation, and trilinear interpolation. Range image merging When modeling complex 3-D objects, it is often necessary to combine multiple independent depth images. There are three main steps: (1) register some 3-D surface models; (2) integrate them into a coherent 3-D surface; and (3) model fitting, where parameter representations can be used (such as generalized cylinders, superquadric), nonparametric models (such as triangular meshes), or physicsbased models.
1358
38
Multi-image 3-D Scene Reconstruction
Volumetric range image processing [VRIP] A technique for global correspondence search of a set of depth images based on local descriptors that are invariant to 3-D rigid body transformation. Also refers to a method of merging two or more aligned 3-D surfaces into a single model. First, calculate a weighted signed distance function for each depth image and combine the depth images using a weighted average method. To make the representation more compact, run-length coding is used to encode empty, visible, and varying symbol distance voxels, and only the signed distance values close to each surface are stored. After computing the combined signed distance function, a zero-crossing surface extraction algorithm (such as a march cube algorithm) can be used to recover a meshed surface model. Edge in range image The position in the depth image where the abrupt depth changes (also the position where the gray level of the image changes abruptly). If the objects in the scene are composed of polyhedra, the intersection of their surfaces corresponds to the edge points. Depth image edge detector See range image edge detector. Depth distortion The systematic and incorrect distance estimation mainly comes from the measurement and calculation errors of the intrinsic parameters. Indices of depth The empirical knowledge in spatial perception can help judge the spatial position of objects. Humans do not have organs used to sense distance directly or specifically. Perception of space depends not only on vision but also on indices (cues) of depth, including non-visual indices of depth, binocular vision indices of depth, and monocular vision indices of depth. Binocular vision indices of depth With the help of binocular imaging, two different independent images are obtained and the indices of depth are obtained. The difference in binocular vision results in binocular parallax, which is proportional to depth (i.e., the distance from the eye or camera to the object). This is the basis of stereo vision. Non-visual indices of depth In the human visual system, the indices of depth obtained with the help of physiological or anatomical information. There are two common types: 1. Eye focus adjustment. When looking at different objects near and far, the eye adjusts its lens through the eye muscles to obtain a clear vision on the retina. This regulatory activity provides information about the distance of the object to the signals transmitted to the brain, from which the brain can give an estimate of the distance of the object. 2. Convergence of binocular boresight. When viewing different objects near and far, the two eyes will adjust themselves to align the respective foci to the object, so that the object is mapped to the most sensitive area of the retina. In order to
38.1
Scene Recovery
1359
align the two eyes with the object, the visual axes of the two eyes must complete a certain convergence movement, that is, the near distance must be concentrated, and the far distance must be dispersed. The movement of the eye muscles that control the convergence of the visual axis can also provide the brain with information about the distance of the object. Range image registration An important step in depth image merging is to align multiple depth images to build a 3-D model. The most common method is to iterative closest point algorithm. Range image alignment Same as range image registration.
38.1.3 Occlusion Occlusion When one object is located between the observer and another object, the closer object in the acquired image occludes the farther object. The occluded surface is the part of the farther object that is blocked by the closer object. In Fig. 38.1, the closer cylinder (partially) blocks the farther box. Occlusion detection Identify features or targets in a scene that cause occlusion (visual overlap). Occlusion is generated by other parts of the scene. Occlusion recovery Attempts to infer the shape and appearance of hidden surfaces due to occlusion. This recovery can help improve the integrity of scenes and targets in virtual reality. Occluding contour Visible edges of smooth surfaces away from the viewer. A 3-D space curve on the surface is specified, and the line of sight from the observer to a point on the space curve is perpendicular to the surface normal at that point. The 2-D image of this curve can also be called an occlusion contour. The edge detection process can be used to find the occlusion contour. When a vertical cylinder is viewed from the side, its leftmost and rightmost edges are occluding edges. See occlusion. Same as profile curve. Fig. 38.1 Cylinder block cuboid
1360
38
Multi-image 3-D Scene Reconstruction
Occluding contour analysis A general term with multiple meanings: 1. Occluding contour detection. 2. Reasoning of the 3-D surface shape at the occluding contour. 3. Determine the relative depth of the surfaces on both sides of the occluding contour. Occluding contour detection Determine which image edges originate from an occluding contour. Profile curve The target contour curve obtained by observing the scene at a certain angle. Also called occluding contour. For opaque scenes, the part inside the profile curve obscures the part behind the scene. Occlusion understanding A general term indicating the analysis and judgment of scene occlusion. The work here includes occluding contour detection, determining the relative depth of the surfaces on both sides of an occluding contour, and searching for Tee junction as cues for occlusion, depth sorting, and so on. Inter-position A phenomenon when objects block each other. At this time, the occluded object is inserted between the observer and the occluded object, and it can be determined that the distance from the occluding object to the observer is closer than the distance from the occluded object to the observer.
38.2
Photometric Stereo Analysis
Photometric stereoscopy is a method of restoring the surface orientation of a scene with a series of images acquired under the same viewing angle but under different lighting conditions (Hartley and Zisserman 2004; Jähne and Hauβecker 2000; Szeliski 2010).
38.2.1 Photometric Stereo Photometric stereo A 3-D reconstruction using one camera method. Also called shape from light moving. Need to use a series of images with the same viewing angle but collected under different lighting conditions (such as using a fixed camera but using different positions of the light source to obtain two images from which sufficient information can be extracted to restore the depth) to restore the surface of the scene orientation
38.2
Photometric Stereo Analysis
1361
(more precisely, the surface normal at each surface point). Different reflection maps can be obtained from such multiple images, which together constrain the surface normal at each point. Photometric stereo eliminates the problem of matching when multiple cameras are fixedly placed relative to the scene. Photometric stereo imaging A mode of stereo imaging in which the mode of acquisition device is fixed relative to the scene and the light source is moved relative to the scene to collect multiple images. Also called moving light imaging. Because the surface of the same scene has different brightness under different lighting conditions, using the image obtained in this way can only obtain the surface orientation (relative depth) of the object, but cannot obtain the depth information of the absolute distance between the collector and the scene. Moving light imaging Same as photometric stereo imaging. Photometric stereo analysis When the camera is in a different lighting layout (or more accurately, the radiation value of the light source), the work and technology of identifying the 2-D object’s surface orientation are based on the measured image gray value or color vector value. In photometric stereo analysis, color images using vector values have two obvious advantages over grayscale images: (1) by using color information, photometric stereo analysis can also be performed on non-fixed scenes containing moving objects; (2) with the help of physical phenomenon, such as the processing of highlights in color images, photometric stereo analysis (under certain conditions) can also be performed on non-Lambertian surfaces in the scene. Solving photometric stereo The solution and process of solving the image brightness constraint equation according to the given image brightness. Thus, the surface orientation determined by the surface gradients p and q can be obtained, and the shape of the original imaged object can be restored. In general, the correspondence from image brightness to surface orientation is not unique. This is because there is only one degree of freedom (luminance value) in each spatial position and two degrees of freedom (gradient values in two directions). In practical applications, it is often necessary to solve two or more image brightness constraint equations simultaneously (by changing multiple light sources to obtain multiple images) to achieve photometric stereology. Shape from photometric stereo See photometric stereo. Same as shape from light moving. Shape from light moving Same as photometric stereo.
1362
38
Multi-image 3-D Scene Reconstruction
38.2.2 Illumination Models Illumination 1. Same as illuminance. 2. The purpose of lighting is to make important features of the illuminated object appear and to suppress unwanted features. For this reason, it is necessary to consider the interaction between the light source and the irradiated object, the spectral composition of the light source and the irradiated object, and the relative orientation of the light source and the irradiated object, because this can enhance the degree of visibility of certain features. Illuminance The light flux per unit area on the surface of a scene illuminated by a light source. The unit is lx (lux); 1lx ¼ 1lm/m2. It is the total amount of light that hits (shots) to a point on the surface from all directions. Therefore, it is equal to the irradiance weighted by the response curve of the human eye. The total amount of visible light incident on a point on the surface. Measured in lux (lumens per square meter) or foot-candles (lumens per square foot). Illumina tion will decrease as the distance between the observer and the light source increases. Illumination variance In face recognition, the brightness change of the light source in the direction relative to the face when the face image is collected. More broadly, when capturing images, changes in the illumination of the imaging surface caused by changes in scene lighting. Shadow factor The ratio of direct illumination to total illumination. When a scene is illuminated by the outside world, its total illumination E has both the contribution of direct lighting and the contribution of indirect lighting. If E0 is used to represent the illuminance that cannot be directly reached to the scene, that is, the illuminance that is blocked, the shadow factor is (E – E0 )/E. Also called shadow rate. Illumination component The luminous flux incident on the surface of the visible scene. It is a factor that determines the gray level of an image and generally changes slowly in space. In theory, its value is always greater than zero (only considering the case of incident), but it is not infinite (because it should be physically achievable). Illumination estimation A method of determining the position and orientation of a light source in an image using tones, shadows, and specular reflections. Illumination function A functional representation of the illumination component.
38.2
Photometric Stereo Analysis
1363
cos4 law of illuminance The illuminance E imaged by a lens group at a point is proportional to the cosine fourth power of the angle w formed by the principal ray pointing at the point and the optical axis. It can be expressed as E ¼ E0cos4w, where E0 is the illuminance of the image point on the optical axis. This law indicates that the luminosity of the edge portion imaged by the wide-angle objective lens decreases rapidly. The ratio of the illuminance of the off-axis image to the on-axis image on the focal plane is called relative contrast. Illumination model A mathematical model used to calculate the lighting effect of a scene, given the physical form of the scene and the nature of the light source. Describes the situation in which light irradiates the surface of a scene and has different illuminances affected by light intensity, incident angle, line of sight, and surface reflection coefficient. A description of the light component that is reflected or refracted from a surface. The standard illumination model contains three basic components: ambient light, diffuse illumination, and specular reflection. Image irradiance equation An equation that relates the surface reflection R of an object to the observed brightness E. Excluding a scalar scaling factor (depending on lighting intensity, surface color, light conversion efficiency, etc.), it can be written as E(x, y) ¼ R( p, q), where (x, y) are pixel coordinates, and the surface normal line can be written as ( p, q, –1). In general, the same reflection value corresponds to a set of surface normals (with one degree of freedom). It is the equation linking the surface of an object with the intensity of the perceived image. Can be expressed as E ðx, yÞ ¼ ρðu, vÞR½Nðu, vÞL, Nðu, vÞV, VL Among them, E(x, y) is the irradiance of the light sensor at the image plane (x, y), (u, v) is the position parameter of the surface of the object, ρ(u, v) is the reflectance there, and R represents the reflection function, N(u, v) is the direction of the normal at this point, L is the direction of the light source, and V is the direction of observation. Image brightness constraint equation An equation is established between the brightness E(x, y) of the pixel at (x, y) and the reflection characteristic R( p, q) of the scene position corresponding to that pixel. The normalized equation can be written as E ðx, yÞ ¼ Rðp, qÞ Using monocular images to solve the image brightness constraint equation is an ill-posed problem, and additional information is needed to have a unique solution.
1364
38
Multi-image 3-D Scene Reconstruction
Fig. 38.2 Shape recovery with motion
Smooth constraint 1. Constraints used to solve the image brightness constraint equation using monocular images. In the smooth constraint, the surface of the object is considered smooth, or continuous in depth (more generally, the target can have a piecewise continuous surface, only smooth at the boundary). In this way, the orientation of two adjacent surface elements on the surface is not arbitrary. Together, they should give a continuous and smooth surface. 2. The assumption that a small change in the RGB value of the scene surface will only cause a small change in the spectral reflection coefficient.
38.3
Shape from X
Various methods for recovering scenes from shapes are often referred to as (getting) “shapes from X,” where X can represent illumination changes, shadows, focal lengths, contours, textures, scene movements, etc. (Davies 2012; Forsyth and Ponce 2012; Hartley and Zisserman 2004; Jähne and Hauβecker 2000; Prince 2012; Szeliski 2010).
38.3.1 Various reconstructions Shape from X Method for estimating the surface shape or pose of a 3-D scene from one of many characteristics (X factors such as stereo vision, shadows, contours, motion, polar ization, specularity, auto-focus, etc.) See 3-D reconstruction using one camera. Shape from orthogonal views See shape from contours. Active structure from X Restore scene depth and 3-D structure with active sensing technology. A typical example is a technique that uses motion information plus shape from X. Figure 38.2 shows a schematic diagram of recovering the shape from structured light. The
38.3
Shape from X
1365
external surface of the object can be both scanned by using the light plane by moving the camera and sequentially exposed to the plane rays of the camera by rotating the object. Shape from polarization A technique to restore the local shape based on the polarization properties of the surface being observed. The basic idea is to illuminate the surface with known polarized light, estimate the polarization state of the reflected light, and then use this estimation to associate the surface normal with the measured polarization parameter in a closed model. In practice, the polarization estimates can be noisy. This method is likely to be effective when the brightness image cannot provide information (such as a highlight surface without features). Reconstruction of vectorial wavefront Reconstruction technique using vector wavefront to record polarization-state information. General photographic methods record the light and shade and color in an imaging manner. Holography records the light, darkness, color, and phase of an object, but it does not use imaging. Therefore, when observation is required, the recorded hologram should be illuminated with the same light as in the original recording to achieve reproduction. In addition to recording the light, darkness, color, and phase of an object, vector wavefront recording also records polarization-state information, but it is not recorded by photography. During reconstruction, it is necessary to irradiate the recording map with light having a polarization state during recording in order to realize vector wavefront reproduction. Because the electrical vector characterizing the polarization state is directional, this reproduction method is called vector wavefront reconstruction. Shape from interreflection A 3-D reconstruction using one camera method. Standard methods of recovering shape from shading or photometric stereo assume that the surface is illuminated by a single point light source. When the surfaces are close to each other, one surface will also receive the luminous flux of the other surface due to mutual reflection. These additional illuminances will cause errors in shape recovery. By modeling the causes of mutual reflection, the influence of mutual reflection can be eliminated iteratively, and a better estimate of the true shape can be obtained. Shape from perspective A technique for estimating depth using different features in clues provided by perspective projection. For example, the movement of the perspective camera along the optical axis will change the size of the imaging target. Shape from multiple sensors An algorithm that uses information gathered by a set of homogeneous or heterogeneous sensors to recover the shape of an object. In the former case, see also stereo vision. In the latter case, see also sensor fusion. Shape from structured light See structured light triangulation.
1366
38
Multi-image 3-D Scene Reconstruction
Shape from contours A 3-D reconstruction using one camera method. The shape of the surface of the scene is reconstructed based on the contour information of the 3-D scene projected onto the image plane. Among them, there is a commonly used technique called shape from silhouette. In addition, there is work to understand shapes based on the differential properties of apparent contours. Shape from silhouette A 3-D reconstruction using one camera method. It belongs to the form of shape from contour. Restoring the shape from the silhouette includes extracting the silhouette of the target from some visual observations and intersecting the 3-D cone generated from the silhouette of the target with the projection center. The intersecting volumes are called visual hulls. Shape from profile curves The shape information of the scene is obtained by matching the profile curve of the scene. However, the main difficulty encountered by this type of method is that the position of the profile curve varies with the camera viewpoint, so directly matching the profile curves in the two images and performing triangulation calculations will give the wrong shape. However, if there are three or more close images, it is possible to use local arcs to fit the positions of the corresponding edge primitives, thereby directly obtaining a semi-dense surface mesh from the matching results. Another advantage of this type of method is that the surface shape can be reconstructed from the untextured surface, in which case the foreground and background only need to have different gray levels or colors. Shape from vergence A technical method for determining the depth of a scene using two cameras. The two cameras are fixed on the same pole and rely on two servo motors to control their optical axes to converge in a plane containing two optical centers. Such a device is also called a stereo heads. When the optical axis convergence point is on the scene, the depth of the scene can be determined. See binocular focused horizontal mode. Stereo heads See shape from vergence.
38.3.2 Structure from Motion Structure from motion A kind of 3-D reconstruction using one camera methods. The solving optical flow equation is used to recover the relationship between objects in the scene and the structure of the scene itself. There are two cases: 1. Determine the shape characteristics of a moving object and its position and velocity from an image sequence obtained from a fixed camera.
38.3
Shape from X
1367 1 2 3 4 5 6 7 8 9 A B C D
1
3
5
7
9
2
4
6
8
A
B
C
D
1 2 3 4 5 6 7 8 9 A B C D
(a)
(b)
Fig. 38.3 Bisection problem and related Hessian matrix
2. Determine the shape characteristics and pose of an object and the speed of the camera from an image sequence obtained by a moving camera. It can also be regarded as starting from the scene point of a collection and using their motion information to restore the 3-D shape of the collection. It includes simultaneous estimation of 3-D geometry (structure) and camera pose (motion). It is a dichotomy (bisection) between structure and motion. Each feature point xij in a given image depends on the position of a 3-D point pi (pi ¼ (xi, yi, zi)) and the pose of a 3-D camera (Rj, cj) (Rj indicates the direction; cj indicates the light center). See Fig. 38.3(a), where nine circles represent nine 3-D points and four squares represent four cameras. The lines in the figure represent 2-D features that indicate which points are visible in which cameras. If the values of all these points are known or fixed, then all camera equations become independent and vice versa. If the structure variables are ranked before the motion variables, the resulting Hessian matrix has the layout shown in Fig. 38.3(b). Among them, the block in the upper left corner is the point (structure) Heissen App, and the block in the upper right corner is the point-camera Hessian Apc. Letting Heissen before the movement be represented by Acc and Heissen after the movement be represented by A0 cc (A0 cc corresponds to the non-zero term between the two cameras, which represents that they can see a common 3-D point, as shown by the dashed arc in Fig. 38.3(a) and the red square in 38.3.2(b)), then the following Schur complement can be used to calculate the simplified motion: A0cc ¼ Acc ATpc A1 pp Apc See structure and motion recovery. Schur complement See structure from motion.
1368
38
Multi-image 3-D Scene Reconstruction
Line-based structure from motion A method of restoring structure from motion with the help of line segments. First, detect the line segments in the image, and then combine them using the common vanishing points in each image. Then use the vanishing point for camera calibra tion, and use the apparent and trifocal tensor to match the line segment corresponding to the common vanishing point. These line segments can be used to infer block structure models for planes and scenes. Plane-based structure from motion A method for recovering structure from motion with the aid of planar information. A special case of recovering a structure from motion is more suitable for situations where there are many planar structures in the scene (such as outdoor buildings or indoor furniture). In these cases, feature-based or brightness-based methods can be used to directly estimate the homography between different planes, and use the obtained information to simultaneously estimate or infer the camera’s pose and plane equations. Two-frame structure from motion An extension to the general use of a single-purpose for structure from motion. See also epipole/epiouter pole geometry for analysis. Constrained structure from motion In the structure from motion method, there are certain constraints on the target motion in the scene and/or the scene and the target structure. The general structure from motion does not limit the scene or target, but if some special information of the scene or target can be used, it will be easier to obtain better results. For example, when the camera moves around a fixed center on a fixed arc, or the target is on a rotating platform, motion modeling is easier. For another example, there are some higher-level geometric primitives such as straight lines or planes in the scene, and it will be more convenient to build a 3-D model. A typical example is in the scene of a building, where straight lines or planes are often parallel or perpendicular to each other. These special contact information can play an important role in restoring the structure. Non-rigid structure from motion General extension of the structure from motion. Restore the structure of a 3-D non-rigid object from a series of 2-D images (possibly without calibration) containing a moving non-rigid object. In dynamic scenes, non-rigid structures are often encountered. Uncertainty in structure from motion Because many highly relevant parameters need to be estimated in the structure from motion, and the “true value” is often not known, it is possible to produce a large number of uncertain estimation results, such as bas-relief ambiguity. Moving light display A sequence of images collected for an object with a point light source in a dark scene. The light source is viewed as a set of moving bright spots. This type of image sequence was used in early studies of structure from motion.
38.3
Shape from X
1369
Structure and motion recovery The structure of the scene and the 3-D position of the camera are calculated simultaneously from a series of images of the 3-D scene. Common strategies rely on multi-view tracking of 2-D image entities (such as points or edges of interest) to obtain constraints on 3-D entities (points or lines) and camera movement. Constraints can be embodied in entities such as basic matrices and trifocal tensors, which can be estimated only from image data and further calculate the 3-D position of the camera. Recovery can reach up to some equivalent classes of acquisition scenarios, and any scenario in the class may generate observed data. Typical examples are projective reconstruction and affine reconstruction. Bas-relief ambiguity Starting from an image in an orthogonal projection condition, shadows are used to reconstruct the ambiguity of a 3-D target based on Lambertian reflection. Assuming the real surface is z(x, y), the surface family az(x, y) + bx + cy will generate the same image under the above observation conditions, so for any a, b, and c values, all reconstructions are equally effective. Ambiguity now belongs to a family with three parameters. In the structure from motion, an uncertainty problem is caused by the need to estimate many interrelated parameters and have no true values. This makes it difficult to estimate the 3-D depth and camera movement of a scene at the same time. At this time, it is often impossible to determine with certainty which direction an inversion version of a target is moving in, or whether a target is turning left or right. Small motion model A mathematical model that expresses very small (ideally infinitesimal) motion between adjacent video frames. Mainly used in shape from motion. See optical flow. Motion from line segments The motion information is obtained by matching the line segments in the image. Photo tourism A system based on a sequence of photos that determines the current location and next travel activity plan. The 3-D position and pose of the camera from which the image was taken using the structure from motion technique. Tomasi-Kanade factorization A maximum likelihood solution to the problem of structure and motion recovery. It is assumed that an affine camera is used to observe a static scene, and the position of the observation point (x, y) is disturbed by Gaussian noise. If m points are observed in n views, the 2n m observation matrix with all observation points has a rank of 3. The optimal approximation to rank 3 of this matrix can be reliably obtained by singular value decomposition. Thereafter, the 3-D point and camera position can be easily extracted (although there may still be affine ambiguities).
1370 Fig. 38.4 Kinetic depth
38
Multi-image 3-D Scene Reconstruction Fixed point Target Swept trajectory
38.3.3 Shape from Optical Flow Shape from optical flow See structure from motion. From optical flow to surface orientation Key steps in finding structure from motion. Optical flow contains information about the structure of the scene, so the orientation of the surface can be deduced from the flow of light moving on the surface of the object, and the orientation of each surface can determine the structure of the scene and determine the relationship between the scenes. Structure from optical flow The camera motion is restored by calculating the optical flow constrained by the fundamental matrix of infinitesimal motion. The small motion approximation replaces the rotation matrix R with I – [w]x, where w is the rotation axis and satisfies the unique vector of Rw ¼ w. Shape from specular flow See shape from motion. Shape from motion Algorithm for estimating the 3-D shape (structure) and depth of a scene from motion information contained in an image sequence. Some methods are based on tracking a sparse set of image features, such as Tomasi-Kanade factorization; others are based on dense motion fields (i.e., optical flow) to reconstruct densely shape surfaces. Stereo from motion A way to obtain depth information by using the motion trajectory of the camera. If a monocular system is used to acquire a series of images in multiple poses, the same 3-D space point will respectively correspond to coordinate points on different image planes and generate parallax. If the motion trajectory of the camera is used as the baseline, taking any two images and matching the features therein, it is possible to obtain depth information of the scene. Kinetic depth A method of estimating depth at a feature point (often an edge point) of an image by means of controlled motion of the sensor. This method is generally not performed on all points of the image (because of lack of sufficient image structure), nor is it performed in regions where the accuracy of the sensor changes slowly (such as a wall). A typical situation is shown in Fig. 38.4, where the camera rotates along a
38.3
Shape from X
1371
circular track staring at a fixed point in front, but the target is not necessarily at this point. Temporal derivative The term in the optical flow equation that represents the gray value of an image point over time indicates the change in pixel brightness between adjacent frame images in a video sequence. Special case motion In the shape from motion, there is a sub-problem that restricts the camera movement in advance. For example, plane motion, turntable motion or single-axis motion, and pure translation motion. In each case, constrained motion simplifies the general problem, resulting in one or more closed solutions, with high efficiency and increased accuracy. Similar advantages can also be obtained from approximations such as affine cameras and weak perspectives. Moving observer A moving camera or other sensor. Moving observers have been used extensively in the study of structure from motion. Dynamic stereo Stereo vision of a moving observer. In this way, the technique of shape from motion can be used together with stereo vision technology.
Chapter 39
Single-Image 3-D Scene Reconstruction
The use of monocular single images for scene restoration is actually an ill-posed problem. This is because when a 3-D scene is projected onto a 2-D image, the depth information is lost (Davies 2005; Hartley and Zisserman 2004). However, in practice, there are still many depth clues in the image, so it is possible to recover scenes from them under certain constraints or prior knowledge (Davies 2012; Forsyth and Ponce 2012; Jähne and Hauβecker 2000; Shapiro and Stockman 2001; Szeliski 2010; Zhang 2017c).
39.1
Single-Image Reconstruction
From the information point of view, just using the monocular single image directly for scene reconstruction is not enough in general, and it often requires some relevant prior knowledge (Davies 2005; Jähne and Hauβecker 2000). 3-D reconstruction using one camera A technology that uses only one camera to implement 3-D scene reconstruction. At this time, you can place the scene at a certain position in the space to collect single or multiple images (corresponding to different times or environments), and use various 3-D clues to reconstruct the scene. Because the shape of the scene is the most basic and important of its intrinsic properties, various 3-D reconstruction using one camera methods are often referred to as “shape from X,” where X can represent changes in brightness, illumination, shading, texture, and contour shapes, scene movement, and more. 3-D scene reconstruction Use one or more 2-D images to restore a 3-D scene in an objective scene
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_39
1373
1374
39 Single-Image 3-D Scene Reconstruction
Monocular vision indices of depth The indices of depth obtained from some physical conditions of the stimulus itself when observing the scene (monocular vision). It mainly includes size and distance, changes in lighting, linear perspective, texture gradient, occlusion of objects, selfoccluding, motion parallax, etc. To obtain deep clues, it is often necessary to rely on the experience and learning capabilities of the observer. Shape from monocular depth cues A method of monocular vision indices of depth detected from a single image for scene shape estimation. Among them, the indices of depth mainly provide the relative orientation and relative depth information of different scenes or different parts of the object. Monocular vision Vision that uses only monocular gaze and observation. Under certain conditions, people can also realize the depth perception of spatial scenes by relying only on monocular vision. This is because some physical conditions of the stimulus itself in monocular vision can also become perceptual depth and distance cues (monocular vision indices of depth). Monocular visual space The visual space behind the lens of an optical system. This space is often assumed to be unstructured, but the depth of the scene can be recovered from the out-of-focus blur that occurs in that space. Visual cue A type of information extracted from visual data that helps observers better understand the world. For example, an occluding contour can help understand the depth order and tone of an object, giving clues about the shape of the surface (see shape from shading). Monocular Use a single camera, sensor, or eye. This is in contrast to binocular vision or multiple-ocular stereo vision using more than one sensor. Sometimes it means that the image data is obtained from a single viewpoint, because if a single camera is used over time, it is mathematically equivalent to multiple cameras. Linear perspective 1. One of the monocular depth of cues. According to the laws of geometry optics, light passing through the center of the pupil generally gives a true image of the center projection. Roughly speaking, this projective transformation can be described as a projection from a point to a plane, called linear perspective. Due to the existence of linear perspective, closer objects occupy larger viewing angles and appear larger in size; distant objects occupy smaller viewing angles and appear smaller in size. 2. A method for representing a 3-D scene on a 2-D plane. At this time, it is necessary to make the distant object larger than the near object in order to conform to the way that the human eye feels the real world. Compare isometric perspective.
39.1
Single-Image Reconstruction
1375
Y
W1 y
p1
W2 p2
W3
p3 O Z
λ
x
X
Fig. 39.1 The coordinate system of perspective 3 points
Single view reconstruction See single view metrology. Single view metrology An application based on vanishing point estimation and camera calibration technology (commonly used in architectural scenes). Users can interactively measure the height and other dimensions of buildings in the image and build 3-D models composed of sheet planes. First, two orthogonal vanishing points are determined on the ground, and vanishing points are determined in the vertical direction. Next, the user marks several scales in the image (such as the height of a reference target), and then other scales (such as the height of another target) can be automatically calculated. Perspective n-point problem [PnP] In perspective imaging, the problem of calculating 3-D scene coordinates (thus determining the scene shape and pose) based on the coordinates of n known image points. Camera pose can also be estimated based on n known 2-D to 2-D point correspondences. It has been proven that the collinear 3 points and coplanar 4 points, 5 points, and 6 points have unique solutions in theory. Perspective 3 points is a special case. Perspective 3 points [P3P] Under the condition that the 3-D scene model and the focal length of the camera are known, the coordinates of the three image points are used to calculate the geometry and pose of the 3-D scene. The distance between each of the three points is now known. See Fig. 39.1 for the coordinate relationship between the image, camera, and scene. This is a reverse perspective problem. It is known that the corresponding point of the three-point Wi on the 3-D scene on the image plane xy is pi, and now the coordinates of Wi are calculated based on pi.
1376
39 Single-Image 3-D Scene Reconstruction
Here the line from the origin to pi also passes through Wi, so if we let vi be the unit vector on the corresponding line (pointing from the origin to pi), the coordinates of Wi can be obtained by the following formula: Wi ¼ ki vi i ¼ 1, 2, 3 The (known) distance between the above three points is (m 6¼ n) d mn ¼ kWm Wn k Substituting the expression of Wi into the calculation formula of dmn, we get d2mn ¼ kk m vm k n vn k2 ¼ k 2m 2km kn ðvm • vn Þ þ k2n In this way, three quadratic equations about ki can be obtained, in which the three d2mn on the left of the equation are already known according to the 3-D scene model, and the three dot products vm•vn can also be calculated based on the coordinates of the image points. The P3P problem of Wi coordinates becomes a problem of solving three quadratic equations with three unknowns. This problem can be solved using iteration.
39.2
Various Reconstruction Cues
Many methods have been proposed from the perspective of (monocular) singleimage restoration. Some methods are more general (with certain generalization), and some methods need to meet specific conditions (Davies 2005; Peters 2017; Shapiro and Stockman 2001; Szeliski 2010).
39.2.1 Focus Shape from focus A method for estimating a set of scene depth at each pixel and for obtaining the optimal focus (minimum blur) in the neighborhood of the pixel under investigation by changing the focus setting of the camera. Of course, pixels corresponding to different depths will get optimal focus under different settings. It can be seen as a proactive approach. At the focus position, the high-frequency information is the most in the image. It is assumed here that there is a model of the relationship between depth and image focus. The model has a series of parameters (such as optical parameters) that need to be calibrated. Note that the camera uses a large aperture
39.2
Various Reconstruction Cues
1377
so that it can generate focused image points with the smallest possible depth interval. Compare shape from defocus. Same as depth from focus. Depth from focus The relationship between the change in focal length and the depth of the scene when the camera lens focuses on scenes at different distances is used to determine the 3-D reconstruction using one camera method based on the measurement of the focus distance. The distance from a camera to a point in space can be determined by acquiring a series of images that are getting better focused. Also called autofocus and software focus. Focus following In image acquisition, when the object of interest moves, the camera focus is slowly changed to keep the object in focus. See depth from focus. Shape from zoom A problem in calculating the shape of a scene using at least two images acquired under different zoom settings. What is considered here is the distance from the sensor to each scene point. The basic idea is to derive the focal length of the perspective projection equation to obtain the representation that relates the change in focal length and the pixel’s movement along the depth to calculate the required depth distance. Shape from defocus A method of estimating a set of scene depths at each pixel and estimating the surface shape of a scene from multiple images obtained under different controlled focus settings. It is assumed that the connection between depth and image focus is a closed model that contains a certain amount of parameters (such as optical parameters) that need to be calibrated. Once the values of the pixels in the image are obtained, the depth can be estimated. Note that the camera uses a large aperture, so that the points in the scene can be judged whether they are in focus in the smallest possible depth interval. Compare shape from focus. Probe pattern The pattern of light projected from a detector can be used to assess the depth from defocus. Shape from scatter trace A method to restore the object shape from a transparent or translucent object using a moving light source or a moving camera or both, by combining multiple observations of a point. The set of measurements at each pixel is the scattering trajectory. Depth from defocus A 3-D reconstruction using one camera method that uses the relationship between scene depth with camera parameters and the degree of image blur to obtain scene depth information from directly measurable parameters. The degree of blur of the object image gives indices of the depth of the object, and the blur increases when the object surface leaves the focus distance of the camera.
1378
39 Single-Image 3-D Scene Reconstruction
Therefore, the depth can be estimated from the amount of blur caused by defocus. There are several issues to note here: 1. The amount of blur increases in both directions with distance from the focal plane. Therefore, it is necessary to use two or more images taken at different focusing distances, or move the object in the depth direction and find the point that gives the sharpest image. 2. The magnification of the object varies with the focus distance or with the movement of the object. This can be modeled explicitly (but will complicate the problem) or use the principle of telecentric optics to approximate the camera (aperture placed in front of the lens is required). 3. The amount of defocus needs to be reliably estimated. A simple method is to average the square of the gradient in the region, but this is affected by several factors, including the target magnification problem mentioned above. A better approach is to carefully design the rational filter.
39.2.2 Texture 3-D texture The texture and characteristics of 3-D surface imaging. For example, in perspective projection, the density of texels varies with distance, and you can estimate some properties of 3-D surfaces, such as shape, distance, and orientation. Shape from texture A 3-D reconstruction using one camera method. According to the spatial change information of the surface texture of the scene, the surface shape of the scene is reconstructed, or the surface normal field is restored (it may different from the surface shape by a scale factor). Because the surface orientation of the scene can reflect the shape of the object, reconstructing the shape from the texture is also called the surface orientation from the texture. The deformation (texture gradient) of the plane texture recorded in the image depends on the shape of the textured surface. Techniques for shape estimation include techniques that use statistical texture features and regular texture patterns. Surface orientation from texture See shape from texture. Isotropy assumption A typical assumption for the texture pattern in the process of surface orientation from texture. That is, for isotropic textures, the probability of finding a texture primitive on the texture plane has nothing to do with the orientation of the texture primitive. In other words, the probability model for isotropic textures does not need to consider the orientation of the coordinate system on the texture plane.
39.2
Various Reconstruction Cues
1379
Virtual camera P1
P2 V1
θ1
θ2
V2
Model Fig. 39.2 Combination of texture maps
Recovery of texture map The last step of the 3-D reconstruction. After 3-D modeling the target, the surface of the target needs to be further restored. First you need to parameterize the texture coordinates as a function of the 3-D surface position (per mesh in the mesh representation). The second thing is to project the texture perspective onto the corresponding pixels of the image. View-dependent texture map 1. A method used in recovery of texture maps when there are multiple texture image sources. Different texture image sources are used for each polygon facet, and the weight of each source depends on the angle between the virtual camera (surface normal) and the source image. 2. A method similar to view interpolation. But instead of correlating independent depth maps with each input map and generating a 3-D model of the scene, different images are used as the source of the texture maps according to the current position of the virtual camera, as shown in Fig. 39.2. The contribution of viewpoints P1 and P2 depends on the angles θ1 and θ2 between the lines of sight of V1 and V2 with the line of virtual camera. Texture stereo technique A technique combining a shape from texture method and a stereo vision method. By obtaining two images of the scene at the same time and using the texture information in them to estimate the direction of the surface of the scene, the complex problem for matching corresponding points is avoided. In this method, the two imaging systems used are linked by a rotation transformation. Orientation from change of texture The main technique for shape from texture. The description of textures here is mainly based on the idea of structural approach, that is, complex textures are composed of basic texture elements repeatedly arranged and combined in a certain regular form. The use of the texture of the object surface to determine the surface orientation should consider the impact of the imaging process, which is specifically related to the connection between the texture of the scene and the texture of the image. During the process of acquiring an image, the texture structure on the original
1380
39 Single-Image 3-D Scene Reconstruction Y Y
X 0 (a)
(b)
X
(c)
Fig. 39.3 Texture change and surface orientation
scene may change when projected onto the image. This change may be different depending on the orientation of the surface on which the texture is located. Therefore, such a 2-D image has 3-D information on the orientation of object surface. There are mainly three types of texture changes, and the commonly used recovery orientation methods can also be divided into three types: (1) methods based on change of texel size, (2) methods based on the change of texel shape, and (3) methods based on changes of spatial relation among texels. Figure 39.3 gives an example of the relationship between the three types of texture changes and surface orientation in turn, and the arrows indicate the direction of maximum change. Figure 39.3a shows that due to the law of near big far small in perspective projection, texels with different positions will have different changes in size after projection. Figure 39.3b shows that the shape of the texels on the surface of the object may change to some extent after perspective projection imaging. Figure 39.3c shows that the change of the parallel grid line segment set after perspective projection imaging also indicates the direction of the plane where the line segment set is located. Methods based on change of texel size A method of orientation from change of texture. The orientation of the plane where the texels are located can be determined according to the direction of the maximum value of the texel projection size change rate. At this time, the direction of the texture gradient depends on the rotation angle of the texel around the camera axis, and the value of the texture gradient gives the angle at which the texel is inclined from the line of sight. Methods based on change of texel shape A method of orientation from change of texture. The orientation of the texture plane is determined by two angles, that is, the angle rotated relative to the camera axis and the angle inclined relative to the line of sight. For a given original texel, these two angles can be determined based on the changes after imaging. For example, on a plane, a texture consisting of a ring becomes an ellipse on an inclined plane. At this time, the orientation of the principal axis of the ellipse determines the
39.2
Various Reconstruction Cues
1381
Vanishing point Vanishing lines
Fig. 39.4 Vanishing points and vanishing lines
angle of rotation relative to the camera axis, and the ratio of the length of the long and short axes reflects the angle of inclination relative to the line of sight. Methods based on change of spatial relation among texels A method of orientation from change of texture. If the texture consists of a regular grid of texel, you can calculate its vanishing point before determining its vanishing line. The direction of the vanishing line indicates the rotation angle of the texel relative to the camera axis, and the intersection of the vanishing line and x ¼ 0 indicates the angle at which the texel is tilted relative to the line of sight. Vanishing point The intersection point of each line segment in the set of intersecting line segments in the grid of texel. For a transmission image, the vanishing point on the plane is formed by the texels projected to the image plane at an infinite distance (consisting of parallel lines), or the convergence point of the parallel lines at infinity. Also refers to the image where two parallel 3-D lines meet at infinity. A pair of parallel 3-D lines can be expressed a + sn and b + sn, where a and b are coefficient vectors and s is a coefficient. The vanishing point is an image in the 3-D direction [n 0]T. Vanishing line A 2-D line image produced by the intersection of a 3-D plane and an infinite plane. It is also a straight line determined by two vanishing points from a grid of texel of the same surface. The horizontal line in the image is the image where the ground plane intersects the infinite plane, as if a pair of railway lines meet at the vanishing point (the intersection of two parallel lines and the infinite plane). Figure 39.4 shows the vanishing lines obtained when the ground intersects roads and railways. Horizon line A straight line defined by all vanishing points on the same plane. The most common horizon line is associated with the ground line, as shown in Fig. 39.5. Ground plane The horizon plane corresponding to the ground plane, that is, the ground surface on which the target is placed. This concept is only really useful when the ground is substantially flat. Figure 39.6 shows a schematic diagram of the ground plane.
1382
Vanishing point
39 Single-Image 3-D Scene Reconstruction
Horizon line
Ground line
Vanishing point
Fig. 39.5 Ground line schematic
Fig. 39.6 Schematic diagram of the ground plane
39.2.3 Shading Image tone Light and dark levels on the image. That is, the density levels from black via gray to white. If the tones of the image are rich, the image is soft; if the tones are poor, the image appears stiff. See shade. Image tone perspective One of the ways to express the depth of space in photography. When light passes through the atmosphere, due to the diffusing effect of the air medium on the light, the contrast between the light and dark of the nearby scene, the sharpness of the outline, and the saliency of the color all appear to be greater than those of the distant scene. And the closer and stronger, the farther and weaker. This phenomenon is reflected in the photo to produce a tone perspective effect, which can represent the spatial depth of the scene and the spatial position of the object.
39.2
Various Reconstruction Cues
1383
Shading in image The brightness change of the imaging surface of the object after the imaging changes the brightness of the image plane. The tone of the object expresses the spatial changes in brightness caused by the orientation of the parts of the surface after the object is illuminated by the light in the scene, so that the orientation of the parts of the surface of the object can be estimated, thereby achieving the reconstruction from shading. Shading 1. Shadowing: Steps and techniques for making 3-D scenes with different tones. See shade. 2. Density: The display effect caused by changing the hue, color saturation, intensity, composition, or other attributes of the display elements or specific regions or parts of the display image on the screen display region. 3. Tone: A brightness mode composed of light and dark gradation regions of a brightness image. Changes in brightness on the surface can be attributed to changes in lighting intensity, different surface orientations, and surface-specific reflectivity. Shadow stripe A variant using the method of light stripe ranging. By swinging a straight bar to generate a shadow stripe, the distance is measured by the change of the shadow stripe with the surface of the scene. At this time, only ordinary point light sources need to be used. cos2 fringe The light intensity distribution is a stripe proportional to the square of the cosine. At this time, the cosine variable is inversely proportional to the wavelength λ and proportional to the distance x where the fringes occur, that is, I ¼ Acos2(Cx/λ), where A and C are constants in the experiment. It can be seen that the light and dark distribution of such stripes is of equal width. Shading correction A class of techniques that can change unwanted shading effects, such as inconsistent intensity distribution due to non-uniform lighting. All techniques assume a tone model, that is, a photometric model of the imaging, that formalizes the relationship between the measured image brightness and camera parameters (mainly gain and offset), illumination, and reflection. See shadow and photometry. Ray tracing 1. Describe a method in which light from a light source bounces multiple times in a scene before reaching the camera. Trace every ray from the camera to the light source (or vice versa). It is suitable for situations where the scene is closer to the reflection of highlights (such as mirrors, polished balls, etc.). Compare radiosity. 2. A way to model the brightness of the scene. It is generally believed that light leaves the light source, reaches the visible surface of the camera and reflects, and changes intensity and color to reach the camera. If the scene is dominated by
1384
39 Single-Image 3-D Scene Reconstruction
highlight surfaces (such as glass objects or specular spheres), it is best to use ray tracing techniques to build a lighting model and determine the brightness of each surface. Shape from shading A kind of 3-D reconstruction using one camera methods. The shape of the surface of the scene is reconstructed according to the shading of the image generated by the spatial change of the brightness of the surface of the scene. In other words, it is necessary to consider restoring the surface normal field (a scale factor that differs from the surface shape) from the shading mode (light and shadow) of an image. The key idea here is that assuming a scene (typically with a Lambertian surface) reflection map, an image brightness constraint equation can be written to associate the surface normal with the illumination intensity and image brightness. You can use this constraint to restore normals by assuming local surface smoothness. Reconstruction from shading See shading in image. Shading from shape Given an image and a geometric model, the technique used to restore the reflectivity of an isolated object surface. But it is not exactly the inverse of the problem of shape from shading.
39.2.4 Shadow Shade 1. A term in labeling of contours technology. If a continuous surface in a 3-D scene does not obscure part of the other surface from the viewpoint, but blocks the light from illuminating this part, it will cause shadows on that part of the second surface. The shadow on the surface is not caused by the shape of the surface itself, but is the result of the influence of other parts on the light. 2. See umbra of a surface. Shadow 1. The result of light and dark changes on the image due to (mutual) occlusion of the objects in the scene. There are two types: self-shadows and cast-shadows. 2. The relative distribution of the shape of the shadow region and the corresponding light intensity. When the edge of an opaque object blocks the ray radiation of a point light source, the projected shadow is completely dull, which is called the umbra. If the light source is large, part of the light in the light source can reach the field of view, while the other part of the light cannot reach, and a penumbra will be obtained. 3. A part of the scene is the result of the lighting source that cannot reach directly due to self-occlusion (resulting in attached shadow or self-shadow) or being
39.2
Various Reconstruction Cues
1385
Fig. 39.7 Examples of different shadows Attached shadow Castshadow
Fig. 39.8 Various shadows
Light Source
Fig. 39.9 Penumbra and umbra
Light Source
No Blur shadow Umbra (complete shadow)
Shadow .......... Object
.......... ..... .......... ..... .......... ..... .......... ..... .......... ..... .......... ..... ..... .......... ..... .......... .......... ..... ..... ..... .......... ......... ..... ......... ..... ..... ......... ......... ..... ..... ......... ......... ..... .........
Penumbra Umbra
blocked by other targets (resulting in cast-shadow). Therefore, this part is darker than the surroundings. Examples of different shadows are shown in Fig. 39.7. Umbra A dark region similar to the outline of an object is formed when the light emitted by the point light source is blocked when it encounters an opaque object. Also refers to a completely dark region or part of the shadow (see Fig. 39.8). It can be seen as being produced by a special light source (i.e., a light source that cannot be reached). Penumbra A region that is partially shadowed because the light source is partially blocked. It’s different from the completely shaded umbra region. Figure 39.9 shows a schematic diagram. The half-bright and half-dark part of a relatively large luminous body (light source) that is blocked by an opaque object. From astronomy, its schematic diagram of the eclipse is shown in Fig. 39.10. Penumbra is one of two types of shadows. At the junction of penumbra and umbra, partial eclipses will occur. Compare umbra. Soft shadow Same as penumbra. Shadow detection Use the photometric characteristics to determine the shadow region in the image that corresponds to the scene. This is useful for both true color estimation and region analysis. See color, color image segmentation, color matching, photometry, and region segmentation.
1386
39 Single-Image 3-D Scene Reconstruction
Sun
Earth Umbra
Penumbra eclipse Partial eclipse Umbra eclipse Penumbra
Fig. 39.10 Schematic diagram of the eclipse
Shadow matting Cutout in shadowed images. Similar to smoke matting, but instead of assuming a constant foreground, the color of each pixel is changed between fully illuminated and completely shaded. This color can use the shadow to change the maximum and minimum values in the scene to estimate. The local shadow matte thus obtained can be used to reproject the shadow to a new scene. If the target scene has a non-planar geometry, it can be scanned through the scene by waving a stick-like shadow. New shadow matting can be combined with the calculated deformation field to make it drape correctly with the new scene. Shadow analysis The process of analyzing all surface regions that reflect each other to find shadows. At the borders of shadowed regions, the smoothing of pixel values acts on regions affected by mutual reflection. For shadow analysis, a histogram of the luminance values of the pixels in the region to be inspected needs to be calculated. If there are no shaded borders in the region, the histogram is balanced. Otherwise, if the region includes a light part and a dark part, this will be reflected in the histogram as two separate peaks. In this case, the difference in peaks can be determined, and the brightness value is added to this difference for all pixels in the shadow. Next, if the light source and/or surface is colored, its HSI value can be converted into RGB space. Shadow type labeling A problem or technique similar to shadow detection, but also requires classification of different shadow types. Self-shadow One part of a scene in the scene occludings the other part of the scene (selfocclusion), so as to produce a shadow (see the figure in shadow). Cast-shadow Shadows caused by different objects in the scene blocking each other. Also refers to the shadow caused by the projection of one target onto another (see the figure in shadow).
39.2
Various Reconstruction Cues
1387
Shadow understanding Estimate the various characteristics of the scene in the 3-D scene based on the appearance or size of the shadow, such as the height of a person and the width of the road. See shadow type labeling. Shape from shadows A technology that uses a set of images of outdoor scenes (corresponding to different angles of the sun) obtained at different times to restore the geometric relationship of the scene. Various assumptions and knowledge about the position of the sun can be used to recover geometric information. It may also be referred to as “recovering/ rebuilding from darkness.” A similar method can also be used for indoor scenes, in which case the optical and geometric characteristics of artificial light sources must be considered.
39.2.5 Other Cues Shape from specularity Algorithm for estimating the local shape based on the specularity of the surface. The specularity constrains the surface normal based on the incident angle being equal to the reflection angle. But detecting specularity in the image itself is not an easy problem. Epipolar plane image analysis Reconstruction method by analyzing self-motion in epipolar plane images. The slope of the straight line in the epipolar plane image is proportional to the distance between the target and the camera, and the vertical line corresponds to the feature at infinity. A method to determine the shape from motion, in which epipolar plane images are analyzed. The slope of a straight line in an epipolar plane image is proportional to the distance from the camera to the target (vertical lines correspond to features at infinity). Shape from photo-consistency A technique based on spatial division using multi-view (photo) reconstruction. The basic constraint here is that the shape under consideration needs to be “photoconsistent” for all input photos, that is, there is consistent brightness in all camera photos. Photo-consistency Describe the concept of stitching or matching similarities between images. In many stereo vision work, the photo-consistency of an image with the reference image fr(x, y) can be calculated with the help of disparity space images:
1388
39 Single-Image 3-D Scene Reconstruction
Q
Fig. 39.11 Schematic diagram for gradient space representation
G
q= r
θ = arctan(q/p) O
C ðx, y, dÞ ¼
X
P
D½ f ðx, y, d, kÞ f r ðx, yÞ
k
See shape from photo-consistency. Gradient space representation A surface-oriented representation used in restoring shape from shading. In the schematic diagram given in Fig. 39.11, a 3-D surface is represented as z ¼ f(x, y), and then the normal of the element above the surface can be expressed as N ¼ [p q 1]T, where ( p, q) is the coordinate of a point G in the 2-D gradient space. With the help of gradient space representation, we can understand the structural relationship formed by the intersection of planes. Self-occluding The phenomenon that a continuous surface on a 3-D object in a scene not only obscures part of the surface of other objects but also obstructs other parts of itself. Self-occluding occurs only on concave objects.
Chapter 40
Knowledge and Learning
Knowledge is the result of previous human understanding of the objective world and a summary of experience (Witten 2005). It can guide the current understanding of the new changes in the objective world (Jähne and Hauβecker 2000). The interpretation of scenes needs to be based on existing knowledge and learning (Forsyth and Ponce 2012; Goodfellow et al. 2016; Hartley and Zisserman 2004; Prince 2012; Sonka et al. 2014).
40.1
Knowledge and Model
Knowledge is the sum of experience accumulated by human beings in understanding the objective world (Witten 2005). Knowledge is often represented by models and is therefore often referred to directly as models (Zhang 2017c).
40.1.1 Knowledge Classification Knowledge The sum of human cognition of the world and accumulated experience in understanding the world. Knowledge is often expressed as rules or facts that can be used for inference. To use knowledge effectively, you need to study what knowledge is available and how to represent and manage them. Knowledge representation A certain form and method used to represent knowledge. From the perspective of form, the representation of knowledge can be divided into procedural representa tion and declarative representation. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_40
1389
1390
40
Knowledge and Learning
A general term for encoding knowledge in a computer. In computer vision systems, knowledge is often related to visual processing and object recognition. A common knowledge representation scheme is a geometric model, which records the 2-D and 3-D shapes of the object. Other commonly used knowledge representation schemes include geometric model and frame. Static knowledge In declarative representations, describe non-dynamic knowledge about objects and their fixed relationships. Object knowledge In knowledge classification, it reflects the knowledge about the characteristics of each object in the objective world. Data mining Information acquisition technology that extracts useful information and regularity from very large data sets by embedded patterns and knowledge in the data. It is also the process of extracting useful information from large data collections. This information can be descriptive (providing an understanding of patterns or structures in the data) or predictive (evaluating one or more variables based on other variables). The focus here is to allow people to learn to understand structural patterns or behaviors from data rather than relying on autonomous machine performance. Image mining An extension to traditional data mining. Study how to obtain useful information by extracting patterns and knowledge embedded in images (including videos). Traditional data mining mainly deals with structured data, and images are not structured data, so new technologies need to be researched. The image content is mainly perceptible, and the interpretation of the information carried by the image is often subjective. Classification of knowledge The classification of knowledge for research and application. There are many ways to classify common knowledge. For example, there is a way to divide knowledge into three categories: (1) scene knowledge, including the geometric model of the scene and the spatial relationship between them, etc.; (2) image knowledge, including the types and relationships of features in the image, such as the mutual position of lines, contours, regions, etc.; and (3) the mapping knowledge between the scene and the image, including the attributes of the imaging system, such as the focal length, projection direction, and spectral characteristics. There is another way to divide knowledge into three categories: (1) procedural knowledge, (2) visual knowledge, and (3) world knowledge. In the study of image understanding, knowledge can be divided into two categories: (1) scene knowledge, that is, knowledge about the scene, especially the object in the scene, mainly some objective facts about the research object; and (2) procedural knowledge, that is, what knowledge is on which occasions that can be used, needs to be used, and how to use it, that is, knowledge that uses knowledge.
40.1
Knowledge and Model
1391
Knowledge can also be classified according to content: (1) object knowledge, (2) event knowledge, (3) procedure knowledge, and (4) meta-knowledge. Meta-knowledge In classification of knowledge, a type of knowledge about knowledge. Scene knowledge In the objective world, the facts about the objects themselves and their interrelationships, as well as the laws concluded by experience. Represents the knowledge used in a type of image understanding, including the geometric model of the scene and the spatial relationship and interaction between them. This kind of knowledge is generally limited to certain definite scenes, also known as a priori knowledge about scene. Knowledge is often represented by a model and is therefore often referred to directly as a model. Image knowledge Knowledge used in image understanding. It includes the types of features in the image and their relationships, such as lines, contours, and regions. Compare scene knowledge and mapping knowledge. Mapping knowledge In image understanding, knowledge that can connect scenes with images. It mainly includes the properties of the imaging system, such as focal length, projection direction, and spectral characteristics. Procedural knowledge In image understanding, knowledge related to operations such as selecting algorithms and setting algorithm parameters. Knowledge can be divided into three categories: procedural knowledge, visual knowledge, and world knowledge. It is generally believed that the level of visual knowledge is lower than that of world knowledge, reflecting the low-to-medium content in the scene, and more specific and special than world knowledge. It is mainly used for preprocessing the scene and providing support for world knowledge. The level of world knowledge is higher than that of visual knowledge and reflects the high-level content in the scene. It is more abstract and comprehensive than visual knowledge. It is mainly used for high-level interpretation of the scene and is the basis for image understanding. Procedural knowledge is related to visual knowledge and world knowledge, and it helps to refine visual knowledge and acquire world knowledge. Visual knowledge Knowledge used in image understanding. Related to image-forming models, such as when a 3-D object is slanted and produces shadows. It is generally believed that visual knowledge reflects the middle- and low-level content in the scene, which is more specific and special, and is mainly used for preprocessing of the scene and provides support for world knowledge.
1392
40
Knowledge and Learning
World knowledge A type of knowledge used in image understanding. World knowledge refers to the overall knowledge of the problem domain, such as the connection between objects in the image and the connection between the scenery and the environment. Because the level is relatively high and reflects the high-level content in the scene, it is more abstract and comprehensive than visual knowledge. It is mainly used for high-level interpretation of the scene and is the basis for image understanding. Visual routine A term developed by Ullman in 1984 to refer to a subunit of a vision system that performs a particular task, somewhat similar to behavior control in a robot. Perform It reflects how to operate in classification of knowledge.
40.1.2 Procedure Knowledge Procedure knowledge Knowledge of applying knowledge. See procedural knowledge. Cybernetics knowledge Same as procedure knowledge. Layered cybernetics Each process has a fixed sequence of cybernetics. Cybernetics Behaviors and processes that use procedure knowledge to make decisions (control mechanism). They depend on the purpose of visual processing and the knowledge of the scene. Depending on the control sequence and degree of control, common control mechanisms and processes can be divided into many categories, as shown in Fig. 40.1. Bottom-up cybernetics A control framework and process, is a layered cybernetics. The characteristic is that the entire process is basically completely dependent on the input image data (so it is also called data-driven control). Figure 40.2 shows the flowchart. The raw data is Layered cybernetics Control mechanisms Non-layered cybernetics
Bottom-up cybernetics Top-down cybernetics Feedback cybernetics Heterarchical cybernetics
Fig. 40.1 Classification of control mechanisms
Ordered feedback Disorder (dynamic) feedback Feedback to the bottom No feedback to the bottom
40.1
Knowledge and Model
1393
Interpretation
Scene description Knowledge Image feature
Image
Fig. 40.2 Bottom-up control mechanism
gradually transformed into more organized and useful information through a series of processing processes, but the data volume is gradually compressed as the process progresses. The knowledge about the object in this control mechanism is only used when matching the description of the scene. The advantage is that as long as the object model is changed, different objects can be processed. But because low-level processing has nothing to do with the application field, low-level processing methods are not necessarily very suitable for specific scenarios and therefore are not necessarily very effective. In addition, if an undesired result occurs at a certain processing step, the system may not be able to find it, and the next processing step will inherit errors, resulting in many errors in the final result. Bottom-up Reasoning from data to conclusion. In image engineering, it refers to the use of data at a low level to generate hypotheses, which are continuously refined as the algorithm progresses. For example, when solving data-driven problems, the following control strategy is adopted: instead of using the object model in the early stage, only the general knowledge about the world is used, and then the features extracted from the observed image data are collected and interpreted to produce high enough hierarchical scene description. Compare top-down. Data-driven The method of controlling from the bottom up in the control process. Its characteristic is that the entire process is basically and completely dependent on the input image data, so it is also called data-based, that is, data-based control. Top-down cybernetics A layered cybernetics framework and process. The characteristic is that the entire process is controlled by the model (knowledge-based) (so-called model-driven control); see Fig. 40.3. Generally, this kind of control is used for the verification of results, that is, to predict the target and verify the hypothesis. Because the
1394
40
Knowledge and Learning
Fig. 40.3 Top-down control mechanism
Verify results
Scene description Knowledge Image feature
Image
knowledge about the scene is used in multiple processing steps, and these steps only deal with the work necessary for them, the overall efficiency of this control is high. Most practical systems use this control. It should be noted that this control is not suitable for situations where too many assumptions are made; otherwise verification will become too time-consuming and unbearable. Top-down An inference method that searches for evidence of high-level hypotheses in the data. For example, a hypothesis-test algorithm can have a strategy to guess the position of a circle in an image well, then compare the hypothetical circle with the edges in the image, and select those circles that are supported by a good match. Another example is the human body recognizer, which uses body parts (such as head, body, limbs, etc.) that are directly from image data. Furthermore, it can identify smaller subcomponents. Non-layered cybernetics There is no control over the execution order of each process. Heterarchical cyber netics is a typical non-layered cybernetic. Feedback cybernetics A cybernetics framework and theory that feeds back part of the analysis results from a higher-level processing stage to a lower-level processing stage to improve its performance. Belongs to non-layered cybernetics. Figure 40.4 shows an example of feeding back the result of feature extraction to the feature extraction step. Prior knowledge about features is necessary here, because the extracted features are to be tested with this knowledge, and the process of feature extraction also needs to be improved on this knowledge. Other analysis results can also be fed back, such as the results of scene description can be fed back into the feature extraction process; the results of scene interpretation can also be fed back to the interpretation process. It should be noted here that the feedback control framework can only work effectively if the feedback results contain important information that can guide subsequent processing; otherwise it may cause waste.
40.1
Knowledge and Model
1395
Fig. 40.4 Feedback cybernetics
Interpretation
Scene description Knowledge Image feature
Image
Fig. 40.5 Heterarchical cybernetics
Interpretation
Scene description Knowledge Image feature
Image
Feedback cybernetics can be divided into ordered feedback (predetermined feedback order) and disordered feedback (dynamically determined feedback order) according to whether there is a fixed feedback order. Feedback Use the output of a system to control the actions and behavior of the system. Heterarchical cybernetics A control framework or process. Belongs to non-layered cybernetics. The characteristic is that the processing process does not have a predetermined order. As shown in Fig. 40.5, part of the processing results of each stage will be reflected in the subsequent processing. From this point, it is essentially a feedback cybernetics, and the most effective processing means is selected for processing at any time. According to whether the knowledge is fed back to the lowest-level processing stage, the control mechanisms at different layers can be divided into two categories. One is to feed knowledge directly to the bottom. Generally speaking, the processing of raw data requires a lot of calculations. If feedback information can be used to
1396
40
Knowledge and Learning
reduce the amount of calculations, feedback is meaningful. However, if the highlevel processing is started too early, the results will not be very good, because if the feedback knowledge itself is not sufficient, it will not help the low-level processing much. The other type is not to directly feedback knowledge to the lowest level (feedback to higher layers), that is, to process an input image to a certain extent without using scene knowledge and then use knowledge on this basis. Demon In knowledge representation, a special program running in the background. It is suitable for non-hierarchical control. The characteristic is not to start when the name is called, but to start when a critical situation occurs. It provides a modular knowledge representation that can represent both scene knowledge and cybernetics knowledge. It always watches the progress of visual processing and runs spontaneously when certain conditions are met. A common example is a program that verifies or guarantees that the modules of a complex system function properly. Non-hierarchical control A structured way to arrange the order of operations in an image interpretation system. The characteristic is that there is no main process to sort the operations or operators used, but each operator can observe the current result and decide whether it has the ability to operate and whether to perform the operation. Knowledge of cybernetics Knowledge of strategies and decisions in information processing processes. Control strategy Guidelines and policies behind a sequence of processing steps in automated image technology. For example, control can be top-down (search to verify the image data of the desired target) or bottom-up (gradually act on the image data to infer hypotheses). Control strategies can also allow selection of other alternative processes, parameter values, or assumptions. Model-based control strategy A class of top-down, goal-oriented strategies that use hypothesis-test methods. Man in the loop Include people in control systems, such as a driverless vehicle or drone. The person in the loop has to wait until the system can fully and automatically perform the intended work.
40.1.3 Models Model An abstract representation of certain objects or types of objects. A mode that represents scene knowledge in knowledge representation. That is, the factual characteristics of the scenery in the objective world. The term model
40.1
Knowledge and Model
1397
reflects the fact that any natural phenomenon can only be described to a certain (accuracy or precision) degree. 2-D model A model for representing image characteristics in knowledge representation. Its advantage is that it can be directly used for matching of images or image characteristics. The disadvantage is that it is not easy to fully represent the geometric characteristics of objects in 3-D space and the relationships between objects. 3-D model The description model of the 3-D scene, usually refers to the shape description model of the 3-D scene. Commonly used in model-based computer vision. The model representing the characteristics and other connections of the 3-D object position and shape in the scene in the knowledge representation is also a 3-D model. Model comparison In face recognition, a way of identifying by comparing a set of models. Each of these models interprets the data (matching test images with prototype images) in different ways, requiring the selection of the model that best (with the highest likelihood) explains the data. Model acquisition The process of learning to build a model. It is generally based on observation examples of the structure to be modeled. This can be as simple as learning the distribution coefficients. For example, you can learn the texture characteristics that distinguish tumor cells from normal cells in an image. On the other hand, you can also learn the structure of the object, such as building a model of a construction using video sequences collected from various viewpoints. Another way to obtain a model is to learn the characteristics of an object, such as which properties and connections can distinguish a square from other geometric shapes. Model building Often refers to the process of building a geometric model, but can also refer to the process of building other broader models. An example of observation of a structure modeled from a video sequence can be used in practice. See model acquisition. Image-based modeling 1. A 3-D modeling method that takes an image as input and represents the scene as a hierarchical collection of depth images, often calculated with the help of users. 2. Used to describe the process of generating 3-D models of texture maps from multiple images. Dynamic correlation A Monte Carlo random modeling and identification framework commonly used in financial modeling to identify trends.
1398
40
Knowledge and Learning
Model parameter update Adjustments and changes to model parameters. When model parameter learning is performed, the performance of the model is often optimized by means of an iterative algorithm, which results in the updating of model parameters during each iteration. Model parameter learning See parameter estimation. Model reconstruction See model acquisition. Over-determined inverse problem A problem that when building a model, the model is described by only a few parameters, but there is a lot of data for verifying. Common examples are fitting a straight line through a large number of data points. In this case, it may not be possible to determine the exact solution of a straight line through all data points, only a straight line that minimizes the total distance from all data points. Compare underdetermined problem. Under-determined problem Problems caused by too little data available and used when building the model. That is, more parameters are needed to describe a model, and too little data can be obtained. A typical example is computing dense motion vectors of an image sequence, and another example is computing a depth map from a pair of stereograms. In order to solve such a problem, it is necessary to add qualifications. Compare over-determined inverse problem.
40.1.4 Model Functions Model assumption Assumptions made to effectively describe objective scenarios and build models based on actual observations. This can simplify the problem and remove some minor disturbances, but the model established only describes the objective scene with a certain degree of accuracy and correctness. It is possible that the observed results are completely consistent with the model assumptions, but this does not guarantee the correctness of the model assumptions. For example, when illuminating an object with different shades of light, the position where the image gray level changes in space indicates the boundary of the object. However, if the thickness of the object is different and the light is not irradiated vertically, the change places in imaging gray value are not necessarily the boundary of the object. Model order The number of variables used to describe the model. The more variables the higher the order. For example, the number of lag variables in an autoregressive model.
40.1
Knowledge and Model
1399
Model order selection A special case of model selection, in which there are a series of similar models with increasing complexity. For example, autoregressive models with different historical lengths or polynomial regression models with different order polynomials. Model structure In a representation (model), the organization of primitives such as edge segments or surfaces. Latent structure Non-quantified or hidden geometric relationships, statistics, or correlation characteristics presented by a system. Model exploitation A direct detection method in which both positive and negative knowledge are used to guide the search process. Compare model exploration. Model exploration The non-directional detection of the model is opposite to the model exploitation. Model base A database containing multiple models, often used as part of the identification process. Model library Same as model base. Model base indexing The task of selecting one or more candidate models from a model base of a known structure of the system. This often avoids exhaustively calling and testing every member of the model base. Model robustness A statistical model whose performance is not significantly affected by outfield points. See robust statistics. Model matching Identify a corresponding, often known model from some unknown data. For example, matching an unknown 2-D or 3-D shape extracted from an image to a shape from a database of known object models. See object recognition. Model registration A general term for aligning a geometric model with a set of image data (spatial). This process may require estimation of the coordinate transformations such as rotation, translation, and scaling that map the model to image data. You may also need to estimate the parameters of the object model, such as the object size. If fitting, perspective distortion may need to be considered. Figure 40.6 shows an example. The 2-D model in the left image is registered to the same position in the gray-level image in the right.
1400
40
Knowledge and Learning
Fig. 40.6 A model registration example
Model fitting See model registration. Model-driven A method of controlling from top to bottom during the control process. It is characterized by the entire process being controlled by a (knowledge-based) model, so it is also called model-driven. Model invocation See model base indexing. Model inference Parameter estimation process in one model or model selection process in several models. Model topology A field of mathematics that takes into account the properties of the model, which are invariant under the conditions of continuous deformation of the modeled object, such as connectivity. Model selection 1. The work or task of selecting a model from many candidate models, the goal is to select the “best” model among them. For supervised learning tasks, it is best to define them with predictable performance on test data. An important issue to consider here is overfitting, that is, performance on the training set is not necessarily a good indicator of performance on the test data set. Generally, for models with more parameters, the danger of overfitting is greater, so methods that punish the number of parameters are often used, such as Akaike information criterion, Bayesian information criterion, minimum description length, and so on. For each model, a Bayesian model can be used for comparison based on boundary likelihood and prior probability. Probably the most commonly used method for selecting predictive models is through cross-validation. In addition, there is no need to always choose a single model, and it is also possible to combine the prediction results of different models, such as to reference ensemble learning or Bayesian model averaging. 2. A simple approximation to Bayesian model averaging, where only a single most likely model is used to make a prediction. 3. In image mosaic, determining the most suitable registration model for a given situation or data set is also a model selection problem.
40.2
Knowledge Representation Schemes
1401
4. See model base indexing. Model evidence See Bayesian model comparison.
40.2
Knowledge Representation Schemes
Incorporating human knowledge into image understanding, it also needs to determine what form to use to represent knowledge, that is, knowledge representation (Jähne and Hauβecker 2000, Shapiro and Stockman 2001; Witten 2005).
40.2.1 Knowledge Representation Models Knowledge representation scheme Specific methods of representing knowledge. For example, demon, blackboard system, database of image functions, language of image processing, agent of image understanding, frame, unit, logic system, semantic network, and produc tion system. Procedural representation A form of knowledge representation. The process of representing a set of knowledge as how to apply it (procedure knowledge). The advantage of this form is that it is easy to represent existing knowledge about how to do something and heuristic knowledge about how to do something effectively. A set of representations used in artificial intelligence to record procedural knowledge of how to perform a task. A traditional example is a production system. In contrast, declarative representations record how an entity is constructed. Declarative representation In knowledge representation, a general way and process that represents knowledge as a stable set of state facts and controls these facts (also known as model knowledge). The knowledge of the descriptive representation is fixed knowledge about the object and the relationship between them. If you want to use these representations for pattern recognition, you need to match. One of the advantages of this method is that it can be used for different purposes. Each fact needs to be stored only once, and the number of times of storage is independent of the number of times these facts are applied in different ways. That is, it is easy to add new facts to the existing collection, which will neither change other facts nor change the representation process. Iconic model 1. A representation form with image characteristics. For example, templates in template matching. 2. One of two models of declarative representation. It is a simplified diagram that retains features. To build an iconic model, you only need to use the image of the
1402
40
Knowledge and Learning
object. Because basically only the shape of the object is considered, and its gray level is not considered, it can be directly matched with the input image, and the processing is simple. The iconic model is effective when the number of objects in the image is relatively small, but this is not easy to satisfy in a normal 3-D scene. Compare graph model. Iconic A symbol or representation with image characteristics. Database of image functions A knowledge representation method that embeds the knowledge to be represented into image functions and forms a library. Functions provided in the library can include input-output operations (such as read, store, digitize), single-pixel-based operations (such as histograms), neighborhood-based operations (such as convolution), and image operations (such as arithmetic and logic)), transformation (such as Fourier transform, Hough transform), texture analysis, image segmentation, feature extraction, object representation, object description, pattern recogni tion, pattern classification, and drawing, display, and other auxiliary operations. Frame 1. A way of representing knowledge. It can be regarded as a complex structured semantic network (framework). Suitable for recording a related set of facts, rules of reasoning, preconditions, etc. 2. Short for frame image. A complete standard television picture with even and odd fields. Semantic net A graph representation in which nodes represent the attributes of an object in a given domain and arcs represent the connection (may be directed) between the objects. Figure 40.7 shows an arch on the left and its semantic net representation on the right. See symbolic object representation, graph model, and relational graph. Semantic network A method of knowledge representation. The semantic network is a numbered directed graph in which objects are represented as nodes, and the links between objects are represented as labeled arcs connecting different nodes. The semantic network gives a visual representation of the connections between the elements in the image and can effectively represent knowledge from a visual perspective.
Fig. 40.7 Sematic net representation
Arch
Beam
Part Col
Col
Col
Beam
Col Support
40.2
Knowledge Representation Schemes
1403
Production system 1. A method of knowledge representation. It is also a form of modular knowledge representation system. Also called rule-based system or expert system. Represent knowledge as a collection of demons called production rules. Production systems are suitable for managing symbols, but are less effective for signal processing. With the development of distributed computing, the production system is gradually modularized and evolved into a blackboard system. 2. A calculation method of logical reasoning, in which logic is expressed as a set of “production rules.” A rule has the form “LHS ! RHS,” that is, it states that if the pattern or condition set recorded on the left hand side (LHS) is true, then the action specified on the right hand side (RHS) is performed. A simple example is: “If less than 10 edge segments are detected, reduce the threshold by 10%.” 3. An industrial system that makes certain products. 4. A system that needs to be actually used, as opposed to a demonstration system. Production rule Rules for representing knowledge in production systems. Each production rule has the form of a condition-action. Condition-action Regular forms in production systems. Also often referred to as production rules. It points to the formal rules such as α ! β or IF α THEN β, where α is the left term or antecedent of the production and β is the right term or consequent of the production. Expert system A human-machine system with the ability to solve a specific problem (based on the knowledge of a specially trained expert). A modular knowledge expression system. A system that uses accessible knowledge and heuristics to solve problems. It works based on expert knowledge, rules-based, and is generally limited to a particular area. It is a typical application of production systems. See knowledge based vision and production system. Blackboard In artificial intelligence, the special category of databases that can be accessed as a whole is also a method of knowledge representation. It is a data structure that represents the status of information processing and is also suitable for non-layered cybernetics. Its knowledge source can be feature extraction, image segmentation, region classification, or object recognition. In the expert system, it is used to record items such as current data, working assumptions, intermediate decisions, etc. about a specific problem. It is also often called temporary storage or short-term storage.
1404
40
Knowledge and Learning
40.2.2 Knowledge Base Knowledge base An abstract representation of the world, a collection of knowledge. The knowledge base has a hierarchical structure, and its content can be divided into analogical representation models and propositional representation models, which are similar to human representations for the world. The knowledge base should have the following properties: 1. 2. 3. 4. 5. 6.
Can represent analogies, propositions, and process structures. Allows quick access to information. Can be easily expanded. Support querying the category structure. Allow connections and transformations between different structures. Support belief maintenance, inference, and planning.
Propositional representation model A kind of knowledge base representation model. The properties of propositional representations include (1) dispersion, (2) discreteness, (3) abstraction, and (4) inference. Dispersion 1. One of the characteristics of propositional representations in a propositional representation model. An element represented by a proposition may appear in many different propositions, and these propositions can be represented in a consistent manner using the semantic network. 2. The phenomenon that white light is broken down into various colors. That is, white light will be dispersed into various colors when it is refracted. To be precise, this corresponds to the quantitative relationship of the refractive index n of a medium with the wavelength λ. 3. The scattering of light as it passes through a medium, that is, diffusion. Discreteness In the propositional representation model, one of the characteristics of propositional representation. Propositions usually do not represent continuous change, but they can make propositions approach continuous-change values with arbitrary precision. Abstraction In the propositional representation model, one of the characteristics of propositional representation. Propositions are either true or false and have no similarities in geometry and structure to the situation being represented. Inference 1. A thinking process that derives an unknown conclusion from one or more known judgments (prerequisites). The main methods include deductive reasoning and inductive reasoning. Deductive reasoning is based on general laws, using logical
40.2
Knowledge Representation Schemes
1405
proofs or mathematical operations to derive the laws that special facts should follow, from general to special. Inductive reasoning is the generalization of general concepts, principles, or conclusions from many individual things, from special to general. 2. One of the three key issues of the hidden Markov model. It is also the first step in the classification problem (the second step is decision). In this step, training data is needed to learn various types of posterior probabilities. 3. Reasoning: One of the characteristics of propositional representations in propo sitional representation model. Propositional representations can be manipulated with more uniform calculations. These calculations can be used to develop reasoning/inference rules to derive new propositions from old ones. Analogical representation model A kind of knowledge base representation model. The analogy representation has the following characteristics: coherence, continuity, analogy, and simulation. Coherence One of the characteristics of the analogical representation model. That is, each element of a represented situation only appears once, and the relationship between each element and other elements can be accessed. Continuity One of the characteristics of the analogical representation model. For example, similar to the continuity of motion and time in the physical world, analogical representation allows to represent continuous states of change. Analogy 1. One of the characteristics of the analogical representation model, that is, it is the structure of the representation reflected or is equivalent to the relationship structure of the situation being expressed. 2. A class of work for object recognition. The goal is to find similarities between the transformations of different objects. Simulation One of the characteristics of the analogical representation model. That is, the analog model can be queried and manipulated by any complicated calculation process, which simulates the situation represented physically and geometrically.
40.2.3 Logic System Logic system A knowledge representation method based on predicate logic. Predicate logic A type of knowledge that characterizes the relationship between specific things or abstract concepts. Also called first-order logic.
1406
40
Knowledge and Learning
Unit A knowledge representation method. It can be seen as a way to try to explain frame representation with predicate logic. Predicate calculus A symbolic language. It is an important element in predicate logic. Also called mathematical calculus. In most cases, logical systems are based on first-order predicate calculus and can represent almost anything. Resolution 1. A parameter that reflects the fineness of the image and corresponds to the amount of image data in image engineering. Indicates the number of pixels per unit area, scale, perspective, etc. A 2-D image can be represented by a 2-D array f(x, y), so spatial resolution must be considered in the X and Y directions, and magnitude resolution must be considered in property F. 2. The ability to distinguish or recognize under different circumstances. Such as: (1) The resolution of the human eye. The ability of the human eye to recognize details. It depends on the shape and lighting of the object. Generally, the opening angle of the two ends of the object to the human eye cannot be distinguished within 10 . (2) The resolution of the telescope. The angle between two distinguishable points relative to the instrument’s entrance pupil. The image of a point light source in a telescope is a diffracted light spot, a bright spot in the center, and a ring of alternating light and darkness around. The diffraction spots of two adjacent point images partially overlap. It can be distinguished when the depth of the valley where the two spots overlap is 19% of the peak, which is the Rayleigh criterion of resolution (Rayleigh criterion). If D is the entrance pupil diameter (objective lens diameter), the resolution limit is 1.22D1λ, where λ is the wavelength of light. At this time, the ratio of the valley depth to the peak top is 0.19. (3) The resolution of the microscope. The distance between two points of the object that can just be distinguished. The minimum resolution distance according to Rayleigh criterion is 0.61 λ(nsinϕ)1, where λ is the wavelength, n is the refractive index of the object space medium, and ϕ is half of the opening angle (aperture angle) of the edge of the objective lens to the object. (4) The ability of the instrument to separate two adjacent spectral lines. The ratio of the wavelength λ to the newly resolvable wavelength Δλ, λ/Δλ represents the resolution. For prisms, according to the Rayleigh criterion, the resolution is λ/Δλ ¼ bdδ/dλ, where b is the width of the beam shining on the prism and dδ/dλ is the angular dispersion (δ is the deflection angle). When the prism is at the minimum deflection position, the criterion becomes tdδ/dλ, where t is the length of the thickest part of the prism. 3. The inference rules used in the predicate calculus to prove the theorem. The basic steps are to first represent the basic elements of the problem into clause form, then find the antecedent and consequent of the implicit representations that
40.2
Knowledge Representation Schemes
1407
can be matched, and then replace the variables to make the atoms equal to perform the matching. The clauses in the solution after matching include the left and right sides of the mismatches. In this way, the theorem proof is transformed into a clause to be solved to produce an empty clause, and the empty clause gives contradictory results. From the perspective that all positive theorems can be proved, this resolution rule is complete; from the perspective that all wrong theorems are impossible to prove, this resolution rule is correct. Resolvent In predicate calculus, a class of clauses obtained after matching using resolution reasoning rules. Include mismatched left and right sides. The proof of the theorem can be transformed into a clause to be solved by the solution to produce an empty clause, and the empty clause gives contradictory results. Existential quantifier A kind of quantifier in predicate calculus. It can be represented by “∃”, and ∃x means that there is an x. Universal quantifier A quantifier representing the whole in a predicate calculus. It can be expressed by “8”, and 8x represents all x. Antecedent In predicate calculus, the left part of the operation expressed by “ )” is implied with logical conjunction. If the preceding term is empty, the implicit expression “) P” can be regarded as representing P. Consequent In the predicate calculus, the right part of the operation expressed by “)” is implied with a logical conjunction. If the result is empty, the implicit expression “P )” represents the negation of P, that is, “~ P”. Logic predicate Generally regarded as a type of knowledge that is well organized and widely used. See predicate logic. Well-formed formula [WFF] Legal predicate calculus representation. Conjunction A special logical representation. Obtained by concatenating other expressions with ˄. For example, if the predicate “DIGITAL(I )” represents the sentence “Image I is a digital image” and the predicate “SCAN(I)” represents the sentence “Image I is a scanned image,” the clause “DIGITAL(I) ˄ SCAN(I)” means the sentence “Image I is a digital image and a scanned image.” Compare disjunction. Disjunction A special logical representation. Obtained by concatenating other expressions with ˅. For example, if the predicate “DIGITAL(I)” represents the sentence “image I is a
1408
40
Knowledge and Learning
digital image,” and the predicate “ANALOGUE(I)” represents the sentence “image I is an analog image,” then the clause “DIGITAL(I) ˅ ANALOGUE(I)” means the sentence “Image I is a digital image or an analog image.” Compare conjunction. Clausal form syntax A syntax followed by logical representations. Can be used for inference and learning of image understanding. The logical expression can be written as (8x1 x2. . . xk) [A1 ˄ A2 ˄. . . ˄ An ) B1 ˅ B2 ˅. . . ˅ Bm], where each A and B are atoms. The left and right of the clause are called the condition and conclusion of the clause, respectively. Non-clausal form syntax A syntax followed by logical representations. Logical representations that follow this syntax include atoms, logical conjunctions, existential quantifiers, and universal quantifiers.
40.3
Learning
The model can also be established step by step based on the image data obtained from the object and scene. Such a modeling process is essentially a learning process. There are many ways to learn (Hartley and Zisserman 2004; Jähne and Hauβecker 2000; Prince 2012; Szeliski 2010).
40.3.1 Statistical Learning Statistical learning The process of using a statistical model and related data to reason about the distribution of the generated data. It mainly includes two types: supervised learning and unsupervised learning. Statistical learning theory A theoretical framework for studying machine learning, also known as computational learning theory. It originates from probably approximately correct learning framework. Computational learning theory Same as statistical learning theory. Statistical relational learning [SRL] A machine learning method that integrates techniques such as relational/logical representation, probabilistic inference, and data mining to obtain a relational data likelihood model.
40.3
Learning
1409
Probabilistic model learning The process of parameter estimation in a statistical model or model selection in a set of models. See statistical learning. Bayesian model learning See probabilistic model learning. Transferable belief model [TBM] Can be seen as a generalization of the probability model. It is a partial knowledge representation model that allows explicit modeling of questions between several hypotheses. It can handle imprecise and uncertain information and provides some tools to combine this information. Learning 1. In the process of pattern recognition, the process of obtaining the relationship between each pattern directly from the sampling mode or the process of obtaining a decision function based on the training samples. The latter case is often called training. 2. One of the three key issues of the hidden Markov model. 3. Acquisition of knowledge or skills. For a more specific definition in most cases, see “machine learning.” See also supervised learning and unsupervised learning. 4. The process of parameter estimation and model selection. Learning classifier A classifier that performs an iterative training process. The classification performance and accuracy can be enhanced after each iteration. Strong learner Gives a learner very close to zero error with high probability. Opposite with weak learners. Bootstrap is a way to combine many weak learners to get a strong learner. Weak learner The base classifier used in the bootstrap. This is a class of learners (with a high probability) that is definitely better than random selection, but for a given problem, the true-value performance summary is worse in opposite to strong learners. Bootstrap is a way to combine many weak learners to get a strong learner. See weak classifier. Cascaded learning A supervised learning technique in which a weak classifier (weak learner) with the best performance in a large set is successively selected to build a binary classifier. Start with a bootstrap method for the entire training set, and then classify the misclassified samples of the previously selected classifier. The selected weak learner set is sorted according to the training performance, and the unseen samples are classified by cascading. Commonly used for object detection. Figure 40.8 shows a schematic diagram of cascaded learning classifier. The object is detected through a series of cascaded classifiers.
1410
40
Fig. 40.8 Cascaded classifiers
C1 Refuse
Knowledge and Learning
Pass C2 Pass
Refuse
…… Cn
Pass
Refuse Object detected
Learning rate parameter The weights of the iterations adjusted during the learning process. For example, k in the following formula: wðnþ1Þ ¼ wðnÞ þ kΔ Probably approximately correct [PAC] learning framework A theoretical framework for studying machine learning is the predecessor of statistical learning theory. It makes use of possibly roughly correct judgment criteria, the purpose of which is to understand how large a data set needs to be for good generalization and the bounds of the computational cost of providing learning. The bounds derived from a possibly roughly correct frame often give the worst case: because it can be used for any distribution p(x, t), as long as the training and test samples originate from the same distribution (randomly) and are available to any function f(x) as long as it belongs to F. The distribution faced in practical applications often has obvious regularity, so if there are no assumptions about the distribution form, the obtained boundaries will be very conservative, or the size of the data set that reaches a certain generalization ability will be overestimated. An improvement on this may be the PCA-Bayesian framework, which relaxes the requirement that f(x) belongs to F, but does not restrict the distribution, so the obtained limit is still relatively conservative. Probably approximately correct [PAC] A probability criterion. Suppose a data set D of size N is obtained from a joint distribution p(x, t), where x is the input variable and t is the class label. Consider only the “noise” case, where the class label is determined by some (unknown) deterministic function t ¼ g(x). Consider the function f(x; D), which is obtained from the function space F based on the training set D. If the expected error rate for it is lower than a predetermined threshold ε (which satisfies the following formula), the function f(x; D) is said to have good generalization ability.
40.3
Learning
1411
E ½I ð f ðx; DÞ 6¼ tÞ < ε Among them, I(•) is the index function, and the expectation is relative to the distribution p(x, t). The left side of the above formula is a random variable because it depends on the training set D, and the probably approximately correct learning frame may obtain the data set D from p(x, t). The above formula can be established with a probability greater than 1 – δ. Here δ is another pre-specified parameter and “probably approximately” comes from a high probability (error rate less than δ) requirement greater than 1 – δ. For a given model, space F, as well as the parameters ε and δ, aims to provide a minimum bound on the size N of the data set that meets the criteria based on what may be approximately correct learning. A key quantity that may be roughly correct to learn is the Vapnik-Chervonenkis dimension, that is, the VC dimension, which provides a measure of the complexity of a function space and allows extending a probably approximately correct framework to a space containing an infinite number of functions. PAC-Bayesian framework See probably approximately correct. Vapnik-Chervonenkis [VC] dimension See probably approximately correct learning framework.
40.3.2 Machine Learning Machine learning The process that computer systems achieve self-improvement through self-learning. The main emphasis is to study how computers simulate or implement human learning behaviors to acquire new knowledge or skills or reorganize existing knowledge structures to continuously improve their performance. Machine learning problems can often be reduced to search problems. Different learning methods are defined and distinguished by different search strategies and search space structures. One of the main research directions of machine learning is automatic learning to identify complex patterns and make intelligent decisions based on data. A collection of methods for automatically analyzing structures in data. The research work can be divided into two categories: 1. Unsupervised learning: also known as descriptive modeling, where the goal is to discover patterns or structures of interest in the data. 2. Supervised learning: Also known as predictive modeling, where the goal is to predict the value of one or more variables (on the basis of some other values given). These goals are similar to those of data mining, but here the focus is more on automated machine performance than how people learn from data.
1412
40
Knowledge and Learning
High-dimensional data In machine learning, a large number of features are used to describe a special framework of signals in a database D ¼ [d1, d2, . . ., dn]. The number of features is comparable to the number of samples n. Supervised learning A collective term for a large class of machine learning methods. It refers to the techniques and processes used in pattern classification or pattern recognition to adjust the parameters of the classifier by using a set of (training) samples of known categories to achieve the required performance. Its task can be expressed as given an input x, based on a set of training samples, predict a response variable or output y. The output can be a category label (for a classification problem), a continuousvalued variable (for a regression problem), or a more general goal. The training data used for learning includes the input vector and its corresponding target vector. See supervised classification. Compare unsupervised learning. Unsupervised learning A general term for a large class of learning methods that refers to find patterns or structures of interest in data without teaching or supervising signals to guide the search process. The training data used for learning includes the input vector but no corresponding target vector. The goals of unsupervised learning include: 1. Clustering: find a collection of similar samples in the data. 2. Density estimation: determine the distribution of data in the input space. 3. Visualize: project data from high-dimensional space to 2-D or 3-D space It can also refer to a class of machine learning methods. It refers to the techniques and processes of pattern classification or recognition that do not use (training) samples of known categories and rely on learning clusters to adjust the parameters of the classifier to achieve the required performance. See unsupervised classification. Compare supervised learning. Examples of unsupervised learning include clustering, latent variable models, dimensionality reduction, and so on. Compare supervised learning. Unsupervised feature selection Feature selection for an unsupervised learning problem with no response or target variables. For example, features that are highly correlated or dependent on the retained subset can be eliminated, or features of a subset can be selected to maximize an indicator of clustering performance. Semi-supervised learning A learning method between supervised learning and unsupervised learning. In supervised learning, the data set contains multiple input-output pairs. In semisupervised learning, learners can get more examples but only input (i.e., these examples are not labeled). The learning objective is to give improved performance to unmarked examples by using labeled examples.
40.3
Learning
1413
Weakly supervised learning A strategy when supervision information is insufficient in supervised learning. Although learning is generally divided into supervised learning and unsupervised learning in traditional machine learning, sometimes it is encountered that the task of supervising information is relatively insufficient. For example, when learning to perform object localization, there may be only one image with or without specific object information, but the location of the object is unknown. As another example, multiple instance learning is also a typical case of weakly supervised learning. Multi-class supervised learning A special case of supervised learning, emphasizing that more than two types of samples need to be classified or identified. Active learning Ways and means to learn the environment and get information through interaction. For example, proactively observe an object from a new perspective. Now it refers to a method of machine learning in which the learning agent can actively query the environment to obtain data samples. For example, a classification method can identify sub-regions of certain input spaces that are less reliable and therefore require more training samples in these sub-regions to characterize the input. Such methods are considered as supervised learning methods. Multiple instance learning A variant of supervised learning. Learners do not accept a set of input-output pairs, but a set of example packages. Each package contains many inputs but only one label. If all inputs in the package are positive, the label is positive; if all inputs in the package are negative, the label is negative. Machine learning styles The form of execution of machine learning tasks. There can be different ways in different fields. For example, there are four main types in data mining: (1) classifi cation learning, (2) association learning, (3) clustering, and (4) numeric prediction. Association learning A method of machine learning in the field of data mining. Not only to determine a particular class value but also to find associations between different features. Numeric prediction A machine learning method in the field of data mining. The conclusion it predicts is not a discrete class but a numerical quantity. Evidence approximation In the machine learning literature, the title of empirical Bayes in statistical literature, or type 2 maximum likelihood, or generalized maximum likelihood. Empirical Bayes Same as evidence approximation.
1414
40
Knowledge and Learning
Type 2 maximum likelihood Same as evidence approximation. Generalized maximum likelihood Same as evidence approximation.
40.3.3 Zero-Shot and Ensemble Learning Zero-shot learning [ZSL] Learning to classify categories without training samples. Cross-categorization learning A type of learning that recognizes targets without training samples. Zero-shot recognition Recognition of categories without training samples. For example, after introducing the concept of attributes, cross-categorization learning can be performed. Specifically, it uses an attribute classifier to describe unknown target categories with attributes. If a textual description of attributes is used, it is possible to learn a new target type model based on these descriptions without using image samples, that is, zero-shot learning recognition. Ensemble learning Combine multiple learning models to try to produce better predictive performance than a single model. Typical techniques include the combination of bagging, bootstrap, and Bayesian statistical models. Decision forest A general term that encompasses variations of the random forest for ensemble learning concept. Bag of detectors An ensemble learning-driven object detection method in which a set of independently trained detection concepts (detectors) are used, which can be of the same type (such as random forests). The individual results obtained from this set of detectors are combined in an integrated or bootstrapping manner to produce the final detection result. Random forest An example of ensemble learning, where a classifier consists of many decision trees. Differences between decision trees can be obtained by training on a random subset of the training set or by random selection of features. The overall classification result for a given example is obtained by integrating the learned decision tree set.
40.3
Learning
(a)
1415
(b)
(c)
(d)
(e)
(f)
Fig. 40.9 Bootstrap classification process diagram
Bootstrap aggregation A method that can improve the stability and accuracy of machine learning algorithms. It avoids overfitting problems by reducing the variance of the model. It is a way to introduce variability into different models of a complex. Consider a regres sion problem that predicts the value of a single continuous variable. If it is possible to generate M bootstrap data sets, train a prediction model ym(x) with each of them, m ¼ 1, 2, . . ., M. The comprehensive prediction would be y a ð xÞ ¼
M 1 X y ð xÞ M m¼1 m
This process is called bootstrap aggregation. Commonly used in decision tree methods. Bagging 1. Ensemble: An ensemble learning method in which the end result of an ensemble set is the result of a combination. There are two cases of integration: the majority voting method is used in classification; the average method is used in regression. 2. Boot aggregation: Same as bootstrap aggregation. Boosting process Boosting (bootstrapping) is an iterative process using weak classifiers one by one. The schematic of a bootstrap classification process is shown in Fig. 40.9. There are two types of samples in Fig. 40.9a. At the beginning, the corresponding weights are the same. The first weak classifier divides the samples into two groups of left and right; the weight of the wrong samples in Fig. 40.9b increases (represented by the increase in diameter); in Fig. 40.9c, the second weak classifier is used to divide the sample into two groups; the weight of the misclassified samples in Fig. 40.9d is also increased; in Fig. 40.9e, the third weak classifier is used to divide the sample into two groups of up and down; the final classifier in Fig. 40.9f is a combination of the first three weak classifiers (the weight of the misclassified samples of the third weak classifier also increased). Bootstrap A method for determining frequency error lines. Suppose the original data set includes N data points, X ¼ {x1, x2, . . ., xN}. By randomly extracting N data points from X (put back after each extraction), a new data set Y can be formed. In Y, some
1416
40
Knowledge and Learning
points in X may not appear, and some points in X may appear multiple times. Repeat this process L times, you will get L new data sets, which is called the bootstrap data set. Using these data sets, the statistical accuracy of parameter estimates can be evaluated by observing predictions. Bootstrap data set See bootstrap.
40.3.4 Various Learning Methods Continuous learning A general term describing how a system continuously updates processed models based on current data. For example, the background model is updated based on a change in lighting during the day for change detection. Incremental learning Essentially progressive learning. See continuous learning. Online learning A promising learning method in which the parameters are updated once a new training sample is obtained. See incremental learning. Bayesian statistical model A typical statistical model. Consider a statistical model M with a parameter q. The commonly used statistical modeling method is to give the data D to estimate the parameters of the model, that is, to use the maximum posterior estimation. The Bayesian statistical model uses probability to quantify the uncertainty about q. Consider the prior distribution p(q|M) first. Combine it with the likelihood function p(D|q, M) to get the posterior distribution p(q|D, M ), that is, pðqjD, M Þ ¼
pðDjq, M ÞpðqjM Þ pðDjM Þ
Among them, the normalization factor p(D|M) is called edge likelihood. The above formula can be written as text posterior ¼
Likelyhood prior Edge likelyhood
This model can be used for prediction by averaging all posteriors. That is, for a supervised learning task with input x and output y, we get
40.3
Learning
1417
Z pðyjx, D, M Þ ¼
pðyjx, q, M ÞpðqjD, M Þdq
Reinforcement learning A technique for determining appropriate actions to maximize returns or rewards in a given situation. The learning algorithm here is different from supervised learning. It takes a series of attempts to find the best results. Therefore, the learning algorithm interacts with the environment in a series of states. In many cases, the current action affects not only the immediate return but also the return of all subsequent steps. A feature of reinforcement learning is to strike a balance between exploration and exploitation. Too much focus on one will give bad results. Exploitation In reinforcement learning, it refers to the strategy of systematically selecting actions that can get high returns. Exploration In reinforcement learning, a strategy in which the system chooses to use new actions to see if they are effective. Learning from observation Learn by observing the objective world and environment. See learning. Trial-and-error A method for exploring the characteristics of a black box system through constant guessing and error correction. It is a purely empirical learning method. The first step is guessing or experimenting, and the second step is checking and correcting. It focuses on specific problems (not trying to generalize), is problem-oriented (does not explore the reasons for success), requires little knowledge, but does not guarantee optimal methods (only finding a certain solution). Multiple kernel learning Given a certain number of kernel functions suitable for the problem, find the optimal linear combination of these kernel functions. Relevance learning A class of methods that use distance metrics to measure the similarity of examples in the input feature space. A typical example is the K-nearest neighbor algorithm. In relevance learning, performance can be improved by changing measures such as re-weighting different dimensions. Visual learning Learn visual models from a set of images (samples), or more broadly, what you can use to perform visual tasks. It is a direction/content in the more general field of automatic learning. Work including face recognition and image database indexing are important applications using visual learning. See unsupervised learning and supervised learning.
1418
40
Knowledge and Learning
Table 40.1 Three main types of transfer learning methods
Inductive transfer learning
Direct transfer learning Unsupervised transfer learning
Source domain and target domain Same/related
Source task and target task Related
Related
Same
Source data and target data Label | label (multi-task learning) No label | label (selflearning) Label | no label
Related
Related
No label | no label
Transfer learning Use what you have learned in one domain to help you learn in a new domain. The former field is generally called the source field, and the latter field is the target field. Accordingly, there are source tasks and source data in the source domain and target tasks and target data in the target domain. Transfer learning uses the knowledge learned in the source domain to complete the source task to solve the problem of using the target data to complete the target task in the target domain. The introduction of attributes paves the way for transfer learning. At present, there are three main types of transfer learning methods. Some of their characteristics are shown in Table 40.1. Sequential learning A type of learning method or idea that uses only one observation (or a few data) to learn at a time and removes the used data before using the next observation. Since there is no need to store or load all data into memory, this method is also useful when the amount of data is large. One-shot learning Learning about a category based on only a few data points from that category. This requires the transfer of knowledge from other categories that have already learned.
40.4
Inference
The interpretation of the scene needs to be based on existing knowledge and by means of reasoning (Prince 2012; Shapiro and Stockman 2001; Witten 2005).
40.4
Inference
1419
40.4.1 Inference Classification Inference 1. A thinking process that derives an unknown conclusion from one or more known judgments (prerequisites). The main methods include deductive inference and inductive inference. Deductive inference is based on general laws, using logical proofs or mathematical operations to derive the laws that special facts should follow, from general to special. Inductive inference is the generalization of general concepts, principles, or conclusions from many individual things, from special to general. 2. One of the three key issues of the hidden Markov model. It is also the first step in the classification problem (the second step is decision-making). In this step, training data is needed to learn various types of posterior probabilities. 3. One of the characteristics of propositional representations in propositional representation models. Propositional expressions can be manipulated with more uniform calculations. These calculations can be used to develop inference rules to derive new propositions from old ones. Spatial reasoning Infer from a geometric perspective, not from symbolic or linguistic information. Same as geometric reasoning. Geometric reasoning Use geometric shapes for inference to complete tasks such as robot motion planning, spatial position estimation, shape similarity, and more. Structure reasoning Inferential analysis of 3-D object structures with the help of labeling of contours technology in 2-D images. Temporal reasoning The process of analyzing the time relationship between signals or events. For example, to determine whether an event seen in one video stream is a good prediction of an event seen in another video stream. See temporal event analysis. Variational inference 1. A set of approximate reasoning techniques using global criteria, also known as variational Bayes. 2. One of the two steps in solving the LDA model. Refers to the determination of the topic mixing probability q of the image and the probability that each word is generated by the topic z given the hyperparameters a and b and the observation variable s. Variational inference for Gaussian mixture Combining variational inference with mixture of Gaussian models. Local variational inference The goal is to discover function-range variational inference for a single variable or a group of variables in a model. This is different from general variational inference,
1420
40
Knowledge and Learning
where the goal is to directly find an approximation of the entire posterior distribution of all random variables. Variational Bayes Same as variational inference. Monte Carlo sampling An approximate inference method based on digital sampling. This method is more flexible and can be used for a wide range of models, but the computational complexity is large. Top-down model inference A form of reasoning in which analysis proceeds from the entire goal to its constituent parts. The high-level information is used to generate predictions for the lower-level structures, and then these predictions are tested. For example, you can first construct a statistical classifier to evaluate the likelihood that a region contains a person and then refine it to use a subclass classifier to evaluate the likelihood that a sub-region in that region contains a human head, and then further a subclass classifier can be used to find eyes, nose, mouth, etc. Failure in the lower layers may lead to failure in the higher layers. This form runs oppositely to the bottom-up approach. Running intersection property A useful feature in graph-based reasoning. It states that if a variable is contained in two small cliques, then it must also be included in each small clique on the path connecting the two small cliques in the graph. This feature ensures that the reasoning of variables is consistent throughout the graph. Sum-product algorithm A class of efficient and accurate inference algorithms that can be applied to tree structures. Viterbi algorithm When using a hidden Markov model, an algorithm that uses the sum-product algorithm to find the most probable sequence of states. A dynamic programming algorithm used to determine the most likely interpretation can be used to determine the sequence of hidden states in a hidden Markov model given the observed data. Max-sum algorithm An algorithm that can be regarded as the application of dynamic programming in the context of a graph model is closely related to the sum-product algorithm. It helps to identify the variables with the greatest probability, as well as determine the values of those probabilities.
40.4
Inference
1421
40.4.2 Propagation Belief propagation [BP] An inference technique that was originally used for tree structures but is now popularized for cyclic graphs (such as MRF). It is closely related to the dynamic programming method, because both pass messages back and forth in the tree or graph. An algorithm for accurate inference without looping on a directed graph. A special form equivalent to the multiply sum algorithm. Assume a joint probability distribution for a set of random variables defined by a probabilistic graphical model. With knowledge of some values of the variable e (evidence), it is often desirable to perform probabilistic reasoning to obtain a conditional distribution p(x|e) for the remaining variables (or a subset of them) x. In belief propagation, the posterior marginal distribution of each variable in x is obtained through a message passing process. For graphs without cycles, the above algorithm is accurate, and more generally a joint tree algorithm can be used to calculate the posterior marginal distribution for any probability graph model. For ward-backward algorithm is a special example for belief propagation in hidden Markov model (HMM) Forward propagation A situation where information flows forward through the network. Forward-backward algorithm A special case of belief propagation is used in probabilistic inference in hidden Markov models by propagating messages forward and backward along the Markov chain. A two-step message-passing algorithm that effectively calculates the posterior distribution of hidden variables under the context of a hidden Markov model whose graph is a tree structure. There are several variants of the basic algorithm that can obtain accurate marginal distributions. The most commonly used is the alpha-beta algorithm. This algorithm is mainly composed of forward α-recursion and backward β-recursion, so it is also called forward-backward recursion. First write the following conditional independence properties:
1422
40
Knowledge and Learning
pðx1 , x2 , . . . , xn jzn Þ ¼ pðXjzn Þ ¼ ¼
pðxnþ1 , xnþ2 , . . . , xN jzn Þ pðx1 , x2 , . . . , xn1 jzn Þ
pðx1 , x2 , . . . , xn1 jxn , zn Þ pðx1 , x2 , . . . , xn1 jzn1 Þ pðx1 , x2 , . . . , xn1 jzn1 , zn Þ ¼ pðxnþ1 , xnþ2 , . . . , xN jzn , znþ1 Þ ¼ pðxnþ1 , xnþ2 , . . . , xN jznþ1 Þ pðxnþ2 , xnþ3 , . . . , xN jznþ1 , xnþ1 Þ ¼ pðxnþ2 , xnþ3 , . . . , xN jznþ1 Þ pðXjzn1 , zn Þ ¼ pðx1 , x2 , . . . , xn1 jzn1 Þ ¼ pðxNþ1 jX, zNþ1 Þ pðxn jzn Þpðxnþ1 , xnþ2 , . . . , xN jzn Þ pðzNþ1 jzN , XÞ ¼ pðxNþ1 jzNþ1 Þ ¼ pðzNþ1 jzN Þ where X ¼ {x1, x2, . . ., xN}. In the following, let us consider using the EM algorithm to find an effective framework for maximizing the likelihood function in a hidden Markov model. The EM algorithm first selects model parameters and records them as told. In step E, these parameter values can be used to determine the posterior distribution of the hidden variable p(Z|X, told). Then, use this posterior distribution to evaluate the expected value of the logarithm of the likelihood function of the complete data (a function of the parameter t) to give the function Q(t, told): X p ZjX, told ln pðX, ZjtÞ Q t, told ¼ z
Next, use M(zn) to represent the marginal posterior distribution of the hidden variable zn, and use J(zn–1, zn) to represent the joint posterior distribution of the two consecutive hidden variables: M ðzn Þ
¼
J ðzn1 , zn Þ
¼
p zn jX, told p zn1 , zn jX, told
For each value of n, a set of K non-negative numbers with a sum of 1 can be used to store M(zn). Similarly, a non-negative matrix (sum 1) of K K can be used to store J(zn–1, zn). If M(znk) is used to represent the conditional probability of znk ¼ 1, because the expected value of a binary random variable is the probability that it will take 1, so there is
40.4
Inference
1423
M ðznk Þ
¼ E½znk ¼
X
M ðzÞznk
z
X J zn1,j , znk ¼ E zn1,j , znk ¼ M ðzÞzn1,j , znk z
Further available K N X K X K N X X X Q t, told ¼ M ðz1k Þ ln πk þ J z1k,j , znk ln Ajk þ n¼2
k¼1
K X
j¼1 k¼1
n¼1
M ðznk Þ ln pðxn jϕk Þ
k¼1
The goal of step E is to efficiently calculate M(zn) and J(zn–1, zn). In step M, Q(t, told) is maximized relative to the parameter t ¼ {π, A, ϕ}, where M (zn) and J(zn–1, zn) are considered constants. The maximum relative to π and A can be achieved by the Lagrangian multiplier method, the result is πk ¼
M ðz1k Þ K P M z1j j¼1
Αjk ¼
N P J zn1,j , znk
n¼2
K P N P J zn1,j , znl
l¼1 n¼2
In order to maximize Q(t, told) relative to the parameter ϕk, it is only necessary to maximize the log-likelihood function weighting the emission density p(X|ϕk) when the weight is M(znk). For example, for Gaussian emission density, p(X|ϕk) ¼ N(X|μk), Σk), maximizing Q(t, told) gives N P
M ðznk ÞXn μk ¼ n¼1N P M ðznk Þ n¼1 N P
Σk ¼
M ðznk ÞðX n μk ÞðXn μk ÞT
n¼1 N P n¼1
M ðznk Þ
1424
40
Knowledge and Learning
α (z n-1,1)
Fig. 40.10 The forward recursive process of calculating α
k=1
A11
α (z n,1) p(xn|z n,1)
A21
α (z n-1,2) k=2
α (z n-1,3)
A31
k=3 n‒ 1
n
Baum-Welch algorithm Same as forward-backward algorithm. α-recursion The recursion of the quantity α(zn) in the forward-backward algorithm, is also called forward recursion. If you use conditional independence and use the rules of summation and multiplication, you get αðzn Þ
¼ pðx1 , , xn , zn Þ ¼ pðx1 , , xn jzn Þpðzn Þ ¼ pðxn jzn Þpðx1 , , xn1 jzn Þpðzn Þ ¼ pðxn jzn Þpðx1 , , xn1 , zn Þ X ¼ pðxn jzn Þ pðx1 , , xn1 , zn1 , zn Þ Z
n1 X ¼ pðxn jzn Þ pðx1 , , xn1 , zn jzn1 Þpðzn1 Þ
Z n1
X ¼ pðxn jzn Þ pðx1 , , xn1 jzn1 Þpðzn jzn1 Þpðzn1 Þ Z n1
X ¼ pðxn jzn Þ pðx1 , , xn1 , zn1 Þpðzn jzn1 Þ Z n1
The final forward recursion is αðzn Þ ¼ pðxn jzn Þ
X αðzn1 Þpðzn jzn1 Þ zn1
Note that in the above summation, there are K terms in total, and the K values of each zn are calculated on the right side of the equation, so the calculation cost is O (K2) in each recursive step for α. Figure 40.10 shows the recursive process of calculating α described above using a lattice diagram. In this fragment of the lattice diagram, the quantity α(zn1) is obtained by selecting the unit α(zn-1,j) of α(zn-1) in step
40.4
Inference
1425
n–1 and applying the sum with the weight Aj1 (corresponding to the value of p(zn|zn1), then multiply by the data contribution p(xn|zn1). To initiate this recursion, the following initial conditions are required: αðz1 Þ ¼ pðx1 , z1 Þ ¼ pðz1 Þpðx1 jz1 Þ ¼
K Y
fπk pðx1 jϕk Þg
k¼1
It means that α(z1k) takes the value πkp(x1|ϕk) for k ¼ 1, 2, . . ., K. Starting from the first node in the sequence, α(zn) is calculated for each hidden node. Because each step of the recursion involves multiplication with a K K matrix, the total computational cost of estimating all these quantities in the entire sequence is O(K2N ). β-recursion The recursion of the quantity β(zn) in the forward-backward algorithm is also called backward recursion. If you use conditional independence and use the rules of summation and multiplication, you get β ðzn Þ
¼ pðxnþ1 , , xN , zn Þ ¼ pðxnþ1 , , xN jzn Þpðzn Þ ¼ pðxnþ1 , , xN1 jzn Þpðzn ÞpðxN jzn Þ ¼ pðxnþ1 , , xN jzn Þ X ¼ pðxnþ1 , , xN , znþ1 jzn Þ Z nþ1
¼
X
pðxnþ1 , , xN jzn , znþ1 Þpðznþ1 jzn Þ
Z nþ1
¼
X
pðxnþ1 , , xN jznþ1 Þpðznþ1 jzn Þ
Z nþ1
¼
X
pðxnþ2 , , xN jznþ1 Þpðxnþ1 jznþ1 Þpðznþ1 jzn Þ
Z nþ1
The final backward recursion is βðzn Þ ¼
X
βðznþ1 Þpðxnþ1 jznþ1 Þpðznþ1 jzn Þ
znþ1
Here is a reverse information transmission algorithm, which uses β(zn + 1) to calculate β(zn). At each step, the emission probability p(xn + 1|zn + 1) is multiplied by the transition matrix p(zn + 1|zn), and then zn + 1 is marginalized to absorb the effect of observing xn + 1. Figure 40.11 shows the recursive process of calculating β described above using a lattice diagram. In this fragment of the lattice diagram, the quantity β(zn1) is obtained by selecting the unit α(zn + 1,k) of α(zn + 1) in step n + 1 and using them by A1k
1426
40
β (z n,1)
Fig. 40.11 The backward recursive process of calculating β
k=1
Knowledge and Learning
A11
β (z n+1,1) p(xn|z n+1,1)
A21
β (z n+1,2) k=2
p(xn|z n+1,2) A31
β (zn+1,3)
k=3
p(xn|z n+1,3) n
n+1
(corresponding to p(zn + 1|zn) value) as the weight to multiply the corresponding value of the emission density p(xn|zn + 1,k) and summing up. To initiate this recursion, an initial condition is required, which is a value for β(zn). This value can be obtained by setting n ¼ N: pðzN jX Þ ¼
pðX, zN ÞβðzN Þ pð X Þ
It can be obtained by letting β(zn) ¼ 1 for all zn. Flooding schedule A method of message transmission in belief propagation, which simultaneously transmits all messages in each path in both directions at each time period. Loopy belief propagation [LBP] 1. A spread of belief propagation. Belief propagation may not give optimal results in a cycle graph, because nodes with multiple parent nodes have different optimal values for each parent node. Loopy belief propagation updates the graph in parallel, so that the optimized results can still be given in the case of a loop in the graph. 2. A method for performing simple approximate reasoning in a graph. The basic idea is to use the sum-product algorithm, although this does not guarantee a good convergence result. Max-product rule A variant of belief propagation that performs the same calculations (inference) as the dynamic programming method, but uses probability instead of energy. Serial message passing schedule A message passing method in belief propagation, which passes only one message in each channel during a time period.
40.4
Inference
1427
Expectation propagation [EP] A form of deterministic approximate reasoning based on minimizing KullbackLeibler divergence. Consider the problem of minimizing KL( p||q) relative to q(z). At this time, p(z) is a fixed distribution, q(z) is an exponential function and can be written as qðzÞ ¼ hðzÞgðηÞ exp ηT uðzÞ The Kullback-Leibler divergence as a function of η becomes (C is constant)
KL pq ¼ ln gðηÞ ηT E pðzÞ ½uðzÞ þ C The constant term is independent of the natural parameter η. In this set of distributions, KL( p||q) can be minimized by setting the gradient relative to η to zero to obtain ∇ ln gðηÞ ¼ E pðzÞ ½uðzÞ Since the negative gradient of lng(η) is given by the expectation of u(z) under the distribution q(z), we get E qðzÞ ½uðzÞ ¼ EpðzÞ ½uðzÞ It can be seen that the optimal solution corresponds to the statistics of matching expectations. For example, if q(z) is Gaussian N(z|μ, Σ), then we can set the mean μ of q(z) equal to the mean of the distribution p(z) and the covariance Σ equal to covariance of p(z) to minimize KL divergence. This method is sometimes called moment matching. Moment matching Same as assumed density filtering. See expectation propagation. Assumed density filtering [ADF] A special case of expectation propagation. All approximation coefficients except the first are initialized to 1, and all coefficients are updated once. For online learning (where data points arrive in sequence), it is more appropriate to use assumed density filtering. You need to learn each data point and then remove it before considering the next data point. However, in batch settings, there is an opportunity to reuse the data points multiple times for improved accuracy. It is this idea that expectation propagation uses. If assumed density filtering is used on batch data, the results obtained may have an undesired dependence on the (arbitrary) order considered by the data points.
Chapter 41
General Image Matching
The interpretation of the scene is a process of continuous cognition, so the information obtained from the image needs to be matched with the existing model that explains the scene. Matching can be done at the pixel level, the object level, or at a more abstract level (Prince 2012; Shapiro and Stockman 2001; Zhang 2017c).
41.1
General Matching
Generalized matching includes not only specific matching of low-level pixels or sets of pixels of the image but also abstract matching related to the nature and connection of the image object or object property and even to the description and interpretation of the scene (Hartley and Zisserman 2004; Prince 2012; Shapiro and Stockman 2001; Zhang 2017c).
41.1.1 Matching General image matching The expansion of image matching in terms of concept, meaning, method, and technology. Generalized matching can consider not only the geometric and gray properties of the image but also other abstract properties and attributes of the image. Recognize the unknown by linking the unknown with the known, or use the model stored in the computer to identify the unknown visual pattern of the input in order to eventually establish an interpretation of the input. Exhaustive matching Consider all possible matches. As an alternative, consider the methods of hypothesis testing. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_41
1429
1430
41
General Image Matching
Hierarchical matching Match at a level of increasing detail. This technique can be used for matching images or more abstract representations. Matching (Finding) the correspondence between two things. See image matching and general image matching. Tie points 1. In geometric distortion correction, the corresponding known positions in distorted and undistorted images. According to them, the parameter values of the spatial transformation model can be determined, thereby performing image coordinate transformation. Same as control point. 2. A pair of matching points in two different images (or one image and one map). Given enough matching points, the geometric relationships between images can be estimated, such as computing rectification, image alignment, geometric transformation, fundamental matrices, etc. Matching metric Metrics and metric functions for matching. Dissimilarity metric A measure of the degree of difference between two objects or structures. Often converted into Euclidean distance between two eigenvalues. Fisher-Rao metric A metric defined by Fisher with Rao distance. Rao distance provides a measure of the difference between two probability distributions. Given two probability densities p and q, they both belong to a parameter family with parameter α. Suppose p(x) ¼ p (x|α) and q(x) ¼ q(x|α + δα), where δα is a small increment of α. With the first-order approximation, the squared distance D( p, q) satisfies 2D(pf(|α), p(|α + δα)) ¼ δαJ (α)δα + (high-order term), where J(α) is a matrix that defines a Riemannian metric on a subset of the α values taken. This metric is called the Fisher-Rao metric. Match verification A technique or process for verifying a preliminary hypothetical matching of object or scenario. For example, in the specific object recognition (or retrieval) of the image database, a set of feature points need to be calculated for each image in the database in advance, and their related descriptions are put into an index structure. During recognition or retrieval, feature points are extracted from the new image and compared with previously stored object feature points. Once a sufficient number of matching feature points are found, matching verification can be performed, that is, it is determined whether the spatial distribution of these matched feature points is consistent with the spatial distribution of the feature points of the image objects in the database.
41.1
General Matching
1431
Match move In the visual effects industry, the 3-D motion of a camera is estimated based on the geometric layout of a 3-D scene to superimpose 3-D graphics or computer-generated images on the 3-D scene. Essentially, the technology of structure from motion is used to estimate the 3-D motion of a camera in a 3-D scene to superimpose 3-D graphics or computer-generated images on the screen. Because in this process the motion of the synthetic 3-D camera used to draw the graphic must match the realworld camera, it is called match move. For small motions or motions that include only camera selection, only one or two tracking points are needed to calculate the required visual motion. For a plane moving in 3-D space, four points are needed to calculate the homography. Flat overlays can then be used to replace billboard content in sports games. The general case of match move requires estimating the full 3-D camera pose along the focal length (zoom) of the lens and sometimes also its radial distortion parameters. For more complex scenes, the technique of recovering structure from motion is generally used to obtain camera motion and 3-D structure information at the same time. Symbolic matching More abstract matching using higher-level symbol units. It is different from tem plate matching. It does not directly use pixel properties, but uses the overall characteristics or features of the object of interest itself, such as object size, orientation, and spatial relationship to calculate similarity. String descriptor A relational descriptor. Suitable for describing the recurring relationship structure of basic elements. Basic elements are first represented by characters, and then strings are constructed with the help of formal grammar and rewriting rules. In other words, first use strings (a 1-D structure) to represent structural units, and then use a formal method (such as formal grammar) to describe the connection between these units. See structural approach. String matching A kind of general image matching. Usually used for object matching. The character string can represent the contour or feature point sequence of the object region. Consider two strings a1a2. . .an and b1b2. . .bm. If matching pair by pair starting from a1 and b1, and there is ak ¼ bk at the k-th position, then the two strings are said to have a match. If M is used to represent the total number of matches between two strings, the number of unmatched symbols is Q ¼ max A, B M Among them, ||A|| and ||B|| represent the length (number of symbols) of the character strings A and B, respectively. It can be shown that Q ¼ 0 if and only if A and B are exactly the same. A simple similarity measure between A and B is
1432
41
General Image Matching
Fig. 41.1 Interpretation tree search Data 1
M1
M2
* ?
Data 2
X M1
M3 ?
M2
M3
Data 3
X M1
X M2 ?
* ?
M3 ?
* ?
Data 4
R¼
M M ¼ Q max A, B M
It can be seen that a larger R value indicates a better match. When A and B match exactly (all the corresponding symbols match), the R value is infinite; when there is no symbol match between A and B (M ¼ 0), the R value is zero. Matching method A general term used to indicate the correspondence between two structures (such as surface matching) or feature sets. See stereo correspondence problem. Interpretation tree search A method for matching between two discrete collection elements. For each feature originating from the first set, a depth-first search tree is constructed to consider all possible matching features from the second set. After finding a match for a feature (which satisfies a set of consistency tests), it turns to matching the remaining features. When there is no possible match for a given feature, the algorithm skips several features. Figure 41.1 shows a partial interpretation tree that matches model features with data features, where * is a wildcard, X means inconsistent, and M is the model. Search tree A data structure that can be used to record each selection in a problem-solving activity. The next action or decision can be searched in the alternative selection space. This tree can be generated either explicitly or implicitly in the sequence of actions. For example, a tree with record substitution model-data feature matching is a special search tree of interpretation tree searches. If each non-leaf node has two children, this is a binary search tree. See decision tree and tree classifier. Matching feature extraction Feature extraction in stereo vision for matching between multiple images. In order to obtain the depth information by using the disparity between the results of different observations of the same scene in different observation directions, it is necessary to determine the corresponding relationship of the same scene in different images. In
41.1
General Matching
1433
order to accurately establish the corresponding relationship, it is necessary to select appropriate image features. The feature here is a general concept that can represent various types of specific representations and abstract descriptions of pixels or pixel sets. The commonly used matching features can be divided into point-like features, line-like features, and region features from small to large in size.
41.1.2 Matching Function Matching function See similarity metric. Centroid distance A simple metric for the closeness between two sets of points (clustering). The centroid distance is the absolute value of the difference between the mean of the points to which the two point sets belong. Similarity distance For two sets of points, generally refers to the Hausdorff distance between them. Hausdorff distance A distance function that measures the distance between two sets of points. A measure of the distance between two sets of image points. For each point in the two sets, determine the minimum distance from any point in the other set. The Hausdorff distance is the maximum of these minimums. Given the set of two finite points A ¼ {a1, a2, . . ., am} and B ¼ {b1, b2, . . ., bn}, the representation of the Hausdorff distance between them is H ðA, BÞ ¼ max ½hðA, BÞ, hðB, AÞ where hðA, BÞ ¼ max min k a bk a2A
b2B
hðB, AÞ ¼ max min k b ak b2B
a2A
In the above two formulas, the norm ||•|| may take different indices. The function h(A, B) is called the directed Hausdorff distance from the set A to set B and describes the longest distance from point a 2 A to any point in the point set B. Similarly, the function h(B, A) is called the directed Hausdorff distance from set B to set A; the longest distance from point b 2 B to any point in point set A is described. Since h (A, B) and h(B, A) are asymmetric, the maximum value between them is generally taken as the Hausdorff distance between the two point sets. The geometric significance of the Hausdorff distance can be explained as follows: if the Hausdorff distance between the point sets A and B is d, then for any point in
1434
41
a1
Fig. 41.2 A schematic diagram for the calculation of Hausdorff distance h (A, B)
General Image Matching
d11
b1
d12
d21 a2
d22
b2
each point set, taking a point in a set as the center point, it can find at least one point in another point set for a circle of radius d. If the Hausdorff distance of the two point sets is 0, it means that the two point sets are completely coincident. Figure 41.2 shows a schematic diagram of the calculation of h(A, B). First, calculate d11 and d12 starting from a1, and take the minimum value d11; then calculate d21 and d22 starting from a2, and take the minimum value d21; finally, choose the maximum value between d11 and d21 to get h(A, B) ¼ d21. Modified Hausdorff distance [MHD] An improvement on Hausdorff distance. The concept of statistical averaging is used to overcome the sensitivity of the Hausdorff distance to a few noise points or a set of outliers. Given two sets of finite points A ¼ {a1, a2, . . ., am} and B ¼ {b1, b2, . . ., bn}, the improved Hausdorff distance is H MHD ðA, BÞ ¼ max ½hMHD ðA, BÞ, hMHD ðB, AÞ where hMHD ðA, BÞ ¼
1 X min ka bk N A a2A b2B
hMHD ðB, AÞ ¼
1 X min kb ak N B b2B a2A
Among them, NA indicates the number of points in the point set A, and NB indicates the number of points in the point set B. Standard deviation modified Hausdorff distance [SDMHD] An improvement on Hausdorff distance. This is done with the help of second-order statistics to overcome the insensitivity of the Hausdorff distance to the point distribution inside the point set. Given two sets of finite points A ¼ {a1, a2, . . ., am} and B ¼ {b1, b2, . . ., bn}, the standard deviation modified Hausdorff distance is H STMHD ðA, BÞ ¼ max ½hSTMHD ðA, BÞ, hSTMHD ðB, AÞ where
41.1
General Matching
1435
hSTMHD ðA, BÞ ¼
1X min ka bk þ k SðA, BÞ m a2A b2B
hSTMHD ðB, AÞ ¼
1X min kb ak þ k SðB, AÞ n b2B a2A
In the above two formulas, the parameter k is the weighting coefficient, and S (A, B) represents the standard deviation of the distance from a point in the point set A to the furthest point in the point set B: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi #2ffi u " uX X 1 SðA, BÞ ¼ t min ka bk min ka bk m b2B b2B a2A a2A S(B, A) represents the standard deviation of the distance from a point in point set B to the furthest point in point set A: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi #2ffi u " uX X 1 min kb ak min kb ak SðB, AÞ ¼ t n b2B a2A a2A b2B The abovementioned standard deviation modified Hausdorff distance not only considers the average distance of all points between two point sets but also adds the distribution information of the points between the point sets by introducing the standard deviation of the distance between the point sets (the consistency of point distribution between two points sets), so the characterization of point sets is more detailed. Edge eigen-vector-weighted Hausdorff distance A Hausdorff distance weighted by a weight function calculated based on the first eigenface of the face edge image. It can be used to match the tested image/candidate image region with a predefined face edge template to measure the similarity between them. This is useful in face recognition, because the difference between the test face and other faces needs to be found, so it is more appropriate to use the amplitude of the first eigenface as a weight function. Edge frequency-weighted Hausdorff distance A method to determine the Hausdorff distance of the weight function during matching by counting the edge information in the image. It is often used to more directly and effectively reflect the structural information of the human face. The weight function can be represented as a grayscale image that represents the frequency of occurrence of edge points. By the binarization of this grayscale image, a binary image of the face edge model can be obtained.
1436
41
General Image Matching
Reliability of a match An evaluation index for image matching. If the index exceeds a certain threshold, the match can be considered correct. Reliability One of the commonly used evaluation criteria for image matching can also be used for other image technologies. Refers to how many times an algorithm has achieved satisfactory results in a total of tests. If N pairs of images are tested, and M tests give satisfactory results and N is large enough and this N is representative of the image, then M/N indicates reliability. The closer M/N is to 1, the more reliable it is. The reliability of the algorithm is predictable. Hamming distance A metric of the difference between discrete objects. It can be measured by the number of bits where the two bit strings differ at corresponding positions. Taking a string as an example, there are two strings of the same length A ¼ {a1a2. . .an} and B ¼ {b1b2. . .bn}. The Hamming distance between them can be calculated by the following formula: DH ðA, BÞ ¼ # fijai 6¼ bi g For example, the Hamming distance between 00000 and 10,000 is 1, and the Hamming distance between 10,000 and 10,101 is 2. Levenshtein distance A distance metric that measures the similarity between two strings. It is defined as the minimum number of operations (eliminate, insert, and replace) required to transform a string into another string. Levenshtein edit distance An information theory metric used to determine the difference between two strings. It can be calculated from the minimum number of edits required to convert one string to another. Valid editing functions include delete, insert, and replace.
41.1.3 Matching Techniques Dynamic pattern matching A general image matching commonly used for object matching. Firstly, in the matching process, the representation pattern of the object to be matched is dynamically established, and then the object matching is realized. In the process of reconstructing 3-D cells from the sequence medical slice images, the flowchart of the dynamic pattern matching method for determining the correspondence of the cross sections of the same cell in adjacent slices is shown in Fig. 41.3, which has six steps:
41.1
General Matching
Select a matched profile
1437 Identify corresponding
candidates
Build section pattern
Parameter adjustment Mismatching
Select a matching profile
Build section pattern
Pattern checking and matching Matching Matching result
Fig. 41.3 Dynamic pattern matching flowchart
1. Select a matched profile from the matched section (or initial reference section). 2. Schematic representation of the selected matched section. 3. Determine the candidate region on the matching section (a priori knowledge can be used to reduce the amount of calculation and ambiguity). 4. Select the profile to be matched in the candidate region. 5. Schematic representation of selected matching section. 6. Use the similarity between section patterns to check the correspondence between sections. If they match, the matching result for profiles is given. If not, adjust the parameters and start again. Absolute pattern A pattern with absolute coordinates constructed in dynamic pattern matching. It has rotation invariance, but does not have translation invariance. Compare relative patterns. Relative pattern Patterns without absolute coordinates constructed in dynamic pattern matching. In addition to rotation invariance, it also has translation invariance. Compare absolute pattern. Structural matching A general image matching. A match between a reference structure and a structure to be matched with the same parts. Structure matching See recognition by components. Structural description Contains the representation of explicit information about object components and the relationships between components. For 2-D or 3-D entities, a form of relationship description that can be used for structure matching. In this form, it contains a set of primitives, each with its own attribute description; it also includes a set of named relationships, which can be represented by a multivariate group, where each component corresponds to the primitives of the named relationship.
1438
41
General Image Matching
Fig. 41.4 Human face template and spring model
Given a function h: A ! B, if the following formula is satisfied: S∘h ⊆ T it is an associative isomorphism from N-ary relation S ⊆ AN to N-ary relation T ⊆ BN, where S h ¼ {(b1, b2, . . ., bN) 2 BN|for some (a1, a2, . . ., aN) 2 S, bn ¼ h (an), n ¼ 1, 2,. . ., N}. Further, if an isomorphical relation exists and satisfies S∘h ¼ T And T∘h1 ¼ S Then the relationship S and the relationship T match. See part-based representa tion and geometric model. Pictorial structures A representation structure for scenes. The scene is divided into parts that are elastically connected to each other, and the pictorial structure represents both parts (nodes) and connection relationships (edges). The “templates and springs model” is a classic example. Templates and springs model A physical analog model for face structure matching. As shown in Fig. 41.4, the templates corresponding to the organs are connected by springs. The spring function describes the relationship between the templates and can take different values (even infinite). The relationship among the templates generally has certain constraints. For example, in the face image, the two eyes are generally on the same horizontal line, and the distance is always within a certain range. The quality of the match is a function of the goodness obtained by using the template for local fitting and the
41.1
General Matching
1439
energy required to lengthen the spring by fitting the structure to be matched to the reference structure. The general form of the matching metric of the template and the spring is as follows: C¼
X d2Y
CT ½d, F ðd Þ þ
X ðd, eÞ2ðYE Þ
CS ½F ðd Þ, F ðeÞ þ
X
C M ð cÞ
c2ðN[M Þ
Among them, CT represents the dissimilarity between the template and the structure to be matched, CS represents the dissimilarity between the structure to be matched and the components that make up the object, CM represents the penalty for failing to match the components, and F() is a mapping from the reference structural template to the structural components to be matched. F divides the reference structure into two types: structures that can be found in the structure to be matched (belonging to the set Y) and structures that are not found in the structure to be matched (belonging to the set N ). Similarly, components can also be divided into two types: components that exist in the structure to be matched (belonging to set E) and components that do not exist in the structure to be matched (belonging to set M). Elastic graph matching A matching method based on dynamic link architecture. In face recognition, a face can be represented by a grid-like sparse graph, where nodes represent feature vectors and edges represent distance vectors between the nodes. The graph is deformed during matching so that the nodes approach the corresponding points of the model graph. The elastic graph matching method can tolerate changes in the pose, expression, and lighting. Elastic matching An optimization technique that minimizes the distortion between locally varying data and corresponding features or pixels of a given model. Commonly used in model-based object recognition. Elastic deformation The original means that the material completely recovers the deformation caused by the original external force after the external force is cancelled. In image registra tion, this model can be used to build a grid to register two images. Dynamic link architecture [DLA] A structure that describes the conscious activity of the human brain. The strength of the link changes, so it is called dynamic linking. Commonly used in elastic graph matching. Relational matching A class of matching algorithms based on relational descriptors. See relational graph.
1440
41
General Image Matching
Relation matching A kind of general image matching. The objective scene is decomposed into components, and the mutual relationship between the components of the object is considered, and the set of mutual relationships is used to achieve matching. In relation matching, the two expressions to be matched are both relations. Generally, one of them is called an object to be matched, and the other is called a model. There are two relationship sets, Xl and Xr, where Xl belongs to the object to be matched and Xr belongs to the model, which are expressed as X 1 ¼ fR11 ,
R12 , . . ., R1m g
X r ¼ fRr1 ,
Rr2 , . . ., Rrn g
Among them, Rl1, Rl2, . . ., Rlm and Rr1, Rr2, . . ., Rrn represent different relationships between the object to be matched and various components in the model, respectively. To match the two relation sets Xl and Xr and realize the mapping from Xl to Xr, you need to find a series of corresponding images pj to satisfy ( dis ðX l , X r Þ ¼ inf C
p
m X j
Vj
X
W ij C E ij p j
)
i
Among them, disC(Xl, Xr) represents the distance between the corresponding terms expressed by the respective pairs of Xl and Xr in terms of the error E, p is the transformation between the corresponding expressions, and V and W are the weights, and C(E) represents the error in terms of E. Composite operation An operation in relation matching. The operator is denoted as . Suppose p is the corresponding transformation (mapping) of S to T, p1 is the corresponding transformation (inverse mapping) of T to S, and Rl and Rr are used to represent the corresponding relationship expressions: Rl ⊆ SM ¼ S(1) S(2) . . . S(M), Rr ⊆ TN ¼ T(1) T(2) . . . T(N ), then Rl p means to transform Rl by transform p, that is, mapping SM to TN, Rr p1 indicates that Rr is transformed by inverse transformation p1, that is, TN is mapped to SM: R1 p ¼ f ½T ð1Þ, T ð2Þ, . . . , T ðN Þ 2 T N Rr p1 ¼ g½Sð1Þ, Sð2Þ, . . . , SðM Þ 2 SM Here f and g represent a combination of expressions of a relationship.
41.2
41.2
Image Matching
1441
Image Matching
Image matching can be performed at different (abstract) levels, including images, objects, relationship structures, scene descriptions, etc. (Hartley and Zisserman 2004; Jähne and Hauβecker 2000; Prince 2012; Szeliski 2010; Zhang 2009).
41.2.1 Image Matching Techniques Image matching An important class of image understanding techniques. Often based on the geometric and grayscale properties of the image (more feature characteristics of images and objects have been considered in recent years) to establish correspondences between different images. 1. The comparison of two images is often evaluated using cross-correlation. See template matching. 2. In content-based image retrieval, find images from the image library that correspond in some way to a given example image. Matching based on raster The raster representation of the image is used to match in pixels. Property-based matching The process of comparing two entities (image features or patterns) using their characteristics. Here the characteristics can be various physical descriptions of regions, such as the moment of a region. See classification, boundary property, and metric property. Image invariant Refers to the characteristics of an image or image feature that remains unchanged when the image is disturbed or changed. For example, the ratio between colors in an image will not change due to image scaling. This feature is useful in image pattern recognition, image matching, and image retrieval. Constrained matching A common term for identification by comparing two objects with their respective constraints or common constraints. Typical examples are detecting a moving target in a specific region (as opposed to a stationary target) or tracking the same person in different fields of view. Object matching A general image matching method that uses the object in the image as the basic unit of matching. Polygon matching A class of techniques used to match polygon shapes. See polygon.
1442
41
General Image Matching
Pairwise geometric histogram [PGH] Line- or edge-based shape representation for object matching. Particularly suitable for 2-D images. The histogram representation can be obtained by calculating the relative angle and vertical distance of each line segment from all other line segments. This representation is unchanged for rotation and translation. You can use the Bhattacharyya distance to compare pairwise geometric histograms. Dynamic time warping A technique for matching a series of (usually sampled over time) observation data with a sequence of feature models. The goal is to get a one-to-one match between observations and features. Because the rate of observations may change (timevarying), some features may be skipped or matched with multiple observations. The goal of dynamic time warping is to minimize the number of skips or reduce multi-sample matching. Effective algorithms to solve this problem are often based on linear ordering of sequences. See hidden Markov models.
41.2.2 Feature Matching Techniques Feature matching After extracting (geometric or other) features from two or more images and obtaining their representations and descriptions, the technology and process of establishing the connection between them. There are two main points here: one is to choose a matching strategy, that is, to determine which correspondences are used for the next processing, and the other is to design efficient data structures and algorithms to achieve effective matching. There are two strategies in practice: matching the image characteristics of the same object in two or more images or using the characteristics of a known object to match the characteristics of an unknown object. Typical examples of the former include feature-based stereo, and typical examples of the latter include featurebased recognition. Point matching A class of point-to-point methods and processes for feature matching or stereo vision correspondence. Point similarity measure The process of measuring the similarity of image points, also refers to the measurement function. Here, a small neighborhood of image points must be considered to contain sufficient information to characterize the position of the point. Typical measurement functions include cross-correlation, sum of absolute distortion (SAD), and sum of squared differences (SSD)
41.2
Image Matching
1443
Feature point 1. The position of a specific image feature (or its center) in the image or pixels with unique or special properties in the image. Often used as landmark points or control points in image matching, such as corner points, inflection points, etc. 2. Points identified in the image using special local operators (SIFT, SURF, etc.). Also called salient point. Can be used as landmark points and control points in image matching. Feature point correspondence Match corresponding feature points in two or more images. It is assumed here that the feature points to be matched in different images all correspond to the same point in the scene. If the correspondence between the feature points can be established, the triangulation technology can be used to estimate the depth distance of the scene from binocular stereo vision, fundamental matrix, homography, trifocal tensor, etc., so as to make scene reconstruction or determine the 3-D object motion. Feature tracking using an affine motion model A feature matching method that uses an affine motion model to compare a window to be matched with a reference window. Generally, a translation model is used first, and then the position thus estimated is used to initialize the affine registration between two windows. In object tracking, its search range is generally the region around the predicted position in adjacent frames. The tracker thus obtained is called Kanade-Lucas-Tomasi tracker. Feature match densification Further enhancement of preliminary feature matching results. Matching can be a progressive process. After obtaining some preliminary matches, you can use them as seeds to further verify more matches. This process can be regarded as a random sampling process, and it can also be solved by RANSAC. For stereo vision, for example, you can further search for corresponding points along the epipolar lines. Feature match verification Verification of feature matching results. Generally, after obtaining some hypothetical matching results, geometric registration can be used to verify which of these matches are in range and which ones have outliers. Feature-based correspondence analysis The method of performing correspondence analysis between selected image features can be used for stereo matching. Features can be particularly noticeable parts of the image, such as edges or pixels along edges. The analysis can be based on the characteristics of the feature, such as the orientation or length of the edge. Note that in this case, due to the sparseness of features, it is often not possible to directly obtain a dense disparity map by analyzing the correspondence. See stereo matching based on features. Sparse correspondence A concept that is opposition to dense correspondence. See feature-based corre spondence analysis and stereo matching based on features.
1444
41
General Image Matching
Stereo matching based on features Stereo matching using feature points or feature point sets in an image to be matched. Frequently used features are inflection points and corner points, edge segments, and object contours in the image. In recent years, SIFT feature points and SURF feature points have been used more frequently. These features should be unique in the image. Sparse matching point Matching points obtained directly from stereo matching based on feature methods. Since the feature points are only some specific points on the object and there is a certain interval between them, the matching points obtained by the stereo matching based on feature method are sparsely distributed in space. The remaining (corresponding) points can be obtained by interpolation.
41.2.3 Correlation and Cross-Correlation Correlation 1. A mathematical operation. The correlation of the two functions f(x, y) and g(x, y) can be expressed as Z1 Z1 f ðx, yÞ gðx, yÞ ¼
f ðx þ u, y þ vÞgðx, yÞdxdy 1
1
Z1 Z1 ¼
f ðx, yÞgðx þ u, y þ vÞdxdy 1
1
If f(x, y) and g(x, y) are the same function, the correlation is called autocorrelation; if they are not the same function, the correlation is called crosscorrelation. 2. The degree of linear correlation or dependence between two or more quantities. Pearson’s correlation coefficient See correlation. Pairwise correlation See correlation. Paired contours A pair of contours that appear in the image at the same time with a special spatial relationship. For example, the contours generated by riverbeds in aerial images, the contours of human limbs (arms, legs). Symbiosis can be used to make their detection more robust.
41.2
Image Matching
1445
Paired boundaries See paired contours. Magnitude correlation 1. The relationship between the magnitudes of two vectors, regardless of their phase or orientation. 2. A frequency space correlation that uses the amplitude information of the fre quency spectrum to establish a matching correspondence between the images. Grayscale correlation Cross-correlation between gray values in an entire image or image window. Frequency space correlation A method of image matching or image registration through correlation calculations in the frequency domain. Here, the image is first converted to the frequency domain by (fast) Fourier transform, and then the correlation calculation is performed in the frequency domain to establish the correspondence. Phase correlation 1. A method of frequency space correlation using the phase information of the frequency spectrum to establish matching correspondence between two images. See phase correlation method. 2. A motion estimation method using the Fourier transform’s translation-phase duality (i.e., the offset in the spatial domain is equivalent to the phase offset in the frequency domain). Among them, the frequency spectrum of the two images to be matched is whitened by dividing the product of each frequency with the amplitude of the Fourier transform (F) (before the final inverse Fourier transform): F fE PC ðuÞg ¼
F 1 ðwÞF 2 ðwÞ kF 1 ðwÞkkF 2 ðwÞk
When using the logarithmic polar coordinates and the rotation and scaling characteristics of the Fourier transform, the spatial rotation and scaling can also be estimated from the phase offset in the frequency domain. See planar motion estimation. Phase correlation method [PCM] A frequency domain motion estimation technique that uses a normalized crosscorrelation function calculated in the frequency domain to estimate the relative offset between two image blocks (one in the anchor frame and the other in the target frame). The basic idea is to perform a spectrum analysis on two successive fields and calculate the phase difference. These offsets undergo an inverse transformation that reveals the peaks at which their positions correspond to inter-field motion. The main steps are as follows: 1. Calculate the discrete Fourier transform of each block in the object frame and the anchor frame (recorded as frame k and frame k + 1, respectively). 2. The normalized cross power spectrum between the discrete Fourier transforms calculated according to the following formula:
1446
41
General Image Matching
F F kþ1
Ck,kþ1 ¼
k F k F kþ1
Among them, Fk and Fk + 1 are DFTs of two blocks having the same spatial position in frame k and frame k + 1 and * represents a complex conjugate. 3. Calculate the phase correlation function PCF (x): PCFðxÞ ¼ F 1 ðCk,kþ1 Þ ¼ δðx þ dÞ where Ck,k + 1 is the inverse DFT, and d is the displacement vector. 4. Determine the peak of PCF(x). Cross-correlation matching Cross-correlation-based matching on two sets. The closer the cross-correlation value is to 1, the better the match. For example, in correlation-based stereo vision, for each pixel in the first image, its corresponding pixel in the second image is the pixel with the highest correlation value. To calculate the correlation value, a local neighborhood of each pixel is generally used. Cross-correlation Correlation between two different functions. The cross-correlation of two signals can be calculated by the dot product of the vector or matrix expressing the signal. Standard method for estimating the degree of correlation between two sequences. Given two sequences {{xi} and {yi}, where i ¼ 0, 1, . . ., (N – 1), the crosscorrelation between them with distance d is defined as P i ðxi mx Þ yid my ffiqffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi C d ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2ffi P P 2 ð x m Þ y m i x y i i id where mx and my are the mean of the corresponding sequence. Intensity cross-correlation Use cross-correlation of intensity data between two images. Intensity data Image data representing the brightness of the measured light. In practice, there is often not a linearly mapped relationship between the measured light and the stored value. This term can also refer to the brightness of the visible light observed. Angle between a pair of vectors The angle between two vectors in space. With a pair of vectors x and y, the angle θ between them is
41.2
Image Matching
1447
xy θ ¼ arccos kxkkyk
where the dot product (denoted as x•y) is equal to the length of x projected onto y and the norms of vectors x and y are ||x|| and ||y||. This can be used to determine the similarity between images (representing images as vectors).
41.2.4 Mask Matching Techniques Template matching A basic image matching method. Use a small image (template) to match a part (sub-image) of a larger image. The purpose of matching is to determine whether a small image exists in the large image, and if so, to further determine the position of the small image in the large image. An example of template matching is shown in Fig. 41.5 for a template image w (x, y) with size J K and a large image f(x, y), with J M, K N, and size M N. Where (s, t) is the translation amount of the template. A strategy for locating an object in an image. The template is a 2-D image of the desired object and compared with all windows in the image to be searched that are the same size as the template. The difference between the calculated difference of the template (can use normalization correlation, the sum of squared differences (SSD), etc.) is smaller than the given threshold, which corresponds to the desired object. To obtain invariance to scale, rotation, or other transformations, templates must be subject to the transformation. Geometric model of eyes A model describing the structure of the eye with geometric figures. Generally, it refers to the geometric model constructed to represent the upper and lower eyelids
Fig. 41.5 Template matching
O
t s
M
N
y
J (s, t) K w(x–s, y–t) f (x, y)
x
1448
41
General Image Matching
Fig. 41.6 An improved geometric model for eye
and iris information (including the position and radius of the iris, the position of the corners of the eye, the degree of eye opening, etc.). A typical eye geometry model uses a circle and two parabolas to represent the contour of the iris and the contours of the upper and lower eyelids, respectively. It is also called a deformable model or template. An improved model can be seen in Fig. 41.6. This model can be represented by a seven-tuple (O, Q, a, b, c, r, θ), where O is the center of the eye (located at the origin of the coordinates), Q is the center of the iris, and a and c are the height of the parabola above and below, b is half of the length of the parabola, r is the radius of the iris, and θ represents the angle between the line where the two parabola intersects and the X axis. In this model, the intersection points of the two parabolas are two corner points (P1 and P2). In addition, the iris circle has two intersection points with the upper and lower parabola, respectively, forming the other four corner points (P3 and P4, P5 and P6). Using the information of P1 and P2 can help adjust the width of the eyes, and using the information of P3 to P6 can help determine the height of the eyelids and help calculate the eye parameters accurately and robustly. Geometric flow A variant of the deformable template. It makes it more suitable for template matching by changing the form of the initial curve. Geometry here means that the change is entirely determined by the geometry of the curve. Region matching 1. Establish a correspondence between the members of the two region sets. 2. Determine the similarity between two regions, that is, solve the feature matching problem between regions. See template matching, color matching, and color histogram matching. Correlation function In template matching, a function that measures similarity. Suppose what needs to be matched is a template image w (x, y) with size J K and a large image f(x, y) with size M N, J M, K N. The correlation function between f(x, y) and w(x, y) can be written as
41.2
Image Matching
1449
cðs, t Þ ¼
XX f ðx, yÞwðx s, y t Þ x
y
where (s, t) is the matching position; s ¼ 0, 1, 2, . . ., M – 1; t ¼ 0, 1, 2, . . ., N – 1. Compare correlation coefficients. Correlation coefficient In template matching, a function that measures similarity. Suppose what needs to be matched is a template image w(x, y) with size J K and a large image f(x, y) with size M N, J M, K N. The correlation coefficient between f(x, y) and w(x, y) can be written as P P f ðx; yÞ f x; y ½wðx s; y t Þ w Cðs; t Þ ¼ (
x
P P x
y
2 P P 2 f ðx; yÞ f x; y ½wðx s; y t Þ w
y
x
)1=2
y
where (s, t) is the matching position; s ¼ 0, 1, 2, . . ., M – 1; t ¼ 0, 1, 2, . . ., N – 1, w is the mean of w (need to be calculated only once); and f(x, y) is the mean of the region in f(x, y) corresponding to the current position of w. The correlation coefficient normalizes the correlation function, so it is less sensitive to changes in the magnitude of f(x, y) and w(x, y). Minimum average difference function A metric function for template matching. Suppose what needs to be matched is a template image w(x, y) with size J K and a large image f(x, y) with size M N, J M, K N. The minimum average difference function between f(x, y) and w (x, y) can be expressed as M ad ðs, t Þ ¼
1 XX j f ðx, yÞwðx s, y tÞj MN x y
where s ¼ 0, 1, 2, . . ., M – 1; t ¼ 0, 1, 2, . . ., N – 1. Minimum mean square error function A metric function for template matching. Let’s match a template image w(x, y) with size J K and a large image f(x, y) with size M N, J M, K N. The minimum mean square error function between f(x, y) and w(x, y) can be written as M me ðs, t Þ ¼
h 1 XX f ðx, yÞwðx s, y tÞ2 MN x y
where s ¼ 0, 1, 2, . . ., M – 1; t ¼ 0, 1, 2, . . ., N – 1.
1450
41
General Image Matching
Matched filter An operator that produces a strong response in the output image if the input image contains a pattern part that matches it. For example, by adjusting the filter to match a certain object at a certain scale, the object can be detected. It is similar to template matching. The only difference is that the matched filter can also be adjusted to match spatially separated patterns. This term was first used in signal processing and later introduced into image processing. Spatial matched filter See matched filter. Coarse-to-fine Among various match methods, a strategy to speed things up. In matching (especially pixel-based matching), it is often necessary to operate on all possible candidates and choose the best result. When the search range is large, if the search step is small, the calculation will be large; if the search step is large, the accuracy will be small. At this time, if a coarse-to-fine strategy is adopted, the best balance can be achieved between calculation speed and matching accuracy. Pattern matching See template matching. Correlation matching Matching of entities with correlation.
41.2.5 Diverse Matching Techniques Matching based on inertia equivalent ellipses A method of object matching based on inertia equivalent ellipse derived from inertia. As shown in Fig. 41.7, each object in the image to be matched is represented by its equivalent ellipse, and the problem of matching the object is transformed into the matching of its equivalent ellipse. Equivalent ellipse of inertia An ellipse equal to the inertia of an object region and normalized by the region area. Also called inertia equivalent ellipse, referred to as equivalent ellipse. According to the calculation of the object moment of inertia, the directions and lengths of the two
Object 1
Equivalent ellipse 1
Fig. 41.7 Matching based on inertia equivalent ellipses
Equivalent ellipse 2
Object 2
41.2
Image Matching
1451
principal axes of the ellipse are completely determined. The slopes of the two major axes (relative to the horizontal axis) are k and l, respectively qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ðA BÞ ðA BÞ2 þ 4H 2 2H qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ðA BÞ þ ðA BÞ2 þ 4H 2 l¼ 2H
k¼
The lengths of the two semi-major axes ( p and q) are ffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p ¼ 2= ðA þ BÞ ðA BÞ2 þ 4H 2 ffi sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 q ¼ 2= ðA þ BÞ þ ðA BÞ þ 4H 2 Among them, A and B are the moments of inertia of the object around the X and Y coordinate axes, respectively, and H is the product of inertia. The object in a 2-D image can be regarded as a planar uniform rigid body. The equivalent inertia ellipse established on it reflects the distribution of points on the object. With the inertia equivalent ellipse, two irregular regions can be registered. Inertia equivalent ellipse Same as equivalent ellipse of inertia. Histogram matching Same as histogram specification. Color histogram matching A techniques used for color image indexing, where similarity measures can use various distances between the color histograms of the two images, such as the Kullback-Leibler distance or the Bhattacharyya distance. Similarity computation based on histogram In order to realize image retrieval, the matching calculation is performed by means of the statistical histogram of the image. The goal is to obtain similar distances between images. It has been widely used in color-based image retrieval. At this time, there are multiple methods for similarity calculation, and different distance measures can be used. Pyramid matching Matching with the help of pyramids built by using pyramid transform. Spatial pyramid matching A form of pyramid matching in which spatial decomposition is performed hierarchically. For example, this decomposition can be achieved by iteratively subdividing
1452
41
General Image Matching
Fig. 41.8 Statistics of vocabulary in images
with a factor of 2 in each direction. Compared with ordinary pyramid matching, it has the advantage of considering the spatial distribution of different features. At each level and spatial partition, some form of bag-of-words matching is performed, and then the matching results are combined hierarchically. In this process, weighting according to the size of the spatial area can also be considered. Specifically, the image to be matched can be decomposed into a pyramid structure and the number of vocabulary contained in each part is counted for matching. The statistics of the vocabulary of each part can be carried out with the help of histograms, as shown in Fig. 41.8 (take a two-level pyramid as an example, the left picture is level 0, and the right picture is level 1), where the different graphic symbols in the previous line represents different words, and the histograms in the small graph on the right represent the region statistics. Geometric model matching Comparison of two geometric models or comparison of a model with a set of image data (corresponding to the target) to identify the target. Curve matching A method of comparing the currently considered data set with a previously modeled curve or other curve data set and establishing a correspondence relationship. If the modeled curve corresponds closely to the data set, a similar interpretation can be made. Curve matching is different from curve fitting because curve fitting involves minimizing the parameters of the theoretical model rather than minimizing the actual sample. Contour matching See curve matching. Contour matching with arc length parameterization A simple contour matching method. Consider R two objects with similar contours in Fig. 41.9. First, subtract the average x0 ¼ sx(s) from the object representation. Next, scale the result so that s goes from 0 to 1 instead of 0 to S, that is, divide x(s) by S. Finally, the Fourier transform is performed on the normalized result, and each x ¼ (x, y) is regarded as a complex number. If the original contour is the same after a certain amount of rotation and translation, the resulting Fourier transform is just a
41.2
Image Matching
Fig. 41.9 Contour matching with arc length parameterization
1453
s = 0 s =1
t d
x0
x(s)
x0 s =1
s=0
scale shift in amplitude, a constant phase difference (derived from rotation), and a starting point difference with respect to s. Chamfer matching A technique that is based on the comparison of contours and the evaluation of the similarity of two point sets based on the concept of chamfer distance. This can be used to match edge images with the help of distance transforms. To determine the parameters (such as translation, scaling, etc.) used to register the test image and the reference image, the binary edge map of the test image is compared with the distance transform. The edges are detected in the first image, and the distance transform of the edge pixels is calculated, and then the edges in the second image are matched. This can also be used for alignment of binary images. See Hausdorff distance. Mean squared edge distance [MSED] A similarity metric used when using edges for matching. The advantage of using edge matching is that it is insensitive to lighting changes. Edge matching is performed between the edge of the image and the edge of the template. For each edge point in the template, the edge point in the image closest to it needs to be found. The concern here is the distance between the edge points, not the specific point closest. Therefore, it is possible to perform edge detection on the image, obtain a segmented edge map, and perform distance transform on the background in the edge map. If the distance measure between the edge point in the template and the edge point in the edge map is lower than a threshold, the template can be considered to have a match. Now the set of edge points in the template is S, and the distance transformation of the background in the edge map is expressed as d(x, y), and then the mean square edge distance between the template and the image is (n ¼ |S| is the number of edge points in the template) SEDðx, yÞ ¼
1 X dðx þ s, y þ t Þ2 n ðs, t Þ2S
It should be noted that if some edge points in S do not appear in d(x, y) (e.g., due to occlusion or detection problems), the distance between these edge points and the edge points in the image closest to them may be large, resulting in no matching locations. Boundary matching See curve matching.
1454
41
General Image Matching
Edge matching See curve matching. Volume matching Use volume representations to identify correspondence between targets or between subsets of targets. Sub-volume matching Matching between video and 3-D sub-image in the template. For example, it is possible to match actions with templates from the perspective of spatiotemporal motion. The main difference between this method and the component-based method is that it does not need to extract action descriptors from the extreme points of the scale space, but only needs to check the similarity between two local spatiotemporal patches (by comparing the motion between them). The advantage of sub-volume matching is that it is more robust to noise and occlusion. If combined with optical flow characteristics, it is also robust to apparent changes. The disadvantage of sub-volume matching is that it is susceptible to background changes. Model-assisted search An algorithm that searches for matches (such as in face recognition) where higherlevel models (such as a 3-D model of a face) are used to assist the search.
41.3
Image Registration
The meaning of image registration relative to image matching is generally narrow (Hartley and Zisserman 2004). It mainly refers to the establishment of correspondence between images obtained in different time or space, especially the geometric correspondence (for geometric correction). The final effect to be obtained is often reflected at the pixel level (Forsyth and Ponce 2012; Szeliski 2010; Zhang 2017c).
41.3.1 Registration Registration See image alignment and image registration. Fourier-based alignment Image registration with Fourier transform. The principle is that the Fourier transform of a translated signal has the same amplitude but a linearly varying phase as the Fourier transform of the original signal.
41.3
Image Registration
1455
Fig. 41.10 Huangpu River basin model overlaid on aerial Shanghai map
Map registration Register a symbolic map with aerial or satellite imagery. Of these, it is often necessary to identify road, building or ground targets. Figure 41.10 shows the effect of overlaying the Huangpu River basin model with an aerial Shanghai map. Elastic registration A higher-order version of elastic matching, where corresponding points or surfaces are matched or merged by minimizing some deformation costs. Affine registration Use affine transformations to register two or more images, two or more surface meshes, or two or more point clouds. Procrustes analysis A method of minimizing the square error through translation, rotation, and scaling to compare two data sets (and then perform various forms of registration). Procrustes average A concept in Procrustes analysis. The Procrustes analysis aligns two targets and defines the Procrustes distance between them. Given a set of targets, a reference target can be found, which minimizes the Procrustes distance between the reference target and each target. This reference target is the Procrustes average. Medical image registration A special but very typical image registration application. The sequence images obtained by CT or MRI in medicine need to be used to build a complete threedimensional tissue or track the development of the patient’s lesions over time. It is often necessary to use some mutual information measures on similarity to help track the change of the same patient’s condition over time (longitudinal studies). In addition, it is often necessary to compare the images of different patients to determine their common ground or to detect the results caused by variations and lesions
1456
41
General Image Matching
(horizontal studies). Finally, medical objects are often smooth, elastic, and deformable. All these factors make image registration widely used in medicine. A general term used to indicate the registration of two or more medical image types. It can also refer to atlas-based registration. Typical examples include registration of TCT images with MRI images. Atlas-based registration Often refers to the registration of different kinds of medical images. See medical image registration. Ultrasound sequence registration Registration of successively acquired, superimposed ultrasound images. Symbolic registration Same as symbolic matching. Rigid registration Neither the model nor the data allows distorted image registration. This simplifies registration to an Euclidean transformation that estimates the alignment of the model with the data. Contrast with non-rigid registration. Non-rigid registration Register or align two objects in different configurations (different from rigid bodies). For example, a pedestrian or a face feature (such as a mouth) is non-rigid because their shape changes over time. This matching is often needed in medical imaging because parts of the human body are deformed. Non-rigid registration is more complicated than rigid registration. See alignment and registration. Joint entropy registration Data are registered using joint entropy (a measure of the degree of uncertainty) as a criterion. Iterative closest point [ICP] (1) An algorithm that minimizes the difference between two point clouds. The algorithm iteratively modifies and minimizes the (translation, rotation) transformation of the distance between two initial scans. It is often used to reconstruct 3-D surfaces from different scans, to position the robot itself, and to obtain optimal path planning. (2) A technique mainly used for the alignment/registration of rigid body. The basic idea is to match rigid bodies according to the selected geometric characteristics, and set these matching points as imaginary corresponding points, and then solve the registration parameters based on this correspondence. These registration parameters are then used to transform the data. Finally, the same geometric feature is used to determine a new correspondence relationship. Specifically, the following two steps need to be repeated until the goal is achieved: 1) given the estimated transformation from the first shape to the second shape, find the closest feature in the second shape to each feature in the first shape; and 2) given a new closest feature set, re-estimate the
41.3
Image Registration
1457
transformation that maps the first feature set to the second feature set. Many variants of this algorithm require a good estimation of the initial shape alignment. Iterative closest point [ICP] algorithm See iterative closest point. Iterated closest point [ICP] A widely used 3-D registration technique, characterized by switch between two tasks. The first task is to determine the closest point between two surfaces to be registered, and the second task is to solve a 3-D absolute orientation problem. See iterative closest point.
41.3.2 Image Registration Methods Image registration A technique for aligning, overlapping, or matching two homogeneous objects (such as images, curves, models). Among them, the correspondence between two (or more) images describing the same scene should be established, and the corresponding pixels should originate from the same physical unit in the imaging scene. More specifically, it refers to calculating geometric transformations to superimpose one image region on top of another. For example, image registration determines the region that is common to two images and then determines a plane transformation (rotation and translation) to align them. Similarly, curve registration determines the transformation of similar or identical parts between two curves. Transforms do not need to be rigid. Non-rigid registration is common in medical image processing (such as digital subtraction angiography). Note that in most cases, there is no accurate solution, because the two objects are not exactly the same, and the best approximate solution requires the use of least mean square or other complicated methods. Image registration is closely related to image matching, but there are differences. Generally speaking, the meaning of image registration is relatively narrow. The main goal is to establish the correspondence relationship between the images obtained in different time and space, especially the geometric relationship. The final obtained effect is often reflected in the pixel layer. From this point of view, registration can be seen as a match to lower-level representations. Control point 1. Corresponding points in geometric distortion correction. This can be used to calculate the coordinate transformation parameters to achieve spatial transformation. 2. Used for parametric smooth curve representation system to specify the complexity of the curve shape to a given order (such as linear, quadratic, or cubic). A curve is defined as an algebraic connection between all specified control points in
1458 Fig. 41.11 A curve and four control points
41
General Image Matching
P1
P4
P2
P3
space, but the points themselves need not be points of interest. Figure 41.11 shows a curve and four control points, two of which are not on the curve, but they have an effect on the degree of curve bending. Control points can be thought of as a variation of point-based algebraic curves. It can also be extended to surface representation. 3. In image registration, a typical feature commonly used in feature-based align ment. In this way, the registration parameters can be calculated by using methods such as least squares, thereby registering the image. Ground control point Same as control point. Commonly used in photography terminology. Landmark detection A general term for detecting features used for image registration. Registration can be performed between two images or between a model and an image. A landmark can be task-specific, such as a component on a circuit board or an anatomical feature on a person (such as the tip of a nose); it can also be a more general image feature, such as various points of interest. Intensity-based alignment Establish a connection between two images or two targets by estimating the intensity difference (which can be between corresponding pixels but generally between corresponding small templates). Feature-based alignment Techniques and methods to establish the connection between two images or two targets by estimating the difference in feature characteristics (which can be motion characteristics or geometric positions, etc.). Phase-based registration A registration technique that uses local phase information from two images to determine the correct alignment of the images. Multi-sensor alignment Align various images acquired by different sensors, often from different modalities (such as visible and infrared light). See multimodal image alignment.
41.3
Image Registration
1459
Multi-image registration A general term for geometrically aligning two or more images. This registration allows pixels from different images to overlap or combine with each other. See sensor fusion. For example, aligning a series of successive brightness images one by one can help generate a surrounding panoramic image mosaic. Alternatively, images can come from different kinds of sensors (see multimodal fusion). For example, magnetic resonance images and computer tomography images acquired on the same part of the human body can be registered to provide doctors with richer information. Multimodal image alignment Registration of images obtained by sensors based on different modalities, such as visible light and infrared light. Because different modalities generally have different responses, image registration is generally performed by aligning the same features instead of directly using brightness. Temporal alignment Determine the time-time correspondence between two time-based images. Frame images in a video sequence are typical examples with time-to-time correspondence. Dynamic time warping is a technique commonly used for temporal registration. Multi-view image registration See multi-image registration. Temporal offset 1. The amount of time shift required to time register two time-based sequences or signals. For example, the number of frames or seconds between the starting points of two gait cycles. 2. The time interval between two events. For example, the start time and end time of a spot match (the length of the match).
41.3.3 Image Alignment Image alignment The process of spatially matching and aligning two images, one of which is called a template image and the other is called an input image. This can be used to estimate the difference or change between the two images. See image registration. Alignment A method for geometric model matching by registering geometric models with image data. See image registration.
1460
41
General Image Matching
3-D alignment Alignment of a set of 3-D spatial points by means of coordinate transformation. If the 3-D transform’s relative motion parameters are linear, such as translation, similarity, and affine motion, you can use the least squares method. Surface matching Identifying corresponding points on two 3-D surfaces is often used as a guide for surface registration. Geometric alignment Same as geometric correction. Several feature points on the two targets are aligned to facilitate the matching of the targets. Coincidental alignment A phenomenon in which two structures look related but are actually independent of each other or when the alignment is just the result of observing from some specific viewpoint. For example, random edges may be collinear, or random planes may be smooth, or the corners of the target look close. See nonaccidentalness. Base alignment Forces a line of text to a horizontal line, regardless of the relative size or other attributes of the letter. Text analysis In the context of image analysis, it refers to the detection and decoding of characters and words in image and video data.
41.3.4 Image Warping Deformable shape registration A form of registration between two object contours. This allows contours to be deformed in a non-affine transformation to obtain alignment from one contour to another. Commonly used for registration between sequential video frames to achieve tracking of deformed targets. See non-rigid registration and elastic registration. Contrast with affine registration. Image morphing A geometric transformation technique that gradually changes the geometric relationship between parts in one image and transforms it into another image. The basic idea is to produce a visual illusion effect by displaying the image in the middle of the change process. Image distortion was common in television, movies, and advertising in the 1980s and 1990s, but its impact has since decreased. In order to achieve deformation, users often need to specify control points in the initial image and the final image. These control points are then used to generate two meshes (one for each image). Finally, the two meshes are connected using an affine transformation. See morphing.
41.3
Image Registration
1461
Fig. 41.12 Morphing sequence
The process of warping and blending two or more images together. It is a method to effectively change the appearance of an image or make it animated. Generally, the two images to be transformed are first transformed into an intermediate image, and then they are mixed together, so that the obtained result can avoid the problem of ghosting caused by directly mixing the two images. Morphing The method and process of starting from one image or object shape and changing it to another image or object shape through a series of intermediate image deformations (transformation and color) in a continuous and seamless form. Figure 41.12 shows the deformation sequence from cat to tiger. Congruencing An operation that uses geometric distortion or deformation. Two images with different geometries but the same set of objects can be spatially transformed so that the size, shape, position, and orientation of each target in one image are the same as the size, shape, position, and orientation can face up. View morphing Also called view interpolation. Can be used to generate a smooth 3-D animation from one view of a 3-D scene to another. In order to generate such a transition sequence, the camera matrix (i.e., a matrix containing internal and external parameters such as camera position, orientation, and focal length) needs to be smoothly interpolated first. Although linear interpolation can also be used, in order to obtain better results, it is often necessary to use the raised cosine camera parameters and try to move the camera along a circular trajectory. To generate intermediate frames, there are two types of methods: establishing a complete 3-D correspondence set and generating a 3-D model for each reference view. View interpolation A family of techniques for synthesizing a new view based on two or more views previously collected on a target or scene. Techniques that can accomplish this include simple interpolation using feature point correspondences between images and projective transformation using a tri-focal tensor. See view morphing.
1462
41
General Image Matching
Skew A geometric error introduced by imaging from a non-orthogonal pixel grid (in which no strict 90 is formed between the rows and columns of pixels). Generally, it is only considered in high-precision photogrammetric applications. Inverse compositional algorithm A fast and effective image registration algorithm. Instead of updating the additive estimate Δp of the deformation parameters, it iteratively solves the reverse incremental deformation increment W(x; Δp)1. Mismatching error Errors due to failure to match accurately. Geman-McClure function A commonly used robust penalty function, is also a robust metric for registration error. It is a function of smooth change itself. It changes quadratically (squared) for small x values but it is relatively stable for large x values: GMðxÞ ¼
x2 1 þ x2 =a2
Among them, a is a constant, called the outlier threshold, which can be calculated with the help of mean absolute deviation.
41.4
Graph Isomorphism and Line Drawing
Graph isomorphism is a way to solve relation matching (Marchand-Maillet and Sharaiha 2000; Zhang 2009). With the help of line drawing markers, 3-D scenes and corresponding models can be matched to explain the scene (Shapiro and Stockman 2001).
41.4.1 Graph Matching Graph matching A general term describing the technique of comparing two graph models. Some techniques attempt to discover graph isomorphism or subgraph isomorphism, while others only attempt to establish similarity between graphs. Many-to-many graph matching A version of graph matching in which vertex clustering sets obtained from one graph are matched with vertex clustering sets of other graphs. This method is
41.4
Graph Isomorphism and Line Drawing
1463
Fig. 41.13 Many-to-many graph matching
Graph B
Graph A
especially effective in applications where one-to-one matching is not perfect. Figure 41.13 shows a schematic. Graph isomorphism The case where there is a one-to-one correspondence between the vertices and edges of two graphs. Graphs G and H are isomorphic as G ffi H. The necessary and sufficient conditions are that each of V(G) and V(H ), E(G), and E(H ) has the following mappings: P:
V ðG Þ ! V ðH Þ
Q:
E ðG Þ ! E ðH Þ
Among them, the mappings P and Q maintain an associated relationship. Determining whether two graphs are isomorphic is considered an NPC problem. Isomorphics The two graphs form an isomorphic relationship. See isomorphic graphs. Isomorphic decomposition The process of decomposing a graph into several isomorphic subgraphs. Isomorphic graphs Two graphs satisfying isomorphism. For two graphs that have the same geometric representation of graph but are not identical graphs, if the labels of the vertices and edges of one graph are appropriately renamed, one can obtain a graph that is identical to the other graph. Each graph is called an isomorphic graph. Isomorphic subgraphs See isomorphic graphs. Subgraph isomorphism Graph isomorphism between a part of a graph (subgraph) and the full graph of another graph. Can be used to determine the equality of a pair of subgraphs for two given graphs. Given graph A and graph B, the subgraph isomorphism problem is to enumerate all subgraph pairs (a, b), where a ⊂ A, b ⊂ B, a is isomorphic to b and some given predicate p(a, b) is true. Proper modification of this problem can solve many graph problems, including determining the shortest path and finding the largest and smallest maximal clique. The subgraph tree of a given graph has an exponential relationship with the number of vertices, so the generalized subgraph isomorphism problem is NP-complete.
1464
41
Fig. 41.14 Subgraph isomorphism schematic
A
B
C
D
General Image Matching b
a
u
Fig. 41.15 Line graph schematic
w
x
c
ux uv
wx v
vx
Figure 41.14 shows an example of subgraph isomorphism, where the matching graph can be expressed as A:b – C:a – B:c. Double-subgraph isomorphism All graphs are isomorphic between the subgraphs of one graph and the subgraphs of another graph.
41.4.2 Line Drawing Line drawing See labeling of contours. Line graph For graph G, the graph whose vertices are all composed of edges in graph G is denoted as L(G). If two edges in graph G have common vertices, then the two vertices of graph L(G) are connected. An example is shown in Fig. 41.15, with G on the left and L(G) on the right. Arc set In graph-theory, one of two sets that make up a graph. A graph G is defined by a finite non-empty vertex set V(G) and a finite edge set E(G). Each element of E(G) is called an arc or edge of G, and it connects a pair of vertices. See vertex set. Line-drawing analysis 1. Analysis of hand-drawn or CAD drawings, the purpose is to extract symbol descriptions or shape descriptions. For example, research has considered extracting 3-D building models from CAD drawings. Another application is to construct circuit descriptions by analyzing hand-drawn circuit sketches. 2. The analysis of line intersections in a polyhedral block world scene, the purpose is to understand the 3-D structure of the scene.
41.4
Graph Isomorphism and Line Drawing
1465
Fig. 41.16 Waltz line labeling example
M S
+ – + +
Diagram analysis Syntactic analysis of line drawing (possibly related text in reports or other documents) images. This field is closely related to visual language analysis Waltz line labeling A scheme for interpreting polyhedral line images in the blocks world. The lines in each image are labeled to indicate which edge in the scene produced it: concave, convex, occluded, cracked, or shadowed. Considering the constraints provided by the intersection label in the constraint satisfaction problem, Waltz proved that the set of line segments whose marks are locally ambiguous can disambiguate globally. See line drawing. Figure 41.16 gives an example of line labeling. In the picture, a hollow cylinder is placed on a platform, there is a mark on the cylinder, and the cylinder creates a shadow on the platform. The side of the cylinder has two limbs. The upper top contour is divided into two parts by the two limbs. The upper contour limb occludes the background (platform) and the lower contour limb occludes the interior of the cylinder. The creases around the platform are convex, while the creases between the platform and the cylinder are concave. Shape from line drawing A symbolic algorithm that infers the 3-D nature of the scene object with the help of line drawings (unlike accurate shape measurements in other “recover/rebuild shape from X”). First, make certain assumptions about the types of lines allowed, such as polyhedron objects only, no surface marks or shadows, and up to three lines at intersections. Second, build a dictionary of line intersections and assign a symbolic label to each possible line intersection under the given hypothesis. Finally, the line drawing can be traversed to obtain the results that meet the conditions. In general position When a line drawing is labeled, the result observed or given to a specific object in a specific situation. It is assumed that each surface of the object is a plane, and the corner points after all the planes intersect are formed by three faces (such a 3-D object can be called a trihedral corner). In this case, a small change in the viewpoint will not cause a change in the topological structure of the line drawing, that is, it will
1466
41
General Image Matching
Fig. 41.17 Line label schematic
+
+ + ‒
‒
not cause the disappearance of faces, edges, and connections. The object is in a normal position in this case. Line label In an ideal world scene of a polyhedron block, symbols marked to describe and distinguish various types of line segments (here the line segments are not necessarily straight lines). Typical line segment types include discontinuities in convex or concave surfaces (fold edges), occluding contours (limbs) formed by fold edges on the background, two polyhedrons with break edges created when edges are aligned, shadow edges, etc. To distinguish these different types, different labels are used to mark these line segments. Labeling is a step in scene understanding that helps determine the 3-D structure of a scene. See junction label. Figure 41.17 shows some commonly used line labels, in which concave edges are represented by +, convex edges are represented by –, and occluding edges are represented by >. Line labeling The technique or process of numbering line segments. Method for assigning line labels. In the early scene understanding work, first the edge segments are detected and the line segments are numbered, and then the 3-D structure of the object based on the 2-D line segment topology are derived. Link Edges with different endpoints in line drawings. Limb In labeling of contours technology, the normal of a continuous surface in a 3-D scene is perpendicular to the direction of the line of sight, and the contour changes when the direction changes smoothly and continuously. Usually formed when viewing a smooth 3-D surface from the side. It is not a real edge that exists objectively on a 3-D scene, but an edge that is perceived visually, and its position is related to the direction of the line of sight. Blade In labeling of contours technology, when a continuous surface (called an occluding surface) in a 3-D scene occludes a part of another surface (called an occluded surface), if it walks along the contour of the previous surface, the smooth continuous
41.4
Graph Isomorphism and Line Drawing
1467
Fig. 41.18 Crack edge schematic
Crack
contour lines is formed in the surface normal direction. It is the true edge of a 3-D scene. Parallel edge Same as multiple edge. Multiple edge Two parallel edges with the similar directions in the line drawing. Also called parallel edges. More generally refers to multiple edges with the same pair of vertices in the graph. Loop Two edges with the same endpoints in the line drawing. Also called circle. Crease A contour mark formed when the orientation of a 3-D visible surface changes suddenly or two 3-D surfaces meet at an angle. On both sides of the crease, the points on the 3-D surface are continuous, but the normal direction of the surface is not continuous. Crack edge In the line labeling study, the edges where two aligned blocks meet are represented. It’s on the edge between pixels. Each pixel has four crack edges, which are located between the pixel and its four four-neighbor pixels, as shown in Fig. 41.18. The direction of the edge is in the direction of increasing grayscale, and the amplitude of the edge is the absolute value of the gray-level difference between the two pixels on both sides of the edge. Here, there are neither step edges nor fold edges.
41.4.3 Contour Labeling Labeling of contours A method for representing a 3-D scene with a 2-D image. When a 3-D scene is projected onto a 2-D image, regions of each part’s surface will be formed separately. The boundaries of each surface are displayed as outlines in the 2-D image, and the outlines of the objects are used to form the line drawings of the objects. For relatively simple scenes, you can use line drawings, that is, 2-D images with outline
1468
41
General Image Matching
markers to represent the interrelationships between the various surfaces of the 3-D scene. This markup can also be used to match 3-D scenes with corresponding models to explain the scene. Mark In the labeling of contours technology, when the local part of the 3-D surface has different reflectances, there is boundary between light and dark parts. This is not due to the shape of the parts of the 3-D surface but due to the different reflectance (and therefore the brightness) of the parts of the 3-D surface. Labeling via backtracking A method for automatically labeling line drawings. It expresses the problem to be solvedas: given a set of edges in a 2-D line drawing, a label is given to each edge to explain the situation in 3-D. It first arranges the edges into a sequence and then generates a path in a depth-first manner. It makes all possible labels for each edge in turn and checks the consistency of the new label with other edge labels. If the connection produced by the new label is contradictory, then fall back to consider another path; otherwise continue to consider the next edge. If the labels assigned to all edges in this way satisfy the consistency, a labeling result for the line drawing can be obtained. Labeling with sequential backtracking A way to automatically label line drawings. First arrange the edges of the graph into an ordered sequence (as far as possible, the edges that are most constrained by the label are placed in front), generate the path in a depth-first manner, and at the same time make all possible labels for each edge in turn, and check the new label consistency with other edge markers. If the labels given to all edges in the graph in this way satisfy the consistency, a labeling result is obtained, and a possible 3-D target structure can be inferred. Line matching The process of establishing correspondence between lines in two line sets. One set can be a geometric model, such as in model-based recognition, model registration, alignment, and so on. On the other hand, lines can also be extracted from different images, such as in feature-based stereo, or epipolar geometry is estimated between two lines. Planar scene 1. The case when the depth of the scene is small relative to the distance from the camera. At this point, the scene can be regarded as planar, and some useful approximation methods can be used, such as the homography between the two fields of view obtained with a perspective camera. 2. The situation when the surface of all objects in the scene is flat. As in the blocks world scene. Blocks world A simplified field of early artificial intelligence and computer vision research. That is, a simple representation of the objective world, in which the scene is considered to
41.4
Graph Isomorphism and Line Drawing
1469
Fig. 41.19 Examples of two types of junction labels
Arrow connection
“Y” connection
include only entities (scene and/or object) with a flat surface, which is often used in experimental environments to solve image analysis problems. The basic feature is to limit the analysis to a simplified geometric object (such as a polyhedron) and assume that a geometric description of the object (such as an image edge) can be easily obtained from the image. Junction label A symbolic label used for edge patterns at the edges where they meet. This type of method is mainly used in the blocks world, where all objects are polyhedra, so all line segments are straight and only meet in a limited number of structures. See line label. Figure 41.19 shows a “Y” connection (a corner of a cube viewed from the front) and an arrow connection (a corner of a cube viewed from the side).
Chapter 42
Scene Analysis and Interpretation
The high level of image understanding requires a clear explanation of the meaning of the high level of the image. Understanding images is actually about understanding the scene. The high-level interpretation of the scene is based on the analysis and semantic description of the scene (Duda et al. 2001; Theodoridis and Koutroumbas 2009; Zhang 2017c).
42.1
Scene Interpretation
The understanding of visual scenes can be expressed as the combination of various image technologies based on visually perceived scene environmental data and the mining of features and patterns in visual data from different perspectives, such as computational statistics, behavioral cognition, and semantics, so as to realize the effective analysis and understanding of scene (Forsyth and Ponce 2012; Marr 1982; Prince 2012).
42.1.1 Image Scene Image scene It can be regarded as a snapshot that the camera instantly sees, and it is an image of the objective world. Scene coordinates A 3-D coordinate system that describes the location of an objective scene relative to the origin of a given coordinate system. Other coordinate systems are the camera coordinate system, the observer-centered coordinate system, and the object-cen tered coordinate system. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_42
1471
1472
42
Scene Analysis and Interpretation
Scene layout Location and interrelationships of major elements in the scene. This knowledge can be used for scene decomposition and scene labeling. For example, the layout of an outdoor scene often includes the sky above, green spaces under the sun, roads, cars on the road, and so on. Scene brightness The luminous flux radiated from the scene surface. It is the power emitted by a unit area on the surface of the light source within a unit solid angle, and the unit is W•m– 2 •sr–1 (Watts per square meter squared). Scene brightness is related to radiance. Scene radiance See atmospheric scattering model. Fundamental radiometric relation The relationship between scene brightness L and measured illumination E: E¼L
2 π d cos 4 α 4 λ
Among them, d is the diameter of the lens, λ is the focal length of the lens, and α is the angle between the light from the scene surface element to the center of the lens and the optical axis. Scene vector A representation used in video analysis to describe what is happening in the current frame. Each position k in the scene vector [st1, . . ., stk, . . ., stK] corresponds to a different event category (such as a car driving, a person riding a bicycle, etc.), and the value of stk is the number of instances of class k detected at time t. Cumulative scene vector For a given scenario, a vector representation of the time-accumulated activity classification output. Each vector dimension corresponds to an activity class. In behavior analysis, when scene vector measurement is performed for each video frame, the cumulative scene vector can be used to overcome the semantic interruption caused by scene inactivity in small cycles. Scene recovering See 3-D reconstruction. Image interpretation A general term for an image operation or process that is used to extract and describe an image, as opposed to the process of producing a human-viewable output image. It is often assumed that the description given is semantically (very) high-level, such as “This is a historical relic from the Ming Dynasty” and “that student is riding to the classroom.” A broader definition also includes extracting information or evidence for the next (often non-image processing) activity (such as judgment or decisionmaking).
42.1
Scene Interpretation
Fig. 42.1 Examples of nonaccidentalness
1473 Non-accidentally end Nonaccidentally parallel
Nonaccidentalness A general principle that can be used to improve image interpretation. Based on the concept that when there are some regularities in the image, they are most likely derived from the regularity in the scene. For example, if the endpoints of two straight lines are close to each other, this may be the result of the line’s endpoints being accidentally aligned with the observer; but it is more likely that the two endpoints are at the same point in the observation scene. Figure 42.1 gives an example case of line termination and line orientation that are unlikely to happen by chance. Image interpretation Description and interpretation of image content obtained from the scene. In essence, it is necessary to understand the meaning of the scene reflected by the image and give a representation of the semantic layer. Knowledge-based vision 1. A visual strategy with a knowledge base and an inference component. Typical knowledge-based vision is mainly rule-based, combining information obtained by processing images with information in a knowledge base, inferring hypotheses that should be generated and verified, and what new information can be inferred from the existing information, and which new primitives need further extraction. 2. A style of image interpretation, based on multiprocessing components that can perform different image analysis processes, some of which can solve the same problem in different ways. The components are linked by an inference algorithm that understands the capabilities of the components, such as when they are available and when they are not. Another common component is the taskdependent knowledge representation component used to guide inference algorithms. Some mechanisms for representing uncertainty are also public, and they record the confidence of the system’s output processing results. For example, a knowledge-based vision system for road networks can be used for aerial analysis of road networks. It includes modules for detecting straight sections, intersections, and forest roads, as well as modules for map surveying, road classification, and curve connections.
1474
42
Scene Analysis and Interpretation
42.1.2 Scene Analysis Scene analysis Work and techniques to analyze scenes with the help of images to detect and identify objects and obtain scene information. It can also be said that the process of examining an image or a video to infer relevant scene information. Information here can refer to the identity of an object, the shape of a visible surface, and the spatial or dynamic relationship between some objects. See shape from contour, object recognition, and symbolic object representation. Scene decomposition Segment an image into regions with semantic meaning. For example, an image of the interior of an office can be segmented into regions such as tables, chairs, cabinets, and so on. See semantic image segmentation. Object-based representation A representation that specifies the object category and pose. This representation is good for inferring the meaning of the content in the scene. Scene object labeling Method for assigning semantic labels to object region in an image. Semantic labeling A method for assigning object with semantic symbols in an image in scene object labeling. Constraint propagation A technique for solving the problem of object labeling in a scene. The local consistency (local optimum) is first obtained using local constraints, and then the local consistency is adjusted to the global consistency (global optimum) of the entire image using an iterative method. Here the constraints are gradually propagated from the local to the global. The method of solving the labeling by comparing the direct connections between all the objects in the image can save the calculation amount. Image parsing A method for determining a semantically meaningful identifier for each pixel (or image block) in a given image. Scene labeling The identity of the scene units is determined from the image data, and labels are provided that fit their nature and role. See labeling problem, region labeling, relaxation labeling, image interpretation, and scene understanding. Scene flow Multi-view video stream representing a dynamic scene using multiple cameras. It can be used to reconstruct the 3-D shape of the object at each moment and estimate the complete 3-D motion of corresponding surface points between each frame of images.
42.1
Scene Interpretation
1475
Dynamic scene Different from the general assumption in restoring shape from motion (i.e., the scene is considered to be a rigid body, and the camera is moving), there are scenes where the objects are changing in motion. Dynamic scene analysis A method for tracking moving objects in a scene, determining the movement of the object, and identifying moving objects by analyzing images that change over time. In recent years, many dynamic scene analyses have relied on descriptive knowledge of the scene. Scene recognition See scene classification. Latent Dirichlet process [LDP] In scene analysis, when using multiple labeled objects for training, the process of modeling objects and components in a generative model in order to build a geometric model describing their mutual position. Information on the distribution of all objects and parts can be learned from a large label library and then used in the inference/recognition process to label elements in the scene. Intelligent scene analysis based on photographic space [ISAPS] A technology that predicts the scene in the field of view and chooses the optimal settings for the key functions of the camera.
42.1.3 Scene Understanding Scene understanding Based on the scene analysis, a further explanation of the scene is provided. Among them, context plays an important role and can help provide useful semantic information. Or a semantic interpretation of a scene from image data is built, that is, it describes the scene with the identity of the objects in the scene and the relationship between the objects. See image interpretation, object recognition, symbolic object representation, semantic net, graph model, and relational graph. Scene modeling The process of modeling the main characteristics of the scene. Construct geometric models, graph models, or other types of models for the scene to describe the content of scene and location of objects in the scene. The common method is to directly extract, represent, and describe the low-level attributes (color, texture, etc.) of the scene and then use the classification and recognition to learn and reason about the high-level scene information. Image perception The use of images to observe and understand the scene, that is, to use the images to feel and perceive the objective world.
1476
42
Scene Analysis and Interpretation
Gist of a scene The refined representation and description of the scene represent the unique characteristics of the scene. It can help predict or determine where a particular object is likely to appear, which can be used to help understand the scene. After imaging the scene, it often corresponds to the gist of an image. Gist of an image Correspondence of scene points on the image. Often used to match complete images. Gist image descriptor The spatial envelope model of a scene is based on the four global characteristics of the image (naturalness, openness, roughness, and scalability). It is the output of a set of image filters (such as the Gabor filters learned from image data), which are combined to give estimates of the above four characteristics, respectively. These four estimates combine to define an image global descriptor, which is more suitable for image clustering and retrieval. Scene reconstruction Estimating the 3-D geometric information of a scene from the collected image data can also include the restoration of reflective properties in the scene, such as the shape of the visible surface or the outline of the scene and the color of the surface. See reconstruction, architectural model reconstruction, volumetric reconstruction, surface reconstruction, and slice-based reconstruction. Semantics of mental representations In scene interpretation, a representation that does not require reconstruction. It is characterized by trying to represent in a natural and predictable way. According to this view, a sufficiently reliable feature detector constitutes a primitive representation of the existence of a certain feature in the visual world. The representation of the entire object and scene can then be constructed from these primitives (if there are enough primitives). Semantic scene understanding A concept associated with semantic image segmentation, but slightly broader to include the case of multiple images or videos. Their common goal is to separate unique objects from image data and identify their types. A related goal is to identify and label everything in the data and establish an overall interpretation of the scene. See image segmentation and object recognition. Semantic scene segmentation See semantic image segmentation. Category model A concept corresponding to high-level semantics, representing imitation based on the basic form of matter. Here, the physical object represents the basic form of matter, and the model represents a simplified description or imitation of an objective scene or system.
42.1
Scene Interpretation
1477
Middle-level semantic modeling One method is to obtain the semantic interpretation of the scene. It uses the recognition of objects to improve the performance of low-level feature classification and to resolve the semantic gap between high-level concepts and low-level attributes. At present, more research methods are based on visual vocabulary modeling and dividing the scene into specific semantic categories according to the semantic contents and distribution of the object. Scene interpretation The work that uses images to explain the scene. It is more important to consider the meaning of the entire image without necessarily explicitly verifying a particular target or person. It can also be regarded as a follow-up application domain of action recognition. Some methods actually used only consider the results captured by the camera, from which the activities and situations in the scene can be learned and identified by observing the movement of the target without necessarily determining the identity of the target. This strategy is effective when the target is small enough and can be represented as a point in 2-D space. 3-D interpretation With the help of models or knowledge, the meaning of one or more images is inferred, that is, the content and meaning of the scene corresponding to the image is explained. Also called image interpretation, which is essentially scene interpreta tion. For example, there are a certain number of line segments in the image. Considering the characteristics of perspective projection, it can be considered that there are some polyhedra in the scene (such as man-made objects: building, tables, etc.).
42.1.4 Scene Knowledge A priori knowledge about scene Same as scene knowledge. A priori distribution A record of the probability distribution of an agent’s beliefs about certain uncertainties; these beliefs are made without considering some evidence or data. Conjugate prior A special prior. For a given probability distribution p(x|y), a prior p( y) conjugated to the likelihood function can be determined so that the posterior distribution has the same functional form as the prior. In Bayesian probability theory, if the posterior probability distribution p(y|x) and the prior probability distribution p( y) belong to the same distribution family (type), they are called conjugate distributions. Among them, relative to the likelihood function, the prior probability distribution is called a conjugate prior. For example, a family of Gaussian distributions is conjugated to itself with respect to a Gaussian likelihood function (self-conjugated): if the
1478
42
Scene Analysis and Interpretation
likelihood function is Gaussian, choosing a conjugate prior ensures that the posterior probability distribution is also Gaussian. Conjugate distribution See conjugate prior. A priori probability Probability based on past experience and analysis, or the probability of an event occurring when there is no information about a particular environment. It can be represented as p(s), which is the probability of occurrence of s before any evidence is observed. Compare a posteriori probability. Suppose an equal probability set Q is output from a given action. If a specific event E can be output from any of the subsets S, then the a priori probability or theoretical probability of E is defined as PðEÞ ¼
sizeðSÞ sizeðQÞ
Prior probability Same as a a priori probability. A priori information The general knowledge of the scene reflected by the image is also called a priori knowledge. It means people’s prior understanding and historical experience of the scene environment and things before collecting images. A priori information is related to both the objective scene and the subjective factors of the observer. For example, in image segmentation evaluation, the a priori information incorporated can be used to measure the type and amount of characteristic information of the image to be segmented that is combined when designing the segmentation algorithm. This can be used to compare the advantages and disadvantages of the algorithm in terms of stability, reliability, and efficiency to a certain extent. Noninformative prior A prior with a distribution form that has a very small (as small as possible) effect on the posterior distribution. This situation is often referred to as “let the data speak for itself.” Improper prior A special prior. Considering the distribution p(x|y) managed by the parameter y, the prior distribution p( y) ¼ constant can be used as a suitable prior. If y is a discrete variable with K values, this is equivalent to simply setting the prior probability of each value to 1/K. But for continuous parameters, if there is no limit to the domain of y, the integral of y will diverge, then this prior distribution cannot be properly normalized, and this prior is called improper. In practice, if the corresponding
42.2
Interpretation Techniques
1479
posterior distribution is appropriate (i.e., it can be properly normalized), an inappropriate prior can still be used. Atypical co-occurrence A special situation in which two or more observations or events occur in combination does not meet the prior expectations. A priori knowledge See a priori information. A priori domain knowledge Prior known information about the types of scene or object. See a priori probability. A Posteriori probability The probability obtained by revising the prior probability after obtaining the information of the “result”. For example, in a communication system, the a posteriori probability is the probability that the receiving end knows that the message was sent after receiving a certain message. It can also be seen as the probability of an event occurring when there is some information about the environment. The a posteriori probability can be expressed as p(s|e), which is the probability that the situation s will appear after the evidence e is observed. It can be calculated based on the prior probability and evidence by Bayes’ theorem. Compare a priori probability. Scene constraint Any constraints imposed on the image taken from there based on the nature of the scene. For example, a building’s side walls and ground are often perpendicular to each other. These constraints can help explain the scene with images. Manhattan world A scene composed of planes in a 3-D space composed of three orthogonal planes. Many real-world or man-made objects, such as the inside and outside of most buildings, can be viewed as Manhattan world scenes.
42.2
Interpretation Techniques
The interpretation of the scene can rely on many theories and tools, such as fuzzy sets and fuzzy operations, genetic algorithms, discrete labeling and probabilistic labeling techniques, bag of words model/feature bag model, pLSA model, and LDA model (Duda et al. 2001; Prince 2012; Sonka et al. 2014; Theodoridis and Koutroumbas 2009; Zhang 2017c).
1480
42
Scene Analysis and Interpretation
42.2.1 Soft Computing Soft computing Contrary to traditional computing (hard computing). The main characteristics of traditional computings are strict, deterministic, and precise. Soft computing simulates the biochemical processes (human perception, brain structure, evolution, immunity, etc.) of biological intelligent systems in nature, through the tolerance of uncertain, inaccurate, and incomplete truth values to obtain lower-cost computing and robust solutions. Traditional artificial intelligence can perform symbolic operations. This is based on the assumption that human intelligence is stored in a symbolic knowledge base. But the acquisition and representation of symbolic knowledge have limited the application of artificial intelligence. Soft computing generally does not perform too many symbolic operations, so in a sense, soft computing is a supplement to traditional artificial intelligence. At present, soft computing mainly includes four computing modes: (1) fuzzy logic, (2) artificial neural network, (3) genetic algorithm, and (4) chaos theory. These models complement and cooperate each other. Fuzzy logic A logical form that allows for a range of possibilities (i.e., a degree of authenticity) between true and false. A soft computing model that mimics the way the human brain judges and reasoning about uncertainty concepts. For description systems with unknown or uncertain models, fuzzy sets and fuzzy rules can be used for reasoning and fuzzy comprehensive judgments, so they are good at representing qualitative knowledge and experience with unclear boundaries. With the help of the concept of member ship function, it can deal with inexact data and problems with more than one solution. Certainty representation Any intermediate technique that encodes trust/belief into hypotheses, conclusions, calculations, etc. Example representations include probability and fuzzy logic. Artificial neural network [ANN] A mathematical model of an algorithm that imitates the behavioral characteristics of animal neural networks and performs distributed parallel information processing. It is a nonlinear system composed of a very large number of simple calculation processing units (neurons). Also called neural network for short. It belongs to the soft computing mode/model. It adjusts the interconnection relationship between a large number of nodes in the network system to achieve the purpose of processing information. It can mimic the information processing, storage, and retrieval functions of the human nervous system at different levels and layers and has various capabilities such as learning, memory, and calculation.
42.2
Interpretation Techniques
1481
Genetic algorithm [GA] A search algorithm based on natural selection and genetic theory that combines the rules of survival of the fittest in the process of biological evolution with the random information exchange mechanism of chromosomes within the population. It optimally seeks a solution by iteratively refining a small set of candidates (a process that mimics the process of genetic evolution). Use the suitability of the set of possible solutions (populations) to generate new populations until certain conditions are met (e.g., the optimal solution has not changed in a given number of iterations). Belongs to a soft computing model. Mutation One of the basic operations of genetic algorithms. Frequently and randomly changing a certain code in some code strings (such as every one thousand codes in the evolution from one generation to the next) to maintain various local structures and avoid losing some of the characteristics of the optimal solution. Reproduction One of the basic operations in genetic algorithms. The reproduction operator is responsible for making good code strings alive and letting other code strings die according to probability. The reproduction mechanism copies code strings with high fitness to the next generation. The selection process is probabilistic. The probability that a code string is copied to a new sample is determined by its relative fitness in the current population. The higher the fitness of a code string, the greater the probability of survival; the lower the fitness of a code string, the smaller the probability of survival. The result of this calculation process is that code strings with high fitness will have a higher probability of being copied to the next-generation group than code strings with low fitness. Since the number of code strings in the general control group remains stable, the average fitness of the new group will be higher than the original group. Crossover One of the basic operations of genetic algorithms. The newly generated code strings are randomly paired, a boundary position is randomly determined for each pair of code strings, and a new code string is generated by swapping the head and the boundary position of the code string pair, as shown in Fig. 42.2. Not all newly generated code strings need to be crossed. Generally, a probability parameter is used to control the number of code strings to be crossed. Another way is to keep the best copy code string in its original form.
Fig. 42.2 Crossover of code strings
ABCDEFGHIJKLMNOPQRS
ZYXWVUTSRQPO NOPQRS
ZYXWVUTSRQPONMLKJIH
ABCDEFGHIJKLM NMLKJIH
1482
42
Scene Analysis and Interpretation
Random crossover See crossover. Genetic programming Implement genetic algorithms in some programming languages and evolve programs that meet certain evaluation criteria. Fitness The value of an evolutionary function in a genetic algorithm. The genetic algorithm directly uses the objective function rather than the knowledge derived from or assisted by the objective function. In genetic algorithms, the search for better new solutions depends only on the evolutionary function itself. Fitness function In the literature of genetic algorithms, it refers to the objective function that needs to be optimized. The fitness function evaluates a given genetic algorithm and returns a value proportional to its degree of optimization. Chaos theory A soft computing mode/model with both qualitative thinking and quantitative analysis. When a dynamic system cannot use a single data relationship, but must use overall, continuous data relationships to explain and predict, it can be used to explore its behavior.
42.2.2 Labeling Object labeling Labeling or naming the object region segmented from the image is the process of giving the object a unique semantic symbol. The ultimate goal is to obtain a proper interpretation of the scene image. Region labeling A class of algorithms used to give a label or meaning to each region of a given image after segmentation to obtain proper image interpretation. Representative techniques include relaxation labeling, probabilistic relaxation labeling, and interpretation trees (see interpretation tree search). See labeling problem. Discrete labeling A class of methods for labeling scene objects. It assigns only one label to each object, and the main consideration is the consistency of each object label in the image. In extreme cases, it can be regarded as a special case where the probability of one label is 1 and the other labels are 0 in probabilistic labeling. Discrete relaxation A method for discretely labeling objects in a scene. The possible types of each object are iteratively limited based on its relationship with other objects in the scene. Its
42.2
Interpretation Techniques
1483
purpose is to obtain globally consistent interpretations from locally consistent connections where possible. It can obtain unambiguous labeled results in most practical situations, but it is an oversimplified method that cannot deal with incomplete or inaccurate scene and objects. Discrete relaxation labeling Discrete labeling method using relaxation technology. It is a bottom-up method of scene interpretation. Probabilistic labeling A class of methods for labeling scene objects. It allows multiple labels to be assigned to a set of coexisting objects. These labels are weighted with probability, and each label has a degree of trust. It always gives the labeling result and the corresponding degree of trust. Compare discrete labeling. Probabilistic relaxation labeling Probabilistic labeling method using relaxation technology. It can also be seen as an extension of relaxation labeling, where each object that needs to be labeled (i.e., each image feature) is not simply given a label but a set of probabilities, each of which is a probability to give a specific label. This method may overcome the problems caused by missing or extra objects in segmentation, but it may also give ambiguous and inconsistent explanations. See belief propagation. Probabilistic relaxation A method of data interpretation in which local inconsistencies play a role in suppressing and local coherence plays an incentive role. The goal is to combine these two effects to limit the probability. A method for labeling objects in a scene with probability. It can overcome the problem of data loss or data errors, but it may lead to inconsistent or ambiguous results.
42.2.3 Fuzzy Set Fuzzy set A special set in which the degree to which an element belongs to a set is represented by a membership function between 0 and 1. Can also refer to data combined into a set, where each item is associated with a level/possibility of a set’s membership. See fuzzy set theory. Fuzzy set theory A theory for dealing with fuzzy sets (including imprecise or fuzzy events). Compare crisp set theory. Crisp set theory The traditional set theory as opposed to the fuzzy set theory. In which, any element of space belongs to a certain set completely or does not belong to the set at all.
1484
42
Scene Analysis and Interpretation
Membership function In fuzzy mathematics, a function introduced to represent the uncertainty of a research object. If 0 is used to indicate not belong at all, and 1 is used to indicate belong to, the value between 0 and 1 indicates different certainty. The function representing this uncertainty is called a membership function and can describe the different degrees of certainty that an element belongs to a set. Minimum normal form membership function A normalized membership function that requires a membership value of at least one element in the fuzzy set domain to be 1. Maximum normal form membership function A normalized membership function that requires a membership value of at least one element in the fuzzy set domain to be 0. Fuzzy connectedness A property function μψ describing the overall fuzzy connection in an image. It assigns a value in [0, 1] to each pair of pixels f and g in the image. This assignment can be based on the fuzzy affinity of the pixel pairs. Pixels f and g do not have to be (directly) adjacent. They can be connected through a series of adjacent pixels to form a path, where the fuzzy affinity of each pair of adjacent pixels can be recorded as ψ( fi, fi+1), 0 i N–1 (N is the path length). The intensity of each channel is defined as the minimum fuzzy affinity between pairs of adjacent pixels on the channel, that is, the intensity of the entire channel L is determined by its weakest local connection: ψ 0 ðLÞ ¼
min ψ f i , f iþ1
0iN1
There may be many paths between each pair of pixels f and g. Letting M represent the set of all paths connecting f and g, then fuzzy connectivity can be expressed as μψ ð f , gÞ ¼ max ψ 0 ðLÞ LM
Fuzzy affinity The function describing the local fuzzy connection between pixels can be described as ψ( f, g) 2 [0, 1], which reflects the strength of the connection between two fuzzy adjacent pixels. It can be a function of pixel attributes (such as grayscale) or other derived properties (such as gradients). Compare fuzzy connectedness. Fuzzy adjacency A type of special adjacency relationship between two image units is determined by the fuzzy adjacency function μ( f, g) 2 (0, 1], where f and g represent two pixels, respectively. Compare hard-adjacency.
42.2
Interpretation Techniques
1485
Fuzzy similarity measure Non-limiting simulation of similarity measures when evaluating fuzzy feature pairs. The most intuitive way to calculate similarity is based on the distance between features, using the most commonly used fuzzy correspondence measure of the finite distance measure. Here are a few proposed measures (let d be the distance measure between the two fuzzy sets A and B): 1. S(A, B) ¼ 1/[1 + d(A, B)] 2. S(A, B) ¼ exp[–α d(A, B)], where α is the measure of steepness 3. S(A, B) ¼ 1 – β d(A, B), where β ¼ 1, 2, 1
42.2.4 Fuzzy Calculation Zadeh operator There are mainly three operators acting on fuzzy sets: fuzzy intersection, fuzzy union, and fuzzy complement. Fuzzy space The region of support (domain) of fuzzy set (domain). Fuzzy intersection Intersection of sets in fuzzy space. There are two fuzzy sets A and B, their domains are DA and DB, and their membership functions are MA(a) and MB(b), and then the fuzzy intersection of A and B is A \ B : M A\B ða, bÞ ¼ min ½M A ðaÞ, M B ðbÞ Fuzzy union Union of sets in fuzzy space. There are two fuzzy sets A and B, their domains are DA and DB, and their membership functions are MA(a) and MB(b), and then the fuzzy union of A and B is A [ B : M A[B ða, bÞ ¼ max ½M A ðaÞ, M B ðbÞ Fuzzy complement The complement of a set in fuzzy space. A fuzzy set A is set, whose domain is DA, and its membership function is MA(a), and then the fuzzy complement of A is Ac :
M A c ð aÞ ¼ 1 M A ð aÞ
1486
42
Fuzzy rule
Fuzzy composition
Solution space
Scene Analysis and Interpretation
Defuzzification
Decision
Fig. 42.3 Model and steps for fuzzy reasoning
Fuzzy reasoning Reasoning of fuzzy sets with the help of fuzzy rules. The information in each fuzzy set needs to be combined with certain rules to make decisions. The basic model and main steps are shown in Fig. 42.3. Starting from fuzzy rules, the basic relationship to determine the membership in the related membership function is called fuzzy composition. The result obtained by fuzzy composition is a fuzzy solution space. To make a decision based on the solution space, there must be a process of defuzzification. See fuzzy logic. Monotonic fuzzy reasoning The simplest kind of fuzzy reasoning that can get the solution of a problem directly without using fuzzy combination and de-fuzzification. Fuzzy solution space In fuzzy reasoning, the result of fuzzy composition of fuzzy rules. Correlation minimum A fuzzy composition method in fuzzy reasoning. The fuzzy membership function of the original result is truncated to limit the fuzzy result, which is characterized by simple calculation and simple deblurring. Compare correlation products. Fuzzy composition One of the main steps of fuzzy reasoning (see Fig. 42.3). Combine different fuzzy rules to help make decisions. There are different combining mechanisms that can be used to combine fuzzy rules, including min-max rules and correlation products. Correlation product In fuzzy reasoning, a fuzzy composition method is obtained by scaling the fuzzy membership function to the original result. This preserves the shape of the original fuzzy set. Compare correlation minimum. Fuzzy rule In fuzzy reasoning, a series of unconditional and conditional propositions. The form of an unconditional fuzzy proposition is x is A, and the form of a conditional fuzzy proposition is if x is A then y is B, where A and B are two fuzzy sets and x and y represent scalars in the corresponding domain of the two. Min-max rule In fuzzy reasoning, a fuzzy rule for making decisions. A series of minimization and maximization processes were used to combine the various knowledge relevant to the decision-making process.
42.2
Interpretation Techniques
Fig. 42.4 De-fuzzification using composite moment method
1487
L(y)
Solution set 1
0 Fig. 42.5 De-fuzzification using composite maximum method
L(y)
c
y
Solution set
1
0
d
y
De-fuzzification One of the main steps of fuzzy reasoning (see Fig. 42.3). On the basis of fuzzy combination, further determine a vector containing multiple scalars (each scalar corresponds to a solution variable) that can best represent the information in the fuzzy solution set, so as to determine the technology and process of the precise solution for decision-making. This process is performed independently for each solution variable. Two commonly used de-fuzzification methods are the composite moments method and the composite maximum method. Composite moments A method of de-fuzzification in fuzzy reasoning. We must first determine the center of gravity c of the membership function of the fuzzy solution and transform the fuzzy solution into a clear solution variable c, as shown in Fig. 42.4. The results of the composite moment method are sensitive to all rules (affected by all solutions) and are often applied in control. Composite maximum A method of de-fuzzification in fuzzy reasoning. The goal is to determine the domain point with the largest membership value in the membership function of the fuzzy solution. If the maximum is on a platform, a clear solution d is given in the center of the platform, as shown in Fig. 42.5. The result of the maximum combination depends on a single rule with the highest predicted truth value (unaffected by other solutions) and is often applied in recognition.
1488
42
Scene Analysis and Interpretation
42.2.5 Classification Models Scene classification Determine the type of a particular image or video frame. For example, a system can divide images into acquisitions from certain scenes (indoor, outdoor, suburban, urban). Bag-of-words model A model derived from natural language processing. After the introduction of the image field, it is often called a bag of feature model. Can describe the connection between the object and the scene and is often used for image and video classification. Bag of words Also called bag of features or bag of key point. Can refer to a simple classification recognition algorithm. The flow is shown in Fig. 42.6. First, detect key patches (small regions centered on key points), extract object features from them, quantify to obtain the distribution (often using histogram) on the learned visual words (feature cluster centers), and use the classification algorithm to learn a decision surface and make classification. See bag of features. Latent semantic indexing [LSI] An indexing method and mechanism different from keyword search. Try to bypass natural language understanding when searching for what you need, and use statistical methods to achieve the same goal. The idea originates from solving the problem of text retrieval in natural language. Vocabulary in natural language texts is characterized by polysemy and synonymy. Due to polysemy, the retrieval algorithm based on exact matching will extract many words that are not needed by users; due to synonymy, the retrieval algorithm based on exact matching will miss many vocabulary that users want. The latent semantic indexing uses statistical analysis of a large number of samples to find the correlation between different words, so that the search results are closer to what the user is really looking for; at the same time, this can also ensure the high efficiency of the search. Probabilistic latent semantic analysis [pLSA] A probabilistic graphical model built to solve object and scene classification. It is based on probabilistic latent semantic indexing. Originating from the study of natural language and text, its original noun definitions use concepts in the text, but it is also easy to generalize to the field of images (especially with the help of a bag of feature model framework).
Detect key patch
Extract feature
Compute histogram
Fig. 42.6 The flowchart of classification using bag of words
Determine decision surface
42.2
Interpretation Techniques
1489
Fig. 42.7 Basic LDA model
b
a
q M
z
s N
It is usually described with a set of documents and the text contained in the document. In the image field, this can correspond to a set of images and bag of features describing the images. pLSA models the distribution of text in a document, using a small number of topics in proportion. pLSA is related to latent Dirichlet allocation, but uses a point estimate for the proportion of individual documents instead of Dirichlet priors. Simply using principal component analysis on the text/document matrix is called latent semantic analysis. pLSA provides a better model for such data because constraints such as the number of non-negative characters are considered. See topic model. Topic In probabilistic latent semantic analysis, the semantic content and core concepts of the analyzed object. Latent emotional semantic factor Represents an intermediate semantic layer concept between low-level image features and high-level emotion categories. Probabilistic latent semantic analysis models can be used to learn latent emotional semantic factors. Probabilistic latent semantic indexing [pLSI] Latent semantic indexing with probability statistics. Latent Dirichlet allocation [LDA] 1. A set of probabilistic models that can be used for learning and inference in image understanding. The basic model can be represented by Fig. 42.7, where the box represents the repeat/set; the large box represents the image set, and M represents the number of images in it; the small box represents the repeated selection of topics and words in the image, and N represents the number of words in the image. Generally, it is considered that N is independent of q and z. The leftmost latent node a corresponds to the Dirichlet prior parameter of the subject distribution in each image; the second latent node q on the left represents the subject distribution in the image (qi is the subject distribution in image i), and q is also called mixed probability parameter; the third latent node z on the left is the topic node, and zij represents the topic of the word j in image i; the fourth node on the left is the only observation node (with shadow), s is the observation variable, and sij represents the j-th word in image i. The upper right node b is the polynomial distribution parameter of the topic-word layer, that is, the Dirichlet prior parameter of the word distribution in each topic.
1490
42
Scene Analysis and Interpretation
The above basic model is a three-layer Bayesian model. Among them, a and b belong to the hyperparameters of the image collection layer, q belongs to the image layer, and z and s belong to the visual vocabulary layer. Model solving includes two processes of approximate variational inference and parameter learning. 2. A latent variable model in which each observation vector is generated from a combination of considered sources or topics. To obtain this model, a latent probability vector is drawn from a Dirichlet distribution, and observations are generated from a combination of sources defined by these probabilities. In contrast, a mixture model focuses on observation vectors originating from a single source. Latent Dirichlet allocations are used to identify topics in a document. See topic model and probabilistic latent semantic analysis. Latent variable Variables in the graph model whose values are not observed (in the training set). Latent variable model Some models of the observed variable x that represent common latent or latent factors. A single discrete latent variable gives a mixture model. Models using continuous vectors of latent variables include factor analysis, independent compo nent analysis, and latent Dirichlet allocation. Latent behavior A non-quantified or latent geometric relationship, statistics, or correlation characteristic presented by a system or entity. Generally, this refers to behaviors that are not visible or hidden and are not easily measured or quantified. See latent variable model. Hidden variable Same as latent variable. Intensive variables One of two latent variables (the other is an extensive variable). Its number is independent of the size of the data set. Extensive variables A variable that scales with the size of the data set. One of two latent variables (the other is an intensive variable). Latent identity variable [LIV] Basic multidimensional variables representing individual identities. LDA model Same as latent Dirichlet allocation. Topic model A model that describes the connection between a set of documents and the words contained in them. In image technology, it can be mapped to a group of images, where each image is described by a bag of features. The basic idea is to model the
42.2
Interpretation Techniques
Fig. 42.8 Basic SLAD model
1491
q
a
z
s
b M
N
l
h
distribution of words in a document with a smaller number of topics. See latent Dirichlet allocation and probabilistic latent semantic analysis. Parameter learning One of two steps in solving the LDA model. Refers to the process of determining the hyperparameters a and b given the set of observed variables S ¼ {s1, s2, . . ., sM}. Supervised latent Dirichlet allocation [SLDA] A model obtained by introducing category information into the latent Dirichlet allocation model. The classification performance is higher than the latent Dirichlet allocation model. Its graph model is shown in Fig. 42.8. The meaning of each upper node: the leftmost hidden node a corresponds to the Dirichlet prior parameter of the subject distribution in each image; the second hidden node q represents the topic distribution in the image (qi is the topic distribution in image i), and q is also called the mixture probability parameter; the third hidden node z on the left is the topic node, and zij represents the topic of j-th word in image i; the fourth node s on the left is the only observation node (shaded), s is the observation variable, and sij represents the j-th word in image i; the fifth node b on the left is the polynomial distribution parameters at the topic-word level, that is, Dirichlet prior parameters for the word distribution in each topic. In addition, a category label node l related to topic z is added in the lower part. The parameter h in the softmax classifier can be used to predict the label l corresponding to the topic z 2 Z ¼ {zk}, k ¼ 1, 2, . . ., K. The inference for topic z in the SLDA model is affected by the category label l, so that the learned word-topic distribution hyperparameter d is more suitable for classification tasks (also can be used for labeling).
Chapter 43
Image Information Fusion
Image information fusion is an information technology and process that comprehensively processes information data from multiple image sensors to obtain more comprehensive, accurate, and reliable results than using only single sensor information data (Theodoridis and Koutroumbas 2003; Zhang 2015f).
43.1
Information Fusion
In information fusion, information obtained by multiple sensors (which may be redundant, complementary, or cooperative) is put together for comprehensive processing to improve the integrity and accuracy of information acquisition (Zhang 2017c).
43.1.1 Multi-sensor Fusion Information fusion The technology and process of combining information from multiple sources, specifically the same space and time information data obtained from different sources. The goal is to obtain comprehensive, accurate, and reliable results. Data from different sources often come from different sensors, so information fusion is also called multi-sensor information fusion. Fusion can expand the coverage of spatial and temporal information detection, improve detection capabilities, reduce information ambiguity, and increase the credibility of decision-making and the reliability of information systems. See sensor fusion.
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_43
1493
1494
43
Image Information Fusion
Image information fusion A special type of information fusion or multi-sensor information fusion. The image is used as a processing and operation object, and information in multiple images is combined to make a judgment decision. Compared with image fusion, image information fusion focuses more on higher-level fusion. Multi-sensor information fusion See information fusion. Multi-sensory fusion Combine information from multiple different sensors. This work is quite difficult, because very different information and its representations need to be aligned, and the information can be very different in content. See sensor fusion. Multi-sensor geometry The relative distribution of the field of view obtained by a group of sensors or by a single sensor from different locations. One consequence of the different positions is that the 3-D structure of the scene can be inferred. Although not all sensors are required to be of the same type, the same type of sensors are generally used for convenience. Multi-spectral analysis Use the brightness of the observed image at different wavelengths to help understand the observed pixels. A simple version uses RGB image data. In satellite remote sensing analysis, seven or more frequency bands are often used (including several infrared light wavelengths). Recent hyperspectral sensors can give measurements at 100–200 different wavelengths. Multi-spectral segmentation Segmentation of multispectral images. There are two types of strategies: (1) segment each image channel separately and then combine the results; and (2) combine the information of each image channel before segmentation. Multimodal analysis A general term for image analysis using image data derived from more than one sensor type. It is often assumed that the data is already registered, so each pixel record gets data from two or more types of sensors from the same part of the scene being observed. Multimodal neighborhood signature A feature point is described based on image data in its neighborhood. This data comes from several registered sensors, such as X-ray imaging and magnetic resonance imaging. Multimodal interaction A new type of human-computer interaction that combines multiple modes. The shortcomings of one mode can often be compensated by other modes, which can improve usability. For example, visually impaired users can interact with voice; in noisy environments, users can wear tactile gloves to interact.
43.1
Information Fusion
1495
Multimodal user interface A user interface built using a combination of modal interaction techniques. At present, new technologies such as gesture input, gaze tracking, and expression classification are mainly adopted and combined, so that users can interact with computer systems in multiple channels in a natural, parallel, and collaborative manner. Precise information and fast capture of user’s intentions effectively improve the naturalness and efficiency of human-computer interaction. Fusion Determine the process by which two things work together and combine into one. Data from multiple sources can be combined into a single expression. Fusion move A fusion optimization method. The idea is to replace part of the current best estimates with hypotheses obtained by more basic methods and combine them with local gradient descent to better minimize energy. Algorithm fusion Combination of different algorithms for information fusion. Empirical mode decomposition [EMD] 1. A parameterless data-driven analysis tool that decomposes nonlinear and non-stationary signals into intrinsic mode functions. In image fusion, images from different imaging modes are first decomposed into their intrinsic mode functions and then combined. Based on the combined intrinsic mode functions, a fused image can be reconstructed. 2. A method for analyzing natural signals in the time domain, which is often nonlinear and non-stationary. Comparable with methods such as 2-D Fourier transform and wavelet decomposition, although the basis function represen tation is derived empirically. Related technologies are defined by algorithms, not by theory. For a signal x(t): (1) Determine all extreme points of x(t). (2) Interpolate between the minimum and maximum values, respectively, to obtain the envelope Emin(t) and envelope Emax(t). (3) Calculate the mean m(t) ¼ [Emin(t) + Emax(t)]/2. (4) Extract details: x(t+1) ¼ x(t) – m(t). (5) Iterate the residual x(t+1). Intrinsic mode function [IMF] A function that represents the intrinsic primitives of a signal mode. LANDSAT A series of satellites launched by the United States to obtain satellite images of information on the surface of the earth. Launched in 1999, LANDSAT 7 orbits the earth every 16 days.
1496
43
Image Information Fusion
LANDSAT satellite One of a group of Earth Resource Technology satellites launched by the United States. No. 8 was launched on February 11, 2013. LANDSAT satellite image Image acquired using LANDSAT satellite. The first LANDSAT remote sensing satellite was launched by the United States in 1972 and carried a video tube or vidicon camera that can collect images in three visible light bands. A multispectral scanning system (MSS) on the satellite can provide images of four wavelengths, three of which are in the visible range and the other in the near-infrared range. Since then, the United States has launched another 7 LANDSAT satellites (of which 6-th failed). The MSS sensors on the first five satellites can be used in combination (although they are currently disabled), and the spatial resolution achieved is 56 m 79 m. LANDSAT-7’s ground resolution (panchromatic band) is up to 15 m.
43.1.2 Mosaic Fusion Techniques Image mosaic Combining several images to form a larger image, which can cover a wider (even 360 ) viewing angle or field of view, or can display more content (such as a photo wall combined with many small photos). Photomontage Combine multiple images taken on the same scene to get the best appearance of each object scene in the scene. Compare image stitching. Image stitching A term originating from the photography world, which stitches aerial photos into large-scale photo mosaics based on surveying and mapping reference points or manually determined control points (most digital maps and satellite photos are based on this). In image engineering, it refers to stitching multiple images into a panoramic image. It can be used to stitch partially overlapping images, or to combine images that are repeatedly collected for the same scene (the purpose is to obtain the best possible combination and appearance of each primitive). A technology that combines a set of images (with some overlap with each other) collected from different parts of the same scene to form a larger seamless panorama. This involves constructing models of camera movements, realizing image align ment (recover the scene's original spatial distribution and connection), and performing brightness adjustments (to eliminate exposure differences between different images). A common method is to combine multiple images collected in different orientations or geometric sizes into a single panoramic image mosaic.
43.1
Information Fusion
1497
Up vector selection In image stitching, calculations are performed to ensure that the vertical axis in the stitched image points upward. When shooting, it is possible to perform different pan and tilting movements. At this time, choose to make the horizontal side of the camera parallel to the ground. Seam carving An image operation or algorithm that adjusts the size of an image based on content perception. It includes finding seams in the original image (an optimal 8-connected pixel path from top to bottom or left to right) and using this information to complete (1) eliminating (“cropping”) the least contribution to the image content to reduce the image size and (2) expanding the image by inserting more seams. By using these operations in both (horizontal and vertical) directions, you can convert an image to a new size without losing as little meaningful content as possible. Seam selection in image stitching A key step in image stitching. When stitching a panorama using multiple partially overlapping images, the seams need to be selected in the region where the adjacent images are consistent to avoid seeing the transition from one image to another. Therefore, it is necessary to avoid the phenomenon that the seams look unnatural due to cutting the moving target. For a pair of adjacent images, this process can be carried out by means of dynamic programming, starting from one edge of the overlapping region and ending at the other edge. Gap closing In image stitching, a technique for estimating a set of rotation matrices and focal lengths and concatenating them to generate a large panorama. View integration Generate a combined image (or video) by performing image mosaic to several overlapping (i.e., overlapping field of view) images (or video streams). Bundle adjustment The earliest use of photogrammetry refers to the process of simultaneously adjusting the orientation and position of a group of multiple overlapping photos to build a panoramic image. In image engineering, it refers to the attitude optimization process and technology in the process of global registration of multiple images at the same time to restore the structure and spell out the panorama during scene restoration. Here, robust nonlinear minimization of measurement errors is performed. An algorithm used to optimize the 3-D coordinates of camera and point set positions from 2-D image measurements. It minimizes model fitting errors and the cost function of camera changes. Bundle refers to the light between the detected 3-D feature and the center of each camera. These beams need to be adjusted iteratively (relative to the camera center and feature position).
1498
43
Image Information Fusion
Fig. 43.1 A panoramic image mosaic
Panoramic image mosaic A type of technology that combines a set of partially overlapping images into a single panoramic image. In typical cases, panoramic image mosaics generate images with both very high resolution and a very large field of view. Such images cannot often be directly acquired with actual cameras. Panoramic mosaics can be constructed in different ways, but generally include three necessary steps: (1) determine the correspondence between two adjacent images (see stereo correspondence problem); (2) use correspondence to determine the relationship between two images (or between the current mosaic and a new image); and (3) merge the new image into the current mosaic. Figure 43.1 shows a panoramic mosaic image obtained from multiple images. Planar mosaic Panoramic image mosaic of a flat scene. A transformation that connects the different fields of view of a planar scene is homography. Document mosaicing An image mosaic of the document. Photo-mosaic The large-scale high-resolution photo collection composed of a group of photos is the basis of today’s digital maps and satellite photos and is also used to generate large-angle panoramas. In which the image registration and image stitching techniques are used. Cylindrical mosaic A photo mosaic method in which a single 2-D image is projected onto a cylinder. This situation is only possible when the camera is rotated around a single axis or the projection center of the camera remains substantially constant from the closest scene point.
43.2
43.2
Evaluation of Fusion Result
1499
Evaluation of Fusion Result
The evaluation of the image fusion effect often uses different methods and indicators. The evaluation of image fusion effect includes subjective evaluation and objective evaluation (Zhang 2009). Evaluation of image fusion result Evaluation of image fusion effect according to the purpose of image information fusion. It is an important content to be studied in image fusion. Different methods and indicators are often used to evaluate the effects of integration at different levels. Among them, the information richness of the fused image and the improvement of image quality are always the fundamental purpose of image fusion (the latter is particularly important for the low-level fusion), and it is also the basic criterion for measuring the effects of various fusion methods. Mutual information 1. In information theory, the amount of information common to two random variables is an information measure that reflects the correlation between two events. Given two events X and Y, the mutual information between them can be expressed as I ðX, Y Þ ¼ H ðX Þ þ H ðY Þ H ðX, Y Þ Among them, H(X) and H(Y ) are the self-information of X and Y, respectively, and H(X, Y ) is the interrelating entropy. Can also be written as mutual information I(X, Y ) ¼ H(X) + H(Y ) – H(X, Y ) ¼ H(X) – H(X|Y) ¼ H(Y) – H(Y|X), where H(•) represents the entropy function. Note that I(X, Y ) is symmetrical. 2. In image information fusion, an objective evaluation index based on the amount of information. The mutual information between the two images reflects their information connection, which can be calculated using the probability meaning of the image histogram. If the histograms of the fused image and the original image are hf(l ) and hg(l ), respectively, (l ¼ 1, 2, . . ., L ), Pfg(lf, lg) indicates the joint probability that at the same location of the two images, the gray value of pixel in the original image is lf, and the gray value of pixel in the fused image is lg, then the mutual information between the two images is H ð f , gÞ ¼
L X L X l2 ¼0 l1 ¼0
Pfg l f , lg
Pfg l f , lg log h f l f hg l g
If the fused image g(x, y) is obtained from the two original images f1(x, y) and f2(x, y), then the mutual information of g(x, y) with f1(x, y) and f2(x, y) (if the
1500
43
Image Information Fusion
mutual information of f1 and g is related to the mutual information of f2 and g, the relevant part H( f1, f2) needs to be subtracted) is H ð f 1 , f 2 , gÞ ¼ H ð f 1 , gÞ þ H ð f 2 , gÞ H ð f 1 , f 2 Þ This reflects the amount of original image information contained in the final fused image. Information entropy In image information fusion, an objective evaluation index based on the amount of information. Referred to as entropy. The entropy of an image is an indicator of the richness of the information in the image and can be calculated using the histogram of the image. Let the histogram of the image be h(l), l ¼ 1, 2, . . ., L (L is the number of bins in the histogram), then the information entropy of the image can be expressed as H¼
L X
hðlÞ log ½hðlÞ
l¼0
If the entropy of the fused image is greater than the entropy of the original image, it indicates that the information amount of the fused image is greater than that of the original image. Entropy Abbreviation for information entropy. 1. In layman’s terms, the amount of disorder in a system. 2. A measure of the information content of a random variable X. Given X has a set of possible values X, the probability of which is {P(x), x 2 X}, and then the entropy H(X) of X is defined as (note 0log0: ¼ 0) H ðX Þ ¼
X ½PðxÞ log PðxÞ x2X
For a multivariate distribution, the joint entropy H(X, Y) of X and Y is H ðX, Y Þ ¼
X
½Pðx, yÞ log Pðx, yÞ
ðx, yÞ2XY
When the values of a set are represented as a histogram, the entropy of the set can be defined as the entropy of the probability distribution function represented by the histogram. In an image, the entropy E of different regions is often different, the entropy value of relatively uniform regions is small, and the entropy value of regions that change a lot is relatively large. As in Fig. 43.2, E1 < E2 < E3.
43.2
Evaluation of Fusion Result
1501
Fig. 43.2 Different regions have different entropy in the image
E1 E2
E3
Conditional entropy Given the value of one random variable x, the average information required to specify the value of another random variable y. Let the joint distribution of x and y be p(x, y). Given x, specify the conditional entropy of y as ZZ H ½yjx ¼
pðy, xÞ ln pðy, xÞdydx
Using the product rule, we know that conditional entropy satisfies H ½x, y ¼ H ½yjx þ H ½x Among them, H[x, y] is the differential entropy of p(x, y), and H[x] is the differential entropy of the edge distribution. The above formula can be interpreted as the information needed to describe x and y is the sum of the information needed to describe x alone and the additional information needed to specify y for a given x. Interrelating entropy In image information fusion, an objective evaluation index based on the amount of information. The correlation entropy between the fused image and the original image reflects the correlation between the two images. If the histograms of the fusion image g and the original image f are hg(l) and hf(l), respectively, l ¼ 1, 2, . . ., L, then the interrelating entropy between them is C ð f : gÞ ¼
L X L X
Pfg l f , lg log Pfg l f , lg
lg ¼0 l f ¼0
Among them, Pfg(lf, lg) represents the joint probability that the pixels at the same position in the two images have a gray value of lf in the original image and a gray
1502
43
Image Information Fusion
value of lg in the fused image. Generally speaking, the larger the interelating entropy of the fused image and the original image, the better the fusion effect. Joint entropy Same as interrelating entropy. Gray-level disparity In image information fusion, an objective evaluation index based on statistical characteristics. The gray deviation (corresponding to spectral distortion) between the fused image and the original image reflects the difference in spectral information between the two. Let f(x, y) be the original image and g(x, y) be the fused image, the size is N N, and the gray-level disparity is N 1 N 1 1 X X gðx, yÞ f ðx, yÞj D¼ N N x¼0 y¼0 f ðx, yÞ
When the difference is small, it indicates that the fused image retains the gray information of the original image better. Standard deviation 1. The positive square root of the variance. 2. An objective evaluation index based on statistical characteristics in image information fusion. The gray standard deviation of an image reflects the dispersion of each gray value relative to the average gray value and can be used to evaluate the contrast of the image. Letting g(x, y) be the fused image of N N and μ be its mean value of gray, then the standard deviation is vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uN1 N1 XX 1 u t σ¼ ½gðx, yÞ μ2 N N x¼0 y¼0 If the standard deviation of an image is small, it means that the contrast of the image is small (the contrast between adjacent pixels is smaller), the overall image tone is relatively single, and less information can be observed. If the standard deviation is large, the situation is reversed. α-Family of divergences A large class of divergence measures. Letting p(x) and q(x) be two random distributions, then the general formula of the alpha divergence family is Dα pq ¼
Z 4 ð1þαÞ=2 ð1αÞ=2 1 p ð x Þ q ð x Þ dx 1 α2
43.2
Evaluation of Fusion Result
1503
Among them, the continuous parameter α takes a value between –1 and 1. When α ¼ 0, the resulting symmetric divergence is linearly related to the following Hellinger distance: Z h i DH pq ¼ pðxÞ1=2 qðxÞ1=2 dx The square root of the Hellinger distance is a commonly used distance metric. Dα( p||q) is always greater than or equal to zero, and only when p(x) ¼ q(x), the equal sign holds. If p(x) is a fixed distribution, consider minimizing Dα( p||q) with respect to the set of distributions q(x). For α –1, the divergence is called zero forcing, x satisfying p(x) ¼ 0 will cause q(x) ¼ 0, and generally q(x) will underestimate the support region of p(x) and tend to the peaks near maximum mass. Conversely, for α –1, the divergence is called zero avoiding. An x satisfying p (x) > 0 will cause q(x) > 0, and generally q(x) will overestimate the support of p(x) and stretch to cover all p(x). The two forms of Kullback-Leibler divergence correspond to two special cases of the alpha divergence family. KL( p||q) corresponds to the limit when α approaches 1, and KL(q||p) corresponds to the limit when α approaches –1. Zero avoiding See α-family of divergences. Zero forcing See α-family of divergences. Hellinger distance See α-family of divergences. Kullback-Leibler divergence [KL divergence] An information metric, also called cross entropy. Given an unknown distribution p (x), another distribution q(x) is used to approximate it. Then, when transmitting the value of x, the average increase in information due to the approximation of p(x) with q(x) is the Kullback-Leibler divergence between p(x) and q(x). Calculated as follows (unit is nat): Z
Z KL pq ¼ pðxÞ ln qðxÞdx pðxÞ ln pðxÞdx Z ¼
qðxÞ pðxÞ ln dx pðxÞ
The Kullback-Leibler divergence is always non-negative, only when p(x) ¼ q(x) (i.e., when accurately approximated) is equal to zero. It is asymmetric, that is, KL( p|| q) 6¼ KL(q||p). But if you add them up, you get a symmetric cross entropy. See αfamily of divergences.
1504
43
Image Information Fusion
Relative entropy Same as Kullback-Leibler divergence. Cross entropy An objective evaluation index based on the amount of information in image information fusion. Also called directional divergence. The cross entropy between the fused image and the original image directly reflects the relative difference in the amount of information contained in the two images. See symmetric cross entropy. Directional divergence Same as cross entropy. Symmetric cross entropy Cross entropy in symmetrical form. If the histograms of the fused image and the original image are hg(l) and hf(l), respectively, l ¼ 1, 2, . . ., L, then the symmetric cross entropy between them is K ð f : gÞ ¼
L X l¼0
hg ðlÞ log
X
L hg ðlÞ h f ðl Þ h f ðlÞ log h f ðl Þ hg ð l Þ l¼0
The smaller the symmetric cross entropy, the more information the fused image gets from the original image. Kullback-Leibler distance A measure of the relative entropy (also called distance) between two probability densities p1(x) and p2(x). Can be expressed as Z p ð xÞ D p1 p2 ¼ p1 ðxÞ log 1 dx p2 ð xÞ This distance is directed, that is, D( p1||p2) is different from D( p2||p1).
43.3
Layered Fusion Techniques
Generally, the multi-sensor image fusion method is divided into three levels from low to high, that is, pixel-level fusion, feature-level fusion, and decision-level fusion, which help improve the performances in feature extraction, object recognition, and decision-making, respectively (Bishop 2006; Bow 2002; Duda et al. 2001; Theodoridis and Koutroumbas 2009).
43.3
Layered Fusion Techniques
Image acquisition
Image data
Feature extraction
Pixel layer fusion
1505
Object feature
Feature layer fusion
Object recognition
Class attribute
Decision making
Decision layer fusion
Fig. 43.3 Three layers of image fusion
43.3.1 Three Layers Three layers of image fusion The image information fusion method can be divided into three levels from low to high. These three layers are: pixel layer fusion, feature layer fusion, and decision layer fusion. This division is related to the steps to complete the overall work with the help of image fusion. In the three-level flowchart of image fusion shown in Fig. 43.3, there are four steps from collecting scene images to making a decision, namely, image acquisition, feature extraction, object recognition, and decision. The three levels of image fusion occur exactly between these four steps. Pixel layer fusion is performed after image acquisition (or before feature extraction), feature layer fusion is performed after feature extraction (or before object recognition), and decision layer fusion is performed after object recognition (or before decisionmaking) of. Pixel layer fusion One of the three layers of image fusion. It is the fusion performed at the bottom layer (data layer). By processing and analyzing the physical signal data (two or more images) collected by the image sensor, the object features are obtained to give a single fused image. The advantage of pixel layer fusion is that it can retain as much original information as possible, so pixel layer fusion has higher accuracy than feature layer fusion or decision layer fusion. The main disadvantages of pixellevel fusion are the large amount of processed information, poor real-time performance, high computational cost (high requirements for data transmission bandwidth and registration accuracy), and the requirement that the fused data is obtained by similar or similar sensors. Image fusion The image technology and process of comprehensively processing image data obtained from different sources for the same scene to obtain comprehensive, accurate, and reliable results. It can be performed at three levels, including pixel layer fusion, feature layer fusion, and decision layer fusion.
1506
43
Image Information Fusion
Feature layer fusion One of the three layers of image fusion. It's the fusion in the middle layer. First it is needed to extract features from the original image, obtain some object information (such as the edges, contours, shapes, surface orientations, and distances between the objects), and synthesize them to obtain a more confident judgment result. This not only retains important information in the original image but also compresses the amount of data, which is more suitable for data obtained by heterogeneous or very different sensors. The advantage is that the amount of data involved is less than the pixel layer fusion, which is conducive to real-time processing, and the features provided are directly related to decision analysis; the disadvantage is that the accuracy is often worse than the pixel layer fusion (because the degree of extraction is a certain degree of abstraction). Feature fusion A method to improve the robustness of recognition. It combines the different characteristics to more effectively grasp the object characteristics. It uses different feature filters at the same time when checking the co-occurrence probability of the extracted multiple joint features. Joint feature space A space formed by combining (such as concatenating) different features (vectors). For example, if f1 is a 2-D vector and f2 is a 3-D vector, then they jointly define a 5-D joint feature space. If you want to operate in this joint feature space, considering that f1 and f2 may have different scales, you need to adjust (normalize) the operation kernel. The new core should be
kf 1 k2 kf 2 k2 K ðf 12 Þ ¼ k k h21 h22 Among them, the parameters h1 and h2 are used to control the width of the two cores, respectively. Decision layer fusion One of the three layers of image fusion. At the highest level, the best decision can be made directly based on certain criteria and the credibility of each decision. Before the decision layer fusion, object classification or object recognition have been completed for the data obtained from each sensor. Decision layer fusion is often performed with various symbolic operations. The advantage is that it has strong fault tolerance, good openness, and real-time. The disadvantage is that the original information may have been greatly lost during the fusion, so the accuracy in space and time is often lower. Bayesian fusion A feature layer fusion method. Can also be used for decision layer fusion. Letting the sample space S be divided into A1, A2, . . ., An and satisfy (1) Ai \ Aj ¼ ∅; (2) A1[A2[. . .[An¼ S; and (3) P(Ai) > 0, i ¼ 1, 2, . . . , n, then for any event B, P(B) > 0, there are
43.3
Layered Fusion Techniques
PðAi jBÞ ¼
1507
PðA, BÞ PðBjAi ÞPðAi Þ ¼ Pn PðBÞ j¼1 P BjA j P A j
If the decision problem of multi-sensors is regarded as a division of sample space, Bayesian conditional probability formula can be used to solve this problem. Taking a system with two sensors as an example, the observation result of the first sensor is B1, and the observation result of the second sensor is B2. The possible decision of the system is A1, A2, . . ., An. Assuming that each decision is independent of B1 and B2, using the prior knowledge of the system and the characteristics of the sensor, the Bayesian conditional probability formula can be written as PðB1 jAi ÞPðB2 jAi ÞPðAi Þ j¼1 P B1 jA j P B2 jA j P A j
PðAi jB1 ^ B2 Þ ¼ Pn
Finally, the decision that gives the system the maximum a posteriori probability can be selected as the final decision.
43.3.2 Method for Pixel Layer Fusion Method for pixel layer fusion A method for achieving the pixel layer fusion. The operands are the pixels in the original image. Since the original images to be fused often have some different but complementary characteristics, the method adopted to fuse them must also take into account their different characteristics. Typical methods include weighted average fusion, HSI transform fusion, PCA transformation fusion method, pyramid fusion, wavelet transform fusion, high-pass filter fusion, Kalman filter fusion, regression model fusion, parameter estimation fusion, etc. Weighted average fusion A method of pixel layer fusion. Can also be used for feature layer fusion. First, the image with lower spatial resolution in the two participating images is expanded by resampling to an image with the same resolution as the other participating image, and then the two images are weighted averaged in grayscale to get the fused image. HSI transform fusion A pixel layer fusion method. First select the lower-resolution group from the two groups of participating images, treat the three band images as RGB images, and transform them into HSI space. Then replace the I component in the previous HSI space with another set of higher-resolution images that participate in the fusion. Finally, perform the HSI inverse transformation on the three components H, S, and I, and use the obtained RGB image as the fusion image.
1508
43
Image Information Fusion
HSI transformation fusion method See HSI transform fusion. PCA transformation fusion method A basic pixel layer fusion method based on principal component analysis (PCA). The specific steps are as follows (assuming that the two images to be fused are f(x, y) and g(x, y), the spectral coverage of f(x, y) is larger than g(x, y), and g(x, y) has a higher spatial resolution than f(x, y)): 1. Select three or more band images in f(x, y) for PCA transformation. 2. Perform a histogram matching between g(x, y) and the first principal component image obtained by the above PCA transformation, so that their gray mean and variance are consistent. 3. Replace the first principal component image obtained by performing PCA transformation on f(x, y) with g(x, y) after matching as above, and then perform inverse PCA transformation to obtain a fused image. Pyramid fusion A pixel layer fusion method. The lower-resolution image in the two images participating in the fusion is expanded by resampling into an image having the same resolution as the other image participating in the fusion, and then the two images are pyramid-decomposed, and the decomposition of fusion is performed at each level, and finally a fused image is reconstructed from the fused pyramid. Pyramid fusion method Same as pyramid fusion. Wavelet transform fusion A pixel layer fusion method. First perform wavelet transform on the two participating images (generally, one image has a higher spatial resolution than the other image, but the spectral coverage of this image is smaller than the other image) to obtain the respective low-frequency sub-image and high-frequency detail sub-image. Then combine the low-frequency sub-image of the image with larger spectral coverage but lower spatial resolution with the high-frequency detail sub-image of image with smaller spectral coverage but higher spatial resolution and perform inverse transform to get the fused image. Wavelet transform fusion method See wavelet transform fusion. Combined fusion method In the methods for pixel layer fusion, a method of combining different methods to overcome the problems caused by the individual use of each fusion method. Common fusion methods include HSI transform fusion and wavelet transform fusion, principal component transform fusion, and wavelet transform fusion.
43.3
Layered Fusion Techniques
1509
43.3.3 Method for Feature Layer Fusion Method for feature layer fusion A method to achieve the feature layer fusion. The operation object is the feature extracted from the image. Typical methods include weighted average fusion, Bayesian fusion, evidence reasoning fusion, neural network fusion, cluster analysis fusion, entropy fusion, voting fusion, etc. Feature-based fusion See feature layer fusion. Evidence reasoning fusion A feature layer fusion method. Can also be used for decision layer fusion. Evidence reasoning is often named by the surname Dempster-Shafer theory and is called the D-S theory. The method of achieving evidence reasoning fusion is called evidential reasoning method, in which the principle of probability additivity in Bayesian fusion is abandoned and a principle called semi-additivity is used instead. Evidential reasoning method See evidence reasoning fusion. Evidence reasoning See evidence reasoning fusion. Dempster-Shafer theory A theory that can be used for feature layer fusion and decision layer fusion. Also called evidence reasoning. A principle called semi-additivity is used for probability, which is suitable for the case where the credibility of two opposite propositions is very small (both cannot be judged according to the current evidence). Dempster-Shafer The author of a belief modeling method for testing hypotheses that represents information in the form of beliefs and incorporates credible metrics against the hypothesis. Basic probability number In evidence reasoning, a parameter that reflects the credibility of the event itself.
43.3.4 Method for Decision Layer Fusion Method for decision layer fusion A way to achieve the decision layer fusion. The operation object is the equivalent relationship or independent decision extracted from the image. Typical methods include knowledge-based fusion, Bayesian fusion, evidence reasoning fusion,
1510
43
Image Information Fusion
neural network fusion, fuzzy set theory fusion, reliability theory fusion, logic module fusion, production rule fusion, fusion based on rough set theory, etc. Fusion based on rough set theory A decision layer fusion method. Let U (U 6¼ ∅) be a non-empty finite set of objects of interest, called the universe. An arbitrary subset X in U is called a concept in U. The set of concepts in U is called knowledge about U (often expressed in the form of attributes). Let R be an equivalence relation (which can represent the attributes of things) on U, then a knowledge base is a relation system K ¼ {U, R}, where R is the set of equivalence relations on U. For the subset X in U that cannot be defined by R, it can be described (approximately) by two exact sets (upper approximation set and lower approximation set). The R boundary of X is defined as the difference between the upper approximate set of R and the lower approximate set of R, that is, BR ðX Þ ¼ R ðX Þ R ðX Þ Let S and T be the equivalence relation in U, and the S positive domain of T (the set of equivalence classes in U that can be accurately divided into T ) is PS ðT Þ ¼ [ S ðX Þ X2T
The dependency of S and T is Q S ðT Þ ¼
card ½PS ðT Þ card ðU Þ
Among them, card(∙) represents the cardinality of the set. From the above, 0 QS(T ) 1. Using the dependency QS(T ) of S and T, we can determine the compatibility of the two equivalent classes of S and T. When QS(T ) ¼ 1, it means that S and T are compatible; when QS(T) 6¼ 1, it means that S and T are not compatible. When applying rough set theory to multi-sensor information fusion, it is needed to use the dependency relationship QS(T ) of S and T. By analyzing a large amount of data, find out the intrinsic relationship among them, and eliminate compatible information, so as to determine the minimum invariant core in a large amount of data and obtain the fastest fusion method based on the most useful decision information. Rough set theory A theory suitable for feature layer fusion and decision layer fusion. Different from the fuzzy set theory which can express the fuzzy concept but can't calculate the number of fuzzy elements, according to rough set theory, the exact mathematical formula can be used to calculate the number of fuzzy elements. The description of rough sets uses the following conventions. Let L 6¼ ∅ be a finite set of objects of interest, called the universe of discourse. For any subset X in L,
43.3
Layered Fusion Techniques
1511
call it a concept in L. The set of concepts in L is called knowledge about L (often expressed in the form of attributes). Letting R be an equivalent relationship defined on L (which can represent the attributes of things), then a knowledge base is a relationship system K ¼ {L, {R}}, where {R} is the set of equivalent relationships on L. According to whether the subset X can be defined by R, the R rough set and the R exact set can be distinguished. The upper and lower bounds of the rough set are both exact sets. Rough set See rough set theory. R rough set A special set in rough set theory. Let L 6¼ ∅ be a finite set of objects of interest, called the universe of discourse. For any subset X in L, call it a concept in L. The set of concepts in L is called knowledge about L (often expressed in the form of attributes). Let R be an equivalent relationship defined on L (which can represent the attributes of things), then a knowledge base is a relationship system K ¼ {L, {R}}, where {R} is the set of equivalent relationships on L. For any subset X in L, if it is not defined by R, it is a R rough set; if it is defined by R, it is a R exact set. See fusion based on rough set theory. Exact set See R rough set and rough set theory. R exact set A special set in rough set theory. Regarding the universe U of the object of interest, consider any subset X (a concept in U ). If X can be represented by the equivalent relationship R, it is an R exact set. See fusion based on rough set theory. Compare R rough set. Upper approximation set An exact set for describing rough sets. Is the upper bound of the rough set. Also for knowledge R, the finite set (discourse domain) L composed of the objects of interest may be a set of elements that can be classified into any subset X in L (at least it cannot be ruled out that they belong to X). Can be expressed as R ðX Þ ¼ fX 2 U : RðX Þ \ X 6¼ ∅g where R(X) is the equivalent class containing X. Lower approximation set An exact set for describing rough sets, which is also the lower bound of rough sets. For knowledge R, it is a set of all elements in a finite set (discourse domain) L composed of objects of interest that must be classified into any subset X of L, which can be expressed as
1512
43
Image Information Fusion
R ðX Þ ¼ fX 2 U : RðX Þ ⊆ X g where R(X) is the equivalent class containing X. Minimum invariant core A kernel in the fusion based on rough set theory. Let R be a set of equivalent relations, and R 2 R. If I(R) ¼ I(R – {R}), I(•) represents an indeterminate relationship, then R is omissible in R (unnecessary); otherwise it cannot be omitted (required). When R is not omissible for any R 2 R, the set R is independent. If R is independent and P 2 R, then P is also independent. The set of all non-negligible relations in P is called the core of P. A core is a collection of knowledge features that cannot be omitted.
Chapter 44
Content-Based Retrieval
Content-based image and video retrieval is essentially to grasp high-level semantic content, which belongs to the category of image understanding (Alleyrand 1992; ISO/IEC JTC1/SC29/WG11 2001; Zhang 2007).
44.1
Visual Information Retrieval
The purpose of visual information retrieval (VIR) is to quickly extract a collection or sequence of images related to a query from a visual database collection (Alleyrand 1992; Zhang 2009).
44.1.1 Information Content Retrieval Visual information retrieval An important research area of information technology. The main purpose is to quickly extract images or video clips related to a query requirement from a visual database (which can be isolated or online). In fact, it is an extension of traditional information retrieval, including visual media in information retrieval. Visual information Information obtained through vision and vision computing. The media that provides visual information are mainly images, graphics, and videos. Inverted index A data structure suitable for fast information retrieval. For example, when retrieving a specific word in a document, an inverted index can be calculated in advance for each individual word associated with the document (or news, or web page), and a © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_44
1513
1514
44 Content-Based Retrieval
Querying
Matching Network
Network
Information user
Description
Verification
Multimedia database
Extraction
Fig. 44.1 Basic block diagram of content-based visual information retrieval system
word matching to a specific query is quickly retrieved based on how often the specific word appears in the document. Query by image content [QBIC] An early term for content-based image retrieval. Originated from a famous system developed by IBM in the 1990s for querying based on image content. It allowed users to give features such as the shape, color, or texture of the image or object that they want to query. Then the system used this as an example to search, browse, and extract the images that meet the conditions from the large online (image) database. A set of techniques for selecting images from an image database using examples of the desired image content (this is different from text search). The content here mainly refers to information such as color, texture, or shape and has now been extended to the semantic content and emotional content of images. See image database indexing. Content-based retrieval Retrieval based on the content of the information, not its data carrier, structure, time of formation, etc. In image engineering, the main research is visual information retrieval. The basic framework of the system for retrieving visual information based on content is shown in Fig. 44.1, which is mainly composed of five modules (all inside the dotted rectangular box in the figure). The information user issues a query request, and the system converts the query request into a computer internal description. With these descriptions matched with the information in the database, the required information data is extracted. The user can use this directly after verification or improve the query conditions and start a new round of retrieval. Image content Broadly refers to the meaning or semantic content of an image. Determining and extracting image content is an important basis and main purpose of using images. Image Hashing Techniques and processes for extracting short digital sequences by performing unidirectional mapping based on image content characteristics. Can be used to identify the image. Unlike the image watermarking technique, since the extraction of the Hash does not require modification of the image data, the original image quality can be maintained intact.
44.1
Visual Information Retrieval
1515
Content-based image retrieval [CBIR] Quickly search and extract the required image from the database based on the content similarity. The key is to effectively determine and capture the content of the image. Involves techniques for image analysis, query, matching, indexing, searching, and verification. It is an image database search method that matches based on the content of images in the database, as opposed to a method that uses text descriptors for indexing. Duplicate image retrieval Detect images with the same or near the same content, regardless of the process of geometric compression or image encoding used for storage. The similarity of the matching can be based on pixel calculation, feature-based calculation, or a combination of the two. This is a special case of image retrieval. Content-based video retrieval [CBVR] Quickly query and extract the required video clips from the database based on the content similarity. The key is to effectively identify and capture the content of the video. Involves video analysis, query, organization, matching, indexing, searching, and other technologies. Near duplicate image/video retrieval An image/video retrieval strategy. It is necessary to identify whether the image or video in the database is almost the same as the image or video of a given query after a certain transformation (such as image cropping, compression, scaling, filtering, etc.).
44.1.2 Image Retrieval Image retrieval The process or technique of retrieving a single image or sequence of images from an image database. Specifically, the image in the image library can be compared with a query example (image or text), and then the search method outputs the image that can be most recently matched according to given criteria. Similarity measure A measure used to measure the similarity between at least two entities (such as images or targets) or variables (such as feature vectors). Commonly used distance measures, such as Hausdorff distance, Mahalanobis distance, Minkowski dis tance, and so on. Similarity The degree of similarity between two entities (such as images or objects) or variables (such as feature vectors) in terms of various characteristics. It can be measured by various similarity measures.
1516
44 Content-Based Retrieval
The property that makes two entities (such as images, targets, features, models, values, etc.) or collections similar (i.e., how they look consistent with each other). The similarity transformation produces perfect similar structures. The similarity measure can quantify the degree of similarity between two structures that may not be exactly the same. Examples of similar structures include two polygons that are the same except for their size, two image neighborhoods that have the same brightness value but are scaled down with different multiplier factors, and so on. The concept of similarity is the core of some traditional vision problems, including stereo corre spondence problems, image matching, and geometric model matching. Similarity metric A measure that quantifies the similarity of two entities. For example, cross-corre lation is a commonly used measure of image region similarity. See feature similar ity, graph similarity, grayscale similarity, point similarity measure, and matching method. Sum of absolute differences [SAD] A criterion for similarity metrics. It is generally the result of summing the absolute values of the gray differences between the pixels in the image. So it is the sum of the L1 norm. Term frequency-inverse document frequency [TF-IDF] A visual vocabulary-based measure used in image retrieval. After mapping the query image to the visual vocabulary that composes it, the database image can be extracted by matching. This can be achieved by matching the query and the vocabulary distribution (term frequency) nwi/ni of the database image, where nwi is the number of occurrences of the word w in the image i and ni is the total number of words in the image i. In order to reduce the weight of frequently occurring words and focus on searching for infrequently occurring words (and thus more informative), an inverse document frequency weight logN/Nw is also used, where Nw is the number of images containing the word w and N is the total number of images in the database. Combining these two factors gives the term frequency-inverse document frequency metric: ti ¼
nwi N log Nw ni
Term frequency See term frequency-inverse document frequency. Inverse document frequency See term frequency-inverse document frequency. Metrics for evaluating retrieval performance In image retrieval, metrics are defined to evaluate the performance of the retrieval system and the retrieval results obtained by different technologies. Among them, the
44.1
Visual Information Retrieval
1517
determination of the degree of image matching is an important part of judging the retrieval performance, and the degree of matching can often be measured by means of the measure describing similarity. Commonly used similarity measures include three types: (1) metrics based on histogram statistics, (2) metrics based on pixel statistics, and (3) metrics based on pixel difference. Because the purpose of image retrieval is to find and extract the required images or videos, in many cases it is also necessary to consider the number and ranking of the similar images retrieved. Retrieving ratio A metric for evaluating retrieval performance. For a given query image, retrieval algorithm and parameters, the retrieval rate can be expressed as ηij ðN, T Þ ¼
nij , N
i ¼ 1, 2, , M j ¼ 1, 2, , K
Among them, M is the total number of query images, K is the total number of parameter sets used by the algorithm, N is the total number of similar images in the image library, T is the total number of images extracted by the system, and nij is the number of similar images extracted with the i-th query image and j-th group of parameters. The retrieval rate defined in this way is related to the total number T of extracted images. For the case of querying an image using K settings, the average retrieval rate can be calculated by the following formula: ηi ðN, T Þ ¼
K K 1 X nij 1 X n , ¼ K j¼1 N KN j¼1 ij
i ¼ 1, 2, , M
For the case of querying M images with K settings, the average retrieval rate can be calculated by the following formula: ηðN, T Þ ¼
M M K 1 X 1 XX ηi ¼ n M i¼1 KMN i¼1 j¼1 ij
Efficiency of retrieval A metric for evaluating retrieval performance. For a given query image, let N be the total number of similar images in the image library, n be the number of similar images extracted, and T be the total number of images extracted by the system. The expression of the retrieval efficiency is ηT ¼
n=N
NT
n=T
N>T
1518
44 Content-Based Retrieval
Compressed domain retrieval Visual information retrieval is performed directly on the compressed data, which does not require or does not completely require decompression. Retrieval of compressed images or videos before decompression or before complete decompression. The advantages of this include: 1. The amount of data in the compressed domain is smaller than the amount of data in the original domain or the decompressed domain, which is conducive to improving the efficiency of the entire system, especially when the retrieval system requires real-time response. 2. The additional operation steps of decompression can be (partly) omitted, which can reduce the processing time and the equipment overhead. 3. Many image or video compression algorithms have made a large number of processing and analysis works during the compression process. If the results of these processing and analysis can be used in retrieval, it can reduce repetitive work and improve retrieval efficiency. 4. In feature-based image retrieval, the corresponding feature vector is stored in addition to the image when storing the image (building a database). 5. In the compressed domain, some feature vector information is already included in the compression coefficient, and the extra storage amount can be omitted. Compressed domain The space where the image is compressed in image engineering. Tattoo retrieval An application of image retrieval in which the search pattern is tattoo. The purpose is to help identify a person’s legal situation.
44.1.3 Image Querying Image querying Same as image database indexing, but shorter. The most basic query is based on visual features such as color, texture, and shape. Compare image retrieval. Query by example A typical method for query in image database. The user gives an example image and asks the system to retrieve and extract (all) similar images from the image database. A variation on this approach is to allow the user to combine multiple images or draw a sketch to get a sample image. The sample images should be consistent or similar in content to the images to be retrieved. Query by sketch A typical method for query in image database. It is an implementation method of query by example. The user draws a sketch as an example image as required and asks the system to retrieve and extract (all) similar images in the image database.
44.1
Visual Information Retrieval
1519
Drawing sketch See query by sketch. Relevance feedback Feedback on whether the returned results (of search or image retrieval) are relevant, the feedback can come from the user or the system. The feedback can be used to improve subsequent searches. Query expansion A technique of content-based image retrieval. The previous image returned by the initial query is submitted for query again to obtain more candidate results, thereby improving the retrieval performance of images that are difficult to query (such as samples affected by occlusion). Normalized discounted cumulative gain [DCG] In image retrieval, the result of a query is a sorted list of images. To evaluate this result, assume that each image has been labeled with a correlation score. Discounted cumulative gain (DCG) measures the quality of the search results, penalizing images that are highly relevant but rank behind in the image list. The normalization process calculates the ratio of the observed DCG to the best DCG available. Human-computer interaction [HCI] An interdisciplinary study of the interaction between a computer system and a user. It involves the research in computer science and artificial intelligence, psychology, computer graphics, computer vision, and other fields. It also refers to the research and practice of interaction methods between human users and computer systems. Often used to better design software and hardware interfaces or improve the workflow of using software. Lightpen Originally referred to a device that uses a laser beam to indicate position. It is now referred also to a human-computer interaction device that allows a person to indicate a position on a computer screen by touching the screen with a pen at a desired location. The computer then draws, selects, etc. at that location. It is essentially a mouse, but it doesn’t work on the (mouse) pad but on the display screen. Motion-based user interaction The user takes the motion field information to control a particular application. User interface [UI] A medium for interaction and information exchange between users and computer systems. In many cases, it is also called a human-machine interface. It can realize the conversion between the internal storage form of information and the acceptable operation form of human being, so that the user can conveniently and efficiently control the system to complete the work that the user wants the system to complete. For example, in content-based image retrieval and content-based video retrieval, the user interface plays an important role in querying and verifying.
1520
44 Content-Based Retrieval
Human-machine interface See user interface.
44.1.4 Database Indexing Image database indexing A technique for associating an index (such as a keyword) with an image, which allows images in the image database to be indexed efficiently. Thumbnail A reduced representation of an image. Sufficient information should be provided so that the observer can see the general situation of the original image. For an image database, thumbnails are often used to indicate which content images are mainly included, so that it can be easily judged whether they are needed images before opening them. In image retrieval, the candidate images by querying are often represented in the form of thumbnails to display multiple images on one page, which is convenient for users to choose. Epitome model For a given image, the representation of a small miniature in which the basic shapes and texture units are concentrated. The mapping from the miniature to its original pixels is implicit. With the change to the implicit mapping, multiple images can share the same miniature. In the miniature model, raw pixel values are used to characterize texture and color characteristics (instead of filtering response). Miniature can be derived with the help of models of production systems. It is assumed here that the original image can be generated from the miniature by copying pixel values and adding Gaussian noise. In this way, as a learning process, by checking the best match, different size slices from the image can be added to the miniature (a very small image for checking the best possible match). Then, the miniature is updated according to the newly sampled image slice. This process continues iteratively until the miniature stabilizes. Figure 44.2 shows an example image and two miniatures of different sizes. Visible epitomes are relatively compact representations of images. Image database The collection of a large number of images also generally has functions such as image input/archiving, image management, and image retrieval. Visual database An application program for organizing, managing, and saving various images. Image gallery A set of images for statistical analysis, texture analysis, or feature analysis. Query in image database One of the basic steps and tasks in content-based image retrieval.
44.1
Visual Information Retrieval
1521
Fig. 44.2 Left: Original color image; Middle: Its 32 32 epitome; Right: Its 16 16 epitome
Image search Retrieve images from the web or database based on keywords and visual similarity. There are many search engines based on keywords in the title, the text of the attachment, the file name, and use information such as clickthrough rate. With the improvement of recognition technology, more and more systems work with visual features and content similarity, and they can work without keywords or when keywords are incorrect. Image tagging Automatically or manually assign relevant keywords to images in the image database to help with image retrieval or scene classification. Image annotation A technique for (semantic) text labeling of images and the objects within them. In other words, the process of documenting the object content or context of an image. On this basis, it can further realize content-based image retrieval by using text description (annotation links the image data at the lower level with the semantic content at the higher level), and it can also help to explain the scene represented by the image. Here it is needed to describe the content of each image with multiple text keywords (or sentences) as accurately and comprehensively as possible. See image classification. Vocabulary tree A data structure derived from document retrieval, suitable for identification and retrieval of large image databases. Each leaf node corresponds to a visual word, and the middle node corresponds to a cluster of similar visual words. An image database is organized into the results of a hierarchical tree structure, where the search for matching images is to compare the description of the index
1522
44 Content-Based Retrieval
image with the descriptors at each branch of the tree. The tree can be constructed from a database using a hierarchical K-means clustering method. Labeling problem Given a set of image structures S (can be pixels or more structured objects like edges) and a set of labels L, the label problem can be described as how to assign a label to each image structure s 2 S with l 2 L. This process generally relies on image data and adjacent labels. For example, a typical remote sensing application is to label image pixels based on the type of land (such as sand, water, snow, forest, etc.). Object indexing A method to organize object models based on certain primitives (commonly shapebased primitives) to efficiently locate similar models. See indexing and model base indexing. Spatial indexing 1. Convert one object to a number so you can quickly compare it to other objects. This is closely related to the calculation of the spatial transformation and imaging distortion of the object. For example, an object can be indexed by its compact ness or elongation after representing it as a set of 2-D boundary points. 2. An effective data structure designed for searching and storing geometric quantities. For example, querying nearest points can be made more efficient by calculating spatial indexes (such as Voronoï diagrams, distance transforms, k-D trees, binary space partitions, etc.).
44.1.5 Image Indexing Image indexing See image database indexing. Indexing The process of using keywords to retrieve an element from a data structure. It has become a powerful concept from programming to image technology. For example, given an image and a set of candidate models to establish the identity of an object, often by identifying some features (elements with characteristics) in the image and using the characteristics of the features to index the database of the model. See model base indexing. Indexing structure In feature matching, it is necessary to construct a physical structure that searches for all corresponding feature points. Typical index structures include Hash tables and multidimensional search trees.
44.1
Visual Information Retrieval
1523
Structured codes A code that can be efficiently represented and has the function of efficiently searching for the closest code word. Multidimensional search trees A typical index structure. Among them, the most commonly used are k-D trees, and others are metric trees. K-D tree A multidimensional search tree is also a special binary tree with K features. At each node, the value of one of the K features is used to determine how to access its subtree. The nodes of the left subtree have eigenvalues smaller than the judgment value, and the nodes of the right subtree have eigenvalues larger than the judgment value. An example of K-D tree representation in 2-D feature space is shown in Fig. 44.3, where Fig. 44.3a represents the stepwise division of the image (the thick division line in the figure represents the current one division) and Fig. 44.3b gives the resulting KD tree (only one feature is considered in the figure, so all nodes are still divided into 3 categories like a quadtree or a binary tree, and multiple features can also be used in practice). The last graph in Fig. 44.3a shows the region represented by each leaf node in the K-D tree obtained in Fig. 44.3b. Note that during the stepwise division of the K-D tree, the division direction changes alternately. Because it is not necessary to divide the entire image matrix into two equal-sized subgraphs each time, it is more flexible and easier to adapt to the shape of the object in the figure. In order to represent the same image, the last number of nodes used in the K-D tree is generally less than that of the binary tree. It can be considered that in the process of constructing the K-D tree, there are not only division but also merging. Hashing A simple way to implement the index structure. Feature descriptors are mapped into fixed-size cells based on a certain function. When matching, each new feature is also put into a Hash, and its similar Hash is searched to return possible candidates, and these candidates are further sorted or ranked to determine a valid match.
Level 0
V1
H2
V1
V4 H3
Level 2
H5 H6
H2
Level 1
A
1 2 3 4
C
5 6
(a)
Fig. 44.3 K-D tree representation of 2-D feature space
A
H3 C
V4 H5
Level 3
H6 3
Level 4
(b)
246
5
1524
44 Content-Based Retrieval
Hash table A typical index structure. See Hashing. Hash function Mapping between variable-length strings and fixed-length strings. In general, the Hash mapping result of a string can reduce the length of the string. One-way Hash Refers to a Hash function with a small amount of forward calculation and a large amount of inverse (reverse) calculation. In other words, given an input, it is easy to get the output; but given the desired output, it is practically impossible to get the original input. Message authentication code The one-way Hash of the information appended to the information is used to determine that the information has not changed between the time the Hash was appended and the test. Locality sensitive Hashing [LSH] A complex but widely used version of Hashing, where features are indexed using a set of independently calculated Hash functions. Spatial Hashing See spatial indexing. Parameter-sensitive Hashing [PSH] An extension of locally sensitive Hashing, which is more sensitive to the distribution of points in the parameter space. Geometric Hashing An efficient way to achieve geometric primitive template matching (assuming that a subset of geometric primitives in the image exactly match the geometric primitives of the template). Among them, some geometrically invariant features are mapped into a Hash table, and this Hash table is used for identification. The method initially uses points as primitives, but line segments can also be used as primitives. The basic geometric Hashing method uses affine transformations to determine the allowed transformation space, and other types of transformations can also be used after improving the method. It is based on the fact that three non-collinear points (called base points) can be used to determine the affine basis of the 2-D plane. If these three points are p0, p1, p2, then all other points can be represented by a linear combination of the three: q ¼ p0 þ s ð p1 p0 Þ þ t ð p2 p0 Þ A characteristic of the above representation is that the affine transformation does not change the equation, that is, the value of (s, t) depends on only three base points and has nothing to do with the affine transformation. In this way, the value of (s, t) can be regarded as the affine coordinate of the point q. The same applies to line segments:
44.2
Feature-Based Retrieval
1525
three non-parallel line segments can be used to define an affine basis. If the restricted transformation is a similarity transformation, then two points can determine a basis. But if two line segments are used, only one rigid body transformation can be determined. Geometric Hashing creates a Hash table to quickly determine the potential location of a template, thereby reducing the amount of template matching effort. Suppose there are M points in the template and N points in the image. The Hash table is created as follows: for every three non-collinear points on the template, the affine coordinates (s, t) of the other M – 3 points in the template are calculated and used as the index of the Hash table. For each point, the current three base points are stored in the Hash table. To search for a template in the image, randomly select three image points and construct affine coordinates (s, t) for the other N – 3 points. Using (s, t) as the index of the Hash table, you can get the serial number of the three base points, so you get a vote of the three base points in the image. If there are no three base points corresponding to the randomly selected points in the calculation result, the vote will not be accepted. If three base points can be found for randomly selected points, the vote will be accepted and the base point number will be indicated. So, if enough votes are accepted, there will be enough evidence that a template exists in the image, and it can be further checked whether the template really exists. In fact, it is only necessary to find the right one to find the template. Therefore, if K of the M template points appear in the image, the probability of finding at least one correct base point in L trials is
3 L K P¼1 1 N If you use a similarity transformation, the index in parentheses will change from 3 to 2. For example, if K/N ¼ 0.2 and the probability of finding a template is 99%, affine transformation needs 574 experiments, while similarity transformation only needs 113 experiments. As a comparison, the number of potential correspondences N between M primitives in the template and N primitives in the image is , that is, M O(NM).
44.2
Feature-Based Retrieval
The initial stage of content-based retrieval is feature-based image retrieval, which specifically refers to mainly visual features, such as color, texture, shape, spatial relationship, motion information, and so on (Jähne 2004; Russ and Neal 2016; Zhang 2017b).
1526
44 Content-Based Retrieval
44.2.1 Features and Retrieval Feature-based image retrieval [FBIR] The initial stage of content-based image retrieval. The features here are mainly visual features, such as color, texture, shape, structure, spatial relation, and movement state. Visual feature Features that directly affect the human visual system and produce visual percep tion. Also called perceptual features. Commonly used include color, texture, shape, structure, motion, spatial relation, and so on. Puzzle A type of puzzle game. Basic jigsaw puzzles need to combine a given small unit into a large picture according to a certain order or logic according to its pattern, texture, shape, etc. In this process, puzzlers often need to recognize and reason about patterns. This is very similar to image retrieval based on visual features. Combined feature retrieval Retrieval by combining different visual features (color, texture, shape, etc.). Different visual features reflect the attributes of the image from different angles, and each has its own characteristics in describing the content of the image. In order to comprehensively describe image content and effectively improve retrieval performance, different types of visual features are often used in combination or comprehensive features are constructed for retrieval. Feature layer semantic The semantics contained in various visual features contain more global characteristics. Although the feature layer can be regarded as a layer in the semantic layer in a broad sense, when it comes to the study of the semantic layer, it often refers to higher-level research. Texture-based image retrieval One of the fields of feature-based image retrieval. Texture-based image retrieval uses various texture descriptors to describe the texture information in the image, and at the same time, the spatial information in the image can also be quantitatively described to a certain extent. It can be considered that texture-based image retrieval is content-based image retrieval using texture as a matching criterion. Internal Gaussian normalization A normalization method is used to calculate the similar distance between components with different physical meanings and different value ranges within the same feature vector. The internal components of the feature vector can have the same status when the similarity measurement is performed, thereby reducing the influence of a small number of oversized or undersized components on the overall similarity.
44.2
Feature-Based Retrieval
1527
The feature vector of M dimension can be written as F ¼ [f1 f2 . . . fM]T. For example, if I1, I2, . . ., IK are used to represent the images in the image database, then for any one image Ii, the corresponding feature vector is Fi ¼ [fi,1 fi,2 . . . fi,M]T. Assuming that the feature component value series [f1,j f2,j . . . fi,j . . . fK,j]T conforms to the Gaussian distribution, the mean mj and standard deviation σ j can be calculated, and then the following formula can be used to normalize fi,j to the interval [–1, 1] (the superscript (N) represents the normalized result): ðNÞ
f i,j ¼
f i,j m j σj
After normalizing according to the above formula, each fi,j is transformed into fi, with N(0, 1) distribution. If 3σ j is used for normalization, the probability that the value of fi,j(N) falls in the interval [–1, 1] can reach 99%. (N)
j
External Gaussian normalization A normalization method used to calculate the similarity distance between feature vectors with different physical meanings and value ranges. The positions of different types of feature vectors in the calculation of similar distances are the same. Let an M-dimensional feature vector be F ¼ [f1 f2 . . . fM]T. For example, if I1, I2, . . ., IK are used to represent the images in the image database, then for any one image Ii, the corresponding feature vector is Fi ¼ [fi, 1 fi, 2 . . . fi, M]T. The main steps of Gaussian normalization are as follows: 1. Calculate the similarity distance between the feature vectors FI and FJ corresponding to each two images I and J in the image database: DIJ ¼ distanceðFI , FJ Þ,
I, J ¼ 1, 2, , K and
I 6¼ J
2. Calculate the mean mD and standard deviation σ D of K(K-1)/2 distance values obtained from the above formula. 3. For the query image Q, calculate the similarity distance between it and each image in the image database and record it as D1Q, D2Q, . . ., DKQ. 4. For D1Q, D2Q, . . ., DKQ first performs Gaussian normalization according to the following formula (where the superscript (N) indicates the normalization result): ðNÞ
Di,j ¼
f i,j m j σj
Convert the normalized result to the interval [–1, 1], and then perform a linear transformation as follows:
1528
44 Content-Based Retrieval ðNÞ DIQ
DIQ mQ ¼ þ 1 =2 3σ Q
It is easy to know that 99% of the values of DIQ(N) thus obtained fall in the interval [0, 1]. The actual similarity distance does not necessarily conform to the Gaussian distribution. In this case, a more general relationship can be used: DIQ m 1 P 1 1 1 2 Lσ L If L is 3, the probability that the similar distance falls in the [–1, 1] interval is 89%; if L is 4, the probability that the similar distance falls in the [–1, 1] interval is 94%. Image normalization An operation to eliminate differences in scene lighting. The typical method is to subtract the mean of the image and then divide it by the standard deviation, so that you can get an image with zero mean and unit variance. Because the actual image is not a Gaussian random sample, all the above methods cannot completely solve the problem. In addition, the position of the light source often causes changes in tone, which cannot be corrected by the above method. Shape-based image retrieval One of the fields of feature-based image retrieval. Shape is not only an important feature to describe the image content but also a direct feature to describe an object in an image. Shapes are often associated with objects and have a certain semantic meaning, so shape features can be regarded as higher-level features than colors or textures. However, no precise mathematical definition of shape has been found so far, including geometric, statistical, or morphological measures, to make it completely consistent with human perception. Structure-based image retrieval One of the fields of feature-based image retrieval. The concept of structure here focuses more on describing the mutual position and orientation relationship between parts in the same target in the image. If a component is replaced with a target, it can be explained by a spatial relationship. Structure feature Features involved/used in structure-based image retrieval. To a certain extent, it can be regarded as a feature between texture and shape, which can express the connection between non-repeating brightness patterns and object parts. Sketch-based image retrieval A type/mode of image retrieval in which the database index is based on a sketch drawn by a user on a desired target. For example, a sketch, logo, or icon of a human face can be used. The difficulty in retrieving by this method stems from the fact that
44.2
Feature-Based Retrieval
1529
the sketch is not a faithful copy of the desired target, and the retrieval is expected to be relatively accurate in shape. Often this type of search emphasizes the use of shape information and some also use color information. Spatial-relation-based image retrieval One of the fields of feature-based image retrieval. When the image contains many independent targets, and the search results emphasize the positional relationship between them, especially when there are more than one targets (not too many in practice) and the shape and size of the target can be ignored in comparing to the distance between the targets, then image retrieval using spatial relation is more suitable. Spatial relation The association of two or more spatial entities, described in terms of connections or correlations. Examples include the orthogonality or parallelism of lines or planes. In image engineering, the mutual position and orientation of different objects. For example, one image region is contained in another image region (inclusive), and they also reflect the mutual position and orientation of the objects in the scene. Spatial proximity A description given in real space (not feature or feature space) of the distance between two objects. Location-based retrieval Retrieve information based on the location of the mobile device. The term is most commonly associated with mobile phone devices, but can also be used in aug mented reality applications where information may be overlaid on specific locations in an image or video.
44.2.2 Color-Based Retrieval Color based image retrieval An example of general image database indexing technology, where one of the main indexes for an image database is derived from a color sample (i.e., the color distribution of the sample image) or a set of text color terms (such as “red,: “green”) and so on. It is one of the branches of feature-based image retrieval. Color is an important feature for describing image content, and it was applied earlier in content-based image retrieval than other features. Color features are relatively well defined, and the extraction is relatively easy (can be performed on a single pixel). Retrieving an image using a color feature requires the following two points to be determined: (1) this feature can represent the characteristics of the color information of the image, which are related to the selected color model or color space; and (2) this feature can be used to calculate the degree of similarity.
1530
44 Content-Based Retrieval
Intensity-based database indexing One of the image database indexing techniques uses a description of brightness, such as a histogram of pixels (monochrome or color), or a vector of local derivatives. Color based database indexing See color based image retrieval. Color indexing Use color information, such as color histograms, for image database indexing. A key issue encountered here is the change in lighting. The ratio of colors between adjacent positions can be used to obtain illumination invariance. GoF/GoP color descriptor A color descriptor recommended in the international standard MPEG-7. The scalable color descriptor has been extended to make it suitable for video clips or a set of still images. This descriptor adds two more bits than the scalable color descriptor to indicate the calculation result of the color histogram, including: 1. Average: The average histogram is equivalent to the histograms of each frame of images are aggregated together and normalized. 2. Median: The median histogram is equivalent to combining the median values of all the image histogram bars (bins) of each frame, which is less sensitive to errors and noise than the average histogram. 3. Intersect: The intersecting histogram calculates the minimum value of the corresponding histogram for each image frame to find the least common colors. Intersecting histograms are different from histogram intersections. Intersect histograms only give a scalar measure, while intersecting histograms can provide a vector measure. The similarity measure used to compare scalable color descriptors can also be used to compare group of frame (GoF) color descriptors or group of picture (GoP) color descriptors. Dominant color descriptor [DCD] A basic color descriptor recommended and used in the international standard MPEG-7. It is suitable for describing local features in an image, especially for occasions where only a few colors are needed to characterize color information in a region of interest. When using shape features to describe the object region in the image, the region is often extracted. On this basis, the main color descriptor can be used, that is, the region is described by a set of main colors to obtain the shape description and color description of the region. Reference color table A method of similarity computation based on histogram. The image colors are represented by a set of reference colors. This set of reference colors can cover various colors that can be visually perceived, and the number of colors can be less than the number of colors used in the original image. In this way, a simplified histogram can be obtained. The resulting feature vector used for similarity calculation (matching) is
44.2
Feature-Based Retrieval
f ¼ ½ r1
1531
r2
ri
r N T
Among them, ri is the frequency of occurrence of the i-th color, and N is the size of the reference color table. The similarity value between the query image Q and the database image D after using the above feature vector and weighting vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N X T u PðQ, DÞ ¼ W ðf Q f D Þ ¼ t W i ðr iQ r iD Þ2 i¼1
Among them, the weight is calculated by the following formula:
Histogram intersection A method of similarity computation based on histogram. In content-based image retrieval, letting HQ(k) and HD(k) be the feature statistics histograms of query image Q and database image D, and L is the number of features, then the similarity between the two images is L1 P
PðQ, DÞ ¼
min ½H Q ðkÞ, H D ðk Þ
k¼0 L1 P
H Q ðk Þ
k¼0
Central moment method A method of similarity computation based on histogram. Can be used for image query and retrieval. Letting M iQR , M iQG , M iQB denote the central moments of order i (i 3) of the three component histograms of R, G, and B of the query image Q and letting M iDR , M iDG , M iDB denote the central moments of order i (i 3) of the three component histograms of R, G, and B of the database image I, then the similarity value P(Q, D) between Q and D is vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u 3 3 3 2 2 2 u X X X M iQR M iDR þW G M iQG M iDG þW B M iQB M iDB PðQ,DÞ¼ tW R i¼1
i¼1
Among them, WR, WG, WB are weighting coefficients.
i¼1
1532
44 Content-Based Retrieval
Central moment Region moment relative to the centroid of region. For example, the moment p+q of the image f(x, y) is written as mpq ¼
XX x
xp yq f ðx, yÞ
y
Then the central moment of order p + q of f(x, y) can be expressed as M pq ¼
XX x
ðx xÞp ðy yÞq f ðx, yÞ
y
where x ¼ m10 =m00 and y ¼ m01 =m00 are the coordinates of the centroid of region of f(x, y). Normalized central moment Relative to the centroid of region, the region moment normalized to the amplitude. If the central moment of the p+q order of the image f(x, y) is represented by Mpq, the normalized central moment can be expressed as N pq ¼
44.3
M pq M γ00
where, γ ¼
pþq þ 1, 2
p þ q ¼ 2, 3,
Video Organization and Retrieval
Video programs often have obvious structural characteristics. On this basis, content organization, index structure, and content-based query can be established (Bovik 2005; Zhang 2017c).
44.3.1 Video Organization Video shot A basic physical unit of video. Shot for short. It consists of a series of frame images arranged in time series obtained by a camera continuously. A video shot is usually taken in the same scene and usually contains a set of continuous actions at a certain position in space. Each video program contains a series of different shots. Shot Short for video shot.
44.3
Video Organization and Retrieval
1533
Shot organization strategy Principles and methods of combining multiple shots to form higher-level physical or semantic units. A typical example is to organize shots into episodes. Typical techniques used are shots clustering and prior knowledge analysis. Key frame The first was an animation technique in computer graphics, in which key frames in a sequence were drawn by an experienced animation designer, and interpolated frames between them were drawn by a less experienced animation designer. In motion sequence analysis of computer vision, key frames are simulated video frames, typically showing motion discontinuities, and scene motion between them can be smoothly interpolated. In a video program, one or more frame images that are representative of a scene and characteristic of a scene. Key frame selection Select the video frame with the largest change. Can be applied to video coding, video summarization, etc. Video key frame 1. In video compression, the frame that is completely stored. This frame can be used as the basis for progressive compression. If the viewpoint has changed significantly, or the scene itself has changed significantly, it is often necessary to determine a new key frame. 2. In video analysis, completely preserved video frames. These frames can be used for video summarizations or to understand an action in a scene. 3. In computer animation, define the start and end of a smooth transition video frame. The frames between them can be obtained by interpolation. Episode In video organization, a semantic unit that represents video content. Includes (not necessarily connected or adjacent in time) a set of content-related shots. Generally, it describes a story or event that is related to the content. Representative frame of an episode In image retrieval, it refers to one or more frame images that are representative of the episode and can summarize the content of the episode. Logical story unit A semantic unit in video organization. Contains a set of temporally continuous shots where elements with similar visual content have repeated connections. Can also be seen as a special case of episode. Shot detection Segmentation of shots in a video along the time axis. Also called time domain segmentation. Considering the large amount of data in the video, in practice, the segmentation uses mainly the boundary-based method, that is, the transition position of the shot is determined by detecting the boundary between the shots.
1534
44 Content-Based Retrieval
There are two main types of transitions between shots: abrupt change and gradual change. See image segmentation. Map of average motion in shot [MAMS] In shot detection, a feature that determines the time, space, and motion of a shot. It is a cumulative average map of 2-D distribution, which keeps the spatial distribution information of motion. For a shot, the map of average motion in shot can be expressed as M ðx, yÞ ¼
Lv 1 X f i ðx, yÞ f iþv ðx, yÞj L i¼1
L¼
N v
Among them, N is the total number of frames of the lens; i is the sequence number of the frames; and v is the number of intervals for measuring the difference between frames. Precision 1. A metric for evaluating retrieval performance. It is a measure of accuracy, which is the number of related images extracted by the system divided by the total number of images extracted (i.e., the correct positive plus false alarm). Can be expressed as Precision ¼
# of related images # of total extracted images
It can also be used for classification tasks. At this time, the accuracy rate P is similar to the correct prediction value and can be expressed as P¼
Tp T p þ Fp
Among them, Tp is correct affirmation, and Fp is false positive. A perfect precision score of 1.0 indicates that each extracted image is relevant, but it does not provide information on whether all related images are extracted at all. Therefore, it is not enough to use only one index of precision, for example, it is necessary to consider the number of all relevant images by calculating the recall. The accuracy rate can also be evaluated by using the first n results obtained. This measure is called the accuracy rate at n and is recorded as P@n. 2. See measurement precision. 3. The repeatability of the accuracy of the results obtained by a device performing multiple measurements under the same conditions. Typical measurements consider the standard deviation of the expected error measurement. For example, the accuracy of a vision system that measures linear dimensions can be evaluated by
44.3
Video Organization and Retrieval
1535
making ten million measurements on a known target and calculating the standard deviation of these measurements. Compare accuracy. 4. The number of significant bits to the right of the decimal point in a floating-point or double-precision number. Specificity In the binary classifications, the same as precision. An index to measure the performance of object classification, mainly used in medical context. A binary classifier C(x) returns the label + or – for the sample x. Comparing these predictions with true labels, you can get true positive (TP), true negative (TN), false positive (FP), or false negative (FN). Specificity is defined as the false positive rate, which is TN/(TN + FP), or the percentage of false samples that are correctly labeled as false. See sensitivity. Recall A metric for evaluating retrieval performance. It is a completeness measure, which can be calculated by dividing the number of related images extracted by the system by the total number of related images in the database (i.e., the total number that should be extracted). Can be expressed as Recall ¼
# of related images # of total related images
This indicator can also be used for classification tasks. At this time, the recall rate R can be expressed as R¼
Tp T p þ Fn
Among them, Tp is correct and affirmative, and Fn is false and negative. A perfect recall with a score of 1.0 indicates that all relevant images have been extracted but does not provide information on how many unrelated images were also extracted. Therefore, it is not enough to use only one index of recall, for example, it is necessary to calculate the precision to consider the number of irrelevant images. Recall rate Same as recall.
44.3.2 Abrupt and Gradual Changes Abrupt change A type of transition between consecutive video shots. Also called cut. Cut corresponds to a sudden change in a certain pattern (generated/caused by changes in scene
1536
44 Content-Based Retrieval
brightness or color, movement of the target or background, changes in edge contours, etc.) between the two frames of images. The frame before the cut belongs to the previous shot. The frame after the cut belongs to the next shot. Therefore, cut corresponds to the rapid transition between shots, and it is generally considered that cut itself has no time length (no cut frame). Compare gradual change. Cut detection Identification of frames in a movie or video where the camera’s viewpoint suddenly changes. Here the viewpoint transformation includes changing to another viewpoint of the same scene and also changing to a new scene. See abrupt change. Straight cut Same as abrupt change. Gradual change The transition between video shots slowly changing from one shot to another. Often lasts a dozen or dozens of frames. Also called optical cut. Is commonly used in video editing (also in movie editing). The most typical methods include dissolve, fade in, and fade out, others also slide, pull up, pull down, pop-off, pop-on, translate, flip, wipe, matte, rotate, and so on. Compare abrupt change. Optical cut Same as gradual change. Dissolve One of the most typical gradual changes between video shots. Can be obtained by optical processing or by changing the gray or color of the pixel. It can also be regarded as a superimposed combination of a fade-in and a fade-out. At this time, the boundary between the front and back shots will span several frames or form a corresponding transition sequence including the start frame and the end frame. A common effect is that the last few frames of the previous shot gradually dimmed and disappeared, while the first few frames of the subsequent shot progressively appeared. During the transition, the frames of both shots appear (transparent overlay). Compare fade in and fade out. Fade in One of the most typical gradual changes between video shots. Brighten a shot (starting from darker) until the last frame becomes completely white (in practice, it stops when it reaches normal display). It can be obtained by optical processing or by changing the gray level or color of a pixel. At this time, the boundary between the front and back shots will span several frames or constitute a corresponding transition sequence including the start frame and the end frame. A common fade-in effect is that the first few frames of the latter shot slowly and evenly emerge from the completely black screen. Compare fade out. Fade out One of the most typical gradual changes between video shots. Fade out corresponds to the optical process of continuously dimming a shot (starting from normal display)
44.3
Video Organization and Retrieval
1537
until the last frame becomes completely black. It can be obtained by optical processing or by changing the gray level or color of a pixel. At this time, the boundary between the front and back shots (the latter shot can be very dark) will span several frames or form a corresponding transition sequence including the start frame and the end frame. A common fade-out effect is that the last few frames of the previous shot slowly and evenly darken until it becomes a completely black screen. Compare fade in. Slide A typical gradual changes between video shots. The first frame of the next shot is pulled in from one side/corner of the screen, while the last frame of the previous shot is pulled out from the other side/corner of the screen. Pull up A typical gradual change between video shots. The first frame of the next shot is pulled up from the bottom and gradually covers the last frame of the previous shot. Pull down 1. A typical gradual change between video shots. The first frame of the next shot is pulled down from the top and gradually covers the last frame of the previous shot. 2. The sample rate (from high to low) conversion from movie to video. Pop-off When the video shot is switched, the last frame of the previous shot is removed from the screen and disappears. A typical gradual change. Pop-on When the video shot is switched, the first frame of the latter shot is exposed from the screen and occupies the entire picture. A typical gradual change. Translate A typical gradual change between video shots. The last frame of the previous shot is gradually pulled out from one side of the screen, revealing (from the other side) the first frame of the next shot. Flip A typical gradual change way to progressively reverse the last frame of the previous shot between video shots to reveal the first frame of the next shot from the other side. Wipe A typical gradual change between video shots. The first frame of the next shot gradually passes over and covers the last frame of the previous shot (it looks like the last frame is gradually erased). Matte A typical gradual change between adjacent video shots. Gradually infiltrate or penetrate the screen with a dark template and gradually cover the last frame of the previous shot until it becomes a completely black screen.
1538
44 Content-Based Retrieval
44.3.3 Video Structuring Video program analysis One of the important contents in content-based video retrieval. The purpose is to establish or restore the (semantic) structure in a video program and to query, index, or extract key content based on this structure. Video program Works with pictures and sound. Video structure parsing Classify the segmented shots using a priori structure and characteristics known for a video. The prior structure in the video can have a sequential or hierarchical structure, in which the main scene can be predicted or detected. Typical applications include television news or weather forecasts. Ranking sport match video In the video program analysis, the different shots are sorted according to their importance and excitement. This is done so that it can be indexed and retrieved as needed. There are always some climax events in sports games, and there are many uncertain factors in the games, which make the highlight shots corresponding to special events a major attraction of sports game programs (so sports game videos are also called event videos). After the program segments are sorted according to importance and wonderful degree, corresponding selection can be made according to time, bandwidth, etc. to receive and watch. Sport match video Videos about sporting events. Generally, it has a strong structure, and there are some climax events, which provide time clues and restrictions on sports video analysis. In addition, the environment of a sports game is often specific, but there are many uncertain factors in the game. The time and location of the event cannot be determined in advance, so the video generation process cannot be controlled during the game. There are also many characteristics of the shooting methods of sports games. For example, slam dunks for basketball games are not only shot in the air but also shot from below the rebound. See ranking sport match video. Highlight shot Shots that correspond to special events in sports match videos. See ranking sport match video. Home video organization In content-based video retrieval, the process of establishing a content index for home video recordings. The goal is to get the video clips you need faster and more efficiently. Home videos mainly record different segments of people’s lives, so the importance of all shots are often equal. Home video consists primarily of unedited original video shots, mainly time-stamped data. In terms of content, clustering is often performed based on the similarity of shots.
44.3
Video Organization and Retrieval
1539
Two-stage shot organization A hierarchical organization strategy in home video organizations. At the first level of shot clustering, it is necessary to detect the change of the scene, that is, the position where the motion region and environmental characteristics change at the same time, which represents the transfer of the episode and place. At the second level of shot clustering, consider the interior of a scene, and further detect changes in a certain region (excluding simultaneous changes) in the motion region or environmental characteristics. It represents the change of focus within the same shot (corresponding to different moving targets, but the same environment) or shifting the position of the attention target (corresponding to the same moving target, but different environments). Changes in this clustering level can be called sub-scenario transformations. In other words, the hierarchy of the first-level shot clustering distinguishes high-level scene semantics, and the hierarchy of the second-level shot clustering detects weaker semantic boundaries (boundaries between two parts with different semantic content) within the same scene. Such a two-level shot organization strategy can make the content organization of video shots more flexible and accurate and closer to the high-level semantic understanding of video content. Video retrieval based on motion feature One of the important contents in the research of content-based video retrieval. Motion is a basic element for video analysis and is unique to video data. The movement is directly related to the relative position change of the space entity or the camera motion. Motion information indicates the development and change of video image content on the time axis and has a very important role in describing and understanding video content. The retrieval effect based on specific content can also be obtained through the retrieval based on motion features. For example, with the analysis of global motion, you can know whether the camera has performed operations such as pan, zooming, or tilting when shooting. For another example, with the analysis of local motion, you can understand the motion characteristics of the target of interest in the scene.
44.3.4 News Program Organization News program structuring One of the technologies of content-based video retrieval. In video program analysis, the inherent hierarchical structure of news videos is automatically extracted, so that it can be organized and retrieved as needed. The structural characteristics of news videos are relatively obvious, and the structure of various types of news programs has good consistency. The main content is composed of a series of news story units, each corresponding to a news item. The relatively fixed hierarchical structure of news videos provides a variety of information clues that are
1540
44 Content-Based Retrieval
helpful for understanding video content, establishing index structures, and conducting content-based queries. News video A category of video programs that are particularly and widely used and specifically cover news content. It highlights and emphasizes the carrier or media of the program more than news program. News program The programs that cover news are video programs that cover news on TV. Various types of news programs have better consistency, and their main content is composed of a series of news story units (news items). It highlights and emphasizes the form and content of the show, compared to news video. News story A layer in the news program structuring. Represents a video unit that tells a relatively independent event in content with clear semantics. Compare news item. News item The physical structure of a news video corresponds to a layer of a news story. Main speaker close-up [MSC] Close-up of a single person’s head in a news video. Examples include live reporter reporting shots, interviewee shots, speaker shots, etc. The main object in the video screen is the portrait of the character, not the background activity. Unlike face detection, the main concern when detecting a speaker’s shot is whether a dominant person’s head appears in the video instead of other objects appearing at a specific position on the screen. The detailed features of the person’s face do not have to be extracted. In the application, a statistical model of motion can be constructed, and a method of matching a head template of a character to detect important speaker shots can be used. Announcer’s shot Shot of the announcer in a news video show. It is the first shot of the news story unit, which is different from other “speaker shots,” such as reporters’ shots, interviewers’ shots, and speakers’ shots. The detection of the announcer’s shot is the basis of the structure of news videos, especially the segmentation of news items. Also called positioning shot. Anchor shot See announcer’s shot. News abstraction A more compact representation of the original video sequence extracted from the news program structuring. To extract the abstract, various visual features can be used, or the analysis results of audio, video, and text content can be comprehensively used.
44.4
Semantic Retrieval
1541
Visual rhythm 1. A visual summary of a complete video, such as listing key frames and transition shot frames in a single-image display. 2. A repeating visual pattern, such as the lines observed in the window.
44.4
Semantic Retrieval
Semantic features, also called logical features, are high-level features, which can be further divided into objective features and subjective features, which can help with more abstract semantic level retrieval that is more integrated with human cognition (Zhang 2007; Zhang 2015c).
44.4.1 Semantic-Based Retrieval Semantic-based visual information retrieval [SBVIR] A high-level content-based retrieval. Visual information retrieval at the semantic level can avoid or bypass the problem of semantic gap and is more in line with people’s understanding of information content. See semantic feature. Semantic feature A high-level feature in content-based retrieval. Also called logical feature. From the perspective of human cognition, people’s description and understanding of images are mainly carried out at various semantic levels. In addition to describing objective things, semantics can also describe subjective feelings and more abstract concepts. Semantic features can be further divided into objective features and subjective features. Logical feature Same as semantic feature. Semantic event In video organization, it refers to video clips with specific semantic meaning or connotation. It should be noted that the time domain segmentation of a video decomposes the video into shots, but each such shot does not necessarily correspond to a (complete) semantic event. Semantic layer It emphasizes the research level of representing, describing, and explaining the meaning of image content. Image semantics have ambiguity, complexity, and abstraction and are generally divided into multiple layers. The broad semantic layer includes:
1542
44 Content-Based Retrieval
1. Feature layer semantics (such as color, texture, structure, shape, and motion). Directly related to visual perception. 2. Object layer semantics (such as people and things) and spatial relation semantics (such as people in front of buildings and balls on grass). This needs to be done logical reasoning and identify the category of the object in the image. 3. Scene layer semantics (such as seashore, wilderness, and indoor). 4. Behavior layer semantics (such as running, conducting image retrieval, and performing programs). 5. Affective layer semantics (such as pleasing images and exhilarating videos). The narrow semantic layer mainly involves cognitive level semantics and abstract attribute semantics. Semantic image annotation and retrieval Image retrieval from an image database based on symbol descriptors (such as keywords that describe images). For example, instead of using color histograms directly, use histograms and other information to infer the existence of certain targets (such as people, cars, etc.) to describe image content. The descriptor used can also be combined with a probability that reflects the existence of the described target. Semantic video indexing Video indexing based on conceptual units (such as words, image slices, video clips, etc. that describe video content). Opposite methods that use sets of numerical properties, such as color histograms. Semantic video search See semantic video indexing. Semantic gap The gap between low-level features and high-level semantics in content-based retrieval. This is because the retrieval system often uses low-level features for retrieval and the human cognitive objective world often uses high-level semantic level knowledge, so the gaps are produced. This is also the difference between computers and people. The solutions that have been proposed mainly focus on mapping low-level image visual features to high-level semantics to fill the semantic gap. It can also be seen as the difference between two different representations of an object. For example, a square can be described as a set of four line segments of equal length that end to end, and the angle between adjacent line segments is a right angle; it can also be described using a specific set of pixels in a binary image. Although a square is described, it takes a lot of calculations to prove that they are equivalent. This is because these descriptions use very different levels of semantics. Annotation A general term for assigning a label to an image or a region of an image. Annotations can be performed manually to obtain reference truth values or automatically with the help of scene semantic segmentation, scene understanding, and augmented reality results. Figure 44.4 shows an example of face annotation.
44.4
Semantic Retrieval
1543
Fig. 44.4 An example of face annotation
Head Eyes
Mouth
Fig. 44.5 Multilayer image description model
Scene
High-level description and interpretation
Scene understanding Object Object recognition Feature Vector A
Visual perception
Feature Vector B Feature Vector C
Feature extraction Meaningful region Pre-segmentation Original image
44.4.2 Multilayer Image Description Multilayer image description model A model that analyzes and extracts image content at different levels to achieve a progressive understanding of image semantics. Figure 44.5 shows a schematic flowchart. The model consists of five layers: (1) the original image layer, (2) the meaningful region layer, (3) the visual perception layer, (4) the object layer, and (5) the scene layer. Based on the description of the previous layer, the description of the next layer is obtained through certain operations. It is a context-driven algorithm based on prior knowledge.
1544
44 Content-Based Retrieval
Original image layer One layer of multi-layer images description model. Corresponds to the original image, at the lowest level. Meaningful region layer One layer of multi-layer images description model. The so-called meaningful region is a region in an image that has a certain meaning relative to the perception of human vision. In order to obtain the representation and description of the meaningful region layer, the original image needs to be pre-segmented. This is different from image segmentation in the usual sense. There is no need to perform pixel-level fine segmentation on the image, and it is often only necessary to extract relatively complete different regions. Visual perception layer One layer of multi-layer images description model. After obtaining the description of the meaningful region layer, it is necessary to perform feature extraction to obtain the description of the visual perception layer and use this to interpret the image content. According to the specific application, the features that can fully describe the image content should be selected. Object layer One layer of multi-layer images description model The entire image is divided into several layers that represent a certain semantic and have a certain relationship. Based on their own needs and application requirements, users can propose to the system what kind of semantic regions (i.e., objects) they want to include in the image, and at the same time, they can further ask what kind of relationship should be satisfied between these objects. Object layer semantic The semantics of the object region in an image or video. It belongs to semantic layer semantics and has certain local and abstract characteristics. Scene layer One layer of multi-layer images description model, the highest layers in the model. The main consideration is the semantic concept embodied by an image as a whole. Scene-layer semantic A semantic in the semantic layer. Represents the semantics contained or possessed by the original objective scene and has overall comprehensive characteristics. Subjective feature A semantic feature. It is mainly related to the attributes of the object and scene and expresses the meaning and purpose of the object or scene. Subjective characteristics are closely related to subjective perception and emotion. Objective feature A semantic feature mainly related to the recognition of objects in images and the motion of objects in videos.
44.4
Semantic Retrieval
1545
Table 44.1 Five typical atmospheres and their illuminance and hue characteristics Number 1 2 3 4 5
Atmosphere type Vigor and strength Mystery or ghastfulness Victory and brightness Peace or desolation Lack unity and disjoint
Illuminance Large illumination and large contrast Large contrast Small contrast Small contrast Scattered
Hue Bright Dim or dull Warm tone Cold tone —
Note: The terms in bold type in the table indicate that they are included as headings in the text
Cognitive level semantic A synonym for object layer semantics and scene-layer semantics. Cognitive level The level of recognizing and descripting the objective things. In image engineering, it mainly corresponds to the object layer and the scene layer. At these higher levels, cognition plays an important role in understanding image content.
44.4.3 Higher Level Semantics Atmosphere semantic The atmosphere embodied or created in an image or video picture is an emotionallayer semantic that has both subjective and abstract features. The five typical atmospheres roughly defined by the environment and object lighting conditions, and their illuminance and hue characteristics are shown in Table 44.1. Vigor and strength An atmosphere semantic type created by an image or video. The screen has a large illuminance, large contrast, and bright colors. Mystery or ghastfulness An atmosphere semantic type created by an image or video. The screen has a large contrast, while the tone is dim or dark. Victory and brightness An atmosphere semantic type created by an image or video. The characteristics of the picture are mostly low contrast (soft) and warmer tones. Peace or desolation An atmosphere semantic type created by an image or video. The characteristics of the picture are mostly low contrast and cooler tones. Lack unity and disjoint An atmosphere semantic type created by an image or video. The main feature of the picture is that the illumination distribution is relatively random.
1546
44 Content-Based Retrieval
Abstract attribute semantic A collective term for behavior layer semantics and emotional layer semantics. Affective computing In artificial intelligence, it involves the fields of processing, recognition, and interpretation of human emotions, as well as related technologies and system design. Affective state The emotional state of a person or animal is related to emotional feelings or responses. It is often measured and determined through posture analysis or expression analysis. See affective gesture. Emotional-layer semantic A semantic of human feelings and emotions in semantic layer. It has strong subjective abstract characteristics. Emotional state recognition Automatically analyze and classify (calculation and work) the external appearance of a person’s emotions, psychology, spirit, etc. To make full use of machine vision and sensor technology. Human emotion It is related to many people’s feelings, ideas, and behaviors and can be represented through expressions, behaviors (this information can often be obtained through image technology), and physiological indicators.
44.4.4 Video Understanding Video understanding A generalization of image understanding. Take video as input. Video understanding can be achieved on the basis of basic video computing with the help of learning capabilities. Video annotation Associate symbolic objects, such as text descriptions or index entries, with video frame images. Video corpus A set of videos or video clips that need to be analyzed to obtain information in them. These videos can be labeled. Video search 1. Discover and collect video connections on the network. 2. Locate the required video in the video database (see video retrieval). 3. Look for specific content in a video.
44.4
Semantic Retrieval
1547
Video browsing In content-based video retrieval, nonlinear browsing of organized video data based on organizational structure. Video retrieval Abbreviation for content-based video retrieval. Extract videos or video clips of interest from a video database (or online video collection). Retrieval can be based on the similarity of the content, or the keywords or metadata associated with the video. Video thumbnail A small-sized image that gives an impression of video content. There are many software packages with the ability to implement or generate video thumbnails. Google’s videos page is an example. Video content analysis A set of methods for extracting different kinds of information from video, such as main characteristics, quantification of sports behavior, scene types, technology categories used, icon appearance, etc. Video semantic content analysis An analysis of video content descriptions is given in terms such as actors, goals, relationships, or activities. Contrary to more attention to digital quantities, such as analysis of video frames, motion distances, or color histograms. Video summarization A brief summary of important events from a set of video image. For example, a summary can be obtained by extracting video key frames and keeping their order without repeating similar content multiple times. The extraction of wonderful shots of sports games is also a commonly used abstract. See mosaic-based video compression. Video organization One of the important work contents in content-based video retrieval. The original video stream is combined into a new structural relationship and indexed according to the requirements of the specific application, so that users can use the video database more effectively. This is basically a process of continuously abstracting the video code stream and gradually obtaining high-level expressions. Video organization framework Structural model for organizing video content. A typical organizational framework is shown in Fig. 44.6, where Fig. 44.6a shows the overall block diagram and Fig. 44.6b shows a specific schematic diagram of the organizational structure. According to Fig. 44.6a, three types of operations can be performed on videos: video organization, video browsing, and video retrieval. With reference to Figs. 44.6a and b, the video is gradually organized into four layers (video organization) from bottom to top, namely, the frame image layer F, the shot layer S, the episode layer E, and the video program layer P. If you go from top to bottom, you can do nonlinear browsing (video browsing), from video to episode, from episode to
1548
44 Content-Based Retrieval
Browsing
Video
Video program
Top-down E1
E2
Episode Retrieval S1
S2
Shot
Bottom-up Organization
F1 F2 Frame image
(a) Overall block diagram
(b) organizational structure
Fig. 44.6 Video organization framework
shot, and finally from shot to frame image. Video retrieval (and video indexing) can be performed at each layer, which is very effective especially at the shot layer and the episode layer. Video indexing Allows video annotations for a specific query. The query can be “in which frame has this person appeared?” or “from which frame did the event begin?” Video mining A general term that refers to extracting information from a video. Its tasks include detecting suspicious objects in a scene or monitoring the manufacture of products. More recently, it refers to discovering specific patterns from a video sequence or collection, such as analyzing traffic patterns to ensure safety, statistical analysis of sports performance, determining the cost of producing a video, and quantifying customer behavior in a store. Parallel tracking and mapping [PTAM] At the same time, complete bundling adjustments for key frames selected from the video stream and robust real-time pose estimation for intermediate frames.
Chapter 45
Spatial-Temporal Behavior Understanding
Image understanding requires the use of images to fully grasp the scene information, analyze the meaning of the scene, explain the scene dynamics, and then understand the semantic information they convey (Zhang 2018b).
45.1
Spatial-Temporal Techniques
The understanding of spatial-temporal behavior requires spatio-temporal technology to comprehensively detect and track various objects of the scene in time and space, and based on the full grasp of spatio-temporal information, use knowledge to reason, to judge behavior, and to explain the scene (Zhang 2017c).
45.1.1 Techniques and Layers Spatial-temporal behavior understanding An image understanding process that explains the behavior of specific objects in a scene with the help of spatial-temporal techniques. The understanding of spatialtemporal behavior needs to use the collected images to determine which sceneries are in the scene and how they change their position, pose, velocity, relationship, etc. in space over time. This includes acquiring objective information (collecting image sequences), processing related visual information, analyzing (representing and describing) the information content, and interpreting the image/video information on this basis to achieve behavior learning and recognition. In other words, it is necessary to grasp the actions of the scenes in space and time, determine the purpose of the activity, and then understand the semantic information they convey.
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_45
1549
1550
45 Spatial-Temporal Behavior Understanding
Action primitives The first layer in the spatial-temporal behavior understanding. The basic atomic action unit used to construct the action generally corresponds to brief object motion information in the scene. Action The second layer in the spatial-temporal behavior understanding. It is a collection (ordered combination) with a specific meaning composed of a series of action primitives of the subject/initiator. General movements often represent a simple action pattern performed by a single person, usually only lasting on the order of seconds. The result of human movement often results in a change in pose/ orientation. Activity The third layer in the spatial-temporal behavior understanding. Refers to a combination of a series of time-continuous actions performed by the subject/sponsor to complete a certain task or achieve a certain goal (mainly stressing logical combinations) and represents a specific task, behavior, or purpose. Activities are relatively large-scale movements that generally depend on the environment and interacting people. Activities often represent complex (possibly interactive), sequential actions performed by multiple people and often last longer. See activity classification. Event 1. A record in stereology when a probe intersects an object scene. It is the basis for quantitative estimation of geometric units. 2. The fourth layer in the spatial-temporal behavior understanding. Refers to certain activities and results that occur in a particular time and space. The actions are often performed by multiple subjects/sponsors (group activities). Detection of specific events is often associated with abnormal activity. 3. A statistical description of the image and a collection of random test outputs. The probability of an event can be used to define the distribution function of a random variable. Behavior The fifth layer in the spatial-temporal behavior understanding. Its subject/sponsor mainly refers to human or animal. Emphasize that the subject/sponsor is subject to thoughts to change actions, continue activities, and describe events in a specific environment/context. Spatial-temporal techniques Techniques for the spatial-temporal behavior understanding, including highdimensional motion analysis, 3-D pose detection of targets, space-time tracking, behavior judgment, and so on. Spatio-temporal filtering Filtering of images in space and time. It is a generalization of general temporal filtering and spatial filtering.
45.1
Spatial-Temporal Techniques
B
1551
B
B I J
I
O
Left
L
O
K L
O
Middle
Right
Fig. 45.1 Example of using SSD algorithm results matching error B I J
O
B I J
B
B I
L
O
K L
O
O
Left Middle Right t
Left
Middle
Right
K L x
(a)
(b)
Fig. 45.2 Using spatio-temporally shiftable windows in avoiding matching errors
Spatio-temporal filter A filter that performs image filtering in space and time. Spatio-temporally shiftable windows A template for constructing polar plane images. Consider a simple sequence of three frame images, as shown in Fig. 45.1, where the frame image in the middle is the reference frame. There is a moving foreground object (labeled O) and a fixed background (labeled B). In different frames, the background regions I, J, K, and L are partially occluded, respectively. If an SSD algorithm is used, it will be generated when the window where the matching pixel is located is in the background region centered on the black pixel, or when the window is located in the depth discontinuity (abrupt change) centered on the white pixel, in both cases, mismatch will be produced. The use of spatio-temporally shiftable windows can reduce problems caused by partially occluded regions and depth discontinuities. Referring to Fig. 45.2(a), the spatio-temporally shiftable window centered on the white pixel in region O can be correctly matched in each frame; the spatio-temporally shiftable window centered on the black pixel in region B can be matched in the left frame and get the correct match, but you need to choose in time to disallow the match in the right frame image. Figure 45.2(b) shows an epipolar plane image corresponding to this sequence, which provides more details on why time selection works. In this figure, the three images corresponding to the left, middle, and right are shown as three-dimensional horizontal layers that pass through the epipolar plane image. The spatio-temporally
1552
45 Spatial-Temporal Behavior Understanding
shiftable windows around black pixels are represented by rectangles, it is seen that the right image is not used for matching. Spatio-temporal frequency response A combination of temporal frequency response and spatial frequency response. The spatial frequency response (contrast sensitivity) of a given temporal frequency (unit is a period per second) and the temporal frequency response (contrast sensitivity) of a given space frequency (unit is a period per degree) can be obtained from it. Experiments show that at higher temporal frequencies, both the peak frequency and the cutoff frequency shift down in the spatial frequency response. They also help verify an intuitive expectation: opposite to the ability of the eye to observe the spatial resolution of the still image, the eye cannot distinguish the details of too high spatial frequency when the target moves very fast. The key significance of this discovery to the design of television and video systems is the possibility of compensating for temporal resolution with spatial resolution, or vice versa. The use of interlaced scanning in TV system design takes advantage of this fact. Spatial and temporal redundancy A special kind of data redundancy exists between pixels in spatially adjacent and temporally related video images. Space-time stereo The expansion of ordinary stereo vision along the time axis. A unified framework for obtaining deep information using triangulation technology. This can be achieved with two or more cameras continuously shooting.
45.1.2 Spatio-Temporal Analysis Spatio-temporal analysis Motion image analysis on 3-D stereo composed of sequential 2-D images. For example, epipolar plane images, motion occlusion, spatio-temporal autoregressive models, etc. Spatio-temporal space Generally, it refers to the 3-D representation space of a video sequence as a whole, which consists of a set of 2-D video frames stacked one after another. See spacetime cuboid. It can be extended to 4-D, which is to combine 3-D data over a period of time (such as derived from CT, MRI, etc.). When analyzing differential geometry, spatiotemporal space can also be considered continuous. Spatio-temporal relationship match Given a set of feature points in spatio-temporal space, find the spatial relationship (such as proximity) and time relationship (such as after) between a pair of points. There are different ways to determine feature points, such as using a FAST interest
45.1
Spatial-Temporal Techniques
1553
Fig. 45.3 Space-time cuboid
Frame sequence
3-D data cuboid
point detector. With the help of space-time relationship matching, many descriptions of actions can be obtained, especially activities involving multiple people. Matching complex descriptions can help to perform behavior classification. Space-time descriptor An image behavior descriptor that combines spatial and temporal elements. The descriptor may be based on a space-time cuboid located at a space-time interest point, or based on a cumulative motion history image. These descriptors are used in particular for behavior analysis and behavior classification. Space-time cuboid A piece of image data that changes with time is connected to form a 3-D (or higherdimensional) stereo, that is, a 3-D data volume. A cuboid here generally refers to a small subset of the entire data set. For example, given a space-time interest point in a video sequence, an N N (2T + 1) cuboid can be obtained by stitching the N N neighborhood data in the current frame with the corresponding neighborhoods in the previous T frame and the subsequent T frame (see Fig. 45.3). Analysis of the data in it yields unique descriptors for similar space-time patterns. Action cuboid 3-D space (1-D time plus 2-D space) for detecting or tracking object motion. The typical situation is a video acquired over a period of time. Figure 45.4 shows the trajectory of a ping-pong ball hit by a ping-pong player in an action cuboid. The action cuboid can be seen as the promotion to 3-D of the window (or region of interest) of the object detection in 2-D. Cuboid descriptor A localized 3-D spatial-temporal feature descriptor can describe a region in a video sequence, similar to the description of a 2-D image using a region descriptor. It can also be used to describe objects in an explicit 3-D voxel space. Space-time interest point A class of interest points with unique properties is derived from the spatial and temporal characteristics of pixels in its neighborhood. These points can be used for action recognition or feature point tracking.
1554
45 Spatial-Temporal Behavior Understanding
Fig. 45.4 Action cuboid schematic
Y
T X
FAST interest point detector A spatio-temporal point of interest detector. A corner feature detector using a machine learning algorithm, which greatly improves the speed by learning the discriminative characteristics. It is often used for non-maximum suppression to improve the adaptability to noise. Point of interest Special points of interest in the image. It can be a feature point, a salient point, a landmark point, etc. Point of interest/activity path [POI/AP] model In scene modeling, a model constructed by combining points of interest and activity paths. In learning the path model of interest points, the main work includes (1) activity learning, (2) adaption, and (3) feature selection. Property learning In the spatial-temporal behavior understanding, an algorithm used to learn and characterize the attributes of spatio-temporal patterns.
45.1.3 Action Behavior Understanding Action behavior understanding A subcategory of spatial-temporal behavior understanding, which emphasizes establishing the relationship between subject and action. At present, from coarsegrained to fine-grained, it can be divided into three levels: single-label actor-action recognition, multi-label actor-action recognition, and actor-action semantic segmentation. Single-label actor-action recognition A simple case of actor-action understanding (action behavior understanding) problem. It corresponds to the general action recognition problem and considers the case where a single actor (subject) initiates a single action. There are three
45.1
Spatial-Temporal Techniques
1555
models available at this time: naïve Bayesian model, joint product space model, and tri-layer model. Multi-label actor-action recognition An extension of the actor-action understanding (action behavior understanding) problem. It corresponds to the case there are multiple subjects and/or multiple actions initiated, that is, the case of multiple labels. In this case, there are still three models available similar to single-label actor-action recognition: naive Bayesian model, joint product space model, and tri-layer model. In other words, three types of classifiers can still be considered as follows: multi-label actor-action classifier using naive Bayesian model, multi-label actor-action classifier in joint product space, and the first two classifiers combined actor-action classifier based on a tri-layer model. Joint product space model In the spatial-temporal behavior understanding, a model that uses actor space X and action space Y to generate a new labeling space Z. Here, the product relationship is used: Z ¼ X Y. In the joint product space, a classifier can be learned directly for each actor-action tuple. Obviously, this method emphasizes the existence of actor-action tuples, which can eliminate the occurrence of unreasonable combinations; and it is possible to use more cross-actor-action features to learn a more discriminatory classifier. However, this method may not make good use of the commonality across different subjects or different actions, such as adults and children walking and waving. Actor-action semantic segmentation An extension of the actor-action understanding (action behavior understanding) problem, corresponding to the case of working at the finest granularity. It also includes other coarser-grained issues, such as detection and location. Here, the task is to find labels for the actor-action on each voxel throughout the video. In this case, the following four models can be used: naive Bayesian model, joint product space model, bi-layer model, and tri-layer model. Bi-layer model A model used in actor-action semantic segmentation. Compared with the joint product space model, it increases the actor-action interaction between separate actor nodes and action nodes. But compared with the tri-layer model, the spatiotemporal changes of these interactions have not been considered. Tri-layer model A model that unifies the naive Bayesian model and the joint product space model. It learns classifiers in actor space X, action space Y, and joint actor-action space Z at the same time. In inference, it infers Bayesian term and joint product space term separately and then combines them linearly to get the final result. It not only performs cross-modeling of actor-action but also models different actions initiated by the same actor and the same action initiated by different actors.
1556
45.2
45 Spatial-Temporal Behavior Understanding
Action and Pose
An action is a collection (ordered combination) of practical significance that consists of a series of action primitives for the subject/initiator. In general, the results of human performing actions often lead to changes in human posture (Peters 2017; Szeliski 2010).
45.2.1 Action Models Action modeling Means and process of building a model to achieve automatic extraction and recognition. The main methods can be divided into three categories: nonparametric modeling, volumetric modeling, and parametric time-series modeling. Nonparametric modeling A class of main methods for action modeling. A set of features is extracted from each frame of the video, and then these features are matched with the stored template. Typical methods include using 2-D templates, 3-D object models, and manifold learning methods. Temporal template A typical method in nonparametric modeling for action. The background is extracted first, and then the background blocks extracted from a sequence are combined into a still image. There are two ways to combine them: one is called the motion energy image; the other is called the motion history image. Volumetric modeling A major method for action modeling. It does not extract features frame by frame, but treats the video as a 3-D volume with voxels (intensity) and extends standard image features (such as scale space extremes, spatial filter response) to 3-D. Typical methods include spatio-temporal filtering methods, component-based methods, sub-volume matching methods, and tensor-based methods. Constellation Multiple identical or similar objects are given in a specific structural arrangement or distribution. For example, the positions of multiple actors in a dance performance are combined. In the component-based volumetric modeling method, the video is regarded as a series of collections, and each collection includes components in a small time sliding window. Different actions can include similar spatio-temporal components but can have different geometric relationships. Combining global geometry into component-based video representations makes up a component of constellation.
45.2
Action and Pose
1557
Constellation model A model built from constellation forms. Contains a number of components, and their relative positions are coded with their average position and a full covariance matrix. It can represent not only positional uncertainty but also potential correlations (covariances) between different components. Parametric time-series modeling A dynamic modeling (method) for associated action and time. Through modeling, specific parameters for a set of actions are estimated from the training set to characterize the actions. This method is more suitable for complex actions that span the time domain. Typical methods include hidden Markov models, linear dynamical systems, and nonlinear dynamic systems. Binary space-time volume A method for representing motion when modeling 3-D moving objects. By superimposing blocks/blobs with background removed, the motion and shape information of the object can be obtained. Action model The model of the object of interest (mainly people) defined or learned in advance reflects the characteristics of its actions and activities. It can be used to match or contrast with the motion in the scene or image for motion detection or action recognition. Similar to using an object model in model-based object recognition. Action representation Actions are modeled with the help of spatial-temporal feature vectors to represent actions in a given video sequence. Action detection Techniques and processes for automatically discovering and locating the actions and activities of an object of interest from an image. It is often a step in video analysis, which mainly uses the time characteristics of the action. Compare action localization. Action localization Techniques and processes for automatically determining where the action and activity of an object of interest occur from an image (or further from a scene). See action detection. Cyclic action Periodic or cyclical actions and activities (such as walking or running). Temporal segmentation can be performed by analyzing the self-similar matrix. It is further possible to add markers to the athletes and track the markers, and use the affine distance function to build a self-similar matrix. If a frequency transformation on the self-similar matrix is performed, the peaks in the spectrum correspond to the frequency of the movement (if you want to distinguish between walking and running people, the gait cycle can be calculated). An analysis of the matrix structure determines the type of action.
1558
45 Spatial-Temporal Behavior Understanding
Action unit [AU] The minimum unit obtained by decomposing the action sequence, also refers to the minimum unit obtained by removing the motion representation from the original measurement of the pixel motion itself (such as optical flow). For example, in facial expression recognition research, it refers to the basic unit used to describe the change in facial expression. From an anatomical perspective, each action unit corresponds to one or more facial muscles. Facial muscle movements reflect changes in facial expressions. This research method is more objective and commonly used in psychological research. For example, some people attributed various facial expression changes to the combination of 44 action units and established the corresponding relationship between facial expressions and action units. However, there are more than 7000 combinations of 44 action units. The number is huge. In addition, from the perspective of detection, there are many action units that are difficult to obtain based only on 2-D facial images.
45.2.2 Action Recognition Action recognition Based on the motion detection, the motion types (such as running, walking, etc.) are further divided. That is, to distinguish the meaning of the action and distinguish between different actions. Typical movements to be identified include walking, running, jumping, dancing, picking up objects, sitting down, standing up, waving hands, etc. It is also the basic step of activity recognition and behavior classification, because activities or behaviors often consist of multiple actions. Atomic action In posture analysis and action recognition, a short sequence of basic limb movements that constitutes a motion pattern related to high-level movements. For example, raise your left leg or your right leg while walking. Rate-invariant action recognition Recognize actions regardless of their speed. See action recognition. View-invariant action recognition The recognition operation is performed independently of the viewpoint. To this end, it is necessary to construct a model that does not depend on the viewpoint. This is different from many techniques for action recognition based on models generated from data collected from a given viewpoint, which are limited to videos seen from the same viewpoint. Affordance and action recognition Recognition of actions and action trends. The action trend represents the possibility of an action being performed by an entity. Recognizing action trends is to predict possible actions and lay the foundation for recognizing actions themselves. See action recognition.
45.2
Action and Pose
1559
45.2.3 Pose Estimation Pose The comprehensive state of the location and orientation of the scene in space. Although independent of the size of the scene, it depends on the relationship of the scene to the coordinate system. It is generally an ill-posed problem to restore the pose of the scene from the scene’s image, and additional constraints are required to solve it. The pose of the camera is determined by the external parameters of the camera, and the pose of the person reflects the action state and behavior of the person. The position and orientation of an object in a given frame of reference. In image technology, the given frame of reference is often the world coordinate system or the camera coordinate system. Pose estimation is a typical image technique. Pose variance In face recognition, the change in face orientation relative to the camera’s principal axis. The main consideration is the left and right rotation and the pitching of human face. Pose consistency An algorithm for determining whether two shapes are equivalent. Also called viewpoint consistency. For example, given two sets of points G1 and G2, the algorithm finds a sufficient number of corresponding points to determine the mapping transformation T between the two sets and then uses T for all other points in G1. If the transformed points are close enough to the points in G2, the consistency is satisfied. 2-D pose estimation A special case of 3-D pose estimation. Given two sets of points {xj} and {yk}, determine the best Euclidean transformation {R, t} (determining pose) and matching matrix {Mjk} (determining correspondence) that can connect the two. 3-D pose estimation The process of determining the transformation (translation and rotation) of an object in one coordinate system relative to another coordinate system. Generally, only rigid body objects are considered, and the model is known in advance, so the position and orientation of the object in the image can be determined based on feature matching. The problem can be described as given two sets of points {xj} and {yk}, to determine the Euclidean transformation (determining pose) and the matching matrix (determining correspondence) that can best connect the two. Assuming that the two sets of points are corresponding, the two can be accurately matched under this transformation. View correlation A pose estimation technique often used when the 3-D structure of the scene is known.
1560
45 Spatial-Temporal Behavior Understanding
Focused filters A set of filters used to estimate the pose of the object. Their strongest response indicates the pose of the object. Through-the-lens camera control A pose estimation technique often used when the 3-D structure of the scene is known. Self-localization A problem of estimating the position of a sensor in an environment constructed from image or video data. This problem can be regarded as a geometric model matching problem, provided that it has a sufficiently complex physical model, that is, it contains enough points to have a comprehensive solution to the 3-D pose estimation problem. In some cases, it is possible to directly identify a sufficient number of identification points. If you have no knowledge of the scene at all, you can still use tracking technology or optical flow technology to gradually obtain the corresponding points, or obtain stereo correspondence in multiple simultaneous frame images. Affective gesture Gestures produced by the body (of a person or animal) that indicate emotional feelings or responses. Used for posture analysis to indicate social interaction. Affective body gesture See affective gesture. Viewpoint consistency constraint A term used by Lowe to describe 3-D model projection. It refers to a 3-D model that matches a set of 2-D line segments must be able to project the model onto these 2-D line segments at least at one 3-D camera position. Essentially, these 2-D and 3-D data must allow pose estimation. Partially constrained pose A special pose constraint situation. One of the objects was subject to a set of restrictions that restricted the allowable orientation or position of the object, but did not completely fix any of the orientations or positions. For example, a car on a highway cannot drive off the ground and can only rotate about an axis perpendicular to the road surface.
45.2.4 Posture Analysis Posture analysis A set of techniques used to estimate the pose (such as sitting, standing, pointing, squatting) of an articulated object (such as a human body).
45.2
Action and Pose
1561
Pose representation Express the pose (angular position) of a 3-D target in a given frame of reference. A common representation is the rotation matrix, which can be parameterized with different methods, such as Euler angles (yaw, roll and pitch angle), rotation angles around the coordinate axis, axis-angle curves, and quaternions. See orientation representation and rotation representation. Orientation representation See pose representation. Orientation The nature of a point, line, or region facing or pointing to space can also refer to the posture or state of the body in space. For example, the orientation of a vector (i.e., the direction of this vector) is indicated by its unit vector; the orientation of an ellipsoid is indicated by its principal direction; the orientation of a wireframe model is by its own reference boxes that are specified relative to a world reference box. Orientation error The amount of error associated with an orientation value. The difference between the expected (ideal) orientation value and the actual orientation value. Gesture component space The space representing each component of the pose. Once the video sequence is decomposed into atomic actions, the pose can be represented by a hybrid model. To automatically determine how many typical groups are existing, they can be represented in the pose component space where the axis of the space is the principal component of the motion. Visual pose determination The process of determining the position and orientation of a known object in space using a suitable object model and one or more cameras, depth sensors, or triangulation-based vision sensors. Pose estimation Determine the orientation and position of a 3-D object based on one or more images. The term is also commonly used to refer to transforms that align a geometric model with image data. There are many techniques to achieve this. See alignment, model registration, orientation representation, and rotation representation. See also 2-D pose estimation and 3-D pose estimation. Pose determination See pose estimation. Iterative pose estimation A robust, flexible, and accurate attitude estimation strategy. One method based on this strategy is to first guess an initial pose and then iteratively minimize the reconstruction error of the unknown pose parameters by nonlinear least squares. Another method based on this strategy is to first use an orthogonal projection model
1562
45 Spatial-Temporal Behavior Understanding
to obtain the initial estimate and then use a more accurate perspective projection model to iteratively refine the initial estimate. Gesture segmentation The video sequence is divided into a stream of atomic actions, which can be mapped into a template dictionary specified by the user in advance. Gesture recognition can be performed based on the segmentation results to determine the information they convey. Gesture tracking Calculate the attitude trajectory of a person or animal in the pose component space. Pose clustering A set of algorithms that use clustering techniques to solve pose estimation problems. See clustering, pose, and K-means clustering.
45.3
Activity and Analysis
An activity is a combination of a series of actions performed by a subject/sponsor in order to complete a certain task or achieve a certain goal (emphasis is mainly placed on logical combinations) (Jähne and Hauβecker 2000; Forsyth and Ponce 2012; Zhang 2017c).
45.3.1 Activity Activity segmentation The work of dividing a video sequence into a series of subsequences is based on changes in the activities performed along the video sequence. Activity recognition Identification of human or animal action and behavior. In addition to obtaining information such as the person’s movement and posture, it is also necessary to determine what kind of behavior in relation to the environment (by combining motivation, goals, etc.) See activity classification. Activity classification An activity consisting of an action time series is classified into a predefined category (given one of a set of discrete tags). Activity model The representation of a given activity for activity classification is similar to the object model in model-based object recognition.
45.3
Activity and Analysis
1563
Activity representation See activity model. Declarative model In activity modeling and identification, a model based on logical methods is used to describe all desired activities with scene structure and events. The activity model includes interactions between objects in the scene. Regional activity Activities that occur in a local or current region (as opposed to global activities). Group activity Activities involving multiple people in a shared public space. Contextual entity In modeling an activity, other actors or environmental things related to the actors in the activity. Crowd flow analysis Use the motion properties (such as optical flow field) obtained from crowd video sequences to complete behavior analysis. See anomalous behavior detection. POI/AP model Short for point of interest/activity path model. Adaption A work in learning POI/AP models. Be able to adapt to the requirements of adding new activities and removing activities that no longer continue and to validate the model. See adaptive. Feature selection 1. Choosing the suitable features or properties for a specific task is a key step in object classification and object recognition. The requirements for features often include detectable, distinctive or discriminatory, stable, and reliable. 2. In general, the selection of many features according to needs. That is, the most important set of related variables or parameters is selected from a large number of features describing an image or an object to construct a robust learning model. 3. A work in POI/AP model learning. Determine the right dynamics representation layer for a particular task. For example, using only spatial information to determine which way a car is going, speed information is often required to detect an accident. Activity learning In the spatial-temporal behavior understanding, a work in learning the POI/AP model. It does this by comparing trajectories (of the scenery’s activity), where the trajectory length may be different, and the key is to maintain an intuitive understanding of similarity.
1564
45 Spatial-Temporal Behavior Understanding
Fig. 45.5 Activity graph schematic
Walk
Stand
Run
Jump
Table 45.1 Activity transition matrix
Activity Stand Walk Run Jump
Stand 0.90 0.55 0.05 0.40
Walk 0.60 0.75 0.20 0.25
Run 0.15 0.50 0.30 0.05
Jump 0.25 0.05 0.30 0.10
Activity path [AP] In scene modeling, the subject/object moves/travels in the image region where the event occurred. It can be obtained by means of active learning on the basis of tracking moving objects.
45.3.2 Activity Analysis Activity analysis The work and method of analyzing the behavior of a person or thing in a video sequence to identify actions that occur on the fly or a long sequence of actions. For example, identify potential intruders in a restricted region. Activity graph Record a graph of activity transition matrices, where each node corresponds to an activity or a phase of an activity, and each arc represents a connection between two activities or between two activity phases. Figure 45.5 shows a simple activity graph. Activity transition matrix When there are N activities, it is an N N matrix, where each item corresponds to the transition probability between two states. Here the state itself is the activity that appears in the scene. See state transition probability. Table 45.1 shows the active transition matrix corresponding to Fig. 45.5. Multi-view activity representation A representation of the activity obtained by multiple camera systems across multiple fields of view.
45.3
Activity and Analysis
1565
Automatic activity analysis Automatic analysis of object actions and activities in the scene according to the built scene model. Virtual fencing A typical automatic activity analysis task. Limit the scope of activities without physical entities. Specifically, you can first determine the fence boundary (range) in the monitoring system and provide early warning of internal events. Once there is an intrusion, the analysis module is triggered, and then the high-resolution PTZ camera is controlled to obtain details of the intrusion, such as starting statistics on the number of intrusions. Speed profiling A typical automatic activity analysis job. A typical situation is to use tracking technology to obtain the dynamic information of the object and to analyze the change of its speed of movement. Path classification A typical automatic activity analysis task. In the automatic activity analysis, the information obtained from the historical movement pattern is used to determine which activity path can best explain the new data, so as to classify the activity path. Abnormality detection A typical automatic activity analysis task. By identifying specific behaviors, detect and extract atypical behaviors and events. Anomaly detection Detection of undesirable targets, events, behaviors, etc. in a given environment. The typical method is to compare with models of normally occurring targets, events, behaviors, etc. It is generally regarded as an unsupervised learning job, often used in visual industrial inspection and automated visual monitoring. Online anomaly detection An automatic method for locating interesting or unusual events, typically using continuous real-time data such as video. See anomaly detection. Object interaction characterization A typical automatic activity analysis task. Belongs to high-level analysis. In different environments, different types of interactions can exist between different objects. Strictly defining object interactions is currently difficult. Online activity analysis A typical automated activity analysis task. The system quickly infers ongoing behaviors (including online analysis, identification, evaluation activities, etc.) based on incomplete data.
1566
45.4
45 Spatial-Temporal Behavior Understanding
Events
Events often refer to some (irregular) activity that occurs at a specific time period and a specific spatial location. Usually the actions are performed by multiple subjects/ initiators (group activities). The detection of specific events is often associated with abnormal activity (Prince 2012).
45.4.1 Event Detection Visual event Circumstances in which certain aspects of visual behavior change, such as when a person enters a place that should not be entered, the vehicle deviates from the normal route. Primitive event The basic unit used to compose complex events. For example, in the event of a table tennis match, athletes entering the field, practicing, serving, scoring, etc. can all be regarded as primitive events. Event detection Detect (specific) activity in a scene by analyzing a series of images. Unusual behavior detection Through the analysis of the video data, abnormal behavior was found. Abnormal behaviors include retrograde in road traffic, brawls in public places, and pedestrians appearing in inappropriate places. Models of normal behavior can be constructed using methods such as supervised classification, semi-supervised learning, or unsupervised behavior modeling. See anomaly detection. Anomalous behavior detection A special case of monitoring and analyzing human movement and behavior. In particular, detection that is inconsistent with normal behavior, such as detecting intruders or behaviors that could lead to crime. Statistical insufficiency problem The problem that a sound statistical model cannot be estimated for a certain phenomenon due to insufficient training samples. This happens mostly in abnormal behavior detection, where there are many examples of normal behavior but few or even no examples of abnormal behavior. See sparse data. Bottom-up event detection A method for detecting events in a bottom-up manner. In this context, a motion feature, an action unit, or an atomic action may be used as the lowest level among the multiple levels. See behavior hierarchy.
45.4
Events
1567
Local motion event Spatio-temporal events in a video sequence describe the movement of an object over a limited period of time in a partially visible scene. See spatio-temporal analysis. One-class learning The goal is to take a set of data points from the probability distribution P as input and obtain a “simple” subset S of the input space as the output of learning. Here, the probability that a test point from P is outside S is equal to some specific probabilities. One way to achieve this learning is to use a class of support vector machines. Another method is to perform a probability density estimation and use a probability profile to define S. One-class learning can be used for outlier rejection and abnormality detection. Abnormality detection A typical automatic activity analysis task. By identifying specific behaviors, detect and extract atypical behaviors and events. Contextually incoherent An abnormal phenomenon that cannot be explained by contextual information or prior models (such as behavioral models and prior distributions). See anomaly detection. Outlier detection Identification of samples or points outside the main distribution. Outlier rejection Identify outfield points and eliminate them from the current process.
45.4.2 Event Understanding Event understanding Recognition of an event in a series of images (such as many people approaching or concentrating toward the same location). This needs to be done based on the data provided by event detection. Contextual event Significant changes in the scene that are independent of the object of interest (such as a crowd) and at specific locations in the visual environment. For example, there may be forgotten objects in the scene, the lights on the road may be off, and pedestrians on the roadside may stop or cross the street. Dynamic event Spread along time, not instant events.
1568
45 Spatial-Temporal Behavior Understanding
Dynamic topic model In a large collection of documents (or articles), a set of probabilistic time-series models can help analyze the temporal evolution of different topics. It is also used in adaptive behavior models in behavior analysis. Temporal topology A way to represent signals, events, and temporal connections to activities that are independent at specific moments. For example, it can be said that event A occurred before event B without specifying how long before. A similar situation is the connection between the geometric model and the graph model, where the geometric model indicates where each feature is, and the graph model can indicate the adjacent relationship between the features. Event video See ranking sport match video. Event knowledge In classification of knowledge, the knowledge reflects different actions of subjects in the objective world. Video event challenge workshop It has been held internationally since 2003 to integrate a variety of capabilities to build a series of conferences based on domain ontology of general knowledge. The conference has initially defined six areas of video surveillance: (1) security around and inside, (2) monitoring of railway crossing, (3) visual banking monitoring, (4) visual subway monitoring, (5) warehouse security, and (6) airport apron security. Video event markup language [VEML] A formal language directed by video event challenge workshops. Used to label the VERL events in the video. Video event representation language [VERL] A formal language directed by video event challenge workshops. Can help complete the ontology representation of complex events based on simple sub-events.
45.5
Behavior and Understanding
Behavior emphasizes that the subject/sponsor is subject to thought to change actions, continue activities, and describe events in a specific environment/context. Various methods have been proposed for the understanding and interpretation of behavior (Forsyth and Ponce 2012; Zhang 2017c).
45.5
Behavior and Understanding
1569
45.5.1 Behavior Behavior profile The semantic labeling of a behavior model corresponds to the prior of the behavior pattern. Generally used for anomalous behavior detection. Often used as a synonym for behavior patterns. Behavior profiling The use of behavior profiles in a behavior analysis task. Visual behavior 1. Observed action or observable part of an action, such as opening a door. 2. Behavior that manifests during perception, such as adjusting the camera’s pointing direction. 3. Biological processes and activities that allow visual perception, such as watching or tracking a moving object. Behavior representation See behavior model. Behavior hierarchy A hierarchical behavioral representation consisting of three layers: atomic actions (i.e., sequences of basic limb movements, such as waving hands and raising legs), actions (consisting of a series of atomic actions, such as walking), and activities or events (a series of actions contained in specific time and space ranges, such as walking to the door and opening the door). Behavior posterior At time t, the posterior probability P(Bt|et) of a given behavior Bt when an event behavior pattern et appears. See behavior hierarchy. Behavior pattern A series of sequential actions (or events) that constitute a given human or animal behavioral activity. Generally represented as a time-series feature vector. In this context, the work of behavior detection, behavior classification, and behavior analysis translates into a special case of general pattern recognition. See behavior hierarchy and behavior model. Behavior prediction The prediction of a given behavior pattern is generally at a given position in a multi-camera system, based on the latest knowledge of the behavior pattern up to the current point in time. In addition, multi-camera systems also require explicit or implicit knowledge about multi-camera topologies. See multi-camera behavior correlation. Behavior footprint Temporal statistical distribution of behavior patterns in a given pixel or region of interest. Records the behavior (category over time) at a point in the image. Usually
1570
45 Spatial-Temporal Behavior Understanding
1
2
3
4
5
Behavior class
Fig. 45.6 Behavior footprint representation
be represented as a histogram. An example is shown in Fig. 45.6. The left image is the scene image, and the right graph shows the statistical histogram of the behavior change in the framed part of the scene over time. Behavior layer semantic A kind of semantics at the semantic layer, at a higher level. It mainly involves the semantics of human actions and behaviors and has more subjective characteristics and certain abstract characteristics.
45.5.2 Behavior Analysis Behavior analysis A vision procedure or technical model for tracking and identifying human behavior. Often used in threat analysis. Behavior detection Detect specific behavior patterns in a video sequence. See behavior analysis and anomalous behavior detection. Behavior localization Determine the (temporal and spatial) location of a given behavior or pattern in a single image. It can be divided into two cases: (1) the position in 3-D space is determined by the context of the scene; (2) the position in time is determined by the video sequence. Salient behavior The behavior of a person or system that is different from normal. Behavioral saliency Consideration of the concept of saliency as opposed to behavior analysis. Can help identify patterns of behavior that are different from normal. See anomalous behavior detection.
45.5
Behavior and Understanding
1571
Cumulative abnormality score A measure of abnormality used in behavioral analysis. Given a prior behavioral model, the effect of noise reduction is accumulated by accumulating a history of the likelihood of anomalous behavior in each region along the time axis. In the presence of noise, a threshold is used to control the detection of abnormal behavior alarms (increasing the time decay for occurrences below the threshold). Behavior classification For the classification of a given sequence of activities, a new label corresponding to the prior behavior is assigned to each discrete set. Behavior class distribution 1. In behavior classification, a global prior distribution of multiple classes. 2. In behavior classification, intra-class distribution within a given category. See within class scatter matrix and distribution overlap. Behavior affinity matrix An affinity matrix that captures the behavior similarities can be used for behavior classification and similar tasks. Situation graph tree A behavior representation structure that describes activities, including alternating sequences of actions, that can be used for video-based behavior classification. Construct a graph where the nodes contain logical predicates that identify the state and the actions expected in this state. An arc represents a probability transition between these modes. States can be expanded hierarchically to generate a subgraph tree. Behavior correlation The technique derived by using cross-correlation for matching is used for spatiotemporal analysis of video sequences for behavior detection and behavior analysis. Multi-object behavior (Action or motion) related behavior of multiple objects in a scene. Multi-camera distributed behavior Behavior when the object is viewed by multiple cameras. These behaviors need to be relevant in both time and space domains. Multi-camera behavior correlation Temporal and spatial correlation of activity or behavior in a multi-camera field of view. Helps establish connections between multiple cameras.
1572
45 Spatial-Temporal Behavior Understanding
45.5.3 Behavior Interpretation Behavior interpretation A general term for extracting high-level semantic descriptions of behavior patterns that occur in video sequences. See behavior analysis and image interpretation. Behavioral context 1. The context of image understanding and computer vision system action relative to its environment. 2. Contextual information used to help explain behavior. Trajectory-based representation A representation of an object and its behavior based on its trajectory or path. A path can be represented either on an image or in scene space. A typical application of this representation is to detect possible car-related criminal activities in parking lots by observing pedestrians’ atypical (abnormal) walking patterns. Group association context Contextual information used in behavior analysis work, the goal is to understand the group activities associated with several people in the same group, where some knowledge about the behavior of a single person is extracted through behavior classification. Behavior model A model that is defined or learned for a given behavior used for behavior detection, behavior classification, and general behavior analysis. Similar to the object model in model-based object recognition, in which the model is matched with a given but not seen behavioral paradigm to ensure semantic understanding. Behavior learning A method of generalizing goal-driven behavioral models through learning algorithms (such as reinforcement learning). Distributed behavior An agent-based optimization method in which there is a direct analogy between the points visited in the search space and the agents motivated by independent behavior. Unsupervised behavior modeling An unsupervised learning form for automatically learning behavior models based on examples. For example, models can be built based on clustering and trajectory-based representations. Statistical behavior model A statistical form of a behavioral model, such as the hidden Markov model (HMM). One advantage of this form of statistics is that you can maintain a probabilistic estimate of a hypothetical behavior.
45.5
Behavior and Understanding
1573
Adaptive behavior model Behavioral models with adaptive characteristics that facilitate online updating of models for behavior learning. A typical process consists of three steps: model initialization, online abnormal behavior detection, and online model updating through unsupervised learning. See unsupervised behavior modeling. Linear dynamical system [LDS] A more general parametric modeling method than the hidden Markov model. The state space is not limited to a set of finite symbols, but can be continuous values in Rk space, where k is the dimension of the state space. The simplest linear dynamical system is a first-order time-invariant Gauss-Markov process, which can be expressed as xðt Þ ¼ Axðt 1Þ þ wðt Þ w N ð0, PÞ yðt Þ ¼ Cxðt Þ þ vðt Þ
v N ð0, QÞ
Among them, x 2 Rd is the d-D state space; y 2 Rn is the n-D observation vector, d dj(x) (for all j 6¼ i). The basis of the Bayes classifier is that the classification decision can be made based on the probability distribution of the training samples of each class, that is, the unknown object should be given the most likely class according to the observed characteristics. To implement the mathematical calculations performed by the Bayesian classifier, three probability distributions are required: 1. The prior probability of each type of sk is recorded as P(sk).
46.2
Bayesian Statistics
1609
2. The unconditional distribution of the feature vector of the measurement mode x, denoted as p(x). 3. Class conditional distribution, that is, the x probability of a given class sk, is denoted as p(x|sk). According to Bayes’ theorem, these three distributions are used to calculate the posterior probability that the pattern x originates from the class sk and is denoted as p (sk|x): pðsk jxÞ ¼
pðxjsk ÞPðsk Þ pðxjsk ÞPðsk Þ ¼ PW pð xÞ k¼1 pðxjsk ÞPðsk Þ
The design of a Bayes classifier requires knowing the prior probability (P(sk)) and class conditional distribution p(x|sk)) for each class. The prior probabilities are easily calculated with the number of samples in each class and the total number of samples. Estimating the class conditional distribution is a much more difficult problem, which is often solved by modeling the probability density function of each class as a Gaussian (normal) distribution. Bayesian learning A machine learning method based on conditional probability, following the principle of Bayes’ theorem. See Bayes classifier. Bayesian adaptation An adaptive training method for Bayesian learning, in which a Bayes classifier is pre-trained with an omnipresent training set, and then adjusted (re-weighted) to suit special needs or application environments, and the smaller adaptation set used for adjustment should be derived from the environments in which it is to be applied.
46.2.3 Belief Networks Belief networks A set of graphical models in activity modeling and recognition. Bayesian network is a simple belief network. Bayesian network A simple belief network. It encodes a set of random variables as local conditional probability density (LCPD) and then encodes the complex conditional dependencies between them. Also called a directed graphical model, where the connections of the graph have a special directionality indicated by arrows. Directed graphs are more suitable for representing causal connections between random variables. Compare undirected graphical models. It also represents a belief modeling method using graph structure. Nodes are variables and arcs are implicit causal dependencies (corresponding to a given
1610 Fig. 46.13 Nodes in Markov blanket
46
Related Theories and Techniques
F
C
G
S
T
probability). These networks are useful in fusing multiple data in a unified and rigorous way. Bayesian graph A graphical model to explain the joint probability distribution. The nodes in the graph represent variables (states) and the arcs represent the influence of one variable on another. It is a promotion of the Bayesian network. Cutset sampling A variation on the Bayesian network sampling method, which only samples a subset of variables in the network and uses accurate reasoning for the rest. Markov blanket In machine learning, a Markov blanket of a node C in a Bayesian network includes the parent node F of the node, the child node S of the node, and other parents of the child node T of the node. Node G is shown in Fig. 46.13. Markov boundary Same as Markov blanket. Local conditional probability densities [LCPD] See Bayesian network. Dynamic belief network [DBN] A generalization of the simple Bayesian network by combining the time dependence between random variables. Also called dynamic Bayesian network. Compared with the traditional hidden Markov model (HMM), which can only encode one hidden variable, the dynamic belief network can encode complex conditional dependencies between several random variables, which is suitable for modeling and identifying complete activities. Dynamic Bayesian network [DBN] Directed graphical models or Bayesian networks that represent processes that evolve over time. A typical dynamic Bayesian network is specified by an initial state distribution and a transition model. The transition model indicates the evolution of the system from the current time to the future, that is, from a time t to a time t + 1 for
46.3
Graph Theory
1611
a discrete system. Hidden Markov model is a simple example of dynamic Bayesian network. Basic belief assignment [BBA] A method of assigning basic evidence to every proposition of the problem under consideration, exactly as defined by a probability function.
46.3
Graph Theory
Graph theory is a mathematical branch of the theory that studies graphs. It has been widely used in image engineering (Marchand-Maillet and Sharaiha 2000; Shapiro and Stockman 2001).
46.3.1 Tree Graph-theory A branch of mathematics that studies the theory of graphs. It has been widely used in image engineering. Tree A connected acyclic graph. That is, a graph that does not include a circle. Vertices with a degree of 1 are called leaves (nodes). A tree is also a connected forest, and every component of the forest is also a tree. Each side of the tree is a cut edge. Adding an edge to the tree forms a circle. Tree structure A structural pattern descriptor in image pattern recognition. Suitable for expressing relatively complex hierarchies between elements. Tree grammar A set of syntax rules for structural pattern recognition of tree structures. Defined by a five-tuple G ¼ ðN, T, P, r, SÞ Among them, N and T are the non-terminal symbol set and the terminal symbol set, respectively; S is a starting symbol contained in N, which is generally a tree; P is a set of production rules whose general form is Ti ! Tj, where Ti and Tj are trees; r is a sorting function and records the number of direct descendants of nodes whose labels are terminal symbols in the syntax.
1612
46
Related Theories and Techniques
Fig. 46.14 Polytree example
Stripe tree A binary tree data structure for hierarchically representing planar arc fragments. The root node of the tree structure represents the smallest-sized rectangle surrounding the arc. Each non-leaf node in the tree breaks the arc segment, and its bounding box (or bounding box) approaches two consecutive arc segments. The two child nodes contain the smallest rectangle that can enclose the two arc segments that belong to it. In this way, the leaf nodes of the tree will have an enclosure close enough to their containing arc segments. Polytree A directed tree structure in a graph. Among them, some nodes have more than one parent node (at the same time there are more than one node without parent node), but there is only one arc between them (regardless of the direction of the arc), as shown in Fig. 46.14. R-tree A dynamic data structure. The feature space is divided into multi-dimensional rectangles, and these rectangles can partially overlap or all overlap. R-tree is very effective when searching for points or regions in an image. Tree descriptor A relational descriptor. Suitable for describing basic element hierarchical relationship structure. The basic elements in the image description can be represented by nodes, and the physical connection relationship between the basic elements can be recorded by pointers. Tree description See tree descriptor. Geodetic graph Graph with at least one shortest path between any two vertices/endpoints on it. Geodetic line A line connecting any two vertices/endpoints on a geodetic map. Interval tree A tree used to represent all scale structures of an image. When constructing an interval tree, start with the root of the tree corresponding to the largest scale of the image, and gradually search in the image to the smaller scale. This constitutes an
46.3
Graph Theory
1613
Fig. 46.15 Dendrogram example
effective search structure where each node is the parent of a node with a specific interval value. Dendrogram A visual representation of the hierarchy (such as similarity or relatedness) between elements or objects. A special tree structure in which all leaf nodes have the same depth. Figure 46.15 gives an example. Maximal spanning tree A data structure that meets special conditions. If the weight of the connection between two clusters is defined as the number of nodes shared by the two clusters, and the weight of a tree structure is the sum of the weights of all connections of the tree structure, then the maximum spanning tree is connecting the tree with the highest weight among all possible tree structures of the cluster. If the tree is condensed, that is, each cluster that is a subset of the other clusters has been absorbed into a larger cluster, this gives a junction tree. For junction trees, a message passing algorithm that is essentially equivalent to the sum-product algorithm can be used to determine the marginal distribution and conditional distribution of the variables. Junction tree See maximal spanning tree. Junction tree algorithm A method for precise reasoning with the help of junction trees. Join tree A tree-structured undirected graph in which nodes correspond to maximal clusters and arcs correspond to connections between pairs of nodes with common variables. It is important to select the pairs of nodes to be connected. By choosing to be able to give a maximal spanning tree, that is, to choose the largest weight among all possible joint trees of the clique. Here, the largest means that the number of joint nodes is the largest. If the joint tree is compressed so that a sub-clique is absorbed into a larger clique, a junction tree is given. Message passing algorithm See maximal spanning tree.
1614
46
Related Theories and Techniques
Tree-reweighted message passing algorithm A message passing algorithm in which weighted weights are continuously updated as messages are passed.
46.3.2 Graph Graph A (visual) mathematical representation of a relationship. It also represents a relational data structure. A graph G is composed of a vertex set (also called a node set) V (G) and an edge set (also called an arc set) E(G). G ¼ ½V ðGÞ, E ðGÞ ¼ ½V, E Among them, each element in E(G) corresponds to an unordered pair of two vertices (nodes in the graph) in V(G), which is called an edge (line) of G. A graph G can also be regarded as a triple, including a set of vertices V(G), an edge set E(G), and a relationship that makes each edge related to two (not necessarily different) vertices, and the two vertices are called the endpoints of this edge. Degree In (acyclic) graphs, the number of associated edges d(x) of a vertex x. If it is a directed graph, it can also distinguish between “in” and “out,” where the associated edge refers to the edge with x as the endpoint. Descendant node In the graph, the nodes on the path in the direction of the arrow pass after the previous node. Cut edge In the graph, deleting the edges that increase the number of connected elements. It can be proven that an edge is a cut edge if and only if it does not belong to any (closed) loop. For graph G, G–e and G–E can be used to represent the subgraphs obtained by deleting an edge e and an edge set E from G, respectively. In Fig. 46.16, edges ux, uw, and xw belong to a triangle. The number of connected elements will not increase when deleting any edge, so it cannot be a cut edge; only the edge uv is a
Fig. 46.16 Cut edge
u
v
x
w
46.3
Graph Theory
1615
Fig. 46.17 Cut vertex
Fig. 46.18 A graph and its complement
u
v
x
w
u x
u v
G y
x
G w
y
v
C
w
cut edge (after deletion, vertex v is still retained and becomes a new connected unit). Compare cut vertex. Cut vertex The vertex that deleting them in the graph will increase the number of connected units. For graph G, G–v and G–V can be used to represent the subgraphs obtained by deleting a vertex v and a vertex set V from G, respectively. In Figure 46.17, vertex u is the cut vertex (the edges uv, uw, and ux do not exist after deletion, but the vertices v, w, and x are still retained); but vertex v is not the cut vertex (the number of connected units is increase). Compare cut edges. Simple graph Graphs that do not include multiple edges and loops. Underlying simple graph A simple spanning subgraph is obtained by removing all the multiple edges and loops in a graph. See simple graph. Complement graph A simple graph whose vertex set is the same as the original simple graph, but the edge set is composed of all edges that are not part of the original graph. Figure 46.18 shows a graph G (left) and its complement GC (right). Among them, {u, x, y} is a clique of size 3, and {v, w} is an independent set of size 2. They are the maximum clique and the largest independent set, respectively. Because the clique becomes an independent set and the independent set becomes a clique under the definition of the complement graph, these two sets are exactly swapped in the complement graph GC.
1616
46
Related Theories and Techniques
Fig. 46.19 A cycle
u y
v
x
w
Dual graph For a graph G, the graph G* in which each vertex corresponds to a region in G. If a region in G has a common edge in G, then only in this case are the vertices in G* adjacent. Cycle Graphs with equal number of vertices and edges. Its vertices can be placed on a circle (the edges are connected in a circle), and the two vertices are adjacent if and only if they appear consecutively on the circle. Figure 46.19 shows a cycle formed by vertices u, v, w, x, and y. If any edge is deleted from the cycle, it becomes a path. Hypergraph A generalization or extension of the graph. It still consists of a vertex set and an edge set, but the vertices can contain any subset of vertices, and the edges can contain any subset of edges. Wiener index The sum of the distances between all the vertex pairs in a graph. It can be proven that in all graphs with n vertices, the Wiener index D(T ) ¼ ∑d(u, v) takes the minimum value on the star form and the maximum value on the path, and only in these two cases the extreme values can be obtained. Supergraph A relationship between two graphs. For graphs G and H, if V(H ) ⊆ V(G), E (H ) ⊆ E(G), then graph G is the parent graph of graph H and is denoted as G ⊇ H. Compare subgraph. Spanning supergraph A relationship between two graphs. For graphs G and H, if H ⊆ G, V(H ) ¼ V(G), then graph G is the spanning supergraph of graph H. Compare spanning subgraphs. Proper supergraph A relationship between two graphs. If graph G is the supergraph of graph H, but G 6¼ H, then graph G is called the proper mother graph of graph H. Compare proper subgraph.
46.3
Graph Theory
1617
Subgraph A contained relationship between two graphs. For graphs H and G, if V(H ) ⊆ V (G), E(H ) ⊆ E(G), then graph H is a subgraph of graph G, and it is denoted as H ⊆ G. Compare supergraph. Spanning subgraph A relationship between two graphs. For graphs H and G, if H ⊆ G and V(H ) ¼ V (G), then graph H is a spanning subgraph of graph G. Compare spanning supergraph. Proper subgraph A relationship between two graphs. If graph H is a subgraph of graph G, but H 6¼ G, then graph H is called a proper subgraph of graph G. Compare property supergraph. Induced subgraph A special subgraph in graph theory. For non-empty vertex subset V0 (G) of graph G, V0 (G) ⊆ V(G), if there is a subgraph of graph G with V0 (G) as the vertex set, with all edges having both endpoints in graph G inside V0 (G) as the edge set, this subgraph is the induced subgraph of graph G, which is denoted as G [V0 (G)] or G [V0 ]. Edge-induced subgraph A special subgraph. For a non-empty edge subset E0 (G) of graph G, E0 (G) ⊆ E(G), if there is a subgraph of graph G with E0 (G) as the edge set and the endpoints of all edges in graph G as the vertex set, then the subgraph is an edge-derived subgraph of graph G, and it is denoted as G[E0 (G)] or G[E0 ].
46.3.3 Graph Representation Graph representation The structure of the graph is used to represent the results of the connection between the objects in the image. In the graph search and graph cut methods, the structure of the graph is used to represent the relationship between the gray values of adjacent pixels, while in the topology description, the structure of the graph is used to represent the connection between connected components. See graph model. Arc of graph The connecting line (usually a straight line) between two nodes in a graph. Arc The links between the nodes in the graph are also called edges. See edge. Arc set In graph theory, one of two sets that make up a graph. A graph G is defined by a finite non-empty vertex set V(G) and a finite edge set E(G). Each element of E(G) is called an arc or edge of G, and it connects a pair of vertices. See vertex set.
1618
46
Related Theories and Techniques
Fig. 46.20 Edges and capacity in a directed graph
s
S
T
t
Edge set A set of finite edges in a line drawing. It is also one of the two sets that make up the graph. A graph G is composed of a finite set of non-empty vertices V(G) and a finite set of edges E(G). Each element in E(G) is an edge in G. Each edge corresponds to an unordered pair of vertices in V(G). Capacity Restrictions on network traffic in graph theory. There are two cases: 1. Traffic passing through one edge in the network. 2. Traffic passing through one cut in the network. Source/sink cut In graph theory, the set [S, T] is composed of all edges of the source point set S to the sink point set T. Where S and T are partitions of the node set and s 2 S, t 2 T. The capacity of the cut is the total capacity of all edges in [S, T], which is written as cap (S, T). Given a cut [S, T], each s-t path in the graph needs to use at least one edge in [S, T], so the maximum feasible flow is cap(S, T ). Referring to Fig. 46.20, in a directed graph, [S, T] represents a set of edges consisting of a head in S and a tail in T. So the capacity of cutting [S, T] is not affected by the edges from T to S. Node of graph In the graph, a symbolic representation of certain features or entities. Nodes are connected by arcs, and arcs represent the relationship between different nodes. Vertex set 1. In graph theory, one of two sets that make up a graph. A graph G is defined by a finite non-empty vertex set V(G) and a finite edge set E(G). Each element of V(G) is called the vertex of G. A pair of vertices corresponds to an element in the edge set. See arc set. 2. A non-empty set consisting of a finite number of vertices in a line drawing. Vertex 1. Vertex: A point at the end of a line segment, also called an endpoint. A vertex is usually the common point of two or more line segments. 2. Nodes: The points in the graph that represent things, and the lines between the nodes represent the connections between things. Degree of a graph node The weight between two connected nodes in the graph.
46.3
Graph Theory
1619
Fully connected graph A graph where each node is connected to all other nodes. A graph with connected edges between any pair of nodes. Given N random variables x1, x2, . . ., xN, their joint distribution is p(x1, x2, . . ., xN). By repeatedly using the probability multiplication rule, the joint distribution can be written as the product of the conditional distribution of each variable: pðx1 , x2 , . . . , xN Þ ¼ pðxN jx1 , x2 , . . . , xN1 Þ . . . pðx2 jx1 Þpðx1 Þ For a given N, it can be represented as a directed graph containing N nodes, each node corresponds to a conditional distribution on the right side of the above formula, and each node has a connected edge from a lower number entry. Such a graph is a fully connected graph. Graph similarity The degree of similarity of two graph representations. In practice, different graph representations will not be exactly the same. At this time, we can further consider the double-subgraph isomorphism to measure the similarity. Lattice diagram A graph representation method for variable strings, where nodes represent the states of a variable and a variable corresponds to a column in the graph. For each state of a given variable, there is an edge connecting the state of the previous variable (determined by the probability maximization). Once you know the most probable value of the final node, you can simply return to the most probable state of the previous variable along the edge, so you can finally return to the first node. This corresponds to back-propagating information, called backtracking. Trellis diagram Same as lattice diagram. Colored graph Generalization of the representation of the graph. The requirements for the same vertices and the same edges in the original graph definition were relaxed, allowing vertices to be different and edges to be different. The different elements in the graph are represented by different colors, which are called the vertex color (meaning the vertex is marked with different colors) and the edge color (meaning the edges are marked with different colors). So a generalized colored graph G can be expressed as G ¼ ½ðV, CÞ, ðE, SÞ Among them, V is the vertex set, C is the vertex chromaticity set, E is the edge set, and S is the edge chromaticity set. Can be expressed as V ¼ fV 1 , V 2 , . . . , V N g
1620
46
Related Theories and Techniques
Fig. 46.21 Winged edge representation
Face 1 Connection
Current edge
Connection Face 2
C ¼ fC V 1 , C V 2 , . . . , C V N g E ¼ eV i V j jV i , V j 2 V S ¼ sV i V j jV i , V, j 2 V Different vertices can have different colors, and different edges can have different colors. Winged edge representation A graph representation of a polyhedron. Where nodes represent vertices, edges, and faces. Faces point around the edge nodes, the edge nodes point to the vertices, and the vertices point to the connected edges, which point to the adjacent faces. The “wing edge” is derived from the fact that a current edge has four connections to the previous and subsequent edges, and these edges surround the two faces containing the first edge, as shown in Fig. 46.21.
46.3.4 Graph Geometric Representation Geometric representation of graph A way of representing a graph, where the vertices of the graph are represented by circles and the edges are represented by straight lines or curves connecting the vertices. Graphs with edges greater than or equal to 1 can have an infinite variety of geometric representations. Geometric realization of graph Same as geometric representation of graph. Geometric realization Same as geometric representation. Geometric representation Same as geometric representation of graph. See geometric model. Factor graph A graph construction method or graph constructed accordingly. Both directed graphs and undirected graphs can represent a global function with several variables as a factor product corresponding to a subset of these variables. The subgraph
46.3
Graph Theory
1621
Fig. 46.23 Use plates to compactly represent nodes
vn
R N
Fig. 46.22 Factor graph example
x1
x2
fa
fb
x3
fc
fd
makes this decomposition more obvious by introducing additional nodes representing the factors themselves on the basis of the nodes representing the variables. Write the joint distribution of a set of variables as a factor multiplication: pð y Þ ¼
Y
f s ðxs Þ
s
Among them, xs represents a subset of variables, and factor fs is a function corresponding to a subset of variables xs. In the factor graph, there are not only nodes corresponding to each variable (usually represented by circles) but also additional nodes corresponding to each factor fs(xs) (usually represented by small squares). Edges are connected by variable nodes. Consider the distribution that can be expressed as p(x) ¼ fa(x1, x2)fb(x1, x2) fc(x2, x3)fd(x3). Its factor graph is shown in Fig. 46.22. Identical graphs Two graphs that satisfy the identical relationship. For graphs G and H, if and only if V(G) ¼ V(H ) and E(G) ¼ E(H ), G and graph H are identical. At this time, the two graphs can be represented by using the same geometric representation of graph. However, when two graphs can be represented by the same geometric representation, they are not necessarily identical graphs. Identical The two graphs G and H can be represented by the same geometric representation of graphs, and their vertex sets V(G) ¼ V(H ) and arc sets E(G) ¼ E(H ). A graph symbol used to represent nodes in a graph more compactly. For example, consider the directed graph model in the left side of Fig. 46.23, which can be represented by the plate as the right side, where only a single node vn and a plate surrounding it are drawn. Here the plate is labeled N to indicate that there are N such nodes, of which only the node vn is explicitly drawn.
1622
46
Related Theories and Techniques
Incidence matrix In graph theory, it can have different meanings depending on the object being discussed: 1. A matrix of 0 and 1 in the graph. Where element (i, j) takes 1 if and only if vertex i is associated with edge j. 2. A matrix of 1, 0, and 1 in a directed graph. The element (i, j) takes 1 when vertex i is the head of edge j; 1 is taken when vertex i is the tail of edge j; 0 is taken in other cases. 3. In general, a matrix representing membership. Incident In graph theory, there are connecting edges between a pair of vertices; or there are common vertices between two edges. Generally, the edge e composed of unordered pairs of vertices A and B in the graph is denoted as e $ AB or e $ BA, and A and B are called the endpoints of e, and edge e joins A and B. In this case, vertices A and B are associated with edge e, and edge e is also associated with vertices A and B. Two vertices associated with the same edge are adjacent, and the two edges with common vertices are also adjacent.
46.3.5 Directed Graph Orientation of graph A directed graph obtained by specifying the head and tail for each edge in the (undirected) graph. Directed graph A graph in which edges or arcs follow in a single direction, as opposed to an undirected graph. A triplet G consisting of a vertex set V(G), an edge set E(G), and a function that assigns an ordered pair of vertices to each edge. The first vertex of an ordered pair is the tail of the edge, and the second vertex is the head of the edge; they are collectively referred to as the endpoints of the edge. An edge at this time refers to the edge from its tail to its head. For example, OnTopOf(A, B) indicates that object A is above B, which is a property that can be used in a directed graph. On the other hand, proximity is a property that can be used in undirected graph. Adj(A, B) indicates that the object A is adjacent to B and also implies adj(B, A), that is, the object B is adjacent to A. Directed graphical model [DGM] A joint probability distribution defined on a set of random variables. Its graph structure is a directed acyclic graph, where each node in the graph represents a random variable. The joint distribution of a variable set x is p(x) ¼ ∏ip(xi|Pai), where Pai represents the parent node of node i in the graph. The missing edges in the graph suggest that the conditions are independently linked. For example, in
46.3
Graph Theory
Fig. 46.24 Directed acyclic graph
1623
1
4
2
3
5
6
Fig. 46.24 (directed acyclic graph), the factorization is p(x1, x2, . . ., x6) ¼ p(x1)p(x4) p(x2|x1)p(x5|x2, x4)p(x3|x5)p(x6|x5). In some cases, directional arrows can be interpreted as expressing causality (see the probabilistic causal model), although there is actually nothing about the essential causality of a directed graph model. Directed graphical models are sometimes considered Bayesian networks or belief networks. Probabilistic causal model A model used in artificial intelligence to express causality. The simplest causal model is a causal graph, which is essentially an acyclic graph, in which nodes represent variables and directed arcs represent causes and effects. A probabilistic causal model is a causal diagram in which each variable is conditionally dependent on the probability distribution of its cause. See probabilistic graphical model. Strict digraph A directed graph with no cycles and only one corresponding edge in each ordered vertex pair. Directed acyclic graph [DAG] A special kind of directed graph, which contains only directed edges. There is no closed path to move from one node to another in the direction of the arrow until it returns to the starting node. Such a graph without a directed cycle is called a directed acyclic graph, and the nodes in the graph are ordered, and it is impossible to reach the node ranked earlier from one node. If there is an edge between a pair of vertices u and v, there must be u ! v or v ! u, but not both. At the same time, there must be no closed pathways (cycles). A directed acyclic graph is used to define a directed graphical model. Figure 46.24 shows a directed acyclic graph. Directed cycle See directed acyclic graph. Directed acyclic graph SVM [DAGSVM] See multiclass support vector machine. Directed edge Edges in a directed graph. In the graph cut algorithm, the directed edge corresponds to the region contour segment, which can distinguish the transition from the light region to the dark region and the transition from the dark region to the light region.
1624 Fig. 46.25 Undirected graph and directed graph
46 Undirect മ
Related Theories and Techniques
A
C
A
Direct മ
D
C
D
B
B
Fig. 46.26 Maximum “clique” in the graph
1
3
4
5
6
2
Transitive closure Contains a collection of transitive relationships. For a directed graph D, if there is a path from node x to node y and node y to node z in D, then there is x ! z in this graph, which represents the transitive relationship. Undirected graph A graph in which each edge represents a symmetric connection (i.e., regardless of the order in which the nodes are connected). Can be regarded as a special kind of directed graph, that is, symmetric directed graph. A graph can also be said that the arcs are bidirectional, as opposed to directed (unidirectional) graphs, as shown in Fig. 46.25. One feature that can be used in undirected graphs is adjacency: adj(A, B) indicates that node A is adjacent to node B and implies adj(B, A), that is, node B is adjacent to node A. Undirected graphical model [UGM] A model that defines a joint probability distribution on a set of random variables. The graph structure here is an undirected graph. The joint probability distribution is defined by the product of positive factors, and each factor corresponds to a maximum “clique” in the graph. In Fig. 46.26, there are five maximum “cliques,” namely, (1, 2), (2, 3, 5), (3, 4), (4, 6), and (5, 6). If x is used to represent a random variable, the joint probability distribution can be written as p(x) ¼ ∏cFc(xc)/Z, where Fc(xc) represents the function of the clique c and Z is partition function or normalization constant such that the sum of p(x) is 1. Because Fc(x) is always positive, it can be conveniently expressed by an exponential function, that is, Fc(xc) ¼ exp.[Ec(xc)]. With this expression, we can get Fc(xc) ¼ exp.[Ec(xc)], which is called the Boltzmann distribution. An undirected graph model is also called Markov Random Field (MRF): here the conditional
46.3
Graph Theory
1625
distribution of the term Markov and the variable xi (after giving other variables) depends only on its nearest neighbors in the graph. A common application of undirected graph model in image engineering is the grid graph structure of the nearest neighbor interaction between corresponding pixels, where each node has a variable (i.e., a label binary variable indicating the foreground or background). This pairwise interaction typically records the pattern that close locations have the same label. See conditional random field.
46.3.6 Graph Model Graph model One of two models for declarative representation. Represents the more abstract nature of objects in 2-D images or 3-D scenes. The nodes in the graph model of a 2-D image represent the basic elements in the image, and the lines represent the connections between the nodes. The nodes in the graph model of a 3-D scene represent the solid units in the scene, and the lines still represent the connections between the nodes. Typical applications include object representation (see graph matching) and edge gradient calculation (see graph search). Compare the iconic model. Sparse flexible model A graph model for component-based recognition. The parts are ordered and the position of each part depends on the positions of up to K of its ancestors. For example, for the graph shown in Fig. 46.27a, the corresponding sparse flexible model with K ¼ 1 is shown in Fig. 46.27b, and the corresponding sparse flexible model with K ¼ 2 is shown in Fig. 46.27c. Sparse graphical model A graph model whose random variables depend on only a few other variables. Graphical model Same as probabilistic graphical model.
A
B
E
F
A
E
A
E
C
C
B
C
B
D
D
F
D
F
(a) Fig. 46.27 Sparse flexible model examples
(b)
(c)
1626
46
Fig. 46.28 D-separation schematic
a
a
f e
Related Theories and Techniques f
b
e
c
c
(a)
(b)
b
Past-now-future [PNF] A graph model that can model complex time relationships (sequence, period, parallelism, synchronization, etc.). If extended within the framework of a dynamic belief network, it can be used to model complex temporal ordering situations. D-separation An important feature of the graph model is that it can directly obtain the conditional independent properties of the joint distribution from the graph (d stands for directed) without any analysis and calculation. Also refers to the framework that implements this function. To describe directed separation, consider a general directed graph where A, B, and C are arbitrary sets of disjoint nodes (their AND may be smaller than the set of nodes in the graph). Now we need to determine whether a given directed acyclic graph implies a specific condition to independently describe A B|C. To this end, all possible paths from any node in A to any node in B can be checked. If any path contains a node that meets one of the following two conditions, the path is blocked: 1) the arrows in the path either pass head-to-tail at the node or pass tail-to-tail at the node, and the node belongs to C; and 2) the arrows in the path pass head-to-head at the node, but the node or its descendants are not in the set C. If all the paths are blocked, then A and B are directly separated by C. At this time, the joint distribution of all variables in the graph satisfies A B|C. D-separation can be illustrated visually with the help of Fig. 46.28. In Fig. 46.28a, the path from a to b is not blocked by node f, because this node is a node that is connected tail-to-tail and is a node that has not been observed; nor it is blocked by node e; although node e is a head-to-head node, it has a descendant node c in the condition set (indicated by the shade). Therefore, the condition independent description a b|c is not implied in the figure. In Fig. 46.28b, the path from a to b is blocked by node f, because this node is a tail-to-tail node, but it is also an observed conditional node (indicated by a shade). According to the figure, the conditions independently describe a b|c for any distribution. It should also be pointed out that the path from a to b is also blocked by node e, because node e is a node that is connected head-to-head, and neither itself nor its descendants are in the condition set. Q
Q
Q
Q
46.3
Graph Theory
1627
Ancestral sampling A model that is closely related to a graph model and is sampled from a given probability distribution. Consider a joint distribution p(x1, x2, . . ., xK) with K variables, which can be decomposed into pð xÞ ¼
K Y
pðxk jPk Þ
k¼1
Among them, x ¼ {x1, x2, ..., xK}, and Pk represents the parent set of xk. This corresponds to a directed acyclic graph. Assume that the variables have been sorted such that each node has a higher ordinal number than its parent node. The goal now is to collect samples x10 , x20 , . . ., x0 K from this joint distribution. Start with the node with the lowest number, and take a sample from the distribution p(x1), which is called x10 . Next, each node is performed in turn. For node n, a sample is taken from the conditional distribution p(xn|Pn). At this time, the parent node has been set to the sample value. Note that at each step, these parent values are always available because they correspond to the lower-numbered nodes that have been sampled. Once the samples are obtained from the final variable xK, the purpose of sampling from the joint distribution is achieved. In order to obtain samples from the edge distribution of some corresponding variable subsets, it is only necessary to obtain sample values from the required nodes and ignore the sample values of the remaining nodes. For example, to sample from the distribution p(x3, x6), you only need to sample the entire joint distribution before retaining the values x30 and x60 and discard the remaining values {x0 j 6¼ 3,6}. Inference in Graphical Models Use graph model for reasoning. Using the structural information in the graph model can help find effective inference algorithms and make the structure of these algorithms transparent. Structure learning 1. Learn the geometric model. 2. Learn the nodes and arcs of a graph model, such as in a Bayesian network, in a hidden Markov model, or in a more general model (e.g., directed acyclic graph). Latent trait model A model with both continuous hidden variables and discrete observed variables. Blocked path Conditional independence of a node in a directed probability graph blocking the path between the other two nodes. Can be divided into the following three. 1. First consider the directed graph with three nodes shown in Fig. 46.29. The joint distribution corresponding to the three variables can be written as
1628
46
Related Theories and Techniques
Fig. 46.29 The first directed graph with 3 nodes
c
a
b
pða, b, cÞ ¼ pðajcÞpðbjcÞpðcÞ If no variables are observed, you can determine whether a and b are independent by marginalizing c in the above formula: pða, bÞ ¼
X
pðajcÞpðbjcÞpðcÞ
c
In general, the above formula will not be decomposed into p(a)p(b), so a⊥ ⊥ = bj∅ Among them, ∅ represents the empty set, and the symbol ⊥⊥ / means that the conditional independent characteristic does not hold in general. Now suppose that the condition is on the variable c (i.e., node c is shaded, as shown in Fig. 46.30). With the help of Bayes’ theorem, the conditional distribution of a and b after c is given by pða, bjcÞ ¼ pða, b, cÞ=pðcÞ ¼ pðajcÞpðbjcÞ So the conditional distribution characteristics is a⊥ ⊥b j c By considering the path from node a through node c to node b, a simple graphic explanation of the above results can be given. It can be considered here that node c is “tail-to-tail” relative to the path, because it is connected to the tail of two arrows, and such a path indicates that node a and node b are related. However, when node c is used as a condition (as shown in Fig. 46.30), this conditional node blocks the path from node a to node b, causing node a and node b to become (conditional) independent. 2. Consider again the directed graph with three nodes shown in Fig. 46.31. The joint distribution corresponding to the three variables can be written as
46.3
Graph Theory
1629
c
a
b
Fig. 46.30 Directed graph with the first condition on variable c
c
a
b
Fig. 46.31 The second directed graph with 3 nodes
pða, b, cÞ ¼ pðaÞpðcjaÞpðbjcÞ Similarly, assuming no variables are observed, you can still determine whether a and b are independent by marginalizing c in the above formula: pða, b, cÞ ¼ pðaÞ
X
pðcjqÞpðbjcÞ ¼ pðaÞpðbjaÞ
c
As before, under normal circumstances, the above formula will not be decomposed into p(a)p(b), so there is also
a⊥ ⊥ = bj○ Now assuming that the condition is on the variable c (i.e., node c is shaded, as shown in Fig. 46.32), the condition distribution of a and b after c is given as pða, bjcÞ ¼ pða, b, cÞ=pðcÞ ¼ pðaÞpðcjaÞpðbjaÞ=pðcÞ ¼ pðajcÞpðbjcÞ So still get the conditional distribution characteristics a⊥⊥b j c By considering the path from node a through node c to node b, a simple graphic explanation of the above results can also be given. Here, the path of node c from node a to node b is considered to be “head-to-tail.” Such a path connecting node a and node b indicates that the two nodes are related. But when node c is used as a condition (as shown in Fig. 46.32), this conditional node also blocks the path from node a to node b, causing node a and node b to become (conditional) independent.
1630
46
Related Theories and Techniques
c
a
b
Fig. 46.32 Directed graph with the second condition on variable c
a
b
c Fig. 46.33 The third directed graph with 3 nodes
3. Finally, consider the directed graph with three nodes shown in Fig. 46.33 (the detailed behavior is more complicated). The joint distribution corresponding to the three variables can be written as pða, b, cÞ ¼ pðaÞpðbÞpðcja, bÞ First consider the case where no variable is observed, then by marginalizing c in the above formula pða, bÞ ¼ pðaÞpðbÞ This shows that a and b are independent (unlike the first two cases), and the result is
a⊥⊥b j ○ Now suppose that the condition is on the variable c (i.e., node c is shaded, as shown in Fig. 46.34), then the condition distribution of a and b after c is given as pða, bjcÞ ¼ pða, b, cÞ=pðcÞ ¼ pðaÞpðbÞpðcja, bÞ=pðcÞ This will not normally be broken down into p(a)p(b), so we get
a⊥ ⊥ = bj○ From the above, the third case is the opposite of the first two cases. However, a simple graphical explanation of the results can be given. Here the path of node c from node a to node b is considered “head-to-head” because it is connected to the heads of the two arrows. When node c is not observed, it blocks the path from node a to node b, causing node a and node b to become independent. However, when
46.3
Graph Theory
Fig. 46.34 Directed graph with the third condition on variable c
1631
a
b
c node c is used as a condition (as shown in Fig. 46.34), the path is unblocked, causing node a and node b to be related. The third case has an extension when the condition of node c is used, that is, if any descendant node of node c (if there is a path from node x to node y, and each step along the path follows in the direction of the arrow, it is said that node y is a descendant of node x), and the path blocking can be released, and node a and node b become related. Head-to-head See blocked path. Head-to-tail See blocked path. Tail-to-tail See blocked path.
46.3.7 Graph Classification Clique A concept in graph theory, nodes in cliques are completely connected. Also refers to the subset of all nodes in a graph connected to all other nodes; or the set of nodes with edges between each other. In a fully connected graph, each vertex is a neighbor of all other vertices. The graph shown in Fig. 46.35 is a clique with four vertices, which also includes other cliques with fewer vertices (such as a clique consisting of vertices A, B, and C only). See complement graph. Compare independent set. Maximal clique A special grouping. The maximal clique in the graph is such a clique that if you pull in a node from the graph or add any node to the clique, it is no longer a clique. This can be explained with the help of an undirected graph with four variables as shown in Fig. 46.36. The graph includes five cliques, each clique has two variables, namely, {A, B}, {B, C}, {C, D}, {A, C}, and {B, D}, and two maximal cliques with three nodes, namely, {A, B, C} and {B, C, D}. The set {A, B, C, D} lacks a link from A to D, so it does not constitute a clique.
1632
46
Related Theories and Techniques
Fig. 46.35 Clique schematic
Fig. 46.36 Cliques and maximal cliques
A
B
C
D
A B
C D
Fig. 46.37 Maximal clique example
A B
C
E
D
A clique not included in any other clique in a graph. It is not a true subset of any other clique, that is, there are no more nodes connected to all nodes in the graph. Can be used to represent maximized matching structures in association graph matching algorithms. There can be different sizes; the focus here is on maximizing rather than size. There are two maximal cliques Fig. 46.37: BCDE and ABE. Maximum clique The clique with the most nodes in a graph. As shown in Fig. 46.37, the maximum clique BCDE. The problem of maximum clustering is to find the maximum clique of a given graph. Same as maximal clique. Treewidth of a graph Number of variables in the maximum clique in the graph. In practice, the size of the maximum clique is reduced by one to ensure that the width of a tree is one. For undirected graphs, this refers to a number associated with the graph. When decomposing the tree of a graph, it refers to the size of the largest vertex set in the graph. Independent set A graph consists of pairs of non-adjacent nodes. See complement graph. Compare clique.
46.3
Graph Theory
1633
Fig. 46.38 Wayne diagram representation of a distribution set
P
D
U
Dependency map A graph that reflects the independent state of each condition that its distribution meets. For any distribution, completely separated graphs (graphs without links) are trivial dependency maps. Compare independency map. Independency map A special graph where each conditional independent state implied by the graph satisfies a specific distribution. For any distribution, a fully connected graph is a trivial independency map. Compare dependency map. Perfect map A graph that reflects each conditional independent characteristic that is satisfied by its distribution and that each conditional independent state implied by the graph satisfies a particular distribution. Therefore, it is both an independency map and a dependency map. Consider a set of distributions defined on a set of variables. Some of these distributions have a directed graph that is a perfect graph. Some distributions have an undirected graph that is a perfect graph. There are also some distributions that are not perfect, whether they are directed graph or undirected graph. Some relationships between them can be represented by means of the Wayne graph of Fig. 46.38, where P represents the set of all distributions, D represents a set of directed graphs in which a perfect map exists, and U represents an undirected graph in which a perfect map exists. Association graph A graph that connects mutually different related factors (elements, parts, etc.). Also called a relational graph. With the help of logical relationships, the links between these factors can be analyzed. For example, there is a graph for structural matching (such as matching between geometric models and data descriptions). Each node in the graph corresponds to the pairing between the model and the data features. An arc indicates that two connected nodes are compatible with each other. The maximum clique can be used to determine the appropriate or optimal match. Figure 46.39 shows the correspondence of one set of model features A, B, and C with another set of model features a, b, c, and d. The maximum clique consisting of A:a, B:b, and C:c is a possible match. Association graph matching A graph matching technology based on association graph. The graph is defined as G ¼ [V, P, R], where V is the node set, P is the unit predicate set for the node, and R is
1634
46
Fig. 46.39 Association graph and matching examples
Related Theories and Techniques
A:a B:d
B:b C:c
C:d
the set of binary relations between the nodes. The predicate here represents a statement that takes only one of two values, TRUE or FALSE. The binary relationship describes the attributes of a pair of nodes. Association graph matching is the matching of nodes and nodes, two-valued relationship, and two-valued relationship in two graphs. Relational graph 1. A graph representation of an image. Consider each pixel as a node. Each pixel is connected to each other pixel, and each connection is given an inverse weight to the absolute value of the difference between the gray values of the two connected pixels. These weights measure the similarity between two pixels. In this way, an image can be represented as an undirected correlation graph. 2. A graph that uses arcs to express the relationship between the properties of image entities (such as regions or other features). Entities correspond to the nodes of the graph. For example, the properties commonly used for regions are adjacency, containment, connectivity, and relative region sizes. Relational model See relational graph. Graph median Corresponding to the vertices of a graph in graph theory, the sum of the lengths of the shortest paths from this vertex to all other vertices can be minimized. Graph partitioning Operations and methods for splitting a graph into subgraphs that satisfy certain criteria. For example, given a graph representing a polygon composed of all edge segments in an image, it can be divided into subgraphs corresponding to each object in the scene. Graph pruning Simplify the graph by removing vertices and edges in the graph. Graph theoretic clustering Clustering methods using concepts in graph theory, especially using efficient graph theory algorithms (such as maximum flow) for clustering. Graph clustering A clustering problem where each input is a graph. Compare graph theoretic clustering.
46.4
Compressive Sensing
1635
Graph classification problem A classification problem where each input is a graph.
46.4
Compressive Sensing
Compressive sensing is a novel signal sampling and processing theory. The theoretical framework and research contents of compressive sensing mainly involve three aspects: sparse representation, measurement encoding, and decoding reconstruction (Donoho 2006).
46.4.1 Introduction Compressive sensing [CS] A novel signal sampling and processing theory. With the sparse representation of the signal, random sampling is performed at a sampling rate much smaller than the Nyquist rate, discrete samples of the signal are obtained, and then the original signal is perfectly reconstructed by a nonlinear reconstruction algorithm. A sparse representation was specifically used to allow a relatively small number of measurement samples to be used to store and reconstruct the signal. This goal is achieved by solving an under-constrained system of linear equations and placing sparse constraints on the original signal. It has been used for image acquisition, image compression, and image registration. Also called compressive sensing, com pressed sampling, and spatial sampling. Compressive sampling [CS] A novel signal sampling and processing theory. This theory combines signal sampling and compression, improves the signal acquisition and information processing capabilities, breaks through the bottleneck of Nyquist’s theorem-Shannon sampling theorem, and greatly relieves the pressure of data acquisition, storage, transmission, and analysis. The theory states that when the signal is compressible or sparse in a certain transform domain, by collecting a small number of nonadaptive linear projection values of the signal, constructing a mathematical reconstruction model, and solving it using a reconstruction algorithm, the original signal can be reconstructed more accurately. The framework and research content of this theory mainly involve three aspects: sparse representation, measurement coding, and decoding reconstruction. Compressed sampling [CS] Same as compressive sensing.
1636
46
Related Theories and Techniques
Variable reordering The key to using a direct method (non-iterative) to solve the sparse problem is to determine a valid order for the variables. This reduces the amount of filling. Multi-frontal ordering In the field of variable reordering, with the help of domain decomposition technology, the minimum degree ordering and nested dissection ordering are typical methods. Independent subgraphs can be processed in parallel on separate processors. Minimum degree ordering In variable reordering, when faced with unstructured data, heuristic knowledge is used to determine a method for correct reordering. Suitable for general adjacent (sparse) graphs. Nested dissection ordering In variable reordering, when faced with unstructured data, a method to obtain correct reordering with the help of heuristic knowledge. Specifically, it iteratively splits the graph into two halves along a small size boundary.
46.4.2 Sparse Representation Sparse representation The results obtained by expanding an image on a set of linearly independent overcomplete redundant bases. If x is used to represent the coordinates of a pixel f (x, y), the super-complete representation of a vector f 2 RN consisting of all pixels of an image can be written as f ¼ Zx where Z is an N K-dimensional matrix, where K > > N. x 2 RK is the coefficient of the linear expansion of the image f on the overcomplete dictionary Z. When || x||0 < < N, that is, the number of non-zero elements in the set x is far less than N, the above expression can be said to be a sparse representation. It is one of the three key processing steps for compressive sensing. It needs to find some orthogonal basis or compact frame that is compressible to the signal, so that it can achieve a sparse representation of the signal. From a mathematical point of view, it is necessary to construct a sparse transformation matrix to achieve a linear decomposition of multidimensional data. It also represents the representation method used for sparse description of the object. Given a large dictionary with N possible image descriptors or image features, a binary vector of length N can be used to describe an object (or an image), where the elements of the binary vector indicate which feature is used to the object. Because only a few features are used, the representation used for description is sparse. This
46.4
Compressive Sensing
1637
representation can be extended to include the encoding of the existence of a relationship. See sparse coding. Sparse coding [SC] An artificial neural network method for simulating mammalian vision system. From a mathematical perspective, sparse coding is a multidimensional data description method. In data analysis, sparse coding is a dictionary representation to find data, so that each data can be represented as a linear combination of a few atomic signals. Also refers to a method of describing some data using a few examples (or a few examples from many neurons) in a large set of descriptors. For example, a set of Gabor filters of different sizes, scales, and orientations can be used as the descriptor set. Based on this, several non-zero coefficients of the Gabor filter can be used to describe a specific image slice. These non-zero coefficients must be selected to reconstruct the image slice well. See sparse representation. Sparse data 1. Data containing many zero elements. 2. Only contains data for a few special configuration examples, so the probability of this configuration is very small and it is difficult to estimate. Sparsity problem A general machine learning problem in which there is not enough data to accurately or comprehensively estimate a model. Here the model can be a statistical model but inaccurate for the probability of sparse events, or it can be a graph model but lack the connection between nodes. See sparse data. Overcomplete dictionary The base or frame in sparse representation is often referred to as the dictionary, and the vector elements in it are called atoms. Atoms See overcomplete dictionary. Method of optimal directions A technique for finding dictionary representations of data in sparse coding. Its essence is to optimize the atomic signal matrix and the weight matrix alternately, and the dictionary is updated by performing a simple pseudo-inverse calculation on the weight matrix. Skeletal set A set consisting of a small amount of data with a limited difference from the complete data. For example, in 3-D reconstruction from a video sequence, only a few frames of image can be used for reconstruction, and the results are only constantly different from the results of reconstruction using all frames of images or less than a certain limit. Missing data Not available and therefore requires estimated data. For example, a pedestrian in video surveillance may not be visible in several frames due to being blocked by other
1638
46
Related Theories and Techniques
people or objects for a period of time. In order to obtain its motion trajectory, a guess estimation is needed. Missing at random Refers to the process by which the mechanism of data loss does not depend on unobserved data. Direct sparse method When the Jacobian matrix (or approximate Hessian matrix) is very sparse (sparse patterns appear in the matrix), the method of storing the matrix in a compressed form is used. At this time, the amount of storage required is proportional to the number of non-zero entries in the matrix. Compressed form In the direct sparse method, the way to store the Hessian matrix. See compressed sparse row. Compressed sparse row [CSR] A direct sparse method. The sparse Cholesky decomposition is used instead of the standard Cholesky decomposition. The method of storing the Hessian matrix and the related factors of the Cholesky decomposition in compressed form. A triple (i, j, Cij) is stored as a list. Skyline storage A storage method used in a direct sparse method that stores only adjacent vertical spans between non-zero elements. Automatic relevance determination [ARD] A mechanism that pushes some parameters (a subset) to zero, which can lead to sparsification. Direct sparse matrix techniques The technique of processing the sparse matrix directly is used for the sparse matrix. For example, in many optimization problems, the Jacobian matrix and the (approximate) Hessian matrix are generally very sparse. In this case, the sparse Cholesky decomposition can be used directly instead of the regular Cholesky decomposition. In the sparse Cholesky decomposition, the Hessian matrix and related Cholesky bases are stored in compressed form, and the amount of storage is proportional to the number of non-zero coefficients.
46.4.3 Measurement Coding and Decoding Reconstruction Measurement coding One of the three key processing steps for compressive sensing. It refers to the representation of perceptual measurements in compressed sensing. Perceptual measurements are obtained by sparsely measuring/observing the signal using a linear
46.4
Compressive Sensing
1639
projection that is not related to the sparse transform matrix. The representation of perceptual measurements is often called coding, so they are collectively called measurement coding. Null space property To achieve compressive sensing, a characteristic that the measurement matrix used should satisfy. The null space property is a necessary condition to ensure reconstruction, but the influence of noise is not considered in the derivation. Compare restricted isometry property. Restricted isometry property [RIP] To achieve compressive sensing, a characteristic that the measurement matrix used should satisfy. Unlike the null space property, the restricted isometry property takes into account when the measured values are noisy or when errors are introduced during quantization. Decoding reconstruction One of the three key processing steps for compressive sensing. It is the guarantee of compressive sensing model solution and the core part of compressive sensing. Its meaning is to accurately reconstruct high-dimensional original signals using low-dimensional data (perceptual measurements) of compressed measurements. The work here is to design a fast reconstruction algorithm to recover the signal from a very small number of linear observations. The quality of the reconstructed signal determines the quality of the recovered signal. Basis pursuit A basic reconstruction model in compressive sensing. Suppose the vector x 2 ℝN is a sparse signal of length N. The measurement matrix Φ: ℝN ! ℝM is known. Now we want to reconstruct the original target signal x based on a small number of measured values y 2 ℝN, M < N. The direct reconstruction method is a combinatorial optimization problem, which is an NP-hard problem. One solution is to use the convex L1 norm to approximate the non-convex L0 norm and transform the combinatorial optimization problem into a convex optimization problem. With this approximation, if the measured value is free of noise, the basic pursuit representation is min kxk1 ,
s:t: y ¼ Φx
If there is a small amount of bounded noise in the measurement, the representation for basic pursuit denoising is min kxk1 , Basis pursuit denoising See basis pursuit.
s:t: kΦx yk2 ε
1640
46
Related Theories and Techniques
Matching pursuit [MP] A method to achieve object recognition through the best decomposition of images. In sparse reconstruction of compressive sensing, it refers specifically to an iterative greedy reconstruction algorithm. It looks at a signal as a linear combination of the elements in the base/dictionary. A dictionary is a sampling matrix Φ 2 RM N, each column of which represents the elements of a prototype signal (also known as an atomic signal). Reconstruction can be described as given a signal y, seek to describe the signal y through a sparse combination of elements in these bases/ dictionaries. Orthogonal matching pursuit [OMP] An improved iterative greedy algorithm for matching pursuit (MP) in sparse reconstruction of compressive sensing. In this algorithm, instead of subtracting a vector from the dictionary with which it is most relevant, the residual r is projected into an orthogonal subspace that expands linearly with all selected columns at each loop. Since the residual r is always orthogonal to the selected column, the same column will not be selected twice in OMP, so its maximum number of cycles will be significantly reduced. If the size of the sampling matrix is M N and the sparsity is K, the operation complexity of OMP is O(MNK). Stagewise orthogonal matching pursuit [StOMP] An improved iterative greedy algorithm based on orthogonal matching pursuit (OMP) for sparse reconstruction in compressive sensing. It is different from OMP that only selects one element from the dictionary in each loop. It selects a set of elements at a time. The feature of this set is that the correlation between the residual component and the dictionary column is greater than a certain threshold and the residual component will update the column collections of dictionary. The computational complexity of StOMP is O(MNlnN), which is much smaller than OMP. Subspace pursuit [SP] An improved iterative greedy algorithm based on orthogonal matching pursuit (OMP) for sparse reconstruction in compressive sensing. It can perform accurate signal reconstruction and approximated signal reconstruction separately when no noise and noisy. For any measurement matrix, as long as it satisfies the restricted isometry property, SP can accurately reconstruct the original signal from the noisefree measurement vector. Count-median sketch A combined algorithm for sparse reconstruction in compressive sensing. It requires the measured value to be free of noise. It can consider both positive and negative signals. When the measured values and the sampling matrix are known, the coefficients of the reconstructed object signal can be obtained from the observed values with the median value. Compare count-min sketch. Count-min sketch A combined algorithm for sparse reconstruction in compressive sensing. It requires the measured value to be free of noise. It only considers non-negative signals. When
46.4
Compressive Sensing
1641
the measured value and the sampling matrix are known, the coefficients of the reconstructed object signal can be obtained from the observation value with the smallest amplitude. Compare count-median sketch. Shrinkage cycle iteration A reconstruction algorithm based on the principle of convex optimization in com pressive sensing. Also called soft thresholding. Its basic idea is to replace the L0 norm with a convex function and optimize the convex objective function J(x) for an unknown variable x in a convex set in an RN space. This type of algorithm has high reconstruction accuracy, requires less measurement data, but has a slow calculation speed. Soft-thresholding Same as shrinkage cycle iteration. Bayesian compressive sensing [BCS] A reconstruction algorithm based on Bayesian principle in compressive sensing. Its basic idea is to consider nondeterministic factors based on an unknown signal model, that is, consider a sparse signal with a known probability distribution, and reconstruct a signal that matches this probability distribution from a random measurement. This algorithm considers the time correlation between signals, so it can provide higher reconstruction accuracy for signals with strong time correlation than other reconstruction algorithms. However, since there is no exact concept of “distortionfree” or “reconstruction error” in the framework of Bayesian signal modeling, the original object signal cannot be reconstructed without distortion based on a certain number of measurements. Fast marginal likelihood maximization A progressive model construction method that makes full use of signal sparsity in Bayesian compressive sensing (BCS). It can reduce the computational complexity from O(N2) to O(NM2) using the expectation maximization algorithm, where N is the length of the sparse signal and M < N is the length of the measured value. Relevance vector machine [RVM] A Bayesian learning method, a Bayesian sparse kernel regression and classification technology, and a Bayesian algorithm in compressive sensing. It can produce sparse classification results. Its mechanism is to select a few from many candidate basis functions and use their linear combination as a classification function to achieve classification. From the perspective of compressive sensing, it can be regarded as a method to determine the components in a sparse signal. This signal provides different weights to some basis functions, which are the column vectors in the measurement matrix. In contrast to support vector machines, which are decision machines that cannot provide posterior probability, relevance vector machines can provide posterior probability output. It shares many characteristics with support vector machines, but avoids the main problems of support vector machines. It usually helps to get a sparser model, resulting in faster performance while maintaining similar generalization errors.
1642
46
Related Theories and Techniques
A Bayesian sparse kernel method for regression and classification. The prediction function is represented as a linear combination of kernel functions: ∑Ni ¼ 0xif(xi), where i ¼ 1, 2, . . ., n index the data points. As in support vector regression (different from Gaussian process regression and kernel ridge regression), the solution of the relevance vector machine is sparse related to the coefficients, that is, many terms are zero. Relevance vector The base vectors corresponding to non-zero weights in a relevance vector machine.
46.5
Neural Networks
Artificial neuron networks (ANNs for short), also referred to as neural networks (NNs), are mathematical models of algorithms that imitate the behavioral characteristics of animal neural networks and perform distributed parallel information processing (Goodfellow et al. 2016).
46.5.1 Neural Networks Artificial intelligence [AI] The ability or technology to simulate, perform, or reproduce certain functions related to human intelligence with a computer. Compare human intelligence. Human intelligence Human ability to understand the world, judge things, learning environment, plan behavior, reasoning thinking, problem-solving, etc. Human visual function is an embodiment of human intelligence. Artificial intelligence is to use computers to achieve the corresponding functions and capabilities of human beings. Neural network [NN] Originally refers to the mathematical expression of information processing in biological systems and also refers to many models that emphasize biological characteristics, such as many effective models in statistical pattern recognition (such as multilayer perceptrons). Abbreviation for artificial neural network. An artificial neural network consists of a series of units (artificial neurons) connected in a network. Connections generally have adjustable strength, called weights. The use of learning algorithms (or training algorithms) to adjust the weights allows the neural network to perform pattern recognition tasks, such as supervised learning and unsupervised learning. Artificial neural networks can be regarded as simplified or abstract biological network models, although a lot of work only considers the pattern recognition characteristics of these nonlinear models. Artificial neuron networks can have different
46.5
Neural Networks
1643
architectures, such as the forward layout from the input layer through the intermediate hidden layer to the output layer (as used in multilayer perceptual networks or radial basis function networks) or competition that includes feedback architecture (already used in Hopfield networks and Kohonen networks). Neural-network classifier A classifier implemented using electronic networks inspired by the neural structure of the brain. Process one item at a time, and learn by comparing with the classification of known real items, and complete the work task through multiple iterations. Convolution neural network [CNN] An artificial feedforward neural network, which is also a variant of a multilayer forward network (multilayer perceptron), whose artificial neurons can respond to other units within a certain coverage area around it (equivalent to a convolution operation). The basic structure of CNN includes two layers. One is the convolutional layer, in which the input of each neuron is connected to the local receptive field of the previous layer, where the local features are extracted, so it is also called the feature extraction layer. The other is the pooling layer, which combines the extracted feature maps, so it is also called the feature mapping layer. Deep convolution neural network [DCNN] Multilayer (now hundreds of layers) convolutional neural networks, often referred to as deep neural networks (DNN). Deep neural network [DNN] Generally refers to a neural network with two or more hidden layers, but it can also refer to a neural network with more hidden layers than a shallow neural network. Abbreviation for deep convolution neural network. Deep learning A method based on data learning in machine learning is a branch of unsupervised learning. It attempts to imitate the working mechanism of the human brain and establish a learning neural network to analyze, recognize, and interpret data such as images. Combining low-level features to form more abstract high-level representation attribute categories or features to discover distributed feature representations of data. This is similar to how humans grasp simple concepts first and then use them to represent more abstract semantics. With the help of deep learning, unsupervised or semi-supervised feature learning can be used to achieve efficient feature extraction. Shallow neural network It usually refers to a neural network with only one hidden layer, but it can also refer to a neural network with fewer hidden layers than a deep neural network. Single-layer perceptron network A neural network for supervised learning problems. It maps the input data x of d-D to the space of output data y of d0 -D. The single-layer perceptual network is characterized by a weight matrix W of d0 d and a transfer function S, which can be written as f(x) ¼ S(Wx), where S is applied to the vector components item by item.
1644
46
Related Theories and Techniques
Each output unit is a perceptron, and S is its activation function. A typical S is a logistic sigmoid function S(z) ¼ (1 + e–z)1. Compare multilayer perceptron network. Weight sharing A mechanism in traditional neural networks. With the help of weight sharing, convolution of image pixel values and kernels containing weight parameters can be realized. Soft weight sharing A method to effectively reduce the complexity of a network with a large number of weights. It attempts to use regularization methods to make groups of weights with similar values. In addition, dividing the weight group, calculating the average weight value of each group, and diffusing the weight value within the group are determined by the learning process. Weight parameter See weight sharing.
46.5.2 Special Neural Networks Convolutional neural network A neural network with invariance built into the network structure is more suitable for the recognition of image data. The structure of the convolutional neural network is shown in Fig. 46.40. In the convolutional layer, there are many units, they are organized into planes, and each is called a feature map. The units in the feature map only take input from a small region in the image (local receptive field) to take advantage of local properties, and all units in the feature map share the same weight. If cells are regarded as feature detectors, all cells in a feature map detect the same pattern in different positions from the input image. Since the weights are shared, the activation of these units is equivalent to the convolution of the pixel brightness with the “kernel” containing the weight parameters. If the input image is translated, the activation of the feature map will also be translated by the same amount, but the Fig. 46.40 Convolutional neural network diagram
Input image
Convolutional layer
Subsampling layer
46.5
Neural Networks
1645
intensity will not change. This provides the basis for the invariance of the network output to the translation and distortion of the input image. In the network, the output of the convolutional layer constitutes the input of the subsampling layer. For each feature map of the convolutional layer, there is a unit plane in the subsampling layer, where each unit receives the input of a small receptive region in the corresponding feature map of the convolutional layer. These units perform subsampling. The small receptive region is selected to be continuous but not overlapped, so that the number of rows and columns of the subsampling layer is half of the convolutional layer. In this way, the unit response at the subsampling layer is less sensitive to small offsets of the image in the corresponding region of the input space. Convolution layer One of the two basic structural layers in convolutional neural networks. It mainly plays the role of feature extraction, so it is also called feature extraction layer. Convolutional layer The layer that performs the convolution function in a deep neural network is usually immediately after the input data. There can be multiple convolutional layers in a deep neural network. Radius of a convolution kernel The number of pixels from the center element of the convolution template to the edge element of the convolution template. For example, a 3 3 template has a radius of 1, a 5 5 template has a radius of 2. Long short-term memory [LSTM] A typical recurrent neural network can also be considered as a time-recursive neural network, which is suitable for processing and predicting data that changes over time (especially with longer intervals and delays). Feature map See convolutional neural network. Local receptive field See convolutional neural network. Feed-forward neural network A neural network where data enters from the input and proceeds in a single direction toward the output. Recurrent neural network [RNN] A type of recursive neural network that takes sequence data as input, recurses in the evolution direction of the sequence, and connects all cyclic units in a chain to form a closed loop. Typical recurrent neural networks include long short-term memory networks, bidirectional recurrent neural networks, and so on.
1646
46
Related Theories and Techniques
Circular convolution A special kind of convolution. Circular convolution of two vectors {xi} and {yj} of length n with ck ¼ ∑ni¼10 xiyj, where 0 k < n, j ¼ (i – k)modn. Bidirectional recurrent neural network [Bi-RNN] A simple recurrent neural network consisting of two RNNs stacked on top of each other. It assumes that the current output is not only related to previous sequences but also related to subsequent sequences. The final output is determined by the state of the hidden layers of these two RNNs. Back-propagation neural nets A typical neural network compares the output of the network with the expected output and calculates an error measure based on the sum of squared differences. Then, by changing the weight of the network, the gradient descent method is used to minimize the error measure. If the sample of the training set is ti, the actual output is xi, and the expected output is yi, then the error is E¼
2 XX xij yij i
j
The algorithm iteratively updates the expected value (k is the number of iterations of the update): yij ðk þ 1Þ ¼ yij ðk Þ e
∂E ∂yij
Back-propagation One of the most studied neural network training algorithms in supervised learn ing. It is named after the difference between the calculated response and the expected response at the network output is propagated back to the network input. This difference is only one of the inputs to the network weighting process. An effective method for evaluating the gradient of an error function for a feedforward neural network. A local messaging framework in which information alternates forward or backward through the network. Also called error back propagation. Error backpropagation Same as back-propagation. Recursive neural network [RNN] Artificial neural network with tree structure and network nodes recursively input information according to their connection order. It is regarded as a generalization of recurrent neural networks. When each parent node of a recurrent neural network is connected to only one child node, its structure is equivalent to a fully connected recurrent neural network.
46.5
Neural Networks
1647
Hopfield network A form of neural network is an undirected graphical model with paired potential energy. Used in Lenovo networks and can solve optimization problems.
46.5.3 Training and Fitting Training See learning. Training sequence A collection of sequences for training. It can be divided into two parts, one is the data sequence, and the other is the corresponding category identification sequence. Each training data includes the pattern in the data sequence and the corresponding category information in the category discrimination sequence. The training sequence can be used to estimate the conditional probability distribution for constructing decision rules or to construct the decision rules themselves. Training procedure The process of using training sequences to build decision rules. This process can be performed in two ways: parallel and iterative serial. In the parallel training process, the entire training sequence is used one at a time. The process of constructing decision rules has nothing to do with the order of training data in the training sequence. During the iterative training process, the entire training sequence is used multiple times, and the decision rules are adjusted or updated according to each data. This process is related to the order of the training data in the training sequence. Window-training procedure An iterative training procedure. Here, the window is a subset of the pattern measurement space. Only when the training data falls within a given window (generally including decision boundaries) can the decision rules be adjusted. Early stopping A method to control training effects and solve over-training problems. The training of nonlinear models is a process of iteratively reducing the error function. For many optimization algorithms used for training, the error is a non-increasing function of the iterative index. However, for independent data (commonly known as the validation data set), the error often decreases first and then increases due to overfitting. At this time, the training should be terminated when the error relative to the validation data set is minimal, so that the training results have good generalization performance. In other words, at this time, it is necessary to use the downward trend of the validation sample error to determine when to end the training, which can be illustrated by using Fig. 46.41. Figure 46.41a shows how the error on the training set changes with the number of iteration steps, and Fig. 46.41b shows how the error on the validation set changes
1648
46 0.3
0.4
0.2
0.3
0.1
Related Theories and Techniques
0.2 0
10
20
30
40
0
10
(a)
20
30
40
(b)
Fig. 46.41 Changes in errors on the training sets and validation sets
with the number of iteration steps. In order to obtain the best generalization performance, it is necessary to terminate the training at the position of the vertical dotted line in the figure to minimize the error on the validation set. Dynamic training Training (method) to dynamically change parameters based on the characteristics of the test data. Hinge loss function A function used to train a machine learning classifier (typically a support vector machine). For x 2 R, the articulation loss is defined as H(x) ¼ max (0, 1, x). Fitting A special general image matching method. Expand the matching between the two representations to be matched: image structure and relational structure. Function interpolation A fitting problem to a function. Given a set of input vectors {xi} and corresponding object values {ti}, both of which are N, a smoothing function f(x) must be determined so that it can fit each object value accurately, that is, f(xi) ¼ ti. To this end, f(x) can be represented as a linear combination of radial basis functions, each radial basis function being centered on a data point: f ð xÞ ¼
N X
w i hð kx xi kÞ
i¼1
Among them, the coefficient {wi} can be calculated according to the least square method. Because the number of coefficients is equal to the number of constraints, the resulting function f(x) will fit each target value accurately. Overfitting 1. A problem that arises in the training of neural networks is that the performance of the training set is acceptable, but the test results for the samples that do not appear in the training set decline significantly. It can also be understood as that the
46.5
Neural Networks
1649
trained neural network is too dependent on the samples in the training set and has failed to generalize its learning results to new samples that have not been encountered before. 2. A situation during training-testing. If the performance of a model in the independent test set from the same data-generating distribution is worse than in the training set, the model is said to overfit the training set. 3. In curve fitting, the fitting result only fits a given sampling point but badly matches other points, resulting in a poor fit of the overall curve. Dropout A technique to overcome the problem of overfitting. It randomly disconnects some nodes from the neural network during the training of the neural network, thereby slightly changing the structure of the neural network to prevent over-adaptation to a fixed parameter group and improve the generalization ability. Underfitting A stage or process in data fitting. Suppose there is a set of models with a single increase in complexity for a data set, such as determining the order of polynomial complexity in polynomial regression, or a single increase in the number of hidden units for a multilayer perceptual network. Because the model complexity changes from low to high, the model will underfit the data generator at the beginning, that is, it lacks sufficient flexibility to model the structure in the data. As the model becomes more complex, at some optimal complexity point, the model will begin to overfit the data.
46.5.4 Network Operations Weight-space symmetry In a neural network, changing the sign of the input hidden unit weight and the sign of the output weight from the hidden unit at the same time will give the characteristics of the same mapping function. In addition, if the weight of the input and output of one hidden unit is swapped with the weight of the input and output of another hidden unit, the mapping function will not change. Shrinkage A method of reducing coefficient values in statistics. If the value of the parameter is reduced, it can be called parameter shrinkage. It is called weight decay in the study of machine learning and neural networks, because it will cause the weight value to decrease to 0 in sequential learning. In the case of quadratic regression, it is called ridge regression. Parameter shrinkage See shrinkage.
1650
46
Related Theories and Techniques
Weight decay See shrinkage. Pooling One way to reduce the data dimension is similar to subsampling a signal. In training deep neural networks, it can also reduce the amount of data that needs to be processed. There are various pooling methods; the more typical ones include average pooling, maximum pooling, and L2 pooling. Pooling layer The layer that performs the pooling function in a deep neural network is generally located behind the convolutional layer and the activation function. It is one of two basic structural layers in a convolution neural network. It mainly functions as a feature map (combining the extracted features), so it is also called a feature mapping layer. Feature mapping layer Same as pooling layer. Maximum pooling A pooling method in which the value of a pooled neighborhood is replaced by the maximum value in the neighborhood. Average pooling A pooling method in which the value of a pooled neighborhood is replaced by the average value in the neighborhood. L2 pooling A pooling method in which the values of a pooled neighborhood are replaced by the second norm of all values in the neighborhood. Radial basis function [RBF] A function whose value depends only on the distance from a center c to the input x, d (x) ¼ d(||x – c||). Euclidean distance is commonly used. The typical radial function is d(r) ¼ exp.(ar2), where a is a positive constant. The radial basis function is a spherically symmetric function. Considering the (interpolation) work of reconstructing a surface from a 3-D point cloud, the surface can be expressed as z ¼ f(x, y), and the value at point (xi, yi) is gi, then P wi ðx, yÞgi i z ¼ f ðx, yÞ ¼ P wi ðx, yÞ i
of which weight wi ðx, yÞ ¼ K ðkðx, yÞ ðxi , yi ÞkÞ K(•) is the radial basis function.
46.5
Neural Networks
1651
Radial basis function network Neural network consisting of radial basis function units. Can be used for super vised learning, mapping input data x to output space y. The standard radial basis function network structure includes a hidden layer of radial basis function units connected to the input x, and the number of units in this layer can be arbitrary. The output of these hidden units is connected to the output layer through a parameter (weight) matrix, which may also have additional nonlinear properties. A common method for specifying the center of a radial basis function network is to randomly select a subset of the data points.
46.5.5 Activation Functions Activation function 1. Describe the nonlinear function predicted by the model. In the statistical literature, the inverse of the activation function is called the link function. When the model is a linear function of the input variables, the model predicts y (x) ¼ wTx + w0, so y is a real number. To solve the classification problem, it is necessary to predict the discrete class labels or more generally the posterior probability between [0 1]. To this end, the above model is generalized, that is, the linear function of w is transformed by a nonlinear activation function f(•) to obtain yðxÞ ¼ f wT x þ w0 2. The decision surface in the classification corresponds to y(x) ¼ constant, i.e., wTx + w0 ¼ constant, so although f(•) is a nonlinear function, the decision surface is a linear function of x. For this reason, the above formula is also called a generalized linear model. 3. Thresholding function in artificial neural network, usually smooth function. It is often used after the convolutional layer. Commonly used activation functions include sigmoid function, hyperbolic tangent function, and rectifier function. Link function See activation function. Sigmoid function A commonly used activation function, whose representation is (the curve is shown in Fig. 46.42) hð z Þ ¼ Its derivative is
1 1 þ ez
1652
46
Related Theories and Techniques h(z)
Fig. 46.42 Sigmoid function
z
0
Fig. 46.43 Logistic sigmoid function
1
0.5
0
h0 ð z Þ ¼
‒5
0
5
∂hðzÞ ¼ hðzÞ½1 hðzÞ ∂z
The sigmoid function is symmetrical about the activation function axis. Logistic sigmoid function A non-convex, non-concave function, defined as (shape shown in Fig. 46.43) Sð x Þ ¼
1 1 þ ex
pffiffiffiffiffiffiffiffi This function is very similar to the probability function pf( π=8 x) scaled by pffiffiffiffiffiffiffiffi π=8. It can map the entire range of independent variables to a limited interval and is a “squeezing” function. It also satisfies the following symmetrical properties: SðxÞ ¼ 1 SðxÞ If you take the logarithm of it, you get a concave function (which can be verified by calculating the second derivative). The conjugate function of the logical sigmoid function is gðt Þ ¼ t ln t ð1 t Þ ln ð1 t Þ
46.5
Neural Networks
1653
h(z)
Fig. 46.44 Rectifier function
0
z
This is exactly the binary entropy function of a variable whose probability is t with a value of 1. The inverse of a logical sigmoid function can be written as (called a logit function) x ¼ ln
S 1S
Logit function See logistic sigmoid function. Softmax function Multi-class generalization function of logistic sigmoid function. It represents a smoothed version of the maximum function. Normalized exponential function Same as softmax function. Rectifier function A commonly used activation function is also called ReLU activation function because it uses a rectifier linear unit. Its representation is (Fig. 46.44) hðzÞ ¼ max ð0, zÞ Its derivative is 0
h ðzÞ ¼
1 0
if z > 0 if z 0
The axis (vertical axis) of the rectifier function with respect to the activation function is asymmetric. ReLU activation function See rectifier function. Rectifier linear unit [ReLU] The unit used in the ReLU activation function.
1654
46
Related Theories and Techniques
h(z)
Fig. 46.45 Hyperbolic tangent function
0 z
Hyperbolic tangent function A commonly used activation function, whose representation is (Fig. 46.45) hðzÞ ¼ tanh ðzÞ ¼
1 e2z 1 þ e2z
Its derivative is h0 ðzÞ ¼ 1 ½hðzÞ2 The hyperbolic tangent function is symmetrical about the activation function axis.
46.6
Various Theories and Techniques
The research of image engineering is inspired by the research of many disciplines and also supported by many new theories, new methods, and new technologies (Bishop 2006; Chan and Shen 2005; Zhang 2009).
46.6.1 Optimalization Optimization A general term used to indicate that a search can maximize or minimize a certain amount of parameter value. Optimization parameter estimation See optimization. Parameter estimation A set of techniques used to estimate parameters in a given parameter model. For example, suppose a set of image points are on an ellipse. Consider the hidden ellipse model ax2 + bxy + cy2 + dx + ey + f, then the parameter vector [a, b, c, d, e, f]T can be estimated by the least squares surface fitting.
46.6
Various Theories and Techniques
1655
Well-determined parameters A parameter whose value is strictly constrained/limited by the original data. Second-order cone programming [SOCP] A convex optimization problem. Can be written to minimize fTx under the conditions of ||Aix + bi||2 ciTx + di, i ¼ 1, 2, . . ., m, and Fx ¼ g, where f 2 Rn, Ai 2 Rni n, bi 2 Rni, ci 2 Rn, di 2 R, F 2 Rp n, g 2 Rp. Here x 2 Rn is the optimization variable. Functional optimization An analysis technique that optimizes (maximizes or minimizes) complex functions of continuous variables. Constrained optimization Optimize a function if the parameters of the function satisfy a certain constraint. More generally, given a function f(x), search for x that maximizes (or minimizes) f(x) while satisfying g(x) ¼ 0 and h(x) 0. Here, the functions f, g, and h can all take parameters in vector form, and both g and h can be functions of vector values and satisfy multiple constraints. Optimizations that satisfy the equivalence constraints can be calculated with the help of the Lagrange multiplier technique. Quadratic optimization that satisfies equivalence constraints leads to a universal eigensystem. Satisfying general g and h to optimize f can be achieved by iterative methods, such as sequential quadratic programming. Lagrange multiplier technique A constrained optimization method that finds a solution to a numerical problem with a single or multiple constraints. The traditional Lagrange multiplier technique determines the parameter vector ω that can minimize or maximize the function ϕ(ω) ¼ γ(ω) + λχ(ω), where γ(ω) is the function that needs to be minimized, χ(ω) is a constraint function whose value is zero when the independent variable satisfies the constraints, and λ is a Lagrange multiplier. Particle swarm optimization An optimization method is based on simulating the behavior of animal society, that is, the dynamic change of a group of particles in search space depends on the laws of animal society. Each particle has the following pair of update functions: vi,tþ1 ¼ vi,t þ c1 randðÞ bi,t di,t þ c2 randðÞ d
j,t
di,t
di,tþ1 ¼ di,t þ vi,t where di,t is the position of particle i at time t, vi,t is its speed, bi is its best recorded position, dj is the position of the particle in the group that has the best evaluation score, and f( pi) is an evaluation function that maps particle positions to values, and () is a randomly generated floating-point number between [0 1]. The parameter pairs c1 and c2 are the weights of the local and global best values, respectively.
1656
46
Related Theories and Techniques
Levenberg-Marquardt optimization A digital multivariate optimization method that smoothly switches between gradient descent away from the local optimum and the second-order Hessian method. Newton optimization An effective optimization method. The goal is to determine the local minimum of the function f: Rn ! R from the starting position x0. Given the gradient ∇f of the function and the Hessian matrix H at xk, the Newton update is xk + 1 ¼ xk – H1∇f. If f is a quadratic form, then a single Newton step can directly give a global minimum. For general f, repeated Newton steps will converge to a local optimum. Derivative-based search A numerical optimization method used when the gradient can be estimated. A typical example is the quasi-Newton method, which attempts to generate estimates of the Hessian inverse matrix. This will determine the next iteration point. Conjugate direction An optimization framework in which a set of independent directions are determined in the search space. Consider a pair of vectors u and v. If uTAv ¼ 0 holds for a given matrix A, then u and v are called conjugate. The conjugate direction optimization method is a method in which a series of optimization directions are designed, which are conjugated to a canonical matrix, but determining them does not require a canonical matrix. Conjugate gradient A basic numerical optimization technique. The minimum value of a numerical objective function is determined by iteratively descending along the non-disturbing (conjugate) method. The conjugate gradient method does not need to calculate the second derivative and can determine the optimal value of an N-D quadratic formula in N iterations. In contrast, the Newton method requires only one iteration, while the gradient descent method requires an arbitrarily large number of iterations. Pareto optimal When solving a multi-objective (purpose) problem, a solution to a non-trivial problem that cannot optimize all objectives simultaneously with a single solution. A Pareto point is said to occur if one goal cannot be further improved, while the other goal is deteriorated at the same time. The solution to the multi-objective problem is an infinite set of Pareto points.
46.6.2 Kernels Kernel (1) A small matrix of values is used in the image convolution operator.
46.6
Various Theories and Techniques
1657
(2) Structuring elements for operations of mathematical morphology. See kernel function. Kernel trick A method of generalizing various algorithms based on the definition of the inner product on the kernel. Here, the algorithm refers to an algorithm in which the input vector appears as a scalar product. The basic idea of generalization is to replace the scalar product with other kernels, so as to obtain the corresponding kernel algorithm. Typical examples are kernel principal component analysis. If an algorithm depends on the input data vector only in the form of the inner product between vector pairs, the inner product can be replaced by a kernel function and moved to the feature space. This method can be used in kernel principal component analysis and kernel Fisher discriminant analysis. Kernel substitution Same as kernel trick. Kernel bandwidth The size of the kernels used in the method based on kernel techniques. Kernel-matching pursuit learning machine [KMPLM] A machine learning method based on kernel trick. Compared with SVM, the classification performance is equivalent, but it has a more sparse solution. Kernel function 1. A function in an integrated transformation (such as the exponential term in a Fourier transform). 2. Operation functions for pixels in an image (see convolution operator). 3. Convert the inner product operation in the high-dimensional space into a function calculated in the low-dimensional input space. In the mean shift method, the function of the kernel function is to make different contributions to the mean shift as the distance x between the feature point and the mean is different. 4. A special class of binary functions can be expressed as K(x, y) ¼ , where x and y are defined in N-D space (such as feature space); f(•) is a mapping function from N-D space to M-D space, often M > > N; represents the inner product. The kernel function in practice generally satisfies symmetry and positive definiteness and represents the distance between the two mapping results. 5. A function k(x, x0 ) with two inputs x and x0 , mapping x and x0 into R, and the resulting Gram matrix is positively definite for any set point. This function quantifies the similarity of the inputs x and x0 in a sense. The kernel function can also be regarded as a dot product in the feature space of the nonlinear mapping from x to ϕ(x). For example, it can be used for kernel Fisher discrim inant analysis and support vector machines. The radial basis function is a typical kernel function, which can be written as k(x, x0 ) ¼ exp.[||x – x0 ||/2σ2]. The kernels commonly used to define kernel functions include Fisher kernel, Gaussian kernel, homogeneous kernel, and so on.
1658
46
Related Theories and Techniques
Homogeneous kernel A type of kernel used in a kernel function. It is just a function of the magnitude of the distance (Euclidean distance is often used) between input variables, which is more general than the fixed kernel, and can be written as k(x, x0 ) ¼ k(||x – x0 ||). Radial basis kernel Same as homogeneous kernel. Stationary kernel A type of kernel used in a kernel function. It is just a function of the difference of the input variables, not affected by input space translation, which can be written as kðx, x0 Þ ¼ kðx x0 Þ。 Gaussian kernel Transform kernel with Gaussian distribution probability density. Its form is n o 2 kðx, x0 Þ ¼ exp kx x0 k =2σ 2 Because it is not considered here as a probability density, the normalization coefficient is omitted. Graph kernel A kernel function that returns a scalar describing the similarity between two graphs. Graph embedding Mapping from a graph to a vector space. The extracted features can be used to define a graph kernel. Kernel learning Method for determining parameters in a kernel function. Methods such as support vector machines and Gaussian process regression rely on a kernel function, where the kernel may have some free parameters that need to be determined. Similarly, the method of linearly combining kernels is called multi-kernel learning, and the relative weights of different kernels need to be determined. Kernel density estimation A simple method to estimate the density function given a sparse set of samples. Smooth the data, that is, convolve the data with a fixed kernel of width h, which can be written as X X kx x i k2 f ðxÞ ¼ K ð x xi Þ ¼ k h2 i i where xii is the input sample and k(r) is the kernel function.
46.6
Various Theories and Techniques
1659
Kernel density estimator A class of density models used for density estimation. Consider a small hypercube region R centered on a point x, and now estimate the probability density of the point distribution. In order to count the total number of points K in R, the following function can be defined (D is the number of data points): k ðu Þ ¼
1
j ui j 1=2 i ¼ 1, 2, , D
0
Otherwise
It represents a unit cube centered at the origin and is a typical kernel function, also called a Parzen window here. From the above formula, for a point xn, if it falls in a cube with side length h, there is k[(x – xn)/h] ¼ 1; if it falls outside a cube with side length h, then k[(x – xn)/h] ¼ 0. And the total number of points K falling in the cube is K¼
N
X x xn k h n¼1
Substituting the above formula into the density estimation formula in K-nearest neighbors, the following kernel density estimator (also called the Parzen estima tor) can be obtained: pð xÞ ¼
N
1 X 1 x xn k D N n¼1 h h
where hD ¼ V represents a hypercube with side length h in D-dimensional space. Considering the symmetry of the function k(u), the above formula can also be interpreted to give the average number of points in a hypercube with N centers xn. The previous kernel function k(u) can be in any other form, as long as it satisfies the following two formulas: Z
kðuÞ 0 k ðuÞdu ¼ 1
The first formula guarantees that the probability distribution obtained is non-negative everywhere, and the second formula guarantees the sum of the probability distribution obtained is 1. Parzen window technique Same as kernel density estimation. Parzen estimator Same as kernel density estimator.
1660
46
Related Theories and Techniques
Parzen window 1. A nonparametric method for density estimation. Given a sample {xi} derived from some underlying density, where i ¼ 1, 2, . . ., n, the Parzen window estimation can be obtained as pð xÞ ¼
1X K ðx, xi Þ i n
Among them, K is a kernel function. Gaussian distribution is a commonly used kernel. 2. A triangular weighted window is often used to limit leakage to pseudo-frequency when calculating the power spectrum of a signal. 3. The probability density function in mean shift, that is, the kernel function. See kernel density estimator. Mercer condition A given kernel can be represented as a condition to be satisfied by a combination of a nonlinear operator and an inner product.
46.6.3 Stereology Stereology The science that studies the structure of 3-D entities and the geometric relationships between their 2-D images. In typical stereology applications, the information used to study a structure is provided by a set of 2-D images obtained from the structure. These images can come from a structured reflective surface, a transmissive solid sheet, or from a projection of the solid shape. Principles of stereology Basic rules in stereology for quantitative estimation of the total number of geometric units inside the object of interest. In stereology methods, the geometric properties of the units in 3-D space can be obtained or measured with the help of probes. Different probes are needed for scenes in different 3-D spaces to generate events and get counts. Examples of different combinations of probes and scenes can be found in Table 46.1. Structure elements of stereology The type units in stereology that makes up a 3-D space scene. Including 3-D scene, 2-D surface, 1-D line segment, 0-D point. See principles of stereology. Geometric properties of structures Geometric properties of stereology structural elements. Often can be divided into two categories: metric properties and topological properties.
46.6
Various Theories and Techniques
1661
Table 46.1 Different combinations of probes and scenes Probe
Point (0-D)
Line (1-D)
Surface (2-D)
Volume (3-D)
Volume (3-D)
Surface (2-D)
Curve (1-D)
Point (0-D)
One event
One event
Two events
Three events
Scene
Countable event
Note: Terms indicated in bold in the table are included as headings in the text, and can be consulted for reference
Probe In stereology, geometric primitives used to insert into space and record the intersection with the structure of interest in order to measure or obtain the geometric properties of the spatial unit. The most commonly used stereology probes include points, lines, faces (surfaces), and volumes (also including two parallel planes that are close together). See principles of stereology. Stereoscopy A method of artificially creating a sense of space (stereoscopic phenomenon). The main method is to make two pictures, corresponding to the perspective view of the actual object seen by the left and right eyes, respectively. Stereoscopic phenomenon The sense of depth caused by the inconsistent imaging of different distant and near objects on the retinas of the two eyes. But it does not include the depth sensation obtained by the effects of monocular or binocular convergence. The stereoscopic phenomenon can only sense the difference in depth between two closer objects, but it cannot estimate its absolute distance. In addition, five conditions must be met to have the correct stereoscopic phenomenon: (1) two objects cannot be shielded from each other or be blocked by others objects obscure one; (2) the two objects cannot be too far apart in the horizontal direction; (3) the two objects cannot be too far away in the vertical direction; (4) if the two objects are connected in a line, the left and right eyes should not observe from both sides of the straight line; and (5) if the objects are in a
1662
46
Related Theories and Techniques
straight line and this line is parallel to the two eyes, there will be no stereoscopic phenomenon. Stereoscopic radius The radius of a circle with a stereoscopic phenomenon made centered on the observer. Outside this circle, no stereoscopic phenomenon is produced. Because the stereo parallax formed by the objects outside this circle for the two eyes is too small and is as small as the stereoscopic acuity. For example, with a stereoscopic acuity of 1000 and an interpupillary distance of 65 mm, the stereoscopic radius is (0.065/10) 206,000 ¼ 1339 m. Stereoscopic acuity A concept related to stereo parallax when observing with stereoscopy. Let the opening angle of the baseline (i.e., the interpupillary distance) to distant objects be the parallax angle Δ of the object. For the same baseline b, observing objects at different distances R in the distance can have different parallax angles. The difference between the two parallax angles that an observer can still distinguish is called stereoscopic acuity, δΔ ¼ b/R. Stereoscopic acuity is generally taken as 10“, and experienced observers can reach 5”. Given the stereoscopic acuity, it can be converted into the difference in distance (depth), R ¼ b/δΔ.
46.6.4 Relaxation and Expectation Maximization Relaxation A technique for assigning the values of a continuous or discrete set to the nodes of a graph (or grid) by diffusing the effects of local constraints. The grid can be an image grid, in which case the nodes correspond to pixels or features (such as edges or regions). In each iteration, each node interacts with its neighbors and changes its value based on local constraints. As the number of iterations increases, the effect of local constraints gradually spreads to more and more distant parts of the grid. When there is no more change or the change becomes less obvious, it is converged. Slack variable In the optimization problem, an inequality constraint is added to convert the variable into an equality constraint. Use slack variables to replace an inequality constraint with a combination of an equality constraint and a non-negative constraint. Equality constraint Constraints expressed as equations. For example, the constraint is f(x) ¼ 0. Inequality constraint Constraints in the form of inequality. For example, the constraint is f(x) 0.
46.6
Various Theories and Techniques
1663
Relaxation labeling A relaxation technique that assigns labels from a discrete set to a grid node or a graph node. A traditional example in artificial intelligence is the Waltz line marker. See line-drawing analysis. Relaxation matching A slack labeling technique for model matching, whose purpose is to label (match) each model primitive against the scene primitives. Starting from an initial label, the algorithm uses a coherent measure to iteratively coordinate adjacent labels. See discrete relaxation, relaxation labeling, and probabilistic relaxation labeling. Expectation Maximization [EM] An algorithm for calculating the maximum likelihood estimation of a latent variable model based on sampled data. Contains two steps, where step E is the expectation step and calculates the posterior probability of the latent variable based on the estimation of the known parameters; step M is the maximization step, which maximizes the likelihood of fully expected data in the posterior probability obtained from step E. The EM algorithm decomposes the more difficult maximum likelihood function problem into two steps, each of which is easier to implement. This method works well even when some values are missing. Expectation step [E step] Step E in expectation maximization. Maximization step [M step] Step M in expectation maximization. Expectation conditional maximization [ECM] A general EM method is an extension of the expectation maximization algorithm. It performs several constraint optimizations for each M step in the EM method. For example, the parameters can be divided into groups, and the M step is also decomposed into multiple steps. Each step optimizes a subset and keeps the other subsets fixed. Generalized expectation maximization [GEM] An extension of the expectation maximization algorithm. The algorithm has to solve the difficult problem of M step there. It tries to change the parameter to increase its value instead of maximizing L(q, θ) relative to θ. In addition, since L (q, θ) is the lower bound of the log-likelihood function, each complete EM loop in the generalized expectation maximization algorithm can be guaranteed to increase the log-likelihood value. Monte Carlo EM algorithm Expectation maximization (EM) algorithm is implemented with Monte Carlo method. Consider that the model has a latent variable Z, a visible (observed) variable X, and a parameter θ. At M step, the log-likelihood of the complete data is expected relative to θ the function to be optimized, which can be written as
1664
46
Q θ, θold ¼
Z
Related Theories and Techniques
p ZjX, θold ln pðZjX, θÞdZ
The sample set {Z(l )} is obtained by sampling the estimation of the current posterior distribution p(Z|X, θold). The finite sum of the elements of this sample set can be used to approximate the above formula L
1X Q θ, θold ln p ZðlÞ , Xjθ L l¼1
Next, the Q function can be optimized using the methods commonly used in the M step. This process is called Monte Carlo EM (expectation maximization) algorithm. Stochastic EM A special case of Monte Carlo EM algorithm. At this point, consider a finite mixed model and take only one sample at a time in E step. Data augmentation 1. The process or technique of introducing unobserved data (implicit or auxiliary variables) into the original data. This strategy is used in the expectation maximi zation (EM) algorithm to maximize the likelihood function or in the Markov chain Monte Carlo method. 2. A Bayesian algorithm for calculating posterior distribution using parameter vectors. Similar to the expectation maximization (EM) algorithm, it also includes two steps: an estimation step (I step, corresponding to E step) and a posteriori step (P step, corresponding to M step).
46.6.5 Context and RANSAC Context Information, knowledge, etc. related to or accompanying certain data and contributing to the full meaning of the data. For example, the spatial context of a pixel in a video sequence can represent the intensity of its surrounding pixels in the same frame of image, and the temporal context of that pixel can represent its intensity at the same position (same image coordinates) in the previous and subsequent frame images. Some information is ambiguous in the absence of context. For example, according to the differential optical flow, only the normal flow can be estimated, and the complete flow needs to be estimated in consideration of the spatial context of each pixel. At the scene understanding level, if it is known that the image data comes from contextual information such as a theater performance, it is easy to distinguish between an actual battle and a stage behavior.
46.6
Various Theories and Techniques
1665
Contextual method A method to search for new features by considering the spatial arrangement or distribution of detected features in the image. Contextual information Additional information about the image, video, or scene to analyze can help evaluate the output. It can be obtained by assisted perception, a priori labeling of content, or removing the primary object of interest from the scene background. See contextaware algorithm. Contextual knowledge A priori contextual information about a given image or video. For example, the camera viewing angle, the distance from the object of interest. See contextual information. Context-sensitive The operation of image processing is adjusted according to the image content, which is also called content-related. Context-aware algorithm Technologies that span a range of domains such as visual saliency, object detection, object tracking, and object recognition. Its common feature is that in addition to considering the specific subjects of the task, it also uses contextual information to maximize efficiency. For example, the consciousness of crossing the scene region is used in pedestrian detection. More generally, additional sensor information is used to provide context (e.g., Global Positioning System provides location clues to mobile devices). Context dependent The processing result or output obtained using a context-aware algorithm. Canonical direction The simplest and most obvious criterion used in context (given a priori environment or domain knowledge) or the smallest amplitude direction. For example, the canonical motion direction implied in a sports field is forward. Random sample consensus [RANSAC] A technique, strategy, or process that selects a small set of deterministic samples and then uses this to validate a large number of samples. In the case of many outliers, first find a set of interior points and proceed from there. This can overcome the problem that iteratively reweighted least squares (method) does not easily converge to the global optimum when there are many outliers. A general method for fitting models from data with outliers. Its goal is to determine which points in the data are outliers and remove them from the final model fit. Random sample consensus attempts to repeatedly fit the model by using a random subset of the data. It is expected that sooner or later, a subset without outlier points will be found to fit a good model. To enhance the probability of this situation, it chooses the smallest size subset that can uniquely fit the model. After selecting the
1666
46
Related Theories and Techniques
smallest size data subset and fitting, the random sample consensus should judge the quality of the fit. Specifically, it is determined whether the data point is an interior point or an outlier point according to the fitting result. This requires some knowledge of the expected change in the real model. If a data point with its distance from the model exceeds the expected variance (such as two or three times), then it is classified as an outlier point. Otherwise, the data points are classified as interior points. For each subset, the internal points are counted. The above process is repeated many times: in each iteration, a minimum subset is randomly selected to fit the model, and the number of interior points satisfying the model is counted. After a predetermined number of iterations, a subset with the most interior points are selected to calculate the required model. The complete random sample consensus algorithm includes the following steps: 1. 2. 3. 4. 5.
Randomly select a smallest data subset. Use this minimum subset to estimate model parameters. Calculate the number of interior points after the model is fitted. Repeat the first three steps according to a predetermined number of times. Re-estimate the parameters based on all the interior points in the best fit result.
Think of it as a robust estimator that takes into account the effects of outliers in the data used. Commonly used in least squares estimation problems (such as parametric surface fitting problems). First consider the minimum-sized subset of data needed to solve the problem, and then look for statistical consistency in the results. See least median of squares estimation, M-estimation, and outlier rejection. Inlier A statistical point whose distance from the expected value is less than the expected distance in the statistics. For example, in RANSAC technology, the point where the difference from the predicted value is less than a given threshold. Samples that fall within a hypothetical probability limit (such as 95%). Compare outlier. Outlier Special data that is statistically significantly different from other data. In the point set distribution, points that are far away from the cluster of points. If a data set meets certain rules except for individual points or can be well expressed by a model, then these individual points are outliers. Classification of points into outliers requires consideration of both the model used and the statistical characteristics of the data. Progressive sample consensus [PROSAC] An improved random sample consensus. Add the most confident match to the random sample first, so that the process of finding a good set of interior points can be accelerated statistically. Preemptive RANSAC An improved random sample consensus. By selecting only a few samples in the initial iteration and obtaining satisfactory hypothetical results from them for subsequent scoring and selection, the efficiency is improved.
46.6
Various Theories and Techniques
1667
Least median of squares [LMS] For a set/series of data, the median after squaring. For example, when using the random sample consensus algorithm, the interior points must first be calculated, that is, the points within the circle with e as the center around the predicted position, that is, the residual ||ri|| e. At this time, the median value of ||ri||2 can be calculated, the random selection process is repeated N times, and the sample set with the largest number of interior points (or the smallest median residual) is the final result. See Theil-Sen estimator. Theil-Sen estimator A robust estimator for curve fitting. Consider a family of curves with parameters a1. . .p to parameterize and fit the data x1. . .n. If q is the smallest number of points that can uniquely define a1. . .p, then the Theil-Sen estimation of the optimal parameters b1. . .p is a parameter that measures the median error of all q points. For line fitting, for example, the number of parameters (slope and intercept) is p ¼ 2. The number of points given a fit is also q ¼ 2. In this way, the Theil-Sen estimation of the slope gives the median error of the (qn) two-point slope estimation. Thiel-Sen estimators are not statistically efficient and do not have particularly high decomposition points, as opposed to estimators such as RANSAC and least median of squares. Chirality The property that an object does not coincide with its mirror image. In stereo vision, it means that the reconstructed point should be in front of all cameras, that is, a positive distance from the line of sight from the camera. Chirality can be used to help determine the direction (sign) of rotation and translation. It can also be used in RANSAC to distinguish between possible and impossible configurations. In addition, the chirality can also be used to transform the projection reconstruction into a quasi-affine reconstruction.
46.6.6 Miscellaneous Adaptive The characteristics that the algorithms adjust its parameters based on the data at hand to optimize performance. In image engineering, the characteristics of processing technology or algorithm parameters can be automatically selected according to the characteristics and changes of the image itself. Many technologies can use this feature to improve the range and performance. Typical examples include adaptive contrast enhancement, adaptive filtering, adaptive smoothing, and so on. Adaptivity The device or algorithm in image engineering can automatically determine the nature of the next operation process and method based on the characteristics of the input image and previous operation steps and results.
1668
46
Related Theories and Techniques
Adaptive reconstruction A data-driven method of generating statistically significant data in a 3-D point cloud region where data may be lost due to sampling issues. Self-organizing map [SOM] A model that represents a 2-D nonlinear manifold as a regular array of discrete points. It is somewhat similar to the K-means method. At the beginning, the prototypes were randomly distributed. During training, they self-organize to approximate a smooth manifold. But unlike the K-means method, the self-organizing map model does not optimize the cost function, so it is difficult to set model parameters and convergence. Same as Kohonen network. Kohonen network A method for clustering, analyzing, and visualizing high-dimensional multivariate data, which can generate a topological organization of the input data. The response of the entire network to a given data vector can be used for the low-dimensional signature of that data vector. Also called self-organizing map. Generative topographic mapping [GTM] A nonlinear and training-effective hidden variable model (also referred to as the method of constructing the model). The generated topographic mapping method uses a hidden distribution defined by a finite grid of delta functions on the hidden space. Marginalization on hidden space need only sum the contributions of various grid positions. The model thus obtained fits the data set with a 2-D nonlinear manifold and projects the data points back to the hidden space for visualization by evaluating the posterior distribution on the hidden space. GTM can be thought of as a probabilistic version of an ad hoc graph model. However, GTM optimizes the log-likelihood function, and the resulting model defines a probability density in the data space. It corresponds to a constrained Gaussian mixture model in which the components have the same variance and all mean values are on a smooth 2-D manifold. Manifolds in GTM are defined on continuous surfaces (rather than on prototype vectors like self-organizing manifolds), so it is possible to calculate the magnifica tion factor of a locally scaled manifold corresponding to the fitted data set and the directional curvatures of the manifold. They can be visualized with the projection data to provide further information about the model. Directional curvatures See generative topographic mapping. Distortion measure [DM] An objective function used in clustering. Considering the data set {xi}, there is a corresponding set of binary indicator variables {rik}, where k ¼ 1, 2, . . ., K indicates one of the K categories assigned to the data point xi. In other words, if the data point xi is assigned to the class k, then rij ¼ 1; while for j 6¼ k, there is rij ¼ 0. This is the 1/K coding scheme. The distortion is now
46.6
Various Theories and Techniques
DM ¼
1669 I X J X i¼1
r ik kxi μk k2
j¼1
It represents the sum of the squared distances of each data point to the mean vector of the class to which it is assigned. To minimize DM, {rik} and {μk} must be determined. This can be done using an iterative technique consisting of two successive steps (one update rik and another update μk). First select some initial values for μk, then fix μk in step 1 to optimize DM relative to rik, and fix rik in step 2 to optimize DM relative to μk. This two-step optimization process is repeated until convergence. Compared with the expectation maximization (EM) algorithm, the two steps here correspond to the expectation step and the maximization step, respectively. Nondeterministic polynomial [NP] problem So far, no solution to the polynomial time (or complexity) algorithm has been found. Also called nondeterministic problem of polynomial complexity. Those problems that can be solved by the computer (can only be counted and cannot be guessed) in polynomial time can be called polynomial problems (P problem). The P problem should be a subset of the NP problem. The NP problem can only be solved by indirect guessing, so it is a nondeterministic problem. There are many problems where it is difficult to find a polynomial-time solution algorithm (perhaps it does not exist), but if you give it a solution, you can judge the correctness of the solution in polynomial time. This also leads to the second definition of the NP problem, that is, the NP problem is a problem that can verify the correctness of a solution in polynomial time, that is, the NP complete problem. NP-complete A special set of problems in complexity calculations is currently solvable. In the worst case, if the number of input data is N, the calculation time is its exponential order of magnitude O(eN). For a subset of exponential problems that can be called NP-complete, if an algorithm that executes within polynomial time O(NP) (for some p) can be found, then a related algorithm for any other NP-complete algorithm can also be found. Nondeterministic polynomial complete [NPC] problem A special NP problem. All problems belonging to a class of NP complete problems can be transformed into any of them using polynomial time. In this way, as long as a solution to the NPC problem is found, all other problems can be solved in polynomial time. Considering the second definition of NP problem, if all possible solutions of an NP problem can be judged correct in polynomial time, then it can be called a nondeterministic polynomial complete problem (NPC problem). Ambiguity problem A problem having more than one explanation or understanding. For example, the problem of uncertainty caused by periodic repeating features in stereo matching, the aperture problem in imaging.
Chapter 47
Optics
Optics is a physical discipline that studies the behavior and properties of light and is also a discipline related to optical engineering technology (Peatross and Ware 2015).
47.1
Optics and Instruments
Both optics and optical instruments are closely related to imaging technology. For example, image acquisition is an optical process first, and the commonly used camera is a typical optical instrument (Forsyth and Ponce 2012; Jähne 2004; Tsai 1987).
47.1.1 Classifications Classical optics A collective term for geometric optics and wave optics. Geometric optics Same as geometrical optics. Geometrical optics A branch of optics. Think of electromagnetic radiation as a pencil of lines that diverges in all directions from the source to describe its propagation approximately. It is an approximation of the wave theory of electromagnetic theory when the wavelength approaches zero. Also called ray optics. Applying the concept of light to study the reflection and refraction of light is the main basis for lens design. Among them, light propagation follows the laws of straight propagation, independent propagation, and reflection and refraction. The basis of geometric optics is that light © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_47
1671
1672
47
Optics
travels in a straight line, which can only be suddenly bent or refracted by refraction, and ignores diffraction. Geometry optics Same as geometrical optics. Wave optics Starting from the fact that light is a transverse wave, the branch of physics that studies the phenomenon and principle of light during its propagation. Also called physical optics. Physical optics Same as wave optics. Engineering optics A collective term for geometric optics and physical optics in engineering applications. Among them, the application of geometric optics involves the measurement of length and angle, inspection of mechanical parts, parallelism and smoothness inspection, accurate measurement of distant objects, alignment of optical axes, and so on. Telecentric optics A lens system designed to move an image plane along an optical axis without changing the magnification or image position of an imaging world point. One implementation is to place the aperture on the front focal plane of the lens (see Fig. 4.7 of the telecentric imaging), which can ensure that these rays are parallel behind the lens. In this way, when the imaging plane perpendicular to the optical axis moves back and forth along the optical axis, the position of the imaged point on the image plane does not change. Quantum optics The study of light and information about the interaction of light and matter at the micro level. Information optics The optical branch formed by introducing the linear system theory and Fourier analysis method commonly used in communication theory into optics, that is, Fourier optics. Information theory is concerned with the transmission of information in the system, and the optical system can be regarded as a communication channel. Information optics discusses the problem of propagation in the light field. Photonics Technologies and disciplines that generate and detect light and other radiation. Including radiation emission, transmission, deflection, amplification, and detection. Involves lasers and other light sources, fiber optics, and electro-optical instruments. Infophotonics Interdisciplinary study of photonics and information science and technology. Studies using photons as carriers for propagation (including modulation, transmission),
47.1
Optics and Instruments
1673 EXP
h Q'
P' P
ENP
θ
F
F'
Q
θ" θ'
s
s'
h'
c
Fig. 47.1 Relationship between Gaussian optics and pinhole camera model
reflection, refraction, absorption (including flooding), dispersion, polarization, interference, diffraction, scattering, loss, degeneration, trapping, bunching, transition, compression, structure, etc. Techniques and methods for processing (controlling, transforming), amplifying, coupling, storing, detecting, recording, displaying, alarming and other information, as well as vision and various application problems, and exploring the quality of photon in time, space, and frequency domains. Gaussian optics An idealized optical system. It is a near-beam axis approximation obtained by replacing the sine of the angle of incidence with the angle of incidence in the law of refraction when the angle of incidence is small (can be expressed as a linear law of refraction). Also refers to an ideal optical system (when using thick lenses). In Gaussian optics, concentric beams of light converge to a point after passing through a lens composed of a spherical lens. In practice, phenomena that are inconsistent with Gaussian optics in any optical system can be regarded as aberrations. The goal of optical system design is to maximize its incident angle on the basis of satisfying Gaussian optics. Compare pinhole camera model. In the Gaussian optical system, there are two projection centers, one in the entrance pupil for the object-side beam and one in the exit pupil for the image-side beam. Since the pinhole camera model has only one projection center, to establish a projection center in the Gaussian optical system, it is necessary to move the projection center in the exit pupil to the projection center of the entrance pupil (keep the object-side field of view angle θ unchanged), as shown in Fig. 47.1. At the same time, it is necessary to virtually move the image plane to a point at a distance c (principal distance) from the projection center Q of the entrance pupil (keep the size of the image unchanged). The new image-side field of view angle θ" is equal to the object-side field of view angle θ, thus obtaining c¼
h0 β s ¼ βs ¼ f 0 1 βP h
where β is the lens magnification factor, f ' is the lens focal length, and βP is the lens magnification.
1674
47
Optics
Paraxial optics An optical branch that studies the geometric imaging law of the paraxial region of the optical system. Its imaging formula is l0 1 – l1 ¼ f 0 1, which is generally called a Gaussian formula; the other formula is x0 x ¼ f 0 2, which is generally called a Newton formula. In both formulas, l0 is the image distance, l is the object distance, f 0 is the image-side focal length, f is the object-side focal length, x0 is the focal image distance, x is the focal object distance, y0 is the image height, and y is the object height. It can be seen that the vertical magnification β ¼ l0 /l ¼ f 0 /x ¼ x0 /f ¼ y0 /y, and the axial magnification α ¼ x0 /x. The sign of each quantity is determined according to the optical symbol. The paraxial imaging formula of geometrical optics is the imaging formula of an ideal optical system, and the aberrations in optical design are the differences obtained by comparison with them. Paraxial optics is also called first-order optics or Gaussian optics. Biological optics A subject that studies the biological body’s own luminescence and the effects of low radiation, such as visible light, ultraviolet light, and infrared light, on the biological body. Also called photo-biophysics. Biomedophotonics A discipline that detects the effects of biological systems on releasing energy in the form of photons. With this basis, the structure, function, and performance information of biological systems reflected by photons can be understood. It also tries to use photons to act on biological systems to make their characteristics, structures, and extracts develop in a direction that meets some specific human needs.
47.1.2 Instruments Beam splitter An optical system that breaks incident unpolarized light into two orthogonal (90 intersection) polarized beams. As shown in Fig. 47.2. Spectrophotometer A device or instrument that measures incident ray radiation in a narrow wavelength band and breaks it down into spectral colors and intensity along the spectrum. Its effect is similar to Newton’s prism, but a diffraction grating is used in practice. Fig. 47.2 Beam splitter
Polarized light Beam splitter Incident light Polarized light
47.1
Optics and Instruments
1675
Absorption spectrophotometer Apparatus and equipment that can give out the spectral intensity of light with different wavelengths after being absorbed by different substances. Imaging spectrometer A device that generates spectrum by passing incident light through a dispersive optical unit (prism or grid). It is a non-scanning spectrometer that is imaged directly onto a linear sensor array. Gonioreflectometer An arrangement of light sources and sensors used to collect intensity data from many angles to model a bidirectional reflectance distribution function for a given surface or target. Fiberscope A flexible fiber optic device that allows observation of the normally invisible part of a target. Mainly used for medical testing. Spatial light modulator A programmable optical filter (often based on liquid crystal technology) that can be programmed to control the amount of light passing through the filter. Can be used for data projection devices. Another application is optical image processing, where filters are placed on the Fourier transform plane to allow frequency domain filtering. Densitometer A hardware device that measures the density (tone quality or lightness) of computer monitors, photographic negatives, transparencies, images, etc. It is an important tool for accurate color management. Coherent fiber optics Bundle many fiber optic elements into a single cable assembly and align the spatial locations of the individual fibers for image transmission. Apparent magnification That is subjective magnification. When using a telescope, microscope, magnifying glass, etc. to observe the scene, both the eye function and the instrument characteristics need to take into account the magnification. This magnification is different from the horizontal, vertical, and angular magnifications defined purely in the lens group and is the ratio of the image size on the retina after using the instrument to the image size directly observed by the eye only. For a telescope, the subjective magnification is equal to the ratio of the objective focal length to the eyepiece focal length. For microscopes and magnifying glasses, the subjective magnification is equal to the ratio of the apparent distance to the focal length of the instrument image, that is, the ratio of 250 mm to the millimeters of the focal distance of the instrument image. Subjective magnification Same as apparent magnification.
1676
47
Optics
Mirror For a surface with specular reflection, the incident light will only be reflected at the same angle and in the same plane as the surface normal (i.e., the angle of incidence is equal to the angle of reflection). Mirror image It is the same size and symmetrical as the original, but how to rotate or move it cannot make the image coincide. If the original is a right-handed rectangular coordinate system, the mirror image is a left-handed rectangular coordinate system. Catadioptric optics A universal method that uses a mirror in combination with a traditional imaging system to obtain a wide viewing angle, such as 180 degrees. It is often desirable for a refracting system to have a single viewpoint so that a geometrically correct perspective/projection image can be generated from the acquired image. An optical system that combines refraction and reflection. Normal optics is refracted by a transmission lens and then focused by a curved mirror. In such a system, there may not be a single point through which all light passes (some systems using a fisheye lens have a similar situation). Omnidirectional sensing Literally, it means sensing in all directions simultaneously. In practice, it refers to the use of mirrors and lenses to project most of the line of sight at a point onto the image of a single camera. It generally does not see the space behind the mirror and lens. See catadioptric optics. Figure 47.3 shows a camera using a spherical mirror to obtain a very wide field of view. Spherical mirror The shape is a mirror that is part of a sphere. Can be used in reflective refraction cameras. Fig. 47.3 Omnidirectional sensing schematic
Spherical mirror
Camera
47.2
Photometry
Fig. 47.4 Concave mirror schematic
1677 Concave mirror Imaged object F
C principal axis
Image
Concave mirror A mirror used for imaging. The mirror has a concave surface that can be used to reflect light to a focal point. The reflecting surface is rotationally symmetric about the optical axis or the principal axis, and the mirror surface may be a spherical surface, an ellipsoidal surface, a paraboloid, a hyperboloid, or a part of another surface. Also called a converging mirror because it condenses light to a focal point. In the case of a spherical mirror, the intermediate point from the center of the ball C to the vertex is the mirror focus point F, as shown in Fig. 47.4. Note that in Fig. 47.4, the image size obtained from the imaged object can be determined by three different methods, as shown by the three rays of light emerging from the top of the imaged object. Conical mirror A mirror with a tapered (or partially tapered) shape. It is especially useful in robotic walks, because placing a camera facing the apex of a cone and aligning the axis of the cone with the optical axis gives a 360 view. Cone mirrors have previously been used to produce encrypted images called “anamorphoses.”
47.2
Photometry
Photometry mainly uses various principles and theories of light to help measuring the objective world. It often needs to consider the response of the human eye, including physiological factors (Forsyth and Ponce 2012; Jähne 2004; Sonka et al. 2014).
47.2.1 Intensity Photometry The field of optics related to light intensity measurement, that is, the part of optics that measures the quantity or spectrum of light. A branch of radiometry
1678
47
Optics
(considering connections to the human eye’s response). A discipline that studies the measurement of light in the process of emission, propagation, absorption, and scattering. It is a corresponding measurement discipline in the visible light band, taking into account the subjective factors of the human eye. In image engineering, visible light is the most common electromagnetic radiation. Collecting visible light images from scenes requires knowledge related to photometry. The following physical quantities are commonly used in photometry to describe the light energy emitted, transferred, or received: (1) luminous flux, (2) luminous intensity, (3) brightness, and (4) illumination. Photometric measurement studies the intensity of light and its measurement. It also evaluates the visual effects of radiation based on the physiological characteristics of the organs of human vision and certain agreed specifications. The measurement methods are divided into visual measurement (subjective photometry) and instrument and physical measurement (objective photometry). The subjective photometry directly compares the brightness of the two halves of the field of view and then converts it into object detection quantities, such as luminous intensity and lumi nous flux. Objective photometry uses physical devices instead of human eyes for photometric comparison. In computer vision, commonly used photometric models describe the luminous flux radiated from a surface (which can be physically on a radiation source or virtual) or the surface of an illuminated object. A widely used photometric model can be found in Lambert’s law. Intensity [I] 1. The brightness of a light source. A component of the HSI model. It is proportional to the reflectivity of the surface of the object. If the surface of the object is not colored, there is only a one-dimensional change in brightness. For colored object surfaces, the more white light is added to the color, the brighter the object surface is; the more black is added, the smaller the brightness. 2. Intensity: The total energy radiated from the light source. Measured in watts (W). Image data of the brightness received from the scene is recorded. Light intensity Same as luminescence intensity. Luminescence intensity A physical quantity that indicates the intensity of visible light radiance emitted by a light source in a certain range of directions. The luminous flux emitted by the light source is divided by the total solid angle of the space 4π to obtain the average luminous intensity of the light source. The luminous flux emitted by the light source or a part of the light-emitting surface in a certain direction cube corner element dω is divided by the solid angle element, and the light intensity in this direction is obtained. The unit is candelas (cd), or lm/sr; the symbol Iv. Luminous intensity Same as luminescence intensity.
47.2
Photometry
1679
Radiance 1. For a point light source, the luminous flux transmitted in a unit solid angle in a certain direction, the unit is candelas (cd), 1 cd ¼ 1 lm/sr. 2. For point-to-point radiation source, the radiant flux transmitted in a unit solid angle in a certain direction, unit: watt (W) per steradian (W/sr). 3. The amount of light (emission energy) leaving a surface. Light can be generated by the surface itself, such as the surface of a light source; it can also be generated by the surface reflecting other incident light, such as a non-blackbody surface. The surface can be real, like a wall, or it can be virtual, like an infinite plane. See irradiance and radiometry. 4. Also called emissivity. Unit, W/(srm2); symbol, L. The radiation intensity of the small surface element dA at a point on the surface of the object in a given direction is dΦ. This direction is represented by a small cone element, and its solid angle is dΩ. Then the radiance transmitted in L ¼ (dΩdAcosθ)1. Here dAcosθ represents the projection of small facet dA in the direction perpendicular to the considered direction, and θ is the angle between the small facet normal and the projection direction. 5. Same as radiation intensity and radiation luminance. Radiation luminance The radiation intensity of a point on a surface of radiation source in a certain direction. Unit: Watt per sphere square meter (W/(srm2)). Differential threshold When calculating the luminous intensity, the smallest difference between the intensity of the two stimuli can be perceived (distinguished). Photometric brightness The old name of brightness is no longer used. Filter method 1. A method of using filters to achieve image processing. For example, use a highpass filter to sharpen the image and a low-pass filter to eliminate noise. 2. A simple method for feature selection. The selection of features is taken as a preprocessing step of a learning method for calculating univariate statistics on a feature-by-feature basis. This type of method is simple to calculate, but because it does not consider the relationship between variables or the nature of the learning method based on feature training, it is possible to obtain only inaccurate sub-optimal results. 3. A measurement method in photometry and colorimetry. Applying a filter makes two comparable light sources the same in terms of luminosity or chromaticity. The filter performance at this time should be accurately expressed so that the test data is accurate. This filter is called a compensation filter. Photon An electromagnetic radiation energy packet. Einstein proposed it in 1917 to explain the photoelectric effect.
1680
47
Optics
Light quantum The energy of photons in electromagnetic radiation (composed of positive and negative particles of the same size and moving at the speed of light).
47.2.2 Emission and Transmission Luminous emittance Luminous flux per unit area of the light source. Unit, lm/m2; symbol, Mv. Luminous exitance Same as luminous emittance. Emissivity A metric of a surface that is suitable for measurements that emit or radiate electromagnetic waves. Expressed as the ratio of surface thermal excitation to blackbody thermal excitation at the same temperature. The emissivity of a black body is 1, and the emissivity of natural materials is between 0 and 1. Luminous efficacy 1. Radiated light efficiency: The quotient of the total luminous flux divided by the total radiant flux. It is a measure of the effectiveness of radiation-induced perception of the human eye. 2. Lighting system light efficiency: The quotient of the total luminous flux divided by the input power of the bulb. The unit is lm/W. Luminous flux The radiant flux (visible radiant energy corrected by the eye) of the human eye’s visual characteristics as specified by the International Committee on Illumination, or the radiant flux received by a physical receiver with a spectral sensitivity of V(λ), is the amount of light of all wavelengths that pass through a certain spatial cross section per unit time. It is proportional to the perceived brightness. Also called optical radiation power and optical radiation quantity. It is equal to the radiant flux radiated by a candlelight source into a solid angle of one steradian. Unit, lm; symbol, Φv. Luminous transmittance The ratio of the amount of luminous flux transmitted from an object to the amount of luminous flux incident on this object. Transmittance The ratio of radiant luminous flux to incident luminous flux under specific conditions. The ratio of the power (output) through the transparent object to the incident power (input). It is not the (intrinsic) nature of the material, but it also depends on the measurement setup for light transmission. For example, for the transmittance of a thin plate, it is necessary to consider the emission of radiant flux on the front and back. See reflectivity and transmissivity.
47.2
Photometry
1681
Transmissivity 1. A descriptor based on region density feature. The ratio of the luminous fluxes that penetrates the object to the incident to the object is given. 2. The ratio of the radiant flux passing through the surface of an optical medium to the radiant flux incident on the surface. It is a property of the material that is considered only for a single surface. See reflectivity and transmittance. Transmission The phenomenon of light passing through an object after it is irradiated onto the object, that is, the situation where the optical radiation is received through the object. Transmittance space A color space with a lower correlation between the three channels than the RGB color space. Also called T space. It can be converted from RGB color space. Suppose T space is represented by T ¼ [T1 T2 T3]T, and RGB color space is represented by C ¼ [R G B]T, then T ¼ MC where 2
0:0255 0:1275
6 M ¼ 4 0:3315 0:5610
1:5045 0:3825
1:0965
3
7 0:1785 5 0:0510
Here, to convert [255; 255; 255] in RGB space to T space and still [255; 255; 255], multiply each term by 255. Transparency A property of the surface of a scene that radiation (such as visible light) can penetrate, with the result that objects on the other side of the surface can be seen. Nontransparent surfaces are opaque. A property of a substance that reflects its ability to allow light (radiation) to penetrate. A highly transparent substance has a weak ability to absorb light, so most light is relatively easy to penetrate. Compare opacity. Opaque When light cannot pass through a substance or structure. This causes shadows and occlusions. Opacity A property of the substance itself that reflects its ability to absorb light (radiation). Materials with high opacity have a strong ability to absorb light, so light is more difficult to transmit. Compare transparency.
1682
47
Optics
Translucency Transmission of light through a diffusive interface such as frosted glass. There are multiple possible exit directions for the incident light entering the translucent material.
47.2.3 Optical Properties of the Surface Lambert’s law The lightness and darkness observed on the ideal diffuse reflection surface is independent of the position of the observer, but changes with the angle θ between the surface normal and the direction of the light source, as shown in Fig. 47.5. Lambert’s cosine law Cosine law of luminous intensity. If there is a completely diffuse surface dS, the luminous intensity Ia in any direction is proportional to the cosine of the angle between the direction and the surface normal, that is, Ia ¼ I0cosθ. I0 is the normal luminous intensity. By definition, I0 ¼ LdS, and L is the brightness, so Iθ ¼ LdS cosθ, where dScosθ is the effective area of the surface in the θ direction. Lambert’s law A law that reflects the absorption of ray radiation through a medium of a certain distance. If the intensity of the ray radiation with wavelength λ at a certain point in the medium is I0, after passing through the distance x, the intensity becomes Ix due to absorption (the intensity reduction dI is sometimes called the absorption intensity). Absorption constant (linear absorption coefficient) k ¼ (lnI0 – lnIx)/x. In practical applications, it is more convenient to take the base 10 logarithm, and then k ¼ (lgI0 – lgIx)/x, and m ¼ klge ¼ 0.4343 k. Lambertian surface A typical ideal scattering surface. Also called a diffuse reflection surface, is a surface whose reflection follows Lambert’s law, and is more commonly called a frosted surface or a matte surface. A surface that emits light uniformly in all directions is a completely diffuse surface. For materials with a Lambertian surface, the light reflected from the surface has the same brightness in all directions. Such surfaces have the appearance of equal brightness when viewed from various viewpoints. The lightness (shade) of a surface depends only on the relative direction of the incident light. The orientation of such a surface relative to the camera is Fig. 47.5 Lambert’s law schematic
Light source
Surface normal
θ
Camera
47.2
Photometry
1683
irrelevant to the imaging process: it receives the same light from the surface regardless of which direction the camera is in. However, the orientation of such a surface relative to the illumination source is still very important: if the surface deviates from the direction of the light source, the surface will not receive light and appear black. Diffuse reflection surface Same as Lambertian surface. Lambert surface Same as Lambertian surface. All over reflecting surface Same as Lambertian surface. Lambertian reflection A reflection on a Lambertian surface that scatters light uniformly in all directions. This phenomenon is often associated with shading. Also called diffuse reflection. Apparent surface The result of projecting a surface perpendicular to the visual axis when a viewer does not look at it from the front (the visual axis is the same as the surface’s normal direction). If the angle between the surface normal and the visual axis is ϕ and the area of the surface is S, then the apparent area Sϕ ¼ Scosϕ. If the luminous intensity measured in this direction is I ϕ, then the luminance in this direction is Lϕ ¼ Iϕ(Scosϕ)1. When this light-emitting face has the same brightness in all directions, then LϕS ¼ I0 ¼ constant, so Iϕ ¼ I0cosϕ. Where I0 is the luminous intensity in the direction perpendicular to the surface, which is Lambert’s law. Matte surface The surface that its reflection characteristics follow the surface of the Lambertian surface model. Matte reflection Same as diffuse reflection. See Lambertian reflection. Non-Lambertian surface A surface where light cannot be reflected uniformly in all directions. There are many objects with non-Lambertian surfaces (such as those with highlight reflections). The outer surface of the aircraft in Fig. 47.6 is the non-Lambertian surface. Non-Lambertian reflection See non-Lambertian surface. Isotropic reflection surface An ideal surface (physically unachievable) that reflects the incoming radiation evenly in all directions after receiving the incident radiation. When a surface is considered to be radiating outward, it is also called a homogeneous emission surface. Observing this type of surface at an (oblique) angle will make it brighter,
1684
47
Optics
Fig. 47.6 Examples of objects with non-Lambertian surfaces
and its brightness depends on the inverse of the cosine of the radiation angle. The contours in the reflection map of this surface are parallel straight lines. Homogeneous emission surface See isotropic reflection surface. Isotropy radiation surface Ideal object surface that can radiate uniformly in all directions. As opposed to Lambertian surface properties. Physically unachievable, so it is a hypothetical, ideal, or extreme situation. Such a surface will look brighter when viewed obliquely. This is because tilt reduces the visible surface area, and from the assumption that the radiation itself has not changed, the amount of radiation per unit area will increase. Compare isotropic reflection surface.
47.3
Ray Radiation
Ray radiation involves the emission, propagation, measurement, etc. of various electromagnetic radiations (Peatross and Ware 2015).
47.3.1 Radiation Radiation 1. Any kind of electromagnetic spectrum emission. 2. Radiation produced by radioactive materials, including electromagnetic radiation (such as gamma-ray photons).
47.3
Ray Radiation
1685
Radiometric image formation Photometric image formation in general, because light (visible light) is a special type of electromagnetic radiation. Radiometric calibration When there is a nonlinear relationship between the radiant energy and the gray value of the image, the process of determining the nonlinear response and finding its inverse response function. If you use an inverse response function for a nonlinear response image, you get a linear response image. More generally, the process of trying to estimate radiation from pixel values. The idea of radiation correction is that the light (radiation intensity) entering the real camera generally changes with the camera itself. A simple correction model is E (i, j) ¼ g(i, j)I + o(i, j), where for each pixel (i, j), E is the radiation intensity to be estimated and I is the measurement intensity, g and o are the pixel-dependent gain and offset that need to be corrected, respectively. The true value of E can be measured with a photometer (see photometry). Calibration of radiation under laboratory conditions often uses a calibrated grayscale card, by measuring the gray values of different gradient bars and comparing these gray values with the known reflection coefficients of each gradient bar to obtain a series of independent measuring. Radiometric response function The function of the relationship between the photon light intensity reaching the camera lens in scene radiation and the value in the image file collected into the computer is shown in Fig. 47.7. It can be seen as a function that defines how the recorded illuminance value is converted into a brightness value in the imaging device. It is affected by or is a function of multiple factors, such as lens aperture and shutter speed, sensor gain, and analog-to-digital converter. α-radiation A type of radiation produced by many radioactive elements, including a helium nucleus (with two neutrons and two protons). Also written as alpha radiation. β-radiation A special type of radiation produced by many radioactive elements, including electrons. Also written as beta radiation. Radiosity The total radiation on the surface of a scene. That is, the total radiation (including emitted and reflected radiation) leaving the surface. Describes a method in which light from a light source bounces multiple times in a scene before reaching the Sensor core
Camera
Scene radiation Lens
Apertur
Shutter
Fig. 47.7 From scene radiation to image file
Sensor
Gain
Image file A/D
1686
47
Optics
camera. Associate the brightness value with a rectangular surface region in the scene. Suitable for situations where the scene is closer to being composed of a uniform bidirectional reflection function and a simple geometric illuminant and surface. Radiant exitance Radiation flux per unit area emitted by a surface. That is, the quotient of the radiant energy flux (i.e., radiated power) dΦ of the panel element dA at a point away from the surface divided by the area of the panel element. In fact, it is the total radiated power emitted into a half space per unit area. Also called radiance. The unit is watt per square meter (W/m2), symbol M. Radiation flux The radiant energy passing through a certain cross section of space per unit time. Also known as radiant power and radiation amount. Unit: Watt (W). Radiance factor When the illuminance of incident light on a small facet is E, the ratio of the radiant brightness L of the small facet to the illuminance. That is, the radiance factor μ ¼ L/ E. It reflects the radiant power of small facets when illuminated by light. Radiant intensity 1. Radiation brightness: same as radiance. 2. Radiation density: Same as radiant flux. Radiation intensity Same as radiance.
47.3.2 Radiometry Radiometry 1. Radiometric measurement: A technique for measuring the radiant flux of elec tromagnetic radiation (wavelength is about 1 nm ~ 1 mm). The basic method used is to absorb radiation on a given surface and then detect the effect. 2. Radiation metrology, radiology: The field of objective measurement of electromagnetic radiation energy. The research content includes the detection and measurement of radiant energy (either separated wavelengths or a range of wavelengths), the interaction of radiation and materials (absorption, reflection, scattering, and emission), and so on. In image engineering, the brightness of the images collected from the scene involves knowledge related to radiometric. In radiometric, in order to describe the radiant energy transmitted, transferred, or received: (1) radiation flux, (2) radiance, and (3) radiant intensity. 3. Sometimes specifically refers to the measurement of optical radiation (between 3 101 and 3 1016 Hz, i.e., the wavelength is between 0.01 and 1000 μm). Optical radiation includes ultraviolet, visible, and infrared radiation. Commonly used units are watts/meter2 and photons/sr•second. Radioscopy in a broader sense covers a larger area than photometry, which measures only visible light.
47.3
Ray Radiation
1687
Spectroradiometer 1. A device for obtaining a color image of a scene. After the light passes through the shutter, it is directed to the concave diffraction grating. The grating decomposes the signal to the photosensitive unit and focuses the interference signal to the photosensitive unit. The device has high spectral resolution, accuracy, and stability. However, a disadvantage of spectroradiometer is that they can only measure a single point. Therefore, it is almost impossible to use them to collect the entire scene. 2. A device that measures the radiation reflected, transmitted, or emitted by an object as a function of the wavelength of the electromagnetic radiation. Optical part of the electromagnetic spectrum Electromagnetic radiation with a wavelength of 400–700 nm. Can make human eyes receive color subjective feeling. Electromagnetic [EM] radiation The propagation of electric and magnetic fields. That is, the process and phenomenon of electromagnetic waves emitted or leaked into the air generated by the interactive change of electric and magnetic fields. It can be characterized by the oscillation frequency v, the wavelength l, and the propagation speed s. In a vacuum, the electromagnetic radiation and light travel at the same speed, s ¼ c 3 108 m/s. The wavelength of electromagnetic radiation covers 10 to the 24th power. Light is special electromagnetic radiation that can be felt by the human eye. Radiant energy The energy emitted by electromagnetic radiation. Unit, joule (J); symbol, Q. Luminous energy A measure of the time integration of the luminous flux. Can be used to describe the radiant energy the eye gets from a photographic flash. Unit, lumens second lm•s; symbol, Qv. Radiant energy fluence The ratio of the radiant energy incident on a given sphere in space to the crosssectional area of the sphere. Unit, joules per square meter (J/m2); symbol, Ψ . Electromagnetic spectrum The wavelength range of electromagnetic radiation. From radio waves at one end (wavelength 1 m or longer) to gamma rays (wavelength 0.01 nm or shorter) at the other end. As shown in Fig. 47.8, the wavelength ranges from γ-rays to radio waves cover a large range (about 1013). The optical spectrum generally refers to a range including a far ultraviolet with a wavelength of about 10 nm to a far infrared with a wavelength of 0.1 cm. Those with a wavelength of less than 10 nm are gamma (γ) rays and X-rays. Wavelengths greater than 0.1 cm are microwave and radio waves. In the optical spectrum, it can be divided into far ultraviolet, near ultraviolet, visible light, near infrared, mid-infrared, far infrared, and extreme far infrared according to the wavelength. The visible spectrum band, that is, the spectral band where the radiant energy generates visual stimulus to the human eye and forms a bright feeling,
1688
47
Optics
Optical spectrum γ - rays
X- rays
0.01 nm
1 nm
FNuv Visible light 0.4
Ultraviolet
Microwave
V L
EHF SHF UHF VHF 10 μm
0.1 μm
Near infrared
0.6 0.8 1.0
Radio waves
Infrared
2
10 cm
0.1 cm
Mid infrared 3
4
5
8
MF
10 m
Far infrared 6
HF
10
LF
VLF
1 km
100 km
Extreme far infrared 15
20
30Wavelength (μm)
Fig. 47.8 Electromagnetic spectrum (E extreme, S super, U ultra, V very, H high, M mid, L low, F frequency)
generally refers to a wavelength from 0.38 to 0.76 μm. The measure of total visual stimulus that produces the human eye is the field of photometry. Radiation spectrum That is the electromagnetic spectrum. All electromagnetic waves propagating in the vacuum at the speed of light are arranged in order of increasing wavelength. The shortest wavelength is γ rays, and the longest is radio waves. Cosmic radiation Primary radiation, secondary radiation, electromagnetic cascade radiation, and atmospheric showers caused by these three are collectively referred to. These include a variety of unstable, short-lived (108 to 106 s) mesons. In the secondary radiation, there is also a broad shower from a single strong particle with an energy of more than 1015 eV, which can reach an area of 1000m2. Cosmic ray A stream of high-energy particles from cosmic space. The source has not been fully understood so far. The cosmic rays outside the earth’s atmosphere are called primary or primitive cosmic rays. The main component is protons, followed by plutonium particles and a small number of light atomic nuclei. The energy is extremely high, reaching above 1020 eV. After entering the atmosphere, it collides with atmospheric molecules, atoms, and ions and undergoes geomagnetic interaction to cause transmutation. The metamorphic matter contains π mesons that are subatomic particles, and π mesons generate μ particles, forming secondary cosmic rays (more than half of the compositions are μ particles). This part of the ray has great penetrating power, can penetrate deep water and underground, and is called hard part. The other part is mainly electrons and photons. It has a small penetrating power and is called a soft part. Various cosmic rays can be observed and studied through imaging. Radiant exposure The radiant energy Qe received per unit area on a surface. The radiant exposure is expressed by He according to the CIE regulations, He ¼ dQe/dA, and A represents the surface area. The unit is joules per square meter (J/m2).
47.3
Ray Radiation
1689
Radiance exposure The total radiant energy received per unit illuminated area within the exposed or illuminated visual range. Unit: joules per square meter (J/m2). Radiant flux Power that enters or leaves an object in the form of electromagnetic radiation (symbol Φ). Unit: Watt (W). Radiated energy per unit time, that is, the amount of energy emitted or absorbed per unit time. See radiation, irradiance, and radiometry. Radiant flux density The power per unit area that is emitted into or out of an object in the form of electromagnetic radiation. Unit, Wm2; symbol, E or M. Radiant power The intensity of the three primary colors in the RGB color space defined by CIE. Radiant energy flux The power emitted, transmitted, or received in the form of radiation, is also called radiated power. Unit, Watt (W); symbol, P. Radiant efficiency The ratio of the radiant flux emitted by a radiation source to the power provided for the radiation source. Dimensionless; symbol η. Irradiance The amount of energy received at one point on the surface corresponding to the scene point. Corresponds to the radiant flux per unit area incident on a surface. Also called radiant flux density. Radiometric quantity In radiology, a series of physical quantities (or basic concepts) derived from the concept of radiant flux or radiant intensity, such as radiant exitance, radiance, and irradiance. Unlike photometry, which involves only the visible light band, the full band of electromagnetic radiation is considered here, and it is an objective measurement. It does not consider that the eyes have different responses to different wavelengths. The unit used is W (lm in photometry).
47.3.3 Radiometry Standards Radiance map The radiation distribution of the scene. Sometimes referred to as a high dynamic range image. Radiance format A commonly used format for relatively compact storage of high dynamic range images. Its file extension is .pic and .hdr. A single common exponent is used, but
1690
47
Optics
Fig. 47.9 Coding scheme of radiance format
Red
Green
Blue
Exponent
each channel has its own mantissa, as shown in Fig. 47.9, and each pixel requires 32 bits to represent it. Ray radiation Energy and its propagation in the form of electromagnetic waves or photons. The wavelength of ray radiation is 100 nm–1 mm. Considering the physiological and visual effects of the human eye, it can also be divided into three parts: those with a wavelength of 100 nm–380 nm are called ultraviolet; those with a wavelength of 380 nm–780 nm are called visible light; and those with a wavelength of 780 nm– 1 mm are called infrared rays. Blackbody radiation Electromagnetic radiation emitted by a blackbody at a constant uniform temperature (such as the filament of a light bulb or the coating of an artificial light source). The spectrum of emitted radiation (i.e., the color of light) depends only on the temperature of the blackbody and increases in intensity with increasing temperature in the color range from dark red to bright blue and white (the spectral of visible light). Here, the blackbody refers to its visual appearance at room temperature. Blackbody An object that can theoretically absorb external electromagnetic radiation without reflection or transmission. In the objective world, such an absolute blackbody does not exist. The actual blackbody spectral radiance follows the Planck formula, that is, the relationship between the radiant existence M in the blackbody spectrum and the temperature T and wavelength λ is M ðλ, T Þ ¼
c1 c exp 2 Δλ 5 λT λ
where c1 ¼ 2πhc2 ¼ 3.7417749 1016 W•m2, c2 ¼ hc/k ¼ 1.438769 102 m•K, Δλ is the narrow band considered, h ¼ 6.6260693 1034 J•s is Planck’s constant, c ¼ 2.99792458 108 m/s is the speed of light, and k ¼ 1.3806505 1023 J/K1 is the Boltzmann constant. The blackbody can absorb all electromagnetic radiation falling on its surface and can be regarded as an ideal source of pure thermal radiation, whose spectrum is directly related to temperature. Fig. 47.10 shows the spectral radiance of a blackbody at three different temperatures (i.e., the energy radiated by a blackbody per unit area within a unit solid angle and a unit wavelength, its unit is Watt per sphere square meter per nanometer, i.e., W•sr1•m2•nm1) changes with wavelength (note that the horizontal axis is a logarithmic scale). Between the two vertical dashed lines in the figure is the wavelength range of visible light. It can be seen from the figure that when T ¼ 1000 K, radiation starts to enter the visible light range (mainly
47.3
Ray Radiation
1691
Spectral radiance 105 104
Visible light T=6500K
103 102
T=3000K
101 100 10
T=1000K
‒1
10‒2 10‒3 100
Wavelength 1000
10000
100000
Fig. 47.10 Spectral radiance of blackbody at three different temperatures
mid-infrared and far infrared when it is below 1000 K, which makes people feel hot). The first thing you see is the red glow when the object is heated. The curve of T ¼ 3000 K gives the spectrum of an incandescent lamp with many red components. The curve at T ¼ 6500 K is the daylight (white light) spectrum. Planckian locus Also called blackbody locus, that is, the color locus of the blackbody in the chromaticity space when the temperature of the blackbody of the incandescent lamp changes. Generally, it goes from low-temperature deep red through orange, yellow-white, and white to finally high-temperature blue, as shown in the trace in Fig. 47.11. Non-blackbody An object that can absorb most of the external electromagnetic radiation and only a small part reflects or transmits it. Generally, it refers to objects that are closer to the blackbody and away from the absolute white body. Gray body An object that emits at high temperature and has an emissivity (constant in the visible light band) less than 1. Continuous wave radiation, the radiant energy of each wavelength is smaller than the blackbody at the same temperature, the ratio of the two is called the blackness of the gray body. Strictly speaking, only black bodies, that is, objects that radiate according to Planck’s law, emit white light, and other objects should be regarded as gray bodies. If an object does not reduce its emissivity proportionally to all wavelengths, the object is not a gray body but has some kind of diffraction.
1692
47
Optics
Fig. 47.11 Planckian locus in chromaticity space
Absolute black body An object made of a substance that absorbs all radiation and has no reflective power. This is an ideal source of pure thermal radiation, which does not actually exist. Absolute white body An object with a diffuse reflection coefficient of 1 for any wavelength of light. Also called standard white. Correlated color temperature The temperature of the blackbody when the light emitted by the light source is closest to the color of the blackbody radiating light at a certain temperature (i.e., the closest color distance on the CIE 1960 UCS chart). It is also a color temperature defined by the temperature of a black radiator for a given light source. At this temperature, the color produced by the black radiator is the closest to the color produced by the light source under the same brightness and the same observation conditions. Taking the most commonly used light source D65 as an example, its correlated color temperature is 6504 K. Isotemperature line On the black body trajectories of the CIE 1960 UCS diagram and the CIE 1931 chromaticity diagram, straight lines projecting at a certain interval on which the correlated color temperatures are equal. Also called correlated color temperature line.
47.3
Ray Radiation
1693
47.3.4 Special Lights Ultraviolet [UV] Generally speaking, it refers to the part of the electromagnetic radiation wave with a wavelength of about 10–380 nm. Sometimes it also refers to a region of 300–400 nm in the solar spectrum. It can also refer to the names of electromagnetic radiation with a wavelength of 40–300 nm (far ultraviolet) and 300–420 nm (near ultraviolet). Its short wavelength makes it useful for detecting surface details. General glass is opaque to ultraviolet radiation, and quartz glass is transparent. It also often excites fluorescent materials. Visible light Commonly known as red, orange, yellow, green, cyan, blue, and violet. In a very narrow frequency range around 1015 Hz. The term or description of electromagnetic radiation with a wavelength of 400 nm (blue) to 700 nm (red) corresponds to the sensitive range of rod cells and cone cells in the human eye. Infrared light It has electromagnetic energy with a wavelength between about 700 nm and 1 mm. Sometimes it is further divided into near infrared, mid-infrared, far infrared, extreme far infrared, and so on. Just a little shorter than the wavelength of visible light and just a little longer than the wavelength of microwave radio. Infrared light is often used in machine vision systems because it is easily received by most semiconductor image sensors and cannot be observed by human being at the same time. In addition, it is a measure of the heat radiated from the observation scene. Infrared [IR] See infrared light. Infrared ray Invisible electromagnetic radiation with a wavelength greater than the wavelength of red light. The range is usually set between 780 nm and 1 mm. Also called infrared radiation, heat rays. It was originally discovered and named because it can display thermocouples beyond the red light spectrum. Near infrared [NIR] Infrared light with a wavelength of about 0.7 μm to 1.1 μm. Some people also take the part of the sunlight spectrum with a wavelength of 700 nm–830 nm. Thermal infrared [TIR] Infrared light with a wavelength of about 8 μm to 12 μm is also called long-wave infrared. Long-wave infrared [LWIR] Infrared rays with a wavelength of about 8 μm to 12 μm are also called thermal infrared.
1694
47
Optics
Monochromatic light Electromagnetic radiation consisting only light of the same wavelength is characterized by only one parameter of the wavelength. It is an ideal single-frequency light, corresponding to a single sine wave without beginning and end. Quasi-monochromatic light Multiband monochromatic light in extremely narrow frequency band. The ideal monochromatic light should be a single-frequency light, which is a sine wave without beginning and end, but what can be obtained in practice can only be called quasi-monochromatic light, which is a mixture of sine waves. Also called composite light, hybrid light, and multicolor radiation. Band A range of wavelengths in the electromagnetic spectrum. In this range, the sensor used to acquire the image has a non-zero sensitivity. A typical color image contains three color bands.
47.4
Spectroscopy
Spectroscopy is a branch of optics. It uses spectroscopy to study the interaction between electromagnetic waves and matter (Peatross and Ware 2015). It is an interdisciplinary subject that mainly involves physics and chemistry.
47.4.1 Spectrum Spectrum 1. White light (belonging to multi-color light) is divided by a dispersion system (such as prisms and gratings) and formed into color bands arranged in order of wavelength or frequency. For example, after sunlight splits through a prism, it forms a color spectrum that is continuously distributed in the order of red, orange, yellow, green, cyan, blue, and purple. Invisible ultraviolet light and infrared light can be detected by a detector after being separated or changed into a visible image by an image converter and displayed. According to different bands, it can be divided into infrared spectrum, visible spectrum, ultraviolet spectrum, and X-ray spectrum. 2. Short for Fourier spectrum. 3. The electromagnetic spectrum value in a certain range. Visible spectrum A spectrum composed of human-perceivable optical radiation (having a wavelength of about 380–780 nm).
47.4
Spectroscopy
1695
Normalized spectrum The result of eliminating the dependence of the image spectrum on the imaging geometry. For Lambertian surfaces, the imaging process can be modeled as follows: Z Q ¼ m•n
þ1
SðλÞI ðλÞRðλÞdλ
0
Among them, Q is a record of a sensor with a sensitive function S(λ). At this time, the surface is observed according to the normal vector n and the reflection function R (λ), and the surface is illuminated by a light source with spectrum I(λ) in direction m. It can be seen that the spectrum of an image is composed of different Q values corresponding to different camera sensors, and a simple scale factor m•n depends on the imaging geometry. The effects of imaging geometry and illumination spectrum interfere with each other. For a multispectral camera with L spectral bands, the above formula can be generalized as Z Ql ði, jÞ ¼ m • nij
þ1
Sl ðλÞI 0 ðλÞRij ðλÞdλ
l ¼ 1, 2, , L
0
To eliminate the dependence on the imaging geometry, it can be calculated as follows Q ði, jÞ ql ði, jÞ PL l k¼1 Qk ði, jÞ
l ¼ 1, 2, , L
These ql(i, j) values constitute the normalized spectrum of the pixel (i, j) and are independent of the imaging geometry. Spectral signature A string of values in different spectral bands associated with a pixel in a multispec tral image. A multispectral image includes a set of grayscale images, each of which corresponds to a band, representing the brightness of the scene by the sensitivity of the sensors corresponding to the band. In a multispectral image, the property value of a pixel is a vector, and each component represents the gray value in a band. Spectral unmixing The process of distinguishing the material in the region corresponding to the pixels in the remote sensing image. Considering the spatial resolution of the remote sensing image (relatively coarse), the ground region corresponding to one pixel will contain multiple substances, and their spectral signatures are mixed among the pixels. Separating them to obtain regions corresponding to different substances is called spectral decomposition.
1696
47
Optics
Spectrum quantity The rate of change of the spectrum with wavelength. That is, the number of spectral lines in a unit wavelength interval. Also called spectral quantity. Spectral differencing 1. A method for detecting highlights using three images of the same scene. Spectral differencing does not refer to the calculation of partial differentiation of color image functions here, but refers to the calculation of the spectral difference between pixels in a color image. This technique uses three color images to detect highlights (similar to trinocular stereo analysis), which are acquired from three different viewing directions under the same lighting conditions. 2. A method to identify the color pixels from two color images collected from different viewing directions under the same lighting conditions. These pixels do not coincide with any other pixels of other images in a 3-D color space, such as RGB color space. In order to find the color pixels with inconsistent line of sight, the spectral differencing method calculates the minimum spectral distances (MSD) image. It is assumed here that the color of the light and the color of the object are different in the scene. Therefore, for white light, there is no corresponding white object in the scene. This is because the highlight cannot be separated when the highlight color is the same as the object color. Spectral angular distance A measure of the difference between two spectra. Think of each spectrum as a straight line (a vector) in space, and the angle between the two lines reflects the difference between these two spectra. Spectral reflectance See reflectance. Spectral radiance The energy radiated by a blackbody per unit area within a unit solid angle at a unit wavelength (interval). The unit is watt per sphere square meter per nanometer (W•sr1•m2•nm1). Spectral response The output of the imaging sensor when illuminated. The response R to a monochro matic light of wavelength λ is the product of the input light intensity I and the spectral response s(λ) at that wavelength, that is, R ¼ I•s(λ). Spectral response function Functions that describe the spectral response of red, green, and blue sensors in color cameras: Z R¼
LðλÞSR ðλÞdλ
47.4
Spectroscopy
1697
Z G¼
LðλÞSG ðλÞdλ Z
B¼
LðλÞSB ðλÞdλ
Among them, L(λ) is the incident spectrum at a given pixel, and SR(λ), SG(λ), and SB(λ) are the spectral sensitivity of the red, green, and blue sensors, respectively. Spectral sensitivity See spectral response function. Photosensor spectral response Characterize the output of the sensor as a function of the input spectral frequency. See spectral response, Fourier transform, frequency spectrum, and spectral frequency.
47.4.2 Spectroscopy Spectroscopy An interdisciplinary study of the interaction between electromagnetic waves and matter by means of spectrum, which mainly involves physics and chemistry. Through the study of spectroscopy, you can grasp the micro and macro properties of many aspects from atomic molecules to the sky universe. Emission spectroscopy A discipline that studies the emission spectrum (the spectrum directly emitted by an object’s luminescence). When an atom or molecule at a high energy level transitions to a lower energy level, radiation is generated, and excess energy is emitted to form an emission spectrum. Absorption spectroscopy It is a branch of spectroscopy that is parallel to emission spectroscopy. It uses the atoms or molecules of a substance to absorb radiation of different wavelengths (light) to different degrees to estimate the structure of a substance or identify the type of substance. Absorption spectrophotometers are commonly used as research tools. Spectral absorption curve A curve that reflects the absorption of radiation at different wavelengths in the spectrum. The three cones (L cone, M cone, and S cone) and rod cells in the retina have each different spectral absorption curves. Biological spectroscopy The discipline that studies the electronic absorption spectrum of organic molecules in the ultraviolet (200 ~ 360 nm) and visible light (360 ~ 780 nm), the vibration
1698
47
Optics
spectrum of the infrared region (780 nm ~ 25 μm), and the emission spectra of certain metal atoms in the ultraviolet and visible light regions. Medical spectroscopy is part of it. Spectral luminous efficiency See spectral luminous efficiency curve. Spectral luminous efficiency function Spectral luminous efficiency is a function of wavelength. See spectral luminous efficiency curve. Spectral luminous efficiency curve When the maximum value of the spectral luminous efficiency is set to 1, the relationship curve between the spectral luminous efficiency value of each wavelength light and the wavelength value is obtained. Spectrum color The color of a single light. But it is not a strictly monochromatic light. The monochrome in theory does not exist in reality. Spectral colors are the light colors that are closest to a single color. Sometimes they are also called pure spectrum colors for clarity. They form a spectrum locus in the chromaticity diagram, and other colors are called out-of-spectrum colors. Whiteness Visual subjective perception of the depth of white. For spectrum colors, the higher the saturation or purity, the lower the whiteness; otherwise, the higher. Spectral histogram 1. A histogram of an L-D space where a multispectral image with L bands is located, and the value of the pixel in a band is measured along each axis. It can be seen as a generalization of gray histograms of ordinary images and color histograms of color images. 2. A vector containing the edge distribution of the filtered response, which implicitly combines the local structure of the image (by examining spatial pixel connections) and global statistics (by calculating the edge distribution). When using a sufficient number of filters, the spectral histogram can uniquely represent any image (by up to one shift). Let {F(k), k ¼ 1, 2, . . ., K} represent a group of filters. Convolve the image with these filters, and generate a histogram for each filtered response: HIðkÞ ðzÞ ¼
h i 1 X δ z IðkÞ ðx, yÞ jIj ðx, yÞ
Among them, z represents a bin (histogram bar) of the histogram, I(k) is a filtered image, and δ(•) is a δ function. Thus, for the selected filter bank, the spectral histogram can be expressed as
47.4
Spectroscopy
1699
h i ð1Þ ð2Þ ðK Þ HI ¼ HI , HI , , HI Spectral histograms are invariant to translation (this is often a desirable feature in texture analysis). Ideal white It is generated by superimposing three light beams (corresponding to tristimulus values) and has a spectrum of equal energy at all wavelengths. In other words, the spectral density is the same at all wavelengths. Equal energy spectrum on the wavelength basis Same as ideal white. Equal-energy spectrum The spectrum when the energy density corresponding to the unit wavelength light intensity is constant in a certain wavelength range.
47.4.3 Spectral Analysis Spectral analysis 1. Analysis performed in the space, time, or electromagnetic spectrum. 2. Generally refers to any analysis that includes eigenvalue detection. This concept is a bit vague, so the number of “spectrum technologies” is large. Sometimes equivalent to principal component analysis. Spectral filtering A filter adjusted to a different spectral frequency is used to modify the light before entering the sensor. A typical use is related to laser sensing, where filters are selected to pass only light at the laser frequency. Another typical use is to eliminate ambient infrared light to increase the sharpness of the image (most silicon-based sensors are also sensitive to infrared light). Spectral frequency Electromagnetic spectrum or spatial frequency. Spectral distribution The distribution of spectral power spectrum or electromagnetic spectrum. Spectral power distribution [SPD] A parameter describing the radiation (physical energy) of a light source. It gives the relative intensity of radiation corresponding to different wavelengths. Relative spectral radiant power distribution Normalized spectrum of modeled daylight changes. First, the spectrum of daylight with wavelengths from 300 nm to 830 nm is recorded at widely varying locations, times of day, and seasons. In this way, an ensemble version of the daylight spectrum
1700
47
Optics
is first generated. Then for each measured spectrum, record the emission power spectrum (W/m2) of daylight at each unit interval (m). Finally, normalize these absolute measurements so that all spectra have a value of 100 for the wavelength λ ¼ 560 nm. Spectral density function See power spectrum. Empirical color representation The user-centered formulation of the spectrum of light reflections. There are two main types: (1) statistical representation, which ignores the intensity component of light and uses chromaticity modeling, and (2) basis function representation, which is generated by modeling the data with various bases. Spectral factorization A method for designing a linear filter based on a differential equation with a given spectral density function when used for white noise. Spectral decomposition method See spectral analysis. Spectral graph theory Learning of graph characteristics can be performed through eigen analysis of correlation matrices (such as adjacent matrices or Laplace graphs). Spectral graph partitioning A method to obtain graph partitioning using eigen analysis of graph correlation matrix, which can be used for image segmentation. Spectral clustering A form of graph theory for clustering. The similarity between the paired data points is first recorded into a square similarity matrix. Spectral graph partitioning is then performed to obtain clusters. Self-similarity matrix A matrix with similar functions as elements. Given a set of targets {o1, o2, . . ., on} described by characteristic vectors [p1, p2, . . ., pn], a self-similar matrix M can be defined. Each element Mij in M is a user-defined similarity function sim(pi, pj) between the objects oi and oj. A commonly used similar function is the Euclidean distance function, that is, sim(pi, pj) represents the Euclidean distance between the vector pi and the vector pj. Given a matrix M, clustering of similar objects can be performed. See spectral clustering. Positive color Originally refers to five pure colors of blue, red, yellow, white, and black and now refers to a variety of pure spectrum colors. Emissive color Another name for positive colors.
47.4
Spectroscopy
1701
47.4.4 Interaction of Light and Matter Interaction of light and matter The light has several modes of interaction with matters. As shown in Fig. 47.12, it includes three types of reflection, two types of transmission, and one type of absorption. Specular reflection One of the interactions of light and matter. Reflection of electromagnetic radia tion by a smooth surface that behaves like a mirror. In specular reflection, the incident light and the reflected light are in the same plane, and the angle between them and the normal of the reflecting surface is equal. The specularly reflecting surface is smooth on the wavelength scale of the electromagnetic radiation. Its direction depends on the macrostructure of the surface shape. The specular reflection or specularity is formed as follows: the 3-D position point L of the light source, the surface normal N, the surface point P, and the camera center point C at this point are all on the same plane; and the angles ∠LPN and ∠NPC are equal, As shown in Fig. 47.13. It is one of the components of a typical bidirectional reflectance distribution function and is very dependent on the direction of light exit. Specular reflection coefficient A parameter that reflects the reflective properties of a specular surface. It is related to the surface material. Ideal specular reflecting surface An idealized surface reflection model. Can reflect all incident light like a mirror surface, the wavelength of the reflected light depends only on the light source, and has nothing to do with the color of the reflective surface.
Fig. 47.12 How light interacts with matter
Incident
1. Specular reflection
1
2
2 Diffuse reflection 3 Backside reflection
6
4. Diffuse transmission
3
5. Directional
4
6. Absorption
5
Fig. 47.13 Specular reflection
L
N C
P
1702
47
Optics
Highlight Bright light emitted from a light source and reflected from a smooth surface to the camera. It is not fixed on the object surface, but walks around the points on the object surface that meet the reflection conditions. At the highlight point, the surface color information and texture information are basically lost, so it is a great obstacle to the calculation of object features and makes object recognition difficult. See specular reflection. Glint Specular reflections (highlights) visible on a mirror-like surface, as shown in Fig. 47.14. Diffuse reflection One of the interactions of light and matter. The incident light is diffusely reflected in substantially all directions. It is often associated with shadows, which have a smooth change in brightness relative to the surface normal, without gloss. The diffusely reflecting surface is rough on the wavelength scale of electromagnetic radiation. Diffuse reflection originates from the selectively received and re-emitted light in the material, so it is often passed to the light with a strong body color. Also called Lambertian reflection or matte reflection. Light is scattered uniformly in all directions. The ideal Lambertian diffusion effect is to have the same reflected energy in all directions regardless of the direction of the incident light flux. Figure 47.15 shows a diagram of diffuse reflection; N indicates the direction of the normal.
Fig. 47.14 Glint example
Fig. 47.15 Diffuse reflection
Glint point
Reflected N light Incident light
47.4
Spectroscopy
1703
Fig. 47.16 Light distribution caused by diffuse reflection and specular reflection
Uniform diffuse reflection The phenomenon of uniform reflection in all directions regardless of the incident direction at all. This is an ideal case for surface characterization. Lobe of intense reflection The directional lobes generated by specular reflection at a certain viewing angle are different from the uniform lobes generated by diffuse reflection, as shown in Fig. 47.16. The reason for the lobe is that specular reflection does not always give ideal results like a mirror; it depends on the angle of incidence of the light. The width of the lobes is determined by the microstructure of the surface. Backside reflection One of the interactions of light and matter. It is caused by the incident light at the interface between two transparent media, which can cause ghosting. Water color The mixed feeling of reflected light on the water surface and reflected light on the bottom. Water is colorless, which is different from the chromaticity impression obtained by comparative observation. Pure water appears blue. The maximum spectral transmission wavelength of water is 470.0 nm (the transparent region is in the 420.0–470.0 nm band), which belongs to short-wavelength light. It is the short-wavelength light that is scattered and transmitted, while the long-wave visible light is absorbed. Therefore, shallow water is bright blue, with more white light components; deep water is dark blue, because the blue light that reaches the bottom of the water must be diffusely reflected and then scattered before reflecting into the human eye. The light intensity becomes weaker with the depth of the water. Diffuse transmission One of the interactions of light and matter. When light passes through a transmissive material with a rough surface, the transmitted light spreads evenly. Uniform diffuse transmission The phenomenon of uniform transmission in all directions independently of the incident direction. This is an ideal case for surface characterization.
1704
47
Optics
Directional transmission One of the interactions of light and matter. When light strikes a transparent material, the transmitted light undergoes a highly directional transmission in accordance with the laws of geometric optics. Absorption One of the interactions of light and matter. It is the attenuation of light caused by the light passing through the optical system or transmitted to the object surface. It is a phenomenon that radiation (such as light) is weakened by matter. Can be divided into two types of true absorption and apparent absorption. True absorption is the conversion of energy into another form, such as photochemical absorption that converts heat into chemical energy or photoelectric absorption that converts into electrical energy. Apparent absorption includes light scattering and photoluminescence. Light still exists, but its wavelength and direction have changed, especially phosphorescence, and there is a delay in time. A common process is the conversion of light into heat in a medium. After the incident light hits the medium, except for reflection and transmission, the remaining part is absorbed by the medium. The amount absorbed depends on the medium. The blackbody absorbs the most. Absorptance The ratio of the radiation or luminous flux absorbed by an object to the incident luminous flux under given conditions. The absorption ratio is not only related to the nature of the material but also depends on a given setting. See absorption coefficient. Absorption coefficient A measure indicating the internal absorbance α of an optical medium. It can also refer to the unit absorption of an optical medium for monochromatic light at a given wavelength and other conditions. Here, the absorption of light refers to the phenomenon that when the light propagates in an optical medium, the intensity of the light is attenuated with the propagation distance (penetration depth). If Φ is the incident light flux and L is the thickness of the optical medium, then dΦ/Φ ¼ α(λ)dL. It is further possible to obtain the rate of absorption index called the Beer-Lambert law: Φ ¼ Φ0exp(αL). Compare absorptance. Beer–Lambert law See absorption coefficient. Photoelectric effect Under the irradiation of electromagnetic waves with a frequency higher than a certain frequency, the electrons in some substances will be excited by photons to form a current, which is the phenomenon or process of photo-generated electricity.
47.5
47.5
Geometric Optics
1705
Geometric Optics
Geometric optics is a branch of optics that analyzes and describes optical concepts from a geometric point of view (Peatross and Ware 2015; Zhang 2017a).
47.5.1 Ray Ray In geometric optics (also called ray optics) the flow of light traveling along a straight line. Principal ray Light that passes through the center of the aperture stop and also (virtual/extended) through the center of the entrance pupil and the exit pupil. In Fig. 47.17, the principal ray is represented by a thick red solid line, and its (imaginary/extended) optical path through the entrance pupil (ENP) center Q and the exit pupil (EXP) center Q0 is represented by a thick red long dash line; the light passing through the entrance pupil and exit pupil edges is represented by a thin solid brown line, and the (imaginary/ extended) light path passing through the edges of the entrance pupil and exit pupil is represented by a thin green dashed line. These rays determine the cone of light entering and exiting the lens system and thus the luminous flux that reaches the imaging plane. Chief ray Same as principal ray. Pencil of lines A bunch of straight lines in common. For example, let p be a general beam point and p0 be the point through which all lines pass. A pencil of lines can be represented by p ¼ p0 + λv, where λ is a real number and v is the direction of a single line (both are parameters). Ray tracking When using geometric optics to design a lens group, select a number of representative rays, and use a set of formulas to gradually track the process of light rays EXP
Q'
ENP
θ
Q
F
Fig. 47.17 Principal ray diagram
P' P F'
θ'
1706
47
Optics
according to the lens order in the lens group. For example, paraxial and marginal rays are used to study spherical aberration and chromatic aberration to spectral lines C, D, F, and so on. Image construction by graphics Method for determining imaging position by means of geometric mapping. There are three typical rays in geometric optics: the outgoing rays passing through the focal point are parallel to the optical axis; the outgoing rays parallel to the optical axis are passing through the focal point; the rays passing through the center of curvature are not deflected. According to these three typical rays, geometric optics mapping can be used to trace the rays and initially determine the image position so as to correct aberrations. This method is intuitive and convenient to use when accuracy is not high or when preliminary design is performed. Dichromatic model It is considered that the light reflected from the surface is the color model of the sum of two components, that is, the sum of body color and interface reflection. Volume reflection follows Lambert’s law. Interface reflections model highlights. Application areas of the model include color constancy, color image segmentation, and shape recovery. Body color The self-color of nonluminous objects (i.e., objects that are only visible under lighting conditions). The color attached to an object is different from the surface color. Light penetrates into the object to show its body color, so transmitted light and reflected light have the same color. For color measurement of body color, in 1931, CIE developed a standard illuminant. The spectral distribution of illumination directly affects perception or body color measured from illuminated objects. Therefore, the color value of the body color is only defined in relation to the spectral power distribution of the illumination. CIE recommends the use of the term “object color.” Dichromatic reflection model [DRM] A model that describes the properties of light reflection on the surface of an insulating material (such as plastic or pigment) with non-uniform optical properties. Generally refers to a hybrid reflection that does not specifically model the specular reflection component. The surface of these substances consists of an interface and an optically neutral colored pigment-containing medium. Can be regarded as a reflection model of a homogeneous substance. It is considered here that the visible color of a uniform material object illuminated by a single light source is the sum of the following two items (i.e., the sum of the reflected light radiation at the interface and the reflected ray radiation at the body surface): Lr ðVr ; λÞ ¼ Li ðVr , Vi , N; λÞ þ Lb ðVr , Vi , N; λÞ Among them, Vi represents incident rays, Vr represents line of sight, N represents illuminated face element normal, λ represents wavelength of light source color, Li is
47.5
Geometric Optics
1707
radiation amount of reflected light at the interface, and Lb is radiation amount reflected at the surface. The above formula can also be written as Lr ðVr ; λÞ ¼ Ci ðλÞM i ðVr , Vi , NÞ þ C b ðλÞM b ðVr , Vi , NÞ That is, Li and Lb are the products of the relative power spectra Ci(λ) and Cb(λ) and the radiation amplitudes Mi(Vr, Vi, N) and Mb(Vr, Vi, N), respectively, where the relative power spectrum is only related to the wavelength and the radiation amplitude depends only on the surface geometry. This model can be used in segmenting highlight color objects and in Bayer pattern demosaicing techniques under different shadow conditions. Bidirectional reflectance distribution function [BRDF] A function describing the spatial relationship between incident and reflected rays. It is also a function indicating the reflection characteristics of the scene surface. Consider the coordinate system shown in Fig. 47.18, where N is the normal of the surface element and OR is an arbitrary reference line. The direction of a ray I can be represented by the angle θ (called the polar angle) between the ray and the element’s normal and the angle ϕ (referred to as the azimuth angle) between the orthographic projection of the light on the surface of the scene and the reference line. If (θi, ϕi) is used to indicate the direction of light incident on the surface of the scene, and (θe, ϕe) is used to indicate the direction of reflection to the observer’s line of sight, the bidirectional reflectance distribution function can be introduced with the help of Fig. 47.19. Fig. 47.18 Polar angle θ and azimuth angle ϕ indicating the direction of light
N
I
θ O
φ
R
Fig. 47.19 Bidirectional reflectance distribution function
N
(θ e, φ e)
(θ i, φ i)
φ
1708
47
Optics
When light enters the surface of the object along the direction (θi, ϕi), the observer observes the brightness of the surface in the direction (θe, ϕe), which can be recorded as f(θi, ϕi; θe, ϕe), where θ represents the polar angle and ϕ represents the azimuth angle. The unit of this function is the inverse of the solid angle (sr 1), which ranges from zero to infinity (at this time, any small incident will cause the unit radiation to be observed). Note that f(θi, ϕi; θe, ϕe) ¼ f(θe, ϕe; θi, ϕi), which is symmetrical about the direction of incidence and reflection. Suppose that the illuminance obtained by incident on the surface of the object along the (θi, ϕi) direction is δE(θi, ϕi), and the brightness of the reflection (emission) observed from the (θe, ϕe) direction is δL(θe, ϕe), the bidirectional reflectance distribution function is the ratio of brightness and illuminance, i.e.: f ð θ i , ϕ i ; θ e , ϕe Þ ¼
δLðθe , ϕe Þ δEðθi , ϕi Þ
Further, f(θi, ϕi; θe, ϕe) can be parameterized by the angle of incidence vi ¼ (θi, ϕi), the angle of reflection ve ¼ (θe, ϕe), and the light direction (lx, ly, n) relative to the local surface coordinate system, as shown in Fig. 47.20. The symmetry of the bidirectional reflectance distribution function with respect to vi and ve is also called Helmholtz reciprocity. If it is considered that the bidirectional reflection distribution function can also be a function of the wavelength λ, the bidirectional reflection distribution function can also be expressed as f(θi, ϕi; θe, ϕe, λ) or f(lx, ly, n; λ). Spatially varying BRDF [SVBRDF] A bidirectional reflectance distribution function that is different at each point in space. Can be expressed as f(x, y, θi, ϕi; θe, ϕe, λ). Reflectance estimation A class of techniques for estimating the bidirectional reflectance distribution function (BDRF). It is mainly used to recover shapes from shading and imagebased rendering. The goal is to draw an arbitrary image of the scene based on the geometric and photometric information in the video. See physics-based vision. Fig. 47.20 Bidirectional reflectance distribution function
vi
n
θi
ve
θe ly
φi
φe l×
47.5
Geometric Optics
1709
Isotropic BRDF A bidirectional reflectance distribution function for isotropic materials. At this time, the general bidirectional reflection distribution function f(θi, ϕi; θe, ϕe) can be simplified to f(θi, θe, |ϕe –ϕi|; λ) or f(lx, ly, n; λ). Anisotropic BRDF Compare isotropic BRDF. Anisotropy The phenomenon that the image shows different properties in different directions. For example, in many 3-D images, the resolutions in the X and Y directions are generally the same, but the resolution in the Z direction is often different. The voxels at this time will not be cubes, but cuboids. A similar situation often occurs in video images, and the resolution along the time axis is often different from the resolution along the spatial axes, resulting in inconsistent changes in moving objects in various directions. Bidirectional scattering surface reflectance-distribution function [BSSRDF] The 8-D bidirectional reflectance distribution function when modeling the scattering effect of the surface through the long-distance transmission of light through the medium.
47.5.2 Reflection Reflection 1. The phenomenon that part of the light returns from the interface to the first medium when it is emitted from one medium to the interface that intersects with another medium. It is a type of interaction of light and matter when light passes through the interface of different media. Compare total reflection. 2. An optical phenomenon means that all light incident on a surface is deflected away without absorption, diffusion, or scattering. An ideal mirror is a perfect reflective surface. Given a single incident ray reaching the surface, the angle of incidence is equal to the angle of reflection, as shown in Fig. 47.21. See specular reflection. N
Fig. 47.21 Mirror reflection
θ
θ
φ
1710
47
Optics
3. The result obtained by flipping an image left and right. It can be seen as a mathematical transformation of the image, where the output image is obtained by flipping the input image about a given transformation line in the image plane. 4. In morphology for binary images, the result of flipping the binary set from left to right and up and down relative to the reference point. The mapping set is symmetric to the original set, which can be said to be transposed with each other. 5. In the morphology for gray-level images, the image is first inverted to the vertical axis, and then the horizontal axis is inverted. This is also equivalent to rotating the image 180 around the origin. For a grayscale image f(x, y), its image passing through the origin can be expressed as bf ðx, yÞ ¼ f ðx, yÞ Albedo After the surface of a diffusing object is irradiated with normal incident light, the ratio of the light diffused in all directions to the incident light. Also called whiteness, it was originally used in astronomy to describe the reflective ability of an object’s surface. Its value is between 0 and 1, or between 0% and 100%. Here 0% corresponding surface is ideal black, and 100% corresponding surface is ideal white. If an object reflects 50% of the light shining on it, its reflectance is said to be 0.5. Figure 47.22 shows some schematic diagrams of albedo. Reflectivity The ratio of the radiant energy reflected from a surface to the radiant energy incident on the surface. Unlike reflection coefficient, reflectivity only includes a single reflection from the surface, so it is the intrinsic property of the optical properties of the medium on both sides of the surface. Unit: 1; Symbol: r. Compare reflectance. Reflectance The ratio of the reflected flux to the incident flux on a surface. It is also the ratio of reflected power to incident power. See bidirectional reflectance distribution func tion and reflectance factor.
1.0
0.75
0.5
Fig. 47.22 Some schematic diagrams of albedo
0.25
0.0
47.5
Geometric Optics
1711
Reflectance ratio A photometric invariant for segmentation and recognition. It stems from the observation that the illumination on either side of a reflective edge or color edge is almost the same. Therefore, although it is not possible to separate reflection and illumination from the obtained brightness, the ratio of brightness on both sides of the edge is equal to the ratio of reflection and independent of the illumination. Thus, for a significant number of reflections, this ratio does not change with illumination and local surface geometry. Reflectance factor The ratio of the radiant flux (or luminous flux) reflected from the surface of the object to be analyzed to the radiant flux (or luminous flux) emitted from the fully diffuse reflective surface under specific lighting conditions and within a specified solid angle. It is also the ratio of the radiant flux reflected from a surface to the radiant flux incident on the surface. The suffix -ance implies the nature of a specific region on the material and can include multiple reflections, such as reflections on the front and back of a sheet, or back and forth between two reflective surfaces (this increases the reflected radiant flux). Unit, 1; symbol, R. Compare reflectivity. Reflection law Describe the laws of light reflection properties. It mainly includes two points: (1) The incident rays, the normal perpendicular to the reflecting surface and passing through the incident point, and the reflected light are all in the same plane. (2) The absolute values of the angle of incidence and reflection counted from the normal are equal and the directions are opposite. Reflection component The part of the scene brightness that reflects the incident light. It is a factor that determines the gray level of an image. Its value changes sharply at the junction of different objects. In practice, the incident light is often regarded as a unit value, so the reflection component obtained from the object is determined by the reflection ratio (from 0 in the case of total absorption without reflection to 1 in the case of total absorption without reflection). Reflection function A functional form of the reflection component. Reflection coefficient A parameter that reflects the reflective nature of a surface. It is the ratio of the intensity of the reflected light on the surface of the incident light to the intensity of the incident light when the incident light is projected on the object. It is related to the projection angle, intensity, wavelength of incident light, the nature of the surface material of the object, and the measurement angle of reflected light. Reflectance modeling A method to overcome the effects of lighting to restore surface characteristics. When the illumination has obvious directionality and the object has relative illumination movement, it may produce highlights or shadows, which affects the reliable
1712
47
Optics
Fig. 47.23 Effect of reflection operator
reconstruction of the surface. At this time, shadow removal needs to be performed explicitly, that is, modeling the direction of the illumination light source and estimating the surface reflection characteristics. Reflectance model A model that represents how light is reflected by a material. See Lambertian surface, Oren-Nayar model, and Phong reflectance model. Reflection operator A linear transformation that intuitively changes each point or vector in a given space as its mirror image. The effect is shown in Fig. 47.23. The matrix H corresponding to the transformation has the property of HH ¼ I, that is, H1 ¼ H, which can be said to be a reflection matrix which is its own inverse. See rotation. Luminous reflectance The ratio of the luminous flux reflected from the surface of an object to the incident luminous flux. Retroreflection The reflection of the light back to the light source is expected to have the smallest possible scattering. Inverse light transport The process of decomposing a drawing or real image into the sum of reflected images, where each image records the number of times the light has been reflected before reaching the camera. Oren-Nayar model A reflectance model for rough surface diffusion. One-bounce model A model describing the mutual reflection between two surfaces. Consider two semiinfinite planes A and B with different colors. The color signal C at a point on A includes light reflected from surface A without mutual reflection, a no-bounce color signal, plus light reflected from surface B and then reflected from surface A. The latter light is reflected from surface B again and so on. In order to simplify the model, the reflection of this infinite sequence is interrupted after the first time, only one-bounce.
47.5
Geometric Optics
1713
47.5.3 Various Reflections Fresnel reflection A reflection on the surface of an optically non-uniform insulating substance. The surface of these substances consists of an interface and an optically neutral colored pigment-containing medium. The interface separates the surface from the surrounding medium, typically air. When the surface is irradiated, part of the radiation hitting the surface cannot penetrate the surface and enter the material, but is reflected at the interface. This is Fresnel reflection, which has a spectral distribution close to the incident light source. It can reflect the incident light energy like a mirror, the spectrum does not change, and it is not affected by the surface material. Compare body reflection. Interface reflection Same as Fresnel reflection. Surface reflection Same as Fresnel reflection. Surface reflectance A description of how a surface interacts with light. See reflectance. Body reflection The reflection caused by the interior of an object when light hits it. At this time, light energy diffuses into the interior of the object (especially the dielectric), and the pigment particles therein absorb light of a certain wavelength to make the object appear a certain color. Compare Fresnel reflection. Phong reflectance model An empirical model of local lighting that models surface reflections as a mixture of diffuse reflection and specular reflections. Phong illumination model A lighting model suitable for describing the characteristics of reflective surfaces that are not completely flat. It is assumed that the specular reflection is the strongest when the angle θ between the reflection line and the line of sight is 0, and as θ increases, the specular reflection decreases by cosnθ, where n is the specular reflection coefficient. Phong shading model A reflection model. Factors such as diffuse and specular components in reflection and ambient light are considered. The ambient light here does not depend on the orientation of the surface but on the color of the ambient light and the color of the scene. Phong shading See Phong shading model.
1714
47
Optics
Specularity See highlight and specular reflection. Cook-Torrance model A shadow model of surface reflection in computer graphics for 3-D rendering. It is an alternative model to the commonly used Phong illumination model, which combines the physical properties of reflection and provides better reflection rendering effects on rough and metal surfaces.
47.5.4 Refraction Refraction An optical phenomenon in which light is deflected as it passes through different optical media, such as from air to water. The amount of deviation is determined by the difference in the refractive index of the two media and satisfies Snell’s law: n1 n2 ¼ sin ða1 Þ sin ða2 Þ Among them, n1 and n2 are the refractive index of the two media, and a1 and a2 are the corresponding refraction angles. Law of refraction The law that reflects the relationship between the incident angle and the exit angle when light is refracted at the interface between two media. As shown in Fig. 47.24, the refractive index of the first medium is n1, and the refractive index of the second medium is n2. Light is refracted at the interface between two media. The incidence angle θ1 is the angle between the incident light and the interface normal, and the exit angle θ2 is the angle between the outgoing light and the interface normal. The relationship between these two angles is n1 sin θ1 ¼ n2 sin θ2
Fig. 47.24 Law of refraction
θ1
n1 n2 θ2
47.5
Geometric Optics
1715
The law of refraction indicates that the process of imaging with a lens in practice is a nonlinear process, that is, after the concentric beams pass through the lens, they cannot completely converge at one point. However, when the incident angle θ1 is small, θ1 can be used instead of sinθ1 to get the (paraxial approximation) linear refraction law representation (corresponding to the pinhole camera model): n1 θ 1 ¼ n2 θ 2 Brewster’s angle Maximum polarization angle when light is reflected from the surface of a dielectric material. Light is polarized perpendicular to the surface normal. The degree of polarization depends on the angle of incidence, the refractive index of air, and the reflective medium. Brewster’s angle can be calculated as follows: θB ¼ tan
1
n1 n2
Among them, n1 and n2 are the refractive indices of the two materials, respectively. Refractive index A characteristic of light propagation media. For each medium, the speed v at which light travels is always less than the speed c at which light travels in a vacuum. The ratio n ¼ v/c is the refractive index of the medium. See index of refraction. Index of refraction The ratio of the sine of the incident angle to the sine of the refraction angle. It is also equal to the ratio of the refractive index of the refractive medium to the refractive index of the incident medium. It can be divided into absolute index of refraction and relative index of refraction. If the first medium is vacuum, the ratio obtained is the absolute index of refraction. For a given wavelength of light, the absolute index of refraction of a medium is the ratio of the speed of electromagnetic waves (light) in a vacuum to its speed in this medium. The relative index of refraction of two media is the ratio of their absolute index of refraction, that is, the index of refraction between the two media is also the ratio of the speed of light in the two media. This ratio is used in lens design to explain the bending of light that occurs when light enters another material from one material (Snell’s law). Absolute index of refraction Refractive index obtained by the latter when light enters another medium from a vacuum medium. The refractive index is always for two media that are in contact. Generally, the refractive index of a gas is mostly relative to a vacuum, while the refractive index of a general liquid and solid is mostly relative to air.
1716
47
Optics
Total reflection A special kind of refraction. When light is emitted from an optically dense medium with a higher refractive index to an interface of a light sparse medium with a lower refractive index, if the incident angle is increased, the refraction angle will increase faster, the refracted light will become weaker, and the reflected light will become stronger. When the incident angle is greater than a critical angle, the refracted light disappears, and the intensity of the reflected light is equal to the intensity of the incident light (i.e., all the incident light is reflected back to the optical dense medium). Birefringence When a natural light beam is incident on a single optical axis anisotropic crystal medium, it splits into two plane-polarized lights (i.e., linear vibration birefringence) perpendicular to each other in the direction of electric vector vibration. The two split beams have different propagation speeds. One beam is an ordinary light wave, which follows the general law of refraction, and the refractive index is independent of the direction. The other beam is a non-ordinary light wave, which does not follow the general law of refraction, and the refractive index varies with direction to varying degrees. The difference in refractive index between the two is called birefringence. Polarization A state in which the vibration of the electromagnetic field of electromagnetic radiation is restricted to a single plane and is inconsistent with the direction of propagation. In the rays of electromagnetic radiation, the direction of polarization is the direction of the electric field vector. The polarization vector is always in a plane perpendicular to the rays. The direction of polarization in a ray can be random (nonpolarized) or constant (linearly polarized), or there are two polarized plane elements connected to each other perpendicular to each other. In the latter case, depending on the amplitude of the two waves and their relative phases, the combined electric vector trajectory is an ellipse (elliptical polarization). Elliptical polarization and planar elements can be converted to each other through an optical system with birefringence. It also refers to the characterization of polarized light. Polarized light Light whose direction of vibration is asymmetric with respect to the direction of propagation. Light waves are electromagnetic waves, and electromagnetic waves are transverse waves. The vibration vector of the transverse wave (including electronic and magnetic vibration vectors for electromagnetic waves) is perpendicular to its propagation direction. The plane formed by the forward direction of the light wave and the direction of vibration is called the vibration plane. The vibration plane of light is limited to a certain fixed direction and is called plane polarized light or linearly polarized light. If the horizontal and vertical components of the electric field are nondeterministically superimposed, the resulting light is nonpolarized light. What does not satisfy this superposition condition is polarized light, and the locus of the
47.6
Wave Optics
1717
electric field tip is an elliptical polarized light. Light is often partially polarized, that is, it can be viewed as the sum of fully polarized light and completely nonpolarized light. In computer vision, polarization analysis is a field of physics-based vision that has been used for metal-dielectric discrimination, surface reconstruction, fish classification, defect detection, and structured light triangulation. Nonpolarized light Natural light or light that is not polarized from a light source. Although the direction of vibration of the electric vector of these lights is perpendicular to the direction of travel of the light, that is, perpendicular to the direction of propagation of light energy, no vibration direction is dominant. Abbe constant A constant that describes the ratio of refraction and scattering of an optical system. Its expression can be written as Vd ¼
nd 1 n f nc
where n is the refractive index and the subscript indicates the wavelength: d represents the helium line D (587.6 nm), and f and c represent the hydrogen lines F and C (486.1 nm and 656.3 nm), respectively. Fresnel equations An equation describing the change in light as it passes through the interface of two media with different refractive coefficients. This special equation corresponding to the electric field strength is also called Fresnel equation.
47.6
Wave Optics
Wave optics is a branch of optics. It studies the phenomenon and principle of light in the process of propagation from the perspective of light as waves (Peatross and Ware 2015).
47.6.1 Light Wave Light A general term for electromagnetic radiation. A collective name for γ rays, X-rays, ultraviolet, visible light, infrared rays, etc., which can produce photochemical effects or photoelectric effects. May indicate lighting in the scene or radiation originating from the scene’s incident sensors. Visible, infrared, or ultraviolet light is used in most imaging applications.
1718
47
Optics
Generally refers to the electromagnetic wave band that can stimulate the human eye retina to produce visual sensation and is generally called visible light. The propagation velocity in a vacuum is c ¼ 2.99792458 108 m/s. Light can be described by particles (called photons) or electromagnetic waves (wave-particle duality). A photon is a tiny electromagnetic vibration energy packet that can be characterized by its wavelength or frequency. Wavelength is generally measured in meters (m) and its multiples and divisors. Frequency is measured in Hertz (Hz) and its multiples. Wavelength (λ) and frequency ( f ) are related by the following expression: λ¼
v f
Among them, v is the speed of wave propagation in the medium, which is usually approximated by the speed of light (c) in a vacuum. Light wave Especially refers to the electromagnetic waves that can be perceived by the conversion of visual or photoelectric effects. Including ultraviolet waves, visible light, waves and infrared waves. Wavelength For periodic waves (such as light waves), the distance between two adjacent points with the same phase in the direction of propagation (or the distance between two consecutive peaks of the wave). The wavelength is often denoted by λ, and its value is the result of dividing the wave speed by the frequency. Basic unit: m (meter). Electromagnetic waves, especially visible light (wavelengths of about 400–700 nm), are important in imaging technology. Nanometer A unit derived from the International System of Units (SI) basic unit of length, denoted nm. 1 nm ¼ 1 109 m or 3.937 108in or 10 Å (angstroms). For example, the wavelength of color (RGB) is generally measured in nanometers. Micron One millionth of a meter. All kinds of visible light have a wavelength of less than one micron. Wave number The number of wavelengths λ in a unit length. A factor of 2π is often added in use, such as in circular frequency. Light vector Perpendicular to the direction of light travel, represents a periodic vector of light (as a transverse wave). According to the electromagnetic theory of light, both electrical and magnetic vectors can be taken as light vectors, but considering the mechanism of polarization and scattering, it is more appropriate to use electrical vectors as light vectors. In an isotropic medium, the electric vector is the electric field
47.6
Wave Optics
1719
strength vector. In a crystal, the electric vector should be the electric displacement vector. Coherent light The emitted light waves have the same wavelength and phase. Such light waves remain focused over long distances. Incoherent In wave optics, the absence of a fixed phase relationship between two waves (case). If two inconsistent waves are superimposed, the interference effect will not last longer than a single consistent time of the waves. Interference When ordinary light interacts with materials with similar dimensions to light, or when coherent light interacts, the most obvious effects from the perspective of computer vision are the generation of interference fringes and laser illumination. When an image is transmitted in an electronic medium, it is also called electronic interference. Interference fringe The fringes produced by following the principle of optical interference with two adjacent beams. When optical interference occurs, the most significant effect is the generation of interference fringes on the surface illuminated by light. These are parallel, almost equally spaced alternating lighter and darker bands. These bands can cause blurring of edge positions. Interference color The mixed color of non-monochromatic light when it interferes. The reason is that some wavelengths of light interfere constructively, some interfere destructively, and some weaken less.
47.6.2 Scattering and Diffraction Scattering Due to the role of gas molecules and aerosols in the atmosphere, the atmospheric effect produced when electromagnetic radiation (mainly short visible wave lengths) propagates in all directions. See Rayleigh scattering. Scatter matrix Given a set of d-D points represented as column vectors {x1, x2, . . ., xn}, the following d d matrix S¼
n X i¼1
ðxi μÞðxi μÞT
1720
47
Optics
P where μ ¼ ni¼1 xi =n is the mean. It is also equal to (n–1) times the sampled covariance matrix. Scattering coefficient A measure of the part of radiation in an optical medium that points in other directions due to the inconsistent refractive index. A medium of equal thickness can scatter equal radiant flux: dΦ/Φ ¼ β(λ)dx. Unit: 1/m. See absorption coefficient and extinction coefficient. Volumetric scattering A physical lighting and perceptual effect or phenomenon in which perceptual energy is redistributed by materials in a medium that is transmitted by radiation (including light), such as dust, haze, etc. Rayleigh scattering Scattering of light as it passes through a substance composed of particles that are much smaller than the wavelength. Also called molecular scattering. The frequency of the scattered light is the same as the incident light. The intensity of the scattered light is inversely proportional to the fourth power of the wavelength and is proportional to the square of the cosine of the angle between the incident light and the scattered light. According to Rayleigh scattering, blue light in daylight is more scattered by molecules in the air than other light with longer wavelengths (such as red and green light), making the sky blue. Rayleigh criterion A method to quantify the surface roughness of an object based on the wavelength λ of the electromagnetic radiation. This confirms that the surface will look more like a specular reflector or more like a diffuse reflector. If the root mean square of the irregularity of a surface is greater than λ/(8cosθ), then the surface can be considered rough, where θ is the angle of incidence. Ideal scattering surface An idealized surface scattering model. All points in the observation direction are equally bright (regardless of the angle between the observation line and the surface normal) and reflect all incident light completely without absorption. Subsurface scattering A special type of scattering, where light is reflected not only from the surface of an object but also partially inside the surface of a person and then emitted out of the interior through a series of reflections. This is an important factor affecting the appearance of the surface; typical examples are human skin. Dispersion 1. One of the characteristics of propositional representation in a propositional representation models. An element represented by a proposition may appear in many different propositions, and these propositions can be represented in a consistent manner using the semantic network.
47.6
Wave Optics
1721
2. The phenomenon that white light is broken down into various colors. That is, white light will be dispersed into various colors when it is refracted. To be precise, this corresponds to the quantitative relationship of the refractive index n of a medium with the wavelength λ. 3. The scattering of light as it passes through a medium. Compton backscattering [CBS] A nondestructive detection technique for obtaining a cross-sectional image of a subject. In atomic physics, Compton scattering (also known as Compton effect) refers to the phenomenon that when photons of X-rays or γ-rays interact with matter, they lose energy and cause the wavelength to become longer. The energy of the scattered gamma photons in a certain direction is determined, and the intensity of the scattered gamma photons is proportional to the number of electrons in the effective interaction volume in the target. Therefore, by scanning the target with gamma photons, the material information of the target can be obtained. Compton backscatter imaging uses the target’s backscattering of rays to reconstruct the image, and it is sensitive to low atomic number but high density organic matter (such as drugs, explosives). Diffraction When waves pass through the edges or pores of obstacles, the propagation direction appears to bend. The smaller the pores and the larger the wavelength, the more pronounced this phenomenon is. It also refers to the phenomenon that waves continue to propagate through obstacles after encountering obstacles or small holes, also called diffraction. At this time, the first-order wavefront will produce a weaker second-order wavefront, and the two interfere with each other and selfinterfere, forming various diffraction patterns. Geometrical theory of diffraction Asymptotic theory for analyzing the diffraction characteristics of electromagnetic waves using the concept of rays. The reality is geometric wave optics. Airy disk Due to the diffraction of light, the striped circular spot formed by diffusing the object point on the focal plane to the object surface. This round spot will affect the sharpness of the image. If the lens magnification is β and the pupil magnification is βp (the ratio of the exit pupil diameter to the entrance pupil diameter), then
2J 1 ðr Þ I¼ r
2
Among them, J1(r) is the first-order Bessel function of r ¼ π(x2 + y2)1/2/(λFe), λ is the wavelength of light, and the effective f-number Fe ¼ F(1β/βp), F is the f-number of the lens. The first minimum value of the above formula is obtained when r1 ¼ 1.21967λFe, and it is often regarded as the radius of the diffraction spot.
1722
47
Optics
m=0 m=1 m=2
Fig. 47.25 Diffraction grating
Light stripe
Diffraction grating
Light source
When the effective f-number Fe becomes larger, that is, when the aperture stop becomes smaller, r1 increases; and when λ becomes larger, r1 also increases. For example, for green light at λ ¼ 500 nm, r1 5 μm when Fe ¼ 8. Diffraction grating An array of diffractive elements can produce the effect of periodically changing the phase and/or amplitude of a wave. The simplest way to achieve this is to use an array of slits (see moiré interferometry), as shown in Fig. 47.25. Grating See diffraction grating. Diffraction limit Maximum image resolution of an imaging system or an optical system. This limit is affected by the size of the eyepiece and the viewing wavelength. Diffraction-limited For the optical system, the self-aberration has been corrected, and only the resolution of the diffraction effect is limited. Fresnel diffraction The light in the vicinity of the lens aperture is clearly curved and streaked.
Chapter 48
Mathematical Morphology for Binary Images
Mathematical morphology is a mathematical tool for analyzing images based on morphology (Serra 1982).
48.1
Image Morphology
Image morphology uses structuring elements with certain morphology to measure and extract the corresponding shapes in the image in order to achieve the purpose of image analysis and recognition (Serra 1982; Zhang 2017b).
48.1.1 Morphology Fundamentals Mathematical morphology A mathematical field that analyzes images based on morphology. Can be regarded as a theory of analyzing the structure of space. Also called image algebra. The characteristic is to use structuring elements with a certain shape to measure and extract the corresponding shapes in the unknown image to achieve the purpose of image analysis and recognition. Its mathematical foundation is set theory. The algorithm has a natural structure implemented in parallel. The operation object of mathematical morphology can be a binary image, a grayscale image, or a color image. Basic operations have their own characteristics in binary images and grayscale (multivalued) images, and there are many types of combination operations and practical algorithms. Geometric morphometrics Statistical tool for analyzing relative geometrical information related to living things, especially geometrical changes caused by various processes. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_48
1723
1724
48 Mathematical Morphology for Binary Images
Image morphology A method of image processing using set operations. See mathematical morphology operation. Fuzzy morphology A special mathematical morphology based on fuzzy logic instead of traditional Boolean logic. Image algebra See mathematical morphology. Properties of mathematical morphology The nature of mathematical morphology operations. It depends only on the operation itself, not on the operation object. Commonly used mathematical morphological properties include shift invariance, commutativity, associativity, increasing, idempotency, extensive, and anti-extensive. For the four basic operations of mathematical morphology for binary image, binary dilation, binary erosion, binary opening, and binary closing, the cases with or without the above properties can be summarized into Table 48.1 (let A be the object of calculation, B be the structuring element). For the four basic operations of mathematical morphology for gray-level image, gray-level dilation, gray-level erosion, gray-level opening, and graylevel closing, they have or do not have the above properties (because the shift is not used in the operations of mathematical morphology for gray-level image, instead the ordering/ranking operation is used, so the shift invariance is not considered) which can be summarized into Table 48.2 (set f as the input image and b as the structuring element).
Table 48.1 The properties of the four basic operations of mathematical morphology for binary images Property Shift invariance Commutativity Associativity Increasing
Operation Dilation (A)xB ¼ (AB)x
Erosion (A)xB ¼ (AB)x
AB ¼ BA (AB)C ¼ A (BC) A ⊆ B ) AC ⊆ BC
(AB)C ¼ A (BC) A ⊆ B ) AC ⊆ BC
Idempotency Extensive Anti-extensive
Opening A○(B)x ¼ A○B
Closing A●(B)x ¼ A●B
A ⊆ B ) A○C ⊆ B○C (A○B) ○B ¼ A○B
A ⊆ B ) A●C ⊆ B●C (A●B)●B ¼ A●B
A ⊆ AB
A ⊆ A●B AB ⊆ A
A○B ⊆ A
48.1
Image Morphology
1725
Table 48.2 The properties of the four basic operations of mathematical morphology for gray-level images Property Commutativity Associativity Increasing Idempotency Extensive Anti-extensive
Operation Dilation fb¼bf ( f b) c ¼ f (b c) f1 ⊥ f2 ) f1b ⊥ f2b
Erosion ( f b) c ¼ f (b c) f1 ⊥ f2 ) f1b ⊥ f2b
Opening
Closing
f1 ⊥ f2 ) f1 ○ b ⊥ f2 ○ b ( f ○b)○b ¼ f ○b
f1 ⊥ f2 ) f1 ● b ⊥ f2 ● b ( f●b)●b ¼ f●b f ⊥ ( f ● b)
f ⊥ ( f b) ( f b) ⊥ f
( f ○ b) ⊥ f
Shift invariance A technique or process or description that does not change with the shift (translation) of an image (or object). The invariant nature of the object at different locations in the image. Compare scale invariance and rotation invariance. A property of mathematical morphology. It means that the result of the shift (displacement) operation does not vary depending on the sequence of the displacement operation and other operations, or that the result of the operation has nothing to do with the displacement of the operation object. Commutativity One of the operational properties of mathematical morphology. It indicates that changing the order of the operands during the operation has no effect on the result. Associativity One of the operational properties of mathematical morphology. It means that each operation object can be combined in different sequences during the operation without affecting the result. Increasing One of the operational properties of mathematical morphology. Take the mathe matical morphology for binary image as an example, let MO be the mathematical morphological operator, and set A and structuring element B as the operation objects. If MO(A) ⊆ MO(B) when A ⊆ B, then MO has increasing. It can also be said that MO has inclusiveness or MO has order-preserving properties. It can be shown that dilation and erosion, as well as opening and closing, have all been increasing. For mathematical morphology for gray-level images, considering the input image f (x, y) and the structuring element b(x, y), the same properties can be obtained. Increasing transformation Transformations/manipulations with increasing or order-preserving properties. Typical morphological operations, such as dilation, erosion, opening, and closing, can be considered as increasing transformations. Suppose that T is used to represent
1726
48 Mathematical Morphology for Binary Images
transformations and lowercase letters are used to represent transformation objects. If a ⊆ b, then T(a) ⊆ T(b). Idempotency One of the operational properties of mathematical morphology. Taking mathe matical morphology for binary image as an example, let MO be a mathematical morphology operator; idempotency means that given a set A, then MOn(A) ¼ MO(A) holds. In other words, no matter how many times the MO operation, the result is the same as one operation. For mathematical morphology for gray-level images, considering the input image f(x, y) and the structuring element b(x, y), the same properties can be obtained. Extensive The result of the operator’s set operation in mathematical morphology includes the original set. Take the mathematical morphology for binary image as an example. Let MO be a mathematical morphological operator. Given a set A, if MO(A) ⊇ A, MO is extensive. Dilation and closing are both extensive. For mathematical morphology for gray-level images, considering the input image f(x, y) and the structuring element b(x, y), the same properties can be obtained. Anti-extensive A property of mathematical morphology operation. Taking the mathematical mor phology for binary image as an example, the result of the operator’s set operation is included in the original set. Let MO be a mathematical morphological operator and A be an operation object. If MO(A) ⊆ A, then MO is said to have anti-extensive. Erosion and opening are both anti-extensive. For mathematical morphology for gray-level images, considering the input image f(x, y) and the structuring element b(x, y), the same properties can be obtained.
48.1.2 Morphological Operations Mathematical morphology operation A set of mathematically defined image processing and image analysis operations that can be applied to binary images and grayscale images. The result of the operation is mainly based on the spatial pattern of the input data value, not just the data value itself. For example, a morphological line thinning algorithm (see thinning operator) will determine the position in an image where a line is represented by more than one pixel wide data. The algorithm further chooses to set one of these redundant pixels to 0. Figure 48.1 shows an example. Figure (a) is before thinning and Figure (b) is after thinning. Morphological transformation A binary image and grayscale image transformation, whose basic feature is that it mainly works on the pattern of pixel values rather than on the value itself. Some
48.1
Image Morphology
1727
Fig. 48.1 Mathematical morphology thinning example
(a)
(b)
typical examples include dilate operators, erode operators, thinning operators, and so on. Dilation A basic mathematical morphology operation on images. According to the different types of images, such as binary image, grayscale image, and color image, it can be divided into binary dilation, gray-level dilation, and color dilation. Dilate operator An operator that performs an operation that expands a binary or grayscale object relative to the background. The effect is to fill the holes inside the object and connect the object region that is closer. Compare erode operator. Erosion A basic mathematical morphological operation on images. According to the different types of images, such as binary image, grayscale image, and color image, it can be divided into binary erosion, grayscale erosion, and color erosion. Erode operator An operator that performs an operation that shrinks a binary or grayscale object relative to the background. The effect is to remove isolated object regions and separate object regions that are connected only by shreds. Compare dilate operator. Opening A basic mathematical morphology operation on images. Depending on the different types of images, such as binary image, grayscale image, and color image, it can be divided into binary opening, gray-level opening, and color opening. Open operator An operator that performs the mathematical morphology operation to a binary image. The operator can be composed of a sequence of N erode operators followed by N dilate operators, all of which use the same specific structuring element. This operator is useful both in separating contact objects and removing small regions. Figure 48.2 shows a schematic diagram, where the right image is obtained by opening the left image with a structuring element of the suitable size. Closing A basic mathematical morphology operation on images. According to the different types of images, such as binary image, grayscale image, and color image, it can be divided into binary closing, gray-level closing, and color closing.
1728
48 Mathematical Morphology for Binary Images
Fig. 48.2 The effect of open operator
Close operator Successively use two binary morphological operations, the operators that perform dilation operation first and then erosion operation. The effect is to fill small holes or gaps in the image. Dissymmetric closing A mathematical morphology operation. This involves expanding the image with one structuring element and eroding the resulted image with another structuring element, so that the expansion and erosion are (partially) complementary. Since two different structuring elements are used, the two operations are not necessarily symmetrical. Compare closing. Geodesic transformation Improved mathematical morphological transformation with geodesic distance. It can operate on only part of the image or object, and the structuring elements it uses can be adjusted for each pixel as needed. Let the geodetic distance between two points p and q in the set P be dP( p, q), and the geodesic sphere be BP(c, r) in P, where c is the center and r is the radius, which can be defined as BP ðc, r Þ ¼ fp 2 P, dP ðc, pÞ r g Considering a subset Q of P, the geodesic dilation DP(r)(Q) of the size r of Q in P can be defined as ðrÞ
DP ðQÞ ¼ [ BP ðc, r Þ ¼ fp 2 P, ∃c 2 Q, dP ðc, pÞ r g p2Q
Similarly, the geodesic erosion EP(r)(Q) with the size r of Q in P can be defined as ðrÞ
E P ðQÞ ¼ fc 2 Q, BP ðc, r Þ ⊆ Qg ¼ fc 2 Q, 8p 2 P∖Q, dP ðc, pÞ > r g Figure 48.3(a) shows a schematic diagram of geodesic dilation. Consider the set P (yellow region, including two parts that are not connected) and a subset Q (white area). Dilate Q with the geodesic sphere (red dotted line) to get the dilation result
48.1
Image Morphology
1729
Q
(r)
Q
DP (Q)
P
(r)
EP (Q)
P (a)
P
P (b)
Fig. 48.3 Geodesic dilation and geodesic erosion
(light red region). Figure 48.3(b) shows a schematic diagram of geodesic erosion. Consider the set P (yellow region, including two parts that are not connected) and a subset Q (white area). Erode Q with the geodesic sphere (red dotted line) to get the dilation result (light red region). Geodesic operator In mathematical morphology, an operator that uses two input images. First apply a morphological operator (the basic dilation and erosion operator) to the first image, and then require the result to be greater than or equal to the second image. Geodesic dilation See geodesic transformation. Geodesic erosion See geodesic transformation. Geodesic sphere See geodesic transformation.
48.1.3 Morphological Image Processing Morphology 1. Shape: the state of the earth’s surface. This is one of the ground physical properties that can be obtained with the pixel values of synthetic aperture radar images. 2. Morphology: A branch of biology that studies the structure of animals and plants. Originally refers to a discipline that specializes in the nature of biological forms, the goal of which is to describe the form of organisms and study their regularity. The main consideration is the shape of a structure. In image engineering, mathematical morphology is used. See mathematical morphology operation. Structuring element A special object in mathematical morphology operations. It is itself a well-defined set of images (in mathematical morphology for binary image) or functions
1730
48 Mathematical Morphology for Binary Images
(in mathematical morphology for gray-level image). The operation of mathematical morphology is the operation on the object with structural elements. For each structuring element, first specify a coordinate origin, which is the reference point for the structuring element to participate in the morphological operation. Note that the origin of the structuring element may or may not be included in the structuring element (i.e., the origin itself does not necessarily belong to the structural element), but the operation results in the two cases are often different. Basic neighborhood structure for morphological image processing. A structural element is usually a small image that defines a shape pattern. Morphological operations on a source image can be combined with structuring elements and source images in a variety of ways. Golay alphabet,Golay In mathematical morphology, a structuring element obtained by rotating (with symmetry) a structuring element sequence on an appropriate image grid (such as a triangular grid, a square grid, a hexagonal grid, an octagonal grid, etc.). More generally, a designation for a sequence of combined structuring elements. For example, the sequence of a set of (8) combined structural elements for sequential thinning B(i) ¼ (E(i), F(i)) is as follows (here, the structural element template is represented in matrix form, and the other four templates are available from symmetry or rotation): 2
Bð1Þ
0 6 ¼ 4
0 1
3 2 0 7 6 5 Bð2Þ ¼ 4 1
0 1
3 0 7 05
1
1
1
1
2
Bð4Þ
6 ¼ 41
1 1 0
2
Bð3Þ
1 6 ¼ 41
1
3 0 7 05
1
0
3
7 05 0
Among them, 1 indicates that it belongs to E(i), 0 indicates that it belongs to F(i), and * indicates don’t care element, that is, the value is not important or can be arbitrarily taken (because it does not participate in the calculation). Simply connected Structuring element without holes are called simply connected structuring elements. Don’t care element In mathematical morphology operation, the element of the structuring element template that can be any value (the element can be ignored in the operation). Morphological image processing Use mathematical morphological operations to process images. Commonly used are dilate operators, erode operators, open operators, close operators, and so on. In such methods, images are considered as a collection of points.
48.1
Image Morphology
1731
Morphologic algorithm A morphological analysis algorithm based on the basic operations and combined operations of mathematical morphology. It can not only extract image components but also perform preprocessing and post-processing on objects containing shapes of interest. Morphologic factor Quantity or characteristics without units. It is usually the ratio between different geometric factors and geometric parameters of the object. Morphological filtering A filtering technique that uses mathematical morphological operations on an image. Among them, a many-to-one binary (or Boolean) function h is used in a window W in the input image f(x, y) to output an image g(x, y): gðx, yÞ ¼ h½Wf ðx, yÞ Examples of Boolean operations used are as follows: 1. OR: This is equivalent to morphological dilation using a square structuring element of the same size as W. 2. AND: This is equivalent to the morphological erosion using a square structural element of the same size as W. 3. MAJ (majority): This is a morphologically equivalent filter of a median filter for binary images. Morphological filter A nonlinear filter that modifies the geometric features of an image locally through morphological transformation. If each image in Euclidean space is regarded as a set, morphological filtering is a set operation that changes the shape of the image. Given the filtering operation and filtering output, a quantitative description of the input image geometry can be obtained. One implementation of the morphological filter is a combination of opening and closing. Morphological segmentation Binary images are processed using mathematical morphological operations to extract isolated regions of a desired shape or object. The desired shape or object is specified with a morphological kernel. This process can also be used to separate contact objects. Morphometry Techniques and processes for measuring the shape of an object. Morphological shape decomposition [MSD] A better method for extraction of skeleton based on mathematical morphology.
1732
48.2
48 Mathematical Morphology for Binary Images
Binary Morphology
When the operation object of mathematical morphology is a binary image, it is called binary morphology (Gonzalez and Woods 2008; Serra 1982; Zhang 2017b).
48.2.1 Basic Operations Mathematical morphology for binary images Mathematical morphology on binary image operations. The operation object is regarded as a set of binary pixels, and the operation is performed on the set. The two sets involved in each operation are not equivalent: A is generally used to represent the (quasi-operation) set of images, B is the set of structuring elements, and B is performed on A. Morphology for binary images Same as mathematical morphology for binary images. Binary mathematical morphology Same as mathematical morphology for binary images. Binary dilation A basic operation in mathematical morphology for binary image. Let the dilate operator be represented by . Image A uses structuring element B to expand and write A B; the expression is n h i o b \ A 6¼ ∅ A B ¼ xj B x
b The above formula shows that when A is dilated with B, B is first mapped (B) b ). Here, the intersection about the origin, and then its image is translated by x ( B x
of A and B is not the empty set. In other words, the set obtained by dilating A with B is the set of the origin positions of B when the displacements of B intersect with at least one non-zero element in A. According to this explanation, the above formula can also be written as n h i o b \A ⊆A A B ¼ xj B x
This formula helps to borrow the concept of convolution to understand the dilation operation. If B is regarded as a convolution mask, dilation is achieved by first mapping B about the origin and then moving the image continuously on A.
48.2
Binary Morphology
1733
Minkowski addition An operation between two sets, is the same as a binary dilation. Letting A and B be two binary sets, then the Minkowski addition of A and B can be expressed as A B ¼ [ ðAÞb b2B
Binary erosion A basic operation of mathematical morphology for binary image. The erode operator is represented by . Image A uses structuring element B to erode writing A B; the expression is
The above formula shows that the result of A eroding with B is the set of all x, where (B)x is still in A after B is translated by x. In other words, the set obtained by eroding A with B is the set of the origin position of B when B is completely included in A. The above formula can help to understand the erosion operation by borrowing correlation concepts. The effect is to “shrink” or “refine” the object in a binary image. The scope and direction of refinement is controlled by the size and shape of the structuring elements. Minkowski subtraction An operation between two sets, similar to binary erosion. Letting A and B be two binary sets, then the Minkowski subtraction of A and B can be expressed as
It can be seen that compared with binary erosion, only the displacement of A is in another direction. Iterative erosion In mathematical morphology for binary image, the operation of iteratively eroding an image with the same structuring element. Iterative erosion can be used to obtain regional seeds. For example, iteratively eroding the original image f with the unit circular structuring element b can be expressed as
Here f1 ¼ fb, f2 ¼ f1b ¼ f2b, and then continue until fm ¼ fmb and fm+1 ¼ ∅; m is the maximum number of non-empty graphs. Binary opening A basic operation of mathematical morphology for binary image. That is, using the same structuring element B, the image A is eroded (operator ) and then dilated
1734
48 Mathematical Morphology for Binary Images
(operator ). The open operator is represented by ○. A uses B to open writing A ○ B; the expression is
The binary opening operation can filter out the spurs smaller than the structuring elements in the binary image and can also cut the slender overlap to achieve separation. Opening characterization theorem A theorem representing the ability of a binary opening operation to extract shapes from images that match the structuring elements used. If ○ is the open operator, is the erode operator, A is the binary image, B is the structuring element, and (B)t is the translation of B by t, then the opening characterization theorem can be expressed as
The above formula shows that opening A with B selects some points in A that match B (the shape), and these points can be obtained by the translation of the structuring element B completely contained in A. Spurs Originally refers to a component with short spikes (thorns) fixed to the back of a rider’s boots and can be used to stimulate the horse’s fast running movement. It is often referred to as a defect in the metal wires in printed circuit board (PCB) images. Corresponding to the extra bumps on the metal lines, these bumps will change the line width. The opening operation in mathematical morphology can be used to eliminate “spurs.” Considering that the layout of metal lines has three directions: horizontal, vertical, and diagonal, octagonal structuring element are often used. Binary closing A basic operation of mathematical morphology for binary image. First use the same structuring element B to dilate the binary image A (operator ), and then erode (operator ) the result. The close operator is represented by •. A uses B to close and write A • B; the formula is
The binary closing operation can fill gaps or holes that are smaller than the structuring elements in the binary image and can also connect short breaks to build connectivity. Closing characterization theorem A theorem that represents the binary closing operation’s ability to extract shapes from images that match the structuring elements used. If • is used to represent a b represents that B is close operator, A is an image, B is a structural element, and B t
48.2
Binary Morphology
1735
b and then its image is translated by t, the closing first mapped about the origin (B), characterization theorem can be expressed as A•B ¼
n
o b ) B b \ A 6¼ ∅ x jx2 B t
t
The above formula shows that the result of the closing of A by B includes all b , the points that meet the following conditions: when the point is covered by B t b is not zero. intersection of A and B t
Mouse bites A defect on a metal line in a printed circuit board (PCB) image that corresponds to a notched or missing and changes the line width. Closing operations in mathematical morphology can be used to eliminate “mouse bites.” Considering that the layout of metal lines has three directions, horizontal, vertical, and diagonal, octagonal structuring elements are often used. Duality A property between two things that are both connected and opposed. Two concepts or theories have similar properties and can be applied to one or the other. For example, the connection of several connecting points in the projected space is the same as the connection of several connecting lines. These connections are dual. There are many examples in image engineering. For example, in Hough trans form, the point-line duality and the point-sine curve duality. There are also in mathematical morphology: duality of binary dilation and erosion, duality of binary opening and closing, duality of gray-level dilation and erosion, duality of gray-level opening and closing, and so on. Duality of binary dilation and erosion In the two operations of binary dilation and binary erosion, one operation on the image object is equivalent to the other operation on the image background. Let A B denote the dilation of A by B and AB denote the erosion of A by B. The duality of the dilation and erosion operations can be expressed as
b represents the mapping of B about the origin, and Ac represents the Among them, B complement of A. Duality of binary opening and closing In the two operations of binary opening and binary closing, one operation on the image object is equivalent to the other operation on the image background. Let A ○
1736
48 Mathematical Morphology for Binary Images
B mean using B to open A, and A • B mean using B to close A. The duality of binary opening and closing can be expressed as b ðA∘BÞc ¼ Ac • B b ðA • BÞc ¼ Ac ∘B b represents the mapping of B about the origin, and Ac represents the Among them, B complement of A. Hit-or-Miss [HoM] transform One of the basic operations of mathematical morphology for binary image. The operator that implements this operation is called a hit-or-miss operator. It is a basic tool for shape detection. In fact, two structuring elements are used, which correspond to two operations. Let A be the original image and E and F be a pair of disjoint (no intersection) sets (a pair of structuring elements is determined). The expression of the hit-to-hit transform * is
Among them, Ac represents the complement of A, represents a dilate operator, and represents an erode operator. Any pixel z in the hit-or-miss transform result satisfies: E + z is a subset of A, and F + z is a subset of Ac. In turn, a pixel z that meets the above two conditions must be in the result of the hit-or-miss transform. E and F are called hit structuring element and miss structuring element, respectively. It should be noted that the two structural elements must satisfy E \ F ¼ ∅; otherwise, the hit-or-miss transform will give the result of the empty set. Figure 48.4 shows an example of using a hit-or-miss transform to determine the location of a square region of a given size. Figure 48.4(a) shows the original image, including four solid squares of 3 3, 5 5, 7 7, and 9 9 respectively. The 3 3 solid square E in Figure 48.4(b) and the 9 9 box F(1 pixel in width) in Figure 48.4(c) are combined to form the structuring element B ¼ (E, F). In this example, the hit-or-miss transform is designed to hit the region covering E and “miss out” the region F. The final result is shown in Figure 48.4(d). Hit-or-Miss [HoM] operator An operator of mathematical morphology, it works by using a logical AND on corresponding bits of pixels in an input image and a structuring element. This
(a)
(b)
(c)
Fig. 48.4 Determine the square region with the hit-or-miss transform
(d)
48.2
Binary Morphology
1737
operation works best for binary images, but it can also be used for grayscale images. See hit-or-miss transform.
48.2.2 Combined Operations and Practical Algorithms Combined operation A morphological analysis operation combined with basic operations of mathematical morphology. Thinning algorithm A method to implement the medial axis transform to obtain a regional skeleton. The skeleton is obtained by iteratively eliminating the boundary points of the region. The conditions for eliminating these boundary points are that these points are not endpoints, and the removal will not affect the connectivity and will not erode the region too much. Thinning operator An operator of mathematical morphology. Thinning is an operation that can be used to eliminate selected foreground pixel regions in a binary image. Thinning operators are a bit like erode operators or open operators. It has many applications, especially skeletonization, and repairing the output of edge detection to make the line segment a single pixel width. Thinning is generally only used for binary images; it also gives binary images as output. In Fig. 48.5, the left image is a binary image of a region, and the right image is the result obtained by thinning it with 8-neighborhood structuring elements. Thinning A technique that uses mathematical morphology for binary image to erode an object region without splitting it into multiple sub-regions. Let structure element B thinning the set A be denoted as A B, which can be expressed by the hit-or-miss transform as A B ¼ A ðA * BÞ ¼ A \ ðA * BÞc Among them, the superscript c represents the complement set operation. If you define a series of structuring elements {B} ¼ {B1, B2, . . ., Bn}, where Bi+1 represents the result of Bi rotation, thinning can also be defined as a series of operations: Fig. 48.5 Thinning effect
1738
48 Mathematical Morphology for Binary Images
A fBg ¼ A ðð ððA B1 Þ B2 Þ Þ Bn Þ This process is to first thin the object region to be thinned with B1 and then to refine the previous result with B2 and so on until it is thinned with Bn. The whole process can be repeated until there is no change in the results of two consecutive thinnings. Comparing thickening. Line thinning See thinning. Sequential thinning In mathematical morphology, thinning operations are performed sequentially. Letting A denote an image and {B(1), B(2), . . ., B(n)} denote n combined structuring elements B(i) ¼ (E(i), F(i)), then the sequential thinning can be written as A BðiÞ ¼ A Bð1Þ Bð2Þ BðnÞ Pruning An operation that uses mathematical morphology for binary image to the results of thinning and extraction of skeleton. It is an important supplement to the operations of thinning and extraction of skeleton. It is often used as a post-processing means for thinning and extraction of skeleton to remove excess parasitic components (“spurs,” i.e., small deformations that do not match the overall structure). Here we need to use a set of structuring elements, including iteratively using some structuring elements, to eliminate noise pixels. Thickening operator A mathematical morphology operator. Thickening is an operation that can be used to expand a selected foreground pixel region in a binary image. The thickening operator is somewhat like a dilate operator or a close operator. It has many applications, including determining an approximate convex hull of a shape, building a skeleton by influence zones, and so on. Thickening is generally only used for binary images, and it also gives binary images as output results. In Fig. 48.6, the left image is a binary image, and the right image is the result of thickening it five times in the horizontal direction.
Fig. 48.6 Thickening effect
48.2
Binary Morphology
1739
Thickening The operation is the opposite of the effect of using the mathematical morphology for binary image operation to achieve the thinning. The structure element B is used to thicken the binary set A (denoted A B), which can be expressed by the hit-ormiss transform * as
If you define a series of structuring elements {{B} ¼ {B1, B2, . . ., Bi, Bi+1, . . ., Bn}, where Bi+1 represents the result of Bi rotation, thickening can also be defined as a series of operations:
In practice, the background of the set A may be thinned first, and then the thinned background may be supplemented to obtain a thickening result. In other words, if you want to thickening set A, you can first construct its complement C (¼ Ac), then thinning C, and finally find Cc. Sequential thickening In mathematical morphology, the thickening operation is performed sequentially. Letting A denote an image and {B(1), B(2), . . ., B(n)} denote n combined structuring elements B(i) ¼ (E(i), F(i)), then the sequential thickening can be written as
Skeleton by maximal balls In mathematical morphology, the skeleton obtained by using the maximal ball, that is, in the discrete set A, the set S(A) of the center p of the largest ball: SðAÞ ¼ fp 2 A : ∃r 0, Bðp, r Þ is the maximal ball in Ag According to the above definition, the skeleton of a disc is its center, but the skeleton of two contacting discs is three points instead of a straight line connecting the two centers, as shown in Fig. 48.7. Maximal ball A term used in mathematical morphology to calculate the skeleton. The sphere can be represented by B( p, r), which represents the set with p as the center and r as the Fig. 48.7 Skeletons of discs
Skeleton of a disc
Skeleton of two contacting discs
1740
48 Mathematical Morphology for Binary Images
Fig. 48.8 Maximal ball and non-maximal ball
Non-maximal ball Maximal ball B A
S
radius (r 0), where the distance between each point and the center p is d r. Here d can be Euclidean distance, city block distance, or chessboard distance. A ball in a discrete set is called the maximum if and only if there is no larger ball in the discrete set that can contain the ball. In Fig. 48.8, the maximal ball is not the larger ball B, but the smaller ball A. This is because in the set S, it is no longer possible to include a ball containing the ball A, but there may be a ball containing the ball B. Quench function In mathematical morphology, the function used to represent the correlation between each point p in the skeleton S(A) of the binary set A and the radius rA( p) in the calculation of the skeleton by maximal ball. An important property of rA( p) is that the quenching function can guarantee the complete reconstruction of the original set A through the union operation of the maximal ball B: A¼
[ ½p þ r A ðpÞB
p2SðAÞ
Granulometry A name derived from stereology, used mostly in biology and materials science. It is an important tool for mathematical morphology, which can help obtain the shape information of the object without segmenting the object. Research on the size characteristics of a set (such as the size of a set of regions). It is usually performed by means of a series of opening operations in mathematical morphology (using structuring elements of increasing size) and analyzing the distribution of the resulting size. Sieving analysis A mathematical morphology method based on the principle of granulometry. Given a number of differently sized objects, a series of sieve (structuring element) with increasing size is used for sieving to classify objects into different categories according to size. The result can be represented as a size histogram (the horizontal axis is the object size and the vertical axis is the number of objects of various sizes), which is also called the granulometric curve or granulometric spectrum. Granulometric spectrum Particle size distribution results obtained from particle size analysis. See sieving analysis.
48.2
Binary Morphology
1741
Granulometric curve See sieving analysis. Sieve filter A morphological filter that only allows dimensional structures in a narrow range to pass. For example, to extract bright spotlike defects of size n n pixels (n is an odd number), the following sieve filters can be used (where ○ represents the open operator and the superscript represents the size of the structuring element):
f ∘bðnnÞ f ∘bðn 2Þðn 2Þ
The function of the sieve filter is similar to that of the band-pass filter in the frequency domain. The first term in the above formula removes all bright spot structures with a size smaller than n n, and the second term removes all bright spot structures with a size smaller than (n – 2) (n – 2). Subtracting these two terms leaves a structure with dimensions between n n and (n – 2) (n – 2). Generally, the sieve filter works best when the size of the filtering structure is several pixels. Region filling algorithm A method to obtain the region enclosed by a given contour by using operations of mathematical morphology for binary image. Suppose a region in a given image has a set of 8-connected boundary points as A, and the structuring element B includes a pixel’s 4-neighborhood. First assign a point in the region to 1, and then fill it according to the following iterative formula (set the points in other regions to 1 as well): X k ¼ ðX k 1 BÞ \ Ac , k ¼ 1, 2, 3,
Among them, Ac represents the complement of A, and represents the dilate operator. When Xk ¼ Xk–1, the iteration is stopped. At this time, the intersection of Xk and A includes the interior of the filled region and its outline.
Chapter 49
Mathematical Morphology for Gray-Level Images
The operation object of the gray-level image mathematical morphology is a graylevel image or a color image with multiple attribute values (Gonzalez and Woods 2008; Sonka et al. 2014; Zhang 2017b).
49.1
Gray-Level Morphology
Gray-level mathematical morphology is the result of generalizing binary mathematical morphology by gray value ordering. It considers not only the spatial position but also the amplitude of the gray values of pixels (Gonzalez and Woods 2008; Sonka et al. 2014; Zhang 2017b).
49.1.1 Ordering Relations Grayscale morphology Same as grayscale mathematical morphology. Maximum operation In mathematical morphology for gray-level image, the operation of calculating the maximum value. Corresponds to the union operation in mathematical mor phology for binary image. Minimum operation In mathematical morphology for gray-level image, the operation of calculating the minimum value. Corresponds to the intersection operation in mathematical morphology for binary image.
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_49
1743
1744
49 Mathematical Morphology for Gray-Level Images
Maximum value In mathematical morphology for gray-level image, the result obtained by performing the maximum operation on two signals f(x) and g(x) point by point. Can be expressed as ð f _ gÞðxÞ ¼ max f f ðxÞ, gðxÞg Among them, for any value a, there are max{a, 1} ¼ a. If x 2 D[f] \ D[g], then ( f _ g)(x) is the maximum value of the two finite values f(x) and g(x); otherwise ( f _ g)(x) ¼ 1; if x 2 D[f] – D[g], then ( f _ g)(x) ¼ f(x); if x 2 D[g] – D[f], then ( f _ g)(x) ¼ g(x); finally, if x is not in the support region of any of f(x) and g(x), i.e., x2 = D[f] [ D[g], then ( f _ g)(x) ¼ 1. Minimum value In mathematical morphology for gray-level image, the result obtained by performing the minimum operation on two signals f(x) and g(x) point by point. Can be expressed as ð f ^ gÞðxÞ ¼ min f f ðxÞ, gðxÞg Among them, for any value a, there are min{a, 1} ¼ 1. If x 2 D[f] \ D[g], then ( f _ g)(x) is the minimum value of the two finite values f(x) and g(x); otherwise ( f ^ g)(x) ¼ 1. Ranking problem The problem of determining the order or level of a certain number of objects or data items. Ordering relation A relationship between elements in a collection. For u, v 2 S (both elements u and v belong to the set S), let denote the ordering relation, and then u v means u precedes v, or u is the same as v. If the elements of the collection are natural numbers, then the ordering relation can be represented by . Strict ordering relation A relationship between the elements (may be symbols) in the set S. For u, v 2 S, let ≺ denote the ordering relation, and then u ≺ v means u (strictly) precedes u ≺ v. If the elements of the set are natural numbers, then the strict ordering relation can be represented by k, gk disappears in fl. The first step in ultimate erosion is conditional dilation: Uk ¼
f kþ1 fbg ; f k
The second step of the ultimate erosion is to subtract the above dilation result from the erosion of f, that is, gk ¼ f k U k If there are multiple objects in the image, the union of their respective gk can be calculated to obtain the ultimate eroded object set g. In other words, the ultimate erosion image is g¼ [
k¼1, m
gk
where m is the number of times of erosion. Shrinking erosion A special conditional erosion is also a specific ultimate erosion.
49.2
49.2
Soft Morphology
1753
Soft Morphology
Soft morphological filter can be defined on the basis of weighted sorting statistics. It is less sensitive to additive noise and changes in object shape (Zhang 2017b). Soft mathematical morphology An extension of gray-level mathematical morphology, replacing the original maximum and minimum operations with other ranking operations. For example, in a neighborhood of an image corresponding to a structuring element of 5 5, the central pixel is replaced by 80% of the pixels in which it is ranked. Sometimes the order can also be weighted. See fuzzy morphology. Soft morphology Same as soft mathematical morphology. Soft morphological filter A morphological filter defined on the basis of weighted-order statistics. Replace flat structuring elements with flat structuring systems to increase flexibility in noise tolerance. The two basic operations are soft dilation and soft erosion. On this basis, soft opening and soft closing can also be combined. Structuring system Structuring elements used in soft morphological filters in mathematical mor phology for gray-level images. It can be described as [B, C, r], including three parameters: the finite plane sets C and B, C ⊂ B, and a natural number r satisfying 1 r |B|. Set B is called a structural set, |B| is its size, C is its hard center, B – C gives its soft contour, and r is the order of its center. Flat structuring element An element with a value of zero in the support region of a structuring element. Makes a subset of the collection of image planes. Flat filter Morphological filters using flat structuring elements. Stack filter A nonlinear morphological filter based on flat structuring elements. It is a generalization of the median filter (in a broader sense, it can be seen as a generalization of various ordering filters). It can effectively maintain the constant region and monotonous change region in the image, suppress the pulse interference, and maintain the edge. From a filter implementation perspective, a stack filter can be seen as a way to implement some nonlinear filters in the binary domain. Flat structuring system Promotion of flat structuring elements. Expressed as [B, C, r], where B and C are a set of finite planes, C ⊂ B, and r is a natural number that satisfies 1 r |B|. The set B is called a structural set, C is the (hard) center of B, B – C gives the (soft) contour of B, and r is the order of the center of B (i.e., C).
1754
49 Mathematical Morphology for Gray-Level Images
Soft erosion A basic operation in soft morphological filters. The soft erosion of the image f using the flat structuring system [B, C, r] is denoted as f ⊝ [B, C, r](x), and the expression is f ⊝½B, C, r ðxÞ ¼ composite set fr ◊ f ðcÞ : c 2 C x g [ rth minimum
f ð bÞ : b 2 ð B C Þ x
Among them, ◊ represents repeated operations; a multiset is a set of a series of objects on which repeated operations can be performed. The above formula shows that the soft erosion of the image f using the structural system [B, C, r] at any position x is accomplished by moving the sets B and C to position x and using the f value in the moved set to form a composite set. Among them, the f value of the hard center C of the structure set B is repeatedly used r times, and then the r-th smallest value in the composite set is taken. Soft dilation A basic operation in soft morphological filters. The soft dilation of the image f using the flat structuring system [B, C, r] is denoted as f [B, C, r](x), and the expression is f ½B, C, r ðxÞ ¼ composite sets fr ◊ f ðcÞ : c 2 C x g [ f ðbÞ : b 2 ðB C Þx r‐th maximum Among them, ◊ represents repeated operations; a multiset is a set of a series of objects on which repeated operations can be performed. The above formula shows that the soft dilation of the image f using the structural system [B, C, r] at any position x is accomplished by moving the sets B and C to position x and using the f value in the moved set to form a composite set. Among them, the f value of the hard center C of the structure set B is repeatedly used r times, and then the r-th maximum value in the composite set is taken.
Chapter 50
Visual Sensation and Perception
Vision is an important function of human understanding of the world. Vision includes “sense” and “perception,” so it can be further divided into visual sensation and visual perception. In many cases, visual sensation is often called vision, but in fact visual perception is more important and complicated (Aumont 1994; Davis 2001; Finkel and Sajda 1994; Peterson and Gibson 1994; Zakia 1997).
50.1
Human Visual System
The human visual process is a complex process consisting of multiple steps. In a nutshell, the visual process consists of three sequential processes: optical process, chemical process, and neural processing process (Marr 1982).
50.1.1 Human Vision Human vision The most important way for humans to perceive the world through the senses. Vision is the most important way for humans to recognize the objective world. The visual process can be seen as a complex process from sensation (feeling the image obtained by 2-D projection of the 3-D world, i.e., light stimulation) to perception (recognizing the 3-D world content and meaning). The human visual system can analyze, associate, interpret, and comprehend the observed phenomena rationally and discriminate and judge the outline of information in the visual process. Vision is not only a psychological and physical perception but also a source of creativity. Its experience comes from the perception and analysis of the surrounding environment.
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_50
1755
1756
50
Visual Sensation and Perception
Human visual system [HVS] The sum of organs and functions on the human body that implement and complete the vision procedure. Primitives only represent human visual organs, including eyes, retinas, optical nerve networks, side regions of the brain, and striated cortex in the brain. Now generally includes a variety of visual functions, such as brightness adaptation, color vision, horoptera, and so on. Vision system 1. Abbreviation for the human visual system. 2. A device constructed by humans that uses computers to implement the functions of the human visual system. Characteristics of human vision A description of the characteristics, performance, and attributes of the human visual system. Involves basic facts about the nature of the perceived scene, such as brightness, contrast, sharpness (small details), color, motion, and flicker. It also includes visual masking effects, color constancy, contrast sensitivity, motion constancy, and sensation constancy. Human visual properties Various characteristics and masking effects of the human visual system (i.e., different sensitivity to different brightness or brightness contrast). Common ones include: 1. Brightness masking property: The human eye is less sensitive to noise added in high-brightness regions; 2. Texture masking properties: If the image is divided into smooth regions and texture-dense regions, the sensitivity of the human visual system to smooth regions is much higher than that of texture-dense regions; 3. Frequency property: The human eye has different sensitivity to different spatial frequency components in the image. The human eye has low sensitivity to highfrequency content and relatively high resolution in low-frequency regions; 4. Phase property: The human eye is less sensitive to changes in the signal phase angle than to changes in the amplitude; 5. Directional nature: The human eye has a choice of directions when observing the scene. The human visual system is most sensitive to changes in light intensity in the horizontal and vertical directions and is least sensitive to changes in light intensity in oblique directions. Stereo fusion A capability of the human visual system. When a pair of stereo images are independently displayed to two eyes, the 3-D spatial situation of the scene can be perceived, that is, the stereo correspondence problem is solved. The fact that people can fuse in random dot stereograms shows that high-level recognition is not needed to solve all stereo vision correspondence problems.
50.1
Human Visual System
1757
Normal vision Ideal state for observation with eyes. When the eye is not making any accommodation (eye muscle relaxation), the parallel light emitted by the distant object is just focused on the retina (the far point is at infinity). Relative field The full range a person sees when looking at an object in the scene with one eye. Including regions where imaging is not clear. This range is limited by the nose, cheeks, eyebrows, frame, eye corners, etc. It is approximately 60 upward, 75 downward, 60 inward, and 100 outward. Also called monocular FOV. Visual angle The opening angle of the range of the observed scene or the size of the object (both ends of the object) to the eyes. Elevation When the line of sight is above the horizontal line, the angle between the horizontal line and the line of sight in the vertical plane. Listing rule When the eye changes the line of sight, the new position of the eyeball will rotate once relative to the original position, which corresponds to a simple rotation around the center of rotation of the eye in the plane determined by the new and old lines of sight.
50.1.2 Organ of Vision Organ of vision The overall structure of the visual zone of the eye, optic nerve, and the cortex of brain. Configural processing hypothesis In the human visual system, inverting the face will destroy the structural information (spatial connection between facial features) while retaining the view of relatively intact feature information (eyes, nose, mouth). Brain A functional unit in the human visual system that processes information. The brain uses neural signals obtained from sensors on the retina and passed through the optic nerve to the brain to generate neural functional patterns that are perceived as images. See eye. Occipital lobe One of the components of the human brain. The fibers from the lateral geniculate nucleus terminate at this part. Because it is the primary receptive field of nerve impulses from the eye, it is also called the receptive field.
1758
50
Visual Sensation and Perception
Visual cortex A part of the brain, located in the posterior cortex, is mainly responsible for the processing of visual information. The human visual cortex includes the primary visual cortex (striated cortex) and the advanced visual cortex (extrastriate cortex). The two hemispheres of the brain each have a portion of the visual cortex. The visual cortex of the left hemisphere receives information from the right field of view, and the visual cortex of the right hemisphere receives information from the left field. Striated cortex The part of the visual cortex in the human brain, also known as the primary visual cortex (V1). There, the response to the light stimulus undergoes a series of processing to finally form the appearance of the scene, thereby transforming the perception of light into the perception of the scene. It outputs information to the extrastriate cortex. Extrastriate cortex Advanced visual cortex. The input information is obtained from the striated cortex, some of which are involved in processing the spatial location information of the object and related motion control and some of which are involved in object recognition and content memory. Log-polar transformation The transformation of the image in the visual cortex. Oriented Gaussian kernels and their derivatives A spatio-temporal filter used to describe the properties of the spatio-temporal structure of cells in the visual cortex. The Gaussian kernel is used to filter in the spatial domain, and the Gaussian differential is used to filter in the time domain. After taking the threshold of the response, the result histogram can be combined. This method can provide simple and effective features for far-field (non-close shot) video. Oriented Gabor filter bank A spatio-temporal filter used to describe the spatio-temporal structure of cells in the visual cortex. When analyzing the video, the video fragment can be considered as a spatio-temporal volume defined in the XYT space. For each voxel (x, y, t), a Gabor filter bank is used to calculate the local appearance model of different orientations and spatial scales and a single time scale. In this way, the video volume data can be filtered, and further specific features can be derived based on the response of the filter bank. Synapse The structural parts where the neurons contact each other in the body are also the parts where the neurons are functionally connected, which plays an important role in information transmission.
50.1
Human Visual System
1759
50.1.3 Visual Process Vision An important function that humans understand the objective world. Vision includes two steps, “sight” and “perception”; the former corresponds to visual feeling and the latter corresponds to visual perception, that is, the visual sensation produced by the image of the scene in the scene to stimulate the retina and the visual perception obtained in the visual cortex. Line of sight A straight line from the observer or camera to the scene (usually some specific object). Visual image Generally refers to the image obtained by human eyes. View volume The subspace in the (infinite) 3-D space, which is surrounded by the projection center of the camera and the boundary of the visible region on the image plane. This subspace may also be limited by planes that are near or far due to focus and depth of field constraints. Figure 50.1 shows a schematic. Visual edge See perception of visual edge. Perception of visual edge Perception of the boundary (which may not necessarily be a geometric or physical boundary) between two surfaces of different brightness when viewed from the same viewpoint. There can be many reasons for the difference in brightness, such as different lighting and different reflection properties. Visual performance The ability of humans to complete certain visual tasks with the help of visual organs. The ability to distinguish object details and the ability to recognize contrast are often used as visual function indicators. Vision procedure The human visual system completes the complex steps of human visual function in order of optical procedure, chemical procedure, and neural process procedure.
Fig. 50.1 View volume
View volume
Projection center
1760
50
Visual Sensation and Perception
Optical process The first step of the vision procedure. The physical basis of optical processes is the human eye. When the eye focuses on the object in front, the light incident from the outside into the eye is imaged on the retina. The optical process basically determines the size of the image. Optical procedure The procedure of imaging using light and lenses. Chemical procedure The second step of the vision procedure. The light-receiving cells (photosensitive units of the human eye) are distributed on the surface of the retina to receive light energy and form a visual pattern. There are two types of light-receiving cells: cone cells and rod cells. Cone cells can work in strong light and feel color; the vision of cone cells is called photopic vision. Rod cells work only in very dark light, do not feel color, and only provide an overall view of the field of vision; the vision of rod cells is called scotopic vision. The chemical process basically determines the brightness and/or color of the image. Rhodopsin A photoreceptor pigment formed by the combination of retinal and protein in cone and rod cells in human eyes. It is mainly found in rod cells. The characteristic is that it is sensitive to weak light and will gradually synthesize in the dark place (generally all can be generated in about 30 min in the dark place, and the visual light sensitivity reaches the highest value). After absorbing light, it can be broken down by chemical reactions and generate neuron signals. It helps in the conversion of information in the chemical procedure that determines the brightness or color of the image. Neuron process procedure The third step of the vision procedure. In the brain’s nervous system, the response to light stimuli undergoes a series of processing to finally form the appearance of the scene, thereby transforming the perception of light into the perception of the scene. Visual pathway An image is formed from a photon receiver on the retina, which is then converted into electrical signals transmitted to the optic nerve, and then enters the complete route of the visual cortex in the brain responsible for visual perception. Reaction time The time interval from when the eye receives the light stimulus until the brain becomes aware of the light effect. Response time tr ¼ (a/Ln) + ti, where ti is the invisible part, about 0.15 s. The first term is called the visible bound (a is constant), and it varies with the brightness L. This value is 0 when L is large, and it is about 0.15 s when L is close to the visual threshold. Visual masking Same as visual masking effect.
50.2
Eye Structure and Function
1761
Visual masking effect In the human visual system, one visual stimulus causes another visual stimulus to be invisible or attenuated. For example, in the image, the noise in the flat region is more obvious (easier to find) than the noise in the texture region. This is due to the effect of visual masking, because the gray-level fluctuations of the texture region itself weaken the human eye’s perception of the noise. Masking Primitive refers to a measure of a person’s response to a previous stimulus when the latter stimulus occurs. Now it refers to the phenomenon that one image component (target) has reduced visibility or perceptibility due to the presence of another component. The human visual system is affected by several types of masking, such as: 1. Brightness masking: The visual threshold value increases with the increase of background brightness, which is the result of brightness adaptation. In addition, high-brightness levels increase the effect of flicker. 2. Contrast masking: Errors (and noise) in highlight regions are often difficult to find, which is due to the characteristics of the human visual system, that is, the visibility of details in one image is weakened by the presence of another. 3. Edge masking: Errors near the edges are often difficult to detect. 4. Texture masking: Errors in texture regions are often difficult to detect, and the human visual system is sensitive to errors in uniform regions.
50.2
Eye Structure and Function
The human eye is an important part of the human visual system and the physical basis for realizing optical processes (Zhang 2017a).
50.2.1 Eye Structure Eye One of the two main parts of the human visual system. The human eye is its input sensor. See brain. Eye-camera analogy Correspondence of the lens, pupil, and retina in the eye to the lens, aperture, and imaging surface in the camera. Cross section of the human eye Observe the cross section of the human eye from the side of the human body. Figure 50.2 shows a simplified cross section of the human eye with the names of
1762 Fig. 50.2 Cross section of the human eye
50
Visual Sensation and Perception Blind spot
Anterior chamber Retina Fovea Pupil Lens Iris
Optic nerve
the parts marked to understand the human eye anatomy and characteristics related to imaging. Anterior chamber The part of the eyeball that lies between the cornea, the iris, and the eyeball. The anterior chamber is filled with a colloidal liquid called anterior chamber fluid with a refractive index of 1.336. See cross section of the human eye. Pupil The small circular opening in the center of the anterior iris of the lens of the eyeball. The muscle fiber in the iris controls its size, which is related to the accommodation of the eyes and the retinal illuminance. It shrinks when the outside light is bright and expands when the outside light is dark. The diameter is about 3–4 mm under normal light and can be reduced to about 1.5 mm in strong light, and it can be enlarged to about 8 mm in dim light. What one usually sees is the image of the pupil, which is the entrance pupil of the eye, which is about 0.12 times larger than the pupil itself. The role in imaging corresponds to the aperture of the camera. See cross section of the human eye. Lens A crystal at the front of the human eye sphere (eyeball). Roughly lenticular (approximately 10 mm radius of curvature in front and approximately 6 mm radius of curvature in rear), it provides refractive power and focusing for different distances. The role in imaging corresponds to the lens of the camera. See cross section of the human eye. Iris The annular wall on the bulb wall of the human eyeball, at the forefront of the middle vascular membrane, between the pupil and the sclera. It is one of the more mature and widely used biometric features with the help of image technology. Has a high degree of uniqueness and will not change in life. See cross section of the human eye.
50.2
Eye Structure and Function
1763
Yellow spot A small yellow spot in the middle of the retina. Oblate, diameter is about 2–3 mm, up to 5 mm. The center is the fovea, where the numbers of rod cells are the most, and the vision and color vision are most sensitive. Yellow spot pigments make the macula less sensitive to short-wavelength light. Fovea A high-resolution region in the middle of the retina in the human eye. It is a small pit in the middle of the retina, also called the macula (yellow spot). The occupied field of view is about 1 –2 , its center is called concave center, and its diameter is about 0.2–0.3 mm. It has the sharpest vision and the highest resolution. Similar regions in artificial sensors (such as log-pole sensors) mimic the retinal configuration of the photon receiver. See blind spot and cross section of the human eye.
50.2.2 Retina Retina The last of the three films (fibrous membrane, vascular membrane, and retina) on the eyeball wall of the human eye, that is, the light-sensitive surface behind the eye. The retina has a complex manifold structure that converts incident light into neural impulses, which are transmitted to the brain through visual channels. There are light receivers and nerve fibers on it, which are mainly composed of visual cells that can sense light stimulation and a variety of neurons as communication and conduction pulses. Nerve fibers transmit the pulses generated by light stimulation into the brain. They have no nerve cells at the optic nerve exiting the eyeball, so there is no vision in the place where it is concentrated into the brain. It is called blind spot that is on the back of the eye near the nose. On the other side about the same distance (near ears), there is a small dimple facing the pupil, called the fovea, where the vision is most acute. See cross section of the human eye. Blind spot The part of the eye that is not sensitive to light on the retina. Near the fovea. It is the point where the optical nerve enters the retina, where there no visual cells and the light stimulus cannot be felt. This is a structural limitation. At this point, the eye cannot detect radiation, so the image of an external object cannot cause a light sensation when the object falls on this place. See cross section of the human eye. Indirect vision A way to turn the eyeball to see the scene more clearly. When the image of the scene does not fall in the central region of the retina, the image may not be clear due to the visual cell density there; to see more clearly, one can rotate the eyeball to make the image fall on the fovea.
1764
50
Visual Sensation and Perception
Perifovea The region on the retina far from the fovea. Anatomically, the retina is divided into ten longitudinal slices according to the direction of light travel, and in the transverse direction, all retinas are divided into the middle and the outer. In the middle, there are (1) fovea, 1.5 mm in diameter; (2) near the fovea, a ring with a width of 0.5 mm, where cells are concentrated; and (3) far fovea (perifovea), a ring with a width of 1.5 mm wide. Retinal disparity The difference between the images of the retinas of the two eyes when a person looks at a three-dimensional object with depth. This difference arises from the slightly different angles at which each eye looks at the object. Retina illuminance Illumination of outside objects on the retina. According to the definition of illuminance and Lambert’s law, it can be proved that the retinal illuminance / (object brightness pupil area)/(distance from the posterior node of the eye to the retina)2. The pupil area and the distance from the posterior node of the eye to the retina can be considered constant, so the retinal illuminance is only proportional to the brightness of the object and has not related to other quantities, such as the size of the object and the direction and distance of the object to the observer. Retinal illumination Illumination when the stimulus acts on the retina. See retina illuminance. Troland CIE recommended unit of retinal illuminance. Symbol Td. One Td is the illuminance on the retina when an object with a brightness of 1 cd/m2 passes through a pupil area of 1 mm2, that is, the illuminance on the retina when light of an object with a brightness of 1 lx passes through a pupil with a radius of 1 mm. Retinex Abbreviation of Retinal Cortex. Retinal cortex See retinex theory. Retinex theory A color theory that describes the color constancy of the human visual system. The word retinex comes from the condensation of the two words retina and cortex. The principles of scene perception and/or image formation in the human visual system can be explained. Because it is assumed that each component of the color signal (in RGB color space) is processed separately in monochrome, the result of the combination should only depend on the reflection characteristics of the surface and have nothing to do with lighting conditions. Based on this theory, the reflection characteristics of different surfaces are compared with short, medium, and long spectral ranges, respectively. The comparison results in the color channels give a lightness record. Assume that the brain compares the brightness record values of
50.2
Eye Structure and Function
1765
each color channel in the same scene. This has nothing to do with the spectral composition of the light (and is therefore independent of the relative intensity of the light). According to this theory, the construction of surface colors in the brain is the result of “comparative comparison.” This is fundamentally different from other color perception theories, which only include comparisons without mixing or superimposing. Note, however, that this theory does not provide a complete description of human color constancy. The retinal cortex theory is also a theory that explains the different perceptions of the human visual system under different brightness and ambient light. It points out that the brightness feeling at this time is mainly related to the reflected light of the scene and has nothing to do with the illuminated light and has color constancy. For image f(x, y), according to the theory, there is f ðx, yÞ ¼ Eðx, yÞ Rðx, yÞ Among them, E(x, y) represents the brightness component of the surrounding environment and has nothing to do with the scene; R(x, y) represents the reflection ability of the scene and has nothing to do with the lighting. In this way, the theory gives a mathematical model for image acquisition. If the reflection can be estimated correctly, it can provide a representation that does not change with illumination for any given image. See retinex algorithm. Retinex algorithm An image enhancement algorithm based on the retinex theory, the purpose is to calculate a light source-independent quantity, lightness, at each pixel of the image. The key observation here is that the illumination on the surface of the scene changes slowly, resulting in a slow change in the observed surface brightness. This is in contrast to strong changes at the reflective or folded edges of the scene. The retinex algorithm works based on the observed brightness B ¼ L I, where L is the surface brightness (or reflection) and I is the illumination. Take the logarithm of B at each pixel, and the product of L and I becomes the sum of the logarithms. Slow changes in surface brightness can be detected by difference and then eliminated by thresholding. Combine the results to get a brightness map (may be an arbitrary scalar factor as the difference). Retina recognition A biometric recognition technology that has received extensive attention and research. The retina is one of the organs that plays an important role in human vision. Retina recognition uses the uniqueness of the blood vessel distribution pattern around the retina to verify a person’s identity. The main steps involved in retinal recognition include: (1) collecting a retinal image with a rotary scanner and extracting features; (2) digitizing feature point data; and (3) identifying by template matching.
1766
50
Visual Sensation and Perception
Invariance hypothesis An analytical hypothesis for studying visual perception and cognition. The method of analysis begins with the analysis of visual stimuli and attempts to link isolated scientific elements to all aspects of the real perception experience. All analytical theories are based on constant assumptions. For a given retinal projection mode, it can be considered that there are an infinite number of possible scenes that will cause this mode. The invariant hypothesis assumes that, among so many possible scenarios, the observer will always choose one and only one. Light adaptation The process of the retina’s sensitivity decreases as the amount of incident light increases. After a person changes from a dark environment to a bright environment, the visual sensibility will rapidly decrease to adapt to the bright environment. It is the brightness adaptation by the cone cells of the retina at a brightness of 3 sd/m2 (candela per square meter) or above. Compare dark adaptation. Dark adaptation The process by which the sensitivity of the retina increases in response when the amount of incident light decreases. When people change from a bright environment to a dark environment, their visual sensibility will gradually improve to adapt to the dark environment. This implies that the brightness is below 0.03 sd/m2 (candela per square meter). At this time, it is mainly caused by the adaptation of rod cells of the retina. Compare light adaptation.
50.2.3 Photoreceptor Photoreceptor The human eye’s retina is full of light-sensitive receiving cells. Divided into cone cells and rod cells. See chemical procedure. Cone cell Cells in the retina of the human eye that sense ray radiation. It is a class of retinal photoreceptor neurons, named after the dendritic cones. There are approximately 6,000,000–7,000,000 cone cells on the retina surface of the human eye. The central region (fovea) is densely distributed, and the surrounding region gradually decreases, which is very sensitive to color. Cone cells can be divided into three types, which have different response curves to the incident radiation spectrum. The combination of the three cone responses can produce color feeling, which plays a decisive role in forming color vision. Cone cells also play a decisive role in visual acuity under photopic vision conditions. Humans can use these cells to distinguish details, mainly because each cell is connected to its own nerve ending. Compare rod cells.
50.2
Eye Structure and Function
1767
Cone Short for cone cell. Photoreceptor cells that appear in the retina of the human eye are responsible for color vision. Works best in bright light. Its number is much smaller than that of coexisting dark light working rod cells. L cones One of three types of sensor contained in the human eye. The sensitive range corresponds to long light wavelengths (responsible for the feeling called “red”). M cones One of three types of sensor contained in the human eye. The sensitive range corresponds to a medium light wavelength (responsible for the feeling called “green”). S cones One of three types of sensor contained in the human eye. The sensitive range corresponds to a short light wavelength (responsible for the feeling called “blue”). Rod cell A columnar photoreceptor (receiver of photons) in the retina that is very sensitive to all wavelengths of visible light. A type of cell in the retina of the eyeball that is responsible for receiving ray radiation, is named because its dendrites are rod. There are approximately 75–150 M rod cells on the retina surface in each eye. They have a large distribution surface but a lower resolution because several rod cells are connected to the same nerve ending. Rod cells only work in very dark light and play a decisive role in the perception of light and dark in scotopic vision conditions and are more sensitive to low light levels. The rod cell is mainly to provide an overall visual image of the field of view, because there is only one type of rod cells, so it does not produce color perception. In the retina, rod cells and cone cells are complementary. Scotopia vision Same as scotopic vision. Rod Abbreviation for Rod Cell. Ganglion cell The form of cells in the retina of the human eye that performs operations very close to the ∇2 operation (convolution ∇2 with the image). Each cell’s response to a light stimulus is limited to a local neighborhood called a receptive field. In the receptive field, there are two complementary cells surrounding the central tissue. Close to the center is the ON-center neuron, which will increase its activity during light stimulation; away from the center is the OFF-center neuron, whose activity is inhibited during light stimulation. ON-center neuron A sensory unit of nerve center cells in the retina of the eye. It includes the central and peripheral receptive fields. Exposure of the central receptive field causes
1768
50
Visual Sensation and Perception
stimulation of the ON-center neuron (i.e., the pulse frequency of the applied voltage rises). On the other hand, exposure to the peripheral receptive field causes temperance (i.e., the pulse frequency of the applied voltage decreases). If the central receptive field and the surrounding receptive fields are exposed at the same time, the response from the central receptive field always dominates. If the light stimulus disappears, moderation occurs in the central receptive field, and stimulation occurs in the peripheral receptive field. Compare OFF-center neuron. OFF-center neurons A cellular sensory unit in the nerve center of the retina in the eye. It includes the central and peripheral receptive fields. The exposure of the central receptive field causes the control of OFF-center neurons (i.e., the pulse frequency of the applied voltage decreases). On the other hand, exposure to peripheral receptive fields causes stimuli (i.e., the pulse frequency of the applied voltage increases). If the central and peripheral receptive fields are exposed at the same time, the responses from the peripheral receptive fields are always dominant. If the light stimulus disappears, stimulus is generated in the central receptive field, and temperance is produced in the peripheral receptive field. Compare ON-center neuron.
50.3
Visual Sensation
Visual sensation is a lower level of vision, and it mainly receives external light stimuli. Among them, the physical characteristics of light, the extent to which light stimulates visual receptors, and the light produced by the visual system after it acts on the retina are considered (Aumont 1994).
50.3.1 Sensation Sensation The direct reflection of the individual attributes of things caused by the fact that objective things act on human sensory organs. Sensation belongs to the perceptual stage of knowledge and is the source of all knowledge. If closely coordinated with perception, it can provide material for thinking activities. Optically speaking, the sense of light is the sensation. This sensation is the original impression obtained by the organs of vision. It can be changed due to environmental changes and accumulation of knowledge, but it cannot be fabricated. The color perception obtained on the basis of sensation (processed by the brain) is perception. Sensation constancy In the human visual system, the characteristics (size, shape, color, brightness, etc.) of the external objective scene have changed, but the subjective feeling remains unchanged. See perception constancy and perceptual constancy.
50.3
Visual Sensation
1769
Sensory adaptation The process or state in which the senses of the human sensory organs change as the stimulus is received. It can be regarded as a time-interaction mode, which is manifested in that the stimulus received first by the receptor affects and changes the effect of the stimulus that comes next. Sensation rate The reciprocal of the time required from the start of the stimulus to the sensation. For light stimulation, the sensation rate is the reciprocal of the time required from the start of the light stimulation to the human eye until it stimulates visual sensation. Sensitivity 1. In the binary classifications, the same as recall. 2. In the process of perception, the ability of the human visual system to respond to direct external stimuli. 3. It is mainly used in medical context to measure the performance of object classification. A binary classifier C(x) returns the label + or – for the sample x. Comparing these predictions with true labels, you can get true positive (TP), true negative (TN), false positive (FP), or false negative (FN). Sensitivity is defined as the true positive rate, which is TP/(TP + FN), or the percentage of true samples that are correctly marked as true. See specificity. Visual feeling The lower part of vision mainly receives external stimuli and obtains information from the outside. It is mainly at the molecular micro-level to understand the basic properties (such as brightness, color) of people’s response to light (visible radiation). The main research contents of visual perception are (1) the physical characteristics of light, such as light quantum, light wave, and spectrum; (2) the extent to which light stimulates visual receptors, such as photometry, eye structure, visual adaptation, visual intensity and sensitivity, temporal characteristics of vision, and spatial characteristics of vision; and (3) the feeling produced by the processing of visual system after light acts on the retina, such as brightness and hue. Visual sensation Low-level stages of vision. As far as normal eyes are concerned, as long as they are open, everything in the range of sight will stimulate the retina and cause sensation (passive sensation); however, people only pay attention to the scenes of interest or intentionally (active feeling). Compare visual perception. Stimulus 1. Radiant material energy that excites the receptor and has effects on the body 2. A target or event that a computer vision system can detect. Receptive field 1. The region corresponding to the sensory unit of the nerve center cell in the retina of the eye, that is, the region of the retina that responds to light stimulation. In such regions, a suitable light stimulus can stimulate simulation and/or modulation. The main cells responsible for visual perception in the retinal region are
1770
50
Visual Sensation and Perception
rod cells and cone cells, which work at low and high light intensities, respectively. In the retina, the sensory domains of adjacent neural central cells overlap considerably. 2. The region of visual space that leads to a photopic response. 3. For an image used as input, the region used to calculate each output value. See region of support. Receptive field of a neuron The part of the sensory organ that causes neurons to move when stimulated (such as a region of the retina). This part always corresponds to at least one neuron in the sensory conduction system or the cortical sensory area. Perceptual image When the image formed by the human eye is transmitted to the brain, the corresponding image is generated in the brain. Photochemical action on the retina can be transmitted to the brain through a series of complex biochemical processes of the retinal nerve. The foreign object image formed by the retina may be somewhat distorted, and the distortion may aggravate and add psychological effects (such as optical illusion) in the latter part of the conduction process. For example, the objective world forms an inverted image on the retina, and a person feels an upright image, which indicates that the image on the retina is different from the sensory image. The generation of the image from the retina in the human eye to the image in the brain is somewhat similar to the process of television shooting, transmission, and display, but not very much like the process of photography. Duplicity theory of vision A theory that explains visual perception. According to the theory, the retina is composed of two types of cells, cone cells and rod cells, which are connected to nerve endings, respectively. The cone cells are responsible for strong light and color. Rod cells are completely absent in the fovea, and the more they reach the edge of the retina, they can be responsible for the vision of weak light, but they cannot feel the color. Afterimage After the eye has adapted to a completely dark environment, when the light is suddenly exposed, the image in the original environment will still appear in the eye. It usually lasts a few seconds. Afterimages also appear with eyes closed. The afterimage is sometime bright (positive afterimage) and sometime dark (negative afterimage), and it will go through several iterations, but its clarity and intensity often do not correspond. When the afterimage differs from the diffraction of the original, the one whose diffraction is similar to the original is called a positive afterimage, and the complement is a negative afterimage. When the color of the afterimage differs from the original, the color that is similar to the original is called a positive afterimage, and the complementary is a negative afterimage. Residual image Same as afterimage.
50.3
Visual Sensation
1771
Visual after image After the light stimulation stops working, that is, after the stimulus light disappears, the visual image still remains. Also known as visual retention image. Can be divided into positive afterimage and negative afterimage. The positive afterimage presents the same color as the original perception, and the negative afterimage presents the opposite color as the original perception. Positive afterimage An afterimage that is more consistent in color and diffraction with the original image. Negative afterimage An afterimage that is complementary to the original object and original image in terms of color and diffraction. For colored objects, the afterimage color is complementary to the original color. For uncolored objects, the afterimage lightness and darkness are opposite to the original lightness and darkness. Motion afterimage After observing a moving object for a long period of time or looking at a stationary object in continuous motion, the human eye still feels that there is a moving image after removing the line of sight, but the direction is positive or negative. See afterimage.
50.3.2 Brightness Luminance 1. The amount of visible light that enters the human eye from the surface of a scene. It is also the intensity measured from one location in the scene. In other words, it is the amount of visible light that leaves a point on the surface in a given direction due to reflection, transmission, and/or emission. It can be measured by the luminous flux per unit solid angle and unit area projected onto a plane perpendicular to the propagation direction. The standard unit is candela per square meter (cd/m2), or lm/(sr • m2), also called nit (1nit ¼ 1 cd/m2) in the United States. 2. Describe one of the three basic quantities of a color light source. A measure of the amount of information an observer receives from a light source, measured in lumens (lm). Corresponds to the radiant power of the light source and is weighted with a spectral sensitivity function (a characteristic of HVS). Luminosity A parameter that measures the intensity of light in an image. The greater the luminosity, the greater the intensity of the light, the brighter it is, and vice versa. Lightness The intensity of the reflected light felt from the surface of a monochrome object. It shows that the human visual system tries to extract the reflectivity of objects based on the brightness in the scene.
1772
50
Visual Sensation and Perception
Perception of luminosity contrast Perception of changes in the distribution of brightness in space. For a surface, the brightness perceived by the mind is basically determined by its relationship with the brightness of the surrounding environment (especially the background). If two objects have similar brightness ratios to their respective backgrounds, they will appear to have the same brightness, which has nothing to do with their absolute brightness. Conversely, when two objects have different brightness ratios to their respective backgrounds, the same object will appear brighter on a darker background than on a lighter background. Brightness sensitivity The ability of the human eye or the human visual system to distinguish between different light intensities generally varies with the average brightness function. Perceptual models use brightness sensitivity to determine the visibility of changes in light intensity. Photopic response A curve modeling the sensitive wavelengths of the human eye under normal lighting conditions. Under these conditions, cone cells are the best photoreceptors in the retina to respond to light. The peak of the response curve appears at 555 nm, which indicates that the human eye is most sensitive to green-yellow under normal lighting conditions. When the light intensity is very low, the rod cell determines the response of the human eye. At this time, the scotopic response can be used to model. The peak of the response curve appears at about 510 nm. Scotopic response See photopic response. Unrelated perceived color Color felt on dark background. When the background is relatively dark, a person’s subjective perception of the color of an object requires that the brightness of the object exceeds a certain threshold, the observation time exceeds a certain threshold, and the area of the retinal stimulated region exceeds a certain threshold. When the background is relatively bright, the human eye also observes black. The brighter the surroundings, the darker the subject looks at in the middle of the field of view. The most prominent example is gray in a light background, which was originally white in a dark background (as orange in a dark background becomes brown in a light background). Therefore, the color in the dark background is called an unrelated perceived color, and the color in the light background is called a non-unrelated perceived color (affected by the background color). Absolute threshold In the human visual system, the minimum intensity of the physical stimulus required to stimulate the response of system. Just-noticeable difference [JND] Changes in brightness that the human visual system can just distinguish. For a grayscale image, it refers to the grayscale difference between two adjacent pixels that
50.3
Visual Sensation
1773
can be distinguished by the human eye. Set the gray value of one pixel to g and the gray value of another pixel to g + Δg, and then Δg is the visual threshold, and Δg/g is basically a constant. See contrast sensitivity. Visual quality 1. Reflect the distortion, degradation, or deterioration of the image after acquisition or processing compared with the original scene. 2. It generally refers to a measure of the quality of the image itself, such as the use of definition, contrast, etc. Definition Another term for “picture quality” or visual quality. High-definition images are usually bright, with sharp edges and many details; low-definition images are usually dim, blurry, and smudged. Contrast sensitivity The ability of the human eye or human visual system to distinguish between changes in brightness. It reflects the degree to which the human visual system distinguishes brightness differences. It is a function of the spatial frequency and is also related to the size of the observation object and the length of the presentation time. If the test is performed with a grid composed of lines of different thicknesses and contrasts, the closer the contrast between the light and dark lines of the grid perceived by the eye and the contrast between the light and dark lines of the original test grid, the bigger the contrast sensitivity. Under ideal conditions, people with good vision can distinguish 0.01 brightness contrast, that is, the contrast sensitivity can reach a maximum of 100. On the other hand, it can also refer to the sensitivity of the human eye when the observed contrast changes, and it is defined as the smallest brightness difference that can be detected by an observer. This can be measured by the brightness ratio of the test pattern composed of two adjacent blocks, as shown in Fig. 50.3. The observer’s field of view is mostly filled by the ambient brightness (Y0). In the center region, the left semicircle portion has a test brightness value (Y ), and the right semicircle portion has a slightly increased brightness value (Y + ΔY). Adjust the ΔY value step-by-step. The participant should indicate when the difference between the two semicircles becomes observable. At this time, record the corresponding Y and ΔY values. This process is repeated for a fairly wide range of brightness values. Fig. 50.3 Contrast sensitivity test mode
Y
Y+ΔY
Y0
1774
50
Visual Sensation and Perception
Contrast sensitivity threshold A characteristic description index of the human visual system. If the background brightness of an image is higher, the contrast sensitivity of the human visual system is greater, and the threshold is higher. At this time, the sensitivity of the human eye to the additional noise of the image region is smaller (the less easy it is to observe noise). Using this feature, it can be seen that if watermark embedding is performed in high-brightness regions in the image, the additional information that can be embedded can be more.
50.3.3 Photopic and Scotopic Vision Daylight vision Vision under strong light (above 1 cd/m2). Belongs to photopic vision. Also called cone vision (because mainly cone cells are at work). Subjective black The lowest brightness that a human eye can perceive a luminous region on a dark background. This value is not absolute, it is related to whether dark adaptation is complete, it is related to the spectral components of light, and it is also related to the size of the light-emitting area (the size of the image on the retina) and the position. In addition, the length of time that light appears also affects. The subjective black corresponding to the daylight vision region (fovea threshold) is about 103 cd/m2, while the subjective black corresponding to the night-light vision region (outer fovea threshold) is about 106 cd/m2. When the night-light vision threshold is expressed in lm, this value is about 1012 lm (parallax of 10 and pupil of 8 mm). Photopic vision Vision under strong light (about several cd/m2 or more). The eye adapted to this situation is a suitable eye. At this time, the cone cell plays a major role on the retina, so it is also called cone vision. At this time, in addition to being able to distinguish brightness, people can also have color perception. Compare scotopic vision. Photopic zone The range of visual brightness corresponding to photopic vision. The brightness is about 102–106 cd/m2. Scotopic vision Vision in low light (observed subject’s brightness is below 102 cd/m2). Vision that is primarily affected by rod cells of the retina. In practice, it is often necessary to make dark adaptation of the eyes. Because the rod cells only have a sense of light and darkness when they function, there is no sense of color, so the scenes people see in dark light are all gray and black. Compare photopic vision. Night-light vision Same as scotopic vision.
50.3
Visual Sensation
1775
Peripheral rod vision Same as scotopic vision. Scotopic zone The range of visual brightness corresponding to scotopic vision. The brightness is approximately between 102 and 106 cd/m2. Mesopic vision Among the several visual states between photopic vision and scotopic vision, vision in which cone cells and rod cells of the retina function simultaneously (or neither dominate). At this time, the brightness of the field of view is about 102–10 cd/m2. The International Committee on Illumination recommends 0.05 cd/m2 and a field of view of 20 as the standard state for studying vision in this range. Purkinje phenomenon Under different lighting levels, the human eye’s sensitivity to various colors of light changes or shifts. Under photopic vision conditions, human eyes are most sensitive to yellow-green light due to the action of cone cells; while in scotopic vision conditions, human eyes are most sensitive to blue-green light due to the effects of rod cells. The Purkinje phenomenon can also be considered as the phenomenon in which the spectral luminous efficiency value and the maximum value move in the short-wave direction when the photopic vision passes through the mesopic vision to the scotopic vision. See Purkinje effect. Purkinje effect A subjective effect based on Purkinje phenomenon. Shows that the lights of different colors (especially red and blue) show different subjective brightness to the human eye during the day (photopic vision) and night (scotopic vision). Assuming that two objects (red and blue) appear to be equally bright during the day, blue objects appear brighter at night.
50.3.4 Subjective Brightness Subjective brightness The subjective perceived brightness of the human visual system. That is, the brightness of the observed object, which is judged by the person according to the strength of the retina to feel the light stimulus, is often called the apparent brightness. It is a psychological term closely related to (objective) brightness. For a light source with a certain light emitting area (extended light source), the subjective brightness of the eye is specified as the image brightness on the retina. It is directly proportional to the product of the retinal illuminance and the aperture factor, which is directly proportional to the brightness of the object, the pupil size, and the aperture factor. See apparent luminance.
1776
50
Visual Sensation and Perception
Visual brightness Same as subjective brightness. Brilliance Used to indicate subjective brightness. It also represents brightness, lightness, fidelity, etc. Brightness The subjective feeling of the eyes on the light source and the surface lightness of the scene. It is the intensity of light felt from the image itself, not the characteristics of the scene being depicted. It has a certain correspondence with the amount of light emitted by the scene. When observing a light source, the brightness depends on the intensity of the light source, but may be different from the objective luminosity. When observing the reflective surface of a scene, the brightness depends on the intensity of the light source and the reflection coefficient of the surface, but it may be different from the objective illumination. Brightness represents the state or nature of the lightness of a scene and is an attribute of visual sensation. From a visual point of view, brightness refers to a series of colors from very dim (black) to very bright (dazzling) achromatic. Brightness is the amount of ray radiation that reaches a detector after being incident on a surface. It is usually measured in lux or ANSI lumens. When combining images, their values are scaled to approximate the bit pattern used. For example, if an 8-bit byte is used, the maximum value is 255. See luminance. Brightness can also be simply understood as the intensity of colors. Different colors can have different intensities. Generally speaking, yellow is higher than blue in intensity. Apparent brightness The observer feels that the light in some parts of the field of view is stronger or weaker. This feeling first depends on the brightness of the object, which corresponds to the sensitivity of cone cells on the retina to different wavelengths of light. However, there are rod cells on the retina. According to the Purkinje effect, its sensitivity is different from cone cells, so the brightness cannot completely determine the subjective brightness, and it is also related to whether rod cells are used. Apparent luminance Under the assumption that the pupil is covered with optical flow and the aperture effect does not exist, the luminance of the light collected by a uniformly diffuse object. When using optical instruments, the pupil of the eye may be covered with optical flow, but the aperture effect is also sufficient to reduce the effect of light, so the subjective brightness of the field of view does not completely depend on the luminous of the object. The apparent brightness is exactly the subjective brightness that the eyes (taking into account the aperture effect) feel (the other conditions such as adaptation to light are unchanged). Therefore, the apparent brightness is proportional to the product of the retinal illuminance and the aperture factor, that is, it is directly proportional to the luminance of the object, the pupil size, and the aperture factor.
50.3
Visual Sensation
1777
Mach band Bright and dark bands can be felt at the abutment between the bright and dark illuminated regions of the field of view. It is the result of subjective edge contrast effect. Contrast effect is most obvious in the edge region, so it is also called edge contrast. Mach band phenomenon is a kind of edge contrast phenomenon. The Mach band corresponds to the different gray bands in the Mach band effect. It is uniform in itself, but because the subjective brightness perceived by people is related to the environment, that is, the human visual system tends to produce overshoot or undershoot at the boundary of different intensity regions, so it looks like a band with gray changes. Mach band effect An effect of the human visual system. People can perceive changes in brightness at the edges of a constant brightness region. This change makes the parts nearer the lighter regions appear darker, while the parts nearer the darker regions appear brighter. Therefore, the human visual system tends to overestimate or underestimate the boundary values between different brightness regions. Discovered in 1865 by the Austrian physicist Mach. Figure 50.4 can be used to introduce the Mach band and its mechanism. Figure 50.4a is a Mach band pattern, which includes three parts: the left side is a uniform low-brightness region, the right side is a uniform high-brightness region, and the middle is a region that gradually transitions from low brightness to high brightness. Figure 50.4b is the corresponding brightness distribution. By looking at Fig. 50.4a, it can see a darker zone at the junction of the left and middle zones than the left zone and a brighter zone at the junction of the middle and right zones than the right zone. In fact, dark bands and bright bands do not exist on objective stimuli, but are the result of subjective brightness, as shown in Fig. 50.4c. The abovementioned illusion caused by the Mach band pattern can be explained by the lateral inhibition between adjacent nerve cells on the retina. Lateral inhibition refers to the inhibitory effect of a cell on surrounding neighboring cells. This effect is directly proportional to the intensity of light stimulation to the cell, that Brightness (d)
(a) Distance Brightness (b) Distance
(c)
Subjective brightness Distance
Fig. 50.4 Mach band and its mechanism
Sensory cell Stimulus intensity Inhibition Visual response
1
2
3 4
5 6
7
8
10 10 10 10 20 30 40 40 5
5
10 10
5
6
7
8
8
8
9 12 14 17 16 16
(e)
1778
50
Visual Sensation and Perception
is, the stronger the stimulation of a cell, the greater the inhibitory effect on neighboring cells. See Fig. 50.4d, e. Cell 3 is inhibited by Cell 2 and Cell 4. Because Cell 4 is more stimulated than Cell 3, the inhibition of Cell 3 is greater than that of Cell 2, so it will be darker. In the same way, the inhibitory effect of Cell 6 is less than that of Cell 7, so it will be brighter. The brightness of the graphic will affect the width of the Mach band. As the brightness of the graphic decreases, the width of the dark band increases, but the width of the bright band is not affected by the change in brightness. The Mach band disappears completely at very high or very small illuminance. When the brightness difference between the uniform dark region and the uniform bright region increases, that is, the gradient of the brightness distribution in the middle region increases, the Mach band becomes more pronounced. Lateral inhibition 1. The phenomenon that similar neurons can inhibit each other. When a neuron is stimulated, the neuron generates neural excitement; when another neuron is stimulated, the latter’s response can inhibit the former’s response. 2. The inhibitory effect of a cell on surrounding cells. This effect is directly proportional to the intensity of the light stimulus to the cell, that is, the stronger the stimulus to a cell, the greater the inhibitory effect on neighboring cells. The optical illusion produced by observing the Mach band pattern can be explained by the lateral inhibition between adjacent nerve cells on the retina. 3. A process that weakens a given feature or eliminates neighboring features. For example, in the Canny edge detector, the gradient magnitude of the local maximum brightness there causes neighboring gradient values that cross the edge (rather than along the edge) to become zero. Simultaneous contrast The subjective brightness perceived by the human visual system from the surface of an object is affected by the relative relationship between the brightness of the surface and the surrounding environment. Specifically, simultaneous contrast indicates that placing the same object (reflecting the same brightness) on a darker background will look brighter and placing it on a lighter background will look darker. See Fig. 50.5, where all the small squares in the center have exactly the same brightness. However, it can be seen from the figure that it looks brighter when it is on a dark background and looks darker when it is on a light background. So, from left to right, the small square gradually darkens. At this time, it is actually not a
Fig. 50.5 Simultaneous contrast examples
50.3
Visual Sensation
1779
gray level that is perceived, but a difference between the gray level and the gray levels of the surrounding region. In addition, when two contrasting colors are presented at the same time, the tendency of the color of a region to enhance the complementary color of an adjacent region is also a manifestation of simultaneous contrast. Simultaneous brightness contrast The perception of surface brightness depends on the phenomenon of perception of background brightness. To introduce this phenomenon, consider the case where a gray surface is surrounded by a white surface and a black surface, respectively. At this point, the gray surface placed on the white surface looks darker than the gray surface placed on the black surface. This phenomenon is simultaneous brightness contrast, also often called simultaneous contrast. Simultaneous color contrast Contrast in the opponent color model. Detection of a colored surface depends on the color surrounding the surface. For example, a gray surface surrounded by red regions looks blued green. Opponent color models can be used to describe inducing colors and induced colors. This effect of color contrast can be seen as changing the color constancy in a systematic way. It is also an index describing the difference between adjacent pixels or pixel sets in a color image. The color observation of a surface region will depend on the color of the surface of the surrounding region. For example, the surface of a gray region surrounded by red regions looks blue-green, which is called inducing color. This type of contrast is also called simultaneous color contrast. The opponent color model is often used to describe the influence of inducing colors on the surrounding colors. Tristimulus values According to the three-color theory, three spectral intensities required for different colors are obtained. Indicates the relative number of three primary colors, which need to be combined to match a given color. According to the principle of meta meric matching, in order to generate an arbitrary color C(λ) in the brain, it is necessary to simultaneously project a spectrum P1(λ) with intensity A1(C), spectrum P2(λ) with intensity A2(C), and spectrum P3(λ) with intensity A3(C). This can be written as C ðλÞ ffi A1 ðC ÞP1 ðλÞ þ A2 ðCÞP2 ðλÞ þ A3 ðC ÞP3 ðλÞ If the intensity of the projection spectrum required to produce a white sensation in the brain is A1(W ), A2(W ), and A3(W), the normalized amount Aj(C)/Aj(W) is invariant to the observer and can be used as the tristimulus value of color C(λ): T j ðC Þ
A j ðC Þ , j ¼ 1, 2, 3 A j ðW Þ
1780
50
Visual Sensation and Perception
Note that the tristimulus value thus defined corresponds to the objective tristimulus value of the color system. Further can be written: C ðλÞ ffi
3 X
T j ðC ÞA j ðW ÞP j ðλÞ
j¼1
In this expression, Tj(C) is the objective tristimulus value, which can completely and objectively describe a color with respect to a given color system. The factor Aj(W) in the formula is the only subjective quantity that can make the equation for a particular observer special. If it is ignored, as long as the tristimulus value used is the average color matching function of many people with normal vision, then an equation corresponding to a standard observer will be obtained. Using the standard observer’s tristimulus value will generate different color sensations for different observers, which is based on the hardware inside the observer, that is, the way they personally observe the color. Color matching experiments An experimental method to determine tristimulus values. Divide a white screen into two parts. Project a unit of monochromatic light (¼ single wavelength) per unit of radiated power onto half of it (called a monochromatic reference stimulus). Project 3 primary colors onto the other half at the same time. An observer (respectively) adjusts the intensity of the three lights until the colors they see on the two halves of the screen cannot be distinguished. The three (for matching) intensities are the tristimulus values of a specific wavelength corresponding to the monochromatic reference stimulus. Luminosity coefficient An integral part of the color theory of tristimulus values. Refers to the contribution of a given primary color to the total perceived brightness.
50.3.5 Vision Characteristics Spatial characteristic of vision The effect of spatial factors on visual perception. In fact, vision is first and foremost a spatial experience. Cumulative effect of space on vision A spatial characteristic of vision. When observing an object, if each light quantum is absorbed by a rod cell, the critical light energy Ec of the stimulus is detected with a probability of 50% and is proportional to the visible area A and surface brightness L of the object, that is,
50.3
Visual Sensation
1781
Ec ¼ k A L Among them, k is a constant, which is related to the units used by Ec, A, and L. Note that the area that the above relationship can satisfy has a critical value Ac (corresponding to a solid angle of 0.3 steradian). When A < Ac, the above law holds; otherwise, it does not hold. Spatial summation One of the spatial characteristics of vision. It reflects the cumulative effect of different stimuli throughout space. Due to the spatial cumulative effect, when the light stimulus falling on a small region of the retina is not visible due to insufficient strength, then it may be seen when falling on a large region of the retina. In other words, if a dark object is relatively small and invisible, a large object that is equally dark may be observed. Spatial frequency Frequency of spatial structure repetition. When visually perceiving the existence of an object, both the spatial position of the object and the spatial structure of the object must be extracted. The spatial position can be determined by the position of the object image on the retina, and the spatial structure is characterized by the spatial frequency. Spatial frequency is expressed by the number of lines per unit length, which indicates the thickness or density of the structure of an object. For a given test grating, the measurement of the amount of image intensity change gives the spatial frequency. Consider a stationary set of sinusoidal gratings, let’s arrange them in the x direction, then the light intensity distribution function is CðxÞ ¼ C0 ½1 þ m cos ð2πUxÞ Among them, C0 represents the average light intensity of the grating; m is the grating light intensity variation factor; and U is the spatial frequency of the grating, which is the distance between two adjacent peaks, that is, the inverse of the spatial period. For a sine grating distributed along the y direction, the spatial frequency V can also be given by the inverse of the spatial period. The spatial frequency of any orientation grating can be decomposed into two spatial frequencies along the x-direction and ydirection, or equivalent to the superposition of the latter two. Spatial frequency can be completely characterized by frequency changes in two orthogonal directions, such as horizontal and vertical. If fx is called the horizontal frequency (the unit is period/horizontal unit distance) and fy is the vertical frequency, then the frequency pair ( fx, fy) characterizes the spatial frequency of a 2-D image. Obviously, the lower the spatial frequency, the smaller is the grating fringes, and the higher the spatial frequency, the denser is the grating fringes. It can be roughly considered visually that any object or an optical image is superimposed by many sinusoidal gratings, and these gratings have different spatial frequencies and spatial orientations. The uniform background is a zero-frequency component. The component whose light distribution changes slowly is called a low-frequency component,
1782
50
Visual Sensation and Perception
and the details where the light intensity changes sharply are high-frequency components. From the principle of optical diffraction, it can be known that all optical systems can detect the zero-frequency and low-frequency components more easily, and the high-frequency components are cut off due to the limited aperture. For this reason, optical systems all have low-pass filter properties, and the resolution of spatial details is limited. Eyeball optical systems are no exception. The rate at which brightness is repeated as it is scanned through image space. In a 2-D image, space refers to the XY plane of the image. In 3-D images, space refers to the XYZ stereo of the image. In video images, space refers to XYT stereo. In Fig. 50.6, there is a significant repetition pattern in the horizontal direction, and its spatial frequency is about 0.04 cycles/pixel. Temporal characteristics of vision The effect of time on visual perception. The existence of time influence can be explained from three aspects: 1. Most visual stimuli change over time or are generated over time. 2. The eyes are constantly moving, which constantly changes the information obtained by the brain. 3. Perception is not an instantaneous process, because information processing always takes time. Cumulative effect of time on vision A temporal characteristic of vision. When observing an object with ordinary brightness, the degree of eye irritation is directly proportional to the time interval (observation time length) of the stimulus when the time interval is less than a given value. If Ec is the critical light energy at which the stimulus is perceived with a 50% probability, then it is proportional to the visible area A, surface brightness L, and time interval T of the object, that is,
Fig. 50.6 Repeated pattern horizontally
50.3
Visual Sensation
1783
Ec ¼ A L T The condition for the above formula is that T < Tc, where Tc is the critical time interval. Reaction time The time interval from when the eye receives the light stimulus until the brain becomes aware of the light effect. Response time tr ¼ (a/Ln) + ti, where ti is the invisible part, about 0.15 s. The first term is called the visible bound (a is constant), and it varies with the brightness L. This value is 0 when L is large, and it is about 0.15 s when L is close to the visual threshold. Response time Same as reaction time. Temporal frequency response The curve of the contrast sensitivity of the human visual system over time repeating patterns. It mainly depends on several factors, such as viewing distance, display brightness, and ambient light. Experiments show: 1. The peak frequency of the contrast sensitivity increases with the average image brightness. 2. One of the reasons why the eyes have lower sensitivity at high time frequencies is because the feeling of the image can be retained for a short time interval after an image is removed. This phenomenon is called visual persistence. 3. Critical flicker frequency (below this frequency will be able to perceive a single flash of flicker, and above this frequency those flickers will be combined into a continuous, smooth moving image sequence) directly proportional to the average brightness of the display. Critical flicker frequency The minimum flicker frequency of a constant stimulus that can be felt when different lights alternately appear in a certain field of view in a short period. Temporal frequency How fast the physical process moves. The movement of the target in the field of view will inevitably cause the light information received by the retina to constantly change. The degree of change can be characterized by time and frequency. The one-dimensional sine grating distributed along the x direction moves at the speed R in the positive x direction, and its light distribution can be expressed as Cðx, t Þ ¼ C0 ½1 þ m cos ð2πU ðx Rt ÞÞ or written as Cðx, t Þ ¼ C 0 ½1 þ m cos ð2πðUx Wt ÞÞ
1784
50
Visual Sensation and Perception
The time terms of the above two formulas are related to the movement speed R and the time frequency W of the light distribution change caused by the movement, respectively. A sinusoidal grating that moves at a speed R is completely equivalent to a sinusoidal grating that changes in place at the time frequency W. It can be seen that the time frequency is directly related to the speed of movement. Given the spatial structure of an object, the higher the time frequency, the faster its movement speed and vice versa. Daily visual experience shows that the human eye can only produce motion perception for objects moving at an appropriate speed. Too fast or too slow cannot cause a sense of movement. Related to this, physical processes that change the time frequency of movement too high or too low cannot cause movement perception visually. This shows that the vision system should also have the characteristics of low-pass and band-pass filtering in time and frequency. Frequency sensitivity The ability of the human eye or the human visual system to distinguish between different frequency stimuli. Visual sensitivity to frequency can be divided into three types: spatial frequency sensitivity, temporal frequency sensitivity, and color sensitivity. Smooth pursuit eye movement Eye tracking movement of moving object. This results in a temporal frequency at which motion is perceived to be zero, thereby avoiding motion blur and improving the ability to distinguish spatial details. This condition is called dynamic resolution and is used by humans to evaluate the ability to reconstruct details in real moving images. Visual acuity The ability of the visual system to identify spatially close visual stimuli. This can be measured with a grating with parallel stripes with the same width between black and white. Once the stripes are too close, the stripes cannot be distinguished. Under optimal lighting conditions, the smallest distance between stripes that can be distinguished by a person corresponds to a viewing angle of 0.5 min of arc. The visual acuity of 1 indicates the human’s resolution at a standard distance (5 m) when the viewing angle is 1 . It indicates the accuracy of the details of the scene that the human eye can see under good lighting conditions. A type of spatial resolution that represents human vision, that is, sight, the ability to distinguish small details such as contours, fonts, grid shapes, stereoscopy, cursors, and other object shapes and sizes. Visual acuity specifically corresponds to the size of the smallest test object that an observer can see. The visual acuity of the human eye is related to the arrangement of sensory cells on the retina, the size of the pupil, the brightness of the object in the scene, and the observation time. Eye chart can be used to detect visual acuity, which is actually a psychophysical measurement that wants to determine the threshold of perception.
50.3
Visual Sensation
1785
Fig. 50.7 C-optotype
7.5mm
1.5mm
1.5mm
C-optotype A visual standard used in international eye diagrams to determine visual acuity. As shown in Fig. 50.7, both horizontal and vertical are composed of 5 units of detail. The width of the black line is 1/5 of the diameter, and the opening of the ring is also 1/5 of the diameter. When the target diameter is 7.5 m, the opening of the ring is 1.5 mm. When performing a visual inspection, place the visual target at a distance of 5 m from the subject (the illumination light is close to 150 lx (lux)), and let the subject indicate the direction of the gap (the size of the gap formed on the retina at this time is 0.005 mm). If it can be pointed out correctly, the subject’s vision is considered normal and is set to 1.0. Landolt ring Same as C-optotype. Landolt broken ring Same as Landolt ring. Myopia When the eye is at a standstill (i.e., no accommodation), the image of a distant object appears in front of the retina. At this time, it is clear to see the near and blurry to the distance. This may be because the eyeball is too long and the refraction is normal, it may also be because the refractive part is abnormal (the curvature is too high or the refractive index is too high), or it may be because the eyeball is normal in length and the lens has too strong refractive power. These problems can be corrected by wearing a concave lens (i.e., ordinary myopia glasses). Nearsightedness Same as myopia. Astigmatism 1. A refractive error in the eye (it may occur at the same time as myopia and hyperopia). At this time, the light is focused at different positions inside the optical system. Figure 50.8 shows a schematic diagram. This phenomenon occurs because the lens has an irregular curvature. At this time, the light is not focused at a point but focused in a small range. It is possible to correct this problem with a toric lens, which has a larger refractive index along one axis than the other.
1786
50
Visual Sensation and Perception
Fig. 50.8 Astigmatism schematic diagram
Fig. 50.9 Aberration schematic diagram
Is It
Sagittal plane
Optical axis
Lens Coronal plane
Object point
2. An aberration. If the plane defined by the lens optical axis and the off-axis object point is the sagittal plane (also called the meridian plane), and the plane perpendicular to the sagittal plane is the coronal plane (also called the tangential plane), the light emitted by the off-axis object point on the sagittal plane and on the coronal plane will pass through the lens and will not intersect at one point (focused into two short lines and perpendicular to the sagittal and coronal planes, respectively), resulting in aberrations, such as Fig. 50.9. The two short lines are called sagittal focal line It and coronal focal line Is. Somewhere between them will get the smallest diffuse round spots. Smaller astigmatism can be obtained using an aperture stop with a larger f-number. 3. A kind of Seidel aberration. As shown in Fig. 50.10, the light emitted by the off-axis object point (not on the principal axis of the optical system) has an angle with the optical axis. After refraction by the lens, its meridian point (on the plane formed by the principal light and the principal axis of the optical system) and the sagittal image point (which is perpendicular to the plane formed by the principal ray and the principal axis of the optical system) are not coincident (convex power
50.3
Visual Sensation
1787
Fig. 50.10 Seidel aberration schematic diagram
Sagittal image point Meridian point Sagittal image point Optical axis
Meridian point Principal light
of the convex lens to the meridian light is greater than that of the sagittal light). The gap is called astigmatism and can cause unclear imaging. The midpoint of the two image points is imaged as a minimum circle of dispersion. The images formed at the two image point offsets are ellipses whose principal axes are perpendicular to each other.
50.3.6 Virtual Vision Central vision Vision operated by cone cells in the central part of the retina (around the fovea, but not foci itself). Central eye An imaginary eye between the eyes. Imagine that the retinas are formed by superimposing the corresponding points of the two eyes, and the corresponding points refer to the image formed by the two eyes on the same object point. The visual content obtained by the central eye is the same as that of the two eyes, but the sight of the two eyes is merged into a single sight, which is consistent with actual vision. See cyclopean eye. Cyclopean eye Imaging a single imaginary visual organ equivalent to binocular imaging. In practice, when the observer focuses the eyesight on a closer object, there is a certain angle between the sight axes of the two eyes, and neither is vertically forward. But when the two eyes look at the object, they are aligned to a common visual direction through convergence, and the resulting image is single, as if seen by one eye. From the perspective of subjective perception, the two eyes can be regarded as a single organ, and a single eye that is theoretically imaginary in the middle of the two eyes can be used to represent this organ, called the central eye. As shown in Fig. 50.11, each pair of corresponding points on the retina of both eyes has a common visual direction. When the object is directly in front of C, it acts
1788
50
Visual Sensation and Perception
C
Fig. 50.11 Cyclopean eye schematic diagram
SL CL
FS
S
FC
SR
CR
on the foveae CL and CR of the left and right eyes, respectively. When CL and CR are supposed to overlap, the position of the object point C is on the fovea FC of the cyclopean eye, and its direction is in the middle of the cyclopean eye, that is, the subjective visual direction is directly forward. When the object is at S, it acts on the SL and SR of the left and right eyes, respectively. For the cyclopean eye, the object is located at FS. The cyclopean eye is a very useful concept when people are dealing with spatial perception. When a person orients an object in space, he considers himself the center of visual space and the direction from the central concave of the cyclopean eye to the front as the forward direction of vision to determine the orientation of the object. Since the object is seen in a single direction subjectively, this direction is called the subjective direction, and it relates the object to the cyclopean eye between the two eyes imagined above. All objects falling in two visual directions (two optical axes) are perceived as being in the same subjective direction. It looks like the two points corresponding to the two retinas have the same direction value. Virtual camera Physically nonexistent, but the effect is similar to the virtual device of the camera. See cyclopean eye. Cyclopean separation See disparity gradient. Cyclopean view In the analysis of stereo image, the midpoint of the baseline between the two cameras is equivalent to the result observed with the cyclopean eye. A term in stereo image analysis, derived from the mysterious Cyclops with only one eye. When performing stereo reconstruction of a scene based on two cameras, it is necessary to consider which coordinate system is used to represent the reconstructed 3-D coordinate position or which viewpoint is used to express the reconstruction process. The Cyclops viewpoint falls at the midpoint of the baseline between the two cameras.
50.4
Visual Perception
1789
Center-surround differences A mechanism of biological vision that has an excited response to the most visually noticeable region and a suppressed response to the surrounding region, making the center more prominent than the surrounding region. Binocular single vision The two eyes are well matched, and each image falls on the vision of the corresponding retina. Horoptera One of the functions of the human visual system. When a person observes an objective scene, the images formed by the set of spatial points at the corresponding parts of the retinas (corresponding points) are different, but they can form corresponding point pairs (corresponding to the same point in space) and fuse in the visual center of the brain as one. In other words, the two retinal images are combined after they are transmitted to the cerebral cortex to produce a single visual image with a sense of depth. Binocular diplopia The position of the image of the object on the retina of the two eyes does not correspond, and each eye sees an image.
50.4
Visual Perception
Visual perception is a higher level in vision, which transforms external stimuli into meaningful content. Among them, constancy, color perception, visual attention mechanisms, etc. all play important roles (Finkel and Sajda 1994; Peterson and Gibson 1994; Zakia 1997).
50.4.1 Perceptions Perception The overall perception (of the sense) formed by the combination of the organization based on the sensations. It is also the process of understanding the objective world through the analysis of sensor inputs (such as images). Perceptual characteristics include relativity, integrity, constancy, organization, meaning (understanding), selectivity, and so on. Psychophysics A subject area that studies the relationship between human perception (or other organisms) and the physical incentives that lead to perception.
1790
50
Visual Sensation and Perception
Situational awareness A general psychological term that refers to the ability to perceive important factors in an environment, and it manifests itself in making predictions about how the environment will change. In the context of image understanding, it generally refers to a combination of scene understanding and event understanding. Visual perception High or advanced stages of vision. It turns external stimuli into meaningful content. For example, people produce concepts such as existence, shape, size, color, position, height, brightness, and intensity of the scene they are observing and further make environmental feelings, reactions, interpretations, understandings, and even decisions and actions. Visual perception mainly discusses how people react after receiving visual stimuli from the objective world and the way they react and studies how to form people’s appearance of external world space through vision, so it has both psychological factors and sometimes the use of memory (with a prior) to help making judgment. Visual perception is a group of activities performed in the nerve center, which can organize some scattered stimuli in the visual field to form a whole with a certain shape to express and understand the world. Visual perception can be divided into brightness perception, color perception, shape perception, space perception, motion perception, and so on. Compare visual sensation. Brightness perception The human visual system’s visual perception of brightness. It is the attribute of human subjective perception, from which it is judged that a region seems to emit more or less light. The subjective brightness perceived by a person is proportional to the logarithm of the luminous intensity incident into the eyes. Brightness adaptation 1. The adaptation of human visual system to the perception of light source bright ness. The overall brightness range that the human visual system is adapted to is large, ranging from the dark vision threshold to the dazzling limit of the order of 1010. However, it should be noted that the human visual system cannot work in such a large range at the same time, but achieves brightness adaptation by changing its specific sensitivity range. The range of brightness that the human visual system can distinguish at the same time is only a small range centered on the brightness adaptation level, which is much smaller than the total adaptation range. See Fig. 50.12. 2. Same as dark adaptation. Overall range
Dark vision threshold
Adaptation level
Adaptation level
Specific range
Specific range
Fig. 50.12 Brightness range of the visual system
Dazzling limit
50.4
Visual Perception
1791
Brightness adaptation level In brightness adaptation, the current sensitivity of the visual system under certain conditions. Referring to Fig. 50.12, the range of brightness that the human eye feels at a certain moment is a small range centered on this adaptation level. Brightness adjustment Increase or decrease the brightness of an image. To reduce this, linear interpolation can be performed linearly between the image under consideration and a pure black image. To increase, linearly extrapolation can be performed linearly between a pure black image and an object image. The extrapolation function is v ¼ ð1 αÞ i0 þ α i1 Among them, α is a mixing factor (usually between 0 and 1), v is an output pixel value, and i0 and i1 are pixel values corresponding to a black pixel and an object image pixel. See gamma correction and contrast enhancement. Color perception The human visual system’s visual perception of color. Its physiological basis is related to the chemical procedures of visual perception and to the neural processing of the nervous system in the brain. Color exists only in the eyes and brain of a person. Color vision is the result of human brain interpretation. Color perception is a color psychophysics phenomenon that combines two main components: 1. The spectrum of a light source (usually described by its spectral power distri bution (SPD)) and the physical properties of the observed surface (such as its ability to absorb and reflect). 2. The physiological and psychophysical characteristics of the human visual sys tem (HVS). Color vision Sensation and perception of color. It is an inherent ability of the human visual system. Human vision can not only perceive the stimulation of light but also perceive electromagnetic waves of different frequencies as different colors. Shape perception The human visual system’s visual perception of the shape and structure of objects. Gloss A perceptual characteristic of the shape of an object’s surface caused by observing the spatial distribution of reflected light from the object. The gloss behavior can be determined based on the regularly reflected light components. Spatial perception The human visual system’s visual perception of spatial depth. People can perceive a 3-D visual space from the visual image formed by the retina surface, that is, they can also obtain depth distance information.
1792
50
Visual Sensation and Perception
Motion perception 1. Motion sensation: People can feel the movement of the object only when the object has a certain displacement within a certain time. Of course, the perception of motion is also related to the surrounding environment in terms of brightness, chroma, shape characteristics, and the position of the visual image on the retina. When there is a reference object, the object motion threshold is about 1–2 arc minutes per second. If there is no reference object, the motion threshold is about 15–30 arc minutes per second. Under the same conditions, the threshold of the edge of the retina is higher than the fovea. The maximum speed of movement (which has a great relationship with the brightness of the object) has an angular movement of 1.5 –3.5 in about 0.01 s; otherwise it cannot be detected. With a reference object and at an appropriate speed, the angular movement threshold is about 20 s, and when there is no reference object, it is about 80 s. It can be seen that the threshold of motion perception (acuity) is higher than the resolution threshold (visual acuity) of stationary objects, that is, it is more difficult to observe. 2. Motion perception: the human visual system’s visual perception of motion. Current theories suggest that there are motion detectors in the human visual system that can record signals that affect adjacent points in the retina. On the other hand, the human visual system can also acquire the knowledge of human’s own movement to avoid attributing the movement of the human body or the human eye to the movement of the scene. Perception of motion Same as motion perception. Perception of movement See motion perception. Motion parallax 1. When there is relative movement between the observer and the scene, a perception corresponding to the visual difference produced. Because motion changes the size and position of the scene on the retina, it creates a sense of depth. Here, the different distances between the observer and the scene cause the difference in the angle of view to change, which is characterized by a larger change in the angle of view of the closer object and a smaller change in the angle of view of the distant object. Motion parallax corresponds to the displacement vector (including size and direction) obtained by projecting the scene’s motion onto the image, so it can be further divided into distance-caused motion parallax and directionalcaused motion parallax, both of which can generate indices of depth through perception of the cerebral cortex of brain. 2. A kind of motion-related information. For example, when a person moves laterally to both sides, the resulting image has change information due to relative movement with the retina. 3. When the camera is moving, observe the effect of the apparent relative displacement of the target from different viewpoints. See parallax.
50.4
Visual Perception
1793
A A1 f1
B1
f2
A2 B2
v
B
Fig. 50.13 Geometric interpretation of distance-caused motion parallax
f1
A
P
B
v f2 Fig. 50.14 Geometric interpretation of direction-caused motion parallax
Distance-caused motion parallax A motion parallax caused by a change in distance. As shown in Fig. 50.13, when the observer moves from left to right with velocity v and observes objects A and B, the visual images obtained at f1 are A1 and B1, and the images obtained at f2 are A2 and B2, respectively. The observer perceives that the visual size of the object A changes faster than the visual size of the object B, and the (still) objects A and B appear to move away from each other (as if moving). Direction-caused motion parallax A motion parallax caused by a change in viewing direction. As shown in Fig. 50.14, when the observer moves from top to bottom at a speed v, the perceived motion rotates around the gaze point. If the fixation point is P, the closer point A can be observed to move in the opposite direction to the observer and the farther point B to move in the same direction as the observer. This is an index of depth caused by the perception of the cerebral cortex of brain. Dynamic index of depth The retina has indices of depth provided during movement. A typical example is motion parallax, which is the information generated when the image moves relative to the retina when a person moves laterally to both sides. Another example is dynamic perspective. For example, when a person moves in a car, the continuous change of the field of view will produce a flux on the retina. The speed of the flow is inversely proportional to the distance, so the distance information is provided.
1794
50
Visual Sensation and Perception
50.4.2 Perceptual Constancy Perceptual constancy Adults can more accurately perceive the physical properties of scenes in real life, without being completely affected by changes in observation conditions. For example, a person can more accurately estimate the physical size of objects at different distances without changing only based on changes in the size of the viewing angle. More examples are color constancy, motion constancy, size constancy, and spectral constancy. Compare with perception constancy. Brightness constancy 1. A mechanism of the human visual system, that is, the tendency of people to maintain the brightness they feel stable under different lighting conditions when they observe the world. This can stabilize the relative shift of the brightness of the object caused by the constantly changing lighting in the environment, so that the object always looks like it is in the same lighting environment. 2. An assumption made in computer vision that the brightness of an object can remain constant despite changes in ambient lighting. Specifically, in stereo vision matching, it is assumed that the corresponding pixels in the two images to be matched always have the same brightness value. See color constancy and illumination constancy. Brightness constancy constraint See brightness constancy. Illumination constancy Make people feel that the brightness of the surface is approximately constant (irrespective of the illumination). Color constancy A characteristic of the human visual system. It can help people correctly recognize the color of the scene in different seasons (such as summer or winter) and time periods (such as dawn or dusk). Color constancy is essentially a relatively stable characteristic of color perception and classification of objects when lighting and observation conditions change. When the issue of color purity is not involved, it refers to the characteristic that the color perception does not change with the illuminance. Such as white, as long as the spectral distribution of the illumination source is not changed, it will be white under various illuminations. Another example is that a red object is always considered red under daylight conditions (whether it is early morning or noon). However, there are differences in color perception in the center and edges of the human retina. If there is a difference in the spectral distribution of the illumination light source, even if the illumination is the same, the color vision must be different. Therefore, it can be considered that constant color is not a strict concept. It is known from psychological experiments that the constancy of color vision is affected by the spatial depth information and the complexity of the scene.
50.4
Visual Perception
1795
von Kries hypothesis A hypothesis about the primate color constancy process multiplies the color channels independently to scale. This assumption can be implemented by multiplying with a diagonal matrix. Supervised color constancy A technique that uses spectral reference values to convert color values under unknown light into color values under known light. First, collect a reference picture under unknown lighting conditions. Next, the color of the illumination is determined with reference to known spectral values. Based on the assumption that the lighting does not change, the desired image can be generated. However, the assumption that the spectral power distribution of the illumination does not change is generally not valid for outdoor images. Spectral constancy Eliminate the spectral dependence of pixel spectral signatures on image acquisi tion. In this way, a pixel’s spectral signature can be used to identify the type of target represented by that pixel. Size constancy The tendency that no matter how much the distance between the observer and the scene changes, a particular object is always perceived as having the same size. The size constancy depends on the correlation between the apparent distance of the object and the size of the visual image on the retina. Motion constancy The human visual system adapts to the moving speed of a moving object imaged on the retina, so that the actual change (due to changes in viewpoint, self-movement, etc.) is perceived as a constant velocity law. Perception constancy The tendency that after the human visual system undergoes some changes in the objective scene, its subjective perception effect (size, shape, color, brightness, etc.) remains the same. This is a manifestation of the perception constancy. Constancy phenomenon Some characteristics of the objective scene have changed, but the subjective perception (size, shape, color, brightness, etc.) remains the same. This is due to the correction process in the brain’s center. Also called constant sensation. Common are: 1. Constant size: Objects at different distances are considered the same or unchanged. This is only suitable for a certain distance range. 2. Constant color: Although the illumination is changing, the color of the object remains the same. For example, the daytime illumination can vary up to about 500 times, but people don’t feel how much the color of some scenes changes. 3. Constant shape: When a flat object is viewed from the front, it is a certain shape, and if you rotate it, you still feel the original shape.
1796
50
Visual Sensation and Perception
4. Constant motion: Although the moving speed of some moving objects on retina is changing, they are considered to be moving at a constant speed. 5. Constant position: The position of the object on the retina changes when eyeball is rotated, but it is felt that the object has not moved. 6. Constant direction: When the human body tilts itself, the main direction of outside world does not change. 7. Constant brightness: subjective brightness does not change when illuminance changes.
the the the the
50.4.3 Theory of Color Vision Tristimulus theory of color perception A theory of human perception of color. There are three types of cone cells in the human visual system, each with different spectral response curves, so the perception of any incident light can be expressed as three intensities, which roughly correspond to the long (the maximum value is about 558–580 nm), medium (the maximum value is about 531–545 nm), and the short (the maximum value is about 410–450 nm) wavelengths. Color stimulus function Function φ(λ) indicating the radiation causing color stimuli in the eye. For luminescent objects, the color stimulus function is equal to the spectral power S(λ). In the nonluminous object color proposed by CIE, the color stimulus function is the product of the spectral power S(λ) and the spectral reflection factor R(λ). Observe the fluorescence sample. The fluorescence function SF(λ) is added to the color stimulus function of the nonluminous object color. Based on these three situations, the color stimulus function can be expressed as 8 > < SðλÞ φðλÞ ¼ SðλÞRðλÞ > : SðλÞRðλÞ þ SF ðλÞ
Luminescent sample Non‐luminous object Fluorescence
sample
sample
Theory of color vision The theory of color vision production is explained based on existing physiological knowledge or several hypothetical properties. The problems faced by this theory are complex. Objectively, from the physical properties of light, the light acting on the eyes differs only in wavelength (frequency), but subjectively, after the human visual system responds to these wavelengths of light and transports it to the brain, it can produce intricate and complex thousands of colors. Although there are many color vision theories, there is no universal theory that can explain all color vision phenomena. There are three important theories:
50.4
Visual Perception
1797
1. The three primary color theory: It can explain and deal with many problems in color photography, color printing, color selection, color matching, color mixing, and other applications. And three cone cells and three pigments have been found in the human eye. 2. Opponent color theory: Research on psychological problems focusing on color vision. After receiving the signals of the three primary colors, the three cone cells are composed of two opposite colors of red-green and yellow (red + green)-blue. The three are added to form a photometric signal, and this photometric signal is compared with its neighboring photometric signal. The contrast between black and white is created. 3. Visual band theory: It is divided into several bands from the retina to the brain nerve, and each band plays a different role in information processing and transmission. Chromatopsy Same as color vision. Three primary colors See primary color. T unit The unit of three primary colors. Any color can be expressed by multiplying and adding its chromaticity coordinates to the light color units of the three primary colors. The sum of these three products is the T unit of the color. Trichromatic sensing See trichromatic theory. Young-Helmholtz theory A theory that explains color vision. Also called the trichromatic theory. It is believed that any color sensation can be produced by mixing three basic spectra. Specifically, color vision is based on three different cone cells, which are particularly sensitive to light at long, medium, and short wavelengths, respectively. These three types of cone cells that are more sensitive to long-, medium-, and short-wavelength light are also called red cone cells, green cone cells, and blue cone cells, respectively. In 1965, three color-sensitive cone cells with different pigments were found in the retina of the human eye, basically corresponding to red-, green-, and blue-sensitive detectors. This is generally considered a proof of the theory. The theory holds that all colors can be made up of three primary colors (the primary colors of the additive color mixture are red, green, and blue; the primary colors of the subtractive color mixture are magenta, yellow, and cyan (blue-green)). Young trichromacy theory Same as Young-Helmholtz theory. Three-color theory Same as Young trichromacy theory.
1798
50
Visual Sensation and Perception
Trichromatic theory Same as Young trichromacy theory. Trichromatic theory of color vision Same as Young trichromacy theory. Trichromacy There are three independent color receiver channels. For the camera, the three colors are RGB. For primate retinas, the three colors correspond to long, medium, and short (LMS) wavelength receptors. Color matching 1. A process or technique that uses a combination of three primary colors to produce the desired color. According to the theory of three primary colors, any color stimulus can be matched with a mixture of three basic stimuli. A specific method is to change the intensity ratio of the three colors of red, green, and blue so that the obtained color C can be expressed as C rR þ gG þ bB In the formula, means matching (i.e., the color feelings on both sides are the same); r, g, and b, respectively, represent the scaling coefficients of R, G, and B, and r + g + b ¼ 1. That is, a color stimulus C can be matched by r units of basic stimulus R and g units of basic stimulus G and b units of basic stimulus B. 2. The process or technique of deploying one color to achieve the same color perception as another given color.
50.4.4 Color Vision Effect Trichromat According to the three-color theory, three types of cone cells in the retina are used to perceive colored people. Each color Cx produced by the primary color light source can be obtained by additively mixing three suitable colors C1, C2, and C3. The perception equation is (α, β, γ, δ are weight coefficients) αC1 þ βC2 þ γC3 ffi þδCx The wavelengths of the primary colors C1, C2, and C3 have been standardized internationally. For most people, the constants α, β, γ, and δ used to produce a hue are actually the same, and they are called normal trichromats. A small percentage of people deviate from the constant, and they are called abnormal trichromats.
50.4
Visual Perception
1799
Normal trichromat See trichromat. Anomal trichromat Including dichomat and monochromat. See trichromat. Color blindness People lacking red or green photoreceptors (cone cells) (in rare cases, people lacking blue photoreceptors). These people can perceive far less color values than normal trichromat, or have a poor ability to distinguish small color differences. The worst is the monochromat, who can’t feel the color at all or can’t distinguish the color. Monochromat There is only one type of photoreceptor (cone cell) in the retina in the eye, and these people can only sense gray levels. According to statistics, such people account for less than 0.005% of the total population. Dichromat A person with only two cones in his retina from birth. According to statistics for Europe and North America, about 2% of the population. For dichromats, all colors C are combined with two colors: δC ffi αC1 þ βC2 Among them, C1 and C2 are two of the three primary colors, and α, β, and δ are coefficients. The color value that the dichromat can perceive is much smaller than that of the normal trichromats. Most of the former lack red cones or green cones, showing red-green blindness. Protanopia A subcategory of dichromatic vision. This type of person can see a narrower part of the red spectrum than normal people, and it is easy to treat lights pale and crimson, as well as cyan and violet to the same color. The most sensitive region in its spectrum is more toward the purple side, and its neutral point is near 490 nm. Also called the first color blindness. Protanomalous vision An abnormal condition of color vision. Also known as type A color weakness. Patients have normal trichomat, but when combining red and green to yellow, more red is used than ordinary people, and the spectral light curve is shifted to a short wavelength, which is between normal and protanopia. Dichromatic vision A special type of color blindness, also known as “partial color blindness.” Patients will describe every color in the spectrum as two shades. Including red and green blinds (more common) that can only see blue and yellow as well as yellow and blue blinds that can only see red and green.
1800
50
Visual Sensation and Perception
Pseudo-isochromatic diagram Pictures used to check if color vision is normal. Also called color-blind pictures, color-blind tables, and confusing pictures. For example, color blindness can recognize and cannot recognize handwriting pictures, and normal and abnormal vision can recognize pictures of different symbols from the same picture. Metamerism The two kinds of light have different spectrum distributions, but they have the same color visual effect, that is, the property of equal tristimulus value. It also represents a color stimulus with the same color but different spectral distribution under the same observation conditions. Metameric colors Two color stimuli with the same tristimulus value but different radiant flux spectral distributions may exhibit the same color visual effect. Also called metam erism. Here color is defined by a limited number of channels, each of which contains a certain spectral range. It can be seen that the same metamerism can be combined from multiple spectral distributions. Metameric The two color samples are different in frequency, but at least make people have the same or similar color experience (i.e., look the same) under certain conditions. This means that two objects that appear to have the same color may have completely different colors under different conditions. Metameric stimulus Can produce metameric of stimuli. If a self-luminous body is considered, the stimulus depends on the degree of difference in its spectral distribution, whether the observer’s color vision is normal, and whether it is a standard observer. If the subject is considered, the stimulus depends only on the spectral distribution of the light source. Metameric matching Color matching of metameric colors. Two kinds of light with different spectral distributions have the same color vision effect and also have the same tristimulus value, which is called metameric stimulus or metameric light. Such color matching is metameric matching. Inducing color The color stimulus of a certain region in the field of view affects the stimulus color when the color perception of its adjacent region. Compare the induced color. Induced color The affected color when a color stimulus in one region of the field of view has an effect on the color perception of an adjacent region. Compare inducing color.
50.4
Visual Perception
1801
50.4.5 Color Science Color science A subject that studies the response of various wavelengths or bands of electromagnetic waves to the retina and the judgment of the brain nerves on this response. It forms part of bio-optics and also covers physics, physiology, and psychology. Object perceived color Colors that objects can only associate with the help of some actual experience or things. Compare non-object perceived color. Nonluminous object color The color of light reflected or projected from the surface of an object. It is related to both the wavelength of the original incident light and the reflection characteristics of the surface of the object. See mutual illumination. Non-object perceived color One of four colors: red, yellow, green, and blue. It takes a certain time for the human eye to generate color vision after receiving the light stimulus, and only four colors of red, yellow, green, and blue can be recognized at the beginning. They are independent and abstract and are called non-object colors or aperture colors. See free color. Aperture color Observe the color of the object through the light hole. Due to the limitation of the aperture diameter, the original body color of the light-emitting object may not be felt. Free color Eliminate the perceived color after observing the color of known objects. In color vision, there are many psychological factors that affect people’s judgment of color. One of the most important situations is the presentation of objects and backgrounds that require color judgment. To avoid this effect, there are generally two ways to make the color attached to the object a free color. One method is to observe through the small holes, so that the observer cannot know what object the seen color belongs to (to avoid psychological cues), which is called aperture color. Another method is the Maxwell observation method commonly used in many photometric experiments or instruments, that is, making a point illumination of the object, and using a single lens to image the point on the pupil (also see only a single point). Advancing color A color that looks closer to the scene than it actually is. Derived from a subjective characteristic of the human visual system. Receding color The color of the surface of an object that makes it appear farther than it actually is. Derived from a subjective characteristic of the human visual system.
1802
50
Visual Sensation and Perception
Contractive color Looks like it makes the actual object appear smaller. Derived from a subjective characteristic of the human visual system. Expansive color Looks like it will make the actual object appear bigger. Derived from a subjective characteristic of the human visual system. Warm color A color that gives a warm feeling after observation. Related to people’s subjective perception of certain colors. This mainly comes from the experience of the color of some objects that can generate heat and extends to similar colors, such as red, orange, yellow, wine red, iron red, jujube, and purple red. They are the colors of light with longer wavelengths in visible light. Compare cool color. Cool color People can obtain or give a cool color after observation. Such as cyan, blue, and green. Compare warm color. Subjective color The subjectively perceived color of the human visual system does not necessarily correspond one-to-one with the objective wavelength of light. Given a certain wavelength of light, its color can be determined. However, some colors can be made by mixing two other colors (wavelengths) with completely different light. For example, with the proper combination of red and green, the same color sensation as that of sodium yellow can be obtained. This mixed color is called subjective color. Color difference Perceptual differences between the two color stimuli. Color difference formula Formula for calculating the difference between two color stimuli. Color difference threshold For a given diffraction, the smallest change in color that the eye can discern when the light is constant and the color changes. Also called color sensitivity. The color change can be expressed by different parameters. If it is expressed by the dominant wavelength l, it is from l to l + Δl; if it is expressed by the color purity p, it is from p to p + Δp; if it is expressed by chromaticity coordinates, it is from x, y to x + Δx, y + Δy. There are different ways to express the color difference threshold, which include: 1. Represented by spectral color (color purity is 1): the minimum value is at 490 nm and 590 nm, and the width is about 1 nm. The purple and red ends of the visible spectrum rise rapidly, and the color difference threshold decreases when the luminosity is increased, that is, the number of distinguishable colors increases and the sensitivity increases, but the excessively strong or weak light will increase the color difference threshold. When the color purity is not enough, the color difference threshold will also increase, that is, the color that can be distinguished
50.4
Visual Perception
1803
will decrease. When the brightness is between 15 cd/m2 and 10,000 cd/m2, it has the most sensitive region. At this time, about 100 pure spectral tones and 30 kinds of purple colors can be distinguished. In the visible spectrum, the hue threshold has a minimum at 440 nm, 490 nm, and 590 nm and a maximum at 430 nm, 455 nm, 540 nm, and 650 nm. 2. The threshold value of color purity indicates the minimum color purity difference that can be distinguished when the brightness of the two compared light is different and the light wavelength and the dominant wavelength are unchanged. In the vicinity of white light with low color purity, the change is large, the thresholds at both ends of the spectrum are small, and the thresholds in the middle are large. In the vicinity of spectral colors with high color purity, there is little change with wavelength. 3. The threshold value on the chromaticity diagram indicates that it can reflect the influence of the dominant wavelength and the color purity at the same time, but the determined color difference threshold is not very accurate and does not completely match the natural color (image) difference. The brightness from the colorless vision to the beginning of the color is called the colorless light threshold. Color contrast When the retina is stimulated with light of different wavelengths, the unstimulated part (peripheral field) and the stimulated part (inner field) present a contrasting color sensation. For example, when the peripheral field is stimulated by light, the greater the difference between its hue and the hue of inner field, the more brilliant the inner field becomes. This is an interaction that is influenced by secondary complementary stimuli surrounding the stimulus. However, a colorless stimulus can only induce a black peripheral field by white light in the infield; otherwise it cannot. This shows that the black peripheral field also has a secondary white feeling in the inner field. Successive color contrast An index describing the difference between adjacent pixels or pixel sets in a color image is also a phenomenon in human visual system perception. This happens if a person observes a colored region for a long time and then turns to observe a black and white region. At this time, the residual image of the previous observation region is displayed as an opposite color (negative afterimage) or close to the original color (positive afterimage). Color appearance The subjective color perception when a person focuses on an object. According to theory of color vision, when the observed object is a nonluminous body, one kind of object color is felt, and the feeling at this time is related to other scenes around the object. When the observed object is a luminescent body, the surrounding scenes will appear dark. The subjective feelings of these two modes in 3-D color space are not exactly the same. Regardless of the mode in which the color appearance appears, it depends on four factors: (1) the spectral distribution of the light source; (2) the transmission or reflection performance of the focused object and surrounding
1804
50
Visual Sensation and Perception
objects; (3) the size and shape of the object in the field of view; and (4) the visual response of the viewer to the spectrum. Chromatic adaptation A phenomenon in which the human eye’s perception of color changes under the stimulation of colored light.
50.4.6 Visual Attention Human visual attention theory A theory that studies the principle and characteristics of human visual system’s visual attention. It assumes that the human visual system pays attention to and processes only a part of the details of the entire environment for a certain period of time and hardly considers the rest. This is the visual attention mechanism, which enables people to quickly locate the object of interest in a complex visual environment. Visual attention [VA] Use low-level feature detection to guide the process of high-level scene analysis and object recognition strategies. From the perspective of human observation of the world, searching for patterns of attention through glances is a common method. The ability of a vision system to selectively focus on some environments and not others. Attention See visual attention. Visual attention mechanism The perception mechanism that vision system selectively pays attention to part of the environment and not the other parts. It has two basic characteristics: directivity and concentration. Directivity manifests itself as being selective for multiple stimuli that appear in the same period of time; concentrative manifestation is its ability to inhibit interfering stimuli, and its generation, range, and duration depend on the characteristics of external stimuli and human subjective factors. Visual attention mechanisms are mainly divided into two categories: bottom-up data-driven pre-attention mechanisms and top-down task-driven post-attention mechanisms. Among them, the bottom-up processing is the significance detection driven by the underlying data without the guidance of prior knowledge. It belongs to a lower-level cognitive process and does not consider the impact of cognitive tasks on extraction saliency, so the processing speed is relatively fast. The top-down process belongs to the process of discovering significant targets by means of taskdriven. It belongs to a higher-level cognitive process. It needs to be consciously processed according to the task and extract the required regions of interest, so the processing speed is relatively slow.
50.4
Visual Perception
1805
Human visual attention mechanism Same as visual attention mechanism. Visual attention model A model that uses computers to simulate human visual attention mechanism. After-attention mechanism See visual attention mechanism. Focus of attention [FOA] A feature, target, or region that the human visual system focuses on when observing. Weight of attention region When analyzing and organizing a video, describe one of the two weights of the viewer’s impression of the video. The weight of attention area WAR can be calculated in two cases: W AR ¼
S Sð1 þ L=RL Þ
L < L0 L L0
Among them, WAR is proportional to the lens zoom parameter S and is strengthened by the speed of the horizontal panning shot; L0 is the minimum value of panning speed that can have an impact, and panning motion less than L0 is considered random shake; RL is used for control the factor by which the pan speed affects the weight. In this time-weighted model, weights greater than 1 are emphasized, while values less than 1 are ignored. Therefore, the product of the two weights of the area of interest of the sport and the weight of background can be specified as 1. Conspicuity maps The output of a biological model-based pre-attention mechanism in a visual atten tion mechanism. First, color features, brightness features, and direction features (low-level visual features) are extracted from the image by linear filtering, and 12 color feature maps and 6 brightness feature maps and 24 directional feature maps are formed after Gaussian pyramids, central-peripheral difference operators, and normalization. By combining these feature maps separately, a focus map of color, brightness, and direction can be formed. If an attention map of three features is linearly fused to generate a saliency map, a two-layer winner-take-all (WTA) neural network can be used to obtain a saliency region. Fixation In physiology, the result of fixed visual attention to a single location, where the eye or sensor is fixedly pointing at that location. Gaze stabilization A gaze control method corresponding to the positioning process in vision. Similar to the object detection process.
1806
50
Visual Sensation and Perception
Gaze correction An application of stereo head tracking and 3-D reconstruction. In a video conference using a laptop, the camera is often placed above the display. When a participant looks at his screen (a window on it), his gaze no longer looks at the subject, but instead looks down without looking at the other person. If two or more cameras are used to replace the original single camera, it is possible to construct a virtual view through synthesis to generate a virtual eye communication. Use real-time video matching to build an accurate 3-D header model, and use view interpolation to synthesize individual views within it. Gaze change A means of gaze control. Corresponding to the rotation of the eyeball, the purpose is to control the next fixation point according to the needs of a specific task, similar to the tracking process of an object. Visual search A task that searches for an image or a scene to find a pre-specified target. It is often used as an experimental tool in psychophysics. Eye movement The two eyes acting together as a single image of the object at which the fixation point is located are performed by six extraocular muscles. Can be divided into two kinds of intentional and unintentional exercise. Unintentional movements, such as eyeball instability, and the light from the eyeball see the side of the object aimlessly. Unintentional movement is not the visual tracking that you care about, or looking at a moving object, or looking at a still life while moving, or the physiological reflection of the eyeball when you close your eyes. In intentional movements (such as aiming or reconnaissance), eye movements are always jumping. Accommodation The process of adjusting the eye to sharpen the image on the retina. See accommo dation convergence. Accommodation convergence A series of actions to coordinate the eyes in order to see closer objects: (1) eyes adjust the two eyes separately; (2) the two eyes are converged (converge on the object); (3) the eyes are slightly downward; and (4) the pupil is slightly narrowed. In fact, it is natural to cause the eyes to converge due to adjusting the line of sight to see nearby objects. It is generally considered that the accommodation is proportional to the convergence, that is, when the accommodation is regarded as 1 m1, the convergence is 2–6 m1. Convergence When looking at a close object, two eyes turn toward each other (rotate toward each other). The closer an object is, the more it rotates, the greater the convergence effect. Convergence is a clue for deep vision. Objects will lose their effect from about 15 to 18 m away. See non-visual indices of depth.
50.5
50.5
Visual Psychology
1807
Visual Psychology
Visual psychology studies the psychological mechanism and response of objective images through visual organs. Illusion is the reflection of people’s senses on the improper feeling and false perception of objective things. The most obvious of optical illusions is the geometric illusion (Aumont 1994; Finkel and Sajda 1994; Peterson and Gibson 1994; Zakia 1997).
50.5.1 Laws of Visual Psychology Color psychophysics The branch of psychophysics that explains color perception. The perception of color begins with a colored light source that emits electromagnetic radiation with a wavelength of about 400–700 nm. Part of the radiation is reflected on the surface of the object in the scene, and the emitted reflected light reaches the human eye; and part of the radiation penetrates the object in the scene, and the transmitted light after passing through reaches the human eye. These lights give a colored sensation by means of the retina and brain. Pop-out effect A criterion or method used by psychologists to determine whether a (visual) effect actually comes from (visual) perception. For example, in the left picture of Fig. 50.15, several green circles among many red circles are easily detected (protruded), while in the right picture of Fig. 50.15, it is much more difficult to detect the green circle due to the influence of the green missing corner matrix. Weber’s law A psychophysical law that indicates the relationship between mental and physical quantities. It reflects the relationship between the minimum perceptible difference between two stimulus intensities and the absolute intensity of the stimulus and points out that the threshold of the perceived difference varies with the original stimulus amount and shows a certain regularity. For example, in visual perception, if I represents the brightness of the target (a physical quantity reflecting the intensity of the stimulus) and ΔI represents the perceived brightness change (minimum Fig. 50.15 Pop-out effect schematic diagram
1808
50
Visual Sensation and Perception
perceptible difference), then Weber’s law can be expressed as ΔI/I ¼ W, where W is the Weber rate. The Weber rate can vary with different stimuli and is constant at moderate stimuli (i.e., this law only applies to stimulus in the medium range). For most normal people, this law can be expressed as ΔI
0:02 I Among them, ΔI is the difference of the smallest gray value that can be resolved by the human eye when the brightness is I. Because the ratio ΔI/I is a constant, when the value of I is small (darker gray), the smallest perceptible difference that can be resolved is also smaller. Weber’s law can be understood as follows: if the difference between two stimuli with values I and I + ΔI can be observed and perceived, as long as ΔI/I ΔJ/J, it should be possible to perceive the difference between the stimuli values for J and J + ΔJ. Weber-Fechner law Same as Weber’s law. Gestalt theory A theory created by a school of psychology and cognition in Germany in the early twentieth century. These include laws about how humans organize and aggregate visual elements to form objects (shapes). Common rules include: 1. The law of approach: Elements that are close in space are more likely to be perceived as belonging to a common shape than elements that are separated. 2. The law of similarity: elements with similar shapes or sizes are more likely to be perceived as belonging to similar collective shapes. 3. The law of continuity: when a shape is incomplete, there is a natural tendency to continue the shape to complete. 4. The law of closure: When moving a shape, all elements moving at the same time are considered to belong to the same overall shape. Gestalt theory states that the way an object is observed is determined by the entire environment or the place where the object itself exists. In other words, the visual elements in the human field of view are either attracted (combined) or repelled (uncombined) with each other. The close, similar, continuous, and closed Gestalt rules describe the different ways of combining in the field. Gestalt Gestalt is German, which means “shape.” In the first half of the twentieth century, the Gestalt school of psychology, led by German psychologists Wertheimer, Köhler, and Koffka, did far-reaching and influential work on the theory of perception (and thus computer vision). Its basic purpose is that the perceptual mode has a holistic nature and often cannot be explained by a single component. In other words, the whole contains more information than the sum of the parts. This concept is
50.5
Visual Psychology
1809
summarized into several basic laws (closeness, similarity, closure, “common destiny” or good form, saliency) and can be applied to all psychological phenomena (not just perception). Many low-level tasks in computer vision, especially perceptual grouping and perceptual organization, use these ideas. See visual illusion. See Gestalt theory. Perceptual organization A psychological conclusion based on Gestalt theory. The central principle is that the human visual system prefers or is more satisfied with certain organizations (or explanations) of visual stimuli than others. A well-known example is that a cube drawn in a line grid is immediately interpreted as a 3-D object rather than a combination of 2-D lines. This concept has been used in some low-level vision systems to determine the set of low-level features most likely to be produced by the object of interest. In addition, the interpretation hypothesis for the optical illusion of geometric figures is also based on the organization of perception. Perceptual grouping See perceptual organization. Prototype matching A matching theory proposed by psychologists of Gestalt theory. It is believed that, for the currently observed image of a letter “A,” no matter what its shape is or where it is placed, it is similar to the previously known “A.” In long-term memory, humans do not store countless templates of different shapes, but use similarities abstracted from various types of images as prototypes and use this to test the images to be recognized. If a prototype analog can be found in the image to be recognized, then the recognition of this image is achieved. Perspective inversion 1. Inversion of perspective: A phenomenon of illusion in human vision. When people observe the cube in the Necker cube illusion, they will intermittently treat the middle two vertices as the closest points to themselves, resulting in the exchange of visual concerns. 2. Perspective inversion: Starting from the image points obtained from the 3-D point of the scene, determine the position of the point in the 3-D scene. That is, solve the perspective projection equation to obtain 3-D coordinates. See absolute orientation. Perceptual reversal Perspective inversion in psychology. Cylinder anamorphosis A special effect that intentionally reverses an image so that it can only be viewed correctly through a mirror. This is an art that has been popular since the eighteenth century. With this effect, the image can create an ancient atmosphere or an illusion of the times.
1810
50
Visual Sensation and Perception
Alternating figures It is possible to see the patterns of the two images alternately. For example, the graphic in Fig. 50.16 is called a “Rubin bottle” or “Peter-Paul cup.” On the one hand, it can be seen as a wide mouth cup (considering the middle part); on the other hand, it can be seen as a side portrait of two people facing each other (considering the left and right parts). When one image is seen, the other is used as the background. Optical illusion Due to physiological and psychological optics, incorrect perception of the phenomenon observed by the eye. For example, if you look at the white blocks in the black frame and the black blocks in the white frame, even though the size of the two blocks and the two frames are exactly the same, you always feel that the white blocks are larger. This is because the image of bright spots on the retina is always a small round spot rather than a point, so the white block image invades the black frame, while the white frame image invades the black block. This phenomenon is called light infiltration. The fan-shaped illusion (or Hering illusion) shown in Fig. 50.17 is a type of light hallucination. The two parallel lines in the middle on the left figure are curved inward, and the two parallel lines in the right figure are curved outward. One explanation for this hallucination is the neural displacement theory. Same as illusion of geometric figure. Fig. 50.16 An example of alternating figures
Fig. 50.17 Fan-shaped hallucination
50.5
Visual Psychology
1811
Fig. 50.18 An impossible object
Phosphene Originally refers to the light sensation produced by the eye’s retina at the moment of external mechanical stimulation or electrical stimulation (non-light stimulation), etc. Now it also refers to the phenomenon that the image formed on the retina is different from the one perceived by people due to psychological effects (including various optical illusions). Critical bandwidth Originally refers to the psychophysical properties of hearing. For a narrow-band noise source corresponding to a certain loudness, if the bandwidth is increased, the sound loudness remains unchanged when the bandwidth does not exceed the critical bandwidth. The critical bandwidth varies with the center frequency of the band. Impossible object An object that does not physically exist. Figure 50.18 gives an example. The step on the upper surface cycle is not possible in practice.
50.5.2 Illusion Illusion People’s organs incorrectly sense objective things. The optical illusion of geometric figures is the geometric illusion produced by the human visual system. Visual illusion The illusion of the human visual system in observing objective things. The most typical is the illusion of geometric figures. The perception of scenes, targets, movements, etc. does not match the actual reality of the generated image or image sequence. In general, illusion can be caused by a specific arrangement of visual stimuli, observation conditions, and the response of the human visual system. Common examples include the Ames room (a room that
1812
50
Visual Sensation and Perception
Fig. 50.19 Ponzo illusion
looks normal, but a person who looks different in height when he is in different positions) and the Ponzo illusion (when viewing two equal line segments as the projection of object in a 3-D space, their lengths will be different). As shown in Fig. 50.19, the horizontal bands in the distance appear to be longer than the horizontal bands in the near field. See Gestalt theory. Subjective contour The contour or shape of an object that can be seen by the human visual system without a difference in brightness. This kind of contour perception without direct stimulation is also called illusion boundary. It stems from the edges that people perceive in an image based on the integrity of Gestalt theory, especially when no image evidence appears. Illusion boundary Same as subjective contour. Subjective boundary Same as subjective contour. Visual deformation The visual illusion of geometric figure caused by a sensory deformation of a graphics through vision. These deformations occur in the vision of normal people, and they occur immediately, often without the influence of external factors, which can be produced only by observing the graphics. Ebbinghaus illusion An illusion of size, or an illusion of relative size perception. A typical situation is shown in Fig. 50.20. The two left and right thick line circles surrounded by other circles are the same in size, but because they are surrounded by relatively larger and smaller circles, respectively, the thick line circle on the left looks smaller than the thick line circle on the right. This illusion is also called the Titchener circles.
50.5
Visual Psychology
1813
Fig. 50.20 An example of Ebbinghaus illusion
(a)
(b)
(c)
(d)
(e)
Fig. 50.21 Necker cube illusion
Titchener circles Same as Ebbinghaus illusion. Motion contrast An illusion of movement. For example, if you look at two symmetrical figures alternately, you will feel that the two figures are swinging back and forth. Necker illusion Abbreviation of Necker cube illusion. Necker cube Draw an orthogonal projection of a cube. It may be explained in two ways, this is the Necker cube illusion. Necker cube illusion An illusion of geometric figure. This is a typical example showing that using only the elementary representations in Marr’s visual computation theory does not guarantee a unique interpretation of the scene. See Fig. 50.21. If the observer focuses on the intersection of the upper right three lines of the small square in the center of Fig. 50.21a, it will be interpreted as Fig. 50.21b, that is, the imaged cube is regarded as Fig. 50.21c. If the observer focuses on the intersection of the lower left three lines of the small square in the center of Fig. 50.21a, it will be interpreted as Fig. 50.21d, that is, the imaged cube is considered as Fig. 50.21e. This is because Fig. 50.21a gives people a clue of (part of) a 3-D object (cube). When a person tries to recover 3-D depth from it with the help of empirical knowledge, due to the difference between the two integrated methods, he will draw two different interpretations. It can be seen that the Necker cube illusion indicates that the brain makes different
1814
50
Visual Sensation and Perception
assumptions about the scene and even makes decisions based on incomplete evidence. Necker reversal An ambiguity that arises when recovering 3-D structures from multiple images. Under affine observation conditions, a set of 2-D image sequences generated by a set of rotating 3-D points is the same as a set of 2-D image sequences generated by another set of 3-D points rotated in opposite directions. There may be two sets of solutions. The different set of points here can be a mirror image of the first set of points on any plane perpendicular to the camera’s optical axis.
50.5.3 Illusion of Geometric Figure and Reason Theory Illusion of geometric figure A situation in which a person perceives a result when observing a geometric figure does not correspond to an actual stimulation pattern. This happens when people’s attention is focused only on a certain feature of the figure under the influence of various subjective and objective factors. Also called visual deformation. Refers to the visual distortion of graphics. These deformations occur in normal human vision, and they happen instantly, often without the influence of other external factors. Common illusions of geometric figures can be divided into the following two categories based on the propensity for the errors caused: 1. Optical illusion in quantity or dimension, including optical illusion caused by size and length. 2. Optical illusion in direction, caused by a change in the direction of a straight line or a curve. Psychologists and physiologists have put forward many kinds of theories/hypotheses to explain the causes of optical illusions of geometric figures, such as the typical eye moving theory, perspective theory, confusion theory, contrast theory, assim ilation theory, neural displacement theory, etc. Of these theories, although each can explain some types of illusion, there is no one theory that can give a reasonable explanation for all illusions of geometric figure. In other words, a unified theory of interpretation has never been established. Geometric illusion Same as illusion of geometric figure. Eye moving theory An explanatory point about the illusion of geometric figures. It has two main forms. One form is that a person’s impression of the length of an object is based on the eye scanning the object from one end to the other. Because the vertical movement of the eyeball is more laborious than the horizontal movement, the false impression that the vertical distance is longer than the same horizontal distance is generated. This form is often used to explain the illusion of Fig. 50.22a, that is, people will feel that the
50.5
Visual Psychology
1815
Fig. 50.22 Two optical illusions that can be explained by the eye moving theory
(a) Fig. 50.23 Incorrect comparison example
(b)
A
B
C
vertical line is longer than the horizontal line, but the two lines are actually the same length. Another form thinks that the features in the figure will cause the eye gaze point to be wrong, thereby creating the illusion. For example, in Fig. 50.22b, the line between the two arrows has the same length, but due to the influence of the direction of the arrow, the scanning line of the eye exceeds or does not reach the end point of the arrow. As a result, people’s perception of the length of the line is wrong. One will feel that the line between the two arrows above is shorter than the line between the two arrows below, but the two lines are actually the same length. Perspective theory An explanatory point about the illusion of geometric figures. Also known as the misapplied constancy theory. The core concept of the perspective theory is that the graphics that cause optical illusions imply depth through perspective, and this depth suggestion will lead to changes in the perception of the size of the graphics. The general rule of change is that the size of the parts of the graphic that represent the more distant objects is enlarged and the size of the parts that represent the closer objects is reduced. Misapplied constancy theory See perspective theory. More emphasis and suggestion on the results of subjective misreading Confusion theory A theory that explains the optical illusion of geometric figures. It is considered that the observer did not correctly select the object to be compared when determining the size of an object or a space for comparison, or that the figure to be compared is confused with other figures. As shown in Fig. 50.23, the distance from the right side of circle A to the left side of circle B is equal to the distance from the left side of circle B to the right side of circle C, but the former looks significantly longer. The reason is to confuse the distance to be compared with the distance between circles. Incorrect comparison theory Another name for confusion theory. This so-called comparison points out the nature of the mis-comparison theory intuitively.
1816
50
Visual Sensation and Perception
Fig. 50.24 One illusion of geometric figure that can be explained by the contrast theory
Fig. 50.25 One illusion of geometric figure that can be explained by the assimilation theory
Contrast theory In the theory of optical illusion of geometric figures, it is considered that the judgment is transformed to the opposite side due to the comparison of different reference objects when judging the scene. For example, two circles of the same size appear different in size when they are surrounded by different circles, for example, surrounded by a set of small circles or a set of large circles (the circle surrounded by a set of large circles is smaller than the circle surrounded by a set of small circles), as shown in Fig. 50.24. Assimilation theory An explanatory point about the illusion of geometric figures. It is believed that when people make judgments on scenery, they will assimilate themselves or part of them into their background or frame of reference, which will lead to wrong judgments. For example, in Fig. 50.25, the diagonal in the parallelogram on the left looks longer than the diagonal in the parallelogram on the right, but the two diagonals are actually the same length. This phenomenon can be explained by the assimilation theory. When a parallelogram is divided into two parts of different sizes, the smaller the small quadrangle, the shorter the diagonal line looks. This is because when a person perceives the pattern, assimilate a certain component of the stimulus to its background or in the reference system. Neural displacement theory An explanatory point about the illusion of geometric figures. It is based on physiology, emphasizing that optical illusions are caused by disturbed physiological mechanisms or by interactions between visual perception regions. This theory holds that a strong stimulus to a certain area will reduce the sensitivity of the surrounding regions, resulting in different perceptions and interpretations of various parts of the object. The basis is that the activity of a certain position in a nerve can inhibit the activity of its neighboring regions. A typical example is the phenomenon of simultaneous contrast. If the same piece of gray paper is placed in two different backgrounds of black and white, people seem to have a significant difference in
50.5
Visual Psychology
1817
brightness: when the background light produces strong suppression, the paper is darker; when the dark background produces weak suppression, the paper is whiter. Good figure principles The part of the field of vision that is considered to be able to speak the meaning figure is often the most meaningful figure (i.e., good figure) among the various combinations that may be displayed by the same stimulus. The specific factors that make up a good figure include two types: the irritant factors mainly related to the characteristics of the stimulus itself and the nonirritant factors that change according to the subjective conditions of the observer.
Chapter 51
Application of Image Technology
Image engineering not only integrates many similar disciplines, but also further emphasizes the application of image technology (Poynton 1996; Zhang 2018a).
51.1
Television
Digital television is an area where image technology is widely used (Poynton 1996; Zhang 2007).
51.1.1 Digital Television Digital television [DTV] A display technology that can fully compress, encode, transmit, store, and broadcast moving images, sounds, and data. An audiovisual system for receiving/playing television programs by viewers. Compared with traditional analog TV, digital TV has the advantages of high-definition picture, high-fidelity stereo sound, signal storage, full use of frequency resources, and the ability to synthesize multimedia systems with computers. The signal transmission rate in standard digital television is 19.39 MByte/sec. Digital TV can be further divided into low-definition television, standarddefinition television, high-definition television, and ultra-high-definition television according to its definition. Scanning raster Scans in raster mode along horizontal lines from left to right. All the lines together form a scanning raster or television raster. © Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_51
1819
1820
51 Application of Image Technology
Television raster See scanning raster. Television scanning The process of converting pixels in an image into video voltage-time response according to brightness and chrominance. The receiver is completely synchronized with the transmitted scanning sequence through the synchronization control signal. Scanning can be performed line by line (progressive scanning), or interlaced scanning can be performed after scanning the odd lines and then the even lines. The advantage of the latter is that a picture is divided into two and the frequency is doubled, that is, 25 f/s becomes 50 f/s, which can effectively eliminate flicker. Low-definition television [LDTV] Digital TV with a resolution of 340 255 pixels. The horizontal resolution of the image is greater than 250 lines, the frame rate is up to 60 fps, and the screen aspect ratio is 4:3. Compare HDTV. Standard-definition television [SDTV] A type of digital television with a horizontal resolution of 480600 lines, a resolution of 720 576 pixels, a frame rate of up to 60 fps, and a screen aspect ratio of 4:3. High-definition television [HDTV] A system that produces, plays, and displays high-quality television images. Developed on the basis of digital television. The horizontal resolution of the image is 720 lines (progressive) or 1080 lines (interlaced) or higher, with a resolution of up to 1920 1080 pixels, a frame rate of up to 60 fps, a screen aspect ratio of 16:9, and a brightness bandwidth of 60 MHz. According to ITU regulations, the image quality of a normal-vision viewer at a distance of three times the height of the display screen of a high-definition television system should have the same impression as when viewing original scenes or performances. Ultra-high-definition television [UHDTV] With the digital TV specification, the extended product of HDTV can have a physical resolution on the screen of 3840 2160 pixels. The International Tele communication Union (ITU) released the international standard for ultra-highdefinition television on August 23, 2012: ITU-R Recommendation BT.2020. This standard regulates the resolution, frame rate, color space, and color coding of UHD TVs. The resolution is divided into two levels: (1) Ultra HD 4K: horizontal resolution 3840, vertical resolution 2160, and aspect ratio 16:9, with totally about 8.3 million pixels, and (2) Ultra HD 8K, horizontal resolution 7680, vertical resolution 4320, and aspect ratio 16:9, with totally 33.2 million pixels. It can be seen that 4k TV, as people say, can only represent part of it. There are nine types of frame rates: 120p, 60p, 59.94p, 50p, 30p, 29.97p, 25p, 24p, and 23.976p. They all use progres sive scanning.
51.1
Television
1821
Closed-circuit television [CCTV] A video capture system placed in a special environment allows only limited access to observers and can be used for video surveillance. Telecine Conversion from motion picture movie format to standard video equipment format. Videophone You can see the other party’s dynamic image on the display during a call. Also known as picturephone. Picturephone You can see the other party’s dynamic image on the display during a call. There are two types of passbands for image signals, narrowband and broadband: narrowband is about 1 MHz, which can be transmitted by equalizing amplification on longdistance lines; broadband is about 4 MHz, which can only be transmitted by optical cables. Audio The frequency of sound waves (sounds) that can be heard by normal human ears, also refers to sound signals or sound data. Generally speaking, video includes video data and audio data. In digital video, audio is usually encoded as a data stream that can be separated or interleaved with video data. In digital television, audio and video are interleaved.
51.1.2 Color Television Color TV format Specific systems and technical standards for color television broadcasting. Common standards include NTSC (developed by the United States, used in the United States and Japan, etc.), PAL (developed by Germany, used in the countries such as Germany and China), and SECAM (developed by France, used in countries such as France and Russia). National Television System Committee [NTSC] format One of three widely used color TV formats. This format was developed by the National Television System Committee. The color model used is the YIQ model. Phase Alteration Line [PAL] format One of three widely used color TV formats. It is also an analog composite video standard. The color model used in the PAL format is the YUV model. Which uses two color difference signals (U and V ) and combines the results into a single color signal after low-pass filtering; and the luminance component Y switches the phase between alternating rows, which is the source of the PAL name and also its most obvious feature.
1822
51 Application of Image Technology
Séquentiel Couleur À Mémoire [SECAM] format One of three widely used color TV formats. It is mainly used in France and most Eastern European countries. The color model in the SECAM format uses the YUV model. It plays 819 lines per second. Séquentiel Couleur À Mémoire [SECAM] See Séquentiel Couleur À Mémoire format. Color model for TV Based on different combinations of three primary colors (R, G, B), used for color models of color TV formats. The YUV model is used in the PAL TV system, and the YIQ model is used in the NTSC TV system. YIQ model A color model used in the color television system NTSC, which separates the luminance signal (Y) and two color signals: an in-phase signal (about orange/ cyan) and a quadrature signal (about purple/green). I and Q are the results obtained by rotating the U and V components in the YUV model by 33 , respectively. After rotation, I corresponds to the color between orange and cyan, and Q corresponds to the color between green and purple. Because the human eye is less sensitive to color changes between green and purple than color changes between orange and cyan, the number of bits required for the Q component may be less than the number of bits for the I component during quantization, and the required bandwidth for transmission of Q component can be narrower than that for transmission of I component. The conversion from RGB to YIQ can be expressed as [Y I Q]T ¼ M[R G B]T, where 2
0:299
6 M ¼ 4 0:587 0:114
0:596 0:275 0:321
0:212
3
7 0:523 5 0:311
In addition, the values of Y, I, and Q can also be obtained by using the normalized R0 , G0 , B0 in the NTSC format (with γ-correction). The calculation is 2
3 2 Y 0:299 6 7 6 4 I 5 ¼ 4 0:596 Q
0:212
0:587 0:275 0:523
32 0 3 0:114 R 76 0 7 0:321 54 G 5 0:311
B0
The inverse transformation of R0 , G0 , and B0 calculated from Y, I, and Q is 2
R0
3
2
1:000
6 07 6 4 G 5 ¼ 4 1:000 1:000 B0
0:956
0:620
32
Y
3
76 7 0:272 0:647 54 I 5 1:108 1:700 Q
51.2
Visual Surveillance
1823
YUV model The color model used in color TV format, PAL format and SECAM format, where Y represents the luminance component, U and V are proportional to the color differences B – Y and R – Y, respectively, representing the chrominance component. The value of YUV can be calculated from the normalized R0 , G0 , and B0 (R0 ¼ G0 ¼ B0 ¼ 1 corresponding to the reference white) in the PAL format (with γ-correction) by 2
3 2 Y 0:299 6 7 6 4 U 5 ¼ 4 0:147 V
0:615
0:587 0:289 0:515
32 0 3 0:114 R 76 0 7 0:436 54 G 5 0:100 B0
The inverse transform of R0 , G0 , and B0 calculated from Y, U, and V is 2
R0
3
2
1:000
6 07 6 4 G 5 ¼ 4 1:000 1:000 B0
0:000 0:395 2:032
1:140
32
Y
3
76 7 0:581 54 U 5 0:001 V
NTSC color The range of colors that a typical TV can reproduce is much smaller than 24-bit color.
51.2
Visual Surveillance
Visual surveillance and intelligent transportation are inseparable from the support of various image technologies (Szeliski 2010; Zhang 2018a).
51.2.1 Surveillance Surveillance An application area of machine vision is mainly concerned with the monitoring of various activities in the scene. This involves background modeling and human motion analysis techniques. Visual surveillance Use a video camera to observe what is usually in a specific region, including people and vehicles. Applications that use visual surveillance include security region surveillance, flow analysis, and user behavior profiles.
1824
51 Application of Image Technology
Fig. 51.1 Automated visual surveillance schematic
Person surveillance A set of technologies for detecting, tracking, counting, and identifying people or their actions in CCTV for security reasons. For example, there are many automatic monitoring systems for parking lots, banks, airports, and other places. A typical system needs to be able to detect the presence of people and track their movements over time. It is also possible to identify people with the help of an existing face database and divide people’s behavior into predefined categories (such as normal or abnormal). See anomalous behavior detection, face recognition, and face tracking. Wide-area scene analysis [WASA] Video-based surveillance using a camera network over a field of view of a single camera. Intelligent visual surveillance On the basis of intelligent camera control, further consideration of object recogni tion and scene interpretation is used to implement visual surveillance. Automated visual surveillance Including the popularization and application of automatic scene understanding technologies such as object detection, object tracking, and recent behavior classification. Figure 51.1 shows a group of screens for tracking and detecting faces. Purposive-directed It belongs to a guidance method adopted by the vision system. For example, the active vision system can improve the performance of the system to complete specific tasks by actively controlling camera parameters and adjusting the function of the work module according to the requirements of vision purposes. Bin-picking On the assembly line, let the robotic manipulator equipped with a vision sensor pick up the specified workpiece (such as screws and nuts). For hand-eye robot systems equipped with vision sensors, challenges come from image segmentation, object recognition in clutter environments, and pose estimation.
51.2
Visual Surveillance
1825
51.2.2 Visual Inspection Inspection A general term for identifying defects by visual inspection of a target. Examples of commonly used inspection applications in practice include inspection of printed circuit boards to check for cracks or soldering failures, inspection of holes or discoloration in paper production, inspection of violations in food production, etc. Visual inspection A general term that originally refers to the use of human visual systems and now refers to the inspection of products and production processes by analyzing images or videos, which can be used for quality control of production lines. Machine inspection Automatically inspect product quality or defects on the assembly line with visionbased equipment. Intruder detection An application of machine vision, often by analyzing video sequences to detect people or creatures that do not appear normally in the scene. Obstacle detection The use of visual data to detect targets in front of an observer is commonly used in mobile robot applications. Crack detection Detection of cracks and defects on manufactured items using industrial visual surveillance. It is also commonly used for in situ monitoring of heavy engineering structures or leak detection in pipeline engineering production. Automatic visual inspection Monitoring process using imaging sensors, image processing, pattern recognition, or computer vision technology to measure and interpret imaging targets to determine if they are within acceptable tolerances. Automatic visual inspection systems often combine materials processing, lighting, and image acquisition technologies. Specially designed computer hardware can also be included when needed. In addition, the system also needs to be equipped with a suitable image analysis algorithm. Inspection of PC boards See inspection. Currency verification Check the authenticity of printed currency and coins. Among them, optical character recognition technology is used.
1826
51 Application of Image Technology
51.2.3 Visual Navigation Visual navigation Use visual data (usually a video sequence) to steer (steering) vehicles, aircraft, ships, etc. through specific environments. With different assumptions, you can determine the distance to the obstacle, the time to contact, and the shape and identity of the target you see. Depth sensors, including acoustic sensors (see acoustic sonar), are also commonly used. See visual servoing and visual localization. Global positioning system [GPS] A satellite system that determines the position of each GPS receiver in absolute earth reference coordinates. The accuracy of standard civilian GPS is basically on the order of meters. Higher accuracy can be obtained using differential GPS. Localization Often refers to spatial positioning. For the determination of a feature, target, or object position in an image, its coordinates or bounding boxes are often used for marking. In some cases, localization can also refer to temporal positioning, that is, to determine an event in time, such as determining the discarding of an object in a video sequence. Occupancy grid A graph construction technique mainly used for autonomous vehicle navigation. A grid is a set of squares or cubes representing a scene, which can be marked based on whether the viewer thinks that the corresponding scene region is empty (can sail) or full (existing vehicle). Probability measures can also be used. Information from depth images, binocular stereo, or sonar sensors is often used to update the grid as the observer moves. Passive navigation In image engineering, the motion of a camera is determined from a time-varying image sequence. Visual servoing Use the observed motion in the image as feedback to guide a robot. For example, move the robot so that the end effector of the robot aligns and contacts the desired target. Typically, systems have little or no knowledge of camera positions, their connection to the robot, or robot dynamics. The relevant parameters need to be learned while the robot is moving. Visual servoing allows the results of corrections to be changed while the robot is working. Such a system can be adapted to unconventional occasions, such as a robotic arm flexing under heavy loads or a motor slipping, or corrections that do not provide sufficient accuracy to reliably accomplish what is expected by relying solely on robot dynamics and kinematics action situation. Because there is only measurement of the image, the inverse dynamics problem is more difficult than in the traditional servo.
51.2
Visual Surveillance
1827
Real-time processing Any calculation made within a defined time in a given process. For example, in visual servoing, the tracking system provides position data to the control algorithm to generate a control signal; if the control signal is generated too slowly, the entire system may become unstable. Different processes impose very different constraints on real-time processing. When processing video stream data, “real-time” refers to processing the data of the current frame completely before collecting the next frame (there can be several frames of lag overall, such as in a pipeline parallelism system). Adaptive visual servoing See visual servoing. Robot behavior Robot equipment movement. Fully autonomous humanoid robot soccer A bold plan was proposed during the Robot Soccer World Cup 2003, in which an autonomous humanoid robot team capable of playing football is proposed. It is hoped that this football team will defeat the (human) World Cup champion team in 2050 in accordance with the rules of the World Football League (FIFA, abbreviation of Fédération Internationale de Football Association in French). Weld seam tracking Use the visual feedback to control the welding robot so that it keeps welding along the desired seam. Simultaneous localization and mapping [SLAM] A vision algorithm used primarily by mobile robots. It allows robots to gradually build and update geometric models as they explore unknown environments. Given a part of the model constructed, the robot can determine its position (self-localization) relative to the model. It can also be understood that the robot starts to move from an unknown position in an unknown environment. During the movement, it not only performs its own positioning based on the existing map and the estimated position but also performs incremental mapping based on its own positioning. So as to achieve autonomous positioning and navigation of the robot. Concurrent mapping and localization [CML] Same as simultaneous localization and mapping. Visual odometry Same as simultaneous localization and mapping.
1828
51 Application of Image Technology
51.2.4 Traffic Autonomous vehicle Computer-controlled mobile robot. People only need to operate at very high levels, such as indicating the ultimate goal of work (the end of a journey). The autonomous navigation system used needs to complete visual tasks such as road detection, selfpositioning, landmark point positioning, and obstacle detection, as well as robot tasks such as route planning and engine control. Automotive safety As an application of image technology, car safety mainly considers how to detect unexpected obstacles, such as pedestrians on the road, in autonomous driving. Generally used in situations where active vision technology such as radar or LIDAR does not work well. Vehicle detection An application example of the object recognition problem, where the task is to find the desired vehicle in the video image. Road structure analysis A class of technologies used to obtain road information from images. The image used can be a close-up image (such as an image of an asphalt road collected from a moving vehicle, used to automatically draw defects over long distances) or a remote sensing image (to analyze the geographic structure of the road network). Traffic analysis Analysis of video data of car traffic to identify license plates, detect accidents, find congestion, calculate flux, etc. Traffic sign recognition Starting from a single image or video clip, the identification or classification of road traffic information signs. Detection of the traffic sign is often a basic initial step. This capability is needed for autonomous vehicles. Vehicle license plate analysis The location of the vehicle license plate is determined in the video image, and the characters on it are identified. License plate recognition An application of computer vision or image technique whose goal is to identify a vehicle’s number plate from image data. The image data is often collected by a camera placed at a location where the vehicle is decelerating (such as a toll booth on a highway). Number plate recognition See license plate recognition.
51.3
51.3
Other Applications
1829
Other Applications
In recent years, image technology has been applied in more and more fields (Davies 2012; Gonzalez and Woods 2018; Goshtasby 2005; Pratt 2007; Russ and Neal 2016; Zhang 2018a).
51.3.1 Document and OCR Optical character recognition [OCR] Automatic recognition of printed and written characters by computer. The concept of optical character recognition has been proposed for more than 80 years. As a highspeed, automatic information entry method, optical character recognition has been widely used in office automation, news publishing, machine translation, and other fields. According to the characteristics of the recognition object, it can be divided into print recognition and handwriting recognition. Their basic recognition processes include character image input, image preprocessing, character segmentation and feature extraction, and character classification and recognition. A generic term for extracting text descriptions of letters from text images. Common professional applications include bank documents, handwritten digits, handwritten characters, cursive text, Chinese characters, English characters, Arabic characters, etc. Character segmentation In optical character recognition, a text image is divided into single characters, and one character corresponds to one region. Here, various image segmentation algorithms can be used, and sometimes it is necessary to determine separate regions of interest for characters to make segmentation more robust and accurate. In addition, it is often necessary to use mathematical morphology techniques for post-processing to connect the separated parts of the same character. Character recognition See optical character recognition. Printing character recognition An important domain in optical character recognition. Recognition objects are printed characters. Due to the relatively standardized identification objects, such identification has now reached a practical level. Ascender Refers to the 7 lowercase letters b, d, f, h, k, l, and t. When typesetting, they are taller than other letters and sometimes need to be adjusted. Compare descender. Descender Refers to the five lowercase letters g, j, p, q, and y. When typesetting, they are lower than other letters and sometimes need to be adjusted. Compare ascender.
1830
51 Application of Image Technology
Handwritten character recognition Automatic recognition of handwritten characters. Slant normalization A class of algorithms commonly used for handwritten character recognition that converts slanted cursive characters to vertical characters. See optical character recognition. Handwriting recognition An important branch of optical character recognition. Recognition objects are handwritten characters, which vary a lot. Considering the real-time nature and the use of writing process information, it can be further divided into online handwriting recognition and offline handwriting recognition. Online handwriting recognition A handwriting recognition method that recognizes text being written on the tablet while writing. Information such as the stroke order and orientation of the writing can be used for recognition. Compare offline handwriting recognition. Offline handwriting recognition Application technology for offline recognition of recorded handwritten text. Compare online handwriting recognition. Cursive script recognition A branch of optical character recognition. In which the characters to be automatically recognized and classified are handwriting (also known as ligature). Postal code analysis A set of image analysis techniques for understanding handwritten or printed postal codes, including letters and numbers. See handwritten character recognition and optical character recognition. Zip code analysis See postal code analysis. Document analysis Generic term describing operations that retrieve information from a document, for example, character recognition and document mosaicing. Fit text to curve Align the text with the alignment curve, that is, arrange the text into the required shape, as shown in Fig. 51.2. Fit text to path Same as fit text to curve. Fig. 51.2 Arrange the text into the desired shape
51.3
Other Applications
1831
Document retrieval Identification of documents in scanned document libraries based on certain criteria. Character verification The process of confirming that the printed or displayed characters can be read by humans within a certain tolerance range. Can be used in applications such as marking. Handwriting verification Identification and verification of the handwriting style corresponding to a specific individual.
51.3.2 Medical Images Macroscopic image analysis In medical image research, analysis of images of human organs (such as the brain, heart). Compare microscopic image analysis. Cardiac image analysis Techniques and processes involved in developing 3-D vision algorithms that track heart motion from magnetic resonance images and echocardiograms. Angiography A method of angiography using X-ray opaque stains. X-ray opaque stains need to be introduced into the blood vessels at work. It also refers to the manipulation and study of the images thus obtained. Digital subtraction angiography A basic technique in medical image processing that detects, monitors, and visualizes blood vessels by subtracting the background image from the target image. Typically, blood vessels are seen more clearly by using X-ray contrast media. See medical image registration. Coronary angiography A set of image processing techniques (often based on X-ray data) for visualizing and monitoring blood vessels surrounding the heart (coronary arteries). See angiography. Color Doppler A method for noninvasive blood flow imaging by showing the heart and other body parts in 2-D echocardiographic images. Blood flow in different directions will be displayed in different colors. Endoscope An optical instrument or device used to visually inspect the internal conditions of various body organs. See fiberscope.
1832
51 Application of Image Technology
Capsule endoscopy An endoscopic technique for examining a patient’s digestive tract for medical diagnosis. A small tethered camera similar in size and shape to a pill was used. The patient swallows the capsule to record images of the gastrointestinal tract, including images of the small intestine (difficult to obtain with traditional tether endoscopy). The image was transmitted from the capsule using wireless data transmission technology, and the capsule was naturally discharged from the patient and recovered. Virtual bronchoscopy An alternative to an imaging system with endoscope, which can generate a virtual view of the lung system using techniques such as magnetic resonance imaging. Mammogram analysis X-rays test of women’s breasts are typically done to detect possible signs of cancer development. Breast scan analysis See mammogram analysis. Cell microscopic analysis An automated process/procedure for the detection and analysis of different cell types in images acquired with a microscope vision system. Common examples include blood cell analysis and precancerous cell analysis. Chromosome analysis Vision technology that uses microscope images to diagnose genetic problems such as disorders. Human chromosomes are usually arranged in 23 pairs and displayed in a standard diagram.
51.3.3 Remote Sensing Remote sensing Any kind of detection technology that does not directly contact the observed scene to obtain its electromagnetic characteristics. It is a technology that uses electromag netic radiation or other radiation to detect and quantify the physical, chemical, and biological characteristics of an object without the sensor touching the object. At present, it mainly refers to the use of sensors on satellites or airplanes to observe the ground and the atmosphere from above the ground, collect and record relevant information about objects and phenomena in the earth’s environment, and display them in the image, spectrum, color, form, etc. It does not directly contact the target (object or phenomenon), but comprehensively detects its nature, shape, and change law. Collect (using airplanes or satellites), analyze, and understand image information primarily derived from the earth’s surface. Commonly used in agricultural, forest,
51.3
Other Applications
1833
meteorological and military applications. See multispectral analysis, multispectral image, and geographic information system. Remote sensing system Realization of remote sensing ingestion system. Can be divided into five types: (1) camera system, UV photography, full-color photography (black and white), color photography, color infrared photography (false color), infrared photography, multispectral photography, etc.; (2) electronic optical system, TV system, multiband scanner, (near, middle, and far), and infrared scanner; (3) microwave and radar systems; (4) electromagnetic wave nonimaging system, γ-rays, radiation measurement and spectral measurement, and laser; and (5) gravity system and magnetic system. Also included are related delivery, transmission, or communication modules. Overproportion In the spectral decomposition of remote sensing images, the problem of determining the mixing ratio of different substances in the ground region corresponding to pixels. Probe data Data obtained by means of remote sensing or parameters (such as location) describing remote sensing. Telepresence Interact with the target through a visual or robotic device at a distance from the user. Examples include driving a remote camera used by a user’s helmet display, transmitting audio remotely, and using local controls to manipulate remote machinery and tactile feedback from the remote to the surrounding environment. Systeme Probatoire d’Observation de la Terra [SPOT] Spaceborne Earth Observation System developed by the French Space Research Centre. Also refers to a series of satellites launched by France, which can provide the earth with satellite imagery. For example, SPOT-5, launched in May 2002, provides an updated global image every 26 days. SPOT imaging spectra includes 500–590 nm, 610–680 nm, and 790–890 nm. The spatial resolution of the panchromatic band is 10 m, and the amplitude resolution is 6 bits; the spatial resolution of the multispectrum is 20 m, and the amplitude resolution is 8 bits. Satellite Probatoire d’Observation de la Terra [SPOT] Launched by France in 1986, the satellite has two imaging systems. One imaging system is used for three visible bands with a resolution of 20 m; the other imaging system produces a panchromatic image with a resolution of 10 m. Each system includes two units, so stereo image pairs can be obtained. Topographic map 1. A representation that displays both geometric shapes and identified features (such as man-made structures) on a land map. Height is usually represented as an elevation profile (contour). This form of map can also be used to represent the results of image analysis.
1834
51 Application of Image Technology
Fig. 51.3 Digital elevation map schematic
2. In the nervous system, the layout of neurons. Among them, the neural response characteristics change regularly with different spatial positions of neurons in the neural region. Digital terrain map Same as digital elevation map. Digital elevation map A sample and quantized map, where each point corresponds to the height at the reference ground plane. A simple schematic is shown in Fig. 51.3, similar to a series of pulses (elevations) on a regular grid (spatial coordinates). Terrain analysis Analysis and interpretation of data representing the shape of the planet’s surface. Typical data structures are digital elevation maps or triangulated irregular networks (TINs). Digital terrain model [DTM] A numerical matrix representing terrain height as a function of geographic or spatial location.
51.3.4 Various Applications Image editing program Computer software for scanning, editing, processing, manipulating, saving, and outputting images. Based on the pixel representation (raster representation) of the image, Photoshop is a typical representative. Area composition A term for desktop publishing. Arrange text, images, etc. on one page for plate making.
51.3
Other Applications
1835
Fig. 51.4 Barcode example
Barcode reading Methods and algorithms for detecting, imaging, and interpreting black parallel lines of different widths. The arrangement of these parallel lines gives detailed information about the product or other objects. The barcode itself has many coding standards and arrangements. Figure 51.4 gives an example. Map analysis An image of the map (e.g., obtained with a flatbed scanner) is analyzed to extract symbolic descriptions of the ground depicted by the map. Due to the widespread use of digital map libraries today, similar analysis is rarely used. Digital heritage Archeology and monument protection using digital technology. In applications related to digital heritage, obtain detailed 3-D models of cultural relics before using them for tasks such as analysis, preservation, restoration, and art reproduction. Fine art emulation Use image processing methods to make digital photos look like effects obtained by using certain painting or art methods, especially with similar artistic style. CAVE Recursive abbreviation for cave audio and video experience. It is an immersive audio and video environment with multiple screens surrounding the observer, forming a complete 3-D surround cave environment. By tracking the head, the movement of the head is displayed to the observer. Natural material recognition Identification of natural materials, natural materials are naturally produced, as opposed to man-made materials. Fax The communication method of converting the pictures (text, drawing) on the paper into a fax machine or computer with a special format and then using a telephone line or network to transfer to another computer or fax machine. The receiving computer often cannot directly edit the content, but can only view and print it as a picture. Facsimile telegraphy A technology that converts still images (photos, maps, drawings, diagrams, drawings, etc.) or text into scanned light or electrical signals through scanning and photoelectric conversion and transmits them to a distance through a transmission line or radio waves and reproduces them. The transmitting device consists of a
1836
51 Application of Image Technology
scanner, photoelectric conversion, DC amplification, shaping, frequency modulator, carrier amplification, low-pass filter, and output laser or output circuit. The receiving device consists of a photoelectric converter or input circuit, a high-pass filter, a limiting amplifier, a low-pass filter, a frequency discriminator, a preamplifier, a recording power amplifier, and a scanning recorder. Phototelegraphy Same as facsimile telegraphy.
Chapter 52
International Organizations and Standards
The in-depth research of image engineering and the rapid development and wide application of image technology have promoted the formulation of international standards in the field of image. This work is mainly led by some specialized international standardization organizations (see the website).
52.1
Organizations
Many international organizations have participated in the formulation and promotion of international standards in the field of image engineering (see their official website).
52.1.1 International Organizations International Standardization Organization [ISO] An organization responsible for setting international standards. ISO works with the ITU to develop the JPEG and MPEG series of international standards. International Electrotechnical Commission [IEC] An organization mainly engaged in international standardization in the field of electrical and electronic technology. Founded in 1906. Participated in the development of JPEG and MPEG series of international standards. The two organizations, IEC and ISO, are legally independent. IEC is responsible for international standardization in the electrical and electronics fields, and ISO is responsible for other domains.
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3_52
1837
1838
52
International Organizations and Standards
International Telecommunication Union [ITU] A United Nations agency. Responsible for developing and recommending international standards for data communication and image compression. Formerly known as the Consultative Committee for International Telephone and Telegraph. Consultative Committee for International Telephone and Telegraph [CCITT] The predecessor of the International Telecommunication Union. International Telecommunication Union, Telecommunication Standardization Sector [ITU–T] An organization specialized in the development of international standards related to telecommunications under the management of the International Telecommunication Union. Specifically responsible for formulating and recommending international standards for data communication and image compression. International Radio Consultative Committee One of the permanent bodies of the International Telecommunication Union, specializing in the development of international standards related to radiocommunication and technical business issues. Founded in 1927, it has merged with the International Frequency Registration Board (IFRB) since March 1, 1993, and has become the Radiocommunication Sector of the International Telecommunication Union (ITU), or ITU-R for short. CCIR Abbreviation of Comité Consultatif International Radiotelecommunique (in Franch), same as International Radio Consultative Committee. International Frequency Registration Board [IFRB] One of the permanent bodies of the International Telecommunication Union, its purpose is to help all member states to use radio communication channels rationally and to minimize harmful interference. It also makes technical recommendations to member states to enable them to use as many wireless circuits as possible in particularly crowded frequency bands. It was established in 1947, and since March 1, 1993, it has merged with the International Radio Advisory Committee to become the radio communications department of the International Telecommunication Union (ITU) today, referred to as ITU-R. International Color Consortium [ICC] An international organization set up by several companies to resolve color inconsistencies. An international standard has been developed for the color management system (CMS). With the help of Configuration Associated Space (PCS) as the intermediate color space for color space conversion, the device is configured through the connection between the color space (RGB or CMYK) of the color device and the PCS space, thereby achieving open management of color and making color transfer independent of color devices.
52.1
Organizations
1839
International Committee on Illumination Academic organization in the field of international lighting engineering. Abbreviated as CIE, derived from the initials of the French name Commission International de l’Έclairage. CIE See International Committee on Illumination. International Astronomical Union [IAU] International nongovernmental organizations with professional astronomy staff from around the world. Founded in 1919. Society of Photo-Optical Instrumentation Engineers [SPIE] An international optical engineering society, also known as the International Society for Optics and Photonics. Founded in 1955, it now includes more than 250,000 members from 173 countries. Institute of Electrical and Electronics Engineers [IEEE] An international society of electronic technology and information science engineers. Formally established in 1963 (one of its predecessors was created in 1884). Published journals, magazines, periodicals, proceedings, books, and standards in many related fields. Society of Motion Picture and Television Engineers [SMPTE] Founded in 1916, with more than 85 member countries, it is an international industry association in the fields of film, television, video, and multimedia. Copy Protection Technical Working Group [CPTWG] A working organization composed of technicians and content protection experts in related industries, established in 1995, is dedicated to DVD video copy control systems. Data Hiding Sub-Group [DHSG] A sub-working group working on video watermarking under a copy protection technology working group.
52.1.2 National Organizations National Institute of Standards and Technology [NIST] An organization directly under the US Department of Commerce, engaged in basic and applied research in physics, biology, and engineering, as well as research in measurement technology and methods, providing standards, standard reference data, and related services. Both the Face Recognition Technology program (FERET program) and the FERET database are derived from the institute.
1840
52
International Organizations and Standards
National Television Standards Committee [NTSC] The national committee responsible for the development of television and video standards, also refers to the television standards (NTSC format) it has developed. National Television System Committee [NTSC] An agency responsible for the management and standardization of television in the United States. It also refers to an analog video standard developed by it, which targets color video. Its frame rate is 30 Hz, and each image has 525 lines. The image size is 640 480 pixels. It is also viewed as a signal recording system used to encode video data at approximately 60 video fields per second. Mainly used in the United States, Japan, and other countries. American Newspaper Publishers Association [ANPA] The US newspaper industry develops standards and specifications related to desktop publishing. ANPA color specifications have been accepted as the standard for newspaper printing. ANPA colors have also been integrated into most image editing programs. Electronic Industries Association [EIA] An American electronics industry standard setting agency. Advanced Television Systems Committee [ATSC] A nonprofit international organization (also known as the US Advanced Television Business Advisory Board) focused on developing voluntary standards for digital television. Its organizational members come from industries such as broadcasting, broadcasting equipment, film, consumer electronics, computers, cable television, satellite, and semiconductor. The committee adopted a national standard for ATSC digital television on September 15, 1995, in which the source encoding uses MPEG-2 video compression and AC-3 audio compression; the channel encoding uses VSB modulation and provides two modes: terrestrial broadcasting mode (8VSB) and high data rate mode (16VSB). Audio Video Coding Standard Workgroup of China [AVS] China’s expert group on audio and video standards. Formed by the former Ministry of Information Industry of the People’s Republic of China, the full name is “Digital Audio and Video Codec Technical Standard Working Group.” Its task is to formulate, develop, and revise standards for common technologies such as compression, decompression, processing, and representation of digital audio and video.
52.2
Image and Video Coding Standards
Most of the established international standards for images are related to the encoding of images and videos. (Gonzalez and Woods 2018; Zhang 2017a)
52.2
Image and Video Coding Standards
1841
52.2.1 Binary Image Coding Standards International standard for images Standards related to images and image technique, developed and recommended by international organizations. Joint Bi-level Imaging Group [JBIG] 1. Binary Image Joint Group: an organization that formulates international standards for binary image compression. It is composed of two organizations, ISO and CCITT. A series of international standards for binary image compression have been developed. 2. International standard for binary image coding: The international standard ITU-T T.82 (ISO/IEC11544) for compression of still binary images. Developed by the Binary Image Joint Group in 1993. JBIG is also called JBIG-1 because JBIG-2 was later introduced. JBIG uses adaptive technology and uses adaptive templates and adaptive arithmetic coding on the images given by the halftone output technology to improve performance. JBIG also implements progressive transmission and reconstruction through pyramid-like layered coding (from high resolution to low resolution) and layered decoding (from low resolution to high resolution). QM-coder Adaptive binary arithmetic encoder specified in two international standards of JBIG: binary image encoding and still image encoding. The principle used is the same as arithmetic coding, but it applies to the case where the input sign is a single bit and only fixed-precision integer arithmetic operations (addition, subtraction, and displacement) are used, and there is no multiplication (replaced by approximation). JBIG-2 Improved version of JBIG. ITU-T T.88 (ISO/IEC14492) formulated by the Joint Bi-level Imaging Group in 2000. The characteristic is to use different encoding methods for different image content, use symbol-based coding for text and halftone regions, and use Huffman coding or arithmetic coding for other content regions. Supports lossy, lossless, and progressive encoding. G3 format An international image format suitable for facsimile images. CCITT’s corresponding special group (Group 3) is responsible for the formulation and hence the name, also known as fax group 3. Which uses nonadaptive, 1-D run-length coding technology. All current fax machines used on the public switched telephone network (PSTN) use the G3 format. G4 format An international image format suitable for facsimile images. CCITT’s corresponding special group (Group 4) is responsible for the development and hence the name, also known as fax group 4. Supports 2-D run-length coding. Can be used for fax machines on digital networks such as ISBN.
1842
52
International Organizations and Standards
52.2.2 Grayscale Image Coding Standards Joint Photographic Experts Group [JPEG] An agency that sets international standards for still (gray or color) image compres sion. It is composed of two organizations, ISO and CCITT. Several international standards for still (gray or color) image compression have been developed. JPEG standard An international standard for still (gray or color) image compression. Developed by the Joint Photographic Experts Group in 1994 and numbered ISO/IEC 10918. The JPEG standard actually defines three encoding systems: 1. DCT-based lossy coding basic system. For most compression applications 2. An extended or enhanced coding system based on a layered incremental mode. For high compression ratio, high accuracy, or progressive reconstruction applications 3. Lossless system based on DPCM method in predictive coding. For applications without distortion Joint Photographic Experts Group [JPEG] format A commonly used image file format suggested by the Joint Photographic Experts Group. The suffix is jpg. JPEG is an international standard for still image coding. In addition to lossless compression methods, the standard can save considerable space when using lossy compression methods. The JPEG standard only defines a standardized encoded data stream and does not specify the format of the image data file. The JPEG format can be regarded as a collective name for formats compatible with JPEG code streams. JPEG File Interchange Format [JFIF] A commonly used image file format. Uses the encoded data stream defined by the JPEG standard. A JFIF image is a JPEG image represented in grayscale, or in Y, Cb, Cr component color, and contains a JPEG-compatible file header. Lossless and near-lossless compression standard of JPEG [JPEG-LS] An international standard for still (gray or color) image compression developed by the Joint Photographic Experts Group in 2000. Numbered ISO/IEC14495. This standard is divided into two parts: one part corresponds to lossless compression, and the other part corresponds to near-lossless compression (also known as quasilossless compression). This latter part is also called JPEG’s near-lossless compression standard. No DCT or arithmetic coding is used in this standard. Only limited quantization is performed in near-lossless compression. With the use of several existing neighbors of the current pixel as context, encoding is performed using context prediction and run-length coding to achieve a balance between improving compression rate and maintaining image quality. JPEG-NLS See lossless and near-lossless compression standard of JPEG.
52.2
Image and Video Coding Standards
1843
Motion JPEG A method of encoding motion video or television signals using JPEG technology. Which operates independently for each frame, only intraframe coding. Motion JPEG provides a quick way to access any frame in a video. JPEG XR Continuous-tone still image compression algorithm and file format developed by Microsoft Corporation. Originally called Windows Media Photo. An image codec that supports high dynamic range image coding, including lossless compression and lossy compression, and requires only integer operations. A file can contain multiple images with a “.jxr” extension. In addition, each image can be partially decoded. JPEG-2000 An international standard for still (gray or color) image compression. Developed by the Joint Photographic Experts Group in 2000. JPEG-based image compression techniques and algorithms are used, in which DCT is replaced by two different wavelet decompositions, namely, “CDF 9/7” and “CDF 5/3” wavelet descriptors. The JPEG 2000 standard can cover both lossless compression and lossy compres sion. In addition to improving the compression quality of the image, many functions have been added, including progressive compression transmission based on image quality, visual perception, and resolution, random access, and processing of the code stream (which can access quickly and easily different positions); the decoder scales, rotates, and crops the image while decompressing. It also features open structure and backward compatibility. MQ-coder A context-based adaptive binary arithmetic encoder. It has been used in the international standard JPEG2000 and the international standard JBIG-2. Resolution scalability A feature option of the international standard JPEG-2000 that allows extraction of both original resolution and lower resolution images from the same source of compressed image data. Flexible Image Transport System [FITS] The N-dimensional image data format developed by the International Astronomical Union in 1982 and was used by NASA in 1990 to transmit astronomical images. Pixels are encoded as 8-, 16-, or 32-bit integers, or 32- or 64-bit floating-point numbers.
52.2.3 Video Coding Standards: MPEG International standard for videos Standards related to video and video technique, developed and recommended by international organizations.
1844
52
International Organizations and Standards
Video compression standard A specification for video compression developed by international organizations. More commonly used include MPEG standards for video coding in consumer applications, such as MPEG-1, MPEG-2, and MPEG-4, and ITU-T standards for video format standardization in telecommunications applications, such as H.261, H.263, H.264, and H.265. Joint Video Team [JVT] The team jointly formed in 2001 by ITU-T video coding expert group and the ISO/IEC Moving Picture Experts Group (MPEG) to develop an advanced video coding international standard. Moving Pictures Experts Group [MPEG] A group that develops digital audio and video standards (for VCD, DVD, digital TV, etc.). It also represents the expert group formed by ISO and CCITT to develop international standards for moving images (international standards for videos). At present, MPEG-1, MPEG-2, MPEG-4, MPEG-7, and MPEG-21 have been formulated. MPEG is also commonly used to refer to media stored in the MPEG-1 format. MPEG-1 An international standard for moving (gray or color) image compression developed by the Moving Pictures Expert Group in 1993. Number ISO/IEC 11172. It is the first standard for mixed encoding of “high-quality” video and audio. It is essentially an entertainment-quality video compression standard. It is mainly used for the storage and extraction of compressed video data on digital media. It has been widely used in CD-ROM and video compact disc (VCD). Video presentation units [VPUs] Uncoded digital image received by a video coder specified in the international standard MPEG-1. Audio presentation unit In the international standard for images, MPEG-1, corresponding to uncoded audio samples received by the audio digitizer. Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbps Same as MPEG-1. Video compact disc [VCD] Corresponds to a full-motion, full-screen video standard. The video uses MPEG-1 compression encoding, with a resolution of 352 240 pixels for NTSC format and 352 288 pixels for PAL format. MPEG-2 An international standard for moving (gray or color) image compression was developed by the ISO Moving Picture Experts Group in 1994. The number ISO/IEC 13818 is a subset of ISO Recommendation 13818 and deals with the
52.2
Image and Video Coding Standards
1845
transmission of studio-quality audio and video. It covers four levels of video resolution and is a standard for compressed video transmission, suitable for the range from ordinary TV (code rate 5 ~ 10 Mbps) to high-definition television (code rate 30 ~ 40 Mbps, the original MPEG-3 content). The rate has been later extended up to 100 Mbps. A typical application is digital video disc (DVD). Generic coding of moving pictures and audio Same as MPEG-2. Program stream A set of audio, video, and data elements with a common time relationship specified in the international standard MPEG-2. Generally used for sending, storing, and playing. Transport streams In the international standard MPEG-2, a set of program (audio, video, and data) streams or elementary streams. MPEG-4 An international standard for moving (gray or color) image compression developed by the ISO Moving Picture Experts Group (MPEG). The first version was developed in 1999, and the second version was developed in 2000, numbered ISO/IEC 14496. The original focus is similar to H.263 applications (up to 64 Kbps for very low bit rate channels). Later it expanded to include a large number of multimedia applications, such as applications on the network. It is mainly used for compressed video transmission. It is suitable for the transmission of dynamic images (frame rate can be lower than 25 frames or 30 frames per second) on narrow bandwidth (generally less than 64 Kbps) communication lines for application of low bit rate image compression. Its overall goal is to uniformly and effectively encode a variety of audio and video (AV), including still images, sequence images, computer graphics, 3D models, animations, languages, sounds, etc. Because the processing object is relatively wide, MPEG-4 is also called the international standard for low-rate audio, video, and multimedia communications. Coding of audio-visual object Same as MPEG-4. MPEG-4 coding model Coding model in object-based multimedia compression standards. One of the measurement units used is the facial definition parameter. Facial definition parameters [FDP] In the international standard MPEG-4, a set of markers describing the smallest motion that can be felt on a face.
1846
52
International Organizations and Standards
52.2.4 Video Coding Standards: H.26x H.261 An international standard for motion (gray or color) image compression developed by CCITT in 1990. Mainly suitable for applications such as video conferences and video phones. It is also called P 64 standard (P ¼ 1, 2, . . ., 30), because its code stream can be 64, 128, . . ., 1920 Kbps. Allows transmission of motion video over a T1 line (with a bandwidth of 1.544 Mbps) with a delay of less than 150 ms. In H.261, hybrid coding and hybrid decoding were first adopted, which greatly influenced some subsequent international standards for image compression. H.263 An international standard for motion (gray or color) image compression developed by ITU-T in 1995. In 1998 and 2000, ITU-T introduced H.263+ and H.263++, respectively, which further reduced the bit rate and improved the encoding quality and also expanded the functions. A video compression standard that uses a very low bit rate compression format. Because it was originally designed to support video conferencing, it can support several video frame sizes such as 128 96, 176 144, 352 288, 704 576, and 1408 1152. It is suitable for PSTN and mobile communication networks. Also widely used in IP video communications. H.264/AVC An international standard for motion (gray or color) image compression developed by the Joint Video Team in 2003. Numbers ISO/IEC 14496-AVC and ITU-T H.264. The goal is to improve the compression efficiency while providing a network-friendly video expression that supports both “conversational” (such as videophone) and “non-conversational” (such as broadcast or streaming) video applications. Also known as Part 10 of MPEG-4, it is a flexible standard widely used for highdefinition video compression. Compared with H.263, it adds many new features and provides lower bit rate and more efficient compression flexibility, which can be used in a wide range of network environments. It consists of two main layers: an independently encoded Video Coding Layer (VCL) and a Network Abstraction Layer (NAL) that formats the video and adds header information for transmission. Adaptive in-loop de-blocking filter One of the main technologies adopted by the international standard H.264/AVC in coding to eliminate block distortion caused by block-based coding. Context-adaptive binary arithmetic coding [CABAC] An optional entropy coding method in the video international standard H.264. The coding performance is better than the context-adaptive variable-length coding, but the computational complexity is also high.
52.2
Image and Video Coding Standards
1847
Context-adaptive variable-length coding [CAVLC] An entropy coding method used in the video international standard H.264. It is mainly used to encode luminance and chrominance residual transform and quantized coefficients. Compare context-adaptive binary arithmetic coding. H.265/HEVC An international standard for motion (gray or color) image compression developed by the Joint Video Team in 2013. Numbers ISO/IEC 23008-2 HEVC and ITU-T H.265. The compression efficiency is about double that of H.264/AVC. High Efficiency Video Coding [HEVC] Same as H.265. Coding unit [CU] One of the three basic units defined by the international standard H.265/HEVC. It is also the smallest unit used for coding in H.265/HEVC, with a minimum size of 8 8. In the (image frame) prediction mode, some coding units are converted into predict units. When transforming the image residuals, the coding unit will be transformed into a transform unit for DCT transformation and quantization. Coding tree unit [CTU] An operating unit obtained by dividing images in the international standard H.265/ HEVC. According to different coding settings, the coding tree unit may be composed of multiple coding units. For example, a coding tree unit can be converted into a coding unit by a recursive decomposition through a quadtree structure. Coding tree block [CTB] The pixel block corresponding to the coding tree unit. Predict unit [PU] One of the three basic units defined by the international standard H.265/HEVC. In the image frame prediction mode, it is converted from some coding units. Predict block [PB] The pixel block corresponding to the predict unit. Transform unit [TU] One of the three basic units defined by the international standard H.265/HEVC. When converting the image residual, it is converted by the coding unit. Transform block [TB] The pixel block corresponding to the transform unit defined by the international standard H.265/HEVC. Low-complexity entropy coding [LCEC] An entropy coding method used in the international standard H.265/HEVC. It is an improvement on context-adaptive variable-length coding in the international standard H.264/AVC.
1848
52
International Organizations and Standards
High-complexity entropy coding [HCEC] An entropy coding method used in the international standard H.265/HEVC. It is an improvement on the context-adaptive binary arithmetic coding in the international standard H.264/AVC. H.266/VVC The Joint Video Team is considering an international standard for motion (grayscale or color) image compression to be developed from 2016. Also known as Versatile Video Coding (VVC). Proposals were collected in 2017, and reference software was identified in 2018. Standards are expected to be completed in 2020. Compared with H.265/HEVC, H.266/VVC will add a trinomial-tree and a binary tree when dividing the structure and allow independent division of luminance and chrominance in I-frames; LM prediction mode will be added in intraframe prediction, and wide-angle intra prediction technology and multi-reference line intra prediction technology will be used; affine mode will be added in inter prediction. Versatile Video Coding [VVC] Same as H.266/VVC.
52.2.5 Other Standards H.32x Generic term for ITU-T transmission protocols mainly used for video telephony and multimedia conferences. H.320 A communication standard developed by ITU-T in 1996. It is mainly used for video telephony and multimedia conferences on Narrowband Integrated Services Digital Network (N-ISDN). Narrowband Integrated Services Digital Network [N-ISDN] Integrated services digital network with a transmission rate below 2 Mbps. Compare Broadband Integrated Services Digital Network. H.321 A communication standard developed by ITU-T. It is mainly used for video telephony and multimedia conferences on Broadband Integrated Services Digital Network (B-ISDN), such as ATM network. Broadband Integrated Services Digital Network [B-ISDN] Integrated services digital network with a transmission rate of over 2 Mbps. Compare narrowband integrated services digital network. H.322 A communication standard originally proposed by ITU-T. The purpose is to perform video telephony and multimedia conferences on a local area network with
52.3
Public Systems and Databases
1849
quality of service (QoS) guarantees. Basically based on H.320, because it was contained by H.323 that was formulated later, it lost its meaning before it was formulated. H.323 A communication standard developed by ITU-T. It is mainly used for video telephony and multimedia conferences on packet-based networks (PBN) without quality of service (QoS) guarantees. Packet networks include Ethernet, signaling networks, and so on. H.323 versions 1–4 were released in 1996, 1998, 1999, and 2000, respectively. H.324 A relatively low-rate multimedia communication terminal standard developed by ITU-T. H.263 is a standard under it.
52.3
Public Systems and Databases
There are many public image technology research platforms and related databases, and most of them are available from the Internet (see the website).
52.3.1 Public Systems Information system An organic whole that connects computers and databases with communication means. Used to collect, store, transmit, classify, process, and retrieve information. Geographic information system [GIS] A system for managing and analyzing geographic data distributed in 2-D space. The data can be map-oriented, with qualitative attributes in the form of points, lines, and plane of the recording region in vector form, or image-oriented, with quantitative attributes in the form of grids indicating rectangular grid cells. Image information system Computers are used to complete visual perception and cognition, an information system that integrates image acquisition, image processing, image analysis, and image understanding. In a broad sense, image information includes information perceived by the human visual system, image information obtained by various visual devices, other representations derived from this information, and advanced expressions and behavior plans abstracted from the above information, knowledge that is closely related to this information, and the experience required to process it. The image information system can also store, transmit, operate, process, and manage this information.
1850
52
International Organizations and Standards
3DPO system A typical system for scene recognition and localization using depth images. English full name is three-dimensional part orientation system. ACRONYM system A visual system developed by Brooks et al., for identifying 3-D images. It is a domain-independent, model-based, representative 3-D image information system that uses the primitives of a generalized cylinder to represent the stored model and the object extracted from the scene. Users use graphics to interactively generate object models, and the system automatically predicts desired image features, finds candidate features, and explains them. It can recognize and classify objects with existing models and extract 3-D information such as the shape, structure, 3-D position, and orientation of objects from monocular images. AMBLER An autonomous active vision system, designed and developed by NASA and Carnegie Mellon University. Supported by a 12-legged robot, it is intended for planetary exploration. Fan system An object recognition system using depth images as input. KB Vision A knowledge-based image understanding environment. New concepts and methods that can help verify and evaluate image understanding. The three-layer model structure is explained using low-level image matrix, middle-level symbol description, and high-level scene. KHOROS An image processing development and software execution environment that integrates scientific data visualization with software development tools and application programming interfaces (APIs). It contains a large number of set operators, which is also very convenient for users to expand their applications. The accompanying system has a drop-down interactive development workspace where operators can be instantiated and connected in series by clicking and dragging. VISIONS system A 2-D image information system based on regions in an image. The task is to give a correct interpretation of the region in the image and then the scenery in the natural scene under the guidance of certain prior knowledge. English is called visual integration by semantic interpretation of natural scenes. Bhanu system A 3-D scene analysis system for 3-D object shape matching. Hypothesis Predicted and Evaluated Recursively [HYPER] A typical vision system in which geometric connections obtained from a polygon model are used for recognition.
52.3
Public Systems and Databases
1851
Spatial Correspondence, Evidential Reasoning, and Perceptual Organization [SCERPO] A vision system developed by David Lowe that demonstrates the ability to recognize complex polyhedral targets (shavers) in complex scenes. Nevatia and Binford system A typical machine vision system using generalized cylinder technology to describe and identify curved objects. Image Discrimination Enhancement Combination System [IDECS] A typical vision system. Yet Another Road Follower [YARF] A road tracker. Developed by Carnegie Mellon University for autonomous driving systems. This is a typical system in structured road recognition.
52.3.2 Public Databases Database An application that manages, saves, and displays large amounts of organically organized data. Scene-15 data set A database for studying scene image classification. Including 15 types of indoor and outdoor scenes, 4 485 pictures. Among them, 241 in suburbs, 308 in urban areas, 360 in coasts, 374 in mountains, 328 in forests, 260 in highways, 292 in streets, 356 in buildings, 315 in shops, 311 in factories, 410 in outdoor, and 215 in offices, 289 in living room, 216 in bedroom, and 210 in kitchen. The resolution of various images is around 300 300 pixels. UIUC-Sports data set A database for studying scene image classification. Including 8 types of sports scenes, 1579 pictures. Among them, 250 are rowing, 200 are badminton, 182 are polo, 137 are petanque, 190 are skateboarding, 236 are croquet, 190 are sailing, and 194 are rock climbing. Various image resolutions range from 800 600 pixels to 2400 3600 pixels. ImageNet A large labeled image library that collects images obtained from 80,000 nouns (synonym sets) of WordNet (a network that manages English vocabulary by classification). As of April 2010, 500 to 1,000 carefully reviewed examples (label classifications) of 14,841 synonym sets have been collected and approximately 14 million by the end of 2012.
1852
52
International Organizations and Standards
Caltech-101 data set A database for studying image classification of common objects. There are 102 types of objects in total. In addition to the background type, the remaining 101 types contain a total of 8 677 pictures. The number of various types of pictures ranges from 31 to 800, and the resolution is about 300 200 pixels. Among them, the object is mostly transformed and normalized, is located in the center, and is a front image, and the background is relatively simple. Caltech-256 data set A database for studying common scene image classification. It can be seen as an extension to the Caltech-101 data set, which contains a total of 257 types of objects. In addition to the background category, the remaining 256 categories contain a total of 29,780 pictures. The number of various types of pictures is at least 80, and the resolution is about 300 200 pixels. Compared to the Caltech-101 data set, the object in the image in the Caltech-256 data set is not necessarily located in the center of the screen and has a more complex background, and the internal changes of various types of images are also larger (the intra-class differences are greater). Mixed National Institute of Standards and Technology [MNIST] database A data set from the National Institute of Standards and Technology (NIST). The training set includes 60,000 digital samples, and the test set includes 10,000 digital samples. Each digital sample is a 28 28 grayscale image with the digits centered. Both sets consist of numbers written by 250 different people, 50% of them are high school students and 50% are from staff of the Census Bureau. NIST data set A collection of handwritten digits produced by the National Institute of Standards and Technology (NIST). The pixels in the raw data image are black and white (binary image). See the Mixed National Institute of Standards and Technology database. KITTI data sets A data set for researching and evaluating stereo matching algorithms, collected and managed by Karlsruhe Institute of Technology (KIT), Germany. The depth maps in its training and test sets were obtained with radar mounted on a sports car and labeled. It currently includes KITTI 2012 and KITTI 2015. KITTI 2012 contains 194 pairs of training images and 195 pairs of test images, both in lossless PNG format. KITTI 2015 contains 200 pairs of training images and 200 pairs of test images (4 color images per scene), both in lossless PNG format. Compared with KITTI 2012, KITTI 2015 has some more dynamic scenes. The Actor-Action Data set [A2D] A video database constructed to study the situation where multiple subjects initiate multiple actions. Among them, a total of seven subject categories are considered, adult, baby, cat, dog, bird, car, and ball; nine action categories, walking, running, jumping, rolling, climbing, crawling, flying, eating, and no movement (not belonging to the first eight categories).
52.3
Public Systems and Databases
1853
International Affective Picture System [IAPS] A database for researching and testing image classification problems based on sentiment semantics. There are a total of 1182 color pictures, which contain a wide range of object categories, which can be divided into ten categories from an emotional perspective, including five positive (amusement, contentment, excitement, awe, and undifferentiated positive) and five negative (anger, sadness, disgust, fear, and undifferentiated negative). PASCAL Visual Object Classes [VOC] challenge An annual technical competition organized by PASCAL. The organizer provides test images. Refer to the contestant's design algorithm to classify the images and compete according to the accuracy and efficiency of the algorithm.
52.3.3 Face Databases HRL face database An image database of 10 people, each with at least 75 lighting changes, with a resolution of 193 254 pixels. It is the first database to systematically set lighting conditions. CMU hyperspectral face database A database of face images. A color image with a resolution of 640 480 pixels was collected by 54 people at a wavelength of 0.45–1.1 μm (at intervals of 10 nm). The face image was taken 1–5 times with each wavelength in 6 weeks. In addition, each person’s image at each wavelength contains four types of illumination changes: lights at a 45 angle to the subject from the left side. Lights with a single front light, lights at a 45 angle to the subject from the right side, and all three lights turned on. AR database A database of facial images of 76 men and 60 women. 118 of them contained 26 frontal color images of 768 576 pixels in all two groups. These two groups of images are taken at intervals of 2 weeks. The images in each group are divided into 13 categories according to the characteristics of the ingestion: (1) normal expression + normal lighting, (2) smile + normal lighting, (3) angry + normal lighting, (4) shock + normal lighting, (5) normal expression + right side light, (6) normal expression + left side light, (7) normal expression + left side and right side lights, (8) eye block + normal light, (9) eye block + right side light, (10) eye block + left side light, (11) mouth block + normal light, (12) mouth block + right side light, and (13) mouth block + left side light. The two letters A and R are derived from the abbreviations of the surnames of the two first builders of the database. CAS-PEAL database A face image database containing 99,594 color images with 360 480 pixel of 1045 people. The 1045 people included 595 Chinese men and 445 Chinese women. These images include changes in pose, expressions, attachments, lighting, and more. There
1854
52
International Organizations and Standards
are three types of posture changes: (1) no up and down movement of the head, (2) head up, and (3) head down. Each category also includes nine head poses from left to right. There are five types in the expression change diagram: smile, frown, surprise, closed eyes, and open mouth. Attachment changes include six cases with three glasses and three hats. Illumination changes consist of 15 types of light, placed on 3 platforms at elevation angles of –45 , 0 , and 45 to the face and with the fluorescent light source at azimuth angle of –90 , –45 , 0 , 45 , and 90 to the face. FERET database A face database consisting of 14 051 grayscale images of 256 384 pixel of 1199 people constructed by the Face Recognition Technology program. There are 2 expression changes, 2 lighting changes, 15 pose changes, and 2 shooting time changes in these images. Among them, two kinds of expression changes are normal expression pictures and smile picture; two kinds of lighting changes include the difference of camera and lighting conditions. The 15 posture changes are divided into two groups: one group is the head that rotates left and right between –60 and 60 , with 9 variations, including –60 , –40 , –25 , –15 , 0 , 15 , 25 , 40 , and 60 ; the other group is the head that rotates left and right between –90 and 90 , with 6 variations, including –90 , –67.5 , –22.5 , 22.5 , 67.5 , and 90 . Finally, the two types of shooting time changes refer to shooting time intervals of 0 to 1031 days (average of 251 days). In addition, the FERET database specifies three data sets, training, gallery, and probe, and provides a standardized test protocol, which greatly facilitates the comparison of different algorithms (just compare directly with previously reported results without repeating previous experiments). KFDB database A resolution 640 480 color image database containing 1000 Korean faces. These images have three changes of pose, lighting, and expression. Each person’s posture changes are divided into three categories: (1) the hair is naturally scattered on the forehead without wearing glasses; (2) comb the hair from the forehead; and (3) wear glasses. Each type of image contains rotations of the face from left to right in seven directions: –45 , –30 , –15 , 0 , 15 , 30 , and 45 . Each person’s light changes are divided into two categories: (1) using a fluorescent lamp as a light source and (2) using an incandescent lamp as a light source. Each type of image contains eight types of lighting changes. These eight types of lighting changes are formed by eight light sources placed at equal intervals of 45 on a circle with the center of the face as the center of the circle. Everyone’s expression changes include the natural expression, happiness, surprise, anger, and blinking (shock). MIT database A database of 433 grayscale images with 120 128 pixel of 16 people. These images contain three changes in lighting conditions, three scale changes, and three head rotation changes. The three lighting conditions are frontal lighting, 45 directional lighting to the face, and 90 directional lighting to the face. The three head rotations are head up, head left, and head right.
52.3
Public Systems and Databases
1855
ORL database A database of 400 grayscale images with 92 112 pixel of 40 people. Everyone was collected ten different images. All images have similar dark backgrounds. Different images of the same person are at different times, different lighting, different head positions (upward, downward, left, and right), different facial expressions (open/ closed eyes, laugh/serious), and different face details (with/without glasses), but usually the abovementioned changes do not appear in ten images of the same person at the same time. PIE database A database of 41368 images with 640 486 pixel of 68 people. These images contain rich pose, lighting, and expression changes. In terms of posture changes, under the same lighting conditions, each person was photographed 13 pictures, including 9 pictures taken with the head rotated from about –90 to 90 and 4 pictures with head tilted down, head tilted up, head tilted up left, and head tilted up right. As far as lighting changes are concerned, there are two types of background lighting: one is to turn on the backlight, and the other is not to turn on the backlight. Under each background lighting and pose, 21 images were taken by each person, including 8 images when the flash was placed on the same plane as the center of the face and placed between –90 and 90 and 13 images taken with the flash placed above (or diagonally above) the person’s face at an elevation of approximately –67.5 to 67.5 . The expression changes in each person’s image are normal expressions, smiles, and blinks (shocks). UMIST database A database of 564 gray images with 220 220 pixel of 20 people. Dozens of images were collected for everyone. The main feature of this database is each person’s image covers the continuous pose change of the head from one side to the front. The user can use the mirror image of these images to synthesize the continuous pose change of the head from the other side to the front. YALE database A database of 165 gray face images with 100 100 pixel containing 15 people. Everyone was collected with 11 different images. The characteristics of these 11 images are the following: (1) front light exposure; (2) wearing glasses; (3) expression is happy; (4) left light exposure; (5) no glasses; (6) neutral expressions; (7) right side light exposure; (8) expressions are sad; (9) expressions are sleepy; (10) expressions are surprised; and (11) is blinking. YALE database B Contains a database of 5760 gray face images with 640 480 pixels of 10 people. Each person was photographed with 585 images in 9 poses and 65 lighting conditions while maintaining natural expression and unobstructed and constant background lighting. These nine posture changes include head rotation changes at nine different angles to one side, of which one image is a frontal image and five images are up, down, left, upper left, and lower left around the optical axis of the camera, respectively. Images rotated by 12 ; 3 images are rotated up left, left, and bottom left
1856
52
International Organizations and Standards
by 24 around the camera’s optical axis. The user can use the mirror image of these nine pictures as the nine pose pictures of the head turning to the other side. The 65 lighting conditions consist of the flash firing separately at 65 different positions when shooting.
52.4
Other Standards
There are many standards developed in different fields and for different purposes, and the number is still increasing (ISO/IEC 2001; Zhang 2007, 2017c).
52.4.1 International System of Units International System of Units A simple and scientific practical unit system created by the International Metrology Committee. Its symbol is “SI.” There are currently seven basic units (SI units): meters (m), kilograms (kg), seconds (s), amperes (A), Kelvin (K), candela (cd), and moles (mol). Le Système International d’Unitès Same as International System of Units. SI unit Units of measurement adopted by the International Metrology Committee. The name comes from Le Système International d’Unitès. lx SI unit of illumination. 1 lux is the luminous flux (lm) per unit area (m2) on a surface. lm SI unit of luminous flux. Candela [cd] The basic unit of the International System of Units (SI) for luminous intensity. 1 candela means that the luminous flux of 1 lm is radiated in a unit solid angle, which is the luminous intensity of a specific light source in a given direction. The light source emits monochromatic radiation with a frequency of 540.0154 1012 Hz, and in this direction the radiation intensity is (1/683) W/sr. It is the luminous power emitted by a point light source in a specific direction within each unit solid angle. The contribution of each wavelength is weighted by a standard photometric function defined as
52.4
Other Standards
1857
LðE, D, AÞ ¼
ED2 A
ðLuminous functionÞ
Among them, E represents the illuminance measurement; D represents the distance between the aperture plane and the photometer, D ¼ rc2 + rs2 + d, where rc is the radius of the (finite) aperture, rs is the radius of the light source, and d is the distance between the light source and the aperture; and A represents the area of the source aperture. Boltzmann constant A physical constant related to temperature and energy, denoted as k, k ¼ 1.3806505 10–23 J/K, where J is the energy unit Joule and K is the thermodynamic temperature unit Kelvin. According to the resolution of the International Metrology Conference, from May 20, 2019, 1 Kelvin will correspond to the Boltzmann constant of 1.380649 10–23J/K. The physical meaning of the Boltzmann constant is the coefficient of the average translational kinetic energy Ek of a single gas molecule with the thermodynamic temperature T: Ek ¼ (3/2) kT. Kelvin An international unit indicating temperature, recorded in K. Generally used to describe the characteristics of light and to investigate the lighting of the image accordingly. Planck’s constant A physical constant related to ray radiation, denoted as h, h ¼ 6.6262 1034 Js, where J is the energy unit Joule and s is the time unit second. Light quantum at frequency v is E ¼ hv. Planck’s law of blackbody radiation is defined as BðλÞ ¼
2h λ5 hc Wm2 m1 2 c exp kλT 1
where T is the absolute temperature of the light source, λ is the wavelength of light, the speed of light c ¼ 2.998 1023 ms1, and the Boltzmann constant k ¼ 1.381 1023 K1.
52.4.2 CIE Standards CIE standard photometric observer In order to uniformly measure and calculate the light metric, the International Committee on Illumination stipulates the tristimulus values function of the human visual relative spectrum based on the test results of many studies. That is, the relative spectral sensitivity (spectral luminous efficiency function).
1858
52
International Organizations and Standards
Table 52.1 2 CIE standard observer’s primary color wavelength λ and corresponding relative power spectrum S Primary color R G B
λ(nm) 700.0 546.1 435.8
S 72.09 1.379 1.000
Standard observer An ideal person described by tristimulus values obtained by averaging the color matching functions of many people with normal color vision. 2 CIE standard observer Corresponds to the standard color stimulus at viewing angle of 2 . It is defined by the single primary color and color matching used. The wavelength and relative power spectrum of the primary color values are shown in Table 52.1. CIE 1931 standard colorimetric observer The CIE 1931 standard chromatic system defines an imaginary observer of chromaticity with the viewing angle of the field of view of 1 ~ 4 . The selection of the spectral tristimulus values X (λ), Y (λ), and Z (λ) is based on the spectrum, that is, the value of the optical efficiency V(λ) as the Y (λ) value, and used as the reference. CIE 1946 standard colorimetric observer The field of view of CIE 1931 standard chromatic system is expanded to more than 4 . The symbols of the spectral tristimulus values are X 10(λ), Y 10(λ), and Z 10(λ). The coordinate system is similar to the CIE1931 standard chromaticity system. The subscript 10 indicates a large field of view of 10 . CIE 1931 standard colorimetric system The standard chromatic system developed by CIE in 1931. Also called XYZ chromatic system. See CIE colorimetric system. CIE 1960 uniform color space [UCS] A uniform color space developed for the improvement of the CIE 1931 standard chromatic system. Its color space linearly transforms the CIE 1931 XY chromaticity diagram to improve the uniformity of the color space, but does not uniformize the brightness. It has been replaced by CIE 1976 L*a*b* color space. CIE colorimetric functions A chromaticity function derived from a set of reference stimuli (primary colors) X, Y, and Z. It is also called the tristimulus values of the CIE spectrum, which is represented by x (λ), y (λ), and z (λ), respectively. It is the three-color component (XYZ) of the uniform energy monochromatic light stimulation in the CIE 1931 standard chromatic system.
52.4
Other Standards
1859
Standard illuminant Specified by CIE, standardized spectral distribution. To compare colors and measure colors, a set of color measurements must be given for each color based on the spectral distribution (or radiation) of the illumination. To reduce this variation, some spectral distributions have been standardized internationally. Note that there is no “standard light source” (physical light source). In the corresponding standard, different spectral distributions are listed without considering whether there are correspondingly distributed light sources (lamps). CIE recommends four types of illuminators. The standard illuminant A is a representative of artificial lighting with tungsten light, which corresponds to the illuminator of Planck’s total radiator at a color temperature of 2856 K. Illuminant B represents direct sunlight with a color temperature of 4874 K. The standard illuminant C represents the average daylight with a color temperature of 6774 K and is used to represent “sunlight.” However, considering its lack of long-wave UV radiation, it was replaced by the standard illuminant D65. The standard illuminant D65 represents the average daylight with an absolute color temperature of 6504 K, which is one of the average daylight from 400 to 25000 K and is used as a representative of the average daylight. The disadvantage of the standard illuminant D65 is that it cannot be reproduced with any technical light source. The spectrum distribution of various standard illuminants can be checked in tables and libraries. Generally, it also refers to having a standard spectrum published for a given situation (generally agreed internationally by CIE) that allows color researchers and manufacturers to use a recognized color spectrum distribution for luminaires. For example, incandescent light bulbs can be used as a standard illuminant. CIE standard sources In order to achieve the CIE standard illuminant, four artificial light sources A, B, C, and D specified by the International Committee on Illumination. Photopic standard spectral luminous efficiency Spectral luminous efficiency function for photopic vision recommended by CIE in 1924 and adopted by the International Bureau of Metrology in 1931. Luminance efficiency A sensor-specific function V(λ), which determines the contribution of light I(x, y, Rλ) observed at the sensor position (x, y) with wavelength λ to the brightness l(x, y) ¼ I (λ)V(λ)dλ. Scotopic standard spectral luminous efficiency In the case of brightness levels below 10–3 cd/m2, the spectral luminous efficiency mainly played by the rod cells of the human visual system.
1860
52
International Organizations and Standards
52.4.3 MPEG Standards MPEG-7 A standard developed by the ISO Moving Picture Experts Group (MPEG). An international standard for content-based retrieval. It is commonly referred to as the international standard for multimedia content description interfaces. It was formulated in 2001 and numbered ISO/IEC 15938. Unlike MPEG-2 and MPEG-4, which compress multimedia content for specific applications, it specifies the structure and characteristics of compressed multimedia content produced by different standards for search engines. The goal is to specify a standard set of descriptors describing different multimedia information. These descriptions should be related to the information content so that various multimedia information can be queried quickly and efficiently. Multimedia content description interface Same as MPEG-7. MPEG-7 descriptor Part of the international standard MPEG-7. It describes low-level image, video (including audio) characteristics such as color, texture, motion, shape, and more. Color layout descriptor A color descriptor recommended in the international standard MPEG-7 is designed to obtain the global and spatial color distribution information of an image. Expresses the spatial distribution information of color. The main steps to obtain the color layout descriptor are as follows: 1. Map the image from RGB color space to YCbCr color space; the mapping formulas are Y ¼ 0:299 R þ 0:587 G þ 0:114 B Cr ¼ 0:500 R 0:419 G 0:081 B C b ¼ 0:169 R 0:331 G þ 0:500 B 2. Divide the entire image into 64 blocks (each with a size of (W/8) (H/8), where W is the width of the entire image and H is the height of the entire image), and calculate the average value of each color component of all pixels in each block, and use this as the representative color (principal color) of the block; 3. Convert this set of 64 colors to YCbCb color space, and perform discrete cosine transform (DCT) on the average data of each block; 4. By scanning and quantizing the zigzags of these blocks, three sets of low-frequency components after color discrete cosine transform are taken together to form the color layout descriptor of the image.
52.4
Other Standards
1861
Scalable color descriptor A color descriptor recommended in the international standard MPEG-7. The color histogram of the HSV space is used, and the number of histogram bins and the accuracy in representation are scalable. In this way, the number of histogram bins or the accuracy of representation can be selected according to the requirements of retrieval accuracy. Color structure descriptor A color descriptor recommended in the international standard MPEG-7, designed to obtain the structural information of the color content of the image. It is a feature descriptor that combines color and spatial relation. A generalization of the concept of color histogram is used: a structural unit can generate a histogram based on the occurrence of local color in a local neighborhood, rather than just at each individual pixel. By dividing the image into a set of 8 8 blocks (called structural elements) to count colors, you can distinguish between two images that have the same color component in the full image but have different color component distributions. In other words, the color structure descriptor describes both the color content itself and the structure of the color content by means of structural elements. See color layout descriptor. Motion descriptor Descriptor used to describe the movement of the object or the movement of the scene in the video Descriptor of motion trajectory A descriptor that describes the motion characteristics of an object by describing its motion trajectory. The motion trajectory descriptor recommended by the international standard MPEG-7 consists of a series of key points and a set of functions that interpolate between these key points. As required, key points are represented in coordinate values in 2-D or 3-D coordinate space, while interpolation functions correspond to each coordinate axis, and x(t), y(t), and z(t) correspond to horizontal, vertical, and depth direction trajectory, respectively. Motion activity descriptor A motion descriptor recommended in the international standard MPEG-7. The description of the spatial distribution of motion activity can be used to represent motion density, motion rhythm, and so on. Motion activity A high-level metric describing the level of activity or pace of action in a video or a segment of it. Depending on the time resolution or the size of the time interval, there are different ways to characterize the activity of the motion. For example, at a relatively rough level such as the entire video program, it can be said that sports competitions have a higher activity and video telephony sequences have a lower activity. At a finer interval level of time, the use of motion activity can distinguish between clips with and without events in the video. Motion activity is also useful in video indexing (such as surveillance applications) and browsing (arranging video
1862
52
International Organizations and Standards
clips based on activity or hierarchy). The description of motion activity can be measured by the magnitude of the motion vector in each macroblock in the MPEG code stream. MPEG-21 An international standard for multimedia applications developed by the Moving Picture Experts Group in 2005. Number ISO/IEC 18034. Attempting to use a unified framework to integrate and standardize various multimedia services, the ultimate goal is to coordinate multimedia technology standards between different levels, thereby establishing an interactive multimedia framework. This framework can support a variety of different application domains, allow different users to use and transfer different types of data, and implement the management of intellectual property rights and the protection of digital media content. Multimedia framework Same as MPEG-21.
52.4.4 Various Standards Digital Imaging and Communications in Medicine [DICOM] A unified image data format (called DICOM format, including DICOM file metainformation and DICOM data set) and data transmission standards for various digital medical imaging equipment. DICOM format Format for storing (image) files according to the DICOM standard. Includes DICOM file meta-information and DICOM data set. A DICOM data set consists of DICOM data elements arranged in a certain order. DICOM file meta-information The part of DICOM format that contains information identifying the data set. DICOM data set One of the components of the DICOM format. DICOM data element The elements that make up a DICOM data set. It includes four parts: tag, value representation (VR), value length (VL), and actual value (value). PostScript [PS] format A commonly used image file format. It is mainly used to insert images into printed materials such as books. PostScript is a programming language, more precisely a page description language. In PS format, grayscale images are represented as ASCIIencoded decimal or hexadecimal numbers. The file extension for this format is “.ps”. Encapsulated PostScript format A packaged form of the PostScript (PS) format. The file extension is “.eps”.
52.4
Other Standards
1863
Camera Link A standard used to regulate the physical interface between a digital camera and an image grabber or frame grabber. First introduced in October 2000, there are three configurations (basic, intermediate, and complete, which require 24, 48, and 64 bits, respectively, and use 4, 8, and 12 channels, respectively). The data transmission rate of the basic configuration can reach 255 MB/s, and the transmission rate of the higher configuration can reach 510 MB/s and 680 MB/s. Camera Link specification See Camera Link. IEEE 1394 A specification developed by the Institute of Electrical and Electronics Engineers. A serial digital bus standard/system. Its standard transmission rate is 100 Mbps (the improved version in 2006 can support up to 800 Mbps), it supports hot plugging of peripherals, and its power, control and data signals are carried by a single cable. This bus system can address up to 64 cameras from a single interface, and multiple computers can simultaneously acquire images from the same camera. See FireWire. FireWire The initial name given to a specific data transmission technology. Now refers to the international industry standard (IEEE 1394) for high-performance serial buses (high-speed interfaces). With data rates up to 400 Mb/s and 800 Mb/s, it can be connected to up to 63 peripherals and can also be used to connect digital cameras and personal computers. EIA-170 One of the standards of the American Electronics Industry Association. This includes an analog video standard for black and white video. It has a frame rate of 30 Hz and 525 lines per image. The image size is 640 480 pixels. AVS standards A series of audio and video coding standards developed by the Audio Video Coding Standard Workgroup of China. Audio Video Interleave [AVI] format An audio and video format from Microsoft. Unlike MPEG, it is not (or has not yet become) an international standard, so the compatibility of AVI video files and players is not always guaranteed.
Bibliography
Alleyrand MR (1992) Handbook of image storage and retrieval systems. Multiscience Press, Essex ASM International (2000) Practical guide to image analysis. ASM International, Materials Park Aumont J (1994) The image. Translation: Pajackowska C. British Film Institute Babaee M, Wolf T, Rigoll G (2016) Toward semantic attributes in dictionary learning and non-negative matrix factorization. Pattern Recogn Lett 80(6):172–178 Barni M, Bartolini F, Piva A (2001) Improved wavelet-based watermarking through pixel-wise masking. IEEE-IP 10(5):783–791 Barni M et al (2003a) What is the future for watermarking? (Part 1). IEEE Signal Process Mag 20 (5):55–59 Barni M et al (2003b) What is the future for watermarking? (Part 2). IEEE Signal Process Mag 20 (6):53–58 Barnsley MF (1988) Fractals everywhere. Academic Press, Maryland Bishop CM (2006) Pattern recognition and machine learning. Springer, New York Borji A, Itti L (2013) State-of-the-art in visual attention modeling. IEEE-PAMI 35(1):185–207 Borji A, Sihite DN, Itti L (2013) Quantitative analysis of human-model agreement in visual saliency modeling: a comparative study. IEEE-IP 22(1):55–69 Borji A, Sihite DN, Itti L (2015) Salient object detection: a benchmark. IEEE-IP 24(12):5706–5722 Bovik A (ed) (2005) Handbook of image and video processing, 2nd edn. Elsevier, Amsterdam Bow ST (2002) Pattern recognition and image preprocessing, 2nd edn. Marcel Dekker, Inc, New York Bracewell RM (1986) The Hartley transform. Oxford University Press, Oxford Bracewell RN (1995) Two-Dimensional Imaging. Prentice Hall, London Castleman KR (1996) Digital image processing. Prentice Hall, London Chan TF, Shen J (2005) Image processing and analysis — Variational, PDE, wavelet, and stochastic methods. Siam, Philadelphia Chui CK (1992) An introduction to WAVELETS. Academic Press, Maryland Costa LF, Cesar RM (2001) Shape analysis and classification: theory and practice. CRC Press, New York Cox IJ, Miller ML, Bloom JA (2002) Digital Watermarking. Elsevier, Amsterdam Danaci EG, Ikizler-Cinbis N (2016) Low-level features for visual attribute recognition: an evaluation. Pattern Recogn Lett 84:185–191 Davies ER (2005) Machine vision: theory, algorithms, practicalities, 3rd edn. Elsevier, Amsterdam Davies ER (2012) Computer and machine vision: theory, algorithms, practicalities, 4th edn. Elsevier, Amsterdam Davis LS (ed) (2001) Foundations of image understanding. Kluwer Academic Publishers, Boston
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3
1865
1866
Bibliography
Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306 Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, Hoboken Finkel LH, Sajda P (1994) Constructing visual perception. Am Sci 82(3):224–237 Forsyth D, Ponce J (2012) Computer vision: a modern approach, 2nd edn. Pearson, Cambridge Gonzalez RC, Wintz P (1987) Digital Image Processing, 2nd edn. Addison-Wesley, Reading Gonzalez RC, Woods RE (2008) Digital Image Processing, 3rd edn. Pearson, New York Gonzalez RC, Woods RE (2018) Digital Image Processing, 4th edn. Pearson, New York Goodfellow I, Bengio Y, Courville A (2016) Deep learning. Massachusetts Institute of Technology, Cambridge Goshtasby AA (2005) 2-D and 3-D image registration – for medical, remote sensing, and industrial applications. Wiley-Interscience, Hoboken Hartley R, Zisserman A (2004) Multiple view geometry in computer vision, 2nd edn. Cambridge University Press, Cambridge ISO/IEC JTC1/SC29/WG11 (2001) Overview of the MPEG-7 standard, V.6, Doc. N4509 Jähne B (2004) Practical handbook on image processing for scientific and technical applications, 2nd edn. CRC Press, Boca Raton Jähne B, Hauβecker H (2000) Computer vision and applications: a guide for students and practitioners. Academic Press, San Diago Kak AC, Slaney M (2001) Principles of computerized tomographic imaging. Society for Industrial and Applied Mathematics, Philadelphia Koschan A, Abidi M (2008) Digital color image processing. Wiley, Hoboken Kropatsch WG, Bischof H (2004) Digital image analysis – selected techniques and applications. Springer, New York Lau DL, Arce GR (2001) Chapter 13: Digital halftoning. In: Mitra SK, Sicuranza GL (eds) Nonlinear image processing. Academic Press, Maryland Lohmann G (1998) Volumetric image analysis. Wiley/Teubner Publishers, New Jersey Marchand-Maillet S, Sharaiha YM (2000) Binary digital image processing — a discrete approach. Academic Press, Maryland MarDonald LW, Luo MR (1999) Colour imaging — vision and technology. Wiley, New York Marr D (1982) Vision – a computational investigation into the human representation and processing of visual information. W.H. Freeman, New York Mitra SK, Sicuranza GL (eds) (2001) Nonlinear image processing. Academic Press, Maryland Nikolaidis N, Pitas I (2001) 3-D image processing algorithms. Wiley, New York Peatross J, Ware M (2015) Physics of light and optics, available at optics.byu.edu Peters JF (2017) Foundations of computer vision: computational geometry, visual image structures and object shape detection. Springer, Cham Peterson MA, Gibson BS (1994) Object recognition contributions to figure-ground organization: operations on outlines and subjective contours. Percept Psychophys 56:551–564 Plataniotis KN, Venetsanopoulos AN (2000) Color image processing and applications. Springer, New York Poynton CA (1996) A technical introduction to digital video. Wiley, New York Pratt WK (2007) Digital image processing: PIKS scientific inside, 4th edn. Wiley Interscience, Hoboken Prince SJD (2012) Computer vision – models, learning, and inference. Cambridge University Press, Cambridge Ritter GX, Wilson JN (2001) Handbook of computer vision algorithms in image algebra. CRC Press, New York Russ JC, Neal FB (2016) The image processing handbook, 7th edn. CRC Press, New York Salomon D (2000) Data compression: the complete reference, 2nd edn. Springer – Verlag, New York Samet H (2001) Object representation. Foundations of image understanding. Kluwer Academic Publishers, Dordrecht Serra J (1982) Image analysis and mathematical morphology. Academic Press, Washington, DC
Bibliography
1867
Shapiro L, Stockman G (2001) Computer vision. Prentice Hall, Englewood Cliffs Shih FY (2010) Image processing and pattern recognition – fundamentals and techniques. IEEE Press, Pitcataway Shih FY (ed) (2013) Multimedia security: watermarking, steganography, and forensics. CRC Press, Boca Raton Sonka M, Hlavac V, Boyle R (2014) Image processing, analysis, and machine vision, 4th edn. Cengage Learning, Stanford Szeliski R (2010) Computer vision: algorithms and applications. Springer, Heidelberg Tekalp AM (1995) Digital video processing. Prentice Hall, London Theodoridis S, Koutroumbas K (2003) Pattern recognition, 2nd edn. Elsevier Science, Amsterdam Theodoridis S, Koutroumbas K (2009) Pattern Recognition, 3rd edn. Elsevier Science, Amsterdam Tsai RY (1987) A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the shelf TV camera and lenses. J Robot Autom 3(4):323–344 Umbaugh SE (2005) Computer imaging – digital image analysis and processing. CRC Press, New York Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Elsevier Inc Zakia RD (1997) Perception and imaging. Focal Press Zeng GSL (2009) Medical image reconstruction: a conceptual tutorial. Higher Education Press, Beijing Zhang Y-J (1995) Influence of segmentation over feature measurement. PRL 16:201–206 Zhang Y-J (1996) A survey on evaluation methods for image segmentation. Pattern Recogn 29 (8):1335–1346 Zhang Y-J (ed) (2006) Advances in image and video segmentation. IRM Press, Hershey-New York Zhang Y-J (ed) (2007) Semantic-based visual information retrieval. IRM Press, Hershey-New York Zhang YJ (2009) Image engineering: processing, analysis, and understanding. Cengage Learning Zhang Y-J (ed) (2011) Advances in face image analysis: techniques and technologies. Medical Information Science Reference, IGI Global, Hershey, New York Zhang Y-J (2015a) Chapter 47: A comprehensive survey on face image analysis. In: Encyclopedia of information science and technology, 3rd edn. Medical Information Science Reference, IGI Global, Hershey, New York, pp 491–500 Zhang Y-J (2015b) Chapter 122: Image inpainting as an evolving topic in image engineering. In: Encyclopedia of information science and technology, 3rd edn. Medical Information Science Reference, IGI Global, Hershey, New York, pp 1283–1293 Zhang Y-J (2015c) Chapter 123: Up-to-date summary of semantic-based visual information retrieval. In: Encyclopedia of information science and technology, 3rd edn. Medical Information Science Reference, IGI Global, Hershey, New York, pp 1294–1303 Zhang Y-J (2015d) Chapter 579: A review of image segmentation evaluation in the 21st century. In: Encyclopedia of information science and technology, 3rd edn. Medical Information Science Reference, IGI Global, Hershey, New York, pp 5857–5867 Zhang Y-J (2015e) Chapter 584: Half century for image segmentation. In: Encyclopedia of information science and technology, 3rd edn. Medical Information Science Reference, IGI Global, Hershey, New York, pp 5906–5915 Zhang Y-J (2015f) Chapter 586: Image fusion techniques with multiple-sensors. In: Encyclopedia of information science and technology, 3rd edn. Medical Information Science Reference, IGI Global, Hershey, New York, pp 5926–5936 Zhang Y-J (2017a) Image engineering, Vol.1: image processing. De Gruyter, Berlin Zhang Y-J (2017b) Image engineering, Vol.2: image analysis. De Gruyter, Berlin Zhang Y-J (2017c) Image engineering, Vol.3: image understanding. De Gruyter, Berlin Zhang Y-J (2018a) Chapter 113: Development of image engineering in the last 20 years. In: Encyclopedia of information science and technology, 4th edn. Medical Information Science Reference, IGI Global, Hershey, New York, pp 1319–1330
1868
Bibliography
Zhang Y-J (2018b) Chapter 115: The understanding of spatial-temporal behaviors. In: Encyclopedia of information science and technology, 4th edn. Medical Information Science Reference, IGI Global, Hershey, New York, pp 1344–1354
Index
Numbers 1/f noise, 589 12-adjacency, 361 12-adjacent, 361 12-connected, 365 12-connectedness, 365 12-connectivity, 363 12-neighborhood, 357 16-adjacency, 361 16-adjacent, 361 16-connected, 365 16-connectedness, 365 16-connectivity, 363 16-neighborhood, 358 18-adjacency, 361 18-adjacent, 361 18-connected, 366 18-connected component, 368 18-connectedness, 366 18-connectivity, 363 18-neighborhood, 359 1-bit color, 12 1-D projection, 234 1-D RLC (1-D run-length coding), 680 1-D run-length coding [1-D RLC], 680 1-D space, 5 1-norm, 300 1-of-K coding scheme, 935 2 CIE Standard observer, 1858 2.1-D Sketch, 1270 2.5-D image, 17 2.5-D model, 17 2.5-D sketch, 1268 24-bit color, 20 24-bit color images, 20
© Springer Nature Singapore Pte Ltd. 2021 Y.-J. Zhang, Handbook of Image Engineering, https://doi.org/10.1007/978-981-15-5873-3
24-neighborhood, 359 256 colors, 20 26-adjacency, 362 26-adjacent, 362 26-connected, 366 26-connected component, 369 26-connectedness, 366 26-connectivity, 363 26-neighborhood, 359 2-D coordinate system, 188 2-D differential edge detector, 876 2-D Fourier transform, 396 2-D image, 17 2-D image exterior region, 49 2-D input device, 154 2D-LDA (2-D linear discriminant analysis), 1216 2-D line equation, 844 2-D linear discriminant analysis [2-D-LDA], 1216 2-D-log search method, 794 2-D model, 1397 2-D operator, 511 2-D PCA, 431 2-D point, 5 2-D point feature, 1273 2-D pose estimation, 1559 2-D projection, 234 2-D projective space, 234 2-D RLC (2-D run-length coding), 680 2-D run-length coding [2-D RLC], 680 2-D shape, 1099 2-D shape descriptor, 1106 2-D shape representation, 1099 2-D space, 5
1869
1870 2-D view, 196 2-D Voronoï diagram, 1121 2-norm, 300 2N-point periodicity, 414 3:2 pull down, 779 32-bit color, 21 32-bit color image, 21 3-adjacency, 362 3-adjacent, 362 3-connected, 365 3-connectedness, 365 3-connectivity, 363 3-D alignment, 1460 3-D computed tomography, 625 3-D coordinate system, 188 3-D data, 199 3-D data acquisition, 199 3-D deformable model, 988 3-D differential edge detector, 887 3-D distance, 373 3-D edge model, 869 3-D face recognition, 1241 3-D facial features detection, 1240 3-D image, 17 3-D imaging, 198 3-D information recover, 1356 3-D interpretation, 1477 3-D line equation, 844 3-D model, 1397 3-D model-based tracking, 1149 3-D moments, 1026 3-D morphable model, 986 3-D motion estimation, 1133 3-D motion segmentation, 949 3-D neighborhood, 360 3-D object, 1315 3-D object description, 1315 3-D object representation, 1315 3-D operator, 511 3-D parametric curve, 983 3-D perspective transform, 226 3-D photography, 237 3DPO system, 1850 3-D point, 6 3-D point feature, 1273 3-D pose estimation, 1559 3-D reconstruction, 1355 3-D reconstruction using one camera, 1373 3-D representation, 1270 3-D scene reconstruction, 1373 3-D shape descriptor, 1315 3-D shape representation, 1315 3-D SIFT, 1343
Index 3-D skeleton, 1006 3-D space, 5 3-D stratigraphy, 1356 3-D structure recovery, 1355 3-D SURF, 1345 3-D surface imaging, 198 3-D texture, 1378 3-D video, 776 3-D vision, 66 3-D volumetric imaging, 198 3-D Voronoï diagram, 1121 3-neighborhood, 357 4-adjacency, 362 4-adjacent, 362 4-connected, 365 4-connected boundary, 368 4-connected component, 368 4-connected contour, 368 4-connected distance, 371 4-connected Euler number, 1029 4-connected region, 368 4-connectedness, 365 4-connectivity, 363 4-D approach, 1315 4-D representation, 1315 4-directional chain code, 989 4-neighborhood, 357 4-path, 367 6-adjacency, 362 6-adjacent, 362 6-coefficient affine model, 460 6-connected, 365 6-connected component, 368 6-connectedness, 365 6-connectivity, 364 6-neighborhood, 357 8-adjacency, 362 8-adjacent, 362 8-bit color, 20 8-coefficient bilinear model, 460 8-connected, 365 8-connected boundary, 369 8-connected component, 368 8-connected contour, 369 8-connected distance, 372 8-connected Euler number, 1029 8-connected region, 369 8-connectedness, 365 8-connectivity, 364 8-directional chain code, 989 8-neighborhood, 358 8-path, 367 8-point algorithm, 122
Index A A*, 909 A* algorithm, 909 A2D (The Actor-Action Data set), 11852 AAM (active appearance model), 1102 Abbe constant, 1717 Aberration, 151 Aberration function, 151 Aberration order, 140 Abnormality detection, 1565, 1567 Abrupt change, 1535 Absolute black body, 1692 Absolute calibration, 122 Absolute conic, 123 Absolute coordinates, 185 Absolute index of refraction, 1715 Absolute orientation algorithm, 240 Absolute orientation, 240 Absolute pattern, 1437 Absolute point, 185 Absolute quadric, 1301 Absolute sensitivity threshold, 112 Absolute sum of normalized dot products, 303 Absolute threshold, 1772 Absolute ultimate measurement accuracy [absolute UMA, AUMA], 964 Absolute UMA (absolute ultimate measurement accuracy), 964 Absolute white body, 1692 Absorptance, 1704 Absorption, 1704 Absorption coefficient, 1704 Absorption image, 36 Absorption spectrophotometer, 1675 Absorption spectroscopy, 1697 Absorptive color, 731 Abstract attribute, 1033 Abstract attribute semantic, 1034, 1546 Abstraction, 1404 ACC (accuracy), 1050 Accommodation, 1806 Accommodation convergence, 1806 Accumulation method, 851 Accumulative cells, 851 Accumulative difference, 1140 Accumulative difference image [ADI], 1139 Accumulator array, 851 Accuracy [ACC], 1050 Accuracy of the camera parameter, 88 Accurate measurement, 1051 Achromatic color, 723 Achromatic lens, 135 Achromatic light, 211
1871 Acoustic image, 26 Acoustic imaging, 180 Acoustic SONAR, 26 Acousto-optic-tunable filter [AOTF], 158 Acquisition, 168 Acquisition device, 85 ACRONYM system, 1850 Action, 1550 Action behavior understanding, 1554 Action cuboid, 1553 Action detection, 1557 Action localization, 1557 Action model, 1557 Action modeling, 1556 Action primitives, 1550 Action recognition, 1558 Action representation, 1557 Action unit [AU], 1249, 1558 Activation function, 1651 Active, qualitative, purposive vision, 1262 Active appearance model [AAM], 1102 Active attack, 706 Active blob, 9 Active calibration, 121 Active camera, 115 Active constraint, 1207 Active contour model, 913 Active contour tracking, 920 Active contours, , 913, 920 Active fusion, 1264 Active illumination, 218 Active imaging, 199 Active learning, 1413 Active net, 1103 Active perception, 1262 Active pixel sensor [APS], 105 Active rangefinding, 199 Active recognition, 1169 Active sensing, 1262 Active sensor, 99 Active shape model [ASM], 1102 Active stereo, 69, 1264 Active structure from X, 1364 Active surface, 1103 Active triangulation, 209 Active triangulation distance measurement, 209 Active unit [AU], 1249 Active vision, 1263 Active vision framework, 1265 Active vision imaging, 199 Active volume, 69 Activity, 1550 Activity analysis, 1564
1872 Activity classification, 1562 Activity graph, 1564 Activity learning, 1563 Activity model, 1562 Activity path [AP], 1564 Activity recognition, 1562 Activity representation, 1563 Activity segmentation, 1562 Activity transition matrix, 1564 Actor-action semantic segmentation, 1555 Actual color, 763 Acutance, 480 AdaBoost, 1200 Adaline (adaptive linear element), 1226 Adaption, 1563 Adaptive, 1667 Adaptive behavior model, 1573 Adaptive bilateral filter, 532 Adaptive coding, 657 Adaptive contrast enhancement, 446 Adaptive edge detection, 862 Adaptive filter, 543 Adaptive filtering, 543 Adaptive histogram equalization [AHE], 491 Adaptive Hough transform, 854 Adaptive in-loop de-blocking filter, 1846 Adaptive linear element [Adaline], 1226 Adaptive median filter, 528 Adaptive meshing, 1118 Adaptive non-maximal suppression [ANMS], 905 Adaptive pyramid, 819 Adaptive reconstruction, 1668 Adaptive rejection sampling, 323 Adaptive smoothing, 518 Adaptive thresholding, 931 Adaptive triangulation, 1119 Adaptive visual servoing, 1827 Adaptive weighted average [AWA], 518 Adaptive weighting, 518 Adaptivity, 1667 ADC (analog-to-digital conversion), 250 Add noise, 585 Added mark, 691 Added pattern, 690 Added vector, 691 Additive color, 730 Additive color mixing, 730 Additive image offset, 472 Additive noise, 583 Ade’s eigenfilter, 1074 ADF (assumed density filtering), 1427 ADI (accumulative difference image), 1139
Index Adjacency, 353 Adjacency graph, 354 Adjacency matrix, 354 Adjacency relation, 354 Adjacent, 354 Adjacent pixels, 354 Advanced Television Systems Committee [ATSC], 1840 Advancing color, 1801 Advantages of digital video, 775 Adversary, 706 AEB (automatic exposure bracketing), 193 Aerial photography, 83 Affective body gesture, 1560 Affective computing, 1546 Affective gesture, 1560 Affective state, 1546 Affine, 459 Affine arc length, 459 Affine camera, 110 Affine curvature, 460 Affine flow, 459 Affine fundamental matrix, 458 Affine invariance, 1048 Affine invariant, 1048 Affine invariant detector, 1048 Affine invariant region, 1049 Affine length, 459 Affine moment, 459 Affine motion model, 460 Affine quadrifocal tensor, 110 Affine reconstruction, 1356 Affine registration, 1455 Affine stereo, 1325 Affine transformation, 457 Affine trifocal tensor, 110 Affine warp, 459 Affinity, 923 Affinity matrix, 921 Affinity metric, 921 Affordance and action recognition, 1558 After-attention mechanism, 1805 Afterimage, 1770 AGC (automatic gain control), 123 Age progression, 1256 Agent of image understanding, 63 Agglomerative clustering, 1213 AHE (adaptive histogram equalization), 491 AI (artificial intelligence), 1642 AIC (akaike information criterion), 1607 Air light, 612 Airy disk, 1721 Akaike information criterion [AIC], 1607
Index Albedo, 1710 Algebraic curve, 983 Algebraic distance, 375 Algebraic multigrid [AMG], 809 Algebraic point set surfaces, 1285 Algebraic reconstruction technique [ART], 641 Algebraic surface, 1285 Algorithm fusion, 1495 Algorithm implementation, 1267 Algorithm test, 955 Alignment, 1459 All over reflecting surface, 1683 Alphabet, 651 Alternating figures, 1810 Alychne, 741 AM (amplitude-modulated) halftoning technique, 266 Ambient illumination, 220 Ambient light, 212 Ambient space, 6 Ambiguity problem, 1669 AMBLER, 1850 American Newspaper Publishers Association [ANPA], 1840 AM-FM halftoning technique, 267 AMG (algebraic multigrid), 809 Amplifier noise, 586 Amplitude, 44 Amplitude-frequency hybrid modulated halftoning technique, 267 Amplitude-modulated [AM] halftoning technique, 266 Analog, 39 Analog image, 39 Analog to digital conversion [ADC], 250 Analog video, 774 Analog video raster, 774 Analog video signal, 774 Analog video standards, 774 Analog/mixed analog-digital image processing, 59 Analogical representation model, 1405 Analogy, 1405 Analytic curve finding, 846 Analytic photogrammetry, 240 Analytic signal, 438 Analytical evaluation, 958 Analytical method, 958 Analyzer, 159 Anamorphic lens, 134 Anamorphic map, 25 Anamorphic ratio, 137 Ancestral sampling, 1627
1873 Anchor frame, 1131 Anchor shot, 1540 AND operator, 476 Ando filter, 885 Angiography, 1831 Angle between a pair of vectors, 1446 Angle scanning camera, 199 Angular field, 172 Angular frequency, 342 Angular orientation feature, 1085 Animation, 7 Anisometry, 18 Anisotropic BRDF, 1709 Anisotropic diffusion, 603 Anisotropic filtering, 603 Anisotropic smoothness, 604 Anisotropic structure tensor, 604 Anisotropic texture filtering, 1064 Anisotropy, 1709 ANMS (adaptive non-maximal suppression), 905 ANN (artificial neuron network), 1480 Annihilation radiation, 628 Annotation, 1542 Announcer’s shot, 1540 Anomal trichromat, 1709 Anomalous behavior detection, 1566 Anomaly detection, 1565 Anonymity, 719 ANPA (American Newspaper Publishers Association), 1840 Antecedent, 1407 Anterior chamber, 1762 Anti-alias, 548 Anti-aliasing filter, 547 Anti-blooming drain, 99 Anti-extensive, 1726 Anti-Hermitian symmetry, 298 Anti-mode, 926 Anti-parallel edges, 847 Anti-podal quaternion, 465 Anti-symmetric component, 436 Anti-symmetric, 1101 AOTF (acousto-optic tunable filter), 158 AP (activity path), 1564 AP (average precision), 1051 Aperture, 143 Aperture aberration, 152 Aperture color, 1801 Aperture control, 144 Aperture problem, 144 Aperture ratio, 143 Aperture stop, 143
1874 Apoachromatic lens, 135 Apparent brightness, 1776 Apparent contour, 1289 Apparent error rate, 1229 Apparent luminance, 1776 Apparent magnification, 1675 Apparent motion, 1129 Apparent movement, 1130 Apparent surface, 1683 Appearance, 1238 Appearance based recognition, 1174 Appearance based tracking, 1148 Appearance change, 1239 Appearance enhancement transform, 445 Appearance feature, 1239 Appearance flow, 1158 Appearance model, 1102 Appearance prediction, 1070 Appearance singularity, 1275 Appearance-based approach, 1233 APS (active pixel sensor), 105 AR (augmented reality), 80 AR database, 1853 AR model (auto-regressive model), 1070 Arbitrarily arranged trinocular stereo model, 1348 Arbitrary mapping, 479 Arc, 1617 Arc length, 981 Arc length parameterization, 982 Arc of graph, 1617 Arc second, 175 Arc set, 1464, 1617 Archetype, 1101 Architectural model reconstruction, 83 Architectural modeling, 83 Architectural reconstruction, 83 ARD (automatic relevance determination), 1638 Area composition, 1834 Area of influence, 76 Area of region, 1022 Area of spatial coverage, 76 Area scan camera, 113 Area sensors, 100 Area statistics, 1020 Area under curve [AUC], 1053 Area Voronoï diagram, 1121 Area-based, 970 Area-based correspondence analysis, 1340 Area-based stereo, 1340 Area-perimeter ratio, 1023 Arithmetic coding, 667
Index Arithmetic decoding, 667 Arithmetic mean filter, 520 Arithmetic operation, 471 ARMA (auto-regressive moving average), 1071 ARMA (auto-regressive moving average) model, 1071 Arnold transform, 714 Arrangement rule, 1080 Array processor, 344 ART (algebraic reconstruction technique), 641 ART without relaxation, 642 Arterial tree segmentation, 950 Articulated object, 1313 Articulated object model, 1313 Articulated object segmentation, 1313 Articulated object tracking, 1147 Artifact noise, 581 Artifact, 580 Artificial intelligence [AI], 1642 Artificial neuron network [ANN], 1480 Ascender, 1829 ASM (active shape model), 1102 Aspect equivalent, 196 Aspect graph, 1283 Aspect ratio, , 785, 1111 Asperity scattering, 1289 Aspherical lens, 134 Assimilation theory, 1816 Association graph, 1633 Association graph matching, 1633 Association learning, 1413 Associativity, 1725 Assumed density filtering [ADF], 1427 Astigmatism, 147, 1785 Asymmetric application, 651 Asymmetric fingerprinting, 700 Asymmetric key cryptography, 717 Asymmetric key watermarking, 700 Asymmetric SVM, 1206 Asymmetric watermarking, 700 Asynchronous reset, 98 Atlas-based registration, 1456 Atlas-based segmentation, 950 Atmosphere optics, 612 Atmosphere semantic, 1545 Atmospheric extinction coefficient, 612 Atmospheric scattering model, 611 Atmospheric transmittance, 612 Atmospheric turbulence blur, 612 Atmospheric window, 612 Atomic action, 1558 Atoms, 1637 ATR (automatic target recognition), 1169
Index ATSC (Advanced Television Systems Committee), 1840 Attack, 706 Attack to watermark, 707 Attacker, 706 Attention, 1804 Attenuation, 596 Attribute, 1031 Attribute adjacency graph, 1033 Attribute theory, 1032 Attribute-based morphological analysis, 1033 Attributed graph, 1032 Attributed hypergraph, 1033 Atypical co-occurrence, 1479 AU (action unit), 1249, 1558 AUC (area under curve), 1053 Audio, 1821 Audio presentation unit, 1844 Audio video interleave [AVI] format, 1863 Audio-Video Coding Standard Workgroup of China [AVS], 1840 Augmented reality [AR], 80 Augmented vector, 453 Augmenting path algorithm, 921 AUMA (absolute ultimate measurement accuracy), 964 Authentication, 712 Authentication mark, 713 Auto-associative networks, 1209 Autocalibration, 121 Auto-contrast, 480 Autocorrelation, 1077 Auto-correlation of a random field, 1586 Auto-covariance of random fields, 1586 Autofocus, 141 Autofocusing, 141 Automated visual surveillance, 1824 Automatic, 1226 Automatic activity analysis, 1565 Automatic contrast adjustment, 481 Automatic exposure bracketing [AEB], 193 Automatic gain control [AGC], 123 Automatic relevance determination [ARD], 1638 Automatic target recognition [ATR], 1169 Automatic visual inspection, 1825 Automaton, 1226 Automotive safety, 1828 Autonomous exploration, 1264 Autonomous vehicle, 1828 Auto-regressive hidden Markov model, 1595 Auto-regressive model [AR model], 1070 Auto-regressive moving average [ARMA], 1071
1875 Auto-regressive moving average [ARMA] model, 1071 Autostereogram, 1324 Average absolute difference, 1058 Average gradient, 521 Average length of code words, 652 Average pooling, 1650 Average precision [AP], 1051 Average smoothing, 518 Averaging filter, 518 AVI (audio video interleave) format, 1863 AVS (Audio Video Coding Standard Workgroup of China), 1840 AVS standards, 1863 AWA (adaptive weighted average), 518 Axial aberration, 153 Axial chromatic aberration, 152 Axial representation, 1003 Axial spherical aberration, 152 Axiomatic computer vision, 67 Axis, 1319 Axis of elongation, 431 Axis of least second moment, 1009 Axis of rotation, 465 Axis of symmetry, 1003 Axis/angle representation of rotations, 462 Axis-angle curve representation, 464 Axonometric projection, 236 Azimuth, 30
B B (blue), 728 B2D-PCA (bidirectional 2D-PCA), 431 Back face culling, 259 Back focal length [BFL], 139 Back light, 212 Back lighting, 216 Back porch, 783 Back-coupled perceptron, 1210 Background, 830 Background estimation and elimination, 1751 Background flattening, 1751 Background labeling, 830 Background modeling, 1140 Background motion, 1141 Background normalization, 1140 Background plate, 858 Background replacement, 854 Background subtraction, 1140 Back-projection, 638 Back-projection filtering, 639 Back-projection of the filtered projection, 640
1876 Back-projection reconstruction, 639 Back-projection surface, 169 Back-propagation, 1646 Back-propagation neural nets, 1646 Backside reflection, 1730 Backtracking, 911 Backup, 272 Backward difference, 877 Backward kinematics, 142 Backward mapping, 617 Backward motion estimation, 1132 Backward pass, 383 Backward problem, 82 Backward zooming, 137 BAD (biased anisotropic diffusion), 603 Bag of detectors, 1414 Bag of features, 1185 Bag of words, 1488 Bagging, 1415 Bag-of-feature model, 1185 Bag-of-words model, 1488 Balance measure, 929 Balanced filter, 544 Ballooning energy, 914 Band, 1694 Banding, 581 Band-pass filter, 545 Band-pass filtering, 545 Band-pass notch filter, 556 Band-pass pyramids, 819 Band-reject filter, 546 Band-reject filtering, 546 Band-reject notch filter, 557 Bandwidth selection in mean shift, 937 Bar, 847 Bar detection, 847 Bar detector, 847 Bar edge, 847 Barcode reading, 1835 Barrel distortion, 579 Bartlett filter, 525 Barycentric coordinates, 1021 Base alignment, 1460 Baseline, 1345 Basic belief assignment [BBA], 1611 Basic logic operation, 476 Basic move, 380 Basic NMF [BNMF] model, 1192 Basic probability number, 1509 Basis, 387 Basis function, 388 Basis function representation, 388 Basis pursuit, 1639
Index Basis pursuit denoising, 1639 Bas-relief ambiguity, 1369 Batch training, 646 Baum-Welch algorithm, 1424 Bayesian compressive sensing [BCS], 1641 Bayer filter, 158 Bayer pattern, 158 Bayer pattern demosaicing, 754 Bayes classifier, 1608 Bayes criterion, 1203 Bayes factor, 1603 Bayes filtering, 1606 Bayes’ rule, 1606 Bayes’ theorem, 1605 Bayesian adaptation, 1609 Bayesian data association, 1150 Bayesian fusion, 1506 Bayesian graph, 1610 Bayesian information criterion [BIC], 1606 Bayesian learning, 1609 Bayesian logistic regression, 294 Bayesian model averaging, 1603 Bayesian model comparison, 1602 Bayesian model learning, 1409 Bayesian model selection, 1607 Bayesian model, 1601 Bayesian network, 1609 Bayesian Occam’s razor, 1606 Bayesian parameter inference, 1605 Bayesian probability, 1605 Bayesian saliency, 1037 Bayesian statistical model, 1416 Bayesian temporal model, 1602 BBA (basic belief assignment), 1611 BCS (Bayesian compressive sensing), 1641 BE (bending energy), 1107 Beam splitter, 1674 Beer–Lambert law, 1704 Behavior, 1550 Behavior affinity matrix, 1571 Behavior analysis, 1570 Behavior class distribution, 1571 Behavior classification, 1571 Behavior correlation, 1571 Behavior detection, 1570 Behavior footprint, 1569 Behavior hierarchy, 1569 Behavior interpretation, 1572 Behavior layer semantic, 1570 Behavior learning, 1572 Behavior localization, 1570 Behavior model, 1572 Behavior pattern, 1569
Index Behavior posterior, 1569 Behavior prediction, 1569 Behavior profile, 1569 Behavior profiling, 1569 Behavior representation, 1569 Behavioral context, 1572 Behavioral saliency, 1570 Belief networks, 1609 Belief propagation [BP], 1421 Beltrami flow, 534 Benchmarking, 1043 Bending, 1107 Bending energy [BE], 1107 BER (bit error rate), 690 Bernoulli distribution, 330 Bessel function, 281 Bessel-Fourier boundary function, 400 Bessel-Fourier coefficient, 400 Bessel-Fourier spectrum, 400 Best approximation classifier, 1202 Best approximation criterion, 1202 Best next view, 1264 Betacam, 785 Between-class covariance, 1177 Between-class scatter matrix, 1177 BFL (back focal length), 139 B-frame, 802 Bhanu system, 1850 Bhattacharyya coefficient, 375 Bhattacharyya distance, 374 Bias, 985 Bias and gain, 797 Bias field estimation, 630 Bias parameter, 630 Biased anisotropic diffusion [BAD], 603 Biased noise, 594 Bias-variance trade-off, 309 BIC (Bayesian information criterion), 1606 Bicubic, 1283 Bicubic spline interpolation, 973 Bidirectional 2D-PCA [B2D-PCA], 431 Bidirectional frame, 802 Bidirectional prediction frame, 801 Bidirectional prediction, 802 Bidirectional recurrent neural network [Bi-RNN], 1646 Bidirectional reflectance distribution function [BRDF], 1707 Bidirectional scattering surface reflectancedistribution function [BSSRDF], 1709 Bidirectional texture function [BTF], 71 Bilateral filter, 523 Bilateral filtering, 502
1877 Bilateral smoothing, 502 Bilateral symmetry, 1237 Bilateral telecentric lenses, 134 Bi-layer model, 1555 Bi-level image, 18 Bi-linear, 620 Bi-linear blending, 620 Bi-linear blending function, 620 Bi-linear filtering, 1064 Bi-linear interpolation, 618 Bi-linear kernel, 526 Bi-linear surface interpolation, 619, 1286 Bi-linear transform, 460 Bi-linearity, 619 Bimodal histogram, 925 Binarization, 830 Binary attribute, 1032 Binary classifier, 1205 Binary closing, 1734 Binary decision, 1205 Binary decomposition, 678 Binary dilation, 1732 Binary entropy function, 664 Binary erosion, 1733 Binary graph cut optimization, 922 Binary image, 12 Binary image island, 19 Binary mathematical morphology, 1732 Binary moment, 19 Binary MRF, 1591 Binary noise reduction, 596 Binary object feature, 1044 Binary object recognition, 1169 Binary opening, 1733 Binary operation, 470 Binary order of Walsh functions, 392 Binary region skeleton, 1006 Binary relation, 287 Binary space-time volume, 1557 Binary tree, 1007 Binning, 472 Binocular, 1338 Binocular angular scanning mode, 202 Binocular axis mode, 201 Binocular diplopia, 1789 Binocular focused horizontal mode, 200 Binocular horizontal mode, 200 Binocular image, 33 Binocular imaging, 1338 Binocular imaging model, 1338 Binocular parallel horizontal mode, 200 Binocular single vision, 1789 Binocular stereo, 200
1878 Binocular stereo imaging, 199 Binocular stereo vision, 1338 Binocular tracking, 1340 Binocular vision, 1338 Binocular vision indices of depth, 1358 Binoculars, 166 Binormal, 1278 Binomial distribution, 328 Binomial filter, 504 Bin-picking, 1824 Biological optics, 1674 Biological spectroscopy, 1697 Biomedophotonics, 1674 Biometric feature, 1231 Biometric recognition, 1232 Biometric systems, 1232 Biometrics, 1232 Bipartite graph, 66 Bipartite matching, 66 Bipolar impulse noise, 590 Biquadratic, 1282 Birefringence, 1716 Bi-RNN (bidirectional recurrent neural network), 1646 B-ISDN (Broadband integrated services digital network), 1848 Bi-spectrum technique, 597 Bit, 49 Bit allocation, 675 Bit depth, 13, 49 Bit error rate [BER], 690 Bit map, 48 Bit plane coding, 660 Bit plane encoding, 678 Bit rate, 776 Bit resolution, 13 Bit reversal, 790 Bi-tangent, 981 Bitmap format, 13 Bitmapped file format, 273 Bitplane, 12 Bit-plane decomposition, 13 Bitshift operator, 350 Bivariate time series, 777 Black level, 168 Blackboard, 1403 Blackbody, 1690 Blackbody radiation, 1690 Blade, 1466 Blanking, 781 Blanking interval, 781 Bleeding, 759 Blending image, 37
Index Blending operator, 716 Blind coding, 702 Blind deconvolution, 562 Blind detection, 702 Blind image deconvolution, 811 Blind image forensics, 713 Blind signal separation [BSS], 335 Blind source separation, 335 Blind spot, 1763 Bling embedding, 702 Bling watermarking, 702 Blob, 9 Blob analysis, 949 Blob extraction, 949 Block, 8 Block coding, 658 Block distance, 371 Block matching, 1342 Block thresholding, 931 Block truncation coding [BTC], 658 Block-circulant matrix, 574 Blocked path, 1627 Blockiness, 663 Blocking artifact, 659 Blocking Gibbs sampling, 1600 Block-matching algorithm [BMA], 793 Blocks world, 1468 Blooming, 103 Blooming effect, 104 Blotches, 595 Blue [B], 728 Blue noise, 589 Blue noise dithering, 264 Blue noise halftone pattern, 267 Blue screen matting, 856 Blum’s medial axis, 1001 Blur, 564 Blur circle, 139 Blur estimation, 565 Blurring, 275 b-move, 380 BMA (block-matching algorithm), 793 BMP, 13 BMP format, 13, 272 BNMF (basic NMF) model, 1192 Bob, 782 Body color, 1706 Body part tracking, 1251 Body reflection, 1713 Boltzmann constant, 1857 Boltzmann distribution, 334 Booming, 92 Boosting, 1199
Index Boosting process, 1199, 1415 Bootstrap, 1415 Bootstrap aggregation, 1415 Bootstrap data set, 1416 Bootstrap filter, 1155 Border, 903 Border effects, 515 Border length, 1017 Border matting, 857 Border tracing, 912 Bottom-hat transformation, 1750 Bottom-up, 1393 Bottom-up cybernetics, 1392 Bottom-up event detection, 1566 Bottom-up segmentation, 834 Boundary, 903 Boundary accuracy, 1087 Boundary closing, 903 Boundary description, 1012 Boundary detection, 902 Boundary diameter, 1016 Boundary extraction algorithm, 905 Boundary following, 903 Boundary grouping, 913 Boundary invariant moment, 1017 Boundary length, 1017 Boundary matching, 1453 Boundary moment, 1017 Boundary pixel of a region, 904 Boundary point, 904 Boundary point set, 904 Boundary property, 1015 Boundary region of a set, 286 Boundary representation, 973 Boundary segment, 902 Boundary segmentation, 901 Boundary set, 902 Boundary signature, 975 Boundary temperature, 1117 Boundary thinning, 905 Boundary-based method, 832 Boundary-based parallel algorithm, 832 Boundary-based representation, 970 Boundary-based sequential algorithm, 832 Boundary-region fusion, 939 Bounding box, 997 Bounding region, 996 Box filter, 520 Box-counting approach, 75 Box-counting dimension, 75 Boxlet, 425 Box-Muller method, 349 BP (belief propagation), 1421
1879 bps, 776 Bracketed exposure, 193 Brain, 1757 Branching, 1307 BRDF (bidirectional reflectance distribution function), 1707 Breadth-first search, 911 Breakpoint detection, 846 Breast scan analysis, 1832 Brewster’s angle, 1715 Bright-field illumination, 221 Brightness, 1776 Brightness adaptation, 1790 Brightness adaptation level, 1791 Brightness adjustment, 1791 Brightness constancy, 1794 Brightness constancy constraint, 1794 Brightness constancy constraint equation, 1164 Brightness image forming model, 44 Brightness perception, 1790 Brightness sensitivity, 1772 Brilliance, 1776 Broadband integrated services digital network [B-ISDN], 1848 Broadcast monitoring, 705 Broadcast video, 777 Brodatz texture, 1066 Brodatz’s album, 1066 B-snake, 915 B-spline, 972 B-spline fitting, 973 B-spline snake, 915 BSS (blind signal separation), 335 BSSRDF (bidirectional scattering surface reflectance-distribution function), 1709 BTC (block truncation coding), 658 BTF (bidirectional texture function), 71 Bundle adjustment, 1497 Burn, 107 Burn-in, 1601 Busyness measure, 929 Butterfly filter, 851 Butterworth band-pass filter, 554 Butterworth high-pass filter, 552 Butterworth low-pass filter, 548
C C (cyan), 728 CABAC (context-adaptive binary arithmetic coding), 1846 CAC (constant area coding), 679 CAD (computer-aided design), 82
1880 Calculus of variation, 302 Calibration matrix, 123 Calibration object, 119 Calibration patterns, 119 Calibrator, 262 CALIC (context-based adaptive lossless image coding), 687 Caltech-101 data set, 1852 Caltech-256 data set, 1852 Camera, 108 Camera calibration, 118 Camera calibration target, 119 Camera connectivity matrix, 117 Camera constant, 116 Camera coordinate system, 186 Camera coordinates, 187 Camera geometry Camera jitter, 88 Camera line of sight, 89 Camera Link, 1863 Camera Link specification, 1863 Camera Lucida, 261 Camera matrix, 124 Camera model, 173 Camera motion, 88 Camera motion compensation, 89 Camera motion estimation, 89 Camera motion vector, 89 Camera pipeline, 115 Camera pose, 116 Camera position estimation, 89 Camera resectioning, 118 Camera roll, 89 Camera topology inference, 117 Candela [cd], 1856 Canny edge detector, 890 Canny filter, 890 Canny operator, 889 Canny’s criteria, 890 Canonical configuration, 112 Canonical correlation analysis [CCA], 1583 Canonical direction, 1665 Canonical factor, 1584 Canonical variate, 1584 Capacity, 1618 Capsule endoscopy, 1832 Cardiac gated dynamic SPECT, 627 Cardiac image analysis, 1831 Cardinal elements of the lens, 130 Carrier image, 37 CART (classification and regression trees), 1178 Cartesian coordinates, 277 Cartography, 81
Index Cascade of coordinate transformations, 449 Cascade of transformations, 450 Cascaded Hough transform, 853 Cascaded learning, 1409 Cascading Gaussians, 533 CAS-PEAL database, 1853 Cast-shadow, 1386 CAT (computed axial tomography), 625 Cat mapping, 715 Catadioptric optics, 1676 Categorization, 1166, 1177 Categorization of image processing operations, 60 Categorization of texture features, 1066 Category, 1176 Category identification sequence, 1177 Category model, 1476 Category recognition, 1176 Category-level recognition, 1177 Cathode-ray oscillograph, 261 Cathode-ray tube [CRT], 260 Cauchy distribution, 330 Causality, 1594 Caustic, 172 CAVE, 1835 Cavity, 1029 CAVLC (context-adaptive variable-length coding), 1847 CB (coding block), 653 CBCT (cone beam CT), 633 CBIR (content-based image retrieval), 1515 CBS (compton back-scattering), 1721 CBVR (content-based video retrieval), 1515 CCA (canonical correlation analysis), 1583 CCD (charge-coupled device), 103 CCD camera, 111 CCD sensor, 103 CCI (color colorfulness index), 611 CCIR, 1838 CCIR camera, 114 CCITT (Consultative Committee for International Telephone and Telegraph), 1838 CCTV (closed-circuit television), 1821 cd (Candela), 1856 CDF (cumulative distribution function), 488 Ceiling function, 280 Cell coding, 679 Cell decomposition, 1313 Cell microscopic analysis, 1832 Cellular array, 344 Census transform histogram [CENTRIST], 1345
Index Center line, 1003 Center of curvature, 1019 Center of mass, 1014 Center projection, 236 Center-surround, 839 Center-surround differences, 1789 Center-surround operator, 839 Central difference, 877 Central eye, 1787 Central limit theorem, 307 Central moment, 1532 Central moment method, 1531 Central vision, 1787 Central-slice theorem, 637 CENTRIST (census transform histogram), 1345 Centroid, 1014 Centroid distance, 1433 Centroid of region, 1021 Centroid tracker, 1146 Centroid tracking, 1146 Cepstrum, 401 Certainty representation, 1480 CFA (class-dependence feature analysis), 1184 CFA (color filter array), 757 CFG (context-free grammar), 1224 CFS (cluster fast segmentation), 1752 CG (computational geometry), 82 CG (computer graphics), 69 CGI (computer-generated images), 7 CGM (computer graphics metafile), 71 Chain code, 988 Chain code histogram, 989 Chain code normalization, 990 Chain code normalization for rotation, 990 Chain code normalization with respect to starting point, 990 Chained transformations, 450 Chamfer distance, 373 Chamfer matching, 1453 Chamfering, 383 Change detection, 1136 Channel, 16 Chan-Vese energy functional, 946 Chan-Vese functional, 946 Chaos theory, 1482 Character recognition, 1829 Character segmentation, 1829 Character verification, 1831 Characteristic view, 1317 Characteristics of human vision, 1756 Charge transfer efficiency, 103 Charge-coupled device [CCD], 103 Charge-injection device [CID], 104
1881 Chart, 7 Chart-less radiometric calibration, 120 Chebyshev moments, 1107 Chebyshev norm, 300 Checkboard effect, 580 Chemical procedure, 1760 Chernoff distance, 374 Chessboard distance, 372 Chessboard distance metric, 372 Chief ray, 1705 Chip sensor, 96 Chirality, 1667 Chi-squared distribution, 325 Chi-squared test, 325 Cholesky decomposition, 284 Chord distribution, 1104 Choropleths, 763 Chroma, 746 Chroma keying, 735 Chroma subsampling, 764 Chromatic aberration, 152 Chromatic adaptation, 1804 Chromatic coefficient, 730 Chromatic color, 723 Chromatic light, 211 Chromatic light source, 214 Chromaticity, 735 Chromaticity coordinates, 749 Chromaticity diagram, 732 Chromatopsy, 1797 Chrominance, 735 Chromosome analysis, 1832 Chunking, 303 CID (charge-injection device), 104 CIE, 1839 CIE 1931 standard colorimetric observer, 1858 CIE 1931 standard colorimetric system, 1858 CIE 1946 standard colorimetric observer, 1858 CIE 1960 UCS (uniform color space), 1858 CIE 1960 uniform color space [UCS], 1858 CIE 1976 L*a*b*color space [CIELAB], 749 CIE 1976 L*u*v*color space [CIELUV], 749 CIE 1976 uniform chromaticity scale diagram, 733 CIE chromaticity coordinates, 749 CIE chromaticity diagram, 733 CIE colorimetric functions, 1858 CIE colorimetric system, 734 CIELAB (CIE 1976 L*a*b*color space), 749 CIE L*A*B* model, 747 CIELUV (1976 L*u*v*color space), 749 CIE L*U*V* model, 748 CIE standard photometric observer, 1857
1882 CIE standard sources, 1859 CIE XYZ chromaticity diagram, 733 CIF (common intermediate format), 275 Cipher, 719 Cipher key, 717 Cipher text, 717 Circle, 848 Circle detection, 847 Circle fitting, 848 Circle of confusion, 138 Circulant matrix, 574 Circular convolution, 1646 Circularity, 1113 Circumscribed circle, 1113 City-block distance, 371 City-block metric, 372 Class identifier, 1178 Class of a pixel, 828 Class of shapes, 1108 Class recognition, 1176 Class separability, 1176 Class-dependent feature analysis [CFA], 1184 Classical optics, 1671 Classical probability, 308 Classification, 1176 Classification and regression trees [CART], 1178 Classification error, 965 Classification learning, 1178 Classification of image segmentation techniques, 831 Classification of knowledge, 1390 Classification of planar shapes, 1108 Classification of spatial filters, 503 Classification of thresholding techniques, 927 Classification scheme of literatures for image engineering, 56 Classified image, 36 Classifier, 1196 Clausal form syntax, 1408 Clear text, 717 CLG method, 1129 Clip operation, 615 Clip transform, 486 Clipping, 615 Clique, 1631 Clockwise spectrum, 726 Close neighborhood, 356 Close operator, 1728 Closed half-space, 299 Closed lower half-space, 299 Closed parametric curve, 984 Closed set, 904
Index Closed set recognition, 1170 Closed upper half-space, 299 Closed world, 1182 Closed-circuit television [CCTV], 1821 Close-set evaluation, 1245 Close-set object, 904 Closing, 1727 Closing characterization theorem, 1734 CLSM (confocal laser scanning microscope), 161 Clump splitting, 950 Cluster, 1211 Cluster analysis, 1213 Cluster assignment function, 1211 Cluster distribution, 1211 Cluster fast segmentation [CFS], 1752 Cluster of polygons, 995 Clustering, 1211 CLUT (color look-up table), 764 Clutter, 580 CML (concurrent mapping and localization), 1827 CMOS (complementary metal-oxidesemiconductor), 104 CMOS sensor, 105 CMOS sensors with a logarithmic response, 105 CMOS (complementary metal-oxidesemiconductor) transistor, 105 c-move, 380 CMU hyperspectral face database, 1853 CMY model, 742 CMY, 742 CMYK, 743 CMYK composition, 743 CMYK model, 742 CNI (color naturalness index), 611 Coarse cluster contour, 1124 Coarse/fine, 807 Coarseness, 1079 Coarse-to-fine, 1450 Coarse-to-fine processing, 807 Coaxial diffuse lights, 224 Coaxial illumination, 218 Coaxial telecentric light, 221 Cocktail party problem, 335 Code book, 652 Code division multiplexing, 271 Code word, 652 Code-book vector, 684 Codec, 650 Coded pattern, 209 Coding, 648
Index Coding block [CB], 653 Coding of audio-visual object, 1845 Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbps, 1844 Coding redundancy, 653 Coding tree block [CTB], 1847 Coding tree unit [CTU], 1847 Coding unit [CU], 1847 Coding-decoding flowchart of vector quantization, 685 Coffee transform, 387 Cognitive level, 1545 Cognitive level semantic, 1545 Cognitive vision, 66 Coherence, 1405 Coherence detection, 1326 Coherence scale/volume, 162 Coherent fiber optics, 1675 Coherent light, 1719 Coherent motion, 1130 Coincidental alignment, 1460 Collimated lighting, 217 Collimating, 627 Collinearity, 232 Collineation, 232 Collusion attack, 708 Collusion-secure code, 708 Color, 723 Color appearance, 1803 Color balance, 724 Color based database indexing, 1530 Color based image retrieval, 1529 Color blindness, 1799 Color calibration, 761 Color casts, 259 Color channels, 763 Color chart, 726 Color checker chart, 727 Color circle, 725 Color clustering, 833 Color colorfulness index [CCI], 611 Color complement, 726 Color constancy, 1794 Color contrast, 1803 Color conversion, 738 Color co-occurrence matrix, 770 Color correction, 761 Color descriptor, 765 Color difference, 1802 Color difference formula, 1802 Color difference threshold, 1802 Color differential invariant, 1049
1883 Color doppler, 1831 Color edge, 767 Color edge detection, 768 Color efficiency, 217 Color encoding, 723 Color enhancement, 757 Color feature, 765 Color filter array [CFA], 757 Color filtering for enhancement, 757 Color fringe, 759 Color fusion, 759 Color gamut, 723 Color gradient, 770 Color grading, 759 Color halftone filter, 764 Color halftoning, 764 Color Harris operator, 769 Color histogram, 769 Color histogram matching, 1451 Color HOG, 770 Color image, 19 Color image demosaicing, 754 Color image edge detection, 862 Color image enhancement, 757 Color image processing, 763 Color image restoration, 764 Color image segmentation, 833 Color image thresholding, 932 Color imbalance, 576 Color indexing, 1530 Color layout descriptor, 1860 Color line model, 729 Color lookup table [CLUT], 764 Color management, 739 Color mapping, 756 Color matching, 1798 Color mixing, 729 Color mixture, 729 Color mixture model, 729 Color model, 737 Color model for TV, 1822 Color moment, 770 Color naturalness index [CNI], 611 Color negative, 107 Color noise reduction, 758 Color normalization, 739 Color palette, 20 Color perception, 1791 Color pixel value, 763 Color printing, 744 Color psychophysics, 1807 Color purity, 721 Color quantization, 764
1884 Color range, 723 Color re-mapping, 761 Color representation system, 738 Color representation, 723 Color saturation, 722 Color science, 1801 Color sequence, 744 Color shift, 259 Color similarity, 738 Color slicing, 758 Color slicing for enhancement, 757 Color space, 738 Color space conversion, 739 Color stimulus function, 1796 Color structure descriptor, 1861 Color system, 738 Color tables, 20 Color temperature, 211 Color texture, 1062 Color tolerance, 739 Color transformations, 764 Color TV format, 1821 Color video, 776 Color vision, 1791 Color wheel, 725 Colorbar, 19 Color-difference signal, 747 Colored graph, 1619 Colored noise, 589 Colorimetric model, 739 Colorimetry, 721 Colorization, 755 Color matching experiments, 1780 Coma, 148 Combinational problem, 921 Combinatorial explosion, 922 Combined feature retrieval, 1526 Combined fusion method, 1508 Combined operation, 1737 Combined reconstruction method, 637 Common intermediate format [CIF], 275 Commutativity, 1725 Compact set, 286 Compactness, 1111 Compactness descriptor, 1111 Compactness hypothesis, 1182 Comparison of shape number, 1098 Compass edge detector, 886 Compass gradient operator, 886 Compatibility constraint, 1330 Complement graph, 1615 Complement of a set, 286 Complementary colors, 726
Index Complementary metal-oxide-semiconductor [CMOS], 104 Complementary metal-oxide-semiconductor [CMOS] transistor, 105 Complementary wavelength, 729 Complementary-color wavelength, 737 Complete authentication, 712 Complete set, 390 Complete tree-structured wavelet decomposition, 421 Completeness checking, 68 Completion, 607 Complex plane, 297 Complexity, 1116 Complexity of evaluation, 967 Complexity of watermark, 693 Compliant device, 707 Component video, 786 Composed parametric curve, 983 Composite and difference values, 808 Composite filter, 543 Composite Laplacian mask, 510 Composite logic operation, 477 Composite maximum, 1487 Composite moments, 1487 Composite operation, 1440 Composite video, 786 Compositing, 607 Compound decision rule, 1221 Compound lenses, 137 Compound transforms, 450 Compressed domain, 1518 Compressed domain retrieval, 1518 Compressed form, 1638 Compressed sampling [CS], 1635 Compressed sparse row [CSR], 1638 Compression, 654 Compression factor, 655 Compression gain, 655 Compression noise, 648 Compression ratio, 654 Compressive sampling [CS], 1635 Compressive sensing [CS], 1635 Compressive video, 777 Compton back-scattering [CBS], 1721 Computation cost, 959 Computational camera, 114 Computational complexity, 1261 Computational geometry [CG], 82 Computational imaging, 1261 Computational learning theory, 1408 Computational photography [CPh], 1261 Computational symmetry, 1261
Index Computational theory, 1267 Computational topology, 79 Computational tractability, 1261 Computational vision, 67 Computational vision theory, 1266 Computed axial tomography [CAT], 625 Computed tomography [CT], 625 Computer graphics [CG], 69 Computer graphics metafile [CGM], 71 Computer vision [CV], 65 Computer-aided design [CAD], 82 Computer-aided tomography, 625 Computer-generated images [CGI], 7 Computer-generated map, 7 Concave degree, 1110 Concave function, 281 Concave mirror, 1677 Concave residue, 998 Concave vertex, 839 Concavity, 1110 Concavity measure, 929 Concavity tree, 999 Concavo-convex, 1110 Concentric mosaic, 19 Concurrence matrix, 1076 Concurrency, 1574 Concurrent mapping and localization [CML], 1827 CONDENSATION (CONditional DENSity propagATION) algorithm, 1154 CONDENSATION tracking, 1155 Condenser lens, 133 Condition number, 898 Condition-action, 1403 Conditional density propagation, 1155 Conditional dilation, 1752 Conditional distribution, 323 Conditional entropy, 1501 Conditional Gaussian distribution, 326 Conditional independence, 323 Conditional mixture model, 323 Conditional ordering, 766 Conditional probability, 324 Conditional random field [CRF], 1588 Conditional replenishment, 798 Cone, 1767 Cone beam CT [CBCT], 633 Cone cell, 1766 Cone-beam projection, 633 Configural-processing hypothesis, 1757 Confocal, 162 Confocal laser scanning microscope [CLSM], 161
1885 Conformal mapping, 297 Confusion matrix, 1229 Confusion table, 1229 Confusion theory, 1815 Congruencing, 1461 Conic, 1277 Conic fitting, 1278 Conic invariant, 1277 Conic section, 1278 Conical mirror, 1677 Conjugate antisymmetry, 298 Conjugate direction, 1656 Conjugate distribution, 1478 Conjugate gradient descent, 645 Conjugate gradient, 1656 Conjugate prior, 1477 Conjugate symmetry, 297 Conjugate symmetry of Fourier transform, 410 Conjugate transpose, 298 Conjunction, 1407 Connected, 364 Connected component, 368 Connected component labeling, 370 Connected line segments, 370 Connected polygons, 370 Connected polyhedral object, 370 Connected region, 369 Connected sets, 369 Connectedness, 368 Connectivity, 363 Connectivity analysis of run, 908 Connectivity number, 1029 Connectivity of run, 909 Connectivity paradox, 367 Consequent, 1407 Conservative smoothing, 596 Consistent estimate, 1053 Consistent gradient operator, 886 Conspicuity maps, 1805 Constancy phenomenon, 1795 Constant area coding [CAC], 679 Constellation, 1556 Constellation model, 1557 Constrained least squares, 290 Constrained least squares restoration, 601 Constrained matching, 1441 Constrained optimization, 1655 Constrained restoration, 599 Constrained structure from motion, 1368 Constraint propagation, 1474 Constraint satisfaction, 350 Constructing a combined Delaunay on Voronoï mesh, 1119
1886 Constructive solid geometry [CSG], 1312 Consultative Committee for International Telephone and Telegraph [CCITT], 1838 Content-based image retrieval [CBIR], 1515 Content-based retrieval, 1514 Content-based video retrieval [CBVR], 1515 Context, 1664 Context dependent, 1665 Context-adaptive binary arithmetic coding [CABAC], 1846 Context-adaptive variable-length coding [CAVLC], 1847 Context-aware algorithm, 1665 Context-based adaptive lossless image coding [CALIC], 687 Context-free grammar [CFG], 1224 Context-sensitive, 1665 Context-sensitive grammar, 1224 Contextual classification, 1180 Contextual entity, 1563 Contextual event, 1567 Contextual image classification, 1886 Contextual information, 1665 Contextual knowledge, 1665 Contextual method, 1665 Contextually incoherent, 1567 Continuity, 1405 Continuous constraint, 1331 Continuous convolution, 342 Continuous Fourier transform, 399 Continuous image, 39 Continuous learning, 1416 Continuous random variable, 1580 Continuous-tone color, 21 Continuous-tone image, 27 Contour, 580 Contour analysis, 1096 Contour edgelet, 1124 Contour filling, 607 Contour following, 849 Contour grouping, 849 Contour interpolation and tiling, 1306 Contour linking, 849 Contour matching, 1452 Contour matching with arc length parameterization, 1452 Contour partitioning, 974 Contour relaxation, 849 Contour representation, 974 Contour signature, 976 Contour smoothing with arc length parameterization, 982
Index Contour tracking, 849 Contour-based recognition, 1175 Contour-based representation, 970 Contour-based shape descriptor, 1105 Contouring, 580 Contourlet, 974 Contractive color, 1802 Contra-harmonic mean filter, 602 Contrast, 481 Contrast across region, 961 Contrast enhancement, 445 Contrast manipulation, 480 Contrast ratio, 481 Contrast reversal, 481 Contrast sensitivity, 1773 Contrast sensitivity threshold, 1774 Contrast stretching, 481 Contrast theory, 1816 Control point, 1457 Control strategy, 1396 Controlled-continuity splines, 973 Convergence, 1806 Convergence theorem, 1208 Convert images, 36 Convex body, 1109 Convex combination, 1109 Convex deficiency, 998 Convex function, 280 Convex property, 1110 Convex region, 1109 Convex residue, 998 Convex set, 1108 Convex vertex, 839 Convex-hull, 997 Convex-hull construction, 997 Convexity, 1019 Convexity ratio, 1110 Convolution, 340 Convolution back-projection, 639 Convolution layer, 1645 Convolution mask, 509 Convolution neural network, 1644 Convolution operator, 342 Convolution theorem, 341, 406 Convolutional layer, 1645 Convolutional neural network, 1644 Co-occurrence matrix, 1076 Cook-Torrance model, 1714 Cool color, 1802 Cooling schedule for simulated annealing, 568 Cooperative algorithm, 1330 Coordinate system transformation, 186 Coordinate system, 182
Index Coordinate transformation, 449 Coordinate-dependent threshold, 931 Coordinates of centroid, 1021 Coplanarity, 1049 Coplanarity invariant, 1049 C-optotype, 1785 Copy attack, 707 Copy control, 707 Copy detection, 707 Copy Protection Technical Working Group [CPTWG], 1839 Copyright protection, 705 Core line, 1004 Coring, 1886 Corner detection, 840 Corner detector, 840 Corner feature detector, 840 Corner point, 840 Corner-based Delaunay mesh on a digital image, 1119 Corner-based Delaunay on Voronoï mesh on an image, 1119 Corner-based Voronoï mesh on a digital image, 1119 Coronary angiography, 1831 Correction equation, 1153 Correlated color temperature, 1692 Correlation, 1444 Correlation based optical flow estimation, 1164 Correlation based stereo, 1326 Correlation coefficient, 1449 Correlation distortion metric, 695 Correlation function, 1448 Correlation matching, 1450 Correlation minimum, 1486 Correlation model, 95 Correlation noise, 583 Correlation product, 1486 Correlation quality, 695 Correlation theorem, 408 Correlation-based block matching, 1342 Correlogram, 700 Correspondence, 1341 Correspondence analysis, 1342 Correspondence constraint, 1330 Correspondence map, 1131 Correspondence matching, 1341 Correspondence problem, 1342 Correspondence-based morphing, 615 Corresponding, 1307 Corresponding points, 1341 cos2 fringe, 1383 cos4 law of illuminance, 1363
1887 Cosine diffuser, 213 Cosine integral image, 24 Cosine transform, 412 Cosine-cubed law, 217 Cosmic radiation, 1688 Cosmic ray, 1688 Cost function, 910 Counterclockwise spectrum, 726 Count-median sketch, 1640 Count-min sketch, 1640 Coupled hidden Markov model, 1594 Covariance matrix, 1584 Covariance of a random vector, 1582 Covariance propagation, 1585 Covariance, 1584 Cover, 286, 698 Cover of a set, 286 Cover work, 697 Covert channel, 718 CPh (computational photography), 1261 CPTWG (Copy Protection Technical Working Group), 1839 Crack code, 991 Crack detection, 1825 Crack edge, 1467 Crack following, 992 Cramer-Rao lower bound, 290 Crease, 1467 CRF (conditional random field), 1588 Crimmins smoothing operator, 525 Crisp set theory, 1484 Criteria for analytical method, 958 Critical bandwidth, 1811 Critical flicker frequency, 1783 Critical motion, 122 Cross entropy, 1504 Cross ratio, 1167 Cross section, 5, 1319 Cross section of the human eye, 1761 Cross-categorization learning, 1414 Cross-correlation, 1445 Cross-correlation matching, 1446 Cross-correlation of random fields, 1586 Cross-covariance of random fields, 1587 Crossing number, 1030 Crossover, 1481 Cross-sectional function, 1319 Cross-validation, 1227 Crowd flow analysis, 1563 CRT (cathode-ray tube), 260 Cryptography, 717 Crystallize, 1119 CS (compressed sampling), 1635
1888 CS (compressive sampling), 1635 CS (compressive sensing), 1635 CSG (constructive solid geometry), 1312 CSR (compressed sparse row), 1638 CT (computed tomography), 625 CT image, 24 CTB (coding tree block), 1847 CTU (coding tree unit), 1847 CU (coding unit), 1847 Cube map, 1287 Cubic spline, 973 Cuboid descriptor, 1553 Cumani operator, 887 Cumulative abnormality score, 1571 Cumulative distribution function [CDF], 488 Cumulative effect of space on vision, 1780 Cumulative effect of time on vision, 1782 Cumulative histogram, 488 Cumulative scene vector, 1472 Currency verification, 1825 Curse of dimensionality, 278 Cursive script recognition, 1830 Curvature, 1018 Curvature estimates, 1018 Curvature of boundary, 1018 Curvature of field, 149 Curvature primal sketch, 1018 Curvature scale space, 815 Curvature segmentation, 1297 Curvature sign patch classification, 1297 Curvature-driven diffusion, 567 Curve, 1275 Curve approximation, 985 Curve binormal, 1279 Curve bi-tangent, 981 Curve evolution, 980 Curve fitting, 984 Curve inflection, 1276 Curve invariant, 1276 Curve invariant point, 1275 Curve matching, 1452 Curve normal, 1276 Curve pyramid, 818 Curve representation system, 979 Curve saliency, 980 Curve segmentation, 846 Curve smoothing, 980 Curve tangent vector, 981 Curvelet, 418 Curvelet transform, 418 Cusp point, 1274 Cut detection, 1536 Cut edge, 1614
Index Cut vertex, 1615 Cutoff frequency, 541 Cutset sampling, 1610 CV (computer vision), 65 Cyan [C], 728 Cybernetics, 1392 Cybernetics knowledge, 1392 Cycle, 1616 Cyclic action, 1557 Cyclopean eye, 1787 Cyclopean image, 23 Cyclopean separation, 1788 Cyclopean view, 1788 Cylinder anamorphosis, 1809 Cylinder extraction, 1296 Cylinder patch extraction, 1296 Cylindrical coordinates, 184 Cylindrical mosaic, 1498 Cylindrical surface region, 1296
D DAG (directed acyclic graph), 1206, 1623 DAP (direct attribute prediction), 1034 Dark adaptation, 1766 Dark channel prior, 609 Dark current, 586 Dark current noise, 586 Dark noise, 586 Dark-field illumination, 221 Data association, 1150 Data augmentation, 1664 Data compression, 654 Data density, 48 Data fusion, 95 Data Hiding Sub-Group [DHSG], 1839 Data integration, 95 Data mining, 1390 Data parallelism, 347 Data payload, 693 Data reduction, 654 Data redundancy, 653 Data structure, 346 Database, 1851 Database of image functions, 1402 Data-driven, 1393 Daylight vision, 1774 DBN (dynamic Bayesian network), 1610 DBN (dynamic belief network), 1610 DC (directional contrast), 1038 DCD (dominant color descriptor), 1530 DCNN (deep convolution neural network), 1643
Index DCT (discrete cosine transform), 412 DDR (digitally reconstructed radiograph), 625 DDW (double domain watermarking), 704 Deblocking filter, 799 Deblocking, 663 Deblur, 565 Deblurring, 565 Decay factor, 596 Decimation, 246 Decision, 1220 Decision boundary, 1222 Decision forest, 1414 Decision function, 1220 Decision layer fusion, 1506 Decision region, 1222 Decision rule, 1221 Decision surface, 1222 Decision theory, 1221 Decision tree, 1222 Declarative model, 1563 Declarative representation, 1401 Decoding, 648 Decoding image, 35 Decoding reconstruction, 1639 Decomposable filter, 544 Decomposition methods, 282 Decompression, 649 Deconvolution, 341 Decryption, 717 DED (differential edge detector), 877 Deep convolution neural network [DCNN], 1643 Deep learning, 1643 Deep neural network [DNN], 1643 Deficit of convexity, 998 Definition, 479, 1773 Defocus, 137 Defocus blur, 575 Defogging, 608 Deformable contour, 918 Deformable model, 987 Deformable shape, 1099 Deformable shape registration, 1460 Deformable superquadric, 1102 Deformable template, 987 Deformable template model, 987 Deformation energy, 1102 De-fuzzification, 1487 Degenerate configuration, 231 Degenerate vertex, 839 Degradation, 576 Degradation system, 576 Degraded image, 35
1889 Degree, 1614 Degree of a graph node, 1618 Degree of expansion, 1114 Degree of freedom [DoF], 185 Dehazing, 609 De-interlacing, 782 Delaunay mesh, 1120 Delaunay mesh nerve, 1120 Delaunay nucleus, 1120 Delaunay on Voronoï mesh overlay, 1119 Delaunay tessellation, 1120 Delaunay triangle, 1120 Delaunay triangle region, 1120 Delaunay triangulation, 1119 Delaunay wedge, 1120 Delaunay-on-Voronoï mesh nerve on an image, 1119 Delaunay-on-Voronoï mesh on an image, 1119 Delete attack, 708 Demon, 1396 Demosaicing, 754 Dempster-Shafer, 1509 Dempster-Shafer theory, 1509 Dendrogram, 1613 Denoising, 596 Dense correspondence, 1342 Dense disparity map, 1343 Dense optical flow computation, 1160 Dense reconstruction, 1343 Dense stereo matching, 1343 Densitometer, 1675 Densitometry, 1023 Density, 44 Density estimation, 44 Density of a region, 1023 Density-weighted center, 1014 Dependency map, 1633 Depth, 12 Depth acuity, 1328 Depth distortion, 1358 Depth estimation, 1357 Depth from defocus, 1377 Depth from focus, 1377 Depth gradient image, 31 Depth image, 31 Depth image edge detector, 1358 Depth information, 1357 Depth map, 31 Depth map interpolation, 1357 Depth of field, 138 Depth of focus, 190 Depth perception, 1322 Depth resolution, 45
1890 Depth sensor, 101 Deriche edge detector, 891 Deriche filter, 892 Derivative estimation, 522 Derivative estimation by function fitting, 522 Derivative of a color image, 768 Derivative of Gaussian filter, 534 Derivative-based search, 1656 Derived measurement, 1045 Derived metric, 1045 Descattering, 565 Descendant node, 1614 Descender, 1829 Description coding, 657 Description-based on boundary, 1012, 1103 Description-based on contour, 1104 Description based on region, 1012 Descriptions of object relationship, 1027 Descriptor, 1012 Descriptor for boundary, 1012 Descriptor for contour, 1012 Descriptor for region, 1012, 1020 Descriptor of motion trajectory, 1861 Design of visual pattern classifier, 1199 Destriping, 579 Destriping correction, 579 Detail component, 1243 Detect then track, 1149 Detection attack, 708 Detection criterion, 890 Detection of pepper-and-salt noise, 592 Detection probability ratio, 959 Detection rate, 1228 Detection region, 698 Deterministic approach, 599 Device-independent bitmap [DIB], 273 DFD (displaced frame difference), 796 DFFS (distance from face space), 1243 DFT (discrete Fourier transform), 399 DGM (directed graphical model), 1622 DHSG (Data Hiding Sub-Group), 1839 DIAC (dual of the image of the absolute conic), 123 Diagonal approximation of Hessian matrix, 898 Diagonal covariance matrix, 1585 Diagonal-adjacency, 362 Diagonal-adjacent, 362 Diagonalization of matrix, 575 Diagonalization, 575 Diagonal-neighborhood, 360 Diagram analysis, 1465 Diameter bisection, 848 Diameter of boundary, 1015
Index Diameter of contour, 1016 Diaphragm, 141 DIB (device-independent bitmap), 273 Dichroic filter, 157 Dichromat, 1799 Dichromatic model, 1706 Dichromatic reflection model [DRM], 1706 Dichromatic vision, 1799 DICOM (digital imaging and communications in medicine), 1862 DICOM data element, 1862 DICOM data set, 1862 DICOM file meta-information, 1862 DICOM format, 1862 Dictionary-based coding, 686 Dictionary-based compression, 686 Diffeomorphism, 1194 Difference distortion metric, 695 Difference image, 474 Difference matting, 856 Difference of Gaussian [DoG], 535 Difference of Gaussian [DoG] filter, 534 Difference of Gaussian [DoG] operator, 535 Difference of low-pass [DOLP] transforms, 820 Difference of offset Gaussian [DooG] filter, 535 Differential coding, 680 Differential edge detector [DED], 877 Differential geometry, 278 Differential image compression, 680 Differential invariant, 278 Differential pulse-code modulation [DPCM], 671 Differential threshold, 1679 Differentiation filtering, 881 Diffraction, 1721 Diffraction grating, 1722 Diffraction limit, 1722 Diffraction-limited, 1722 Diffuse bright-field back-light illumination, 223 Diffuse bright-field front light illumination, 223 Diffuse illumination, 220 Diffuse reflection, 1702 Diffuse reflection surface, 1683 Diffuse transmission, 1703 Diffusion, 446 Diffusion filtering, 534 Diffusion function, 446 Diffusion MRI tractography, 629 Diffusion smoothing, 533 Diffusion tensor imaging, 178 DIFS (distance in face space), 1242 Digamma function, 485 Digital, 6
Index Digital arc, 367 Digital camera, 108 Digital camera model, 109 Digital chord, 250 Digital curve, 251 Digital elevation map, 1834 Digital filter, 540 Digital fingerprints, 693 Digital geometry, 249 Digital heritage, 1835 Digital holography, 196 Digital image, 6 Digital image forensics, 713 Digital image processing [DIP], 59 Digital image watermarking, 699 Digital imaging and communications in medicine [DICOM], 1862 Digital micro-mirror device [DMD], 165 Digital panoramic radiography, 238 Digital photogrammetry, 239 Digital pixel sensors [DPS], 100 Digital radiography, 238 Digital rights management [DRM], 705 Digital signal processor [DSP], 335 Digital signal processor [DSP] chip, 335 Digital signature, 1256 Digital single lens reflex [DSLR] camera, 115 Digital straight line segment, 251 Digital subtraction angiography, 1831 Digital television [DTV], 1819 Digital terrain map, 1834 Digital terrain model [DTM], 1834 Digital topology, 78 Digital versatile disc [DVD], 270 Digital video, 774 Digital video disc [DVD], 775 Digital video express [DIVX], 775 Digital video formats, 774 Digital video image, 6 Digital video processing, 775 Digital video signal, 775 Digital video standards, 774 Digital visual space, 7 Digital Voronoï topology, 79 Digital watermarking, 699 Digitally reconstructed radiograph [DDR], 625 Digitization, 249 Digitization box, 249 Digitization model, 249 Digitization scheme, 249 Digitizer, 249 Dihedral edge, 869 Dilate operator, 1727
1891 Dilation, 1727 Dimension, 278 Dimensional reduction technique, 1190 Dimensionality, 278 Dimensionality reduction, 1189 Dimensionlessness, 1104 DIP (digital image processing), 59 Direct attenuation, 611 Direct attribute prediction [DAP], 1034 Direct component, 410 Direct filtering, 792 Direct gray-level mapping, 495 Direct imaging, 176 Direct LDA [DLDA], 1216 Direct linear transform [DLT], 116 Direct message codes, 690 Direct metric, 1044 Direct sparse matrix techniques, 1638 Direct sparse method, 1638 Directed acyclic graph [DAG], 1206, 1623 Directed acyclic graph SVM [DAGSVM], 1623 Directed bright-field back-light illumination, 222 Directed bright-field front-light illumination, 221 Directed cycle, 1623 Directed dark-field front-light illumination, 222 Directed edge, 1623 Directed graph, 1622 Directed graphical model [DGM], 1622 Directed illumination, 215 Direction, 531 Direction detection error, 882 Directional contrast [DC], 1038 Directional curvatures, 1668 Directional derivative, 531 Directional derivative filter, 530 Directional difference filter, 532, 878 Directional differential operator, 532, 877 Directional divergence, 1504 Directional filter, 543 Directional response, 881 Directional selectivity, 532 Directional transmission, 1704 Directionality, 1068 Direction-caused motion parallax, 1793 Dirichlet distribution, 333 Dirichlet prior, 334 Dirichlet process mixture model, 334 Dirichlet tessellation, 1122 Dirty paper codes, 696 Discontinuity detection, 861 Discontinuity preserving filtering, 937
1892 Discontinuity preserving regularization, 570 Discontinuous event tracking, 1149 Discrepancy criteria, 962 Discrete, 250 Discrete cosine transform [DCT], 412 Discrete curvature, 1018 Discrete curve evolution, 980 Discrete disk, 381 Discrete distance, 371 Discrete Fourier transform [DFT], 399 Discrete geometry, 9 Discrete Hartley transform, 435 Discrete Karhunen-Loève [KL] transform, 426 Discrete KL transform, 426 Discrete labeling, 1482 Discrete object, 9 Discrete random variable, 1580 Discrete relaxation, 1482 Discrete relaxation labeling, 1483 Discrete sine transform [DST], 413 Discrete straightness, 250 Discrete wavelet transform [DWT], 415 Discreteness, 1404 Discriminant function, 1214 Discrimination function, 1214 Discrimination-generating Hough transform, 854 Discrimination power, 1214 Discriminative attribute, 1032 Discriminative graphical model, 1215 Discriminative model, 1215 Discriminative random field [DRF], 1590 Disjoint view, 171 Disjunction, 1407 Disordered texture, 1090 Disparity, 1327 Disparity error detection and correction, 1352 Disparity gradient, 1328 Disparity gradient limit, 1329 Disparity limit, 1328 Disparity map, 1329 Disparity search range, 1352 Disparity smoothness constraint, 1331 Disparity space image [DSI], 28 Dispersion, 729, 1404, 1720 Displaced frame difference [DFD], 796 Displacement field, 1157 Displacement vector [DV], 1157 Display, 258 Dissimilarity metric, 1430 Dissolve, 1536 Dissymmetric closing, 1728 Distance from face space [DFFS], 1243
Index Distance function, 371 Distance in face space [DIFS], 1242 Distance map, 382 Distance measure, 375 Distance measurement based on chain code, 375 Distance measurement based on local mask, 384 Distance measuring function, 375 Distance metric, 375 Distance perception, 1322 Distance sensor, 101 Distance transform [DT], 382 Distance-based approach, 1243 Distance-caused motion parallax, 1793 Distortion, 577 Distortion coefficient, 128 Distortion measure [DM], 1668 Distortion metric, 695 Distortion metric for watermarks, 695 Distortion polynomial, 127 Distortion suppression, 614 Distributed behavior, 1572 Distributed camera network, 116 Distribution function, 321 Distribution of point objects, 1027 Distribution of unwatermarked works, 698 Distribution overlap, 319 Distribution-free decision rule, 1221 Distributivity of Fourier transform, 410 Dithered index modulation, 697 Dithered quantization, 264 Dithering, 264 Dithering technique, 264 Divide and conquer, 994 Divisive clustering, 941 DIVX (digital video express), 775 DLA (dynamic link architecture), 1439 DLDA (direct LDA), 1216 DLT (direct linear transform), 116 DM (distortion measure), 1668 DM (Δ-modulation), 670 DMD (digital micro-mirror device), 165 DNN (deep neural network), 1643 Document analysis, 1830 Document mosaicing, 1498 Document retrieval, 1831 Dodge, 107 DoF (degree of freedom), 185 DoG (difference of Gaussian), 535 DoG (difference of Gaussian) filter, 534 DoG (difference of Gaussian) operator, 535 Dollying, 92
Index DOLP (difference of low-pass) transforms, 820 Domain, 249 Domain of a function, 279 Domain scaling law, 405 Dome-shaped light panel with a diffuser in front of the lights, 225 Dominant color descriptor [DCD], 1530 Dominant motion direction, 1128 Dominant plane, 576 Dominant wavelength, 722 Don’t care element, 1730 DooG (difference of offset Gaussian) filter, 535 Doppler, 1142 Dot per inch, 48 Double domain watermarking [DDW], 704 Double-subgraph isomorphism, 1464 Down-conversion, 247 Downhill simplex, 1194 Down-rounding function, 279 Down-sampling, 248 DPCM (differential pulse-code modulation), 671 DPS (digital pixel sensors), 100 Drawing, 7 Drawing exchange file [DXF], 276 Drawing sketch, 1519 DRF (discriminative random field), 1590 DRM (dichromatic reflection model), 1706 DRM (digital rights management), 705 Dropout, 1649 D-separation, 1626 DSI (disparity space image), 28 DSLR (digital single lens reflex) camera, 115 DSP (digital signal processor), 335 DSP (digital signal processor) chip, 335 DST (discrete sine transform), 413 DT (distance transform), 382 DTM (digital terrain model), 1834 DTV (digital television), 1819 Dual graph, 1616 Dual of the image of the absolute conic [DIAC], 123 Dual problem, 305 Dual quaternion, 465 Dual-energy gamma densitometry, 1044 Duality, 1735 Duality of binary dilation and erosion, 1735 Duality of binary opening and closing, 1735 Duality of gray-level dilation and erosion, 1748 Duality of gray-level opening and closing, 1748 Dual-tree wavelet, 425 Duplicate image retrieval, 1515
1893 Duplication correction, 606 Duplicity theory of vision, 1707 Dust filtering, 597 DV (displacement vector), 1157 DVD (digital versatile disc), 270 DVD (digital video disc), 775 DWT (discrete wavelet transform), 415 DXF (drawing exchange file), 276 Dyadic order of Walsh functions, 392 Dye-sublimation printer, 263 Dynamic appearance model, 1012 Dynamic Bayesian network [DBN], 1610 Dynamic belief network [DBN], 1610 Dynamic correlation, 1397 Dynamic event, 1567 Dynamic geometry of surface form and appearance, 1260 Dynamic imagery, 777 Dynamic index of depth, 1793 Dynamic link architecture [DLA], 1439 Dynamic Markov random field, 1592 Dynamic occlusion, 1143 Dynamic pattern matching, 1436 Dynamic programming, 911 Dynamic range, 482 Dynamic range compression, 482 Dynamic range of camera, 87 Dynamic recognition, 1169 Dynamic scene, 1475 Dynamic scene analysis, 1475 Dynamic snakes, 916 Dynamic stereo, 1371 Dynamic texture, 1090 Dynamic texture analysis, 1090 Dynamic threshold, 930 Dynamic time warping, 1442 Dynamic topic model, 1568 Dynamic training, 1648
E EAG (effective average gradient), 932 Early stopping, 1647 Early vision, 65 Earth mover’s distance [EMD], 379 Ebbinghaus illusion, 1812 EBMA (exhaustive search block-matching algorithm), 795 Eccentricity, 1115 Eccentricity transform, 1115 Echocardiography, 22 ECM (expectation conditional maximization), 1663
1894 EDCT (even-symmetric discrete cosine transform), 413 Edelsbrunner-Harer Convex Combination Method, 998 Edge, 860 Edge adaptive smoothing, 525 Edge description parameter, 871 Edge detection, 859, 861 Edge detection operator, 862 Edge detection steps, 862 Edge detector, 862 Edge direction, 872 Edge editing, 444 Edge eigen-vector weighted Hausdorff distance, 1435 Edge element, 861 Edge element combination, 902 Edge element extraction, 861 Edge enhancement, 444 Edge extraction, 861 Edge finding, 861 Edge following, 903 Edge frequency-weighted Hausdorff distance, 1435 Edge gradient, 875 Edge gradient image, 875 Edge grouping, 902 Edge histogram descriptor [EHD], 873 Edge image, 35 Edge in range image, 1358 Edge linking, 902 Edge magnitude, 872 Edge map, 862 Edge matching, 1454 Edge motion, 1143 Edge orientation, 872 Edge per unit area, 872 Edge pixel, 872 Edge point, 871 Edge preserving smoothing, 524 Edge set, 1618 Edge sharpening, 520 Edge template, 510 Edge tracking, 903 Edge type labeling, 868 Edge-based segmentation, 834 Edge-based stereo, 1326 Edge-induced subgraph, 1617 Edgel, 861 Edgelet, 861 Edge-point histogram, 871 Edge-point linking, 909 EDST (even-antisymmetric discrete sine transform), 414
Index Effective average gradient [EAG], 932 Effective focal length [EFL], 139 Efficiency, 1051 Efficiency of an algorithm, 350 Efficiency of retrieval, 1517 EFL (effective focal length), 139 EFM (enhance FLD model), 1217 EGI (extended Gaussian image), 29 Egomotion, 1133 Egomotion estimation, 1133 EHD (edge histogram descriptor), 873 EIA (Electronic Industries Association), 1840 EIA-170, 1863 Eigen-decomposition, 428 Eigenface, 1241 Eigenfilter, 1074 Eigenimage, 38 Eigenregion, 1178 Eigenspace representation, 428 Eigenspace-based recognition, 1173 Eigentracking, 1147 Eigenvalue, 428 Eigenvalue analysis, 428 Eigenvalue decomposition, 428 Eigenvalue transform, 428 Eigenvector, 428 Eigenvector projection, 428 e-insensitive error function, 1207 EIT (electrical impedance tomography), 625 Elastic deformation, 1439 Elastic graph matching, 1439 Elastic matching, 1439 Elastic registration, 1455 Electrical impedance tomography [EIT], 625 Electromagnetic optical imaging, 181 Electromagnetic [EM] radiation, 1687 Electromagnetic spectrum, 1687 Electron beam computed tomography, 662 Electron microscopy, 160 Electronic image, 33 Electronic Industries Association [EIA], 1840 Electronic shot noise, 586 Electronic shutter, 106 Electronically tunable filter [ETF], 157 Electron-pair annihilation, 628 Electrostatic process, 264 Elementwise operation, 470 Elevation, 1757 Elimination attack, 708 Ellipse of inertia, 1009 Ellipsoid, 1312 Elliptic snake, 917 Elliptical K-means, 320 Elongatedness, 1114
Index Elongation, 1111 EM (Expectation Maximization), 1663 EM (electromagnetic) radiation, 1687 Embedded coding, 659 Embedder, 691 Embedding attack, 709 Embedding function, 920 Embedding region, 698 Emboss filter, 506 EMD (Earth mover’s distance), 379 EMD (empirical mode decomposition), 1495 Emission computed tomography, 626 Emission CT, 626 Emission image, 36 Emission spectroscopy, 1697 Emission tomography, 626 Emissivity, 1680 Emissive color, 1700 Emotional-layer semantic, 1546 Emotional state recognition, 1546 Empirical Bayes, 1413 Empirical color representation, 1700 Empirical discrepancy method, 961 Empirical evaluation, 960 Empirical goodness method, 960 Empirical method, 959 Empirical mode decomposition [EMD], 1495 Empty interior set, 974 Empty sub-image, 18 Encapsulated PostScript format, 1862 Encoding, 652 Encryption, 717 End member, 14 End-member extraction, 14 Endoscope, 1831 Energy function, 913 Energy minimization, 914 Energy-based segmentation, 834 Engineering optics, 1672 Enhance FLD model [EFM], 1217 ENP (entrance pupil), 174 Ensemble learning, 1414 Entity, 170 Entrance pupil [ENP], 174 Entropy, 1500 Entropy coding, 665 Entropy of image, 664 Entropy of random variables, 1582 Environment map, 36 Environment matte, 856 EP (expectation propagation), 1427 Epanechnikov kernel, 937 EPI (epipolar plane image), 33
1895 EPI strip, 34 Epipolar constraint, 1332 Epipolar correspondence matching, 1332 Epipolar geometry, 1331 Epipolar line, 1332 Epipolar line constraint, 1332 Epipolar plane, 1333 Epipolar plane image [EPI], 33 Epipolar plane image analysis, 1387 Epipolar plane motion, 1333 Epipolar standard geometry, 1335 Epipolar transfer, 1332 Epipolar volume, 1333 Epipole, 1332 Epipole location, 1332 Episode, 1533 Epitome model, 1520 Equal energy spectrum on the wavelength basis, 1699 Equal-distance disk, 381 Equal-energy spectrum, 1699 Equal-interval quantizing, 248 Equality constraint, 1662 Equalization, 490 Equal-probability quantizing, 248 Equidensity, 124 Equidifference weft multicone projection, 236 Equidistance projection, 236 Equivalence relation, 287 Equivalent ellipse of inertia, 1009, 1450 Equivalent f-number, 142 Erasable watermarking, 701 Erf function, 312 Ergodic random field, 1587 Ergodicity, 1588 Erlang noise, 589 Erode operator, 1727 Erosion, 1727 Error, 1058 Error analysis, 1058 Error backpropagation, 1646 Error correction, 1059 Error correction encoder, 650 Error function, 312 Error propagation, 1059 Error rates, 1058 Error-correcting code, 650 Error-correction training procedure, 1059 Essential matrix, 1335 E step (expectation step), 1663 Estimating the degradation function, 562 Estimation theory, 660 ETF (electronically tunable filter), 157
1896 Euclidean distance, 372 Euclidean distance transform [DT], 382 Euclidean geometry, 277 Euclidean reconstruction, 1324 Euclidean space, 277 Euclidean transformation, 456 Euler angles, 89 Euler number, 1028 Euler number of 3-D object, 1029 Euler number of region, 1028 Euler theorem for polyhedral object, 1316 Euler vector, 1029 Euler-Lagrange, 302 Euler-Lagrange equations, 302 Euler-Poincaré characteristic, 1028 Euler’s formula, 1029 EV (exposure value), 194 Evaluation criteria, 957 Evaluation framework, 954 Evaluation measure, 958 Evaluation metric, 958 Evaluation of image fusion result, 1499 Even field, 781 Even function, 281 Even-antisymmetric discrete sine transform [EDST], 414 Even-chain code, 988 Even-symmetric discrete cosine transform [EDCT], 413 Event, 1550 Event detection, 1566 Event knowledge, 1568 Event understanding, 1567 Event video, 1568 Evidence approximation, 1413 Evidence reasoning, 1509 Evidence reasoning fusion, 1509 Evidential reasoning method, 1509 Evolving fronts, 913 Evolving interfaces, 913 Exact authentication, 712 Exact evaluation of Hessian matrix, 898 Exact set, 1511 Example-based super-resolution, 810 Example-based texture synthesis, 1092 Excess kurtosis, 1581 Excitation purity, 736 Exhaustive matching, 1429 Exhaustive search block-matching algorithm [EBMA], 795 Existential quantifier, 1407 Exit pupil [EXP], 174 EXP (exit pupil), 174
Index Expansion coefficient, 419 Expansion function, 419 Expansion move, 923 Expansive color, 1802 Expectation, 308 Expectation conditional maximization [ECM], 1663 Expectation of first-order derivative, 880 Expectation propagation [EP], 1427 Expectation step [E step], 1663 Expectation value, 309 Expectation-Maximization [EM], 1663 Expected value, 309 Expert system, 1403 Expert system for segmentation, 956 Explicit registration, 1243 Exploitation, 1417 Exploration, 1417 Exponential distribution, 329 Exponential family, 324 Exponential high-pass filter, 552 Exponential hull, 1000 Exponential low-pass filter, 549 Exponential noise, 588 Exponential smoothing, 525 Exponential transformation, 483 Exponential twist, 463 Exposure, 193 Exposure bracketing, 193 Exposure differences, 194 Exposure time, 194 Exposure value [EV], 194 Expression, 1246 Expression classification, 1247 Expression feature vector, 1248 Expression recognition, 1247 Expression tensor, 1246 Expression understanding, 1247 EXR, 274 EXR (OpenEXR), 274 Extended Gaussian image [EGI], 29 Extended graphics array [XGA], 258 Extended Kalman filtering, 1153 Extended light source, 214 Extended source, 214 Extending binary morphology to gray-level morphology, 1745 Extensive, 1726 Extensive variables, 1490 Exterior border point, 902 Exterior camera parameter, 88 Exterior orientation, 125 Exterior parameters of camera, 88
Index External camera calibration, 125 External coefficient, 88 External energy function, 914 External energy, 914 External force, 914 External Gaussian normalization, 1527 External parameter, 88 External pixels of the region, 905 External representation, 970 Extinction coefficient, 612 Extracted mark, 691 Extraction function, 691 Extraction of connected component, 370 Extraction of skeleton, 1005 Extrastriate cortex, 1758 Extremal pixel, 10 Extremal point, 999 Extremes of a region, 1016 Extrinsic calibration, 125 Extrinsic camera calibration, 125 Extrinsic parameters, 88 Extrude, 469 Eye, 1761 Eye location, 1239 Eye movement, 1806 Eye moving theory, 1814 Eye tracking, 1240 Eye-camera analogy, 1761
F FA (factor analysis), 1573 Face analysis, 1244 Face authentication, 1245 Face detection, 1236 Face detection and localization, 1236 Face feature detection, 1238 Face identification, 1245 Face image, 21 Face indexing, 1245 Face knowledge-based approach, 1238 Face liveness detection, 1240 Face modeling, 1244 Face perception, 1244 Face recognition, 1241 Face registration, 1245 Face subspace, 1234 Face tracker, 1239 Face tracking, 1239 Face transfer, 1245 Face verification, 1245 Face-image analysis, 1244 Face-inversion effect, 1244
1897 Face-Recognition Technology [FERET] Program, 1243 Facet model, 13 Facet model extraction, 14 Facet model-based extraction, 1306 Facial action, 1248 Facial animation, 1250 Facial definition parameters [FDP], 1845 Facial expression, 1246 Facial expression analysis, 1247 Facial expression classification, 1247 Facial expression feature extraction, 1248 Facial expression feature tracking, 1248 Facial expression recognition, 1247 Facial feature, 1240 Facial features detection, 1240 Facial motion capture, 1250 Facial organ, 1239 Facial organ extraction, 1240 Facial organ tracking, 1240 Facial action keyframe, 1250 Facial-action coding system [FACS], 1249 FACS (facial-action coding system), 1249 Facsimile telegraphy, 1835 Factor analysis [FA], 1573 Factor graph, 1620 Factored-state hierarchical hidden Markov model [FS-HHMM], 1594 Factorial conditional random field [FCRF], 1589 Factorial hidden Markov model, 1594 Factorization, 1129 Factors influencing measurement errors, 1055 Fade in, 1536 Fade out, 1536 False alarm, 1230 False alarm rate, 1230 False color, 763 False contour, 580 False contouring, 580 False identification, 1230 False negative, 1230 False negative probability, 1230 False positive, 1229 False positive probability, 1230 False positive rate [FPR], 1230 False-color enhancement, 762 False-color image, 21 Fan system, 1850 Fan-beam projection, 633 Fan-beam reconstruction, 634 Far light source, 214 Fast algorithm for block matching, 794
1898 Fast and efficient lossless image compression system [FELICS], 665 Fast elliptical filtering, 544 Fast Fourier transform [FFT], 397 FAST interest point detector, 1554 Fast marching method [FMM], 945 Fast marginal likelihood maximization, 1641 Fast multiplication by the Hessian, 898 Fast NMF, 1192 Fast wavelet transform [FWT], 416 Fax, 1835 FBIR (feature-based image retrieval), 1526 FCRF (factorial conditional random field), 1589 FDP (facial definition parameters), 1845 Feathering, 446 Feature, 1184 Feature adjacency graph, 1188 Feature clustering, 1189 Feature compatibility constraint, 1330 Feature contrast, 1187 Feature decomposition, 1248 Feature descriptor, 1187 Feature detection, 1188 Feature dimension reduction, 1189 Feature extraction, 1188 Feature extraction layer, 1189 Feature fusion, 1506 Feature image, 35 Feature layer fusion, 1506 Feature layer semantic, 1526 Feature location, 1188 Feature map, 1645 Feature mapping layer, 1650 Feature match densification, 1443 Feature match verification, 1443 Feature matching, 1442 Feature measurement, 1043 Feature motion, 805 Feature orientation, 1187 Feature point, 1443 Feature point correspondence, 1443 Feature point tracking, 1146 Feature representation, 1187 Feature selection, 1188, 1563 Feature selection method, 1188 Feature similarity, 1187 Feature space, 1186 Feature stabilization, 1189 Feature tracking, 1146 Feature tracking using an affine motion model, 1443 Feature tracks, 1146 Feature vector, 1184
Index Feature-based alignment, 1458 Feature-based approach, 1237 Feature-based correspondence analysis, 1443 Feature-based fusion, 1509 Feature-based image retrieval [FBIR], 1526 Feature-based method, 1237 Feature-based optical flow estimation, 1164 Feature-based recognition, 1232 Feature-based stereo, 1326 Feature-based tracking, 1146 Feedback, 1395 Feedback cybernetics, 1394 Feed-forward neural nets, 1645 Feret box, 997 FELICS (fast and efficient lossless image compression system), 665 FERET database, 1854 FERET (Face-Recognition Technology) Program, 1243 Feret’s diameter, 1016 f-factor, 142 FFL (front focal length), 139 FFT (fast Fourier transform), 397 FI (fragmentation of image), 964 Fiber optics, 271 Fiberscope, 1675 Fidelity factor, 541 Fiducial point, 119 Field angle, 172 Field averaging, 782 Field merging, 782 Field metric, 1044 Field of view [FOV], 170 Field rate, 779 Field stop, 172 Fifth-generation CT scanner, 632 Figure of merit [FOM], 925, 962 Figure-ground separation, 69, 829 Fill factor, 96 Filled area, 1022 Filling, 607 Film camera, 106 Film scanner, 155 Filmless camera, 107 Filter, 157 Filter coefficients, 540 Filter factor, 157 Filter method, 540, 1679 Filter of the back-projection, 639 Filter ringing, 541 Filter selection, 540 Filtered back-projection, 640 Filtering, 539
Index Filtering along motion trajectory, 792 Filtering the image, 539 Filtering threshold, 540 Fine art emulation, 1835 Fine cluster contour, 1124 Fingerprint database indexing, 1254 Fingerprint identification, 1254 Fingerprint indexing, 1253 Fingerprint minutiae, 1253 Fingerprint recognition, 1253 Finite automaton, 1226 Finite difference, 877 Finite element analysis, 304 Finite element model, 304 Finite impulse response [FIR], 339 Finite impulse response [FIR] filter, 339 Finite series-expansion reconstruction method, 642 FIR (finite impulse response), 339 FIR (finite impulse response) filter, 339 FireWire, 1863 First fundamental form, 1281 First order Markov chain, 1598 First-derivative filter, 879 First-generation CT scanner, 630 First-order derivative edge detection, 880 Fish eye lens, 135 Fish tank virtual reality, 79 Fisher criterion, 1220 Fisher criterion function, 1220 Fisher discriminant function, 1216 Fisher face, 1243 Fisher information matrix, 1220 Fisher kernel, 1218 Fisher law, 1339 Fisher LDA (Fisher linear discriminant analysis), 1216 Fisher linear discriminant [FLD], 1218 Fisher linear discriminant analysis [Fisher LDA, FLDA], 1216 Fisher linear discriminant function, 1216, 1220 Fisher-Rao metric, 1430 FITS (flexible image transport system), 1843 Fit text to curve, 1830 Fit text to path, 1830 Fitness, 1482 Fitness function, 1482 Fitting, 1648 Fitting circles, 870 Fitting ellipses, 870 Fitting lines, 870 Fixation, 1805 Fixed pattern noise, 594
1899 Fixed temperature simulated annealing, 568 Fixed-inspection system, 1054 Flange focal distance, 140 Flash and non-flash merging, 238 Flash matting, 857 Flash memory, 270 Flash photography, 237 Flat, 1292 Flat field, 194 Flat filter, 1753 Flat shading, 74 Flat structuring element, 1753 Flat structuring system, 1753 Flatbed scanner, 155 Flat-field correction, 195 Flat-fielding, 194 FLD (Fisher linear discriminant), 1218 FLDA (Fisher linear discriminant analysis), 1216 Flexible image transport system [FITS], 1843 Flexible template, 1097 Flexible-inspection system, 1054 Flicker noise, 589 Flip, 1537 FLIR (forward looking infrared) camera, 109 Floating-point operations [FLOPS], 346 Flooding schedule, 1426 Floor function, 279 Floor operator, 279 FLOPS (floating-point operations), 346 Flow field, 1160 Flow histogram, 1159 Flow vector field, 1160 Flow-based human motion tracking, 1251 Fluorescence microscopy, 161 Fluorescence, 212 Fluorescent lamp, 210 Fluorescent lighting, 216 Fluorescent screen, 261 Flying spot scanner, 155 FM (frequency modulation), 266 FM halftoning technique, 266 FM (frequency modulation) mask, 266 FMM (fast marching method), 945 fMRI (functional magnetic resonance imaging), 629 FND (furthest neighbor distance), 378 f-number, 142 FOA (focus of attention), 1805 FOC (focus of contraction), 1158 Focal depth, 190 Focal length, 139, 188 Focal length ratio, 140
1900 Focal plane, 191 Focal point, 189 Focal surface, 190 Focus, 141 Focus control, 141 Focus following, 1377 Focus invariant imaging, 178 Focus length, 139 Focus length of camera, 140 Focus length of lens, 140 Focus of attention [FOA], 1805 Focus of contraction [FOC], 1158 Focus of expansion [FOE], 1158 Focus series, 190 Focused and tilted ring light, 221 Focused filters, 1560 FOE (focus of expansion), 1158 Fog concentration factor, 609 Fogging, 609 Fold edge, 869 FOM (figure of merit), 925, 962 Foot-of-normal, 853 Foreground motion, 1129 Foreground, 830 Foreground-background, 830 Foreshortening, 226 Forgery attack, 709 Form factor, 1112 Formal definition of image segmentation, 826 Formal grammar, 1225 Formal language, 1225 Formal language theory, 1225 Förstner operator, 840 Forward difference, 877 Forward Hadamard transform, 393 Forward Hadamard transform kernel, 393 Forward image transform, 386 Forward kinematics, 1142 Forward looking infrared [FLIR] camera, 109 Forward looking radar, 163 Forward mapping, 617 Forward motion estimation, 1132 Forward pass, 383 Forward problem, 82 Forward propagation, 1421 Forward transform, 387 Forward transformation kernel, 387 Forward Walsh transform, 391 Forward Walsh transform kernel, 391 Forward warping, 617 Forward zooming, 93 Forward-backward algorithm, 1421 Forward-coupled perceptron, 1210
Index Four-color printing, 744 Four-color process, 744 Four-color theory, 750 Fourier affine theorem, 405 Fourier boundary descriptor, 971 Fourier convolution theorem, 406 Fourier correlation theorem, 407 Fourier descriptor, 411 Fourier domain convolution, 399 Fourier domain inspection, 399 Fourier image processing, 411 Fourier matched filter object recognition, 411 Fourier power spectrum, 400 Fourier rotation theorem, 408 Fourier scaling theorem, 404 Fourier shape descriptor, 411 Fourier shearing theorem, 406 Fourier shift theorem, 407 Fourier similarity theorem, 408 Fourier slice theorem, 403 Fourier space, 409 Fourier space smoothing, 410 Fourier spectrum, 399 Fourier transform [FT], 395 Fourier transform pair, 395 Fourier transform representation, 395 Fourier-based alignment, 1454 Fourier-based computation, 411 Fourier-Bessel transform, 396 Four-nocular stereo matching, 1349 Four-point gradient, 521 Four-point mapping, 454 Fourth-generation CT scanner, 631 Fourth-order cumulant, 1581 FOV (field of view), 170 Fovea, 1763 Fovea image, 22 Foveation, 22 FPR (false positive rate), 1230 Fractal, 75, 76 Fractal coding, 683 Fractal dimension, 75 Fractal image compression, 683 Fractal measure, 75 Fractal model, 76 Fractal model for texture, 1066 Fractal representation, 76 Fractal set, 76 Fractal surface, 76 Fractal texture, 1062 Fraction of correctly segmented pixels, 965 Fragile watermark, 700 Fragmentation of image [FI], 964
Index Frame, 778, 1402 Frame average, 791 Frame buffer, 779 Frame differencing, 792 Frame grabber, 165 Frame image, 27 Frame interpolation, 783 Frame of reference, 185 Frame rate, 779 Frame store, 778 Frame transfer sensor, 101 Free color, 1801 Free end, 843 Free landmark, 975 Freeform surface, 1300 Freeman code, 988 Free-viewpoint video, 777 Freeze-frame video, 777 Frei-Chen orthogonal basis, 850 Frenet frame, 1280 Frequency, 342 Frequency convolution theorem, 341 Frequency domain, 399 Frequency domain coding, 657 Frequency-domain filter, 527 Frequency domain filtering, 545 Frequency domain filtering technique, 545 Frequency domain motion detection, 1136 Frequency domain sampling theorem, 242 Frequency domain watermarking, 703 Frequency filter, 545 Frequency filtering technique, 545 Frequency hopping, 271 Frequency-modulated halftoning technique, 266 Frequency-modulated interference measurement, 208 Frequency modulation [FM], 266 Frequency modulation [FM] mask, 266 Frequency response function, 399 Frequency sensitivity, 1784 Frequency space correlation, 1445 Frequency spectrum, 399 Frequency window, 343 Frequentist probability, 308 Fresnel diffraction, 1722 Fresnel equations, 1717 Fresnel lens, 135 Fresnel reflection, 1713 Frobenius norm, 300 From optical flow to surface orientation, 1370 Front focal length [FFL], 139 Front-light, 212
1901 Front porch, 783 Frontal, 191 Frontal plane, 191 Frontier point, 1274 Front-lighting, 216 FS-HHMM (factored-state hierarchical hidden Markov model), 1594 FS (full-search) method, 795 f-stop, 142 FT (Fourier transform), 395 Full bleed, 264 Full perspective projection, 229 Full prime sketch, 1268 Full-color enhancement, 757 Full-color process, 744 Full-frame sensors, 100 Full-frame transfer sensor, 100 Full-search [FS] method, 795 Fully autonomous humanoid robot soccer, 1827 Fully connected graph, 1619 Function based model, 1014 Function based recognition, 1175 Function interpolation, 1648 Function of contingence-angle-versus-arclength, 978 Function of distance-versus-angle, 977 Function of distance-versus-arc-length, 976 Functional combination, 1092 Functional magnetic resonance imaging [fMRI], 629 Functional matrix, 771 Functional optimization, 1655 Functional representation, 1014 Fundamental form, 1281 Fundamental matrix, 1335 Fundamental radiometric relation, 1472 Furthest neighbor distance [FND], 378 Fused image, 37 Fusion, 1495 Fusion based on rough set theory, 1510 Fusion move, 1495 Fuzzy adjacency, 1485 Fuzzy affinity, 1484 Fuzzy c-means cluster analysis, 1214 Fuzzy complement, 1485 Fuzzy composition, 1486 Fuzzy connectedness, 1484 Fuzzy intersection, 1485 Fuzzy k-means cluster analysis, 1214 Fuzzy logic, 1480 Fuzzy morphology, 1724 Fuzzy pattern recognition, 1173 Fuzzy reasoning, 1486
1902 Fuzzy rule, 1486 Fuzzy set, 1483 Fuzzy set theory, 1483 Fuzzy similarity measure, 1485 Fuzzy solution space, 1486 Fuzzy space, 1485 Fuzzy union, 1485 FWT (fast wavelet transform), 416
G G (green), 728 G3 format, 1841 G4 format, 1841 GA (genetic algorithm), 1481 Gabor filter, 436 Gabor spectrum, 437 Gabor transform, 435 Gabor wavelet, 423 Gain noise, 593 Gait analysis, 1252 Gait classification, 1252 Galerkin approximation, 1152 Gallery image, 24 Gallery set, 1167 Gamut, 762 Gamut mapping, 761 Ganglion cell, 1767 Gap closing, 1497 Garbage matte, 858 Gating function, 320 Gauge coordinates, 1054 Gauging, 1054 Gaussian, 533 Gaussian averaging, 534 Gaussian band-pass filter, 554 Gaussian blur, 533 Gaussian blur filter, 533 Gaussian convolution, 533 Gaussian curvature, 1290 Gaussian derivative, 533 Gaussian distribution, 324 Gaussian filter, 533 Gaussian focal length, 140 Gaussian high-pass filter, 551 Gaussian image pyramid, 821 Gaussian kernel, 1658 Gaussian low-pass filter, 549 Gaussian map, 28 Gaussian Markov random field [GMRF], 1592 Gaussian mixture, 320 Gaussian mixture model [GMM], 320 Gaussian noise, 587
Index Gaussian optics, 1673 Gaussian process, 1179 Gaussian process classification, 1179 Gaussian process latent variable model [GP-LVM], 1179 Gaussian process regression, 293 Gaussian pyramid, 821 Gaussian pyramid feature, 821 Gaussian random field, 1179 Gaussian smoothing, 533 Gaussian speckle, 595 Gaussian sphere, 29 Gaussian-Gamma distribution, 327 Gaussian-Hermite moment, 46 Gaussian-Wishart distribution, 329 Gaze change, 1806 Gaze control, 1265 Gaze correction, 1806 Gaze direction estimation, 1233 Gaze direction tracking, 1233 Gaze location, 1233 Gaze stabilization, 1805 GC (gray-level contrast), 45 GCRF (generalization of CRF), 1589 GDSI (generalized disparity space image), 28 GEM (generalized expectation maximization), 1663 Geman–McClure function, 1462 General camera model, 174 General evaluation framework, 955 General grammar, 1224 General image degradation model, 574 General image matching, 1429 General image representation function, 4 General scheme for segmentation and its evaluation, 956 Generality of evaluation, 967 Generalization of CRF [GCRF], 1589 Generalization sequence, 1182 Generalized cones, 1317 Generalized co-occurrence matrix, 1076 Generalized curve finding, 846 Generalized cylinder, 1317 Generalized cylinder model, 1319 Generalized cylinder representation, 1318 Generalized disparity space image [GDSI], 28 Generalized eigenvalue, 429 Generalized eigenvalue problem, 429 Generalized expectation maximization [GEM], 1663 Generalized finite automaton [GFA], 1226 Generalized Hough transform [GHT], 852 Generalized linear classifier, 1198
Index Generalized linear model, 1198 Generalized maximum likelihood, 1414 Generalized mosaic, 1094 Generalized order statistics filter, 506 Generalized singular value decomposition [GSVD], 284 Generate and test, 1183 Generating point, 1122 Generational copy control, 707 Generative model, 1593 Generative model for texture, 1066 Generative topographic mapping [GTM], 1668 Generic Coding of Moving Pictures and Audio, 1845 Generic object recognition, 1169 Generic viewpoint, 196 Genetic algorithm [GA], 1481 Genetic programming, 1482 Genlock, 1575 Genus, 1031 Geodesic, 379 Geodesic active contour, 918 Geodesic active region, 918 Geodesic dilation, 1729 Geodesic distance, 378 Geodesic erosion, 1729 Geodesic operator, 1729 Geodesic sphere, 1729 Geodesic transform, 379 Geodesic transformation, 1728 Geodetic graph, 1612 Geodetic line, 1612 Geographic information system [GIS], 1849 Geometric algebra, 278 Geometric alignment, 579, 1460 Geometric center, 1014 Geometric compression, 655 Geometric constraint, 1330 Geometric correction, 614 Geometric deformable model, 918 Geometric distance, 1287 Geometric distortion, 613 Geometric distortion correction, 613 Geometric feature learning, 1096 Geometric feature proximity, 1096 Geometric feature, 1096 Geometric flow, 1448 Geometric Hashing, 1524 Geometric heat flow, 951 Geometric illusion, 1814 Geometric invariant, 1048 Geometric mean filter, 527 Geometric mean, 527
1903 Geometric model, 1312 Geometric model matching, 1452 Geometric model of eyes, 1447 Geometric morphometrics, 1723 Geometric nerve, 1125 Geometric operations, 614 Geometric perspective transformation, 226 Geometric properties of structures, 1660 Geometric realization, 1620 Geometric realization of graph, 1620 Geometric reasoning, 1419 Geometric representation, 1620 Geometric representation of graph, 1620 Geometric restoration, 614 Geometric shape, 1096 Geometric similarity constraint, 1330 Geometric transformation, 613 Geometrical aberration, 150 Geometrical image modification, 614 Geometrical ion [geon], 1319 Geometrical lens aberration, 151 Geometrical light-gathering power, 87 Geometrical optics, 1671 Geometrical theory of diffraction, 1721 Geometry image, 33 Geometry optics, 1672 Geomorphology, 80 geon (geometrical ion), 1319 Gestalt theory, 1808 Gestalt, 1808 Gesture analysis, 1254 Gesture component space, 1561 Gesture recognition, 1255 Gesture segmentation, 1562 Gesture tracking, 1562 Gesture-based user interface, 1232 GFA (generalized finite automaton), 1226 Ghost, 581 Ghost artifacts, 580 GHT (generalized Hough transform), 852 Gibbs distribution, 334 Gibbs phenomenon, 411 Gibbs point process, 1592 Gibbs random field, 1592 Gibbs sampler, 569 Gibbs sampling, 1600 GIF (graphics interchange format), 274 Gimbal lock, 91 GIQ (grid-intersection quantization), 251 GIS (geographic information system), 1849 GIST (gradient domain image stitching), 716 Gist image descriptor, 1476 Gist of a scene, 1476
1904 Gist of an image, 1476 GLD (gray-level distribution), 1062 Glint, 1702 Global, 448 Global aggregation methods, 1329 Global atmospheric light, 612 Global component, 1238 Global illumination, 220 Global motion, 1129 Global operation, 448 Global point signature, 1014 Global positioning system [GPS], 1826 Global property, 1014 Global shutter, 106 Global structure extraction, 1101 Global threshold, 928 Global tone mapping, 179 Global transform, 387 Globally ordered texture, 1089 GLOH (gradient location-orientation histogram), 1343 GLOH (gradient location-orientation histogram) features, 1344 Gloss, 1791 Glue, 1313 GML (group mapping law), 492 GMM (Gaussian mixture model), 320 GMRF (Gaussian Markov random field), 1592 GoF/GoP color descriptor, 1530 Golay alphabet,Golay, 1730 Gold standard, 1042 Golden image, 50 Golden template, 1042 Golomb code, 667 Gonioreflectometer, 1675 Good figure principles, 1817 Goodness, 960 Goodness criteria, 960 Gouraud shading, 74 GP-LVM (Gaussian process latent variable model), 1179 GPS (global positioning system), 1826 GPU (graphic processing unit), 70 GPU algorithms, 70 Grabber, 165 GrabCut, 922 Gradient, 875 Gradient based flow estimation, 1163 Gradient classification, 1347 Gradient consistency heuristics, 1309 Gradient constraint equation, 1163 Gradient descent, 644 Gradient domain, 876
Index Gradient domain blending, 716 Gradient domain tone mapping, 761 Gradient edge detection, 875 Gradient feature, 876 Gradient filter, 881 Gradient image, 28 Gradient location-orientation histogram [GLOH], 1343 Gradient location-orientation histogram [GLOH] features, 1344 Gradient magnitude thresholding, 872 Gradient mask, 881 Gradient matching stereo, 1326 Gradient operator, 881 Gradient operator in hexagonal lattices, 253 Gradient space, 875 Gradient space representation, 1388 Gradient vector, 876 Gradient vector flow [GVF], 917 Gradient-domain image stitching [GIST], 716 Gradual change, 1536 Graduated non-convexity, 598 Grain, 11 Grammar, 1223 Grammatical representation, 1097 Granular noise, 670 Granularity, 1070 Granulometric curve, 1741 Granulometric spectrum, 1740 Granulometry, 1740 Graph, 1614 Graph classification problem, 1635 Graph clustering, 1634 Graph cut, 920 Graph embedding, 1658 Graph isomorphism, 1463 Graph kernel, 1658 Graph matching, 1462 Graph median, 1634 Graph model, 1625 Graph partitioning, 1634 Graph pruning, 1634 Graph representation, 1617 Graph search, 910 Graph similarity, 1619 Graph theoretic clustering, 1634 Graph-based merging, 946 Graphic processing unit [GPU], 70 Graphic technique, 70 Graphical user interface [GUI], 71 Graphical model, 1625 Graphics, 8 Graphics interchange format [GIF], 274
Index Graph-theory, 1611 Grassfire algorithm, 1002 Grassfire transform, 1002 Grassmannian space, 1234 Grating, 1722 Gray body, 1691 Gray code, 679 Gray level, 44 Gray level-gradient value scatter, 882 Gray level resolution, 45 Gray scale, 44 Gray scale co-occurrence, 1076 Gray scale correlation, 1445 Gray scale distribution model, 45 Gray scale gradient, 882 Gray scale image, 27 Gray scale mathematical morphology, 1745 Gray scale moment, 1024 Gray scale morphology, 1743 Gray scale texture moment, 1078 Gray value, 45 Gray value moment, 1024 Gray-code decomposition, 678 Gray-level closing, 1748 Gray-level contrast [GC], 45 Gray-level co-occurrence matrix, 1077 Gray-level dependence matrix, 1077 Gray-level dilation, 1747 Gray-level discontinuity, 861 Gray-level disparity, 1502 Gray-level distribution [GLD], 1062 Gray-level erosion, 1747 Gray-level histogram, 487 Gray-level histogram moment, 1085 Gray-level image, 27 Gray-level interpolation, 616 Gray-level mapping, 479 Gray-level mathematical morphology, 1745 Gray-level opening, 1748 Gray-level range, 45 Gray-level run-length, 680 Gray-level run-length coding, 680 Gray-level similarity, 832 Gray-level slicing, 486, 753 Gray-level transformations, 479 Gray-level-difference histogram, 1065 Gray-level-difference matrix, 1077 Gray-level-to-color transformation, 755 Greedy algorithm, 917 Greedy search, 917 Greek (gamma)-ray, 14 Green [G], 728 Green noise dithering, 265
1905 Green noise halftone pattern, 267 Grid filter, 544 Grid of texel, 1064 Grid polygon, 995 Grid theorem, 995 Grid-intersection quantization [GIQ], 251 Ground control point, 618, 1458 Ground following, 1148 Ground plane, 1381 Ground tracking, 1148 Ground truth, 1043, 1166 Group activity, 1563 Group association context, 1572 Grouping, 941 Grouping transform, 62 Group-mapping law [GML], 492 Growing criteria, 939 Growth geometry coding, 679 GSVD (generalized singular value decomposition), 284 GTM (generative topographic mapping), 1668 GUI (graphical user interface), 71 Guided filtering, 503 GVF (gradient vector flow), 917 GVF snake, 917
H H (hue), 746 H.261, 1846 H.263, 1846 H.264/AVC, 1846 H.265/HEVC, 1847 H.266/VVC, 1848 H.320, 1848 H.321, 1848 H.322, 1848 H.323, 1849 H.324, 1849 H.32x, 1848 Haar function, 424 Haar matrix, 424 Haar transform, 424 Haar wavelet, 423 Hadamard transform, 393 Hadamard transform kernel, 394 Hadamard-Haar transform, 394 Hahn moment, 1027 Half tone, 265 Half-octave pyramids, 819, 820 Half-space, 298 Half-tone pattern, 266 Half-toning, 265
1906 Half-toning technique, 265 Halo, 610 Halo effect, 610 Hamiltonian, 910 Hammersley–Clifford theorem, 1592 Hamming code, 650 Hamming distance, 1436 Hand orientation, 1254 Hand tracking, 1254 Hand-eye calibration, 120 Hand-eye coordination, 120 Hand-geometry recognition, 1254 Handle, 1029 Hand-sign recognition, 1225 Handwriting recognition, 1830 Handwriting verification, 1831 Handwritten character recognition, 1830 Hankel transform, 396 Hann window, 508 Hard assignment, 1185 Hard constraint, 922 Hard segmentation, 829 Hard-adjacency, 354 Hardware implementation, 1267 Hardware-oriented model, 739 Harmonic analysis, 411 Harmonic homology, 231 Harmonic mean, 602 Harmonic mean filter, 602 Harris corner detector, 841, 842 Harris detector, 842 Harris interest point operator, 841 Harris operator, 841 Harris-Stephens corner detector, 841 Hartley, 435 Hartley transform, 435 Hash function, 1524 Hash table, 1524 Hashing, 1523 Hat transform, 1749 Hausdorff dimension, 75 Hausdorff distance, 1433 HBMA (hierarchical block matching algorithm), 795 HCEC (high-complexity entropy coding), 1848 HCF (highest confidence first), 645 HCI (human-computer interaction), 1519 HCV (hue, chroma, value) model, 746 HDF (hierarchical data format), 275 HDR (high dynamic range), 179 HDR (high dynamic range) image, 35 HDR (high dynamic range) image format, 179 HDRI (high dynamic range imaging), 179
Index HDTV (high-definition television), 1820 Head and face modeling, 1244 Head tracking, 1239 Head-to-head, 1631 Head-to-tail, 1631 Heaviside step function, 339 Height image, 32 Helical CT, 632 Hellinger distance, 1503 Helmholtz reciprocity, 205, 1708 Helmholtz stereo imaging, 205 Helmholtz stereopsis, 205 Hermitian symmetry, 298 Hermitian transpose, 298 Hessian, 897 Hessian matrix, 897 Hessian operator, 898 Heterarchical cybernetics, 1395 Heterogeneous sensor network, 96 Heteroscedastic model, 319 Heteroscedasticity, 319 Heuristic search, 910 HEVC (high efficiency video coding), 1847 Hexagonal grid, 253 Hexagonal image representation, 254 Hexagonal lattices, 253 Hexagonal sampling efficiency, 254 Hexagonal sampling grid, 253 Hexagonal sampling mode, 254 Hidden Markov model [HMM], 1593 Hidden semi-Markov model, 1596 Hidden state variable, 1153 Hidden variable, 1490 Hidden watermark, 702 Hiding image, 37 Hierarchical, 807 Hierarchical Bayesian model, 1605 Hierarchical block matching algorithm [HBMA], 795 Hierarchical clustering, 1213 Hierarchical coding, 681 Hierarchical data format [HDF], 275 Hierarchical decision rule, 1221 Hierarchical hidden Markov model, 1594 Hierarchical Hough transform, 852 Hierarchical image compression, 681 Hierarchical K-means, 935 Hierarchical matching, 1430 Hierarchical mixture of experts [HME], 1223 Hierarchical model, 347 Hierarchical motion estimation, 1131 Hierarchical progressive image compression, 682 Hierarchical texture, 1062
Index Hierarchical thresholding, 925 Hierarchical variable transition HMM [HVT-HMM], 1594 Hierarchy of transformations, 453 High complexity entropy coding [HCEC], 1848 High dynamic range [HDR], 179 High dynamic range [HDR] image, 35 High dynamic range [HDR] image format, 179 High dynamic range imaging [HDRI], 179 High efficiency video coding [HEVC], 1847 High-boost filter, 550 High-boost filtering, 501 High-definition television [HDTV], 1820 High-dimensional data, 1412 Higher-order graph cut, 922 Higher-order interpolation, 617 Higher-order local entropy, 960 Higher-order singular value decomposition [HOSVD], 284 Highest confidence first [HCF], 645 High-frequency emphasis, 551 High-frequency emphasis filter, 550 High-key photo, 22 High-level vision, 65 Highlight, 1702 Highlight shot, 1538 High-pass filter [HPF], 546 High-pass filtering, 546 High-speed camera, 109 High-speed cinematography/motion picture, 238 Hilbert transform, 438 Hinge error function, 1207 Hinge loss function, 1648 Histogram, 487 Histogram analysis, 496 Histogram bin, 489 Histogram density estimation, 498 Histogram dissimilarity measure, 497 Histogram distance, 497 Histogram equalization, 490 Histogram equalization with random additions, 491 Histogram feature, 496 Histogram hyperbolization, 495 Histogram hyperbolization with random additions, 495 Histogram intersection, 1531 Histogram manipulation, 495 Histogram mapping law, 492 Histogram matching, 1451 Histogram modeling, 495 Histogram modification, 493
1907 Histogram moment, 497 Histogram of edge direction, 873 Histogram of local appearance context [HLAC], 1106 Histogram of oriented gradient [HOG], 882 Histogram of oriented gradients [HOG] descriptor, 882 Histogram of shape context [HOSC], 1104 Histogram processing for color image, 769 Histogram shrinking, 494 Histogram sliding, 495 Histogram smoothing, 495 Histogram specification, 486 Histogram stretching, 493 Histogram thinning, 494 Histogram transformation, 489 Hit-or-miss [HoM] operator, 1736 Hit-or-miss [HoM] transform, 1736 HLAC (histogram of local appearance context), 1106 HME (hierarchical mixture of experts), 1223 HMM (hidden Markov model), 1593 HOG (histogram of oriented gradient), 882 HOG (histogram of oriented gradients) descriptor, 882 Hold-out set, 1228 Hole filling, 1092 Hole, 848 Holistic recognition, 1170 Hologram, 195 Holography, 195 HoM (hit-or-miss) operator, 1736 HoM (hit-or-miss) transform, 1736 Home video organization, 1538 Homogeneity assumption, 1069 Homogeneity, 1069 Homogeneous, 1069 Homogeneous coordinates, 452 Homogeneous emission surface, 1684 Homogeneous kernel, 1658 Homogeneous Markov chain, 1599 Homogeneous noise, 583 Homogeneous random field, 1588 Homogeneous representation, 452 Homogeneous texture, 1062 Homogeneous texture descriptor [HTD], 1069 Homogeneous vector, 453 Homography, 231 Homography transformation, 231 Homomorphic filter, 558 Homomorphic filtering, 558 Homomorphic filtering function, 558 Homoscedastic noise, 585
1908 Homotopic skeleton, 78 Homotopic transformation, 78 Homotopic tree, 78 Hopfield network, 1647 Horizon line, 1381 Horn-Schunck algorithm, 1161 Horn-Schunck constraint equation, 1161 Horn-Schunck constraint equation for color image, 1161 Horoptera, 1789 HOSC (histogram of shape context), 1104 HOSVD (higher-order singular value decomposition), 284 Hotelling transform, 426 Hough forest, 852 Hough transform [HT], 850 Hough transform based on foot-of-normal, 853 Hough transform line finder, 851 HPF (high-pass filter), 546 HRL face database, 1853 HSB (hue, saturation, brightness) model, 747 HSI (hue, saturation, intensity) model, 744 HSI space, 746 HSI transform fusion, 1507 HSI transformation fusion method, 1508 HSI-based colormap, 756 HSL (hue, saturation, lightness) model, 747 HSV (hue, saturation, value) model, 746 HSV space, 746 HT (Hough transform), 850 HTD (homogeneous texture descriptor), 1069 Hu moment, 1026 Huber function, 845 Hue [H], 746 Hue angle, 762 Hue enhancement, 760 Hue, chroma, value [HCV] model, 746 Hue, saturation, brightness [HSB] model, 747 Hue, saturation, intensity [HSI] model, 744 Hue, saturation, lightness [HSL] model, 747 Hue, saturation, value [HSV] model, 746 Hueckel edge detector, 899 Huffman coding, 666 Human biometrics, 1231 Human body modeling, 1251 Human body shape modeling, 1252 Human emotion, 1546 Human face identification, 1245 Human intelligence, 1642 Human motion analysis, 1250 Human motion capture, 1251 Human motion tracking, 1251 Human pose estimation, 1251
Index Human vision, 1755 Human visual attention mechanism, 1805 Human visual attention theory, 1804 Human visual properties, 1756 Human visual system [HVS], 1756 Human-computer interaction [HCI], 1519 Human-face verification, 1245 Human-machine interface, 1520 HVS (human visual system), 1756 HVT-HMM (hierarchical variable transition HMM), 1594 Hybrid encoding, 657 Hybrid Monte Carlo, 250 Hybrid optimal feature weighting, 1188 HYPER (hypothesis predicted and evaluated recursively), 1850 Hyperbolic surface region, 1291 Hyperbolic tangent function, 1654 Hyper-centric camera model, 175 Hyper-centric imaging, 177 Hyperfocal distance, 140 Hypergraph, 1616 Hyper-Laplacian distribution, 333 Hyper-Laplacian norm, 300 Hyperparameter, 1605 Hyperplane, 1274 Hyperprior, 1605 Hyperquadric, 1302 Hyperspectral image, 17 Hyperspectral imaging, 180 Hyperspectral sensor, 103 Hyperstereoscopy, 239 Hypostereoscopy, 240 Hypothesis predicted and evaluated recursively [HYPER], 1850 Hypothesis testing, 957 Hypothesize and test, 1183 Hypothesize and verify, 1183 Hysteresis edge linking, 902 Hysteresis thresholding, 890 Hysteresis tracking, 891
I I (intensity), 1678 I1, I2, I3 model, 752 IA (illumination adjustment), 217 IA (image analysis), 61 IAC (image of absolute conic), 123 IAP (indirect attribute prediction), 1035 IAPS (International Affective Picture System), 1853 IAU (International Astronomical Union), 1839
Index IBP (iterative back-projection), 644 IBPF (ideal band-pass filter), 553 IBR (image-based rendering), 71 IBRF (ideal band-reject filter), 556 ICA (independent component analysis), 1194 ICC (International Color Consortium), 1838 ICC profiles, 749 ICE (iterative constrained end-membering), 14 ICM (iterated conditional modes), 646 Iconic, 1402 Iconic image, 24 Iconic model, 1401 Iconic recognition, 1169 Iconoscope, 155 ICP (iterated closest point), 1457 ICP (iterative closest point), 1456 ICP (iterative closest point) algorithm, 1457 Ideal band-pass filter [IBPF], 553 Ideal band-reject filter [IBRF], 556 Ideal edge, 868 Ideal filter, 546 Ideal high-pass filter [IHPF], 550 Ideal line, 250 Ideal low-pass filter [ILPF], 547 Ideal point, 453 Ideal scattering surface, 1720 Ideal specular reflecting surface, 1701 Ideal white, 1699 IDECS (image discrimination enhancement combination system), 1851 Idempotency, 1726 Identical, 1621 Identical graphs, 1621 Identifiability, 1183 Identification, 1168 Identity axiom, 1041 Identity verification, 1231 IE (image engineering), 55 IEC (International Electrotechnical Commission), 1837 IEEE (Institute of Electrical and Electronics Engineers), 1839 IEEE, 1394, 1863 I-frame, 799 IFRB (International Frequency Registration Board), 1838 IFS (image file system), 272 IFS (iterated function system), 683 IFT (inverse Fourier transform), 396 IGS (Interpretation-guided segmentation), 836 IHPF (ideal high-pass filter), 550 IHS (intensity, hue, saturation), 746 IHT (iterative Hough transform), 852
1909 i.i.d. (independent identically distributed), 324 i.i.d. (independent identically distributed) noise, 584 IIDC (Instrumentation and Industrial Digital Camera), 110 IIDC (Instrumentation and Industrial Digital Camera) standard, 110 IIR (infinite impulse response), 339 IIR (infinite impulse response) filter, 339 IK (inverse kinematics), 81 Ill-conditioned problem, 563 Illegitimate distortion, 580 Ill-posed problem, 562 Illuminance, 1362 Illuminant direction, 215 Illumination, 1362 Illumination adjustment [IA], 217 Illumination component, 1362 Illumination constancy, 1794 Illumination estimation, 1362 Illumination field calibration, 221 Illumination function, 1362 Illumination model, 1363 Illumination variance, 1362 Illusion, 1811 Illusion boundary, 1812 Illusion of geometric figure, 1814 ILPF (ideal low-pass filter), 547 Image, 3 Image acquisition, 167 Image addition, 471 Image algebra, 1724 Image alignment, 1459 Image analogy, 58 Image analysis [IA], 61 Image analysis system, 62 Image annotation, 1521 Image anti-forensics, 713 Image arithmetic, 471 Image authentication, 711 Image binning, 489 Image blending, 715 Image block, 8 Image border, 849 Image brightness, 44 Image brightness constraint equation, 1363 Image capture, 167 Image capturing equipment, 85 Image class segmentation, 833 Image classification, 1177 Image coder, 649 Image coder and decoder, 649 Image coding, 647
1910 Image coding and decoding, 647 Image communication, 271 Image completion, 607 Image composition, 71 Image compression, 648 Image connectedness, 368 Image construction by graphics, 1706 Image content, 1514 Image conversion, 4 Image converter, 166 Image coordinate system, 187 Image coordinate system in computer, 188 Image coordinate transformation, 449 Image coordinates, 449 Image cropping, 615 Image database, 1520 Image database indexing, 1520 Image decipher, 1171 Image decoder, 650 Image decoding, 648 Image decoding model, 648 Image decomposition, 8 Image decompression, 648 Image degradation, 574 Image degradation model, 574 Image dehazing, 608 Image denoising, 595 Image descriptor, 1013 Image difference, 472 Image digitization, 6 Image discrimination enhancement combination system [IDECS], 1851 Image display, 257 Image display device, 260 Image dissector, 156 Image distortion, 577 Image division, 475 Image domain, 386 Image editing program, 1834 Image encoder, 632 Image encoding, 648 Image engineering [IE], 55 Image enhancement, 441 Image enhancement in frequency domain, 442 Image enhancement in space domain, 442 Image epitome, 49 Image feature, 837 Image feature extraction, 837 Image fidelity, 649 Image field, 173 Image file, 272 Image file format, 272 Image file system [IFS], 272
Index Image filtering, 539 Image flipping, 450 Image flow, 1159 Image forensics, 713 Image formation, 169, 176 Image fragmentation, 8 Image frame, 777 Image fusion, 1505 Image gallery, 1520 Image generation, 954 Image geometric nerve, 57 Image geometry, 57 Image gray-level mapping, 479 Image grid, 49 Image Hashing, 1514 Image haze removal, 608 Image hiding, 714 Image histogram, 487 Image iconoscope, 155 Image impainting, 605 Image indexing, 1522 Image information fusion, 1494 Image information hiding, 713 Image information security, 705 Image information system, 1849 Image intensifier, 444 Image intensity bin, 489 Image intensity bucket, 488 Image intensity histogram, 488 Image interleaving, 12 Image interpolation, 615 Image interpretation, 1472, 1473 Image invariant, 1441 Image irradiance equation, 1363 Image isocon, 155 Image knowledge, 1391 Image lattice, 49 Image magnification, 451 Image manipulation, 60 Image matching, 1441 Image matting, 854 Image matting and compositing, 855 Image memory, 269 Image mining, 1390 Image modality, 170 Image model for thresholding, 925 Image morphing, 1460 Image morphology, 1724 Image mosaic, 1496 Image motion estimation, 1131 Image multiplication, 475 Image negative, 485 Image noise, 581
Index Image normalization, 1528 Image observation model, 175 Image of absolute conic [IAC], 123 Image operation, 470 Image orthicon tube, 156 Image padding, 607 Image pair rectification, 1335 Image palette, 20 Image parsing, 1474 Image patch, 8 Image pattern, 1171 Image pattern recognition, 1171 Image perception, 1475 Image plane, 187 Image plane coordinate system [IPCS], 187 Image plane coordinates, 188 Image printing, 263 Image processing [IP], 59 Image processing hardware, 61 Image processing operations, 60 Image processing software, 60 Image processing system, 61 Image processing toolbox [IPT], 60 Image projection reconstruction, 634 Image pyramid, 820 Image quality, 50 Image quality measurement, 51 Image querying, 1518 Image quilting, 1092 Image reconstruction, 636 Image reconstruction from projection, 636 Image rectification, 1334 Image registration, 1457 Image repair, 605 Image representation, 42 Image representation function, 42 Image representation with matrix, 42 Image representation with vector, 43 Image resizing, 615 Image resolution, 46 Image restoration, 561 Image retrieval, 1515 Image rippling, 451 Image sampling, 168 Image scaling, 450 Image scene, 1471 Image search, 1521 Image segmentation, 825, 942 Image segmentation evaluation, 953 Image semantics, 63 Image sensor, 99 Image sequence, 803 Image sequence analysis, 803
1911 Image sequence fusion, 804 Image sequence matching, 804 Image sequence stabilization, 791 Image sequence understanding, 803 Image sharpening, 513 Image sharpening operator, 513 Image shrinking, 450 Image size, 42 Image smoothing, 500 Image space, 5 Image stabilization, 791 Image stitching, 1496 Image storage, 269 Image storage devices, 269 Image subtraction, 472 Image subtraction operator, 472 Image tagging, 1521 Image tampering, 711 Image technique, 56 Image tessellation, 1122 Image texton, 1062 Image time sequence, 777 Image tone, 1382 Image tone perspective, 1382 Image topology, 1028 Image transfer, 271 Image transform, 385 Image twirling, 451 Image understanding [IU], 63 Image understanding environment [IUE], 63 Image warping, 451 Image watermarking, 699 Image zooming, 451 Image-based lighting, 216 Image-based modeling, 1397 Image-based rendering [IBR], 71 Image-based visual hull, 74 Image-flow equation, 1160 ImageNet, 1851 Image-processing operator, 60 Imaginary primaries, 737 Imaging element, 169 Imaging geometry, 169 Imaging method, 169 Imaging modalities, 170 Imaging radar, 163 Imaging spectrometer, 1675 Imaging spectroscopy, 162 Imaging spectrum, 162 Imaging surface, 169 Imaging transform, 186 IMF (Intrinsic mode function), 1495 Imperceptible watermark, 701
1912 Implicit curve, 974 Implicit registration, 1243 Implicit surface, 1300 Implicit synchronization, 706 Importance sampling, 1155 Importance weights, 1155 Impossible object, 1811 Impostor, 855 Improper prior, 1478 Improved NMF [INMF] model, 1192 Improving accuracy, 1352 Impulse noise, 590 Impulse response, 339 In general position, 1465 Incandescent lamp, 210 Incidence matrix, 1622 Incident, 1622 Incident light, 215 Incident light measurement, 215 Incoherent, 1719 Incorrect comparison theory, 1815 Increasing, 1725 Increasing transformation, 1725 Incremental learning, 1416 Independency map, 1633 Independent component, 1194 Independent component analysis [ICA], 1194 Independent factor analysis, 1195 Independent identically distributed [i.i.d.], 324 Independent identically distributed [i.i.d.] noise, 584 Independent motion detection, 1136 Independent noise, 583 Independent random variables, 1580 Independent set, 1632 Independent variables, 1582 Independent-frame, 799 Index of refraction, 1715 Indexed color images, 19 Indexed color, 20 Indexing, 1522 Indexing structure, 1522 Indicator function, 1300 Indices of depth, 1358 Indirect attribute prediction [IAP], 1035 Indirect filtering, 792 Indirect imaging, 177 Indirect radiography, 168 Indirect vision, 1763 Individual kernel tensor face, 1243 Induced color, 1800 Induced subgraph, 1617 Inducing color, 1800
Index Industrial vision, 69 Inefficiency, 1051 Inequality constraint, 1662 Inertia equivalent ellipse, 1009, 1451 Inference, 1404, 1419 Inference in Graphical Models, 1627 Infimum, 287 Infinite impulse response [IIR], 339 Infinite impulse response [IIR] filter, 339 Inflection point, 840 Influence function, 306 Influence of lens resolution, 1055 Influence of sampling density, 1056 Influence of segmentation algorithm, 1056 Infophotonics, 1672 Information entropy, 1500 Information fusion, 1493 Information geometry, 279 Information hiding, 714 Information matrix, 1220 Information of embedded watermark, 692 Information optics, 1672 Information source, 651 Information system, 1849 Information theory, 660 Information-lossy coding, 662 Information-preserving coding, 661 Information-source symbol, 651 Informed coding, 697 Informed detection, 697 Informed embedding, 697 Infrared [IR], 1693 Infrared image, 15 Infrared imaging, 181 Infrared light, 1693 Infrared ray, 1693 Infrared sensor, 102 Inkjet printer, 263 Inlier, 1666 INMF (improved NMF) model, 1192 Inner orientation, 125 Inpainting, 605 Input cropping, 494 Input-output hidden Markov model, 1595 Inscribed circle, 1113 Inseparability, 698 Inside of a curve, 1277 Inspection, 1825 Inspection of PC boards, 1825 Instance recognition, 1169 Instantaneous code, 663 Institute of Electrical and Electronics Engineers [IEEE], 1839
Index Instrumentation and industrial digital camera [IIDC], 110 Instrumentation and Industrial Digital Camera [IIDC] standard, 110 Integer lifting, 425 Integer wavelet transform, 425 Integral curvature, 1291 Integral image, 23 Integral image compression, 659 Integral invariant, 278 Integral projection function, 233 Integrated optical density [IOD], 1023 Integrated orthogonal operator, 850 Integrating sphere, 119 Integration time, 194 Intelligent camera control, 109 Intelligent photo editing, 607 Intelligent scene analysis based on photographic space [ISAPS], 1475 Intelligent scissors, 915 Intelligent visual surveillance, 1824 Intensity [I], 1678 Intensity cross-correlation, 1446 Intensity data, 1446 Intensity enhancement, 443 Intensity flicker, 581 Intensity gradient, 443 Intensity gradient direction, 443 Intensity gradient magnitude, 443 Intensity histogram, 488 Intensity image, 28 Intensity level slicing, 486 Intensity matching, 1342 Intensity of embedded watermark, 692 Intensity sensor, 101 Intensity slicing, 753 Intensity, hue, saturation [IHS], 746 Intensity-based alignment, 1458 Intensity-based database indexing, 1530 Intensity-based image segmentation, 834 Intensive variables, 1490 Interaction of light and matter, 1701 Interaction potential, 1590 Interactive computer vision, 67 Interactive image processing, 59 Interactive local tone mapping, 761 Interactive restoration, 562 Inter-camera appearance variance, 117 Inter-camera gap, 117 Interest operator, 827 Interest point, 837 Interest point detector, 837 Interest point feature detector, 838
1913 Interface reflection, 1713 Interference, 1719 Interference color, 1719 Interference fringe, 1719 Interferometric SAR, 164 Inter-frame, 793 Inter-frame coding, 799 Inter-frame filtering, 793 Inter-frame predictive coding, 799 Interior border point, 904 Interior camera parameter, 85 Interior orientation, 125 Interior parameters of camera, 86 Interior point, 904 Interlaced scanning, 780 Interline transfer sensor, 100 Intermediate representation, 1266 Internal camera calibration, 124 Internal coefficient, 87 Internal energy (or force), 914 Internal energy function, 914 Internal feature, 1008 Internal Gaussian normalization, 1526 Internal orientation, 125 Internal parameter, 87 Internal pixel of a region, 904 Internal representation, 971 International Affective Picture System [IAPS], 1853 International Astronomical Union [IAU], 1839 International Color Consortium [ICC], 1838 International Committee on Illumination, 1839 International Electrotechnical Commission [IEC], 1837 International Frequency Registration Board [IFRB], 1838 International Radio Consultative Committee, 1838 International standard for images, 1841 International standard for videos, 1843 International Standardization Organization [ISO], 1837 International System of Units, 1856 International Telecommunication Union [ITU], 1838 International Telecommunication Union, Telecommunication Standardization Sector [ITU–T], 1838 Internet photos, 24 Inter-pixel redundancy, 654 Interpolation, 606 Interpolation kernel, 616 Inter-position, 1360
1914 Interpretation tree search, 1432 Interpretation-guided segmentation [IGS], 836 Interreflection, 219 Interreflection analysis, 219 Inter-region contrast, 960 Interrelating entropy, 1501 Interval tree, 1612 Intraframe, 793 Intraframe coding, 798 Intraframe filtering, 792 Intra-region uniformity, 961 Intrinsic calibration, 124 Intrinsic camera calibration, 124 Intrinsic dimensionality, 39 Intrinsic image, 38 Intrinsic mode function [IMF], 1495 Intrinsic parameter, 86 Intrinsic property, 39 Intruder detection, 1825 Invariance, 1046 Invariance hypothesis, 1766 Invariant, 1046 Invariant contour function, 1104 Invariant feature, 1343 Invariant Legendre moments, 1027 Invariant moments, 1026 Invariant theory, 305 Invariants for non-collinear points, 1168 Inverse compositional algorithm, 1462 Inverse convolution, 341 Inverse distance, 1351 Inverse document frequency, 1516 Inverse fast wavelet transform, 416 Inverse filter, 599 Inverse filtering, 599 Inverse Fourier transform [IFT], 396 Inverse gamma distribution, 333 Inverse Hadamard transform, 394 Inverse Hadamard transform kernel, 394 Inverse half-toning, 266 Inverse harmonic mean, 602 Inverse harmonic mean filter, 602 Inverse Hessian, 899 Inverse imaging problem, 82 Inverse kinematics [IK], 81 Inverse light transport, 1712 Inverse operator, 448 Inverse perspective, 227 Inverse perspective transformation, 227 Inverse problem, 81 Inverse projection, 640 Inverse rendering, 82 Inverse square law, 217
Index Inverse texture element difference moment, 1078 Inverse tone mapping, 179 Inverse transform, 387 Inverse transformation kernel, 387 Inverse Walsh transform, 391 Inverse Walsh transform kernel, 391 Inverse warping, 449 Inverse Wishart distribution, 329 Invert operator, 485 Inverted index, 1513 Invertible watermark, 701 IOD (integrated optical density), 1023 IP (image processing), 59 IPCS (image plane coordinate system), 187 IPT (image processing toolbox), 60 IR (infrared), 1693 Iris, 1225, 1762 Iris detection, 1225 Iris recognition, 1225 IRLS (iterative reweighted least squares), 288 IRLS (iteratively reweighted least squares), 288 Irradiance, 1689 Irregular octree, 1314 Irregular quadtree, 1007 Irregular sampling, 245 ISAPS (intelligent scene analysis based on photographic space), 1475 ISO (International Standardization Organization), 1837 ISO setting, 107 ISODATA clustering, 935 isomap (isometric feature mapping), 1190 Isometric feature mapping [isomap], 1190 Isometric perspective, 227 Isometric transformation, 455 Isometry, 455 Isomorphic decomposition, 1463 Isomorphic graphs, 1463 Isomorphic sub-graphs, 1463 Isomorphics, 1463 Isophote, 6 Isophote curvature, 6 Iso-surface, 1289 Isotemperature line, 1692 Isotropic BRDF, 1709 Isotropic covariance matrix, 1585 Isotropic detector, 886 Isotropic gradient operator, 513 Isotropic operator, 513 Isotropic reflection surface, 1683 Isotropic scaling, 467 Isotropy assumption, 1378
Index Isotropy radiation surface, 1684 Iterated closest point [ICP], 1457 Iterated conditional modes [ICM], 646 Iterated function system [IFS], 683 Iterative algorithm, 642 Iterative back-projection [IBP], 644 Iterative blending, 716 Iterative closest point [ICP], 1456 Iterative closest point [ICP] algorithm, 1457 Iterative conditional mode, 567, 926 Iterative constrained end-membering [ICE], 14 Iterative endpoint curve fitting, 985 Iterative erosion, 1733 Iterative factorization, 229 Iterative Hough transform [IHT], 852 Iterative pose estimation, 1561 Iterative reconstruction re-projection, 638 Iterative reweighted least squares [IRLS], 288 Iterative transform method, 638 Iteratively reweighted least squares [IRLS], 288 ITU (International Telecommunication Union), 1838 ITU-T (International Telecommunication Union, Telecommunication Standardization Sector), 1838 IU (image understanding), 63 IUE (image understanding environment), 63 Iverson’s bracket, 305
J Jacobian matrix, 771 Jaggies, 244 JBIG (Joint bi-level imaging group), 1841 JBIG-2, 1841 Jensen’s inequality, 280 JFIF (JPEG file interchange format), 1842 JLS (jump linear system), 1574 JND (just noticeable difference), 1772 Join tree, 1613 Joint bilateral filter, 524 Joint bi-level imaging group [JBIG], 1841 Joint distribution function, 317 Joint entropy, 1502 Joint entropy registration, 1456 Joint feature space, 1506 Joint invariant, 1049 Joint Photographic Experts Group [JPEG], 1842 Joint Photographic Experts Group [JPEG] format, 1842 Joint probability density function, 317 Joint probability distribution, 316
1915 Joint product space model, 1555 Joint Video Team [JVT], 1844 Jordan parametric curve, 984 JPEG (Joint Photographic Experts Group), 1842 JPEG file interchange format [JFIF], 1842 JPEG (Joint Photographic Experts Group) format, 1842 JPEG standard, 1842 JPEG XR, 1843 JPEG-2000, 1843 JPEG-LS (lossless and near-lossless compression standard of JPEG), 1842 JPEG-NLS, 1842 Judder, 581 Jump edge, 868 Jump linear system [JLS], 1574 Junction label, 1469 Junction tree, 1613 Junction tree algorithm, 1613 Just-noticeable difference [JND], 1772 JVT (Joint Video Team), 1844
K K nearest neighbors [KNN], 1196 K nearest neighbors [KNN] algorithm, 1197 K2 algorithm, 911 Kalman filter, 1150 Kalman filter equation, 1151 Kalman filter function, 1152 Kalman filtering, 1151 Kalman smoother function, 1152 Kalman snakes, 916 Kalman-Bucy filter, 1151 Kanade-Lucas-Tomasi [KLT] tracker, 1156 Karhunen–Loève [KL] transform, 426 KB Vision, 1850 KCCA (kernel canonical correlation analysis), 1584 K-D tree, 1523 Kelvin, 1857 Kendall’s shape space, 1099 Kernel, 1656 Kernel 2D-PCA, 432 Kernel bandwidth, 1657 Kernel canonical correlation analysis [KCCA], 1584 Kernel density estimation, 1658 Kernel density estimator, 1659 Kernel discriminant analysis, 1217 Kernel Fisher discriminant analysis [KFDA], 1217
1916 Kernel function, 1657 Kernel LDA [KLDA], 1217 Kernel learning, 1658 Kernel PCA, 432 Kernel regression, 292 Kernel ridge regression, 292 Kernel substitution, 1657 Kernel tensor-subspace [KTS], 1234 Kernel tracking, 1148 Kernel trick, 1657 Kernel-based PCA [KPCA], 432 Kernel-matching pursuit learning machine [KMPLM], 1657 Kernel-singular value decomposition [K-SVD], 284 Key frame, 1533 Key frame selection, 1533 Key management, 717 Key point, 838 Key space, 717 KFDA (kernel Fisher discriminant analysis), 1217 KFDB database, 1854 KHOROS, 1850 Kinematic chain, 1252 Kinematic model, 1135 Kinematic motion models, 135 Kinescope, 261 Kinescope recording, 259 Kinetic depth, 1370 Kirsch compass edge detector, 887 Kirsch operator, 887 KITTI data sets, 1852 KL divergence (Kullback-Leibler divergence), 1503 KL (Karhunen–Loève) transform, 426 KLDA (kernel LDA), 1217 KLT (Kanade-Lucas-Tomasi) tracker, 1156 K-means, 934 K-means cluster analysis, 934 K-means clustering, 933 K-medians, 934 KMPLM (Kernel-matching pursuit learning machine), 1657 K-nearest neighbors [KNN] classifier, 1196 Knee algorithm, 927 Knight-distance, 373 Knight-neighborhood, 360 KNN (K nearest neighbors), 1196 KNN (K nearest neighbors) algorithm, 1197 KNN (K-nearest-neighbors) classifier, 1196 Knowledge, 1389 Knowledge base, 1404
Index Knowledge of cybernetics, 1396 Knowledge representation, 1389 Knowledge representation scheme, 1401 Knowledge-based, 683 Knowledge-based coding, 682 Knowledge-based computer vision, 67 Knowledge-based vision, 67, 1473 Koch’s triadic curve, 77 Kodak-Wratten filter, 158 Koenderink’s surface shape classification, 1298 Kohonen network, 1668 KPCA (kernel-based PCA), 432 Krawtchouk moments, 1025 Kronecker order, 394 Kronecker product, 286 Kruppa equations, 122 K-SVD (kernel-singular value decomposition), 284 KTC noise, 588 KTS (kernel tensor-subspace), 1234 Kullback-Leibler distance, 1504 Kullback-Leibler divergence [KL divergence], 1503 Kurtosis, 1581 Kuwahara filter, 523
L L cones, 1767 L1 norm, 300 L2 norm, 300 L2 pooling, 1650 L1 image coding, 662 Lab color space, 747 L*a*b* model, 747 Label, 908 Label image, 36 Labeling, 908 Labeling of contours, 1467 Labeling problem, 1522 Labeling via backtracking, 1468 Labeling with sequential backtracking, 1468 Lack unity and disjoint, 1545 Lacunarity, 848 LADAR (laser detection and ranging), 163 Lagrange multiplier technique, 1655 Laguerre formula, 1167 Lambert surface, 1683 Lambertian reflection, 1683 Lambertian surface, 1682 Lambert’s cosine law, 1682 Lambert’s law, 1682 Laminography, 623
Index Landmark detection, 1458 Landmark point, 974 Landolt broken ring, 1785 Landolt ring, 1785 LANDSAT, 1495 LANDSAT satellite, 1496 LANDSAT satellite image, 1496 Landscape image, 26 Landscape mode, 258 Language attribute, 1032 Language of image processing, 60 Lanser filter, 892 Laplace approximation, 314 Laplace-Beltrami operator, 896 Laplacian, 895 Laplacian eigenmap, 1190 Laplacian eigenspace, 1187 Laplacian face, 1241 Laplacian image pyramid, 822 Laplacian matrix, 1187 Laplacian matting, 855 Laplacian mean squared error, 696 Laplacian operator, 894 Laplacian pyramid, 821 Laplacian smoothing, 517 Laplacian of Gaussian [LoG], 536 Laplacian-of-Gaussian [LoG] edge detector, 893 Laplacian-of-Gaussian [LoG] filter, 535 Laplacian-of-Gaussian [LoG] kernel, 535, 893 Laplacian-of-Gaussian [LoG] operator, 536, 894 Laplacian-of-Gaussian [LoG] transform, 536 Large field, 171 Laser, 263 Laser computing tomography, 624 Laser detection and ranging [LADAR], 163 Laser illumination, 218 Laser printer, 263 Laser radar (LADAR), 163 Laser range sensor, 102 Laser speckle, 1159 Laser stripe triangulation, 209 LASSO (least absolute shrinkage and selection operator), 572 Latent behavior, 1490 Latent class analysis, 324 Latent Dirichlet allocation [LDA], 1489 Latent Dirichlet process [LDP], 1475 Latent emotional semantic factor, 1489 Latent identity variable [LIV], 1490 Latent semantic indexing [LSI], 1488 Latent structure, 1399
1917 Latent trait model, 1627 Latent variable, 1890 Latent variable model, 1490 Lateral aberration, 153 Lateral chromatic aberration, 153 Lateral histogram, 1008 Lateral inhibition, 1778 Lateral spherical aberration, 153 Lattice, 1118 Lattice code, 697 Lattice coding, 697 Lattice diagram, 1619 Law of refraction, 1714 Laws operator, 1073 Laws’ texture energy measure, 1074 Layer, 18 Layered clustering, 1213 Layered cybernetics, 1392 Layered depth image [LDI], 31 Layered depth panorama, 32 Layered motion estimation, 1131 Layered motion model, 128 Layered motion representation, 1138 Layered representation, 1594 Layout consistent random field, 1587 LBG (Linde-Buzo-Gray) algorithm, 684 LBP (local binary pattern), 1081 LBP (Loopy belief propagation), 1426 LCD (liquid crystal display), 260 LCEC (low-complexity entropy coding), 1847 LCPD (local conditional probability densities), 1610 LCTF (liquid-crystal-tunable filter), 157 LDA (latent Dirichlet allocation), 1489 LDA (linear discriminant analysis), 1215 LDA model, 1490 LDI (layered depth image), 31 LDP (latent Dirichlet process), 1475 LDS (linear dynamical system), 1573 LDTV (low-definition television), 1820 Learned motion model, 1135 Learning, 1409 Learning classifier, 1409 Learning from observation, 1417 Learning rate parameter, 1410 Learning-based super-resolution, 810 Least absolute shrinkage and selection operator [LASSO], 572 Least confusion circle, 138 Least mean squares algorithm [LMS algorithm], 645 Least mean square estimation, 289 Least-mean-squares filter, 601
1918 Least median of squares [LMS], 1667 Least median of squares estimation, 289 Least-significant-bit watermarking, 697, 703 Least squares, 288 Least squares curve fitting, 986 Least squares estimation, 289 Least squares fitting, 845, 879 Least squares line fitting, 844 Least squares surface fitting, 271 Leave-K-out, 1227 Leave-one-out, 1227 Leave-one-out test, 1227 LED (light-emitting diode), 260 LED panels or ring lights with a diffuser in front of the lights, 224 Left-handed coordinate system, 182 Legendre moment, 1025 Legitimate distortion, 580 Length of boundary, 1016 Length of chain code, 989 Length of code word, 652 Length of contour, 1017 Lens, 126, 1762 Lens decentering distortion, 146 Lens distortions, 146 Lens equation, 126 Lens flare, 127 Lens glare, 127 Lens group, 137 Lens law, 126 Lens model, 126 Lens type, 130 Leptokurtic probability density function, 316 Letterbox, 258 Level of detail [LOD], 808 Level set, 945 Level set front propagation, 918 Level set methods, 945 Level set tree, 945 Levenberg-Marquardt approximation of Hessian matrix, 898 Levenberg-Marquardt optimization, 1656 Levenshtein distance, 1436 Levenshtein edit distance, 1436 Lexicographic order, 394 LFA (local feature analysis), 434 LFF (local feature focus) method, 1175 License plate recognition, 1828 Lie group, 1194 Lifted wavelets, 425 Lifting, 425 Lifting scheme, 425 Light, 1717
Index Light adaptation, 1766 Light amplification for detection and ranging, 163 Light detection and ranging [LIDAR], 163 Light field, 72 Light intensity, 1678 Light quantum, 1680 Light source, 213 Light source color, 214 Light source detection, 214 Light source geometry, 213 Light source placement, 213 Light stripe, 208 Light stripe ranging, 209 Light transport, 71 Light vector, 1718 Light wave, 1718 Light-emitting diode [LED], 260 Light-emitting semiconductor diode, 260 Lighting, 215 Lighting capture, 215 Lightness, 1771 Lightpen, 1519 Likelihood function, 313 Likelihood ratio, 313 Likelihood score, 313 Limb, 1466 Limb extraction, 1251 Limited angle tomography, 624 Limiting resolution, 47 Linde-Buzo-Gray [LBG] algorithm, 684 Line and field averaging, 783 Line at infinity, 453 Line averaging, 783 Line cotermination, 839 Line detection, 843 Line detection operator, 844 Line drawing, 1464 Line equation, 844 Line fitting, 844 Line following, 912 Line graph, 1464 Line grouping, 912 Line hull, 1000 Line intersection, 838 Line junction, 838 Line label, 1466 Line labeling, 1466 Line linking, 913 Line matching, 1468 Line moment, 1017 Line moment invariant, 1017 Line number, 781
Index Line of sight, 1759 Line scan camera, 112 Line segment, 843 Line segmentation, 846 Line sensor, 100 Line spread function [LSF], 337 Line sweeping, 1318 Line thinning, 1738 Linear, 295 Linear appearance variation, 442 Linear array sensor, 113 Linear blend, 297 Linear blend operator, 297 Linear classifier, 1198 Linear composition, 1091 Linear decision function, 1221 Linear degradation, 576 Linear discriminant analysis [LDA], 1215 Linear discriminant function, 1215 Linear dynamical system [LDS], 1573 Linear features, 1014 Linear filter, 501 Linear filtering, 501 Linear gray-level scaling, 486 Linear independence, 296 Linear interpolation, 618 Linear operator, 295 Linear perspective, 1374 Linear programming [LP], 302 Linear programming [LP] relaxation, 302 Linear quantizing, 248 Linear regression, 294 Linear regression model, 1199 Linear scale space representation, 814 Linear sharpening filter, 506 Linear sharpening filtering, 501 Linear shift invariant [LSI] filter, 516 Linear shift invariant [LSI] operator, 516 Linear smoothing filter, 505 Linear smoothing filtering, 501 Linear symmetry, 355 Linear system of equations, 296 Linear transformation, 295 Linear-Gaussian model, 1573 Linearity of Fourier transform, 410 Linearly non-separable, 296 Linearly separable, 296 Linear machines, 1198 Linear-median hybrid filter, 516 Line-based structure from motion, 1368 Line-drawing analysis, 1464 Line-likeness, 1080 Link, 1466
1919 Link function, 1651 Liouville’s Theorem, 1098 Lip shape analysis, 1256 Lip tracking, 1256 Liquid crystal display [LCD], 260 Liquid-crystal-tunable filter [LCTF], 157 Listing rule, 1757 Litting a sum of Gaussians, 930 LIV (latent identity variable), 1490 Live lane, 916 Live wire, 916 LLE (locally linear embedding), 1190 lm, 1856 LMS (least median of squares), 1667 LMS algorithm (least-mean squares algorithm), 645 Lobe of intense reflection, 1703 Local, 1046 Local aggregation methods, 1340 Local binary pattern [LBP], 1081 Local component, 1238 Local conditional probability densities [LCPD], 1610 Local contrast adjustment, 446 Local curvature estimation, 1018 Local distance computation, 383 Local enhancement, 445 Local feature analysis [LFA], 434 Local-feature-focus [LFF] method, 1175 Local frequency, 343 Local geometry of a space curve, 1278 Local invariant, 1046 Local motion, 1129 Local motion event, 1567 Local motion vector, 1129 Local operation, 447 Local operator, 446 Local orientation, 355 Local point invariant, 1046 Local pre-processing, 446 Local receptive field, 1645 Local roughness, 1080 Local shading analysis, 1281 Local shape pattern [LSP], 1083 Local surface shape, 1291 Local ternary pattern [LTP], 1083 Local threshold, 930 Local variance contrast, 446 Local variational inference, 1419 Local wave number, 343 Local-feature-focus algorithm, 1175 Locality preserving projection [LPP], 1190 Locality sensitive Hashing [LSH], 1524
1920 Localization, 1826 Localization criterion, 890 Locally adaptive histogram equalization, 491 Locally linear embedding [LLE], 1190 Locally ordered texture, 1089 Location recognition, 1170 Location-based retrieval, 1529 Locking state, 1145 LOD (level of detail), 808 LoG (Laplacian of Gaussian), 536 LoG (Laplacian-of-Gaussian) edge detector, 893 LoG (Laplacian-of-Gaussian) filter, 535 LoG (Laplacian-of-Gaussian) kernel, 535 LoG (Laplacian-of-Gaussian) operator, 536, 894 LoG (Laplacian-of-Gaussian) transform, 536 Logarithmic filtering, 501 Logarithmic Gabor wavelet, 423 Logarithmic transform, 482 Log-based modification, 482 Logic operation, 475 Logic operator, 475 Logic predicate, 1407 Logic sampling, 1156 Logic system, 1405 Logical feature, 1541 Logical image, 34 Logical story unit, 1533 Logical object representation, 972 Logistic regression, 293 Logistic regression model, 293 Logistic sigmoid function, 1652 Logit function, 1653 Log-likelihood, 314 Log-normal distribution, 328 Log-polar image, 41 Log-polar stereo, 41 Log-polar transformation, 1758 Long baseline stereo, 1325 Long motion sequence, 1134 Long short-term memory [LSTM], 1645 Longitudinal aberration, 154 Longitudinal chromatic aberration, 154 Longitudinal magnification, 174 Longitudinal spherical aberration, 154 Long-time analysis, 1134 Long-wave infrared [LWIR], 1693 Lookup Table, 761 Loop, 1467 Loopy belief propagation [LBP], 1426 Loss function, 921 Loss matrix, 1220
Index Lossless and near-lossless compression standard of JPEG [JPEG-LS], 1842 Lossless coding, 661 Lossless compression, 661 Lossless compression technique, 662 Lossless predictive coding, 669 Lossless predictive coding system, 669 Lossy coding, 662 Lossy compression, 662 Lossy compression technique, 662 Lossy image coding, 662 Lossy image compression, 662 Lossy predictive coding, 6700 Lossy predictive coding system, 671 Low-and high-frequency coding, 686 Low complexity entropy coding [LCEC], 1847 Low frequency, 342 Low-activity region, 172 Low-angle illumination, 219 Low-definition television [LDTV], 1820 Lower approximation set, 1511 Lower half-space, 298 Low-level scene modeling, 64 Low-level vision, 65 Low-pass filter [LPF], 546 Low-pass filtering, 546 Lows’ curve segmentation method, 846 Lows’ method, 846 LP (linear programming), 302 LP (linear programming) relaxation, 302 LPF (low-pass filter), 546 Lp-norm, 300 LPP (locality preserving projection), 1190 LSF (line spread function), 337 LSH (locality sensitive Hashing), 1524 LSI (latent semantic indexing), 1488 LSP (local shape pattern), 1083 LSI (linear shift invariant) filter, 516 LSI (linear shift invariant) operator, 516 LSTM (long short-term memory), 1645 LTP (local ternary pattern), 1083 Lucas-Kanade algorithm, 1163 Lucas-Kanade method, 1163 Lumigraph, 73 Luminance, 192, 1771 Luminance efficiency, 1859 Luminance gradient-based dense optical flow algorithm, 1161 Luminescence intensity, 1678 Luminosity, 1771 Luminosity coefficient, 1780 Luminous efficacy, 1680 Luminous emittance, 1680
Index Luminous energy, 1687 Luminous exitance, 1680 Luminous flux, 1680 Luminous intensity, 1678 Luminous reflectance, 1712 Luminous transmittance, 1680 Lumisphere, 73 Luv color space, 748 LWIR (long-wave infrared), 1693 lx, 1856 LZW, 686 LZW coding, 686 LZW compression, 686 LZW decoding, 686
M M (magenta), 728 M cones, 1767 M step (maximization step), 1663 MacAdam ellipses, 736 Macbeth color chart, 727 Macbeth color checker, 727 Mach band, 1777 Mach band effect, 1777 Machine inspection, 1825 Machine learning, 1411 Machine learning styles, 1413 Machine vision, 68 Machine vision system [MVS], 68 Macro lens, 134 Macro-block, 798 Macro-block partition, 798 Macro-block sub-partition, 798 Macrophotography, 137 Macro-pixel, 166 Macroscopic image analysis, 63, 1831 Macro-texture, 1062 MAD (mean absolute deviation), 1057 MAD (median absolute deviation), 529 Magenta [M], 728 Magnetic disc, 270 Magnetic resonance image, 25 Magnetic resonance imaging [MRI], 628 Magnetic tape, 270 Magneto-optical disc, 270 Magnification, 89 Magnification factors, 89 Magnitude correlation, 1445 Magnitude function, 46 Magnitude of the gradient [MOG], 879 Magnitude quantization, 248 Magnitude resolution, 45
1921 Magnitude response, 881 Magnitude-modulated phase measurement, 207 Magnitude-retrieval problem, 402 Mahalanobis distance, 376 Mahalanobis distance between clusters, 377 Main speaker close-up [MSC], 1540 Mammogram analysis, 1832 MAMS (map of average motion in shot), 1534 Man in the loop, 1396 Manhattan distance, 372 Manhattan world, 1479 Manifold, 1193 Many-to-many graph matching, 1462 Map analysis, 1835 MAP (maximum a posteriori) estimation, 1203 Map of average motion in shot [MAMS], 1534 MAP (maximum a posteriori) probability, 566 MAP (mean average precision), 1051 Map registration, 1455 Mapping function, 479 Mapping knowledge, 1391 March cube [MC] algorithm, 1308 Marching cubes [MC], 1308 Marching tetrahedra, 1280 Marching tetrahedral algorithm [MT algorithm], 1310 Margin, 1206 Marginal distribution, 322 Marginal Gaussian distribution, 326 Marginal likelihood, 1603 Marginal ordering, 765 Mark, 1468 Marked point process, 1583 Marker, 944 Marker-controlled segmentation, 944 Markerless motion capture, 1136 Marking space, 691 Markov blanket, 1610 Markov boundary, 1610 Markov chain, 1597 Markov chain Monte Carlo [MCMC], 1599 Markov decision process [MDP], 1597 Markov model, 1592 Markov network, 1591 Markov process, 1573 Markov property, 1593 Markov random field [MRF], 1590 Markovian assumption, 1601 Marr edge detector, 896 Marr operator, 896 Marr-Hildreth edge detector, 896 Marr’s framework, 1266 Marr’s theory, 1266
1922 Marr’s visual computational theory, 1266 Mask, 507 Mask convolution, 5508 Mask operation, 508 Mask ordering, 509 Masking, 1761 Masking attack, 708 MAT (medial axis transform), 1000 Match move, 1431 Match verification, 1430 Matched filter, 1450 Matching, 1430 Matching based on inertia equivalent ellipses, 1450 Matching based on raster, 1441 Matching feature extraction, 1432 Matching function, 1433 Matching method, 1432 Matching metric, 1430 Matching pursuit [MP], 1640 Mathematical morphology, 1723 Mathematical morphology for binary images, 1732 Mathematical morphology for gray-level images, 1745 Mathematical morphology operation, 1726 MATLAB, 346 MATLAB operators, 346 Matrix decomposition, 282 MATrix LABoratory, 346 Matrix of inverse projection, 641 Matrix of orthogonal projection, 235 Matrix of perspective projection, 228 Matrix of weak perspective projection, 231 Matrix-array camera, 113 Matrix-pyramid [M-pyramid], 818 Matte, 1537 Matte extraction, 855 Matte reflection, 1683 Matte surface, 1683 Matting, 855 MAVD (mean absolute value of the deviation), 963 Max filter, 530 Max-flow min-cut theorem, 921 Maximal ball, 1739 Maximal clique, 1631 Maximal component run length, 1081 Maximal nucleus cluster [MNC], 1123 Maximal nucleus triangle clusters [MNTC], 1125 Maximal spanning tree, 1613 Maximally stable extremal region [MSER], 1037
Index Maximization step [M step], 1663 Maximum a posteriori [MAP] estimation, 1203 Maximum a posteriori [MAP] probability, 566 Maximum clique, 1632 Maximum entropy, 564 Maximum entropy Markov model [MEMM], 1593 Maximum entropy method, 564 Maximum entropy restoration, 564 Maximum flow, 921 Maximum likelihood, 1203 Maximum likelihood estimation, 1203 Maximum likelihood method, 306 Maximum likelihood-expectation maximization [ML-EM], 643 Maximum margin, 1204 Maximum margin Hough transform, 854 Maximum normal form membership function, 1484 Maximum operation, 1743 Maximum pooling, 1650 Maximum posterior probability, 566 Maximum value, 1744 Max-min sharpening transform, 529 Max-product rule, 1426 Max-sum algorithm, 1420 Maxwell color triangle, 731 MBD (minimum barrier distance), 1039 MC (marching cubes), 1308 MC (march cube) algorithm, 1308 MCMC (Markov chain Monte Carlo), 1599 MCMC (Monte Carlo Markov chain), 1600 m-connected, 366 m-connected component, 369 m-connected region, 369 m-connectivity, 364 MDC (minimum directional contrast), 1038 MDL (minimum description length), 1221 MDN (mixture density network), 318 MDP (Markov decision process), 1597 MDS (multidimensional scaling), 1191 Mean, 519 Mean absolute deviation [MAD], 1057 Mean absolute value of the deviation [MAVD], 963 Mean and Gaussian curvature shape classification, 1292 Mean average precision [MAP], 1051 Mean curvature, 1292 Mean distance to boundary, 1117 Mean field annealing [MFA], 568 Mean field approximation, 568 Mean filter, 519
Index Mean shift, 936 Mean shift cluster analysis, 1213 Mean shift clustering, 936 Mean shift filtering, 520 Mean shift image segmentation, 936 Mean smoothing, 518 Mean smoothing operator, 519 Mean square difference [MSD], 1057 Mean squared edge distance [MSED], 1453 Mean squared error [MSE], 1057 Mean squared reconstruction error [MSRE], 1057 Mean structural similarity index measurement [MSSIM], 53 Mean value coordinates, 188 Mean variance, 1056 Meaningful region layer, 1544 Meaningful watermark, 702 Meaningless watermark, 702 Mean-shift smoothing, 500 Mean-square signal-to-noise ratio, 656 Measurement accuracy, 1050 Measurement coding, 1638 Measurement equation, 1045 Measurement error, 1055 Measurement matrix, 1045 Measurement precision, 1051 Measurement resolution, 1046 Measurement unbiasedness, 1051 Mechanical vignetting, 130 Media, 5 Media space, 698 Medial, 1004 Medial axis skeletonization, 1001 Medial axis transform [MAT], 1000 Medial line, 1003 Medial surface, 1004 Median absolute deviation [MAD], 529 Median filter, 527 Median filtering, 529 Median flow filtering, 529 Median smoothing, 529 Medical image, 21 Medical image registration, 1455 Medical image segmentation, 949 Medium transmission, 612 MEI (motion energy image), 1135 Membership function, 1484 Membrane model, 563 Membrane spline, 563 MEMM (maximum entropy Markov model), 1593 Memory, 270
1923 Memory-based methods, 1175 MER (minimum enclosing rectangle), 997 Mercer condition, 1660 Merging technique, 994 Mesh compression, 1118 Mesh generators, 1118 Mesh model, 1118 Mesh subdivision, 1118 Mesh-based warping, 615 Mesopic vision, 1775 Message, 690 Message authentication code, 1524 Message mark, 690 Message passing algorithm, 1613 Message pattern, 690 Message vector, 690 Message-passing process, 318 Messaging layer, 272 M-estimation, 306 M-estimator, 306 Meta-knowledge, 1391 Metameric, 1800 Metameric colors, 1800 Metameric matching, 1800 Metameric stimulus, 1800 Metamerism, 1800 Metamers, 98 Method for decision layer fusion, 1509 Method for feature layer fusion, 1509 Method for pixel layer fusion, 1507 Method for texture analysis, 1064 Method of histogram distance, 497 Method of optimal directions, 1637 Methods based on change of spatial relation among texels, 1381 Methods based on change of texel shape, 1380 Methods based on change of texel size, 1380 Metric, 1041 Metric combination, 1042 Metric determinant, 1018, 1281 Metric property, 1042 Metric reconstruction, 121 Metric space, 1044 Metric stratum, 455 Metric tree, 1044 Metrical calibration, 121 Metrics for evaluating retrieval performance, 1516 Mexican hat operator, 342 MFA (mean field annealing), 568 M-files, 346 MFOM (modified figure of merit), 962 MHD (modified Hausdorff distance), 1434
1924 MHI (motion history image), 1139 Micro-mirror array, 165 Micron, 1718 Microscope, 160 Microscopic image analysis, 62 Micro-texture, 1062 Microwave image, 16 Microwave imaging, 181 Mid-crack code, 991 Middle-level semantic modeling, 1477 Middle-level vision, 65 Mid-point filter, 526 Mid-point filtering, 527 Mid-sagittal plane, 630 Mid-wave infrared [MWIR], 16 Millimeter wave image, 16 Millimeter-wave radiometric images, 16 Millions of instructions per second [MIPS], 346 MIMD (multiple instruction multiple data), 343 Mimicking, 58 Min filter, 530 Minimal, 1295 Minimal external difference, 948 Minimal imposition, 944 Minimal point, 1295 Minimal spanning tree, 947 Minimum average difference function, 1449 Minimum barrier distance [MBD], 1039 Minimum bounding rectangle, 997 Minimum degree ordering, 1636 Minimum description length [MDL], 1221 Minimum directional contrast [MDC], 1038 Minimum distance classifier, 1202 Minimum enclosing rectangle [MER], 997 Minimum error classifier, 1203 Minimum error criterion, 1203 Minimum invariant core, 1512 Minimum mean square error function, 1449 Minimum normal form membership function, 1484 Minimum operation, 1743 Minimum value, 1744 Minimum vector dispersion [MVD], 897 Minimum vector dispersion edge detector, 897 Minimum-perimeter polygon [MPP], 993 Minimum-variance quantizing, 673 Minkowski addition, 1733 Minkowski distance, 374 Minkowski loss, 293 Minkowski subtraction, 1733 Min-max rule, 1487 Minus lens, 133 MIP mapping, 1091
Index MIPS (millions of instructions per second), 346 Mirror, 1676 Mirror image, 1676 Misapplied constancy theory, 1815 Misclassification rate, 1229 Mis-identification, 1229 Mismatching error, 1462 Missing at random, 1638 Missing data, 1637 Missing pixel, 97 Miss-one-out test, 1227 MIT database, 1854 Mixed connectivity, 364 Mixed National Institute of Standards and Technology [MNIST] database, 1852 Mixed path, 367 Mixed pixel, 25 Mixed Poisson-Gaussian noise, 587 Mixed reality, 80 Mixed-connected, 366 Mixed-connected component, 369 Mixed-connected region, 39 Mixing proportion, 318 Mixture coefficient, 319 Mixture component, 319 Mixture density network [MDN], 318 Mixture distribution, 318 Mixture distribution model, 318 Mixture model, 318 Mixture of Bernoulli distribution, 331 Mixture of experts, 320 Mixture of Gaussians, 320 Mixture of linear regression model, 294 Mixture of logistic regression model, 293 ML-EM (maximum likelihood-expectation maximization), 643 MLP (multilayer perceptron), 1209 MLPN (multilayer perceptron network), 1209 MLS (moving least squares), 1304 MNC (maximal nucleus cluster), 1123 MNC contour, 1123 MNC contour edgelet, 1124 MNC convex hulls, 1124 MNC-based image object shape recognition methods, 1125 MNIST (Mixed National Institute of Standards and Technology) database, 1852 MNTC (maximal nucleus triangle clusters), 1125 Modal deformable model, 70 Mode, 926 Mode filter, 526 Model, 1396
Index Model acquisition, 1397 Model assumption, 1398 Model base, 1399 Model base indexing, 1399 Model building, 1397 Model comparison, 1397 Model evidence, 1401 Model exploitation, 1399 Model exploration, 1399 Model fitting, 1400 Model for image understanding system, 64, 1259 Model inference, 1400 Model invocation, 1400 Model library, 1399 Model matching, 1399 Model of image capturing, 168 Model of reconstruction from projection, 636 Model of tracking system, 1145 Model order, 1398 Model order selection, 1399 Model parameter learning, 1398 Model parameter update, 1398 Model reconstruction, 1398 Model registration, 1399 Model robustness, 1399 Model selection, 1400 Model structure, 1399 Model topology, 1400 Model-assisted search, 1454 Model-based coding, 682 Model-based compression, 682 Model-based computer vision, 66 Model-based control strategy, 1396 Model-based feature detection, 1189 Model-based object recognition, 1173 Model-based recognition, 1173 Model-based reconstruction, 1356 Model-based segmentation, 834 Model-based stereo, 1326 Model-based texture features, 1067 Model-based tracking, 1149 Model-based vision, 63 Moderate Resolution Imaging Spectroradiometer image, 15 Modes of stereo imaging, 198 Modified figure of merit [MFOM], 962 Modified Hausdorff distance [MHD], 1434 Modular eigenspace, 1173 Modulation transfer function [MTF], 338 Modulator, 338 MOG (magnitude of the gradient), 876 Moiré contour pattern, 204
1925 Moiré contour stripes, 204 Moiré fringe, 204 Moiré interferometry, 204 Moiré patches, 32 Moiré pattern, 204 Moiré topography, 205 Moment, 1024 Moment characteristic, 1024 Moment invariant, 1024 Moment matching, 1427 Moment of intensity, 1024 Moment of the gray-level distribution, 1085 Moment preserving, 864 Moments of random variables, 1580 Mondrian, 26 Mondrian image, 26 Monochromat, 1799 Monochromatic light, 1694 Monochrome, 726 Monochrome image, 27 Monocular, 1374 Monocular image, 33 Monocular vision, 1374 Monocular vision indices of depth, 1374 Monocular visual space, 1374 Monogenic signal, 439 Monogenic wavelets, 423 Monotonic fuzzy reasoning, 1486 Monotonicity, 279 Monte Carlo EM algorithm, 1663 Monte Carlo Markov chain [MCMC], 1600 Monte Carlo method, 1599 Monte Carlo sampling, 1420 Moore boundary tracing algorithm, 912 Moravec corner detector, 842 Moravec interest point operator, 838 Morphable models, 986 Morphic transformation, 468 Morphing, 1461 Morphologic algorithm, 1731 Morphologic factor, 1731 Morphologic image reconstruction, 943 Morphological edge detector, 1750 Morphological filter, 1731 Morphological filtering, 1731 Morphological gradient, 1750 Morphological image processing, 1730 Morphological image pyramid, 822 Morphological segmentation, 1731 Morphological shape decomposition [MSD], 1731 Morphological smoothing, 1750 Morphological transformation, 1726
1926 Morphology, 1729 Morphology for binary images, 1732 Morphology for gray-level images, 1745 Morphometry, 1731 Mosaic, 1094 Mosaic filter, 158 Mosaic-based video compression, 797 Mosquito noise, 585 Most probable explanation, 317 Mother wavelet, 425 Motion, 1127 Motion activity, 1861 Motion activity descriptor, 1861 Motion afterimage, 1771 Motion analysis, 1134 Motion blur, 576 Motion capture, 1136 Motion coding, 799 Motion compensated prediction, 800 Motion compensated video compression, 800 Motion compensation, 800 Motion constancy, 1795 Motion contrast, 1813 Motion deblurring, 793 Motion descriptor, 1861 Motion detection, 1136 Motion direction profile, 1157 Motion discontinuity, 1129 Motion energy image [MEI], 1135 Motion estimation, 1130 Motion factorization, 1128 Motion feature, 1128 Motion field, 1156 Motion from line segments, 1369 Motion history image [MHI], 1139 Motion history volume, 1139 Motion image, 27 Motion JPEG, 1843 Motion layer segmentation, 1143 Motion model, 1128 Motion moment, 1137 Motion object tracking, 1147 Motion parallax, 1792 Motion pattern, 1135 Motion perception, 1792 Motion prediction, 799 Motion region type histogram [MRTH], 1138 Motion representation, 1137 Motion segmentation, 1143 Motion sequence analysis, 1134 Motion smoothness constraint, 1128 Motion stereo, 199 Motion tracking, 1147
Index Motion trajectory, 1139 Motion understanding, 1135 Motion vector [MV], 1137 Motion vector directional histogram [MVDH], 1137 Motion-adaptive, 792 Motion-based user interaction, 1519 Motion-compensation filter, 800 Motion-detection-based filtering, 792 Mounting a ring light into a diffusely reflecting dome, 224 Mouse bites, 1735 Movement analysis, 1134 Moving average filter, 520 Moving average smoothing, 1128 Moving image, 27 Moving least squares [MLS], 1304 Moving light display, 1368 Moving light imaging, 1361 Moving object analysis, 1134 Moving object detection, 1141 Moving object detection and localization, 1142 Moving object segmentation, 1142 Moving object tracking, 1147 Moving observer, 1371 Moving Pictures Experts Group [MPEG], 1844 MP (matching pursuit), 1640 m-path, 367 MPEG (Moving Pictures Experts Group), 1844 MPEG-1, 1844 MPEG-2, 1844 MPEG-21, 1862 MPEG-4, 1845 MPEG-4 coding model, 1845 MPEG-7, 1860 MPEG-7 descriptor, 1860 MPP (minimum-perimeter polygon), 993 M-pyramid (matrix-pyramid), 818 MQ-coder, 1843 MRF (Markov random field), 1590 MRF inference, 1592 MRI (magnetic resonance imaging), 628 MRTH (motion region type histogram), 1138 MSC (main speaker close-up), 1540 MSD (mean square difference), 1057 MSD (morphological shape decomposition), 1731 MSE (mean squared error), 1057 MSED (mean squared edge distance), 1453 MSER (maximally stable extremal region), 1037 MSRE (mean squared reconstruction error), 1057
Index MSSIM (mean structural similarity index measurement), 53 MT algorithm (marching tetrahedral algorithm), 1310 MTF (modulation transfer function), 338 Multiband images, 16 Multibaseline stereo, 1345 Multi-camera behavior correlation, 1571 Multi-camera distributed behavior, 1571 Multi-camera system, 116 Multi-camera system blind area, 116 Multi-camera topology, 116 Multichannel images, 16 Multichannel kernel, 16 Multi-class logistic regression, 294 Multi-class supervised learning, 1413 Multi-class support vector machine, 1205 Multidimensional edge detection, 888 Multidimensional histogram, 489 Multidimensional image, 18 Multidimensional scaling [MDS], 1191 Multidimensional search trees, 1523 Multifilter bank, 559 Multifocal tensor, 1350 Multifocus images, 38 Multi-frame motion estimation, 1131 Multi-frontal ordering, 1636 Multigrid, 809 Multi-grid method, 809 Multi-image registration, 1459 Multi-label actor-action recognition, 1555 Multilayer image description model, 1543 Multilayer perceptron [MLP], 1209 Multilayer perceptron network [MLPN], 1209 Multilayer sequential structure, 1260 Multilevel method, 812 Multilinear constraint, 1351 Multilinear method, 284 Multimedia content description interface, 1860 Multimedia framework, 1862 Multimodal analysis, 1494 Multimodal biometric approaches, 1232 Multimodal fusion, 95 Multimodal image alignment, 1459 Multimodal interaction, 1494 Multimodal neighborhood signature, 1494 Multimodal user interface, 1495 Multi-module-cooperated structure, 1260 Multi-nocular stereo, 1346 Multi-object behavior, 1571 Multi-order transforms, 450 Multi-parametric curve, 983 Multi-pass texture mapping, 1091
1927 Multi-pass transforms, 450 Multiple edge, 1467 Multiple instance learning, 1413 Multiple instruction multiple data [MIMD], 343 Multiple kernel learning, 1417 Multiple light source detection, 214 Multiple motion segmentation, 1143 Multiple target tracking, 1148 Multiple view interpolation, 618 Multiple-image iterative blending, 716 Multiple-ocular image, 33 Multiple-ocular imaging, 1345 Multiple-ocular stereo imaging, 1346 Multiple-ocular stereo vision, 1346 Multiplexed illumination, 220 Multiplication color mixing, 731 Multiplicative image scaling, 475 Multiplicative noise, 583 Multi-resolution, 808 Multi-resolution image analysis, 808 Multi-resolution method, 808 Multi-resolution refinement equation, 808 Multi-resolution theory, 808 Multi-resolution threshold, 808 Multi-resolution wavelet transform, 808 Multi-scale, 811 Multi-scale description, 812 Multi-scale entropy, 812 Multi-scale integration, 812 Multi-scale method, 812 Multi-scale representation, 811 Multi-scale standard deviation, 1117 Multi-scale theory, 811 Multi-scale transform, 815 Multi-scale wavelet transform, 813 Multi-sensor alignment, 1458 Multi-sensor geometry, 1494 Multi-sensor information fusion, 1494 Multi-sensory fusion, 1494 Multislice CT scanner, 633 Multi-spectral analysis, 1494 Multi-spectral images, 16 Multi-spectral imaging, 180 Multi-spectral segmentation, 1494 Multi-spectral thresholding, 932 Multi-symbol message coding, 690 Multi-tap camera, 109 Multi-temporal images, 777 Multi-thresholding, 926 Multi-thresholds, 926 Multivariate density kernel estimator, 937 Multivariate normal distribution, 325 Multi-view activity representation, 1564
1928 Multi-view geometry, 1346 Multi-view image registration, 1459 Multi-view stereo, 1346 Mumford-Shah functional, 946 Munsell chroma, 751 Munsell color chart, 727 Munsell color model, 727, 751 Munsell color notation system, 727, 752 Munsell color system, 751 Munsell remotation system, 751 Munsell value, 751 Mutation, 1481 Mutual illumination, 219 Mutual information, 1499 Mutual interreflection, 219 MV (motion vector), 1137 MVD (minimum vector dispersion), 897 MVDH (motion vector directional histogram), 1137 MVS (machine vision system), 68 MWIR (mid-wave infrared), 16 Myopia, 1785 Mystery or ghastfulness, 1545
N N16 space, 355 NA (numerical aperture), 145 Naïve Bayesian classifier, 1604 Naïve Bayesian model, 1604 Nameable attribute, 1032 NAND operator, 476 Nanometer, 1718 Narrow baseline stereo, 1324 Narrowband Integrated Services Digital Network [N-ISDN], 1848 Nat, 49 National Institute of Standards and Technology [NIST], 1839 National Television Standards Committee [NTSC], 1840 National Television System Committee [NTSC], 1840 National Television System Committee [NTSC] format, 1821 Natural image matting, 857 Natural image statistics, 307 Natural light, 211 Natural material recognition, 1835 Natural media emulation, 480 Natural neighbor interpolation, 1286 Natural order of Walsh functions, 393 Natural order, 393
Index Natural parameters, 324 Natural scale detection transform, 416 Natural vignetting, 129 NC (nucleus cluster), 1123 NCC (normalized cross-correlation), 304 NCD (normalized compression distance), 375 NDC (normalized device coordinates), 178 NDM (normalized distance measure), 963 Near duplicate image/video retrieval, 1515 Near infrared [NIR], 1693 Near light source, 214 Nearest neighbor distance [NND], 377 Nearest neighbor distance ratio [NNDR], 377 Nearest-class mean classifier, 1202 Nearest neighbor interpolation, 617 Near-lossless coding, 662 Near-lossless compression, 662 Near-photo-quality, 263 Near-photo-quality printer, 262 Near-photorealistic, 262 Nearsightedness, 1785 Necker cube illusion, 1813 Necker cube, 1813 Necker illusion, 1813 Necker reversal, 1814 NEE (noise equivalent exposure), 584 Needle map, 1269 Negate operator, 486 Negative afterimage, 1771 Negative lens, 133 Negative posterior log likelihood, 1607 Negentropy, 1582 Neighborhood, 356 Neighborhood averaging, 518 Neighborhood of a point, 356 Neighborhood operator, 512 Neighborhood pixels, 361 Neighborhood processing, 512 Neighborhood-oriented operations, 447 Nerve, 1125 Nested dissection ordering, 1636 Net area, 1022 Netted surface, 1301 Neural displacement theory, 1816 Neural network [NN], 1642 Neural-network classifier, 1643 Neuron process procedure, 1760 Neutral color, 723 Neutral expression, 1247 Nevatia and Binford system, 1851 Nevitia operator, 878 News abstraction, 1540 News item, 1540
Index News program, 1540 News program structuring, 1539 News story, 1540 News video, 1540 Newton optimization, 1656 Next view planning, 1263 Next view prediction, 1264 N-fold cross-validation, 1228 Night sky light, 212 Night-light vision, 1774 NIR (near infrared), 1693 N-ISDN (Narrowband Integrated Services Digital Network), 1848 NIST (National Institute of Standards and Technology), 1839 NIST data set, 1852 NLF (noise level function), 584 NLPCA (nonlinear PCA), 433 NMF (non-negative matrix factorization), 1191 NMR (nuclear magnetic resonance), 629 NMSF (non-negative matrix-set factorization), 1192 NN (neural network), 1642 NND (nearest neighbor distance), 377 NNDR (nearest neighbor distance ratio), 377 Nod, 93 Nodal point, 137 Node of graph, 1618 Noise, 582 Noise characteristics, 584 Noise equivalent exposure [NEE], 584 Noise estimation, 595 Noise level function [NLF], 584 Noise model, 582 Noise reduction, 596 Noise removal, 596 Noise source, 584 Noise suppression, 597 Noiseless coding theorem, 660 Noiselets, 424 Noise-to-signal ratio [NSR], 966 Noise-whitening filter, 1195 Non-accidentalness, 1473 Non-affective gesture, 1255 Non-blackbody, 1691 Non-central camera, 115 Non-clausal form syntax, 1408 Non-complete tree-structured wavelet decomposition, 421 Non-convexity, 1110 Nondeterministic polynomial [NP] problem, 1669
1929 Nondeterministic polynomial complete [NPC] problem, 1669 Non-hidden watermark, 703 Non-hierarchical control, 1396 Noninformative prior, 1478 Non-intrinsic image, 39 Non-Jordan parametric curve, 984 Non-Lambertian reflection, 1683 Non-Lambertian surface, 1683 Non-layered clustering, 1213 Non-layered cybernetics, 1394 Nonlinear degradation, 576 Nonlinear diffusion, 567 Nonlinear filter, 504 Nonlinear filtering, 501 Nonlinear mapping, 1243 Nonlinear mean filter, 505 Nonlinear operator, 522 Nonlinear PCA [NLPCA], 433 Nonlinear sharpening filter, 506 Nonlinear sharpening filtering, 501 Nonlinear smoothing filter, 505 Nonlinear smoothing filtering, 501 Nonlocal patches, 9 Nonluminous object color, 1801 Non-maximum suppression, 905 Non-maximum suppression using interpolation, 906 Non-maximum suppression using masks, 906 Nonmetric MDS, 1191 Non-negative matrix factorization [NMF], 1191 Non-negative matrix-set factorization [NMSF], 1192 Non-object perceived color, 1801 Nonoverlapping field of view, 171 Nonparametric clustering, 1213 Nonparametric decision rule, 1221 Nonparametric method, 315 Nonparametric modeling, 1556 Nonparametric surface, 1283 Nonparametric texture synthesis, 1092 Non-photorealistic rendering [NPR], 72 Nonpolarized light, 1717 Nonrecursive filter, 504 Non-regular parametric curve, 984 Non-relaxation algebraic reconstruction technique, 642 Non-rigid model representation, 1313 Non-rigid motion, 1132 Non-rigid registration, 1456 Non-rigid structure from motion, 1368 Non-rigid tracking, 1149 Non-separating cut, 1030
1930 Non-simply connected polyhedral object, 370 Non-simply connected region, 370 Non-singular linear transform, 296 Non-smooth parametric curve, 984 Non-subtractive dither, 264 Non-symbolic representation, 972 Nonuniform illumination, 218 Non-uniform rational B-splines [NURBS], 973 Nonverbal communication, 1247 Non-visual indices of depth, 1358 Norm, 300 Norm of smooth solution, 571 Normal curvature, 1298 Normal distribution, 325 Normal flow, 1159 Normal focal length, 140 Normal kernel, 936 Normal map, 1280 Normal order of Walsh functions, 393 Normal plane, 1279 Normal trichromat, 1799 Normal vector, 1280 Normal vector to an edge, 874 Normal vision, 1757 Normal-gamma distribution, 328 Normalization, 303 Normalized central moment, 1532 Normalized compression distance [NCD], 375 Normalized correlation, 304 Normalized cross-correlation [NCC], 304 Normalized cut, 922 Normalized DCG (discounted cumulative gain), 1519 Normalized device coordinates [NDC], 178 Normalized discounted cumulative gain [DCG], 1519 Normalized distance measure [NDM], 963 Normalized exponential function, 1653 Normalized spectrum, 1695 Normalized squared error, 966 Normalized sum of squared difference [NSSD], 1349 Normal-rounding function, 279 Normal-Wishart distribution, 329 NOT operator, 477 Notability, 1036 Notability of watermark, 694 Notch filter, 556 Novel view synthesis, 1333 NP (nondeterministic polynomial) problem, 1669 NPC (nondeterministic polynomial complete) problem, 1669
Index NP-complete, 1669 NPR (non-photorealistic rendering), 72 NSLDA (null space LDA), 1217 NSR (noise-to-signal ratio), 966 NSSD (normalized sum of squared difference), 1349 NTSC (National Television Standards Committee), 1840 NTSC (National Television System Committee), 1840 NTSC camera, 114 NTSC color, 1823 NTSC (National Television System Committee) format, 1840 Nuclear magnetic resonance [NMR], 629 Nuclear magnetic resonance imaging, 630 Nucleus, 1124 Nucleus cluster [NC], 1123 Null space LDA [NSLDA], 1217 Null space property, 1639 Number of pixels mis-classified, 965 Number plate recognition, 1828 Numeric prediction, 1413 Numerical aperture [NA], 145 NURBS (non-uniform rational B-splines), 973 Nyquist frequency, 242 Nyquist rate, 242 Nyquist sampling rate, 242
O O’Gorman edge detector, 889 Object, 830 Object beam, 196 Object boundary, 973 Object characteristics, 1011 Object class recognition, 1171 Object classification, 1172 Object contour, 974 Object count agreement [OCA], 963 Object description, 1011 Object detection, 831 Object extraction, 831 Object feature, 1012 Object grouping, 831 Object identification, 1172 Object indexing, 1522 Object interaction characterization, 1565 Object knowledge, 1390 Object labeling, 1482 Object layer, 1544 Object layer semantic, 1544 Object localization, 849
Index Object matching, 1441 Object measurement, 1043 Object perceived color, 1801 Object plane, 191 Object recognition, 1168 Object region, 831 Object representation, 969 Object segmentation, 831 Object shape, 1096 Object tracking, 1147 Object verification, 1172 Object-based, 682 Object-based coding, 682 Object-based representation, 1474 Object-boundary quantization [OBQ], 252 Object-centered Object-centered representation, 1271 Objective criteria, 958 Objective evaluation, 52 Objective feature, 1544 Objective fidelity, 649 Objective fidelity criterion, 649 Objective function, 910 Objective quality measurement, 52 Objectness, 950 Object-oriented, 682 Object-specific metric, 1044 Oblique illumination, 216 Oblivious, 702 OBQ (object-boundary quantization), 252 Observation model, 95 Observation probability, 1153 Observation variable, 1153 Observed unit, 1166 Observer, 1133 Observer-centered, 1269 Obstacle detection, 1825 Obtaining original feature, 1248 OCA (object count agreement), 963 Occam’s razor, 1606 Occipital lobe, 1757 Occluding contour, 1359 Occluding contour analysis, 1360 Occluding contour detection, 1360 Occluding edge, 869 Occlusion, 1359 Occlusion detection, 1359 Occlusion recovery, 1359 Occlusion understanding, 1360 Occupancy grid, 1826 OCR (optical character recognition), 1829 Octave pyramids, 819 Octree [OT], 1314
1931 OD (optical density), 1023 ODCT (odd-symmetric discrete cosine transform), 413 Odd chain code, 988 Odd field, 781 Odd-antisymmetric discrete sine transform [ODST], 414 Odd-symmetric discrete cosine transform [ODCT], 413 ODST (odd-antisymmetric discrete sine transform), 414 Off-axis imaging, 178 OFF-center neurons, 1768 Offline handwriting recognition, 1830 Offline video processing, 789 Offset noise, 593 OFLD (optimal Fisher linear discriminant), 1216 OFW (optimal feature weighting), 1188 OLDA (optimal LDA), 1216 Omnicamera, 114 Omnidirectional camera, 114 Omnidirectional image, 35 Omnidirectional sensing, 1676 Omni-directional stereo, 1322 Omnidirectional vision, 114 Omnidirectional vision systems, 114 OMP (orthogonal matching pursuit), 1640 ON-center neuron, 1767 One response criterion, 890 One-bounce model, 1712 One-class learning, 1567 One-lobe raised cosine, 508 One-point perspective, 227 One-shot learning, 1418 One-versus-one classifier, 1197 One-versus-rest classification, 1182 One-versus-the-rest classifier, 1197 One-way Hash, 1524 Online activity analysis, 1565 Online anomaly detection, 1565 Online filtered inference, 1154 Online handwriting recognition, 1830 Online learning, 1416 Online likelihood ratio test, 336 Online model adaptation, 1133 Online processing, 336 Online video processing, 789 Online video screening, 789 Opacity, 1681 Opaque, 1681 Opaque overlap, 1093 Open half-space, 298
1932 Open lower half-space, 299 Open neighborhood, 356 Open neighborhood of a pixel, 361 Open operator, 1727 Open parametric curve, 984 Open set, 904 Open set recognition, 1169 Open upper half-space, 299 Open Voronoï region, 1122 OpenEXR [EXR], 274 Opening, 1727 Opening characterization theorem, 1734 Open-set evaluation, 1245 Open-set object, 904 Operations of camera, 91 Operator, 446, 510 Opponent color model, 750 Opponent color space, 750 Opponent color system, 750 Opponent color theory, 750 Opponent colors, 750 Opponent model, 750 Optical axis, 189 Optical center, 108 Optical character recognition [OCR], 1829 Optical coherence tomography, 625 Optical computing tomography, 624 Optical cut, 1536 Optical density [OD], 1023 Optical depth, 1023 Optical disc, 270 Optical flow, 1158 Optical flow boundary, 1159 Optical flow constraint equation, 1164 Optical flow equation, 1162 Optical flow field, 1160 Optical flow field segmentation, 1160 Optical flow region, 1159 Optical flow smoothness constraint, 1164 Optical illusion, 1810 Optical image, 39 Optical image processing, 40 Optical imaging, 181 Optical interference computer tomography, 628 Optical lens, 130 Optical part of the electromagnetic spectrum, 1687 Optical pattern, 1159 Optical procedure, 1760 Optical process, 1760 Optical transfer function [OTF], 338 Optical triangulation, 204, 209 Optical vignetting, 130 Optimal basis encoding, 1172
Index Optimal color, 722 Optimal feature weighting [OFW], 1188 Optimal Fisher linear discriminant [OFLD], 1216 Optimal LDA [OLDA], 1216 Optimal path search, 911 Optimal predictor, 672 Optimal quantizer, 674 Optimal thresholding, 930 Optimal uniform quantizer, 675 Optimization, 1654 Optimization method, 1131 Optimization parameter estimation, 1654 Optimization technique, 642 Optimization-based matting, 857 Optimum filter, 544 Optimum statistical classifier, 1203 OR operator, 477 Ordered landmark, 975 Ordered over-relaxation, 1601 Ordered subsets-expectation maximization [OS-EM], 644 Ordered texture, 1062 Ordering, 502 Ordering constraint, 1352 Ordering filter, 502, 523 Ordering relation, 1744 Orderless images, 41 Order-statistics filter, 506 Order-statistics filtering, 501 Ordinal transformation, 506 Ordinary Voronoï diagram, 1121 Oren-Nayar model, 1712 Organ of vision, 1757 Orientation, 1561 Orientation error, 1561 Orientation from change of texture, 1379 Orientation map, 27 Orientation of graph, 1622 Orientation of object surface, 1289 Orientation representation, 1561 Oriented Gabor filter bank, 1578 Oriented Gaussian kernels and their derivatives, 1758 Oriented pyramid, 819 Oriented smoothness, 468 Oriented texture, 1068 Original image layer, 1544 ORL database, 1855 Orthogonal Fourier-Mellin moment invariants, 1027 Orthogonal gradient operator, 881 Orthogonal image transform, 390 Orthogonal least squares, 289
Index Orthogonal masks, 850 Orthogonal matching pursuit [OMP], 1640 Orthogonal multiple-nocular stereo imaging, 1350 Orthogonal projection, 234 Orthogonal random variables, 1580 Orthogonal regression, 295 Orthogonal transform codec system, 676 Orthogonal transform, 390 Orthogonal tri-nocular stereo imaging, 1347 Orthographic, 390 Orthographic camera, 110 Orthographic projection, 235 Orthography, 234 Orthoimage, 34 Orthonormal, 308 Orthonormal vectors, 43 Orthoperspective, 227 Orthoperspective projection, 228 Osculating circle, 1276 Osculating plane, 1279 OS-EM (ordered subsets-expectation maximization), 644 OT (octree), 1314 OTF (optical transfer function), 338 Otsu method, 928 Outer orientation, 126 Outer product approximation of Hessian matrix, 898 Outer product of vectors, 43 Outlier, 1666 Outlier detection, 1567 Outlier rejection, 1567 Out-of-focus blur, 575 Output cropping, 494 Outside of a curve, 1277 Outward normal direction, 1276 Over operator, 471 Over proportion, 1833 Overcomplete dictionary, 1637 Overcomplete wavelets, 423 Over-determined inverse problem, 1398 Overfitting, 1648 Overlapped binocular field of view, 171 Over-relaxation, 1600 Over-sampling, 247 Over-segmentation, 828 Over-segmented, 828
P P3P (perspective 3 points), 1375 PAC (probably approximately correct), 1410
1933 PAC (probably approximately correct) learning framework, 1410 PAC-Bayesian framework, 1411 PACS (picture archiving and communication system), 272 Padding, 514 Paired boundaries, 1445 Paired contours, 1444 Pairwise correlation, 1444 Pairwise geometric histogram [PGH], 1442 PAL camera, 114 PAl (phase alteration line) format, 1821 Palette, 20 Paley order of Walsh functions, 393 Palm-geometry recognition, 1254 Palm-print recognition, 1254 Pan, 91 Pan angle, 93 Panchromatic, 21 Panchromatic image, 21 Panning, 93 Panography, 237 Panoramic, 171 Panoramic image mosaic, 1498 Panoramic image stereo, 1324 Panoramic video camera, 113 Panscan, 258 Pan/tilt/zoom [PTZ] camera, 113 Pantone matching system [PMS], 743 Panum fusional area, 1339 Panum vision, 1339 Parabolic point, 1290 Parallactic angle, 1328 Parallax, 1327 Parallax error detection and correction, 1352 Parallel edge, 1467 Parallel edge detection, 901 Parallel processing, 345 Parallel projection, 235 Parallel technique, 58, 833 Parallel tracking and mapping [PTAM], 1548 Parallel-boundary technique, 901 Parallel-region technique, 923 Parameter estimation, 1654 Parameter learning, 1491 Parameter of a digital video sequence, 775 Parameter shrinkage, 1649 Parameter-sensitive Hashing [PSH], 1524 Parametric bi-cubic surface model, 1283 Parametric boundary, 982 Parametric contour models, 919 Parametric curve, 982 Parametric edge detector, 899
1934 Parametric mesh, 1283 Parametric model, 919 Parametric motion estimation, 1131 Parametric motion model, 1131 Parametric surface, 1283 Parametric time-series modeling, 1557 Parametric transformation, 469 Parametric warps, 615 Parametric Wiener filter, 601 Para-perspective, 230 Para-perspective projection, 229 Paraxial, 154 Paraxial aberration, 154 Paraxial optics, 1674 Pareto optimal, 1656 Parseval’s Theorem, 401 Part recognition, 1174 Part segmentation, 1313 Part-based model, 1174 Part-based recognition, 1174 Part-based representation, 971 Partial order, 1745 Partial rotational symmetry index, 1085 Partial translational symmetry index, 1085 Partial volume interpolation, 618 Partially constrained pose, 1560 Partially observable Markov decision process [POMDP], 1597 Partially ordered set, 1745 Particle counting, 949 Particle filter, 1154 Particle filtering, 1154 Particle flow tracking, 1156 Particle segmentation, 943 Particle swarm optimization, 1655 Particle tracking, 1156 Partition Fourier space into bins, 409 Partition function, 565, 566 Partitioned iterated function system [PIFS], 684 Parzen estimator, 1659 Parzen window, 316, 1660 Parzen window technique, 1659 PASCAL (pattern analysis, statical modeling and computational learning), 1166 PASCAL visual object classes [VOC] challenge, 1853 PASCAL VOC (visual object classes) challenge, 1853 Passive attack, 706 Passive navigation, 1826 Passive sensing, 99 Passive sensor, 99 Passive stereo algorithm, 1326
Index Past-now-future [PNF], 1626 Patch, 992 Patch classification, 1297 Patch-based motion estimation, 1136 Patch-based stereo, 1342 Path, 366 Path classification, 1565 Path coherence function, 1150 Path coherence, 1150 Path finding, 911 Path from a pixel to another pixel, 366 Path learning, 1144 Path modeling, 1144 Path tracing, 1147 Path-connected, 364 Pattern, 1165 Pattern analysis, statistical modeling, and computational learning [PASCAL], 1166 Pattern class, 1165 Pattern classification, 1166 Pattern classification technique, 1166 Pattern classifier, 1166 Pattern grammar, 1097 Pattern matching, 1450 Pattern noise, 594 Pattern recognition [PR], 1172 Payload, 692 PB (predict block), 1847 PCA (principal component analysis), 430 PCA algorithm, 431 PCA transformation fusion method, 1508 PCA-SIFT (principal component analysisscale-invariant feature transform), 434 PCM (phase correlation method), 1445 PCM (pulse-coding modulation), 657 PCS (profile connection space), 749 PDF (probability density function), 315 PDM (point distribution model), 1102 PE (probability of error), 1058 Peace or desolation, 1545 Peak, 1294 Peak signal-to-noise ratio, 656 Peak SNR [PSNR], 656 Pearson kurtosis, 1581 Pearson’s correlation coefficient, 1444 Pedestrian detection, 1251 Pedestrian surveillance, 1251 Penalty term, 569 Pencil of lines, 1705 Penumbra, 1385 Pepper noise, 592 Pepper-and-salt noise, 592
Index Percentage area mis-classified, 965 Percentile filter, 504 Percentile method, 930 Perceptibility, 695 Perceptible watermark, 701 Perception, 1789 Perception constancy, 1795 Perception of luminosity contrast, 1772 Perception of motion, 1792 Perception of movement, 1792 Perception of visual edge, 1759 Perception-oriented color model, 739 Perceptron, 1208 Perceptron criterion, 1209 Perceptron learning algorithm, 1208 Perceptron network, 1208 Perceptual adaptive watermarking, 701 Perceptual constancy, 1794 Perceptual feature, 1262 Perceptual feature grouping, 1262 Perceptual grouping, 1809 Perceptual image, 1770 Perceptual organization, 1809 Perceptual reversal, 1809 Perceptual shaping, 698 Perceptual significant components, 696 Perceptual slack, 696 Perceptually based colormap, 756 Perceptually uniform color space, 752 Perfect map, 1633 Perform, 1392 Performance characterization, 954 Performance comparison, 954 Performance judgement, 955 Performance-driven animation, 7 Perifovea, 1764 Perimeter, 1017 Perimeter of region, 1017 Periodic color filter arrays, 158 Periodic noise, 583 Periodical pattern, 1351 Periodicity estimation, 1351 Periodicity of Fourier transform, 411 Peripheral rod vision, 1775 Person surveillance, 1824 Perspective, 225 Perspective 3 points [P3P], 1375 Perspective camera, 109, 173 Perspective distortion, 226 Perspective inversion, 228, 1809 Perspective n-point problem [PnP], 1375 Perspective projection, 227 Perspective scaling, 226
1935 Perspective scaling factor, 227 Perspective theory, 1815 Perspective transform, 226 Perspective transformation, 226 PET (positron emission tomography), 627 Petri net, 1574 PFELICS (progressive fast and efficient lossless image compression system), 665 P-frame, 800 PGH (pair-wise geometric histogram), 1442 Phantom, 634 Phantom model, 634 Phase alteration line [PAL] format, 1821 Phase change printers, 263 Phase congruency, 401 Phase correlation, 1445 Phase correlation method [PCM], 1445 Phase space, 336 Phase spectrum, 401 Phase unwrapping technique, 336 Phase-based registration, 1458 Phase-matching stereo algorithm, 1342 Phase-retrieval problem, 336 Phong illumination model, 1713 Phong reflectance model, 1713 Phong shading, 1713 Phong shading model, 1713 Phosphene, 1811 Phosphorescence, 212 Photo consistency, 1387 Photo pop-up, 1355 Photo Tourism, 1369 Photoconductive camera tube, 156 Photoconductive cell, 156 Photodetector, 155 Photodiode, 156 Photoelectric effect, 1704 Photogrammetric modeling, 239 Photogrammetry, 239 Photography, 237 Photomap, 22 Photometric brightness, 1679 Photometric calibration, 193 Photometric compatibility constraint, 1330 Photometric decalibration, 193 Photometric image formation, 177 Photometric invariant, 1049 Photometric normalization technique, 442 Photometric stereo, 1360 Photometric stereo analysis, 1361 Photometric stereo imaging, 1361 Photometry, 1677 Photomicroscopy, 237
1936 Photomontage, 1496 Photo-mosaic, 1498 Photon, 1679 Photon noise, 586 Photonics, 1672 Photopic response, 1772 Photopic standard spectral luminous efficiency, 1859 Photopic vision, 1774 Photopical zone, 1774 Photoreceptor, 1766 Photorealistic, 51 Photosensitive sensor, 99 Photosensitive unit, 105 Photosensor spectral response, 1697 Photosite, 94 Phototelegraphy, 1836 Physical optics, 1672 Physics-based vision, 67 Physiological model, 739 PICT (picture data format), 275 Pictorial pattern recognition, 1171 Pictorial structures, 1438 Picture, 4 Picture archiving and communication system [PACS], 272 Picture data format [PICT], 275 Picture element, 10 Picture tree, 1100 Picture-phone, 1821 PIE database, 1855 Piecewise linear “tent” function, 526 Piecewise linear transformation, 495 Piecewise rigidity, 1132 PIFS (partitioned iterated function system), 684 Pincushion distortion, 579 Pinhole camera, 107 Pinhole camera model, 107, 173 Pinhole model, 107 Pink noise, 589 Pipeline parallelism, 344 Pirate, 706 Pit, 1294 Pitch, 90 Pitch angle, 90 Pixel, 9 Pixel addition operator, 472 Pixel by pixel, 470 Pixel classification, 828 Pixel connectivity, 363 Pixel coordinate transformation, 448 Pixel coordinates, 10 Pixel counting, 1022
Index Pixel distance error, 962 Pixel division operator, 475 Pixel edge strength, 872 Pixel exponential operator, 483 Pixel grayscale resolution, 45 Pixel intensity, 10 Pixel intensity distributions, 10 Pixel interpolation, 617 Pixel jitter, 165 Pixel labeling, 908 Pixel layer fusion, 1505 Pixel logarithm operator, 482 Pixel multiplication operator, 475 Pixel number error, 965 Pixel per inch, 49 Pixel spatial distribution, 963 Pixel subsampling, 11 Pixel subtraction operator, 473 Pixel transform, 480 Pixel value, 10 Pixel-based representation, 11 Pixel-change history, 11 Pixel-class-partition error, 965 Pixel-dependent threshold, 928 Place recognition, 1170 Planar calibration target, 119 Planar facet model, 1305 Planar mosaic, 1498 Planar motion estimation, 1132 Planar patch extraction, 1305 Planar patches, 1305 Planar projective transformation, 231 Planar rectification, 1334 Planar scene, 1468 Planar-by-complex, 975 Planar-by-vector, 974 Planckian locus, 1691 Planck’s constant, 1857 Plane, 1274 Plane at infinity, 453 Plane conic, 1277 Plane figures, 69 Plane plus parallax, 1329 Plane projective transfer, 233 Plane sweep, 1334 Plane sweeping, 1318 Plane tiling, 70 Plane-based structure from motion, 1368 Platykurtic probability density function, 316 Playback control, 707 PLDA (probabilistic linear discriminant analysis), 1216 Plenoptic camera, 115
Index Plenoptic function, 73 Plenoptic function representation, 73 Plenoptic modeling, 74 Plenoptic sampling, 73 Plessey corner finder, 842 Plücker coordinate, 188 Plücker coordinate system, 188 Plumb-line calibration method, 128 Plumb-line method, 128 pLSA (probabilistic latent semantic analysis), 1488 pLSI (probabilistic latent semantic indexing), 1489 PM (principal manifold), 1194 PM (progressive mesh), 1283 PMF (probability mass function), 316 PMS (pantone matching system), 743 PNF (past-now-future), 1626 PNG (portable network graphics) format, 273 PnP (perspective n-point problem), 1375 POI/AP model, 1563 POI/AP (point of interest/activity path) model, 1554 Point, 1273 Point at infinity, 453 Point cloud, 1274 Point detection, 837 Point distribution model [PDM], 1102 Point feature, 1013 Point invariant, 1046 Point light source, 214 Point matching, 1442 Point object, 837 Point of extreme curvature, 1019 Point of interest, 1554 Point of interest/activity path [POI/AP] model, 1554 Point operation, 447 Point operator, 447 Point sampling, 244 Point similarity measure, 1442 Point source, 213 Point transformation, 447 Point-based rendering, 74 Pointillisme, 763 Pointillist painting, 763 Point-spread function [PSF], 336 Poisson distribution, 333 Poisson matting, 857 Poisson noise, 587 Poisson noise removal, 587 Polar coordinates, 184 Polar harmonic transforms, 1048
1937 Polar rectification, 1334 Polarization, 1716 Polarized light, 1716 Polarizer, 159 Polarizing filter, 159 Polya distribution, 330 Polycurve, 980 Polygon, 992 Polygon matching, 1441 Polygonal approximation, 992 Polygonal network, 995 Polyhedral object, 1317 Polyhedron, 1316 Polyline, 974 Polyphase filter, 602 Polytope, 1317 Polytree, 1612 POMDP (partially observable Markov decision process), 1597 Pooling, 1650 Pooling layer, 1650 Pop-off, 1537 Pop-on, 1537 Pop-out effect, 1807 Portable bitmap, 273 Portable floatmap, 274 Portable graymap, 273 Portable image file format, 273 Portable network graphics [PNG] format, 273 Portable networkmap, 274 Portable pixmap, 273 Portrait mode, 264 Pose, 1559 Pose clustering, 1562 Pose consistency, 1559 Pose determination, 1561 Pose estimation, 1561 Pose representation, 1561 Pose variance, 1559 Position, 278 Position descriptor, 1013 Position detection, 68 Position invariance, 1047 Position invariant, 1047 Position variable, 336 Position-dependent brightness correction, 443 Positive afterimage, 1771 Positive color, 1700 Positive definite matrix, 429 Positive predictive value [PPV], 1229 Positive semi-definite, 429 Positive semidefinite matrix, 429 Positron emission tomography [PET], 627
1938 Postal code analysis, 1830 Posterior distribution, 319 Posterior probability, 319 Posterize, 762 Post-processing for stereo vision, 1351 PostScript [PS] format, 1862 Posture analysis, 1560 Potential field, 565 Potential function, 565 Power method, 285 Power spectrum, 400 Power-law transformation, 483 PPCA (probabilistic principal component analysis), 433 PPCA (probabilistic principal component analysis) model, 434 PPN (probabilistic Petri net), 1575 PPV (positive predictive value), 1229 PR (pattern recognition), 1172 Precise measurement, 1053 Precision, 1534 Precision matrix, 1585 Precision parameter, 1075 Predicate calculus, 1406 Predicate logic, 1405 Predict block [PB], 1847 Predict unit [PU], 1847 Predicted-frame, 800 Prediction equation, 1153 Prediction-frame, 800 Predictive coding, 669 Predictive compression method, 669 Predictive error, 672 Predictor, 672 Predictor-controller, 1153 Preemptive RANSAC, 1666 Prefix compression, 665 Prefix property, 666 Pre-image, 40 Preprocessing, 58 Prewitt detector, 884 Prewitt gradient operator, 884 Prewitt kernel, 884 Primal sketch, 1268 Primary aberrations, 147 Primary color, 728 Primitive detection, 837 Primitive event, 1566 Primitive pattern element, 1165 Primitive texture element, 1064 Prim’s algorithm, 947 Principal axis, 189 Principal component, 430
Index Principal component analysis [PCA], 430 Principal component analysis-scale-invariant feature transform [PCA-SIFT], 434 Principal component basis space, 431 Principal component representation, 431 Principal component transform, 428 Principal curvature, 1297 Principal curvature sign class, 1297 Principal curve, 434 Principal direction, 1297 Principal distance, 231 Principal manifold [PM], 1194 Principal normal, 1278 Principal point, 189 Principal ray, 1705 Principal subspace, 1236 Principal surface, 434 Principal texture direction, 1068 Principles of stereology, 1660 Printer, 262 Printer resolution, 264 Printing character recognition, 1829 Prior probability, 1478 Private watermarking, 701 Privileged viewpoint, 196 Probabilistic aggregation, 948 Probabilistic causal model, 1623 Probabilistic data association, 1150 Probabilistic distribution, 307 Probabilistic graphical model, 317 Probabilistic Hough transform, 852 Probabilistic inference, 318 Probabilistic labeling, 1483 Probabilistic latent semantic analysis [pLSA], 1488 Probabilistic latent semantic indexing [pLSI], 1489 Probabilistic linear discriminant analysis [PLDA], 1216 Probabilistic model learning, 1409 Probabilistic Petri net [PPN], 1575 Probabilistic principal component analysis [PPCA], 433 Probabilistic principal component analysis [PPCA] model, 434 Probabilistic relaxation labeling, 1483 Probabilistic relaxation, 1483 Probability, 308 Probability density, 314 Probability density estimation, 315 Probability density function [PDF], 315 Probability distribution, 307 Probability mass function [PMF], 316
Index Probability of error [PE], 1058 Probability-weighted figure of merit, 963 Probably approximately correct [PAC], 1410 Probably approximately correct [PAC] learning framework, 1410 Probe data, 1833 Probe image, 22 Probe pattern, 1377 Probe set, 1167 Probe video, 777 Probe, 1661 Probit function, 311 Probit regression, 312 Procedural knowledge, 1391 Procedural representation, 1401 Procedure knowledge, 1392 Process colors, 743 Processing strategy, 958 Procrustes analysis, 1455 Procrustes average, 1455 Production rule, 1403 Production system, 1403 Profile connection space [PCS], 749 Profile curve, 1360 Profile curve-based reconstruction, 1357 Profiles, 1101 Program stream, 1845 Progressive coding, 658 Progressive compression, 658 Progressive fast and efficient lossless image compression system [PFELICS], 665 Progressive image coding, 658 Progressive image compression, 658 Progressive image transmission, 271 Progressive mesh [PM], 1283 Progressive sample consensus [PROSAC], 1666 Progressive scan camera, 113 Progressive scanning, 780 Projection, 232 Projection background process, 237 Projection center, 234 Projection histogram, 1008 Projection imaging, 178, 232 Projection matrix, 233 Projection matrix in subspace, 1235 Projection of a vector in subspace, 1235 Projection plane, 234 Projection printing, 178 Projection theorem for Fourier transform, 402 Projective depth, 124 Projective disparity, 124 Projective factorization, 234
1939 Projective geometry, 233 Projective imaging mode, 232 Projective invariant, 233 Projective plane, 234 Projective reconstruction, 1325 Projective space, 233 Projective stereo vision, 1325 Projective stratum, 233 Projective transformation, 453 Proof of ownership, 705 Proper kurtosis, 1581 Proper subgraph, 1617 Proper supergraph, 1616 Properties of biometrics, 1232 Properties of mathematical morphology, 1724 Property learning, 1554 Property of watermark, 693 Property-based matching, 1441 Proposal distribution, 322 Propositional representation model, 1404 PROSAC (progressive sample consensus), 1666 Protanomalous vision, 1799 Protanopia, 1799 Protected conjugate gradients, 303 Prototype, 350 Prototype matching, 1809 Proximity matrix, 1214 Pruning, 1738 PS (PostScript) format, 1862 Pseudo-color, 753 Pseudo-color enhancement, 753 Pseudo-color image, 19 Pseudo-color image processing, 753 Pseudo-color transform, 754 Pseudo-color transform function, 754 Pseudo-color transform mapping, 755 Pseudo-coloring, 755 Pseudo-coloring in the frequency domain, 755 Pseudo-convolution, 341 Pseudo-isochromatic diagram, 1800 Pseudo-random, 348 Pseudo-random numbers, 349 PSF (point-spread function), 336 PSH (parameter-sensitive Hashing), 1524 PSNR (peak SNR), 656 Psychophysical model, 752 Psychophysics, 1789 Psycho-visual redundancy, 653 PTAM (parallel tracking and mapping), 1548 p-tile method, 930 PTZ (pan/tilt/zoom) camera, 113 PU (predict unit), 1847
1940 Public watermarking, 700 Public-key cryptography, 700 Pull down, 1537 Pull up, 1537 Pulse response function, 338 Pulse time duration measurement, 206 Pulse-coding modulation [PCM], 657 Punctured annulus, 848 Pupil, 1762 Pupil magnification, 174 Pure light, 735 Pure purple, 735 Pure spectrum, 735 Purity, 722 Purkinje effect, 1775 Purkinje phenomenon, 1775 Purple, 736 Purple boundary, 736 Purple line, 736 Purposive perception, 1262 Purposive vision, 1264 Purposive-directed, 1824 Puzzle, 1526 Pyramid, 817 Pyramid architecture, 817 Pyramid fusion, 1508 Pyramid fusion method, 1508 Pyramid matching, 1451 Pyramid transform, 817 Pyramid vector quantization, 685 Pyramidal radial frequency implementation of shiftable multi-scale transforms, 808 Pyramid-structured wavelet decomposition, 818
Q QBIC (query by image content), 1514 QCIF (quarter common intermediate format), 275 QIM (quantization index modulation), 697 QM-coder, 1841 QMF (quadrature mirror filter), 559 QP (quadratic programming), 303 QR factorization, 282 QT (quadtree), 1006 Quadratic discriminant function, 1214 Quadratic form, 303 Quadratic programming [QP], 303 Quadratic variation, 1301 Quadrature filters, 559 Quadrature mirror filter [QMF], 559 Quadric, 1301
Index Quadric patch, 1301 Quadric patch extraction, 1301 Quadric surface model, 1301 Quadri-focal tensor, 1350 Quadrilinear constraint, 1351 Quadtree [QT], 1006 Qualitative metric, 1045 Qualitative vision, 1264 Quality, 50 Quality of a digital image, 51 Quality of a Voronoï region, 1122 Quality of an MNC contour shape, 1124 Quantitative criteria, 1045 Quantitative metric, 1045 Quantization, 248 Quantization error, 673 Quantization function in predictive coding, 673 Quantization index modulation [QIM], 697 Quantization noise, 594 Quantization watermarking, 701 Quantizer, 673 Quantum efficiency, 103 Quantum noise, 585 Quantum optics, 1672 Quarter common intermediate format [QCIF], 275 Quarter-octave pyramids, 820 Quasi-Euclidean distance (DQE), 372 Quasi-affine reconstruction, 1356 Quasi-disjoint, 1313 Quasi-invariant, 1049 Quasi-monochromatic light, 1694 Quasinorm, 300 Quaternion, 464 Quench function, 1740 Query by example, 1518 Query by image content [QBIC], 1514 Query by sketch, 1518 Query expansion, 1519 Query in image database, 1520 Quincunx sampling, 820
R R (red), 728 R exact set, 1511 R rough set, 1511 RAC (relative address coding), 681 RADAR (radio detection and ranging), 163 Radar image, 16 Rademacher function, 392 Radial basis function [RBF], 1650 Radial basis function network, 1651
Index Radial basis kernel, 1658 Radial blur, 446 Radial distortion, 578 Radial distortion parameters, 578 Radial feature, 409 Radial lens distortion, 127 Radially symmetric Butterworth band-pass filter, 555 Radially symmetric Butterworth band-reject filter, 557 Radially symmetric Gaussian band-pass filter, 555 Radially symmetric Gaussian band-reject filter, 558 Radially symmetric ideal band-pass filter, 555 Radially symmetric ideal band-reject filter, 557 Radiance, 1679 Radiance exposure, 1689 Radiance factor, 1686 Radiance format, 1689 Radiance map, 1689 Radiance structure surrounding knowledge database, 1260 Radiant efficiency, 1689 Radiant energy, 1687 Radiant energy fluence, 1687 Radiant energy flux, 1689 Radiant exitance, 1686 Radiant exposure, 1688 Radiant flux, 1689 Radiant flux density, 1689 Radiant intensity, 1686 Radiant power, 1689 Radiation, 1684 Radiation flux, 1628 Radiation intensity, 1686 Radiation luminance, 1621 Radiation spectrum, 1688 Radio band image, 16 Radio detection and ranging [RADAR], 163 Radio frequency identification [RFID], 1171 Radio wave imaging, 182 Radiography, 169 Radiometric calibration, 1685 Radiometric correction, 579 Radiometric distortion, 579 Radiometric image formation, 1685 Radiometric quantity, 1689 Radiometric response function, 1685 Radiometry, 1686 Radiosity, 1685 Radius of a convolution kernel, 1645 Radius ratio, 1113
1941 Radius vector function, 978 Radon inverse transform, 438 Radon transform, 437 RAG (region adjacency graph), 1008 Ramer algorithm, 995 Ramp edge, 868 Random access camera, 110 Random crossover, 1482 Random dot stereogram, 1322 Random error, 1055 Random feature sampling, 1583 Random field, 1587 Random field model, 1585 Random field model for texture, 1066 Random forest, 1414 Random noise, 585 Random process, 1583 Random sample consensus [RANSAC], 1665 Random variable, 1579 Random walk, 380 Random walker, 381 Randomized Hough transform, 854 Random-watermarking false-positive probability, 691 Random-work false-positive probability, 691 Range compression, 482 Range data, 32 Range data fusion, 32 Range data integration, 32 Range data registration, 32 Range data segmentation, 32 Range edge, 1291 Range element [rangel], 12 Range flow, 1159 Range image, 32 Range image alignment, 1359 Range image edge detector, 32 Range image merging, 1357 Range image registration, 1359 Range of a function, 279 Range of visibility, 610 Range sensing, 205 Range sensor, 101 Rangefinding, 204, 210 Rangel (range element), 12 Rank filter, 506 Rank order filtering, 501 Rank support vector machine, 1206 Rank transform, 490 Rank-based gradient estimation, 523 RankBoost, 1749 Ranking function, 1749 Ranking problem, 1744
1942 Ranking sport match video, 1538 RANSAC (random sample consensus), 1665 RANSAC-based line detection, 844 Raster, 40 Raster image, 40 Raster image processing, 40 Raster scanning, 40 Rasterizing, 41 Raster-to-vector, 41 Rate distortion, 661 Rate-invariant action recognition, 1558 Rate-distortion coding theorem, 661 Rate-distortion function, 661 Rate-distortion theorem, 661 Rauch-Tung-Striebel [RTS] function, 1152 RAW image format, 274 Raw primal sketch, 1268 Ray, 1705 Ray aberration, 151 Ray radiation, 1690 Ray space, 72 Ray tracing, 1383 Ray tracking, 1705 Rayleigh criterion, 1720 Rayleigh criterion of resolution, 47 Rayleigh noise, 588 Rayleigh resolving limit, 47 Rayleigh scattering, 1720 RBC (recognition by components), 1174 RBF (radial basis function), 1650 RCT (reflection CT), 628 Reaction time, 1760, 1783 Real-aperture radar, 164 Reality, 80 Real-time processing, 1827 Recall, 1535 Recall rate, 1535 Receding color, 1801 Receiver operating curve [ROC], 1053 Receptive field, 1769 Receptive field of a neuron, 1770 Recognition, 1168 Recognition by alignment, 1170 Recognition by components [RBC], 1174 Recognition by parts, 1175 Recognition by structural decomposition, 1175 Recognition with segmentation, 1170 Reconstruction, 635 Reconstruction based on inverse Fourier transform, 637 Reconstruction based on series expansion, 644 Reconstruction by polynomial fit to projection, 638
Index Reconstruction error, 636 Reconstruction from shading, 1384 Reconstruction of vectorial wavefront, 1365 Reconstruction using angular harmonics, 638 Reconstruction-based super-resolution, 810 Record control, 707 Recovery of texture map, 1379 Recovery state, 1146 Rectangle detection, 850 Rectangularity, 1116 Rectification, 1333 Rectified configuration, 112 Rectifier function, 1653 Rectifier linear unit [ReLU], 1653 Rectifying plane, 1279 Rectilinear camera rig, 112 Recurrent neural network [RNN], 1645 Recursive block coding, 658 Recursive contextual classification, 1181 Recursive filter, 543 Recursive neural network [RNN], 1646 Recursive region growing, 939 Recursive splitting, 942 Recursive-running sums, 1073 Red [R], 728 Reduced ordering, 766 Reducing feature dimension and extracting feature, 1248 Reduction factor, 817 Reduction window, 817 Redundant data, 653 Reference beam, 196 Reference color table, 1530 Reference frame transformation, 791 Reference image, 50 Reference mark, 690 Reference pattern, 690 Reference plane, 124 Reference vector, 690 Reference views, 1169 Reference white, 730 Reflectance, 1710 Reflectance estimation, 1708 Reflectance factor, 1711 Reflectance image, 1281 Reflectance map, 1281 Reflectance model, 1712 Reflectance modeling, 1711 Reflectance ratio, 1711 Reflection, 1709 Reflection coefficient, 1711 Reflection component, 1711 Reflection computed tomography, 628
Index Reflection CT [RCT], 628 Reflection function, 1711 Reflection image, 36 Reflection law, 1711 Reflection operator, 1712 Reflection tomography, 628 Reflective color, 724 Reflectivity, 1710 Reflexive, 1745 Refraction, 1714 Refractive index, 1715 Refresh rate, 779 Region, 835 Region adjacency graph [RAG], 1008 Region aggregation, 938 Region boundary extraction, 835 Region centroid, 1021 Region concavity tree, 999 Region consistency, 1012 Region decomposition, 1006 Region density feature, 1023 Region detection, 835 Region filling, 607 Region filling algorithm, 1741 Region growing, 938 Region identification, 1168 Region invariant, 1048 Region-invariant moment, 1026 Region labeling, 1482 Region locator descriptor, 1021 Region matching, 1448 Region merging, 941 Region moment, 1024 Region neighborhood graph, 1007 Region of acceptable distortion, 696 Region of acceptable fidelity, 696 Region of interest [ROI], 827 Region of support, 862 Region outside a closed half-space, 299 Region propagation, 1148 Region proposal, 951 Region representation, 971 Region segmentation, 835 Region shape number, 1114 Region signature, 1008 Region snake, 917 Region splitting, 941 Region tracking, 1148 Regional activity, 1563 Region-based active contours, 920 Region-based image segmentation, 835 Region-based method, 832 Region-based parallel algorithms, 832
1943 Region-based representation, 970 Region-based segmentation, 834 Region-based sequential algorithm, 832 Region-based shape descriptor, 1106 Region-based signature, 971 Region-based stereo, 1340 Region-dependent threshold, 930 Registration, 1454 Regression, 290 Regression function, 290 Regression testing, 290 Regular grammar, 1225 Regular grid of texel, 1064 Regular least squares, 288 Regular parametric curve, 983 Regular point, 979 Regular polygon, 992 Regular shape, 984 Regular tessellation, 1093 Regularity, 1093 Regularization, 569 Regularization parameter, 569 Regularization term, 570 Regularized least squares, 288 Reinforcement learning, 1417 Rejection sampling, 322 Relation matching, 1440 Relational descriptor, 1028 Relational graph, 1634 Relational Markov network [RMN], 1592 Relational matching, 1439 Relational model, 1634 Relational shape description, 1096 Relation-based method, 1027 Relative address coding [RAC], 681 Relative aperture, 145 Relative attribute, 1032 Relative brightness, 45 Relative brightness contrast, 46 Relative data redundancy, 653 Relative depth, 1357 Relative encoding, 680 Relative entropy, 1504 Relative extrema density, 1086 Relative extrema, 1086 Relative field, 1757 Relative maximum, 1086 Relative minimum, 1086 Relative motion, 1129 Relative operating characteristic [ROC] curve, 1053 Relative orientation, 185 Relative pattern, 1437
1944 Relative saturation contrast, 760 Relative spectral radiant power distribution, 1699 Relative ultimate measurement accuracy [relative UMA, RUMA], 964 Relative UMA (ultimate measurement accuracy) [RUMA], 964 Relaxation, 1662 Relaxation algebraic reconstruction technique, 642 Relaxation labeling, 1663 Relaxation matching, 1663 Relaxation segmentation, 950 Relevance feedback, 1519 Relevance learning, 1417 Relevance vector, 1642 Relevance vector machine [RVM], 1641 Reliability of a match, 1436 Reliability, 1436 Relighting, 216 ReLU (rectifier linear unit), 1653 ReLU activation function, 1653 Remote sensing, 1832 Remote sensing image, 25 Remote sensing system, 1833 Removal attack, 709 Remove attack, 708 Rendering, 71 Renormalization group transform, 350 Repeatability, 1044 Repetitiveness, 1086 Representation, 970 Representation by reconstruction, 1266 Representation of motion vector field, 1137 Representations and algorithms, 1267 Representative frame of an episode, 1533 Reproducibility, 1053 Reproduction, 1481 Reproduction ratios, 238 Requirement for reference image, 967 Re-sampling, 246 Resection, 118 Reset noise, 594 Residual, 672 Residual error, 1056 Residual graph, 921 Residual image, 1770 Resolution, 46, 1406 Resolution limit, 47 Resolution of optical lens, 136 Resolution of segmented image, 958 Resolution scalability, 1843 Resolution-independent compression, 655
Index Resolvent, 1407 Resolving power, 87 Resource sharing, 1575 Response time, 1783 Responsivity, 97 Restoration, 562 Restoration transfer function, 562 Restored image, 35 Restricted isometry property [RIP], 1639 Resubstitution method, 1182 Reticle, 166 Retina, 1763 Retina illuminance, 1764 Retina recognition, 1765 Retinal cortex, 1764 Retinal disparity, 1764 Retinal illumination, 1764 Retinal image, 22 Retinex, 1764 Retinex algorithm, 1765 Retinex theory, 1764 Retrieving ratio, 1517 Retro-illumination, 220 Retroreflection, 1712 Reverse engineering, 81 Reversible authentication, 712 Reversing color pixel separation process, 933 Reversing grayscale pixel separation process, 933 Rewriting rule, 1225 RFID (radio frequency identification), 1171 RGB, 728 RGB model, 740 RGB-based color map, 756 Rhodopsin, 1760 Ribbon, 1318 Ricci flow, 1099 Rice noise, 588 Ridge, 1293 Ridge detection, 847, 1294 Ridge regression, 292 Riemann surface, 297 Riemannian manifold, 278 Riesz transform, 438 Right-handed coordinate system, 183 Rigid body, 457 Rigid body model, 1312 Rigid body motion, 1132 Rigid body segmentation, 833 Rigid body transformation, 456 Rigid motion estimation, 1132 Rigid registration, 1456 Rigidity constraint, 1132
Index Ring artifact, 581 Ring filter, 1067 RIP (restricted isometry property), 1639 RMN (relational Markov network), 1592 RMS (root mean square) error, 1057 RNN (recurrent neural network), 1645 RNN (recursive neural network), 1646 Road structure analysis, 1828 Robbins-Monro algorithm, 290 Roberts cross detector, 885 Roberts cross gradient operator, 885 Roberts cross operator, 885 Roberts kernel, 885 Robinson edge detector, 892 Robinson operators, 892 Robot behavior, 1827 Robot vision, 68 Robust, 694 Robust authentication, 712 Robust estimator, 306 Robust least squares, 288 Robust penalty function, 306 Robust regression, 292 Robust regularization, 570 Robust statistics, 306 Robust technique, 306 Robust watermark, 702 Robustness, 694 Robustness of watermark, 693 ROC (receiver operating curve), 1053 ROC curve, 1053 ROC (relative operating characteristic) curve, 1053 Rod, 1767 Rod cell, 1767 Rodriguez’s rotation formula, 463 Rodriguez’s formula, 463 ROI (region of interest), 827 Roll, 91 Rolling, 89 Rolling shutter, 106 Rolling shutter camera, 105 Roof edge, 868 Roof function, 280 Root mean square [RMS] error, 1057 Root-mean-square signal-to-noise ratio, 657 Rotating mask, 510 Rotation, 465 Rotation estimation, 466 Rotation invariance, 1048 Rotation invariant, 1048 Rotation matrix, 461 Rotation operator, 461
1945 Rotation representation, 461 Rotation theorem, 409 Rotation transformation, 461 Rotational symmetry, 466 Rotoscoping, 7 Rough set, 1511 Rough set theory, 1510 Roughness, 1080 Rounding function, 279 Roundness, 1115 RS-170, 776 R-S curve, 977 R-table, 852 R-tree, 1612 RTS (Rauch-Tung-Striebel) function, 1152 Rubber sheet distortion, 78 Rubber sheet model, 563 Rubber band area, 1023 Rule-based classification, 1180 RUMA (relative ultimate measurement accuracy), 964 Run-length coding, 679 Run-length compression, 680 Run-length feature, 1072 Running intersection property, 1420 Running total, 1072 RVM (relevance vector machine), 1641
S S (saturation), 759 S cones, 1767 SAD (sum of absolute differences), 1516 SAD (sum of absolute distortion), 1058 Saddle ridge, 1295 Saddle valley, 1294 Safe RGB colors, 742 SAI (simplex angle image), 1287 Salience, 1036 Saliency, 1036 Saliency map, 1036 Saliency smoothing, 1037 Salient behavior, 1570 Salient feature, 1037 Salient feature point, 1037 Salient patch, 1037 Salient pixel group, 1037 Salient point, 1037 Salient regions, 1037 Salt noise, 592 Salt-and-pepper noise, 590 Sample covariance, 244 Sample mean, 243
1946 Sample-based image completion, 607 Sampling, 241 Sampling bias, 244 Sampling density, 243 Sampling efficiency, 244 Sampling mode, 243 Sampling pitch, 243 Sampling rate, 245 Sampling rate conversion, 247 Sampling scheme, 243 Sampling theorem, 242 Sampling-importance resampling [SIR], 245 Sampson approximation, 986 SAR (synthetic aperture radar), 163 SAR (synthetic aperture radar) image, 25 SART (simultaneous algebraic reconstruction technique), 643 SAT (symmetric axis transform), 1002 Satellite image, 25 Satellite probatoire d’observation de la terra [SPOT], 1833 Saturability, 763 Saturated color, 725 Saturation [S], 759 Saturation enhancement, 759 Saturation exposure, 194 Savitzky-Golay filtering, 517 SBQ (square-box quantization), 251 SBVIR (semantic-based visual information retrieval), 1541 SC (sparse coding), 1637 Scalability, 663 Scalability coding, 663 Scalable color descriptor, 1861 Scalable vector graphics [SVG] format, 274 Scalar, 170 Scale, 813 Scale function, 417 Scale invariance, 1047 Scale invariant, 1047 Scale operator, 813 Scale reduction, 813 Scale selection, 813 Scale selection in tone mapping, 761 Scale space, 811 Scale space filtering, 815 Scale space image, 33 Scale space matching, 815 Scale space primal sketch, 815 Scale space representation, 814 Scale space technique, 814 Scale theorem, 408 Scaled orthographic projection, 235
Index Scaled rotation, 455 Scale-invariant feature transform [SIFT], 1343 Scaling, 467 Scaling factor, 467 Scaling function, 418 Scaling matrix, 466 Scaling transformation, 466 Scanline, 780 Scanline slice, 780 Scanline stereo matching, 1335 Scanner, 154 Scanning, 780 Scanning electron microscope [SEM], 160 Scanning raster, 1819 Scatter matrix, 1719 Scattered data interpolation, 1286 Scattering, 1719 Scattering coefficient, 1720 Scattering model image storage device, 270 Scatterplot, 30 SCC (slope chain code), 988 Scene, 170 Scene analysis, 1474 Scene brightness, 1472 Scene classification, 1488 Scene completion, 607 Scene constraint, 1479 Scene coordinates, 1471 Scene decomposition, 1474 Scene flow, 1474 Scene interpretation, 1477 Scene knowledge, 1391 Scene labeling, 1474 Scene layer, 1544 Scene layout, 1472 Scene modeling, 1475 Scene object labeling, 1474 Scene radiance, 1472 Scene recognition, 1475 Scene reconstruction, 1476 Scene recovering, 1472 Scene understanding, 1475 Scene vector, 1472 Scene-15 data set, 1851 Scene-centered, 1270 Scene-layer semantic, 1544 SCERPO (spatial correspondence evidential reasoning and perceptual organization), 1851 SCFG (stochastic context-free grammar), 1225 Schur complement, 1367 Schwarz criterion, 1607 Scotoma, 171
Index Scotopia vision, 1767 Scotopic response, 1772 Scotopic standard spectral luminous efficiency, 1859 Scotopic vision, 1774 Scotopic zone, 1775 Scratch removal, 606 Screen, 261 Screen capture, 261 Screen resolution, 261 Screening, 828 Screw motion, 467 SD (symmetric divergence), 966 SDMHD (standard deviation modified Hausdorff distance), 1434 SDTV (standard definition television), 1820 Seam carving, 1497 Seam selection in image stitching, 1497 Search tree, 1432 Second fundamental form, 1281 Second-order difference, 892 Second order Markov chain, 1598 Secondary colors, 729 Second-derivative operator, 888 Second-generation CT scanner, 631 Second-generation watermarking, 700 Second-moment matrix, 1025 Second moment of region, 1025 Second-order cone programming [SOCP], 1655 Second-order-derivative edge detection, 888 Secret communication, 718 Secret embedded communication, 718 Security, 705 Security of watermark, 693 Seed, 938 Seed and grow, 939 Seed pixel, 939 Seed region, 939 Segment adjacency graph, 356 Segmentation, 826 Segmentation algorithm, 834 Segmentation by weighted aggregation [SWA], 950 Segmentation-based stereo matching, 1340 Segmented image, 35 Seidel aberrations, 147 Selective attention mechanism, 1265 Selective attention, 1265 Selective authentication, 712 Selective filter, 542 Selective filtering, 543 Selective vision, 69
1947 Selectively replacing gray level by gradient, 442 Self-calibration, 121 Self-information, 660 Self-information amount, 660 Self-information quantity, 660 Self-inverting, 287 Self-localization, 1560 Self-motion active vision imaging, 199 Self-occluding, 1388 Self-organizing map [SOM], 1668 Self-shadow, 1386 Self-similarity matrix, 1700 SEM (scanning electron microscope), 160 Semantic attribute, 1032 Semantic event, 1541 Semantic feature, 1541 Semantic gap, 1542 Semantic image annotation and retrieval, 1542 Semantic image segmentation, 835 Semantic labeling, 1474 Semantic layer, 1541 Semantic net, 1402 Semantic network, 1402 Semantic primitive, 836 Semantic region, 836 Semantic region growing, 836 Semantic scene segmentation, 1476 Semantic scene understanding, 1476 Semantic texton forest, 1089 Semantic video indexing, 1542 Semantic video search, 1542 Semantic-based, 682 Semantic-based coding, 682 Semantic-based visual information retrieval [SBVIR], 1541 Semantics of mental representations, 1476 Semi-adaptive compression, 659 Semiangular field, 172 Semi-fragile watermark, 700 Semi-hidden Markov model, 1596 Semi-open tile quantization, 251 Semi-regular tessellation, 1093 Semi-supervised learning, 1412 Semi-thresholding, 926 Senior aberration, 147 Sensation, 1768 Sensation constancy, 1768 Sensation rate, 1769 Sensitivity, 1769 Sensitivity analysis attack, 709 Sensor, 94 Sensor component model, 94
1948 Sensor element, 94 Sensor fusion, 95 Sensor model, 94 Sensor modeling, 95 Sensor motion compensation, 98 Sensor motion estimation, 98 Sensor network, 95 Sensor noise, 593 Sensor path planning, 97 Sensor placement determination, 98 Sensor planning, 97 Sensor position estimation, 98 Sensor response, 97 Sensor sensitivity, 97 Sensor size, 96 Sensor spectral sensitivity, 97 Sensors based on photoemission principles, 99 Sensors based on photovoltaic principles, 100 Sensory adaptation, 1769 Separability, 388 Separable class, 1178 Separable filter, 542 Separable filtering, 542 Separable point spread function, 337 Separable template, 510 Separable transform, 388 Separating color image channels, 763 Sequence camera, 114 Sequencing, 1574 Sequency order of Walsh functions, 393 Sequential edge detection, 909 Sequential edge detection procedure, 909 Sequential gradient descent, 645 Sequential learning, 1418 Sequential minimal optimization [SMO], 1206 Sequential Monte Carlo method, 1599 Sequential technique, 833 Sequential thickening, 1739 Sequential thinning, 1738 Sequential-boundary technique, 909 Sequential-region technique, 937 Serial message passing schedule, 1426 Series expansion, 419 Series-coupled perceptron, 1210 Set, 286 Set interior, 286 Set theoretic modeling, 1313 Seventh-generation CT scanner, 633 Shade, 1384 Shading, 1383 Shading correction, 1383 Shading from shape, 1384 Shading in image, 1383
Index Shadow, 1384 Shadow analysis, 1386 Shadow detection, 1385 Shadow factor, 1362 Shadow matting, 1386 Shadow stripe, 1383 Shadow type labeling, 1386 Shadow understanding, 1387 Shallow neural network, 1643 Shannon sampling theorem, 241 Shannon-Fano coding, 666 Shape, 1095 Shape analysis, 1096 Shape boundary, 1124 Shape class, 1108 Shape classification, 1097 Shape compactness, 1111 Shape compactness descriptor, 1111 Shape complexity, 1116 Shape complexity descriptor, 1116 Shape context, 1104 Shape decomposition, 1096 Shape description, 1103 Shape descriptor, 1105 Shape descriptor based on curvature, 1106 Shape descriptor based on polygonal representation, 1020 Shape descriptor based on the statistics of curvature, 1107 Shape elongation, 1111 Shape from contours, 1366 Shape from defocus, 1377 Shape from focus, 1376 Shape from interreflection, 1365 Shape from light moving, 1361 Shape from line drawing, 1465 Shape from monocular depth cues, 1374 Shape from motion, 1370 Shape from multiple sensors, 1365 Shape from optical flow, 1370 Shape from orthogonal views, 1364 Shape from perspective, 1365 Shape from photo consistency, 1365 Shape from photometric stereo, 1361 Shape from polarization, 1365 Shape from profile curves, 1366 Shape from scatter trace, 1377 Shape from shading, 1384 Shape from shadows, 1387 Shape from silhouette, 1366 Shape from specular flow, 1370 Shape from specularity, 1387 Shape from structured light, 1365
Index Shape from texture, 1378 Shape from vergence, 1366 Shape from X, 1364 Shape from zoom, 1377 Shape grammar, 1097 Shape index, 1298 Shape magnitude class, 1299 Shape matching, 1097 Shape matrix, 1099 Shape measure [SM], 961 Shape membership, 1108 Shape modeling, 1097 Shape moment, 1026 Shape number, 1017 Shape parameters, 917 Shape perception, 1791 Shape perimeter, 1124 Shape priors, 915 Shape recognition, 1097 Shape recovery, 1097 Shape representation, 1099 Shape template, 1098 Shape template models, 1098 Shape texture, 1062 Shape unrolling, 971 Shape-based image retrieval, 1528 Shaped pattern, 698 Shapeme histogram, 1106 Shape transformation-based method, 1098 Sharp filtering, 501 Sharpening, 520 Sharpening filter, 504 Sharp-unsharp masking, 522 Shearing theorem, 406 Shearing transformation, 468 Shepp-Logan head model, 634 Shift Huffman coding, 666 Shift invariance, 1047, 1725 Shift-invariant point spread function, 337 Shift-invariant principle, 296 Shifting property of delta function, 9 Shock graph, 1002 Shock tree, 1001 Short baseline stereo, 1325 Short-time analysis, 1134 Short-time Fourier transform [STFT], 396 Short-wave infrared [SWIR], 15 Shot, 1532 Shot and spike noise, 592 Shot detection, 1533 Shot noise, 586 Shot organization strategy, 1533 Shrinkage, 1649
1949 Shrinkage cycle iteration, 1641 Shrinking erosion, 1752 Shutter, 106 Shutter control, 106 SI unit, 1856 Side information, 698 Side-looking radar, 163 Sieving analysis, 1740 SIF (source input format), 275 SIF (source intermediate format), 275 SIFT (scale-invariant feature transform), 1343 Sifting, 339 Sigmoid function, 1595 Signal coding system, 650 Signal processing, 334 Signal quality, 336 Signal processing-based texture feature, 1067 Signal-to-noise ratio [SNR], 655 Signal-to-watermark ratio, 692 Signature, 975 Signature curve, 976 Signature identification, 1256 Signature verification, 1256 Signed distance transform, 383 Silhouette, 1100 Silhouette-based reconstruction, 1356 SIMD (single instruction multiple data), 343 Similarity, 1515 Similarity computation based on histogram, 1451 Similarity computation, 497 Similarity distance, 1433 Similarity measure, 1515 Similarity metric, 1516 Similarity theorem, 408 Similarity transformation, 454 Simple boundary, 974 Simple decision rule, 1221 Simple graph, 1615 Simple lens, 131 Simple linear iterative clustering [SLIC], 1212 Simple neighborhood, 361 Simplex, 1194 Simplex angle image [SAI], 1287 Simply connected, 1730 Simply connected polyhedral object, 370 Simply connected region, 370 Simulated annealing, 567 Simulated annealing with Gibbs sampler, 568 Simulating, 58 Simulation, 1405 Simultaneous algebraic reconstruction technique [SART], 643
1950 Simultaneous brightness contrast, 1779 Simultaneous color contrast, 1779 Simultaneous contrast, 1778 Simultaneous localization and mapping [SLAM], 1827 Sinc filter, 547 Single Gaussian model, 1141 Single instruction multiple data [SIMD], 343 Single threshold, 926 Single thresholding, 926 Single view metrology, 1375 Single view reconstruction, 1375 Single-camera system, 115 Single-chip camera, 111 Single-class support vector machine, 1205 Single-image iterative blending, 716 Single-label actor-action recognition, 1554 Single-layer perceptron network, 1643 Single-lens reflex camera, 115 Single mapping law [SML], 492 Single-parametric curve, 982 Single-photon emission computed tomography [SPECT], 626 Single-photon emission CT [SPECT], 628 Single-scale retinex, 442 Singular point, 1274 Singularity event, 1275 Singular value decomposition [SVD], 282 Sinusoidal interference pattern, 576 Sinusoidal projection, 235 SIR (sampling-importance resampling), 245 Situation graph tree, 1571 Situational awareness, 1790 Sixth-generation CT scanner, 632 Size constancy, 1795 Skeletal set, 1637 Skeleton, 1004 Skeleton by influence zones [SKIZ], 1001 Skeleton by maximal balls, 1739 Skeleton model, 1949 Skeleton point, 1004 Skeleton transform, 1004 Skeletonization, 1004 Sketch-based image retrieval, 1528 Skew, 1462 Skew correction, 614 Skew symmetry, 1020 Skewness, 1075 Skewness of probability density function, 316 Skin color analysis, 1253 Skin color detection, 1253 Skin color model, 1253 Skinny-singular value decomposition [skinnySVD], 284
Index skinny-SVD (skinny-singular value decomposition), 284 SKIZ (skeleton by influence zones), 1001 Skyline storage, 1638 Slack variable, 1662 SLAM (simultaneous localization and mapping), 1827 Slant, 92 Slant edge calibration, 327 Slant normalization, 1830 Slant transform, 439 Slant-Haar transform, 439 SLDA (supervised latent Dirichlet allocation), 1491 SLDS (switching linear dynamical system), 1574 SLIC (simple linear iterative clustering), 1212 Slice, 623 Slice-based reconstruction, 1357 Slide, 1537 Sliding window, 508 Slippery spring, 914 Slope chain code [SCC], 988 Slope density function, 978 Slope overload, 671 SM (shape measure), 961 Small motion model, 1369 Smallest univalue segment assimilating nucleus operator, 840 Smart camera, 109 Smart coding, 683 SMG (split, merge and group), 940 SML (single mapping law), 492 SMO (sequential minimal optimization), 1206 Smoke matting, 856 Smooth constraint, 1364 Smooth filtering, 499 Smooth motion curve, 1144 Smooth parametric curve, 984 Smooth pursuit eye movement, 1784 Smooth region, 1300 Smooth surface, 1300 Smoothed local symmetries, 1005 Smoothing, 517 Smoothing filter, 503 Smoothness constraint, 1163 Smoothness penalty, 571 SMPTE (Society ofMotion Picture and Television Engineers), 1839 Snake, 914, 915 Snake model, 915 SNR (signal-to-noise ratio), 655 Sobel detector, 883 Sobel edge detector, 883
Index Sobel gradient operator, 883 Sobel kernel, 883 Sobel operator, 883 SOC (system on chip), 345 Society of Motion Picture and Television Engineers [SMPTE], 1839 Society of Photo-Optical Instrumentation Engineers [SPIE], 1839 SOCP (second-order cone programming), 1655 Soft assignment, 1185 Soft computing, 1480 Soft constraint, 922 Soft dilation, 1754 Soft erosion, 1754 Soft focus, 238 Soft mathematical morphology, 1753 Soft matting, 857 Soft morphological filter, 1753 Soft morphology, 1753 Soft segmentation, 829 Soft shadow, 1385 Soft vertex, 839 Soft weight sharing, 571, 1644 Softmax function, 1653 Soft-thresholding, 1641 Solid angle, 347 Solid model, 1312 Solidity, 1116 Solid-state imager, 102 Solid-state sensor, 102 Solution by Gauss-Seidel method, 296 Solution by Jacobi’s method, 296 Solving optical flow equation, 1163 Solving photometric stereo, 1361 SOM (self-organizing map), 1668 Sony RGBE pattern, 1950 Source, 651 Source coding theorem, 661 Source decoder, 652 Source encoder, 651 Source geometry, 213 Source image, 35 Source input format [SIF], 275 Source intermediate format [SIF], 275 Source placement, 213 Source/sink cut, 1618 SP (subspace pursuit), 1640 Space carving, 1316, 1346 Space curve, 1278 Space division multiplexing, 335 Space domain, 5 Space-filling capability, 1117 Space-time cuboid, 1553 Space-time descriptor, 1553
1951 Space-time image, 35 Space-time interest point, 1553 Space-time stereo, 1552 Space-variant processing, 827 Space-variant sensor, 101 Spanning subgraph, 1617 Sparse coding [SC], 1637 Sparse correspondence, 1443 Sparse data, 1637 Sparse flexible model, 1625 Sparse graphical model, 1625 Sparse iterative constrained end-membering [SPICE], 14 Sparse matching point, 1444 Sparse representation, 1636 Sparsity problem, 1637 Spatial aliasing, 244 Spatial and temporal redundancy, 1552 Spatial angle, 348 Spatial averaging, 519 Spatial characteristic of vision, 1780 Spatial clustering, 935 Spatial coordinate transformation, 448 Spatial correspondence evidential reasoning and perceptual organization [SCERPO], 1851 Spatial coverage, 1117 Spatial derivatives, 521 Spatial domain, 442 Spatial domain filter, 503 Spatial domain filtering, 499 Spatial domain smoothing, 517 Spatial filter, 499 Spatial filtering, 499 Spatial frequency, 1781 Spatial Hashing, 1524 Spatial indexing, 1522 Spatial light modulator, 1675 Spatial masking model, 706 Spatial matched filter, 1450 Spatial noise, 593 Spatial noise descriptor, 593 Spatial noise filter, 504 Spatial normal fields, 1287 Spatial occupancy, 1316 Spatial occupancy array, 1315 Spatial perception, 1791 Spatial proximity, 1529 Spatial pyramid matching, 1451 Spatial quantization, 248 Spatial reasoning, 1419 Spatial relation, 1529 Spatial relationship between pixels, 353 Spatial relationship of primitives, 1081
1952 Spatial resolution, 48 Spatial sampling, 245 Spatial sampling rate, 245 Spatial statistics, 307 Spatial summation, 1781 Spatial transformation, 614 Spatial window, 507 Spatial domain sampling theorem, 241 Spatially varying BRDF [SVBRDF], 1708 Spatial-relation-based image retrieval, 1529 Spatial-temporal behavior understanding, 1549 Spatial-temporal techniques, 1550 Spatio-temporal analysis, 1552 Spatio-temporal filter, 1550 Spatio-temporal filtering, 1550 Spatio-temporal Fourier spectrum, 399 Spatio-temporal frequency response, 1552 Spatio-temporal relationship match, 1552 Spatio-temporal space, 1552 Spatio-temporally shiftable windows, 1551 SPD (spectral power distribution), 1699 SPD (symmetric positive definite), 429 Special case motion, 1371 Specification noise, 592 Specificity, 1535 Specified histogram, 492 Speckle, 594 Speckle noise, 594 Speckle reduction, 595 SPECT (single-photon emission CT), 626 Spectral absorption curve, 1697 Spectral analysis, 1699 Spectral angular distance, 1696 Spectral clustering, 1700 Spectral constancy, 1795 Spectral decomposition method, 1700 Spectral density function, 1700 Spectral differencing, 1696 Spectral distribution, 1699 Spectral factorization, 1700 Spectral filtering, 1699 Spectral frequency, 1699 Spectral graph partitioning, 1700 Spectral graph theory, 1700 Spectral histogram, 1698 Spectral imaging, 179 Spectral luminous efficiency, 1698 Spectral luminous efficiency curve, 1698 Spectral luminous efficiency function, 1698 Spectral power distribution [SPD], 1699 Spectral radiance, 1696 Spectral reflectance, 1696 Spectral response, 1696
Index Spectral response function, 1696 Spectral sensitivity, 1697 Spectral signature, 1695 Spectral unmixing, 1695 Spectrophotometer, 1674 Spectroradiometer, 1687 Spectroscopy, 1697 Spectrum, 1694 Spectrum approach, 1084 Spectrum color, 1698 Spectrum function, 1084 Spectrum locus, 735 Spectrum quantity, 1696 Specular flow, 1158 Specular reflection coefficient, 1701 Specular reflection, 1701 Specularity, 1714 Speed of a curve, 1276 Speed profiling, 1565 Speeded-up robust feature [SURF], 1344 Sphere, 1303 Spherical, 1116 Spherical aberration, 150 Spherical coordinate system, 183 Spherical coordinates, 183 Spherical harmonic, 281 Spherical linear interpolation, 621 Spherical mirror, 1676 Spherical spin image, 37 Sphericity, 1112 Sphericity ratio, 1116 Sphering, 1195 SPICE (Sparse iterative constrained end-membering), 14 SPIE (Society of Photo-Optical Instrumentation Engineers), 1839 Spin image, 37 Splatting, 73 Spline, 972 Spline smoothing, 973 Spline-based motion estimation, 1137 Split and merge, 940 Split and merge algorithm, 940 Split, merge, and group [SMG], 940 Splitting technique, 993 Spoke, 1125 Sport match video, 1538 SPOT (satellite probatoire d’observation de la terra), 1833 Spot colors, 743 Spot detection, 595 Sprite, 18 Spurs, 872, 1734
Index Square degree, 348 Square grid, 253 Square image representation, 253 Square lattice, 253 Square sampling efficiency, 253 Square sampling grid, 253 Square sampling mode, 253 Square-box quantization [SBQ], 251 Squared error clustering, 1212 SR (super-resolution), 809 sRGB color space, 741 SRL (statistical relational learning), 1408 SSD (sum of squared differences), 1349 SSIM (structural similarity index measurement), 52 SSSD (sum of SSD), 1349 Stability of views, 1693 Stack filter, 1753 Stadimetry, 205, 210 Stagewise orthogonal matching pursuit [StOMP], 1640 Standard CIE colorimetric system, 734 Standard color values, 741 Standard conversion, 247 Standard definition television [SDTV], 1820 Standard deviation, 1502 Standard deviation modified Hausdorff distance [SDMHD], 1434 Standard deviation of noise, 584 Standard illuminant, 1859 Standard image, 50 Standard imaging, 176 Standard lens, 131 Standard observer, 1858 Standard rectified geometry, 1336 Standard stereo geometry, 1336 Standard-definition television [SDTV], 1820 State inference, 1153 State model, 95 State space, 911 State space model, 1152 State transition probability, 1597 State variable, 336 State vector, 1155 Static knowledge, 1390 Static recognition, 1169 Stationary camera, 108 Stationary kernel, 1658 Stationary random field, 1588 Statistical approach, 1071 Statistical behavior model, 1572 Statistical classifier, 307 Statistical decision theory, 660
1953 Statistical differencing, 521 Statistical distribution, 307 Statistical feature, 1075 Statistical error, 1055 Statistical fusion, 307 Statistical independence, 1582 Statistical insufficiency problem, 1566 Statistical learning, 1408 Statistical learning theory, 1408 Statistical model, 1952 Statistical moment, 306 Statistical pattern recognition, 1172 Statistical relational learning [SRL], 1408 Statistical shape model, 1103 Statistical shape prior, 1103 Statistical texture, 1072 Statistical texture feature, 1067 Statistical uncertainty, 1054 Steepest descent, 646 Steerable filters, 543 Steerable pyramid, 818 Steerable random field, 1587 Steganalysis, 718 Steganographic message, 719 Steganography, 715, 718 Stego-key, 719 Stem plot, 346 Step edge, 868 Steradian, 348 Stereo, 1321 Stereo camera calibration, 122 Stereo convergence, 203 Stereo correspondence problem, 1341 Stereo from motion, 1370 Stereo fusion, 1756 Stereo heads, 1366 Stereo image, 18 Stereo image rectification, 1333 Stereo imaging, 198 Stereo matching, 1341 Stereo matching based on features, 1444 Stereo matching based on region gray-level correlation, 1340 Stereo representation, 1315 Stereo triangulation, 1322 Stereo view, 203 Stereo vision, 1324 Stereo vision system, 1324 Stereo-based head tracking, 1239 Stereographic projection, 30 Stereology, 1660 Stereophotogrammetry, 239 Stereophotography, 239
1954 Stereoscope, 203 Stereoscopic acuity, 1662 Stereoscopic imaging, 203 Stereoscopic imaging with angle scanning camera, 203 Stereoscopic phenomenon, 1661 Stereoscopic photography, 240 Stereoscopic radius, 1662 Stereoscopy, 1661 STFT (short-time Fourier transform), 396 Stiffness matrix, 304 Still image, 27 Still video camera, 108 Stimulus, 1769 Stippling, 763 Stirling’s approximation, 304 StirMark, 706 Stochastic approach, 566 Stochastic completion field, 1583 Stochastic context-free grammar [SCFG], 1225 Stochastic EM, 1664 Stochastic gradient descent, 645 Stochastic process, 1583 Stochastic sampling technique, 245 StOMP (stagewise orthogonal matching pursuit), 1640 Stop-number, 142 Storage camera tube, 156 Straight cut, 1536 Stratification, 122 Streaming video, 779 Stretch transformation, 467 Striated cortex, 1758 Strict digraph, 1623 Strict order, 1744 Strict ordering relation, 1744 Strict-ordered set, 1745 String, 1745 String description, 1015 String descriptor, 1015, 1431 String grammar, 1015 String matching, 1431 String structure, 1015 Strip artifacts, 501 Stripe noise, 593 Stripe ranging, 209, 210 Stripe tree, 1612 Strobe duration, 215 Strobed light, 215 Strong edge pixels, 891 Strong learner, 1409 Strong texture, 1089 Structural approach, 1079
Index Structural description, 1437 Structural matching, 1437 Structural pattern descriptor, 1225 Structural pattern recognition, 1173 Structural similarity index, 52 Structural similarity index measurement [SSIM], 52 Structural texture, 1079 Structural texture feature, 1067 Structure and motion recovery, 1369 Structure elements of stereology, 1660 Structure factorization, 1129 Structure feature, 1528 Structure from motion, 1366 Structure from optical flow, 1370 Structure learning, 1627 Structure matching, 1437 Structure reasoning, 1419 Structure tensor, 842 Structure-based image retrieval, 1528 Structured codes, 1523 Structured illumination, 218 Structured light, 208 Structured light imaging, 208 Structured light source calibration, 208 Structured light triangulation, 208 Structured model, 347 Structured SVM, 1206 Structuring element, 1729 Structuring system, 1753 Subband, 686 Subband coding, 685 Sub-category recognition, 1170 Subcomponent, 347 Subcomponent decomposition, 347 Subdivision connectivity, 1284 Subdivision surface, 1300 Sub-Gaussian probability density function, 316 Subgraph, 1617 Subgraph isomorphism, 1463 Sub-image decomposition, 675 Sub image, 18 Subjective black, 1774 Subjective boundary, 1812 Subjective brightness, 1775 Subjective color, 1802 Subjective contour, 1812 Subjective evaluation, 52 Subjective feature, 1544 Subjective fidelity, 649 Subjective fidelity criteria, 649 Subjective magnification, 1675 Subjective quality measurement, 52
Index Subjectivity and objectivity of evaluation, 968 Suboptimal variable-length coding, 666 Subordering, 765 Sub-pixel, 11 Sub-pixel edge detection based on expectation of first-order derivatives, 864 Sub-pixel edge detection based on moment preserving, 863 Sub-pixel edge detection based on tangent information, 865 Sub-pixel edge detection, 863 Sub-pixel interpolation, 617 Sub-pixel-level disparity, 1329 Sub-pixel-precise contour, 867 Sub-pixel-precise thresholding, 867 Sub-pixel-precision of fitting, 986 Sub-pixel-refinement, 1342 Subsampling, 247 Subsequence-wise decision, 1223 Subset, 287 Subspace, 1233 Subspace analysis, 1235 Subspace classification, 1234 Subspace classifier, 1234 Subspace learning, 1236 Subspace method, 1234 Subspace pursuit [SP], 1640 Subspace shape models, 1235 Substitution attack, 707 Subsurface scattering, 1720 Subtle facial expression, 1246 Subtractive color, 730 Subtractive color mixing, 731 Subtractive dither, 264 Subtractive image offset, 474 Subtractive primary color, 731 Sub-volume matching, 1454 Successive color contrast, 1803 Successive doubling, 397 Sum of absolute differences [SAD], 1516 Sum of absolute distortion [SAD], 1058 Sum of squared differences [SSD], 1349 Sum of SSD [SSSD], 1349 Summed area table, 24 Sum-product algorithm, 1420 Sum-product algorithm for hidden Markov model, 1596 Super extended graphics array [SXGA], 258 Super extended graphics array+ [SXGA+], 259 Super video graphics adapter [SVGA], 259 Superellipse, 1303 Super-Gaussian probability density function, 316
1955 Supergraph, 1616 Supergrid, 49 Superparamagnetic clustering, 1213 Superpixel, 11 Superpixel segmentation, 949 Superposition principle, 295 Superquadric, 1302 Super-resolution [SR], 809 Super-resolution reconstruction, 810 Super-resolution restoration, 810 Super-resolution technique, 809 Super-slice, 907 Super video compact disc [SVCD], 270 Supervised classification, 1181 Supervised color constancy, 1795 Supervised latent Dirichlet allocation [SLDA], 1491 Supervised learning, 1412 Supervised texture image segmentation, 1088 Supervised texture segmentation, 1088 Super-voxel, 12 Support region, 5 Support vector, 1204 Support vector machine [SVM], 1203 Support vector machine for regression, 1204 Support vector regression, 1204 SURF (speeded-up robust feature), 1344 Surface, 1280 Surface average value, 1289 Surface boundary representation, 1285 Surface class, 1290 Surface classification, 1292 Surface color, 724 Surface completion, 606 Surface continuity, 1282 Surface curvature, 1290 Surface description, 1289 Surface discontinuity, 1282 Surface element [surfel], 1281 Surface fitting, 1287 Surface harmonic representation, 1285 Surface inspection, 1282 Surface interpolation, 1285 Surface light field, 72 Surface matching, 1460 Surface mesh, 1304 Surface model, 1282 Surface normal, 1286 Surface orientation, 1281 Surface orientation from texture, 1378 Surface patch, 1305 Surface reconstruction, 1356 Surface reflectance, 1713
1956 Surface reflection, 1713 Surface representations, 1284 Surface roughness, 1289 Surface roughness characterization, 1289 Surface segmentation, 1282 Surface shape classification, 1291 Surface shape discontinuity, 1291 Surface simplification, 1285 Surface tiling, 1304 Surface tiling from parallel planar contour, 1306 Surface tracking, 1148 Surface triangulation, 1304 surfel (surface element), 1281 Surflet, 1286 Surrounding region, 996 Surveillance, 1823 Survival of the fittest, 1154 SUSAN corner finder, 840 SUSAN operator, 840 SVBRDF (spatially varying BRDF), 1708 SVCD (super video compact disc), 270 SVD (singular value decomposition), 282 SVG (scalable vector graphics) format, 274 SVGA (super video graphics adapter), 259 SVM (support-vector machine), 1203 SWA (segmentation by weighted aggregation), 950 Swap move, 923 Sweep representation, 1319 Swendsen-Wang algorithm, 568 Swept object representation, 1318 SWIR (short-wave infrared), 15 Switching hidden Markov model, 1595 Switching linear dynamical system [SLDS], 1574 Switching state space model, 1595 SXGA (super extended graphics array), 259 SXGA+ (super extended graphics array), 258 Symbol coding, 659 Symbol-based coding, 659 Symbolic, 971 Symbolic image, 30 Symbolic matching, 1431 Symbolic object representation, 971 Symbolic registration, 1456 Symmetric application, 651 Symmetric axis transform [SAT], 1002 Symmetric component, 436 Symmetric cross entropy, 1504 Symmetric cross-ratio functions, 1167 Symmetric divergence [SD], 966 Symmetric key cryptography, 717
Index Symmetric positive definite [SPD], 429 Symmetric transform, 387 Symmetry, 1101 Symmetry axiom, 1041 Symmetry detection, 1101 Symmetry group, 455 Symmetry line, 1003 Symmetry measure, 1107 Symmetry plane, 1274 Synapse, 1758 Sync pulse, 1575 Sync separation, 1574 Synchronization, 1574 Syntactic classifier, 1225 Syntactic pattern recognition, 1225 Syntactic texture description, 1065 Synthetic aperture radar [SAR], 163 Synthetic aperture radar [SAR] image, 25 Synthetic image, 7 System attack, 708 System model for image coding, 650 System noise, 585 System on chip [SOC], 345 System uncertainty, 1055 Systematic comparison and characterization, 967 Systematic error, 1055 Systolic array, 345
T T unit, 1797 TA (tree annealing), 498 Tabu search, 911 Tagged image file format [TIFF], 274 Tail-to-tail, 1631 Tamper resistance, 706 Tangent angle function, 974 Tangent distance, 572 Tangent method, 867 Tangent plane, 1287 Tangent propagation, 572 Tangent space, 1194 Tangent vector to an edge, 874 Tangential distortion, 146 Tangential distortion of lens, 146 Tapped delay line, 1601 Target frame, 1131 Target image, 35 Target recognition, 1169 Target tracking, 1147 Task parallelism, 58 Task-directed, 68
Index Tattoo retrieval, 1518 Taut-string area, 1022 TB (transform block), 1847 TBD (texture browsing descriptor), 1069 TBM (transferable belief model), 1409 TBN (texture by numbers), 1092 TCF-CFA (tensor correlation filter based CFA), 1184 TCT (transmission CT), 625 Tee junction, 838 Telecentric camera, 111 Telecentric camera model, 175 Telecentric illumination, 215 Telecentric imaging, 177 Telecentric lens, 133 Telecentric optics, 1672 Telecine, 1821 Telepresence, 1833 Telescope, 164 Television camera, 113 Television microscope, 161 Television raster, 1820 Television scanning, 1820 Tell-tale watermarking, 701 TEM (transmission electron microscope), 161 Temperature parameter, 567 Template, 507 Template convolution, 509 Template image, 24 Template matching, 1447 Template tracking, 1148 Template-based representation, 971 Template-matching-based approach, 1236 Templates and springs model, 1438 Temporal aliasing, 244 Temporal alignment, 1459 Temporal and vertical interpolation, 782 Temporal averaging, 792 Temporal characteristics of vision, 1782 Temporal correlation, 797 Temporal derivative, 1163, 1371 Temporal event analysis, 1602 Temporal face recognition, 1243 Temporal frequency, 1783 Temporal frequency response, 1783 Temporal interpolation, 783 Temporal model, 1602 Temporal offset, 1459 Temporal process, 335 Temporal reasoning, 1419 Temporal representation, 972 Temporal resolution, 778 Temporal segmentation, 804
1957 Temporal stereo, 1325 Temporal synchronization, 1575 Temporal template, 1556 Temporal texture, 1090 Temporal topology, 1568 Temporal tracking, 1147 Tensor correlation filter-based CFA [TCF-CFA], 1184 Tensor face, 1243 Tensor product surface, 1300 Tensor representation, 424 Term frequency, 1516 Term frequency-inverse document frequency [TF-IDF], 1516 Terrain analysis, 1834 Tessellated image, 36 Tessellated viewsphere, 197 Tessellation, 1122 Test procedure, 955 Test sequence, 1182 Test set, 1182 Tetrahedral spatial decomposition, 1310 Texel, 1063 Texem, 1064 Text analysis, 1460 Texton, 1063 Texton boost, 1080 Textual image, 24 Textual image compression, 648 Texture, 1061 Texture addressing mode, 71 Texture analysis, 1064 Texture boundary, 1087 Texture browsing descriptor [TBD], 1069 Texture by numbers [TBN], 1092 Texture categorization, 1065 Texture classification, 1065 Texture composition, 1092 Texture contrast, 1072 Texture description, 1064 Texture descriptor, 1064 Texture direction, 1068 Texture element, 1063 Texture element difference moment, 1063 Texture energy, 1070 Texture energy map, 1070 Texture energy measure, 1070 Texture enhancement, 1064 Texture entropy, 1078 Texture feature, 1066 Texture field grouping, 1087 Texture field segmentation, 1087 Texture gradient, 1068
1958 Texture grammar, 1080 Texture homogeneity, 1072 Texture image, 28 Texture mapping, 1090 Texture matching, 1065 Texture model, 1065 Texture modeling, 1065 Texture motion, 1070 Texture orientation, 1068 Texture pattern, 1065 Texture primitive, 1080 Texture recognition, 1065 Texture region, 1087 Texture region extraction, 1087 Texture representation, 1064 Texture second-order moment, 1078 Texture segmentation, 1087 Texture spectrum, 1069 Texture statistical model, 1066 Texture stereo technique, 1379 Texture structural model, 1079 Texture synthesis, 1091 Texture tessellation, 1093 Texture transfer, 1090 Texture uniformity, 1078 Texture-based image retrieval, 1526 TF-IDF (term frequency-inverse document frequency), 1516 The Actor-Action Data set [A2D], 1852 Theil-Sen estimator, 1667 Thematic mapper [TM], 98 Theorem of the three perpendiculars, 738 Theory of color vision, 1796 Theory of image understanding, 63 Thermal imaging, 181 Thermal infrared [TIR], 1693 Thermal noise, 585 Thick lens, 132 Thick lens model, 133 Thick shape, 1108 Thickening, 1739 Thickening operator, 1738 Thin lens, 131 Thin shape, 1108 Thinness ratio, 1117 Thinning, 1737 Thinning algorithm, 1737 Thinning operator, 1737 Thin-plate model, 563 Thin-plate spline, 564 Thin-plate spline under tension, 564 TIR (thermal infrared), 1693 Third-generation CT Scanner, 631
Index Third-order aberrations, 147 Three layers of image fusion, 1505 Three primary colors, 1797 Three-CCD camera, 111 Three-chip camera, 111 Three-color theory, 1797 Three-point gradient, 521 Three-point mapping, 454 Three-step search method, 794 Three-view geometry, 1347 Threshold, 924 Threshold based on transition region, 931 Threshold coding, 675 Threshold selection, 924 Thresholding, 924 Thresholding based on concavity-convexity of histogram, 928 Thresholding with hysteresis, 891 Through-the-lens camera control, 1560 Thumbnail, 1520 Tie points, 1430 TIFF (tagged image file format), 274 Tightness and separation criterion [TSC], 1211 Tikhonov regularization, 570 Tiling, 1306 Tilt angle, 93 Tilt direction, 93 Tilting, 92 Time code, 779 Time convolution theorem, 341 Time delay index, 243 Time division multiplexing, 271 Time domain sampling theorem, 242 Time series, 778 Time to collision, 120 Time to contact, 120 Time to impact, 120 Time window, 507 Time-frequency, 816 Time-frequency grid, 816 Time-frequency window, 816 Time-of-flight, 206 Time-of-flight range sensor, 102 Time-scale, 815 Time-varying image, 777 Time-varying random process, 1583 Time-varying shape, 919 Titchener circles, 1813 TLS (total least squares), 289 TM (thematic mapper), 98 T-number, 143 Toboggan contrast enhancement, 523 Tobogganing, 944
Index Toeplitz matrix, 285 Tolerance band algorithm, 1276 Tolerance interval, 1054 Tomasi view of a raster image, 43 Tomasi-Kanade factorization, 1369 Tomography, 623 Tonal adjustment, 760 Tone mapping, 761 Tongue print, 1256 Top surface, 1747 Top-down, 1394 Top-down cybernetics, 1393 Top-down model inference, 1420 Top-hat operator, 1749 Top-hat transformation, 1749 Topic, 1487 Topic model, 1490 Topographic map, 1833 Topography, 80 Topological descriptor, 78 Topological dimension, 77 Topological PCA [TPCA], 433 Topological property, 78 Topological representation, 77 Topology, 77 Topology inference, 79 Topology of a set, 286 Topology space, 77 Torsion, 1279 Torsion scale space [TSS], 1279 Tortuosity, 1019 Torus, 1303 Total cross number, 1352 Total least squares [TLS], 289 Total reflection, 1716 Total variation, 570 Total variation regularization, 571 TPCA (topological PCA), 433 TPR (true positive rate), 1228 T-pyramid (tree-pyramid), 818 Trace transform, 438 Tracking, 92, 1145 Tracking state, 1146 Traffic analysis, 1828 Traffic sign recognition, 1828 Training, 1647 Training procedure, 1647 Training sequence, 1647 Training set, 1166 Trajectory, 1143 Trajectory clustering, 1144 Trajectory estimation, 1144 Trajectory learning, 1143
1959 Trajectory modeling, 1144 Trajectory preprocessing, 1144 Trajectory transition descriptor, 1144 Trajectory-based representation, 1572 Transaction tracking, 705 Transfer function, 337 Transfer learning, 1418 Transferable belief model [TBM], 1409 Transform, 386 Transform block [TB], 1847 Transform coding, 675 Transform coding system, 676 Transform domain, 386 Transform kernel, 386 Transform pair, 387 Transform unit [TU], 1847 Transformation, 387 Transformation function, 386 Transformation kernel, 386 Transformation matrix, 386 Transform-based representation, 971 Transition probability, 1597 Transition region, 931 Transitive, 287 Transitive closure, 1624 Translation, 468 Translation invariance, 1047 Translation invariant, 1047 Translation matrix, 468 Translation of Fourier transform, 410 Translation theorem, 407 Translation transformation, 468 Translational motion estimation, 1132 Translucency, 1682 Transmission, 1681 Transmission computed tomography, 626 Transmission CT [TCT], 625 Transmission electron microscope [TEM], 161 Transmission tomography, 626 Transmissivity, 1681 Transmittance, 1680 Transmittance space, 1681 Transmitted color, 724 Transmitted light, 212 Transparency, 1681 Transparent layer, 259 Transparent motion, 1130 Transparent overlap, 1092 Transparent watermark, 702 Transport layer, 272 Transport streams, 1845 Trapezoid high-pass filter, 551 Trapezoid low-pass filter, 548
1960 Treating of border pixels, 513 Tree, 1611 Tree annealing [TA], 498 Tree automaton, 1226 Tree classifier, 1222 Tree description, 1612 Tree descriptor, 1612 Tree grammar, 1611 Tree search method, 910 Tree structure, 1611 Tree structure rooting on knowledge database, 1260 Tree-pyramid [T-pyramid], 818 Tree-reweighted message passing algorithm, 1614 Treewidth of a graph, 1632 Trellis code, 650 Trellis diagram, 1619 Trellis-coded modulation, 651 Trial-and-error, 1417 Triangle axiom, 1041 Triangle similarity, 1012 Triangular factorization, 282 Triangular grid, 252 Triangular image representation, 253 Triangular lattice, 252 Triangular norms, 470 Triangular sampling efficiency, 253 Triangular sampling grid, 252 Triangular sampling mode, 253 Triangulated model, 1305 Triangulation, 1304 Trichromacy, 1798 Trichromat, 1798 Trichromatic sensing, 1797 Trichromatic theory of color vision, 1798 Trichromatic theory, 1798 Trifocal plane, 1348 Trifocal tensor, 1348 Trihedral corner, 840 Tri-layer model, 1555 Trilinear constraint, 1351 Trilinear interpolation, 620 Trilinear mapping, 1091 Trilinear MIP mapping, 1091 Trilinear tensor, 1090 Trilinearity, 305 Trimap, 828 Trinocular image, 33 Trinocular stereo, 1347 Tristimulus theory of color perception, 1796 Tristimulus values, 1779 Troland, 1764
Index True color enhancement, 757 True negative, 1129 True positive, 1228 True positive rate [TPR], 1228 True color, 757 True color image, 20 Truncated Huffman coding, 666 Truncated median filter, 528 Truncated SVD [tSVD], 284 Truncation, 472 Trustworthy camera, 114 Try-feedback, 957 TSC (tightness and separation criterion), 1211 TSS (torsion scale space), 1279 T-stop, 142 tSVD (truncated SVD), 284 TU (transform unit), 1847 Tube camera, 109 Tube sensor, 102 Tukey function, 845 Tunnel, 1031 Twisting, 1019 Two-frame structure from motion, 1368 Two-screen matting, 856 Two-stage shot organization, 1539 Two-view geometry, 1340 Type 0 operations, 60, 447 Type 1 error, 1230 Type 1 operations, 60, 447 Type 2 error, 1198 Type 2 maximum likelihood, 1414 Type 2 operations, 60, 447 Types of camera motion, 91 Types of light sources, 210 Types of texture, 1089
U UGM (undirected graphical model), 1624 UCS CIE 1960 (uniform color space), 1858 UHDTV (ultra-high-definition television), 1820 UI (user interface), 1519 UIQI (universal image quality index), 51 UIUC-Sports data set, 1851 ULR (unstructured lumigraph rendering) system, 74 Ultimate erosion, 1752 Ultimate measurement accuracy [UMA], 964 Ultimate point, 1013 Ultra-high-definition television [UHDTV], 1820 Ultrasonic imaging, 181 Ultrasound image, 26
Index Ultrasound sequence registration, 1456 Ultra-spectral imaging, 180 Ultraviolet [UV], 1693 Ultraviolet image, 15 UM (uniformity measure), 961 UMA (ultimate measurement accuracy), 964 Umbilical point, 1275 Umbra, 1385 Umbra of a surface, 1747 UMIST database, 1855 Unary code, 665 Unauthorized deletion, 709 Unauthorized detection, 709 Unauthorized embedding, 709 Unauthorized reading, 709 Unauthorized removal, 709 Unauthorized writing, 709 Unbiased noise, 594 Unbiasedness, 1051 Uncalibrated approach, 120 Uncalibrated camera, 115 Uncalibrated stereo, 1326 Uncalibrated vision, 120 Uncertainty, 308 Uncertainty and imprecision data, 1053 Uncertainty image scale factor, 87 Uncertainty in structure from motion, 1368 Uncertainty of measurements, 1054 Uncertainty problem, 1351 Uncertainty representation, 308 Unconstrained restoration, 598 Uncorrelated noise, 583 Uncorrelated random fields, 1587 Uncorrelated random variable, 1580 Under-determined problem, 1398 Under-fitting, 1649 Underlying simple graph, 1615 Under-segmented, 828 Undirected graph, 1624 Undirected graphical model [UGM], 1624 Unidirectional prediction, 800 Unidirectional prediction frame, 800 Unified PCA [UPCA], 433 Uniform chromaticity chart, 734 Uniform chromaticity coordinates, 734 Uniform chromaticity scale, 734 Uniform color model, 752 Uniform color space, 752 Uniform diffuse reflection, 1703 Uniform diffuse transmission, 1703 Uniform distribution, 322 Uniform illumination, 218 Uniform linear motion blurring, 793
1961 Uniform noise, 589 Uniform pattern, 1082 Uniform predicate, 826 Uniformity measure [UM], 961 Uniformity within region, 961 Unimodal, 279 Unipolar impulse noise, 590 Uniquely decodable code, 663 Uniqueness constraint, 1331 Uniqueness of watermark, 694 Uniqueness stereo constraint, 1331 Uniquenesses, 297 Unit ball, 1311 Unit quaternion, 465 Unit sample response, 338 Unit vector, 43 Unit, 1406 Unitary transform, 389 Unitary transform domain, 389 Univalue segment assimilating nuclear [USAN], 841 Universal image quality index [UIQI], 51 Universal quality index [UQI], 51 Universal quantifier, 1407 Universal serial bus [USB], 111 Unrectified camera, 115 Unrelated perceived color, 1772 Unsharp masking, 444, 522 Unsharp operator, 444 Unstructured lumigraph rendering [ULR] system, 74 Unstructured lumigraph, 74 Unsupervised behavior modeling, 1572 Unsupervised classification, 1181 Unsupervised clustering, 1214 Unsupervised feature selection, 1412 Unsupervised learning, 1412 Unsupervised texture image segmentation, 1088 Unsupervised texture segmentation, 1088 Unusual behavior detection, 1566 UPCA (unified PCA), 433 Up-conversion, 247 Up vector selection, 1497 Updating eigenspace, 428 Upper approximation set, 1511 Upper half-space, 298 Up-rounding function, 280 Up-sampling, 247 UQI (universal quality index), 51 USAN (univalue segment assimilating nuclear), 841 USB (universal serial bus), 111
1962 USB camera, 110 User interface [UI], 1519 Utility function, 921 UV (ultraviolet), 1693
V VA (visual attention), 1804 Validation, 1183 Validation set, 1183 Valley, 1292 Valley detection, 847 Valley detector, 847 Value in color representation, 723 Value quantization, 250 Vanishing line, 1381 Vanishing point, 1381 Vapnik-Chervonenkis [VC] dimension, 1411 Variable conductance diffusion [VCD], 603 Variable focus, 136 Variable reordering, 1636 Variable-length code, 665 Variable-length coding, 665 Variable-speed photography, 237 Variance, 1075 Variants of active contour models, 918 Variational analysis, 301 Variational approach, 301 Variational Bayes, 1420 Variational calculus, 302 Variational inference for Gaussian mixture, 1419 Variational inference for hidden Markov model, 1596 Variational inference, 1419 Variational mean field method, 302 Variational method, 301 Variational model, 302 Variational problem, 301 VC (Vapnik-Chervonenkis) dimension, 1411 VCD (variable conductance diffusion), 603 VCD (video compact disc), 1844 VDED (vector dispersion edge detector), 897 VDT (video display terminal), 260 Vector Alpha-trimmed median filter, 528 Vector dispersion edge detector [VDED], 897 Vector field, 34 Vector image, 34 Vector map, 70 Vector median filter, 528 Vector ordering, 765 Vector quantization [VQ], 684 Vector rank operator, 767
Index Vehicle detection, 1828 Vehicle license plate analysis, 1828 Vehicle tracking, 1149 Velocity, 1157 Velocity field, 1157 Velocity moment, 1157 Velocity smoothness constraint, 1158 Velocity vector, 1134 VEML (video event markup language), 1568 Vergence, 1339 Vergence maintenance, 1339 VERL (video event representation language), 1568 Versatile video coding [VVC], 1848 Vertex, 1618 Vertex set, 1618 Vertical interpolation, 783 VGA (video graphics array), 258 Victory and brightness, 1545 Video, 773 Video accelerator, 261 Video adapter, 261 Video analysis, 804 Video annotation, 1546 Video-based animation, 788 Video-based rendering, 74 Video-based walkthroughs, 788 Video board, 261 Video browsing, 1547 Video camera, 113 Video capture, 787 Video card, 261 Video clip, 805 Video clip categorization, 805 Video codec, 796 Video coder, 796 Video coding, 796 Video compact disc [VCD], 1844 Video compression, 796 Video compression standard, 1844 Video compression technique, 797 Video computing, 803 Video conferencing, 784 Video container, 784 Video content analysis, 1547 Video corpus, 1546 Video data rate, 778 Video decoder, 796 Video decoding, 796 Video decompression, 797 Video de-interlacing, 781 Video denoising, 791 Video descriptor, 804
Index Video deshearing, 790 Video display terminal [VDT], 260 Video enhancement, 790 Video error concealment, 798 Video event challenge workshop, 1568 Video event markup language [VEML], 1568 Video event representation language [VERL], 1568 Video format, 786 Video frequency, 776 Video graphics array [VGA], 258 Video image, 27 Video image acquisition, 787 Video imaging, 787 Video indexing, 1548 Video key frame, 1533 Video mail [V-mail], 784 Video matting, 858 Video mining, 1548 Video mosaic, 805 Video-mouse, 784 Video organization framework, 1547 Video organization, 1547 Videophone, 1755 Video presentation unit [VPUs], 1844 Video processing, 787 Video program, 1538 Video program analysis, 1538 Video quality assessment, 784 Video quantization, 787 Video rate system, 784 Video restoration, 788 Video retrieval, 1547 Video retrieval based on motion feature, 1539 Video sampling, 787 Video screening, 788 Video search, 1546 Video segmentation, 804 Video semantic content analysis, 1547 Video sequence, 787 Video sequence synchronization, 788 Video shot, 1532 Video signal, 785 Video signal processing, 785 Video sprites, 804 Video stabilization, 790 Video standards, 785 Video stippling, 788 Video stream, 778 Video structure parsing, 1538 Video summarization, 1547 Video technique, 785 Video terminology, 784
1963 Video texture, 788 Video thumbnail, 1547 Video transmission, 788 Video transmission format, 788 Video understanding, 1546 Video volume, 778 Vidicon, 156 View combination, 1174 View correlation, 1559 View integration, 1497 View interpolation, 1461 View morphing, 1461 View region, 197 View volume, 1759 View-based eigenspace, 1173 View-based object recognition, 1175 View-dependent reconstruction, 233 View-dependent texture map, 1379 Viewer-centered representation, 1269 Viewfield, 171 Viewing angle, 175 Viewing distance, 176 Viewing field, 171 Viewing space, 176, 196 Viewing sphere, 176 View-invariant action recognition, 1558 Viewpoint, 196 Viewpoint consistency constraint, 1560 Viewpoint invariance, 1049 Viewpoint planning, 1263 Viewpoint selection, 1333 Viewpoint space partition [VSP], 197 Viewpoint-dependent representations, 1269 Viewsphere, 196 Vignetting, 128 Vigor and strength, 1545 Virtual, 80 Virtual aperture, 145 Virtual bronchoscopy, 1832 Virtual camera, 1788 Virtual color, 736 Virtual endoscopy, 80 Virtual fencing, 1565 Virtual image plane, 192 Virtual reality [VR], 79 Virtual reality markup language [VRML], 79 Virtual reality modeling language [VRML], 79 Virtual view, 80 Virtual viewpoint video, 777 Viscous model, 987 Viseme, 1256 Visibility, 610 Visibility class, 172
1964 Visibility level [VL], 610 Visibility locus, 117 Visible light, 1693 Visible spectrum, 1694 Visible watermark, 701 Visible-light image, 15 Vision, 1759 Vision computing, 1261 Vision procedure, 1759 Vision system, 1756 VISIONS system, 1850 Visual acuity, 1784 Visual after image, 1771 Visual angle, 1757 Visual appearance, 1011 Visual attention [VA], 1804 Visual attention maps, 30 Visual attention mechanism, 1804 Visual attention model, 1805 Visual behavior, 1569 Visual brightness, 1776 Visual code word, 1186 Visual computational theory, 1266 Visual context, 836 Visual cortex, 1758 Visual cue, 1374 Visual database, 1520 Visual deformation, 1812 Visual edge, 1759 Visual event, 1566 Visual feature, 1526 Visual feeling, 1769 Visual gauging, 1054 Visual hull, 999 Visual illusion, 1811 Visual image, 1759 Visual industrial inspection, 69 Visual information, 1513 Visual information retrieval, 1513 Visual inspection, 1825 Visual knowledge, 1391 Visual learning, 1417 Visual localization, 849 Visual masking effect, 1761 Visual masking, 1760 Visual navigation, 1826 Visual odometry, 1827 Visual pathway, 1760 Visual patterns, 1165 Visual perception, 1790 Visual perception layer, 1544 Visual performance, 1759 Visual pose determination, 1561
Index Visual quality, 1773 Visual rhythm, 1541 Visual routine, 1392 Visual salience, 1036 Visual scene, 170 Visual search, 1806 Visual sensation, 1769 Visual servoing, 1826 Visual space, 6 Visual surveillance, 1823 Visual tracking, 1149 Visual vocabulary, 1186 Visual words, 1186 Visualization computation, 82 Visualization in scientific computing, 82 Viterbi algorithm, 1420 VL (visibility level), 610 VLBP (volume local binary pattern), 1082 VLBP (volume local binary pattern) operator, 1082 V-mail (video mail), 784 Vocabulary tree, 1521 Volume, 1311 Volume detection, 629 Volume element, 11 Volume local binary pattern [VLBP], 1082 Volume local binary pattern [VLBP] operator, 1082 Volume matching, 1454 Volume rendering, 72 Volume skeletons, 1312 Volumetric 3-D reconstruction, 1322 Volumetric image, 17 Volumetric modeling, 1556 Volumetric models, 1311 Volumetric primitive, 1311 Volumetric range image processing [VRIP], 1358 Volumetric reconstruction, 1322 Volumetric representation, 1315 Volumetric scattering, 1720 Von Kries hypothesis, 1795 Von Mises distribution, 328 Voronoï cell, 1120 Voronoï diagram, 1120 Voronoï mesh nerve, 1123 Voronoï nerve nucleus, 1123 Voronoï polygon, 1121 Voronoï region, 1122 Voronoï tessellation, 1122 Voxel carving, 1347 Voxel coloring, 1346 Voxel, 11
Index Vox-map, 11 VPUs (video presentation unit), 1844 VQ (vector quantization), 684 VR (virtual reality), 79 VRIP (volumetric range image processing), 1358 VRML (virtual reality markup language), 79 VRML (virtual reality modeling language), 79 VSP (viewpoint space partition), 197 VVC (versatile video coding), 1848
W Wachspress coordinates, 996 Walkthrough, 367 Wallis edge detector, 896 Walsh functions, 391 Walsh order, 393 Walsh transform, 390 Walsh transform kernel, 391 Walsh-Kaczmarz order, 393 Waltz line labeling, 1465 Warm color, 1802 Warping, 449 WASA (wide-area scene analysis), 1824 Water color, 1703 Waterhole for the watershed algorithm, 944 Waterline for the watershed algorithm, 944 Watermark, 689 Watermark decoder, 691 Watermark detection, 689 Watermark detector, 691 Watermark embedding strength, 692 Watermark embedding, 692 Watermark image, 24 Watermark key, 691 Watermark pattern, 690 Watermark vector, 690 Watermarking, 700 Watermarking in DCT domain, 704 Watermarking in DWT domain, 704 Watermarking in transform domain, 703 Watermark-to-noise ratio, 692 Watershed, 942 Watershed algorithm, 943 Watershed segmentation, 942 Watershed transform, 942 Wave number, 1718 Wave optics, 1672 Wave propagation, 1001 Waveform coding, 657 Wavelength, 1718 Wavelet, 421
1965 Wavelet boundary descriptor, 1028 Wavelet calibration, 416 Wavelet decomposition, 420 Wavelet descriptor, 1028 Wavelet function, 417 Wavelet modulus maxima, 416 Wavelet packet decomposition, 421 Wavelet pyramid, 420 Wavelet scale quantization [WSQ], 677 Wavelet series expansion, 418 Wavelet transform [WT], 415 Wavelet transform codec system, 676 Wavelet transform coding system, 677 Wavelet transform fusion, 1508 Wavelet transform fusion method, 1508 Wavelet transform representation, 415 Wavelet tree, 420 Wax thermal printers, 263 WBS (white block skipping), 679 WD (Wigner distribution), 397 WDT (weighted distance transform), 383 Weak calibration, 122 Weak classifier, 1202 Weak edge pixels, 891 Weak learner, 1409 Weak perspective, 230 Weak perspective projection, 230 Weak texture, 1089 Weakly calibrated stereo, 1326 Weakly supervised learning, 1413 Weaknesses of exhaustive search block matching algorithm, 795 Wearable camera, 114 Weave, 782 Weaving wall, 1308 Webcam, 114 Weber-Fechner law, 1808 Weber’s law, 1807 Wedge filter, 1067 Weight decay, 1650 Weight of attention region, 1805 Weight of background, 1140 Weight parameter, 1644 Weight sharing, 1644 Weighted 3-D distance metric, 373 Weighted average fusion, 1507 Weighted averaging, 519 Weighted distance transform [WDT], 383 Weighted finite automaton [WFA], 1226 Weighted least squares [WLS], 288 Weighted median filter, 528 Weighted prediction, 797 Weighted walkthrough, 380
1966 Weighted-order statistic, 523 Weight-space symmetry, 1649 Weld seam tracking, 1827 Well-determined parameters, 1655 Well-formed formula [WFF], 1407 Well-posed problem, 562 WFA (weighted finite automaton), 1226 WFF (well-formed formula), 1407 White balance, 730 White block skipping [WBS], 679 White level, 259 White noise, 587 White point, 735 Whiteness, 1698 Whitening, 1195 Whitening filter, 1195 Whitening transform, 1195 Whitening transformation, 1195 Wide angle lens, 134 Wide baseline stereo, 1325 Wide field of view, 171 Wide-area scene analysis [WASA], 1824 Width function, 997 Width parameter, 327 Wiener filter, 600 Wiener filtering, 600 Wiener index, 1616 Wiener-Khinchin theorem, 1583 Wigner distribution [WD], 397 Window, 507 Window function, 507 Window scanning, 508 Window-based motion estimation, 1136 Windowed Fourier transform, 397 Windowed sinc filter, 529 Windowing, 507 Window-training procedure, 1647 Winged edge representation, 1620 Winner-takes-all [WTA], 948 Wipe, 1537 Wireframe, 1307 Wire removal, 606 Wireframe representation, 1307 Wishart distribution, 329 Within-class covariance, 1176 Within-class scatter matrix, 1176 WLS (weighted least squares), 288 Woodbury identity, 285 Work, 698 World coordinate system, 186 World coordinates, 186 World knowledge, 1392 World-centered, 1270
Index Wrapper algorithm, 1309 Wrapper methods, 1188 Wrapping mode, 70 Wrong focus blur, 575 WSQ (wavelet scale quantization), 677 WT (wavelet transform), 415 WTA (winner-takes-all), 948 Wyner-Ziv video coding, 797
X Xenon lamp, 211 XGA (extended graphics array), 258 XNOR operator, 477 XOR operator, 477 X-ray, 15 X-ray CAT/CT, 624 X-ray image, 15 X-ray imaging, 181 X-ray scanner, 155 X-ray tomography, 624 XYZ colorimetric system, 734
Y YALE database, 1855 YALE database B, 1855 Yaw, 90 YC1C2 color space, 747 YCbCr color model, 747 YCbCr color space, 747 Yellow spot, 1763 Yellow [Y], 728 Yet Another Road Follower [YARF], 1851 YIQ model, 1822 Young trichromacy theory, 1797 Young-Helmholtz theory, 1797 YUV model, 1823
Z Zadeh operator, 1485 Zernike moment, 1025 Zero avoiding, 1503 Zero crossing of the Laplacian of a Gaussian, 894 Zero forcing, 1503 Zero-cross correction algorithm, 1352 Zero-cross pattern, 512 Zero-crossing filtering, 502 Zero-crossing operator, 512 Zero-crossing point, 512 Zero-mean homogeneous noise, 593
Index Zero-mean noise, 593 Zero-memory source, 652 Zero-order interpolation, 616 Zero-shot learning [ZSL], 1414 Zero-shot recognition, 1414 Zip code analysis, 1830 z-keying, 1342 Zonal coding, 675 Zoom, 136 Zoom in, 93 Zoom lens, 136 Zoom out, 137 Zooming, 137 Zooming property of wavelet transform, 419 ZSL (zero-shot learning), 1414 Zucker-Hummel operator, 887 Zuniga-Haralick corner detector, 843 Zuniga-Haralick operator, 843
Others Déjà vu [DjVu], 275 DjVu (déjà vu), 275 Le Système International d’Unitès, 1856 SCART (syndicat des constructeurs d’appareils radio récepteurs et téléviseur), 262 SECAM (Séquentiel Couleur À Mémoire), 1822 SECAM (Séquentiel Couleur À Mémoire) format, 1822 Séquentiel Couleur À Mémoire [SECAM], 1822 Séquentiel Couleur À Mémoire [SECAM] format, 1822 SPOT (système probatoire d’observation de la terra), 1833
1967 Syndicat des constructeurs d’appareils radio récepteurs et téléviseur [SCART], 262 Système probatoire d’observation de la terra [SPOT], 1833 α-Channel, 855 α-family of divergences, 1502 α-matted color image, 855 α-matting, 855 α-quantile (alpha quantile), 487 α-radiation, 1685 α-recursion, 1424 α-trimmed mean filter, 519 β-distribution, 331 β-radiation, 1685 β-recursion, 1425 γ (gamma), 483 γ-correction (gamma correction), 484 γ-curve (gamma curve), 485 γ-Densitometry, 1044 γ-distribution, 332 γ-function (gamma function), 484 γ-noise, 588 γ-ray image, 14 γ-transformation, 435, 483 Δ-modulation [DM], 670 ϕ-effect, 1130 ψ-s curve, 976 a posteriori probability, 1479 a priori distribution, 1477 a priori domain knowledge, 1479 a priori information, 1478 a priori information incorporated, 958 a priori knowledge, 1479 a priori knowledge about scene, 1477 a priori probability, 1478