Image Engineering: Volume 3 Image Understanding 9783110524130, 9783110520347

This graduate textbook explains image reconstruction technologies based on region-based binocular and trinocular stereo

231 11 3MB

English Pages 312 Year 2017

Table of contents :
Preface
Contents
1. Introduction to Image Understanding
2. Stereo Vision
3. 3-D Shape Information Recovery
4. Matching and Understanding
5. Scene Analysis and Semantic Interpretation
6. Multisensor Image Fusion
7. Content-Based Image Retrieval
8. Spatial–Temporal Behavior Understanding
Answers to Selected Problems and Questions
References
Index

Recommend Papers

Image Engineering: Volume 3 Image Understanding 9783110524130, 9783110520347

This graduate textbook explains image reconstruction technologies based on region-based binocular and trinocular stereo

168 28 11MB Read more

Image Engineering: Volume 1 Image Processing 9783110524116, 9783110520323

This graduate textbook explains image geometry, and elaborates on image enhancement in spatial and frequency domain, unc

172 22 4MB Read more

Image Engineering: Volume 2 Image Analysis 9783110524123, 9783110520330

This graduate textbook presents fundamentals, applications and evaluation of image segregation, unit description, featur

169 71 19MB Read more

Image Engineering: Volume 1 Image Processing 9783110524116, 9783110520323

This graduate textbook explains image geometry, and elaborates on image enhancement in spatial and frequency domain, unc

180 24 15MB Read more

Image Engineering: Volume 2 Image Analysis 9783110524123, 9783110520330

This graduate textbook presents fundamentals, applications and evaluation of image segregation, unit description, featur

160 82 3MB Read more

Handbook of Image Engineering 9789811558726, 9789811558733

493 31 50MB Read more

Python 3 Image Processing: Learn Image Processing with Python 3, NumPy, Matplotlib, and Scikit-image 9789388511728, 9388511727

Gain a working knowledge of practical image processing and with scikit-image. Key FeaturesComprehensive coverage of va

205 44 7MB Read more

Digital Image Processing [3 ed.] 0735605173

544 78 2KB Read more

Image Processing and Machine Learning, Volume 1: Foundations of Image Processing [1] 9781032234588, 9781032262604, 9781003287414

Image processing and machine learning are used in conjunction to analyze and understand images. Where image processing i

110 14 47MB Read more

Image performance 9781937557775, 1937557774

281 115 5MB Read more

Image Engineering: Volume 3 Image Understanding
9783110524130, 9783110520347

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Yu-Jin Zhang Image Engineering 3 De Gruyter Graduate

Also of Interest Image Engineering Vol. 1: Image Processing Y-J. Zhang, 2017 ISBN 978-3-11-052032-3, e-ISBN 978-3-11-052422-2, e-ISBN (EPUB) 978-3-11-052411-6

Image Engineering Vol. 2: Image Analysis Y-J. Zhang, 2017 ISBN 978-3-11-052033-0, e-ISBN 978-3-11-052428-4, e-ISBN (EPUB) 978-3-11-052412-3

Color Image Watermarking Q. Su, 2016 ISBN 978-3-11-048757-2, e-ISBN 978-3-11-048773-2, e-ISBN (EPUB) 978-3-11-048763-3, Set-ISBN 978-3-11-048776-3

Modern Communication Technology N. Zivic, 2016 ISBN 978-3-11-041337-3, e-ISBN 978-3-11-041338-0, e-ISBN (EPUB) 978-3-11-042390-7

Yu-Jin Zhang

Image Engineering

Volume III: Image Understanding

Author Yu-Jin ZHANG Department of Electronic Engineering Tsinghua University, Beijing 100084 The People’s Republic of China E-mail: [email protected] Homepage: http://oa.ee.tsinghua.edu.cn/∼zhangyujin/

ISBN 978-3-11-052034-7 e-ISBN (PDF) 978-3-11-052413-0 e-ISBN (EPUB) 978-3-11-052423-9 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2017 Walter de Gruyter GmbH, Berlin/Boston Typesetting: Integra Software Services Pvt. Ltd. Printing and binding: CPI books GmbH, Leck Cover image: Sorrentino, Pasquale/Science Photo Library @ Printed on acid-free paper Printed in Germany www.degruyter.com

Preface This book is the Volume III of “Image Engineering”, which is focused on “Image Understanding”, the high layer of image engineering. This book has grown out of the author’s research experience and teaching practices for full-time undergraduate and graduate students at various universities, as well as for students and engineers taking summer courses, in more than 20 years. It is prepared with students and instructors in mind with the principal objective of introducing basic concepts, theories, methodologies, and techniques of image engineering in a vivid and pragmatic manner. Image engineering is a broad subject encompassing computer science, electrical and electronic engineering, as well as mathematics, physics, physiology, and psychology. Readers of this book should have some preliminary background in one of these areas. Knowledge of linear system theory, vector algebra, probability, and random process would be beneficial but may not necessary. This book consists of eight chapters covering the main branches of image understanding. It has totally 57 sections, 94 subsections, with 176 figures, 16 tables, and 409 numbered equations in addition to 51 examples and 96 problems (the solutions for 16 of them are provided in this book). Moreover, over 200 key references are given at the end of book for further study. This book can be used for the third course “Image Processing” in the course series of image engineering, for graduate students in the disciplines of computer science, electrical and electronic engineering, image pattern recognition, information processing, and intelligent information systems. It can also be of great help to scientists and engineers doing research and development in connection within related areas. Special thanks go to De Gruyter and Tsinghua University Press, and their staff members. Their kind and professional assistance are truly appreciated. Last but not least, I am deeply indebted to my wife and my daughter for their encouragement, patience, support, tolerance, and understanding during the writing of this book. Yu-Jin ZHANG

Contents 1 Introduction to Image Understanding 1 1.1 The Development of Image Engineering 1 1.1.1 Review of Basic Concepts and Definitions 1 1.1.2 An Overview of a Closed Image Technology Survey 5 1.1.3 A New Image Engineering Survey Series 5 1.2 Image Understanding and Related Disciplines 9 1.2.1 Image Understanding 9 1.2.2 Computer Vision 10 1.2.3 Other Related Disciplines 13 1.2.4 Application Domains of Image Understanding 15 1.3 Theory Framework of Image Understanding 16 1.3.1 The Visual Computational Theory of Marr 16 1.3.2 Improvements for Marr’s Theoretical Framework 24 1.3.3 Discussions on Marr’s Reconstruction Theory 27 1.3.4 Research on New Theoretical Frameworks 30 1.4 Overview of the Book 33 1.5 Problems and Questions 34 1.6 Further Reading 35 37 2 Stereo Vision 2.1 Modules of Stereo Vision 37 2.1.1 Camera Calibration 37 2.1.2 Image Capture 38 2.1.3 Feature Extraction 38 2.1.4 Stereo Matching 38 2.1.5 Recovering of 3-D Information 39 2.1.6 Post-processing 39 2.2 Region-Based Binocular Matching 40 2.2.1 Template Matching 40 2.2.2 Stereo Matching 42 2.3 Feature-Based Binocular Matching 44 2.3.1 Basic Methods 44 2.3.2 Matching Based on Dynamic Programming 47 2.4 Horizontal Multiple Stereo Matching 49 2.4.1 Horizontal Multiple Imaging 49 2.4.2 Inverse-Distance 50 2.5 Orthogonal Trinocular Matching 53 2.5.1 Basic Principles 54 2.5.2 Orthogonal Matching Based on Gradient Classification 2.6 Computing Subpixel-Level Disparity 62

57

VIII

Contents

2.7 2.7.1 2.7.2 2.8 2.9

Error Detection and Correction Error Detection 66 Error Correction 67 Problems and Questions 70 Further Reading 71

66

72 3 3-D Shape Information Recovery 3.1 Photometric Stereo 72 3.1.1 Scene Radiance and Image Irradiance 73 3.1.2 Surface Reflectance Properties 76 3.1.3 Surface Orientation 77 3.1.4 Reflectance Map and Image Irradiance Equation 79 3.1.5 Solution for Photometric Stereo 82 3.2 Structure from Motion 85 3.2.1 Optical Flow and Motion Field 85 3.2.2 Solution to Optical Flow Constraint Equation 86 3.2.3 Optical Flow and Surface Orientation 90 3.3 Shape from Shading 93 3.3.1 Shading and Shape 93 3.3.2 Gradient Space 95 3.3.3 Solving the Brightness Equation with One Image 97 3.4 Texture and Surface Orientation 102 3.4.1 Single Imaging and Distortion 103 3.4.2 Recover Orientation from Texture Gradient 105 3.4.3 Determination of Vanishing Points 108 3.5 Depth from Focal Length 110 3.6 Pose from Three Pixels 111 3.6.1 Perspective ThreePoint Problem 112 3.6.2 Iterative Solution 112 3.7 Problems and Questions 114 3.8 Further Reading 115 116 4 Matching and Understanding 4.1 Fundamental of Matching 116 4.1.1 Matching Strategy and Groups 117 4.1.2 Matching and Registration 118 4.2 Object Matching 120 4.2.1 Measurement of Matching 120 4.2.2 String Matching 123 4.2.3 Matching of Inertia Equivalent Ellipses 4.3 Dynamic Pattern Matching 125 4.3.1 Flowchart of Matching 125 4.3.2 Absolute Patterns and Relative Patterns

123

126

Contents

4.4 4.5 4.5.1 4.5.2 4.6 4.6.1 4.6.2 4.6.3 4.7 4.8

128 Relation Matching Graph Isomorphism 132 Fundamentals of the Graph Theory 132 Graph Isomorphism and Matching 135 Labeling of Line Drawings 137 Labeling of Contours 137 Structure Reasoning 139 Labeling with Sequential Backtracking 140 Problems and Questions 141 Further Reading 143

144 5 Scene Analysis and Semantic Interpretation 5.1 Overview of Scene Understanding 144 5.1.1 Scene Analysis 145 5.1.2 Scene Perception Layer 145 5.1.3 Scene Semantic Interpretation 147 5.2 Fuzzy Reasoning 148 5.2.1 Fuzzy Sets and Fuzzy Operation 148 5.2.2 Fuzzy Reasoning Methods 150 5.3 Image Interpretation with Genetic Algorithms 153 5.3.1 Principle of Genetic Algorithms 153 5.3.2 Semantic Segmentation and Interpretation 155 5.4 Labeling of Objects in Scene 159 5.4.1 Labeling Methods and Key Elements 160 5.4.2 Discrete Relaxation Labeling 160 5.4.3 Probabilistic Relaxation Labeling 162 5.5 Scene Classification 163 5.5.1 Bag of Words/Bag of Feature Models 164 5.5.2 pLSA Model 167 5.5.3 LDA Model 170 5.6 Problems and Questions 176 5.7 Further Reading 177 179 6 Multisensor Image Fusion 6.1 Overview of Information Fusion 179 6.1.1 Multisensor Information Fusion 180 6.1.2 Sensor Models 182 6.2 Image Fusion 185 6.2.1 Main Steps of Image Fusion 185 6.2.2 Three Layers of Image Fusion 186 6.2.3 Evaluation of Image Fusion 188 6.3 Pixel-Layer Fusion 193 6.3.1 Basic Fusion Methods 193

IX

X

6.3.2 6.3.3 6.3.4 6.4 6.4.1 6.4.2 6.4.3 6.5 6.6

Contents

Combination of Fusion Methods 195 The Optimal Decomposition Levels 199 Examples of Pixel-Layer Fusion 201 Feature-Layer and Decision-Layer Fusions Bayesian Methods 204 Evidence Reasoning 205 Rough Set Methods 208 Problems and Questions 212 Further Reading 213

214 7 Content-Based Image Retrieval 7.1 Feature-Based Image Retrieval 214 7.1.1 Color Features 214 7.1.2 Texture Features 216 7.1.3 Shape Features 218 7.2 Motion-Feature-Based Video Retrieval 7.2.1 Global Motion Features 222 7.2.2 Local Motion Features 223 7.3 Object-Based Retrieval 225 7.3.1 Multilayer Description Model 225 7.3.2 Experiments on Object-Based Retrieval 7.4 Video Analysis and Retrieval 229 7.4.1 News Program Structuring 229 7.4.2 Highlight of Sport Match Video 233 7.4.3 Organization of Home Video 238 7.5 Problems and Questions 244 7.6 Further Reading 246

204

222

227

248 8 Spatial–Temporal Behavior Understanding 8.1 Spatial–Temporal Technology 248 8.1.1 New Domain 249 8.1.2 Multiple Layers 249 8.2 Spatial–Temporal Interesting Points 251 8.2.1 Detection of Spatial Points of Interest 251 8.2.2 Detection of Spatial–Temporal Points of Interest 8.3 Dynamic Trajectory Learning and Analysis 254 8.3.1 Automatic Scene Modeling 255 8.3.2 Active Path Learning 257 8.3.3 Automatic Activity Analysis 261 8.4 Action Classification and Recognition 263 8.4.1 Action Classification 263 8.4.2 Action Recognition 265 8.5 Modeling Activity and Behavior 270

252

Contents

8.5.1 8.5.2 8.6 8.7

270 Modeling Action Activity Modeling and Recognition Problems and Questions 281 Further Reading 282

Answers to Selected Problems and Questions References Index

297

287

275

283

XI

1 Introduction to Image Understanding This book is the final volume of the book set “Image Engineering” and focuses on image understanding. Image understanding, based on the results of image processing and image analysis, attempts to interpret the meaning of image at a high level to provide semantic information closely related to human thinking, and help further to make decisions and to guide the actions according to the understanding of scenes. The sections of this chapter are arranged as follows: Section 1.1 first provides an overview of image engineering by reviewing and summarizing its development, and then enumerates some statistical data of two related literature surveys (such as the number of publication, the change of titles and journals for the duration of 30 years in the first survey, and the number of papers in 5 classes published each year in 15 journals for the past more than 20 years in the second survey) on image engineering covering the past years. Section 1.2 summarizes the research contents of image understanding and its position in image engineering, discusses the relationship between image understanding and computer vision, and the connection and difference with some related disciplines. Section 1.3 introduces all the main points of Marr’s theory of visual computation, which have important foundation for the image understanding and computer vision. In addition, some improvements to Marr’s theory and discussions on the reconstruction theory are provided. Section 1.4 overviews the main contents of each chapter in the book and indicates the characteristics of the preparation and some prerequisite knowledge for this book.

1.1 The Development of Image Engineering First, an overview of the development of image engineering is provided. 1.1.1 Review of Basic Concepts and Definitions Images can be obtained by using different observing and capturing systems from the real world in various forms and manners. They can act, directly and/or indirectly, on human eyes and produce visual perception (Zhang, 1996c). A human visual system is a typical example with such capabilities. Vision is an important function for human observing and cognitiving the world. About 75% of human information regarding the outside world comes from the visual system, which not only shows a huge amount of visual information, but also the higher utilization rate of human visual information. Because of this, many systems have been developed to image the objective scene using a variety of radiations for visualizing the world. An image is a physical form of visual DOI 10.1515/9783110524130-001

2

1 Introduction to Image Understanding

information. The final result of the acquisition of digital images of the scene is often a sample array of energy, so the matrix and array are used to represent each element coordinates corresponding to the location of the scene points, as well as the value of the element corresponds to the physical quantity of the scene point. People use a variety of technical methods and means of processing to treat the image in order to obtain the necessary information. From a broad perspective, image technology can be seen as a variety of techniques operating on image in general. It covers a wide range of tasks and processes, with the help of computers and other electronic devices, such as image acquisition, sensing, coding, storage, and transmission; image synthesis and generation; the display, rendering, and output of images; image transformation, enhancement, restoration, and reconstruction from projection; image segmentation, feature extraction, and measurement; target detection, representation, and description; sequence image correction; image database building, indexing, querying, and extraction/retrieval; image classification, expression, and recognition; 3-D image reconstruction; image model building, image information fusion, image knowledge utilization, and matching; the interpretation and understanding of image and scene, as well as reasoning, learning, judgment, decision making, and behavior planning based on them (how to deduce the goal to be achieved and to construct the sequence of operations for the target); and so on. In addition, the image technology may also include hardware and system design and fabrication to perform the functions described above. Many of the above-mentioned specific techniques have been described in Volumes I and II of this book set. Incorporated research and integration of image techniques can be carried out within the overall framework of image engineering (Zhang, 1996a). As it is all known, engineering refers to the collection of various disciplines formed by the application of the principle of natural science to the industrial sectors. Image Engineering is a new discipline for the research and application on the whole image field, which is developed by using the principles of mathematics, optics, and other basic sciences, combined with electronic technology, computer technology, and the technical experience accumulated in various image applications. In fact, the development and accumulation of image technology over many years have laid a solid foundation for the establishment of image engineering disciplines, and various kinds of image applications also put forward the urgent needs for the establishment of image engineering disciplines (Zhang, 2013b). Image engineering is very rich in content and is extremely wide in application. It can be divided into three layers according to the level of abstraction: the research methods, the operation operands, and the amount of data (see Figure 1.1): image processing (IP), image analysis (IA), and image understanding (IU). Image processing refers to a relatively low level of operation; it is mainly focused on the image pixel level, so the data volume involved is very large. Image analysis is in the middle level, segmentation and feature extraction bring the original image expressed by pixel into a relatively simple non-graphic description. Image understanding (the focus of this volume, as indicated in Figure 1.1 by the shading) mainly refers to the operation

1.1 The Development of Image Engineering

(Semantic)

Image understanding

Symbol

Middle

Image analysis

Object

Low

Image processing

Pixel

Smaller (Data volume)

Lower

High

(Operand)

(Abstraction level)

Higher

3

Bigger

Figure 1.1: The three layers of image engineering.

of high-level (symbolic) operation, basically based on more abstract description, for making judgment and decision. Its processes and methods are more similar as human thinking and reasoning. Here, as the degree of abstraction increases, the amount of data is gradually reduced. Specifically, through a series of processes, the original image data are gradually transformed into more organized and utilizable information. During these processes, the semantics are gradually introduced and the operating operands have progressively changed. In addition, the high-level operation can guide the low-level operation and can improve the performance of low-level operation so as to efficiently complete complex tasks. Generally speaking, image engineering is a combination of image processing, image analysis, and image understanding, and includes also their engineering applications. From the point of view of concept, image engineering can not only accommodate a lot of similar disciplines, but also emphasizes the application of image technology. Image engineering here is used to summarize the entire field of image research and application. It makes also the relationship between image processing, image analysis, and image understanding more close. Image engineering is a new interdisciplinary for the systematic study of various image theories, techniques, and applications. From its research method, image engineering can learn from mathematics, physics, biology, physiology (especially neurophysiology), psychology, electronics, computer science, and many other disciplines. From its research scope, image engineering crosses with pattern recognition, computer vision, computer graphics, and many other specialties. In addition, the research progress of image engineering is closely related to the theory and technology of artificial intelligence, fuzzy logic, genetic algorithm, neural network, and so on. Its development and application are closely related to biomedicine, communication, document processing, industrial automation, materials, military reconnaissance, remote sensing, traffic management and many other areas. Image engineering is a new discipline that studies comprehensively and systematically the image theory, expounds the principles of image technology, spreads the application of image technology and summarizes the practical experience in the production. Considering the contents of this book, the main components of image engineering can be shown by the overall framework in Figure 1.2, in which the basic

4

1 Introduction to Image Understanding

User/system

Image acquizition

Scene

Image processing

Knowledge

Image understanding

Image analysis

Data

Interpretation

Image

Control and strategy

Vision

Artificial intelligence, Compressive sensing, Convolutional neural network, Deep learning, Fuzzy logic, Genetic algorithm, Machine learning,……

Figure 1.2: The overall framework for image engineering.

modules of image engineering are in the dashed box. Various image techniques are used here to help people obtain information from the scene. The first thing to do is to use a variety of ways to obtain images from the scene. Next, the low-level processing of the image is mainly to improve the visual effect of the image or to reduce the amount of image data on the basis of keeping the visual effect, the results of the processing are mainly for the user to watch. The middle-level analysis of the image is mainly to detect, extract, and measure the objects of interest in the image. The results of the analysis can provide the user with data describing the characteristics and properties of the image object. Finally, the high-level understanding of the image is to grasp the image content and to explain/interpret the original objective scene through the study of the properties of the objects in the image and their mutual relations. The result of the understanding provides the user with information about the objective world, which can guide and plan human action. The use of the image technology from low level to high level has been strongly supported by many new theories, new tools, and new technology from other disciplines and domains. In order to complete these tasks some appropriate control strategies are also required and useful. Volume I of this book set described the basic principles and techniques of the low-level image processing in detail. Volume II of this book set described the basic principles and techniques of the middle-level image analysis in detail, this volume (Volume III) will cover the basic principles and techniques of high-level image understanding. This includes the acquisition and expression of 3-D objective scene information (on the basis of processing and analysis), the reconstruction of the scene from images, the interpretation of the scene, and the related knowledge and application in the above process, as well as the use of control and strategy. The study of high-level understanding has become a focus of image technology research and development recently. The integrated application of different levels of image technology has promoted the rapid development of the image business.

1.1 The Development of Image Engineering

5

1.1.2 An Overview of a Closed Image Technology Survey As with other emerging disciplines, papers on image engineering and its various applications quickly reached a publishing rate of more than a thousand a year in the early eighties. Regular conferences/meetings are held on many aspects of the subject, and there are many survey articles, paper collections, meeting proceedings and journal special issues. In fact, several hundred thousand papers have been published on the techniques and applications of image engineering. A well-known bibliography series had been developed over 30 years to provide a convenient compendium of the research in picture processing (before 1986), image processing, and computer vision (after 1986). This series was ended in 2000 by the author after 30 survey papers were published. More than 40 journals (almost all of them selected from the United States or international journals) and around 10 proceedings from large-scale conferences were involved, with 34,293 cited papers covering a long period. A brief summary about the 30 papers in this series is sorted out and given in Table 1.1 (Zhang, 2002c). Except for the first two papers, each paper in the publishing year provides a list of related publications in the preceding year. (The second paper is for three preceding years.) The first two papers were published in the journal of “ACM Computing Survey,” and the remaining papers were associated with the journal of “Computer Graphics and Image Processing (CGIP)” and its descendents. CGIP originated in 1972, changed its name to “Computer Vision, Graphics and Image Processing” (CVGIP) in 1983, and then separated into “CVGIP: Graphic Models and Image Processing” and “CVGIP: Image Understanding” in 1991. The bibliography series was continued in the latter journal. Since 1995, both journals have changed their names. The former was changed to “Graphic Models and Image Processing (GMIP),” and the latter was changed to “Computer Vision and Image Understanding (CVIU).” It was in CVIU that the bibliography series published its last six papers.

1.1.3 A New Image Engineering Survey Series A few limitations of the above-mentioned bibliography survey series include the following: (1) No attempt was made to summarize or evaluate the different papers cited. (2) No attempt was made to provide statistics about the different papers cited. (3) No attempt was made to analyze (or discuss) the distributions of the different papers cited. To overcome the above limitations, a new series of bibliography survey on image engineering has been published since 1996. In contrast to the previous bibliography survey, there is not only the classification of the papers but also a statistical analysis of the papers. In fact, three kinds of summary statistics have been developed, based

6

1 Introduction to Image Understanding

Table 1.1: An overview of a closed survey series. #

Survey Title

Index Year

Number

Journal of Publication

Published Year

1

Picture Processing by Computer Progress in Picture Processing: 1969–71

∼ 69

408

ACM Computing Surveys

1969

69∼71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

580 350 245 341 354 461 609 819 700 897 982 1185 1138 1252 1063 1436 1412 1635 1187 1611 1178 1897 1281 1911 1561 2148 1691 2268 1693

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Picture Processing: 19xx (72–86)

Image Analysis and Computer Vision: 19xx (87–99)

Computer Graphics and Image Processing (CGIP)

Computer Vision, Graphics and Image Processing (CVGIP)

CVGIP: Image Understanding

Computer Vision and Image Understanding (CVIU)

1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000

on this new series of publication survey and analysis of over 21 years. Some results are presented below. 1.1.3.1 Classification of Image Techniques The scheme for classification of image techniques used in the above bibliography series is first discussed here. The classification of papers into groups can be considered as a problem of partitioning a set into subsets. An appropriate classification should satisfy the following four conditions: (1) Every paper must be in a group. (2) All groups together can include all papers. (3) The papers in the same group should have some common properties. (4) The papers in different groups should have certain distinguishing properties.

1.1 The Development of Image Engineering

7

Taking into consideration the above conditions and the status of the development in the field, the selected papers on image engineering have been further classified into five categories: image processing (IP), image analysis (IA), image understanding (IU), technique applications (TA), and Summary and Surveys (SS). All papers in these five categories have been further classified into 23 subcategories. A complete and compact classification of the theories and techniques of image engineering is summarized and shown in Table 1.2. It is easy to verify that the above four conditions are satisfied.

Table 1.2: Classiﬁcation scheme of literatures in image engineering. Group

Subgroup

IP: Image Processing

P1: Image capturing (including camera models and calibration) and storage P2: Image reconstruction from projections or indirect sensing P3: Filtering, transformation, enhancement, restoration, inpainting, quality assessing P4: Image and/or video coding/decoding and international coding standards P5: Image digital watermarking, forensic, image information hiding, and so on P6: Multiple/super-resolutions (decomposition/interpolation, resolution conversion) A1: Image segmentation, detection of edge, corner, interest points A2: Representation, description, measurement of objects (bi-level image processing) A3: Feature measurement of color, shape, texture, position, structure, motion, and so on A4: (2-D) object extraction, tracking, discrimination, classiﬁcation and recognition A5: Human organ (biometrics) detection, location, identiﬁcation, categorization, etc. U1: (Sequential, volumetric) image registration, matching and fusion U2: 3-D modeling, representation, and real world/scene recovery U3: Image perception, interpretation, and reasoning (semantic, machine learning) U4: Content-based image and video retrieval (in various levels, related annotation) U5: Spatial-temporal technology (3-D detection, tracking, behavior understanding Applications T1: System and hardware, fast algorithm implementation T2: Telecommunication, television, web transmission, and so on T3: Documents (texts, digits, symbols) T4: Biomedical imaging and applications T5: Remote sensing, radar, surveying, and mapping T6: Other application areas S1: Cross-category summary (combination of image processing/analysis/understanding)

IA: Image Analysis

IU: Image Understanding

TA: Technique

SS: Summary and Survey

8

1 Introduction to Image Understanding

1.1.3.2 Distribution of Publications A summary of the number of publications concerning image engineering in the years from 1995 to 2016 is shown in Table 1.3. In Table 1.3, the total number of papers published in these journals (#T), the number of papers selected for survey as they relate to image engineering (#S), and the selection ratio (SR) for each year, are provided. In Table 1.3, the statistics for IP, IA, and IU are also provided for comparison. The listed numbers are the numbers of papers published each year, and the percentages in parenthesis are the ratio of IP/IA/IU papers over all IE papers. The total number of papers related to image understanding is only half of the total number of papers related to image processing or image analysis in these 22 years. This indicates that the research on IU has great room to be forwarded and the study of IU is also promising. The distribution of 23 subcategories for 2016 is shown in Figure 1.3. The papers under the category IU have been further classified into five subcategories. Their contents and average numbers of papers per year are listed in Table 1.4.

Table 1.3: Summary of image engineering over 22 years. Year 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Total Average

#T

#S

997 1,205 1,438 1,477 2,048 2,117 2,297 2,426 2,341 2,473 2,734 3,013 3,312 3,359 3,604 3,251 3,214 3,083 2,986 3,103 2,975 2,938 56,391 2,563

147 212 280 306 388 464 481 545 577 632 656 711 895 915 1008 782 797 792 716 822 723 728 13,577 617

SR

IP

IA

IU

14.7% 17.6% 19.5% 20.7% 19.0% 21.9% 20.9% 22.5% 24.7% 25.6% 24.0% 23.60 27.02 27.24 27.97 24.05 24.80 25.69 23.98 26.49 24.30 24.78

35 (23.8%) 52 (24.5%) 104 (37.1%) 108 (35.3%) 132 (34.0%) 165 (35.6%) 161 (33.5%) 178 (32.7%) 194 (33.6%) 235 (37.2%) 221 (33.7%) 239 (33.6%) 315 (35.2%) 269 (29.4%) 312 (31.0%) 239 (30.6%) 245 (30.7%) 249 (31.4%) 209 (29.2%) 260 (31.6%) 199 (27.5%) 174 (23.9%) 4,296 (31.64) 195

51 (34.7%) 72 (34.0%) 76 (27.1%) 96 (31.4%) 137 (35.3%) 122 (26.3%) 123 (25.6%) 150 (27.5%) 153 (26.5%) 176 (27.8%) 188 (28.7%) 206 (29.0%) 237 (26.5%) 311 (34.0%) 335 (33.2%) 257 (32.9%) 270 (33.9%) 272 (34.3%) 232 (32.4%) 261 (31.8%) 294 (40.7%) 266 (36.5%) 4,285 (31.56) 195

14 (9.52%) 30 (14.2%) 36 (12.9%) 28 (9.15%) 42 (10.8%) 68(14.7%) 78 (16.2%) 77 (14.3%) 104 (18.0%) 76 (12.0%) 112 (17.1%) 116 (16.3%) 142 (15.9%) 130 (14.2%) 139 (13.8%) 136 (17.4%) 118 (14.8%) 111 (14.0%) 124 (17.3%) 121 (17.7%) 103 (14.2%) 105 (14.4%) 2010 (14.80) 91

24.08

9

1.2 Image Understanding and Related Disciplines

120 100 80 60 40 20 0 P1 P2 P3 P4 P5 P6 A1 A2 A3 A4 A5 U1 U2 U3 U4 U5 T1 T2 T3 T4 T5 T6 S1 Figure 1.3: Distribution of papers in 23 subcategories for 2016.

Table 1.4: The classiﬁcation of image understanding category. Subcategory

#/Year

U1: (Sequential, volumetric) image registration, matching, and fusion U2: 3-D modeling, representation, and real world/scene recovery U3: Image perception, interpretation, and reasoning (semantic, machine learning) U4: Content-based image and video retrieval (in various levels, related annotation) U5: Spatial-temporal technology (3-D detection, tracking, behavior understanding

47 19 5 25 12

1.2 Image Understanding and Related Disciplines Image understanding and some disciplines are closely linked and have related contents and overlapped coverage. Some general discussions are provided in the following. 1.2.1 Image Understanding Image engineering includes three levels from low to high. As image understanding is at the high level of image engineering, it focuses on the further study of the property of each object and the relationship among them, based on image analysis results and combined with artificial intelligence and cognitive theory, to understand the meaning of the image content and to interpret the corresponding objective scenes, so as to guide and plan action. If image analysis mainly focuses on the study of the objective world by centering on the observer (mainly study observable things), then the image understanding to a certain extent is taking the objective world as the center to grasp the whole world (including the things not directly observable) with the help of knowledge, experience, and so on. Image understanding is concerned with how to describe and judge the scene according to the image. It uses computers to build systems to help explain the meaning of image, so as to realize the use of image information to explain the objective world.

10

1 Introduction to Image Understanding

It determines what information from the objective world is needed to capture through image acquisition, and to extract the objects through image processing and image analysis so as to perform the required tasks. It also determines what information is used for further treatment to obtain the desired decision. It is necessary to study the mathematical model of understanding, and through the programming of mathematical model to achieve the computer simulation of understanding ability. Many of these tasks cannot be fully automated, given the limitations of current computer capabilities and the level of image-understanding techniques. In most cases, the “system” completes the work at the lower level, and the person needs to complete the rest at the higher level, as shown in Figure 1.4. Image understanding is to help people in recognizing and intepreting the objective world through images. If there is no system at all, people must perform all the tasks. If the system only has lowlevel capabilities, then people need to complete the middle- and high-level work, on the basis of system results. If the system has both low-level and middle-level capabilities, then the user only needs to complete the high-level work on the basis of system results. In case if the system has the ability of working from the low level to the high level, then users can easily obtain decisions. The current bottleneck in research and development is mainly in the high level.

1.2.2 Computer Vision The human visual process can be thought of as a complex process from sensation (what is perceived is the image obtained by 2-D projection of the 3-D world) to perception (the cognition of the content and meaning of 3-D world by using 2-D images) (Kong, 2002). The ultimate goal of vision is to make a meaningful interpretation and description of the scene to the observer in a narrow sense. In a broad sense, there are also action plans based on these interpretations and descriptions, which consider also the surrounding environment and the wishes of the observer. Computer vision refers to the use of computers to achieve human visual function, so as to perceive the image Decision

Human High

IU

IU

IU

IU

Middle

IA

IA

IA

IA

Low

IP

IP

IP

IP

System Figure 1.4: The system and user cover different levels.

1.2 Image Understanding and Related Disciplines

11

of the actual objects and scenes to make meaningful judgments (Shapiro, 2001). This is in fact also the goal of image understanding. 1.2.2.1 Research Method There are two main research methods of computer vision. Bionics Method: In the bionics method, with reference to the structures and principles of the human visual system, it is required to establish the corresponding processing module or to produce visual equipment to complete some similar functions. There are three related questions (Sonka, 2008): What is? An empiricism issue. It needs to determine how to design existing visual systems. What should be? A standardization issue. It needs to determine the desired characteristics that should be met by natural or ideal vision systems. What could be? A theory issue. It needs to determine the mechanism of the intelligent vision system. Engineering Method: From the analysis of the functions of human visual process, it is not to deliberately simulate the internal structure of the human visual system, but only to consider the system input and output, and to use all available means to achieve system functionality. The engineering method is also the main method that will be discussed in this book. 1.2.2.2 Realization of the Engineering Method According to the direction of information flow and the number/quantity of prior knowledge, there are two ways in using engineering methods to achieve human visual functions: Bottom-Up Reconstruction: The 3-D shape of the object needs to be reconstructed from an image or a set of images, where either a luminance image or a depth image can be used. Marr’s visual computing theory (see Section 1.3.1) is a typical approach, which is strictly bottom-up and requires little prior knowledge of the object. Recognition from Top to Bottom: It is also known as model-based vision. The prior knowledge about the object is represented by the model of the object, where the 3-D model is more important. In the recognition based on CAD model, because of the constraints embedded in the model, the uncertain visual problems can be solved in many cases. 1.2.2.3 Research Objectives The main research objectives of computer vision can be summarized into two categories, they complement each other. The first research goal is to build a computer vision system to complete a variety of visual tasks. In other words, the computer is enabled to use a variety of visual sensors

12

1 Introduction to Image Understanding

(such as CCD or CMOS camera devices, etc.) to obtain images of the scene, to perceive and restore the geometric properties of objects, attitude structure, movement, relative positions, and so on, in 3-D environment, to identify, describe, interpret the objective scene and then make judgments or decision making. Here the main research is the technical mechanism. At present, the work in this direction and domain is focused on the construction of specialized systems for the completion of specialized visual tasks in various real-world situations. For the future, the development of general-purpose systems is in the long run (Jain, 1997). The second goal of the study is for further mastering and understanding the human brain vision (such as computational neuroscience) by exploring the working mechanism of human brain vision. Here the main research is the biological mechanism. For a long time, the human brain visual system has been largely studied from the physiological, psychological, neurological, cognitive, and other aspects. However, it is still far from discovering all the mysteries of the visual process. It can be said that the research and mastering of visual mechanism are far behind that of visual information processing. It should be pointed out that the full understanding of the human brain vision will also promote an in-depth study of computer vision, Finkel (1994). With the study for the great understanding ability of human visual system, the development of algorithms for new image understanding and computer vision would be progressed. This book will mainly consider the first research goal. 1.2.2.4 The Relationship Between Image Understanding and Computer Vision Image understanding and computer vision are closely related. Image is a kind of physical form expressing visual information. Image understanding must be done by computer, based on image processing and image analysis. Computer vision, as a discipline, has a very close relationship and certain crossover with many subjects, especially image processing, image analysis, and image understanding, which are all taking the image as the main research object. Computer vision mainly emphasizes the use of computers to achieve human visual function, which in fact requires the use of many technologies in the three levels of image engineering, although the current research contents are more related to image understanding. The intimate link between image understanding and computer vision can also be seen from the definition of computer vision, Sonka (2008): the central problem of computer vision is to understand the scene composed of objects and its 3-D nature from a moving or stationary scene in a sequence of images obtained by monocular or several monocular, mobile or stationary observers. This definition basically matches the definition of image understanding. The complexity of the understanding task is related to the specific application. If the prior knowledge is less, such as in the human vision of nature, then the understanding is complex; but in most cases where the environment and the object are limited, the possible explanations are then limited, and understanding may not be too complicated.

1.2 Image Understanding and Related Disciplines

13

In the construction of image/visual information systems, computer can assist human beings to complete a variety of visual tasks. Image understanding and computer vision are both required to use projective geometry, probability theory and stochastic process, artificial intelligence, and other theories in different aspects. For example, they have to use two types of intelligent activities: one is perception, such as perception of the scene in the visible part of the distance, orientation, shape, velocity, or mutual relations; other is thinking, such as analyzing the behavior of objects with the help of scene structure, inferring the scene changes, making decisions, and planning the actions of subjects. Computer vision was originally studied as an artificial intelligence problem, so it was also called image understanding, Shah (2002). In fact, both image understanding and computer vision are often used in combination. In essence, they are interrelated, especially in many cases their contents are overlapped. There are no absolute boundaries either in conceptual or in practical bases. In many cases and situations, they complement each other although they may have dissimilar focuses. It is more appropriate to regard them as different terminology used by professionals with different backgrounds. This will not be intentionally distinguished in this book.

1.2.3 Other Related Disciplines Image understanding and computer science are closely related (image processing and image analysis, as the basis of image understanding, are also closely related to computer science). In addition to computer vision, some other computer-related disciplines, such as machine vision/robot vision, pattern recognition, artificial intelligence, computer graphics, and so on have played an important role in the development of image understanding and will continue to perform an important influence. Machine vision/robot vision and computer vision are inextricably linked, and in many cases, are used as synonyms. In particular, it is generally believed that computer vision is more focused on the theory and algorithm of scene analysis and image interpretation, while machine vision/robot vision is more concerned with image acquisition, system construction, and algorithm implementation. Machine vision/ robot vision and the technology applications of image engineering are more closely related, Zhang (2009b). Patterns include a wide range; image is one kind of pattern. Recognition refers to the mathematics and technology that automatically creates symbolic descriptions or logical reasoning from objective facts, so people define pattern recognition (PR) as a discipline that classifies and describes objects and processes in the objective world, Bishop (2006). At present, the recognition of image patterns mainly concentrates on the classification, identification, expression, and description of the content of interest in the image (object), and has a considerable intersection with the image analysis.

14

1 Introduction to Image Understanding

In image understanding, the concepts and methods of pattern recognition are also widely used. However, visual information has its own particularity and complexity. Traditional pattern recognition (competitive learning model) does not include all the research topics of image understanding. Human intelligence mainly refers to the human ability to understand the world, judge things, learn about the environment, plan behavior, reason and think, problem solve, and so on. Artificial intelligence refers to the ability and technology to simulate, execute or regenerate certain functions related to human intelligence by using computers, Nilsson (1980), Winston (1984), Dean (1995). Visual function is an embodiment of human intelligence, so image understanding and computer vision are closely related to artificial intelligence. In the research of image understanding, many artificial intelligence techniques are used. In a sense, image understanding can also be regarded as an important application field of artificial intelligence, which needs to be realized by theoretical research achievements and systems of artificial intelligence. Computer graphics study how to generate a “picture” from a given description or data, and also has a close relationship with computer vision. Computer graphics are commonly referred to as the inverse problem of computer vision because vision extracts 3-D information from 2-D images, while graphics use 3-D models to generate 2-D scenes. Computer graphics are often associated more with image analysis. Some graphics can be thought of as the visualization of image analysis results, and the generation of computer-realistic scenes can be considered as the inverse process of image analysis, Zhang (1996a). In addition, the graphics technology plays a significant role in the human–computer interaction and modeling process of visual system. A good example combining computer graphics and computer vision is image-based rendering, whose basic introduction can be found in Zhang (2002b). It should be noted that the procedure in computer graphics is more about approaching a deterministic problem (a problem that can be solved by mathematical methods), which is different from image understanding and computer vision where a lot of uncertainties exist. In many practical applications, people are more concerned about the speed and accuracy of graphic generation. A compromise between real time and fidelity needs to be achieved. From a broader field of view, image understanding wants to use engineering methods to solve some biological problems, and to complete some biological inherent functions, so that it has mutual learning and an interdependent relationship with biology, physiology, psychology, neurology, and other disciplines. In recent years, image understanding researchers and visual psychology physicists have been working closely, and a series of research results have been obtained. In addition, image understanding belongs to engineering application science, and is inseparable from electronics, integrated circuit design, communication engineering, and so on. On the one hand, the study of image understanding makes full use of the achievements of these disciplines; on the other hand, the application of image understanding has greatly promoted the in-depth research and development of these disciplines.

1.2 Image Understanding and Related Disciplines

15

1.2.4 Application Domains of Image Understanding In recent years, image understanding has been widely used in many fields; the following are some typical examples: (1) Industrial vision, such as automated production lines, industrial inspection, industrial testing, postal automation, computer-aided surgery, micro-medical operation, and a variety of robots for working in hazardous situations. The use of image and visual technology for production automation can speed up production process, ensure consistent quality, and avoid false positives caused by fatigue, inattention, and so on. (2) Human–computer interaction, such as intelligent agents, visual recognition, and so on, so that the computer can understand easily the human desire (hand gesture), lip movements (lip reading), torso movement (gait), and to execute the instruction, which is not only consistent with human interaction habits, but also increases the convenience of interaction and telepresence, and so on. (3) Visual navigation, such as autonomous vehicles, cruise missile guidance, unmanned aircraft flying, mobile robots, precision guidance of operation and all aspects of intelligent transportation. It can avoid human involvement and the resulting risks, and can improve accuracy and speed. (4) Virtual reality, such as aircraft pilot training, medical operation simulation, scene modeling, and battlefield environment representation, and so on. It can help people transcend the physiological limits of human beings, create their own feelings, and improve work efficiency. (5) Automatic interpretation of images, including automatic judging and explaining the radiological images, microscopic images, remote sensing multi-band images, synthetic aperture radar images, aerospace aerial images. As the development of technology in recent years, the types, and the numbers of images have increased rapidly, and the automatic interpretation of images has become an important means to solve the problem of information expansion. (6) Research on the human visual system and mechanism, the human brain and physiological study, and so on.

It is also worth noting that during the 2003 World Robot Soccer Cup, a brave proposal was made with a goal: a completely independent humanoid robot soccer team would be formed by 2050 and it would be able to follow the World Soccer League (FIFA) rules of the game and overcome the human World Cup champion team (see www.robocup.org). It is interesting to note that the time required for such a project (about 50 years) is comparable with the historic events of the Wright brothers creating the first aircraft to Apollo sending people to the moon and safe return, as well as from the invention of digital computers to the construction of the “dark blue” for defeating the human chess champion.

16

1 Introduction to Image Understanding

Now more than 10 years have passed, based on the current research and technical level, to realize this goal seems still very difficult, though someone declared that the era of humanoid robots has come. Some models of humanoid robots (such as “Cornell” and “Denis”) using passive walking principles, the robots (such as “speaker”) with artificial vocal articulation driven by artificial lung airflow, the robots (such as “Domo”) with the rubber touch sensor to touch the outside world have been introduced in the laboratory, but let the robot imitate human visual feeling, especially visual perception to work is still a very challenging task. However, it is important to develop and implement such a long-term plan. The meaning of achieving this goal is not (only) for providing another entertainment or sports applications of the image understanding and computer vision technology. To achieve such a goal, people must get a deeper understanding of “image understanding,” to do more cutting-edge research for image understanding technology, to make the image understanding system with higher performance, and to push image understanding technology to an even wider field.

1.3 Theory Framework of Image Understanding The research on image understanding and computer vision did not have a comprehensive theoretical framework in the early days. The study of target recognition and scene understanding in the 1970s basically detected the linear edges first, then combined them to form more complex structures. However, the basic element detection is difficult and unstable in reality, so the understanding system can only enter a simple line and corner, and compose a so-called building block world.

1.3.1 The Visual Computational Theory of Marr In 1982, Marr published a book entitled Vision, (Marr, 1982) that summarizes his and his colleagues’ work on visual computing based on a series of results obtained from human visual studies. The visual computing theory is thus proposed, which outlines a framework for understanding visual information and visual information processing. The framework is both comprehensive and refined, and it is the key to make the research of understanding visual information tight and to raise the level of visual research from a descriptive level to the level of mathematical science. Marr’s theory pointed out that the purpose of vision should be understood before to understand the details. This is suitable for a variety of information processing tasks (Edelman, 1999). The main points of the theory are as follows, Marr (1982). 1.3.1.1 Vision Is a Complex Process of Information Processing Marr believes that vision is the information processing task and the processes are far more complex than the human imagination, and its difficulty is often not addressed

1.3 Theory Framework of Image Understanding

17

by people. One of the main reasons for this is that while it is difficult to understand an image with a computer, it is often a trivial task for a person. In order to understand the complex process of vision, two problems must be solved. One is the representation of visual information; the other is the processing of visual information. Here, the representation refers to a formal system (e. g., Arabic system, binary number system), which can express certain entities or certain types of information, and which has a number of rules on how the system works. Some of the information in the representation is salient and unambiguous, while others are hidden and obscure. The representation has a great impact on the degree of difficulty for the subsequent information processing. As for the visual information processing, it achieves the goal through continuous processing, analysis, and understanding of information, as well as the conversion of different forms of representations and gradual abstraction. To complete the visual task, the processes should be conducted in several different levels and aspects. Recent biological studies have shown that when the biological organism makes the perception on the outside world, the visual system can be divided into two cortical visual subsystems, that is, two visual pathways: WHAT path and WHERE path, respectively. The information transmitted by the WHAT path is related to the object in the outside world, and the WHERE path is used to transfer the spatial information of the object. In conjunction with the attention mechanism, WHAT information can be used to drive bottom-up attention, form awareness, and perform object recognition; WHERE information can be used to drive top-down attention to treat spatial information. This research result is consistent with Marr’s point of view, because according to Marr’s theory of computation, the visual process is a process of information processing. Its main purpose is to find the objects in the external world and the spatial location of the objects. 1.3.1.2 Three Key Elements of Visual Information Processing To fully understand and explain the visual information, it is needed to simultaneously grasp three key elements in Marr’s theory, namely, computational theory, algorithm implementation and hardware implementation. First, if a task needs to be done with a computer, then it should be computable. This is the computability problem, needs to be answered by the computational theory. For a given problem, if there is a program that can give output in a finite number of steps for a given input, then this problem is computable. Computable theory has three research objects, namely, the problem of judgment, computable function, and computational complexity. The problem of judgment is to determine whether the equation has a solution. The computable functions mainly discuss whether a function can be calculated. For example, a mathematical model such as Turing machine can be used to determine whether a function is a computable function. The computational complexity mainly discusses the NP-complete problem. Generally, the existence of an effective algorithm with polynomial complexity in time and space is considered. All

18

1 Introduction to Image Understanding

problems that can be solved by a polynomial time algorithm form a P class, P ⊆ NP. The NP-complete problem is the most difficult problem in the NP class. However, NP-complete does not mean that there is no way to solve, for some problems the approximate solution that meets specific application can be obtained. The highest level of visual information understanding is the abstract computational theory. There is no definite answer to the question of whether vision is computable with modern computer. Vision is a process of visual feeling plus visual perception. The mechanism of human visual function in terms of micro-anatomical knowledge and objective knowledge of visual psychology is still grasped very little, so the discussion of visual computability is still relatively limited, mainly concentrated in how to complete certain specific visual tasks with the existing capabilities of computer for the number and symbol processing. Present visual computability often refers to the case that given an input to the computer, whether or not the similar results as human vision can be obtained. The computational goal here is clear, and the output requirement can be determined after the input is given, so the emphasis is on the information understanding step in the conversion from input to output. For example, given an image of a scene (input), the computational goal is to obtain the interpretation of the scene (output). Visual computational theory has two main research aspects: one is what is computed and why such computation is needed; the other is to put forward certain constraints so as to uniquely determine the final results of the computation. Second, the computing targets of the current computer are the discrete numbers or symbols. Since the computer’s storage capacity has a certain limit, so with the computational theory in hand, the realization of the algorithm (algorithm implementation) must also be considered. Therefore, it is required to choose a suitable representation for the entity operated by the processing. On the one hand, the input and output representations of the processing should be selected; on the other hand, the representation conversion algorithms must be determined. Representation and algorithm are mutually restrained, in which three points need to be paid attention to: first, under normal circumstances there can be many optional representations; second, the determination of algorithms often depends on the choice of representations; third, given a representation, it can have a variety of algorithms for task completion. From this point on, the chosen representation and the method of operation are closely related. The instructions and rules that are commonly used to process are called algorithms. Finally, with the representations and algorithms, how to implement the algorithms in physics is also necessary to be considered. Especially with the continuous improvement of real-time requirements, dedicated hardware implementation problems are often put forward. It should be noted that the determination of the algorithm often depends on the physical characteristics of the hardware to be used for algorithm implementations, while the same algorithm can also be implemented by using different technical approaches.

1.3 Theory Framework of Image Understanding

19

Table 1.5: The meaning of the three key elements of visual information processing. Key Element

Meaning and the Problems Solved

Computational Theory

What is the goal of computation? Why should one make such a computation? How to achieve the computational theory? What is the input and output representation? What algorithms are used to achieve the conversion between representations? How to implement representations and algorithms in physics? What are the details of the computational structure?

Representation and Algorithms

Hardware Implementation

The above discussion can be summarized in Table 1.5. The above three key elements have a certain logical causal connection, but there is no absolute dependence. In fact, there are many different options for each element. In many cases, the interpretation of a problem related to one element is essentially independent of the other two elements, or the problem can be explained only by one or two of these elements. The above three key elements are also called the three levels of visual information processing. It is pointed out that different issues need to be interpreted at different levels. The relationship between them is often represented as in Figure 1.5 (in fact, two levels are more appropriate), where the forward arrow indicates the meaning of the guiding direction; in turn, the meaning as a basis. Note that once having a computational theory, the representations and algorithms are interrelated with the hardware implementation. 1.3.1.3 Three-Level Internal Representation of Visual Information According to the definition of visual computability, the process of visual information processing can be decomposed into a number of conversion steps from one representation to another representation. Representation is the key to the processing of visual information. A basic theoretical framework for the study of visual information processing is composed of three-level representation structures of the visual world, which is established, maintained, and explained by visual processing. For most philosophers, what is the essence of visual representation, how it relates to perception, how it supports action, all can have different interpretations. However, they agree

Computational theory

Representation and algorithms

Hardware implementation

Figure 1.5: The links between three key elements of the visual information processing.

20

1 Introduction to Image Understanding

that the answers to these questions are related to the concept of “representation,” Edelman (1999). Primal sketch representation It is a 2-D representation that is a collection of image features, which describes the contours where there are property variations of the object surface. Primal sketch representation provides information about the contours of each object in the image and is a sketch form of representation for a 3-D object. This form of representation can be proved from the human visual process, as people observing the scene always pay attention first to the part with dramatic changes, so the basic primal sketch representation must be a stage in the human visual process. It should be noted that the use of primal sketch representation alone does not guarantee a unique interpretation of the scene. In the case of the Necker’s illusion shown in Figure 1.6, Marr (1982), if the observer focuses on the intersection of the three lines at the upper right of Figure 1.6(a), it will be interpreted as in Figure 1.6(b), and the cube imaged is shown in Figure 1.6(c). If the observer focuses on the intersection of the three lines at the bottom left of Figure 1.6(a), it will be interpreted as in Figure 1.6(d), and the cube imaged is shown in Figure 1.6(e). This is because Figure 1.6(a) gives some clues for (part of) a 3-D object (cube), but when a person tries to recover the 3-D depth from the empirical knowledge, two different integration approaches can be used with two results of different interpretations, respectively. Incidentally, Necker’s illusion can also be explained by the viewpoint reversal, Davies (2005). When people observe the cube, they will intermittently take the two middle vertices as the nearest respective points from themselves. In psychology, this is called perceptual reversal. Necker’s illusion indicates that the brain may make different assumptions about the scene and even make decisions based on incomplete evidence. 2.5-D Sketch The 2.5-D sketch presentation is intended solely to accommodate the computational functions of the computer (see Section 1.3.4). It decomposes the object according to the principle of orthogonal projection with a certain sampling density, so that the visible surface of the object is decomposed into a number of surface elements with a certain size and geometrical shape. Each surface element has its own orientation, which can be represented by the surface normal vector. All these normal vectors (each is shown

(a)

(b)

Figure 1.6: Necker’s illusion.

(c)

(d)

(e)

1.3 Theory Framework of Image Understanding

21

Figure 1.7: An example of 2.5-D sketch.

by an arrow) form a 2.5-D sketch (also called a pin diagram, which has a needle-like pattern). In 2.5-D sketch, the orientations of the normal vector are observer-centered. The specific steps for obtaining 2.5-D sketch are: orthogonal projecting of the visible surface into a number of unit surfaces, representing the orientation of each unit surface by its normal vector, drawing the normal vectors superimposed inside the visible parts of object contour. One example of 2.5-D sketch is shown in Figure 1.7. The 2.5-D sketch is an eigen-image because it represents the orientation of the surface elements of the object, giving information about the shape of the surface. It is characterized by both the representation of a part of the object boundary information, which is similar to the primal sketch, and the representation of the orientation information of visible surface taking the observer as the center. The surface orientation is an intrinsic property and the depth is also an intrinsic property. Therefore, it is possible to convert a 2.5-D sketch into a (relative) depth map. Given the partial derivatives p and q of z(x, y) to x and y, respectively, z(x, y) can be recovered theoretically by integrating along an arbitrary curve in the plane: (x,y)

z(x, y) = z(x0 , y0 ) +

∫ (pdx + qdy)

(1.1)

(x0 ,y0 )

However, p and q are calculated from the noisy image in practice, so the above integral will be related to the path. This results in nonzero integration along the closed path. Nevertheless, since p and q are known, then the available information is actually more than needed, so the least mean square method can be used to search the surface that best matches with the surface gradient. In order to minimize the fitting error, it is preferable to optimize the z(x, y) ∬ [(zx – p)2 + (zy – q)2 ]dxdy

(1.2)

I

where p and q are estimations of the gradient, and zx and zy are the partial derivatives of the best fitting surface. This is in fact a variational problem. The required minimum integral form is

22

1 Introduction to Image Understanding

∬ F(z, zx , zy )dxdy

(1.3)

The Euler equation is Fz –

𝜕 𝜕 F – F =0 𝜕 x zx 𝜕 y zy

(1.4)

From F = (zx – p)2 + (zy – q)2 , the following equation can be obtained: 𝜕 𝜕 (zx – p) + (z – q) = ∇2 z – (px + qy ) = 0 𝜕x 𝜕y y

(1.5)

This equation is consistent with intuition because it indicates that the Laplace value at the desired surface must be equal to the Laplace estimate px + qy for the given data. It can be proved that for the integration defined by eq. (1.3), the natural boundary condition is Fzx

dy dx – Fzy =0 ds ds

(1.6)

where s is the length of the curve along the boundary. This gives (zx – p)

dy dx = (zy – q) ds ds

(1.7)

or (zx , zy )T ∙ (

dy dx T dy dx T , – ) = (p, q)T ∙ ( , – ) ds ds ds ds

(1.8)

In eq. (1.8), (dy/ds, –dx/ds)T is the normal vector at point s. It shows that the normal derivative of the desired surface is consistent with the estimate of the normal derivative obtained from the given data. The combination of 2-D primal sketch and 2.5-D sketch can obtain the 3-D information (including boundary, depth, reflection properties, etc.) of the object within the contour of the object that the observer can see (i. e., visible). This representation is consistent with the 3-D object understood by humans. 3-D representation It is the object-centered (i. e., it also includes the part of the object that is not visible) forms of representation. It describes the shape and spatial organization of 3-D objects in an object-centered coordinate system. Now return to the visual computability problem. From a computer or information processing perspective, the visual computability problem can be divided into several

1.3 Theory Framework of Image Understanding

Image input

Pixel representation

Early processing

Primal sketch

Interim processing

2.5D sketch

Postprocessing

23

3D representation

Figure 1.8: Three-level representation of the Marr’s framework.

steps. There is a form of representation between each pair of steps. Each step is a computation or processing method connecting the before and after representation forms, as shown in Figure 1.8. According to the above-described three-level representation, the problem to be solved by visual computability is how to express the 3-D world from base image representation, primal sketch and 2.5-D sketch. They have been summarized in Table 1.6. 1.3.1.4 Visual Information Organization in the Form of Functional Blocks The idea of considering the visual information system composed of a set of relatively independent functional modules has not only been supported by computational evolutionary and epistemological arguments, but also by the experimental results proving the separation of some functional modules. In addition, psychological studies have shown that people can obtain different intrinsic visual information through the use of a variety of clues or from their combination. This suggests that the visual information system should include a number of modules, with each module to obtain a specific visual cue and to make certain processing. It is then possible to complete the final task of visual information understanding by combining different modules with different weights in accordance to the Table 1.6: Representation framework for visual computability problems. Name

Purpose

Basic Element

Image

Represent the brightness of the scene or the illumination of the object Represent the location of the brightness change in the image, the geometrical distribution, and the organization structure of the object contour Represent the orientation, depth, contour, and other properties of visible object surface in the observer-centered coordinate system Describe the shape and the spatial organization of shapes by voxels or sets of surface in an object-centered coordinate system

Pixel (value)

Primal Sketch

2.5-D Sketch

3-D Map

Zero crossing point, endpoint, corner, inﬂection point, edge segment, boundary, and so on Local surface orientation (“needle” primitives), discontinue point of surface orientation, depth, depth discontinuity point 3-D model, with the axis as the skeleton, attach the voxel or surface element to the axis

24

1 Introduction to Image Understanding

environment. According to this point of view, the complex processing can be done with some simple independent functional modules. This can simplify the research methods, reducing the difficulty of specific implementation. This is also important from an engineering point of view. 1.3.1.5 Formalizing Computational Theory Must Consider Constraints In the process of image acquisition, the information of original scene may undergo a variety of changes, including: (1) When a 3-D scene is projected as a 2-D image, information of the depth and invisible parts of the object will be lost. (2) Images are always obtained from a specific perspective; the different perspective images of the same scene would be different. In addition, the loss of information may occur when the different objects block/hinder each other or the different parts of an object block each other. (3) The imaging projection makes the illumination, the object geometry and surface reflection characteristics, the camera characteristics, the spatial relationship between the light source, the object, and the camera, and so on to be integrated into a single image gray value, it may be hard to distinguish them. (4) Noise and distortion are inevitably introduced during the imaging process. For a problem, if its solution is: existing; unique; continuously depending on the initial data (boundary condition), it is appropriate. If one or a few of them do not meet the above conditions, the problem is ill-conditioned (under definition). Due to the change of the information in the original scenes, it is difficult to solve the visual problem as a solution to the inverse problem of the optical imaging process (ill-posed problem). To solve this problem, the constraints of the problem need to be found out according to the general characteristics of the external objective world, and they need to be turned into precise assumptions, so as to obtain conclusiveness and stand up to the conclusions. Constraint conditions are generally obtained by means of prior knowledge, which can be used to alter ill-posed problems. This is because the problem can be clarified by adding constraints to the computational problem, so that it can be solved.

1.3.2 Improvements for Marr’s Theoretical Framework Marr’s visual computational theory is the first one of the most influential theories of visual research, and actively promotes the research in this field. It has played an important role in the research and development of image understanding and computer vision. Marr’s theory also has its shortcomings, of which there are four issues related to the overall framework (see Figure 1.8):

1.3 Theory Framework of Image Understanding

(1) (2) (3) (4)

25

In the framework, the input is passive. The system only processes the given input image. The processing purpose of framework is fixed, which is always to restore the location and shape of the object in the scene. The framework lacks or does not adequately pay attention to the guiding role of high-level knowledge. The information processing flow in the whole framework is basically bottom-up, unidirectional, and without feedback.

In view of the above problems, a series of improvement ideas have been put forward in recent years. Corresponding to the framework of Figure 1.8, it has been improved and integrated with some new modules, as shown in Figure 1.9. In conjunction with Figure 1.9, the following four aspects of the improvements for the framework of Figure 1.8 are discussed in detail. (1) Human vision is proactive/initiative, such as changing the viewing angle as needed to help identify. Active vision refers to the vision system that determines the motion of the camera to obtain the appropriate image from the appropriate location and viewing angle based on the results of the analysis and the current requirements of the visual task. Human vision is also selective. It can stare at (looking at the region of interest with a higher resolution), or it can turn a blind eye to some parts of the scene. Selective vision refers to the visual system that can determine the attention points of camera to obtain the corresponding image on the basis of the results of existing analysis and visual tasks of the current requirements. Taking into consideration of these factors, an image acquisition module is added to the improvement framework, which is also considered together with other modules in the framework. This module selects the image acquisition mode according to the visual purpose. A detailed discussion of initiative and selectivity is also available in Section 1.3.4. The above active and selective vision can also be viewed as two forms of active vision, Davies (2005): moving the camera to focus on a specific target of interest in the current environment; focusing on a particular region of the image and dynamically interacting with it to provide explanation. Although these two

High-level knowledge

Feedback

Image acquisition

Early processing

Interim processing

Postprocessing Visual purpose

Figure 1.9: Improved Marr’s framework.

26

(2)

(3)

1 Introduction to Image Understanding

forms of active vision seem similar, they have certain differences. In the first form, proactive/initiative is mainly reflected in the observation of camera. In the second form, the proactive/initiative is mainly reflected in the processing level and strategy. Although both forms have interaction, that is, vision has the proactive/initiative, but moving camera to record and store the complete scene is a very expensive process. In addition, the overall explanation thus obtained will not necessarily all be used. Therefore, only to collect the most valuable parts of the scene at the moment, narrow its scope and enhance its quality to obtain a useful explanation would be a good imitation of the process of human interpretation of the scene. Human vision can be adjusted for different purposes. Purposive vision refers to the decision making of visual system based on visual goals, such as whether to perform complete and comprehensive restoration of the information on the location and shape of objects in the scene, or just to detect whether there is an object in the scene. It is possible to give a simpler solution to the visual problem. The key issue here is to determine the purpose of the task, so in the improved framework the visual purpose box is augmented, Aloimonos (1992). It helps to determine the qualitative analysis or quantitative analysis according to the understanding of different purposes (in practice, there are many occasions where only qualitative results are sufficient, it does not require complex quantitative results). Relatively, the qualitative analysis still lacks complete mathematical tools. The motivation of purposive vision is to make clear only some of the information. For example, the collision avoidance of an autonomous vehicle does not require accurate shape description, and some qualitative results are sufficient. This idea does not have a solid theoretical basis, but the research on some biological vision systems has already provided many examples. Qualitative vision that is closely related to purposive vision seeks qualitative descriptions of objects or scenes. Its motivation is not to express geometric information that is not required for qualitative (nongeometric) tasks or decisions. The advantage of qualitative information is that it is less sensitive to a variety of unwanted alterations (such as slightly varying the angle of view) or noise than quantitative information. Qualitative or invariant can allow for easy interpretation of the observed events at different levels of complexity. Humans have the ability to completely solve visual problems in the context of obtaining only part of the information from the image because of the implicit use of various kinds of knowledge. For example, after obtaining the object shape information by using CAD design data (using the object model library), the difficulty in recovering the shape of objects from a single image can be greatly reduced. The use of high-level knowledge can solve the problem of lacking lowlevel information, so in the improved framework, the high-level knowledge box is added, Huang (1993).

1.3 Theory Framework of Image Understanding

(4)

27

There is interaction between sequential processing steps in human vision. Although the mechanism of this interaction is still not well understood, it is widely recognized that the feedback of high-level knowledge and late results has an important effect on early processing. From this point of view, the feedback control flow is added to the improved framework.

1.3.3 Discussions on Marr’s Reconstruction Theory Marr’s theory emphasizes the reconstruction of the scene, and takes the reconstruction as the basis for understanding the scene. 1.3.3.1 The Problems of Reconstruction Theory According to Marr’s theory, the common core concept of different visual tasks is representation, and the common processing goal is to reconstruct the scene according to the visual stimulus and incorporate the result into the representation. If the visual system wants to restore the characteristics of the scene, such as the surface reflection property of the objects, the direction and speed of movement of objects, the surface structure of the objects, then the representations to help various reconstruction tasks are needed. Under such a theory, different works should have the same conceptual core, understanding processes, and data structures. In his theory, Marr shows how people can extract a variety of clues from the inside to build the representation of visual world. If such a unified representation is regarded as the ultimate goal of visual information processing and decision making, then vision can be viewed as a reconstruction process that begins with stimulation and sequentially which acquires and accumulates. This idea of interpreting the scene after the reconstruction of scene can simplify the visual tasks, but it does not exactly match the human visual function. In fact, reconstruction and interpretation are not always sequential but need to be adjusted according to visual purposes. The above reconstruction assumptions have also been challenged. For example, some people who are contemporaneous with Marr have raised doubts on taking the visual process as a hierarchical process with single-channel data processing. One of the significant contributions is that, on the basis of long-lasting studies of psychophysics and neuropsychology, it has been shown that the hypothesis of a single pathway is untenable. When Marr wrote his book Vision, there was only a small amount of psychological research results that took into account the high-level visual information of primates. There was quite little knowledge of the anatomy and functional organization of high-level visual areas. As the new data continue to gain insight into the entire visual process, it is found that visual processes are less and less like a single-pass process, Edelman (1999). Fundamentally speaking, a correct representation of the objective scene should be available for any visual work. If this is not the case, then the visual world itself (an external illustration of internal representation) cannot support visual behavior.

28

1 Introduction to Image Understanding

Nonetheless, further researches have revealed that the reconstruction-based representation (in aspects below) is a poor interpretation for understanding vision, or has a series of problems, Edelman (1999). First, look at the meaning of reconstruction or classification. If the visual world can be built internally, then the visual system is not necessary. In fact, capturing an image, creating a 3-D model or even giving a list of important stimulus features could not guarantee the recognition or classification. Of all the possible ways of interpreting the scene, the method containing reconstruction makes the largest circle, since reconstruction does not directly contribute to the interpretation. Second, it is hard to achieve representation by reconstruction from the original image in practice. From the perspective of computer vision, it is very difficult to restore scene representation from the original image. Now, there are many discoveries in biological vision that support other representation theories. Finally, the reconstruction theory is also problematic conceptually. The source of the problem can be related to the theoretically reconstructed work that can be applied to any representation. Apart from the problem of implementing reconstruction in concrete terms, one may first ask whether it is worthwhile to look for a representation of universal unity. Because the best representation should be the most suitable representation for the work, so a universal and uniform representation is not necessarily needed. In fact, according to the theory of information processing, it is self-evident to choose the right and correct representation for a given computational problem. This importance is also pointed out by Marr himself. 1.3.3.2 Representation Without Reconstruction In recent years, some studies and experiments show that the interpretation of the scene does not have to be based on the scene of the 3-D recovery (reconstruction). Or more precisely, the interpretation does not have to be built on the complete 3-D reconstruction of the scene. Since there are a series of problems in the realization of representation according to reconstruction, other forms of representation have also been paid attention to and studied. For example, one representation was first proposed by Locke in the book Concerning Human Understanding and is now commonly referred to as “semantics of mental representations” (Edelman, 1999). Locke suggests using natural and predictable methods for representation. According to this point of view, a sufficiently reliable feature detector constitutes a primitive representation of the existence of a characteristic/feature in the visual world. The representation of the entire object and the scene can then be constructed from these primitives (if there is enough amount). In the theory of natural computing, the original concept of feature level is developed under the influence of finding the “insect detector” from frog retina. Recent computer vision and computational neuroscience studies have shown that modifications to the original hypothesis on feature-level representation can be used as a replacement for the reconstruction theory. There are two differences between the current feature detection and traditional feature detection. One is that a set of

1.3 Theory Framework of Image Understanding

29

feature detectors can have far greater expressiveness than any individual detector; and second, many theoretical researchers recognize that a “symbol” is not the only element that combines features. Consider the representation of spatial resolution as an example. In a typical case, the observer can see two straight lines that are close to each other (the distance between them may be smaller than the distance between the photoreceptors in the center). The early assumption is that at some stages in the cortical process, the visual input is reconstructed with subpixel accuracy, so that it is possible to obtain a smaller distance than the pixels in the scene. Proponents of the reconstruction theory do not believe that a feature detector can be used to construct the visual function. Marr said: “The world is so complex that it cannot be analyzed with a feature detector.” Now this view has been challenged. In spatial resolution, for example, a set of patterns covering the viewing area can contain all the information needed to determine the offset, without the need for reconstruction. As another example, consider the perceptions of the coherent motion. In the central cortical region of the monkey, a recipient cell with motion consistent with a particular direction can be found. It is believed that the joint movement of these cells represents the movement of the field of view (FOV). To illustrate this point, note that given a central cortical region and in the field of view to determine the movement are synchronized. Artificial simulations of cells can produce behavioral responses similar to those of real motion, with the result that the cells reflect motion events, but visual motion is difficult to be reconstructed from motion in the central cortical region. This means that motion can be determined without reconstruction. The above discussion shows that new thinking is needed for Marr’s theory. The description of computational hierarchy for a task determines its input and output representations. For a low-level task, such as binocular vision, both input and output are very clear. A system with stereoscopic vision must receive two different images of the same scene and produce a representation that explicitly expresses the depth information. However, even in such a task, reconstruction is not very necessary. In stereoscopic observations, qualitative information, such as the depth order of the observed surface, is useful and relatively easy to calculate, and is also close to what the human visual system actually does. In high-level task, the choice of representation is more ambiguous. An identification system must be able to accept the images of object or scene to be identified, but what should be the representation of the desired identification? It is not enough to simply store and compare the original images of the object or scene. As pointed out by many researchers, the appearance of objects is related to their direction of observation, to their illumination, and to the existence and distribution of other objects. Of course, the appearance of objects and their own shape are also related. Can one restore its geometric properties from the apparent representation of an object and use it as a representation? Previous studies have shown that this is not feasible.

30

1 Introduction to Image Understanding

To sum up, on the one hand the complete reconstruction appears to be unsatisfactory for many reasons, and on the other hand it is unreliable to represent the object with only the original image. However, these more obvious shortcomings of the method do not mean that the whole theoretical framework based on the concept of representation is wrong. These shortcomings only indicate the need to further examine the basic assumptions behind the representation concept.

1.3.4 Research on New Theoretical Frameworks Limited to history and other factors, Marr did not study how to use mathematical methods to strictly describe the visual information. Although he fully studied the early vision, he did not talk about the representation and utilization of visual knowledge, as well as the recognition based on visual knowledge. In recent years, there have been many attempts to establish a new theoretical framework. For example, Grossberg claimed the establishment of a new visual theory – dynamic geometry of surface form and appearance, Grossberg (1987). It is pointed out that the perceived surface shape is the total result of multiple processing actions distributed over multiple spatial scales, so the actual 2.5-D sketch does not exist, which poses a challenge to Marr’s theory. Another new visual theory is the network-symbol model, Kuvich (2004). In this model framework, it is not necessary to accurately calculate the 3-D model of the scene, but to transform the image into an understandable relational format similar to the knowledge model. This has some similarities with the human visual system. In fact, it is difficult to process the natural image with geometric manipulation. The human brain constructs the network-symbol relation structure of the visual scene and uses different clues to establish the relative order of the scene surface respective to the observer, and the relationship between the objects. In the network-symbol model, the object recognition is performed not on the basis of the FOV but on the derived structure, which is not affected by the influence of local change and object appearance. Two more representative works are introduced below. 1.3.4.1 Knowledge-Based Theoretical Frameworks Knowledge-based theoretical frameworks are developed around perceptual feature groupings, Lowe (1987), Goldberg (1987), Lowe (1988). The physiological basis of the theoretical framework derives from the results of the study of psychology. The theoretical framework believes that the human visual process is only a recognition process, and has nothing to do with reconstruction. In order to identify the 3-D object, human perception can be used to describe the object. The recognition can be completed under the guidance of knowledge and directly through the 2-D image, without the need of bottom-up 3-D reconstruction through visual input.

1.3 Theory Framework of Image Understanding

31

Verification

Image feature

Perceptual organization

recognition

Object model

Figure 1.10: A knowledge-based theoretical framework.

The process of understanding 3-D scenes from 2-D images can be divided into the following three steps (see Figure 1.10): (1) With the help of procedure for perceptual organization, extracting the groups and structures that remain constant over a wide range relative to the viewing directions from the image features. (2) Constructing models by means of image features, in which the search space is reduced by means of probabilistic queuing. (3) Finding the spatial correspondence by solving unknown observation points and model parameters so that the projection of the 3-D model is directly matched to the image features. In the whole process, there is no need to measure the surface of a 3-D object (no reconstruction is needed). The information about the surface is reckoned by using the perceptual principle. This theoretical framework shows a high stability for the occlusion and incomplete data processing. This theoretical framework introduces feedback and emphasizes the guidance of high-level knowledge on the role of vision. However, the practice shows that, in some occasions requiring the judgment of the object size, or the estimation of object distance, identification alone is not enough and 3-D reconstruction must be made. In fact, 3-D reconstruction still has a very wide range of applications. For example, in the virtual human plan, through the 3-D reconstruction from a series of slices it can get a lot of human body information. Another example, based on the 3-D reconstruction of tissues, the 3-D distribution of cells can be obtained, which has a good supporting effect for the positioning of the cells.

1.3.4.2 Active Vision-Based Theoretical Frameworks Active vision framework has been put forward mainly based on human vision (or more general biological vision) initiative. Human vision has two special mechanisms: Mechanism of Selective Attention: Not all the things seen by the eye are of concern/interest to humans. The useful visual information is usually distributed in a certain range of space and time period, so human vision does not see all parts of the scene equally, but according to the need to selectively pay special attention to some

32

1 Introduction to Image Understanding

of them, while the other parts of the general observation are even turned a blind eye. According to this feature of the selective attention mechanism, multi-azimuth, and multi-resolution sampling can be performed at the time of image acquisition, then information related to a particular task can be selected or retained. Gaze Control: People can adjust the eye, so that people can “look” at different locations in the environment and in different times according to the need for obtaining useful information. This is the gaze control, or attention control. Following this way, it is possible to always obtain visual information suitable for a particular task by adjusting the camera parameters. Gaze control can be divided into gaze stabilization and gaze change. The former refers to the positioning process, such as target tracking; the latter is similar to the rotation of the eye, which controls the fixation in the next step according to the requirements of specific tasks. The active vision-based theoretical framework, taking into consideration the human visual mechanism, is shown in Figure 1.11. Active vision-based theoretical framework emphasizes that visual system should be task-oriented and purpose-oriented. At the mean time, the visual system should have the ability to take the initiative to perceive. The active vision system can control the motion of the camera through the mechanism of active control of the camera parameters, and can coordinate the relationship between the processing task and the external signal according to the existing analysis results and the current requirements of the visual task. These parameters include the camera’s position, orientation, focal length, aperture, and so on. In addition, active vision also incorporates the “attention” ability. By changing the camera parameters or by processing the data after the camera to control the “attention,” it can achieve the choice/selection of perception on space, time, resolution, and so on. Similar to knowledge-based theoretical frameworks, active vision-based theoretical frameworks also attach great importance to knowledge. It believes that knowledge is a high-level ability to guide visual activity, which should be used in performing visual tasks. However, the current active vision-based framework lacks feedback. On

Active Vision Framework

Attention Selection Gaze Stabilization

Fixation Control

Hand-eye Coordination Gaze Change

Figure 1.11: Active vision-based theoretical framework.

Robot Structure Fusion

1.4 Overview of the Book

33

the one hand, this non-feedback structure does not meet the biological vision system. On the other hand, the non-feedback structure often leads to the problem of poor accuracy, large influence of noise and high computational complexity. It also lacks some adaptability to the application and environment.

1.4 Overview of the Book This book has eight chapters. The current chapter makes a general introduction to image understanding and its development, and distinguishes it from some related subjects. By presenting Marr’s theory and related improvements, a general view for the tasks and procedure of image understanding is provided. Chapter 2 is titled stereo vision. This chapter discusses the modules of stereo vision systems, region-based binocular matching, feature-based binocular matching, horizontal multiple stereo matching, orthogonal trinocular matching, the computing of subpixel level disparity, and techniques for error detection and correction. Chapter 3 is titled 3-D shape information recovery. This chapter discusses the principle of photometric stereo, techniques for structure from motion, shape recovering from shading, relation of texture and surface orientation, depth computation from focal length, and the estimation from three pixels. Chapter 4 is titled matching and understanding. This chapter discusses the fundamentals of matching, object-matching techniques, dynamic pattern matching, techniques for relation matching, the principle of graph isomorphism, and the labeling of line drawings for matching. Chapter 5 is titled scene analysis and semantic interpretation. This chapter discusses several techniques for understanding the scene content, such as fuzzy reasoning, genetic algorithms, scene object labeling, and scene classification using bag of words model/bag of features model, pLSA model, and LDA model. Chapter 6 is titled multi-sensor information fusing. This chapter discusses some general concepts for information fusion, the main steps and layers as well as the evaluation of image fusion, techniques for pixel-layer fusion, and techniques for feature-layer and decision-layer fusions. Chapter 7 is titled content-based image retrieval. This chapter discusses the typical techniques for feature-based image retrieval, the principle of motion-featurebased video retrieval, the general framework for object-based retrieval, and representative examples of video analysis and retrieval. Chapter 8 is titled spatial-temporal behavior understanding. This chapter discusses a number of techniques for understanding object behavior in the scene, such as the detection of key points, the learning and analysis of dynamic trajectory and activity path, and the classification and identification of actions and activities, in a sequence from low level to high level. Each chapter of this book is self-contained, and has a similar structure. After a general outline and an indication of the contents of each section in the chapter,

34

1 Introduction to Image Understanding

the main subjects are introduced in several sections. In the end of every chapter, 12 exercises are provided in the Section “Problems and Questions.” Some of them involve conceptual understanding, some of them require formula derivation, some of them need calculation, and some of them demand practical programming. The answers or hints for two of them are collected at the end of the book to help readers to start. The references cited in the book are listed at the end of the book. These references can be broadly divided into two categories. One category is related directly to the contents of the material described in this book, the reader can find from them the source of relevant definition, formula derivation, and example explanation. References of this category are generally marked at the corresponding positions in the text. The other category is to help the reader for further study, for expanding the horizons or solve specific problems in scientific research. References of this category are listed in the Section “Further Reading” at the end of each chapter, where the main contents of these references are simply pointed out to help the reader targeted to access. For the learning of this book, some basic knowledge is generally useful: (1) Mathematics: The linear algebra and matrix theory are important, as the image is represented by matrix and the image processing often requires matrix manipulation. In addition, the knowledge of statistics, probability theory, and stochastic modeling are also very worthwhile. (2) Computer science: The mastery of computer software technology, the understanding of the computer architecture system and the application of computer programming methods are very important. (3) Electronics: Many devices involved in image processing are electronic devices, such as camera, video camera, display screen, and so on. In addition, electronic board, FPGA, GPU, SOC, and so on, are frequently used. Some specific pre-requests for this book would be some basic knowledge of image processing and image analysis (Volumes I and II of this book set, see also (Zhang, 2012b) and (Zhang, 2012c)).

1.5 Problems and Questions 1-1 1-2 1-3

Why image understanding is considered at the top of image engineering compared to image processing and image analysis? What are the aims of computer vision? What is the basic purpose, what is the ultimate purpose? What are the connections between the understanding of images and the realization of visual function by computers? Why are they introduced together?

1.6 Further Reading

35

1-4

There are several types of technical methods for achieving image understanding. What are their characteristics? 1-5* Images and graphics are closely related but also different. Try to make a discussion, and summarize the key points. 1-6 Try to illustrate (show an example) that an image itself does not have all the required geometric information to restore the objective scene. 1-7 Try to illustrate (show an example) that an image itself does not have all the detailed information of an objective scenario. 1-8 What are the chances of building a humanoid robot team in 2050 to compete with the human soccer team? What is your opinion? 1-9* Can you cite some visual tasks, which can be calculated, but the algorithm is difficult to implement; or the algorithm can be designed, but the hardware is difficult to achieve? 1-10 Why is the 2.5-D representation needed to introduce? What benefits and problems does it bring? Are there other ways or approaches? 1-11 Under what circumstances are the reconstruction and interpretation of the scene not serial, or can they be accomplished simultaneously? 1-12 Considering some recent advances in science (including artificial intelligence, physiology, psychology, bionics, neural networks, genetic algorithms, machine learning, soft science, etc.), where can the Marr’s theory be supplemented and modified?

1.6 Further Reading 1.

The Development of Image Engineering – The series of survey papers for image engineering can be found in Zhang (1996a, 1996b, 1997, 1998a, 1999a, 2000a, 2001, 2002a, 2003a, 2004a, 2005a, 2006, 2007a, 2008a, 2009a, 2010, 2011a, 2012a, 2013a, 2014, 2015a, 2016, and 2017). Some summaries on this survey series can be found in Zhang (1996c, 2002c, 2008b, and 2015d). –

2.

A comprehensive introduction to the contents of image engineering can be found in Zhang (2007b and 2013b). Image Understanding and Related Disciplines – More information for computer vision can be found in Ballard (1982), Levine (1985), Horn (1986), Shirai (1987), Haralick (1992), Haralick (1993), Faugeras (1993), Jähne (1999a, 1999b, 1999c, and 2000), Shapiro (2001), Forsyth (2003), Hartley (2004), Davies (2005), Sonka (2008), Szeliski (2010), Davies (2012), Forsyth (2012), Prince (2012). –

More information for machine vision can be found in Jain (1995), Snyder (2004), Steger (2008).

36

3.

1 Introduction to Image Understanding

Theory Framework of Image Understanding – An image understanding framework based on Bayesian network and image semantic features can be found in Luo (2005). –

4.

There are also more and more methods for the representation of images. One example is using nonnegative tensor decomposition, Wang (2012, 2013). Overview of the Book – The main materials in this book are extracted from the books: Zhang (2007c), Zhang (2012d).

2 Stereo Vision The real world is 3-D. The purpose of 3-D vision is to recover the 3-D structure and description of a scene and to understand the properties of objects and the meaning of a scene from an image or an image sequence. Stereo vision is an important technique for getting 3-D information. Though the principles of 3-D vision can be considered a mimic of the human vision system, modern stereo vision systems have more variations. The sections of this chapter are arranged as follows: Section 2.1 provides an overview of the process of stereoscopic vision by listing the six functional modules of stereo vision. Section 2.2 describes the principle of binocular stereo matching based on region graylevel correlation and several commonly used techniques. Section 2.3 introduces the basic steps and methods of binocular stereo matching based on different kinds of feature points, and the principle of obtaining the depth of information from these feature points. Section 2.4 discusses the basic framework of horizontal multiple stereo matching, which using the inverted distance can, in principle, reduce the mismatch caused by periodic patterns. Section 2.5 focuses on the method of orthogonal trinocular stereo matching, which can eliminate the mismatch caused by the smooth region of the image through matching both the horizontal direction and the vertical direction at the same time. Section 2.6 presents an adaptive algorithm that is based on local image intensity variation pattern and local disparity variation pattern to adjust the matching window to obtain the parallax accuracy of subpixel level. Section 2.7 introduces an algorithm to detect the error in the disparity map obtained by stereo matching and to make corresponding corrections.

2.1 Modules of Stereo Vision A stereo vision system normally consists of six modules, Barnard (1982). In other words, six tasks are needed to perform stereo vision work. 2.1.1 Camera Calibration Camera calibration determines the internal and external parameters of a camera, according to certain camera models, in order to establish the correspondence between the real points in the scene and the pixels in the image. In stereo vision, several cameras are often used, each of which needs to be calibrated. When deriving 3-D information from 2-D computer image coordinates, if the camera is fixed, the DOI 10.1515/9783110524130-002

38

2 Stereo Vision

calibration can be carried out only once; if the camera is moving, several calibrations might be needed.

2.1.2 Image Capture Image capture or image acquisition deals with both spatial coordinates and image properties, as discussed in Chapter 2 of Volume I of this book set. For stereo imaging, several images are required, and each is captured in a normal way, but the relation among them should also be determined. The commonly used stereo vision systems have two cameras, but recently, even more cameras are used. These cameras can be aligned in a line, but can also be arranged in other ways. Some typical examples have been discussed in Chapter 2 of Volume I of this book set.

2.1.3 Feature Extraction Stereo vision uses the disparity between different observations in space to capture 3-D information (especially the depth information). How to match different views and find the correspondence of control points in different images are critical. One group of commonly used techniques is matching different images based on well-selected image features. Feature is a general concept, and it can be any abstract representation and/or a description based on pixels or pixel groups. There is no unique theory for feature selection and feature detection, so feature extraction is a problem-oriented procedure. Commonly used features include, in an ascending order of the scales, point-like features, line-like features, and region-like features. In general, larger-scale features could contain abundant information, based on which fast matching can be achieved with a smaller number of features and less human interference. However, the process used to extract them would be relatively complicated and the location precision would be poor. On the other hand, smaller-scale features are simple to represent or describe and they could be located more accurately. However, they are often required in a large number and they normally contain less information. Strong constraints and suitable strategy are needed for robust matching.

2.1.4 Stereo Matching Stereo matching consists of establishing the correspondence between the extracted features, further establishing the relation between pixels in different images, and finally obtaining the corresponding disparity images. Stereo matching is often the most important and difficult task in stereo vision. When a 3-D scene is projected on a 2-D plan, the images of the same object under different viewpoints can be quite different. In addition, as many variance factors, such as lighting condition, noise interference, object shape, distortion, surface property, camera characteristics, and so

2.1 Modules of Stereo Vision

39

on, are all mixed into a single gray-level value, determining different factors from the gray levels is a very difficult and challenging problem. This problem, even after many research efforts, has not been well resolved. In the following sections, stereo-matching methods based on region correlation and feature match up will be described.

2.1.5 Recovering of 3-D Information Once disparity images are obtained by stereo matching, the depth image can be calculated and the 3-D information can be recovered. The factors influencing the precision of the distance measurement include quantization effect, camera calibration error, feature detection, matching precision, and so on. The precision of the distance measurement is propositional to the match and location precision, and is propositional to the length of the camera baseline (connecting different positions of cameras). Increasing the length of the baseline can improve the precision of the depth measurement, but at the same time, it could enlarge the difference among images, raise the probability of object occlusion and thus augment the complexity of the matching. To design a precise stereo vision system, different factors should be considered together to keep a high precision for every module.

2.1.6 Post-processing The 3-D information obtained after the above steps often exhibits certain errors due to variance reasons or factors. These errors can be removed or reduced by further postprocessing. Commonly used post-processes are the following three types. 2.1.6.1 Depth Interpolation The principal objective of stereo vision is to recover the complete information about the visible surface of objects in the scene. Stereo-matching techniques can only recover the disparity at the feature points, as features are often discrete. After the feature-based matching, a reconstruction step for interpolating the surface of the disparity should be followed. This procedure interpolates the discrete data to obtain the disparity values at nonfeature points. There are diverse interpolation methods, such as nearest-neighbor interpolation, bilinear interpolation, spline interpolation, modelbased interpolation, and so on, Maitre (1992). During the interpolation process, the most important aspect is to keep the discontinuous information. The interpolation process is a reconstruction process in some sense. It should fit the surface consistence principles, Grimson (1983). 2.1.6.2 Error Correction Stereo matching is performed among images suffering from geometrical distortion and noise interference. In addition, periodical patterns, smoothing regions in the

40

2 Stereo Vision

matching process, occluding effects, nonstrictness of restriction, and so on, can induce different errors in disparity maps. Error detection and correction are important components of post-processing. Suitable techniques should be selected according to the concrete reasons for error production. 2.1.6.3 Improvement of Precision Disparity computation and depth information recovery are the foundation of the following tasks. Therefore, high requirements about disparity computation are often encountered. To improve the precision, subpixel-level precision for the disparity is necessary, after obtaining the disparity at the pixel level.

2.2 Region-Based Binocular Matching The most important task needed to obtain the depth images is to determine the correspondence between the related points in binocular images. In the following, the discussion considers parallel horizontal models. By considering the geometrical relations among different models, the results obtained for parallel horizontal models can also be extended to other models (see Section 2.2 of Volume I in this book set). Region-based matching uses gray-level correlation to determine the correspondence. A typical method is to use the mean-square difference (MSD) to judge the difference between two groups of pixels needed to be matched. The advantage of this method is that the matching results are not influenced by the precision of different feature detections, so high-accuracy locations of the points and a dense disparity surface can be obtained, Kanade (1996). The disadvantage of this method is that it depends on the statistics of gray-level values of images and thus is quite sensitive to the structure of the object surface and the illumination reflection. When the object surfaces lack enough texture detail or if there is large distortion in capturing the image, this matching technique will encounter some problems. Besides gray-level values, other derived values from gray levels, such as the magnitude or direction of gray-level differential, Laplacian of gray level, curvature of gray level, and so on, can also be used in regionbased matching. However, some experiments have shown that the results obtained with gray-level values are better than the results obtained with other derived values, Lew (1994).

2.2.1 Template Matching Template matching can be considered the basis for region-based matching. The principle is to use a small image (template) to match a subregion in a big image. The result of such a match is used to help verify if the small image is inside the big image, and locate the small image in the big image if the answer is yes. Suppose that the task is to find the matching location of a J × K template w(x, y) in an M × N image f (x, y). Assume

2.2 Region-Based Binocular Matching

N

O

41

y

t s M

J (s, t) K w(x s, y t) f(x, y)

x

Figure 2.1: Illustration of template matching.

that J ≤ M and K ≤ N. In the simplest case, the correlation function between f (x, y) and w(x, y) can be written as c(s, t) = ∑ ∑ f (x, y)w(x – s, y – t) x

(2.1)

y

where s = 0, 1, 2, . . . , M – 1 and t = 0, 1, 2, . . . , N – 1. The summation in eq. (2.1) is for the overlapping region of f (x, y) and w(x, y). Figure 2.1 illustrates the situation, in which the origin of f (x, y) is at the top-left of image and the origin of w(x, y) is at its center. For any given place (s, t) in f (x, y), a particular value of c(s, t) will be obtained. With the change of s and t, w(x, y) moves inside f (x, y). The maximum of c(s, t) indicates the best matching place. Besides the maximum correlation criterion, another criterion used is the minimum mean error function, given by Mme (s, t) =

1 2 ∑ ∑ [f (x, y)w(x – s, y – t)] MN x y

(2.2)

In VLSI hardware, the calculation of square operation is complicated. Therefore, the square value is replaced by an absolute value, which gives a minimum average difference function, Mad (s, t) =

1 󵄨 󵄨 ∑ ∑ 󵄨󵄨f (x, y)w(x – s, y – t)󵄨󵄨󵄨 MN x y 󵄨

(2.3)

The correlation function given in eq. (2.1) has the disadvantage of being sensitive to changes in the amplitude of f (x, y) and w(x, y). To overcome this problem, the following correlation coefficient can be defined, given by ̄ ∑ ∑ [f (x, y) – f ̄(x, y)] [w(x – s, y – t) – w] x y

C(s, t) =

2

̄ } {∑ ∑ [f (x, y) – f ̄(x, y)] ∑ ∑ [w(x – s, y – t) – w] x y

2

x y

1/2

(2.4)

42

2 Stereo Vision

where s = 0, 1, 2, . . . , M – 1, t = 0, 1, 2, . . . , N – 1, w̄ is the average value of w(x, y), and f ̄(x, y) is the average value of f (x, y) in the region coincident with the current location of w(x, y). The summations are taken over the coordinates that are common to both f (x, y) and w(x, y). Since the correlation coefficient C(s, t) is scaled in the interval [–1, 1], its values are independent to the changes in the amplitude of f (x, y) and w(x, y). 2.2.2 Stereo Matching Following the principle of template matching, a pair of two stereo images (called left and right images) can be matched. The process consists of four steps: (1) Take a window centered at a pixel in the left image. (2) Construct a template according to the gray-level distribution in the above window. (3) Use the above template to search around all pixels in the right image in order to find a matching window. (4) The center pixel in this matching window is the pixel corresponding to the pixel in the left image. 2.2.2.1 Constraints A number of constraints can be used to reduce the computational complexity of stereo matching. The following list gives some examples, Forsyth (2003). Compatibility Constraint The two matching pixels/regions of two images must have the same physical properties, respectively. Uniqueness Constraint matching.

The matching between two images is a one-to-one

Continuity Constraint The disparity variation near the matching position should be smooth, except at regions with occlusion. Epipolar Line Constraint Look at Figure 2.2. The optical center of the left camera is located at the origin of the coordinate system, the two optical centers of the left and right cameras are connected by the X-axis, and their distance is denoted B (also called the baseline). The optical axes of the left and right image planes lie in the XZ plane with an angle ( between them. The centers of the left and right image planes are denoted C󸀠 and C󸀠󸀠 , respectively. Their connecting line is called the optical centerline. The optical centerline crosses the left and right image planes at points E󸀠 and E󸀠󸀠 , respectively. These two points are called epipoles of the left and right image planes, respectively. The optical centerline is in the same plane as the object point W. This plane is called the epipolar plane, which crosses the left and right image planes at lines L󸀠 and L󸀠󸀠 . These two lines are called the epipolar lines of object point W projected on the left and right image planes. Epipolar lines limit the corresponding

2.2 Region-Based Binocular Matching

43

Y W θ L

uʹ

Cʹ Z

vʹ

O

Lʺ uʺ Eʺ

Eʹ B

X vʺ

Cʺ Figure 2.2: Illustration of epipoles and epipolar lines.

points on two stereo images. The projected point of object point W on the right image plane, corresponding to the left image plane, must lie on L󸀠󸀠 . On the other hand, the projected point of object point W on the left image plane, corresponding to the right image plane, must lie on L󸀠 .

2.2.2.2 Factors Influencing View Matching Two important factors influencing the matching between two views are: (1) In real applications, due to the shape of objects or the occlusion of objects, not all objects viewed by the left camera can be viewed by the right one. In such cases, the template determined by the left image may not find any matching place in the right image. Some interpolation should be performed to estimate the matching results. (2) When the pattern of a template is used to represent the pixel property, the assumption is that different templates should have different patterns. However, there are always some smooth regions in an image. The templates obtained from these smooth regions will have the same or similar patterns. This will cause some uncertainty that will induce some error in the matching. To solve this problem, some random textures can be projected onto object surfaces. This will transform smooth regions to textured regions so as to differentiate different templates. Example 2.1 Real examples of the influence of smooth regions on stereo matching. Figure 2.3 gives a set of examples showing the influence of smooth regions on stereo matching. Figure 2.3(a, b) represents the left and right images of a pair of stereo images, respectively. Figure 2.3(c) is the disparity map obtained with stereo vision, in which darker regions correspond to small disparities and lighter regions correspond to large disparities. Figure 2.3(d) is the 3-D projection map corresponding to Figure 2.3(c). Since some parts of the scene, such as the horizontal eaves of the tower and the building, have similar gray-level values along the horizontal direction, the search along it will produce errors due to mismatching. These errors are manifested in Figure 2.3(c) by the nonharmonizing of certain regions with respect to surrounding regions and are visible in Figure 2.3(d) by some sharp burrs. ◻∘

44

2 Stereo Vision

(a)

(b)

(d)

(c)

Figure 2.3: Real examples of the inﬂuence of smooth regions on stereo matching.

2.3 Feature-Based Binocular Matching One method to determine the correspondence between the related parts in binocular images is to select certain feature points in images that have a unique property, such as corner points, inflexion points, edge points, and boundary points. These points are also called the control points or the matching points. The main steps involved in a feature-based matching are listed in the following: (1) Select several pairs of control points from binocular images. (2) Match these pairs of points (see below). (3) Compute the disparity of the matched points in two images to obtain the depth at the matching points. (4) Interpolate sparse depth values to get a dense depth map.

2.3.1 Basic Methods Some basic methods for feature-based matching are presented first. 2.3.1.1 Matching with Edge Points A simple method for matching uses just the edge points, Barnard (1980). For an image f (x, y), its edge image can be defined as t(x, y) = min{H, V, L, R}

(2.5)

2.3 Feature-Based Binocular Matching

45

where H (horizontal), V (vertical), L (left), and R (right) are defined by 2

2

2

2

H = [f (x, y) – f (x – 1, y)] + [f (x, y) – f (x + 1, y)]

(2.6)

V = [f (x, y) – f (x, y – 1)] + [f (x, y) – f (x, y + 1)]

(2.7)

2

2

(2.8)

2

2

(2.9)

L = [f (x, y) – f (x – 1, y + 1)] + [f (x, y) – f (x + 1, y – 1)]

R = [f (x, y) – f (x + 1, y + 1)] + [f (x, y) – f (x – 1, y – 1)]

Decompose t(x, y) into small nonoverlap regions W, and take the points with the maximum value in each region as the feature points. For every feature point in the left image, consider all possible matching points in the right image as a set of points. Thus, a label set for every feature point in the left image can be obtained, in which label l can be either the disparity of a possible matching or a special label for nonmatching. For every possible matching point, the following computation is performed to determine the initial matching probability P(0) (l) A(l) =

∑

[fL (x, y) – fR (x + lx , y + ly )]

2

(2.10)

(x, y)∈W

where l = (lx , ly ) is the possible disparity and A(l) represents the gray-level similarity between two regions and is inversely proportional to P(0) (l). Starting from P(0) (l), assigning positive increments to the points with small disparities and negative increments to the points with large disparities can update P(0) (l) with a relaxation method. With the iteration, the kth matching probability P(k) (l) for correct matching points will increase and that for incorrect matching points will decrease. After certain iterations, the points with maximum P(k) (l) can be determined as matching points.

2.3.1.2 Matching with Zero-Crossing Points Using the convolution with the Laplacian operator can produce zero-crossing points. The patterns around zero-crossing points can be taken as matching elements, Kim (1987). Consider the different connectivity of zero-cross points. Sixteen zero-cross patterns can be defined as the shades shown in Figure 2.4. For each zero-cross pattern in the left image, collect all possible matching points in the right image to form a set of possible matching points. In stereo matching, assign an initial matching probability to each point. Using a similar procedure as for matching with edge points, the matching points can be found by iterative relaxation.

2.3.1.3 Depth of Feature Points In the following, Figure 2.5 (obtained by removing the epipolar lines in Figure 2.2 and moving the baseline to the X-axis) is used to explain the corresponding relations among feature points.

46

2 Stereo Vision

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

(m)

(n)

(o)

(p)

Figure 2.4: Illustration of 16 zero-cross patterns.

Y W θ O B

uʹ

uʺ

X

vʹ

vʺ Figure 2.5: Illustration of a binocular vision system.

Z

In a 3-D space, a feature point W(x, y, –z), after respective projections on the left and right images, will be (u󸀠 , v󸀠 ) = (x, y)

(2.11)

(u , v ) = [(x – B) cos ( – z sin (, y]

(2.12)

󸀠󸀠

󸀠󸀠

In eq. (2.12), the computation for u" is made by two coordinate transformations: A translation followed by a rotation. Equation (2.12) can also be derived with the help of Figure 2.6, which shows a plane that is parallel to the XZ plane in Figure 2.5. This is given by u󸀠󸀠 = OS = ST – TO = (QE + ET) sin ( –

B–x cos (

(2.13)

By noting that W is on the –Z axis, it is easy to get u󸀠󸀠 = –z sin ( + (B – x) tan ( sin ( –

B–x = (x – B) cos ( – z sin ( cos (

(2.14)

Once u󸀠󸀠 is determined from u󸀠 (i. e., the matching between feature points is established), the depth of a point that is projected on u󸀠 and u󸀠󸀠 can be determined from eq. (2.13) by – z = u󸀠󸀠 csc ( + (B – u󸀠 ) cot (

(2.15)

2.3 Feature-Based Binocular Matching

W

47

Q(x, y, –z)

θ

θ uʺ S E

0

O

B

X uʹ

θ

Figure 2.6: Coordinate arrangement for computing the disparity in a binocular vision system.

T Z

2.3.1.4 Sparse Matching Points Feature points are particular points on the objects and they are located apart from each other. The dense disparity map cannot be obtained only from the sparse matching points, so it is impossible to recover uniquely the object shape. For example, four points of a co-plane are shown in Figure 2.7(a), which are the sparse matching points obtained by the disparity computation. Suppose that these points are on the surface of an object, then it is possible to find an infinite number of surfaces that pass these four points. Figure 2.7(b–d) is just a few of them. It is evident that it is not possible to recover uniquely the object shape from the sparse matching points.

2.3.2 Matching Based on Dynamic Programming Matching of feature points is needed to establish the corresponding relations among feature points. This task can be accomplished with the help of ordering constraints (see below) and by using dynamic programming techniques, Forsyth (2003). The ordering constraint is also a popularly used type of constraint in stereo matching, especially for feature-based matching. Consider three feature points on the surface of an object in Figure 2.8(a). They are denoted sequentially by A, B, and C. The orders of their projections on two images are just reversed as c, b, a, and c󸀠 , b󸀠 , a󸀠 , respectively. This inverse of the order of the corresponding points is called the ordering

(a)

(b)

(c)

(d)

Figure 2.7: It is not possible to recover uniquely the object shape from the sparse matching points.

48

2 Stereo Vision

C A

C A

B

B

D O

O

O' c'

a

c b (a)

b'

a'

O'

d b (b)

c'

a

b'

d'

Figure 2.8: Illustration of ordering constraints.

constraint. In real situations, this constraint may not always be satisfied. For example, when a small object is located in front of a big object as shown in Figure 2.8(b), the small object will occlude parts of the big object. This does not only removes points c and a󸀠 from the resulted images, but also violates the ordering constraint for the order of the projections on images. However, in many cases, the ordering constraint is still held, and can be used to design stereo-matching techniques based on dynamic programming (see Section 2.2.2 in Volume II of this book set). The following discussion assumes that a set of feature points has been found on corresponding epipolar lines, as shown in Figure 2.8. The objective is to match the intervals separating those points along the two intensity profiles, as shown in Figure 2.9(a). According to the ordering constraint, the order of the feature points must be the same, although the occasional interval in either image may be reduced to a single point corresponding to the missing correspondences associated with the occlusion and/or noise. The above objective can be achieved by considering the problem of matching feature points as a problem of optimization of a path’s cost over a graph. In this graph, nodes correspond to pairs of the left and right feature points and arcs represent matches between the left and right intervals of the intensity profile, bounded by the features of the corresponding nodes, as shown in Figure 2.9(b). The cost of an arc measures the discrepancy between the corresponding intervals. b

c d

e

f

a

a

b

a' b' c'

f'

d'

e'

a' b'

c'

d'

e' f'

(a)

(b)

Figure 2.9: Matching based on dynamic programming.

c

d

e

f

2.4 Horizontal Multiple Stereo Matching

49

2.4 Horizontal Multiple Stereo Matching In horizontal binocular stereo vision, there is the following relation between the disparity d of two images and the baseline B linking two cameras (+ represents focal length) + + d = B 󵄨󵄨 󵄨 ≈ BZ 󵄨󵄨+ – Z 󵄨󵄨󵄨

(2.16)

The last step is made in considering that the distance Z >> + is satisfied in most cases. From eq. (2.16), the disparity d is proportional to the baseline length B, for a given distance Z. The longer the baseline length B is, the more accurate the distance computation is. However, if the baseline’s length is too long, the task of searching the matching points needs to be accomplished in a large disparity range. This not only increases the computation but also causes mismatching when there are periodic features in the image (see below). One solution for the above problem is to use a control strategy of matching from coarse to fine scale, Grimson (1985). That is, start matching from low-resolution images to reduce the mismatching and then progressively search in high-resolution images to increase the accuracy of the measurement.

2.4.1 Horizontal Multiple Imaging Using multiple images can solve the above problem to improve the accuracy of the disparity measurement, Okutomi (1993). In this method, a sequence of images (along the horizontal direction) are used for stereo matching. The principle is to reduce the mismatching error by computing the sum of the squared difference (SSD), Matthies (1989). Suppose that a camera is moving horizontally and captures a set of images fi (x, y), i = 0, 1, . . . , M at locations P0 , P1 , P2 , . . . , PM , with baselines B1 , B2 , . . . , BM , as shown in Figure 2.10. According to Figure 2.10, the disparity between two images captured at points P0 and Pi , respectively, is + di = Bi , Z P0

P1

P2

PM

i = 1, 2, . . . , M

(2.17)

X

B1 B2 ......

...... BM

Figure 2.10: Illustration of horizontal multiple imaging.

50

2 Stereo Vision

As only the horizontal direction is considered, f (x, y) can be replaced by f (x). The image obtained at each location is fi (x) = f [x – di ] + ni (x)

(2.18)

where noise ni (x) is assumed to have a Gaussian distribution with a mean of 0 and a variance of 3n2 , which is ni (x) ∼ N(0, 3n2 ). The value of SSD at x in f0 (x) is Sd (x; d̂ i ) = ∑ [f0 (x + j) – fi (x + d̂ i + j)]

2

(2.19)

j∈W

where W represents the matching window and d̂ i is the estimated disparity value at x. Since SSD is a random variable, its expected value is { 2} E [Sd (x; d̂ i )] = E { ∑ [f (x + j) – f (x + d̂ i – di + j) + n0 (x + j) – ni (x + d̂ i + j)] } {j∈W } (2.20) 2 = ∑ [f (x + j) – f (x + d̂ – d + j)] + 2N 32 i

i

w n

j∈W

where Nw denotes the number of pixels in the matching windows. Equation (2.20) indicates that Sd (x; d̂ i ) attains its minimum at di = d̂ i . If the image has the same graylevel patterns at x and x + p(p ≠ 0), it has f (x + j) = f (x + p + j)

j∈W

(2.21)

From eq. (2.20), it will have E [Sd (x; di )] = E [Sd (x; di + p)] = 2Nw 3n2

(2.22)

Equation (2.22) indicates that the expected values of SSD will attain extremes at two locations x and x + p. In other words, there is an uncertainty problem or ambiguity problem, which will produce a mismatching error. Such a problem has no relation with the length or the number of baselines, so the error cannot be avoided even with multiple images.

2.4.2 Inverse-Distance The inverse-distance t is defined as t=

1 Z

(2.23)

2.4 Horizontal Multiple Stereo Matching

51

From eq. (2.16), ti = tî =

di Bi + d̂

(2.24)

i

(2.25)

Bi +

where ti and tî are the real and estimated inverse-distances, respectively. Taking eq. (2.25) into eq. (2.19), the SSD corresponding to t is 2

St (x; tî ) = ∑ [f0 (x + j) – fi (x + Bi +tî + j)]

(2.26)

j∈W

Its expected value is 2

E [St (x; tî )] = ∑ {f (x + j) – f [x + Bi +(tî – ti ) + j]} + 2Nw 3n2

(2.27)

j∈W

Taking the sum of SSD for M inverse-distance gives SSSD (sum of SSD) in inversedistance M

(S) St(12⋅⋅⋅M) (x; t)̂ = ∑ St (x; tî )

(2.28)

i=1

The expected value of this new measuring function is M

(S) ̂ = ∑ E [St (x; t)] ̂ E [St(12⋅⋅⋅M) (x; t)] i=1 M

(2.29) 2

= ∑ ∑ {f (x + j) – f [x + Bi +(tî – ti ) + j]} +

2Nw 3n2

i=1 j∈W

Returning to the problem of having the same patterns at x and x + p. As indicated in eq. (2.21), here it has E [St (x; ti )] = E [St (x; ti +

p )] = 2Nw 3n2 Bi +

(2.30)

Note that the uncertainty problem still exists, as the minimum occurs at tp = ti + p/(Bi +). However, with the change of Bi , tp changes but ti does not change. This is an important property of SSSD in inverse-distance. By using such a property, it is possible to select different baselines to make the minima appear at different locations. Taking the case of using two baselines B1 and B2 (B1 ≠ B2 ) as an example, it can be derived from eq. (2.29) that

52

2 Stereo Vision

(S) ̂ = ∑ {f (x + j) – f [x + B1 +(t1̂ – t1 ) + j]}2 E[St(12) (x; t)] j∈W 2 + ∑ {f (x + j) – f [x + B2 +(t2̂ – t2 ) + j]} + 4Nw 3n2

(2.31)

j∈W

It can be proven that when t ≠ t,̂ it has Okutomi (1993) (S) ̂ > 4Nw 32 = E [S(S) (x; t)] E [St(12) (x; t)] n t(12)

(2.32)

(S) ̂ This means that at the correct matching location t, there is a real minimum St(12) (x, t). The uncertainty problem caused by repeated patterns can be solved by using two different baselines.

Example 2.2 The effect of the new measuring function. One example showing the effect of the new measuring function is illustrated in Figure 2.11, Okutomi (1993). Figure 2.11(a) shows a plot of f (x) given by f (x) = {

2 + cos(x0/4) –4 < x < 12 1 x ⩽ –4, x ≥ 12

Suppose that for baseline B1 , it has d1 = 5 and 3n2 = 0.2, and the window size is 5. Figure 2.11(b) gives E[Sd1 (x; d)], which has two minima at d1 = 5 and d1 = 13, respectively. Now, a pair of images with baseline B2 are used, and the new baseline is 1.5 times the old one. Thus, the obtained E[Sd2 (x; d)] is shown in Figure 2.11(c), which has two minima at d1 = 7 and d1 = 15, respectively. The uncertainty problem still exists, and the distance between the two minima has not been changed. Using the SSD in inverse-distance, the curves of E[St1 (x; t)] and E[St2 (x; t)] for baselines B1 and B2 are plotted in Figure 2.11(d, e), respectively. From these two figures, it can be seen that E[St1 (x; t)] has two minima at t1 = 5 and t1 = 13, and E[St2 (x; t)] has two minima at t1 = 5 and t1 = 10. The uncertainty problem still exists when only the inverse-distance is used. However, the minimum for the correct matching position (t = 5) has not been changed while the minimum for the false matching position changes with the alteration of the baseline length. Therefore, adding two (S) SSD values in inverse-distances gives the expectation curve E[St(12) (x; t)] as shown in Figure 2.11(f). The minimum at the correct matching position is smaller than the minimum at the false matching position. In other words, there is a global minimum at the correct matching position. The uncertainty problem is thus solved. ◻∘ Consider that f (x) is a periodic function with a period T. Every St (x, t) is a periodic function of t with a period T/Bi +. There will be a minimum in every T/Bi +. When two (S) baselines are used, the corresponding St(12) (x; t) is still a periodic function of t but with a different period T12

53

2.5 Orthogonal Trinocular Matching

E [St 1 (x, t)]

f (x) 3

15

2

10

1 -5

x 0

5

10

5 0

15

(a)

t 0

5

10

15

20

10

15

20

(d) E [Sd1 (x, d)]

E [St2 (x, t)]

15

15

10

10

5

d

5 0

0 0

5

10

15

20

t 0

(e)

(b)

5 (S)

E[St(12) (x, t)] 25 20

E [Sd2 (x, d)] 15

15

10

10

5

d

5 0

0 0

5

10

15

20

(c)

t 0

5

10

15

20

(f)

Figure 2.11: Expected values of various functions.

T12 = LCM (

T T , ) B1 + B2 +

(2.33)

Here, LCM denotes the least common multiple. It is evident that T12 should not be smaller than T1 or T2 . By suitably selecting baselines B1 and B2 , it is possible to allow only one minimum in the searching region.

2.5 Orthogonal Trinocular Matching One problem often encountered in stereo vision is that mismatching arises for gray-level smooth regions. This problem cannot be solved by the above multiple stereo-matching technique, though it can reduce the mismatching caused by periodic patterns. In real applications, horizontal smooth regions often have visible gray-level differences along the vertical direction. In this case, the mismatching problem caused by the horizontal smoothness can be solved by vertical matching. Similarly, when the region is smooth along the vertical direction, the correct matching could be achieved

54

2 Stereo Vision

by a horizontal searching. Considering both vertical and horizontal matching, techniques for the orthogonal trinocular matching are proposed, Ohta (1986).

2.5.1 Basic Principles The basic principles for orthogonal trinocular matching are discussed first. 2.5.1.1 Removing Mismatching in Smooth Regions Since both horizontal smooth regions and vertical smooth regions can appear in real images, both the horizontal pairs and the vertical pairs of images should be captured. In the simplest case, two orthogonal pairs of cameras are arranged in a plane, as shown in Figure 2.12. The left image L and the right image R form the horizontal stereo image pair with baseline Bh . The left image L and the top image T form the vertical stereo image pair with baseline Bv . These two pairs of images form a set of orthogonal trinocular images (the two baselines can have different lengths). Suppose that the three captured images are fL (x, y) = f (x, y) + nL (x, y) fR (x, y) = f (x – dh , y) + nR (x, y)

(2.34)

fT (x, y) = f (x, y – dv ) + nT (x, y) where dh and dv are the horizontal and vertical disparities, respectively. In the following discussion, it is assumed that dh = dv = d. The two SSDs respectively corresponding to the horizontal and vertical directions are 2 Sh (x, y; d)̂ = ∑ [fL (x + j, y + k) – fR (x + d̂ + j, y + k)] j,k∈W 2 Sv (x, y; d)̂ = ∑ [fL (x + j, y + k) – fT (x + j, y + d̂ + k)]

(2.35)

j,k∈W

Adding them produces the orthogonal disparity measuring function O(S) (x, y; d)̂ given by Y T

Z

Bv

R

Bh

L

X

Figure 2.12: Locations of cameras for orthogonal trinocular imaging.

2.5 Orthogonal Trinocular Matching

O(S) (x, y; d)̂ = Sh (x, y; d)̂ + Sv (x, y; d)̂

55

(2.36)

The expected value of O(S) (x, y; d)̂ is ̂ = ∑ [f (x + j, y + k) – f (x + d̂ – d + j, y + k)]2 E [O(S) (x, y; d)] j,k∈W 2 + ∑ [f (x + j, y + k) – f (x + j, y + d̂ – d + k)] + 4Nw 3n2

(2.37)

j,k∈W

where Nw denotes the number of the pixels in the matching window W. From eq. (2.37), when d̂ = d E [O(S) (x, y; d)] = 4Nw 3n2

(2.38)

̂ gets its minimum at the position of the correct disparity. It can be E[O(S) (x, y; d)] seen from the above discussion that to remove periodic patterns in one direction, the inverse-distance is not necessary. Example 2.3 Removing mismatching in smooth regions by orthogonal trinocular. Figure 2.13(a–c) represents the left, right, and top images with horizontal and vertical smooth regions. The disparity map obtained by matching horizontal binocular images is shown in Figure 2.13(d). The disparity map obtained by matching vertical binocular images is shown in Figure 2.13(e). The disparity map obtained by matching the orthogonal trinocular images is shown in Figure 2.13(f). In Figure 2.13(d), some visible mismatching occurs at the horizontal smooth region (shown by horizontal strips). In Figure 2.13(e), some visible mismatching occurs at the vertical smooth region (shown by vertical strips). Such mismatching strips have not appeared in Figure 2.13(f). Figure 2.13(g–i) represents 3-D plots corresponding to Figure 2.13(d–f), respectively. ◻∘ 2.5.1.2 Reducing Mismatching Caused by Periodic Patterns The technique of orthogonal trinocular matching can reduce the mismatching caused by smooth regions as well as reduce the mismatching caused by periodic patterns. Consider the case where the object has both horizontal and vertical periodic patterns. Suppose that f (x, y) is a periodic function with horizontal and vertical periods Tx and Ty , respectively, given by f (x + j, y + k) = f (x + j + Tx , y + k + Ty )

(2.39)

where Tx and Ty are nonzero constants. From eq. (2.35) to eq. (2.38), the following equations can be derived

56

(g)

2 Stereo Vision

(a)

(b)

(c)

(d)

(e)

(f)

(h)

(i)

Figure 2.13: Removing mismatching in smooth regions by orthogonal trinocular.

̂ = E [S (x, y; d̂ + T )] E [Sh (x, y; d)] h x

(2.40)

̂ = E [S (x, y; d̂ + T )] E [Sv (x, y; d)] v y

(2.41)

̂ = E [S (x, y; d̂ + T ) + S (x, y; d̂ + T )] E [O(S) (x, y; d)] h x v y = E [O(S) (x, y; d̂ + Txy )] Txy = LCM (Tx , Ty )

(2.42)

(2.43)

According to eq. (2.43), if Tx ≠ Ty , the expected period of O(S) (x, y; d), Txy , would be larger than the expected period of Sh (x, y; d), Tx , or the expected period of Sv (x, y; d), Ty . Consider the range of the disparity search for matching. Suppose that d ∈ [dmin , dmax ]. The number of minimum occurrences in E[Sh (x, y; d)], E[Sv (x, y; d)], and E[O(S) (x, y; d)] are respectively given by

2.5 Orthogonal Trinocular Matching

Nh =

dmax – dmin Tx

Nv =

dmax – dmin Ty

N=

dmax – dmin LCM(Tx , Ty )

57

(2.44)

According to eqs. (2.43) and (2.44), N ⩽ min (Nh , Nv )

(2.45)

This indicates that when substituting Sh (x, y; d) or Sv (x, y; d) by O(S) (x, y; d) as the similarity function, the number of minimum occurrences in E[O(S) (x, y; d)] is smaller than those of either E[Sh (x, y; d)] or E[Sv (x, y; d)]. Example 2.4 Reducing the mismatching caused by periodic patterns with orthogonal trinocular matching. Figure 2.14(a–c) represents the left, right, and top images of a square prismoid (truncated pyramid) with periodic textures on its surface. The disparity map obtained by matching the horizontal binocular images is shown in Figure 2.14(d). The disparity map obtained by matching the vertical binocular images is shown in Figure 2.14(e). The disparity map obtained by matching the orthogonal trinocular images is shown in Figure 2.14(f). Due to the influence of the periodic patterns, there are many mismatching points in Figure 2.14(d, e). Most of these mismatching points are removed in Figure 2.14(f). Figure 2.14(g–i) represents 3-D plots corresponding to Figure 2.14(d–f), respectively. ◻∘ 2.5.2 Orthogonal Matching Based on Gradient Classification In the following, a fast orthogonal matching method based on gradient classification is described. 2.5.2.1 Algorithm Flowchart The principle of this method is first to compare the smoothness of regions along the horizontal direction and the vertical direction. In horizontal smoother regions, the matching is based on a vertical image pair. In vertical smoother regions, the matching is based on the horizontal image pair. To judge whether a region is horizontal smooth or vertical smooth, the gradient direction of this region is used. The flowchart of this algorithm is shown in Figure 2.15. The algorithm has four steps. (1) Compute the gradients of fL (x, y) and obtain the gradient direction of each point in fL (x, y).

58

2 Stereo Vision

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

Figure 2.14: Reducing the mismatching caused by periodic patterns with orthogonal trinocular matching.

Left image

Top image

Right image

Gradient direction

Vertical region

Classification

Horizontal region

Vertical matching

Combination

Horizontal matching

Depth map Figure 2.15: Flowchart of the matching algorithm with 2-D search.

2.5 Orthogonal Trinocular Matching

(2) (3)

(4)

59

Classify fL (x, y) into two parts with near horizontal gradient directions and near vertical gradient directions, respectively. Use the horizontal pair of images to compute the disparity in near horizontal gradient direction regions, and use the vertical pair of images to compute the disparity in near vertical gradient direction regions. Combine the disparity values of the above two results to form a complete disparity image and then a depth image.

In gradient image computation, only two gradient directions are required. A simple method is to select the horizontal gradient value Gh and the vertical gradient value Gv as (W represents the width of the mask for gradient computation) W/2 y+W/2

Gh (x, y) = ∑

∑

i=1 j=y–W/2 W/2 x+W/2

Gv (x, y) = ∑

∑

j=1 i=x–W/2

󵄨 󵄨󵄨 󵄨󵄨fL (x – i, j) – fL (x + i, j)󵄨󵄨󵄨

(2.46)

󵄨 󵄨󵄨 󵄨󵄨fL (x, y – j) – fL (x, y + j)󵄨󵄨󵄨

(2.47)

The following classification rules can be used. If Gh > Gv at a pixel in fL (x, y), this pixel should be classified to the part with near horizontal gradient directions and be searched in the horizontal image pair. If Gh < Gv at a pixel in fL (x, y), this pixel should be classified as the part with near vertical gradient directions and be searched in the vertical image pair. Example 2.5 A real example in reducing the mismatching caused by smooth regions with orthogonal trinocular matching. A real example in reducing the influence of smooth regions on stereo matching in Example 2.1 by using the above orthogonal trinocular matching based on the gradient classification is shown in Figure 2.16. Figure 2.16(a) is the top image corresponding to the left and right images in Figure 2.3(a, b), respectively. Figure 2.16(b) is the gradient image of the left image. Figure 2.16(c, d) represents the near horizontal gradient direction image and the near vertical gradient direction image, respectively. Figure 2.16(e, f) represents disparity maps obtained by matching with the horizontal image pair and the vertical image pair, respectively. Figure 2.16(g) is the complete disparity image obtained by combining Figure 2.16(e, f), respectively. Figure 2.16(h) is its corresponding 3-D plot. Comparing Figure 2.16(g) and 2.16(h), with Figure 2.3(c) and 2.3(d), respectively, the mismatching has been greatly reduced with the orthogonal trinocular matching. ◻∘ 2.5.2.2 Some Discussions on Mask Size Two types of masks have been used in the above procedure. One is the gradient mask used to compute the gradient-direction, another is the matching (searching) mask

60

2 Stereo Vision

(a)

(b)

(c)

(d)

(e)

(f)

(h)

(g)

Figure 2.16: A real example of reducing the mismatching caused by smooth regions with orthogonal trinocular matching.

A

A P

P

B

C

B

C

D (a)

E

D (b)

E

Figure 2.17: Illustration of the inﬂuence of gradient masks.

used to compute the correlation between gray-level regions. The sizes of both gradient masks and matching masks influence the matching performance, Jia (1998). The influence of gradient masks can be explained with the help of Figure 2.17, in which two regions (with vertices A, B, C, and with vertices B, C, D, E) of different gray levels are given. Suppose that the point that needs to be matched is P, which is located near the horizontal edge segment BC. If the gradient mask is too small, as is the case of the square in Figure 2.17(a), the horizontal and vertical regions are hardly

2.5 Orthogonal Trinocular Matching

A

61

A P

B

C

D

E

(a)

P

B

C

D

E

(b)

Figure 2.18: Illustration of the inﬂuence of matching masks.

separated as Gh ≈ Gv . It is thus possible that the matching at P will be carried out with the horizontal image pair, for instance, and the mismatching could occur due to the smoothness in the horizontal direction. On the other hand, if the gradient mask is big enough, as is the case of the square in Figure 2.17(b), the vertical image pair will be selected and used for the matching at P, and the error matching will be avoided. The size of the matching mask also has great influence on the performance of the matching. Big masks can contain enough variation for matching to reduce the mismatching, but big masks can also produce big smoothness. The following two cases should be distinguished. (1) Matching is around the boundary of texture regions and smooth regions, as shown in Figure 2.18(a). If the mask is small and can only cover smooth regions, the matching will have some randomness. If the mask is big and covers both two types of regions, correct matching can be achieved by selecting suitable matching images. (2) Matching is around the boundary of two texture regions, as shown in Figure 2.18(b). As the mask is always inside the texture regions, the correct matching can be achieved no matter what the size of the mask is.

(a)

(b)

Figure 2.19: Results of orthogonal matching of multiple imaging.

62

2 Stereo Vision

Example 2.6 Results of orthogonal matching of multiple imaging. In addition to orthogonal trinocular images, one more image along the horizontal direction and one more image along the vertical direction are used (i. e., multiple imaging along both direction) to give the disparity map shown in Figure 2.19(a). Figure 2.19(b) is its 3-D plot. The results here are even better than those shown in Figure 2.16. ◻∘

2.6 Computing Subpixel-Level Disparity In a number of cases, pixel-level disparity obtained by normal stereo-matching algorithms is not precise enough for certain measurements. The computation of the subpixel-level disparity is thus needed. In the following, an adaptive algorithm based on local variation patterns of the image intensity and disparity is introduced, which could provide subpixel-level precision of the disparity, Kanade (1994). Consider a statistical distribution model for first-order partial differential of the image intensity and disparity used for stereo matching, Okutomi (1992). Suppose that images fL (x, y) and fR (x, y) are the left and right images of an intensity function f (x, y), respectively. The correct disparity function between fL (x, y) and fR (x, y) is dr (x, y), which is expressed by fR (x, y) = fL [x + dr (x, y) , y] + nL (x, y)

(2.48)

where nL (x, y) is the Gaussian noise satisfying N(0, 3n2 ). Suppose that two matching windows WL and WR are placed at the correct matching position in the left and right images, respectively. In other words, WR is placed at pixel (0, 0) in the right image fR (x, y), while WL is placed at pixel [dr (0, 0), 0] in the left image fL (x, y). If the disparity values in the matching windows are constants, that is, dr (u, v) = dr (0, 0), fR (u, v) should be equal to fL [u + dr (0, 0), v] if no noise influence exists. In real situations, dr (u, v) in the matching window is a variable. Expanding fL [u + dr (u, v), v] at dr (0, 0) gives fL [u + dr (u, v),

v] ≈ fL [u + dr (0, 0),

v]

+ [dr (u, v) – dr (0, 0)]

𝜕 f [u + dr (0, 0), 𝜕u L

(2.49) v] + nL (u, v)

Substituting eq. (2.49) into eq. (2.48) yields

fR (u, v) – fL [u + dr (0, 0),

v] ≈ [dr (u, v) – dr (0, 0)]

𝜕 f [u + dr (0, 0), v] + nL (u, v) 𝜕u L (2.50)

2.6 Computing Subpixel-Level Disparity

63

Suppose that the disparity dr (u, v) in the matching windows satisfies the following statistical distribution model, Kanade (1991) dr (u, v) – dr (0, 0) ∼N (0, kd √u2 + v2 )

(2.51)

where ∼ means satisfying a distribution and kd is a constant. This model indicates that the expectation of the disparity at (u, v) equals the expectation of the disparity at the center of the window (0, 0), but the variance of their disparity difference increases with the increment of the distance from (u, v) to (0, 0). Suppose further that the first-order partial differential of the image intensity at (u, v) in fL (x, y) satisfies the following statistical model 𝜕 f (u, v) ∼ N (0, 𝜕u L

kf )

(2.52)

where kf is a constant that denotes the fluctuant of the image intensity along a local direction (see below). This model indicates that the expectation of the image intensity at (u, v) equals the expectation of the image intensity at (0, 0). The uncertainty of this assumption increases with the increment of the distance between (u, v) and (0, 0). According to the above two statistical models and the assumption that the first-order partial differentials on the image intensity and disparity are statistically independent, it can be proven that the statistical distribution of the image intensity difference between the pair of stereo images is nS (u, v) = fR (u, v) – fL [u + dr (0, 0), v]

(2.53)

It can be considered as a distribution of the Gaussian noise, Kanade (1991) nS (u, v) ∼ N (0, 23n2 + kf kd √u2 + v2 )

(2.54)

where kf = E {[

𝜕 f [u + dr (0, 0) , 𝜕u L

2

v]] }

(2.55)

From eq. (2.54), it can be seen that the fluctuation of the first-order partial differential of fL (x, y) and the disparity dr (u, v) in the matching windows form a combined noise nS (u, v) with the image noise nL (u, v). This combined noise satisfies a zero-mean Gaussian distribution. Its variance consists of two parts: One is a constant 23n2 , coming from the image noise; the other is a variable proportional to √u2 + v2 , coming from the local uncertainty in the matching windows. Such an uncertainty can be described by an additional noise whose energy is proportional to the distance between the center pixel and the surrounding pixels. When the disparity in the window is a constant

64

2 Stereo Vision

(kd = 0), this additional noise is zero. The stronger the fluctuation in the matching window, the more uncertainty of the contribution of the surrounding pixels. Suppose that d0 (x, y) is an initial estimation of the correct disparity dr (x, y). Expending fL (u + dr (0, 0), v) at u + d0 (x, y) gives fL [u + dr (0, 0), v] = fL [u + d0 (0, 0), v] + Bd

𝜕 f [u + d0 (0, 0), v] 𝜕u L

(2.56)

𝜕 f [u + d0 (0, 0), v] 𝜕u L

(2.57)

Taking eq. (2.56) into eq. (2.53) yields nS (u, v) = fR (u, v) – fL [u + d0 (0, 0), v] – Bd

where Bd = dr (0, 0) – d0 (0, 0) is the amended value for the disparity, which needs to be estimated. It can be proven that the conditional probability density of Bd satisfies the Gaussian distribution, so it can be computed in the following steps. (1) Obtain an initial disparity value first by using any pixel level stereo-matching algorithm. (2) For each pixel, where the subpixel level disparity is to be estimated, select the window for the disparity estimation with the minimum uncertainty and compute the amended value for the disparity. (3) Stop the computation of the amended value for the disparity when it converges or attains a pre-defined iteration number. Such a procedure for computing the subpixel-level disparity can also be used for multiple imaging or orthogonal trinocular vision cases. Example 2.7 Illustration of the subpixel-level disparity. One illustration showing the results of subpixel-level disparity is given in Figure 2.20. Figure 2.20(a–c) represents the left, right, and top images of an orthogonal trinocular system for a square pyramid. To facilitate the matching, a layer of texture is covered on its surface. Figure 2.20(d) is the pixel-level disparity map obtained with the orthogonal trinocular matching, Figure 2.20(e) is the subpixel-level disparity map obtained with the binocular matching. Figure 2.20(f) is the subpixel-level disparity map obtained with the orthogonal trinocular matching. Figure 2.20(g–i) are the 3-D plots of Figure 2.20(d–f), respectively. Both the subpixel-level disparity maps obtained with the binocular and orthogonal trinocular matching have higher precision than that of the pixel-level disparity map obtained with orthogonal trinocular matching. By comparing Figure 2.20(e, f), it can be seen that the binocular matching with the subpixel-level disparity produces some mismatching along the diagonal direction, while the trinocular matching with subpixel-level disparity has no such problem. ◻∘

2.6 Computing Subpixel-Level Disparity

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

65

(i)

Figure 2.20: Illustration of the subpixel-level disparity.

Table 2.1: Results of volume computation with various methods.

Volume (m3 ) Relative error

Real Volume

Pixel Level

Subpixel Level (binocular)

Subpixel Level (orthogonal trinocular)

2.304 —

3.874 68%

2.834 23%

2.467 7%

Example 2.8 Disparity computation and volume measurement. Suppose that the side length of the square is 2.4 m and the height of the pyramid is 1.2 m. The results of the volume computation with the disparity maps obtained by using the above three matching methods are listed in Table 2.1, Jia (2000c). The influence of the precision for the disparity computation on the precision of measurement is evident. ◻∘

66

2 Stereo Vision

(a)

(b)

(c)

Figure 2.21: Real computation example of subpixel-level disparity.

Example 2.9 Real computational examples of subpixel-level disparity. The original images used for the computation in Figure 2.21 are those shown in Figure 2.3(a, b) and Figure 2.16(a). Figure 2.21(a) is the pixel-level disparity map obtained with the orthogonal trinocular matching algorithm. Figure 2.21(b) is the subpixellevel disparity map obtained with the binocular matching algorithm. Figure 2.21(c) is the subpixel-level disparity map obtained with the orthogonal trinocular matching algorithm. ◻∘

2.7 Error Detection and Correction There are a number of error sources in the computation of disparity maps, such as the existence of periodic patterns or smooth regions, various occlusions, different constraints and so on. In the following, a general and fast error detection and correction algorithm for the disparity map is introduced, Jia (2000b). It can directly process disparity maps, independent of any matching algorithm. Its computational complexity is just proportional to the number of mismatching pixels.

2.7.1 Error Detection According to the ordering constraint discussed in Section 2.3, the concept of the ordering match constraint can be introduced. Suppose that fL (x, y) and fR (x, y) are a pair of images, and OL and OR are their imaging centers, respectively. As shown in Figure 2.22, P and Q are two nonoverlapping points in the space, PL and QL are the respective projections of P and Q on fL (x, y), and PR and QR are the respective projections of P and Q on fR (x, y). Denoting the X coordinate X(∙), it can be seen from Figure 2.22 that in correct matching there are X(PL ) ≤ X(QL ) and X(PR ) ≤ X(QR ) when X(P) < X(Q), or X(QL ) ≤ X(PL ) and X(QR ) ≤ X(PR ) when X(P) > X(Q). Denoting ⇒ as implying, the following conditions are satisfied X(PL ) ≤ X(QL ) ⇒ X(PR ) < X(QR ) X(PL ) ≤ X(QL ) ⇒ X(PR ) > X(QR )

(2.58)

2.7 Error Detection and Correction

Z

P

67

Q

Y

PL

QL

PR

QR

X

fL(x , y)

fR(x , y)

OL

OR

Figure 2.22: Illustration of the ordering match constraint.

It is said that PR and QR satisfy the ordering match constraint, otherwise they are crossing. Using the ordering match constraints, crossing match regions can be detected. Let PR = fR (i, j) and QR = fR (k, j) be two pixels in the jth line of fR (x, y), and their matching points in fL (x, y) can be denoted PL = fL (i + d(i, j), j) and QL = fL (k + d(k, j), j), respectively. Let C(PR , QR ) be the cross label between PR and QR . If eq. (2.58) holds, C(PR , QR ) = 0; otherwise, C(PR , QR ) = 1. The cross number Nc corresponding to a pixel PR is defined as N–1

Nc (i, j) = ∑ C(PR , QR )

k ≠ i

(2.59)

k=0

where N is the number of pixels in the jth line. 2.7.2 Error Correction Calling the regions with nonzero cross numbers as cross-regions, the mismatching error in the cross-regions can be corrected by the following algorithm. Suppose that {fR (i, j)|i ⊆ [p, q]} is the cross-region corresponding to PR , and the total cross number in the cross-region, Ntc , is q

Ntc (i, j) = ∑ Nc (i, j)

(2.60)

i=p

The procedure for correcting mismatching points in cross-regions has the following steps: (1) Find the pixel fR (l, j) with the maximum cross number. Here, l = max [Nc (i, j)] i⊆[p,q]

(2.61)

68

2 Stereo Vision

Determine the search range {fL (i, j)|i ⊆ [s, t]} for the matching point fR (k, j), where

(2)

{ (3)

s = p – 1 + d(p – 1, j) t = q + 1 + d(q + 1, j)

(2.62)

Search a new matching point that can reduce the total cross number Ntc in the above range. Use the new matching point to correct d(k, j), eliminating the mismatching corresponding to the pixel with the currently maximum cross number.

(4)

The above procedure can be iterated. Once a mismatching pixel is corrected, the procedure can be applied to the rest of the mismatching pixels. After correcting d(k, j), the new Nc (i, j) in the cross-region can be calculated by using eq. (2.59), and a new Ntc can be obtained. The procedure will be repeated until Ntc = 0. Since the criterion used in this algorithm is to make Ntc = 0, this algorithm is called the zero-cross correction algorithm. Example 2.10 Matching error detection and removing. Suppose that the computed disparity values for region [153, 163] in the jth line of the image are listed in Table 2.2. The distribution of the match points before correction is shown in Figure 2.23. According to the corresponding relation between fL (x, y) and fR (x, y), it is known that the points in [160, 162] are the mismatching points. Following eq. (2.23), Table 2.3 gives the computed cross numbers. According to Table 2.3, [fR (154, j), fR (162, j)] is a cross-region. Following eq. (2.60), Ntc = 28 can be determined. Following eq. (2.61), the pixel with the maximum cross number is fR (160, j). Following eq. (2.62), the search range for the new matching point fR (160, j) is {fL (i, j)|i ⊆ [181, 190]}. Searching new matching points that correspond to Table 2.2: Disparity in cross-regions. i

153

154

155

156

157

158

159

160

161

162

163

d(i, j)

28

28

28

27

28

27

27

21

21

21

27

fR

fL

153

154

155

156

157

181

182 183 184 185

158 159

160 161

162 163

186 187 188 189

190 191

Figure 2.23: The distribution of match points in cross-region before correction.

2.7 Error Detection and Correction

69

Table 2.3: Cross numbers in region [153, 163]. i

153

Nc

fR

fL

154

155

156

157

158

159

160

161

162

163

1

2

2

3

3

3

6

5

3

0

153

154

155

181

182 183

156

157

184 185

158

159

186 187

160 161

162 163

188 189 190 191

Figure 2.24: The distribution of the match points in cross-region after correction.

fR (160, j) and that can reduce Ntc gives fL (187, j) in this range. Adjust the disparity value d(160, j) corresponding to fR (160, j) to a new value d(160, j) = X[fL (187, j)]–X[fR (160, j)] = 27. The above procedure can be repeated until Ntc = 0 for the whole region. The distribution of the match points after the correction is shown in Figure 2.24, in which all mismatching in the region [160, 162] has been removed. ◻∘ Example 2.11 Real error detection and removing. One real example of error detection and removing is shown in Figure 2.25. The stereomatching images used are Figure 2.3(a, b). Only part of the images are shown here as Figure 2.25(a). Figure 2.25(b) is the disparity map obtained with stereo matching. Figure 2.25(c) is the result after further correction. Comparing Figure 2.25(b) and (c), it is easy to see that many mismatching points (black and white points in gray background) in the original disparity map have been removed. Such a procedure for correcting errors in the disparity map is also applicable for the cases of multiple imaging or orthogonal trinocular vision. ◻∘

(a)

(b)

Figure 2.25: Real error detection and removing.

(c)

70

2 Stereo Vision

2.8 Problems and Questions 2-1* In Figure 2.11 of Volume I, suppose that + = 0.05 m and B = 0.4 m. Obtain the disparity values for two cases: the point W is at (1, 0, 2 m) and at (2, 0, 3 m). If the disparity is 0.02 m, what is the corresponding Z? 2-2 A given stereo vision system has + = 0.05 m and B = 0.2 m. (1) Draw the function curve of the disparity D for the object distance Z. (2) If the system resolution is 0.1 line/mm, what is the effective measuring distance? 2-3 Show that the correlation coefficient of eq. (2.4) has values in the interval (–1, 1). 2-4 Give some real examples for point features, line features and blob features. Discuss their functions in stereo matching. 2-5 Compare the two feature point-matching methods in Section 2.3. What are their differences in computation time and robustness? 2-6 Prove eq. (2.12) by coordinate transforms. 2-7 When using the method of horizontal multiple stereo matching, what would be the advantage of making different baselines in an integer proportion? 2-8 In Example 2.2, in addition to baselines B1 and B2 , add a third baseline B3 . (S) Suppose that B3 = 2B1 . Draw the curves E[St3 (x; t)] and E [St(123) (x; t)]. 2-9* In Figure Problem 2-9, Bh = Bv . If a more accurate measurement for the depth of the point P is required, which pair of images, the left image L and the right image R or the left image L and top image T, should be used? Y

P Z

T Bv

L

Bh

R

X

Figure Problem 2-9 2-10

2-11

2-12

Consider Figure 2.12. Compare the changes produced by moving the camera in the XY plane and along the Z-axis. In what circumstances would moving the camera in the XY plane be more suitable for obtaining an accurate distance measurement? Compare the accuracy of the disparity obtained by using the following two processes: Use subpixel edge detection to find the boundary of an object and then perform stereo matching, or directly using the subpixel disparity computation. Implement the algorithm for error detection and correction in Section 2.7, and verify the algorithm with the help of the data in Table 2.2.

2.9 Further Reading

71

2.9 Further Reading 1.

Modules of Stereo Vision – The introduction of stereo vision can be found in many computer vision textbooks; for example, Haralick (1992, 1993), Jähne (2000), Shapiro (2001) and Forsyth (2003). –

2.

One example of using stereo vision for scene understanding can be found in Franke (2000).

– Using models to improve stereo matching can be found in Maitre (1992). Region-Based Binocular Matching – Many matching techniques based on regions have been proposed, and a comparison can be found in Scharstein (2002). –

3.

One assumption in correlation-based matching is the parallel between the observed surface and two project planes. When this assumption is not valid, the strategy discussed in Devernay (1994) can be used. Feature-Based Binocular Matching – Besides the types of feature points mentioned in the text, other features, such as straight lines Medioni (1985), moment Lee (1990) and junction of boundary can be used. –

4. 5.

An O(1) disparity refinement method based on belief aggregation and belief propagation is proposed in Huang (2016). Horizontal Multiple Stereo Matching – Further discussion on multiple stereo matching can be found in Kim (2002). Orthogonal Trinocular Matching – Further discussion on orthogonal trinocular matching with experiments can be found in Forsyth (2003). –

6.

– One multi-view stereo reconstruction work can be found in Zhu (2011a). Computing Subpixel-Level Disparity – The principle behind the algorithm in Section 2.6 can be found in Okutomi (1993). –

7.

Multiple stereo matching and orthogonal trinocular matching can also be combined, Jia (2000a).

Different discussions on subpixels can be found in Zhang (2001b).

– Other influencing factors on depth estimation can be found in Zhao (1996). Error Detection and Correction – A comprehensive discussion on matching error detection, correction and evaluation can be found in Mohan (1989). –

The error in the disparity map causes the error in depth estimation. The influence of camera calibration on the depth estimation can be found in Zhao (1996).

3 3-D Shape Information Recovery Stereo vision introduced in the previous chapter requires determining which point in one image corresponds to a given point in the other image. This is often a difficult problem in real situation. To avoid this problem, various techniques using different 3-D cues for information recovery based on a single camera (position is fixed but several images may be taken) are proposed, Pizlo (1992). Among the intrinsic properties of scene, the shape of a 3-D object is the most important. Using a single camera to recover the 3-D shape information is often called “shape from X,” in which X can represent the illumination change, shading, contour, texture, motion, and so on. Some early important work in this kind was conducted in the 1970s, Marr (1982). The sections of this chapter are arranged as follows: Section 3.1 introduces a photometric stereology method for determining the orientation of the surface in a scene using a series of images with the same viewing angle but different illumination. Section 3.1 discusses the principle and technique of acquiring the surface orientation of a moving object by detecting and calculating the optical flow field. Section 3.2 discusses how to reconstruct the surface shape of the object according to the different image tones produced by the spatial variation of the brightness on the surface. Section 3.3 presents the principles of three techniques for restoring the surface orientation based on the change (distortion) of the surface texture elements after the imaging projection. Section 3.4 describes the relationship of object depth with the focal distance changes caused by focusing on the objects of different distances. This indicates that the object distance can be determined based on the focal length making the object clear. Section 3.5 introduces a method for computing the geometry and pose of a 3-D object using the coordinates of three points in an image under the condition that the 3-D scene model and the focal length of the camera are known.

3.1 Photometric Stereo Photometric stereo is an important method for recovering surface orientation. This method requires a set of images, which are taken from the same viewing angles but with different lighting conditions. This method is easy to implement, but requires the control of lighting in the application environment. DOI 10.1515/9783110524130-003

3.1 Photometric Stereo

73

3.1.1 Scene Radiance and Image Irradiance Scene radiance and image irradiance are related but different concepts. The former is the power emitted from a unit solid angle per unit area on the surface of the light source, with the unit Wm–2 sr– 1. The latter is the power per unit area on the surface of the object, with the unit Wm–2 . 3.1.1.1 The Relationship Between Scene Radiance and Image Irradiance Now consider the relationship between the radiance at a point on an object in the scene and the irradiance at the corresponding point in the image, Horn (1986). As shown in Figure 3.1, a lens of diameter d is located at a distance + from the image plane. Let a patch on the surface of the object have an area $O, while the corresponding image patch has an area $I. It is supposed that the ray from the object patch to the center of the lens makes an angle ! with the optical axis and that there is an angle ( between this ray and a surface normal. The object patch is at a distance z from the lens (–z means that the direction from lens to object points to –Z), measured along the optical axis. The solid angle of the cone of the rays leading to the patch on the object is equivalent to the solid angle of the cone of the rays leading to the corresponding patch in the image. The apparent area of the image patch as seen from the center of the lens is $I × cos !. The distance of this patch from the center of the lens is +/ cos !. The solid angle subtended by this patch is $I × cos !/(+/ cos !)2 . The solid angle of the patch on the object as seen from the center of the lens is $O × cos (/(z/ cos !)2 . The equivalence of the two solid angles gives $O cos ! z 2 = ( ) $I cos ( +

(3.1)

Since the lens area is 0(d/2)2 , the solid angle seen from the object patch is K=

1 0d2 0 d 2 cos ! ( ) cos3 ! = 4 (z/ cos !)2 4 z

N d/2 Z δI

δO θ

α

α

O

λ

–Z

Figure 3.1: An object patch and the corresponding image patch.

(3.2)

74

3 3-D Shape Information Recovery

The power of the light originating from the patch and passing through the lens is $P = L × $O × K × cos ( = L × $O ×

0 d 2 ( ) cos3 ! cos ( 4 z

(3.3)

where L is the radiance of the surface in the direction toward the lens. Since no light from other areas reaches this image patch, the irradiance of the image at the patch is E=

$P $O 0 d 2 =L× × ( ) cos3 ! cos ( $I $I 4 z

(3.4)

Taking eq. (3.1) into eq. (3.4) yields 0 d 2 E = L ( ) cos4 ! 4 +

(3.5)

The image irradiance E is proportional to scene radiance L and the diameter of the lens d, and is inversely proportional to the focus of the lens +. When the object is fixed, L can be considered as a constant, so can + and d. If the camera moves, the change of the irradiance will be reflected on the angle !, that is, 0 d 2 BE = L ( ) [cos4 ! – cos4 (! + B!)] 4 + 0 d 2 = L ( ) [cos2 ! + cos2 (! + B!)] [cos2 ! – cos2 (! + B!)] 4 + 0 d 2 = L ( ) [cos2 ! + cos2 (! + B!)] [cos ! + cos(! + B!)] [cos ! – cos(! + B!)] 4 + 0 d 2 B! = –2L ( ) [cos2 ! + cos2 (! + B!)] [cos ! + cos(! + B!)] [sin(! + ) sin(B!/2)] 4 + 2 (3.6) As B! is quite small, the above equation becomes d 2 0 d 2 BE ≈ –L ( ) [4 cos2 ! cos ! sin ! × B!/2] = –L0 ( ) cos2 ! sin(2!) × B! 2 f f

(3.7)

The angular velocity of the camera is v=

B! = Bt

BE/BE 2

–L0 ( df ) cos2 ! sin(2!)

which is propositional to the change rate of irradiance BE/Bt.

(3.8)

3.1 Photometric Stereo

N

75

I

θ O R

ϕ

Figure 3.2: Angle ( and angle 6.

N (θi, ϕi) (θe, ϕe)

Figure 3.3: Direction ((i , 6i ) and direction ((e , 6e ).

3.1.1.2 Bi-directional Reflectance Distribution Function Consider the coordinate system shown in Figure 3.2, where N is the normal vector of the surface patch, OR is an arbitrary reference line, and the direction of a ray I can be represented by an angle ( (polar angle, between I and N) and angle 6 (azimuth, between a perpendicular projection of the ray onto the surface and OR on the surface). Using Figure 3.2, one ray falling on the surface can be specified by the direction ((i , 6i ) and one ray toward the viewer can be specified by the direction ((e , 6e ), as shown in Figure 3.3. The bi-directional reflectance distribution function (BRDF) can be denoted f ((i , 6i ; (e , 6e ). This function represents how bright a surface appears when it is viewed from one direction while the light falls on it from another. It is the ratio of the radiance $L((e , 6e ) to the irradiance $E((i , 6i ), given by f ((i , 6i ; (e , 6e ) =

$L((e , 6e ) $E((i , 6i )

(3.9)

Consider now the extended source (see Section 2.3 of Volume I in this book set). In Figure 3.4, the solid angle, corresponding to a patch of size $(i in the polar angle and size $6i in the azimuth angle, is $9 = sin (i $(i $6i . Denote Eo ((i , 6i ) as the radiance per unit solid angle coming from the direction ((i , 6i ), the radiance from the patch is Eo ((i , 6i ) sin (i $(i $6i and the total irradiance of the surface is 0 0/2

E = ∫ ∫ Eo ((i , >i ) sin (i cos (i d(i d>i

(3.10)

–0 0

To obtain the radiance for the whole surface, the products of the BRDF and the irradiance from all possible directions are integrated. This gives the radiance in the direction toward the viewer by

76

3 3-D Shape Information Recovery

N δθi

Figure 3.4: Illustration of the integration of the extended light source.

δϕi

0 0/2

L((e , 6e ) = ∫ ∫ f ((i , 6i ; (e , 6e )Eo ((i , 6i ) sin (i cos (i d(i d6i

(3.11)

–0 0

3.1.2 Surface Reflectance Properties Consider two extreme cases: An ideal scatter surface and an ideal specular reflection surface, Horn (1986). An ideal scatter surface is also called a Lambertian surface. It appears equally bright from all viewing directions and reflects all incident light. From this definition, its BRDF must be a constant. Integrating the radiance of the surface over all directions and equating to the total radiance, the total irradiance is given by 0 0/2

∫ ∫ f ((i , 6i ; (e , 6e )E((i , 6i ) cos (i sin (e cos (e d(e d6e = E((i , 6i ) cos (i

(3.12)

–0 0

Multiplying both sides by cos (i to convert them to a direction N, the BRDF of ideal scatter surface is f ((i , 6i ; (e , 6e ) = 1/0

(3.13)

For an ideal scatter surface, the radiance L and the irradiance E have the relation given by L = E/0

(3.14)

Suppose that an ideal Lambertian surface is illuminated by a point source of a radiance E, which is located in the direction ((s , 6s ) and has the radiance E((i , 6i ) = E

$((i – (s )$(6i – 6s ) sin (i

(3.15)

It is then derived from eq. (3.13) that L=

1 E cos (i 0

Equation (3.16) is called Lambert’s law.

(i ≥ 0

(3.16)

3.1 Photometric Stereo

77

N

θe = θ

θi =θ

Figure 3.5: Illustration of a specular reﬂection surface.

If an ideal Lambert surface is under a uniform radiance E, then 0 0/2

L= ∫ ∫ –0 0

E sin (i cos (i d(i d6i = E 0

(3.17)

Equation (3.17) shows that the radiance of the patch is the same as that of the source. An ideal specular reflection surface can reflect all the light arriving from the direction ((i , 6i ) into the direction ((e , 6e ), as shown in Figure 3.5. The BRDF is proportional to the product of the two impulses $((e – (i ) and $(6e – 6i – 0) with a factor k. The total radiance is obtained by integrating over all emittance directions of the surface, given by 0 0/2

∫ ∫ k$((e – (i )$(6e – 6i – 0) sin (e cos (e d(e d6e = k sin (i cos (i = 1

(3.18)

–0 0

The BRDF is then solved as f ((i , 6i ; (e , 6e ) =

$((e – (i )$(6e – 6i – 0) sin (i cos (i

(3.19)

For an extended source, taking eq. (3.19) into eq. (3.11) yields 0 0/2

L((e , 6e ) = ∫ ∫ –0 0

$((e – (i )$(6e – 6i – 0) E((i , 6i ) sin (i cos (i d(i d6i sin (i cos (i

(3.20)

= E((e , 6e – 0)

3.1.3 Surface Orientation A smooth surface has a tangent plane at every point. The direction of the tangent plane can be used to represent the direction of the plane at this point. The surface normal can be used to specify the orientation of this plane. In practice, the coordinate system is chosen such that one axis is lined up with the optical axis of the imaging system

78

3 3-D Shape Information Recovery

Y

Z

O

–z Figure 3.6: Using distance –z to describe surface.

X Z

y

qδy pδy O

δx

δy

X

Figure 3.7: Surface orientation and ﬁrst partial derivatives.

while the other two axes are parallel to the image plane, as shown in Figure 3.6. A surface is then described in terms of its perpendicular distance –z from some reference planes parallel to the image plane. The surface normal can be found by taking the cross-product of any two (nonparallel) lines in the tangent plane, as shown in Figure 3.7. Taking a small step $x in the x-direction from a given point (x, y), the change in z is $z = $x × 𝜕z/𝜕x + e, where e represents a higher-order term. Denote the first-order partial derivatives of z with respect to x and y, p and q, respectively. Look at Figure 3.7. The change along the z-direction is p$x when $x is taken in the x-direction and the change along the z-direction is q$y when $y is taken in the y-direction. The former can be written as [$x 0 p$x]T , which is parallel to r x = [1 0 p]T , and the latter can be written as [$x 0 p$x]T , which is parallel to r y = [0 1 q]T . The surface normal (it points toward the viewer) is then given by N = r x × r y = [1

0

p]T × [0

1

q]T = [–p

–q

1]T

(3.21)

A unit vector is obtained by N̂ =

[–p – q 1]T N = . |N| √1 + p2 + q2

(3.22)

The angle (e between the surface normal and the direction to the lens is obtained by taking the dot-product of the (unit) surface normal vector and the (unit) view vector (from the object to the lens) [0 0 1]T

3.1 Photometric Stereo

N̂ ∙ V̂ = cos (e =

1 √1 + p2 + q2

.

79

(3.23)

3.1.4 Reflectance Map and Image Irradiance Equation Now consider the link between the pixel gray scale (image brightness) and the pixel gradient (surface orientation). 3.1.4.1 Reflectance Map When a point source of a radiance E illuminates a Lambertian surface, the scene radiance is L=

1 E cos (i 0

(i ≥ 0

(3.24)

where (i is the angle between the surface normal vector [–p – q 1]T and the direction vector toward the source [–ps – qs 1]T . The dot-product is given by cos (i =

1 + ps p + q s q √1 + p2 + q2 √1 + p2s + q2s

(3.25)

Taking eq. (3.25) into eq. (3.24) gives the relation between the object brightness and surface orientation. Such a relation function is called the reflectance map and is denoted R(p, q). For the Lambertian surface illuminated by a single distant point source, its reflectance map is R(p, q) =

1 + p s p + qs q √1 + p2 + q2 √1 + p2s + q2s

(3.26)

The reflectance map can be represented by a contour map in the gradient space. According to eq. (3.26), contours of constant brightness are nested conic sections in the PQ plane, since R(p, q) = c implies (1 + ps p + qs q)2 = c2 (1 + p2 + q2 )(1 + p2s + q2s ). The maximum of R(p, q) is achieved at (ps , qs ). Example 3.1 Illustration of the reflectance map of the Lambertian surface. Three examples of reflectance maps of the Lambertian surface are shown in Figure 3.8. Figure 3.8(a) shows the case of ps = 0, qs = 0 (nested circles). Figure 3.8(b) shows the case of ps ≠ 0, qs = 0 (ellipse or hyperbola). Figure 3.8(c) shows the case of ps ≠ 0, qs ≠ 0 (hyperbola). ◻∘ Different from the Lambertian surface, a surface that emits radiation equally in all directions can be called a homogeneous emission surface. Such a surface appears

80

3 3-D Shape Information Recovery

Q

0.0

0.2

Q 0.4 0.6

0.5 0.6 0.7 0.8 0.9

0.8

0

0

P

(a)

P

(b)

(c)

Figure 3.8: Some reﬂectance maps of the Lambertian surface.

brighter when viewed obliquely. This is because the visible surface is reduced when the radiance of the same power comes from a foreshortened area. In this case, the brightness depends on the inverse of the cosine of the emittance angle. Since the radiance is proportional to cos (i / cos (e, and cos (e = 1/(1 + p2 + q2 )1/2 , so R(p, q) =

1 + ps p + qs q √1 + p2s + q2s

(3.27)

The contours with constant brightness are now parallel lines, because from R(p, q) = c, (1+ps p+qs q) = c(1+p2s +q2s ) will be obtained. These lines are orthogonal to the direction of (ps , qs ). Example 3.2 The reflectance map of a homogeneous emission surface. Figure 3.9 illustrates an example of the reflectance map of a homogeneous emission surface. Here, ps /qs = 1/2, and the contour lines have a slope of 2. ◻∘ 3.1.4.2 Image Constraint Equation The irradiance of a point in an image, E(x, y), is propositional to the radiance at the corresponding point on the object in the scene. Suppose that the surface gradient at this point is (p, q) and the radiance there is R(p, q). Setting the constant of proportionality to one yields E(x, y) = R(p, q)

(3.28)

Q 0.0

0.2

0

0.4

0.6

0.8

P

Figure 3.9: An example of the reﬂectance map of a homogeneous emission surface.

81

3.1 Photometric Stereo

This is the image brightness constraint equation or image irradiance equation. It indicates that the gray-level value I(x, y) at the pixel (x, y) depends on its reflectance property R(p, q). Consider a sphere with a Lambertian surface illuminated by a point source at the same place as the viewer (see Figure 3.10). As there are (e = (i and (ps , qs ) = (0, 0), the relation between the radiance and gradient is given by R(p, q) =

1 √1 + p2 + q2

(3.29)

If the center of the sphere is on the optical axis, its surface equation is z = z0 + √r2 – (x2 + y2 )

x 2 + y2 ⩽ r 2

(3.30)

where r is the radius of the sphere and –z0 is the distance from the center of the sphere to the lens. Since p = –x/(z – z0 ) and q = –y/(z – z0 ), it has then (1 + p2 + q2 )1/2 = r/(z – z0 ), and finally

E(x, y) = R(p, q) = √1 –

x 2 + y2 r2

(3.31)

According to eq. (3.31), the brightness falls off in image, from its maximum at the center to zero at the edge. This conclusion can also be obtained by considering the source emission direction S, the viewing direction V, and the surface normal N. When people see such a change in brightness, a conclusion maybe made is that this image is captured from a round, probably spherical, object.

Z N2

S1

S2

N1

V1

V2

O

r

–z0 Figure 3.10: A sphere illuminated by a point source.

82

3 3-D Shape Information Recovery

3.1.5 Solution for Photometric Stereo Given an image, it is expected that the shape of the object in image can be recovered. There is a unique mapping from the surface orientation to the radiance. However, the inverse mapping is not unique, as the same brightness can be originated from many surface orientations. A contour of constant R(p, q) connects such a set of orientations in the reflectance map. In some cases, the surface orientation can be determined with the help of particular points where the brightness attains its maximum or minimum. According to eq. (3.26), for a Lambertian surface, R(p, q) = 1 only when (p, q) = (ps , qs ). Therefore, the surface orientation can be uniquely defined by the surface brightness. While in general cases, the mapping from the image brightness to the surface orientation is not unique. This is because in every space location the brightness has only one degree of freedom, while the orientation has two degrees of freedom (two gradients). To recover the surface orientation, some additional information should be introduced. To determine two unknowns, p and q, two equations are required. Two images, taken with different lighting conditions as shown in Figure 3.11, yield the following two equations R1 (p, q) = E1

(3.32)

R2 (p, q) = E2

If these two equations are linearly independent, a unique solution for p and q is possible.

Example 3.3 Solution of photometric stereo. Suppose that

R1 (p, q) = √

1 + p1 p + q1 q r1

and

R2 (p, q) = √

1 + p2 p + q2 q r2

N

θi2

θe2 θe1

θi1

Figure 3.11: Two images are taken from the same viewpoint with different lighting.

83

3.1 Photometric Stereo

where r1 = √1 + p21 + q21

and

r2 = √1 + p22 + q22

Provided p1 /q1 ≠ p2 /q2 , p and q can be solved from eq. (3.32) as p=

(E12

r1 – 1)q2 – (E22 r2 – 1)q1 p1 q2 – q1 p2

and

q=

(E22

r2 – 1)p1 – (E12 r1 – 1)p2 p1 q2 – q1 p2

Given two corresponding images taken with different lighting conditions, a unique solution can be obtained for the surface orientation at each point. ◻∘ Example 3.4 Illustration of photometric stereo. Two images of a sphere, taken with different lighting conditions (the same source has been put in two different position), are shown in Figure 3.12(a) and Figure 3.12(b), respectively. The surface orientation determined by photometric stereo is shown in Figure 3.12(c) by using an orientation-vector representation (the vector is represented by a line segment). It can be seen that the surface orientation is perpendicular to the paper in places closest to the center of the sphere, while the surface orientation is parallel to the paper in places closest to the edges of the sphere. ◻∘ The reflectance properties of a surface at different places can be different. A simple example is that the irradiance is the product of a reflectance factor (between 0 and 1) and some functions of orientation. Suppose that a surface, which is like a Lambertian surface except that it does not reflect the entire incident light, is considered. Its brightness is represented by 1 cos (i , where 1 is the reflectance factor. To recover the reflectance factor and the gradient (p, q), three images are required. Introducing the unit vectors in the directions of three source positions, it has Sj =

(a)

(b)

[–pj – qj 1]T √1 + p2j + q2j

j = 1, 2, 3

(c)

Figure 3.12: Computing the surface orientation by photometric stereo.

(3.33)

84

3 3-D Shape Information Recovery

Then the irradiance is Ej = 1(Sj ∙ N)

j = 1, 2, 3

(3.34)

where N=

[–p – q 1]T

(3.35)

√1 + p2 + q2

is the unit surface normal. For unit vectors N and 1, three equations can be obtained E1 = 1(S1 ∙ N)

E2 = 1(S2 ∙ N)

E3 = 1(S3 ∙ N)

(3.36)

Combining these equations gives E = 1S ∙ N

(3.37)

where the rows of the matrix S are the source directions S1 , S2 , S3 , and the components of the vector E are the three brightness measurements. Suppose that S is non-singular. It can be shown that 1N = S–1 ∙ E =

1 [E (S × S3 ) + E2 (S3 × S2 ) + E3 (S1 × S2 )] [S1 ∙ (S2 × S3 )] 1 2

(3.38)

The direction of the surface normal is the result of multiplying a constant with a linear combination of the three vectors, each of which is perpendicular to the directions of the two light sources. If multiplying the brightness obtained when the third source is used with each of these vectors, the reflectance factor 1 will be recovered by finding the magnitude of the resulting vector. Example 3.5 Recover the reflectance factor by three images. Suppose that a source has been put in three places (–3.4, –0.8, –1.0), (0.0, 0.0, –1.0), (–4.7, –3.9, –1.0), and then three images are captured. Three equations Q 0.2 0.1 –0.2 –0.1 ( p ,q )

0

0.1 –0.1 –0.2

(a)

(b)

Figure 3.13: Recover the reﬂectance factor by three images.

0.2 P

85

3.2 Structure from Motion

can be established and the surface orientation and reflectance factor can be determined. Figure 3.13(a) shows three reflectance property curves. It is seen from Figure 3.13(b) that when 1 = 0.8, three curves are joined at p = –1.0, q = –1.0. ◻∘

3.2 Structure from Motion The structure of a scene consisting of a number of objects or an object consisting of several parts can be recovered from the motion of objects in a scene or the different motion of object parts.

3.2.1 Optical Flow and Motion Field Motion can be represented by a motion field, which is composed of motion vectors of the image points. Example 3.6 Computation of a motion field. Suppose that in a particular moment, one point Pi in an image is mapped from a point Po at the object surface, as shown in Figure 3.14. These two points are connected by the projection equation. Let Po have a velocity V o relative to the camera. It induces a motion V i in the corresponding Pi . The two velocities are Vo =

dr o dt

Vi =

dr i dt

(3.39)

where r o and r i are related by 1 1 r = r + i ro ∙ z o

(3.40)

where + is the focal length of the camera and z is the distance (vector) from the center of the camera to the object. ◻∘ According to visual psychology, when there is relative motion between an observer and an object, optical patterns on the surface of object provide the motion dro Po ro Pi

ri

dri f

z

Figure 3.14: Object point and image point are connected by a projection equation.

86

3 3-D Shape Information Recovery

and structure information of the object to the observer. Optical flow is the apparent motion of the optical patterns. Three factors are important in optical flow: Motion or velocity field, which is the necessary condition of optical flow; optical patterns, which can carry information; and a projection from the scene to the image, which make it observable. Optical flow and motion fields are closely related but different. When there is optical flow, certain motion must exist. However, when there is some motion, the optical flow may not appear. Example 3.7 The difference between the motion field and optical flow. Figure 3.15(a) shows a sphere with the homogeneous reflectance property rotating in front of a camera. The light source is fixed. There are brightness variations in different parts of the surface image; however, the brightness variations are not changed with the rotation of sphere, so the image is not changed in gray-level values with time. In this case, though the motion field is not zero, the optical flow is zero everywhere. In Figure 3.15(b), a fixed sphere is illuminated by moving the light source. The gray-level values in different locations of the image change as the illumination changes according to the source motion. In this case, the optical flow is not zero but the motion field of sphere is zero everywhere. This motion is called the apparent motion. ◻∘

3.2.2 Solution to Optical Flow Constraint Equation The optical flow constraint equation is also called the image flow constraint equation. Denote f (x, y, t) as the gray-level value of an image point (x, y) at time t, u(x, y) and v(x, y) as the horizontal and vertical velocities of the image point (x, y), respectively. The optical flow constraint equation can be written as fx u + fy v + ft = 0 where fx , fy , ft are the gray-level gradients along X, Y, T directions, respectively.

(a)

(b)

Figure 3.15: The difference between a motion ﬁeld and the optical ﬂow.

(3.41)

3.2 Structure from Motion

87

V ( fx, fy ) v0 θ O

0

U

Figure 3.16: Values of u and v satisfying eq. (3.41) lie on a line.

In addition, eq. (3.41) can also be written as (fx , fy ) ∙ (u, v) = –ft

(3.42)

3.2.2.1 Computation of Optical Flow: Rigid Body Equation (3.41) is a linear constraint equation about velocity components u and v. In the velocity space, as shown in Figure 3.16, values of u and v satisfying eq. (3.41) lie on a line, u0 = –ft /fx

( = arctan (fx /fy )

v0 = –ft /fy

(3.43)

All points on this line are the solutions of the optical flow constraint equation. In other words, one equation cannot uniquely determine the two variables u and v. To determine uniquely the two variables u and v, another constraint should be introduced. When dealing with rigid bodies, it is implied that all neighboring points have the same optical velocity. In other words, the change rate of the optical flow is zero, which are given by (∇u)2 = (

𝜕u 𝜕u 2 + ) =0 𝜕x 𝜕y

(3.44)

(∇v)2 = (

𝜕v 𝜕v 2 + ) =0 𝜕x 𝜕y

(3.45)

Combining these two equations with the optical flow constraint equation, the optical flow can be calculated by solving a minimum problem as %(x, y) = ∑ ∑ {(fx u + fy v + ft )2 + +2 [(∇u)2 + (∇v)2 ]} x

(3.46)

y

In eq. (3.46), the value of + depends on the noise in the image. When the noise is strong, + must be larger. To minimize the error in eq. (3.46), calculating the derivatives of u and v, and letting the results equal to zero, yields fx2 u + fx fy v = –+2 ∇u – fx ft

(3.47)

fx2 v

(3.48)

2

+ fx fy u = –+ ∇v – fy ft

88

3 3-D Shape Information Recovery

The above two equations are also called Euler equations. Let ū and v̄ be the average values in the neighborhood of u and v, and let ∇u = u – ū and ∇v = v – v.̄ Equation (3.47) and eq. (3.48) become (fx2 + +2 ) u + fx fy v = +2 ū – fx ft

(3.49)

(fy2

(3.50)

2

2

+ + ) v + fx fy u = + v̄ – fy ft .

The solutions for eq. (3.49) and eq. (3.50) are fx [fx ū + fy v̄ + ft ]

u = ū –

v = v̄ –

(3.51)

+2 + fx2 + fy2 fy [fx ū + fy v̄ + ft ] +2 + fx2 + fy2

.

(3.52)

Equation (3.51) and eq. (3.52) provide the basis for solving u(x, y) and v(x, y) with an iterative method. In practice, the following relaxation iterative equations are used u(n+1) = ū (n) –

v(n+1) = v̄(n) –

fx [fx ū (n) + fy v̄(n) + ft ]

(3.53)

+2 + fx2 + fy2 fy [fx ū (n) + fy v̄(n) + ft ] +2 + fx2 + fy2

.

(3.54)

The initial values can be u(0) = 0, v(0) = 0 (a line passing through the origin). Equations (3.53) and (3.54) have a simple geometric explanation: The iterative value at a new point (u, v) equals to the difference of the average value in the neighborhood of this point and an adjust value that is in the direction of the brightness gradient, as shown in Figure 3.17. Example 3.8 Real example of the optical flow detection. Some results for a real detection of optical flow are shown in Figure 3.18. Figure 3.18(a) is an image of a soccer ball. Figure 3.18(b) is the image obtained by rotating Figure 3.18(a) around the vertical axis. Figure 3.18(c) is the image obtained by rotating V

( fx, fy )

v0

(u̅ , v̅ ) ( u, v )

θ O

u0

U

Figure 3.17: Illustration of a geometric explanation.

3.2 Structure from Motion

(a)

(b)

(d)

89

(c)

(e)

Figure 3.18: Real example of optical ﬂow detection.

Figure 3.18(a) clockwise around the viewing axis. Figure 3.18(d, e) are the detected optical flow in the above two cases, respectively. ◻∘ It can be seen from Figure 3.18 that the optical flow has bigger values at the boundaries between the white and black regions as the gray-level values change is acute there, and has smaller values at the inside place of the white and black regions as the graylevel values there show almost no change during the movement. 3.2.2.2 Computation of Optical Flow: Smooth Motion Another method for introducing more constraints is to use the property that the motion is smooth in the field. It is then possible to minimize a measure of the departure from smoothness by es = ∬ [(u2x + u2y ) + (vx2 + vy2 )]dxdy

(3.55)

In addition, the error in the optical flow constraint equation ec = ∬ [Ex u + Ey v + Et ]2

dxdy

(3.56)

should also be small. Overall, the function to be minimized is es + +ec , where + is a weight. With strong noise, + should take a small value. 3.2.2.3 Computation of Optical Flow: Gray-Level Break Consider Figure 3.19(a), where XY is an image plane, I is the gray-level axis, and an object moves along the X direction with a velocity of (u, v). At time t0 , the gray-level value at point P0 is I0 and the gray-level value at point Pd is Id . At time t0 + dt, the

90

3 3-D Shape Information Recovery

I

I

Y P0

I0 Pd

0

t0

P0 Pd

Id X

X, T t0

t0+dt

(a)

t0+dt

(b)

Figure 3.19: Gray-level break cases.

gray-level value at P0 is moved to Pd and this forms the optical flow. Between P0 and Pd , there is a gray-level break. Considering Figure 3.19(b), look at the change of gray-level along the path. As the gray-level value at Pd is the gray-level value at P0 plus the gray-level difference between P0 and Pd , it has Pd

Id = ∫ ∇f ∙ dl + I0

(3.57)

P0

On the other hand, look at the change of gray level along the time axis. As the graylevel value observed at Pd is changed from Id to I0 , it has t0 +dt

I0 = ∫ ft dt + Id

(3.58)

t0

The changes of gray-level values in the above two cases are equal, so combining eqs (3.57) and (3.58) gives Pd

t0 +dt

∫ ∇f ∙ dl = – ∫ ft dt

(3.59)

t0

P0

Substituting dl = [u v]T dt into eq. (3.59) yields fx u + fy v + ft = 0

(3.60)

Now it is clear that the optical flow constraint equation can be used when there is a gray-level break. 3.2.3 Optical Flow and Surface Orientation Consider an observer-centered coordinate system XYZ (the observer is located at the origin), and suppose that the observer has a spherical retina, so the world can

3.2 Structure from Motion

91

Z V z θ r O

ϕ

Y

X

Figure 3.20: Spherical and Cartesian coordinate systems.

be projected onto a unit image sphere. Any point in this image sphere can also be represented by a longitude 6, a latitude (, and a distance from the origin r. The above two coordinate systems are depicted in Figure 3.20, and can be interchanged. x = r sin ( cos 6

(3.61)

y = r sin ( sin 6

(3.62)

z = r cos (

(3.63)

r = √x 2 + y 2 + z 2

(3.64)

( = arccos(z/r)

(3.65)

> = arccos(y/x)

(3.66)

The optical flow of an object point can be determined as follows. Let the velocity of the point be (u, v, w) = (dx/dt, dy/dt, dz/dt), and the angular velocities of the image point in 6 and ( directions are v cos 6 – u sin 6 r sin ( (ur sin ( cos 6 + vr sin ( sin 6 + wr cos () cos ( – rw %= r2 sin (

$=

(3.67) (3.68)

Equations (3.67) and (3.68) are the general expressions for the optical flow in 6 and ( directions. Consider a simple example. Suppose that the object is stationary and the observer is traveling along the Z-axis with a speed S. In this case, u = 0, v = 0, and w = –S. Taking them into eq. (3.67) and eq. (3.68) yields $=0

(3.69)

% = S sin (/r

(3.70)

Based on eq. (3.69) and eq. (3.70), the surface orientation can be determined. Suppose that R is a point in the patch on the surface and an observer located at O is looking at this patch along the viewing line OR as in Figure 3.21(a). Let the normal

92

3 3-D Shape Information Recovery

Z

N σ

θ

Z O

R

τ

ϕ

σ

r Y

O

r τ

r

θ

ϕ

R'

O

X

R'

X (a)

Y

Z

R

(b)

R'

(c)

Figure 3.21: An example of surface orientation determination.

vector of the patch be N. N can be decomposed into two orthogonal directions. One is in the ZR plane and has an angle 3 with OR as in Figure 3.21(b). Another is in the plane perpendicular to the ZR plane (so is parallel to the XY plane) and has an angle 4 with OR󸀠 (see Figure 3.21(c), the Z-axis is pointed outward from paper). In Figure 3.21(b), 6 is a constant, while in Figure 3.21(c), ( is a constant. Consider 3 in the ZR plane, as shown in Figure 3.22(a). If ( is given an infinitesimal increment B(, the change of r is Br. Drawing an auxiliary line 1, from one side, it gets 1/r = tan(B() ≈ B(, from other side, it has 1/Br = tan 3. Combining them yields rB( = Br tan 3

(3.71)

Similarly, consider 4 in the plane perpendicular to the ZR plane, as shown in Figure 3.22(b). If giving 6 an infinitesimal increment B6, the change of r is Br. Drawing an auxiliary line 1, from one side, it has 1/r = tan(B6) ≈ B6, and from the other side, it has 1/Br = tan 4. Combining them yields rB6 = Br tan 4

(3.72)

Taking the limit values of eq. (3.71) and eq. (3.72) yields 1 𝜕r cot 3 = [ ] r 𝜕( 1 𝜕r cot 4 = [ ] r 𝜕6 O

Z R

σ ρ

ϕ

σ r

θ O (a)

r

θ

Y r

τ ρ r

Y

X

(3.74)

ϕ r

r

(3.73)

τ R'

(b)

Figure 3.22: The process used to determine surface orientation.

3.3 Shape from Shading

93

in which r can be determined from eq. (3.64). Since % is the function of both 6 and (, eq. (3.70) can be written as r=

S sin ( %(6, ()

(3.75)

Taking partial differentiations of 6 and ( yields 𝜕r –1 𝜕% = S sin ( 2 𝜕> % 𝜕> 𝜕r cos ( sin ( 𝜕% = S( – 2 ) 𝜕( % % 𝜕(

(3.76) (3.77)

Substituting Eqs (3.65) and (3.66) into eqs (3.73) and (3.74) yields 𝜕(ln %) ] 𝜕( 𝜕(ln %) ] 4 = arccot [– 𝜕6

3 = arccot [cot ( –

(3.78) (3.79)

3.3 Shape from Shading In contrast to the techniques in photometric stereo and structure from motion that need to at least capture two images of a scene, the technique of shape from shading can recover the shape information of a 3-D scene from a single image.

3.3.1 Shading and Shape When an object is illuminated in a scene, different parts of its surface will appear with different brightness, as they are oriented differently. This spatial variation of brightness is referred to as shading. The shading on an image depends on four factors: The geometry of the visible surface (surface normal), the radiance direction and radiance energy, the relative orientation and distance between an object and an observer, and the reflectance property of the object surface, Zhang (2000b). The meaning of the letters in Figure 3.23 is as follows. The object is represented by the patch Q, and the surface normal N indicates the orientation of the patch. The vector I represents the incidence intensity and the direction of the source. The vector V indicates the relative orientation and distance between the object and the observer. The reflectance property of the object surface 1 depends on the material of the surface, which is normally a scale, but can be the function of the patch’s spatial position. According to Figure 3.23, the image’s gray levels depend on the incidence intensity I, the reflectance property of the object surface 1, and the angle between the viewing direction and the surface normal i, which is given by

94

3 3-D Shape Information Recovery

I

i

e

ρ Figure 3.23: Four factors inﬂuencing gray-level variation.

E(x, y) = I(x, y)1 cos i

(3.80)

Consider the case where the light source is just behind the observer, cos i = cos e. Suppose that the object has a Lambertian surface; that is, the surface reflectance intensities are not changed with the observation locations. The observed light intensities are E(x, y) = I(x, y)1 cos e

(3.81)

Since N = [p q – 1]T and V = [0 0 – 1]T , by overlapping the gradient coordinate system onto the XY coordinate system, as shown in Figure 3.24, [p q – 1]T ∙ [0 0 – 1]T 1 cos e = cos i = 󵄨 = 󵄨󵄨[p q – 1]T 󵄨󵄨󵄨 ∙ 󵄨󵄨󵄨[0 0 – 1]T 󵄨󵄨󵄨 2 󵄨 󵄨 󵄨 󵄨 √(p + q2 + 1

(3.82)

Taking eq. (3.82) into eq. (3.80) yields E(x, y) = I(x, y)1

1 √p2 + q2 + 1

(3.83)

Consider now the case where i = e, and suppose that I is perpendicular to the patch Q, which has a normal [pi qi – 1]T . Then it has [p q – 1]T ∙ [pi qi – 1]T (ppi + qqi + 1) cos i = 󵄨 = 󵄨󵄨[p q – 1]T 󵄨󵄨󵄨 ∙ 󵄨󵄨󵄨[pi qi – 1]T 󵄨󵄨󵄨 󵄨 󵄨 󵄨 󵄨 √p2 + q2 + 1√p2i + q2i + 1

(3.84)

Taking eq. (3.84) into eq. (3.80) yields the image’s gray levels with any incident angle by I

O

Y,Q

X,P

Figure 3.24: Overlap of gradient coordinate system and XY coordinate system.

3.3 Shape from Shading

E(x, y) = I(x, y)1

(ppi + qqi + 1) √p2 + q2 + 1√p2i + q2i + 1

95

(3.85)

Equation (3.85) can be written in a more general form as E(x, y) = R(p, q)

(3.86)

Equation (3.85) is just the image brightness constraint equation as shown in eq. (3.28).

3.3.2 Gradient Space A 3-D surface can be represented by z = f (x, y). A patch on it can be represented by N = [p q – 1]T . From the orientation point of view, a 3-D surface is a point G(p, q) in a 2-D gradient space, as shown in Figure 3.25. Using the gradient space to study the 3-D surface will reduce the dimensionality, but this representation cannot determine the location of the 3-D surface in the 3-D coordinate system. In other words, a point in a 2-D gradient space can represent all patches having the same orientation, but these patches can have different 3-D space locations. With the help of the gradient space, the structure formed by cross-planes can be easily understood. Example 3.9 Judge the structure formed by cross-planes. Several planes cross to make up a convex or concave structure. Consider two crossed planes S1 and S2 forming a joint line l, as shown in Figure 3.26. In Figure 3.26, G1 and G2 represent the points in the gradient space, corresponding to the normal of the plane. The connecting line between G1 and G2 is perpendicular to the projection line l󸀠 of l. Overlapping the gradient space to the XY space, and projecting the two planes and the gradient points corresponding to their normal on the overlapped space, the following conclusion can be made. These two planes form a convex structure if S and G have the same sign (located on the same side of l󸀠 ), as shown in Figure 3.27(a). These two planes form a concave structure if S and G have different signs (located separately on two sides of l󸀠 ), as shown in Figure 3.27(b).

Q q=f( p)

G

r θ=arctan(q/p) O

P

Figure 3.25: Gradient space.

96

3 3-D Shape Information Recovery

G2

Z G1

l S2

S1 O

X

l′

Y

Figure 3.26: Cross of two planes in 3-D space.

(a)

(b)

Figure 3.27: Convex or concave structures made by two planes in a 3-D space.

l'1

l1 GA

A

B

l3

l'3

A

l'1 GB

B

A

l'3

B

C C

l2

GA

(a)

C A

C B

l'2

GC

GB

B A

l'2

GC

(b)

(c)

Figure 3.28: Two cases of intersected three planes.

A further example is shown in Figure 3.28. In Figure 3.28(a), three planes A, B, C are intersected with intersect lines l1 , l2 , l3 , respectively. When the sequence of planes is AABBCC, clockwise, then the three planes form a convex structure, as shown in Figure 3.28(b). When the sequence of planes is CBACBA, clockwise, then the three planes form a concave structure, as shown in Figure 3.28(c). ◻∘ Referring back, eq. (3.83) can be rewritten as p2 + q2 = (

2

I(x, y)1 1 ) –1= 2 –1 E(x, y) K

(3.87)

where K represents the relative reflectance intensity. Equation (3.87) corresponds to a set of nested circles in the PQ plan, in which each circle represents the trace of patches

97

3.3 Shape from Shading

Q

Q

Q

C

GC

120° 150° B

60° 90°

30°

A

O

GB

P

(a)

O

90°

O

GA

P

P

(b)

(c)

Figure 3.29: Illustration of the application of the reﬂectance map.

having the same gray level. When i = e, the reflectance map is composed of circles having the same center. While for the case of i ≠ e, the reflectance map is composed of ellipses or hyperbolas. Example 3.10 Application of the reflectance map. Suppose that three planes A, B, C are observed, and they form the intersecting angles shown in Figure 3.29(a). The lean angle for each plane is not known, yet, but it can be determined with the help of the reflectance map. Suppose that I and V have the same directions, and KA = 0.707, KB = 0.807, and KC = 0.577. According to the characteristic that the connecting lines between G(p, q) of the two planes are perpendicular to these two planes, the triangle shown in Figure 3.29(b) can be obtained. The problem becomes to find GA , GB and GC on the reflectance map shown in Figure 3.29(c). Taking KA , KB , and KC into eq. (3.87) yields two groups of solutions: (pA , qA ) = (0.707, 0.707), (p󸀠A , q󸀠A ) = (1, 0),

(pB , qB ) = (–0.189, 0.707), (p󸀠B , q󸀠B ) = (–0.732, 0),

(pC , qC ) = (0.707, 1.225) (3.88) (p󸀠C , q󸀠C ) = (1, 1)

(3.89)

The solution of eq. (3.88) corresponds to the small triangle in Figure 3.29(c), while the solution of eq. (3.89) corresponds to the big triangle in Figure 3.29(c). Both of them are correct. ◻∘ 3.3.3 Solving the Brightness Equation with One Image In Section 3.1, the brightness equation has been solved by using additional images taken under different lighting conditions, which provide additional constraints for the brightness equation. Here, only one image is used, but some smooth constraints could be used to provide additional information.

98

3 3-D Shape Information Recovery

Q (a, b) θ0

P

0

Figure 3.30: The reﬂectance map of a linear case.

f (s)

E

O ap+bq

s

Figure 3.31: A strictly monotonic function that has an inverse.

3.3.3.1 Linear Case As shown in Figure 3.30, suppose that the reflectance map is a function of a linear combination of gradients, R(p, q) = f (ap + bq)

(3.90)

where a and b are constants. In Figure 3.30, the contours of the constant brightness are parallel lines in the gradient space. In eq. (3.90), f is a strictly monotonic function that has an inverse, f –1 . As shown in Figure 3.31 ap + bq = f –1 [E(x, y)]

(3.91)

The gradient (p, q) cannot be determined from a measurement of the brightness alone, but one equation that constrains its possible values can be obtained. The slope of the surface, in a direction that makes an angle ( with the x-axis, is m(() = p cos ( + q sin (

(3.92)

If a particular direction (0 is chosen, where tan (0 = b/a, cos (0 = a/√a2 + b2

(3.93)

sin (0 = b/√a2 + b2 The slope in this direction is m((0 ) =

ap + bq 1 = √a2 + b2 √a2 + b2

f –1 [E(x, y)]

(3.94)

3.3 Shape from Shading

99

The slope of the surface in a particular direction is thus obtained. Starting at a particular image point, and taking a small step of a length $s, a change in z of $z = m$s is produced by dz 1 = ds √a2 + b2

f –1 [E(x, y)]

(3.95)

Both x and y are linear functions of s, given by x(s) = x0 + s cos (

y(s) = y0 + s sin (

(3.96)

To find the solution at a point (x0 , y0 , z0 ) on the surface, integrating the differential equation for z yields

z(s) = z0 +

1 √a2 + b2

s

∫ f –1 [E(x, y)]ds

(3.97)

0

A profile of the surface along a line in the special direction (one of the straight lines in Figure 3.32) is obtained. The profile is called a characteristic curve. 3.3.3.2 Rotationally Symmetric Case When the point source is located at the same place as the viewer, the reflectance map is rotationally symmetric, which is given by R(p, q) = f (p2 + q2 )

(3.98)

Suppose that the function f is strictly monotonic and differentiable, with the inverse f –1 . It has p2 + q2 = f –1 [E(x, y)]

(3.99)

The direction of the steepest ascent makes an angle (s with the x-axis, where tan (s = p/q, so that cos (s = p/√p2 + q2

and

sin (s = q/√p2 + q2

Y t s z0

X

0 Figure 3.32: The surface can be recovered by integration along the lines.

(3.100)

100

3 3-D Shape Information Recovery

According to eq. (3.92), the slope in the direction of the steepest ascent is m((s ) = √p2 + q2 = √f –1 [E(x, y)]

(3.101)

In this case, the slope of the surface can be found, given its brightness, but the direction of the steepest ascent cannot be found. Suppose that the direction of the steepest ascent is given by (p, q). If a small step of a length $s is taken in the direction of the steepest ascent, the changes in x and y would be $x =

p √p2 + q2

$s

$y =

q √p2 + q2

$s

(3.102)

The change in z would be $z = m$s = √p2 + q2

$s = √f –1 [E(x, y)]$s

(3.103)

If taking the step length to be √p2 + q2 $s, the above expressions will be simplified as $ x = p$ s

$z = (p2 + q2 )$s = f –1 [E(x, y)]$ s

$ y = q$ s

(3.104)

A planar surface patch gives rise to a region of a uniform brightness in the image. Only curved surfaces show non-zero brightness gradients. To determine the brightness gradients, equations for the changes of p and q should be developed. Denote u, v, and w as the second partial derivatives of z with respect to x and y u=

𝜕2 z 𝜕x2

𝜕2 z 𝜕2 z =v= 𝜕x𝜕y 𝜕y𝜕x

w=

𝜕2 z 𝜕y2

(3.105)

Then, Ex = 2(pu + qv)f 󸀠

and

Ey = 2(pv + qw)f 󸀠

(3.106)

where f 󸀠 (r) is the derivative of f (r) with respect to its single argument r. The changes $p and $q caused by taking the step ($x, $y) in the image plane can be determined by differentiation, $ p = u$ x + v$ y

$q = v$ x + w$ y

(3.107)

$ p = (pu + qv)$s

$q = (pv + qw)$s

(3.108)

Following eq. (3.104),

3.3 Shape from Shading

101

Following eq. (3.106), $p =

Ex $s 2f 󸀠

$q =

Ey 2f 󸀠

$s

(3.109)

Therefore, in the limit as $s → 0, the following five differential equations are obtained (the dots denote differentiations with respect to s) ẋ = p

ẏ = q

ż = p2 + q2

ṗ =

Ex 2f 󸀠

q̇ =

Ey 2f 󸀠

(3.110)

Given starting values, this set of five ordinary differential equations can be solved numerically to produce a curve on the surface of the object. By differentiating ẋ = p and ẏ = q one more time with respect to s, the alternate formulation can be obtained as ẍ =

Ex 2f 󸀠

ÿ =

Ey 2f 󸀠

ż = f –1 [E(x, y)]

(3.111)

Since both Ex and Ey are image brightness measurements, these equations must be solved numerically. 3.3.3.3 General Smooth Case In general, the object surface is quite smooth. This induces the following two equations (∇p)2 = (

𝜕p 𝜕p 2 + ) =0 𝜕x 𝜕y

(3.112)

(∇q)2 = (

𝜕q 𝜕q 2 + ) =0 𝜕x 𝜕y

(3.113)

Combining them with the image brightness constraint equation, the problem of solving the surface orientation becomes the problem of minimizing the following total error 2

%(x, y) = ∑ ∑ {[E(x, y) – R(p, q)] + + [(∇p)2 + (∇q)2 ]} x

(3.114)

y

Denote p̄ and q,̄ the average values of p and q neighborhoods, respectively. Taking the derivatives of % to p and q, setting the results to zero, and taking ∇p = p – p̄ and ∇q = q – q̄ into eq. (3.114) yields 1 𝜕R [E(x, y) – R(p, q)] + 𝜕p 1 𝜕R ̄ y) + [E(x, y) – R(p, q)] q(x, y) = q(x, + 𝜕q

̄ y) + p(x, y) = p(x,

(3.115) (3.116)

102

3 3-D Shape Information Recovery

The iterative formulas for solving eq. (3.115) and eq. (3.116) are p(n+1) = p̄ (n) +

1 𝜕 R(n) [E(x, y) – R(p(n) , q(n) )] + 𝜕p

(3.117)

q(n+1) = q̄ (n) +

1 𝜕 R(n) [E(x, y) – R(p(n) , q(n) )] + 𝜕q

(3.118)

Example 3.11 The flowchart for solving the image brightness constraint equation. The flowchart for solving Eqs (3.117) and (3.118) is shown in Figure 3.33. It in principle can be used for solving eqs. (3.53) and (3.54), too. ◻∘ Example 3.12 Illustration of the shape from shading. Two illustrations of shape from shading are shown in Figure 3.34. Figure 3.34(a) is an image of a ball and Figure 3.34(b) is the corresponding surface orientation map. Figure 3.34(c) is another image of a ball and Figure 3.34(d) is the corresponding surface orientation map. ◻∘

3.4 Texture and Surface Orientation Shape from texture has been studied for quite some time, Gibson (1950). In the following, the estimation of the surface orientation based on texture distortion is discussed.

p(n)(x,y)

–(n) p (x,y)

1 —Σ 4

i,j

+ 1/λ

p(n)(x,y) q(n)(x,y) E(x,y) i,j

dR dq

(n)

x

+

–R(n)(p,q)

1/λ

p(n)(x,y) q(n)(x,y)

i,j

p(n+1)(x,y)

dR dq

1 —Σ 4

(n)

x

–(n) q (x,y)

+

q(n)(x,y) Figure 3.33: The ﬂowchart for solving the image brightness constraint equation.

q(n–1)(x,y)

3.4 Texture and Surface Orientation

(a)

(b)

(c)

103

(d)

Figure 3.34: Illustration of shape from shading.

3.4.1 Single Imaging and Distortion The boundary of an object is formed by a set of consecutive line segments. When the lines in a 3-D space are projected on a 2-D image plane, several distortions can be produced. The projection of a point is still a point. The projection of a line depends on the projection of the points forming the line. Suppose that the two end points of a line are denoted W 1 = [X1 Y1 Z1 ]T and W 2 = [X2 Y2 Z2 ]T , and the points between them are (0 < p < 1) X1 X2 [ ] [ ] pW 1 + (1 – p)W 2 = p [ Y1 ] + (1 – p) [ Y2 ] [ Z1 ] [ Z2 ]

(3.119)

Using the homogenous coordinates, the two end points can be represented by PW 1 = [kX1 kY1 kZ1 t1 ]T , PW 2 = [kX2 kY2 kZ2 t2 ]T , where t1 = k(f – Z1 )/+, t2 = k(f – Z2 )/+. The points on the line are represented by (0 < p < 1) kX1 kX2 [ kY ] [ kY ] [ ] [ ] P [pW 1 + (1 – p)W 2 ] = p [ 1 ] + (1 – p) [ 2 ] [ kZ1 ] [ kZ2 ] [ t1 ] [ t2 ]

(3.120)

Their image plane coordinates are (0 ≤ p ≤ 1) w = [x

y]T = [

p X1 + (1 – p)X2 p t1 + (1 – p)t2

p Y1 + (1 – p)Y2 ] p t1 + (1 – p)t2

T

(3.121)

In the above, the projection results are represented by using p. On the other hand, the two end points on the image planes are w1 = [+X1 /(+ – Z1 ) +Y1 /(+ – Z1 )]T , w2 = [+X2 /(+–Z2 ) +Y2 /(+–Z2 )]T , and the points on the line can be represented by (0 < q < 1) +X1

+X2

1 ] + (1 – q) [ +–Z2 ] qw1 + (1 – q)w2 = q [ +–Z +Y1 +Y2 [ +–Z1 ] [ +–Z2 ]

(3.122)

104

3 3-D Shape Information Recovery

Their image plane coordinates are (represented by using q, with 0 ≤ q ≤ 1) w = [x

Y]T = [q

+X1 +X2 + (1 – q) + – Z1 + – Z2

q

+Y1 +Y2 T + (1 – q) ] + – Z1 + – Z2

(3.123)

Since the projection results represented using p are the image coordinates represented using q, eq. (3.121) and eq. (3.123) should be equal, which is given by q t2 q t2 + (1 – q)t1 p t1 q= p t1 + (1 – p)t2

p=

(3.124) (3.125)

The point represented using p in a 3-D space corresponds to only one point represented using q in a 2-D image plane. The projection (except the orthogonal projection) result of a line from a 3-D space to a 2-D image plane is still a line (length can be different). For an orthogonal projection, the result would be a point. Consider now the distortion of parallel lines. A point (X, Y, Z) on a 3-D line is X X0 a [ ] [ ] [ ] [ Y ] = [ Y0 ] + k [ b ] [ Z ] [ Z0 ] [c ]

(3.126)

where (X0 , Y0 , Z0 ) represent the start points, (a, b, c) are the direction cosines and k is a coefficient. For a set of parallel lines, they have the same (a, b, c), but different (X0 , Y0 , Z0 ). Substituting eq. (3.126) into eqs. (2.25) and (2.26) of Volume I of this book set yields x=+

y=+

(X0 + ka – Dx ) cos 𝛾 + (Y0 + kb – Dy ) sin 𝛾 –(X0 + ka – Dx ) sin ! sin 𝛾 + (Y0 + kb – Dy ) sin ! cos 𝛾 – (Z0 + kc – Dz ) cos ! + + (3.127) –(X0 + ka – Dx ) sin 𝛾 cos ! + (Y0 + kb – Dy ) cos ! cos 𝛾 + (Z0 + kc – Dz ) sin ! –(X0 + ka – Dx ) sin ! sin 𝛾 + (Y0 + kb – Dy ) sin ! cos 𝛾 – (Z0 + kc – Dz ) cos ! + + (3.128)

When the line extends on both sides to infinity, k = ±∞, eq. (3.127) and eq. (3.128) are simplified to x∞ = +

a cos 𝛾 + b sin 𝛾 –a sin ! sin 𝛾 + b sin ! cos 𝛾 – c cos !

(3.129)

y∞ = +

–a sin 𝛾 cos ! + b cos ! cos 𝛾 + c sin ! . –a sin ! sin 𝛾 + b sin ! cos 𝛾 – c cos !

(3.130)

105

3.4 Texture and Surface Orientation

It is seen that the projection of parallel lines depends only on (a, b, c), but not on (X0 , Y0 , Z0 ). The parallel lines having the same (a, b, c) will cross a point after infinitive extension. This point is called the vanishing point and will be discussed below.

3.4.2 Recover Orientation from Texture Gradient The surface orientation can be determined with the help of surface texture, and particularly with the help of the appearance change of textures. The principle of the structural techniques (Section 12.3) is adopted here. The texture image is often decomposed into primitives, called texels (for texture elements). When using the surface texture to determine the orientation of a surface, it must consider the imaging process. In this process, the original texture structure can be changed on the projection to the image. This change depends on the orientation of the surface, and carries the 3-D information of the surface orientation. The change of textures can be described by the texture gradients and classified into three groups. The methods for recovering orientation from textures can also be classified into three groups, as shown in Figure 3.35. 3.4.2.1 Texel Change in Size In perspective projection, texels that have different distances from viewers will change their apparent size after projection to the image, as shown in Figure 3.35(a). The direction of the maximum rate of the change of the projected texel size is the direction of the texture gradient. Suppose the image plane is in superposition with the paper and the view direction is pointed toward the inside. The direction of the texture gradient depends on the angle of the texel around the camera line of sight, and the magnitude of the texture gradient indicates how much the plane is tilted with respect to the camera. Example 3.13 Depth provided by texel change in size. Figure 3.36 presents two pictures showing the depth provided by the texel change in size. There are many pedals (texels) in Figure 3.36(a), which produce the depth

Y

(a)

(b)

Figure 3.35: Texture change and surface orientation.

O (c)

X

106

3 3-D Shape Information Recovery

(a)

(b)

Figure 3.36: Depth provided by the texel change in size.

impression of a scene with the gradual change of size from the front to the back. There are many columns and windows (texels) on the building in Figure 3.36(b), whose size change produces the depth impression and helps viewers to know the farthest part of the building. ◻∘ 3.4.2.2 Texel Change in Shape The shape of a texel after projection may change. If the original shape of the texels is known, the surface orientation can be induced by the texel change in shape. The orientation of surface is determined by two angles. For example, texture formed by circles will become ellipses if it is put on a tilted plane, as shown in Figure 3.35(b). Here, the direction of the principal axis specifies the angle around the camera axis, while the ratio (aspect ratio) of the major principal axis and the minor principal axis specifies the tilted angle with respect to the camera line of sight. Suppose that the plane equation of circular texels is ax + by + cz + d = 0

(3.131)

The circle can be viewed as the cross-line between the plane and a sphere that has the equation x 2 + y2 + z 2 = r 2

(3.132)

Combining eq. (3.131) and eq. (3.132) yields a2 + c2 2 b2 + c2 2 2adx + 2bdy + 2abxy d2 x + y + = r2 – 2 2 2 2 c c c c

(3.133)

This is an ellipse equation and can be further written as 2

[(a2 + c2 )x +

2 2 ad bd a2 d2 + b2 d2 ] +[(a2 + c2 )y + 2 ] +2abxy = c2 r2 –[ ] (3.134) 2 2 2 a +c b +c a2 + c2

3.4 Texture and Surface Orientation

107

α

Y Z θ

X O

Figure 3.37: A circular texel plane.

From eq. (3.134), the center point’s coordinates, major and minor principal axes can be determined. Based on these parameters, the rotation angle and tilted angle can be calculated. Another method used to judge the deformation of circular texels is to compute the major and minor principal axes of the ellipses. Look at Figure 3.37. The angle between the texture plane and the y-axis is !. In the image obtained, not only the circular texels become ellipses but also the density in the upside is higher than the downside (forming density gradient). If the original diameter of the circle is D, for the circle in the center, the major and minor principal axes are Dmajor (0, 0) = +

D Z

(3.135)

Dminor (0, 0) = +

D cos ! Z

(3.136)

where + is the focus of the camera and Z is the distance from object to lens. Consider texels not on the axis of the camera, such as the bright ellipses in Figure 3.37. If the Y-coordinate of the texel is y, the angle between the line from the origin to the texel and the Z-axis is (, then, Jain (1995) D Dmajor (0, y) = + (1 – tan ( tan !) Z Dminor (0, y) = +

D cos !(1 – tan ( tan !)2 Z

(3.137) (3.138)

The aspect ratio is cos !(1 – tan ( tan !), which will decrease with the increase of (. 3.4.2.3 Texel Change in Spatial Relation If the texture is composed of a regular grid of texels, then the surface orientation information can be recovered by computing the vanishing point. For a projection

108

3 3-D Shape Information Recovery

(a)

Figure 3.38: The regular grid of texels and the vanishing point.

(b)

map, the vanishing point is produced by the projection of the texels at infinity to the image plane. In other words, the vanishing point is the pool of parallel lines at infinity. Example 3.14 The regular grid of texels and the vanishing point Figure 3.38(a) shows a projective map of a cube with parallel grids on the surface; Figure 3.38(b) illustrates the vanishing points of three surfaces. ◻∘ By using two vanishing points obtained from the same regular grid of texels, the direction of the surface can be determined. The straight line that lies on the two vanishing points is called the vanishing line. The direction of the vanishing line indicates the angle that a texel is around the axis of the camera, while the cross-point of the vanishing line with x = 0 indicates the tilted angle of the texel with respect to the camera line of sight, as shown in Figure 3.35(c). A summary of the three methods for determining the surface orientation is given in Table 3.1.

3.4.3 Determination of Vanishing Points If a texture is composed of line segments, the vanishing points can be determined with the help of Figure 3.39. In Figure 3.39(a), a line is represented by + = x cos ( + y sin (

(3.139)

Table 3.1: A summary of the three methods used to determine surface orientation Method

Rotation Angle Around Viewing Line

Tilt Angle with Respect to Viewing Line

Using texel change in size Using texel change in shape Using texel change in spatial relation

Texture gradient direction

Texture gradient value

The direction of major principal axis of texel The direction of line connecting two vanishing points

Ration of texel major and minor principal axes The crosspoint of line connecting two vanishing points and x = 0

3.4 Texture and Surface Orientation

Y

Y

T

109

R

(xv,yv) (xv,yv)

(x,y)

r

(xv,yv)

λ θ O (a)

O

X

X O

(b)

S

(c)

O

w

W

(d)

Figure 3.39: Determination of vanishing points.

Denote “⇒” as a transform from one set to another set. The transform {x, y} ⇒ {+, (} maps a line in an image space XY to a point in a parametric space DC. The set of lines in the XY space that has the same vanishing point (xv , yv ) will be projected onto a circle in the DC space. Taking + = √x2 + y2 and ( = arctan {y/x} into the following equation yields + = xv cos ( + yv sin (

(3.140)

Writing the result in the Cartesian system yields (x –

y 2 x 2 y 2 xv 2 ) + (y – v ) = ( v ) + ( v ) 2 2 2 2

(3.141)

Equation (3.141) represents a circle with a radius + = √(xv /2)2 + (yv /2)2 and the center (xv /2, yv /2), as shown in Figure 3.39(b). This circle is the trace of the projection of all lines, which have (xv , yv ) as the vanishing points, into the space DC. The above method for determining the vanishing point has two drawbacks: One is that detecting a circle is more complicated than detecting a line, requiring more computation; the other is when xv → ∞ or yv → ∞, it will have + → ∞. To overcome these problems, an alternative transform {x, y} ⇒ {k/+, (} can be used, where k is a constant. In this case, eq. (3.140) becomes k/+ = xv cos ( + yv sin (

(3.142)

Mapping eq. (3.142) into the Cartesian system (s = + cos (, t = + sin () yields k = xv s + yv t

(3.143)

This is a line equation. The vanishing point at infinity is projected to the origin, and all lines that have the same vanishing point (xv , yv ) correspond to a line in the ST space, as shown in Figure 3.39(c). The slope of this line is –yv /xv , so this line is perpendicular to the vector from the origin to the vanishing point (xv , yv ), and has a distance from the origin k/√xv2 + yv2 . Another Hough transform can be used to detect this line.

110

3 3-D Shape Information Recovery

Take the ST space as the original space and denote the line in the new Hough space RW. The line in the ST space corresponds to a point in the RW space, as shown in Figure 3.39(d), with the location coordinates k

r=

√xv2

(3.144)

+ yv2

w = arctan {

yv } xv

(3.145)

From eq. (3.144) and eq. (3.145), the coordinates of vanishing points can be calculated by xv =

yv =

k2

(3.146)

r2 √1 + tan2 w k2 tan w r2 √1 + tan2 w

(3.147)

3.5 Depth from Focal Length The depth of field of a thin lens depends on its focal length +. The thin lens equation can be derived from Figure 3.40. A ray from object point P parallel to the optical axis passes through the lens and reaches the image plane. From the geometry of Figure 3.40, the thin lens equation can be derived as 1 1 1 + = + do di

(3.148)

where do and di are the distances from the object and from the image to the lens, respectively. When the projection of an object is in perfect focus on the image plane, a clear image will be captured. If moving P along the optical axis, it will be out of focus. The

P

Image plane

di

do

di2

Figure 3.40: Depth of ﬁeld of the thin lens.

di1

Blur disk

3.6 Pose from Three Pixels

111

captured image becomes a blur disk as it also moves along the optical axis. The diameter of the blur disk depends on both the resolution and the depth of field of the camera. Suppose that the aperture of a lens is A and the diameter of the blur disk is D. Consider the relation between D and the depth of field. Assume that D = 1 (unit is pixel) can be tolerated, what are the nearest do (do1 ) and farthest do (do2 )? Note that at the nearest do , the distance between image and lens is di1 , while at the farthest do , the distance between image and lens is di2 . According to Figure 3.40, at the nearest do , di1 =

A+D di A

(3.149)

From eq. (3.148), the nearest object distance is do1 =

+di1 di1 – +

(3.150)

Substituting eq. (3.149) into eq. (3.150) yields do1 =

+ A+D di A A+D di A

–+

=

do +(A + D) A+ + Ddo

(3.151)

=

do +(A – D) A+ – Ddo

(3.152)

Similarly, the farthest object distance is do2 =

+ A–D di A A–D di A

–+

The depth of field should be defined as the difference between the farthest and nearest planes for the given imaging parameters and limiting D. In this case, Bdo = do2 – do1 =

2ADdo +(do – +) 2

2

(Ddo ) – (A+)

(3.153)

From eq. (3.153), it is clear that the depth of field increases with the decrease of D. If a larger blur disk is tolerated, a larger depth of field can be obtained by the lens with a shorter focal length than the lens with a longer focal length. On the other hand, using a lens with a long focal length will give a small focus of field. By measuring the focal length, the distance between the object and lens can be determined.

3.6 Pose from Three Pixels According to the principle of projection, one pixel in an image could be the projection result of a line in the 3-D space. To recover the pose of a 3-D surface from a 2-D

112

3 3-D Shape Information Recovery

image, some additional constraints are required. In the following, a simple method for computing the pose of an object from three pixels resulting from the projection of that object is introduced, Shapiro (2001). Here, the assumptions are that a geometric model of the object is known and the focal length of the camera is known.

3.6.1 Perspective ThreePoint Problem Using the 2-D image properties of a perspective transformation to compute a 3-D object’s properties is an inverse perspective problem. Here, three points are used and thus it is called the perspective three-point problem (P3P). The coordinate relations among the image, the camera, and the object are shown in Figure 3.41. Given that three points on an object are Wi , (i = 1, 2, 3) and their corresponding pixels on the image are pi , the task is to compute the coordinates of Wi based on pi . Note that the line from the origin to pi passes Wi , so by denoting vi as the unit vector on the line (pointed from origin to pi ), the coordinates of Wi can be obtained from W i = ki vi

i = 1, 2, 3

(3.154)

The distances among the three points are (m ≠ n) 󵄩 󵄩 dmn = 󵄩󵄩󵄩W m – W n 󵄩󵄩󵄩

(3.155)

Taking eq. (3.154) into eq. (3.155) yields 󵄩 󵄩2 2 – 2km kn (vm ∙ vn ) + kn2 d2mn = 󵄩󵄩󵄩km vm – kn vn 󵄩󵄩󵄩 = km

(3.156)

3.6.2 Iterative Solution It has now three quadratic equations for three ki , in which the three d2mn are known from the object model and three dot products vm ∙ vn are computed from pixel coordinates, so the problem of P3P (computing the positions of the three points Wi ) is reduced W1

Y

y

p1

W2 p2

W3

p3 Z

O λ X

x Figure 3.41: The coordinate system for pose estimation.

3.6 Pose from Three Pixels

113

to solving three quadratic equations. In theory, eq. (3.156) has eight solutions (eight groups of [k1 k2 k3 ]). From Figure 3.41, however, it is clear that if [k1 k2 k3 ] was a solution, then [–k1 – k2 – k3 ] must also be a solution. Since the object can only be on one side of the camera in practice, at most four solutions are possible. It has also been proven that only two solutions are possible in the common case. The solutions can be obtained by using non-linear optimization. The task is to find the root ki from f (k1 , k2 , k3 ) = k1 2 – 2k1 k2 (v1 ∙ v2 ) + k2 2 – d12 2 g(k1 , k2 , k3 ) = k2 2 – 2k2 k3 (v2 ∙ v3 ) + k3 2 – d23 2 2

2

h(k1 , k2 , k3 ) = k3 – 2k3 k1 (v3 ∙ v1 ) + k1 – d31

(3.157)

2

Suppose that the initial values are close to [k1 k2 k3 ], but f (k1 , k2 , k3 ) ≠ 0. Now, some small changes [B1 B2 B3 ] are needed to put f (k1 + B1 , k2 + B2 , k3 + B3 ) toward zero. Linearizing f (k1 + B1 , k2 + B2 , k3 + B3 ) in the neighborhood of [k1 k2 k3 ] yields

f (k1 + B1 , k2 + B2 , k3 + B3 ) = f (k1 , k2 , k3 ) + [

k1 𝜕f 𝜕f 𝜕f [ ] ] [ k2 ] 𝜕k1 𝜕k2 𝜕k3 [ k3 ]

(3.158)

Similar equations can be obtained for functions g and h. Combining the three results and setting the left side to zero yields 𝜕f [ 𝜕k1 f (k1 , k2 , k3 ) [ 0 [ ] [ 𝜕g [ ] [ ] [ [ 0 ] = [ g(k1 , k2 , k3 ) ] + [ [ ] [ 𝜕k1 [ [ [0] [ h(k1 , k2 , k3 ) ] [ 𝜕h [ 𝜕k1

𝜕f 𝜕k2 𝜕g 𝜕k2 𝜕h 𝜕k2

𝜕f 𝜕k3 ] ] ] k1 𝜕g ] ] ][ [ k2 ] 𝜕k3 ] ] k ] [ 3] 𝜕h ] 𝜕k3 ]

(3.159)

The matrix of partial derivatives is the Jacobin matrix J. The Jacobin matrix J of a function f (k1 , k2 , k3 ) has the following form (vmn = vm ∙ vn )

J11 [ J(k1 , k2 , k3 ) ≡ [ J21 [ J31

J12 J22 J32

(2k1 – 2v12 k2 ) (2k2 – 2v12 k1 ) J13 [ ] 0 (2k2 – 2v23 k3 ) J23 ] = [ [ J33 ] 0 [ (2k1 – 2v31 k3 )

0

] (2k3 – 2v23 k2 ) ] ] (2k3 – 2v31 k1 ) ]

(3.160) If the Jacobin matrix J at (k1 , k2 , k3 ) is invertible, the following solution for the changes to parameters exists

114

3 3-D Shape Information Recovery

f (k1 , k2 , k3 ) k1 ] [ ] [ [ ] –1 [ k2 ] = –J (k1 , k2 , k3 ) [ g(k1 , k2 , k3 ) ] ] [ [ k3 ] , k , k ) h(k 1 2 3 ] [

(3.161)

Adding the above changes to the previous parameters and denoting K l as the l-th iterative values yields K l+1 = K l – J –1 (K l )f (K l )

(3.162)

The whole algorithm is summarized in the following. Input: Three point pairs (Wi , pi ), the focal length of camera, and the tolerance on distance. Output: Wi (coordinates of 3-D points). Step 1: Initialize Compute d2mn according to eq. (3.156). Compute vi and 2vm ∙ vn from pi . Choose a starting parameter vector K 1 = [k1 k2 k3 ]. Step 2: Iterate, until f (K l ) ≈ 0. Compute K 1+1 according to eq. (3.162). Stop if |f (K l+1 )| ≤ ±B or the number of iterations exceeds the limit. Step 3: Compute pose According to eq. (3.154), compute Wi by using K 1+1 .

3.7 Problems and Questions 3-1 3-2

Why is it possible to recover a 3-D scene from a single image? Given a hemisphere with a Lambert surface, reflect coefficient r, and radius d, suppose that the incident light and the view point are all on the top of the hemisphere. Compute the distribution of the reflection. Can you determine whether the surface is convex or concave? 3-3 Suppose that the hemisphere in Figure 3.4 has a Lambert surface. If the ray is parallel to the N-axis and in a downward direction, what will be the reflection map at the patch shown in the figure? 3-4 An ellipsoid x2 /4 + y2 /4 + z2 /2 = 1 has an ideal scatter surface. If the radiance is 10 and the reflection coefficient is 0.5, obtain the irradiance observed at (1, 1, 1). 3-5* Suppose that the equation for a sphere with a Lambert surface is x2 + y2 + z2 = r2 . When this sphere is illuminated by a point source with a direction cosine (a, b, c), write the observed light intensity at a point (x, y) on the surface of the sphere.

3.8 Further Reading

115

3-6

Suppose that the observer is traveling along the Z-axis with a speed S and the object is traveling also along the Z-axis but with speed T. If S > T, write out the optical flow constraint equation. 3-7 How many cases exist for the intersection of four planes? How do you determine if these intersections are convex or concave situations? 3-8 Compare the image brightness constraint equation and the optical flow constraint equation, and find their common points and different points. 3-9* Given a surface with a texture formed by circles, and these circles become ellipses of 0.04 m × 0.03 m after imaging projection, if the angle between the long axis of the ellipse and the X-axis is 135∘ , determine the orientation of the surface. 3-10 Suppose that there is a vanishing point in the projection map of a regular grid. If x = 0, y = 0, and y = 1 – x are lines passing through the vanishing point, what are the coordinates of this vanishing point? 3-11 Suppose that the aperture of a lens is 20 and the diameter of a tolerable blur disk is 1. When the focal length of the lens is 50, what is the depth of field? Repeat the calculations for focal lengths 150 and 500. 3-12 Design a set of data, including three groups of corresponding point pairs (Wi , pi ), i = 1, 2, 3. Suppose the focal length of the camera is + and the distance error allowed is B. Use the iterative algorithm given in Section 3.5 to compute Wi .

3.8 Further Reading 1.

Photometric Stereo – Detailed discussion on photometric stereo can be found in Horn (1986). –

2.

3.

4.

5.

6.

More discussion on illumination for image capturing can be found in Jähne (1999a). Structure from Motion – More discussion on structure from motion can be found in Forsyth (2003). – Discussion about geometric aspects can be found in Hartley (2004). Shape from Shading – Shape from shading uses the intensity change of an object region to solve the orientation of object surface Jähne (2000), Forsyth (2003). Texture and Surface Orientation – More information and examples on surface orientation from texture changes can be found in Aloimonos (1985), Shapiro (2001), and Forsyth (2003). Depth from Focal Length – Discussion in depth from focal length can be found also in Castleman (1996) and Jähne (1999a). Pose from Three Pixels – More details on the P3P problem can be found in Shapiro (2001).

4 Matching and Understanding Understanding an image and understanding a scene are complex tasks, including many processes, such as observation, perception (combining visual input and some pre-defined representation), recognition (to establish the relationship among different internal representations), matching, cognition, interpretation, and so on. Among these processes, matching that aims to connect something unknown with known facts, knowledge, and rules is a critical task in image understanding. Current matching techniques can be classified into two classes. One class is more concrete and corresponds to the set of pixels and is called image matching. Another class is more abstract, corresponding to the property of objects and semantics of scene, and is called generalized matching. Some techniques for image matching have already been discussed in Chapter 2. This chapter is mainly focused on the manners and techniques of generalized matching. The sections of this chapter are arranged as follows: Section 4.1 presents first some matching strategies and classification methods, and then discusses the links and differences between the matching and registration. Section 4.2 focuses on the principles and measurement methods of object matching, and introduces several basic object matching techniques. Section 4.3 introduces a dynamic pattern (a group of objects) matching technology, which is characterized by the patterns to be matched that are dynamically established during the matching process. Section 4.4 describes the matching of diverse types of relationships between objects, in which the relationship can express the different attributes of the object set, so the matching here is relatively abstract, and its application would be more extensive. Section 4.5 introduces first some of the basic definitions and concepts of graph theory, and then focuses on how to use graph isomorphism to match abstract structures or spatial relationship. Section 4.6 introduces a labeling method for line drawing that expresses the relationship between the surfaces of a 3-D scene/object. With the help of such a labeling method the 3-D scene and the corresponding model can also be matched.

4.1 Fundamental of Matching In understanding an image, matching plays an important role. From the point of view of vision, searching the required objects from a scene and matching the characteristics of these objects with a model are the main steps toward understanding the meaning of a scene. DOI 10.1515/9783110524130-004

4.1 Fundamental of Matching

117

4.1.1 Matching Strategy and Groups Matching can be performed in different (abstract) layers. For each concrete matching process, two already existed representations are put into correspondence. When the two representations are similar, matching is performed in the common domain. For example, if two representations are both images, it is called image matching. If two representations are both objects, it is called object matching. If two representations are both relation structures, it is called relation matching. When the two representations are different, matching is performed in an extended sense, Ballard (1982). Matching can be used to establish the relationship between two things. This can be accomplished via mapping. Image matching can be classified into two groups according to the mapping functions used, image matching and object matching, as shown in Figure 4.1, Kropatsch (2001).

4.1.1.1 Matching in Object Space For matching in object space, the object O can be directly reconstructed by taking the inverse projective transforms TO1 and TO2 . An explicit model for O is required, and problem solving is achieved by establishing the correspondence between the image features and the object model features. The advantage of techniques based on matching in the object space is that these techniques fit the physical world. Therefore, even the case of occlusion can be treated.

4.1.1.2 Matching in Image Space Matching in image space directly relates images I1 and I2 by mapping function T12 . In this case, the object model is implicitly included in the process of establishing T12 . Generally, this process is quite complicated. If the surface of the object is smooth, by an approximation using the affine transform, the computational complexity will decrease to a comparable state as in the object space matching. However, in the case of occlusion, matching will become hard to perform, as the assumption of smoothness cannot be held. The algorithms for image matching can be classified into three groups, according to the image model used.

T12

I1

TO1

I2

TO2 O

Figure 4.1: Image matching and object matching.

118

4 Matching and Understanding

Matching Based on Raster It uses the raster representation of an image (i.e., it tries to find a mapping function of the image regions by directly comparing gray-level functions). It can achieve high accurate results, but it is sensitive to occlusion. Matching Based on Features In feature-based matching, the symbolic description of an image is decomposed by extracting salient features from the image. Then the search for corresponding features is performed according to the local geometric property of the object description. Compared to matching based on raster, matching based on features is more suitable for cases with discontinuity and approximation. Matching Based on Relationship It is also called matching based on structure. Its techniques are based on the similarity among topology relations of features. Such a similarity exists in the feature adjacency graph. Matching based on relationship is quite general and can be used in many applications. However, it may produce a complicated search tree and so it will have a high computational complexity. 4.1.2 Matching and Registration Matching and registration are two closely related concepts and processes, and the techniques for these processes share many common points. However, they are different when looking at their scales in detail. Registration has a narrower meaning. It indicates mainly the spatial or time correspondence among images, especially the geometric connection. The result of registration is largely at the pixel layer. Matching has a wider meaning. Not only the gray level or geometric properties, but also some abstract properties and attributes are considered. There are also differences between image registration and stereo matching. The former needs to establish the correspondence between point pairs and to compute the parameters for the coordinate transform between two images on the basis of the correspondence between point pairs. The latter needs only to establish the correspondence between the pairs of points and to compute the disparity for each point pair. Considering the concrete techniques, registration can often be achieved with the help of the coordinate transform and the affine transform. Four factors or steps are important in registration methods, Lohmann (1998): (1) Determine the feature space for the features used in registration. (2) Restrict the search range and determine the search space to make the search process have a solution. (3) Find a search strategy for scanning the search space. (4) Define a similarity metric to determine if a match exists. Registration can be performed either in the image space or in the transformed space. One group of transformed space methods is the frequency space correlation,

4.1 Fundamental of Matching

119

which performs correlation in the frequency space for registration. The images to be registered are first transformed into the frequency space by the fast Fourier transform. The corresponding relation is established in the frequency space by using either the phase information or the magnitude information. The former method is called a phase correlation method and the latter method is called a magnitude correlation method. The phase correlation computation for image translation, using the translation property of the Fourier transform, is introduced in the following using an example. The translation vector between two images can be computed directly according to the phases of their power spectrum. Suppose that two images f1 (x, y) and f2 (x, y) have the following translation correlation relationship f1 (x, y) = f2 (x – x0 , y – y0 )

(4.1)

Then, their corresponding Fourier transforms are also related by F1 (u, v) = F2 (x, y) exp[–j20(ux0 + vy0 )]

(4.2)

If the Fourier transforms, F1 (u, v) and F2 (u, v), of two images f1 (x, y) and f2 (x, y) are represented by their normalized mutual power spectrums, their phase correlation is F (u, v)F2∗ (x, y) exp[–j20(ux0 + vy0 )] = 󵄨󵄨 1 󵄨 󵄨󵄨F1 (u, v)F2∗ (x, y)󵄨󵄨󵄨

(4.3)

where the inverse Fourier transform of exp[–j20(ux0 + vy0 )] is $(x – x0 , y – y0 ). This implies that the relative translation is (x0 , y0 ) for two images f1 (x, y) and f2 (x, y) in the image space. This translation can be determined by searching the maximum values (caused by pulse). The main steps of Fourier transform-based phase correlation are: (1) Compute the Fourier transforms, F1 (u, v) and F2 (u, v), of two images f1 (x, y) and f2 (x, y) that need to be registered. (2) Remove the DC components and high-frequency noise and compute the product of the frequency spectrum components. (3) Compute the (normalized) mutual power spectrums according to eq. (4.3). (4) Obtain the inverse Fourier transform of the normalized mutual power spectrums. (5) Search the coordinates of peak values as these coordinates provide the relative translation. The computational complexity of the above registration method only depends on the size of images, and is not related to the overlapping regions of images. This method uses only the phase information in the mutual power spectrum for image registration. Its advantages are simple computation and insensitive to the brightness change between images (less influenced by illumination variation). Since the peak is acuity, high accuracy can be obtained.

120

4 Matching and Understanding

4.2 Object Matching Image matching takes pixel as unit, so the computational load is high and the efficiency is low in general. In real applications, interesting objects are first detected and extracted, and then the matching is performed on objects. If simple representations are used, the matching computation can be greatly reduced. Since objects can be represented in various methods, the matching for objects can also be made by various methods. Some examples are string matching, shape number matching (Section 6.4.1) and inertia equivalent ellipse matching. 4.2.1 Measurement of Matching To perform object matching and to meet the requirements of applications, some measurements about the quality and criteria of matching are needed. 4.2.1.1 Hausdorff Distance An object is composed of many points. The matching of two objects can be considered the matching between two point sets. Hausdorff distance (HD) can be used to describe the similarity of point sets and matching using HD has been employed widely. Given two finite point sets A = {a1 , a2 , . . . , am } and B = {b1 , b2 , . . . , bn }, their HD is defined by H(A, B) = max[h(A, B), h(B, A)]

(4.4)

h(A, B) = max min ‖a – b‖

(4.5)

h(B, A) = max min ‖b – a‖

(4.6)

where

a∈A b∈B

b∈B a∈A

In the above equations, the norm || ⋅ || can take different forms. Function h(A, B) is called the directed HD from set A to set B, which gives the largest distance from a point a ∈ A to any point in set B. Similarly, function h(B, A) is called the directed HD from set B to set A, which gives the largest distance from a point b ∈ B to any point in set A. Since functions h(A, B) and h(B, A) are not symmetric, the maximum of them is taken as the HD between the two sets. HD has a simple geometric meaning: If HD between sets A and B is d, then all points in a set will be located in a circle, which is centered in any point in another set and has d as the radius. If the HD is 0 for two point sets, these two sets are in superposition. The above original HD selects the most mismatch point of the two point sets as the distance measure, thus it is very sensitive to the noise, outlier and disturbance.

4.2 Object Matching

121

An improved measurement is called the modified Hausdorff distance (MHD), which uses the average value instead of the maximum value, Dubuisson (1994). In other words, eq. (4.5) and eq. (4.6) are replaced by hMHD (A, B) =

1 ∑ min ‖a – b‖ NA a∈A b∈B

(4.7)

hMHD (B, A) =

1 ∑ min ‖b – a‖ NB b∈B a∈A

(4.8)

where NA is the number of points in set A and NB is the number of points in set B. Taking eqs. (4.7) and (4.8) into eq. (4.4) yields HMHD (A, B) = max[hMHD (A, B), hMHD (B, A)]

(4.9)

The MHD calculates the distances from each point in A to the nearest point in B or from each point in B to the nearest point in A and uses the average of them as the distance measurement instead of choosing the maximum one. Hence, the MHD is less sensitive to noise and outlier points.

4.2.1.2 Structure Matching Metrics Objects can often be decomposed into parts. Different objects can have the same parts but different structures. For structural matching, most matching metrics may be explained with the physical analogy of “templates and springs,” Ballard (1982). A structural matching is a matching between a reference structure and a structure to be matched. Imagine that the reference structure is put on a transparent rubber sheet. The matching process moves this sheet over the structure to be matched, distorting the sheet to get the best match. A match is not merely a correspondence, but a correspondence that has been quantified according to its goodness. The final goodness of match depends on the individual matches between the elements of the structures to be matched, and on the amount of work that it takes to distort the sheet. A computationally tractable form of this idea is to consider the model as a set of rigid templates connected by springs, as shown in Figure 4.2 for a face. The templates are connected by springs whose tension is a function of the relations between the elements of structures. A spring function can be arbitrarily complex and nonlinear. For example, the tension in the spring can attain very high or infinite values for configurations of the templates that are not allowed. Nonlinearity is good for such constraints as in a face image; the two eyes must be in a horizontal line and must be within fixed limits of distance. The quality of the match is a function of the goodness of the fit of the templates locally and the amount of energy needed to stretch the springs to

122

4 Matching and Understanding

Hair

Eye

Eye Left edge

Nose

Right edge

Figure 4.2: A template and a spring model of a face.

Mouth

force the input onto the reference data. Costs may be imposed for missing or extra elements. A general form of matching metric for a template and spring model is given by C = ∑ CT [d, F(d)] + d∈Y

∑ (d,e)∈(Y×E)

CS [F(d), F(e)] +

∑

CM (c)

(4.10)

c∈(N ⋃ M)

where CT represents the dissimilarity between the template and the structure to be matched, CS represents the relation dissimilarity between the matched parts of the template and the structure to be matched, and CM represents the penalties for missing elements. Function F(.) represents the mapping from templates of the reference structure to elements of the structure to be matched. It partitions the reference templates into two classes: One can be found in the structure to be matched (belongs to the set Y) and the other cannot be found in the structure to be matched (belongs to set N). In analogy, parts can also be classified into two groups: One can be found in the structure to be matched (belongs to set E) and the other cannot be found in the structure to be matched (belongs to set M). As with other correlation metrics (e. g., Section 2.2), there are normalization issues involved in the structural matching metrics. The number of matched elements may affect the ultimate magnitude of the metric. For instance, if springs always have a finite cost, then the more elements are matched, the higher the total spring energy must be. However, this should not be taken to imply that a match of many elements is worse than a match of a few elements. On the other hand, a small, elegant match of a part of the input structure with one particular reference object may have much of the search structure unexplained. This good submatch may be less helpful than a match that explains more of the input. In eq. (4.10), this case is solved by acknowledging the missing category of elements.

4.2 Object Matching

123

4.2.2 String Matching String is a 1-D data structure. String matching techniques can be used to match two sequences of feature points or two boundaries of objects. Suppose that two boundaries of objects, A and B, are coded as strings a1 a2 . . . an and b1 b2 . . . bm (see Sections 3.1.5 of Volume II in this book set). Starting from a1 and b1 , if ak = bk in the k-th position, a match occurs. Let M represent the number of matches between two strings, and the number of symbols that do not match is Q = max(||A||, ||B||) – M

(4.11)

where ||arg|| is the length (number of symbols) in the string representation of the argument. It can be shown that Q = 0 if and only if A and B are identical. A simple measurement of the similarity between A and B is the ratio R=

M M = Q max(||A||, ||B||) – M

(4.12)

It can be seen from eq. (4.12) that the larger the value of R is, the better the matching is. For a perfect match, R is infinite, and for a nonmatch between symbols in A and B (M = 0), R is zero. Because matching is done symbol by symbol, the starting point on each boundary is important in terms of reducing the amount of computation. Any method that normalizes to the same starting point is helpful. 4.2.3 Matching of Inertia Equivalent Ellipses The matching between objects can also be performed with the help of their inertia equivalent ellipses. This has been used in the registration process for reconstructing 3-D objects from serial sections, Zhang (1991). Different from the matching based on object contours, which uses only the contour points, matching based on inertia equivalent ellipses uses all points of object regions. For any object, its ellipse of inertia can be computed as in Section 13.3. Starting from the ellipse of the inertia of object, its inertia equivalent ellipse can be obtained. From the matching point of view, once the objects to be matched are represented by their (inertia) equivalent ellipses, the problem of object matching becomes the problem of ellipse matching. This process is illustrated in Figure 4.3. For the object matching, coordinate transforms for translation, rotation, and scaling are commonly used. Parameters for these transformations can be obtained from the center coordinates of equivalent ellipses, the orientation angles (the angle between the major ellipse axis and X-axis) of two equivalent ellipses, and the lengths of two major ellipse axes.

124

Object 1

4 Matching and Understanding

Equivalent ellipse 1

Equivalent ellipse 2

Object 2

Figure 4.3: Matching with the help of inertia equivalent ellipses.

Consider first the center coordinates of equivalent ellipses; that is, the centroid of the object. Suppose that an object region consists of N pixels, then the centroid of this object is given by xc =

1 N ∑x N i=1 i

(4.13)

yc =

1 N ∑y N i=1 i

(4.14)

The translation parameters can be computed from the center coordinates of two equivalent ellipses. Next, consider the orientation angle of equivalent ellipses. It can be obtained from the two slopes of ellipse axis, k and l (let A be the inertia of the object around the X-axis and B be the inertia of the object around the Y-axis) as 6={

arctan(k) arctan(l)

AB

(4.15)

The rotation parameters can be computed from the difference of the two orientation angles. Finally, the lengths of the major and minor axes provide information on the object size. If the object is an ellipse, it is the same as its equivalent ellipse. While in general, the equivalent ellipse of an object approximates the object in both the inertia and size. The length of the axes can be normalized by using the object size. After normalization, the major axis of equivalent ellipse a in case of A < B can be computed (let H be inertia product) as a=√

2[(A + B) – √(A – B)2 + 4H 2 ] M

(4.16)

The scaling parameters can be computed from the ratio of the major axis lengths of equivalent ellipses. All parameters of the above three transformations can be computed independently. Therefore, the transforms for a matching can be carried out in sequence. One advantage of using the elliptical representation is the suppression of small local fluctuations. The averaging property inherent in the derivation of elliptical representation provides a global, rather than local, description of the object region.

4.3 Dynamic Pattern Matching

125

4.3 Dynamic Pattern Matching In dynamic pattern matching, the patterns are constructed dynamically as opposed to the previously introduced methods. The construction is based on the data to be matched and produced during the match process, Zhang (1990).

4.3.1 Flowchart of Matching In 3-D cell reconstruction from a serial section, one important task is to establish the correspondence between the profiles in adjacent sections. The flowchart for such a process is shown in Figure 4.4. In Figure 4.4, two sections to be matched are called the matched section and the matching section, respectively. The matched section is taken as a reference section. Once all profiles in the matching section are aligned with that of the matched section, this matching section becomes a matched section. Then it can be used as the reference section of the next matching section. This process can be repeated until all profiles in a serial of sections are matched. As it can be seen from the flowchart, dynamic pattern matching has six steps: (1) Select a matched profile from the matched section. (2) Construct the pattern representation of the selected profile. (3) Determine the candidate region in the matching section. (4) Select matching profiles from the candidate region. (5) Construct the pattern representations for matching profiles. (6) Test the similarity between two profile patterns to determine the correspondence in profiles. In the above technique, the local geometric information is used to search the corresponding profiles in consecutive sections and the human behavior is simulated in a pattern matching procedure.

Selection

Determination

Selection

Construction

Parameter

Construction

No-matching Pattern test

Matching

Figure 4.4: The ﬂowchart of matching for dynamic patterns.

Matching results

126

4 Matching and Understanding

4.3.2 Absolute Patterns and Relative Patterns In dynamic pattern matching, the local “constellation” of profile points in one section is used as a template to search for a similar constellation in an adjacent section. For this purpose, a pattern representation is dynamically constructed for each profile. This pattern considers the relative positions between the profile to be matched and some profiles around it. For each matched profile, the m nearest surrounding profiles are chosen in the construction of the related matched pattern. This pattern will be composed of these m profiles together with the center profile. The pattern is specified by the center point coordinates (xo , yo ), the m distances measured from m surrounding profiles to the center point (d1 , d2 , . . . , dm ), and the m angles between m pairs of distance lines ((1 , (2 , . . . , (m ). A pattern vector can therefore be written as Pl = P(xl0 , yl0 , dl1 , (l1 , . . . , dlm , (lm )T .

(4.17)

Each element in the pattern vector can be considered a feature of the center profile. The maximum of m distances is called the diameter of the pattern. For each potential matching profile, a similar pattern needs to be constructed. All surrounding profiles falling into a circle (around the matching profile), which have the same diameter as the matched pattern are taken into account. These pattern vectors can be written in a similar manner, Pr = P(xr0 , yr0 , dr1 , (r1 , . . . , drn , (rn )T

(4.18)

The number n may be different from m (and can be different from one pattern to another), because of the deformation of the section and/or the end of the continuation of objects in the section. The above process for pattern formation is automatic. Moreover, this pattern formation method is rather flexible, as here no specific size has been imposed to construct these patterns. Those constructed patterns are also allowed to have a different number of profiles. The pattern thus constructed is called an “absolute pattern,” because its center point has absolute coordinates. One example of an absolute pattern is illustrated in Figure 4.5(a). Each pattern thus formed is unique in the section. Every pattern vector belongs to a fixed profile and presents specific properties of this profile. To match corresponding points in two adjacent sections by means of their patterns, translation and rotation of patterns are necessary. The absolute pattern formed as described above is invariant to the rotation around the center point because it is circularly defined. However, it is sensitive to translations (see Figure 4.5(b)). To overcome this inconvenience, from each absolute pattern, a corresponding “relative pattern” by discarding the coordinates of center point from the pattern vector is further constructed. The relative pattern vectors corresponding to eq. (4.17) and eq. (4.18) are given by

4.3 Dynamic Pattern Matching

Y

127

Y d1

y0

y' 0 y0

d1

θ2 dm

O

θm

θ1

θ1

d2

dm

θm

d2

x0

O

X

(a)

θ2

X

x0 x'0

(b)

Figure 4.5: Absolute patterns.

Y

Y

d1 θ1

θ1

d2

θm

d1

θ2 dm

dm

θm

O (a)

θ2

d2

X

O

X

(b)

Figure 4.6: Relative patterns.

Ql = Q(dl1 , (l1 , ⋅ ⋅ ⋅ , dlm , (lm )T

(4.19)

T

(4.20)

Qr = Q(dr1 , (r1 , ⋅ ⋅ ⋅ , drn , (rn )

The relative pattern corresponding to the absolute pattern in Figure 4.5(a) is shown in Figure 4.6(a). The uniqueness of the relative pattern is related to the number of surrounding profiles in the pattern and the distribution of profiles in sections. It can be seen from Figure 4.6(b) that the relative pattern has not only the rotation invariance but also the translation invariance. The above procedure has been used in a real application for symbolic reconstruction of profiles in consecutive sections, Zhang (1991). Using a pattern similarity test, which is based on the relative position measurements of all corresponding profile pairs, the relationship among different profile constellations can be determined. In general, more than 95% of the decisions made by the automatic procedure correspond to the human decision, while the percentage for normal global transformation-based methods is only 60–70% for the same data.

128

4 Matching and Understanding

4.4 Relation Matching The real world can be decomposed into many objects, and each object can still be decomposed into many parts/components. There are many relations among different components. An object can be represented by a set of relations among its components. Relation matching is an important task in image understanding. In relation matching, the two representations to be matched are both relations, in which one is often called an object to be matched and the other is called a model. In the following, the main steps of relation matching are introduced. Relation matching can be considered as finding the matching model when an object to be matched is given. Suppose that there are two relation sets Xl and Xr , where Xl belongs to the object to be matched and Xr belongs to the model. They can be represented as Xl = {Rl1 , Rl2 , . . . , Rlm }

(4.21)

Xr = {Rr1 , Rr2 , . . . , Rrm }

(4.22)

where Rl1 , Rl2 , . . . , Rlm and Rr1 , and Rr2 , . . . , Rrn are the representations of different relations between the parts of the object to be matched and the model. Example 4.1 Objects and relation representations. An object is illustrated by the drawing in Figure 4.7(a), which can be considered a frontal view map of a table. It has three parts and can be represented by Ql = (A, B, C). The relation set of these parts can be represented by Xl = {R1 , R2 , R3 }, where R1 represents the connection relation, R1 = [(A, B) (A, C)], R2 represents the up-down relation, R2 = [(A, B) (A, C)] and R3 represents the left-right relation, R3 = [(B, C)]. Another object is illustrated by the drawing in Figure 4.7(b), which can be considered a frontal view map of a table with a drawer in the middle. It has four parts and can be represented by Qr = {1, 2, 3, 4}. The relation set of these parts can be represented by Xr = {R1 , R2 , R3 }, where R1 represents the connection relation, R1 = [(1, 2) (1, 3) (1, 4) (2, 4) (3, 4)], R2 represents the up-down relation, R2 = [(1, 2) (1, 3) (1, 4)] and R3 represents the left-right relation, R3 = [(2, 3) (2, 4) (4, 3)]. ◻∘ Let dis(Xl , Xr ) denote the distance between Xl and Xr , which is formed by the differences of corresponding relation representations; that is, all dis(Rl , Rr ). The matching between Xl and Xr is the matching of all corresponding relations between the A

1 4

B

(a)

C

2

(b)

3 Figure 4.7: Illustrations of objects and relation representations.

4.4 Relation Matching

129

two sets. In the following, one relation is considered first. Using Rl and Rr to represent these two relation representations, it has Rl ⊆ SM = S(1) × S(2) × ⋅ ⋅ ⋅ × S(M)

(4.23)

N

Rr ⊆ T = T(1) × T(2) × ⋅ ⋅ ⋅ × T(N)

(4.24)

Define p as the transform from S to T, and p–1 as the transform from T to S. Denote symbol ⊕ as a composite operation, such that Rl ⊕ p represents the transform Rl by p (i.e., transform SM to T N ). In analog, Rr ⊕ p–1 represents the transform Rr by p–1 (i.e., transform T N to SM ) Rl ⊕ p = f [T(1), T(2), . . . , T(N)] ∈ T N –1

Rr ⊕ p

= g[S(1), S(2), . . . , S (M)] ∈ S

M

(4.25) (4.26)

where both f and g stand for combinations of some relation representations. Look at dis(Rl , Rr ). If the corresponding terms in these two relation representations are not equal, then for any corresponding relation p, the following four errors could exist E1 = {Rl ⊕ p – (Rl ⊕ p) ⋂ Rr } } } } } E2 = {Rr – Rr ⋂(Rl ⊕ p)} –1 –1 } E3 = {Rr ⊕ p – (Rr ⊕ p ) ⋂ Rl } } } } E4 = {Rl – Rl ⋂(Rr ⊕ p–1 )} }

(4.27)

The distance between two relation representations Rl and Rr is the sum of error terms in eq. (4.27) dis(Rl , Rr ) = ∑ Wi Ei

(4.28)

i

where weights Wi are used to count the influence of different error terms. If all corresponding terms in two relation representations are equal, then it is always possible to find a transform p that satisfies Rr = Rl ⊕ p and Rr ⊕ p–1 = Rl . In other words, the distance computed by eq. (4.28) would be zero. In this case, it is said that Rl and Rr are fully matched. In practice, the number of errors in E can be denoted C(E), and eq. (4.28) can be rewritten as disC (Rl , Rr ) ∑ Wi C(Ei )

(4.29)

i

To match Rl and Rr , a corresponding transform, which yields the minimum error (counted by number of errors) between Rl and Rr , needs to be searched. Note that E is a function of p, so the transform p looked for should satisfy

130

4 Matching and Understanding

disC (Rl , Rr ) = inf {∑ Wi C [Ei (p)]} p

(4.30)

i

Referring back to eqs. (4.21) and (4.22), to match two relations sets Xl and Xr , a set of transforms pj should be found, which make the following equation valid } {m disC (Xl , Xr ) = inf {∑ Vj ∑ Wij C [Eij (pj )]} p i } { j

(4.31)

In eq. (4.31), it is assumed that n > m, where Vj are the weights for counting the different importance of relations. Example 4.2 Matching of connection relations. Consider the case of matching two objects in Example 4.1 only using connection relations. From eqs. (4.23) and (4.24), it has Rl = [(A, B) (A, C)] = S(1) × S(2) ⊆ SM Rr = [(1, 2) (1, 3) (1, 4) (2, 4) (3, 4)] = T(1) × T(2) × T(3) × T(4) × T(5) ⊆ T N . When there is no component 4 in Qr , Rr = [(1, 2)(1, 3)]. In this case, p = {(A, 1) (B, 2)(C, 3)}, p–1 = {(1, A)(2, B)(3, C)}, Rl ⊕ p = {(1, 2)(1, 3)}, and Rr ⊕ p = {(A, B)(A, C)}. The four errors in eq. (4.27) are E1 = {Rl ⊕ p – (Rl ⊕ p) ∩ Rr } = {(1, 2)(1, 3)} – {(1, 2)(1, 3)} = 0 E2 = {Rr – Rr ∩ (R ⊕ p)} = {(1, 2)(1, 3)} – {(1, 2)(1, 3)} = 0 E3 = {Rr ⊕ p–1 – (Rr ⊕ p–1 ) ∩ Rl } = {(A, B)(A, C)} – {(A, B)(A, C)} = 0

.

E4 = {Rl – Rl ∩ (Rr ⊕ p–1 )} = {(A, B)(A, C)} – {(A, B)(A, C)} = 0 It is evident that dis(Rl , Rr ) = 0. When there is the component 4 in Qr , Rr = [(1, 2)(1, 3) (1, 4)(2, 4)(3, 4)]. In this case, p = {(A, 4)(B, 2)(C, 3)}, p–1 = {(4, A)(2, B)(3, C)}, Rl ⊕ p = {(4, 2)(4, 3)}, and Rr ⊕ p = {(B, A)(C, A)}. The four errors in eq. (4.27) become E1 = {(4, 2)(4, 3)} – {(4, 2) (4, 3)} = 0 E2 = {(1, 2)(1, 3)(1, 4)(2, 4)(3, 4)} – {(2, 4)(3, 4)} = {(1, 2)(1, 3)(1, 4)} E3 = {(B, A)(C, A)} – {(A, B)(A, C)} = 0

.

E4 = {(A, B)(A, C)} – {(A, B)(A, C)} = 0 If only the connection relation is considered, the order of components in the expression can be exchanged. From the above results, dis(Rl , Rr ) = {(1, 2)(1, 3)(1, 4)}. The

4.4 Relation Matching

131

numbers of errors are C(E1 ) = 0, C(E2 ) = 3, C(E3 ) = 0, C(E4 ) = 0. Finally, disC (Rl , Rr ) = 3. ◻∘ Matching uses the model stored in the computer to recognize the unknown pattern of the object to be matched. Once a set of transforms pj are obtained, their corresponding models should be determined. Suppose that the object to be matched X is defined in eq. (4.21). For each of a set of models Y1 , Y2 , . . . , YL (they can be represented by eq. (4.22)), a set of transforms p1 , p2 , . . . , pL can be found to satisfy the corresponding relation in eq. (4.31). In other words, all distance disC (X, Yq ) between X and the set of models can be obtained. If for a model Yq , its distance to X satisfies disC (X, Yq ) = min{disC (X, Yi )} i = 1, 2, ⋅ ⋅ ⋅ , L

(4.32)

for q ≤ L, it has X ∈ Yq . This means that X can be matched to the model Yq . In summary, the whole matching process has four steps. (1) Determining the relations among components (i.e., for a relation given in Xl , determine the same relation in Xr ). This requires m × n comparisons

Xl =

(2)

Rl1

Rr1

Rl2

Rr2

Rlm

Rrn

= Xr

(4.33)

Determining the transform for the matching relations; that is, determine p to satisfy eq. (4.30). Suppose that p has K possible forms. The task is to find, among K transforms, the one providing the minimum value for the weighted error summation. C

p1 : dis (Rl , Rr ) } { { 󳨀󳨀󳨀󳨀󳨀󳨀󳨀󳨀󳨀󳨀󳨀󳨀→ } } { } { } { { } { p2 : disC (Rl , Rr ) } } { Rl { 󳨀󳨀󳨀󳨀󳨀󳨀󳨀󳨀󳨀󳨀󳨀󳨀→ } Rr } { } { ⋅⋅⋅⋅⋅⋅ } { } { } { C { } { pK : dis (Rl , Rr ) } 󳨀 󳨀 󳨀 󳨀 󳨀 󳨀 󳨀 󳨀 󳨀 󳨀 󳨀 󳨀 → } {

(3)

(4.34)

Determining the transform set for the matching relation set. That is, K dis values are weighted disC (Rl1 , Rr1 ) { { { { disC (Rl2 , Rr2 ) disC (Xl , Xr ) ⇐ { { ⋅⋅⋅⋅⋅⋅ { { C { dis (Rlm , Rrm )

(4.35)

Note that eq. (4.35) assumes m ≤ n. Therefore, only m pairs of relations have correspondence, while other n – m relations exist only in relations set Xr .

132

(4)

4 Matching and Understanding

Determining the model (find the minimum in L disC (Xl , Xr )) p1

{ 󳨀󳨀→ Y1 → disC (X, Y1 ) { { { p2 { { 󳨀󳨀 → Y2 → disC (X, Y2 ) X{ { ⋅⋅⋅⋅⋅⋅ { { { { pL C { 󳨀󳨀→ YL → dis (X, YL )

(4.36)

4.5 Graph Isomorphism A graph is a data structure used for describing relations. A graph isomorphism is also a technique for matching relations. In the following, some fundamental definitions and concepts about the graph theory are first introduced, and then the graph isomorphic technique is discussed.

4.5.1 Fundamentals of the Graph Theory Some basic concepts, definitions, and representations of graphs are given first. 4.5.1.1 Basic Definitions In Graph theory, a graph G is defined by a limit and nonempty vertex set V(G) and a limit edge set E(G), which can be denoted G = [V(G), E(G)] = [V, E]

(4.37)

where every element in E(G), called a no-order edge, corresponds to a pair of vertexes. It is common to denote the elements in set V by capital letters and the elements in set E by small letters. The edge e formed by vertex pairs A and B is denoted e ↔ AB or e ↔ BA. Both A and B are the ends of e, or e joins A and B. In this case, vertexes A and B are incident with edge e, and edge e is incident with vertexes A and B. Two vertexes incident with the same edge are adjacent. Similarly, two edges incident with the same vertex are adjacent. If two edges have two same vertexes, they are called multiple edges or parallel edges. If the two vertexes of an edge are the same one, this edge is called a loop; otherwise, this edge is called a link. In the definition of a graph, two vertexes may be the same or may be different, while two edges may be the same or may be different, too. Different elements can be represented by vertexes with different colors. This is called the color property of vertexes. Different relations between elements can be represented by edges with different colors. This is called the color property of edges. A general/extended color graph G is denoted G = [(V, C) (E, S)]

(4.38)

4.5 Graph Isomorphism

133

where V is a vertex set, C is the color property set of vertexes, E is an edge set and S is the color property set of edges V = {V1 , V2 , . . . , VN }

(4.39)

C = {CV1 , CV2 , . . . , CVN }

(4.40)

E = {eVi Vj |Vi , Vj ∈ V}

(4.41)

S = {sVi Vj |Vi , Vj ∈ V}

(4.42)

4.5.1.2 Geometric Representation of Graphs Denote the vertexes of a graph by round points and the edges of a graph by lines or curves connecting vertexes. A geometric representation or geometric realization of a graph can be obtained. For the graph with the number of edges larger than 1, there may be infinite numbers of geometric representations. Example 4.3 Geometric representation of a graph. Suppose that V(G) = {A, B, C}, E(G) = {a, b, c, d}, where a ↔ AB, b ↔ AB, c ↔ BC, and d ↔ CC. This graph can be represented by Figure 4.8. In Figure 4.8, edges a, b, and c are adjacent to each other, and edges c and d are adjacent, but edges a and b are not adjacent to edge d. Similarly, vertexes A and B are adjacent, and vertexes B and C are adjacent, but vertexes A and C are not adjacent. Edges a and b are multiple edges, edge d is a loop, and edges a, b, and c are links. ◻∘ Example 4.4 Illustrations of color graph representation. The two objects in Example 3 can be represented by the two graphs in Figure 4.9, where the color properties of vertexes are distinguished by different vertex forms and the color properties of edges are distinguished by different line types. ◻∘ 4.5.1.3 Subgraph For two graphs G and H, if V(H) ⊆ V(G) and E(H) ⊆ E(G), graph H is called the subgraph of graph G, and this is denoted by H ⊆ G. On the other hand, graph G is called the super-graph of graph H. If graph H is the subgraph of graph G, but H ≠ G, graph H is called the proper subgraph of graph G, and graph G is called the proper super-graph of graph H, Sun (2004).

A a b B

c

d C

Figure 4.8: Geometric representation of a graph.

134

4 Matching and Understanding

A

1

Connection relation

Component type 1 1

Component type 2 B

C

2

Up-down relation

3

Left-right relation

Component type 3

Figure 4.9: Color graph representation of objects.

A

A

A

a

a b

d

c

B (a)

b

d c

B (b)

C

A a

a C

c

B (c)

C

c

B (d)

C

Figure 4.10: Graph and spanning subgraph.

A

B A

a b

c

d

a b

B A

c

B A

B

a

c

d

b

e C (a)

B

a

c

d e

D C (b)

(c)

DC (d)

C (e)

B d e

D C (f)

D

Figure 4.11: Four operations for obtaining the underlying simple graph.

If H ⊆ G and V(H) = V(G), graph H is called the spanning subgraph of graph G, and graph G is called the proper super-graph of graph H. For example, Figure 4.10(a) shows graph G, while Figure 4.10(b–d) represents different spanning subgraphs of graph G. By removing all multiple edges and loops from graph G, a simple spanning subgraph can be obtained. This subgraph is called the underlying simple graph of graph G. Among three spanning subgraphs shown in Figure 4.10(b–d), there is only one underlying simple graph; that is, Figure 4.10(d). In the following, graph G in Figure 4.11(a) is used to show four operations for obtaining the underlying simple graph. (1) For a nonempty vertex subset V 󸀠 (G) ⊆ V(G) in graph G, if one of the subgraphs of graph G takes V 󸀠 (G) as the vertex set and takes all edges with two ends in V 󸀠 (G) as the edge set, this subgraph is called an induced subgraph of graph G. It can be denoted as G[V 󸀠 (G)] or G[V 󸀠 ]. Figure 4.11(b) shows G[A, B, C] = G[a, b, c]. (2) For a nonempty edge subset E󸀠 (G) ⊆ E(G) in graph G, if one of the subgraphs of graph G takes E󸀠 (G) as the edge set and takes vertexes of all edges

4.5 Graph Isomorphism

135

in graph G as the vertex set, this subgraph is called an edge-induced subgraph of graph G. It can be denoted as G[E󸀠 (G)] or G[E󸀠 ]. Figure 4.11(c) shows G[a, d] = G[A, B, D]. For a nonempty vertex proper subset V 󸀠 (G) ⊆ V(G) in graph G, a subgraph of graph G is also called an induced subgraph of graph G if the following two conditions are satisfied. The first condition is that this subgraph take all vertexes, except those in V 󸀠 (G) ⊂ V(G), as its vertex set. The second condition is that this subgraph take all edges, except those in relation with V 󸀠 (G), as its edge set. Such a subgraph is denoted G – V 󸀠 . It has G – V 󸀠 = G[V\V 󸀠 ]. Figure 4.11(d) shows G – {A, D} = G[B, C] = G[{A, B, C, D} – {A, D}]. For a nonempty edge proper subset E󸀠 (G) ⊆ E(G) in graph G, if one of the subgraphs of graph G takes the rest of the edges after removing the edges in E󸀠 (G) ⊂ E (G) as the edge set, this subgraph is also called an induced subgraph of graph G, which can be denoted G – E󸀠 . Note that G – E󸀠 and G[E\E󸀠 ] have the same edge set, but they are not always identical. The former is always a spanning subgraph while the latter may not be. Figure 4.11(e) shows an example of the former, G – {c} = G[a, b, d, e]. Figure 4.11(f) shows an example of the latter, G[{a, b, c, d, e} – {a, b}] = G – A ≠ G – [{a, b}].

(3)

(4)

4.5.2 Graph Isomorphism and Matching Based on the above described concepts, a graph isomorphism can be identified and graph matching can be performed. 4.5.2.1 Identical Graphs and Graph Isomorphic According to the definition of graph, if and only if two graphs G and H satisfy V(G) = V(H) and E(G) = E(H), they are called identical graphs and can be expressed by the same geometric representations. For example, graphs G and H in Figure 4.12 are identical and expressed by the same geometric representations. However, if two graphs can be expressed by the same geometric representations, they are not necessarily identical. For example, graphs G and I in Figure 4.12 are not identical though they can be expressed by the same form of geometric representation. A a

B

c

b

X

A a

C

B

G = [V, E ] Figure 4.12: Identical graphs.

x

c

b H = [V, E ]

C

Y

z

y I = [V', E']

Z

136

4 Matching and Understanding

A e B

P Q P

G = [V, E ]

P(A) Q(e) P(B) I = [V', E']

Figure 4.13: Graph isomorphic.

For two graphs expressed by the same geometric representations but that are not identical, if the labels of the vertexes and/or the edges of one graph are changed, it is possible to make it identical to another graph. In this case, these two graphs can be called isomorphic. In other words, the isomorphism of two graphs means that there are one-to-one correspondences between the vertexes and edges of the two graphs. Two isomorphic graphs G and H can be denoted G ≅ H. The sufficient and necessary conditions for two graphs G and H to be isomorphic are given by P:

V(G) → V(H)

(4.43)

Q:

E(G) → E(H)

(4.44)

Mappings P and Q have the relation of Q(e) = P(u)P(v), ∀e ↔ uv ∈ E(G), as shown in Figure 4.13. 4.5.2.2 Determination of Isomorphism According to the above discussion, isomorphic graphs have the same structures and the differences are the labels of the vertex and/or edge. Some examples are discussed here. For simplicity, assume that all vertexes have the same color properties and all edges have the same color properties. Consider a single color graph (a special case of G) B = [(V)(E)]

(4.45)

where V and E are given by eqs. (4.39) and (4.41), respectively. The difference is that all elements in each set are the same here. With reference to Figure 4.14, suppose that two graphs B1 = [(V1 )(E1 )] and B2 = [(V2 )(E2 )] are given. Matching using graph isomorphism has the following forms, Ballard (1982).

(a)

(b)

(c)

Figure 4.14: Different forms of graph isomorphism.

(d)

4.6 Labeling of Line Drawings

137

Graph Isomorphism Graph isomorphism is a one-to-one mapping between B1 and B2 . For example, Figure 4.14(a, b) represents graph isomorphism. In general, denote f as a mapping, and for e1 ∈ E1 and e2 ∈ E2 , there must be f (e1 ) = e2 . In addition, for each edge of E1 connecting any pair of vertexes e1 and e󸀠1 (e1 , e󸀠1 ∈ E1 ), there must be an edge of E2 connecting f (e1 ) and f (e󸀠1 ). Subgraph Isomorphism Subgraph isomorphism is a type of isomorphism between a subgraph of B1 and the whole graph of B2 . For example, several subgraphs in Figure 4.14(c) are isomorphic with Figure 4.14(a). Double-subgraph Isomorphism Double-subgraph isomorphism is a type of isomorphism between every subgraph of B1 and every subgraph of B2 . For example, there are several double subgraphs in Figure 4.14(d), which are isomorphic with Figure 4.14(a). One method that is less restrictive than the isomorphism and faster in convergence is to use the association graphs for matching, Snyder (2004). In matching with the association graph, a graph is defined as G = (V, P, R), where V represents a set of vertexes, P represents a set of unary predicates on vertexes, and R represents binary relations between vertexes. A predicate takes on only the value of TRUE or FALSE. A binary relation describes a property possessed by a pair of vertexes. Given two graphs, an association graph can be constructed. Matching using association graphs is to match the vertexes and the binary relations in two graphs.

4.6 Labeling of Line Drawings In a 3-D world, the surfaces of 3-D objects could be observed. Projecting a 3-D world onto a 2-D image, each surface will form a region. The boundary of surfaces in the 3-D world becomes the contour of the regions in the 2-D image. Line drawings are the result of representing the object regions by their contours. For simple objects, by labeling their line drawings, (i.e., by using contour labels of 2-D images) the relationship among 3-D surfaces can be represented, Shapiro (2001). Such a label can be used to match 3-D objects and their models for interpreting the scene. 4.6.1 Labeling of Contours Some commonly used terms in labeling of contours are as follows: (1) Blade If a continuous surface (occluding surface) occludes another surface (occluded surface), when it goes along the contour of the former surface, the change of the surface normal direction would be smooth and continuous. In this case, the contour is called a blade. To represent a blade on line drawings, a single

138

(2)

(3)

(4)

(5)

4 Matching and Understanding

arrowhead “>” is used. The direction of the arrow is used to indicate which surface is the occluding surface. By convention, the occluding surface is located to the right side of blade as the blade is followed in the direction of the arrow. In two sides of a blade, the directions of the occluding surface and the occluded surface have no relation. Limb If a continuous surface occludes not only another surface but also part of itself (self-occluding), the change of the surface normal direction is smooth and continuous, and this change is perpendicular to the viewing direction, then the contour line is called a limb. To represent a limb, two opposite arrows “< >” are used. Following in the direction of the arrow, the surface directions are not changed. Following the direction that is not parallel to a limb, the surface direction changes continuously. A blade is a real edge of a 3-D object, but a limb is not. Both the blade and the limb belong to jump edges. In the two sides of a jump edge, the depth is not continuous. Crease If an abrupt change to a 3-D surface or the joining of two different surfaces is encountered, a crease is formed. In the two sides of a crease, the points on the surface are continuous, but the surface normal directions are not continuous. If the surface is convex at the crease place, it is marked by a “+” symbol. If the surface is concave at the crease place, it is marked by a “–” symbol. Mark The mark of an image contour is caused by a change in the surface albedo. A mark is not caused by the 3-D surface shape. A mark can be labeled “M.” Shade If a continuous surface does not occlude another surface or part of another surface along the viewing direction, but does occlude the illumination of light source for another surface or part of another surface, it will cause shading in the second surface. The shading on the surface is not caused by the form of the surface, but is caused by the influence of other parts of the surface. A shade can be labeled “S.”

Example 4.5 Examples of contour labeling. Some examples of contour labeling are illustrated in Figure 4.15. In Figure 4.15, a hollow cylinder is put on a platform, there is a mark on the cylinder, and the cylinder forms shading on the platform. There are two limbs on the two sides of the cylinder. The top edges are divided into two parts by two limbs. The top-front (edge) represents a blade that occludes the platform while the bottom-front (edge) represents a surface that occludes the interior of the cylinder. All creases on the platform are convex, while the crease between the cylinder and the platform is concave. ◻∘

4.6 Labeling of Line Drawings

139

M S + –

+

+

Figure 4.15: Examples of contour labeling.

F

6

G +

+

E C

B

A (a)

+

K

–

3

D +

H

I

2

1 (b)

+

+

+

5

J

10 4 1

7

–

+

– 8

9

Figure 4.16: Different interpretation of a line drawing.

4.6.2 Structure Reasoning In the following, structure reasoning for 3-D objects is based on their contours in 2-D images. Suppose that all surfaces of an object are planes and all corners are formed by the cross of three planes. Such a 3-D object is called the trihedral corner object. Two line drawing examples of the trihedral corner object are shown in Figure 4.16. In Figure 4.16, both objects are in a general position, which means that a small change of viewpoint will not cause the topological change of the line drawing (no face, edge or connection will disappear) at such a position. The two line drawings in Figure 4.16 are the same in geometric structure. However, two different interpretations can be derived. The difference between these line drawings is the three additional concave creases in the line drawing of Figure 4.16(b). Therefore, the object in Figure 4.16(a) is viewed as floating in the space and the object in Figure 4.16(b) is viewed as glued to a back wall. To label the line drawing, three labels {+, –, →} can be used, in which “+” represents a nonclosed convex line, “–” represents a nonclosed concave line and “→” represents a closed line. Under these conditions, there are four possible groups of 16 types of topological combinations of line junctions. That is, there are six types of Ljunctions, four types of T-junctions, three types of arrows, and three types of forks (Y-junction), as shown in Figure 4.17. Note that other topological combinations do not physically exist.

140

4 Matching and Understanding

L-junction +

+

T-junction

–

+

Arrow junction +

–

–

+

Fork junction

–

+

+

–

+

–

+

–

+

– –

Figure 4.17: Sixteen topologically possible line junctions for images of trihedral blocks.

4.6.3 Labeling with Sequential Backtracking There are different methods for automatic labeling of line drawings. In the following, a method called labeling with sequential backtracking is introduced, Shapiro (2001). The problem to be solved can be stated easily. Given a 2-D line drawing with a set of edges, assign each edge a label (the junctions used are shown in Figure 4.17) to interpret the 3-D cause. The method of labeling with sequential backtracking arranges edges into a sequence. If possible, the edges with the most constrained labels should be assigned first. According to the depth first strategy, each edge is labeled with all possible labels and the consistency of the new label with other edge labels is checked. If a newly assigned label produces a junction that is not in the junction types shown in Figure 4.17, then backtrack; otherwise, try the next edge. If all labels are consistent, a label result is obtained and a complete path to the leaves is found. Example 4.6 Illustration of labeling with sequential backtracking. Considering the pyramid with four faces shown in Figure 4.18, the interpretation tree obtained by labeling with sequential backtracking, including each step and the result, is shown in Table 4.1. It is seen from the interpretation tree that three complete paths can be obtained. They give three different interpretations of a line drawing. ◻∘

A

D C B

Figure 4.18: Pyramid used in the example.

4.7 Problems and Questions

141

Table 4.1: The interpretation tree for the line drawing of a pyramid.

A

B

C

–

+ +

+

+ +

–

+

–

+

– Conflicting interpretation of edge AB

+ Interpretation tree

+ – –

–

+

Not belonging to 16 types

+

Conflicting interpretation of edge AB –

–

–

+

+

Not belonging

—

Result and interpretation —

—

—

—

—

—

—

—

—

to 16 types –

+

D

–

–

– +

Glued to wall –

+

–

Conflicting interpretation of edge AB

—

—

– +

+

+

–

+

+

–

– – Resting on table

+

Floating in air

4.7 Problems and Questions 4-1

What is the relation between template matching and the Hough transform? Compare their computation expense when detecting points on a line. 4-2* Show that Q in eq. (4.11) is 0 if and only if A and B are identical strings. 4-3 Show that the translation, rotation, and scaling parameters discussed in Section 4.2 can be computed independently. 4-4 When constructing dynamic patterns, what relations, except connecting line lengths and their angles, can still be used? Give the expressions for pattern vectors.

142

4-5

4 Matching and Understanding

Considering only the connecting relation, compute the distances between the object in the right side of Figure 4.7 and the two objects in Figure Problem 4-5, respectively.

A

A

D

B

B

C

C

Figure Problem 4-5 4-6

Show that if two graphs are identical, then the numbers of the vertex and edge in the two graphs are equal. However, if the numbers of vertex and edge in two graphs are equal, these two graphs may not be identical. 4-7* Show that the two graphs in Figure Problem 4-7 are not isomorphic. A

X

d

a

D

c B

W

x z

b

C

y

Y

G = (V, E )

w Z

I = (V, E )

Figure Problem 4-7 4-8 4-9

Provide all subgraphs in Figure 4.14(c), which are isomorphic with Figure 4.14(a). Draw the 16 subgraphs that have {A, B, C, D} as vertex sets in Figure Problem 4-3. B

C

A

D E

F

Figure Problem 4-9 4-10 4-11 4-12

How do you distinguish a blade from a limb using shading analysis? List the junction types (see Figure 4.17) of all junctions for the two objects in Figure 4.16. Design a line drawing that corresponds to a nonexist object.

4.8 Further Reading

143

4.8 Further Reading 1.

2.

Fundamental of Matching – Many matching techniques use Hough transforms, such as the general Hough transform, Chen (2001), the iterative Hough transform (IHT), (Habib 2001), and the multi-resolution Hough transform, Li (2005a). Object Matching – Several modified Hausdorff distances have been proposed, see Gao (2002b), Lin (2003), Liu (2005), and Tan (2006). –

3.

One method for combining curve fitting and matching for tracking of infrared objects can be found in Qin (2003). Dynamic Pattern Matching – Examples of utilization of dynamic pattern matching in bio-medical image analysis can be found in Zhang (1991). –

4.

A method similar to dynamic pattern matching, called a local-feature-focus algorithm, is discussed in Shapiro (2001). Relation Matching – Relation is a general concept, and relation matching can have many different forms; for example, see Dai (2005). –

5.

Relation matching has been popularly used in content-based visual information retrieval, Zhang (2003b). Graph Isomorphism – A more detailed introduction to graph theory can be found in Marchand (2000), West (2001), Buckley (2003), and Sun (2004). –

6.

Double-subgraph isomorphism has a big search space. It can be solved with the help of another graph method for cliques, Ballard (1982). Labeling of Line Drawings – The labeling of line drawings gets help from the graph theory, so further reading can be found in Marchand (2000), Sun (2004), and so on.

5 Scene Analysis and Semantic Interpretation Image understanding actually is to achieve understanding of the scene. Understanding of the visual scene can be expressed as follows: on the basis of visual perception of environmental data, combined with a variety of image technology, from the perspective of statistics of calculation, cognitive of behavior, semantic interpretation, and so on, mining the characteristics and patterns of visual data, to achieve the effective explanation and cognition of scene. From a certain point of view, the scene understanding is based on the scene analysis, to achieve the goal of interpreting the scene semantics. Scene understanding requires the incorporation of high-level semantics. The scene labeling and classification are both the means of scene analysis toward the semantic interpretation of scene. On the other side, to explain the semantics of the scene, it is required to make further reasoning according to the processing and analysis results of the image data. Reasoning is the process of collecting information, performing learning, making decisions based on logic. There are many other mathematical theories and methods can be used in scene understanding. The sections of this chapter are arranged as follows: Section 5.1 provides a summary of the various scenarios and different tasks involved in scene understanding. Section 5.2 discusses fuzzy reasoning. Based on the introduction to the concepts of fuzzy sets and fuzzy operations, the basic fuzzy inference methods are outlined. Section 5.3 describes a method of using genetic algorithms for image segmentation and semantic decomposition, as well as semantic inference judgment. Section 5.4 focuses on the question of scene object labeling. Labeling to scene objects is an important means to enhance the results of the analysis into the conceptual level, commonly used methods include discrete labeling and probability labeling. Section 5.5 discusses around the scene classification and introduces some of the definitions, concepts, and methods, primarily the bag of words model/the bag of features model, pLSA model, and LDA model.

5.1 Overview of Scene Understanding Scene analysis and semantic interpretation are important to understand the content of the scene, but the research is relatively less mature in this area, many of the issues are still in exploration and evolution.

DOI 10.1515/9783110524130-005

5.1 Overview of Scene Understanding

145

5.1.1 Scene Analysis Scene analysis must rely on image analysis techniques to obtain information on the subjects in the scene. Scene analysis is thus to build the foundation for further scene interpretation. Object recognition is an important step and one of the foundations for scene analysis. When recognizing a single object, it is generally believed that the image of this object region can be decomposed into several subregions (often corresponding to the part of object), these subregions have relatively fixed geometry, and they together constitute the appearance of the object. But in the analysis of natural scenes, which often contains a number of scene subjects, the relationship among them is very complex and quite difficult to predict, so in the scene analysis not only the internal relationship among objects themselves but also the distribution and relative position of different objects should be considered. From a cognitive perspective, scene analysis is more concerned about people’s perception and understanding of the scene. A large number of biology, physiology, and psychology tests for scene analysis showed that the global characteristic analysis of the scene often occurs in the early stage of visual attention. How to lay on the research results of biology, physiology, and psychology, and introducing the corresponding constraint mechanism to establish a reasonable calculation model are the issues requiring research and exploration. In the scene analysis, an important issue is that the visual content of the scene (the objects and their distribution) would have large uncertainty due to: (1) Different lighting conditions, which can lead to difficulties on object detection and tracking (2) Different object appearances (although sometimes they have similar structural elements), which will bring ambiguity to the identification of objects in scene (3) Different observation scales, which often affect the identification and classification of the objects in scene (4) Different landscape positions, orientations and occlusions among each other, which will increase the complexity of object cognition

5.1.2 Scene Perception Layer Analysis and semantic interpretation of the scene can be conducted at different levels. Similar in content-based coding, the model-based coding can be performed at three levels, namely the lowest level of object-based coding; the middle level of knowledge-based coding; and the highest level of semantic-based coding. The analysis and semantic interpretation of the scene can be divided into three layers. (1) Local layer: The main emphasis on this layer is the analysis, recognition or labeling of image region for local scene or single object.

146

(2) (3)

5 Scene Analysis and Semantic Interpretation

Global layer: This layer considers the whole scene, focuses on the relationship between objects having a similar appearance and functionality, and so on. Abstraction layer: This layer corresponds to the concept meaning of the scene and provides the abstract description of scene.

Taking the scene of the course conduction in a classroom as an example, the three layers corresponding to the above analysis and semantic interpretation task can be considered as follows. In the first layer, the main consideration is on extracting objects from the image, such as teachers, students, tables, chairs, screens, projectors, and so on. At the second layer, the focus is on determining the position and the mutual relationship of the objects, such as the teacher standing in front of the classroom, students sitting facing the screen, the projector projecting the lecture notes toward screen, and so on; the focus is also on analyzing environment and functionality, such as indoor or outdoor, what type of indoor rooms (office or classroom), and so on. At the third layer 3, it is required to describe the activities in the classroom (such as inclass or recess) and/or the atmosphere (such as serious or more relaxed, calmer or very dynamic). Humans have a strong capability for the perception of scene. For a new scene (especially the more conceptual scene), the human eye has often swept once to explain the meaning of the scene. For example, considering an open sport game in the playground, by observing the color of the grass and runway (low-level features), the athletes and running state (intermediate target) on the runway, plus the reasoning based on experience and knowledge (conceptual level), people could immediately make the correct judgment. In this process, the perception of middle layer has a certain priority. Some studies have shown that people have a strong ability to identify middle-layer objects (such as stadium), which is faster than identifying and naming them at lower and/or upper layers. There are some hypotheses that the higher priority of perception in the middle layer comes from two factors within the same time: one is maximizing the similarity inside classes (if there is or no athlete in playground), another is maximizing the difference between classes (even the classrooms after class will be different from the playground). From the visual characteristics of the scene itself, the same objects in the middle layer have often a similar spatial structure, and perform similar behaviors. In order to obtain the semantic interpretation of the scene, it is required to establish connection between the high-level concepts with the low-level visual features and middle-level object properties, as well as to recognize objects and their relative relationship. To accomplish this work, two modeling methods can be considered. 5.1.2.1 Lower Scenario Modeling Such methods directly represent and describe the low-level properties (color, texture, etc.) of the scene, and then reasoning based on the high-level information for a scene

5.1 Overview of Scene Understanding

147

by means of classification and recognition. For example, there is a large playground with grass (light green), track (dark red); there are classrooms with many desks and chairs (regular geometry having horizontal or vertical edges). The methods can be further divided into: (1) Global method: it works with the help of statistics of the whole image (such as color histograms) to describe the scene. A typical example is the classification of images into indoor images and outdoor images, the outdoor images can be further divided into natural outdoor scenery images and artificial building images, the artificial building images can be still further divided into tower images, playground images, and so on. (2) Block method: It works on dividing the image into multiple image pieces called blocks (can be regular or irregular). For each block, the global method can be used, and then these results can be integrated to provide the judgment. 5.1.2.2 Middle Semantic Modeling Such methods improve the performance of low-level feature classification by means of identification of objects, and solve the semantic gap problem between high-level concepts and low-level attributes. A commonly used method is based on classifying the scene into specific semantic categories according to the semantics and distribution of objects, by using visual vocabulary modeling (see Section 5.5.1). For example, the determination of the classroom from the indoor scene can be performed with the existence of objects such as tables, chairs, projector/projection screen and so on. Note that the semantic meanings of the scene may not be unique, and indeed it is often not unique in practice (e. g., the projector can be in a classroom or in a conference room). Especially for outdoor scenes, the situation would be more complicated, because the objects in scene can have any size, shape, location, orientation, lighting, shadows, occlusion, and so on. In addition, several individuals of the same type of objects may also have quite different appearances.

5.1.3 Scene Semantic Interpretation Scene semantic interpretation is a multi-faceted research and development of technology: (1) Visual computing technology; (2) Dynamic control strategy of vision algorithms; (3) Self-learning of scene information; (4) Rapid or real-time computing technology; (5) Cooperative multi-sensory fusion; (6) Visual attention mechanism (7) Scene interpretation combining cognitive theory; (8) System integration and optimization.

148

5 Scene Analysis and Semantic Interpretation

5.2 Fuzzy Reasoning Fuzzy is a concept often associated with the opposites of clear or precise (crisp). In daily life, many obscure things may be encountered, which have no clear amount or quantitative limits, then some vague phrases should be used for description. Fuzzy concepts can express a wide variety of noncritical, uncertain, imprecise knowledge and information, as well as those obtained from conflicting sources. Here it is possible to use human-like qualifiers or modifiers, such as higher intensity, lower gray value, and so on, to form fuzzy sets and to express the knowledge about image. Based on the expression of knowledge, further reasoning can be conducted. Fuzzy inference needs the help of fuzzy logic, fuzzy operation, and fuzzy arithmetic/algebra.

5.2.1 Fuzzy Sets and Fuzzy Operation A fuzzy set S in a fuzzy space X is a set of ordered pairs: S = {[x, MS (x)]|x ∈ X}

(5.1)

where MS (x) represents the grade of membership of x in S. The values of membership function are always nonnegative real numbers and are normally limited in [0, 1]. The fuzzy sets are described solely by its membership function. Figure 5.1 shows some examples of using crisp set and fuzzy set to represent the concept “dark,” where the horizontal axis corresponds to image gray-level x that would be the domain of definition of membership function for fuzzy set, and the vertical axis shows the values of membership function L(x). Figure 5.1(a) uses crisp sets to describe “dark,” and the result is binary (gray level smaller than 127 is complete dark, while gray level bigger than 127 is entirely no dark). Figure 5.1(b) is a typical fuzzy membership function that has value from 1 to 0, corresponds to gray-level values from 0 to 255. When x is 0, the L(x) is 1, x is completely belong to dark fuzzy set. When x is 1, the L(x) is 0, x does not completely belong to dark fuzzy sets. Between these two extremes, the gradual transition of x indicates that some intermediate parts belong to “dark,” while other parts are not under the “dark.” Figure 5.1(c) is a nonlinear membership function. It looks like a combination of Figure 5.1(a, b), but it still represents a fuzzy set. L(x) 1

0 (a)

L(x) 1

127

255 x

0 (b)

L(x) 1

255 x

0 (c)

Figure 5.1: Representations with crisp sets and fuzzy sets.

127

255 x

5.2 Fuzzy Reasoning

149

Operations on fuzzy sets can be carried out by using fuzzy logic operations. Fuzzy logic is built on the basis of multi-valued logic. It studies the means of fuzzy thinking, language forms and laws with fuzzy sets. In fuzzy logic operations, there are some operations having similar names with that in general logic operations but with different definitions. Let LA (x) and LB (y) represent membership functions corresponding to fuzzy sets A and B, their domains of definitions are X and Y, respectively. The fuzzy intersection, fuzzy union, and fuzzy complement can be defined as follows Intersection A ∩ B : LA∩B (x, y) = min [LA (x), LB (y)] Union A ∪ B : LA∪B (x, y) = max [LA (x), LB (y)]

(5.2)

c

Complement A : LAc (x) = 1 – LA (x) Operations on fuzzy sets can also be achieved by using the normal algebra operations via point by point changing the shape of the fuzzy membership function. Suppose the membership function in Figure 5.1(a) represents a fuzzy set D (dark), then the membership function of an enhanced fuzzy set VD (very dark) would be (as shown in Figure 5.2(a)): LVD (x) = LD (x) ∙ LD (x) = L2D (x)

(5.3)

This kind of operations can be repeated. For example, the membership function of fuzzy set VVD (very very dark) would be (as shown in Figure 5.2(b)): LVVD (x) = L2D (x) ∙ L2D (x) = L4D (x)

(5.4)

On the other side, it is possible to define a weak fuzzy set SD (somewhat dark), whose membership function is (as shown in Figure 5.2(c)): LSD (x) = √LD (x)

(5.5)

Logical operations and algebra operations can also be combined. For example, the membership function of fuzzy set NVD (not very dark), that is, the complement of enhanced fuzzy set VD, is (as shown in Figure 5.2(d)): LNVD (x) = 1 – L2D (x) L(x) 1

L(x) 1

0 (a)

255 x

0 (b)

(5.6)

L(x) 1

255 x

0 (c)

L(x) 1

255 x

0 (d)

255 x

Figure 5.2: The results of different operations on the original fuzzy set D in Figure 5.1(b).

150

5 Scene Analysis and Semantic Interpretation

Here NVD can be seen as N[V(D)], so LD (x) corresponds to D, L2D (x) corresponds to V(D) and 1 – L2D (x)) corresponds to N[V(D)]. 5.2.2 Fuzzy Reasoning Methods In fuzzy reasoning, the information in several fuzzy sets is combined according to certain rules to make a decision, Sonka (2008). 5.2.2.1 Basic Model The basic model and main steps for fuzzy reasoning are shown in Figure 5.3. Starting from fuzzy rules, the determination of the basic relations of the memberships in associated membership functions is called composition. The result of fuzzy composition is a fuzzy solution space. Making a decision on the basis of solution space, a de-fuzzification (de-composition) process is required. Fuzzy rules are a series of unconditional and conditional propositions. The form of unconditional fuzzy rules is x is A

(5.7)

if x is A then y is B

(5.8)

The form of conditional fuzzy rules is

where A and B are fuzzy sets, x and y represent scales from their respective domains. The membership that corresponds to unconditional fuzzy rules is LA (x). The unconditional fuzzy propositions are used to limit solution space or to define a default solution space. Since these rules are unconditional, they can be used directly to solution space with the help of fuzzy set operations. It is now considering the conditional fuzzy rules. Among the currently used methods for decision, the simplest is the monotonic fuzzy reasoning. It can directly obtain the solution without the use of fuzzy composition and de-fuzzification (see below). For example, let x represent the outside illumination value, y represent graylevel value of image, then the fuzzy rule representing the high-low gray level of image is: if x is DARK then y is LOW. The principle of monotonic fuzzy reasoning is shown in Figure 5.4. Suppose the outside illumination value is x = 0.3, then the membership value is LD (0.3) = 0.4. If this value is used to represent membership value LL (y)LD (x), then the expectation of

Fuzzy rules

Composition

Solution space

De-composition

Figure 5.3: The model and steps for fuzzy reasoning.

Decision

5.2 Fuzzy Reasoning

L(x) 1

L(y) 1

DARK

151

LOW

0.4 0

0.3

1

0

x

110

255

y

Figure 5.4: Monotonic fuzzy reasoning based on a single fuzzy rule.

high-low gray level of image is y = 110, which is at a relative low place in the range from 0 to 255. 5.2.2.2 Fuzzy Composition The knowledge related to decision-making processes is often included in more than one fuzzy rule. However, it is not every fuzzy rule for decision making that has the same contribution. There are different rules for combining the binding mechanism, the most commonly used approach is the min-max rule. In the min-max composition, a series of minimization and maximization processes are used. One example is shown in Figure 5.5. First, the correlation minimum (the minimum of the predicate truth) LAi (x) is used to restrict the consequent fuzzy membership function LBi (y), where i represents the ith rule (two rules are used in this example). Then, the consequent fuzzy membership function LBi (y) is updated point by point to produce the fuzzy membership function: LB+ (y) = min{LAi (x), LBi (y)}

(5.9)

i

Finally, by seeking the maximum of minimized fuzzy set, point by point, the fuzzy membership function of solutions is LS (y) = max{LB+ (y)} i

L(x) 1

Predicate fuzzy set

L(y) 1

A

(5.10)

i

Consequent fuzzy set B L(y)

0 L(x') 1

0

Minimization

x

0 L(y') 1

A'

x

0

Maximization

Solution fuzzy set

y

0

B'

y

Figure 5.5: Using correlation minimum for fuzzy min-max composition.

y

152

5 Scene Analysis and Semantic Interpretation

L(x) 1

L(y)

Predicate fuzzy set

1

A

Consequent fuzzy set B L(y)

0 L(x') 1

Minimization

0

x

L(y') 1

A'

0

Maximization

y

0

B'

0

x

Solution fuzzy set

y

y

Figure 5.6: Using correlation product for fuzzy min-max composition.

Another approach is called correlation product. One example is given in Figure 5.6. This approach scales the original consequent membership function instead of truncating them. Correlation minimum has the advantage of being computationally simple and easier to de-fuzzify. Correlation product has the advantage of keeping the original form of fuzzy set, as shown in Figure 5.6. 5.2.2.3 De-fuzzification Fuzzy composition gives the fuzzy membership function of one single solution for each solution variable. To determine the exact solutions used for decision making, it is required to identify a vector that can best express the information in fuzzy solution set and has multiple scalars (each corresponding to a solution variable). This process should be carried out independently for each solution variable, and is called de-fuzzification. Two commonly used de-fuzzification methods are the moment composition method and the maximum composition method. The moment composition method first determines the centroid of membership function of fuzzy solution c, and then converts the fuzzy solution into a crisp solution variable c, as shown in Figure 5.7(a). The maximum composition method determines the domain point where the membership function has maximum value among the membership functions of fuzzy solution. In case the maximum values are at a plateau, then the center of the plateau provides a crisp solution d, as shown in Figure 5.7(b). The moment composition method is result sensitive to all rules, while L(y) 1

0

(a)

L(y) 1

Fuzzy soltion set

c

Fuzzy soltion set

0

y

(b)

d

y

Figure 5.7: Two methods for de-fuzziﬁcation.

5.3 Image Interpretation with Genetic Algorithms

153

the maximum composition method has results depending on the single rule with maximum predicate value. The moment composition method is often used for control applications, while the maximum composition method is often used in recognition applications.

5.3 Image Interpretation with Genetic Algorithms A brief description for the basic ideas and operations of genetic algorithms is given, and using genetic algorithms for image segmentation and semantic content explanation is provided.

5.3.1 Principle of Genetic Algorithms Genetic algorithms (GA) use the natural evolution mechanism to search the extremes of objective function. Genetic algorithms are not guaranteed to find the global optimum, but in practice it can always find solutions very closing to the global optimal solution. Genetic algorithms have some properties compared to other optimization techniques: (1) Genetic algorithms do not use parameter sets themselves but use a coding of these sets. Genetic algorithms require the natural parameter sets to be coded as a finite-length string with limited symbols. In this way, the representation for every optimization problem is converted to a string representation. In practice, binary strings are often used, in which there are only two symbols, 0 and 1. (2) In each extreme search step, genetic algorithms look from a large population of sample points, instead of a single point. Thus, it has a great chance to find the global optimum. (3) Genetic algorithms directly use the objective function, instead of using derivatives or auxiliary knowledge. The search for new and better solutions depends only on the evaluation function that can describe the goodness of the particular string. The value of the evaluation function in genetic algorithms is called fitness. (4) Genetic algorithms do not use the deterministic rules, but use probabilistic transition rules. The transition rules from the current string populations to new and better string populations depends only on the natural idea, that is, according to higher fitness to support good string while removing bad string with only lower fitness. This is just the basic principle of genetic algorithms, in which the string with the best result will have the highest probability to survive in the evaluation process. The basic operations of genetic algorithms include reproduction, crossover and mutation. Using the three basic operations can control the survival of good strings and the death of bad strings.

154

5 Scene Analysis and Semantic Interpretation

5.3.1.1 Reproduction The process of reproduction makes the good string survival and other string death according to probability. The mechanism of reproduction reproduces the strings with high fitness in the next generation, in which the selection of a string for reproduction is determined by its relative fitness in current population. The higher the fitness of a string, the higher the probability that this string can survive; the lower the fitness of a string, the lower the probability that this string can survive. The result of this process is that the string with higher fitness will have higher probability than the string with lower fitness to be reproduced into the population of nest generation. Since the string number in population usually remains stable, the average fitness of the new generation will be higher than that of the previous generation. 5.3.1.2 Crossover There are many ways for realizing crossover, the basic idea is randomly to match pairs of the new generated code string to determine the location of border for each pair of random code strings, and to generate new code strings by exchange of the heads of string pairs and the location of border, as shown in Figure 5.8. Not all newly generated code strings should cross. Generally a probability parameter is used to control the required number of cross-code string. Another method is to make the best copy of code string retain its original form. 5.3.1.3 Mutation The principle of mutation operation is to frequently and randomly change some codes in a code string (e. g., making a change for every one of a thousand codes in the evolution from one generation to the next generation), in order to maintain a variety of local structures to avoid losing some optimal characteristics of solutions. The convergence of the genetic algorithms is worthy of attention. It plays an important role in controlling whether the evolution should be stopped. In practice, if the group with largest fitness was no significant increase in several processes of evolution, then it can be stopped. 5.3.1.4 Algorithm Steps According to the above three basic operations, the genetic algorithms consist of the following steps:

ABCDEFGHIJKLMNOPQRS

ZYXWVUTSRQPO NOPQRS

ZYXWVUTSRQPONMLKJIH

ABCDEFGHIJKLM NMLKJIH

Figure 5.8: Illustration of code string crossover.

5.3 Image Interpretation with Genetic Algorithms

(1) (2) (3) (4) (5) (6)

155

Generating encoded string for initial population, giving objective function value (fitness) Coping code strings with high fitness (according to probability) in the new group of populations, removing code strings with low fitness Building a new code string from the encoded strings copied from the old populations through cross combination Occasionally, making mutation of a code randomly selected from code strings Ranking the code strings of the current population, according to their fitness If the fitness values of the code string with maximum fitness are not significantly increased in several evolutionary processes, stop; otherwise, return to step 2 to continue the calculation

5.3.2 Semantic Segmentation and Interpretation An important feature of the genetic algorithm is to consider all the samples of population groups in a single process step, the samples with a high degree of adaptation will be retained and others will die. This feature is suitable for semantic image segmentation. Here the semantic segmentation refers to divide the image into regions based on the semantic information and optimizing the results according to the context and other high-level information. 5.3.2.1 Objective Function The image interpretation using genetic algorithm is based on the principle of hypothesis testing, where the objective functions optimized by genetic algorithm evaluate the quality of image segmentation and semantic interpretation. Algorithms begin with an over-segmented image (called the primary segmentation), the start region in this over-segmented image is called the primary region. Algorithm constantly and iteratively updates and merges the primary region into the current regional, namely continuous building viable region division and interpreting new assumption samples. The primary region available can be described by using the primary region adjacency graph. The combination of primary region adjacency graphs can produce a specific region adjacency graph. Specific region adjacency graph can express the region merged from all the adjacency regions with the same interpretation. Each possible semantic image segmentation corresponds to only one particular region adjacency graph. Specific region adjacency graph can be used to evaluate the objective function of semantic segmentation. The design of an objective function (i.e., the fitness function in genetic algorithm) is a key issue of semantic segmentation. This function must be based on the property of image regions and relationship among these regions, in which the utilization of a priori knowledge of the desired segmentation is required. The optimization of objective function consists of three parts:

156

(1)

5 Scene Analysis and Semantic Interpretation

The confidence that is proportional to corresponding probability in interpretation ki for region Ri , according to the nature of the region itself, is C(ki |Xi ) ∝ P(ki |Xi )

(2)

(5.11)

The confidence in interpretation ki for region Ri , according to the interpretation kj from nearby region Rj N

C(ki ) =

(3)

C(ki |Xi ) A ∑ [V(ki , kj )C(kj |Xj )] NA j=1

(5.12)

wherein V(ki , kj ) represents the value of compatibility function for two adjacent objects Ri and Rj , with labels Li and Lj , respectively; and NA is the number of regions adjacent to region Ri . The evaluation of the interpretation confidence for the whole image is N

Cimage =

1 R ∑ C(ki ) NR i=1

(5.13)

or NR

󸀠 Cimage = ∑[ i=1

2

C(ki ) ] NR

(5.14)

wherein C(ki ) can be computed according to eq. (5.12), NR is the number of regions in the corresponding specific region adjacency graph. Genetic algorithms try to optimize the objective function Cimage representing the current segmentation and interpretation hypothesis. Segmentation optimization function is based on unary properties of hypothesized regions and on binary relations between these regions and their interpretations. In the evaluation of local region confidence C(ki |Xi ), a priori knowledge about the characteristics of the image under consideration is used. 5.3.2.2 Concrete Steps and Examples According to the objective function described above, the procedure using genetic algorithm for segmentation and semantic interpretation has the following steps (1) Initializing the original image into primary region, the corresponding relationship between the relative position of each region and its label in genetic algorithm generated code string is defined. (2) Building a primary region adjacency graph. (3) Selecting the starting population of code string at random. Whenever possible, it is better to use a priori knowledge for determining the starting population.

5.3 Image Interpretation with Genetic Algorithms

(4)

(5) (6) (7)

157

Genetic optimization of the current population, with the help of the current region adjacency graph to compute the value of optimal segmentation function for each code string. If the maximum value of optimization criteria did not change significantly in a number of successive steps, go to step 7. Let genetic algorithm generate a new population with assumptions on segmentation and interpretation, go to step 4. The final obtained code string that has the greatest confidence (optimal segmentation hypothesis) represents the ultimate semantic segmentation and interpretation.

Example 5.1 Semantic segmentation example One synthetic image and its graph representations are shown in Figure 5.9. Figure 5.9(a) represents a scene image with a ball on the grass, Sonka (2008). Preliminary semantic segmentation divided it into five different regions Ri , i = 1 ∼ 5, abbreviated 1 ∼ 5 in the figure. Figure 5.9(b) gives the region adjacency graph of Figure 5.9(a), in which nodes represent regions, and arcs connect nodes of adjacent regions. Figure 5.9(c) shows the dual graph of Figure 5.9(b), where nodes correspond to the intersection of different regional contours, and arcs correspond to the contour segment. ◻∘ Let B represent the marker for the ball, L represent the marker for the grass. High-level knowledge used is: the image has a round ball that is in the green grass region. Define two unary conditions: one is the confidence that a region in a ball region depends on the compactness of region C C(ki = B|Xi ) = compactness(Ri )

(5.15)

and another is the confidence that a region in a grassland region depends on the greenness of region G C(ki = L|Xi ) = greenness(Ri )

1

2

2

1 3

4

5

3 1

5

2 3 4 5

4 (a)

(5.16)

(b)

(c)

Figure 5.9: A synthetic image and its graph representations.

158

5 Scene Analysis and Semantic Interpretation

Suppose that the confidence of region constituted by ideal sphere ball and the confidence of region constituted by ideal lawn area are both equal to 1, that is C(B|circular) = 1

(5.17)

C(L|green) = 1

Define a binary condition: the confidence that a region is inside another region is provided by the following compatibility function r(B|is inside L) = 1

(5.18)

The confidences of other combinations are zero. Unary conditions show that the more compact a region, the more round it is, and the higher it is interpreted as a sphere. Binary condition indicates that a ball can only be completely surrounded by a lawn. Let the code strings represent the region labels in order of region numbers, that is, the region numbers correspond to the location of region labels in the code string. Suppose at random the code strings obtained by region labeling are BLBLB and LLLBL, which represent the two groups of segmentation hypotheses shown in Figure 5.10. In each group of Figure 5.10, the left figure is the interpretation, the up-right is the corresponding code string, and the low-right is the corresponding region adjacency graph. Suppose a randomized crossover is performed at the location| between the second position and third position (represented by |). According to eq. (5.13), the respective confidences of compactness representing ball region and of location that the ball region is inside grassland are as follows: BL|BLB ⇒ Cimage = 0.00, LL|LBL ⇒ Cimage = 0.12, LLBLB ⇒ Cimage = 0.20, BLLBL ⇒ Cimage = 0.00. The latter two code strings are obtained by swapping the parts before | in the former two code strings, as shown in Figure 5.11. From the above confidence values, it is seen that the second and third segmentation hypotheses corresponding to the second and third code strings are relatively the best. Therefore, the first and four code strings can be eliminated, while for the second and third code strings another randomized crossover between the third position and

B

L

L BLBLB

LLLBL

B (1, 3, 5)

L L

(4)

L (2)

B

B

L

(4)

(1, 2, 3, 5)

Figure 5.10: Primary interpretation, code string, and regional adjacency graph.

5.4 Labeling of Objects in Scene

L

B

L

LLBLB L

B

159

BLLBL

L

B

(1, 2, 4)

(3, 5)

B

L

L

B

(2, 3, 5)

(1, 4)

Figure 5.11: Interpretation, code string, and region adjacency graph after a random crossover.

fourth position can be conducted. The new confidences are as follows: LLL|BL ⇒ Cimage = 0.12, LLB|LB ⇒ Cimage = 0.20, LLLLB ⇒ Cimage = 0.14, LLBBL ⇒ Cimage = 0.18. The latter two code strings are obtained by swapping the parts after | in the former two code strings. Now, selecting the aforementioned second and fourth code strings, another randomized crossover is performed at the location between the fourth position and fifth position. The new confidences are as follows: LLBL|B ⇒ Cimage = 0.20, LLBB|L ⇒ Cimage = 0.18, LLBLL ⇒ Cimage = 0.10, LLBBB ⇒ Cimage = 1.00. The latter two code strings are obtained by swapping the parts after | in the former two code strings. Because the current code string LLBBB has the highest degree of confidence among available segmentation hypotheses, so genetic algorithm stops. In other words, if it continues to generate hypotheses, no better confidence would be obtained. Thus the obtained optimal segmentation result is shown in Figure 5.12.

5.4 Labeling of Objects in Scene Labeling objects in scene refers to the semantic labeling of object regions in image, that is, to assign the semantic symbols to objects. Here, it is assumed that regions corresponding to the image of scene have been detected, and the relationship between these regions and objects has been described by the region adjacency graph or by the semantic web. The property of the object itself can be described by unary relationship, while the relationship among objects can be described by binary or higher-order correlations. The objective of labeling objects in scene is to assign a tag (with semantic meaning) for each object in the image of scene so as to obtain the proper interpretation of the scene image. Thus obtained interpretation should be consistent with the scene

L LLBBB B

L

B

(1, 2)

(3, 4, 5)

Figure 5.12: Optimal interpretation, code string, and region adjacency graph.

160

5 Scene Analysis and Semantic Interpretation

knowledge. Tags need to have consistency (i.e., any two objects in image are in reasonable structure or relationship), and tend to have the most likely explanation when there are multiple possibilities. 5.4.1 Labeling Methods and Key Elements For the labeling of objects in scene, there are mainly two methods, Sonka (2008). (1) Discrete labeling. It assigns each object just a tag, the main consideration is the consistency of the image tags. (2) Probabilistic labeling. It allows assigning multiple tags to the jointly existing objects. These tags are weighted by the probability, and each tag has a confidence. The difference between the two methods is mainly reflected in the robustness of the scene interpretation. There are two results for discrete tag, one is that a consistent tag is obtained, the other is that the impossibility of assigning a consistent tag to scene is detected. Due to the imperfection of segmentation, discrete tag will generally give the result that is not consistent in description (even if only a few local inconsistencies were detected). On the other hand, the probability tag can always give the labeling results and corresponding trusts. Although it is possible to have some locally inconsistent results, which are still better than the description results provided by discrete tag, which are consistent but quite impossible. In extreme cases the discrete tag can be regarded as a special case of the probability tag, in which the probability of one tag is 1 and the probabilities of all other tags are 0. The method of scene labeling includes the following key elements: (1) A group of objects Ri , i = 1, 2, . . . , N; (2) For each object Ri , there is a limited set of tags Qi , this same set is also applicable to all objects; (3) A limited set of relationship between objects; (4) A compatibility function between related objects. This reflects the constraints for relationship between objects. If the direct relationship among all objects in image should be considered for solving labeling problems, a very large amount of computation is required. Therefore, the method of constraint propagation is generally used. That is, the local constraint is used to obtain local consistency (the local optimum), then the iterative method is used to adjust the local consistency to the global consistency of the entire image (the global optimum). 5.4.2 Discrete Relaxation Labeling One example for scene labeling is shown in Figure 5.13. Figure 5.13(a) shows a scene with five object regions (including the background), Sonka (2008). The five regions are

5.4 Labeling of Objects in Scene

161

denoted by B (background), W (window), T (table), D (drawer) and P (phone). Unary properties for describing objects are: (1) The windows are rectangular; (2) The table is rectangular; (3) The drawer is rectangular. Binary constraints/conditions are as follows: (1) The windows are located above the table; (2) The phone is on the table; (3) The drawer is inside the table; (4) The background is connected with image boundary. Under these constraints, some of the results in Figure 5.13(b) are not consistent. An example of a discrete relaxation labeling process is shown in Figure 5.14. First, all existing tags are assigned to each object, as shown in Figure 5.14(a). Then, the iterative checking of the consistency is performed for each object to remove those tags that is likely to not satisfy the constraints. For example, considering the connection of background and the boundary of the image could determine the background in the beginning, and subsequently the other objects could not be labeled as background. After removing the inconsistent tags, Figure 5.14(b) will be obtained. Then considering that the windows, tables, and drawers are rectangular, the phone can be determined, as shown in Figure 5.14(c). Proceeded in this way, the final consistency results with correct tags are shown in Figure 5.14(d).

Background

B D

Window Phone

T

Drawer

W

Table

P

(a)

(b)

Figure 5.13: Labeling of objects in scene.

B

BDPTW BDPTW

(a)

BDP TW

B

B

DTW

DPTW

W

DP TW

P

P

BDPTW

DPTW

DTW

D

BDPTW

DPTW

DTW

T

(b)

(c)

Figure 5.14: The process and result of discrete relaxation labeling.

(d)

162

5 Scene Analysis and Semantic Interpretation

5.4.3 Probabilistic Relaxation Labeling As a bottom-up method for interpretation, discrete relaxation labeling may encounter difficulties when the object segmentation was incomplete or incorrect. Probabilistic relaxation labeling method may overcome the problem of object loss (or damage) and the problem of false object, but may also produce certain ambiguous inconsistent interpretations. Considering the local structure shown on the left of Figure 5.15, its region adjacency graph is on the right part. The object Rj is denoted qj , qj ∈ Q, Q = {w1 , w2 , . . . , wT }. The compatibility function of two objects Ri and Rj with respective tags qi and qj is r(qi = wk , qj = wl ). The algorithm iteratively searches the best local consistency in whole image. Suppose in step b of iterative process the tag qi is obtained according to the binary relationship of Ri and Rj , then the support (wk for qi ) can be expressed as T

(b) s(b) j (qi = wk ) = ∑ r(qi = wk , qj = wl )P (qj = wl )

(5.19)

l=1

where P(b) (qj = wl ) is the probability that the region Rj is labeled as wl at this time. Consider all the N objects Ri (labeled with wi ) linked with Rj (labeled with wj ), the total support obtained is N

N

T

j=1

j=1

l=1

(b) S(b) (qi = wk ) = ∑ cij s(b) j (qi = wk ) = ∑ cij ∑ r(qi = wk , qj = wl )P (qj = wl )

(5.20)

where cij is a positive weight to meet ∑Nj=1 cij = 1, which represents the binary contact strength between two objects Ri and Rj . The iterative update rule is P(b+1) (qi = wk ) =

1 (b) P (qi = wk )S(b) (qi = wk ) K

(5.21)

where K is the normalization constant: T

K = ∑ P(b) (qi = wl )S(b) (qi = wl )

(5.22)

l=1

R3

R2 R3 R1

Ri R4

q4

q3

q2 Ri qi

R4

R2

q1 R1

Rk qk R k

Figure 5.15: Local structure and region adjacency graph.

5.5 Scene Classiﬁcation

163

This is a nonlinear relaxation problem. Taking eq. (5.20) into eq. (5.21), a global optimization function can be obtained: N

T

N

T

F = ∑ ∑ P(qi = wk ) ∑ cij ∑ r(qi = wk , qj = wl )P(qj = wl ) k=1 i=1

j=1

(5.23)

i=1

The constraints for solutions T

∑ P(qi = wk ) = 1

∀ i P(qi = wk ) ≥ 0

∀ i, k

(5.24)

k=1

In the concrete use of probabilistic relaxation labeling method, the conditional probabilities of a tag for all objects are first determined, then the iteration repeats and the following two steps are continuously iterated: (1) Computing, according to eq. (5.23), the objective function representing the quality of scene labeling; (2) Updating the probability of a tag to increase the value of the objective function (to improve the quality of the scene labeling); Once the objective function value is maximized, then the best tag is obtained.

5.5 Scene Classification Scene classification is to determine the various specific regions in image, according to the principles of visual perception organization, and to give a conceptual explanation of the scene. Its concrete means and goals are to automatically classify and label the images according to a given set of semantic classes, so as to provide an effective context information for object identification and scene content interpretation. Scene classification is different from object recognition (but needs to have a full knowledge of the object). In many cases, it is required to classify the object before having its full information (in some cases, using only low-level information, such as color, texture, and so are already able to achieve categorization). Referring to the human visual cognitive processes, some primary classification and recognition have already met the requirements for the scene classification, namely the establishment of links between low-level features and high-level cognitive, and the determination and interpretation of semantic categories of the scene. Scene classification has a guiding role for object recognition. Naturally, most of the objects appear only in specific scenarios, so the correct judgment of the global scene can provide a reasonable context constraint mechanism for partial image analysis.

164

5 Scene Analysis and Semantic Interpretation

5.5.1 Bag of Words/Bag of Feature Models Bag of words model is derived from natural language processing. It is often referred to as bag of features model after its introduction into the image field. Bag of features model consists of class characteristics (features) that belong to the same destination set formed package (bag), where the name comes, Sivic (2003). The model usually takes a directed graph structure. There are the probability constraints relationship between the nodes in undirected graph, and the causal relationship between the nodes in directed graph, the undirected graph can be regarded as a special kind of directed graph – symmetric directed graph. In the model of bag of features, the conditional independence between image and visual vocabulary is the theoretical basis of the model, but the model has not strictly geometric information about the object components. The original bag of words model only considers the symbiotic relationship between the features corresponding to the words and the logical relationships of topics, while it ignores the spatial relationships between features. However, in the image field, not only the image features themselves are important but also the spatial distributions of these features are also significant. In recent years, there are many feature descriptors (e. g., SIFT) that have a relatively high number of dimensions. They can be more comprehensive and have explicit expression for representing the key points in the image and their special characteristics in small surrounding areas (different from corner points that only representing the location information while representing their own characteristics implicitly). Their representations are also distinguished from other key points and from small surrounding areas. In addition, these feature descriptors can overlap each other in image space so that the characteristics of the relationship can be better preserved. The utilization of these features enhances the ability for describing the spatial distribution of the image feature. Representing and describing the scene with the model of bag of features is required to extract features of local region description from the scene. These features can be called visual vocabulary. If the scene is decomposed, there will be some basic components. Applying the concept of a document, a book is made up of many words or word composition. Return to the image field, it is considered that the image of scene is composited by many visual words. From a cognitive perspective of view, each visual word corresponds to a feature (more precisely, a feature describing the local scene characteristic) in image, and is a basic unit in reflecting the image content or the meaning of the scene. Building a collection of visual vocabulary (dictionary) may include the following aspects: (1) Extracting features; (2) Learning visual vocabulary; (3) Using the quantitative characteristics of the visual vocabulary; (4) Representing image by using the frequency of visual vocabulary.

5.5 Scene Classiﬁcation

4 0 5 0 3

(a)

(d)

3 0 6 0 3

2 0 5 0 5

… … … … …

(b)

(e)

165

(c)

(f)

Figure 5.16: The process of obtaining local region description features in image.

A specific example is shown in Figure 5.16. First, the region (the neighborhood of key points) in image is detected and different kinds of regions are divided and extracted, as shown in Figure 5.16(a), in which the region is the square form for simplification. Then for each region, the feature vectors are calculated to represent the region, as shown in Figure 5.16(b). Next, the feature vectors are quantizated into the visual words and a codebook is built as shown in Figure 5.16(c). Finally, the appearance frequency of specific words for each image is counted (few examples using histogram are shown in Figure 5.16(e) ∼ Figure 5.16(f)), these frequencies will be combined to give the representation for images. If the image is divided into a number of subregions, and each subregion is assigned a semantic concept, it can be said that each of the subregion is taken as a visual unit that has its unique independent semantic meaning. Because similar scenes should have similar concepts in distribution, a scene can be classified into specific semantic categories according to the regional distribution of semantic concepts. If the semantic concept and visual vocabulary can be linked, then the scene classification can be performed by means of representation and a description model of words. Using visual vocabulary can directly represent the objects, or can only represent the middle-level concepts in the neighborhood of key points. The former needs to detect or segment the objects in scene, and further make the scene classification via object classification. For example, once the sky was detected, the image should be outdoors. The latter does not require direct segmentation of objects, but to identify the scene tags by using the local descriptors obtained from training. There are three general steps: (1) Feature point detection: The often used methods include image grid method and Gaussian difference method. The former divides image into mesh, and takes their center positions to determine feature points. The latter uses DoG operator to detect local feature points of interest, such as the corners.

166

(2)

(3)

5 Scene Analysis and Semantic Interpretation

Feature representation and description: Both characteristics of feature points themselves and their neighborhood should be combined. In recent years, SIFT operator is often used, in which the feature point detection and the feature representation and description are combined actually. Generating dictionary: Local descriptions are clustered (e.g., using the k-means clustering method), and the cluster centers are taken to build dictionary.

Example 5.2 Dictionary of visual vocabulary One example for constructing dictionary of visual vocabulary is given in Figure 5.17. In practice, the choice of the local region can make use of SIFT local descriptor, and the selected local region is a circular area with the key point in the center. These local regions have the same characteristics, as shown in Figure 5.17(a). The constructed dictionary of visual vocabulary is shown in Figure 5.17(b), wherein each subimage represents a single basic visual vocabulary (a cluster of key features) and can be represented by a vector, as shown in Figure 5.17(c). The use of dictionary of visual vocabulary can represent the original image by a combination of the visual vocabulary, the frequency of various visual vocabulary used reflects the image characteristics. ◻∘ In practical application process, with the help of feature detection operator and feature descriptor the image is expressed by visual vocabulary first; and then the parameter estimation and probabilistic reasoning are constituted, to obtain the iterative formula for parameters and the results of probability analysis; finally the trained models were analyzed to gain understanding. The most commonly used model in modeling is Bayesian-related models, such as the typical probability latent semantic analysis (pLSA) model and latent Dirichlet allocation (LDA) model. According to the framework of bag of features model, the image is seen as text, the topic discovery from the image is seen as object class (such as teachers, sports grounds), then a scene comprising of multiple objects is seen as a probabilistic model with a mixed group of topics, the classification of semantic categories can be made by analyzing the topic distribution in scene.

[a1 a2 a3 a4 a5 …]T [b1 b2 b3 b4 b5 …]T [c1 c2 c3 c4 c5 …]T [d1 d2 d3 d4 d5 …]T

(a)

(b)

Figure 5.17: Get a visual vocabulary with SIFT local descriptor.

(c)

5.5 Scene Classiﬁcation

167

5.5.2 pLSA Model pLSA model is derived from the probability of latent semantic indexing and is a graph model for solving the object and scene classification, Sivic (2005). pLSA model derived from the learning of natural language and texts, the original definitions are adopted from the concepts in text, but it is also very easy to be extended to the image field (in particular by means of the bag of features model). 5.5.2.1 Model Description Suppose there are a set of images T = {ti }, i = 1, . . . , N, N is the total number of images; T contains visual words from the set of words – dictionaries (visual vocabulary) S = {sj }, j = 1, . . . , M, M is the total number of words. The properties of image set T can be described by an N × M statistical co-occurrence matrix P, in which each matrix element pij = p(ti , sj ) represents the frequency of word sj appearing in image ti . This matrix is actually a sparse matrix. pLSA model uses a latent variable model to describe the data in co-occurrence matrix. It associates each observation (word sj appearing in the image ti ) with a latent variable (called topic variable) z ∈ Z = {zk }, k = 1, . . . , K. Let p(ti ) represent the probability that a word appearing in image ti , p(zk |ti ) represents the probability that topic zk appearing in the image ti (the image distribution in topic space), p(sj |zk ) represents the probability that the word sj appears in a particular topic zk (the topic distribution in dictionary), then by selecting image ti with probability p(ti ) and the topic with probability p(zk |ti ), a word sj with probability p(sj |zk ) can be generated. The conditional probability model based on the co-occurrence matrix of topics and words can be defined as K

p(sj |ti ) = ∑ p(sj |zk )p(zk |ti )

(5.25)

k=1

That is, each word in every image may be formed by mixing the K latent topic variables p(sj |zk ) with coefficients p(zk |ti ). Thus, the element in co-occurrence matrix P is p(ti , sj ) = p(ti )p(sj |ti )

(5.26)

A graph representation of pLSA model is depicted in Figure 5.18, in which the boxes represent sets (the large box represents image set, the small box represents the repeated selection of topics and words); the arrows represent the dependencies between those nodes; nodes are random variables, the left observation node t (shaded) corresponds to the image, the right observation node s (shaded) corresponds to the visual vocabulary described by descriptors, the intermediate node z is a latent nodes (unobserved), which represents the object category, namely the topic. The model is to establish the probabilistic mapping among the topic z, image t and visual

168

5 Scene Analysis and Semantic Interpretation

Image

Vocabulary

t

p(t)

z

p(z|t)

p(s|z)

s St

T Object

classes

{Cap, Ear, Hair, Eye, Nose, Mouth, …} Figure 5.18: pLSA schematic model.

vocabulary s, and to select the category corresponding to the maximum a posteriori estimation as the judgment result of final classification category. The objective of pLSA model is to search for vocabulary distribution probability p(sj |zk ) under specific topics zk and for the mixing ratio of p(zk |ti ) corresponding to a specific image, thereby obtaining the vocabulary distribution p(sj |ti ) in the specific image. Equation (5.25) represents every image as a convex combination of K topic vectors, which can be illustrated by using matrix operations, as shown in Figure 5.19. Wherein each column in the left matrix represents the visual vocabularies in a given image, each column in the middle matrix represents the visual vocabularies in a given topic, each column in the right matrix represents the topics in a given image (object class). 5.5.2.2 Model Calculations It is required to determine the common topic vectors for all images and the specific mixing coefficients for each image, which aims to give a high probability model to the words appearing in the image, so the category with the maximum a posteriori probability can be selected as the final object category. This can be obtained by optimizing the parameters of the following objective function so as to make a maximum likelihood estimation: M

N

L = ∏ ∏ p(sj |ti )p(sj ,ti )

(5.27)

j=1 i=1

The maximum likelihood estimation for the latent variable model can be computed by using expectation-maximization (EM) algorithm. It consists of two steps: E step t

t

z s

s

z

=

p(s|t)

p(z|t)

p(s|z)

Figure 5.19: Co-occurrence matrix decomposition.

5.5 Scene Classiﬁcation

169

and M step. E step is an expectation step, in which the posterior probability of latent variables is calculated on the basis of parameter estimation. M step is a maximization step, in which the likelihood of fully expected data obtained from the E-step will be maximized. E step can be expressed as (by using Bayes formula): p(zk |ti , sj ) =

p(sj |zk )p(zk |ti ) k ∑l=1 p(sj |zl )p(zl |ti )

(5.28)

M step has an iterative formula p(sj |zk ) =

∑Ni=1 p(sj |zk )p(zk |ti ) ∑kl=1 p(sj |zl )p(zl |ti )

(5.29)

The operations of E step and M step conduct alternatively until the termination condition is satisfied. Finally, the judgment of category can be carried out by means of the following formula: z∗ = arg max{p(z|t)} z

(5.30)

Example 5.3 Expectation-maximization algorithm Expectation-maximization algorithm is an algorithm of statistical computing, which searches the parameters of maximum likelihood estimation or maximum a posteriori estimation in the statistical probability model (dependent on unobservable latent variables). It is an iterative technique for estimating unknown variables via portion of relevant known variables. The algorithm has two alternating iterative calculation steps: (1) Calculating the expected value (E step), namely using the existing estimates of latent variables, to calculate the estimation value of maximum likelihood; (2) Maximization (M step), namely in the basis of the maximum likelihood values from E step to estimate the values of desired parameters, the estimated parameter values thus obtained will be used in the next E step. ◻∘ 5.5.2.3 Model Application Examples Consider an emotion-based task for semantic image classification, Li (2010). Image includes not only visual scene information but also contains a variety of emotional semantic information. In other words, except to express the objective world scenery, state, and environment, image can also bring a strong emotional response. Different emotional adjectives can be used to express different categories of emotions. There is a sentiment classification framework that divides all the emotion into 10 categories. They include five kinds of positive categories (amusement, contentment, excitement, awe, and undifferentiated positive) and five kinds of negative categories

170

5 Scene Analysis and Semantic Interpretation

(anger, sadness, disgust, fear, and undifferentiated negative). The international community has established the International Affective Picture System (IAPS) database, Lang (1997), in which there are a total of 1182 color pictures with very rich object categories. Some examples of pictures belonging to the above-mentioned 10 kinds of emotion categories are shown in Figure 5.20. Figure 5.20(a) to Figure 5.20(e) correspond to the five kinds of positive emotions and Figure 5.19(f) to Figure 5.19(j) correspond to the five kinds of negative emotions. In the image classification based on emotional semantic information, the images are the pictures from database, words are selected from emotional vocabulary, and the topic corresponds to latent emotional semantic factor (an intermediate layer of semantic concepts, between low-level image features and high-level emotional categories). First, the low-level image features obtained by using SIFT operator are clustered with K-means algorithm to form the emotional dictionary. Then, pLSA model is used to study the latent emotional semantic factor, so as to achieve the probability distribution p(sj |zk ) of every latent emotional semantic factor on emotional words and the probability distribution p(zk |ti ) of every picture on latent emotional semantic factors. Finally, the method of support vector machine (SVM) is used to train the emotional image classifier that will be applied to the classification of different emotional picture categories. Example 5.4 Classification test and result Some experimental results using the above methods for emotion classification are shown in Table 5.1, where 70% of each class of emotional picture are taken as the training set, the remaining 30% of the picture are taken as the test set. Training and testing processes are alternatively repeated 10 times, the table shows the average correct classification rate of 10 categories (%). The values of emotional word s have been selected from 200 to 800 with interval 100, the values of latent emotional semantic factor z are in the range of 10 to 70 with gap 10. In Table 5.1, the effects of different numbers of latent emotional semantic factors and emotional vocabulary on the image classification are shown. When the values of latent emotional semantic factor are fixed, along with the increasing number of emotional vocabulary, the classification performance is gradually improving first and then declining, the best is around s = 500. Similarly, when the number of emotional vocabulary is fixed, along with the increasing values of latent emotional semantic factor, the classification performance is also gradually improving first and then declining, the best is around z = 20. Therefore, by selecting s = 500 and z = 20, the best classification result can be achieved. ◻∘ 5.5.3 LDA Model LDA model is a set probability model. It can be seen as formed by adding super parameter layer to pLSA model and building the probability distribution of latent variable z, Blei (2003).

5.5 Scene Classiﬁcation

(a) amusement

(b) contentment

(c) excitement

(d) awe

(e) undifferentiated positive Figure 5.20: Example pictures of 10 kinds of emotion categories.

171

172

5 Scene Analysis and Semantic Interpretation

(f) anger

(g) sadness

(h) disgust

(i) fear

(j) undifferentiated negative Figure 5.20: (Continued)

5.5 Scene Classiﬁcation

173

Table 5.1: Classiﬁcation examples z s

200

300

400

500

600

700

800

10

24.3

29.0

33.3

41.7

35.4

36.1

25.5

20

38.9

45.0

52.1

69.5

62.4

58.4

45.8

30

34.0

36.8

43.8

58.4

55.4

49.1

35.7

40

28.4

30.7

37.5

48.7

41.3

40.9

29.8

50

26.5

30.8

40.7

48.9

39.5

37.1

30.8

60

23.5

27.2

31.5

42.0

37.7

38.3

26.7

70

20.9

22.6

29.8

35.8

32.1

23.1

21.9

5.5.3.1 Basic LDA Model The LDA model can be illustrated by Figure 5.21, where the boxes represent sets (the large box represents image set with the number of images M, the small box represents the repeated selection of topics and words in image, N is the number of words in image, and is generally believed that N is independent of q and z). Figure 5.21(a) shows the basic LDA model. The leftmost latent node, a, corresponds to a Dirichlet priori parameters representing the topic distribution in every image. The second latent node from left, q, represents the topic distribution in image (qi is the topic distribution of image i), q is also called mixed probability parameters. The third latent node from left, z, is the topic node, zij represents the topic of word j in image i. The fourth latent node from left is the only observation node (shaded), s is the observation variable, sij represents the jth word in images i. The rightmost node b is the polynomial distribution parameters of topic-word layer, namely Dirichlet priori parameters of word distribution in every topic. As seen above, the basic LDA model is a three-layer LDA Bayesian model. Wherein a and b are the parameters belonging to super parameters of image set, q belongs to the image layer, z and s belong to visual vocabulary layer. LDA model includes K latent topics z = {z1 , z2 , . . . , zK }, every word in image is produced by a corresponding topic. Each image is composed of K topics according to specific probability q. The model parameter N obeys Poisson distribution; q obeys Dirichlet distribution, that is, q ∼ Dirichlet(a), where a is a priori for the Dirichlet distribution (the topic distribution of images fits with Dirichlet probability distribution).

b

q

a M (a)

z

s N

M (b)

Figure 5.21: LDA schematic model.

r

f

q

z

a N

q M

(c)

b

f

z

s

K

N

174

5 Scene Analysis and Semantic Interpretation

Each word is a term in dictionary (visual vocabulary). It can be represented by a K-D vector with only one component being 1 and the other component being 0. A word sj is selected from the dictionary with a probability of p(sj |q, b). Solving LDA model includes two processes: variational approximation inference (Gibbs sampler can also be used, see Griffiths (2004)) and parameter learning. Variational inference refers to the process of determining the topic mixing probability q and the probability of every word produced by topic z, in given super parameters a and b as well as observed variables s, that is, p(q|a) [∏Ni=1 p(zi |q)p(si |zi , b)] p(q, z, s|a, b) p(q, z|s, a, b) = = p(s|a, b) ∫ p(q|a) [∏Ni=1 ∑zi p(zi |q)p(si |zi , b)]dq

(5.31)

Wherein the denominator p(s|a, b) is the likelihood function of words. Due to the existed coupling relationship between q and b, it cannot directly calculate p(s|a, b). It is seen from the LDA model graph, the coupling relationship is induced by the condition relationship among q, z, and s. Therefore, by deleting the connection between q and z as well as the observation node s, a simplified model can be obtained as shown in Figure 5.21(b), an approximate distribution p󸀠 (q, z|r, f ) of p(q, z|s, a, b) can also be obtained as follows N

p󸀠 (q, z|r, f ) = p(q|r) ∏i=1 p(zi |fi )

(5.32)

where the parameter r is the Dirichlet distribution parameter of q, f is a polynomial distributed parameter of z. Further, taking the logarithm of p(s|a, b): log p(s|a, b) = L(r, f ; a, b) + KL [p󸀠 (q, z|r, f )||p(q, z|s, a, b)]

(5.33)

Item 2 on the right-hand side indicates the KL divergence between the approximating distribution model p󸀠 and LDA model p. The smaller the KL divergence is, the more approximated p󸀠 to p. Achieving the minimum KL divergence can be realized by maximizing the lower bound L(r, f ; a, b) of likelihood function, and solving model parameters r and f . Once r and f are known, q and z can be solved through sampling. Parameter learning process, under the condition of giving the set of observed variables S = {s1 , s2 , . . . , sM }, is a process determining the super parameters a and b. This can be achieved by variational EM iteration. Wherein the above variational inference algorithm is used in E step to calculate the variational parameters r and f in each image; while all variational parameters are collected in M step from all images, and the partial derivatives of super parameters a and b are computed to maximize the lower bound L(r, f ; a, b) of likelihood function, finally the estimation for super parameters is achieved.

5.5 Scene Classiﬁcation

175

In practice, the basic LDA model is generally extended to the smooth LDA model to get better results (to overcome the sparse problems when using large data sets). Smooth LDA model is shown in Figure 5.21(c), where K represents the number of topics in model, f corresponds to a K × V Markov matrix (V is the dimension of word vector), wherein each row represents the word distribution in topics. Here the image is represented as a random mixing of latent topics, in which each topic is characterized by the distribution of words. For each image i in image collection, the generation process of LDA is as follows: (1) Selecting qi to satisfy Dirichlet distribution, qi ∼ Dirichlet(a), where i ∈ {1, . . . , M}, Dirichlet(a) is the Dirichlet distribution of parameter a; (2) Selecting fk to satisfy Dirichlet distribution fk ∼ Dirichlet(b), where k ∈ {1, . . . , K}, Dirichlet(b) is the Dirichlet distribution of parameter b; (3) For each word sij , where j ∈ {1, . . . , Ni }, selecting a topic zij ∼ Multinomial(qi ) and selecting a word sij ∼ Multinomial(fzij ), in which Multinomial represents the polynomial distribution. 5.5.3.2 SLDA Model To further improve the classification performance of LDA model, the category information can be introduced, thus obtaining a supervised LDA model—SLDA model, Wang (2009), whose graph model is shown in Figure 5.22. Figure 5.22(a) is the standard SLDA model, the meaning of each node at the upper portion is the same as in Figure 5.21(a), the lower portion has been added a category tag node l that is related to topic z. It is possible to predict tag l corresponding to topic z ∈ Z = {zk }, k = 1, . . . , K, with the parameter h of Softmax classifier. The reasoning in SLDA model on topic z is influenced by tag l, which makes the super parameter d of word-topic distribution more suitable for classification tasks (can also be used for labeling). Likelihood function of image in SLDA model is N

p(s, l|a, b, h) = ∫ p(q|a) ∑z [∏i=1 p(zi |q)p(si |zi , b)]p(l|z1:N , h)dq i

(5.34)

where h is the control parameter for category tag l. Learning process of parameter is to determine super parameters a, b, and h under the conditions that the observed variables set S = {s1 , s2 , . . . , sM } and category information {li }i=1:M are given. This can

a

q

z

s

b

r

f

q

z

M

N

l

(a) Figure 5.22: SLDA schematic model.

h

M (b)

N

176

5 Scene Analysis and Semantic Interpretation

be achieved by using the variational EM algorithm. The variational inference of SLDA is to determine the topic mixing probability q of image, topic probability z of every word, and image category tag l under the conditions that super parameters a, b and h, as well as observed variable s are given. Compared to LDA model, the variational EM algorithm and variational inference method for SLDA are more complex, Wang (2009). The simplified SLDA model is shown in Figure 5.22(b), it is the same as the simplified LDA model in Figure 5.21(b).

5.6 Problems and Questions 5-1

Figure Problem 5-1 shows the membership functions of (gray scale) dark fuzzy set D, (gray scale) bright fuzzy set B, and (gray scale) medium fuzzy set M. It can be seen that, for a given gray scale g, it can correspond to a number of different membership function values, try to explain this.

L(x) D 1

M

B

LM(x) LB(x) LD(x)

0

127

g

255

x

Figure Problem 5-1 5-2* In Problem 5-1, if LM (x) = 2/3 and LB (x) = 1/3, what is the gray-level g? 5-3 As shown in Figure Problem 5-1, what are the values of LD (x), LM (x) and LB (x) when g = 180? And when g = 18? 5-4 Prove that in fuzzy logic, the AND of a set with its complement is not an empty set. 5-5 The semantic-based region growing methods can also be applied to the semantic segmentation and interpretation of images. But the use of genetic algorithms for semantic segmentation and interpretation is equivalent to a combination of image splitting and merging. Try to explain why it can say so? What advantages are brought by this approach? 5-6 If in Example 5.1, eq. (5.14) instead of eq. (5.13) is used for the calculation of the circularity of the region marked as sphere and the confidence of the location of the sphere region in the grassland region, how are the steps and results of the random variation changed? 5-7 If the initial code strings obtained from the random region marking in Figure 5.8(a) is BLLLB and BBLBL, try to draw the two corresponding sets of

5.7 Further Reading

177

segmentation hypothesis graphs. In addition, what will the following processes of replication, crossover and mutation be like? 5-8 There should be a series of intermediate steps between Figure 5.13(c) and Figure 5.13(d), which are to be supplemented. It is required to determine only one target at a time, and to list the properties used. 5-9 Compare the differences and approaches used between the image grid method and the Gaussian difference method in obtaining the local feature description of the image. What effects will they have on the obtained description vectors and on the final classification results. 5-10 There are some indoor scenes images, such as classrooms, meeting rooms, offices, exhibition rooms, gymnasiums, libraries, and so on, try to use the pLSA model to classify these images. Show some examples with images, words, topics (text and graphs), and describe the workflow. 5-11* For the basic LDA model shown in Figure 5.20(a), illustrate the nodes of graph with the concepts and examples in the image. 5-12 Perform the same tasks and with same requirements of scene classification as the problem 5-10, but this time using the SLDA model.

5.7 Further Reading 1.

Overview of Scene Understanding – There are several books on the scene semantic interpretation providing information on different research directions, for example, Luo (2005), Gao (2009). –

2.

A work on using Bayesian networks for image semantic interpretation can be found in Luo (2005). Fuzzy Reasoning – More on fuzzy sets and fuzzy operations can be seen in the contents of specialized books, such as Cox (1994) and Zhangzl (2010). –

3.

The application of fuzzy reasoning in image pattern recognition can be seen in Duda (2001), Theodoridis (2009). Image Interpretation with Genetic Algorithms – More information on the content of genetic algorithms can be found in special books, such as Mitchell (1996), Han (2010). –

4.

Genetic algorithm can also be used in combination with fuzzy set method for segmentation based on statistical properties of pixel points and their spatial neighborhood information as in Xue (2000). Labeling of Objects in Scene – An early overview of the relaxation of the target in the scene can be seen in Kittler (1985).

178

5 Scene Analysis and Semantic Interpretation

–

5.

A method of associating probability for automatic image tagging can be found in Xu (2008). Scene Classification – In the feature-based scene classification, the construction of visual dictionary is an important task. Two examples of how to optimize learning dictionaries can be found in Duan (2012) and Liu (2012a). –

An example of combining image classification and retrieval is shown in Zhang (2008c).

–

A semi-supervised image classification based on linear programming for bootstrapping can be seen in Li (2011).

–

A work on image classification using the identification of sparse coding can be found in Liu (2012).

6 Multisensor Image Fusion Multisensor fusion treats information data supplied by various sensors. It can provide more comprehensive, accurate, and robust results than that obtained from a single sensor. Fusion can be defined as combined processing of the data acquired from multiple sensors, as well as assorting, optimizing, and conforming this data to increase the ability to extract information and to improve decision capability. Fusion can extend the coverage for space and time information, reduce fuzziness, increase the reliability of making decisions, and improve the robustness of systems. Image fusion is a particular type of multisensor fusion, which takes images (include video frames) as operating objects. In a more general sense, the combination of multiresolution images can be considered a special type of fusion process. In this chapter, however, the emphasis is on information fusion of multisensor images. The sections of this chapter are arranged as follows: Section 6.1 provides an overview of the information fusion and describes the concepts of information types, fusion layers, and the active fusion. Section 6.2 introduces the main steps of image fusion, three fusion layers, and the objective and subjective evaluation methods of fusion results. Section 6.3 presents several basic methods of pixel-level fusion and some common techniques for combining them. A number of different examples of pixel-level fusion are provided. Section 6.4 discusses the technical principles of the three tools or methods used in feature-level fusion and decision-level fusion, including Bayesian, evidential reasoning, and rough set theory.

6.1 Overview of Information Fusion The human perception of the outside world is the result of the interaction between the brain and many other organs, in which not only visual information but also much nonvisual information play a role. For example, the intelligent robots currently under investigation have different sensors for viewing, hearing, olfaction (the sense of taste), gestation (the sense of smell), the sense of touch (the sense of pain), the sense of heat, the sense of force, the sense of slide, and the sense of approach (Luo, 2002). All these sensors provide different profile information of a scene in the same environment. There are many correlations among the information of sensor data. To design suitable techniques for combining information from various sensors, theories and methods of multisensor fusion are required. DOI 10.1515/9783110524130-006

180

6 Multisensor Image Fusion

6.1.1 Multisensor Information Fusion Multisensor information fusion is a basic ability of human beings. Single sensors can only provide incomplete, inaccurate, vague, and uncertain information. However, information obtained by different sensors can even be contradictory. Human beings have the ability to combine the information obtained by different organs and then make estimations and decisions about the environment and events. Using a computer to perform multisensor information fusion can be considered a simulation of the function of the human brain for processing complex problems. 6.1.1.1 Layers of Information Fusion There are several schemes for classifying information fusion into various layers. For example, according to the information abstraction level, information fusion can be classed into five layers (the main considerations here are the strategic early warning in battlefields, in which C3 I–Command, Control, Communication, and Information systems are required) (He, 2000): Fusion in the Detection Layer is the fusion directly in the signal detection level of multisensors. That is, the signal detected by a single sensor is first preprocessed before transmitting it to the center processor. Fusion in the Position Layer is the fusion of the output signals of each sensor. Fusion here includes both time fusion and space fusion. From time fusion, the object’s states can be obtained. While from space fusion, the object’s moving trace can be obtained. Fusion in the Object Recognition Layer Object recognition is used to classify objects according to their attributes and/or properties. Fusion in the object recognition layer can be performed in three ways. (1) Decision fusion: Fusing the classification results of each sensor. (2) Feature fusion: Fusing the feature description vectors of each sensor. (3) Data fusion: Fusing the original data of each sensor. Fusion in the Posture Evaluation Layer tries to analyze the whole scene based on object recognition. This requires the combination of various attributes of objects, events, etc., to describe the action in a scene. Fusion in the Menace Estimation Layer Posture estimation emphasizes state while menace estimation emphasizes tendency. In the menace estimation layer, not only the state information is taken into account, but also the appropriate a priori knowledge should be used to estimate the tendency of the state changes and the results of possible events.

6.1 Overview of Information Fusion

181

6.1.1.2 Discussions about Active Vision Systems It is well known that static, single image analysis constitutes an ill-posed problem. One reason is that the reconstruction of a 3-D scene from a 2-D image is underdetermined. Interpreting and recovering information from one single image has been the goal of many image-understanding systems since the 1980s. Researchers have tried to attain their goals by implementing artificial intelligence techniques (expert vision systems, modeling of domain knowledge, and modeling of image analysis knowledge). In the later of 1980s, a new paradigm called “active perception” or “active vision” was introduced (Aloimonos, 1992) and then extended to “active, qualitative, purposive” vision (Andreu, 2001). The main ideas behind these concepts are that the ill-posed problem of general vision can be well defined and solved easily under the following conditions: (1) If there is an active observer taking more than one image of the scene. (2) If the “reconstructionist” metric approach (see the discussion in Section 1.3.3) to vision is relaxed to a qualitative one, where it is sufficient to state, for example, that object A is closer to the observer than object B. (3) If, instead of general vision, a well-defined narrow purpose of the vision system is modeled (leading to a particular solution of a specific application problem). (4) If any combination of these three conditions are met. Let us consider condition (1) from a new perspective. If there is an active vision system taking more than one image of a scene, or even more general, if there are moving observers or objects in the scene and the system is equipped with several sensors, then the essential problem that has to be solved is how to integrate multiple information from multiple sensors taken at different moments. Information to be fused can be imperfect in many ways (wrong, incomplete, vague, ambiguous, and contradictory). Mechanisms are required to: (1) Select information from different sources; (2) Combine information into a new aggregated state; (3) Register spatially and temporally visual information; (4) Integrate information at different levels of abstraction (pixel level, feature level, object level, symbol level, etc.). 6.1.1.3 Active Fusion-Based Understanding Figure 6.1 shows a general schema of an image understanding system, with a special emphasis (boldface, bold arrows) on the role of fusion within the schema. The upper half corresponds to the real-world situation, while the lower half reflects its mapping in the computer. Line boxes denote levels of processing and dashed line boxes denote levels of representation, respectively. Solid arrows represent the data flow and dashed ones represent the control flow. The process of fusion combines information, actively selects the sources to be analyzed, and controls the processes to be performed on these data, which is called active fusion. Fusion can take place at isolated levels (e.g., fusing several input images

182

6 Multisensor Image Fusion

World

Scene selection

Exposure

Integration

Scene description

Image

Image segmentation

Active fusion

Interaction

World description

Scene

3-D modeling

Image description

Figure 6.1: The framework of general image understanding based on active fusion.

producing an output image) or integrate information from different representational levels (e.g., generate a thematic map from a map, digital elevation model, and image information). Processing at all levels can be requested and controlled (e.g., selection of input images, choice of classification algorithms, and refinement of the results in selected areas).

6.1.2 Sensor Models Information fusion is based on information captured from different sensors, so the models of the sensor play an important role. 6.1.2.1 Fusion of Multisensor Information The advantages of fusing multisensor information include the following four aspects: (1) Using multisensors to detect the same region can enhance reliability and credibility. (2) Using multisensors to observe different regions can increase spatial coverage. (3) Using multi-types of sensors to examine the same object can augment information quantity and reduce the fuzziness. (4) Using multisensors to observe the same region can improve spatial resolution. In fact, when multisensors are used, even if some of them meet problems, other sensors can still capture environment information, so the system will be more robust. Since several sensors can work at the same time, processing speed can be increased, efficiency of utilization can be augmented, and the cost of information capturing will be reduced. Corresponding to the forms of multisensor information fusion, the information obtained from the outside can be classified into three types: Redundant Information is the information about the same characteristics in the environment, captured by several independent sensors (often the same modality). It

6.1 Overview of Information Fusion

183

can also be the information captured by one sensor but at different times. Redundant information can improve the tolerance and reliability of a procedure. Fusion of redundant information can reduce the uncertainty caused by noise and increase the precision of the system. Complementary Information is the information about different characteristics in the environment, captured by several independent sensors (often the different modality). By combining such information, it is possible to provide complete descriptions about the environment. Fusion of complementary information can reduce the ambiguity caused by lacking certain features and enhance the ability to make correct decisions. Cooperation Information is the information of several sensors, from which other sensors can be used to capture further information. Fusion of such information is dependent on the time sequence that different sensors use. 6.1.2.2 Several Sensor Models Sensor model is an abstract representation of physical sensors and their processes of information processing. It describes not only its own properties but also the influence of exterior conditions on the sensor and the ability of interactions among different sensors. The probability theory can be used to model the multisensor fusion system Durrant (1988). Denote the observation value of sensor yi , the decision function based on the observation value Ti , and the action of decision ai , then ai = Ti (yi )

(6.1)

Now, consider a multisensor fusion system as a union of a set of sensors. Each sensor can be represented by an information structure Si , which includes the observation value of this sensor yi , the physical state of this sensor xi , the a priori probability distribution function of this sensor pi , and the relation between the actions of this sensor and other sensors, which is given by yi = Si (xi , pi , a1 , a2 , . . . , ai–1 , ai+1 , . . . an )

(6.2)

So the information structure of the set of sensors can be represented by n groups of S = (S1 , S2 , . . . , Sn ). If denoting the decision function T = (T1 , T2 , . . . , Tn ), the goal of information fusion is to obtain a consistent decision a, based on a set of sources, which describes the environment characteristic better than any single decision ai (i = 1, 2, . . . , n). If considering separately the actions of different parts on yi , (i.e., consider the conditional probability density function) three sub-models, the state model Six , the

184

6 Multisensor Image Fusion

observation model Sip , and the correlation model, SiT can be obtained. In other words, eq. (6.2) can be written as Si = f (yi |xi , pi , Ti ) = f (yi |xi )f (yi |pi )f (yi |Ti ) = Six Sip SiT

(6.3)

State Model: A sensor can change its space coordinates and/or internal state over time (e. g., the displacement of a camera or the adjustment of the lens focus). The state model describes the dependency of observation values of a sensor on the location/state of this sensor. This corresponds to the transform of different coordinate systems for the sensor. That is, the observation model of the sensor fi (yi |pi ) and the correlation model fi (yi |Ti ) are transformed to the current coordinate system of the sensor by using the state model fi (yi |xi ). Observation Model: The observation model describes the measurement model, in case where the position and state of a sensor as well as the decision of other sensors are known. The exact form of fi (yi |pi ) is based on many physical factors. To simplify, a Gaussian model is used Sip = f (yi |pi ) =

1–e

1 exp [– (yi – pi )T W1i–1 (yi – pi )] 2 1i e 1 + exp [– (yi – pi )T W2i–1 (yi – pi )] 2 (20)m/2 |W2i |1/2 (20)m/2 |W

|1/2

(6.4)

where 0.01 < e < 0.05 and the variance is |W1i | 0, i = 1, 2, ⋅ ⋅ ⋅ , n. Then for each event B, P(B) > 0, the following equation holds P(Ai |B) =

P(B|Ai )P(Ai ) P(A, B) = n P(B) ∑j=1 P(B|Aj )P(Aj )

(6.18)

If consider the multisensor decision as a partition of the sample space, the formula of the Bayesian conditional probability can be used to solve the decision problem. Consider first a system with two sensors. Suppose that the observation result of the first sensor is B1 , the observation result of the second sensor is B2 , and the decisions probably made by the system are A1 , A2 , ⋅ ⋅ ⋅ , An . If assuming that A, B1 , and B2 are independent to each other, the Bayesian conditional probability can be obtained by using a prior knowledge about the system and the property of sensors, given by P(Ai |B1 ∧ B2 ) =

P(B1 |Ai )P(B2 |Ai )P(Ai ) ∑nj=1 P(B1 |Aj )P(B2 |Aj )P(Aj )

(6.19)

The above results can be extended to cases where there are more than two sensors. Suppose that there are m sensors, and their observation results are B1 , B2 , ⋅ ⋅ ⋅ , Bm . If these sensors are independent of each other and are independent of the observation objects and conditions, the total posterior probability of the decision made by the system with m sensors is m

∏ P(Bk |Ai )P(Ai )

P(Ai |B1 ∧ B2 ∧ ⋅ ⋅ ⋅ ∧ Bm ) =

k=1 n m

i = 1, 2, ⋅ ⋅ ⋅ , n

(6.20)

∑ ∏ P(Bk |Aj )P(Ai )

j=1 k=1

The decision, which makes the system have the maximum posterior probability, will be taken as the final decision.

6.4 Feature-Layer and Decision-Layer Fusions

205

Example 6.1 Object classification using fusion Suppose that there are four classes of objects (i = 1, 2, 3, 4), and they have the same probability P(Ai ); that is, P(A1 ) = P(A2 ) = P(A3 ) = P(A4 ) = 0.25. Two sensors (j = 1, 2) are used and their measurement values Bj meet the Gaussian distribution (,ji , 3ji ), so the a priori probability density of each measurement value is P(Bj |Ai ) =

–(bj – ,ji )2 1 exp ( ) √203ji 23ji2

i = 1, 2, 3, 4

j = 1, 2

The a priori probability density functions can be obtained as follows. Suppose that the probability of the observation is 1 in the average value for each object, and the probability of a real observation is the sum of two probabilities on both sides ∞

∞

P(B1 |A1 ) = 2 ∫ p(b1 |A1 )db1

P(B2 |A1 ) = 2 ∫ p(b2 |A1 )db2

B1

B2

B1

∞

P(B1 |A2 ) = 2 ∫ p(b1 |A2 )db1

P(B2 |A2 ) = 2 ∫ p(b2 |A2 )db2

–∞

B2

B1

B2

P(B1 |A3 ) = 2 ∫ p(b1 |A3 )db1

P(B2 |A3 ) = 2 ∫ p(b2 |A3 )db2

–∞

–∞

B1

B2

P(B1 |A4 ) = 2 ∫ p(b1 |A4 )db1

P(B2 |A4 ) = 2 ∫ p(b2 |A4 )db2

–∞

–∞

The final fusion result is P(Ai |B1 ∧B2 ) =

P(B1 |Ai )P(B2 |Ai )P(Ai ) 4

i = 1, 2, 3, 4

◻∘

∑ P(B1 |Aj )P(B2 |Aj )P(Aj )

j=1

6.4.2 Evidence Reasoning The Bayesian information fusion is based on the probability theories, in which the additivity of the probability is a thumb of rule. However, the Bayesian method cannot be used when the reliabilities of the two opposite propositions are both small (the determination could not be made according to the current evidence). Evidence reasoning (proposed by Dempster and perfected by Shafer, also called the D–S theory) abnegates the additivity and adopts the half additivity. In the D–S theory, the interesting proposition set is represented by a recognition framework. It defines a set of function C : 2F → [0, 1], which satisfies

206

(1) (2)

6 Multisensor Image Fusion

C(Ø) = 0, that is, no reliability for the empty set and ∑A⊂F C(A) = 1, i.e., though any reliability value can be endued to a proposition, the sum of the reliability values for all propositions must be 1.

Here, C is considered a basic reliability distribution on the recognition framework F. ∀A ⊂ F, C(A) is called the basic probability number of A, and reflects the reliability on A itself. For any proposition, the D–S theory defines a reliability function B(A) = ∑ C(E)

∀A ⊂ F

(6.21)

E⊂A

Equation (6.21) shows that the reliability function of A is the sum of the reliability numbers of all subsets in A. From the definition of the reliability function, it has B(Ø) = 0

B(F) = 1

(6.22)

Only using the reliability function to judge a proposition A is not enough, because B(A) cannot reflect the suspicion to A. So, it is needed to define the suspicion to A ∀A ∈ H

define

D(A) = B(A)̄

P(A) = 1 – B(A)̄

(6.23)

where D is a suspicion function, D(A) is the suspicion to A, P is a plausible function, and P(A) is the plausible degree of A. According to eq. (6.23), P can be represented by C corresponding to B ∀A ∈ H

P(A) = 1 – B(A)̄

= ∑ C(E) – ∑ C(E) = ∑ E⊂Ā

E⊂F

C(E)

(6.24)

E∩A=Ø ̸

If A ∩ E ≠ Ø, A is consistent with E. Equation (6.24) shows that P(A) includes all basic reliability numbers of the proposition sets consistent with A. From A ∩ Ā = Ø, A ∪ Ā ⊂ F, it has B(A) + B(A)̄ ≤ ∑

C(E)

(6.25)

E∩A=Ø ̸

That is, B(A) ≤ 1 – B(A)̄ = P(A).

(6.26)

[B(A), P(A)] denotes the uncertain region for A. B(A) and P(A) are called the top and bottom limits of the probability, respectively. [0, B(A)] is a fully reliable region, which indicates the support to the proposition “A is true.” [0, P(A)] is the non-suspicion region, which indicates that the evidence cannot refute the proposition “A is true.”

6.4 Feature-Layer and Decision-Layer Fusions

207

Uncertainty 0

1 B

P

Support Refuse

Non-doubt

Figure 6.14: A partition of information regions.

C2 C2(EL) C2(Ej) C2(E1) C1(A1)

C1(Ai)

C1(AK)

Figure 6.15: The composition of the reliability function.

C1

The above two regions are illustrated in Figure 6.14. The bigger the region [B(A), P(A)], the higher the uncertainty. If considered a proposition as an element of the recognition framework F, for ∀C(A) > 0, A is called the focal element of the reliability function B. Given two reliability functions B1 and B2 of the same recognition framework, denote C1 and C2 as their corresponding basic reliability distributions, respectively. The composition of the reliability function is shown in Figure 6.15. In Figure 6.15, the vertical band represents the reliability of C1 assigned to A1 , A2 , . . . , AK ; the horizontal band represents the reliability of C2 assigned to E1 , E2 , . . . , EL . The shaded region is the cross of the horizontal band and vertical band, with measure C1 (Ai )C2 (Ej ). The combined action of B1 and B2 is to distribute C1 (Ai )C2 (Ej ) on Ai ∩ Ej . Given A ⊂ F, if Ai ∩ Ej = A, C1 (Ai )C2 (Ej ) is the part of the reliability distributed to A. The total reliability distributed to A is ∑Ai ∩Ej =A C1 (Ai )C2 (Ej ). However, for A = Ø, there would be part of the reliability distributed to an empty set according to the above description. Obviously, this is not reasonable. To solve this problem, it is required to multiply [1 – ∑Ai ∩Ej =Ø C1 (Ai )C2 (Ej )]–1 with every reliability so as to satisfy the condition that the total reliability equals 1. The final rule for composing two reliability functions is (⊕ denotes the composition operation) ∑

C(A) = C1 (A) ⊕ C2 (A) =

Ai ∩Ej =A

1–

∑

C1 (Ai )C2 (Ej )

Ai ∩Ej =Ø

C1 (Ai )C2 (Ej )

(6.27)

208

6 Multisensor Image Fusion

The above procedure can be generalized to the case of multi-reliability composition. If denoting C1 , C2 , . . . , Cn for the reliability distribution of n groups of information, when they are derived from independent information, the fused reliability function C = C1 ⊕ C2 ⊕ . . . ⊕ Cn can be represented as ∑

C(A) =

∏ Ci (Ai )

Ai ∩Ej =A i=1

1–

∑

∏ Ci (Ai )

(6.28)

Ai ∩Ej =Ø i=1

In practice, the information captured by a sensor is taken as evidence, and every sensor provides a group of propositions and reliability functions. The multi-sensor information fusion becomes a procedure that combines different evidences into new evidence under the same recognition framework. This procedure has the following steps: (1) Compute for each sensor, the basic reliability number, the reliability function, and the presumable function. (2) Using the composition rule in eq. (6.28), compute the basic reliability number, the reliability function, and the plausible function under the joint action of all sensors. (3) Select the object with the maximum support, under certain decision rules.

6.4.3 Rough Set Methods One problem with evidence reasoning in multisensor fusion is the composition exploitation. To solve this problem, it is necessary to analyze the complementary information and to compress redundant information. The rough set theory provides a solution for this task. Different from the fuzzy set theory that can represent the fuzzy concepts but cannot calculate the number of fuzzy elements, the number of fuzzy elements can be obtained with accurate formulas of a rough set. 6.4.3.1 Definition of Rough Set Suppose that U ≠ Ø is a limited set formed by all interesting objects and is called a definition domain. For any subset X in U, X is called a concept of U. The concept set in U is called the knowledge about U (often represented in an attribute form). Denote U as an equivalence relation (representing the attribute of objects) defined in U, and a knowledge database is a relation system K = {U, R}, where R is an equivalence relation set defined in (Zhangwx, 2001). For any subset X in U, if it can be defined by R, it is called a R exact set; if it cannot be defined by R, it is called a R rough set. A rough set can be approximately described by two exact sets, the upper approximation set and the lower approximation set, respectively.

6.4 Feature-Layer and Decision-Layer Fusions

209

The definition of the upper approximation set is R∗ (X) = {X ∈ U : R(X) ∩ X ≠ Ø}

(6.29)

The definition of the lower approximation set is R∗ (X) = {X ∈ U : R(X) ⊆ X}

(6.30)

where R(X) is an equivalence relation class including X. The definition of the R-boundary of X is the difference set between the upper approximation set and the lower approximation set, given by BR (X) = R∗ (X) – R∗ (X)

(6.31)

Example 6.2 Examples of a rough set Suppose that a knowledge database K = (U, R) is given, where U = {x1 , x2 , . . . , x8 }, and R is an equivalence set including equivalence sets E1 = {x1 , x4 , x8 }, E2 = {x2 , x5 , x7 }, E3 = {x3 }, and E4 = {x6 }. Now consider the set X = {x3 , x5 }. The lower approximation set will be R∗ (X) = {x ∈ U : R(x) ⊆ X} = E3 = {x3 }, while the upper approximation set will be R∗ (X) = {x ∈ U : R(x) ∩ X ≠ Ø} = E2 ∪ E3 = {x2 , x3 , x5 , x7 }, and the boundary set will be BR (X) = R∗ (X) – R∗ (X) = {x2 , x5 , x7 }. ◻∘ For knowledge R, R∗ (X) is the set including the elements that could be classified into X from U. For knowledge R, R∗ (X) is the set including the elements that must be classified into X from U. For knowledge R, BR (X) is the set including the elements that neither can be clearly classified into X, nor could be clearly classified into X,̄ (the complementary set of X), from U. In addition, R∗ (X) can be referred to as the R-positive domain of X; U – R∗ (X) can be called the R-negative domain of X; BR (X) can be called the boundary domain of X. Following these definitions, the positive domain is the set that includes all elements clearly belonging to X, according to the knowledge R; the negative domain is the set that includes all elements clearly belonging to X,̄ according to the knowledge R. The boundary domain is an uncertain domain in some senses. For the knowledge R, the elements belonging to the boundary domain cannot be clearly assigned to X or X.̄ R∗ (X) is the sum set of the positive domain and the boundary domain. A 2-D illustration for the above discussed sets and domains is given in Figure 6.16, in which the space is formed by rectangles, each of which represents an equivalent class in R. The region between R∗ (X) and R∗ (X) represents the R-boundary of X, which is an uncertain region for X. In summary, if and only if R∗ (X) = R∗ (X), X is a R-definable set; if and only if ∗ R (X) ≠ R∗ (X), X is a R rough set. In other words, R∗ (X) can be considered the maximum definable set in X and R∗ (X) can be considered the minimum definable set including X.

210

R*(X)

6 Multisensor Image Fusion

BR(X)

R*(X) X Figure 6.16: A 2-D illustration of a rough set.

6.4.3.2 Description With Rough Set The boundary domain is the un-determined domain due to the incompleteness of the knowledge (i. e., the elements in BR (X) could not be determined). Therefore, the uncertain relation of a subset X on U is rough, BR (X) ≠ Ø. The larger the boundary domain of a set X, the more the fuzzy elements are in this set and the lower the precision. To represent this accurately, a definition of the precision is introduced, given by dR (X) =

card [R∗ (X)] card [R∗ (X)]

(6.32)

where the card(⋅) represents the cardinality of the set, and X ≠ Ø. In Example 6.2, dR (X) = card[R∗ (X)]/card[R∗ (X)] = 1/4. The precision dR (X) reflects the completeness of the knowing set X. For any R and X ⊆ U, it has 0 ≤ dR (X) ≤ 1. When dR (X) = 1, the R boundary domain of X is empty and the set X is R definable. When dR (X) < 1, the set X has a nonempty boundary domain and the set X is R nondefinable. The concept in contrast to the precision is rough hR (X) = 1 – dR (X)

(6.33)

It reflects the incompleteness for knowing the set X. With the help of R∗ (X) and R∗ (X), the topological property of a rough set can be represented. In the following, four important types of rough sets are defined (their geometrical meanings can be found in Figure 6.16). (1) If R∗ (X) ≠ Ø and R∗ (X) ≠ U, X is called an R rough definable set. In this case, it is possible to determine that some elements of U belong to either X or X.̄ (2) If R∗ (X) = Ø and R∗ (X) ≠ U, X is called an indefinable set in R. In this case, it is possible to determine that if some elements of U belong to X̄ but not if any element of U belongs to X. (3) If R∗ (X) ≠ Ø and R∗ (X) = U, X is called an outside indefinable set of R. In this case, it is possible to determine whether some elements of U belong to X but not if any element of U belongs to X.̄ (4)

If R∗ (X) = Ø and R∗ (X) = U, X is called an R total indefinable set. In this case, it is not possible to determine if any element of U belongs to X or X.̄

6.4 Feature-Layer and Decision-Layer Fusions

211

Example 6.3 Examples of a rough set Suppose a knowledge database K = (U, R) is given, where U = {x1 , x2 , ⋅ ⋅ ⋅ , x10 }, R is an equivalent set, and there are equivalent classes E1 = {x0 , x1 }, E2 = {x2 , x6 , x9 }, E3 = {x3 , x5 }, E4 = {x4 , x8 }, and E5 = {x7 , x10 }. The set X1 = {x0 , x1 , x4 , x8 } is an R definable set, because R∗ (X1 ) = R∗ (X1 ) = E1 ∪ E4 . The set X2 = {x0 , x3 , x4 , x5 , x8 , x10 } is an R rough definable set. It has R∗ (X2 ) = E3 ∪ E4 = {x3 , x4 , x5 , x8 }, R∗ (X2 ) = E1 ∪ E3 ∪ E4 ∪ E5 = {x0 , x1 , x3 , x4 , x5 , x7 , x8 , x10 }, BR (X2 ) = E1 ∪ E5 = {x0 , x1 , x7 , x10 }, and dR (X2 ) = 1/2. The set X3 = {x0 , x2 , x3 } is an indefinable set in R, because R∗ (X3 ) = Ø, R∗ (X3 ) = E1 ∪ E2 ∪ E3 = {x0 , x1 , x2 , x3 , x5 , x6 , x9 } ≠ U. The set X4 = {x0 , x1 , x2 , x3 , x4 , x7 } is an outside indefinable set of R. It has R∗ (X4 ) = E1 = {x0 , x1 }, R∗ (X4 ) = U, BR (X4 ) = E2 ∪ E3 ∪ E4 ∪ E5 = {x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 , x10 }, dR (X4 ) = 2/11. The set X5 = {x0 , x2 , x3 , x4 , x7 } is an R total indefinable set, because R∗ (X5 ) = ∗ Ø, R (X5 ) = U. ◻∘ 6.4.3.3 Fusion Based on a Rough Set When applying the rough set theory to multisensor information fusion, two concepts, that is, the nuclear of a rough set and the reduction of a rough set, are used. Let R be an equivalent relation set, and R ∈ R, if I(R) = I(R – {R}), then R can be omitted from R (unnecessary); otherwise, R cannot be omitted from R (necessary). In the above, I(∙) represents an undetermined relation. If for any R ∈ R, R was not possible to omit, then set R is independent. If R is independent, and P ∈ R, then P is independent, too. The set of all relations that cannot be omitted is called the nuclear of P and is denoted C(P). The relation between the nuclear and reduction is C(P) = ⋂ J(P)

(6.34)

where J(P) represents all reduction sets of P. It can be seen from above that the nuclear is included in all reductions and can be calculated from the intersection set of the reduction. In the process of knowledge reduction, the nuclear is the set of knowledge characters that cannot be removed. Suppose that S and T are equivalent relations in U. The S positive domain (the equivalent class set that can be accurately partitioned to T) of T is PS (T) = ⋃ S∗ (X)

(6.35)

X∈T

The dependent relation between S and T is QS (T) =

card[PS (T)] card (U)

(6.36)

212

6 Multisensor Image Fusion

It can be seen that 0 ≤ QS (T) ≤ 1. Using the dependent relation between S and T, QS (T), the consistence between the two equivalent classes S and T can be determined. When QS (T) = 1, S and T are consistent. When QS (T) ≠ 1, S and T are not consistent. When applying the rough set theory to multisensor information fusion, the dependent relation between S and T is used to help eliminate the consistent information and determine the minimum nuclear. Once the most useful decision information has been found, the fastest fusion method can be obtained.

6.5 Problems and Questions 6-1 6-2

What is the relation between active vision and active fusion? When is the objective evaluation of fusion results more appropriate than the subjective evaluation of fusion results? 6-3 What is the advantage of fusion evaluation according to the fusion objectives? 6-4 Select two images, and obtain their weighted average fused image with various pairs of weights. Make some discussions on the results. 6-5 Select a color image and decompose it into its R, G, B components. Take two of three components and use the wavelet transform method to fuse them. Judge the fusion result with one statistics-based criterion and one information-based criterion. 6-6 Suppose that P(A1 ) = 0.1, P(A2 ) = 0.2, P(A3 ) = 0.3, P(A4 ) = 0.4 in Example 6.1. What is the final fusion result? 6-7 Both Bayesian methods and evidence reasoning are based on the computation of probability. What are their differences? Design a set of test data and compare the computation results. 6-8* Prove the following properties of the upper approximation set and the lower approximation set: (1) R∗ (X ∩ Y) ⊆ R∗ (X) ∩ R∗ (Y). 6-9

(2) R∗ (X ∪ Y) = R∗ (X) ∪ R∗ (Y). Prove the following properties of the upper approximation set and the lower approximation set: (1) R∗ (X) ⊆ X ⊆ R∗ (X). (2) R∗ (X ∪ Y) = R∗ (X) ∪ R∗ (Y).

6-10

6-11

(3) R∗ (X ∩ Y) = R∗ (X) ∩ R∗ (Y). Given a knowledge database K = (U, R), where U = {x0 , x1 , ⋅ ⋅ ⋅ , x8 }, if there are equivalent classes E1 = {x0 , x4 }, E2 = {x1 , x2 , x7 }, E3 = {x3 , x5 , x6 , x8 }, determine the types of sets of E1 , E2 , and E3 . Given a knowledge database K = (U, R), where U = {x0 , x1 , ⋅ ⋅ ⋅ , x8 }R = {R1 , R2 , R3 }, equivalent relations R1 , R2 , and R3 have the following equivalent classes: R1 = {{x1 , x4 , x5 }, {x2 , x8 }, {x3 }, {x6 , x7 }}, R2 =

6.6 Further Reading

213

{{x1 , x3 , x5 }, {x6 }, {x2 , x4 , x7 , x8 }}, and R3 = {{x1 , x5 }, {x6 }, {x2 , x7 , x8 }, {x3 , x4 }}. R has the following equivalent classes, R = {{x1 , x5 }, {x2 , x8 }, {x3 }, {x4 }, {x6 }, {x7 }}. Compute the nuclear and all reduction sets. 6-12* Prove the following theorem: The set X is a R-rough and R-definable set if and only if X̄ is a R-rough and R-definable set.

6.6 Further Reading 1.

Summary of Information Fusion – A recent trend in information processing for the last 20 years is the fusion of information (Zhang, 2016). –

Information fusion can be carried out with different forms and data, such as audio and video, multimedia, multi-modality (Renals, 2005), etc.

–

2.

A general introduction to image fusion techniques with multiple-sensors can be found in (Zhang, 2015c). Image Fusion – Stereo vision can also be considered an image fusion process, in which views from more than two points are fused to provide the complete information (Russ, 2002). –

More examples of using wavelet transforms in fusion applications of remote sensing imaging can be found in (Bian, 2005).

–

Multisensor image fusion in remote sensing is discussed in more detail in (Polhl, 1998).

–

3.

Fusion examples of the SAR and FLIR images and the related registrations can be found in (Chen, 2001). Pixel-Layer Fusion – An introduction to image fusion using the wavelet transform can be found in (Pajares, 2004). –

4.

Detailed derivation and discussion for determining the optimal level of the wavelet decomposition in a fusion task can be found in (Li, 2005b). Feature-Layer and Decision-Layer Fusions – A tendency of fusion is going from pixels to regions (objects), and a general framework can be found in (Piella, 2003). –

More detailed information on the rough set theory and applications can be found in (Zhangwx, 2001).

7 Content-Based Image Retrieval With the development of electronic devices and computer technology, a large number of images and video sequences (in general, a group of images) have been acquired, stored, and transmitted. As huge image databases are built, the searching for required information becomes complicated. Though a keyword index is available, it has to be created by human operators and it could not fully describe the content of images. An advanced type of technique, called content-based image retrieval (CBIR), has been developed for this reason. The sections of this chapter are arranged as follows: Section 7.1 introduces the matching techniques and similarity criteria for image retrieval based on color, texture, and shape features. Section 7.2 analyzes the video retrieval techniques based on motion characteristics (including global motion features and local motion features). Section 7.3 presents a multilayer description model, including original image layer, meaningful region layer, visual perception layer, and object layer, for object-based high-level image retrieval. Section 7.4 discusses the particular methods for the analysis and retrieval of three types of video programs (news video, sports video, and home video).

7.1 Feature-Based Image Retrieval A typical database querying form is the query by example. Variations include composing multiple images or drawing sketches to obtain examples. Example images are described by appropriate features and are matched to the features of database images. The popularly used features are color, texture, and shape. Some other examples are the spatial relationship between objects and the structures of objects (Zhang, 2003b). 7.1.1 Color Features Color is an important feature in describing the content of images. Many retrieval methods are based on color features (Niblack, 1998; Zhang, 1998b; Zhang, 1998c). Commonly used color spaces include RGB and HSI spaces. In the following, the discussion is based on RGB space, although the HSI space is often better. Many color-based techniques use histograms to describe an image. The histogram of an image is a 1-D function H(k) = nk /n

k = 0, 1, ⋅ ⋅ ⋅ , L – 1

(7.1)

where k is the index of the feature bins of the image, L is the number of bins, nk is the number of pixels within feature bin k, and n is the total number of pixel in an image. DOI 10.1515/9783110524130-007

7.1 Feature-Based Image Retrieval

215

In content-based image retrieval, the matching technique plays an important role. With histograms, the matching of images can be made by computing different distances between histograms. 7.1.1.1 Histogram Intersection Let HQ (k) and HD (k) be histograms of a query image Q and a database image D, respectively, and the matching score of the histogram intersection between the two images is given by Swain (1991) L–1

∑ min[HQ (k), HD (k)]

P(Q, D) =

k=0

(7.2)

L–1

∑ HQ (k)

k=0

7.1.1.2 Histogram Distance To reduce computation, the color information in an image can be approximated roughly by the mean of histograms. For RGB components, the feature vector for matching is f = [,R

,G

T

,B ]

(7.3)

The matching score of the histogram distance between two images is P(Q, D) = √(f Q – f D )2 = √∑R,G,B (,Q – ,D )2

(7.4)

7.1.1.3 Central Moment Mean is the first-order moment of histograms, and higher-order moments can also be used. Denote the i-th (i ≤ 3) central moments of RGB components of a query image Q i i i by MQR , MQG , and MQB , respectively. Denote the i-th (i ≤ 3) central moments of RGB i i i components of database image D by MDR , MDG , and MDB , respectively. The matching score between two images is 3

3

3

i=1

i=1

i=1

i – M i )2 + W ∑ (M i – M i )2 + W ∑ (M i – M i )2 P(Q, D) = √WR ∑ (MQR G B DR QG DG QB DB

(7.5)

where WR , WG , and WB are weights. 7.1.1.4 Reference Color Tables Histogram distance is too rough for matching while histogram intersection needs a lot of computation. A good trade-off is to represent image colors by a group of reference colors (Mehtre, 1995). Since the number of reference colors is less than that of the original image, the computation can be reduced. The feature vectors to be matched are

216

7 Content-Based Image Retrieval

f = [r1

r2

T

rN ]

(7.6)

where ri is the frequency of the i-th color and, N is the size of the reference color table. The matching value between two images is N

P(Q, D) = W √(f Q – f D )2 = √∑ Wi (riQ – riD )2

(7.7)

i=1

where Wi = {

riQ 1

riQ > 0 riQ = 0

and riD > 0 or riD = 0

(7.8)

In the above four methods, the last three methods are simplifications of the first one with respect to computation. However, histogram intersection has another problem. When the feature in an image cannot take all values, some zero-valued bins may occur. These zero-valued bins have influence on intersection, so the matching value computed from eq. (7.2) does not correctly reflect the color differnce between the two images. To solve this problem, the cumulative histogram can be used (Zhang, 1998c). A cumulative histogram of an image is a 1-D function, given by k

ni n i=0

k = 0, 1, . . . , L – 1

I(k) = ∑

(7.9)

The meanings of the parameters are the same as that in eq. (7.1). Using the cumulative histogram could reduce zero-valued bins; therefore, the distance between two colors is propotional to their similarity. Further improvement can be made by using the subrange cumulative histogram (Zhang, 1998c).

7.1.2 Texture Features Texture is also an important feature in describing the content of an image (see also Chapter 5 of Volume II in this book set). For example, one method using the texture feature to search a JPEG image is proposed Huang (2003). One effective method for extracting texture is based on the cooccurence matrix. From the co-occurence matrix M, different texture descriptors can be computed. 7.1.2.1 Contrast

2

G = ∑ ∑ (h – k) mhk h

k

(7.10)

7.1 Feature-Based Image Retrieval

217

For rough texture, the values of mhk are concentrated in the diagonal elements. The value of (h – k) would be small and the value of G is also small. Otherwise, for fine textures, the value of G would be big. 7.1.2.2 Energy J = ∑ ∑ (mhk ) h

2

(7.11)

k

When the values of mhk are concentrated in diagonal elements, the value of J would be big. 7.1.2.3 Entropy S = – ∑ ∑ mhk log mhk . h

(7.12)

k

When the values of mhk in the cooccurence matrix are similar and are distributed around the matrix, the value of S would be big. 7.1.2.4 Correlation ∑ ∑ hkmhk – ,h ,k C=

h k

3h 3k

(7.13)

where ,h , ,k , 3h , and 3k are the mean values and standard variances of mh and mk , respectively. Note that mh = ∑k mhk is the sum of the row elements of M, and mk = ∑h mhk is the sum of column elements of M. The size of the cooccurence matrix is related to the number of gray levels of the image. Suppose the number of gray levels of the image is N, then the size of the cooccurence matrix is N × N, and the matrix can be denoted M(Bx, By) (h, k). First, the co-occurence matrices for four directions, M(1,0) , M(0,1) , M(1,1) , M(1,–1) , are computed. The above four texture descriptors are then obtained. Using their mean values and standard deviations, a texture feature vector with eight components is obtained. Since the physical meaning and value range of these components are different, a normalization process should be performed to make all components have the same weights. Gaussian normalization is a commonly used normalization method Ortega (1997). Denote an N-D feature vector as F = [f1 f2 . . . fN ]. If database images are denoted I1 , I2 , . . . , IM , for each Ii , its corresponding feature vector is F i = [fi,1 fi,2 . . . fi,N ]. Suppose that [f1,j f2,j . . . fi,j . . . fM,j ] satisfies the Gaussian distribution, after obtaining the mean value mj and standard deviation 3j , fi,j can be normalized into [–1, 1] by

218

7 Content-Based Image Retrieval

fi,j(N) =

fi,j – mj 3j

(7.14)

The distribution of fi,j(N) is now an N(0, 1) distribution.

7.1.3 Shape Features Shape is also an important feature in describing the content of images (see also Chapter 6 of Volume II in this book set). A number of shape description methods have been proposed (Zhang, 2003b). Some typically used methods are geometric parameters (Scassellati, 1994; Niblack, 1998), invariant moments (Mehtre, 1997), boundary direction histogram (Jain, 1996), and important wavelet coefficients (Jacobs, 1995). 7.1.3.1 MPEG-7 Adopted Shape Descriptors MPEG-7 is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group), which was formally named “Multimedia Content Description Interface” (ISO/IEC, 2001). It provides a comprehensive set of audiovisual description tools to describe multimedia content. Both human users and automatic systems that process audiovisual information are within the scope of MPEG-7. More information about MPEG-7 can be found at the MPEG-7 website (http://www.mpeg.org) and the MPEG-7 Alliance website (http://www.mpeg-industry.com/). A shape can be described by either boundary/contour-based descriptors or regionbased descriptors. The contour shape descriptor captures characteristic shape features of an object or region based on its contour. One is the so-called curvature scale-space (CSS) representation, which captures perceptually meaningful features of the shape. This representation has a number of important properties, namely: (1) It captures very well the characteristic features of the shape, enabling similaritybased retrieval. (2) It reflects the properties of the perception of the human visual system and offers good generalization. (3) It is robust to nonrigid motion. (4) It is robust to partial occlusion of the shape. (5) It is robust to perspective transformations, which result from the changes of the camera parameters and are common in images and video. (6) It is compact. Some of the above properties of this descriptor are illustrated in Figure 7.1 (ISO/IEC, 2001), and each frame contains very similar images according to CSS, based on the actual retrieval results from the MPEG-7 shape database. Figure 7.1(a) shows shape generalization properties (perceptual similarity among different shapes).

7.1 Feature-Based Image Retrieval

(a)

(b)

219

(c)

Figure 7.1: Some properties of CSS descriptor.

Figure 7.1(b) shows the robustness to nonrigid motion (man running). Figure 7.1(c) shows the robustness to partial occlusions (tails or legs of the horses). Example 7.1 Various shapes described by shape descriptors The shape of an object may consist of a single region or a set of regions as well as some holes in the object as illustrated in Figure 7.2. Note that the black pixel within the object corresponds to 1 in an image, while the white background corresponds to 0. Since the region shape descriptor makes use of all pixels constituting the shape within a frame, it can describe any shape. In other words, the region shape descriptor can describe not only a simple shape with a single connected region as in Figure 7.2(a, b) but also a complex shape that consists of holes in the object or several disjoint regions as illustrated in Figure 7.2(c–e), respectively. The region shape descriptor not only can describe such diverse shapes efficiently in a single descriptor but also is robust to minor deformations along the boundary of the object. Figure 7.2(g–i) are very similar shape images for a cup. The differences are at the handle. Shape in (g) has a crack at the lower handle, while the handle in (i) is filled. The region-based shape descriptor considers Figure 7.2 (g, h) to be similar but different from Figure 7.2(i) because the handle is filled. Similarly, Figure 7.2(j, k), and 7.2(l) show

(a)

(g)

(b)

(h)

(c)

(i)

Figure 7.2: Examples of various shapes.

(d)

(j)

(e)

(k)

(l)

220

7 Content-Based Image Retrieval

the part of video sequence where two disks are being separated. With the region-based descriptor, they are considered similar. ◻∘ The descriptor is also characterized by its small size, fast extraction time, and matching. The data size for this representation is fixed to 17.5 bytes. The feature extraction and matching processes are straightforward and have a low order of computational complexities, and suitable for tracking shapes in the video data processing. 7.1.3.2 Shape Descriptor Based on Wavelet Modulus Maxima and Invariant Moments Since the wavelet coefficients are obtained by sampling uniformly the continuous wavelet transform via a dyadic scheme, the general discrete wavelet transform lacks the translation invariance. A solution is to use the adaptive sampling scheme. This can be achieved via the wavelet modulus maxima, which are based on irregular sampling of the multiscale wavelet transform at points that have some physical significance. Unlike regular sampling, such a sampling strategy makes the translation invariance of the representation (Cvetkovic, 1995). In two dimensions, the wavelet modulus maxima indicate the location of edges (Mallat, 1992). Example 7.2 Illustration of wavelet modulus maxima In the first row of Figure 7.3, an original image is given. The second row shows seven images of the wavelet modulus in seven scales (the scale increases from left to right). The third row shows the corresponding maxima of wavelet modulus. Visually, the wavelet modulus maxima of an image are located along the edges of the image (Yao, 1999). ◻∘ It has been proven that for an N × N image, the levels of the wavelet decomposition J should not be higher than J = log2 (N) + 1. Experimental results show that the wavelet maxima at a level higher than 6 have almost no discrimination power. For different applications, different levels of wavelet decomposition can be selected. Fewer

Figure 7.3: Illustration of wavelet modulus maxima.

7.1 Feature-Based Image Retrieval

221

decomposition levels (about 1 to 2 levels) can be used in retrieving a clothes-image database to save computation, and more decomposition levels (5 to 6 levels) can be used in retrieving a flower-image database to extract enough shape information of the image. Given a wavelet representation of images, the next step is to define a good similarity measurement. Considering the invariance with respect to affine transforms, the invariant moments can be selected as the similarity measurement. The moments extracted from the image form a feature vector. The feature elements in the vector are different physical quantities. Their magnitudes can vary drastically, thereby biasing the Euclidean distance measurement. Therefore, a feature normalization process is thus needed. Let F = [f1 , f2 , . . . , fi , . . . , fN ] be the feature vector, where N is the number of feature elements and let I1 , I2 , . . . , IM be the images. For image Ii , the corresponding feature F is referred to as F i = [fi,1 , fi,2 , . . . , fi,j , . . . , fi,N ]. Since there are M images in the database, an M × N feature matrix F = {fi,j } can be formed, where fi,j is the j-th feature element in F i . Now, each column of F is a sequence with a length M of the j-th feature element, represented as F j . Assuming F j to be a Gaussian sequence, the mean mj and the standard deviation 3j of the sequence are computed. Then, the original sequence is normalized to an N(0, 1) sequence as follows

fi,j =

fi,j – mj 3j

(7.15)

The main steps of this algorithm can be summarized as follows (further details can be found in Zhang (2003b): (1) Make the wavelet decomposition of the image to get the module image. (2) Compute the wavelet modulus maxima to produce the multi-scale edge image. (3) Compute the invariant moments of the multiscale edge image. (4) Form the feature vector based on the moments. (5) Normalize (internally) the magnitudes among different feature elements in the feature vector. (6) Compute the image similarity by using their feature vectors. (7) Retrieve the corresponding images.

Example 7.3 Retrieval results for geometrical images Figure 7.4 gives two sets of results in retrieving geometric figures. In the presentation of these results, the query image is at the top-left corner and retrieving images are ranked from left to right, top to bottom. Under the result images, the similarity values are displayed. As shown in Figure 7.4, the above algorithm ranks the image of the same shape images (regardless of its translation, scaling, and rotation) as better matches. It is evident that this algorithm is invariant to translation, scaling, and rotation. ◻∘

222

7 Content-Based Image Retrieval

Figure 7.4: Retrieval results for geometrical images.

7.2 Motion-Feature-Based Video Retrieval Motion information represents the evolution of video content along the time axis, which is important for understanding the video content. It is worth noting that color, texture, and shape are common features for both images and video, while motion feature is unique for video. Motion information in video can be divided into two categories: global motion information and local motion information.

7.2.1 Global Motion Features Global motion corresponds to the background motion, which is caused by the movement of cameras (also called camera motion) and characterized by the integer movement of all points in a frame. Global motion for a whole frame can be modeled by a 2-D motion vector (Tekalp, 1995). Motion analysis can be classified into short-time analysis (for a few frames) and long-time analysis (several 100 frames). For a video sequence, short-time analysis can provide accurate estimation about the motion information. However, for understanding a movement, a duration of about one second or more is required. To obtain the meaningful motion content, a sequence of short-time analysis results should be used. Each of the short-time analysis results can be considered a point in the motion feature space, so the motion in an interval can be represented as a sequence of feature points (Yu, 2001b). This sequence includes not only the motion information in adjacent frames but also the time ordering relation in video sequence. The similarity measurement for feature-point sequences can be carried out with the help of string matching (another possibility is to use subgraph isomorphism introduced in Section 4.5), as introduced in Section 4.2. Suppose that there are two video clips l1 and l2 , whose lengths of the feature point sequences are N1 and N2 , then these two feature point sequences can be represented by {f1 (i), i = 1, 2, ⋅ ⋅ ⋅ , N1 } and {f2 (j), j = 1, 2, ⋅ ⋅ ⋅ , N2 }, respectively. If the lengths of two sequences are the same, that is, N1 = N2 = N, the similarity between two sequences is

7.2 Motion-Feature-Based Video Retrieval

223

N

S(l1 , l2 ) = ∑ Sf [f1 (i), f2 (i)]

(7.16)

i=1

where Sf [f1 (i), f2 (i)] is a function for computing the similarity between two feature points, which can be the reciprocal of any distance function. If two sequences have different lengths, N1 < N2 , then the problem of selecting the beginning point for matching should be solved. In practice, different sequences l󸀠2 (t) with the length equal to that of l1 in l2 with a different beginning time t are selected. Since l1 and l󸀠2 (t) have the same lengths, the distance between them can be computed with eq. (7.16). By changing the beginning time, the similarity values for all possible subsequences can be obtained, and the similarity between l1 and l2 is determined as the maximum of these values N

S(l1 , l2 ) =

max

0≤t≤N2 –N1

∑ Sf [f1 (i), f2 (i + t)].

(7.17)

i=1

7.2.2 Local Motion Features Local motion corresponds to the foreground motion, which is caused by the movement of the objects in the scene. Since many objects can exist in one scene and different objects can move differently (in directions, speed, form, etc.), local motion can be quite complicated. For one image, the local motions inside may need to be represented with several models (Yu, 2001a). To extract local motion information from a video sequence, the spatial positions of the object’s points in different frames of the video should be searched and determined. In addition, those points corresponding to the same object or the same part of the object, which have the same or similar motions, should be connected, to provide the motion representations and descriptions of the object or part of the object. The motion representations and descriptions of objects often use the vector field of local motion, which provides the magnitude and direction of motion. It is shown that the direction of motion is very important in differentiating different motions. A directional histogram of local motion can be used to describe local motion. On the other hand, based on the obtained vector field of the local motion, the frame can be segmented and those motion regions with different model parameters can be extracted. By classifying the motion models, a histogram for the local motion region can be obtained. Since the parameter models of motion regions are the summarization of the local motion information, the histogram of the local motion region often has a higher level meaning and a more comprehensive description ability. The matching for these histograms can also be carried out by using the methods such as histogram intersection, as shown in Section 7.1.

224

7 Content-Based Image Retrieval

Figure 7.5: First frame in a query clip.

(a)

(b)

(c)

(d)

Figure 7.6: Results obtained with the directional histogram of the local motion vector.

Example 7.4 Feature matching with two types of histograms Illustrations for using the above two histograms are shown below by an experiment. In this experiment, a video sequence of nine minutes for a basketball match, which comes from an MPEG-7 standard test database, is used. Figure 7.5 shows the first frame of a penalty shot clip, which is used for querying. In this video sequence, except for the clip whose first frame is shown in Figure 7.5, there are five clips with similar content (penalty shot). Figure 7.6 shows the first frames of four clips retrieved by using the directional histogram of the local motion vector. From Figure 7.6, it can be seen that although the sizes and positions of regions occupied by players are different, the actions performed by these players are all penalty shots. Note that the player in Figure 7.6(a) is the same as that in Figure 7.5, but these two clips correspond to two scenes. Figure 7.7 shows the first frames of another clip retrieved using the histogram of the local motion vector. In fact, all five clips with penalty shots have been retrieved by this method. Compared to the directional histogram of the local motion vector, the histogram of the local motion vector provides better retrieval results. ◻∘ In the above discussion, the global and local motions are considered separately, and also detected separately. In real applications, both the global motion and the local motion will be included in video. In this case, it is required to first detect the global motion (corresponding to the motion in the background where there is no moving object), and then to compensate the global motion to remove the influence of the camera motion, and finally to obtain the local motion information corresponding to different foreground objects. As many global motion and local motion information can

7.3 Object-Based Retrieval

225

Figure 7.7: One more result obtained with the histogram of the local motion vector.

be obtained directly from a compressed domain, the motion-feature-based retrieval can also be performed directly in a compressed domain (Zhang, 2003b).

7.3 Object-Based Retrieval How to describe an image’s contents is a key issue in content-based image retrieval. In other words, among the techniques for image retrieval, the image content description model would be a crucial one. Humans’ understanding of image contents possesses several fuzzy characteristics, which indicates that the traditional image description model based on low-level image features (such as color, texture, shape, etc.) is not always consistent to human visual perception. Some high-level semantic description models are needed.

7.3.1 Multilayer Description Model The gap between low-level image features and high-level image semantics, called the semantic gap, sometimes leads to disappointing querying results. To improve the performance of image retrieval, there is a strong tendency in the field of image retrieval to analyze images in a hierarchical way and to describe image contents on a semantic level. A widely accepted method for obtaining semantics is to process the whole image on different levels, which reflects the fuzzy characteristics of the image contents. In the following, a multilayer description model (MDM) for image description is depicted (see Figure 7.8); this model can describe the image content with a hierarchical structure to reach progressive image analysis and understanding (Gao, 2000b). The whole procedure has been represented in four layers: The original image layer, the meaningful region layer, the visual perception layer, and the object layer. The description for a higher layer could be generated from the description from the adjacent lower layer, and the image model is synchronously established by the procedure for progressive understanding of the image contents. These different layers could provide distinct information on the image content; so this model is suitable for accessing from different levels. In Figure 7.8, the left part shows that the proposed image model includes four layers, in which the middle part shows the corresponding formulas for the

226

7 Content-Based Image Retrieval

Multi-layer

Example

Formula

Object layer

OL

{ T R s}

Relationship determination Object recognition Visual perception layer

VPL

F

MRL

l(x, y)

OIL

f(x, y)

{FMC FWT FRD}

Feature extraction Meaningful region layer Pre-segmentation Original image layer Figure 7.8: Multilayer description model.

representations of the four layers, while the right part provides some representation examples of the four layers. In this model, the image content is analyzed and represented in four layers. The adjacent layers are structured in a way that the representations for the upper layers are directly extracted from those for lower layers. The first step is to split the original image into several meaningful regions, each of which provides certain semantics in terms of human beings’ understanding of the image contents. Then proper features should be extracted from these meaningful regions to represent the image content at the visual perception layer. In the interest of the follow-up processing, such as object recognition, the image features should be selected carefully. The automatic object recognition overcomes the disadvantage of large overhead in manual labeling while throws off the drawback of insufficient content information representation by using only lower-level image features. Another important part of the object layer process is relationship determination, which provides more semantic information among the different objects of an image. Based on the above statements, the multilayer description model could be expressed by the following formula MDM = {OIL, MRL, VPL, OL}

(7.18)

In eq. (7.18), OIL represents the lowest layer – the original image layer – with the original image data represented by f (x, y). MRL represents the labeled image l(x, y), which is the result of the meaningful region extraction (Luo, 2001). VPL is the description for the visual perception layer, and it contains three elements, F MC , F WT and F RD , representing mixed color features, wavelet package texture features, and region descriptors, respectively (Gao, 2000a). The selection of VPL is flexible and it should

7.3 Object-Based Retrieval

227

be based on the implementation. OL is the representation of the object layer and it includes two components, T and Rs . The former T is the result of the object recognition used to indicate the attribute of each extracted meaningful region, for which detailed discussions would be given later. The latter Rs is a K × K matrix to indicate the spatial relationship between every two meaningful regions, with K representing the number of the meaningful regions in the whole image. In brief, all the four layers together form a “bottom-up” procedure (implying a hierarchical image process) of the multilayer image content analysis. This procedure aims at analyzing the image in different levels so that the image content representation can be obtained from low levels to high levels gradually.

7.3.2 Experiments on Object-Based Retrieval Some experiments using the above model for a multilayer description for the objectbased retrieval are presented below as examples. In these experiments, landscape images are chosen as the data source. Seven object categories often found in landscape images are selected, which include mountain, tree, ground, water, sky, building, and flower. 7.3.2.1 Object Recognition In Figure 7.9, four examples of the object recognition are presented. The first row is for original images. The second row gives the extracted meaningful regions. As is

Mountain

Tree

Ground

Water

Sky

Building

Figure 7.9: Experimental results for presegmentation and object recognition.

Flower

228

7 Content-Based Image Retrieval

mentioned above, the procedure of meaningful region extraction aims not at precise segmenting but rather at extracting major regions, which are striking to human vision. The third row represents the result of the object recognition, with each meaningful region labeled by different shades (see the fourth row) to indicate the belonging category. 7.3.2.2 Image Retrieval Based on Object Matching Based on object recognition, the retrieval can be conducted in the object level. One example is shown in Figure 7.10. The user has submitted a query in which three objects, “mountain,” “tree,” and “sky,” are selected to query. Based on this information, the system searches in the database and looks for images with these three objects. The returned images from the image database are displayed in Figure 7.10. Though these images are different in the sense of the visual perception as they may have different colors, shapes, or structural appearances, all these images contain the required three objects. Based on object recognition, the retrieval can be further conducted using the object relationship. One example is given in Figure 7.11, which is a result of an advanced search. Suppose the user made a further requirement based on the querying results of Figure 7.10, in which the spatial relationship between the objects “mountain” and “tree” should also satisfy a “left-to-right” relationship. In other words, not only the objects “mountain” and “tree” should be presented in the returned images (the presentation of “sky” objects is implicit), but also the “mountain” should be presented to the left of the “tree” in the images (in this example, the position of “sky” is not limited). The results shown in Figure 7.11 are just a subset of Figure 7.10.

Figure 7.10: Experimental results obtained by object matching.

7.4 Video Analysis and Retrieval

229

Figure 7.11: Further results obtained by object relationship matching.

7.4 Video Analysis and Retrieval The purpose of video analysis is to establish/recover the semantic structure of video (Zhang, 2002b). Based on this structure, further querying and retrieval can be easily carried out. Many video programs exist, such as advertisements, animations, entertainments, films, home videos, news, sport matches, teleplays, etc.

7.4.1 News Program Structuring A common video clip is physically made up of frames, and the video structuring begins with identifying each individual camera shots of the frames. Video shots serve as elementary units used to index the complete video clips. Many techniques for shot boundary detection have been proposed (O’Toole, 1999; Garqi, 2000; and Gao, 2002a).

7.4.1.1 Characteristics of News Program A news program has a consistent, regular structure that is composed of a sequence of news story units. A news story unit is often called a news item, which is naturally a semantic content unit and contains several shots, each of which presents an event with a relative independency and clear semantics. In this way, the news story unit is the basic unit for video analysis. The relatively fixed structure of news programs provides many cues for understanding the video content, establishing index structure, and performing contentbased querying. These cues can exist on different structure layers and provide relatively independent video stamps. For example, there are many speakers’ shot in which the speakers or announcers repeatedly appear. This shot can be considered the beginning mark of a news item and is often called an anchor shot. Anchor shots can be used as hypothesized beginning points for news items. In news program analysis, the detection of anchor shots is the basis of video structuring. Two types of methods are used in the detection of anchor shots: The direct detection and the rejection of no-anchor shot. One problem for correctly detecting an anchor shot is that there are many “speakers’ shot” in news programs, such as reporter shots, staff shots, and lecturer shots, etc. All these shots are mixed, so the accurate

230

7 Content-Based Image Retrieval

detection of anchor shots is difficult, and many false alarms and wrong detections could happen. In the following, a three-step detection method for anchor shots is presented (Jiang, 2005a). The first step is to detect all main speaker close-ups (MSC) according to the changes among consecutive frames. The results include real anchor shots but also some wrong detections. The second step is to cluster MSC using unsupervised clustering method and to establish a list of MSC (some postprocesses are made following the results of news title detection). The third step is to analyze the time distribution of shots and to distinguish anchor shots from other MSC shots. 7.4.1.2 Detection of MSC The anchor shot detection is frequently relied on the detection of anchorpersons, which is often interfered by other persons, such as reporters, interviewees, and lecturers. One way to solve this problem is by adding a few preprocessing (Albjol, 2002) or postprocessing (Avrithis, 2000) steps to prevent the possible false alarms. However, in the view of employing rather than rejecting the interferences, one can take proper steps to extract visual information from these “false alarms” and use the special information that gives clues to the video content structuring and facilitate the anchor shot identification. Since a dominant portion of news programs is about human activities, human images, and especially those camera-focused talking heads play an important role. With an observation of common news clips, it is easily noticed that heads of presidents or ministers appear in political news, heads of pop stars or athletes appear in amusement or sports news, and heads of professors or instructors appear in scientific news, etc. Definitely, these persons are the ones users are looking for when browsing news video. Therefore, apart from being a good hint for the existence of anchorpersons, all the focused heads themselves serve as a good abstraction of news content regarding to key persons. The detection of a human subject is particularly important here to find MSC shots. As it is defined, MSC denotes a special scene with a single focused talking head in the center of each frame, whose identification is possible using a number of human face detection algorithms. However, it is not concerned here with the face details such as whether it has clear outlines of cheeks, nose, or chin. What gives the most important clues at this point is whether it is a focused talking head in front of the camera with the probability of belonging to anchorpersons, or some meaningless objects and the backgrounds. In terms of this viewpoint, skin color and face shape are not the necessary features that must be extracted. Thus, differing from other methods based on complicated face color/shape model training and matching (Hanjalic, 1998), identifying MSCs in a video stream can be carried out by motion analysis in a shot, with a simplified Head-Motion-Model set. Some spatial and motional features of the MSC frame are illustrated in Figure 7.12.

7.4 Video Analysis and Retrieval

(a)

(b)

Left

(d)

(c)

Middle

Body region

231

Right

Body region

(e)

Body region

(f)

Figure 7.12: Some spatial and motional features of the MSC frame.

(1)

(2)

(3)

Generally, an MSC shot (see Figure 7.12(b)) has relatively lower activity compared to normal shots (see Figure 7.12(c)) in news video streams, but has stronger activity than static scene shots (see Figure 7.12(a)). Each MSC has a static camera perspective, and activity concentrates on a single dominant talking head in the center of the frame, with a size and position within a specific range. In MSC shots, the presence of a talking head is located in a fixed horizontal area during the shot, with three typical positions: left (L), middle (M), and right (R), as shown in Figure 7.12(d–f).

According to the above features, MSC detection can be carried out with the help of the map of average motion in shot (MAMS) (Jiang, 2005a). For the k-th shot, its MAMS is computed by Mk (x, y) =

1 L–d 󵄨󵄨 󵄨 ∑ 󵄨f (x, y) – fi+d (x, y)󵄨󵄨󵄨 L i=1 󵄨 i

L=

N d

(7.19)

where N is the number of frames in the k-th shot and d is the interval between the measured frames. MAMS is a 2-D cumulative motion map, which keeps the spatial distribution of the motion information. The change value of an MSC shot is between that of the normal shots and that of static scene shots, and can be described by the division of MAMS by its size. In other words, only the following shot can be a possible candidate Ts
60 shots < 10 shots

7.4 Video Analysis and Retrieval

A1

P1

P2

1

5

10

P3 Shot type

P1

News item 1

A2

P4

P5

233

15

Shot number

P6

News item 2 18

25

20

A1

P7

30 P8

P9

P10

News item 3 31

35

40

45

Figure 7.13: News item and shot structure.

beginning position, followed by related report shots and speaker’s shots, a structured news program is obtained. One example is shown in Figure 7.13. Using this structure, a user could locate any specific person in any news item and perform non-linear high-level browsing and querying (Zhang, 2003b).

7.4.2 Highlight of Sport Match Video Sport match videos are also quite structured. For example, a football match has two half games, while a basketball match further divides a half game into two sets. All these characteristics provide useful time cues and constraints for sport match analysis. On the other hand, there are always some climactic events in a sport match (sometimes called a sport video or event video), such as shooting in a football match and slam-dunk in basketball match. A sport match has much uncertainty, and the time and position of an event can hardly be predicted. The generating of a sport match video cannot be controlled by program makers during the match. 7.4.2.1 Characteristics of a Sport Video In sport matches, a particular scene often has some fixed colors, movements, object distributions, etc. As a limited number of cameras have been preset in the play ground, one event often corresponds to particular change of the scene. This makes it possible to detect interesting events by the change of the scene, or the existence or appearance of certain objects. For a video clip having players, the indexing can be made based on the silhouettes of players, the colors of their clothes, and the trajectories. For a video clip toward the audience, the pose and action can be extracted as the indexing entry. The highlight shot is often the most important part of a sport match. News reports on sport matches are always centered on highlighted events. In defining highlighted events, some prior knowledge is often used. The definition, content, and visual representation of highlighted events, vary from one sport match to another. For example, shooting in a football match is the focus, while slam-dunk and fast passing always

234

7 Content-Based Image Retrieval

attract much attention in a basketball match. From the point of querying view, it can be based on the detection of highlighted shots or based on the components of events, such as a ball, a goal, a board, etc. It can also be based on the type of activities, such as a free ball, a penalty, a three-point field goal, etc. The features of a sport match can be divided into three layers, the low layer, the middle layer, and the high layer (Duan, 2003). The low-layer features include motion vectors, color distributions, etc., which can be directly extracted from images. The middle-layer features include camera motions, object regions, motions, etc., which can be derived from the low-layer features. For example, the camera motion can be estimated from the motion vector histogram. The features of the high layer correspond to events and their relation with those of middle/low layer can be established with the help of some appropriate knowledge.

7.4.2.2 Structure of Table Tennis Matches Different from a football match that has fixed time duration, a table tennis match is based on scores. It has relatively unambiguous structure. One match is composed of several repeated scenes with typical structures. In particular, the following scenes can be distinct, such as the match scene, the service scene, the break scene, the audience scene, and the replay scene. Each scene has its own particularity and property. One table tennis match is formed by three to seven sets, while, for each set, dozen of games exist. The result of each game here induces one point in the score. The repetition of the above structure units constructs a match, in which each scene happens in a fixed sequence. For example, service scene is followed by the play scene, while the replay scene goes after a highlight play. The above structure is depicted in Figure 7.14. Example 7.5 Illustration of clustering results According to the above-described structure, shots in a table tennis match can be clustered. Figure 7.15 shows several clusters obtained by using an unsupervised clustering, in which each column shows shots in one class. ◻∘

Start

Game

Break

Game

Break

Game

Pause

Service

Play

Repeat

Audience

Figure 7.14: The structure of table tennis matches.

End

7.4 Video Analysis and Retrieval

235

Figure 7.15: Results of unsupervised clustering.

Video sequence

Player position detection

Table position detection

Player trajectory extraction

Basic ranking

Ball position detection

Ball trajectory extraction

Quality ranking

Final ranking Figure 7.16: Flowchart of object detection, tracking, and shot ranking.

7.4.2.3 Object Detection and Tracking For most audiences, not only the final score is important but also the level of a highlight and its incitement for each play. The ranking of a table tennis highlight can be obtained by the combination of the information of the trajectory of the ball, the position of the player, and the table. Figure 7.16 gives an overview of object detection, tracking, and ranking of shots Chen (2006). The player position, the table position, and the ball position should first be detected. Then, both the trajectories of the player and the ball need to be extracted. Using the results of the detection and extraction, the rank (consisting of both basic rank and quality rank) for each shot is determined.

236

7 Content-Based Image Retrieval

The table position detection is performed first. The table region can be specified by using the edge and the color features in a rough-fine procedure. The RGB color histogram (normalized to unity sum) of each rough table region is calculated and the histogram is obtained by uniformly quantizing the RGB color coordinates into 8 (Red), 8 (Green), and 8 (Blue) bins, respectively, resulting in a total of 512 quantized colors. Then, a radon transform is used to find the position of the table edge lines and to specify a rough boundary of the table. After the detection of tables, the players are detected. The two players are the most drastic moving objects in the playing field. By taking into account the spatial-temporal information of the image sequence, the two biggest connected squares inclosing the players can be determined, as illustrated in Figure 7.17. Ball detection and tracking is a challenging task. A combined analysis of the color, shape, size, and position is required. The detection of a ball candidate is based on the detection of the table region. Ball candidates can be classified into the in-table regions and the out-table regions. The consecutive losing of the in-table candidate balls indicates that the ball is out of the table. Detection of the ball in the in-table region is relatively simple compared to that of the ball in the out-table region. The latter may need a complicated process with the Bayesian decision (Chen, 2006). One tracking result is shown in Figure 7.18, in which Figure 7.18(a) through Figure 7.18(d) are four images sampled at an equal interval from a sequence. Figure 7.18(e) shows the trajectory obtained for the entire play in superposition. 7.4.2.4 Highlight Ranking To rank the highlight level precisely and in close connection with the human feeling, different levels of the human knowledge for evaluating match are considered. The

Figure 7.17: Detection result of players.

(a)

(b)

(c)

Figure 7.18: Illustration of ball tracking.

(d)

(e)

7.4 Video Analysis and Retrieval

237

ranking of a table tennis match can be divided into three layers: the basic ranking, the quality of the action, and the most sophisticated one – the feeling from the experience. The basic part describes the most common aspects of the match, which include the velocity of the ball and the players, the duration of the play for a point, and the distance between two strokes. The quality of the match involves the quality of the ball trajectory, the drastic degree of the player motion, and the consistency of the action. These fuzzy concepts are more suitable to be treated by a powerful fuzzy system. The last part mostly refers to the experience or the feeling of the referee. In the following, some details for the first two layers are provided (Chen, 2006). Basic Ranking The basic part of the table tennis highlight level describes the most direct feeling of a single point with the explicit knowledge and a statistical transfer from the feature detected, which is given by R = N (wv hv + wb hb + wp hp )

(7.21)

where N is the number of frames and wv , wb , and wp are the weights for different factors. The highlight rank is determined by three factors. The first is the average speed of ball, hv = f (

1 N ∑ v(i)) N i=1

(7.22)

where v(i) is the speed of i-th stroke. The second is the average distance running by the ball between two consecutive strokes, N1

N2

i=1

i=1

󵄨 󵄨 󵄨 󵄨 hb = f (∑ 󵄨󵄨󵄨b1 (i + 1) – b1 (i)󵄨󵄨󵄨 /N1 + ∑ 󵄨󵄨󵄨b2 (i + 1) – b2 (i)󵄨󵄨󵄨 /N2 )

(7.23)

where N1 and N2 are the total numbers of strokes made by player 1 and player 2, respectively, and b1 and b2 are the ball positions at the stroking point with corresponding players. The third is the average speed of the movement of the players between two consecutive strokes, N1

N2

i=1

i=1

󵄨 󵄨 󵄨 󵄨 hp = f (∑ 󵄨󵄨󵄨p1 (i + 1) – p1 (i)󵄨󵄨󵄨 /N1 + ∑ 󵄨󵄨󵄨p2 (i + 1) – p2 (i)󵄨󵄨󵄨 /N2 )

(7.24)

where p1 and p2 are the corresponding player’s speed at frame i. The function f (⋅) in the above three equations is a sigmoid function f (x) =

1 ̄ 1 + exp[–(x – x)]

(7.25)

It is used to convert a variable to a highlight rate so that different affection factors can be added together.

238

7 Content-Based Image Retrieval

Quality of Action To measure the quality of the game, the basic marking unit is a stroke, including the player’s premoving, the stroking, the trajectory and speed of the ball, and the similarity and consistency between the two adjacent strokes. The drastic degree of the motion of the player characterizes the moving variety of the players, 󵄨 󵄨 󵄨 󵄨 m(i) = wp f (󵄨󵄨󵄨p(i) – p(i – 2)󵄨󵄨󵄨) + ws f (󵄨󵄨󵄨s(i) – s(i – 2)󵄨󵄨󵄨)

(7.26)

where p(i) and s(i) are the positions and shapes (measured from the area of minimal bounding rectangle) of the stroking player at the stroke i. Function f is the same sigmoid function defined in eq. (7.25), wp and ws are the corresponding weights. The quality of the ball trajectory between two strokes is described by the length and velocity, for it is much more difficult to control a long trajectory of a ball flying at a high speed. This quality can be represented by t(i) = wl f (l(i)) + wb f (v(i))

(7.27)

where l(i) and v(i) are the length and average velocity along the ball trajectory between the strokes i and i – 1. wl and wv are the weights, and f (⋅) is the sigmoid function. Finally, the variation of strokes is composed of three factors that depict the consistency and diversification of the match for a dazzling match consists of various motion patterns, given by u(i) = wv f (v(i) – v(1 – i)) + wd f (d(i) – d(1 – i)) + wl f (l(i) – l(1 – i))

(7.28)

where v(i), d(i), and l(n) are the average velocity, direction, and length of the ball trajectory between the strokes i and i – 1. wv , wd , and wl are the weights, and f (⋅) is the sigmoid function.

7.4.3 Organization of Home Video Compared to other types of video programs, home video has some particularities according to the persons to shoot and the objects to be screened (Lienhart, 1997). It is often less structured, so the organization is a more challenging task. 7.4.3.1 Characteristics of Home Video In spite of the unrestricted content and the absence of a storyline, a typical home video has still certain structure characteristics: It contains a set of scenes, each of which is composed of ordered and temporally adjacent shots that can be organized in clusters that convey semantic meaning. The fact is that home video recording imposes temporal continuity. Unlike other video programs, home video just records the life but does not compose a story. Therefore, every shot may have the same importance. In

7.4 Video Analysis and Retrieval

239

addition, filming home video with a temporal back and forth structure is rare. For example, on a vacation trip, people do not usually visit the same site twice. In other words, the content tends to be localized in time. Consequently, discovering the scene structure above the shot level plays a key role in home video analysis. Video content organization based on shot clustering provides an efficient way of semantic video accessing and fast video editing. Home video is not prepared for a very large audience (like broadcasting TV), but for relatives, guests, and friends. In the analysis, the purpose and filming tact should be considered. For example, the subjective feeling transferred by the video information can be decomposed into two parts: one is from motion regions that attract the attention of viewers and the other is from the general impression of the environment. At the same time, different types of camera motions should also be considered. Different camera motions may signify different changes in attention. For example, a zoom-in puts the attention of viewers more on motion regions, while a pan and tilt puts the attention of viewers more on the environment. 7.4.3.2 Detection of Attention Regions Structuring video needs the detection of motion–attention regions. This is not equivalent to the detection of video objects. The latter task needs to accurately determine the boundaries of objects and to quickly follow the changes of the objects. The former task stresses more the subjective feeling of human beings in regards to the video. In this regard, the influence of the region detection on the subjective feeling is more important than just an accurate segmentation. Commonly, most object detection methods fail if there is no specific clear-shaped object in the background. In order to circumvent the problem of the actual object segmentation, a different concept, attention region is introduced. This attention region does not necessarily correspond to the real object but denotes the region with irregular movement compared to the camera motion. Based on the assumption that different movement from the global motion attracts more attention (supported by the common sense that irregular motion tends to be easily caught by human eyes in a static or regular moving background), these regions are regarded as somewhat important areas in contrast to the background image. The attention region detection requires less precision, since more emphasis has been placed on the degree of human attention than on the accurate object outline. Actually, it is simply done based on the outliers detected previously by the camera motion analysis. In addition, the attention region tracking process also becomes easier for its coarse boundary of the macroblocks. The first step of attention region detection is to segment a “dominant region” from a single frame, which is illustrated in Figure 7.19. Figure 7.19(a) is a typical home video frame. This frame can be decomposed into two parts: the running boy on the grassplot, as shown in Figure 7.19(b), and the background with grasses and trees, as shown in Figure 7.19(c). The latter represents the environment and the former represents the attention region.

240

(a)

7 Content-Based Image Retrieval

(b)

(c)

Figure 7.19: Illustration of attention region and environment.

The detection of attention regions does not require very high precision. In other words, the accurate extraction of the object boundary is not necessary. Therefore, the detection of attention regions can be performed directly in am MPEG compressed domain. Two types of information in the compressed domain can be used Jiang (2005b): DCT Coefficients of Macro Block: In DCT coefficients, the DC coefficient is easy to obtain. It is the direct component of the block and is eight times the average value in the block. It roughly reflects brightness and color information. Motion Vectors of Macro Block: Motion vectors correspond to the sparse and coarse motion field. They reflect approximately the motion information. With the motion vectors, a simple but effective four-parameter global motion model can be used to estimate the simplified camera motions (zoom, rotate, pan and tilt), given by {

u = h0 x + h1 y + h2 v = h1 x + h0 y + h3

(7.29)

A common least square fitting algorithm is imposed to optimize the model parameters h0 , h1 , h2 , and h3 . This algorithm recursively examines the error produced by the current estimate of the camera parameters and generates an outlier mask, consisting of macroblocks with motion vectors not following the camera motion model, then recomputes the camera parameters and forms new outlier mask and iterates. As it operates directly on MPEG video, the motion vector preprocessing has to be done to get a dense motion vector field. Example 7.6 Detected attention regions Some examples of attention regions detected are shown in Figure 7.20. All of them are macroblocks characterized by different movements from the global motion. Note that these regions are quite different from real objects. They could be a part of the

7.4 Video Analysis and Retrieval

(a)

(b)

241

(c)

Figure 7.20: Detected attention regions.

real object (kid’s legs in Figure 7.20(a)) or several objects as a whole (woman and kids jumping on trampoline in Figure 7.20(b), kid and little car on the road in Figure 7.20(c). Reasonably, a region drawing the viewer’s attention is not necessary to be a complete accurate semantic object. ◻∘ 7.4.3.3 Time-weighted Model Based on Camera Motion The detection of attention regions divides video content into two parts spatially and temporally: The attention regions and the remnant regions. Two types of features can be used to represent the content of a shot: One is the features in attention regions, which emphasize the part extracting the audience and the other is the features for other regions, which stress the global impression for the environment. Color is often considered the most salient visual feature that attracts viewer’s attention, since the color appearance, rather than the duration, trajectory, and area of region, is counted more in the mind. Thus, an effective and inexpensive color representation in the compressed domain – the DC color histogram – is used to characterize each shot. In contrast to calculating an overall average color histogram, DC histograms of macroblocks constituting the attention regions and the background are computed, respectively. The two types of histograms form a feature vector for each shot, which holds more information than a single histogram describing the global color distribution. As each attention region and background appears in several frames, the histograms along time can be accumulated to form a single histogram. Instead of the common average procedure, a camera motion-based weighting strategy is used, giving different importance to histograms at different times. It is well known that camera motions are always utilized to emphasize or neglect a certain object or a segment of video (i.e., to guide viewers’ attention). Actually, it can imagine a camera as a narrative eye. For example, a camera panning imitating an eye movement either to track an object or to examine a wider view of a scene and a close-up indicates the intensity of an impression. In order to reflect the different impression of the viewers affected by camera movement, a camera attention model can be defined. Camera movement controlled by a photographer is useful for formulating the viewer attention model. However, the parameters in the four-parameter global motion

242

7 Content-Based Image Retrieval

model are not the true camera move, scale, and rotation angles. Thus, these parameters have to be first converted to real camera motion for attention modeling (Tan 2000), given by S= { { { { {r = { { L= { { { {( =

h0 + 1 h1 /(h0 + 1) √(a+)2 + (b+)2 = √h22 + h23 /(h0 + 1)

(7.30)

arctan[(b+)/(a+)] = – arctan(h3 /h2 )

where S indicates the interframe camera zoom factor (S > 1, zoom in; S < 1, zoom out), r is the rotation factor about the lens axis, a+ and b+ are the camera pan and tilt factors, L is the magnitude of camera panning (horizontal and vertical), and ( is the angle. The next step is mapping the camera parameters to the effect they have on viewers’ attention. The camera attention can be modeled on the basis of the following assumptions (Ma, 2002). (1) Zooming is always used to emphasize something. Zoom-in is used to emphasize the details, while zoom-out is used to emphasize an overview. The faster the zooming speed, the more important the content focused. (2) Situations of camera panning and tilting should be divided into two types: region-tracking (with attention region) and surroundings-examining (without attention region). The former corresponds to the situation of a camera tracking a moving object, so that much attention is given to the attention region and little to the background. Camera motion of the latter situation tends to attract less attention since horizontal panning is often applied to neglect something (e. g., change to another view), if no attention region exists. The faster the panning speed, the less important the content. Additionally, unless a video producer wants to emphasize something, vertical panning is not used since it brings viewers an unstable feeling. (3) If the camera pans/tilts too frequently or too slightly, it is considered random or unstable motion. In this case, the attention is determined only by the zoom factors. In frames with attention regions, viewers’ attention is supposed to be split into two parts, represented by the weight of background WBG and the weight of attention region WAR as WBG = 1/WAR WAR = {

S S(1 + L/RL )

L < L0 L ≥ L0

(7.31) (7.32)

7.4 Video Analysis and Retrieval

(a)

(b)

243

(c)

Figure 7.21: Timeweighted modeling based on camera motion.

where WAR is proportional to S and is enhanced by panning, L0 is the minimal camera pan (panning magnitude less than L0 is regarded as random), and RL is a factor controlling the panning power to affect attention degree. In this model, a value greater than 1 means emphasis and a value less than 1 means neglect. In frames without attention region, only background attention weight WBG is computed by S L < L0 { { WBG = { S/(1 + f (()L/RL ) L ≥ L0 , { { S(1 + f (()L/RL ) L ≥ L0 ,

( < 0/4 ( ≥ 0/4

(7.33)

The attention degree decreases in the situation of horizontal panning (( < 0/4) and increases in the situation of vertical panning (( ≥ 0/4). f (() is a function representing the effect of the panning angle on the decrease or increase rate. It gets smaller while the panning angle gets nearer to 0/4 because a panning in this diagonal direction tends to have little effect on the attention.

Example 7.7 Time-weighted modeling based on camera motion Figure 7.21 shows some examples of time-weighted modeling based on camera motion. Figure 7.21(a–c) are three frames extracted from the same shot of kids playing on the lawn. Although they share a similar background, the background attention weight WBG differs in the attention model according to different camera motions of the three frames. Figure 7.21(a) is almost stationary (without the attention region), with the attention just determined by the zoom factor. Figure 7.21(b) is a left panning to track a running kid (with an attention region), thus it has less weight on the background than on the attention region. Figure 7.21(c) shows also the region tracking but has a higher panning speed than in Figure 7.21(b), and thus, it has an even smaller background weight. It is observed in Figure 7.21 that visual contents are spatially split into two attention parts, while they are temporally weighted by camera motion parameters. ◻∘

244

7 Content-Based Image Retrieval

First layer Scene 1

Scene 2

Scene 3

Second layer Figure 7.22: Two layer clustering of shots.

7.4.3.4 Strategy for Shot Organization Using the shot features and weights obtained above, a feature vector for each shot is composed. The visual similarity between two shots is then computed. In particular, similarities of the background and that of the attention regions are computed separately, for example, by using the normalized histogram intersection. Based on the similarity among shots, similar shots can be grouped. A two-layer shot organization strategy can be used (Zhang, 2008d). In the first layer, the scene transition is detected. The place where both the attention regions and the environment change gradually indicates the change of the location and scenario. In the second layer, either the attention regions or the environment is changed (not both). This change can be the change of focus (different moving object in same environment) or the change of object position (same moving object in different environment). The change in the second layer is also called a change of the subscene. Example 7.8 Two-layer shot organization One example of two-layer shot organization is shown in Figure 7.22. Five frames are extracted from five consecutive shots. White sashes mark detected attention regions. The first layer organization clusters the five shots into three scenes, which are different both in attention regions and in environments. The last three shots in the scene 3 of the first layer can be further analyzed. The third and fourth shots in Figure 7.22 can be clustered together, which have similar background and similar attention regions, while the fifth shot has a different attention region from the third and fourth shots. In short, the first layer clusters shots with the semantic event, while the second layer distinguishes different moving objects. ◻∘

7.5 Problems and Questions 7-1*

Let HQ (k) and HD (k) be as shown in Figure Problem 7-1(a, b), respectively. Use the histogram intersection method to calculate the matching values. If HQ (k) is as shown in Figure 7-1(c), then what is the matching value now?

245

7.5 Problems and Questions

HQ(k) 7

HD(k)

9

9

HQ(k)

7

2 0 1 2 3 4 5 6 7

k

3

4

5 3

4

5

9 7

6

0 1 2 3 4 5 6 7

9

k

7

0 1 2 3 4 5 6 7

k

Figure Problem 7-1 7-2

Select a color image and make a smoothed version using some average operations. Take the original image as a database image and the smoothed image as a query image. Compute their histograms and the matching values according to eqs. (7.2), (7.4), (7.5) and (7.7), respectively. 7-3 Design two different histograms and make the matching value obtained with the histogram intersection to be zero. 7-4 Compute the values of four texture descriptors for a Lenna image and a Cameraman image, respectively. Compare these values and discuss the necessity of normalization. 7-5 Implement the algorithm for computing the shape descriptor based on wavelet modulus maxima. Draw several geometric figures and compute the descriptor values for them. Based on the above result, discuss the sensitivity of object shape on the descriptors’ values. 7-6* There are three images shown in Figure Problem 7-6, which are silhouettes of trousers in the clothing image library. Try to select several shape descriptors to compare, discuss which descriptor is more suitable in distinguishing them from other garments, and which descriptor can best distinguish them from each other.

Figure Problem 7-6 7-7

7-8

Propose a method/scheme to combine the different types of descriptors presented in Section 7.1 into a new one. What is the advantage of this combined descriptor? Given F1 = f1 (i) = {1, 2, 3, 4, 5}, F2 = f2 (i) = {2, 4, 6, 8, 10}, and F3 = f3 (i) = {3, 6, 9}, assume that Sf [f1 (i), f2 (i)] is a Euclidean distance function: (1) What is the value of similarity between F1 and F2 according to eq. (7.16)?

246

7 Content-Based Image Retrieval

7-9

7-10 7-11 7-12

(2) What is the value of similarity between F1 and F3 according to eq. (7.17)? Why has the research in content-based image retrieval been centered on semantic-based approaches in recent years? What are the significant advances in this area? How many cues can be used in structuring news programs? How can these cues be extracted from broadcasting news programs? How do you model the feeling of human beings watching a table tennis game? What kinds of factors should be considered? If possible, make some home video recordings and classify them into different categories.

7.6 Further Reading 1.

Feature-Based Image Retrieval – A general introduction to content-based image retrieval can be found in Zhang (2003b). –

Some early work in visual information retrieval can be found in Bimbo (1999).

–

More discussion on early feature-based image retrieval can be found in Smeulders (2000).

–

Different feature descriptors can be combined in feature-based image retrieval (Zhang, 2003b).

–

2.

Several texture descriptors are proposed by the international standard MPEG7, a comprehensive comparison has been made in Xu (2006). Motion-Feature-Based Video Retrieval – Motion information can be based on the trochoid/trajectory of the moving objects in the scene (Jeannin 2000). –

3.

Motion activity is a motion descriptor for the whole frame. It can be measured in an MPEG stream by the magnitude of motion vectors (Divakaran, 2000). Object-Based Retrieval – Even more detailed descriptions for the object-based retrieval can be found in Zhang (2004b, 2005b). –

Segmentation plays an important role in object-based retrieval. New advancements in image segmentation for CBIR can be found in Zhang (2005c).

–

One way to incorporate human knowledge and experience into the loop of image retrieval is to annotate images before querying. One such example of automatic annotation can be found in Xu (2007a).

–

Classification of images could help to treat the problem caused by retrieval in large databases. One novel model for image classification can be found in Xu (2007b).

7.6 Further Reading

–

4.

247

Recent advances of retrieval in the object level, the scene level, and some even high levels, can be found in Zhang (2005b, 2007e, 2015e).

– One study on CNN-based matching for retrieval can see Zhou (2016). Video Analysis and Retrieval – One method of automated semantic structure reconstruction and representation generation for broadcast news can be found in Huang (1999). –

The use of an attention model for video summarization can be found in Ma (2002).

–

Different video highlights for various sport matches have been generated. See Peker (2002) for an example.

–

One approach of using a parametric model for video analysis is described in Chen (2008).

–

Recent research works in semantic-based visual information retrieval have advanced this field, and some examples can be found in Zhang (2007d).

–

Some more details on the hierarchical organization of home video can also be found in Zhang (2015b)

8 Spatial–Temporal Behavior Understanding An important work of image understanding is through treating image obtained from the scene in order to express the meaning and to guide the action. For this purpose, it is required to determine which objects are in the scene, and how they change over time their positions, attitudes, moving speeds, and relationships in space. In short, to grasp the action of object in space and time, to determine the purpose of the action, and thus to understand the semantics of the information they passed. Image-/video-based automatic understanding of object (human and/or organism) behavior is a very challenge research topic. It includes the access of objective information from image acquisition sequence, the processing and analysis of relevant visual information contents (representation and description), as well as the interpretation of image/video information. In addition, on the basis of the results obtained above, it is required to learn and recognize behavior of objects in scene. The above work covers a great span and includes a number of research topics. Recently, the action detection and identification have grabbed a lot of attention and the related researches have made some significant progresses. Comparing to low-level image processing and middle-level image analysis, the high-level behavior identification and interstation (associated with semantic and intelligence) have just started with few research results. In fact, many definitions of the concept is not very clear, many techniques continue to evolve and be updated. The sections of this chapter are arranged as follows: Section 8.1 introduces and overviews the definitions, developments, and different layers of spatial–temporal technology. Section 8.2 introduces the detection of key points (space–time points of interest) that reflect the collection and change of motion information in space–time domain. Section 8.3 discusses the dynamic trajectory and activity path formed by connecting the point of interest. The learning and analysis of dynamic trajectory and activity path help to grasp the state of the scene in order to further characterize the scene properties. Section 8.4 describes some kinds of techniques for action classification and identification, which are still ongoing research in progress. Section 8.5 describes the technique classification for modeling and identification of actions and activities as well as a variety of classes of typical methods.

8.1 Spatial–Temporal Technology Spatial–temporal technologies are facing to understand the spatial–temporal behavior, which are relative new for research community. Many of the present works are started at different levels, here are some general situations. DOI 10.1515/9783110524130-008

8.1 Spatial–Temporal Technology

249

20 15 10 5 0 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Figure 8.1: Statistics of the numbers of publications for spatial–temporal technology in the last 12 years.

8.1.1 New Domain The annual survey series of the yearly bibliographies on image engineering, which was mentioned in Chapter 1, has started in 1996 (for the publications of 1995) and has been carried out for consecutive 22 years (Zhang, 2017). When the series enters its second decade (for the literature statistics of 2005), with the appearance of some new hot spots in the image engineering research and application, a new subcategories (C5): spatial–temporal technology (including 3-D motion analysis, gesture and posture detection, object tracking, behavior judgment, and understanding) has been added into the image-understanding category (C) (Zhang, 2006). The emphasis here is the comprehensive utilization of a variety of information possessed by the image/video in order to make the according interpretation for the dynamics of scene and the objects inside. In the past 12 years, the number of publications belong to subcategory C5 in the annual survey series has attend a total of 168 ones. Their distributions in each years are shown in the bars in Figure 8.1, in which a 3-order polynomial curve fitting to the number of publications of each year is also drawn to show the change trends. Overall, this is still a relatively new field of research, so its development is not too stable, yet.

8.1.2 Multiple Layers Currently, the research target of spatial–temporal technology is mainly on moving people or things, and the change of objects (particularly human being) in scene. According to the abstraction levels of representation and description, multiple layers can be distinguished from bottom to top: (1) Action primitive: It refers to the atomic building unit for action, generally corresponds to the motion information of the scene in a short interval of time.

250

(2)

(3)

(4)

(5)

8 Spatial–Temporal Behavior Understanding

Action: A collection (ordered combinations) composited of a series of action primitives produced by subject/initiator, which has specific meaning. Generally, action represents a motion pattern of one person, and only lasts few seconds. The results of human actions often lead to the changes in body posture. Activity: It refers to a series of actions produced by subject/initiator. These actions are combined (mainly emphasizing the logical combination) to complete a job or reach a certain goal. Activity is relatively large-scale motion and generally depends on the environment and human interactions. Activity usually represents complex actions of more than one person (with possible interaction) and often last for a long period of time. Events: It refers to certain activities occurred at special circumstance (particular position, time, environment, etc.). Usually, the activity is performed by multiple subjects/initiators (group activity) and/or having the interaction with external world. Detection of specific events is often associated with abnormal activity. Behavior: It emphasizes that the subject/initiator (mainly human being), dominated by ideological movements, to change action, perform sustained activity, describe events, etc., in a specific environment/context.

In the following, the sport of table tennis is taken as an example to give some typical pictures at all the above layers, as shown in Figure 8.2. Player’s venue, swing, and so on can be seen as typical action primitives, as shown in Figure 8.2(a). Players complete a tee (including drop, windup, jitter wrist, hitting, etc.) or the back ball (including the venue, outriggers, palming, pumping balls, etc.) are typical actions, as shown in Figure 8.2(b). However, the whole process that a player went baffle hither side and

(a)

(b)

(d) Figure 8.2: Several pictures of table tennis game.

(c)

(e)

8.2 Spatial–Temporal Interesting Points

251

took back the ball is often seen as an activity. The two players hit the ball back and forth in order to win points is a typical scene of activity, as shown in Figure 8.2(c). The competition between two or several sport teams is generally seen as an event, and awarding the players after the game is also a typical event, as shown in Figure 8.2(d), which leads to the ceremony. After winning, though the player makes a fist and selfmotivation can be regarded as an action, but more often is seen as a behavior of the players. In addition, when players perform good exchange, the audience applauded, shouting, cheering, are also attributed to the behavior of the audience, as shown in Figure 8.2(e). It should be noted that the concepts of last three layers are often not strictly distinguished and are used in many studies without distinction. For example, the activity may be called event, when it refers to some unusual activities (such as the disputes between two persons, the elder person falls during walk, etc.); the activity may be called behavior, when the emphasis is mainly on the meaning of activity (behavior), or the nature of the activity (such as shoplifting actions or activities over the wall burglary are called theft). In the following discussion, unless special emphasis being made, the (generalized) activities will be used unevenly to represent the last three layers. The research in spatial–temporal technology has been conducted from points (point of interest) to curves (trajectory of object ≈ multiple points), to surfaces (allocation of activity ≈ compound curves), and to volume (variation of behaviors ≈ stack of surfaces).

8.2 Spatial–Temporal Interesting Points The change of scene usually comes from the motion of objects, especially accelerated motion. Accelerated motion of local structure in video images corresponds to the objects with accelerated motion in scenes, they are at the locations with unconventional values in the image. It is expected, at these positions (image points), there are information of object movement in physical world and of force for changing object structure in scene, so they are helpful in understanding the scene. In spatial–temporal scene, the detection of the point of interest (POI) has a tendency of expansion from space to space–time (Laptev, 2005).

8.2.1 Detection of Spatial Points of Interest In the image space, the image modeling can be performed by using the linear scalespace representation, namely Lsp : R2 × R+ → R, with f sp: R2 → R.

252

8 Spatial–Temporal Behavior Understanding

Lsp (x, y; 3l2 ) = g sp (x, y; 3l2 ) ⊗ f sp (x, y)

(8.1)

That is, making convolution of f sp by using a Gaussian kernel with a variance of 3l2 : g sp (x, y; 3l2 ) =

1 exp [–(x2 + y2 )/23l2 ] 203l2

(8.2)

The idea behind the typical Harris detector of interesting points is to determine the spatial locations in f (x, y) with significant property change in both horizontal and vertical directions. For a given observation scale 3l2 , these points can be computed with the help of the matrix of second order moment that is obtained by summation in a Gaussian window with variance 3l2 : ,sp (∙; 3l2 , 3i2 ) = g sp (∙; 3i2 ) ⊗ {[∇L(∙; 3l2 )][∇L(∙; 3l2 )]T } = g sp (∙; 3i2 ) ⊗ [

2 (Lsp x ) sp sp Lx Ly

sp Lsp x Ly sp 2 ] (Ly )

(8.3)

sp sp sp 2 sp where Lsp x and Ly are Gauss differentials came from Lx = 𝜕x [g (∙; 3l ) ⊗ f (∙)] and sp 2 sp 2 Lsp y = 𝜕y [g (∙; 3l ) ⊗ f (∙)] under local scale 3l . This second moment descriptor can be viewed as the distribution covariance matrix of orientation in a local neighborhood of a 2-D image. Therefore, the eigenvalues +1 and +2 (+1 ≤ +2 ) of ,sp constitute the image descriptors of f sp changing along two directions. If the values +1 and +2 are both very large, it indicates that there is a point of interest. To detect such points, positive maximum value of corner function is needed:

H sp = det(,sp ) – k ∙ trace2 (,sp ) = +1 +2 – k(+1 + +2 )2

(8.4)

At the point of interest, the ratio of eigenvalues a = +2 /+1 should be great. According to eq. (8.4), for the positive local maxima of H sp , a should satisfy k ≤ a/(1 + a)2 . So, if set k = 0.25, the maximum value of H will correspond to the ideal isotropic point of interest (in this case a = 1, namely +1 = +2 ). A smaller value of k (corresponding to a larger value of a) allows for the detection of sharper points of interest. Commonly used value of k in the literature is k = 0.04, corresponding to the detection of interest points with a < 23.

8.2.2 Detection of Spatial–Temporal Points of Interest The above computation for points of interest in spatial space can be extended to spatial–temporal space for detection of spatial points of interest in a particular location on time. This tendency is started around 10 years ago (Laptev, 2005). Detection of spatial–temporal points of interest is essential for extracting low-level motion features, and no background modeling is required.

8.2 Spatial–Temporal Interesting Points

253

The detection of points of interest is to find positions where both spatial and temporal changes have big values. The detection is generally using techniques for extracting low-level motion features and does not require background modeling. One method is first making convolution of the given video with a 3-D Gaussian kernel at different spatial and temporal scales. Then, the spatial–temporal gradients at each layer of scale space representation are calculated and collected from neighborhood of various points to produce the stable estimation matrix of spatial–temporal second moment. The local features can be finally extracted from the matrix. Example 8.1 Examples of spatial–temporal points of interest Figure 8.3 gives a fragment of table tennis players swing and batting. Several spatial– temporal points of interest are extracted from this picture. The degree of density of spatial–temporal points of interest along the time axis is related to the frequency of actions, and the spatial positions of spatial–temporal points of interest are corresponding to the trajectory of motion and the range of action. ◻∘ For modeling spatial–temporal image sequence, start from L, L : R2 → R and convolve it with nonisotropic (space variance 3l2 and time variance 42l are not correlated) Gaussian kernel to form a linear scale space L, L: R2 × R → R. L(∙; 3l2 , 42l ) = g(∙; 3l2 , 42l ) ⊗ f (∙)

(8.5)

where the Gaussian kernel for the separation of space and time is g(x, y, t; 3l2 , 42l ) =

1 √(20)3 3l4 42l

exp [–

x 2 + y2 t2 – ] 23l2 242l

(8.6)

Using a separate scale parameter for time domain is critical, because time and space events are generally independent. In addition, the events detected by using the operator of points of interest also depends on both the scale of observation of space and time, so the scale parameters 3l2 and 42l require separate treatment.

Y

T X

Figure 8.3: An example of spatial–temporal points of interest.

254

8 Spatial–Temporal Behavior Understanding

Similar to the spatial domain, a spatial–temporal matrix of second-order moment is a 3 × 3 matrix containing spatial and temporal differentials that are convolved with Gaussian function g(x, y; 3l2 , 3l2 ):

,=

g(∙; 3i2 , 42i )

L2x [ ⊗ [ Lx Ly [ Lx Lt

Lx Ly L2y Ly Lt

Lx Lt ] Ly Lt ] L2t ]

(8.7)

wherein the integral scale 3i2 and 42i are linked up with the local scales 3l2 and 42l by 3i2 = s3l2 and 42i = s42l , respectively. The first-order differentials are Lx (∙; 3l2 , 42l ) = 𝜕x (g ⊗ f ) Ly (∙; 3l2 , 42l ) = 𝜕y (g ⊗ f ) Lt (∙; 3l2 , 42l )

(8.8)

= 𝜕t (g ⊗ f )

To detect a point of interest, the search for , with significant eigenvalues +1 , +2 , and +3 is conducted. This can be accomplished by extending eq. (8.4), through a combination of rank expansion and determinant of ,, to spatial–temporal domain: H = det(,) – k ∙ trace3 (,) = +1 +2 +3 – k(+1 + +2 + +3 )3

(8.9)

To prove that the positive local extremes of H corresponds to points having large +1 , +2 , and +3 (+1 ≤ +2 ≤ +3 ), it is required to define the ratio a = +2 /+1 and b = +3 /+1 and rewrite H as H = +31 [ab – k(1 + a + b)3 ]

(8.10)

Since H ≥ 0, so there are k ≤ ab/(1 + a + b)3 , and k will have the maximum possible value of k = 1/27 at a = b = 1. To a large value of k, the local extreme values of H will correspond to points with both great changes along spatial and temporal directions. Especially, if a and b are both set as a maximum of 23 as in space, then the value of k in the eq. (8.9) will be k ≈ 0.005. Therefore, the spatial–temporal points of interest in f can be obtained by detecting the positive local spatial–temporal maxima in H.

8.3 Dynamic Trajectory Learning and Analysis Dynamic trajectory learning and analysis attempt to provide certainty for monitoring the state of the scene by understanding and characterization of the change of each target position and the moving results (Morris, 2008). A flowchart for dynamic trajectory learning and analysis in video is shown in Figure 8.4. First, the target is detected (e.g., the pedestrian detection from a moving car, see Jia (2007) and tracked; then, the scene

8.3 Dynamic Trajectory Learning and Analysis

Input video

Object detection

Object tracking

255

Scene modeling

Trajectory

Motion analysis

Annotated video

Figure 8.4: Flowchart for dynamic trajectory learning and analysis.

model is automatically constructed with the trajectory obtained; finally, the model is used to monitor the situation and provide labels for the activities. In the scene modeling, the points of interest (POI) are first determined within an image area and considered as the location where some events happening, then in the learning step the activities are defined along an activity path (AP). The path is to characterize how the target moving between the points of interest. Such constructed models can be called POI/AP model. The main tasks in the POI/AP Learning include: (1) Activity learning: It is conducted by comparing trajectories. Though the lengths of trajectories may be different, the key issue is to maintain the intuitive cognition of similarity. (2) Adaption: It studies the techniques for managing POI/AP model. These techniques must be able to adapt to new activities online, to remove discontinued activities and to validate the model. (3) Feature selection: It is the determination of the correct kinetics expression level for specific task. For example, using only the spatial information can verify which road the car has passed, but to determine the cause of accident the speed information of the car are also required.

8.3.1 Automatic Scene Modeling Modeling scenarios by means of dynamic trajectory automatically include the following three tasks (Makris, 2005):

8.3.1.1 Object Tracking It needs for each observable object to achieve identity maintenance in each frame. For example, tracking an object in the T frame of video will generate out of a series of inferred tracking state: ST = {s1 , s2 , ⋅ ⋅ ⋅ , sT }

(8.11)

where st may describe some object characteristics such as location, speed, appearance, shape and the like. These trajectory information constitute the cornerstone

256

8 Spatial–Temporal Behavior Understanding

for further analysis. Through careful analysis of these information, it is possible to identify and understand different activities.

8.3.1.2 Detection of Points of Interest The first task in image scene modeling is to figure out a region of interest. In topographic map for object tracking, these regions correspond to the nodes in the graph. Two types of nodes mostly considered are in/out regions and stop regions. In the classroom, for example, the former correspond to the doors of classroom while the latter correspond to the podium of classroom. In/out regions, are the locations of the objects to enter or leave the field or the locations where the tracked objects appear or disappear. These regions are often modeled by means of 2-D Gaussian mixture model (GMM), Z ∼ ∑W i=1 wi N(,i , 3i ), which includes W components. This problem can be solved by the EM algorithm. At the entry point, data comprise the place determined by the first tracking state, while at the leaving point, data include the place determined by the last tracking state. They can be distinguished by using a density criterion, the mixed density in the state i is defined as di =

wi > Ld 󵄨 󵄨 0√󵄨󵄨󵄨3i 󵄨󵄨󵄨

(8.12)

It measures the degree of compactness of Gaussian mixture. The threshold Ld =

w 0√|C|

(8.13)

indicates the average density of signal cluster. Here, 0 < w < 1 is the weight defined by user, C is the covariance matrix of all the points concentrated in the region dataset. The compact mixing indicates the correct region, while the loose mixing indicates the tracking noise resulting from disruptions of tracking. Stop region comes from the landmark points in scene, which is the locations where objects tends to be fixed for some periods of time. The stop region can be determined by two methods with different criteria: (1) The speed of the tracking point is below a certain predetermined threshold value in this region; (2) All the tracking points are at least maintained inside a limited distance ring at a certain period of time. By defining a radius and a time constant, the second method can ensure that the object is indeed maintained in a specific range, while the first method may include objects with very low speed of the movement. For the analysis of activities, not only the locations must be accurately determined but also the time spent in each stop region should be grasped.

8.3 Dynamic Trajectory Learning and Analysis

Motion

Order

Path

Trajectory

Path

Video clip

Path

257

Motion word

f1 f2 vc

f fT (a)

(b)

(c)

Figure 8.5: Scheme for trajectory and path detection.

8.3.1.3 Active Path Finding To understand the behavior, the activity path (AP) needs to be determined. This can be obtained by filtering out false alarms or tracking interrupted noise from training set with POI, and keeping only the paths started after entering the region and ended before leaving the region. An activity of interest should be defined between the two end points (points of interest). To distinguish between time-varying action objects (such as person walking or running along a pedestrian walkway), the dynamic information varying with time are necessary to be added in the path. Figure 8.5 gives the three basic structures of path finding algorithms, their main differences include the type of input, the motion vector, the trajectory (or video clips), and the way for motion abstraction. In Figure 8.5(a), input is a single trajectory at time t, each point in the path has been implicitly ordered. In Figure 8.5(b), a full trajectory is used as the input of learning algorithms to directly establish an output path. In Figure 8.5(c), the decomposition of path following the video timing is depictured. Video clips are broken down into a set of motion words to describe the activities, or in other words, a video clip is annotated by the labels of certain activities according to the appearances of motion words.

8.3.2 Active Path Learning As the active path provides the information for object motion, an original trajectory can be represented by a sequence of dynamic measurements. For example, a common representation for trajectory is often a motion sequence: GT = {g 1 , g 2 , ⋅ ⋅ ⋅ , g T }

(8.14)

g t = [xt , yt , vxt , vyt , atx , aty ]T

(8.15)

where the motion vector is

It represents the dynamic parameters obtained from the object tracking at time t, that is, position [x, y]T , speed [vx , vy ]T , and acceleration [ax , ay ]T .

258

Tracking

8 Spatial–Temporal Behavior Understanding

Pre-processing Normalization dimension reduction

Clustering Distance/similarity clustering

Modeling

Path

Cluster + envelope sub-path

Figure 8.6: Steps for path learning.

It is possible to learn AP, with unsupervised way, by using only the trajectory. The basic process is shown in Figure 8.6 that includes three steps. Although in the figure, there are three separate sequential steps, they are often combined. In the following, some detailed explanations for the three steps are provided. 8.3.2.1 Preprocessing of Trajectory The most of the work in path learning are to get trajectory suitable for clustering. When tracking the main difficulty comes from the time-varying characteristics, which leads to inconsistent lengths of trajectory. It is needed to take steps to safeguard meaningful comparison between different inputs with various sizes. In addition, the trajectory representation should be visually expressed in the cluster to maintain the original trajectory similarity. Preprocessing of trajectory consists of two tasks, normalization to ensure that all the paths have the same length; dimension reduction for mapping the paths to a new low-dimensional space in order to perform more robust clustering. (1) The purpose of normalization is to ensure that all the trajectories have the same length Ta . Two simple techniques are zero filling and expansion. Zero filling is to add some items of zero to the end of short trajectory. The expansion is to extend the last part of the original trajectory till the required length is achieved. They are both likely to expand the trajectory space to very large space. In addition to checking the training set to determine the trajectory length Ta , it is also possible to make use of a priori knowledge for resampling and smoothing. Resampling combined with interpolation could ensure that all trajectories have the same length Ta . Smoothing can be used to eliminate noise, and the trajectory smoothed can be interpolated and sampled to a fixed length. (2) Dimensionality reduction maps the trajectory to a new low-dimensional space, so more robust-clustering methods can be used. This can be accomplished by assuming a trajectory model and determining the model parameters that can best describe this model. Commonly used techniques include vector quantization, polynomial fitting, multiresolution decomposition, hidden Markov model, subspace method, the spectrum method, and nuclear method. Vector quantization can be achieved by limiting the number of unique trajectory. If the dynamics of trajectory is ignored and only spatial coordinate is based on, then the trajectory can be seen as a simple 2-D curves and can be approximated by a minimum mean square polynomial of order m (each w is a weight coefficient):

8.3 Dynamic Trajectory Learning and Analysis

259

m

x(t) = ∑ wk tk

(8.16)

k=0

In spectrum method, a similarity matrix S can be built for the training set, where each element sij represents the similarity between the trajectory i and trajectory j. In addition, a Laplace matrix L is also built: L = D–1/2 SD–1/2

(8.17)

where D is a diagonal matrix whose i-th diagonal element is the sum of i-th elements in S. By decomposing L, the maximum K eigenvalues can be determined. If the corresponding eigenvectors are put into a new matrix, then the row of this matrix corresponds the trajectory after transform in spectrum space, and the spectrum trajectory can be obtained with k-means method. 8.3.2.2 Trajectory Clustering Clustering is commonly used machine-learning techniques for determining the structure of unlabeled data. While observing the scene, the motion trajectory can be collected and combined into the similar category. In order to produce meaningful clusters, the trajectory-clustering process needs to consider three issues: the definition of a distance (corresponding to similarity) measure; the strategy for cluster updating; and clustering validation. 1. Distance/similarity measure: Clustering technology depends on the definition of distance (similarity) measure. As mentioned above, a major problem in the trajectory clustering is that the different trajectories generated by the same activity may have different lengths. To solve this problem, either some preprocessing methods can be used or a distance measure independent of size can be defined (if two trajectories Gi and Gj have the same length): dE (Gi , Gj ) = √(Gi – Gj )T (Gi – Gj )

(8.18)

If two trajectories Gi and Gj have different lengths, then the improvement made for the case in which Euclidean distance does not change with dimensional change is to compare two trajectory vectors with lengths of m and n (m > n), respectively, and use the last point g j,n to cumulative distortion: d(c) ij =

n m–n 1 { ∑ dE (g i,k , g j,k ) + ∑ dE (g i,n+k , g j,n )} m k=1 k=1

(8.19)

Euclidean distance is relatively simple, but in case there is time offset the effect would be bad, because that only the aligned sequences can match. Here, the Hausdorff distance can be considered. On the other side, there is a distance measure did not

260

8 Spatial–Temporal Behavior Understanding

rely on the complete trajectory (without considering outliers). Suppose the lengths of trajectories Gi = {g i,k } and Gj = {g j,l } are, respectively, Ti and Tj , then T

Do (Gi , Gj ) =

1 i ∑ d (g , G ) Ti k=1 o i,k j

(8.20)

where do (g i,k , Gj ) = min [ i

dE (gi,k , gj,l ) Zl

]

l ∈ {⌊(1 – $)k⌋ ⋅ ⋅ ⋅ ⌈1 + $⌉ k}

(8.21)

where Zl is a normalization constant and is the variance at point l. Do (Gi , Gj ) is used to compare the trajectory and the existing clusters. If two trajectories are to be compared, Zl = 1 can be used. Thus, defined distance measure is an average normalized distance from any point to its best matched point, where the best match is at a center of sliding window located at point l with width of 2$. 2. Clustering process and verification: The preprocessed trajectories can be combined with unsupervised learning techniques. The trajectory space will be broken down into perceptually similar clusters (such as roads). There are several ways for clustering learning, such as iterative optimization; online adaptation; hierarchical approach; neural network; symbiotic decomposition. The path learned with clustering algorithm needs further validation, this is because the real number of categories did not know. Most clustering algorithms require an initial choice for the desired number of classes K, but this is often incorrect. To this end, the clustering can be conducted for different K, respectively, and the K corresponding to the best results is taken as the real number of clusters. A tightness and separation criterion (TSC) can be used here as the judgment criterion, which compare the distance between the trajectories in the same cluster and the distance between the trajectories in different clusters. Given a training set DT = {G1 , . . . , GM }, then K M 2 2 1 ∑j=1 ∑i=1 fij dE (Gi , cj ) TSC(K) = M minij d2E (ci , cj )

(8.22)

where fij is the fuzzy membership of trajectory Gi over clustering Cj (where the sample is represented by cj ). 8.3.2.3 Path Modeling After trajectory clustering, a graph model can be built according to the path obtained for effective reasoning. Path model is a compact representation for clustering. Path modeling can be conducted in two ways, as shown in Figure 8.7. One considers the complete path by using the cluster centers and the envelopes (to indicate the range of path), as shown in Figure 8.7(a). The path from end to end has not only the average center line but also the envelope on both sides indicating the path range.

261

8.3 Dynamic Trajectory Learning and Analysis

Envelop

c1

POI Internal state (a)

0.5

c2

POI

c4

c3

0.25 c4

c1 0.5

c5

Center

c2 c3 0.25 c5

(b)

Figure 8.7: Two ways for path modeling.

Along the path, it may have some intermediate state providing the measurement sequence. Other one decomposed the whole path into a number of subpaths (using tree structure), as in Figure 8.7(b). The path is represented as the tree of sub-paths. The probability predicting the path is marked on the arc pointing from the current node to the next node.

8.3.3 Automatic Activity Analysis Once the scene model is established, then the activities and behaviors of the objects can be analyzed. For instance, a basic function of surveillance video is to validate the event of interest. In general, whether an event is interesting could only be determined under specific circumstances. For example, the parking management system will focus on whether there is space to park, while in the smart meeting room the concern would be the communication among participants. In addition to simply identify a particular behavior, all atypical events need to check. By observing a scene for long time, the system can analyze a series of activities and can learn to find what the event of interest is. Some typical activity analysis examples are: (1) Virtual fencing: Any monitoring system has a monitoring range. By setting up early warning on the border, an equivalent virtual fence is established for a certain range, and the invasion will trigger system to start an analysis procedure. For example, the high-resolution PTZ cameras (PTZ) can be used to catch the details of the invasion, and to start making statistics on the number of the invasion. (2) Speed profiling: Virtual fencing uses only position information, while tracking can also provide dynamic information for speed-based early warning, such as speeding or road blockage. (3) Path classification: Speed profiling uses only the current tracking data, while activity path (AP) can also provide historical information to predict and interpret the coming data. The behavior of emerging target can be described by means of a maximum a posteriori (MAP) path: L∗ = arg max p(lk |G) = arg max p(G, lk )p(lk ) k

k

(8.23)

262

4.

8 Spatial–Temporal Behavior Understanding

This will help to determine which activities can best explain the new path data. Because the path prior distribution p(lk ) can be estimated from the training set, so the problem is reduced to an HMM for maximum likelihood estimation. Abnormality detection: Detecting abnormal events is often an important task of the monitoring system. Since the activity path can indicate the characteristics of typical activities, so if the new path shows certain differences with normal one, an abnormality will be detected. Exception mode can be detected with intelligent thresholding: p(l∗ |G) < Ll

5.

(8.24)

where the value for the most resemble path l* with new trajectory G is still smaller than the threshold value Ll . Online activity analysis: Enabling online analysis, identification, and evaluation of activities would be more powerful and useful than just descripting the activity with the path. An online system needs to fast inference the current behavior based on incomplete data (often on the basis of graph model). Two examples are path prediction and trace exception. Path prediction uses tracking data up to now to predict future behavior and refines the prediction when more data are collected. Using noncomplete trajectory to forecast activities can be expressed as L̂ = arg max p(lj |Wt Gt+k )

(8.25)

j

6.

wherein Wt represents window functions, Gt+k is the trajectory until the current time and k future tracking states predicted. Trace exception is to detect an abnormal event once it is occurred, in addition to place the entire track under an exception. This can be achieved by using Wt Gt+k instead of G in eq. (8.24). The window function Wt is not necessarily the same as in prediction, and the threshold may need to adjust according to the amount of data. Objectinteractive characterization: Even higher level analysis is expected to explain the interaction among different objects. Similar to abnormal event, strictly defining object interaction is also very difficult. Under different circumstances, different objects have different types of interactions. Taking a car crash as an example. Every car has its spatial dimensions, which can be regarded as their personal space. When the car is moving, the personal space around the car needs to add a minimum safety distance (minimum safety zone), so the spatial– temporal personal space will change with the movement, the faster the speed, the more increase of the minimum safety distance (particularly in the direction of driving). A schematic diagram is shown in Figure 8.8, where the personal space is represented by a circle, while the safety zone will change with the speed (including magnitude and direction). If the safety zones of two cars have the

8.4 Action Classiﬁcation and Recognition

263

Safety zone Personal space

Figure 8.8: Use paths for collision assessment.

intersection, there will be a possibility of collision. Thereby this can help plan the route. Finally, it should be noted that for simple activities, relying only on speed and object position can be analyzed, but for more complex activities, more measurements would be required, such as adding a cross-sectional curvature to determine odd walk. To provide more comprehensive coverage for activities and behaviors, the use of multiple camera networks would be often required. Activity trajectory may also come from the interconnecting parts of objects (such as human), where the activity definitions should be made with respect to a set of trajectories.

8.4 Action Classification and Recognition Vision-based human action recognition is a process to use action labels for marking an image sequence (video). On the basis of obtaining the representation of the observed image sequence or video, this process can be turned into a classification problem.

8.4.1 Action Classification Many techniques for action classification have been proposed (Poppe, 2010). 8.4.1.1 Direct Classification In direct methods, it does not pay special attention to the time domain, even the video is used. The related methods put all observed information (from all frames) into a single expression, or identify and classify actions separately for each frame. In many cases, as high-dimensional representation for image is required, a large number of computation is inevitable. In addition, the representation may also include features such as noise. Therefore, it is required to have a compact, robust feature representation in low-dimensional space for the classification. Either linear methods or nonlinear methods for dimensional reduction can be used. For example, PCA is a typical linear approach, locally linear embedding (LLE) is a typical nonlinear method.

264

8 Spatial–Temporal Behavior Understanding

The classifiers used in direct classification can also be different. Classifiers based on identification concern about how to distinguish different categories, rather than to model various categories, such as the typical SVM. In the bootstrap framework, a series of weak classifiers (each often using only 1-D representation) are used to build a strong classifier. Except AdaBoost, LPBoost can obtain sparse coefficients and can quickly converge. 8.4.1.2 Time Status Model Generative models try learning the joint distribution between observation and action, and modeling each action class (considering all changes). Identification model try learning the probability of action classes under observation conditions, they are not concerned for the category modeling but the difference between classes. The most typical model in generative group is hidden Markov model (HMM), in which the hidden state corresponds to each step of action. Hidden state tries modeling state transition probabilities and observation probabilities. There are two independent assumptions. One is that a state transition is only dependent on the previous state, another is that the observation depends only on the current state. The variants of HMM include maximum entropy Markov model (MEMM), the state decomposition of the hierarchical hidden Markov model (FS-HHMM), hierarchical variable transition hidden Markov model (HVT-HMM). On the other hand, the identification group tries modeling the conditional distributions for given observation, and combining multiple observations to distinguish different classes of action. This model is advantageous to distinguish between related actions. Conditional Random fields (CRF) is a typical model of identification, its improvements include decomposition CRFs (FCRF), spread CRFs, etc. 8.4.1.3 Action Detection The methods based on action detection does not explicitly model the object representation in the image, nor the action in the image. It connect the observed sequence with the labeled video sequence to directly detect (already defined) actions. For example, video clips can be described as being on different time scales coded bag of words, each word corresponds to the gradient orientation of a local patch. A patch having low variation with time can be ignored, so that the representation will focus on the motion regions. When the motion is periodic one (such as a person walking or running), the action is cyclical, that is cyclical action. In this case, it is possible to perform temporal segmentation by means of analyzing self-similarity matrix. In addition, tags can be attached to motion initiator, to build self-similarity matrix by tracking tags and using an affine function of distance. Self-similarity matrix can be further undergone frequency conversion, then the peak in the spectrum corresponds to the frequency of movement (e. g., to distinguish a walking person from a running person, the gait cycle can be calculated). Matrix structure can be analyzed to determine the type of actions.

8.4 Action Classiﬁcation and Recognition

265

The main methods for action representation and description can be classified into two groups: appearance-based and body model-based. In appearance-based approach, the descriptions for foreground, contour, optical flow, etc. are directly used. In body model-based method, the body model is used to represent the structural features of the human body. For example, the actions are represented by the sequence of joint. Regardless of the kind of methods used, achieving the detection of the human body as well as the detection and tracking of some important parts of the human body (such as the head, hands, and feet) will play an important role. Example 8.2 Action recognition database Some sample pictures for actions in Weizmann action recognition database are shown in Figure 8.9 Blank (2005). From top to bottom, each row provides the pictures in one action: head clap (jack), lateral movement (side), bend (bend), walking ( walk), running (run), play with one hand (wave 1), waving both hands (wave 2), forward hop (skip), both feet jump (jump), feet jump in place (p-jump). ◻∘

8.4.2 Action Recognition The domain of representation and recognition of action and activity is a not very new but un-matured domain. Many methods for action recognition have been developed. They are depended on the purpose of the research and the application domain (Moeslund, 2006). For monitoring systems, the human activity and human interaction are considered. For scene interpretation, representation can be independent from the objects inducing the motion.

8.4.2.1 Holistic Recognition It puts the overall emphasis on the whole-body identification or on identifying each individual part of single person. For example, the information based on the structure and dynamics of the whole body may be used to identify people walking, walking gait, etc. Most techniques are based on human silhouette or outline, and they have less distinction among various parts of the body. For example, there is a body-based identification technology using human silhouettes and sampling uniformly its outlines, and then, the PCA decomposition process is performed. For the calculation of correlation in spatial–temporal domain, respective trajectories can be compared in eigen-space. On the other hand, the use of dynamic information can not only recognize identity but also determine what is doing this person. The action recognition based on body parts makes use of the position and dynamic information of body parts.

266

8 Spatial–Temporal Behavior Understanding

Figure 8.9: Example images in Weizmann action recognition database.

8.4.2.2 Posture Modeling Recognition of human action is closely related to the body posture estimation. Body posture can be divided into action posture and postural posture (gesture). The former corresponds to the action behavior at a certain moment, while the latter corresponds to the orientation of human body in 3-D space.

8.4 Action Classiﬁcation and Recognition

267

The main methods for posture representation and description can be classified into three groups: appearance-based, body model-based, and 3-D reconstruction-based. (1) Appearance-based method: It does not directly modeling the physical structure, but using color, texture, outline and other information for body posture analysis. Since the apparent information in 2-D images only are used, it is difficult to estimate the human pose. (2) Body model-based method: First, the graph model, 2-D or 3-D model of the human body are used for modeling human body, then by analyzing these parameterized body models the body posture is estimated. This kind of methods typically has high requirement for image resolution and precision of object detection. (3) 3-D reconstruction-based method: First, multiple cameras at different locations are used to obtain images of 2-D moving objects; through the matching of corresponding points, 3-D moving objects are reconstructed; then the camera parameters and imaging formula are used to estimate the body posture in 3-D space. Gesture modeling can be based on spatial–temporal points of interest (see Section 8.2). If Harris corner detector is used, the spatial–temporal points of interest obtained will be more concentrated in regions with sudden change of motion. The number of such points is small and sparse, so it is possible to lose important motion information in video, leading to failure detection. On the other side, dense spatial– temporal points of interest can be extracted via motion intensity (the image can be convoluted with a Gaussian filter in spatial domain and Gabor filter in temporal domain), in order to fully capture the change of motion. After extracting the spatial– temporal points of interest, the descriptor for each point is built first, and then, the modeling for each gesture is followed. One particular method is to first extract the spatial-temporal points of features from training sample database as low-level features, one posture corresponds to one set of spatial–temporal points of features. Then, unsupervised classification method is used for sample classification to get the clustering results for typical postures. Finally, modeling each typical category of posture by using the Gaussian mixture model based on EM-algorithm. Recent trends in the natural scene pose estimation is to detect posture in single frame, which can overcome the problem caused by unstructured scene viewed by single camera. For example, on the basis of robust parts detection and of probability combination of parts, it has already been able to obtain a better estimation for 2-D posture in a complex movie. 8.4.2.3 Activity Reconstruction Action results in a change of posture. If each stationary body posture is defined as a state, then the building of a sequence of actions (activity) can be obtained by conducting a single traversal through the corresponding postures of the state, with

268

8 Spatial–Temporal Behavior Understanding

the help of state-space method (also known as probability network method). Based on such a sequence of actions, body action, and posture can be recovered. Based on the posture estimation, some significant progresses have also been achieved in automatic reconstruction of human activity from the video. The original model-based analysis–synthesis approach conducts effectively search in posture space by means of a multiview video capturing. Many current methods focus more on capturing overall body movement rather than getting stressed on very precisely building details. Human activity reconstruction with a single view has also made a lot of progresses based on statistical sampling technology. Currently, it is more concerned on how to use the learned model for constraining the activity-based reconstruction. Studies show that using a strong a priori model is helpful for tracking specific activities in a single view. 8.4.2.4 Inter-Activity Interactivity is relative complex. Two main types can be distinguished: interaction between human and environment; interaction among different persons. (1) Interaction between human and environment: The human is the initiator of the activity, such as taking a book up from a table and driving a car on the road. This can be referred as single (person) activity. (2) Interaction among different persons: Several persons interact each other. It often refers to the exchange activities or contact behavior of two (or multiple) persons. It can be seen as the combination of several single (atomic) activities, synchronized in spatial-temporal space. For single activity, it can be described by means of probabilistic graphical models. Probabilistic graphical model is a powerful tool for modeling continuous dynamic characteristic sequence, a relatively mature theoretical basis. Its disadvantage is that the topology of the model is dependent on the structural information of activity itself, so the complex interactions requires a lot of training to learn the topology data model. In order to combine a number of single activities together, you can use statistical relational learning (SRL) approach. SRL is a machine-learning method integrating relational/logical representation, probabilistic reasoning and data mining in order to obtain a comprehensive likelyhood model of relational data. 8.4.2.5 Group Activity Many quantitative changes may cause a qualitative change. Large number of objects involved in the activity could pose new problems and require new solutions. For example, the motion analysis of object groups is mainly concentrated on biological communities of people flow, traffic flow and the dense groups of organisms in nature. The goals of research are the representation and description of object groups, the motion feature analysis of object groups and the boundary constraints on the object

8.4 Action Classiﬁcation and Recognition

269

groups. In this case, a special grasp of the unique behavior of individuals is weakened, more concerns are on the abstract description of individual for describing the entire collection of activities. For example, some studies draw macroscopic kinematic theory to explore the movement of the particle stream and establish the kinetic theory of particle flux. On the basis, semantic analysis of the polymerization, dissipation, differentiation, and combination of object groups becomes indispensable for capturing the tendency and situation of whole scene.

Example 8.3 Counting the number of people In many public places, such as squares, stadium entrances, the counting of people numbers is required. Figure 8.10 shows a scene for such a situation. Although there are many people with different forms in the scene, the concern here is the number of people (passed) in a specific range (area surrounded by a frame) (Jia, 2009). ◻∘

8.4.2.6 Scene Interpretation Different from recognition of objects in scene, scene interpretation considers to comprehend the meaning of entire image rather than to verify a particular person or object. In practice, many methods do recognize activity by only considering the images acquired by camera and observing the motion of objects in the images, without determining the identity of objects. This strategy is effective when the object is small enough to be represented as a point in 2-D space. For example, a detection system for abnormal conditions include the following modules. First, it is the extraction of the position, speed, size, and binary silhouette of 2-D objects with vector quantization to generate a paradigm codebook. To consider the time relationship between each other, the symbiotic statistics can be used to produce symbiotic matrix. By iteratively define the probability function of paradigms in two codebooks, a binary tree structure can be determined, in which the leaf nodes correspond symbiotic statistical probability distribution in the matrix. The higher-level node corresponds to simple scenario events (such as the movements of pedestrians or cars), so they are used to give further explanation of the scene.

Figure 8.10: Counting the number of people in ﬂow monitoring.

270

8 Spatial–Temporal Behavior Understanding

8.5 Modeling Activity and Behavior A general action/activity recognition system should comprise, from an image sequence to high-level interpretation, several steps (Turaga, 2008): (1) Capturing input video or image sequence. (2) Extracting concise low-level image features. (3) Descripting middle-level actions, based on low-level features. (4) Interpreting image with high-level semantics, starting from basic actions. Generally, a practical activity recognition system has hierarchical structure. In the low level, there are modules for foreground–background separation, for object detection and tracking. In the middle level, the main module is for action recognition. In the high level, the most important module is inference engine, which codes the semantics of activity according to lower-level action or action primitives, and then understands the entire activity with the aid of learning. From an abstract point of view, the level of activity is higher than that of action. From the technology point of view, modeling and recognizing the action and activity can often be conducted by using different techniques. A categorization scheme is shown in Figure 8.11 (Turaga, 2008): 8.5.1 Modeling Action The methods for action recognition can be divided into three groups: nonparameter modeling, volume modeling, and (timing) parameter modeling. Nonparametermodeling methods extract a set of features from each video frame, and match these features with stored template. Volume-modeling methods do not extract features frame by frame, but rather see video as a 3-D volume of pixel intensities and extend the standard 2-D image features (such as scale space extremes, spatial filter response) to 3-D. Methods of timing parameter modeling focus on the dynamic modeling of movement time, estimating the specific parameters from a training set of actions. 8.5.1.1 Nonparameter Modeling Typical methods include: using 2-D template, using 3-D object model, manifold learning, and so on. Action

Activity

Simple

Nonparameter

Complex

Volume

Parameter

Graphic model

Syntax

Figure 8.11: Classiﬁcation of approaches for action and activity recognition.

Knowledge

8.5 Modeling Activity and Behavior

271

Using 2-D Template Such methods include the following steps: first, motion detection and object tracking in the scene. After tracking, a crop sequence with objects is established. Scale variations can be compensated by means of the size normalization. For a given movement, a cyclical index is calculated. If it is highly cyclical, action recognition should be performed. For recognition, the periodic sequences are divided into independent periods by using periodic estimation. The average period is divided into several time segments and for each space point in every segment the flow characteristics are calculated based on the flow characteristics. The flow characteristics of each segment are averaged into a single frame. The frame of average flow in an activity period constitutes the template for each action group. A typical approach is to build a time–domain template as an action model. First, the background is extracted, and then, the extracted background blocks from a sequence are combined into a still image. There are two ways of combination. One assigns to all frames in a sequence with the same weight, thus obtained combination may be referred to the motion energy image (MEI). The other assigns to all frames in a sequence with different weights. For example, the new frame having greater weights, while the old frame having smaller weights. Thus, obtained representation is called the motion history image (MHI). For a given action, the images obtained by combination form a template. The calculation of invariant moments leads to recognition. Using 3-D Object Model 3-D object model is the model built for spatial–temporal objects, such as the generalized cylinder model and 2-D contour superposition model. In 2-D contour superposition model, the object movement and shape information are contained, whereby the geometric information of object surface can be extracted, such as peaks, pits, valleys, and ridges. If the 2-D contour is substituted by the blob in background, the binary spatial–temporal volume can be obtained. Manifold Learning Many action recognition tasks are related to the data of high-dimensional space. The feature space becomes sparse with the dimension in exponential form, so to build an effective model requires a large number of samples. According to the manifold of data to learn, the intrinsic dimension of the data can be determined. The number of freedom in this intrinsic dimension is small and can help the design of efficient model in low-dimensional space. The simplest method to reduce the dimension is the principal component analysis (PCA), in which the data is assumed in a linear subspace. In practice, except in extraordinary circumstances, the data are not in a linear subspace, so the techniques to learn, from a large number of samples, the intrinsic geometry of manifold are needed. Nonlinear dimensionality reduction technology allows data points to be represented according to their degree of proximity to each other in the nonlinear manifolds. Typical methods include locally linear embedding (LLE), and Laplace intrinsic maps.

272

8 Spatial–Temporal Behavior Understanding

8.5.1.2 Volume Modeling Typical methods include spatial–temporal filtering, using parts of 3-D space (e.g., spatial–temporal points of interest), subvolume matching, and tensor-based methods. Spatial–temporal Filtering Spatial–temporal filtering is an extension of spatial filtering, where a bank of spatial– temporal filters are used for filtering space–time volume data in video. Further, specific characteristics based on the response of the filter bank can be obtained. There is hypothesis that the spatial–temporal nature of the visual cortex cells can be described by the structures of available spatial–temporal filters, such as oriented Gaussian kernel and differentiation as well as oriented Gabor filter bank. For example, a video clip can be considered as a space–time volume in XYT space. For each voxel (x, y, t), the local appearance model can be calculated with Gabor filter banks for different orientations and spatial scales as well as for single time scale. Using the average space probability of each pixel in a frame, the action can be recognized. Because the analysis is conducted in a single time scale, this method cannot be used when the frame rate change. To do this, the locally normalized histogram of spatial-temporal gradient can be extracted on several time scales, then the 72 of histograms as well as the input video and samples of stored video are matched. Another method is to use a Gaussian kernel in the spatial filtering, and to use Gauss differential in the temporal filtering. The responses are incorporated into the histogram after thresholding. This method is capable of providing simple and effective features for far field (nonclose shots) video. With effective convolution, the filtering method can be easily and quickly achieved. However, in most applications, the filter bandwidth does not know in advance, so it is necessary using large filter banks at multiple spatial and temporal scales to effectively get action. Since the output response of each filter has the same number of dimensions with the input data, so the use of large filter bank with multiple spatial and temporal scales is also subject to certain restrictions. Using Parts of 3-D Space Video can be seen as a collection of many local assembly parts, each part has a specific movement patterns. A typical approach is to use the spatial-temporal points of interest as shown in Section 8.1. In addition to use the Harris interest point detector, the spatial–temporal gradients extracted from the training set can also be used for clustering. In addition, the bag of words model can be used to represent the action, where the bag of words model can be obtained by extracting spatial-temporal points of interest and doing feature clustering. Because the points of interest are local in nature, so the long-time relevance is ignored. To solve this problem, the correlogram can be used. A video is seen as a series of sets, each set comprising a part in a small time sliding window. This method does not directly build a global geometric model for local parts, but see them as a bag of features. Different actions can contain similar spatial–temporal components but can

8.5 Modeling Activity and Behavior

273

have different geometric relationships. If the global geometry is incorporated into the part-based video representation, which constitutes part of a constellation. When the number of parts is large, this model would be quite complex. Constellation model and bag of features model can also be combined into a hierarchy structure, at the top of the constellation model, there are only a small number of components, and each component is contained within the bag of features at the bottom. Such a mixture has combined the advantages of these two models. In most part-based methods, the detection of parts is often based on linear operations, such as filtering, spatial–temporal gradient, so the descriptors are sensitive for the apparent change, noise, occlusion, and so on. On the other hand, due to the localized nature, these methods are relatively robust for unsteady background. Subvolume Matching Subvolume matching means the matching between the video and subvolume of template. For example, the action and templates can be matched from the point of view that spatial motion and temporal motion are correlated. The main difference between this approach and part-based approach is that it does not need to extract action descriptor from the extreme points of scale space but to check the similarity between two spatial–temporal blocks (patch). However, the correlation computation for the whole video volume can be time consuming. One way to solve this problem is to extend the fast Haar feature (box feature) to 3-D, which has been very successful in object detection. The 3-D Haar features are the outputs of 3-D filter bank, the filter coefficient is 1 and –1. The outputs of the filters can be combined with the bootstrap method to obtain robust performance. Another method is to seen the video volume as a set of subvolumes with any shape, each of the subvolumes is a homogenous space, which can be obtained by the clustering of closer pixels in appearance and space. Further, the given video is divided into many subvolumes or super-voxels. Action templates can be matched in these sub-volumes by searching the minimum region sets that can maximize the overlapping rate between the sub-volumes and a set of templates. The advantages of subvolume matching is relatively robust to noise and occlusion. If the optical flow features is combined, it is also relatively robust to apparent change. The disadvantage of subvolume matching is that it is easier to be influenced by the change of background. Tensor-based Methods Tensor is the multidimensional (space) extension of matrix. A 3-D spatial-temporal volume may be naturally viewed as a tensor with three independent dimensions. For example, human action, human identity, and joint trajectory can be seen as the three independent dimensions of a tensor, respectively. By the decomposition of the total data tensor into dominant mode (similar to PCA extension), it is possible to extract the corresponding labels for the movement and identity of action person. Of course, the three dimensions of a tensor can be directly taken as the three dimensions in space, that is, (x, y, t).

274

8 Spatial–Temporal Behavior Understanding

The tensor-based method provides a direct way for the matching of whole video, without regard to the middle-level representations used by the previous approaches. In addition, other types of features (such as optical flow and spatial–temporal filter response) are also very easy to be incorporated by increasing the number of dimensions of the tensor. 8.5.1.3 Parameter Modeling The first two modeling methods are more suitable for simple actions, modeling method described below is more suitable for complex movements across the time domain, such as ballet steps in the video and instrumentalists playing with complex gestures. Typical methods include hidden Markov model (HMM), linear dynamic system, and nonlinear dynamic system. Hidden Markov Model Hidden Markov model (HMM) is a typical model in state space. It is very effective for modeling time series data, and it has good promotional and identification properties so is suitable for applications requiring recurrence probability estimations. In the process of constructing discrete hidden Markov model, the state space is seen as a finite set of discrete points. The evolution with time is modeled as a probability step transformed from one state to another. Three key issues of hidden Markov model are reasoning, decoding and learning. Hidden Markov model was first used to identify the action of hit in tennis (shot), such as backhand, backhand volley, forehand, forehand volley, and smash. Wherein the image models with background subtraction are converted to hidden Markov models corresponding to particular classes. Hidden Markov model can also be used for modeling the time-dependent actions (e.g., gait). A single hidden Markov model can be used for modeling the action of a single person. For multiplayer action or interaction, one pair of hidden Markov model can be used to represent alternate actions. Also, domain knowledge can be incorporated into the hidden Markov model construction, or HMM can be combined with object detection to take advantage of the relation between the actions and the objects of action. For example, a priori knowledge for the state duration can be incorporated into hidden Markov model framework, thus resulted model is called half-hidden Markov model (semi-HMM). If assigning a discrete label that is used for high-level behavioral modeling to the state space, then a mixed state hidden Markov model is constituted and can be used for nonstationary behavior modeling. Linear Dynamic System Linear dynamic system (LDS) is more general than hidden Markov model, in which the state space is not constrained to be the collection of limiting symbols, but can be continuous values in ℝk space, where k is the dimension of state space. The simplest linear dynamic system is the invariant first-order Gauss-Markov process:

8.5 Modeling Activity and Behavior

x(t) = Ax(t – 1) + w(t) y(t) = Cx(t) + v(t)

w ∼ N(0, P)

v ∼ N(0, Q)

275

(8.26) (8.27)

wherein, x ∈ ℝd is a d-D state space, y ∈ ℝn is an n-D observation vector, d M (the length of B is greater than or equal to M at this time), that is Q > 0. Similarly, the case that the length of B is greater than the length of A can be proved. (2) The length of A is equal to the length of B: If the length of A is N, and N ≠ M (i.e., A and B are not exactly the same string), then N > M, and Q > 0.

4-7

As discussed above, Q can only be zero if A and B are exactly the same string. Use the method of counter-evidence. Suppose the two graphs are isomorphic, then the parallel edges a and c should correspond to the parallel edges x and z (the only parallel edges), so the vertices A and B should correspond to vertices X and Y, but both A and B are the endpoints of the three edges, and X and Y are not all the endpoints of three edges, contradiction.

Chapter 5 Scene Analysis and Semantic Interpretation 5-2 5-11

g = 170. {Hint} Refer to Figure 5.1.

Chapter 6 Multi-Sensor Image Fusion 6-8

(1) Since X ∩ Y ⊆ X, X ∩ Y ⊆ Y, so R∗ (X ∩ Y) ⊆ R∗ (X), R∗ (X ∩ Y) ⊆ R∗ (Y). (2) Since X ⊆ X ∪ Y, X ∩ Y ⊆ Y, so R∗ (X) ⊆ R∗ (X ∪ Y), R∗ (Y) ⊆ R∗ (X ∪ Y).

Answers to Selected Problems and Questions

6-12

285

Assume X is R-rough and R-definable, then R∗ (X) ≠ Ø, R∗ (X) ≠ L. R∗ (X) ≠ Ø ⇔ ∀ x ∈ X, so [x]R ⊆ X ⇔ [x]R ⊆ ∩X̄ = 0 ⇔ R∗ (X)̄ ≠ L. R∗ (X) ≠ L ⇔ ∀ y ∈ L, so [y]R ∩ R∗ (X) = Ø ⇔ [y]R ⊆ R∗ (X)̄ ⇔ R∗ (X)̄ ≠ Ø.

Chapter 7 Content-Based Image Retrieval 7-1 7-6

The match value between Figure Problem 7-1(a, b) is 7/16. The match value between Figure Problem 7-1(b, c) is 9/16. For example, the ratio of height and width of Feret box (see Section 3.3.2 in Volume II) as well as form factor (see Section 6.3.1 in Volume II). The appearance of the trousers in the three images is mainly different in the bifurcation of trousers legs. Thus, their ratios of height and width of Feret box would be significantly different; while their shape factors would be quite close (the contour length and the silhouette area are similar).

Chapter 8 Color Image Processing 8-3 8-12

{Hint} One possible approach is to resample (or interpolate) the raw data, so as to transform the anisotropic data into isotropic data. The first line: Define a testing procedure for a car in the parking lot parade, in which the car is represented by v and the parking lot is represented by lot; The second line: The detection begins when the car drives into the parking lot; The third line: The counter i is set to 0; The fourth line: A cycle of statistical count starts. The fifth line: If the car is in the parking lot and the car turns in the parking lot on the road for one circle, the counter will increase by 1; The sixth line: If the counter reaches a predetermined threshold n, the counting is stopped; The seventh line: The car leaves the parking lot, the detection is ended and the program exits. Consider a basketball player practicing shooting activities, as shown in Figure Solution 8-12. The player is represented by p, basketball court is represented by court. The detection starts from the moment the player enters the basketball court and makes a basketball shot. Making a basketball shot (making a ball into the basket), the counter plus 1 to i. If a predetermined threshold n is reached, the counting is stopped, the training is ended and the player can leave the court.

286

Answers to Selected Problems and Questions

PROCESS (practice-shoot-court (player p, basketball-court court), Sequence (enter (p, court), Set-to-zero (i), Repeat-Until ( AND (move-in-court (p), inside (p, court), increment (i)), Equal (i, n) ), Exit (p, court) ) )

Figure Solution 8-12

References [Albiol 2002] Albiol A, Torres L, Delp E J. 2002. Video preprocessing for audiovisual indexing. Proc. ICASSP, 4: 3636–3639. [Aloimonos 1985] Aloimonos J, Swain J. 1985. Shape from texture. Proc. 9IJCAI, 926–931. [Aloimonos 1992] Aloimonos Y. (Ed.). 1992. Special Issue on Purposive, Qualitative, Active Vision. CVGIP-IU, 56(1): 1–129. [Andreu 2001] Andreu J P, Borotsching H, Ganster H, et al. 2001. Information fusion in image understanding. Digital Image Analysis – Selected Techniques and Applications. Springer, Heidelberg. [Avrithis 2000] Avrithis Y, Tsapatsoulis N, Kollias S. 2000. Broadcast news parsing using visual cues: A robust face detection approach. Proc. ICME, 1469–1472. [Ballard 1982] Ballard D. H., Brown C. M. 1982. Computer Vision. Prentice-Hall, New Jersey. [Barnard 1980] Barnard S T, Thompson W B. 1980. Disparity analysis of images. IEEE-PAMI, 2: 333–340. [Bian 2005] Bian H, Zhang Y-J, Yan W D. 2005. The study on wavelet transform for remote sensing image fusion. Proc. Fourth Joint Conference on Signal and Information Processing, 109–1139. [Bimbo 1999] Bimbo A. 1999. Visual Information Retrieval. Morgan Kaufmann, Inc., Burlington. [Blei 2003] Blei D, Ng A, Jordan M. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993–1022. [Bishop 2006] Bishop C M. 2006. Pattern Recognition and Machine Learning. Springer, Heidelberg. [Blank 2005] Blank B, Gorelick L, Shechtman E, et al. 2005. Actions as space-time shapes. ICCV, 2: 1395–1402. [Bregonzio 2009] Bregonzio M, Gong S G, Xiang T. 2009. Recognizing action as clouds of space-time interest points. CVPR, 1948–1955. [Buckley 2003] Buckley F, Lewinter M. 2003. A Friendly Introduction to Graph Theory. Pearson Education, Inc., London. [Castleman 1996] Castleman K R. 1996. Digital Image Processing. Prentice-Hall, New Jersey. [Chen 2001] Chen Y. 2001. Registration of SAR and FLIR images for ATR application. SPIE 4380: 127–134. [Chen 2006] Chen W, Zhang Y-J. 2006. Tracking ball and players with applications to highlight ranking of broadcasting table tennis video. Proc. 2006 IMACS Multi-conference on Computational Engineering in Systems Applications, 2: 1896–1903. [Chen 2008] Chen W, Zhang Y-J. 2008. Parametric model for video content analysis. PR, 29(3): 181–191. [Chen 2016a] Chen Q Q, Zhang Y-J. 2016a. Cluster trees of improved trajectories for action recognition. Neurocomputing, 173: 364–372. [Chen 2016b] Chen Q Q, Liu F, Li X, et al. 2016b. Saliency-context two-stream convnets for action recognition. Proc. 23ICIP, 3076–3080. [Cox 1994] Cox E. 1994. The Fuzzy Systems Handbook. AP Professional. Cambridge, England. [Cvetkovic 1995] Cvetkovic Z, Vetterli M. 1995. Discrete time wavelet extreme representation: Design and consistent reconstruction. IEEE-SP, 43: 681–693. [Dai 2005] Dai S Y, Zhang Y-J. 2005. Unbalanced region matching based on two-level description for image retrieval. PRL, 26(5): 565–580. [Davies 2005] Davies E R. 2005. Machine Vision: Theory, Algorithms, Practicalities (3rd Ed.), Elsevier, Amsterdam. [Davies 2012] Davies E R. 2012. Computer and Machine Vision: Theory, Algorithms, Practicalities (4th Ed.). Elsevier, Amsterdam. [Dean 1995] Dean T, Allen J, Aloimonos Y. 1995. Artiﬁcial Intelligence: Theory and Practice. Addison Wesley, New Jersey.

288

References

[Devernay 1994] Devernay F, Faugeras O. 1994. Computing differential properties of 3-D shapes from stereopsis without 3-D models. Proc. CVPR, 208–213. [Divakaran 2000] Divakaran A, Ito H, Sun H F, et al. 2000. Fade-in/out scene change detection in the MPEG-1/2/4 compressed video domain. SPIE 3972: 518–522. [Duan 2003] Duan L Y, Xu M, Tian Q. 2003. Semantic shot classiﬁcation in sports video. SPIE 5021: 300–313. [Duan 2012] Duan F, Zhang Y-J. 2012. Max-margin dictionary learning algorithm for sparse representation. Journal of Tsinghua University (Sci & Tech), 50(4): 566–570. [Dubuisson 1994] Dubuisson M, Jain A K. 1994. A modiﬁed Hausdorff distance for object matching. Proc. 12ICPR, 566–568. [Duda 2001] Duda R O, Hart P E, Stork D G. 2001. Pattern Classiﬁcation (2nd Ed.). John Wiley & Sons, Inc., New Jersey. [Durrant 1988] Durrant-Whyte H F. 1988. Sensor models and multisensor integration. Journal of Robotics Research, 7(6): 97–113. [Edelman 1999] Edelman S. 1999. Representation and Recognition in Vision. MIT Press, Boston. [Faugeras 1993] Faugeras O. 1993. Three-Dimensional Computer Vision: A Geometric Viewpoint. MIT Press, Boston. [Finkel 1994] Finkel L H, Sajda P. 1994. Constructing visual perception. American Scientist, 82(3): 224–237. [Forsyth 2003] Forsyth D, Ponce J. 2003. Computer Vision: A Modern Approach. Prentice Hall, New Jersey. [Forsyth 2012] Forsyth D, Ponce J. 2012. Computer Vision: A Modern Approach (2nd Ed.). Prentice Hall, New Jersey. [Franke 2000] Franke U, Joos A. 2000. Real-time stereo vision for urban trafﬁc scene understanding. Proc. Intelligent Vehicles Symposium, 273–278. [Gao 2000a] Gao Y Y, Zhang Y-J. 2000a. Object classiﬁcation using mixed color feature. Proc. ICASSP, 4: 2003–2006. [Gao 2000b] Gao Y Y, Zhang Y-J, Merzlyakov N S. 2000b. Semantic-based image description model and its implementation for image retrieval. Proc. 1ICIG, 657–660. [Gao 2002a] Gao X, Tan X. 2002a. Unsupervised video-shot segmentation and model-free anchorperson detection for news video story parsing. IEEE-CSVT, 12(9): 765–776. [Gao 2002b] Gao Y, Leung M K H. 2002b. Line segment Hausdorff distance on face matching. PR, 35(2): 361–371. [Gao 2009] Gao J, Xie Z. 2009. Image Understanding Theory and Method. Science Press, Beijing. [Gargi 2000] Gargi U, Kasturi R, Strayer S H. 2000. Performance characterization of video-shot-change detection methods. IEEE-CSVT, 10(1): 1–13. [Gibson 1950] Gibson J J. 1950. The Perception of the Visual World. Houghton Mifﬂin, Boston. [Goldberg 1987] Goldberg R R, Lowe D G. 1987. Veriﬁcation of 3-D parametric models in 2-D image data. Proc. IEEE Workshop on Computer Vision, 255–257. [Grifﬁths 2004] Grifﬁths T L, Steyvers M. 2004. Finding scientiﬁc topics. Proc. National Academy of Sciences, 101(s1): 5228–5235. [Grimson 1983] Grimson W E L. 1983. Surface consistency constraints in vision. CVGIP, 24: 28–51. [Grimson 1985] Grimson W E L. 1985. Computational experiments with a feature based stereo algorithm. IEEE-PAMI, 7(1): 17–34. [Grossberg 1987] Grossberg S, Mingolia E. 1987. Neural dynamics of surface perception: Boundary webs, illuminants and shape-from-shading. CVGIP, 37(1): 116–165. [Habib 2001] Habib A, Kelley D. 2001. Automatic relative orientation of large scale imagery over urban areas using modiﬁed iterated Hough transform. ISPRS Journal of Photogrammetry & Remote Sensing, 56: 29–41.

References

289

[Han 2010] Han R F. 2010. The Principle and Application of Genetic Algorithm. Weapons Industry Press, Beijing. [Hanjalic 1998] Hanjalic A, Lagendijk R L, Biemond J. 1998. Template-based detection of anchorperson shots in news programs. Proc. ICIP, 3: 148–152. [Haralick 1992] Haralick R M, Shapiro L G. 1992. Computer and Robot Vision, Vol.1. Addison-Wesley, New Jersey. [Haralick 1993] Haralick R M, Shapiro L G. 1993. Computer and Robot Vision, Vol.2. Addison-Wesley, New Jersey. [Hartley 2004] Hartley R, Zisserman A. 2004. Multiple View Geometry in Computer Vision (2nd Ed.). Cambridge University Press, Cambridge. [He 2000] He Y, Wang G H, Lu D J, et al. 2000. Multi-Sensor Information Fusion and Applications. Publishing House of Electronics Industry, Beijing. [Horn 1986] Horn B K P. 1986. Robot Vision. MIT Press, Boston. [Huang 1993] Huang T, Stucki P (Eds.). 1993. Special Section on 3-D Modeling in Image Analysis and Synthesis. IEEE-PAMI, 15(6): 529–616. [Huang 1999] Huang Q, Liu Z, Rosenberg A. 1999. Automated semantic structure reconstruction and representation generation for broadcast news. SPIE 3656: 50–62. [Huang 2003] Huang X Y, Zhang Y-J, Hu D. 2003. Image retrieval based on weighted texture features using DCT coefﬁcients of JPEG images. Proc. 4PCM, 3: 1571–1575. [Huang 2016] Huang X M, Zhang Y-J. 2016. An O(1) disparity reﬁnement method for stereo matching. PR, 55: 198–206. [ISO/IEC 2001] ISO/IEC JTC1/SC29/WG11. 2001. Overview of the MPEG-7 standard, V.6, Doc. N4509. [Jacobs 1995] Jacobs C E, Finkelstein A, Salesin D. 1995. Fast multiresolution image querying. Proc. SIGGAPH’95, 277–286. [Jain 1995] Jain R, Kasturi R, Schunck B G. 1995. Machine Vision. McGraw-Hill Companies. Inc., New York. [Jain 1996] Jain A K, Vailaya A. 1996. Image retrieval using color and shape. PR, 29(8): 1233–1244. [Jain 1997] Jain A K, Dorai C. 1997. Practicing vision: integration, evaluation and applications. PR, 30(2): 183–196. [Jähne 1999a] Jähne B, Hau"ecker H., Gei"ler P. 1999a. Handbook of Computer Vision and Applications: Volume 1: Sensors and Imaging. Academic Press, Washington, D.C. [Jähne 1999b] Jähne B, Hau"ecker H, Gei"ler P. 1999b. Handbook of Computer Vision and Applications: Volume 2: Signal Processing and Pattern Recognition. Academic Press, Washington, D.C. [Jähne 1999c] Jähne B, Hau"ecker H, Gei"ler P. 1999c. Handbook of Computer Vision and Applications: Volume 3: Systems and Applications. Academic Press, Washington, D.C. [Jähne 2000] Jähne B, Hau"ecker H. 2000. Computer Vision and Applications: A Guide for Students and Practitioners. Academic Press, Washington, D.C. [Jeannin 2000] Jeannin S, Jasinschi R, She A, et al. 2000. Motion descriptors for content-based video representation. Signal Processing: Image Communication, 16(1–2): 59–85. [Jia 1998] Jia B, Zhang Y-J, Zhang N, et al. 1998. Study of a fast trinocular stereo algorithm and the inﬂuence of mask size on matching. Proc. ISSPR, 1: 169–173. [Jia 2000a] Jia B, Zhang Y-J, Lin X G. 2000a. Stereo matching using both orthogonal and multiple image pairs. Proc. ICASSP, 4: 2139–2142. [Jia 2000b] Jia B, Zhang Y-J, Lin X G. 2000b. Genera l and fast algorithm for dispar ity error detection and correct ion. Journal of Tsinghua University (Sci & Tech), 40(1): 28–31. [Jia 2000c] Jia B, Zhang Y-J, Lin X G., et al. 2000c. A sub-pixel-level stereo algorithm with right-angle tri-nocular. Proc. 1ICIG, 417–420. [Jia 2007] Jia H X, Zhang Y-J. 2007. A survey of computer vision based pedestrian detection for driver assistance system. Acta Automatica Sinica, 33(1): 84–90.

290

References

[Jia 2009] Jia H X, Zhang Y-J. 2009. Automatic people counting based on machine learning in intelligent video surveillance. Video Engineering, (4): 78–81. [Jiang 2005a] Jiang F, Zhang Y-J. 2005a. News video indexing and abstraction by speciﬁc visual cues: MSC and news caption. Video Data Management and Information Retrieval, Deb S. (ed.). IRM Press, Hershey-New York. Chapter 11 (254–281). [Jiang 2005b] Jiang F, Zhang Y-J. 2005b. Camera attention weighted strategy for video shot grouping. SPIE 5960: 428–436. [Kanade 1991] Kanade T, Okutomi M. 1991. A stereo matching algorithm with an adaptive window: Theory and experiment. Proc. ICRA, 1088–1095. [Kanade 1994] Kanade T, Okutomi M. 1994. A stereo matching algorithm with an adaptive window: Theory and experiment. IEEE-PAMI, 16(9): 920–932. [Kanade 1996] Kanade T, Yoshida A, Oda K, et al. 1996. A stereo machine for video-rate dense depth mapping and its new applications. Proc. 15CVPR, 196–202. [Kara 2011] Kara Y E, Akarun L. 2011. Human action recognition in videos using keypoint tracking. Proc. 19th Conference on Signal Processing and Communications Applications, 1129–1132. [Kim 1987] Kim Y C, Aggarwal J K. 1987. Positioning three-dimensional objects using stereo images. IEEE-RA, 1: 361–373. [Kim 2002] Kim C N, Mohan T, Hiroshi I. 2002. Generalized multiple baseline stereo and direct virtual view synthesis using range-space search, match, and render. IJCV, 47(1): 131–147. [Kittler 1985] Kittler J, Illingworth J. 1985. Relaxation labeling algorithms – A review. Image and Vision Computing, 3(4): 206–216. [Kong 2002] Kong B. 2002. Comparison between human vision and computer vision. Chinese Journal of Nature, 24(1): 51–55. [Kuvich 2004] Kuvich G. 2004. Active vision and image/video understanding systems for intelligent manufacturing. SPIE 5605: 74–86. [Lang 1997] Lang P J, Bradley M M, Cuthbert B N. 1997. International affective picture system (IAPS): Technical manual and affective ratings, NIMH Center for the Study of Emotion and Attention. [Laptev 2005] Laptev I. 2005. On space-time interest points. IJCV, 64(2/3): 107–123. [Lee 1990] Lee S U, et al. 1990. A comparative performance study of several global thresholding techniques for segmentation. CVGIP, 52: 171–190. [Levine 1985] Levine M D. 1985. Vision in Man and Machine. McGraw-Hill, New York. [Lew 1994] Lew M S, Huang T S, Wong K. 1994. Learning and feature selection in stereo matching. IEEE-PAMI, 16(9): 869–881. [Li 2005a] Li R, Zhang Y-J. 2005a. Automated image registration using multi-resolution based Hough transform. SPIE 5960: 1363–1370. [Li 2005b] Li R, Zhang Y-J. 2005b. Level selection for multi-scale fusion of out-of-focus image. IEEE Signal Processing Letters, 12(9): 617–620. [Li 2010] Li S, Zhang Y-J. 2010. Discovering latent semantic factors for emotional picture categorization. Proc. 17ICIP, 1065–1068. [Li 2011] Li S, Zhang Y-J. 2011. Semi-supervised classiﬁcation of emotional pictures based on feature combination. SPIE-7881A: 0X1–0X8. [Lienhart 1997] Lienhart R, Pfeiffer S, Effelsberg W. 1997. Video abstracting. Communications of ACM. 40(12): 54–62. [Lin 2003] Lin K H, Lam K M, Sui W C. 2003. Spatially eigen-weighted Hausdorff distances for human face recognition. PR, 36: 1827–1834. [Liu 2005] Liu X M, Zhang Y-J, Tan H C. 2005. A new Hausdorff distance based approach for face localization. Sciencepaper Online, 200512–662 (1–9). [Liu 2012] Liu B D, Wang Y X, Zhang Y-J. 2012. Dictionary learning on multiple manifolds for image classiﬁcation. Journal of Tsinghua University (Sci & Tech), 50(4): 575–580. [Lohmann 1998] Lohmann G. 1998. Volumetric Image Analysis. John Wiley & Sons and Teubner Publishers, New Jersey.

References

291

[Lowe 1987] Lowe D G. 1987. Three-dimensional object recognition from single two-dimensional images. Artiﬁcial Intelligence, 31(3): 355–395. [Lowe 1988] Lowe D G. 1988. Four steps towards general-purpose robot vision. Proc. 4th International Symposium on Robotics Research, 221–228. [Luo 2001] Luo Y, Zhang Y-J, Gao Y Y, et al. 2001. Extracting meaningful region for content-based retrieval of image and video. SPIE 4310: 455–464. [Luo 2002] Luo Z Z, Jiang J P. 2002. Matching Vision and Multi-information Fusion. China Machine Press, Beijing. [Luo 2005] Luo J B, Savakis A E, Amit S. 2005. A Bayesian network-based framework for semantic image understanding. PR, 38(6): 919–934. [Luo 2010] Luo S W, et al. 2010. The Prception Computing of Visual Information. Science Press, Beijing. [Ma 2002] Ma Y F, Lu L, Zhang H J, et al. 2002. A user attention model for video summarization. Proc. ACM International Multimedia Conference and Exhibition, 533–542. [Maitre 1992] Maitre H, Luo W. 1992. Using models to improve stereo reconstruction. IEEE-PAMI, 14(2): 269–277. [Mallat 1992] Mallat S, Hwang W L. 1992. Singularity detection and processing with wavelets, IEEE-IT, 38(2): 617–643. [Marchand 2000] Marchand-Maillet S, Sharaiha Y M. 2000. Binary Digital Image Processing – A Discrete Approach. Academic Press, Washington, D.C. [Marr 1982] Marr D. 1982. Vision – A Computational Investigation into the Human Representation and Processing of Visual Information. W.H. Freeman, New York. [Matthies 1989] Matthies L, Szeliski R, Kanade T. 1989. Kalamn ﬁlter-based algorithms for estimating depth from image sequences, IJCV, 3: 209–236. [Medioni 1985] Medioni G, Nevatia R. 1985. Segment-based stereo matching. CVGIP, 31(1): 2–18. [Mehtre 1995] Mehtre B M, Kankanhalli M S, Narasimhalu A D, et al. 1995. Color matching for image retrieval. PRL, 16: 325–331. [Mehtre 1997] Mehtre B M, Kankanhalli M S, Lee W F. 1997. Shape measures for content based image retrieval: A comparison. Information Processing & Management, 33(3): 319–337. [Mitchell 1996] Mitchell T M. 1996. An Introduction to Genetic Algorithms. MIT Press, Boston. [Mohan 1989] Mohan R, Medioni G. 1989. Stereo error detection, correction, and evaluation. IEEE-PAMI, 11(2): 113–120. [Niblack 1998] Niblack W, Hafner J L, Breuel T, et al. 1998. Updates to the QBIC system. SPIE 3312: 150–161. [Nilsson 1980] Nilsson N J. 1980. Principles of Artiﬁcial Intelligence. Tioga Publishing Co., Pennsylvania. [Ohta 1986] Ohta Y, Watanabe M, Ikeda K. 1986. Improved depth map by right-angled tri-nocular stereo. Proc. 8ICPR, 519–521. [Okutomi 1992] Okutomi M, Kanade T. 1992. A locally adaptive window for signal matching. IJCV, 7(2): 143–162. [Okutomi 1993] Okutomi M, Kanade T. 1993. A multiple – baseline stereo. IEEE-PAMI, 15(4): 353–363. [Ortega 1997] Ortega M, Rui Y, Chakrabarti K, et al. 1997. Supporting similarity queries in MARS. Proc. ACM Multimedia, 403–413. [O’Toole 1999] O’Toole C, Smeaton A, Murphy N, et al. 1999. Evaluation of automatic shot boundary detection on a large video test suite. Proc. Second UK Conference on Image Retrieval, 12. [Pajares 2004] Pajares G. 2004. A wavelet-based image fusion tutorial. PR, 37: 1855–1872. [Peker 2002] Peker K A, Cabasson R, Divakaran A. 2002. Rapid generation of sports video highlights using the MPEG-7 activity descriptor. SPIE 4676: 318–323.

292

References

[Piella 2003] Piella G. 2003. A general framework for multiresolution image fusion: from pixels to regions. Information Fusion, 4: 259–280. [Pizlo 1992] Pizlo Z, Rosenfeld A. 1992. Recognition of planar shapes from perspective images using contour-based invariants. CVGIP: Image Understanding, 56(3): 330–350. [Polhl 1998] Polhl C, Genderen J L. 1998. Multisensor image fusion in remote sensing: Concepts, methods and applications. International Journal of Remote Sensing, 19(5): 823–854. [Prince 2012] Prince S J D. 2012. Computer Vision – Models, Learning, and Inference. Cambridge University Press, Cambridge. [Qin 2003] Qin X, Zhang Y-J. 2003. A tracking method based on curve ﬁtting prediction of IR object. Infrared Technology, 25(4): 23–25. [Renals 2005] Renals S, Bengio S (eds.). 2005 Machine Learning for Multimodal Interaction. LNSC 3869, Springer, Heidelberg. [Scassellati 1994] Scassellati B, Alexopoulos S, Flickner M. 1994. Retrieving images by-2D shape: A comparison of computation methods with human perceptual judgments. SPIE 2185: 2–14. [Scharstein 2002] Scharstein D, Szeliski R. 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms, IJCV, 47(1): 7–42. [Shah 2002] Shah M. 2002. Guest introduction: The changing shape of computer vision in the twenty-ﬁrst century. IJCV, 50(2): 103–110. [Shapiro 2001] Shapiro L, Stockman G. 2001. Computer Vision. Prentice Hall, New Jersey. [Shirai 1987] Shirai Y. 1987. Three-Dimensional Computer Vision. Springer-Verlag, Heidelberg. [Sivaraman 2011] Sivaraman S, Morris B T, Trivedi M M. 2011. Learning multi-lane trajectories using vehicle-based vision. ICCV Workshops, 2070–2076. [Sivic 2003] Sivic J, Zisserman A. 2003. Video Google: A text retrieval approach to object matching in videos. Proc. ICCV, II: 1470–1477. [Sivic 2005] Sivic J, Russell B C, Efros A A, et al. 2005. Discovering objects and their location in images. Proc. ICCV, 370–377. [Smeulders 2000] Smeulders A, Worring W M M, Santini S, et al. 2000. Content-based image retrieval at the end of the early years. IEEE PAMI, 22(12): 1349–1380. [Snyder 2004] Snyder W E, Qi H. 2004. Machine Vision. Cambridge University Press, Cambridge. [Sonka 2008] Sonka M, Hlavac V, Boyle R. 2008. Image Processing, Analysis, and Machine Vision (3rd Ed). Brooks/Cole Publishing, Toronto. [Steger 2008] Steger C, Ulrich M, Wiedemann C. 2008. Machine Vision Algorithms and Applications. Wiley-VCH, New Jersey. [Sun 2004] Sun H Q. 2004. Graph Theory and Applications. Science Press, Beijing. [Swain 1991] Swain M J, Ballard D H. 1991. Color indexing. IJCV, 7: 11–32. [Szeliski 2010] Szeliski R. 2010. Computer Vision: Algorithms and Applications. Springer, Heidelberg. [Tan 2000] Tan Y P, Saur D F, Kulkarni S R, et al. 2000. Rapid estimation of camera motion from compressed video with application to video annotation. IEEE-CSVT, 10(1): 133–146. [Tan 2006] Tan H C, Zhang Y-J. 2006. A novel weighted Hausdorff distance for face localization. Image and Vision Computing, 24(7): 656–662. [Tekalp 1995] Tekalp A M. 1995. Digital Video Processing. Prentice-Hall, New Jersey. [Theodoridis 2009] Theodoridis S, Koutroumbas K. 2009. Pattern Recognition (3rd Ed.). Elsevier Science, Amsterdam. [Tran 2008] Tran D, Sorokin A. 2008. Human activity recognition with metric learning. LNCS 5302: 548–561. [Wang 2009] Wang C, Blei D, Li F F. 2009. Simultaneous image classiﬁcation and annotation. Proc. CVPR, 1903–1910. [Wang 2012] Wang Y X, Gui L Y, Zhang Y-J. 2012. Neighborhood preserving non-negative tensor factorization for image representation. Proc. 37ICASSP, 3389–3392.

References

293

[Wang 2013] Wang Y X, Zhang Y-J. 2013. Nonnegative matrix factorization: A comprehensive review. IEEE-KDE, 25(6): 1336–1353. [Weinland 2011] Weinland D, Ronfard R, Boyer E. 2011. A survey of vision-based methods for action representation, segmentation and recognition. CVIU, 115(2): 224–241. [West 2001] West D B. 2001. Introduction to Graph Theory (2nd Ed.). Pearson Education, Inc., London. [Winston 1984] Winston P H. 1984. Artiﬁcial Intelligence (2nd Ed.). Addison-Wesley, New Jersey. [Xu 2006] Xu F, Zhang Y-J. 2006. Comparison and evaluation of texture descriptors proposed in MPEG-7. International Journal of Visual Communication and Image Representation, 17: 701–716. [Xu 2007a] Xu F, Zhang Y-J. 2007a. A novel framework for image categorization and automatic annotation. In: Semantic-Based Visual Information Retrieval, IRM Press, Chapter 5 (90–111). [Xu 2007b] Xu F, Zhang Y-J. 2007b. Integrated patch model: A generative model for image categorization based on feature selection. PRL, 28(12): 1581–1591. [Xu 2008] Xu F, Zhang Y-J. 2008. Probability association approach in automatic image annotation. Handbook of Research on Public Information Technology, II: Chapter 56 (615–626). [Xue 2000] Xue J H, Zhang Y-J, Lin X G. 2000. Dynamicamic image segmentation using 2-D genetic algorithms. Acta Automatica Sinica, 26(5): 685–689. [Yao 1999] Yao Y R, Zhang Y-J. 1999. Shape-based image retrieval using wavelets and moments. Proc. International Workshop on Very Low Bitrate Video Coding, 71–74. [Yu 2001a] Yu T L, Zhang Y-J. 2001a. Motion feature extraction for content-based video sequence retrieval. SPIE 4311: 378–388. [Yu 2001b] Yu T L, Zhang Y-J. 2001b. Retrieval of video clips using global motion information. IEE Electronics Letters, 37(14): 893–895. [Zhang 1990] Zhang Y-J. 1990. Automatic correspondence ﬁnding in deformed serial sections. Scientiﬁc Computing and Automation (Europe) 1990, Chapter 5 (39–54). [Zhang 1991] Zhang Y-J. 1991. 3-D image analysis system and megakaryocyte quantitation. Cytometry, 12: 308–315. [Zhang 1996a] Zhang Y-J. 1996a. Image engineering in China: 1995. Journal of Image and Graphics. 1(1): 78–83. [Zhang 1996b] Zhang Y-J. 1996b. Image engineering in China: 1995 (Supplement). Journal of Image and Graphics. 1(2): 170–174. [Zhang 1996c] Zhang Y-J. 1996c. Image engineering and bibliography in China. Technical Digest of International Symposium on Information Science and Technology, 158–160. [Zhang 1997] Zhang Y-J. 1997. Image engineering in China: 1996. Journal of Image and Graphics. 2(5): 336–344. [Zhang 1998a] Zhang Y-J. 1998a. Image engineering in China: 1997. Journal of Image and Graphics. 3(5): 404–414. [Zhang 1998b] Zhang Y-J, Liu Z W, He Y. 1998b. Comparison and improvement of color-based image retrieval techniques. SPIE 3312: 371–382. [Zhang 1998c] Zhang Y-J, Liu Z W, He Y. 1998c. Color-based image retrieval using sub-range cumulative histogram. High Technology Letters, 4(2): 71–75. [Zhang 1999] Zhang Y-J. 1999. Image engineering in China: 1998. Journal of Image and Graphics. 4(5): 427–438. [Zhang 2000] Zhang Y-J. 2000. Image engineering in China: 1999. Journal of Image and Graphics. 5(5): 359–373. [Zhang 2001a] Zhang Y-J. 2001a. Image engineering in China: 2000. Journal of Image and Graphics. 6(5): 409–424. [Zhang 2001b] Zhang Y-J. 2001b. Image Segmentation. Science Press, Beijing. [Zhang 2002a] Zhang Y-J. 2002a. Image engineering in China: 2001. Journal of Image and Graphics. 7(5): 417–433.

294

References

[Zhang 2002b] Zhang Y-J. 2002b. Image Engineering (3): Teaching References and Problem Solutions. Tsinghua University Press, Beijing. [Zhang 2002c] Zhang Y-J. 2002c. Image engineering and related publications. International Journal of Image and Graphics, 2(3): 441–452. [Zhang 2003a] Zhang Y-J. 2003a. Image engineering in China: 2002. Journal of Image and Graphics. 8(5): 481–498. [Zhang 2003b] Zhang Y-J. 2003b. Content-Based Visual Information Retrieval. Science Press, Beijing. [Zhang 2004a] Zhang Y-J. 2004a. Image Engineering in China: 2003. Journal of Image and Graphics. 9(5): 513–531. [Zhang 2004b] Zhang Y-J, Gao Y Y, Luo Y. 2004b. Object-based techniques for image retrieval. Multimedia Systems and Content-based Image Retrieval, Deb S. (ed.). Idea Group Publishing, New York. Chapter 7 (156–181). [Zhang 2005a] Zhang Y-J. 2005a. Image engineering in China: 2004. Journal of Image and Graphics. 10(5): 537–560. [Zhang 2005b] Zhang Y-J. 2005b. Advanced techniques for object-based image retrieval. Encyclopedia of Information Science and Technology, Mehdi Khosrow-Pour (ed.). Idea Group Reference, Hershey-New York. 1: 68–73. [Zhang 2005c] Zhang Y-J. 2005d. New advancements in image segmentation for CBIR. Encyclopedia of Information Science and Technology, Mehdi Khosrow-Pour (ed.). Idea Group Reference, Hershey-New York. 4: 2105–2109. [Zhang 2006] Zhang Y-J. 2006. Image engineering in China: 2005. Journal of Image and Graphics. 11(5): 601–623. [Zhang 2007a] Zhang Y-J. 2007a. Image engineering in China: 2006. Journal of Image and Graphics. 12(5): 753–775. [Zhang 2007b] Zhang Y-J. 2007b. Image Engineering (2nd Ed.). Tsinghua University Press, Beijingt. [Zhang 2007c] Zhang Y-J. 2007c. Image Engineering (3): Image Understanding (2nd Ed.). Tsinghua University Press, Beijing. [Zhang 2007d] Zhang Y-J. (Ed.). 2007d. Semantic-Based Visual Information Retrieval. IRM Press, Hershey-New York. [Zhang 2007e] Zhang Y-J. 2007e. Toward high level visual information retrieval. Semantic-Based Visual Information Retrieval, Zhang Y J (ed.). IRM Press, Hershey-New York. Chapter 1 (1–21). [Zhang 2008a] Zhang Y-J. 2008a. Image engineering in China: 2007. Journal of Image and Graphics. 13(5): 825–852. [Zhang 2008b] Zhang Y-J. 2008b. A study of image engineering. Encyclopedia of Information Science and Technology (2nd Ed.), Information Science Reference, Hershey-New York. VII: 3608–3615. [Zhang 2008c] Zhang Y-J. 2008c. Image classiﬁcation and retrieval with mining technologies. Handbook of Research on Text and Web Mining Technologies, Song M, Wu Y F B, (eds.). Chapter VI (96–110). [Zhang 2008d] Zhang Y-J, Jiang F. 2008d. Home video structuring with a two-layer shot clustering approach. Proc. 3rd International Symposium on Communications, Control and Signal Processing, 500–504. [Zhang 2009a] Zhang Y-J. 2009a. Image Engineering in China: 2008. Journal of Image and Graphics. 14(5): 809–837. [Zhang 2009b] Zhang Y-J. 2009b. Machine vision and image techniques. Automation Panorama. (2): 20–25. [Zhang 2010] Zhang Y-J. 2010. Image engineering in China: 2009. Journal of Image and Graphics. 15(5): 689–722.

References

295

[Zhang 2011] Zhang Y-J. 2011. Image engineering in China: 2010. Journal of Image and Graphics. 16(5): 693–702. [Zhang 2012a] Zhang Y-J. 2012a. Image engineering in China: 2011. Journal of Image and Graphics. 17(5): 603–612. [Zhang 2012b] Zhang Y-J. 2012b. Image Engineering (1): Image Processing (3rd Ed.). Tsinghua University Press, Beijing. [Zhang 2012c] Zhang Y-J. 2012c. Image Engineering (2): Image Analysis (3rd Ed.). Tsinghua University Press, Beijing. [Zhang 2012d] Zhang Y-J. 2012d. Image Engineering (3): Image Understanding (3rd Ed.). Tsinghua University Press, Beijing. [Zhang 2013a] Zhang Y-J. 2013a. Image engineering in China: 2012. Journal of Image and Graphics. 18(5): 483–492. [Zhang 2013b] Zhang Y-J. 2013b. Image Engineering (3rd Ed.). Tsinghua University Press, Beijing. [Zhang 2014] Zhang Y-J. 2014. Image engineering in China: 2013. Journal of Image and Graphics. 19(5): 649–658. [Zhang 2015a] Zhang Y-J. 2015a. Image engineering in China: 2014. Journal of Image and Graphics. 20(5): 585–598. [Zhang 2015b] Zhang Y-J. 2015b. A hierarchical organization of home video. Encyclopedia of Information Science and Technology (3rd Ed.), Mehdi Khosrow-Pour (ed.). Information Science Reference, Hershey-New York. Chapter 210 (2168–2177). [Zhang 2015c] Zhang Y-J. 2015c. Image fusion techniques with multiple-sensors. Encyclopedia of Information Science and Technology (3rd Ed.), Mehdi Khosrow-Pour (ed.). Information Science Reference, Hershey-New York. Chapter 586 (5926–5936). [Zhang 2015d] Zhang Y-J. 2015d. Statistics on image engineering literatures. Encyclopedia of Information Science and Technology (3rd Ed.), Mehdi Khosrow-Pour (ed.). Information Science Reference, Hershey-New York. Chapter 595 (6030–6040). [Zhang 2015e] Zhang Y-J. 2015e. Up-to-date summary of semantic-based visual information retrieval. Encyclopedia of Information Science and Technology (3rd Ed.), Mehdi Khosrow-Pour (ed.). Information Science Reference, Hershey-New York. Chapter 123 (1294–1303). [Zhang 2016] Zhang Y-J. 2016. Image engineering in China: 2015. Journal of Image and Graphics. 21(5): 533–543. [Zhang 2017] Zhang Y-J. 2017. Image engineering in China: 2016. Journal of Image and Graphics, 22(5): 563–573. [Zhangwx 2001] Zhang W X, Wu W Z, Liang J Y, et al. 2001. Rough Set Theory and Methods. Science Press, Beijing. [Zhangzl 2010] Zhang Z L, et al. 2010. Fuzzy Set Theory and Method, Wuhan University Press, Wuhan. [Zhao 1996] Zhao W Y, Nandhakumar N. 1996. Effects of camera alignment errors on stereoscopic depth estimates. PR, 29(12): 2115–2126. [Zheng 2012] Zheng Y, Zhang Y-J, Li X, et al. 2012. Action recognition in still images using a combination of human pose and context information. Proc. 19ICIP, 785–788. [Zhou 2016] Zhou D, Li X, Zhang Y-J. 2016. A novel CNN-based match kernel for image retrieval. Proc. 23ICIP, 2045–2049. [Zhu 2011a] Zhu Y F, Zhang Y-J. 2011a. Multi-view stereo reconstruction via voxels clustering and parallel volumetric graph cut optimization. SPIE 7872: 0S1–0S11. [Zhu 2011b] Zhu Y F, Torre F, Cohn J F, et al. 2011b. Dynamic cascades with bidirectional bootstrapping for action unit detection in spontaneous facial behavior. IEEE Trans. AC, 2(2): 79–91.

Index 2-D gradient space 95 2.5-D sketch presentation 20 3-D representation 22 abnormality detection 262 absolute pattern 126 action 249 action primitive 249 action posture 266 active fusion 181 active vision 25 active vision-based theoretical framework 32 activity 250 activity path (AP) 255, 257 algorithm implementation 18 anchor shot 229 artiﬁcial intelligence 14 attention region 239 bag of features model 164 bag of words model 164 Bayesian method 204 behavior 250 belief network 276 bibliography series 5 bibliography survey 5 bi-directional reﬂectance distribution function (BRDF) 75 bionics method 11 camera calibration 37 camera-focused talking heads 230 camera motion 222 central moment 215 classiﬁcation of Image Techniques 6 color 214 color features 214 color graph 132 color histogram intersection 232 complementary information 183 computational theory 17 computer Graphics (CG) 14 computer vision (CV) 10 constraint propagation 160 contour map 79 co-occurrence matrix 214 cooperation information 183

crossover 154 correlation coefﬁcient 41 correlation function 41 correlation minimum 152 correlation model 184 correlation product 152 correspondence 40, 117 cross entropy 191 CT image 202 cumulative histogram 216 curvature scale-space (CSS) 218 cyclical action 264 decision-layer fusion 187 de-fuzziﬁcation 150, 152 depth of ﬁeld 110 discrete relaxation labeling 160 disparity 38, 62 double-sub-graph isomorphism 137 D-S theory 205 dynamic belief networks (DBN) 276 dynamic geometry of surface form and appearance 30 dynamic pattern matching 125 dynamic programming 47 engineering method 11 entropy 191 epipolar line 42 epipolar plane 42 epipoles 42 expectation-maximization (EM) 168 Euler equation 88 evaluation of fusion results 188 event 250 evidence reasoning 205 feature-based matching 44, 118 feature-based registration 185 feature extraction 38 feature-layer fusion 187 frequency space correlation 118 fully reliable region 206 fused reliability function 208 fuzzy composition 150 fuzzy logic 148

298

Index

fuzzy rules 150 fuzzy solution space 150 Gaussian mixture model (GMM) 256 Gaussian normalization 217 general Hough transform (GHT) 186 generative models 264 genetic algorithms 153 global motion 222 gradient masks 60 graph 132 graph isomorphism 137 gray-level disparity 190 hardware implementation 18 Hausdorff distance (HD) 120 high-level semantic description models 225 high-level task 29 highlight shot 233 histogram 214 histogram distance 215 histogram intersection 215 home video 238 homogeneous emission surface 79 horizontal binocular stereo vision 49 HSI transform fusion 194 ideal scatter surface 76 ideal specular reﬂection surface 77 identical graphs 135 image 1 image acquisition 38 image analysis (IA) 2 image brightness constraint equation 81, 101 image capture 38 image engineering (IE) 2, 3 image irradiance 73 image irradiance equation 81 image matching 117 image processing (IP) 2 image registration 118, 185 image technology 2 image understanding (IU) 2, 9 inertia equivalent ellipses 123 infrared image 201 integrate multiple information 181 International Affective Picture System, (IAPS) 170 interrelating entropy 191 inverse -distance 50

isomorphic 136 isomorphic graphs 136 iterative Hough transform (IHT) 186 knowledge-based theoretical frameworks 30 labeling 137 labeling with sequential backtracking 140 Lambertian surface 76 landscape images 227 latent dirichlet allocation (LDA) 166 line drawings 137 linear scale-space representation 253 local motion 223 long-time analysis 222 Machine vision 13 magnitude correlation 119 main speaker close-ups (MSC) 230 map of average motion in shot (MAMS) 231 mapping 117 matching 117 matching based on inertia equivalent ellipses 123 matching based on raster 118 matching based on relationship 118 matching based on structure 118 matching in image space 117 matching in object space 117 matching masks 60 maximum composition 152 mean-square difference (MSD) 40 mean variance 191 membership function 148 minimum average difference function 41 minimum mean error function 41 min-max rule 151 mis-matching error 49, 50, 67 modiﬁed Hausdorff distance (MHD) 121 model-based vision 11 moment composition 152 monotonic fuzzy reasoning 150 mutation 154 motion analysis 222 motion energy image 271 motion ﬁeld 85 motion vector 240 MPEG-7 218 multi-layer description model 225 multi-scale wavelet transforms 220

Index

multi-sensor decision 204 multi-sensor information fusion 180 mutation 154 mutual information 192 network-symbol model 30 news item 229 news program 229 object interactive characterization 262 object layer 226 object matching 120 objective evaluation 189, 191 observation model 184 observer-centered coordinate system 90 online activity analysis 262 optical ﬂow 86 optical ﬂow constraint equation 86 optimal decomposition level 199 ordering constraint 47, 66 orthogonal tri-nocular matching 53, 55 parameter learning 174 path classiﬁcation 261 pattern recognition (PR) 13 PCA-based fusion 194 perspective three point problem (P3P) 112 PET image 202 phase correlation 119 photometric stereo 72 pixel-layer fusion 187, 193 point of interest (POI) 251, 255 postural posture 266 primal sketch representation 20 probability latent semantic analysis (pLSA) 166 probability of latent semantic indexing (pLSI) 167 probability Petri net (PPN) 277 probabilistic relaxation labeling 163 purposive vision 26 pyramid fusion 194 qualitative vision 26 radar image 201 redundant information 183 reference color 215 reﬂectance map 79, 97 region adjacency graph 162

region-based registration 185 relation matching 117, 128 relation representations 129 relative pattern 126 relaxation iterative equations 88 reliability function 206 remote sensing images 201 representation and algorithm 18 reproduction 154 robot vision 13 rotationally symmetric 99 rough set 208 scene analysis 144 scene radiance 73 selective vision 25 semantic gap 225 semantic interpretation 144 semantic labeling 159 sensor model 183 shading 93, 138 shape 218 shape from texture 102 short-time analysis 222 shot organization strategy 244 smooth constraint 97 space-time volume data 272 sparse matching points 47 spatial-temporal points of interest 252 speed proﬁling 261 sport match videos 233 SPOT whole spectrum images 193 state model 184 stereo matching 38, 118 stereo vision system 37 string matching 123, 222 structural matching 121 structure reasoning 139 sub-graph 133 sub-graph isomorphism 137, 222 subjective evaluation 189 sub-pixel 62 sub-pixel level disparity 64 sum of squared sifference (SSD) 49 sum of SSD (SSSD) 51 supervised LDA model (SLDA) 175 surface normal 77 surface orientation 78, 82, 90, 102, 105 symbolic description 118

299

300

Index

template and spring model 122 template matching 40 texel change in shape 106 texture 216 TM multi-spectrum images 195 topics 164 trajectory 235 vanishing line 108 vanishing point 107 variational inference 174 video clip 229

video event representation language (VERL) 280 video event markup language (VEML) 280 video shots 229 virtual fencing 261 visible-light image 201 visual computing theory 16 visual vocabulary 164 wavelet modulus maxima 220 wavelet transform fusion 195 weighted average fusion 193 zero-crossing points 45