135 13 78MB
English Pages 502 [1159] Year 2006
Computer Vision and Graphics
Computational Imaging and Vision
Managing Editor MAX A. VIERGEVER
Utrecht University, Utrecht, The Netherlands
Editorial Board GUNILLA BORGEFORS, Centre for Image Analysis, SLU, Uppsala, Sweden THOMAS S. HUANG, University of Illinois, Urbana, USA SABURO TSUJI, Wakayama University, Wakayama, Japan
Volume 32
Computer Vision and Graphics International Conference, ICCVG 2004, Warsaw, Poland, September 2004, Proceedings Edited by
K. Wojciechowski Silesian University of Technology, Gliwice, Poland
B. Smolka Silesian University of Technology, Gliwice, Poland
H. Palus Silesian University of Technology, Gliwice, Poland
R.S. Kozera The University of Western Australia, Crawley, Australia
W. Skarbek Warsaw University of Technology, Warsaw, Poland and
L. Noakes The University of Western Australia, Crawley, Australia
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN-10 ISBN-13 ISBN-10 ISBN-13
1-4020-4178-0 (HB) 978-1-4020-4178-5 (HB) 1-4020-4179-9 (e-book) 978-1-4020-4179-2 (e-book)
Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. www.springer.com
Printed on acid-free paper
All Rights Reserved © 2006 Springer No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Printed in the Netherlands.
Table of Contents, Part I
COMPUTER VISION Method of Analysis of Low Resolution TV Images for Deformation Measurements of Historical Constructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Sklodowski, Z. Iwanow
1
Co-occurrences of Adapted Features for Object Recognition Across Illumination Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. Muselet, L. Macaire, J.-G. Postaire
7
A Fast and Robust Approach for the Segmentation of Moving Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Al-Hamadi, R. Niese, B. Michaelis
13
Reconstruction Accuracy with 1D Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . Y. Caulier, K. Spinnler
20
Shape Similarity to Alphanumeric Sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. Lebied´z
27
Qualitative Characterization of Dynamic Textures for Video Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . R. Pet´eri, D. Chetverikov
33
Image Classifiers for Scene Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. LeSaux, G. Amato
39
Vision-Based Analysis of the Tire Footprint Shape . . . . . . . . . . . . . . . . . . . . K. Jankowska, T. Krzyzynski, A. Domscheit
45
Measurement of the Length of Pedestrian Crossings through Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M.S. Uddin, T. Shioyama
51
COMPUTATIONAL GEOMETRY Shape Recovery of a Strictly Convex Solid from N-Views . . . . . . . . . . . . . . . S. Collings, R.S. Kozera, L. Noakes (Special Session organized by: Ryszard Kozera and Lyle Noakes)
57
Joint Estimation of Multiple Light Sources and Reflectance from Images . B. Mercier, D. Meneveaux (Special Session organized by: Ryszard Kozera and Lyle Noakes)
66
vi An Algorithm for Improved Shading of Coarsely Tessellated Polygonal Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. Singh, E. Walia
72
Multiresolution Analysis for Irregular Meshes with Appearance Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Roy, S. Foufou, F. Truchetet
80
Smooth Interpolation with Cumulative Chord Cubics . . . . . . . . . . . . . . . . . . R.S. Kozera, L. Noakes (Special Session organized by: Ryszard Kozera and Lyle Noakes)
87
A Parallel Leap-Frog Algorithm for 3-Source Photometric Stereo . . . . . . . . T. Cameron, R.S. Kozera, A. Datta (Special Session organized by: Ryszard Kozera and Lyle Noakes)
95
Noise Reduction in Photometric Stereo with Non-Distant Light Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 R.S. Kozera, L. Noakes (Special Session organized by: Ryszard Kozera and Lyle Noakes) Hypergraphs in Diagrammatic Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 E. Grabska, K. Grzesiak-Kope´c, J. Lembas, A. L achwa, ´ G. Slusarczyk
GEOMETRICAL MODELS OF OBJECTS AND SCENES 3D Modeling of Outdoor Scenes from Omnidirectional Range and Color Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 T. Asai, M. Kanbara, N. Yokoya User-Controlled Multiresolution Modeling of Polygonal Models . . . . . . . . . 125 M. Hussain, Y. Okada, K. Niijima Shape Similarity Search for Surfel-Based Models . . . . . . . . . . . . . . . . . . . . . . 131 M.R. Ruggeri, D.V. Vrani´c, D. Saupe Description of Irregular Composite Objects by Hyper-Relations . . . . . . . . . 141 J.L. Kulikowski Automatic Face Synthesis and Analysis. A Quick Survey . . . . . . . . . . . . . . . 147 Y. Sheng, K. Kucharski, A.H. Sadka, W. Skarbek (Special Session organized by: Ryszard Kozera and Lyle Noakes) 3D Data Processing for 3D Face Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 K. Ignasiak, M. Morgos, W. Skarbek, M. Tomaszewski (VISNET session organized by Wladyslaw Skarbek)
vii
MOTION ANALYSIS, VISUAL NAVIGATION AND ACTIVE VISION Model of Deformable Rings for Aiding the Wireless Capsule Endoscopy Video Interpretation and Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 P.M. Szczypi´ nski, P.V.J. Sriram, R.D. Sriram, D.N. Reddy Single-Camera Stereovision Setup with Orientable Optical Axes . . . . . . . . . 173 L. Duvieubourg, S. Ambellouis, F. Cabestaing Feature-Based Correspondence Analysis in Color Image Sequences . . . . . . 179 A. Al-Hamadi, R. Niese, B. Michaelis A Voting Strategy for High Speed Stereo Matching: Application for Real-Time Obstacle Detection Using Linear Stereo Vision . . . . . . . . . . . 187 M. Harti, Y. Ruichek, A. Koukam Vehicle Detection Using Gabor Filters and Affine Moment Invariants from Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 T. Shioyama, M.S. Uddin, Y. Kawai Pointing Gesture Visual Recognition by Body Feature Detection and Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 S. Carbini, J.-E. Viallet, O. Bernier
IMAGE AND VIDEO CODING H.264 Based Coding of Omnidirectional Video . . . . . . . . . . . . . . . . . . . . . . . . 209 I. Bauermann, M. Mielke, E. Steinbach Experimental Comparison of Lossless Image Coders for Medical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 A. Przelaskowski Discrete Orthogonal Transform for Gappy Image Extrapolation . . . . . . . . . 222 J. Polec, T. Karlub´ıkov´ a Bit-rate Control for Compression of Video with ROI . . . . . . . . . . . . . . . . . . . 228 W. Skarbek, A. Buchowicz, A. Pietrowcew, F. Pereira (VISNET session organized by Wladyslaw Skarbek) Intra-Frame Prediction for High-Pass Frames in Motion Compensated Wavelet Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 L. Cieplinski, J. Caball, S. Ghanbari (VISNET session organized by Wladyslaw Skarbek)
viii DCT-domain Downscaling for Transcoding MPEG-2 Video . . . . . . . . . . . . . 246 S. Dogan, S.T. Worrall, A.H. Sadka, A.M. Kondoz (VISNET session organized by Wladyslaw Skarbek)
COLOR AND MULTISPECTRAL IMAGE PROCESSING Metamer Set Based Measures of Goodness for Colour Cameras . . . . . . . . . . 252 A. Alsam, J.Y. Hardeberg Smoothing Jagged Spectra for Accurate Spectral Sensitivities Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 A. Alsam, J.Y. Hardeberg Color Image Processing Using Generalized Weighted Vector Filters . . . . . . 267 R. Lukac, K.N. Plataniotis, A.N. Venetsanopoulos, B. Smolka Radiometric Calibration of a Multispectral Camera . . . . . . . . . . . . . . . . . . . 273 A. Mansouri, M. Sanchez, F.S. Marzani, P. Gouton Colour Reproduction Accuracy of Vision Systems . . . . . . . . . . . . . . . . . . . . . 279 H. Palus, D. Bereska Face Tracking Using Color, Elliptical Shape Features and a Detection Cascade of Boosted Classifiers in a Particle Filter . . . . . . . . . . . . . . . . . . . . . 287 B. Kwolek
IMAGE FILTERING AND ENHANCEMENT Automatic Contrast Enhancement by Histogram Warping . . . . . . . . . . . . . . 293 M. Grundland, N.A. Dodgson High Quality Deinterlacing Using Inpainting and Shutter-Model Directed Temporal Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 D. Tschumperl´e, B. Besserer Virtual Restoration of Artworks Using Entropy-Based Color Image Filtering Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 R. Lukac, K. Plataniotis, B. Smolka
VIRTUAL REALITY AND MULTIMEDIA APPLICATIONS Deformation and Composition of Shadow for Virtual Studio . . . . . . . . . . . . 314 Y. Manabe, M. Yamamoto, K. Chihara
ix Object Selection in Virtual Environments Using an Improved Virtual Pointer Metaphor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 F. Steinicke, T. Ropinski, K. Hinrichs Super-resolved Video Mosaicing for Documents by Extrinsic Camera Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 A. Iketani, T. Sato, S. Ikeda, M. Kanbara, N. Nakajima, N. Yokoya Multimedia Content Adaptation: May One Fit All? . . . . . . . . . . . . . . . . . . . 337 F. Pereira (VISNET session organized by Wladyslaw Skarbek)
BIOMEDICAL APPLICATIONS Center-Point Model of Deformable Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 P.M. Szczypi´ nski 3D Visualization of Gene Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 L. Zhang, X. Liu, W. Sheng The Multiple Image Stack Browser Suite and its Alignment Strategy . . . . 355 H. Hofmeister, C. G¨ otze, W. Zuschratter Fast 3D Pre-segmentation of Arteries in Computed Tomography Angiograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 L. Fl´ orez-Valencia, F. Vincent, M. Orkisz Independent Component Analysis of Textures in Angiography Images . . . 367 E. Snitkowska, W. Kasprzak Detection on Non-parametric Lines by Evidence Accumulation: Finding Blood Vessels in Mammograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 L. Chmielewski
IMAGE AND VIDEO DATABASES How Useful are Colour Invariants for Image Retrieval ? . . . . . . . . . . . . . . . . 381 G. Schaefer Image Retrieval for the WWW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 B. Smolka, M. Szczepanski, K. Wojciechowski Extracting Semantic Information from Art Images . . . . . . . . . . . . . . . . . . . . 394 ˇ E. Sikudov´ a, M.A. Gavrielides, I. Pitas
x
PATTERN RECOGNITION Automatic Facial Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400 M. Sato, H. Murakami, M. Kasuga An Improved Detection Algorithm for Local Features in Gray-Level Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 ´ A. Sluzek Unsupervised Scale-Space Texture Detector in Multi-Channel Images Based on the Structural Tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 B. Cyganek Stem-end/Calyx Detection in Apple Fruits: Comparison of Feature Selection Methods and Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 D. Unay, B. Gosselin Cascade of Operators for Facial Image Recognition and Indexing . . . . . . . . 426 W. Skarbek, K. Kucharski, M. Bober (VISNET session organized by Wladyslaw Skarbek)
COMPUTER ANIMATION MCFI-Based Animation Tweening Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 438 T. Sakchaicharoenkul Biomechanically Based Muscle Model for Dynamic Computer Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448 M. Dobˇsik, M. Frydrych Spline and Ideal: from Real to Virtual Sculptures, and Back . . . . . . . . . . . . 457 E. Bittar, O. Nocent, A. Heff A Common Feature Representation for Speech Frames and Image Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. .. . . . W. Kasprzak, A. Okazaki, R. Seta (VISNET session organized by Wladyslaw Skarbek)
463
VISUALIZATION AND GRAPHICAL DATA PRESENTATION Rendering of Binary Alloys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 P. Callet, A. Zymla On Visualization of Complex Image-Based Markup . . . . . . . . . . . . . . . . . . . . 477 J.W. Jaromczyk, M. Kowaluk, N. Moore
xi Visualizing Directional Stresses in a Stress Tensor Field . . . . . . . . . . . . . . . . 485 T. Jirka 3D Modelling of Large Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 V. Sequeira, G. Bostr¨ om, M. Fiocco, D. Puig, A. Rossini, J.G.M. Gon¸calves (VISNET session organized by Wladyslaw Skarbek)
Table of Contents, CD-ROM COMPUTER VISION Local Image Structure Analysis Using Conic Sections . . . . . . . . . . . . . . . . . . 503 C. Perwass Gated Images: New Perspectives for Vision-Based Navigation . . . . . . . . . . . 509 ´ A. Sluzek, T.C. Seong Evolutionary Approach to Finding Iterated Function Systems for a Two Dimensional Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516 A. Bielecki, B. Strug Masks and Eigenvectors Weights for Eigenfaces Method Improvement . . . 522 M. Kawulok A New Approach to Corner Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 M. Sarfraz, A. Masood, M.R. Asim Generation of an Accurate Facial Ground Truth for Stereo Algorithm Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 A. Woodward, P. Leclercq, P. Delmas, G. Gimel’farb Subpixel Accurate Segmentation of Small Images Using Level Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540 M. Stommel, K.-D. Kuhnert On the Accuracy of Selected Image Texture Segmentation Methods . . . . . 546 M. Strzelecki, A. Materka A SBAN Stereovision Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552 M. P´erez-Patricio, O. Colot, F. Cabestaing Morphological Normalization of Image Binary Cuts . . . . . . . . . . . . . . . . . . . 558 A. Chupikov, S. Mashtalir, E. Yegorova
xii Model Based Multi-View Active Contours for Quality Inspection . . . . . . . . 565 P. d’Angelo, C. W¨ ohler, L. Kr¨ uger
COMPUTATIONAL GEOMETRY Automatic Lens Distortion Estimation for an Active Camera . . . . . . . . . . . 575 O. Lanz A Robust Discrete Approach for Shape From Shading and Photometric Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581 B. Kerautret Computation of Room Acoustics Using Programmable Video Hardware . . 587 M. Jedrzejewski, K. Marasek Hardware Implementation of EWA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 A. Herout, P. Zemˇcik, J. Reptin The Perspective-N-Point Problem for Catadioptric Sensors: an Analytical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599 J. Fabrizio, J. Devars Dynamic Shadow Map Regeneration with Extended Depth Buffers . . . . . . 607 R. Wcislo, R. Bigaj High Precision Texture Mapping on 3D Free-form Objects . . . . . . . . . . . . . 613 Y. Iwakiri, T. Kaneko Adaptive Z-buffer Based Selective Antialiasing . . . . . . . . . . . . . . . . . . . . . . . . 619 P. Rokita Morphological Normalized Binary Object Metamorphosis . . . . . . . . . . . . . . 626 M. Iwanowski Bubble Tree Drawing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 S. Grivet, D. Auber, J.-P. Domenger, G. Melancon Surface Reconstruction of 3D Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642 X. Li, C.-Y. Han, W.G. Wee Feature -Based Registration of Range Images in Domestic Environments . . 648 M. Wu ¨nstel, T. Rofer ¨
GEOMETRICAL MODELS OF OBJECTS AND SCENES Assessment of Image Surface Approximation Accuracy Given by Triangular Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655 O.M. van Kaick, H. Pedrini
xiii Non-Uniform Terrain Mesh Simplification Using Adaptative Merge Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662 F.L. Mello, E. Strauss, A. Oliveira, A. Gesualdi On Using Graph Grammars and Artificial Evolution to Simulate and Visualize the Growth Process of Plants . . . . . . . . . . . . . . . . . . . . . . . . . . 668 D. Nowak, W. Palacz, B. Strug Automatic Tessellation of Quadric Surfaces using Grassmann-Cayley Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 F. Jourdan, G. H´egron, P. Mac´e
MOTION ANALYSIS, VISUAL NAVIGATION AND ACTIVE VISION Visual Person Tracking in Sequences Shot from Camera in Motion . . . . . . 683 P. Skulimowski, P. Strumillo A Novel Object Detection Technique in Compressed Domain . . . . . . . . . . . 689 A.M.A. Ahmad A Method for Estimating Dance Action Based on Motion Analysis . . . . . . 695 M. Naemura, M. Suzuki A Feature Based Motion Estimation for Vehicle Guidance . . . . . . . . . . . . . . 703 O. Ambekar, E. Fernandes, D. Hoepfel Fast Uniform Distribution of Sequences for Fractal Sets . . . . . . . . . . . . . . . . 709 V.M. Chernov Occlusion Robust Tracking of Multiple Objects . . . . . . . . . . . . . . . . . . . . . . . 715 O. Lanz Face Tracking Using Convolution Filters and Skin-Color Model . . . . . . . . . 721 P. Gejgus, P. Kubini A New Hybrid Differential Filter for Motion Detection . . . . . . . . . . . . . . . . . 727 J. Richefeu, A. Manzanera, Pedestrian Detection Using Derived Third-order Symmetry of Legs . . . . . . 733 L. Havasi, Z. Szl´ avik, T. Szir´ anyi Automatic and Adaptive Face Tracking using Color-based Probabilistic Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740 S.-F. Wong, K.-Y.K. Wong
xiv
IMAGE AND VIDEO CODING Fast And Robust Object Segmentation Approach for MPEG Videos . . . . . 746 A.M.A. Ahmad, S.-Y. Lee Hybrid Orthogonal Approximation of Non-Square Areas . . . . . . . . . . . . . . . 752 J. Polec, T. Karlub´ıkov´ a, A. Bˇrezina Large Texture Storage Using Fractal Image Compression . . . . . . . . . . . . . . . 758 J. Stachera, S. Nikiel Chen and Loeffler Fast DCT Modified Algorithms Implemented in FPGA Chips for Real-Time Image Compression . . . . . . . . . . . . . . . . . . . . 768 A. D¸abrowska, K. Wiatr
COLOR AND MULTISPECTRAL IMAGE PROCESSING Automatic Landmark Detection and Validation in Soccer Video Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774 A. LeTroter, S. Mavromatis, J.-M. Boi, J. Sequeira More than Color Constancy Non-Uniform Color Cast Correction . . . . . . . . 780 M. Chambah Image Segmentation Based on Graph Resulting from Color Space Clustering with Multi-Field Density Estimation . . . . . . . . . . . . . . . . . . . . . . . 787 W. Tarnawski Multi-Field Density Estimation: A Robust Approach for Color Space Clustering in Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794 W. Tarnawski Fast Color Image Segmentation Based on Levellings in Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800 T. Geraud, G. Palma, N. VanVliet Segmentation-Based Binarization for Color Degraded Images . . . . . . . . . . . 808 C. Thillou, B. Gosselin Compact Color Video Signature By Principal Component Analysis . . . . . . 814 T. Leclercq, L. Khoudour, L. Macaire, J.G. Postaire, A. Flancquart Comparison of Demosaicking Methods for Color Information Extraction . 820 F. Faille
xv
IMAGE FILTERING AND ENHANCEMENT Blind Extraction of Sparse Images from Under-Determined Mixtures . . . . 826 W. Kasprzak, A. Cichocki, A. Okazaki Interactive Contrast Enhancement by Histogram Warping . . . . . . . . . . . . . . 832 M. Grundland, N.A. Dodgson Radial Basis Function Use for the Restoration of Damaged Images . . . . . . 839 K. Uhlir, V. Skala Methods for Designing the Recursive FIR Filters . . . . . . . . . . . . . . . . . . . . . . 845 V.V. Myasnikov
VIRTUAL REALITY AND MULTIMEDIA APPLICATIONS Camera Positioning Support Based on Potential Field in a Static Virtual Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851 A. Wojciechowski Generation of a Static Potential Field for Camera Positioning Support in a Virtual Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857 A. Wojciechowski Atmosphere Reproduction of the Landscape Image by Texture Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863 T. Bando, N. Kawabata (Special Session organized by: Ryszard Kozera and Lyle Noakes)
BIOMEDICAL APPLICATIONS Wavelet Methods in Improving the Detection of Lesions in Mammograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869 P. Bargiel, A. Przelaskowski, A. Wroblewska On Application of Wavelet Transforms to Segmentation of Ultrasound Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875 P. Kie´s Application of Image Processing Techniques in Male Fertility Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 881 L. Witkowski, P. Rokita
xvi Mathematical Morphology and Support Vector Machines for Diagnosis of Glaucoma on Fundus Eye Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 888 ´ K. Stapor, A. Brueckner, A. Swito´ nski Flow Reduction Marching Cubes Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 894 P. Krˇsek, Supervised and Unsupervised Statistical Models for Cephalometry . . . . . . 900 S. Aouda, M. Berar, B. Romaniuk, M. Desvignes Retrieving Thermal Medical Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906 G. Schaefer, S.Y. Zhu, B. Jones Visualizing Articular Cartilage Using Expectation Maximization to Compare Segmentation of MR Images by Independent Raters . . . . . . . . 912 P.A. Hardy, J.W. Jaromczyk, P.J. Thacker Morphological Method of Microcalcifications Detection in Mammograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 921 M. Ustymowicz, M. Nieniewski Bias and Noise Removal from Magnitude MR Images . . . . . . . . . . . . . . . . . . 929 M. Kazubek
IMAGE AND VIDEO DATABASES Multi Camera Automatic Video Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935 S. Sumec Clustering Method for Fast Content-Based Image Retrieval . . . . . . . . . . . . . 946 D. Kinoshenko, V. Mashtalir, E. Yegorova Fully Automated Identification and Segmentation of Form Document . . . . 953 S. Mandal, S.P. Chowdhury, A.K. Das, B. Chanda
PATTERN RECOGNITION Low Resolution Image Sampling for Pattern Matching . . . . . . . . . . . . . . . . . 962 R. Brunelli Soft-Computing Agents Processing Webcam Images to Optimize Metropolitan Traffic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 968 A. Faro, D. Giordano, C. Spampinato Structural Object Recognition by Probabilistic Feedback . . . . . . . . . . . . . . . 975 A. Barta, I. Vajk
xvii Evaluating the Quality of Maximum Variance Cluster Algorithms . . . . . . . 981 K. Rzadca Deformable Grids for Speech Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987 K. Slot, H. Nowak Didactic Pattern Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993 M. Szwoch, W. Malina Pattern Matching with Differential Voting and Median Transformation Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1002 M. Marszalek, P. Rokita Two-dimensional-oriented Linear Discriminant Analysis for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1008 M. Visani, C. Garcia, J.-M. Jolion
MODELING OF HUMAN VISUAL PERCEPTION 3D Human Model Acquisition from Uncalibrated Monocular Video . . . . . . 1018 E. Peng, L. Li A Mura Detection Based on the Least Detactable Contrast of Human Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1024 K. Taniguchi, K. Ueta, S. Tatsumi Using Multi-Kohonen Self-Organizing Maps for Modeling Visual Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1031 M. Collobert New Methods for Segmentation of Images Considering the Human Vision Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037 A. Kuleschow, K. Spinnler
COMPUTER ANIMATION Interactive Character Animation in Virtual Environments . . . . . . . . . . . . . . 1043 P. Cichocki, J. Rzeszut Empathic Avatars in VRML for Cultural Heritage . . . . . . . . . . . . . . . . . . . . 1049 S. Stanek Closed Form Solution for C 2 Orientation Interpolation . . . . . . . . . . . . . . . . . 1056 V. Volkov, L. Li
xviii A Compression Scheme for Volumetric Animations of Running Water . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063 B. Benes, V. Tesinsky Joining NURBS-based Body Sections for Human Character Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1069 D. Byrnes, L. Li Motion Recovery Based on Feature Extraction from 2D Images . . . . . . . . . 1075 J. Zhao, L. Li, K.C. Keong Audiovisual Synthesis of Polish Using Two- and Three-Dimensional Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1082 J. Belkowska, A. Glowienko, K. Marasek
VISUALIZATION AND GRAPHICAL DATA PRESENTATION Representation and Visualization of Designs with the Use of Hierarchical Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1088 P. Nikodem Evolutionary Approach for Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . 1094 M. Sarfraz, M. Riyazuddin, M.H. Baig Chaos Games in Color Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1100 H. Jak´ obczak An Effective Contour Plotting Method for Presentation of the Postprocessed Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1112 I. Jaworska Meshing Techniques for Generation of Accurate Radiosity Solution for Virtual Art Gallery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1118 M. Pietruszka, M. Krzysztofik
Late Papers A Prolongation-based Approach for Recognizing Cut Characters . . . . . . . . 1125 A. Luijkx, C. Thillou, B. Gosselin
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1131
Preface As the speed, capabilities, and economic advantages of modern digital devices continue to grow, the need for efficient information processing, especially in computer vision and graphics, dramatically increases. Growth in these fields stimulated by emerging applications has been both in concepts and techniques. New ideas, concepts and techniques are developed, presented, discussed and evaluated, subsequently expanded or abandoned. Such processes take place in different forms in various fields of the computer science and technology. The objectives of the ICCVG are: presentation of current research topics and discussions leading to the integration of the community engaged in machine vision and computer graphics, carrying out and supporting research in the field and finally promotion of new applications. The ICCVG is a continuation of the former International Conference on Computer Graphics and Image Processing called GKPO, held in Poland every second year in May since 1990, organized by the Institute of Computer Science of the Polish Academy of Sciences, Warsaw and chaired by the Editor of the International Journal of Machine Graphics and Vision, Prof. Wojciech S. Mokrzycki. The ICCVG 2004 gathered 354 Authors from 30 countries: Australia (9), Belgium (4), Brazil (10), Canada (3), China (4), Colombia (1), Czech Republic (11), France (70), Germany (33), Greece (2), China (2), Hungary (5), India (8), Italy (13), Japan (30), Kuwait (2), New Zealand (4), Norway (3), Pakistan (2), Poland (83), Portugal (1), Russia (2), Saudi Arabia (3), Singapore (2), Slovakia (7), Taiwan (2), Thailand (1), Ukraine (5), United Kingdom (19) and USA (13). The ICCVG 2004 Proceedings contain 163 papers, each accepted on the basis of reviews by three independent referees. Contributions are organized into following sessions corresponding to the scope of the conference: Computer Vision, Computational Geometry, Geometrical Models of Objects and Scenes, Motion Analysis, Visual Navigation and Active Vision, Image and Video Coding, Color and Multispectral Image Processing, Image Filtering and Enhancement, Virtual Reality and Multimedia Applications, Biomedical Applications, Image and Video Databases, Pattern Recognition, Modeling of Human Visual Perception, Computer Animation, Visualization and Graphical Data Presentation. The ICCVG 2004 has been organized by the Association for Image Processing, Polish-Japanese Institute of Information Technology and the Silesian University of Technology. The Association for Image Processing integrates the Polish community working upon the theory and applications of computer vision and graphics. This is done through organization of scientific and technical meetings, publishing activities, establishing relations with other organizations having similar objectives, and promoting image-processing-based methods in the industrial community.
xx The Polish-Japanese Institute of Information Technology was founded in 1994 as a result of an agreement between the governments of Poland and Japan and is a unique product of the merger of two cultures. PJIIT combines modern technology and team work culture of Japan with Polish traditions in mathematics and related disciplines. Now, it is one of the leading, Polish non-state universities and cooperates with a number of EU, US and Japanese universities. The Silesian University of Technology is one of the leading technical universities in Poland. It gained its respectable position due to both high level of education, teaching currently more than 30000 students, and the leading-edge quality of research pursued by the academic staff. I would like to thank all members of the Program Committee, as well as the additional reviewers, for their help in selecting and ensuring high quality of the papers. I would also like to thank Bernadeta Bonio, Jadwiga Hermanowicz, Joanna Naróg, Tomasz Lewicki and Paweł Wiemann for their commitment to the conference organization and administration. I am highly grateful to the Polish-Japanese Institute of Information Technology for including this conference into the series of scientific events commemorating its 10th anniversary, hosting the Conference in its modern premises and for the help in the conference organization. Finally, I would like to invite everyone to the next conference ICCVG 2006, which will take place in the Polish-Japanese Institute of Information Technology, Warsaw, Poland in 2006. Konrad Wojciechowski Chairman of ICCVG2004
Organization
Conference Chairs Conference Chair – K. Wojciechowski, (Poland) Co-Chairs: – S. Ido, (Japan) – J.L. Kulikowski, (Poland) – W.S. Mokrzycki, (Poland)
Conference Committee Members E. Bengtsson, (Sweden) P. Bhattacharya, (United States) A. Borkowski, (Poland) D. Chetverikov, (Hungary) L. Chmielewski, (Poland) R. Chora´s, (Poland) S. Dellepiane, (Italy) M. Doma´nski, (Poland) U. Eckhardt, (Germany) A. Gagalowicz, (France) E. Grabska, (Poland) H. Heijmans, (Netherlands) J.M. Jolion, (France) A. Kasi´nski, (Poland) R. Klette, (New Zealand) W. Kosi´nski, (Poland) R. Kozera, (Australia) H. Kreowski, (Germany) M. Kurzy´nski, (Poland) W. Kwiatkowski, (Poland) G. Levina, (Russia) R. Lukac, (Canada) V. Lukin, (Ukraine)
A. Materka, (Poland) H. Niemann, (Germany) M. Nieniewski, (Poland) L. Noakes, (Australia) M. Orkisz, (France) H. Palus, (Poland) M. Paprzycki, (United States) D. Paulus, (Germany) J. Piecha, (Poland) K. Plataniotis, (Canada) J. Roerdink, (Netherlands) P. Rokita, (Poland) R. Sara, (Czech Republic) V. Skala, (Czech Republic) B. Smolka, (Poland) J. Sołdek, (Poland) G. Stanke, (Germany) R. Tadeusiewicz, (Poland) V. Valev, (Bulgaria) T. Vintsiuk, (Ukraine) J. Zabrodzki, (Poland) M. Zaremba, (Canada)
xxii
Reviewers Chmielewski Leszek, (Poland) Cyganek Bogusław, (Poland) Datta Amitava, (India) Doma´nski Marek, (Poland) Eckhardt Ulrich, (Germany) Grabska Ewa, (Poland) Jolion Jean-Michel, (France) Kaczmarzyk Paweł, (Poland) Kasi´nski Andrzej, (Poland) Kosi´nski Witold, (Poland) Kozera Ryszard, (Australia) Kulikowski Juliusz Lech, (Poland) Kurzy´nski Marek, (Poland) Luchowski Leszek, (Poland) Lukac Rastislav, (Canada) Marasek Krzysztof, (Poland)
Materka Andrzej, (Poland) Niemann Heinrich, (Germany) Palus Henryk, (Poland) Paulus Dietrich, (Germany) Roerdink Jos, (The Netherlands) Sara Radim, (Czech Republic) Skomorowski Marek, (Poland) ´ Slusarczyk-Hliniak Graz˙ yna, (Poland) Smolka Bogdan, (Poland) Stąpor Katarzyna, (Poland) ´ Swierniak Andrzej, (Poland) Szczepa´nski Marek, (Poland) Tadeusiewicz Ryszard, (Poland) Wojciechowski Konrad, (Poland)
METHOD OF ANALYSIS OF LOW RESOLUTION TV IMAGES FOR DEFORMATION MEASUREMENTS OF HISTORICAL CONSTRUCTIONS Marek SKàODOWSKI and Zdzisáaw IWANOW Institute of Fundamental Technological Research, ul ĝwietokrzyska 21, 00-049 Warszawa, POLAND
Abstract:
An attempt is presented of using low resolution industrial TV images for remote displacement measurement in historical constructions. To increase the resolution of measurements images are resized using cubic splines and correlated using Fourier transform. The method is calibrated in the laboratory and in XVI century church. Presented results show the feasibility of this approach to the monitoring of structural deformations.
Key words:
historical constructions; deformation correlation; subpixel accuracy
1.
measurement;
image
analysis;
INTRODUCTION
Deformations of historical constructions are often attributed to crack formation. In such a case various sensors are used for measurement of crack openings. These are mostly LVDT sensors1, fiber optics sensors of FBG2 and SOFO type3 and PSD sensors4. But none of the sensors are contactless and they must be fixed to a structure in a vicinity of a crack. Quite often such an intervention to the structure is undesirable or not allowed because of historical and artistic value of the element surface. For this reason a contactless, remote displacement measurement method based on industrial TV equipment was developed and tested in a laboratory and in-situ. The method uses an interpolating spline procedure for low resolution images magnification, which is the contribution of the second Author.
1 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1–6. © 2006 Springer. Printed in the Netherlands.
2
2.
MEASUREMENT METHOD AND IMAGE PROCESSING
Industrial TV equipment was chosen for its high standarisation level, repeatability of properties, low price and possibility of transmitting TV frames to a computer using external USB frame grabber and radio transmission line. These makes all the system versatile and its image acquisition part is fully portable one and can be used for temporal displacement monitoring of various structures. Only TV cameras and radiotransmitters must be fixed to the structure in locations which do not destroy elements of historical importants. In the presented experiments a black and white TV camera with 1/3 inch CCD matrix was used. Its resolution was 352x288 pixels only and it was equipped with f = 60 mm lens. After recording the images were inspected to define region of interest incorporating meaningful elements of a scene and being large enough with the respect to the expected displacement. Which means that a template defined within region of interest before displacements took place will fit into this region after the displacements as well. When the displacement takes place region of interest covers the same pixels in subsequent images while search area recorded within these pixels is slightly shifted due to the object translation. A template image (from the first TV frame, being the reference one) is magnified using 2D cubic spline interpolation procedure and so are the consecutive images of the region of interest. These magnified images are correlated5 for each displacement step with the magnified template using search area and template spectra calculated with Discrete Fourier Transform. Example of correlation function is shown in Fig. 1 (left).
Figure 1. Visualization of correlation function (left) and approximated correlation peak (right).
TV Images for Deformation Measurements
3
Resulting coordinates of correlation peak are expressed in terms of pixels of magnified images of the search area which are subpixel representation of original TV frames. A resolution of correlation peak position depends on magnification factor applied to the image of region of interest as a consequence of resampling of original subimages after their spline approximation. There was chosen magnification factor equal 15 in both horizontal and vertical directions. To further increase the resolution of the correlation shift an area of ± 5 pixels around the correlation peak is further approximated with parabolic function and resampled adding 10 times more interpolating points. Thus the final theoretical resolution of the correlation peak coordinates with respect to the original 352x288 pixel frame was 1/150 of the pixel. This interpolated surface is shown in Fig. 1 (right). Next section gives results of experiments and achieved accuracy of the proposed method.
3.
EXPERIMENTS
The first calibration experiment was performed in a laboratory. TV camera with 2.4 GHz radio-transmitter was placed on an optical bench and fixed to the table equipped with micrometric screw allowing to translate the camera in the direction perpendicular to its optical axis. As the region of interest there was chosen a part of a PC computer box situated 11.4 m apart from the camera. TV frames were recorded with USB frame grabber connected to the 2.4 GHz radio-receiver. The camera was translated several times perpendicularly to its axis up to a distance of 8 mm which is much more than expected for in-situ crack measurements. At each translation step the image was recorded and then processed according to the interpolationcorrelation procedure described in the previous section. Results given in Fig. 4 (left) show also a linear fit to the experimental data. In-situ calibrating experiments were performed in XVI century church in SkĊpe where various structural damages occur. Presented in Fig. 2 images show cracks developing on the left and right -hand side of the Triumph Arch of the church. Regions of interest were chosen near the middle of the presented images with templates covering cracks. Translation of the camera in these experiments where up to 2.5 mm in the direction parallel to the arch surface which was not perpendicular to the camera optical axis.
4
Figure 2. Left and right hand side of Triumph Arch in SkĊpe.
Measurement setup was similar to that of laboratory experiment except of direct transmission of TV images to USB frame grabber without radiotransmitting line. Distance from the camera to the wall surface was different in the case of left and right side experiments but was of the same order of magnitude as in laboratory. Figure 3 shows the recording equipment used in the church. Results for right-hand side of the arch are presented in Fig. 4 (right) together with linear regression fit.
Figure 3. Recording of TV images in SkĊpe.
TV Images for Deformation Measurements
5
Figure 4. Calibration curves for tests: left - in-lab, right - right-hand side of Triumph Arch.
In Table 1 there are summarized results of three tests showing good linearity of measurements and measurement errors of order of 0.1 mm or less (standard deviation and error of fit for no displacement). Thus the accuracy of 1/20 of pixel was achieved for remote measurements with low resolution industrial TV camera without any additional markers or patterns placed on the surface of the structure. This is much less than in the case of fiber optics or LVDT sensors but has no drawback of attaching sensors to the surface of valuable wall paintings. Table 1. Summary of calibration results. place R2 of linear fit laboratory in situ (left) in situ (right)
0.998 0.993 0.993
ı mm 0.10 0.07 0.07
error of fit at zero mm -0.05 -0.06 +0.02
CONCLUSIONS Calibration experiments performed in laboratory and on a real historical structure show that it is possible to measure deformations of structure
6 elements of historical/artistic value by remote observation with low resolution TV cameras. For 352x288 resolution images recorded with 60 mm lens and distance to the structure of 11-12 m the accuracy of displacement measurements was 0.1 mm. It is thus expected that increasing of the accuracy by factor of four is feasible for practical applications. Applied spline interpolation and spectral domain correlation of images results in 1/20 pixel accuracy of correlation peak position calculation of original TV frames without placing artificial markers or patterns on the surface of a structure.
REFERENCES 1. P. P. Rossi and C. Rossi, Surveillance and monitoring of ancient structures : recent developments, in: Structural Analysis of Historical Constructions, edited by P. Roca, J. L. Gonzales, E. Oniate and P. B. Lourenco (CIMNE, Barcelona, 1999), pp. 163-177. 2. M. P. Whelan, D. Albrecht and A. Capsoni, Remote structural monitoring of the cathedral of Como using an optical fiber Bragg sensor system, SPIE Int. Symp. on Smart Structures and Materials, San Diego, 242-252 (March 2002). 3. D. Inaudi, N. Casanova and B. Glisic, Long-term deformation monitoring of historical constructions with fiber optics sensors, in: Historical Constructions, edited by P. B. Lourenco and P. Roca (Guimaraes, Portugal, 2001), pp. 421-430. 4. R. Kozáowski, EC Project “Friendly Heating”, Contract EVK4-CT-2001-00067, http://www.heritage.xtd.pl/friendly_heating/index.html. 5. J. P. Lewis, Fast template matching, Proc. of CIPPRS Vision Interface, Quebec, 120-123 (May 1995).
CO-OCCURRENCES OF ADAPTED FEATURES FOR OBJECT RECOGNITION ACROSS ILLUMINATION CHANGES Damien Muselet, Ludovic Macaire and Jack-Gérard Postaire Laboratoire LAGIS UMR CNRS 8146 Université des Sciences et Technologies de Lille Cité Scientifique - Bâtiment P2 - 59655 Villeneuve d’Ascq - FRANCE [email protected] [email protected]
Abstract
In this paper, we propose an original approach which allows to recognize objects in color images acquired under uncontrolled illumination conditions. For each pair of images to compare, the scheme consists in evaluating specific color features adapted to this pair. These adapted features are evaluated so that the distributions of adapted colors in the two images are similar only when they contain the same object. Then we propose to analyze the spatial co-occurrences between the adapted features to compute the image indices. Experimental tests on a public image database show the efficiency of this approach in the context of object recognition across illumination changes.
Keywords:
color, object recognition, adapted features, illumination changes, co-occurrence matrices.
1.
INTRODUCTION
Object searching in a database of color images which is a particular problem of the color image retrieval, is identical to the appearance based object recognition. In this framework, the recognition problem can be stated in terms of finding among all the target images of a database, those which contain the same object as that represented in the query image. In this paper, we specifically address the problem of recognizing objects when they are lighted by different illuminations during the image acquisitions, by considering that the differences are restricted to the temperature and the intensity of the used illuminations. Two cases exist : when the query and target images contain the same object lighted by two different illuminations, the images are similar (images (a)-(b) of figure 1). When the query and target
7 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 7–12. © 2006 Springer. Printed in the Netherlands.
8 images do not represent the same object, they are different (images (a)-(c) or (b)-(c) of figure 1).
(a)
(b) Figure 1.
(c)
Images of objects.
Since the colors of the objects in the images are not stable across the illumination conditions, the effect of these changes have to be taken into account by the recognition scheme. Thus, each pixel P has to be characterized not by a color vector c(P ) whose coordinates are the color component levels (cR (P ), cG (P ), cB (P ))T , but by a color feature vector. In this context, the classical approach consists in evaluating color invariant vectors which are as less sensitive as possible to illumination changes (Funt et al., 1998). These approaches are based on illumination change models which try to describe the variations of the color vectors of the pixels under illumination changes (Gevers and Smeulders, 1999). Unfortunately, these models are based on so restrictive assumptions about illuminations and acquisition devices that they lead to invariant vectors which are not totally independent on the illumination (Funt et al., 1998, Finlayson and Schaefer, 2001). That’s why we propose another approach which consists in considering each pair constituted of the query image and one of the target images and in evaluating feature vectors adapted to each of these pairs. These adapted feature vectors, presented in the third section, are derived from an original illumination change model described in the second section. In this context, the image indexing scheme consists in extracting robust and efficient characteristic indices from the target and query images. Object recognition is performed by means of a matching scheme which compares the indice of the query image with those of the target images. The matching scheme is based on a similarity measure between these image indices. The target images are ranked with respect to their similarity measures with the query image, in order to determine those which contain the same object as that contained by the query image. One of the most widely used image indices is the histogram of the adapted feature vectors (Swain and Ballard, 1991). Since the histograms do not represent the spatial distribution of the adapted feature vectors in an image, we introduce, in the fourth section, a spatio-colorimetric image indice based on the adapted feature co-occurrence matrices which simultaneously take into account the adapted features of the pixels and their spatial interactions.
Co-occurrences of adapted features for object recognition ...
9
Inorder to assess the performances of the proposed objectrecognition scheme, we achieve experimental tests on a public image database and compare, in the last section, the obtained recognition results with those obtained by classical approaches.
2.
ILLUMINATION CHANGES AND RANK MEASURES
Let us consider each of the three pairs of query and target color compok , I k ), k = R, G, B, that can be extracted from the nent images, denoted (Ique tar pair (Ique , Itar ) constituted of similar query and target color images. In each color component image I k , each pixel P is characterized by its color component level ck (P ). The proposed model of illumination changes considers that the level ck (Ptar ) of the pixel Ptar observing an elementary surface lighted k , is by the target illumination Etar (λ) in the target color component image Itar k expressed from the level c (Pque ) of the pixel Pque observing the same elementary surface lighted by the illumination Eque (λ) in the query color component k by : image Ique k (ck (Pque )) + ρk (Ptar ), ck (Ptar ) = fque,tar
k = R, G, B.
(1)
k is a monotonic increasing function associated with each The function fque,tar k , I k ). The function ρk pair of query and target color component images (Ique tar is a bias function which depends on the pixel Ptar in the target color component k . image Itar Let Rk [I](P ), denote the rank measure of the pixel P within the color component image I k defined as (Hordley et al., 2002) : ck (P ) k H [I](i) k , k = R, G, B, (2) R [I](P ) = i=0 N −1 k i=0 H [I](i)
where H k [I](i) is the number of pixels characterized by the level i in the component image I k and N is the number of levels used to quantize the color components. k , When the two considered colour images are similar, the functions fque,tar k = R, G, B, of equation (1) do not modify the rank measures of the pixels which represent the same elementary surfaces within the two colour component images. The function ρk represents the possible modifications of the rank measures of the pixels which represent the same elementary surfaces within the two colour component images. These possible rank measure modifications are the consequences of illumination changes between the similar query and target images. From our illumination change model, we assume that the rank measure of k the pixel Preq (Ptar respectively) within the color component image Ireq
10 k respectively) is, among the rank measures of all the pixels of I k (I k (Itar req tar respectively), the closest from the rank measure of the pixel Ptar (Pque respeck (I k respectively) : tively) within the color component image Itar req
⎧ k k Rk [Itar ](Ptar ) − Rk [Ique ](P ), ⎪ ⎨R [Itar ](Ptar ) − R [Ique ](Pque ) = Pmin ∈I k que
⎪ Rk [Ique ](Pque ) − Rk [Itar ](P ). ⎩Rk [Ique ](Pque ) − Rk [Itar ](Ptar ) = min k P ∈Itar
(3) We assume that two pixels which represent the same elementary surface in two similar color component images, respect the equation (3). The pairs of pixels (Pque ,Ptar ) which verify this equation, are called hereafter corresponding k ,I k ). pixels of the pair of color component images (Ique tar
3.
ADAPTED FEATURE VECTORS
We propose to characterize the pixels P by adapted feature vectors so that the vectors of only corresponding pixels in the two images, are equal. Therefore, we independently analyze each pair of color component image. Within them, we detect the corresponding pixels thanks to equation (3) and label them with the same adapted feature levels. When the query and target images are similar, the adaptation scheme detects the same pairs of corresponding pixels in the three pairs of color component images and these pixels are characterized by identical adapted feature vectors. When the two images are different, there is no reason that the feature vectors of a lot of pixels in the two images are equal. This scheme allows to discriminate the case when the images are similar from the case when they are different. Once the pixels in the query and target images to compare are characterized by adapted feature vectors, the effects of illumination changes between the image acquisition are eliminated. We propose to analyze the spatial cooccurrences between the adapted features of the pixels to compare the images.
4.
INDEXING WITH NORMALIZED ADAPTED FEATURE CO-OCCURRENCE MATRICES
The adapted feature co-occurrence matrices are a generalization of greylevel co-occurrence matrix proposed by Haralick (Haralick, 1979). Let us denote MIk,k , the adapted feature co-occurrence matrix which measures the spatial interaction between the adapted features lk and lk in the image I. The cell MIk,k (u, v) of this matrix indicates the number of occurrences that a pixel P in the image I, whose adapted level lk (P ) is equal to v, is located in the 8neighborhood of a pixel P whose adapted level lk (P ) is equal to u. The indice
Co-occurrences of adapted features for object recognition ...
11
of the image I is composed by the six following adapted feature co-occurrence matrices : MIR,R , MIR,G , MIR,B , MIG,G , MIG,B and MIB,B . The proposed similarity measure between the query image and the target image is the mean value of the six intersections between two normalized cooccurrence matrices (Muselet et al., 2002). Thus, the more similar the spatial disposition of the adapted features within the two images are, the closer to 1 the value of the similarity criterion between these two images.
5.
EXPERIMENTAL RESULTS AND CONCLUSIONS
In order to demonstrate the improvement of the intersection between adapted feature co-occurrence matrices for the object recognition across illumination changes, we use a public database, published by the University of East Anglia (Finlayson and Schaefer, 2001, UEA, ). Its 336 images contain 28 individual colorful designs lighted by one of the three available illuminations and acquired with the same viewing conditions by one of four different cameras (see figure 1). Finlayson demonstrates that the object recognition results obtained by the intersection between histograms of invariant vectors processed by the greyworld normalization or by 1D-histograms equalization, outperform those obtained by classical object recognition methods (Finlayson and Schaefer, 2001, Finlayson et al., 2003). Hence, we propose to compare the results obtained by these two schemes with those obtained by the intersection between the adapted feature co-occurrence matrices. For this purpose, we use the same test protocol as that described in (Finlayson and Schaefer, 2001). For each tested method, the authors process the Average Match Percentile (AMP) evaluated such as:
AM P =
1 Nretrieval
×
Nretrieval j=1
Ntarget − Rank(j) , Ntarget − 1
(4)
where Nretrieval is the total number of retrievals (Nretrieval = 672), Ntarget is the number of target images analyzed for each retrieval (Ntarget = 28) and Rank(j) is the rank obtained by the similar target image in the j th retrieval. The analyzed indexing scheme reaches the perfect object recognition when the AMP reaches 100%. Table 1 shows the AMP obtained by each object recognition method with the UEA database. Table 1 shows that the intersection between the adapted feature co-occurrence matrices provides better results than those obtained by the intersection between classical invariant feature vector histograms for the object recognition across illumination changes.
12 Table 1. Average Match Percentiles obtained by different object recognition methods with the UEA database. M is set to 16. Object recognition methods Invariant feature vector histograms (greyworld) (Finlayson and Schaefer, 2001) Invariant feature vector histograms (equalization) (Finlayson et al., 2003, Hordley et al., 2002) Adapted feature co-occurrence matrices
AMP 93.96 96.72 99.17
These result improvements can be explained by three main points. First, the adapted feature vector processing is based on our original model of illumination changes which uses less restrictive assumptions than those used by the classical models which lead to invariant feature vectors. Secondly, these adapted features vectors depend on the pair of query and target images constructed during the retrieval, whereas the invariant feature vectors are determined by independently considering the query and target images. Finally, the co-occurrence matrices take into account both the distribution of the adapted feature vectors and the spatial interactions between them in the images while the histograms only represent the distributions of considered invariant feature vectors.
REFERENCES Finlayson, G., Hordley, S., Schaefer, G., and Tian, G. (2003). Illuminant and device invariant colour using histogram equalisation. In Proc. of the IS&T/SID Eleventh Color Imaging Conf., pages 205–211, Scottsdale USA. Finlayson, G. and Schaefer, G. (2001). Colour indexing across devices and viewing conditions. In Proceedings of 2nd Int. Workshop on Content-based MultiMedia Indexing, pages 215– 221, Brescia, Italy. Funt, B., Barnard, K., and Martin, L. (1998). Is machine colour constancy good enough? In Proceedings of the 5th European Conference on Computer Vision, pages 445–459. Gevers, T. and Smeulders, A. (1999). Color-based object recognition. Pattern Recognition, 32:453–464. Haralick, R. (1979). Statistical and structural approaches to textures. Proceedings on IEEE, 67(5):786–804. Hordley, S., Finlayson, G., Schaefer, G., and Tian, G. (2002). Illuminant and Device Invariant Colour Using Histogram Equalisation. Technical report sys-c02-16, School of Information Systems, University of East Anglia, Norwich, United Kingdom. Muselet, D., Macaire, L., and Postaire, J. (2002). A new approach for color person image indexing and retrieval. Machine Graphics & Vision, 11(2/3):257–283. Swain, M. J. and Ballard, D. H. (1991). Color indexing. Int. Jour. of Computer Vision, 7(1):11– 32. UEA. http://vision.doc.ntu.ac.uk/datasets/uncalibimdb/database.html.
A FAST AND ROBUST APPROACH FOR THE SEGMENTATION OF MOVING OBJECTS
Ayoub K. Al-Hamadi; Robert Niese, Bernd Michaelis Institute for Electronics, Signal Processing and Communications (IESK) Otto-von-Guericke-University Magdeburg D-39016 Magdeburg, P.O. Box 4210 Germany
Abstract:
This paper proposes a technique for analysing the automatic extraction of moving objects and suppression of the remaining errors under disturbed image situations from static camera. In this technique, we apply a modified difference image-based approach for the segmentation of moving objects in video sequences. The second part of the paper examines the problem of suppression of the remaining errors by means of morphological, separation and shadow detection algorithms. The efficiency of this suggested approach for moving objects segmentation will be demonstrated here on the basis of the analysis of strongly disturbed image sequences.
Key words:
Segmentation of moving objects, Video sequence analysis
1.
INTRODUCTION
Video object segmentation is required by numerous applications ranging from high-level of computer vision tasks1, motion analysis3, to secondgeneration video coding2. In the motion analysis or tracking of objects it is desirable to apply automatic techniques for the detection of moving objects from image sequences. Several kinds of methods have been suggested3,4,5,8,9. Many approaches to moving object detection for traffic monitoring and video surveillance proposed in the literature are based on the background suppression methods4,8. This is due to difference image (DI) which is produced quickly by subtractions. Using this DI, some regions where difference values are large, are considered as moving objects. Since the real time processing and robustness by the segmentation of moving object is essential for the video surveillance and tracking analysis,
13 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 13–19. © 2006 Springer. Printed in the Netherlands.
14 a modified difference-image-based (MDI) approach is used in this paper. This approach is based on AND operation between two successive difference images. This whole procedure allows the extraction of arbitrary objects. The elimination of remaining errors (holes, outliers or fusion of regions and casting shadow) takes place by means of morphological, separation and shadow detection algorithms. The shadow detection algorithm is based on the use of color-structur-code (CSC)11. A robust segmentation of moving image regions is obtained by this approach despite of the influence of casting shadow and the change of lighting condition, that occurs often in real environments. The efficiency of this suggested method for segmentation of moving objects will be shown here by analysis of strongly disturbed real image sequences (traffic scene). The resulting image regions present the Motion-Blobs (MB) that can be used for the solution of the corresponding problem in the tracking analysis.
2.
THE MOVING OBJECT SEGMENTATION
2.1
Optical flow for object segmentation
In principle, moving objects can be segmented from the background by computing the optical flow. The transition borders between regions of discontinuity is consulted here for the separation of the scene into moving objects and stationary background as long as the computed vector field is reliable (smooth). The differential method is used for the computation of the optical flow in the image sequence, which is sensitive to brightness fluctuations between consecutive images and overlapping noise6,7,11. Hence, in case of such disturbances, a false motion vector field will be computed that does no longer describe the actual movement (Fig. 3 A). The problem results among other things from the so-called short-range-mechanism, whereby only few pixels exert influence on the computation of a vector. In order to structure the vector field an iterative procedure is introduced, which firstly determines the speed vectors. This causes a clearer separation between areas, which belong to the moved object and which pertain to the background. However the received result is incorrect due to noise and textures. To improve the evaluation a special standardisation can be conducted here, which ensures that only pixels have influence on the optical flow, which reside within the moved zone11. In simple sequences this algorithm supplies a feasible result after approx 30-40 iterations. However, mathematically a convergence can’t be proven for the optical flow6. A higher accuracy with the analysis of real image sequences will not be reached using this approach (optical flow). This is because with the analysis of simple synthetic objects (only a homogeneous synthetic texture and only a small translation as syn-
A Fast and Robust Approach for the Segmentation of Moving Objects
15
thesised movement) 5% lower limit of accuracy is reached6,11. The flow vector field of this synthetic object is not well structured and has inconsistent vector lengths and angles, although the movement of the object is homogeneous. Improved results of the moving object segmentation can be obtained via computation of motion vector fields by means of full search Blockmatching (BM)3. The improvement of this procedure is reached „in comparison to optical flow“ by long-rang-mechanisms (more information for the computation of the motion vectors). A disadvantage is the high running time of this algorithm. This is because the computation of the motion vector takes place via a surface-based similarity measure. The segmentation of moving objects of a stationary background does not represent an optimal solution by means of the transition borders between regions (i.e. discontinuity) of the displacement vector field with real image scenes. This is because moved blocks and the stationary background do not indicate however the accurate position of the moved objects. To accurately calculate the zero crossings between the moved blocks and background a hierarchical BM can be used here3. Thus the running time of this algorithm is nondeterministic and thus not suitable for real time. This motivates the next part of the paper.
2.2
A modified difference-image-based approach
Compared with the previously-mentioned methods for segmenting moving objects, a difference image scheme is a simple way to detect moving objects in a scene. This is because the difference image is produced quickly by simple subtractions. Thereby, the pixel changes in the image are detected exactly if the change is caused by movement and not by noise artifacts. Here it is to be emphasised that the transition borders between regions (i.e. discontinuity) will not be cleaned up by this approach. The zero crossings will not be indicated in the resulting DI- difference image. A consequence of that is that the segmentation mask doesn’t describe the object shape and position. This point represents here the starting point to a following suggested approach for segmentation of moving objects. A modified approach (MDI) has been developed to moving object segmentation. Instead of using temporal derivatives, two consecutive difference images are combined. Each of these difference images is created by subtracting two successive images. In the following a binary threshold is applied and either binary images become concatenated with an “AND” operator (Eq.1). MDI t 1
Ix, y, t Ix, y, t 1 Ix, y, t 1 Ix, y, t 2
(1)
16
Figure 1. The MDI approach for the determination of moving regions (Motion-Blobs).
This leads to image MDI (Fig. 1). The exact object position (in the mid image) is determined. Even though potential regions of moving objects can be determined comfortably, the MDI approach requires extensive preprocessing. In case of weakly textured objects there will be only very small changes in the difference images. Hence, after applying a threshold to the difference images, holes will arise. Image refinement focuses on removing these holes and smoothing contours. Ideally, morphological operations are applied to close appearing holes. To reach best results a suitable squared structural element (SE) has to be chosen in the following way: x Closing-Dilation with a larger SE to connect regions x Closing-Erosion with a smaller SE to separate adjacent objects To eliminate outliers and remove remaining errors each detected region becomes smoothed and possibly separated if a certain criterion is fulfilled. Contour based erosion is suited to conduct this task. In this procedure all pairs pi of contour points cj are determined that have a Euclidean distance d less than dmin. For each of these pairs pi=(c1, c2) the line between contour point pi and pj is erased from the binary region mask. This does not only lead to erosion of the region contour but also to separation of superficially connected parts of the region. p i , p j p i p j d min ; whereas
p i , p j ^ p1 ,...., p n `
(2)
This algorithm is used further for the smoothing of the contour, then outliers will be removed simply and quickly from the previously segmented regions (Fig. 1). For the detection and removal of casting shadow, a correction step is indispensable (Fig. 2). This takes place by using a so-called shadow detection algorithm (filter) which exploits particular color information of the moving region. The shadow filter is realized by a color segmenting algorithm, e.g. color-structur-code CSC10, in two different high segmenting
A Fast and Robust Approach for the Segmentation of Moving Objects
17
stages. The first segmenting stage, the so-called fine-segmenting, detect the shadow area. This shadow is homogenous and there are no gradients inside the region. Furthermore, there is no color information in the shadow region. The second segmenting stage, the so-called rough-segmenting, remove the shadow area because it is not a part of our object of interest. The CSC algorithm is suitable for this task, since it is very fast and makes it possible to exactly control the segment size. By color analysis inside a shadow area we relatively get few segments of large area. Moving region which also contains a shadow, possess a small brightness and contain only little texture.
Figure 2. A robust segmentation of moving regions is reached by this approach despite of the influence of casting shadow and the change of lighting condition.
In contrast to the shadow area, the object region contains stronger gradients. This is because of many small CSC-segments in side the object. This observation is used to produce segmentation mask, which gives information over the distribution of CSC segments within a moving region. Now the extracted region of the moving object is easily realized, because the CSCsegments which belong to the homogenous shadow area has been removed, see the object outline (Fig. 2).
Figure 3. The analysis for moving objects in real video sequence with the suggested MDI approach. Part A presents the analysis by the influence of brightness change, shadow. Parts S1 and S2 show the robust and exact segmentation of real moving vehicles by the influence of casting shadow and of deformable objects (S2) in long sequence.
18 The results of segmentation of moving objects is represented by means of conventional procedures (optical flow, DI scheme) and with the suggested approach (Fig. 3 (MDI)). It is to be recognized here that during the evaluation by conventional methods, the segmentation of moving objects is not reliable. This is because of the fact that the shadow of the objects is extracted as a moving region (DI) although it does not belong to the object directly. As conclusion one recognises the fact that the use of the optical flow is not meaningful to segmenting of image regions because many interference factors is to be considered. A robust segmentation of moving objects was reached by means of applying the suggested method MDI. It can be recognised that the segmentation mask describe the moving objects and the contours describe the actual object shape and position (s. Fig. 3). Nonetheless, the suggested method enables exact and robust wrapping of objects despite of influence of disturbance.
3.
SUMMARY AND VIEW
A robust algorithm (MDI) was developed in the available work for automatic segmentation of moving objects under the influence of disturbed image situations. The elimination of remaining errors take place by means of morphological, separation and shadow detection algorithms. A robust segmentation of moving image regions is obtained by this approach despite of the change of lighting condition and shadow that occurs often in real environments. The resulting image regions present the motion blobs that can be used for the solution of the corresponding problem in the tracking analysis.
ACKNOWLEDGEMENTS This work was supported by BMBF grants: (03i1210A and 03i0404A).
REFERENCES 1. 2. 3.
Ullman S.: High-Level Vision: Object Recognition and Visual Cognition, MIT Press, Cambridge, MA, 1996 Torres. Torres L.; Delp E.J.: New trends in image and video compression, in X European Signal Processing Conference, Tampere, Finland, September 4-8, 2000. Al-Hamadi A.; Michaelis B.: Intensity-based method for tracking of objects in colour video sequences under the influence of non-cooperative situations. SPPRA 2002, CreteGreece, June 25-28; pp.62-67.
A Fast and Robust Approach for the Segmentation of Moving Objects 4.
19
Karmann K.P., and Brandt A.: Moving Object Recognition Using an Adaptive Background Memory, Elsevier Science B.V., pp.289-296, 1990 . 5. Smith, S.M.; Brady, J.M.: Real-Time Motion Segmentation and Shape Tracking, ASSET-2, PAMI(17), No. 8, August 1995, pp.814-820. 6. Klette R.; Koschan A.; Schluens K.: Computer Vision; [ISBN 3-528-06625-3]. 7. Horn, B. K. P.; Schunck, B. G.: Determining Optical Flow. AI 17, 1981. pp. 185-203. 8. Wang R.; Hong P.; Huang T.: Memery-based moving object extraction for video indexing; 15th ICPR; Barcelona 2000; Volume1; pp.811-814. 9. Kim C.; Hwang J. N.: A fast and robust moving object segmentation in video sequences; IEEE ICIP; Kobe Japan;1999, pp. 131-134. 10. Priese L.; Rehrmann V.: On hierarchical color segmentation and applications; proceedings of the CVPR; pp. 633-634, IEEE computer society press, June 1993, NY city. 11. Al-Hamadi, A., Michaelis, B., Niese, R..: Towards Robust segmentation and tracking of moving objects in video sequences. In: 3rd IEEE-EURASIP (2003); Rome; pp. 645-650.
RECONSTRUCTION ACCURACY WITH 1D SENSORS Application on Scanning Cameras Yannick Caulier Fraunhofer IIS, Am Wolfsmantel 33, D-91058 Erlangen [email protected]
Klaus Spinnler Fraunhofer IIS, Am Wolfsmantel 33, D-91058 Erlangen [email protected]
Abstract
This article describes a new calibration technique to determine the position and orientation of linescan cameras. The novelty of the method lies in the determination of the camera projection plane with no restriction on its world position. Using scanning properties and a special calibration pattern the external camera parameters are computed. An evaluation of the method accuracy is given by the minimum distance between the computed back projected ray and a 3D reference point.
Keywords:
Camera calibration; 3D reconstruction; feature detection.
1.
INTRODUCTION
Image recording with linescan cameras differs from conventional matrix camera techniques. We developed a new method specially adapted to linescan recording which is a multiple step method using a known 3D calibration object. As to be seen later on, different parts of the calibration object are used to inscribe themselves in the stepwise determination of the search external parameters.
2.
THE LINESCAN CAMERA TECHNIQUE
Linescan cameras are often requested when long objects with a constant velocity have to be recorded. A typical application can be found in industrial image processing applications where objects have to be inspected at the end of 20 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 20–26. © 2006 Springer. Printed in the Netherlands.
Reconstruction Accuracy with 1D Sensors
21
a production line (also called ’web inspection’ (Greer, 2001)). In opposition to matrix cameras, linescan cameras have a 1D sensor where only single lines (slices) of the object are recorded (like in (Caprille B., 1990)). The movement of the object is parallel to the ground plane (which is also the (xw , zw ) plane of the world coordinate) along the xw axes (see Figure 1). Some parameters like the focal length (f ) (or magnification), the central point (r) (or principal point position) and the pixel distance (δ) are the same as for matrix cameras. Nevertheless, the camera frequency is also an internal camera characteristic which is directly linked to the object speed. (Gupta R., 1997) used the world moving as camera parameter. Thus we consider F/v as the fourth intrinsic parameter (we name it the scanning parameter) (F is camera frequency and v is the object speed). The external linescan camera parameters are the 3 rotation angles (αX , αY and αZ ) and the 2 translation values (TX and TZ ). As the translation TY should be given relatively to an initial position of the recorded scene and we consider only one camera, we consider it as being null. We do not consider the optical distortion.
3. 3.1
CALIBRATION Common techniques
Some applications restrict the calibration problem of 1D sensors in the projection plane. (Faugeras O., 1998) proposes a method where the position of at least 3 cameras situated in a plane is computed. In (Schmidt R., 2000) a method closer to our application is exposed. The work consists in computing the focal distance and the position of 5 linescan cameras. Scanning principle is also referenced in literature as the pushbroom camera model which are satellite cameras. (Gupta R., 1997) determines the internal and external camera parameters by factorising the camera matrix linking 3D points with projected points (perspective projection) on the sensor and the speed of the world (orthographic projection). (Kim T., 2001) propose a method to find the start position and attitude of a satellite camera.
3.2
Used technique
The fundamental idea is based on the way scanned images are constructed: the object moves and images are obtained along the projection plane. The main notion used is the comparison of the linear movement of objects (the speed, v) and the obtained images (the camera frequency, F ) by using the main element binding them: the time, t. Using this principle, we can easily obtain the position of the projection plane in world coordinate. Afterwards all object points
22 scanned in that plane are projected on the camera sensor, so that we can use the classical pinhole camera model as described in (Tsai, 1997). For our experiment we developed a special 3D calibration object having a wedge form (the upper plane of the object contains a grid with known geometric characteristics). The grid is made of 900 cross points ordered in 9 square sections, numerated from left to right and from top to bottom. The object has a linear movement, so that the horizontal (respectively the verticals) lines are always perpendicular (respectively parallel) to the moving direction (see Figure 1).
Figure 1. (a) The calibration rig with the its 9 sections (each contains 100 cross points). (b) General overview of the set-up, and construction of the projection plane.
4.
DETERMINATION OF THE PROJECTION PLANE 3D POSITION
The projection plane is defined with two angles belonging to the world coordinate (Figure 1). The first, called αZ , is the rotation angle between projection plane and the ground plane around the zw direction. The second, called αX , is the rotation angle of the projection plane with the ground plane around the xw direction. For determining the angle αZ we consider at first stage the camera recording of one height object position and at second stage, after object moves with constant speed, the camera recording of another object height position (see Figure 1). Time to pass from one height position to the other depends on the frequency of the camera F but also on the speed of the object v, so that we can obtain the equation 1 (hO is the height of the recorded object portion, dI is the corresponding distance in the image and αO is the angle of the calibration rig). In Figure 1 let’s consider the horizontal line of one recorded object section. The αX angle is the ratio of Y to X distance in the images pondered by the scanning parameter for Y distance and the resolution in X (RX ) for the X distance (equation 1). This value can be directly deduced from the knowledge of the calibration pattern. We choose RX as being the ratio of the number of
23
Reconstruction Accuracy with 1D Sensors
pixel between 2 adjacent lines (both situated on the same plane of the object) and the corresponding known distance.
αZ = atan
5.
dI / Fv
dY / Fv hO .αX = atan dX /RX − hO /tan(αO )
(1)
CAMERA POSITION IN THE PROJECTION PLANE
The 900 cross points are projected parallel to the moving direction of the object. Knowing the world position of projection plane, we can easily obtain the position of those lines in the projection plane. This parallel projection (along the xw axis) first changes the 3D object points in 3D intersection points, situated on the projection plane. Those points, transformed by a 3D euclidian coordinate change (around the xw and zw axis respectively), gives the object points in the projection plane coordinates. Since we know the object coordinates in the projection plane, we apply the classical technique using a pinhole camera model and obtain the projection matrix to pass from known 2D points to our 1D representation. The projection matrix of a 1D camera links the image coordinates and object coordinates in the projection plane. We do this for all the 2D points in the projection plane and obtain an equation system which links the known parameters with the unknown calibration coefficients (TX ), (TZ ) and (αY ).
6. 6.1
EXPERIMENTS AND RESULTS The laboratory set-up
The calibration setup is made of a Dalsa Spyder camera having a 512 pixel sensor (pixel distance of 14μm) which runs with a frequency of 1000Hz. The wide-angle lens objective is a Zoom Nikkor with a focal distance of 35mm. The object moves with a constant speed of 12.5 mm/s through the projection plane of the camera which is at a distance of approximately 1000mm from the ground plane. All the image characteristics are obtained with a sub-pixel accuracy with the use of interpolation functions.
24
6.2
Evaluation of the calibration quality
At the moment we evaluate our method by using only one camera. In a first step we measure the calibration accuracy and investigate the need of a further optimization (we search to optimize the parameters αX , αY , αZ , TX and TZ by applying the Levenberg-Marquardt gradient method as described in (Press W. H., 1999)). The performance of calibration is determined by the minimum distance between the back projected ray and the object line (we call it the calibration error). This back projected ray (contained in the projection plane) passes through the optical centre of projection and the computed sensor position. In ideal case, this ray should intersect the corresponding known 3D object line. We investigate the influence of the projection plane parameter finding by considering each 9 section separately. The αZ angle is computed by considering the height difference between the up and bottom horizontal line in a section. We compute the αX angle for the 10 horizontal lines in a section and take their average value. Those angles are then used to compute the rest of the external parameters for each section. The results for all the 9 sections are are written in Table 1 (this table shows the maximum and minimum values of the calibration error of the 100 points for every 9 section). Table 1. Found maximum and minimum values of calibration error (respectively CE Max and CE Min) of the 100 points for every 9 section (S1-S9). S1 S2 S3 S4 S5 S6 S7 S8 S9 CE Max[mm] 0.1015 0.1106 0.1334 0.0925 0.0935 0.0848 0.0847 0.0970 0.1075 CE Min[mm] 0.0002 0.0004 0.0021 0.0001 0.0003 0.0000 0.0008 0.0014 0.0008
Those results show that the knowledge of the 3D reference points position and orientation as well as the detection of the corresponding 2D image points gives best approximation of the calibration error (its maximum value for all the sections is 0.1334mm). We try to improve those results by applying the Levenberg-Marquardt minimization method (the optimized parameters are the extrinsic ones). We obtained the same results of calibration error but a worse accuracy in external parameter finding.
6.3
Evaluation of the reconstruction accuracy
As we are using for the moment only one camera, we estimate the reconstruction accuracy by considering construction points (for camera calibration)
Reconstruction Accuracy with 1D Sensors
25
and reference points (for the determination of the reconstruction error). We take the 100 points of section 5 (situated in the middle of the calibration pattern) as references and considered the other sections (1,2,3,4,6,7,8 and 9) for the calibration of the camera. Table 2 shows the 8 reconstruction errors. Table 2. Reconstruction error (the reference points are those of section 5, the construction points are those of sections 1,2,3,4,6,7,8,9). Maximum, average and minimum reconstruction errors (respectively Max RE, Avg RE and Min RE) of the 100 points of the 8 sections (S1-S4 and S6-S9). S1 S2 S3 S4 S6 S7 S8 S9 Max RE [mm] 0.1670 0.1994 0.5672 0.2631 0.2706 0.3333 0.1752 0.1400 Avg RE [mm] 0.0586 0.0666 0.3059 0.1021 0.1010 0.1317 0.0668 0.0525 Min RE [mm] 0.0007 0.0009 0.0701 0.0047 0.0020 0.0007 0.0013 0.0029
The average reconstruction errors for the 8 sections are situated between 0.3059mm and 0.0525mm, the maximal error value we measure for all the points surrounding the reference section 5 is 0.5672mm.
7.
CONCLUSION AND FUTURE WORK
We have shown, that a combination of known calibration techniques with special recording features of constant moving objects recorded with linescan cameras can be used for determining extrinsic camera parameters. Contrary to other known techniques for 1D sensors, there is no restriction on the world position of the camera projection plane. We prove the validity of our method by obtaining an average reconstruction accuracy of less than 0.3mm. As the proposed application permits to determine the 3D position and orientation of 1D cameras, the advantage of our method lies in the fact that 3D reconstruction can be done with only 2 cameras. The use of an optimization method (non-linear approximation) brought a worse approximation of the extrinsic parameters and gives the same accuracy of calibration error. We will adapt this 3D reconstruction method to the use of several cameras to obtain better reconstruction results (the accuracy of the method will be improved by combining the calibration results of more cameras).
REFERENCES Caprille B., Torre V. (1990). Using vanishing points for camera calibration. International Journal of Computer Vision, 4:127–140. Faugeras O., Quan L., Sturm P. (1998). Self-calibration of a 1d projective camera and its application to the self-calibration of a 2d projective camera. Technical report.
26 Greer, C. (2001). Fundamentals of machine-vision cameras (part 1). 2. Gupta R., Hartley R. I. (1997). Linear pushbroom cameras. IEEE Transaction of Pattern Analysis and Machine Intelligence, 9:963–975. Kim T., Shin D., Lee Y.-R. (2001). Development of a robust algorithm for transformation of a 3d object point onto a 2d image point for linear pushbroom imagery. Photogrammetric Engineering and Remote Sensing, 67(4):449–452. Press W. H., Teukolsky S. A. (1999). Numerical Recipes in C (2nd ed.): the art of scientific computing. Schmidt R., Schramm U., Hofman R.-Caulier Y. Spinnler K. Wittenberg T. (2000). Automatic three-dimensional inspection, measurement and detection of errors in transparent pipes; computer graphics forum. Vision, Modelling, and Visualization, Saarbruecken, pages 19– 24. Tsai, R. Y. (1997). A versatile camera calibration technique for high accuracy 3d machine vision metrology using off-the-shelf tv cameras and lens. IEEE Journal of Robotics and Automation, (03):323–344.
SHAPE SIMILARITY TO ALPHANUMERIC SIGN Proposition of New Criterion 1 JACEK LEBIEDħ GdaĔsk University of Technology, Faculty of Electronics, Telecommunications and Informatics, ul. G. Narutowicza 11/12, 80-952 GdaĔsk, Poland, e-mail: [email protected]
Abstract:
The paper describes different approaches to evaluation of shape. It contains analysis of their usefulness to calculation of similarity to a letter or digit. Ability to check similarity to alphanumeric signs is needed for estimation of preprocessing quality in recognition of machine-typed documents4,9. It is also needed for distinguishing letter-like segments from other shapes in search of inscriptions in pictures taken by apparatus for visually impaired3. Because the well-known approaches seem to be insufficient to evaluation of shape similarity to alphanumeric sign, the new method is proposed. This method is based on statistical analysis of the Maximal Square Map introduced in the paper.
Key words:
text detection, text extraction, text segmentation
1.
INTRODUCTION
There are many methods for detection of text from real scenes. Some of these methods use horizontal and vertical projection profiles to cut the image into columns and paragraphs (rather only for scanned paper documents), some work by grouping small components into larger segments until all blocks are found, while others treat text as a type of texture and use texture segmentation algorithms2,10,12. All these methods are based on assumption that text exhibits spatial cohesion – text consists of characters of similar heights, orientation and spacing10. Therefore these methods often omit single and double letters or digits (e.g. tramway numbers, hotel room numbers on a doorplate). For detection of such short texts we need to employ another 1
Funded in part by the 5FP EU Grant IST-2001-33441-MEMORIAL
27 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 27–32. © 2006 Springer. Printed in the Netherlands.
28 approach based on analysis of character shape. Additionally this approach may be useful for estimation of segmentation quality of alphanumeric signs in a multiletter text. Latin, Greek, Cyrillic letters and Arabic numerals have a specific shape. Since human can recognize signs of such a kind, even though they come from alphabets (e.g. Georgian letters – mchedruli) unknown to the reader, there apparently exist some features of shape that allow discriminating letterlike forms from other objects. In this paper we try to identify these features.
2.
CLASSICAL SHAPE MEASURES
There is a large number of various shape measures. Below we present a short survey of these measures. Template matching is the simplest way of shape evaluation. Every figure to be evaluated is compared with all of shape templates to count the number of pixels that do not match (Fig. 1). The measure of similarity (in fact, dissimilarity because zero means the ideal matching) can be written as İ = mink{x,y|f(x,y)-fk(x,y)|},
(1)
where f and fk stand for characteristic functions of an examined figure and k-th template respectively. For evaluation of shape similarity to alphanumeric signs this approach needs to use at least one template for each letter (capital and small) and digit (over 60 templates for one font, style and size). The large number of templates causes calculations to be timeconsuming. Additional disadvantage of this method is its great sensitiveness to translations, rotations and scaling.
Figure 1. Comparison between examined figure and one of templates.
Shape descriptors like eccentricity (ijmax / ijŏijmax, where ijmax , ijŏijmax – maximum diameter and diameter perpendicular to it)8, aspect ratio (ijmax / ijmin, where ijmin – minimum diameter), roundness (4·S / ʌ·ijmax2, where S – area), form factor (4·ʌ·S / L2, where L – perimeter) describe various geometrical features7. Geometrical features can be characterized also by moment invariants like M1 = N2,0 + N0,2, M2 = (N2,0 – N0,2)2 + 4·N1,12, M7 = N2,0·N0,2 – N1,12, based on normalized central moments defined as
Shape Similarity to Alphanumeric Sign
29
Np,q = Mp,q / M0,0(p+q)/2+1, where Mp,q = (x-Ex)p·(y-Ey)q·f(x,y) means central moment (symbol E denotes mean value)1. These descriptors allow recognizing given shapes (e.g. disk “Ɣ” or letter “K”), but do not stand the test for category of differentiated shapes like alphanumeric signs. A contour of letters should be smooth. Some approaches like fractal dimension, harmonic analysis, and chain code representation allow describing smoothness of contour, but they can be used only for elimination rough shapes. Spades (“Ƅ”), hearts (“Ɔ”), squares (“Ŷ”) or disks (“Ɣ”) have still smooth contour. Elimination of some group of shapes may be obtained by use of topological properties. One of them, namely Euler’s characteristic is equal to a number of connected fragments of figure decreased by a number of holes. On a raster grid it can be calculated by a one-pass algorithm (Fig. 2) as a number of pixel vertices (a vertex may belong to four pixels of figure) minus a number of pixel edges (an edge may belong to two pixels) plus a number of pixels6. If a figure consist of one connected segment, a number of its holes may be expressed as one minus Euler’s characteristic. Such a criterion allows distinguishing alphanumeric signs from figures that have more than two holes.
Figure 2. Neighborhood patterns and incremental calculation of characteristic of Euler.
Features of figure skeleton (numbers of end points, nodes, external and internal branches, loops) may be treated as topological properties too. Because various shapes can have the same skeleton (e.g. square “Ŷ” and letter “x”), these features are unfit for detection of alphanumeric signs. Skeleton analysis loses information about width of branches and can be used only for elimination of complex shapes. Additionally methods for constructing the skeleton and its analysis are rather complicated. Statistics based on the Euclidean distance map (EDM) holds out some hope of successful letters and digits similarity evaluation. In the EDM every pixel is assigned a value that is its distance from the nearest background pixel. Constructing such a distance map is simple and needs only two passes7. As an example we can consider histograms of the EDM. For alphanumeric signs they should be relatively regular (Fig. 3). However a map that has for letters and digits exactly one dominant bar in the histogram seems to be better for examination. Such a map has been proposed in the next section.
30
Figure 3. Histograms of EDM for letters “S” and “O”.
Letters and numerals originated as handwritten marks. Hence their shapes have a form of lines that have constant width resulting from thickness of the pen. Low value of dispersion or variance of hypothetical pen width (fiber width) for some shape may mean that the shape is letter or digit or figure similar to alphanumeric sign (e.g. “§”). A pen path length (fiber length) may serve as an additional criterion. Calculations of these pen (fiber) parameters need constructing of a skeleton and are rather complex. The next section contains proposition of some simple transformation (map) which assigns a value corresponding with pen width to each pixel of a figure. Statistics based on this map should be a good criterion for evaluation of shape similarity to alphanumeric signs.
3.
NEW PROPOSITION OF MEASURE BASED ON MAXIMAL SQUARE MAP
Let us consider a map, where each pixel is assigned a value that is the diameter (or the radius) of maximal disk containing the pixel and belonging to the figure entirely. Such a map counts for every pixel a size of the wider pen that can draw the given pixel. Because of calculation complexity it would be better to consider squares instead of disks. The results should be similar. A map, where every pixel is assigned a value that is the side of maximal square containing the pixel and belonging to the figure entirely, will be called the maximal square map (MSM). The MSM can be calculated in a similar way like the EDM. The MSM procedure implemented by the author uses three passes, but one may suppose that there exists a two-pass algorithm. For a shape that has a form of a line with constant width the MSM should have almost the same values for most pixels. Dispersion (or variance) of the MSM should be close to zero for such a shape. Therefore it can be used as a criterion of similarity to alphanumeric sign. Line length, calculated as a result of dividing a number of pixels (figure area) by squared mean value of the MSM (area of averaged square), can be treated as an additional criterion. Tests have shown that these criteria work in practice. The experiments were performed with scans of archival machine typed documents (Figs. 4, 5)
Shape Similarity to Alphanumeric Sign
31
as well as with photos taken on a street from pedestrian perspective (Figs. 6, 7).
Figure 4. Fragment of an archival machine typed document.
Figure 5. Evaluation of similarity to alphanumeric signs of black shapes from Fig. 4: a) darker figures with less dispersion of the MSM, b) darker figures with fiber length/width closer to 17 (e.g. disappearing of underlines).
Figure 6. Pedestrian perspective photo: a) original (P. Zabáocki), b) result of segmentation.
Figure 7. Evaluation of similarity to alphanumeric signs for segments from Fig. 6: a) darker figures with less dispersion of the MSM, b) black figures obtained by applying some threshold for dispersion of the MSM and rejection shapes too small and too narrow.
32 As we can see, recall, defined as the number of correct estimates divided by the total number of targets5, is near 1 (almost all alphanumeric signs are detected). Precision, equal to the number of correct estimates divided by the total number of estimates5, is less than recall but is much greater than 0.5 (majority of detected objects are alphanumeric signs). In foreseeable future an evaluation of the results should be done with some common benchmark datasets5.
4.
CONCLUSIONS
Discussion of various approaches to shape evaluation and their usefulness for estimation of similarity to alphanumeric signs is presented. The efficient method based on the new transformation MSM is proposed. Test results of this method revealing its efficiency are shown.
REFERENCES 1. ChoraĞ R. S.: Object Recognition Based on Shape, Texture and Color Information. Proceedings of the 3rd Conference on Computer Recognition Systems KOSYR’2003, Wrocáaw University of Technology, Wrocáaw (2003) 181–186. 2. Clark P., Mirmehdi M.: Recognising text in real scenes. International Journal on document Analysis and Recognition 4 (2002) 243-257. 3. Kowalik R.: RADONN – Manual Apparatus for Reading of Inscriptions (only in Polish: RADONN – rĊczny aparat do odczytywania napisów). Unpublished Report, GdaĔsk University of Technology, GdaĔsk (2002). 4. LebiedĨ J., Podgórski A., Szwoch M.: Quality Evaluation of Computer Aided Information Retrieval from Machine Typed Papers Documents. Proceedings of the 3rd Conference on Computer Recognition Systems KOSYR’2003, WUT, Wrocáaw (2003) 115–121. 5. Lucas S. M., Panaretos A., Sosa L., Tang A., Wong S., Young R.: ICDAR 2003 Robust Reading Competitions. Proceedings of the Seventh International Conference on Document Analysis and Recognition ICDAR 2003 (2003). 6. Román-Roldán R., Gómez-Lopera J. F., Atae-Allah Ch., Martínez_Aroza J., LuqueEscamilla P. L.: A measure of quality for evaluating methods of segmentation and edge detection. Pattern Recognition 34 (2001) 969–980. 7. Russ J. C.: The Image Processing Handbook. CRC Press, Boca Raton (2002). 8. Sonka M., Hlavac V., Boyle R.: Image Processing, Analysis and Machine Vision. PWS Publishing (1998). 9. Wiszniewski B.: The Virtual Memorial Project. http://docmaster.eti.pg.gda.pl . 10. Wu V., Manmatha R., Riseman E. M.: Finding text in images. Proceedings of 2nd ACM Conference on Digital Libraries (1997) 3-12. 11. Zhang D., Lu G.: Review of shape representation and description techniques. Pattern Recognition 37 (2004) 1–19. 12. Zhang J., Chen X., Hanneman A., Yang J., Waibel A.: A Robust Approach for Recognition of Text Embedded in Natural Scenes. Proceedings of International Conference on Pattern Recognition ICPR 2002 (2002).
QUALITATIVE CHARACTERIZATION OF DYNAMIC TEXTURES FOR VIDEO RETRIEVAL Renaud Péteri and Dmitry Chetverikov MTA SZTAKI - Hungarian Academy of Sciences 1111 Budapest, Kende u.13-17., Hungary Email: [email protected]
Abstract
A new issue in texture analysis is its extension to temporal domain, known as dynamic texture. Many real-world textures are dynamic textures whose retrieval from a video database should be based on both dynamic and static features. In this article, a method for extracting features revealing fundamental properties of dynamic textures is presented. Their interpretation enables qualitative requests when browsing videos. Future work is finally exposed.
Keywords:
Dynamic texture; video retrieval; MPEG-7; qualitative feature; normal flow.
1.
INTRODUCTION
The amount of available digital images and videos for professional or private purposes is quickly growing. Extracting useful information from these data is a highly challenging problem and requires the design of efficient content-based retrieval algorithms. The current MPEG-7 standardization (also known as the "Multimedia Content Description Interface") aims at providing a set of content descriptors of multimedia data such as videos. Among them, texture (Wu et al., 2001) and motion (Divakaran, 2001) were identified as key features for video interpretation. Combining texture and motion leads to a certain type of motion pattern known as Dynamic Textures (DT). As the real world scenes include a lot of these motion patterns, such as trees or water, any advanced video retrieval system will need to be able to handle DT. Because of their unknown spatial and temporal extend, the recognition of DT is a new and highly challenging problem, compared to the static case where most textures are spatially well-segmented. In the task of handling qualitative human queries, we are concerned about understanding and extracting fundamental properties of DT. In this article, we present a method for extracting features whose qualitative interpretation enables to discriminate between different sorts of DT. Dynamic texture recognition in videos is a recent theme, but it has already led to several
33 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 33–38. © 2006 Springer. Printed in the Netherlands.
34 kinds of approaches. In reconstructive approaches (Szummer, 1995) or (Saisan et al., 2001), the recognition of DT is derived from a primary goal which is to identify the parameters of a statistical model ‘behind’ the DT. Geometrical approaches (Otsuka et al., 1998; Zhong and Scarlaroff, 2002) consider the video sequence as a 3D volume (x, y and time t). Features related to the DT are extracted by geometric methods in this 3D space. Qualitative motion recognition approaches (Nelson and Polana, 1992; Bouthemy and Fablet, 1998; Peh and Cheong, 2002) are based on the human ability to recognize different types of motion, both of discrete objects and of DT. The aim is not to reconstruct the whole scene from motion, but to identify different kinds of DT.
2.
PROPOSED CHARACTERIZATION OF DYNAMIC TEXTURES
2.1
The normal flow as the media of motion information
For video retrieval purposes, it is important to minimize the computational cost, and to extract DT features which are easily interpretable by humans. To that aim, we have chosen a qualitative approach to the problem of DT recognition. It is based on the assumption that a full displacement is not required, its computation is time-consuming and not always accurate. A partial flow measure, given by the normal flow, can provide sufficient information for recognition purposes. The normal flow vn is derived from the optical flow equation (Horn and Schunck, 1981), and is parallel to the local image gradient: vn (p) = −
It (p) n ||∇I(p)||
(1)
with p the pixel where the normal flow is computed, It (p) the temporal derivative of the image at p, ∇I(p) its gradient at p, and n a unit vector in the gradient direction. The normal flow field is fast to compute and can be estimated directly without any iterative scheme used by regularization methods (Horn and Schunck, 1981). Moreover, it supports both temporal and structural information of DT: temporal as it is related to moving edges and spatial as it is linked to the edge gradient vectors.
2.2
Extraction of the normal flow
In order to extract numerical features characterizing a DT, the normal flow fields are computed for each DT. For reducing the sensitivity to noise of the normal flow, a Deriche blurring filter, set with σ = 1, is performed on the sequence, followed by a linear histogram normalization. Image regions with low spatial gradients are masked through an automatic threshold on the spatial gradient, as values of their motion flows would not be significant. The normal
35
Dynamic Texture Retrieval
flow is then computed according to formula (1), excluding ‘masked’ pixels. Fig. 1 illustrates the computation of the normal flow field (1.a), its norm (1.b) and its angle (1.c) on the "fire" sequence (presented further on Fig. 2).
Figure 1.
Normal flow field (1.a), its norm (1.b) and its angle (1.c) on the "fire" sequence.
2.3
Numerical criteria
Several criteria have been defined for characterizing DT. These criteria are computed from the normal vector field as well as directly from the video sequence (Table 1). Table 1. Numerical criteria for characterizing DT. Normal flow Divergence Peakness Rotational Orientation
Raw sequence Mean of the MRA Variance of the MRA
- Criteria based on the normal vector field are: the average of the divergence (scaling motion) and the rotational (rotative motion) over the whole video sequence V ; the peakness of the distribution, defined as the average flow magnitude divided by its standard deviation; the orientation homogeneity φ = i∈Ω vi ∈ [0, 1], where vi is the normal flow at i and Ω is the set of non-zero v i∈Ω
i
normal flow vectors. φ reflects the flow homogeneity of the DT compared to its mean orientation. A more detailed description of its signification is given in the appendix. - Criteria computed from the raw video sequence are based on the temporal variation of the maximal regularity criteria (Chetverikov, 2000). For each frame of the sequence, the spatial texture regularity is computed in a sliding window, and the maximal value is selected. The features computed for the DT are the temporal mean and standard deviation of the maximum regularity areas (MRA). All the selected criteria are translation and rotation invariant.
3. 3.1
RESULTS ON A REAL DATASET Experiments
The defined criteria have been applied on a real dataset browsing a wide range of possible DT occurrences (Fig. 2): an escalator (A), a fire (B), a waving
36 plastic sheet (C), clothes in a washing machine (D), a waving flag (E), smoke going upward (F), ripples on a river (G) and a strong water vortex (H).
Figure 2.
3.2
The DT dataset (courtesy of the MIT).
Analysis of the results
Defined quantities have been computed on the whole sequence and averaged (Table 2) Table 2. Numerical criteria for characterizing DT. Criterion Divergence Rotational Peakness Orientation MRA mean MRA variance
A 0.032 0.009 0.507 0.865 0.582 0.029
B 0.250 0.140 0.657 0.221 0.205 0.055
C 0.067 0.037 0.757 0.055 0.222 0.031
DT sample D E 0.108 0.073 0.055 0.026 0.376 0.375 0.225 0.183 0.104 0.381 0.066 0.235
F 0.045 0.030 0.459 0.511 0.067 0.057
G 0.128 0.086 0.829 0.169 0.205 0.036
H 0.173 0.105 0.652 0.022 0.136 0.026
The divergence reflects converging and diverging fluxes. It explains the important values for fluids such as fire B, river G or the vortex H. Rigid objects such as the escalator A or the plastic sheet C have very low values. In the same way, the rotational criterion reflects circular movements around points such for the vortex H or the fire B. As expected, the upward movement of the escalator has a value close to 0. The peakness criterion discriminates DT having homogeneous and high motion values (C or G) from DT with sparse or low motion values (D or E). The orientation criterion reflects a main motion orientation in the DT. The 1st raw of Fig. 3 represents dynamic textures A, C and F, where the orientation criterion has been superimposed. On a rigid and well-oriented motion like A, the homogeneity value on orientation is high, reflecting a consistent main motion flow. The smoke of sequence F is not well segmented and is very volatile, resulting to a lower value on orientation
Dynamic Texture Retrieval
37
homogeneity. However, the ascending motion of the smoke is still extractable. The last sequence C of the waving plastic sheet has a very low main orientation value: the plastic sheet has an oscillating motion, resulting from an overall null displacement.
Figure 3. Orientation homogeneity and maximum regularity criteria (the values are reported in Table 2). 1st row: main orientation pointed by the triangle and its homogeneity (base of the triangle). 2nd row: values of the maximal regularity areas for each frame.
The spatial texture regularity criteria discriminate DT maintaining a spatially coherent structure through time, from those with low values reflecting a close to random spatial arrangement. The size of the sliding window was set to 80 × 80 pixels. The 2nd raw of Fig. 3 represents the temporal evolution of the MRA value for A, C and F. The dynamic textures A and C have significant and stable regularity values, whereas F appears as a random texture. The flag E (Table 2) has a high regularity mean value, but it has also the highest variance, due to some frames where the flag is folded.
4.
CONCLUSION AND FUTURE PROSPECTS
This article deals with the recent issue of dynamic texture recognition. We have proposed a method for extracting quantitative and qualitative fundamental features of DT. Based on the normal flow field and on the texture regularity, the derived criteria are fast to compute and easily interpretable, enabling to handle qualitative human queries. It is possible to separate different sorts of motion: oriented (A) from isotropic (G) or rotating (H), as well as spatially regular (E) from random ones (F). The qualitative criteria can be used to guide the human query: a request ‘water’ can indeed appear in many occurrences in a video (waving, as a vortex or as an oriented flow). One can think of assigning
38 weights to the defined coefficients according to this request. Our current work aims at testing the discriminative aspect of the features in a full classification process. The multi-scale properties in time and space of dynamic textures will also be studied.
ACKNOWLEDGMENTS This work was carried out during the tenure of an ERCIM fellowship, and was also supported by the EU Network of Excellence MUSCLE (FP6-507752).
APPENDIX: THE MEANING OF THE ORIENTATION CRITERION Given Ω, the set of non null normal flow vectors of a video sequence, we define the global motion vector as V = i∈Ω vi . The idea of the orientation criterion is to compute the similarity in orientation between each normal flow vector and the global motion vector. One defines the orientation contribution at pixel i for one motion vector as φi = α(i) cos(θvi ,V ), where α(i) favors the contribution of highest motion points: α(i) =
vi
∈ [0, 1] vj φi = i∈Ω α(i) cos(θvi ,V ) The overall orientation criterion on the sequence is then: φ = j∈Ω
φ=
α(i) vi ·V i∈Ω V vi
vi ·V i∈Ω = . We finally obtain: φ = V v j∈Ω
j
vi i∈Ω ∈ [0, 1] . v i∈Ω
i
REFERENCES Bouthemy, P. and Fablet, R. (1998). Motion characterization from temporal cooccurrences of local motion-based measures for video indexing. In ICPR’98, pages 905–908. Chetverikov, D. (2000). Pattern regularity as a visual key. Image and Vision Computing, 18:pp. 975–986. Divakaran, A. (2001). An overview of MPEG-7 motion descriptors and their applications. In Sharbek, W., editor, CAIP 2001, pages 29–40, Warsaw, Poland. Horn, B. and Schunck, B. (1981). Determining optical flow. Artificial Intelligence, 17:185–203. Nelson, Randal C. and Polana, Ramprasad (1992). Qualitative recognition of motion using temporal texture. CVGIP: Image Understanding, 56(1):pp. 78–89. Otsuka, K., Horikoshi, T., Suzuki, S., and Fujii, M. (1998). Feature extraction of temporal texture based on spatiotemporal motion trajectory. In ICPR, volume 2, pages 1047–1051. Peh, C. H. and Cheong, L.-F. (2002). Synergizing spatial and temporal texture. IEEE Transactions on Image Processing, 11(10):pp. 1179–1191. Saisan, P., Doretto, G., Wu, Ying Nian, and Soatto, S. (2001). Dynamic texture recognition. In Proceedings of the CVPR, volume 2, pages 58–63, Kauai, Hawaii. Szummer, Martin (1995). Temporal Texture Modeling. Technical Report 346, MIT. Wu, P., Ro, Y. M., Won, C. S., and Choi, Y. (2001). Texture descriptors in MPEG-7. In Sharbek, W., editor, CAIP 2001, pages 21–28, Warsaw, Poland. Zhong, J. and Scarlaroff, S. (2002). Temporal texture recongnition model using 3D features. Technical report, MIT Media Lab Perceptual Computing.
IMAGE CLASSIFIERS FOR SCENE ANALYSIS Bertrand Le Saux and Giuseppe Amato ISTI - CNR di Pisa ∗ Via G. Moruzzi, 1 - 56124 - Pisa - Italy
[email protected], [email protected]
Abstract
The semantic interpretation of natural scenes, generally so obvious and effortless for humans, still remains a challenge in computer vision. We intend to design classifiers able to annotate images with keywords. Firstly, we propose an image representation appropriate for scene description: images are segmented into regions and indexed according to the presence of given region types. Secondly, we propound a classification scheme designed to separate images in the descriptor space. This is achieved by combining feature selection and kernel-method-based classification.
Keywords:
scene analysis, feature selection, image classification, kernel methods
1.
INTRODUCTION
How might one construct computer programmes in order to understand the content of scenes? Several approaches have already been proposed to analyse and classify pictures, by using support-vector machines (SVM) on image histograms (Chapelle et al., 1999) or hidden Markov models on multi-resolution features (Li and Wang, 2003). To capture information specific to an image part or an object, approaches using blobs to focus on local characteristics have also been propounded (Duygulu et al., 2002). We believe that a segmentation of images into regions can provide more semantic information than the usual global image features. The images that contain the same region types are likely to be associated with the same semantic concept. Hence, one has to design a classification scheme to test the co-presence of these region types.
∗ This
research was partially funded by the ECD project and the Delos Network of Excellence
39 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 39–44. © 2006 Springer. Printed in the Netherlands.
40
Figure 1. Feature extraction: the original image (a) is first segmented (b) then the boolean presence vector (c) is extracted by comparing the image regions to those in the region lexicon obtained with clustering techniques.
This paper is organised as follows: § 2 explains how the scene information is represented by the means of presence vectors. The scene classifiers are described in § 3, and § 4 evaluates the proposed method.
2. 2.1
FEATURE EXTRACTION Collecting region types
The region lexicon - the range of the possible region types that occur in a set of images - is estimated through the following steps: a training set of generic images provides a range of the possible real scenes; each one is segmented into regions by using the mean-shift algorithm (Comaniciu and Meer, 1997); the regions are then pulled together and indexed using standard features - mean colour, colour histogram - for visual description; finally these index are clusterised by using the Fuzzy C-Means algorithm (Bezdek, 1981) to obtain categories of visually-similar image regions: each cluster represents a region type.
2.2
Representing the content of images
Given a region lexicon, every image can be described by a presence vector: each component corresponds to a region type and its value can be true or false
Image classifiers for scene analysis
41
depending on the fact that the region type is present or not in the image. The decision on the presence of a region type is taken by measuring its similarity - in the visual-feature space - to the regions of the image. For instance, it is likely that a countryside image will contain sky, greenery and dark ground regions (cf. figure 1).
3.
AUTOMATED IMAGE CLASSIFICATION
We aim to annotate images with keywords. The keyword is basically a mnemonic representation of a concept such as people, countryside, etc. We define a binary classifier for each considered concept through a two-step process. Firstly a feature selection allows to determine which region types are meaningful to recognise a concept. Secondly a kernel classifier is used to learn a decision rule based on the selected region types.
3.1
Feature selection
In our application, the feature selection (FS) is a filtering phase (Guyon and Elisseff, 2003). The most standard way to select features consists in ranking them according to their individual predictive power. Let Y denote a boolean random variable for the scene keyword to associate with the image. We denote F1 , . . . , Fp the boolean random variables associated with each feature, i.e. region type. Information theory (Gray, 1990) provides tools to choose the relevant features. The entropy measures the average number of bits required to encode the value of a random variable. For instance, the entropy of the class Y is H(Y ) = − y P (Y = y) log(P (Y = y)). The conditional entropy H(Y |Fj ) = H(Y, Fj ) − H(Fj ) quantifies the number of bits required to describe Y when the feature Fj is already known. The mutual information of the class and the feature quantifies how much information is shared between them and is defined by: (1) I(Y, Fj ) = H(Y ) − H(Y |Fj ) The probabilities are estimated empirically on the training samples as the ratio of relevant items to the number of samples. The selected features are the ones which convey the largest information I(Y, Fj ) about the class to predict.
3.2
Kernel-adatron classifiers
The adatron was first introduced as a perceptron-like procedure to classify data (Anlauf and Biehl, 1989). A kernel-based version was then proposed (Friess et al., 1998). It solves the margin-maximisation problem of the SVM (Vapnik, 1995) by performing a gradient ascent.
42 Table 1. Error rates for various keywords: comparison of various classification schemes applied to presence vectors. Null errors on the training set (denoted as “train error”) indicate the classifier over-fits to the data. It is better to allow a few mis-classified training samples in order to obtain better results on the test set and thus insure a good generalisation. keyword
linear adatron train error test error snowy 0.0 % 9.2 % countryside 0.0 % 12.6 % people 3.6 % 16.4 % streets 0.1 % 14.0 %
kernel adatron train error test error 2.3 % 8.9 % 0.0 % 9.1 % 0.5 % 14.1 % 0.0 % 12.1 %
FS + kernel adatron train error test error 2.4 % 8.5 % 8.0 % 8.4 % 3.6 % 7.5 % 2.5 % 6.2 %
The training data-set is denoted T = {(x1 , y1 ), . . . , (xn , yn )}. Each xi is a reduced presence vector (with only the selected features) and yi is true or false depending on the fact that the image is - or is not - an example of the concept to learn. For a chosen kernel K, the algorithm estimates the parameters αi and b of the decision rule, which tests if an unknown presence vector x corresponds to the same concept: n yi αi K(x, xi ) + b) f (x) = sign(
(2)
i=1
4. 4.1
EXPERIMENTS Data-set
The data-set is composed of 4 classes of images containing 30 instances of a particular scene: snowy, countryside, streets and people and of a fifth one consisting of various images used to catch a glimpse of the possible real scenes. In the experiments, error rates are averaged on 50 runs using cross-validation with training of the classifier on 80% of the set (Duda et al., 2000).
4.2
Error rates
First we aim to test the validity of the image representation by presence vectors. A linear classifier tests only the co-presence of the region types to attribute a given label. The results (cf. table 1) show that the description scheme is efficient enough to separate different categories. Then, the use of a polynomial-kernel adatron enables to obtain smaller error rates,butover-fitting on the training data still impedes the correct classification of the more complex scene as people or streets. Finally, by selecting the most informative region types for each category (“FS + kernel adatron”), we can obtain performances on complex scenes as good as those on simple ones.
Image classifiers for scene analysis
43
Figure 2. The meaningful regions correspond to the region types that have a high mutual information with the label to predict. The upper row shows the original images and the lower row shows only the meaningful regions in these images.
4.3
Meaningful regions
For each concept, the feature selection allows to retrieve the meaningful parts of the image: only these regions are then used by the classifier to recognise a given keyword. Figure 2 shows that the selected region types are consistent with what was intuitively expected: green ones are used to recognise countryside, skin-coloured ones people and white ones snowy landscapes.
4.4
Evaluation
Our approach is compared with a SVM applied to image histograms (Chapelle et al., 1999). The error rates for both methods are shown in table 2. The histogram-based approach works well for the simple scenes that likely have high-density peaks on some colours. However, the performance is less effective for the more complex types of scene: various backgrounds make the generalisation harder. On the contrary, our classifiers used on presence-vectors obtain roughly the same error rates for all kinds of scene. On the complex ones, both the segmentation in regions and the feature selection allow to catch the details that permit to differentiate these images from others without over-fitting.
5.
CONCLUSION
We have presented in this article a new approach to scene recognition that intends to identify the image-region types in a given scene. Both the image representation and the classification scheme are particularly appropriate for scene description and provide a good trade-off between the classifier performance
44 Table 2. Error rates for various keywords: SVM on histograms vs. feature selection and polynomial adatron on presence vectors. keyword snowy countryside people streets
SVM on histograms FS + polynomial adatron train error test error train error test error 0.0 % 5.0 % 2.4 % 8.5 % 0.0 % 9.5 % 8.0 % 8.4 % 0.0 % 13.2 % 3.6 % 7.5 % 0.1 % 15.2 % 2.5 % 6.2 %
and the prevention of over-fitting. Moreover the method is robust, since it does not require a fine tuning of a complex algorithm but on the contrary uses a succession of simple procedures.
REFERENCES Anlauf, J.K. and Biehl, M (1989). The adatron: an adaptive perceptron algorithm. Neurophysics Letters, 10:687–692. Bezdek, J.C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New-York. Chapelle, O., Haffner, P., and Vapnik, V. (1999). Svms for histogram-based image classification. IEEE Transactions on Neural Networks, 10:1055–1065. Comaniciu, D. and Meer, P. (1997). Robust analysis of feature spaces: Color image segmentation. In Proc. of CVPR, pages 750–755, San Juan, Porto Rico. Duda, R.O., Hart, P.E., and Stork, D.G. (2000). Pattern Classification. Wiley, New-York, 2nd edition. Duygulu, P., Barnard, K., de Freitas, J.F.G., and Forsyth, D. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proc. of ECCV, volume 4, pages 97–112, Copenhagen, Denmark. Friess, T.-T., Christianini, N., and Campbell, C. (1998). The kernel-adatron algorithm: a fast and simple learning procedure for support vector machines. In Proc. of ICML, Madison, Wisconsin. Gray, R.M. (1990). Entropy and Information Theory. Springer-Verlag, New York, New York. Guyon, I. and Elisseff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182. Li, J. and Wang, J.Z. (2003). Automatic linguistic indexing of pictures by a statistic modeling approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9):1075– 1088. Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer Verlag, New-York.
VISION-BASED ANALYSIS OF THE TIRE FOOTPRINT SHAPE Klaudia Jankowska1 , Tomasz Krzyzynski1 , Andreas Domscheit2 1 Technical University of Koszalin, Raclawicka 15-17, 75-620 Koszalin, Poland [email protected] [email protected] 2 Continental AG, Jaedekamp 30, 30419 Hannover, Germany [email protected]
Abstract
In this paper we present an image processing application for automatic shape analysis of tire-ground contact area. To conduct such evaluations is essential since tires are responsible for giving support for the vehicle and for transforming forces necessary to obtain required kinematic behavior of the vehicle. Normally analysis and comparison of footprints is done "manually", just by looking on them, automatisation by means of image processing can make them objective and more efficient.
Keywords:
Tire-ground contact area (tire footprint); background separation; images registration.
1.
INTRODUCTION
Tires are the only part of the vehicle having the contact with the ground. The contact area (called footprint) is only 500 ÷ 600 cm2 in size for a truck tire. That small part of a tire is under pressure of 0, 6 ÷ 1, 6 MPa and is the place where acceleration, braking and lateral forces are generated1 . Footprint shape and pressure distribution are foremost factors for tires wear and mileage performance, and have significant influence on braking behavior. They have also to be taken under consideration during tire optimization for specific applications (load variations, inflation pressure, wheel position). On the other hand footprint shape analysis is performed during quality control of tire production. As manual inspection of footprint images is time consuming, complex, unreliable and day-form-dependant an effort has been undertaken to automatise this process. The overall goal of this work is to build tire footprint assessment
45 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 45–50. © 2006 Springer. Printed in the Netherlands.
46 system consisting of criterions that can be automatically calculated with the help of image processing methods. This paper we address to first part of the work, namely footprint shape analysis witch will be followed by analysis of pressure distribution in contact area.
2.
IMAGE ACQUISITION
Tyres are pressed against an illuminated glass plate. Thin film is placed in between the tire and the glass plate. Due to dispersion bright marks appear in the contact area of the tire (see Figure 1). Images are acquired in grayscale where bright levels depict higher pressure. Obtained images are of the size 768 × 576 pixels (an example can be seen in Figure 2a). load
tire film glass plate
camera
Figure 1.
Measurement method.
In order to accommodate the test stand to accept various tire sizes a possibility to adjust the distance of the camera from the glass plate has been provided. For each of six available positions the pixel size is being measured during the test stand calibration. Another information obtained in the calibration process is pressure expressed as a function of pixel intensity. Its value is established by measuring the image intensity of the calibration rubber stamp of known contact area and loaded with specified forces.
3. 3.1
IMAGE PROCESSING Background separation
As our overall goal is an automatic application we were looking for appropriate method to extract footprint area from image background (various
47
Vision-based Analysis of the Tire Footprint Shape
thresholding algorithms are described in2 ). The target was to avoid necessity to manually select the region of interest. Developed method consist of three steps. First grayscale images Igray are converted to binary ones Ibw :
Ibw (u, v) =
1 if Igray (u, v) < T 0 otherwise
(1)
where T is gray level threshold value defined during measurement stand calibration (see Figure 2b). Then the small objects recognized to be noise are removed (see Figure 2c). Limiting value for noise object area is 0.3 cm2 , what means 39 ÷ 84 pixels depending on distance of camera from projection plane. After those steps images still contain artefacts usually caused by bending of foil placed in between the tire and the glass plate. For some tread patterns they can be of the size comparable to the size of small tire blocs what makes their identification more difficult. A difference of gray level values mean for pixels of object k from threshold value T has been found as effective criterion to separate those unwanted objects: d(k) = μI(k) − T
(2)
Objects whose mean graylevel intensity μI(k) differs from threshold value T less then 5 are recognized not to be part of the footprint area (see Figure 2d).
3.2
Shape parameters
For extracted footprint shapes height, width, area and height to width ratio are calculated. Scaling from pixel values to metrical values is done using pixel height and width obtained during calibration of the test stand.
3.3
Images registration
In order to perform automatic footprint shape comparison images need to be registered as in particular images contact area appears in different position within image (see Figure 3a, b). Correlation method based on Fast Fourier Transform3 according to following equation was used:
C = F −1 F(im1) ∗ F(rotated im2)
(3)
where im1 is image one and rotated im2 is part of second image (enclosing tire contact area) rotated 180◦ . Position of im2 within im1 is indicated by maximum pixel intensity value in correlation image C (see Figure 3c, d).
48
Figure 2. Background separation process: a) example of original image, b) result after thresholding, c) result after noise removal and finally, d) result after artifacts removal.
Vision-based Analysis of the Tire Footprint Shape
49
Figure 3. Registration of two images and their difference: a, b) original images of footprints, c) correlation image with maximum value indicated, d) footprints difference computed after registration of images, colors indicate where the contact area increased (black), decreased (light gray) or remained unchanged (dark gray pattern).
50
3.4
Shapes comparison
Having images registered allows us to compare footprint shapes. On the Figure 3 one can see two footprints and difference between them with colors indicating parts of contact area which undergo changes.
4.
RESULTS AND FUTURE WORK
Image database used for testing the method consists of 62 footprint images of tires varying in construction, tread pattern, size, load, inflation pressure and degree of wear. During background separation process 747 objects have been found in those images. 92.4% of them was classified correctly as tire part (448 objects) or noise artefact (242 objects). Objects not classified correctly were mainly those consisting simultaneously of tire and background part. To overcame this problem is one of the future tasks. Automatization of shape parameters calculation gives practical benefit. Those parameters can be used to describe particular tire or to compare different ones during development or production quality control process. Effective registration method gives possibility to indicate parts of footprint where change of shape occurred. This is valuable information additional to shape dimension change. On the base of proposed methods an application with graphical user interface has been developed. It is planed to extend this application with successive methods for footprint shape and pressure distribution assessment, classification and comparison.
ACKNOWLEDGMENTS The first author would like to thank staff at Continental AG, Hanover, Germany for providing images for this work, and Leonardo da Vinci Programmme for partial support.
REFERENCES 1. John C. Dixon, Tires, suspension and handling, Warrendale, Pa.: Society of Automotive Engineers, second edition, 1996. 2. James R. Parker, Algorithms for image processing and computer vision, Wiley computer publishing, New York, 1997. 3. J. P. Lewis, Fast Template Matching, Vision Interface, pp. 120÷123, 1995.
MEASUREMENT OF THE LENGTH OF PEDESTRIAN CROSSINGS THROUGH IMAGE PROCESSING Mohammad Shorif Uddin and Tadayoshi Shioyama Department of Mechanical and System Engineering, Kyoto Institute of Technology Matsugasaki, Sakyo-ku, Kyoto 606-8585, Japan, E-mail: [email protected]
Abstract
A computer vision based new method for the measurement of the length of pedestrian crossings with a view to develop a travel aid for the blind people is described. In a crossing, the usual black road surface is painted with constant width periodic white bands. In Japan, this width is 45 cm. The crossing region as well as its length is determined using this concept. Experimental results using real road scenes with pedestrian crossing confirm the effectiveness of the proposed method.
Keywords:
Image analysis; computer vision; pedestrian crossing; measurement of length; travel aid for the blind.
1.
INTRODUCTION
This paper discusses an application of computer vision to improve the mobility of millions of blind people all over the world. Usually, the blind uses a white cane as a travel aid. The range of detection of special patterns or obstacles using a cane is very narrow. To improve the usefulness of the white cane, various devices have been developed such as the SONICGUIDE,1 the Mowat sensor,2 the Laser cane3 and the Navbelt4 . However, these devices are not able to assist the blind at a pedestrian crossing where information about the existence of crossing, its length and the state of traffic lights is important. There are traffic lights with special equipment which notifies the blind of a safe direction at a crossing by sounding beeping noise during the green signal. However, such equipment does not inform the blind about the length of crossing and is not available at every crossing; perhaps, it would take too long for such equipment to be put and maintained at every crossing. Blind people obviously can not see, but can hear. Navigation is the number one barrier for them. The arrival of fast and cheap digital portable laptop computers with multimedia computing to convert audio-video streams in real time opens new pathways to the development of an intelligent navigation system for the blind people. 51 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 51–56. © 2006 Springer. Printed in the Netherlands.
52 In this paper we propose a simple image based method to measure the crossing length assuming that the location of pedestrian crossing is known. Previously, Shioyama et al.5 developed an image analysis method, which is based on edge detection, Hough transformation and Susan feature detection techniques. However, it is complicated, computationally inefficient, and needs many parameters to adjust. In contrast to the previous method, the present method is fast and needs very few parameters to adjust. In order to evaluate the performance of the proposed method, experiment is performed using real road scenes with pedestrian crossing.
2.
METHOD FOR MEASUREMENT OF CROSSING LENGTH
In a crossing, the usual black road surface is painted with constant width periodic white bands. Image processing technique is applied for the extraction of the number of white and black bands in the crossing region. end line
2nd line 1st line d y0 yn
image plane
f h camera center ground plane
optical axis
Figure 1.
2.1
Crossing length estimation model.
Crossing length estimation model
The crossing length estimation model is described in Fig. 1. Let yn and y0 are the positions of the first and the end lines of crossing in the image plane, respectively, f is the focal length of the camera, h is the camera position from the road surface, d is the crossing length (i.e. the horizontal distance from the observer to the end line of crossing) and w is the width of a crossing band. If there are n number of bands in the crossing image, then we may easily find the fh . Then, we get following relations: y0 = fdh and yn = d−nw yn − y0 = f h ×
nw . d(d − nw)
(1)
Measurement of the Length of Pedestrian Crossings through Image Processing
We can derive the expression for crossing length from Eq. (1) as 4nw f h 1 2 . nw + (nw) + d= 2 yn − y0
2.2
53
(2)
Feature extraction principle
Let i(x, y) denote the intensity of an image, where x and y are the horizontal and vertical spatial coordinates, respectively. If the image rotate by an angle of θ and consider (u, v) be the rotated coordinates correspond to (x, y), then u and v a r e calculated a s f o l l o w s : u = + x c o s θ + y s i n θ and v =−x sin θ + y cos θ . If there e xists a pedestrian crossing in the image,then θ will closely correspond to the direction of crossing bands. The differential of i(x, y) along v direction includes alternate peaks and valleys on the edges of crossing bands (i.e. from black to white and vice versa). Therefore, ∂∂vi has local extremes on the edges of crossing bands. Since the edges of crossing bands are straight lines, the integration along u direction emphasizes the local extremes. one can find crossing bands by analyzing the projec ∞ Consequently, ∂i du, which is a one-dimensional function about v. The integral of tion as −∞ ∂v square of this projection becomes a good measure of closeness to true crossing ∞ ∞ ∂ i 2 direction. Accordingly, we use θ that maximizes −∞ −∞ ∂ v du dv. Using Parseval’s formula, one can derive the following equation. ∞ ∞ ∂ i 2 1 ∞ 2 du dv = ζ |I (−ζ sin θ , ζ cos θ )|2 d ζ , (3) 2π −∞ −∞ −∞ ∂ v
∞ ∞ − j(ξ x+η y) dxdy, j is a complex operator, and where I(ξ , η ) = −∞ −∞ i(x, y)e ζ = −ξ sin θ + η cos θ . We find θ that maximizes the right hand side of Eq. (3).
2.3
Crossing direction estimation
The images used in this paper are of size (width × height) = (640 × 480) pixels and the origin of the image coordinates is chosen at the upper left corner of the image. At first, the color image is converted to a gray scale image. We calculate the power spectrum using 2D FFT of the gray scale image. In FFT, to make sample numbers that are integer power of 2, we used the maximum possible region of an image. In taking the maximum region, we choose the lower part. This is due to the fact that the camera is set at the height of an observer’s eye, so the crossing be always in the lower region of the image. For an image of size (640 × 480) pixels, we have taken (512 × 256) pixels from the lower region for the Fourier transformation and these are the maximum possible sample numbers. The maximum value of the power spectrum corresponds to the crossing direction.
54
2.4
Crossing pattern extraction
We make integration of the gray scale image along u-axis direction for each v-axis position. The shape of this projection plot is periodic in nature if there exists crossing patterns in the image. Then the differentiation along v direction of the integral data will give the location of edges of white and black bands. Next, we follow the following steps. 1 Starting from the bottom of the image, integrate all absolute values of differentiation data and then smooth the integration result using a moving averaging window of size 1/20 times of image height. Due to crossing patterns, this smooth integration result’s gradient will be largely changed at the end line of the crossing region. To emphasize this change, use a Laplacian filter of size Nl = 1/10 times of image height. Finally, take the position of the global maximum of this filtered output. This is the end line of crossing. Laplacian filter can be described as Nl /2 a(t + n), where a(t) is the integration data after L(t) = a(t) − N1l ∑ n=−N l /2 n=0
2 3
4
5
6
smoothing. From the bottom position of image to the end line of crossing extract important extremes by comparing absolute value of the differentials. Starting from the farthest local extreme (on the basis of y position in the image), if there exists two adjacent extremes of same sign, remove the extreme that has the lower value. The straight lines along crossing direction at the remaining extremes’ locations indicate the boundaries of the crossing bands. We can estimate the crossing distance using Eq. (2) from the above extracted number of crossing bands. However, this estimation may be erroneous as the neighboring regions of crossing have influence on the extraction of crossing bands. So, for perfect estimation, we use the following steps to extract the crossing region from the whole image. A crossing region is characterized by alternate black and white bands, so if a band is white, then its mean intensity value must be greater than the mean values of immediate previous and later bands. Using the selected bands in Step 3, determine the white bands from the mean image intensity of each band region. Integrate the gray values within each white band along v direction. To find the left and right border points of each white band, use a step filter = 128 pixels. Step filter can be described as of size Ns Ns /2 −1 1 S(t) = Ns ∑n=−Ns /2 b(t + n) − ∑n=1 b(t + n) , where b(t) is the mean integration data along v direction. To determine the left and right border lines of the crossing region from the left and right border candidate points, respectively, use the minimum error strategy with omission.
Measurement of the Length of Pedestrian Crossings through Image Processing
55
7 Starting from the bottom of the image, integrate the image along u direction for each v position within the left and right boundaries. Determine the the position of white and black bands from the differentials of these integration data using the above Steps 1 to 3 and then calculate the crossing length using Eq. (2).
(a)17.00m, 16.26m, 4.3%
(b)18.56m, 13.78m, 25.8%
Figure 2. Two experimental images of pedestrian crossing. The crossing region in each image is marked by border lines.
200000
150000
100000 projection differential int_diff_smooth int_diff_laplace imp_local_extremes
50000
0
-50000 0
50
100
150
200
250
300
350
400
450
v-position [pixel]
(a)
(b)
Figure 3. (a) Integration “Projection”, differentiation “differential”, integration of differential data after smoothing “int diff smooth”, Laplacian filtered output “int diff laplace”, important local extreme positions “imp local extremes” for the crossing image shown in Fig. 2(a) using the crossing region only, (b) straight line marks are drawn at the position of important local extremes for the same crossing image.
3.
EXPERIMENTAL RESULTS
To evaluate the performance of the proposed method for the measurement of crossing length, we used 77 real images of crossing taken by a commercial digital camera under various illumination conditions by observing in various weathers (except rain). Two samples of experimental images are shown in Fig. 2. Under each image we show the true crossing length, the estimated crossing length and the percentage error. The extracted border lines are also shown
56 in these images. Fig. 3(a) presents the results of integration, differentiation, integration of differential data after smoothing, Laplacian filtered output and important local extreme positions for the image of Fig. 2(a) using the crossing region only. In Fig. 3(b), straight line marks are drawn at the position of important local extremes for the same crossing image shown in Fig. 2(a). From the experimental results, we find that our proposed method is successful in determining the length of the pedestrian crossings. The average relative estimation error is 8.5% and the r.m.s. error is 1.87 m. The maximum relative error is 25.8%. The maximum error occurs for the image shown in Fig. 2(c) due to the fact that the white paintings and also the image resolution are not perfect. Though in the present investigation, the relative and the maximum errors are somehow large, we are confident that these errors will definitely be reduced by (i) maintaining clear white paintings on the crossing, (ii) taking image with good resolution and (iii) adding a vehicle detection algorithm as a preprocessing step with this method,which will ensure no vehicle obstruction in the image. The approximate computation time of the proposed algorithm for the measurements of a crossing length is 1.30 s using an Intel Pentium M of 1600 MHz processor.
4.
CONCLUSIONS
In this paper, a simple and fast computer vision based pedestrian crossing length measurement technique has been described. Using 77 real road scenes with pedestrian crossing, the average relative estimation error and the r.m.s. error are found 8.5% and 1.87 m, respectively. The main sources of measurement error are the low image resolution and distorted white paintings. We are confident that the accuracy will be greatly increased by overcoming the above mentioned causes. As the computer hardware cost is decreasing day by day, we hope the system will be affordable by the blind.
ACKNOWLEDGMENTS The authors are grateful for the support of Japan Society for the Promotion of Science under Grants-in-Aid for Scientific Research (No. 16500110 and No. 03232).
REFERENCES 1. L. Kay, Radio Electron. Eng. 44, 605-629, 1974. 2. D. L. Morrissette et al., J. Vis. Impairment and Blindness 75, 244-247, 1981. 3. J. M. Benjamin, in Carnahan Conf. on Electronic Prosthetics, 77-82, 1973. 4. S. Shoval et el., IEEE Trans. Syst. Man Cybern. C28, 459-467, 1998. 5. T. Shioyama et al., Meas. Sci. Technol. 13, 1450-1457, 2002.
SHAPE RECOVERY OF A STRICTLY CONVEX SOLID FROM N-VIEWS Simon Collings1 ∗ , Ryszard Kozera2 † , Lyle Noakes3 1 School of Mathematics and Statistics
The University of Western Australia 35 Stirling Hwy, Crawly, W.A. 6009 Perth, Australia [email protected] 2 School of Computer Science and Software Engineering
The University of Western Australia 35 Stirling Hwy, Crawly, W.A. 6009 Perth, Australia [email protected] 3 School of Mathematics and Statistics The University of Western Australia 35 Stirling Hwy, Crawly, W.A. 6009 Perth, Australia [email protected]
Abstract
∗ With † With
In this paper we consider the problem of extracting the shape of a smooth convex solid, V ⊂ R3 , from a set of N photographs. The method begins by extracting the edges of each photograph. These edges are used to form a cone whose apex is the camera centre, which is guaranteed to enclose V. For a strictly convex solid any two such cones will most likely touch at two places (Collings et al., 2004), whose coordinates then give two data points which lie on V along with the orientation of the surface at these points. A set of cameras observing V yields a cloud of such points and normals. A new type of implicit surface is fitted to both the points and their normals. The implicit surface has the property of minimising a linear combination of first, second and third order energies, as in (Dinh et al., 2002), but with the added refinement of incorporating information about the surface orientation at each constraint point.
partial funding from The Western Australian Interactive Virtual Environments Centre. partial funding from The Alexander von Humbolt Foundation.
57 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 57–65. © 2006 Springer. Printed in the Netherlands.
58
1.
INTRODUCTION
Attempting to extract 3D information from a set of N photographs is a topic of much interest in the field of computer vision. Voxel colouring (Culbertson et al., 1999) and space carving (Katulakos and Seitz, 2000) go some way to solving the problem in a robust, but rather machine intensive way. In this paper it is shown that in the case of a smooth strictly convex solid, V ⊂ R3 , being viewed by N cameras, whose parameters are all known, it is possible to calculate a set of surface points. An implicit surface can be fitted to these points and, provided there are a sufficient number of them, this surface forms a very good representation of V. Throughout this paper all concepts will be illustrated by the example of an ellipsoid which is the zero level set of the function F (x, y, z) = ax2 + by 2 + cz 2 − r2 . Further examples are given in the conclusion. Fig. 1 shows 2 simulated photographs of the ellipsoid V, along with a 3D rendering of the original shape. The 2 photographs are a subset of the 8 used for this example and are created using a ray-tracing style simulation. Let us suppose that the photographs are taken from known camera locations {c1 , . . . , cN }, with both intrinsic and extrinsic camera parameters assumed known. The reflectance map for this example is Lambertian (Horn, 1986), but the results presented here are essentially shape-from-silhouette, since no reflectance information is used. In a later paper the reflectance information may be used to refine the shape-from-silhouette result.
2.
THE OCCLUDING EDGE THEOREM
For any given camera, ci the occluding edge with respect to that camera is the set of points in V whose image under perspective projection through ci form the boundary of the image of V. The occluding edge cone (OEC) is the unique cone whose apex is at the camera’s centre and which passes through the occluding edge. The OEC for a given camera entirely encloses the solid, touching it at the occluding edge. To recover OECs, edge detection is per-
Figure 1.
An ellipsoid and two photographs of an ellipsoid.
59
Shape Recovery of a Strictly Convex Solid from N-Views
35
30
25
20
10
Figure 2.
Edge detected image.
Figure 3.
15
20
25
30
35
40
45
A tangent for each edge pixel.
formed on each image. There is a variety of edge detection techniques in use (Canny, 1983; Marr and Hildreth, 1980), but due to the artificial nature of our example, a simple gradient based method was found to work well. Shown below in Fig. 2 is an example of an edge detected photograph. The edges are then scaled and back-projected to give a representation of the OEC. Fig. 4 shows two such cones, with the original ellipsoid V. Observe that for an arbitrary pair of cones there will quite likely be two points where the cones touch, indeed the following theorem holds:
Theorem 1 Let ci and cj , be two cameras observing a strictly convex solid, V and let Qi and Qj be their respective OEC’s. For the boundary of the strictly convex solid, write ∂V and for the tangent plane to, say, Qi at a point x ∈ Qi , write (T Qi )x . Providing neither camera is contained within the other’s OEC, there exist two points p1 , p2 ∈ Qi ∩ Qj ∩ ∂V, such that (T Qi )p1 = (T Qj )p1 = (T ∂V)p1 and (T Qi )p2 = (T Qj )p2 = (T ∂V)p2 , i.e. at each of these two points the OECs and V intersect and have identical tangent planes. The proof of Theorem 1 can be found in (Collings et al., 2004). Two OECs which satisfy the condition stated in the theorem will be called compatible and similarly their respective cameras will be called compatible cameras.
Figure 4.
Two OECs touch on the surface of V.
60
3.
SOLVING FOR THE INTERSECTION OF THE OECS
The points where two compatible OECs touch are determined by taking the edge detected photograph and finding the normal associated with each edge pixel in it. To find the associated normal for a given pixel a local orthogonal least squares quadratic fit is performed with that pixel and several of its neighbours. Details of this fitting process will be given elsewhere. The derivative of the local quadratic gives an estimate for the tangent vector at each pixel. All such tangents are shown for one image in Fig. 3. Each tangent to the back-projected image is also tangential to to the occluding edge of the object, where the image plane is considered to be embedded in R3 . By taking the cross product of this tangent line with the ray that projects from the camera centre through the pixel and normalising, one obtains a normal to the cone at that pixel. For two compatible cameras ci and cj the above procedure results in a set of normals indexed by edge pixels, so that we have nik for k = 1 . . . K and njl for l = 1, . . . , L, where K and L are number of edge pixels in image i and j respectively. These are unit normals, which lie on the unit 2-sphere, and the process of recovering them can be considered as a restriction of the Gauss map to the occluding edges of the object. To determine the approximate points for which the normals correspond, the L × K array whose entries are given by nik − njl must be searched. To achieve this a discrete gradient descent method is employed where a 3 × 3 mask is centred over a random location in the array. The minimum of the 9 data points covered by this mask is calculated and then the mask is re-centred over this local minimum. By repeating this procedure the mask will eventually end up centred on a local minima, which then corresponds to one of the solutions. A different initialisation point will then usually yield the other solution. Suppose that one of the solutions is located at the index (k, l). The pixel k is back-projected through the camera centre Ci to form a ray. Similarly a ray is formed by back-projecting the pixel l through the camera centred at Cj . The intersection of these two rays is then a point lying on the surface of V. For each pair of compatible cameras, two new points on the surface can be calculated. The result is a set of M points s1 , . . . , sM , and their corresponding normals n1 , . . . , nM . In the example described here there are eight cameras which are
spaced roughly evenly around V. For N cameras there are a maximum of 2 N2 possible solutions generated. Since not all cameras in this example were compatible, there were only M = 42 solution points generated. In Fig. 5, we see the cloud of points, the cloud of points with their normals and the cloud of points superimposed over a plot of the original shape, V.
61
Shape Recovery of a Strictly Convex Solid from N-Views
4. 4.1
FITTING IMPLICIT RADIAL BASIS FUNCTIONS TO THE POINTS Fitting without normal constraints
In (Dinh et al., 2002), a method is presented for reconstructing a surface from a cloud of data points. The method is a variational one, where the fitted surface is the level set of a function designed to minimise a linear combination of first, second and third order derivatives. As the data points are subject to noise and pixelisation error, a balance must be found between interpolation and approximation in the fitting process. Let {s1 , . . . , sM } be the points calculated by the above procedure, which are known to lie near the surface V. Let φ, τ ∈ R and x = (x1 , x2 , x3 ) be standard cartesian coordinates in R3 . Also let U (f ) =
2
2
3
i=1
∂f 2 , ∂xi
2
3
f ∂ f and S(f ) = 3i=1 3j=1 3k=1 ∂xi ∂x . R(f ) = 3i=1 3j=1 ∂x∂i ∂x j j ∂xk 2 The smoothness term in the cost function is given by: I(f ) = R3 φ U (f ) + R(f ) + τ 2 S(f ) dx. The cost function itself is given by:
H(f ) = I(f ) +
M f (si ) i=1
λi
,
(1)
where the λi ’s are the so called regularisation parameters, which control a trade-off between interpolation and approximation for each data point. The distributional Euler-Lagrange equation for the smoothness term is given in (Jost and Li-Jost, 1998) as −φ2 f + 2 f − τ 2 3 f = δ, where f = 3 ∂ 2 f ˆ i=1 ∂x2 is the Laplace operator. The fundamental solution, f , to this equai tion can be found by means of the Fourier Transform (Gel’fand and Shilov, 1964; Szmydt, 1977) and is given in (Chen and Suter, 1996) as: √ √ √ 2 2 1 1 − vx − ve− wx ) , where v = 1+ 1−4τ φ 1 + (we fˆ(x) = 2 2 4πφ x
v−w
2τ
Figure 5. The cloud of solution points, the solutions and their normals and the solutions and normal superimposed on V.
62
Figure 6.
The fitted data and the surface itself.
√ 1− 1−4τ 2 φ2 and w = . Incorporating the constraints f (si ) = 0 leads to the 2τ 2 2 2 equation −φ f + f − τ 2 3 f = M i=1 wi δsi , where wi ∈ R. Equation 1 is minimised by a weighted sum of the fundamental solution centred at the constraint points. In addition to this we require terms to span the null space of I(f ), (Girosi et al., 1993). Since I(f ) has first order and higher derivatives the null space can be spanned by a first order polynomial p0 + p, x , where p = (p1 , p2 , p3 ), and ·, · is the Euclidean inner product. The function is then a radial basis function of the form f (x) =
M
wi fˆ(x − si ) + p, x + p0 .
(2)
i=1
Equation 2 is linear, but it cannot be solved for the weights wi so that f (si ) = 0 for every i, because this would just produce the trivial result that every weight is identically 0. To overcome this problem an additional constraint is imposed in the form of an interior point to the surface, which are given an arbitrary positive value. This forces the weights to take on non-zero values. Solving for the weights is a linear problem, the precise implementation of which follows that of (Dinh et al., 2002). Shown in Fig. 6 is the implicit function fitting to the above data, along with the actual surface for comparison. The parameters used in the fitting are τ = 0.5, δ = 0.15. The non-surface constraint was given a λ value of 2 and the surface constraints were given a λ value of 0.0001.
4.2
Fitting with normal constraints
The normal constraints have the form ∇f (si ) = ki ni , for ki ∈ R \ {0}, i.e. the gradient of the implicit surface is required to be a multiple of the prescribed normal. These are non-linear constraints, but they can be made linear by using some preprocessing as follows. For each normal constraint, ni choose two vectors, linearly independent from each other and ni . Applying the Gramm-
63
Shape Recovery of a Strictly Convex Solid from N-Views
Schmidt orthogonalisation process to these three vectors (beginning with ni ) yields two vectors a1i and a2i , which are orthogonal to each other and to ni . The two constraints ∇f (si ), a1i = 0 and ∇f (si ), a2i = 0 are then equivalent to the above single constraint. Note that the result of the above process is to replace the one non-linear constraints with two new linear constraints. To incorporate information about the normal constraints into the surface fitting method we need to solve the distributional equation (Gel’fand and Shilov, 1964) −φ2 f + 2 f − τ 2 3 f =
M i=1
wi δ s i +
M 2 j
vi ∇δsi , a2i ,
(3)
i=1 j=1
which can be derived by considering Lagrange multipliers in the calculus of variations. The solution to this equation is a linear combination of the fundamental solution and its derivatives, given by:
The linear system to be solved is then given by AW = S where A is the matrix shown in Fig. 7, W = [ w1 . . . wn v11 . . . vn2 p0 p ]T is the column vector of weights and S = [ f (s1 ) . . . f (xn ) 0 . . . 0 ] is the required value of f at the constraint points. Although there is little visual difference between a point cloud fitted using either of the above two methods, preliminary experiments have shown that the inclusion of normal information into the fitting of the point cloud lends a degree of robustness to removal of points. When no normal information is used, randomly removing a subset of the points from the fitting process will often cause the representation to collapse. Using the same subset of the points but incorporating normal information into the fitting process will still yield an adequate representation. This is a useful feature of incorporating normal information, where available, since the point cloud representation is sparse when few cameras are used.
64 ⎡
fw1 (s1 ) + λ1 .. ⎢ ⎢ . ⎢ ⎢ fw1 (sM ) ⎢ ⎢ < ∇fw1 (s1 ), a11 > ⎢ ⎢ .. ⎢ . ⎢ ⎢ < ∇f (s ), a2 > w1 n ⎢ n ⎣ 1
Figure 7.
5.
. . . .
s1
Figure 8.
.
fwM (s1 ) fwM (sM ) + λM < ∇fwM (s1 ), a11 > .. . < ∇fwn (sn ), a2n > 1 sn
fv11 (s1 )
.
. fv11 (sM ) < ∇fv11 (s1 ), a11 >
. .
< ∇fv11 (sn ), a1n >
.
2 (s1 ) fvM .. . 2 (sM ) fvM 1 2 (s1 ), a > < ∇fvM 1
< ∇fvn2 (sn ), a2n > 1
0
1 1 1 0 0 0 0
s1 .. . sM a11 .. . a2n 0 0
⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦
The system matrix for the inclusion of normal information.
The ellipsoid, the recovered ellipsoid and the ellipsoid with normals.
CONCLUSION AND RESULTS
In this paper we have shown that in the case of a smooth, strictly convex solid under observation by a set of cameras, whose parameters are known, a set of points on the surface of the object can be calculated.It has been proven that provided certain conditions are met, each pair of cameras yield a pair of calculable points on the object, and therefore a set of such cameras lead to a point cloud representation of the object. In the process of calculating these points we also obtain the orientation of the surface. This information can be incorporated into the fitting process, to give a representation that is robust to the removal of data points. The method is very effective. The results shown below do not use shading information, and so can be considered shape-from-silhouette results.
Figure 9. The implicit quartic ax4 + by 4 + cz 4 − r2 = 0 and the recovered shape from 8 photographs.
65
Shape Recovery of a Strictly Convex Solid from N-Views 4
4
2
4
2
0
2
0
-2
0
-2
-2
-4 4 4
-4 4 4
-4 4 4
2
2
2
0
0
0
-2
-2
-2
-4 -4
-4 -4 -2
-2 0
2
0 2
4
Figure 10.
-4 -4 -2
0
2 4
4
Egg shaped solid and its recovered shape with superimposed normals.
In Fig. 8 the original elipsoid is shown, along with the recovered shape from 8 photographs using the method proposed above. The second example, shown in Fig. 9 uses the same camera configuration to recover the shape of an implicit quartic. For the final example, shown in Fig. 10 an egg shaped solid, which is a piecewise smooth joining of two ellipsoids is recovered, also with the same camera configuration.
REFERENCES Canny, J. F. (1983). Finding edges and line in images. Master’s thesis, MIT AI Lab. Chen, F. and Suter, D. (1996). Multiple order Laplacian splines - including splines with tension. Technical report, Department of Electrical and Computer Systems Engineering, Faculty of Engineering, Monash University, Clayton 3168. Vic, Australia. Collings, S., Kozera, R., and Noakes, L. (2004). Statement and proof of the Occluding Edge Theorem for n-view shape recovery. In preparation. Culbertson, W. B., Malzbender, T., and Slabaugh, G. (1999). Generalised voxel coloring. International Workshop on Vision Algorithms, Corfu, Greece, pages 67–64. Dinh, H. Q., Turk, G., and Slabaugh, G. (2002). Reconstructing surfaces by volumetric regularization using radial basis functions. IEEE Transactions on Pattern Analysis and Machine Intelligence., 24(10):1358–1371. Gel’fand, I. M. and Shilov, G. E. (1964). Generalized Functions Vol. 1. Academic Press, New York. Girosi, F., Jones, M., and Poggio, Tomaso, P. (1993). Priors, stabilizers and basis functions: from regularization to radial, tensor and additive splines. Technical Report A.I. Memo No. 1430, Massachusetts Insititute of Technology Artificial Intelligence Laboratory. Horn, B. K. P. (1986). Robot Vision. MIT Press. Jost, J. and Li-Jost, X. (1998). Calculus of variations. Cambridge University Press, Cambridge. Katulakos, K. N. and Seitz, M. (2000). A theory of shape by space carving. Int. J. Comp. Vision, 38(3):199–218. Marr, D. and Hildreth, E. C. (1980). Theory of edge detection. In Procedings of the Royal Society, volume B, pages 187–217. Szmydt, Z. (1977). Fourier transformation and linear differential equations. D. Reidel Publishing Co., Dordrecht, revised edition.
JOINT ESTIMATION OF MULTIPLE LIGHT SOURCES AND REFLECTANCE FROM IMAGES Bruno Mercier and Daniel Meneveaux SIC Laboratory, University of Poitiers (France)
{mercier,daniel}@sic.univ-poitiers.fr Abstract
In this paper, we propose a new method for estimating jointly light sources and reflectance properties of an object seen through images. A classification process firstly identifies regions of the object having the same appearance. An identification method is then applied for jointly (i) deciding what light sources are actually significant and (ii) estimating diffuse and specular coefficients for the surface.
Keywords:
Light sources detection; reflectance properties estimation; identification method.
1.
INTRODUCTION
Light sources estimation is a key issue for many applications related to computer vision, image processing or computer graphics. For example segmentation algorithms, shape from shading methods or augmented reality approaches can be improved when the incoming light direction is known. From images, several works propose to estimate directional or point light sources: with stereo images (Zhou and Kambhamettu, 2002), using convex objects contours (Vega and Yang, 1994), shadows (Nillius and Eklundh, 2001) or reflectance maps (Guillou, 2000). Recent methods favor the use of lambertian or specular spheres for acquiring radiance maps (Debevec, 1998) or detecting point light sources (Powell et al., 2001; Zhang and Yang, 2000). Our approach can be applied to a set of images for a single object and does not need any additional test pattern or specific object such as a sphere. In this paper our contributions include: a new method for estimating several directional and point light sources from images of an object; an algorithm capable of estimating light sources jointly with the surface reflectance of the object; results for a series of experiments.
2.
OVERVIEW
Our method applies to a set of images representing an object without cavity. For each image, we make the assumption that the camera position and
66 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 66–71. © 2006 Springer. Printed in the Netherlands.
Joint Estimation of Multiple Light Sources and Reflectance from Images
67
orientation are known. In practice, we use as well synthesized images and photographs of real objects (see Figure 1).
a.
b.
c.
Figure 1. a. Images of some virtual objects we used; b. Image of a real plastic object (clown model courtesy of Rodolphe, 3 years old); c. Voxel-based reconstructed clown.
A preprocessing step recovers a voxel-based geometry of the object, using the shape from silhouette approach proposed by (Szeliski, 1993) (Figure 1.c). For each voxel, a polygonal surface is produced according to the marching cubes algorithm and a normal is estimated (for more details, see (Mercier et al., 2003)). Each pixel of an image corresponds to the radiance emitted by the object in the direction of the camera (from one voxel to the camera). Each voxel is seen from several viewpoints; thus, for each voxel it is possible to store the set of radiances associated with pixels. We distinguish two types of light source (point and directional) and two types of surface (diffuse and specular-like). The broad lines of our light sources estimation algorithm are: classify voxels into regions according to hue and orientation; for each region, estimate the type of surface (diffuse or specular-like); for each region, search for a point light source; for each region, search for a directional light source; for each region, identify sources parameters and surface properties; validate light sources positions/directions and surface properties.
3.
LIGHT SOURCES DETECTION
In this work, as a BRDF model, we choose the modified Phong-model proposed in (Lewis, 1994) since it is physically plausible and represents diffuse and/or specular surfaces with only 3 coefficients. According to this model, the → − radiance reflected Lr at a point P to a direction R is expressed as: (n+2)Ls Ks s Kd cos θ cosn φ, where Ls is the radiance emitted Lr = Lπr 2 cos θ + 2πr 2 by a light source S; r is the distance between S and P ; Kd and Ks are respectively diffuse and specular coefficients for reflection; n defines the specular −→ lobe size; θ is the light incident angle; φ is the angle between Rm (mirror → − reflection direction) and R .
68
3.1
Voxels classification
In this paper we make the assumption that the surface can be made up with different types of materials and lit by several light sources. To simplify the problem, voxels are firstly classified according to hue (using HSV color-space). For each class defined, voxels are then grouped again according to orientation so that all the voxels in a group be likely lit by the same (single) light source.
3.2
Type of surface estimation
Fir each voxel class, the type of surface (diffuse or specular-like) is estimated with the help of a variation coefficient V computed from radiance samples (pixels): V=
2 N bV N bLi Li,j −Lmoy bV i / N moy i=1 i=1 N bLi j=1 L i
where N bV represents the number of voxels in the class, Li,j is the j th rais the average radiance of diance sample of Vi (Vi is the ith voxel), and Lmoy i Vi . N bLi corresponds to the number of radiance samples in the class. V varies according to the surface specular aspect.
3.3
Point source detection
For each voxel of a given class, we firstly estimate a directional light source; the point source position is then deduced from this set of directions. If the point source is far enough to the surface, the algorithm concludes it is a directional light source.
3.3.1 For diffuse surfaces. The radiance emitted by a diffuse surface element is constant whichever reflection direction. This radiance corresponds to the product Ls Kd cos θ. For all the voxels of a given class we consider that Ls Kd is constant. For each class, the voxel V ref having the highest radiance Lref V is chosen as a reference for initializing our iterative process: its normal −−→ is used as the incident direction of light IVref , with θVref = 0 and Lref V = Ls Kd . Consequently, for each voxel V in the class, θV = arccos(LV /Lref V ). This estimation of θV does not directly provide the incident direction but a cone of directions (Figure 2.a). For V , the incident direction belongs to the plane defined by the centers −−→ of V , V ref and the incidence direction IVref (Figure 2.a). The intersection between the cone of directions and this plane gives 0, 1 or 2 incident directions. Momentarily the algorithm ignores the voxels having two possible incident directions. All the single-incident directions are stored in the matrix form MX = D, where X corresponds to the searched point light source coordinates. X is obtained with the help of a pseudo-inverse matrix Mpi . From
Joint Estimation of Multiple Light Sources and Reflectance from Images
69
Figure 2. a. Intersection between the cone and the plane containing the point source; b. Choice of a new voxel Ve for estimating point light source, where θe is known.
this first estimation, for each voxel ignored (having two correct incident directions), one of the two directions (the most consistent with the estimation of X ) is added to the matrix system. A new estimation of Mpi and X is computed. However, the quality of this estimation is very dependent on choice of V ref : the normal of V ref is rarely perfectly aligned with the light source direction. For refining the solution, the algorithm selects various V ref (see Figure 2.b) 2m so as to reduce the following error criterion: Ed = i=1 (Mi X − Di )2 , where Mi is the ith row of the matrix M and m corresponds to the number of incident directions. Ed provides the final result significance. This process is repeated as long as the estimated error Ed decreases.
3.3.2 For specular-like surfaces. For non-lambertian surfaces, the specular lobe can be used to estimate the light source position; the incident ra−→ diance Ls is mostly reflected by the surface in the mirror direction Rm . We propose to represent the specular lobe as a curved surface and identify the coef−→ ficients from voxels radiance samples (Figure 3.a). For estimating Rm , we use a parabolic surface Lα,β = a(α − δα )2 + b(β − δβ )2 + c. From this equation, −→ the mirror direction Rm is defined by (δα , δβ ) (see Figure 3.b). For estimating these coefficients we identify a, b, c, δα and δβ from radiance samples with a gradient descent method. As for diffuse surfaces, the direction estimated for each voxel is stored in a linear system solved with a pseudo-inverse matrix to recover the point source position.
3.4
Directional source detection
3.4.1 For diffuse surfaces. The radiance emitted by a voxel V is: → −→ − −→ LV = Ls Kd cos(θV ) = Ls Kd ( I .NV ) where NV is the surface normal inside → − → − V and I is the searched incident direction. In this equation, Ls , Kd , and I are the same for all the voxels of a class. Again, this system can be described
70
Figure 3. a. Radiance samples for a voxel, the grey tint contains radiance samples of the specular lobe; b. Radiance emitted by a surface point according to the angles α and β (polar coordinates of the reflected direction).
and solved using a matrix form MX = D, where X represents the product → − Ls Kd I , D is a vector containing LV values and M corresponds to the set of −→ vectors NV .
3.4.2 For specular-like surfaces. As for point light sources, specular lobes can be used. Once a direction has been estimated for each voxel (for a given class), the light source direction corresponds to the average of all estimated directions.
3.5
Joint identification
Each light source estimation algorithm is independently applied for each voxel subclass, providing one point light source and one directional light source. For each analysis, we estimate an error: Ea =
bLi N bV N i=1 j=1
[(
Ls Kd (n + 2)Ls Ks cos θi + cos θi cosn φi,j ) − Li,j ]2 2 πr 2πr2
where Li,j corresponds to the radiance sample j of the voxel Vi ; the parameters Ls Kd , Ls Ks and n are unknown. We apply an identification algorithm with the help of a gradient descent method in order to obtain the final parameters Ls Kd , Ls Ks and n (keeping Ea as small as possible). A final step groups the detected light sources according to their types and position/orientation.
4.
RESULTS AND CONCLUSION
For validating our method, we used a set of voxels lit by only one (known) light source. Voxels positions, radiance samples directions and normals have been randomly generated. As shown in Table 4, our point light source detection method provides a directional light source when the distance is too high. With images of virtual objects, the estimated direction of incoming flux is precise about 15 degrees in the worst case; the average precision is about 6 degrees. Though validation is difficult to achieve through real objects, we applied our method to the clown shown in Figure 1.b. It has been lit by 2 spots,
Joint Estimation of Multiple Light Sources and Reflectance from Images Surface type
Object-source distance
Estimated final source
0 − 9m > 9m 0 − 2m 6m > 6m
point directional point none directional
diffuse specularlike Figure 4.
71
inaccuracy on source L s Kd n pos/dir L s Ks < 1cm 1% × < 1◦ < 15cm 1m 1% 5% < 1◦
Point light source detection with a 1-meter diameter object.
actually estimated as directional light sources with an error ranging from 30◦ to 40◦ . This error is mainly due to two factors. Firstly, the used object contains many unreconstructed small cavities. Secondly, our calibration method is still not precise enough. Our method could probably be improved with the help of specular spots seen on images. Moreover, since the method has proven efficient for virtual objects, we aim at validating it with photographs of real objects.
REFERENCES Debevec, Paul E. (1998). Rendering synthetic objects into real scenes: Bridging traditional and image-based graphics with global illumination and high dynamic range photography. ACM Computer Graphics, 32(Annual Conference Series):189–198. Guillou, Erwan (2000). Simulation d’environnements complexes non lambertiens „ a partir d’images: application „ a la r«ealit«e augment«ee. PhD thesis, Universit«e de Rennes 1. Lewis, R. (1994). Making shaders more physically plausible. Computer Graphics Forum, 13(2). Mercier, B., Meneveaux, D., and Fournier, A. (2003). Lumigraphe et reconstruction g«eom«etrique. In AFIG 2003. Nillius, Peter and Eklundh, Jan-Olof (2001). Automatic estimation of the projected light source direction. In CVPR, pages I:1076–1083. Powell, M. W., Sarkar, S., and Goldgof, D. (2001). A simple strategy for calibrating the geometry of light sources. PAMI, 23(9):1022–1027. Szeliski, Richard (1993). Rapid octree construction from image sequences. In 58, editor, CVGIP: Image Understanding, volume 1, pages 23–32. Vega, O. E. and Yang, Y. H. (1994). Default shape theory: With application to the computation of the direction of the light source. CVGIP, 60(3):285–299. Zhang, Y. and Yang, Y. H. (2000). Illuminant direction determination for multiple light sources. In CVPR00, pages I: 269–276. Zhou, W. and Kambhamettu, C. (2002). Estimation of illuminant direction and intensity of multiple light sources. In ECCV02, page IV: 206 ff.
AN ALGORITHM FOR IMPROVED SHADING OF COARSELY TESSELLATED POLYGONAL OBJECTS Chandan Singh1, Ekta Walia2 1
Professor & Head, Department of Computer Science & Engg, Punjabi University, Patiala, INDIA. 2Lecturer, Department of Computer Science, National Institute of Technical Teachers Training & Research, Chandigarh, INDIA.
Abstract:
There are many objects whose shape requires them to be tessellated coarsely for example cone, cube, teapot with its body constituted by large triangles. The existing shading algorithms other than Phong Shading, cannot correctly render the specular highlights on such coarsely tessellated objects. It is known universally that Phong shading is still not used commercially due to its requirement of per-pixel normal vector normalization. In this paper, we present an algorithm that uses cubic bezier triangles for interpolation of diffuse and specular component of intensity. In most of such cases, this algorithm produces better visual results as compared to Phong Shading at comparatively lower cost.
Key words:
Quadratic Bezier Triangle, Cubic Bezier Triangle
1.
EXISTING METHODS OF SHADING
In increasing order of visual realism, there are two well-known shading methods. Gouraud shading1 also called intensity interpolation shading or color interpolation shading eliminates the intensity discontinuities across the adjacent polygons. Phong Shading2, also known as normal-vector interpolation shading, interpolates the surface normal vector rather than the interpolation of intensity. All the shading methods proposed so far have one common goal i.e. to attain shading effect, which approximates the quality of Phong shading. In this paper, we propose an algorithm for visually better shading for coarsely
72 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 72–79. © 2006 Springer. Printed in the Netherlands.
An Algorithm for Improved Shading
73
tessellated objects. Section 2 describes the use of Cubic Bezier triangles to separately interpolate the diffuse and specular components of intensity. Section 3 shows that this method gives significantly better visual result for objects that have been constituted by large triangles. The Cost comparisons of Cubic Bezier Triangle method and Phong shading have been done in this section. By comparing the cost of Phong Shading with that of our algorithm, we ensure that its cost should not be higher than Phong shading. Section 4 proposes a general adaptive model for shading.
2.
CUBIC BEZIER TRIANGLES FOR BETTER HIGHLIGHTS
Due to their shape constraints, there are some objects like a cone shown in Figure 1, which constitute few large triangles, and few small triangles. The part of an object which contains small triangles can be rendered with infact any method, sometimes for simplicity, even with the Gouraud shading method but the part which contains large triangles should either be subdivided and rendered with Gouraud Shading or Phong shading should be used to properly render them. We know that regular subdivision is computation intensive and so is Phong Shading, therefore, in this section we present a new shading algorithm which produces better shading at a lower cost, in particular for large triangles. The method of Brown3 to use quadratic bezier triangle for the interpolation of specular highlight gives very good approximation to phong shading only if the size of the triangles is considerably small. If the size of the rendered triangle is large, the shading is not accurate and therefore significant difference can be seen between Phong shading & this method. It is clear from the visual results shown in Figure 1. This problem is reduced by quadratically interpolating the diffuse component also, as mentioned in Walia and Singh4. However for shading a large triangle with high specular component, the method in Walia and Singh4 is also not a good approximation of Phong shading. Therefore, we suggest the use of Cubic Bezier triangle for interpolating the diffuse component as well as highlight component for proper shading of an object. In this method, a Cubic Bezier triangle is setup, which defines a cubic shading function for the underlying planar triangle.
74
Figure 1. Cone (tessellated with 16 triangles) Shaded with Brown Method3, Walia and Singh Method4, Cubic Bezier Triangle Method and Phong Shading2 respectively
2.1
Algorithm
Step 1: The normals N1, N2, N3 at the vertices are determined by averaging the face normals. The normals N4, N5, N6, N7, N8, N9 and N10 are determined by averaging the vertex normals at the one-third/two-third points on the edges and the centroid of the triangle as shown in Figure 2. Step 2: Set-up the Cubic Bezier triangle. The ten control points on the surface are shown in Figure 2. Barycentric coordinates can be determined from the screen space coordinates of a point p using the screen space areas a0,a1,a2, of the triangles pP030P003, pP003P300 and pP300P030 respectively. The barycentric coordinates b0, b1, b2 are evaluated using equation 1.
b0
a0 b1 a 0 a1 a 2
a1 b2 1 b0 b1 a 0 a1 a 2
(1)
75
An Algorithm for Improved Shading
Figure 2. Ten Control Points of Cubic Bezier Triangle and Position of the Normals
Step 3: The barycentric coordinates are employed to perform cubic interpolation for determining the diffuse as well as the specular component of the intensity at a given point p. The cosines for the highlight and diffuse component are determined using the ten normals as shown in Figure 2. These cosines namely c030, c003, c300, c021, c012, c102, c201, c210, c120, c111 are used to create ten control points namely, C030, C003, C300, C201, C102, C120, C210, C021, C012, C111 (for vertices, one-third-points of the edges and centroid), using the equations set 2. C030 = c030 C003 = c003 C300 = c300
(2)
C 021
(54c012 - 27c021 6c030 - 15c003) 18
C 012
(27c021 - 6C021 - 8c030 - c003) 12
C 201
(54c201 - 27c102 6c003 - 15c300) 18
C102
(27c102 - 6C201 - 8c003 - c300) 12
(54c120 - 27c210 6c300 - 15c030) C 210 18
(27c210 - 6C120 - 8c300 - c030) 12
C120
C111
(27c111 - c030 - c003 - c300 - 3C021 - 3C012 - 3C102 - 3C201 - 3C210 - 3C120) 6
76 Step 4: Cosines calculated for the diffuse component as well as specular component can be placed in the equation 3 to obtain their respective values at point p(b0,b1,b2) on the surface. c(b0,b1,b2) = C030b03+C003b13+C300b23+3C021b02b1+3C012b0b12 +3C102b12b2+3C201b1b22+3C210b22b0+3C120b2b02+6C111b0b1b2
(3)
where C030, C003, C300, C021, C012, C102, C201, C210, C120, C111 are the ten control points (for vertices, one-third-points of the edges and centroid) generated from the ten cosines evaluated at the ten keypoints of the triangular surface. Step 5: Repeat steps 3 and 4 for interpolation of specular component. Step 6: Convex Hull property of Cubic Bezier triangle can be used to limit the calculation of specular highlights to those triangles on which highlight is predicted to occur. For this limits can be defined that specular highlights will occur only for the cosines lying in the narrow range Clow and Chigh. Chigh = cos(0) = 1 and Clow = cos() where is the largest angle for which a specular reflection is visible. If any one of the control points is greater than Clow, then a specular highlight might occur on the triangle. In this case, it is necessary to perform cubic interpolation of the specular cosines calculated at the ten points of the triangle using the expression mentioned above. However, if all of the control points are less than Clow, then no specular highlight will occur in the triangle and therefore, it is not necessary to perform cubic interpolation of highlight cosines. Step 7: Table lookup can then be performed to determine the color corresponding to the intensity computed at each pixel.
3.
VISUAL RESULTS AND COMPUTATION COMPARISON
The results of this algorithm when executed for shading a teapot surface constituting a set of very large and small triangles as shown in Figure 3 are better than the results obtained from Phong Shading3.
3.1
Set-up Cost
For the case where viewer and light source are at infinite distance., the cost of setting up Cubic Bezier triangle is only 54 additions, 56 multiplications and 14 divisions. In addition to this, since normals are
An Algorithm for Improved Shading
77
calculated at 10 points therefore 10 normalization operations are required whereas in Phong Shading, the cost of determining linear equation parameters is 17 additions, 14 multiplications and 6 divisions. In addition to this, it requires 3 normalization operations. Considering that division and inverse square root are three times costly than multiplication and addition operations, the setup cost of our algorithm and Phong Shading will be 152 and 49 multiplications respectively. In case of viewer and light source being at finite distance, the set-up cost of our algorithm will increase as nine additional light L and view vector V will be constructed and normalized and then half way vectors H will be constructed and normalized. The set up cost for our algorithm will be 262 multiplications whereas it will be 49 multiplications for Phong Shading.
Figure 3. Teapot (Body other than lid, handle etc. tessellated with 16 triangles) Shaded with Cubic Bezier Triangle Method and Phong Shading.
3.2
Per-Pixel Cost
The per-pixel cost of our algorithm is contributed by two major operations, namely, computation of cubic shading functions and computation of barycentric coordinates. For our algorithm, the per-pixel operations include 25 multiplications and 9 additions for triangles without highlights whereas for triangles with highlights the per-pixel operations include 50 multiplications and 18 additions. In addition to this the computation of barycentric coordinates for every pixel requires 4 additions and 2 divisions. Since, highlight calculations are required for only 10 percent of the pixels, therefore the average per-pixel cost of our algorithm is 27.5 multiplications, 13.9 additions and 2 divisions. The per-pixel cost of Phong Shading comes from evaluation of linear function for computation of normals, normalization of normals and lighting
78 operations. This comes out to be 20 multiplications, 16 additions and 1 inverse square root operation. The number of multiplications required for per-pixel computation by our algorithm and Phong Shading is 47.4 and 39 respectively. This comparison has been done for the case when viewer and light source are at infinite distance. In case of viewer and light source being at finite distance, the cost of Phong shading will be very high because for this case L and V will have to be reconstructed and normalized everytime prior to constructing H. This will increase the per-pixel operations by 18 multiplications, 6 additions and 3 inverse square root. Thus, the per-pixel cost would be 72 multiplications instead of 39 multiplications. In such case, our algorithm will be less computation intensive as compared to Phong Shading. As mentioned in Table 1, In case of viewer and light source at infinite distance, the total cost of the proposed algorithm will be higher than Phong Shading only if the size of the triangle is very small e.g. if it consists of only 10 pixels. However, if the triangle is big in size e.g. it consists of 6000 pixels then as shown in Table 1, our algorithm is only slightly costlier than Phong Shading. In case of viewer and light source at finite distance, the total cost of the proposed algorithm will be less than Phong Shading. However, if the triangle is big in size then as shown in Table 1, our algorithm is reasonably cheaper (approximately 35% cheaper) than Phong Shading. Table 1. Cost Comparison when Viewer and Light Source are at InFinite Distance. Viewer & Light at Infinite Viewer & Light at Finite Distance Distance Cubic Bezier Phong Cubic Bezier Phong Shading Operations Triangle Shading Triangle Interpolation Interpolation Set-up Cost in terms 152 49 262 49 of multiplications Per-pixel cost in terms 47.4 39 47.4 72 of multiplications Total cost for triangle 626 439 736 769 with 10 pixels Total cost for triangle 284552 234049 284662 432049 with 6000 pixels
4.
ADAPTIVE MODEL FOR SHADING
We propose a new adaptive model for shading. The decision regarding which method should be used for shading includes a straightforward calculation of screen space area of every triangle to be rendered. Then average screen space area is computed. Before finally shading the triangles,
An Algorithm for Improved Shading
79
each triangle's area is compared with the average area. If the area is less than average area, then Gouraud Shading should be performed otherwise, Cubic Bezier Triangle method should be applied for better highlights.
REFERENCES 1. Gouraud H, “Continuous Shading of Curved Surfaces”, IEEE Transactions on Computers Vol 20, No. 6, 623-629, 1971 2. B.T.Phong, “Illumination for Computer Generated Pictures”, Communications of the ACM, Vol.18, No. 6, 1975 3. Russ Brown, “Modeling Specular Highlights Using Bezier Triangles”, Technical Report, Sun MicroSystems Laboratories, TR-99-75,1999. 4. Ekta Walia, Chandan Singh, “Bi-Quadratic Interpolation of Intensity for Fast Shading of Three Dimensional Objects”, Image and Vision Computing, New Zealand Proceedings, pp 96-101, 2003.
MULTIRESOLUTION ANALYSIS FOR IRREGULAR MESHES WITH APPEARANCE ATTRIBUTES Michaël Roy, Sebti Foufou, and Frédéric Truchetet LE2I, CNRS UMR 5158 - Universitéde Bourgogne 12 rue de la fonderie - 71200 Le Creusot - France
Abstract
We present a new multiresolution analysis framework based on the lifting scheme for irregular meshes with attributes. We introduce a surface prediction operator to compute the detail coefficients for the geometry and the attributes of the model. Attribute analysis gives appearance information to complete the geometrical analysis of the model.We present an application to adaptive visualization and some experimental results to show the efficiency of our framework.
Keywords:
multiresolution analysis, irregular meshes, appearance attributes, adaptive visualization
1.
INTRODUCTION
3D scanners usually produce huge data sets containing geometrical and appearance attributes. Geometrical attributes describe shape and dimensions of the object and include data relative to a point set on the object surface. Appearance attributes describe object surface properties such as colors, texture coordinates, etc. Multiresolution analysis is an efficient framework to represent a data set at different levels of detail, and to provide frequency information. It gives rise to many applications such as filtering, denoising, compression, editing, etc. Many papers present the multiresolution analysis and the wavelet transform in computer graphics domain. Here we refer to (Stollnitz et al., 1996) as an introduction. The goal of this work is to build a multiresolution mesh analysis managing both geometric and appearance attributes. Attribute management is very important especially for terrain models where the attributes are linked to the nature of the terrain. In some cases, attributes are more important than the terrain itself. The main contributions of this work are the use of the lifting
80 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 80–86. © 2006 Springer. Printed in the Netherlands.
Multiresolution Analysis for Irregular Meshes
81
scheme for the multiresolution analysis of irregular meshes with attributes, and the application of the proposed multiresolution framework to detail coefficient dependent visualization of meshes. In section 2, we review multiresolution mesh analysis schemes proposed in the literature. In section 3, we detail the proposed multiresolution analysis framework. Section 4 presents some experimental results and the application of the proposed method to detail-dependent visualization. Conclusion and ideas for future extensions are given in section 5.
2.
RELATED WORK
(Lounsbery et al., 1997) made the connection between wavelets and subdivision to define different levels of resolution. This technique makes use of the theory of multiresolution analysis and of the subdivision rules to construct a multiresolution representation for surfaces with subdivision connectivity. (Zorin et al., 1997) proposed a combination of subdivision and smoothing algorithms to construct a set of algorithms for interactive multiresolution editing of complex meshes with arbitrary topology. (Bonneau, 1998) introduced the concept of multiresolution analysis over non-nested spaces, which are generated by the so-called BLaC-wavelets. This concept was then used to construct a multiresolution analysis over irregular meshes. (Kobbelt et al., 1998) proposed a multiresolution editing tool for irregular meshes using the progressive mesh algorithm (Hoppe, 1996) to build the coarse resolution mesh, and a smoothing operator to estimate the high resolution mesh. (Guskovetal.,1999) presented a series of non-uniform signal processing algorithms designed for irregular triangulation, a smoothing algorithm combined with existing hierarchical methods is used to build subdivision, pyramid, and wavelet algorithms for meshes with irregular connectivity. Recently, (Valette and Prost, 2004) presented a new wavelet-based multiresolution analysis of irregular surface meshes using a new irregular subdivision scheme. The method is a fine-tocoarse decomposition, and use a complex simplification algorithm in order to define surface patches suitable for their irregular subdivision scheme.
3.
MULTIRESOLUTION ANALYSIS
Multiresolution analysis provides a framework that rigorously defines various approximations and fast analysis algorithms. This framework constructs iteratively approximation and detail parts forming successive levels of resolution of the original data set. The details capture the local frequency content of the data set, and are used to exactly reconstruct the original data set. Classical multiresolution analysis frameworks (such as wavelet transform) cannot be applied to irregularly sampled data sets. (Sweldens, 1998) proposed the lifting scheme that allows the multiresolution analysis of irregular samples.
82 We propose a multiresolution analysis framework suitable for triangular irregular meshes with appearance attributes. This framework decomposes a mesh in a series of levels of detail, and computes detail coefficients at each level. evenk−1
Mk
Split
evenk−1
Mk−1
Predict Dk−1
oddk−1
Figure 1.
Predict
Merge
Mk
oddk−1
Multiresolution mesh analysis framework.
The multiresolution mesh analysis framework is presented in Fig. 1. The decomposition is represented on the left part of the figure. Starting from a fine mesh Mk , two groups of vertices (odds and evens) are defined by the split operator. The odd vertices are designated to be removed, and the even vertices remain to create the coarse mesh Mk−1 that approximates the initial mesh. The odd vertices are predicted using the predict operator, and then subtracted to the original odd vertices to give the detail coefficients Dk−1 . The last step is the removal of the odd vertices from the initial mesh using a downsampling operator. The reconstruction is shown on the right part of the figure and is simply the inverse scheme. Starting from a coarse mesh Mk−1 , the odd vertices are re-inserted using an upsampling operator. The odd vertices are predicted using the predict operator, and then exactly reconstructed by adding the detail coefficients Dk−1 . In the following paragraphs, we detail the different steps required to build the multiresolution mesh analysis framework.
3.1
Downsampling and Upsampling
We employ the Progressive Mesh (PM) framework (Hoppe, 1996) to build downsampling and upsampling operators. In the PM setting an edge collapse provides the atomic downsampling step, and a vertex split becomes the atomic upsampling step. For the downsampling, we use the half-edge collapse operation, which does not introduce a new vertex position but rather subsamples the original mesh. Thus, it enables the contraction of nested hierarchies on unstructured meshes that can facilitate further applications. The downsampling operator removes about 33% of the vertices per level.
3.2
Split and Merge
The split operator takes a given mesh and selects the even and the odd vertices. The later are designated to be removed using half-edge collapse operations. In order to do global downsampling and upsampling, the odd vertices
Multiresolution Analysis for Irregular Meshes
83
are defined as a set of independent vertices (not directly connected by an edge). Different methods can be used to select the odd vertices, and thus the half-edge collapses to perform. Our algorithm performs an incremental selection by selecting one odd vertex and locking all adjacent vertices (even vertices). The selection ends when no more vertices can be selected. By selecting an odd vertex in order to remove it, we also select the even vertex it will merge with. In other words, we directly select a half-edge collapse. The Quadric Error Metric from (Garland and Heckbert, 1997) is used as a criterion for the selection of the odd vertices because it minimizes the length of the details and retains the visual appearance of the simplified mesh. Since the upsampling is completely done with vertex split operations, the merge operator is not required.
3.3
Predict
The predict operator estimates the odd vertices using the even vertices. We propose a prediction operator that uses the local surface geometry. Meshes coming from real world scenes usually contain appearance attributes such as colors, texture coordinates, etc. Also, we consider the vertex position and the normal vector as geometric attributes. Attributes are considered as vectors in Euclidian space defined on each vertex of the mesh. So a vertex is represented as an array composed of m attribute vectors (a1 , . . . , am ) where each an is an attribute vector. We define an application fn (v) = an that gives the attribute vector an of the attribute n associated with the vertex v. Our prediction operator estimates each odd vertex vik from the mesh Mk as a set of attributes fn (vik ) given by : k wi,j .fn (vjk ). (1) fn (vik ) = j∈V1k (i)
V1k (i) represents the one-ring neighborhood of the vertex vik . The wi,j are weights of the relaxation operator minimizing the curvature energy of an edge ei,j (Meyer et al., 2002) : cot αi,j + cot βi,j , l∈V k (i) cot αi,l + cot βi,l
k = wi,j
(2)
1
where αi,j and βi,j are the angles opposite to the edge ei,j . Predicted attributes are relaxed in terms of curvature energy of the analysed surface. This relaxation operator guaranties smooth variation of the attributes. This method also assumes that the attributes are linked to the surface. The wi,j coefficients are computed during the decomposition, and need to be stored to be re-used for the prediction step in the reconstruction.
84
4.
EXPERIMENTAL RESULTS
We present an application of our multiresolution framework to adaptive visualization. Our method uses the PM framework, which has proved its efficiency for view-dependent visualization of meshes (Hoppe, 1997). The PM framework selects the vertices to display using visibility criteria. We improve this selection using the detail coefficients of the analysis. If the detail coefficient length of a vertex is below a given threshold, the vertex is declared as irrelevant and thus is removed from the mesh. Figure 2 shows view-dependent visualization of the Buddha and the Earth model. On both Figures 2(a) and 2(b), the viewport used to visualize the models is shown as the yellow pyramid. We see that invisible vertices are removed using frustrum and backface culling. Irrelevant vertices are removed by thresholding the detail coefficients. Figure 3 shows view and detail dependent visualization of the Buddha model using the viewport represented in Fig. 2(a). The original model, from Stanford University, contains 543.652 vertices and 1.087.716 faces. Figure 3(a) shows the model visualized using only frustrum and backface culling. Figures 3(b) and 3(c) show the Buddha model visualized with our detail dependent technique. Each figure shows the result of different values of the threshold (the higher the threshold, the coarser the resolution). We see that important geometrical features of the model such as the high curvature regions are preserved. Figure 4(a) shows a model of the earth with color attributes. Figure 4(b) shows the same model simplified using our method. The detail coefficients of the color attributes are thresholded to remove irrelevant vertices. Detail dependent simplification insures the preservation of attribute features (e.g. the coastlines). The advantage of our method is that it allows more advanced visualization by computing the relevance of the vertices using the detail coefficients. Combinations of several attributes can be performed to improve the result of the simplification.
5.
CONCLUSION AND FUTURE WORK
We have presented a new multiresolution analysis for irregular meshes based on the lifting scheme. Our framework manages attributes such as color, normals, etc. Our method is easy to implement and results show the efficiency of the analysis over the attributes, which allows more complete multiresolution analysis. The next step in this work will focus on feature detection using detail orientation in order to build a semi-automatic denoising algorithms for scanned models.
85
Multiresolution Analysis for Irregular Meshes
(a)
(b)
Figure 2. View and detail dependent visualization of the Buddha model in (a) and the Earth model in (b). Both figures show the viewport used to visualize the models. Wireframe representation is overlaid on each model.
(a) Original model 102.246 vertices 204.522 faces
(b) τ =0.0001 31.848 vertices 63.684 faces
(c) τ =0.0004 9.293 vertices 18.524 faces
Figure 3. Detail dependent visualization of the Buddha model in (a). Figures (b) and (c) show two different values of the detail threshold used to segment the geometric details.
86
(a) Original model (327,680 faces)
(b) Simplified model (62,632 faces)
Figure 4. Adaptive visualization according to the color detail coefficients. Original model is shown in (a), and simplified model in (b). Relevant vertices are selected by thresholding the color detail coefficients. The model is a sphere with color representing the elevation of the earth from the ETOPO5 data set.
REFERENCES Bonneau, G.-P. (1998). Multiresolution analysis on irregular surface meshes. IEEE Transactions on Visualization and Computer Graphics, 4(4):365–378. Garland, M. and Heckbert, P. (1997). Surface simplification using quadric error metrics. In Proceedings of ACM SIGGRAPH, pages 209–216. Guskov, I., Sweldens, W., and Schröder, P. (1999). Multiresolution signal processing for meshes. In Proceedings of ACM SIGGRAPH, pages 325–334. Hoppe, H. (1996). Progressive meshes. In Proceedings of ACM SIGGRAPH, pages 99–108. Hoppe, H. (1997). View-dependent refinement of progressive meshes. In Proceedings of ACM SIGGRAPH, pages 189–198. Kobbelt, L., Campagna, S., Vorsatz, J., and Seidel, H.-P. (1998). Interactive multi-resolution modeling on arbitrary meshes. In Proceedings of ACM SIGGRAPH, pages 105–114. Lounsbery, M., DeRose, T., and Warren, J. (1997). Multiresolution analysis for surfaces of arbitrary topological type. ACM Transactions on Graphics, 16(1):34–73. Meyer, M., Desbrun, M., Schröder, P., and Barr, A.H. (2002). Discrete differential-geometry operators for triangulated 2-manifolds. In Proceedings of Visualization and Mathematics. Stollnitz, E. J., DeRose, T. D., and Salesin, D. H. (1996). Wavelets for Computer Graphics: Theory and Applications. Morgan Kaufmann. Sweldens, W. (1998). The lifting scheme: A construction of second generation wavelets. SIAM Journal on Mathematical Analysis, 29(2):511–546. Valette, S. and Prost, R. (2004). Wavelet-based multiresolution analysis of irregular surface meshes. IEEE Transactions on Visualization and Computer Graphics, 10(2):113–122. Zorin, D., Schröder, P., and Sweldens, W. (1997). Interactive multiresolution mesh editing. In Proceedings of ACM SIGGRAPH, pages 259–269.
SMOOTH INTERPOLATION WITH CUMULATIVE CHORD CUBICS Ryszard Kozera∗ School of Computer Science and Software Engineering The University Western Australia 35 Stirling Highway, Crawley 6009 WA, Perth Australia [email protected]
Lyle Noakes School of Mathematics and Statistics The University of Western Australia 35 Stirling Highway, Crawley 6009 WA, Perth Australia [email protected]
Abstract
Smooth cumulative chord piecewise-cubics, for unparameterised data from regular curves in Rn , are constructed as follows. In the first step derivatives at given ordered interpolation points are estimated from ordinary (non-C 1 ) cumulative chord piecewise-cubics. Then Hermite interpolation is used to generate a C 1 regular (geometrically smooth) piecewise-cubic interpolant. Sharpness of theoretical estimates of orders of approximation for length and trajectory is verified by numerical experiments. Good performance of the interpolant is also confirmed experimentally on sparse data. This may be applicable in computer graphics and vision, image segmentation, medical image processing, and in computer aided geometrical design.
Keywords:
Interpolation, cumulative chord parameterisation, length and trajectory estimation, orders of convergence.
1.
INTRODUCTION
Let γ : [0, T ] → Rn be a smooth regular curve, namely γ is C r for some r ≥ 1 and γ(t) ˙ = 0 for all t ∈ [0, T ]. Our task is to estimate γ and its length
∗ This
work was supported by an Alexander von Humboldt Foundation.
87 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 87–94. © 2006 Springer. Printed in the Netherlands.
88 T d(γ) ≡ 0 γ(t) ˙ dt from an ordered m + 1-tuple Qm = (q0 , q1 , . . . , qm ) of points in Rn , where qi = γ(ti ), 0 = t0 < t1 < . . . < ti < . . . < tm = T , and the ti are unknown. Here · is the Euclidean norm, and the data Qm is said to be unparameterised.
Definition 1 The unparameterised data Qm (for m ≥ 2) is admissible when the {ti }m i=0 satisfy: δm → 0
for δm = max{ti − ti−1 : i = 1, 2, . . . , m} .
(1)
From now on the subscript m in δm is suppressed. Recall the following:
Definition 2 A family {fδ : δ > 0} of functions fδ : [0, T ] → R is said to be O(δ p ) (fδ = O(δ p )) when there is a constant K > 0 such that, for some δ0 > 0, |fδ (t)| < Kδ p for all δ ∈ (0, δ0 ) and all t ∈ [0, T ]. A family {cδ : δ > 0} of numbers is said to be O(δ p ) when it is O(δ p ) as a family of constant functions. A family of vector-valued functions is said to be O(δ p ) when the component functions are all O(δ p ). An approximation γˆ : [0, Tˆ] → Rn to γ determined by Qm has an order p when, for some reparameterisation ψ : [0, T ] → [0, Tˆ], γˆ ◦ ψ − γ = O(δ p ) .
(2)
In the simpler case where the {ti }m i=0 are known, we have the following standard result (Kozera et al., 2003):
Example 3 If the ti are known, piecewise Lagrange interpolation through successive k + 1-tuples (qi , qi+1 , qi+2 , . . . , qi+k ), where k ≥ 1 and i = 0, k, 2k, 3k . . ., approximates γ and d(γ) with uniform errors at most O(δ k+1 ). For length estimation an extra necessary assumption on sampling mδ = O(1) is needed. Without real loss we may suppose m divisible by k. If we guess the ti blindly to be distributed uniformly: ti ≈ tˆi = mi ∈ [0, 1], the resulting uniform piecewise-quadratic γˆ : [0, 1] → Rn is sometimes uninformative. Indeed, for such a guess of ti , piecewise-linear interpolation approximates γ to order 2, but piecewise-quadratic approximations can actually degrade estimates (Noakes et al., 2001a; Noakes et al., 2001b). So without knowing the ti it might seem difficult to match the order 3 achieved in Ex. 3. However it turns out that higher order approximations are achievable for many special curves γ possibly sampled according to some restrictive rule. Some of those constraints require convexity of γ, its embedding in Euclidean space either R2 or R3 , an a priori knowledge of the derivatives of γ at Qm or more-or-less uniformity of ti : i.e. the existence of a constant 0 < λ < 1 such that for each m, and all i = 0, 1, 2, . . . , m − 1, λδ ≤ ti+1 − ti .
(3)
Smooth Interpolation with Cumulative Chord Cubics
89
Such schemes, studied in (de Boor et al., 1987; Lachance and Schwartz, 1991; Mørken and Scherer, 1997; Noakes and Kozera, 2003; Rababah, 1995 and Schaback, 1989), usually require numerical solutions of systems of nonlinear equations. In the present paper (a conference version of Kozera and Noakes, 2004) we examine sharpness of estimates for γ and d(γ) using admissible unparameterised data, and smooth cumulative chord piecewise cubic interpolation as introduced in the next section.
2.
SMOOTH CUMULATIVE CHORD CUBICS AND MAIN RESULT
Cumulative chord length parameterisations (Epstein, 1976; Kvasov, 2000 Chap.11; Lee, 1992) are often used in computer graphics (Piegl and Tiller, 1997 Section 9.2.1). One way of using this is as follows: set tˆ0 = 0
and tˆj = tˆj−1 + qj − qj−1 ,
(4)
for j = 1, 2, . . . , m. For some integer k ≥ 1 dividing m, i = 0, k, 2k, . . . , m− k, let γˆk be the curve satisfying γˆk (tˆj ) = qj ,
(5)
for all j = 0, 1, 2, . . . , m, and whose restriction γˆki to each [tˆi , tˆi+k ] is polynomial of degree at most k. Call the track-sum γˆk of the γˆki the cumulative chord piecewise degree-k polynomial approximation to γ defined by Qm = (q0 , q1 , . . . , qm ). Then (Noakes and Kozera, 2004; Noakes and Kozera, 2002):
Theorem 4 Suppose r ≥ k + 1 and k is 2 or 3. Let γˆ : [0, Tˆ] → Rn be the cumulative chord piecewise degree-k approximation defined by Qm . Then there is a piecewise-C r reparameterisation ψ : [0, T ] → [0, Tˆ], with γˆ ◦ ψ − γ ) = d(γ) + O(δ k+1 ). γ = O(δ k+1 ). If in addition mδ = O(1) then also d(ˆ For general admissible unparameterised data there is no improvement in convergence for cumulative chord piecewise-quartics with k = 4 (Kozera, 2003; Kozera, 2004). For k = 2, 3, although cumulative chord piecewisepolynomials match orders of convergence for parametric interpolation (Compare Ex. 3 and Th. 4) these are usually not C 1 at knot points tkj , where j ≡ 0 mod k. The present paper rectifies this deficiency for k = 3 using a new cumulative chord C 1 piecewise-cubic: 1. For each i = 0, 1, . . . , m − 3, let γˆ3i : [tˆi , tˆi+3 ] → Rn be the cumulative chord cubic interpolating qi , qi+1 , qi+2 , qi+3 at tˆi , tˆi+1 , tˆi+2 , tˆi+3 , respectively. Then we estimate the derivative of γ at qi as v(qi ) = γˆ3i (tˆi ) (see Fig. 1). Derivatives at qm−2 , qm−1 , qm are estimated by applying the same method to Qm in reverse order.
90
Figure 1.
Estimates of derivatives of γ at qi as v(qi ) = γˆ3i (tˆi ). 1 0.8 0.6 0.4 0.2 -0.5-0.25
0.25 0.5 0.75
1
1.25 1.5
-0.2
Figure 2. 7 points on the spiral γsp (dashed) sampled as in (9), interpolated by a cumulative chord C 1 piecewise-cubic γsph , with d(γsph ) = d(γsp ) + 6.122 × 10−3 .
2. Set γhi : [tˆi , tˆi+1 ] → Rn to be the cubic polynomial satisfying
γhi (tˆi+l ) = qi+l , γhi (tˆi+l ) = γˆ i+l (tˆi+l ) , for l = 0, 1
(6)
given in terms of divided differences by Newton’s Interpolation Formula (see de Boor, 2001 Chap. 1) by γhi (tˆ) = γhi [tˆi ] + γhi [tˆi , tˆi ](tˆ − tˆi ) + γhi [tˆi , tˆi , tˆi+1 ](tˆ − tˆi )2 +γhi [tˆi , tˆi , tˆi+1 , tˆi+1 ](tˆ − tˆi )2 (tˆ − tˆi+1 ) .
(7)
The cumulative chord C 1 piecewise-cubic γh : [0, Tˆ] → Rn is the track-sum of the γhi . The following holds (Kozera and Noakes, 2004):
Theorem 5 Suppose r = 4 for γ being a regular curve in Rn . There is a piecewise-C ∞ reparameterisation φ : [0, T ]→ [0, Tˆ], with γh ◦φ = γ+O(δ 4 ). If in addition mδ = O(1) then d(γh ) = d(γ) + O(δ 4 ).
91
Smooth Interpolation with Cumulative Chord Cubics 18 16 14 12 10 3.5
Figure 3.
4
4.5
5
-log |d(γsph ) − d(γsp )| against log m for a spiral γsp in Ex. 6.
For small δm the C 1 interpolant γh is regular, and in particular has neither cusps nor corners.
3.
NUMERICAL EXPERIMENTS
Here are some experiments, using Mathematica, and admissible data from smooth regular curves in R2 and R3 .
Example 6 Consider a regular spiral γsp : [0, 1] → R2 given by γsp (t) = (t + 0.2)(cos(π(1 − t)), sin(π(1 − t)))
(8)
with length d(γsp ) = 2.452. For γsph based on the 7-tuple Q6 with ti =
(−1)i+1 i + , t0 = 0 , tm = 1 , m 3m
(9)
and m = 6 is shown in Fig. 2: trajectory and length estimation seems good for such sporadic data. The errors in length estimates for m = 48, 90, 150, and 198 are 2.466 × 10−6 , 1.973 × 10−7 , 2.537 × 10−8 and 8.339 × 10−9 . The plot for γsph interpolation of − log |d(γsph ) − d(γsp )| against log m in Fig. 3, for m = 18, 24, 30, . . . , 198, appears almost linear, with least squares estimate of slope 4.011. Since T ≤ mδ ≤ 2T , orders of convergence in terms of δ are the same as with 1/m. Evidently, sampling (9) satisfies the condition mδ = O(1). Th. 5 holds also for a non-convex cubic γc with one inflection point at (0, 0).
Example 7 Consider the regular cubic γc : [0, 1] → R2 given by γc (t) = (2t − 1, (2t − 1)3 ). The cumulative chord C 1 piecewise-cubic γch based on the sampling (9) yields for m = 48, 90, 150, and 198, the errors in length estimates 1.919 × 10−7 , 2.628 × 10−8 , 3.778 × 10−9 and 1.276 × 10−9 respectively. Linear regression of − log |d(γch ) − d(γc )| against log m, for
92
1.5 1 0.5 0
1 0.5 -1
0 0
-0.5 1
-1
Figure 4. 7 points on the elliptical helix γeh (dashed) sampled as in Ex. 9, interpolated by a cumulative chord C 1 piecewise-cubic γehh , with d(γehh ) = d(γeh ) − 4.478 × 10−2 .
m = 180, 186, . . . , 300, yields the estimate 3.961 for convergence order of length approximation. Again, γch based on the 7-tuple Qm gives an excellent estimate on sporadic data for length d(γch ) = d(γc ) + 3.505 × 10−2 , where d(γc ) = 3.096. We verify now that Th. 5 holds also for samplings which are not necessarily more-or-less uniform (see (3)).
Example 8 Let γqh be the cumulative chord C 1 piecewise-cubic from a regular quartic curve γq : [0, 1] → R2 , γq (t) = (t, (t+1)4 /8) with d(γq ) = 1.4186 where sampling is according to: ti =
1 1 i i , for i even ; ti = + − , for i odd ; tm = 1 . (10) m m m m2
Clearly condition (3) does not hold for sampling (10). Errors in length estimates for m = 6, 48, 90, 150, and 198 are 1.912 × 10−5 , 2.557 × 10−9 , 3.483 × 10−10 , 5.394 × 10−11 and 1.884 × 10−11 . Linear regression of − log |d(γqh ) − d(γq )| against log m, for m = 480, 486, . . . , 600, yields the estimate 4.035 for convergence order of length approximation. Note that for sampling (10), the condition mδ = O(1) also holds. Finally, we experiment with a regular elliptical helix space curve.
Example 9 Fig. 4 shows γehh from 7 points on the elliptical helix γeh : [0, 2π] → R3 , given by γeh (t) = (1.5 cos t, sin t, t/4) and sampled with ti π(2i−1) either 2πi according as i is even or odd. Clearly mδ = O(1) holds m or m
Smooth Interpolation with Cumulative Chord Cubics
93
here. Although sampling is uneven, sparse, and not known for interpolation, γehh seems very close to γeh : d(γeh ) = 8.090 and d(γehh ) = 8.045. Linear regression of − log |d(γehh ) − d(γeh )| against log m, for m = 90, 96, . . . , 300, yields the estimate 4.002 for convergence order of length approximation. Errors in length estimates for m = 6, 48, 90, 150, and 198 are 4.478 × 10−2 , 2.902 × 10−5 , 2.044 × 10−6 , 2.614 × 10−7 and 8.620 × 10−8 . Similarly, for sampling (10) linear regression of − log |d(γehh ) − d(γeh )| against log m, for m = 240, 246, . . . , 300, yields the estimate 4.081 for convergence order of length approximation. The condition mδ = O(1) cannot be dropped from √ Th. 5. Indeed, for √ γsp sampled according according to t = 0 and t = 1/ m + ((i − 1)( m− 0 i √ 1)/(m − 1) m), for 1 ≤ i ≤ m, (for which mδ = O(1)) γsph (see (6)) gives order approximately 2.71 for length convergence using 6 ≤ m ≤ 95, whereas Th. 5 gives order 4.
4.
CONCLUSIONS
Quartic orders of convergence for length estimates given in Th. 5 for cumulative chord C 1 piecewise-cubics are sharp (at least for n = 2, 3). Curves need not be planar nor convex and sampling need not be more-or-less uniform. We do however need the mild condition (1). For length estimation an extra weak condition on sampling mδ = O(1) is needed. Our scheme also performs well on sporadic data. Asymptotically γh is geometrically smooth, i.e. with no corners and cusps. These good qualities should make cumulative chord C 1 piecewise cubics useful for many applications. For example, image segmentation is an important task in many applications, such as image interpretation in medical applications. Such tasks are usually approached using snakes (Blake and Isard, 1998; Desbleds-Mansard et al., 2001 or Kass et al., 1988) which are curves satisfying some variational condition. Typically an initial snake is chosen as a spline determined by specifying data points of interest. When these points are irregularly spaced the parameterisation becomes an important issue. Because of the good behavior noted already, cumulative chord C 1 piecewise-cubics seem an excellent choice.
REFERENCES Blake, A. and Isard, M. (1998). Active Contours. Springer-Verlag, Berlin Heidelberg New York. de Boor, C. (2001). A Practical Guide to Splines. Springer-Verlag, New York Berlin Heidelberg. de Boor, C., H¨ollig, K., and Sabin, M. (1987). High accuracy geometric Hermite interpolation. Computer Aided Geom. Design, 4:269–278. Desbleds-Mansard, C., Anwander, A., Chaabane, L., Orkisz, M., Neyran, B., Douek, P. C., and Magnin, I. E. (2001). Dynamic active contour model for size independent blood vessel lumen
94 segmentation and quantification in high-resolution magnetic resonance images. In Skarbek, W., editor, Proc. 9th Int. Conf. Computer Anal. of Images and Patterns, Warsaw Poland, volume 2124 of Lect. Notes Comp. Sc., pages 264–273, Berlin Heidelberg. Springer-Verlag. Epstein, M. P. (1976). On the influence of parameterization in parametric interpolation. SIAM J. Numer. Anal., 13:261–268. Kass, M., Witkin, A., and Terzopoulos, D. (1988). Active contour models. Int. J. Comp. Vision, 1:321–331. Kozera, R. (2003). Cumulative chord piecewise-quartics for length and trajectory estimation. In Petkov, N. and Westenberg, M. A., editors, Proc. 10th Int. Conf. Computer Anal. of Images and Patterns, Groningen The Netherlands, volume 2756 of Lect. Notes Comp. Sc., pages 697–705, Berlin Heidelberg. Springer-Verlag. Kozera, R. (2004). Asymptotics for length and trajectory from cumulative chord piecewisequartics. Fundamenta Informaticae, 61(3-4):267–283. Kozera, R. and Noakes, L. (2004). C 1 interpolation with cumulative chord cubics. Fundamenta Informaticae, 61(3-4):285–301. Kozera, R., Noakes, L., and Klette, R. (2003). External versus internal parameterization for lengths of curves with nonuniform samplings. In Asano, T., Klette, R., and Ronse, C., editors, Theoret. Found. Comp. Vision, Geometry Computat. Imaging, volume 2616 of Lect. Notes Comp. Sc., pages 403–418, Berlin Heidelberg. Springer-Verlag. Kvasov, B. I. (2000). Methods of Shape-Preserving Spline Approximation. World Scientific, Singapore. Lachance, M. A. and Schwartz, A. J. (1991). Four point parabolic interpolation. Computer Aided Geom. Design, 8:143–149. Lee, E. T. Y. (1992). Corners, cusps and parameterization: variations on a theorem of Epstein. SIAM J. of Numer. Anal., 29:553–565. Mørken, K. and Scherer, K. (1997). A general framework for high-accuracy parametric interpolation. Math. Computat., 66(217):237–260. Noakes, L. and Kozera, R. (2002). Cumulative chords and piecewise-quadratics.In Wojciechowski, K., editor, Proc. Int. Conf. Computer Vision and Graphics, Zakopane Poland, volume II, pages 589–595. Association for Image Processing Poland, Silesian University of Technology Gliwice Poland, Institute of Theoretical and Applied Informatics PAS Gliwice Poland. Noakes, L. and Kozera, R. (2003). More-or-less uniform sampling and lengths of curves. Quar. Appl. Math., 61(3):475–484. Noakes, L. and Kozera, R. (2004). Cumulative chord piecewise-quadratics and piecewise-cubics. In Klette, R., Kozera, R., Noakes, L., and J., Weickert, editors, Geometric Properties from Incomplete Data. Kluwer Academic Publishers. In press. Noakes, L., Kozera, R., and Klette, R. (2001a). Length estimation for curves with different samplings. In Bertrand, G., Imiya, A., and Klette, R., editors, Digit. Image Geometry, volume 2243 of Lect. Notes Comp. Sc., pages 339–351, Berlin Heidelberg. Springer-Verlag. Noakes, L., Kozera, R., and Klette, R. (2001b). Length estimation for curves with ε-uniform samplings. In Skarbek, W., editor, Proc. 9th Int. Conf. Computer Anal. of Images and Patterns, Warsaw Poland, volume 2124 of Lect. Notes Comp. Sc., pages 339–351, Berlin Heidelberg. Springer-Verlag. Piegl, L. and Tiller, W. (1997). The NURBS Book. Springer-Verlag, Berlin Heidelberg. Rababah, A. (1995). High order approximation methods for curves. Computer Aided Geom. Design, 12:89–102. Schaback, R. (1989). Interpolation in R2 by piecewise quadratic visually C 2 B´ezier polynomials. Computer Aided Geom. Design, 6:219–233.
A PARALLEL LEAP-FROG ALGORITHM FOR 3-SOURCE PHOTOMETRIC STEREO Tristan Cameron Ryszard Kozera∗ Amitava Datta The School of Computer Science and Software Engineering The University of Western Australia 35 Stirling Highway Crawley, W. A., 6009, Perth, Australia
Abstract
Existing Photometric Stereo methods provide reasonable surface reconstructions unless the irradiance image is corrupted with noise and effects of digitisation. However, in real world situations the measured image is almost always corrupted, so an efficient method must be formulated to denoise the data. Once noise is added at the level of the images the noisy Photometric Stereo problem with a least squares estimate is transformed into a non-linear discrete optimization problem depending on a large number of parameters. One of the computationally feasible methods of performing this non-linear optimization is to use many smaller local optimizations to find a minimum (called 2D Leap-Frog). However, this process still takes a large amount of time using a single processor, and when realistic image resolutions are used this method becomes impractical. This paper presents a parallel implementation of the 2D Leap-Frog algorithm in order to provide an improvement in the time complexity. While the focus of this research is in the area of shape from shading, the iterative scheme for finding a local optimum for a large number of parameters can also be applied to any optimization problems in Computer Vision. The results presented herein support the hypothesis that a high speed up and high efficiency can be achieved using a parallel method in a distributed shared memory environment.
Keywords: Photometric Stereo, Shape from Shading, Nonlinear Optimization, Parallel Processing, Noise Rectification.
∗ This
research was supported by an Alexander von Humboldt Foundation.
95 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 95–102. © 2006 Springer. Printed in the Netherlands.
96
1.
INTRODUCTION
Photometric Stereo consists of two independent steps: gradient computation and gradient integration. Existing linear noise removal algorithms (Simchony et al., 1990; Frankot and Chellappa, 1988) work on the assumption that after the first step of gradient computation the Gaussian nature of the noise is preserved, however, this is not the case and the resulting reconstructed surface can be incorrect as shown in (Noakes and Kozera, 2003a; Noakes and Kozera, 2003b). If we assume the Gaussian noise is added to the irradiance images, and not to the vector fields, then we must take the non-linear setting to solve the optimization problem. This depends (as shown in (Noakes and Kozera, 2003a; Noakes and Kozera, 2003b) on a large number of parameters. Since the numerical scheme for solving such an optimisation task involves the calculation of the Hessians (matrix of second derivatives that will have the size (M × M )2 where M is the image resolution) it is computationally expensive. It would also be very computationally expensive to calculate the eigenvalues of such a large matrix. The 2D Leap-Frog Algorithm proposed in (Noakes and Kozera, 1999; Noakes and Kozera, 2003a; Noakes and Kozera, 2003b) is an iterative method that is similar to the block-Gauss-Seidel, but is non-linear (Noakes and Kozera, 1999). The local optimization is broken into a series of smaller optimization problems (consisting of a smaller number of variables, and thus smaller Hessians) that can be solved much more quickly and with a wider variety of methods. This, however, is offset against the need for many small optimizations to converge to the global optimum. This paper proposes a parallel method for the 2D Leap-Frog Algorithm in order to accelerate the denoising and reconstruction step. The experiments run used three light sources and the initial guess consisted of the ideal surface with added Gaussian noise. The finding of a good initial guess is not covered in this paper and this is a different problem that is present in any non-linear optimization problem. Although the focus of this paper is on denoising shape from shading, the 2D Leap-Frog, and therefore the proposed parallel method, can be applied to any non-linear optimization problem in computer vision. The proposed parallel method would therefore be beneficial to many areas involving image processing or non-linear optimization such as medical imaging, synthetic aperture radar, and robot vision (Horn, 2001).
1.1
2D Leap-Frog
This section will give a simple geometric explanation of the 2D Leap-Frog Algorithm. Readers are referred to (Noakes and Kozera, 1999; Noakes and Kozera, 2003a; Noakes and Kozera, 2003b) for a more detailed definition.
97
A Parallel Leap-Frog Algorithm for 3-source Photometric Stereo
2
1
Figure 1.
Shows the pixels optimized by three different overlapping snapshots.
The 2D Leap-Frog Algorithm simplifies the large scale optimisation problem by blending solutions of small scale problems in order to converge to the maximum-likelihood estimate of the surface. This means that we have a large choice of ready-made algorithms for optimising the small scale problems (in this case the Levenberg-Marquardt method). In practice we deal with discrete data, and so we must define the domain of the 2D Leap-Frog Algorithm. Let our domain be Ω of size M × M , where M is a number of pixels. The domain of the subproblems, called snapshots, is given by Ω m , and has size k × k, where k = 4 pixels in this implementation. Let xn be a pixel in our domain. Then, given an initial guess u0 , the 2D Leap-Frog Algorithm moves to the unique maximum-likelihood estimate uopt of u by the following steps. An iteration begins at snapshot Ω 1 , in the bottom-left corner of Ω. From a snapshot Ω m , we obtain Ω m+1 by translating Ω m the horizontal axis, until the right edge of Ω is reached.
k 2
pixels along
The next snapshot is obtained by translating Ω 1 vertically by and then translating Ω m along the horizontal axis as before.
k 2
pixels,
An iteration is complete when the last snapshot optimised is Ω n , or the top-right corner of Ω. Fig. 1 shows the pixels optimised for each snapshot in the 2D Leap-Frog Algorithm. The snapshots are optimised according to the cost function of the 2D LeapFrog Algorithm, called the performance index (Noakes and Kozera, 1999; Noakes and Kozera, 2003a; Noakes and Kozera, 2003b). The performance index measures the distance between the images of the computed solution and
98 the noisy input images and can be defined analytically as follows: E(x1 , x2 , . . . , xN ) =
Ei (xi1 , . . . , xik ) + Ej (xj1 , . . . , xjk ) ,
(1)
i =j
where E is defined over Ω, N is large, ik and jk are small, all components of Ei are fixed, all components of Ej are free, and Ei and Ej are defined over a snapshot Ω m . By finding the optimum of Ej we decrease the energy of E (since all components of Ei are fixed). This procedure finds the suboptimal solution of E and in the case when the initial guess is good this solution is a global minimum.
2.
PARALLEL IMPLEMENTATION
The approach taken uses a root processor to perform the calculations along with the rest of the processors. The topology here is the one dimensional Cartesian network (Grama et al., 2003). A Cartesian, or grid topology consists of all the nodes forming a lattice or grid. Each processor communicates only with the processors connected to it, increasing robustness. The topology used is agrid of width one, meaning processors have at most two processors to communicate with, one above, and one below. The root processor deals with all the initialization details and initial communication and final communication, but it will behave in the same way as the other processors during the core of the algorithm. The image and initial guess are split up in the same way between the processors according to a pre-defined scheme. The image is split up into rows between the processors in as even a way as possible by root. Every set of rows must be divisible by 2 because of the size of the sub-squares used to process the image, which is 4 × 4 in this implementation. However, pixels outside the sub-square are also used in the non-linear optimization process as fixed parameters. This essentially means that when processing a sub-square, access to a 6×6, 5×6, 6 × 5, or 5 × 5 sub-domain of pixels must be available, depending on the boundary conditions. The consequence of this is that once the root processor has defined how the image is to be split among the processors, an ‘overlap’ row is added to the bottom and top of each set of rows. The bottom, or root, processor only receives a top overlap row and the top processor only receives a bottom overlap row. An important feature to note here is that the 2D Leap-Frog Algorithm converges to a globally minimal solution (given a good initial guess) due to the many overlapping local minimizations. This means that if the individual processors are allowed to work entirely independently of each other the convergence would not be guaranteed. This is because the sub-squares that are contained in the top two rows (excluding the overlap row) of processor n and the bot-
A Parallel Leap-Frog Algorithm for 3-source Photometric Stereo
99
tom two rows (excluding the overlap row) of processor n + 1 would never be processed. This area, called the ‘buffer zone’, must be processed prior to each iteration of the 2D Parallel Leap-Frog. Once these buffer zones have been processed, each processor is able to run the 2D Leap-Frog Algorithm on its remaining data. It must be noted that at the end of the processing of the buffers and at the end of each iteration processors must be synchronized for the communication of the buffer data.
2.1
Parallel Efficiency
The efficiency of a parallel algorithm is expressed in terms of speed up, Sp = T1 /Tp , where T1 is the time taken for one processor to finish the computation, and Tp is the time taken using p processors to finish the computation. The efficiency is given by Ep = Sp /p. It should be noted that in general 1/p ≤ Ep ≤ 1. So far when talking about speed up and efficiency we have been concerned with the computational time only. However, the performance of a distributed memory architecture must also take into account the communication time. The time taken for a program to run on p processors is given by Tp = (T1 /p) + Tc , where T1 is the time taken to run on a single processor and Tc is the communication time of the program. We can now apply this to the formula for speed up above to give us Sp =
2.2
T1 . (T1 /p) + Tc
(2)
Expected Efficiency
Using (2) from above, we can calculate an expected speed up, and therefore an expected efficiency of the proposed parallel method. Let Tc = n/To where Tc is the communication time in seconds, n is the number of bytes to send, and To is the number of bytes that can be sent per second. During the initialization phase p − 1 processors receive approximately 4M/p rows (each row consists of M double (8 byte) values). Each processor receives approximately M/p rows from each of the three irradiance images and from the initial guess. After the 2D Leap-Frog Algorithm has finished processing, p − 1 processors send approximately M/p rows (each row consisting of M double values) to the root processor. During a single iteration of the 2D Parallel Leap-Frog Algorithm using p processors, p − 1 processors send 2 rows each containing M double values at the beginning of the buffer processing. Following the buffer processing p − 1 processors send 2 rows each containing M double values. The number of bytes that can be sent per sec-
100 ond, To , is approximately 500,000,000 bytes (HP, 2003). This gives us the following equation for the expected communication time:
Tc =
8 ((p − 1)(4M/p) + (p − 1)(M/p)) + (8 ∗ 500(4M (p − 1))) , (3) 500000000
where M is the image resolution and p is the number of processors Note that the number of iterations is fixed at 500. This can be substituted into (2) to give an expected speed up.
3.
RESULTS
Experiments were carried out on a 4 node AlphaServer SC40 at the Interactive Virtual Environments Center (IVEC) (IVEC, 2003) known as CARLIN. Each node consists of four 667 MHz Alpha EV67 processors and has access to 4GB of main memory, with up to 108 GB of virtual memory. Each processor has a 64KB L-cache, 64KB D-cache, and 8 MB of Level 2 cache (IVEC, 2003). The non-linear local minimization technique used was the LevenbergMarquardt algorithm. The experimental surfaces were tested at three differing resolutions so as to ascertain the scalability of the algorithm. A total of four surfaces were tested however, due to space limitations the reconstructions for only one will be presented in this paper. The timings for all four surfaces will be reported. Two of the surfaces tested were synthetic and two of the surfaces were of real data, one a jug and one a fish, both generated from 3D Studio Max models. The reconstruction of the Jug is shown in the results. The three different image resolutions tested were 64 × 64, 128 × 128, and 256 × 256 pixels. This gives us an idea how the implementation scales from a small image (64×64)toal argeimage (256×256).For each experiment noise from a Gaussian (Normal) distribution with mean 0 and standard deviation 0.2 was added to the ideal surface to generate the initial guess. The irradiance images, with intensity values ranging from 0 to 1, were corrupted with noise from a Gaussian distribution with mean 0 and standard deviation 0.02. It is expected that the speed up will be relatively high (efficiency above 60%) for the eight processor implementation.
4.
CONCLUSION
As can be seen in Fig. 2(c) the reconstructed surface provides a reasonable representation of the ideal surface shown in Fig. 2(a). The value of the performance index (Noakes and Kozera, 2003a; Noakes and Kozera, 2003a; Noakes and Kozera, 2003b) is also much closer to zero (ideal) for the reconstructed surface than that of the initial guess.
101
A Parallel Leap-Frog Algorithm for 3-source Photometric Stereo
(a)
(b)
(c)
Figure 2. (a) The ideal surface with a performance index of zero. (b) The initial guess with a performance index of 88.451. (c) The reconstructed surface with a performance of 5.055.
From Fig. 3, Fig. 4, and Fig. 5 it can be seen that the speed up was higher for the large resolution images than for the smaller resolution images. This is as expected since, when reconstructing the surface with a higher resolution, a larger percentage of the total time is spent in parallel processing as opposed to communication. The results given support our hypothesis even though the actual speed up is lower than the expected speed up. This is because the expected speed up was calculated using the maximal throughput of 500 MB per second when in reality this would rarely be achieved. Due to the effectiveness of this parallel algorithm, reconstructions can be done in far less time, and with a greater accuracy through more iterations.
REFERENCES Frankot, Robert T. and Chellappa, Rama (1988). A method for enforcing integrability in shape from shading algorithms. IEEE Trans. Pattern Anal. Mach. Intell., 10(4):439–451. Grama, A., Gupta, A., Karypis, G., and Kumar, V. (2003). Introduction to Parallel Computing. Addison Wesley. Horn, B. K. P. (2001). Robot Vision. MIT Press in association with McGraw-Hill. HP (2003). Hewlett-packard company. http://www.hp.com. Accessed 7/10/2003. IVEC (2003). Interactive virtual environments centre. http://www.ivec.org. Accessed 4/9/2003. Noakes, L. and Kozera, R. (1999). A 2D Leap-Frog algorithm for optimal surface reconstruction. Proc. 44th Annual Meet. Opt. Eng. SPIE’99, III-3811:317–328. Noakes, L. and Kozera, R. (2003a). Denoising images: Non-linear Leap-Frog for shape and light-source recovery. Chapter in Theoretical Foundations of Computer Vision Geometry, Morphology and Computational Images, pages 419–436. Lecture notes in Computer Science 2616. Noakes, L. and Kozera, R. (2003b). Nonlinearities and noise reduction in 3-source photometric stereo. J. Math. Imag. and Vis., 2(18):119–127. Simchony, T., Chellappa, R., and Shao, M. (1990). Direct analytical methods for solving Poisson equations in computer vision problems. IEEE Trans. Pattern Rec. Mach. Intell., 12(5):435– 446.
102
Speed Up vs Number of Processors
Time (sec) vs Number of Processors
1
6
250
200
5
4
150
3
100
2
2
3
4
5
6
7
1 1
8
0.8
Expected
Efficiency
Speed Up
300
Time (sec)
Expected
7
350
50 1
Efficiency vs Number of Processors
8
400
Actual 0.6
0.4
Actual
0.2
2
3
4
5
6
7
0 1
8
2
Number of Processors
Number of Processors
(a)
3
(b)
Figure 3.
4
5
6
7
8
Number of Processors
(c)
The results for the 64 by 64 resolution images.
Speed Up vs Number of Processors Time (sec) vs Number of Processors
Efficiency vs Number of Processors
8
2000
Expected 7
1800
1
1600
6
Expected
1200 1000
0.8
Efficiency
Speed Up
Time (sec)
1400
5
4
Actual 0.6
0.4
800
3
Actual
600
0.2
2 400 200 1
2
3
4
5
6
7
1 1
8
2
3
4
5
6
7
0 1
8
2
Number of Processors
Number of Processors
(a)
3
(b)
Figure 4.
5
6
7
8
(c)
The results for the 128 by 128 resolution images.
Speed Up vs Number of Processors
Efficiency vs Number of Processors
8
Time (sec) vs Number of Processors 11000
Expected
7
10000
1
9000
6
8000
0.8
7000 6000
Expected
5
Efficiency
Speed Up
Time (sec)
4
Number of Processors
4
5000
Actual 0.6
0.4
3
4000
Actual
3000
0.2
2 2000 1000 1
2
3
4
5
6
Number of Processors
(a)
Figure 5.
7
8
1 1
2
3
4
5
6
Number of Processors
(b)
7
8
0 1
2
3
4
5
6
Number of Processors
(c)
The results for the 256 by 256 resolution images.
7
8
NOISE REDUCTION IN PHOTOMETRIC STEREO WITH NON-DISTANT LIGHT SOURCES Ryszard Kozera∗ School of Computer Science and Software Engineering The University Western Australia 35 Stirling Highway, Crawley 6009 WA, Perth Australia [email protected]
Lyle Noakes School of Mathematics and Statistics The University of Western Australia 35 Stirling Highway, Crawley 6009 WA, Perth Australia [email protected]
Abstract
In classical photometric stereo, a Lambertian surface is illuminated from multiple distant point light-sources. In the present paper we consider nearby lightsources instead, so that the unknown surface, is illuminated by non-parallel beams of light. In continuous noiseless cases, the recovery of a Lambertian surface from non-distant illuminations, reduces to solving a system of non-linear partial differential equations for a bivariate function u, whose graph is the visible part of the surface. This system is more difficult to analyse than its counterpart, where light-sources are at infinity. We consider here a similar task, but with slightly more realistic assumptions: the photographic images are discrete and contaminated by Gaussian noise. This leads to a non-quadratic optimization problem involving a large number of independent variables. The latter imposes a heavy computational burden (due to the large matrices involved) for standard optimization schemes. We test here a feasible alternative: an iterative scheme called 2-dimensional Leap-Frog Algorithm 14 . For this we describe an implementation for three light-sources in sufficient detail to permit code to be written. Then we give examples verifying experimentally the performance of Leap-Frog.
Keywords:
photometric stereo, non-quadratic optimization, noise reduction.
∗ This
work was supported by an Alexander von Humboldt Foundation.
103 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 103–110. © 2006 Springer. Printed in the Netherlands.
104
1.
INTRODUCTION
In shape-from-shading the task is to recover a typically Lambertian (light scattered equally in all directions) surface as the graph of an unknown function u : Ω → R defined over an image domain Ω. Each photograph, for a light-source positioned at infinity (yielding fixed illumination direction p = (p1 , p2 , p3 )), gives an image irradiance equation 4 for u, !
p1 ux (x, y) + p2 uy (x, y) − p3 " = E(x, y) p21 + p22 + p23 u2x (x, y) + u2y (x, y) + 1
(1)
defining a first-order non-linear partial differential equation for u. Here E : Ω → R is the image intensity of the photograph. The left-hand side of (1) represents at each point (x, y) the cosine of the angle between the normal n = (ux , uy , −1) to the surface at (x, y, u(x, y)) and the light-source direction p. Evidently (1) is invariant with respect to the translation u → u + c - the socalled a standard ambiguity. For other more subtle ambiguities, especially in single image shape-from-shading see e.g. 1 , 4 , 6 , 9 , 10 or 15 . In multiple image shape-from-shading (called photometric stereo), the problem of determining u from several images is generically well-posed (up to standard ambiguity), over the intersection of their respective domains see e.g. 4 , 7 , 8 or 16 . For shape recovery one first determines the gradient ∇u, and then integrates it to u, up to standard ambiguity. What complicates the task is that the intensity functions E 1 , E 2 , . . . , E m , corresponding to m > 1 photographs, may be contaminated by noise. This feeds into the estimate v of ∇u. Consequently v is usually non-integrable in the sense that it is not the gradient of any C 2 function. If the noise is assumed to be addition of uniform Gaussian one to the gradient estimate v, the problem reduces to finding an integrable vector field vˆ nearest to v in a simple and well-defined sense. The approaches to this use linear methods: see e.g. 2 , 4 , 5 , 12 or 17 . Whatever the approach, the underlying mathematical problem is one of quadratic optimization and the difficulty not so much in the mathematical algorithms as the assumption that v is contaminated by uniform Gaussian noise. It seems much more natural to assume noise added to the intensity functions E s . Unfortunately this reduces to a non-convex optimization problem which depends on a huge number of variables (for real resolution of the photographs). The problem simplifies when we can make a good initial guess for u. Gradient Descent is an elementary approach that works fairly well in such circumstances, but regulating step-sizes can be tricky, especially considering that the huge number of independent variables makes calculation of Hessians prohibitively costly.
105
Noise Reduction in Photometric Stereo with Non-Distant Light Sources
The 2D Leap-Frog Algorithm 11,13,14 is an iterative scheme, resembling block-Gauss-Seidel 3 , but in a non-linear setting, which breaks the optimization into a sequence of optimization problems in fewer independent variables. Solving the smaller optimization problems can be done much more quickly, with a wider range of methods (for instance Hessians can be calculated as need be). This has to be traded off against the need for many small-scale optimizations. For a non-distant light-source q = (q1 , q2 , q3 ), since at a given surface point (x, y, u(x, y)) a varying illumination direction is p = (x−q1 , y −q2 , u(x, y)− q3 ), in the noiseless setting, equation (2) now reads: E(x, y) = !
(x − q1 )ux (x, y) + (y − q2 )uy (x, y) − (u(x, y) − q3 ) " . (2) (x − q1 )2 + (y − q2 )2 + (u(x, y) − q3 )2 u2x (x, y) + u2y (x, y) + 1
Note that there is no standard ambiguity now. In this paper we consider the lefthand side of (2) also to be contaminated with Gaussian noise. We reconstruct the surface for noisy images with 2D Leap-Frog and make a preliminary report on the performance of algorithm in question for three light-source photometric stereo.
2.
DISCRETIZATION AND 2D LEAP-FROG
Our pixels have side 1/M in Ω = [0, 1] × [0, 1], M > 1 is a fixed integer. For 1 ≤ i, j ≤ M , and (xi , yj ) the center of the pixel [(i − 1)/M, i/M ] × [(j − 1)/M, j/M ], set uji = u(xi , yj ) and Eij = E(xi , yj ). The central-difference − approximations ux (xi , yj ) ≈ M (uji+1 − uji−1 )/2 uy (xi , yj ) ≈ M (uj+1 i j−1 j ij ui )/2 at internal pixels of Ω yield discrete analogue of (2) < νˆi , q >= Eij , νˆij
(M (uji+1 − uji−1 )/2, M (uj+1 − uj−1 )/2, −1) i i
=" , j−1 2 1 + (M (uji+1 − uji−1 )/2)2 + (M (uj+1 − u )/2) i i
(3)
(xi − q1 , yj − q2 , uji − q3 ) , q ij = " j 2 2 2 (xi − q1 ) + (yj − q2 ) + (ui − q3 ) and 1 < i, j < M . The values of u M × M tableau ⎛ uM uM 2 3 ⎜ uM −1 uM −1 uM −1 2 3 ⎜ 1 ⎜ .. .. .. u = ⎜ . . . ⎜ ⎝ u21 u22 u23 u12 u13
(4)
appearing in (3) are displayed in the ··· ··· .. . ··· ···
M uM M −2 uM −1 −1 M −1 M −1 uM M −2 uM −1 uM .. .. .. . . . u2M −2 u2M −1 u2M u1M −2 u1M −1
⎞ ⎟ ⎟ ⎟ ⎟ (5) ⎟ ⎠
106 ( l − 1) l S21
S31( l −1)l
S11( l −1)l (a)
(b)
(c)
S12( l −1)l
S32( l −1)l
( l − 1) l S22
(d)
(e)
(f)
S13( l −1)l ( l − 1) l S23
(g)
Figure 1.
S33( l −1)l
(h)
(i) (l−1)l
Covering an image Ω by the family F kl = {Sij
k = l − 1). Each
(l−1)l Sij
}1≤i,j≤3 of sub-squares (here
consists of 22(l−1) pixels.
with no corners. Identify the space of all such tableaux with RM < νˆij , q ij >= Eij in the form f q (u) = E ,
2 −4
and write (6)
where f q : RM −4 → R(M −2) is determined by the light-source q. Three light sources q s , give three images E s , and three systems of equations of the form (6). If the E s are contaminated by independent uniform mean zero Gaussian noise, let uopt be a maximum-likelihood estimate of u. Then uopt minimizes s (7) J (uopt ) = Σ3s=1 f q (uopt ) − E s 2 , 2
2
where · is the Euclidean norm on R(M −2) . We describe 9 classes of different non-quadratic optimization problems defined locally over a given rectangular sub-array of u (a snapshot) corresponding to different locations. For k < l l = [(i−1)/2l , i/2l ]×[(j−1)/2l , j/2l ], where and M = 2l , Ω has M 2 pixels Sij 1 ≤ i, j ≤ 2l . Cover Ω also by a family of overlapping squares (snapshots) kl } 2k pixels S l , where S kl = F kl = {Sij 1≤i,j≤2l−k+1 −1 , each comprising 2 ij ij [(i−1)2k−l−1 , (i−1)2k−l−1 +2k−l ]×[(j−1)2k−l−1 , (j−1)2k−l−1 +2k−l ], and 1 ≤ i, j ≤ 2l−k+1 − 1. In Fig. 1 k = l − 1 and l ≥ 2. The bottom-left snapshot 2l is shown in Fig. 2 where k = 2. Similarly to (7) let J rt ( ul ) : R8 → R be S11 2
Noise Reduction in Photometric Stereo with Non-Distant Light Sources
v 07
X
u 6l
X
u 3l
X
X
X
w04
w03
v60
v50
u 7l
u 4l
u1l
X
X
M(xrt c ,yc )
v30
w02
u 5l
v 02
w10
u 2l
v10
u8l
X
v 04
S112l
(a)
107
N(rtxc ,yc ) S112l
(b)
2l Figure 2. (a) Free and fixed variables for the snapshot S11 without 4-pixel value enforcement 2l on u. (b) Selected pixels in S11 over which performance index J is locally minimized.
the the local performance index in ul ∈ R8 variables representing u ∈ RM −4 s at pixels in Mrt (xc , yc ) (see Fig. 2a) and defined to reduce the noise in E p (for 1 ≤ s ≤ 3) over pixels from N rt (xc , yc ) (see Fig. 2b). Note that J rt depends here also on fixed values v ∈ R7 and w ∈ R4 (see Fig. 2a). In a similar fashion, the eight remaining cases (see Fig. 1(b-i)) yield the corresponding local performance indices J rtl : R6 → R, J tl : R8 → R, J brt : R6 → R, J brtl : R4 → R, J tlb : R6 → R, J br : R8 → R, J lbr : R6 → R, and J lb : R8 → R, respectively. Note that snapshot overlaps are needed to guarantee the convergence of 2D Leap-Frog to the critical point of J (see 14 ). 2
2-dimensional Leap-Frog: In this version of Leap-Frog, snapshots are square, with half-snapshot overlaps, and a particular order of snapshots is from left to right and then from 2 bottom to top. An initial guess u00 ∈ RM −4 is assumed to be given. For n = 1, 2, . . . repeat the following steps until some halting condition is flagged. kl (for k = l − 1 see Fig. 1 (a)) Start with the left bottom snapshot S11 and apply any non-linear optimization solver to J0rt with respect to k k ul0 ∈ R2 (2 −2) - more precisely, adjust the variables in the snapshot 2l to minimize J rt . This yields a new update un0 c ∈ R2 −4 .
Pass now to the second snapshot (for k = l − 1 see Fig. 1 (b)) of the first kl and optimize J rbl . Adjusting variables in the snapshot yields a row S21 2l new vector un0 c ∈ R2 −4 .
108 kl Continue until the last snapshot S(2 l−k+1 −1)1 in the first row (for k =
l − 1 see Fig. 1 (c)) and optimize J tl accordingly. Adjusting variables 2l in the snapshot yields a new vector un0 c ∈ R2 −4 . This completes the first row of nth iteration. kl snapshot and optimize J brt Pass to the second row. Start with the S12 (for k = l − 1 see Fig. 1 (d)). Adjusting variables in the snapshot yields 2l a new vector un0 c ∈ R2 −4 . kl (a generic case) over which we optimize Pass to the second snapshot S22 brtl (for k = l − 1 see Fig. 1 (e)). Adjusting variables in the snapshot J 2l yields a new vector un0 c ∈ R2 −4 . kl Continue until the last snapshot S(2 l−k+1 −1)2 in the second row is reached.
Over this snapshot optimize J tlb (for k = l −1 see Fig. 1 (f)). Adjusting 2l variables in the snapshot yields a new vector un0 c ∈ R2 −4 .
Continue row by row (as specified in the previous steps), until the last kl row is reached. Now optimize J br over S12 l−k+1 −1 (for k = l − 1 see Fig. 1 (g)). Adjusting variables in the snapshot yields a new vector 2l un0 c ∈ R2 −4 . kl Pass to the second snapshot of the last row S22 l−k+1 −1 over which we lbr optimize J (for k = l − 1 see Fig. 1 (h)). Adjusting variables in the 2l snapshot yields a new vector un0 c ∈ R2 −4 . kl This continues, up until the last snapshot S(2 l−k+1 −1)(2l−k+1 −1) , in the
last row is reached. Over this sub-square optimize J lb for k = l − 1 (see Fig. 1 (i)). Adjusting variables in the snapshot yields a new 2l vector un0 c ∈ R2 −4 . This completes the nth iteration, and the resulting 2l updated global values of un0 c are labeled by un0 = un0 c ∈ R2 −4 .
3.
NUMERICAL EXPERIMENTS
Numerical experiments were conducted with Mathematica, using FindMinimum for snapshot optimizations, and Ω = [0, 1] × [0, 1]. For testing we consider a coarse grid with M = 16. Uniform Gaussian noise with mean zero and different standard deviations σ1 = 0.02 and σ2 = 0.05 is added to three photographs. Two configurations of three light-sources Li = {pi , qi , ri } are used (i = 1, 2), namely L1 = {(0, 0, −8), (8, 8, −8), (5, 0, 11)} and L2 = {(0, 0, −60),(48, 48, −48), (32, 0, 64)}. We report here (due to space limitation) on the performance of 2D Leap-Frog only for two examples.
Noise Reduction in Photometric Stereo with Non-Distant Light Sources
109
Figure 3. (a) Bumpy surface. (b) Initial guess. (c) 2D Leap-Frog for σ1 and L1 . (d) 2D Leap-Frog for σ2 and L2 .
Define a bumpy surface (see Fig. 3(a)) as a graph of ub (x, y) =
1 (20f ((x, y), w 1 ) − 15f ((x, y), w 2 ) + 12f ((x, y), w 3 )) , (8) 16
where f (v1 , v2 ) = exp(−100 v1 − v2 |v1 − v2 ), v1 , v2 ∈ R2 , and w 1 = 3 = (1/3, 4/5). The initial guess u00 is (3/4, 1/2), w 2 = (1/4, 1/3), and w obtained by adding uniform Gaussian noise with standard deviation 1/2 and zero mean to ub (see Fig. 3(a)). This guess is so bad that very little of Fig. 3(a) is visible in Fig. 3(b). The data comprised three images obtained by illuminating the bumpy surface from three light-sources L1 (or L2 ) and then contaminated with uniform Gaussian noise of standard deviation σ1 (or σ2 ). After 110 iterations of 2D Leap-Frog the value of the performance index J (see (7)) for the initial guess in Fig. 3(b) is decreased from J (u00 ) = 240.932 (or u110 from J (u00 ) = 247.088) to J (u110 0 ) = 2.15855 (or to J ( 0 ) = 3.20951). The 2D Leap-Frog surface estimate is shown in Fig. 3(c) (or in Fig. 3(d), accordingly).
110
4.
CONCLUSIONS
2D Leap-Frog for noisy 3 light-source photometric stereo with non-distant illuminations is introduced and tested. Preliminary results show robust performance. Further analysis to establish sufficient conditions for 2D Leap-Frog to yield a global minimum of J and more testing for synthetic and real images with camera resolution are needed. The algorithm is amenable to parallelism.
REFERENCES 1. Brooks, M. J., Chojnacki, W., and R., Kozera (1992). Impossible and ambiguous shading patterns. Int. J. Comp. Vision, 7(2):119–126. 2. Frankot, R. T. and Chellappa, R. (1988). A method of enforcing integrability in shape from shading algorithms. IEEE Trans. Patt. Rec. Machine Intell., 10(4):439–451. 3. Hackbush, W. (1994). Iterative Solution of Large Sparse Systems of Equations. Springer, New York, Heidelberg, Berlin. 4. Horn, B. K. P. (1986). Robot Vision. McGraw-Hill, New York Cambridge, MA. 5. Horn, B. K. P. (1990). Height and gradient from shading. Int. J. Comp. Vision, 5(1):37–75. 6. Hurt, N. E. (1991). Mathematical methods in shape-from-shading: a review of recent results. Acta Appl. Math., 23:163–188. 7. Kozera, R. (1991). Existence and uniqueness in photometric stereo. Appl. Math. Comput., 44(1):1–104. 8. Kozera, R. (1992). On shape recovery from two shading patterns. Int. J. Patt. Rec. Art. Intel., 6(4):673–698. 9. Kozera, R. (1995). On complete integrals and uniqueness in shape from shading. Appl. Math. Comput., 73(1):1–37. 10. Kozera, R. (1997). Uniqueness in shape from shading revisited. J. Math. Imag. Vision, 7:123–138. 11. Noakes, L. (1999). A global algorithm for geodesics. J. Math. Australian Soc. Series A, 64:37–50. 12. Noakes, L. and Kozera, R. (2001). The 2-D Leap-Frog, noise, and digitization. In Bertrand, G., Imiya, A., and Klette, R., editors, Digital and Image Geometry, volume 2243 of Lect. Notes Comp. Sc., pages 352–364, Berlin Heidelberg. Springer-Verlag. 13. Noakes, L. and Kozera, R. (2002). Denoising images: non-linear Leap-Frog for shape and light-source recovery. In Asano, T., Klette, R., and Ronse, C., editors, Geometry, Morphology, and Computational Imaging, volume 2616 of Lect. Notes Comp. Sc., pages 419–436, Berlin Heidelberg. Springer-Verlag. 14. Noakes, L. and Kozera, R. (2003). Nonlinearities and noise reduction in 3-source photometric stereo. J. Math. Imag. Vision, 18(3):119–127. 15. Oliensis, J. (1991). Uniqueness in shape from shading. Int. J. Comp. Vision, 6(2):75–104. 16. Onn, R. and A., Bruckstein (1990). Uniqueness in shape from shading. Int. J. Comp. Vision, 5(1):105–113. 17. Simchony, T., Chellappa, R., and Shao, M. (1990). Direct analytical methods for solving Poisson equations in computer vision problems. IEEE Trans. Patt. Rec. Machine Intell., 12(5):435–446.
HYPERGRAPHS IN DIAGRAMMATIC DESIGN1 Ewa Grabska, Katarzyna Grzesiak-Kopeü, Jacek Lembas, Andrzej àachwa and GraĪyna ĝlusarczyk Institute of Computer Science, Jagiellonian University, Nawojki 11, 30-072 Kraków, Poland
Abstract:
A specific diagram language, called the floor-layout language, is considered. It allows to edit floor-layouts maintaining specified restrictions. An internal diagram representation in the form of hypergraphs needed for syntactic analysis is described. Some examples of designing floor-layouts are presented.
Key words:
hypergraph, diagram, computer-aided design
1.
INTRODUCTION
Visual (diagrammatic) communication plays an essential role in many computer aided design systems. This paper deals with diagrammatic design aided by computer, which is demonstrated on designing floor-layouts. A specific diagram language, called the floor-layout language, is considered. Its visual components correspond to symbols of floor-layouts (walls, windows, doors, etc.). Designers use the diagrammatic language to edit floor-layouts. This editing mode does not allow to create arbitrary drawings, but is restricted to floor-layout components which occur in this language. Maintaining specified restrictions requires an internal diagram representation which allow for syntactic analysis. This paper should be seen as the first step in developing a floor-layout language. We start with a description of its internal representation in the form of hypergraphs and outline a logical view of our model with a use of UML. Our language will be equipped with a parser based on hypergraphs similar to other parsers of diagrammatic languages (Minas, 2002). However, 1
The research is supported by Polish State Committee for Scientific Research (KBN) grant no 0896/T07/2003/25
111 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 111–117. © 2006 Springer. Printed in the Netherlands.
112 our main goal is to propose diagrammatic reasoning with the use of hypergraphs to support innovative design.
2.
HIERARCHICAL LAYOUT HYPERGRAPHS
In this paper directed and hierarchical hyperedge-labelled hypergraphs are proposed to represent floor-layouts. These hypergraphs constitute a modification of hypergraphs defined in (Minas, 2002) and in (ĝlusarczyk, 2003) and will be called layout hypergraphs. Layout hypergraphs contain hyperedges of two types. Hyperedges of the first type are non-directed and correspond to layout components, while hyperedges of the second type represent relations among components. They are directed unless represent symmetrical relations. Hyperedges can represent components and relations on different levels of details. For example layout components correspond to rooms whose elements on the lower level of detail can be treated as walls, doors, windows etc. Moreover, floor–layouts will be represented by hierarchical hypergraphs. In general we assume that hierarchical hyperedges of the first type represent groups of components. Therefore subcomponents of a given component can be represented at the same or different level of details in a form of another layout hypergraph nested in the graph hyperedge corresponding to this component. We propose to distinguish two basic relations between components on the highest level: accessibility and adjacency. Appropriate subtypes of these basic relations describe correspondence between lower level components. It should be noted that the number of hierarchy levels depends on the considered aspects of design. Example 2.1 Let us consider two floor-layouts depicted in Figs.1 and 2, respectively. A one-level layout hypergraph is sufficient to describe the structure of the first layout, while the structure of the second one needs to be described by a hierarchical layout hypergraph. A formal definition of a one-level layout hypergraph will precede the representation of the layout from Fig.1. Let [i] denote the interval {1...i} for i t 0 (with [0]= ). Let [i] 1 denote a family of intervals [i] for i t 0. Let 6E=6C 6R, where 6C 6R = , and 6V be fixed alphabets of hyperedge and node labels, respectively. Definition 2.1 A layout hypergraph over 6 =6 E 6 V is a system G = (EG,VG, sG, tG, lbG, extG), where: 1. EG=EC ER, where EC ER = , is a finite set of hyperedges, where elements of EC represent object components, while elements of ER represent relations,
Hypergraphs in Diagramatic Design
113
1. VG is a finite set of nodes, 2. sG: EG o VG* and tG: EG o VG* are two mappings assigning to hyperedges sequences of source and target nodes, respectively, in such a way that e EC sG(e)=tG(e), 3. lbG = (lbV, lbE), where: x lbV: VG o 6V is a node labelling function, x lbE: EG o 6E is a hyperedge labelling function, such that e EC lbE(e) 6C and e ER lbE(e) 6R,
Figure 1. Sample floor layout.
Figure 2. Another sample floor layout.
114 4. extG: [n]o VG is a mapping specifying a sequence of hypergraph external nodes. Hyperedges of the layout hypergraph are labelled by names of the corresponding components or relations. To each hyperedge a sequence of source and target nodes is assigned. Hypergraph nodes express potential connections between hyperedges. A hyperedge is called non-directed if sequences of its source and target nodes are equal. Moreover, for each hypergraph a sequence of external nodes is determined. The length of this sequence specifies the type of a hypergraph. Example 2.2 A one-level layout hypergraph, which represents the floorlayout shown in Fig.1, is depicted in Fig.3. Each hyperedge of EC represents one room, while hyperedges of ER correspond to relations of the highest level of details, namely adjacency and accessibility. As it has been mentioned, to represent the structure of the floor-layout presented in Fig.2 we need a hierarchical layout hypergraph. Definition 2.2 A hierarchical layout hypergraph over 6 is a system H= (EH,VH, sH, tH, lbH, extH, chH), where: 1. EH=EC ER and VH are finite sets of hyperedges and nodes, respectively, 2. sH, tH: EH o VH* assign sequences of source and target nodes to hyperedges, 3. lbH: VH EH o 6V 6E is a graph labelling function, 4. extH: [n]o ExtH* is a mapping specifying a sequence of external nodes, 5. chH: EC o P(A) is a child nesting function, where A = VH EH is called a set of hypergraph atoms, and the following conditions are satisfied: x a A e1,e2 EC a ch(e1) a ch(e2) => e1=e2, i.e., one atom cannot be nested in two different hyperedges, x eEC ech+(e), where ch+(e) denotes all descendants of a given hyperedge e, i.e., a hyperedge cannot be its own child, x eEH if there exist f EC such that e chH(f), then all nodes of sH(e) and tH(e) belong to chH(f), i.e., source and target nodes of a nested hyperedge e are nested in the same hyperedge as e.
Figure 3. A one-level layout hypergraph representation of a floor-layout from Fig.1.
Hypergraphs in Diagramatic Design
115
Example 2.3 A hierarchical layout hypergraph, which represents the floorlayout from Fig.2, is shown in Fig.4. It has three hierarchical hyperedges representing groups of rooms. The hierarchical hyperedge on the highest level, which is labelled apartment, contains two other hierarchical hypergraphs corresponding to a living area and sleeping area and two nonhierarchical hyperedges corresponding to an entrance hall and wc. The hierarchical hyperedge labelled sleeping area contains hyperedges representing a bedroom and bathroom, while the hierarchical hyperedge labelled living area contains three hyperedges representing open spaces with different functions, namely a living room, kitchen and hall. All hyperedges of ER represent relations on the highest level of details. Semantic information can be assigned to the hypergraph elements in the form of attributes. Therefore a notion of an attributed hierarchical layout hypergraph is introduced. Let AV and AE be sets of node and hyperedge attributes, respectively. Definition 2.4 An attributed hierarchical layout hypergraph is a system AG = (H, attV, attE), where: 1. H = (EH,VH, sH, tH, lbH, extH, chH) is a hierarchical layout hypergraph, 2. attV: VG o P(AV) and attE: EG o P(AE) are two functions assigning sets of attributes to nodes and hyperedges, respectively. Example 2.4 In the hierarchical layout hypergraph shown in Fig.4 hyperedges representing rooms have attributes space and area assigned to them. The attribute space takes the value open_space for hyperedges representing the living room, kitchen, hall and bedroom, while it is equal to closed_space for hyperedges corresponding to the bathroom, wc, entrance hall and garage. The value of the attribute area specifies the quadratic area of each room.
116
3.
DESIGN TOOLS
Figure 4. A hierarchical layout hypergraph representing a floor-layout from Fig.2.
In order to build a model of our system we have decided to take the advantage of commercially proved approaches to software development, namely Object Oriented Analysis and Design using the UML (Booch et al., 1998). The UML with a wide range of diagrams enables to represent not only a static view of a floor-layout but a dynamic one as well. Moreover, it allows for a clear division between the layout of data and the layout of possible realizations. In other words, it allows us to divide the layout of hypergraphs from the layout of diagrams created by the user. At present stage of research we have concentrated on a logical view of our model, which describes structural mechanisms and the main classes in the floor-layout design. It is worth stressing here that we would like to incorporate some inference into the system. Since the UML enables us to model also the system’s dynamics we are able to present interference rules with a use of the same diagrammatic notation.
CONCLUSIONS There exist different approaches to design floor-layouts with the use of computer (Borkowski et al., 1999; Heisserman, 1994). Our approach differs from the others as we plan to make our design system more “intelligent”. We intend to apply hypergraph transformation rules which will make it possible
Hypergraphs in Diagramatic Design
117
to interfere useful facts about designs being created. Therefore we need an internal representation in the form of hierarchical hypergraphs which enable us to gather and extract information about designs on different levels of details. Moreover, hypergraph transformations facilitate making modifications of successive design stages.
REFERENCES 1. Booch, G., Jacobson, I. And Rumbaugh, J., 1998, Unified Modeling Language User Guide, 1st ed., Addison Wesley, Boston. 2. Borkowski, A., Grabska, E., and Szuba, J., 1999, Visualisation of Graphs in ArchiCAD, LNCS 1774: 241-246. 3. Heisserman, J., 1994, Generative Geometric Design, IEEE Computer Graphics and Applications, March 1994, pp. 37-45 4. Minas, M., 2002, Concepts and realization of a diagram editor generator based on hypergraph transformation, Science of Computer Programming 44: 157-180. 5. ĝlusarczyk, G., 2003, Hierarchical Hypergraph Transformations in Engineering Design, Journal of Applied Computer Science 11 (2): 67-82.
3D MODELING OF OUTDOOR SCENES FROM OMNIDIRECTIONAL RANGE AND COLOR IMAGES Toshihiro ASAI1 , Masayuki KANBARA,1 and Naokazu YOKOYA1 1 Nara Institute of Science and Technology,
8916-5 Takayama, Ikoma, Nara 630-0192, JAPAN
{toshih-a, kanbara, yokoya}@is.naist.jp Abstract
This paper describes a method for modeling wide area outdoor environments by integrating omnidirectional range and color images. The proposed method efficiently acquires range and color data of outdoor environments by using omnidirectional laser rangefinder and omnidirectional multi-camera system (OMS). In this paper, we also give experimental results of reconstructing our campus from data acquired at 50 points.
Keywords:
3D modeling, outdoor environment, omnidirectional range image, omnidirectional color image
1.
INTRODUCTION
3D models of outdoor environments can be used in a number of fields such as simulation and virtual walk-through. However, such 3D models are often made manually with high costs, so recently automatic 3D modeling has been widely investigated; for example, 3D shapes estimation from an image sequence 1−3 and measuring outdoor environments by a laser rangefinder 4−6 . The former has the problem that estimation accuracy for modeling of wide area outdoor environments is low. On the other hand, the latter can measure the shape of object with high accuracy and in a long distance. This paper proposes a 3D reconstruction method for wide area outdoor environments. By using an omnidirectional laser rangefinder and an omnidirectional multi-camera system (OMS) which can capture a wide-angle highresolution image, the range and color data of outdoor environments are efficiently acquired. Note that the omnidirectional color and range images are acquired approximately at the same position to register both of the images geometrically. Moreover, by using RTK-GPS and gyro sensor to measure the position and orientation of the sensor system, the 3D outdoor scene model is efficiently generated. This paper is structured as follow. Section 2 describes the sensor system which is used to acquire omnidirectional range and color im-
118 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 118–124. © 2006 Springer. Printed in the Netherlands.
3D Modeling of Outdoor Scenes from Omnidirectional Range and Color Images 119
Figure 1.
Sensor system mounted on a car.
ages. Section 3 explains the 3D modeling method which integrates range and color images obtained at multiple positions. In Section 4, experimental results with the proposed method are described. Finally, Section 5 gives summary and future work.
2.
SENSOR SYSTEM
2.1
Data Acquisition
Fig. 1 illustrates the sensor system mounted on a vehicle. The system equips the omnidirectional laser rangefinder (Riegl, LMS-Z360), omnidirectional camera (Point Grey Research, Ladybug), and RTK-GPS. Fig. 2 shows an acquired omnidirectional range image in which the distance from the rangefinder is coded in intensity. Fig. 3 shows a generated omnidirectional color image from images acquired by OMS. Note that the maximum measurable range of rangefinder is about 200m and its measurement accuracy is within 12mm. Ladybug has radially located six camera units in a camera block and their positions and orientations are fixed. Since each camera can acquire a 768×1024 image, the Ladybug can acquire high-resolution omnidirectional color images which covers more than 75% of full spherical view. The Ladybug is calibrated geometrically and photometrically in advance8 . RTK-GPS and gyro sensor are used to measure the position and orientation of the sensor system. The measurement accuracy is about 3cm. The orientation of the sensor system is measured by the gyro sensor whose measurement accuracy is within 0.5 deg. Usually, yaw value measured by gyro sensor usually includes an accumulative error. In this paper, to avoid the accumulative error, the measurement yaw value of gyro sensor is corrected by interlocking with RTK-GPS.
2.2
Alignment of Coordinate Systems
There are four coordinate systems in the sensor system; the rangefinder, the OMS, the RTK-GPS, and the gyro sensor coordinate systems. Geometrical
120
Figure 2.
Omnidirectional range image.
Figure 3.
Omnidirectional color image.
relationship among the coordinate systems is fixed, and these coordinate systems are registered to RTK-GPS(global) coordinate system as shown in Fig. 4. The method for estimating the transformation matrices which represent the relationship among the coordinate systems is described below.
(a) Matrix between OMS and rangefinder coordinate systems. By giving the corresponding points of range and color images, the transformation matrix can be estimated by 8 . (b) Matrix between rangefinder and gyro sensor coordinate systems. The transformation matrix can be estimated by measuring more than three markers whose positions in the gyro sensor coordinate system are known. The markers are placed at positions which can be measured by the rangefinder as shown in Fig. 5(a). The positions of markers in rangefinder coordinate system are estimated by the range data as shown in Fig. 5(b). (c) Matrix between gyro sensor and global coordinate systems. Z-axis of the gyro sensor coordinate system can correspond with Alt-axis of the global coordinate system when the gyro sensor is powered up. The transform
3D Modeling of Outdoor Scenes from Omnidirectional Range and Color Images 121
Figure 4.
Figure 5.
Relationship among the coordinate systems of the sensors.
Alignment of rangefinder and gyro sensor coordinate systems.
matrix usually consists of rotation and translation components. The translation component of the transformation matrix can be acquired by RTK-GPS. On the other hand, the rotation component can be estimated by the gyro sensor. However, only offset of yaw direction in gyro sensor coordinate system and yaw direction in global coordinate system is unknown. The x-axis of gyro sensor coordinate system is aligned with Lat-axis of global coordinate system by measuring a same point in the real scene with rangefinder from two different points.
3. 3.1
3D MODELING OF OUTDOOR ENVIRONMENTS Registration of Multiple Range Images
In order to register the multiple range images, the ICP algorithm is used9,10 . The position and orientation of the rangefinder are acquired by the RTK-GPS and the gyro sensor, respectively. The error of orientation value influences
122 parts far from the rangefinder, therefore the acquired position by RTK-GPS is used as position of range data, though the acquired orientation of gyro sensor is used as the initial value of orientation of range data. In the conventional ICP algorithm, the distance between points in paired range data is defined as an error, and the transformation matrix is calculated so that the error is minimized. The present rangefinder measures the distance by rotating the laser scan, thus the spatial density of data points depends on the distance; that is, close objects are measured densely and far objects are measured sparsely. This causes a problem in registering range data obtained at different positions. In order to overcome this problem, we define an error by computing the distance between a point in one data set and a plane determined by adjacent points in the other data set11 .
3.2
Texture-mapping of Color Images on 3D Shape
The 3D shape obtained in the previous section is texture-mapped by omnidirectional color images. Each triangular patch on the 3D shape is colored by the texture from the image which gives the highest resolution. However this strategy fails when an occlusion occurs. The occlusion is detected when the whole 3D shape intersects with a triangular pyramid determined by triangular patch vertexes and the projection center of camera. In such a case, the second highest resolution image is selected.
4.
EXPERIMENTS
We have carried out experiments of reconstructing our campus. In experiments, the range and color images are acquired at 50 points in our campus (about 250m× 300m). Fig. 6 shows the acquisition points of data in our campus. The sensor coordinate systems are aligned in advance with the proposed method described in Section 2.2. The resolution of each omnidirectional range image is 904×450. The number of polygons of the generated 3D model is 2,930,462. Fig. 7 illustrates a 2D CAD data of our campus overlaid with the generated model. We confirm that the generated model has no large distortion. Examples of rendering the generated model are shown in Fig. 8.
5.
CONCLUSION
This paper has proposed a 3D modeling method which is based on integrating omnidirectional range and color images for wide area outdoor environments. In experiments, a 3D model is actually generated from omnidirectional range and color images acquired 50 points in our campus. We can move viewpoint and look around the model freely. However, the sense of incongruity is observed in the generated model, when different images are selected in neighboring polygons. Such an effect is mainly caused by the varying illumination
3D Modeling of Outdoor Scenes from Omnidirectional Range and Color Images 123
Figure 6.
Range data acquisition points.
Figure 7. 2D CAD data overlaid on generated 3D model.
conditions during the measurement of whole area. This problem in generating textured 3D model should further be investigated.
REFERENCES 1. T. Sato, M. Kanbara, N. Yokoya and H. Takemura: “Dense 3-D Reconstruction of an Outdoor Scene by Hundreds-baseline Stereo Using a Hand-held Video Camera,” International Jour. of Computer Vision,Vol. 47, No. 1-3, pp. 119–129, 2002. 2. C. Tomasi and T. Kanade: “Shape and Motion from Image Streams under Orthography: A Factorization Method,” International Jour. of Computer Vision, Vol. 9, No. 2, pp. 137–154, 1992. 3. M. Okutomi and T. Kanade: “A Multiple-baseline Stereo,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 15, No. 4, pp. 353–363, 1993. 4. S. F. El-Hakim, C. Brenner and G. Roth: “A Multi-sensor Approach to Creating Accurate Virtual Environments,” Jour. of Photogrammetry & Remote Sensing, Vol. 53, pp. 379–391, 1998. 5. H. Zhao and R. Shibasaki: “Reconstruction of Textured Urban 3D Model by Fusing GroundBased Laser Range and CCD Images,” IEICE Trans. Inf. & Syst., Vol. E-83-D, No. 7, pp. 1429–1440, 2000. 6. P. K. Allen, A. Troccoli, B. Smith, S, Murray, I. Stamos and M. Leordeanu: “New Methods for Digital Modeling of Historic Sites,” IEEE Computer Graphics and Applications, Vol. 23, pp. 32–41, 2003. 7. C. Fr¨ u h and A. Zakhor: “Constructing 3D City Models by Merging Aerial and Ground Views,” IEEE Computer Graphics and Applications, Vol. 23, pp. 52–61, 2003.
124
Figure 8.
Generated 3D model with the texture.
8. S. Ikeda, T. Sato, and N. Yokoya: “Panoramic Movie Generation Using an Omnidirectional Multi-camera System for Telepresence,” Proc. 13th Scandinavian Conf. on Image Analysis, pp. 1074–1081, 2003. 9. P. J. Besl and N. D. McKay: “A Method for Registration of 3-D Shapes,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 14 No. 2, pp. 239–256, 1992. 10. T. Oishi, R. Sagawa, A. Nakazawa, R. Kurazume and K. Ikeuchi: “Parallel Alignment of a Large Number of Range Images,” Proc. International Conf. on 3D Digital Imaging and Modeling, pp. 195–202, 2003. 11. K. Pulli: “Multiview Registration for Large Data Sets,” Proc. International Conf. on 3D Digital Imaging and Modelling, pp. 160–168, 1999.
USER-CONTROLLED MULTIRESOLUTION MODELING OF POLYGONAL MODELS Muhammad Hussain1,2 , Yoshihiro Okada1,2 and Koichi Niijima1 1 Graduate School of Information Science and Electrical Engineering,
Kyushu University, 6-1, Kasuga Koen, Kasuga, Fukuoka 816-8580, Japan. 2 Intelligent Cooperation and Control, PRESTO, (JST)
{mhussain, okada, nijiima}@i.kyushu-u.ac.jp
Abstract
Exploiting the simplification hierarchy and hypertriangulation model, a unified framework (ADSIMP) has been proposed for user-controlled creation of multiresolution meshes5 . For this system, here we present a new set of compact data structures for holding the multiresolution model, which is not only memory efficient but also supports the creation of hypertriangulation model simultaneous with bottom-up vertex hierarchy generation. The new proposed construction algorithm, based on multiple choice optimization simplification technique, is faster and makes available ADSIMP for real time applications. Comparison with related work shows that the new system provides combined environment at reduced overhead of memory space usage and faster running times.
Keywords:
Adaptive simplification; multiresolution model; vertex hierarchy; hypertriangulation model; level of detail.
1.
INTRODUCTION
Polygonal models are ubiquitous in 3D CG and its various applications, and because of their growing size and complexity, there is a challenging problem of their storage, real time manipulation and visualization. The most popular approach to alleviate this problem was the proposal of LOD and Multiresolution paradigms where an object is stored at k different (discrete/continuous) levels of detail. This approach necessitated polygonal simplification algorithms and throughout the last decade, a whole family of automatic polygonal simplification algorithms came into existence3 . While they produce very appealing results in many cases, they perform poorly at extremely low levels of detail because they are blind to semantic or high level meanings of a model. To overcome the shortcomings of uniform automatic simplification,
125 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 125–130. © 2006 Springer. Printed in the Netherlands.
126 a few methods1,2,6,7 have been proposed for adaptive simplification of polygonal meshes. The methods proposed in2,6,7 rely on simplification hierarchy whereas Zeta1 rely on hypertriangulation model. The algorithms presented in6,7 provide high level control whereas Semisimp2 and Zeta1 provide low level control to a user. Zeta1 provides relatively better control to a user but it is limited because it takes pre-computed multiresolution representation of a polygonal model as input. Hussain et al.5 proposed a new system for adaptive simplification of meshes exploiting hypertriangulation model and simplification hierarchy; this paper improves it in two ways:(1) the data structures are more compact and (2) the construction algorithm is faster. Exploiting a technique for simplification similar to the one proposed in,8 we present a new construction algorithm. The overall organization of the paper is as follows.The following section describes in detail a unified framework for adaptive simplification, and the construction algorithm has been detailed in Section 3. Techniques for navigation across continuous LODs, selective refinement and selective simplification have elaborated in Section 4. Results have been discussed in Section 5 and Section 6 concludes the paper.
2.
MULTIRESOLUTION MODEL
This section presents the details of a unified framework for sophisticated multiresolution mesh representation of orientable, 2-manifold polygonal models in 3 ; hereafter we’ll refer it as Adaptive Simplification Model (ADSIMP).
2.1
Simplification hierarchy
Half-edge collapse (ecol) transformation is invertible and naturally builds a bottom-up vertex hierarchy; Figure 1(a,d) shows ecol and its inverse vsplit transformation and the associated nodes of the vertex hierarchy. ADSIMP exploits half-edge collapse transformation for simplification because the resulting vertex hierarchy can be encoded with as many nodes as there are vertices in the input mesh M and it is simple to implement. The sequence of half-edge collapse transformations is driven by a memory-efficient and feature preserving error metric proposed in;4 according to this the cost of the half-edge colQt , lapse transformation est (vs , vt ) → vt (see Figure 1(b)) is Cost(est ) = 1 where Qt = lt · θt with lt = 2 (A1 + A2 ), A1 = area of triangle t (v1 , v2 , vs ), A2 =area of triangle t (v1 , v2 , vt ), and θt is the angle described by the normal of the triangle t when edge est is collapsed, and summation is taken over all triangles incident on vs . For detailed account, consult4 .
User-controlled Multiresolution Modeling of Polygonal Models
127
Figure 1. (a) Edge collapse (ecol) and vertex split (vsplit) operations and (d) corresponding nodes of the vertex hierarchy. (b) An illustration of ecol for error measure. (c) Gluing operation. (e) PackedEdge data structure.
2.2
Hypertriangulation model
Hypertriangulation model1 is a form of multiresolution representation of a polygonal mesh and is based on the idea of gluing. To comprehend the idea behind gluing, consider Figure 1(c), half-edge collapse transformation euv (u, v) → v replaces current patch Tu of triangles (in shaded region in Figure 1(c)) incident on vertex u with a new patch Tu ; instead of removing Tu , patch Tu is pasted over Tu by modifying the next adjacent relations of the half-edges ev0 , e01 , e12 , e23 , e34 , e45 along the boundary of each patch, for instance, after gluing, next adjacent half-edge e4u of the boundary half-edge e34 will become inactive and e4v will become active, see Figure 1(c). Observe that when gluing is accomplished, there are two half-edges next to each of the half-edges along the common boundary of patches Tu and Tu ; a list of variable size is associated with each half-edge to store reference to these next adjacent half-edges. Pasting is carried out by storing a new next adjacency in this list, setting it active and setting current next adjacency inactive.
2.3
Multiresolution data structures
Here we present two new entities of data structures, Vertex and PackedEdge, which have been exploited to construct and hold the proposed multiresolution model. Hereafter, we will use PackedEdge and half-edge interchangeably.
128 Their representation in C++ format is as follows. struct Vertex {
float position[3]; PackedEdge* pe; int* children; unsigned short ch indx; int parent; };
Here pe is a pointer to the associated PackedEdge that makes possible the traversal of all vertices in 1-ring neighborhood of this vertex and its adjacencies; children and parent fields encode vertex hierarchy, children is a variable length array that holds the pointers to those vertices which will collapse to this vertex in the order of their simplification and parent holds the pointer to the vertex to which this vertex will be collapsed and its sign bit is used to hold the information whether this vertex is active or not in an LOD. struct PackedEdge {
int PackedEdge* PackedEdge** unsigned short
origin; twin; next; idx next; };
Here origin, twin and next fields are as shown in Figure 1(e); next is a variable length array that holds in order the pointers to those half-edges which are next to this half-edge in ADSIMP across different resolution levels (see Figure 1(c)), one of these would be active at a time and idx next holds the index of this active half-edge. Note that there is no need of data structure to explicitly specify facets, they are implicitly defined by the adjacencies which are encoded in PackedEdge data structure. Vertex and PackedEdge records are stored in two dynamic arrays.
3.
CONSTRUCTION ALGORITHM
The construction algorithm for ADSIMP described in Section 2 takes original fully detailed polygonal mesh M as input and performs the following steps: 1 Initialize ADSIMP by creating PackedEdge record for each half-edge e and Vertex record for each vertex u of the original input mesh M . 2 Select k vertices randomly, and for each vertex u ∈ {u1 , u2 , ..., uk }, determine optimal half-edge euv , where u is the origin and v is the head of euv ; this is the one among out-going half-edges of u whose collapse removes u and causes minimum geometric deviation. Cost of the corresponding optimal half-edge is stored as the the cost of u (Section 2, simplification). 3 Select minimum cost vertex u from {u1 , u2 , ..., uk }.
User-controlled Multiresolution Modeling of Polygonal Models
129
Figure 2. An interactive session. An LOD of bunny model at uniform resolution, #face 468, has been extracted (left top and bottom). Corresponding adaptively refined LOD (right top and bottom), #faces 1027, detail has been added at the head and tail. Side small window depicts the original model #faces 69451.
4 Collapse euv by putting reference of u in children field of v and putting reference of v in parent field of u, and weld the new patch Tu onto the current patch Tu (Section 2, hypertriangulation model). 5 Repeat Steps 2 through 4 until no vertex can be removed.
4.
SELECTIVE REFINEMENT AND SELECTIVE SIMPLIFICATION
After building ADSIMP exploiting the method described in Section 3, it can be traversed to walk through the hierarchy of continuous LODs of a mesh by moving a vertex cut up or down the vertex hierarchy and this is accomplished by two key local operations: refine vertex() and simplify vertex(); simplify vertex() moves up the next adjacency of each half-edge on the boundary of the patch Tu , for example, it sets the next adjacency of e34 from e4u to e4v , see Figure 1(c) and refine vertex() reverses this process, For detailed account consult5 .
5.
DISCUSSION
We implemented ADSIMP using C++, MFC classes, and OpenGL on a system with PentiumIV 2.9GHz and main memory 512MB. Figure 2 shows snapshots of our system. One can efficiently navigate through the space of con-
130 tinuous LODs at run time using the slider, can extract any fixed LOD and can further fine tune it. Each Vertex record occupies 26n bytes of memory and the memory size occupied by each PackedEdges does not exceed 216n bytes, so the overall memory size occupied by ADSIMP is at the maximum 242n bytes, n being the number of vertices. But in case of Zeta, this size is 347n. It means that our proposed model consumes about 30% less memory. Although, the authors of the methods proposed in2,6,7 have not reported memory space occupancy, the method of Youngihn et al. associates two quadric error metrics with each vertex and this requires 80n bytes only for error metrics in addition to storing geometric and adjacency information. Analytically the time complexity of the preprocessing phase is O(n). During interactive session, the complexity of refining/simplifying s vertices exploiting simplification hierarchy is O(s); but in case of Zeta, it is O(slogs) because they maintain a priority queue. Empirically, we found that our system takes 1.87 sec. to build ADSIMP for bunny model and can extract 550K triangles per sec. during interactive session.
6.
CONCLUSION
A new system has been proposed for user-controlled multiresolution modeling of polygonal meshes, which is based on more compact data structures and faster construction algorithm. It is simple to implement, and provides the functionality of both low-level and high level user-driven methods for adaptive simplification of polygonal meshes in a unified environment with reduced overhead of memory consumption and faster execution time. Employing this model one can efficiently navigate through continuos levels of detail of a mesh, and can extract mesh at a constant desirable resolution and can further locally simplify or refine the selected LOD to satisfy his/her needs.
REFERENCES 1. Cignoni, P., Montani, C., Rocchini, C., and Scopigno, R. Zeta, A resolution modeling system. GMIP: Graphical Models and Image Processing, 60(5):305-329. 2. Li, G., and Watson, B., Semiautomatic simplification. In ACM symposium on Interactive 3D Graphics 2001, 43-48. 3. Luebke, D., A Developer’s survey of Polygonal simplification Algorithms. IEEE Computer Graphics & Applications, 24-35. 4. Hussain, M., Okada, Y. and Niijima, K., Efficient and feature-preserving triangular mesh decimation. Journal of WSCG, 12(1):167-174, 2004. 5. Hussain, M., Okada, Y. and Niijima, K., User-controlled simplification of polygonal models. Proc. 3DPVT04, IEEE computer society press (to appear), 2004. 6. Pojar, E., and Schmalsteig, D., User-controlled creation of multiresolution meshes. In ACM symposium on Interactive 3D Graphics 2003, pp: 127-130, 243. 7. Kho, Y., Garland, M., User-guided simplification. In ACM symposium on Interactive 3D Graphics 2003, pp: 123-126, 242. 8. J. Wu, L. Kobbelt., Fast mesh decimation by multiple-choice techniques. In Proc.Vision, Modeling, and Visualization 2002, Erlangen, Germany, November 2002.
SHAPE SIMILARITY SEARCH FOR SURFELBASED MODELS M. R. Ruggeri, D. V. Vraniü, and D. Saupe Department of Computer and Information Science, University of Constance, D-78457 Konstanz, Germany
Abstract:
We present a 3D object retrieval system in which a surfel-based model serves as a query and similar objects are retrieved from a collection of surfel-based models. The system comprises a surfelization technique for converting polygonal mesh models into a corresponding surfel-based representation, pose normalization of surfel-based models, a depth buffer-based method for describing shape of surfel-based models, and a search engine. Our surfelization technique consists of an enhanced triangle rasterization procedure adapted on the original triangulated model geometric features. Surfel-based representation is normalized by applying a modification of the Principle Component Analysis to the set of surfels. The 3D-shape descriptor is extracted from the canonical coordinate frame of the surfel-based model using the orthographic depth images. Each image is used as the input for the 2D Fast Fourier Transform. Appropriate magnitudes of the obtained coefficients are used as components of the feature vector. The resulting feature vector possesses an embedded multi-resolution representation, is invariant with respect to similarity transforms and robust with respect to outliers. The retrieval effectiveness of the presented method for characterizing shape of surfel-based models is evaluated using precision-recall diagrams.
Key words:
point-based graphics; 3D model retrieval; point sampling.
1.
INTRODUCTION
Point-based representation of 3D objects has recently received a lot of attention by the computer graphics community. The advantages of the pointbased representation are mostly appreciable in representing and rendering highly detailed surfaces1,2 where the rendering of traditional primitives (polygons) amounts to less than a pixel per primitive. We consider surfels as
131 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 131–140. © 2006 Springer. Printed in the Netherlands.
132 our principal point-based primitives. Surfels (SURFace ELements introduced by Pfister et al.1) are a powerful paradigm that allows representing and collecting different information about the characteristics of the surface in the neighbourhood of each sample point. The number of surfel-based models is swiftly increasing. In the near future, this will result in the development of large and complex databases of surfel-based models requiring powerful retrieval systems, which will be able to perform appropriate and efficient queries. A 3D model retrieval system extracts several low-level features for each model, and measures the similarity between any two models in the low-level feature space. These low-level features are represented in a vector called feature vector. So far, shape similarity search of surfel-based models has not been studied in depth, even though the literature provides a rich variety of shape similarity methods for polygonal mesh models retrieval3. Our motivation is to propose a 3D object retrieval system in which a surfel-based model serves as a query and similar objects are retrieved from a collection of surfel-based models. To achieve our goal we need to extract an appropriate and compact feature vector from each surfel-based model and test its retrieval effectiveness on a database of such models. Such databases are not yet available and creating surfel-based models is still a difficult task. Thus, we converted already existing databases of triangular mesh models by using a surfelization procedure. Surfelization is the process of extracting surfels from a generic geometric representation of a 3D model (e.g., triangular mesh, implicit surface, parametric surfaces, CSG, etc.). We propose a geometric feature preserving surfelization method, which is able to automatically convert huge databases of triangulated 3D models into the corresponding surfel-based representation. The main idea of our surfelization method is to follow the triangulation of the original triangulated models so that all the geometric structures of the models can be appropriately sampled. In fact, we do not know a priori what the source of our triangular meshes is. Therefore, by following this approach we ensure a correct geometric conversion preserving all the shape characteristics of the original models. As shape descriptor of a surfel-based model we propose an extension of a depth buffer image-based descriptor for polygonal models4, which produced good retrieval results. First, each surfel-based model is normalized and aligned with respect to its principal component axes by applying a modification of the Principle Component Analysis to the set of surfels. A normalized depth buffer image is then extracted from each of the six sides of a suitable bounding cube by using orthographic projection. The resulting depth buffer images are processed with the 2D Fourier transform. The feature vector is built by appropriately sampling the magnitude of the
Shape Similarity Search for Surfel-Based Models
133
Fourier coefficients. The resulting feature vector is invariant with respect to similarity transforms, and robust with respect to outliers and level of detail. The retrieval effectiveness of the presented method for characterizing shape of surfel-based models is evaluated using precision-recall diagrams. The retrieval performance of our feature vector has then been compared with other different feature vectors (including the one based on depth-buffer images) extracted from the original triangulated 3D models.
2.
PREVIOUS WORK
Extraction of surfel-based representations from geometric models has been studied in different settings. Some techniques require complicated computations with the geometry and/or image space1,5. Other techniques are based on estimations of the local curvature or on a local parametric approximation of the underlying surface2,6. Turk7 randomly spreads sampled points on triangular surfaces, and applies a relaxation procedure to optimize their position. The last category of surfelization techniques are conceived to be performed on the fly exploiting graphics hardware8,9. Cohen et al.9 present a triangle rasterization procedure that, given the distance between samples, covers the entire triangle surface with square tiles adapted to their edges. Our algorithm for surfelization of triangular meshes is inspired by9 as we use an adaptive triangle rasterization process, and by7 as our method is driven by the curvature and the geometric characteristic of the triangular surface. Point-based model retrieval has not yet been explored in depth. So far, only few matching techniques have been proposed for point-based models: based on differential geometry and algebraic topology10 and based on sampling a segmentation of the model built exploiting the Delaunay triangulation11. On the other hand a lot of work has been done to retrieve polygonal models. We classify methods for shape similarity of 3D models in three general categories: geometry-based, topology-based, and image-based. The geometry-based approach is characterized by using shape descriptors based on the geometric distribution of vertices or polygons12,3,4. Topologybased methods establish the similarity of two 3D models by comparing their topological structures13,3. Finally, image-based approaches consider two models similar if they look similar from any angle. Generally, these approaches extract a feature vector from a set of images obtained projecting 3D models onto different projection planes14,3,4. We refer to3 for a more extensive overview of shape descriptors and to15,4 for a detailed comparison between the different methods mentioned above. Recently shape benchmarks have been published for comparing different shape descriptors15,16, which show the promising retrieval effectiveness of image-based methods15. In
134 order to retrieve surfel-based models, we modify a depth-buffer image-based approach4, which shows high retrieval effectiveness for triangle mesh models (Section 4).
3.
SURFELIZATION
Our system comprises a surfelization technique for converting already available retrieval test databases of triangular mesh models into a corresponding surfel-based representation. The goal of our surfelization technique is to sample an arbitrary triangular surface with surfels, by preserving both its geometric properties and its contour properties. This will ensure a correct shape approximation of the original triangulated model. The main idea of our surfelization algorithm is to follow the original triangulation, by sampling triangles with respect to the following geometric features: x crease edges and boundary edges; x smooth local triangular surface at a vertex. An edge is considered as a crease edge, if the dihedral angle ș between two adjacent triangles exceeds a predefined threshold șt. An edge is detected as a boundary edge if it belongs to only a single triangle. We call these two types of edges preserved edges. For each vertex v, we consider all triangles sharing the vertex v as our local triangular surface Ȉ. We then compute the average plane ɩ, formed by all the vertices belonging to Ȉ17. The surface Ȉ is considered as non-smooth, if the orthogonal distance between the vertex v and the plane ɩ exceeds a threshold dt. The sampling algorithm starts from a seed vertex and samples the underlying local triangular surface near the features with a specific pattern. The remaining planar surface is sampled on a regular grid. Triangles sharing a preserved edge are sampled exactly along the edge line (Figure 1a). Surfel disks are then placed along the edge line. The radius and the position of each surfel are selected in order to keep the over edge error e (Figure 1b) under a threshold emax. Furthermore, surfels radii must be in a predefined range [rmin, rmax]. When a triangular surface is locally smooth, a surfel is centred on each of its vertices parallel to the average fitting plane17 of its neighbour vertices. To select the radius of the surfel we consider two cases. In the first case where at least one triangle sharing the vertex has a preserved edge (see Figure 1c), the radius of the surfel is set to the minimum length of the connected edges (r in Figure 1c). In the other case we set the radius to the maximum length of the connected edges (r in Figure 1d). In this last case, we have also to ensure
Shape Similarity Search for Surfel-Based Models
135
that preserved edges that belonging to some neighbour triangle, do not overlap the surfel disk (see Figure 1d). In both cases the radius r is cropped to be in the range [rmin, rmax], and the remaining unfilled area is sampled on a regular grid. Beside the two special cases mentioned above, the remaining triangles are sampled on a regular squared lattice without covering the already sampled area near the preserved features. Basically, this is a rasterization process, where surfel disks are placed onto the centre of each output pixel with a radius equal to half the pixel diagonal. This technique is parallelizable and can be easily implemented exploiting the rasterization process of graphics hardware.
(a)
(b)
(c)
(d)
Figure 1. Surfelization near preserved edges (a, b) and of a smooth local surface (c, d).
4.
SHAPE SIMILARITY SEARCH
Our retrieval system aims to efficiently query a database of surfel-based models. An arbitrary surfel-based model serves as a query key to retrieve similar models from a collection of surfel-based models. Our retrieval method consists of three steps: normalization (pose estimation), feature extraction and similarity search. Normalization (pose estimation). Surfel-based models are given in arbitrary units of measurement and undefined positions and orientations. The normalization procedure aims to transforms a model into a canonical coordinate frame, so that if one chose a different scale, position, rotation, or orientation of a model, then its representation in such coordinates remains the same. Moreover, the normalized representations corresponding to different levels of detail of the same model should be similar as much as possible. We normalize each surfel-based model by applying a modification of the Principle Component Analysis (PCA) to the set of surfels. Following the usual PCA procedure we compute the covariance matrix C through the following formulae:
C
1 S
N
¦ i 1
G G G G (c i g )(ci g ) T S i , gG
1 N G ¦ ci Si , S S i1
N
¦S i 1
i
(1)
136 G where g is the center of gravity and S is the surface area of the entire model, G N is the number of surfels, Si and c i are the area and the center of the ith surfel, respectively. C is a 3x3 symmetric, real matrix. The eigenvalues of C are then computed and sorted in decreasing order. From C we build the rotation matrix R, which has the normalized eigenvectors as rows with nonnegative elements on the main diagonal. First we center the surfel-based G G model onto the point g by translating the surfel center points c i . Then, we apply the transformation defined by the rotation matrix R to the surfel G centers points and to the surfel normal vectors ni . The resulting surfel-based model is now in canonical coordinate, which is invariant to translation and rotation. Feature extraction. The features we extract represent the 3D shape of the objects present in a model. These features are stored as vectors (feature vector) of fixed dimension. There is a trade-off between the required storage, computational complexity, and the resulting retrieval performance. We adapted the approach proposed in4 in which a shape descriptor is extracted for each model from the canonical coordinate frame using six orthographic depth images. The depth images are computed by orthogonally projecting the surfel-based model onto the six different faces of a cube establishing an appropriate region-of-interest. The size of the resulting depth images is 256x256 pixels. This operation can be easily performed exploiting a graphics hardware z-buffer. The choice of the region-of-interest cube is very important and may affect the retrieval performance. We propose an approach that is robust with respect to outliers. The cube is aligned with the three canonical axes and G centered at the center of gravity g (defined in Formula 1). The size of the cube, df, is proportional to the average distance davg of the surfels to the G center of gravity, g. d avg
1 S
N
G
G
¦ c g S , d i
i
f
f d avg
(2)
i 1
Here f is a fixed ratio and we have used f = 3.9. Our experience suggests that this approach is superior to using the standard bounding cube, which may sensitively depend on outliers of the model. Figure 2 shows the model of an ant sampled with the corresponding six depth buffer images. It shows that the two antennae of the ant have been clipped by our cube, which improves retrieval performance by eliminating outliers. We apply the 2D discrete Fourier transform to each depth buffer image to obtain the corresponding Fourier spectrum. The feature vector is then built by storing, for each processed image, the magnitudes of low frequencies
Shape Similarity Search for Surfel-Based Models
137
Fourier coefficients as described in4. According to experimental results4 we set the dimension of our feature vector to 438: we store 73 real numbers for each image.
(a)
(b)
Figure 2. Surfel-based model (a) and the extracted depth buffer images (b).
Similarity search. The extracted features are designed so that similar 3D-objects are close in feature vector space. Using a suitable metric nearest neighbors are computed and ranked. In our case the distance between two feature vectors is computed by using the l1 or l2 norm. A variable number of objects are thus retrieved by listing the top ranking items.
5.
RESULTS
We have converted two test databases of triangular mesh models. The first database consists of 1814 3D polygonal models provided by the Princeton Shape Benchmark repository15. The second database consists of 1841 polygonal models provided by the CCCC repository16. These databases contain triangulated 3D models at different resolutions and with very different numbers of triangles. Both databases have been converted in less than two hours including the input/output time. The time required for the surfelization was 98.28 minutes for the first database, and 100.74 minutes for the second database, on a 2.4 GHz Intel Pentium 4 system with 2 GB RAM using a non-optimized C++ version of our program, and software only rasterization. Table 1 shows the execution time (in seconds) and the number of surfels produced by our surfelization technique together with the number of geometric primitives (vertices and triangles) of the original triangular mesh. The fourth column shows the number of the geometric features extracted from the triangulated models. Table 1. Surfelization performance. Model Vertices Triangles Bunny 7691 15294 Ant 298 492 Fish 18409 35270 Airplane 13908 27087
Features 13800 347 31563 24185
Surfels 283604 14202 325851 470660
Time (s) 2.18 0.09 5.16 7.57
138
(a)
(b)
Figure 3. Result of the surfelization: the original triangulated models (a) and the corresponding surfel-based representations (b).
Figure 4. Precision-recall diagram computed for different feature vectors of the original triangles meshes with respect the proposed feature vector extracted from the corresponding surfel-based models. Dimension of feature vectors are given in brackets.
Figure 3 shows two different triangular mesh models (Figure 3a) and the corresponding surfel-based models (Figure 3b). As we can see both geometrically and visually the features of the original models have been preserved. The retrieval effectiveness of the presented method for characterizing shape of surfel-based models is evaluated by using precision-recall diagrams. We have compared our method with other retrieval methods for polygonal mesh models. Figure 4 shows the precision-recall diagram we use to compare these different feature vectors. These results refer to retrieval tests performed on the polygonal models of the original Princeton Shape Benchmark test database, and on the corresponding surfel-based
Shape Similarity Search for Surfel-Based Models
139
representation. We have compared our retrieval system for surfel-based models with three different feature vectors extracted from the original triangulated models. The first feature vector is based on depth buffer images4 (DBD in Figure 4). The second is a ray based feature vector with spherical harmonics representation18 (RSH in Figure 4). The third feature vector is computed by using the exponentially decaying EDT technique12 (EDTorig in Figure 4). For surfel-based models we have considered two feature vectors. The first is extracted by using our region-of-interest cube (DBSBM in Figure 4) and the second is extracted by using the standard bounding cube (DBSBM-SBC in Figure 4). The diagram shows that the retrieval performance of the depth buffer image-based descriptor we designed for surfel-based models is approximately the same as the performance of the corresponding descriptor of polygonal mesh objects.
6.
CONCLUSION AND FUTURE WORK
We have presented a retrieval system for surfel-based models. The system includes: a surfelization technique for converting polygonal mesh models into a corresponding surfel-based representation, pose normalization of surfel-based models, a depth-buffer image-based method for describing shape of surfel-based models, and a search engine. With the presented surfelization technique we have converted large databases of polygonal mesh models used for retrieval purpose into corresponding databases of surfelbased models in an acceptable period of time. Converting an already classified database allows to skip the classification step, which is commonly done manually and it is time consuming. Our surfelization procedure preserves the geometric features of the original models, but does not consider their texture properties. Even if this fact is not so crucial for shape analysis and retrieval purposes, it is fundamental for realistic surfel-based model visualization. Future work will be driven in this direction. We have also proposed a suitable feature descriptor for surfel-based models. Our feature vector is derived from the image-based shape descriptors used for polygonal mesh models4. This approach obtained approximately the same good retrieval results with surfel-based models, as shown by our retrieval tests. However, our main goal will be to conceive further feature vectors, which are strictly based on the intrinsic geometric properties of set of surfels. Our retrieval system16,4 offers a user-interface that enables to non-specialist users an easy and effective access to 3D geometry database. We plan to extend our system to support also different point-based model representations.
140
AKNOWLEDGMENTS This work was supported by the strategic research initiative on Distributed Processing and Delivery of Digital Documents funded by the German Research Foundation, the Kurt Lion Foundation, and the DFGGraduiertenkolleg "Explorative Analysis and Visualization of Large Information Spaces" of the University Konstanz.
REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
Pfister H., Zwicker M., Van Baar J., Gross M.: Surfels: surface elements as rendering primitives. SIGGRAPH 2000, (Jul. 2000), pp. 335-342. Alexa M. et al.: Computing and rendering point set surfaces. In IEEE Trans. on Computer Graphics and Visualization, (Jan.-Mar. 2003), Vol. 9, No.1, pp. 3-15. Tangelder J. W. H., Veltkamp R. C.: A survey of content based 3D shape retrieval methods. Shape Modeling International, Genova, Italy, (Jun. 2004). Vranic D. V.: 3D model retrieval. Ph.D. Thesis, University of Leipzig, 2004. Moenning C., Dodgson N. A.: Fast marching farthest point sampling. EUROGRAPHICS 2003, (Sep. 2003). Zwicker M., Pauly M., Knoll O., Gross M.: Pointshop 3D: an interactive system for point-based surface editing. SIGGRAPH 2002, (Jul. 2002). Turk G.: Re-tiling polygonal surfaces. SIGGRAPH 92, (Jul. 1992). Wand M. et al.: The randomized z-buffer algorithm: interactive rendering of highly complex scenes. SIGGRAPH 2001, (Aug. 2001), pp. 361-370. Cohen J., Aliaga D. G., Zhang W.: Hybrid simplification: combining multi-resolution polygon and point rendering. IEEE Visualization 2001, (Oct. 2001), pp. 37-44. Collins A., Zomorodian A., Carlsson G., Guibas L.: A barcode shape descriptor for curve point cloud data. Symposium on Point-Based Graphics 2004, (Jun. 2004). Dey T. K., Giesen J., Goswami S.: Shape segmentation and matching from noisy point clouds. Symposium on Point-Based Graphics 2004, (Jun. 2004). Funkhouser T. et al.: A search engine for 3D models. ACM Trans. on Graphics, 22(1), (2003), pp. 83-105. Hilaga M., Shinagawa Y., Kohmura T., Kunii T. L.: Topology matching for fully automatic similarity estimation of 3D shapes. SIGGRAPH 2001, (2001). Chen D.-Y., Ouhyoung M., Tian X. P., Shen Y. T.: On visual similarity based 3D model retrieval. Computer Graphics Forum, (2003), pp. 223-232. Shilane P., Min P., Kazhdan M., Funkhouser T.: The Princeton shape benchmark. Shape Modeling International 2004, Genova, Italy, (Jun. 2004). Vranic D. V.: Content-based classification of 3D-models by capturing spatial characteristics. http://merkur01.inf.uni-konstanz.de/CCCC/. Schroeder W., Zarge J., Lorensen W.: Decimation of triangle meshes. SIGGRAPH 92, (Jul. 1992), pp. 65-70. Vranic D. V., Saupe D., Richter J.: Tools for 3D-object retrieval: Karhunen-Loevetransform and spherical harmomics. IEEE 2001 Workshop on Multimedia Signal Processing, (Oct. 2001), pp. 293-298.
DESCRIPTION OF IRREGULAR COMPOSITE OBJECTS BY HYPER-RELATIONS
Juliusz L. Kulikowski Institute of Biocybernetics and Biomedical Engineering PAS, Warsaw, Poland, e-mail [email protected]
Abstract:
It is presented a formal approach to the description of irregular geometrical forms by hyper-relations. This approach is based on an extension of the method of composite objects description based on the algebra of relations. It is shown that in some applications hyper-relations suit to the description of irregular geometrical forms better than other conventional tools. The general concept is illustrated by an example of irregular forms description by compositions of Freeman chains.
Key words:
irregular geometrical forms, relations, description
1.
hyper-relations, composite objects
INTRODUCTION
Objects of composite and irregular form are investigated in many areas of natural and/or technological sciences. Living biological cells or organs, expanding air or water pollution, evolving geophysical or meteorological phenomena, etc. can be visualized as some static or form-changing objects and analyzed by image processing methods. At a first level of natural objects’ forms classification dense and fuzzy forms can be distinguished. An object has a dense form if the form constitutes a subset of a geometrical space in a classical set theory sense. Otherwise, if it is described in the probabilistic, fuzzy sets, rough sets or any other similar-type categories, it belongs to the large class of fuzzy forms. Typical example of a dense form description is this of a heart ventricle form approximation by a regular ellipsoid1. Such approximation is sufficient 141 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 141–146. © 2006 Springer. Printed in the Netherlands.
142 in most medical applications despite the fact that a real heart ventricle is not exactly ellipsoidal. Typical example of a fuzzy form is a one describing expansion of chemical pollution in sea water. The polluted area observed in air- or satellite photos is not dense in the above-given sense but it can be characterized by a time-varying density function. The methods of dense and fuzzy geometrical forms description are evidently different. Moreover, the forms of composite geometrical objects have, in general, a multi-level organization and on each level the proper method of geometrical forms description should be used. The aim of this paper is presentation of geometrical forms timevariations models based on a hyper-relational approach. Principles of an approach to static composite forms description based on extended algebra of relations have been described in literature2. A concept of hyper-relation has been originally presented in Kulikowski’s work3. A model based on it seems to be flexible enough for description of a variety of objects including those having a multi-level structure. It will be shown below that fuzzy forms can be described by hyper-relations, as well.
2.
HYPER-RELATIONS AS TOOLS FOR IRREGULAR FORMS DESCRIPTION
Dense geometrical forms can be roughly divided into the classes of regular and irregular ones. A geometric form is called regular if it belongs to a typical, exactly defined and well known class of geometrical objects in 2D or 3D space, like: points straight lines or their segments, triangles, rectangles, circles, spheres, ellipsoids, etc. Each regular object can be characterized by a finite set of its parameters describing, both, its geometrical form and position in a fixed system of co-ordinates. Natural objects rather exceptionally take regular forms, however, for certain purposes their forms can be approximated by the regular ones. In a more general case natural objects take irregular forms. Then, they can’t be described but by an a priori unlimited number of parameters like: contour points’ co-ordinates, functional series’ coefficients, moments, etc. describing their forms with a desired but limited accuracy. Let us denote by :(N), N = 1,2,…,k,…, K, the set of all possible values of a parameter ZN describing a geometrical form. For practical reasons it will be assumed that the maximum number K of parameters under consideration is finite. On the basis of this family of sets taken
Description of IRREGULAR COMPOSITE Objects by hyper-Relations
143
in a given linear order : = [:(N)] a class C of its subfamilies F(V), V = 1, 2, …, 2K1, preserving the order, can be defined. Any Cartesian product: S(V) = { XN :(N) : :(N) F(V)}
(1)
where XN denotes a multiple Cartesian product of sets indexed by N, represents thus a subspace of linearly ordered strings of parameters belonging to F(V). If F(V) consists of the sets of parameters characterizing a given class of geometrical forms then any subset: r S(V)
(2)
is, by definition, a relation defined on S(V). As such, it describes a subclass of objects (forms) characterized by the strings of the corresponding elements (values of parameters). The strings of elements satisfying a given relation are called its syndromes. Let us denote by R(V) the family of all possible relations that can be defined on S(V). Taking into account that R(V) is, by definition, the set of all possible subsets of S(V) (including an empty subset and S(V) itself) it is possible to establish in R(V) a Boolean algebra of relations as an algebra of subsets of S(V). It has been shown in [2] that this algebra can be extended on the family of all relations that can be defined on any possible sub-families S(V) of : = [:(N)], N = 1,2,…,K. In particular, if S(V) and S(W) are any two such (in general, different) subfamilies and r(V)P , r(W)Q are any two relations described, correspondingly, on S(V) and S(W) then their sum: r(VW) = r(V)P r(W)Q
(3)
is a relation defined on S(V) S(W) consisting of all syndromes such that each of them projected on S(V) satisfies r(V)P or projected on S(W) satisfies r(W)Q. Using relations to composite geometrical objects description has many good points but it is limited by the fact that any relation consists of syndromes of a fixed length. This disadvantage leads to the concept of a first-order hyper-relation defined on : as a subset of variouslengths syndromes taken from selected sub-families of : with preservation of their linear order3. Let : be a linearly ordered family of sets and let us take into account any subset of preserving the order subfamilies: S(D), S(E), …, S(H) : as well
144 as the corresponding Cartesian products F(D) , F(E) , …, F(H). Any sum of subsets: h = r(D)
r(E) … r(H)
(4)
such that r(D) F(D) , r(E) F(E),…, r(H) F(H) , will be called a firstorder hyper-relation (a h-relation) defined on the linearly ordered family of sets S(D) S(E) … S(H).
Let us denote by L: the set of all possible strings of elements taken from any possible subfamilies S(V) of :. Then the family H of all possible subsets of L:, including an empty subset and L: itself, constitutes a Boolean algebra of subsets being, at the same time, a Boolean algebra of all h-relations defined on :. Application of the above-given concepts will be illustrated on the following example. Any irregular curve C on a discrete 2D plane can be described by a Freeman chain given by a linearly ordered sequence of ordered pairs [p, q], [di, Oi] , i I, where I is a set of natural indices of the chain segments, [p, q] are discrete co-ordinates of a starting point of the chain, di D(8) { [1,2,…,8] denotes the direction of the i-th segment of the chain in an 8-connective vicinity, Oi is a natural number indicating the length of the i-th segment (compare [6], Chapt. 12.2.3). The concept of h-relation will be used for construction an algebra of Freeman chains. Let us denote by N the set of natural numbers. We shall define a Cartesian product: : = N u N u Xi [N uD(8)]i
(5)
There will be taken into account the subfamilies of : of the following form: S(V) { S[I(V)] = N u N u X iI (V)[N uD(8)]i
(6)
I(V) I being a subset of indices of chain segments. Any subset r S(V) satisfying the condition that for each syndrome any two consecutive terms [N uD(8)]i [N uD(8)]i+ 1 there is di z di+1 z di + 4 (mod 8)
(7)
(i.e. no two consecutive chain segments have the same or opposite directions) defines a relation describing Freeman chains consisting of
Description of IRREGULAR COMPOSITE Objects by hyper-Relations
145
a fixed number of segments. Let us remark that this relation describes Freeman chain as directed objects, having their starting and end nodes, the nodes being enumerated in an unique way. If the chains of various numbers of nodes are of interest, a set of subfamilies S(V) based on the subsets of indices I(V) of various lengths and various combinations of indices should be taken into account. A sum: F = h(D) h (E) … h(K)
(8)
where h(D) , h (E) , … , h(K) are relations of the above-described type defined, correspondingly, on S(D) , S(E),…, S(K), defines an h-relation describing Freeman chains consisting of the corresponding (in particular, all possible up to a fixed limit), numbers of segments. Each subset F(Q) F is a h-relation describing some Freeman chains, as well. This observation makes us able to define some higher-order objects composed of Freeman chains. Let F(L), F(Q) be two h-relations describing Freeman chains of the following properties: F(L) contains all Freeman chains: 10 consisting of at least 1 and at most n, n >1, segments, 20 their starting node being denoted by i1, and 30 containing also a node denoted by i0; F(Q) contains Freeman chains:10 consisting of at least 1 and at most m, m > 1, segments, 20 their starting node being denoted by i0, 30 having no other common nodes with F(L) excepting i0, and 40 such that: d(L)0 z d(Q)0 z d(L)0 + 4 (mod 8)
(9a)
d(L)0- z d(Q)0 z d(L)0- + 4 (mod 8)
(9b)
where 0- denotes the index of a node preceding i0 in F(L). Then an intersection of h-relations F = F(L) F(Q)
(10)
defines an h-relation describing geometrical constructions composed of pairs of Freeman chains connected in i0. A typical example of such construction is shown in Fig. 1.
146
Figure 1. A typical example of a structure described by two connected Freeman chains.
In similar way, more sophisticated structures consisting of 3, 3, etc. Freeman chains can be defined by algebraic combinations of several h-relations.
3.
CONCLUSIONS
It has been shown that a large class of composite irregular geometrical forms, representing natural objects, can be formally described as h-relations or their algebraic combinations. This, in particular, concerns geometrical forms being composed of simpler forms described by various types of shape coefficients, spectral coefficients, moments, etc. Due to the fact that h-relations describe classes of geometrical objects, an extension of the hyper-relational approach on fuzzy geometrical objects is also possible.
REFERENCES 1. W. Jakubowski (ed.).Diagnostyka ultradĨwiĊkowa. (in Polish, PZWL, Warszawa, 1989). 2. J.L. Kulikowski. Relational Approach to Structural Analysis of Images (Machine Graphics and Vision, vol. 1, No 1-2, 1992, pp. 299-309). 3. J.L. Kulikowski. Rozpoznawanie hiperrelacji w komputerowych bazach wiedzy (in Polish, in: ”InĪynieria wiedzy i systemy ekspertowe”, OW Politechniki Wrocáawskiej, Wrocáaw, 2000, pp. 281-291). 4. J. Zabrodzki (ed.). Grafika komputerowa, metody i narzĊdzia (in Polish, WNT, Warszawa, 1994).
AUTOMATIC FACE SYNTHESIS AND ANALYSIS. A QUICK SURVEY
Y. Sheng1, K. Kucharski2, A. H. Sadka1, W. Skarbek2 1
University of Surrey, 2 Warsaw University of Technology
Abstract:
Considerable interest has been received in automatic face synthesis and analysis over the last three decades. This paper surveys the current state of the art in face synthesis, and also presents face detection and eye detection selected algorithms along with facial feature extraction approach based on using Harris corner detector in face analysis.
Key words:
3D face synthesis, 3D face modeling, face detection, eye detection, facial feature extraction, AdaBoost classifier, Harris corner detector
1.
INTRODUCTION
Automatic face synthesis and analysis have received tremendous attention from practitioners in the research fields of both image processing and computer graphics. To give the research community a comprehensive insight into the field, this paper conveys the current state of art technology, ranging from face detection to modeling a specific face. The trunk of the paper is composed of two parts, face synthesis and face analysis. The first part introduces face synthesis in terms of a survey, which first conducts an investigation of current applications of the face synthesis technology, followed by 3D face representation, and ends up with methods of
Acknowledgement: The work presented was developed within Visnet, a European Network of Excellence (http://www.visnet-noe.org), funded under the European Commission IST FP6 programme".
147 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 147–160. © 2006 Springer. Printed in the Netherlands.
148 synthesising a face. The first part is basically aimed at those new comers, giving them a recapitulative overview of advances in 3D face synthesis. For those experts, the second part delves into face analysis from the image processing point of view, elaborating some recently proposed methods by authors in face detection and facial feature extraction. The conclusions are drawn in the last section.
2.
FACE SYNTHESIS
2.1
Applications
Applications of face synthesis have mainly focused onto two major categories, virtual presence and non-presence applications. In the former, the 3D synthesised face generally serves as a virtual alternative, providing users with a sense of presence, and bridging the gaps between the audio-visual sensations of current state of the art visualisation equipments and the real sensation of live events. In the latter, 3D synthesised face always plays an in-between role for implementing a special purpose. 2.1.1
Virtual presence
Virtual presence is one of the most active fields for 3D face synthesis, the idea of which is to improve human-machine interaction, and make the environment more vivid with the virtual presence of participants. The use of 3D face synthesis technology boosts advances in immersive videoconferencing and human-computer interface (HCI). For example, in the scenario of immersive videoconferencing, the basic idea is to place 3D visual representations of participants at predefined positions around the table with a shared virtual table environment (SVTE). People are led to believe the participants are present in the same room, but they are actually located at several remote locations (see IST project VIRTUE1). Using 3D synthesised face in immersive videoconferencing can offer rich communication modalities as similar as possible to those used in face-to-face meetings. It would also overcome the limitations of conventional multiparty videoconferencing, showing participants within one window, instead of showing them in separate windows as in Microsoft NetMeeting2 and Access Grid3. Moreover, this high-realism conferencing system will avoid the unnecessary cost of travel. HCI is another potential field for 3D face synthesis, and believed to have a wealth of applications ranged from e-commerce to distance education. In
Automatic Face Synthesis and Analysis. A Quick Survey.
149
HCI, a 3D virtual avatar that can be a duplicate of the animator is interactively present in terms of speech and face expression. One motivation for using the 3D synthesised model is that sound and animation can convey information faster than tedious characters alone. Also, technical maturations in computer facial animation4 and audiovisual synchronisation gear up the success of 3D face synthesis in the field, especially after standardisation of MPEG-45. For instance, targeting toward young Internet users, an E-mail service, called PlayMail6, has been developed, which can translate the text and emotion of an email into a video based on MPEG-4 standard with an animated talking face. Facing the gradually emerging applications, a MPEG4 compliant Face Animation Engine (FAE), called EPTAMEDIA7 was produced, which enabled the real-time rendering of the 3D face model from text, speech and real actor. 2.1.2
Non-presence applications
Unlike in virtual presence, 3D synthesised face in non-presence applications always plays an in-between role for implementing a special purpose such as face identification, face tracking and 3D model-based video coding. Feng et al8 proposed to identify face using virtual frontal-view image. The idea was to first construct a 3D face from an unknown head-andshoulder image, followed by transforming the synthesised face, resulting in a frontal-view face from an arbitrary-view. As tackling a planar frontal-view face problem, the face could be identified by exploiting the existing methods. Valente et al9 proposed a facial feature tracking algorithm, in which a realistic 3D face was first generated. The synthesised face must be identical to the 2D face in the video frame in terms of shape, orientation and size so as to be used as a reference for block-matching face features with the real face. Both of the above two applications reportedly achieved more robust performances because the use of a 3D face model can bring a high degree of realism. 3D model-based video coding has received considerable attention for over two decades10,11. The idea came up when the conventional waveform coding reached its limitation in video transmission over narrowband networks. Given a head-and-shoulder video sequence, a generic 3D face model should ideally be adapted automatically to a detected face in the video frame. Facial texture should also be acquired without supervision, and then sent to the decoder with shape, size and orientation information so that a specific 3D face mimic can be generated at the decoding end. Between two successive frames of the input video sequence, the motion of the face is estimated by calculating a set of animation parameters. The shape parameters and texture with the background are only sent at the first frame
150 of the whole sequence. All the encoder needs to transmit is just the animation parameters and the changes of the background in the following frames. At the decoder, the input video is reconstructed with the individualised 3-D face model rendered using the received animation parameters. By inducting the idea of 3D face synthesis, the compression efficiency can greatly be improved.
2.2
3D Face representation
Since a first parameterised face model was technically reported12, considerable endeavour has been received in how to model a face more realistically with less degree of computational complexity. Waters13 used vector approach to simulate the facial muscle. Terzopoulos et al14 proposed a three-layer face model for constructing detailed anatomical structure and dynamics of the human face. The three layers correspond to skin, fatty tissue and muscle tied to bone. Elastic spring elements connect each mesh node and each layer. Muscle forces propagate through the mesh systems to create animation. This model can achieve great realism. However, it is computationally expensive to simulate such a multilayer lattices. The idea of using the elastic spring model has been widespread15,16. Especially in the latter work16, the face model reported earlier4 with reduced vertices is cooperated with the elastic spring model for providing a high degree of realism with concise computation. CANDIDEs have been extensively accepted models because of its simplicity and public availability17. The original CANDIDE18, consisting of 75 vertices and 100 triangles, was first motivated for model-based video coding. Later, Welsh at British Telecom19 created CANDIDE-2 with 160 vertices and 238 triangles covering the entire frontal head (including hair and teeth) and the shoulders. With the finalisation of MPEG-4 standard, there is an emerging need of updating the existing CANDIDEs and making them compliant with the Face Definition Parameters (FDPs). In the light of the demand, CANDIDE-320 was designed based on the original CANDIDE model. The updating has been made by adding details into the mouth, nose and eyes of the original model, matching the corresponding Facial Feature Points (FFPs) in MPEG-4. Besides the above mesh-based 3D representation, the face can also be described by 3D eigenfaces. 3D laser scans of human heads (cf. figure 1a) enable representation of faces as a “cloud of N points” together with associated colour information (cf. figure 1b). Typically such spatial set contains more than 105 points in 3D space (N>105) and it is too complex as a 3D model for face recognition and face animation.
Automatic Face Synthesis and Analysis. A Quick Survey.
151
Figure 1. 3D laser scan: (a) color image rectified from cylinder along which the scan proceeds; (b) reconstructed 3D cloud of points with colour information.
Figure 2. Selected 3D eigenfaces which are visualized using eigenvectors obtained for shape and colour components.
It appears that the collection of “head clouds”, obtained for a large group of persons, and considered as a set of points in a high N dimensional space
152 can be approximated by a hyperplane (subspace) of relatively low dimension M, (typically M§50). One of possible linear algebraic base spanning this hyperplane is obtained by Principal Component Analysis (PCA21) as eigenvectors of covariance matrix for the given training set of points. In case of 2D facial images PCA eigenvectors are called eigenfaces22 and by the analogy for 3D facial cloud of points, PCA eigenvectors are recognized as 3D eigenfaces [23] (cf. figure. 2). Having M 3D eigenfaces F1,…,FM and average face F0 any 3D face F can be approximated by a linear combination of eigenfaces where coefficients Įi are suitable dot products: M
F | F0 ¦Di Fi , Di Fit F F0
(1)
i1
In order to get consistent PCA model, each cloud of points should be registered to a reference cloud points. To this goal a mesh of FAP points (defined in MPEG-4 standard) and mappings between corresponding mesh triangles exploiting barycentric coordinates could be exploited.
2.3
3D Face synthesis
To synthesis a specific 3D face, effort mainly concentrates on extraction of the 3D data and texture of the real face. 2.3.1
Using 3D laser scanner
Synthesising a 3D face using a Cyberware 3D laser scanner24 has been most straightforward way exploited by researchers9,25,26. The 3D scanner measures an object by moving a scanning apparatus in a circular path around the object. The scanner contains a laser range finder and a colour camera, respectively generating a range map and a texture map, both of which are indexed by cylindrical coordinates with horizontal coordinate representing azimuth angle around the head and vertical coordinate representing vertical distance. By integrating the range data and texture data, the scanner creates a 3D face with very dense vertex topology (over one million vertices). To simplify the computational complexity, an alternative with fewer primitives is required. This can be implemented by using the existing parameterised face models discussed in the previous subsection. In addition, the outcome of 3D scanning always comes up with incomplete data because the laser beam may be occluded or dispersed in the areas like the hair and the neck. Most importantly, the expensive price limits the use of special apparatus.
Automatic Face Synthesis and Analysis. A Quick Survey. 2.3.2
153
Using stereo images
To reduce the research cost, some researchers have employed stereo images to synthesise a 3D face27,28. Here the input images consist of a few calibrated stereo photos, and a disparity map between the photos is used to find the depth of the facial surface. However, the conventional stereo algorithms often fail in accurately reconstructing the 3D face because the image data does not provide enough information about the geometry of the face. Lengagne et al29 proposed a way to incorporate a priori information in a reconstruction process from a sequence of calibrated face images. But computation here is rather heavy, due to tedious camera calibration30. 2.3.3
Using multi-view images
Comparing with those using a clumsy and expensive 3D scanning apparatus, and computationally heavy stereo images, most researchers prefer using multi-view images in 3D face synthesis, where only a PC with a video camera is of necessity. In this case, several photos are taken from different views beforehand by rotating the head gradually in front of the camera, ensuring that the facial texture and depth can mostly be acquired31. Zhang et al32 used two arbitrary frontal view images to create the 3D face. But the information from only frontal view images is insufficient to discover the surface depth. Generally speaking, there is a trade-off between the computational complexity and the synthesis realism. In order to maintain the degree of realism with minimum computational complexity, methods using an orthogonal photo pair, consisting of a frontal view image and a face profile, from which the surface depth is visible, has been popular33,34. One of challenges in this method is photogrammetric measurement, i.e. how to estimate 3D coordinates (x, y, z) of facial features from two corresponding sets of 2D coordinates (x, yf), (yp, z) from the orthogonal pair. For generating a texture map, the most common way is to blend the frontal and profile images to form a cylindrical texture map35. Furthermore, if the image pair is not precisely orthogonal, and facial features and luminance are incoherent between the image pair, a preprocessing step, including profile feature scaling, rotation to matching the frontal image, and colour compensation between two orthogonal images, is needed ahead of photogrammetric measurement and texture generation36. However, self-occlusion is inevitable in the areas such as the neck using two orthogonal images. To construct the 3D face more accurately, a face synthesis approach from nine views was introduced37. But the computation complexity increases as well.
154 2.3.4
Using a single view image
All the above methods have hardly been flawless, especially faced with the demand of an easy-to-manipulate system in real-time applications like 3D model-based video coding for video conferencing, where a 3D specific face must be promptly rendered. 3D face synthesis from a single image has been able to achieve such a simple, quick system. An early attempt came from the understanding to linear transformation38. Here Vetter et al predefined a linear object class with a large number of different face photos. Then linear transforms were learned from the linear object class. Given only a single image of an object, i.e. a face, the system could synthesise 2D faces in different orientations using the linear transforms. But their algorithm skipped over modeling a 3D face, and cannot be used for 3D face synthesis applications. However, authentic 3D face synthesis from a single image has been recently reported by some research groups. Some algorithms are only capable of coping with precise front-view inputs39,40, which limit the manipulable range; while some algorithms41,42 require user interaction to point out facial features, which cannot satisfy the demand of automation. Although an automatic facial feature extraction method was embedded into the system16, only three feature points are not enough to back up pose estimation without a priori information. Due to the lack of the depth, hence, estimation of 3D information from a 2D image still remains challenging, and seeks a major breakthrough.
3.
FACE ANALYSIS
Face analysis is important application oriented task. In case of face recognition and face modeling for animation the following topics are of great importance: face detection, facial feature extraction, and facial feature tracking.
3.1
Face detection
Face detection using AdaBoost classifier was introduced by Viola and Jones in their seminal paper in December 200143 and extended to detect multipose faces by Xiao, Li and Zhang44). Their really novel approach has shown how local contrast features found in specific positions of the object can be combined to create a strong face detector. AdaBoost is known from the late eighties as a multi-classifier and a training procedure for a collection of weak classifiers, e.g. having the success rate about 0.5, to boost them by suitable voting process to very high level of performance.
Automatic Face Synthesis and Analysis. A Quick Survey.
155
Viola and Jones applied an early and suboptimal heuristics of Shapire and Freund for AdaBoost training algorithm45. In general the AdaBoost algorithm selects a set of classifiers from a family of weak classifiers {CZ}, indexed by compound parameter Z.. A concerted action of the classifier produces a strong classifier. For each object o, the classifier CZ elaborates a decision GZ (o) {-1, +1} on the membership of the object o to one of two classes labeled by -1 and +1. If GZ (o) = 1, then the cost of this decision JZ(o) = DZ, otherwise JZ (o) = EZ. The cost of the decision is a real number and it can be negative. In face detection, weak classifiers are defined using a concept of region contrast i.e. the difference between pixel’s intensity sums in the subregions of the distinguished image region; the examples are presented in the Figure 3. To obtain a weak classifier the AdaBoost algorithm must be provided with the method of assigning such a threshold to the region contrast that achieves the minimum of the average classification error on the training sequence of negative and positive image examples o1, .., oL.
Figure 3. Types of filters used in frontal face detection.
In order to get costs of weak decisions AdaBoost calculates the weighted classification error H: L
H
¦ w G Z (o ) y (o ) i
i
i
i 1
where y(oi) = 1 if the image oi is a positive example, y (oi) = 1 otherwise. Modyfing weights of the training set examples in order to find next weak classifier t+1 to become a forming part of the strong classifier is the crucial step of the whole algorithm. AdaBoost algorithm uses very effective heuristics to minimize an upper bound of H, namely wi,t+1 m wi,texp[-Jt(oi)y i]. Altough employing such a training scheme allows for obtaining an ideal classification rate on the training stage the output strong classifier has too many weak components to detect faces in real time. Moreover, it is impossible to choose such a negative part of training set to teach the classifier to distinguish faces from various and complex types of background during detection process. It means that this approach is insufficient when it
156 comes to getting reasonably small false acceptance and false rejection rates that should be at most 0.01% and 5% respectively. The Adaboost cascade algorithm (Figure 4) can overcome these difficulties by applying the multistage classification with the strong classifiers. It should be noted that requirements for the false acceptance and false rejection ratio at every single stage at level 50% and 0.1% allow in the several stages to reach satisfactory performance of the whole classifier. To fulfill that condition, the subsequent stages of the cascade have to be trained on the negative examples generated with help of the partial cascade learned so far.
Figure 4. Adaboost cascade diagram.
3.2
Eye detection
The AdaBoost cascade algorithm can be applied to eye pair detection and give satisfactory results (compare Figure 5), but it has some limitations. The more eyes are skewed the less accurate detection becomes whereas detecting eyes separately requires photos of excellent resolution which is condition that often cannot be met in practice. In one work47 the author developed a specialised algorithm devoted to single eye localisation. The proposed procedure consists of the following stages: x Finding in the face oval candidate points for eye centers using colour cues. x Filtering of candidate points using LDA classifier. x Finding a “good” pair of candidate points for eye centres. First step is a preprocessing technique that makes use of the fact that in both the pupil and the cornea of the eye the level of the red colour component should be rather low while the blue component level being high.
Automatic Face Synthesis and Analysis. A Quick Survey.
157
Otherwise, to deal with situation when the eyes are shut another mapping of the input image is introduced that can detect high contrast regions.
Figure 5. Multiple face and eye pair detection by AdaBoost algorithm.
Figure 6. a) r-b map; b) contrast map of the facial image; c) sample eye detection results.
Applying consecutively both the mapping of the original image to the r-b space (Figure 6a) and to the contrast space (Figure 6b), the thresholding and intersecting result maps followed by treating them by erosion and dilation operators to remove local holes, the substantially reduced list of eye candidates is obtained. The feature vector is extracted from circle of radius R centered in the every candidate’s eye center and is put on the input of LDA classifier. The classifier is trained on positive examples obtained by assuming the eye centre to lie in within the close neighbourhood of the eye center marked manually. The negative examples are created by putting the eye center in any other place of the facial image. To benefit from the discriminatory abilities of the LDA transformation, negative examples are split into multiple classes using Vector Quantisation algorithm whereas positives are divided into various classes using information about face orientation, eyelid position and eye socket orientation. In the next step of the algorithm the eye centre’s candidates are assigned a weight defined as distance to the nearest class if it is a positive one,
158 otherwise a “practical” infinity e.g. 105. Finally, local minima of such a weight map are found and sorted using additional heuristic condtions. Examplar results are presented in Figure 6 c).
3.3
Facial feature extraction
To form a facial feature set, general approaches can be adapted that find candidate features in any image for sake of correspondence and tracking. These features may be those pixels of an image that have nontrivial gradients along two independent directions measured in specified window of analysis. The quality of a corner point with coordinates x=[x,y]T as a candidate feature can be measured by Harris criterion48:
C ( x) G
det(G ) k trace 2 (G ) V 1V 2 k (V 12 V 22 )
ª ¦ I x2 « «¬¦ I x I y
¦I I ¦I x
2 y
y
º » »¼
where Ix, Iy are image respetive derivatives, V1, V2 - eigenvalues of matrix G and k is a parameter. The given pixel becomes a feature point if value of this criterion is over some arbitrary threshold. Knowledge of typical face parts like eyes, nose, mouth, chins etc. allows for choosing different local thresholds in image areas corresponding to those parts as well as selecting different number of features for each of them. Some refinements fo the algorithm may be also applied, for instance accepting only certain number of features that have the greatest criterion value in every of the distinguished face regions.
4.
CONCLUSIONS
This paper conveys the current state of art technology, ranging from face synthesis to face analysis. The paper briefly surveys current research status in face synthesis, and introduces an improved method using AdaBoost classifier in face detection and facial feature extraction.
REFERENCES 1. VIRTUE, EU IST Project IST-1999-10044, http://www.virtue.eu.com 2. Microsoft NetMeeting, http://www.microsoft.com/windows/netmeeting/ 3. Access Grid, http://www.accessgrid.org
Automatic Face Synthesis and Analysis. A Quick Survey. 4. 5. 6. 7. 8.
159
F. Parke and K. Waters, “Computer facial animation”, A K Peters, Ltd, 1996. ISO/IEC IS 14496-2: MPEG-4 Visual, 1999. PlayMail, http://playmail.research.att.com EPTAMEDIA, www.eptamedia.com G. C. Feng and P. C. Yuen, “Recognition of head-&-shoulder face image using virtual frontal-view image”, IEEE Trans. on Systems, Man, and Cybernetics-Part A: Cybernetics, Vol. 30, No. 6, pp. 871-883, 2000. 9. S. Valente and J. Dugelay, “A visual analysis/synthesis feedback loop for accurate face tracking”, Signal Processing: Image Communication, Vol. 16, pp. 585-608, 2001. 10. K. Aizawa and T. S. Huang, “Model-based image coding: advanced video coding techniques for very low bit-rate applications”, Proceedings of the IEEE, Vol.83, No.2, pp.259-271, February 1995. 11. D. Pearson, “Developments in model-based video coding”, Proceedings of the IEEE, vol.83, No.6, pp.892-906, June 1995. 12. F. Parke, “A parametric model for human faces”, Tech. Report UTEC-CSc-75-047, University of Utah, 1974. 13. K. Waters, “A muscle model for animating three-dimensional facial expression”, Computer Graphics, vol. 21, No. 4, pp. 17-24, 1987. 14. D. Terzopoulos and K. Waters, “Physically-based facial modeling, analysis, and animation”, The Journal of Visualization and Computer Animation, vol. 1, pp. 73-80, 1990. 15. Y. Zhang, E. Sung and E. C. Prakash, “A physically-based model for real-time facial expression animation”, Proc. of 3rd Conf. on 3D Digital Imaging and Modeling, pp. 399406, 2001. 16. G. C. Feng, P. C. Yuen and J. H. Lai, “Virtual view face image synthesis using 3D springbased face model from a single image”, 4th Int. conf. on Automatic Face and Gesture Recognition, pp. 530-535, 2000. 17. CANDIDE, http://www.icg.isy.liu.se/candide 18. M. Rydfalk, “CANDIDE, a parameterised face”, Report No. LiTH-ISY-I-866, University of Linkoping, Sweden, 1987. 19. B. Welsh, “Model-based coding of images”, PhD Dissertation, British Telecom Research Lab, Jan. 1991. 20. J. Ahlberg, “CANDIDE-3: an undated parameterised face”, Report No. LiTH-ISY-R2326, Linkoping University, Sweden, January 2001. 21. I.T. Jolliffe, “Principal Component Analysis, Second Edition”, Springer, 2002. 22. M. Turk, and A. Pentland, “Eigenfaces for Recognition”, Journal of Cognitive Neuroscience, 3(1): 71-86, 1991. 23. V. Blanz, S. Romdhani and T. Vetter, “Face Identification across Different Poses and Illumination with a 3D Morphable Model”, Proc. FG’02, pp.202-207, 2002. 24. CYBERWARE Home Page, http://www.cyberware.com 25. P. Eisert and Bernd Girod, “Analyzing facial expression for virtual conferencing”, IEEE Trans. on Computer Graphics and Application, Vol. 18, No. 5, pp. 70-78, 1998. 26. D. Terzopoulos and K. Waters, “Analysis and synthesis of facial image sequences using physical and anatomical models”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 15, No. 6, pp. 569-579, 1993. 27. R. Lengagne, J. Tarel and O. Monga, “From 2D images to 3D face geometry”, Proc. of 2nd Int. Conf. on Automatic Face and Gesture Recognition”, pp. 301-306, Oct 1996. 28. G. Galicia and A. Zakhor, “Depth based recovery of human facial features from video sequences”, Proc. of IEEE Int. Conf. on Image Processing, Vol.2, pp.603-606, 1995.
160 29. R. Lengagne, P. Fua, O. Monga, “3D stereo reconstruction of human faces driven by differential constraints”, Image and Vision Computing, Vol. 18, pp. 337-343, 2000. 30. Z. Zhang, “A flexible new technique for camera calibration”, IEEE Trans. on PAMI, Vol. 22. No. 11. pp. 1330-1334, 2000. 31. C. Cheng and S. Lai, “An integrated approach to 3D face model reconstruction from video”, IEEE ICCV Workshop on Recognition, Analysis and Tracking of Face and Gestures in Real-Time System, Vancouver, Canada, pp. 16-22, 2001. 32. Z. Zhang, Z. Liu, D. Adler, M.F. Cohen. R. Hanson and Y. Shan, “Robust and rapid generation of animated faces from video images: a model-based modeling approach”, Microsoft Research, Technique Report, MSR-TR-2001-101, 2001. 33. Goto, W. Lee, N. Magnenat-Thalmann, “Facial feature extraction for quick 3D face modeling”, Signal Processing: Image Communication 17 (2002) 243-259. 34. W. Gao, Y. Chen, R. Wang, S. Shan and D. Jiang, “Learning and synthesizing MPEG-4 compatible 3D face animation from video sequence”, IEEE Trans. on Circuit and Systems for Video Technology, Vol. 13, No. 11, pp. 1119-1128, 2003. 35. W. Lee and N. Magnenat-Thalmann, “Fast head modeling for animation”, Image and Vision Computing 18 (2000) 355-364. 36. I. Park, H. Zhang, V. Vezhnevets and H. Choh, “Image-based photorealistic 3D face modeling”, Proc. of the 6th International Conference on Automatic Face and Gesture Recognition, 2004. 37. S. Morishima, “Face analysis and synthesis”, IEEE Signal Processing Magazine, pp. 2634, May 2001. 38. T. Vetter and T. Poggio, “Linear object classes and image synthesis from a single example image”, IEEE Trans. on PAMI, Vol.19, No.7, pp. 733-742, 1997. 39. Y. Hu, D. Jiang, S. Yan, L. Zhang, H. Zhang, “Automatic 3D reconstruction for face recognition”, Proc. of the 6th IEEE Int. Conf. on Automatic Face and Gesture Recognition, 2004. 40. C. Kuo, R. Huang and T. Lin, “3D facial model estimation from single front-view facial image”, IEEE Trans. on CSVT, Vol.12, No.3, 2002. 41. A. Valle, J. Ostermann, “3D talking head customization by adapting a generic model to one uncalibrated picture”, Proc. of IEEE Int. Symposium on Circuits and Systems, Vol.2, pp. 325-328, 2001. 42. S. Ho and H. Huang, “Facial modeling from an uncalibrated face image using a coarse-tofine genetic algorithm”, Patten Recognition 34 (2001) 1015-1031. 43. P. Viola, M. Jones, “Rapid Object Detection using a Boosted Cascade of Simple Features ” , Computer Vision and Pattern Recognition, December 2001. 44. R. Xiao, M.-J. Li, H.-J. Zhang , “Robust Multipose Face Detection in Images”, IEEE Trans. on Circuits and Systems for Video Technology” , January 2004. 45. Y. Freund, R.E. Shapire, “A decision theoretic generalization of on-line learning and an application to boosting”, Journal of Computer and Systems Sciences, 55(1):119-139, August 1997. 46. W. Skarbek, K. Kucharski, „Tutorial on Face and Eye Detection by AdaBoost Method”, special VISNET session at Polish National Conference on Radiocommunications and Broadcasting KKRRiT, 16-18 June, 2004. 47. A. Pietrowcew, “Face detection and face recognition in digital images”, P.hD. dissertation in Polish, Warsaw, 2004. 48. Y. Ma, S. Soatto, J. Kosecka, S. Shankar Sastry, „An Invitation to 3-D Vision. From images to geometric models”, Springer-Verlag, New York, 2004.
3D DATA PROCESSING FOR 3D FACE MODELLING Krystian Ignasiak1 , Marcin Morgo´s1 , Władysław Skarbek1 , Michał Tomaszewski1,2 1 Institute of Radioelectronics
Warsaw University of Technology
{K.Ignasiak, M.Morgos, W.Skarbek, M.Tomaszewski}@ire.pw.edu.pl 2 Polish-Japanese Institute of Information Technology
Abstract
In this research we describe 3D data processing operations which were applied to 3D laser scans of human heads before 3D eigenfaces could be obtained both in shape space and color space. Modified median filtering was used to remove laser scan spikes. 3D scans registration exploits mapping of 3D mesh to the reference 3D mesh. The mesh is based on Facial Action Points (FAP) of MPEG-4 and the mapping itself on barycentric local coordinates in each triangle. 3D eigenfaces are computed by Singular Value Approximation (SVA) applied directly to 3D scans avoiding computation of prohibitively large covariance matrix.
Keywords:
median filtering, 3D scans registration, SVA for PCA, 3D eigenfaces, 3D face modelling
1.
INTRODUCTION
Recently, 3D face modelling experiences widespread interest in multimedia research community. The reasons are manyfold. We point to a few of them: Face animation on personal computers, can be now performed in real time even if the mesh model is very detailed. This is due to the fact that sophisticated graphics cards with built-in accelerators of OpenGL and its extensions, have become affordable because of the high demand for real time 3D interactive games. Face recognition using 3D face shape models is now possible as efficient algorithms have emerged within many competing projects. This is due to the current political demand for more reliable and cheap biometric recognition systems. 161 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 161–166. © 2006 Springer. Printed in the Netherlands.
162 Forecasts are made for high demand (in a near future) on distance learning systems based on full immersive virtual classrooms. In such environments video-conference paradigm is replaced by a concept of avators similar to a real man in terms of visual, aural and interactive behavior. This approach imposes much less bit-rate requirement for communication system supporting e-learning.
Figure 1. Output data from 3D laser scanner: (a) color texture image, (b) reconstructed 3D cloud of points with color information.
One of very promising approaches is 3D eigenfaces approach combined with PCA coefficients identification from 2D views produced by cheap video cameras 1 . In this paper we describe 3D data processing operations which were applied to 3D laser scans of human heads (cf. fig. 1) before 3D eigenfaces could be obtained both in shape and color spaces.
2.
3D LASER SCAN REGISTRATION
Our source data is produced by 3D laser scanner of Cyberware machinery. It measures in cylindrical coordinates (h, ϕ) the depth d of facial points and their color components (r, g, b) with intrinsic resolutions hres and ϕres . The height range ΔH is also an intrinsic parameter of the scanner and the angle range ΔΦ = 2π. The 3D data is represented by four 2D arrays: d(ih , iϕ ), r(ih , iϕ ), g(ih , iϕ ), b(ih , iϕ ), ih = 1, . . . , hres , iϕ = 1, . . . , ϕres . Therefore, setting the 3D coordinates origin at the bottom of the scan cylinder and in its center with y axis aligned with cylinder axis we have simple formulas for 3D coordinates for cloud of points represented by another three arrays x, y, z : y(ih , iϕ ) = h(ih , iϕ ), x(ih , iϕ ) = d(ih , iϕ ) sin(iϕ Δϕ), z(ih , iϕ ) = d(ih , iϕ ) cos(iϕ Δϕ), Δϕ = 2π/ϕres .
3D data processing for 3D face modelling
2.1
163
Spikes removal
Spikes are large depth values returned by scanner for points of poor reflectance. They rarely occur on facial skin which is of importance for our modelling and therefore median filtering in small windows of depth data component, seems to be a good candidate method for their removal (cf. fig. 2).
Figure 2.
Spikes removal with median filter.
We found that the so called modified median had the best performance. This filter increases automatically its size if the number of spikes is too high in the neighborhood of the given point.
2.2
Reference FAP mesh definition
3D scans registration exploits mapping of 3D mesh to the reference 3D mesh. The mesh is based on Facial Action Points (FAP) of MPEG-4 3 . It is manually designed on 2D color component using a simple triangulation rule: ensure the minimal number of edges inside triangles (cf. fig. 3). For each scan FAP points are manually registered using both 2D color scan image and 3D face visualization based on dense triangular mesh naturally obtained from 2D indexing of scans (cf. fig. 3).
2.3
Alignment of inertia features
Having a mesh Ms found for the given scan s we consider its shape as a mask to filter out scan points outside of the mesh. Let Cs denotes the set of all
164
Figure 3. view (b).
FAP mesh and its overlay on a color face scan component (a) and on a 3D face
3D points of scan s which are in the mesh, i.e. with (ih , iϕ ) indices included in the mesh Ms . Then considering the cloud Cs as a physical rigid body we can find its center of inertia and axis of inertia and next align them with center of inertia and axis . of inertia for the median mesh body C0 = medians (Cs ) obtained for the median . mesh M0 = medians (Ms ). We consider the mesh M0 as the reference mesh. This step partially compensates the existing lack of alignment between the axis of scan cylinder and the vertical axis of inertia for a human head.
2.4
Mapping to reference median mesh
After alignment of the inertia center and axis the last step of data registration is mapping of scan mesh to the reference mesh. To this end we use barycentric local coordinates in each triangle. The simple rule is applied: for any point Q in the reference mesh get its component (depth or color) by an interpolation of the component in scan to be registered in the point P with the same barycentric coordinates in the same triangle (with the same FAP point indices) of the mesh. In fig. 4 f (c) denotes any component (depth or color) of the scan to be (c) registered while freg is the result of the registration.
3.
PCA MODELLING OF FACE SHAPE AND COLOR
It appears that the collection of facial mesh clouds, obtained for a big group of L persons (L > 50), and considered as a set of points in a high N dimensional space (here N > 4 · 104 ) can be approximated by a hyperplane (subspace) of relatively low dimension M (typically M ≈ 50). One of possible linear algebraic bases spanning this subspace is obtained by Principal Compo-
165
3D data processing for 3D face modelling Q = λi Qi + λj Qj + λk Qk
P = λi Pi + λj Pj + λk Pk
Qj Pj
Oj Qi
Oi
Oj
−→
Q
Pi
Oi
P Ok
Ok
Pk
Qk
(c)
freg (Q) ≈ f (c) (P ) Figure 4. The idea of registration of mesh content by barycentric coordinates (corresponding vertices of two triangles have the same FAP indices).
nent Analysis (PCA - cf. 2 ) as eigenvectors of covariance matrix for the given training set of vectors. In case of 2D facial images PCA eigenvectors are called eigenfaces 4 and by the analogy for 3D facial cloud of points, PCA eigenvectors are called 3D eigenfaces 1 (cf. fig. 5). Having M 3D eigenfaces F1 , . . . , FM and average mesh content F0 , any 3D mesh content F can be approximated by a linear combination of 3D eigenfaces where coefficients αi are appropriate dot products: F ≈ F0 +
M
. αi Fi , αi = (F − F0 )t Fi
i=1
3D eigenfaces were built in our research by Singular Value Approximation (SVA) what allows to avoid building of prohibitively large covariance matrices.
4.
CONCLUSIONS
Operations on 3D scan data were described which are necessary to register data from different human head scans. While spike removal by modified median filtering removes inherent scanner’s faults, the inertia features alignment and next mapping preserving local barycentric coordinates in FAP mesh are novel steps in scan data registration. The quality of 3D eigenfaces design confirms the validity of our approach.
166
Figure 5.
3D eigenfaces visualized using eigenvectors for depth and color components.
ACKNOWLEDGEMENT The work presented was developed within VISNET, a European Network of Excellence (http://www.visnet-noe.org), funded under the European Commission IST FP6 programme.
REFERENCES 1. Blanz V., Romdhani S., Vetter T.: Face Identification across Different Poses and Illumination with a 3D Morphable Model, Proc. FG’02, pp.202-207, 2002. 2. Jolliffe I.T.: Principal Component Analysis, Second Edition, Springer, 2002. 3. Ostermann J.: Animation of synthetic faces in MPEG-4, Computer Animation, June:49-51, 1998. 4. Turk M., Pentland A.: Eigenfaces for Recognition, Journal of Cognitive Neuroscience, 3(1):7186, 1991.
MODEL OF DEFORMABLE RINGS FOR AIDING THE WIRELESS CAPSULE ENDOSCOPY VIDEO INTERPRETATION AND REPORTING Piotr M. SzczypiĔski1, 2, Parupudi V. J. Sriram3, Ram D. Sriram1, D. Nageshwar Reddy3 1
National Institute of Standards and Technology, Gaithersburg, USA Institute of Electronics, Technical University of àódĨ, àódĨ, Poland 3 Asian Institute of Gastroenterology, Hyderabad, India 2
Abstract:
Key words:
1.
The wireless capsule endoscopy (WCE) imaging technique provides detailed images of the gastrointestinal (GI) tract, in particular the small intestine, not feasible with earlier techniques. We present a model of deformable rings (MDR) for aiding in the interpretation of WCE videos. The model flexibly matches consecutive video frames with regard to displacement of distinctive portions of the GI tract’s tube-like surface. It creates a map, 2D representation of the surface, and estimates relative velocity of a capsule endoscope as it traverses the GI tract. The map can be glanced through for rapid identification of the possible abnormal areas for further detailed endoscopic video investigation. This method significantly reduces the total time spent in interpretation of the WCE videos. deformable models, wireless capsule endoscopy, WCE, motion analysis
INTRODUCTION
The human small intestine, which measures approximately 6 meters, cannot be visualized using a traditional endoscopic approach. The WCE6,9 is a relatively new technique that facilitates the imaging of the small intestine. The WCE system consists of a pill-shaped capsule with built-in video camera, light-emitting diodes, video signal transmitter, and battery, as well as a video signal receiver-recorder device. The capsule is ingested and passes through the GI tract. Currently, the capsule transmits video images at a rate of two frames per second for approximately 8 hours. The transmitted images are received and recorded by the external receiver-recorder device. 167 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 167–172. © 2006 Springer. Printed in the Netherlands.
168 The investigation of video recordings is performed by a trained clinician. It is a tedious task that takes considerable amount of time, usually more than an hour per recording. The video interpretation involves viewing the video and searching for bleedings, erosions, ulcers, polyps and narrow sections of the bowel due to disease or any other abnormal-looking entities. The MDR aims at aiding the WCE video interpretation. It preprocesses the WCE video recording to produce a map of the internal surface of the digestive system. The MDR also computes a rough estimate of capsule velocity as it passes through the GI tract. The map serves as a quick reference to the video sequences, supports identification of segments of the bowel and can be glanced through for quick identification of large-scale abnormalities. The estimate of the capsule velocity provides data sufficient for localization of video sequences that show narrow sections of GI tract, where the capsule stopped or considerably slowed down. The paper is organized as follows. Section 2 depicts a class of the video data that is processed by the MDR. Section 3 presents the novel structure of the MDR, which includes a unique technique for variable neighborhood tension computation, provides a general explanation of the model implementation and behavior, and introduces the original conception of forming a digestive system map. Finally, Section 4 demonstrates selected examples of GI tract maps and concludes the paper.
2.
WCE VIDEO PROPERTIES
The wireless capsule endoscope used in this study produces color images of the internal lumen of the GI tract, covering a circular 140° field of view [9]. Since the shape of the capsule is elongated and the GI tract is akin to a collapsed tube, most of the time the wireless capsule endoscope aligns in a direction parallel to the GI tract, heading the camera lenses forward or backward. Folded walls of digestive tract or intense peristaltic movements may cause the capsule to change its pitch or yaw in relation to the axis of the GI tract. However, such changes do not last long and the capsule eventually repositions to stay parallel with the axis of the GI tract. Therefore, it was assumed that most of the video frames contain images of the GI tract walls, which converge in perspective at the point located near the center of an image (figure 1 a). As the capsule passes through, portions of the tract shift outward or toward the center of an image. In the current approach to WCE video processing, we follow movements2-4,8 of the digestive system walls by elastic matching of consecutive video frames. We also estimate the average speed of these movements toward and outward the center of the video frame and collect data on the texture of internal lumen of the digestive system.
Model of Deformable Rings for Aiding the Wireless Capsule Endoscopy Video Interpretation and Reporting
169
Figure 1. Example of WCE video frames: (a) video frame showing a fragment of small intestine interior, (b) initial form of MDR superimposed on the video frame and (c) frame with MDR after completion of the matching process.
3.
MDR MATCHING AND MAP ASSEMBLING
The MDR comprises nodes, which are connected to form a mesh. The mesh is positioned in the plane of the video frame. It forms concentric rings surrounding the center of the frame (figure 1.b). Every node of the mesh is referred to by a pair of indexes p = 1, 2,...P and q = 1, 2,...Q. The initial location of a node within the image coordinate system is given by the following formula:
>x
p,q
y p,q
@
T
ª § 2Sq · § 2Sq ·º ¸¸» ¸¸ sin ¨¨ rw p 1 «cos¨¨ © Q ¹¼ ¬ © Q ¹
T
(1)
where r is a radius of inner ring and w is a ratio of radii of adjacent rings. MDR nodes store information on local image properties, such as RGB color components, that were found at their locations within a preceding video frame. The nodes search within the vicinity of the current frame for locations having similar properties. Each node is pushed toward such a location. On the other hand, the arrangement of nodes within the MDR mesh has to be retained. This requirement is satisfied by modeling tensions within the model structure. The elastic matching is a process of successive displacements of nodes intended for finding a state, in which balance between the two effects is obtained1. For a computation of the image influence vector, several locations within the image, in the vicinity of a node (circular area with radius v), are randomly chosen. Image properties at these locations are compared with properties stored within the node by means of an Euclidean distance d in the properties’ space. The image influence vector f is directed from the node
170 toward the location with the smallest d value and its length is equal to a value of some parameter ȟ. For computation of tension, let us define the n-neighborhood of node p, q. The n-neighborhood is a set of all MDR nodes connected with node p, q by n or less number of lines and it includes the node p, q itself. The averaged transformation of a neighborhood from its initial position given by (1) can be defined by translation vector Tp, q and matrix Jp, q of scaling, rotation and shear. The Jp, q is a square 2x2 matrix, which can be viewed as an average local Jacobian of MDR mesh transformation. Vector Tp, q and matrix Jp, q are computed in the MDR for minimum mean square error of node positions. The tension vector for node p, q is computed by means of the following formula: g p,q
y p,q
U J p , q >x p , q
@
T
>
Tp , q x p , q
y p,q
@
(2)
T
where [xp,q , yp,q]T is a vector of node p, q actual location and ȡ is a tension parameter. Displacement of a node is computed using the following formula:
>x
( i 1) p ,q
y p ,q
@ >x
( i 1) T
(i ) p ,q
y p ,q
@
(i ) T
g p ,q
(i )
f p ,q
(i )
(3)
where an index i refers to discrete time – iteration number. The process of displacing nodes is repeated until some state of equilibrium is reached, i.e., average displacement distance of MDR nodes drops below some selected threshold value, and until the iteration index (i) is lower than some arbitrary chosen maximum. It was experimentally found that the process of matching is more efficient if at the start, neighborhood n and vicinity v parameters are relatively large, and then are reduced gradually throughout the process. In this manner, at the beginning of the matching process, the model quickly adjusts its position with regard to some global image changes then, after decreasing n and v parameters, it deforms locally5,7 matching local GI tract deformations. The process of elastic matching is repeated for every frame (m) of the WCE video recording. It must be noted that as the capsule moves forward the MDR expands while matching consecutive video frames. If the capsule moves backward the model shrinks. The average size of MDR in relation to its initial state (1) can be evaluated by determinant of matrix J computed for a neighborhood of all the model nodes. To prevent the model from excessive expanding or from collapsing, a det J is computed after completion of every matching. When det J > w2 the outer ring of a model is erased and a new inner ring is created. If det J < w–2 then the inner ring is erased and a new outer one is added. In either case of rings swapping, RGB vectors sampled at locations of nodes forming the
Model of Deformable Rings for Aiding the Wireless Capsule 171 Endoscopy Video Interpretation and Reporting outer ring are arranged in a row of pixels. All such rows are collected during the video processing to form an image or a map of the GI system surface (figure 2). At every swapping, capsule pace is evaluated with the formula: v
J
det J k Z det J k 1 (mk mk 1 )Z det J k J k 1
; Z ® w ¯w
1
if
det J k 1 w2
if
det J k 1 ! w2
(4)
where k is a swapping event index, m is a frame index and Ȗ a parameter.
Figure 2. Examples of map fragments produced by MDR with corresponding video frames: (a) area of bleeding and (b) froth content.
4.
RESULTS AND CONLUSIONS
The MDR was implemented in C++ in DirectShow technology10 as a video-processing module (filter). It was experimentally established that acceptable results could be obtained with models having P = 7 rings, with Q = 128 nodes per ring. Coefficients of image influence ȟ and tension ȡ were set to 1. The matching process is performed in two phases. During the first phase (10 iterations) the transformation matrix J is computed for all the nodes (n > 64); it is assumed that vector T = [0, 0]T and search vicinity is of radius v = 20. During the second phase (5 iterations), the model deforms locally matching image details with parameters n = 2 and v = 5. The average time of processing the 8-hour video recording is less then 30 minutes using a PC with an Intel Pentium 4, 1.8 GHz processor. The database used for MDR testing consisted of over 30 video recordings. Twenty of them were used for preliminary tests and tuning of model parameters. Ten other known recordings consisting of 5 normal and 5 abnormal cases were used for the assessment of MDR utility. It was found that certain characteristics that indicate areas of bleeding, ulceration and obscuring froth could be recognized within maps. Therefore, the maps can be glanced through for quick identification of such abnormal areas, which
172 noticeably reduces examination time. Also, by means of capsule velocity estimation graphs, areas of capsule retention can be detected. These indicate narrow sections of the gastrointestinal tract. The model deals with images, which are obtained by a smooth forward/backward movement of a capsule. When the capsule moves sideways or jumps, the produced map may be ambiguous. Therefore, our future work will be focused on the development of models capable of tracking various kinds of capsule movements.
REFERENCES M. Kass, A. Witkin, D. Terzopoulos, Snakes: Active Contour Models, Int. J. of Computer Vision, vol. 1, no. 4, 1988, pp. 321-331 2. H. Delingette, Adaptive and deformable models based on simplex meshes, IEEE Workshop on Motion of Non-Rigid and Articulated Objects, 1994, pp.152-157 3. Yao Wang, O. Lee, A. Vetro, Use of two-dimensional deformable mesh structures for video coding, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 6, Dec. 1996, pp. 636-659 4. P. van Beek, A. M. Tekalp, N. Zhuang, I. Celasun, Minghui Xia, Hierarchical 2-D mesh representation, tracking, and compression for object-based video, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, March 1999, pp. 353-369 5. P. SzczypiĔski, A. Materka, Variable-Flexibility Elastic Model for Digital Image Analysis, Bulletin of the Polish Academy of Sciences, Technical Sciences, Vol.47, No.3, 1999, pp. 263-269 6. G. Iddan, G. Meron, A. Glukhowsky, P. Swain, Wireless Capsule Endoscopy, Nature 2000, pp. 405-417 7. P. SzczypiĔski, A. Materka, Object Tracking and Recognition Using Deformable Grid with Geometrical Templates, ICSES 2000, pp.169-174 8. W. Badaway, A structured versus unstructured 2D hierarchical mesh for video object motion tracking, Canadian Conference on Electrical and Computer Engineering, Vol. 2, May 2001, pp. 953-956 9. D. G. Adler, C. J. Gostout, Wireless Capsule Endoscopy, Hospital Physician, May 2003, pp.16-22 10. Microsoft Corporation, DirectShow, http://msdn.microsoft.com/library 1.
Disclaimer: No approval or endorsement of any commercial product by the National Institute of Standards and Technology is intended or implied. Certain commercial equipment, instruments, or materials are identified in this report in order to facilitate better understanding. Such identification does not imply recommendations or endorsement by the National Institute of Standards and Technology, nor does it imply the materials or equipment identified are necessarily the best available for the purpose.
SINGLE-CAMERA STEREOVISION SETUP with Orientable Optical Axes Luc Duvieubourg1 , Sebastien Ambellouis2 , and Francois Cabestaing1 1 LAGIS, Laboratoire d’Automatique, G´ enie Informatique & Signal, CNRS-UMR 8146.
Bˆ atiment P2, Universit´e des Sciences et Technologies de Lille, 59655 Villeneuve d’Ascq CEDEX, France. 2 Institut National de REcherche sur les Transports et leur Securit´ e (INRETS). 20 rue Elis´ee Reclus, 59650 Villeneuve d’Ascq, France [email protected]
Abstract
The stereovision sensor described in this article has been developed during a research project called RaViOLi, for “Radar and Vision Orientable, Lidar”. The main outcome of this project is the improvement of driving safety thanks to the analysis of redundant data coming from several cooperative sensors installed on an autonomous vehicle. One is a high precision stereovision sensor whose field of view can be oriented toward a region of interest of the 3D scene, like the road in front of the vehicle at long distance. The sensor is composed of a single camera, of two lateral mirrors, and of a prism rotating about its edge. The mirrors project the left and right images of the stereo pair onto both halves of the imaging surface of the camera, yielding the equivalent of two virtual cameras with parallel axes. Rotating the prism changes the orientation of both optical axes while keeping them parallel.
Keywords: Single-camera stereovision; mirror-based stereovision; steerable stereovision setup.
1.
INTRODUCTION
Stereovision techniques aim at recovering depth information from two or more 2D images of a scene. The standard stereovision sensor, similar to the pair of eyes of most evolved animals, is composed of two cameras placed side by side. Although simple to comprehend, this type of setup is much more complex to build than it appears at first sight. To simplify the matching process involved in stereopsis analysis, one often assumes that the two imaging systems — i.e. cameras and lenses — are perfectly identical. When this condition is not satisfied, calibration procedures are
173 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 173–178. © 2006 Springer. Printed in the Netherlands.
174 used to determine the actual geometrical and optical parameters of the stereovision setup (Faugeras and Toscani, 1988; Tsai, 1987). If motion analysis is involved, the two cameras must be electronically synchronized to allow for simultaneous grabbing of both images. To avoid these constraints, some researchers have described optical setups which include a single camera and several mirrors (Lee and Kweon, 2000). When such a setup is correctly assembled, it becomes equivalent to a pair of virtual cameras with strictly identical optical properties. The basic concept has been presented in (Innaba et al., 1993) and an efficient implementation, which takes into account many technological constraints, has been described in (Mathieu and Devernay, 1995). Some application require orienting the field of view of the stereo system toward a given feature in the scene. For example, in the case of vhicle-mounted sensors for road surveillance at long distance, the system must keep in sight the region of interest in which the neighbor vehicles are moving, even when the road curvature is high. In the previously described stereo setups, the virtual cameras have parallel but fixed optical axes, which means that to orient them toward a given feature in the scene, the whole setup must be rotated. Therefore, since high speed and precise rotations are required, inertia becomes a major problem. In this paper, we describe a mirror-based stereovision setup in which the two optical axes can be oriented in any direction while remaining parallel. Orientation is changed by rotating a single prism, less heavy and much smaller than the complete setup, therefore minimizing problems related to inertia. We present the principle of operation of this new steerable stereoscopic setup in section 2, and derive several of its properties from a geometrical analysis. Section 3 describes how this setup has been simulated with the POV-RayTM rendering software and shows several synthetic images. Section 4 concludes this paper.
2.
SINGLE-CAMERA STEREOSCOPIC SENSOR
A schematic top view of the stereoscopic sensor is presented in Fia and b in Figure 1) gure 1. It is composed of two lateral plane mirrors ( c and of a central prism with two planar reflective surfaces (marked as ). This setup is similar to the system described in (Mathieu and Devernay, 1995), except that the prism can rotate about the axis defined by the d The edge at the intersection of the reflective surfaces (marked as ). optical axis of the camera intersects the edge of the prism, which is projected as a vertical straight line through the center of the image. This straight line, which does not move when the prism is rotated, splits the
Single-Camera Stereovision Setup
175
176 between the symmetry axis of the prism and the optical axis of the real camera. The optical center of each virtual camera remains on a circle of radius d centered on a point that does not belong to the optical axis of each camera. We can show that the centerpoint of one circle belong to the straight line connecting the reflexion of the mid-point of the prism edge in the lateral mirror and the optical canter of the corresponding camera. The angle between the optical axis of a virtual camera and the optical axis of the real one is twice the angle γ, but in the opposite direction.
(a) without rotation
Figure 2.
3.
(b) rotation of angle γ
Stereoscopic setup without and with rotation of the prism.
SIMULATIONS AND LABORATORY PROTOTYPE
Before building a prototype of the single-camera stereoscopic sensor, we have modelled the imaging system using the POV-RayTM rendering software. Our goal was not only to verify that the actual setup would later give satisfactory images, but also to create a software environment that would allow us to test any image processing algorithm on synthetic image sequences of a road scene. With computer-generated images, the results can be compared to the perfectly known parameters of the synthetic scene, which is almost impossible on real road images. We present several images of a road scene which have been synthesized using our implementation in POV-Ray. The simulated stereo setup has the following characteristics : the focal distance f of the camera is either 25 or 50 millimeters, the width w of its image sensing surface is one third of an inch, the distance d between the optical centre and the prism of 125 millimeters, the distance 2S between the optical axes of the virtual cameras
Single-Camera Stereovision Setup
177
is 400 millimeters, and the angle α is 45 degrees, which corresponds to a right angle between the reflective surfaces of the prism. The road described in the synthetic scene has two lanes, and turns to the right with a radius of curvature equal to one kilometer, which is a realistic value for a highway. The vehicle equipped with the stereoscopic setup drives on the rightmost lane, following another vehicle at a distance of 100 meters. Figure 3 shows the synthetic image rendered with a focal distance of 25 millimeters, which is an mid-range value between telephoto and wide-angle for this sensor size. The preceding vehicle appears clearly, but with a poor resolution. Because of the curvature of the road, if one increase the focal distance of the camera without steering the optical axis, the vehicle would leave the field of view of the system. In Figure 4, the prism has been rotated of 1.45 degrees to the left and the focal length has been increased from 25 millimeters to 50 millimeters. Therefore, the optical axes of the virtual cameras have been oriented toward the right side of the road and the vehicle appears in the center of the field of view with a better resolution.
Figure 3.
Synthetic image, vehicle at 100 meters, focal distance 25mm.
Figure 4.
Same vehicle, focal distance 50mm, orientation 1.45 degrees.
A prototype of the stereoscopic setup, which has been assembled in our laboratory, is shown in Figure 5. All the elements are installed on a planar optical table using rails to allow for precise adjustment of their positions. Manually orientable mounts have been used for the lateral mirrors, but the prism is fixed on a mount whose angular position can be precisely controlled by a stepper motor.
178
Figure 5.
4.
Laboratory prototype.
CONCLUSION
In this paper, we have described a stereoscopic setup based on a single camera associated with a set of mirrors. By rotating a prism about its edge, the field of view of this system can be oriented toward any direction, which is mandatory when long distance perception is required. The setup has been simulated using the POV-RayTM rendering software and an actual prototype has been built. Although based on a single camera, the setup must be calibrated. The classical methods are not appropriate for this new sensor, and we are developing a dedicated one. We are also working on higher level techniques, like image-driven position control and video rate stereo analysis, that will allow us to use this system on a real vehicle.
ACKNOWLEDGMENTS The authors want to acknowledge the support of the Regional Council Nord-Pas de Calais that comes through a grant financing the RaViOLi project.
REFERENCES Faugeras, O. and Toscani, G. (1988). The calibration problem. IEEE J. of Robotics and Automation, RA-3(4):323–344. Innaba, M., Hara, T., and Inoue, H. (1993). A stereo viewer based on a single camera with view-control mechanisms. In Int. Conf. on Intelligent Robots and Systems, Yokoama, Japan. Lee, D. H. and Kweon, I. S. (2000). A novel stereo camera system by a biprism. IEEE Trans. on Robotics and Automation, 16(5):528–541. Mathieu, H. and Devernay, F. (1995). Syst`eme de miroirs pour la st´er´eoscopie. Rapport technique 172, INRIA, Sophia Antipolis. Projet Robotvis. Tsai, Roger Y. (1987). A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE J. of Robotics and Automation, RA-3(4):323–344.
FEATURE-BASED CORRESPONDENCE ANALYSIS IN COLOR IMAGE SEQUENCES
Ayoub K. Al-Hamadi, Robert Niese, Bernd Michaelis Institute for Electronics, Signal Processing and Communications (IESK) Otto-von-Guericke-University Magdeburg D-39016 Magdeburg, P.O. Box 4210 Germany
Abstract:
Feature-based matching frequently suffers from accuracy and performance due to a high number of image features that need to be matched. Instead, a hierarchical feature extraction process is used here to solve the correspondence problem in image sequences. In the first step, a modified difference image technique is applied to generate Motion-Blobs. Secondly, a color segmentation is used to determine Motion-Blob sub-segments and the respective set of properties. These properties are used to finally solve the correspondence problem. Consequently stability and accuracy of the motion analysis are increased.
Key words:
Tracking analysis, feature extraction, color image processing, Segmentation
1.
INTRODUCTION
The solution of the correspondence problem is still an active field of research1,2,3 due to the variety of problems that may occur in real sequences: partial occlusions, merging/splitting, shadow, lighting change and deformations. This paper demonstrates a technique of analysing the following two problems: automatic extraction of moving objects and solution of the correspondence problem for motion analysis in video sequences that are taken by a stationary camera. In contrast to intensity-based methods, which usually evaluate successive images pixel by pixel, feature-based approaches use image features on a higher level4. There are various features to solve the correspondence problem, besides simple primitives like edges, corners, and points there are also more complex geometrical primitives like rigid or flexible
179 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 179–186. © 2006 Springer. Printed in the Netherlands.
180 contour models. A common disadvantage in these approaches is the high number of image primitives, which often lead to ambiguities when solving the correspondence problem. There are two ways which the suggested method differs from other motion analysis methods (block-matching5, optical flow6, common feature based methods7,8,9,12 and deformable model based methods9,13): (1) Automatic object-adapted region segmentation. (2) Correspondence between image features is determined by hierarchical correlation levels. This process leads to a highly reduced number of image features. Thus, ambiguities are decreased while performance and accuracy of the correlation process are increased. In addition to that, using the suggested method, the influence of image specific interferences like changes in illumination, reflection, shadow or partial occlusion is clearly suppressed. Furthermore the analysis is stable in case of object related interferences like deformation, rotation and increase or decrease in size. Particularly the splitting of objects into several parts and merging of previous separate parts can be handled appropriately. Such typical situations for merges and splits of object regions occur in dense traffic scenes.
2.
CONCEPT
The goal of this research work is the development of a hierarchical approach that is capable to segment and track objects, which may vary in shape. The solution of the correspondence problem is a three step process (Fig. 1). Initially all image regions are determined that are supposed to depict moving objects. This step is referred to as motion segmentation. Motion segmentation is realized through a modified difference image technique (MDI). Instead of using temporal derivatives, two consecutive difference images are combined. Each of these difference images is created by subtracting two successive images. In the following a binary threshold is applied and either binary images become concatenated with an “AND” operator. The segmentation generates at a time t a motion mask, which itself is evaluated to generate image features of the first level. These image features are named as Motion-Blobs (MB). In a further step, color segmentation is applied to dissect MB into even smaller samples. There are several algorithms, which are potentially suitable for this purpose. The algorithm, which combines most of the features required is the Colour-Structure-Code (CSC)10. The CSC is an advanced region growing approach that combines the advantages of fast local region growing algorithms and the robustness of global methods10. CSC segmentation has been found as a practical means to divide MB’s. Correspondences are computed in each feature level. In the first level m:n matching is used to
Feature-based Correspondence Analysis in Color Image Sequences
181
handle splitting and merging of object regions. This step is primarily based on topological features of MB’s.
Figure 1. Paradigm for the solution of the correspondence problem.
In the second level a feature based similarity criterion is used to determine corresponding CSC-segments. The set of corresponding CSC-segments is restricted through the prior matching of MB’s. After determining all CSCsegment correspondences, object motion trajectories are accomplished by evaluating motion vectors of each CSC-segment.
2.1
Feature extraction and correspondence analysis
Separating the feature matching process into two levels offers great advantage with respect to accuracy and efficiency. In the first step correspondences are determined in feature level 1, i.e. between MB’s. The MB correlation (MBC) is the basis for a subsequent CSC- correlation (Fig. 1). 2.1.1
Motion-Blobs -Correlation
The process of motion segmentation results in image regions that represent object candidates. These regions are referred to as Motion-Blobs (MB). Every MB holds a set of features that can be used to solve the correspondence problem. The set of all MB’s embodies the first feature level. Specifically, each MB b is characterized by feature vector FMB(b). G FMB bi
M
RGB MB ; Area
AMB ; Set of colour segments S MB
T
(1)
The basis feature MMB is a set of RGB color vectors, which define the topology of a MB. Area AMB represents the absolute value of a Motion-Blob region, i.e. the number of all region pixels. Only very few parameters are necessary to achieve a match between these large regions. Due to pre-
182 processing MB’s usually appear in a small number. However, to cope with effects like merging and splitting of the object projections (Fig. 2), it is important to perform a multi-matching (m:n). In general, an arbitrary m: n (mregion to n-region) assignment is possible, where many MB’s are simultaneously split and merged. The different cases are the following: {(a) 0 : 1 – emerge; (b) 1 : 0 – vanish; (c) 1 : 1 – simple movement; (d) 1 : n – split; (e) n : 1 – merge; (f) 2 : 2 – simultaneous split and merge of two blobs; (g) n : m –simultaneous split and merge of n blobs at a time}.
Figure 2. Examples MB-Correlation Case.
This matching is based on simple assumptions. Basically it is assumed that every MB derives from a projection of a moving object. Provided that no object exceeds the sampling rate of the recording system, projections of a moving object will appear at adjacent locations in successive images. Additionally, the size of projected objects is not expected to change dramatically between two recorded frames. Plain displacement, merging and splitting of MB’s can be detected by evaluating size and location. As a result of that, sets of corresponding MB’s are determined for each frame pair. After determining all MB correspondences the more elaborate feature based matching of CSC segments can be done. 2.1.2
CSC- Segments-Correlation
Once all MB’s are correlated, the second feature level can be processed. This task is performed for each correlated MB-Set-Pair, i.e. all related CSCsegments will be correlated themselves on the basis of different matching criteria. The matching process is realized through a combination of four separately weighted correlation criteria which achieves a high accuracy at a low computational expense. The matching is done between two CSCsegment-sets M0 and M1 that belong to previously correlated MB’s. Each CSC segment p is characterized by feature vector VCSC(pi) G FCSC p j
T i i Colour CCSC ; Area Contour ACCSC ; Circumference; Center; Circularity
(2)
where color pixel and contour information is used to describe features of color segments. The circularity measure gives information about the shape
Feature-based Correspondence Analysis in Color Image Sequences
183
of a region. It is used to exclude segments that have a degenerated shape, i.e. a contour with many outliers. Usually, matching of such segments does not lead to useful results. With the features presented it is possible to reliably find correspondences between sets of color segments in successive images. In the matching process segment feature vectors FCSC(pi) are used to define four similarity measures, which are based on topology, orientation and color of the segments. Those are (1) relative segment location EQpos within a set of corresponding MB’s (2) Inter-frame-distance EQdist, (3) colour value EQcolour and (4) size EQsize. All measures are weighted and summed up to the total similarity measure EQtotal. This total similarity measure leads to the precise improvement of the match quality and thus the determination of motion trajectory of the object even in unfavourable situations is ensured. i EQtotal
§ 4 i· ¨¨ ¦ w j ¸¸ ©j 1 ¹
1
w EQ i 1
i pos
i i i w2i EQdist w3i EQcolour w4i EQsize
(3)
where weights wij are chosen dynamically with respect to the actual data (empirical selection in our case are w1= w2= w4= 1 and w3=4). Specifically, weights should be dependent on the matching case of the underlying MB’s, the number of present CSC segments and the spatial extent of the segment set. In m:n cases, frequently the distribution of segments is much more diverse than in plain 1:1 cases. Thus m:n cases may benefit from relying stronger on other similarity measures, namely EQcolour. In general EQcolour should carry the biggest weight (w3= 4). Due to high quality CSC segmentation, the majority of segments varies from the average color by an evident degree. For that reason color similarity presents the most important matching feature. EQdist and EQsize work best at average weights. The global similarity measure EQtotal has to be computed between all segments. This leads to a computational complexity of O(n²). Segments usually match if similarity measure EQtotal shows the highest value for the respective pair. After determining best fitting segments, motion vectors can be computed. Given a matching M(p1, p2) of two CSC-segments p1 and p2; the respective motion vector v derives from the displacement of p1 and p2, i.e. the difference of the particular segment position.
3.
RESULTS
We want to briefly demonstrate the analysis of real image sequences, which are overlaid by image-specific interferences. The primary goal of tracking is determining the location and speed of object primitives. The set of all motion values of a primitive is called (Motion-) Trajectory. The object
184 of interest is overlaid by shadow, lighting modifications and small partial occlusion (Fig. 3, A). These influences lead to clear modifications of the intensity values. The use of conventional intensity-based methods is not suitable due to the strong change of the intensity.
Figure 3. The analysis of moving objects in real sequences using the suggested technique. Sequence A presents the analysis under the influence of brightness change, shadow and partial occlusion. Sequences B&C show the results of the analysis as MB and motion trajectories. The MB’s (MB4, MB7and 6 at t=40) contain several objects. Using the suggested hierarchical approach, the analysis is stable and the motion trajectories (right) describe exactly the motion parameters of objects despite high traffic density and reflection.
In contrast to that, the suggested method allows error free motion estimation due to a hierarchical feature extraction process. The results of moving object segmentation and motion analysis are visualized (Fig.3 A). With a significantly reduced number of image features the approach achieves high performance and accuracy. The motion trajectory exactly describes the object movement despite overlaying interferences. Figure 3 B and 6 C show results of the analysis of a dense traffic scene, i.e. motion trajectories and object tracking contours. It can be recognized that the motion trajectories describe the motion parameters of moving objects in spite of the fact that
Feature-based Correspondence Analysis in Color Image Sequences
185
some objects are merged due to reflection and the high density of traffic (merge of obj.1 and 7 at t=28 Sequence B). When using only one feature level, the contours no longer describe the actual object shape and position (s. MBs) due to dense traffic MB’s containing multiple objects. Nonetheless, the suggested method enables exact and robust wrapping and tracking of single objects (Fig.3).
4.
SUMMARY AND CONCLUSION
A novel algorithm has been developed for automatic segmentation of moving objects and correspondence analysis under the influence of imagespecific interferences. The solution of the correspondence problem in the tracking process has taken place via hierarchically feature correlation from moving image regions (MB’s). Thus, ambiguities are decreased while performance and accuracy by the correspondence analysis process are increased. The matching process is realized through the combination of four separately weighted correlation tables that achieve a high accuracy at lower computational expenses.
ACKNOWLEDGMENTS This work was supported by LSA/ Germany grant (FKZ: 3133A/0089B).
REFERENCES 1. Huwer, S.; Niemann, H.: 2D-Objekt Tracking Based on Projection-Histograms. In: Burkhardt, H.; Neumann, B. (Hrsg.): CV; ECCV’98, LNCS 1998, Vol. I, pp. 961-976. 2. Dickmanns, E.D; Mysliwetz, B.D.: Reccursive 3-D Road and Relative Ego-State Recognition. IEEE Trans. On PA. and MI., Vol.14, No. 2, Feb. 1992. 3. Calow R.; Michaelis B.; Al-Hamadi, A.: Solutions for model-based analysis of human gait; 25th DAGM, September 10-12; 2003, Magdeburg, Germany. 4. Guse, W.: Objektorientierte Bewegungsanalyse in Bildfolgen, VDI-Re. 10, Nr. 223, 92. 5. Mecke, R.; Michaelis, B.:A robust method for motion estimation in image sequences. AMDO 2000, Palmade mallorca, Spain, 7- 9 Sep.2000, pp. 108-119 (LNCS 1899). 6. Lucena M.J, Fuertes J.M, Gomez j., Perez de la Blance, Garrido A.: Tracking from Optical Flow; 3th IEEE-EURASIP, Sep. 18-20, 2003, Rome, Italy, pp. 651-655. 7. Rehrmann V.: Object-oriented Motion Estimation in colour image sequences; in Proc. of the 5th European conference on CV; Springer-Verlag, June-1998. Freiburg. 8. Deriche R. and Faugeras O.: Tracking line segments. IVC; 8(4): 261-270,1990. 9. Blake A., Curwen R. and Zisserman A: A framework for spatio-temporal control in the tracking of visual contours; International Journal of CV; 11(2):127-145; 1993.
186 10. Priese L.; Rehrmann V.: On hierarchical colour segmentation and applications; in proc. of the CVPR; pp. 633-634, IEEE- CSP, June 93, New York city. 11. Al-Hamadi, A.; Niese R.; Michaelis, B.: Towards Robust segmentation and tracking of moving objects in video sequences. 3rd IEEE-EURASIP, ISPA 03, Rome, pp. 645-650. 12. Coifman B.: A real time computer vision system for vehicle Tracking and traffic surveillance, submitted for publication in TR-C, Revised December 1, 1998. 13. Martnez S. V., Knebel J.-F. and Thiran J.-P.: Multi-Object Tracking using the Particle Filter Algorithm on the Top-View Plan. http://lts1pc19.epfl.ch/repository/Venegas2004_730.pdf.
A VOTING STRATEGY FOR HIGH SPEED STEREO MATCHING Application for real-time obstacle detection using linear stereo vision Mohamed Harti, Yassine Ruichek and Abderrafiaa Koukam Systems and Transportation Laboratory, University of Technology of Belfort-Montbéliard, 90010 Belfort Cedex, France
Abstract:
In this paper we propose a new stereo matching algorithm for real-time obstacle detection in front of a moving vehicle. The stereo matching problem is viewed as a constraint satisfaction problem where the objective is to highlight a solution for which the matches are as compatible as possible with respect to specific constraints. These constraints are of two types: local constraints, namely position, slope and gradient magnitude constraints, and global ones, namely uniqueness, ordering and smoothness constraints. The position and slope constraints are first used to discard impossible matches. Based on the global constraints, a voting stereo matching procedure is then achieved to calculate the scores of the possible matches. These scores are then weighted by means of the gradient magnitude constraint. The correct matches are finally obtained by selecting the pairs for which the weighted scores are maximum. The performance of the voting stereo matching algorithm is evaluated for real-time obstacle detection using linear cameras.
Key words:
edge extraction; depth perception; linear stereo vision; obstacle detection; realtime processing; stereo matching; voting strategy.
1.
INTRODUCTION
Depth perception is one of the most active research areas in computer vision. Passive stereo vision is a well known approach for extracting depth information of a scene. It consists in analyzing the stereo images of the scene seen by two or more video cameras from different viewpoints1. The difference of the viewpoint positions in the stereo vision system causes a 187 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 187–196. © 2006 Springer. Printed in the Netherlands.
188 relative displacement, called disparity, of the corresponding features in the stereo images. This relative displacement encodes the depth information, which is lost when the three dimensional structure is projected on an image plane. The key problem is hence the stereo matching task, which consists in comparing each feature extracted from one image with a number, generally large, of features extracted from the other image in order to find the corresponding one, if any. This process, which is difficult to perform, requires a lot of computation, as well as a large amount of memory2. Once the matching is established and the stereo vision system parameters are known, the depth computation is reduced to a simple triangulation technique. In the robot vision domain, the stereo matching problem is generally simplified by making hypotheses about the type of objects being observed and their visual environment so that structural features, such as corners or vertical straight lines, can be more or less easily extracted3. Indoor scenes, including a few rigid objects scattered without occlusions against a featureless background, are much more easier to analyze than natural outdoor scenes of the real world4. With such restrictive assumptions, the number of candidate features for matching is substantially reduced so that computing times become acceptable for real-time processing without an important loss of useful information. Unfortunately, none of these hypotheses can be used in outdoor scenes, such as road environments, for detecting and localizing obstacles in front of a moving vehicle, because the features are too numerous to allow a reliable matching within an acceptable computer time5. Considering these difficulties, some authors have proposed to use linear cameras instead of matrix ones5-7. With these cameras, the information to be processed is drastically reduced since their sensor contains only one video line, typically 2,500 pixels, instead of, at least, 250,000 pixels with standard raster-scan cameras. Furthermore, they have a better horizontal resolution than video cameras. This characteristic is very important for an accurate perception of the scene in front of a vehicle. The aim of this work is to propose a new stereo matching algorithm for real-time obstacle detection in front of a moving car using linear stereo vision.
2.
STEREO VISION WITH LINEAR CAMERAS
A linear stereo system is built with two line-scan cameras, so that their optical axes are parallel and separated by a distance E (see Fig. 1). Their lenses have identical focal lengths f. The fields of view of the two cameras are merged in one single plane, called the optical plane, so that the cameras
A Voting Strategy for High Speed Stereo Matching
189
shoot the same line in the scene. A specific calibration method has been developed to adjust the parallelism of the two optical axes in the common plane of view attached to the two cameras5. f
Planar field of the left camera
Optical plane
Optical axis of the left camera Stereo vision sector
Stereoscopic axis
E
Optical axis of the right camera
Planar field of the right camera
Figure 1. Geometry of the linear cameras.
Let the base-line joining the perspective centers Ol and Or be the X-axis, and let the Z-axis lie in the optical plane, parallel to the optical axes of the cameras, so that the origin of the {X,Z} coordinate system stands midway between the lens centers (see Fig. 2). zP
Z
P(xP,zP)
E
O
Ol
Or
X xP
f
xl
xr Left sensor
Right sensor
Figure 2. Pinhole lens model.
Let us consider a point P(xp,zp) of coordinates xp and zp in the optical plane. The image coordinates xl and xr represent the projections of the point P in the left and right imaging sensors, respectively. This pair of points is
190 referred to as a corresponding pair. Using the pin-hole lens model, the coordinates of the point P in the optical plane can be found as follows: zP
E f d
xP
xl z P E f 2
(1)
xr z P E f 2
(2)
where f is the focal length of the lenses, E is the base-line width and d x l x r is the disparity between the left and right projections of the point P on the two sensors.
3.
EDGE EXTRACTION
Edge detection is performed by means of the Deriche’s operator8. After derivation, the pertinent local extrema are selected by splitting the gradient magnitude signal into adjacent intervals where the sign of the response remains constant [7]. In each interval of constant sign, the maximum amplitude indicates the position of a unique edge associated to this interval when, and only when, this amplitude is greater than a low threshold value t. The application of this thresholding procedure allows to remove non significant responses of the differential operator lying in the range [-t,+t]. The adjustment of t is not crucial. Good results have been obtained with t adjusted at 10% of the greatest amplitude of the response of the differential operator. Applied to the left and right linear images, this edge extraction procedure yields two lists of edges. Each edge is characterized by its position in the image, the amplitude and the sign of the response of the Deriche's operator.
4.
EDGE STEREO MATCHING
The proposed edge stereo matching algorithm is based on two types of constraints: local constraints, namely position, slope and gradient magnitude constraints, and global ones, namely uniqueness, ordering and smoothness constraints. The two first local constraints, i.e. the position and global constraints, are used to discard impossible matches so as to consider only potentially acceptable pairs of edges as candidates. Resulting from the sensor geometry,
A Voting Strategy for High Speed Stereo Matching
191
the position constraint assumes that a couple of edges i and j appearing in the left and right linear images, respectively, represents a possible match if the constraint xi > xj is satisfied, where x denotes the position of the edge in the image. The slope constraint means that only the pairs of edges with the same sign of the gradient are considered as possible matches. The third local constraint, i.e. the gradient magnitude constraint, is used for weighting the validity of the possible matches, which respect the position and slope constraints. This constraint supposes that important weights are affected to the pairs of edges for which the gradient magnitudes are close (and vice-versa). The global constraints are used to built a voting strategy between the possible matches in order to highlight the best ones. The uniqueness constraint assumes that one edge in the left image matches only one edge in the right image (and vice-versa). The ordering constraint allows promoting pairs for which the order between the edges is preserved. The smoothness constraint supposes that neighbouring edges have similar disparities.
4.1
Problem mapping
The edge stereo matching problem is mapped onto a NLxNR array M, called matching array, where NL and NR are the numbers of edges in the left and right images, respectively (see Fig. 3). Each element Mlr of these array explores the hypothesis that the edges l in the left image matches or not the edge r in the right image. We consider only the elements representing the possible matches that met the position and slope constraints.
L el f t l i n e a r
l
Mlr
i m a g e
r Right linear image
Figure 3. Matching array. The white circles represent the possible matches that met the position and slope constraints. The black circles represent the impossible matches that do not respect the position and slope constraints.
192
4.2
Voting stereo matching strategy
After the mapping step, the stereo matching process is performed thanks to a voting strategy, which is based on the global constraints. The voting procedure is applied to all the possible matches that met the position and slope constraints. For each element Mlr representing a possible match, the voting procedure consists first of determining among the other elements the voters, which are authorized to vote for the candidate Mlr. The voters are determined by using the uniqueness and ordering constraints. An element Ml’r’ representing a possible match is considered as a voter for the candidate Mlr if the pairs (l,r) and (l’,r’) verify the uniqueness and ordering constraints (see Fig. 4). The voters perform their vote sequentially by contributing to the score of the candidate Mlr. The score updating rule is defined by means of the smoothness constraint as follows: SM lr ( new )
SM lr ( previous ) f ( X lrl' r' )
(3)
where Xlrl’r’ is the absolute value of the difference between the disparities of the pairs (l,r) and (l’,r’), expressed in pixels. The scores of the possible matches are initially set to 0. f is a non linear function, which calculates the contribution of a voter. This function is chosen such that a high contribution corresponds to a high compatibility between the pairs (l,r) and (l’,r’) with respect to the smoothness constraint, i.e. when Xlrl’r’ is close to 0, and a low contribution corresponds to a low smoothness compatibility, i.e. when Xlrl’r’ is very large. In our case, this function is given by: f(X )
1 1 X
(4)
Once the voting process is achieved, the scores of all the possible matches are weighted by means of the gradient magnitude constraint. Before the weighting procedure, the scores are normalized so as to take values between 0 and 1. For each possible match (l,r), the weight Wlr is computed as follows: Wlr
2 1 e
( Ylr T )
1
(5)
where Ylr is the absolute value of the difference between the gradient magnitudes of the edges l and r appearing in the left and right images, respectively. T is adjusted such that a high weight is affected when the gradient magnitudes are close, i.e. when Ylr is close to 0, and a low weight is
A Voting Strategy for High Speed Stereo Matching
193
affected when the gradient magnitudes are far-off, i.e. when Ylr is very large. A satisfying value of this parameter is experimentally selected as T = 20. L el f t l i n e a r
l
Mlr
i m a g e
r Right linear image
Figure 4. Voting strategy. The voters of the candidate Mlr are the elements (white circles) situated in the gray area of the matching array.
After the score weighting procedure, the correct matches are obtained by selecting the possible matches for which the weighted scores are maximum in the rows and columns of the matching array.
5.
APPLICATION TO OBSTACLE DETECTION
A stereo set-up, built with two line-scan cameras, is installed on the roof of a car for periodically acquiring stereo pairs of linear images as the car travels (see Fig. 5). The tilt angle is adjusted so that the optical plane intersects the pavement at a given distance Dmax in front of the car. Fig. 6 shows a stereo sequence shot by this set-up where the linear images are represented as horizontal lines, time running from top to bottom. In this sequence, the prototype car travels in the central lane of the road and follows another car. The optical plane intersects gradually the shadow of the preceding car, then the whole car from the bottom to the top, as the distance between the two cars decreases. A third car pulls back into the central lane after overtaking the preceding car. The prototype car is itself overtaken by another one, which is travelling in the third lane of the road. The trajectories of the different vehicles during the sequence are shown in Fig. 7. On the pictures of Fig. 6 we can see the white lines that delimit the pavement of the road and, between these lines, the two dashed white lines and the preceding car. At the bottom of the pictures, we can also see, on the
194 left most lane, the car, which is overtaking the prototype car, and, in the middle, the shadow of the vehicle, which pulls back in front of the preceding car. The curvilinear aspect of the lines is due to the variations of the stereoscope tilt angle because of the uneven road surface. Note that the depth reconstruction is not affected by these oscillations, provided the two optical planes of the two cameras remain correctly calibrated when the car is running. The mechanical design of the stereoscope guarantees the stability of the calibration, even when the car is running on a rugged pavement.
Figure 5. Geometry of the stereo stereo-up on the vehicle.
Figure 6. Stereo sequence.
The stereo sequence of Fig. 6 has been processed by the proposed stereo matching algorithm. For each stereo pair, the disparities of all matched edges are used to compute the positions and distances of the edges of the objects seen in the stereo vision sector thanks to equations (1) and (2). The results are shown in Fig. 8 in which the distances are represented as gray levels, the darker the closer, whereas positions are represented along the horizontal axis. As in Fig. 6, time runs from top to bottom. Fig. 8 shows that the voting stereo matching algorithm provides good matching results. The edges of the two dashed lines have been correctly matched. The edges of the lines, which delimit the pavement, cannot be
A Voting Strategy for High Speed Stereo Matching
195
matched continuously because they do not always appear in the common part of the fields of the cameras. The preceding vehicle is well detected as it comes closer and closer to the prototype car. The shadow of the vehicle, which pulls back in front of the preceding vehicle, is identified as a white continuous line, at the bottom of the reconstructed image. Finally, we can see the dark oblique line, which represents the vehicle overtaking the prototype car, at the bottom of the reconstructed image. Overtaking car
Pulling back car Prototype car
Preceding car
Figure 7. Trajectories of the vehicles during la sequence.
Figure 8. Reconstructed scene.
The proposed stereo matching algorithm is implemented using a PC Intel-Pentium III running at 1 GHz. The time processing of the stereo sequence, which is composed by 200 stereo linear images, is about 80 ms. The average processing rate is hence 2500 pairs of stereo linear images per second.
196
6.
CONCLUSIONS
The aim of this paper is concerned with the stereo matching problem for real-time obstacle detection in front of a moving car. The stereo matching problem is viewed as a constraint satisfaction problem where the objective is to highlight a solution for which the matches are as compatible as possible with respect to specific constraints. These constraints are of two types: local constraints, namely position, slope and gradient magnitude constraints, and global ones, namely uniqueness, ordering and smoothness constraints. The position and slope constraints are first used to discard impossible matches so as to consider only potentially acceptable pairs as candidates. The searching process of the correct matches is then performed by means of a voting strategy, which is based on the global constraints. The scores calculated by the voting procedure for each possible pairs are then weighted thanks to the gradient magnitude constraint. The correct matches are finally obtained by selecting the pairs for which the weighted scores are maximum. The performance of the new edge stereo matching algorithm is evaluated for real-time obstacle detection using linear cameras. The tests carried out with stereo sequences acquired in real traffic conditions show the interest of the proposed approach in terms of robustness and reliability of depth computation with a very high processing rate.
REFERENCES 1. 2. 3.
4. 5.
6.
7.
8.
Jähne, B., and Haußecker, H. (2000). Computer Vision and Applications. Academic Press. Barnard, S., and Fisher, M. (1982). Computational Stereo. ACM Computational Surveys, 14, pp. 553–572. Kriegman, D.J., Triendl, E., and Binford, T.O. (1989). Stereo Vision and Navigation in Buildings for Mobile Robot. IEEE Transactions on Robotics and Automation, Vol. 5, No. 6. Nitzan, D. (1988). Three–dimensional Vision Structure for Robot Application. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 10, No. 3, pp. 291–309. Bruyelle, J.L. (1994). Conception and Realization of a Linear Stereoscopic Sensor: Application to Obstacle Detection if Front of Vehicles. PhD Thesis, University of Sciences and Technologies of Lille, France. Inigo, R.M., and Tkacik, T. (1987). Mobile Robot Operation in Real-time With Linear Image Array Based Vision. Proceedings of the IEEE Intelligent Control Symposium, pp. 228–233. Burie, J.C., Bruyelle, J.L., and Postaire, J.G. (1995). Detecting and Localising Obstacles in Front of a Moving Vehicle Using Linear Stereo Vision. Mathematical and Computer Modelling, Vol. 22, No. 4-7, pp. 235–246. Deriche, R. (1990). Fast Algorithms for Low-level Vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, No. 1, pp. 78–87.
VEHICLE DETECTION USING GABOR FILTERS AND AFFINE MOMENT INVARIANTS FROM IMAGE DATA Tadayoshi Shioyama1 , Mohammad Shorif Uddin1 and Yoshihiro Kawai1 1 Kyoto Institute of Technology, Matsugasaki, Sakyo-ku, Kyoto 606-8585, Japan
email: [email protected]
Abstract
This paper proposes a new method for detecting vehicles from an image. The method consists in three stages of segmentation, extraction of candidate window corresponding to a vehicle and detection of a vehicle. From the experimental results using 121 real images of road scenes, it is found that the proposed method can successfully detect vehicles for 120 images among 121 images.
Keywords:
Vehicle detection; Gabor filter; segmentation; affine moment invariants.
1.
INTRODUCTION
To improve the mobility of millions of blind people all over the world, an effective navigation system is very important. This paper addresses a vehicle detection algorithm for the purpose of developing a travel aid for the blind pedestrian. Many vision-based methods for detecting vehicles have been proposed for intelligent transportation control or collision avoidance in vehicles1–5 . So far we know, there have been no reports concerning vehicle detection for a pedestrian. In this paper, we present a new algorithm for detecting vehicles from an image viewed from a pedestrian. In our method, at first an image is segmented into regions by using not only color information but also Gabor filter outputs of a grayscale image. Second, we find a candidate rectangular window corresponding to a vehicle. Third at each region in the window, we calculate the affine moment invariants for the contour of the region, and compare the invariants with the invariants of reference contours which usually exist in a vehicle region. If the window has a region with the same invariant as one of these reference contours, then the window is treated as a area corresponding to a vehicle. For the purpose of evaluating the performance of the proposed method in detecting vehicles, we perform experiments using many real images.
197 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 197–202. © 2006 Springer. Printed in the Netherlands.
198
2.
SEGMENTATION TECHNIQUE
Segmentation is performed using the color information as well as the outputs of Gabor filters.
2.1
HSL color space
We use the HSL color model where H, S and L denote hue angle, saturation and lightness, respectively. The lightness L takes a value in the range [-1,+1], the saturation S takes a value in the range [0,1] and the hue angle H takes avalue in the range [0,2π ). We define HSL(x, y) as a vector with three components: HSL(x, y) = (L, S cos H, S sin H), where (L, S, H) are values at an image coordinate (x, y).
2.2
Gabor filter
Let f (x, y) be a lightness L at (x, y) in an input image. Then a Gabor transformation of f (x, y) is given by a convolution: z(x, y) =
∞ −∞
2 2 t +s exp{− j2π (u0t + v0 s)}dtds, (1) f (x − t, y − s) exp − 2σ 2
where (u0 , v0 ) are the fundamental frequencies in a two-dimensional frequency space. The radial frequency r0 is given by r02 = u20 + v20 . A Gaussian window of a Gabor function has a width σ . In order to keep a constant number of waves within the width σ , we introduce a following relation: σ r0 = 1/τ . For the purpose of detecting vehicles, we empirically use two radial frequencies and four directions. Eight Gabor filters are optimally allocated in a frequency space6 . In this paper, we set as r0 = 0.14, σ = 2.36 for low frequency, and r0 = 0.33, σ = 1.0 for high frequency.
2.3
Gabor features
We denote GEθ r0 (x, y) as the normalized output of Gabor filter with a radial frequency r0 and direction θ for a grayscale image f (x, y), here r0 = h(High) and (Low), and θ = 0o , 45o , 90o and 135o . The normalization is carried out so that the maximal output of Gabor filter lies in the range of [0,1]. We define a feature vector GE(x, y) as a vector with eight components represented by GEθ r0 (x, y).
2.4
Segmentation
For segmentation, we use both the feature vector GE(x, y) and the color vector HSL(x, y). The segmentation is carried out by finding a candidate neighboring region j∗ for merging for each considered region i, which satisfies the
199
Vehicle Detection Using Gabor Filters and Affine Moment Invariants...
following condition: Ei j∗ = min Ei j ≡ D2i j {Ri j + κ (Mi2j + M 2ji )} and D2i j∗ < T2 , j
(2)
where κ and T2 are constants, and D2i j ≡ GEi − GE j 2 +η HSLi − HSL j 2 . We denote by GEi , HSLi the averages of GE and HSL at region i, by Ri j :Ri j ≡ ni n j /(ni + n j ), by ni the number of pixels in region i, by Mi j the Maharanobis distance from region i to region j.
(a)
(b) Figure 1.
(a)
3.
Gabor filter outputs GE90o h .
(b) Figure 2.
(c)
(c)
(d)
Typical contours shown by white lines for a vehicle.
DETECTION OF VEHICLES
In this section, we find candidate regions for a vehicle by using the result of segmentation and Gabor filter outputs, and select a region truly corresponding to a vehicle by using affine moment invariants.
3.1
Extraction of candidate regions using segmented regions and Gabor filter outputs
From the characteristic of the Gabor filter outputs, it is found that the Gabor filter output GE90o h takes a large value at the boundary between a vehicle and a nearest road region as illustrated in Fig. 1. Hence, we extract candidate regions for a vehicle by the following algorithm:
200 Step 1) Find a point (XM,Y M) with the greatest value GMAX of GE90o h on the boundary of the nearest road region. Here, the nearest road region is found from the segmented regions obtained in section 2. Step 2) Consider a rectangular window whose lower left coordinate is (XM − cw,Y M) and whose upper right coordinate is (XM + cw,Y M + ch). Here, the origin of an image coordinate is set at the lower left point in an image. We set as cw = 35 and ch = 50 for an image of size (160 × 120) pixels. Step 3) Extract regions whose centers of gravity are in the rectangular window obtained in step 2. The extracted regions are treated as candidate regions corresponding to a vehicle as illustrated in Fig. 3(a).
(a) Figure 3.
3.2
(b)
Example of candidate regions(a) and selected regions(b).
Selection of candidate regions using affine moment invariants
After the segmentation, a vehicle is in general segmented into multiple regions. Among these, some regions have typical contours corresponding to a vehicle, such as side body, side window, front window, bonnet and so on as illustrated by white lines in Fig. 2(a)∼(d). The typical contours are used for selecting regions corresponding to a vehicle.
3.2.1 Preselection of candidate regions by P2 /A. We denote by P the peripheral length of a region contour and denote by A the area of the region. Then, P2 /A is an indicator of complexity of the shape of region. Hence, using the P2 /A, we preselect the candidate regions by removing a very complicated region with an extreme large P2 /A. 3.2.2 Affine moment invariants. The viewpoint related changes in images are expressed by affine transformations if the distances from a viewpoint to objects are sufficiently long in comparison with the differences of depths of objects. Therefore, we use affine moment invariants in order to select regions corresponding to a vehicle, because typical contours of such re-
Vehicle Detection Using Gabor Filters and Affine Moment Invariants...
201
gions are considered to be planar. Affine moment invariants are given by7 2 4 I1 = c1 (μ20 μ02 − μ11 )/μ00 ,
(3)
2 2 7 ) − μ11 (μ30 μ03 − μ21 μ12 ) + μ02 (μ30 μ12 − μ21 ))/μ00 , I2 = c2 (μ20 (μ21 μ03 − μ12 (4) 2 2 3 3 2 2 10 μ03 − 6μ30 μ21 μ12 μ03 + 4μ30 μ12 + 4μ21 μ03 − 3μ21 μ12 )/μ00 |. I3 = log | (μ30 (5) Here μ pq denotes the central moment and ρ (x, y) the indicator function. The coefficients ci , i = 1, 2 are used to make the values of Ii , i = 1, 2 comparable with each other. The coefficients are set as c1 = 10.0 and c2 = 1000.0.
3.2.3 Selection of candidate regions with affine moment invariants. ∗ , i = 1, 2, m = 1, 2, .., M, for In learning process, affine moment invariants Iim the typical contours are stored as reference models. For each region j, which satisfies the condition n j > 40 in a segmented test image, we calculate affine moment invariants Ii , i = 1, 2, 3, and select by finding candidate regions which satisfy the following conditions: ∗ 2 ∗ 2 ∗ 2 1/2 ) + (I2 − I2m ) + (I3 − I3m ) ) < ε , for m = 1, .., M, ((I1 − I1m
(6)
where ε is a threshold and is set as ε = 2.0 empirically. Figure 3(b) shows an example of selected region which is found from candidate regions in Fig. 3(a).
3.3
Vehicle detection
Based on the regions selected by affine moment invariants, a vehicle is detected in the following algorithm: Step 1) Check whether there is a region selected by affine moment invariants in subsection 3.2.3 or not. Step 2) If there exist a selected region, then it is decided that there is a vehicle in the rectangular window. Draw a rectangle. Step 3) If there exist no selected region, then it is decided that there is no vehicle in the window. Draw no rectangle on the grayscale image.
4.
EXPERIMENTAL RESULTS
We use 121 real images of road scenes including vehicles taken by a 3-CCD camera (Sony DCR-VX1000) at the height of human eye. The parameters are empirically set as follows: η = 1.0, κ = 1.0, T1 = 0.03, T2 = 0.7. We use the reference models for eight typical contours of a vehicle. Some experimental results are shown in Fig. 4. In this figure, a white rectangular window is shown when the method decides that there exists a vehicle, by using affine moment invariants of the reference models for a vehicle. The proposed method successfully detects vehicles for 120 real images among 121 images.
202
Figure 4.
5.
Some experimental results.
SUMMARY
We have proposed a new method for detecting a vehicle using Gabor filters and affine moment invariants. From experimental results, it is found that the proposed method can successfully detect vehicles for 120 images among 121 images. We are planning to check the performance of the proposed method for images without vehicle. It is planned to implement the present method by making a audio communication link to the blind people as a travel aid.
ACKNOWLEDGMENTS The authors are grateful for the support of Japan society for the Promotion of Science under Grant-in-Aid for Scientific Research (No.16500110).
REFERENCES 1. Z. Duric, et al., Estimating relative vehicle motions in traffic scenes, Pattern Recognition, Vol.35 (2002) pp.1339-1353. 2. D. W. Murray and B. F. Buxton, Scene segmentation from visual motion using global optimization, IEEE Trans. on PAMI, Vol.9, No.2 (1987) pp.220-228. 3. X. Li, Z. Q. Liu and K. M. Leung, Detection of vehicles from traffic scenes using fuzzy integrals, Pattern Recognition, Vol.35 (2002) pp.967-980. 4. A. K. Jain, N. K. Ratha and S. Lakshmanan, Object detection using Gabor filters, Pattern Recognition, Vol.30 (1997) pp.295-309. 5. G. Adiv, Determining three-dimensional motion and structure from optical flow generated by several moving objects, IEEE Trans. on PAMI, Vol.7, No.4 (1985) pp.384-401. 6. K. Kawada and S. Arimoto, Hierarchical texture analysis using Gabor expansion, J. of the Institute of Electronics, Information and Communication Engineers, Vol.J78-DII, No.3 (1995) pp.437-444. 7. J. Flusser and T. Suk, Pattern recognition by affine moment invariants, Pattern Recognition, Vol.26 (1993) pp.167-174.
POINTING GESTURE VISUAL RECOGNITION BY BODY FEATURE DETECTION AND TRACKING Sébastien Carbini, Jean Emmanuel Viallet and Olivier Bernier France Telecom Research and Development TECH/IRIS/VIA Technopole Anticipa, 2 Avenue Pierre Marzin 22307 Lannion Cedex - France
{sebastien.carbini, jeanemmanuel.viallet, olivier.bernier}@rd.francetelecom.com Abstract
Among gestures naturally performed by users during communication, pointing gestures can be easily recognized and included in more natural new Human Computer Interfaces. We approximate the eye-finger pointing direction of a user by detecting and tracking, in real time, the 3D positions of the centre of the face and of both hands; the positions are obtained by a stereoscopic device located on the top of the display. From the head position and biometric constraints, we define both a rest area and an action area. In this former area, the hands are searched for and the pointing intention is detected. The first hand spontaneously moved forward by the user is defined as the pointing hand whereas the second detected hand, when it first moves forwards, is considered as the selection hand. Experiments on spatial precision, carried out with a group of users, show that the minimum size of an object to be easily pointed at is some 1.5 percent of the diagonal of the large display.
Keywords:
Non intrusive Human Computer Interface, pointing gesture, bi-manual interaction, detection and tracking of face and hands.
1.
INTRODUCTION
Computer vision applied to gesture recognition allows users to freely interact unencumbered, without carrying specific devices or markers. Amongst gestures occurring during non-verbal communication, pointing gestures can be easily recognized and included in more natural new human computer interfaces. Several studies have been performed in the field of vision based on pointing gesture recognition. In this paper, we consider a pointing gesture, not as spatio-temporal trajectory to be recognized, but together with 1,2,3 as the instantaneous pointed location on a display. In 3 , the pointed direction is given by the forearm of the user, estimated with a 3D model of the bust and of the arm. From an image processing point of view, the forearm exhibits few discriminat-
203 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 203–208. © 2006 Springer. Printed in the Netherlands.
204 ing features and is thus difficult to detect when one points in the direction of the camera. On the other hand, the face has a stable and characteristic shape that can be detected and tracked. Without visual feedback, a pointing method with an axis that does not include an alignment with an eye (extended finger axis, forearm or extended arm axis) is less precise that a pointing gesture which involves aiming at a target using both one’s eye and typically the tip of a finger. We propose to use an eye-alignment aiming convention and to approximate the “eye-tip of the finger” pointing direction by the face-hand direction. The first hand spontaneously moved towards the display by the user is defined as the pointing hand whereas the second detected hand (Figure 1-a), when it first move forwards, is considered as the selection hand, similar to a mouse click, (Figure 1-b) or as a control of a third axis useful for virtual 3D world interactions.
2.
Figure 1.
METHODOLOGY
(a) : States of the pointing system. (b) : Pointing and selection hands in action.
In the pointing method described here, it is not a necessity to know beforehand the dominant hand of a user. It is both suited for right-handed or left-handed users. Furthermore, no calibration step or manual initialisation is needed. The face of the user is automatically detected as it enters the field of view of the camera, provided that the image face width is greater than 15 pixels and the out of plane rotation of the face is below 50 degrees 4 . The user begins to interact with the display, with the hand he or she favours to use. As soon as a body part is detected (face or hand), the body part is continuously tracked until tracking failure is automatically detected: then body part detection is retriggered. Pointing is taken into account only in the action area, where the hand is sufficiently in front of the head. Otherwise the hand lies in the rest area (hand on a hip for instance) and the user is able not to interact continuously (Figure 2-a).
2.1
Face detection and tracking
Since faces have relatively stable characteristics across persons, the user’s face is first detected. Detection is carried out with a set of neural networks4 .
Pointing Gesture Visual Recognition by Body Feature Detection and Tracking
205
The face detector input is a 15x20 pixels grey level image and it answers whether this image is a face or not. Since there is a trade-off between the false alarm rate and the detection rate we favour a very small false alarm rate. Tracking is done by fitting a statistical model to observations 5 . The model is based on skin-color and depth histograms. For skin colour detection (Figure 2-c), a simple skin colour look-up table is used. It was previously constructed from images of various persons under various lighting conditions. The disparity (Figure 2-d) is obtained with a stereo camera.
2.2
Hand detection and tracking
Contrary to faces, hands exhibit extremely variable shapes as seen from a camera and is thus difficult to detect, specifically at low image resolution. Once the face is detected and using disparity information, the 3D position of face is obtained. Moving skin colour zones are taken as hand candidates. Biometric constraints limit the search space to a centroid centred on the face. Furthermore, it is reasonable to admit that, when interacting with the display, the user moves its hand towards the display and sufficiently away from its face (more than 30 cm). Thus the hand search space is restricted to a volume delimited by a sphere and a plane, a volume called the ’action area’ (Figure 2-a). A hand
Figure 2. (a): 1. Rest area, 2. Action area, 3. Hand unreachable area, 4. Display - (b): Face position (square), pointing hand position (circle), selection hand position (cross) - (c) Skin color - (d) Disparity in both rest and action area.
is detected as a skin color moving zone, in the action area, the closest to the display. The first detected hand is considered to be the pointing hand. Then the second hand is detected in a similar manner after having previously discarded both the area of the first detected hand and the area of the corresponding arm from the search space. Indeed, a naked arm or an arm covered by a skin colour cloth could be mistakenly selected as the second skin colour moving zone closest to the display. In order to detect the arm, a skin colour zone is first initialised on the hand. Then it is merged, at each iteration, with skin colour neighbouring pixels with continuous depth values and located in front of the face. The arm detection iterates until the zone no longer grows. The second hand is used to control a command as it enters the action area or to control a third axis by changing its distance to the display.
206 Tracking of the two hands and their pointing-command labels is necessary otherwise as soon as the selection hand gets closer to the display than the other hand, their actions would swap. Hand tracking, initialised upon hand detection, is performed in a similar manner as face tracking. With up sleeves or skin colour clothes, tracking can possibly position itself anywhere along the arm. In order to find the hand, the tracking solution is oriented, using disparity gradient, towards the arm extremity the closest to the display. This reframing is inadequate for small gradients but then the forearm, parallel to the display, is in the rest area and pointing is not taken into account.
2.3
Occlusions and tracking failures
Since the different body parts tracking rely on similar features, in case of occlusion of the face or a hand by another hand, only one of the body part is correctly tracked whereas the other is erroneously anchored on the first and cannot any longer be controlled by the user. The tracked zones remain fused even after the end of the physical occlusion. In pointing situation, the most frequent case of occlusion occurswhen a hand passes in front of the head in the camera direction. In order to solve this problem, we consider that it is the hand that moves and not the face: face tracking is temporarily interrupted until the end of the occlusion and for the hand tracking the search space is further constrained in order to discard pixels with disparity values too close to the face. The other cases of occlusions are more difficult to deal with and we first identify fusion of body parts. In case of a face-hand fusion, experiments show that usually it is the estimated hand that positions itself on the face and stays there, whereas if the estimated face is anchored on the physical hand then face tracking quickly fails, tracking failure is automatically detected and forces the reinitialisation of the face by detection. Therefore, in case of detected face-hand fusion, the face is kept and the hand destroyed and detected again. Hand-hand fusion leads to the elimination of both hands and hand detection is launched. Face and hand tracking failure, either in the rest or the action area, are automatically detected by estimating the number of skin-colour valid disparity pixels. If lower than a threshold, the body part tracking is considered to have failed and detection is automatically re-triggered.
3.
EXPERIMENTAL SET-UP AND PERFORMANCES
The mean performances of the pointing system is estimated by several experiments on spatial and temporal precision, experiments carried out by a group of 14 users. A user is located at 1.5 m of a retro-projected image of 2 x 1.7 m size. A Bumblebee stereo camera 6 , used with a resolution 160x120 is located above the display at an angle of 45 degrees (Figure 3-a) with the
Pointing Gesture Visual Recognition by Body Feature Detection and Tracking
207
display in order to maximize hand displacement on the image and thus spatial precision when the user points towards the four corners of the display. In this set-up, a 1 cm change of the face position (with a hand still) roughly corresponds to a 1 cm change of the aimed location on the display. The system performs at 25 Hz on a Pentium IV (3 GHz). In order to characterize temporal stability, a user is asked to aim at a cross at the centre of the display for about 10 seconds. The mean temporal stability 2.29 cm and a typical result is given in Figure 3-b. To evaluate the spatial precision, each user has to follow as closely as possible the path of rectangles, centred on the display and of decreasing size. The mean distance of the aimed location to the closest rectangle boundary point are estimated. The typical track given in 3-c is similar to the one obtained by 3 . One may notice that the
Figure 3. Pointing gestures experiments (a) Experimental setup: camera C, retroprojector RP and display D. (b) Temporal stability (mean distance = 2.29 cm). (c) Spatial precision path for a rectangle size of 102x85 cm (mean distance = 2.75 cm). (d) Mean distance D to the rectangle (of width W and height H).
Figure 4. Pointing gesture and target size (a) Typical inter-target trajectory: Reaction time R (distance>0,9), Trip time T (0,1 0
(2)
Note:
602 Mirror P1 P’ 1 P’ 2 P’ P’4 3
P2 P4 P3
Figure 3.
An example with 4 points.
with vi directing vector of line (pi pi ). The line (pi pi ) is the reflection of the line (pi pi ), their directing vector are then linked: if we note vi the directing i the normal of the mirror in point p , we can write: vector of line (pi pi ) and N i i (N i . v ) vi = vi − 2N i
(3)
Note that all directing vectors are assumed to be unit. To locate the N points pi in space you have to find all λi , you have to find a system of equation with all λi and solve it. Note − → p− i p j = pj − p i = pj + λj vj − (pi + λi vi )
(4) (5)
→ You can find the relation that expresses constraints on vector − p− i pj . Solving the obtained system will give you analytical expression of all λi . Many con→ straints can be used like the length of vectors (||− p− i pj || = Lij ) . The simplest way to obtain systematically a solution is to express linear combination → −−→ −−→ p− (cj − i pj + ck pi pk = pi pl ), and to solve the system to express all λ relatively to one of them. By the use of an additional equation (||pi pj || = Lij for example), you can find the expression of the last λ. Note that, as points are well known in the object coordinate space, coefficients c and Lij are known, the only unknown factors are the λ... The complexity of the solution depends on expressions you choose. Once you get an expression for all λ, you get the distance between points in space and the center of the mirror then, you can estimate the transformation between the sensor coordinate space and the object coordinate space. This method allow you to get an analytical solution for N > 3 if the points are coplanar and for N > 4 otherwise and for N = 3 if the points are aligned. Let us see on a simple example how to do.
2.2
Example
Suppose you have 4 points (p1 , p2 , p3 , p4 ) coplanar in space and they form a parallelogram (Figure 3). Their projection onto the mirror are noted p1 , p2 , p3
603
The Perspective-N-Point Problem for Catadioptric Sensors
and p4 respectively and their projection on the image plan are noted p1 , p2 , p3 and p4 respectively. We can compute vi as explained before and then compute −−→ → vectors − p− 1 p2 and p3 p4 . As points are forming a parallelogram in space we can write: − → −−→ (6) p− 1 p 2 + p3 p 4 = 0 this gives us the system: ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
⎤
px2 + λ2 vx2 − (px1 + λ1 vx1 ) +px4 + λ4 vx4 − (px3 + λ3 vx3 ) py2 + λ2 vy2 − (py1 + λ1 vy1 ) +py4 + λ4 vy4 − (py3 + λ3 vy3 ) pz2 + λ2 vz2 − (pz1 + λ1 vz1 ) +pz4 + λ4 vz4 − (pz3 + λ3 vz3 )
rewritten differently: ⎡
⎤
⎡
⎥ ⎡ ⎤ ⎥ 0 ⎥ ⎥ ⎣ ⎥= 0 ⎦ ⎥ ⎥ 0 ⎦
⎤
⎡
⎤
λ2 vx2 − λ3 vx3 + λ4 vx4 λ1 vx1 vx0 ⎣ λ2 vy2 − λ3 vy3 + λ4 vy4 ⎦ = ⎣ λ1 vy1 ⎦ + ⎣ vy0 ⎦ λ2 vz2 − λ3 vz3 + λ4 vz4 λ1 vz1 vz0 with if we pose:
v0 = p1 − p2 + p3 − p4 vxi Δijk = vyi v zi
vxj vyj vzj
vxk vyk vzk
(7)
(8)
(9)
(10)
the solution of the system (8) is λ2 = (Δ340 + Δ134 λ1 )/Δ234 λ3 = (Δ240 + Δ124 λ1 )/Δ234 λ4 = (Δ230 + Δ123 λ1 )/Δ234
(11) (12) (13)
We have an expression of all λi relatively one of them. All expressions are divided by Δ234 . Δ234 is never equal to zero because vector vi are not collinear. There is no degenerated solution. We have to find an expression of the last λ. If we pose L14 the length of the segment [p1 p4 ], we can write: → − p− (14) 1 p4 = L14 ⇔ p4 + λ4 v4 − (p1 + λ1 v1 ) = L14
(15)
⇔ λ21 − 2λ1 λ4 v1 v4 + 2(p1 − p4 )(λ1 v1 − λ4 v4 ) 2 2 +λ24 + p2 1 + p4 − 2p1 p4 = L14 → (with − vi = 1 and L14 > 0)
(16)
604 Combined with the expression of λ4 (Eq. 13), we get: λ21 A + λ1 B + C = 0 with λ1 > 0
(17)
and → → A = 1 + (Δ123 /Δ234 )2 − 2(Δ123 /Δ234 )− v1 − v4 2 − → → v4 ) − B = 2Δ230 Δ123 /Δ234 + 2(p1 − p4 )( v1 − Δ123 /Δ234 − → − → − 2Δ230 /Δ234 v1 v4 → v4 C = −2(p1 − p4 )Δ230 /Δ234 − 2 +(p1 − p4 ) + (Δ230 /Δ234 )2 − l2 = 0
(18) (19) (20)
This polynomial expression (Eq. (17)) can be easily solved and give an expression of λ1 . This solution, combined with Eqs. (11) to (13), gives the expression of all λi . Note that Eq. (17) has two solutions but the correct one is the positive one. We have explained how to get an analytical solution of the P-N-P, then we have given an example of simple solution with N = 4 and shape constrained. This constraint was added to simplify the expression of the solution and to make it shorter. If you want to get the general solution (without any constraints on shape), you just have to change equation (6) by → −−→ −−→ p− K1 − 1 p2 + K2 p1 p4 − p1 p3 = 0
(21)
Now let us see simplifications that bring the single viewpoint constraint.
3.
SIMPLIFICATION OF THE SOLUTION WITH THE SINGLE VIEWPOINT CONSTRAINT
The single viewpoint constraint ensures the presence and the uniqueness of the optical center of the mirror (Baker and Nayar, 1999). The computation of reflection on the mirror is simplified and particularly we are not obliged to compute the normal of the mirror to compute the reflection (Figure 4). If we note F the optical center of the mirror, the Eq. (1) can be rewritten:
with
pi = F + λi vi
(22)
vi = (pi − F )/(||pi − F ||)
(23)
pi
Note that points are always known as they are at the intersection of line → p− (F pi ) and the mirror. Expression of vectors − i pj (Eq. (5)) is then simplified: − → p− i pj = F + λj vj − (F + λi vi) = λj vj − λi vi
(24) (25)
605
The Perspective-N-Point Problem for Catadioptric Sensors
Mirror
pi’
F’
pi
vi
pi’’ Image plane F
Projection of points pi according to the optical center of the mirror.
Figure 4.
If the preceding example with the parallelogram p1 p2 p3 p4 is taken again, system (8) becomes: ⎡
⎤
⎡
⎤
λ2 vx2 + λ4 vx4 − λ3 vx3 λ1 vx1 ⎣ λ2 vy2 + λ4 vy4 − λ3 vy3 ⎦ = ⎣ λ1 vy1 ⎦ λ2 vz2 + λ4 vz4 − λ3 vz3 λ1 vz1
(26)
the solution of this system is then: λ2 = (Δ134 /Δ234 ) λ1 , λ3 = (Δ124 /Δ234 ) λ1 , λ4 = (Δ123 /Δ234 ) λ1
(27) (28) (29)
In the same way, Eq. (14) becomes: → − p− 1 p4
= L14 − → ⇔ λ4 → v4 − λ1 − v1 = L14 2 → − − → ⇔ λ4 − 2 v1 v4 λ1 λ4 + λ21 = L214 → (because − vi = 1 and L14 > 0)
(30) (31)
combined with new solution (29) we get: λ21 = L214 /δ → → v4 Δ123 /Δ234 + Δ2123 /Δ2234 δ = 1 − 2− v1 −
(32) (33)
If we introduce this solution in Eqs. (27)-(29), we get a solution for each λ: √ λ1 = L14 / δ, (34)
606 λ2 = (Δ134 L14 )/(Δ234 λ3 λ4
√
δ), √ = (Δ124 )/(Δ123 ) (L14 )/( δ), √ = (Δ234 )/(Δ234 ) (L14 )/( δ)
(35) (36) (37)
The solution is then drastically simplified. (Note that this solution also works with planar camera).
4.
CONCLUSION
We have presented a new way to solve the perspective-N-point problem. The solution is strictly analytical and works well with every kind of catadioptric panoramic sensors (but is not restricted to them). Expressions of solutions are simplified when the sensor respects the single viewpoint constraint. The choice of the geometrical form of points pi makes it possible to get a simple solution. This solution has a constant execution time. The solution can be used directly or can be used to initialize a minimisation process.
REFERENCES Abidi, M. A. and Chandra, T. (1995). A new efficient and direct solution for pose estimation using quadrangular targets: Algorithm and evaluation. IEEE Trans. Pattern Anal. Mach. Intell., 17(5):534–538. Alter, T. D. (1994). 3-d pose from 3 points using weak-perspective. IEEE Trans. Pattern Anal. Mach. Intell., 16(8):802–808. Ameller, M-A., Quan, Long, and Triggs, B. (2002). Camera pose revisited: New linear algorithms. In 14„eme Congr„es Francophone de Reconnaissance des Formes et Intelligence Artificielle. Paper in French. Baker, Simon and Nayar, Shree K. (1999). A theory of single-viewpoint catadioptric image formation. International Journal of Computer Vision, 35(2):175 – 196. Carceroni, Rodrigo L. and Brown, Christopher M. (1997). Numerical methods for model-based pose recovery. Technical Report TR659, University of Rochester - Computer Science Department. Fabrizio, Jonathan, Tarel, Jean-Philippe, and Benosman, Ryad (2002). Calibration of panoramic catadioptric sensors made easier. In Proceedings of IEEE Workshop on Omnidirectional Vision (Omnivis’02), pages 45–52, Copenhagen, Denmark. IEEE Computer Society. Fischler, Martin A. and Bolles, Robert C. (1981). Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395. Haralick, Robert M, Lee, Chung Nan, Ottenberg, Kartsen, and Nölle, Michael (1991). Analysis and solutions of the three points perspective pose estimation problem. In Computer Vision and Pattern Recognition, pages 592–598. Horaud, Radu, Conio, Bernard, Leboulleux, Olivier, and Lacolle, Bernard (1989). An analytic solution for the perspective 4-point problem. Computer Vision, Graphics and Image Processing, 47:33–44. Quan, Long and Lan, Zhong-Dan (1999). Linear n-point camera pose determination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8):774–780. Vstone-Corporation (2000). http://www.vstone.co.jp.
DYNAMIC SHADOW MAP REGENERATION WITH EXTENDED DEPTH BUFFERS Rafal Wcislo, Rafal Bigaj AGH - University of Science and Technology, Cracow, Poland [email protected], [email protected]
Abstract
1.
The original shadow mapping algorithm grants that construction of the light’s depth buffer and the screen rendering are sequential jobs that cannot be interlaced. Dynamic shadow map redrawing method presented in this article is a mutation of the base shadowing method that allows one to construct a variable depth buffer giving the flexible way for in-place adding of primitives to the earlier rendered partial scene. A fast innovative method for re-shading the previously lightened areas of screen, which can be applied in real-time environments, will be presented.
INTRODUCTION
Many methods solving shadow generation problem have been developed4 : simple fake shadows, shadow volumes, area subdivision, depth buffers, ray tracing and many more. The ray tracing algorithms definitely give the most realistic shadows. However, they are still too slow for the real-time environments. The most widely studied for the last dozen years were methods based on depth buffers (also named z-buffers), that solves the shadowing problem by answering the question: “is the point visible from the light source”. Hansong Zhang in his article Forward Shadow Mapping,2 introduces the new method, which changes a little the view of shadow mapping by inverting the rendering phase order. In contradistinction to the standard algorithm neither mapping from a screen buffer to a shadow map is done nor texture mapping hardware techniques are used. The concept of the method is given below: 1 Render the scene from the light’s view to get the depth buffer.
2 Render the screen buffer (from the eye’s view). 607 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 607–612. © 2006 Springer. Printed in the Netherlands.
608 3 For each light’s z-buffer point, map it to corresponding screen area and for each visible (in the eye’s view) point from that region update the pixel color with light intensity. The main problem with the adduced algorithm is mapping light’s z-buffer’s points to screen areas. This problem is named pixel warping. Due to the situation where z-buffer’s points are not infinitely small and both shadow map and screen buffer have different sampling rate for each shadow map’s point the corresponding screen’s area must be found. The computation of the exact region is too expensive. Therefore the estimation of the area is done instead. To avoid artifacts in regions between mapped areas the overestimation is commonly used.
2.
PROBLEM DESCRIPTION
The new approach adds the innovative functionality that allows for a dynamic 3d scene creation. In the base method1 (like in the other mutations) the whole model structure of the scene knowledge is required before rendering time. The shadow map construction and screen buffer creation are sequential jobs that cannot be interlaced. Each model change (e.g. object addition) requires regeneration of the whole scene. The new rendering solution allows for dynamic addition of objects to the scene. This future changes the main concept of the shadow mapping method are requires the depth buffer data extensions. The price for the new functionality is the increased memory usage and computational complexity. Usually this disadvantage is covered by low scene regeneration cost. The object addition essentially requires the same processing time for each primitive. This virtue can be used in scenes where a complicated static background occurs. Pre-generated buffers that include the static part of the objects’ model can be computed once and then reused every frame. Only objects that change their positions require redrawing.
3.
ALGORITHM OVERVIEW
The method initialized by Williams1 works in two phases. The first one is the shadow maps and view z-buffer generation for all objects on the screen. The second phase is the core part of the 3d model drawing which uses previously collected depth buffers. This action sequence is static and each change in the model requires repeating the whole process. The dynamic shadow map generation method is divided into n separate phases, where n is the number of 3d model’s primitives. No ordering of input objects is required.
Dynamic Shadow Map Regeneration with Extended Depth Buffers
3 rendering order
609
shadow map
C B
2
A
dynamically added shadow
1
ordinary shadow
Screen
a
b
c
d
e Lit area Shadowed area
Figure 1.
3.1
Dynamic shadow map construction.
New method’s futures sample
Method overview is presented in Figure 1. The scene model contains three objects (named with big letters A, B, C), one light source (situated on the right side of the image) and a camera (at the bottom). These three objects are sequentially drawn on the light’s shadow map and on the screen. The rendering order is shown. Nothing special happens after drawing primitives A and B. The depth buffers are constructed in a normal way. The interesting situation occurs while adding object C. It hides the middle part of the object B in the light’s view. While rendering the object C from the light’s view the shadow must appear on the screen area (marked with letter ‘c’ in figure), which has been lightened since drawing the object B. While drawing the object C on the shadow map its points that covers object’s B pixels are mapped to the screen buffer and changed to its shadowed form.
3.2
Algorithm phases
Each drawing phase consists of two subphases which are analogous to the two parts of the base algorithm: 1 Drawing a separate primitive on the extended lights’ depth buffers and newly shadowed screen areas un-lightening. 2 Rendering an object on the screen. (a) For each screen point of the primitive map it to the light’s view and get the corresponding depth buffer value. (b) Get the value of “is lit” predicate.
610 (c) Store required data in the extended buffers. (d) Set an adequate color of the screen point. The first step is a single primitive generation on all lights’ z-buffer. For each pixel drawn on the shadow map that turns out to be the nearest to the light and has a reference to screen point corresponding points must be un-lightened. After the first step all buffers are ready to be used. The only action needed is rendering primitive on the screen. While drawing the object some information are stored in the extended buffers that are later used for dynamic shadow addition.
3.3
Extended buffers
Two additional buffers are used in shadow generation process: associated with light’s shadow map and d one with screen buffer. All of the extended buffers values are specified in the second subphase of the primitive drawing procedure (rendering on the screen). While rendering any shapes on the screen a check action is executed which looks at the shadow map and compares mapped view model point’s depth with light model point’s depth stored in the z-buffer. Information about screen points mapped to the shadow map are fully known and can be stored in the additional space in light’s depth buffer. Extended shadow map consists of: depth value (the ordinary value of shadow map), screen point referring flag (set if any screen point refers to the light’s model point covered be this shadow buffer cell), reference counting flag (is set if point is referred by more than one screen point), and referred screen point coordinates.
3.4
Additional screen buffer
If a screen point which falls under rendered primitive is proved to be lit with given light source a corresponding depth buffer cell and light intensity (light color component) are available. In the object drawing subphase the additional screen buffer is created which holds these values. Each buffer cell stores two values: corresponding depth buffer cell id and light color component (the lightening effect derived value).
3.5
Newly shaded area un-lightening
The image shown in Figure 2a was rendered with a simple method that reconstruct the screen points which should be shaded during the regeneration of the shadow map. If newly drawn object covers the existing shadow map data than the referred screen point’s color must be reconstructed. The un-lightening pixel action uses light’s color component stored in the additional screen buffer.
Dynamic Shadow Map Regeneration with Extended Depth Buffers
Figure 2a.
Simple method artifacts.
Figure 2b.
611
Additional screen buffer.
As presented in Figure 2a the un-lightening method is highly imperfect. The reconstructed shadow is not a compact area but a set of scattered pixels (due to the z-buffer under-sampling). The special actions presented in the next section must be adapted to solve the above mentioned problem.
3.6
Compact shadow area reconstruction
The pixel to area projection is not a new problem. In the introduction to this article the pixel warping method (used in Forward Shadow Mapping2 ) has been mentioned. It is based on image warping techniques3 . This algorithms uses 3d transformations to compute the set of the screen pixels that correspond to the given shadow map point. The new solution of backward pixel warping uses information collected during previous screen data generation (backward here means that the method is based on values stored during forward pixel mapping). When using references stored in light’s depth buffer, the mapping from light’s view to eye’s view is unnecessary. As described in 3.5 this data makes it possible to redraw single pixels from the screen buffer. The current problem is to make a compact area from that point. Figure 2b shows the colored additional screen buffer, which stores corresponding light’s points identifiers. The screen points with the same light’s z-buffer points’ ids create compact areas which obviously represent screen regions that correspond to single pixels in the shadow map. These information can easily be used during shadow regeneration by using the commonly known, fast and easy flood-fill algorithm. The only modification to the simple method (3.5) is checking if repainted point (referenced by the z-buffer pixel that is overwritten) refers to more then one point. There is no gap in a single area
612 created by the screen points with the same ids. Therefore the flooding shadow starting at any place of the region gives the expected effect. This method is fast enough for real-time applications. In flood-fill algorithm 8 comparisons are needed (with all neighbours) for each repainted pixel. This action does not increase the cost of the method besides it’s base form too much.
3.7
Method’s problems and their solutions
3.7.1 Overwritten points differentiation. In some circumstances, while generating the extended buffers for a subsequent primitive, the unpleasant situation might occurs. This can happen when an area of screen that corresponds to the region in the light’s z-buffer is overwritten by data resulting from a given object. While adding the new object’s to the shadow map, some of its edge points can be falsely redrawn. This happens because of the z-buffer quantisation errors. The shadow map edge point is mapped to the screen area where some pixels lay on the one triangle and other on the second one. If stored z-buffer’s cells ids are constructed only with its index in the buffer then the adduced situation might occur. To avoid this problem, additional comparison must be done that will distinguish points from different primitives.
4.
CONCLUSION
The featured method is dedicated to the shadow mapping environments where addition of new objects is common and causes frequent regeneration of the static elements. The main cost of the algorithm is increased memory usage (z-buffer must be at least doubled and screen buffer tripled). Computing overabundance is not very high because of the linear complexity for each pixel redrawn in a given primitive. As mentioned before, the method gives good effects for example for scenes with static background which can be generated once and used in further drawing process.
ACKNOWLEDGMENTS This work is supported by AGH grant 11.11.120.16.
REFERENCES 1. L. Williams, Casting curved shadows on curved surfaces, Computer Graphics (SIGGRAPH 78 Proceedings), Vol. 12(3), p. 270-274, August 1978. 2. H. Zhang, Forward shadow mapping, University of North Carolina at Chapel Hill, Eurographics Workshop on Rendering 1998, July 1998. 3. P. S. Heckbert, Fundamentals of texture mapping and image, masters thesis, Dept. of EECS, University of California, Berkeley, Technical Report No. UCB/CSD 89/516, June 1989. 4. A. Woo, P. Poulin, A. Fournier, A survey of shadow algorithms, IEEE Computer Graphics and Applications, Vol 10(6), p. 13-32, November 1990.
HIGH PRECISION TEXTURE MAPPING ON 3D FREE-FORM OBJECTS Yuya Iwakiri1 , Toyohisa Kaneko Department of Information and Computer Sciences Toyohashi University of Technology, Japan 1 Research Fellowship of the Japan Society for the Promotion of Science [email protected], [email protected]
Abstract
1.
For creating a CG model from an existing 3D object, a laser scanner is commonly used for providing a geometry model and a digital camera for preparing photographs for texture mapping. It is important to have an accurate estimate of camera parameters in order to precisely align photographs on overlapping regions. The alignment based on silhouettes has been most frequently employed. In this paper we examine the combination of silhouettes and texture patterns for aligning textures. Initially an optimal strategy for camera position allocation is described. Then a new method of pattern matching is presented for more precise camera position estimation. Two toy objects were used to investigate the validity of our new method. It was found that our method could yield an alignment accuracy within one to two pixels(that corresponds to .5 to .6mm).
INTRODUCTION
CG technology is expected to play an important role in preserving historical objects such as buildings, sculptures etc. For such an existing object or world object, it is required to acquire a 3D geometrical model and its surface optical property, typically textures. There are instruments such as Cyberware scanners that enable acquiring the geometry and texture at the same time. These instruments are very expensive and also lack in flexibility for dealing with various object sizes. A more practical technique is to acquire a geometrical model with a laser scanner (or a CT scanner) and textures with a digital camera. In this case, aligning textures in an overlapping region is a very important task. There are a number of papers concerned with texture mapping3–5,7–10 . Usually an alleviating solution to the pattern aligning problem is to use a blending function, two linear functions changing from 0 to 1 and 1 to 0 in an overlapping region. However, this solution does not work well for some patterns such as line patterns: misaligning results to yielding two lines from a single line. Kurazume et al.6 proposed use of a reflectance edge pattern from the reflectance image of a specially designed laser scanner. This method is based on the observation that the reflectance image is reasonably correlated to textures patterns on the object surface. Bernardini et al.1 used texture patterns and range data for 613 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 613–618. © 2006 Springer. Printed in the Netherlands.
614 acquiring a geometry model and textures simultaneously using a stereo vision system for the Pietá Project. For each local region, it proposed an image-based registration algorithm.
2.
METHOD
The principle of mapping photographs on the surface of a 3D geometric model is explained using Figure 1, where two optical systems are considered: a real optical system and a virtual optical system. If these two systems are identically set, then the texture on 3D model of the virtual system can be the same as that in the real object by inverse projection of photographs. It is assumed here that noteworthy texture patterns appear on the object surface uniformly. Objects with a large patch without notable patterns are excluded. (This problem is not serious since some signs like ’x’ can be temporarily added on such a patch).
2.1
Optimal strategy for camera positions
Consider an 3D object placed on a turn table. The camera distance is set so as to take the entire object at any circular angle. A certain number of photographs which overlap each other are taken to cover the entire object surface. By and large a strategy to take larger overlapping regions will increase the number of photographs, while that to take smaller regions will decrease the number but causes less reliability for aligning textures. For determining optimal camera positions, we employ a visibility matrix for 360 camera positions around the object (Figure 2). The horizontal row shows the visibility for each polygon on the object surface (Figure 4): 1 for a visible case and 0 for an invisible case. Hence, if there are common 1’s between a camera position and another position, the polygonal patch can be commonly seen from the two positions. By starting at the position 1, the number of commonly seen polygons decreases as the other position is becoming far from the position 1. If an overlapping ratio (e.g.50%) is given, a second position can be determined to be the one which is closest to the ratio (Figure 5). Continuing this manner, all the camera positions are determinable. Usually the last position with the starting position 1 is likely to have a larger or smaller overlapping ratio. In this case, a new ratio is derived by averaging all the resulting ratios. Then this process is repeated with a new overlapping ratio. Finally a set of camera positions with approximately identical overlapping ratios as shown in Figure 6 are obtained.
2.2
Silhouette-based matching
The silhouette of an object from a camera position is usually unique if the object is irregularly shaped. Hence the camera position can be estimated based
High Precision Texture Mapping on 3D Free-form Objects
615
on a silhouette alone. The technique of the reference5 which is computationally very efficient is used here to estimate camera parameters.
2.3
Image-based matching
It has been observed that use of silhouettes alone (or silhouette matching) for camera parameter estimation is not sufficient, often resulting into misalignment of texture patterns. In this paper, use of texture patterns (or image-based matching) is proposed in addition. Consider two successive camera positions Vi and Vi+1 . The overlapping region is projected to the plane perpendicular to the camera direction at the mid angular point Vi of Vi and Vi+1 (see Figure 7). The difference of the photographs taken at Vi and Vi+1 is the total differential sum of pixel values. Then the minimization of this differential sum e is searched with the initial position resulting from the silhouette matching by varying the position of the second camera (namely (i + 1)-th position) while that of the first position is fixed. n e=
k=1
|Pi,k − Pi+1,k | n
(1)
where Pi,k and Pi+1,k are k-th pixel values of i-th and (i + 1)th camera positions.
2.4
Texture mapping
After the camera positions are estimated using both silhouette and imagebased matching describe above, textures are mapped on the object surface with a pair of blending functions as: pi αi = m
p j=1 j
,
pi = 1 −
φi 90
(2)
where φi is the angle between the normal direction vector and the view vector.
3.
EXPERIMENTS
We employed a ceramic cat (Figure 8) (size:11 × 17 × 12cm) and a plastic giraffe (Figure 9) (size:15 × 24 × 5cm) for experiments. They were scanned by a CT scanner. Their geometrical models were obtained with applying the marching cubes algorithm of the CT data. A VTK polygon reduction algorithm was applied to reduce the number of surface polygons to approximately 8212 for the cat and 6821 for the giraffe. The CT resolution is .625mm × .625mm × 1.0mm. The method presented here was then applied. The global algorithm described in the previous section provided seven and eight camera positions for the cat and the giraffe, respectively. The silhouette matching and then the image-based matching were carried out. Figure 10 and Figure 12 show the resulting textures with the silhouette matching alone. Figures 11 and Figure 13 show magnified pictures in order to demonstrate details. Figures 11(a) and 13(a) show misaligned patterns (see double whiskers for the cat and fuzzy textures for the giraffe) while Figures 11(b) and 13(b) show well aligned patterns
616 (a single whisker for the cat and clearer textures for the giraffe). It can be said that our method is very effective in providing high precision texture mapping.
3.1
Verification
As was described above, the difference of two images on an overlapping region is the sum of R, G, and B for all the pixels. For verifying the alignment, we employ the difference in edge images. Edges are extracted by applying a 3×3 Sobel filter on two images on each overlapping region(Figure 14). Strong edges in the top 5% percentile are extracted as shown in Figure 15. Then they are thinned with applying a thinning algorithm. Finally an ICP method2 is applied to locate corresponding edge points on two edge images. For the cat, the average residual error was found to be 1.7 pixel (.83mm) with silhouette matching alone and 1.2 pixel(.56mm) with the new method. For the giraffe, the average error was 1.3 pixel (.78mm) and 1.2 pixel (.65mm), respectively.
4.
CONCLUSION
We have presented a method for aligning textures accurately on a 3D geometric object. Initially we described an algorithm for allocating camera positions so that the overlapping ratio of a pair of camera positions is kept constant. Then a new image-base matching algorithm has been presented to align texture patterns as accurately as possible. The method was applied on two objects, resulting into excellent texture alignments. In addition, a verification method that employs edges is found to be useful for quantifying alignment accuracy. With the experiments, accuracy is found to be about .56mm for the cat and .65mm for the giraffe.
REFERENCES 1. F. Bernardini, I. M. Martin, and H. Rushmeier, High-quality texture reconstruction from multiple scans. IEEE TVCG, 7(4), October 2001. 2. P. J. Besl and N. D. McKay, A method for registration of 3-d shapes. IEEE Tr. on PAMI, PAMI-14(2):239 – 257, 1992. 3. P. Debevec, Y. Yu, and G. Borshukov, Efficient view-dependent image-based rendering with projective texture-mapping. In 9th Eurographics workshop on rendering, pages 105 – 116, 1998. 4. P. E. Debevec, C. J. Taylor, and J. Malik, Modeling and rendering architecture from photographs; a hybrid geometry-and image-based approach. In Proc. of SIGGRAPH ’96, pages 11 – 20, 1996. 5. Y. Iwakiri and T. Kaneko, Pc-based realtime texture painting on real world objects. In Proc. of Eurographics 2001, volume 20, pages 105–113, 2001. 6. R. Kurazume, K. Nishino, Z. Zhang, and K. Ikeuchi, Simultaneous 2d images and 3d geometric model registration for texture mapping utilizing reflectance attribute. In Proc. of Fifth Asian Conference on Computer Vision, January 2002.
617
High Precision Texture Mapping on 3D Free-form Objects
7. H. P. A. Lensch, W. Heidrich, and H. P. Seidel, Automated texture registration and stitching for real world models. In Proc. of Pacific Graphics 2000, pages 317 – 326, October 2000. 8. P. J. Neugebauer and K. Klein, Texturing 3d models of real world objects from multiple unregistered photographic views. In Proc. of Eurographics ’99, pages 245 – 256, September 1999. 9. E. Ofek, A. Pappoport, and M. Werman, Multi-resolution textures from image sequences. IEEE CG and Applications, 17(2):18 – 29, March – April 1997. 10. K. Pulli, M. Cohen, T. Duchamp, H. Hoppe, L. Shapiro, and W. Stuetzle, View-based rendering: Visualizing real objects from scanned range and color data. In Proc. of 8th Eurographics Workshop on Rendering, pages 23 – 34, June 1997.
3D Object
Polygon Model
Picture Real Camera
Projection Image
Virtual Camera
Figure 1.
Real and Virtual Camera Systems.
1 Y
X
Normal vector
n
Visible Area
0
0
0
1
L
0
1
0
0
1
1
L
1
T0 T1
Angle 2
0
1
1
1
L
1
T2
φ View vector
Surface # 3 4 L
0
Surface
Z
2
M
M
M
M
M
O
M
359
0
0
0
1
L
1
T359
Polygon A A A A L A 1 2 3 4 n Area
Figure 2. sitions.
360 camera po-
Figure 3. condition.
V4
Visibility Figure 4. trix. V4
V5
V3
V5
Visibility ma-
Overlapped area
V6 V3
X
X Z
V6
Z
V2 V7 V 1
Figure 5. Initial camera positions where V1 and V7 are too close.
V7
V2 V1
Figure 6. Final camera positions with similar overlapping ratios.
Vi
Vi'
Vi+1
Figure 7. Overlapping region of two camera positions.
618
Figure 8.
(a) before Figure 10.
Figure 9.
(b) after
(a) before
Textured model (cat).
(a) before Figure 12. raffe).
Model #1(cat).
(b) after
(b) after Magnified texture (cat).
(a) before
Textured model (gi-
(a)
Figure 11.
Model #2(giraffe).
(b)
Figure 14. 5% top values of Sobel filter output.
Figure 13.
(a)
Figure 15. thinning.
(b) after Magnified texture (giraffe).
(b)
Edges after
ADAPTIVE Z-BUFFER BASED SELECTIVE ANTIALIASING
Przemyslaw Rokita Warsaw University of Technology, Institute of Computer Science; Nowowiejska 15/19, 00-665 Warsaw, Poland; phone:(+48-22)6607753, fax: (+48-22)8251635; e-mail: [email protected]
Abstract:
Antialiasing is still a challenge in real-time computer graphics. In this paper we present a real-time selective antialiasing solution that builds upon our experience. We investigated existing approaches to real-time antialiasing and finally found a new simpler solution. Our new idea is to use the z-buffer directly for extracting visible edges information. Method presented here can be used to improve image quality in graphics accelerators but also applications such as real-time ray tracing.
Key words:
antialiasing, real-time rendering, digital image processing, spatial convolution
1.
INTRODUCTION
Most systems in order to solve the aliasing problem apply the supersampling technique. It requires a lot of computation time to process all the extra samples and a large memory space to store them. To reduce memory cost the accumulation buffer method was proposed1. The major drawback of this method is that the processing time is increased and proportional to the subpixel resolution. A partial solution to lower the number of samples per pixel is to use the sparse mask method. Such a solution was used in SGI’s RealityEngine and Infinite Reality. Anyway it requires a large memory and expensive hardware. Another way to lower the number of required samples is to concentrate the rendering process only on polygon edges. As in 2 the color contribution of each polygon at a pixel is estimated by the area of the pixel covered by the polygon edge. The memory
619 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 619–625. © 2006 Springer. Printed in the Netherlands.
620 usage in this approach is generally lower, but it requires dynamic memory allocation and high processing cost. An attempt to reduce both processing time and memory requirements was proposed as the A-buffer method3. Important drawback of this approach is that the visibility may not be computed correctly when two fragments overlap in depth. A different kind of antialiasing method is based on postprocessing4. After the rendering phase, pixels are filtered to reduce the aliasing effect. As aliasing artifacts are in fact high frequency distortions, they can be reduced using lowpass filtering. A way to optimize the antialiasing process is to use a selective approach. The idea of selective antialiasing can be seen at work in Matrox Parhelia PC graphics card 7. It is called Fragment Antialiasing (FAA), as opposite to Full Screen Antialiasing (FSAA). It is an implementation of an idea introduced in the project Talisman (a commodity real-time 3D graphics for the PC) by Jay Torborg and James T. Kajiya in 1996 5. The antialiasing method used here is based on the A-buffer algorithm 3 mentioned above. An important advantage of selective antialiasing is a more accurate treatment, focusing only on the portions of the image that truly need antialiasing. Matrox states that fragment pixels, or pixels that describe objects' surface edges make up only 5-10% of a scene's pixels, and often less than that. The result is that after fragments are identified, there are considerably fewer pixels to perform antialiasing operations on them, than when using FSAA. A potential problem is that in this approach textures do not get antialiasing processing performed on them. But we have to take into account, that textures usually are filtered independently before the final antialiasing step - using trilinear and anisotropic filters. If we use FSAA on top of trilinear and anisotropic filters, the final effect is to blur texture detail. Conclusion is that the selective antialiasing in comparison to full-screen antialiasing, improves consistently performance and can give better quality of textures. The layer/chunk architecture as proposed in Talisman and Perhelia gives good quality antialiasing, but is complicated and rather costly (the Talisman project was cancelled, and Parhelia is a high-end expensive graphics card). It also has a drawback – not all the edges that need antialiasing are processed6.
Adaptive Z-buffer Based Selective Antialiasing
2.
621
NEW APPROACH – Z-BUFFER BASED SELECTIVE ANTIALIASING
In this paper we present a solution that builds upon our previous experience. In 4 we have proposed a new antialiasing algorithm (3D filtering using pixel-flow information) in the context of a hybrid walk-through animation. One of the problems we had in this project was excessive blurring. It gave us an idea to use selective antialiasing. We investigated existing approaches to selective antialiasing (Talisman and Parhelia described above) and finally found a new simpler solution. Graphics hardware accelerators are based on the z-buffer. We had a previous experience in using z-buffer as parameter for adaptive filtering in real-time depth-of-field simulation 8. This gave us an idea to use the z-buffer directly for extracting visible edges information. If we regard the z-buffer content as an intensity function than we can extract the 2D edges mask (i.e. part of the image for further antialiasing) using highpass filtering. Most simple way to do highpass filtering on digital images is to use spatial convolution. Convolution in the spatial domain corresponds to multiplication in frequency domain, but in our case obviously because of the computation costs (processing in frequency domain requires calculation of discrete Fourier transform) we are going to use the spatial domain. An important advantage of this approach is that hardware twodimensional real-time convolvers are well known and available on the market. Edge detection we need, can be achieved using the following Laplacian convolution mask: 1 -2 1 -2 4 -2 1 -2 1 The result of applying this operation to the z-buffer data can be seen on fig. 1b.
Figure 1. Non-antialiased image of a scene that was used for tests (a - left); selection mask extracted from z-buffer - those pixels should be processed further with an antialiasing procedure (b - right)
622
Przemyslaw Rokita
The mechanism for extracting edges from the z-buffer presented here is simple,robust and easily scalable. For example, we can extract all the visible polygons edges. In order to achieve this effect only the value of one threshold parameter is to be introduced. After selecting pixels that need antialiasing, using the method described above, the antialiasing itself can be done using one of two approaches: x by supersampling - as traditionally used in methods described above, or x using signal processing method described below (an example of result can be seen on figure 2b). From the point of view of signal processing theory, the discontinuities and aliasing artifacts are high frequency distortions. This suggests the possibility of replacing the traditional, computationally expensive antialiasing techniques, like unweighted and weighted area sampling and supersampling, by an appropriate image processing method, or in general – signal processing method. Signal processing in the case of digital images in general, can be done using processing in frequency or spatial domain. Using frequency domain in our case is rather impractical (as mentioned above it requires computation of DFT). Most simple solution is to use spatial domain and again spatial convolution. To filter aliasing artifacts that are high frequency distortions we need a lowpass filter. In this case convolution mask values are derived from the required impulse response of the filter. In the case of lowpass filtering we need here, the required impulse response is the Gaussian function. For the lowpass filtering of selected pixels that need antialiasing, we can use the following convolution mask: 1 2 1 2 4 2 1 2 1
It is a well-known problem that lowpass filtering causes blurring as a side effect, and in fact a loss of information in the processed signal or image. But it is also well-known in the video community that the human eye is less sensitive to higher spatial frequencies than to lower frequencies, and this knowledge was, for example, used in designing video equipment. In the case of computer generated animation sequences we also have to consider that the content and the final quality of the resulting animation is to be judged by a human observer. In our previous research 4 we discovered that there is an increase in perceived sharpness with the pixels velocity increase. For example, an animation perceived as sharp and of a good quality can be composed of relatively highly blurred frames, while the same frames observed as still images would be judged as blurred and unacceptable by the human observer. Our approach in 4 was based on this perceptual phenomenon, and took advantage of compensation by the visual system of excessive blurring introduced by lowpass filtering in the animation frames.
Adaptive Z-buffer Based Selective Antialiasing
623
Another way of increasing perceived sharpness that we discovered and are presenting here - is to filter pixels selectively. This idea lead us to the solution presented here, which can be summarized as follows: 1st – select only objects edges by applying on the z-buffer spatial convolution with Laplacian, 2nd – apply lowpass spatial convolution filtering (or supersampling) to selected pixels. The blurring effect of a standard lowpass filtering can be further reduced if we use a more sophisticated filtering technique. We propose here to use a directional filtering technique as an antialiasing tool. The best postprocessing antialiasing effect will be obtained if we apply the lowpass filter along local orientation of antialiased features 10. In 10 authors propose a complicated curve fitting method as a solution for the local feature orientation extraction. Here we propose a more simple and efficient solution. Instead of using a curve fitting method based on second order intensity derivatives as in 10, we can use directly a set of first order derivatives applied on the z-buffer content. If we use a numerical approximation based on the Sobel operator, then in order to derive local features orientation in four principal directions we will apply the following set of convolution masks: -1 0 1
-2 0 2
-1 0 1
-2 -1 0
-1 0 1
0 1 2
-1 -2 -1
0 0 0
1 2 1
0 -1 -2
1 0 -1
2 1 0
For each feature direction detected subsequently an appropriate directional Gaussian filter will be applied: 0 0 0
1 2 1
0 0 0
1 0 0
0 2 0
0 0 1
0 1 0
0 2 0
0 1 0
0 0 1
0 2 0
1 0 0
This way the lowpass filter is applied along local features selected for antialiasing, filtering out high frequency distortions due to intermodulation. In this approach the highpass filtering applied on the z-bufer, has a twofold application: x it selects the objects edges that need to be antialiased, x it gives a local feature direction allowing for edge reconstruction.
3.
SUMMARY
In this paper we have presented a selective antialiasing solution, based on a simple mechanism for extracting edges from the z-buffer. On figures 2a-2b may be seen an example of obtained results. The approach proposed here preserves texture details. Textures usually are filtered independently using traditional texture filtering approach - trilinear or anisotropic filtering. Another important observation is that our perception system is more sensible to the aliasing artifacts on the objects boundaries that in textures 9. In our
624 daily life we meet many different and previously unknown textures on surrounding objects - as a result, aliasing artifacts can be perceived as texture features. On the other hand, we all know well how real-world objects boundaries look like – so jagged edges on rendered objects boundaries are easily perceptible and annoying. This observation was used in the algorithm described above. General purpose hardware real-time convolution filters are well known and available. For example Harris HSP48908 two-dimensional real-time convolver was introduced more than a decade ago. Hardware convolution was also available years ago in the SGI’s RealityEngine. The advantage of a solution presented here is that the same circuit architecture can be used for selection (edges extracted from z-buffer using Laplacian or gradient convolution masks) and the antialiasing of selected pixels (lowpass filtering using uniform or directional Gaussian convolution masks). It leads to a simple real-time implementation. The algorithm can be also easily modified to select for antialiasing not only object edges but also texture edges. Convolution with Laplacian can be applied directly to the image (instead of z-buffer). It will then extract all the edges: object boundaries and high contrast transitions in texture patterns. This approach will give better results when the texture filtering (trilinear or anisotropic) is not to be used. Approach presented here can be used to improve image quality in graphics accelerators but also applications such as real-time ray tracing. In the real-time ray-tracing the number of samples is limited to meet the expected frame-rate, which leads to annoying aliasing. It could be corrected using a postprocessing stage based on the solution proposed above.
Figure 2. An enlarged part of the image from figure 1a (rendered without antialiasing) (a left); an enlarged part of the image from figure 1a antialiased using z-buffer based selective antialiasing (b - right)
Adaptive Z-buffer Based Selective Antialiasing
625
REFERENCES 1. P.Haeberli and K.Akeley, The Accumulation Buffer: Hardware Support for High-Quality Rendering, In Proc. of ACM Siggraph’90, pp. 309-318, August 1990. 2. M.Kelley, S.Winner, and K.Gould, A Scalable Hardware Renderer Accelerator using a Modified Scanline Algorithm, In Proc. of ACM Siggraph’92, pp. 241-248, July 1992. 3. L.Carpenter, The A-buffer, an Antialiased Hidden Surface Method, In Proc. of ACM Siggraph’84, pp.103-108, July 1984. 4. K. Myszkowski, P. Rokita, T. Tawara, Perception-Based Fast Rendering and Antialiasing of Walkthrough Sequences, IEEE Transactions on Visualization and Computer Graphics, vol. 6, no. 4, October/December 2000. 5. J. Torborg and J. Kajiya, Talisman: Commodity Realtime 3D Graphics for the PC, In Proc. of ACM Siggraph’96, pp. 353-363, August 1996. 6. http://www.digit-life.com/articles/matroxparhelia512/ 7. http://www.matrox.com/mga/products/parhelia512/technology/faa16x.cfm 8. P.Rokita, Generating of Depth-of-Field Effects in Virtual Reality Applications, IEEE Computer Graphics & Applications, vol.16, no. 2, pp. 18-21, March 1996. 9. H.R.Schiffman, Sensation and Perception – an Integrated Approach, John Wiley and Sons, 1990. 10. R. Lau, An Efficient Low-Cost Antialiasing Method Based on Adaptive Postfiltering, IEEE Transactions on Circuits and Systems for Video Technology, vol.13, no. 3, pp. 247256, March 2003.
MORPHOLOGICAL NORMALIZED BINARY OBJECT METAMORPHOSIS Marcin Iwanowski Warsaw University of Technology, Institute of Control and Industrial Electronics ul.Koszykowa 75, 00-662 Warszawa POLAND [email protected]
Abstract
The paper describes a method for binary 2D and 3D object metamorphosis using a normalized morphological interpolation function and a mask. Comparing with the existing methods the proposed one has two important advantages: the normalization of the interpolation function and the new formulation of the interpolator. The first one allows obtaining steady and smooth transformation of the area (volume) of the interpolated objects. The new formulation of the interpolator introduces a mask inside which the interpolation is performed. Owing to the the mask one can define the area inside which the interpolation is performed. The new kind of mask is also proposed - it is equal to the convex hull of both input objects. In the paper also two examples of the interpolation of 2D and 3D objects are given. The method can be applied to image reconstruction, as well as for the computer-aided animations.
Keywords:
Mathematical morphology; image metamorphosis; computer animations.
1.
INTRODUCTION
This paper describes a method for binary object metamorphosis,6,14 by means of morphological interpolation1,3–5,7,10,12 . Interpolation between two object (compact sets of pixels) consists in generating a sequence of intermediary objects, shape of which is transformed from the shape of first input object (initial) object into the shape of the second one (final). A method proposed in the paper is based on the method introduced in,7 where a shape of the interpolated object is obtained by a thresholding of the interpolation function, which is computed from the morphological geodesic distance functions. In the paper two principal improvements are proposed: the first one is normalization of the interpolation function, which allows stabilizing the change of the area (or volume in case of 3D objects) for increasing interpolation levels. Due to that fact the interpolation sequence is characterized by stable and constant transformation. Second improvement allows performing the interpolation inside a user-defined mask.
626 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 626–632. © 2006 Springer. Printed in the Netherlands.
627
Morphological Normalized Binary Object Metamorphosis
Due to that fact the interpolated object don’t have to be included in the union of both input ones, as proposed in,7 which was a problem especially when an intersection of input objects was relatively small. The new formulation of the interpolator allows applying a mask which defines the area inside which the interpolation is performed. A new type of mask is also proposed - a convex hull of both input objects. Two examples show the interpolation results obtained by applying the proposed method on two- and three-dimensional input objects.
2.
BASICS OF THE MORPHOLOGICAL INTERPOLATION
2.1
The interpolator
An interpolator provides a transformation which produces an interpolated object. It is a function of three principal arguments: two input objects (initial and final) and an interpolation level α. An interpolation level is a real number α such that 0 ≤ α ≤ 1. In this paper the interpolator is denoted as: IntR P →Q (α), where P represents the initial object, Q - the final one. A parameter R (such that P ∪ Q ⊂ R) denotes a mask inside which interpolation is performed. Shapes of interpolated objects are turning from a shape of the object P to s shape of the object Q. For α = 0, the interpolated object is equal to the initial R object (IntR P →Q (0) = P ); for α = 1 - to the final one (IntP →Q (1) = Q). A sequence of interpolated objects produced for increasing values of α is an interpolation sequence.
2.2
Distance function calculation
The definition of the distance depends on the underlying image grid11 . In case of 2D images, usually the 4- and 8-connected grid are used, in some application also a 6-connected grid is considered. For 3D images three types of connectivity are usually considered: 26-,18- and 6- connectivity, depending which neighbors are being taken into account. To obtain an interpolation function for nester objects, geodesic distances are computed. They describe the distance to one object (inner, X) inside the other one (outer, Y ), as the shortest path connecting a given point p ∈ Y \ X with the closest point of inner object X, such that every point of this path belongs to Y . A geodesic dilation8,9,11 of image X with mask Y of a size 1 equals an inter(1) section of dilated image X with the mask Y : δY (X) = δ(X) ∩ Y . The geo(λ) (1) (1) (1) desic dilation of a given size λ is given by δY (X) = (δY (δY (. . . δY (X))). 7
A geodesic distance is defined than as: dY (X)[p] = inf{i
89
λ−times (i) : p ∈ δY (X)}
:
628
Figure 1.
Elementary structuring elements in: 4(a), 8(b), 6(c), 18(d) and 26(e) - connectivity.
The dilations are performed with the elementary structing element, shape of which influences the values of the distance function. Elementary structuring elements, containing the closest neighbors are shown in Fig. 1. Although the distances based on the image grid are not Euclidean ones, they are sufficient for the interpolation purposes. It is, however, possible to compute the Euclidean distance - in such a case one has to use more sophisticated algorithms proposed e.g. in12,13 .
2.3
Interpolation between nested objects
Let X and Y be nested object, inner and outer, respectively (X ⊂ Y ). The interpolation function proposed in7 is defined as: intY (X)[p] =
dY (X)[p] dY (X)[p] + dX C (Y C )[p]
(1)
where X C and Y C stand for the complements of objects X and Y respectively. The interpolator based on the Eq. 1 is defined as (T[α] stands for the thresholding operator at level α): IntYX→Y (α) = T[α] (intY (X))
3. 3.1
(2)
PROPOSED METHOD Normalization
A disadvantage of the interpolator defined by the Eq. 2 is that throughout the interpolation process the area (or volume in case of 3D objects) increases irregularly, while the interpolation level α grows steady. A speed of growth of the area/volume of the interpolated object is not constant. This feature is remarkable especially when one of the sets (the outer one) is elongated (see Fig. 2d). In such a case the object grows slower for lower α values and much faster for higher ones, the ΔS/Δα factor is not a constant one. In order to stabilize a process of growing of the interpolated object IntYX→Y (α) for increasing α, an additional step of normalization is proposed. An impression of the constant growth of the interpolated object depends on a relative
Morphological Normalized Binary Object Metamorphosis
629
Figure 2. Nested objects: outer (a), inner (b), the area of the interpolated objects (c), interpolation sequence (d), interpolation function (e), normalized function (f), area function after normalization (g) and sequence obtained from normalized interpolation function (h).
difference of the area (volume) of the objects between each of two consecutive interpolated ones, which should be constant. To obtain it, the distance function IntYX→Y (α) is normalized using a normalizing factor n defined as: n(α) =
S(IntYX→Y (α)) − S(X) S(Y ) − S(X)
(3)
where S stands for an area (or volume in 3D case) function of its argument. A normalized interpolation function is than defined as: intY (X)[p] = n(intY (X)[p]) . The interpolated object is obtained by applying this to Eq. 2 which gives: IntY X→Y (α) = T[α] (intY (X)) = T[α] (n(intY (X)))
(4)
The above equation defines the normalized interpolator. An example of normalization is shown on Fig.2. Contrary to a sequence obtained by means of the Eq. 2 (see Fig.2a), an increase of the surface area of consecutive sequence frames obtained from the Eq. 4 (see Fig.2g) is constant, which gives an impression of steady and stable shape metamorphosis. The stair-casing effect visible on Fig. 2g comes from the limited number of levels of interpolation function before normalization. This function, defined as a real number is computed for this example as an integer between 0 and 255. For larger number of possible values the stair-case effect would be less remarkable or even invisible. In this example the 8-connectivity was used.
630
3.2
General case
Let P and Q be the initial and final objects (any, not only the nested ones as in the previous section). They can either be nested, intersected or disjoint. In the proposed method, the interpolation is performed inside an auxiliary object R- a mask. A final result of the interpolation at given level α is obtained as an intersection of two interpolations of nested objects: R R IntR P →Q (α) = IntP →R (α) ∩ IntQ→R (1 − α)
(5)
R where IntR P →R (α) and IntQ→R (1−α) are the interpolated objects between P and R; and between Q and R respectively. The normalization in this case is performed separately for each of two interpolators of nested objects.
3.3
The mask
The mask R from the Eq. 5 defines the region, inside which the interpolation is performed. It means that every interpolated object as well as both input ones must be included in the mask. The object, which is intended to be a mask should fulfill two general conditions: it must be an object (a single connected component), and it must include both objects P and Q : P ∪ Q ⊂ R. Different definitions of the mask are possible. One of them is a union of the input objects7 . This solution - R = P ∪ Q - can be applied only in the case of two input sets with non-empty intersection (P ∩ Q = ∅). This mask cannot be applied to the interpolation of disjoint objects. Also results of interpolation of objects with a relatively small intersecting area are often not satisfactory the pixels belonging to the interpolated object made an impression of pouring from the initial into final object4 . Owing to the new definition (5), a new kind of mask is proposed. It is equal to the convex hull (CH) - the smallest convex object of both input ones: R = CH(P ∪ Q)
(6)
An algorithm of its generation is described in11 . For 3D objects the convex hull computation was investigated in2 .
4.
RESULTS AND CONCLUSIONS
The interpolator described in the paper works on both two- and three-dimensional objects. The only difference is the structuring element used. In the first case it was a square one (8 neighbors, Fig. 1b), in the second one - a cube structuring element (26 neighbors, Fig. 1e).
Morphological Normalized Binary Object Metamorphosis
Figure 3.
Figure 4. - c and d.
631
The interpolation sequence between 3D objects (initial - (a), final - (i).
2D shapes - a and b, contour represents the convex hull; the interpolation functions
Figure 5.
The interpolation sequence between objects from Fig. 4.
A first example of the interpolation is presented on Fig.3. It is a of 3D objects interpolation. A sequence of 3D objects was produced with the normalized interpolation function and a mask equal to the union of both input objects. A second example interpolates between two 2D shapes. In this case also the normalized functions was applied, but the mask was equal to the convex hull of union of both input objects. The proposed method has two important advantages: the normalisation of the interpolation function and the new formulation of the interpolator. First
632 feature allows obtaining constant change of the area (volume) of the interpolated objects. The new formulation of the interpolator introduces a mask inside which the interpolation is performed. Owing to this mask one can define the area inside which the interpolation is performed. The new kind of mask is also proposed - it is equal to the convex hull of both input objects. Two examples of the interpolation with the normalized interpolation function were given. The proposed method can be applied to area and volume modelling. The 2D interpolation can be used to the generation of 2D slices being part of the 3D volumic object beginning from the existing slices. The method can be also efficiently applied to the computer-aided animations of 2D- and 3D-images.
REFERENCES 1. Beucher S., Interpolation of sets, of partitions and of functions, in H. Heijmans and J. Roerdink Mathematical Morphology and Its Application to Image and Signal Processing, Kluwer, 1998. 2. Borgefors G., Nyström I., Sanniti di Baja G., Computing covering polyhedra of non-convex objects, in Proceedings of 5th British Machine Vision Conference, York, UK, pp. 275-284, 1994. 3. Iwanowski M., Serra J., Morphological-affine object deformation, in L. Vincent and D. Bloomberg, Mathematical Morphology and Its Application to Image and Signal Processing, pp.82-90, Kluwer, 2000. 4. Iwanowski M., Application of mathematical morphology to interpolation of digital images Ph. D. thesis, Warsaw University of Technology, School of Mines of Paris, Warsaw-Fontainebleau 2000. 5. Iwanowski M., Morphological binary interpolation with convex mask, Proc. of Int. Conf. on Computer Vision and Graphics, Zakopane, Poland 2002. 6. Lazarus F., Verroust A., Three-dimensional metamorphosis: a survey, The Visual Computer vol.14, pp.373-389, 1998. 7. Meyer F., Morphological interpolation method for mosaic images, in P. Maragos, R. W. Schafer, M. A. Butt, Mathematical Morphology and Its Application to Image and Signal Processing, Kluwer, 1996. 8. Serra J., Image Analysis and Mathematical Morphology vol.1, Academic Press, 1982. 9. Serra J., Image Analysis and Mathematical Morphology vol.2, Academic Press, 1988. 10. Serra J., Hausdorff distance and interpolations, in H. Heijmans and J. Roerdink, Mathematical Morphology and Its Application to Image and Signal Processing, Kluwer, 2003. 11. Soille P., Morphological Image Analysis - Principles and Applications, Springer Verlag, 1999, 2003. 12. Soille P., Spatial distributions from the contour lines: an efficient methodology based on distance transformation, J. of Visual Communication and Image Representation 2(2), June 1991, pp. 138-150. 13. Vincent L., Exact Euclidean distance function by chain propagations, Proc. IEEE Computer Vision and Pattern Recognition, 1991, pp. 520-525. 14. Wolberg G., Image morphing: a survey, The Visual Computer vol.14, pp.360-372, 1998.
BUBBLE TREE DRAWING ALGORITHM∗ S. Grivet,1 D. Auber,1 J. P. Domenger1 and G. Melancon2 1 LaBRI-Universit Bordeaux 1, 351 Cours de la Lib ration, 33405 Talence, France é é 2 LIRMM Montepellier, France
{auber, grivet, domenger}@labri.fr [email protected]
Abstract
1.
In this paper, we present an algorithm, called Bubble Tree, for the drawing of general rooted trees. A large variety of algorithms already exists in this field. However, the goal of this algorithm is to obtain a better drawing which makes a trade off between the angular resolution and the length of the edges. We show that the Bubble Tree drawing algorithm provides a planar drawing with at most one bend per edge in linear running time.
INTRODUCTION
Hierarchical representations of information still remain central in Information Visualization. Their success mainly resides in the wide spectrum of applications for which they are relevant. Some applications focus on hierarchical data, making the use of tree layouts an obvious choice. For instance, tree representations for the visual analysis of file systems (Munzner, 1997) or phylogenies (Amenta and Klingner, 2002) are mandatory to reflect the structure of the data under study. Tree representations can still be relevant when the data is not hierarchical but consist in a general network or graph. In most cases, a tree is extracted from the network following an adequate search of the nodes and links. This approach makes sense when the task conducted by the user requires the selection of an entry point in the network, for instance, although the user is offered a hierarchical view of non hierarchical data. The display of a tree then makes it possible for the user to change its focus of interest, assuming this focus is consistently positioned at the center of the screen. The visual information supported by the layout can also be complemented by visual cues such as node size and labels. Tree layout algorithms more or less belong to two distinct categories. The first category corresponds to the so-called hierarchical drawing of trees ori∗ supported
by ACI masse de Données NavGraph grant
633 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 633–641. © 2006 Springer. Printed in the Netherlands.
634 ginally proposed by Reingold and Tilford (Reingold and Tilford, 1981) (extended by Walker (Walker, 1990)) and reconsidered recently by Buchheim et al (Buchheim et al., 2002). Even if these algorithms do not perform well in terms of aspect ratios, their interest mainly resides in their ability to deal with varying node size. Another interesting feature is the possibility to map the layout into a radial representation, through a simple transformation more or less sending the bottom line of a top-down drawing to a circle (Eades, 1992). A clear advantage of such a representation is to allow direct comparison of the ancestor-descendant distance between nodes. However, this type of information is not sufficient when performing visual information retrieval or data mining. The second category of algorithms differs from the first one in that the focus is not on the hierarchical structure of the information, but rather on scalability and on the possibility of displaying large amounts of data. Tree maps, for instance, do not intuitively reflect the hierarchical structure of the information (Bruls et al., 2000; Shneiderman, 1991). They are mainly used in contexts where the user needs to directly access attribute values and changes. Cone trees introduced in the pioneer work of Robertson (Robertson et al., 1991) suggested the use of 3D for displaying and navigating large hierarchies. Several authors have later improved this technique (Carriere and Kazman, 1995; Jeon and Pang, 1998; Teoh and Ma, 2002). In this paper we present some basic terminology about trees and aesthetic criterion. Then, we describe precisely the two principle stages of our algorithm and finally we discuss about the time complexity of this algorithm. We conclude the paper with several drawings of an entire Linux file system containing about 270.000 files.
2.
PRELIMINARIES
We define a (rooted) tree as a directed acyclic graph with a single source, called the root of the tree, such that there is a unique directed path from the root to any other node. Each node a on the path from the root to a node n is called an ancestor of n. For convenience we denote ancestor(n) the unique nearest ancestor of n. This function is not defined when n is the root. Each node d, such that there exists a directed path from n to d, is called a descendant of n. We denote outadj(n) as (or children of n) the set of the nearest descendant of n. The degree of a node n, denoted as deg(n), is the number of neighbours of a node n. We denote ni the i-th child of the node n. Results from the graph drawing community show that if one wants to obtain a efficient drawing he/she needs to make a trade off between several aesthetic criteria (Battista et al., 1999). Here is the definition of some aesthetic criteria that our drawing algorithm takes into account.
635
Bubble Tree Drawing Algorithm
x Crossing number: The edges should not cross each others. x Number of bends : the polyline used to draw an edge should have the least possible bends. x Angular resolution : the minimal angle between two adjacent edges of a 2·π node n should be nearest to deg(n) . x The drawing of a subtree does not depend on its position in the tree, isomorphic subtrees should be drawn identically up to a translation and a rotation. x The order of children of a node should be respected in the final drawing.
3.
ALGORITHM
The algorithm that we have designed is recursive as the so-called Reingold and Tilford’s algorithm (Reingold and Tilford, 1981). It uses a depth first search traversal in order to draw the tree. As for the Reingold and Tilford’s algorithm, the linear running time of the algorithm is achieved by using relative positions for each nodes and by delaying the computation of node position to a second phase of the algorithm. The idea of the algorithm is to use enclosing circles (instead of contour in the R&T’s algorithm) to represent space needed to draw a sub-tree. In the following we will describe precisely the two stages of our algorithm. The first one is the computation of the position of the enclosing circle of each children’s sub-tree of a node relatively to itself. The second one is the coordinates assignment that prevents to have crossings in the final drawing.
3.1
Computation of the relative positions
We use a suffix depth first search procedure to assign the relative position γn of the center of the enclosing circle relative to ancestor(n). If one considers that each sub-tree induced by a child of a node n has already been drawn and that one has got an enclosing circle for each sub-tree drawing, the first part of our algorithm consists in placing each enclosing circle on the plane and to prevent overlapping between them.
3.1.1 Enclosing circle location. In order to enhance the angular resolution in the final drawing, the idea of this algorithm is to place each enclosing circle around the node n. This operation can be done by determining an angular sector θi for each enclosing circle Ci . If the enclosing circles are placed inside their respective angular sector it is straightforward that there is no overlapping. Let r1 , . . . , r|outadj(n)| be the radius of each enclosing circle. The first approach is to assign an angular sector θi proportionally to ri such that |outadj(n)|
i=1
θi = 2π.
636 However, one can attribute an angular sector larger than π which is too big to place a circle in. If such case arise, we give an angular sector equal to π to the bigger, and then we assign an angular sector to the others proportionally to their radius such that their sum is π. The total sum remains 2π. Then we place each enclosing circle such that it is tangent internally to its respective angular sector. If the angular sector is too big, the enclosing circle and the node n could overlap. This could be avoided by computing correctly the distance δi from n to the center of an enclosing circle, using the following formula. δi = max(size(n) + ri ,
ri ) sin(θi /2)
Using δi we place the center of the enclosing circle on the interior angle bisector of θi . Each coordinate γni is computed relatively to the position of n.
γni
=
xi = δi cos(( ij=1 θj ) − θi /2) yi = δi sin(( ij=1 θj ) − θi /2)
Theses coordinates will be used in the second stage of the algorithm in order to compute the final drawing. The Figure 1 summarizes all this part. One can see that to fix the angular sector to π when it is too big leaves a lot of space unused. We implement a O(n log(n)) algorithm to solve this problem. We start with a angular resolution Θ equal to 2π, and a global radius R equal |outadj(n)| to i ri . We treat each circle in the decreasing order of their radius. If the sector calculated proportionally to the radius (θi = Θ · ri /R) is bigger than the maximum we can allocate for this circle (θimax = 2 arcsin(ri /(ri + size(n)))), then we fix the sector to its maximum and we decrease the angular resolution Θ according to the angle of the sector θimax , and the global radius R to the radius of the circle ri . If all the sectors we set are maximum, we use the remaining angular resolution Θ to space the sectors themselves. According to our tests the difference between the two solutions is not perceptible for normal trees.
3.1.2 Enclosing circle calculation. Now, we need to find the smallest enclosing circle of the set of the enclosing circles we have placed. A detailed comparison between different methodologies is discussed in (Freund et al., 2003). We use the incremental randomized algorithm proposed by Welzl (Welzl, 1991) that gives the optimal solution in a linear average running time. In (Welzl, 1991) one needs to calculate the smallest enclosing circle of two circles (denoted EC2 (C1 , C2 )), and of three circles (denoted EC3 (C1 , C2 , C3 )). The computation of EC2 is obvious. Let cn (resp. rn ) be the center (resp. radius) of the circle Cn . Let ce (resp. re ) be the center (resp. radius) of the smallest enclosing circle Ce . The com-
637
Bubble Tree Drawing Algorithm C 3 C0
C1
C 2
θ 0 θ 1
Rescale
θ 3
C 1
δ0
C1 δ 1
C 3 C0
C 2
θ 2
Compute Angular Sector
Figure 1.
C 3
C0
δ n 2
δ
3
C2
Compute Relative Positions
Computation of locations.
putation of EC3 can be done by solving the following system of equations: ⎧ ⎨ |c1 ce |
= (re − r1 )2 |c c | = (re − r2 )2 ⎩ 2 e |c3 ce | = (re − r3 )2 The algorithm EC(S, B) computes incrementally the enclosing circle of a set of circles S according to a set, denoted B, of boundary circles of S. The definition of EC(S, B) is the following: HH B S HH H ∅
S ∪ {c}
∅
{b1 }
{b1 , b2 }
∅ E = EC(S , ∅) if c ∈ E then E else EC(S , {c})
{b1 } E = EC(S , {b1 }) if c ∈ E then E else EC(S , {b1 , c})
{EC2 (b1 , b2 )} E = EC(S , {b1 , b2 }) if c ∈ E then E else {EC3 (b1 , b2 , c)}
The computation of the smallest enclosing circle of a set S corresponds to calculate EC(S, ∅). The selection of the circle c in S ∪ {c} needs to be done randomly to ensure having an average polynomial time (Welzl, 1991). To apply the optimizations proposed in (Welzl, 1991), we use a double-ended queue (Knuth, 1973) to store S. This queue is initialized randomly at the beginning of the algorithm. We always dequeue the selected circle c at the end of the queue. When the recursive procedure is finished, if c ∈ E, we enqueue c at the end, else at the beginning. Thus the elements of S have only been reordered. Placing the circle which is not in E at the beginning of the queue optimize the order for the next calls. If one wants a linear algorithm, one can use some heuristics in order to approximate the smallest enclosing circle. For instance, one can start with the circumcircle of the bounding box of the set of circles and then merge incrementally all the circles, using EC2 as described in (Auber, 2003). Another heuristic determines the center of the enclosing circle by calculating the barycenter of the center of the circles, weighted by the square of their radius. The radius of
638 the enclosing circle is the maximum distance between its center and the center of the other circles augmented by their respective radius. Our experiments show that in average computing the smallest enclosing circle is 7% better than those heuristics. Nevertheless, the experiments have shown that the drawings are almost the same when we use the smallest enclosing circle or one of the heuristics presented above.
3.1.3 Bend location. To reserve an angular sector needed for the connection of a node n with its ancestor, we add a dummy enclosing circle C during the enclosing circle placement process. The position of the bend βn is the intersection between the enclosing circle Cn and the line containing the center of C and n.
3.2
Coordinate assignment
After applying the previous algorithm we obtain for each node n an enclosing circle denoted Cn . Each center of Cn has a position γn relative to ancestor(n). Thus we can compute the position δn of the center of Cn relative to the center of Cancestor(n) . Each node n have a position, ζn , and a bend βn position, relative to the center of Cn . function coordAssign(node n,Cnabs ) input : n, the node to draw, Cnabs , the absolute coordinate of the center of Cn . 1. Let rot be the rotation operation of center Cnabs such that : Pancestor(n) , rot(βn ) + Cnabs and Cnabs are aligned. 2. set Pn to rot(ζn ) + Cnabs 3. set Pnβ to rot(βn ) + Cnabs 4. for all ni in outadj(n) 5. call coordAssign(ni ,rot(δni ) + Cnabs ) with δni = ζn + γni ) Figure 2.
Coordinates assignment.
The final position of a node is obtained by using a prefix order traversal. After placing the root of the tree in the center of the view (coordinates (0,0)), we call a recursive function, detailed in the algorithm 2, to obtain the final absolute coordinates. In the following, we denote Pn the final position of a node n and Pnβ the final position of the bend associated to n for the reconnection. The Figure 3 summarizes the rotation scheme of the algorithm.
3.3
Space and time complexity
The time complexity of the Bubble Tree algorithm is the sum of the complexities of the two stages described above. For each of these steps we have a linear algorithm thus the complexity of the entire algorithm is linear. Furthermore, if one wants to obtain a layout using the optimal solution for the
639
Bubble Tree Drawing Algorithm
Figure 3.
Coordinates assignment.
first stage we obtain a complexity in an expected n · log(n) time. This complexity comes from the using of the Welzl’s algorithm (Welzl, 1991) and from the relative position computation (cf. 3.1.1) that requires a sort of the children of each node according to their enclosing circle radius. For the space complexity, the algorithm requires storing five values for each node and thus it is straightforward that it is linear in space.
4.
CONCLUSION
In Figure 4, we present the drawing obtained with our algorithm and two well-known tree drawing algorithms. The first one is the Walker’s algorithm (Walker, 1990) and the second one is the radial tree drawing proposed by Eades (Eades, 1992). The data-set we use is an entire Linux file system, including users’ directories, that contains about 270.000 nodes. Clearly the weak point of this algorithm is that we do not obtain a straight line drawing. A straightforward bound of the number of bends is the number of internal nodes in the tree. This bound can be equal to the number of nodes less one. However, in the final algorithm we automatically remove a bend βn if it is straight lined with n and ancestor(n). This operation do not change the theoretical bound that can be reached when one have a completely unbalanced tree. The Figure 5 Figure 5. Spiral effect. shows the spiral effect induced by the presence of several completely unbalanced sub-tree that induced a large number of bends.
640
Figure 4.
Bubble tree (left), tree radial (top/right), tree walker (bottom/right).
At the opposite, the number of bends is equal to zero for a well balanced tree, and produces a fractal effect (see Figure 6). On the set of sub-trees of our experimentation, we have measured that: the average number of bends is about 1% of the number of nodes, the best value is 0% and the worst value is 7.3%. In this paper we have presented an algorithm for the drawing of general rooted trees. The strong point of the Bubble tree drawing is to accentuate the angular resolution aesthetic criteria. Such a characteristic is very important for the purpose of Visual Information Retrieval. This algorithm has been implemented and compared with others by using the Tulip Software (Auber, 2001). During interactive visualization of huge file-system it has clearly demonstrated its efficiency. Indeed, Figure 6. Fractal effect. the Bubble Tree algorithm enables to easily detect isomorphic sub-trees even on a graph having more than 270.000 nodes. Furthermore, small modifications of the tree structure imply small modifications of the final drawing. This property is essential for the visual detection of the similarities.
Bubble Tree Drawing Algorithm
641
REFERENCES Amenta, N. and Klingner, J. (2002). Case study: Visualizing sets of evolutionary trees. In IEEE Infovis’02, pages 71–76. Auber, D. (2001). Tulip. In Mutzel, P., J-unger, M., and Leipert, S., editors, 9th Symp. Graph Drawing, volume 2265 of Lecture Notes in Computer Science, pages 335–337. SpringerVerlag. Auber, D. (2003). Graph Drawing Softwares, chapter Tulip- A Huge Graphs Visualization Framework, pages 80–102. Mathematics and Visualization series. Springer-Verlag. Battista, G., Eades, P., Tamassia, R., Tollis, I., and Tollis, G. (1999). Graph Drawing : Algorithms for the Visualization of Graphs. Prentice-Hall. Bruls, D. M., Huizing, C., and VanWijk, J. J. (2000). Squarified treemaps. In Data Visualization 2000, Proceedings of the joint Eurographics and IEEE TCVG Symposium on Visualization, pages 33–42. Springer. Buchheim, C., J-unger, M., and Leipert, S. (2002). Improving walker’s algorithm to run in linear time. Technical report, Zentrum f-ur Angewandte Informatik K-oln, Lehrstuhl J-unger. Carriere, J. and Kazman, R. (1995). Interacting with huge hierarchies: Beyond cone trees. In G. and Eick, S., editors, IEEE Symposium on Information Visualization, pages 74–78. Atlanta, Georgia Institute for Electrical and Electronics Engineers. Eades, P. (1992). Drawing free trees. Bulletin of the Institute for Combinatorics and its Applications, 5:10–36. Freund, R. M., Sun, J., and Xu, S. (2003). Solution methodologies for the smallest enclosing circle problem. Computational Optimization and Applications, 24-26. Jeon, C. S. and Pang, A. (1998). Reconfigurable disc trees for visualizing large hierarchical information space. In IEEE InfoVis’98, pages 19–25. Knuth, D. E. (1973). The Art of Computer Programming, volume 1. Addison-Wesley. Munzner, T. (1997). H3: laying out large directed graphs in 3d hyperbolic space. In IEEE Infovis’97, pages 2–10. Reingold, E. M. and Tilford, J. S. (1981). Tidier drawings of trees. IEEE Transactions on Software Engineering, 7(2):223–228. Robertson, G. G., Mackinlay, J. D., and Card, S. K. (1991). Cone trees: Animated 3d visualizations of hierarchical information. In SIGCHI, Conference on Human Factors in Computing Systems, pages 189–194. ACM. Shneiderman, B. (1991). Tree visualization with tree-maps : A 2-d space filling approach. In ACM Transaction on graphics, pages 92–99. Teoh, Soon Tee and Ma, Kwan Liu (2002). Rings: A technique for visualizing large hierarchies. In Kobourov, S. G. and Goodrich, M. T., editors, Graph Drawing, volume 2528 of Lecture Notes in Computer Science, pages 268–275. Springer. Walker, J. Q. (1990). A node positioning algorithm for general trees. Software Practice and Experience, 20:685–705. Welzl, E. (1991). Smallest enclosing disks (balls and ellipsoids). In Maurer, Hermann A., editor, New Results and New Trends in Computer Science, Lecture Notes in Computer Science, 555. Springer-Verlag.
SURFACE RECONSTRUCTION OF 3D OBJECTS
Xiaokun Li, Chia Yung Han, William G. Wee
Dept. of Elec. & Comp. Eng. & Comp. Sci., University of Cincinnati, Ohio, 45221, USA
Abstract:
A priority driven based algorithm for surface reconstruction from a set of surface points is presented. It calculates the shape change at the boundary of mesh area and builds a priority queue for the advance front of mesh area according to the changes. Then, the mesh growing process is forced to propagate in reliable direction through the queue. The algorithm reconstructs surface in a fast and reliable way. The efficiency of the proposed algorithm is demonstrated by the experimental results.
Key words:
Computational geometry; geometrical models of objects; data visualization, 3D object reconstruction; mesh growing; surface triangulation.
1.
INTRODUCTION
3D object surface reconstruction and modeling from point cloud is an important topic in many fields of science and engineering including computer graphics, computer vision, virtual reality, and reverse engineering1. Thus, a number of surface reconstruction algorithms have been proposed in recent years2,3,4,5,6. In this paper, a new incremental based method called priority driven based algorithm for constructing triangular surface is presented. The main contributions of the approach are a priority driven based method which forces the mesh growing process to propagate in an efficient way, a new triangulation method, two invariants and four topological operations used to guarantee the topology correctness and mesh quality in surface reconstruction.
642 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 642–647. © 2006 Springer. Printed in the Netherlands.
Surface Reconstruction of 3D Objects
2.
ALGORITHM DESCRIPTION
2.1
Overview
643
In the algorithm, a set of surface points is the input and a set of triangles whose vertices are the input points is the output. The main steps of the algorithm can be briefly described as the following: Firstly, all input points are held by a kd-tree data structure7 for both point indexing and nearest neighbor searching. Then, a starting triangle is chosen at a ‘flat’ place on the object as the initial mesh area. The edges of the starting triangle constitute the initial advance front of the mesh area. The priority driven strategy is then used to sweep the advance front ahead for mesh growing in an effective way. The algorithm reconstructs surface from both unorganized points and organized points in a fast and efficient way. Moreover, it makes no assumption regarding the points, such as locally uniform, noise free, closed or boundary surface, geometry or topology limitation, and so on.
2.2
Data structure and terminology
How to efficiently find a point and search its neighborhood points from a 3D point set is an important problem on surface reconstruction. In our research, a kd-tree based data structure and a standard neighbor search scheme7 are used. At any given stage of our algorithm, any point of the input data is tagged to one of the four states: free, close, and accepted. The free points are those that are not included into the mesh. All input data are tagged free before running the algorithm. The free points close to the current front edge are also called close points. More exactly speaking, the close points are those free points whose distance to the middle point of the current front edge is less than a predefined constant value called the radius of candidate region. Points that lie on the mesh are accepted points. The boundary of a triangle is called edge. A triangle has three edges. The edge on the advance front is a front edge. The accepted point on front edge is also named as front point. The triangle of the surface is called face. The point chosen for constructing the new triangle is called reference point. Two invariants are maintained for mesh quality and topology correctness during the execution of our algorithm: Invariant 1: There should be no free, close, or accepted point in the interior or on the edge of any constructed triangle. Invariant 2: Any edge of the triangular surface can only be connected by one or two triangles.
644 Two parameters need to be computed or set before mesh growing process: the upper bound D max of the angles of the new constructed triangle, and the upper bound E max of the angle between the normal of the new constructed triangle and the normal of the triangle incident to the other side of the current front edge. In our implementation, D max is set as 120 D and E max is set as 90 D .
2.3
Priority driven based mesh growing
One open issue in the incremental based methods is that it is difficult to guarantee the meshing correctness at the place with sharp curvature change. But, it is always easier and safer to mesh data at a ‘flat’ place. The same situation also happens in the other methods. Therefore, during the mesh growing process, when the advance front reaches a ‘sharp curvature change’ place and if the mesh growing process can be postponed at this place and forced to propagate on ‘flat’ places first, one interesting thing will happen: In most cases, the meshing growing will gradually propagate to the other side of the ‘sharp curvature change’ place through some other paths and naturally stitch or make it easier to connect the meshes on both sides of the “sharp curvature change’ place. The main idea of the method is to build a priority queue to force the mesh growing along the ‘flat’ places first. The entire procedure is similar to the greedy algorithm. Its major steps are described as the following: Firstly, all input points are tagged as free. Then, 1. The initialization process determines the starting triangle from a flat place by studying the normal change of the surface. The three edges of the triangle are tagged as the front edges, whose cost T is set to zero, and pushed into a minimum priority queue (a sorted queue wherein all points are sorted by T in increased order.), called edge heap. Then begin the following loop: 2. Pop one front edge which has the smallest cost T from the edge heap and call it current front edge. 3. Find the free points which are close to the current front edge by using nearest neighbor search, and change their state from free to close. 4. Find the reference point from the close point set (See Section 2.4 for detail). 5. Construct a new triangle through the reference point and the current front edge, and add it into triangle list by choosing an appropriate topology operation which is defined in Section 2.4. Then, if there are new front edges created in the triangulation process, compute their cost T by Cost Calculation and push them into the edge heap. 6. Cost Calculation: Before a new front edge is added into the edge heap, its cost T needs to be calculated first. The cost T is designed for
Surface Reconstruction of 3D Objects
645
measuring the curvature change of the local surface at the new edge, which is based on the fact that the normal deviation of the new triangle and its neighboring triangle (both are connected to the current front edge) reflects the shape change of the local surface. The T can be obtained through the following equation. T
1
Na x Nb Na Nb
where N a is the normal of the new triangle incident to the current front edge, N b is the normal of the triangle connected to the other side of the current front edge. The change of cost T is monotonic. 7. Tag accepted to the reference point and set free to the other close points. 8. Jump to (2) and repeat the loop till the queue is empty. 9. In case the mesh growing stops somewhere before the entire point set is processed. Multiple- starting-seed strategy is used here to continue the meshing process by jumping to (1) when there still have unmeshed surface points. A real example is shown in Fig. 1 to illustrate the procedure of mesh growing.
Figure 1. Rabbit with 8171 points and 16300 faces (a) - (c) show three intermediate states of the mesh growing process. (d) gives the final result.
2.4
Triangulation
The triangulation here is defined as the process on how to find the reference point for the current front edge to construct a new triangle. Note that the complete triangulation process on a new triangle will be completed when an appropriate topological operation is chosen from four operations5: join, fill, link, and glue. The main idea of our triangulation method is to define a 3D candidate region with a predefined radius for the current edge
646 and find all free and front points located in the region by searching the nearest neighboring points through the created kd-tree. Then, the unsuitable candidate points are removed by two predefined rules to guarantee the mesh quality and prevent the possible construction error. Let t be the current trial point and ab be the current front edge. Rule 1: Compute the angle of the line at and ab , and the angle of bt and ba . If either is larger than D max , the trial point will be removed from the candidate set. Rule 2: Calculate the angle between the normal of triangle 'atb and the normal N of the triangle incident to the other side of the edge ab . If the angle is larger than E max , the trial point will be removed from the candidate set. To efficiently find the reference point from the candidate set, priority driven based strategy is used again by computing a cost to each candidate point and saving the points into a maximum priority queue (a sorted heap wherein all points are sorted by the cost in decreased order.). In this way, the most possible reference point will be firstly popped out from the queue for consideration. Thus, it will save a lot of computational time for triangulation. The cost computation will consider the distance between the trial point and the current edge, and the normal change between the new triangle and its neighbor triangles. Intersection check is also carried out to guarantee the correctness of the triangulation.
3.
EXPERIMENTS
The algorithm was implemented with C++ plus OpenGL API and ran in Windows environment. Ten experiments were tested by our algorithm. All tests were run on a PC with 1.6GHz P4 processor and 384MB RAM. The input data sets include Stanford rabbits (point: 8,171 and 35,947), Horse (point: 48,485), Club (point: 16,864), Head (point: 12,772), Knot (point: 10,000), Mechpart (point: 4,102), Hypersheet (point: 6,752), Dragon (point: 437,645), Buddha (point: 543,652). Our test results show the triangulation rates are around 20K triangles/sec (a real-time speed). An example is given in Figure 2.
4.
CONCLUSION
In this paper, a priority driven based algorithm for surface reconstruction is proposed and presented. As the priority driven based strategy is used to force the mesh growing along reliable way and the efficient criteria are
Surface Reconstruction of 3D Objects
647
developed for triangulation at each step of mesh growing, the algorithm is robust and fast.
Figure 2. Examples of Horse (96507 faces), Club (33,617 faces), and Head (24,405 faces).
REFERENCES 1. M. Sonka, V. Hlavac, and R. Boyle, Image Processing, Analysis, and Machine Vision, second edition, PWS, (1999). 2. G. Turk and M. Levoy, Zippered Polygon Meshes from Range Images, ACM SIGGRAPH, pages 311-318, (1994). 3. Stanley J. Osher, Ronald P. Fedkiw, Level Set Methods and Dynamic Implicit Surfaces, Springer Verlag, November, (2002). 4. N. Amenta, S. Choi, T. K. Dey, and N. Leekha, A Simple Algorithm for Homeomorphic Surface Reconstruction, Intl. Journal on Computational Geometry & Applications, Vol. 12, pages 125-141, (2002). 5. F. Bernardini, J. Mittleman, H. Rushmeier, C. Silva, and G. Taubin, The ball-pivoting algorithm for surface reconstruction, IEEE Trans. on Visualization and Computer Graphics, Vol. 5(4), pages 349-359, (1999). 6. H. Hoppe, T. Derose, T. Duchamp, J. McDonald, and W. Stuetzle, Surface Reconstruction from Unorganized Point Clouds, ACM SIGGRAPH, pages 71-78, (1992). 7. J. H. Friedman, J. L. Bentley, and R. A. Finkel, An algorithm for finding best matches in logarithmic expected time, ACM Transactions on Mathematical Software, Vol. 3(3), pages 209-226, (1977).
FEATURE-BASED REGISTRATION OF RANGE IMAGES IN DOMESTIC ENVIRONMENTS Michael Wünstel, Thomas Röfer Technologie-Zentrum Informatik (TZI) Universität Bremen Postfach 330 440, D-28334 Bremen
{wuenstel, roefer}@informatik.uni-bremen.de Abstract
Registration is an important step to combine pictures that are taken either from different perspectives (multi-viewpoint), at different points in time (multi-temporal) or even from diverse sensors (multi-modal). The result is a single image that contains the combined information of all original images. We use 2 12 -D range data taken by a laser range scanner. During recording occlusions can leave some of the objects scanned incompletely. Thus it is the task of the registration to produce images which give the most complete view of the scene possible. An algorithm often used for the registration of range images is the ICP algorithm (Rusinkiewicz and Levoy, 2001). This point-based algorithm only uses local information, and it has the advantage that it is universally applicable. The drawbacks are that its very time consuming, it may converge in a local minimum, and the result cannot be evaluated absolutely. Therefore, for our domestic scenario we present an approach that first detects certain global features of the room and then uses their semantic and spatial information to register the range images. Thus this approach has the advantage that its tendency to get stuck in a local minimum is reduced and the overall performance is increased.
Keywords:
Registration, Matching, Range Image, ICP, Feature Extraction
1.
INTRODUCTION
The algorithm most frequently used for 3-D point cloud registration is the ICP-Algorithm. In each step it determines the quality of the match and calculates a better rigid transformation T between the two data sets if possible. There exist various variations of the algorithm. An extensive comparison of the algorithms can be found in Rusinkiewicz and Levoy (Rusinkiewicz and Levoy, 2001). Their categorization is done concerning the main steps of the algorithm: The selection of the points used, the matching of points, the weighting of corresponding points followed by a rejection of bad pairs, and the minimization 648 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 648–654. © 2006 Springer. Printed in the Netherlands.
Feature-Based Registration of Range Images in Domestic Environments
649
of the energy function based on the chosen error metric. Therefore the possible variations are quite large and run from probabilistic oriented approaches (Hähnel and Burgard, 2002) to numerical oriented approaches (Neugebauer, 1997). In Dallay and Flynn (Dalley and Flynn, 2001) several variants of the ICP-Algorithm have been tested and also the basic limitations of the algorithm are presented. As the calculation, especially of the nearest neighbors, is very computing-time expensive, and the algorithm may converge to a local minimum we show an alternative that uses the knowledge of extracted features to match the data sets. We show the mode of action within domestic environments. Thereby we extract the walls of a room (vertical planes), identify their relations, and use this knowledge to match the original range images. We get a good measurement to evaluate the registration not only by a number but by the relation of semantic-based features.
2.
THE ALGORITHMS
In this section we present our implementation of the Levenberg MarquardtICP (LM-ICP) algorithm which is based on the work of Fitzgibbon (Fitzgibbon, 2001). To overcome some problems of the generally applicable ICP algorithm as mentioned above we point out our feature-based matching method inspired by the work of Stamos and Leordeanu (Stamos and Leordeanu, 2003). They extract planar regions and linear features at the areas of intersection to use them for registration. Our approach presented here is the extension of the work of Röfer (Röfer, 2002) from 2-D (matching line segments) to 3-D (matching planes, namely walls).
2.1
The LM-ICP-Algorithm
The matching of planes is based on Fitzgibbon’s ICP-Algorithm (Fitzgibbon, 2001) and will be presented in more detail.The fundamental ICP-Algorithm mainly consists of an energy function that represents the fitness of the match between two data sets, and it has to be minimized. The value of the energy function is the weighted sum of the squared distances between the points in data set A and their nearest neighbors in data set B: E(a) =
|B| i=1
w(|mφ(i) − T (a, ni )|)2
650 where: a mi ni φ(i) w
parameters of the rigid transformation T from data set B to A point i of data set A point i of data set B index of the nearest point in data set A to point ni weighting function: identical, if the distance is below a certain threshold, constant otherwise
The optimization consists of a main loop in which a new rigid transformation T is calculated and the quality of the match is evaluated through the energy function in every optimization step. This is repeated until a termination condition is achieved (either the progress per iteration step falls below a certain threshold or a maximal number of iterations is reached). In our implementation of the ICP-Algorithm we homogenize the points in a way that every point within a set of points has a minimum distance to its neighboring point. This prevents the algorithm from emphasizing regions with a high point density. To speed up the search for the nearest point of ni in the reference data set A we parse the points into a Kd-tree (Bærentzen, 2004). The search for the nearest neighbor is the most time consuming task in our implementation. The weight of a pair of points is given by their distance if the distance is below a certain threshold and has a constant value otherwise. The threshold is reduced in each minimization step until a minimal value is reached. Therefore points that are further away loose weight for the registration. No pairs are rejected. We either use the Euclidean distance or the Manhattan distance as an error metric. For the calculation of the new values we apply the Levenberg-Marquardt algorithm. Besides the energy function we use the difference quotient as an approximation for the derivation needed. That means that for every parameter two values of the energy function have to be calculated. As a starting position we transform the center of gravity of data set B into data set A.
2.2
Our Approach: Feature-Based Registration
As already mentioned the ICP-Algorithm is a powerful instrument for point registration although it has some drawbacks. The quality of a match can be evaluated numerically. Nevertheless the overall success is partly dependent on more or less random parameters such as the starting position. Therefore newer approaches try to extract features from the range data that can be used for matching (Schön, 2002). That is a substantively difficult task as usually the registration is a preprocessing step, and therefore only a limited number of features of the sub-scene are available. In our domain, domestic environments, the room itself supplies such features. We extract the walls and analyze the relations between them. The algorithm is inspired by the work of (Stamos and
Feature-Based Registration of Range Images in Domestic Environments
651
Leordeanu, 2003) and (Röfer, 2002). The plane detection algorithm we use is based on the work of (Stamos and Allen, 2000). For every point it calculates the normal vector and the point projected to the fitted plane from a neighboring region of this point. The neighboring points do not have to be calculated as they are given directly through the ordering of the scanning process. To calculate the plane we use SVD (Singular Value Decomposition). As a result we get the normal vector of the plane and the nearest point to the origin point on the plane. The normal vector is only taken if the corresponding singular value is below a certain local threshold. The potential planes are then calculated using a regiongrowing algorithm where again a co-planarity of neighboring values has to be achieved. The last step of the plane detection consists of a global comparison of the local normals with the “average” normal of the plane candidate. This enables us to sort out, for example, continuously curved surfaces. After having calculated the planes the actual registration algorithm can be started. Figure 1 shows the overview of the feature-based (FB) algorithm. It consists of three main parts. In the preprocessing step the walls are extracted and they are ordered spatially. In each data set the ordered two-element subsets are calculated. In the registration step, for every pair of the subsets calculated previously the transformation between the data set A and B is determined. Therefore the data sets are aligned with regard to the first walls of each data set (step 1) and afterwards moved parallel to these walls until the other walls are overlapping (step 2). In the last step, the winning registration is selected. The quality of a match is calculated from the number of center points of the data set B lying within a certain distance (0.05m) to a wall of data set A. To additionally be able to rate scans with the same number of matching walls, the overall sum of the distances is calculated.
3.
EXPERIMENTS AND RESULTS
To perform our experiments we use the range data of an office scene. We will first apply the LM-ICP algorithm to the data and afterwards the FB algorithm. The data sets consist of two scans that were scanned from two different positions. The scene consists of the entrance area of an office with a book shelf, a cupboard, a door and a chair. The figures 2a and 2b show the scene from two different perspectives together with the scanning equipment. Figure 3 shows the two data sets before the registration process. For the ICP-based algorithm the centers of gravity of the two data sets are aligned afterwards. The result of the LM-ICP algorithm can be seen in figure 4. Figure 8 shows the planes detected for the range image resulting from the right position. Only planes larger than a certain threshold are taken into consideration. Here only neighboring information is used to fulfill the planarity condition. For the registration algorithm the condition has to be strengthened.
652
Figure 1.
Figure 2a.
Left view of the scene.
FB Registration Algorithm.
Figure 2b. Right view of the scene.
The figures 5 and 6 show the intermediate results of our feature-based matching algorithm. In the first step (figure 5) the two corresponding planes WAi and WBs are aligned (brighter walls on the right). In the second step (figure 6) the data set B is translated along WAi in a way that WAj and WBt are congruent (brighter walls on the top). The black coordinate plane indicates the particular place and orientation of the scanning system. Figure 7 shows the final result of the best match. As the algorithm is based on the position of two corresponding wall pairs it is ambiguous in a sense that there exist more than one set that describes the right transformation.
Feature-Based Registration of Range Images in Domestic Environments
Figure 3. Starting formation with homogenized point density.
Figure 5.
Figure 7. gorithm.
FB Matching, Step 1.
Result of the feature-based al-
Figure 4. rithm.
Result of the LM-ICP Algo-
Figure 6.
Figure 8. tion).
653
FB Matching, Step 2.
Planes detected (right posi-
654
4.
CONCLUSION AND OUTLOOK
We have presented an approach that bases the registration on the recognition of robust domestic features namely vertical planes predominately the walls. In contrast to the common point-based ICP-algorithm this method is not using point correspondences but spatial relations of semantic features. For the future we have planned to incorporate even more semantic information such as the positions of certain furniture. Furthermore strategies for the fast handling of several scans simultaneously have to be investigated.
ACKNOWLEDGMENTS This project is supported by the Deutsche Forschungsgemeinschaft, DFG, through interdisciplinary Transregional Collaborative Research Center “Spatial Cognition: Reasoning, Action, Interaction”.
REFERENCES Bærentzen, J. Andreas (14.06.2004). http://www.imm.dtu.dk/∼jab/. Dalley, Gerald and Flynn, Patrick J. (2001). Range image registration: A software platform and empirical evaluation. In Young, Danielle C., editor, Proceedings of the Third International Conference on 3-D Digital Imaging and Modeling (3DIM-01), pages 246–253, Quebec City, Canada. IEEE Computer Society, Los Alamitos, CA. Fitzgibbon, A. W. (2001). Robust Registration of 2D and 3D Point Sets. In Proceedings of the British Machine Vision Conference, pages 662–670. Hähnel, Dirk and Burgard, Wolfram (2002). Probabilistic Matching for 3D Scan Registration. In Fachtagung ROBOTIK 2002. Neugebauer, P. (1997). Geometrical cloning of 3D objects via simultaneous registration of multiple range images. In Proceedings of the 1997 International Conference on Shape Modeling and Application (SMI-97), pages 130–139, Los Alamitos, CA. IEEE Computer Society. Röfer, T. (2002). Using histogram correlation to create consistent laser scan maps. In IEEE International Conference on Robotics Systems (IROS-2002), pages 625–630. Rusinkiewicz, Szymon and Levoy, Marc (2001). Efficient variants of the ICP algorithm. In Young, Danielle C., editor, Proceedings of the Third International Conference on 3-D Digital Imaging and Modeling (3DIM-01), pages 145–152, Quebec City, Canada. IEEE Computer Society, Los Alamitos, CA. Schön, Nikolaus (2002). Feature evaluation for surface registration. Annual Report 32, University Erlangen-Nürnberg. Stamos, I. and Allen, P. (2000). 3-D Model Construction using Range and Image Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR-00), pages 531–536, Los Alamitos. IEEE. Stamos, Ioannis and Leordeanu, Marius (2003). Automated feature-based range registration of urban scenes of large scale. In International Conference of Computer Vision and Pattern Recognition, volume 2, pages 555–561.
ASSESSMENT OF IMAGE SURFACE APPROXIMATION ACCURACY GIVEN BY TRIANGULAR MESHES Oliver Matias van Kaick, Hélio Pedrini Federal University of Paraná, Computer Science Department Curitiba, PR, Brazil
{oliver,helio}@inf.ufpr.br Abstract
Keywords:
1.
Image quality assessment plays an important role in several image processing applications, including data approximation by triangular meshes. The determination of adequate metrics is essential for constructing algorithms that generate high quality models. This paper evaluates a number of different image measures used to refine a given triangular mesh until a specified accuracy is obtained, which are more effective than traditional metrics such as the magnitude of the maximum vertical distance between pairs of corresponding points in the images. Experiments show that a considerable reduction in the triangulation size can be obtained by using more effective criteria for selecting the data points. Several metrics for evaluating the overall quality of the resulting models are also presented and compared. Image quality assessment; surface approximation; triangular meshes.
INTRODUCTION
Surface approximation is of great interest in a variety of knowledge domains, including computer graphics, computer-aided design, computer vision, medical image analysis, and terrain modeling. These fields usually involve the generation, manipulation, interpretation, and visualization of large amount of 3D spatial information. Substantial results have been reported in the last years to generate highfidelity surface models 6,7,8 . Polygonal surfaces are frequently used to approximate a finite set of data points, due mainly to their simplicity and flexibility. A common way of approximating surfaces is to use a triangulated irregular network (also mesh or triangulation), where the structure of the model can be adjusted to reflect the density of the data. Relevant surface features can also be incorporated into the model. Several methods for generating triangular meshes from dense data sets have been proposed in the literature 7,9 . However, the optimal placement of vertices and edges in a mesh is still an area of active research. 655 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 655–661. © 2006 Springer. Printed in the Netherlands.
656 Most surface methods can be classified as refinement and decimation (also coarsening or simplification). Refinement methods 5,11 start with a minimal initial approximation of the surface and iteratively add new points to the triangulation until the model satisfies a specified level of accuracy. Decimation methods 7,1 start with a triangulation containing the entire set of data points and iteratively simplify it, until the desired approximation criterion is achieved. A given set of points may in general produce many different triangulations and the quality of the piecewise linear interpolation clearly depends on the specific triangulation of the data points. Mesh generation methods often use different error criteria to select points during the triangulation process and also to measure the fitness of the approximated surface models. The key to produce accurate models lies in the choice of an adequate point importance measure. A general comparison of triangulation approaches is difficult, because the criteria used to construct the triangulation are highly differentiated and there is no common way of measuring approximation error. Unfortunately, there is no formal and universally acknowledged definition of error 3 . For rendering applications 4 , for instance, similarity of appearance is crucial. In terrain modeling 7,5,12,2 the maximum vertical distance between the original data set and the approximated model is generally used as a measure to refine the mesh. For these reasons, it is important to have methods for assessing image surface approximation accuracy between the models, allowing researchers to design and evaluate new simplification approaches. The purpose of this paper is to evaluate a number of different measures both to add new points to an initial sparse triangulation and to quantify the global difference between two surface models.
2.
SURFACE APPROXIMATION QUALITY MEASURES
This section initially presents a number of metrics used to refine a given triangular mesh until a specified accuracy is satisfied. Experiments show that a significant reduction in the size of the triangulation can be obtained simply by adopting different metrics for selecting the points. Several metrics for evaluating the quality of the resulting models are also presented and compared.
2.1
Measures for mesh refinement
Methods for constructing triangular meshes typically add new points to an initial sparse triangulation until a specified degree of accuracy is achieved. Many different strategies can be used to determine the order of insertion. A common choice is to repeatedly add the worst fitting point to the current triangulation, given by the magnitude of the maximum vertical difference between the original and approximated models 10 , until all points are fit within
657
Assessment of Image Surface Approximation Accuracy
a desired error tolerance. Although this procedure is not optimal, in the sense that the resulting triangulation may not have the minimal error measured for a particular number of points, it is simple to implement and produces compact triangulations. However, since the point of highest error is always selected, such procedure is vulnerable to outliers. Other approximation criteria are possible, such as the root mean square error or the mean absolute error. The main contribution of this work is to rank the points by new criteria, instead of the maximum vertical distance. We will briefly describe a number of metrics, then experiments using these measures will be presented. Each measure associates an error to a triangle t. f (p) and g(p) are the values at a point p(x, y) in the original model and in the approximated model, respectively. 0 ≤ x ≤ (M − 1) and 0 ≤ y ≤ (N − 1), where M and N correspond to the image dimensions. 0 ≤ f (p) ≤ L and 0 ≤ g(p) ≤ L for all p, where L is the maximum intensity value. k(p) > 0 is the curvature at point p, calculated with a Laplacian filter, A(t) denotes the area of triangle t, and Δ(p) = f (p) − g(p). The measures are Maximum Error (ME), Absolute Vertical Sum (AVS), Squared Vertical Sum (SVS), Laplacian Maximum Error (LME), Laplacian Absolute Vertical Sum (LAVS), Laplacian Squared Vertical Sum (LSVS), Jaccard Coefficient (JC) and Czenakowski Distance (CZD). ME(t) = max |Δ(p)|
AVS(t) =
p∈t
X
|Δ(p)|
SVS(t) =
p∈t
|Δ(p)| LME(t) = max p∈t k(p) 0
B B B JC(t) = ME(t) B1 − B @
X p∈t
(
p∈t
X |Δ(p)| LAVS(t) = k(p) p∈t
1 1, if f (p) = g(p) C C 0, otherwise C C C A(t) A
LSVS(t) =
CZD(t) = X „ p∈t
2.2
X [Δ(p)]2
X [Δ(p)]2 k(p) p∈t
ME(t).A(t) « 2 min(f (p), g(p)) 1− f (p) + g(p)
Metrics for mesh evaluation
Objective image quality measures play an important role in various image processing applications 15,14 . Image quality measures are primarily used to evaluate the distortion due to, for instance, surface approximation, compression, blurring, or noise. The most frequently used metrics are deviations between the original and the coded images 10,13 , where mean square error (MSE) or signal to noise ratio (SNR) are the most common measures. A number of metrics are described below, whose main use is in the evaluation of the error introduced in the approximation of surfaces. These metrics are calculated using an error measure based on distances between each pair of corresponding image points, f (x, y) and g(x, y), with Δ(x, y) = f (x, y) − g(x, y). The metrics are Mean Average Error (MAE), Normalized
658 Mean Squared Error (NMSE), Root Mean Squared Error (RMSE), Peak Signal to Noise Ratio (PSNR), Normalized Correlation (NC), Czenakowski Distance (CZD), Multi-Resolution Error (MRE) and Jaccard Coefficient (JC). v u u RMSE = t
r−1
r−1
X 1 1 2X 2X MRE = fx,y − gx,y 2r 22r−2 x=0 y=0 r=1 ´ ` N N in the images. where fx,y and gx,y represent the resolution block r of size 2r−1 . 2r−1 M −1 M −1 N N −1 −1 X X X X (f (x, y) − μf )(g(x, y) − μg ) [Δ(x, y)]2 NC =
M −1 N −1 1 X X [Δ(x, y)]2 M.N x=0 y=0
log N
x=0 y=0 −1 M −1 N X X
(f (x, y) − μf )2 (g(x, y) − μg )2
x=0 y=0
NMSE =
x=0 y=0 −1 M −1 N X X
[f (x, y)]2
x=0 y=0
where μf is the mean of f and μg the mean of g. « M −1N −1„ M −1N −1 2 min(f (x, y), g(x, y)) 1 XX 1 XX |Δ(x, y)| CZD = 1− MAE = M.N x=0 y=0 f (x, y) + g(x, y) M.N x=0 y=0 ( M −1 N −1 X X 1, if f (x, y) = g(x, y) 0 1 M.N.L2 0, otherwise x=0 y=0 B M −1 N −1 C JC = 1− PSNR = 10 log10 @ X X [Δ(x, y)]2 A M.N x=0 y=0
3.
EXPERIMENTAL RESULTS
A set of test images was approximated using the described measures for mesh refinement. The results for two images are shown in Figure 1. The number of points needed by each metric to obtain a fixed desired error level (RMSE) is shown in the charts. The metrics AVS, SVS, LAVS and LSVS generated the approximations using less number of points. The JC metric performed as the ME and LME metrics, using more points than the previous measures. The CZD metric had the poorer results. Figure 2 shows the different triangulations generated for the approximation of one image, using the described measures. All triangular meshes satisfy the same error level (RMSE), and the results show the spatial distribution of points generated by each metric. Using the metrics AVS, SVS, LME, LAVS and LSVS, the triangulations had points evenly distributed across the image domain, yielding to the best results. The ME and JC measures concentrated points on specific locations, mainly the portions of the image with high curvature, and the generated triangulations needed more points. Although the CZD metric also spread the points evenly across the image, the generated triangulation used more points than all the others. Figure 3 shows the described metrics for mesh evaluation calculated for the approximations of only one image, due to space limit. A fixed number of points was established, and the global error of the generated approximation was calculated using the various measures. The MAE, NMSE, RMSE, and CZD
Assessment of Image Surface Approximation Accuracy
659
metrics had similar results, showing the same global behavior. The PSNR and NC measures also had similar results, however, in an ascending direction when compared with the previous ones. The JC measure had also similar results, but with a more linear output scale. The MRE metric presented a non-uniform output, although it measures the error at more levels of resolution.
4.
CONCLUSIONS
Several different metrics used to refine triangular meshes were presented and evaluated in this work, motivated by the search for better criteria for the generation of more compact triangular models. Experiments demonstrated that a significant reduction in the size of the triangulation can be obtained by using more effective metrics than traditional ones. Several metrics for evaluating the quality of the resulting models were also described and compared.
REFERENCES 1. Cignoni, P., Montani, C., and Scopigno, R. (1998a). A comparison of mesh simplification algorithms. Computers and Graphics, 22(1):37–54. 2. Cignoni, P., Puppo, E., and Scopigno, R. (1995). Representation and visualization of terrain surfaces at variable resolution. In Scientific Visualization’95, pages 50–68. 3. Cignoni, P., Rocchini, C., and Scopigno, R. (1998b). Metro: Measuring error on simplified surfaces. Computer Graphics Forum, 17(2):167–174. 4. Cohen, J., Olano, M., and Manocha, D. (1998). Appearance-preserving simplification. In SIGGRAPH’98 Conf. Proceedings, Annual Conference Series, pages 115–122. 5. Garland, M. and Heckbert, P.S. (1995). Fast polygonal approximation of terrains and height fields. Technical Report CMU-CS-95-181, Carnegie Mellon University. 6. Garland, M. and Heckbert, P.S. (1997). Surface simplification using quadric error metrics. Computer Graphics, 31:209–216. 7. Heckbert, P.S. and Garland, M. (1997). Survey of polygonal surface simplification algorithms. In SIGGRAPH’97 Course Notes, 25. ACM Press. 8. Hoppe, H. (1996). Progressive meshes. Computer Graphics, 30:99–108. 9. Lindstrom., P. and Turk, G. (1998). Fast and memory efficient polygonal simplification. In IEEE Visualization, pages 279–286. 10. Little, J.J. and Shi, P. (2003). Ordering points for incremental TIN construction from DEMs. GeoInformatica, 7(1):5–71. 11. Schroeder, W.J., Zarge, J.A., and Lorensen, W.E. (1992). Decimation of triangle meshes. Computer Graphics, 26(2):65–70. 12. Snoeyink, J. and Speckmann, B. (1997). Easy triangle strips for TIN terrain models. In Ninth Canadian Conf. on Computational Geometry. 13. Wang, Z., Bovik, A.C., Sheikh, H.R., and Simoncelli, E.P. (2004). Image quality assessment: From error measurement to structural similarity. IEEE Transactions on Image Processing, 13(1). 14. Wang, Zhou and Bovik, Alan C. (2002). A universal image quality index. IEEE Signal Processing Letters, 9(3):81–84. 15. Zhou Wang, Alan C. Bovik and Lu, Ligang (2002). Why is image quality assessment so difficult? In IEEE Int. Conf. on Acoustics, Speech & Signal Processing.
660
700
45000 ME AVS SVS LME LAVS LSVS JC CZD
500 400 300
ME AVS SVS LME LAVS LSVS JC CZD
40000 35000 Points Needed
Points Needed
600
200
30000 25000 20000 15000 10000
100
5000
0
0 50
100
150
200 RMSE
250
300
350
(a) klamath falls-e Figure 1.
5
10
15
20 RMSE
25
30
(b) lena
Number of points used by approximations for two images.
(a) Original image
(b) ME (1048 points)
(c) AVS (485 points)
(d) SVS (440 points)
(e) LME (497 points)
(f) LAVS (526 points)
(g) LSVS (434 points)
(h) JC (1006 points)
(i) CZD (1042 points)
Figure 2.
Triangulations for different approximations of lena image with RMSE = 25.
35
661
Assessment of Image Surface Approximation Accuracy 8
9e-05 ME AVS SVS LME LAVS LSVS JC CZD
6 MAE
5 4 3
ME AVS SVS LME LAVS LSVS JC CZD
8e-05 7e-05 6e-05 NMSE
7
5e-05 4e-05 3e-05
2
2e-05
1
1e-05
0
0 0
50000
100000
150000
0
50000
Used points
(a) MAE
150000
(b) NMSE
16
70 ME AVS SVS LME LAVS LSVS JC CZD
12 10 8 6
ME AVS SVS LME LAVS LSVS JC CZD
65 60 PSNR
14
RMSE
100000
Used points
55 50
4 45
2 0
40 0
50000
100000
150000
0
50000
Used points
(c) RMSE 0.0025 ME AVS SVS LME LAVS LSVS JC CZD
NC
0.999 0.9985 0.998
0.0015 0.001 0.0005
0.9975
0 0
50000
100000
150000
0
Used points
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
100000
Used points
(g) MRE
100000
150000
(f) CZD
JC
ME AVS SVS LME LAVS LSVS JC CZD
50000
50000
Used points
(e) NC
MRE
ME AVS SVS LME LAVS LSVS JC CZD
0.002
CZD
0.9995
Figure 3. image.
150000
(d) PSNR
1
0
100000
Used points
150000
0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4
ME AVS SVS LME LAVS LSVS JC CZD
0
50000
100000
150000
Used points
(h) JC
Metrics for mesh evaluation calculated for the approximations of klamath falls-e
NON-UNIFORM TERRAIN MESH SIMPLIFICATION USING ADAPTATIVE MERGE PROCEDURES Flávio Mello1 , Edilberto Strauss2 , Anto^nio Oliveira2 and Aline Gesualdi2 1 Institute of Research and Development–IPD
Av das Américas 28705 D-13 Guaratiba CEP 23020-470 Rio de Janeiro, RJ, Brazil [email protected] 2 Department of Electronics and Computer Engineering –DEL–POLI–UFRJ
Computer Graphics Laboratory–LCG–COPPE–UFRJ P. Box 68504, CEP 21945-970 Rio de Janeiro, RJ, Brazil [email protected],[email protected],[email protected]
Abstract
The performance of a walkthrough over terrain models is deeply influenced by the real scenario high level of details. To guarantee natural and smooth changes in a sequence of scenes, it is necessary to display the actual height field’s maps at interactive frame rates. These frame rates can be accomplished by reducing the number of rendered geometric primitives without compromising the visual quality. This paper describes an optimized algorithm for building a triangular mesh, the terrain model, which combines an efficient regular grid representation with low cost memory requirements, using a bottom-up approach.
Keywords:
Terrain mesh simplification; triangles merge; quadtrees.
1.
INTRODUCTION
The terrain walkthrough plays an important role in the virtual reality, as it is observed in computer systems like: Geographic Information Systems (GIS), Military Mission Planning, Flight Simulation, etc. A regular grid sampled data, known as Digital Elevation Model (DEM), is required1 to represent terrain altimetry. However, the relationship between the actual map image resolution and its associated data can easily exceed the capabilities of typical graphics hardware, which makes impossible a real-time interactive application. In order to compute the DEM’s triangular mesh, it is usually necessary that, at least, one of the following properties be fulfilled, as it is described in2–8 . A
662 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 662–667. © 2006 Springer. Printed in the Netherlands.
Non-Uniform Terrain Mesh Simplification Using Adaptative Merge Procedures 663
same terrain region triangulation may look different from each other according to the chosen properties. So, the mesh can: - be conforming: when a triangle is not allowed to have a vertex of another triangle in the interior of one of its edges; - respect the input: when the set of the resulting mesh vertex is included on the set of DEM’s pixels coordinates; - be well shaped: when the angles of any mesh triangle are neither be too large nor too small. Usually, it is required from the triangles angles to be in the range from 45o to 90o ; - be non-uniform: when it is fine near the borders of the components (pixels gradient values) and coarse far away from this borders. The non-uniform mesh generation method we shall describe in this paper is based on a quadtree structure. Our algorithm divides the DEM into a regular grid, and then merges the redundant triangles into bigger ones. It is conforming although it is neither well shaped nor respects the input.
2.
RELATED WORK
Lindstrom et alli3 proposed a block based algorithm, which uses a bottomup strategy combined with quadtrees. Duchaineauy et alli9 created the Realtime Optimally Adapting Meshes (ROAM), and it has been widely used in games. ROAM learned much from Lindstrom’s algorithm and it’s much faster. But, during SIGGRAPH 2000 course 39, it was reported a ROAM’s decrease of performance when rendering terrains at high levels of detail6,7 . Hoppe5 generates TINs with View Dependant Progressive Meshes in order to perform real time terrain fly-over. It produces a mesh with far less triangles than a regular grid based one, but it spends too much time on optimization. Most of the work presented on this paper is based on Rottger’s algorithm4 . He gave in his paper an important method of crack eliminating, which has since been used by developers. Rottger’s algorithm is regular grid based, and uses an error metric different from Lindstrom’s.
3.
OVERVIEW
The underlying data structure of the presented algorithm is basically a quadtree. For the discussion in this paper, it is assumed that the height fields dimensions are 2n × 2m , where n and m might not be equal. The mesh generation presented here is will be described as a sequence of two steps. First, the height field is recursively divided into four quadrants, called nodes. Every tree node is divided until a customized tree height is reached. Since no simplification criteria over the height field points are made, the resulting tree is a full divided quadtree. Every tree node corresponds to a square patch (triangle pair) of the height field. This implies that squares represented by tree leaves correspond to the most refined subdivision of the height field. This subdivision represents a DEM regular and uniform grid.
664 The second step of the algorithm implements the merge of redundant triangles into bigger ones. Just the quadtree leaves are inserted into a triangle pair list (TPL), and the merge of the redundant triangles occurs during this insertion. At the end of this algorithm step, the TPL will be composed by different sized triangle pairs, representing the optimized mesh. Each leaf node may merge with a TPL node if they share two properties. First, nodes must have a coincident edge, which means that its edges should not only be adjacent, but also have the same side sizes (see Figure 1a). The second merge property sets that nodes must have the same topology. It demands that the squares represented by nodes must be at the same spatial plane, or with the same normal, as shown in Figure 1b. At first, it is checked if a horizontal merge may occur, and then, if it may occur in the vertical direction. If the nodes merge criteria are satisfied, then the list node is removed from the TPL and two nodes became a new larger one. This new node will also be tested against the others TPL elements, during its insertion into the list. Thus, a node insertion can lead to many triangle pair merges. +
=
+
= No Merge
(a)
+
=
+
=
No Merge
(b)
Figure 1. Merge criteria: (a)Two neighbor squares may merge if their coincident edges are the same size. (b) Merge may occur only with coplanar patches.
The merging behavior varies according to the component (pixel value gradient) position over the quadtree subdivision. The inner height map pixel value that pulls a vertex to a higher or lower altitude may be located at one of the three positions illustrated on Figure 2a. The pixel may be coincident to vertex neighboring all four patches (patches 5-6-7-8); or coincident to the vertex neighboring just two patches (patches 5-7), or even coincident to the corner of the quadtree subdivision (patch 7). These configurations for the inner height map pixels implies on three primitive patch patterns, created by the proposed algorithm, as shown on Figure 2b. The Figure 2c represents the quadtree subdivision by Rottger4 and Lindstrom3 just before determining how should be the triangle fan configured in order to avoid the cracks. In the first and second cases, the proposed method used fewer triangles than Rottger and Lindstrom even before their attempt to eliminate the cracks. On the third case, the method uses more rendering triangles than the other ones. Since the other methods still need to do some splitting before drawing the triangulated height field patch, it is expected that the presented algorithm will need less triangles to render the patch on this case too. The probability p1 of occurring a inner height map pixel value similar to the first case patch is given by 4/9. The proposed method represents the first
Non-Uniform Terrain Mesh Simplification Using Adaptative Merge Procedures 665
1 st
1
2
5
6
3
4
7
8
9 10 13 14 11 12 15 16
2 nd
1
2
5
6
3
4
7
8
9 10 13 14 11 12 15 16
3 rd
1
2
5
6
3
4
7
8
9 10 13 14 11 12 15 16
(a)
(b)
(c)
Figure 2. Different DEM subdivision according to the inner height map pixel value position. (a) Original Subdivision; (b) Adaptative Merge Subdivision; (c) Rottger Subdivision.
case patch using 6 triangle, while Rottger method uses 7. So, the rendering triangles drawing rate (RDT R) between the proposed algorithm and Rottger one, on this case, is given by 6/7. Therefore, the total rendering triangles drawing rate for an inner height map pixel value is given by RT DRinterior = 3 i=1 pi × RT DRi = (4/9 × 6/7) + (4/9 × 7/10) + (1/9 × 8/4) = 0.9143, which indicates that the proposed algorithm would used 8.57% less triangles than Rottger method. A similar analysis can be made on the border of the height field. The inner height map pixel value may be coincident to the vertex neighboring two patches of two different nodes, or coincident to the vertex neighboring just two patches of a same node, or even coincident to the corner of the height field. Note that computing the RDT Rborder of the height field, it would reach the value of 1.3482. This means that the proposed algorithm needs 34.82% more triangles than Rottger method. Although Rottger algorithm has greater performance at the border of the height field, the proposed algorithm is still more efficient due to the gain obtained inside of the height field. It must be observed that the number of inner triangles grows much faster than the number of border triangles, as the quadtree height increases. In fact, the algorithm gets better result than Rottger’s one when the quadtree height increases from 4 to 5. Real applications frequently uses quadtree height values between 7 and 913 .
4.
RESULTS
All screen shots have been taken from the application running on an ordinary PC with a 1GHz processor and a GeForce 2 MX 400 video board. The 1024 × 1024 grid of Figure 3a represents elevation data of Salt Lake City West, taken from the US Geological Survey. The elevation at each grid vertex is given by an integer between 0-255, where one unit represents 8.125 meters. Figure 3b
666 represents a 2048 × 2048 grid extracted from Rio de Janeiro City height field. It was taken from Instituto Militar de Engenharia and represents Rio de Janeiro City neighborhoods. One unit elevation from this grid vertex represents 0.418 meters.
Figure 3. Height field used for examples: (a) Salt Lake City West (United States) 1024×1024 grid; (b) Rio de Janeiro City Neighborhoods (Brazil) 2048 × 2048 grid.
Figure 3a regular grid, generated by a quadtree subdivision would take 524.288 triangles since the quadtree height was set to 9. By applying the merge techniques described on this paper, the proposed method generates a rendering list with 384.146 triangles, while R-ottger method12 takes 461.104 triangles. It represents a 26,73% of optimization when compared to the regular grid, and 16,69% of optimization when compared to R-ottger method12 . Considering the Figure 3b case, the regular grid generated by a quadtree subdivision would also take 524.288 triangles, because the quadtree height is the same for both examples. The proposed method generates a rendering list with just 313.052 triangles, while R-ottger method12 takes 378.036 triangles. It represents a 40,29% of optimization when compared to the regular grid, and 17,19% of optimization when compared to R-ottger method12 . The region presented on Figure 4a corresponds to the Salt Lake City mesh created by the proposed algorithm, while Figure 4b corresponds to Rio de Janeiro neighborhoods mesh. At this example, the quadtree height was set to 6 for both regions because values greater than this one produce too overcrowded figures. It should be observed that the resulting meshes are fine near the contour lines, and coarse far away from these curves.
5.
CONCLUSION
It was presented a bottom-up algorithm for optimizing terrain mesh triangulations. The method has been implemented and provides high quality triangulations with thousands of geometric primitives. The generated mesh is conforming, although it is neither well shaped nor respects the input. The coherence between frames has not been exploited yet, but it has achieved good frame rates on PC platforms, such as 47f ps. Critical future issues include level of detail rendering and efficient paging mechanism, which will allow rendering height fields that do not entirely fit into RAM.
Non-Uniform Terrain Mesh Simplification Using Adaptative Merge Procedures 667
(a)
(b)
Figure 4. Generated meshes top view: (a) Salt Lake City West mesh; (b) Rio de Janeiro City Neighborhoods mesh.
REFERENCES 1. Turner, Bryan, Real-Time Dynamic Level of Detail Terrain Rendering with ROAM, 2000. (www.gamasutra.com/features/20000403/turner 01.htm) 2. Zhao, Youbing, Zhou Ji, Shi Jiaoying, Pan Zhigeng, A Fast Algorithm For Large Scale Terrain Walkthrough, 2002. 3. Lindstrom P., Koller D., et al., Real-time continuous level of detail rendering of height fields, Computer Graphics, SIGGRAPH ’96 Proceedings, pp.109-118 (1996). 4. Rottger, S., Heidrich, W., Slasallek, P., Seidel, H., Real-Time Generation of Continuous Levels of Detail for Height Fields, (1998). 5. Hoppe, H., Smooth View Dependant Level-of-Detail Control and its Application to Terrain Rendering, Technical Report, Microsoft Research (1998). 6. Blow, Jonathan, Terrain Rendering at High Levels of Detail, Paper for the Game Developers’ Conference 2000, San Jose, California, USA. 7. Blow, Jonathan, Terrain Rendering Research for Games, Slides for Siggraph2000 Course 39, 2000. 8. De Berg, Mark, et al., Computional Geometry - Algorithms and Applications, 2ed., Berlin, Springer, 2000, chap.14. 9. Duchaineauy, Mark, Wolinsky, Murray, et al., ROAMing Terrain: Real-time Optimally Adapting Meshes, IEEE Visualization ’97Proceedings, 1997. 10. Cormen, Thomas, H, Leiserson, Charles, E, Rivest, Ronald L., Introduction to Algorithms, MIT Press, 1997, 18ed., pp.221-243. 11. Markenzon, Lilian, Szwarcfiter, Jayme Luiz, Data Structures and its Algorithms, Livros T´ ecnicos e Cient´ificos, 1994. (in portuguese) 12. S. R-ottger, Terrain LOD Implementations - libMini, http://www.vterrain.org/LOD/Implementations/. [capture on 26/03/04] 13. A. Ogren, Continuous Level of Detail in Real-Time Terrain Rendering, MSc. Dissertation, University of Umea, January, 2000.
ON USING GRAPH GRAMMARS AND ARTIFICIAL EVOLUTION TO SIMULATE AND VISUALIZE THE GROWTH PROCESS OF PLANTS Dominika Nowak, Wojciech Palacz, Barbara Strug Institute of Computer Science, Jagiellonian University, Nawojki 11, Cracow, Poland
{nowakd,palacz,strug}@ii.uj.edu.pl Abstract
In this paper we propose a hierarchical approach to representing plants and their environment. The graph grammar is used to simulate the growth process of each plant. This approach is combined with an artificial evolution that models the growth of the whole environment. The proposed approach is illustrated with an example of a set of fern-like structures.
Keywords:
Graph grammars; plant growth process; artificial evolution.
1.
INTRODUCTION
The human beings have always been trying to understand the Nature and learn from it. With the advent of computers, and computer graphics in particular, researchers were tempted to find ways of recreating the world on our monitors. Mathematical models for the simulation of developmental processes are one of the ways of recreating the form of an organism and its growth process. Techniques allowing for synthesis of realistic models of plants can be useful in computer assisted education, design, computer arts and landscape generation. For such a simulation a good mathematical model is needed. However, nearly all living organisms are extremely complex. Thus it is very difficult to describe them in a mathematical form. Even if such model is found it is usually very complicated and may cause computational problems. So rather then looking for an exact mathematical formula for a specific organism (structure) it is easier to look for a method (a procedure) to generate it. Over the recent years there has been a significant development of research in the field of plant-like structure generation, leading to the creation of two methodologies: space-oriented ones and structure-oriented ones. In spaceoriented methods a state of each point of the model space is defined. The
668 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 668–673. © 2006 Springer. Printed in the Netherlands.
Simulation and Visualization of Growth Process of Plants ...
669
best known space-oriented methods are diffusion limited growth,4 reactiondiffusion models1,6 and their discrete counterparts: cellular automata12 and diffusion limited aggregation13 . In the structure-oriented methods the location of each component is described; the emphasis is put on where each module of the structure is located11 . Probably the best known and the most widely researched method in this group are L-systems. L-systems were proposed by Lindenmayer in5 as a tool for describing the growth process of simple multicellular organisms. They have been extended for higher plants and now are used to simulate linear and branching structures built from modules9 . Though the expressive power of L-systems and their extensions is quite substantial they lack capabilities to address the problem of local modifications. Hence it is impossible to make changes confined to a small neighbourhood without affecting the whole structure. It also is impossible to draw the geometrical and topological properties directly from the string. Graph grammars seem to be particularly useful to address these drawbacks. They produce graphs which have topological structure. The main aim of our work is to simulate growth of plants. We need a formal model which provides a convenient way of representing plant structures, and which allows for runtime modification.
2.
HIERARCHICAL REPRESENTATION
As many real-life plants are recursive (a smaller branch spawns from the larger branch, etc.), hierarchical graphs seem to be the most suitable. Moreover the hierarchical structure of the representation allows for the uniform representation of the plants and their environment which can be considered to be the node on higher level in the hierarchy. To be able to visualize plants we need information on geometrical objects represented by each node of a graph. This is done by using labelled nodes and assigning to each label a primitive it represents. Because the data on age of plants, their level, etc. is required, we decided to use attributed graphs and store this data as attribute values. A directed graph G = (V, E, s, t) consists of a finite set of nodes V , a finite set of edges E (collectively called atoms), and mappings s, t : E → V (nodes s(e) and t(e) are called the source and the target of edge e, respectively). Let an attribute be a function a : X → Da , where X ⊂ V , and Da is a set of attribute values. Let A be a finite set of attributes. Let B be a set of node labels. Attributed labelled graph G = (V, E, s, t, atr, lab) is a directed graph where mapping atr : V → P (A) assigns a set of attributes to each node and lab : V → B assigns labels to nodes. For every v ∈ V and a ∈ atr(v) the value of a(v) must be defined. Let ⊥ be a fixed value different from any node or edge, used to denote that a given node or edge has no parent. Hierarchical
670 graph G = (V, E, s, t, atr, lab, par) is an attributed labelled graph, with par : V ∪ E ∪ {⊥} → V ∪ E ∪ {⊥} being the parent assigning function. Function par has to be acyclic: no edge or node can be its own ancestor. A graph representing a plant is generated by a graph grammar, which contains a set of rules (productions). In terms of plant generation, applying a production corresponds to a single modification of a plant and consecutive application of rules reflects its growth. A more formal treatment of the relevant theory can be found in2,8,10 – in this paper the focus is on its application. As an example, let us present a field of fern-like plants. Every fern starts as a single stem. Stem grows only on the top (this assumption was made for simplicity’s sake). In every growth phase stem sprouts either a pair of leaves, or a branching sub-stem. This smaller stem also starts to grow. Graph productions displayed in Fig. 1 are used to grow new ferns. The productions not only create new nodes and edges, but also set new attribute values. As in many plants different forms of growth are possible the grammar describing this growth is usually a non-deterministic one i.e. the existence of many productions with identical left side is allowed. A real number pi ∈ (0, 1) is assigned to each production to define the probability of applying this production. The change of the probability of applying a given production may result in a different graph thus representing a differently looking plant. Still as the same grammar is used the plant would belong to the same class (species). Thus we have decided to associate a vector of probabilities with each plant. The environment in which plants are grown can be considered to be a pair (hG, R, v, p), where hG is a hierarchical graph containing nodes representing all plants, R is a grammar describing their growth pattern, v = (v1 , . . . , vk ) a sequence of parameters used in the visualization of the graph, and p = (p1 , . . . , pj ), a sequence of numbers defining the probabilities of applying a given production (where j is the number of productions in a grammar R). Each plant is represented by a triple (h, v, p), where h is a hierarchical subgraph of hG describing the structure of plant, and v and p are defined as above.
3.
SIMULATION OF THE GROWTH PROCESS
The simulation process we propose can be divided into two parts. One is the growth process of individual plants within the environment which is overseen by the graph grammar and some additional conditions presented in this section. The other is the growth of the environment i.e. the possibility of plant to be "born" or to "die". This part of the growth is governed by an algorithm based on evolutionary computations3,7 . The simulation of the growth process is started by applying production from Fig. 1a N times adding plants to the environment. The new plants consist just of a predefined initial graph g0 representing small stem, vectors p and v are
671
Simulation and Visualization of Growth Process of Plants ...
environment fern
environment
a)
stem
fern.level := 1 stem.alive := true stem.birth := simulator.time
fern fern
b)
stem’
stem leaf
stem
leaf
stem.alive = true stem.alive := false stem’.alive := true
stem’.birth := simulator.time
fern fern
c)
fern”
stem
stem”
stem’
stem
stem.alive = true stem.alive := false stem’.alive := true stem”.alive := true
stem’.birth := simulator.time fern”.level := fern.level + 1 stem”.birth := simulator.time
fern fern
d)
stem
fern” stem”
stem’
stem
fern’’’ stem’’’
stem.alive = true stem.alive := false stem”.alive := true fern”.level := fern.level + 1 stem’.alive := true stem’’’.alive := true fern’’’.level := fern.level + 1 stem’.birth := stem”.birth := stem’’’.birth := simulator.time
Figure 1.
Some productions of the grammar for fern-like structures.
672
Figure 2.
Examples of fern-like structures.
copied from the environment. Then at each time step of the simulation a plant grows according to productions from the grammar defined in the environment. The probability of applying a given production i depends on ith element of the vector p assigned to the plant. This probabilities may be constant or may depend on some attributes of nodes to which a given production is to be applied. In case of fern-like structures the probability of applying a production depends on the position of the node corresponding to the left side of the production within the hierarchical graph representing a given plant. At the beginning of its development the plant is more likely to sprout a twig or leaves then to stop growing. The deeper in the hierarchy a node is the less likely the sprouting of new twig is and the more likely the use of a finishing production is. During the simulation new plants are also generated with some probability pnew , which is a global parameter. The structure of a new plant is created by production 1a, the set of parameters used to visualize it and the sequence of probabilities of applying the productions are inherited from its parent. These sequences may be mutated when the new offspring is generated. These parameters define the preferable way of growth of this plant and guarantee that this plant will be similar to its parent but very unlikely to be identical. The mutation applied to vectors v and p is based on Gaussian distribution. A number xi ∈ [ai , bi ] is mapped to x = x + N (0, d) (and clipped to [ai , bi ]), where ai and bi are the lower and upper bound for a given parameter and d = r·(bi −ai ). The value r corresponds to the amount of mutation actually applied.
Simulation and Visualization of Growth Process of Plants ...
673
The parent selection process is based on the age of plants i.e. on the information of how many time steps a plant has survived so far, with the youngest and oldest plants being the least likely to be selected. The minimal age amin and maximal amax for a plant to produce an offspring is also set. Fig. 2a depicts a graph representing a fern like structure shown in Fig. 2b. A number of structures obtained in a simulation process is shown in Fig. 2c.
4.
CONCLUSIONS AND FUTURE WORK
In this paper we presented a model in which the environment contains only one "species" of plants, but in the method nothing prevents us from defining a grammar R able to grow different species within the same process. Alternatively a set of grammars (a grammar system) could be associated with the same environment, each responsible for growing different plants. The parent selection process is based on age of a plant only. A more complex fitness function taking into account how well the plant "grows" (i.e. how many new twigs or leaves it sprouts successfully) is currently researched. Changing over time the global probability of generating offsprings pnew is also planned.
REFERENCES 1. Fowler D. R., Meinhardt H., Prusinkiewicz P., Modeling seashells, Computer Graphics, 26(2), 379-387. 2. Grabska E., Theoretical Concepts of Graphical Modelling Part Three: State of the Art, Machine Graphics & Vision vol.3, no. 3, 1994, pp. 481-512. 3. Holland, J. H., Adaptation in Natural and Artificial Systems, Ann Arbor, 1975. 4. Kaandrop J., Fractal Modelling. Growth and Form in Biology. Berlin Springer Verlag. 5. Lindenmayer A., Mathematical models for cellular interaction in development, part 1 and 2, Journal of Theoretical Biology, 18, 280-312. 6. Meinhardt H., Models of Biological Pattern Formation. London Academic Press. 7. Michalewicz, Z., Genetic Algorithms + Data Structures = Evolution Programs. SpringerVerlag, Berlin Heidelberg New York, 1996. 8. Palacz, W., Algebraic hierarchical graph transformation, Journal of Computer and System Sciences, vol. 68 no 3,2004, pp. 497-520. 9. Prusinkiewicz P., Lindenmayer A., The Algorithmic Beauty of Plants, New York, Springer. 10. Rozenberg, G., Handbook of Graph Grammars and Computing by Graph. Transformations,World Scientific London, 1997. 11. Smith A. R., Plants, Fractals and Formal Languages, Computer Graphics, 18(3), 1-10. 12. Toffoli T., Margolus N., Cellular Automata Machines: A New Environment for Modeling. Cambridge, MA. The MIT Press. 13. Witten T., Sander L., Diffusion-limited aggregation. Physical Review B, 27, 5686-5697.
AUTOMATIC TESSELLATION OF QUADRIC SURFACES USING GRASSMANN-CAYLEY ALGEBRA Frédéric Jourdan,1 Gérard Hégron,2 Pierre Macé3 1 École des Mines de Nantes, La Chantrerie, 4 rue Alfred Kastler. B.P. 20722, F-44307 Nantes; 2 CERMA, École d’architecture de Nantes, BP 81931, Rue Massenet F-44319 Nantes; 3 ARIAM,
École d’Architecture de Paris La Villette, 144 avenue de Flandre, F-75019 Paris
Abstract
Grassmann-Cayley algebra (GCA) provides an efficient formulation of projectice geometry, allowing work with elementary geometrical objects at a low computational cost. In this paper we use GCA as a mathematical framework for modeling conic curves and quadric surfaces of 3D space, and for computing rational parameterizations of these. Then, through ad hoc sampling of the parameter space, we derive tessellations of conics and quadrics.
Keywords:
Grassmann-Cayley algebra; conics and quadrics; surface modeling; tessellations.
1.
INTRODUCTION
Algebraic surfaces of degree 2 play a significant role in the modelization of three-dimensional scenes. Indeed, quadric primitives feature in most CAD software, whereas commercial 3D modelers based on quadric surfaces (SGDL, 1997) are available. The problem of tessellating quadric surfaces is wellknown. Actually, almost every 3D software has it’s own solution. However, these solutions are generally based on euclidean geometric properties such as center, radius, or lengths of axes. Therefore they cannot be applied to general projective quadrics. Moreover, usual polygonizations of spheres or ellipsoids require the computation of trigonometric functions. We seek a less costly way to generate our polygonisation. On the other hand, the problems of automatically parameterizing or tessellating general algebraic quadrics have extensively been studied (Wang et al., 1997). Existing results range from fast and simple algorithms that produce rational parameterizations of quadrics (Abhyankar and Bajaj, 1987), to an elaborate and comprehensive algorithm for tessellating a general rational parametric surface (Bajaj and Royappa, 1994). Other approaches focus on parameterizing
674 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 674–682. © 2006 Springer. Printed in the Netherlands.
Automatic Tessellation of Quadric Surfaces using Grassmann-Cayley Algebra
675
quadric surface patches as Bézier surfaces (Lodha and Warren, 1990; Teller and Séquin, 1991). The drawback of the simplest of these purely algebraic approaches lies in the visual aspect of the generated tessellations, with vertices accumulating onto specific points of the quadric. More elaborate methods overcome this difficulty but are much more complicated to implement. Our purpose consists in finding a middle way between those approaches in terms of simplicity of implementation and quality of triangles. Instead of a pure algebraic representation we adopt Rotgé’s descriptive representation of quadrics (Rotgé, 1997), where a base tetrahedron uniquely determines a quadric incident to certain vertices and tangent planes. Using this we can retain some geometric intuition, while working in a projective framework. Grassmann-Cayley algebra (GCA) provides an efficient algebraic formalism, and in the same time conveys some geometrical meaning. It is therefore well suited to our purpose and we choose to use it as our main algebraic tool. In this paper we assume familiarity with projective geometry, a classical topic covered in many reference books (Berger, 1992; Audin, 2002).
2.
GRASSMANN-CAYLEY ALGEBRA
This section is but a very short introduction. Macé (Macé, 1997) gives a more detailed introduction from a practical point of view, whereas comprehensive treatment of the subject is to be found in Rota’s original paper (Rota and Stein, 1976). Applications of Grassmann-Cayley algebra to projective geometry have been investigated more than once, including applications to conics or quadrics (Rota and Stein, 1976; Hestenes and Ziegler, 1991; Li and Wu, 2003).
2.1
Basic properties
In this section E is a vector space of finite dimension n on the field R of real numbers. The exterior algebra of E, a standard tool of multilinear algebra (Kostrikin and Manin, 1989), is the algebra of antisymmetric tensors with the exterior product ∨, or join. As a vector space, it is spanned by elements of the form a1 ∨ · · · ∨ a p , ai ∈ E, which are called p-blades. If we choose a bracket, i.e. an alternating n-linear form on E, we can define an additional product ∧, or meet (Rota and Stein, 1976). The space E endowed with both ∧ and ∨ becomes the Grassmann-Cayley algebra GC(E). Meet and join are related together by alternative laws, such as (a ∨ b) ∧ (c ∨ d) = |acd|b − |bcd|a.
(1)
All operators are easy to implement. Following a tensor approach, we store blades as matrices. The products join and meet are then computed through
676 matrix multiplications, which are fast and easy to perform. Explicit formulae with examples can be found in previously published papers (Macé, 1997).
2.2
GCA and projective geometry
The key feature allowing applications of GCA to geometry is the ability of blades to represent subspaces of E. More precisely, we can associate to any blade a1 ∨ · · · ∨ a p the subspace spanned by the ai ’s. Thus there is a correspondence between p-blades and p-dimensional subspaces of E. In the projective space P(E), p-dimensional subspaces of E become (p − 1)dimensional projective subspaces, so in the field of projective geometry vectors represent points, 2-blades represent lines, and 3-blades represent planes. By abuse of language, we will refer to blades as projective subspaces, so that we may speak of the 2-blade a ∨ b as the line through the points a and b. GCA’s main operators each carry their own geometrical interpretation. The join operator can be understood as a union operator, whereas the meet is an intersection operator. Thus any ruler-only construction has an immediate transcription in GCA’s formalism. Incidence relations are expressed by algebraic constraints. For instance a ∨ b ∨ c = 0 is equivalent to the colinearity of the three points a, b, and c. Grassmann-Cayley expressions are a compact way to make geometric statements. Another feature of GCA is the possibility to perform algebraic computations without losing geometric intuition. The coordinate-free nature of GCA’s formalism makes it suitable for symbolic computations (Li and Wu, 2003; Sosnov et al., 2002), but effective computations in homogeneous coordinates are easy to perform as well.
3.
CONIC CURVES
Working in a projective plane, we start from a construction triangle determining a conic section, then derive a parameterization and a polygonization of the conic curve. This two-dimensional stage will be extensively used in our tessellation method for three-dimensional quadrics.
3.1
Geometric representation of conics
Given a triangle ABS and an extra point P not incident to any of ABS ’s sides, there is a unique conic section C such that A, B, and P are points of C, and that the lines (AS ) and (BS ) are tangent to C (see Fig. 1). We call the triangle ABS the construction triangle of C. From these base points A, B, S , and P there exists a ruler-only geometric construction of C, which associates to any point M on the line (AB) a point of C. Figure 2 illustrates this classical construction, that is exposed in many textbooks covering projective geometry, e.g. Berger (Berger, 1992).
Automatic Tessellation of Quadric Surfaces using Grassmann-Cayley Algebra
677
S S B
Q
P
B
P
A
M C
Figure 1. The conic section C determined by the construction triangle ABS and the passing point P.
3.2
A
Figure 2. Construction of a point Q ∈ C from a given point M ∈ (AB).
Parameterization
A concise formulation of the construction of the point Q from the point M is given by the following equation Q = B(AP ∧ MS ) ∧ A(BP ∧ MS ) (2) written in Grassmann-Cayley algebra’s formalism, where for the sake of readability the join symbol ∨ has been omitted. This construction makes it possible to turn any parameterization of the line (AB) into a parameterization of the conic C. We choose to write M = A + λB. After insertion in Eq. (2), and expansion of the resulting formula using the alternative law (1), we get the following parameterization Q(λ) = λ2 |S BP| B + λ|BAP| S + |AS P| A.
(3)
With an appropriate linear transformation λ → kλ, we calibrate the parameterization so that Q(1) = P. The adequate scalar k is computed as the cross ratio of the four points B, A, A + B, and S P ∧ AB.
3.3
Polygonization
We use Eq. (3) to approximate the curve C by straight line segments. For any integer n, our segmentation algorithm produces a sequence of 4n points on C. We use these points as the vertices of the desired polyline. To generate the sequence of vertices, we need a sequence of scalars, that will be used as specific values for the parameter λ in Eq. (3). In order to avoid big irregularities in the distribution of the vertices along the curve, we have to use a non-uniform scalar sequence. Besides, we want to use the same fixed parameter sequence for every different conic. After some experimentation we chose the following sequence
678
which gave us our most acceptable results (see Fig. 3). Its incremental nature makes it fast to generate, so the computational cost of the segmentation process is minimal.
Figure 3. Examples of vertices generated by our algorithm with n = 6.
3.4
Figure 4. Interactively moving the point P in our test application, we pass smoothly from one conic type to another.
Handling different types of conics
We augment our vertex sequence with the point B = Q(±∞), so that our sequence loops from B to B. To render on screen our segmented approximation of the conic curve it is then sufficient to draw the segments linking each point of the sequence to its successor. We have to take care, however, of the segments which might cross the hyperplane at infinity H ∞ . We address this issue by computing the intersection of the conic C with the hyperplane H∞ . A point Q(λ) ∈ C is incident to H∞ if and only if Q(λ) ∨ H∞ = 0. This leads to a real quadratic equation in λ, that we solve in order to get the affine type of the conic, together with its asymptotic directions given as points at infinity. Since we know the parameters of these points at infinity, we know which segments cross H∞ , and we can choose not to draw these at most two segments, as shown on Figure 4. A conic C represented by ABS and P is degenerate if and only if P is incident to either line (AS ) or (BS ). In that case C consists in the lines (AS ) and (BS ). Degenerate conics are then easy to take in consideration.
4.
QUADRIC SURFACES
We now generalize the segmentation technique to three-dimensional quadrics. Our tessellation process follows the same steps as in the case of conics.
Automatic Tessellation of Quadric Surfaces using Grassmann-Cayley Algebra
4.1
679
Projective representation of quadrics
There are two widely used approaches for representing quadrics (Miller, 1988). In the algebraic approach, they are represented by an algebraic equation, while in the geometric approach they are described by their type, plus a type-dependant collection of geometrical data such as directions, center, or radius. Due to the euclidean nature of these data, we will call the latter approach euclidean. In this paper we use the approach introduced in Rotgé’s thesis (Rotgé, 1997), where a quadric is represented by a construction tetrahedron which is a generalization of section 3.1’s construction triangle ABS . By comparison to the euclidean approach, we’ll speak of the projective representation. Like the euclidean representation, the projective representation conveys some geometric intuition about the shape of the represented quadric. It can for this reason be used in an interactive modeling system (Rotgé, 1997), a feature that could motivate its use in preference to the algebraic approach. Although not as immediately intuitive as the euclidean representation, the type-independant projective approach suits better to our projective framework. In the projective representation, a quadric is determined by two non-coplanar construction triangles ABS 1 and ABS 2 , which together with their respective passing points P1 and P2 define two conics C1 and C2 . A result attributed to Monge (Rotgé, 1997) states that among the many quadrics incident to C 1 and C2 , there is a unique quadric Q such that S 1 and S 2 are conjugate with respect to Q. The points A, B, S 1 , and S 2 are the vertices of the construction tetrahedron of Q. We call C1 and C2 the base conics of Q (see Fig. 5).
4.2
Parameterization
The parameterization technique we use is again an idea of Rotgé (Rotgé, 1997). It is based on a ruler-only construction of Q whose summary follows. We start from the first base conic C1 , that we can construct from its base points A, B, S 1 , and P1 . Let’s note X2 = AB ∧ S 2 P2 . For each constructed point A of C1 , it is possible to construct the intersection point B of the line A X2 with the conic C1 . As S 2 is the pole of the plane containing C 1 , the conic C2 determined by the construction triangle A B S 2 and the passing point P2 is another plane section of Q (see Fig. 5). Constructing C 2 as in section 3, and repeating for every other point A of C1 , we construct every point of Q. The point B is constructed using as an intermediate point the harmonic conjugate of X2 with respect to A and B. Using Eq. (3) to parameterize C 1 and C2 , then combining these two equations, we end up with a polynomial parameterization of the quadric.
680 S2
B B C1
P2 S1
X2 P1 C2
A
A S 1
Figure 5. A new construction tetrahedron A B S 1 S 2 for Q has been constructed from the point A ∈ C. The conic C2 is shown in solid line, whereas the base conic C2 and the starting construction tetrahedron are shown in dotted lines.
4.3
Tessellation
We use the aforementioned parameterization together with our segmentation technique of section 3 in order to obtain a mesh of points on the quadric Q. Given a fixed integer n, we first generate a vertex sequence of 4n vertices on the base conic C1 . Then for each vertex A of this sequence, we segment the quadric C2 issued from A , again with 4n points. Whe then have a total of 16n2 vertices on Q, among which the point P 2 appears 4n times as the point of parameter 1 in every conic C2 . The same holds for the point P2 of parameter −1, so P2 and P2 will play in our tessellation similar roles as the north and south poles in the polygonization of a sphere by meridians and parallels. The vertices are joined with their neighbours along isoparameter lines. Thus our mesh has quadrangular cells except in the vicinity of one of P 1 or P2 . In order to get triangles, we separate the quadrangular cells along one of their diagonals. Examples of the resulting triangulation are shown in Figure 6. As in the conic case, we can consider different types of quadrics. the fact that S 1 is the pole of the plane (ABS 2 ) (and conversely), implies that the quadric Q is degenerate if and only if one of its base conics is degenerate. This allows us to handle degenerate quadrics in a convenient way. Likewise, the affine type of Q depends only of the type of the base conics C1 and C2 . A difficulty not mentioned by Rotgé (Rotgé, 1997) arises from the fact that the construction of section 4.2 fails to generate the entire quadric Q when its type is a one-sheeted hyperboloid. In that case, the hyperboloid is only partially covered with the points we can construct. To overcome this difficulty, we need another point on the line (AB) to play the role of X 2 in the construction. The relative position of this new point X 3 with respect to the
Automatic Tessellation of Quadric Surfaces using Grassmann-Cayley Algebra
681
points A and B should not be the same as X 2 ’s. Using X3 , we can design a new parameterization algorithm for one-sheeted hyperboloids, yet with the extra cost of computing a quadric-line intersection for each vertice A on C1 . The tesselletion algorithm for one-sheeted hyperboloids is then more complex and less efficient than the previous algorithm.
Figure 6. Examples of meshes generated with our technique: an ellipse, a cone, and a twosheeted hyperboloid.
5.
DISCUSSION
There are several issues concerning the regularity of the obtained triangles. First of all, our algorithm as exposed in this paper can be enhanced by some heuristics at the conic or quadric level to reduce mesh irregularity. We have chosen not to detail these heuristics here in order to focus on the main ideas of our techniques. Also we can see in Fig. 6 that meshes generated for two-sheeted hyperboloids are much more dense on the sheet containing the base points. This irregularity in our mesh doesn’t seem us really serious, because we expect users to be mostly interested in the sheet directly in contact with the construction tetrahedron. Another issue of the hyperboloid case commes from the fact that our vertices become very sparse when we go away from the summits of the sheets. Again we minimize the seriousness of this irregularity, as in a real application every quadric would be clipped, and the vertices around infinity discarded.
6.
SUMMARY AND RESULTS
We have presented an algorithmic technique for producing a triangular mesh from quadric surfaces represented by base points. The algorithm is parameterized by an integer quantity growing with the number of vertices generated. As a first step of our tessellation process we also gave a method for segmenting planar conics.
682 To experiment with our algorithm, we have developed an interactive test application. With it we experienced interactive frame rates when deforming quadric meshes with up to ten thousand vertices. Our algorithm would benefit from a little polishment, but after some experimentations with it, we have estimated that its meeting of our initial expectations was acceptable.
REFERENCES Abhyankar, S. S. and Bajaj, C. (1987). Automatic parameterization of rational curves and surfaces 1: conics and conicoids. Computer–Aided Design, 19(1):11–14. Audin, Michèle (2002). Geometry. Springer Verlag. Bajaj, C. L. and Royappa, A. (1994). Triangulation and display of rational parametric surfaces. In Proceedings of IEEE Visualization, pages 69–76. Berger, Marcel (1992). Géométrie. Nathan. Hestenes, D. and Ziegler, R. (1991). Projective geometry with clifford algebra. Acta Applicandae Mathematicae, 23:25–63. Kostrikin, Alexei I. and Manin, Yu. I. (1989). Linear Algebra and Geometry. Gordon and Breach Science Publishers. Li, Hongbo and Wu, Yihong (2003). Automated short proof generation for projective geometric theorems with cayley and bracket algebras: II. conic geometry. J. Symb. Comput., 36(5). Lodha, S. K. and Warren, J. (1990). Bézier representation for quadric surface patches. ComputerAided Design, 22(9):574–579. Macé, Pierre (1997). Tensorial calculus of line and plane in homogeneous coordinates. Computer Networks and ISDN Systems, 29. Miller, James R. (1988). Analysis of quadric-surface based solid models. IEEE Computer Graphics and Applications, 8(1):28–42. Rota, Gian-Carlo and Stein, Joel (1976). Applications of cayley algebras. In Colloquio Internazionale sulle Teorie Combinatorie, Roma. Rotgé, Jean-François (1997). L’Arithmétique des Formes : une introduction à la logique de l’espace. PhD thesis, Université de Montréal, Faculté de l’Aménagement. SGDL (1997). http://www.sgdl-sys.com/. Sosnov, A., Macé, P., and Hégron, G. (2002). Semi-metric formal 3D reconstruction from perspective sketches. Lecture Notes in Computer Science, 2330:285–?? Teller, S. J. and Séquin, C. H. (1991). Constructing easily invertible bézier surfaces that parameterize general quadrics. In Proc. 1991 ACM/SIGGRAPH Symposium on Solid Modeling and CAD/CAM Applications, pages 303–315. Wang, Wenping, Joe, Barry, and Goldman, Ronald (1997). Rational quadratic parameterizations of quadrics. International Journal of Computational Geometry and Applications (IJCGA), 7(6):599–??
VISUAL PERSON TRACKING IN SEQUENCES SHOT FROM CAMERA IN MOTION
Piotr Skulimowski and Pawel Strumillo Institute of Electronics, Technical University of Lodz, Wolczanska 223, 90-924 Lodz, POLAND, e-mail: [email protected], [email protected]
Abstract:
A person tracking algorithm in video scenes shot by camera in motion is proposed. It is shown experimentally that for such a recording scenario an uncomplicated algorithm incorporating background motion subtraction can successfully track moving human silhouettes.
Key words:
frame; motion; person tracking; detection; estimation; motion vectors; MPEG; camera; video sequence, background subtraction.
1.
INTRODUCTION
Detection and tracking of moving objects is one of the major research area in computer vision systems. This tendency has been reflected in recommendations of the second generation MPEG-4, MPEG-71 multimedia standards allowing for separate encoding and streaming of detected objects in video sequences. Applications vary from video surveillance, robotics, biometrics, and reconstruction of 3D scenes. Recent attempts in this field were communicated e.g., in a paper2 where moving objects are detected by dividing video sequence frames into layers implementing edge tracking methods or in another paper3, where active contours based techniques are used for tracking distributions like colour or texture of the objects. Within this research framework detecting, tracking and/or identifying people from a distance have recently attracted particularly intensive research interest. A large number of concepts were proposed for modelling human body using decomposable triangulated graph4, silhouette analysis, human identification
683 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 683–688. © 2006 Springer. Printed in the Netherlands.
684 by gait recognition5, and stereo vision6. Also, a real-time multi-person tracking system was proposed for detection and classification of moving objects into semantic categories7. It requires, however, multiple cameras deployed in the observed area. Recent research interest focuses also on object tracking tasks in the context of moving image background sub-regions. Analysis of such dynamic scenes has been recently approached by means of an on-line autoregressive modelling8 and by using optical flow concept coupled with kernel-based density estimation for background modelling and subtraction9. This work concentrates on the problem of visual person tracking in scenes shot by camera that is not stationary. Such object tracking scenarios can take place in mobile robot systems or portable devices aiding blind or visually impaired persons. It is shown that, as opposed to complex algorithms8,9, an MPEG coding inspired algorithm incorporating motion compensation for background subtraction and simple processing of interframe difference images is capable of segmenting human silhouette from dynamic scenes. The algorithm is intended to work robustly for camera in motion recordings of sequences and abrupt scene content variations.
2.
OBJECT TRACKING ALGORITHM
The developed algorithm concerns the far field application aimed at detecting presence and identifying moving silhouettes of single or many persons in video sequences. The task is to detect and track objects in dynamic scenes shot from camera in motion (planar motion types are currently considered). The procedure for compensation of the moving background is intentionally based on estimating motion vectors in 16x16 blocks, similarly as for the MPEG standards. Key steps of the algorithm are the following:
Figure 1. Overview of the proposed method for visual person tracking.
Visual Person Tracking in Sequences Shot from Camera in Motion
2.1
685
Background motion compensation and detection of objects in motion
The concept of motion estimation and compensation that is proposed in the MPEG standards is employed and modified for the task of background motion estimation. Motion estimation is carried out first for all image macroblocks of size 16x16. The Mean Absolute Distortion (MAD) criterion is used for matching of macroblocks by applying the simplest full-search method in a predefined vicinity1. A set of motion vectors 'x, 'y corresponding to each of the image macroblocks is obtained for two consecutive video frames F i, j; t 1 , F i, j; t , (with i, j being pixel coordinates in a frame): Z t 1, t
^'x0 , 'y0 , ! , 'xk , 'yk , ! , 'x N 1 , 'y N 1 `
(1)
where: N is the number of macroblocks of an image frame. Next the current set of motion vectors Z is searched (separately in x and y direction) for the most frequent motion coordinates 'xestim , 'yestim . Macroblocks from the current frame F i, j; t are subdivided into two groups by using a simple threshold criteria for each of the motion coordinates: 'xk 'xestim d th and 'y k 'yestim d th . If both motion coordinates of a macroblock meet this criterion the macroblock is classified as a background, otherwise the macroblock is identified as an object in motion. Further, “pixels in motion” are merged to regions, by testing the 4-connectivity neighbourhood condition.
2.2
Calculation of interframe differences
Consecutive interframe difference images FDi, j; t are calculated according to the formula: FDi, j; t
F i, j; t F i 'xestim , j 'yestim ; t 1
(2)
and using an appropriate condition that takes care of boundary effects, i.e., pixels of the difference image are set to zero if their new, compensated coordinate falls outside the image frame. Interframe difference images with background motion compensation switched off and on are compared in Fig. 2.
686
Figure 2. Interframe difference image (a), difference of the frames with background compensation switched on (b); note that grey levels of the difference images are shifted by half a greyscale range for better visualisation.
2.3
Calculation of difference maps
Interframe difference maps are obtained by binarizing the difference images. A cumulative grey-scale histogram H(g), g=0, 1, …, L-1 is computed for difference image pixels belonging to background labelled macroblocks. A threshold level is selected for the lowest greyscale value gth for which the cumulative histogram satisfies the condition H g th d A , with constant A set experimentally to A=0.989. Difference frame image is converted to a difference binary map FDB i, j; t using gth as a threshold level and setting FDB i, j; t 0 for pixels in motion and FDB i, j; t 1 , otherwise. Pixels in motion are merged to regions, by testing the 4-connectivity neighbourhood condition. Finally, each frame pixel is labelled as either belonging to the background or an object in motion (e.g., more than one objects can be detected in case of multi-person tracking system).
2.4
Silhouette detection
This part of the algorithm focuses on applying a number of simple logical conditions for the tracked regions of human silhouette, e.g. tracked persons can stop, change their direction of movement or can became invisible due to masking by other stationary or tracked objects. Such rules allow for maintaining tracking for situations when the tracked object becomes stationary or whose velocities are too small to detect it as a moving object. The tracked person silhouette is indicated in frame sequences by positioning a rectangle encompassing the body parts in motion.
Visual Person Tracking in Sequences Shot from Camera in Motion
3.
687
EXPERIMENTAL RESULTS
Tests were performed on video sequences of 320x240x24bits resolution recorded at 19,6 frames per second. Images were recorded by using a digital camera with video capture capability. The sequences were shot from “a free hand” with intentional motion of the camera during exposure. The developed person tracking algorithm was implemented in C++ compiler on a PC, powered by 2,66 GHz Pentium IV processor. The achieved frame rate, with the tracking algorithm running was 5 frames per second. This result can be still improved after careful code optimisation and implementation of fast motion estimation procedures (e.g. the three/four or diamond step search methods). Illustration of the person tracking algorithm performance is given in Fig. 3 (see motion of the background in consecutive frames). It is worth noting that tracking of both persons is maintained after they cross each other. Images of the detected silhouettes of the strolling persons are shown in Fig. 4. These binary maps can be used for further processing and analysis.
Figure 3. Demonstration of the developed person tracking algorithm (frames no. 3, 8, and 19 are shown only).
Figure 4. Difference maps obtained for the sequence shown in Fig. 3 (computed for frame pairs: 2-3, 7-8, and 18-19 correspondingly).
688
4.
CONCLUSIONS
The proposed person tracking algorithm is based on motion estimation techniques. It requires motion vectors that can be extracted from video sequences that were compressed by means of MPEG coding standards or can be computed using motion estimation procedures. Algorithm speed up, from the current rate of 5 frames per second, can be achieved by further C++ code optimisation and implementation of fast motion estimation procedures. An open problem, yet to be studied, is determination of camera motion class of trajectories for which the algorithm will maintain its tracking capability. This is a key problem in the envisaged application of the developed algorithm in a portable system for aiding blind and visually impaired persons.
REFERENCES 1. Skarbek W., 1998, Multimedia: algorithms and coding standards (in Polish), Akademicka Oficyna Wydawnicza, Warszawa. 2. Smith, P. Drummond, T. Cipolla, R, 2004, Layered motion segmentation and depth ordering by tracking edges, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(4), pp. 479-494. 3. Freedman, D. Tao Z., 2004, Active contours for tracking distributions, IEEE Transactions on Image Processing, 13(4), pp. 518- 526. 4. Song Y., Goncalves L., Perona P., 2003, Unsupervised learning of human motion, IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(7), pp. 814-827. 5. Wang L., Tan T., Ning H., Hu W., 2003, Silhouette analysis-based gait recognition for human identification, IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(12), pp. 105-1518. 6. Harville M., 2002, Stereo person tracking with adaptive plan-view statistical templates, available at: www.hpl.hp.com/techreports/2002/HPL-2002-122.pdf. 7. Wei N., Long J., Han D. Yuan-Fang W., 2003, Real-time multi-person tracking in video surveillance, Proc. of the 2003 Joint Conference of the Fourth Pacific Rim Conference on Multimedia, Information, Communications and Signal Processing, 2, pp. 1144-1148. 8. Monnet A., Mittal A., Paragios N., Ramesh V., 2003, Background modeling and subtraction of dynamic scenes, The 9th IEEE International Conference on Computer Vision, pp. 1305-1312. 9. Mittal A. Paragios N., 2004, Motion-based background subtraction using adaptive kernel density estimation, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 27 June- 2 July, Washington DC, pp. 302-309.
A NOVEL OBJECT DETECTION TECHNIQUE IN COMPRESSED DOMAIN
Ashraf M.A. Ahmad Department of Computer Science and Information Engineering,National Chiao-Tung University, 1001 Ta-Hsueh Rd, Hsinchu, Taiwan,[email protected]
Abstract:
In this paper we propose a novel approach for robust motion vector based object detection in MPEG-1 video streams. By processing the extracted motion vector fields that are directly extracted from MPEG-1 video streams in the compressed domain, through our proposed system, in order to reduce the noise within the motion vector content, obtain more robust object information, and refine this information. As a result, the object detection algorithm is more capable of accurately detecting objects with more efficient performance.
Key words:
MPEG-1, Filter, Object Detection, Motion Vector, Gaussian, Median, Texture Filter.
1.
INTRODUCTION AND RELATED WORK
As the proliferation of compressed video sequences in MPEG formats continues, the ability to perform video analysis directly in the compressed domain becomes increasingly attractive. To achieve the objective of identifying and selecting desired information, a reliable object detection mechanism is needed as a primary step. A lot of work is being done in the area of motion based video object segmentation in the pixel domain1,2, which exploits the visual attributes and motion information. However, very little work has been carried out in the area of compressed domain video object extraction. Pixel domain motion detection is performed based on the motion information at each pixel location such as optical flow estimation3, which is very computationally demanding In many cases, especially in the case of well-textured objects, the motion vector values reflect the movement of
689 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 689–694. © 2006 Springer. Printed in the Netherlands.
690 objects in the stream very well. Some approaches4,5,15 utilize these motion vector values directly. Processing digital video directly in the compressed domain has reduced the processing time, enhanced storage efficiency, speed, and video quality. Object detection directly in compressed video without full-frame decompression is clearly advantageous, since it is efficient and can more easily reach real-time processing speeds. Motion vector information is an important cue for humans to be able to perceive video content. Thus, the need for reliable and accurate motion vector information becomes clear for those approaches that are employing the motion information2,3,4,5,15 as well as to get highly efficient detection algorithms at the macroblock level. Motion vector information is sometimes difficult to use due to the lack of effective representation. Therefore, in this paper, we introduce a technique which can overcome those defects and produce more reliable motion vector information and smoothed object boundaries for the object detection technique. This technique will be described in detail. One work2,6 applies a median filter for the magnitude only. While we apply our scheme for both magnitude and direction which results in a more accurate and reliable outcome in terms of object detection6 used spatial filter which is Mean Filter, where some authors7 proved this insufficient and unrealistic of terms of real-time application.
2.
OVERVIEW OF THE PROPOSED SYSTEM
First of all, we will present the following diagram which states an abstract overview of our proposed system, and then we will describe its components in details.
Figure 1. The System Overview.
A Novel Object Detection Technique in Compressed Domain
691
In our proposed approach we first take an MPEG-1 video stream with the [IBBPBBPBBPBBPBB] structure. Fig.1 shows the proposed system architecture. We extract the motion vectors from P-frames only in order to reduce the computational complexity. In general, a video with 30 fps, consecutive P-frames separated by two or three B-frames are still similar and do not vary too much. As such, it is sufficient to use the motion information of P-frames only to detect the objects. Meanwhile, we will extract the DCT coefficients from I frames, these coefficient include the DC coefficient, and the AC components as well. Knowing that, these coefficients are readily available in MPEG-1 stream. Then, we will pass the DCT coefficients into a module to calculate the texture of each frame. Later, we will propagate this texture information into P frame using Inverse Motion Compensation Technique. After that, we will filter each motion vector based on its texture value as what will be described in later chapters. Furthermore, we use the median filter as it does not alter motion vector values. Rather, it simply rearranges motion vectors, not altering the values contained within any motion vector. Hence, the median filter is used to repair potential irregularities introduced by the previous filter process in order to straighten up any single motion vector which have been influenced. After obtaining the motion vector field’s magnitude and direction values, we pass these values through the Gaussian filter.. Steps will be described in detail in the following sections.
3.
TEXTURE FILTER
In fact, in most cases motion vectors are not only inaccurate, but in some cases they are completely meaningless. This would not allow even a sturdy fitting stage to operate reliably. So, it is more robust if these low-textured macroblocks are not included in the fitting. For doing so we analyze the AC components of the DCT coefficients, thus staying in the compressed domain.
3.1
Features Extraction from MPEG-1
The MPEG-1 compressed video provides one motion vector for each macroblock of 16x16 pixels size. This means that the motion vectors are quantized to 1 vector per 16x16 block. The motion vectors are not the true motion vectors of a particular pixel in the frame. Our object detection algorithm requires motion vectors of each P-frame from the video streams. For the computational efficiency, only the motion vectors of P-frames are used for object detection algorithm. Besides, we need to extract the DCT information from I-frames, this information is readily available in MPEG-1
692 stream, thus we are not demanded to spend too much time in decoding the MEPG stream. Hence our approach fits for the real time application environment.
3.2
Texture Energy Computation
Object regions are distinguished from the background using their distinguishing texture characteristics. Some previously published methods fully decompress the video sequence before extracting the desired object regions. Our method helps in locating the candidate object regions directly in the DCT compressed domain using the intensity variation information encoded in the DCT domain. We can design Directional Texture Energy Map (DTEM) in DCT domain Fig.2, by assigning a directional intensity variation indicator for each coefficient in DCT domain either DC or AC components as the following: x H: Horizontal intensity variation. x V: Vertical intensity variation. x D: Diagonal intensity variation.
Figure 2. Directional Texture Energy Map in DCT.
We process in the DCT domain to obtain the directional intensity variation or the so called directional texture energy using only the information in the compressed domain. We compute the horizontal energy Eh by summing up the absolute amplitudes of the horizontal harmonics of the block: which has been marked as H in the DTEM. We compute the vertical energy Ev by summing up the absolute amplitudes of the Vertical harmonics of the block: which has been marked as V in the DTEM. We compute the diagonal energy Ed by summing up the absolute amplitudes of the Diagonal harmonics of the block: which has been marked as D in the DTEM. Finally, we will calculate the average energy Ea. for each macroblock. This is the average value of the Vertical energy, Diagonal energy and Horizontal energy. As in the equation 1. E a
(
3* E h 5 E d 3 E v ) 11
(1)
A Novel Object Detection Technique in Compressed Domain
693
After we learn the average energy, the average energy values are then thresholded to obtain blocks of large intensity variations. We will update Motion vector values based on the Ea as described in the following procedure. For
every
MV new
macrblock:
° MVr old ® °¯ MV old
100*E Et
%
,E a E t
(2)
,E a t E t
We have used an adaptive threshold value which is 1:45 times the average texture energy of the corresponding DCT channel for all the blocks in the Frame.
4.
SPATIAL FILTER
Up to this stage, we have had the output motion vector from the previous component. We will pass the motion vector magnitude and direction values to the Gaussian filter where using two values instead of only one of them makes our object detection more robust and meaningful.
5.
OBJECT DETECTION
An object detection algorithm is used to detect potential objects in video shots. Initially, undesired motion vectors are eliminated. Subsequently, motion vectors that have similar magnitude and direction are clustered together and this group of associated macroblocks of similar motion vectors is regarded as a potential object. Details are presented in the object detection algorithm procedure in previous published papers.
6.
EXPERIMENTAL RESULTS AND DISCUSSION
We have designed an experiment in order to verify optimal performance. The experiment has been designed to test the proposed scheme on three video clips. These video clips are in MPEG format and are part of the MPEG testing dataset. Testing is performed using four types of other’s related work which are, Group A using texture filter only8, group B using spatial filter only mainly Gaussian filter7, Group C using texture and spatial filer as equally important6, and group D our system, finally without any kind
694 of post processing. We note that the performance of our system is consistently superior to performance using others schemes.
7.
CONCLUSIONS
Embedding our system with a specific configuration as a primary step before starting our object detection algorithm, makes the performance much better and reduces the computation time for the object detection system as a whole. This has been verified by examining the results of our experiments.
REFERENCES: 1. N. Brady and N. O’Connor, “Object detection and tracking using an EM-based motion estimation and segmentation framework,” Proc. IEEE ICIP, 925–928 (1996). 2. David P. Elias, The motion Based Segmentation of Image Sequences: Ph.D. thesis (Trinity College, Department of Engineering, University of Cambridge, Aug. 1998). 3. R. Wang, and T. Huang, “Fast Camera Motion Analysis in MPEG domain,” Proc. ICIP, 691-694 (1999). 4. R. C. Jones, D. DeMenthon and D. S. Doermann, “Building mosaics from video using MPEG Motion Vectors,” Proc. ACM Multimedia Conference, 29-32(1999). 5. J. I. Khan, Z. Guo and W. Oh, “Motion based object tracking in MPEG-2 stream for perceptual region discriminating rate transcoding,” Proc. ACM Multimedia Conference, 572-576 (2001). 6. R. Wang, H.-J. Zhang and Y.-Q. Zhang, “A Confidence Measure Based Moving Object Extraction System Built for Compressed Domain,” Proc. ISCAS, 21-24 (2000). 7. Ashraf M.A. Ahmad; Duan-Yu Chen and Suh-Yin Lee “ROBUST COMPRESSED DOMAIN OBJECT DETECTION IN MPEG VIDEOS” Proc. 0f the 7th IASTED International Conference Internet and Multimedia System Applications, 706-712.(2003) 8. Yu Zhong, Hongjiang, and Anil K. Jain “Automatic Caption Localization in Compressed Video”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(4): 385-392 (2000). 9. Jianhao Meng, Yujen Juan, Shih-Fu Chang “Scene Change Detection in a MPEG Compressed Video Sequence”, in IS&T SPIE Proceedings:Digital Video Compression Algorithm and Technology, vol. 2419, San Jose (1995).
A METHOD FOR ESTIMATING DANCE ACTION BASED ON MOTION ANALYSIS
Masahide NAEMURA and Masami SUZUKI Advanced Telecommunications Reserach Institute, 2-2-2 Hikaridai, Keihanna Science City, Kyoto, Japan, 619-0288
Abstract:
We developed a method for estimating a dancer's performance level through motion analysis of video-captured dance action. The estimation is carried out by relating subjective evaluation results of dancers’ performance with motion features automatically extracted through off-the-shelf video processing. The motion features are rhythm elements of dance action. As a result, these rhythm elements are found to yield a strong correlation with the subjective evaluation of performance levels.
Key words:
dance, motion analysis, Hough transform, Kalman filter, rhythm detection
1.
INTRODUCTION
We are conducting research on computer-aided edutainment with a view toward creating learning environments where anybody can acquire advanced skills1. In order to make good use of computer technology in edutainment, it is important to identify the basic elements that characterize the performance difference between an amateur and an expert and to computationally analyze this difference. In this paper, we focus on dance actions as a sub-field of edutainment research and introduce an estimation method for integrating subjective evaluation of dance performance with the results of motion analysis using computer vision technology. There has been much research conducted on traditional dance forms like ballet, kabuki and Japanese traditional dance2,3. They have mainly focused on archiving as artistic expressions the detailed dance patterns performed by professionals. As for motion analysis of dance actions, previous research has 695 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 695–702. © 2006 Springer. Printed in the Netherlands.
696 mainly been carried out to enhance the skills of professionals by using largescale motion capture systems4,5. In contrast, our research seeks to establish a method for transferring professional skills to amateurs. The process of capturing motion features has to use an off-the-shelf method so that ordinary people can access and enjoy it. In order to reach this goal, we adopt a motion estimation method that consists of two steps. The first step is to analyze the difference in performance between professionals and amateurs and to identify the factors that cause this difference. We assume rhythm as the main candidate factor for dance. After identifying this factor, the next step is to extract, by means of video processing, the motion features that represent rhythm and to establish a method for estimating the dance performance by using motion features. Since rhythm is defined as periodic motion, we tried to extract motion features that periodically change their characteristics. By comparing analysis results of the obtained motion features with the results of subjective evaluation of dance performance, we demonstrate that motion features can be a key tool in estimating dance performance.
2.
SUBJECTIVE EVALUATION OF DANCE
Figure 1. Classification by results of subjective evaluation.
In order to clarify the difference in performance between experts and amateurs, we carried out subjective evaluations of dance. We chose street dance as a target dance action because it is mainly composed of rhythmical movements. We had evaluators score seven kinds of dance scenes on evaluation forms that graded dance on five levels, ranging from “poor” to “excellent.” Dances were performed by seven dancers consisting of three professionals and four amateurs. We adopted Scheffe’s paired comparison as a method for the subjective evaluation6. In Scheffe’s paired comparison,
A Method for Estimating Dance Action based on Motion Analysis
697
paired dance scenes were shown and their differences were judged by the evaluators. The 1% confidence intervals of all 21 (7C2) pairs were calculated on the basis of variance analysis. When this confidence interval remains at either a positive or a negative value, it can be said that there is a significant difference in performance levels between the dancers in the pair. The dance scenes were classified according to evaluation scores based on these significant differences. The classification results in Fig. 1 show that professional dancers can be clearly distinguished from amateur dancers. Next, we developed a motion analysis method that can extract motion features reflecting the results of the subjective evaluations.
3.
MOTION ANALYSIS
In order to obtain motion features that affect dance estimation, we focused on the body movements of dancers. The following procedure was used to extract these movements from dance scenes.
image sequences The output of this procedure is a temporally sequenced set of Hough parameters that represent the dancer’s movements. The parameters obtained in this way are used for evaluating the dance performance. Before applying the procedure to image sequences, it is necessary to manually identify the initial main body axes on the first frame. After this initialization, automatic tracking of the main body axes can be carried out.
3.1
Extraction of Silhouette images
Silhouette images of dancers are created by chroma-key techniques. We used a technique introduced in7 to achieve precise silhouettes. Since the method in7 is based on morphological segmentation using background color information that includes shadows and highlights, robust extraction of a dancer's silhouette can be achieved.
3.2
Skeletonization of the Silhouette images
In order to simplify the images and extract meaningful features, skeletonization is performed on the silhouette images. Skeletonization is
698 based on morphological skeleton processing8. This skeletonization can be expressed in the following equation. SK ( X )
N * Sn X n 0
N * X4nB S X4nB S B n 0
>
@
(1)
where nB represents the structuring elements, which can be expressed as nB
3.3
B B ! B n times
(2)
Hough transform of skeleton images
Hough transform of the skeleton images generates parameter sets that represent motion features in a simple form. Hough transform is widely used for extracting line segments from images and can be expressed as U
x cosT y sinT
(3)
where ș is the angle of the extracted line and ȡ is the distance from the original coordinate. The combination of (ȡ,ș) is a Hough parameter of the corresponding line. Hough parameters of line segments are obtained by thresholding frequencies for Hough space. In addition to (ȡ,ș) parameters, coordinate information of end-points of the extracted line segments, xb=(xb,yb) and xe=(xe,ye), are computed. This process is carried out by matching the skeleton images with artificial lines drawn on the basis of the extracted Hough parameters. These end-points are attached to the obtained Hough parameters and processed in tracking the Hough parameters. In performing the Hough transform, some tricks were devised in order to make the process more efficient. Assuming that parameterization of the skeleton image on the previous frame was successfully carried out, Hough transform on the current frame is performed in the confined area specified by the previously obtained parameters. Furthermore, the threshold value for Hough space was adaptively set according to the length of the line segment detected in the previous frame. Specifically, the confined area is the quadrilateral area whose vertexes are (x b 32, y b 32) , ( x b 32, y b 32) , ( x e 32, y e 32) and (x e 32, y e 32) if y b y e or, otherwise, (x b 32, y b 32) , ( x b 32, y b 32) , ( x e 32, y e 32) and (x e 32, y e 32) . x b , x e , y b and y e are coordinate values of end points of the line segments detected in the previous frame. The threshold value is selected as 50% of the length of the line segment detected in the previous frame.
A Method for Estimating Dance Action based on Motion Analysis
699
It is more efficient to deal with these parameter sets instead of raw images because redundant information can be removed. After applying the Hough transform, a Hough parameter of a main body axis is chosen as a rhythmical factor. The obtained Hough parameters must then be tracked through temporal sequences.
3.4
Parameterization of main body axes and tracking parameters
It is necessary to track the Hough parameter of the main body axes during dance sequences in order to analyze the correlation between the parameter set and the dance performance. Tracking is done by collecting the candidate parameters in the neighborhood of the Hough parameter selected in the previous Hough space and then selecting the most likely one among them, assuming temporal continuity of Hough parameters. The extracted Hough parameters at time t+1 are represented in the following procedures, which denote Hough parameters at time t as (ȡt,șt). If there are any Hough parameters at time t+1 in the neighborhood of (ȡt,șt) in the Hough space, (ȡt+1,șt+1) = M(Ng(ȡt,șt))
(4)
In Eq. (4), Ng( ) is an operator that outputs the Hough parameters existing in the neighborhood of (ȡt,șt). For example, if the detected Hough parameter is supposed to be (ȡt,șt), Ng(ȡt,șt) is the set of Hough parameters on the (t+1) frame that exist in the areas whose size is predefined with (ȡt,șt) centered. M( ) is an operator that selects the most appropriate Hough parameters among candidate parameters collected by the Ng( ) operator. The M( ) operator is carried out by outputting Hough parameters whose information of end-points is spatially the closest to the end-points of the previous Hough parameters. Assuming that end points detected on the t frame are x tb ( x bt , y bt ) and x te ( x et , y et ) , and that the i-th end points among the output parameters of the Ng() operator on the (t+1) frame are x tbi1 ( x bit 1 , y bit 1 ) and x tei1 ( x eit 1 , y eit 1 ) , the M() operator is defined in the following equation.
( U t 1 , T t 1 )
arg min ( x tb x tbi1 x te x tei1 )
(5)
( U i ,T i )H i
where Hi is a set of parameters that includes Hough parameters and the information of end points.
700 If there are no Hough parameters close to the previous ones, the Hough parameters at time t+1 are predicted by applying a Kalman filter9. The Kalman filter is expressed by the combination of the following equations using the state vector Xk and the observation vector Yk. Prediction: X k1
(6)
Ak X k
k 1
Ak Pk AkT Qk
P
Update: Kk
Pk H kT (H k Pk H kT Rk )1
Xk
X k Kk (Yk H k X k )
Pk
(I
(7)
Kk H k )Pk
where Qk , Rk , Pk , and K k are the process covariance, the observation covariance matrix, the state covariance, and Kalman gain, respectively. Pk and K k are sequentially computed in the combination of Eqs. (6) and (8). In our case, the state vector consists of end-points of line segments and their velocities, Xk=(xb(t),yb(t),xe(t),ye(t),x'b(t),y'b(t),x'e(t),y'e(t))T. The observation vector is the coordinate information of the end-points of the line segment corresponding to the Hough parameters obtained from equation (4). Since we adopted a velocity model as the dynamics of the Kalman filter, Ak and Hk were fixed in the following matrix. Ak
§ I 4x4 ¨¨ © 0 4x4
I 4 x4 · ¸, Hk I 4 x 4 ¸¹
( I 4 x 4 ,0 4 x 4 )
(8)
where I4x4 is a 4 x 4 identity matrix and 04x4 is a 4 x 4 zero matrix. If there is no proper Hough parameter as a result of estimating Eq. (4), the observation covariance matrix is increased in order to ignore the obtained observation vector. Otherwise, the observation covariance matrix is set to 0, which means the state vector is derived only from the observation vector. Consequently, we have achieved robust tracking of Hough parameters by adaptively changing the observation covariance matrix according to the confidence of Eq. (4) when applying the Kalman filter.
A Method for Estimating Dance Action based on Motion Analysis
4.
701
EXPERIMENTAL RESULTS
Figure 2 shows the image obtained when the main body axis is properly extracted by the proposed method. Over a period of 10 seconds, the tracking of Hough parameters was successful in most of the dance scenes. To identify rhythmical factors from these motion features, we applied a Fourier transform to the temporal sequence of angle values of Hough parameters. Figure 3 shows one of the spectral distributions obtained by Fourier transform. Each spectrum line corresponds to each dancer in this figure. Through these spectra, we found that professional dancers have a stronger peak than amateur dancers at a specific frequency. Furthermore, this frequency almost completely complies with the tempo of the music to which the dancers dance. From this it can be concluded that the sequence of Hough parameters represents rhythm in dance performance. Moreover, since the difference in the peak level of the spectrum corresponds to the results of subjective evaluation, automatic estimation of dance performance becomes possible by using this measure.
Figure 2. Result of motion analysis.
Figure 3. Spectra of Hough parameter sets.
702
5.
CONCLUSION
We developed a method for estimating dance performance by using motion analysis techniques. By representing motion features as Hough parameter sets, we show that the parameter sets have a correlation to the rhythm factor in dance. By analyzing the parameter sets with a Fourier transform, they become an effective tool for estimating dance performance. The next steps are to search for a way to analyze more detailed movements, such as those of the hands, legs and head, and to achieve real-time processing so that dancers can be given useful feedback on their movements during an actual dance.
ACKNOWLEDGMENTS This research was conducted though a grant for “Research on Interaction Media for High-Speed and Intelligent Networking” from the National Institute of Information and Communications Technology, Japan.
REFERENCES 1. R. A. Berry, M. Suzuki, N. Hikawa, M. Makino, “2003 The Augmented Composer Project: The Music Table”, ACM SIGGRAPH Emerging Technologies, Abstracts and Applications (2003). 2. A. Soga et al., “Motion Description and Composing System for Classic Ballet Animation on the Web,” Proc. 10th IEEE Roman, pp. 134-139 (2001). 3. L. M. Naugle, “Motion capture: Re-collecting the dance,” Proc. ICKL99, pp. 208-213 (1999). 4. A. Nakazawa et al., “Imitating human dance motion through motion structure analysis”, Proc. of IROS2002, pp. 2539-2544. 5. K. Hachimura and M. Nakamura, “Method of generating coded description of human body motion from motion-captured data,” IEEE ROMAN 2001, pp. 122-127 (2001). 6. H. Scheffe, “An analysis of variance for paired comparisons,” Journal American Statistical Association, vol. 47, pp. 381-400 (1952). 7. M. Naemura et al., “Morphological Segmentation of Sports Scenes using Color Information,” IEEE-T Broadcasting, vol. 46, No. 3, pp. 181-188 (2000). 8. I. Pitas, Digital Image Processing Algorithms and Applications, John Wiley & Sons, Inc. (2000). 9. A. Gelb ed., “Applied Optimal Estimation,” MIT Press, Cambridge, MA (1974).
A FEATURE BASED MOTION ESTIMATION FOR VEHICLE GUIDANCE Onkar Ambekar1 , Ernani Fernandes2 and Prof. Dr. Dieter Hoepfel1 1 University of Applied Sciences-Karlsruhe, Germany, 2 Valeo Switches and Detection System
GmbH, Germany
Abstract
Motion estimation is a key issue in the research for intelligent vehicles. In this paper, we implement a real-time motion estimation technique based on KanadeLucas-Tomasi technique. This algorithm is composed of two parts: feature (corners) selection, and tracking within consecutive frames. Experimental results on image sequence of a outdoor scene prove the high accuracy and robustness of the algorithm.
Keywords:
Machine vision; real time processing; motion estimation; feature tracking.
1.
INTRODUCTION
The future of the application of novel electronic systems in the automotive sector is extremely promising. One of the most important aspects in the development of these technologies is driver comfort which is based on the key technology: image processing. In this sense all issues concerning driving environment detection, and recognizing targets in real-time play an important role in the automotive industry. Few applications of image processing in vehicle guidance systems are: Rear view system: a camera mounted in the rear of the car is used to display the scene to the driver. The system is used to guide the driver in parking and manoeuvring. Lane departure warning system: front-mounted camera is used to detect the road markings. It warns the driver if a deviation from a safe course is detected. Night vision system: an active systems solution comprising infrared light source, camera and a head-up display is used to allow the driver a clear view of an insufficiently illuminated road. 703 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 703–708. © 2006 Springer. Printed in the Netherlands.
704 The majority of work in motion estimation can be categorized in two streams: optical flow, and model based techniques. Although, techniques based on optical flow1,3,4 can be implemented in real time, they inherently suffer from the aperture problem discussed below. Model based techniques2 are computationally demanding and their implementation in real time requires special hardware; a costlier solution for a highly competitive automotive industry. Keeping in mind the limitations of the above methods, a feature (corner) tracking method based on the Kanade-Lucas-Tomasi technique has been implemented in this article. This method overcomes the aperture problem by extracting features such as corners. Though computationally demanding as compared to optical flow based techniques, this method can be implemented in real time.
2.
APERTURE PROBLEM
A consequence of the non-equivalence of the projected motion and optical flow is due to motion information that is carried in the image structure, i.e. in gray level variation. The aperture problem occurs when there is insufficient gray level variation in the considered image region to uniquely constraint the problem, i.e. more than one candidate motion fits the observed image data equally well. This implies only that motion component which is in a direction perpendicular to a significant image gradient can be estimated with any degree of certainty. For example, consider estimating the motion of a section of an object boundary, such as one shown in Figure 1A. Here, a single spatial image gradient lies within the small aperture, and consequently, it is not possible to judge the correct motion. On the other hand, in case of corners, two spatial image gradients are present within the aperture, and hence, motion can be uniquely determined (as shown in Figure 1B).
Figure 1.
Aperture problem.
705
Motion Estimation
3.
FEATURE EXTRACTION METHOD
By definition, features are locations in the image that are perceptually interesting. A point feature, such as a corner, is defined as a point that can be easily recognized from one frame to the next (the so called correspondence problem). The basic constraint used to solve the correspondence problem (locally in space and time) is to use the brightness constancy of the image region that surrounds a feature point. A well-known algorithm presented by Lucas and Kanade and refined by Tomasi and Kanade5 has been used here for corner detection. The Kanade-Lucas-Tomasi (KLT) corner detection operator is based on the local structure matrix definition which can be written as: 3
Cstr = G(r; σ) ∗
fx2 fx fy fx fy fy2
;
4
=
f-x2 f< x fy < fx fy f-y2
=
(1)
The derivatives of the intensity function f(x,y) are calculated at each point to obtain the values of the matrix (i.e., fx2 and so on). The local structure matrix is calculated by smoothing (integrating) the elements of the matrix by the Gaussian filter G(r; σ). The symmetric 2 x 2 matrix Cstr of the system must be well conditioned and above the image noise level. The noise requirement implies that both eigenvalues of Cstr must be large, while the conditioning requirement needs them not to differ by several orders of magnitude. Two small eigenvalues indicate a roughly constant intensity profile within a window. A large and a small eigenvalue correspond to a unidirectional texture pattern. Two large eigenvalues can represent corners and salt-and-pepper textures. In conclusion, if the two eigenvalues of Cstr are λ1 and λ2 , we accept a window as a corner. min(λ1 , λ2 ) > λthr
(2)
There are some practical issues that must be addressed in constructing an effective feature selector. For instance, we must avoid selecting different featurepoints within the same window, since this could cause feature mismatch during tracking. To this end, one can sort the pixels in an image based on the smallest singular value of Cstr (which can be taken as a cornerness). Then within a given area one may select only the best candidate as a feature point. Figure 2 shows detected corners on a outdoor image.
4.
KANADE-LUCAS-TOMASI
Tracking the corners from the first to the current frame may be identified as a correspondence problem. Correspondence between these corners can be calculated by quantifying the corner dissimilarities of a corner between the
706
Figure 2.
Detected corners are presented with green color box of size 3x3.
first and the current frame. Dissimilarity is the corners’ rms residue between the two frames, and when it grows too large the corner should be abandoned. Image motion can occur due to: ego-motion of the camera, movement of the object when the camera is stationary, and when both the camera and the object move with respect to each other. In any case, the changes can be described as: I(x, y, t + τ ) = I(x − ξ(x, y, t, τ ), y − η(x, y, t, τ ))
(3)
Thus, the image I(x, y, t + τ ), taken at t + τ can be obtained from the preceding image I(x, y, t), taken at time t, by moving every point in the I(x, y, t) image. Thus, a point in x in the image I moves to point x + d in the second image J, and therefore: J(x + d) = I(x)
(4)
Now, due to lack of brightness constancy and image noise equation (4) is not satisfied. The problem of determining the motion parameters is to find the displacement d that minimize the dissimilarity, which is given by:
=
w
[J(x + d) − I(x)]2 w(x)dx
(5)
where w is the given feature window and w(x) is a weighing function. In the simplest case, w(x) =1 or w could also be a Gaussian-like function to emphasize the central area of the window. To minimize , we differentiate with respect to d and set the derivative to zero: ∂ = ∂d
w
[J(x) − I(x) + gT d]g(x)w(x)dx = 0
(6)
707
Motion Estimation ;
where,
g=
∂ ∂x (I ∂ ∂y (I
+ J) + J)
=
.
Rearranging terms, we get w
[J(x) − I(x)]g(x)w(x)dx = −
3
4 T
w
g (x)g(x)w(x)dx d
(7)
In other words, we must solve the equation: Gd = e
(8)
where, G is a 2x2 matrix: G=−
w
gT (x)g(x)w(x)dx
(9)
and e is the following 2x1 vector:
e=
w
[I(x) − J(x)]g(x)w(x)dx
(10)
Hence, 3
d=
5.
dx dy
4
= G−1 e
(11)
RESULTS AND CONCLUSION
Features such as corners and vertices are projection of 3-D features on 2-D image making them reliable for the purpose of motion estimation. Corners have at least two gradient directions, which avoids the aperture problem. In Figure 3, from top left to bottom right: the detected corners are shown in the first frame (in green box of size 3x3 pixels) and their tracked positions in the next frames are shown in red boxes of size 3x3 pixels. In the first frame 150 corners, as seen in the top left image of Figure 3, are detected based on the their cornerness value. The position of the corner and gray-value of surrounding pixels are used to track them in subsequent frames. At the frame rate of 25fps and small vehicle speed the frame difference between consecutive images is small and so is the motion. Here, simple translational model is sufficient for tracking the corners instead of a more complex affine motion model. With the help of the translational model, the KLT technique tracks more than 90 percent of the detected corners. The current implementation works with the speed of 3fps on images having 640x480 resolution. Results clearly show that the Kanade-Lucas-Tomasi corner tracking technique is most suitable for motion estimation for low speed vehicle movement and is robust when applied on images taken taken in outdoor.
708
Figure 3. From left to right: Corners detected in frame 1 are marked in green and are tracked in next frame which are marked in red.
REFERENCES 1. J. L. Barron and D. J. Fleet and S. S. Beauchemin, Performance of optical flow techniques. International Journal of Computer vision, 12(1):43-77, 1994. 2. D. J. Heeger, Model for the extraction of image flow. Journal of the Optical Society of America (A), 4(8):1455-1471, 1987. 3. J. R. Jain and A. K. Jain, Displacement measurement and its application in interframe image coding. IEEE Transaction on Communications, 29(12):1799-1808, 1981. 4. V. Seferidis and M. Ghanbari, General approach to block-matching motion estimation. IEEE Transaction on Communications,Optical Engineering 32(7):1464-1474, 1993. 5. J. Shi and C. Tomasi, Good Features to track. IEEE Conference on Computer Vision and Pattern Recognition, 593-600, 1994.
FAST UNIFORM DISTRIBUTION OF SEQUENCES FOR FRACTAL SETS VLADIMIR M. CHERNOV Image Processing Systems Institute of RAS, Samara, Russia
Abstract:
The problems of constructing the points sequences with fast uniform distribution in the fractal domains associated with canonical number systems in quadratic fields are considered.
Key words:
Uniform distribution, fractals, canonical number systems
1.
INTRODUCTION
The well known criterion by G.Weil1 implies that for nearly all (with respect to Lebesgue measure) numbers x from the fundamental domain A [0,1] and all natural g ! 1 , the values of the function a n {xg n } are uniformly distributed in the A set: T 1N J T J R T , lim R T 0 T of
(1)
where N J T is the number of a n function values in the 0, J A if n 1, 2,...T and the symbol { } denotes the fractional part of the number. Despite the fact that Eq. (1) holds "with probability one", the first constructive example of x numbers meeting Eq. (1) was constructed by D. Champernowne2 only in 1933. Later many authors1,3 investigated the quantitative properties of R T for different x, g . For example, N.M.Korobov3 showed that there exist such x that for arbitrary prime p g as T o f the asymptotic relation Eq. (1) holds in the following form
N J T JT O T
1
3
ln
4
3
T
(2)
709 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 709–714. © 2006 Springer. Printed in the Netherlands.
710 The numbers x with the fast uniform distribution of {xg n } function values, i.e. such numbers that R T in Eq. (1) grows with the minimal possible order of magnitude, can be constructed involving specific methods. These methods use the representation of real numbers in the g -nary number system and so called normal periodic number systems ( m -sequences). The basic ideas introduced for construction of these numbers were widely used in generating multi-dimensional arrays and sequences with uniform distributions. This paper provides two-dimensional generalization of methods introduced in Ref.3 including the distribution in “non-canonical” regions (regions different from unit square)
2.
BASIC IDEAS
Let x be the number such that Eq. (2) holds. In g -ary number system x can be represented in the form x
x1g 1 x2 g 2 ... , x j ^0,1,...g 1`
(3)
In Ref.3 the method was introduced to construct x involving the special sequence of its digits x j {0,1,...g 1} . The sequence was “assembled” using the “value blocks” of m -sequences with increasing periods, the values of the linear recurrent functions over the finite fields with the maximum periods. The remainder term of R T in Eq. (2) is estimated using the results obtained in Ref.4. The method is based on the estimates of exponential sums with the recurrent function. In this particular case, if the shift operator P is applied iteratively to the x representation in the form Eq. (3) Px
P x1g 1 x2 g 2 ...
x2 g 1 x2 g 2 ... , P n1x
P P n x (4)
and the singular parts of the corresponding operator values are calculated, then the sequence a n {xg n } is associated with the sequence composed of these singular parts. It is necessary to note that Eq. (2) is correct not only for the frequency of function a n values appearance in the interval 0, J A , but in any measurable set / A (in terms of Lebesgue measure). In this case Eq. (2) has the following form
N / T mes / T O T
1
3
ln
4
3
T .
(5)
Fast Uniform Distribution of Sequences for Fractal Sets
711
In this paper the two-dimensional real space R 2 is regarded as a ( d ) of a two-dimensional quadratic algebra Q( d ) completion Q (quadratic field of algebraic numbers). The completion is considered with respect to the topology induced by the norm in Q( d ) .This interpretation of R 2 yields the possible generalization results obtained in Ref..3 if the twodimensional analog of one-dimensional set and its properties are defined. Thus, it is necessary to answer the following questions. x What is the “fundamental region”, an analog of the unit segment A [0,1] ? What sets are the analogs of 0, J A ? x How the measure can be defined on the two-dimensional fundamental region, i.e. what is an analogue of the Lebesgue measure on A [0,1] . What sets are measuarable with respect to the measure defined? ( d ) algebra is an analog of Eq. (3) for real x Which representation in Q numbers?
3.
CANONICAL NUMBER SYSTEMS
Let Q( d ) be a quadratic field over Q : Q( d ) {z a b d ; a, b Q} , d be a square-free integer number. Recall that if d ! 0 , the extension (or quadratic field) is called real; if d 0 , it is called imaginary. Let us also remember, that if the norm and the trace of z a b d Q( d ) are integer, (i.e. Norm( z ) a 2 db 2 Z , Tr ( z ) 2a Z ), then the element z is called the algebraic integer in Q( d ) . In contrast, "usual" integers are called integers in the field of rational numbers. Denote by S d the ring of the integers in Q( d ) . The term canonical number system in S d was introduced in Refs6, 7. The algebraic integer D A B d is called the base of the canonical number system in the ring of integers in the field Q( d ) , if every integer z in Q( d ) can be uniquely represented by the finite sum k z
z
¦ z j D j , z j N ^0,1,..., j 0
`
Norm D 1
The pair {D, N} is called the canonical number system in the ring S( d ) of integers in Q( d ) . Below there are several examples of canonical number systems. 1. Suppose Norm(D) 2 ; then there exist exactly three imaginary quadratic fields with the rings of integers where binary canonical number systems exist, namely (ɚ) the ring of Gaussian integers S(i ) Q(i ) with the base D 1 r i ; (b) the ring S(i 7 ) Q(i 7 ) with the base D (1 r i 7 ) 2 ; (c) the ring S(i 2 ) Q(i 2 ) with the base D ri 2 .
712 2. Suppose Norm D 3 ; then there exist only three imaginary quadratic fields with the rings of integers where exist ternary canonical number systems, namely (a) the field Q(i 2 ) with the bases D 1 r i 2 ; (b) the field Q(i 3 ) with the bases D (3 r i 3 ) 2 ; (c) the field Q(i 11) with the bases D 1 r i 11 2 . 3. In the ring of integers of a real quadratic field S( d ) Q( d ) , d ! 0 , quinary ("pental") number system in the ring S( d ) with the base D (5 r 5 ) 2 is the canonical number system with the minimal number of digits.
4.
"DRAGONS" AND THEIR "SCALES"
( d ) algebra. It is easy to prove that there Let z be an element of Q exists the representation of z in the form k z
z
¦
j f
k z
z jD j
1
¦ z jD j ¦ j 0
z jD j
(6)
j f
where z j N , {D, N} is a canonical number system in the ring S( d ) . The first sum in the right-hand side of Eq. (6) is called a regular part and the second sum is called singular part. Similarly, the numbers that can be represented by these sums are called regular and singular numbers respectively. The properties of canonical number systems imply that the regular number is an algebraic integer, i.e. an element of S( d ) . The sets of singular numbers for several algebras are graphically represented on Figs.1-3. In the conditions of the problem considered, these sets are analogs of the fundamental region A [0,1] . For example, for ring S(i ) the fundamental region is closure of Harter-Heituey's "dragon", a well-known fractal set. The ( d ) algebras are locally compact abelian groups with respect to the Q addition operation. This implies that on these algebras the Haar measures invariant to the additive shifts can be correctly defined. These measures induce the normalized measures Mes ('; D, d ) on the ' A (D, d ) subsets. Mes ('; D, d ) is such that the measure of the corresponding fundamental region A (D, d ) is equal to unity. On Figs. 1-3 the subsets ("dragon scales") of fundamental regions are highlighted. "Dragon scales" are the analogs of [k p m , (k 1) p m ] segments, where p Norm(D) 2,3,5 .
Fast Uniform Distribution of Sequences for Fractal Sets
713
Figure 1. Fundamental domains, associated with the binary canonical number systems.
Figure 2. Fundamental domains, associated with ternary canonical number systems.
Figure 3. Fundamental domains associated with quinary canonical number systems.
714
5.
MAIN RESULT
Let S( d ) be the ring of integers in Q( d ) . Suppose the following conditions are true: x {D, N} is a canonical number system in S( d ) . x Norm(D) p , p is prime, x ' A (D, d ) is a measurable set with respect to the Mes ('; D, d ) ( d ) algebra. measure on Q x the sequence x1 , x2 , x3 ,... ; ( x j N) is such that for the number x x1 p 1 x2 p 2 ... the corresponding sequence a (n) {xg n } has fast uniform distribution, i.e. Eq. (2) holds. ( d ) is given by X x1D 1 x2D 2 ... ; N ' (T ) is a number If X Q of values of the function Pn X
P n x1D 1 x2D 2 ... P n1 x2D 1 x3D 2 ...
that are contained in ' A (D, d ) where n 1, 2,..., T . Then as T o f the following relation holds. N ' T
Mes '; D, d T O T
1
3
ln
4
3
T
This research was financially supported by the RF Ministry of Education, Samara Region Administration and U.S. Civilian Research & Development Foundation (CRDF Project SA-014-02) as part of the joint RussianAmerican program "Basic Research and Higher Education" (BRHE) and by Russian Foundation for Basic Research (project No. 03-01-00736).
REFERENCES 1. L.Kuipers, and H.Niderreiter, Uniform Distribution of Sequences (Interscience Tracts, John Wiley and Sons, NY, 1974) 2. D.G.Champernowne, The construction of the decimals normal in the scale of ten, J.Lond. Math. Soc. 8, 264-260 (1933). 3. N.M.Korobov, O raspredelenii drobnyh doley pokazatel’noj funkcii, Vestnik MGU, serija Matematika I mehanika, 4, 42-46 (1966). (in Russian) 4. N.M.Korobov, Raspredelenie nevychetov I pervoobraznyh kornej v rekurrentnyh rjadah, DAN SSSR, 88, 603-608 (1953). (in Russian) 5. R.Lidl, and H.Niderreiter, Finite Fields (Addison-Wesly, Reading, Mass., 1983). 6. I.Kátai, and J.Szabó, Canonical number systems for complex integers, Acta Sci. Math. (Szeged), 37, 255-260 (1975). 7. I.Kátai, and B.Kovács, Kanonische Zahlensysteme in der Theorie der quadratischen algebraischen Zahlen, Acta Sci. Math. (Szeged), 42, 99-107 (1980).
OCCLUSION ROBUST TRACKING OF MULTIPLE OBJECTS∗ Oswald Lanz ITC-irst, 38100 POVO (TN), ITALY [email protected]
Abstract
This paper focuses on the problem of vision-based tracking of multiple objects. Probabilistic tracking in 3D supported by multiple video streams allows us to formalize an efficient observation model that is robust to occlusions. Each tracked object is assigned a support layer, a probabilistically meaningful pixel occupancy map, supplying weights used in the calculation of other objects observation likelihood. A Particle Filter implementation demonstrates the robustness of the resulting tracking system on synthetic data.
Keywords:
Multiple Object Tracking, Occlusions, Bayes Filter, Particle Filter
1.
INTRODUCTION
This paper focuses on the problem of vision-based multiple objects tracking with attention to robustness to occlusions. Previous work has concentrated on single camera 2D tracking. A probabilistic exclusion principle, introduced in (MacCormick and Blake, 1999), prevents a single pixel from being independently associated to similar hypotheses of different objects. The object state is enhanced with a discrete dimension that makes it possible to distinguish between foreground and background hypotheses. However, it is restricted to the specific type of contour-based measurements. An abstraction to object-level and configuration-level behavior is proposed in (Tao et al., 1999). Independent single object hypotheses are reviewed using heuristics based on blob coverage and compactness and therefore a founded probabilistic interpretation is not possible. In partially occluded situations feature-based tracking can be applied successfully to points that are suitably selected at each frame, as proposed in (Dockstader and Tekalp, 2001). A major drawback of feature-based methods is that an object that becomes wholly occluded is lost even if occluded for only a few frames. ∗ Research
partially funded by Provincia Autonoma di Trento under project PEACH
715 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 715–720. © 2006 Springer. Printed in the Netherlands.
716 The use of multiple cameras allows 3D tracking, supporting the development of a robust likelihood model to handle occlusions in a principled way. The next section reviews the basics of Bayesian Filtering, a probabilistic tracking framework supporting multiple unsynchronized streams, with remarks on the multiple object domain. Section 3 presents the main contribution, namely robust tracking in the presence of occlusions. The implementation of the proposed method within a Particle Filter is also discussed. Section 4 describes an experiment demonstrating the robustness of the proposed ideas carried out on synthetic data.
2.
RECURSIVE BAYES FILTER AND MULTIPLE OBJECT TRACKING
Tracking can be formalized as a state estimation problem. Each object configuration is quantized to a state vector xt . The goal of tracking is to compute, at each time step, a density function p(xt ) on the state space, representing the objects belief. A recursive Bayes filter estimates the current belief using stochastic propagation and Bayes theorem, according to p(xt ) ∝ p(zt |xt )
X
p(xt |xt−1 )p(xt−1 ) dxt−1 .
At each time step the previous belief p(xt−1 ) is propagated with a stochastic dynamic model p(xt |xt−1 ). Once a new observation zt is available the prediction is corrected by applying Bayes rule with an observation model p(zt |xt ). It is this second step that is the key to a successful application of Bayes filter and it rests on reliable modeling of the observation likelihood. Multiple objects can be tracked with one single Bayes filter, the joint filter, by concatenating their configurations into a single super-state vector. If the object states encode 3D information, occlusions can be modeled with a joint observation model. Unfortunately, applying the joint filter soon becomes impractical due to exponential complexity increase in the state space dimension. However, any independence assumption between different objects models can be exploited to substantially reduce the filters complexity. While for interobject behavior (density propagation) the assumption of independence might be reasonable in many applications, this is often not the case for observations (measurement update). Image formation involves projection into a lower dimensional space where occlusions can occur. Tracking multiple objects with fully independent filters means ignoring these important interactions.
3.
TRACKING IN THE PRESENCE OF OCCLUSIONS
Borrowing from thefieldof Computer Graphics, the imageformationprocess can be formalized by defining a rendering function g(x) that maps a given ob-
717
Occlusion Robust Tracking of Multiple Objects
Figure 1 The silhouette of a persons’ specific pose and its support layer computed from a bimodal belief. The silhouette becomes blurred due to the estimation uncertainty.
jects’ configuration x into a picture of it. The observation model p(z|x) is then defined in terms of a probabilistic measure of similarity between the (noisy) image z and the synthesized view g(x). The facets of noise involved are manifold: global illumination effects are neglected, often only a very raw object model is available, noise modeling in visual sensors is not straightforward. This may justify the common choice of defining the observation likelihood through a heuristic distance d(z1 , z2 ) in image space between the observation z under analysis and a rendered hypothesis x: p(z|x) ∝ e−d(z,g(x)) . The key to a robust multi-object observation model is to take into account visibility in the definition of the distance d. This paper proposes to do this by associating to each image pixel a weight representing its reliability to belong to the object under analysis. Given an object rendering function g(x), let us introduce its silhouette rendering function Δg (x|s) according to
1 0
Δg (x|s) =
if 0 < gs (x) < s elsewhere
where gs is the distance of the object surface points from the camera optical centre and s is the radius of a clipping sphere. It can be interpreted as a segmentation operator on its depth component: Δg = 1 where the object is visible within a distance s and 0 elsewhere. If the state of one object were known exactly, those pixels with Δg = 1 (i.e. its support) should be ignored in the likelihood calculation of all other objects that are behind it. However, the object’s position is only estimated, available in the form of a density function. For a given depth s let us introduce Au , the set of object configurations that render on a selected pixel u: Au = {x | 0 < [gs (x)]u < s}. From therelation between probability mass and probability density function P (A) = A p(x)dx the probability of occluding pixel u can be derived:
[Pocc ]u =
Au
[Δg (x|s)]u p(x) dx.
718
Figure 2. The current image, the support map for the head and upper body of the person in foreground obtained from his estimated belief, and the weighted image used for computing the occluded person likelihood (light color represents high occlusion probability, that is low weight)
The support layer of an object with belief p(x) at a given depth s can then be defined as Δg (x|s)p(x) dx. S(s) = Ep [Δg (x|s)] = X
Figure 1 shows S(s) for a bimodal belief. The probability that a pixel has been occluded by another tracked object is taken as the highest value among the support layers of the other objects: this is equivalent to classifying the pixel to be owned by the most likely occluding object. The pixel weight layer Wk (s) for object k at depth s is then defined as the probability of not being occluded by other objects, that is Wk (s) = 1 − max{Sl (s)}. l =k
Static obstacles can be incorporated in this formulation: p(x) becomes an impulse function and its support layer is given by the (clipped) silhouette image. Wk (s) can be used to modulate the distance function of the observation model reducing the influence of likely occluded pixels (see Figure 2). Tracking can also be enhanced in the case when objects are temporarily completely occluded. Such situations can be detected by considering low weight regions of W. Likelihood update is not performed as no measurements are available. To make the problem of Bayesian filtering computationally tractable in a nonlinear, non-parametric setting, Monte Carlo approximations have been proposed (e.g. Isard and Blake, 1998; Doucet et al., 2001). The underlying idea is to maintain a compressed representation of the belief in the form of a set of representative, weighted sample states, the particles. Basically, given its particle representation { πi,t−1 , xi,t−1 } at time t − 1, the belief approximation at time tis obtained by sampling a new particle set {xi,t } from the mixture density j p(xt |xj,t−1 )πj,t−1 and by setting their weights to the observation likelihood p(zt |xi,t ) (see Doucet et al., 2001 for a more detailed introduction). According to its definition the support layer of an object is given by the expectation of the silhouette rendering function over its estimated belief. In
Occlusion Robust Tracking of Multiple Objects
719
the particle filter framework it can be approximated by a weighted sum of silhouette images 1 Δg (xi |s)πi . S(s) ≈ π i If the geometric properties of the objects allow the ordering of the particles according to their distance to the camera, this can be exploited to speed up the observation weight calculations. In such situations the support layers can be computed incrementally, starting from the nearest configuration to the farthest one. The complexity O(N 2 K 2 ) of the naive implementation reduces then to O(N K) for K objects, where each one is tracked with N particles.
4.
RESULTS AND CONCLUSIONS
Experiments are performed on synthetic data delivered by a real time graphical simulator presented in (Santuari et al., 2003). Each person is modeled by a coarse histogram in RGB color space. Its state is defined in terms of its 2D position and velocity on a horizontal reference plane. Two calibrated cameras capture images from two orthogonal viewing directions. During tracking only one image per time step is analyzed: we alternate continuously between views. Particle propagation is performed using a constant velocity model with additive Gaussian noise. After propagation particle positions are projected onto the image plane and ordered according to their camera distance. At each projected particle position a coarse head-torso silhouette kernel is extracted from W. The kernel is used to extract a rough 6x6x6 quantized RGB color histogram. Its Bhattacharyya coefficient-based distance to the object’s reference histogram defines the likelihood (see Nummiaro et al., 2002 for a reference). Figure 3 shows the tracking of two people using 50 particles per target. The standard particle filter computes the likelihood of a particle by considering all pixels belonging to its silhouette. When a target is partially occluded its histogram is contaminated by the other objects color which results in lower particle weights. This causes the belief mode to drift away from its true position which can result in a lock on background clutter. This happens even if the target is occluded only in one camera and a bad likelihood is obtained only at each second time step. The proposed algorithm can succeed in tracking by ignoring likely occluded pixels. The histogram of a partially occluded configuration is extracted mainly from the visible part of its silhouette which results in robust tracking. The advantage of the proposed algorithm over the traditional one is clearly demonstrated by the experiments performed on synthetic data. Future work aims at assessing tracking performance on real data. The proposed framework is suitable for the realization of distributed architectures. The Bayes filter naturally allows the distribution of computational load among several processing units. Likelihoods can be calculated on unsynchronized sensor
720
Figure 3. The left column shows the behavior of the standard particle filter. The dark lady is lost due to an occlusion. By considering the light mans’ support layer the same algorithm can track without getting distracted (right column). Only the framed image is used as measurement.
nodes while a central unit maintains 3D target beliefs. The compressed particle belief representation reduces inter-camera communication to such a level that even a low bandwidth connections could support it. From a theoretical perspective, it will be valuable pursuing the embedding of the proposed approach into the joint filter framework, deriving it as a belief marginalization approach.
REFERENCES Dockstader, S. and Tekalp, A.M. (2001). Multiple camera tracking of interacting and occluded human motion. In Proc. of the IEEE, volume 89. Doucet, A., de Freitas, N., and Gordon, N. (2001). Sequential Monte Carlo Methods in Practice. Springer-Verlag. Isard, M. and Blake, A. (1998). Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision. MacCormick, J. and Blake, A. (1999). A probabilistic exclusion principle for tracking multiple objects. In Proc. Int. Conf. Computer Vision. Nummiaro, K., Koller-Meier, E., and Gool, L. Van (2002). Object tracking with an adaptive color-based particle filter. Santuari, A., Lanz, O., and Brunelli, R. (2003). Synthetic movies for computer vision applications. Tao, H., Sawhney, H. S., and Kumar, R. (1999). A sampling algorithm for detecting and tracking multiple objects. In Proc. Vision Algorithms.
FACE TRACKING USING CONVOLUTION FILTERS AND SKIN-COLOR MODEL Gejgus P. and Kubini P. DCGIP FMPI Comenius University, Bratislava, Slovakia
Abstract:
The aim of this paper is to propose a system for tracking of facial features. Face blob will be segmented using stochastic skin-color model. The features will be looked for in the found face blob. Thereafter we detect eyebrows, eyes, and nostrils in the area of the face. The system is only applicable under the constant lighting conditions. The results seem promising.
Key words:
face detection, skin-color model, convolution filters, edge detectors
1.
INTRODUCTION
Facial feature tracking is nowadays one of the most evolving parts of computer vision. We are proposing a system that uses one web camera to detect the face position and to track its features. Human face detection and tracking is a necessary step in many face analysis tasks - like face recognition, model-based video coding, and content-aware video compression. Although these problems are easy to solve for a human, the machine vision labels them as “hard”. Segmentation of the face in such system is very important for localization of particular face features. There exist a lot of methods for face segmentation mostly based on stochastic skin-color models (single Gaussian and mixture of Gaussians)1,2, approach based on skin-color detection and PCA3, neural networks4. Skin color is a distinctive feature of human face, making color information useful to localize human face on static and video images. Color information allows for fast processing, which is important for a tracking system that needs to run at reasonable frame rates.
721 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 721–726. © 2006 Springer. Printed in the Netherlands.
722 The facial features detection and tracking methods can be roughly divided into two groups: modeling features appearance with some pattern recognition method, and usage of empirical rules, derived from observations of exhibited feature appearance properties.
2.
FACE SEGMENTATION
We have used Gaussian skin-color model for face segmentation. We can represent skin-color in a chromatic color space. Chromatic colors (r,g) also known as „pure“ colors are defined by a normalization process:
r g
R /( R G B ) G /( R G B )
Blue color is redundant after the normalization because r+g+b=1. Chromatic colors have been effectively used to segment color images in many applications. The color distribution of skin colors of different people was found to be clustered in a small area of the chromatic color space. Although skin colors of different people appear to vary over a wide range, they differ much less in color than in brightness. In other words, skin colors of different people are very close, but they differ mainly in intensities. Kjeldsen5 showed that chromatic skin-color has normalized Gaussian distribution. Therefore, a face color distribution can be represented by a Gaussian model N (m, 6 2 ) , where m (r , g ) with
r
1 N ¦ri Ni1
g
1 N ¦gi Ni1
and
6
ªV rr «V ¬ gr
V rg º , V gg »¼
where m is the mean and 6 is the covariance matrix of the Gaussian model.
Face Tracking using Convolution Filters and Skin-Color Model
723
A set of hand segmented images is needed to train the skin color model. With this Gaussian fitted skin color model, we can now obtain the likelihood of skin color for any pixel of the image. Therefore, if a pixel, having transform from RGB color space to chromatic color space, has a chromatic pair value of (r,g), the likelihood of skin color for this pixel can then be computed as follows:
P( x) e
1 ( x m )T 6 1 ( x m ) 2
Figure 1. Original image (left) and probability image (right)
where x = (r,g). Hence, this skin color model can transform a color image into a grey scale image such that the gray value at each pixel shows the likelihood of the pixel belonging to the skin (see Fig. 1). With appropriate thresholding, the grey scale images can then be further transformed to a binary image consisting of skin regions and non-skin regions. Connected components analysis is performed and region with maximal area is considered as region corresponding to the face. Another processing (feature detection) is performed only in scope of this region.
3. 3.1
FEATURE TRACKING Conversion to grayscale color model
We are developing a facial features recognition system that can work with low-resolution web cameras. An input image is extracted from lowresolution video with colors encoded in RGB. However, the RGB model doesn’t contain pixel intensity. To process the image as needed, we must convert it to some more appropriate color model. To enable access to pixel intensities, we opted for the fastest conversion available. That is why we use the standard RGB to grayscale image conversion.
724
3.2
Image preprocessing
The advantage is that the preprocessing is not applied to the whole image area. The searching for facial features is performed only in the rectangle obtained from the previous step. Below we describe process of segmentation to extract feature information. The image is preprocessed, so we are able to extract features from the image more easily. The image preprocessing is divided into the median and the Prewitt filtering followed by thresholding:
Figure 2. Original image and the same one after application of horizontal Prewitt edge detector.
1. Horizontal Prewitt filter (see Fig. 2)- the disadvantage of Prewitt filter (and in fact every edge detector) is that it increases noise in image. That’s why we use the median filter, in Step 2 of the recognition process. 2. Median filter – nonlinear filter that decreases noise in image. The noise in image could possibly create too many bogus edges in Step 1. False edges are such edges, that don’t exist in original image, but due to the increase of noise they appear in the edge image. The false edges are removed using this filter. 3. Threshold filter – this filter segments the image into areas that are further processed. If there would be no median filter, the noise in image would cause that the image would be very noisy.
Figure 3. Image after preprocessing (left) and image with recognized features (right).
Picture after applying of these filters will be called the edge picture. The segmented areas approximate real edges in picture. On Fig. 3 we can see the
Face Tracking using Convolution Filters and Skin-Color Model
725
example of the edge picture, the eyebrows and nostrils are well segmented. After this step the eyebrows and nostrils are segmented. So, the position of the whole face can be found. This can be combined with detection of face position using image subtraction.
3.3
Features detection
1. Recognition of eyebrows. Recognition of eyebrows is simple process as they are well segmented from preprocessing step. When we detect the rectangle where the face is located, first we find rectangle envelope of segmented parts of image. Next we find objects that have approximate properties of the eyebrows, i.e. they are approximately horizontal and both rectangles lie approximately on the same line. Approximately means here, that the rectangle coordinates differ just for a small constant. 2. Recognition of nostrils. When we find the position of the eyebrows in the image, the position of nostrils is being looked for. The nostrils are in area shown in gray color (Fig. 3 right). The filled gray rectangle on Fig. 3 is searched for nonzero values from top to bottom. First two segmented objects are searched for their mutual position and position against the eyebrows.
Figure 4. Feature recognition process on segmented face using skin-color model approach (right).
3. Recognition of eyes and pupils. Recognizing process of pupils is performed in original image and uses information about the eyebrows. We recognize pupil according to the relative position of the eyebrows. First approximation of the pupil position is based on anthropological measurements i.e. the position is under the eyebrow approximately in the middle. Then we search around this point trying to find the black area of pupil. So, the position of pupil is found. The process of facial features tracking is based on detection of facial features in the individual images of the video sequence.
726
4.
EXPERIMENTAL RESULTS
The tracking of the upper facial features was tested on a standard PC Athlon 1GHz using standard USB web camera. The perceived frame rate was 15Hz tracking in resolution of 352*288 pixels. The tracker works good under constant lighting conditions during tracking.
5.
CONCLUSION & FUTURE WORK
We are developing a modular system for image processing. The face and feature tracker will be a part of it. We are going to incorporate adaptive skincolor model approach for face segmentation. This will cope with changing lighting conditions. We will also use the adaptive edge detection technique for better detection of facial features. Lost features will be automatically recovered, because we don’t use correlation between successive frames.
ACKNOWLEDGEMENT This research was partially supported Virtual Environments for WWW No. 1/0174/03.
by
VEGA
grant
REFERENCES 1. G. Yang, A. Waibel, A Real-time Face Tracker. Workshop on Applications of Computer Vision, pp. 142-147 (1996). 2. M. Yang, N. Ahuja. Gaussian Mixture Model for Human Skin Color and Its Applications in Image and Video Databases, Proc. SPIE Vol. 3656, pp. 458-466 (1999). 3. N. Oliver, A. Pentland, F. Berard. LAFTER: Lips and Face real-time Tracker with Facial Expression Recognition. Proc. of Computer Vision and Pattern Recognition Conference (1997). 4. R. Feraud, O.J. Bernier, J.E. Viallet, M. Collobert. A Fast and Accurate Face Detector Based on Neural Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.23 n.1, pp. 42-53 (2001). 5. R. Kjeldsen and J. Kender. Finding Skin in Color Images. In 2nd Int. Conf. on Automatic Face and Gesture Recognition (1996). 6. A. Yilmaz, M. A. Shah “Automatic Feature Detection and Pose Recovery for Faces”, ACCV2002: The 5th Asian Conference on Computer Vision, 23rd -25th January 2002, Melbourne, Australia 7. S. Spors, R.Rabenstein “A Real-Time Face Tracker For Color Video” IEEE Int.Conf. on Acoustics, Speech & Signal Processing (ICASSP), Utah, USA, May 2001 8. A. Colmeranez, B. Frey, Th. S. Huang “Detection and Tracking of Faces and Facial Features”, ICIP 1999, pp.657-661,1999.
A NEW HYBRID DIFFERENTIAL FILTER FOR MOTION DETECTION Julien Richefeu, Antoine Manzanera Ecole Nationale Supérieure de Techniques Avancées Unité d’Electronique et d’Informatique 32, Boulevard Victor 75739 Paris CEDEX - France
Abstract
A new operator to compute time differentiation in an image sequence is presented. It is founded on hybrid filters combining morphological and linear recursive operations. It estimates recursively the amplitude of time-variation within a certain interval. It combines the change detection capability of the temporal morphological gradient, and the (exponential) smoothing effect of the linear recursive average. It is particularly suited to small and low amplitude motion. We show how to use this filter within an adaptive motion detection algorithm.
Keywords:
hybrid filter, temporal morphology, motion detection.
1.
INTRODUCTION AND PRELIMINARIES
In some applications such as video databases, video compression, security monitoring or medical imaging, motion information is usually a more significant cue than color or texture. More specifically, the need to detect interesting moving objects in the scene is a fundamental low-level step in many vision systems. In this paper, we will concentrate on the case of video surveillance systems using a stationary camera. The challenge of motion detection lies in the ability of performing an accurate segmentation of the moving objects independently on their size, velocity and contrast with respect to the background of the scene. A large set of motion detection algorithms has already been proposed in the literature. We can file them into four main categories according to the type of inter-frame computations, i.e. to the way the time differentiation is performed. The first one is based on temporal gradient: a motion likelihood index is measured by the instantaneous change in the image intensity computed by differentiation of consecutive frames (Bouthemy and Lalande, 1993). These methods are naturally adaptive to changing environments, but are also dependent on the velocity and size of moving objects. This drawback can be mini-
727 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 727–732. © 2006 Springer. Printed in the Netherlands.
728 mized using a multiple bank of spatiotemporal filters which is done at the price of an increased complexity. The second category are the background subtraction techniques (Toyoma et al., 1999; Cheung and Kamath, 2003; Piccardi, 2004), that use a reference image (background), representing the stationary elements in the scene. Here the motion likelihood measure is the difference between the current frame and the background. These methods are less dependent on the velocity and size of the objects. Nevertheless, the adaptation to dynamic environment is a much more difficult task, which can penalize the detection of small amplitude motion (very slow or low contrast objects). The third type of approach is based on the computation of the local apparent velocity (optical flow) (Beauchemin and Barron, 1995) which is used as input of a spatial segmentation (Ranchin and Dibos, 2003). This method provides valuable information but it is in general more computationally complex and it is also sensitive to the reliability of the optical flow. Thus a trade-off has to be found between the smoothness of the flow field and the accuracy of segmentation. More recently, morphological filters have been employed (Ferrandiere et al., 1997; Salembier et al., 1998; Agnus et al., 2000) for video sequence analysis. By using spatiotemporal structuring elements, a local amplitude of variation can be computed as motion likelihood index. Such measure can be useful to detect small amplitude motion, but as it is sensitive to outliers, it is usually integrated over regions using connected operators. We propose in this paper a new differential operator based on a hybrid filter, combining morphological and linear operations. It computes a pixel-wise amplitude of time-variation over a recursively defined "temporal window". This method is designed to address the problem of small objects and slow motion while providing a certain noise immunity thanks to its linear part. We first present the forgetting morphological temporal gradient in Section 2. Then we show how to use the output of this filter within a motion detection algorithm in Section 3. Results are presented and discussed in the same section.
2.
THE FORGETTING MORPHOLOGICAL TEMPORAL GRADIENT
In this section, we introduce the forgetting morphological temporal gradient and show its interests for motion detection systems. Considering an image sequence It (x) where t is a time index and x a (bidimensional) space index, morphological temporal filters are defined using temporal structuring element τ = [t1 , t2 ]. The temporal erosion (resp. dilation) is defined by ετ (It )(x) = min{It+z (x)} (resp. δτ (It )(x) = max{It+z (x)}). z∈τ
z∈τ
A new hybrid differential filter for motion detection
729
The temporal (morphological) gradient γ is then defined by γτ (It ) = δτ (It ) − ετ (It ). τ represents the temporal interval of interaction, which can be causal (e.g. [−3, 0]), anti-causal (e.g. [0, +5]) or both (e.g. [−1, +1]). Thus the temporal gradient corresponds to the amplitude of variation within this interval. The use of this operator suffers from two major drawbacks: (1) it implies the use of a buffer with size corresponding to the diameter of the structuring element, which can be very memory consuming; (2) it is very sensitive to sudden large variations (like impulse noise or slight oscillations of the sensor). To cope with these two problems, we use hybrid filters, which can be viewed as a recursive estimation of the values of the temporal erosion and dilation. Using parameter α, which is a real number between 0 and 1, the forgetting temporal dilation Mt (resp. erosion mt ) is defined as shown in Figure 1. As in the classical running average (or exponential smoothing) defined by At (x) = αIt (x) + (1 − α)At−1 (x), the inverse of α has the dimension of time. The semantics of Mt (x) (resp. mt (x)) is then the (estimated) maximal (resp. minimal) value observed at pixel x within the 1/α last frames. So as α tends to unity, Mt (resp. mt ) tends to It , and as α tends to zero, Mt (resp. mt ) tends to the maximal (resp. minimal) value observed during the whole sequence. The use of the term "forgetting" is justified by the fact that these operators attach more importance to the near past than to the far past. Initialization for each pixel x: For each frame t for each pixel x: For each frame t for each pixel x:
M0 (x) = m0 (x) = I0 (x) Mt (x) = αIt (x) + (1 − α)max{It (x), Mt−1 (x)} mt (x) = αIt (x) + (1 − α)min{It (x), mt−1 (x)} Γt (x) = Mt (x) − mt (x)
Figure 1. The forgetting morphological temporal operators: Mt is the forgetting dilation, mt the forgetting erosion and Γt the forgetting morphological gradient.
Γt , the forgetting morphological temporal gradient, is used in the following as a differentiation filter because of its interesting properties: (1) it has the dimension of an amplitude of time-variation, so it is able to integrate motion over a long period depending on 1/α, and then to detect small or slow moving objects; (2) it is less sensitive to impulse noise because of its forgetting term, corresponding to the exponentially decreasing weights attached to the past values; (3) it only requires the use of two buffers to compute the forgetting erosion and dilation. Figure 2 displays the forgetting morphological operations compared with their morphological counterparts.
730
ετ (It )
δτ (It )
γτ (It )
mt
Mt
Γt
Figure 2. Application of the forgetting morphological operators (bottom), compared with the classical morphological temporal operators (top), computed on the frame t = 19 of the classical “Tennis” sequence (Berkeley). For comparison purposes the structuring element is τ = [−8, 0], and the forgetting term is α = 1/9. The gradients are displayed in reverse video mode. Note that we use the symmetrical gradient in order to treat dark and light objects the same way.
The forgetting morphological gradient thus represents a relevant motion likelihood measure. We show on the next section how it can be used within a complete moving objects detection algorithm.
3.
MOTION DETECTION ALGORITHM
The filter presented above makes possible a good level of detection for motion whose amplitude is below the spatiotemporal discretization. Nevertheless, the forgetting term α needs to be adapted to the velocity of the moving object. As there are several objects with different sizes and velocities within the scene, it is necessary to adjust locally the value of α to the observations. In addition to this, we need a decision rule to discriminate the moving objects from the background. Because the scene is constantly evolving - typically under illumination or weather condition changes - the decision criterion has to be temporally adaptive. Furthermore the temporal variation, possibly due to moving objects, but also to noise or irrelevant motion, is not uniformly distributed through the scene. So the decision must be locally differentiated. In our algorithm, we compute, from the forgetting morphological temporal filter, a local estimation of the spatiotemporal activity. This estimation is used both for deciding the pixel-level motion label and for adjusting the value of the forgetting term α. Following (Manzanera and Richefeu, 2004), we use the Σ-Δ filter to compute a second order statistics on the sequence: The Σ-Δ filter St of a time series
A new hybrid differential filter for motion detection
731
Xt is a recursive approximation of the median defined by: St = St−1 + 1 if St < Xt and St = St−1 − 1 if St > Xt . Now, what we compute exactly is Vt , the Σ-Δ filter of N times the nonzero values of the forgetting morphological temporal filter. Then, the local estimation of the spatiotemporal activity is defined by Θt = Gσ (Vt ), where Gσ is the bidimensionnal Gaussian filter with standard deviation σ. Θt is used as the first step of decision for the moving label of each pixel: Dt = 1 if Γt > Θt and Dt = 0 elsewhere. Θt is also used to update the forgetting term for every pixel. Inspired by (Pic et al., 2004) who used a locally adaptive learning rate for recursive background estimation, we compute the local α for pixel x by a similar formula: αt (x) = −
2 Θt (x) k2
where Θt is the complementary version of Θt (e.g. Θt = 255−Θt for 2 images coded on 255 gray level) and k is a constant used to set the range value of α. α is then locally adjusted in such a way that the forgetting filters use long term memory in poor spatiotemporal activity areas, and short term memory in areas with high spatiotemporal activity. Thus, the detection of slow, small, or low-contrast moving objects will be enhanced, while large moving objects with high contrast will be better segmented.
Figure 3. Results for the Hamburg Taxi sequence (frame n. 20). The parameters used are N = 2 (number of deviations) for St , σ = 2.5 for the standard deviation of the Gaussian filter used to compute Θt , k = 140 as the constant used in the computation of αt .
Figure 3 shows the motion detection algorithm steps applied on the Hamburg Taxi sequence. The last image in the figure represents Dt , the detection result after removal of the smallest regions using an alternated filter by reconstruction with a ball of radius 1 (Vincent, 1993). It can be seen that the small
732 amplitude motion like the pedestrian on the top left, as well as low-contrast moving object like the dark car on the bottom left are well detected while the high contrasted taxi at the center is better segmented. This is an effect of the adaptable memory of the forgetting filters, in order to fit their detecting ability to the amount of motion.
4.
CONCLUSION
We have presented a new hybrid differential filter and shown how it can be used in a motion detection algorithm with a local adaptation of the forgetting terms. Like the recursive filters, the computation time and memory consumption does not depend on the size of the temporal window. At the present time, we are investigating more sophisticated spatial interaction, in order to improve the spatiotemporal adaptivity, and quantify precisely the validity range of the algorithm.
REFERENCES Agnus, V., Ronse, C., and Heitz, F. (2000). Spatio-temporal segmentation using morphological tools. In 15th ICPR, pages 885–888, Barcelona, Spain. Beauchemin, S.S. and Barron, J.L. (1995). The computation of optical flow. In ACM Computing Surveys, volume 27(3), pages 434–467. Bouthemy, P. and Lalande, P. (1993). Recovery of moving objects masks in an image sequence using local spatiotemporal contextual information. Optical Engineering, 32(6):1205–1212. Cheung, S-C.S. and Kamath, C. (2003). Robust techniques for background subtraction in urban traffic video. In IEEE Int. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Nice, France. Ferrandiere, E. Decenciere, Marshall, S., and Serra, J. (1997). Application of the morphological geodesic reconstruction to image sequence analysis. IEE Proceedings - Vision, Image and Signal Processing, 144(6):339–344. Manzanera, A. and Richefeu, J. (2004). A robust and computationally efficient motion detection algorithm based on Σ-Δ background estimation. In ICVGIP, Kolkata, India. Pic, M., Berthouze, L., and Kurita, T. (2004). Active background estimation: Computing a pixelwise learning rate from local confidence and global correlation values. IEICE trans. Inf. & Syst., E87-D(1):1–7. Piccardi, M. (2004). Background subtraction techniques: a review. In Proc. IEEE Conference on Computer. http://www-staff.it.uts.edu.au/∼massimo. Ranchin, F. and Dibos, F. (2003). Moving objects segmentation using optical flow estimation. Technical report, UPD Ceremade. http://www.ceremade.dauphine.fr/CMD/preprints03/ 0343.pdf. Salembier, P., Oliveras, A., and Garrido, L. (1998). Anti-extensive connected operators for image and sequence processing. IEEE trans. on Image Processing, 7(4):555–580. Toyoma, K., Krumm, J., Brumitt, B., and Meyers, B. (1999). Wallflower: principles and practice of background maintenance. In ICCV, pages 255–261, Kerkyra, Greece. Vincent, L. (1993). Morphological grayscale reconstruction in image analysis: applications and efficient algorithms. IEEE trans. on Image Analysis, 2(2):176–201.
PEDESTRIAN DETECTION USING DERIVED THIRD-ORDER SYMMETRY OF LEGS A novel method of motion-based information extraction from video image-sequences László Havasi1, Zoltán Szlávik2 and Tamás Szirányi2 1
Péter Pázmány Catholic University, Piarista köz 1., H-1052 Budapest Hungary; 2Analogic and Neural Computing Laboratory, Hungarian Academy of Sciences, PO Box 63 H-1518 Budapest Hungary
Abstract:
The paper focuses on motion-based information extraction from video imagesequences. A novel method is introduced which can reliably detect walking human figures contained in such images. The method works with spatiotemporal input information to detect and classify the patterns typical of human movement. Our algorithm consists of easy-to-optimise operations, which in practical applications is an important factor. The paper presents a new information-extraction and temporal-tracking method based on a simplified version of the symmetry which is characteristic for the legs of a walking person. These spatio-temporal traces are labelled by kernel Fisher discriminant analysis. With this use of temporal tracking and non-linear classification we have achieved pedestrian detection from real-life images with a correct classification rate of 96.5%.
Key words:
simplified symmetry, pedestrian detection, tracking, surveillance, kernel Fisher discriminant analysis
1.
INTRODUCTION
In outdoor multi-camera systems such as city-wide distributed monitoring systems in public places, the image-resolution of the surveyed objects is usually comparatively low, while the image-noise originating from lighting conditions and background content is relatively high. In the recognition and tracking of humans by such systems, the first step is targetdetection. Detection of humans in video sequences has been attempted by 733 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 733–739. © 2006 Springer. Printed in the Netherlands.
734 several different methods, depending on the requirements of the particular application. Model-based methods use matching with a priori shapes1, similarly to the case of optimisation for active contours. Active contour methods can in principle handle the detection problem, but the initialisation of weights is sensitive to image deformations and contour-splitting. Optical tracking of image parts and their high-level interpretation can lead to acceptable results2, but the method works satisfactorily only in cases where the image contains detailed textures. Song et al. have presented an unsupervised-learning method for derivation of a probabilistic model of human motion from unlabelled cluttered data3. They reported a very promising 4 percent error rate for pedestrian detection, albeit under favourable conditions (in a moderately restricted environment with persons viewed from the side). The periodic character of walking was exploited by Abdelkader, Cutler and Davis4; their method can detect humans based on an image sequence covering 5-8 step-periods. The practicability of the use of symmetries for human identification is discussed by Hayfron, Nixon and Carter5. In outdoor environments with practical image-resolution and varied lighting conditions, there is only one well-defined criterion for the recognition of a pedestrian: the walking person must use his two legs. The aim of our paper is to introduce an approach for pedestrian detection in real scenes which can produce a reasonably good false-positive detection rate. We outline a novel feature-extraction and tracking method that can reflect the inherent structural changes of target shape, thus enhancing the method’s practical utility.
2.
FEATURE EXTRACTION AND CLASSIFICATION USING SYMMETRY
Symmetry is a basic geometric attribute, and most objects have a characteristic symmetry-map. These unique and invariant properties lead to the applicability of symmetries in our approach to image-processing. Our method6 employs a modified shock-based method7: it calculates symmetries by propagating parallel waves from the ridge. The general shock-based approach (also called grey-level skeleton) has the limitation that it is sensitive to image noise, and particularly to the presence of discontinuous edges. Our symmetry-detection method is based on the use of morphological operators to simulate spreading waves from the edges. Each iteration involves one step of spreading; the symmetry points are marked at the collision-points of the waves. The radius of the extracted symmetry axis corresponds to the number of iterations, to the distance between the collision
Pedestrian Detection Using Derived Third-Order Symmetry of Legs
735
point and the edge point from both parts. In our approach, we simplify the algorithm by using only horizontal morphological operators; since, in the practical cases we are considering, we essentially need to extract only vertical symmetries. This modification has the advantage that it assists in reducing the sensitivity to fragmentation. Sample outputs of the algorithm can be seen in Figure 1. The symmetry operator normally uses the edge map of the original image as its input; we used the Canny edge-detector algorithm to derive the locations of the edges (ridges). To test the robustness of the algorithm in processing real public-place scenes, in our trials we did not employ any background subtraction or change-detection method. The algorithm described is insensitive to minor edge fragmentations, and a “perfect” definition of the target outline is unnecessary.
Figure 1. An idealised outline of a walking person, together with the derived Level 1, Level 2, and Level 3 symmetry maps.
As illustrated in Figure 1, the symmetry concept can be extended by iterative operations. The symmetry of the Level 1 symmetry map is the Level 2 symmetry; and the symmetry of the Level 2 map is the Level 3 symmetry (L3S). The advantage of this approach is that it does not confound the local and global symmetries in the image, so these levels are truly characteristic for the shape-structure of the pair of legs. In the further processing steps we use only L3Ss. It is obvious however that image noise and edge fragmentation will damage the symmetries, especially the L3S. To minimise such errors an appropriate pre-processing method is to filter the captured image-frame using median filters, applied more than once, to remove small errors. This step has proved effective in processing compressed video images because it retains the essential structure of objects while it removes the irrelevant marks. Our symmetry-extraction method is less sensitive to edge fragmentation than is the original “skeleton” method; but nevertheless the L3Ss contain an accumulation of fragments from the preceding symmetry levels. To reduce this error we use vertical limiting operators at each level of processing. In addition, it is an important factor when the objects are small and near to one another on the image. The
736 vertically-oriented kernels help to avoid possible confusion with nearby neighbouring symmetries.
2.1
Temporal Tracking of Symmetries
The extracted L3Ss for a human target are in practice useful primarily with respect to analysis of images of the legs. The arms do not usually generate significant symmetries, among other reasons because of distortions arising from the perspective view, and because of their small size in proportion to the whole body. Thus the resulting symmetry-image from the arms is typically composed of small fragments which are difficult to distinguish from the noise. However, even the existence of clear symmetries in a single static image does not necessarily provide usable information about the image content; for this, we need to track the changes of the symmetry fragments by temporal comparisons. By using the radii of the symmetries an appropriate mask can be defined, with the aid of which the solution of this task becomes relatively easy. The symmetry fragments and their radii define an outline that can be used as a mask between frames to aid classification of the coherent fragments in successive frames, as illustrated in Figure 2.
Figure 2. Masks of the reconstructed symmetries from successive frames, superimposed on an original image; and the limits (marked by X-symbols) used to define symmetries for the classification task.
The tracking algorithm calculates the overlapping areas between symmetry masks; and as time progresses it constructs the largest overlapping one. The advantage of this simple algorithm is that it is tracking the complete leg movement and the associated structural changes, instead of just tracking selected feature points on the image by means of some optical correlation method. This inherent feature of the method increases the stability and the robustness of the results in cases where the edges of the target are partially “damaged” in some frames. The results of temporal tracking can be seen in Figure 3, where we demonstrate the resulting symmetry-traces in some real-life situations. With typical pedestrian move-
Pedestrian Detection Using Derived Third-Order Symmetry of Legs
737
ment speeds, the tracking algorithm can work correctly when the frame rate is 10 frame/sec or more.
Figure 3. Sample symmetry-patterns of real-life pedestrian walking-tracks.
2.2
Classification of the Traces
Level 3 symmetries can also appear in other parts of the image, not only between the legs; and the tracking method also collects all of these related symmetries. We can superimpose these traces onto the original images (with information loss where the image-data overlaps), as can be seen in Figure 3. However, although this format is easy to visualise, the detail of these projections is unnecessarily high, which increases the computation time. The other reason why we use an alternative format for the traces is the importance of the radii, which are not very clearly defined in this representation. We therefore reduce the “input space” of the traces by using the following data representation, see Figure 2. One L3S of a frame can be represented by 6 parameters: an upper and a lower position each represented by (x, y) coordinates, and two widths. The coordinates define the positions of the upper and lower ends of the axis (X-symbols in Figure 2), while the widths define the horizontal distance between the two regression lines at the upper and lower points. In fact, the widths correspond to the radii at the upper and lower points of the symmetry axis. This trapezoidal representation can adequately describe both the orientation and the structure properties of the L3S. We collect L3Ss from 8 sequential frames, and thereby have in all 48 parameters for each trace. In the last step of this stage of the process these parameters, including the widths, are normalised in both x and y dimensions; by means of this normalisation the classification becomes invariant with respect to the object size in the image. In our classification method we used kernel-based (non-linear) Fisher discriminant analysis8. The original linear FDA method classifies two sample sets by maximising the between-class scatter while minimising the intra-class scatter of the features. The aim is to find linear projections that optimise the distinctiveness of the classes. When the problem is not linearly
738 separable however, the solution given by FDA may not be satisfactory. In our case, using the linear method we find that we can achieve a 10% error rate, but with a false-positive rate of 8%, which is rather high. In the nonlinear extension of the method, we tested several kernel functions, and concluded that only the Gaussian radial and the inverse multiquadratic kernels produced acceptable classification rates. Practical test results are summarised below.
3.
EXPERIMENTAL RESULTS
To evaluate the proposed method, we derived “walking” and “nonwalking” traces from a considerable number of real-life outdoor video sequences representing a variety of different walk directions, viewing distances and surrounding situations. There were in all 1000 samples, and according to our manual classification these comprised 300 “walking” and 700 “non-walking” samples. In the experiments our main goal was to reliably detect human movements, but at the same time with a false-positive detection rate as small as possible. Before considering the numerical results, we summarise some practical limitations of the symmetry-tracking method which we noted. The L3Ss can be evolved only if the leg-opening is visible. In our tests we found that this meant that the direction of movement had to be at more than about 70° from the viewing axis; but this is not a serious limitation when more than one camera is monitoring the area9. Crowds, and some other specific “overlap” situations are the main cases which cause problems, although “overlap” does not always prevent successful tracking. The most common problematic cases were as follows: subject wearing long coat; subject carrying large bag etc. in the hand nearest to the camera; full masking of the legs by another person in the perspective view; partial masking by another person moving on a parallel track, with synchronised step periods. All in all, the proportion of such “problem” cases in the processed real-life video sequences was some 15%. For training the KFDA algorithm, we used 50 “walk” and 100 “nonwalk” traces representing all situations of the data set. In our evaluation we found that both types of kernel function can achieve a 96% classification rate. At the same time we chose kernel parameters at points where the falsepositive detection rate is zero, to keep the false detection error rate to a minimum. The good detection rates achieved confirm the power of the data representation introduced in Section 2.2. The final choice between the two kernel functions can be based on analysis of the between-class distances, and using this criterion we chose the Gaussian kernel function. In practice we
Pedestrian Detection Using Derived Third-Order Symmetry of Legs
739
achieved an approximate 96.5% classification rate, with a 1.6% falsepositive detection rate.
4.
CONCLUSIONS
The method we describe can detect pedestrians in image-sequences obtained in outdoor conditions in real-time. Considering even a single stepperiod, a very low false-detection rate is obtainable. To achieve this, we used a novel feature-extraction and tracking method that can reflect the natural structural changes of human leg-shape; the method seems promising for the purpose of providing a useful “understanding” of image-content. Through experiments using a data set derived from real-life video sequences we found that the Gaussian kernel function is a good choice for the classification of traces. The low classification error-rate achieved demonstrates the power of our spatio-temporal data-representation method. The method appears suitable for the detection of human activity in images captured by video surveillance systems such as those typically used in public places.
REFERENCES 1. Mohan, A., Papageorgiou, C., and Poggio, T., 2001, Example-based object detection in images by components, IEEE Trans. PAMI, 23(4), pp. 349-361 2. Nguyen, H. T., Worring, M., and Dev., A., 2000, Detection of moving objects in video using a robust motion similarity measure, IEEE Trans. on Image Processing, 9(1) 3. Song, Y., Goncalves, L., and Perona, P., 2003, Unsupervised learning of human motion, IEEE Trans. PAMI, Vol. 25, pp. 814-828 4. Abdelkader, C., Cutler, R., and Davis, L., 2002, Motion-based recognition of people in eigen-gait space, Proc. of the 5th Int. Conf. on Automatic Face and Gesture Recognition 5. Hayfron, A. J., Nixon, M. S. and Carter, J. N., 2002, Human identification by spatiotemporal symmetry, International Conference on Pattern Recognition, pp. 632-635 6. Havasi, L., Szlávik, Z., 2004, Symmetry feature extraction and understanding, Proc. CNNA’04, Budapest, pp. 255-260 7. Sharvit, D., Chan J., Tek H. and Kimia B.B., 1988, Symmetry-based indexing of image databases, J. Visual Comm. And Image Representation, vol. 9 no. 4, pp. 366-380 8. Mika, S., Rätsch, G., Weston, J., Schölkopf, B., and Müller, K.-R., 1999, Fisher Discriminant Analysis With Kernels, Neural Networks for Signal Processing IX, pp. 41-48 9. Szlávik, Z., Havasi, L., Szirányi, T., 2004, Estimation of common groundplane based on co-motion statistics, ICIAR, Lecture Notes on Computer Science, accepted
AUTOMATIC AND ADAPTIVE FACE TRACKING USING COLOR-BASED PROBABILISTIC NETWORK Shu-Fai Wong and Kwan-Yee Kenneth Wong Department of Computer Science, The University of Hong Kong
{sfwong,kykwong}@cs.hku.hk Abstract
1.
Face tracking has potential applications in a wide range of commercial products such as face recognition system. Commonly used face tracking algorithms can extract faces from images accurately, but they often take a long time to finish the detection process or require too much human intervention in system initialization. Recently, there is an increasing demand of robust and fast face tracking algorithm in applications like video surveillance system. This paper aims at proposing a color modeling and testing scheme so that the robustness and applicability of the tracking system can be increased. Experimental results show that the proposed system can detect and track faces from images reliably and quickly.
INTRODUCTION
Face tracking is a fundamental step in many vision system such as automatic visual surveillance system, virtual reality game interface and face recognition system. Capability in locating the face in the video continuously becomes one of the necessary requirement in human-computer interface design recently. Most commonly used appearance-based face detection algorithms (e.g. Li et al., 2002) is time consuming and complicated, researchers start investigating the use of visual cues, instead of comparing the whole image patch, to facilitate the searching process. Motion, color, and configuration of facial features have been used as visual cues (e.g. Haritaoglu et al., 2000). Recently, researchers have started investigating the possibility of using color as visual cue alone in developing face tracking systems (e.g. Nummiaro et al., 2003; Perez et al., 2002). In most of these works, adaptive color model is built and is used to test the ”skin-ness” of certain pixel. Such detection approach is less sensitive to change of shape, change of illumination and clutter background. However, most of these works haven’t mention the initialization of the color model. Such initialization is commonly done manually and is tedious. This paper aims at proposing an automatic and adaptive scheme for face tracking. Under the proposed scheme, prior skin color model is built auto740 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 740–745. © 2006 Springer. Printed in the Netherlands.
Automatic and Adaptive Face Tracking
741
matically from training video. In tracking stage, both prior color model and the adaptive color model will be used to detect face under the probabilistic framework. Experimental results show that face can be tracked efficiently.
2.
PRIOR PROBABILISTIC MODEL
As mentioned in previous section, the proposed system is designed to learn the color model with least human intervention. By providing several sets of video sequence containing human motion, the system will separate static background and dynamic foreground automatically. Foreground and background data will then be processed and used to train the face color detector. In general, the face color detector contains the prior model of face color. The color model is prior because it is formed from skin colors of different people and under different settings of illumination, which are not specific to certain target.
2.1
Motion segmentation and color space transformation
Motion segmentation can be done simply by interframe difference, which has been used in (Wang and Brandstein, 1998). By using such approach, the pixelwise difference of consecutive frames is considered. A binary map indicating pixels with value difference higher than the threshold is generated. The rectangular bounding region in the image that covers all the white pixels in the binary map is obtained as foreground. The foreground and background patches obtained by motion segmentation will then be converted into histogram for further analysis. As indicated in recent research in skin color analysis (e.g. Yang and Waibel, 1996), skin colors form a relatively compact cluster in a small region in the color space. Thus, the histogram of the foreground patch can be used to infer the skin color cluster. To increase the tolerance toward illumination change in the environment, the original RGB value of pixels are converted into normalized color value, where it is the measure of a color without considering the brightness. The conversion is simply done by getting the normalized rg-space from original RGB value, through r = R/(R + G + B) and g = G/(R + G + B). No further conversion is performed because conversion to and combination with other color spaces may not increase the compactness of the cluster a lot (Zarit et al., 1999). The histograms obtained in this step is illustrated in Figure 1(a).
2.2
Radial basis function network
Radial Basis Function Network (RBFN) is used to learn the prior color model given the foreground and background patches. One may criticize that skin color may not be the majority color in the foreground patch due to the movement of clothing or some background color captured. However, according to our observation, skin color is statistically majority color among fore-
742 ground patch assuming Gaussian distribution of the non-skin color. Thus, final foreground histogram, which is accumulated result of all histograms, contains skin color in various environment. While the background histogram contains color information of the background. Assuming the camera moves within certain environment (e.g. office) only, the corresponding histogram will reflect the color distribution of such environment. These two histograms will be used to train the RBFN such that the network can be used to determine the probability of being skin color given certain rg-value. In RBFN, the learning data will be in format like this: {r, g, t}, where r and g are the rg-value and t is the class (+1 means skin class while 0 means nonskin class). The learning data is directly converted from the histograms of the foreground and background patch by assuming foreground patch indicate skin class while background patch indicate non-skin class. In RBFN, the rg-value will be fitted into totally N radial basis functions (or radial nodes j), which have mean, μj , and variance, σj as parameters. In the proposed system, Gaussian function is used as the basis function and thus mahalanobis distance is calculated at each node. The weighted sum is then obtained at output node. In the learning process, the parameters in each radial node and the corresponding weighting will be adjusted such that the energy 2 (g−μj,g )2 1 (r−μj,r ) 2 + 2σj,g )))2 − N function E = 12 [(t − N j=0 wj exp(− 2 ( 2σj,r j=0 wj ], is minimized. Once the learning process is finished, the radial nodes indeed represents the Gaussian mixture of skin color distribution. The conditional probability of being skin pixel can be determined by the network given any input rg-value by 2 (g−μj,g )2 1 (r−μj,r ) + 2σj,g )) P (skin|r, g) = N j=0 wj exp(− 2 ( 2σj,r
(a)
(b)
Figure 1. (a) The histogram on the left shows the distribution of normalized rg-value of certain foreground patch while the histogram on the right shows those distribution of the corresponding background patch. (b) The graph on left shows the activation responses of RBFN for every rg-value while the graph on the right shows the skin color distribution of certain person under certain environment.
Automatic and Adaptive Face Tracking
3.
743
ADAPTIVE FACE DETECTION AND TRACKING
In previous section, the prior model obtained is too general in most applications. If the activation responses of the network is plotted (as shown in Figure 1(b)), it shows that the probability is quite high over a large range of input data. However,the skin color distribution of certain person at every instance is quite compact as shown in that figure. This means a large number of false positives may be reported if RBFN is used alone. In order to have higher accuracy of skin color detection, a bayesian framework is adopted to integrate the prior color model, the current color model and the dynamic information. skin pixel is formulated as The probability of being P (skin|r, g, M, K )P (K |K)P (M |D)dM dK , P (skin|r, g, D, K) = where D is the prior rg-value distribution for skin color, K is the dynamic information of the previous frame, M is the current skin color distribution and K is the predicted dynamic information of the current frame. Thus, the probability is broken down into 3 elements: P (skin|r, g, M, K ) represents the probability of being skin color given certain rg-value and certain color and dynamic model; P (K |K) represents the probability of having certain dynamic model given previous dynamic information; P (M |D) represents the probability of having certain color model given prior skin color model. Firstly, to estimate the current color model, the histogram of the detected face region is used. With the exception of the first frame, the face region is supposed to be detected accurately by the bayesian framework. By applying the framework in estimating the probability of being skin pixel, those pixels with highest probability will be selected and reported in a binary map. The bounding box, the skin patch and the histogram is then extracted in the same way as the method used in learning prior color model. As shown in Figure 1(b), the distribution of skin color is quite compact at certain instance and thus it can be approximated by Gaussian distribution (M ∼ N (μM , M )). Thus, P (skin|r, g, M, K ) can be approximated by the exponential function of the negative mahalanobis distance of a certain rg-value from the distribution of the model M . While P (M |D) is approximated by the response of the radial basis function network with the mean value μM as input. For the first frame, face can be detected by using interframe difference and the prior color model. Within the bounding box given by the interframe difference approach, skin color is detected by using prior color model. The current color model can then be built using the method described above. Secondly, to estimate the motion model, second order auto-regressive (AR) formula, Kt+1 = α0 Kt + α1 Kt−1 can be used. Thus, P (K |K) can be approximated by the likelihood of having observation of skin pixels given the dynamic model K . The parameters of the AR model is then updated accordingly to maximize that likelihood.
744 To summarize, the prior color model is first learnt from video sequence. Target is firstly located by using interframe difference and prior color model. Current color model and the dynamic model are then built accordingly. In the next frame, the searching window will move according to the dynamic model. Within the searching window, potential skin pixel will be detected using the bayesian framework. From such observation, both the current color model and the dynamic model will be updated. The whole process repeats.
4.
EXPERIMENTS AND RESULTS
The proposed face detection algorithm was implemented using Visual C++ under Microsoft Windows. The experiments were done on a P4 2.26 G Hz computer with 512M ram running Microsoft Windows. Two experiments were conducted to test the accuracy of the face detection scheme and the performance of the whole tracking system. In the first experiment, the system was tested by detecting faces in image using the prior color model. The qualitative result is shown in Figure 2(a) which shows that faces can be detected even under illumination variation. The testing images are of size 640×480, which are captured from webcam. As shown in the figure, pixels of distractors with color similar to the skin color are usually reported as false positives. However, it is reasonable to assume the searching window is close to the target in most cases. Out of 100 testing samples, all regions correspond to the face regions are detected although the existence of false positives. The processing time is 3 second if detection algorithm is applied to the whole image. In the second experiment, the whole tracking system is tested. The qualitative result is shown in Figure 2(b). The result shows that the system can track face under different illumination and background. The tracker is insensitive to both size of the face (in the first row) and the distractor with skin color (in the second row). It is observed that the system can track face accurately in all ten sets of video sequence. The average processing frame rate is 15Hz.
5.
CONCLUSION
Face tracking is useful in various industrial applications. However, commonly used face tracking algorithms are either time consuming or not robust enough. This paper proposed a probabilistic framework that integrates prior color model, current color model and dynamic information together. The system is fast because of the use of color only. It is adaptive because of the use of current color model. It is applicable to other environment easily because of the automatic prior model learning. Experimental results show that the proposed algorithm can track faces efficiently and reliably.
Automatic and Adaptive Face Tracking
745
Figure 2. (a) In this experiment, the system was tested under variation in illumination. First row shows the face detected in the green box. Second row shows the resultant binary image of the skin detection. (b) These two rows show the tracking result of the system. The face detected is bounded by the green box. The frame count is shown on the bottom of each frame.
REFERENCES Haritaoglu, I., Harwood, D., and Davis, L. S. (2000). W4: Real-time surveillance of people and their activities. PAMI, 22(8):809–830. Li, S. Z., Zhu, L., Zhang, Z., Blake, A., Zhang, H., and Shum, H. (2002). Statistical learning of multi-view face detection. In ECCV02, page IV: 67 ff. Nummiaro, K., Koller-Meier, E., Svoboda, T., Roth, D., and Gool, L. Van (2003). Color-based object tracking in multi-camera environments. In DAGM 2003, pages 591–599. Perez, P., Hue, C., Vermaak, J., and Gangnet, M. (2002). Color-based probabilistic tracking. In ECCV2002, pages 661–675. Wang, C. and Brandstein, M. S. (1998). A hybrid real-time face tracking system. In Proc. Acoustics, Speech and Signal Processing. Yang, J. and Waibel, A. (1996). A real-time face tracker. In WACV96, pages 142–147. Zarit, B. D., Super, B. J., and Quek, F. K. H. (1999). Comparison of five color models in skin pixel classification. In Proc. of the International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, pages 58–63.
FAST AND ROBUST OBJECT SEGMENTATION APPROACH FOR MPEG VIDEOS
Ashraf M. A. Ahmad Department of Computer Science and Information Engineering, National Chiao-Tung University, 1001 Ta-Hsueh Rd, Hsinchu, Taiwan, [email protected]
Abstract:
In this paper we propose an efficient approach for Motion Vector (MV) based object detection in MPEG-1 video streams. Based on experimental results performed over the MPEG testing dataset and measuring performance using the standard recall and precision metrics for perceptual performance, and run time metric for efficiency performance. Our approach is remarkably superior to the alternative techniques. In addition to these results, we describe a user system interface that we developed, where users can maintain the configurations interactively.
Key words:
MPEG-1, Efficiency, Object Detection, Motion Vector, Motion Vector Selector
1.
INTRODUCTION AND RELATED WORK
Because digital imaging, video streams and their standards are becoming more prevalent, it is expedient to have effective algorithms and paradigms to process visual contents. To achieve the objective of identifying and selecting desired information, a reliable object detection mechanism is needed as a primary step. Although object detection has been studied for many years, it remains an open research problem. A robust, accurate and high performance approach remains a great challenge. Much work is being done in the area of motion based video object segmentation in the pixel domain1,2,3, which exploits the visual attributes and motion information. However, very little work has been carried out in the area of compressed domain video object extraction. Pixel domain motion
746 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 746–751. © 2006 Springer. Printed in the Netherlands.
Fast and Robust Object Segmentation Approach for MPEG Videos
747
detection is performed based on the motion information at each pixel location such as optical flow estimation, which is very computationally demanding. The motion information can be available from the compressed domain. In many cases, especially in the case of well-textured objects, the MV values reflect the movement of objects in the stream very well. Some approaches5,6,7 utilize these MV values directly. Processing digital video directly in the compressed domain has reduced the processing time and enhanced storage efficiency, speed, and video quality. Moreover, in today’s ever growing volume of video data provided in compressed formats MPEG, it increasingly makes more sense to perform object detection in the compressed domain. Object detection directly in compressed video without full-frame decompression is clearly advantageous, since it is more efficient and can more easily reach real-time processing speeds. The need for reliable and accurate MV information becomes clear for those approaches that are employing the motion information2,3,5,6,7 as well as to get highly efficient detection algorithms at the macroblock level. But MV information is sometimes difficult to use due to the lack of effective representation and due to the fact that it introduces large amounts of noise that makes further processing of the data impractical. Besides, it is still far from ideal in performance, as the key motion estimation part is carried out using a coarse area-correlation method that has proven inefficient in terms of accuracy. Some researchers21 elaborate on the noise in MVs due to camera noise and irregular object motion. However, it is known that the MVs in MPEG may not represent the true motion of a Macro-block. Also the 1 pixel per 16x16 pixel-block scheme makes detection of small objects difficult. Therefore, in this paper, we introduce a technique that can overcome those defects and produce more reliable MV information and smoothed object boundaries. This technique processes the raw and extracted MV fields from P frames using a Motion Vector Selector (MVS). The MVS removes non-textured MVs and smoothes the MVs based on how texture their macro block. Thus, we are making the resulting data more representative of the original MVs and more reliable for use by further compressed domain object detection algorithms. This way, many situations that may cause trouble in conventional approaches can be handled properly without using complex operations and the efficiency will increase. Some8,9 used only the median filter or modified median filter in the compressed domain; not for the raw MV, but for the processed MV to repair the irregularities introduced by the hard limiting of some MV values or accumulative process of MV. In these approaches, the time consumption is high, due to the computational complexity. They partially need to return to the pixel domain. Moreover one work9 used P,B frames and one8 applied a median filter for the magnitude only, while we apply our approach for both magnitude and direction which will result in a more accurate and reliable
748 outcome in terms of object detection. Another work4 used Spatial confident measure which is mean filter, where authors11 proved this insufficient and unrealistic for real-time application. Besides that, the same work4 combined both the texture and spatial measure equally, which was proven insufficient and unrealistic for the real-time applications. Other works12,13 used the texture measure without regard to the spatial measure which resulted in less accuracy. The spatial measure was used11 without regarding the texture measure which resulted in a less realistic result.
2.
OVERVIEW OF THE PROPOSED SCHEME
In our proposed approach we first take an MPEG1 video stream with the [IBBPBBPBBPBBPBB] structure. Fig.1 shows the proposed system architecture. Next, we extract the desired features from P-frames only in order to reduce the computational complexity. Since in general, in a video with 30 fps, consecutive P-frames separated by two or three B-frames are still similar and would not vary too much. After obtaining the MV field magnitude and direction values, we pass these values through the Motion Vector Selector (MVS). We then pass these filtered MVs into the object detection algorithm which first considers the value of the MV over a specific threshold. Ten the object detection algorithm is applied to get a set of detected objects in each frame. Steps will be described in detail in the following sections.
Figure 1. System overview.
2.1
Motion Information Extraction from MPEG1
The MPEG-1 compressed video provides one MV for each macroblock of size 16x16 pixels, which means that the MVs are quantized to 1 vector per 16x16 block. The MVs are not the true MVs of a particular pixel in the frame. Our object detection algorithm requires MVs of each P-frame from the video streams. For the computational efficiency, only the MVs of Pframes are used for object detection algorithm. Without loss of generality,
Fast and Robust Object Segmentation Approach for MPEG Videos
749
we assume that, Group Of Pictures (GOP) will have the standard structure [IBBPBBPBBPBBPBB].
3.
MOTION VECTOR SELECTOR
We propose a texture-based motion vector selector which operates directly in the DCT domain for MPEG-1 video. The DCT coefficients in MPEG-1 video10, which capture the directionality and periodicity of local image blocks, are used as texture measures to identify high texture regions, rather the non texture regions. Each unit block in the compressed images is classified as ratio of how texture is this unit block based on local horizontal, vertical intensity variations and diagonal intensity. This is the basic idea behind the MVS. As we mentioned before, our approach is texture-based, we propose to use the DCT coefficients directly from compressed domain as texture features to refine the motion vector’s values. Figure 2 states the MVS operation clearly. The I-frame has no motion values and it stores DCT information of the original frame. Though I-frame provides no motion information, we still could grasp how textured images are, and propagate that information to the P frames. Upon getting the average energy based on the DCT information, these average energy values are then thresholded to obtain the blocks of large intensity variations. We will select the MV with high texture value. As stated in figure 2.
Figure 2. MVS Scheme.
In the following figures we show the results of using MVS over MVs. Fig. 3 - right shows the representation of each extracted MV over its corresponding frame without any processing, while Fig. 3 - left shows the representation of the filtered MV using the MVS for the same frame.
750
Figure 3. Smoothed MVs using the MVS (left); MVs without MVS (right).
4.
OBJECT DETECTION
So far, we have obtained the filtered MV. We started by eliminating undesired MVs before the process of detection in order to achieve more robust performance. MVs with magnitude equal to or approaching zero are recognized as undesirable and hence are not taken into consideration. On the contrary, MVs with larger magnitude are considered more reliable and are therefore selected. An object detection algorithm is used to detect potential objects in video shots. Initially, undesired MVs are eliminated. Subsequently, MVs that have similar magnitude and direction are clustered together and this group of associated macroblocks of similar MVs is regarded as a potential object. Details are presented in our previous published paper12.
5.
RESULT AND DISCUSSION
We have designed an experiment in order to verify optimal performance. In summary, the proposed system boosts performance. In addition, the computational complexity is low. Both the Gaussian and Median filters are available as a readily implemented component in both hardware and software. Beside the MVs, DCT coefficient, and AC component are readily available in MPEG-1 stream. As we refine the MVs resulting in vectors that are easy to process, the execution time of the object detection algorithm after using the filter will be reduced significantly compared to that without using any kind of post processing. Although we are adding another block for post processing, the efficiency is the same or even better in terms of execution time for the entire object detection process.
Fast and Robust Object Segmentation Approach for MPEG Videos
6.
751
CONCLUSION AND FUTURE WORK
Along with the increasing popularity of video over internet and versatility of video applications such as video surveillance, vision-based control, human-computer interfaces, medical imaging, robotics and so on, the availability and efficiency of videos will heavily rely on object detection and other related object tracking capabilities. Hence, we present an effective, efficient, and reliable scheme for automatically extracting independently moving video objects using MVs fields. In the future, we will use our proposed scheme of object detection in video streaming framework to achieve better streaming.
REFERENCES 1. N. Brady and N. O’Connor, “Object detection and tracking using an EM-based motion estimation and segmentation framework,” Proc. IEEE ICIP, 925–928 (1996). 2. David P. Elias, The motion Based Segmentation of Image Sequences: Ph.D. thesis (Trinity College, Department of Engineering, University of Cambridge, Aug. 1998). 3. N. Vasconcelos and A. Lippman, “Empirical Bayesian EM-based motion segmentation,” Proc. of the IEEE CVPR, 1997, 527–532. 4. R. Wang, H.-J. Zhang and Y.-Q. Zhang, “A Confidence Measure Based Moving Object Extraction System Built for Compressed Domain,” Proc. ISCAS, 21-24 (2000). 5. R. C. Jones, D. DeMenthon and D. S. Doermann, “Building mosaics from video using MPEG Motion Vectors,” Proc. ACM Multimedia Conference, 29-32(1999). 6. J. I. Khan, Z. Guo and W. Oh, “Motion based object tracking in MPEG-2 stream for perceptual region discriminating rate transcoding,” Proc. ACM Multimedia Conference, 572-576 (2001). 7. D.-Y. Chen, S.-J. Lin and S.-Y. Lee, “Motion Activity Based Shot Identification and Closed Caption Detection for Video Structuring,” Proc. VISUAL, 288-301(2002). 8. S. Chien, S. Ma, L. Chen, “Efficient Moving Object Segmentation Algorithm Using Background Registration Technique,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 7, 577-586(2002). 9. Y. Ma, and H.-J. Zhang, “A New Perceived Motion based Shot Content Representation,” IEEE ICIP, 426-429 (2001). 10. R. V. .Babu, and K. R. Ramakrishnan , “Compressed Domain Motion Segmentation for Video Object Extraction,” Proc. of ICASSP, 3788-3791(2002). 11. Roberto Castagno, Touradj Ebrahimi, Murat Kunt “Video segmentation based on multiple features for interactive multimedia applications” IEEE Transaction on circuits and systems for video technology, vol 8. No 5, pp 111-122 (1999). 12. Ashraf M.A. Ahmad; Duan-Yu Chen and Suh-Yin Lee “ROBUST COMPRESSED DOMAIN OBJECT DETECTION IN MPEG VIDEOS” Proc. 0f the 7th IASTED International Conference Internet and Multimedia System Applications, 706-712 (2003). 13. Yu Zhong, Hongjiang, and Anil K. Jain “Automatic Caption Localization in Compressed Video”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(4): 385-392 (2000).
HYBRID ORTHOGONAL APPROXIMATION OF NON-SQUARE AREAS
Jaroslav Polec, Tatiana Karlubíková and Anton BĜezina Faculty of Electrical Engineering and Information Technologies, Slovak University of Technology in Bratislava, Slovakia
Abstract:
The work deals with possibilities of approximation of non-square areas using combined discrete orthogonal transforms (DOT), where image data are represented using less spectral coefficients than in the case of single transform approach. The work is based on an assumption that properly chosen orthogonal transforms complete each other in the process of image approximation.
Key words:
discrete orthogonal transform, triangulation, image approximation, image coding.
1.
INTRODUCTION
In the field of image compression, many algorithms are based on partitioning the original image into square blocks and then processing these blocks via an orthogonal transform (JPEG, MPEG). Rapidly developing technologies provide more efficient hardware equipment for image processing, thus enabling utilization of more sophisticated algorithms, using arbitrarily shaped regions rather than blocks. Although the approximation of an image using a single orthogonal transform is very simple and therefore quick, its efficiency can differ from image to image. If we are able to provide more orthogonal basis functions for the approximation algorithm, we can reduce these differences. Recent works based on segmented image approximation use only one DOT much like the traditional JPEG approach does. In our work we try to
752 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 752–757. © 2006 Springer. Printed in the Netherlands.
Hybrid Orthogonal Approximation of Non-Square Areas
753
use more orthogonal transforms at a time in order to decrease the amount of spectral coefficients describing the image segments. We combine the advantages of two approaches – image segmentation to non-square areas and the use of more orthogonal basis functions. Image is segmented into regions by suitable homogenity criterion. Regions are supposed to be quasi-stationary and to have uniform texture. Triangular regions offer good trade-off between shape adaptability and description simplicity. Therefore image after segmentation is polygonized and triangulated by a suitable method1,2. This reduces the coding complexity significantly, as we have only to store the vertices of the obtained triangles. Resulting image description is composed of a segment mask and the spectral coefficients obtained in the process of image approximation using a set of basis functions. Coding of images this way seems more natural to human psycho-visual perception. Let us have a segment A of a 2D discrete image. This segment represents an arbitrarily shaped region with internal texture structure f n1 , n2 . First we find circumscribing rectangle L of this segment. Next, we pad this rectangle with zeros so that its width and height will be the power of 2 thus allowing us to use the fast transform algorithms. The size of the rectangle is now N 1 u N 2 points. The texture structure of segment A in rectangle L is then approximated using the appropriate basis functions. In 1D case, if xn ^x0 , x1 ,..., xN 1 ` is the set of points representing the image grey level values and u k n is the set of orthogonal basis functions, we can obtain the belonging spectral coefficients as3 X k
N 1
¦ xn .u n , k * k
0,1,..., N 1
(1)
n 0
In each step of approximation, the suitable basis function is selected until we approximate the texture with sufficient quality. This method is called the “Matching Pursuit” algorithm4. According to more recent work5 we provide an overcomplete dictionary of basis functions for the approximation process which is constructed as a mixture of basis functions of more DOTs.
2.
SELECTION OF SUITABLE BASIS FUNCTION
Next problem is the determination of the best-suited basis function6. If we have a set of orthogonal basis functions defined as
u k1 ,k2 n1 , n 2 , k1 0,1,...N1 1 , k2 0,1,...N 2 1 ,
754 where n1 0,1,...N 1 1 , n 2 0,1,...N 2 1 . After Q iteration steps, we have approximation g v n1 , n2 of segment texture f n1 , n2 . Using linear approximation theory, this can be expressed as sum of basis functions weighted by appropriate spectral coefficients.
g v n1 , n2
¦c Q
( )
k1 , k 2 KQ
k1 , k 2
u k1 ,k2 (n1 , n2 ) ,
(2)
where K Q denotes the set of basis function indices used in g v n1 , n2 and c (kQ,k) the spectral coefficients. The residing difference between the 1 2 original texture and its approximation is then r v n1 , n2
f n1 , n2 g v n1 , n 2
(3)
Now we want to approximate this difference with a suitable basis function to minimize the error function
¦ [ f (n , n ) g (n , n )]
EA
1
2
1
2
2
n1 ,n2A
,
(4)
where A stands for the approximated segment. We are searching6 the basis function that maximizes
v
'EA
ª º r v n1 , n2 uk1,k2 n1 , n2 » « ¬«n1,n2A ¼»
2
¦
¦>u
2
k1 ,k2
n1 , n2 @
(5)
n1 ,n2A
Because we use dictionary constructed out of more DOTs, which results in an overcomplete set of basis functions, it is necessary that all of the basis function are orthonormal, so that none of them is preferred in the approximation process. More background information about signal decomposition into overcomplete systems can be found in previous works4,5. However, the use of more transforms can increase the difficulty of image decoding especially when the same coefficient is used by both. In order to prevent this, masking matrices must be included in the implementation and before the step of finding a suitable basis function we must set the spectral coefficients that have already been chosen by the other transforms to zero.
Hybrid Orthogonal Approximation of Non-Square Areas
3.
755
APPROXIMATION
The scheme of the approximation process can is depicted on Fig. 1. We sequentially append obtained spectral coefficients to an initially empty set to approximate the texture of a given segment. DHT original segment
suitable basis function selection
DCT
DHT-1
DCT-1
masking matrices step count, criterion control
+
approximated segment
Figure 1. Block diagram of proposed algorithm. DCT and DHT represent forward transform and DHT-1 and DCT-1 represent backward transforms.
In each step, the residual signal between the original and approximated segment is calculated. We acquire its decomposition into the created dictionary. If any of the spectral coefficients acquired by one of the basis functions has already been used in the process of approximation, it will be set to zero when acquired by the others using the masking matrices. This is done to prevent using the same coefficient obtained via different transforms. It also increases the speed of the decoding process as we only have to calculate inverse transforms of the spectral matrices and simply merge them summing the elements on the same position.
4.
CONCLUSION
It is known that the first-order Markov process with a high correlation coefficient is a good model for natural images. For that process the performance of DCT is asymptotically close to that of the Karhunen-Loeve transform (KLT). This gives the impression that DCT is a nearly perfect solution for transform coding. But the nonuniqueness of the decomposition by basis pursuit gives us the possibility of adaptation, i.e., of choosing from among many representations one that is most suited to our purposes. Having increased the complexity we were able to represent the texture of a given segment with less spectral coefficients (Figs. 2 - 5). The discrete cosine (DCT) and discrete Haar transform (DHT) were used in this work.
756 It has been proven, that the use of more orthogonal transforms in the process of segmented image approximation is possible also when the segments are arbitrarily shaped areas.
Figure 2. The results of the approximation, image size 256x256: a) original, b) segmentation map, 132 segments, c) approximated using DCT II, d) approximated using DHT, e) proposed approach combining DCT II and DHT. Resulting PSNR in all approximated images is 25 dB.
used spectral coefficients
PSNR[dB]
29 28 27
DCT DHT proposed approach
26 25 24 23 22 21 20 0
5
10
15
20
spectral coefficients
Figure 3. Comparison of PSNR versus number of used spectral coefficients in all approaches.
25 x10
3
Hybrid Orthogonal Approximation of Non-Square Areas
757
Figure 4. Complexity chart - overall PSNR versus the total number of multiplications made during the approximation process (counted as stated by Besslich3).
ACKNOWLEDGEMENTS This work was partially funded by projects: Audio, Video And Biomedical Signal Processing Using Non-Standard Forms Of DSP Algorithms, VEGA 0147/2003, and Methods And Algorithms For Digital Signal Processing, Protocol Development, Communication Channels Modeling And Simulation Of Signal Transmission In Different Environments, VEGA 0146/2003.
REFERENCES 1. Kolingerova, I., Zalik, B.: Improvements to randomized incremental Delaunay insertion, Computers and Graphics, Vol. 26, No. 3, Elsevier Science 2002, pp. 477-490 2. Kolingerova, I.: Modified DAG location for Delaunay triangulation, Computational Science – ICCS 2002, Part III, Amsterdam - Netherlands, pp. 125 – 134, LNCS 2331, Springer – Verlag, 2002 3. Besslich, Ph. W., LU T.: Diskrete Orthogonal-transformationen. Berlin: Springer-Verlag, 1990, 312 pp., ISBN 3-540-52151-8 4. Mallat, S., Zhang, Z.: Matching Pursuit in a time-frequency dictionary, IEEE Transactions on Signal Processing, Vol. 41, 1993, No. 12, pp. 3397-3415. 5. Chen, S. S., Donoho, D. L., Saunders, M. A.: Atomic decomposition by basis pursuit, SIAM Review, Vol. 43, 2001, No. 1, pp. 129-159. 6. Kaup, A., Aach, T.: Coding of Segmented Images Using Shape-Independent Basis Functions. In: IEEE Trans. On Image Processing, vol.7, 1998, No.7, pp. 937-947
LARGE TEXTURE STORAGE USING FRACTAL IMAGE COMPRESSION
Stachera Jerzy1 and Nikiel Sáawomir2 1 Institute of Computer Science, Warsaw University of Technology, ul Nowowiejska 15/19, 00665 POLAND, e-mail: [email protected], 2Institute of Control and Computation Engineering, University of Zielona Góra, ul Podgórna 50, 65-246 Zielona Góra, POLAND, e-mail: [email protected]
Abstract:
Texture mapping has traditionally added visual realism to computer graphics images. Modern graphic boards reserve more space for image textures that are mapped in real time on 3D meshes. Outside the graphic accelerators texture images are stored in popular compressed formats that are not designed to meet real-time texture decompression requirements. In the paper we forward an alternative technique for image texture handling. We use fractal image compression scheme to decompress very large textures. The main idea is to use high degree of local self-similarity in texture images representing natural scenes. The method delivers very high compression ratio for complex textures and close to real-time decompression. Local compression schemes allow for Region Of Interest access to the visual data. Properties of the method are compared to those of classical DCT and wavelet compression schemes.
Key words:
fractal compression, texture mapping, multiresolution
1.
INTRODUCTION
Texture mapping is used to add visual detail to a computer generated scene and in its basic form it lays an image map onto polygon objects1. When mapped onto an object the appearance of the object is modified by a corresponding data from the image, this can be patterns, bumps or others2,3. The image is typically a sampled array so a continuous seamless texture must first be reconstructed from the samples. During the mapping the image must be warped to match perspective distortions. Then the warped texture is 758 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 758–767. © 2006 Springer. Printed in the Netherlands.
Large Texture Storage using Fractal Image Compression
759
filtered to avoid aliasing artefacts. The required filtering is approximated by one of several methods4,5. Current applications of three dimensional computer graphics, however, require much larger textures to achieve high degree of visual realism. This is particularly visible in the humanoids and the outdoor scenes where repeating patterns create artificial feel of the scene. The problem of texture handling arises in case of games and virtual reality simulations. This is particularly visible in panned background panoramas and in extreme close-ups where visual detail of is of most importance. Some general textures and geometry are stored and accessed locally but all changes in the scene require heavy mirroring and downloads of upgraded data. Image textures can be too big to download via broadband internet connection. There is a continuous need for efficient texture transfer then.
2.
PROBLEM DEFINITION
Most image compression schemes are oriented towards compression for storage or transmission. In choosing a compression scheme for texture mapping there are several restriction to consider8. Decoding speed. Textures are compressed once and decompressed many times, it is the asymmetric process. Thus, decoding speed it is a main factor which states whether given method is suitable. Random access. Region Of Interest (ROI) access to a compressed texture data allows for rendering directly from compressed textures. Texture compression scheme provides fast random access to texture data. Otherwise, the process of decompression may limit performance of the rendering pipeline. Compression rate and Visual Quality. High compression rate results in storing more textures in memory or alternatively in high-resolution textures. Texture downloads are more efficient. While lossless compression methods preserve original texture data they do not achieve high compression rates comparing to lossy methods.
3.
FRACTAL IMAGE COMPRESSION
3.1
PIFS
Fractal image compression was made possible thanks to the work of M. Barnsley and A. Jacquin. For that purpose there was introduced an extension
760 of IFS called a partitioned iterated function system (PIFS). A PIFS consist of a complete metric space X , d , a set of domains D X , and a set of contractive transformation W : D o X . By the Contraction Mapping FixedPoint Theorem, has a unique fixed point f w X , W satisfying f w W f w . Thus, the image encoding problem is to find a PIFS such that its fixed point f w is as close as possible to encoded image I X . Therefore, it is required that W minimises the distance between its fixed point f w and encoded image I . Finding such a PIFS for a given image can be computational expensive since it involves a minimisation over many transformations. In order to make this problem solvable only one class of transformations is considered.
3.2
Fractal texture compression
The process of image coding is based on a set of contractive transformations W such that its fixed point f w is an approximation to the coded image I . Thus W defines a lossy code for the image I 7,15.The compression scheme for textures is based on a block oriented fractal compression scheme for images. Moreover, our algorithm allows ROI access and local decompression of texture regions. For compressing textures we define additional assumptions extending the basic fractal encoding scheme: 1. N u N - the size of the texture ( N = 2 l ), 2. B u B - the size of the range block ( B 2 n ), 3. D u D - the size of domain block is twice the size of a range block ( D 2 B ), 4. M () - the spatial contraction function averages four adjacent texture elements and then maps the averaged value onto range block applying one of eight isometries. The resulting range block values are scaled by si and added to oi :
ri
wi (d i )
siM (d i ) oi
(1)
5. the set of range blocks R comes from a quadtree partition of the texture9.
Large Texture Storage using Fractal Image Compression
761
Figure 1. Quadtree partitioning
Figure 2. Lenna. Image size 512x512, minimum quadtree depth 5, maximum quadtree depth 7, search region depth 4.
In its basic form the fractal coding scheme does not allow for local decoding. It is not possible to decode only one region of a texture without decoding all the domains which are related to a given region. In our algorithm, we solve the problem of local decompression by restricting the search area to a given range. The search regions are defined by some initial number of quadtree partitions of the texture less than the number of minimal partitions. In the terms of tree representation (Figure 1) which is created after quadtree partitioning, search regions represent nodes at a level of the tree which is set between the root (initial texture) and the minimum tree depth. Thus each region is a square area. Its size is at least twice the range size and defines independent domain pool. In searching part of the algorithm we compare only ranges with domains which are contained in the same search region. Each search region can be treated independently from each other. It allows for ROI access to texture regions and local decompression. For Lenna image (Figure 2), the search region size is 32x32 (white border),
762 the minimum range size is 4x4 and the maximum range size is 16x16 (black border). The fractal compression scheme was introduced to encode greyscale images. In most cases textures are colour images written in a format based on the RGB colour space. To take advantage of human insensitivity to colour changes we can convert a texture to the YUV colour space. The YUV colour space is consisted of three channels: luminance channel Y and two chrominance channels U (hue), V (saturation). The chrominance channels store information about colour and it is possible to compress them with little to no visible degradation. The overall scheme for fractal texture compression consists of the following steps: x we convert texture colour space to YUV colour space, x we average the chrominance channels (hue, saturation) to one-half, x we set the minimum quadtree depth qmin , maximum quadtree depth qmax and search region depth qsearch >1,..., qmin @ x for each component: Y, U’, V’ we use quadtree partitioning method and region search strategy x we save header information: quadtree depth, number of transformation for each component, texture resolution, … x we save quantized transform parameters with quadtree information for each component using variable length code. The independent processing of each YUV component makes it possible to utilise parallel processing during the process of texture compression and decompression.
3.3
Fractal texture decompression
The compression scheme for textures must take into consideration their applications. As it was said before the scheme must be mainly characterised by the fast decoding algorithm. However, the main drawback of presented scheme may be an expensive compression algorithm. It is also characterised by a number of relatively simple decoding algorithms. Several decompression methods have been developed to optimize the decompression process9,12. It was stated that hierarchical decoding method is one order of magnitude less computational expensive than iterative method (assuming B 2 !! it , where it is equal to number of iterations in the iterative method)13. Moreover, we can decompress texture in a finite predetermined number of steps which depends on partitioning of the texture rather than on the texture image itself. The hierarchical method was introduced only for fixed size range blocks. In our scheme we modify that method extending the case of quadtree partitioning by Malah D12.
Large Texture Storage using Fractal Image Compression
763
One property of fractals is a resolution independence (or super resolution). It is also true for textures which are compressed using fractal compression method. In contrast to linear interpolation which tends to blur the texture, the fractal decompression method ensures the richness of details even at higher than original resolution. In our compression scheme we average the chrominance information. The super resolution property of the fractal decompression method allows us to decompress the chrominance channels (hue, saturation) at original resolution without the loss of visual details. The input data to fractal decompression scheme comes from the presented fractal compression scheme. We can decompress texture blocks which size depends on the search region size used during compression. The decompression scheme utilises the hierarchical decompression method. In the first step the transformations for given texture block (range block and domain block size) are scaled by a factor 1 / 2 max (where 2max u 2 max is the size of the biggest range block) in order to approximate the top level fixed point of the PIFS pyramid (these step can be omitted if we do this calculation before saving the coefficient to the file). At the top level we apply the transformation only once to level grey texture in order to max approximate the fixed point f 1 / 2 . Then we double the resolution by multiplying the transformation by factor equal to two, and then we apply the transformation. The process is repeated log 2 (2 max ) times. The overall decompression process for given texture block may be written in following steps, for each YUV component do (assuming, that U, V components are averaged to one-half): 1. we multiply the transformation by a factor of 1 / 2 max - optional 2. we apply transformation to the top level 3. we multiply transformation by 2 4. we apply transformation 5. we repeat 3), 4) until the required resolution is achieved 6. for U, V components – we multiply transformation by 2, apply transformation.
4.
EXPERIMENTAL RESULTS
Table 1. Efficiency of the three compression methods (*CR – compression ratio, PSNR – peak signal to noise ratio). Image FCI (32x32) FCI (64x64) JPEG JPEG2000 CR PSNR[dB] CR PSNR[dB] CR PSNR[dB] CR PSNR [dB] Building 49.35:1 27.25 65.25:1 28.9 95.6:1 30.7 197.4:1 33.6
764 Earth Nebulea Sky Map Jupiter
74.75:1 62.87:1 63.31:1 51.18:1 47.41:1
29.69 100:1 33.4 168.3:1 35.68 182.6:1 29.8 68.76:1 31.9 71.7:1
30.36 120.1:1 31.91 136.7:1 33.74 161.1:1 30.32 89.2:1 32.11 96.4:1
30 32.7 31.4 28.7 34.2
200.1:1 32.8 422.2:1 34.9 542.9:1 36.7 126.9:1 30.1:1 243.2:1 32.8
We carried out the experiments mainly to achieve high compression rates (the images for experiments are located at the end of publication). The table shows the compression ratio for a given peak signal to noise ratio. The results are presented for a texture stored in 2048 u 2048 resolution in the RGB colour space. For fractal compression scheme, search region size are 32x32 (minimum partitions 7, maximum partitions 8) and 64x64 (minimum partitions 6, maximum partitions 8). The FCI is the format of data used in our compression scheme. It is based on a bit allocation scheme proposed by Y. Fisher9, where the transform coefficients are allocated as follows: s i (5bit), oi (7-bit), M (3-bit) and variable length codeword is used for the domain location. There is a relation between search region size and compression ratio. Enlargement of the search region size increases the compression ratio and the minimal size of texture blocks that can be decompressed independently. The decompression time depends only on the search region size and can be done in linear time 2n .
5.
DISCUSSION
Most image compression schemes which use transform methods such as discrete cosine transformation (JPEG) or wavelet functions (JPEG2000) are oriented towards image quality and high compression rates. High compression rates in those methods are achieved by efficient data decorrelation, quantisation and entropy coding of transform coefficients. The process of decompression is usually multi-staged with expensive computation of inverse transform. Texture mapping imposes other constraints on the compression scheme. It is required from texture compression scheme to deliver a fast decompression algorithm. Because JPEG and JPEG2000 require a multi-staged decompression process, they are not suitable for real-time texturing applications. The proposed fractal compression scheme allows for random access for given texture block and local decompression. The compression ratio in our scheme depends only on texture block size. Increasing the texture block size we increase the compression ratio and the results are comparable to JPEG algorithms, with additional advantage of local decompression.
Large Texture Storage using Fractal Image Compression
765
In most cases the aliasing problem in texture mapping is solved by the mip-mapping method. Changing sizes of texture one of the mip-maps which best approximates the texture is chosen. The method solves the problem of texture transform. Theoretically, it is possible to generate a set of mip-maps which can be used at any level of detail thus solving the problem of texture magnification. However, mip-map memory handling in this case may dominate the rendering process. Therefore in most hardware the interpolation method is used. The most commonly used linear interpolation method tends to smooth the texture, resulting in a blurred texture image. The proposed fractal decompression scheme solves the problem of texture magnification utilising fractal properties, namely: super-resolution and a simple decoding algorithm. The super resolution property is used in proposed scheme to decode the chrominance channels at a twice the encoded resolution. The approach adds details in colour information at higher than original resolutions which are the result of fractal transform and preserve sharpness of decoded texture. Moreover, the simple decoding algorithm makes it possible to implement in a dedicated hardware for real-time texturing or employ on a mobile device.
6.
CONCLUSIONS AND FUTURE WORK
Texture mapping is a highly efficient technique adding complex visual detail to three dimensional geometry. Popular PC-based systems offer each year a new generation of accelerated graphics boards providing more space for image textures. Concerning virtual reality systems and gaming industry, there is still need for efficient very large image texture handling. High efficient compression scheme proposed in the paper works well with textures depicting natural scenes. Our method utilises local self-similarity in natural images allowing for multi-resolution decompression and local decompression. Fractal texture handling offers significant savings both in storage and in time of decompression which is close to meet real-time constraints.
766
7.
TEST TEXTURES
Figure 3. A Building
Figure 4. Sky
Figure 5. Earth
Figure 6. Map
Figure 7. Nebulea
Figure 8. Jupiter
REFERENCES 1. Heckbert P., Survey of Texture Mapping, IEEE Computer Graphics and Applications, November 1986, pp. 56-67 2. Gardner G., Visual simulation of Clouds, Computer Graphics, Siggraph’85 Proceedings, July, 1985, pp. 297-303 3. Palewski M., Fractal shading model, MSc thesis, Institute of Control and Computation Engineering, University of Zielona Góra, February 2003 (in Polish) 4. Williams L., Pyramidal Parametrics, Computer Graphics, Siggraph’83 Proceedings, July 1983, pp. 1-11 5. Crow F. , Summed-Area Tables for Texture Mapping, Computer Graphics, Siggraph’84 Proceedings, July, 1984, pp. 207-212 6. Deering M. Et al., The Triangle Processor and normal Vector Shader: A VLSI system for High Performance Graphics, Computer Graphics, Siggraph’88 Proceedings, August, 1988, pp.21-30 7. Baharav Z., Malah D., Karnin E., Hierarchical Interpretation of Fractal Image Coding and Its Application, I.B.M Israel Science and Technology 8. Beers A., Agrawala M., Rendering from compressed textures, 9. Fisher Y., Fractal Image compression: theory and application, Springer-Verlag, 1995 10. Welstead S., Fractal and Wavelet Image Compression Techniques, SPIE, 1999 11. Saupe D., Fractal Image Compression via Nearest Neighbor Search, NATO Advance Study Institute Fractal Image Encoding and Analysis, Trondheim, 1995
Large Texture Storage using Fractal Image Compression
767
12. Malah D., Hierarchical fast decoding of fractal image representation using quadtree partitioning, Technion - I.I.T, Israel 13. Cisar G., On entropy coding Fisher's fractal quadtree code, Joint Research Centre, Italy, 1996 14. Stachera J., Real-time Texturing using Fractal Image Compression, MSc thesis, Institute of Control and Computation Engineering, University of Zielona Góra, February 2003 (in Polish) 15. Stachera J., Nikiel S., Fractal Image compression for efficient texture mapping, The 12th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision'2004
CHEN AND LOEFFLER FAST DCT MODIFIED ALGORITHMS IMPLEMENTED IN FPGA CHIPS FOR REAL-TIME IMAGE COMPRESSION
Agnieszka Dąbrowska and Kazimierz Wiatr University of Science and Technology, Institute of Electronics, Mickiewicza 30,
Abstract:
The Discrete Cosine Transform (DCT) is one of the basic varieties of transform coding algorithms. Moreover DCT is used in standard algorithms of compression of still image (JPEG) and video compression of algorithms (MPEG, H.26x). In case of discrete cosine transform of compression’s images of algorithms there are used the blocs: 8x8 pixels. Paper presents problems connected with implementation of DCT algorithms in reconfigurable FPGA structures. Authors implemented DCT transform used Chen’s algorithm and Loeffler’s algorithm, because these algorithms most effective for FPGA implementation. Paper presents implementations results in Xilinx chips XCV200BG352.
Key words:
DCT, Chen, Loeffler, FPGA
1.
INTRODUCTION
The two-dimensional discrete cosine transform we can divide for two one-dimensional discrete cosine transforms. One of them is realized towards rows and the second towards columns. If xn denotes vector of input values then the vector of transformed values yn we can mark on the basis of the dependence (Eq. 1). N 1
yk
º ª 2S ( 2n 1) k » 4 N ¼ ¬
D k ¦ x n cos « n 0
(1)
768 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 768–773. © 2006 Springer. Printed in the Netherlands.
Chen and Loeffler Fast DCT Modified Algorithms Implemented in FPGA Chips for Real-Time Image Compression
769
where : k=(0, 1, ..., N-1) D 0 1 / N for k=0 or D k 2 / N in remaining cases. The discrete cosine transform we can represent in a matrix figure. For N = 8 a matrix-vector form is following (Eq. 2), where the coefficients ck are equal ck=cos(kS/16). ª y0 º «y » « 1» « y2 » « » « y3 » « y4 » « » « y5 » «y » « 6» «¬ y7 »¼
ª « « « « Dk « « « « « « «¬
c0
c0
c0
c0
c0
c0
c0
c1
c3
c5
c7
c2
c6
c7 c2
c5 c6
c3 c6
c3 c7 c4 c4 c5 c1 c6 c2 c7 c5
c6 c2 c1 c5 c4 c4 c7 c2 c3
c3
c5 c1 c7 c4 c4 c4 c1 c3 c7
c2 c2 c6 c6 c1 c3 c5 c1
c0 º ª x0 º c1 »» «« x1 »» c2 » « x2 » » « » c3 » « x3 » c4 » « x4 » » « » c5 » « x5 » c6 » « x6 » » « » c7 »¼ «¬ x7 »¼
(2)
What is visible from the dependence (Eq. 2) to calculate a single vector of transformed values we should execute 64 multiplications and 56 additions. In case of 8 vectors of input values the number of operations necessary to enumerating transformed values, characteristic for them, makes 1024 multiplications and 896 additions. The number of executed multiplications may be reduced from 64 to 32 thanks to the symmetry of coefficients’ matrix ck but the number of additions and subtractions will grow up simultaneously. In this case a matrix-vector form is following: ª c0 ª y0 º «c «y » « 2 « 2» « c4 « y4 » « « » « y6 » D « c6 k « 0 « y1 » « « » « 0 « y3 » « 0 «y » « « 5» «¬ 0 ¬« y7 ¼»
c0 c6 c4 c2
c0 c0 c6 c2 c4 c4
0 0
0 0
0 0
0
0
0
0 c1
0 c3
0 c5
0
0
c6 0
0 0
0 0
0 0
c3 c5
c7 c1
c1 c7
0
0
0
c7
c5
c3
c2
0 º ªx0 x7 º 0 »» «« x1 x6 »» 0 » « x2 x5 » » » « 0 » « x3 x4 » c7 » « x0 x7 » » » « c5 » « x1 x6 » c3 » « x2 x5 » » » « c1 ¼» ¬« x3 x4 ¼»
(3)
770
2.
THE FAST DCT ALGORITHMS
The DCT algorithms in which the number of required operations of multiplications was reduced they are called the fast DCT algorithms. One of such algorithms is the Chen’s algorithm. The number of multiplications was reduced to 16 and addition/subtraction was reduced to 26 operations. In the Loeffler’s algorithm the number of multiplications was reduced to 11 and addition/subtraction was reduced to 29 operations.
3.
PROBLEMS CONNECTED WITH IMPLEMENTATION OF THE DCT ALGORITHMS
During an implementation of 1D-DCT algorithms there appears a question: “What accuracy we should accept for the coefficients of multiplication”. We should receive the accuracy in the order of 10-2 or maybe there is a necessary to receiving multiplication’s coefficients with an accuracy of higher order. The really important question is also: “How the received accuracy does move on values of transformed values?”. For representation’s accuracy of multiplication’s coefficients to the second place after comma the received maximum difference between the calculated analytically transformed values and the values received from the implemented Chen’s algorithm is equal 12,75. The maximum difference appears at maximum values of input pixels (x0 to x7 are equal 255). In this case the value of transformation’s coefficient y0 is equal 734 where the value of this coefficient calculated analytically is equal 721,249. Theoretically the maximum difference is equal 20,4 so is larger than value received from the implemented algorithm. If the multiplication’s coefficients were realized with an accuracy to the third place after comma then the maximum difference between the calculated analytically transformed values and received from the implemented algorithm is equal 1,182. The maximum appointed theoretically difference is equal 2.04, thereat accuracy. There exists the question about sufficient accuracy of representation of multiplication’s coefficients. We should answer also for the question: “How does the received differences between yn received from implemented algorithm and yn calculated analytically move on pixels’ values received thanks to inverse discrete cosine transform (IDCT)?”. What is visible from the Table 1 at the accuracy in the order of 10-2 the maximum difference MD between the value of the input pixel and the value of the reconstructed pixel by the implemented Chen’s algorithm is equal 5. If
Chen and Loeffler Fast DCT Modified Algorithms Implemented in FPGA Chips for Real-Time Image Compression
771
the representation’s accuracy of multiplication’s coefficients is in the order of 10-3 then the maximum difference MD between pixels is equal 2. Table 1. The consideration for a hypothetical input vector.
xn
yn calculated analytically
yn received from the implemented Chen’s algorithm with the accuracy in the order of
The difference between yn received and yn calculated
xn received from the implemented Chen’s 1D-IDCT algorithm
10-2
10-3
10-2
10-3
10-2
10-3
x0 =252
y0 = 645,5885
y0 = 657
y0 = 645
-11,4115
0,58849
255
252
x1 =250
y1 = 55,33119
y1 = 55
y1 = 55
0,33119
0,33119
255
250
x2 =249
y2 = 5,037243
y2 = 5
y2 = 5
0,037243
0,03724
253
248
x3 =247
y3 = -27,3337
y3 = -28
y3 = -27
0,66633
-0,33367
251
247
x4 =207
y4 = -22,6274
y4 = -23
y4 = -23
0,372583
0,37258
211
206
x5 =223
y5 = -21,849
y5 = -23
y5 = -21
1,151043
-0,848958
226
223
x6 =159
y6 = 36,18185
y6 = 37
y6 = 36
-0,81815
0,181849
162
159
x7 =239
y7 = -32,817
y7 = -33
y7 = -33
0,183041
0,183041
244
238
The implementation of the Chen’s DCT with the accuracy in the order of 10-3 in the structure XCV200BG352 of Xilinx occupies 574 from 2352 of SLICE blocks (what states 24% of the structure’s supplies), 20% LUT blocks (958 from 4704). The maximum attainable frequency for this implementation is equal 76,115MHz. For the implementation of the Chen’s DCT with an accuracy of representation of multiplication’s coefficients in the order of 10-2 this frequency reaches 105,831MHz for the structure XCV200BG352. In this case the SLICE blocks are occupied in 16% (382 from 2352) and LUT blocks in 13% (614 from 4704). From the viewpoint of working speed better is the implementation of algorithm with the accuracy in the order of 10-2. However from the viewpoint of pixels’ reconstruction’s accuracy better is the implementation of algorithm with the accuracy in the order of 10-3. There exists some possibility which will be a compromise between speed’s working and an accuracy of pixels’ reconstruction. It is enough to implement the Chen’s DCT algorithm with an accuracy of multiplication’s coefficients in the order of 10-2 except of multiplication’s coefficient occurrents at marking transformed value y0. What is visible from the Table 2 the maximum difference MD between the value of the input pixel and the value of the reconstructed pixel by the implemented Chen’s algorithm with an accuracy of multiplication’s coefficients in the order of 10-2 except of multiplication’s coefficient occurrents at marking transformed value y0 is equal 3. The maximum difference between the calculated analytically transformed values and the transformed values received from the implemented algorithm is equal 2,89.
772 In this case the implementation in XCV200BG352 occupies 402 from 2352 SLICE blocks (17% SLICE blocks) and 648 from 4704 LUT blocks (13% LUT). The maximum attainable frequency is equal 93,932MHz. Table 2. The consideration for a hypothetical input vector.
yn calculated analytically
yn received from the implemented Chen’s algorithm with the accuracy in the order of 10-2 except of coefficients occurrents at marking y0.
x0 = 253
y0 = 393,15137
x1 = 33
y1 = 83,52634
x2 = 247
The difference between yn received and yn calculated
xn received from the implemented Chen’s 1D-IDCT algorithm
y0 = 393
0,15137
255
y1 = 82
1,526338
31
y2 = -25,132389
y2 = -23
-2,13239
247
x3 = 121
y3 = 78,72797
y3 = 80
-1,27203
119
x4 = 87
y4 = 46,669048
y4 = 47
-0,33095
84
x5 = 219
y5 = 123,22545
y5 = 126
-2,77455
220
x6 = 123
y6 = 157,36061
y6 = 159
-1,63939
124
x7 = 29
Y7 = 41,818001
y7 = 42
-0,182
30
xn
The Loeffler’s 1D-DCT algorithm was implemented with accuracy’s representation of multiplication’s coefficients in the order of 10-3. In this case the maximum difference between the calculated analytically transformed values and the transformed values received from the implemented algorithm is equal 2,40562. The maximum difference MD between the value of the input pixel and the value of the pixel’s reconstruction is equal 2. Table 3. A case for the maximum difference between yn received and yn calculated for the accuracy in the order of 10-3 . xn received from the implemented Loeffler’s 1D-IDCT algorithm
xn
yn calculated analytically
yn received from the implemented Loeffler’s algorithm
x0 = 8
y0 = 101,82338
y0 = 102
-0,17662
8
x1 = 16
y1 = -51,53858
y1 = -52
0,461416
14
The difference between yn received and yn calculated
x2 = 24
y2 = 0
y2 = 0
0
24
x3 = 32
y3 = -5,387638
y3 = -4
-1,38764
30
x4 = 40
y4 = 0
y4 = 0
0
41
x5 = 48
y5 = -1,307223
y5 = -1
-0,60722
48
x6 = 56
y6 = 0
y6 = 0
0
57
x7 = 64
y7 = -0,405619
y7 = 2
-2,40562
63
The implementation of the Loeffler’s 1D-DCT algorithm with the accuracy in the order of 10-3 in the structure XCV200BG352 occupies 508 from 2352 SLICE blocks (21% SLICE blocks) and 877 from 4704 LUT blocks (18% LUT). The maximum attainable frequency for the implementation of the Loeffler’s 1D-DCT algorithm is equal 79,195MHz.
Chen and Loeffler Fast DCT Modified Algorithms Implemented in FPGA Chips for Real-Time Image Compression
4.
773
SUMMARY
Comparing the parameters achieved by the implemented Chen’s and Loeffler’s algorithms with an accuracy in the order of 10-3, better is the Loeffler’s algorithm. At an angle of achieved maximum frequencies better is the Loeffler’s algorithm because the maximum frequency for an accuracy’s representation of multiplication’s coefficients in the order of 10-3 is 79,195MHz and thereat accuracy for the Chen’s algorithm the frequency is 76,115MHz. As regards of an occupied resources the Loeffler’s algorithm is more profitable. The implementation of this algorithm occupies 508 SLICE blocks and 877 LUT blocks for accuracy on the order of 10-3. Chen’s algorithm with the accuracy on the order of 10-3 requires 574 SLICE blocks and 958 LUT blocks. For both algorithms with the accuracy of 10-3 the accuracy of the reconstruction of pixels is the same because the maximum difference MD between the value of the input pixels and the value of the reconstructed pixels is 2. The argument appeals to us to advantage of the Chen’s algorithm is the maximum difference between the calculated analytically transformed values and the transformed values received from the implemented algorithm. This difference is 1,182 for the Chen’s algorithm. For the Loeffler’s algorithm this maximum difference is 2,40562.
REFERENCES 1. Bukhari K., Kuzmanov G., Vassiliadis S.: DCT and IDCT Implementations on Different FPGA Technologies, www.stw.nl 2. Chotin R., Dumonteix Y., Mehrez H.: Use of Redundant Arithmetic on Architecture and Design of a High Performance DCT Macro-block Generator, www.asim.lip6.fr 3. Heron J., Trainor D., Woods R.: Image Compression Algorithms Using Re-configurable Logic, www.vcc.com 4. Jamro E., Wiatr K.: Dynamic Constant Coefficient Convolvers Implemented in FPGA, Lecture Notes in Computer Science – no 2438, Springer Verlag 2002, pp. 1110-1113 5. Trainor D., Heron J., Woods R.: Implementation of the 2D DCT Using a Xilinx XC6264 FPGA, www.ee.qub.ac.uk 6. Wiatr K.: Dedicated Hardware Processors for a Real-Time Image Data Pre-Processing Implemented in FPGA Structure. Lecture Notes in Computer Science - no 1311, SpringerVerlag 1997, vol. II, pp. 69-75 7. Wiatr K.: Dedicated System Architecture for Parallel Image Computation used Specialised Hardware Processors Implemented in FPGA Structures. International Journal of Parallel and Distributed Systems and Networks, vol. 1, No. 4, Pittsburgh 1998, pp. 161-168 8. Wiatr K., Jamro E.: Implementation of Multipliers in FPGA Structure, Proc. of the IEEE Int. Symp. on Quality Electronic Design, Los Alamitos CA, IEEE Computer Society 2001, pp. 415- 420 9. Woods R., Cassidy A., Gray J.: VLSI Architectures for Field Programmable Gate Arrays: A Case Study, www.icspat.com
AUTOMATIC LANDMARK DETECTION AND VALIDATION IN SOCCER VIDEO SEQUENCES Arnaud Le Troter, Sebastien Mavromatis, Jean-Marc Boi and Jean Sequeira LSIS Laboratory (UMR CNRS 6168) - LXAO group University of Marseilles- France [email protected]
Abstract
Landmarks are specific points that can be identified to provide efficient matching processes. Many works have been developed for detecting automatically such landmarks in images: our purpose is not to propose a new approach for such a detection but to validate the detected landmarks in a given context that is the 2D to 3D registration of soccer video sequences. The originality of our approach is that it globally takes into consideration the color and the spatial coherence of the field to provide such a validation. This process is a part of the SIMULFOOT project whose objective is the 3D reconstruction of the scene (players, referees, ball) and its animation as a support for cognitive studies and strategy analysis.
Keywords:
Landmark detection, Recognition area, Classification
1.
INTRODUCTION
The SIMULFOOT project developed at the LSIS laboratory provides an interesting framework for theoretical researches in 3D scene analysis and cognitive ergonomics1,2,3 . It also gives a support for many partnerships with various professionals. This project aims to produce 3D scene reconstruction and animation from video sequences4,5,6 . 3D scene analysis and registration has been widely studied for many years and decades but, up to now, no general solution has been brought to this problem. Solutions are related to given contexts on which we have some knowledge (a given set of elements, a given geometry). In the frame of the SIMULFOOT project, we can express the constraints as follows: - Most relevant information is related to a 2D space (the field) as, for instance, the lines drawn on it. - Those 3D elements that are relevant have a known relation with the 2D space, as the goal or the players, for instance, who are in 3D space but, 774 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 774–779. © 2006 Springer. Printed in the Netherlands.
Automatic landmark detection and validation in soccer video sequences
775
relatively to the scale factor - global view of the scene - their feet are considered as being on the field. - The other 3D elements, such as the gallery for instance, do not bring any information in relation with our problem (we only want to model what happens on the field). These considerations on the conditions of our problem have to be completed by two remarks: First, he field is not of a given color (even if it looks as being "green" or "brown") and some parts of it are hidden (by the players) but the area of the field is characterized by a color coherence and a spatial coherence. Second, The transformation that associates the points on the field in an image (captured with a camera) with the corresponding ones in the model is a projective transformation, i.e. we can express it by means of a (3,3) matrix in the homogeneous space7 . Taking into account all these considerations and remarks we have designed the following approach: We look for the lines in the image, we characterize the field area in the image, we only keep those lines belonging to the field area, we extract landmarks and look for the corresponding elements in the model and we compute the transformation matrix. We will not describe the solutions we have brought at each step of the global process (because it could not be done in so few pages) but we will focus on one of these steps that is the landmark validation process, directly related to the field detection.
2.
LANDMARK DETECTION
Let us consider the following image (Figure 1a). Many straight lines can be detected in it but not all of them are relevant: for instance, the top of the wall and the vertical lines are not significant.
Figure 1a.
A soccer image.
Figure 1b. After the use of a Figure 1c. After thresholdLaplacian Filter. ing.
776 Landmarks are usually points. But in this case, they may be points so that straight lines. The reason is that there is an exact correspondence between them (the only available points are intersections of straight lines). Thus, we will talk either of points or lines. We use a classical way for detecting such lines. We compute the lightness component and we apply a Laplacian filter on it as illustrated below (Figure 1b). Then we obtain a binary image after thresholding the result of the previous step (Figure 1c). Finally, we use a "2 to 1" Hough Transform to detect straight lines, using a classical polar description with ρ and θ to represent them (Figure 2a). The final result is the set of straight lines detected in the image (Figure 2b).
Figure 2a.
The Hough Space.
Figure 2b.
Straight lines detected.
Our goal was not to design an optimal approach for detecting such lines but to effectively detect them. And we do it even if this approach is not original, because these lines appear very clearly in the context of this application.
3.
FIELD DETECTION AND LANDMARK VALIDATION
The field detection is the key point of our approach because it provides the mask for landmark validation. The only assumption we make is that the soccer field type of color is majority in the image (i.e. the camera is not oriented toward the stand and does not take a close-up of a player) as illustrated in Figure 1a. Research works have been developed to automatically extract the foreground elements from the background in video sequences, such as those described in 8 and 9 . But the problem mentioned in these papers is quite different: we do not want to characterize the background but to find the area in the image that corresponds to the relevant part of the background. In other words, we want to provide a background segmentation into two parts, the relevant one and the non-relevant one. Let us also mention the works of Vandenbroucke on a method that provide a color space segmentation and classification before using snakes to track the players in video sequences 10 . The ideas developed in this paper are very interesting but they do not fully take advantage of the color and spatial coherence.
Automatic landmark detection and validation in soccer video sequences
777
For these reasons, we have developed a new approach that integrates all these features. This approach consists in: Analyzing the pixel distribution in the color space, characterizing the relevant area in the color space (color coherence), selecting the corresponding points in the image and using a spatial coherence criterion on the selected pixels in the image. All the pixels can be represented in a color space that is a 3D space. Usually, the information captured by a video camera is made of (red, green, blue) quantified values. But this color information can be represented in a more significant space such as the HLS (Hue, Lightness, Saturation) one, for example (it provides the expression of the hue that seems determinant in this context). The first idea we developed was to select points on the basis of a threshold in the "natural" discrete HLS space. "Natural" means that we use the set of values associated with the quantified number of (R,G,B) values. The algorithm is very simple: we count the values occurrences in the HLS space and we keep those with the higher score. It is very simple but it is not efficient at all because of various possible local value distributions. Many improvements have been brought to this algorithm. They are described with many details in 2 : We define a discrete HLS space. Each pixel representation in this space acts as a potential source so that this process produces a strong interaction between pixels that bring similar color information. A potential value (the sum of these potential) is then assigned to each cell of this discrete HLS space. We select those cells that have the highest potential values and that are connected (to take advantage of the color coherence).
Figure 3a.
The classical HLS space.
Figure 3b.
A discrete HLS space.
778 At this step, we have provided a pixel selection only based on a color coherence criterion (Figure 4a). We then use spatial coherence for selecting the field area with a closing to connect the separated elements in the field (e.g. those that are separated by white lines) and with an opening to cut off the "wrong connections" (e.g. if some green is in the gallery, close to the field). Then we select the connected area that contains the most pixels (e.g. to eliminate some green parts in the gallery and only keep the field) and we fill the holes in this area (the green area in Figure 4b).
Figure 4a. Pixel classifica- Figure 4b. tion using color coherence. image.
4.
Main areas in the Figure 4c. selection.
Automatic field
RESULTS
As an illustration of the whole process, we consider the initial example. Once we have selected the field area (Figure 4c), we only keep those lines that belongs to it (Figure 5b).
Figure 5a.
The initial image.
Figure 5b.
The selected lines.
We did not go through a theoretical study of the algorithm complexity. But, as an experimental result, we can say that on a classical laptop computer (less than 1 GHz): - The first process (discrete HLS space analysis) that is performed only when there are major changes, i.e. at the beginning of the sequence takes a few seconds - The current selection takes a few milliseconds
Automatic landmark detection and validation in soccer video sequences
5.
779
CONCLUSION
We have focused our works on the automatic landmark selection and validation because it is the key point of the global registration process. The approach we propose is very fast and robust: it is compatible with a rate of twenty images per second on a basic computer and it gives a right selection even in very tricky situations. In addition, it does not depend on the color, except if there are various colors on wide areas of the field. Solutions to all the other steps have been implemented. They are not optimal but they make the global process run. One problem remains anyway that is automatically finding the logical relation between the detected landmarks and the correspondent model elements: this problem is easy to solve when the camera does not move a lot (i.e. when we are in the same situation) but it would need the use of "Artificial Intelligence" procedures in the general case (i.e. when the camera is moving from one side to the other side of the field).
REFERENCES 1. S. Mavromatis, J. Baratgin, J. Sequeira, "Reconstruction and simulation of soccer sequences," MIRAGE 2003, Nice, France. 2. S. Mavromatis, J. Baratgin, J. Sequeira, "Analyzing team sport strategies by means of graphical simulation," ICISP 2003, June 2003, Agadir, Morroco. 3. H. Ripoll, "Cognition and decision making in sport," International Perspectives on Sport and Exercise Psychology, S. Serpa, J. Alves, and V. Pataco, Eds. Morgantown, WV: Fitness Information technology, Inc., 1994, pp. 70-77. 4. B. Bebie, "Soccerman : reconstructing soccer games from video sequences," presented at IEEE International Conference on Image Processing, Chicago, 1998. 5. M. Ohno, Shirai, "Tracking players and estimation of the 3D position of a ball in soccer games," presented at IAPR International Conference on Pattern Recognition, Barcelona, 2000. 6. C. Seo, Kim, Hong, "Where are the ball and players ? : soccer game analysis with colorbased tracking and image mosaick," presented at IAPR International Conference on Image Analysis and Processing, Florence, 1997. 7. S. Carvalho, Gattas, "Image-based Modeling Using a Two-step Camera calibration Method," presented at Proceedings of International Symposium on Computer Graphics, Image Processing and Vision, Rio de Janeiro, 1998. 8. A. Amer, A. Mitiche, E. Dubois, "Context independent real-time event recognition: applica˝ tion to key-image extraction." International Conference on Pattern Recognition Ð QuObec (Canada) Ð August 2002. ´ 9. S. LefRvre, L. Mercier, V. Tiberghien, N. Vincent, "Multiresolution color image segmentation applied to background extraction in outdoor images." IST European Conference on Color in Graphics, Image and Vision, pp. 363-367 Poitiers (France) April 2002. 10. N. Vandenbroucke, L. Macaire, J. Postaire, "Color pixels classification in an hybrid color space." IEEE International conference on Image Processing, pp. 176-180 Ð Chicago Ð 1998.
MORE THAN COLOR CONSTANCY: NONUNIFORM COLOR CAST CORRECTION Techniques & Applications Majed Chambah CReSTIC, Université de Reims Champagne Ardenne, Reims, France.
Abstract:
Color constancy is a psychophysical phenomenon of the human visual system. The color constancy adaptation makes us stably perceive the scene regardless changes in color of the illuminant. Machine color constancy techniques aim to correct color casts introduced by an illuminant shift. These methods, however, are not suited for correcting stronger color casts caused by other reasons than an illuminant shift. In this paper two recent methods for correcting non uniform color casts are presented. These methods were applied with success in the fields of digital film restoration and underwater imaging. The presented correction results are good and very promising.
Key words:
Color correction, color cast removal, color constancy, dye fading restoration, digital film restoration, underwater images.
1.
INTRODUCTION
Color constancy is a psychophysical mechanism of the human visual system. It enables the human observer to identify objects in a scene even with changing the illuminant chromaticity. Man-made image processing equipment inherently does not have this capability. A scene illuminated with a non canonical illuminant (the canonical illuminant is the light source for which the camera is white balanced) introduces a color cast all over the taken picture of the scene. Several machine color constancy models were developed1-4. They aim to remove the introduced color cast by correcting the illuminant shift. The majority of color constancy models operates in two steps. The first step consists in estimating the illuminant of the scene, and the second one
780 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 780–786. © 2006 Springer. Printed in the Netherlands.
More than Color Constancy: Non-Uniform Color Cast Correction
781
consists in correcting the image by substituting the estimated illuminant with the canonical one. Common color constancy methods are suited for removing color casts introduced by an illuminant shift. They are often not suited for removing stronger color casts due to factors other than a simple illuminant shift, though. We present in this paper two recent techniques for correcting strong and non uniform color casts introduced either by more than one illuminant or by underwater environment or photochemical phenomena like color dye fading.
2.
STRONG AND NON UNIFORM COLOR CASTS
In this paper we will focus mainly on two kinds of particularly strong and non uniform color casts. The first color cast is introduced by a photochemical phenomenon: the bleaching of color dyes of films. The second one is caused by underwater environment. The bleaching phenomenon is caused by spontaneous chemical changes in the image dyes of color films. Many films have taken on a distinct color cast, caused by the rapid fading of one or two image dyes. Dye fading is a chemically irreversible process. The color cast caused by the fading of the chromatic layers of the film is non uniform, i.e. it may have different colors in shadows and highlights. Fig. 3 illustrates a faded image with a cyan cast in highlights and a magenta cast in midtones and shadows. The correction of this kind of color cast has a significant cultural value since all films are subject to fading and hence represents a significant hope for cinematographic archives. Beside the fact that it can tackle artifacts that are out of reach of traditional photochemical restoration techniques, digital color restoration presents the advantage of not affecting the original material, since it works on a digital copy. Images and videos taken in aquatic environment present a strong and non uniform color cast due to the aquatic environment and to artificial (and/or natural) lighting. It means that the cast has different color and intensity in foreground, background (more depth), shadows and highlights. This non uniform cast has to be removed prior to any processing (object recognition, image retrieval, etc.) on this kind of images to recover chromatic information inherent to the objects. Fig. 1 illustrates an example of an underwater image.
782
3.
NON UNIFORM COLOR CAST CORRECTING TECHNIQUES
The techniques presented here after, handle strong, non uniform and multiple color casts. Further more, they need no a priori information and can be used for a variety of images : natural, outdoors, indoors, underwater, etc.
3.1
ACE : Automatic Color Equalization
ACE, for Automatic Color Equalization5,6, is an algorithm for digital images unsupervised enhancement. It is based on a new computational approach that merges the "Gray World" and "White Patch" equalization mechanisms, while taking into account the spatial distribution of color information. Inspired by some adaptation mechanisms of the human visual system, ACE is able to adapt to widely varying lighting conditions, and to extract visual information from the environment efficiently. ACE has two stages. The first stage accounts for a chromatic spatial adjustment (responsible for color correction). It performs a sort of contrast enhancement, weighted by pixel distance. It produces an output image in which every pixel is recomputed according to the image content, approximating the visual appearance of the image. Each pixel p of the output image R is computed separately for each chromatic channel c as shown in Eq. (1) where I is the input image (notation as in original papers5-7). The result is a local-global filtering.
r ( I ( p ) I ( j )) d ( p, j) j Im, j z p rmax ¦ j Im, j z p d ( p , j )
¦
Rc ( p)
(1)
I(i) - I(j) accounts for the basic pixel contrast interaction mechanism, d(.) is a distance function which weights the amount of local or global contribution, r(.) is the function, that accounts for the relative lightness appearance of the pixel. It has a saturation shape. The lower part of the fraction has been introduced to balance the filtering effect of pixels near the border, avoiding a vignetting effect. Distance d(.), with its parameter a, weights the global and local filtering effect (where x=i-j) :
d (i, j )
d ( x)
e ax e ax 2
2
(2)
More than Color Constancy: Non-Uniform Color Cast Correction
783
The second stage, a dynamic tone reproduction scaling, configures the output range to maximize the device dynamic. It maps the intermediate pixels array R into the final output image O. In this stage not only a simple linear maximization is made; an estimate of the medium gray and the maximum values of the intermediate image R are used to map into gray levels the relative lightness appearance values of each channel.
Oc ( p )
round ( g sc Rc ( p ))
(3)
where g is the final medium gray and s is the slope of the mapping function.
3.2
PHM : Progressive Hybrid Method
Classic color constancy methods such as gray world (GW) and white patch (WP) are global techniques, i.e. they determine the color cast from the whole image. They are hence, only able to handle a uniform color cast. GW estimates the most prominent cast in the image, since it uses the mean of the image to estimate the cast. WP estimates the cast in highlights and is very sensitive to noise and clipping. PHM7 is based on a combination of GW and Modified WP (more robust to noise and clipping effects). To avoid artifacts caused by loss of gradation in intensity levels (since different methods are used depending on the intensity level of the current pixel), mid-tones (comprised between thresholds h2 and h1) are corrected with a graduated combination of GW and modified WP methods, as follows:
M
RG B 3
If M t h1 If M d h2
Kc Kc
If h2 d M d h1
S ch1
U ch
1
(4)
Sc
Uc Kc
(1 G )
Sc
Uc
G
S ch1
U ch
1
where U c is the mean of channel c, U ch1 is the mean of the pixels greater than h1 of the channel c, S c is the target mean of channel c and S ch1 the target mean of the pixels greater than h1 of the channel c.
784 Corrected image O is obtained by a diagonal matrix :
ªK R « 0 « «¬ 0
0 KG 0
0 º ª RI º 0 »» ««GI »» K B »¼ «¬ BI »¼
ª Ro º «G » « o» «¬ Bo »¼
(5)
PHM is able to handle more than a cast in the same image. In fact GW estimates and corrects color cast in mid-tone and shadows, while Modified WP estimates and corrects color cast in highlights.
3.3
Experimental results
We tested the two techniques on a variety of faded and underwater images (3 faded sequences and 3 underwater sequences witch represents a total of 3000 images). We considered three kinds of images according to their chromatic diversity. Type 1 is the most chromatically diverse like the overview of image of Fig. 1. Type 3 is the least chromatically diverse like the close-up of image of Fig. 2. The results are judged visually (pleasantness and naturalness of the image), since we often do not have ground truth (especially for digital restoration) and using statistical measures such as polar hue histogram and RGB histograms5,8. Usual color constancy methods such as gray world and white patch are not efficient for such kind of images, since they are global and thus able to handle only uniform color casts. The GW method estimates the most prominent cast in the image, since it is based on the mean of the image. As a consequence, it corrects the mid-tones and the shadows of the image, inverting the color cast in the highlights. It performs well only on images of type 1 since they are the most chromatically diverse. The WP patch method estimates the cast from the highlights. It is effectless on these images since the videos of the fish tanks are (in our case) over saturated (so some values are clipped), because of the variable lighting conditions. Since it is sensitive to noise and clipping, WP is not suited for underwater videos color correction. The PHM method gives better results than GW since its combination with WP (actually MWP: Modified White Patch) permits to lighten the reverse color cast. PHM performs best on images of type 1, the most chromatically diverse ones. ACE gives the best results since it widely adapts to different non uniform color casts and it is unsupervised. All the casts are removed from images of type 1 and 2. In the case of type 3 images, ACE introduces less reverse cast than the other methods but the object colors are correct. This fact is what
More than Color Constancy: Non-Uniform Color Cast Correction
785
accounts most if the color correction is done prior to processing like object segmentation and/or recognition. We reported on table 1 the summary of the comparison of PHM and ACE with basic color constancy methods GW and WP. Figs. 1 to 4 illustrate some examples of image correction. Table 1. Experimental results summary. Type 1 Type 2 (+) mid-tones and (+) Overall balanced shadows corrected image (-) reverse cast in (-) a residual color cast highlights Effectless since there are clipped values (+) Best balanced (+) Best balanced image image (+) Very slight (+) Good contrast reverse cast
GW
WP ACE
PHM
4.
(+) Overall balanced image (-) a residual color cast
(+) mid-tones and shadows corrected (-) slight reverse cast in highlight
Type 3 (-) Strong reverse cast
(-) Slight reverse cast (+) Correct color of objects (+) Best balanced image (-) Reverse cast (+) More balanced image than GW
CONCLUSION
We have presented in this paper two methods for non uniform color cast removal. They both performed better than classic color constancy methods; suited for illuminant shift correction. The methods were tested on series of faded and underwater images. The best performing method is an ACE based technique which adapts to a variety of color casts and different kinds of images (overview to close-up) with no a priori knowledge about the scene.
REFERENCES 1. 2. 3. 4.
G. Finlayson, S. Hordley, « A theory of selection for gamut mapping color constancy », proceedings of IEEE Computer Vision and Pattern Recognition, pp. 60-65, 1998. H.-C. Lee, R. M. Goodwin, «Colors as seen by humans and machines», IS&T annual conference proceedings, 1994, pp. 401-405. J. Holm, «Adjusting for the scene adopted white», IS&T’s PICS conference, 1999, pp. 158-162. B. Funt, V. Cardei, K. Barnard, «Learning color constancy», IS&T/SID 4th Color imaging conference, 1996, pp. 58-60.
786 5.
6. 7.
8.
M. Chambah, A. Rizzi, C. Gatta, B. Besserer and D. Marini, “Perceptual approach for unsupervised digital color restoration of cinematographic archives”, SPIE Electronic Imaging. S. Clara, California (USA), january 2003. A. Rizzi, C. Gatta and D. Marini, “A New Algorithm for Unsupervised Global and Local Color Correction”, Pattern Recognition Letters, vol. 24 (11), pp. 1663-1677, July 2003. M. Chambah, B. Besserer and P. Courtellemont, “Recent Progress in Automatic Digital Restoration of Color Motion Pictures”, SPIE Electronic Imaging, vol. 4663, pp. 98-109, San Jose, CA, USA, January 2002. M. Chambah, D. Semani, A. Renouf, P. Courtellemont, A. Rizzi, « Underwater color constancy : enhancement of automatic live fish recognition », SPIE / IS&T Electronic Imaging 2004, San Jose, CA, USA, january 2004, vol. 5293, N° 19.
These figures can be found on this URL: http://leri.univ-reims.fr/~chambah/papers/ICCVG2004.pdf
IMAGE SEGMENTATION BASED ON GRAPH RESULTING FROM COLOR SPACE CLUSTERING WITH MULTI-FIELD DENSITY ESTIMATION
Wojciech Tarnawski Chair of Systems and Computers Networks,Faculty of Electronics, Technical University of Wroclaw, Wybrzeze Wyspianskiego 27, Poland, [email protected]
Abstract:
This paper presents region based segmentation algorithm executed in the image domain basis of cluster set received in the color space clustering with multi-field density estimation1. It’s based on the graph constructed by taking into consideration color similarity, spatial adjacency and also indexes generated by multi-field estimation. The segmentation results of the standard color test images are also presented.
Key words:
image segmentation, region based segmentation, graph’s theory.
1.
INTRODUCTION
Two-stage approach for image segmentation process became lately very researched area2-4. In this paper the second stage of this process is proposed. It concerns the region-based segmentation algorithm basis of results from color space clustering with multi-field density estimation1. At the beginning of this process we map the clusters received in color domain to the image domain. Each region in image space contains pixels that are spatially connected and belong to the same cluster in color space. The region segmentation algorithm is based on the graph’s theory5, which takes into consideration color similarity, spatial adjacency and also multi-field estimation based indexes. The idea of the presented region based segmentation is realized by means of the graph’s modification. It is needed
787 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 787–793. © 2006 Springer. Printed in the Netherlands.
788 to appoint that regions, which are separated in the image domain can be represented by the same cluster in color space. It signifies that color space clustering determines minimal number of regions (segments) in the image domain. The one constraint that determines the segmentation result based on color space clustering is that the resulting segmented image can not be under-segmented.
2.
REPRESENTATION OF CLUSTER SET BY THE DIRECTED GRAPH
Let : ^Z1 , Z 2 ,..., Z r ` be a set of clusters found in the color space clustering. Every cluster Z i , i 1,2..., r is described by means of its prototype as the following: P(Z i )
f , hˆ
p2 ( f
)
(1)
where f represents the 3-dimensional vector describing localization of the cluster prototype in the color space and hˆ ( f ) is a value of the p order p
2
multi-field density estimator, where p 2 is used in Eq. 3 in [1]. Let S represents matrix about size r u r and describes the probability of mutual coexistance of pixels belonging to two clusters Z i , Z j . The single element of this matrix is defined as follows:
T [i, j ] K ij K i , Z i , Z j ^:`
(2)
where K i j is the total number of image pixels having color belonging to cluster Z i , which are 8-neighbours of pixels having color belonging to cluster Y j and K i is the total number of all pixels having color belonging to cluster Z i . According to laws of probability the following definitions are also determined: r 1
T [i, j ] 1 T [i, j ], PrT (Z i )
T [i, j ], Z i , Z j ^:` j
(3)
Image Segmentation based on Graph Resulting from Color Space Clustering with Multi-field Density Estimation
789
Let D be a matrix about sizes r u (r 1) and describes distances P (Z i ) P (Z j ) , Z i z Y j , between cluster prototypes in color space. The
last definition concerns discrete distribution function defined based on histogram H calculated from segmented image, where every pixel is described by label of cluster from the set : :
j di
B (Z i )
¦ H (Z j )
(4)
j 1
Step 1 – sorting in decreasing order cluster prototypes P(Z1 ), P (Y 2 ),..., P (Z r ) resulting in sorted set denoted as : * according to the following index W ascribed to every cluster Z i (definition of the index 'hˆ ( P (Z ) is presented in literature1): ( p1, p 2 )
i
W (Z i ) max 'hˆ( p1, p 2) ( P (Z i ), hˆ p 2 ( P(Z i ) max PrT (Z i ), T [i, i ]
(5)
This is the most important step in graph’s construction because the order of cluster prototypes decides about later, hierarchical relations, which decide about sequential region merging in image domain. Step 2 – sorting of row matrices T , D according to order in the set : * resulting in matrices denoted as T * , D * . Let i is index related to the order in the set : , and k - index related to the order in the set : * . Next, sort elements of the every k th row of matrix D * in the increasing order and later elements of every k th row of matrix T * in the decreasing order. Step 3 – according to k 1,2,..., r look through the every row of matrix T * and select the rows k having at the first position value equal to T [i, i ] .
Next, define subset 6 : * , 6 selected values of k .
^V k `
including clusters indexed by the
790 Step 4 – look from the beginning through the every element V k of the subset 6 calculating the difference G (V ) W (V B(V ) and select k for which k
k
k
the following condition is fulfilled : arg min G (V k ) . Next, remove from the set 6 elements V for which k ! k is fulfilled. k
Finally, we construct directed graph G S including nodes represented by the elements from the sets 6 and : * . The nodes from the set 6 will be called terminal nodes denoted as T and from the set : * - normal nodes denoted as W . The all nodes are indexed by k and the following conditions are fulfilled: x every normal node Wk is the initial node of exactly two directed arcs denoted as e1k , ek2 and can be also final node. x terminal nodes are the final nodes for directed arcs and are not initial node of no arc. x with every e1k , ek2 are ascribed the weights w1k , wk2 defined as the following: Let with every normal node Wk represented by Z k : * Z k 6 are connected the following magnitudes: d kmin , which is equal to value D * [ k ,1] describing distance to the nearest cluster prototype and denote this cluster by Y kd | Z kd : * and the next: c kmax T * [k ,1] only if T * [k ,1] z T [i, i ] or c kmax
T * [k ,2] if T * [k ,1] T [i, i ] and denote this cluster as Y kc . If
PZ then
P Z kd
c k
w1k
d is related to cluster
wk2
Y kd
1 , else w1k
and
wk2
T [d , d ] T [d , k ] where index
T [c, c] T [c, k ] where index c is
related to cluster Y kc .
3.
REGION SEGMENTATION BASED ON GRAPH MODIFICATION
The region segmentation algorithm consists in graph G S modification and results in the set of directed sub-graphs. Hierarchical form of the received graphs makes possible to execute region merging process in the image space. The graph’s modification rely on generating for normal nodes Wk parent nodes R (Wk ) among terminal nodes Tk and by transforming part of normal nodes into terminal modes. Arcs connecting node with its parent node are chosen between arcs e1k , e k2 and will be denoted as e k* . The proposed
Image Segmentation based on Graph Resulting from Color Space Clustering with Multi-field Density Estimation
791
algorithm has two user-defined thresholds: TC, and TD. The first parameter is the threshold describing distance between cluster prototypes in the color space under which the cluster merging can be realized. The second determines minimal values of the multi-field density estimator ascribed to cluster prototype. The algorithm of the graph’s modification is the following: Determine values of the user-defined parameters TC , TD . Look through the normal nodes Wk according to order k r , r 1, r 2,..., k . Step 1 – for every normal node Wk determine its parent node R (Wk ) resulting in node Wt , t z k connected with the proper cluster on condition that R (Wt ) doesn’t exist and under the constraint expressed by inequality w1k d Tc as follows:
Wt
Z kc ° °Z kc ° ° R(Wk ) ®Z kd ° c °Z k ° c °¯Z k
if Phˆ (Z kc ) d TD
PZ
if Phˆ (Z kc ) ! TD P Z kc
d k
if Phˆ (Z kd ) ! TD w1k ! wk2
(6)
if Phˆ (Z kc ) ! TD w1k wk2 if Phˆ (Z kc ) ! TD w1k
wk2
If any mentioned condition is not fulfilled the three cases are analyzed: 1. If the node Wk already has parent node R (Wt ) then parent node R (Wk ) is parent node of the node Wt . Next remove arc e k* and create arc between node Wk and its parent and ascribe for it the weight value equal to one. 2. If the node is the parent node for itself then transform its into terminal node Tk and remove all its arcs. 3. If the inequality is not fulfilled w1k d Tc transform the node Wk into terminal node Tk and remove all its arcs. Step 2 – if all normal nodes Wk were checked and among parent nodes are nodes which are not terminal nodes modify graph G S in such way that all parent nodes will be terminal nodes using procedure described in the first case of the previous step. Received terminal nodes become final clusters represented by the set of regions to which child nodes are merged. After every merge of two clusters, the values of multi-field density estimator ascribed to merged cluster
792 prototypes are added to the value of this estimator ascribed to the cluster prototype which is the parent node. Basis of received parent nodes the region merging in image domain is realized.
4.
RESULTS
The presented color image segmentation algorithm has been implemented and later tested on the variety of standard test images and in the practical tasks of image analysis in industry6 and medicine7. On the Figure 1 (next page) are presented segmentation results for few standard test images presented in publications concerning with image analysis. They are: “objects”, “fruits”, “house” and “peppers”. The presented results were received based on the cluster sets received by means of the multi-field density estimation calculated from the image homogram3 built in modified HSI* color space8. The results of color image segmentation are organized as follows: first column includes input images, second – after color space clustering1, and the last – final results after region merging processing described in the current paper. For images “fruits” and “objects” the edges are presented for better visualization.
REFERENCES 1. 2. 3. 4.
5. 6. 7.
8.
W. Tarnawski, Multi-field density estimation: A robust approach for color space clustering in image segmentation, Proceedings of ICCVG’2004 (in this volume). Q. Chen, Yi L., Color image segmentation – an innovative approach, Pattern Recognition 35, pp. 395 – 405, (2002). H.D. Cheng, X.H Jiang., J. Wang, Color image segmentation based on homogram thresholding and region merging, Pattern Recognition 35, pp. 373 – 393, (2002). D. Comaniciu P. Meer , Mean shift: a robust approach toward feature space analysis, Transactions on Pattern Analysis and Machine Intelligence, vol. 24, No.5, pp. 603 – 619, May, (2002). R. Tadeusiewicz , M. FlasiĔski, Pattern Recognition (in polish), PWN, Warszawa, (1991). W.Tarnawski , Colour image segmentation algorithm in vectoral approach for automated optical inspection in electronics, Opto-Electronics Review, 11(3), 197-202 (2003). T.KrĊcicki , D. Dus , J. Kozlak, W. Tarnawski, M. Jelen, M. Zalesska-Krecicka , T. Szkudlarek , Quantitative evaluation of angiogenesis in laryngeal cancer by digital image measurement of the vessel density, Auris Nasus Larynx, 29, pp. 271-276, (2002). W. Tarnawski, Fast and efficient colour image segmentation based on clustering modified HSI space, Proc. 4th Sci. Symp. On Image Processing Techniques, Warsaw University of Technology, Serock, 244-257, (2002) (in Polish).
Image Segmentation based on Graph Resulting from Color Space Clustering with Multi-field Density Estimation
793
Figure 1. Image segmentation results on few test images. In each row, the original image is shown on the left side and it is followed by the image segmentation results generated by color space clustering and later after region merging process.
MULTI-FIELD DENSITY ESTIMATION: A ROBUST APPROACH FOR COLOR SPACE CLUSTERING IN IMAGE SEGMENTATION
Wojciech Tarnawski Chair of Systems and Computers Networks,Faculty of Electronics, Technical University of Wroclaw, Wybrzeze Wyspianskiego 27, Poland, [email protected]
Abstract:
This paper describes color space clustering based on the multi-field density estimation in color image segmentation process. It is based on the mode seeking approach executed with pdf multi-field estimator calculated from 3Dhistogram or homogram1. It results in clusters found in the color space of the segmented image. The next stage2 merges clusters in the image domain based on the graph’s modification. Unlike to many existing clustering algorithms, the proposed approach does not require the knowledge about the number of the color clusters to be generated. Moreover, the clustering result is controlled by the cluster-validity measure, which makes possible correct adjusting of the parameters for the proposed estimation.
Key words:
image segmentation, nonparametric density estimation, cluster analysis, feature’s space smoothing.
1.
INTRODUCTION
Image segmentation is a process of partitioning image pixels based on selected image features. The pixels that belong to the same region must be spatially connected and have the similar image features. If the selected segmentation feature is color, the color image histogram or homogram1 includes modes representing classes of objects/regions. In this paper we describe color image segmentation system that performs color clustering in a color space. The color region segmentation in the image domain is presented in literature2. Clusters can be viewed as regions of the color space in which
794 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 794–799. © 2006 Springer. Printed in the Netherlands.
Multi-field Density Estimation: a Robust Approach for Color Space Clustering in Image Segmentation
795
patterns are dense, separated by regions of low pattern density. That’s why in the first stage of image segmentation, basis of histogram or homogram calculated in the color domain of the image, estimation of probability density function (pdf) is realized. The clustering procedure consists in the modeseeking approach in pdf, where every mode corresponds to a class of objects/regions. Next, each mode is associated with a cluster prototype and each feature is assigned to the appropriate clusters. Every of the existing mode-seeking approaches requires isolated, smoothed and steep modes. However, histogram or homogram is usually composed of several overlapping and “noisy” modes. The simplest way to identify modes is by partitioning the feature space into a number of nonoverlapping regions or cells and next by seeking for cells with relatively high frequency counts. The choice of the size of these cells is critical and commonly is governed by the Parzen window approach3 or by the nearest-neighbour approach4. Even if the cell-size requirement is met, the success of such an approach depends on two factors. First, cells of small volume will give a very “noisy” estimate of density, whereas large cells tend to overly smooth the density estimate. Described in this paper multi-field density estimation takes into account small and also large characteristics of the pdf distribution. The clustering procedure based on the graph-theoretic approach5 is additionally controlled by cluster-validity measure defined also in strength of the proposed multifield estimator.
2.
MULTI-FIELD DENSITY ESTIMATION
The pixel’s attributes of a color image are commonly represented as a 3D vector in the earlier selected color space. At the first stage, we developed a multi-field density estimation algorithm to generate clusters of similar colors using a color histogram or homogram of an image which represents 3D discrete feature space. The one constraint is that clustering algorithm involves the continuous pdf with isolated, smoothed and steep modes. The proposed estimation is based on the idea of multipole expansion7 commonly used in physics, particularly in problems involving the gravitational field of mass aggregation, the electric and magnetic field of charge and current distributions. It’s needed to notice that described in this paper idea is based on the multipole expansion but it doesn’t reflect the original conception. In our case “the charges” are the histogram or homogram values and the proposed, simplified algorithm has smaller computational complexity. Firstly, it assumes that analyzed feature space is divided into cubes as it is shown in Figure 1.
796
Figure 1. The idea of feature (color) space division.
P
where The size of single cube is always 2 Px u 2 y u 2 Pz Px , Py , Pz 1,2,3,... . This constraint makes possible further fast computation by using bit shift operations. The every single cube is divided into eight sub( P 1)
cubes, every of size 2 ( Px 1) u 2 y u 2 ( Pz 1) etc. The number of grids denoted as p and size of the biggest cube described by means of parameter c (understood as the power factors: Px , Py , Pz ) are the estimation parameters and they mainly decide about the quality of segmentation. The set of cubes about the same size defines a single grid indexed by m, where m c 1, c 2, c 3..., c p . With an every cube expressed by N \m is connected the weight w expressed by the conditional expectation:
w N \m
where \
E (4 * | T * N \m )
1 \
¦T\*Pr (T * | T * N \m )
V ( N m ) T *N
(1)
m
[\ (1) ,\ ( 2) ,\ (3) ]T is a vector denoting the localization of the
cube’s centroid in the feature space, Pr (T * | T * N \m ) denotes conditional probabilities with continuous random variable 4 * approximated by image histogram or homogram’s values T * 4 * . The normalized multi-field porder estimator hˆ (T * ) of the pdf is defined by the following formula: p
Multi-field Density Estimation: a Robust Approach for Color Space Clustering in Image Segmentation hˆ p (T * )
8G
3
6 ª º \ 1 * \ \ « ¦ « m (c p) V ( N ) ¦ w( N j j ) I m ( j )»» m j 1 m c p ¬ ¼ m c 1
1 2S
3
where: * \ m \ ( c p )
797
(2)
are the values (weights) of the Gaussian
distribution assigned for consecutive (according to index m) grids, I m ( j ) are interpolation values and V ( N m ) represents volume of the single cube in the m-th grid.
3.
CLUSTERING ALGORITHM AND CLUSTER VALIDITY MEASURE
The image segmentation process starts with locating the peaks of the estimated pdf by means of the multi-field estimation 3D color space. This is done by graph-theoretic clustering developed by Koontz, Narendra and Fukunaga5 and is realized for cube centroids from the most dense grid indexed by m=c-p. The found clusters are described by means of the directed trees composed of the nodes equivalent to the mentioned cubes. The cubes with maximal values hˆ p (T * ) become cluster prototypes and the cubes related to this prototype belong to this cluster. It is clear that multi-field estimation’s parameters i.e. the cell’s size c in the most dense grid and number of grids p, play a major role in the success of the proper clustering. They decide about “degree of smoothing” of the feature space, which implicates too large number of clusters (oversegmentation) or may combine several peaks into a single cluster resulting in incorrect segmentation in form of the undersegmentation. The undersegmentation result is undesirable and the following cluster validity measure was defined to avoid this problem. Let 'hˆ( p1, p2 ) ( P ) be the increase’s measure of the cluster prototype P with p1 and p 2 -order ( p 2 ! p1 p1 1 ) multi-field estimator values defined basis of Canberra’s measure:
'hˆ( p1, p2 ) ( P ) 1
hˆ p1 ( P ) hˆ p2 ( P ) hˆ p1 ( P ) hˆ p2 ( P )
, 'hˆ( p1, p2 ) ( P ) 0,1
(3)
798 Let K F
{P1 , P2 ,..., Pr } represent the set of cluster prototypes found in the
clustering procedure. The cluster validity measure Q F ( P1 , P2 ,..., Pr ) 0,1 is defined by the following formula:
ª
QF ( P1 , P2 ,..., Pr )
§
¦* «¬'hˆ( p1, p2 ) ( Pi* ) ¨©1 D Pi
·º Pi* Pj* ¸» jK F \{i} ¹¼ min
card ( Pi | Pi | 'hˆ( p1, p2 ) ( Pi ) ! TF )
(4)
where: D is the distance between cube’s centroids in the most dense grid indexed by m=c-p, symbol denotes norm, TF is user-defined parameter (in practice: TF symbol
*
0.95 ) and determines the set of prototypes depicted with ,where the condition 'hˆ( p1, p2 ) ( Pi ) ! TF is fulfilled, card
represents cardinality number. Because the set of parameters of the multifield estimation is finite the correct set of cluster prototypes is given when the following is fulfilled: arg max Q F ( P1 , P2 ,..., Pr ) .
4.
RESULTS
In this paper are presented image segmentation results for test image called “objects” (Fig. 2a) for illustrating the correlation between segmentation result and the value of the proposed index QF . The presented results were received based on the pdf estimated by means of the multi-field density estimation calculated from the image homogram1 built in modified HSI* color space6. The results are organized as follows: in the first row the input image “objects” is presented; the second row includes segmentation results for p 2 and c 3 (Fig. 2b), c 4 (Fig. 2c), c 6 (Fig. 2d). For every segmentation result in the upper, right corner the value of QF index is presented. Detailed analysis of the results (look at image fragments marked by circles) presents that index value QF decreases particularly when the region is classified to incorrect cluster in spite of better segmentation in general (compare Fig. 2d) and Fig. 2c)). However, the image with more and proper regions (oversegmentation) is better result because found regions might be connected in the region segmentation process.
Multi-field Density Estimation: a Robust Approach for Color Space Clustering in Image Segmentation
799
Figure 2. Segmentation results for different parameters of the multi-field density estimation. The value of the cluster validity measures QF are also presented (more details in the text).
REFERENCES 1. 2. 3. 4. 5.
6.
7.
H.D. Cheng, X.H. Jiang, J. Wang, Color image segmentation based on homogram thresholding and region merging, Pattern Recognition, 35, pp. 373-393, (2002). W. Tarnawski, Image segmentation based on graph resulting from color space clustering with multi-field density estimation, Proceedings of ICCVG’2004 (in this volume). E. Parzen, On estimation of a probability function and mode, Ann. Math. Statist., 33, pp. 1065-1076, (1962). R.O. Duda, P.E. Hart, Pattern Classification and Pattern Analysis (John Wiley & Sons, New York, London, Sydney, Toronto, 1973). W. Koontz, K. Fukunaga , P. Narendra, A graph-theoretic approach to nonparametric cluster analysis, IEEE Transactions on Computers, vol. C-25, No. 9, pp. 936-944, (1976). W. Tarnawski, Fast and efficient colour image segmentation based on clustering modified HSI space, Proc. 4th Sci. Symp. On Image Processing Techniques, Warsaw University of Technology, Serock, 244-257, (2002) (in Polish). Weisstein’s E. World of Physics Multipole expansion; http://scienceworld.wolfram.com/physisc/MultipoleExpansion.html
FAST COLOR IMAGE SEGMENTATION BASED ON LEVELLINGS IN FEATURE SPACE Thierry Geraud, Giovanni Palma, Niels Van Vliet EPITA Research and Development Laboratory 14-16 rue Voltaire, F-94276 Le Kremlin-Bicetre France [email protected]
Abstract
This paper presents a morphological classifier with application to color image segmentation. The basic idea of a morphological classifier is to consider a color histogram as a 3-D gray-level image, so that morphological operators can be applied to it. The final objective is to extract clusters in color space, that is, identify regions in the 3-D image. In this paper, we particularly focus on a powerful class of morphology-based filters called levellings to transform the 3D histogram-image to identify clusters. We also show that our method gives better results than other state-of-the-art methods.
Keywords:
Classification; color image segmentation; color spaces; mathematical morphology; levellings.
1.
INTRODUCTION
A classical approach to segmentation is to perform data classification in a judiciously chosen feature space. In the case of color images, trivial spaces are color ones such as red-green-blue (RGB) space or others more relevant with respect to human color perception. However, when dealing with natural images, image segmentation is a rather difficult task. In such images, objects are often textured, specular, and subject to color gradation and to noise. Consequently, color modes or classes usually do not have “simple” shapes in feature space, that is, they cannot be described easily by parametric models such as those described in (Comaniciu and Meer, 1997). In this context, mathematical morphology appears to be a suitable tool for studying data and extracting classes. Every morphological classifier considers histograms as 3-D digital images in order to process them with common image operators. Segmentation of the color space is then used to classify the pixels in the original image. The main problem for natural images is to avoid over-segmented results.
800 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 800–807. © 2006 Springer. Printed in the Netherlands.
Fast Color Image Segmentation Based on Levellings in Feature Space
801
In (Postaire et al., 1993) a very simple morphological classifier based on binary mathematical morphology is proposed. The 3-D histogram is first thresholded to get a binary image in which only cluster cores appear. A morphological closing is then applied for regularization purpose and a connected component labelling process identifies the clusters. Unfortunately, this method does not take full advantage of the “level-shape” of histograms. In (Zhang and Postaire, 1994) an evolution of the former method is described. Before thresholding, the 3-D histogram is pre-processed by a morphological filter which digs the valleys, in order to increase the separability of clusters. A major problem of this method is that the initial relief between two clusters must be contrasted enough for them to be separated. In (Park et al., 1998) a difference of Gaussians from the histogram is computed and then thresholded. The resulting binary image of cluster cores is processed by a morphological closing and a connected component labeling is performed. Each component, i.e. each cluster, is then dilated to enlarge its volume in the feature space. At this stage, one cannot assign a label to every color: some colors of the original image do not belong to any cluster of the color space. Park et al. propose to assign such colors to their respective nearest clusters. Last, a method is proposed in (Barata and Pina, 2002) which relies on morphological operators to model the clusters of training sets before to determinate class boundaries in feature space. However, this is not an automatic classifier. This paper describes a very simple, efficient and effective clustering method based on a morphology study of data in color space. In section 2 we present a general scheme for histogram filtering and classification and we recall the definitions and some properties about openings, connected filters and levellings. In section 3 we explain how to modify the histogram in feature space to obtain relevant classification and we compare our method with others. Last we conclude in section 4.
2.
ABOUT HISTOGRAM FILTERING AND MATHEMATICAL MORPHOLOGY OPERATORS
For the sake of clarity, this section deals with 1-D functions: IR → IR. However, every notion given here is naturally expendable to greater dimensions. In the case of color images, these functions are IR3 → IR.
2.1
A General Scheme for Histogram Filtering and Classification
Basically, clustering in feature space aims at finding relevant peaks in this space. Consider the histogram of a gray image given in Figure 1. Without prior knowledge about the underlying intensity distributions of the object appearing
802 in the original image, we can assume that proper locations for peak separations are close to minimum values for the function. These values are pointed out by the red bullets on the left diagram in Figure 1.
Figure 1.
Histogram Filtering.
In a histogram most of its maxima are not representative of the presence of classes. Such maxima are just due to local variations of the function. They should be removed in order to keep significant peaks only, thus avoiding an over-classification of the feature space. A simple approach is to apply a filter that keeps (respectively removes), the proper (resp. invalid) maxima. The result of such a filter is depicted on the right diagram in Figure 1. Every function maximum corresponds exactly to one relevant histogram peak. This is depicted by the green bullets on the right diagram in Figure 1. Furthermore, the expected locations of class separators (depicted by the bullets on the left diagram in Figure 1) are the only minima appearing in the filtered function. Partitioning the feature space into classes is thus equivalent to put a frontier between every maxima. Such frontiers should be located on function minima. If we consider the negative of the filtered function (that is, maxx (f (x)) − f (x)), the classification problem can be re-phrased as follows: partitioning the feature space into classes is equivalent to separate every function minima and separations between minima should be located on function crest values. This operation is performed using a morphological filter, the geodesic watershed transform, as described in (Vincent and Soille, 1991). Finally, we end up with the following classification scheme, which is a much more simplified version of the one that has already been proposed in (Geraud et al., 2001). 1 Compute image histogram, 2 if needed, regularize this histogram to get a better description of data in feature space,
Fast Color Image Segmentation Based on Levellings in Feature Space
803
3 apply a filter on this function to suppress inconsistent maxima, 4 invert the result, 5 run the watershed transform to get a partition of the feature space. Obviously, the quality of such a classifier is highly dependent on the properties of the filter used in step 3. Thus, choosing and designing the appropriate filter is a critical step.
2.2
About Openings, Connected Filters and Levellings
Let us consider a function f defined on points and whose values f (x) are quantified with n bits (this assumption allows us to simplify notations): f : X → [0..2n−1 ], where X is a set of points. If n = 1, f is a Boolean function; if n > 1, we will say that f is a scalar function. The flat zone of f containing x, denoted by Γx (f ), is the largest connected component that includes x and such that ∀x ∈ Γx (f ), f (x ) = f (x). We have Γx (f ) ⊂ X. In the following, when a set of points Z ⊂ X is given, we will denote (i) Z a connected component of Z, so that Z = ∪i Z (i) ). Given a function f , ft (x) is defined as: ft (x) = 1 if f (x) ≥ t, 0 otherwise, and the set Ft as: Ft = { x | ft (x) = 1 }
Morphological filters. A filter Φ is a morphological filter if it verifies two properties: it should be increasing (f ≤ g ⇒ Φ(f ) ≤ Φ(g)) and idempotent (Φ ◦ Φ = Φ). A morphological opening is an anti-extensive morphological filter (γB ≤ id , where B is a structuring element) that can be expressed as the composition of an erosion and a dilation using a structuring element as defined in (Soille, 1999). Because morphological opening is an anti-extensive filter, it can be used to suppress local maxima while “globally” keeping the information that was contained in the original image. However, these basic filters shift contours. This drawback makes their use redhibitory when object contours should be perfectly preserved (Geraud, 2003). Our objective is now to move to morphology-based filters that satisfy the contour preservation property. This family of filters is known as connected operators (Serra and Salembier, 1993). Connected operators. A filter ψ is a connected operator if the flat zones of the input function are included into the flats zones of the output function: y ∈ N (x) and f (x) = f (y) ⇒ ψ(f )(x) = ψ(f )(y), where N (x) denotes the neighborhood of x. An equivalent definition comes with the decomposition of f into flat zones: ψ is a connected operator if ∀x, Γx (f ) ⊂ Γx (ψ(f )). A criterion κ defined over a set is increasing if: (Z ⊆ Z and Z satisfies κ) ⇒ Z satisfies κ. The trivial opening of a connected set Z is defined by:
804 γκ (Z) = Z if Z satisfies κ, ∅ otherwise. This definition is trivially extended to a non-connected set Z = {Z (i) } following γκ (Z) = ∪i γκ (Z (i) ). An attribute opening γκ of a function f relies on an increasing criterion κ: (i) γκ (f ) = t γκ (ft ) where γκ (Ft ) = ∪i γκ (Ft ). f(x)
(1)
(2)
F5
(3)
F5
F5
12
5
75
Figure 2.
90 100
120
140
x
180 190
Area Opening and Volume Levelling.
A classical attribute opening is the area opening. The corresponding criterion is αλ such as αλ (Z) is verified iff |Z| ≥ λ, where |Z| denotes the number of points of Z and where λ is a given threshold. In Figure 2, a scalar function (i) f has been decomposed into the sets Ft . The set F5 is depicted in red. It (1) (1) has three connected components, and F5 = [75, 90] (so |F5 | = 16). Filtering f by an area opening with λ = 15 is the function γα15 (f ) that appears in Figure 2 when stacking bold lines. For instance, we have γα15 (f )(120) = 12. (i) Put differently, bold lines represent the sets Ft which verify the criterion (3) α15 . For instance, the criterion is not verified for F5 = {180, .., 190} since (3) |F5 | = 11 so we have γα15 (f )(185) < 5. Please note that the flat zone of f5 (3) containing x = 185 is Γ185 (f5 ) = F5 .
Levellings. Being connected is not such a strong property for an operator. Sometimes, we also want to preserve the local spatial ordering of function values. This leads to the definition of a sub-class of connected operators. A filter is a levelling if: y ∈ N (x) and f (x) < f (y) ⇒ ψ(f )(x) ≤ ψ(f )(y). An interesting levelling is the volume levelling, (Vachier, 2001). A volume (i) can be computed from every Ft following: (i)
ν(Ft ) =
(i )
(i )
(i)
t ≥t, i such as Ft ⊆Ft (1)
|Ft |
For instance, the volume of F5 is computed from the flat zones included (1) (1) (1) (1) in the ellipse drawn in Figure 2. We have ν(F5 ) = |F5 | + |F6 | + |F8 |.
Fast Color Image Segmentation Based on Levellings in Feature Space
805
Last, the criterion used in filtering is based upon a volume threshold λ: νλ (Z) is verified iff ν(Z) ≥ λ. The filter is finally defined just like attribute openings: (i) γνλ (f ) = t γνλ (ft ) where γνλ (Ft ) = ∪i γνλ (Ft ). However, it is not an (i) attribute opening since the criterion computation does not rely on Ft only, but takes into account the components of Ft with t = t.
3.
PROPOSED METHOD AND COMPARATIVE RESULTS
Our method follows the classification scheme described in section 2.1. The key point is to use an appropriate filter to keep the relevant peaks in feature space only.
3.1
Histogram Filtering by Volume Levellings
Simply consider a histogram as a function f . The volume levelling filter has a simple interpretation: it is a number of pixels in the original image. The volume levelling process flattens a peak of the histogram only if the number of pixels (in the original image) which corresponds to the removed part of this peak is less than a given threshold. Another way to explain the meaning of this filter and the influence of the volume threshold is the following: no class will be created when the number of pixels from the original image would be less than λ.
Figure 3. Classification using Volume Levelling Filtering: original image (left), our result (right, 6 classes).
The original robotic image contains 352×288 pixels encoded in 8 bits RGB; it is depicted in Figure 3 (left). First, the image histogram in the hue-saturationlightness (HSL) space is computed (step 1). To speed-up the classification process, the histogram is down-sampled to 5 bits per color component and then regularized (step 2) with a Gaussian kernel (sigma = 0.5). For the filtering step (step 3), the volume threshold has been set to 0.05% of the number of
806 pixels in the original image (λ = 506). Last, the filtered histogram is inverted (step 4) and the watershed transform (step 5) is applied to provide a color space partition into classes. This process leads to 6 color classes, and finally, the noncontextual labeling of the original image is depicted in Figure 3 (right). We can observe that the “green” class, corresponding to the table, is perfectly extracted in feature space, although the color of the table in the original image is not homogeneous. We have not yet performed a rigorous quantitative comparison of our results with other ones; nevertheless, extra results are accessible through the Internet from www.lrde.epita.fr/dload/papers/iccvg04/ for a qualitative comparison of results over various images.
Figure 4. State-of-the-art Morphological Classifications: Zhang et al. (left., 8 classes) and Park et al. (right, 9 classes).
Figure 4 depicts the result of the morphological classifiers proposed in (Zhang and Postaire, 1994), and in (Park et al., 1998) respectively. As we can see, classification in color space is less relevant than with our method (many artifacts appear in the resulting image due to bad class identification in feature space). Moreover, we have tried to tune the parameters of both of these classifiers but we did not succeed in getting a correct result with 6 classes.
3.2
Contextual Segmentation
The method presented here does not take into account contextual information to assign end labels to points. Thus, noise-like effects might appear in the labeled image. In such a case a contextual labeling using Markov random fields can be applied as presented in (Geraud et al., 2001).
3.3
Implementation Details
To regularize with a Gaussian kernel, we use the fast recursive implementation explained in (Deriche, 1993). For the volume filter, we use an implementation based on the union-find algorithm from Tarjan. Our implementation is an
Fast Color Image Segmentation Based on Levellings in Feature Space
807
adaptation of the one proposed for attribute openings in (Meijster and Wilkinson, 2002). In all our experiments, we use olena, a generic image processing library written in C++ that we have developed (Darbon et al., 2002). This library is available under the GNU Public Licence (GPL) through the Internet from http://olena.lrde.epita.fr
4.
CONCLUSION
We have proposed a morphology-based classifier in feature space that takes advantage of levellings. First, it gives relevant results due to the strong properties of levellings. Second, both parameters of our method are very intuitive: a variance for the regularization, if needed, and the minimal number of pixels of a class. Last, the method is fast: a color image segmentation with our method takes less than 0.5s on a common computer—we have a 1,7 GHz personal computer running GNU/Linux.
REFERENCES Barata, T. and Pina, P. (2002). Improving classification rates by modelling the clusters of training sets in features space using mathematical morphology operators. In Proceedings of the 16th Intl. Conf. on pattern Recognition, volume 4, pages 90–93, Quebec City, Canada. IEEE Computer Society. Comaniciu, D. and Meer, P. (1997). Robust analysis of feature spaces: Color image segmentation. In Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, pages 750–755, San Juan, Puerto Rico. Darbon, J., Geraud, T., and Duret-Lutz, A. (2002). Generic implementation of morphological image operators. In Mathematical Morphology, Proceedings of the 6th Intl. Symposium (ISMM), pages 175–184. Sciro Publishing. Deriche, R. (1993). Recursively implementing the gaussian and its derivatives. Technical Report 1893, INRIA. Geraud, T. (2003). Fast road network extraction in satellite images using mathematical morphology and markov random fields. In Proceedings of the EURASIP Workshop on Nonlinear Signal and Image Processing (NSIP), Trieste,Italy. Geraud, T., Strub, P.Y., and Darbon, J. (2001). Color image segmentation based on automatic morphological clustering. In Proceedings of the IEEE Intl. Conf. on Image Processing, volume 3, pages 70–73. Meijster, A. and Wilkinson, M.H.F. (2002). A comparison of algorithms for connected set openings and closings. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(4):484– 494. Park, S.H., Yun, I.D., and Lee, S.U. (1998). Color image segmentation based on 3-D clustering: Morphological approach. Pattern Recognition, 31(8):1061–1076. Postaire, J.-G., Zhang, R.D., and Lecocq-Botte, C. (1993). Cluster analysis by binary morphology. IEEE Trans. on PAMI, 15(2):170–180. Serra, J. and Salembier, P. (1993). Connected operators and pyramids. In Proceedings of SPIE Image Algebra and Mathematical Morphology IV, volume 2030, pages 65–76, San Diego, CA, USA. Soille, P. (1999). Morphological Image Analysis – Principles and Applications. Springer-Verlag. Vachier, C. (2001). Morphological scale-space analysis and feature extraction. In Proceedings of IEEE Intl. Conf. on Image Processing, volume 3, pages 676–679, Thessaloniki, Greece. Vincent, L. and Soille, P. (1991). Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Trans. on PAMI, 13(6):583–598. Zhang, R.D. and Postaire, J.-G. (1994). Convexity dependent morphological transformations for mode detection in cluster analysis. Pattern Recognition, 27(1):135–148.
SEGMENTATION-BASED BINARIZATION FOR COLOR DEGRADED IMAGES Celine Thillou, Bernard Gosselin Faculte Polytechnique de Mons, Belgium [email protected]
Abstract
Recently, a new kind of images taken by a camera in a "real-world" environment appeared. It implies different strong degradations missing in scanner-based pictures and the presence of complex backgrounds. In order to segment text more properly as possible, a new binarization technique is proposed using color information. This information is used at proper moments in the processing not from the beginning to have smaller regions-of-interest. It presents the advantage of reducing the computation time. In this paper, an accent is put on stroke analysis and character segmentation. The binarization method takes it into account in order to improve character segmentation and recognition afterwards.
Keywords:
Binarization; color clustering; wavelet; character segmentation.
1.
INTRODUCTION
A new kind of camera-based images in a mobile environment appeared very recently. They can be taken by an embedded camera on a personal digital assistant or other mobile devices. Therefore this context implies a bunch of degradations, not present in classical scanner-based pictures, such as blur, perspective distortion, complex backgrounds, uneven lighting... Thresholding, as the first step of OCR, is crucial. Actually this is the first step where some information is lost after picture acquisition. Errors at this point are propagated all along the recognition system. The challenge to obtain a very robust binarization method is major. In our context, text areas are already detected and therefore this part is no more considered here. Our test database is based on public images given in the ICDAR 2003 website, used for their robust reading competition. Unfortunately, from this competition, no paper was written on the text binarization and recognition, only on text detection.
808 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 808–813. © 2006 Springer. Printed in the Netherlands.
Segmentation-based Binarization for Color Degraded Images
Figure 1.
1.1
809
An overview of our binarization method.
A brief state of the art
Most existing binarization techniques are thresholding related. Basically, these techniques can be categorized into two categories : global6 and local or adaptive3 . Global methods tempt to binarize the image with a single threshold. By contrast, local methods change the threshold dynamically over the image according to local information. Meanwhile in our context, image processing systems need to process a large number of documents with different styles and without pre-specified parameters. Moreover, all these techniques perform poorly under complex backgrounds. In4 , Liu and Srihari used the global Otsu5 algorithm to obtain candidate thresholds. Then, texture features were measured from each thresholded image, based on which the best threshold was picked. Color information is not used and this technique fails when different colors with almost the same intensity are present. Seeger7 created a new threshold technique for camera images, like in our context, by computing a surface of background intensities and by performing adaptive thresholding for simple backgrounds. Wang9 tried to combine both color and texture information to improve results. This technique works well for images similar to our database but computation time required is very high and no consideration on connectivity between components are presented. With other techniques, and some similar ones, our method fills these failures. Some great results are also obtained in the domain of content retrieval and video segmentation in multimedia documents. Garcia2 uses color clustering for binarization. This method is based on k-means and decision about which cluster or combination of clusters has to be considered is based on a bunch of criteria concerning characters properties.
2.
OUR BINARIZATION APPROACH
A scheme of our proposed system is presented in Figure 1. Color information is only used after gray-scale denoising and coarse thresholding in order to consider only useful parts and to decrease the required time for color clus-
810
Figure 2. The impact of applying wavelet denoising on gray-scale images. From top to bottom: original image, thresholded image by Otsu’s algorithm, reconstructed image using the method with wavelets8 which is then thresholded by Otsu’s algorithm.
tering with less pixels. Then a combination of results is either applied or not, according to a parameter of distance and this eventual combination is partial or total in order to take into account non-connectivity of characters.
2.1
Denoising pre-processing
An important problem for thresholding methods and especially for "realworld" pictures comes from a non-uniform illumination which introduces noise. To correct this constraint, we use a wavelet decomposition as described in8 , which was a preliminary part of this entire method. The wavelet transform splits the frequency range of the gray-scale image into equal blocks and represents the spatial image for each frequency block which gives a multiscale decomposition. We use a level 8 wavelet transform using the Daubechies1 16 wavelet and remove low frequency subimages except the lowest one for reconstruction as shown in Figure 2.
2.2
Coarse thresholding
The well-known Otsu method is then applied directly on the reconstructed image. This thresholding method is used for its simplicity and is sufficient for an approximate binarization at this point. Moreover this method is freeparameter and therefore general and gives all useful information on our database for a further processing. In order to keep advantage of this coarse distribution and to use color information to remove useless parts, a zonal mask is applied on the color image as shown in Figure 3 on the left. Actually in order to consider only color parts of this first threshold, a mask corresponding to useful text is applied by an AND operation on each R,G,B subimage. This task is done in order to constraint the size of the region-of-interest.
2.3
Color clustering
In9 , color clustering is done using Graph Theoretical clustering without giving the number of clusters because the picture was not pre-processed. Actu-
Segmentation-based Binarization for Color Degraded Images
811
Figure 3. Right: A zonal mask applied on initial color images, left: color clustering in three subimages: background, noise and foreground.
Figure 4.
Two textual foreground clusters after color clustering.
ally, pre-processing with the approximate thresholding does not lose any textual information at all in our database. We use the well-known K-means clustering with K=3. The three dominant colors are extracted based on the color map of the picture. The color map is obtained by getting all intensity values of each pixel and by removing duplicate values. Color map bins are hierarchically merged according to an Euclidean metric in the color space. These merged bins form color clusters that are iteratively updated by the K-means algorithm using the metric. Finally, each pixel in the image receives the value of the mean color vector of the cluster it has been assigned to. Three clusters are enough thanks to our pre-processing with the mask for our database. A decomposition is shown in Figure 3 on the right.
2.4
Eventual combination
The background color is selected very easily and efficiently as being the color with the biggest rate of occurrences in the first and last lines and the first and last columns of pixels of the picture. Only two pictures left which correspond depending on the initial image to either one foreground picture and one noise picture or two foreground pictures as shown in Figure 3-right and in Figure 4. In9 , combination is based on some texture features to remove inconvenient pictures and on a linear discriminant analysis. Here, the most probable useful picture is defined with a means of skeletonisation. Actually, as the first thresholding corresponds in an approximative way to characters, a skeletonisation is used to get the color of centers of characters as in8 . Euclidean distance D with both mean color pixel of the cluster and mean color of the skeleton is performed. D is described in the following equation with pr1, pg1 and pb1, color values of one cluster for R,G,B channels, pr2,
812
Figure 5. The first sample ‘study’ with well-spaced characters gives nearly the same result between our method and the Wang’s one9 on top. For the second sample ‘point’, our method on bottom improves the result of Wang for character segmentation.
pg2 and pb2 are for the mean color value of the skeleton. The cluster with the smallest distance from the skeleton is considered as the cluster with the main textual information. Combination to do is decided according to the distance D between mean color values of the two remaining clusters. "
D=
(pr2 − pr1)2 + (pg2 − pg1)2 + (pb2 − pb1)2
If distance is inferior to 0.5, color are considered as similar and the second picture seems to be a foreground picture too. On the ICDAR database, this decision is valuable to 98.4% and no false alarms is detected. For the 1.6% remaining, some useful information is lost but the recognition is still possible as the first selected picture is the most relevant foreground one.
2.5
"Smart" combination
Connected components on the first foreground picture are computed to get coordinates of their bounding box in order not to connect components with pixels to add in the combination. Only pixels which can be added will change the first foreground picture. On the contrary, some characters can be broken if they were broken in the first foreground picture. But, in this case, the correction will be facilitated by the fact that characters parts will be closer. Some samples are described in Figure 5 to compare final results of the Wang9 method and our binarization technique.
2.6
Experimental results
Classical OCRs are not designed for natural scene character recognition and it is quite difficult to have pertinent results. Therefore to explain improvements our binarization method brought, we detailed results under visual judgement and by the number of connected components, which is a strong improvement for the following character segmentation. On the ICDAR 2003 database, 21% characters are no more connected comparing to the thresholding method described in8 . It corresponds to an improvement of 29% images and the visual judgement is drastically improved.
Segmentation-based Binarization for Color Degraded Images
813
Concerning the Wang’s method described in9 , results on binarization are quite the same but we have 6.3% characters no more connected and segmentation is improved. This improvement concerns 11.3% images of the database. Moreover in comparison with this latter method, computation time is reduced around 40% depending on the image size. This reduction is relative to the way it can be implemented but nevertheless the time reduction is obvious.
3.
CONCLUSION AND FUTURE WORK
In this paper, we have presented a new binarization method for "real-world" camera-based pictures. Illumination and blur are corrected with a wavelet denoising. Color information is not used from the beginning in order to reduce computation time and to use the color information at a more convenient step. Moreover a smart combination is done between clusters to get as much information as possible with a compromise with the number of connected components in order to improve character segmentation and recognition. Improvements have been done comparing to other recent techniques in binarization using color information. A way to discriminate backgrounds between clean and noisy ones is currently under investigation to further decrease the computation time and to get smoother results in the case of clean backgrounds.
ACKNOWLEDGMENTS This work is part of the project Sypole and is funded by Ministere de la Region wallonne in Belgium.
REFERENCES 1. I. Daubechies, Ten lectures on wavelets, SIAM (1992). 2. C. Garcia and X. Apostolidis, Text detection and segmentation in complex color images, Proceedings of ICASSP 2000, (2000) Vol. IV, 2326–2330. 3. J. Kittler, J. Illingworth, Threshold selection based on a simple image statistic, CVGIP, 30 (1985) 125–147. 4. Y. Liu and S. N. Srihari, Document image binarization based on texture features, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.19, nˇr5, (1997) 540–544. 5. N. Otsu, A thresholding selection method from gray-level histogram, IEEE Transactions on Systems, Man, and Cybernetics, 9 (1979) 62–66. 6. P. K. Sahoo, S. Soltani, A. K. C. Wong, A survey of thresholding technique, CVGIP, 41 (1988) 233–260. 7. M. Seeger and C. Dance, Binarising camera images for OCR, ICDAR 2001, (2001) 54–59. 8. C. Thillou and B. Gosselin, Robust thresholding based on wavelets and thinning algorithms for degraded camera images, Proceedings of ACIVS 2004, (2004). 9. B. Wang, X-F. Li, F. Liu and F-Q. Hu, Color text image binarization based on binary texture analysis, Proceedings of ICASSP 2004, (2004) 585–588.
COMPACT COLOR VIDEO SIGNATURE BY PRINCIPAL COMPONENT ANALYSIS Thomas Leclercq1 , Louahdi Khoudour1 , Ludovic Macaire2 , Jack-Gérard Postaire2 , Amaury Flancquart1 1 INRETS-LEOST
Institut National de Recherche sur les Transports et leur Sécurité Laboratoire Electronique Ondes Signaux pour les Transports 20, rue Elisée reclus-59650 Villeneuve d’ascq, France [email protected] - http://www.inrets.fr/leost 2 Laboratoire d’Automatique LAGIS - UMR CNRS 81456
Université des Sciences et Technologies de Lille Cité Scientifique - Bâtiment P2 - 59655 Villeneuve d’Ascq - FRANCE http://www.lagis.univ-lille1.fr
Abstract
In transport security context, real-time detection of potentially dangerous situations is very important. We need to be able to identify pedestrians like intruders in forbidden areas, as quickly as possible. The signatures must be precise in order to identify persons, and also compact to be rapidely transmitted through a network. We propose in this paper a new signature which is based on a Principal Component Analysis (PCA), applied to color histograms.
Keywords:
Color deformable objects; image sequence comparison; video signature; data analysis; principal component analysis; color histograms.
1.
INTRODUCTION
Supervision of public sites requires a high number of cameras, so that each of them observe strategic locations of the site under control. This multi-camera system is developed to track moving persons in the site. For this purpose, the cameras acquire top view color images so that the site is observed by different cameras without overlapping between the different fields of view. The analysed color image sequences contain persons who walk through these observed areas. The aim of such a multi-cameras system is to identify the persons as they move under the different vision sensors in order to determine their journey in the site. 814 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 814–819. © 2006 Springer. Printed in the Netherlands.
Compact Color Video Signature by Principal Component Analysis
815
In this paper, we propose an original approach for retrieving person image sequences acquired by different color cameras. The first section deals with classical methods used in video shots comparison. In the second section, we describe our method, which evaluates signatures of image sequences based on the color histograms of these images. The last section presents some preliminary results of sequences comparison using our signature method.
2.
COLOR IMAGE SEQUENCES COMPARISON
Since we do not analyze the shape or the motion in the dynamical scenes, we compare two images using their color histograms. One of the most widely used similarity measure between two color images is the histograms intersection1 . Then, the similarity between two image sequences A and B, could be determined by the analysis of the distance matrix between the images of two sequences: this matrix contains the (Na * Nb) distances that could be processed between each of the Na images of A and each of the Nb images of B, relatively to a given image similarity measurement. The processing of such a matrix is expensive in terms of computing time. That is why some methods in the literature consist in reducing the number of histograms. One of the most widely used method is the key-frames extraction. In a given sequence, only Na’ images are selected, assuming that they are the most representative of the sequence, according to different considerations2 . The comparison between two sets of key-frames gives also a distance matrix, that needs just Na’ * Nb’ elementary image comparisons. the results depend on the key-frames selection, which is not a trivial problem. Another method consists in calculating a single global feature from a complete sequence, as the dominant color histogram,3 which is too coarse to distinguish two different persons moving under the cameras.
3.
SIGNATURE BY LINEAR DIMENSION REDUCTION
We propose to reduce the number of dimensions of the histograms, via a Principal Component Analysis (PCA)4 . We consider each color histogram of each image as a point in a high-dimensional space5 . Each axis of this space corresponds to a cell of the histogram. The coordinate of the point along this axis is the number of pixels whose color corresponds to this cell. The successive histograms constitute a set of points, and the comparison of two sequences consists in comparing these two corresponding sets of points. A 3D color histogram created from a classical RGB image should contain 256*256*256 cells. We could reduce it to a 16*16*16 histogram. As an histogram contains 4096 cells, the dimension of the space in which the points are
816 projected is very high. In this article, we try to use three 2D color histograms: RG, RB and GB, reduced to 16*16 histograms. Each of them contains only 256 cells. A Principal Components Analysis (PCA) is then applied separately on each of the three spaces to reduce the number of dimensions with an adapted projection. It allows on one hand to exploit the redundancy of existing information, and on the other to reduce the noise inherent to any measure. Because of the small quantity of points (less than 200 points per sequence) compared with the relatively important number of axis, it seems that the totality of the information could be kept in a space of a reduced number of dimensions. We notice that the first three principal components of the PCA represent the quasi-totality of the variance (about 98%). The signature contains the three projectors adapted to the processed sequence, and the three sets of points that have been projected in a reduced space. Let us consider now the case where we compare a second sequence to the one that has been projected: we project the three sets of points coming from the second sequence, by means of the three projectors adapted to the first sequence. Then, we want to know if the second set of points comes from a sequence which contains the same person, or a different one. If the person is the same one, for each 2D histogram, the discrimination between the two sets of points should be low. Then, we could hope that if the person is different, this discrimination is higher.
4.
RESULTS
In order to demonstrate the efficiency of our method, we have build a database containing image sequences of eight persons, each of them being observed by two cameras. Each sequence is referenced by a letter between A and H associated with the observed person, and a number which corresponds to the camera. Figure 1 shows three image samples of three different individuals. In the database, the proposed procedure looks for the target sequence of the database which is the closest to the request one.
4.1
Graphical results
The Figure 2(a) shows the result of the PCA applied on the RG histogram for the person-sequence A1. The case 2(b) (A1-A2) shows the points representing two sequences from the same individual passing under two cameras, in the reduced space determined by PCA on the points coming from A1. The Figures 2(c) (A1-B1) and 2(d) (A1-C1) shows projections in the A1 components of points representing two different persons: B1 is similar to A1 and A2 in terms of color clothes, the sets of points projected are rather close. As C1 is very
Compact Color Video Signature by Principal Component Analysis
Figure 1.
817
Three person images extracted from the image sequences.
different from A1 and A2: all the points representing C1 are grouped on the right side of the space.
4.2
Numerical results
We first calculate a similarity measure between two sequences A and B, based on the histograms intersection: for each image of the A sequence, we look for the image in the B sequence that is the most similar, according to the average value of the three histogram intersections. This consists in finding the minimal value of each line of the similarity matrix between A and B. Then, we calculate the average value of these minimal similarities. In order to evaluate the performance of our method, we compute for each of the three 2D histograms the euclidian distance between the two gravity centers of the two projected sets of points after the PCA. Table 1 shows these two measures, where the A1 sequence is the target sequence, and the others sequences are the candidate sequences. We notice that these two measures (method 1 and method 2) provide approximately the same results. Thanks to the reduction of dimensions by PCA, we characterize a sequence using only 2 or 3 principal components instead of the original 256 cells of the histogram, and we keep 98% of the original information.
4.3
Computing time
4.3.1 Comparison of two sequences. The comparison of two images sequences A and B using the three 2D histograms intersection takes around 2 second, depending of the number of images in the two sequences (160 in average, maximum 190). The projection of the A sequence in the components
818
Figure 2.
(a) A1
(b) A1-A2
(c) A1-B1
(d) A1-C1
Projections of different sequences in the two first axis of the PCA.
Table 1. Distance to the Target sequence A1. Method 1 = Histogram Intersections. Method 2 = Distance between gravity centers of two sequences, after PCA projection. Request A2 B1/B2 C1/C2 D1/D2 E1/E2 F1/F2 G1/G2 H1/H2
Method 1 0.213 0.278 / 0.302 0.634 / 0.672 0.623 / 0.696 0.585 / 0.628 0.355 / 0.409 0.557 / 0.598 0.696 / 0.700
Method 2 0.0246 0.0326 / 0.0455 0.1167 / 0.1188 0.1143 / 0.1193 0.1053 / 0.1121 0.0998 / 0.1001 0.1031 / 0.1074 0.1140 / 0.1176
Compact Color Video Signature by Principal Component Analysis
819
of the B sequence, and the comparison between the A and B sets of projected points takes less than 10 milliseconds on the same machine running at 2,4 GHz.
4.3.2 Signature processing. The creation of the signature includes a matrix inversion, whose computing time depends on the number of components that are valid. The average time is 2 seconds for 60 valid components, and 9 seconds in the worst case (96 valid components).
5.
CONCLUSION
We have proposed an original approach for comparing color image sequences by mean of color histograms. We are presently working on the coding of the color information (type of the best adapted representation), which is an important preliminary stage above all the processing steps. Various colorimetric spaces6 will come to enrich the signatures, with the aim of analyzing the variations in the behavior of the series of attributes according to the space of representation. Other comparison methods, like the Nearest Feature Line one7 will also be integrated into the general method, by trying to be more precise than simply analyzing the gravity centers distance.
REFERENCES 1. M. J. Swain and D. H. Ballard, Color indexing, International Journal of Computer Vision, vol. 7(1), pp. 11-32, 1991. 2. A. Divakaran, R. Radhakrishnan and K. Peker, Motion Activity-based Extraction of KeyFrames from Video Shots, International Conference on Image Processing (ICIP 2002), vol. 1, pp. 932-935, Rochester NY, September 2002. 3. T. Lin, C. W. Ngo, H. J. Zhang and Q. Y. Shi, Integrating Color and Spatial Features for Content-based Video Retrieval, International Conference on Image Processing (ICIP 2001), vol. 3, pp. 592-595, Thessalonique, Greece, October 2001. 4. G. Saporta, Probabilités, Analyse de Données et Statistique, TECHNIP, Paris, 1990. 5. J. P. Cocquerez and S. Philipp, Analyse d’images: filtrage et segmentation, MASSON, Paris, 1995. 6. D. Muselet, L. Macaire, L. Khoudour and J. G. Postaire, Color invariant for person images indexing, European Conference on Colour in Graphics, Image and Vision (CGIV 2002), pp. 236-240, Poitiers, France, April 2002. 7. S. Z. Li, K. L. Chan and C. Wang, Performance Evaluation of the Nearest Feature Line Method in Image Classification and Retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22(11), pp. 1335-1339, 2000.
COMPARISON OF DEMOSAICKING METHODS FOR COLOR INFORMATION EXTRACTION Flore Faille Institute for Real-Time Computer Systems Technische Universität München, 80333 Munich, Germany [email protected]
Abstract
1.
Single–chip color cameras use a color filter array to sample only one color per pixel. The missing information is interpolated with demosaicking algorithms. Several state–of–the–art and more recent demosaicking methods are compared in this paper. The aim is to find the method best suited for use in computer vision tasks. For this, the mean squared error for various images and for typical color spaces (RGB, HSI and Irb) is measured. The high inter–channel correlation model which is widely used to improve interpolation in textured regions is shown to be inaccurate in colored areas. Consequently, a compromise between good texture estimation and good reconstruction in colored areas must be found, e.g. with the methods by Lu et al.9 and Freeman2 .
INTRODUCTION
Most digital color cameras are based on a single CCD G R or CMOS sensor combined with a color filter array (CFA): B G each pixel measures only one of the RGB colors. The most popular CFA is the Bayer CFA1 shown in Fig. 1. Demosaicking algorithms interpolate the sparsely sampled color information to obtain a full resolution image, i.e. three color Bayer values per pixel. Many demosaicking algorithms were de- Figure 1. color filter array. signed (see the overview in sect. 2). However, even recent methods are prone to interpolation errors or artifacts. The most common artifacts are shown in Fig. 4: “zipper” effects, false colors and desaturation of colored details. This raises the questions whether images acquired with a single–chip camera can be used to extract reliable color information for further computer vision tasks, and which demosaicking algorithm is the best suited. The previous comparisons between demosaicking methods, e.g.,9,10 aimed at visually pleasing images, so their evaluation criteria were, in addition to Mean Squared Error (MSE) in RGB space, visual inspection and measures 820 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 820–825. © 2006 Springer. Printed in the Netherlands.
Comparison of Demosaicking Methods for Color Information Extraction
821
∗ 9,10 . In computer vision tasks, separabased on human perception like ΔEab tion between intensity and chrominance is widely used to increase robustness to illumination changes3,4 . However, none of the previously used criteria can evaluate chrominance quality. For that reason, a detailed analysis using typical color spaces (HSI and Irb) is provided here. In addition, performance differences in colored, textured and homogeneous areas are emphasized. After an overview of the existing demosaicking algorithms, the methods chosen for comparison are introduced in section 2. Section 3 presents the comparison framework and the results. A conclusion is given in section 4.
2.
DEMOSAICKING ALGORITHMS
A good overview of the many state–of–the–art methods is given in10 . The simplest algorithm performs a bilinear interpolation of the three channels separately. The demosaicking quality is however poor as shown in Fig. 4(a). To improve it, a larger neighborhood and gradient information can be considered. In addition, the high inter–channel correlation can be used by assuming the difference between R, G and B to be constant on a small neighborhood. This allows considerable improvements at a moderate complexity increase10 . The most popular state–of–the–art methods are analyzed here: median–based postprocessing (MBP)2 and adaptive color plane interpolation (ACPI)6 . The most interesting of the recently developed algorithms are5,7,9 . In9 gradient information is included more flexibly and plays a bigger role in estimating the R and B channels. It is promising for color quality, so this method will be analyzed here. In7 the neighborhood considered during interpolation is not chosen based on gradient but on homogeneity of the results. In5 a compromise between high inter–channel correlation and fidelity to sampled data is reached using constraint sets. Both methods produce less artifacts in textured areas but are too time–consuming for usual computer vision tasks.
2.1
Median–based postprocessing (MBP and EMBP)
MBP reduces demosaicking artifacts by enforcing inter–channel correlation2 . After demosaicking, the difference images R − G and B − G are median–filtered. The image is then reconstructed using the filtered difference images δR/B and the CFA sampled data: for example, at a sampled G pixel, (R , G , B ) = (δR + G, G, δB + G). The algorithm works similarly at R and B pixels. Desaturation and isolated false intensity pixels appear near color edges with low inter–channel correlation (e.g. red/white edges), as shown in Fig. 4(b). To avoid such contradiction between inter–channel correlation and sampled data, an Enhanced Median–Based Postprocessing (EMBP) is used in,7,9 in which sampled values are changed too. The reconstruction step is for all pi-
822 xels: (R , G , B ) = (δR + G , (R − δR + B − δB )/2, δB + G ). The implementation proposed in9 was chosen here: already processed pixels are used to filter following pixels for a faster diffusion of the estimation, and only areas with sufficient gradient are postprocessed. Results are presented in Fig. 4(c).
2.2
Adaptive color plane interpolation (ACPI)
ACPI is a state–of–the–art method using gradient information and inter– channel correlation6,10 . To account for the CFA structure, the G channel is interpolated first and is used to process R and B channels. Figure 2 presents the G channel estimation: gradients and interpolated G values depend on the laplacian of the R(B) channel. Similarly, estimated R and B values depend on the laplacian of the G channel. Gradients are used to estimate R(B) values at B(R) pixels, for which four R(B) neighbors exist (results shown in Fig. 4(d)). R1 G2 R3 G4 R5 G6 R7 G8 R9
Compute the horizontal and the vertical gradients: H = |G4 − G6| + |R5 − R3 + R5 − R7| V = |G2 − G8| + |R5 − R1 + R5 − R9| If H > V G5 = (G2 + G8)/2 + (R5 − R1 + R5 − R9)/4 else if V > H G5 = (G4 + G6)/2 + (R5 − R3 + R5 − R7)/4 else G5 = (G2 + G8 + G4 + G6)/4 + (R5 − R1 + R5 − R9 + R5 − R3 + R5 − R7)/8
Figure 2. Interpolation of the G value at a sampled R pixel with ACPI6 . Sampled B pixels are processed similarly.
2.3
Weighted adaptive color plane interpolation (WACPI)
UP WACPI9 can be seen as an extension of ACPI. G values are estimated first. While ACPI interpolates horizontally or LEFT vertically or takes the average of both directions, gradients RIGHT are integrated more flexibly in WACPI. The contributions DOWN of every direction are weighted by the gradient inverses before they are summed up and normalized to build the estiGramate. This enables the consideration of four directions in- Figure 3. dient directions. stead of two (see Fig. 3), which enhances the performances at slanted edges and corners (see Fig. 4(e)). In contrary to ACPI, gradient information is considered for the interpolation of all R and B values. To achieve this, R(B) values at sampled B(R) pixels are estimated first and are used to calculate the R and B values at sampled G pixels. The exact algorithm can be found in9 .
3.
COMPARISON OF THE ALGORITHMS
The algorithms are evaluated on simulated CFA sampled images by comparison with the original three–channel image. 24 different images of the Kodak
Comparison of Demosaicking Methods for Color Information Extraction
823
Figure 4. Enlarged details of the Small Lighthouse image9 (top: buoy, bottom: fences) and demosaicking artifacts: (a) bilinear interpolation, (b) MBP,2 (c) EMBP,7,9 (d) ACPI,6 (e) WACPI,9 and (f) original image. This figure is available in color online8 .
color image database5,9 are used. MBP and EMBP are applied after WACPI. The demosaicking results are illustrated in Fig. 4 to show the different artifacts. Bilinear interpolation causes “zipper” effects. For ACPI and WACPI, false colors are found near edges or corners. MBP and EMBP reduce these artifacts but introduce desaturation near colored edges. Additionally, MBP causes isolated false intensity pixels and EMBP causes color blurring. As mentioned before, popular color spaces for computer vision separate intensity and chrominance information to reduce sensitivity to light direction and intensity3,4 . For this, chrominance components are based on ratios3,4 . Robustness to specularities can be achieved using inter–channel differences4 . HSI and Irb are two wellknown color spaces based on these principles: ⎛
⎞
⎛
√
⎞
⎛
⎞
⎛
⎞
3(G−B) H I (R + G + B)/3 arctan 2R−G−B ⎜ ⎝ S ⎠ = ⎝ 1 − min(R, G, B)/I ⎟ ⎝ ⎠ ⎝ ⎠ r R/I = ⎠ and I b B/I (R + G + B)/3
Performances will be evaluated using the MSE between original and demosaicked images in HSI and Irb spaces. The MSE in RGB space is added for comparison with previous evaluations like9,10 . H, S, r and b are scaled by 100, so that their order of magnitude is similar to R, G, B and I. As the results depend on image content, the average MSE over all images is presented in Table 1. The performance discrepancy in textured and in homogeneous areas is emphasized (edges were detected with a Laplacian filter). ACPI, WACPI, MBP (+WACPI) and EMBP (+WACPI) require on average over all images 2, 8.5, 20 and 14 times as much execution time as bilinear interpolation. Bilinear interpolation, which does not use gradients and inter–channel correlation, achieves by far the worst results. Despite its low complexity, ACPI
824 Table 1. Demosaicking performance near edges, in homogeneous areas and in entire images. The average MSE over 24 images of the Kodak database is given for RGB, HSI and Irb spaces, each row showing a channel (top row for R, H and r, etc.). As I is the same in HSI and Irb, the MSE for I is only given for HSI. The best performance in each category is indicated in boldface. Algorithm RGB
Edges HSI
Homogeneous areas RGB HSI rb
Entire images RGB HSI
bilinear
215 87.1 222
28.4 .966 95.4
1.40 1.57
14.1 5.64 15.1
7.26 .0549 6.14
.0545 .0587
104 42.0 108
17.6 .460 45.9
.669 .748
ACPI
38.1 24.9 33.6
11.7 .268 15.8
.394 .319
4.85 2.60 4.52
4.77 .0223 1.89
.0245 .0232
20.1 13.2 18.4
8.35 .141 8.46
.202 .168
WACPI
23.6 13.3 21.6
9.18 .187 9.44
.232 .210
3.44 1.71 3.52
4.05 .0173 1.35
.0178 .0180
12.4 7.22 11.7
6.63 .0959 5.08
.118 .108
MBP
25.2 9.93 18.4
8.18 .195 8.13
.249 .212
4.01 1.91 3.67
3.97 .0204 1.49
.0206 .0212
12.6 5.18 9.72
5.99 .0976 4.20
.120 .106
EMBP
29.7 18.1 24.0
8.98 .264 9.64
.543 .283
3.51 1.79 3.58
4.08 .0182 1.36
.0201 .0188
14.4 8.57 12.1
6.51 .124 5.07
.245 .134
rb
rb
allows significant enhancement. WACPI and MBP perform best. The quality of color ratios and of saturation are on average comparable for both (see MSE for r, b and S). MBP improves texture estimation (see R, G, B and I) and reduces false color artifacts (see H). But its performance in homogeneous areas is worse, due to its higher sensitivity to the inaccuracy of the high inter–channel correlation model (constant color differences in a neighborhood) in colored areas. To emphasize this problem, Table 2 gives the average MSE in colored areas (areas with S ≥ 0.3). MBP shows the maximal performance drop. WACPI achieves here the best results. MBP’s higher sensitivity to the model inaccuracy also results in more negative interpolated values: if these are not corrected to 0, the mean MSE over all (entire) images e.g. for r becomes .441 for WACPI and .965 for MBP. To summarize, MBP better estimates texture, but is outperformed by WACPI in colored areas. This explains our observation that MBP is better on images with fine textures (e.g. landscapes), while WACPI is better e.g. on images showing human–made objects and on close– ups. The overall performance of EMBP is moderate. It better estimates texture than MBP in colored areas, but achieve poor chrominance quality.
825
Comparison of Demosaicking Methods for Color Information Extraction
Table 2. Demosaicking performances in colored areas (with S ≥ 0.3). As in Table 1, the average MSE over all 24 images is given in RGB, HSI and Irb color spaces. The results for the bilinear interpolation are omitted, as it achieved by far the worst results in Table 1.
RGB
ACPI HSI
30.8 19.6 27.6
8.83 .756 12.8
4.
rb
RGB
WACPI HSI
.747 .598
19.2 10.7 18.8
6.98 .584 7.75
rb
RGB
MBP HSI
.435 .409
28.2 11.3 19.2
6.44 .524 8.99
rb
RGB
EMBP HSI
.452 .407
27.2 15.8 21.0
7.66 .759 7.95
rb .806 .484
CONCLUSION
State–of–the–art and recent demosaicking algorithms were compared on various images. To verify if demosaicked images are suitable for computer vision tasks, the average MSE in typical color spaces (RGB, HSI and Irb) was measured. While the high inter–channel correlation model significantly improves interpolation results, it was also shown to be problematic in colored areas. WACPI and MBP (+WACPI) provide the best results. WACPI performs better in colored and in homogeneous areas. MBP better reconstructs texture and reduces false color artifacts. The MBP algorithm could be enhanced by processing only edge pixels like in EMBP: this would reduce execution time and improve performance in homogeneous areas. In addition, regions with saturated colors could be left unchanged. If emphasis lies on efficiency, ACPI and WACPI achieve the best compromise between speed and quality.
REFERENCES 1. Bayer, B. E. (1976). Color imaging array. United States Patent 3,971,065. 2. Freeman, W. T. (1988). Method and apparatus for reconstructing missing color samples. United States Patent 4,774,565. 3. Funt, B., Barnard, K., and Martin, L. (1998). Is machine colour constancy good enough? In ECCV98, pages 445–459. 4. Gevers, T. and Smeulders, A. W. M. (1999). Color–based object recognition. Pattern Recognition, 32:453–464. 5. Gunturk, B. K., Altunbasak, Y., and Mersereau, R. M. (2002). Color plane interpolation using alternating projections. IEEE Trans. on Image Processing, 11(9):997–1013. 6. Hamilton, J. and Adams, J. (1997). Adaptive color plane interpolation in single sensor color electronic camera. United States Patent 5,629,734. 7. Hirakawa, K. and Parks, T. W. (2003). Adaptive homogeneity–directed demosaicing algorithm. In ICIP03, pages III: 669–672. 8. Images available in color at: http://www.rcs.ei.tum.de/~faille/demosaicking.html 9. Lu, W. and Tan, Y.–P. (2003). Color filter array demosaicking: New method and performance measures. IEEE Trans. on Image Processing, 12(10):1194–1210. 10. Ramanath, R., Snyder, W. E., Bilbro, G. L., and Sander, W. A. (2002). Demosaicking methods for bayer color arrays. Journal of Electronic Imaging, 11(3):306–315.
BLIND EXTRACTION OF SPARSE IMAGES FROM UNDER-DETERMINED MIXTURES Włodzimierz Kasprzak1 , Andrzej Cichocki2 , Adam F. Okazaki1 1 Warsaw University of Technology, Inst. of Control and Computation Eng., ul. Nowowiejska
15/19, PL-00-665 Warsaw [email protected], [email protected] 2 Brain Science Institute RIKEN, Lab. for Advanced Brain Signal Processing, Hirosawa 2-1,
Wako-shi, 351-0198 Saitama, JAPAN [email protected]
Abstract
We propose a blind signal extraction approach to the extraction of binary and sparse images from their under-determined mixtures, i.e. when the number of sensors is lower by one than the number of unknown sources. A practically feasible solution is proposed for constrained classes of images, i.e. sparse, binaryvalued and dynamically-constrained sources.
Keywords:
Blind signal processing; image enhancement; linear optimization; sparse signals.
1.
INTRODUCTION
The goal of blind source separation (BSS) is to extract (statistically independent) unknown source signals from their linear mixtures without knowing the mixing coefficients1,4,6 . This blind signal processing technique has so far main applications in data mining and biomedical signal processing problems. A precondition for the application of BSS solutions is that the number of (statistically independent) source signals is at most equal to the number of sources, known a priori. Typically it should be equal to the number of sensors and outputs. However, in practice these assumptions do not often hold. In this paper we consider the mixing case, where the number of independent sources is higher than the number of mixtures by one.
2.
THE BLIND SOURCE EXTRACTION PROBLEM
Denote by x(t) = [x1 (t), . . . , xn (t)]T the n-dimensional t-th data vector made up of the mixtures at discrete index value (usually time) t. The mixing
826 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 826–831. © 2006 Springer. Printed in the Netherlands.
Blind Extraction of Sparse Images
827
model in blind source separation (BSS) can then be written in the vector form x(t) = As(t) + n(t)
(1)
Here s(t) = [s1 (t), . . . , sm (t)]T is the source vector consisting of the m source signals at the index value t. Furthermore, each source signal si (t) is assumed to be a stationary zero mean stochastic process. n(t) means Gaussian noise. In standard neural and adaptive source separation approaches, an m × n separating matrix W is updated so that the m-vector (y(t) = W x(t)) becomes ˆ(t)) of the original independent sources1,7 . an estimate (y(t) = s A standard assumption in BSS is that the number m of the sources should be known in advance. Like in most neural BSS approaches, we have assumed up to now that the number m of the sources and outputs l are equal in the separating network. Generally, both these assumptions may not hold in practice. Now, let us consider the difficult case where there are fewer mixtures than sources: n < m. Then the n × m mixing matrix A in (1) has more columns than rows. In this case, complete separation is usually out of the question. However, some kind of separation may still be achievable in special instances at least. This is the goal of blind signal extraction (BSE). The BSE problem has recently gained larger attention. Pajunen9 has proposed an algorithm for binary source separation, that separates m binary sources from two or more mixtures. The restrictive assumptions about sources are that the mixture vectors must not overlap, and the mixing matrix must have non– parallel column vectors. Chen & Donoho3 have applied a so called Basis Pursuit approach for spectrum estimation. Basis Pursuit decomposes a signal into an optimal superposition of dictionary elements, where optimal means having the smallest l1 norm of coefficients among all such decompositions. Their dictionary includes overcomplete cosine and sine bases, and the Dirac basis. Hence this optimization principle leads to decompositions that can be very sparse. This topic has been studied theoretically in2 . The authors show that it is possible to separate the m sources into n disjoint groups if, and only if, A has n linearly independent column vectors, and the remaining m − n column vectors satisfy the special condition that each of them is parallel to one of these n column vectors. Recently Li et al.8 have proposed a multi-stage approach to sparse signal extraction. In the first stage the observed data is transformed into the timefrequency domain via wavelets. Next the use of a sophisticated hierarchical clustering technique allows to identify the mixing matrix. In the last step the sparse sources are estimated alternatively by linear-, quadratic- or semi-definite programming optimization. In case of images, their spectral distributions are very similar to each other, as their spectra are dominated by the first frequency coefficient. Therefore a different approach is proposed in our paper.
828
3.
PROPOSED SOLUTION TO BSE
In general the vectors s(k) and x(k) are correlated Rxs = E{xsT } = 0, but the noise vector is not correlated with s. Our objective is to find the best ,best such that the pair of vectors: n = x − As and s are estimation matrix A no longer correlated with each other: 0 = E{(x − As)sT } = Rxs − ARss .
(2)
,best = Rxs R−1 . A ss
(3)
and
Assuming that the sensor signals are available only, our BSE approach consists of two main steps, that are iterated together (until convergence is achieved) , of the (unand of a third (optional) final refinement step: (I.a) to estimate A known) mixing matrix A from mixed (sensor) signal vector x(t); (I.b) to esti, and x(t); (II.) a final post-processing for mate source signals s,(t), for given A specific signals (option).
3.1
Extraction of sources
, (or A ,1 ) of the mixing matrix A (or combined mixing After the estimate A matrix A1 , respectively) is known, there exist potentially many solutions to the under-determined source extraction problem. We can solve it in special cases, at least. When source signals are spiky and sparse signals, in the sense that they fluctuate mostly around zero and only occasionally have nonzero values, the problem of estimation of unknown signals can be converted to the extended linear programming problem, i.e. finding the optimal sequence of estimated source signals s,i (k)(i = 1, ..., n, ), which minimize the l1 norm: n
|s,i (t)|,
(4)
k i=0
subject to the constraints: ,s,(t) = x(t), A
or
,1 s,(t) = v(t), ∀k. A
(5)
A very efficient linear programming algorithm that allows one to minimize the l1 norm is known, called the FOCUSS algorithm5 . The author has applied this FOCUSS algorithm with the following iteration rule: , s,(k + 1) = D(k + 1)inv[AD(k + 1)]x(k + 1),
(6)
where the diagonal matrix D is obtained by: D(k + 1) = diag[|s,(k)|1−p/2 ],
(7)
Blind Extraction of Sparse Images
829
and inv[.] means the pseudo-inverse operation: inv[W ] = W T (W W T )−1 .
(8)
The initial diagonal elements are set to diag[D(0)] = [1, ..., 1]T and the parameter p = 0.5. During the above iteration process a competition between , appears, which of them should represent the vector x. At the columns of A the end some of the columns survive only to represent v.
3.2
Mixing matrix estimation
, by using With given source signal estimation s, estimate the mixing matrix A an iterative rule: , + 1) = A(k) , , , A(k − η(k)[A(k)R s,s, − Rxs, − γ(k)A(k)].
(9)
, T (A(k)R , The forgetting factor γ(k) = tr[A(k) s,s, − Rxs,)] ensures that , is kept approximately constant during the the Frobenius norm of the matrix A iteration process, thus enforcing the stability of the algorithm.
3.3
Postprocessing for (n-1) sensors to (n) sources
After all the previous steps a proper estimation of the sources is usually done for all such vector samples, where at least one source sample is equal to zero. If all sources have non-zero values, then the estimated sample vector is estimated wrongly. At least in some specific signal cases this error could be corrected. Let us assume that (n-1) sensors are available for (n) sources. Then the crossing section of (n-1) hyperplanes in the n-hyperspace determines a line in the n-dimensional space. A wrong signal vector corresponds to a point on this line, whereas the solution point is located somewhere else on this line. Thus the proper solution can be obtained by a linear shift of the estimated point. The direction cosine of the solution line is dependent on the known (estimated) mixing matrix and it is independent of the sources. Thus, if we find the proper correction value (at a given time sample) for one estimated source, we will be able properly to correct all the remaining outputs. Usually there is no need for any post-correction if the sources are spiky signals, i.e. with high probability in each time sample at least one of the sources is equal to zero.
3.4
The ST constrain
The proposed correction mechanism can be applied to several types of source signals. The sources may be binary signals or three-valued positive signals. They may even be of general waveform but subject to our so called ST-constrain,
830 i.e. they should fluctuate rather slowly and smoothly in comparison with the sampling frequency – only one source is allowed to have an amplitude change for two consecutive pixels at given image position.
4.
Test results
In our experiments, natural or synthetic grey-scale images (with 256 grey levels) are used. Their size is equal to 256 × 384 and 256 × 256. Before the start of the learning procedure the image signals should be transformed to zero–mean signals, and for compatibility with the learning rate and initial weights they are also scaled to the interval [−1.0, 1.0]. The obtained results can be assessed quantitatively by using suitable mathematical measures, like SNR (signal-to-noise ratio) between each reconstructed source and the corresponding original source. In the first two experiments the reconstruction of binary images is shown, i.e. binary edge images (Fig. 1) or binary intensity images (Fig. 2). Obviously these are non-spiky signals and in some areas, in which all three sources take non–zero values, the separation fails.
(a) Three binary edge images
(a) Three binary source images
(b) Two edge image mixtures
(b) Two mixtures
(c) Reconstructed binary edge images
(c) Reconstructed binary sources
Figure 1. Example of edge image extraction - binary edge images.
Figure 2. Example of a natural face image reconstruction - binary source images.
We have extended the testing to non-binary sources. If the sources are threevalued signals or they satisfy the specific ST-constrain, they can also be extracted from an under-determined mixture, i.e. if for n sources the number of mixtures is (n-1). In such situations the third post-processing step is applied. Example of results for 3-valued image sources are shown on Fig. 3. For a ST-constrained intensity image set the results are shown in Fig. 4. Obviously, natural image sets usually do not satisfy our ST-constrain. Hence this
Blind Extraction of Sparse Images
831
constrain was artificially satisfied by making 2 more copies of each pixel (and adding them to the image) and left shifting the second and third source by one or two pixels, respectively. As provided in Fig. 4, the extracted sources are disturbed by a low frequency error.
(a) Three three-valued source images
(b) Two mixtures
(c) Reconstructed sources Figure 3. Example of a natural face image reconstruction - three-valued images.
Figure 4. Example of a natural face image reconstruction, subject to the ST-restriction.
REFERENCES 1. A. J. Bell, T. J. Sejnowski, An information maximization approach to blind separation and blind deconvolution, Neural Computation, vol. 7, 1995, 1129–1159. 2. X. R. Cao, R. W. Liu, A general approach to blind source separation, IEEE Trans. on Signal Processing, vol. 44(1996), March 1996, 562–571. 3. S. S. Chen, D. L. Donoho, Application of Basis Pursuit in Spectrum Estimation. ICASSP’98, Proceedings, vol. 3(1998), 1865–1868. 4. A. Cichocki, S. Amari, Adaptive Blind Signal and Image Processing, John Wiley, Chichester, UK, 2003 (II. corrected edition). 5. I. F. Gorodnitsky, B. D. Rao, Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Re-weighted Minimum Norm Algorithm, IEEE Transactions on Signal Processing, vol. 45(1997), No.3, 600–616. 6. A. Hyvärinen, J. Karhunen, E. Oja, Independent Component Analysis, John Wiley, New York, 2001. 7. J. Karhunen, A. Cichocki, W. Kasprzak, P. Pajunen, On neural blind separation with noise suppression and redundancy reduction, Int. J. of Neural Systems, vol. 8 (1997), No. 2, 219– 237, World Scientific Publ., London–Singapore. 8. Y. Li, A. Cichiocki, S. Amari, Analysis of sparse representation and blins source separation, Neural Computation, vol. 16(2004), 1-42. 9. P. Pajunen, An Algorithm for Binary Blind Source Separation, Helsinki Univ. of Technology, Lab. of Computer and Information Science, Report A36, 1996.
INTERACTIVE CONTRAST ENHANCEMENT BY HISTOGRAM WARPING Mark Grundland and Neil A. Dodgson Computer Laboratory, University of Cambridge Cambridge, United Kingdom [email protected]
Abstract:
We present an interactive contrast enhancement technique for the global histogram modification of images. Through direct manipulation, the user adjusts contrast by clicking on the image. Contrast around different key tones can be adjusted simultaneously and independently without altering their luminance. Histogram warping by monotonic splines performs the gray level mapping. User interfaces for contrast correction find application in digital photography, remote sensing, medical imaging, and scientific visualization.
Keywords:
Image editing, interactive image enhancement, contrast enhancement, histogram modification, gray level transformation function, monotonic splines.
I don't paint things. I only paint the difference between things. — Henri Matisse (1869-1954)
1.
INTRODUCTION
Contrast is an organizing principle of visual communication, serving to bring the visual structure of information into focus. Perception draws distinctions between objects from differences between intensities. For the viewer, contrast attracts attention. For the artist, contrast conveys emphasis. In graphic design, as a way of making the first impression, contrast is used to tell the eye where to go. When presenting information, contrast makes the composition legible. When a picture carries the message, contrast is often applied to underline it. As an accent or an overstatement, contrast engages the viewer’s interest. In imaging, contrast reflects a necessary compromise,
832 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 832–838. © 2006 Springer. Printed in the Netherlands.
Interactive Contrast Enhancement by Histogram Warping
833
since the human visual system accommodates a dynamic range that is several orders of magnitude greater than the ones available to image reproduction systems. For a photograph to recreate the visual impression of a natural scene, the balance between light and dark tones often requires careful adjustment. For instance, the nature photographer Ansel Adams developed methods for selectively overexposing and underexposing the print by waving cardboard cutouts over portions of the print during exposure. While modern image processing offers effective region selection and feathering tools, interactive contrast adjustment has remained a needlessly cumbersome task.
Figure 1. By adjusting contrast, the user can emphasize different aspects of the same image.
Our aim is to provide a simple, flexible, and precise interactive procedure for specifying the global contrast enhancement of an image. This spatially invariant, global histogram modification operation is performed by a gray level transformation function T ( x ) . At each gray level, its slope determines the change in contrast while its displacement indicates the shift in tone. Simplicity requires that the entire transformation can be easily accomplished by pointing and clicking on the image, so that the user need not be concerned with the shapes of the transformation function and the image histogram. Flexibility requires that contrast can be adjusted simultaneously at more than one point in the tonal range. Precision requires that contrast can be adjusted independently at each chosen point of the tonal range. Our histogram warping method (Figure 1) meets these requirements by using
834 continuously differentiable monotonic splines to express the gray level transformation.
2.
RELATED WORK
In image enhancement1,2 for visual inspection, the role of user interaction in contrast adjustment has received surprisingly little attention. Apart from the numerous automatic algorithms, there are three broad approaches for interactively specifying gray level transformation functions. The transformation may be defined indirectly through histogram specification1,3-5. The user is still faced with the dilemma of selecting the correct histogram for the image. Without taking into account the original histogram, forcing the image to conform to an arbitrary histogram can yield unpredictable results since it is difficult to foresee how much distortion the transformation entails. Moreover, the relationship between the shape of a histogram and the relative contrast of an image may not be readily apparent to an untrained user. A histogram that appears ideal for one image can prove unsuitable for another despite any similarities between the two pictures. Only at most the first three statistical moments of the histogram have been shown to predictably affect contrast6. A flat histogram maximizes the entropy of the encoded information while a hyperbolic histogram maximizes the entropy of the perceived brightness7. As Gonzalez and Woods1 observe, “in general, however, there are no rules for specifying histograms”. Alternatively, the gray level transformation may be expressed directly by a mathematical function y T ( x ) with its parameters chosen by the user. Since reversing image polarity is not normally desirable, a monotonic increasing transformation T c( x ) t 0 is required to preserve the natural order of gray levels. Since abrupt transitions between differing degrees of stretching and compression of the tonal range can cause visible defects, a continuously differentiable C1 transformation is required to avoid artificial discontinuities in the new histogram f (T 1 ( y )) T c(T 1 ( y )) 1 that results from transforming the original histogram f ( x ) . Our histogram warping technique uses splines designed to meet these two requirements. Previously, contrast enhancement has been performed by linear1,2,8, quadratic9, cubic5,9, sigmoidal2,10, logarithmic1, and power law1,2 functions. Default parameters may be obtained by the optimization of a mathematical criterion8 or the study of user preference10. These simple formulas lack the necessary degrees of freedom to express simultaneous and independent contrast adjustments at different points in the tonal range. Piecewise defined functions can cope with this challenge. Existing implementations fail to meet our requirements, as piecewise exponential11 and piecewise linear1,12 histogram transformations
Interactive Contrast Enhancement by Histogram Warping
835
are not continuously differentiable while cubic splines5 can cease to be monotonic in regions of heightened contrast.
Figure 2. Interactive histogram warping: the original image with its histogram, the contrast adjustments with the gray level transformation, and the resulting image with its histogram.
Existing image processing packages, such as Adobe Photoshop 7, invite the user to literally draw the gray level transformation curve. As the shape of the curve changes both tone and contrast at the same time, such a user interface demands considerable skill and practice. Instead of focusing on getting the image right, the user must pay attention to getting the curve right. Using design galleries13 or interactive evolution by aesthetic selection14 to explore the parameter space15 of gray level transformations is a plausible alternative, although these approaches to user interaction are usually reserved for applications where direct manipulation does not suffice.
3.
INTERACTION
A quick way to adjust contrast is to click on the picture. Our user interface (Figure 2, middle) displays the original image alongside the transformed image and a grouped list of contrast adjustments. The contrast
836 adjustment of several key tones can be combined in a single transformation. With the mouse, the user selects the key tones of the original image by clicking on it. Depending on the mouse button, the contrast of a key tone can be raised, lowered, or preserved. The degree of contrast adjustment can be set using the arrow keys or the mouse wheel. Acting as anchors, these key tones are preserved by the gray level transformation to ensure that the overall tonal balance of the image is maintained. What changes is the tonal spectrum around these key tones. Where the contrast is raised by increasing T c( x ) ! 1 , the histogram is stretched, and image details of that tone become more prominent. Conversely, where the contrast is lowered by reducing 0 d T c( x ) 1 , the histogram is compressed, and image details become more subdued. Raising the contrast in one region of the tonal range necessitates lowering contrast in another, and vice versa. Where the transformation coincides with the identity mapping T ( x ) x and T c( x ) 1 , tone and contrast are left unchanged. As more key tones are added, the effect of the contrast adjustments becomes more subtle since the transformation preserves their luminance. Usually a transformation calls for no more than three key tones. A succession of transformations may be applied until the desired effect is achieved. Since the transformations are monotonic and invertible, as long as the gray levels are not quantized, no information is lost. Of course, quantization permits more efficient use of processor and memory resources by discretizing the gray level mapping as a lookup table.
4.
IMPLEMENTATION
Our histogram warping technique maps gray levels bk T ( ak ) according to their contrast adjustments d k T c( ak ) . For monotonicity, we require that the sequence ak is strictly increasing, bk is increasing, and d k is finite and nonnegative. In our application, the key tones bk ak are preserved and their contrast adjustments d k are determined by the user. The endpoints of the dynamic range may be treated as key tones and, optionally, the contrast of these extreme highlights and shadows may be lowered to raise the contrast of the midtones. Alternatively, the endpoints of the image’s tonal range may be remapped to set its white point and its black point. To ensure a continuously differentiable monotonic transformation for any valid choice of inputs, we rely on a piecewise rational quadratic interpolating spline16,17:
T ( x) with rk
bk 1
rk t 2 d k 1 1 t t bk bk 1 , rk d k d k 1 2 rk 1 t t
bk bk 1 and t ak ak 1
x ak 1 for x [ak 1 , ak ] . ak ak 1
(1)
Interactive Contrast Enhancement by Histogram Warping
837
As future research, it would be interesting to explore more precise control over where the histogram is stretched and compressed by using a piecewise rational cubic interpolating spline with adjustable shape parameters17.
5.
CONCLUSION
Contrast adjustment is a common user task in digital photography, remote sensing, medical imaging, and scientific visualization. Our technique enables the user to quickly select the key tones of an image and change their contrast without altering their luminance. In future research, we will apply our histogram warping framework to other histogram modification tasks.
REFERENCES 1. Gonzalez, R. C. and Woods, R. E. 2002. Digital Image Processing, 2 ed. Prentice Hall. 2. Zamperoni, P. 1995. Image Enhancement. Advances in Imaging and Electron Physics, 92, 1-77. 3. Hummel, R. A. 1975. Histogram Modification Techniques. Computer Graphics & Image Processing, 4, 3, 209-224. 4. Gonzalez, R. C. and Fittes, B. A. 1977. Gray-Level Transformations for Interactive Image Enhancement. Mechanism & Machine Theory, 12, 1, 111-122. 5. O'Gorman, L. and Brotman, L. S. 1985. Entropy-Constant Image Enhancement by Histogram Transformation. Proceedings of SPIE, 575, 106-113. 6. Thompson, D. D. and Gonzalez, R. C. 1983. Image Enhancement by Moment Specification. In Proceedings of the 15th Southeastern Symposium on System Theory, 134-137. 7. Frei, W. 1977. Image Enhancement by Histogram Hyperbolization. Computer Graphics & Image Processing, 6, 3, 286-294. 8. Xu, X. and Miller, E. L. 2002. Entropy Optimized Contrast Stretch to Enhance Remote Sensing Imagery. In Proceedings of 16th International Conference on Pattern Recognition, vol. 3, 915-918. 9. Guo, L. J. 1991. Balance Contrast Enhancement Technique and Its Application in Image Colour Composition. International Journal of Remote Sensing, 12, 10, 2133-2151. 10. Braun, G. J. and Fairchild, M. D. 1999. Image Lightness Rescaling Using Sigmoidal Contrast Enhancement Functions. Journal of Electronic Imaging, 8, 4, 380-393. 11. Raji, A., Thaibaoui, A., Petit, E., et al. 1998. A Gray-Level Transformation-Based Method for Image Enhancement. Pattern Recognition Letters, 19, 13, 1207-1212. 12. Sang-Yeon, K., Dongil, H., Seung-Jong, C., et al. 1999. Image Contrast Enhancement Based on the Piecewise-Linear Approximation of Cdf. IEEE Transactions on Consumer Electronics, 45, 3, 828-834. 13. Marks, J., Andalman, B., Bearsley, P. A., et al. 1997. Design Galleries: A General Approach to Setting Parameters for Computer Graphics and Animation. In Proceedings of SIGGRAPH, 389-400.
838 14. Sims, K. 1993. Interactive Evolution of Equations for Procedural Models. Visual Computer, 9, 8, 466-476. 15. Taosong, H., Lichan, H., Kaufman, A., et al. 1996. Generation of Transfer Functions with Stochastic Search Techniques. In Proceedings of the 7th IEEE Visualization Conference, 227-234. 16. Gregory, J. A. and Delbourgo, R. 1982. Piecewise Rational Quadratic Interpolation to Monotonic Data. IMA Journal of Numerical Analysis, 2, 123-130. 17. Sarfraz, M., Al-Mulhem, M., and Ashraf, F. 1997. Preserving Monotonic Shape of the Data Using Piecewise Rational Cubic Functions. Computers & Graphics, 21, 1, 5-14.
RADIAL BASIS FUNCTION USE FOR THE RESTORATION OF DAMAGED IMAGES
Karel Uhlir, Vaclav Skala University of West Bohemia , Univerzitni 8, 30614 Plzen, Czech Republic
Abstract:
Radial Basis Function (RBF) can be used for reconstruction of damaged images, filling gaps and for restoring missing data in images. Comparisons with standard method for image inpainting and experimental results are included and demonstrate the feasibility of the use of the RBF method for image processing applications.
Key words:
inpainting, radial basis functions, interpolation, image processing
1.
INTRODUCTION
One of the interesting problems is how to reconstruct an image well possible from damaged or incomplete original as. This problem is referred to in many papers1. The main question is: “What value was in a corrupted position and how can I restore it ?” The Radial Basis Function method (RBF) is based on variational implicit functions principle and can be used for interpolation of scattered data. The possibility of missing data restoration (image inpainting) by the RBF method was mentioned in Kojekine & Savchenko2. They used this method for surface retouching and marginally for image inpainting as well. They used compactly supported radial basis functions (CSRBF)3 for reconstruction and octree data structure for representation of the parts for reconstruction. The advantage of this method is that the linear system is sparse and can be solved easily4. The drawback of
This work was supported by the Grant No.: MSM 235200005 839
K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 839–844. © 2006 Springer. Printed in the Netherlands.
840 this approach is in error which can be obtained with an improper selection of the radius of support of the CSRBF. In this paper we used a global radial basis function for image reconstruction, inpainting and drawing removal.
2.
PROBLEM DEFINITION
Let us assume that we have an image : with resolution M x N with 256 gray levels. Some pixels have incorrect values (missing or overwritten), see Fig. 1(a-c). We would like to restore the original image or remove inpainting etc. Let us assume that we can detect “missing pixels”, pixels with corrupted values or inpainted pixels5, too. For our experiments we used original images, see Fig. 1d, and noise, writing or drawing was used to corrupt them. Note that restoration of the original image is related to scattered data interpolation problem, where many points are not defined and we want to find a value for them.
Figure 1. Images with inpainting, noise, scratches and the original one (a, b, c, d).
3.
RADIAL BASIS FUNCTIONS
Let us describe the RBF method now. The RBF method may be used to interpolate a smooth function given by n points. The resulting interpolating function thus becomes6: n
f ( x)
¦ O jI (|| x c j ||) P(x), j 1
n
¦ Oi c x i 1
n
¦ Oi c y i 1
n
¦O
i
0 (1, 2)
i 1
where f(ci)=hi, for i=1,…,n, cj are given locations of a set of n input points (pixels), Oj are unknown weights, x is a particular point and I(||x-cj||) is a radial basis function, ||x-cj||=rj is the Euclidean distance (of pixels in our
Radial Basis Function Use for the Restoration of Damaged Images
841
case) and P(x) is a polynomial of degree m depending on the choice of I. There are some popular choices for the basis function, e.g. the thin-plate spline I(r) =r2*log(r), the Gaussian I(r) = exp(-[r2), the multiquadric I(r) = ¥(r2+[2), biharmonic I(r) = |r| and triharmonic I(r) = |r|3 splines, where [ is a parameter. Now we have the linear system of equations Eq. (1) with unknowns Oj,ax,ay,az. Natural additional constraints for the coefficients Oj must be included in Eq. (2) to ensure orthogonality of a solution. These equations and constraints determine the linear system:
ªȜ º B« » ¬a ¼
A i, j
ªh º « 0 » , where B ¬ ¼
I (|| c i c j ||)
a [a x , a y , a z ]T , O
,
ª A |Pº « PT | 0 » and P ¬ ¼
ªc1x « «# «cnx ¬
c1y 1º » # #» cny 1»¼
(3)
i, j 1, ! , n
[O1 , O 2 ,..., O n ]T , h [h1 , h2 ,..., hn ]T
The polynomial P(x) in Eq. (1) ensures positive-definiteness of the solution, of matrix B3. Afterwards, the linear equation system Eq. (3) is solved and the solution vector with O and a is known, then the function f(x) can be evaluated for an arbitrary point x (a pixel position in our case)3,7,8,9.
4.
IMAGE RESTORATION
For image reconstruction we used the RBF method mentioned above and applied it within a 5 x 5 window of pixels. DefineNeighborhood(5,5); LoadImage( : ); Repeat For (i,j=1;i'Gst -'Gstp @< W being W a suitable threshold, or according to some fuzzy logic based rules. The crisp method implies to define many thresholds W discretionally depending on U. Interestingly, the fuzzy logic approach implemented by soft agents resident on the PC boards installed at every crossroad is less discretional once the memberships of the adopted fuzzy functions are defined
974 (e.g., H-algorithm should not be applied if U is low and the car flow changes slowly, or if U is medium-high and the car flow changes very little), and allows us to implement strategies of traffic regulation derived from common sense rules. If the PC boards are powered by Fuzzy Processors4, the time to evaluate, according to fuzzy rules, if the webcam images should be sent to a remote server for a subsequent processing can be very small.
REFERENCES 1. Sussman J., Introduction to transportation systems (Artech House, 2000) 2. Davis L. et alii, W4 real time surveillance of people and their activities, IEEE Trans on Pattern analysis and machine intelligence, 22/8 (2000) 3. Fisher R. et alii, Connected Component labeling (2003), http// homepages.inf.ed.ac.uk/rbf/HIPR2/label.htm#1 4. ST Microelectronics: The fuzzy processor ST52F13 – Catania 2002
STRUCTURAL OBJECT RECOGNITION BY PROBABILISTIC FEEDBACK András Barta and István Vajk Department of Automation and Applied Informatics, Budapest University of Technology and Economics, H-1111, Budapest, Goldman tér 3, Hungary
Abstract:
This paper investigates, how Bayes networks can be applied to hierarchical object representation. It provides a hierarchical graph definition an a recursive algorithm to convert the objects to graph representations and also to reconstruct the image of the objects from its graph. The structural complexity of the objects is calculated to get compact representations.
Key words:
Object recognition, Graphical models, Bayesian networks, Graph theory, Complexity, Object library
1.
INTRODUCTION
The application of graphs for object recognition is an old idea, but lately it came into the limelight again. Graph structures with the help of probability models turned out to be powerfool tools. They are called graphical modells. A graphical model can capture and store the model description and also provide a computational background. Bayesian or bilief networks encodes probabilistic relationship among variables or objects. A probability is assigned to every node of the graph and the edges provide information about their dependencies. This probability represents the subjective belief or knowlage about the node. This paper investigates, how Bayes networks can be applied to hierarchical object representation. It provides a hierarchical graph definition an a recursive algorithm to convert the objects to graph representations and also to reconstruct the image of the objects from its graph. The structural complexity of the objects is calculated to get compact representations. In this paper we present only the main ideas behind our algorithm, the exact theoretical description is not given. The definition of the graph structure and
975 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 975–980. © 2006 Springer. Printed in the Netherlands.
976 the structural complexity is given in Section 2. Section 3. overviews the algorithm and the results of a simple simulation. The graph can be described by the joint probability distribution k
P(n1 , n1 ,..., nk )
Pn | pa(n ) i
i
(1)
i 1
where ni represents the state of the node and pa (ni ) represents the state of the parents of node ni . In order to define the probabilistic network one must specify the prior probabilities of the root nodes and the conditional probabilities of the other nodes. Graphical models include several network formulations, hidden Markov models (Baum and Petrie, 1966), Markov random field (Kindermann and Snell, 1980) and their variations. They are treated in a unified framework by Smyth (1997) and Heckerman (1996). The purpose of an object recognition system is to find appropriate image elements, features, and represent the image by the combination of these bases. These image elements should be stored in a library in a graph structure a hierarchical way. The graph should capture this structure, store the required parameters and assign the probabilities to the nodes and edges. The probabilities of the graph evolve during the object recognition, the probabilities of some nodes increase the others decrease. At the end of the object recognition the high probability nodes represent the image. There is a graph structure construction phase and probability calculation phase. In this paper inferring the structure is not examined, though it is an important part of any object recognition system. The only way that the structure is modified is through the structural complexity optimization. There are many algorithms which can be used to calculate the probabilities of the network. The network can be used for top-down and bottom-up calculation. This bidirectional information flow helps to reduce the calculation complexity. The algorithm that is used is similar to the forward backward algorithm.
2.
SPATIAL GRAPH DEFINITION
Several graph structures may be defined, depending on the represented objects, from a simple binary tree (Messmer, B.T., Bunke, H., 1998) to a sophisticated multi-hierarchical graph structure (Barta A., Vajk I., 2000). There is a good survey about graph based methods by J.M. Jolion and W.G. Kropatsch(1998). The objects are identified by features and their relationships. The G graph structure is defined as follows. The graph has k
Structural Object Recognition by Probabilistic Feedback
977
nodes and they has the form: ni [ri , aTi ]T , where ri is a reference to a library object or feature and ai is a parameter vector. Since the features belong to parameterized feature classes the ai vector is necessary to identify their parameters. The graph has m edges and they has the form: ei [qTi , tTi ]T , where an qi is the connection vector and a t i is the relation vector. The ql [i, j ]T vector identifies the nodes that are connected by the edge. The graph is represented by the node and edge matrices, G ( N , E) , N
>n1
n2
... n k @ , E
>e1
e 2 ... e n @
(2)
Visual information is inherently spatially ordered, so we define t i to represent the spatial relationship of two features. The graph structure, presented here, is applicable directly only to two-dimensional objects, but similarly with more complicated spatial relationships, it can be redefined for three-dimensional objects also. The graph edge represents an orthogonal, linear transformation from one feature to another. This transformation has three components, translation (displacement), rotation and scaling and can be given in matrix form in homogeneous coordinates,
T
ª l 0 0 º ª cos M (l )R (M )D(d cos E , d sin E ) «« 0 l 0 »» «« sin M «¬0 0 1 »¼ «¬ 0
sin M cos M 0
0º ª 1 0 »» «« 0 1 »¼ «¬ d cos E
0 1 d sin E
0º 0 »» (3) 1 »¼
T
and its parameters in vector form: t i > d E l M @ . From the graph of an object, from the object library and the image coordinate system the object can be reconstructed. A picture element or a feature is represented in its own local Fi coordinate system. Since only two-dimensional objects are used and the scale factor is the same for both axes the coordinate system can be represented by two points, the origin and the end of the x coordinate vector. F
F( Pox , Po y , s, D )
ª Pox « Pe ¬ x
Po y 1º Pe y 1»¼
Pox ª « Po s cos D x ¬
Po y 1º Po y sin D 1»¼ T
(4)
and in vector form: fi ª¬ Pox ,i Po y ,i si D i º¼ . Each feature is defined in a unit coordinate system and stored in the library. The features are transformed from the library to the coordinate system of representing node. The coordinate system of a feature can be calculated from the coordinate system of the predecessor node and the transformation of their connecting edge, Fi( Fi 1 ) UTi 1,i . This predecessor node then has to be transformed to the image coordinate system, F0 , that is Fi( I ) Fi( Fi 1 ) T0,i UTi 1,i T0,i . This recursive algorithm transforms the objects back into the image coordinate
978 system (Barta, 2002). This reconstructing algorithm iterates through the graph structure. Since most objects are composed of several simpler composite objects or features this way the representation becomes a hierarchically structured. For object evaluation a graph distance and the structural complexity is defined. The graph distance can be used to compare features and objects to their corresponding image. The structural complexity quantifies how complex the objects are. It is an important quantity, because during the object recognition from the complex pixel representation a simplified object representation is gained. The graphs of the objects are compared against the library graphs. The comparison can be quantified by graph distance measures. Several graph distance measures are suggested. The most frequently used definitions are listed by El-Sonbaty, Ismail, (1998) and Messmer, B.T., Bunke, H (1998). In this paper a node distance definition is used. The features are compared nand the sum of their differences is used as l l the graph distance, d G (GO , GI ) ¦ d fO , f I . The upper index indicates the i 1 feature class. Generally the object representations of an image by image elements are not unique. In order to evaluate the many different representations a distance measure definition is necessary which quantifies the internal complexity of the objects. Simpler object description should be preferred against more complex ones. The structural complexity of an object is the shortest possible description of the structural relationships of the object. In case of graph representation this is also a graph complexity measure. This complexity is defined as k
cM o
cGM (G ) cLM
M M cnode (G ) cedge (G ) ¦ c M ol I l
(5)
l 1
n
M cnode
¦ cr c p ai i 1
m
M , cedge
¦ ct T j j 1
and the structural complexity is defined as the minimum of this quantity csM
min c M (o) . G *
(6)
The minimum is calculated on every possible G * graph that represents the object. The complexity of an object consists of the complexity of the its graph and the complexity of the library that is used for the representation. I l is an indicator function for the used library objects. The
Structural Object Recognition by Probabilistic Feedback
979
graph complexity of an object is defined as the sum of the complexity of its nodes and edges. The node complexity consists of a constant cr value, which is responsible for referencing the feature class to the library and the complexity of the parameter vector belonging to that object class. The complexity of an edge is the complexity of the spatial transformation. In order to able to calculate the complexity a suitable M metric has to be defined on the cr , c p , ct values.
3.
ALGORITHM, SIMULATION
In order to test the different methods a simple simulation is performed for high-level object recognition. Artificial objects are selected instead of real objects for the sake of simplicity and because the results are not distorted by the different feature detection algorithms. The objects are artificially created a hierarchical way, building them up from simple to more complicated ones. Only straight lines and quadratic curves are used. The simulation consists of two parts the object recognition and the object library creation. For testing the object recognition process an image with several of the library objects is created. The object recognition consists of two steps. In the first, initialization step object hypotheses are created by assigning the prior probabilities to the nodes. In the second step the probabilities of the nodes are recalculated. The recognition starts from a random feature location. A second feature is selected randomly but with a normal position distribution, whose mean value is the position of the first feature and its deviation is the average object size. From these two features an initial sub-graph is calculated. This sub-graph is searched in the library. If it is found, then the object, in which it is found, serves as an object hypothesis. The object hypothesis compared against the image graph based on a graph distance measure. It is not necessary to calculate the whole image graph. From this initial sub-graph the image of the library object is projected back into the image and only those areas of the image is searched which is close to this projection. This projection is calculated recursively from the object graph based on the node probabilities. The probabilities in the backward path are calculated by the Bayes rule. This feedback stabilizes and speeds up the algorithm. In each step a new feature of the object projected and compared to the closest feature in the image. Based on this distance measure the object probability increased or decreased. A threshold is applied to accept or reject the object hypotheses. The simulation shows that for the wrong object hypothesis the graph distance measure is increasing rapidly, thus making it possible to abort the calculation early, after a few iteration steps, and start a new search. If an object is accepted, then that portion of the image is
980 identified as an object. The recognition process is controlled by the structural complexity. The calculation always proceeds such that the structural complexity is reduced.
ACKNOWLEDGMENT The research work was supported by the fund of the Control Research Group of the Hungarian Academy of Sciences and by the OTKA T042741 fund. The supports are kindly acknowledged.
REFERENCES 1. Barta, A., I. and Vajk, I., An Attributed Graph Structure for Object Recognition. IASTED International Conference on Signal Processing, Pattern Recognition and Applications, Crete, (2002) 2. Baum, L.E., Petrie, T., Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Statist. 37, 1554–1563. (1966) 3. El-Sonbaty, Y., M.A. Ismail, M.A., A new algorithm for subgraph optimal isomorphism. Pattern Recognition 2, 205-218. (1998) 4. Heckerman D., Bayesian networks for knowledge Discovery, Advances in Knowledge discovery and data mining, AAAI Press / MIT Press, pp. 273-307 (1996) 5. Ibañez, M.V. and Simó A., Parameter estimation in Markov random field image modeling with imperfect observations. A comparative study. Pattern recognition letters 24, pp 2377-2389, (2003) 6. Jolion J.M. and Kropatsch W.G., Graph Based Representation in Pattern Recognition, Springer-Verlag, (1998) 7. Kindermann, R., Snell, J.L., Markov random fields and their application. American Mathematical Society, Providence, RI, (1980) 8. Messmer, B.T., Bunke, H., A decision tree approach to graph and subgraph isomorphism detection. Pattern Recognition 12, (1979-1998). 9. Myers, R., Wilson, R.C., Hancock, E.R., Bayesian Graph Edit Distance, Pattern Analysis and Machine Intelligence 6, 628-635. (2000) 10. Smyth P., Belief networks, hidden Markov models, and Markov random field: A unifying view, Pattern recognition letters, pp. 1261-1268, (1997)
EVALUATING THE QUALITY OF MAXIMUM VARIANCE CLUSTER ALGORITHMS Krzysztof Rzadca Polish-Japanese Institute of Information Technology ul. Koszykowa 86, 02-008 Warsaw, Poland [email protected]
Abstract
1.
An efficient and straightforward method to compute the variance in the Maximum Variance Cluster algorithm is proposed. Experiments on both artificial and real data sets have been performed. Both the time of execution and the quality of the results were considered. According to the results, the method proposed speeds the MVC algorithm up to 50% without loosing its accurancy on the majority of the data sets considered.
INTRODUCTION
Data clustering problem consists of finding partition of given data in terms of similarity. Intuitively, data points assigned to each cluster should be more similar to the members of the same cluster than to the members of the other clusters. Therefore, a clustering algorithm maximizes the inner-cluster similarity, at the same time minimizing the between-cluster similarity. Clustering is an important step in data analysis and is used frequently in e.g. pattern recognition, image processing, information retrieval or data mining. More formally, let X = {x1 , x2 , . . . , xN } be a set of N = |X| data points in a p-dimensional space. Clustering consists of finding the set of clusters C = {C1 , C2 , . . . , CM } which minimizes a given criterion with given X and, usually but not necessarily, given M . There is a number of approaches to this problem (Jain et al., 1999). The algorithms differ both in the terms of the clustering model used (hard/crisp, hierarchical/partitional, square error (MacQueen, 1967), graph theoretic, mixtureresolving (Mitchell, 1997), etc.) and in the way the search space is scanned (deterministic, stochastic, using genetic algorithms (Hall et al., 1999), simulated annealing, neural networks etc.) The rest of the paper is organized as follows. In section 2 we present the MVC algorithm. In section 3 we propose a method of reducing the compu-
981 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 981–986. © 2006 Springer. Printed in the Netherlands.
982 tational cost of the MVC. In section 4 we introduce cluster validity measures used for comparing the quality of the clusterings. Results of the experiments conducted are presented in section 5. Final remarks are presented in section 6.
2.
THE MAXIMUM VARIANCE CLUSTER ALGORITHM
The Maximum Variance Cluster algorithm (MVC)(Veenman et al., 2002) defines a novel constraint on the clustering model, since it restricts the minimum common diversity of every pair of clusters. Regarding the description proposed, the algorithm can be characterized as stochastic partitional square error algorithm with crisp clusters. The MVC states that the variance of every 2 (a parameter of the algorithm): pair of clusters must be higher than σmax ) 2 2 ∀Ci , Cj , i = j : σ (Ci ∪ Cj ) ≥ σmax where σ 2 (Y ) = H(Y |Y | is the variance of the cluster Y , H(Y ) = x∈Y dist(x, μ(Y ))2 is the cluster error, dist is a distance measure function (e.g. Euclidean distance) and μ(Y ) = |Y1 | x∈Y x is the cluster mean. The clustering part of the algorithm is performed as a stochastic optimization procedure in which the square error H is minimized while holding the constraint on the cluster variance. In each iteration, the algorithm tries to mod2 ≥ σ2 ify every cluster by applying one of the following operators. If σA max , MVC uses the isolation operator and the cluster A is divided. The furthest 2 < σ2 point from the cluster mean forms a new singleton cluster. If σA max , the algorithm tries to apply the union operator. This operator unites the cluster with one of the neighboring clusters, but only when the joint variance of such a 2 . If the cluster has not been modified by two previous operpair is below σmax ators, the algorithm uses the perturbation operator which tries to move a point which gives the best gain in terms of error H from one of the neighbouring clusters. For a particular dataset, it is possible to determine the proper value (or val2 by assessing the cluster tendency. In MVC algorithm the cluster ues) of σmax tendency can be represented as a curve showing the number of clusters (or the 2 . Plateaus are regions on such a curve where error H) in a function of σmax 2 . The best valthe number of clusters does not change while increasing σmax 2 ues for σmax come from the biggest plateaus. To construct such a curve one 2 slightly changed in each possibility is to repeat MVC with the value of σmax run (rMVC algorithm), but it involves computational burden. One can also use a more robust heuristic, called Incremental MVC (IMVC, (Rzadca and Ferri, 2003)).
Evaluating the Quality of MaximumVariance Cluster Algorithms
3.
983
ACCELERATING MVC
A detailed analysis of the MVC algorithm reveals that the operation repeated 2 of the cluster A. This most often is the calculation of the current variance σA is repeated for every pair of neighboring clusters in the union operator. The 2 ∗|A|. Every perturbation step also involves a related computation, as HA = σA 2 time the algorithm computes σA , every data point from the cluster must be taken into account, so the computational complexity of this operation is O(n). Our proposal is to compute the diversity of a cluster differently. Inspired by statistics, we can compute the variance of a cluster A which consist of |A| p-dimensional points as: σ2 =
p
1 |A| −1 j=1
⎛ ⎞ |A| ⎝ x2i,j − |A|μ(A)2j ⎠ .
(1)
i=1
Let’s suppose that for each cluster we store the information about the current |A| sum of squared data points S(A)j = i=1 x2i,j and the current mean μ(A). On one hand, using those two values we can compute the variance without referring to the dataset. On the other, we can modify those values easily when the cluster is modified. If a point x∗ is added to a cluster, its values are added
to the sum: S(A)j := S(A)j + x∗j
2
, and the mean is modified: μ(A)j :=
− 1|μ(A) + (|A| is the size of the cluster with x∗ ). The formulas for removing a point from a cluster and joining two clusters are analogous. The result is that both the parameter computation (the variance and the error) and the operators can be performed in constant time, so their computational complexity is O(1). Later in this paper we will refer to the MVC algorithm which computes variance in a way defined above as the Accelerated MVC (AMVC). 1 |A| (|A
4.
x∗j )
CLUSTER VALIDITY
Comparing clusterings delivered by different algorithms with different parameter setting might be a difficult task, especially when considering multidimensional data sets which are hard to visualize. Therefore, indexes of cluster validity (Bezdek and Pal, 1998) have been introduced. Well separated clusters have small intra-cluster scatter, and large distances between clusters. Various validity indexes differ by the way those two features are measured and joined to form an real-valued validity function. Davies-Bouldin Index (DB) defines the scatter of a cluster Ci as the qth root 1 of the qth moment of the points in the cluster Si,q = ( |X1i | x∈Xi x − μi q2 ) q , To measure the distance between clusters Ci and Cj , one can measure the Minkowski distance of order t between their means: dij,t = μi − μj t , Given
984 those two measures, the DB Index is defined as: νDB,qt =
c 1 Si,q + Sj,q max . c i=1 j,j =i dij,t
(2)
As one wants to have clusters with possibly small scatter and which lay far from each other, the DB index should be minimized. Dunn’s Index defines the scatter of a cluster Ci as its diameter: Δ(Ci ) = maxx,y∈Ci dist(x, y). A distance between clusters Ci and Cj can be expressed as a minimum distance between points in those two clusters: δ(Ci , Cj ) = minx∈Ci ,y∈Cj dist(x, y). Dunn’s index should be maximized and is defined as: δ(Xi , Xj ) . 1≤i≤c 1≤j≤c,j =i maxi≤k≤c Δ(Xk )
νD = min
5.
min
(3)
EXPERIMENTS AND RESULTS
The experiments reported in (Veenman et al., 2002) were repeated and broadened to other real-world data sets (Blake and Merz, 1998). In total, 3 twodimensional artificial and 8 multi-dimensional real world data sets were considered. All the real-world data sets were normalized before the experiments. Algorithms were run with the same parameter values as in (Veenman et al., 2002). For each dataset, cluster tendency curves were computed (using the rMVC and the IMVC and two ways of computing the variance: normal and 2 were selected (one for MVC, the other for AMVC). Then two values of σmax AMVC) in the middle of the plateaus with the biggest strength. The number of clusters for the k-means algorithm was set to be equal to the number of clusters returned by the MVC algorithm. Using those values, MVC, AMVC and k-means algorithms were compared – the time of execution (averaged on 10 runs, table 1) and the number of times the best result (regarding the square error H) was achieved in 100 runs were measured. Then, the indexes of cluster validity were measured for the best clusterings returned by each algorithm (table 2). The AMVC algorithm is up to 50% faster than the MVC, especially on larger data sets (D31, ecoli, segmentation). Both algorithms have execution times up to 1000 times longer than the k-means algorithm, but the difference decreases considerably as the size of the data set increases. The algorithms were not optimized and the parameters were not fine-tuned – we expect the execution time to drop considerably when using more speed-oriented implementation of the MVC and reducing the maximum number of iterations of the algorithm. The quality comparison (table 2) shows the supremacy of MVC and AMVC over k-means on the most of the data sets considered both in terms of number of times the best result was achieved and the match between proposed clustering
Evaluating the Quality of MaximumVariance Cluster Algorithms
985
Table 1. Parameters and execution times of the algorithms. data set bupa ecoli glass iris pima segmentation wine yeast O3 R15 D31
MVC 0.10 0.11 0.18 0.18 0.19 0.75 0.47 0.09 50 10 0.004
parameters AMVC 0.045 0.06 0.096 0.10 0.07 0.22 0.14 0.037 50 10 0.0035
k-means 2 6 3 2 2 2 2 6 6 15 31
MVC 10987 5929 4317 1647 24331 4519 3046 354827 645 3454 42859
time [ms] AMVC 8718 3219 3130 1379 17988 2771 1529 131252 418 2135 24654
k-means 15 45 14 4 56 17 18 479 6 56 851
Table 2. Comparison of the quality of the results obtained by algorithms. data set bupa ecoli glass iris pima segm wine yeast O3 R15 D31
MVC 73 100 1 100 11 22 100 6 99 100 100
hit rate AMVC 1 1 2 100 1 3 24 1 100 94 81
kmeans 72 1 3 97 36 12 36 1 1 3 1
MVC 1.429 0.817 1.021 0.550 1.717 1.203 1.404 1.365 0.276 0.355 0.612
DB indexes AMVC 1.127 0.829 0.758 0.550 1.708 0.878 1.379 1.268 0.276 0.354 0.612
kmeans 1.517 1.360 1.266 0.550 1.721 0.815 1.398 1.745 0.554 0.355 0.676
Dunn’s indexes MVC AMVC kmeans 0.085 0.059 0.082 0.098 0.075 0.038 0.106 0.075 0.035 0.358 0.358 0.358 0.071 0.044 0.071 0.070 0.159 0.141 0.168 0.160 0.168 0.022 0.028 0.025 0.340 0.340 0.075 0.194 0.189 0.194 0.018 0.027 0.016
and the data (index computation). On the artificial data sets, a clear relationship between the visual quality of the clusterings and the values of the indexes can be seen. In D31 and O3 data sets, the k-means algorithm is not able to find the right clustering, whereas both AMVC and MVC find it in the most of the repetitions. However, the difference on real world data sets is small. We see that either all the algorithms deal well with a dataset or all fail to cluster it. We think that the real datasets we experimented on are either well structured (like iris, wine, ecoli, or segmentation) or very noisy – some datasets (eg. yeast) are very hard even to classify ((Blake and Merz, 1998), file header). Regarding the DB index AMVC delivers better clusterings of some data sets. This suggests that one can further modify the criterion used in AMVC to bring it together to the validity measure – however this might be difficult without affecting the speed of the algorithm. The results of Dunn’s index are
986 on some data sets contrary to DB index (e.g. bupa, glass). As Dunn’s index focuses on the worst points, this might suggest that MVC-type algorithms do not deal well with points which do not fit well into any cluster. However, the results on the O3 dataset (which consisting of 3 well-defined 2D-gaussian clusters and 3 noisy outliers) show that given the appropriate variance, MVC deals very well with such points. The problem with real-world data sets is that the potential clusters are not so well-formed, so, if there were such a „good” 2 , it would lay in the region of instabilities on the cluster tendency value of σmax plot, and not in the plateau. This suggests that on some datasets the value of 2 from the biggest plateau is not necessarily optimal for the Dunn’s index. σmax
6.
CONCLUDING REMARKS
The MVC algorithm and two algorithms for assessing the cluster tendency were presented. A new method for computing the variance for the MVC algorithm was proposed. According to the results obtained, this method is faster and (on the most of the data sets) delivers results as good as the original MVC algorithm. However, the advantage reported in previous papers the MVC algorithm had over the k-means algorithm could not be seen on the most of the real world data sets.
ACKNOWLEDGMENTS We would like to thank prof. Anna Bielecka for her help.
REFERENCES Bezdek, J.C. and Pal, N.R. (1998). Some new indexes of cluster validity. IEEE Transactions on Systems, Man, And Cybernetics–Part B: Cybernetics , 28(3):301–315. Blake, C.L. and Merz, C.J. (1998). UCI repository of machine learning databases. Hall, L.O., Ozyurt, B., and Bezdek, J.C. (1999). Clustering with a genetically optimized approach. IEEE Transactions on Evolutionary Computation, 3(2):103–112. Jain, A.K., Murty, M.N., and Flynn, P.J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3):265–323. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Cam, L.M. Le and Neyman, J., editors, Proc. Fifth Berkeley Symp. Math. Statistics and Probability, volume 1, pages 281–297. Mitchell, T.M. (1997). Machine Learning. McGraw–Hill. Rzadca, K. and Ferri, F.J. (2003). Incrementally assessing cluster tendencies. In Perales, F.J., Campilho, A.J.C., de la Blanca, N. Perez, and Sanfeliu, A., editors, Pattern Recognition and Image Analysis, number 2653 in LNCS, pages 868–875. Springer-Verlag. Veenman, C.J., Reinders, M.J.T., and Backer, E. (2002). A maximum variance cluster algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1273–1280.
DEFORMABLE GRIDS FOR SPEECH READING
Krzysztof Slot and Hubert Nowak Institute of Electronics, Technical University of Lodz
Abstract:
The following paper presents results of research on the application of deformable grids in image analysis for the purpose of speech reading. The main contribution of the paper is a modification of the original paradigm through an extension of a pool of parameters used for evaluation of gridmatching result. This significantly improves classification performance and indicates that the adopted approach to image analysis offers reasonable representation of information on uttered phones.
Key words:
speech reading, deformable grids
1.
INTRODUCTION
Speech-reading is one of the most challenging image sequence analysis tasks due to several reasons, such as incompleteness and ambiguity of information on uttered phones that is conveyed in facial mimics, as well as because of facial expression subtlety and variability. Commonly used approach to speech reading is to employ standard speech-signal analysis methodology, i.e. to perform a classification of a sequence of vectors derived from consecutive image frames. While sequence classification is usually done using Hidden Markov Model (HMM) method, there are numerous approaches for extracting information relevant to speech-reading task from consecutive sequence frames. They cover a broad spectrum of methods such as face modeling through functional approximation1, feature sets2, probabilistic grids3, active contours and others. None of these approaches though proved to offer satisfactory solution to the problem, so there is still a space for search for new methods, especially taking into account potential
987 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 987–992. © 2006 Springer. Printed in the Netherlands.
988 applications of speech-reading technology (disability assistance through audio-visual speech recognition1, low bit-rate video coding4 etc.). The following paper summarizes our research on application of the deformable grid paradigm5 in an extraction of speech-reading relevant information from individual frames of a video sequence. Deformable grids posses several advantages that allow for accurate modeling of facial expressions related to speech uttering – they offer distributed and detailed object representation and provide robust mechanisms for matching reference templates against tested objects. To better adjust the deformable grid paradigm to properties of the speech reading task, we proposed two modifications of this concept. The first idea – intermediate deformable models6 – was aimed at achieving a substantial increase in image analysis speed, which is necessary to allow real-time processing of video sequences. The presented paper describes the second modification of the deformable grid paradigm, which attempts to increase analysis accuracy. The main idea of the proposal is to extend a pool of grid parameters, which are used for classification of image processing results, by elements that take into account local grid's deformations. Extracting information on local structural properties of a matched grid conveys, in our belief, substantial clues for the correct identification of uttered phones. A structure of the paper is the following. Image analysis with deformable grids is summarized in Section 2. Parameters that have been introduced for grid's match evaluation are described in Section 3. Finally, an experimental verification of the proposed approach is provided in Section 4.
2.
OBJECT RECOGNITION WITH DEFORMABLE GRIDS
Basic procedure for image analysis with deformable grids is to create reference models for each class considered in an experiment and then, to match these models against image objects, which are to be recognized. Gridmatching is a process of adjusting node locations in a way that minimizes some local cost function. This function reflects a similarity between information stored at grid's nodes and information encountered at corresponding image locations, as well as it quantifies mutual node interactions that account for grid's elasticity. Nodes get displaced until these two node-driving mechanisms get balanced. Once this occurs for all of the nodes, a model matching procedure completes and image-model similarity evaluation can be assessed. Parameters that are used for match evaluation fall into two categories – the first one evaluates a similarity itself, while the second one quantifies a level of grid’s deformation. The former group
Deformable Grids for Speech Reading
989
comprises a single element, referred to as “image interaction energy” (terminology adopted from mechanics is commonly used in deformable grids theory). The latter group is composed of grid’s “internal energy” that reflects tensions that exist among nodes, grid's “external energy” used to evaluate divergence from initial grid's structure and finally, of two measures of grid's geometrical deformation (changes in average inter-node distances and grid's overall orientation). A vector composed of the presented parameters, computed for a grid after completion of grid-matching procedure, constitutes procedure’s output and is subject to classification.
2.1
Intermediate deformable models
A concept of intermediate deformable models was presented in literature6 as an attempt to alleviate the major drawback of the original paradigm, i.e. a computational cost of image analysis phase. The main idea behind the proposed modification was to create a single reference model for the given recognition task, instead of building separate models for each class considered in such a task. The reference model was a grid, designed by focusing on dissimilarities among different classes (as opposed to focusing on similarities to particular class’ prototypes, as it happens for the original approach). This way image analysis requires only a single model matching procedure, and its results can be directly used for the classification. To derive an appropriate model, a genetic algorithm (GA) was used as an optimization strategy. It has been shown that such an approach allows for correct data classification and reduces computational cost of image analysis procedure by multiple times.
3.
LOCAL GRID DEFORMATION MEASURES
An amount of information available in speech-reading task is significantly smaller than in case of speech analysis based on acoustic signals. Therefore, a search for extensive representation of a frame-analysis result is crucial to expect reasonable identification of utterances based purely on image information. A pool of parameters used to evaluate grid's matching results for the original deformable grid concept, comprises five global grid characteristics. This appears to be insufficient representation, so we decided to extend the parameter set by including other information derived from matched grids. We examined several candidates for processing result descriptors, focusing on measures of local deformations, derived for four grid's regions (denoted by A, B, C and D in Fig. 1), such as average node displacements
990 computed for a quadrant ( dr ) or average tension vectors calculated for a deformed grid's quadrant ( F ):
G dr
1 N
G Gn Gn F r0 ), F
¦ (r
n X
1 N
G
¦F
n
n X
In the above Gformula,G X is either of quadrants, N is a number of nodes G within a region, r0n and rFn denote initial and final node locations and Fn are tensions computed for some n-th node. We found that salient differences can be observed for these measures for processing of images that correspond to different utterances (Fig. 1) and also, that results produced in left and right quadrants of a grid were quite asymmetric.
Figure 1. Image processing with deformable grids (from left to right): regions of intermediate deformable model, grid matching against phones ‘a’ and ‘e’ and selected local match measures computed for matched grids in both cases.
To identify a set of most discriminative features, the considered sixteen parameters (two, two-component vectors derived for each quadrant) were subject to linear discriminant analysis, using simple Fisher discriminant method. As a result, an eight element feature set, composed of selected tension and displacement measures has been chosen to constitute matched grid representation.
4.
PHONE CLASSIFICATION RESULTS
Eight-element subset, taken from over sixty phones of Polish language, has been selected for our experiments. The subset was composed of six vowels (a, e, o, u, y, i), consonant 'w' and an image of closed lips, which corresponds to the lack of utterance. This particular choice was made due to an ease of manual identification of class representatives, which was necessary for the correct derivation of intermediate deformable model. The intermediate deformable model was derived using a procedure described by Nowak and ĝlot6. It was then manually positioned over test images and gridmatching procedure was executed. After transient decayed, processing results were extracted from matched grids, providing input for data
Deformable Grids for Speech Reading
991
classification procedure. Two output feature spaces were considered for matched grid representation: the five-element one, defined for the original deformable grid paradigm, and the proposed eight-element space made up of local measures. Two classifiers were used for the experiments – simple nearest mean classifier and feed-forward neural network with sigmoid-type nonlinear neurons. Intermediate deformable grid that has been used (rendered as gray-scale image), as well as sample processing results have been shown in Fig. 2.
Figure 2. Intermediate deformable grid and results of its matching against images that correspond to phones 'a', 'e' and 'u'
Classification results are summarized in Fig.3 where results of nonlinear discriminant analysis for grid-matching results are shown for two cases: with global characteristics introduced in the original deformable grid paradigm (on the left) and with local match measures, proposed in the paper (on the right). The analysis was made using a data analysis software package7.
Figure 3. Nonlinear discriminant analysis for classification results of eight phones for global (left) and local (right) match measures
Quantitative assessment of the phone classification experiment has been presented in Table 1. Two hundred and forty test images (30 images per class) were subject to recognition. A substantial increase in classification accuracy can be observed for the case of the proposed, local evaluation of grid's deformations. Also, as it has been expected, the problem requires nonlinear decision boundaries, which can be produced, for example, using neural network classifier (Table 1).
992 Table 1. Eight-phone recognition experiment - classification errors. Global deformation measures Nearest mean classifier 15.97 % Neural network classifier 9.24 %
5.
Local deformation measures (mean dr) 21.8 % 3.36 %
Local deformation measures (mean F) 14.29 % 1.68 %
CONCLUSION AND FUTURE WORK
An idea of using local measures for deformable grid-based phone recognition in still images has been presented in the paper. It has been shown that the proposed approach allows for correct classification of eight-phone set. The major objective of the research though, was not to find particular phone subsets that can be recognized correctly through still image analysis, but to verify whether deformable grids provide sufficiently discriminative information that could be exploited in an analysis of image sequence. We believe that results obtained encourage expecting reasonable performance of utterance recognition based on image sequence analysis. The envisaged observation vectors (for relevant HMM) will be either eight-element, local characteristics of grids computed for consecutive frames or classification results (performed by e.g. Bayes classifier) derived based on these characteristics.
REFERENCES 1. 2. 3. 4.
5. 6. 7.
I. Shdaifat, R. Grigat, D. Langmann , A system for automatic lip reading, Proc. of AVSP 2003, 121-126. (2003). P. Niyogi, E. Petajan, J. Zhong, Feature based representation for audio-visual speech recognition, Proc of AVSP-1999, 133-139 (1999). J. F. Baldwin, T. P. Martin, M. Saeed, Automatic computer lip-reading using fuzzy set theory, Proc. of AVSP-1999 (1999). L. Girin, Joint Matrix Quantization of Face Parameters and LPC Coefficient for Low Bit Rate Audiovisual Speech Coding, IEEE Trans. On Speech and Audio Processing, 12 (3), 265-275 (2004). A. K. Jain, Y. Zhong, S. Lakshmanan, Object Matching Using Deformable Templates, IEEE Trans. on PAMI, 18(3), 267-278 (1996). H.Nowak, K.ĝlot, Object classification with intermediate deformable models, Proc. of ECCTD, 240-243 (2003). Texture analysis and classification software package “MazDa”; https://www.eletel.p.lodz.pl/ merchant/mazda
DIDACTIC PATTERN RECOGNITION SYSTEM Mariusz Szwoch and Witold Malina Technical University of Gdansk, Narutowicza 11/12, 80-952 Gdansk,Poland
Abstract:
In this paper the Pattern Recognition System is described that is designed for features extraction, classifiers learning and classification. Patterns may be presented either in graphical or numerical form. The system is created in a user-friendly manner and enables to carry on some experiments with different learning algorithms. Its additional advantages are enhanced capabilities of graphical presentation and visualization of the learning process and it's results. The system is an universal tool that can be applied in didactics, lectures and in some research work as well.
Key words:
pattern recognition, classification methods, feature selection, visualisation
1.
INTRODUCTION
Pattern recognition methods have become highly advanced and have been successfully applied in practice. It does not mean, however, that using them is always easy. The main problem to apply them in practice may be the lack of suitable informatic tools for individual caring out some initial experiments. For this reason the comprehensive Pattern Recognition System (PRS) has been created in the Knowledge Engineering Department of Gdansk University of Technology. PRS allows in easy way to input of patterns in 2D or multidimensional feature spaces. It's functionality and advanced possibilities of graphical presentation of various stages of patterns pre-processing, learning and classification process makes PRS a very useful tool for didactic work and also for experiments and some research work. The main screen of the PRS system is presented in Fig. 1 - left.
993 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 993–1001. © 2006 Springer. Printed in the Netherlands.
994
2.
INTERFACE AND SYSTEM STRUCTURE
PRS has the graphical interface system that enables comfortable and easy interactive work. The user friendly interface causes that system controlling is clear and the user has at his/her disposal a wide range possibilities to influence upon each stage of the pattern recognition process. The functions realised in the program are defined by its main options. Main menu (Fig. 1 - left) enables the choice of 2- or multi- dimensional data type. The data type determines the way of the data input, choice of learning algorithms and presentation of the learning process and classification results. The multidimensional Data option enables creation of graphical patterns, their conversion into the feature vectors, reduction of their features' number and creation of a new training set. The Learning option allows to perform the classifier's learning according to a chosen method. The last function, which is Classification, enables to give graphical queries to the classifier and to get its responses. Questions are processed the same way as the training set and classified by the method defined during the learning process. Results are displayed on the screen. System makes it also possible to print data sets, information about the learning process and its results.
Figure 1. Main menu (left); setup window for algorightms and correction coefficients (right).
In general, two system's components may be distinguished in PRS: one, created for patterns in 2D feature space, and the second, for learning and classification of graphical patterns. Those components are different in many fields such as the input patterns' creation, visualisation capabilities of the patterns pre-processing and the presentation of the learning process and its results.
Didactic Pattern Recognition System
3.
995
PATTERNS IN 2D FEATURE SPACE
The best possibilities of visualisation, work demonstration and testing of the classifier system are achieved in the case of patterns in 2D feature space. In that case it is possible to create the training set directly on the screen using a computer mouse. For all the classes (maximum 5) their patterns are distinguished by giving them different colours and symbols. An option of "intelligent advisor" guides the user through the numerous dialogs allowing to choose the needed classifier and some learning parameters and conditions. Finally, the learning process is started. It's progress is currently visualised on the screen by some auxiliary lines, various information and control buttons. When not satisfied with the temporarily results user may stop the learning process, change it's parameters and start it once again. The final results may be demonstrated using decision borders and coloured decision regions. For the 2D case it is possible to estimate the quality of learning process and classification of training set in the visual form by watching the decision borders with the training set in the background. Another possibility is created by the criterial function J(w,n):
J ( w, n)
1 m ¦ G[si g (w[n], y[i])] mi1
(1)
where: G() – U-shaped function, m – number of patterns in a learning set, g() – discriminant function, si – coded class of y[i] pattern, n – iteration step. The criterial function J(w,n), counted on each iteration step,should be asymptotically decreasing during the learning process.
3.1
Learning Algorithms
PRS has capabilities of modeling linear, non-linear and minimal distance classifiers. For each classifier type there exists a group of implemented learning methods with their individual parameters and conditions for process conclusion. The system implements the following learning algorithms for 2D data5,7: 1. Iterative learning algorithms for linear classifiers. These algorithms are obtained from the following criterial functions: square, module and perceptron. Additionally, the linear classifier with sigmoidal activation function (neuron) is available (Fig. 1 - right). 2. The non-iterative Fisher's algorithm implemented in standard and extended1,6 versions
996
F (d )
d
T
( P1 P 2 ) d T Ad
2
d T (6 1 6 2 ) d d T Ad
(2)
where A 61 6 2 or A=I. 3. One- and multi-modal minimal distance classifiers are implemented in PRS. They use different distance metrics: Euclidean, Minkovsky and Mahalanobis. These learning methods are based on a mode evaluation and on the measuring the distance between the classified pattern and all the modes in the training set. 4. The non-linear algorithms may also be divided into iterative and noniterative ones. Iterative learning algorithms obtained from criterial functions such as: square, module, perceptron are implemented to quadratic discrimination function. Sequential learning algorithm using the method of consequent division of the feature space by two parallel lines is used. This algorithm generates piece-linear discriminant functions. For iterative learning algorithms of the following form:
w[n]
w[n 1] On gradJ ( w[n 1], y[n])
(3)
there is a possibility to choose different correction coefficients On such as: constant, k/n, optimal for linear separable classes and optimal matrix coefficient K[n] for non linear separable classes [1, 4]. For linear algorithms some additional coefficients may be set up. These are: k, c for relaxation algorithm and V for sigmoidal one. In the case of the Fisher's algorithm (for the classes with the normal distribution) one or two thresholds D can be chosen. Also a decision threshold can be set up in order to improve the classification reliability by introducing neutral decision regions. PRS enables to control the initial values of the coefficient vector w[0] as well as the end condition of an iterative learning process. That condition is based on four indicators: criterial function J(w, n), the number of misclassified patterns, the number of iteration steps and the absolute modulus ~'w[n]~ of the coefficient vector w. That indicators may be defined with numbers and connected in a Boolean expressions that contains logical 'or' and 'and' operators. The next option enables to set up the sequence in which the patterns are received from the training set. Here are some possibilities: alternatively from each class, using classes' probabilities, by whole classes or in random order. Example decision regions in 2D feature space for different classifiers are presented in Fig. 2-4.
Didactic Pattern Recognition System
997
Figure 2. Classification results for multimodal case using minimal distance classifier with Euclidian metrics (left); classification results for multimodal case using minimal distance classifier with Mahalanobis metrics (right).
Figure 3. Classification results for standard Fisher classifier (left); Classification results for extended Fisher classifier (right).
Figure 4. Classification results for multimodal case using piece-linear classifier (left); Classification results using iterative classifier (right).
998
3.2
Multi-Class Problems
In the case of many classes three multi-class learning algorithms are implemented: dividing classes in pairs, separating one class from others and perceptron multi-class algorithm5. In the first method, the learning process is carried on for each pair of classes. The resulting set of the classifying functions determines the division of the features' space into particular decision regions. The second method (sequential) is also a two-class approach. For each class the remaining class are grouped together forming the second class from which the original one must be separated. At last, in the proper multi-class algorithm all classes are separated simultaneously in the same process. There are implemented two versions of the perceptron multiclass algorithm - it is possible to correct one or more coefficient vectors of discriminant functions. During the learning process its actual state is repeatedly displayed on the screen. It is also possible to interrupt the process at any time. When the end conditions were satisfied the learning results are displayed with additional information about the possibility of performing some more iterations. One can also examine the criterial function's values J(n) during the whole learning process (for two-class perceptron algorithm). For non iterative learning algorithms the most appropriate is the technique of separating classes in pairs. For example, it is commonly used for Fisher and piece linear learning algorithms. On the other hand, it is very easy to generalize minimal distance classifiers on multi-class problems. In such case, the learning rely on joining consequent class discriminators.
4.
MULTIDIMENSIONAL PATTERNS OF GRAPHICAL OBJECTS
If the number of features N > 2 there is no possible to visualize the learning process and the classification results. The PRS allows the user to easily go through all the steps of learning process for multidimensional patterns that are: creation of the data and feature vector, feature selection, learning and classification. All actual classifier parameters are displayed concurrently (Fig. 5 - left). An animated arrow points the current step of learning process. Other available steps are represented by active buttons and arrows. In such a case PRS has great possibilities of demonstration different operations that deal with pattern preparation for a classification process. These operations are: x drawing graphical patterns in order to prepare training or testing set;
Didactic Pattern Recognition System
999
x initial pattern processing including discretization, quantization q, construction of feature vector and feature selection. But the iterative learning process may be supervised only by using some parameters such as an error of pattern classification, a value of criterial function and a number of iterations. The final estimation of the learning quality is carried out by examination of the training and testing sets and by analysis of classification error.
Figure 5. Graphical menu for multi-dimensional classifier learning process (left); Image editor for graphical data (multidimensional learning process) (right).
In the multidimensional case the input of the patterns may be performed using a mouse, keyboard or file. The most handy way is drawing the patterns with mouse device in special graphical editor (Fig. 5 - right). Such created patterns may then be disturbed by a pseudo random noise generator with a uniform distribution and subjected to such operations as negation, symmetry and translation. Graphical patterns' features may be extracted both after their discretization and brightness quantization and by counting some moment invariants2. In the patterns' discretization process the matrix m*n is used. The result values are counted by quantization of each cell using defined brightness transformation function. Moment invariants may be counted both on source graphical patterns and on their discrete form. Some scaling factors may also be used independently for each feature. Beside the predefined moment invariants one can also define his/her own invariants based on geometrical moments. The achieved number of features is usually rather large (m*n) and also may be increased by additional 28 moment invariants available in the system. In order to decrease that number of features the PRS offers the automatic selection according to the modified Sebestyen's criterion3,8 (Fig. 6 - left) and selection by hand. After features selection, the training set is created and used later during the learning process.
1000
Figure 6. Features selection window (left); classification window of multidimensional patterns (right).
In the case of multidimensional patterns there are accessible similar possibilities of choices of learning algorithm and its parameters as in the case of patterns in 2D feature space. In order to check the quality of learning process the training set is classified first. Then, other testing patterns may be classified. The testing set's patterns are created in the same graphical editor as those of the training set (Fig. 6 - right). Both sets may have unlimited number of patterns. The testing patterns are subjected to the same operations as the learning patterns (e.g. discretization and brightness quantization, counting the moment invariants and feature selection). Patterns achieved in this way (feature vectors) are then classified.
5.
CONCLUSION
The Pattern Recognition System is an universal tool for pattern recognition. It provides the functions of pattern preprocessing, features' selection and classifiers modelling of different types. The system has an attractive graphical presentation and allows for visualization of various stages of the pattern recognition process. PRS was created in an user-friendly manner providing, for example, an intelligent advisor option and also some suggestions on current button the should be pressed. The system may be used in didactics for the lab purposes, to create some demonstrations and also for some research work.
REFERENCES 1. Malina W., “On an extended Fisher criterion for feature selection”, Trans. On PAMI, Sept. 5 pp. 611-614, 1981.
Didactic Pattern Recognition System
1001
2. Tadeusiewicz R., “Vision systems of industrial robots”. WNT, Warsaw 1992 (in polish). 3. Sobczak W., Malina W., “Information selection and reduction methods”, WNT, Warsaw 1985 (in polish). 4. Cypkin J., “The foundations of machine learning theory”, WNT, Warsaw 1973 (in polish). 5. Malina W., “The foundations of automatic image classification”, Technical University of GdaĔsk, GdaĔsk 2002. 6. Okada T., Tomita S., “An Extended Fisher Criterion for Feature Extraction – Malina’s Method and its problems”, Electronics and Communications in Japan, vol. 67-A, No 6, pp. 10-17, 1984. 7. Theodoridis S., Koutroumbas K.: Pattern Recognition. Academic Press, 1999. 8. Denser L., “A hybrid multispectral feature selection in pattern recognition”, IEEE Trans.Comp., Sept., pp 1116-1117, 1971.
PATTERN MATCHING WITH DIFFERENTIAL VOTING AND MEDIAN TRANSFORMATION DERIVATION Improved Point-Pattern Matching Algorithm for Two-Dimensional Coordinate Lists Marcin Marszalek1 and Przemyslaw Rokita2 Institute of Computer Science, Warsaw University of Technology Nowowiejska 15/19, 00-665 Warsaw, Poland 1 [email protected]; 2 [email protected]
Abstract
We describe an algorithm for matching two-dimensional coordinate lists. Such matching should be immune to translation, rotation, scaling and flipping. We assume that coordinate lists may be only partially overlapping and we allow some random perturbations in the lists. Our goal is to match enough points to be able to derive a coordinates transformation with high confidence. The implementation described here is a part of SkySpy automated sky survey system. The presented algorithm is used to perform matching of stars detected during observations with star catalogues.
Keywords:
pattern matching; coordinate lists; differential voting; transformation derivation
1.
INTRODUCTION
One of the tasks an automated sky survey system must perform is matching stars detected on acquired CCD frames with star catalogues. After astrometry (position measurement) in frame coordinates and relative photometry7 (brightness measurement) has been performed, we are given a list of observed stars positions and brightnesses. On the other side a list of stars contained in the star catalogue3 is available, where for each star we are given its position in celestial coordinates and brightness in magnitude scale. As new frames are continuously acquired, implementation of a fast and robust algorithm, capable of directly matching tens of stars found on a frame with millions of stars contained in a catalogue, accordingly to the literature seems to be impossible nowadays. The solution we have found to fulfill the described requirements is to derive a coordinates transformation and use a simple and fast algorithm to match close neighbors in the same coordinate space. To derive a coordinates transformation between frame coordinates and celestial (catalogue) coordinates, we still need to perform pattern matching5 , but
1002 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1002–1007. © 2006 Springer. Printed in the Netherlands.
Pattern Matching with Differential Voting and Median Transformation
1003
we can focus on matching only few stars, enough to derive a transformation. Those stars should be matched with high confidence. To reduce the number of considered catalogue stars, we use some additional information (measured star movement, observer location, hardware parameters) to calculate a rough estimate of observed sky region. Later on, previously found transformation parameters can be used as the hints for the algorithm. Although simplified, pattern matching in our case is still not a trivial task. The lists are not only subject to translation, rotation, scaling and flipping, but the areas concerned may be only partially overlapping, there may be random star additions and deletions (due to existence of non-star objects and measurement errors caused by cosmic rays or clouds for example) and random coordinates and/or brightnesses perturbations. Moreover, after pattern matching the list of paired points can contain occasional mismatches. We need an algorithm to quickly determine transformation parameters that would not be heavily affected by occasional mismatches, which is not a trivial task as well. In the following sections we will describe the algorithms used to perform the pattern matching and coordinates transformation derivation. We will shortly present previous achievements and describe improvements we have developed.
2.
OVERVIEW
As an input for our pattern matching algorithm we are given two lists of two-dimensional coordinates and measured object brightnesses. We need to derive a coordinates transformation between the two lists. Many algorithms were proposed to solve the stated problem (see the research of Murtagh6 ). We have based our work on the solution presented by Groth2 and developed in parallel (although not published) by Stetson8 . This approach, mimicking the way an astronomer performs matching sky images, was later improved by Valdes10 and tested by Hroch4 . The idea of the algorithm is to search for similar triangles constructed from the points on the two lists. We expect the relating points in the lists to generate many similar triangles, but also expect occasional matches generated by noncorresponding objects. To derive a list of objects corresponding with high confidence, we propose a differential voting method, which we believe is an important improvement compared to previously proposed method. Having a list of relating objects we derive a transformation between the two lists. As at this step a dominant majority of points are true matches, we propose an efficient method of median derivation of transformation parameters. As we are dealing with spherical coordinates, we use the direction cosines and matrix method for coordinates transformation where applicable. This method for solving celestial mechanics problems was proposed by Taki9 .
1004
3.
THE ALGORITHM
First, we choose the m brightest objects from the first list and n brightest objects from the second list. In fact, all previous works assumed the same number of elements in matched lists, though the algorithm does not require this assumption. In our case it is better to choose more objects from the catalogue list. Having only rough estimate of the area covered by the frame, we intentionally overestimate this parameter and pick more objects from the catalogue. We expect the best results, when a number of relating objects in two lists is maximized, so we try to provide a similar densities of chosen objects over the area. This implies selecting more objects from the catalogue list. We select the brightest objects, as those can be found on a CCD image with high confidence and we can expect the brightest objects in one list correspond to the brightest objects in the second list. It is worth noting, that we do not require the same brightness ordering for the stars in the two lists. We have found that selecting 15 to 30 brightest objects is sufficient for the algorithm, giving good results with not too long computation times.
3.1
Finding similar triangles
Provided the reduced object lists, our next step is to construct all possible triangles. From the list of n objects we can construct n(n − 1)(n − 2)/6 triangles. Those will be represented in so called "triangle space". We need to choose a triangle representation that will let us find similar triangles, that is to find corresponding object triplets being insensitive to translation, rotation, scaling and flipping. We find a representation proposed by Valdes better than the one originally presented by Groth. A triangle in the triangle space could be represented as a two-dimensional point (x, y) where x = a/b, y = b/c (1) a, b and c are the lengths of triangle sides in decreasing order. Similar triangles will be located close to each other in the triangle space. Two techniques that allow to speed up a triangle generation and search are worth noting here. First, all the distances between objects in reduced list may be precalculated (optimization introduced by Valdes). This prevents calculating the distance between same objects many times when lengths of triangle sides are determined. Second, the triangles in the triangle space may be presorted by one of the coordinates (optimization introduced by Groth). Then we use a moving window over the presorted coordinates to heavily reduce the number of candidates for neighbors for each triangle considered. A similar technique based on presorting the triangles in the triangle space was proposed by Valdes, but we have found that as we wish to find all similar triangles anyway, the binary search to locate the subset is not necessary. Also, as we were going to use differential voting later, we did not follow Groth’s techniques of filtering out the non-useful triangles, nor a simplified ap-
Pattern Matching with Differential Voting and Median Transformation
Figure 1a. CCD image of M45 (Pleiades) open cluster. Three stars creating an example triangle are marked.
1005
Figure 1b. Catalogue data for M45 open cluster. Corresponding triangle is marked. Celestial coordinates grid is shown.
proach presented by Valdes. By using our new approach, the triangles generating many false matches can easily be discarded in differential voting process. It is worth noting (as it wasn’t directly stated in previous works), that the presented pattern matching algorithm should be applicable for any two-dimensional surface with a natural geometry. We use it for spherical geometry, but its use for Euclidean or hyperbolic geometries seems to be evident.
3.2
Differential Voting
When similar triangles have been found, we use voting method to find corresponding objects. Each matching triangle pair votes for all of its vertices pairs. The argumentation for this technique is based on the fact, that corresponding vertices pairs will receive votes from many similar triangles, while non-corresponding pairs will receive few votes from occasional mismatches. An important improvement of the algorithm we have found is to perform differential voting, that is to treat as corresponding not the pairs that received the largest number of votes, but the pairs that received the largest number of votes comparing to other pairs with the same member. The reasoning for our improvement is that we expect one-to-one matching of objects in the two lists. As we require an object matching with high confidence, if an object receives many votes in two pairs (with two different elements), we do not want to trust neither of the associations. We propose to subtract from the number of votes, received by an object pair in traditional way, the number of votes received by the pair with the highest number of votes and same member. That is v (m, n) = v(m, n) − max(max(v(k, n)), max(v(m, l))) k =m
l =n
(2)
where k, m ∈ M , l, n ∈ N , v(m, n) is the number of votes pair (m, n) has received in a traditional voting and v (m, n) is a number of votes received in a proposed approach.
1006 Table 1a. Vote array before correction. Six highest vote getters were marked bold. Note the obvious mismatch in the upperleft corner. 17 16 6 0 0 0 0 1 0 1
5 6 15 1 0 0 6 3 2 2
2 2 1 20 4 8 2 0 13 1
1 1 1 5 5 3 0 1 1 1
0 0 2 2 2 3 22 0 4 6
0 0 0 0 2 31 1 2 2 1
3 4 5 1 0 0 3 15 3 3
2 2 3 9 2 10 5 4 22 2
Table 1b. Vote array after correction. Six highest vote getters were marked bold. Pairs that received negative number of votes were zeroed. 1 0 0 0 0 0 0 0 0 0
0 0 9 0 0 0 0 0 0 0
0 0 0 7 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 16 0 0 0
0 0 0 0 0 21 0 0 0 0
0 0 0 0 0 0 0 10 0 0
0 0 0 0 0 0 0 0 9 0
Note, that for each object in any of the lists, there will be at most one object on the other list, that forms a pair that has received a positive number of votes. That is for each object there will be at most one matched object, and the number of votes will be proportional to the confidence of one-to-one matching.
3.3
Coordinates transformation derivation
When we have object pairs that can be associated with high confidence, we want to derive a coordinates transformation between two lists. First we want to know if the frame is flipped and we need to correct the rough estimate of the field of view (this is scale the frame in celestial coordinates). Then we derive the celestial coordinates of frame center and compute the frame rotation. We can determine whether the frame is flipped by constructing triangles1 from matched object pairs (from few highest vote getters in fact) and perform voting. At the same time we can calculate the scale correction factor from doubles of matched object pairs. We propose not to calculate the average of the factor, but compute the median value, as occasional mismatches still may occur. Having corrected the orientation and the scale we can derive the remaining spherical coordinates transformation parameters using the direction cosines and matrix method for coordinates transformation derivation as proposed by Taki9 . Again, we calculate the final value of each transformation parameter as median of the values derived from all doubles of pairs. We have found that this is a very simple and robust way to minimize the influence of the rare mismatches left after the pattern matching phase. Our tests showed, that after differential voting phase, when we have chosen six most voted object pairs to derive the transformation parameters, more than 80 percent of results were correct. For other, especially non time-critical applications, an iterative sigma clipping algorithm proposed by Valdes10 may be used.
Pattern Matching with Differential Voting and Median Transformation
4.
1007
SUMMARY
The presented algorithm allows our automated sky survey system to find a coordinates transformation between CCD frame coordinates and celestial coordinates. The triangle matching and voting procedure successfully pairs the objects regardless of minor perturbations in the coordinate lists, as far as there are enough (six or more) bright objects appearing on both lists. A differential voting method discards most of the mismatches and a coordinates transformation parameters may quickly be found. Our approach allowed us to abandon the complicated filtering methods and still reduce the number of mismatches, especially among highest vote getters. Before we have implemented the differential voting correction, mismatches appeared even among the three highest vote getters. After applying the correction, we can use seven highest vote getters and mismatches among those are very rare. In this situation median transformation derivation can be used as quick and robust way to derive transformation parameters. As the proposed voting correction may be applied to original voting table with no computational penalties (the algorithm is O(mn)), we believe our finding may be useful. On a 1.1 GHz PC computer the average computation times for matching two point sets of 30 elements each, were 0.03s for triangle space construction phase and 0.7s for finding corresponding pairs. Introducing our differential voting correction added 0.1s to pair matching phase.
REFERENCES 1. Bronstein, I.N. and Semendjajew, K.A. (2004). Handbook of Mathematics. Springer-Berlin. 2. Groth, E.J. (1986). A Pattern-Matching Algorithm for Two-Dimensional Coordinate Lists. Astronomical Journal, 91:1244–1248. 3. Hfig, E. et al. (2000). The Tycho-2 Catalogue of the 2.5 Million Brightest Stars. Astronomy and Astophysics, 355:27. 4. Hroch, Filip. Computer Programs for CCD Photometry. Department of Theoretical Physics and Astrophysics, Masaryk University, Brno. http://munipack.astronomy.cz/doc/munipack.html. 5. Klette, R., Schluns, K., and Koschan, A. (1998). Computer Vision. Springer. 6. Murtagh, F. (1992). FOCAS Automatic Catalog Matching Algorithms. Publications of the Astronomical Society of the Pacific, 104:301. 7. Palmer, J. and Davenhall, A.C. (2001). The CCD Photometric Calibration Cookbook. Council for the Central Laboratory of the Research Councils. 8. Stetson, P.B. The Techniques of Least Squares and Stellar Photometry with CCDs. Dominion Astrophysical Observatory. http://nedwww.ipac.caltech.edu/level5/Stetson/frames.html. 9. Taki, Toshimi (2002). Matrix Method fot Coordinates Transformation. http://www.asahi-net.or.jp/ zs3t-tk/matrix/matrix.htm. 10. Valdes, F. et al. (1995). FOCAS Automatic Catalog Matching Algorithms. Publications of the Astronomical Society of the Pacific, 107:1119.
TWO-DIMENSIONAL-ORIENTED LINEAR DISCRIMINANT ANALYSIS FOR FACE RECOGNITION Muriel Visani1 , Christophe Garcia1 and Jean-Michel Jolion 2 1 France Telecom R&D DIH/HDM
4, rue du Clos Courtel 35512 Cesson-Sevigne, France {muriel.visani,christophe.garcia}@rd.francetelecom.com 2 Laboratoire LIRIS, INSA Lyon
20, Avenue Albert Einstein Villeurbanne, 69621 cedex, France [email protected]
Abstract
In this paper, a new statistical projection-based method called Two-DimensionalOriented Linear Discriminant Analysis (2DO-LDA) is presented. While in the Fisherfaces method the 2D image matrices are first transformed into 1D vectors by merging their rows of pixels, 2DO-LDA is directly applied on matrices, as 2D-PCA. Within and between-class image covariance matrices are generalized, and 2DO-LDA aims at finding a projection space jointly maximizing the second and minimizing the first by considering a generalized Fisher criterion defined on image matrices. A series of experiments was performed on various face image databases in order to evaluate and compare the effectiveness and robustness of 2DO-LDA to 2D-PCA and the Fisherfaces method. The experimental results indicate that 2DO-LDA is more efficient than both 2D-PCA and LDA when dealing with variations in lighting conditions, facial expression and head pose.
Keywords:
Two-Dimensional-Oriented Linear Discriminant Analysis, Face Recognition, Feature Extraction, Statistical projection, Two-Dimensional Principal Component Analysis.
1.
INTRODUCTION
Since the seminal work of Sirovich and Kirby 6 , which showed that Principal Component Analysis (PCA) could be efficiently used for representing images of human faces, statistical projection-based methods have been widely used in the context of automatic face recognition. Turk and Pentland 7 proposed the very well-known Eigenfaces method, based on PCA, where a face
1008 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1008–1017. © 2006 Springer. Printed in the Netherlands.
Two-Dimensional-Oriented Linear Discriminant for Face Recognition
1009
image can be represented as a weighted sum of a collection of images (eigenfaces) that define a facial basis. Belhumeur et al. 1 introduced the Fisherfaces method, based on Linear Discriminant Analysis (LDA) where class information, i.e. the identity of each face image, is taken into account for enhancing separation between different classes, while building the face space. In PCA-based and LDA-based face recognition methods, the h × w 2D face images must be first transformed into 1D image vectors of size h · w, which leads to high-dimensional image vector space, where statistical analysis, i.e. covariance matrix calculation and eigen system resolution, is costly, difficult and may be unstable. To overcome these drawbacks, Yang et al. 10 proposed recently the Two Dimensional PCA (2D-PCA) method that aims at performing PCA using directly the face image matrices, keeping the 2D structure of the face images. They have shown on various databases that 2D-PCA is more efficient than PCA for the task of face recognition. In addition, we have shown 8 on FERET 4 that 2D-PCA is more robust than PCA when dealing with face segmentation inaccuracies such as misaligned or badly scaled face images, with low-quality images and partial face occlusions. In this paper, we propose a novel class-based projection technique, called Two-dimensional-Oriented Linear Discriminant Analysis, that achieves better recognition results than traditional LDA-based approaches by taking advantages of the 2D matrix representation of the face images while substantially reducing computational and storage costs. The remainder of the paper is organized as follows. In section 2, we describe in details the principle and the algorithm of the proposed 2DO-LDA method, pointing out its advantages over previous projection-based methods. In section 3, a series of four experiments, on different international data sets, is presented to demonstrate the effectiveness and robustness of 2DO-LDA, with respect to variations in lighting conditions, facial expression and head pose and compare its performances with respect to the LDA and 2D-PCA methods. Finally, conclusions are drawn in section 4.
2.
2D-ORIENTED LINEAR DISCRIMINANT ANALYSIS
The classifier is constructed from a training set of n face image matrices Xi , containing h × w pixels, labeled by their corresponding identity. The views of one person form a class. The aim is to find a projection matrix P , of size w ×k, providing efficient separation of the projected classes according to: ˆ i = Xi · P (1) X ˆi is the h × k projected matrix of Xi onto the orthonormal basis P where X of the projection space. Column vectors (Pi )i=1...k of P will be referred to as 2D-Oriented Discriminant Components (2DO-DCs) in the following.
1010 The 2DO-DCs are chosen to jointly maximize the mean variation between classes and minimize the mean of the variations inside each class. Therefore, P can be chosen as the w × k matrix maximizing the following generalized (2) Fisher criterion: J(P ) = (Sˆw )−1 Sˆb ˆ ˆ where Sw and Sb are respectively the generalized within-class and betweenˆi , defined as: class covariance matrix of the n projected image matrices X 1 ˆ ¯ ¯ˆ T ¯ˆ ¯ˆ ˆi−X¯ˆc ) and Sˆb = nc (X¯ˆc−X) Sˆw = (Xi−Xˆc )T(X (Xc−X) (3) nc=1 X ∈Ω n c=1 C
C
i
c
¯ where Xˆc is the mean matrix of the nc projected images of class Ωc (among ¯ˆ C different classes) and X is the mean matrix of all the n projected images of the training set. According to equations (1) and (3), criterion (2) is equivalent to the following criterion: |P T Sb P | (4) J(P ) = T |P Sw P | where Sw and Sb are respectively called the generalized within-class and betweenclass covariance matrix of the training set: Sw =
C C 1 nc ¯ ¯ T ¯ ¯ (Xc−X) (Xc−X) (5) (Xi−X¯c )T(Xi−X¯c ) and Sb = n c=1 X ∈Ω n c=1 i
c
¯ are respectively the mean of the nc images of class Ωc and where X¯c and X the mean of all the n images of the training set. Under the assumption that Sw is non-singular, the k vectors Pi maximizing criterion (4) are the k orthonormal eigenvectors of matrix Sw−1 Sb corresponding to the largest eigenvalues. The matrix Sw is generally invertible due to the low dimension of the 2DO-DCs relative to the number of training samples (n $ w). Once they have been sorted in descending order from their corresponding eigenvalues, the number k of 2DO-DCs to consider can be determined as for the eigenfaces method 9 , traditionally by removing a given percentage of the last eigenvectors. As a statistical projection method, 2DO-LDA can be used for image compression, even if the projection space is chosen to be more discriminative than ˆi and P can be combined to obtain a representative. The projected image X reconstruction of the original image Xi ; some results are shown in Figure 1. Classification of face images is performed in the projection space defined by P : when comparing two faces Xa and Xb , they are first projected onto P according to equation (1), giving their projections Xˆa and Xˆb . Then, a matrixto-matrix distance is calculated between Xˆa and Xˆb , for instance the following
Two-Dimensional-Oriented Linear Discriminant for Face Recognition
(a)
(b)
(c)
(d)
1011
Figure 1. (a) Original images (Asian Face Database PF01). (b) Corresponding reconstructed images with k = 2 2DO-DCs. (c) With k = 3. (d) With k = 20. The projection space is constructed from the training set of the first experiment (see section 3). With the third 2DO-DC the facial features (eye, nose, mouth) appear, but the head poses are not distinguishable yet. With more 2DO-DCs a good visual quality of reconstruction is obtained.
distance, used by Yang. et al.
10 :
d(Xˆa , Xˆb ) =
k
ˆ ˆ Xaj − Xbj 2
(6)
j=1
ˆ where Xij = Xi Pj is the projected vector of image Xi on the the j th 2DO-DC Pj , and · 2 is the standard L2 norm. It can be pointed out that 2DO-LDA offers strong advantages in comparison with 2D-PCA and the usual LDA method: In 2D-PCA the projection space is chosen to retain most of the total scatter of the training set, no matter if that scatter is explained by variations inside the same class (variations in facial expression for instance) or between two different classes. Thus, the projection space constructed from 2D-PCA can represent noise and this method is more suited for face representation than for face classification. It will be shown in section 3 that 2DO-LDA have a stronger discriminative power than 2D-PCA; 2DO-LDA is numerically more stable than the usual LDA method: for the LDA method the sample images are vectors of length w · h, and this large dimension leads to numerical instability when computing the within and between-class covariance matrices, from these vectors; 2DO-LDA allows an important storage gain with respect to the usual LDA method: while for the LDA method the length of the projection vectors Pi is w · h, for 2D-PCA their length is w. Moreover, the number k of selected projection vectors for LDA is traditionally 60% of the number of samples of the training set. We will see in section 3 that the number of 2DO-DCs needed to provide good face recognition rates is much smaller;
1012 Concerning LDA, the length of the projection vectors w · h is usually much larger than the number of samples n. Therefore, the within-class covariance matrix of the training set is generally non-invertible. The trick traditionally used is to perform LDA into a subspace previously constructed from PCA 5 . The corresponding algorithm will be denoted by "PCA+LDA" in the following. Applying first PCA generates an additive computational cost and leads to a loss of information that could, if kept, be discriminative. Concerning 2DO-LDA, the Sw matrix is generally invertible and therefore the algorithm can be applied directly on the training set.
3.
EXPERIMENTAL RESULTS
Four experiments are performed to assess the effectiveness and robustness of 2DO-LDA with respect to variations in lighting conditions, facial expression and head pose and compare its performance with LDA and 2D-PCA. Three face databases are used: the Asian Face Image Database PF01 2 , containing 17 views of each of 107 persons, the well-known FERET 4 face database, and the BioId Database 3 , containing 1521 face images of 23 people, extracted from video sequences. A face image preprocessing step is first applied to each image: it consists in centering the face in the image, setting the image to a size of 65 pixels wide by 75 pixels high, and equalizing its histogram. The first two experiments provide a comparison of the robustness to variations in head pose and facial expression, the third experiment aims at evaluating the efficiency in the presence of illumination variations. The last experiment consists in matching video sequences of faces with different lighting conditions, head poses and facial expressions. Figure 2 shows samples of the training and test sets used for these experiments. Experimental results are analyzed through two graphics: the compared recognition rates of 2D-PCA and 2DO-LDA across a varying number of projection vectors k (first column of Figure 3) and the compared Cumulative Match Characteristic (CMC) curves for LDA, 2D-PCA and 2DO-LDA (second column of Figure 3). A face is said to be recognized at rank j if an image of the same person is among the j th nearest into the projection space. The distance (6) is used for 2D-PCA and 2DO-LDA. Concerning LDA a L2 distance is performed in the projection space. In each CMC curve, the number k of projection vectors has been chosen so as to maximize the performances of the algorithm. The first experiment (see Figure 2.a) is performed on the Asian Face Image Database PF01. The training set contains 535 images of faces, 5 views in near-frontal pose per person. The test set contains 428 images (4 views per person), with stronger non-frontal head poses than in the training set. The training
Two-Dimensional-Oriented Linear Discriminant for Face Recognition
1013
Figure 2. Images used for experiments. (a) First experiment. First row: training set, second row: test set. (b) Second experiment. First row: training set, second row: test set. (c) Third experiment. First row: training set, the test set #1 contains the middle image of the first row (photo taken on 10/31/1994) and the test set #2 contains the second row image (photo taken on 05/21/1996). (d) Fourth experiment. First row: training set from the FERET database. Second (resp. third) row: an extract of a sequence from the test set #1 (resp. #2) from the BioId database.
and test sets present similar lighting conditions, and neutral facial expressions. The test set is compared to the training set. Figure 3.a shows that, for any number k of projection vectors varying from 1 to 15, 2DO-LDA provides better recognition rates than 2D-PCA. The projection vectors of both methods have the same length (here w = 65). The best recognition rate for 2DO-LDA (94,4%) is obtained with k = 8 projection vectors, and is 2,1% superior to the best recognition rate for 2D-PCA, obtained with k = 9. The PCA+LDA algorithm is computed from a sufficient number of 200 principal components. The best results for LDA are obtained with k = 40 projection vectors of length 75·65 = 4875 pixels. Figure 3.b shows that 2DO-LDA gives better results than both 2D-PCA and LDA methods, at the first rank as well as at higher ranks. Therefore, 2DO-LDA appears to be the most robust to head pose changes. The second experiment (see Figure 2.b) is performed on a subspace of the Asian Face Image Database PF01, with similar lighting conditions and frontal head poses. The training set contains 321 images, i.e. three views per person. One corresponds to a neutral expression; the two others are chosen randomly among the four expressions available in the database: happy, surprised, irritated and closed eyes. The test set contains the two remaining facial expressions, for each person. The test set is compared to the training set. From Figure 3.c we can see that 2DO-LDA gives better results than 2D-PCA with less projection vectors (up to 19,7% of difference with k = 6). Figure 3.d shows that
1014 2DO-LDA gives much better recognition rates than both 2D-PCA and LDA methods for a rank varying from 1 to 10 (the mean improvement on the 7 first ranks compared to LDA is about 12%). We have observed that about 68% of the misclassifications of 2D-PCA correspond to matching different persons with the same facial expression, while this kind of errors is involved in only 52% of the fewer misclassifications of 2DO-LDA. Therefore, the efficiency of 2DO-LDA is explained by a better ability to deal with facial expression changes. Indeed, these variations constitute most of the within-class scatter of the training set, which is minimized by 2DO-LDA when applying the criterion 4, while 2D-PCA maximizes the total scatter (containing the within-class scatter). The third experiment (see Figure 2.c) is performed on the FERET database. The training set contains 666 images of 152 persons. The number of images per person is variable, but always larger than three. For each person, the multiple views are taken on different days, under different lighting conditions. The time interval between two views of the same person can be long (from a few days to almost three years), thus one person can wear eyeglasses or a beard on a photo and not on another one (see first row of Figure 2.d). There are two test sets. The first one contains one image per person, taken from the training set. Test set #2 also contains one image per person, taken another day and not belonging to the training set. Lighting conditions are very different and time delay may be long from one test set to another. All the images used for this experiment contain near-frontal head pose; the facial expression can either be neutral or smiling. Test set #2 is then compared to test set #1. Figure 3.e shows that for both 2DO-LDA and 2D-PCA the recognition rates are low (inferior to 50%), which can be explained by the important dissimilarities between the two test sets. However, 2DO-LDA achieves better recognition rates than 2D-PCA (the best recognition rate for 2DO-LDA is achieved with only 5 2DO-DCs against 13 projection vectors for 2D-PCA, and is 5,3% better). The PCA+LDA algorithm is computed from 500 principal components. From Figure 3.f we can conclude that 2DO-LDA provides better results than 2D-PCA and LDA at the first rank as well as at higher ranks. The storage gain compared to LDA is important given that the best recognition rates for LDA are achieved with 100 projection vectors of length 4875 pixels against only 5 projection vectors of length 65 pixels for 2DO-LDA. Moreover, 2DO-LDA reaches 5,5% mean recognition rate improvement over LDA at the first five ranks. The last experiment (see Figure 2.d) is performed on both the FERET and BioId databases. The training set and test set #1 of the previous experiment constitute the training set, that contains consequently 818 images from 152 persons. Two test sets are taken from the BioId database. Each one contains 173 images of 18 persons, taken from different video sequences. There are important variations in lighting conditions, facial expression and head pose, within
Two-Dimensional-Oriented Linear Discriminant for Face Recognition
1015
and between the test sets. First, test set #2 is compared on an image-to-image basis to test set #1, and a recognition rate is obtained, as for previous experiments. Given that the two sets contain very different illumination conditions and head poses, the recognition rates are inferior to 65% for both 2DO-LDA and 2D-PCA techniques (see Figure 3.g). However, 2DO-LDA achieves much better recognition rates than 2D-PCA, with 15% difference between their respective maxima. The PCA+LDA algorithm is performed from 600 principal components. From Figure 3.h we can conclude that 2DO-LDA provides better recognition rates than 2D-PCA and LDA at the first rank as well as at higher ranks. The storage gain compared to LDA is very important: for LDA the best recognition rate is achieved with 550 projection vectors of length 4875 pixels against only 5 2DO-DCs of length 65 pixels for 2DO-LDA. Moreover, 2DOLDA reaches 17.6% mean recognition rate improvement over LDA at the first five ranks. The test sets are then compared on a sequence-to-sequence basis, applying a majority voting scheme to the comparion results obtained previously. 2DO-LDA recognizes 11 sequences from 18, while 2D-PCA recognizes at most 8 sequences.
4.
CONCLUSION
In this paper, we have proposed a new class-based projection method, called 2DO-LDA, that can be successfully applied to face recognition. This technique, by maximizing a generalized Fisher criterion computed directly from matrices of face images, constructs a discriminant projection matrix. 2DO-LDA is numerically more stable and allows an important storage gain in comparison with the usual LDA method. It has already been shown 10 ; 8 that 2D-PCA outperforms the traditional PCA method. In this paper, we have shown on various databases that 2DO-LDA is more efficient and robust to variations in lighting conditions, facial expression and head pose than both 2D-PCA and LDA.
ACKNOWLEDGMENTS Portions of the research in this paper use the FERET database of facial images collected under the FERET program.
1016 First Experiment
First Experiment 2D-LDA
2D-LDA (k=8)
2D-PCA
LDA (k=40)
0,99
Cumuled Recognition Rate
0,94
Recognition Rate
2D-PCA (k=9)
1
0,95
0,93 0,92 0,91 0,9 0,89
0,98 0,97 0,96 0,95 0,94 0,93 0,92
0,88
0,91
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
1
2
3
4
k (number of projection vectors Pi)
5
(a)
(b)
Second Experiment
Second Experiment
2D-LDA
6
7
8
9
10
8
9
10
8
9
10
9
10
rank
2D-LDA (k=5)
2D-PCA
2D-PCA (k=12)
LDA (k=30)
1 0,75
Cumuled Recognition Rate
0,95
Recognition Rate
0,65
0,55
0,45
0,35
0,9 0,85 0,8 0,75 0,7 0,65 0,6
0,25
0,55 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
1
2
3
4
k (number of projection vectors Pi)
5
6
7
Rank
(c)
(d) Third Experiment
Third Experiment 2D-PCA
2D-LDA
2D-LDA (k=5)
0,5
LDA (k=100)
2D-PCA (k=13)
0,8
0,48
0,75
Cumuled Recognition Rate
Recognition Rate
0,46 0,44 0,42 0,4 0,38 0,36 0,34 0,32
0,7 0,65 0,6 0,55 0,5 0,45
0,3 1
2
3
4
5
6
7
8
9
10
11
12
13
14
0,4
15
0
1
2
3
4
k (number of projection vectors Pi)
5
(e)
(f)
Fourth Experiment
Fourth Experiment
2D-LDA
6
7
Rank
2D-PCA
LDA (k=550)
2D-PCA (k=6)
2D-LDA (k=5)
0,7
0,8
0,65
0,75 Cumuled Recognition Rate
Recognition Rate
0,6 0,55 0,5 0,45 0,4 0,35
0,7 0,65 0,6 0,55 0,5
0,3 1
2
3
4
5
6
7
8
9
10
11
k (number of projection vectors)
(g)
12
13
14
15
16
0,45 0
1
2
3
4
5 Rank
6
7
8
(h)
Figure 3. Compared recognition rates of 2DO-LDA and 2D-PCA when varying the number k of projection vectors (first column) and compared CMC curves of 2DO-LDA, 2D-PCA and LDA, for each of the four experiments (second column).
Two-Dimensional-Oriented Linear Discriminant for Face Recognition
1017
REFERENCES 1. P.N. Belhumeur, J.P. Hespanha and D.J. Kriegman, Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection1. , Special Theme Issue on Face and Gesture Recognition, IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(7), 711–720, July 1997. 2. B.W. Hwang, M.C. Roh and S.W. Lee, Performance Evaluation of Face Recognition Algorithms on Asian Face Database, 6th IEEE Int. Conf. on Automatic Face and Gesture Recognition, 278–283, Seoul, Korea, May 2004. 3. O. Jesorsky, K.J. Kirchberg and R.W. Frischholz, Robust Face Detection Using the Hausdorff Distance, In Proc. 3rd International Conference on Audio- and Video-based Biometric Person Authentication, Springer, Lecture Notes in Computer Science, LNCS-2091, 90–95, Halmstad, Sweden, June 2001. 4. P.J. Phillips, H. Wechsler, J. Huang, P. Rauss, The FERET Database and Evaluation Procedure for Face Recognition Algorithms, Image and Vision Computing, 16(5), 295–306, 1998. 5. D.J. Swets and J. Weng, Using Discriminant Eigenfeatures for Image Retrieval5. , IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8), 831–836, 1996. 6. L. Sirovitch and M. Kirby, A Low-Dimensional Procedure for the Characterization of Human Faces6. , Journ. of Optical Society of America A, 4(3), 519–524, 1987. 7. M. Turk and A. Pentland, Face recognition using eigenfaces7. , Proc. IEEE Conference on Computer Vision and Pattern Recognition, 586–591, Maui, Hawaii, 1991. 8. M. Visani, C. Garcia and C. Laurent, Comparing Robustness of Two-Dimensional PCA and Eigenfaces for Face Recognition, To appear in 1st International Conference on Image Analysis and Recognition (ICIAR), September 2004. 9. W.S. Yambor, B. Draper and R. Beveridge, Analyzing PCA-based Face Recognition Algorithms: Eigenvector Selection and Distance Measures9. , in Empirical Evaluation Methods in Computer Vision, H. Christensen and J. Phillips (eds.), World Scientific Press, Singapore, 2002. 10. J. Yang, D. Zhang and A.F. Frangi, Two-Dimensional PCA: A New Approach to AppearanceBased Face Representation and Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(1), 131–137, January 2004.
3D HUMAN MODEL ACQUISITION FROM UNCALIBRATED MONOCULAR VIDEO
En Peng and Ling Li Department of Computing, Curtin University of Technology, Perth,Australia 6845. Email: @cs.curtin.edu.au
Abstract:
This paper presents a novel method to reconstruct 3D body surface model of human figure from 2D uncalibrated monocular video. Unlike other works on body reconstruction from monocular video where either the source video are strictly selected or the skeleton proportions are assumed available beforehand, the proposed system automatically recovers the skeleton proportions and rebuild the 3D human body shape in NURBS format without pose estimation.
Key words:
human body modeling, monocular images, 3D reconstruction
1.
INTRODUCTION
With the rapid development of computer technology in the last decades, a wide range of applications involving virtual humans emerges such as digital movies, video surveillance and virtual reality environments. These applications demand for virtual humans that are realistic and animatable. Recovering the human body shape from image(s) can avoid using the expensive 3D scanning equipment employed in traditional body modeling approach. The approach of modeling from images can be divided into two groups according to the actor in image(s): 1. static human body reconstruction and 2. dynamic human body reconstruction. Researchers in the first group use multi-view images of the posture-specified human figure2,4 or even a single image5. The images must capture the specific views of the immobile human figure. Joint locations and skeleton proportions have to be known beforehand when only one image is used. The body cloning
1018 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1018–1023. © 2006 Springer. Printed in the Netherlands.
3D Human Model Acquisition from Uncalibrated Monocular Video
1019
results are generally accurate. Body reconstruction from image sequences containing human motion is the other group in image-based human model reconstruction. Some researchers required the human figure to perform specified motions3 or limited motions7 under multiple cameras. The methods of using multiple cameras share the same drawback with the static approach: the person is required to pose for the cameras in a specific place, normally a fully equipped laboratory or studio. Other researches try to clone the human body from the monocular video. They usually simply align a pre-defined 3D surface human model5 or a super-quadric model6 with the recovered skeleton from monocular images. Regardless of the accuracy of the posture estimation, one common limitation in their work is that the reconstructed body does not reflect the actual body shape in the images since the shape mapping process is not included. As monocular video is widely available, we decided to choose this kind of input source. In this paper we propose a novel method for reconstructing the 3D human body model without human posture estimation from uncalibrated monocular image sequence. The rest of this paper is organized as follows: The proposed reconstruction system is described in Section 2. Section 3 gives experimental results and evaluation. Conclusions and future works are discussed in Section 4.
2.
SYSTEM DESIGN
2.1
System Overview
Figure 1. Overview of our system.
An overview of the proposed 3D human shape reconstruction system is illustrated in Fig. 1. A generic 3D surface model of human body consisting of NURBS patches is used as the basis for the reconstruction. The generic model is constructed to represent the human body shape in satisfactory details while using the fewest number of control points possible. A
1020 monocular video sequence containing a full human body in motion is used as the input of the system. For the purpose of body modeling, the human figure should wear tight clothes to better reveal the body shape. In addition, the torso part should be almost parallel to image plane for the accuracy of the proposed reconstruction algorithm. Silhouette information is extracted and segmented from each frame. Body silhouettes are then segmented and labeled into different body parts. Seven main body parts are considered in segmentation: torso, head, hip, upper arm, lower arm, thigh and calf. Torso is labeled in each frame while the other parts are segmented and labeled only when they are not occluded by other body parts. A set of key frames is selected automatically for the estimation of the human skeleton proportion. Next, another set of key frames is automatically selected based on different criteria for the extraction of the surface shape for each body part. Finally, the generic human model is personalized according to the estimated skeleton proportion and the extracted surface shape. Skeleton proportion estimation and body surface extraction are the two key components in this system. They will be discussed in more details in the next two sections.
2.2
Skeleton Proportion Estimation
Our skeleton system employs the relative length representation for the skeleton measurement. In the relative length representation, the skeleton length of torso segment is defined as the unit length. The relative lengths for other body segments are to be estimated to reveal the skeleton proportion of the person recorded in the images. According to the general knowledge of perspective projection, if any two skeleton segments are within the same plane which is parallel to image plane in the same image, the projected length of either skeleton segment is its actual length with the same scale parameter. Thus the ratio between the actual lengths of any two skeleton segments is equal to their ratio in projected lengths. Since the skeleton of torso is assumed to be almost parallel to the image plane, the relative length parameter for another body segment can be calculated by comparing its projected length with the torso’s if this body segment and the torso are within the same plane that is parallel to the image plane. During human walking, the head and the hip are usually kept upright with the torso. Thus, the relative skeleton length of head or torso can be easily calculated by using their projected lengths. In general walking motion, the two legs cannot be both in-front-of or behind the torso simultaneously. Same rule applies to the two arms1. Therefore, there are two main possibilities between any two symmetric body parts and the torso in most
3D Human Model Acquisition from Uncalibrated Monocular Video
1021
cases under arbitrary camera views: either they are in the same plane that is parallel to the image plane; or one of the symmetric parts is in front of the plane parallel to image plane that passing torso while the other is behind it (farther from the image plane). It is clear that for any two symmetric body parts, the shorter relative projected length in any image is usually not longer than its actual relative length. Meanwhile, the longer relative projected length may be much longer than its actual projected length among all frames if the perspective effect is strong. Thus, when the shorter relative length is concerned in each image, the longest relative length among all images can approximate the actual relative length. By this way, the skeleton proportion can be estimated for each symmetric part.
2.3
Body Shape Data Extraction
Unlike other silhouette-based human shape recovery methods where posture is estimated first3, we propose a novel method to extract the shape data from 2D silhouettes without the need to reconstruct the human posture. Our method extracts the shape data of each single body part then combines them together to provide the whole body information. Anatomically, the muscle structure system which dominates the body shape is almost the same for each person. Hence, the 2D silhouette of a body part of different persons from the same viewpoint can be regarded similar. So, from a silhouette of a body part, it is possible to estimate the viewpoint for this silhouette. Once the viewpoint is estimated, the feature points can be easily mapped from the silhouette to the body part surface. In our system, different frames from the image sequences are chosen for each body parts. These frames provide silhouette of that body part from different views. For torso, head and hip, only two frames each are needed, which provide their silhouette examples from front view and side view. Four different frames showing silhouette examples of limbs (upper arm, lower arm, thigh and calf) are needed from four different views. In order to estimate the shape for each specified view, three stages are performed in our system: 1. labeled silhouettes selection; 2. viewpoint estimation of each silhouette; 3. silhouette selection for specified viewpoint. The purpose of stage one is to select the body part that most likely is parallel to the image plane. By comparing the relative projection length to the estimated skeleton relative length calculated as mentioned in the previous section, those silhouettes that appeared to be too “short” are removed from the comparing list for the following stages. In the second stage, the viewpoint(s) is/are estimated for each labeled silhouettes from images by comparing it to the stored silhouette examples regardless of the body fat. The comparison results in a dissimilarity value.
1022 Iteratively flipping the generic silhouette horizontally or vertically, the minimum dissimilarity value is chosen to represent the difference between this pair of silhouettes. Given a threshold parameter, the possible viewpoint for the given silhouette can be estimated by comparing the dissimilarity to the threshold. As a result, one or more viewpoints may be estimated for the given silhouette. Stage three aims at selecting the one that most likely reveals the actual shape from each viewpoint. Simply, if all candidate silhouettes for a specified viewpoint have only one estimated viewpoint, the one with lowest dissimilarity is selected. However in some cases, silhouettes are estimated for many viewpoints due to the similarity of side contours in silhouette examples. To further determine the viewpoint for each silhouette, the cue of width difference can be used. The selection process is preceded according to the width relationship in the example silhouettes. After these three stages, shapes of each body part for each specified view can be estimated if the original video contains at least one instance. In the case when the shapes for certain body part in certain specified view cannot be estimated, the generic shapes have to be used since no information about it is provided in the silhouettes.
3.
RESULTS
The proposed system is tested on two computer synthesized video sequences containing human motions.
Figure 2. Calf modeling (a) and results of human figure reconstruction (b).
The significant silhouettes for different body parts are first labeled in every image. Automatic skeleton proportion estimation algorithm and shape extraction algorithm are tested on both image sequences. With the estimated
3D Human Model Acquisition from Uncalibrated Monocular Video
1023
skeleton proportion and the extracted shape for each view, every single body part can be reconstructed. As shown in Fig. 2(a), the control points of edge curves derived from generic NURBS calf model are mapped to the corresponding control points in the extracted shapes which results in the new calf model. By synthesizing all reconstructed body parts, the full human model with arbitrary pose can be produced. It can be seen in Fig. 2(b) that the human figures in both video have been successfully represented by the 3D figures reconstructed by our system.
4.
CONCLUSIONS AND FUTURE WORK
This paper presents a novel method for reconstructing 3D human body model from 2D uncalibrated monocular images. The system includes an intuitive method to estimate skeleton proportion and an algorithm to extract the surface shape of each body part from uncalibrated images without posture estimation. The proposed system has been tested on two computer synthesized image sequences, and the results are satisfactory. Future work will include the study of more human motion styles and improvement on the accuracy of the proposed algorithms.
REFERENCES 1. Z.Chen and H.J.Lee, Knowledge-Guided Visual Perception of 3-D Human Gait from a Single Image Sequence, IEEE Trans, On Systems, Man, and Cybernetics, 22(2), 1992. 2. A.Hilton and T.Gentils, Popup People: Capturing Human Models to Populate Virtual Worlds, SIGGRAPH, UK (1998). 3. I.A.Kakadiaris and D.Metaxas, 3D Human Body Model Acquisition from Multiple Views, in Proceedings of ICCV'95, pages 618--623, Boston, MA, June 1995. 4. W.Lee, J.Gu, and N.M.Thalmann, Generating Animatable 3D Virtual Humans from Photographs, Eurographics, 2000. 5. F.Remondino and A.Roditakis, Human figure Reconstruction and Modeling from Single Image or Monocular Video Sequence, The 4th International Conference on 3-D Digital Imaging and Modeling (3DIM), Banff, Canada (2003). 6. A.Sappa, N.Aifanti, S.Malassiotis, and M.G.Strintzis, Monocular 3D Human Body Reconstruction Towards Depth Augmentation of Television Sequences, ICIP2003. 7. J.Starck and A.Hilton, Model-Based Multiple View Reconstruction of People, ICCV2003.
A MURA DETECTION BASED ON THE LEAST DETECTABLE CONTRAST OF HUMAN VISION
Kazutaka Taniguchi, Kunio Ueta, and Shoji Tatsumi DainipponScreenMfg.Co.,Ltd., Osaka City University
Abstract:
Mura - irregular lightness variation on uniformly manu-factured surface – is required to be detected to keep high quality of the display devices. The mura is perceived in spite of its low contrast, if its spatial frequency falls on a sensitive range of human vision. We propose a method considering characteristics of human vision to detect mura of the display devices’ components that have lower intensity than the final device.
Key words: mura, defect detection, human vision system, spatial frequency, contrast enhancement
1.
INTRODUCTION
Irregular lightness unevenness on uniformly manufactured surface is called as a mura. A mura is understood as a defect without clear contour or contrast that gives viewers unpleasant sensation. The mura degrades quality of display devices, since they are viewed directly. The mura intensity depends on spatial frequency sensitivity of human vision system. The display cell array has in most cases higher frequency than that of human vision system, therefore the fluctuation perceived as mura has lower frequency than the cell array. The mura is brought by a large variety of causes such as unevenness of coated layer thickness, local non-uniformity of chemical process, local surface roughness, size or location divergence of the regularly placed cells and so on. It depends on human perception whether the mura is acceptable or not or how much degrades the quality of the device.
1024 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1024–1030. © 2006 Springer. Printed in the Netherlands.
A Mura Detection based on the Least Detectable Contrast
1025
A quantitative evaluation of mura intensity is first required for an automatic detection system, as the inspectors have acquired the skill of mura evaluation, whose consensus is usually based on the least acceptable sample. The intensity of mura in components of the devices is often lower than that of the final device. Reduction of mura in components level contributes yield of final products. The result we report is on shadow masks of CRT. There are several work of mura detection on final LCD panel1-4 or color filters of LCD5 which have focused on quantification of mura over one percent fluctuation to the background level. However the low contrast mura on display device components such as a shadow mask, whose mura level is demanded to be lower than the final device, have hardly been discussed. Noise perception on printed surface is researched6, which has a similar purpose of ours. There are also a lot of work on contrast enhancement that have mostly intended favorable expression of a scene. There are common approaches on some points such as intending clear detail expression avoiding halos or enhanced signal without noise enhancing however the goal is a little different. The mura detection has to enhance a certain range of spatial frequency and suppress other frequency and does not have to preserve overall image impression. A shadow mask is a metal mask, with a huge number of regularly etched holes, placed just behind the surface coated with fluorescent material of CRT in order to let the scanning electron beam hit the appropriate target. The holes are produced by chemical etching, whose instability brings mura. In case a hole is ten percent larger than the designed size, it is detected as a spot defect. But even in case the hole, with aperture ratio twenty percent, is one percent larger, their lightness difference is just 0.2 percent. Figure1 shows examples of low contrast mura, which is hardly reproduced on screen or on printed image due to noise through the reproduction process.
2.
HUMAN VISION AND MURA INTENSITY
The low contrast mura can be perceived, if they are gathered in several millimeters range of the high sensitivity frequency of human vision, viewing environment is well prepared, and the inspectors are trained.
1026
(a) with mura
(b) without mura
Figure 1. Shadow mask with mura and without mura.
2.1
Luminance level of viewing condition
As shown in Figure 2, Blackwell 7 showed the characteristics of the vision for various luminance level using 4' size with short term (0.2sec.) displaying. The contrast is a ratio of the spot’s lightness variation to background. Figure 2. shows that human vision system can detect lower contrast in higher luminance level. On the contrary, the least detectable mura of LCD panel is much higher than the experiment of SEMU (SEmi Mura Unit) 1. There is a number of differences in conditions.
In any conditions above, the practical conditions of SEMU has disadvantage to Blackwell’s experiment concerning to detection low contrast mura. The target intensity of the mura on the shadow masks is lower than SEMU, due to high luminance, movable sample, and stable illumination, therefore the detection system is required to be designed to detect the low contrast.
2.2
Spatial frequency response
Figure 3 shows that human vision has a band pass sensitivity on spatial frequency by Dooley6 and by Sakata8. The mura detection system discussed in the next section is designed to have the band pass characteristics shown as the dotted line in Figure 3.
A Mura Detection based on the Least Detectable Contrast
1027
Contrast [%]
100 10
The least detectable contrast SEMU Shadow Mask
1 0.1
0.01 0.01 0.1
1
10 100 1000
Luminance level [cd/m2] Figure 2. The least detectable contrast of human vision on luminance.
Figure 3. Frequency sensitivity of human vision system.
3.
PROCESS FLOW
The proposed method has three processing steps: 1) image acquisition, 2) contrast enhancement and 3) quantification.
3.1
Image acquisition phase
In the image acquisition phase, high signal to noise ratio (S/N) is most important to detect lightness difference such as less than one percent. It is a key to compensate the illumination shading and the sensitivity variation of each sensor of the array.
1028
3.2
Contrast enhancement phase
First in the contrast enhancement phase, the acquired image (as in Figure 1) is accumulated in the square area of sa x sa. Second, the image is processed by a low pass filter (window size 2s1+1) to get a image with suppressed high frequency. Third, a high pass filter processes the image Lij (i,j:pixel location) to reduce low frequency distribution hindering effective contrast enhancement of the following process. Each pixel value Lij is divided by a value derived from the average of neighboring pixel values (window size 2s2+1) to get a high pass filtered image Hij as follows:
H ij
Lij § i s2 ¨ ¦ ¨ u is 2 ©
j s2
· Luv /( 2 s2 1) ¸¸ ¦ v j s2 ¹
(1)
2
The low pass filter and high pass filter compose a band pass filter, whose characteristics can be designed by parameter settings of sa, s1 and s2 to give the response curve as in Figure 2. Finally, the image Hij is linearly mapped for contrast enhancement to get Eij. This image Eij is used as a visualized data for inspection (as in Figure 4.)
Eij
b(( H ij 1) / rc 1)
(2)
where rc is contrast enhancement ratio, and b is background value of E. If Eij exceeds [0,max.], it is clipped 0 or max. of the image quantization.
Figure 4. Contrast enhanced image of Figure 1.
A Mura Detection based on the Least Detectable Contrast
3.3
1029
Quantification phase
At the quantification phase, mura intensity is quantified as a set of values (evaluation scores) for local area of the image. There are three types of mura such as large scale mura, line mura, and partial mura, each of which has a correlation with a certain type of scores in a) through g). a) Low frequency power of large local area: Do FFT square parts of image Eij and take their power. Then compute summation of a frequency range. b) Divergence of large local square area: Compute divergences of the square parts. c) Maximum of projection: A low pass filter processes the image Eij. Then compute the maximum deviations of accumulated pixel values parallel to mura in orthogonally elongated regions to the mura direction. d) Divergence of local strip area: Compute standard deviations from total average value AVE in rectangular regions elongated parallel to the mura direction. e) Small grains: Count pixel numbers which belong to the regions bi-leveled by a threshold of deviation from AVE larger than a threshold size f) Divergence of small area: Compute divergences from AVE of small rectangular regions. g) Local Edge: Count the edge pixel numbers of small square regions. The line mura, values calculated as a) and b) are intended for the large scale mura, c) and d) are for the line mura, and e), f) and g) are for the partial mura respectively. In case that every type of the score through a) to g) for every local area is under a previously determined threshold level, the sample is judged as OK. We have applied the method to 1495 samples of shadow masks which are judged by human inspectors as three groups: OK, NG and their border. The judgment of the method for the three groups is shown in as Table 1. The method can judge samples in NG group as NG. Although some false detection is shown in some cases of OK samples, the rate is not high. For the border samples between OK and NG, the false rate is higher than that of OK sample, which suggests that the quantification can be improved. Table 1. Comparison of human inspection and automatic detection. Machine OK NG Human OK 1001 18 Border 397 73 NG 0 6
1030
4.
CONCLUSIONS
A method for mura detection system considering human vision is proposed. The method converts the shadow mask images in Figure 1 into the images in Figure 4, which shows clear enhancement of the mura. The detection result on a number of samples is shown in table 1, which confirms the NG samples can be detected by the method. The detection sensitivity of the spatial frequency depends on the parameters of window size in the system. We have confirmed the effectiveness of the system based on both spatial and contrast characteristics of human vision. The method is applicable to detect mura on another uniform surface.
REFERENCES 1. SEMI Document #3324, “New Standard: Definition of Measurement Index (SEMU) for Luminance Mura in FPD.” 2. F. Saitou, “Uneven Area Defects Inspection on LCD Display Using Multiple Resolute Image,” J. Jap. Soc. for Precision Engineering 63.5. pp.647-651, (in Japanese) 1997. 3. H. Nakano, Y. Yoshida, K. Fujita, “A Method to Aid Detection of Macro Defects of Color Liquid Crystal Display through Gabor Function,” J. IEICE Vol.J80-D-II 3 pp.734-744, (in Japanese) 1997. 4. K. Tanahashi, M. Kohchi, “Automatic measurement method of MURA in liquid crystal displays based on the sensory index”, 8th Intelligent Mechatronics Workshop pp.183-188, 2003. 5. K. Nakashima, “Hybrid Inspection System for LCD Color Filter Panels, IMTC '94 pp.689691, 1994. 6. R.P. Dooley, R. Shaw,”Noise Perception in Electrophotography,” J. Appl. Photogr. Eng. 5, 4, pp.190-196, 1979. 7. H.R. Blackwell, “Contrast thresholds of human eye,” J. Opt. Soc. Am., 36,624, 1946. 8. H. Sakata, H.Isono, “Chromatic Spatial Frequenxy Characteristics of Human Visual System,” J. ITE of Japan 31, 1, pp29-35, 1979.
USING MULTI-KOHONEN SELF-ORGANIZING MAPS FOR MODELING VISUAL PERCEPTION
Michel Collobert France Telecom R&D IRIS/VIA, F-22307 Lannion Cedex, France [email protected]
Abstract:
This paper presents a new organizing principle for perceptual systems based on multiple Kohonen self organizing maps. These maps are arranged in order to model the global brain activity as seen on tomography pictures. In contrast to most of neural network models, a perceived object or knowledge is not represented by the final activity of a single neuron but by a configuration of activity of the whole neurons' set.
Key words:
Distributed perception, configuration of activity, Kohonen maps, somatosensory areas, neuron, chaos, multisensoriality, grandmother neuron.
1.
INTRODUCTION
In our human conscience, each representation of the external world (objects, knowledge, etc ..) is automatically translated into one word, i.e. into *one* symbol. Because of this, researchers in artificial intelligence (AI, including perception, cognition and generation of action), tend to construct their algorithms in order to obtain a single code as ‘output’. In the field of « neural networks », such as the well-known Kohonen’s map1 or MultiLayers Perceptrons2, or in the field of Machine Learning3 the process is similar. Only one of the elements used for coding the representation (or at most a few) are active at the end of the process. However, in living organisms with efficient and adaptive perceptual systems, the only important outputs are actually 'actions'. Thanks to new imaging techniques (such as PET or fMRI scans) we realize that in such a system a substantial portion of the neurons is activated at any given time 1031 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1031–1036. © 2006 Springer. Printed in the Netherlands.
1032 (Fig.1). There is never only one neuron or one little neuron group (such as in the grandmother neuron concept) firing. Instead, a lot of neurons from different brain areas are always firing.
Figure 1. Brain activity of Dr. Christof Koch with his permission.
It is a great challenge to the scientific community (and has been for a long time), in particular in the A.I. and robotics communities, to understand and study the functionality of the perceptual systems. Even modeling the nervous system of a simple housefly4 would be a great success. Some researchers5 even suggest that we are missing something fundamental in the currently available models. The new paradigm presented in this paper aims at filling a small part of this void. The paradigm consists in constructing a model able to perform perception based on the simultaneous activity of several assemblies of neurons.
2.
DESCRIPTION OF THE MODEL
The model is based on some of the principles observed in biological nervous systems : x Specialization of cerebral areas . x Somatotopic maps . x Hebbian Learning6. x Nonlinear and Chaos dynamics1,4,6,8,9. This section first presents the principles of representation and memorizing. The second part describes the principles of recognition/perception.
Using Multi-Kohonen Self-Organizing Maps for Modeling Visual Perception
2.1
1033
Representation.
The first two characteristics of biological nervous systems are used for the representation of input data. They are as follows: x Specialization of cerebral areas. There is a functional specialization of the different parts of a nervous system. For instance, in human beings, the brain is divided in many parts. One of them is the ’visual cortex’ which is itself divided in several areas named MT, V1, V2, V3 etc.. Each of these area is specialized in a particular function, as processing speed, or edges, or colour or the position or.. x Somatotopic maps : A large part of these brain areas are somatotopic, i.e. they performs a topology-preserving mapping of sensory organs onto the cortex. This mapping is not linear, i.e. most the brain has to discriminate between inputs from the sense organs, most it use a great number of neurons. In order to simulate this, the new model presented here uses “Multiple” “Kohonen Self Organizing Maps” (KSOM)1. “Multiple” to address specialization and KSOMs to address somatotopic properties. For example, to discriminate between different objects in a video stream using this model, it is possible to use one specialized KSOM per attribute used to represent these objects. Then, each object or type of object will be represented and memorized by one activity configuration of the whole KSOM. The number of objects between which the system is able to discriminate follows a geometric series, function of the number of neurons per map and the number of attributes/maps. So if we use for example only three mono-dimensional KSOM of 10 neurons each (one for color, one for height, one for width), one is able, in theory, to discriminate between 103 objects with only 30 'neurons' (Fig. 2).
Figure 2. Examples of activity configuration.
Each KSOM is trained with the different values of the associated attribute for all objects. Because an important feature of KSOM is its non-linearity, we are sure to discriminate as well as possible between objects using this attribute. For example if the set of objects to identify/discriminate is composed of faces, the KSOM associated with the color attribute is composed only with neurons responding each at a particular hue of skin.
1034 Another important feature of a KSOM is the somatotopic property. So in the case when the value of an attribute varies for an object between two instants, the two corresponding configurations of activity will be very close. In addition, memorizing, not a particular state of one object at a particular time, but its prototype, can be done by learning the statistics of its representation. The more a particular value of an attribute is observed, the more the corresponding neuron is active. So we obtain the following configuration of activity of this kind (Fig. 3):
Fig. 3. Example of activity configuration of KSOM
Note that this representation presents similarities with parts of brain images obtained with scans, such as in fig. 4.
Figure 4. Example of activity configuration of brain maps.
2.2
Memorizing
To memorizing these configurations of activity, we use the well know Hebbian Learning: "When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some grown process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased"6. We suppose there is an existing connection between each neuron of each map and we apply this Hebbian rule during learning with normalization. Unlike to the number of items the system is able to memorize, the number of connections grows according to an arithmetic series law.
2.3
Recognizing / Perception
The brain is known for its chaotic behavior8,9. Because of this chaotic character, the brain can be seen as a huge system of oscillators, each
Using Multi-Kohonen Self-Organizing Maps for Modeling Visual Perception
1035
oscillator moving from one memorized configuration of activity to another in the absence of external stimuli. A new configuration of activity resulting from sensorial inputs can force the system to be stabilized in a previously memorized state. This stabilization can be called recognition/perception. This is consistent with the findings of many psychologists which attest that the brain is stabilized by the external world (problem of sensory deprivation). Some neurobiologists even say that the brain is trying to construct a coherence when faced by the external world7. Unfortunately mathematics which handle chaotic systems are of no help on this. However if we use another property of cortical areas, namely the diffusion of the excitation between neurons, we can resolve this problem. Recognition then simply becomes a matter of searching the nearest configuration of activity: 1. If the input configuration to recognize is strictly identical to one memorized, the problem is solved. 2. Otherwise, a pairing of the input configuration to one of the previously memorized configurations needs to be performed. Let us suppose that we have to pair the input configuration of fig. 5 with one of the two previously memorized configurations of fig. 2:
Figure 5. Input configuration.
To carry out pairing in this case, the principle of diffusion of the activity towards the adjacent neurons, (preferably with attenuation) can be used:
Figure 6. Example of the first stage of diffusion.
By superimposing this new configuration onto the two initial configurations ‘A’ and ‘B’, it appears that it is the ‘B’ one which "agrees" best to the required form. The input form and the stored form ‘B’ enter in "resonance" while the form ‘A’ is inhibited. Thanks to the KSOM nonlinear property, a step of diffusion is normalized for each map’ space. So a step of diffusion is equivalent for each attribute.
1036
3.
CONCLUSION
The system has been tested on a relatively simple problem of tracking object in a video flow. However, this model is still more theoretical than operational at the moment. First experiments in object tracking show that the model has some promising properties. These properties are due to its ‘biological-like’ features: x The approach is compatible with true multi-sensorialities and data fusion: This is because there is no a priori problem in mixing different maps from different sensor modalities. For example, the paradigm could be used for speech recognition using both sound and lips movements. x The model could be tuned in order to operate (to resonate), even if there is missing data. Such a feature may be useful in vision to treat occlusion problems. But it may be also useful to simulate logical inference if we see this problem as an ‘tuning’ problem between facts. x In a hardware implementation of this model, no one single neuron is important by itself (there is no ‘grandmother cell’). So a chip design of this model would be more resistant to local failure than more traditional models. In our model, the failure of one component is not catastrophic for the system and the sole consequence is a decrease in the quality of the discrimination.
REFERENCES 1. Kohonen T., Self-Organizing Maps, Springer, Berlin, 1995 2. Zhang G.P. Neural Networks for Classification : A Survey, IEEE Trans on Systems, Man and Cybernetics - Part C, vol 30, No 4, nov 2000, pp 451-462 3. Collobert R., Large Scale Machine Learning, Thesis Paris VI, (2004) 4. Franceschini N, Pichon J.M. and Blanes C., From insect vision to robot vision, Phil. Trans. R. Soc. Lond. B (1992) 337, 283-294 5. Brooks R., The relationship between matter and life, Nature (2001), vol 409. 6. Hebb D.O., The Organization of Behavior : A Neurophysiological Model, Wiley, New York 1949. 7. Berthoz A. Le sens du mouvement O. Jacob 1997 ( The brain's sense of movement . Harvard University Press 2000) 8. Korn H., Faure P.,Is there chaos in the brain? II. Experimental evidence and related models, C. R. Biologies 326, 787-840 (2003) Elsevier 9. Faure P., Korn H., Is there chaos in the brain? I. Concepts of nonlinear dynamics and methods of investigation. C. R. Acad. Sci. Paris, Ser. III 324 (2001) 773-793
NEW METHODS FOR SEGMENTATION OF IMAGES CONSIDERING THE HUMAN VISION PRINCIPLES
Andreas Kuleschow, Klaus Spinnler Fraunhofer Institut for Integrated Circuits, Am Wolfsmantel 33, 19058 Erlangen, Germany
Abstract:
The total control of diverse production processes, among them the production of parts with treated metallic surfaces, has recently spread. Such control tasks can be difficult and boring enough for the human inspectors. The automated inspection systems are the only ones which can provide a high accuracy, speed and safeness. But to detect the correct objects such systems have to consider the features of human perception. We propose new methods of an adaptive thresholding which have "human - like" features and are robust against the detection of artefacts in the image.
Key words:
automated inspection systems, “human - like” image segmentation.
1.
INTRODUCTION. KNOWN METHODS
The general problem of image segmentation is very hard, because an image can include many different textures and objects. Since a long time a lot of methods have been introduced for different applications1,2 In a simple case, if an object and a background can evidently be divided depending on their brightness, the so-called thresholding algorithm can be successfully applied. This method sorts the pixels according to their brightness by means of a threshold which can be calculated for example by the analysis of the histogram of the image. If this histogram has a minimum between two maxima, a corresponding brightness can be used as a threshold.1,3 A few improved methods of the global thresholding have been developed,4-6 but those methods did not yield essential advantages for complex images.
1037 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1037–1042. © 2006 Springer. Printed in the Netherlands.
1038 At the same time another approach, the so-called adaptive thresholding, has been developed. Chow and Kaneko7 suggested a method for obtaining a variable thresholding surface, where the specific threshold for each area of the image must be calculated. If the threshold at some areas can not be calculated directly from the histogram, it must be extrapolated from other areas. This extrapolation was a most controversial point of this method which produced a lot of artefacts and required to apply a special checking procedure in order to suppress them. A new approach to adaptive thresholding was offered by Yanowitz and Bruckstein.8 They suggested to find the boundaries and to calculate threshold here and, instead to extrapolate the thresholds, to calculate the so-called potential surface (a surface that accomplishes a Laplace equation) which passes those boundary thresholds. The authors found the boundaries by maximum of brightness gradient and after that calculated the potential surface using a successive over – relaxation method by fixed brightness at the boundary points. But this procedure needs the hundreds of iterations and is very time– consuming. In order to accelerate computation authors suggested an approximate solution on a diluted grid.
2.
THE IMPORTANT FEATURES OF HUMAN PERCEPTION AND A NEW METHOD OF IMAGE SEGMENTATION
In order to construct an improved inspection system which can replace a human inspector, we must respect human perception features, otherwise our system will probably detect other objects in the same image. One of those important features is a hierarchy of objects that a human eye detects. In accordance with the experiments, described by Ran and Farvardin,9 the sequence of object importance can be put as follows: 1. Boundary (or “strong edge”) 2. Texture (or “weak edge”) 3. Background or smoothed area of the object Another important feature of human perception is a nearly linear sensitivity of the eye at the bright region of “normal” lightness (so-called Weber region).10 In other words, if the observer accepts an edge with difference of 20 gray levels by the background lightness of 200 as noticeable, one must accept likewise an edge with difference of 5 levels at the background of 50. And the importance of brightness difference at the boundary is higher as its sharpness for the observer. By considering all of these features, we can put the following principles of human – like image processing:
New Methods for Segmentation of Images Considering the Human Vision Principles
1039
x We seek for boundaries first; areas with no boundaries must be “compressed”. x We recognise the edge by lightness differences, but the gradient of lightness must be integrated in this method. x All these differences must be normalized to neighbourhood. x We sort each pixel by a logical operation. If we detect a closed boundary in the image, we assign to all pixels inside of this boundary the value “object” independent on the actual brightness of those pixels. Our new method includes the following steps: 1. Calculation of the average lightness of image in a small neighbourhood. 2. Comparison of a current lightness of image with averaged value; marking the pixels, where the normalized difference exceeds the selected threshold. 3. Collection of all marked pixels into blobs (by 4 – neighbourhood). All those pixels evidently lie not far from any edge. 4. Executing of the adaptive boundary searching procedure for every blob. 5. Recognizing and removing the duplicates (no artefacts are possible). The adaptive boundary searching procedure looks as follows: a) Select one marked pixel as a centre of a thresholding area (at first the left top pixel of the blob). b) To threshold the area, calculate difference between “bright” and “dark” pixels. c) If the contrast is high enough, find the corresponding edge. d) Follow the edge up to the border of thresholding area. e) Select the last pixel which belongs to both the objects edge and the thresholding area border, as a next thresholding area centre. f) Repeat steps a) – e), except for looking for a corresponding edge, up to the following events: either we close a shapes boundary, or we achieve the edge of AOI (image), or we lose the boundary due to poor contrast or no possibility to threshold the area. Duplicates can appear due to detecting same edge from two different blobs. Different rules can be applied to select the “right” object. So, one can select the greatest object, or the richest in contrast, and so on. The steps 1-2 provide the detection of the edge by differences of brightness primarily, but the sharpness of the edge is already considered by means of size of the averaging filter. To threshold the area we used the different known methods. We tested the deepest minimum of histogram method, the Otzu5 method and the entropic thresholding;6 the best results we have got with the modified deepest minimum of histogram method.
1040
3.
A COMPARISON BETWEEN ADAPTIVE THRESHOLDING METHODS
We have successfully used this method for the inspection of machined metallic surfaces. The goal of such inspection is to find defects like the scratches and caverns which differ from the clean surface by brightness. Due to uneven illumination and reflection of the surface, the essential variations of background brightness are possible.
Figure 1. Test image (an interior surface of the drilled hole).
An example of such image is shown in fig.1. Its size is 480x400 pixels. The objects are caverns at the surface, which look like the dark spots.
Figure 2. Results of the adaptive thresholding method after Yanowitz – Bruckstein. Left: original method, right: improved considering human vision principals method.
The results of original Yanowitz – Bruckstein method are shown in fig. 2 on the left. Processing time for image was 7,3 sec. by 574 iterations (Pentium 3, 700 MHz).
New Methods for Segmentation of Images Considering the Human Vision Principles
1041
One can see the enormous number of artefacts in the image (no suppressing of artefacts was accomplished), although the objects are detected and a space around the objects is free from artefacts. We’ll explain this interesting occurrence later, and just consider the results, which were achieved with our method. Our method under processing is shown in fig. 3 on the left. The thresholded dark areas are marked black and bright regions are marked light grey. The boundaries are white. This picture is a good demonstration of the fact, that only a small part of the whole image is completely processed, the most part of pixels was processed with the fast steps (1-2) of the procedure only. The final results of our method are shown in fig. 3 on the right. The boundaries of dark spots are marked white. Processing time was equal 0,03 sec. (Pentium 3, 700 MHz).
Figure 3.The new method of adaptive thresholding. Left: under processing, right: final results.
The cause for the numerous artefacts by Yanowitz – Bruckstein method is, after our estimation, the unfortunate way to calculate a thresholding surface. Just if the image contains only few little defects or no ones, the thresholding surface will be everywhere corresponding to the potential surface. A pixel can be found under this surface with the probability of 50%, such way the great number of artefacts can be generated. Only around defects this thresholding surface will be pressed down due to the fixed pixels at the boundary of defects. Our proposal to improve the Yanowitz – Bruckstein method in accordance with the human vision principles is as follows: we calculate the potential surface directly. By putting this surface down by a normalized distance from the original position we can obtain the thresholding surface. The experimental results using the improved Yanowitz – Bruckstein method are shown in fig.2 on the right. The level of thresholding surface was calculated here as 0,7 from value of potential surface (the objects to detect
1042 were dark). Few artefacts are probably found (no contrast checking was applied), but the picture appears as good. After Yanowitz – Bruckstein’s recommendations we attempted to accelerate the calculations using sampling of the image with step 3 – 5 pixel and calculated the thresholding surface at this diluted grid. By this way the processing time can be compressed up to 0,1 – 0,2 sec. with satisfactory quality of thresholding. But this method remains to be slower as our new method.
4.
CONCLUSION
Two new proposed methods for an adaptive thresholding were implemented and tested. The basic advantage of our new method of adaptive thresholding is the high speed of calculations due to local application of the complex second stage of the algorithm. The basic advantage of improved Yanowitz – Bruckstein method is its simple implementation. Both methods produce final images, which are very similar to a human estimation of the original image.
REFERENCES 1. R. M. Haralick and L.G. Shapiro Image Segmentation Techniques Computer Vision, Graphics, and Image Processing. 29, 100 – 132 (1985) 2. N.R. Pal and S.K. Pal A review on image segmentation techniques Pattern Recognition, 26(9), 1277 – 1294 (1993) 3. Kennet R. Castleman Digital image processing. Prentice-Hall Inc. 1996.pp.75-76. 4. Ralf Kohler A Segmentation System based on Thresholding Comput. Vision Graphics Image Process. 15, 319-338 (1981). 5. N. Otsu A threshold selection method from gray level histogram IEEE Trans. Syst. Man Cybern., 9, 62 – 66 (1979) 6. N.R. Pal, S.K. Pal Entropic thresholding Signal Processing 16, 97 – 108 (1989) 7. C.K. Chow and T. Kaneko Automatic boundary detection of the left – ventricle from cineangiograms Comput. Biomed. Res,.5, 388 – 410 (1972) 8. S.D. Yanowitz and A.M. Bruckstein A new method for image segmentation Comput. Vision Graphics Image Process., 46, 82 – 95 (1989) 9. Xiaonong Ran and Nariman Farvardin A Perceptually Motivated Three – Component Image Model – Part I: Description of the Model IEEE Trans. on Image Processing, 4, 401 – 415 (1995). 10. S.K. Pal and N.R. Pal Segmentation based on Measures of contrast, Homogeneity, and Region Size IEEE Trans. on Systems, Man and Cybernetics SMC – 17, 857 – 868 (1987)
INTERACTIVE CHARACTER ANIMATION IN VIRTUAL ENVIRONMENTS
Paweá Cichocki, Janusz Rzeszut Institute of Computer Science, Faculty of Electronics and Information Technology, Warsaw University of Technology
Abstract:
This paper presents a method for interactively animating characters that travel over uneven virtual terrain. The method is based on algorithmically generating limb and pelvis trajectories and animating the skeleton with a custom Inverse Kinematics algorithm. Simple but naturally looking and highly customizable models for feet and pelvis trajectories are proposed. Fluent blending of different motions with the proposed approach is possible via linear parameter blending. The algorithm is fast, has very low memory requirements and needs no expensive preprocessing.
Key words:
Animation; Computer graphics; Inverse Kinematics; Legged locomotion.
1.
INTRODUCTION
Our method is based on generating foot trajectories based on a set of parameters and the local characteristics of the landscape, such as slope (gradient) or height, at the points where feet are to be planted or moved across. The parameters of the landscape are considered as the animated character takes each step, which means that the landscape can even be modified as the character walks (e.g. obstacles can change locations). Besides planning the step we model the foot trajectory during the swing phase. The pelvis trajectory is then also affected by the way the feet are placed on the terrain. The parameters of the Inverse Kinematics (IK) algorithm provide additional control over the look of the animation.
1043 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1043–1048. © 2006 Springer. Printed in the Netherlands.
1044
2.
THE STEP PLANING ALGORITHM
Figure 1. Modifying step with slope.
Let e be the position of the foot as it hits the ground, b the foot takeoff position, d the movement direction vector and c a point Lowness units below the pelvis. Our step planning algorithm works in 8 phases: 1. Calculate a first guess position horizontal component p0=c+ StepSize d 2. Get the land altitude at pi (vertical component of ei) 3. Adjust p (calculate pi+1) based on the land altitude from phase 2 4. Repeat phases 2-3 a few times (6 times in our demo) 5. Check if the terrain under the chosen spot has the right steepness 6. If not, then try to change p and repeat phase 5 7. Sample the terrain height under the step line 8. Adjust StepHeight for the upcoming step The length of a step as measured as e b does change only slightly with gradient. However, c-p, as pictured in Fig. 1, being the horizontal component of c-e, does change significantly. Modification of the foot landing position p is done via intersecting a line, approximating the terrain across the step line, with an ellipse. Let us consider a two dimensional crosssection of the problem in the direction of movement d. An ellipse centered at c a units wide and b units tall and the line l (u ) c (h c)u intersect at
u
r
1 2
§ ( xh xc ) · § ( y h y c ) · ¸ ¸ ¨ ¨ a b ¹ ¹ © ©
2
.
(1)
The width of the ellipse a=StepLength and the height b is equal to either VertStepScale or VertStepScale DownSpeedUp , depending on whether
Interactive Character Animation in Virtual Environments
1045
the step is upwards or downwards. The vertical component of the solution is replaced by the land altitude at that point. The above process is repeated a few times. Currently our demo program does 6 iterations. Phases 5-6 are relatively simple: the foot landing position is moved forward a little (phase 6) and the gradient measurement is repeated (back to phase 5). The above is done either until a suitable spot for planting the foot is found or the maximum number of iterations has been reached. Moving the position forward has the nice side effect that the character tries to take a big step over the obstacles if possible. Phases 7 and 8 try to make sure that the foot does not hit the ground as it swings. Currently the height (altitude) of 3 points across the line connecting b and e (the step line) are sampled. If any of those sampled altitudes is above the line more than a certain fraction of StepHeight, then StepHeight is increased by the difference. If none of the sampled altitudes is above the step line, then StepHeight defaults to DefStepHeight.
3.
THE FOOT AND PELVIS TRAJECTORIES
Foot animation can be divided into two phases: the support phase and the swing (air phase). In the support phase, the foot does not move relative to the ground, but it does move relative to the pelvis. If we know the speed of the pelvis it is enough to subtract the speed from the relative foot position in each frame. Suppose the trajectory of the foot during the swing is described by a foot trajectory function f(t). However, f(t), even if it has the right shape, might not have the appropriate timing. We can try to adjust the timing by introducing an intermediate variable s=g(t), so the foot trajectory is described by f(s)=f(g(t)). The function g(t) is a time modeling function. The foot moves above the straight line connecting the points where the foot starts the swing b and where it lands e. We modeled the height of the foot above that line as sine wave series
f ( s ) bs e(1 s) vu(q sin( sʌ) , (1 q)( w sin(2sʌ) (1 w)sin(3sʌ)))
(2)
where u is the up vector and v, q, w are the StepHeight, StepBase and StepTwist parameters respectively. The simplest time modeling function would be
1046
g(t )
(t TimeOffset ) mod LoopTime , AirTime
(3)
where LoopTime is the parameter describing total gait cycle duration, TimeOffset is the parameter controlling the phase shift of the given foot (for biped characters this will be usually 0 for one foot and ½ LoopTime for the other) and AirTime controls how much time the foot spends in the air phase (how much time the swing takes). We found that the function gb(t)=( ½-½cos(g(t)))S+ g(t)(1-S), where S is the SwingSoftness parameter works better. The pelvis trajectory needs to be synchronized with the feet. It also needs to take into account such parameters as gravity. In our demo the horizontal speed of the pelvis in the direction of movement is calculated as
v pd
4 3
(a Behindness) TimeStep , LoopTime
(4)
where a is the horizontal foot advancement in the direction of movement introduced by the last calculated swing. In our demo the height of the pelvis is dependent on the height of the lowest foot and the average foot height. The parameter blending between those two is called PelvisLaziness. If it is 0 the pelvis height is determined entirely by the average foot height - if it is 1 it is entirely dependent on the minimum foot height. For swinging feet the height of the step line (the line connecting b and e) bellow the current foot position is taken (without the offset described by formula (2)). We modeled the horizontal pelvis displacement with a sine wave with the period equal to LoopTime. The amplitude is equal to the x component of the PelvisAmplitude parameter and the phase is controlled by the x component of the PelvisOffset parameter. The vertical pelvis displacement offset was modeled with
· § § § 4S (t z O ) · · ¨ y A sin §¨ 2S (t z O ) ·¸ ¸ (1 P ) , ¨ y A sin ¨¨ ¸ ¸ P ¸¸ ¨ LoopTime ¸ ¸ ¨ ¨ © LoopTime ¹ ¹ © ¹¹ © ©
(5)
where P stands for PelvisSmoothness, A for PelvisAmplitude and O for PelvisOffset. We also introduced a factor displacing the pelvis position in the direction of the movement by a sine wave with amplitude and phase shift equal to the z components of the PelvisAmplitude and PelvisOffset vector parameters respectively.
Interactive Character Animation in Virtual Environments
4.
1047
THE IK ALGORITHM
Our starting point was the Cyclic Coordinate Descendant (CCD) algorithm. It was introduced in 1991 by Wang and Chen2. A popular modification was introduced by Welman3 in 1993. Unfortunately using those approaches resulted in a not quite fluent motion. Our biggest concern with CCD was the oscillation of the system around the solution. We will describe our algorithm for 1 degree of freedom (DOF) rotational joints. All joints in our system are 1 DOF rotational joints or can be decomposed into a few 1 DOF rotational joints.
Figure 2. A hierarchical system.
Consider a hierarchical system composed of N segments S1…SN. Each segment is defined by a translation vector li a rotation axis ri and a rotation angle Ti. The end of segment Si, pi is the end of the previous segment (pi-1) translated by li and rotated relative to the previous segment (Si-1) by Ti around ri. Let g denote the goal the end effector pN is to reach. Each joint defines a bone space matrix Mi. The matrix Ri is the same as matrix Mi without translation (rotation matrix). Generally Mi=Mi-1Mrot(ri,și)Mtrans(li) and pi=Mi[0,0,0,1]T. Let us introduce auxiliary vectors ti=pN-pi, ei=Riri×ti, and di=g-pi. If ri is a unit vector, our algorithm is to modify all angles by
dT i
si e i .d i 2
ei d i c
,
(6)
where c is a small constant and si is a parameter for each joint. The derivation of (6) can be found in Cichocki4.
1048
5.
RESULTS
We implemented the method described above. Simple height map based terrain was used as an example environment. As partly illustrated by the figures below the proposed model animates well characters not only traveling across flat surfaces, but also moving up and down slopes, moving along various kinds of stairs (also traversing), across slopes (one foot higher than the other) and across terrain where some places are not suitable for foot placement.
Figure 3.a) Running across a ramp b) Walking across a long stairs-pyramid.
REFERENCES 1. S.-K. Chung, J. K. Hahn, Animation of Human Walking in Virtual Environments, Computer Animation Proceedings, 1999, 2. L.-C. T. Wang, C.C. Chen, A Combined Optimization Method for Solving the Inverse Kinematics Problem of Mechanical Manipulators, IEEE Transactions on Robotics and Applications, vol. 7 no. 4, 1991, pp. 489-499 3. Ch. Welman, Inverse Kinematics and Geometric Constraints for Articulated Figure Manipulation, Masters Thesis, Simon Fraser University, 1993 4. P. Cichocki, Algorithms for Real-Time Character Animation in Virtual Environments, Bachelor Thesis, Warsaw University of Technology, 2003 5. M. Gleicher Retargeting Motion to New Characters, SIGGRAPH, 1998, pp.33-42 6. J. Lander, Oh My God I Inverted Kine: Inverse Kinematics for Real-Time Games, Game Developer Magazine, September 1998
EMPATHIC AVATARS IN VRML FOR CULTURAL HERITAGE
STANEK S. Comenius University, SK-842 48 Bratislava, Slovakia
Abstract:
We introduce medium precision avatars in VRML worlds for cultural heritage presented on-line9. Avatars are integration of UI or they appear in the cyber city on the places where interesting stories can be told to the user9. Avatars in medium distance communication use empathy for better delivering of message7. We propose a new “delayed mirror” metaphor for empathic communication in virtual reality for cultural heritage8,9. It combines level of detail, real-time interaction and digital storytelling. Avatar’s face structure is based on minimal Perlin’s face2. This precision is intended for outdoor scenes and medium distance communication. However, the facial expressions of a virtual tourist can be captured by a cheap hardware set-up consisting from a pair of web cams7, and employed for improving the empathy. For body motions and deformations in VRML we use H-Anim standard5.
Key words:
empathic avatars, Perlin’s face, H-Anim, cultural heritage, digital storytelling, virtual environment, MPEG-4
1.
INTRODUCTION
In virtual environments there is still need to deliver information to user in a suitable way. We have proposed and designed avatars for better communication virtual environments for cultural heritage. Our implementation is used for already finished project dealing with cultural heritage8,9 (Fig. 1). In this project selected historic objects in Bratislava, capital of Slovakia, are reconstructed, published on-line and avatars are telling facts and stories about them. The VRML avatars (city guides) are as empathic as possible at this moment. Our idea is to combine levels of detail,
1049 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1049–1055. © 2006 Springer. Printed in the Netherlands.
1050 real-time interaction and digital storytelling. In this paper we present in more detail the medium precision avatars based on face by Perlin2. This precision is intended for outdoor scenes and middle distance communication (middle length stories). We utilize the data obtained with features tracking. The facial expressions of a virtual tourist (user) can be captured by a cheap hardware set-up consisting from a pair of web cams7, and employed for improving the empathy of a communication with other users or empathic avatars that will be able according to user’s facial expressions and head movements adapt their empathic reactions. In the communication between user and empathic avatar we understand the empathy as a delayed mirroring of well-recognized user’s emotions. Avatar will be this way acting as a “delayed mirror” and reflected expressions will be of about half weight. The emotionality of a story is given in advance and is combined with the measured head movements and facial expressions of a virtual tourist in real-time. The stories are delivered both in verbal and nonverbal form.
Figure 1. Some examples from virtual-guided tour in VRML world at www.VHCE.info.
The paper is structured as follows. First, we document the medium precision avatar design. In the following, the export or creation of avatars for VRML world and a few ideas about the city navigation using empathic avatar are described. In short also integration with feature capturing system is presented in the 4th section. Finally, some conclusions are given.
Empathic Avatars in VRML for Cultural Heritage
2.
1051
EMPATHIC AVATAR
On the following Fig. 2a can be seen our experimental implementation (Avatar Toolkit) for authoring of empathic avatar. The tool serves to support for creation of empathic avatars. It deals not only with face definition, but with full body structure definition. This implementation is still in progress and currently we create relatively simple VRML avatars with minimal face complexity (as defined by Ken Perlin2). However, that can be useful for creating many expressions needed for empathic communication. Implementation is now suitable to export created models to VRML file supporting defined functionality of the avatar. Empathy and functionality is created using expression changes that can be scripted depending on the storytelling timeline. We are using predefined expressions and changes that are manually created or captured using capturing system for head and facial movements.
Figure 2. Left (a): screenshot of Avatar Toolkit for creation and visualization of empathic avatar; right (b): facial expressions modification (images are showing shaded model of avatar that can be imported from XML file that defines its appearance and functionality).
On the Fig. 2b you can see examples of facial expressions created with simple segments deformations. The avatar model is imported from XML file that defines its structure, possible deformations and functionality. It uses segmented structure to define avatar’s visual look and functionality. It has also integrated Perlin Noise2 to make avatar more believable. The model of avatar is hierarchical segmented structure defined by joints and segments as described in H-Anim standard representation for humanoids5. So every segment has defined its position and possible transformations using its joint and also deformations that are used to create expressions of the segment. Such structure is then easily exported to VRML file that becomes a part of a virtual world. Motion of this structure is achieved by applying of any H-Anim motion. We are also able to do simple combinations of these motions to combine different motions divided into at least two layers expressions and emotions. The first layer of motions and
1052 emotions comes from the story and is divided into two sub-layers: mouth deformations and any other story emotion. Second layer is defined by the environment in which the storytelling avatar actually is (position of the user in the virtual world, position of the object within the story). Captured and recognized expressions of user also belong to the second layer (see last section). Now we are more concerned to create facial expressions and head movements in real-time for virtual environment. Facial expressions are achieved using predefined deformations of the avatars head model. Deformations are able to create basic expressions defined in MPEG-41. And at this moment it is not fully compatible with MPEG-4 standard for facial expressions definition, but is prepared so that it will be easily extended to be fully compatible with MPEG-4. As we want to tell the story in empathic way, it is also needed to synchronize avatar’s lips with its voice (the audio that consist of the story). So it will be also visually acceptable. For this we have defined mouth deformation for every phoneme (for at least those defined in MPEG-4) respectively deformations of the segments defining mouth and segments related to mouth. This way, having textual representation of the story, we have information about the mouth deformation sequence that has to be finally synchronized with storytelling audio. There are defined marks in storytelling timeline for this synchronization. Definition of these marks is at this moment done manually with only a simple automatization, but this will become automatic within integration with system for features capturing.
3.
AVATAR IN VRML ENVIRONMENT
Following Fig. 3 (left) shows an avatar that is imported to VRML environment within a virtual city. Some segments defining parts of avatars body can be replaced and avatar will look more human-like (see Fig. 1, Fig. 3 (right)). In this context, avatar acts as a part of user interface. If the user moves in the virtual world avatar is visible all the time as the buttons. Exported avatar model satisfies definitions of H-Anim standard and we made simple extensions to it to achieve our purposes. For example we are introducing child node for node Joint that defines an audio. This way the avatar’s speech comes from its mouth and the spatial audio delivering is also achieved. It will be mainly used in second way of avatar integration in virtual environment (If the avatar is far away you are not able to hear it). This functionality comes from sound specification in VRML. To achieve synchronization in virtual environment we have created VRML prototypes for Timeline, and TimelineAction and ActionEvent. TimelineAction-s are executed according to the actual time fraction in
Empathic Avatars in VRML for Cultural Heritage
1053
Timeline. Each TimelineAction has a group of ActionEvent-s that are defining start and weight of structure expressions. This way change of structure state in time is defined (mouth deformation according to speech is also achieved this way). So the export from Avatar Toolkit creates instances of these prototypes. The prototypes use advantages of TimeSensor-s and ROUTE-s that are defined in VRML specification.
Figure 3. Integration of virtual environment and Avatar as a part of UI.
As an example of integration of our avatar as a part of UI we created prototype for guided tour (GuidedTour) in virtual environment (Fig. 1, Fig. 3 (right)). User can switch between some viewpoints, start/stop storytelling and switch between languages if possible and turn on/off function for natural behaviour of avatar (Perlin noise). And all that the author needs to define are: possible viewpoints (accessible by pressing numbered buttons), Storyteller (its audio, speech, emotions, and synchronization), guide tour viewpoints and their timings that define viewpoint change during the story telling. All possible facial, head and body emotions are prepared. So to define the main look and feel of guided tour consists of describing and defining timings for viewpoints during the guided tour and also timings for emotions and subtitles for the story. Audio with speech and its synchronization with lips are also pre-processed for the guided tour.
4.
INTEGRATION WITH CAPTURING SYSTEM
Most of the time depending changes must be defined manually at the moment. After integration with system for feature capturing, most of this tedious work will be automatized. Using capturing system not only audio recording but also capturing of expressions of professional actor will be done. And captured information will be in Avatar Toolkit used to define
1054 expressions and their timing through the story. And as the advanced mode there will be possibility of capturing user’s expressions and head movements in real time and information achieved will be used for storytelling avatar to use our “delayed mirror” metaphor to introduce empathy to communication. Captured information can be also used for extending UI for navigation or application property control6. For the future work we want to introduce to Avatar Toolkit also possibility to define visual look of the avatar according to stereo images of the user. This will give an opportunity to user to have visually corresponding avatar in virtual environment. And in multi-user environment and real-time feature tracking also empathic communication between two users with look similar to reality will be possible.
5.
CONCLUSION
We described a few novel ideas for emphatic avatars for cultural heritage and cyber city applications. Cultural monuments have very exciting stories, but there is no need to present the fulltime stories. So, similarly to the distance which gives the precision of emphatic communication in ongoing Virtual Bratislava, we distinguish among three levels of details with stories. There is no need for a distant storyteller to communicate with full precision. Whenever the distance and measurable interest get closer, we increase the precision in facial expressions and digital stories7. Integration of Avatar Toolkit with feature capturing system will give us high power for delivering information to user or storytelling. This way of storytelling is also very suitable for some handicapped people. Future work includes low and high precision (MPEG-4 compatible) avatars.
ACKNOWLEDGEMENTS This work is partially supported by APVT grant No. 20-025502.
REFERENCES 1. Abrantes, G., - Pereira, F., 1999, MPEG-4 Facial Animation Technology: Survey, Implementation, Results, IEEE CSVT vol. 9, no. 2, pp. 290-305, 1999. 2. Perlin, K. 2003. Face demo applet. http://mrl.nyu.edu/~perlin/facedemo 3. Qvortrup, L.ed. 2002. Virtual Interaction: Interaction in Virtual Inhabited 3D Worlds. London Berlin Heidelberg, Springer-Verlag, ISBN 1-85233-516-5.
Empathic Avatars in VRML for Cultural Heritage
1055
4. Qvortrup, L. ed. 2001. Virtual Space: Spatiality in Virtual Inhabited 3D Worlds. London Berlin Heidelberg, Springer-Verlag, ISBN 1-85233-331-6. 5. H-Anim http://www.h-anim.org/ 6. Stanek, S., Ferko, A., 2002. Navigation and Interaction in Cyber Cities: Head Motions and Facial Expressions, pp.75-78, Proc. SCG 2002, Bratislava, Slovak University of Technology 2002, ISBN 80-227-1773-8 7. Stanek, S. – Ferko, A. - Kubini, P. 2003. Real-time Virtual Storytelling for Augmented Cultural Heritage: Message & Empathy, Pp. 45-46 in Proc. for The first Research Workshop on Augmented VR, Geneva, MIRALab 2003. 8. Ferko, A. et al.. 2004. Virtual Heart of Central Europe. CORP 2004. www.corp.at.Vienna: TU Wien 2004. 9. International EU project: Virtual Heart of Central Europe www.vhce.info
CLOSED FORM SOLUTION FOR C2 ORIENTATION INTERPOLATION Vasily Volkov and Ling Li Curtin University of Technology, Perth, Australia @cs.curtin.edu.au
Abstract:
We present a simple closed form solution for C2 smooth quaternion interpolation problem. In contrast to other methods, our approach does not rely on cubic B-spline blending functions which require solution of nonlinear tridiagonal system. Instead, we propose using C2 interpolatory (cardinal) basis. Our method outperforms all alternatives and, being explicit, is absolutely stable.
Key words:
orientation, quaternions, interpolation, splines, blending functions.
1.
INTRODUCTION
Orientation interpolation is an important problem in computer graphics, robotics and airspace craft navigation. In computer graphics it is used, for example, for camera control and key-frame animation. Common representations of orientation1 are 3 × 3 orthogonal matrices – group SO(3) = {R: RTR = I, det(R) = 1} and unit quaternions – group S3 = {q: q R4, ||q|| = 1}. Quaternions are generally faster, intuitively easier and became a standard tool in computer graphics2. We use them throughout this paper. Interpolation is the reconstruction of continuous function from discrete samples. Given n sample values qi S3 and knots ti R the function q: R ĺ S3 such that q(ti) = qi is to be built. We are especially interested in the functions of class C2[t1,tn], i.e. functions having continuous second derivative. Second derivative in our case has meaning of angular acceleration for the motion along path q(t). In this paper we limit our attention to the uniform grid ti = i. Extension to non-uniform grid is straightforward. 1056 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1056–1062. © 2006 Springer. Printed in the Netherlands.
Closed Form Solution for C 2 Orientation Interpolation
1057
Many interpolation methods can be written in the form of convolution:
q(t )
¦ c K (t; i)
(1)
i
i
where K (•;i): R ĺ R is the spatially varying reconstruction kernel, often taken piecewise-polynomial in practice for the sake of simplicity and efficiency. When performing weighted averaging (1) in S3, we must take into account that this subset of R4 is not closed under linear superposition of its elements. In this case, equation (1) should be understood in different terms than sum of scaled vectors. A few appropriate methods were proposed3,4,5. They all are nonlinear and some are even iterative5. Kernel K(t;i) determines the nature of solution q(t). If weighted averaging method is Cf, which is usually the case, then the solution completely inherits smoothness of the kernel. If K(t;i) is interpolating (often called cardinal), i.e. it satisfies Kronecker-delta property K ( j; i ) G ij
1 for i j ® ¯0 otherwise
(2)
then it is enough to put ci = qi in (1) to ensure interpolation q(i) = qi. Otherwise, system of equations for ci should be solved. This is the case when smooth B-spline basis functions are used as reconstruction kernel4,5,6. Though for interpolation in Euclidean space the resulting system is linear and easily solved6, in S3 this system becomes non-linear and has to be solved iteratively5,7,8. These iterations do not necessary converge7. In this paper we propose C2 smooth interpolating kernel for use in quaternion interpolation instead of B-spline basis. This kernel is introduced in the following chapter. Chapter 3 describes all relevant details on KimKim-Shin method of weighted averaging in S3. Results and performance are presented in Chapter 4. Chapter 5 concludes this paper.
2.
C2 INTERPOLATING KERNEL
Our goal is to find C2 continuous piecewise-polynomial functions K(t;i), which satisfy K(j;i) = Gij for i,j =1..n. C2 piecewise-polynomial defined on knots 1, 2, …, n is the function equivalent to m-degree polynomial on every interval [j, j+1], j = 1..n–1: m
K (t ; i ) { K j (t ; i )
¦a k 0
i, j k (t
j ) k for t [j, j+1],
1058 while the polynomials join in C2 continuous manner, i.e. the following relations must hold for j = 2..n–2: K j 1 ( j; i )
(3)
K j ( j; i )
Figure 1. Interpolating kernel K(t;i) built for 12 knots. Note the shift-invariance of the functions in the middle and their warp near the boundaries.
d K j 1 (t ; i ) dt t
d K j (t ; i ) dt t
j
d2 K j 1 (t ; i ) dt 2 t
j
(4) j
d2 K j (t ; i ) dt 2 t
(5) j
Equations (4) and (5) produce linear algebraic system for coefficients aki,j. Such system has to be solved when interpolating with C2 cubic splines6. It can be avoided introducing additional constraints onto kernel functions9. The first and second derivatives of K(t;i) are required to agree with its numerical estimations based on finite differences (which are, in turn, based on Taylor expansion):
d K (t ; i ) dt t
d2 K (t ; i ) dt 2 t
j
j
K ( j 1; i ) K ( j 1; i ) 2
(6)
K ( j 1; i ) 2 K ( j; i ) K ( j 1; i )
(7)
Closed Form Solution for C 2 Orientation Interpolation
1059
where the central difference in (6) is preferred to the one-sided in the favor of symmetry. Since the values of K(t;i) at knots are given in (2), equations (3), (6), (7) define six constraints on every Kj(t;i) for j = 2..n–2 and four on K1(t;i) and Kn-1(t;i). Hence, Kj(t;i) can be found as a quintic polynomial for j = 2..n–2 and as a cubic for j = 1 and j = n–1. Derivations are straightforward and can be easily done with pencil and paper. Resulting functions K(t;i) for i = 4..n–3 are shift-invariant and can be compactly represented as 3 s 5 7.5 s 4 4.5 s 3 s 2 1, for s d 1 4 3 2 °° 5 K (i s; i ) ® s 7.5 s 21.5 s 29 s 18 s 4, for 1 s d 2 ° 0 elsewhere. °¯ The resulting function set is illustrated on Fig. 1. These functions have two important hereinafter properties. First, the support of K(t;i), i.e. the closure of the set where it is nonzero, is finite and short – it is [i–2,i+2] ŀ [1,n]. Also, these functions form a partition of unity on [1,n], i.e. n
¦ K (t; i) { 1 for every t [1,n]. i 1
WEIGHTED AVERAGING IN S3
3.
Given unit quaternions qi S3 and weights K(t;i) R for i=1..n, the following closed form for Cf weighted averaging of quaternions is proposed by Kim et al.4: n
q (t )
q1B1 (t )
(q
Bi ( t ) 1 i 1 qi )
(8)
i 2
n
Bi (t )
¦ K (t; j )
(9)
j i
Bi(t) is called by the authors as cumulative basis. Quaternion is raised to power as qt = exp (t log q), where the logarithm and exponent are:
log q
log(cosT , v sin T ) (0, vT ) and exp(0, vT ) (cosT , v sin T ) .
Here the representation of quaternions in the form q = (cos T, v sin T), T R, v R3, ||v|| = 1 was used. To make the logarithm function
1060 unambiguous, multi-valued T is restricted to the interval of [0,S], where it is unique. As was mentioned in the previous chapter, K(t;i) is a partition of unity and has support not wider than [i–2; i+2], hence n
For i t ªt º 2 : Bi (t )
¦ K (t; j )
0 and
j i
n
For i d ¬t ¼ 1 : Bi (t )
¦ K (t; j )
1.
j i
Therefore, for t [i, i+1], 2 d i d n–2, equations (8) and (9) shrink to q(t ) qi 1 (qi11qi ) Bi (t ) (qi1qi 1 ) Bi 1 (t ) (qi11qi 2 ) Bi 2 (t ) , i2
Bi k (t )
¦ K (t; j )
(10)
j ik
For these values of i, formula (10) results in (here t = i + s, s [0, 1]): Bi (t ) s 5 2.5s 4 1.5s 3 0.5s 2 0.5s 1 ,
Bi 1 (t ) 2 s 5 5s 4 3s 3 0.5s 2 0.5s , Bi 2 (t ) s 5 2.5s 4 1.5s 3 .
Figure 2. The cumulative basis for our C2 interpolating kernel.
Explicit expressions of Bi(t) for i = 1 and i = n – 1 are even shorter. Fig.2 illustrates the resulting cumulative basis.
4.
RESULTS
Our implementation generates N = 1,000,000 interpolated points for n = 10,000 random input samples in 1.57 sec on 2.8 GHz Pentium 4. It is equivalent to ~1.6Ps per interpolated point. Same implementation has shown performance of 3.1Ps/point on 900MHz Pentium III. Fig. 3 demonstrates an
Closed Form Solution for C 2 Orientation Interpolation
1061
example of the generated quaternion curves. Our method outperforms all known to us alternatives where performance is reported. Buss and Fillmore5 report of at least 80P on Pentium II 400MHz per interpolated point. Alexa10 report of 3P on 1GHz Athlon PC for calculating matrix exponent only, which is required for the generation of a single interpolated point. Moreover, his method generates poor curves, since it is based on global linearization. We expect that methods of Kang, Park and Ravani11,12 have performance of the same order as our method, but their approaches have preprocessing steps and, unlike our method, are not local, i.e. adjustment of a single input sample affects entire curve. Among the drawbacks of our approach is that it fails to be bi-invariant – reversing the input sample sequence results in different interpolating curve. Another drawback is that the derivatives at the terminals of the generated curve cannot be controlled, e.g. if one desires to make a “natural” spline.
5.
CONCLUSION
Figure 3. User edited orientation interpolation (right) and interpolation of 100 random orientations (projections on S2 are shown).
We described an explicit method for C2 quaternion interpolation. It has no restrictions on input samples, does not require solving nonlinear system and has high performance. The method is based on the interpolatory piecewise polynomial basis, whose analytic derivation has been explained in detail. Drawbacks of our method are uncontrolled derivatives at the curve terminals and the lack of bi-invariance.
1062
REFERENCES M. L. Curtis, Matrix Groups (Springer-Verlag, 1984). K. Schoemake, Animating rotation with quaternion curves, Proc. SIGGRAPH 85, pp. 245–254 (1985). 3. W. D. Curtis, A. L. Janin and K. Zikan, A Note on Averaging Quaternions. Proc. Virtual Reality Annual International Symposium ’93, pp. 377–385 (1993). 4. M. J. Kim, M. S. Kim and S. Y. Shin, A general construction scheme for unit quaternion curves with simple high order derivatives, Proc. SIGGRAPH 95, pp. 369–376 (1995). 5. S. R. Buss and J. P. Fillmore, Spherical averages and applications to spherical splines and interpolation, ACM Trans. Graph., 20(2):95–126 (2001). 6. C. de Boor, A Practical Guide to Splines, Springer-Verlag (1978). 7. M. J. Kim, M. S. Kim and S. Y. Shin, A C2 continuous B-spline quaternion curve intepolating a given sequence of solid orientations, Proc. Computer Animation ’95, pp. 72–81 (1995). 8. G. M. Nielson, Q-Quaternion splines for the smooth interpolation of orientations, IEEE Trans. Visualization and Computer Graphics, 10(2):224–229 (March/April 2004). 9. V. S. Ryabenkii, Introduction to Computational Mathematics, §3 (Fizmatlit, Moscow, 2000). 10. M. Alexa, Linear Combination of Transformations. Proc. SIGGRAPH 2002, pp. 380–387 (2002). 11. F. C. Park and B. Ravani, Smooth invariant interpolation of rotations, ACM Trans. on Graph., 16(3):277–295 (July 1997).
1. 2.
12. G. Kang and F. C. Park, Cubic spline algorithms for orientation interpolation, Int. J. Numer. Meth. Engng, 46(1): 45–64 (September 1999).
A COMPRESSION SCHEME FOR VOLUMETRIC ANIMATIONS OF RUNNING WATER Bedrich Benes Vaclav Tesinsky Tecnologico de Monterrey, Campus Ciudad de Mexico [email protected] FEL CVUT Prague Abstract
1.
Fluid animations are becoming a standard tool for computer animators. Simulation of turbulent gases, running water, eroded surfaces, or splashing waves are common, but still demanding because they are usually calculated in a voxel space. This brings new requirements to the tools that are used for such animations. The data structures are enormous but providing a good space and time coherency. We present a compression scheme that can be used for storing, accessing, and viewing such animations interactively. Key-frames are compressed by the RLE algorithm and in-betweens as difference frames. To display the scene we convert the level of water and the terrain surface to triangle meshes by the marching cubes algorithm. With this lossless technique we reach compression factor up to 1:100. Scenes can be decompressed fast, can be displayed, and manipulated interactively.
INTRODUCTION
Computer animations of running water and turbulent gases are in the focus of the computer graphics community for a long time. The interest has moved from ad-hoc techniques to the physically correct solution of the Navier-Stokes equations that provide a complete simulation of the motion of a liquid. The landmark paper in this area is the work of (Foster and Metaxas, 1996). They introduced a simplified, but still physically acceptable, solution of the NavierStokes equations and demonstrated its usability for simulation of running water. The later works focused the practical aspects of the animations (Foster and Fedkiw, 2001) and the photorealistic animations (Enright et al., 2001). The common property of these techniques is that they work in a voxel space, albeit they use particles to display the level of water. Previewing, manipulating, storing, etc. of such scenes is difficult and causes a huge demand to the software and the hardware. Techniques that facilitate the work are necessary.
1063 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1063–1068. © 2006 Springer. Printed in the Netherlands.
1064
Figure 1. Results of erosion simulation of an artificial river. The left two images show the erosion process, the last frame displays only the deposited material.
We present a compression scheme that is specially designed for volumetric animations of running water. Animations are stored in a form that facilitates fast displaying and scene manipulations. The user can see interesting details, preview some areas in time, etc. The simulation can be stopped and rerun from any point. We use this scheme to simulate hydraulic erosion. These simulations are computationally very demanding so the results should be saved in some reasonable way for an analysis. Our scheme is based on key framing. Reference scenes are compressed by the Run Length Encoding (RLE). The in-betweens are saved using differential scheme allowing for an efficient scrolling forward and backward in the animation. Since we want to display the water and the terrain, the animations must display the surfaces. We detect them by the marching-cubes algorithm and convert them into the boundary representation. The surfaces are stored as triangle meshes that are efficiently displayed by a graphics hardware. This kind of displaying results in a scene preview. The photorealistic rendering, such as in Figure 1, is provided by a specialized raytracer that is launched as an external application. This scheme facilitates not only displaying the simulation results but also their reuse. We store the complete information about the simulation; the pressure field, the velocity field, states of the cells, etc. This allows us to rerun previously stopped animation, going back and changing some parameters, etc.
2.
PREVIOUS WORK
An application for displaying volumetric medical data was introduced by (Avila et al., 1994). The VolVis system become quite famous in the area of medical data displaying namely for its integration of various displaying algorithms and techniques. The system supports different three-dimensional input data that can be displayed as rough data, compression domain rendering, volumetric ray-tracing, and irregular grid rendering among others. (Chiueh et al., 1997) present an integrated compression and visualization scheme that displays compressed volumetric scenes without actually decom-
A Compression Scheme for Volumetric Animations of Running Water
1065
pressing them. The compression is performed in the Fourier domain and therefore its primer application area is a static scene. This technique is suitable for displaying medical data rather than dynamic scenes obtained by algorithms of fluid dynamics. A hardware assisted rendering approach was introduced by (Lum et al., 2001). They use the hardware support for texture displaying to render a time varying volumetric data. The two different compression schemes are used: the palette based encoding and the temporal encoding. The latter is based on the DCT. A wavelet-based compression scheme for time varying volumetric data was described by (Guthe and Straßer, 2001). They use lossy compression scheme with coding frames similar to the MPEG compression. (Rosa et al., 2003) presented an approach that is closely related to ours. They show a system for interactive displaying and manipulating time-varying data of thermal flow simulations. The main difference to our paper is that to compress a single scene they exploit the hardware support for indexing threedimensional textures. The scene-to-scene coherency is coded by the DCT. The advantage of the method is that they use a special kind of coding that can be displayed by trilinear filtering that makes the method really efficient.
3.
COMPRESSION SCHEME
The application generates huge amounts of voxel data. For example one simulation of a voxels space in the resolution 3003 with 700 frames occupies, in the uncompressed form, more than 300GB of the disk space. The compression scheme is designed to fulfill two goals. First, the compression and decompression must be lossless. We want to use it not only for storing an animation, but also for rendering, and, the most important, for rerunning the simulation from a desired point. The second goal we need is the ability to scroll fast through the data. We want to move forward and backward in time as fast as possible and we want to display the data correspondingly. We exploit two important properties of the volumetric data. The data has high spatial coherency and time coherency. The first is used for coding the key-frames (we should say "key-scenes"), the latter is used for storing the in-betweens. This scheme is depicted in Figure 2.
Figure 2.
Compressed key frames and difference frames are stored.
A property of the volumetric data is that the majority of the values are clamped to the interval [0, 1] so we can easily store them as the fixed-point
1066 representation. We multiply the data by 0xFFFF and store them as 16-bits integers. In this way only the data that is used for direct displaying is stored. High precision is required for the data that is used for rerunning the simulation. Key-frames exploit the high spatial coherency of the volumetric data. The are just few types of material, water, and air that are distributed in the more or less continuous areas. The most complicated are the boundaries of different environments; the deposited material, and the level of water. The spatial data coherency makes the data a candidate for the Run Length Encoding (RLE). Each sequences of the same value is coded as a pair [n, v], where n is the number of repetitions and v is the value itself. If there is a sequence of highly varying values they are stored in the uncompressed form. The value zero is used to indicate the begin and the end of an uncompressed sequence. For example the sequence 000001234999999 will be coded as 5001234069. To perform the compression/decompression efficiently the voxel structure is taken as a linear array i.e., in the way it is stored in the memory that corresponds to scanning sequentially the rows of the first layer, then skipping into the second one, etc. Testing this compression scheme on various scenes gives an average compression factor 1:30. The key-frame is stored every twenty frames. With the increasing distance of the key-frames also increases the compression factor but storing few keyframes makes the scrolling difficult. We have found the twenty frames as the good compromise between the compression quality and the interactivity. In-betweens exploit the time coherency of the sequences. The two successive scenes do not vary very much. The scenes in-between the keyframes are stored as the difference frames relatively to the previous frame (see Figure 2). The compression is asymmetrical i.e., the compression takes longer time than the decompression. When compressing we take the previous frame, denoted by A and the new frame B in the same form as in the previous case, i.e., as a long sequence of values. All equal values in the second array are simply skipped. We store the pair [−, n] where − indicates that we are skipping the data and the n is the number of bytes to skip. If there is a difference between the scenes we store the exact value. We have measured the compression factor of the in-betweens of different scenes and the average value is around 1:400. The overall compression factor of the animations is around 1:100.
4.
IMPLEMENTATION AND RESULTS
We have developed an application that we use for displaying and manipulating the volumetric scenes that result from our erosion simulation algorithm. The application is written in C under OpenGL. The application allows zooming, scrolling the animation forward and backward by key-frames, saving the
A Compression Scheme for Volumetric Animations of Running Water
Figure 3.
1067
The simulator.
preview, manipulating parameters of displaying, changing the transparency, disabling and enabling materials, and spawning the simulator or the raytracer. The application window snapshots are in Figure 3. The system first decodes the first key-frame and the scene is stored in the memory. The actual scene can be decompressed from a key-frame or an inbetween. In the latter case, the scene must be calculated from the closest previous frame and the actual difference frame. When the scene is calculated the additional data structure representing the surfaces boundaries is also generated. We use the marching cubes algorithm that generates the mesh of triangles (polygon soup). The scenes are small, we store it as OpenGL display lists and this makes any interactive manipulation immediate. One objection is that at the moment we want to see an animation we have to do a lot of work for each frame. The scene must be decoded, uncompressed, and the additional data structure must be generated. Keeping this in mind, we have experimented with storing the meshes together with the volumetric data, keeping the meshes in a LRU cache, etc. At the end we have found that the solution of generating the data every time it is necessary is satisfactory. For greater scenes, it would be definitively good to store the mesh in the volumetric file. For small scenes, the biggest scene we use is 4003 voxels, this solution was fast enough.
5.
CONCLUSIONS
A compression scheme for storing volumetric data is presented. The main purpose is to display and reuse the volumetric data produced by the fluid and the erosion simulation. We use key-framing and difference frames coding.
1068 The scene that corresponds to a key-frame is compressed using RLE and the in-betweens are compressed by storing only the difference values. We exploit the spatial and the time coherency. The data does not vary very much in space and this helps us to reach compression factor of about 1:30 for key-frames. The changes between the frames are also low and this allows us to compress the in-betweens with the compression about 1:400. The overall animation compression factor is 1:100. To display a scene we convert the data into the sets of triangles by the marching cubes algorithm. The meshes are not stored together with the volumetric data, because the speed of decompression is high enough to give a good response of the entire system. We were exploring scenes with size around 0.51GB in the compressed form that corresponds to scenes with the uncompressed size up to 100GB.
REFERENCES Avila, Ricardo, He, Taosong, Hong, Lichan, Kaufman, Arie, Pfister, Hanspeter, Silva, Claudio, Sobierajski, Lisa, and Wang, Sidney (1994). Volvis: a diversified volume visualization system. In Proceedings of the conference on Visualization ’94, pages 31–38. IEEE Computer Society Press. Chiueh, Tzi-cker, Yang, Chuan-kai, He, Taosong, Pfister, Hanspeter, and Kaufman, Arie (1997). Integrated volume compression and visualization. In Proceedings of the 8th conference on Visualization ’97, pages 329–ff. IEEE Computer Society Press. Enright, Douglas, Marchner, Stephen, and Fedkiw, Ronald (2001). Animation and rendering complex water surfaces. In Hughes, John F., editor, Proceedings of SIGGRAPH 2002, Computer Graphics Proceedings, Annual Conference Series, pages 736–744. ACM, ACM Press / ACM SIGGRAPH. Foster, Nick and Fedkiw, Ronald (2001). Practical animation of liquids. In Fiume, Eugene, editor, Proceedings of SIGGRAPH 2001, Computer Graphics Proceedings, Annual Conference Series, pages 23–30. ACM, ACM Press / ACM SIGGRAPH. Foster, Nick and Metaxas, Dimitri (1996). Realistic animation of liquids. In Proceedings of Graphical Models and Image Proceedings, volume 58(5), pages 471–483. Guthe, Stefan and Straßer, Wolfgang (2001). Real-time decompression and visualization of animated volume data. In Proceedings of the conference on Visualization ’01, pages 349– 356. IEEE Computer Society. Lum, Eric B., Ma, Kwan Liu, and Clyne, John (2001). Texture hardware assisted rendering of time-varying volume data. In Proceedings of the conference on Visualization ’01, pages 263–270. IEEE Computer Society. Rosa, Gabriel G., Lum, Eric B., Ma, Kwan-Liu, and Ono, Kenji (2003). An interactive volume visualization system for transient flow analysis. In Proceedings of the 2003 Eurographics/IEEE TVCG Workshop on Volume graphics, pages 137–144. ACM Press.
JOINING NURBS-BASED BODY SECTIONS FOR HUMAN CHARACTER ANIMATION Derek Byrnes and Ling Li School of Computing, Curtin University of Technology Perth, Australia @cs.curtin.edu.au
Abstract
This paper presents a simple, effective method for constructing elbow and knee joints between rigid NURBS-based human body segments for animation. The purpose is to automatically and efficiently manipulate NURBS joint patches to provide smooth, realistic joint action. The production of fine surface detail is also illustrated by creating wrinkles in the joint acute angle, also in a simple, effective manner. By automating such processes the need to manually add surface detail is reduced; animator input and animation time is decreased.
Keywords:
NURBS, Joint Deformation, Animation
1.
INTRODUCTION
In virtual human character animation, there are two conflicting goals: realtime and realistic animation. Animation models of high definition are reserved for offline productions, and real-time models lack credible levels of realism. Presented here is a method specifically for elbow and knee joint construction that builds on the simplicity of Bezier curves and on the sparseness of the NURBS control net structure. Using a Bezier curve to mechanically (automatically) compute joint control point locations based on angle of incidence, is a simple and fast method of joint construction. The addition of wrinkles in the acute angle of a joint is achieved by simply oscillating control point locations for the purpose of producing surface detail that is otherwise produced using Dynamics or manual surface manipulation. Animation generally begins with a skeletal hierarchy of rigid segment transforms. Such hierarchies accurately reflect the structure of the human body and are used with common animation techniques. Associated with the skeleton is a skin, either as a single surface or made up of patches. The skin is most frequently a polygon mesh or parametric surface such as NURBS, and is expected to deform in a natural way as the underlying skeleton is moved.
1069 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1069–1074. © 2006 Springer. Printed in the Netherlands.
1070 Animation involves technologies such as free-form deformations (Lamousin and Waggenspack, 1994) (implicitly deform objects by deforming the space they occupy) or simple inbetweening (Lasseter, 1987) (key frame animation) which can be applied to either polygonal or parametric skin types. However, Free-form techniques fail to provide sufficient levels of control to produce fine surface detail, while inbetweening is limited in its use for generating novel animation sequences; key frames must be generated beforehand. Linear Blend Skinning (Gleicher et al., 2003) may be applied to polygonal models, where a particular joint vertex is transformed by a constant multiple of the transform matrix associated with each bone acting at that joint. This method is limited particularly for large deformations where a fixed number of polygons inadequately defines large curvatures. However, this method is very fast for models with low polygon counts and is used extensively in gaming. A deformation technique incorporating implicit surfaces called metaballs simulates underlying muscle actions. Cross sections of these muscles are used as NURBS control points in the construction of a smooth skin (Chauvineau et al., 1996). Such an approach generally seeks to unify the body segments into a single smooth surface, and may require computationally intensive surface joining techniques (Shen et al., 1994). However, the strength of NURBS in this application is their small data sets when compared to polygon models, which is ideal for low bandwidth applications. They are limited only in the need to be tessellated in order to be rendered on present hardware. Hardware and processing limitations force animators to trade-off fine detail for increased frame rates. To this end, the technique outlined herein seeks a uniform or smooth joint deformation for simplicity, but where skin detail such as wrinkles can be added after joint construction. The use of NURBS is favoured because of their small data sets and ability to produce smooth surfaces. The proposed method of joint creation is easy to implement and performs automatic joint deformation in a computationally inexpensive manner using Bezier curves. The location of control points is computed based on the angle of incidence of bones acting at a joint and in this respect is similar to skinning techniques except no weighting is required. Wrinkles are also added in simple and inexpensive manner, and without manual intervention, by oscillating the position of control points in the joint acute angle region. The next section details the use of rigid human body segments and degree two Bezier curves for joint control point calculation, and also includes a description of wrinkle creation. The results of this process are outlined in section 3, followed by discussion in section 4 and conclusions in section 5.
Joining NURBS-based Body Sections for Human Character Animation
2.
1071
JOINT CONSTRUCTION
Approximating an elbow or knee joint begins by utilizing the fact these body segments are approximately cylindrical. Segments are defined using degree three NURBS surfaces, with a control points in the u direction and b control points in v direction. The surfaces are closed in u and connect in a hierarchical manner. This hierarchy consists of transform matrices specifying rotation of a dependent (child) surface about its pivot point. The pivot provides the segment local coordinate system and orientation, and is located on the medial axis of the parent surface. Body segments are constructed with a one-to-one relationship in the control nets of parent and child (similar dimensions and orientation). This arrangement allows the use of Bezier curves to determine a NURBS control net for the joint. This is achieved by extending the control polygon span between the second last and last rows of control points. These lines pass through control points at the edge of the parent and intersect lines extended from the child. A Bezier polygon can be constructed using these two end control points and a third that lies at the intersection (or a point on the shortest line between) of the extended lines. For each of these degree two Bezier curves, several sample points are used as NURBS control points. Then, for any particular rotation of the child, the joint surface can be smoothly defined to interpolate the parent and child surfaces implicitly as a product of the angle of incidence (Figure 1(a)). Solving Equation 1, as illustrated in Figure 1(a), yields Pa and Pb . This is based on the vector equation of a line (i.e. Pa = P1 + α(P2 − P1 )) and the fact the shortest line between two lines (Pa − Pb ) is perpendicular to both (their dot product is zero). (Pa − Pb ) • (P2 − P1 ) = (Pa − Pb ) • (P4 − P3 ) = 0
(1)
G1 continuity is obtained because both NURBS and Bezier curves have endpoint tangents in the same direction as their last polygon spans. The Bezier mid control point is collinear with both the parent and child edge control points, producing a Bezier curve with G1 continuity at its ends. This in turn produces a NURBS surface that also interpolates its control polygon endpoints with G1 continuity (Figure 1(b),(c)). The reader is referred to (Piegl and Tiller, 1995) for an in-depth discussion on the properties of NURBS. Due consideration must be taken on the intersection of the parent and child surfaces during motion. Intersections can be dealt with by first approximating the intersection region using the parent/child control polygons, since NURBS surfaces generally follow the shape of their control polygons. Smooth surfaces of a cylindrical nature allow a one-to-one relationship around the circumference and also along the length of interacting body segments. A proximity threshold is used to determine when related control points become close enough to have collided (required because segments are not strictly cylindrical) by determining which is closer to the parent centroid. A 2D array may be
1072 constructed indicating which pairs of parent/child control points have collided, and thus, need to be dealt with when constructing the joint. This intersection map also indicates pairs of control points in the intersecting rows that do not collide. Attaching the joint not at the ends of the parent and child, but at the row of control points above the region of intersection (removing the part of the parent and child below the new join) allows a smooth joint to be constructed. The length of the arm may be preserved by moving control points in the joint to be coincident with control points in the overlapped rows that were not determined to be intersecting. Most notable candidates are control points on the outside of the elbow when control point in the same row but on the inside of the elbow are intersecting (Figure 1(b),(c)). The use of intersecting lines and Bezier surface evaluation are seen as efficient means for joint construction as they have commonplace, computationally efficient algorithms. Automating wrinkle creation during flexion is achieved by oscillating control points in the acute angle region some percentage of their initial distance from the pivot point. There is an excess of control points created in the acute region as the joint surface is progressively attached at higher rows along the parent. In the acute region control points become densely packed with flexion and only some of these are needed to describe curvature. It is necessary to define a threshold to describe how close consecutive control points may get before forming wrinkles, and also define a percentage of their initial distance from the pivot by which control points will be moved. For each v direction row of control points, a closeness array is populated with control points that are close enough to be used, a second pass determines the new location of control points by applying the percentage initial distance and an alternating direction of movement.
3.
RESULTS
Initial results illustrate overlapping can produce a joint surface that extends the length of the overlapped region, as well as providing a smooth surface interpolating the parent and child surfaces (Figure 1(b)). By subdividing the parent and child control nets at the joint attachment row and discarding the overlapped region, produces three distinct surfaces connecting with G1 continuity (Figure 1(c)). However, subdividing is limited by the number of control points necessary to construct a degree p surface, generally bi-cubic surfaces were used, requiring a minimum of 4 × 4 control points. During flexion, the joint attachment point moves past the fourth row of control points from the pivot. This requires parent and child surfaces to be amalgamated with the joint to produce a single
Joining NURBS-based Body Sections for Human Character Animation
1073
smooth bi-cubic arm surface. Arm surfaces constructed by joining the parts of the three surfaces generally has C 2 continuity. It was determined wrinkles could be created by moving control points in a sinusoidal manner. Move a maximum of three control points toward the pivot, the fourth is stationary, and move the next three away from the pivot. This method produced visually pleasing results as in Figure 1(e), and was preferable to alternating consecutive control points which produced many small wrinkles that did not look realistic.
4.
DISCUSSION
The main limitation of this method is in the creation of a joint surface at differing joint attachment rows causes the joint size to spontaneously change at this transition point (Figure 1(e),(f)). A possible solution is to make the position of the Bezier midpoint based on a barycentric combination of Bezier control point locations. The parameter, θ, scales between the smallest and largest angles where the joint attaches at a particular row. That is, P (θ) = (1 − θ)Pi+1 + θPi . The size and number of wrinkles is also affected by changes in joint size. For instance, when the transition from one row to the next occurs, the joint typically goes from having several wrinkles to having few or none after transition (Figure 1(e),(f)). This problem will be dealt with by the barycentric approach to Bezier midpoint positioning, but may be minimised by refining the control net to include smaller spacing between rows. This method is presently only suitable for hinge joints, as it relies on intersections occurring between parent and child in order to trim of undesirable effects. In joints such as the wrist, ankle and shoulder, it is possible to undergo rotations that do not cause parent/child intersections due to joint size, pivot location and the use of rigid body segments. In such situations, lines extended from the parent and child polygons can have directions that point inside or towards the centre of the joint. Whilst it is possible to join these surfaces, without an intersection this method cannot trim off areas where undesirable effects occur. To apply Bezier joint construction to other joints, the parent and child surfaces must be able to deform to produce tangents conducive to constructing a smooth, uniform curvature between the parent and child surface.
5.
CONCLUSIONS
The method for joint construction outlined here can be applied to hinge joints with the expectation of producing continuous joint surfaces with wrinkles in an efficient manner and approaching real-time application. The Bezier method is deemed unsuitable for joints of greater than one degree of freedom without significant modification. Future research will adapt the Bezier joint
1074 construction method to other types of joints by exploring other means of avoiding undesirable surface effects, but still based only on angle of incidence.
Figure 1. (a) Joint approximation using Bezier curves. (b) Elbow joint surface construction at 45◦ rotation. (c) Elbow joint with overlapped surfaces removed at 45◦ rotation. (d) Elbow joint using a single surface at 45◦ rotation. (e) Wrinkles at 76◦ rotation. (f) Wrinkles at 78◦ . Note there are no wrinkles and joint size is increased.
REFERENCES Chauvineau, E., Shen, J., and Thalmann, D. (1996). Fast realistic human body deformations for animation and VR applications. In Proceedings of the Computer Graphics International (CGI96), pages 166–174, New York. IEEE Press. Gleicher, M., Mohr, A., and Tokheim, L. (2003). Direct manipulation of interactive character skins. In Proceedings of the 2003 Symposium on Interactive 3D Graphics, pages 27–30, New York. ACM Press. Lamousin, H. J. and Waggenspack, W. N. (1994). NURBS-based free-form deformations. IEEE Computer Graphics and Applications, 59(6):59–64. Lasseter, J. (1987). Principles of traditional animation applied to 3d computer animation. Computer Graphics, 21(4):35–44. Piegl, L. and Tiller, W. (1995). The NURBS Book. Springer-Verlag, Berlin, Gernamy. Shen, J., Magnenat-Thalmann, N., and Thalmann, D. (1994). Human skin deformation from cross-sections. In Insight Through Computer Graphics: Proceedings of Computer Graphics International 1994 (CG194), pages 39–49, River Edge, New Jersey. World Scientific Publishing.
MOTION RECOVERY BASED ON FEATURE EXTRACTION FROM 2D IMAGES Jianhui Zhao1, Ling Li2 and Kwoh Chee Keong3 1,3 2
School of Computer Engineering, Nanyang Technological University, Singapore, 639798 School of Computing, Curtin University of Technology, Perth, Australia, 6102
Abstract:
This paper presents a method for motion recovery from monocular images containing human motions. Image processing techniques, such as spatial filter, linear prediction, cross correlation, least square matching etc, are applied to extract feature points from 2D human figures with or without markers. A 3D skeleton human model is adopted with encoded angular constraints. Energy Function is defined to represent the residuals between extracted feature points and the corresponding points resulted from projecting the human model to the projection plane. Then a procedure for motion recovery is developed, which makes it feasible to generate realistic human animations
Key words:
Posture Reconstruction, Human Animation, Energy Function
1.
INTRODUCTION
There are two basic methods in classical computer animation: kinematics and dynamics approaches1. The disadvantage of these methods is their inability to filtrate erroneous movements. If real images containing human motions are used to drive the virtual human body, more faithful motions and variations of dynamic scenes can be generated in the virtual world. This understanding leads us to a source where great amount of motion information can be obtained: the monocular images containing human movements. This approach can be used in many fields2,3, e.g. virtual reality, choreography, rehabilitation, communication, surveillance systems, movie production, game industry, image coding, gait analysis. However, due to the lack of information in the third dimension and the fact that the human body is an extremely complex object, the problem of generating 3D human motion from 2D images taken by single camera is quite difficult. It is 1075 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1075–1081. © 2006 Springer. Printed in the Netherlands.
1076 mathematically straightforward to describe the process of projection from a 3D scene to a 2D image, but the inverse process is typically an ill-posed problem. Chen and Lee4 presented a method to determine 3D locations of human joints from a film recording walking motion. In this method, geometric projection theory, physiological and motion-specific knowledge and graph search theory are used. Another approach is divide-and-conquer technique reported by Holt et al. for human gait5. Although the simplicity of this approach is attractive, it is unsatisfying since it does not exploit the fact that different components do belong to the same model. The reconstruction method proposed by Camillo J. Taylor6 does not assume that the images are acquired with a calibrated camera, but the user is required to specify which end of each segment is closer to the observer. Barron and Kakadiaris7 estimated both the human’s anthropometrical measurements and pose from a single image. Their approach requires the user to mark the segments whose orientation is almost parallel to the image plane. The novelty of our approach is that it is able to deal with human motions from 2D images without camera calibration and user interference. It provides an alternative way for human animation by low cost motion capture while it avoids many limitations that come up with current motion tracking equipments.
2.
EXTRACTION OF FEATURE POINTS
2.1
Extraction from image with markers
The markers with different colors are stuck to the tight clothes of a human subject where the joints are located. Motions of the subject are recorded by a digital camcorder for 30 frames every second. The video sequence is composed of m monocular images of JPEG format, and each frame can be represented as a discrete two-dimensional function f j ( x, y ) . Suppose there are n markers on the human figure, and each marker is represented as M i ( r , g , b) , where r, g, and b are color values of the marker in its Red, Green, and Blue plane respectively. Given a threshold value u, then, whether a pixel p ( x, y ) is belong to the marker can be determined by:
p ( x, y ) M i
( M i u d p ( x, y ) d M i u )
(1)
The procedure for feature extraction from monocular images with markers is:
Motion Recovery Based on Feature Extraction from 2D Images
1077
Step 1, 2D image f j ( x, y ) is used as input. Step 2, a 3u 3 Low Pass Spatial Filter (LPSF) is used to reduce noise, i.e., intensity value of a point is replaced by the average of all the pixels around it. Step 3, tracking of the markers is executed throughout f j ( x, y ) by Equation (1), and the pixels belong to M i ( r , g , b) are selected. Step 4, the tracking result may be irregular, or composed of several discrete regions, thus the main continuous part is selected as the result, and the other parts are discarded. As illustrated in Figure 1, those in blue circle are main part, while those in green circle are discarded parts.
Figure 1. Processing of the tracked results.
Step 5, to extract the feature points, averaging method as follows is applied.
x
1 n ¦ xi , y ni1
1 n ¦ yi ni1
(2)
where x and y are averages of coordinates of all pixels in the tracked result. 2.2
Extraction from image without markers
Suppose there are m monocular frames in a video sequence without markers, and n feature points in human body. The frame is represented as f j ( x, y ) , each feature point is represented as Pi , j ( x, y ) (ith feature point and jth frame), each template window with the feature point as its center is wi , j ( x, y ) . Feature points in the first frame are picked manually, and the procedure for feature extraction from the other frames is: Step 1, 2D image f j ( x, y ) is used as input. Step 2, a 3u 3 LPSF is applied to reduce the noise. Step 3, Linear Prediction is used to predict Pi , j 1 ( x, y ) based on corresponding feature points in previous frames as follows:
1078
° Pi , j , j 1 ® °¯ Pi , j ( Pi , j Pi , j 1 ),1 j d m
Pic, j 1
(3)
Step 4, Normalized Cross Correlation (NCC) is utilized to find matches of the template image wi , j ( x, y ) within search image si , j 1 ( x, y ) with Pic, j 1 ( x, y ) as its center by:
¦¦ s x
c(r , t )
¦¦ s x
y
i , j 1
wi , j ( x r , y t )
y
2 i , j 1
¦¦ w
i, j
x
2
(4)
( x r, y t )
y
The position where the maximum value of c(r , t ) appears is selected as
Pic,cj 1 ( x, y ) . Step 5, Least Square Matching method is applied to find accurate
Pi , j 1 ( x, y ) from the initial point Pic,cj 1 ( x, y ) , during which affine transformations (i.e. rotation, shearing, scaling, and translations) are considered as follows:
x new
3.
a0 a1 x a 2 y , y new
b0 b1 x b2 y
(5)
MOTION RECOVERY
The employed 3D skeletal human model consists of 17 joints and 16 segments, and the joints are Hip, Abdomen, Chest, Neck, Head, Luparm, Ruparm, Llowarm, Rlowarm, Lhand, Rhand, Lthigh, Rthigh, Lshin, Rshin, Lfoot, Rfoot. Kinematics analysis resolves any motion into one or more of six possible components: rotation about and translation along the three mutually perpendicular axes. Rotational ranges of the joints8 are utilized as the geometrical constraints of human motion. Energy Function (EF) is defined to express the deviations between the image features and the corresponding projection features as
EFi
Scale(1) ' _ anglei Scale(2) ' _ lengthi Scale(3) ' _ positioni
(6)
where ǻ_anglei is deviation of orientation, ǻ_lengthi is deviation of length, ǻ_positioni is deviation of position, while Scale are weighting parameters.
Motion Recovery Based on Feature Extraction from 2D Images
1079
The procedure for motion recovery is: Step 1. Take a series of monocular images as input; Step 2. Extract the feature points; Step 3. Calculate initial projection value of the 3D model for every body part; Step 4. Joint Hip is translated in the plane parallel with the image to place the projected point of Hip to the accurate position; Step 5. Rotate joint Hip based on Eq. (6), and the descendent joints of Hip are Abdomen, Lthigh, Lshin, Lfoot, Rthigh, Rshin, and Rfoot; Step 6. Adjust the other joints by considering its immediate descendant in such an order: Abdomen, Lthigh (and Rthigh), Lshin (and Rshin), Chest, Neck, Luparm (and Ruparm), Llowarm (and Rlowarm); Step 7. Joint Hip is translated for another time along the line defined by position of camera and the extracted point of Hip to make the projected posture have the same size as the human figure in 2D image; Step 8. Rotate joint Hip by Eq. (6) again with reference to all the other joints of the human model; Step 9. Adjust other joints by considering all their descendant(s) in the same order as Step 6; Step 10. Display the recovered 3D postures.
4.
EXPERIMENTAL RESULTS
The adopted method for human motion reconstruction from monocular images is tested by several video sequences of human motions, as shown in Figure 2 and Figure 3. There are 8 frames in human kicking sequence of Figure 2, and 3 of them (1st, 3rd, 5th) are displayed; while 3 of the 8 frames (2nd, 4th, 6th) in human farewell sequence of Figure 3 are illustrated. Figures in the 1st column are 2D frames of the video sequence; figures in the 2nd column are the extracted feature points (red&dot) and the animated results (black&solid) from the same viewpoint; figures in the 3rd and 4th column are results from side and top views respectively.
1080
Figure 2. Recovered motion from a kicking sequence.
Figure 3. Recovered motion from a farewell sequence.
5.
CONCLUSION
An approach for reconstruction of human postures and motions from monocular images is presented. The advantage of this method is that neither camera calibration nor user’s interface is needed. Experiments show that
Motion Recovery Based on Feature Extraction from 2D Images
1081
reconstructed results are encouraging, while some improvements are needed. Future work includes automatic and accurate picking of the occluded feature points, further studying of Energy Function and the biomechanical constraints, etc.
REFERENCES 1. Yahya Aydin, Masayuki Nakajima, Database Guided Computer Animation of Human Grasping using Forward and Inverse Kinematics, Computers & Graphics, 23 (1999), Page(s): 145-154. 2. D.M.Gavrila, The Visual Analysis of Human Movement: A Survey, Computer Vision and Image Understanding, Vol. 73, No. 1, January 1999, Page(s): 82-98. 3. Thomas B. Moeslund and Erik Granum, A Survey of Computer Vision-Based Human Motion Capture, Computer Vision and Image Understanding 81, 2001, Page(s): 231-268. 4. Zen Chen and His-Jian Lee, Knowledge_Guided Visual Perception of 3-D Human Gait from a Single Image Sequence, Systems, Man, and Cybernetics, IEEE Transactions, Vol. 22, No. 2, March/April 1992, 336-342. 5. Robert J.Holt, Arun N.Netravali, Thomas S.Huang, Richard J.Qian, Determining Articulated Motion from Perspective Views: A Decomposition Approach, IEEE Workshop on Motion of Non-Rigid and Articulated Objects, 1994, 126-137. 6. Camillo J. Taylor, Reconstruction of Articulated Objects from Point Correspondences in a Single Uncalibrated Image, Computer Vision and Image Understanding 80, 2000, 349-363. 7. Carlos Barron and Ioannis A. Kakadiaris, On the Improvement of Anthropometry and Pose Estimation from a Single Uncalibrated Image, IEEE Workshop on Human Motion, 2000, 53-60. 8. Jianhui Zhao and Ling Li, Human Motion Reconstruction from Monocular Images Using Genetic Algorithms, Journal of Computer Animation and Virtual Worlds, 2004 (15), Page(s): 407-414.
AUDIOVISUAL SYNTHESIS OF POLISH USING TWO- AND THREE-DIMENSIONAL ANIMATION
Joanna BEàKOWSKA, Anna GàOWIENKO, Krzysztof MARASEK Polish-Japanese Institute of Information Technologies, Koszykowa 86, 02-008 Warszawa, [email protected]
Abstract:
In the paper basic problems related to audiovisual synthesis for Polish are described. Issues of the lips synchronization in two- and three-dimensional animation are presented. As results of the work a so called “chatting head” have been prepared which can say all words typed at the computer keyboard.
Key words:
Audiovisual synthesis, lip sync, lips movement
1.
INTRODUCTION
The main purpose of the work lead in Polish Japanese Institute of Information Technologies is creation of avatar (personification of a computer system) which can communicate with a user using speech. Such a system is build of subsystems for speech recognition, speech synthesis, dialogue and animation. Issues related to synthesis, animation and synchronization of speech are called audiovisual synthesis. This vision could include: • animated face that is able to show articulation and some emotions • the whole head or bust, together with their movements • avatar, which could have a form of a whole figure (or only a bust) with movements limited not only to the head but includes also gestures or even the movements of the whole figure.
1082 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1082–1087. © 2006 Springer. Printed in the Netherlands.
Audiovisual Synthesis of Polish Using Two- and Three-dimensional Animation
1083
There is no obligation for avatar to have a humanoid form. The acceptation of the system could even be higher if the user doesn’t seek for any imperfection or fake in human form and details. The purpose of this work was to define main rules of lips movement as far as two- and three-dimensional articulation animation is concerned. Polish diphone speech synthesis system (Festival1 based) was used6. For a given text the program generates a synthetic speech using phonetic and phonological rules and a database of concatenation units. Acoustic speech synthesis is done using MBROLA system6, which stores diphones in a form which is suitable for the concatenation and easy for prosodic modification. The text analysis module prepares a file with detailed information about phones to be produced, how long they are and how the pitch should be modified. This is enough for the audiovisual synthesis. Systems like this have already been created for other languages7, however for Polish it is one of the first attempts.
2.
SYNCHRONISATION OF LIPS MOVEMENTS
2.1
Speech Synthesis and Visual Representation of Phonemes
In order to convert text to speech, a text analysis module recognizes its composition and linguistic structure and converts this representation to speech. This process is linked with linguistic theories, manners of speech generation, acoustic and phonetic representation of the language. Linguists think that it is possible to represent phonetics of the language through the basic acoustic elements called phonemes. Phonemes are the characteristic sounds of a given language. However, to create an avatar not only phonemes are required. The visual representation of phonemes is essential. The visual representation of phonemes is called visem. It should be noticed that not each phoneme has separate visual representation and that the relations between image and sound are unequivocal5. Image is affected not only by acoustics elements but also by prosodic and paralinguistic features like: breathing, eyes and eyes brushes movements, heads movements and gestures. First experiments convinced us to use divisems for the lips synchronization. Our observation is that for the human perception a transition between visems is more important than steady state of the lips. Divisem is a creation analogical to diphone. Diphone begins in the first half of the first phone and ends in the half of the next phone. Per analogiam di-
1084 visem begins in the first half of the first visem and ends in the half of the following. Many of visem visualizations look similar so they are linked to one group. It is correct, because there are sounds for which lips position is almost identical and the what differentiates them is e,g. a type of phonation.
Table 1. Disney notation for Polish (SAMPA coded phones [4]).
a, i
ts, d, g, j, k, n , r, s, x, I, z, dZ, tS , S, s’, ts’, z’, dz’, n’
e, e~
f, v
b, m, p
v, w
l, t
u
o, o~
2.2
Lip Movement in Two-Dimensional Animation
Following the notation introduced by Disney3 it is enough to have seven basic positions of lips to create an impression of speech. It was accepted by animators all over the world – only a few language specific changes should be implemented. In order to synchronize voice with the lip movements it is necessary to record the voice first and then create an “animation receipt”. It is a table that contains frames with the characteristic phonemes. Blank places between them are filled using morphing techniques. To have more realistic lips movements it is also necessary to use “blank” lips positions which represent the transition between the expressed ones. The animator should also remember that the human face is almost never still. Thus additional, accidental movements of lips should be added. Disney notation is used mainly in cartoons. The main targets for such films are children, who can forgive almost everything so the animations don’t have to be perfect. Disney notation is described in Table 1.
Audiovisual Synthesis of Polish Using Two- and Three-dimensional Animation
1085
We propose few simple rules important for the synchronization process: • first record the dialogue, later animate the face • phonemes should be animated according to their articulation: • never forget about single vowels, vowels are really strong, they should be accented using special visems • for affricates put a special attention for the first part • lips don’t move during nasal consonants so use blank frames • plosives at the end of word can be skipped • fricatives like /s,z,s’/, etc. shouldn’t be skipped, they are very long and should have more frames Never skip the phoneme that is located on the beginning of the word.
2.3
Lip Movement in Three-Dimensional Animation
Table 2. Visems in 3D animation. a, i Words: aktor, jabáko, izba, nic
ts, tS, ts’, d, dz, dZ, g, x, j, k, N,n, r, s, S, s’, I, z, Z, z’ Words: caáy, czoáo, üma, dĪem, jabáko, sáoĔ, ziemia ĨrebiĊ e, e~ Words: esej, gĊĞ
f Words: firma l, t Words: los, tom
b, m, p Words: bar, dom, zapaáka
o~ , o Words: jądro, sáoĔce
1086 w, u, v Words: áąka, sáuga, wąwóz
BLANK. It is not a visem but only a lip shape that is used during pauses
For simplified audiovisual synthesis only few visems are necessary to create a three-dimensional “chatting head”. Nine is enough – eight for the basic lips shapes and one “blank” for pauses. In Table 2 the basic lips shapes for Polish are shown. They have been adapted from existing works2 - the basic lips dimensions for Polish sounds were taken and allophones described there were grouped into visems. Three-dimensional animation is much more demanding than twodimensional. A human face is so well known, so it is hard to forgive even a slight mistake. The most complicated task was to create a three-dimensional humanoid head and it took most of the time3. The knowledge of head anatomy was needed on this stage. A special attention was paid to eyes, lips, mouth, teeth, nose – the characteristic parts of human face. They all play an important role in the process of speaking. Before starting animation it was necessary to create a database of face expressions. Based on that visems were created. Visems were used in the simulation of lip movements. The easiest way was to count the transition of each di-visems for a given utterance. Then .avi files can be used for morphing. For a human face we can use two types of morphing: divided morphing (allows modifications of many face parts at the same time) and morphing to many targets (allows to create different emotions). In audiovisual synthesis divided morphing is better to use. Speech animation is not only lips shapes animation but also tongue, teeth, eyes animations. The best way to start is to create a library of morphing. It should contain different emotions, different tongue position. To achieve a realistic lips animation it was necessary to create a library of divisems. In our work we created about 400 .avi files as a result of morphing on two-dimensional pictures of three-dimensional head. Final files were saved in MPEG format with a frequency of 100 fps. These are used to generate fluent transitions between different lips shapes. The synchronization of the “ chatting head” with voice is quite complicated. For that a computer program has been prepared. It reads SAMPA-coded text, generates appropriate MBROLA file, calls speech synthesizer and generates audio, searches the .avi library and concatenates video parts. Finally, the animation is displayed on a screen synchronously with audio. The program can also change the speaking rate. Additionally, it
Audiovisual Synthesis of Polish Using Two- and Three-dimensional Animation
1087
can generate video clips which can be added to the library – so the program can be used with an arbitrary set of animations. The program can be connected to the Festival system, so that the audiovisual synthesis is done directly from text.
3.
CONCLUSIONS
It is a big step from easy two-dimensional lips animation to the threedimensional one. Cartoons with their simple graphic are not so demanding as far as details are concerned. However, when we want to get a “chatting heads” the animation process is much more complicated. A special attention has to be paid for details because three-dimensional model is more or less similar to a human face and even a slight mistake is easy to recognize. Both in two- and three-dimensional animation only a few lips shapes were taken. The final effect was mainly dependent on the way how the transitions between them are modeled. The computer program allows to synchronize synthetic speech with lips movements starting from text input.
REFERENCES 1. Black A., Taylor P., Festival Speech Synthesis System: system documentation (1.1.1), Human Communication Research Centre Technical Report HCRC/TR-83, Edinburgh, 1997 2. Bolla A., Phonetic conspectus of Polish, Hungarian Papers in Phonetics, Budapest, 1987 3. Fleming B., Animacja cyfrowych twarzy”, Darris Dobs, Helion, Gliwice, 2002 4. Marasek K., Large Vocabulary Continuous Speech Recognition System for Polish, Archives of Acoustics, Vol 28, 4, 293-303, 2003 5. McGurk H., McDonald J., Hearing lips and seeing voices, Nature, Vol. 264, pp. 746-748, 1976 6. Szklanny K., Przygotowanie bazy difonów jĊzyka polskiego dla realizacji syntezy mowy w systemie MBROLA, 50. Otwarte Seminarium z Akustyki, , Szczyrk, 2003 7. CLSU Toolkit, http://www.cslu.ogi.edu/toolkit/
REPRESENTATION AND VISUALIZATION OF DESIGNS WITH THE USE OF HIERARCHICAL GRAPHS Piotr Nikodem Institute of Computer Science Jagiellonian University Nawojki 11 30-072 Krakow [email protected]
Abstract
1.
This paper deals with representations of artefacts used during computer-aided design process. Two representations are proposed: external, based on XML format and internal to the application, object-oriented. A graphical visualization of designed model on skeletal structure example is also presented.
INTRODUCTION
This paper describes an approach to representing structures in computer designing systems. Representation problems emerged during the research project dealing with developing a tool for the optimum design of skeletal structures. Complex solution which contains theoretical model based on CP-graphs and hCP-graphs, an external format based on XML standard and an object-oriented format internal to an application, is proposed. Composition graphs had been proved to represent wide range of objects, and are chosen as a model 1,2 . We decided that external format should be based on XML standard. There are several reasons for choosing XML as the basis: XML is a very popular data format and it is a standard controlled by World Wide Web Consortium, which enables easy information exchange, large number of tools for editing, manipulating and parsing XML is available, XML enables easy file validation, using DTD or schemas. An external format should be able to store information from model: object structure and additional information assigned to object elements. Initially existing XML-based graph description formats were taken into consideration. 1088 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1088–1093. © 2006 Springer. Printed in the Netherlands.
Representation and Visualization of Designs
1089
Closest solutions are GML 3 and GraphXML 4 languages. Although GML is a powerful description language for graph drawing purposes and includes an extension mechanism, associating external data with a graph element is not welldefined, what causes adaptation problems. GraphXML is a complex format that gives a possibility to define hierarchical graphs, and has built-in extension mechanism. Albeit this solution was closest to our needs, problems emerged during adaptations. Both hierarchical structure and additional data assigned with graph elements could be realized only by extensions, thus large amount of work was required. Therefore we decided to develop our own format, simpler but more universal and capable for uncomplicated data extending. In association with XML structure the object model was developed and implemented in Java programming language. For designs visualization Java technologies were used as well. In subsequent chapters all parts of solution will be presented.
2.
GRAPHS AND HIERARCHICAL GRAPHS IN DESIGNING
The main advantage of graph structure is capability to include relationships between its components besides information about components itself. In this approach initially composition graphs were used 1,2 , which have been proved to be able to represent a wide range of objects. There is a variety of problems in which CP-graph representation is insufficient, however. Relation between components can be not only spatial, but also hierarchical. That leads us to better solution: an extension of composition graph, hierarchical CP-graph (hCPgraph). The possibility of using hierarchical graphs in designing has already been presented 5,6 . Formally, a hierarchical node is a triple (i, B, C), where i is a node identifier, B is a set of bonds and C is a set of node’s children. A bond is a pair (i, id), where i is an identifier of node to which the bond belongs and id is the identifier of the bond itself. An edge e is a set {b, b’}, where b and b’ are bonds belonging to nodes connected by the edge e. Let X be a set of hierarchical nodes with finite number of bonds and children. A hierarchical CP-graph G is defined as a pair (V, E), where V is a subset of X, and E is a set of edges, satisfying the following conditions: node identifiers are unique (and hence bonds are also unique), edges do not connect nodes related hierarchically, at most one edge can be connected to a bond, a node may have at most one direct ancestor. Nodes and edges in hierarchical graphs can be labelled and attributed. Attributes can represent geometrical properties (position, size), but also visual
1090 (colour) or any required by a problem. In order to create visualization of designed artifacts an interpretation is necessary. An interpretation of a given graph G is defined as a pair of functions (IV , IE ), where IV assigns geometrical objects to nodes and its fragments to bonds, while IE establishes a correspondence between edges and sets of relations between objects. Geometrical objects used depend on the domain of problem.
3.
XML-BASED GRAPH REPRESENTATION
Proposed format is based on XML standard. Each valid file contains root element: . Inside that tag any number of elements may appear. Each element inside a single graph has its unique identifier and label. Graph data is structured in the following way: element contains any number of and tags, element contains any number of (only hierarchical graph) and tags, element contains source and target properties which identifies appropriate bonds elements, element have kind property which determines if edge is directed (if is directed determines also direction). Format described above is capable to represent a CP-graph or hierarchical CP-graph structure. Below is presented sample XML representation of graph.
Representation and Visualization of Designs
1091
Elements presented above describe only a structure of graph. To complete description, mechanism for representing attributes of elements should be provided. Attributes may represent geometric positions, physical parameters or visualization hints of particular element. Such information can be added as properties of tags placed inside proper tag. This approach guarantees an effortless extension instrument for elements additional attributes. In considered sample XML coordinates of nodes are stored in this way.
4.
OBJECT-ORIENTED GRAPH REPRESENTATION
XML-based format is adequate for storing graphs, interchanging them among applications or manipulating in editors, but problem arise when operations on graphs are required. Then it appears that XML is inefficient and inconvenient in use. To fulfill this gap an object representation is proposed. Graph structure is represented in a pure object model. Each element is an object and relations to other elements are represented as object references. Several interfaces are created for graph elements standard functionality: IGraphElem - groups common graph elements functionality (setting and gathering unique ids, labels and attributes), all elements are extending this interface, IGraph - functionality for inserting and deleting nodes and edges, INode - enables inserting and deleting bonds, IBond - connects edge with node, IEdge - connects two nodes, IHierGraph - represents hierarchical graph, extends IGraph, gives functionality for manipulating nodes on different level of hierarchy, IHierNode - represents node in hierarchical graph, extends INode, gives functionality for manipulating children nodes. Described interfaces can be easy implemented in any object-oriented language. Referral implementations for graphs and hierarchical graphs are created in Java programming language. Implementations are placed in the separate packages for maximum flexibility and code reusability as well. Packages
1092 also contain externalization classes for storing and restoring graphs from XML files (XML parser from JDK 1.4 was used). External data is validated during restoring, thus schema definition is not required. All additional attributes defined for elements are also streamed into appropriate objects without any code modifications.
5.
ARTEFACT VISUALIZATION ON SKELETAL STRUCTURES EXAMPLE
Visualization is an important part of computer design support systems. Presentation system described here is currently limited to skeletal structures only. Realization is consistent with the interpretation functions, part of CP-graph theory. Graph nodes and edges are represented by defined graphic primitives with respect to elements labels and attributes. Mapping is realized within program code. Two visualisations are presented: 2-dimensional and 3-dimensional.
Figure 1. 2D visualization of flat transmission tower. B&W mode. Lines thickness and patterns are associated with forces in bars.
6.
CONCLUSIONS AND FUTURE WORK
Hierarchical graphs have been proved to be a promising tool for object description. The format based on XML proposed in this paper is used to represent skeletal structures and shown capability for uncomplicated extensions. It is expected to be used for other objects depiction without any modification.
Representation and Visualization of Designs
Figure 2.
1093
3D visualization of the same transmission tower.
Visualization is assigned to one class of designing problems and still requires much work. An XML format for defining graphical representation of elements is planned. Visualization description file will be dynamically connected to classes of objects or to single objects. More general and configurable visualization tool should be also developed.
REFERENCES 1. E.Grabska. (1993). Theoretical Concepts of Graphical Modelling. Part one: Realization of CP-graphs. MG&V, 2(1), 3-38. 2. E.Grabska. (1993). Theoretical Concepts of Graphical Modelling. Part two: CP-graph Grammars and Languages. MG&V, 2(2), 149-178. 3. M. Himsolt. (1997). GML - Graph Modelling Language. University of Passau. 4. I. Herman, M.S. Marshall. (2000). GraphXML - An XML based graph interchange format. Amsterdam: Centrum voor Wiskunde en Informatica. 5. E.Grabska, W. Palacz. (2000). Hierarchical graphs in creative design MG&V, 9(1/2), 115123. 6. A. Borkowski, E. Grabska, P. Nikodem, B. Strug. (2003). Searching for innovative structural layouts by means of graph grammars and evolutionary optimization. Rome: 2nd International Structural Engineering and Construction Conference.
EVOLUTIONARY APPROACH FOR DATA VISUALIZATION M. Sarfraz, M. Riyazuddin, and M. Humayun Baig Department of Information and Computer Science, King Fahd University of Petroleum and Minerals, KFUPM # 1510, Dhahran 31261, Saudi Arabia.
Abstract:
Simulated Annealing heuristic is used for the Weight optimization of NURBS for visualization of data. The objective of the method is to visualize data by reducing the error and to obtain a smooth curve.
Key words:
Data, Visualization, Simulated Annealing, NURBS, Algorithm.
1.
INTRODUCTION
Scientists use curve fitting for visualization in many applications such as data reduction, approximating noisy data, curve and surface fairing and image processing application like generating smooth curves to digitized data1. In the past researchers uses analytical function for curve fitting the input data. Since the shape of the underlying function of data is frequently complicated, it is difficult to approximate it by a single polynomial. In this case, a spline and its variants are the most appropriate approximating functions2. A kth degree B-spline curve is uniquely defined by its control points and knot values, while for Non-Uniform Rational B-Splines (NURBS) curves, the weight vector has to be specified in addition3. Through the manipulation of control points, weights, and/or knot values, users can design a vast variety of shapes using NURBS. Despite NURBS’ power and potential, users are faced with the tedium of non-intuitively manipulating a large number of geometric variables4. Due to the high number of scanned data points, non-deterministic optimization strategies have to be applied to gain optimal approximation
1094 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1094–1099. © 2006 Springer. Printed in the Netherlands.
Evolutionary Approach for Data Visualization
1095
results. Evolutionary algorithms show great flexibility and robustness5.Genetic algorithms and Tabu search have been applied to optimize NURB parameters. Knots and the weights corresponding to the control points have been optimized using genetic algorithms6 for curve data. Tabu Search7 has been applied to optimize NURBS weights for surface data. In this paper, we have applied Simulated Annealing (SA) to the optimization of NURBS’ weights for curve data. The remainder of the paper is structured as follows. Section 2 briefly reviews NURBS. In section 3, we review the Simulated Annealing (SA) optimization heuristic and also discusses the results. Finally, we conclude the paper in section 4.
2.
THE NURBS
A NURBS curve generalizes a B-spline curve. It is the rational combination of a set of piecewise basis functions with n control point’s pi and their associated weights wi: n
c(u ) =
¦pwB i
i
i,k
(u )
i =1 n
,
¦w B i
i,k
(u )
i =1
where u is the parametric variable and Bi,k(u) is B-spline basis functions. Assuming basis functions of degree k-1, a NURBS curve has n + k knots ti in non-decreasing sequence : t1 t2 … tn+k-1 tn+k. The basis functions are defined recursively using uniform knots as 1 for t i ≤ u < t i +1 Bi ,l (u ) = ® ¯0 otherwise,
with Bi,k(u) =
u - ti t −u Bi,k-1(u) + i + k Bi+1,k-1(u). t i + k -1 − t i t i + k − t i +1
The parametric domain is tk u tn+1. From users’ point of view, the NURBS knots are used to define B-spline basis functions implicitly. Note that, only the relative positions of consecutive knots are actually used to determine its geometry. For further characteristics of the NURBS, the reader is referred to Piegle’s work8.
1096
3.
SIMULATED ANNEALING
We have used the Simulated Annealing optimization heuristic to optimize weights of the NURBS curve for data visualization. The SA algorithm9 was first proposed as a means to find equilibrium configuration of a collection of atoms at a given temperature. Kirkpatrick et al10 were the first to use the connection between this algorithm and mathematical minimization as the basis of an optimization technique for combinatorial (as well as other) problems. SA’s major advantage over other methods is its ability to avoid being trapped in local minima. The algorithm employs a random search, which not only accepts changes that decrease the objective function E, but also some changes that would increase it. The latter are accepted with a probability Prob(accept) = exp(-ΔE/T), where ΔE is the increase in E and T is a control parameter, which by analogy with the original application is known as the system “temperature” irrespective of the objective function involved. Briefly SA works in the following way. Given a function to optimize and some initial values for the variables, simulated annealing starts at a high artificial temperature. While cooling the temperature slowly, it repeatedly chooses a subset of the variables, and changes them randomly in a certain neighborhood of the current point. If the objective function has a lower function value at the new iterate, the new values are chosen to be the initial values for the next iteration. If the objective function has a higher function value at the new iterate, then the new values are chosen to be the initial values for the next iteration with a certain probability, depending on the change in the value of the objective function and the temperature. In order to implement simulated annealing, we need to formulate a suitable cost function for the problem being solved. In addition, as in the case of local search techniques, we assume the existence of a neighborhood structure, and need Neighbor function to generate new states (neighborhood states) from current states. And finally we need a cooling schedule that describes the temperature parameter T and gives rules for lowering it. The parameter value tj for each data point is a measure of the distance of the data point along the curve. One useful approximation for this parameter value uses the chord length between data points. Specifically, for j data points, the parameter value at the lth data point is
t1 = 0 tA t max
§
A
· §
j
·
= ¨¨ ¦ Ds − Ds −1 ¸¸ / ¨¨ ¦ Ds − Ds −1 ¸¸ , A ≥ 2 .
© s=2 ¹ © s =2 ¹ The maximum parameter value, tmax, is usually taken as the maximum value of the knot vector.
Evolutionary Approach for Data Visualization
1097
The control points are calculated using the least squares technique. A fairer or smoother curve is obtained by specifying fewer control polygon points than data points, i.e. 2 ≤ k ≤ n < j . Recalling that a matrix times its transpose is always square, the control polygon for a curve that fairs or smoothes the data is given by [D] = [B] [P] [B]T [D] = [B]T [B] [P] [P] = [[B]T [B]] -1 [B]T [D], where [D]T = [ D1(t1) D2(t2) . . . Dj(tj) ] are data points, [P]T = [ P1 P2 . . . Pn+1] are the control points and [B] is the set of B-spline basis functions. The evaluation of the control points by least squares approximation can be viewed as an initial estimation of the fitted curve. Further refinement can be obtained by optimizing the different NURBS parameters, such as the knot values and the weights in order to achieve better fitting accuracy. The error function (or cost function) between the measured points and the fitted curve is generally given by the following equation: 1/ r
§ s · r E = ¨¨ Qi − S (α 1 ,..., α n ) / s ¸¸ , © i =0 ¹ where Q represents the set of measured points; S(α1, …, αn) is the geometric model of the fitted curve, where (α1, …, αn) are the parameters of the fitted curve; s is the number of measured points and r is an exponent, ranging from 1 to infinity. The fitting task can then be viewed as the optimization of the curve parameters (α1, …, αn) to minimize the error (or cost) E. In case the exponent r is equal to 2, the above equation reduces to the least squares function. A recent publication11 showed that better results could be obtained by optimizing the weights while keeping the knot values uniformly distributed. However, the weights present a large number of independent variables (equaling the number of control points) to the optimization problem, which may lead to a large search space. Therefore, global optimization techniques are needed for optimizing such problems. We have used the Simulated Annealing optimization heuristic to optimize weights of the NURBS curve. The initial solution S0 of weight vector is randomly selected from the range [0, 0.5]. The number of elements in the weight vector corresponds to the number of control points. The cooling schedule10 used here is based on the idea that the initial temperature T0 must be large to virtually accept all transitions and that the changes in the temperature at each invocation of the Metropolis loop are small. The scheme provides guidelines to the choice of T0, the rate of decrements of T, the termination criterion and the length of the markov chain (M).
¦
1098 Initial Temperature T0: The initial temperature must be chosen so that almost all transitions are accepted initially. That is, the initial acceptance ratio χ( T0) must be close to unity where Number of moves accepted at T0 χ (T0 ) . Total number of moves attempted at T0 To determine To, we start off with a small value of initial temperature given by T’o, in the metropol function. Then χ(T’o) is computed. If χ(T’o) is not close to unity, then T’o is increased by multiplying it by a constant factor larger than one. The above procedure is repeated until the value of χ(T’o) approaches unity. The value of T’o is then the required value of To. Decrement of T: A decrement function is used to reduce the temperature in a geometric progression, and is given by Tk+1 = α Tk , k = 0,1, … , where α is a positive constant less than one, since successive temperatures are decreasing. Further, since small changes are desired, the value of α is chosen very close to unity, typically 0.8 ≤ α ≤ 0.99.
Figure 1. Input and output for English alphabet “R” (left); visualizing data for open curve (right).
Length of Markov chain M: This is equivalent to the number of times the Metropolis loop is executed at a given temperature. If the optimization process begins with a high value of To, the distribution of relative frequencies of states will be very close to the stationary distribution. In such a case, the process is said to be in quasi equilibrium. The number M is based on the requirement that at each value of Tk quasi equilibrium is restored. Since at decreasing temperatures uphill transitions are accepted with decreasing probabilities, one has to increase the number of iterations of the Metropolis loop with decreasing T (so that the Markov chain at that particular temperature will remain irreducible and with all states being non null). A factor β is used (β > 1) which, in a geometric progression, increases
Evolutionary Approach for Data Visualization
1099
the value of M. That is, each time the Metropolis loop is called, T is reduced to αT and M is increased to βM. The neighborhood of each element of the weight vector is randomly selected within a range of [weight_element_value, weight_element_value + 1]. Since the number of elements of the weight vector equals the number of control points, this range is selected in order to optimize the locality of the search. Figure 1 is the demonstration for obtaining an outline of an English alphabet “R” and an open data curve raised from a function.
4.
CONCLUSION
An evolutionary approach for Data Visualization of NURBS for curve fitting using Simulated Annealing heuristic has been presented. Simulated Annealing optimization algorithm is used for the global optimization of the fitting error between a set of scanned points and a fitted curve.
REFERENCES 1. J. J. Chou & L. A. Piegl, Data Reduction using Cubic Rational B-Splines, IEEE Computer Graphics & Applications (1992). 2. Y. Yoshimoto, M. Moriyama and T. Harada T, Automatic Knot Replacement by a Genetic Algorithm for Data Fitting with a Spline. Shape Modeling and Applications, 1999. Proceedings of the International Conference on Shape Modeling International, 162 -169, (1999). 3. M. Hoffmann & I. Juhasz, Shape Control of Cubic B-spline and NURBS Curves by Knot Modifications. Proceedings of the International Conference on Information Visualization (IV 2001), IEEE Computer Society Press London (2001). 4. H. Xie & H. Qin, Automatic Knot Determination of NURBS for Interactive Geometric Design, IEEE, (2001). 5. F. Pontrandolfo, G. Monno & A. E. Uva, Simulated Annealing Vs Genetic Algorithms for Linear Spline Approximation of 2D Scattered Data. XII International Conference, Rimini, Italy, (2001). 6. S. A. Raza, Visualization with Spline using a Genetic Algorithm. Master Thesis, King Fahd University of Petroleum & Minerals, Dhahran, Saudi Arabia (2001). 7. A. M. Youssef, Reverse Engineering of Geometric Surfaces using Tabu Search Optimization Technique, Master Thesis, Cairo University, Egypt (2001). 8. L. Piegl, On NURBS: A Survey, IEEE computer graphics & applications, 11(1): 55-71, (1991). 9. N. Metropolis, A. Roshenbluth, M. Rosenbluth, A. Teller & E. Teller, Equation of State Calculations by Fast Computing machines. J. Chem. Phys., 21(6), 1087-1092 (1953). 10. S. Kirkpatrick, Jr. C. Gelatt, & M. Vecchi, Optimization by Simulated Annealing, Science, 220(4598): 498-516, (1983). 11. M. M. Shalaby, A. O. Nassef, & S. M. Metwalli, On the Classification of Fitting Problems for Single Patch Free-Form Surfaces in Reverse Engineering, Proceedings of the ASME Design Automation Conference, Pittsburgh, (2001).
CHAOS GAMES IN COLOR SPACE
Hadrian Jakóbczak Jagiellonian University, Institute of Computer Science, Nawojki 11, 30-072 Kraków, email:[email protected]
Abstract:
The paper proposes an extension of fractal generation with the use of chaos games to multidimensional spaces. Brightness of color is used to simulate additional higher dimensions. Examples of the generated fractals are shown.
Key words:
chaos games; fractal geometry; color.
1.
INTRODUCTION
Using Iterated Function Systems is one of the best known and most efficient ways of generating fractal images. One kind of this method is commonly called Chaos Games. This algorithm is well known for 2dimensional patterns1. But the game can obviously take place in spaces of higher dimensions. The only difference is that showing the effect can be troublesome. It is not so bad if we try to generate patterns in 3-dimensional space, but what about fourth or fifth dimension? Is there any way to give us intuition about objects which we have already made? Treating color as an additional dimension has been also known for a long time2, but it is not often used. Computers give new possibilities of using this method to generate more complex fractals. We treat brightness of each of basic colors – red, green and blue as one new dimension in order to get color patterns. We make experiments in order to show variety of possible effects of this application and to explore how different initial conditions can affect the result.
1100 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1100–1111. © 2006 Springer. Printed in the Netherlands.
Chaos Games in Color Space
2.
1101
CHAOS GAMES – GENERAL RULES
Usually Chaos Games take place in 2-dimensional space and pattern is generated according to simple rules. Before starting the game we have to choose special points, called base points. Base points stay constant during the game and one of their roles is to bound the pattern which is generated. n Let b1, b2, ... , bm , be base points and p1, p2, ... , pm [0, 1] be probabilities assigned to them, i.e., each value pi represents probability of n choosing bi during the game. Let assume that s is a start point. Now we can start generating fractal pattern. Each step of generation is as follows. 1. With suitable probability we choose one of numbers 1, 2, ... , m. Let us say we have chosen i. 2. We change position of s (it is now called guide point). New guide point s’ will be situated in half of distance between old s and chosen bi (the fixed proportion can be different). 3. We draw s’ on the screen. 4. We come back to beginning in order to choose next number. The procedure should be repeated until we can see satisfying approximation of fractal pattern, for example until the moment when we can not see any more changes after new iteration.
3.
COLOR AS THE METHOD OF SIMULATING HIGHER DIMENSIONS 3
4
5
, and . At first we In this paper we simulate fractals in spaces bound the scene to multidimensional cubes [0, 1]n, n = 3, 4, 5. Because we are usually interested in bounded fractal patterns, this restriction is not serious – each bounded fractal can be scaled in order to fit in [0, 1]n cube. But the basic problem is how to put on the screen points with more than two coordinates. Our proposition is to use color in RGB mode. Color of a point is represented by three coordinates, responsible for brightness of red, green and blue, which take values between 0 and 1. We use each of them to simulate one dimension.
1102
4.
EXPERIMENTS IN 3-DIMENSIONAL SPACE
First example of chaos game takes place in square [0, 1]2. Base points are set as follows: b1 = (0, 0), b2 = (0, 1), b3 = (1, 0) and b4 = (1, 1), and probabilities p1 = p2 = p3 = p4 = 0.25. Game starts from s = (0.5, 0.5), while new guide point s’ = (s’1, s’2) is calculated using the formula s’ = ((s1 + bi1) / 2, (s2 + bi2) / 2). The figure 1 presents the effect of this game. The result (red square) does not seem interesting. We can modify initial conditions by adding another base point b5 = (0.25, 0.25). Probability of choosing each of base points is still equal: p1 = p2 = p3 = p4 = p5 = 0.2. The result of this game is shown in figure 2. The fractal looking like a snow flake can be easily noticed, but there are still only two dimensions. Our image is monochromatic – each point is drawn in red in maximum available brightness. In next experiment we introduce third dimension and move last base point deeper in red. Let b1 = (0, 0, 0), b2 = (0, 1, 0), b3 = (1, 0, 0), b4 = (1, 1, 0), and b5 = (0.25, 0.25, 1), where the value of the third coordinate is 1, meaning maximum possible brightness of red. When starting generation process we expect that in the new fractal, points could have different red color, although they are put exactly in the same places as in pattern in figure 2. We see the approximation of first 3-dimensional fractal pattern in figure 3.
Figure 1. Red square – the result of simple chaos game in 2-dimensional space. In all pictures pattern is fitted to square [0, 1]2. In each picture lower left corner of the image represents point (0, 0) while upper right represents (1, 1).
Chaos Games in Color Space
1103
Figure 2. The result of modified chaos game with 5 base points. The pattern is monochromatic, because it is still situated in 2-dimensional space.
Figure 3. The effect of chaos game after adding red dimension. Pattern consists of the same points as in figure 2, although it looks different because full range of brightness of red color is now available.
Figure 4. Fractal pattern from figure 3 after small change in rule of assigning colors to its points. The pattern is not monochromatic although only one variable is responsible for color of all points.
1104 Fractal shown in figure 3 differs from the previous one. It is worth noticing that its self-similarity can be seen not only in shape, but also in brightness of color. The lightest red color appears near the point m = (0.25, 0.25), but we can find another local maximum in each smaller (and darker) copy of main pattern, for example in the one situated in upper right corner. In each copy the lightest part appears in the same place. This effect is seen even better if we change color value assigned to points drawn on screen. Next chaos game has the same initial conditions and generate the same fractal, but color of each point is taken according to the rule c = (2r, r, 0), where r is responsible for brightness of red, so to calculate color we still use only one variable (fig. 4). Now we do next changes in initial conditions. One of them is adding another base point in order to obtain more complex pattern. Lets add new point b6 = (0.9, 0.3, 0.3). New pattern, however, does not seem to be much different from the previous one (fig. 5). Another simple change is to modify values of probability assigned to base points. In all experiments base points were treated equally but now we can make some of them “stronger” than others. So let us leave six base points in their positions, but with modified initial values of probability. Let p1 = p2 = p3 = 0.1, but p4 = 0.2, and p5 = p6 = 0.25 – three points are stronger, the other three are weaker. In the figure 6 we can see what happens.
Figure 5. After adding another base point new pattern looks similar to fractal in figure 4.
Chaos Games in Color Space
1105
Figure 6. Fractal pattern from figure 5 with different values of probability assigned to base points. The shadow of Sierpinski triangle can be seen in the picture.
It can be noticed that both fractals (from figures 5 and 6) have similar shape and color. But there is also a lot of differences. In second pattern we recognize the shape of Sierpinski triangle. This well known shape appears because the game takes place mainly between three selected points – these, which have higher value of probability, while other points are not chosen so often. But what we achieve in our fractal is not a common version of Sierpinski triangle. The pattern can be seen as 3-dimensional because points which generate it are situated in fact in three dimensions, although the effect is drawn on 2-dimensional screen. We can say that simple 2-dimensional graphic rules are used to simulate something which seems to have one dimension more. Playing chaos game we do not render our scenes but only pick single points which have suitable calculated color. This illusion is similar to pencil drawings, which try to suggest on a paper sheet things which are obviously 3-dimensional. Or maybe it is better to compare it with paintings, where combination of shadows and highlights suggests 3dimensional world. In this moment it is worth to ask if this effect could be useful to improve fractal compression of images or for fast generation of artificial worlds using only 2-dimensional graphic. Before next game we have to do some more changes. Up till now experiments follow each other, but for a moment we break this chain and do some steps forward. We change old and introduce new base points. Points b1 to b4 remain in the same place but b5 = (0.25, 0.75, 1), b6 = (0.4, 0.1, 0.7) and b7 = (0.9, 0.3, 0.4). There are also the highest values of probability assigned to these three mentioned points. What’s more, we allow to make a sequence of two transformations s o s’ o s’’ in one step of the game – in this situation we draw only s’’. These sequences also have high probability of being chosen. It would be the same if we add some strong base points more. Playing the game follows to pattern shown in figure 7.
1106 Like last examples this fractal is also drawn using the rule c = (2r, r, 0). Another (this time only technical) change which we make is enlarging points drawn on the screen – the change causes visual effect similar to oil painting. Now we introduce new rule c = (2r, 2r, 1 – r) and play the same game once more (fig. 8). Although the shape is exactly the same, different colors cause that new fractal does not look like the last one. New color palette can be seen as very complicated, but it is still only one variable which sets color value and we are still in 3-dimensional space.
Figure 7. The result of chaos game in 3-dimensional space. The points drawn on the screen were enlarged and pattern looks like oil painting.
Figure 8. Fractal from figure 7 with different color palette.
Chaos Games in Color Space
5.
1107
EXPERIMENTS IN 4-DIMENSIONAL SPACE
It is high time to add another color dimension – fourth coordinate will be responsible for brightness of green. In first example of 4-dimensional chaos game color is calculated according to the simplest rule: c = (r, g, 0). There are six base points: b1 = (0, 0, 0, 0), b2 = (0, 1, 0, 0), b3 = (1, 0, 0, 0), b4 = (1, 1, 0, 0), b5 = (0.25, 0.25, 1, 0) and b6 = (0.6, 0.9, 0, 1). Corresponding probabilities take values: p1 = p2 = p3 = p4 = 0.1 and, for last two points, p5 = p6 = 0.3. We can notice that these initial conditions are similar to the game which generates pattern drawn in figure 3. We add sixth base point, the only one whose last coordinate is not 0. Comparing two fractals (figures 3 and 9) we can notice that new pattern contains the old one – it is the part drawn in red. Another interesting feature of pattern which is shown in figure 9 is “overlapping” of red and green points situated near the diagonal. It can be interesting to zoom in this part of image and to explore behavior of these points. In next 4-dimensional experiment we enlarge the number of base points. Points b1 to b5 stay the same, but b6 = (0.6, 0.3, 0, 1), b7 = (0.8, 0.1, 0.5, 0.8) and b8 = (0.8, 0.1, 0.6, 0.1). For first four points probabilities are equal 0.1, for the other four 0.15. In pattern which is generated during the game (fig. 10) we can see fractal features. For example we could notice that upper left part of whole image is the smaller copy of it. Second copy is seen in upper right corner, next in lower part on the left. And, what’s more important to our considerations – all copies have similar color palette. If we can see orange fog in lower right corner of pattern, the same (but, of course, smaller) group of orange points appear in each copy of main image. So we can talk about fractal character of shape but also about fractal character of color. Similar rules were used to generate next picture. We add base point: b9 = (0.3, 0.6, 0.8, 0.3) and change the schedule of probability. It is also allowed to do two transformations in one step. This leads to pattern with characteristic diagonal stripes and many empty spaces (fig. 11). After next change – b8 = (0.4, 0.1, 0.1, 0.1), probability of choosing base point has different value for each point, but there are some dominating points – we get different pattern, which seems to consist of color blocks. We observe that the largest blocks in the middle are the lightest, while the smaller ones are darker. We also see that some blocks overlap and each of them is divided by some snake-like curves (fig. 12).
1108
Figure 9. 4-dimensional fractal pattern consisting of two overlapping parts.
Figure 10. More complex 4-dimensional pattern. Fractal character of color distribution can be seen as well as fractal character of shape.
Figure 11. 4-dimensional fractal pattern with red, orange and green diagonal stripes.
Chaos Games in Color Space
1109
Figure 12. 4-dimensional fractal pattern consisting of color blocks.
6.
EXPERIMENTS IN 5-DIMENSIONAL SPACE
We can play similar game in 5-dimensional space as well. Base points are set as follows: b1 = (0, 0, 0, 0, 0), b2 = (0, 1, 0, 0, 0), b3 = (1, 0, 0, 0, 0), b4 = (1, 1, 0, 0, 0), b5 = (0.25, 0.25, 1, 0, 1), b6 = (0.6, 0.3, 0,1, 0.5), b7 = (0.8, 0.1, 0.5, 0.8, 1), b8 = (0.4, 0.1, 0.1, 0.1, 0.7) and b9 = (0.3, 0.6, 0.8, 0.3, 0.7). If we compare this with previous initial condition we notice that only change we make is adding another coordinate to each base point while four coordinates are just like before. Values of probability stay also unchanged, so the shape of generated fractal pattern is preserved, although it looks a bit different. But the only difference, in fact, is another color palette. The color of each point is now combination of three parameters: r, g, b – each of them represents brightness of basic color: red, green and blue and final value of color is an effect of using the following rule: c = (2r, 2g, 1 – b). The impression that new pattern contains more points is caused by the fact that, while in last picture these points were almost black and invisible, now they are blue and light (fig. 13). On the end of this outline we see another 5-dimensional fractal, probably the most interesting. With piece of imagination it can be seen as an effect of drawing using color pencil (fig. 14).
1110
Figure 13. 5-dimensional fractal consisting of color blocks. Although only color palette was changed, new pattern looks completely different from that in figure 12.
Figure 14. 5-dimensional fractal imitating drawing with color pencil.
7.
CONCLUSION
In this paper modified version of chaos games is proposed. The algorithm leads to new family of beautiful fractal patterns, where fractal character can be observed not only in shape, but also in their color. Some possible changes are made to show examples how different initial conditions can affect the final image. These results should be examined in details. They can be useful in making color fractal animations and in exploring other features of color fractals. During experiments two of these features are mentioned. It is sometimes possible to imitate complex 3-dimensional objects using only the simplest 2-dimensional graphic and an appropriate
Chaos Games in Color Space
1111
color. There is also sometimes possible to simulate in fractal terms techniques used in art. Could it be the future way of imitating works of artists? Maybe, but even if not, it is for sure another proof of the fact that simple mathematical methods can follow to complex beautiful structures.
REFERENCES 1. H. O. Peitgen, H. Jürgens, D. Saupe, Chaos and Fractals, (Springer Verlag, 1997). 2. T. F. Banchoff, Beyond the Third Dimension, (W. H. Freeman & Co, 1990).
AN EFFECTIVE CONTOUR PLOTTING METHOD FOR PRESENTATION OF THE POSTPROCESSED RESULTS A new approach to the contouring Irena Jaworska Cracow University of Technology, Warszawska 24, 31-154 Cracow, Poland [email protected]
Abstract:
The paper presents a new method of creating contour lines based on irregularly spaced data. The analysis has been focused on contour plots as an useful tool of visualization and real-time verification of the postprocessing stage in multigrid computational environment. A new effective contouring concept has been introduced and developed to satisfy this objective. The presented approach is based on the OpenGL.
Key words:
contour lines, visualization, clipping planes, OpenGL, postprocessing
1.
INTRODUCTION
Development of an advanced graphical modeler and visualization software1 for unstructured 2D-mesh generator2, designed to deal with the adaptive multigrid solution approach3, constitutes an objective of this research. Numerous graphical modelers and visualization software are available. However, many of these programs have not been designed to deal with the iterative algorithms and do not allow for instant visualization of computation progress, and support for effective mesh modification during computation. Taking into account these requirements, a new advanced graphical tool has been designed particularly to be used with the multigrid mesh generator. Given geometry of the considered domain is used to prepare an initial mesh, which is relatively coarse, and to establish the base for the adaptative procedure. Depending on the solution method, each mesh has to contain
1112 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1112–1117. © 2006 Springer. Printed in the Netherlands.
An Effective Contour Plotting Method
1113
different topological information. After calculation, an a’posteriori error analysis is carried out. The result of this analysis is used to prepare a list of new nodes, which are added to the mesh. As a consequence of new nodes appearing, mesh topology information must be updated. The tool can visualize all the mesh related information for every step of the generation process presented above, allows for interactive modification, dynamic presentation of the series of meshes and the solution during the adaptation process. The software can work on-line providing instant visualization of intermediate as well as final results. The presented approach is based on the OpenGL4 and its toolkits GLUT, GLUI. The research has been recently focused on the postprocessing visualization, and particularly on contour plots, as an effective presentation tool of the calculated solutions. Having this purpose, contouring should be a real-time process and must calculate contour lines using only available data.
2.
MAIN CONTOURING TECHNIQUES
Many approaches to constructing contour lines exist, varying mainly in the interpolation algorithms used to determine the location of the contour line between data points. The basic techniques considered for inclusion in our code are discussed below.
2.1
Interpolation
In general, this is a three step approach. The first step calculates contour lines using a triangular irregular network (TIN), a mesh or in any other way. The next step smoothens the contour lines and the final one draws them and appends the elevation points. But calculation of the contour lines on the initial data, given in points scattered over an area (in nodes of an irregular grid), using interpolation algorithms is not trivial and ambiguous. Any interpolation method may be used inside each triangle. However, sometimes a triangle will have all three vertices on the same contour, which is undesirable. It has been also observed, that contour lines of poor quality are obtained from scattered data. Inverse Distance Weighted method is based on the assumption, that the closer a data point, the bigger influence it has. The best results are obtained with IDW, when data points are located densely. If given nodes are far away from each other or irregularly spaced, results will be far from expected. Also, this algorithm tends to produce too many “Bull's Eyes”. Another popular method – “Kriging”, involves many matrix manipulations and requires a lot of processor time.
1114
2.2
Computational graphics approaches
The popular computational approaches, not firmly grounded in mathematical or physical theory, are the Marching Squares technique and texturing of object. The contour lines are obtained very quickly using Marching Squares algorithm5, but it works only on a regular grid, which is disadvantageous. Another method is based on the fact, that contouring is closely related to height fields. A surface can be altitude colored using a 1D texture map to hold the coloring scheme and specifying a generation function for the z coordinate which measures the distance from the reference surface (z = 0). This relationship can be represented as a contour map, but the quality of the obtained lines is insufficient.
3.
A NEW “CUTTING” METHOD
To obtain real-time calculation and improve performance of the plotting available data, a new effective concept has been developed and introduced. One may calculate the contour lines on the initial data, given in any points of an area (in nodes of a regular or irregular grid, scattered data).
3.1
Clipping planes as a tool of easy contouring
The “cutting” method considered here is based on the fact, that contouring depends on the height fields. The surface is calculated using a TIN, a system of not overlapping triangles – a result of Delaunay triangulation (see Fig. 1). Vertices of triangles are the initial reference points. The relief of the 3D object in this case is represented by a many-sided surface, each side of which is described as linear function. The coefficients of this surface function are determined after solution in triangle vertices. Distinctive feature and advantage of triangulated model lies in fact, that here the transformations of initial data are absent, therefore this model does not introduce errors, as other methods of interpolation. The viewing volume in OpenGL may be limited by as many as n clientdefined clip planes generating the clip volume. The surface may be altitudedivided using clipping planes parallel to the reference surface (base of 3D object). These cutting planes are used to generate contour lines. Each clientdefined plane specifies a half-space and finally generates contour line for one actual elevation value. The glClipPlane function specifies a half-space using a four-component plane equation Ax + By + Cz + D = 0.
An Effective Contour Plotting Method
1115
Figure 1. The surface, made of the Delauney Figure 2. Closed contour lines, obtained from clipping planes. triangles.
As the contour strips are parallel to the plane z = 0, the cutting plane equation must be z + D = 0, where D measures the distance from the plane z = 0 and appoints actual contour value. The clipping volume becomes the intersection of the viewing volume and all half-spaces defined by the additional clipping planes. When one draws a clipping region in the GL_LINE mode, the boundary edges of the polygon are drawn as line segments. They are treated as connected line segments for line stippling. Vertices are marked as boundary or nonboundary with an edge flag. OpenGL generates edge flags internally when it decomposes polygons. If the current edge flag is FALSE, the vertex is marked as the start of a nonboundary edge, and only the lines of the cutting polygon contour remain visible. This contour will be the iso-line with value D. Each clipping plane specifies a contour line by a client-defined value in plane equation.
3.2
Drawing the elevation points
In order to put elevation points on each contour line, one must know where these values should be drawn in the window. Coordinates of this line, not only elevation value D, received from the clipping plane equation, must be available. One may extract these coordinates and attributes of the vertices located on cutting plane from the feedback array, which must be created before feedback mode is entered. When OpenGl, in the Feedback mode, draws clipping planes to obtain the contour lines, all primitives that would have
1116 been visible when the render mode had been GL_RENDER, are placed in the feedback buffer. In this case the iso-lines, obtained from clipping planes, are defined in feedback buffer as an array of the line segment attributes. All contour lines are composed of those segments. Line segments returned to the application by the Feedback mode are allocated as a random sequence. Because for each elevation value the whole contour may consist of several lines, each separate contour line must be distinguishable. One may have to deal with two situations: either all separate contour lines are closed or some of them are closed, and some (having vertices on the surface boundary) – not. To make such visualization as ColorMap and signed values on each iso-line possible, all contour lines have to be closed first. To close all separate contour segments, a bounding box should be built around the whole 3D object (constructed from the Delauney triangles). This bounding side is created as linked strip of quadrilaterals, spanned over boundary points of the solution surface (z coordinate equal to the solution value) and their projections onto the reference surface (z coordinate equal to 0). Additionally the vertices in each of these rectangles should be ordered counter clockwise. When all separate contour lines become closed (see Fig. 2), one may obtain all parameters for each contour by sorting the random sequence of line segments from feedback in such way that the endpoint of one line segment is the beginning of the next one. The process is continued until the end of last segment becomes beginning point for the first segment. One obtains the number of the closed lines for every contour value with all coordinates of the vertices. After that, program may draw elevation points (object coordinate z) on each single, closed contour line. This may be done using the GLUT library, which includes utility functions for stroke fonts.
3.3
ColorMap visualization
The next step of the postprocessing visualization has been focused on the ColorMap plot as effective tool for presenting the calculated solutions. The algorithm uses results of previous calculations – closed separate contour lines, sorted sequences of the points for each closed line corresponding to every contour value. The OpenGL Utility Library (GLU) takes as input vertices belonging to contours, which describe hard-to-render polygons and tessellate them. The ColorMap is obtained by filling these closed areas (now presented in OpenGL as polygons) with the defined color. Particular color corresponds to the contour value (see Fig. 3).
An Effective Contour Plotting Method
1117
Figure 3. ColorMap visualization.
4.
SUMMARY
A new, quick and effective contouring method, creating good quality contour lines from available data, has been developed here. The research is already advanced to a degree that interesting results can be seen. The paper presents the status of research and development of the advanced graphic modeler cooperating with the mesh generator for meshless methods. The software being developed meets many modern requirements, including full compliance with many programming and algorithmic standards.
REFERENCES 1. I. Jaworska, On graphic modeller for adaptive meshless FD and FE analysis, 15th International Conference on Computer Methods in Mechanics, (2003). 2. J. Orkisz, P. Przybylski, I. Jaworska, A mesh generator for an adaptive multigrid MFD/FE method, Computational Fluid and Solid Mechanics, 2082-2085 (Elsevier Sc. Ltd., 2003). 3. J. Orkisz, Finite Difference Method (Part III), in: Handbook of Computational Solid Mechanics, edited by M. Kleiber (Springer-Verlag, Berlin,1998), pp. 336-432. 4. M. Woo, J. Neider, T. Davis, T. Shreiner, Open GL Programming Guide, 3d ed. (AddisonWesley, 1999). 5. W. Schroeder, B. Lorensen, 3-D Surface Contours, Dr. Dobb's Journal's, (1996).
MESHING TECHNIQUES FOR GENERATION OF ACCURATE RADIOSITY SOLUTION FOR VIRTUAL ART GALLERY Maria Pietruszka and Mariusz Krzysztofik Institute of Computer Science, Technical University of Lodz, Poland, [email protected]
Abstract:
The main problem of virtual art gallery visualization is achieved compromise between a high cost of calculations (a requirement of real-time rendering), and a level of the image realism (the global illumination is needed). The paper analyses various meshing techniques for radiosity method. For one of them (i.e. Discontinuity Meshing) the original method of accurate updating radiosities for new vertices is presented.
Key words:
Radiosity Method, Discontinuity Meshing, Virtual Art Gallery
1.
INTRODUCTION
While seeking the most adequate solutions for virtual art galleries, we have assumed the following criterions: 1. The image should be characterized by high level of realism. 2. The results should be independent of the actual field of view. 3. The number of polygons created in all stages of calculations should not increase the cost of the final rendering, to an extent restricting the trip around the gallery. 4. The calculation cost of illumination distribution should be possibly low, however this assumption should not restrict the preceding assumptions, if this cost is borne only once (in pre-computed phase). 5. The algorithm of a scene subdivision utilized throughout the calculations of illumination distribution, should be independent, to a high degree, of implementation of these calculations. The last criterion is connected with the difficulty of introducing changes within a structure of an existing complex system, developed throughout
1118 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1118–1123. © 2006 Springer. Printed in the Netherlands.
Generation of Accurate Radiosity Solution
1119
several years. The first four criterions result from the high requirements imposed on visualization of works-of-art and on virtual reality systems with mobile observer. Because of containing numerous static diffuse surfaces by real art galleries, it is necessary to use a global illumination model, such as radiosity, to achieve the high level of realism (criterion 1). The paper presents a part of the results into this area obtained by interdisciplinary team of computer scientists and artists. Since our subdivision algorithm operates solely on geometry of the hitherto existing mesh of patches, its application is not confined to the radiosity algorithm only. The method of updating radiosities for new vertices combines these two algorithms.
2.
RELATED WORKS
The global and indirect illumination distribution, dedicated to real-time rendering, is usually pre-calculated and stored as a texture map. A diffuse reflection is independent on camera position. It can be accurate pre-calculate by radiosity method and store into a light or illumination map1,2. A specular reflection depends on camera position. Therefore, a specular reflection effect can be calculated by a simple local reflection model or using texture mapping (e.g. environment or reflection mapping)1. This predominant technology is currently being used for interactive architectural walkthroughs3 and computer games1. This technology is suitable for virtual art gallery too4. The quality of the perceived scene image is chiefly determined by the scene subdivision into patches, in all stages of calculations. A scene is usually created by using a commercial modeler which optimizes geometry with regard to the final rendering. In that case, the scene mesh is usually excessively dense and can comprise improperly situated, inaccurately attached or not oriented patches. In order not to achieve anomaly in illumination distribution, after having the scene imported it is essential to eliminate abnormalities arisen in preceding stages and to prepare more adequate mesh for a radiosity algorithm2,4. This algorithm analyzes distribution of radiosity function B(x) on scene patches, measured in [W/m2], and basing on that fact it decides whether to subdivide the patch or not. Each of the known subdivision methods, i.e. Adaptive Subdivision2, Discontinuity Meshing5, Adaptive Discontinuity Meshing6, works in a diverse manner. Therefore, each of them has its advantages and disadvantages, which determine the feasibility of exploiting it for specific applications. Illumination distribution on the final mesh of patches can be stored as a mesh of triangles with per-vertex colors or as texture maps (light or illumination maps) that can be rendered by using shading hardware1,4.
1120
3.
MESHING TECHNIQUES FOR RADIOSITY
In the classic Adaptive Subdivision patches of excessively different vertical radiant densities, or excessively immense gradient of radiosity on a surface, are considered to be irregularly illuminated and are subdivided into smaller elements, for which new values of radiosity B(x) are calculated. The quality of final illumination distribution to a high extent depends on the initial scene subdivision, which should consider the shadow areas distribution. During the scene preparation, a photographic documentation of a given room is extremely helpful, obviously if its virtual reproduction is supposed to imitate realistic illumination and objects positions. However the Adaptive Subdivision method is relatively fast, it generates an excessive number of polygons, from the real-time rendering point of view (fig. 1), and an excessive move of the observer toward a surface covered with shadow, can causes zigzag shadow lines to emerge in the image. The problems are grown up by the objects, for which we would like to add transparency or specular reflection effects, because many programmes (e.g. LightScape) don’t store hierarchy of scene objects, created by modeler. The main problem arises with a virtual art gallery, for which the creator does not possess any photographic documentation depicting the lightgradients upon surfaces, since he designed different illumination than original, he positioned the existing exhibits differently or introduced other, which are not placed in the real room. In that case, it seems more functional to use a Discontinuity Meshing (fig. 2) or Adaptive Discontinuity Meshing generating a scene mesh, basing on automatically calculated, cast shadowlines (fig. 3a). An advantage of the Discontinuity Meshing is the fact that it generates less elements than the Adaptive Subdivision, its images do not contain errors in shadow lines and they are independent of the field of view.
Figure 1. Details of illumination of Neoplastic Room (Museum of Art in Lodz): a) photograph, b) Adaptive Subdivision and Progressive Refinement Radiosity (LightScape software) generated 5 million polygons (illumination map - 4,4MB jpg).
Generation of Accurate Radiosity Solution
1121
Figure 2. Virtual room rendered with combined Ray-Casting Radiosity and Discontinuity Meshing4: a) illumination maps only, b) another light effects added (Museum of Art in Lodz). a) V
a
Source patch j Obstructing patch
b)
F1
Source l
b
F21
Dlk rk
c
DVk
F21 F1 Receiver
F221 F222
b
Vnew a
F222 c
Vnew
DMTree
F221
Receiver
Figure 3. Idea of Discontinuity Meshing: a) discontinuity mesh and shadow as a result of mutual interaction between the vertex V and the edge of obstructing patch, b) determination of the form factors FVl between the new vertex and the source patches l, which shot energy in preceding iterations; the points on the source patch are sampled by rays.
Discontinuity Meshing can be utilized in numerous iterational radiosity algorithms (e.g. Ray-Casting Radiosity, Progressive Refinement Radiosity). The Adaptive Discontinuity Meshing is an extension of the Discontinuity Meshing, generating a yet smaller number of polygons. Unfortunately, the visibility testing procedures for the source patch, the potential subdivision of patches as well as the updating of the vertical radiosities, are to a high degree “embedded” in the main radiosity algorithm. The main problem of Discontinuity Meshing is to update the radiosity BV for new vertices (VNew) of discontinuity mesh; BV is indefinite for the VNew. The function B(x) is usually reconstructed upon the patch using interpolators
1122
built on elements of the discontinuity mesh, after its conversion to triangles. The precise method required calculations of radiosity for all vertices of the updated scene mesh, basing on the form factors, which, with regard to timeconsuming computations, disqualified it for compound scenes. Our proposition has been presented below (result on fig. 2). /* ----------------------------------------------------------------------------------------Radiosity algorithm with scene mesh subdivision basing on discontinuities */ Initial calculations (preprocessing); Do until stop criterion is fulfilled { Select source patch j with greatest unshot energy 'Ej='BjAj; If patch j generates subdivisions { DoSubdivision(j); // generate a new scene mesh SaveStrongSource(j); } Calculate radiosity distribution on new set of scene patches (e.g. RayCasting Radiosity); } /* prepare results for final rendering*/ Do triangulation of patches; Generate illumination maps; /* --------------------------------------------------------------------------------------*/ The scene patches are described by two following structures: CPatch stores geometric and energetic data of patch, SubPatch binds a patch with a DMTree structure. Moreover, the SourcePatchesList stores source patches selected in preceding iterations of the radiosity algorithm. If in given iteration a chosen source patch j is recognized as a patch generating discontinuity lines, the function DoSubdivision(j) is called to determine these lines and to perform a new scene subdivision. The first step is to delete the old SubPatchesList and to create the new list. Next a BSP tree for scene polygons is created. It guarantees the proper order of polygon transformation building a list of discontinuity segments (a, b, c on fig. 3a), i.e. the order in accordance with the polygon visibility from the source patch j. Basing on these segments, for every SubPatch a DMTree is built and a Winged-Edge structure is created, which stores the topology and geometry of the discontinuity mesh elements. The final task of DoSubdivision(j) function is to generate a new list of patches basing on SubPatchesList. Since the above method operates solely on geometry of the hitherto existing mesh of patches, its application is not confined to the radiosity method.
1123
Generation of Accurate Radiosity Solution
To update radiosity, for each new vertex (VNew) the SourcePatchesList is being read, and one after another the suitable form factors (FVl) are calculated in case updating the radiosity BV of given new vertex (fig. 3b):
Bv
Bv ¦ U i Fvl 'Bl , where FVl l
Al n
n
¦H k 1
k
cos D Vk cos D lk Srk2 Al / n
(1)
U is a reflectivity of a patch Receiver. The n is number of rays rk cast from the source patch l (which shot energy in preceding iteration) to the vertex VNew. Hk is visibility function of path l from the vertex VNew. Prior to performing the calculations of the radiosity distribution on the new set of scene patches, generated by the function DoSubdivision(j), it is vital to ensure that the function SaveStrongSource(j) appends the current patch j, shooting energy into the scene, to the SourcePatchesList. It is not until then all the scene vertices are prepared to determine the radiosity contribution from the patch j, and the information about the patch parameters is not lost and can be utilized in the following iterations.
4.
CONCLUSIONS
None of the described methods meets all of the initial assumptions. Taking into account the advantages of the Discontinuity Meshing, we have found it purposeful to develop it toward achieving images reconstructing the illumination in a virtual art gallery as precisely as possible. The methods described in this paper can automatically generate a mesh, which is well prepared both for calculating the radiosity values in a successive iteration of a radiosity method, as well as for generating a mesh for final rendering. Full scene subdivision time it usually takes from one second to several minutes. Time of radiosity updates is shorter than computation radiosity time.
ACKNOWLEDGEMENT The research was partly performed within Project No 8T11C01619 supported by KBN - the State Committee for Scientific Research, 2000-2002
1124
REFERENCES 1. A. Watt, F. Policarpo: 3D Games, Real-time rendering and Software Technology, Addison-Wesley 2002. 2. I. Ashdown: Radiosity – A Programmer’s Perspective, J. Wiley & Sons Inc., New York 1994. 3. R. Bastos, K. Hoff, W. Wynn, A. Lastra, Increases Photorealism for Interactive Architectural Walkthroughs, University of North Carolina and Chapel Hill, 1999. 4. M. Pietruszka at al., Time-spatial Models of Sculptures in Virtual Reality, Research project report for KBN, No. 8T11C01619, Technical University of Lodz, 2003, (unpublished), p. 150. 5. D. Lischinski, F. Tampieri, D. P. Greenberg, A Discontinuity Meshing Algorithm for Accurate Radiosity, IEEE CG&A, 12(6), 25-39 (1992) 6. W. Stürzlinger, Adaptive Mesh Refinement with Discontinuities for Radiosity Method, (Fifth Eurographics Workshop on Rendering, Darmstadt, 1994), pp. 239-248.
A PROLONGATION-BASED APPROACH FOR RECOGNIZING CUT CHARACTERS Antoine Luijkx, Celine Thillou and Bernard Gosselin Faculte Polytechnique de Mons TCTS Laboratory [email protected] [email protected] [email protected]
Abstract
This work has to do with Optical Character Recognition in the context of degraded characters, and more particularly in the context of cut characters. It appears like a complement to a more general problem and investigates a field not often seen in literature. The problem is first divided into two sub-problems: first, modular networks, separating the detection of cut characters and their classification are proposed. Then, the classification part is completed with information coming from the prolongation of cut characters.
Keywords:
Optical character recognition; modular neural networks; cut characters.
1.
INTRODUCTION
Although the current techniques of OCR work satisfactorily, the recognition rates fall down for degraded characters. This paper presents results for the particular problem of one of common degradations: cut characters. Cut characters can be defined as incomplete characters, whose the missing part is not known. Examples of cut characters are shown in Figure 1.
Figure 1.
Example of cut characters.
1125 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1125–1130. © 2006 Springer. Printed in the Netherlands.
1126
2.
STATE OF THE ART
Only a few papers are available on this subject. Nijhuis, in,4 suggests a technique for an automatic recognition of car license plate where, pending on the position of the camera, cut characters appear in the picture to be analyzed. Nijhuis has developed a technique based on modular networks. First, characters to be classified are sent to a neural network, previously trained on normal characters. If the outputs of this neural network are below a predefined threshold (i.e. if the confidence levels are too low), the corresponding characters are redirected to a dedicated network, built on cut characters. The results of the two networks are then compared and the final class determined. The final recognition rates obtained are very high. However, this technique only deals with characters cut on the up side, and assumes that a threshold can be defined for confidence levels. In,2 Fukishima suggests a technique to recognize (and restore) occluded characters (character partly hidden by stains, objects, . . . ). The technique is based on a particular network (called neocognitron). This network contains different layers extracting particular features. One layer is dedicated to detect the occluded parts of the character and inhibits features coming from occluded zones, because they are irrelevant. The Fukishima’s algorithm seems to be a viable technique, although no recognition rate is detailed in his papers, nor any discussion about the algorithm behaviour in the presence of degraded characters.
3.
DETECTION AND CLASSIFICATION
Trying to develop a method similar to the one of,4 but able to deal with characters cut on the left, down, right and up sides, neural networks (MultiLayer Perceptrons) were built, one trained on a set of normal characters and an other trained on cut characters. But it soon appeared that a general criterion could not be determined to define a threshold for recognition rates defining which characters could be considered as cut characters. The reason is that, even if the relative magnitudes outputs are reliable enough (the class with the highest output comes first, followed by the class with the second highest output, . . . ), their absolute magnitudes are not properly ranked (the confidence level of the best output is often significantly less than 1). Detection of cut characters is therefore a problem on its own. Figure 2 shows the decomposition of the problem into two parts: the detection part first determines whether the character is processed "normally" or processed through a dedicated process for cut characters. Our efforts were mainly focused on the classification aspect but preliminary results about detection are also shown in the following section.
A Prolongation-based Approach for Recognizing Cut Characters
Figure 2. fication.
4.
1127
Decomposition of the whole problem into two sub-problems : detection and classi-
DETECTION
A basic algorithm was developed for cut characters detection, based on the analysis of the full picture containing the character to be processed. Characters are defined as cut characters if they are adjacent to the edges of the document, and the side of the cut assumed to be the edge of the picture adjacent to the character. Results seem to be encouraging, but some errors still occur, mainly for characters adjacent to more than one edge of the picture. However, simple post-processing block should improve the results. This method is not useful in the case of physically degraded characters (ancient documents, coffee stains, . . . ), where the cuts can be anywhere in the picture.
5.
CLASSIFICATION
Once the detection has been performed, the characters detected as uncut should be processed by typical algorithms (see5 ) and the detected cut characters need a special process. Following the detection phase, all the cut sides of the (detected) cut characters are supposed to be known. Dedicated experts can therefore be used to classify those characters, as shown in Figure 3, where the four networks have been trained on characters cut on the same side (all the networks of this paper use contouring information, see Figure 4). The situation is very different from how people deal with cut characters since people attempt to prolong the cut character mentally and compare it with a known character. Prolongation of cut characters should therefore help recognition and an algorithm based on this assumption was developed. First trials for prolonging characters show difficulties to find the correct prolongation to apply, especially in the case of a noisy environment. Moreover, for a particular cut character, there may be several good prolongations, as shown
1128
Figure 3.
Figure 4.
A classifier built on dedicated experts.
Use of different probes to detect the contouring information.
in Figure 5. Our novel algorithm developed acts differently, performing all possible prolongations and then selecting the best results.
Figure 5. Possible prolongations of a cut character (on the left). This character is more probably a "5", but it could be a "S", a "6", or a "G".
Figure 6 illustrates the algorithm. The (detected) cut characters follow two different channels. The first one is the one presented in Figure 3. The second one uses prolongation. Pending on the side of cut, the number of cuts and the number of connected components, several prolongations are performed. Figure 7 presents the different steps of a prolongation sequence. The skeletons of those prolongations are then extracted, in order to avoid variations due to inaccurate prolongations, as shown in Figure 8. Every skeleton of the prolonged characters is then sent to a network trained on uncut characters skeletons.
A Prolongation-based Approach for Recognizing Cut Characters
Figure 6.
Figure 7.
Figure 8.
1129
A classifier using prolongation.
The different steps of a prolongation process.
The skeletization process helps correcting imperfect prolongations.
Each prolonged character produces as many outputs as possible classes. The problem is now to choose the correct class amongst the best outputs. A selection amongst all the remaining character results is then performed. Because a selection based on the amplitude values of the network output is not reliable, the selection uses relative magnitudes, in order to decrease the error rates: for
1130 each prolonged character, the best class is granted 4 points, the second class 2 points and the third 1 point. The candidates from the set of prolonged characters are then commingled with the candidates of the dedicated network to generate the final outputs.
6.
RESULTS
The database used for testing our algorithm contains characters from pictures taken with a poor-resolution digital camera. It contains about 500 cut characters, cut in the four sides. Regarding the algorithm of Figure 3, the recognition rates reach 62% on the test database, and the use of prolonged character (algorithm of Figure 6) improves the recognition rates by 5%.
7.
CONCLUSIONS AND FUTURE WORK
First, prolongation seems to ameliorate results slightly. However, future work should be performed to improve recognition rates especially in the presence of a background noise. On the other hand, a post-processing block, using for example Viterbi algorithm and contextual information, should correct some current interpretation mistakes (like the mismatch between "5" and "S", . . . ). The process of cut words should also be improved: for instance, a word cut on the left has missing letters and analysing letter sequences from the beginning of the word is irrelevant.
ACKNOWLEDGMENTS We want to thank the people of the TCTS, who provided us helpful advices.
REFERENCES 1. S. Ferreira, C. Thillou, B. Gosselin, From Picture To Speech: an Innovative Application for Embedded Environment, Proc. of the 14th ProRISC Workshop on Circuits, Systems and Signal Processing (ProRISC 2003), Veldhoven, Netherlands, 2003. 2. K. Fukushima, Restoring partly occluded patterns: a neural network model with backward paths, ICANN 2003, pp 393-400, 2003. 3. B. Gosselin, Application de reseaux de neurones artificiels a la reconnaissance automatique de caracteres manuscrits, PhD Thesis, Faculte Polytechnique de Mons, 1996. 4. J. A. G. Nijhuis, A. Broersma, L. Spaanenburg, A modular neural network classifier for the recognition of occluded characters in automatic license plate reading, Proceedings of the FLINS 2002, pp 363-372, Ghent, Belgium, 2002. 5. C. Thillou, Degraded Character Recognition, DEA Report, Faculte Polytechnique de Mons, 2004.
Author Index Ahmad, A.M.A. CD-689, CD-746 Al-Hamadi, A. I-13, I-179 Alsam, A. I-252, I-259 Amato, G. I-39 Ambekar, O. CD-703 Ambellouis, S. I-173 d’Angelo P. CD-565 Aouda, S. CD-900 Asai, T. I-118 Asim, M.R. CD-528 Auber, D. CD-633 Baig, M.H. CD-1094 Bando, T. CD-863 Bargiel, P. CD-869 Barta, A. CD-975 Bauermann, I I-209 Be lkowska, J. CD-1082 Benes, B. CD-1063 Berar, M. CD-900 Bereska, D. I-279 Bernier, O. I-203 Besserer, B. I-301 Bielecki, A. CD-516 Bigaj, R. CD-607 Bittar, E. I-456 Bober, M. I-426 Boi, J.-M. CD-774 Bostr¨om, G. I-491 Bˇrezina, A. CD-752 Brueckner, A. CD-888 Brunelli, R. CD-962 Buchowicz, A. I-228 Byrnes, D. CD-1069 Caball, J. I-240 Cabestaing, F. I-173, CD-552 Callet, P. I-469 Cameron, T. I-95 Carbini, S. I-203
Caulier, Y. I-20 Chambah, M. CD-780 Chanda, B. CD-953 Chernov, V.M. CD-709 Chetverikov, D. I-33 Chihara, K. I-314 Chmielewski, L. I-373 Chowdhury, S.P. CD-953 Chupikov, A. CD-558 Cichocki, P. CD-1043 Cichocki, A. CD-826 Cieplinski, L. I-240 Collings, S. I-57 Collobert, M. CD-1031 Colot, O. CD-552 Cyganek, B. I-413 D¸abrowska, A. CD-768 Das, A.K. CD-953 Datta, A. I-95 Delmas, P. CD-534 Desvignes, M. CD-900 Devars, J. CD-599 Dobˇsik, M. I-448 Dodgson, N.A. I-293, CD-832 Dogan, S. I-246 Domenger, J.-P. CD-633 Domscheit, A. I-45 Duvieubourg, L. I-173 Fabrizio, J. CD-599 Faille, F. CD-820 Faro, A. CD-968 Fernandes, E. CD-703 Fiocco, M. I-491 Flancquart, A. CD-814 Fl´ orez-Valencia, L. I-361 Foufou, S. I-80 Frydrych, M. I-448 Garcia, C. CD-1008
1132
Gavrielides, M.A. I-394 Gejgus, P. CD-721 Geraud, T. CD-800 Gesualdi, A. CD-662 Ghanbari, S. I-240 Gimel’farb, G. CD-534 Giordano, D. CD-968 G lowienko, A. CD-1082 Gon¸calves, J.G.M I-491 Gosselin, B. I-420, CD-808, CD-1125 G¨ otze, C. I-355 Gouton, P. I-273 Grabska, E. I-111 Grivet, S. CD-633 Grundland, M. I-293, CD-832 Grzesiak-Kope´c, K. I-111 Han, C.-Y. CD-642 Hardeberg, J.Y. I-252, I-259 Hardy, P.A. CD-912 Harti, M. I-187 Havasi, L. CD-733 Heff, A. I-456 H´egron, G. CD-674 Herout, A. CD-593 Hinrichs, K. I-320 Hoepfel, D. CD-703 Hofmeister, H. I-355 Hussain, M. I-125 Ignasiak, K. I-161 Ikeda, S. I-327 Iketani, A. I-327 Iwakiri, Y. CD-613 Iwanow, Z. I-1 Iwanowski, M. CD-626 Jak´ obczak, H. CD-1100 Jankowska, K. I-45 Jaromczyk, J.W. I-477, CD-912 Jaworska, I. CD-1112 Jedrzejewski, M. CD-587 Jirka, T. I-485 Jolion, J.-M. CD-1008 Jones, B. CD-906 Jourdan, F. CD-674
Kaick, O.M. van CD-655 Kanbara, M. I-118, I-327 Kaneko, T. CD-613 Karlub´ıkov´ a, T. I-222, CD-752 Kasprzak, W. I-367, I-463, CD-826 Kasuga, M. I-400 Kawabata, N. CD-863 Kawai, Y. I-197 Kawulok, M. CD-522 Kazubek, M. CD-929 Keong, K.C. CD-1075 Kerautret, B. CD-581 Khoudour, L. CD-814 Kie´s, P. CD-875 Kinoshenko, D. CD-946 Kondoz, A.M. I-246 Koukam, A. I-187 Kowaluk, M. I-477 Kozera, R.S. I-57, I-87, I-95, I-103 Krˇsek, P. CD-894 Kr¨ uger, L. CD-565 Krzysztofik, M. CD-1118 Krzyzynski, T. I-45 Kubini, P. CD-721 Kucharski, K. I-147, I-426 Kuhnert, K.-D. CD-540 Kuleschow, A. CD-1037 Kulikowski, J.L. I-141 Kwolek, B. I-287 Lanz, O. CD-575, CD-715 Lebied´z, J. I-27 Leclercq, T. CD-814 Leclercq, P. CD-534 Lembas, J. I-111 Li, X. CD-642 Li, L. CD-1018, CD-1056, CD-1069, CD1075 Liu, X. I-349 Luijkx, A. CD-1125 Lukac, R. I-267, I-308 L achwa, A. I-111 Macaire, L. I-7, CD-814 Mac´e, P. CD-674 Malina, W. CD-993
Author Index Manabe, Y. I-314 Mandal, S. CD-953 Mansouri, A. I-273 Manzanera, A. CD-727 Marasek, K. CD-587, CD-1082 Marsza lek, M. CD-1002 Marzani, F.S. I-273 Mashtalir, S. CD-558 Mashtalir, V. CD-946 Masood, A. CD-528 Materka, A. CD-546 Mavromatis, S. CD-774 Melancon, G. CD-633 Mello, F.L. CD-662 Meneveaux, D. I-66 Mercier, B. I-66 Michaelis, B. I-13, I-179 Mielke, M. I-209 Moore, N. I-477 Morgos, M. I-161 Murakami, H. I-400 Muselet, D. I-7 Myasnikov, V.V. CD-845 Nakajima, N. I-327 Naemura, M. CD-695 Nieniewski, M. CD-921 Niese, R. I-13, I-179 Niijima, K. I-125 Nikiel, S. CD-758 Nikodem, P. CD-1088 Noakes, L. I-57, I-87, I-103 Nocent, O. I-456 Nowak, D. CD-668 Nowak, H. CD-987 Okada, Y. I-125 Okazaki, A. I-463, CD-826 Oliveira, A. CD-662 Orkisz, M. I-361 Palacz, W. CD-668 Palma, G. CD-800 Palus, H. I-279 Pedrini, H. CD-655 Peng, E. CD-1018
1133 Pereira, F. I-228, I-337 P´erez-Patricio, M. CD-552 Perwass, C. CD-503 Peteri, R. I-33 Pietrowcew, A. I-228 Pietruszka, M. CD-1118 Pitas, I. I-394 Plataniotis, K.N. I-267, I-308 Polec, J. I-222, CD-752 Postaire, J.-G. I-7, CD-814 Przelaskowski, A. I-216, CD-869 Puig, D. I-491 Reddy, D.N. I-167 Reptin, J. CD-593 Richefeu, J. CD-727 Riyazuddin, M. CD-1094 R¨ofer, T. CD-648 Rokita, P. CD-619, CD-881, CD-1002 Romaniuk, B. CD-900 Ropinski, T. I-320 Rossini, A. I-491 Roy, M. I-80 Ruggeri, M.R. I-131 Ruichek, Y. I-187 Rzadca, K. CD-981 Rzeszut, J. CD-1043 Sadka, A.H. I-147, I-246 Sakchaicharoenkul, T. I-438 Sanchez, M. I-273 Sarfraz, M. CD-528, CD-1094 Sato, M. I-400 Sato, T. I-327 Saupe, D. I-131 Saux, B. Le I-39 Schaefer, G. I-381, CD-906 Seong, T.C. CD-509 Sequeira, V. I-491 Sequeira, J. CD-774 Seta, R. I-463 Sheng, Y. I-147 Sheng, W. I-349 Shioyama, T. I-51, I-197 ´ Sikudov´ a, E. I-394 Singh, C. I-72
1134 Skala, V. CD-839 Skarbek, W. I-147, I-161, I-228, I-426 Sk lodowski, M. I-1 Skulimowski, P. CD-683 Slot, K. CD-987 ´ Slusarczyk, G. I-111 ´ Sluzek, A. I-406, CD-509 Smolka, B. I-267, I-308, I-387 Snitkowska, E. I-367 Spampinato, C. CD-968 Spinnler, K. I-20, CD-1037 Sriram, P.V.J. I-167 Sriram, R.D. I-167 Stachera, J. CD-758 Stanek, S. CD-1049 Stapor, K. CD-888 Steinbach, E. I-209 Steinicke, F. I-320 Stommel, M. CD-540 Strauss, E. CD-662 Strug, B. CD-516, CD-668 Strumillo, P. CD-683 Strzelecki, M. CD-546 Sumec, S. CD-935 Suzuki, M. CD-695 ´ Swito´ nski, A. CD-888 Szczepanski, M. I-387 Szczypi´ nski, P.M. I-167, I-343 Szir´ anyi, T. CD-733 Szl´ avik, Z. CD-733 Szwoch, M. CD-993 Taniguchi, K. CD-1024 Tarnawski, W. CD-787, CD-794 Tatsumi, S. CD-1024 Tesinsky, V. CD-1063 Thacker, P.J. CD-912 Thillou, C. CD-808, CD-1125 Tomaszewski, M. I-161 Troter, A. Le CD-774 Trouchetet, F. I-80 Tschumperl´e, D. I-301 Uddin, M.S. I-51, I-197 Ueta, K. CD-1024 Uhlir, K. CD-839
Unay, D. I-420 Ustymowicz, M. CD-921 Vajk, I. CD-975 Venetsanopoulos, A.N. I-267 Viallet, J.-E. I-203 Vincent, F. I-361 Visani, M. CD-1008 Vliet, N. Van CD-800 Volkov, V. CD-1056 Vrani´c, D.V. I-131 Walia, E. I-72 Wcislo, R. CD-607 Wee, W.G. CD-642 Wiatr, K. CD-768 Witkowski, L. CD-881 W¨ ohler, C. CD-565 Wojciechowski, A. CD-851, CD-857 Wojciechowski, K. I-387 Wong, K.-Y. K. CD-740 Wong, S.-F. CD-740 Woodward, A. CD-534 Worrall, S.T. I-246 Wroblewska, A. CD-869 W¨ unstel, M. CD-648 Yamamoto, M. I-314 Yegorova, E. CD-558, CD-946 Yokoya, N. I-118, I-327 Zemˇcik, P. CD-593 Zhang, L. I-349 Zhao, J. CD-1075 Zhu, S.Y. CD-906 Zuschratter, W. I-355 Zymla, A. I-469
Computational Imaging and Vision 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.
B.M. ter Haar Romeny (ed.): Geometry-Driven Diffusion in Computer Vision. 1994 ISBN 0-7923-3087-0 J. Serra and P. Soille (eds.): Mathematical Morphology and Its Applications to Image Processing. 1994 ISBN 0-7923-3093-5 Y. Bizais, C. Barillot, and R. Di Paola (eds.): Information Processing in Medical Imaging. 1995 ISBN 0-7923-3593-7 P. Grangeat and J.-L. Amans (eds.): Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine. 1996 ISBN 0-7923-4129-5 P. Maragos, R.W. Schafer and M.A. Butt (eds.): Mathematical Morphology and Its Applications to Image and Signal Processing. 1996 ISBN 0-7923-9733-9 G. Xu and Z. Zhang: Epipolar Geometry in Stereo, Motion and Object Recognition. A Unified Approach. 1996 ISBN 0-7923-4199-6 D. Eberly: Ridges in Image and Data Analysis. 1996 ISBN 0-7923-4268-2 J. Sporring, M. Nielsen, L. Florack and P. Johansen (eds.): Gaussian Scale-Space Theory. 1997 ISBN 0-7923-4561-4 M. Shah and R. Jain (eds.): Motion-Based Recognition. 1997 ISBN 0-7923-4618-1 L. Florack: Image Structure. 1997 ISBN 0-7923-4808-7 L.J. Latecki: Discrete Representation of Spatial Objects in Computer Vision. 1998 ISBN 0-7923-4912-1 H.J.A.M. Heijmans and J.B.T.M. Roerdink (eds.): Mathematical Morphology and its Applications to Image and Signal Processing. 1998 ISBN 0-7923-5133-9 N. Karssemeijer, M. Thijssen, J. Hendriks and L. van Erning (eds.): Digital Mammography. 1998 ISBN 0-7923-5274-2 R. Highnam and M. Brady: Mammographic Image Analysis. 1999 ISBN 0-7923-5620-9 I. Amidror: The Theory of the Moir´e Phenomenon. 2000 ISBN 0-7923-5949-6; Pb: ISBN 0-7923-5950-x G.L. Gimel’farb: Image Textures and Gibbs Random Fields. 1999 ISBN 0-7923-5961 R. Klette, H.S. Stiehl, M.A. Viergever and K.L. Vincken (eds.): Performance Characterization in Computer Vision. 2000 ISBN 0-7923-6374-4 J. Goutsias, L. Vincent and D.S. Bloomberg (eds.): Mathematical Morphology and Its Applications to Image and Signal Processing. 2000 ISBN 0-7923-7862-8 A.A. Petrosian and F.G. Meyer (eds.): Wavelets in Signal and Image Analysis. From Theory to Practice. 2001 ISBN 1-4020-0053-7 A. Jakliˇc, A. Leonardis and F. Solina: Segmentation and Recovery of Superquadrics. 2000 ISBN 0-7923-6601-8 K. Rohr: Landmark-Based Image Analysis. Using Geometric and Intensity Models. 2001 ISBN 0-7923-6751-0 R.C. Veltkamp, H. Burkhardt and H.-P. Kriegel (eds.): State-of-the-Art in ContentBased Image and Video Retrieval. 2001 ISBN 1-4020-0109-6 A.A. Amini and J.L. Prince (eds.): Measurement of Cardiac Deformations from MRI: Physical and Mathematical Models. 2001 ISBN 1-4020-0222-X
Computational Imaging and Vision 24. 25. 26. 27.
28. 29. 30.
31. 32.
M.I. Schlesinger and V. Hlav´acˇ : Ten Lectures on Statistical and Structural Pattern Recognition. 2002 ISBN 1-4020-0642-X F. Mokhtarian and M. Bober: Curvature Scale Space Representation: Theory, Applications, and MPEG-7 Standardization. 2003 ISBN 1-4020-1233-0 N. Sebe and M.S. Lew: Robust Computer Vision. Theory and Applications. 2003 ISBN 1-4020-1293-4 B.M.T.H. Romeny: Front-End Vision and Multi-Scale Image Analysis. Multi-scale Computer Vision Theory and Applications, written in Mathematica. 2003 ISBN 1-4020-1503-8 J.E. Hilliard and L.R. Lawson: Stereology and Stochastic Geometry. 2003 ISBN 1-4020-1687-5 N. Sebe, I. Cohen, A. Garg and S.T. Huang: Machine Learning in Computer Vision. 2005 ISBN 1-4020-3274-9 C. Ronse, L. Najman and E. Decenci`ere (eds.): Mathematical Morphology: 40 Years On. Proceedings of the 7th International Symposium on Mathematical Morphology, April 18–20, 2005. 2005 ISBN 1-4020-3442-3 R. Klette, R. Kozera, L. Noakes and J. Weickert (eds.): Geometric Properties for Incomplete Data. 2006 ISBN 1-4020-3857-7 K. Wojciechowski, B. Smolka, H. Palus, R.S. Kozera, W. Skarbek and L. Noakes (eds.): Computer Vision and Graphics. International Conference, ICCVG 2004, Warsaw, Poland, September 2004, Proceedings. 2006 ISBN 1-4020-4178-0
springer.com