135 44 118MB
English Pages [1025]
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4673
Walter G. Kropatsch Martin Kampel Allan Hanbury (Eds.)
Computer Analysis of Images and Patterns 12th International Conference, CAIP 2007 Vienna, Austria, August 27-29, 2007 Proceedings
13
Volume Editors Walter G. Kropatsch Martin Kampel Allan Hanbury Pattern Recognition and Image Processing Group Institute of Computer-Aided Automation Vienna University of Technology Favoritenstr. 9/1832, 1040 Vienna, Austria E-mail: {krw,kampel,hanbury}@prip.tuwien.ac.at
Library of Congress Control Number: 2007932885 CR Subject Classification (1998): I.5, I.4, I.3.5, I.2.10, I.2.6, F.2.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-540-74271-9 Springer Berlin Heidelberg New York 978-3-540-74271-5 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12108824 06/3180 543210
Preface
It was an honor and a pleasure to organize the 12th international conference on Computer Analysis of Images and Patterns (CAIP 2007) in Vienna, Austria. CAIP has been held biennially since 1985:
Year Location 1985 Berlin, Germany 1987 Wismar, Germany 1989 Leipzig, Germany 1991 Dresden, Germany 1993 Budapest, Hungary 1995 Prague, Czech Republic 1997 Kiel, Germany 1999 Ljubljana, Slovenia 2001 Warsaw, Poland 2003 Groningen, The Netherlands 2005 Versailles, France 2007 Vienna, Austria
Organizers R. Klette L.P. Iaroslavskii, A. Rosenfeld, W. Wilhelmi K. Voss, D. Chetverikov, G. Sommer R. Klette D. Chetverikov, W. Kropatsch V. Hlavac, R. Sara G. Sommer, K. Daniilidis, J. Pauli F. Solina, A. Leonardis W. Skarbek N. Petkov, M. Westenberg A. Gagalowicz W. Kropatsch, M. Kampel
This year 251 full scientific papers were submitted of which 120 were accepted based on the scientific reviews for presentation during the conference. Consequently the competition for acceptance in the final program was tough. The accepted papers were presented during the conference either as oral presentations or as posters in the non-overlapping scientific program. Oral presentations allowed the authors to reach a large number of participants, while posters allowed for more intense scientific interaction. We tried to continue the tradition of CAIP in providing a forum for scientific exchange at a high quality level. Three internationally recognized speakers accepted our invitation to present a stimulating research topic this year: Zygmunt Pizlo, Purdue University; Arnold Smeulders, University of Amsterdam; and Steven W. Zucker, Yale University. We would like to thank all the program committee members and additional reviewers for their valuable feedback enabling the authors to further improve the quality of their work. We are grateful to our sponsors, the International Association for Pattern Recognition, the Austrian Association for Pattern Recognition, the Austrian Computer Society, and the Vienna Convention Bureau.
VI
Preface
Many thanks go to our local support team: Sigrid Elsinger, Martin Lettner, Ernestine Zolda and Alexander Dorfmeister. We also appreciate the help of the staff members of PRIP. June 2007
Walter Kropatsch Martin Kampel Allan Hanbury
CAIP 2007 Organization
General Chairs Walter G. Kropatsch Martin Kampel
Vienna University of Technology Vienna University of Technology
Steering Committee Andr´e Gagalowicz Reinhard Klette Walter G. Kropatsch Nicolai Petkov Gerald Sommer
INRIA Rocquencourt, France The University of Auckland, New Zealand Vienna University of Technology, Austria University of Groningen, The Netherlands Christian-Albrechts-Universit¨at zu Kiel, Germany
Organizing Committee Sigrid Elsinger Martin Lettner Ernestine Zolda Alexander Dorfmeister
Vienna Vienna Vienna Vienna
University of University of University of University of
Technology Technology Technology Technology
Sponsors CAIP 2007 was sponsored by the following organizations: – – – –
Austrian Association for Pattern Recognition Austrian Computer Society International Association for Pattern Recognition (IAPR) Vienna Convention Bureau
Program Committee Ahmed, S. Andres, E. Bayro-Corrochano, E. Brun, L. Chetverikov, D. De Stefano, C. Del Bimbo, A. Di Gesu, V.
Kingston University, UK Universit´e de Poitiers, France Cinvestav, Mexico ENSICAEN, France Hungarian Academy of Sciences, Hungary Universit` a di Cassino, Italy Universit` a degli Studi di Firenze, Italy University of Palermo, Italy
VIII
Organization
Eklundh, J. Ercil, A. Fuchs, S. Gimel’farb, G. Hanbury, A. Hlav´ aˇc, V. Iwanowski, M. Jiang, X. Jolion, J. Kampel, M. Klette, R. Kozera, R. Kropatsch, W. Leonardis, A. Levine, M. Li, X. Niemann, H. Nystr¨ om, I. Perner, P. Petkov, N. Pitas, I. Pizlo, Z. Radeva, P. Roerdink, J.B.T.M. Rousson, M. Sablatnig, R. Sagerer, G. Sanniti di Baja, G. Sauerbier, M. Schizas, C. Sebe, N. Siebel, N. Smeulders, A. Soille, P. Sommer, G. Stathaki, T. Suk, M. Tan, T. Tao, D. ter Haar Romeny, B. van Vliet, L.
Royal Institute of Technology, Sweden Sabanci University, Turkey TU Dresden, Germany University of Auckland, New Zealand Vienna University of Technology, Austria Czech Technical University, Czech Republic Warsaw University of Technology, Poland University of M¨ unster, Germany Lyon Research Center, France Vienna University of Technology, Austria The University of Auckland, New Zealand University of Western Australia, Australia Vienna University of Technology, Austria University of Ljubljana, Slovenia McGill University, Canada University of London, UK University of Erlangen-N¨ urnberg, Germany Uppsala University, Sweden IBaI Institute, Germany University of Groningen, The Netherlands Aristotle University of Thessaloniki, Greece Purdue University, USA Universitat Aut` onoma de Barcelona, Spain University of Groningen, The Netherlands Siemens Corporate Research, USA Vienna University of Technology, Austria University of Bielefeld, Germany Istituto di Cibernetica “E.Caianiello”, Italy ETH Z¨ urich, Switzerland University of Cyprus, Cyprus Universiteit van Amsterdam, The Netherlands Christian-Albrechts-University of Kiel, Germany Universiteit van Amsterdam, The Netherlands EC Joint Research Centre, Italy University of Kiel, Germany Imperial College London, UK Sung Kyun Kwan University, Korea Chinese Academy of Sciences, China The Hong Kong Polytechnic University, Hong Kong, China TU Eindhoven, The Netherlands Delft University of Technology, The Netherlands
Organization
Vento, M. Villanueva, J. Wechsler, H. Weickert, J. Wilkinson, M. Wojciechowski, K. Yuan, Y.
University of Salerno, Italy Computer Vision Center, Spain George Mason University, USA Saarland University, Germany University of Groningen, The Netherlands Silesian Technical University, Poland Aston University, UK
Additional Reviewers Blauensteiner, P. Donner, R. Gedda, M. Haxhimusa, Y. Ion, A. Karlsson, P. Langs, G. Lettner, M. Miˇcuˇs´ık, B. Niazi, K. Nordin, B. Reiter, M. Strand, R. Vidholm, E. Wildenauer, H.
Vienna University of Technology, Vienna University of Technology, Uppsala University, Sweden Purdue University, USA Vienna University of Technology, Uppsala University, Sweden Vienna University of Technology, Vienna University of Technology, Vienna University of Technology, Uppsala University, Sweden Uppsala University, Sweden Vienna University of Technology, Uppsala University, Sweden Uppsala University, Sweden Vienna University of Technology,
Austria Austria
Austria Austria Austria Auatria
Austria
Austria
IX
Table of Contents
Invited Talks Human Perception of 3D Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zygmunt Pizlo
1
Connection Geometry, Color, and Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ohad Ben-Shahar, Gang Li, and Steven W. Zucker
13
Motion Detection and Tracking Adaptable Model-Based Tracking Using Analysis-by-Synthesis Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Harald Wuest, Folker Wientapper, and Didier Stricker
20
Mixture Models Based Background Subtraction for Video Surveillance Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chris Poppe, Ga¨etan Martens, Peter Lambert, and Rik Van de Walle
28
Applicability of Motion Estimation Algorithms for an Automatic Detection of Spiral Grain in CT Cross-Section Images of Logs . . . . . . . . . Karl Entacher, Christian Lenz, Martin Seidel, Andreas Uhl, and Rudolf Weiglmaier
36
Deterministic and Stochastic Methods for Gaze Tracking in Real-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Javier Orozco, F. Xavier Roca, and Jordi Gonz` alez
45
Integration of Multiple Temporal and Spatial Scales for Robust Optic Flow Estimation in a Biologically Inspired Algorithm . . . . . . . . . . . . . . . . . Cornelia Beck, Thomas Gottbehuet, and Heiko Neumann
53
Classification of Optical Flow by Constraints . . . . . . . . . . . . . . . . . . . . . . . . Yusuke Kameda and Atsushi Imiya
61
Target Positioning with Dominant Feature Elements . . . . . . . . . . . . . . . . . . Zhuan Qing Huang and Zhuhan Jiang
69
Speeding-Up Differential Motion Detection Algorithms Using a Change-Driven Data Flow Processing Strategy . . . . . . . . . . . . . . . . . . . . . . . Jose A. Boluda and Fernando Pardo
77
Foreground and Shadow Detection Based on Conditional Random Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yang Wang
85
XII
Table of Contents
n-Grams of Action Primitives for Recognizing Human Behavior . . . . . . . . Christian Thurau and V´ aclav Hlav´ aˇc
93
Human Action Recognition in Table-Top Scenarios: An HMM-Based Analysis to Optimize the Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pradeep Reddy Raamana, Daniel Grest, and Volker Krueger
101
Grouping of Articulated Objects with Common Axis . . . . . . . . . . . . . . . . . Levente Hajder
109
Decision Level Multiple Cameras Fusion Using Dezert-Smarandache Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Esteban Garcia and Leopoldo Altamirano
117
Occlusion Removal in Video Microscopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brian Eastwood and Russell M. Taylor II
125
A Modular Approach for Automating Video Analysis . . . . . . . . . . . . . . . . . Gayathri Nadarajan and Arnaud Renouf
133
Rectified Reconstruction from Stereo Pairs and Robot Mapping . . . . . . . . Antonio Javier Gallego, Rafael Molina, Patricia Compa˜ n, and Carlos Villagr´ a
141
Estimation Track–Before–Detect Motion Capture Systems State Space Spatial Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Przemyslaw Mazurek
149
Medical Imaging Real-Time Active Shape Models for Segmentation of 3D Cardiac Ultrasound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jøger Hanseg˚ ard, Fredrik Orderud, and Stein I. Rabben Effects of Preprocessing Eye Fundus Images on Appearance Based Glaucoma Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J¨ org Meier, R¨ udiger Bock, Georg Michelson, L´ aszl´ o G. Ny´ ul, and Joachim Hornegger
157
165
Flexibility Description of the MET Protein Stalk Based on the Use of Non-uniform B-Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Magnus Gedda and Stina Svensson
173
Virtual Microscopy Using JPEG2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francisco G´ omez, Marcela Iregui, and Eduardo Romero
181
A Statistical-Genetic Algorithm to Select the Most Significant Features in Mammograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gonzalo V. S´ anchez-Ferrero and Juan Ignacio Arribas
189
Table of Contents
Biomarker Selection System, Employing an Iterative Peak Selection Method, for Identifying Biomarkers Related to Prostate Cancer . . . . . . . . Panagiotis Bougioukos, Dionisis Cavouras, Antonis Daskalakis, Ioannis Kalatzis, Spiros Kostopoulos, Pantelis Georgiadis, George Nikiforidis, and Anastasios Bezerianos Automatic Segmentation of Femur Bones in Anterior-Posterior Pelvis X-Ray Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Feng Ding, Wee Kheng Leow, and Tet Sen Howe Assessing Artery Motion Compensation in IVUS . . . . . . . . . . . . . . . . . . . . . Debora Gil, Oriol Rodriguez-Leor, Petia Radeva, and Aura Hern` andez Assessing Estrogen Receptors’ Status by Texture Analysis of Breast Tissue Specimens and Pattern Recognition Methods . . . . . . . . . . . . . . . . . . Spiros Kostopoulos, Dionisis Cavouras, Antonis Daskalakis, Ioannis Kalatzis, Panagiotis Bougioukos, George Kagadis, Panagiota Ravazoula, and George Nikiforidis Multimodal Evaluation for Medical Image Segmentation . . . . . . . . . . . . . . Rub´en C´ ardenes, Meritxell Bach, Ying Chi, Ioannis Marras, Rodrigo de Luis, Mats Anderson, Peter Cashman, and Matthieu Bultelle Automated 3D Segmentation of Lung Fields in Thin Slice CT Exploiting Wavelet Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Panayiotis Korfiatis, Spyros Skiadopoulos, Philippos Sakellaropoulos, Christina Kalogeropoulou, and Lena Costaridou
XIII
197
205 213
221
229
237
Reconstruction of Heart Motion from 4D Echocardiographic Images . . . . ´ lo, and Piotr Bala Michal Chlebiej, Krzysztof Nowi´ nski, Piotr Scis
245
Quantification of Bone Remodeling in the Proximity of Implants . . . . . . . Hamid Sarve, Carina B. Johansson, Joakim Lindblad, Gunilla Borgefors, and Victoria Franke Stenport
253
Delaunay-Based Vector Segmentation of Volumetric Medical Images . . . . ˇ el, Pˇremysl Krˇsek, Miroslav Svub, ˇ ˇ Michal Spanˇ V´ıt Stancl, and ˇ Ondˇrej Siler
261
Non-uniform Resolution Recovery Using Median Priors in Tomographic Image Reconstruction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Munir Ahmad and Andrew Todd-Pokropek
270
Detection of Postmenopausal Alteration of Bone Structure in Digitized X-rays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constantin Vertan, Ion S ¸ tefan, and Laura Florea
278
XIV
Table of Contents
Blood Detection in IVUS Images for 3D Volume of Lumen Changes Measurement Due to Different Drugs Administration . . . . . . . . . . . . . . . . . David Rotger, Petia Radeva, Eduard Fern´ andez-Nofrer´ıas, and Josepa Mauri Eigenmotion-Based Detection of Intestinal Contractions . . . . . . . . . . . . . . Laura Igual, Santi Segu´ı, Jordi Vitri` a, Fernando Azpiroz, and Petia Radeva Brain Tissue Classification with Automated Generation of Training Data Improved by Deformable Registration . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Schwarz and Tomas Kasparek
285
293
301
On Simulating 3D Fluorescent Microscope Images . . . . . . . . . . . . . . . . . . . . David Svoboda, Marek Kaˇs´ık, Martin Maˇska, Jan Huben´ y, Stanislav Stejskal, and Michal Zimmermann
309
Hierarchical Detection of Multiple Organs Using Boosted Features . . . . . Samuel Hugueny and Mika¨el Rousson
317
Biometrics Monitoring of Emotion to Create Adaptive Game for Children with Mild Autistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Ravindra S. De Silva, Masatake Higashi, Stephen G. Lambacher, and Minetada Osano
326
A Simplified Human Vision Model Applied to a Blocking Artifact Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hantao Liu and Ingrid Heynderickx
334
Estimating Reflectance Functions Using a Cyberware 3030 Scanner . . . . . Matthew P. Dickens and Edwin R. Hancock
342
Are Younger People More Difficult to Identify or Just a Peer-to-Peer Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wai Han Ho, Paul Watters, and Dominic Verity
351
Lip Biometrics for Digit Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maycel Isaac Faraj and Josef Bigun
360
An Embedded Fingerprint Authentication System Integrated with a Hardware-Based Truly Random Number Generator . . . . . . . . . . . . . . . . . . Murat Erat, Kenan Danı¸sman, Salih Erg¨ un, Alper Kanak, and Mehmet Kayaoglu A New Manifold Representation for Visual Speech Recognition . . . . . . . . Dahai Yu, Ovidiu Ghita, Alistair Sutherland, and Paul F. Whelan
366
374
Table of Contents
XV
Fingerprint Hardening with Randomly Selected Chaff Minutiae . . . . . . . . ˙ Alper Kanak and Ibrahim So˜gukpınar
383
Wavelet-Based Fingerprint Region Selection . . . . . . . . . . . . . . . . . . . . . . . . . Almudena Lindoso, Luis Entrena, and Judith Liu-Jimenez
391
Face Shape Recovery and Recognition Using a Surface Gradient Based Statistical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mario Castel´ an and Edwin R. Hancock
399
Representation of Facial Features by Catmull-Rom Splines . . . . . . . . . . . . Marco Maggini, Stefano Melacci, and Lorenzo Sarti
408
Automatic Quantitative Mouth Shape Analysis . . . . . . . . . . . . . . . . . . . . . . Augusto Salazar, Jorge Hern´ andez, and Flavio Prieto
416
Color Color Adjacency Histograms for Image Matching . . . . . . . . . . . . . . . . . . . . . Allan Hanbury and Beatriz Marcotegui
424
Segmentation of Distinct Homogeneous Color Regions in Images . . . . . . . Daniel Mohr and Gabriel Zachmann
432
Estimating the Color of the Illuminant Using Anisotropic Diffusion . . . . . Marc Ebner
441
Restoration of Color Images Degraded by Space-Variant Motion Blur . . . ˇ Michal Sorel and Jan Flusser
450
Real-Time Elimination of Brightness in Color Images by MS Diagram and Mathematical Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francisco Ortiz
458
Curves and Surfaces Beyond 2 Dimensions Surface Reconstruction Using Polarization and Photometric Stereo . . . . . Gary A. Atkinson and Edwin R. Hancock
466
Curvature Estimation in Noisy Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thanh Phuong Nguyen and Isabelle Debled-Rennesson
474
3D+t Reconstruction in the Context of Locally Spheric Shaped Data Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wafa Rekik, Dominique B´er´eziat, and S´everine Dubuisson
482
Robust Fitting of 3D Objects by Affinely Transformed Superellipsoids Using Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frank Ditrich and Herbert Suesse
490
XVI
Table of Contents
Fast and Precise Weak-Perspective Factorization . . . . . . . . . . . . . . . . . . . . . ´ Levente Hajder, Akos Pernek, and Csaba Kaz´ o A Graph-with-Loop Structure for a Topological Representation of 3D Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rocio Gonzalez-Diaz, Mar´ıa Jos´e Jim´enez, Belen Medrano, and Pedro Real
498
506
Reading Characters, Words, Lines... Print Process Separation Using Interest Regions . . . . . . . . . . . . . . . . . . . . . Reinhold Huber-M¨ ork, Dorothea Heiss-Czedik, Konrad Mayer, Harald Penz, and Andreas Vrabl
514
Histogram-Based Lines and Words Decomposition for Arabic Omni Font-Written OCR Systems; Enhancements and Evaluation . . . . . . . . . . . Mohamed Attia and Mohamed El-Mahallawy
522
Semi-automatic Training Sets Acquisition for Handwriting Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jerzy Sas and Urszula Markowska-Kaczmar
531
Gabor-Based Recognizer for Chinese Handwriting from Segmentation-Free Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tong-Hua Su, Tian-Wen Zhang, De-Jun Guan, and Hu-Jie Huang
539
Image Based Recognition of Ancient Coins . . . . . . . . . . . . . . . . . . . . . . . . . . Maia Zaharieva, Martin Kampel, and Sebastian Zambanini
547
Text Area Detection in Digital Documents Images Using Textural Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ilktan Ar and M. Elif Karsligil
555
Image Segmentation Optimal Threshold Selection for Tomogram Segmentation by Reprojection of the Reconstructed Image . . . . . . . . . . . . . . . . . . . . . . . . . . . K. Joost Batenburg and Jan Sijbers
563
A Level Set Bridging Force for the Segmentation of Dendritic Spines . . . Karsten Rink and Klaus T¨ onnies
571
Knowledge from Markers in Watershed Segmentation . . . . . . . . . . . . . . . . . S´ebastien Lef`evre
579
Image Segmentation Using Topological Persistence . . . . . . . . . . . . . . . . . . . David Letscher and Jason Fritts
587
Table of Contents
XVII
Image Modeling and Segmentation Using Incremental Bayesian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constantinos Constantinopoulos and Aristidis Likas
596
Model-Based Segmentation of Multimodal Images . . . . . . . . . . . . . . . . . . . . Xin Hong, Sally McClean, Bryan Scotney, and Philip Morrow
604
Image Segmentation Based on Height Maps . . . . . . . . . . . . . . . . . . . . . . . . . Gabriele Peters and Jochen Kerdels
612
Shape Measuring the Orientability of Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paul L. Rosin
620
A 3–Subiteration Surface–Thinning Algorithm . . . . . . . . . . . . . . . . . . . . . . . K´ alm´ an Pal´ agyi
628
Extraction of River Networks from Satellite Images by Combining Mathematical Morphology and Hydrology . . . . . . . . . . . . . . . . . . . . . . . . . . . Pierre Soille and Jacopo Grazzini
636
Fractal Active Shape Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Polychronis Manousopoulos, Vassileios Drakopoulos, and Theoharis Theoharis
645
Decomposition for Efficient Eccentricity Transform of Convex Shapes . . . Adrian Ion, Samuel Peltier, Yll Haxhimusa, and Walter G. Kropatsch
653
Euclidean Shortest Paths in Simple Cube Curves at a Glance . . . . . . . . . . Fajie Li and Reinhard Klette
661
A Fast and Robust Ellipse Detection Algorithm Based on Pseudo-random Sample Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ge Song and Hong Wang
669
A Definition for Orientation for Multiple Component Shapes . . . . . . . . . . ˇ c and Paul L. Rosin Joviˇsa Zuni´
677
Definition of a Model-Based Detector of Curvilinear Regions . . . . . . . . . . C´edric Lemaˆıtre, Johel Miteran, and Jiˇr´ı Matas
686
A Method for Interactive Shape Detection in Cattle Images Using Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Horacio M. Gonz´ alez–Velasco, Carlos J. Garc´ıa–Orellana, Miguel Mac´ıas–Mac´ıas, Ram´ on Gallardo–Caballero, and ´ Fernando J. Alvarez–Franco
694
XVIII
Table of Contents
A New Phase Field Model of a ‘Gas of Circles’ for Tree Crown Extraction from Aerial Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Horv´ ath and Ian H. Jermyn
702
Shape Signature Matching for Object Identification Invariant to Image Transformations and Occlusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stamatia Giannarou and Tania Stathaki
710
Junction Detection and Multi-orientation Analysis Using Streamlines . . . Frank G.A. Faas and Lucas J. van Vliet
718
Decomposing a Simple Polygon into Trapezoids . . . . . . . . . . . . . . . . . . . . . . Fajie Li and Reinhard Klette
726
Shape Recognition and Retrieval: A Structural Approach Using Velocity Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hamidreza Zaboli, Mohammad Rahmati, and Abdolreza Mirzaei
734
Improved Morphological Interpolation of Elevation Contour Data with Generalised Geodesic Propagations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jacopo Grazzini and Pierre Soille
742
Image Registration and Matching Cylindrical Phase Correlation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jakub Bican and Jan Flusser Extended Global Optimization Strategy for Rigid 2D/3D Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Kubias, Frank Deinzer, Tobias Feldmann, and Dietrich Paulus
751
759
A Fast B-Spline Pseudo-inversion Algorithm for Consistent Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Trist´ an and Juan Ignacio Arribas
768
Robust Least-Squares Image Matching in the Presence of Outliers . . . . . . Patrice Delmas, Georgy Gimel’farb, Al Shorin, and John Morris
776
A Neural Network String Matcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abdolreza Mirzaei, Hamidreza Zaboli, and Reza Safabakhsh
784
Incorporating Spatial Information into 3D-2D Image Registration . . . . . . Guoyan Zheng
792
Performance Evaluation and Recent Advances of Fast Block-Matching Motion Estimation Methods for Video Coding . . . . . . . . . . . . . . . . . . . . . . . Berenice Ramirez
801
Table of Contents
XIX
Spectral Eigenfeatures for Effective DP Matching in Fingerprint Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boris Danev and Toshio Kamei
809
Registering Long-Term Image Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Detlev Droege and Dietrich Paulus
817
Graph Similarity Using Interfering Quantum Walks . . . . . . . . . . . . . . . . . . David Emms, Edwin R. Hancock, and Richard C. Wilson
823
Signal Decomposition and Invariants Visual Speech Recognition Using Motion Features and Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wai Chee Yau, Dinesh Kant Kumar, and Hans Weghorn Feature Extraction of Weighted Data for Implicit Variable Selection . . . . Luis S´ anchez, Fernando Mart´ınez, Germ´ an Castellanos, and Augusto Salazar Analysis of Prediction Mode Decision in Spatial Enhancement Layers in H.264/AVC SVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Koen De Wolf, Davy De Schrijver, Wesley De Neve, Saar De Zutter, Peter Lambert, and Rik Van de Walle Object Recognition by Implicit Invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . ˇ Jan Flusser, Jaroslav Kautsky, and Filip Sroubek An Automatic Microarray Image Gridding Technique Based on Continuous Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emmanouil Athanasiadis, Dionisis Cavouras, Panagiota Spyridonos, Ioannis Kalatzis, and George Nikiforidis
832
840
848
856
864
Image Sifting for Micro Array Image Enhancement . . . . . . . . . . . . . . . . . . . Pooria Jafari Moghadam and Mohamad H. Moradi
871
Wavelet Based Local Coherent Tomography with an Application in Terahertz Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiao-Xia Yin, Brian W.-H. Ng, Bradley Ferguson, and Derek Abbott
878
Nonlinear Approximation of Spatiotemporal Data Using Diffusion Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marie Wild
886
A New Wavelet-Based Texture Descriptor for Image Retrieval . . . . . . . . . Esther de Ves, Ana Ruedin, Daniel Acevedo, Xaro Benavent, and Leticia Seijas
895
XX
Table of Contents
Space-Variant Restoration with Sliding Discrete Cosine Transform . . . . . Vitaly Kober and Jacobo Gomez Agis
903
Features and Classification Comparative Evaluation of Classical Methods, Optimized Gabor Filters and LBP for Texture Feature Selection and Classification . . . . . . . . . . . . . Jaime Melendez, Domenec Puig, and Miguel Angel Garcia
912
A Multiple Classifier Approach for the Recognition of Screen-Rendered Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Steffen Wachenfeld, Stefan Fleischer, and Xiaoyi Jiang
921
Improving Stability of Feature Selection Methods . . . . . . . . . . . . . . . . . . . . Pavel Kˇr´ıˇzek, Josef Kittler, and V´ aclav Hlav´ aˇc
929
A Movie Classifier Based on Visual Features . . . . . . . . . . . . . . . . . . . . . . . . . Hui-Yu Huang, Weir-Sheng Shih, and Wen-Hsing Hsu
937
An Efficient Method for Filtering Image-Based Spam E-mail . . . . . . . . . . Ngo Phuong Nhung and Tu Minh Phuong
945
SVM-Based Active Feedback in Image Retrieval Using Clustering and Unlabeled Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rujie Liu, Yuehong Wang, Takayuki Baba, Yusuke Uehara, Daiki Masumoto, and Shigemi Nagata Hierarchical Classifiers for Detection of Fractures in X-Ray Images . . . . . Joshua Congfu He, Wee Kheng Leow, and Tet Sen Howe An Efficient Nearest Neighbor Classifier Using an Adaptive Distance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Omid Dehzangi, Mansoor J. Zolghadri, Shahram Taheri, and Abdollah Dehzangi
954
962
970
Accurate Identification of a Markov-Gibbs Model for Texture Synthesis by Bunch Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Georgy Gimel’farb and Dongxiao Zhou
979
Texture Defect Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michal Haindl, Jiˇr´ı Grim, and Stanislav Mikeˇs
987
Extracting Salient Points and Parts of Shapes Using Modified kd-Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Bauckhage
995
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003
Human Perception of 3D Shapes Zygmunt Pizlo Department of Psychological Sciences, Purdue University, West Lafayette, IN 47907-2081, U.S.A. [email protected]
Abstract. This paper reviews main approaches to 3D shape perception in both human and computer vision. The approaches are evaluated with respect to their plausibility of generating adequate explanations of human vision. The criterion for plausibility is provided by existing psychophysical results. A new theory of 3D shape perception is then outlined. According to this theory, human perception of shapes critically depends on a priori shape constraints: symmetry and compactness. The role of depth cues is secondary, at best.
1 Introduction Reconstruction of a 3D shape from one or more 2D images has been a major challenge in computer vision during its 50 year history. Understanding how the human visual system accomplishes this task has been a major challenge in psychology for about a thousand years. The first written account of human shape perception can be found in Alhazen’s book published in 1083 [1]. Alhazen defined shape constancy as referring to the fact that the percept of the shape of a given object remains constant despite changes in the shape of the object’s retinal image. The shape of the retinal image changes when the viewing orientation of the object relative to the observer changes. Shape constancy phenomenon has played a fundamental role in the study of shape perception, and has proved to be essential in understanding how humans perceive shapes veridically, i.e., as the 3D shapes really are in the physical world. Despite the fact that human and computer vision communities are working on the same problem, viz., formulating a theory of a seeing system, there has not been much interaction between the two. We begin with a brief explanation of why the two communities have worked separately. The main part of the paper will present the history of research on shape constancy, emphasizing the concepts, tools and theories used in both communities. The paper concludes with the presentation of a new theory of 3D shape perception. 1.1 Computer vs. Human Vision The beginning of computer vision coincided with the Cognitive Revolution in psychology. The common assumption was that cognitive psychologists, neuroscientists, engineers and computer scientists would work together in formulating a theory of the human mind and brain, and implementing the theory in artificial systems. It had become clear, very quickly, that this task is extremely difficult. First, psychologists and W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 1–12, 2007. © Springer-Verlag Berlin Heidelberg 2007
2
Z. Pizlo
neuroscientists did not really know much about how the mind and brain work. Second, computer scientists and engineers did not really know how to actually implement even the simplest mental functions, such as reading handwritten text. These two difficulties were responsible for the split between the two communities. Psychological community proceeded with studying mental functions that were simple enough so that they could be modeled, such as perceptual coding of color, contrast and depth, whereas computer vision community proceeded with an attempt to solve simple practical problems regardless of whether or not the solutions had anything to do with how the human visual system works. There was only a brief period in the 1970s during which David Marr tried to bring the two communities together by emphasizing the importance of understanding the human vision, as a general purpose vision system [2]. He did not succeed, but his views on the field of vision had a lasting influence on the psychological community. Now, after 50 years of research, the two communities are trying to join forces again, and learn about each other’s methods, results and approaches. This is especially true in Europe, which is exemplified by launching interdisciplinary efforts under the names ECVision and euCognition. In order to understand where we are now, how we got here, and what should be done next, a review of the main approaches to 3D shape perception is presented, next. The review is organized around the following topics: (i) the nature of 3D shape representation; (ii) the role of learning; (iii) the role of invariants; (iv) the role of surfaces; (v) shape parts; (vi) purposive vs. general purpose vision; and (vii) the role of a priori constraints.
2 Review of Main Approaches Most approaches were tried by both communities, although with different degree of success. Only one or at most two approaches were limited to one community. This fact alone indicates the high degree of synergy between the efforts of these two communities. 2.1 Nature of Perceptual Representation of 3D Shapes Perceptual representation of 3D shapes in the human mind is three-dimensional. There are a number of results and observations supporting this statement. For example, humans are able to mentally rotate 3D objects, as shown by Shepard and his colleagues [3]. It does not matter whether or not mental rotation is used in object recognition and whether the speed of mental rotation is constant, issues that have been discussed vigorously for the last 35 years. What really matters is that when presented with a novel 3D object, the observer has little difficulty imagining how the object looks like from a different viewing orientation. Similarly, given two different views of the same 3D object, the observer has no difficulty recognizing that they represent the same object [4]. Both phenomena are related to shape constancy. Shape constancy in the case of unfamiliar objects cannot be achieved without 3D representation of the shapes. There was a small group of psychologists, starting with Rock, who claimed the perceptual representation of 3D shapes is not 3D [5]. Rock himself claimed that perceptual representation of 3D objects is only 2.5D. Recall that according to Marr, 2.5D
Human Perception of 3D Shapes
3
representation refers to visible surfaces in the viewer-centered coordinate system. According to Rock, the human visual system does not represent an object in the object- centered coordinate system. In essence, Rock claimed that humans do not perceive 3D shapes; only their surfaces. Rock’s followers modified this view by claiming that perceptual representation is only 2D [6] and shape constancy with 3D shapes is either not achieved at all or it requires learning a large number of 2D views of each object (so called multiple view theory [7]). This theory of shape perception was first proposed by Helmholtz [8]. It had been ignored by most subsequent psychologists, including Gestalt psychologists [9], cognitive psychologists [10], Gibson [11], and Marr [2]. The reason was that this theory was inconsistent with the fact that shape constancy is universally achieved in everyday life (regardless of whether or not the 3D shapes are familiar). The only objects, with which shape constancy is not achieved (unless extensive learning is allowed) are crumpled newspapers, amoebas and paperclips [5, 6]. It is difficult to justify the need for a multiple view theory, if the theory is supported only by a handful of unstructured or unnatural objects. Shape constancy performance does improve with learning, when unstructured objects are used. But is learning a part of the theory of perception or memory? More importantly, can learning do the job? 2.2 Role of Learning The claim that the human mind at birth is a tabula rasa, and that perceptual mechanisms (algorithms) have to be learned during one’s life, originated with British empiricists in the 17th century [12]. According to this view, a newborn infant is not able to perceive shapes, depth, motion, colors etc. This extreme view had been gradually changing when confronted with the philosophy of nativism, as well as with empirical results. Elaborating the ideas of such philosophers as Descartes, Reid and Kant, and physiologist Johannes Müller, Hering [13] offered a nativistic theory of perception of color and space. Shortly after that, empirical results showed that the percept of depth and space is present very early in animal and human life. This line of research started with Thorndike, who demonstrated depth discrimination in newborn chicken [14]. Later, Hess showed that visual space perception, including binocular vision, is present at birth in chicken [15]. Experiments with young infants started with Gibson & Walk [16]. They showed that infants perceive depth very early in life. Reports of adult persons, born blind, whose vision had been restored, and who could not recognize objects without extensive learning, were originally taken as a critical proof supporting the empiristic position [17]. But now we know that the visual system can process visual information at birth [18], but it looses this ability when visual stimulation is absent for many years. So, it is not absence of learning, but degeneration of the neural structure, which is responsible for poor vision in such patients. Similarly, experiments with distorting and inverting prisms that were supposed to demonstrate that visual perception of space can be re-learned, showed that if there is any learning or adaptation, it takes place in the motor, not in the visual system [19]. Recent experiments on brain imaging confirmed these results. The spatial representation in the brain is not reorganized, when the visual input is altered by wearing inverting prisms [20]. It is now well established and commonly accepted in the psychological community that the mechanisms underlying perception of contrast, color, motion, texture, depth,
4
Z. Pizlo
and shape are innate, not learned. In particular, results with newborn infants demonstrated that they do perceive objects in 3D space, the way we do [21]. The percept of a 3D object produced by its 2D retinal image involves a priori constraints, and these constraints are innate, not learned. Early demonstrations of Ames [22] and Bruner & Goodman [23], suggesting the operation of learning in size and shape perception had been shown to be flawed (e.g., see Bruner’s, 1973, comments about his own early experiments [24]). Some cross-cultural studies reported differences in visual perception, but these differences can most likely be explained the same way as the results on adult persons born blind (see above). Namely, we all see things the same way when we are born, but if some class of stimuli is absent for an extended period after birth, the visual system may loose the ability to process it adequately. Loosing perceptual abilities due to impoverished visual input is not exactly what proponents of learning theories had in mind. So, why are we still talking about perceptual learning? Why is learning a dominant approach in computer vision community? Learning theories are convenient: instead of figuring out how the stimulus should be analyzed by a computer, we let the computer learn. In psychophysics, perceptual learning can be reliably demonstrated only in a handful of tasks, such as hyperacuity. Discrimination thresholds do improve with learning, and this improvement is related to the visual system’s ability to “learn” the magnitude of visual noise in a given part of the retina. Perceptual learning is also present when viewing fragmented pictures, like the Dalmatian. Considering the fact that these memory effects do not generalize to new images, it is likely that they represent the modulation of visual attention by memory. So, it is not the effect of memory on how the image is analyzed in the brain, but rather which parts of the image an observer is attending to. If perceptual mechanisms are innate, they had to be acquired (learned) during the evolution. It follows that “learning” may be understood more broadly, avoiding the traditional distinction between nativism and empiricism. The problem with this view is that we do not have a good theory of evolutionary learning. Therefore, talking about this as “learning” is likely to lead to confusions. If learning were an important factor in perception, learning would have been the main experimental methodology in visual psychophysics. And it is not. The fact that we all perceive our environment the same way (i.e., individual differences are extremely small), regardless of where we were born and what education we received, suggests one more time that visual experience plays little or no role in visual perception. There are no “good vs. bad perceptionists” in the same sense as there are good vs. bad football or chess players. 2.3 Invariants Gibson [25] was the first to propose that visual space and shape perception involve projective invariants, such as cross ratio of four collinear points. According to Gibson, the retinal images of a binocular, actively moving observer, provide direct information about the 3D scene. Depth cues, as advocated by Marr, and a priori constraints, as advocated by Gestaltists, are not needed because invariants can be computed from images themselves (or, as Gibson put it, from the visual array). Systematic research on the use of invariants in computer vision had not started until 1988 [26]. In the case of planar figures, projective invariants can be computed from a single image. In the
Human Perception of 3D Shapes
5
case of solid objects, a single 2D image allows computation of projective invariants for only some classes of objects, such as a subset of polyhedra (27). These invariants are called model-based. In order to compute projective invariants for an arbitrary object, one needs at least two images. An advantage of projective invariants is that they do not require the cameras to be calibrated. However, the cameras are often calibrated and in such cases, two perspective images allow the reconstruction of Euclidean invariants, such as angles and ratios of distances [28]. In fact, the human eyes are “calibrated cameras”: the visual system “knows” the geometry of the eyeballs. It follows that projective invariants have limited application in human and computer vision. More specifically, they have never been seriously considered as a model of human vision because they cannot distinguish among shapes that are easily distinguished by humans [29]. Perspective invariants provide a better model of human shape perception. They are model-based in the sense that they can establish perspective equivalence between an object and its image, without reconstructing the object (30-31). These invariants have been shown to provide an adequate theory of shape recognition by humans (32). Invariants, however, do not explain how a 3D percept is formed in the case of a single 2D image of an unfamiliar object. But we know that humans always perceive the 3D scene as 3D even if the scene is viewed with one eye. A theory of 3D shape reconstruction is needed. 2.4 Surfaces and Depth Cues The first attempt to provide a general theory of vision was made by Marr [2]. Marr’s approach was motivated by Julesz’ work [33]. Julesz, using random dot stereograms, demonstrated that binocular perception of 3D surfaces is possible without any monocular information about the surfaces. Marr took this as evidence that figure-ground organization is not needed and, in fact, not performed by the visual system. Recall that figure-ground organization refers to establishing which regions and contours in the image represent individual objects, as opposed to background. Instead, according to Marr, the visual system assigns 3D distances and orientations to every point on the retina, based on depth cues such as binocular disparity, motion, texture, and shading. These local distances and orientations represent a description of visible surfaces of objects “out there.” This description was called by Marr 2.5D sketch, and it was a critical element in his theory of shape reconstruction. Once the visible surfaces of an object are known, the visual system should be able to represent them in the objectcentered coordinate system and use this representation to build a 3D model of the shape of this object. But is 3D shape perceptually derived from its surfaces? Mathematically this seems like natural approach. Object-centered properties, such as curvature of the surface are derived from orientations of the surfaces. But perceptually, shape may not be a derivative of surfaces. The first direct test of Marr’s claim was performed only recently. Li & Pizlo [34] asked subjects to judge (i) the 3D orientations of faces of a polyhedron, and (ii) the shapes of the faces. Each type of information is sufficient to reconstruct the 3D shape of the polyhedron. These reconstructions should represent the subject’s percept, as long as the visual system actually performs such reconstructions. Specifically, if 3D shape percept is derived from orientations of surfaces, both types
6
Z. Pizlo
of judgments should lead to the same 3D shapes. They did not. The perceived 3D shape was consistent with the perceived shapes of its faces, but not with the perceived 3D orientations of the faces. The difference was large and systematic. The authors concluded that 3D shape percept is produced by applying simplicity constraints, such as symmetry, to the 2D shapes on the retina, not by using orientations of surfaces. In fact, the 3D shape, once perceived, provides information about its surfaces. So, contrary to the common wisdom, shape is a cue to surfaces, not the other way around. But note that before simplicity constraints can be applied, the 2D shape on the retina must be established. Human 3D shape perception involves figure-ground organization, not 2.5D sketch. 2.5 Shape Parts Theories of shape perception, before 1985, treated shape as one of many perceptual properties and assumed that shape will be explained the same way as depth, speed and size. It was assumed that contextual information is always critical because the retinal image of a given property is always ambiguous. Consider the size of an object. The size on the camera or retinal image depends on the viewing distance. If the viewing distance is not known, the object’s size is also unknown. So, before the size can be judged, the viewing distance has to be estimated. The viewing distance has nothing to do with the physical size of the object. Instead, it represents context, which, in the case of size perception, must be “taken into account.” But shape is different than all other perceptual properties. Shape us unique because it is complex and structured. Shape, unlike size, distance or color, requires very many parameters to describe. It follows that not all information about shape is destroyed by perspective projection to the retina. In fact, only small fraction of this information is destroyed; at least, if something can be assumed about the 3D objects “out there”, like that objects are symmetric. This idea was first implemented by Biederman [35] in his object Recognition-By-Components (RBC) theory. Biederman assumed that most objects consist of simple volumetric parts. The parts formed a small subset of generalized cones and each was characterized by one or more symmetries. Because of the simplicity of the parts, they could be recognized from a single image. Once several parts of an object are recognized, they can be used to identify the entire object. Similar theories were proposed by Pentland [36] and Dickinson et al. [37]. Note that RBC theory is very different from a theory of any other perceptual property. The theory uses simple parts, but there are a number of them and their spatial arrangement, relative to one another, can vary a lot. As such, the theory capitalizes on the complexity of shapes of objects. Surfaces do not have to be reconstructed and depth cues do not play any role. Despite the initial promise, however, this theory did not lead to a working algorithm. No one was able to show how to actually find simple parts in real images. One possible explanation of the failure is the fact that parts of real objects are not necessarily simple. As a result, the fit between the predefined simple parts and images of real parts, was never good enough. What proponents of this approach seemed to have missed is that a priori simplicity constraints may be effective not only when 3D objects are simple. The only thing that is required is that 3D objects be simpler than their 2D images. This is definitely true in the case of symmetric objects; any symmetric objects, including complex ones.
Human Perception of 3D Shapes
7
2.6 Purposive Vision Difficulties in solving the big problem of general purpose vision system posed by Marr, suggested to students of computer vision that it may be more realistic to solve smaller problems for individual applications. In particular, a robot should be active and should intelligently search for new information relevant to the task at hand, and test perceptual hypotheses, rather than passively try to reconstruct the entire 3D scene [38]. The term “purposive” was used in 1932 by Tolman, a behaviorist, who tried to revise behaviorism by adding cognitive elements to it [39]. Later on, the term was used by Wiener, and his colleagues, when they were trying to convince fellow engineers and psychologists, that goal directed behavior can be studied scientifically [40]. The idea that the visual system is like a scientist who tests hypotheses was put forth by Gregory [41]. At present, it is quite well established that human behavior and thinking are purposive. But there is no evidence that vision is purposive. That is, that the perceptual representation of a 3D scene depends on the purposes or motivations of the observer. It is true that visual attention can be affected by the observer’s purposes; the observer will look where the information is most likely to be found. Furthermore, the fact that the visual system is able to analyze only a few objects at a time implies that the observer will form the 3D percept of the attended objects first before reconstructing the rest of the 3D scene. But except for these two limitations, the human visual system is a general purpose system. 2.7 A Priori Constraints The fact that perception involves the operation of a priori simplicity constraints had been recognized by Gestalt psychologists in the first half of the previous century [9,42]. Gestalt’s ideas did not lead, right away, to formal theories because the mathematical tools were not there. Precise definition of simplicity requires the language of information theory, which was not presented until 1948 [43]. Computer simulations, that are essential in testing computational models of vision, could not be performed before 1941. The theory of inverse problems and a general method of solving them were not formulated until 1963 and made available in English literature much later [44]. Finally, the explicit claim that inverse problems, regularization theory, Bayesian inference and simplicity constraints are all needed to explain vision, has not been put forth until 1985 [45]. More than half a century passed between the origin of Gestalt ideas and the first serious implementation of them. During that time, these ideas had been criticized and dismissed for various reasons. Is human visual perception a difficult inverse problem? Does it have to be regularized? Does it involve simplicity constraints? The answer to all of these questions is in the affirmative and this is true especially in the case of 3D shape perception [46]. When presented with a 2D image of a natural 3D scene, an observer perceives the scene as 3D. One can verify this by closing one eye and looking around. An empiristic argument, according to which we see 3D objects as 3D because we know these objects, can easily be rejected by looking at photographs that contain unfamiliar objects. Clearly, the visual system solves a difficult inverse problem because there are infinitely many 3D scenes that could produce any given 2D image. Would the problem get easier when more images were provided? This question was tested in several
8
Z. Pizlo
studies [47-49]. For unstructured objects, like 3D polygonal lines and asymmetric polyhedra in which the visible contours are non-planar, shape constancy completely fails regardless of whether the subject views an object with one eye closed, or with two eyes and whether the object is rotating or not. For other objects, such as symmetric polyhedra, shape constancy is close to perfect under all viewing conditions. These results suggest that depth cues are neither necessary nor sufficient for achieving shape constancy, that is, for seeing shapes as they really are “out there.” What is both necessary and sufficient for achieving shape constancy, is the availability of effective a priori constraints. When the retinal or camera image is produced by a symmetric object, whose contours are planar and compactness large, shape constancy is reliably achieved. Do real objects in natural environment satisfy these constraints? It is difficult to answer this question at this point, but it is quite likely that the answer is “yes.” To the extent that this answer is accurate, the human visual system is a general purpose visual system. Obviously, there are cases when a priori constraints may lead to large perceptual errors. Such cases are not easy to find in the natural world, but are easy to design in the laboratory. This is what was done by Pizlo et al. [50]. The subjects in their experiment viewed binocularly a rotating stimulus. The stimulus was designed in such a way that a priori constraints (symmetry and compactness) were consistent with the percept of a rigidly rotating object, whereas motion and binocular disparity were consistent with a non-rigid object being stretched and then compressed along the line of sight of one eye. The subjects saw a rigidly rotating object, indicating that when a priori constraints and depth cues provide conflicting information, depth cues are ignored and the percept is determined by the constraints. If the shape “a priori” constraints that are used by the visual system are indeed consistent with the shapes of real objects in the natural environment, the visual system should rely on these constraints, rather than on depth cues. It does.
3 A New Theory of 3D Shape Reconstruction The main question becomes what the a priori constraints for shape are rather than which cues are combined in perception of 3D surfaces. Several constraints have already been identified. One of the most important is symmetry. When a 3D object is mirror symmetric, its single 2D orthographic image determines a one-parameter family of 3D shapes [51]. A single 2D perspective image leads to a unique reconstruction, although the reconstruction is expected to be unstable in the presence of visual noise [48,52]. Many objects are mirror symmetric: animals, humans, man-made objects. Symmetry is important: in the presence of gravity, mirror symmetric objects are mechanically stable. How difficult is it to determine whether a given image was produced by a mirror symmetric object? Not very difficult. If distinctive points can be identified, then the line segments connecting pairs of symmetric points are parallel in an orthographic image. If distinctive points cannot be identified, but surface contours can, and the contours are at least piece-wise planar, then the planar parts of the contours are affine equivalent. Once it is known that a 2D image was produced by a symmetric object, the 3D reconstruction is substantially easier. The symmetry itself drastically reduces the family of possible 3D reconstructions from arbitrarily many
Human Perception of 3D Shapes
9
degrees of freedom, to only one. There are other types of symmetries: translational and rotational. When a 2D contour is translated in 3D space, one produces an object called generalized cone (or generalized cylinder or a sweep) [53]. Such objects are also easy to reconstruct because of their translational symmetry. The same is true with objects that have rotational symmetries. If a 3D object has two symmetries, then its reconstruction is unique [51]. Again, in the presence of visual noise, the reconstruction is likely to be unstable. This means that other constraints are needed, as well, either to stabilize the reconstruction or to produce a unique solution, in the first place. The most straightforward constraint is provided by the likelihood function [54]. The 3D reconstructed object should not be too long in depth direction because small changes in the viewing orientation will lead to large changes in its image. It follows that the likelihood function will favor objects which are short in depth direction. Another constraint, not related to likelihood, is maximum compactness. 3D Compactness is defined as V2/S3, where V and S are the object’s volume and surface area. Maximal compactness corresponds to maximizing volume for a given surface area, or minimizing surface area for a given volume. The 3D compactness constraint has never been used before in shape reconstruction models. Its first application was presented by Li & Pizlo [34] and Pizlo et al. [55]. These studies showed that maximal compactness leads to unique and stable reconstructions. Furthermore, the reconstructions tend to be accurate. The 3D compactness is a generalization of 2D compactness that was used in the past [56]. 3D compactness is also a generalization of the minimum variance of angles constraint. The main drawback of the minimum variance of angles constraint is that it can only be applied to polyhedra. Both, compactness and minimum variance of angles give the object its volume. Because all objects have volume, volume does not have to be derived from depth cues such as binocular disparity or motion parallax. It can be assumed. Only such unstructured objects as a crumpled newspaper or polygonal line do not have volume. Note that volume does not have to be physically present. When looking at a Necker cube, we “see” its volume and surfaces even though it is a wire object. The same is true with other objects such as chairs and tables. In such cases, compactness can be computed for the convex hull of the object, rather than for the object itself. When a 3D object is asymmetric, its 3D shape can still be reconstructed by applying compactness constraint and the likelihood function. Reconstruction is much easier if at least some contours on the object’s surface are planar or piecewise planar. Planarity plays the same role as symmetry: it reduces the parameter space of the family of possible 3D shapes. When an object is symmetric, planarity constraint allows the reconstruction of those features on the symmetric object, whose symmetric counterparts are occluded. As a result, one is often able to reconstruct the entire 3D shape, including its back (occluded) surfaces. It is expected that in the case of smoothly curved objects planarity can be replaced by smoothness constraint. Preliminary psychophysical experiments showed that a 3D shape reconstruction model involving these three constraints: symmetry, compactness and planarity, in conjunction with the likelihood function, are able to account for the subjects’ percepts [29,34,48,52,54-55]. The model can explain not only the cases where the percept is veridical, but also cases where subjects make large errors. In this new model, depth cues are not used. They are not needed. The shape constraints are applied to a segmented 2D image. Specifically, the 3D shape constraints are applied to the 2D shapes
10
Z. Pizlo
on the image. It follows that shape reconstruction presupposes that the figure-ground organization has been successfully established. The new model was tested on synthetic and real images. It works well in the presence of image noise. However, before we can claim that we have solved the big problem of formulating a theory of a general purpose vision system, we have to develop a good model of figure-ground organization. Figure-ground organization is itself a difficult inverse problem whose solution depends on the operation of a priori constraints. Finding out what these constraints are, is the main question in human and computer vision. Acknowledgements. This work was partially supported by the National Science Foundation and Department of Energy.
References [1] Alhazen: The optics.Books. p. 1-3 (Translated by Sabra, A.I., The Warburg Institute, London) (1083/1989) [2] Marr, D.: Vision. W.H. Freeman, New York (1982) [3] Shepard, R.N., Cooper, L.A.: Mental images and their transformations. MIT Press, Cambridge, MA (1982) [4] Biederman, I., Gerhardstein, P.C.: Recognizing depth-rotated objects: Evidence and conditions from three-dimensional viewpoint invariance. Journal of Experimental Psychology: HP&P 19, 1162–1182 (1993) [5] Rock, I., DiVita, J.: A case of viewer-centered object perception. Cognitive Psychology 19, 280–293 (1987) [6] Tarr, M.J., Williams, P., Hayward, W.G., Gauthier, I.: Three-dimensional object recognition is viewpoint dependent. Nature Neuroscience 1, 275–277 (1998) [7] Poggio, T., Edelman, S.: A network that learns to recognize three-dimensional objects. Nature 343, 263–266 (1990) [8] von Helmholtz, H.: Treatise on Physiological Optics (translated from German, J.P.C.Southall). Thoemmes, Bristol (1910/2000) [9] Koffka, K.: Principles of Gestalt Psychology. Harcourt, Brace, New York (1935) [10] Hochberg, J.: Perception. Prentice-Hall, Englewood Cliffs, NJ (1978) [11] Gibson, J.J.: The Ecological Approach to Visual Perception. Houghton Mifflin, Boston (1979) [12] Locke, J.: An essay concerning human understanding. Clarendon, Oxford (1690/1975) [13] Hering, E.: Beiträge zur Physiologie. Heft 1. Englemann, Leipzig (1861) [14] Thorndike, E.: The instinctive reactions of young chicks. Psychological Review 6, 191– 282 (1899) [15] Hess, E.H.: Space perception in the chick. Scientific American 195, 71–80 (1956) [16] Gibson, E., Walk, R.: The visual cliff. Scientific American 202, 64–71 (1960) [17] von Senden, M.: Space and sight. Methuen, London (1932/1960) [18] Hubel, D.H., Wiesel, T.N.: Receptive fields of cells in striate cortex of very young, visually inexperienced kittens. Journal of Neurophysiology 26, 994–1002 (1963) [19] Rock, I., Harris, C.S.: Vision and touch. Scientific American 216, 96–104 (1967) [20] Linden, D.E.J., Kallenbach, U., Heinecke, A., Singer, W., Goebel, R.: The myth of upright vision. A psychophysical and functional imaging study of adaptation to inverting spectacles. Perception 28, 469–481 (1999)
Human Perception of 3D Shapes
11
[21] Slater, A., Morison, V.: Shape constancy and slant perception at birth. Perception 14, 337–344 (1985) [22] Kilpatrick, F.P.: Explorations in transactional psychology. New York Univ. Press, NY (1961) [23] Bruner, J.S., Goodman, C.C.: Value and need as organizing factors in perception. Journal of abnormal and social psychology 42, 33–44 (1947) [24] Bruner, J.S.: Beyond the information given; studies in the psychology of knowing. Norton, NY (1973) [25] Gibson, J.J.: The perception of the visual world. Houghton Mifflin, Boston (1950) [26] Weiss, I.: Projective invariants of shapes. In: Proceedings of DARPA Image Understanding Warkshop, Cambridge, MA, pp. 1125–1134 (1988) [27] Rothwell, C.A.: Object recognition through invariant indexing. Oxford University Press, Oxford (1995) [28] Longuet-Higgins, H.C.: A computer algorithm for reconstructing a scene from two projections. Nature 293, 133–135 (1981) [29] Pizlo, Z.: 3D shape: its unique place in visual perception. MIT Press, Cambridge, MA (2008) [30] Pizlo, Z., Rosenfeld, A.: Recognition of planar shapes from perspective images using contour-based invariants. Computer Vision, Graphics & Image Processing: Image Understanding 56, 330–350 (1992) [31] Pizlo, Z., Loubier, K.: Recognition of a solid shape from its single perspective image obtained by a calibrated camera. Pattern Recognition 33, 1675–1681 (2000) [32] Pizlo, Z.: A theory of shape constancy based on perspective invariants. Vision Research 34, 1637–1658 (1994) [33] Julesz, B.: Foundations of cyclopean perception. University of Chicago Press, Chicago (1971) [34] Li, Y., Pizlo, Z.: Is viewer-centered representation necessary for 3D shape perception. Journal of Vision 6 (2006) (Abstract 268) [35] Biederman, I.: Human image understanding: recent research and a theory. Computer Vision, Graphics and Image Processing 32, 29–73 (1985) [36] Pentland, A.P.: Perceptual organization and the representation of natural form. Artificial Intelligence 28, 293–331 (1986) [37] Dickinson, S., Pentland, A., Rosenfeld, A.: From volumes to views: An approach to 3-D object recognition. CVGIP: Image Understanding 55, 130–154 (1992) [38] Aloimonos, Y.: Purposive active vision, CVGIP: Image Understanding, pp. 840–850 (1992) [39] Tolman, E.C.: Purposive behavior in animals and men. Century, NY (1932) [40] Rosenblueth, A., Wiener, N., Bigelow, J.: Behavior, purpose and teleology. Philosophy of Science 10, 18–24 (1943) [41] Gregory, R.L.: Perceptions as hypotheses. Philosophical Transactions of the Royal Society, London B290, 181–197 (1980) [42] Wertheimer, M.: Principles of perceptual organization. In: Beardslee, D.C., Wertheimer, M., van Nostrand, D. (eds.) Readings in Perception, pp. 115–135, NY (1923/1958) [43] Shannon, C.E.: A mathematical theory of communication. The Bell System Technical Journal 27, 623–656, 379–423 (1948) [44] Tikhonov, A.N., Arsenin, V.Y.: Solutions of ill-posed problems. John Wiley & Sons, New York (1977) [45] Poggio, T., Torre, V., Koch, C.: Computational vision and regularization theory. Nature 317, 314–319 (1985)
12
Z. Pizlo
[46] Pizlo, Z.: Perception viewed as an inverse problem. Vision Research 41, 3145–3161 (2001) [47] Pizlo, Z., Stevenson, A.K.: Shape constancy from novel views. Perception & Psychophysics 61, 1299–1307 (1999) [48] Chan, M.W., Stevenson, A.K., Li, Y., Pizlo, Z.: Binocular shape constancy from novel views: the role of a priori constraints. Perception & Psychophysics 68, 1124–1139 (2006) [49] Li, Y., Pizlo, Z.: Monocular and binocular perception of 3D shape: the role of a priori constraints. Journal of Vision 5 (2005) (Abstract 521) [50] Pizlo, Z., Li, Y., Francis, G.: A new look at binocular stereopsis. Vision Research 45, 2244–2255 (2005) [51] Vetter, T., Poggio, T.: Symmetric 3D objects are an easy case for 2D object recognition. In: Tyler, C.W. (ed.) Human symmetry perception and its computational analysis, pp. 349–359. Lawrence Erlbaum, Mahwah, NJ (2002) [52] Chan, M.W., Pizlo, Z., Chelberg, D.M.: Binocular shape reconstruction: psychological plausibility of the 8 point algorithm. Computer Vision & Image Understanding 74, 121– 137 (1999) [53] Binford, T.O.: Visual perception by computer. IEEE Conference on Systems and Control. Miami (1971) [54] Li, Y., Pizlo, Z.: Reconstruction of 3D symmetrical shapes by using planarity and compactness constraints. Annual meeting of the Vision Sciences Society, Sarasota, FL (Abstract 934) (2007) [55] Pizlo, Z., Li, Y., Steinman, R.M.: A new paradigm for 3D shape perception. Perception 35 (2006) (ECVP abs) [56] Brady, M., Yuille, A.: Inferring 3D orientation from 2D contour (an extremum principle). In: Richards, W. (ed.) Natural computation, pp. 99–106. MIT Press, Cambridge, MA (1983)
Connection Geometry, Color, and Stereo Ohad Ben-Shahar1 , Gang Li2 , and Steven W. Zucker3 1
2
Computer Science, Ben Gurion University, Israel [email protected] Real-Time Vision and Modeling Dept., Siemens Corp. Research, Princeton, NJ [email protected] 3 Computer Science, Yale University, New Haven, CT [email protected]
Abstract. The visual systems in primates are organized around orientation with a rich set of long-range horizontal connections. We abstract this from a differential-geometric perspective, and introduce the covariant derivative of frame fields as a general framework for early vision. This paper overviews our research showing how curve detection, texture, shading, color (hue), and stereo can be unified within this framework.
1
Introduction
Early vision is normally thought of as a collection of tasks, including edge detection, texture analysis, and stereo. The tools that are brought to bear to solve these tasks differ as well, with edge detection normally conceptualized in a signaldetection context, texture in the context of statistics for local image patches, and stereo as a problem in projective geometry. The integration of these tasks is normally accomplished with higher-level models. However, since the individual tasks are formulated in terms that differ from one another, these higher-level models are difficult to formulate in formal terms. We have been pursuing a more unified approach. The motivation originally derived from our study of the visual systems in primates, especially the early cortical visual area V1. This is where orientation, as in edge orientation, is first abstracted, and it plays a key role in specifying the functional architecture, or layout, of cortex. While other feature dimensions, such as direction of motion or spatial frequency, are also important, for space limitations we shall not consider them here. The remaining organizing element–eye of origin–is of course necessary for stereo. In this short paper we simply overview our work, with a focus on the geometry that runs through the early vision problems of edge detection, oriented texture and shading analysis, stereo and color. Our goal is to highlight the (differential) geometric framework that is common to all of these problems. In the next section we introduce several concepts from modern differential geometry, especially the covariant derivative, the Frenet equations, and frame fields. Curvature emerges as the central connection between tangent orientations at nearby positions. We then illustrate the quantization of curvature for curves, W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 13–19, 2007. c Springer-Verlag Berlin Heidelberg 2007
14
O. Ben-Shahar, G. Li, and S.W. Zucker
which gives rise to co-circularity. Extending this to 3-D allows us to formulate the stereo problem for space curves. This provides a new framework for stereo, integrating position and orientation disparity, and illustrates how the frame-field structure can elaborate our problem formulations. The full 2-D frame field is the natural setting for oriented textures, and this involves two curvatures, one in the tangential and one in the normal directions. Finally, the extension to color, or at least hue, is sketched. We stress that this presentation is intended as an overview of our work and as an entry to the literature. While all of the results are described in greater detail, with complete references to related research, in the pointers to the literature given as references at the end, we hope that placing these pieces in juxtaposition to one another will illustrate our confidence in — and excitement about — the frame-field representation for early vision. We recommend consulting these more complete papers for the full story.
2
Connection Geometry in the Plane
A unit length tangent vector E(q) attached to point q = (x, y) ∈ 2 is the natural representation of orientation in the plane. Attaching such a vector to points of interest (e.g., along a smooth curve or oriented texture) yields a unit length vector field. Assuming smoothness, an infinitesimal translation along the vector V from q yields a small rotation in the vector E(q). A frame {ET , EN }, placed at the point q with ET identified with E(q) allows us to apply techniques from differential geometry (Fig. 1). EN
V
ET
V
EN E T= E(q)
V θ
q
Fig. 1. Illustration of the geometry behind the connection equation and the covariant derivative
Nearby tangents are displaced both in position and orientation according to the covariant derivatives, ∇V ET and ∇V EN , which can also be represented as vectors in the basis {ET , EN }: ET ∇V ET w11 (V ) w12 (V ) = . (1) ∇V EN EN w21 (V ) w22 (V )
Connection Geometry, Color, and Stereo
15
Note: the 1-forms wij (V ) are functions of the displacement V . Since the basis {ET , EN } is orthonormal, they are skew-symmetric wij (V ) = −wji (V ). Thus w11 (V ) = w22 (V ) = 0 and the system reduces to connection equation formulated by Cartan [1]: ∇V ET 0 w12 (V ) ET = . (2) −w12 (V ) ∇V EN 0 EN w12 (V ), the connection form, is linear in V , so it can be represented in terms of the frame {ET , EN }: w12 (V ) = w12 (a ET + b EN ) = a w12 (ET ) + b w12 (EN ) . giving rise to the scalars:
. κT = w12 (ET ) . κN = w12 (EN )
(3)
which we interpret as tangential (κT ) and normal (κN ) curvatures. Specializing to the one-dimensional case of curves, only ∇ET is necessary: ∇ET ET ET 0 w12 (ET ) = . (4) ∇ET EN 0 EN −w12 (ET ) Now T ,N , and κ can replace ET ,EN , and κT , respectively, and this is the classical Frenet equation (primes denote derivatives with respect to arclength): T 0 κ T = . (5) N −κ 0 N
3
Curves and Co-circularity
The original application of these ideas was to develop compatibility coefficents for a relaxation labeling process for curve detection in images; see [2]. The basic idea is that local estimates of an image curve, or approximations to the tangent, are obtained at all positions and all orientations. (These correspond to the local measurements of orientation in visual cortex of primates.) Consistent tangent estimates support one another; while inconsistent ones detract support. (This corresponds to the computation supported by the neural substrate of long-range horizontal connections.) The goal is to find that collection of tangents that maximize support. (For a technical definition of support, see [3]). From the geometric perspective above, consistency can be interpreted directly in terms of transport along the osculating circle (a local 2-nd order approximation to the curve at each point); see Fig. 2.
4
(Oriented) Textures and Co-helicity
We now utilize the full differential geometry in the plane. In the neighborhood of an orientation within a texture, there is a full set of possible orientations; if
16
O. Ben-Shahar, G. Li, and S.W. Zucker
The osculating circle aproximates a curve in the neighborhood of a point
Incompatible tangent True image curve
q
Compatible tangent
y
x
Local tangent
Fig. 2. The geometry of co-circularity for image curves. (top) The osculating circle provides an approximation to a curve locally via curvature. (bottom) Different quantizations of curvature indicate which tangent estimates should reinforce one another. In effect these “precompute” the different transports.
Fig. 3. Analysis of oriented textures. (top) Co-helical compatibilities for oriented textures. These are defined in a local neighborhood around the central tangent. For textures this neighborhood is 2-D. Note that now there is variation in orientation (i.e., curvature) in both the tangential and the normal directions, and that singularities arise naturally. (bottom) A Brodatz texture; initial measurements of orientation within the center-of-interest; relaxed (consistent) tangent vector field.
we move in any direction within the texture the orientation can change. This implies the need for compatibility functions that are fully 2-D, and these are illustrated in Fig. 3. The construction is a direct extension of co-circularity with
Connection Geometry, Color, and Stereo
17
a helicoid (in position, orientation space) as the generalization of the osculating circle for curves. Two curvatures are necessary. See [6]
5
Analysis of Hue
One normally thinks of color in (red, green, blue) coordinates. However, when the color is mapped to the psychologically more useful (intensity, hue, saturation) coordinates, the color circle emerges. Considering only hue, the value at each pixel can be represented as a vector (which points to the proper location on the hue circle). The geometry is now close to that for oriented textures, modulo π vs 2π. Compatibility fields, which earlier were drawn among vectors, can now be drawn among hues. Notice that hue can change slightly with movement in either the tangential or the normal directions so, again, two (hue) curvatures are needed. The resulting system can be used for denoising [4]; for segmentation; and to provide a basis for color constancy [5]. V Green
Yellow
White
Cyan
Blue
Red
Magenta
H S
Fig. 4. Geometry of color and hue flows. (top) The intensity-hue-saturation representation for color makes the hue circle explicit. The hue at each pixel can therefore be represented as a vector, and similar geometry to that for oriented textures emerges. (middle) An apple image with the hue represented at each pixel. (bottom) Hue compatibilities now indicate how hue varies in the tangential and the normal directions around the central pixel. They can be used for noise cleaning and object segmentation.
6
Stereo
Stereo correspondence involves both differential and projective geometry. Classical projective geometry is well know in computer vision; here we stress how
18
O. Ben-Shahar, G. Li, and S.W. Zucker
Fig. 5. The geometry of stereo. (top) A space curve in 3-D projects into the left and the right images. Shown are two (space) tangents, each of which projects to a pair of (image) tangents. Compatibilities are thus defined over pairs of tangents, and include orientation as well as positional disparities. (bottom) A complex arrangement of twigs. The left and right images are shown, as is the stereo reconstruction from two viewpoints.
continuity of smooth objects (in this case curves, but also surfaces) can supplement the epipolar constraint for matching, and can supercede the ordering and other heuristic constraints. We formulate the basic transport operation in 3 , for which we need to extend the Frenet equations to add torsion, or deviation from the osculating plane. ⎡ ⎤ ⎡ ⎤⎡ ⎤ T 0 κ 0 T ⎣ N ⎦ = ⎣ −κ 0 τ ⎦ ⎣ N ⎦ (6) B 0 −τ 0 B The central observation is that when the standard frontal-parallel plane assumption is violated, higher-order disparities are introduced. It is these new disparities that are most useful. To illustrate, consider the tangent component of the Frenet 3-frame. Note that this projects to a pair of (2-D) tangents, one in the left image and one in the right. They will have a classical spatial disparity as well as the higherorder orientation disparity; see Fig. 5. The compatibility functions and transport are defined in 3-D; and thus are implemented as relationships between pairs of tangents. The stereo system for space curves is described in [7]; the extension to surfaces is in [8].
Connection Geometry, Color, and Stereo
7
19
Summary and Conclusions
In this lecture we attempted to illustrate how the geometry of interactions for smooth objects provides a unifying theme for many of the problems in early vision. The application to the neurobiology of long-range horizontal connections in the first visual cortical area can be found in [9].
References 1. O’Neill, B.: Elementary Differential Geometry. Academic Press, San Diego (1966) 2. Parent, P., Zucker, S.W.: Trace inference, curvature consistency, and curve detection. IEEE Trans. Pattern Analysis and Machine Intelligence 11, 823–839 (1989) 3. Hummel, R.A., Zucker, S.W.: On the foundations of relaxation labeling processes. IEEE Trans. Pattern Analysis and Machine Intelligence PAMI-5, 267–287 (1983) 4. Ben-Shahar, O., Zucker, S.W.: Hue fields and color curvatures: A perceptual organization approach to color image denoising. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR 03, Madison, WI (2003) 5. Ben-Shahar, O., Zucker, S.W.: Hue geometry and horizontal connections. Neural Networks 17(5-6), 753–771 (2004) 6. Ben-Shahar, O., Zucker, S.W.: The Perceptual Organization of Texture Flow: A Contextual Inference Approach. IEEE Trans. Pattern Analysis and Machine Intelligence 25(4), 401–417 (2003) 7. Li, G., Zucker, S.W.: Contextual Inference in Contour-Based Stereo Correspondence. Int. J. of Computer Vision 69(1), 59–75 (2006) 8. Li, G., Zucker, S.W.: Differential Geometric Consistency Extends Stereo to Curved Surfaces. In: Proc. 9-th European Conference on Computer Vision, Graz, Austria, May 7 - 13 (2006) 9. Ben-Shahar, O., Zucker, S.W.: Geometrical computations explain projection patterns of long-range horizontal connections in visual cortex. Neural Computation 16(3), 445–476 (2003)
Adaptable Model-Based Tracking Using Analysis-by-Synthesis Techniques Harald Wuest1 , Folker Wientapper2 , and Didier Stricker2 1
Centre for Advanced Media Technology (CAMTech) Nanyang Technological University (NTU) 50 Nanyang Avenue Singapore 649812 2 Department of Virtual and Augmented Reality Fraunhofer IGD TU Darmstadt, GRIS, Germany [email protected] http://www.igd.fhg.de/igd-a4/
Abstract. In this paper we present a novel analysis-by-synthesis approach for real-time camera tracking in industrial scenarios. The camera pose estimation is based on the tracking of line features which are generated dynamically in every frame by rendering a polygonal model and extracting contours out of the rendered scene. Different methods of the line model generation are investigated. Depending on the scenario and the given 3D model either the image gradient of the frame buffer or discontinuities of the z-buffer and the normal map are used for the generation of a 2D edge map. The 3D control points on a contour are calculated by using the depth value stored in the z-buffer. By aligning the generated features with edges in the current image, the extrinsic parameters of the camera are estimated. The camera pose used for rendering is predicted by a line-based frame-to-frame tracking which takes advantage of the generated edge features. The method is validated and evaluated with the help of ground-truth data as well as real image sequences.
1
Introduction
One of the key challenges for augmented reality applications is the real-time estimation of the camera pose. Several markerless tracking approaches exist which use either lines [1,2,3] or point features [4,5] or a mixture of both [6,7] to determine the camera pose. Tracking point features like the well-known KLT-Tracker [8] can be very promising, when the scene consists of many well-textured planar regions. In industrial scenarios objects which shall be tracked are unfortunately often poorly textured and consist of reflecting materials, where methods based on point features often produce insufficient results. Line features are more robust against illumination changes and reflections and can therefore be more appropriate for the camera pose estimation. In this paper we present a model based tracking method which generates a 3D line model at runtime out of any arbitrary surface model of an object and W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 20–27, 2007. c Springer-Verlag Berlin Heidelberg 2007
Adaptable Model-Based Tracking Using Analysis-by-Synthesis Techniques
21
uses this line model to track the object. The benefit of this approach is that the generated view-dependent line model does not contain any occluded edges and it always consists of lines with an appropriate level of detail. Furthermore, since a line model is generated in every frame, the tracking of silhouette edges of round objects like pipes is possible. In the line model generation step a polygonal model of the object is rendered with a predicted camera pose and contours are extracted by analyzing discontinuities of the z-buffer and the normal buffer or by detecting edges in the frame buffer. The fast creation of an edge map is performed on the graphics hardware. The generated 3D line model is the input for a line-based tracking method, where the extrinsic camera parameters are estimated by minimizing the error between the lines of the model and strong maxima of the image gradient. An application for this tracking approach is an augmented reality system for the maintenance of industrial facilities.
2
Line Model Generation
The main objective of the line model generation step should be to aquire many good candidates of lines that are most likely to be visible in the image, as well, while preserving an affordable amount of computational costs. In [2] a manually constructed static line model is used for tracking, and the rendering step is only needed to perform a visibility test. A method of extracting line features out of a textured scene is described by Reitmayr [9], where gradient edges are extracted out of a rendered image. Drummond [10] presents an approach that is based on identifying edges in a rendered CAD model. In their approach a binary space partition tree is used to perform hidden line removal. Representing an object as such a tree, however, is a complex preprocessing step. Our approach uses an untextured surface model stored in a VRML file and extracts in every processed frame a reliable, view-dependent line model. In contrast to [9] one method for the real-time line model generation presented here is based only on the geometric properties of the object. This method is applied especially when no correct material properties of the model are given. It is related to the silhouette generation of polygonal models for the purpose of non-photorealistic rendering: Isenberg et al. [11] describe methods which create silhouettes of a model in both object space and image space. Hybrid algorithms which combine the advantages of both image space and object space methods exist [12]. As real time performance is an important criterion in our application we use only an image space method as detailed in [13]. The algorithm creates a view-dependent 3D line model as follows: First the surface model is rendered with the predicted camera pose. In a second step an edge map is created by analyzing discontinuities in the z-buffer and the normal-buffer. Two types of edges are of interest: step edges (also called C 0 edges), which correspond to a surface partially occluded by another one, and crease edges (C 1 ) as the locations of two adjacent surfaces with different orientation. For the purpose of their detection we present and discuss several filtering methods detailed in the subsections 2.1 and 2.2. Another method for extraction edges out of the frame-buffer is described in section 2.3.
22
H. Wuest, F. Wientapper, and D. Stricker
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 1. Edge map generation: (a) shows a rendered object and (b) the z-buffer of that object. (c) illustrates the silhouette edges. The normal map is shown in (d) and the edges of the normal map in (e). The combined edge image can be seen in (f).
The last step consists of extracting the 3D-lines of the edge map by means of a Canny like edge extraction algorithm. A 3D contour in the world coordinate system is computed by un-projecting every pixel in the edge map with the information stored in the z-buffer. During the contour following of the edges in the edge map 3D control points and their 3D direction are generated directly and used as the input for the registration step, which is described in section 3. 2.1
Edge Map Generation Using Laplacian Filtering of the z-buffer
Discontinuities in a z-buffer image are changes in depth and can be regarded as a point on an edge of an object according to the given camera viewing direction. Having rendered the surface model with the predicted camera pose, a second order differential operator is applied on the z-buffer image to generate an edge image. As one approach in our implementation we use a simple 3 × 3 Laplacian filter mask in order to find points belonging to silhouette edges. For silhouette edges the Laplace operator returns a high absolute value both on the object border and on the neighboring pixel in the background. Since we are interested in the 3D coordinates of pixels, which are located on the edge of an object and not on a background surface or the far plane, we only consider pixels as a silhouette edge, which have a negative response of the Laplacian filter. Therefore, it can
Adaptable Model-Based Tracking Using Analysis-by-Synthesis Techniques
23
be guaranteed that silhouette pixels are actually located on the rendered object. At strong depth discontinuities the Laplace filter produces only a one pixel thick contour, which is very desirable for the contour following later on. In figures 1(b) and (c) the z-buffer of a rendered scene and the resulting silhouette edge map is illustrated. 2.2
Edge Map Generation Using the Normal Buffer
In contrast to the z-buffer method, which is mostly suitable for C 0 edges, another promising approach is the additional extraction of C 1 edges of the normal map as described in [11]. In [14] a method is described, which creates a normal map by placing several light sources on the axes of the camera coordinate system. In our implementation we simply use a pixel shader to generate a normal map. To extract the edge map, simply for every pixel the angle between neighboring normal vectors is calculated and tested, if it exceeds a predefined threshold. Thereby an edge image is created which represents changes in surface orientation. Figure 1(d) and (e) show a normal map of a rendered scene and the resulting crease edge image. A combined edge map which includes both silhouette and crease edges can bee seen in figure 1(f). The whole process of creating an edge image is done on the graphics hardware with two rendering passes. In the first pass the scene is rendered into a texture, whereas the normal vectors are stored in the (r, g, b) color components and the depth is stored in the alpha channel of the texture. In the second pass a pixel shader creates the combined edge map and stores the edge image in only one color channel. The depth information is thereby stored in the intensity value of that pixel. 2.3
Edge Map Generation Using the Frame Buffer
In high detailed scenes with a model consisting lots of triangles the edge map obtained by geometrical properties like the depth or the surface orientation gets very fuzzy and the resulting contours are not very distinct. If material properties of an object exist, edges can also be extracted out of the frame-buffer by simply computing the image gradient. Again the edge image is created in two rendering passes. In the first pass the scene is rendered with only ambient lighting into a texture and the z-buffer value is stored in the alpha channel. In the second pass the image gradient of the (r, g, b) image is calculated. Pixels which are on the far side of a step edge are detected as described in section 2.1 and suppressed in the edge image. Thereby it can be guaranteed that all edge pixels are located on the object, which is necessary to get the correct 3D coordinates. In figure 2 the edge maps of a toy car are shown, which are generated with both the frame-buffer method and the approach using the normal- and the zbuffer. In this case the contours of the frame-buffer are more distinct, especially in very detailed areas like the wheels or the engine hood.
24
3
H. Wuest, F. Wientapper, and D. Stricker
Line Model Registration
The results of the line model generation can now be used as the input for the tracking step. Our implementation is based on the approach of [1] and [6]. The camera pose computation is based on the minimization of the distance of projected control points on a 3D contour and a 2D edge in the image. As it is difficult to decide which of several gradient maxima along the scan line really corresponds to the control point on a model edge, more than one point is considered as a possible candidate. As it was presented in [6], we use multiple hypotheses to prevent the tracking of being perturbed by misleading strong contours. To make the pose estimation robust against outliers, an estimator function can be applied to the projection error. In our implementation we use the Tukey estimator function ρT uk . More details can be found in [2].
4
Prediction of the Camera Pose
The presented method for the line model registration is a local search method. The convergence towards local minima can be avoided if the initial camera pose used for the minimization is a very close approximation to the real camera pose. Therefore a prediction of the camera pose from frame to frame is very beneficial. As correspondences between 3D contour points and 2D image points exist, if the previous frame was tracked successfully, these correspondences can be used again to make a prediction of the current frame. Therefore the control points of the 3D contour generated in the previous frame are projected into the current image with the camera pose of the previous frame. Again a one-dimensional perpendicular search is performed, but instead of looking for gradient maxima, the point which is most similar to the 2D point in the last frame is regarded as the wanted 2D point. So for every 3D control point, the point with the highest result of a normalized cross correlation along the search line is regarded as the corresponding 2D image point. The calculation of the camera pose is done exactly as in section 3, except that only one 2D point for every 3D point exists.
5
Experimental Results
To evaluate the accuracy, the algorithm is tested with a synthetic image sequence. In this sequence a virtual model of a toy car is rendered with a predefined camera path, where the camera moves half around the model and back again. The result of the pose estimation is stored and compared with the given ground truth data. It is analyzed how both the results of the edge map generation with geometry edges and with material edges affect the camera pose estimation. Figure 2 shows the error of the 6 extrinsic camera parameters for the different line model generation methods. For this particular example it can bee seen that the method using material edges produces more accurate results. However, this is not surprising, since this method produces a clearer edge map, if correct material properties are given. The error of most of the parameters is alternating around
Adaptable Model-Based Tracking Using Analysis-by-Synthesis Techniques translation error
25
rotation error 0.15
2 x y z
1.5
α β γ
0.1
1 0.05
error
error
0.5 0
0 −0.05 −0.5 −0.1
−1
−1.5
0
50
100
150
(a)
200 250 frame number
300
350
400
0
50
100
150
(b)
200 250 frame number
300
350
400
(c)
translation error
rotation error 0.15
2 x y z
1.5
α β γ
0.1
1 0.05
error
error
0.5 0
0 −0.05 −0.5 −0.1
−1
−1.5
0
50
100
150
(d)
200 250 frame number
(e)
300
350
400
0
50
100
150
200 250 frame number
300
350
400
(f)
Fig. 2. Comparison of the error between the estimated and the real camera pose of geometry edges (a) and material edges (d). In (b) and (c) the error of the geometry edges are plotted, (e) and (f) show the error by using the material edges.
0, except the camera z-translation, which is clearly above 0. It seems that the estimated camera is further away from the tracked object. The reason for this artefact is that the extracted silhouette edges are always on the object, but the gradient edges in the image have their peak between the object border and the background. The extracted silhouette edges have an error of half a pixel which mostly affect the z-component of the camera translation. An analysis of the processing time is carried out on a Pentium 4 with 2.8GHz with an ATI Radeon 9700Pro graphics card. The average computational costs for every individual step are shown in table 1. Only one 8 bit buffer holding the edge map with the depth information encoded in every pixel is read back from the graphics card to the main memory, which is a significant processing time benefit compared to [9], where both frame-buffer and z-buffer need to be accessed. Together with the image acquisition and the visualization, the system runs for this particular example with a frame rate of more than 20Hz. The algorithm is tested on three real image sequences showing different industrial objects. Some frames of the resulting videos1 with different models can be seen in figure 3. If the 3D models consist only of geometric data and no material properties, the frame-buffer method is inapplicable, and only the geometry-based approach is used. The edges used for tracking are generated with the combined z-buffer and the normal-buffer method as depicted in figure 1. After initializing 1
Movies may be downloaded from
http://www.igd.fhg.de/~hwuest/modeltracking.html
26
H. Wuest, F. Wientapper, and D. Stricker
Table 1. Average processing time of the individual steps of the tracking approach prediction step time in ms create correspondences 11.42 2.39 predict pose tracking step render model / create edge map 3.98 6.92 read GL buffer 8.56 create correspondences 3.10 estimate pose total time 36.37
(a)
(b)
(c)
Fig. 3. Tracking objects in real image sequences of industrial scenarios
the first camera pose manually, it is possible to estimate the correct camera path throughout the whole sequences. The virtual toy car, which is used in the previous ground truth evaluation, is also tracked in a real image sequence. Both the method based on geometry and the one based on appearance is able to track the model throughout the whole sequence successfully. As in the synthetic image sequence, for this model the frame-buffer method produced tracking results with a more accurate camera pose and less jitter. All the sequences are tested without the prediction step as well. If large movements, especially fast rotations of the camera occur, the registration step does not produce correct results. The parameters of the camera pose get stuck in local minima and the overall tracking fails. Therefore a rough estimation of the camera pose is indispensable to handle fast camera movements.
6
Conclusion
We have presented a flexible tracking method, which is able to track an object with a given polygonal model by generating 3D contours on the fly and aligning these contours on the image gradient. Our method never runs into any scaling problems, since a line model is generated in every frame with an adequate level of detail. A major improvement is the ability of tracking occluding edges and silhouette edges of objects like tubes or pipes. Depending on the scene, a method
Adaptable Model-Based Tracking Using Analysis-by-Synthesis Techniques
27
for the line model generation has to be chosen. When correct material properties exist, the frame-buffer method produces slightly better results. For large camera movements, a prediction of the camera pose for the model generation helps that the pose estimation converges in the proximate frame. Future work will concentrate on the integration of an inertial measurement unit to get an even better estimate of the predicted camera pose and thus stable results even in the case of large abrupt motions.
Acknowledgments The authors would like to thank their project partners Siemens Automation & Drives and Howaldtswerke Deutsche Werft for providing the test scenarios.
References 1. Comport, A., Marchand, E., Pressigout, M., Chaumette, F.: Real-time markerless tracking for augmented reality: the virtual visual servoing framework. IEEE Trans. on Visualization and Computer Graphics 12(4), 615–628 (2006) 2. Wuest, H., Vial, F., Stricker, D.: Adaptive line tracking with multiple hypotheses for augmented reality. In: ISMAR, pp. 62–69 (2005) 3. Lowe, D.G.: Robust model-based motion tracking through the integration of search and estimation. International Journal of Computer Vision 8(2), 113–122 (1992) 4. Bleser, G., Wuest, H., Stricker, D.: Online camera pose estimation in partially known and dynamic scenes. In: ISMAR, pp. 56–65 (2006) 5. Molton, N.D., Davison, A.J., Reid, I.D.: Locally planar patch features for real-time structure from motion. In: Proc. British Machine Vision Conference (2004) 6. Vacchetti, L., Lepetit, V., Fua, P.: Combining edge and texture information for realtime accurate 3d camera tracking. In: Proceedings of International Symposium on Mixed and Augmented Reality (ISMAR) (2004) 7. Rosten, E., Drummond, T.: Fusing points and lines for high performance tracking. In: IEEE International Conference on Computer Vision, pp. 1508–1511. IEEE Computer Society Press, Los Alamitos (2005) 8. Shi, J., Tomasi, C.: Good features to track. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR’94), pp. 593–600. IEEE Computer Society Press, Los Alamitos (1994) 9. Reitmayr, G., Drummond, T.: Going out: Robust model-based tracking for outdoor augmented reality. In: ISMAR, pp. 109–118 (2006) 10. Drummond, T., Cipolla, R.: Real-time tracking of complex structures with on-line camera calibration. In: British Machine Vision Conference, pp. 574–583 (1999) 11. Isenberg, T., Freudenberg, B., Halper, N., Schlechtweg, S., Strothotte, T.: A developer’s guide to silhouette algorithms for polygonal models. IEEE Comput. Graph. Appl. 23(4), 28–37 (2003) 12. Northrup, J.D., Markosian, L.: Artistic silhouettes: A hybrid approach. In: Proceedings of the First International Symposium on Non-Photorealistic Animation and Rendering (NPAR) for Art and Entertainment, June 2000 (2000) 13. Nienhaus, M., D¨ ollner, J.: Edge-enhancement - an algorithm for real-time nonphotorealistic rendering. In: WSCG (2003) 14. Hertzmann, A.: Introduction to 3d non-photorealistic rendering: Silhouettes and outlines. In: SIGGRAPH 99, ACM Press, New York (1999)
Mixture Models Based Background Subtraction for Video Surveillance Applications Chris Poppe, Ga¨etan Martens, Peter Lambert, and Rik Van de Walle Ghent University - IBBT Department of Electronics and Information Systems - Multimedia Lab Gaston Crommenlaan 8, B-9050 Ledeberg-Ghent, Belgium {chris.poppe,gaetan.martens,peter.lambert,rik.vandewalle}@ugent.be http://multimedialab.elis.ugent.be/
Abstract. Background subtraction is a method commonly used to segment objects of interest in image sequences. By comparing new frames to a background model, regions of interest can be found. To cope with highly dynamic and complex environments, a mixture of several models has been proposed in the literature. This paper proposes a novel background subtraction technique based on the popular Mixture of Gaussian Models technique. Moreover edge-based image segmentation is used to improve the results of the proposed technique. Experimental analysis shows that our system outperforms the standard system both in processing speed and detection accuracy. Keywords: Object Detection, Mixture of Gaussian Models, Video Surveillance.
1
Introduction
The detection and segmentation of objects of interest in image sequences is the first major processing step in many computer vision applications such as visual surveillance, traffic monitoring, and semantic annotation. The outcome of this step is usually used in other processing modules in computer vision applications. Therefore, it is important to achieve very high accuracy in the detection, with the lowest possible false alarm rates. The detection of moving objects in dynamic scenes has been the subject of research for several years and different approaches exist [1]. One of the most popular techniques is background subtraction in which a background model is created and dynamically updated. Moving objects of interest, called foreground objects, are represented by the pixels that differ significantly from this background model. The Mixture of Gaussian Models (MGM) is one of the most popular background subtraction techniques [2]. By using a mixture of Gaussian models and a dynamic update scheme, it can handle highly complex scenes with difficult situations like moving trees and bushes, clutter, noise, and permanent changes of the background. Although it gives good results, the use of the Gaussian models and the update scheme are complex. To overcome the complexity of the traditional W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 28–35, 2007. © Springer-Verlag Berlin Heidelberg 2007
Mixture Models Based Background Subtraction
29
MGM we present a simpler mixture of models technique (SMM). Moreover, we present a fast edge-based image segmentation which is used to improve the detection results. Section 2 elaborates on the mixture technique we propose and compares it with MGM. Subsequently, Sect. 3 discusses the spatial image segmentation we used. Section 4 elaborates on the combination of the proposed background subtraction technique with the spatial segmentation and shows experimental results. Finally, some conclusions and further work are formulated in Sect. 5.
2 2.1
Background Subtraction Using a Mixture of Models Background Subtraction Using MGM
When using background subtraction, a background model is created which resembles the observated environment as closely as possible. A dynamic and complex environment typically introduces different background values for the same pixel (e.g., movement of branches), so a single background model is insufficient. Therefore, MGM was first proposed by Stauffer and Grimson in [2]. MGM is a time-adaptive per pixel subtraction technique in which every pixel is represented by a vector, called Ip , consisting of three color component (red, green, and blue). For every pixel a mixture of multivariate normal distributions, which are the actual models, is maintained and each of these models is assigned a weight. T −1 1 1 G (Ip , μp , Σp ) =
e− 2 (Ip −μp ) Σp (Ip −μp ) . n (2π) |Σp |
(1)
(1) depicts the formula for a Gaussian distribution G. The parameters are μp and Σp , which are the mean and covariance matrix of the distribution respectively. For computational simplicity, the covariance matrix is assumed to be diagonal. For every new pixel a matching, an update, and a decision step are executed. The new pixel value is compared with the models of the mixture. A pixel is matched if its value occurs inside a confidence interval within 2.5 standard deviations from the mean of the model. In that case, the parameters of the corresponding distribution are updated according to (2), (3), and (4). μp,t = (1 − ρ) μp,t−1 + ρ (Ip,t ) .
(2) T
Σp,t = (1 − ρ) Σp,t−1 + ρ (Ip,t − μp,t ) (Ip,t − μp,t ) .
(3)
ρ = αG (Ip,t , μp,t−1 , Σp,t−1 ) .
(4)
α is the learning rate, which is a global parameter, and introduces a trade-off between fast adaptation and detection of slow moving objects. Each model has a weight, w, which is updated for every new image according to (5). wt = (1 − α) wt−1 + αMt .
(5)
30
C. Poppe et al.
If the corresponding model introduced a match, Mt is 1 , otherwise it is 0. Formulas (2) to (5) represent the update step. Finally, in the decision step, the models are sorted according to their weights. MGM assumes that background pixels occur more frequently than actual foreground pixels. For that reason MGM defines a threshold based on these weights to decide which models of the mixture depict background or foreground. Indeed, if a pixel value occurs recurrently, the weight of the corresponding model increases and it is assumed to be background. The next section shows the changes we propose to MGM, which result in a technique called simple mixture of models (SMM). 2.2
Background Subtraction Using SMM
The authors of this paper believe that the true strength of MGM lies not in the use of the Gaussian distributions, but in the use of the mixture of weighted models which can deal with multimodal environments. Therefore, we inherit this concept in our system. The update step of MGM demands heavy calculations, since for every matched model the probability distribution function is used to calculate the parameter ρ. Previous research has shown that this calculation can be skipped since this value usually is very small and does not change much [3]. We adopt this in our system and use a constant parameter ρ. Since the distribution functions are now actually no longer used, we discard the conceptual use of the Gaussian models entirely and make use of models consisting of an average pixel value and a maximum difference between new values and the average. This difference will be evaluated according to two independent thresholds. An upper threshold is used for pixel values larger than the average, a lower one for pixels smaller than the average. This separation makes sense, since the symmetric structure of a Gaussian distribution gives a misinterpretation of real pixel values. New larger pixel values, which fall within the range of a model, do not infer that pixels with much smaller values should also be considered part of this model. Additionally, we incorporate the difference between a new pixel value and the last value for that pixel that was regarded as background. MGM only takes the generated models into account and does not regard the previous value for that pixel. As such, if values increase consistently, but with small intermediate steps, the pixel values will fall out of the range of the models. This effect can typically be experienced when a cloud gradually changes the lighting of the environment, which usually can occur in very small time spans (30 seconds), but it will affect MGM drastically. By checking if new pixel values differ only slightly from the previous value, one can assume that this small change is not due to the appearance of a foreground object. As such, we need to take the maximum difference between two consecutive pixels into account in our model. We use a new threshold which is dependent on the upper and lower thresholds of the model, to take the variation of pixel values into account. In this sense our approach resembles the methods applied in the W4 system [4]. Haritaoglu et al. use a minimum value, maximum value and maximum difference with the previous pixel value to evaluate new pixel values. However, these
Mixture Models Based Background Subtraction
31
parameters are defined during a training period and are only periodically updated. Moreover they do not use different models to cope with the multimodal behaviour of complex environments. We use the same update steps as in MGM to adjust the mean, weights and thresholds of our models. The separation of an upper and lower threshold allows to update them independently. The conceptual approach of sorting the models in MGM introduces additional complexity, so we alter this in our system. In most of the cases there is only one model which simulates the background. As such, the first step should be to check wether the weight of the matched model is larger than the predefined threshold. If not, the next step is to check if the weight of the matched model is smaller than all other values. In this case the model surely depicts foreground. Since these two cases occur frequently, a costly sorting step can be avoided. If neither of the initial conditions is fulfilled, the sorting as introduced in MGM can be applied. To further increase the processing speed we have chosen to use a checkerboard pattern for the evaluation of the frames, as shown in following pseudo-code: for each pixel(x,y) if mod(x+y+t,2) result(x,y) = else result(x,y) =
in frame t == 1 doBackgroundSubtraction(pixel(x,y)) result(x-1,y)
For the pixels which are not evaluated at the current frame, a very simple but fast interpolation is used. We copy the decision made for the pixel situated to the left of the current pixel. More advanced interpolations could be made, by using the pixel values in previous frames or the pixels surrounding the current pixel, but these would involve increased memory consumption or multiple passes of the image. By using this checkerboard pattern at each frame, only half of the pixels introduce the complexity of the above presented detection algorithm.
3
Spatial Image Segmentation
SMM takes for each pixel only the past pixel values into account. Although the technique gives reasonable results, including spatial information (such as neighboring pixel values) promises to improve the results. We propose to incorporate spatial information into our system. For this approach we segment the image into different regions according to edge information. This segmentation will consequently be combined with SMM in Sect. 4 to improve the results of the object detection. We use the well-known Canny edge detection technique [5], since it runs very fast and achieves good results. Nevertheless, it is not an optimal edge detection technique and edges can be discontinuous. We want to segment the image based on these edges, so it is important to have as much closed regions as possible. Therefore, we built an extension on the output of the canny edge detector. We
32
C. Poppe et al.
Fig. 1. Left: current frame, middle: canny output, right: extended canny output
detect loose ends with a 3x3 search window and extend the ends in the direction of the edge. Experimental results have shown that an extension of 5 pixels is sufficient for the testdata we used. If more pixels are used, false segmentations will occur more frequently due to noise. Considering that surveillance cameras typically capture an entire environment, it is assumed that many small regions will be found. Therefore 5 pixels should be sufficient to close possible wholes in the edges. This parameter is dependent on the resolution of the images and the complexity of the monitored environment. Fig. 1 shows the result of applying the canny edge detector and the extended version on an image. As can be seen, the extended version is able to close some regions which were left open. After this extension step, the image is divided into closed regions and we use a flood fill operation to give each segment a different value. The pixels representing the edges get the value of the smallest neighboring region. As such, every pixel in the image can be assigned to one and only one region. This image segmentation runs in real-time; segmenting the PetsD2TeC2 sequence takes an average of 28 ms per frame. Fast segmentation is an important factor since the results will be used in the further processing of our surveillance system. Although image segmentation techniques exist which obtain better results, these are more complex and not suited for processing surveillance video data in real-time [6].
4 4.1
Extended SMM Combining SMM with Spatial Image Segmentation
The goal is to improve the result of SMM by using the edge-based segmentation. To accomplish this we use a two-pass matching step, as shown in following pseudo-code: for each pixel(x,y) in frame t if SMMresult(x,y) == 1 nrPix(getSegment(x,y))++ for each pixel(x,y) in frame t if SMMresult(x,y) == 0 if nrPix(getSegment(x,y)) > 0.6*size(getSegment(x,y)) SMMresult(x,y) = 1
Mixture Models Based Background Subtraction
33
Fig. 2. False positive rate for every 50th frame of PetsD2TeC2 sequence
First, we store for each region in the segmented image the number of corresponding foreground pixels resulting from SMM. In a second step, we go over all the background pixels resulting from SMM and see which segments correspond with their location. The segment is considered a foreground region if the number of foreground pixels exceeds a certain amount of the segment size. In that case the background pixels of this segment are denoted as foreground. As such the detection result of SMM is extended to comply with the segmentation result. This way, foreground pixels, which were not detected by only using temporal information in SMM, can still be found. 4.2
Experimental Results
To show the introduced improvements we compare MGM, SMM, and the combination of SMM with the edge-based image segmentation (eSMM). If a background pixel is misclassified as foreground, it is called a false positive. If a foreground pixel is not detected, it is called a false negative. Fig. 2 and Fig. 3, respectively, depict the false positive rate (FP rate) and false negative (FN rate) for the PetsD2Tec2 sequence (384x288)[7]. The FP rate is the percentage of real background pixels which were misinterpreted as foreground. The FN rate is the percentage of real foreground pixels which were not detected. A manual ground truth annotation has been done for every 50th frame of the sequence. For each of these frames, the ground truth annotation has been matched with the output of the detection techniques to find the amount of pixels which are erroneously classified. As can be seen in Fig. 2 SMM has less erroneously interpreted background pixels then MGM. This is mostly due to the use of the maximum difference between consecutive pixel values. The use of the edge-based
34
C. Poppe et al.
Fig. 3. False negative rate for every 50th frame of PetsD2TeC2 sequence Table 1. Average execution times for MGM, SMM, edge-based image segmentation and eSMM in milliseconds per frame sequence PetsD1TeC1 PetsD1TeC2 PetsD2TeC1 PetsD2TeC2
MGM avg. stdev. 147 24 145 23 155 38 159 43
SMM avg. stdev. 65 8 64 6 67 7 66 8
image segmentation eSMM avg. stdev. avg. stdev. 22 7 99 10 26 8 102 10 25 8 108 13 28 8 105 12
image segmentation in eSMM increases the false positives, but it performs still better then MGM. The increase in false positives is mostly due to background pixels, erroneously interpreted as foreground by SMM which are extended by the segmentation technique. Indeed, if many of these misdetections are situated in the same segment, the segment will be regarded as foreground, introducing false detections. Better segmentation techniques can improve the FP rate of eSMM. Fig. 3 shows that SMM has more missed foreground pixels then MGM. However, the extension with the edge-based image segmentation decreases the FN rate considerably and on average it performs better then MGM. Table 1 shows a comparison of the execution times for different test sequences. We have recorded for each frame the time it takes to perform the object detection and consequently present the average values and standard deviations. All sequences have a resolution of 352 by 288 pixels. As shown, SMM is much faster then MGM and allows to process 15 frames per second. Although eSMM introduces improvements of the detection results, it also slows down the system. Nevertheless, eSMM is still faster then MGM. The processing time for segmenting the image according to the edges is also shown, since this step can be done
Mixture Models Based Background Subtraction
35
entirely independent from SMM. A parallel implementation would therefore allow to reduce the processing time even further.
5
Conclusions
This paper presents a new background subtraction scheme based on the popular Mixture of Gaussian Models. We propose the use of simpler models and altered matching, update and decision steps. The system achieves comparable results with the original scheme but with faster response. Additionally, an image segmentation based on edge information has been introduced. The result of the segmentation is consequently used to improve the detection result of our system by reducing the number of false negatives. Experimental results show the gains in processing speed and detection results. An important advantage of our system is that current improvements of the MGM can be projected on our system too. The use of different color spaces, shadow removal techniques etc. can improve the results of MGM and SMM due to the similar structure. Acknowledgments. The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT-Flanders), the Fund for Scientific Research-Flanders (FWO-Flanders), and the European Union.
References 1. Dick, A., Brooks, M.J.: Issues in automated visual surveillance. International Conference on Digital Image Computing: Techniques and Applications, pp. 195–204 (2003) 2. Stauffer, C., Grimson, W.E.L.: Learning Patterns of Activity Using Real-Time Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 747– 757 (2000) 3. Zang, Q., Klette, R.: Evaluation of an Adaptive Composite Gaussian Model in Video Surveillance. In: Petkov, N., Westenberg, M.A. (eds.) CAIP 2003. LNCS, vol. 2756, pp. 165–172. Springer, Heidelberg (2003) 4. Haritaoglu, I., Harwood, D., Davis, L.S.: W4: real-time surveillance of people and their activities. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 809–830 (2000) 5. Canny, J.: A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 679–698 (1986) 6. De Bock, J., Pires, R., De Smet, P., Philips, W.: A Fast Dynamic Border Linking Algorithm for Region Merging. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2006. LNCS, vol. 4179, pp. 232–241. Springer, Heidelberg (2006) 7. Brown, L.M., Senior, A.W., Tian, Y., Connell, J., Hampapur, A., Shu, C., Merkl, H., Lu, M.: Performance Evaluation of Surveillance Systems Under Varying Conditions. IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (2005) http://www.research.ibm.com/peoplevision/ performanceevaluation.html
Applicability of Motion Estimation Algorithms for an Automatic Detection of Spiral Grain in CT Cross-Section Images of Logs Karl Entacher1 , Christian Lenz2 , Martin Seidel2 , Andreas Uhl2 , and Rudolf Weiglmaier2 1
2
Salzburg University of Applied Sciences, Austria [email protected] Department of Computer Sciences, Salzburg University, Austria [email protected]
Abstract. Techniques for an automatic detection of spiral grain in crosssection CT images of logs are proposed and evaluated on sets of natural and artificial cross-section images. Explicit analysis of global rotation, block matching, and optical flow techniques are compared. Experimental results seem to indicate that spiral grain in fact cannot be modeled by a circular motion of luminance values in gray scale images.
1
Introduction
Spiral grain means that wooden fibers are not oriented in parallel along the axis of a tree, they grow in a spiral manner around the pith. An occurance of strong spiral grain heavily influences the quality of sawn timber, i.e. the mechanical properties may be reduced or boards show a tendency to twist after breakdown from a log or after a drying process. Spiral grain may, in different characteristics occur for all wood species [1,2]. There are several different ways to measure spiral grain which means a determination or estimation of the angle of fiber orientation relative to the pith. Traditional destructive methods cut out samples in radial direction to measure the slope of grain. Certain nondestructive ways predict spiral grain based on determination of the fiber orientation on the surface of the log or board. For examples, see the application of the tracheid effect utilizing the light-conducting properties of the softwood tracheids to measure the direction of spiral grain [3,4] or an application of NIR-spectroscopy for grain-angle determination [5]. The usage of computed tomography to predict spiral grain was studied in [6], where the slope of grain is estimated from pixel patterns of unwrapped growth ring images. Ekevad [7] determines local fiberdirections from CT-images using a method to calculate the principal directions of inertia (eigenvectors) of measurement spheres distributed throughout the axial direction of the wood object. In the present paper we study a different way for a possible automatic prediction of spiral grain from CT-data of logs. When interpreting the CT cross-section W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 36–44, 2007. c Springer-Verlag Berlin Heidelberg 2007
Applicability of Motion Estimation Algorithms for an Automatic Detection
37
images as temporally ordered and displaying them in a video style manner, a rotation type motion seems to be clearly present in the data. The aim of the present study is to compare different motion detection algorithms in order to be usable to estimate spiral grain. In Section 1 we propose the different methods applied: global rotation, block matching and optical flow. Section 2 contains experimental results using DICOM data from large dimensioned specimen of Norway Spruce [8,9]. Section 3 concludes the findings.
2
Detection of Spiral Grain in CT Cross-Section Images
The first technique proposed (“rotation analysis”) assumes that spiral grain can be actually modeled by a global rotation of an entire cross-section image. For this purpose, all images in the set are aligned according to their pith. Subsequently, we compare each image (i) with rotated versions of image (i + 1) in an interval [−1.5 ◦ , +1.5 ◦] in s = 1 . . . 100 steps. The rotated version of image (i + 1) exhibiting the smallest error when compared to image (i) indicates the rotation angle between images (i) and (i + 1). The problem of rotating images about an arbitrary angle (except for multiples of 90 ◦ ) is, that the rotated pixels do not fit exactly into the pixel matrix, so that we have to interpolate the gray values. We use either bilinear or bicubic interpolation to cope with this problem. Block matching [10] is mainly used for motion estimation in hybrid video coding schemes like MPEG-1,2,4 or the H.26X family. This technique assumes constant motion within a block of pixels and is restricted to a local translational motion model on a block basis (motion is represented by a single motion vector for each block). Given a block of pixels in the image (i + 1), the corresponding motion vector is computed by searching for the most similar block within a specific search area in image (i) centered around the position of the block in image (i + 1). The matching block is only accepted in case the error between the two blocks is below a certain threshold. Locally constant translational motion is supposed to approximate spiral grain fairly well. Motion vectors are expected to point into the direction of the spiral grain if present. In our implementation we use 8 × 8 pixels blocks and a search area of 2 pixels around the original block (due to the low amount of spiral grain to be expected). Finally, optical flow is a method for estimating the motion of picture elements in a sequence of images without imposing any restriction about the type of motion present. As a result for each image pair, we get an optical flow field representing the motion of brightness information between the images. The Horn & Schunk [11] approach introduces an additional global smoothness constraint in order to minimize strong fluctuations in the flow field. For each pixel a set of equations has to be solved which is usually done in an iterative way which explains the slowness of this technique. The Lucas & Kanade [12] approach relaxes the global smoothness constraint into a local one which facilitates a significant reduction of computational demand. We apply both techniques to image pairs in our cross-section image sets assuming that spiral grain may be modeled by motion of brightness information between image pairs.
38
3 3.1
K. Entacher et al.
Experiments Experimental Settings
We have used three different image sets for our analysis, all of them are crosssection images of logs acquired by computed tomography (CT). We use two CT-image sets of logs known to exhibit sprial grain, each image of both sets is 512×512 pixels with 8 bpp gray color values. Set A (see Fig. 1(a)) consists of 42 and set B (see Fig. 1(b)) consists of 21 slices. The distance between the cross section images in set A is 8mm, whereas in set B it is 16mm. The lines in a light gray value below the log shows the rest for patients, because the CT-images are acquired with a medical scanner.
(a) sample from set A
(b) sample from set B
Fig. 1. Cross-section images used in experiments
For a proof of concept of our techniques we have artificially generated an image series of a log with “regular spiral grain”. For this purpose, we have rotated a single cross-section CT-image of an entire tree (see Fig. 2(a)) around the pith by 0.5◦ clockwise (-) per image generating 18 images in that way (to avoid error accumulation, we have used the first image for each rotation and have rotated it by an increasing angle). The size of the images is 415 × 415 pixels, with 8 bpp gray color values. We have generated two sets, one with bilinear- and the other with bicubic pixel interpolation (see Section 2.1). After image alignment at the pith an area entirely located within the wood area of the images is selected for rotation analysis (in order to avoid errors caused by black background pixels and noisy pixels from the bark – see Fig. 2(b)). Rotation analysis is conducted with image distances of 1 and 3, respectively. The image similarity measures employed in the experiments to determine the rotation angle are mean error (Diff), mean squared error (MSE), peak signal to noise ratio (PSNR), and normalized crosscorrelation (X-corr). 3.2
Results
With respect to rotation analysis, the values given in the columns of Tables 1 – 3 indicate the computed rotation between two slices [in ◦ ] (determined with
Applicability of Motion Estimation Algorithms for an Automatic Detection
(a) artificial spiral grain
39
(b) area to measure grain
Fig. 2. Cross-section of entire tree and area used to measure spiral grain
different error measures in each column). All three tables display results where bilinear interpolation has been used when rotating and matching the images, however, the results are almost identical for bicubic interpolation. Also, there is hardly any difference no matter if slice distance 1 or 3 is employed. The error measures Diff, MSE, and PSNR deliver fairly consistent results, only X-corr often suggests a different amount of rotation as compared to the other measures. For image set A (see Table 1, the results are rather disappointing. We notice a partially significant fluctuation of rotation angles among adjacent pairs of slices where even the orientation of the rotation is changing from positive to negative angles and back to positive ones (see e.g. slices 2–5 and 22–25). Of course, such a phenomenon may not be caused by natural spiral grain which means that obviously we measure nothing but artifacts possibly caused by the high degree of similarity between the cross-section images even without rotation compensation. The results for image set B (see Table 2) are more encouraging. We notice areas of more homogeneous rotation angles as compared to set A. For example, almost constant rotation is obtained in the sets of slices 1–5, 8–11, and 12–15. The log area between slice 15 and 21 seems to exhibit a low amount of rotation (and corresponding spiral grain). The pairs of slices where similar to set A a local fluctuation of rotation angles is observed, are difficult to interpret. We assume to be again confronted with artifacts of unclear provenience. Finally, when applying our rotation analysis approach to the set of artificially roted images (see Table 3), we are able to identify the induced rotation with high precision (except for X-corr). This result shows that this technique is capable of detecting global rotation reliably in principle (in case rotation is actually present). When taking an overall view on the results of rotation analysis they seem to indicate that the assumption of spiral grain being able to be modeled by a global log rotation does not seem to be correct. When applied to the artificially rotated images, rotation angles are computed reliably but when applied to natural cross-section slices we face high fluctuations of rotation angles even among adjacent slices, which cannot be explained by spiral grain.
40
K. Entacher et al.
Table 1. Result of the rotation analysis on image set A (all values represent the rotation between two cross-section slices as a pair [in ◦ ] denoted in the “Image” column) Image set A, rotation analysis with bilinear interpolation distance 1 distance 3 Image Diff MSE PSNR X-corr Image Diff MSE PSNR X-corr 01-02 1.2 1.2 1.2 1.2 01-04 -0.75 -0.78 -0.78 4.08 02-03 0.93 0.87 0.87 1.5 02-05 -0.81 -0.78 -0.78 4.5 03-04 -0.9 -0.96 -0.96 1.5 03-06 -0.96 -0.96 -0.96 -0.33 04-05 1.2 1.17 1.17 1.17 04-07 0 -0.45 -0.45 0.12 05-06 -0.06 -0.54 -0.54 -0.15 05-08 -0.72 -0.72 -0.72 -0.36 06-07 -0.66 -0.54 -0.54 -0.51 06-09 -0.06 -0.63 -0.63 -0.18 07-08 -0.06 -0.57 -0.57 -0.6 07-10 0 0.06 0.06 2.58 08-09 0 0 0 -0.54 08-11 4.17 3.66 3.66 4.5 09-10 0 0 0 1.5 09-12 4.17 4.14 4.14 4.5 10-11 1.17 1.14 1.14 1.47 10-13 -0.06 -0.24 -0.24 4.5 .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . 22-23 0.03 0.06 0.06 1.5 22-25 3.9 3.87 3.87 4.5 23-24 -1.47 -1.05 -1.05 -0.87 23-26 0 0.09 0.09 0.15 24-25 1.2 1.2 1.2 1.23 24-27 3.66 2.85 2.85 3.69 25-26 0 -0.06 -0.06 -0.78 25-28 2.4 2.79 2.79 1.83 .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . 38-39 1.26 1.26 1.26 1.5 38-41 4.2 4.2 4.2 2.13 39-40 0 0 0 0 39-42 4.17 0.27 0.27 4.08 40-41 1.23 1.23 1.23 1.29 41-42 1.23 1.23 1.23 1.29 Sum: ◦ p. sample: min Value max Value
◦
11.10 0.27 -1.5 1.26
13.20 0.32 -1.5 1.26
13.20 27.15 0.32 0.66 -1.5 -1.5 1.26 1.5
14.10 0.34 -2.13 4.2
11.66 11.66 30.07 0.28 0.28 0.73 -1.59 -1.59 -0.84 4.2 4.2 4.5
Figs. 3(a) – 3(c) displays typical visual examples of the results when block matching is applied to our image sets. Whereas the motion vectors obtained when applying block matching to the artificially rotated image set clearly indicate circular motion, the results for image sets A and B as displayed in Figs. 3(a) and 3(b) show besides vectors which are oriented in correct tangential direction also highly irregular motion where even adjacent motion vectors point in almost opposite directions. For many blocks, no motion vectors at all have been identified. Without the acceptable results for the artificial image set one might argue that the assumption of constant translational motion within a block is a too crude approximation for the actual rotation happening, but this does not seem to be the case. At present, the difference between the results for the artificial image set and the natural slices is not clear. In Figs. 4(a) – 4(c) we show typical visual examples when applying the Horn & Schunk optical flow approach to our image sets. Even improving upon block matching, the rotation of the artificial image set is perfectly determined (see Fig. 4(c)). For the natural image sets A
Applicability of Motion Estimation Algorithms for an Automatic Detection
41
Table 2. Result of rotation analysis on image set B
Image 01-02 02-03 03-04 04-05 05-06 06-07 07-08 08-09 09-10 10-11 11-12 12-13 13-14 14-15 15-16 16-17 17-18 18-19 19-20 20-21
Image set B, rotation analysis with bilinear rotation distance 1 distance 3 Diff MSE PSNR X-corr Image Diff MSE PSNR X-corr -0.06 0.72 0.72 1.5 01-04 3.78 3.72 3.72 4.5 0.75 0.75 0.75 1.5 02-05 3.75 3.75 3.75 4.47 0.78 0.78 0.78 1.5 03-06 0.03 0.63 0.63 4.5 0.78 0.78 0.78 1.5 04-07 1.68 1.71 1.71 4.5 -0.96 -0.96 -0.96 -0.66 05-08 0 -0.06 -0.06 0 1.2 1.17 1.17 1.5 06-09 3.81 3.78 3.78 4.5 -0.06 -0.12 -0.12 0 07-10 3.75 3.75 3.75 4.5 1.23 1.23 1.23 1.5 08-11 4.17 4.17 4.17 4.47 1.23 1.26 1.26 1.5 09-12 3.78 3.78 3.78 4.5 1.26 1.26 1.26 1.5 10-13 2.91 2.94 2.94 4.5 -0.06 -0.09 -0.09 0 11-14 3.81 3.78 3.78 4.5 1.17 1.14 1.14 1.5 12-15 -0.57 -0.69 -0.69 4.29 1.23 1.26 1.26 1.47 13-16 4.17 -0.96 -0.96 4.38 1.26 1.26 1.26 1.5 14-17 4.17 4.17 4.17 4.47 0 0 0 1.5 15-18 1.26 1.05 1.05 1.2 0 0.03 0.03 1.5 16-19 1.65 1.68 1.68 4.47 0.03 0.06 0.06 0.15 17-20 1.26 1.23 1.23 1.26 0.69 1.11 1.11 1.5 18-21 0.57 0.6 0.6 1.17 0.03 0.03 0.03 1.5 0.03 0.06 0.06 0.12
Sum: ◦ p. sample: min Value max Value
◦
10.53 0.526 -0.96 1.26
11.73 0.526 -0.96 1.26
(a) image set A
11.73 22.08 0.526 1.104 -0.96 -0.66 1.26 1.5
16.289 0.814 -0.57 4.17
14.46 0.723 -0.96 4.17
(b) image set B
14.46 24.51 0.723 1.226 -0.96 0 4.17 4.5
(c) artificial
Fig. 3. Block matching applied to image sets
and B the result is not that clear but we notice areas of homogeneous optical flow in the lower left part of the slice in set A (see Fig. 4(a)) and in the lower and upper right part of the slice in set B. The judgement whether indicated motion corresponds to spiral grain or is caused by other growth phenomena has to be left to experts in the wood industry. If this is the case, then spiral grain is actually not of “spiral” nature.Finally, Figs. 5(a) – 5(c) displays typical examples
42
K. Entacher et al.
Table 3. Result of rotation analysis on our artificial tree set (with “bilinear rotation”) Artifical image set, -0.5◦ /img, =8.5◦ , rot. analysis with bilinear interpol. distance 1 distance 3 Image Diff MSE PSNR X-corr Image Diff MSE PSNR X-corr 01-02 -0.51 -0.48 -0.48 -0.57 01-04 -1.53 -1.5 -1.5 -1.62 02-03 -0.51 -0.48 -0.48 -0.57 02-05 -1.53 -1.53 -1.53 -1.65 03-04 -0.51 -0.51 -0.51 -0.63 03-06 -1.53 -1.53 -1.53 -1.68 04-05 -0.51 -0.48 -0.48 -0.66 04-07 -1.53 -1.53 -1.53 -1.59 05-06 -0.51 -0.45 -0.45 -0.57 05-08 -1.53 -1.53 -1.53 -1.68 06-07 -0.51 -0.51 -0.51 -0.63 06-09 -1.53 -1.53 -1.53 -1.62 .. .. .. .. .. .. .. .. .. .. . . . . . . . . . . 15-16 -0.51 -0.51 -0.51 -0.6 15-18 -1.53 -1.53 -1.53 -1.59 16-17 -0.51 -0.48 -0.48 -0.57 17-18 -0.51 -0.48 -0.48 -0.51 Sum: ◦ -8.61 p. sample: -0.506 min Value -0.51 max Value -0.45
◦
-8.01 -0.471 -0.51 -0.21
(a) image set A
-8.01 -0.471 -0.51 -0.21
-10.14 -0.596 -0.69 -0.45
-22.95 -0.51 -1.53 -1.53
(b) image set B
-22.77 -0.506 -1.56 -1.44
-22.77 -0.506 -1.56 -1.44
-24.42 -0.543 -1.71 -1.56
(c) artificial
Fig. 4. Horn & Schunk optical flow applied to image sets
(a) image set A
(b) image set B
(c) artificial
Fig. 5. Lukas & Kanade optical flow applied to image sets
of the application of the Lukas & Kanade approach. Whereas rotation is again reliably detected for the artificial set (except for areas near the pith), the optical flow fields for sets A and B are very irregular and do not seem to indicate any
Applicability of Motion Estimation Algorithms for an Automatic Detection
43
relation to regular growth phenomena. Here the price for relaxing the global smoothness constraint is visualized in a drastical manner.
4
Conclusion
While artificial global rotation may be detected and estimated reliably by all three types of techniques considered, natural spiral grain is much harder to be captured. Whereas block matching and Lukas & Kanade optical flow seem to be entirely insuitable for this application (at least in the parameter ranges as applied in our implementation), we obtain at least partially coherent optical flow fields with the Horn & Schunk technique and reasonable rotation angles for parts of the data set using rotation analysis. A comparison with true spiral grain angles from our samples was not carried out since the variation of global rotation vectors is to strong until now. In future work we will concentrate on applying preprocessing techniques to the gray-scale cross section images in order to improve on the results found so far. An application of the algorithms limited to special areas of the cross section will be studied as well. Acknowledgements. The authors would like to thank Alfred Teischinger and Ulrich M¨ uller from the University of Natural Resources and Applied Life Sciences in Vienna for the possibility to use DICOM data of large dimensioned specimen of Norway Spruce recorded within the project XXL-Wood [8,9], and for their suggestion to study the problem. This work was partially supported by the Austrian Science Fund (FWF) Grant P17434-N13.
References 1. Forest Products Laboratory. Wood handbook − Wood as an engineering material. Gen.Tech.Rep.FPL−GTR−113. Madison, WI: U.S. Department of Agriculture, Forest Service. Online Version (March 2007) (1999), http://www.fpl.fs.fed.us/ 2. Harris, J.M.: Spiral grain and wave phenomena in wood Formation. Springer, Heidelberg (1989) 3. Nystr¨ om, J.: Automatic measurement of compression wood and spiral grain for the predection of distortion in sawn wood products. PhD thesis, Lule˚ aUniversity of Technology (2002), Available from: http://www.tt.luth.se/staff/jany/ 4. Gr¨ onlund, A., Oja, J., Grundberg, S., Nystr¨ om, J., Ekevad M.: Process control based on measurement of spiral grain and heartwood content. Draft to be presented at The 18th International Wood Machining Seminar, Vancouver, Canada (2007) 5. Gindl, W., Teischinger, A.: The potential of VIS- and NIR-spectroscopy for the nondestructive evaluation of grain-angle in wood. Wood and Fiber Science 34, 651–656 (2002) 6. Sep´ ulveda, P., Oja, J., Gr¨ onlund: Predicting spiral grain by computed tomography of norway spruce. Journal of Wood Science 48, 479–483 (2002) 7. Ekevad, M.: Method to compute fiber directions in wood from computed tomography images. Journal of Wood Science 50, 41–46 (2004)
44
K. Entacher et al.
8. Teischinger, A., Patzelt, M.: XXL-Wood. Berichte aus Energie- und Umweltforschung 27/2006. BMVIT, Vienna, Austria. (March 2007) (2006), Online Version: www.fabrikderzukunft.at/ 9. Teischinger, A., Buksnowitz, C., M¨ uller, U.: Wood properties of old growth spruce and their technological potential. In: Kurjatko, S., Kudela, J., Lagaˇ na, R. (eds.) Proceedings of the 5th Symposium “Wood Struc¨ oture and Properties 06”, pp. 413–416. Arbora Publishers, Zvolen, Slovakia (2006) 10. Furht, B.: Motion Estimation Algorithms for Video Compression. Kluwer Academic Publishers, Boston, MA (1997) 11. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981) 12. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of Imaging understanding workshop, pp. 121–130 (1981)
Deterministic and Stochastic Methods for Gaze Tracking in Real-Time Javier Orozco1, F. Xavier Roca1 , and Jordi Gonz`alez2 1
Computer Vision Center & Dept. de Ci`encies de la Computaci´ o, Edifici O, Campus UAB, 08193 Bellaterra, Spain 2 Institut de Rob` otica i Inform` atica Industrial (UPC – CSIC), C. Llorens i Artigas 4-6, 08028, Barcelona, Spain
Abstract. Psychological evidence demonstrates how eye gaze analysis is requested for human computer interaction endowed with emotion recognition capabilities. The existing proposals analyse eyelid and iris motion by using colour information and edge detectors, but eye movements are quite fast and difficult for precise and robust tracking. Instead, we propose to reduce the dimensionality of the image-data by using multiGaussian modelling and transition estimations by applying partial differences. The tracking system can handle illumination changes, low-image resolution and occlusions while estimating eyelid and iris movements as continuous variables. Therefore, this is an accurate and robust tracking system for eyelids and irises in 3D for standard image quality.
1
Introduction
Eyelid and iris motion description is demanded from human emotion, truth and deception evaluation by combining psychological and pattern recognition techniques. Ekman and Frisen [4] already established that there are perceptible human emotions, which can be early detected by analysing eyelid and iris movements. Applications on Human Computer Interaction (HCI) demand robustness and accuracy in real-time, which determine the performance evaluation of already proposed techniques. In the literature, there are approaches for gaze analysis dealing with contour detectors, colour segmentation, Hough transform and Optical Flow for eyelids and irises [10,6,9]. These methods are time-consuming and depend on the image quality. On the other hand, restricted detailed textures and templates have been proposed to apply template matching, skin colour detection and image energy minimization [1,7]; for example Moriyama et al [7] deal with three eyelid states namely open, closed and fluttering by constructing detailed templates of skin textures for eyelid, iris and sclera. This approach requires training and texture matching. So, these methods are difficult to be generalized for different image and environment conditions. We propose in this paper a robust and accurate eyelid and iris tracking by combining stochastic and deterministic approaches with reduced image-data. To provide robustness, we reduce the input image by applying Appearance-Models W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 45–52, 2007. c Springer-Verlag Berlin Heidelberg 2007
46
J. Orozco, F.X. Roca, and J. Gonz` alez
[3], which learn the skin texture on-line based on multi-Gaussian assumptions. To provide accuracy, we construct two Appearance-Based Trackers (ABT) to estimate the transition by partial differences with respect to appearance parameters. The first one excludes the sclera and iris information, while achieving fast and accurate eyelid adaptation for any kind of blinking and fluttering motion. The second one is able to track the iris movements, while retrieving the correct adaptation after eyelid occlusions or iris saccade movements. Both trackers agree on the best 3D mesh pose that depends on the head position. The head pose estimation enhances the system capabilities for tracking eyelids and irises in different head position, instead of a frontal face as restriction. Compared to existing gaze tracking methods, our proposed approach has several advantages. First, our system achieves an efficient eyelid and iris tracking, whose movements are encoded as continuous values. Second, this approach is suitable for real-time applications while handling occlusions, illumination changes, faster saccade and blinking movements. Third, we deal with small images and low resolution, which extends the capabilities of the system for different type of applications. The paper is organized as follows: section 2 describes the theoretical foundations for appearance-based trackers, the stochastic observation and deterministic transition models. Section 3 presents experimental results and discussion. Finally, section 4 concludes the paper with the main conclusion and future avenues of research.
2 2.1
Appearance-Based Tracking Image-Data Reduction
The appearance model components are the deformable model and texture. In order to model the eye region, we construct a 3D model of both left and right eyes, which is composed of 36 vertices and 53 triangles, see Fig. 1.(a). This mesh covers the eyeballs, the upper and the lower eyelids, the sclera and the iris. The mesh deformation is determined by the matrix M, which is a n x i matrix: Mn,i = mn,i + Gn,i,k ∗ γ k ,
(1)
where n is the number of vertices and i corresponds to the Cartesian coordinates in the image plane. The matrix mn,i is determined by the biometry of each person in neutral position. The matrix Gn,i,k deforms the mesh depending on
(a)
(b)
(c)
Fig. 1. The 3D mesh (a) is projected onto the input image (b) to construct the corresponding appearance (c)
Deterministic and Stochastic Methods for Gaze Tracking in Real-Time
47
the eyelid and iris movements. The eyelid and iris movements are controlled by the vector γ k for k = 0, 1, 2 for eyelid, iris yaw and iris pitch respectively. These variables are encoded according to the Facial Action Codifying System (FACS), which obey the MPEG-4 codification. Thereby, the 3D mesh Mn,i can be adapted to the eye region, see Fig. 1.(b). The 3D pose of the mesh is taken in to account ρ = [θx , θy , θz , x, y, s], while assuming a weak perspective projection model. Therefore, each 3D point Pi = (Xi , Yi , Zi ) ⊂ M will be projected onto the image point pi = (ui , vi ) by using a projection matrix B. That is, (ui , vi ) = B(Xi , Yi , Zi , 1). Consequently, given an input frame F and the parameters to modify the mesh, which are encoded as q = [ρ, γ], we construct an appearance model to represent the eye region [2]. The shape is provided by the 3D mesh and the texture is obtained by applying a warping function Ψ (F, q) which transfers each pixel from the input frame F into a reference texture according to the vector q. In this way, the final appearance A(q) = [a0 , ..., al ] is obtained, which depends on the mesh configuration q, see Fig. 1.(c). 2.2
Stochastic Appearance Observation
The observation model aims to provide the expected appearances over time, which are based on previous estimations. Therefore, a likelihood distribution is an appropriate method to generate the expected appearances while learning previous estimations. Given an appearance model, A(q) = [a0 , ..., al ], which depends on the mesh configuration q, we assume each pixel of the appearance ai following a Gaussian distribution over time. Thus, we can collect all the variables in a multi-dimensional vector, which can be assumed following a Gaussian distribution as well, N (μ, σ 2 ). μ and σ are vectors of l values according to the appearance variables ai , which means that the variance must be computed component by component and not through inner product of the vector σ. Therefore, the probability for each observation is given by the conditional likelihood function: l P(At |qt−1 ) = N (ai ; μi , σi ). (2) i=0
The tracking goal is the estimation of the vector q at each frame t. We repˆ t (ˆ ˆ t and A resent as q qt ) the tracked parameters and the estimated appearances [2]. For the sake of clarity, hence we assume that At (qt ) and At are equivalent and used depending on the specification level. The estimated average appearance is obtained by applying a recursive filtering technique. Thus, μ and σ 2 are updated for the next frame with respect to previous adaptations and a learning coefficient λ: ˆt μt+1 = λμt + (1 − λ)A
and
ˆ t − μt )2 , σ 2t+1 = λσ 2t + (1 − λ)(A
where μ and σ are initialized with the first appearance A0 .
(3)
48
J. Orozco, F.X. Roca, and J. Gonz` alez
(a)
(b)
(c)
(d)
Fig. 2. The 3D mesh (a) and the appearance (b) for the eyelid tracker. The 3D mesh (c) and the appearance (d) for the iris tracker.
2.3
Deterministic Appearance Transition
In order to estimate the vector qt for the next frame, we adopt an adaptive velocity model, which is predicted by using a deterministic function to obtain ˆ t−1 : the transition state based on the previous prediction, q ˆt = q ˆ t−1 + Δˆ q qt
(4)
where Δˆ qt is the shift of the mesh configuration. ˆ t we construct the corresponding appearance, which Consequently, for each q is compared with the likelihood average appearance by the Mahalanobis distance. Therefore, given Eq. (4), the appearance becomes At ≈ μt , which can ˆ t using the be approximated via a first-order Taylor series expansion around q Vanilla gradient descent method [8]: ˆ t−1 ) + At (qt ) ≈ Ψ (Ft , q
ˆ t, q ˆt) ∂(A ˆ t−1 ). (ˆ qt − q ˆt ∂q
(5)
ˆ t depends on both the previous As a result, the estimation of the vector q adapted and the current average appearance, as well as the minimization distance. The gradient is computed by partial differences with specific descent steps, due to the saccade movements and spontaneous blinking. Thus, tracking is enhanced by quickly retrieving the best adaptation while avoiding drifting problems. Illumination changes, occlusions and faster movements are considered as outliers by constraining with the Huber’s function [5] the gradient descent step for each component of the shift vector Δq. The Huber’s function, ξˆ function is defined as:
ˆ 1 if|y| ≤ c 1 dξ(y) = (6) η(y) = c y dy |y| if|y| > c where y is the value of a pixel in the appearance At normalized by the appearance ¯ and σ ¯ according to the Gaussian assumption for the Appearance. statistics μ ¯ Thus, we constrain the appearance registration on The constant c is set 3 ∗ σ. the probabilistic model. 2.4
Eyelid and Iris Tracking
Eyelids and irises perform fast movements, which are difficult to predict due to the spontaneous motion and the low resolution on images from monocular
Deterministic and Stochastic Methods for Gaze Tracking in Real-Time
49
cameras. Therefore, a probabilistic model for them would be quite uncertain. Instead, we propose two models for each tracker, which avoid those pixels that add uncertainty for the appearance predictions, see Fig. 2. Eyelid tracker uses an appearance model, which excludes the sclera and the iris regions by warping these pixels like eyelid skin. Subsequently, the iris tracker uses an appearance that includes the sclera and iris region. Thus, we construct two Appearance-Based Trackers (ABT), which are combined sequentially, as described next. Eyelid Tracker Tw : estimates the eyelid position independently from irises, see Fig. 2.(a). The tracking vector is w = [ρ, γ0 ], and the appearance model A(w), which excludes sclera and iris, see Fig. 2.(b): 1. To obtain At (wt−1 ) by applying the warping function, Ψ (Ft , wt−1 ). 2. The Gaussian parameters1 are estimated, μt (w) and σ 2t (w) by using Eq. (3). We try with five different learning coefficients λw , the high values allow to learn fast movements while the low values keep more information to handle occlusions. 3. To compute the eyelid gradient based on the previous adaptation wt−1 by using Eq. (5), while testing the whole FACS range [-1,1]. 4. The best estimation is obtained using Eq. (4), by comparing the average and likelihood appearances through a Mahalanobis distance in an iterative Gauss-Newton process. The search involves exploitation more than exploration to avoid local minima and to estimate the spontaneous movements. Iris Tracker Tq : estimates yaw and pitch orientations for irises, see Fig. 2.(c). The tracking vector is q = [ρ, γ0 , γ1 , γ2 ] to obtain the appearance A(q): 1. To obtain At (qt−1 ), by applying the warping function Ψ (Ft , qt−1 ). 2. The Gaussian parameters μt (q) and σ 2t (q) by applying Eq. (3). The learning coefficient λq is lower than λw to keep more information from previous frames, since the iris motion is slower than the eyelid blinking. 3. To compute the iris gradient At (qt−1 ), the iris gradient is estimated for q = [ρ, γ0 , γ1 , γ2 ] in the whole FACS range [-1,1], using Eq. (5). 4. Finally, the best estimation is computed using Eq. (4), by minimizing the Mahalanobis distance. The convergence is achieved by using a shorter exploration than the previous tracker. Both trackers are connected through the error estimation, which is standardized according to the number of pixels of each appearance. Subsequently, the eyelid tracker provides the vector w = [ρ, γ0 ] while the iris tracker estimates the iris while correcting the previous mesh orientation, q = [ρ, γ0 , γ1 , γ2 ], see Fig. 3. 1
Let be μ(w) and σ 2 (w) the Gaussian parameters corresponding to the eyelid tracker Tw . Similarly, μ(q) and σ 2 (q) are the Gaussian parameters for the iris tracker Tq .
50
J. Orozco, F.X. Roca, and J. Gonz` alez
Fig. 3. The eyelid tracking handle movements and blinks as continuous values. Iris tracking for spontaneous movements and saccades. The eyelid position is corrected for the pitch variation effects.
3
Experimental Results
Experiments were run in a 3.2 GHz Pentium PC, in ANSI C code. Three image sequences of 250 frames were used for testing both trackers, which correspond originally to facial image sequences with the cropped eye region. They were recorded with monocular and photographic cameras without illumination conditions. We do not use high image resolution, because the reference texture size is 14x18 pixels. We computed the eyelid estimation by using Tw while dealing with any kind of blinking, e.g. open, closed and fluttering. In addition, the iris estimation is done by using Tq , which estimates yaw and pitch motion, saccade movements and eyelid occlusions. On one hand, the eyelid tracker adapts the eyelid position for slow movements and blinks while estimating the mesh orientation, see Fig. 4. This tracker does not depend on iris estimations even when the pitch movement affects the eyelid position. However, this tracker warps the sclera and iris pixels as eyelid skin,
Fig. 4. Eyelid and Iris trackers are applied sequentially while learning each appearance texture on-line, which enhances the robustness and accuracy
Deterministic and Stochastic Methods for Gaze Tracking in Real-Time
51
Fig. 5. Due to tracking capability to handle illumination changes and occlusions, the tracker has a good performance with eyes wearing glasses
Fig. 6. The eyelid and iris tracking deals with images of small and low-resolution. These images are 64x20 pixels where each input eye is 18x12 pixels.
otherwise, those pixels are considered outliers when the eyelid is occluding the inner eye region. On the other hand, the iris tracker adapts well the iris position while dealing with slow yaw and pitch motion while retrieving the correct adaptation of the mesh after either saccade motion or iris movements while the eyelid occludes the iris. Both trackers use the FACS codification, which is expressed as continuous values between -1.0 and 1.0, see Figs. 3. The eyelid and iris pitch have similar plots because they commonly perform a synchronized and spontaneous movement. This is a correlation that is independently well estimated. For an appearance model size of 14x18 pixels, we obtained a performance of 21 fps, and a 96% of correct adaptations. The error is principally due to the saccade motion or eyelid occlusions. Each sequence was also tested using an appearance size of 5x11 pixels, thus obtaining 52 fps for the iris tracker and 67 fps for the eyelid tracker, and average accuracy adaptation of 85%. Iris tracking is a challenge for small reference texture, where the iris size is 2x3 instead of 5x6 for the big resolution. Learning the texture on-line and handling illumination changes are important capabilities to obtain good adaptations by analysing row-image resolution, and occluded eye regions. The Fig. 5 shows a 400 image sequence, where the subject is wearing sunglasses, the input eye region is 42x82 pixels. The tracking got a 91% of correct adaptations with a reference texture of 14x18 pixels. We obtain 82% of correct adaptation analysing small images with low resolution, where the input eye region is 10x18, see Fig. 6. The whole input image is 112x160 that is appropriate for video conference software.
4
Conclusions
Three main contributions were proven in this work. First, the information and dimensionality reduction for the input image by constructing appearance models,
52
J. Orozco, F.X. Roca, and J. Gonz` alez
which are appropriate for statistical modelling. Second, the stochastic observation model provides an accurate likelihood function, which is conditioned by previous estimations and accumulative appearance information. Third, the deterministic transition model allows generating an appearance space by applying first-order Taylor approximations around to the previous adaptation. The experimental results have proven that combining sequentially two ABT, the system is able to estimate accurately the eyelid and iris position in 3D images. The system does not require high quality images or specific illumination conditions, since we do not use colour information, edge detectors, or motion extraction algorithms. We have demonstrated in this framework that eyelid and iris motion is reliable for HCI applications and psychological systems, which demand real-time performance with accurate results. Our system provides a robust and accurate gaze description, able to handle occlusions, illumination changes, small and low-resolution images in 3D. Acknowledgement. This work is supported by EC grants IST-027110 for the HERMES project, IST-045547 for the VIDI-Video project, by the Spanish MEC under projects TIN2006-14606 and DPI-2004-5414. Jordi Gonz`alez also acknowledges the support of a Juan de la Cierva Postdoctoral fellowship from the Spanish MEC.
References 1. Bernogger, S., Yin, L., Basu, A., Pinz, A.: Eye tracking and animation for mpeg-4 coding. In: Proceedings of the 14th International Conference on Pattern Recognition, vol. 2, pp. 1281–1284 (1998) 2. Cootes, T.F., Taylor, C.J.: Statistical Models of Appearance for Computer Vision. Imaging Science and Biomedical Engineering, University of Manchester (2004) 3. Edwards, G., Cootes, T., Taylor, C.: Face recognition using active appearance models. In: Burkhardt, H., Neumann, B. (eds.) ECCV 1998. LNCS, vol. 1407, pp. 581–695. Springer, Heidelberg (1998) 4. Ekman, P., Friesen, V.: Facial action coding system: A technique for the measurement of facial movement. Consulting Psychologists Press, Palo Alto (1978) 5. Huber, P.J.: Robust estimation of a location parameter. The Annals of Mathematical Statistics 35, 73–101 (1964) 6. Liu, H., Wu, Y., Zha, H.: Eye states detection from color facial image sequence. In: Proc. of the 2nd Conference on Image and Graphics, vol. 4875, pp. 693–698 (2002) 7. Moriyama, T., Xiao, J., Cohn, J., Kanade, T.: Meticulously detailed eye model and its application to analysis of facial image. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(5), 738–752 (2006) 8. Nocedal, J., Wright, S.: Numerical optimization. Springer, New York (1999) 9. Tan, H., Zhang, Y.: Detecting eye blink states by tracking iris and eyelids. Pattern Recognition Letters (2005) 10. Tian, Y.L., Kanade, T., Cohn, J.: Dual-state parametric eye tracking. In: International Conference on Automatic Face and Gesture Recognition, pp. 110–115 (2000)
Integration of Multiple Temporal and Spatial Scales for Robust Optic Flow Estimation in a Biologically Inspired Algorithm Cornelia Beck, Thomas Gottbehuet, and Heiko Neumann Inst. for Neural Information Processing, University of Ulm, Germany {cornelia.beck,thomas.gottbehuet,heiko.neumann}@uni-ulm.de
Abstract. We present a biologically inspired iterative algorithm for motion estimation that combines the integration of multiple temporal and spatial scales. This work extends a previously developed algorithm that is based on mechanisms of motion processing in the human brain [1]. The temporal integration approach realizes motion detection using one reference frame and multiple past and/or future frames leading to correct motion estimates at positions that are temporarily occluded. In addition, this mechanism enables the detection of subpixel movements and therefore achieves smoother and more precise flow fields. We combine the temporal integration with a recently proposed spatial multi scale approach [2]. The combination further improves the optic flow estimates when the image contains regions of different spatial frequencies and represents a very robust and efficient algorithm for optic flow estimation, both on artificial and real-world sequences.
1
Introduction
The correct and complete detection of optic flow in image sequences remains a difficult task (see [3] for an overview of existing technical approaches). Exact knowledge about the movements of the surrounding is needed in many technical applications, e.g., in the context of autonomous robot navigation, but the need for real-time computation in combination with reliable motion estimates further complicates the task. Common problems of optic flow detection are the generation of smooth optic flow fields and the detection of independently moving objects at various speeds. This task has the difficulty of motion detection at temporarily occluded regions. Humans and animals solve these problems in every-day vision very accurately and fast. Neurophysiological and psychophysical research revealed some basic processing principles of the highly efficient mechanisms in the brain [4,5]. We will here present extensions of an algorithm derived from a neural model [6,1] based on these research results. The integration of different temporal scales improves the quality of the detected optic flow in case of temporal occlusions and leads to a more precise representation. In addition, a combination with a multi scale approach makes the algorithm more robust in image sequences containing regions with different spatial frequencies. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 53–60, 2007. c Springer-Verlag Berlin Heidelberg 2007
54
C. Beck, T. Gottbehuet, and H. Neumann
2
Biologically Inspired Algorithm
For optic flow estimation different approaches like regularization, Bayesian models or spatiotemporal energy models have been developed to compute globally consistent flow fields [7,8,9]. Another way is to build a model that simulates the neural processing in the primate visual system. We previously presented such a neural model for optic flow detection based on the first stages of motion processing in the brain [6], namely the areas V1 and MT, including feedforward as well as feedback connections [10]. In model area V1 raw motion is initially detected, model area MT estimates the optic flow of larger regions. To reduce both computing and memory requirements we derived an efficient algorithm from the neural model (see Fig. 1(a), details for implementation in [1]). For the fast extraction of motion correspondences the algorithm uses a similarity measure of the class of rank-order approaches: A variation of the Census Transform [11] provides an abstract representation for directional derivatives of the luminance function. Accordingly, correspondences between two frames of an image sequence can be extracted at locations with the same Census values. Each motion correspondence (called hypothesis) includes a weight which indicates the likelihood of a particular velocity at a certain position. The recurrent signal modulates the likelihood of predicted hypotheses in the first module, enhancing existing motion estimates by (1). To improve the estimations, the hypotheses are integrated in space and velocity (2). In this feedforward integration step hypotheses that are supported by neurons adjacent in position and velocity have an advantage in comparison to isolated motion hypotheses. The likelihoods are modified by a normalization via lateral shunting inhibition (3). The computation of a likelihood representation of motion estimation in module MT utilizes a homologue architecture on a coarser spatial scale (V1:MT ratio is 1:5). . (1) likelihoodV1 1 = Input · 1 + C · likelihoodMT 3 V1 V1 2 likelihood2 = likelihood1 ∗ G(space) ∗ G(vel) . (2) likelihoodV2 1 . (3) likelihoodV3 1 = likelihoodV2 1 / 0.01 + vel
3
Temporal Integration
In the algorithm presented in the last section, the initial motion detection is only calculated for two successive frames (the reference t0 and its previous frame t−1 ). If an image sequence contains temporal occlusions, e.g., an object moving in front of a background, this procedure will fail to calculate the correct optic flow for some of the image regions. The previous frame t−1 contains areas where parts of the background are occluded, while they are visible in frame t0 . In the standard algorithm this leads to wrong or missing motion estimates in such occluded regions. This problem can be solved by using motion cues of additional frames for the initial motion detection (see Fig. 1c and 3). A similar mechanism was proposed by [12] for ordinal depth segmentation of moving objects. The integration
Integration of Multiple Temporal and Spatial Scales
55
Fig. 1. (a) Iterative model of optic flow estimation: V1 and MT-Fine represent the standard algorithm. The dashed box marks the extension added for the integration of multiple spatial scales where a coarse initial guess of the optic flow is calculated in V1 and MT-Coarse that supports the subsequent creation of hypotheses in V1 and MT-Fine. (b) Temporal integration of multiple frames. When using more than two input frames for the motion detection, the motion cues from the different combinations are calculated and the most reliable subset of these is used as input to V1. As the temporal distances between the frames vary, the velocities need to be scaled to the largest temporal distance. (c) Motion detection in sequences with occlusion can be solved using an additional temporally forward-looking step (future step).
of additional frames for motion detection does not only provide an advantage in the case of occlusions. Consider subpixel movement, e.g., in the center of an expanding flow field or for slowly moving objects, where common correlation based approaches are not able to resolve the motion. Utilizing additional frames with larger temporal distance rescales the small velocities to a detectable speed. This leads to a higher resolution of direction and speed and hence provides smoother flow fields (see Fig. 4). The mechanism of integrating motion correspondences of multiple frames used in our algorithm is depicted in Fig. 1(b). One future step is sufficient to solve the temporal occlusion problem whereas multiple past steps increase the accuracy of motion estimates. All hypotheses are scaled according to the maximum temporal distance and a subset with the largest likelihoods at each position is selected as input for V1.
4
Integration of Multiple Spatial Scales
A limitation of the standard algorithm is that movements in spatially lowfrequency areas may not be detected on a fine scale due to the ambiguity of motion cues (background of example in Fig. 2). In contrast, using the same algorithm on a coarse version of the input images will detect this movement.
56
C. Beck, T. Gottbehuet, and H. Neumann
However, small moving objects are overlooked on a coarse scale, as they disappear in the subsampling and the motion integration process. In our algorithm, we use a coarse and a fine scale in a way that combines the advantages of both [2] (see Fig. 1a). In general, algorithms using multiple scales are realized with image pyramids where motion estimation of coarser scales influences the estimation at finer scales as initial guess [13,14]. While the processing of the input image in resolutions of different spatial frequencies provides more information for the motion estimation, combining the different scales is a problem: How can the estimations of a coarse scale be used without erasing small regions detected in a finer scale? We only consider the motion estimation of the coarse scale if the estimation of the fine scale is highly ambiguous. In addition, the coarse estimation has to be compatible with one of the motion correspondences of the fine scale. This avoids small objects from being overlooked.
Fig. 2. If the input image contains regions with a spatially low-frequency structure (background), an algorithm working on a single scale that can detect the movement of the object (small rectangle labeled by the black circle) will miss large parts of the movement in the background (fine scale). Using an additional processing scale solves this problem (two scales). Black positions are positions where no movement was detected, white positions indicate movement to the left, gray positions to the right.
5
Integration of Multiple Temporal and Spatial Scales
To take advantage of both temporal integration and processing at multiple scales, we implemented a version of the algorithm that comprises these two extensions in one approach. The calculation of motion within the coarse scale is effected only on the initial estimation of two frames whereas the fine scale uses multiple frames for temporal integration to keep the computing time as low as possible. Census Transformations of the image sequence are kept for the following steps and the size of the chosen subsets of hypotheses used as input to V1 is limited to ensure constant computational time in further processing.
6
Results
In the first experiment, we tested the advantages of the algorithm with temporal integration in the case of occlusion. As shown in Fig. 3, the sequence contains a rectangular object moving in front of a non-static background. Using only two frames t0 and t−1 for the initial motion detection, the resulting optic flow
Integration of Multiple Temporal and Spatial Scales
57
Fig. 3. A rectangular object is moving to the right in front of a textured background moving to the left (see Input and Ground truth). Without temporal integration, the temporarily occluded region behind the object contains many wrong motion cues as no corresponding regions can be found (motion bleeding in third image). If an additional future time step is added to the initial motion detection, the correct corresponding regions can be matched for every position (fourth image). The correct position of the object is indicated by the black box.
Fig. 4. Results Yosemite Sequence after the first iteration (http://www.cs.brown.edu/ people/black/Sequences/yosFAQ.html). (a) Exemplary input image, (b) Ground truth. For the motion in the sky we assume horizontal motion to the right as proposed in [3]. (c) Median angular error (degree) of module MT for the three different versions of the algorithm. After only one iteration, the results of the algorithm with temporal integration have very low error rates. (d) Standard algorithm: A lot of positions in the sky do not contain a motion hypothesis after the first iteration (black positions), there are some wrong motion hypotheses in the lower left. (e) Temporal integration: The flow field contains less errors and appears smoother, but there are still some void positions in the region of the sky. (f ) The combination of temporal integration and the multi scale algorithm achieves motion hypotheses at every position after the first iteration, even in the coarse structure of the sky the movement to the right is correctly detected. The flow field contains only few errors and is very smooth.
represents many wrong velocities in the area temporarily occluded by the object. In contrast, the extended algorithm with one additional frame t+1 successfully integrates the motion at all positions of the background as well as of the object.
58
C. Beck, T. Gottbehuet, and H. Neumann
Fig. 5. Optic flow estimation for a traffic sequence (from project LFS BadenWuerttemberg, no. 23-7532.24-12-12). (a) Camera movement, (b) Optic flow calculated with the algorithm using temporal integration and multiple scales: The car moving from left to right can clearly be segmented, in the lower right the gray colour indicates correctly an expanding flow field whereas the left part of the image is rather influenced by the rotational movement (white color encodes movement to the left). We calculated the correctly detected movement in the regions of the car and in the surrounding region using masks shown in (c) and (d). The standard and the extended version of the algorithm correctly detect the car movement for about 90% (not shown). (e) The standard algorithm propagates the motion of the car to the surrounding area (black regions) and thus overestimates the size of the car considerably. (f ) The extended version can significantly limit the overestimation. (g) Error plot representing the percentage of pixels overestimating the car size (in relation to its true size).
In a second experiment, improvements by temporal integration and one additional spatial scale are investigated using the Yosemite Sequence. Since in an expanding flow field direction and speed of neighbouring positions change continuously, the higher resolution of velocities leads to a smoother flow field and less outliers (see Fig. 4). Multiple spatial scales further improve the results, as after only one iteration every position of the image contains motion estimates while it takes 3 frames to complete this in single scale conditions. The median angular error rate of the algorithm using temporal integration is considerably lower compared to the standard algorithm. The algorithm is tested in a third experiment with a real-world sequence acquired with a camera mounted in a moving car. This car is turning around a corner, while another car is crossing the street in front of it (Fig. 5). The optic flow calculated with the standard algorithm correctly detects the combination of rotational and expanding flow field as well as the car approaching from the left. Nevertheless, due to the coarse structure of the street and the occlusions generated by the moving car, the region behind the car shows wrong or no velocities and the car movement is propagated into the surrounding area. Using
Integration of Multiple Temporal and Spatial Scales
59
the combined version of the algorithm the results are improved: In the standard algorithm, the region representing car movements is overestimated by 48% of the original size of the car versus 22% in the extended version (mean of 6 successive frames, Fig. 5(g)).
7
Discussion and Conclusion
We presented the advantages of the integration of multiple temporal and spatial scales in a correlation based algorithm for motion estimation. The extension to initial motion cues from more than two frames enables the detection of optic flow in regions of the scene that are temporarily occluded, like in many everyday scenarios (e.g., pedestrian or car moving on a street). Only one additional temporally forward-looking frame is sufficient to fill these areas with correct motion cues as represented in the first experiment in the result section. The idea is similar to the way occlusions are handled in [12]. Nevertheless, while the processing in their algorithm is the basis for the correct segmentation of moving objects by using occlusion information in the context of different disparities, we aim at a complete and correct optic flow field. Furthermore, the temporal integration of motion cues leads to a higher resolution of the velocity. This enables the calculation of a smoother flow field in case of continuous changes in the optic flow, like self-motion sequences. Very slow velocities are only detected if more than two successive frames are employed for the inital motion detection. A comparison of experiment 3 against the most recently proposed optic flow model by Ogale and Aloimonos [15] showed that our extended algorithm has less propagation of the car movement to the surrounding (mean of 22% versus 29%), while the correct estimation at the positions of the car only reaches 60% for their model (versus 90% extended version). One drawback of the integration of more frames might be a slower response to motion direction changes. We tested our algorithm with temporal integration over multiple time steps while changing the object’s direction. The resulting optic flow was slightly disturbed at time steps with strong direction changes, but recovered immediately. Furthermore, the approach could be improved by weighing the likelihood of the estimates of various time steps corresponding to their temporal distance to the reference frame, e.g., with a Gaussian function. We combined the advantages of temporal scales with multiple spatial scales as presented in [2] in one common algorithm. The results for image sequences containing object-background movement and structures of different spatial frequencies were improved. Considering multiple spatial scales allows motion detection in low-spatial frequency areas like a street. The temporal integration adds the optic flow for partially occluded areas and very slow movements. This helps to prevent the propagation of wrong motion cues to surrounding areas. Concerning the efficieny of the algorithm, the computing time for the initial detection of motion hypotheses increases linearly with the number of frames used. The limitation of a specific number of hypotheses at each position in V1 assures that the processing time in V1 and MT remains the same as for the standard version. In
60
C. Beck, T. Gottbehuet, and H. Neumann
conclusion, our algorithm for optic flow estimation combines temporal information of more than two frames and multiple spatial scales efficiently and achieves high quality results in real-world sequences with occlusions, slow movements and continuous direction changes. Acknowledgements. This research has been supported in part by a grant from the European Union (EU FP6 IST Cognitive Systems Integrated project: Neural Decision-Making in Motion; project number 027198).
References 1. Bayerl, P., Neumann, H.: A fast biologically inspired algorithm for recurrent motion estimation. IEEE Transactions On PAMI 29(2), 246–260 (2007) 2. Beck, C., Bayerl, P., Neumann, H.: Optic Flow Integration at Multiple Spatial Frequencies - Neural Mechanism and Algorithm. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Nefian, A., Meenakshisundaram, G., Pascucci, V., Zara, J., Molineros, J., Theisel, H., Malzbender, T. (eds.) ISVC 2006. LNCS, vol. 4291, pp. 741–750. Springer, Heidelberg (2006) 3. Beauchemin, S.S., Barron, J.L.: The Computation of Optical Flow. ACM Computing Surveys 27, 433–467 (1995) 4. Ungerleider, L.G., Haxby, J.V.: What and where in the human brain. Current Opinion in Neurobiology 4, 157–165 (1994) 5. Albright, T.D.: Direction and orientation selectivity of neurons in visual area MT of the macaque. J. Neurophys 52, 1106–1130 (1984) 6. Bayerl, P., Neumann, H.: Disambiguating Visual Motion through Contextual Feedback Modulation. Neural Computation 16, 2041–2066 (2004) 7. Horn, B.K.P., Schunk, B.G.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981) 8. Weiss, Y., Fleet, D.J.: Velocity likelihoods in biological and machine vision. Probabilistic models of the brain: Perception and neural function, pp. 81–100. MIT Press, Cambridge, MA (2001) 9. Adelson, E., Bergen, J.: Spatiotemporal energy models for the perception of motion. Optical Society of America A 2(2), 284–299 (1985) 10. Hup´e, J.M., James, A.C., Girard, P., Lomber, S.G., Payne, B.R., Bullier, J.: Feedback Connections Act on the Early Part of the Responses in Monkey Visual Cortex. J. Neurophys 85, 134–145 (2001) 11. Stein, F.: Efficient Computation of Optical Flow Using the Census Transform. In: Rasmussen, C.E., B¨ ulthoff, H.H., Sch¨ olkopf, B., Giese, M.A. (eds.) Pattern Recognition. LNCS, vol. 3175, pp. 79–86. Springer, Heidelberg (2004) 12. Ogale, A.S., Fermueller, C., Aloimonos, Y.: Motion Segmentation Using Occlusions. IEEE Transactions On PAMI 27(6), 988–992 (2005) 13. Simoncelli, E.: Course-to-fine Estimation of Visual Motion. IEEE Eighth Workshop on Image and Multidimensional Signal Processing, Cannes France (September 1993) 14. Burt, P.J., Adelson, E.H.: The Laplacian Pyramid as a Compact Image Code. IEEE Transactions On Communications 31(4), 532–540 (1983) 15. Ogale, A.S., Aloimonos, Y.: A roadmap to the integration of early visual modules. IJCV 72(1), 9–25 (2007)
Classification of Optical Flow by Constraints Yusuke Kameda1 and Atsushi Imiya2 2
1 School of Science and Technology, Chiba University, Japan Institute of Media and Information Technology, Chiba University, Japan
Abstract. In this paper, we analyse mathematical properties of spatial optical-flow computation algorithm. First by numerical analysis, we derive the convergence property on variational optical-flow computation method used for cardiac motion detection. From the convergence property of the algorithm, we clarify the condition for the scheduling of the regularisation parameters. This condition shows that for the accurate and stable computation with scheduling the regularisation coefficients, we are required to control the sampling interval for numerical computation.
1
Introduction
In this paper, we analyse mathematical properties of three-dimensional opticalflow computation algorithm, since three-dimensional optical flow is a fundamental method for the non-invasive cardiac motion analysis [1,2]. Furthermore, classification and separation of the regions on the heart wall using optical flow derives cardiac diagnosis features [10]. Optical flow is a well established method in computer vision [3,4,5,6,7]. For variational cardiac optical-flow computation, additional consistencies and constraints to image motion-analysis are used [1]. The gradient consistency, and thin plate deformation constraint and the divergence-free constraint are typical additional consistency and constraints, respectively. The gradient consistency is used for enhancement of the region boundary. The deformable constraint, which is equivalent to thin plate deformation constraint [8] is used for physical model of heart-wall deformation. The divergence-free constraint is based on the masspreservation requirement for physical objects. Therefore, for accurate and stable computation, the scheduling of the regularisation parameters [9] is a fundamental problem. There are two types of evaluation method for inverse problems for non-invasive diagnosis. First one is analysis the accuracy of the solution using normalised phantoms, that is, evaluate the difference between phantom, which is used the ground truth, and the solution derived by the algorithm. The second one is mathematics-based evaluation, that is, clarification of the convergence and stability of the algorithm employing numerical analysis. In this paper, from the view point of mathematical-based evaluation, we first derive the convergence property on variational optical-flow computation method used for cardiac motion detection. Secondly, we prove that the divergence-free constraint is dependent to the first order smoothness constraint, which is called the Horn-Schunck regulariser [4] and the vector spline constraint [16,13,14,15] is W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 61–68, 2007. c Springer-Verlag Berlin Heidelberg 2007
62
Y. Kameda and A. Imiya
dependent to the second order smoothness constraint, which is equivalent to thin plate deformation constraint [11,12]. Thirdly, from the convergence property of the algorithm, we clarify the condition for the scheduling of the regularisation parameters. This condition shows that for the accurate and stable computation with scheduling the regularisation coefficients, we are required to control the sampling interval for numerical computation. Furthermore, using the first and second order derivatives, we show that it is possible to classify the motions of points in an image using the orders of differential- constraints [10].
2
Vector Spline Constraints
For matching of the deformable boundary between a pair of images [11], the minimisation of 2 JT P (u, v) = (f (x, y) − f (x − u, y − v)) dx R2 2 2 2 2 uxx + 2u2xy + u2yy + vxx dxdy (1) +λ + 2vxy + vyy R2
produces a promising result, assuming that each element of the vector (u, v) is four-times differentiable. For the interpolation of the vector field on a plane, the solutions u which minimise m,n
J1st (u) =
|u(xij ) − xij |2 +
γ1 |divu|2 + γ2 |rotu|2 dx
(2)
R2
i=1,j=1
for twice differentiable function u, and J2nd (u) =
m,n
|u(xij ) − xij | + 2
i=1,j=1
γ1 |∇divu|2 + γ2 |∇rotu|2 dx,
(3)
R2
for four-times differentiable function u, respectively, where {xij }m,n i=1,j=1 , are called the first-order vector-spline and the second-order vector-spline, respectively [14,15]. Here, the gradient operation in the second term of the regulariser in eq. (3) is computed as the vector gradient for the higher dimensional problems. Equations (1) and (3) are typical minimisation problems with the second order constraints which are used in computer-vision problems.
3
Optical Flow Computation
For a spatio-temporal image f (x, y, z, t), the total derivative is given as ∂f dx ∂f dy ∂f dz ∂f dt d f= + + + dt ∂x dt ∂y dt ∂z dt ∂t dt
(4)
Classification of Optical Flow by Constraints
63
dy dz where u = (x, ˙ y, ˙ z) ˙ = ( dx dt , dt , dt ) is the motion of each point x = (x, y, z) . Optical flow consistency [4,5,6] d f =0 (5) dt implies that the motion of the point u = (u, v, w) = (x, ˙ y, ˙ z) ˙ is the solution of the singular equation,
fx u + fy v + fz w + ft = 0. The optical-flow computation with the second order constraint is |∇f u + ft |2 + αtr∇u∇u + βtrH u H Jα β (u) = u dx,
(6)
(7)
R3
for
trH u H u = trH u H u + trH v H v + trH w H w ,
(8)
where H f is the Hessian matrix of the function f . The Euler-Lagrange equation of the variational problem is Δu −
β 2 1 Δ u = (∇f u + ft )∇. α α
Let (x1 , x2 , x3 ) = (x, y, z) . Setting n + 2 2 shu = (a− i ) + (ai )
(9)
(10)
i=1
where
∂ ∂ ∂ ∂ ui − u(i+1) , a+ ui + u(i+1) , i = ∂xi ∂xi+1 ∂xi ∂xi+1 for u4 = u1 and x4 = x1 , there is a relation, a− i =
tr(∇u∇u ) =
1 (div 2 u + |rotu|2 + sh2 u). 2
(11)
(12)
Furthermore, for trHH , we have the relation trHH = |∇divu|2 + |∇rotu|2 + h2 (u) + k 2 (u) for
v uxy wxy vxy uxy wxy + + h2 (u) = xy Δxy v Δxy u Δyz w Δyz v Δzx Δzx w vy wy wz uz ux vx 2 k (u) = divs, s = , vz wz wx ux uy vy
where Δαβ =
∂2 ∂2 ∂2 ∂2 + , Δ = − αβ ∂α2 ∂β 2 ∂α2 ∂β 2
for α, β ∈ {xy, yz, zy}. These relations imply the next properties.
(13)
(14) (15)
(16)
64
Y. Kameda and A. Imiya
Assertion 1. The first order smoothness constraint tr∇u∇u involves the divergence-free constraint divu. Therefore, the first order smoothness constraint involves the divergence-free condition. Assertion 2. The second order minimum-deformation constraint trH u H u involves the second order vecror spline constraint γ1 |∇divu|2 + γ2 |∇rotu|2 . Therefore, considering the regularisation term trH u H u is equivalent to solve vector spline minimisation [12,13,14,15,16] for optical-flow computation.
4
Numerical Scheme and Its Stability
In this paper, by embedding images in a rectangular region which encircles images, we adopt Dirichlet condition f = 0 on the boundary. Furthermore, for the sampled optical flow vector uijk = (uijk , vijk , wijk ) , we set the vectorisation of sampled function as ⎞ ⎛ u111 ⎞ ⎛ uijk ⎜ u112 ⎟ ⎟ ⎜ (17) u := ⎜ ⎟ , uijk = ⎝ vijk ⎠ . .. ⎠ ⎝ . wijk uMMM Then, we have the discrete version of Euler-Lagrange equation Lu −
β 2 1 L u = (Su + s), s = ff ∇f α α
(18)
where L is the discrete Laplacian operation. In eqs. (9) and (18), the operation to u are split to both side of equation. Therefore, both operations are non-singular and spectra of one operation is less than one and the other is larger than one. It is possible to derive a converging numerical scheme. However, the operation S is singular. In this section, we derive a converging iteration scheme from the equation 1 1 β (19) u + Lu − L2 u = u + Su + s, S = ∇f ∇f t op α α α since operations in both side of the equation are non-singular . The second order numerical derivative is ∂2 u =
u(i + h) − 2u(i) + u(i − h) h2
Setting D 2 to be the second order discrete (20), that is, ⎛ −2 1 0 0 ⎜ 1 −2 1 0 ⎜ ⎜ D2 = ⎜ 0 1 −2 1 ⎜ .. .. .. .. ⎝ . . . . 0
0
(20)
differential operator derived by eq. ··· 0 ··· 0 ··· 0 . . .. . .
0 0 0 .. .
0 · · · 0 1 −2
⎞ ⎟ ⎟ ⎟ ⎟, ⎟ ⎠
(21)
Classification of Optical Flow by Constraints
65
for Dirichlet boundary condition, the fourth order operator is D4 = D 2 D2 . Furthermore, the numerical operations L and L2 which stand for Δ and Δ2 , in R3 , respectively, are given as L = D2 ⊗ I ⊗ I + I ⊗ D 2 ⊗ I + I ⊗ I ⊗ D2 , L2 = D 4 ⊗ I ⊗ I + I ⊗ D 4 ⊗ I + I ⊗ I ⊗ D 4
(22)
+2(D2 ⊗ D2 ⊗ I + I ⊗ D2 ⊗ D2 + D 2 ⊗ I ⊗ D2 ),
(23)
where A⊗B is the Kronecker product of matrices A and B, and I is the identity matrix. For an appropriate positive number Δτ , we have a splitting expression, " Δτ 1 β s, Diag I + Δτ S ijk u = P Diag I + Δτ L − L2 Pu − α α α (24) for permutation matrix which transforms u of eq. (17) to ⎛ ⎞ u111 ⎜ u ⎟ ⎜ 112 ⎟ (25) v := vec ⎜ .. ⎟ = P u ⎝ . ⎠ u MMM From this, we have the iterative form Aun+1 = Bun + c for A
=
B = P
(26)
$ # Δτ S ijk , S = ∇f (i, j, k)∇f (i, j, k) Diag I + α Δτ β I + Δτ L − Δτ L2 P , c = − s, s = ft ∇f (i, j, k). α α
(27) (28)
Both A and B are non-singular and ρ(A) > 1, where ρ(A) is the maximum of the absolute value of the eigenvalues of the matrix A. Therefore, if ρ(B) < 1 then the iterative form eq. (26) is stable and converges to the solution of the original equation. We derive the condition for α, β, and Δτ to guarantee the condition ρ(B) < 1. Setting U and Λ to be the discrete Fourier transform matrix and the diagonal matrix respectively, we have the relations L = U ΛU ,
L2 = U Λ2 U .
(29)
where Λ = Diag (λijk ) , λijk = μi + μj + μk , μk = − 1 − cos
π k, M +1
.
(30)
66
Y. Kameda and A. Imiya
Substituting these relations to β 2 ρ I + Δτ L − Δτ L < 1 α finally, we have the relation Δτ β Δτ β . ρ I + Δτ Λ − Δτ Λ2 < 1 − 4 × 3 2 − 16 × 32 α h α h4
(31)
(32)
From eq. (32), we have the next theorem. Theorem 1. If the inequality 1 − 4 × 3 Δτ − 16 × 32 β Δτ < 1 2 4 h αh
(33)
is satisfied, the iterative form is stable and converges to the solution of the original PDE. The iteration is terminated if max |un+1 (x) − un (x)| < |un (x)| × 10−m , x∈R3
(34)
where |u| is the Euclidean norm in R3 .
5
Numerical Examples
Let u∗(αβγ) be the optical flow vector computed using α-th, β-th, and γ-th constraints. For each point x, we can have many optical-flow vectors u∗(α) , where α is a string of positive integers, for example, 1, 2, 12, 123, and so on. The vector u∗(2) is the deformation vector of the deformable boundary. Therefore, if |u∗2 | is sufficiently small at point x, and u∗(1) u∗(2) the motion in the neighbourhood of this point x is homogeneous. This geometrical property of optical-flow computed by the variational method implies that the operation D(i) = {x | |u∗(i) | |u∗(j) |, i = j}
(35)
derives segments in Rn using the orders of constraints for the computation of variational problems. Generally, it is possible to adopt many constraints for the classification of moving points in a space with optical-flow vectors. For example, two operations Dh = {x | |u∗(1) | |u∗(12) |}, Dd = {x | |u∗(12) | |u∗(1) |}
(36)
classify a scene into a deforming part and a homogeneously moving part, since the regularisers tr∇u∇u and trD2 uD2 u minimise smoothness and elastic energy of the solution u, respectively. Furthermore, the operation % u∗(1) , if |u∗12 | 1 on x, u= (37) u∗(12) , otherwise,
Classification of Optical Flow by Constraints
67
z 70 60
z
50 40
z
30 20 10
50
y
100 50
150
100
200
150 250
200 250
x
(a) Optical flow for 105 , 0 for frames 2 and 3 from 20 frames.
y
y x
(b) Optical flow for 105 , 105 for frames 2 and 3 from 20 frames.
x (c) Optical flow for 105 , 104 for frames 2 and 3 from 20 frames.
Fig. 1. Optical Flow Classification in Three-Dimensional Euclidean Space. (a), (b), and (c), are the optical-flow field images of the beating heart for the regularisation parameters α, β are 105 , 0, 105 , 105 , and 105 , 104 , respectively. The Horn-Schunck constraint extracts the smooth motion u(1) of the whole heart. The second order constraint extracts the deformbale motion u(12) . The combination of the Horn-Schunck constraint and deformable constraint allows us to separate and classify the motions of points in images to smooth motion and deformable motion.
allows us to express multi-modal motion simultaneously, using variational computation methods. These properties imply that the orders of the constraints in variational problems act as the scale of scale space analysis. Figure 1 shows cardiac optical flow u(1) and u(12) extracted for some combinations of α, β. Here α, β are 105 , 0, 105 , 105 , and 105 , 104 . In Figure 1, (a), (b), and (c) are the optical-flow field images of the beating heart for the regularisation parameters α, β are 05 , 0, 105 , 105 , and 105 , 104 , respectively. These sequence show that, as expected, the flow sequence obtained by the HornSchunck constraint extracts smooth motion of the whole heart. Furthermore, with the second order constraint, the deformbale motion on the wall of beating heart is extracted. The extracted dominate deformable part is the atriums. These results indicate that the combination of the Horn-Schunck constraint and deformable constraint allows us to separate and classify the motions of points in images to the smooth motion u(1) and the deformable motion u(12) .
6
Conclusions
In this paper, we first derived the convergence property on variational opticalflow computation method used for volumetric cardiac motion detection. From this convergence property of the algorithm, we clarified the condition for the scheduling of the regularisation parameters. Images for our experiment are provided from Roberts Research Institute at the University of Western Ontario through Professor John Barron. We express thanks to Professor John Barron for allow us to use his data set. The research was supported by No. 19500139, Grant-in-Aid for Scientific Research of Japan Society for the Promotion of Sciences.
68
Y. Kameda and A. Imiya
References 1. Zhou, Z., Synolakis, C.E., Leahy, R.M., Song, S.M.: Calculation of 3D internal displacement fields from 3D X-ray computer tomographic images. In: Proceedings of Royal Society: Mathematical and Physical Sciences, vol. 449, pp. 537–554 (1995) 2. Song, S.M., Leahy, R.M.: Computation of 3-D velocity fields from 3-D cine images of a human heart. IEEE Transactions on Medical Imaging 10, 295–306 (1991) 3. Aubert, G., Kornprobst, P.: Mathematical Problems in Image Processing:Partial Differential Equations and the Calculus of Variations. Springer, NewYork (2002) 4. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17, 185–204 (1981) 5. Nagel, H.-H.: On the estimation of optical flow: Relations between different approaches and some new results. Artificial Intelligence 33, 299–324 (1987) 6. Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical flow techniques. International Journal of Computer Vision 12, 43–77 (1994) 7. Weickert, J., Schn¨ orr, C.: Variational optic flow computation with a spatiotemporal smoothness constraint. Journal of Mathematical Imaging and Vision 14, 245–255 (2001) 8. Timoshenko, S.P.: History of Strength of Materials. Dover, Mineola, NY (1983) 9. Chang, H.-H.: Variational approach to cardiac motion estimation for small animals in tagged magnetic resonance imaging, 2006. IEEE Pacific-Rim Image and Video Technology, 363-372 (2006) 10. Chang, H.-H., Moura, J.M.F., Yijen, L., Wu, Y.L., Ho, C.: Early detection of rejection in cardiac MRI: A spectral graph approach, 2006. IEEE International Symposium on Biomedical Imaging, Arlington, 113-116 (2006) 11. Grenander, U., Miller, M.: Computational anatomy: An emerging discipline. Quarterly of applied mathematics 4, 617–694 (1998) ´ 12. Sorzano, C.O.S., Th´evenaz, P., Unser, M.: Elastic registration of biological images using vector-spline regularization. IEEE Tr. Biomedical Engineering 52, 652–663 (2005) 13. Wahba, G., Wendelberger, J.: Some new mathematical methods for variational objective analysis using cross-validation. Monthly Weather Review 108, 36–57 (1980) 14. Amodei, L., Benbourhim, M.N.: A vector spline approximation. Journal of Approximation Theory 67, 51–79 (1991) 15. Benbourhim, M.N., Bouhamidi, A.: Approximation of vectors fields by thin plate splines with tension. Journal of Approximation Theory 136, 198–229 (2005) 16. Suter, D.: Motion estimation and vector spline. In: Proceedings of CVPR’94, pp. 939–942 (1994)
Target Positioning with Dominant Feature Elements Zhuan Qing Huang and Zhuhan Jiang School of Computing and Mathematics, University of Western Sydney, NSW, Australia [email protected], [email protected] Abstract. We propose a dominant-feature based matching method for capturing a target in a video sequence through the dynamic decomposition of the target template. The target template is segmented via intensity bands to better distinguish itself from the local background. Dominant feature elements are extracted from such segments to measure the matching degree of a candidate target via a sum of similarity probabilities. In addition, spatial filtering and contour adaptation are applied to further refine the object location and shape. The implementation of the proposed method has shown its effectiveness in capturing the target in a moving background and with non-rigid object motion. Keywords: Object detection, object tracking.
1 Introduction Searching for a target region that matches the object template in the previous frame in a video is somewhat different from the object recognition in its traditional sense. The main task in this process is to locate the position of the candidate object in the current frame. Sometimes it may additionally require detailed description of the object on such as the shape deformation. With the increasing variety of applications in the field of video based surveillance, monitoring, robotics and medical treatment etc, a great deal of research has been conducted in the regard. The main approaches considered so far in the literature for the object tracking are the template matching, the snake approach, and the use of kernel density estimation, to name a few. The template matching [1-3] is to match the whole object region directly at different positions in the current frame to locate the new object position that leads to minimum errors. The snake approach [4-6, 11] on the other end emphasises on the extraction of the object contour based on certain deforming criteria such as the minimisation of an energy function. More recently, kernel functions [7-10] are utilised to estimate the minimum errors between the candidate and target regions, in the form of density distributions. Other methods may incorporate additional features such as the specification of the selected edges being tracked [12]. The statistic framework is also widely integrated into these algorithms. The multi-hypothesis method, for instance, approaches the problem by deriving the highest probability of certain hypotheses [13-14] with the Bayesian rule. In [4], as an example, the contour of the tracked object is propagated in the framework of a posterior probability. The main purpose of this work is to approach the tracking problem with less limitation exhibited by the above mentioned main techniques so as to accommodate at least to a certain extent the non-rigid movement, moving background, object shape deformation, object colour changes, as well as the computational practicality. Our proposed W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 69–76, 2007. © Springer-Verlag Berlin Heidelberg 2007
70
Z.Q. Huang and Z. Jiang
approach will operate on a simple yet sufficiently distinguishing object representation on the local background within a statistical framework. The simplified object model will result in lower computational load and allows for non-rigid movement of an object in a moving background. The probabilistic similarity of object pixels has subsequently improved the accuracy of the object detection in a different frame. The tracking of an object in a video sequence is then carried out on the basis of the dynamic segmentation of the object with respect to its local background in terms of the dominant elements of colours. This paper is organised as following. In section 2, we propose a new form of representation for the target template within the environment of a local background, and analyse the extraction of the colour intensity bands for the object. In section 3, we then develop a matching scheme in a probabilistic framework, based on the dominant elements of intensity bands, and follow it up with a refining finalisation. Experimental results are then reported in section 4, with a final conclusion made in section 5.
2 Dynamic Dominant Elements Among the object features such as colour, edge and shape, the colour feature is often more significant in matching the object. We here propose to model an object through a simplified representation of the significant colour bands, and segment the object colours into suitable intensity bands determined by maximising the segments difference between the object and local background. 2.1 Band Selection Let Is be the intensity corresponding to the peak value of object density ρo, which acts as the centre of the most significant band. The intensity bands hi ′s are thus set to hi = [I s + (i − k b − 3 / 2 ) w,
I s + (i − k b − 1 / 2 ) w ],
i = 1,.., n ,
(1)
where kb=⎣Is/w+1/2⎦, n=kb+⎣ke⎦+kp, ke=(1-Is)/w+1/2, kp=0 or 1, and w is the bandwidth. If the RGB colour space is used, then each colour component will be dealt with independently. The main strategy for the band selection is to represent the object with the bands that most distinguish it from its local background. The technical method we introduce here is to calculate the “difference” of the neighbouring regions, one on the object side and one on the background side, along the object boundary. To carry this out, a set of boundary pixels si’, 1≤i′≤n′, typically successively equal in distance, along Fig. 2.1. (a) Typical local neighborhoods along the boundary. (b) Segment patterns in a sample circle. the boundary of the object are first selected, and a circle or a square is used to trap a local neighbourhood where, for instance, the circle centres at a selected boundary pixel s and the radius r is half the distance of the successively
Target Positioning with Dominant Feature Elements
71
selected boundary pixels, as illustrated in figure 2.1 (a). For notational simplicity, such squares of width 2r will also be referred to as a “circle” of radius r. If we denote the segment membership as a bit pattern θ, with each single bit indicating the presence of a particular segment element, then we can compare the segment membership θ+ on the object side and the membership θ - on the background side, as shown in figure 2.1 (b). The task thus becomes searching for a w that leads to the largest difference between θ+ and θ-. To calculate the segmental difference along a boundary section, we count up such differences while ignoring their magnitude (the difference of segment indices, or the distance of colour), via n' ⎡ ⎤ d θ = ∑ ⎢∑ (θ i+ ⊕ θ i− ) /(∑ θ i+ ) 2 ⎥ , i =1 ⎣ ⎦
(2)
where the total segment disparity dθ indicates the extent to which the object separates itself from the background, n' is the total number of circles selected along the boundary path, ⊕ represents the bit XOR operation, and the denominator is a normalising factor. We note that the w having the largest dθ is treated as the most desirable bandwidth, although the w may also be selected in a range corresponding to a higher dθ. 2.2 Extraction of Dominant Elements We now seek to represent an object with a few feature elements that would distinguish the object from local background and facilitate tracking the object. We first let the object colours and the local background colours be binned based on the corresponding bands. We then define the difference Δk and the ratio δk by
Δk=Qok – Qbk, δk= Qok / QBk ,
(3)
where Qok and Qbk are respectively the number of object pixels and the number of local background pixels that fall into the kth bin, and QBk is the number of pixels in the background of the full frame that fall into the kth bin. To proceed with the band selection, we set thresholds μ for all Δi and ν>0 for all δi respectively. The kth band will be selected, and thus considered dominant, if both Δk >μ and δk >ν. Both conditions together ensure the kth band to highlight the object from background. For the selected m bands h'k, we denote by xk the corresponding centre. These xk will thus serve as the dominant elements to represent the object. We note that the bands obtained above are selected according to the intensity distribution without considering the actual spatial information. Once the h'k are derived, however, we can choose a small window for every pixel within a selected segment, keep within the window only the pixels in the same segment, and then calculate there the local mean, or another feature value by a different model, to substitute the pixel with the calculated value. Amongst such new pixel values, the closest to the middle of the band is chosen as the element representing the segment. The advantage of this second method is that it is able to reduce the effect of the colour noises in the background when only sporadic background pixels exhibit intensities very similar to those in the object.
72
Z.Q. Huang and Z. Jiang
3 Improving Target Matching We have proposed in the previous section an efficient model to represent an object so that searching for it in a different frame will be at a lower computational cost. In order to achieve better matching capability, we establish a probabilistic framework that also accommodates well the object representation in terms of dominant feature elements. 3.1 Probability Measurement and Synthesis We here aim to locate pixels in the current frame that are similar to any dominant elements in the previous frame, and build up the object from such pixel patches. Let {yi} denote all the pixels in the current frame or part of the frame corresponding to the local window in the previous frame. Then the probability of the pixel yi belonging to the kth segment of the object in the previous frame is proportional to, pk ( yi ) = ϕ (| yi − xk |) ,
(4)
where φ is a similarity function such as a triangle or Gaussian function. Since the object to be tracked is characterised by its dominant elements, we sum up the probabilities of a pixel yi belonging to each such element to reflect its overall probability of being within the object region in the current frame 3
m
p( yi ) = ∑∑ α k( j ) pk( j ) ( yi ) ,
(5)
j =1 k =1
where j corresponds to the 3 RGB colours, α(j)k ≥0 are chosen to reflect the importance of the object elements derived from the selected bands, m is the total number of the elements, and ∑j,k α(j)k =1. In this work, we typically adopt the α(j)k evenly for the selected bands. This sum will be referred to as the probability map later on. To yield a better object probability representation, we reduce the effect of the pixels that may accumulate sufficient probabilities but actually are not really close to any of the dominant elements, by limiting the contribution to colours within the distance χ
⎧ϕ (| yi − x k |) if | yi − x k |≤ χ . p k ( yi ) = ⎨ 0 if | y i − x k |> χ ⎩
(6)
Setting χ=w/2, for instance, would exclude the pixels which have a distance to the kth element larger than half of the bandwidth. Another technique is to set a single sufficient threshold on all the probabilities pk(y) corresponding to each element before they are summed up in (5). And still another is to allow a different threshold for each pk(y) and apply a final threshold to the total sum p(y). As the dominant bands are selected mainly based on the nearby background in our approach, it is possible for the probabilities obtained in area distant to the object to also exhibit certain similarity to some segments. To reduce this interference, we introduce a spatial filter to filter out the distant pixels. Hence we first find the rough centre of highest probability region (candidate region) nearest to the previous object
Target Positioning with Dominant Feature Elements
73
location; then a masking function such as a Gaussian is applied. This subsequently results in the following modified probability distribution for a better object extraction (7) p ' ( y i ) = g (d ( y i )) p ( y i ) ,
where d(yi) is the distance from yi to the centre, and g(d) is a decreasing function. 3.2 Object Extraction and Refinement
We first observe that the band selection process discriminates against those bands that are similar to the background, though the presence of different RGB colours may compensate it to a good extent. We may thus refine the extracted object p′(yi) with the help of its shape in the previous frame. More precisely, the former shape is first projected onto the current probability map, and is then adapted so as to maximise the total probability of the current shape. This is done by shifting the shape centre to the new position (Cx, Cy) determined by (8) ( , ) = max arg ∑ p' ( y ) ,
c c x
y
y∈Ta , b Ωo
where Ωo is the projected object region, and Ta,b is a spatial translation by a and b pixels horizontally and vertically respectively. We may also render the probabilities with the neighbours, similar to the Weighted Region Consolidation method in [15]. Based on the probability map, a threshold τ is chosen to identify the candidate object. In fact the shape refinement can be achieved by classifying contour pixels y'(s) according to y' (s) ∈Ωo if and only if p(y'(s)) > τ. In other words, we expand its contour when the probability at the contour point is larger than the given threshold τ and shrink the contour when less than τ. We can thus exclude the nearby independent background from the candidate object. Of course this method may lead to ragged boundaries as it does not ensure the smoothness of the contour. Alternatively, we can propagate the contour in the framework of an energy function. Let mapping E: [0,1]→R2 be a parameterized close curve, the energy function can then be expressed as 1
φ ( s ) = ∫ g (| ∇p( s ) |) | E ( s ) | ds , 0
(9)
where ∇p(s) is the probability change at the pixel on the contour, g(.) is a monotonically decreasing function, Ė(s) is the vector derivative of the curve with respect its parameter s. A similar approach can be found in [11]. For better algorithmic stability, we may set a maximally allowed pixel shift for the contour in a single frame step so as to prevent the pixels from moving an unsuitably large distance in the case of substantial background interference.
4 Experiments We now experiment on a video sequence in 240x320 pixels as shown in figure 4.1 (a) and (b). The object at the beginning is extracted from the static background as in figure 4.1 (c), and is then on the moving background in the subsequent frames. On the frame within which the object is already known, we create a local background by including the object within the 115x135 area. The disparities dθ along the object
74
Z.Q. Huang and Z. Jiang
border for different radiuses are calculated and illustrated in figure 4.2 (a) and (b). The curves in figure 4.2 (a) from top to bottom illustrate the effect of the circle radius changing from 2 to 5 pixels. We see from figure 4.2 (a) that some bandwidths can achieve larger local differences of the object from the background than the others. We also observe there that a suitable bandwidth would be 25 to 35. With Is=0.12, the bands for a given bandwidth w can be obtained via (1) similar to the figure 4.2 (c) and (d) where dot-dash line denotes the centre of the bands. The corresponding segmentation is then shown in figure 4.2 (e) and (f). Figures 4.2 (c) to (f) are done for w=30; and the contours in figure 4.2 (e) and (f) are drawn there for a better illustration, and a brighter colour there indicates a larger amount of pixels in a band within the object. For each band identified in figure 4.2 (c) and (d), we calculate Qok and Qbk respectively. For the band selection, we set μ=0. If the full frame is considered for the matching, also set ν=0.5, see section 2.2 for the definition of μ and ν.
(a)
(b)
(c)
Fig. 4.1. (a) First frame. (b) Second frame. (c) Extracted object in the first frame.
a b
c e d f
Fig. 4.2. (a) Total segmental disparity against bandwidth for radius= 2,3,4,5. (b) Radius=5. (c) Bands chosen for the red color. (d) Bands chosen for the green color. (e) Segmentation with bands in (c). (f) Segmentation with bands in (d).
Let χ=w/2 for (6), and define ϕ in (4) as the triangular function ϕ=1–|t| for |t| 0 sgn(a) = 1, if a < 0 sgn(a) = −1, and if a = 0 sgn(a) = 0. The estimate is increased by one at every frame when it is smaller than the sample or decreased by one when it is greater than the sample. The absolute difference between It and Mt is the first differential estimation Δt . Unlike in [7], in [6] the absolute difference Δt is used to compute the timevariance of the pixels, representing their motion activity measure, employed to decide whether the pixel is moving or static. The Vt is used as the dimension of a temporal standard deviation. It is computed as a Σ − Δ filter of the difference sequence. Finally in order to select pixels that have a significant variation rate over its temporal activity, the Σ − Δ filter is applied N = 4 times. After the first estimation of the background and the moving pixels, further spatiotemporal processing is necessary to: eliminate the non significant pixels
80
J.A. Boluda and F. Pardo Table 1. The Σ − Δ background estimation Initialization for each pixel x: M0 (x) = I0 (x) For each frame t for each pixel x: Mt (x) = Mt−1 (x) + sgn(It (x) − Mt−1 (x)) For each frame t for each pixel x: Δt (x) = |Mt (x) − It (x)| Initialization for each pixel x: V0 (x) = Δ0 (x) For each frame t for each pixel x such that Δt (x) = 0: Vt (x) = Vt−1 + sgn(N × Δt (x) − Vt−1 (x)) For each frame t for each pixel x: if Δt (x) < Vt (x) then Dt (x) = 0 else Dt (x) = 1
from the detection (noise, false detection), enhance the segmentation of the moving objects, reduce the ghost effect that produces false detection where a moving object leaves after a long stay, and reduce the aperture effect that leads to poor detection when the projected motion is weak. A Markovian model for spatiotemporal processing was proposed [8]. Instead, several morphological reconstruction processes are proposed in [6]. A simple common edges hybrid reconstruction has been performed to enhance Δt as shown in equation (1). The inputs are the original image It and the Σ − Δ difference image Δt . t Δt = HRecΔ α (M in(∇(It ), ∇(Δt )))
(1)
The gradient modules of Δt and It are computed by estimating the first Sobel gradient and then computing the Euclidean norm. M in(∇(It ), ∇(Δt )) acts as a logical conjunction, retaining only the edges that belong both to an object of Δt and It . In order to recover the object in Δt , we reconstruct the common edges within Δt and with α as elementary structuring element (a ball with radius=3). This has been done by performing a geodesic reconstruction of the common edges (marker image) with Δt as reference (or conditional image). The conditional dilation for the reconstruction has been limited to four iterations since no relevant changes appear with more iterations. Thus, after Δt has been reconstructed, Vt and Dt are computed.
Speeding-Up Differential Motion Detection Algorithms
3.2
81
Change-Driven Data Flow Algorithm
The original algorithm has been modified using the change-driven data flow processing strategy. With this procedure not all the pixels will be processed. Only the pixels that undergo a change grater or equal to the CST will fire the related instructions. Fig. 1 shows the data flow and the intermediate stored images.
If
I t+1 (x) − I t (x)
> CST
Mt G x (I t ) , G y (I t ) Δt G x( Δ t ) , G y( Δ t )
∇ (I t )
∇ ( Δ t) Min ( ∇ ( Δ t) , ∇ (I t ) ) Δ Δ t’ = Rec αt Min ( ∇ ( Δ t) , ∇ (I t ) )
Vt Dt
Fig. 1. Modified algorithm data flow and intermediate images
An initial image It is stored in the computer as the current image. Any absolute difference of the current image pixel It+1 (x) with the stored image fires the intermediate images computation. Moreover, these updates must be done taking into account that the change of an input pixel may modify several output variables. For example, if a pixel is modified (with a value greater or equal to the CST) then its contribution to 6 pixels for the Sobel gradient image Gx and also for 6 pixels for image Gy must be updated. It may appear that a single pixel modification can produce an uncontrolled explosion of operations, but these operations are a simple addition (and sometimes a multiplication by two) per pixel and sometimes these can be reutilized. In the original algorithm for each pixel the Sobel gradient images Gx and Gy are computed in any case; with six additions per pixel with the corresponding multiplications.
82
4
J.A. Boluda and F. Pardo
Experimental Results
The original differential motion detection algorithm and the change-driven data flow version have been implemented. Both algorithms have been tested using several traffic sequences downloaded from the professor H. H. Nagel public ftp site: http://i21www.ira.uka.de/image sequences/ at the University of Karlsruhe. Fig. 2(a) shows a sequence frame of 740 × 560 pixels. There are stationary cars at this frame that correctly appear as motionless, as part of the background. Fig. 2(b) shows the Mt or background image. At this point only three cars in the top left of the image, a cyclist at below the cars and a car partially occluded by a tree at the center appear as moving objects, as the original version shows in Fig. 2(c). Change-driven data flow algorithm results are shown in Fig. 2(d).
Fig. 2. (a) Original sequence, (b) Background Mt image, (c) Dt in the original algorithm, (d) Dt in the modified algorithm with CST=3
There are no evident differences in terms of detected moving points between the original algorithm and the change-driven data flow with CST = 3 as Fig. 2(c) and (d) show. The results are almost the same but the executed time decreases significantly. Fig. 3 shows the executed time speed-up achieved between the modified algorithm, with different CST values, and the original implementation. It must be noted that the change-driven data flow implementation with
Speeding-Up Differential Motion Detection Algorithms
83
CST = 1 gives the same image results as the original but it has a lower computational cost. Likewise, as CST increases, the speed-up improves and with CST ≥ 1 the change-driven data flow implementation improves the original algorithm execution times (speed-up greater tan 1).
2.4
Change−driven data flow speed−up
2.2
2
1.8
1.6
1.4
1.2
1
1
2
3
4
5
6
CST value
Fig. 3. Measured speed-up obtained with the change-driven data flow algorithm
The execution time speed-up improves because only pixels with a value greater or equal to CST are processed. In this way, additional operations required by the data flow implementation are compensated by the decreasing number of pixels processed when CST ≥ 1. The immediate question is whether there is a limit to CST growth and therefore to the algorithm speed-up versus the original implementation. The answer is that if CST is too high, very few points are processed, so there is not a systematic background update, and moreover, moving points with a gray value difference lower than CST are not detected. This property produces two effects that limit the change-driven data flow strategy. If CST increases then fewer correct moving points are detected and otherwise more false positives are detected. The effect of false positives due to the non constant background computation is more relevant than the decrease in correct positives that contribute to computation of the moving object center. Values for CST ≥ 6 make this approach unfeasible and therefore further pepper noise filtering is required for Dt , and consequently the change-driven data flow speed-up disappears. Therefore, practical values in the tested sequences for CST are 1 ≤ CST ≤ 5. The number of processed points is sequence and frame dependent but decreases as the CST value increases. Practical CST values reduce the number of pixels to be processed in a factor of five with a CST ≥ 5, maintaining the same accuracy for the detection of moving objects but giving a speedup of almost two for this CST value.
84
5
J.A. Boluda and F. Pardo
Conclusion
A procedure for speeding-up differential motion detection algorithms has been presented. The change-driven data flow strategy is based on processing the pixels associated which changes above a threshold. The implementation of this methodology requires extra storage to keep track of the intermediate results of preceding computing stages. A recent differential motion detection algorithm has been chosen to test the change-driven data flow policy. The change-driven data flow algorithm implementation shows a significant speed-up for a CST range that gives similar image results to the original implementation. Thus, we have demonstrated the feasibility of using the change-driven data flow algorithm strategy in differential motion detection algorithms.
Acknowledgments This work has been supported by the project TEC2006-08130/MIC of the Spanish Ministerio de Educaci´ on y Ciencia.
References 1. Sandini, G., Questa, P., Scheffer, D., Diericks, B., Mannucci, A.: A retina-like CMOS sensor and its applications. Sensor Array and Multichannel Signal Processing Workshop, 514-519 (2000) 2. Boluda, J.A., Domingo, J.: On the advantages of combining differential algorithms and log-polar vision for detection of self-motion from a mobile robot. Robotics and Autonomous Systems 37(4), 283–296 (2001) ¨ 3. Ozalevli, E., Higgins, C.M.: Reconfigurable Biologically Inspired Visual Motion Systems Using Modular Neuromorphic VLSI Chips. IEEE Transactions on Circuits and Systems 52(1), 79–92 (2005) 4. Silc, J., Robic, B., Ungerer, T.: Processor architecture: from dataflow to superscalar and beyond. Springer, Heidelberg (1999) 5. Pardo, F., Boluda, J.A., Sosa, J.C.: Circle detection and tracking speed-up based on change-driven image processing. In: Proc. ICGST International Conference on Graphics, Vision and Image Processing GVIP-05 (2005) 6. Manzanera, A., Richefeu, J.C.: A new motion detection algorithm based on Σ-Δ background estimation. Pattern Recognition Letters 28, 320–328 (2007) 7. McFarlane, N., Schofield, C.: Segmentation and tracking of piglets in images. Machine Vision and Applications 8, 187–193 (1995) 8. Manzanera, A., Richefeu, J.C.: A robust and computationally efficient motion detection algorithm based on Σ − Δ background estimation. In: Proc. ICGVIP’04, pp. 46–51 (2004)
Foreground and Shadow Detection Based on Conditional Random Field Yang Wang National ICT Australia Kensington, NSW 2032, Australia [email protected]
Abstract. This paper presents a conditional random field (CRF) approach to integrate spatial and temporal constraints for moving object detection and cast shadow removal in image sequences. Interactions among both detection (foreground/background/shadow) labels and observed data are unified by a probabilistic framework based on the conditional random field, where the interaction strength can be adaptively adjusted in terms of data similarity of neighboring sites. Experimental results show that the proposed approach effectively fuses contextual dependencies in video sequences and significantly improves the accuracy of object detection. Keywords: Conditional random field, contextual constraint, object detection, shadow removal.
1 Introduction Moving object detection in image sequences is very important to application areas such as human-computer interaction, content-based video compression, and automated visual surveillance. In recent years, background modeling is a commonly used technique to identify moving objects with fixed camera. However, accurate detection could be difficult due to the potential variability such as shadows cast by moving objects, nonstationary background processes (e.g. illumination variations), and camouflage (i.e. similarity between appearances of moving objects and the background) [7]. Comprehensive modeling of spatiotemporal information within the video sequence is a key issue to robustly segment moving objects in the scene. Temporal information is fundamental to handle nonstationary background processes. Linear processes or probability distributions can be used to characterize background changes from recent observations [1] [11]. In [10], the recent history of pixel intensity is modeled by a mixture of Gaussians, and the Gaussian mixture is adaptively updated for each site to deal with dynamics in background processes. On the other hand, spatial information is important to understand the structure of the scene. Gradient (or edge) features help improve the reliability of object and shadow detection [2] [12]. In [9], spatial cooccurrence of image variations at neighboring blocks is employed to enhance the detection sensitivity. Moreover, contextual constraint is an essential element to effectively fuse spatial and temporal information throughout the detection process. For instance, contiguous sites are likely to belong to the same object in the scene, and one site tends to remain W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 85–92, 2007. © Springer-Verlag Berlin Heidelberg 2007
86
Y. Wang
in the same object at consecutive time instants. Markov random field (MRF) and hidden Markov model (HMM) have been extensively employed to formulate contextual constraints [6] [8]. However, conditional independence of observations is usually assumed in the previous work, which is too restrictive for visual scene modeling. Compared to generative models including MRF and HMM, the conditional random field (CRF) relaxes the strong independence assumption and captures contextual dependencies among observations [4]. Originally proposed for text labeling, CRF has been applied to image processing recently [3]. In this paper, a CRF based approach is proposed to fuse contextual dependencies in image sequences for moving object detection. Spatial and temporal constraints during the detection process are unified in a probabilistic model that permits neighborhood interactions in both labels and observations. The method discriminates moving cast shadows and handles nonstationary background processes. Experimental results show that the proposed approach effectively integrates contextual constraints in video sequences and significantly improves the detection accuracy. The rest of the paper is arranged as follows: Section 2 presents the CRF based model to formulate contextual dependencies in video sequences. Section 3 proposes the object detection method and describes the implementation details. Section 4 discusses the experimental results. Finally our technique is concluded in Section 5.
2 Model Given an image sequence, the observation and label of a point x at time instant t are denoted by g xt and l xt respectively. The observation g xt consists of intensity information at the site x. Label l xt assigns the point x to one of K classes (K equals 3 in this work). The label l xt = ek if point x belongs to the kth class, where ek is a Kdimensional unit vector with its kth component equal to one. Here t ∈ N, x ∈ X, and X is the spatial domain of the video scene. The entire label field and observed image over the scene at time t are compactly expressed as Lt and Gt respectively. Given the observations, there are two ways to estimate the labels. In a generative framework, both the prior model of the label field and the likelihood model of the observed data are formulated to estimate the joint distribution over observations and labels. Alternatively, in a discriminative framework the posterior distribution of the label field is directly formulated. Based on conditional random field, in this work spatial and temporal constraints within the video scene are integrated through a discriminative framework of contextual dependencies among neighboring sites. The definition of conditional random field (CRF) is given by Lafferty et al. in [4]. For random variables S and observed data D over the video scene, (S, D) is a conditional random field if, when conditioned on D, the random field S obey the Markov property: p( s x | D, s y , y ≠ x) = p( s x | D, s y , y ∈ N x ) , where set Nx denotes
the neighboring sites of point x. In this work, the probability distribution of the labels is modeled by a conditional random field to formulate contextual dependencies. At each time t, label field Lt obeys the Markov property when image Gt is given. Hence Lt is a random field globally conditioned on the observed data. Using the
Foreground and Shadow Detection Based on Conditional Random Field
87
Hammersley-Clifford theorem and considering only up to pairwise clique potentials, the posterior probability of the label field is given by a Gibbs distribution with the following form. p( Lt | G t ) ∝ exp{−
∑ [V (l x
x∈ X
t x
| Gt ) +
∑V
t t x , y (l x , l y
| G t )]} .
(1)
y∈ N x
The one-pixel potential Vx (l xt | G t ) reflects the local constraint for a single site. Meanwhile the two-pixel potential Vx , y (l xt , l ty | G t ) models the contextual information (or pairwise constraint) between neighboring sites. Strength of the constraints is dependent on the observed data. The potential functions are further expressed as Vx (l xt | G t ) = Vxl (l xt ) + Vxl | g (l xt | G t ) , Vx , y (l xt , l ty | G t ) = V xl, y (l xt , l ty ) + V xl,| gy (l xt , l ty | G t ) .
(2a) (2b)
In the one-pixel potential, the first term reflects the prior knowledge for different label classes, and the second term reflects the observed information for individual sites. In the two-pixel potential, the first term imposes the connectivity constraint independent of the observations, and the second term imposes the neighborhood interaction dependent on the observed data. Thus, during the detection process both the dataindependent and data-dependent contextual dependencies are unified in a probabilistic discriminative framework based on the CRF. The proposed approach will degenerate into a MRF approach when the data-dependent pairwise potential is set to zero, which is equivalent to the conditional independence assumption of the observed data (i.e., ignore interactions among observations when labels are given) used in previous generative approaches [6] [8]. The approach in [12] also ignores the data-dependent pairwise potential although its model is based on CRF, which should be categorized as a MRF approach from this point of view. Therefore, theoretically the proposed approach more comprehensively models contextual constraints in video sequences than previous approaches. At each time t, the MAP (maximum a posteriori) estimate of the label field is computed as Lˆt = arg max p( Lt | G t ) given the potential functions. Lt
3 Method Given the video sequence, each pixel in the scene is to be classified as moving object, background, or shadow. For a site x at time t, the label l xt equals e1 for background pixel, e2 for cast shadow, and e3 for moving object. Here static shadows are considered to be part of the background. In order to segment foreground objects, the system should first model the background and shadow information. 3.1 Observation Model
For each point x, the pixel intensity g xt has three (R, G, and B) components for color images or one value for grayscale images. Grayscale images are considered in this
88
Y. Wang
section, while the formulation for color images can be derived similarly. Assume that Gaussian noise corrupts each pixel in the scene, so that the observation model becomes the following for a background point at time t. g tx = bxt + n tx , if l xt = e1 ,
(3)
where bxt is the intensity mean for a pixel x within the background, and n tx is independent zero-mean Gaussian noise with covariance (σ xt ) 2 . Intensity means and variances in the background can be estimated from previous images. For stationary background scenes, the intensity mean and variance of each background point can be estimated from a sequence of background images recorded at the beginning. For nonstationary background scenes, the background updating process is based on the idea of Stauffer and Grimson [10]. The recent history of each pixel is modeled by a mixture of Gaussians. As parameters of the mixture model change, the Gaussian distribution that has the highest ratio of weight over variance is chosen as the background model. Given the intensity of a background point, a linear model is used to describe the change of intensity for the same point when shadowed in the video frame. g tx = rxt bxt + n tx , if l xt = e2 .
(4)
Considering the contiguity of video image, the coefficient can be estimated from its g tx bxt if the point is under shadow. neighborhood as rxt ≈
∑
x∈ N x
∑
x∈ N x
In order to achieve maximum application independence, it is assumed that the model of moving objects is unknown in this work. In the absence of any prior knowledge about foreground, uniform distribution is used for the foreground pixel intensity. From the above discussion, the intensity likelihood for each point x at time t becomes ⎧ N ( g tx ; bxt , (σ xt ) 2 ), if l xt = e1 ⎪⎪ p( g xt | l xt ) = ⎨ N ( g tx ; rxt bxt , (σ xt ) 2 ), if l xt = e2 , (5) ⎪ t c, if l x = e3 ⎩⎪ where N ( z; μ ,σ 2 ) is a Gaussian distribution with argument z, mean μ, and variance
σ2. c is a small positive constant ( c = 1 256 for grayscale images). 3.2 Spatiotemporal Constraint
The one-pixel potential function is set as Vx (l xt | G t ) = − ln p(l xt | g xt ) so that the posterior distribution p(Lt | Gt) becomes the product of local posterior if the two-pixel potential is ignored. Since
∏ p(l
t x
| g xt )
x∈ X t t t t | g x ) ∝ p(l x , g x ) = p(l x ) p( g tx | l xt ) , the Vxl (l xt ) = − ln p(l xt ) and Vxl | g (l xt | G t ) =
p(l xt
two terms in the one-pixel potential become
− ln p( g tx | l xt ) . The first term reflects the prior information of the label without the
Foreground and Shadow Detection Based on Conditional Random Field
89
current image. Given the label estimation at the previous time instant, prediction of the label for the current time instant can be expressed by the following dataindependent one-pixel potential. ⎧⎪α1 , if l xt = lˆxt −1 Vxl (l xt ) = − ln p (l xt ) = ⎨ , ⎪⎩α 2 , if l xt ≠ lˆxt −1
(6)
where α1 is smaller than α2 so that the site is likely to remain in the same class as the previous time instant. The potential function imposes the temporal continuity constraint to encourage a point to have the same label at successive video frames. The data-dependent one-pixel potential Vxl | g (l xt | G t ) = − ln p( g tx | l xt ) reflects the constraint from local observation. The potential function can be computed using the local intensity likelihood derived in the previous section. Since foreground objects usually consist of contiguous regions, the connectivity constraint is imposed by the following data-independent pairwise potential. Vxl, y (l xt , l ty ) = − β1l xt ⋅ l ty ,
(7)
where ⋅ denotes the inner product, and the positive β1 weights the importance of dataindependent spatial connectivity. The smoothness constraint is negative only when the two neighboring sites belong to the same class. Thus two neighboring pixels are more likely to belong to the same class than to different classes. The term imposes the smoothness constraint independent of the observed data. The data-dependent pairwise potential can be expressed as Vxl,| gy (l xt , l ty | G t ) ∝ || g tx − g ty || l xt ⋅ l ty , so that the potential function encourages data similarity when neighboring sites belonging to the same class. However, under heavy noises neighboring sites may become quite different even though they belong to the same class. To prevent this problem when detecting moving objects under noisy environments, the data-dependent pairwise potential is replaced by the following normalized term. Vxl,| gy (l xt , l yt | G t ) =
β 2 || g tx − g ty || 2 l xt ⋅ l ty
∑
∑
x
y
1 1 || g tx − g it || 2 + || g ty − g it || 2 | N x | i∈ N | N y | i∈ N
,
(8)
where ||⋅|| is the Euclidean distance, and the positive β2 weights the importance of data-dependent spatial interaction. Naturally, the potential function imposes an adaptive contextual constraint that will adjust the interaction strength according to the similarity between neighboring observations. The term models the neighborhood interaction dependent on the observed data. 3.3 Parameter and Optimization
The potential functions for the posterior distribution of the label field at time t are expressed as the following. Vx (l xt | G t ) = −αl xt ⋅ lˆxt −1 − ln p( g tx | l xt ) ,
(9a)
90
Y. Wang
Vx , y (l xt , l ty | G t ) = − β1l xt ⋅ l ty +
β 2 || g tx − g ty || 2 l xt ⋅ l yt
∑
∑
x
y
1 1 || g tx − g it || 2 + || g ty − g it || 2 | N x | i∈N | N y | i∈N
, (9b)
where the positive α = α2 – α1 weights the importance of temporal continuity. To balance the influence of pairwise potential terms, it is assumed that
∑| N
x∈ X
x
| β1 =
∑| N
x
| β 2 =| X | β .
(10)
x∈ X
The parameters α and β are manually determined to reflect the influence of temporal constraint and spatial constraint respectively. At each time instant, the posterior distribution of label field is optimized by mean field approximation [5]. The mean field algorithm suggests that when estimating the label mean at a single site, the influence from neighboring sites can be approximated by that of their means. Thus the label field at each time instant can be computed from an iterative procedure. Initially, uniform distributions are assumed for all the labels in L0.
4 Results The proposed approach has been tested on video sequences captured under different environments and compared with other detection methods. The 24-pixel neighborhood is utilized in the algorithm. Our C program can process about three 160×120 frames per second on a Pentium 4 2.8G PC. Three moving object detection algorithms are studied in our experiments: the Gaussian mixture approach [10], the proposed approach without using data-dependent pairwise potential, and the proposed approach. The same initialization and neighborhood are used in these algorithms (when applicable). GM is a pixel based method without contextual constraints among neighboring sites, while the last two could be viewed as methods to impose contextual constraints based on Markov random field (MRF) and conditional random field (CRF) respectively. Figure 1 shows the detection results by the Gaussian mixture approach and the proposed method for the “laboratory” sequence with background illumination change.
(a)
(b)
(c)
Fig. 1. (a) One frame of a sequence. (b) Detection result by the GM approach. (c) Detection result by the proposed approach.
Foreground and Shadow Detection Based on Conditional Random Field
(a)
(b)
91
(c)
Fig. 2. (a) One frame of a sequence. (b) Detection result by the MRF approach. (c) Detection result by the proposed approach.
The gray regions in Figure 1c represent moving cast shadows. Compared to Figure 1b, the proposed approach greatly improves the accuracy of foreground detection by imposing contextual constraints. It can be seen that the noisy pixels in the scene are correctly classified by the proposed method. Meanwhile moving cast shadows are exactly removed from the foreground in Figure 1c. Figure 2 shows the detection results by the Markov random field approach and the proposed method for the “aerobic” sequence. The MRF approach produces smooth detection results. However, sometimes it may smooth in a wrong way since the interactions among observations are neglected. The camouflage regions in the human body are erroneously detected in Figure 2b, while this problem is overcome by the proposed approach with data-dependent neighborhood interactions. Table 1. Error rates of detection results
error
GM 6.4%
MRF 4.3%
CRF 3.0%
The detection results are also evaluated quantitatively by comparing to the manually labeled ground-truth images. Table 1 shows the average error rate (portion of misclassified points in the entire image) for ten representative frames of the sequences shown in Figure 1 and 2 (five frames of each sequence). The MRF approach outperforms the GM approach by introducing contextual constraints. Compared to the MRF approach, the CRF approach takes advantage of datadependent neighborhood interactions. In our experiments, the CRF approach averagely reduces the error rate of the other two (GM and MRF) approaches by 53%, and 30% respectively. The substantial increase of the accuracy indicates that the proposed approach effectively fuses spatial and temporal dependencies for object detection in video sequences.
5 Conclusion There are two main contributions in this paper. First, based on conditional random field, a probabilistic discriminative framework has been proposed to fuse contextual
92
Y. Wang
dependencies in image sequences. Second, under the proposed framework, spatial and temporal constraints have been formulated for moving object and shadow detection. Experimental results show that the method significantly improves the performance of object detection and shadow removal in video sequences. Our future study is to automatically determine the algorithm parameters through learning technology.
References 1 Haritaoglu, I., Harwood, D., Davis, L.S.: W4: Real-time surveillance of people and their activities. IEEE Trans. Patt. Anal. Mach. Intel. 22, 809–830 (2000) 2 Jabri, S., Duric, Z., Wechsler, H., Rosenfeld, A.: Detection and location of people in video images using adaptive fusion of color and edge information. In: Proc. Int’l. Conf. Pattern Recognition, vol. 4, pp. 627–630 (2000) 3 Kumar, S., Hebert, M.: Discriminative fields for modeling spatial dependencies in natural images. Advances in Neural Information Processing Systems, 1351–1358 (2004) 4 Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. Int’l. Conf. Machine Learning, pp. 282– 289 (2001) 5 Li, S.Z.: Markov Random Field Modeling in Image Analysis. Springer, Heidelberg (2001) 6 Paragios, N., Ramesh, V.: A MRF-based approach for real-time subway monitoring. Proc. IEEE Conf. Computer Vision and Pattern Recognition 1, 1034–1040 (2001) 7 Prati, A., Mikic, I., Trivedi, M.M., Cucchiara, R.: Detecting moving shadows: Algorithms and evaluation. IEEE Trans. Patt. Anal. Mach. Intel. 25, 918–923 (2003) 8 Rittscher, J., Kato, J., Joga, S., Blake, A.: A probabilistic background model for tracking. Proc. European Conf. Computer Vision 2, 336–350 (2000) 9 Seki, M., Wada, T., Fujiwara, H., Sumi, K.: Background subtraction based on cooccurrence of image variations. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2, 65–72 (2003) 10 Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Trans. Patt. Anal. Mach. Intel. 22, 747–757 (2000) 11 Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: Principles and practice of background maintenance. Proc. Int’l. Conf. Computer Vision 1, 255–261 (1999) 12 Wang, Y., Loe, K.-F., Wu, J.-K.: A dynamic conditional random field model for foreground and shadow segmentation. IEEE Trans. Patt. Anal. Mach. Intel. 28, 279–289 (2006)
n-Grams of Action Primitives for Recognizing Human Behavior Christian Thurau and V´ aclav Hlav´aˇc Czech Technical University, Faculty of Electrical Engineering Department for Cybernetics, Center for Machine Perception 121 35 Prague 2, Karlovo n´ amˇest´ı, Czech Republic [email protected], [email protected]
Abstract. This paper presents a novel approach for behavior recognition from video data. A biologically inspired action representation is derived by applying a clustering algorithm to sequences of motion images. To obey the temporal context, we express behaviors as sequences of ngrams of basic actions. Novel video sequences are classified by comparing histograms of action n-grams to stored histograms of known behaviors. Experimental validation shows a high accuracy in behavior recognition.
1
Motivation and Problem Formulation
Human behavior analysis is an important component of surveillance, video annotation, or sports video interpretation. Basically, three main problems have to be addressed; (a) how to track humans in videos, (b) how to adequately represent actions, and (c) how to classify a sequence of action representations onto behavioral classes. In this work, we address the problem of action representation and classification using action primitives. We validate the approach in a human tracking scenario where we assume for simplicity a static camera, and a well working background subtraction algorithm so that silhouette images of moving persons can be extracted. Based on extracted silhouette images, we classify action sequences onto a set of predefined behavioral prototypes, e.g. ‘walking’, or ‘running’.
2
Related Work
The activity representation proposed in this paper is inspired by the concept of movement primitives in biological organisms. Recent results in behavioral biology indicate that instead of directly controlling motor activation commands, movements are the outcome of a combination of movement primitives [1]. Furthermore, recent experiments in cognitive psychology explains the representation of motor skills in human long-term memory by sequences of basic-action concepts [2]. Basic-action concepts are similar to movement primitives and assume an internal tree-like structuring of basic action units. A more detailed review of the problems of motor control can be found in [3,1,4]. Note that, although W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 93–100, 2007. c Springer-Verlag Berlin Heidelberg 2007
94
C. Thurau and V. Hlav´ aˇc
different disciplines decided for different terms to describe movement control, we can find certain common ideas. In this paper, we use the term action primitive or basic action unit for referencing single prototypical actions. Behavior will be interpreted as the outcome of appropriate sequencing of action primitives. The idea of movement primitives found its way into various robotic and motion control applications. A movement primitive is in this context often referred to as a computational element in a sensorimotormap that transforms desired limb-trajectories into actual motor commands [5]. Related to our approach is the work of Fod et al. [4], where the concept of movement primitives is applied in a computational approach by linear superposition of clustered primitive motor commands that finally resulted in complex motion in simulated agents. For imitation of human activities in simulated game worlds, we used a probabilistic sequencing of action primitives [6]. Despite using different features, the action primitives were derived in a similar way as in [4], or as proposed in this paper. For recognition tasks, the idea of representing actions by a set of basic building blocks is not new, for a brief overview we recommend [7]. Recently, Moeslund et al. [8] recognized actions by comparing strings of manually found motion primitives, where the motion primitives are extracted from four features in a motion image. In [9], silhouette keyframe images are extracted using optical flow data. The keyframes are grouped in pairs and used to construct a grammar for action recognition. In [10], the idea of recognizing human behavior using an activity language is further expanded. Moreover, numerous other papers deal with behavior recognition from motion. Related to our approach are the works of Hamid et al. [11], and Goldenberg et al. [12]. [11] applied a sequence comparison using bags of event n-grams to recognize higher-level behaviors, where n-grams are build using a set of basic hand crafted events. Goldenberg et al. [12] extracted eigenshapes from periodic motions for behavior classification.
3
Proposed Behavior Representation and Recognition
The purpose of this paper is twofold; first, we want to advance automatic behavioral analysis from video data, second, we want to propose psychologically/biologically inspired approaches for recognizing human activities. For action representation, we map each frame of a video sequence to a certain action primitive. Since behavior is a temporal phenomenon, we express the sequential observation of basic action units using n-grams, a popular representation used in speech analysis or text mining. The number of occurrences of certain n-gram patterns holds valuable information for the underlying behavior. Consequently, we classify video sequences by comparing extracted histograms of n-gram patterns to already known behavioral histograms. Experimental results on a publicly available dataset, kindly provided by the Weizmann Institute [13], will illustrate the applicability of the proposed approach. A sequence of silhouette images is one of the most simple and accessible representations for human motion. Silhouettes of humans were frequently used for classifying activities from video data [12,13]. Given a sequence of silhouette
n-Grams of Action Primitives for Recognizing Human Behavior
95
Fig. 1. 10 exemplary action primitives. Sequencing of action primitives results in more complex motions and can thus be interpreted as behavior.
images, the task is to recognize a specific behavior. However, we first have to extract a suitable action representation before behavior recognition can take place. In the following, we (a) explain how to derive action representation using the concept of movement primitives, we (b) further extend the action representation using n-grams to incorporate the temporal context, and finally (c) we introduce a simple classifier by means of histogram comparison of sequences of action n-grams. (a) Following ideas discussed in [4,12], in order to make the problem of action primitive extraction feasible, we reduce the dimensionality of the input images using Principal-Component-Analysis (PCA). Applying PCA to the image training data results in a set of eigenvectors, or, since we are dealing with silhouette images, eigenshapes (the term eigenshapes for shape representation was introduced in [12] where the eigendecomposition alone was used for activity recognition). Given K different behaviors, represented by training sequences B = skn , k = 1, . . . , K of length n of K different behaviors where each frame in a sequence consists of an aligned binary silhouette image. The eigenshapes Mi , i = 1, . . . , m correspond to the m largest eigenvalues of the covariance matrix BB . A lower dimensional representation of silhouette images can thus be achieved by projecting the data onto these first m eigenvectors. Given a mapping to a lower dimensional description of human silhouettes, we can express a sequence of silhouettes Si = [s1 , . . . , sn ], sk ∈ Rm in the reduced space Rm . The idea of action primitives assumes that complex motions can be reconstructed using a set of basic action units. In order to derive a meaningful set of action primitives, we apply Ward’s clustering method [14] on all silhouette vectors s occuring in any of the n training sequences [S1 , . . . , Sn ]. We used the Euclidean distance as a similarity measure. Applying Ward’s clustering minimizes the inner cluster variance and results in a set of almost equally sized clusters. This is important, since a representation of activities using a discrete set of actions requires (a) a sufficient coverage of all possible actions, and (b) a maximum variability in cluster center assignments. This leads to a greater variability in sequences of action primitives and showed to increase behavior classification rates. Figure 1 shows a few resulting clusters using Ward’s method. Other clustering approaches, as for instance k-means, single-linkage clustering, or the Neural Gas algorithm (for a review of clustering approaches see [15]) showed a performance decrease1 . Clustering leads to a set A of k cluster centres 1
For some initial k-means or Neural Gas cluster configurations the results were similar, however, due to a random initialization we were not able to reproduce these results. Generally, we experienced a decrease in classification rates ranging from 3% to 5%.
96
C. Thurau and V. Hlav´ aˇc
a
b
b
c
d
d
e
f
(a,b) (b,b) (b,c) (c,d) (d,d) (d,e) (e,f) Fig. 2. Sample sequence showing the original shapes (upper row), their reconstruction using action primitives (middle), and their bi-gram representation (lower row)
or, in our case, action primitives, where A = [a1 , . . . , ak ], ∈ Rm . Each silhouette vector can be expressed by its most similar action prototype. Thus, we can express a sequence of silhouettes Si = [s1 , . . . , sn ] by their corresponding sequence of action primitives ASi = [ai,argmin |ai −s1 | , . . . , aj,argmin |aj −sn | ]. (b) The temporal context of specific action prototypes is crucial for activity recognition. Therefore, instead of analyzing videos based on individual action primitives, we propose a representation by means of n-grams of action primitives. n-grams provide a sub-sequencing of length n of a given sequence and are widely used for sequence analysis in language processing or bioinformatics. For example, a sequence of action primitives s = [a, a, b, b, c, c] could be simply expressed as a sequence of bi-grams s = [(aa), (ab), (bb), (bc), (cc)] or tri-grams s = [(aab), (abb), (bbc), (bcc)]. Figure 2 shows an example on how a sequence of frames can be represented using a sequence of n-grams of action primitives. It should be noted that with increasing n, the number k of possible instances of n-grams increases exponentially since k = j n where j denotes the number of action primitives. Fortunately, behavior and motion follow certain constraints in the sequencing of action primitives. Thus, only a small amount of all possible combinations of action primitives is observed and need to be considered for action recognition. (c) In order to compare different sequences of action n-grams, we have to measure the similarity of two sequential n-gram streams. Fortunately, sequence similarity measures are a well studied topic; numerous applications require appropriate sequence analysis and similarity measures, e.g. the analysis of DNA sequences in bioinformatics, but also computer security. Interestingly, for the recognition of actions it was mostly ignored so far. For a recent review for sequence similarity measures, we recommend [16], the following notations are partly based on this publication. For sequence comparison we use the Manhattan distance function applied to the histograms of action primitives sequences. Given a sequence s of n-grams of action primitives, a histogram φ(s) for all action primitives [a1 , . . . , an ] that were extracted during the training phase is defined as a mapping φa (s) :
→ N+ ∪ 0, φ(s)ai := occ(ai , s) ,
(1)
n-Grams of Action Primitives for Recognizing Human Behavior
97
Fig. 3. Four exemplary behaviors; from left to right one can see one frame taken from a walking, running, jumping, and waving behavior
where i = 1, . . . , n, and n denotes the total number of action primitives, and occ(ai , s) denotes the number of occurrences of ai in s. The Manhattan distance d(φ(s1 ), φ(s1 )) between two histograms φ(s1 ) and φ(s2 ) is defined as d(φ(s1 ), φ(s2 )) =
n
| φai (s1 ) − φai (s2 ) |
(2)
i=1
Since we compare sequences of different length, the histograms are length normalized, i.e. φ(s)n = φ(s) | s |−1 , for values φ(s)ai > 0. For every training sequence St = [st1 , . . . , stn ], a separate histogram φ(si ), i = 1, . . . , n is computed. For each training histogram, the associated class or behavior label is known. For a novel sequence sk , the task now is to select a suitable class label based on histogram comparisons to already labeled histograms. For the experiments, we selected a class using the maximum aggregation rule, i.e. by finding the class histogram φ(sj ) so that φ(sj ) = argmini d(φ(si ), φ(sk )). Among others, the l −1 average aggregation rule argminl l,i=k did perform worse, l,i=1 d(φ(s i ), φ(sk ))k which indicates that it might not be a good idea to classify activity based on the average of all observed activities of one class of behavior.
4
Experimental Results
To verify the presented approach we carried out a series of experiments. The data already contained extracted silhouettes, therefore we could skip the obligatory person tracking. We resized each video frame to 40x40 binary image. The data contained 10 different behaviors (walking, running, jumping, waving, . . . ) performed by 9 subject, Figure 3 shows some exemplary frames (for further details see [13]). For the experiments, we applied temporal smoothing of classification results (see e.g [17]), i.e. we classify based on a majority voting of the last 10 classified frames. Thereby, we could filter out occasional errors and make behavior recognition more robust. However, we will also report on results without temporal smoothing that were slightly worse. To select optimal parameters, we varied the number of length of n-grams, the histogram sequence length, and the number of action primitives. Figure 4 illustrates the effects of these variations. In a leave-one-out cross validation test series, we tested each configuration on videos collected from one subject. Table 1 shows the average classification rates,
98
C. Thurau and V. Hlav´ aˇc
1 0.98
average classification rate
0.96 0.94 0.92 0.9 0.88 0.86 1−gram 2−gram 3−gram 4−gram 5−gram
0.84 0.82 0.8 10
20
30
40 50 60 number of action primitives
70
80
90
Fig. 4. Average classification rates for varying number of n-grams and action primitives. While the rates for all n-gram configurations are sufficient, tri-grams overall showed the most robust and generally best performance. n-grams with n > 3 suffered from overfitting when using too many action-primitives, this became significant only when using more than 150 action primitives (sequence length was set to 25, action primitives were clustered using the 30 first eigenvectors). The best classification rate of 99.62% could been achieved using tri-grams constructed of 29 action primitives.
Table 1. Average classification rates for leave-one-out cross validation of 10 different behaviors performed by 9 subjects (excluding all behaviors of one subject for each run). The sequence lengths corresponds to the number of frames used for calculating ngram histograms. Here, we used a tri-gram consisting of 50 instances of 30-dimensional action-primitives. Sequence length Classification rate Majority voting Abs. correct Abs. error
5 86.17 92.55 3704 298
10 93.80 96.17 3416 136
15 96.53 97.74 3032 70
20 25 97.80 98.58 98.64 99.10 2620 2193 36 20
and absolute numbers of classified frames for varying sequence lengths for s. Setting the sequence length to contain the whole movie sequence usually resulted in a perfect classification of all test movies. Using a sequence length of 25 frames (corresponding to 1 second) we still achieved sufficient classification rates, usually above 95% under various configurations. The best configuration classified 99.62% of all frames correctly. Thereby, we reach similar classification rates as [13]. Note, however, we can not directly compare the approaches since [13] classified 10 frame long sequences of video where 5 frames overlapped. Moreover, [13] excluded for testing one behavior of one subject. Using the same frame number as in [13] we reach slightly worse classification results. In a second experimental run we used the n-gram histograms of only one subject and classified behaviors performed by the remaining 9 subjects. A summary
n-Grams of Action Primitives for Recognizing Human Behavior
99
Table 2. Average classification rates for using the behavioral data provided by one subject for training, and testing against 8 subjects (using 50 action primitives and a sequence length of 25 frames). Generally, the videos are classified surprisingly good, indicating the proposed approach also works for one-shot learning. 1 2 3 4 5 6 7 8 9 Subject Avg. classification rate 87.70 88.59 83.54 86.81 85.43 87.85 80.44 75.41 57.33 majority voting 89.57 88.17 84.18 90.43 85.75 87.64 82.61 75.48 57.39
of the results can be seen in Table 2. Interestingly, we reached reasonable classification rates as well. This leads to the conclusion that the approach is also usable for one-shot learning, or in settings where we can not access a sufficient amount of training data.
5
Conclusion
In this paper, we proposed a novel approach for behavior recognition making use of action representation using bags of n-grams of action primitives. Action representations are based on biological findings and thus motivate a biologically inspired approach. Further, we expressed activity as n-grams and thereby could rely on conventional methodology for sequence analysis often used in speech processing or text mining. In a series of experiments, we could show the applicability for behavior recognition, where we relied on binary silhouettes images of human motion. Varied experimental settings also indicate usability for scenarios where only a few training samples are available. In the future, we plan to use more sophisticated features for describing prototypical actions instead of the so far used silhouette images (e.g. the recently introduced Histograms of Oriented Gradients for finding humans in images [18]). Moreover, similar to [9], we currently investigate action primitives derived from different views in order to make the approach less sensitive to view changes.
Acknowledgments We would like to thank Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, and Ronen Basri for making their behavior dataset publicly available [13]. Christian Thurau is supported by a grant from the European Community under the EST Marie-Curie Project VISIONTRAIN MRTN-CT-2004-005439. The second author was supported by the project MSM6840770013.
References 1. Ghahramani, Z.: Building blocks of movement. Nature 407, 682–683 (2000) 2. Schack, T., Mechsner, F.: Representation of motor skills in human long-term memory. Neuroscience Letters 391, 77–81 (2006)
100
C. Thurau and V. Hlav´ aˇc
3. Wolpert, D.M., Ghahramani, Z., Flanagan, J.R.: Perspectives and problems in motor learning. TRENDS in Cognitive Sciences 5(11), 487–494 (2001) 4. Fod, A., Matari´c, M., Jenkins, O.: Automated Derivation of Primitives for Movement Classification. Autonomous Robots 12(1), 39–54 (2002) 5. Thoroughman, K., Shadmehr, R.: Learning of action through adaptive combination of motor primitives. Nature 407, 742–747 (2000) 6. Thurau, C., Bauckhage, C., Sagerer, G.: Synthesizing Movements for Computer Game Characters. In: Rasmussen, C.E., B¨ ulthoff, H.H., Sch¨ olkopf, B., Giese, M.A. (eds.) Pattern Recognition. LNCS, vol. 3175, pp. 179–186. Springer, Heidelberg (2004) 7. Moeslund, T., Reng, L., Granum, E.: Finding Motion Primitives in Human Body Gestures. In: Gibet, S., Courty, N., Kamp, J.-F. (eds.) GW 2005. LNCS (LNAI), vol. 3881, pp. 133–144. Springer, Heidelberg (2006) 8. Moeslund, T., Fihl, P., Holte, M.: Action Recognition using Motion Primitives. In: The 15th Danish conference on pattern recognition and image analysis (2006) 9. Ogale, A.S., Karapurkar, A., Aloimonos, Y.: View-invariant modeling and recognition of human actions using grammars. In: ICCV Workshop on Dynamical Vision (2005) 10. Guerra-Filho, G., Aloimonos, Y.: A Sensory-Motor Language for Human Activity Understanding. In: 6th IEEE-RAS International Conference on Humanoid Robots (HUMANOIDS’06), pp. 69–75. IEEE Computer Society Press, Los Alamitos (2006) 11. Hamid, R., Johnson, A., Batta, S., Bobick, A., Isbell, C., Coleman, G.: Detection and Explanation of Anomalous Activities: Representing Activities as Bags of Event n-Grams. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2005 (CVPR’05), IEEE, NJ, New York (2005) 12. Goldenberg, R., Kimmel, R., Rivlin, E., Rudzsky, M.: Behavior classification by eigendecomposition of periodic motions. Pattern Recognition 38, 1033–1043 (2005) 13. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as SpaceTime Shapes. In: Tenth IEEE International Conference on Computer Vision, IEEE Computer Society Press, Los Alamitos (2005) 14. Ward, J.H.J.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244 (1963) 15. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999) 16. Rieck, K., Laskov, P., Sonnenburg, S.: Computation of Similarity Measures for Sequential Data using Generalized Suffix Trees. In: Advances in Neural Information Processing Systems, vol. 19 (2007) 17. Thurau, C., Hettenhausen, T., Bauckhage, C.: Classification of Team Behaviors in Sports Video Games. In: Proc. Int. Conf. on Pattern Recognition, IEEE, pp. 1188–1191. IEEE Computer Society Press, Los Alamitos (2006) 18. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2005 (CVPR’05), IEEE Computer Society Press, Los Alamitos (2005)
Human Action Recognition in Table-Top Scenarios : An HMM-Based Analysis to Optimize the Performance Pradeep Reddy Raamana, Daniel Grest, and Volker Krueger Computer Vision and Machine Intelligence, Copenhagen Institute of Technology, Aalborg University, Denmark {pradeep,dag,vok}@cvmi.aau.dk
Abstract. Hidden Markov models have been extensively and successfully used for the recognition of human actions. Though there exist wellestablished algorithms to optimize the transition and output probabilities, the type of features to use and specifically the number of states and Gaussian have to be chosen manually. Here we present a quantitative study on selecting the optimal feature set for recognition of simple object manipulation actions pointing, rotating and grasping in a tabletop scenario. This study has resulted in recognition rate higher than 90%. Also three different parameters, namely the number of states and Gaussian for HMM and the number of training iterations, are considered for optimization of the recognition rate with 5 different feature sets on our motion capture data set from 10 persons. Keywords: Hidden Markov model, Action Recognition, Optimization.
1
Introduction
Extensive research has been carried out in the past to understand and recognize human actions. Typical applications of the human activity recognition are surveillance, human-robot interaction, imitation learning[1,11]. In surveillance, it is important to detect abnormal and suspicious actions. In robotics community, huge body of research is in place to imitate human through demonstration, imitation and learning. Research on imitation learning shows that actions can be clearly segmented into atomic action units [7,13]. This gives a way to recognize human actions through interpretation of its atomic units [8,9], such as pointing, grasping, and rotating in case of table-top scenarios. Though the scenario is simple, due to the similarity of different actions, the recognition task is difficult. Due to the simplicity of the scenario, the action recognition study on this scenario lets us gain understanding of the human motion and gives insights as to how to tackle the problem of action recognition in complex scenarios. Considerable research in computer vision community has focused on cyclic motions, such as walking or running [1] and Jenkins et. al [8] came up with W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 101–108, 2007. c Springer-Verlag Berlin Heidelberg 2007
102
P.R. Raamana, D. Grest, and V. Krueger
a system to extract the behavior vocabularies for the problem of complex task learning. [4] presents a learning system for one and two hand motions to compute the hand motion trajectory that optimizes the imitation performance given the constraints of the body of the robot. Here we present a quantitative study to optimize the recognition performance of the simple human actions, like pointing, grasping, and rotating performed in a table-top scenario. Our action dataset consists of 10 persons, in contrast to many other studies with 1/2 persons [1,3,10]. We use the Hidden Markov Model(HMM) to learn the motion model. It defines the joint probability distributions over observations and the states of the model. An HMM does time-warping to some extent implicitly in the cases of motion sequences, which differ slightly in their execution speed. Hence it is well-suited to our action dataset consisting of 10 persons with differing execution paces. Our study is based on about 890 sequences, each sequence with an average of 115 poses, totaling 102350 poses. Though there exist well-established algorithms, such as the Baum-Welch algorithm, to optimize the transition and output probabilities of a HMM, the structure of the HMM and its inputs have to be chosen manually. The feature set used to train the HMM is really important, as an inappropriate feature set could lead to low performance of the system and making the system not being able to capture the discriminative features among different actions. Apart from the feature set, the number of states for the HMM and the number of Gaussians for each state has to be given by hand. Moreover the number of training iterations for the HMM also need to be set beforehand. These parameters are usually empirically determined, but this task is often laborious and time consuming in case of big datasets. Hence this study saves a lot of effort and time for subsequent researchers, working in similar scenarios. Moreover this scenario is gaining importance in imitation learning. Perhaps the work most similar to our work presented here is that of Guenter et. al [6]. They have presented simple algorithms to optimize the number of states, training iterations and Gaussians for the HMM in the context of handwritten word recognition task. Their work merely finds parameters with optimal performance, without any studies on different feature sets. Here we perform an extensive evaluation of different feature sets and present results on our action dataset, pointing out the feature set with best classification rate. This evaluation is necessary to build a system with the best performance. Enroute to finding the optimal feature set, the optimal values for different parameters of HMM will also be determined. Recent research [5] shows that one can do motion tracking even from a single view. So this study could be combined with motion capture to recognize the actions in real-time. The paper is organized as follows: Section 2 briefly describes the action recognition through Hidden Markov models and Section 3 describes the motion capture setup and collection of training data. In Section 4, we describe the different feature sets used for the evaluation. Section 5 describes the quantitative evaluation of different feature sets and we conclude in Section 6.
Human Action Recognition in Table-Top Scenarios
2
103
HMM Based Human Action Recognition
L. Rabiner [14] gives an excellent tutorial on how to use HMMs. Given an observation sequence O = O1 O2 . . . OT and a model λ = (A, B, Π), the probability of the observation sequence O, given the model λ, P (O | λ) is calculated by enumerating over all possible state sequences of length T, Q = q1 q2 . . . qT . Then the probability of the observation sequence given the model, irrespective of the state sequence, is obtained by summing the probability over all the possible state sequences: P (O | Q, λ) =
T
P (Ot | Qt , λ) , P (O | λ) =
all Q
P (O | Q, λ)P (Q | λ) (1)
t=1
Then we follow the forward-backward algorithm [14], to compute the likelihood of the observation sequence given the HMM. we train one fully-connected HMM for each class of sequences. To recognize an observation sequence from a dataset, we calculate the likelihood of this sequence from all the trained HMMs. The HMM that gives maximum likelihood represents the observation sequence. If this HMM and the observation sequence belong to the same action class, then we say we could correctly recognize the sequence.
3
Our Action Dataset
Our dataset consists of four actions: Pointing (P), Grasping (G), Rotating (R) and Displacing(D), performed by 10 persons. Each action is performed
Fig. 1. Locations of the electromagnetic sensors on the body of the person and the Setup used to record both markers and video data
1. in three different directions ( to the right, to the front and to the left ) 2. with two different heights and 3. with two different distances.
104
P.R. Raamana, D. Grest, and V. Krueger
Fig. 2. Sequence of Images of a Person Performing the action Rotating
Each person repeated each action 5 times for all combinations of the above settings. With 10 persons, this gives rise to good amount of training examples with respect to variability in execution speed, direction, height, scale and arm lengths. This is essential in order to capture the key features of a particular action invariant to aforementioned properties. Each person has 7 electromagnetic sensors. They are located at the upper back of the torso, shoulder, elbow, wrist, hand, thumb and the index finger, as depicted in Fig. 1. The measurements were acquired with Motionstar of Ascesion [2]. For each sequence, the 3D coordinates of the 7 sensors were recorded at 25 fps, starting from a nearly vertical position (rest pose, Fig. 2). The sequence ends also in the rest pose. Figure 2 shows a person performing the action Rotating. The scenario was also video-recored by 4 calibrated cameras simultaneously with the marker data. The video data could be used for motion tracking etc. The whole setup used is shown in Fig. 1. In this work we consider only three actions: P, G and R, because the action class displacing is a combination of pointing, grasping and moving and hence different from the primitive actions P, G and R.
4
Feature Sets
Each feature set is a sequence of features for all poses. For all the sequences, the 3D position of the torso is considered to be the origin for each pose. It is subtracted from the positions of all the sensors, so that the data used for recognition is independent of the position of the person performing the action. In order to determine the best feature set, we’ve extracted the following feature sets appropriate for the one-arm movements in a table-top scenario: 1. Joint Angles (JA):This consists of the angles at the shoulder and at the elbow between the lower arm and the upper arm. Knowing the problem of
Human Action Recognition in Table-Top Scenarios
2. 3. 4.
5.
5
105
singularity i.e. 0 and π are same, we use the up and direction vectors in the shoulder coordinate frame, obtained from the kinematic chain built for the right arm for each pose. This is invariant to the arm length and can handle directions easily. Direction vector from Torso to Wrist ( T2W ) Direction vector from Shoulder to Wrist ( S2W ) T2W + Grasp Distance GD: Here we define Grasp Distance to be the distance between the sensors on the thumb and the index finger. It is intuitive to see that, GD is discriminative feature between Grasping and the rest. This is confirmed by the experiments. S2W + GD Direction vector from Shoulder to Wrist and the grasp distance.
Quantitative Analysis
In order to learn and recognize different actions, we trained one HMM for each class of sequences. In this work, both training and testing are irrespective of the different individuals. Each time three quarters of sequences from each class are used for training and the fourth quarter is used to test the performance of trained HMMs. The recognition is done using Maximum Likelihood approach as follows. The likelihood of test sequence against all the trained HMMs are calculated and the HMM with maximum likelihood represents the test sequence. We use the Hidden Markov Model Toolbox for Matlab, developed by Kevin Murphy [12] for training and testing. Having chosen different feature sets, its important to set the suitable architecture for the HMM for each of these feature sets to maximise the classification rate. This means that the optimal values for different parameters of the HMM have to be determined. From the preliminary trials on the dataset, we learned that the following ranges of parameters are producing recognition rates(RR) within acceptable range: – Number of States (Q): from 5 to 40 in steps of 5 – Number of Gaussian (G) for each state: 10, 20 and 30 – Number of Training Iterations (I): 10, 15 and 20 With these ranges for different parameters, we use a simple pick-the-best algorithm to find the optimal values for the parameters corresponding to each feature set. For each feature set, we enumerated over the values of each parameter in the above ranges and the performance is noted. The set of parameters corresponding to the maximum recognition rate is noted for each feature set. Though the algorithm is straightforward, it is important to make an extensive evaluation over the above ranges, to make sure that we do not miss the best combination. Each run of the algorithm, with one set of parameters takes about 25 minutes on an average with a 2GHz computer. The best recognition rates and the optimal values for the HMM parameters are given in Table 1. From Table 1, it is clear that GD definitely improved the performance of the system. S2W+GD has optimal performance. One reason is that the motion of shoulder joint for different persons is different for the same action. So it helped the HMM to learn
106
P.R. Raamana, D. Grest, and V. Krueger Varitation of Recognition Rate, with 10 Gaussians and 20 Iterations 95 JA T2W S2W T2W+GD S2W+GD
Recognition Rate −−>
90
85
80
75
70
65
5
10
15
20 25 Number of States −−>
30
35
40
Fig. 3. Variation of RR over number of states, with 10 Gaussian and 20 Iterations Table 1. Best Recognition Rates for Different Feature Sets Feature Set
Best Recognition Rate
JA T2W S2W T2W+GD S2W+GD
80.09 88.68 87.78 90.49 94.12
% % % % %
Optimal Q, G, I 40, 25, 30, 35, 35,
20, 10, 10, 10, 10,
20 20 15 20 15
Table 2. Statistics of the Recognition Rates for All Feature sets Feature Set JA T2W S2W T2W+GD S2W+GD
Mean RR 71.23 79.27 76.99 82.13 84.16
% % % % %
Max. RR 80.09 88.68 87.78 90.49 94.12
% % % % %
Min. RR
Std.Dev
57.92% 56.56% 33.48% 58.37% 45.70%
5.02 6.37 9.37 6.13 9.54
the key features common to the action and to dispose the variant features, and obtain better discrimination among different actions. This study also shows that inclusion of elbow motion(in feature set JA) degraded the classification rate, rather than improving it. This was also observed by Vicente et. al. [15]. From Table 1, we can also see that the combination of 10 Gs and 15 Is, 10 Gs and 20 Is have the best performance. Figures 3, 4 show the variation of the recognition rate over the number of states with these combinations. We can see that the performance increases with the number of states gradually for many of the feature
Human Action Recognition in Table-Top Scenarios
107
Varitation of Recognition Rate, with 10 Gaussians and 15 Iterations 100 JA T2W S2W T2W+GD S2W+GD
95
Recognition Rate −−>
90
85
80
75
70
65
60
5
10
15
20
25
30
35
40
Number of States −−>
Fig. 4. Variation of RR over number of states, with 10 Gaussian and 15 Iterations
sets. The non-monotonicity of the variation shown is due to the randomness in the initialization of the HMMs during training. Apart from the best performance, the average performance of the particular feature set is important. This is because it can be expected, that a feature set with higher average performance and low sensitivity to HMM parameters would perform equally well on another datasets in a similar scenario. The mean, maximum, minimum and the standard deviation of recognition rate over all the trials for the five feature sets are given in Table 2. The standard deviation of the feature set S2W+GD is higher than that of the feature set T2W+GD. But S2W+GD dominates T2W+GD clearly in mean and best recognition rates. So we can say that the feature set S2W+GD is optimal and is suitable for movements in table-top scenarios. The optimal set of parameters are determined to be 35 states for the HMM and 10 Gaussians for each state and 15 iterations for the training.
6
Conclusions
A quantitative study on optimizing the performance of the human action recognition system in table-top scenarios is presented. A set of simple object manipulative actions are used to find the best feature set and optimal values for the HMM parameters. These set of actions could be classified most accurately with the direction vector from shoulder to wrist and the grasp distance. The high average performance of this feature set shows that this is less sensitive to HMM parameters. So it is expected to perform equally well on different datasets in a similar scenario. We also have observed that inclusion of elbow motion degraded the classification rate, rather than improving it. We are building an online task recognition system, integrating this with a motion capture system.
108
P.R. Raamana, D. Grest, and V. Krueger
Acknowledgment. This work has been supported by PACO-PLUS, FP6-2004IST-4-27657.
References 1. Aggarwal, J.K., Cai, Q.: Human motion analysis: A review. Computer Vision and Image Understanding: CVIU 73(3), 428–440 (1999) 2. Ascension. Motion Star Real-Time Motion Capture. http://www.ascension-tech. com/products/motionstar 10 04.pdf 3. Billard, A., Calinon, S., Guenter, F.: Discriminative and adaptative imitation in uni-manual and bi-manual tasks. Robotics and Autonomous Systems 54, 370–384 (2006) 4. Calinon, S., Billard, A., Guenter, F.: Discriminative and adaptative imitation in uni-manual and bi-manual tasks. Robotics and Autonomous Systems 54 (2005) 5. Grest, D., Koch, R., Krueger, V.: Single view motion tracking by depth and silhoutte information. In: Scandinavian Conference on Image Analysis (2007) 6. Guenter, S., Bunke, H.: Optimizing the number of states and training iterations and gaussians in an hmm-based handwritten word recognizer. In: Seventh International Conference on Document Analysis and Recognition, vol. 1, pp. 472–476 (August 2003) 7. Mataric, M.J.: Sensory-motor primitives as a basis for imitation: linking perception to action and biology to robotics. In: Dautenhahn, K., Nehaniv, C.L. (eds.) Imitation in Animals and Artifacts, pp. 391–422. MIT Press, Cambridge (2002) 8. Jenkins, O.C., Mataric, M.J.: Performance-derived behavior vocabularies: Datadriven acqusition of skills from motion. International Journal of Humanoid Robotics 1(2), 237–288 (2004) 9. Krueger, V., Grest, D.: Using hidden markov models for recognizing action primitives in complex actions. In: Scandinavian Conference on Image Analysis (2007) 10. Campbell, L., Bobick, A.: Recognition of human body motion using phase space constraints. In: International Conference in Computer Vision, pp. 624–630 (1995) 11. Moeslund, T., Hilton, A., Krueger, V.: A survey of advances in vision-based human motion capture and analysis. In Computer Vision and Image Understanding 104, 90–126 (2006) 12. Murphy, K.: Hidden Markov Model (HMM) Toolbox for Matlab (1998) http://www.cs.ubc.ca/∼ murphyk/Software/HMM/hmm.html 13. Newtson, D., Engquist, G., Bois, J.: The objective basis of behavior unit. Journal of Personality and social psychology, 847–862 (1977) 14. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989) 15. Vicente, I.S., Kragic, D.: Learning and recognition of object manipulation actions using linear and nonlinear dimensionality reduction. In: 15th IEEE Int. Symp. on Robot and Human Interactive Communication(RO-MAN) (Submitted 2007)
Grouping of Articulated Objects with Common Axis Levente Hajder Computer and Automation Research Institute, Hungarian Academy of Sciences Kende u. 13-17., H-1111 Budapest, Hungary [email protected]
Abstract. We address the problem of nonrigid Structure from Motion (SfM). Several methods have been published recently which try to solve the task of tracking, segmenting, or reconstructing nonrigid 3D objects in motion. Most of these papers focus on deformable objects. We deal with the segmentation of articulated objects, that is, nonrigid objects composed of several moving rigid objects. We consider two moving objects and assume that the rigid SfM problem has been solved for each of them separately. We propose a method which helps to decide whether an object is rotating around an axis defined by another moving object. The theories of the proposed method is discussed in detail. Experimental results for synthetic and real data are presented.
1
Introduction
3D motion based nonrigid object reconstruction is a popular topic in three dimensional computer vision. It has many important applications such as registration of medical images, robotic vision and face reconstruction. Most of the published rigid SfM methods are based on various extensions of the well-known factorization method by Tomasi and Kanade [1]. The original method assumes orthographic projection, while its extensions can cope with weak perspective [2], para-perspective [3] and real perspective [4] cases as well. 3D motion segmentation algorithms for rigid objects can also be based on factorization methods [5, 6], but efficient fundamental matrix [7] or trifocal tensor [8] based segmentation procedures also exist. Most segmentation algorithms, such as [8–10], cannot cope with nonrigid (e.g., articulated) objects. Recently, researchers have begun to deal with the reconstruction of nonrigid moving 3D objects [11–14]. These studies assume that the structure of the nonrigid object can be written as a weighted sum of K so-called key objects: S=
K
l i Si
(1)
i=1
Unfortunately, this assumption is unfavorable for the following reasons: 1. The motion of many real nonrigid objects, such as articulated objects, cannot be written in the above form. For instance, the motion of a human arm, or that of a windmill blade cannot be expressed by eq. (1). W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 109–116, 2007. c Springer-Verlag Berlin Heidelberg 2007
110
L. Hajder
2. Every key object has a different weight in every frame. The calculation of the weights is computationally expensive. The large number of the parameters to be optimized reduces the quality of the result. Our approach differs from the factorization techniques [11–14]. We focus on the reconstruction of an articulated object by grouping the rigid parts. We assume that the 3D motion segmentation problem for the parts has been solved by some of the existing techniques, and the motion and the structural data of the parts are available. Our goal is to group the parts of a potential articulated object. There are different types of articulated objects. In this paper, two rigid parts are considered. (This assumption is not prohibitive: If more than two parts have been found, a brute-force solution is to apply the proposed methods to each pair of parts.) The following problem is addressed in our study: How to determine if a part is rotating around an axis defined by the other part?
2
SfM Under Weak Perspective
Given P feature points of a rigid object tracked across F frames, xf p = (uf p , vf p )T , f = 1, . . . , F , p = 1, . . . , P , the goal of SfM is to recover the structure of the object. If the origins of 2D coordinate system at the centroid is subtracted from the trajectories in all images, the 2D coordinates are calculated as xf p = qf Rf sp ,
(2)
where Rf = [rf 1 , rf 2 ]T is the orthonormal rotation matrix, sp the 3D coordinates of the point, qf is the nonzero scale factor of weak perspective. For all points in the f -th image, the above equations can be rewritten as Wf = (xf 1 . . . xf P ) = Mf · &'() S &'() &'() 2×3
2×P
(3)
3×P
where Mf is called the motion matrix, S = (s1 , . . . , sP ) the structure matrix. Under orthography Mf = Rf , under weak respective Mf = qf Rf . For all frames, the equations (3) form W = &'() M · &'() S , &'()
2F ×P
(4)
2F ×3 3×P
where W T = [W1T , W2T , . . . , WFT ] and M T = [M1T , M2T , . . . , MFT ]. The task is to factorize the measurement matrix W and obtain the structural information S. This can be done in two steps. In the first step, the rank of W is reduced to three by the singular value decomposition (SVD), since the rank of ˆ 2F ×3 · Sˆ3×P . This factorization is deterW is at maximum three: W 2F ×P = M mined only up to an affine transformation because an arbitrary 3×3 non-singular
Grouping of Articulated Objects with Common Axis
111
ˆ Therefore M ˆ contains the base ˆ QQ−1 S. matrix Q can be inserted so that W = M vectors of the frames deformed by an affine transformation. The matrix Q can be determined optimally by least squares optimization both for the orthogonal [1] and the weak-perspective [2] cases imposing the corresponding constraints on the ˆ Q, frame base vectors. The estimated motion vectors can be written as M = M ˆ the estimated structure as S = QS.
3
The Proposed Grouping Method
In this section, we consider two moving rigid objects (parts) and assume that they have been segmented by any of the existing 3D motion segmentation methods [7, 9, 10]. The motion and the structure data of the objects are thus provided by factorization for each frame i, 1 ≤ i ≤ F : W1i = M1i S1
(5)
W2i
(6)
=
M2i S2 .
The goal of the two proposed methods is to group the segmented parts into an articulated object. 3.1
Relative Motion of Two Objects
To examine the motion of an object with respect to another object, one has to determine the relative motion of the two objects. Recall that it is assumed that the factorization has been computed: W1i = M1i S1 and W2i = M2i S2 . The −1 i i = M1i M2i or M21 = relative motion of the objects can be written as either M12 i −1 i i i M2 M1 . It should be noted that M1 and M2 are non-invertible matrices. Each of them can be inverted if completed by the third base vector. Its direction is that of the cross-product of the first and the second base vectors, while its length is either unit (orthography) or the average length of the other two base vectors (weak-perspective). As shown in [10], the factorization is ambiguous. If W1i = M1i S1 is valid SfM factorizations, then W1i = (M1i A1 )(AT1 S1 ) is also valid factorizations if and only if AT1 A1 = I. W2i = (M2i A2 )(AT2 S2 ) is also a valid factorization if AT2 A2 = I. (Here A1 and A2 are the common transformation matrices in all frames.) The ambiguity modifies the relative motions:
i = AT1 M1i M12 i M21
=
−1
i M2i A2 = AT1 M12 A2
−1 AT2 M2i M1i A1
=
i AT2 M21 A1
(7) (8)
Since the matrices A1 and A2 are orthogonal, the following conclusion is drawn: Due to the factorization ambiguity, the obtained relative motion is the true relative motion transformed by an unknown Euclidean transformation.
112
3.2
L. Hajder
Rotation Around an Axis Defined by Another Object
Without loss of generality, let us assume that the coordinate system of the second object is such so that the hypothetic axis is parallel to the vector [1, 0, 0]T . Select a base vector of the first object; let its coordinates be [x1 , y1 , z1 ]T . If the first object is rotating around the axis and the rotation angle in the ith frame is αi , then the coordinates of the base vector in the ith frame are [x1 , y1 cos(α) + z1 sin(αi ), z1 cos(αi )−y1 sin(αi )]T . Similarly, for a second base vector [x2 , y2 , z2 ]T , these coordinates are [x2 , y2 cos(α) + z2 sin(αi ), z2 cos(αi ) − y2 sin(αi )]T . Considering the base vectors as points in the 3D space, we observe that the points of each base vector form a circle, the two circles lie in parallel planes, and their axes are common. For simplicity, we will speak of the ‘coaxial circles’. The calculated relative motion is an Euclidean transformation of the true relative motion. Two coaxial circles transformed by an Euclidean transformation are also coaxial circles. Therefore, the problem of detecting the rotation of an object w.r.t. another object is reduced to fitting coaxial circles to the calculated motion data. A large number of circle and ellipse fitting algorithms is available in both 2D and 3D, such as [15–17]. These methods are optimized to fit a single circle to data points. To our best knowledge, the problem of simultaneous fitting of coaxial circles has not been addressed. In this section, we give a simple but efficient algorithm to solve this task. An error metric is also defined here. Finally, a threshold limit must be set to determine whether the object is rotating around an axis defined by the other object. 3.3
Fitting Coaxial Circles to 3D Points
Given two 3D data set with N and M points, the goal is to determine coaxial circles fitted the two point set. The solution is divided into two parts: Two parallel planes fitted to the points first, then the circles on that planes are estimated. Fitting Parallel Planes to Points. The ith 3D point of the first set is represented by (Xi , zi ) while the j th point in the second by (Uj , wj ) where +T +T * * and Uj = uj vj . Xi = xi yi The equations of the parallel planes can be written as z = aT X + b 1 w = aT U + b 2
(9) (10)
* +T where a is a vector with two elements: a = a1 a2 . The plane fitting is based on the error function: J=
N i=1
(zi − aT Xi − b1 )2 +
M j=1
(wj − aT Uj − b2 )2
(11)
Grouping of Articulated Objects with Common Axis
113
Its derivatives gives the optimal solution to the parameters of the parallel planes: + N * T ∂J = i=1 (a Xi )Xi − zi Xi + b1 Xi + ∂a + M * + j=1 (aT Uj )Uj − wj Ui + b2 Ui = 0 + N * ∂J T = i=1 b1 − zi + a Xi = 0 ∂b1 + M * ∂J T = j=1 b2 − wi + a Ui = 0 ∂b2
(12) (13) (14)
b1 and b2 can be expressed as + N * T i=1 zi − a Xi b1 = N + M * T j=1 wj − a Uj b2 = M
(15) (16)
Optimal value of a can be calculated by solving the following linear equation: M M N N N T T j=1 {Uj } j=1 {Ui } T T i=1 {Xi } i=1 {Xi } − a= {Xi Xi + Ui Ui } − N M i=0
=
N i=1
{zi Xi } +
M j=1
N {wj Uj } −
i=1 {zi }
M
N
i=1 {Xi }
N
−
j=1 {wj }
M
j=1 {Uj }
M
With the known a, offset parameters b1 and b2 come from eqs( 15) and( 16). Determination of the Center and of the Radius of the Circles. The centers of the circles are estimated as the closest point of the planes to the origin. Finally, the radiuses of the circles are estimated by averaging the distance between the points and the corresponding circle centers. Definition of the Error Metric. The definition of the fitting error is simple: let the error of a points be the distance between the original point and the closes point on the corresponding circle. The fitting error is the average of the errors of all points.
4
Experiments on Synthetic Data
For every test, two objects are generated as two point clouds by a Gaussian random number generator with zero mean and standard deviation σobj . The objects undergo the same translational motion, while their rotations are connected by an axis. The free angles of the motion is randomized by the same random number generator. Then the 3D points of the objects are projected onto the image
114
L. Hajder
plane. Finally, 2D noise is added to every 2D point. The 2D noise is generated by a zero-mean Gaussian random number generator with standard deviation σnoi . Three tests were done: – Fitting error versus the noise: The test result presented in the left plot of the Figure 1 shows that the error is growing approximately linearly with the level of noise. The method seems efficient until 7 − 8% noise level. According to the test, error value 0.02 seems to be a good threshold limit for the method. – Fitting error versus the number of frames: The error is decreasing when the number of frames is increasing as it is demonstrated in the central plot of Figure 1, because more frames serve more 3D points to the circle fitting. – Fitting error versus the number of noise: Adding more points to the 3D objects improves the quality of the result, because the quality of motion data produced by the factorization becomes better by adding more points. Better motion data yield more precise 3D points to the circle fitting algorithm. The test is shown in the right plot of Figure 1.
Fig. 1. Left: Fitting error versus the noise level. Center :versus the number of frames. Right: versus the number of points.
5
Experiments on Real Data
Our methods were also tested on a real video sequence consisting eleven frames. Three frames of the images sequence are shown in Figure 2. There are three moving objects in the video: a CD box, a juice box and a painted plastic bear. The bear is connected to the CD box with an axis. Hundreds of feature points have been tracked in the images and these points have been segmented by our region-based 3D segmentation method [10]. Then the motion and structure matrices were calculated by the factorization method of Tomasi and Kanade[1]. We have examined whether the relative motion of the bear and the CD box is a rotation around an axis connected to the CD box. The fitted coaxial circles and the relative motion data are visualized in Figure 3. The fitting error is small (0.109), so the conclusion is that the relative motion between the bear and the CD box is a rotation.
Grouping of Articulated Objects with Common Axis
115
Fig. 2. Real video sequence with a bear, a CD and a juice box. Left: first frame, Middle: 6th frame: Right: 11th (last) frame.
Fig. 3. Coaxial circles fitted to the relative motion between the bear and the CD box. The points representing the relative motion are showing by little octaeders.
6
Conclusion and Future Work
In this paper, we addressed the problem of grouping moving rigid parts of articulated objects. We formulated an axis-connected articulated objects and presented a new method : it determines whether the motion of two rigid objects are connected by an axis. The method and the corresponding theory were discussed in the paper, and the produced error values were examined versus noise in the image space, versus the number of frames and versus the number of points. The proposed method were also tested on data points come from a real video sequence. In the future, we plan to deal with the grouping of different moving parts of human bodies, because a human skeleton is an articulated object, and its moving parts are the bones.
References 1. Tomasi, C., Kanade, T.: Shape and Motion from Image Streams under orthography: A factorization approach. Intl. Journal Computer Vision 9, 137–154 (1992) 2. Weinshall, D., Tomasi, C.: Linear and Incremental Acquisition of Invariant Shape Models From Image Sequences. IEEE Trans. on PAMI 17(5), 512–517 (1995) 3. Poelman, C.J., Kanade, T.: A Paraperspective Factorization Method for Shape and Motion Recovery. IEEE Trans. on PAMI 19(3), 312–322 (1997)
116
L. Hajder
4. Sturm, P., Triggs, B.: A Factorization Based Algorithm for Multi-Image Projective Structure and Motion. In: Buxton, B.F., Cipolla, R. (eds.) ECCV 1996. LNCS, vol. 1064, pp. 709–720. Springer, Heidelberg (1996) 5. Trajkovi´c, M., Hedley, M.: Robust Recursive Structure and Motion Recovery under Affine Projection. In: Proc. British Machine Vision Conference (September 1997) 6. Kurata, T., Fujiki, J., Kourogi, M., Sakaue, K.: A Robust Recursive Factorization Method for Recovering Structure and Motion from Live Video Frames. In: IEEE ICCV Frame-Rate Workshop (September 1999) 7. Torr, P.H., Murray, D.W.: Outlier detection and motion segmentation. In: Arcelli, C., Cordella, L.P., Sanniti di Baja, G. (eds.) Visual Form 2001. LNCS, vol. 2059, pp. 432–443. Springer, Heidelberg (2001) 8. Torr, P.H.S., Zisserman, A., Murray, D.W.: Motion clustering using the trilinear constraint over three views. In: Europe-China Workshop on Geometrical Modelling and Invariants for Computer Vision, pp. 118–125 (1995) 9. Kanatani, K.: Motion Segmentation by Subspace Separation and Model Selection. In: ICCV, 586–591 (2001) 10. Hajder, L., Chetverikov, D.: Robust 3D Segmentation of Multiple Moving Objects Under Weak Perspective. In: ICCV Workshop on Dynamical Vision, CD ROM (2005) 11. Brand, M., Bhotika, R.: Flexible Flow for 3D Nonrigid Tracking and Shape Recovery. In: IEEE Conf. on Computer Vision and Pattern Recognition. vol. 1, pp. 312–322 (December 2001) 12. Torresani, L., Yang, D., Alexander, E., Bregler, C.: Tracking and Modelling Nonrigid Objects with Rank Constraints. In: IEEE Conf. on Computer Vision and Patter Recognition (2001) 13. Xiao, J., Chai, J.X., Kanade, T.: A closed-form solution to non-rigid shape and motion recovery. In: ECCV. vol. 4, pp. 573–587 (2004) 14. Xlad, X., Del Bue, A., Agapito, L.: Non-rigid factorization for projective reconstruction. In: British Machine Vision Conference, pp. 169–178 (2005) 15. Gander, W., Golub, G.H., Strebel, R.: Least-squares fitting of circles and ellipses. In: Numerical analysis (in honour of Jean Meinguet), pp. 63–84 (1996) 16. Taubin, G.: Estimation of planar curves, surfaces, and nonplanar space curves defined by implicit equations with applications to edge and range image segmentation. IEEE Trans. on PAMI 13(11), 1115–1138 (1991) 17. Fitzgibbon, A.W., Pilu, M., Fisher, R.B.: Direct least square fitting of ellipses. IEEE Trans. on PAMI 21(5), 476–480 (1999)
Decision Level Multiple Cameras Fusion Using Dezert-Smarandache Theory Esteban Garcia and Leopoldo Altamirano National Institute of Astrophysics, Optics and Electronics, Luis Enrique Erro No. 1.Tonantzintla, Pueba, Mexico {eomargr, robles}@inaoep.mx
Abstract. This paper presents a model for multiple cameras fusion, which is based on Dezert-Smarandache theory of evidence. We have developed a fusion model which works at the decision fusion level to track objects on a ground plane using geographically distributed cameras. As we are fusing at decision level, track is done based on predefined zones. We present early results of our model tested on CGI animated simulations, applying a perspective-based basic belief assignment function. Our experiments suggest that the proposed technique yields a good improvement in tracking accuracy when spatial regions are used to track. Keywords: Multiple Cameras Fusion, Tracking, Dezert-Smarandache Theory, Decision Level Fusion.
1
Introduction
Using multiple cameras has proven to increase vision systems capabilities, mainly by extending visual field or complementing information from cameras to reduce uncertainty and handling issues such as occlusion. Information coming from sensors might be fused at several levels [1], depending on the type of information provided by cameras, whenever it can be just images or any other information which has been processed inside the camera (known for that reason as smart sensors). In practice, the main task of surveillance systems aims for autonomous systems, or at least to make surveillance a less tedious task. While there are emerging approaches [2,3,4,5] which take in consideration information related to the objects in scene (such as where an object is standing and how long it has been on that particular zone), it is still no common to find work related to the use of multiple cameras to improve information useful for such surveillance systems. In [5] a tracking system using predefined regions is used to analyze behavioral patterns, where uncertainty might be clearly reduced using more than one camera. In [4] a Hierarchical Hidden Markov Model is used to identify activities, based on tracking people on a cell divided room, and even two static cameras cover scene, their purpose is to focus on different zones, but not to refine information. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 117–124, 2007. c Springer-Verlag Berlin Heidelberg 2007
118
E. Garcia and L. Altamirano
In this work we propose a decision fusion level model to take advantage of several geographically distributed cameras, and reduce uncertainty related to the 3D position of an object with respect to the camera view. We consider previously defined zones to track objects position, so the stage of the processing at which data integration takes place allows an interpretation of information which describes better the position of objects being observed and at the same time is useful for high level surveillance systems, such those focused on behavior recognition. In our model, individual decisions are taken by means of an axis-projection-based generalized basic belief assignment (gbba) function and finally fused using Dezert-Smarandache (DSm) hybrid rule. The recent DezertSmarandache Theory (DSmT) [6] is specially useful for the fusion of information arising from several independent sources, and provides an effective combination tool when sources become conflicting, compared to other theories such as Dempster-Shafer [7]. We also propose a way to reduce and manage dynamically the frame of discernment to optimize computer resources. This paper is organized as follows: in section 2, the Dezert-Smarandache theory is briefly described as mathematical framework. In section 3, we define our model and explain how zones are dynamically used to create the frame of discernment and also explain the gbba function we used. The performance of the proposed fusion model is illustrated in section 4. The paper ends with a conclusion in section 5.
2
DSm Hybrid Model
The DSmT considers two mathematical models used to represent and combine information [8]. First we describe the Free-DSm model and then the DSm hybrid model, which is defined from the free one but introduces integrity constraints, which we found useful to avoid wasting computer resources. The Free DSm model, denoted as Mf (Θ), defines Θ = {θ1 , . . . , θn } as a set or frame of n non exclusive elements and an hyper-power set DΘ as the set of all composite possibilities obtained from Θ in the following way: (1) ∅, θ1 , . . . , θn ∈ DΘ .(2) ∀A ∈ DΘ , B ∈ DΘ , (A ∪ B) ∈ DΘ , (A ∩ B) ∈ DΘ . (3) DΘ is formed only by elements obtained by rules 1 or 2. Function m(A) is called general basic belief assignment or mass for A, defined as m() : DΘ → [0, 1], and is associated to a source of evidence. A DSm hybrid model introduces some integrity constraints on elements A ∈ DΘ when there are known facts related to those elements in the problem under consideration. The restricted elements are forced to be empty in the hybrid model M(Θ) = Mf (Θ) and the mass is transferred to the non restricted elements. When DSm hybrid model is used, combination rule for two or more sources is defined for A ∈ DΘ with these functions: mM(Θ) (A) = φ(A) [S1 (A) + S2 (A) + S3 (A)]
(1)
Decision Level Multiple Cameras Fusion
S1 (A) =
k
mi (Xi )
119
(2)
X1 ,X2 ,...,Xk ∈DΘ i=1 X1 ∩X2 ∩...∩Xk =A
S2 (A) =
k
mi (Xi )
(3)
i=1 X1 ,X2 ,...,Xk ∈∅ [U =A]∨[[U ∈∅]∧[A=It ]]
S3 (A) =
k
X1 ,X2 ,...,Xk ∈D X1 ∪X2 ∪...∪Xk =A X1 ∩X2 ∩...∩Xk ∈∅ Θ
mi (Xi )
(4)
i=1
where φ(A) is called the characteristic emptiness function of a set A (φ(A) = 1 if A ∈ ∅ and φ(A) = 0 otherwise). ∅ = {∅M , ∅} where ∅M is the set of of all elements of DΘ forced to be empty. U is defined as U = u(X1 )∪u(X2 )∪. . . ∪u(Xk ), where u(X) is the union of all singletons θi ∈ X, while It = θ1 ∪ θ2 ∪ . . . ∪ θn .
3
Multiple Cameras Fusion
We use a set of cameras to get information from scene. Motion detection and tracking is performed separately for each one of the cameras, then tracked objects are matched in case there are more than one in scene. These are previously required tasks, which might be performed in many ways: motion detection can be performed by using background subtraction, while tracking is commonly solved with filters like Kalman. A homography is necessary to relate information coming from cameras, it is possible to recover homography from a set of static points on ground plane [9] or dynamic information in scene [10]. Correspondence between objects detected in cameras might be achieved by following features matching techniques [11] or geometric ones [12,13]. Once homography has been recovered and correspondence is achieved, it is possible to create hypotheses, evaluate gbba function and fuse information coming from cameras. We propose in this paper that each camera becomes an expert by each object in scene, providing beliefs about object’s position, so what we expect from each camera is a set of hypotheses of people’s location and their assigned beliefs. 3.1
Dynamic Frame of Discernment
Let Γ = {γ1 , . . . , γn } denote ground plane partition, where each γx is a predefined zone of ground plane, which might be an special interest zone, such as corridor or parking area, whenever it is a regular polygon or not. For each moving object i, it is created a frame Θi = {θ1 , . . . , θk }. Each element θx represents a zone γy where the object i might be situated, according to
120
E. Garcia and L. Altamirano
information from cameras. It means that Θi is built dynamically only considering the zones for which there exist some belief provided by at least one camera. This is specially helpful to reduce |DΘ | and thus avoid wasting computer resources. Each camera behaves as an expert, assigning mass to each one of the unrestricted elements of DΘ . Assignation function is simple, and has as main purpose to consider perspective influence in uncertainty. It is achieved by mean of measuring intersection area between γx and object’s vertical axis projected on ground plane, centered on object’s feet. The length of the axis projected on ground plane is determined by the angle of the camera respect to the ground plane, taking object’s ground point as the vertex to measure the angle. So if the camera were just above the object, its axis projection would be just one pixel long, meaning no uncertainty at all. We establish axis projection length as λ = lcos(α), where l is the maximum length that the projection can take measured in pixels, and α is the measured angle.
(a) Cameras perspective
(b) Object’s vertical axis projection
Fig. 1. Projected axis, used to evaluate gbba function
An expert (camera) can provide belief to elements θx ∩ θy ∈ DΘ , by considering couples γi and γj (represented by θx and θy respectively) crossed by axis projection. It is used as information indicating that the object is in the border of γi and γj . Elements θx ∪ . . . ∪ θx can have an associated gbba value, which represents local or global ignorance. We also restrict elements in θx ∩ . . . ∩ θy ∈ DΘ for which there is not a direct basic assignation made by one of the cameras, thus they are included in ∅M , and calculations are simplified. That is possible because of the hybrid DSm model definition. Decision fusion is used to combine the outcomes from cameras, making a final decision. We apply hybrid DSm rule of combination over DΘ in order to achieve a final decision. To transfer the masses of the integrity constraints of the hybrid DSm model, constraints are considered and it is not necessary to compute S1 (A), S2 (A) and S3 (A) in equation 2.
Decision Level Multiple Cameras Fusion
4
121
Experimental Results
As results of our approach, we present tests made using a computer-generatedimagery animated sequence from three points of view, which simulates a three camera surveillance system. We considered a simple squared scenario with a grid of sixteen regular predefined zones, introducing object’s shadow as noisy element. Furthermore, object’s color is sensitive to its position, because illumination is located just in the top of scene. Moving objects are detected by using background subtraction [14]. Principal axis correspondence [15] was implemented, and homography is recovered from a set of static correspondence points [9]. Cameras are elevated 28.32 (camera 1), 18.03 (camera 2) and 9.86 (camera 3) degrees respect to the center of ground plane. Camera 1 is located on a corner and the other two are orthogonally located on the sides of the ground plane. All of them are static and focusing to the center of scene. 3D modeling was done using Blender with Yafray as rendering machine. All generated images for sequence are in a resolution of 800x600 pixels. Examples of images generated by rendering are shown in figure 2, where division lines were outlined on ground plane to have a visual reference of zones, but they are not required for any other task.
(a) camera 1
(b) camera 2
(b) camera 3
Fig. 2. Images from cameras at frame number 40
Belief to be assigned by cameras is obtained from object’s axis projected on ground plane according to camera perspective, where λ is applied as coefficient to compute the length of projection. After calculating the length, homography is used to transform it to ground plane reference and then its intersection with zones is evaluated. All elements θx ∩ ... ∩ θy are included in ∅M , except for those in the form of θi ∩ θj for which had been directly assigned a belief by some of the cameras. Before fusing, assignation is normalized with respect to the unity to comply with DSm hybrid model definition. Results from tracking and fusion of a moving cube with proportions similar to human body are shown in figures 3 and 4, corresponding to a sequence of 220 frames. Object goes from the zone in the lower left corner of the ground plane to the upper right one. Zones γi ∈ Γ were enumerated from left to right and from up to down to be referred to in figures. Generalized basic belief assignment
122
E. Garcia and L. Altamirano
(a) Original positions
(b) Decisions in camera 1
(c) Decisions in camera 2
(d) Decisions in camera 3
Fig. 3. Original positions compared with decisions in cameras
value is represented by gray tones, going from white (high uncertainty) to black (zero uncertainty), while object’s position is plotted as a block line in row. It is common for a camera to assign mass to more than one hypothesis because of the perspective consideration. If we take frame number 100 as example, camera 1 assigns mass to zones 6, 10 and 11, which are adjacent between themselves as is showed in figure 3(b). Each camera proposes hypotheses and correspondent masses. Combining evidences from cameras reduces uncertainty while refines hypotheses compared to single camera, as can be seen in figure 4, where original positions are compared to the ones obtained by fusion. The fusion module improves the accuracy of decisions from cameras, even if target is missed in some frames (camera 3 lost object some times, see figure 3(d)). The tracking system operates at a rate of about 5 frames per second in a notebook with the following characteristics: Turion64, 2.0GHz, 1GB RAM; all
Decision Level Multiple Cameras Fusion
(a) Original positions
123
(b) Positions obtained by fusion
Fig. 4. Original positions compared with fusion result
processed within the same computer, but we suppose frame rate might be considerably higher if smart cameras or parallel processing is used.
5
Conclusions
In this paper an approach to fuse position coming from distributed cameras is proposed and results were shown. We propose the use of cameras as experts, and to work at decision level using Dezert-Smarandache Theory to combine beliefs. It has been shown that our model can reduce uncertainty from cameras by fusing zone based position. Further works are needed to examine the influence of other possible parameters to be included in the gbba function, such as distance from object to the camera, or even try other belief functions. Whenever execution time was not an objective of this work, tests showed it is possible to improve processing to reach real time video surveillance requirements. Acknowledgments. This work is partially supported by CONACYT under grant number 201562.
References 1. Dasarathy, B.V.: Sensor fusion potential exploitation-innovative architectures and illustrative applications. In: Proceedings of the IEEE. vol. 85, pp. 24–38 (January 1997) 2. Lv, F., Kang, J., Nevatia, R., Cohen, I., Medioni, G.: Automatic tracking and labeling of human activities in a video sequence. In: IEEE International Workshop on Performance Evaluation of Tracking and Surveillanc. conjunction with ECCV’04 (2004)
124
E. Garcia and L. Altamirano
3. Green, R., Guan, L.: Quantifying and recognizing human movement patterns from monocular video images - part i: A new framework for modeling human motion (2003) 4. Nguyen, N.T., Phung, D.Q., Venkatesh, S., Bui, H.: Learning and detecting activities from movement trajectories using the hierarchical hidden markov models. In: CVPR ’05. Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Washington, DC, USA, vol. 2, pp. 955–960. IEEE Computer Society, Los Alamitos, CA, USA (2005) 5. Yan, W., Forsyth, D.A.: Learning the behavior of users in a public space through video tracking. In: WACV-MOTION ’05. Proceedings of the Seventh IEEE Workshops on Application of Computer Vision (WACV/MOTION’05), Washington, DC, USA, vol. 1, pp. 370–377. IEEE Computer Society, Los Alamitos, CA, USA (2005) 6. Dezert, J.: Foundations for a new theory of plausible and paradoxical reasoning. Information and Security Journal 9 (2002) 7. Shafer, G.: A Mathematical Theory of Evidence. Princeton University, Princeton (1976) 8. Smarandache, F., Dezert, J.: An introduction to the dsm theory for the combination of paradoxical, uncertain, and imprecise sources of information (2006) 9. Stein, G., Lee, L., Romano, R.: Monitoring activities from multiple video streams: Establishing a common coordinate frame. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 758–767 (2000) 10. Bradshaw, K.J., Reid, I.D., Murray, D.W.: The active recovery of 3d motion trajectories and their use in prediction. IEEE Trans. Pattern Anal. Mach. Intell. 19(3), 219–234 (1997) 11. Krumm, J., Harris, S., Meyers, B., Brumitt, B., Hale, M., Shafer, S.: Multi-camera multiperson tracking for easyliving. In: Proceedings of the Third IEEE International Workshop on Visual Surveillance, pp. 3–10 (July 2000) 12. Khan, S., Shah, M.: Consistent labeling of tracked objects in multiple cameras with overlapping fields of view. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(10), 1355–1360 (2003) 13. Black, J., Ellis, T.: Multi camera image tracking. Image Vision Comput. 24(11), 1256–1267 (2006) 14. Heikkila, J., Silven, O.: A real-time system for monitoring of cyclists and pedestrians. In: VS ’99. Proceedings of the Second IEEE Workshop on Visual Surveillance, Washington, DC, USA, p. 74. IEEE Computer Society, Los Alamitos, CA, USA (1999) 15. Hu, W., Hu, M., Zhou, X., Lou, J.: Principal axis-based correspondence between multiple cameras for people tracking. In: Tan, F.-T., Maybank, M.-S.(eds.) IEEE Trans. Pattern Anal. Mach. Intell., vol. 28(4), pp. 663 (2006)
Occlusion Removal in Video Microscopy Brian Eastwood and Russell M. Taylor II University of North Carolina at Chapel Hill Department of Computer Science Chapel Hill, NC, 27599-3175, USA {beastwoo,taylorr}@cs.unc.edu
Abstract. Video microscopy offers researchers a method to observe small-scale dynamic processes. It is often useful to remove unchanging portions of image sequences to improve the perception and analysis of moving features. We propose two processing methods (a local method and a global method) for detecting and removing partial stationary occlusions from video microscopy data using the bright-field microscope image model. In both techniques, we compute the relative light transmission across the image plane due to fixed, partially-transparent objects. The resulting transmission map enables reconstruction of a video in which the occlusions have been removed. We present experimental results that compare the effectiveness and applicability of our two approaches. Keywords: video processing, image enhancement, occlusion removal, light microscopy.
1
Introduction
Video microscopy figures prominently in many fields of research, such as biology, pathology, materials science, and physics. In some situations, an experiment imposes constraints that introduce undesirable artifacts into the captured images. For example, using long working distance lenses enables focusing further into a specimen, but also increases the depth of field, consequently including more of the specimen in the final image. Figure 1 shows frames of a video in which a microbead is attached to the beating cilia of epithelial lung cells. In this experiment, the microscope focuses through the substrate and cells, and these components modulate the image of the moving cilia. Other artifacts, such as debris on the image sensor (also seen in Figure 1), slide, or cover slip, may contribute to the final image. A typical microscopy video consists of one or more moving “foreground” specimen objects and static “background” impurities. (The use of foreground and background does not imply which object is physically in front of another; we choose the convention that assigns the object of interest, the specimen, to the foreground.) The background impurities will obscure the foreground specimen as the specimen moves across the image plane. An occluding object and specimen become darker where they overlap, but rarely does an object absorb all light. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 125–132, 2007. c Springer-Verlag Berlin Heidelberg 2007
126
B. Eastwood and R.M. Taylor II
Fig. 1. Ciliated epithelial lung cells move a bead in bright-field microscopy using a long working distance lens. The spot circled in the first frame is from dirt on the image sensor. The right-most image is the mean of 61 frames from this video.
In this paper, we present two methods for repairing video microscopy data that suffers from stationary partial occlusions. Our approaches use the image model of a bright-field microscope, and rely on motion to enable separation of the video into moving and stationary components. In addition to improving the visual clarity of microscopy videos, our technique is a useful pre-process for other image analysis tasks, such as optical flow computation. Our approaches have the added benefit—compared to background subtraction—of preserving the measured intensity relationships between image sensor locations, which is necessary for photometry. We compare the effectiveness of these methods, and describe which method suits different types of data.
2
Related Work
The details of image formation, such as noise and sensor calibration, are concerns in any sensitive imaging application [1]. Although sensor calibration considers image formation from the image sensor onwards, our work concentrates on image formation in the bright-field microscope before the image sensor. These methods, therefore, are complementary. Specifically, the linear response of the sensor should be validated before applying our techniques. Typical research microscope image sensors have such a calibrated response. Background removal techniques for microscopy include background subtraction and flat-fielding [2]. These approaches use specimen-free calibration images to remove artifacts due to illumination variations by frame subtraction or division. Our method depends on the presence of a moving specimen and works on existing microscopy data without calibration. Hill et al. use a moving average image for background subtraction to improve the visual clarity of microscopy images [3]. Our methods perform this function and at the same time preserve the image intensity relationships required for photometry. Schechner and Nayar compute the transmission map of a filter with spatiallyvarying transmittance [4]. Their initial estimate is based on image statistics, and we demonstrate in Section 3.2 how to improve this estimate. Their robust estimation relies on image registration with a rigid-body constraint, a condition that is rarely satisfied in biological microscopy. Shizawa’s multiple optical flow estimation technique also relies on rigid-body assuptions [5].
Occlusion Removal in Video Microscopy
127
A significant body of computer vision research focuses on macroscopic occlusion detection and processing [6,7]. In the macroscopic case, occlusion is typically complete. Complete occlusion constitutes a nonlinearity in image formation that is manifested by a complete loss of signal in the captured image. In microscopy, however, occluding objects do not block light completely. In the multiplicative image model, each object absorbs a constant proportion of light, and the signal from the occluded object transmits through the occlusion. Throughout this paper, when we use the term “occlusion” we mean a partial occlusion as encountered in bright-field microscopy, unless otherwise specified.
3
Image Model
In a bright-field microscope, a light source is tuned to uniformly illuminate a specimen across the field of view. A series of lens assemblies (e.g. condenser, objective, and ocular) focus the light transmitted through the specimen onto an image sensor [2]. Light interacting with any object in the optical path is either absorbed, transmitted, or diffracted. Our first-order model ignores diffraction effects; we discuss the consequences of this simplification below. The image formed on the image sensor is therefore I(x) = L(x)T (x) ,
(1)
T (x) = {T (x) : 0 ≤ T (x) ≤ 1, ∀x} ,
(2)
where L(x) is the (ideally uniform) illumination and T (x) defines the total light transmission at each image location x. This constitutes a multiplicative light model for the bright-field microscope. Considered over time, the transmission map, T (x, t), is a composite of transmission coefficients from stationary and moving objects. For a stationary transmission map, Tc (x), and time-varying specimen transmission, Ts (x, t), the image formed on the sensor is I(x, t) = L(x)Tc (x)Ts (x, t) .
(3)
Given the stationary transmission map, a repaired video, without the presence of stationary occlusions, is obtained by Ir (x, t) =
I(x, t) = L(x)Ts (x, t) . Tc (x)
(4)
The image repair problem reduces to finding an accurate estimate of the stationary transmission map, Tc (x); we present two estimation methods in the following sections. Note that Equation 4 amplifies the noise as well as the signal present in the image sequence I(x, t). The transmission map, therefore, provides a natural measure of relative error in the repaired intensity values.
128
B. Eastwood and R.M. Taylor II
As a consequence of ignoring diffraction effects, occlusions on the image sensor are modeled better than occlusions elsewhere in the optical path. Debris on lens elements are often completely out of focus, and their effects are distributed across the image plane. Light diffracted from partially-focused, stationary debris, however, may add intensity to a local image region. Experimental results in Section 5.2 demonstrate that our simplification yields useful information even in certain situations that exhibit diffraction effects. 3.1
Approach 1: A Local Transmission Metric
Within the mean intensity image from all frames in a microscopy video, as seen in Figure 1, the contribution of moving foreground objects is approximately zeromean within a local neighborhood. The predominant spatial variance is due to stationary background objects. This observation leads to a method for estimating the stationary transmission map. We assume that for a sequence of N images the mean intensity, I(x), is locally constant except at the boundaries of stationary background objects. Comparing the mean intensity of an occluded pixel to that of its neighbors determines the amplification necessary to bring the occluded pixel in line with its neighbors. One tested comparison operator is the median of the pixels in a neighborhood Ω, * + Il (x) = M edian I(xi ); xi ∈ Ω . (5) The per-pixel mean intensity provides a measure of how much light was transmitted to an image location while the neighborhood median provides an estimate of how much light was transmitted to the pixels in the same neighborhood. Careful choice of the neighborhood size, based on the expected size of occlusions, leads to an estimate of the stationary transmission map, Tc (x) =
I(x) . Il (x)
(6)
We refer to this technique as the Local Transmission Metric (LTM) method. 3.2
Approach 2: A Global Transmission Metric
We next describe a framework for image intensity comparisons and use it to formulate an alternative estimate of the stationary transmission map. For a single image formed by a multiplicative light model, as in Equation 1, the ratio of image intensities at two image locations measures relative light transmission. The logarithm transforms this ratio to a linear difference, as noted by Shizawa [5]. The 1D central ratio of intensity, analogous to the discrete central difference, is 1 I(x + Δx) 2Δx R(x) = (7) I(x − Δx) log[I(x + Δx)] − log[I(x − Δx)] (8) log R(x) = 2Δx d log[I(x)] ≈ , dx
Occlusion Removal in Video Microscopy
129
which is the derivative of the logarithm of image intensity. In 2D, the gradient replaces the derivative. We cannot obtain an accurate measure of the local stationary transmission gradient using a single frame from a video. Therefore, we use all available frames to compute a robust estimate, log R(x) = Est [∇ log[I(x, ti )]; i = 1..N ] .
(9)
Here, Est is any estimation of the gradient of the logarithm of intensity over all N frames, for example the mean or median. Other robust estimators may improve this method, depending on the nature of the motion in the video [8]. A successful estimator distinguishes between elements from stationary occlusions and the moving specimen. To compute the logarithm of transmission, we integrate this local estimate over the image plane and convert back to intensity space, log Tc (x) = Est [∇ log[I(x, ti )]; i = 1..N ] dxdy , (10) ∀x
Tc (x) = exp (log Tc (x)) .
(11)
We refer to this core method as the Gradient Logarithm Transmission (GLT) method and its variants as GLT-mean and GLT-median. Note that if the robust estimation, Est, is linear, the estimation and gradient commute, simplifying Equations 10 and 11 to Tc (x) = exp (Est[log I(x, ti ); i = 1..N ]) .
(12)
For the mean estimator, Equation 12 reduces to the geometric mean of all frames. No such simplification is possible for the median estimator.
4
Implementation
We implemented both of the techniques outlined in Section 3 as well as meanimage background subtraction in C++ using the Insight Toolkit [9] (ITK) as a framework. Some implementation details deserve mention. Frankot and Chellappa demonstrate how to enforce integrability of derivative estimates, as in Equation 10 [10]. This method computes the periodic surface with integrable derivatives closest to the derivative estimates. To handle occlusions with nonperiodic transmission maps, we pad our image data with a Hanning window [11]. The integration and subsequent exponentiation in Equation 11 means that we only recover the transmission map up to a scale factor. One choice for this scale factor enforces our assumption that objects only absorb light by setting the maximum transmission to one. Mean image background subtraction involves a shifting of intensity values, and consequently intensity ratios are not preserved. In our background subtraction implementation, we maintain the mean image intensity between input and output image sequences.
130
5
B. Eastwood and R.M. Taylor II
Evaluation
We evaluated the microscopy video repair algorithms using both synthetic and real data. Synthetic data provides a known ground truth by which to quantitatively assess the techniques. Real microscopy data must be qualitatively assessed. 5.1
Simulated Data
Our simulated data is generated from a normally-distributed noise field undergoing rigid transformations. Each test video consists of 180 frames of 256 × 256 pixels. To simulate microscopy image formation, we modulate these image sequences with fixed patterns as seen in Figure 2. We apply each of our video repair algorithms and compute the mean peak signal-to-noise ratio (PSNR) over all frames [12]. Higher PSNR values indicate a more faithful reconstruction of an image, and PSNR values between 30 and 40 dB are typical in the image processing literature [13].
Fig. 2. Sample test data of transforming noise fields. Left to right: no occlusion, small occlusions with varying transmission, large occlusion, step function occlusion.
Table 1 shows the reconstruction results for a subset of test cases. Because PSNR is sensitive to gain, we choose to scale the stationary transmission maps estimated by the GLT methods to obtain the highest PSNR. Using the case of a rotating noise field as an example, the data set with no occlusion indicates the quality of reconstruction obtainable by each method. All methods match performance in the unoccluded case when small occlusions of varying modulation are added to the video. The LTM method performs best in this example, followed closely by the GLT-mean. The presence of a large occlusion exposes the weakness in the LTM method, as large occlusions require large neighborhoods to ensure an accurate estimate of the unobstructed intensity at each image location. A neighborhood with a radius of 100 pixels was required to repair the test case with large occlusion. This took over an hour on a modern desktop computer, an unacceptable penalty in most situations. The GLT methods do not suffer from this scale dependency. The shape and non-periodic nature of the transmission map in the stepfunction modulated test case cannot be recovered by the LTM and non-windowed GLT-median methods. Padding with the Hanning window, as discussed in Section 4 resolves this issue for the GLT-median method.
Occlusion Removal in Video Microscopy
131
Table 1. Mean PSNR (dB) measurements for repairing simulated microscopy data; r is half the width of the square neighborhood used in the LTM method; window refers to padding with a Hanning window. The best result for each data set is indicated in bold face. Data set Transform type Occlusion type Background subtract LTM, r=15 LTM, r=100 GLT-mean GLT-median GLT-median, window
5.2
Noise none 42.12 46.12 — 41.83 39.66 35.19
Rotation small large 38.98 16.04 46.16 13.04 — 41.65 41.83 41.83 39.65 39.66 35.20 35.15
step 12.20 10.37 10.38 41.79 12.10 35.20
none 38.98 43.36 — 39.87 36.62 33.39
Translation small large 18.32 14.51 43.44 13.33 — 38.47 39.88 39.85 36.61 36.62 33.39 33.37
step 11.77 10.66 10.68 39.61 12.70 33.28
Microscopy Data
Figure 3 shows the result of repairing the video of beating cilia seen in Figure 1. The LTM method succeeds in removing most of the occlusion in the upper left of the frame. A halo remains from this diffraction effect as our light model only handles occlusions that absorb light. The GLT methods seem not to suffer from this limitation because they form a complete model of all non-moving components of the video. This diffraction, however, may affect the intensity scaling of the final image; here, we preserve the mean image intensity. Diffraction effects are seen in the moving bead as well, but this foreground component has no effect on our repair methods. Background subtraction and the GLT methods also remove the larger stationary background components from the video, which accentuates the motion of the cilia. The GLT-median is the only method that does not suffer from ghosting of the moving bead in the bottom-center of the images. This example highlights the strength of the GLT-median—when a foreground object covers an image location for fewer than half the frames in a video, its effect does not perturb the background of the repaired video.
Fig. 3. Repair of cilia microscopy video, 170 × 170 pixels, 61 frames. Left to right: original, frame repaired with background subtraction, LTM (r = 10), GLT-mean, GLTmedian.
132
6
B. Eastwood and R.M. Taylor II
Discussion
We have demonstrated two methods for repairing microscopy videos that exhibit advantages over background subtraction. Each method and variation has strengths which lead it to outperform the others in different scenarios. The LTM method performs best on videos with small, bounded occlusions, but handles large occlusions and diffraction effects poorly. The GLT methods work equally well on any size occlusion and at least visually improve some diffraction effects. The GLT-median estimation outperforms the GLT-mean estimation when a moving scene object covers an image region for less than half of the video. Research is ongoing to combine our repair method with optical flow to improve tracking in microscopy and provide more reliable reconstruction under total occlusion. Acknowledgments. This work is supported by NIH NIBIB grant P41EB002025-23A1. The authors thank David Hill for the use of the microscopy images.
References 1. Tsin, Y., Ramesh, V., Kanade, T.: Statistical calibration of CCD imaging process. IEEE ICCV 01, 480 (2001) 2. Inou´e, S., Spring, K.R.: Video Microscopy: The Fundamentals, 2nd edn. Springer, Heidelberg (1997) 3. Hill, D.B., Plaza, M.J., Bonin, K., Holzwarth, G.: Fast vesicle transport in pc12 neurites: velocities and forces. European Biophysics Journal 33(7), 623–632 (2004) 4. Schechner, Y.Y., Nayar, S.K.: Generalized mosaicing: High dynamic range in a wide field of view. IJCV 53(3), 245–267 (2003) 5. Shizawa, M., Mase, K.: Simultaneous multiple optical flow estimation. ICPR 1, 274–278 (1990) 6. Sun, J., Li, Y., Kang, S.B., Shum, H.Y.: Symmetric stereo matching for occlusion handling. IEEE CVPR 2, 399–406 (2005) 7. Jia, J., Wu, T.P., Tai, Y.W., Tang, C.K.: Video repairing: Inference of foreground and background under severe occlusion. IEEE CVPR 01, 364–371 (2004) 8. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C: The Art of Scientific Computing, 2nd edn. Cambridge University Press, Cambridge (1992) 9. Ibanez, L., Schroeder, W., Ng, L., Cates, J.: The ITK Software Guide. Kitware, Inc. 2nd edn. (2005), ISBN 1-930934-15-7 http://www.itk.org/ItkSoftwareGuide. pdf 10. Frankot, R.T., Chellappa, R.: A method for enforcing integrability in shape from shading algorithms. IEEE PAMI 10(4), 439–451 (1988) 11. Harris, F.J.: On the use of windows for harmonic analysis with the discrete fourier transform. In: Proceedings of the IEEE. 66(1), 51–83 (1978) 12. Bovik, A.: Handbook of image and video processing, 2nd edn. Academic Press, San Diego (2005) 13. Antonini, M., Barlaud, M., Mathieu, P., Daubechies, I.: Image coding using wavelet transform. IEEE TIP 1, 205–220 (1992)
A Modular Approach for Automating Video Analysis Gayathri Nadarajan1 and Arnaud Renouf2 1
CISA, School of Informatics, University of Edinburgh, Scotland [email protected] 2 GREYC Laboratory (CNRS UMR 6072), Caen cedex France [email protected]
Abstract. Automating the steps involved in video processing has yet to be tackled with much success by vision developers and knowledge engineers. This is due to the difficulty in formulating vision problems and their solutions in a generalised manner. In this collaborated work, we introduce a modular approach that utilises ontologies to capture the goals, domain description and capabilities for performing video analysis. This modularisation is tested on real-world videos from an ecological source and proves useful in conceptualising and generalising video processing tasks. On a more significant note, this could be used in a framework for automatic video analysis in emerging infrastructures such as the Grid. Keywords: Knowledge-Based Vision, Ontological Engineering, Automatic Video Analysis, Ontology-Based Systems.
1
Introduction
The field of video analysis is becoming more and more important with the fast advancement in vision technologies and the increasing size of real-time data that need to be processed efficiently. Although the majority of vision developers focus on improving low-level techniques and algorithms that perform with extreme accuracy, there is a lack of effort in conceptualising the tasks involved in the process of video analysis itself, although much of image processing involves sequences of repeated sub-tasks. Ontological engineering [1], on the other hand, is concerned with providing formal conceptualisations of entities that are relevant to a particular problem domain so that these representations and the relationships between them are made explicit. Applying higher level knowledge or semantics would allow for better reasoning on the concepts by the sharing and reuse of existing knowledge and also lead to the discovery of new knowledge. To date, the problem of conceptualising image processing solutions has been tackled by the vision community in limited ways [2,3,4,5,6] and less addressed by the knowledge engineering community. However, this poses difficulty to visionbased researchers as they lack the expertise to modularise image processing problems using semantic-based approaches (e.g. ontologies). Thus a collaborated W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 133–140, 2007. c Springer-Verlag Berlin Heidelberg 2007
134
G. Nadarajan and A. Renouf
effort between ontology engineers and vision developers could bridge this current research gap that both fields have yet to tackle with much success. The main challenges lie in the fact that the most recent knowledge representation and reasoning methods (e.g. Semantic Web technologies, such as OWL [7]) lack maturity and the image processing formulation is hard to generalise [8,9]. In this work we tackle the problem of providing a modularised approach for video analysis using an example test case from the EcoGrid project [10]. This effort improves and extends existing work by the authors [9,11] by adding process modelling aspects, richer modularisation of ontologies while making use of extensive video processing terminologies. The long-term effect of this is to provide a semantic-based automatic video analysis framework that is flexible which would prove useful for emerging infrastructures such as the Grid [12].
2
Related Work
Attempts to solve automatically image processing problems were conducted within knowledge-based systems such as LLVE [8], CONNY [13], OCAPI [14], MVP [15] and BORG [16]. However these systems remain limited to a list of restricted and well known goals. Therefore a priori knowledge on the application context (domain-specific concepts such as sensor type, noise, lighting, etc) and on the goal to achieve were implicitly encoded in the knowledge base. This implicit knowledge restricts the range of application domains for these systems and it is one of the reasons for their failure [17]. More recent approaches bring more explicit modelling [2,3,4,5,6] but they are all restricted to the modelling of business objects description for tasks such as detection, segmentation, image retrieval, image annotation or recognition applications. They use ontologies that provide the concepts needed for this description (a visual concept ontology for object recognition in [3,5], a visual descriptor ontology for semantic annotation of images and videos in [18] or image processing primitives in [2]) or they capture the business knowledge through meetings with the specialists (e.g. use of the NIAM/ORM method in [4] to collect and map the business knowledge to the vision knowledge). But they do not completely tackle the problem of the application context description (just briefly in [3,5]) and the effects of this context on the images (environment, lighting, sensor, image format). Moreover they do not define the means to describe the image content when objects are a priori unknown or unusable, for instance in robotics, image retrieval or restoration applications. They also suppose that the objectives are well known (to detect, to extract or to recognise an object with a restricted set of constraints) and therefore they do not address their specification. To overcome these limitations, we aim at designing a modular approach for automating video and image processing. In such an approach, we have to make the formulation of the problem to be solved and the knowledge used by image processing experts during the design of the solution explicit. Vision experts design applications through trial-and-error cycles by implementing new solutions from scratch each time instead of reusing already developed ones. The lack of
A Modular Approach for Automating Video Analysis
135
application formulation and modelling is a reason for this. Image processing experts do not realise a complete and rigorous formulation of the applications. Therefore the reusability of the applications is very poor and modularisation is needed to improve this situation.
3
Modularisation
Our modularisation of the image processing field clearly separates the formulation of the problems and the solutions to be produced. The user, represented by domain experts, will be involved in providing the problem descriptions in high level terms while the system (utilising a Planner) will provide the solutions in low level vision terms. Domain knowledge has to be acquired from the domain expert and formalised in order to be able to propose a way of processing the images taken under consideration. Our first work [9] studied the formulation of image processing applications in order to propose an ontology that is used to express the objective of the domain expert (the goal part of the ontology) and define the image class to be processed (the input images and their variability description using the domain part of the ontology). It is used in the Hermes Project [19] which proposes a human-machine interface dedicated to domain experts inexperienced in the image processing field. Using this interface, they are able to formulate their goals and the description of their images using their domain knowledge. The goal, coupled with the domain description, will then be used to derive a set of image processing tasks that will solve the user request. Each task could be further decomposed into sub-tasks. We associate an image processing tool as possessing the capability to achieve a sub-task. In our current approach, we have opted to incorporate three ontologies to separate the goals from the capabilities and to provide meaning for the process within a semantically integrated system. Each ontology holds a vocabulary of classes of things that it represents and the relationships between them. The use of ontologies is beneficial because they provide a formal and explicit means to represent concepts, relationships and properties in a domain. They play an important role in fulfilling semantic interoperability, as highlighted in Section 1. A system with full ontological integration has several advantages. It allows for cross-checking between ontologies, addition of new concepts into the system and discovery of new knowledge within the system. A framework incorporating the ontologies with a Planning mechanism is proposed in [11]. 3.1
Goal Ontology
The goal ontology contains the high level goals and constraints that the user will communicate to the system. These are represented by the concepts Goal, Constraint Category, Constraint Descriptor and Constraint Qualifier in Fig. 1. The concept Task/Process links the goal to the processes that are associated with it. The instances of processes are contained within a process
136
G. Nadarajan and A. Renouf
Fig. 1. Goal Ontology
library [11] and will be selected based on task decomposition and performance criteria. These tasks will also be linked with the capability ontology (Section 3.3). 3.2
Domain Ontology
The domain ontology (Fig. 2) describes the concepts and relationships of the application area, such as the lighting conditions, colour information, position, orientation as well as spatial and temporal aspects. The user will input the domain description along with the goal of the problem. The system will use this ontology to build the user request based on the domain description before feeding it into the Planner which will be responsible for the solution generation.
Fig. 2. Domain Ontology
A Modular Approach for Automating Video Analysis
3.3
137
Capability Ontology
The capability ontology (Fig. 3) contains the classes of video and image processing techniques. Each technique (or capability) is associated with one or more tool. A tool is a software component that can perform a video or image processing task independently, or a technique within an integrated vision library that may be invoked with given parameters. This ontology will be used directly by the Planner in order to identify the tools that will be used to solve the problem.
Fig. 3. Capability Ontology
4
Test Case
The ontologies formulated above have been applied to a real-world scenario in the EcoGrid [10] that utilises state-of-the-art technologies to establish a cyberinfrastructure for ecological research. This includes the integration of geographically distributed sensors, computing power and storage resources into a uniform and secure platform. Scientists can conduct data acquisition, data analysis and data sharing on this platform. One of the challenges is that a vast amount of raw data of varying qualities needs to be analysed efficiently and effectively (Fig. 4). Manual processing by ecologists, however, would be too time-consuming as a minute’s video clip will take 15 minutes’ analysis time. This would not be feasible
Fig. 4. Sample images extracted from EcoGrid video clips. From left to right: cluttered, partial object present, whole object present and non-presence of activity.
138
G. Nadarajan and A. Renouf
for data that amount to 1.86 Terabytes per year. Thus an automatic mechanism for processing these videos must be done in the most efficient manner. A full set of requirements for providing such a mechanism has been outlined in [11]. This includes an adaptive, flexible and generic architecture to allow for various video processing tasks to be conducted in an intelligent and optimal manner. Walkthrough. Based on the devised ontologies, a walkthrough on how they are used to provide different levels of vocabulary for the users, vision tools and processes in a seamless and related manner is outlined here. The user, who is an ecologist, may have a high level goal such as “Detect fishes in the videos” in mind. This is represented and selected via the following selection-value pairs: (Goal: Detection) [Detail Level: Occurrence = all occurrences] [Performance Criteria: Processing Time = real time] As a first step, the user interacts with the system by providing input values for the goal and constraints. The goal can be one of the goals specified in the goal ontology (Fig. 1). While the constraints are additional parameters to specify rules or restrictions that apply to the goal. These include qualifiers e.g. Acceptable Errors, Optimisation Criteria, Detail Level and Quality Criteria contained within the goal ontology. Then the user describes the images to be processed. This description is given using the system which proposes descriptors contained in the domain ontology (Fig. 2). In our scenario, a description on the acquisition context and on the semantic content of the images is obtained; the lighting conditions changes very slowly, the camera is fixed (and also the background), images are degraded by a blocking effect due to the compression, and the fishes are regions whose colours are different from the background and are bigger than a minimal area (because when they are too small they are unusable). Thus, this goal is interpreted as “The detection of all the occurrences of non-background regions on a fixed background.” As soon as the formulation of the user’s problem is made, a sequence of processes for execution using task decomposition will be sought by a Planning mechanism. Fig. 5 contains a breakdown of high-level tasks for the goal detection.
Fig. 5. High-Level Sequence of Process for Goal “Detect Fishes in the Videos”
Each task within this sequence could be further decomposed into sub-tasks. For instance, the ’Segmentation’ task involves ’Background Subtraction’ (under Task/Process in the goal ontology), which is done in three steps; background model construction, current frame and background model differencing, and background model update. All these sub-tasks are point arithmetic operations in the capability ontology. When a sub-task can no longer be decomposed, the tool or technique identified for performing it can be applied directly on the video clip.
A Modular Approach for Automating Video Analysis
139
The tool or technique available within the system is represented in the capability ontology. In the case where more than one tool is available to perform the sub-task, the tool with the best performing capability is selected. In our case, the background model construction is done by averaging a series of successive images without any fish. Next, the difference between the current image and the background model is obtained. The model is updated using this current image to avoid future false detections due to the small changes of lighting conditions. False detections elimination (sub-task of the classification step) is achieved using a threshold on the minimal area given by the user. Fig. 6 shows two sample results obtained using the solution proposed by the modularised approach. As the detection is accurate, the same principles could be extended to achieve more complicated video processing tasks (such as motion analysis) as long as the ontologies capture the goals, domain descriptions and capabilities.
Fig. 6. Two Sample Results. From Left to Right for Each Row: Original Image, Background Model, Classified Result.
5
Conclusion
The formulation of the video analysis problem description and solution could be tackled by modularising them using separate but inter-related ontologies. The three ontologies presented (goal, domain and capability) are extensive and describe the different aspects of video analysis in meaningful ways. This would prove to be useful as it facilitates a means to formalise the video analysis process and promotes reusability of applications. We have demonstrated the application of this modularised approach using an example from an ecological domain. The ontologies are high-level and general enough to be tailored towards building application ontologies by vision experts for solving tasks more specific to their problem domains. On a broader context, this effort is a significant step towards providing a semantically-enhanced framework in emerging infrastructures such as the Grid which would allow for real-time distributed image processing.
140
G. Nadarajan and A. Renouf
References 1. Gomez-Perez, A., Fernandez-Lopez, M., Corcho, O.: Ontological Engineering: With Examples from the Areas of Knowledge Management, E-Commerce and the Semantic Web, (1st edn.) (2004) 2. Nouvel, A., Dalle, P.: An Interactive Approach For Image Ontology Definition. In: 13`eme Congr`es de Reconnaissance des Formes et Intelligence Artificielle, Angers, France, pp. 1023–1031 (2002) 3. Maillot, N., Thonnat, M., Boucher, A.: Towards Ontology Based Cognitive Vision (Long Version). Machine Vision and Applications 16(1), 33–40 (2004) 4. Bombardier, V., Lhoste, P., Mazaud, C.: Mod´elisation et int´egration de connaissances m´etier pour l’identification de d´efauts par r`egles linguistiques floues. Traitement du Signal 21(3), 227–247 (2004) 5. Hudelot, C.: Towards a Cognitive Vision Platform for Semantic Image Interpretation. Application to the Recognition of Biological Organisms. PhD thesis, NiceSophia Antipolis University (2005) 6. Town, C.: Ontological Inference for Image and Video Analysis. Mach. Vision Appl. 17(2), 94–115 (2006) 7. McGuinness, D., van Harmelen, F.: OWL Web Ontology Language. World Wide Web Consortium (W3C) (2004), http://www.w3.org/TR/owl-features/ 8. Matsuyama, T.: Expert Systems for Image Processing: Knowledge-Based Composition of Image Analysis Processes. CVGIP 48(1), 22–49 (1989) 9. Renouf, A., Clouard, R., Revenu, M.: How to Formulate Image Processing Applications? In: Proceedings of the International Conference on Computer Vision Systems, Bielefeld, Germany (2007) 10. EcoGrid National Center for High Performance Computing, Taiwan: http:// ecogrid.nchc.org.tw/ 11. Nadarajan, G., Chen-Burger, Y.H., Malone, J.: Semantic-Based Workflow Composition for Video Processing in the Grid. In: IEEE/WIC/ACM International Conference on Web Intelligence, pp. 161–165 (2006) 12. Foster, I.: The Grid 2 – Blueprint for a New Computing Infrastructure, 2nd edn. Morgan Kaufmann, San Francisco (2004) 13. Liedtke, C., Bl¨ omer, A.: Architecture of the Knowledge Based Configuration System for Image Analysis ”Conny”. In: ICPR’92, pp. 375–378 (1992) 14. Cl´ement, V., Thonnat, M.: A Knowledge-Based Approach to Integration of Image Procedures Processing. CVGIP: Image Understanding 57(2), 166–184 (1993) 15. Chien, S., Mortensen, H.: Automating Image Processing for Scientific Data Analysis of a large Image Database. IEEE PAMI 18(8), 854–859 (1996) 16. Clouard, R., Elmoataz, A., Porquet, C., Revenu, M.: Borg: A KnowledgeBased System for Automatic Generation of Image Processing Programs. IEEE PAMI 21(2), 128–144 (1999) 17. Draper, B., Hanson, A., Riseman, E.: Knowledge-directed vision : Control, learning, and integration. Proc. of IEEE 84, 1625–1681 (1996) 18. Bloehdorn, S., Petridis, K., Saathoff, C., Simou, N., Tzouvaras, V., Avrithis, Y., Handschuh, S., Kompatsiaris, Y., Staab, S., Strintzis, M.G.: Semantic Annotation of Images and Videos for Multimedia Analysis. In: G´ omez-P´erez, A., Euzenat, J. (eds.) ESWC 2005. LNCS, vol. 3532, pp. 592–607. Springer, Heidelberg (2005) 19. Renouf, A.: (Herm`es - a human-machine interface for the formulation of image processing applications) http://www.greyc.ensicaen.fr/∼ arenouf/Hermes
Rectified Reconstruction from Stereo Pairs and Robot Mapping Antonio Javier Gallego, Rafael Molina, Patricia Compa˜ n, and Carlos Villagr´ a Grupo de Inform´ atica Industrial e Inteligencia Artificial Universidad de Alicante, Ap.99, E-03080, Alicante, Spain {ajgallego, rmolina, company, villagra}@dccia.ua.es
Abstract. The reconstruction and mapping of real scenes is a crucial element in several fields such as robot navigation. Stereo vision can be a powerful solution. However the perspective effect arises, as well as other problems, when the reconstruction is tackled using depth maps obtained from stereo images. A new approach is proposed to avoid the perspective effect, based on a geometrical rectification using the vanishing point of the image. It also uses sub-pixel precision to solve the lack of information for distant objects. Finally, the method is applied to map a whole scene, introducing a cubic filter.
1
Introduction
Nowadays, a central aspect in artificial intelligence research is the perception of the environment by artificial systems. It has been considered as one of the most important problems for effective autonomous robotic navigation and reconstruction [1]. Most research on robot mapping makes use of powerful computers and expensive sensors, such as scanning laser rangefinders. Nevertheless, other sensors, such as stereoscopic sensors, can be used. Stereo cameras are cheap and provide information of both range and appearance. Several authors use stereo vision and disparity images to solve the 3D reconstruction or mapping problems. For instance, a first solution to three-dimensional reconstruction with stereo technology explores the possibility of composing several 3D views from the camera transforms, to build the so-called ”3D evidence grid” [2]. There are other approaches which infer 3D grids from stereo vision, due to the fact that appearance information is not provided by range finders. Hence, they add an additional camera to their mobile robots [3,4]. Moreover, a module of 3D recognition could be added to identify some objects. This technique is not exclusive of robotics, but it could be used in other applications such as automatic machine guidance or also for detection and estimation of vehicle movement [5]. Any image taken by a camera is deformed by the conical perspective effect, so direct reconstruction generates scenes with unreal aspect. There are very few works which focus on creating a good reconstruction and on obtaining a real appearance of the scene. However, some interesting works can be found [6,7], W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 141–148, 2007. c Springer-Verlag Berlin Heidelberg 2007
142
A.J. Gallego et al.
but none of them makes any type of perspective rectification. Specific objects are reconstructed instead of the whole scene, so the real structure of the environment is not recovered. Some na¨ıve perspective rectifications have already been used in other fields, to rectify roads and to obtain their real appearance [8]. This work is centred in the reconstruction of the structure of the scene showing its real aspect, using the information provided by the stereo images and the disparity maps (in fact depth maps, their duals, are used). Our proposal does not make assumptions about the scene nor the object structure, it does not segment objects trying to identify known shapes, only a stereo pair is needed and it is nor correspondence dependent. Perspective rectification method allows to eliminate the conical perspective of the scene and to remove the camera orientation. This way the algorithm recovers the structure of the scene and some crucial information such us object geometry, volume and depth.
2
Process Scheme
The reconstruction and mapping methods follow the process scheme shown in figure 1. In the upper part of the figure, the reconstruction method is presented (section 3). Given a pair of stereo images (LIk and RIk , where k is an index indicating the current image within a sequence) the depth map and the vanishing point position are calculated. A depth map Dk and the vanishing point position V Pk are obtained as a result. Then, a rectification process is applied in order to remove the effect of the conical perspective. The result of this stage is a rectified 3D matrix of voxels Rk representing the reconstructed scene. In the lower part of the figure, the mapping method is represented (section 4). Every reconstruction Rk is accumulated to finally obtain a whole mapping representation of the environment M . The camera position is needed to carry out this final step.
Fig. 1. Structure of both 3D reconstruction and mapping
3 3.1
3D Reconstruction Method Perspective Rectification Using a Vanishing Point
Let LI and RI (the k subindex is removed for simplicity) be a pair of stereo images (left and right image) of m × n pixels, and D a dense depth map obtained by any method (in this approach, the depth image is computed using
Rectified Reconstruction from Stereo Pairs and Robot Mapping
143
multi-resolution and energy function [9]). Each value D(x, y) of the depth map contains the depth associated to the pixel (x, y) of the reference image (left image) [10,11,12,13]. The reconstruction of the scene can be done by labelling a 3D matrix of voxels R representing the spatial occupation. However, due to the conical perspective, it would produce an unwanted effect in the reconstruction (figure 2(a) and 2(b)). So, to correctly perform the 3D reconstruction the perspective must be rectified making a correction to the pixel’s coordinates. This way the obtained result will show the same aspect as the real scene (figure 2(c)).
(a)
(b)
(c)
Fig. 2. Rectification scheme: (a) shows the effect of the conical perspective, (b) shows a non-rectified scene seen from above (with only x and z coordinates shown), and (c) shows the result which is desired after rectification (with parallel walls). This image also illustrates the compass angle of the camera in the case of a lateral view.
Figure 2(b) shows the scheme of the process, in which the point Q (current pixel being processed obtained from the input depth map) with coordinates (x, y, D(x, y)) is rectified to obtain Q . The position of the scene vanishing point V P is calculated using the method proposed in [14]. It uses a Bayesian model which combines knowledge of the 3D geometry of world with statistical knowledge of edges in images. The method returns an angle (called as Ψ ) which defines the orientation of the camera. This angle is transformed to Cartesian coordinates to obtain the position (x, y) of scene vanishing point V P . For the depth of point V P the maximum depth value (Dmax ) of the whole depth map is used. Once the V P is obtained, a line L is traced through V P and Q. Next, the intersection of the line L with the x-y plane is calculated, obtaining in this way the point P . Starting from P , the line L is rotated to be perpendicular to the x-y plane. After this process, the new point Q = (x , y , z ) is calculated as follows: ⎧ ⎪ ⎨ xQ = xV P + zV P yQ = yV P + zV P ⎪ ⎩ zQ = zQ
xV P −xQ zQ −zV P yV P −yQ zQ −zV P
(1)
144
3.2
A.J. Gallego et al.
3D Reconstruction Using Sub-pixel Precision
Once the depth map (D) and the vanishing point (V P ) are calculated (they can be obtained in parallel), the reconstruction process is performed, including the perspective rectification (section 3.1). The reconstruction method returns a 3D matrix (R) containing the space occupation of the final result. R is initialized to zero and, then, it is filled as follows: R(x , y , D(x, y)) = 1 where (x , y ) are the rectified coordinates of (x, y), ∀x, y ∈ R/{0 ≤ x ≤ m − 1, 0 ≤ y ≤ n − 1}. The final step is obtaining the real units. This is a direct calculation if the camera parameters are known. The most important drawback is the fact that when the perspective rectification corrects the pixels’ coordinates, the voxels are separated in the 3D representation (figure 3(b)). This is due to the discreteness of depth maps. In fact, pixels corresponding to a distant object are split, leaving a hole whose dimensions increase as the distance to the object increases. To minimize these problems a sub-pixel precision technique is proposed to calculate the position of n fictitious pixels between two consecutive pixels. The precision used for the reconstruction is calculated using the equation 1 − (z/Dmax ), which returns the minimum value when the pixel is in the foreground and the maximum one when it is in the background. All the steps of the sub-pixel reconstruction method can be summarized as follows: 1. D := CalculateDepthM ap(LI, RI) 2. V P := CalculateV anishingP oint(LI) 3. while ( x ≤ m − 1 ) (a) while ( y ≤ n − 1 ) i. (x , y ) := Rectif y(x, y, V P ) ii. R(x , y , D(x, y)) = 1 iii. y = y + 1 − (D(x, y)/Dmax ) (b) x = x + 1 − (D(x, y)/Dmax ) 4. Display(ObtainRealU nits(R))
(a)
(b)
(c)
Fig. 3. Example of reconstruction using the depth map of figure (a). Figure (b) shows the reconstruction without using sub-pixel precision, in which the voxels are separated due to the perspective rectification. In figure (c) the sub-pixel precision has been used. The result has a more realistic appearance because holes have been filled.
Rectified Reconstruction from Stereo Pairs and Robot Mapping
4
145
Mapping Algorithm
The mapping algorithm is an application of the reconstruction method. It demonstrates the utility of the perspective rectification and the advantages of its application to this kind of problem. A significant example is shown in figure 4, where (a) shows two consecutive images of a corridor (there is also a column) and the result of their intersection. Since the scene is not rectified, the intersection of the columns and walls are not coincident. The rectification of the images in (a) is presented in (b). In this case, the intersection is perfectly coincident, so the mapping algorithm is significantly improved.
(a)
(b)
Fig. 4. Mapping comparison with and without rectification
In order to do the 3D mapping of the scene, N stereo pairs (LI0 , RI0 ), (LI1 , RI1 ), ..., (LIN −1 , RIN −1 ) of the environment are taken. Each of these images is captured at a fixed distance. Once a stereo pair (LIk , RIk ) is obtained, its corresponding depth map Dk is calculated and added to the Σ list which stores all the depth maps. Next, the algorithm of perspective rectification is used in order to compute the rectified matrix Rk of each depth map. For each matrix Rk its intersection with the previous matrix is calculated (Rk−1 ∩ Rk ), and its result is added to the main matrix Mmap which represents the mapping of the scene. In this approach the position of the frames is obtained from robot odometry. The system only needs the relative position of the next frame to do the reconstruction from the sequence of images. In order to reduce the effect of possible odometry errors the algorithm uses a cubic filter. This filter F (explained below) is applied to the whole matrix Mmap , which discretizes the three-dimensional matrix and transforms it into a grid of rectangular cubes. Lastly, the result (Mmap ) is represented according to the space occupation of this matrix and calculating its equivalence in real units (metres). Cubic filter F applies the equation g(x, y, z) := Σ(i,j,k)∈S f (i, j, k) to each cube of the matrix, where S represents the set of point coordinates which are located in the neighbourhood of g(x, y, z), including the point in question. In this way the space occupation of each cube is in the centre, and each cell contains the set of readings of that portion of the space. The use of these cells instead of a unique sample let the system avoid possible odometry errors. The number of readings is referred to as “votes”, and represents the probability of space occupation.
146
5
A.J. Gallego et al.
Experimentation and Results
In this section the experiment results are shown. Figure 5 shows a reconstruction comparison using a synthetic disparity map (a) which simulates a corridor. This example clearly shows the effect of the perspective rectification. Figure (b) shows the segmentation used to calculate the vanishing point, which is estimated to be -4o . In figures (c), (d) and (e) the rectification effect is compared: (c) shows a non-rectified reconstruction (seen from above), and (d) and (e) show a top and an oblique view of the correct result after the rectification. As it can be seen, the walls are perfectly rectified, becoming parallel as expected.
(a)
(c)
(b)
(d)
(e)
Fig. 5. Perspective rectification comparison using a corridor depth map
To do the mapping experimentation two sequences of 30 images obtained from two different corridors have been used. All the images were taken using the Pioneer P3-AT robot and the Bumblebee HICOL-60 stereo camera (figure 6(a)). Figure 6(b) shows four images of one of these sequences as well as their depth maps. The main objective is that the walls, floor and roof appear without slope in the reconstruction, and that there should not be any obstacle (noise) in the corridor. It is also important that the columns (represented by circles in the plan (c)) and the coffee machine (represented by a rectangle in the plan (d)) are detected correctly. Figures (c) and (d) show the mapping results of the two sequences and their respective corridor maps. For these examples a cubic filter size of 3 × 3 × 3 and a number of votes of 5 have been used. These results show a good definition of the corridors because the walls are limited and the in-between area can be seen. Moreover, the columns and the coffee machine can be distinguished on the right hand side of each one of the results. To conduct the experiments, a Pentium IV 3,20GHz with 2GB of RAM and a 512MB graphic card have been used. The reconstruction of the maps have been made using a 320 × 240 × 256 voxels matrix and depth maps with a size of 320×240 pixels. Moreover, it is important to note that only non-null pixels (finite
Rectified Reconstruction from Stereo Pairs and Robot Mapping
(c)
147
(d)
Fig. 6. (a) Pioneer Robot P3-AT with stereo camera. (b) Sequence of images for the mapping. (c,d) Mapping results of the corridors.
depth) in the depth map are processed. The computational cost linearly depends on the size of the input images and on the precision of the reconstruction. So the algorithm obtains a good performance: To process just one sequence of 30 images (each image has a level of 70% of processed data) the algorithm takes approximately 9 seconds. This time depends on the precision of the final 3D reconstruction. So, the process time of an individual reconstruction is less than 0.3 seconds.
6
Conclusions
A new method to reconstruct 3D scenes from stereo images has been presented, as well as an algorithm for environment mapping. These methods use geometrical rectification to eliminate the effect of conical perspective. Due to the fact that the vanishing point position is calculated, rectification can remove the camera orientation and obtain a front view of the scene. It is important to notice that the final quality of the reconstructed image depends on the quality of the depth map. In future experiments, better depth images will probably improve the final result. Moreover, other reconstruction algorithms with different geometric primitives will be implemented to be able to compare the results. The results show how this process corrects the perspective effect and how it helps to improve the matching in the mapping algorithm. An advantage of this method is that it is nor correspondence dependent. In addition, it could probably be used for real-time applications due to the low computational burden
148
A.J. Gallego et al.
and to the good performance. The cubic filter is very useful to solve odometry problems in the mapping. This is a statistical approach used to compute the matrix of the occupancy evidence. The mapping algorithm will be improved in future experiments in order to consider moving objects, robot drift and other kind of problems. The main interest of the method is that some crucial information from the scene, such as object geometry, volume and depth, is retrieved. For these reasons the proposed methods are useful in a wide range of applications, such as Augmented Reality (AR) and autonomous robot navigation. As future work we want to use the obtained results in this type of applications. Acknowledgments. This work has been done with the support of the Spanish Generalitat Valenciana, Project GV06/158.
References 1. P´erez, J., Castellanos, J., Montiel, J., Neira, J., Tard´ os, J.: Continuous mobile robot localization: Vision vs. laser. ICRA (1999) 2. Moravec, H.P.: Robot spatial perception by stereoscopic vision and 3D evidence grids. The Robotics Institute Carnegie Mellon University. Pittsburgh, PA (1996) 3. Stephen Se, D., Lowe, J.: Little: Vision-based mobile robot localization and mapping using scale-invariant features. ICRA (2001) 4. Martin, C., Thrun, S.: Real-time acquisition of compact volumetric maps with mobile robots. ICRA (2002) 5. Martinsanz, G.P., de la Cruz Garc´ıa, J.M.: Visi´ on por computador: im´ agenes digitales y aplicaciones. Ra-Ma, D.L.( ed.) Madrid (2001) 6. Vogiatzis, G., Torr, P.H.S., Cipolla, R.: Multi-view stereo via Volumetric Graphcuts. CVPR, pp. 391–398 (2005) 7. Sinha, S., Pollefeys, M.: Multi-view Reconstruction using Photo-consistency and Exact Silhouette Constraints: A Maximum-Flow Formulation. ICCV (2005) 8. Broggi, A.: Robust Real-Time Lane and Road Detection in Critical Shadow Conditions. In: Proceedings IEEE International Symposium on Computer Vision, Coral Gables, Florida. IEEE Computer Society, Los Alamitos, CA, USA (1995) 9. Compa˜ n, P., Satorre, R., Rizo, R.: Disparity estimation in stereoscopic vision by simulated annealing. In: Artificial Intelligence research and development, pp. 160– 167. IOS Press, Amsterdam, Trento, Italy (2003) 10. Trucco, E., Verri, A.: Introductory techniques for 3-D Computer Vision. PrenticeHall, Englewood Cliffs (1998) 11. Cox, I., Ignoran, S., Rao, S.: A maximum likelihood stereo algorithm. Computer Vision and Image Understanding, 63 (1996) 12. Faugeras, O.: Three-dimensional computer vision: a geometric viewpoint. The MIT Press, Cambridge, Massachusetts (1993) 13. Gallego S´ anchez, A.J., Molina Carmona, R., Villagr´ a Arnedo, C.: ThreeDimensional Mapping from Stereo Images with Geometrical Rectification. In: Perales, F.J., Fisher, R.B. (eds.) AMDO 2006. LNCS, vol. 4069, pp. 213–222. Springer, Heidelberg (2006) 14. Coughlan, J., Yuille, A.L.: Manhattan World: Orientation and Outlier Detection by Bayesian Inference. Neural Computation. 15(5), 1063–1088 (2003)
Estimation Track–Before–Detect Motion Capture Systems State Space Spatial Component Przemyslaw Mazurek Szczecin University of Technology, Chair of Signal Processing and Multimedia Engineering, 26. Kwietnia 10, 71126 Szczecin, Poland {przemyslaw.mazurek}@ps.pl http://www.media.ps.pl
Abstract. In the paper spatial component estimation for Track–Before– Detect (TBD) based motion capture systems is presented. Using Likelihood Ratio TBD algorithm it is possible to track markers at low Signal–to-Noise Ratio level that is a typical case in motion capture system. Three kinds of TBD systems are analyzed and compared: full frame processing, single camera optimized and multiple cameras optimized. In the article separate TBD processing for every camera is assumed. Keywords: Estimation, Track–Before–Detect, Motion Capture.
1
Introduction
Motion capture systems are very important in 3D motion acquisition and are used in computer graphic animation (movies, games), biomechanic (sport, medical), ergonomy and many other applications. There are sort of motion capture methods, among them marker based optical tracking systems are the most popular ones due to their high reliability. Marker is a small infrared or visible light reflective ball or emitting device like LED. Using a set of internally and externally calibrated cameras, markers can be detected, tracked, and after triangulation 3D space positions of all markers can be obtained. Due to acquisition constraints (camera properties, marker properties and the geometry of scene) markers usually are quite large (e.g. table tennis balls) and small number of them is used. Increasing the number of small markers is important for improving motion data in application. Small markers are hard to detect and there are some human related constraints - it is impossible to use high energy LED’s due to safety of actor’s eye. Solving this problem is hard when using classical tracking systems based on acquisition, detection, tracking and assignment processes performed separately. Small energy obtained from markers limits the possibilities of using threshold detection based techniques. Using another tracking scheme (Track–Before–Detect) based on tracking at first and then the detection it is possible to track markers even if Signal–to–Noise Ratio (SNR) is near or smaller than one. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 149–156, 2007. c Springer-Verlag Berlin Heidelberg 2007
150
P. Mazurek
The most popular TBD algorithms are Spatio–Temporal Filters [1], Likelihood Ratio TBD [2] and Particle Filters [4,3]. Current author’s work [6] shows some methods for using Likelihood Ratio TBD for motion capture system with camera state space tracking.
2
Geometry of Circularly Motion Capture System
The typical case of motion capture cameras is the circular placement of cameras. In the center of the scene a performance area is located which can be used by actor and there is a maximal marker’s visibility obtained. Camera internal and external parameters define how large is this area and define 3D spatial resolution. Additionally room’s minimal size can be calculated.
Rc HFOV
R eff
h 2
VFOV 2
O O
O
Reff SIDE Rc
Fig. 1. Geometry of scene (top and side views). Performance area is shaded. C – center of scene; Ref f – effective circle radius that guarantees maximal visibility of markers by all cameras; Rc – cameras’ circle radius; h – height of performance area.
Camera’s basic internal parameters can be defined by Horizontal Field of View (HF OV ), Vertical Field of View (V F OV ), horizontal and vertical pixel resolution. Using wide FOV cameras Ref f is interesting but spatial resolution is reduced [5]. Preferably cameras should be placed at large distance and narrow FOV should be used for optimal spatial resolution (almost constant in performance area). This is one of the reasons why large rooms are used. Alternatively the number of cameras can be increased and it is allowed that marker can be visible by less number of cameras. Two effective radiuses in the system can be found but only the minimal one is correct and describes the reduction of effective area:
Estimation TBD Motion Capture Systems State Space Spatial Component
Ref f T OP = Rc sin Ref f SIDE = Rc −
HF OV 2
2 tan
151
(1)
h V F OV
(2)
2
Ref f = min{Ref f T OP , Ref f SIDE }
(3)
Rc HFOV
R eff
h 2
VFOV 2 O O
O
Reff Reff SIDE Rc
Fig. 2. Two cases of performance area’s reduction (top and side views)
It should be noticed that performance area’s height h is not equal to the actor’s height because if actor’s hands are up or actor makes a jump some markers will be located higher than actor’s height. Increasing the number of cameras the performance area changes towards a cylinder. Exactly it is a cylinder with additional cone over the cylinder if no height of room limitations occur, but for simplification only the cylinder will be assumed (cone will be omitted).
3
Track–Before–Detect Algorithm for Motion Capture Systems
Detection in classical tracking systems is simple if SNR is high enough and most of computations are related to tracking, assignment and triangulation only. In TBD systems algorithm tries to track markers first in the state space without knowledge if a marker is located in this state space cell or not. This process is computationally demanding. The Likelihood Ratio TBD algorithm, detailed described in the book [2], uses recurrent method for reduction of computations and every step has a constant cost. Pseudoalgorithm can be written in the following form:
152
P. Mazurek
Start // Initial Likelihood Ratio: Λ(t0 , s) =
p(k = 0, s) p(k = 0, φ)
f or s ∈ S
(4)
For k ≥ 1 and s ∈ S // Motion Update: −
qk (s|sk−1 )Λ(k − 1, sk−1 )dsk−1
Λ (k, s) = qk (s|φ) +
(5)
S
// Information Update: Λ(k, s) = L (yk |s)Λ− (k, s)
(6)
EndFor Stop The most significant cost of algorithm is the integral of motion update. Markov matrix describes motion vectors and this matrix is quite large if assumed dynamic of marker is complex. This is a serious limitation for contemporary available computers because measurement likelihood is a camera frame size matrix. Assuming 1 megapixel camera and 50 fps rate and only 100 motion vectors there are 5 giga basic MAC (Multiply and Accumulate) operations - this is similar to the limits of currently available PC computers (there are no assignment, deghosting, information update and acquisition costs). Multiplying this result by the number of cameras (e.g. 16) roughly estimated cost can be about 30–100 computers for real–time motion capture systems. Additionally system maintenance cost also increases. Proposed cost reduction solution [6] will be briefly described in this paper. Estimation of cost reduction for two systems in comparison to full frame processing will be derived in this paper. 3.1
TBD System Without Optimization (Full Frame Processing)
This is the most trivial system because whole state space is processed but it is important to show how other knowledge based methods can improve the overall system. The general cost formula if motion vectors do not depend on spatial position is: Ctot = f (CJ , CV ) ≈ CJ CV
(7)
where CJ is spatial related cost and CV is velocity related cost for fixed grid state space. If Xr and Yr define camera pixel resolution then for N number of cameras the cost is: Ctot ∼ N Xr Yr CV
(8)
Estimation TBD Motion Capture Systems State Space Spatial Component
153
Decrease of the number of cameras and resolution is not desired. Optimization of CV is an interesting research area but it is outside of this paper’s scope. Using only the knowledge about importance of frame area can also reduce cost. 3.2
TBD System with Separate Optimization
This system uses additional knowledge related to spatial part of TBD state space. The idea of this algorithm, examples and requirements are considered in the author’s paper [6] but will be sketched briefly here also. In typical motion capture systems everything is black - black floor, walls, actors’ clothes (black in visible light or infrared). Bright elements are only markers. Using background estimation actor can be extracted from acquired image if there is a difference between background and actor. Only pixels occupied by actor and related state space importance areas should be considered for tracking by TBD algorithm. The worst case will be analyzed. An actor can be modeled by cylinder or ellipsoid. Independently actor’s orientation side view will consist same amount of pixels. The simplest case is cylinder for modeling actor volume. Depending on the pose the number of pixels changes but there is the upper limit that can be estimated by measurements and described by actor’s height and diameter. If actor has hands up the vertical size increases but the horizontal one decreases. If actor is in e.g. da–Vinci pose the vertical and horizontal sizes increase but the field (number of pixels) is always limited. Due to actor’s mask processing and noises the cost will be a little overestimated. The number of pixels in camera is: Pr = Xr Yr AV F OV AHF OV Xa = Xr V F OV HF OV The number of actor’s pixels is: Ya = Yr
Pa = Xa Ya = Xr Yr where approximately:
AV F OV AHF OV V F OV HF OV
(9) (10)
(11)
ha da AV F OV = 2arc tan HV F OV = 2arc tan 2 (Rc − Ref f ) 2 (Rc − Ref f ) (12) Independently on orientation pose (upper boundary) the number of pixels is AV F OV AHF OV (13) V F OV HF OV Maximal angular size of actor AV F OVmax and AHF OVmax is achieved if actor to camera distance is Rc − Ref f . Pa = Xr Yr
Ct ot ∼ N Xr Yr
AV F OVmax AHF OVmax CV V F OV HF OV
(14)
154
P. Mazurek
For example if actor occupies maximally 20% of scene TBD cost reduces to 20% of maximal cost. This relation is linear. For the worst case it should be assumed that this maximal occupation is fixed for all cameras independently on actor’s position. Of course if actor is angularly smaller due to larger distance to camera then less computation are required by system but the amount of computation power is fixed due to the worst case. 3.3
TBD System with Multiple Camera Optimization
If the position of actor on stage is available it is possible to reduce the amount of computations. Considering two oppositely placed cameras at 2Rc distance and actor moving from one to another, cumulative size of actor (acquired from two cameras) changes but is lower than in previous method. The field of view occupied by actor is position dependent so instead of fixed coefficients AV F OVmax and AHF OVmax actor’s position (xa , ya ) dependent functions should be used. Ct ot ∼ Xr Yr CV
N AV F OVi (xa , ya ) AHF OVi (xa , ya ) i=1
V F OV
HF OV
(15)
If Rc is the largest distance and V F OV and HF OV are not extremely large, actor moving from one camera towards the opposite one do not change significantly the angular size on view from the camera placed perpendicularly to actor’s motion. The largest values AV F OVi (xa , ya ) and AV HF OVi (xa , ya ) are located on Rc circle.
Reff A O
Rc
C
Fig. 3. Top view of performance area. A - actor position; C - camera position
The distance between the camera and actor is: 0 2 2 diAC (φi ) = Ref f + Rc − 2Ref f Rc cos(φ)
(16)
Estimation TBD Motion Capture Systems State Space Spatial Component
For vertical direction: i = 0..(N − 1) – camera index ha AV F OVi = 2arc tan 2diAC (φi )
155
(17)
Because cameras are equally spaced on Rc circle: 2πi (18) N where α is the phase that include basis axis. If this axis is on OC line then α = 0 for camera number i = 0. ⎞ ⎛ N −1 h a AV F OVtotmax = 2 arc tan ⎝ 0 2πi ⎠ (19) 2 2 − 2R 2 R + R R cos i=0 ef f c c ef f N φi = α +
Maxima positions of this functions are: 2πi N For AHF OVtotmax similar formula can be derived: ⎞ ⎛ N −1 d a AHF OVtotmax = 2 arc tan ⎝ 0 2πi ⎠ 2 2 − 2R 2 R + R R cos i=0 ef f c c ef f N arg max {AV F OVtot } =
(20)
(21)
The cost of computation for considered system is described by formula: Ct ot ∼ N Xr Yr
4
AV F OVtotmax AHF OVtotmax CV V F OV HF OV
(22)
Comparison Example
Introducing Mean of Actor FOV formulas as: M AV F OVmax =
N −1 N −1 1 1 AV F OVi M AHF OVmax = AHF OVi N i=0 N i=0
(23)
the total cost is: Xr Yr CV M AV F OVmax M AHF OVmax V F OV HF OV and because the first part of formula is constant for all considered cases: Ct ot ∼
Ct ot ∼ M AV F OVmax M AHF OVmax
(24)
(25)
Total cost related to full frame processing is 5% for single camera optimization and about 1% for multiple camera optimization. Presented derivation of cost formulas are true for single actor, if there are more actors reduction cost will be smaller. Only the first method (full frame processing) is actor number independent. Assuming 5 actors and 100 computers for full frame processing using multiple camera optimization the number of computers decreases to 5.
156
P. Mazurek
180
180 Reff limit No optimisation and VFOV boundary Single camera optimisation Multiple optimisation (4xcam) Multiple optimisation (8xcam) Multiple optimisation (16xcam)
MAVFOVmax (angle)
140 120
160 140 MAHFOVmax (angle)
160
100 80 60
120 100 80 60
40
40
20
20
0 0
20
40
60 Reff
80
100
Reff limit No optimisation and VFOV boundary Single camera optimisation Multiple optimisation (4xcam) Multiple optimisation (8xcam) Multiple optimisation (16xcam)
0 0
20
40
60
80
100
Reff
Fig. 4. MAVFOV and MAHFOV costs for analyzed example. V F OV = 60, HF OV = 80 - camera fields of view (4:3); ha = 25 - height of actor; ha = 5 - diameter of actor; hp = 30 - height of performance area; Rc - camera circle radius; Ref f - calculated effective radius.
5
Conclusions
In the paper a method for estimation of fixed computation cost for multiple cameras circularly placed motion capture system with TBD algorithm is proposed. It is crucial for real–time systems where the worst case should be known and appropriate amount of computation power should be reserved not by tests but by calculations. Presented example (that uses derived formulas) shows significant reduction of computation cost that is very important due to considered TBD algorithm requirements. Acknowledgments. This work is supported by the MNiSW grant N514 004 32/0434 (Poland).
References 1. Barniv, Y.: Dynamic Programming Algorithm for Detecting Dim Moving Targets. In: Bar–Shalom Y. (ed.): Multitarget–Multisensor Tracking. Artech House (1990) 2. Stone, L.D., Barlow, C.A., Corwin, T.L.: Bayesian Multiple Target Tracking. Artech House (1999) 3. Ristic, B., Arulampalam, S., Gordon, N.: Bayesian Beyound the Kalman Filter: Particle Filters for Tracking Applications. Artech House (2004) 4. Doucet, A., De Freitas, N., Gordon, N.: Sequential Monte Carlo Methods in Practice. Springer, Heidelberg (2005) 5. Mazurek, P.: Optimization of multiple camera 3–D space locations for motion capture room environment. 28. Midzynarodowa Konferencja z Podstaw Elektrotechniki i Teorii Obwodow IC-SPETO’2005, Gliwice-Ustron, pp. 403–406 (2005) 6. Mazurek, P.: Likelihood Ratio Track-Before-Detect Algorithm for Motion Capture Systems, Advanced Computer Systems ACS’2007 Multi-Conference. Miedzyzdroje (submitted 2007)
Real-Time Active Shape Models for Segmentation of 3D Cardiac Ultrasound Jøger Hanseg˚ ard1, Fredrik Orderud2, and Stein I. Rabben3 1
2
University of Oslo, Norway [email protected] Norwegian University of Science and Technology, Norway [email protected] 3 GE Vingmed Ultrasound, Norway [email protected]
Abstract. We present a fully automatic real-time algorithm for robust and accurate left ventricular segmentation in three-dimensional (3D) cardiac ultrasound. Segmentation is performed in a sequential state estimation fashion using an extended Kalman filter to recursively predict and update the parameters of a 3D Active Shape Model (ASM) in real-time. The ASM was trained by tracing the left ventricle in 31 patients, and provided a compact and physiological realistic shape space. The feasibility of the proposed algorithm was evaluated in 21 patients, and compared to manually verified segmentations from a custom-made semi-automatic segmentation algorithm. Successful segmentation was achieved in all cases. The limits of agreement (mean±1.96SD) for the point-to-surface distance were 2.2±1.1 mm. For volumes, the correlation coefficient was 0.95 and the limits of agreement were 3.4±20 ml. Real-time segmentation of 25 frames per second was achieved with a CPU load of 22%.
1
Introduction
Left ventricular (LV) volumes and ejection fraction (EF) are among the most important parameters in diagnosis and prognosis of heart diseases. Recently, realtime three-dimensional (3D) echocardiography was introduced. Segmentation of the LV in 3D echocardiographic data has become feasible, but due to poor image quality, commercially available tools are based upon a semi-automatic approach [1,2]. Furthermore, most reported methods are using iterative and computationally expensive fitting schemes. These factors make real-time segmentation in 3D cardiac ultrasound challenging. Prior work by Blake et al. [3,4] and Jacob et al. [5,6], have shown that a state estimation approach is well suited for real-time segmentation in 2D imagery. They used a Kalman filter, which requires only a single iteration, to track the parameters of a trained deformable model based on principal component analysis (PCA), also known as Active Shape Models (ASMs) [7]. ASMs can be trained on manually traced LV contours, resulting in a sub-space of physiologically probable shapes, effectively exploiting expert knowledge of the LV anatomy W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 157–164, 2007. c Springer-Verlag Berlin Heidelberg 2007
158
J. Hanseg˚ ard, F. Orderud, and S.I. Rabben
and function. For segmentation of 3D cardiac data, Van Assen et al. [8] introduced the 3D ASM. However, there are to our knowledge no reports of real-time implementations of 3D ASMs. Based on the work in [3,4,5,6], real-time LV segmentation of 3D cardiac ultrasound was recently introduced by Orderud [9]. He used an extended Kalman filter for robust tracking of a rigid ellipsoid LV model. Later this framework has been extended to use a flexible spline-based LV model coupled with a global pose transform to improve local segmentation accuracy [10]. However, expert knowledge of LV anatomy could not be modeled directly. To utilize expert knowledge of LV anatomy during segmentation, we propose to use a 3D ASM for real-time segmentation of 3D echocardiograms, by extending the framework described in Orderud [9]. The 3D ASM, trained on LV shapes traced by an expert, gives a compact deformable model which is restricted to physiologically realistic shapes. This model is fitted to the target data in realtime using a Kalman filter. The feasibility of the algorithm is demonstrated in 21 patients, where we achieve real-time segmentation of the LV shape, and instantaneous measurements of LV volumes and EF.
2
Shape Model
A set of 496 triangulated LV training meshes were obtained from 31 patients using a custom-made segmentation tool (GE Vingmed Ultrasound, Norway). The training tool provides manual editing capabilities. When necessary, the user hence did manual editing of the segmentation to make it equivalent to manual tracing. Building the ASM requires pair-wise point correspondence between shapes from different patients [7,8]. We developed a reparametrization algorithm for converting triangulated LV training shapes into quadrilateral meshes. This algorithm produced meshes with 15 longitudinal and 20 circumferential segments, with vertices approximately identifying unique anatomical positions. The meshes were aligned separately to remove trivial pose variations, such as scaling, translation and rotation.
pl(xl) q- i
n- i Tl
p(x)
qi(xl) Tg
Fig. 1. A point p(x) on the ASM is generated by first applying a local transformation Tl described by the ASM state vector xl on the mean shape, followed by a global pose transformation Tg to obtain a shape in real-world coordinates
Real-Time ASMs for Segmentation of 3D Cardiac Ultrasound
159
From the aligned training set, the mean vertex position qi was computed, and PCA was applied on the vertex distribution to obtain the Nx most dominant eigenvectors. In normalized coordinates, the ASM can be written on the form qi (xl ) = qi + Ai xl ,
(1)
where the position of a vertex qi is expressed as a linear combination of the associated subspace of the Nx most dominating eigenvectors combined into the 3 × Nx deformation matrix Ai . Here, xl is the local state vector of the ASM. The expression for the ASM can be optimized assuming that the deformation at vertex qi is primarily directed along the corresponding surface normal ni of the average mesh. This is done by projecting the deformation matrix Ai onto the surface normal, giving an Nx -dimensional vector of projected deformation T modes A⊥ i = ni Ai . The optimized expression for the ASM can now be written on the form qi (xl ) = qi + ni (A⊥ i xl ) ,
(2)
reducing the number of multiplications by a factor of three. Due to the quadrilateral mesh structure of the ASM, a continuous surface is obtained using a linear tensor product spline interpolant. An arbitrary point on the ASM in normalized coordinates can be expressed as pl (xl ) = Tl |(u,v) where (u, v) represents the parametric position on the surface, and the local transformation Tl includes the deformation and interpolation applied to the mean mesh. By coupling this model with a global pose transformation Tg with parameters xg including translation, rotation, and scaling, we obtain a surface p(x) = Tg (pl (xl ), xg )
(3)
+ * in real-world coordinates, with a composite state vector xT ≡ xTg , xTl . An illustration showing the steps required to generate the ASM is shown in Fig. 1. In our experiments we used 20 eigenvectors, describing 98% of the total variation within the training set.
3
Tracking Algorithm
The tracking algorithm extends prior work by Orderud [9,10], to enable usage of 3D ASMs in the Kalman filter for real-time segmentation. This is accomplished by using the ASM shape parameters xl directly, in addition to the global pose parameters xg , in the Kalman filter state vector. 3.1
Motion Model
Modeling of motion in addition to position can be accomplished in the prediction stage of the Kalman filter by augmenting the state vector to contain the last two
160
J. Hanseg˚ ard, F. Orderud, and S.I. Rabben
successive state estimates. A motion model which predicts the state x ¯ at timestep k + 1, is then expressed as x ¯k+1 − x0 = A1 (ˆ xk − x0 ) + A2 (ˆ xk−1 − x0 ) ,
(4)
where x ˆk is the estimated state from timestep k. Tuning of properties, like damping and regularization towards the mean state x0 for all deformation parameters, can then be accomplished by adjusting the coefficients in matrices A1 and A2 . Prediction uncertainty can similarly be adjusted by manipulating the process noise covariance matrix B0 used in the associated covariance update equation. The latter will then restrict the change rate of parameter values. 3.2
Measurement Processing
Edge-detection is based on normal displacement measurements vi [4], which are calculated by measuring the radial distance between detected edge-points pobs,i and the contour surface pi along selected search normals ni . These displacements are coupled with associated measurement noise ri to weight the importance of each edge, based on a measure of edge confidence. Measurement vectors are calculated by taking the normal projection of the composite state-space Jacobian for the contour points ∂Tg (pl , xg )i ∂Tg (pl , xg )i hTi = nTi , (5) ∂xg ∂xl which is the concatenation of a global and a local state-space Jacobi matrix. The global Jacobian is trivially the state-space derivative of the global pose transformation, while the local Jacobian has to be derived, using the chain-rule for multivariate calculus, to propagate surface points on the spline through mesh vertices, and finally to the ASM shape parameters: ∂Tg (pl , xg ) ∂Tl (xl )n ∂Tg (pl , xg ) ¯ k A⊥ = ·n (6) j . ∂xl ∂p ∂q l,n j n∈x,y,z j∈1..Nq
Here, ∂Tg (pl , xg )/∂pl is the spatial derivative of the global transformation, and ∂Tl (xl )/∂qj is the spatial mesh vertex derivative of the spline interpolant. 3.3
Measurement Assimilation and State Update
All measurements are assimilated in information space prior to the state update step. Assumption of independent measurements leads to very efficient processing, allowing summation of all measurement information into an information vector and matrix of dimensions invariant to the number of measurements: (7) HT R−1 v = i hi ri−1 vi −1 T T −1 (8) H R H = i hi ri hi .
Real-Time ASMs for Segmentation of 3D Cardiac Ultrasound
161
The updated state estimate x ˆ at timestep k can then by computed by using the information filter formula for measurement update [11], and the updated ˆ is calculated directly in information space: error covariance matrix P ˆ k HT R−1 v x ˆk = x ¯k + P ¯ −1 + HT R−1 H . ˆ −1 = P P k
k
(9) (10)
Using this form, we avoid inversion of matrices with dimensions larger than the state dimension.
4 4.1
Evaluation Data Material
For evaluation of the proposed algorithm, apical 3D echocardiograms of one cardiac cycle from 21 adult patients (11 diagnosed with heart disease) were recorded using a Vivid 7 scanner (GE Vingmed Ultrasound, Norway) with a 3D transducer (3V). In all patients, meshes corresponding to the endocardial boundary were determined using a custom-made semi-automatic segmentation tool (GE Vingmed Ultrasound, Norway). The segmentations were, if needed, manually adjusted by an expert to serve as independent references equivalent to manual tracing. 4.2
Experimental Setup and Analysis
Edge measurements were done perpendicular to the mesh surface within a distance of ±1.5 cm to the surface at approximately 450 locations, using a simple edge model based on the transition criterion [12]. The ASM was initialized to the mean shape, and positioned in the middle of the volume in the first frame. Segmentation was performed on the evaluation set by running the algorithm for a couple of heartbeats, to give the ASM enough time to lock on to the LV. The accuracy of the ASM was assessed using the mean of absolute point-tomesh distances between the ASM and the reference, averaged over one cardiac cycle. Volume differences (bias) between the ASM and the reference were calculated for each frame. End-diastolic volume (EDV), end-systolic volume (ESV), and EF ((EDV − ESV)/EDV·100%) were compared to the manually verified reference (two-tailed t-test assuming zero difference), with 95% limits of agreement (1.96 standard deviations (SD)). EDV and ESV were computed as the maximum and minimum volume within the cardiac cycle respectively.
5
Results
We observed that common challenges with 3D cardiac ultrasound, such as dropouts, shadows, and speckle noise were handled remarkably well, and segmentation was successful in all of the 21 patients. Some examples are shown in Fig. 2(b-d).
J. Hanseg˚ ard, F. Orderud, and S.I. Rabben
Volume comparison
200
162
●●
VASM [ml] 100 150
● ●● ● ● ● ● ● ●● ●●● ●● ● ● ● ●● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●●●● ● ● ●● ●●●● ● ● ●● ● ●● ● ●● ●●● ● ●● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ●●●● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ●●● ●● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ●●● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ●●●●● ● ●● ● ● ●● ●● ● ● ● ●●● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ●●●● ●● ●● ●● ● ● ● ● ● ● ●●● ●● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ●● ● ● ● ● ● ●● ● ● ●
● ● ●
0
50
●
0
50
100 150 Vref [ml]
200
(a)
(b)
(c)
(d)
Fig. 2. LV volumes obtained by the ASM (VASM ) in all 21 patients is compared to the reference (Vref ) and shown with the identity line (dashed) in (a). In (b-d), the enddiastolic segmentation (yellow) in three patients is compared to the reference (red) and shown along with the volume curve for one cardiac cycle.
The limits of agreement (mean±1.96SD) for the point-to-surface distance were 2.2±1.1 mm, indicating good overall agreement between the ASM and the reference. From Tab. 1, column 2, we see that the limits of agreement for volumes were 3.4±20 ml, with a strong correlation (r=0.95). The volume correspondence between the ASM and the reference is shown in Fig. 2(a). Table 1. Segmentation results showing ASM versus reference Volume [ml] Difference (mean±1.96SD) Correlation coeff. (r) *
∗
3.4 ±20 0.95
Significantly different from 0, p Ik(3) > Ik(1) . In region A’, which has the same phase angles as region A, we would expect Ik(1) > Ik(3) > Ik(2) . We therefore, have a simple means of disambiguating azimuth angles in these regions:
φk if Ik(2) > Ik(1) if φk < 45◦ then αk = (4) ◦ φk + 180 otherwise We use these specific inequalities (as opposed to Ik(2) > Ik(3) or Ik(3) > Ik(1) ) since we expect the difference in intensity to be greatest between Ik(2) and Ik(1) for regions A and A’. Similar arguments can be used for the remaining regions of the sphere in Fig. 4. In region B we expect Ik(3) > Ik(2) > Ik(1) . Therefore, we have the condition:
φk if Ik(3) > Ik(1) (5) if 45◦ ≤ φk < 135◦ then αk = ◦ φk + 180 otherwise Finally, in region C, Ik(3) > Ik(1) > Ik(2) . Which gives us our final condition:
φk if Ik(3) > Ik(2) ◦ if 135 ≤ φk then αk = ◦ φk + 180 otherwise
(6)
We therefore have a complete means to disambiguate the azimuth angles from three light sources using (4 – 6). Since the zenith angles are fully constrained by (3) for diffuse reflection, the surface normals can be unambiguously determined using our proposed method.
Surface Reconstruction Using Polarization and Photometric Stereo
4
471
Experiments
This section presents the results of the above disambiguation process on realworld data. We also show how the azimuth angles can be used in conjunction with the zenith angles estimated from (3) to reconstruct three-dimensional surfaces. Finally, we show how the results can be improved by incorporating methods from earlier work in the field. Fig. 5 shows the azimuth angles obtained using our disambiguation method for two smooth porcelain objects, a slightly rough plastic object, an apple and an orange. For this experiment, θL = 45◦ and DL = 1.5m were used. Note that using too large a value for θL would result in problems due to cast shadows. The azimuth angles have clearly been disambiguated successfully for most regions. Only for small zenith angles, where the intensity has least variation with the light source and the degree of polarization is very low, are there any artifacts that reveal the light source distribution. To reconstruct the surface height from the unambiguous surface normals, we applied the well-known Frankot-Chellappa integration algorithm [4]. The algorithm uses the Fourier transform to convert a field of surface normals into a depth map by finding the nearest integrable solution. The reconstructions of our test objects are shown in Fig. 6. The figure shows reasonable general reconstructions but with some problems caused by specularities in the rougher surfaces. Also, the surfaces are flatter than the real objects as a result of roughness, inter-reflections and unknown refractive indices. These problems have been noted in previous work in shape recovery from polarization in diffuse reflection [1,7]. Essentially, because the polarizing properties of diffuse reflection are weak, the raw zenith angle estimates are too heavily contaminated by noise to provide sufficiently accurate reconstructions. This is especially true for rough surfaces [2].
Fig. 5. Top row: raw images of the test objects (illuminated from Source 2). Bottom row: disambiguated azimuth angles.
472
G.A. Atkinson and E.R. Hancock
Fig. 6. Surface reconstructions of the objects in Fig. 5.
Miyazaki et al. [7] address this issue by making the assumption that the histogram of zenith angles of the reconstructed surface matches that of a hemisphere. The histogram of the hemisphere takes the form N sin2 θ, where N is the number of points in the histogram. Atkinson and Hancock [2] use a different approach, where a relationship between the surface zenith angle and the pixel brightness is sought. Their method uses robust statistics to fit a curve to the histogram of pixel brightnesses and zenith angle estimates. Although more general to surface shape than the Miyazaki et al. method, it assumes that the light source and camera directions are identical, which necessitates a fourth light source position here. Fig. 7 shows how these two methods improve the surface reconstruction for the porcelain vase. Both methods show significant improvement, with the Atkinson and Hancock method marginally better on this occasion. For comparison, the mean angular errors across the entire vase were 19◦ when the the raw estimates were used for the reconstruction (Fig. 7a), 9◦ using the Miyazaki et al. method (b) and 6◦ for the Atkinson and Hancock method (c). It should be noted however, that these values are affected by alignment error. Both methods add very little computational cost to the algorithm ( few seconds). The Miyazaki et al. method is particulary useful for rough surfaces where direct zenith angle estimates are too degraded to be of any practical use.
(c) (b) (a)
Fig. 7. Profile of the vase reconstruction using (a) raw zenith angle estimates, (b) the zenith angles estimated using the Miyazaki et al. method, and (c) the zenith angles using the Atkinson and Hancock method. The exact profile is shown by the thick line.
Surface Reconstruction Using Polarization and Photometric Stereo
5
473
Conclusion
This paper presented a new shape recovery technique for computer vision, which used polarization information and photometric stereo. The method used Fresnel theory to estimate the zenith angles completely and the azimuth angles ambiguously, assuming that the reflection from the surface was diffuse. Three light source positions were then used to reliably disambiguate the azimuth angles. The resulting field of constrained surface normals was then converted into depth using the Frankot-Chellappa surface integration algorithm. Finally, the paper showed how the zenith angle estimates can be improved by histogram modification or by incorporating pixel brightness information. In future work, we hope to extend the technique to allow for more general lighting conditions by modifying conditions (4–6). We also hope to improve the way in which the zenith angles are enhanced by estimating relations between the pixel brightnesses and the zenith and azimuth angles. This would remove the need for the fourth illumination direction that was mentioned in the previous section. Alternatively, the polarization data could be processed using singular value decomposition (SVD) in a similar fashion to Yuille et al. [11]. Previously, SVD has only been used in photometric stereo for shape estimation up to a linear transform, but the additional information contained within polarization should allow for unambiguous reconstruction.
References 1. Atkinson, G.A., Hancock, E.R.: Shape estimation using polarization and shading from two views. IEEE Trans. Patt. Anal. Mach. Intell. (to appear) 2. Atkinson, G.A., Hancock, E.R.: Recovery of surface orientation from diffuse polarization. IEEE Trans. Im. Proc. 15, 1653–1664 (2006) ˇ ara, R.: Unambiguous determination of shape from photometric 3. Drbohlav, O., S´ stereo with unknown light sources. In: Proc. ICCV, pp. 581–586 (2001) 4. Frankot, R.T., Chellappa, R.: A method for enforcing integrability in shape from shading algorithms. IEEE Trans. Patt. Anal. Mach. Intell. 10, 439–451 (1988) 5. Hecht, E.: Optics, 3rd edn. Addison Wesley Longman, London, UK (1998) 6. Miyazaki, D., Kagesawa, M., Ikeuchi, K.: Transparent surface modelling from a pair of polarization images. IEEE Trans. Patt. Anal. Mach. Intell. 26, 73–82 (2004) 7. Miyazaki, D., Tan, R.T., Hara, K., Ikeuchi, K.: Polarization-based inverse rendering from a single view. In: Proc. ICCV, vol. 2, pp. 982–987 (2003) 8. Rahmann, S., Canterakis, N.: Reconstruction of specular surfaces using polarization imaging. In: Proc. CVPR, pp. 149–155 (2001) 9. Wolff, L.B., Boult, T.E.: Constraining object features using a polarisation reflectance model. IEEE Trans. Patt. Anal. Mach. Intell. 13, 635–657 (1991) 10. Woodham, R.J.: Photometric method for determining surface orientation from multiple images. Optical Engineering 19, 139–144 (1980) 11. Yuille, A.L., Snow, D., Epstein, R., Belhumeur, P.: Determining generative models for objects under varying illumination: Shape and albedo from multiple images using SVD and integrability. Intl. J. Comp. Vis. 35, 203–222 (1999) 12. Zhang, R., Tsai, P.S., Cryer, J.E., Shah, M.: Shape from shading: A survey. IEEE Trans. Patt. Anal. Mach. Intell. 21, 690–706 (1999)
Curvature Estimation in Noisy Curves Thanh Phuong Nguyen and Isabelle Debled-Rennesson LORIA Nancy Campus Scientifique - BP 239 54506 Vandoeuvre-l`es-Nancy Cedex, France {nguyentp,debled}@loria.fr
Abstract. An algorithm of estimation of the curvature at each point of a general discrete curve in O(n log2 n) is proposed. It uses the notion of blurred segment, extending the definition of segment of arithmetic discrete line to be adapted to noisy curves. The proposed algorithm relies on the decomposition of a discrete curve into maximal blurred segments also presented in this paper.
1
Introduction
A lot of applications in image processing require the geometrical measuring of represented discrete objects. In the framework of the discrete geometry, estimators of geometrical parameters have been proposed but they rely on the recognition of discrete line segments which is very sensitive to the noise existing in the studied curves [1,2,3]. The boundary of such discrete objects is often noisy due to acquisition process. Therefore the concept of blurred segment was introduced [4], it allows the flexible segmentation of discrete curves, taking into account noise. Relying on an arithmetic definition of discrete lines [5], it generalizes such lines, admitting that some points are missing. A curvature estimator was proposed in [6] and used in an application to the arc detection in technical documents [7], the complexity of this curvature estimator algorithm is in O(n2 ) (n is the number of points of the studied curve) and it can only be applied to 8-connected simple curves. We propose in this paper an extension to general curves of this algorithm. First the recognition algorithm of blurred segments proposed in [4] is extended to general curves [8]; the problem of adding or removing a point is studied. Then thanks to the decomposition of curve into maximal blurred segments, we propose an algorithm of calculation of the curvature at each point of a general discrete curve in O(n log2 n). The paper is organized as follows. In Section 2, after recalling definitions related to blurred segments, we study the problem of adding (or removing) a point to (from) a blurred segment of width ν in the case of a general discrete curve. Then we propose an extension to the noisy curves of the notion of maximal segment of a discrete curve. An algorithm to determine all maximal blurred segments of a discrete curve is given in Section 3. In Section 4, after recalling
This work is supported by the ANR in the framework of GEODIB project.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 474–481, 2007. c Springer-Verlag Berlin Heidelberg 2007
Curvature Estimation in Noisy Curves
475
the definition of the curvature estimator adapted to noisy curves, we propose a new algoritm for the determination of the curvature at each point of a discrete curve. Examples are also given.
2 2.1
Blurred Segment of Width ν Definitions
The notion of blurred segments relies on the arithmetical definition of discrete lines [5] where a line, whose slope is ab , lower bound μ and thickness ω (with a, b, μ and ω being integer such that gcd(a, b) = 1) is the set of integer points (x, y) verifying μ ≤ ax − by < μ + ω. Such a line is denoted by D(a, b, μ, ω). Let us recall definitions [4] that we use in this paper (see Figure 1): Definition 1. Let us consider a set of 8-connected points Sb . The discrete line D(a, b, μ, ω) is said bounding for Sb if all points of Sb belong to D. Definition 2. Let us consider a set of 8-connected points Sb . A bounding line of Sb ω−1 is said optimal if its vertical distance (i.e. max(|a|,|b|) ) is minimal, i.e. if its vertical distance is equal to the vertical distance of conv(Sb ), the convex hull of Sb . Definition 3. A set Sb is a blurred segment of width ν if its optimal boundω−1 ≤ ν. ing line has a vertical distance less than or equal to ν i.e. if max(|a|,|b|)
y
x
Fig. 1. D(5, 8, −8, 11), optimal bounding line (vertical distance = sequence of gray points
2.2
10 8
= 1.25) of the
Add (or Remove) a Point to (from) the Blurred Segment of Width ν
In this section we study the problem of adding (or removing) a point to (from) a blurred segment of width ν. The algorithm of recognition of blurred segments of width ν presented in [4] is executed in linear time however it only considers the incremental addition of a point to a blurred segment in the first octant. We present here the general case which requires the incremental calculation of both height and width of the convex hull after adding or removing a point. To do that, we use the results given in [9] and [8] that we briefly recall below.
476
T.P. Nguyen and I. Debled-Rennesson
Dynamic Estimation of the Convex Hull The problem of dynamic estimation of the convex hull of a set of points when adding (or removing) a point to (from) this set was proposed by M.H. Overmars, J. van Leeuwen [9]. In this work, a convex hull is considered as the union of two parts: the upper convex hull (Uhull ) and the lower convex hull (Lhull ). Uhull and Lhull are updated after each operation of addition or removal of a point, the cost of these operations are estimated by the following theorem [9]. Theorem 1. The convex hulls Uhull and Lhull of the set S of n points may be dynamically kept, in the worst case, in O(log2 n) by an operation of addition or removal. Determination of Height and Width of the Convex Hull We use the double technique of binary search [8] to determine the height and width of the convex hull. In [8], the convex hull is also considered as the union of two parts Uhull and Lhull . The double technique of binary search permits to find the vertical width of the convex hull by using the concavity property of the function height(x) = Uhull (x) − Lhull (x) in O(log 2 n).
3 3.1
Maximal Blurred Segments of Width ν Definitions and First Property
The notion of maximal segment of a discrete curve was proposed in [1,3] and relies on the discrete line segments. This structure enables a global understanding of the discrete curve to be analyzed. We propose here an extension of that notion to the blurred segments, adapted to noisy curves, by using the same notations as in [3]. Let us consider a discrete curve called C, the points of C are indexed from 0 to n − 1. C is a general curve and the points of C can be disconnected . We note Ci,j a set of successive points of C increasingly ordered from index i to j. Definition 4. The predicate ”Ci,j is a blurred segment of width ν” is denoted by BS(i, j, ν). The first index j, i ≤ j, such that BS(i, j, ν) and ¬BS(i, j + 1, ν) is called the front of i and noted F (i). Symmetrically, the first index i such that BS(i, j, ν) and ¬BS(i − 1, j, ν) is called the back of j and noted B(j). Definition 5. Ci,j is called a maximal blurred segment of width ν and noted M BS(i, j, ν) iff BS(i, j, ν) and ¬BS(i, j + 1, ν) and ¬BS(i − 1, j, ν). It is obvious that an equivalent characterization for a maximal blurred segment of width ν, M BS(i, j, ν), is to show that F (i) = j and B(j) = i. In this work, we use the notion of blurred segment of width ν which is maximal on the right or left sides: Definition 6. Ci,j is called a maximal blurred segment of width ν on the right side (resp. on the left side) and noted M BSR (i, j, ν) (resp. M BSL (i, j, ν)) if F (i) = j (resp. B(j) = i). Property 1. Let C be a discrete curve, M BSν (C) the sequence of maximal blurred segments of width ν of the curve C. Then, M BSν (C) =
Curvature Estimation in Noisy Curves
477
{M BS(B1 , E1 , ν), M BS(B2 , E2 , ν), ..., M BS(Bm , Em , ν)} and satisfies B1 < B2 < ... < Bm . So we have: E1 < E2 < ... < Em . Proof: We consider 2 consecutive maximal blurred segments M BS(Bi , Ei , ν) and M BS(Bi+1 , Ei+1 , ν). By hypothesis, Bi < Bi+1 , let us suppose that Ei ≤ Ei+1 , then M BS(Bi+1 , Ei+1 , ν) becomes a part of M BS(Bi , Ei , ν). Therefore M BS(Bi+1 , Ei+1 , ν) is not a maximal blurred segment, that is contradictory. 3.2
Algorithm for the Segmentation of a Curve C into Maximal Blurred Segments
We propose an algorithm (see Algorithm 1) which determines all maximal blurred segments of width ν of a discrete curve C according to the conditions given in section 3.1. To do that, property 1 is used. Complexity Each point of the curve is scanned at most twice in this algorithm. The cost of determining a new optimal bounding discrete line when we add (or remove) a point to (from) a blurred segment is in O(log2 n). Hence the complexity of this algorithm is in O(n log2 n).
4
Discrete Curvature of Width ν
4.1
Definition
We recall in this section the curvature estimator which is adapted to noisy curves [6]. It is directly deduced from the estimator proposed by D. Coeurjolly [2] for the 2D curves without noise. This technique can be seen as a generalization of the classical order m normalized curvature [10]. Let C be a discrete curve, Ck is a point of the curve. Let us consider the points Cl and Cr of C such that : l < k < r, BS(l, k, ν) and ¬BS(l − 1, k, ν), BS(k, r, ν) and ¬BS(k, r + 1, ν). The estimation of the curvature of width ν at the point Ck shall be determined thanks to the radius of the circle passing through the points Cl , Ck and Cr . To determine the radius Rν (Ck ) of the circumcircle of the triangle [Cl , Ck , Cr ], we use the formula given in [11] as follows (see Figure 2.a): −−−→ −−−→ −−−→ Let s1 = ||Ck Cr ||, s2 = ||Ck Cl || and s3 = ||Cl Cr ||, then s1 s2 s3 Rν (Ck ) =
(s1 + s2 + s3 )(s1 − s2 + s3 )(s1 + s2 − s3 )(s2 + s3 − s1 ) s with s = Then, the curvature of width ν at the point Ck is Cν (Ck ) = Rν (C k) −−−→ −−−→ sign(det(Ck Cr , Ck Cl )) (it indicates concavities and convexities of curve). As indicated in [2], the degenerated cases, which correspond for example to colinear half-tangents, may be independently tested and, thus, a null curvature is affected to the considered point.
478
T.P. Nguyen and I. Debled-Rennesson
Algorithm 1. Algorithm for the segmentation of a curve C into maximal blurred segments of width ν Data: C - discrete curve with n points, ν - width of the segmentation Result: M BSν - the sequence of maximal blurred segments of width ν begin k=0; Sb = {C0 }; M BSν = ∅; a = 0; b = 1; ω = b, μ = 0; ω−1 ≤ ν do while max(|a|,|b|) k++; Sb = Sb ∪ Ck ; Determine D(a, b, μ, ω) of Sb ; end bSegment=0; eSegment=k-1 ; M BSν = M BSν ∪ CbSegment,eSegment ; while k < n − 1 do ω−1 while max(|a|,|b|) > ν do bSegment++ ; Sb = Sb \ CbSegment ; Determine D(a, b, μ, ω) of Sb ; end ω−1 ≤ ν do while max(|a|,|b|) k++ ; Sb = Sb ∪ Ck ; Determine D(a, b, μ, ω) of Sb ; end eSegment=k-1; M BSν = M BSν ∪ CbSegment,eSegment ; end end
4.2
Algorithm for the Estimation of the Curvature of Width ν at Each Point of C
We propose in this section a new algorithm for the determination of the curvature of width ν at each of the n points of a curve C. The complexity of this algorithm is better than the one of the naive algorithm, in O(n2 ). It consists in calculating at each point Ck , the maximal blurred segment on the right side, M BSR , the maximal blurred segment on the left side, M BSL , then the circle passing through the 3 points: left extremity of M BSL , Ck , right extremity of M BSR . Principle of the Algorithm Let M BSR (k, r, ν) and M BSL (l, k, ν) be the maximal blurred segments on the right and left sides of the point Ck . Then it exists r ≤ k and l ≥ k such that M BSR (k, r, ν) ⊂ M BS(r , r, ν) and M BSL (l, k, ν) ⊂ M BS(l, l, ν). Let us then consider the decomposition of C into maximal blurred segments: M BSν (C) = {M BS(B1 , E1 , ν), M BS(B2 , E2 , ν), ..., M BS(Bm , Em , ν)} with B1 < B2 < ... < Bm and E1 < E2 < ... < Em . We look for the indices i and j such that i is the first index such that Ei ≥ k and j is the last index such that Bj ≤ k. So it is obvious that l = Bi , r = Ej and that the curvature of width ν at the point Ck is the inverse of the radius of the circumcircle of the triangle [Cl , Ck , Cr ]. More generally, we have the following simple result:
Curvature Estimation in Noisy Curves .7638 us: 14 Radio
Oy
Bi D(1 ,−
479
Ei
,5)
−2 1,2,
2,−3
,5)
D(
Ox T
(a) Estimation of the curvature at the point T with width 2
Bi+1
Ei+1
(b) Ei (Bi+1 ) is front (back) of points in first (second) bold edge
Fig. 2.
Property 2. Let L(k), R(k) be the functions which respectively present the indices of the left and right extremities of the maximal blurred segments on the left and right sides of the point Ck . – ∀k such that Ei−1 < k ≤ Ei , then L(k) = Bi – ∀k such that Bi ≤ k < Bi+1 , then R(k) = Ei This method is used in the algorithm 2 (see Figure 2.b). Complexity Both steps of labelling and estimation of the curvature at each point are executed in linear time. However, the determination of the maximal blurred segments are executed in O(n log2 n). Thus the complexity of our method is in O(n log2 n). It is more efficient than O(n2 log2 n) when we work with general curves as well as O(n2 ) for the simple curves with the existing methods.
Algorithm 2. Width ν curvature estimation at each point of C Data: C discrete curve of n points, ν width of the segmentation Result: {Cν (Ck )}k=0..n−1 - Curvature of width ν at each point of C begin Build M BSν = {M BS(Bi , Ei , ν)} ; m = |M BSν |; E−1 = −1; Bm = n; for i = 0 to m − 1 do for k = Ei−1 + 1 to Ei do L(k) = Bi ; for k = Bi to Bi+1 − 1 do R(k) = Ei ; end for i = 0(∗) to n − 1(∗) do Rν (Ci ) = Radius of the circumcircle to [CL(i) , Ci , CR(i) ]; s ; Cν (Ci ) = Rν (C i) end end
480
T.P. Nguyen and I. Debled-Rennesson
(a) Noisy circle, radius = 20
(c)
(b) max=-0.0444, mean=-0.0474
min=-0.0531,
(d) max=0.1761, min=-0.201
Fig. 3. Examples of curvature extraction with ν = 2
(*) The bounds mentioned in the algorithm are correct for a closed curve. In case of an open curve, the instruction becomes: For i = l to n - 1 - l with l fixed to a constant value. Indeed it is not possible to calculate a maximal blurred segment on the left side (resp. on the right side) at the first point (resp. at the last point) of the curve. Thus the calculation of the curvature begins (resp. stops) at the lth (resp. (n − 1 − l)th ) point of the curve. Results Two discrete curves (3.a and 3.c) are represented on the Figure 3 with the graph of their curvature values calculated at each point of the curves with width 2 (3.b and 3.d). The points of the curve 3.c corresponding to the peaks of the associated curvature graph 3.d are indicated by black pixels.
5
Conclusion
We have proposed the notion of maximal blurred segments of a discrete curve for a given width as an extension to noisy curves of the definitions proposed in [1,3]. This decomposition of a discrete curve enabled us to propose an optimized
Curvature Estimation in Noisy Curves
481
version of the algorithm of calculation of the curvature of width ν at each point of a general discrete curve. However, we expect to obtain a specific algorithm with a better complexity for simple curves where points are added in one direction. The extension of the results of this paper to 3D discrete curves will be subject to a next publication.
References 1. Feschet, F., Tougne, L.: Optimal time computation of the tangent of a discrete curve: Application to the curvature. In: Bertrand, G., Couprie, M., Perroton, L. (eds.) DGCI 1999. LNCS, vol. 1568, pp. 31–40. Springer, Heidelberg (1999) 2. Coeurjolly, D., Miguet, S., Tougne, L.: Discrete curvature based on osculating circle estimation. In: Arcelli, C., Cordella, L.P., Sanniti di Baja, G. (eds.) Visual Form 2001. LNCS, vol. 2059, pp. 303–312. Springer, Heidelberg (2001) 3. Lachaud, J.-O., Vialard, A., de Vieilleville, F.: Analysis and comparative evaluation ´ Damiand, G., Lienhardt, P. (eds.) of discrete tangent estimators. In: Andr`es, E., DGCI 2005. LNCS, vol. 3429, pp. 240–251. Springer, Heidelberg (2005) 4. Debled-Rennesson, I., Feschet, F., Rouyer-Degli, J.: Optimal blurred segments decomposition of noisy shapes in linear time. Computers & Graphics 30 (2006) 5. Reveill´e, J.P.: G´eom´etrie discr`ete, calculs en nombres entiers et algorithmique Th`ese d’´etat. Universit´e Louis Pasteur, Strasbourg (1991) 6. Debled-Rennesson, I.: Estimation of tangents to a noisy discrete curve. In: Vision Geometry XII. In: SPIE. vol. 5300, pp. 117–126 (2004) 7. Salmon, J.P., Debled-Rennesson, I., Wendling, L.: A new method to detect arcs and segments from curvature profiles. In: Proceedings of the 18th International Conference on Pattern Recognition, Hong-Kong, China (2006) 8. Buzer, L.: An elementary algorithm for digital line recognition in the general case. ´ Damiand, G., Lienhardt, P. (eds.) DGCI 2005. LNCS, vol. 3429, In: Andr`es, E., pp. 299–310. Springer, Heidelberg (2005) 9. Overmars, M., van Leeuwen, J.: Maintenance of configurations in the plane. J. Comput. and Syst. Sci. 23, 166–204 (1981) 10. Rosenfeld, A., Johnston, E.: Angle detection on digital curves. IEEE Transactions on Computers, 875–878 (1973) 11. Harris, J., Stocker, H.: Handbook of mathematics and computational science. Springer, Heidelberg (1998)
3D+t Reconstruction in the Context of Locally Spheric Shaped Data Observation Wafa Rekik1 , Dominique B´er´eziat2, and S´everine Dubuisson1 1
Universit´e Pierre et Marie Curie (UPMC), Laboratoire d’Informatique de Paris 6 (LIP6) 104 Avenue du Pr´esident Kennedy, 75016 Paris 2 Clime project/INRIA Rocquencourt B.P. 105 78153 Le Chesnay Cedex France [email protected]
Abstract. The main focus of this paper is 3D+t shape recovery from 3D spatial data and 2D+t temporal sequences. This reconstruction is particularly challenging due to the great deal of in-depth information loss observed on the 2D+t temporal sequence. Our approach embed a geometrical local constraint to handle the critical lack of information. This prior constraint is defined by a spherical topology because several applications may be concerned. It allows us to model relevantly the 3Dto-2D transformation that reduces each 3D image into a 2D frame. We then can build a 3D inaccurate inverse reconstruction of each 2D frame belonging to the video, i.e. 2D+t sequence. These inaccurate 3D images are enhanced by gradual motion compensation using a regularity criterion. Results on synthetic data are displayed. Keywords: 3D reconstruction, motion compensation, variational formulation, optical flow constraint.
1
Introduction
In some computer vision applications, we deal with volumetric, i.e 3D, acquisitions framing temporal, i.e 2D+t, ones (figure 1). First type of acquisitions provides a description of the 3D scene geometry, thus, purely spatial information. Second ones observe 2D object motion providing temporal and partial spatial kind of information. A possible approach to mine exhaustively these complementary datasets is to carry out a complete spatio-temporal or a 3D+t scene reconstruction. 3D+t sequences are restored by recovering the 3D original volume from each 2D frame belonging to the 2D+t sequence. As the matter of fact, 3D reconstruction is an inverse problem. In this particular case, it is also an ill posed problem since a single 2D frame is hardly sufficient to recover the 3D original volume. Indeed, this frame exhibits a great deal of spatial distortion and in-depth loss of information caused by the projective transformation that reduces the 3D real structure into a 2D image. To handle this critical lack of information, we introduce a prior data model. It consists in a geometrical constraint defined a spherical topology. Data observing locally spheric shaped objects can W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 482–489, 2007. c Springer-Verlag Berlin Heidelberg 2007
3D+t Scene Reconstruction in the Context
Volume1 (3D)
2D Projections
483
Volume2 (3D)
3D+t full sequence
Fig. 1. Recovering a 3D+t sequence from a couple of 3D images framing a series of 2D projections
be issued from remote sensing acquisitions and especially meteorology, from astronomical ones focusing on the sun and from bio-cellular microscopic imagery. A relevant biologic application is described in [1] where biologists study dynamic of structures of interest evolving on a spherical cell simulation surface. In this context, for instance, spatial datasets consist in multifocus acquisitions. Hence, they are not actual 3D images since they are formed by a series of equal sized slices, where signal in each slice is bothered by luminosity coming from adjacent ones. A spatial deconvolution stage is therefore needed when biologists calibrate correctly the video microscopy system to cell wall simulation experiments. Due to this limitation and to respect generality, we use synthetic data to validate implemented algorithms. Our approach uses this geometrical prior constraint in order to model relevantly the 3D-to-2D transformation. This modeling allows us to build a 3D inaccurate inverse reconstruction of each 2D frame belonging to the video sequence. These inaccurate 3D images are enhanced by gradual motion compensation using a regularity criterion. We proceed with a short overview of 3D reconstruction methods using multiple/single projective views in section 2. Then, we move to the description of our 3D+t reconstruction approach in section 3. Experimental results on synthetic datasets are given in section 4.
2
State-of-the-Art
In the literature, a wide scope of approaches addresses the 3D reconstruction problem. Volume based methods aim at restoring an accurate description of luminosity in each voxel of the 3D reconstructed image, while surface based ones provide a 3D model of the object of interest. This model requires shape and texture recovery of the outer visible side of the observed object. Namely, tomographic reconstruction [2] focuses on restoring the original volume from multiple projective views. Stereo-vision [3] approaches make use in general of the triangulation principle in order to recover the surface shape of selected targets.
484
W. Rekik, D. B´er´eziat, and S. Dubuisson
Moreover, 2D-to-3D registration methods reconstruct 3D features from only one projective view. This registration consists in identifying a geometrical transformation that aligns 2D frames with 3D datasets often acquired with a different modality. This alignment determines their relative position and orientation and then locates them in the coordinate system of the 3D images. It is of growing interest in the medical imaging field mainly to register 3D pre-operative images (like MRI, CT images) into 2D series of intra-operative frames acquired during a surgical intervention. Registration may operate on feature-based methods or intensity based ones. Reader is referred to [4,5]. Tomography and stereovison, namely, require multiple projective views to achieve the reconstruction task. Therefore, aligning 2D projection images to 3D datasets approaches appear to be adapted to our context of study. However they fit into a registration framework rather than a reconstruction one, unless other projective views are available, which is quite unusual. They can be used alternatively for a preprocessing stage for some aforementioned applications. We built then an adapted reconstruction model involving a prior geometrical constraint.
3
3D+t Reconstruction Approach
We carry out the 3D+t reconstruction using a couple of 3D data acquired at two different instants as well as the 2D+t sequence filmed meanwhile. Since the intermediary video sequence presents an interesting temporal resolution, we propose to restore the underlying 3D structure of each frame. The latter results from a reduction of the real 3D structure into a 2D projective view at a given time t. We take into account the prior geometrical constraint related to the observed object in order to build an inaccurate 3D inverse reconstruction of each 2D frame. We then attempt to improve this inaccurate 3D representation by motion compensation. We compute a motion vector field matching the 3D inaccurate structure with the 3D one already estimated correctly at time t − 1. The whole 3D+t sequence is reconstructed gradually using the couple of 3D data as initial border conditions. 3.1
Modeling Assumption
We carry out the 3D+t reconstruction using a series of 2D+t video sequence acquisitions intersected with 3D volumetric ones. We opt for the following notations for the remainder of the paper: X = (x, y, z), x = (x, y), W = (u, v, w), ˆ t) are respectively a 3D vector position, a 2D vector I(X, t), I2D (x, t) and I(X, position, a 3D vector field, a 3D+t sequence, a 2D+t one and finally the estimate of a 3D image at time t. Concretely, we have two 3D structures at two different instants: I 0 = I(X; t0 ) et I 1 = I(X; t1 ). Between t = t0 and t = t1 , we also have a video sequence I2D (x; t) describing all the 2D reductions of the I(X; t) structures evolving in this duration, i.e. between t = t0 and t = t1 . Each 2D frame I2D is a projective sight, at a given instant t, of the 3D original volume. This projection depends only on the data acquisition mechanism. We model it by a linear and
3D+t Scene Reconstruction in the Context
485
7 stationary transformation, called projection p: p(I)(x, y) = Ê I(x, y, z)h(z)dz, where h(z) is a measure kernel describing the interaction between observed object and the input light signal. p integrates contributions of all 3D slices to produce an unique 2D image. It introduces then, an important loss of information in the z−direction. 3.2
Prior Geometrical Constraint
Let us point out that, in our case of study, structures of interest evolve on a spheroid surface. The transformation p does not introduce distortions on respectively x and y−directions, consequently, p is reduced merely to a sum of two contributions coming from, respectively, the frontal and the dorsal hemisphere. 2D frames suffer then from a positioning ambiguity. Indeed, we are unable to assert that a projected object belongs either to the frontal or dorsal hemisphere. We suppose that the spherical support has a constant radius that we estimated in a pre-processing stage. We map the texture of each 2D frame on both sides of a sphere of same parameters that the original surface. This leads to an inaccurate 3D+t sequence IG (X, t) geometrically similar to the original 3D+t sequence I(X, t), but holding errors generated by the positioning ambiguity (figure 2). The sequence IG is formally obtained by IG ◦G(x, t) = I2D (x, t) where G is defined by G(x, y, z) = (x, y)T if (x−cx )2 +(y −cy )2 +(z −cz )2 = R2 . Parameters (cx , cy , cz ) and R are respectively the center and the radius of the sphere surface.
A 2D given projection
?
? ? Mapping same Texture
Possible 3D original images
I2D (x, t)
on frontal and dorsal hemispheres IG(X, t)
Fig. 2. From left to right: illustration of the positioning ambiguity problem and recovering inaccurate 3D images, IG (X, t), from 2D frames I(x, t)
Building the inaccurate 3D+t sequence yields the following data model equation: ˆ I(X, t) = IG (X, t), t ∈]t0 ; t1 [ 3.3
(1)
Motion Compensation
In order to restore correctly each luminosity contribution in the corresponding hemisphere, we match the inaccurate 3D representation, IG (X, t), with the 3D
486
W. Rekik, D. B´er´eziat, and S. Dubuisson
image available at t−1, I(X, t−1). This matching is quantified as a displacement vector field W describing voxel motion between both structures IG (X, t) and I(X, t − 1). Since times t and t − 1 are assumed to be very close and objects of interest evolve scarcely in this duration, we make the assumption that a moving voxel keeps the same gray value over time. A straightforward equation, similar to optical flow constraint, is then issued under the constant brightness assumption from time t to t − 1: ˆ I(X, t) = I(X + W, t − 1) (2) ˆ We recall that I(X, t) accounts for the estimate of the 3D image at time t. We estimate a retrogress vector field defining voxel displacement from t − 1 to t for notation conveniences. Expanding the right-hand side of equation (2) in a first order Taylor series leads to: I(X + W, t − 1) = I(X, t − 1) + ∇I(X, t − 1).W. This linearization yields the following motion constraint: ˆ I(X, t) = I(X, t − 1) + ∇I(X, t − 1).W 3.4
(3)
Numerical Resolution
Combining both constraints related, respectively to motion (3) and data model (1) leads to an equation model related to the displacement vector field:
(W)(X, t) = I(X, t − 1) + ∇I(X, t − 1).W − I
G (X, t)
=0
(4)
On one hand, equation (4) is under determined as it is insufficiency to recover reliably the displacement vector field W. On the other hand, a data model constraint involving a corrupted 3D representation, IG (X, t), is embedded in this equation. To cope with the inaccuracy of IG (X, t), we add an adapted smoothness term standing for spatial regularization of W. This regularization overcomes the aforementioned ambiguity positioning problem. We estimate gradually a dense motion field using a variational formulation. We build a functional E whose minimum, with respect to W, corresponds to the displacement vector field matching IG (X, t) and I(X, t − 1). E is composed of two additive terms: E(W) = (W)2 (X, t)dX + α ∇u2 + ∇v2 + ∇w2 dX (5) Ω
Ω
where the first term is related to equation (4) and the second one penalizes high spatial deformations of W. Parameter α tunes the importance of the second term, i.e. the spatial regularization. Differentiation of E, with respect to displacements u, v and w yields a set of Euler-Lagrange equations (reader is referred to [6]). Discretization of the latter with finite differences leads to a huge linear system. We solve the minimization of the linear system within a Gauss-Seidel iterative scheme. Estimated 3D images are reconstructed by motion compensation with respect to equation (2). However, positions of mapped samples computed with this equation do not match perfectly grid-positions in the output image ˆ I(X, t). We use then a method introduced in [7] to fit a smooth surface from the
3D+t Scene Reconstruction in the Context
487
set of mapped samples. This algorithm makes use of a coarse-to-fine hierarchy of control lattices in order to generate a sequence of bi-cubic B-spline functions whose sum approaches the desired interpolation function. In order to restore gradually the complete 3D+t sequence, we adopt a quite simple reconstruction strategy. We use, respectively, a forward progression procedure taking as first 3D available image, I0 and a backward one using the final 3D available image I1 . Consequently, we generate two displacement vector fields at the median moment of the temporal interval [t0 ; t1 ]. The final 3D image is computed merely using the average of both fields.
4
Results
We present, in this section, a series of experiments in order to assess the performance of the proposed approach. They consist in computing the displacement vector field matching a 2D frame with an anterior 3D structure. As real data are not available, we design two sets of simple synthetic data. First one is composed of a couple of spheres I(X, t1 ), I(X, t2 ) holding two squares lying on respectively the frontal and dorsal hemispheres. From t1 to t2 , the frontal square center moves towards the right-bottom direction and the dorsal square evolves merely downwards. We simulate the 2D transformation of I(X, t2 ) yielding I2D (x, t). We build then inaccurate representation IG (X, t2 ) using I2D (x, t) and finally we compute displacement vector field matching I(X, t1 ) and IG (X, t2 ). The estiˆ mate I(X, t1 ) is computed by motion compensation. Second set of data is quite identical to the first one, except that squares occlude each other in the 2D projection. In order to visualize motion computation results, we use an adapted tool called MAPVIS [8]. This tool displays complex information (scalar and vectorial) lying on real spheroid surfaces or projected ones. In the 3D case, MAPVIS project data embedded on the actual globe observed part. Moreover, it improves visualization of texture in perspective projective views of spheric shaped structures. This tool is based on a suitable planar map projection that unrolls the curved surface around a given origin. Therefore, varying the projection origin around the surface allows to observe different views of the sphere. Since the selected map projection minimize distortions around the projection origin, the closer this point is the more accurate is the data recovery. Equatorial aspect of map projection, with reference to an origin lying on the equator displays the frontal hemisphere of the observed globe and polar one, with reference to the north geographical pole displays information lying on the northern hemisphere. We display in figures 3 and 4 motion results for respectively first and second set of synthetic data. Let us point out that motion computation produces a set of correct vector fields (from amplitude and direction point of views), perfectly tangent to the enclosing spherical surface. Moreover, our reconstruction method handle reliably ambiguity positioning as moving square are mapped on the right original hemispheres. It is noticeable by visualization of difference between the ˆ t2 ). Performance of our estimated 3D image and the original one, i.e. I(., t2 )− I(., algorithm is not bothered by the more complex case showing object occlusion (see
488
W. Rekik, D. B´er´eziat, and S. Dubuisson
Fig. 3. On the first line: from left to right, visualization of the equatorial aspect then polar aspect of respectively of the set I(., t1 ), W then IG (., t2 ). On the second line, from left to right, visualization of the equatorial aspect then polar aspect of respectively ˆ t2 ). ˆ t2 ) then of difference image I(., t2 ) − I(., I(.,
Fig. 4. On the first line: from left to right, visualization of the equatorial aspect then polar aspect of respectively I(., t1 ) then IG (., t2 ). On the second line, from left to right, ˆ t2 ) then of visualization of the equatorial aspect then polar aspect of respectively I(., ˆ difference image I(., t2 ) − I(., t2 ).
figure 4) which does not ease the aforementioned structure distinction problem. However image difference show also reconstruction errors, doubtless amplified by the double interpolation procedure involved first to compute a smooth surface from samples recovered by motion compensation and second to visualize projective views [8]. These errors are localized in the border of moving objects. They may be caused by spatial derivative computations. Besides, they are due to the smoothing of motion discontinuities introduced by the spatial regularization of the estimated displacement vector field.
5
Conclusion
In this paper, we have presented an original approach dedicated to 3D+t scene reconstruction using 3D temporal data and 2D+t temporal sequences. It is based
3D+t Scene Reconstruction in the Context
489
on motion compensation and involves a prior geometrical data constraint. The latter touches spherical topology because several applications may be concerned. It leads us to build an inaccurate 3D inverse reconstruction of each 2D frame. This 3D image is enhanced by motion compensation, involving a regularity criterion under the brightness assumption constancy. The whole 3D+t sequence is reconstructed gradually using both 3D data as initial border conditions. Incorporating the prior geometrical constraint reduces relevantly the 3D-to-2D transformation to a merely a sum of contributions coming from the underlying frontal and dorsal hemispheres. This generates a positioning ambiguity on the intermediate 3D inaccurate representation. Our algorithm overcomes reliably this inaccuracy and handle possible object occlusions. Besides, we use an adapted tool, with relevant visualization properties, in order to display quickly reconstruction results. Currently, the whole sequence is computed frame per frame causing errors cumulation along the sequence. This limitation could be circumvented using a spatio-temporal smoothness constraint computed globally on the whole sequence [9]. Discussing other reconstruction issues, it is possible to generalize the prior geometrical constraint to any regular surface. We propose also in an other framework to solve the 3D+t scene recovery taken lower assumptions related to the data model. Moreover, we plan to apply our approach based on spherical topology to microscopic cell wall simulation acquisitions in the context described in [1]. That requires a spatial deconvolution framework to reconstruct each sphere from the original multi-focus image.
References 1. Staneva, G., Angelova, M., Koumanov, K.: Phospholipase a2 promotes raft budding and fission from giant liposomes. Chem. Phys. Lipids 129, 53–62 (2004) 2. Natterer, F.: The mathematics of computerized tomography. Wiley, New York (1986) 3. Huei-Yung, L.: Computer vision techniques for complete 3D model reconstruction. PhD thesis, State university of New York at Stony Brook (2002) 4. Gueziec, A., Kazanzides, P., Williamson, B., Taylor, R.: Anatomy-based registration of CT-scan and intraoperative X-ray images for guiding a surgical robot. IEEE Trans. Med. Imaging 17, 715–728 (1998) 5. Lemieux, L., Jagoe, R., Fish, D., Kitchen, N., Thomas, D.: A patient-tocomputedtomography image registration method based on digitally reconstructed radiographs. Med. Phys. 21, 1749–1760 (1994) 6. Glowinski, R.: Numerical Methods for Nonlinear Variational Problems. Springer edn. Series in computational physics, New York (1984) 7. Lee, S., Wolberg, G., Shin, S.: Scattered data interpolation with multilevel b-splines. IEEE Transactions on Visu. and Comput. Graph. 3(3), 228–244 (1997) 8. Rekik, W., B´er´eziat, D., Dubuisson, S.: Mapvis: a map-projection based tool for visualizing scalar and vectorial information lying on spheroidal surfaces. In: Proceedings of IV05, London (July 2005) 9. Weickert, J., Schn¨ orr, C.: Variational optic flow computation with a spatio-temporal smoothness constraint. Journal of Math. Imag. and Vision 14(3), 245–255 (2001)
Robust Fitting of 3D Objects by Affinely Transformed Superellipsoids Using Normalization Frank Ditrich and Herbert Suesse Chair of Digital Image Processing Department of Computer Science Friedrich-Schiller-University Jena D-07737 Jena, Germany [email protected], [email protected] http://www.inf-cv.uni-jena.de
Abstract. We present an algorithm for the robust fitting of objects given as voxel data with affinely transformed superellipsoids. Superellipsoids cover a broad range of various forms and are widely used in many application fields. Our approach uses the method of normalization and a new separation of the affine transformation into a shearing, an anisotropic scale and a rotation. It extends our previous work for 2D fitting problems and for fitting rectangular boxes in 3D. Our technique can be used as a valuable tool for solving this fitting task for 3D data. If the exponents describing the superellipsoids to be fitted are known in advance, the method is extremely robust even against major distortions of the object to be fitted.
1
Introduction
In the past, a lot of work was done dealing with the task of object fitting. This is not only important for objects in the 2D domain, but also in 3D, since there are a lot of techniques (e.g. computer tomography) which deliver 3D data. An overview of fitting methods for superellipses in 2D can be found in [4]. In [7] and [8], we developed methods using normalization for fitting 2D objects which are given as closed regions. A solution for rectangular boxes given as a region in a 3D voxel space was shown in [1], but was constrained only to the group of similarity transformations. This paper presents a new solution which uses ideas from 2D normalization and extends them towards a procedure for normalizing affine 3D transformations using an appropriate matrix separation. Compared to other methods, an advantage is the affine invariance of the fitting method itself. The second section of the paper introduces superellipsoids, and the third section briefly explains the principle of normalization. The following section describes our fitting procedure together with the explanation of the necessary
The work of Frank Ditrich was supported by DAKO GmbH Jena, Germany.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 490–497, 2007. c Springer-Verlag Berlin Heidelberg 2007
Robust Fitting of 3D Objects by Affinely Transformed Superellipsoids
491
normalization steps. Some results are given which show the fitting quality we achieved during some experiments. Finally, after a conclusion some further work which is of interest to augment and complete our solution is mentioned.
2
Superellipsoids
In 2D an ellipse can be described by the parametric representation a cos u e(u) = with u ∈ [−π, π). b sin u Superellipses are a generalization of ellipses with the representation a(cos u)ε with u ∈ [−π, π) and ε > 0. s(u) = b(sin u)ε The exponent ε is used to modify the form of the object, some examples are given in Fig. 1. Of course such a generalization is also possible for 3D ellipsoids.
Fig. 1. Some 2D superellipses with their form depending on the exponent ε (values from left to right 0.3, 0.5, 1.0, 2.0, 3.0)
Here, we have two exponents ε1 and ε2 to control the object form. A great variety of objects can be described, see Fig. 2. The parametric representation is the following: ⎞ ⎛ a(cos u)ε1 (cos v)ε2 8 π π9 s(u, v) = ⎝ b(cos u)ε1 (sin v)ε2 ⎠ with u ∈ − , , v ∈ [−π, π), ε1 , ε2 > 0. 2 2 c(sin u)ε1 This is a common representation found in literature, for real application one has to handle the cases where some of the sine and cosine expressions become negative by using their absolute values and additionally considering their signs. An implicit representation is given by the equation ε2 # $ 2 2 2 x ε2 # y $ ε2 ε1 # z $ ε1 + + = 1. a b c Superellipsoids have a broad range of application. For example, they are used as basic modelling elements in computer graphics. Other applications are the modelling of bubbles in fluids or the simulation of particles in granulate materials. Since we allow here the affin transformation of the superellipsoids described above, the range of forms for which our algorithm can be used becomes even larger.
492
F. Ditrich and H. Suesse
Fig. 2. Various 3D superellipsoids, controlled by the exponents ε1 and ε2 ((ε1 , ε2 ) from left to right: (0.3, 0.3), (1.0, 1.0), (3.0, 0.3), (0.5, 3.0), (1.0, 2.0), (3.0, 3.0)).
3
Fitting Using Normalization
In this section we briefly explain the principle of normalization for fitting and describe the advantage of this method in contrast to conventional fitting methods. Suppose we are given a class T of transformation where each t ∈ T is described by n parameters ξ1 , . . . , ξn . The theoretical shapes we use for fitting (e.g. rectangles, ellipses, cubes etc.) are modelled through a class P (θ) in which each of its members is described by m parameters θ1 , . . . , θm . If we want to fit a primitive onto a given object O, we usually derive some features f (O) = (f1 , f2 , f3 , . . .). Now some fitting methods try to solve the problem ||f (O) − f (P (θ))||2 −→ min by searching the whole m-dimensional space Θ of all primitives. Using normalization, we first define an appropriate canonical frame for our class of primitives which depends also on the transformation group, for example a unit square for the class of all squares with respect to similarity transformations. Then, we try to calculate a transform which maps the object to be fitted onto this canonical frame according to some normalization conditions. If we also normalize the optimal primitiv P (θ∗ ) with the same transformation, then we have to solve the following optimization problem: ||f (O ) − f (P (θ∗ ))||2 −→ min It has the dimension m − n which is usually significantly smaller than the dimension m of the problem mentioned above. There are many cases where it becomes 0 or 1, see [7] or [8]. After successfully solving this problem we get the optimal primitiv by applying the inverse of the normalization transformation to the optimization solution. For the determination of the normalization transformation and the parameter optimization we use volume moments up to fourth order (p + q + r ≤ 4), which are defined as xp y q z r dx dy dz. mpqr = object
Robust Fitting of 3D Objects by Affinely Transformed Superellipsoids
4
493
Normalization in 3D
To get the normalization transformation for an object, the transformation matrix can be separated into some simpler transformations, and a set of appropriate normalization conditions is used to determine their parameters. A detailed description together with some examples for the 2D case can be found in [5,7,8]. For affine 3D transformations we use here a separation which is similar to the well-known standard method for 2D transformations [9]. The transformation is separated into a rotation (considered as three rotations about the coordinate axes), an anisotrope scaling and a shearing: ⎞ ⎛ a11 a12 a13 A = ⎝ a21 a22 a23 ⎠ = RZ · RY · RX · S · D a31 a32 a33 ⎛ ⎞⎛ ⎞⎛ ⎞ cos u − sin u 0 cos v 0 − sin v 1 0 0 = ⎝ sin u cos u 0 ⎠ ⎝ 0 1 0 ⎠ ⎝ 0 cos w − sin w ⎠ · . . . 0 0 1 sin v 0 cos v 0 sin w cos w ⎞ ⎛ ⎞⎛ ⎞ ⎛ δ 0 0 1 α β | | | . . . · ⎝ 0 ε 0 ⎠ ⎝ 0 1 γ ⎠ = ⎝ C1 C2 C3 ⎠ . | | | 0 0 λ 0 0 1 Here we only consider transformations which do not contain reflections. If a matrix A is given, then u, v and δ can be determined from ⎞ ⎛ ⎞ ⎛ δ cos u cos v a11 C1 = ⎝ δ sin u cos v ⎠ = ⎝ a21 ⎠ . a31 δ sin v Since this expression can be interpreted as a spherical coordinates representation for the point (a11 , a21 , a31 ), we get two possible solutions for u and v: If (u, v) fulfills the equation, then (π + u, π − v) is also a solution. With the substitution of ε cos w and ε sin w the next line turns into a system of linear equations and the values of α, ε and w can be calculated from them: ⎞ ⎛ ⎞ ⎛ αδ cos u cos v − ε(sin u cos w + cos u sin v sin w) a12 C2 = ⎝ αδ sin u cos v + ε(cos u cos w − sin u sin v sin w) ⎠ = ⎝ a22 ⎠ . a32 αδ sin v + ε cos v sin w It is easy to show that the value α is independent of the choice of one of the above solutions for u and v. If we get a value w as a solution using the values u and v, then the solution for π + u and π − v is π + w. But both triples of values lead to the same product RZ · RY · RX . If these values are known, then β, γ and λ can be determined from the last column ⎛ ⎞ βδ cos u cos v − εγ(sin u cos w + cos u sin v sin w) ⎞ ⎛ ⎜ +λ(sin u sin w − cos u sin v cos w) ⎟ a13 ⎜ ⎟ ⎟ ⎝ ⎠ C3 = ⎜ ⎜ βδ sin u cos v + εγ(cos u cos w − sin u sin v sin w) ⎟ = a23 . ⎝ a33 −λ(cos u sin w + sin u sin v cos w) ⎠ βδ sin v + εγ cos v sin w + λ cos v cos w
494
F. Ditrich and H. Suesse
So, we have shown that this separation is possible and unique if the rotation matrix is considered as a whole, and we can use it for normalization. According to the separation, the normalization steps are the following. First, we handle the shearing matrix. The moments are transformed in this way: (x + αy + βz)p (y + γz)q z r dx dy dz. mpqr = object
Analogously to the 2D case here we consider the moments m110 = m110 + γm101 + αm020 + (αγ + β)m011 + βγm002 m101 = m101 + αm011 + βm002 m011 = m011 + γm002 If we state the conditions m110 = 0,
m101 = 0,
m011 = 0,
we can compute the parameters α, β and γ from them in the following way: α=
m101 m011 − m110 m002 , m020 m002 − m2011
β=
m110 m011 − m101 m020 , m020 m002 − m2011
γ=−
m011 m002
In the second step we deal with the anisotropic scaling. Here the transformation of moments is done according to ⎛ ⎞ δ 0 0 mpqr = det ⎝ 0 ε 0 ⎠ · (δx)p (εy)q (λz)r dx dy dz = δ p+1 εq+1 λr+1 mpqr , 0 0 λ object especially m200 = δ 3 ελm200 ,
m020 = δε3 λm020 ,
m002 = δελ3 m002 .
With the conditions m200 = 1,
m020 = 1,
m002 = 1
we get : δ=
10
m020 m002 , m 4200
: ε=
10
m002 m200 , m 4020
: λ=
10
m200 m020 . m 4002
The third step is the normalization of the rotation. In the 2D case it was possible to determine the rotation angle ϕ directly from a normalization condition (see [5]). Unfortunately, an explicit solution for the 3D case is not known to us until now. Therefore, we used the following conditions to bring the object in an adequate position: m 310 = 0, m301 = 0, m130 = 0, m031 = 0, m103 = 0, m013 = 0.
Robust Fitting of 3D Objects by Affinely Transformed Superellipsoids
495
Here m pqr are the moments obtained after the rotation. For a rotation matrix ⎛ ⎞ R11 R12 R13 R = ⎝ R21 R22 R23 ⎠ R31 R32 R33 (with det(R) = 1) they are transformed from mpqr using
(x + y + z)n =
0≤a,b,c≤n a+b+c=n
(a + b + c)! a b c x y z a! b! c!
a+b+c b+c a b c = x y z b+c c 0≤a,b,c≤n
a+b+c=n
(see [2]) as follows: m pqr
=
0≤a,b,c≤p 0≤d,e,f ≤q 0≤h,i,k≤r a+b+c=p d+e+f =q h+i+k=r
a+b+c b+c d+e+f e+f · ... b+c c e+f f
h+i+k i+k f a b c d e h i k ... · · R11 R12 R13 R21 R22 R23 R31 R32 R33 ·... i+k k . . . · ma+d+h,b+e+i,c+f +k . We used them to do an optimization 2 2 2 (m 310 (u, v, w)) + (m301 (u, v, w)) + (m130 (u, v, w)) + . . . 2 2 2 . . . + (m 031 (u, v, w)) + (m103 (u, v, w)) + (m013 (u, v, w)) −→ min
subject to u, v, w ∈ [0, 2π) to determine the three rotation angles and to bring the object into its final position for the next stage.
5
Determination of the Remaining Parameters
In the last step, it is necessary to determine the parameters of the superellipsoid which describes best the normalized object. This is done by performing an optimization using the following moments: m 000 , m400 , m040 , m004 ,
m 220 , m202 , m022 .
The theoretical values of these moments for given parameters a, b, c, ε1 , ε2 can be calculated according to (see [3] for a detailed description) 2 ap+1 bq+1 cr+1 ε1 ε2 · . . . p+q+2 $ # # ε1 ε2 ε2 $ ε1 , . . . · B (r + 1) , (p + q + 2) + 1 · B (q + 1) , (p + 1) 2 2 2 2 μpqr (a, b, c, ε1 , ε2 ) =
496
F. Ditrich and H. Suesse
since all indices are even numbers. Moments with odd indices become 0. Here B(x, y) = (Γ(x)Γ(y))/Γ(x + y). The parameters are determined by ((m 000 − μ000 (a, b, c, ε1 , ε2 )) + . . . 2 2 . . . + ((m 400 − μ400 (a, b, c, ε1 , ε2 )) + ((m040 − μ040 (a, b, c, ε1 , ε2 )) + . . . 2 2 . . . + ((m004 − μ004 (a, b, c, ε1 , ε2 )) + ((m220 − μ220 (a, b, c, ε1 , ε2 )) + . . . 2 2 . . . + ((m 202 − μ202 (a, b, c, ε1 , ε2 )) + ((m022 − μ022 (a, b, c, ε1 , ε2 )) −→ min 2
subject to a, b, c, ε1 , ε2 > 0. One aspect we have to note in this situation is that the calculation of the moments given above holds for a superellipsoid in the special position defined by the parametric representation in section 2 (along the z-axis). The conditions for normalizing the rotation using the moments m 310 etc. can not guarantee to bring the object in this position. They will be fulfilled if (but not only if) the object is aligned along a coordinate axis, and additionally a rotation of π/4 about this axis is allowed. So we have to perform the optimization for these six positions and use the best one as the result. In cases in which the superellipsoid is not convex (if one of the exponents ε1 , ε2 is greater than 2), there are also special positions different from these six where the rotation normalization conditions are satisfied too. But they can easily be rejected, since they lead to non-meaningful results for the parameters. In our experiments it was always possible to reach a solution for the parameter optimization by simply choosing multiple start values. As the very last step, the inverse of the normalizing transformation composed of the shearing, the anisotropic scaling and the rotation is applied to the optimal superellipsoid found during parameter optimization to get the final fitting solution.
6
Experimental Results
For testing purposes, we randomly generated affinely transformed superellipsoids within a 2003 voxel cube. To evaluate the fitting quality, we generated the border of the original superellipsoid and the border of the fitted superellipsoid as voxels. For every voxel of one border the nearest voxel of the other one is determined. The maximum of all these distances is used as a quality measure. Our method provides very good fitting results. For 300 randomly generated superellipsoids we got an average fitting quality of 1.5. Even if the form of the object is distorted by major missing or added parts, we have only small deviations of the fitted object from the original, if the values for the exponents ε1 and ε2 are known before fitting, see Fig. 3. In the left and middle images an affinely transformed superellipsoid with parameters a = 40, b = 60, c = 80, ε1 = 1, ε2 = 1 is used, and a sphere with radius 30 is removed or added. The resulting values for the fitting quality are 3.16 and 8.25. In the right image the value for the exponents is 2.5, the radius of the removed sphere is 20 and the resulting quality is 3.61. So here we have a robustness against distortions like we have already noticed for rectangular boxes [1].
Robust Fitting of 3D Objects by Affinely Transformed Superellipsoids
497
Fig. 3. Three superellipsoids (gray) with major parts missing or added and the fitting result (white). See Section 6 for explanation.
7
Conclusion and Further Work
In this paper we presented an algorithm for fitting objects with affinely transformed superellipsoids. Its key element is a normalization of an affine transformation by separating it into a shearing, an anisotropic scaling and a rotation. We achieved very good fitting results, even for objects with major missing parts. One drawback of our method is the necessity to determine the rotation by an optimization instead of explicit calculation as it was possible in 2D. To find a solution for this problem is one task of the possible additional work. Perhaps there are also other separations of the affine transformations (as it is the case in 2D) that are also suitable for handling this fitting problem.
References 1. Ditrich, F., Suesse, H.: A New Efficient Algorithm for Fitting Rectangular Boxes and Cubes in 3D. In: Proc. IEEE ICIP 2005, Genoa, Italy, Part II, pp. 1134–1137 (2005) 2. Graham, R.L., Knuth, D.E., Patashnik, O.: Concrete Mathematics. AddisonWesley, Massachusetts (1994) 3. Jakliˇc, A., Solina, F.: Moments of Superellipsoids and their Application to Range Image Registration. IEEE Trans. Systems, Man, and Cybernetics, Part B 33(4), 648–657 (2003) 4. Rosin, P.L.: Fitting Superellipses. IEEE Trans. PAMI 20, 726–732 (2000) 5. Rothe, I., Suesse, H., Voss, K.: The Method of Normalization to Determine Invariants. IEEE Trans. PAMI 18, 366–375 (1996) 6. Solina, F., Bajcsy, R.: Recovery of Parametric Models from Range Images: The Case for Superquadrics with Global Deformations. IEEE Trans. PAMI 12, 131–147 (1990) 7. Voss, K., Suesse, H.: Invariant Fitting of Planar Objects by Primitives. IEEE Trans. PAMI 19, 80–83 (1997) 8. Voss, K., Suesse, H.: A New One-Parametric Fitting Method for Planar Object. IEEE Trans. PAMI 21, 646–651 (1999) 9. Wang, K.P.W.: Affin Invariant Moment Method of Three-Dimensional Object Identification, Ph.D.Thesis. Syracuse University (1977) 10. Zhang, Y.: Experimental comparison of superquadratic fitting objective functions. PRL 24(14), 2185–2193 (2003)
Fast and Precise Weak-Perspective Factorization ´ Levente Hajder1,2 , Akos Pernek2 , and Csaba Kaz´o2 1
Computer and Automation Research Institute, Hungarian Academy of Sciences Kende u. 13-17., H-1111 Budapest, Hungary 2 Budapest Universitiy of Technology and Economics [email protected] Abstract. We address the problem of moving object reconstruction. Several methods have been published in the past 20 years including stereo reconstruction as well as multi-view factorization methods. In general, reconstruction algorithms estimate the 3D structure of the object and the camera parameters in a non-optimal way and then a nonlinear optimization method refines the estimated camera parameters and 3D object coordinates. In this paper, an adjustment method is proposed which is the fast version of the well-known down-hill alternation method. The novelty which yields the high speed of the algorithm is that the steps of the alternation give optimal solution to the subproblems by closed-form formulas. The proposed algorithm is discussed here and it is compared to the widely used bundle adjustment method.
1
Introduction
3D motion-based object reconstruction is a challenging and developing area of computer vision. Many of the published rigid SfM methods are based on various extensions of the well-known factorization method [9] by Tomasi and Kanade. The original method assumes orthographic projection, while its extensions can cope with weak-perspective [12] as well as real projection [8]. It is well known, that there is no optimal closed-form solution to the problem. For this reason, researchers have proposed non-optimal methods and then they try to minimize the so-called reprojection error (or other cost function) by adjustment algorithms such as bundle adjustment [2]. It is also well known that these classical algorithms [9,12,8] cannot cope with realistic situations: they assume that all features points are tracked over all frames. For this reason, several methods have been published to solve the case when data are missing or uncertain[7,4,11]. (The Reader is referred to [3] for an excellent review.) However, the full matrix factorization is also an important issue, because Monte Carlo type robust algorithms [10,6,5] can use it to compute the camera motion from few points. The proposed method assumes the weak-perspective camera model which is applicable if the depth of the object is much smaller than the object-camera distance, but it can be extended easily to the perspective case by using the so-called rescaled measurement matrix [8]. The proposed down-hill alternation method is very similar to the one published by us in 2006 [6]. The main difference W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 498–505, 2007. c Springer-Verlag Berlin Heidelberg 2007
Fast and Precise Weak-Perspective Factorization
499
is that the proposed method is faster, because the computation steps are given in closed forms.
2
Summary of the Tomasi-Kanade Factorization
Given P feature points of a rigid object tracked across F frames, xf p = (uf p , vf p )T , f = 1, . . . , F , p = 1, . . . , P , the goal of SfM is to recover the structure of the object. For orthogonal projection, the 2D coordinates are calculated as xf p = Rf sp + tf ,
(1)
where Rf = [rf 1 , rf 2 ]T is the orthonormal rotation matrix, sp the 3D coordinates of the point and tf the offset. Under the weak perspective model, the equation is xf p = qf Rf sp + tf , (2) where qf is the nonzero scale factor of weak perspective. The offset vector is eliminated by placing the origin of 2D coordinate system at the centroid of the feature points. For all points in the f -th image, the above equations can be rewritten as S Wf = (xf 1 . . . xf P ) = Mf · &'() &'() &'() 2×3
2×P
(3)
3×P
where Mf is called the motion matrix, S = (s1 , . . . , sP ) the structure matrix. Under orthography Mf = Rf , under weak respective Mf = qf Rf . For all frames, the equations (3) form M · &'() S , W = &'() &'()
2F ×P
(4)
2F ×3 3×P
where W T = [W1T , W2T , . . . , WFT ] and M T = [M1T , M2T , . . . , MFT ]. The Tomasi-Kanade factorization method [9] has two main steps: Step 1. In the noise-free case, W has rank 3 after subtracting the centre of gravity in all frames. When noise is present, the method reduces the rank of the noisy measurement matrix by Singular Value Decomposition. In this step, W is factorized into affine motion and affine structure: W = Maf f Saf f . Step 2. In the reduced space, the method applies a linear transformation to obtain the real (Euclidean) motion and structure data. The transformation is represented by a matrix, Q, and the estimated motion and structure data are written as M = Maf f Q and S = Q−1 Saf f , respectively. The matrix Q can be optimally determined by imposing orthographic [9] or weak-perspective [12] motion constraints. The final factorization is written as W = M S, where the matrix S consists of the 3D coordinates of the moving object, and the matrix M contains the 3D directions of base vectors of the camera planes.
500
´ Pernek, and C. Kaz´ L. Hajder, A. o
If few feature points are tracked, the quality of the Tomasi-Kanade factorization is not sufficient, and this is mainly caused by the SVD-based rank reduction. This reduction technique needs a relatively large number of feature points to obtain a good estimation. The SVD method determines a 3D subspace, and, despite the corrective transformation Q, the result cannot leave this subspace.
3
Proposed Method: Fast Alternation
As it is discussed in the introduction, the Tomasi-Kanade factorization does not produce optimal solution. We propose a down-hill alternation method to minimize the reprojection error ||W − M S||2F
(5)
subject to Mf MfT = qf I, ∀f , where ||.||F denotes the Frobenius norm of a matrix. Our method guarantees that a local optimum will be reached starting from the initial matrices M (0) and S (0) yielded by the Tomasi-Kanade factorization. Unfortunately, global optimum can not be guaranteed. The proposed method is an iterative one consisting of three main steps: the completion step, the S-step and the M-step. It runs until the difference between the subsequent error values drops below a given limit . The whole algorithm is overviewed in Alg. 1. We call this algorithm Fast Alteration, because its speed is (relatively) high despite the alternation: all the parameter estimations in the substeps can be written in closed forms. The slowest calculation is the pseudo-inverse computation of a 2F × 3 motion matrix. 3.1
Completion Step
The motion submatrix Mf consists of two rows: Mf = [mTf,1 mTf,2 ]T . Let us ˜ f = [mT mT mT ]T where the direction of denote the completed matrix with M f,1 f,2 f,3 mTf,3 is parallel to the cross product of mTf,1 and mTf,2 , its length is the average length of the vectors mTf,1 and mTf,2 . Based on the above strategy, the measurement matrix W is also completed. ˜ f denotes the completed version of Wf if its third row is calculated as mT S. W f,3 3.2
S-Step
The goal of the S-step is to determine the structure matrix S optimally. The structure matrix is not constrained: its value can take arbitrary real values. The optimal value of S (k) is provided by the well known least-squares optimization: † ˜ (k−1) S (k) = M (k−1) W f †
(6)
where M (k−1) denotes the Moore-Penrose pseudo-inverse of matrix M (k−1) .
Fast and Precise Weak-Perspective Factorization
3.3
501
M-Step
˜ (k) and S (k) . At the M-step, M (k) is estimated from W It is trivial that the value of Mi is independent from that of Mj , if i = j. The main idea is that the calculation of Mf from Wf and S is a quasi-registration problem: the only difficulty is that S consists of 3-dimensional vectors, while Wf only contains 2-dimensional vectors. This problem can be eliminated if Wf and Mf are completed as it is described earlier in this section. ˜ and the strucAfter completion, both the completed measurement matrix W ture matrix S consist of p piece of 3D vectors. The estimated (complete) motion ˜ can be decomposed into the product of a scale parameter qf and an matrix M orthonormal matrix Rf . According to [1], the rotation is optimally calculated as Rf = Vf UfT if Hf = Uf Λf VfT is the singular value decomposition (SVD) of Hf =
P
sp w ˜fTp ,
(7)
p=1
˜ (k) . ˜f p is that of W where sp is the pth columns of S and w The optimal scale is given by the following formula: P ˜fTp Rf sp p=1 w . qf = P T p=1 sp sp
(8)
Algorithm 1. SfM by Fast Alternation M (0) ,S (0) ← Tomasi-Kanade(W ) ˜ (0) ← Complete(W ,M (0) ,S (0) ) ˜ (0) , M W k←0 repeat k ← k+1 ˜ (k−1),M ˜ (k−1) ) S (k) ← S-Step(W (k) ˜ (k−1) ,S (k) ) ˜ ← Complete(W,M W (k) (k) (k) ˜ ˜ ← M-Step(W ,S ) M ˜ (k) ,S (k) ) ˜ (k) ← Complete(W,M W until
(k−1) 2 (k) 2 ˜ (k−1) S (k−1) − W ˜ −M − M (k) S (k) 0 such that Cq = 0 for q > n and each abelian group Cq is finitely generated. All chain complexes considered here are finite and free. A chain c in C is a q-cycle if c ∈ Ker dq . If c ∈ Im dq+1 then a is called a q-boundary. Denote the groups of q–cycles and q–boundaries by Zq and Bq respectively. Define the integer qth homology group to be the quotient group Zq /Bq , denoted by Hq (C). We say that c is a representative q–cycle of the homology generator c + Bq (denoted by [c]). For each q, the qth homology group Hq (C) is a finitely generated free abelian group. The rank of Hq , denoted by βq , is called the qth Betti number of C. Homology is a powerful topological invariant, which characterizes an object by its q-dimensional “holes” (connected components, tunnels and cavities). Let K be a complex (simplicial, cubical or simploidal). A q-chain c is a formal q sum of q-cells (where q is the dimension of the cell) in K. Let {σ1q , . . . , σm } be i mi q the set of q-cells in K, then c = i=1 λi σi , where λi ∈ {0, 1}. Alternatively, we can think of c as the set {σiq , such that λi = 1}, and the sum of two q-chains as their symmetric difference. The q-chains together with the addition operation form the group of q-chains denoted as Cq (K). The differential of a q-cell σ in K, dq (σ), is the sum of the (q − 1)-cells in K that belong to the boundary of σ. By linearity, the differential can be extended to q–chains. The chain complex C(K) is the sequence of chain groups Cq (K) connected by the homomorphisms dq . The homology of K is defined as the homology of C(K). Since we work with
A Graph-with-Loop Structure for a Topological Representation
509
objects embedded in R3 , the homology groups are torsion–free (see [1, ch.10]). Moreover, Theorem of Universal Coefficient [12] ensures that all the homology information can be computed working with coefficients in Z/2. An AT-model for K is established in [8,9] and used to obtain the homology and representative cycles of homology generators of K. An AT-model can be computed starting from an ordering of the cells in K [8,9]. We deal here with a particular ordering based on a cover forest T of K (any two vertices are connected by exactly one path in T if and only if they are connected in K). Let T = (V, E) be a cover forest of K where V is the set of all the vertices of K and E a subset of edges of K. S = (σ0 , . . . , σm ) is a T -filter if it is an ordering of all the cells in K such that: – for each j (where 0 ≤ j ≤ m), {σ0 , . . . , σj } is a subcomplex of K; – if i < j, the dimension of σi is less or equal than the dimension of σj ; – if i < j, σi and σj are two edges and σj ∈ T , then σi ∈ T That is, the edges of the cover forest are in first positions in S). Observe that σ0 is always a vertex of K. An AT-model for K is then defined as the output of the following algorithm, having as the input a complex K and a T -filter S of K. Algorithm 1. [8,9] AT-model Algorithm. Input: a T -filter S = (σ0 , . . . , σm ) of K, H := {σ0 }, f (σ0 ) := σ0 , g(σ0 ) := σ0 , φ(σ0 ) := 0. For i = 1 to m do If f d(σi ) = 0, then H := H ∪ {σi }, f (σi ) := σi , φ(σi ) := 0, g(σi ) := σi + φd(σi ). If f d(σi ) = 0, then: k := max {j such that σj ∈ f d(σi ), j = 1, ..., i − 1}, H := H\{σk }, f (σi ) := 0, φ(σi ) := 0. For j = 1 to i − 1 do if σk ∈ f (σj ), f (σj ) := f (σj ) + f d(σi ), φ(σj ) := φ(σj ) + σi + φd(σi ). Output: the set (S, H, f, g, φ). Notice that in the ith step of the algorithm (i = 1, . . . , m), exactly one homology generator is created or destroyed. The algorithm runs in time at most O(m3 ). Proposition 1. Let K be a complex (simplicial, cubical or simploidal), let T = (V, E) be a cover forest of K and S a T -filter of K. The output of Algorithm 1, (S, H, f, g, φ), satisfies that: – H is a subset of S such that no edge in T is an edge in H. H generates a chain complex denoted by H with null differential; – The number of vertices, edges and triangles in H equals the number of connected components, tunnels and cavities in |K|, respectively. In other words, the homology of K is isomorphic to H. – f : C(K) → H satisfies that if c ad c are two cycles in K such that f (c) = f (c ) then c and c are homologous.
510
R. Gonzalez-Diaz et al.
– g : H → C(K) satisfies that {[g(h)] : h ∈ H} is a set of homology generators of K. If h, h ∈ H, h = h , then g(h) and g(h ) are not homologous. Moreover, if a is an edge in H, then g(a) is a simple cycle in K and all the edges in g(a) \ {a} are edges in T . In fact, g(a) \ {a} is the simple path in T connecting the endvertices of a. – φ : C(K) → C(K) satisfies that if x ∈ H then φ(x) = 0 and there is no y ∈ S such that φ(y) = x. Moreover, if v is a vertex in K, then φ(v) is the simple path in T connecting v with the vertex in H that belongs to the same connected component in K than v. S H 0 0 1 2 0, 1 0, 2 1, 2 1, 2
f g φ 0 0 0 0 0, 1 0 0, 2 0 0 0 0 1, 2 1, 2 + 0, 2 + 0, 1 0
Fig. 3. A a filter S of a simplicial complex K and the result of applying Algorithm 1 to S (an AT-model for K)
3
Computing a Graph-with-Loop Representation of a 3D Object
Let K be a complex (simplicial, cubical or simploidal); |K| its geometric realization in R3 ; and h : |K| → R a continuous function. Let exy denote an edge with endvertices x and y. We say that the height of a point p ∈ |K| is t if h(p) = t and the height of a cell σ ∈ K is the minimum of the heights of all the points on |σ|. We say that K is an h-complex if: – the set of the vertices of K can ; be partitioned into a finite number of subsets r in terms of their height, V = i=1 Vi , where Vi = {v ∈ V : h(v) = ti and t 1 < · · · < tr . – if evw is an edge in K then v and w belong to Vi for some i = 1, . . . , r or v ∈ Vi−1 and w ∈ Vi for some i = 2, . . . , r. h-Complexes appear in a natural way when they are defined by the neighborhood relations of voxels of a 3D digital image and h is the real function that associates to each point on |K| its elevation. Let K be an h-complex and σ a cell in K. We say that σ is horizontal if the heights of all the points on |σ| coincide; otherwise, it is vertical. For i = 1, . . . , r, let Ki be the collection of all the horizontal cells in K with the same height ti , i = 0, 1, . . . , r. Ki is a subcomplex of K and if a cell σ is not in Ki , then σ is vertical. Let Ti = (Vi , Ei ) be a cover forest of Ki and Si a Ti -filter of Ki . Denote by (Si , Hi , fi , gi , φi ), i = 1, . . . , r, the AT-models obtained using Algorithm 1. Let V be the set of vertices in K, T = (V, E) a cover forest of K (obtained
A Graph-with-Loop Structure for a Topological Representation
511
Fig. 4. From left to right: a digital image and 3 h-complexes associated to it considering the 6, 14 and 26-adjacency, respectively
after ;radding vertical edges in K in increasing ordering in height to the graph (V, i=1 Ei )), and S a T -filter of K. Denote by (S, H, f, g, φ) the AT-model obtained using Algorithm 1. Proposition 2. The AT-model (S, H, f, g, φ) satisfies that: – If a is a horizontal edge in H, then a ∈ Hi for some i and g(a) = gi (a) is a simple cycle such that its edges are in Ki . – If a is a vertical edge in H, then for each level i, i = 1, . . . , r, g(a) has an even number of vertical edges of height ti . Now, let us explain how to construct the graph Gh (K) and the function Ψ : Gh (K) → K using the AT-models (Si , Hi , fi , gi , φi ), i = 1, . . . , r and (S, H, f, g, φ) computed before. First, the vertices in Gh (K) in each level i are the vertices in Hi , i = 1, . . . , r. If v is a vertex in Gh (K), then Ψ (v) = v. Second, for each level i and (horizontal) edge a in Hi ∩ H, we add a loop α in Gh (K) such that its endvertex is the vertex in the level i of Gh (K) which belongs to the same connected component than |a| in |Ki |. Define Ψ (α) = gi (a) = g(a). Third, we add an edge exy between two vertices x and y in Gh (K) if x ∈ Hi and y ∈ Hi+1 for some i and f (x) = f (y) = z ∈ H (i.e. x and y belong to the same connected component in K). Define Ψ (exy ) = φ(x) + φ(y) (the simple path in T connecting the vertices x and y). Finally, for each vertical edge evw in H, an edge b is added to Gh (K). Since evw is vertical, then v ∈ Hi and w ∈ Hi+1 for some i. The endvertices of b are the vertices in Gh (K) which belong to the same connected component than v in Ki and w in Ki+1 , respectively. Define Ψ (b) = evw + φi (v) + φi+1 (w). Theorem 2. Given a complex K and a continuous function h : |K| → R. If K is an h-complex, then: 1. The graph Gh (K) and the complex K has the same number of tunnels and connected components. 2. For each loop α ∈ Gh (K), Ψ (α) is a simple cycle representative of a homology generator of K. If α1 and α2 are two different loops in Gh (K), then Ψ (α1 ) and Ψ (α2 ) are two representative cycles of two non-equivalent generators of homology. 3. For each edge exy in Gh (K) that comes from a vertical edge evw ∈ H, then Ψ (exy ) + φ(x) + φ(y) = g(evw ) is a representative cycle of a homology generator of K. 4. The graph Gh (K) without loops is a subdivision of the Reeb graph Rh (K). 5. The graph Gh (K) and the function Ψ can be computed in O(m3 ), where m is the number of cells in K.
512
R. Gonzalez-Diaz et al.
Proof. The number of tunnels of K is the number of edges in H. By construction, each horizontal edge in H produces a loop in Gh (K) (i.e. a tunnel in Gh (K)). Each vertical edge evw in H produces a vertical edge β in Gh (K). Let v in Ki and w in Ki+1 . Let V and W be the two vertices in Gh (K) that belong to the same connected component than v and w, respectively. Since evw ∈ H, then evw created a cycle when it was added. Therefore, v and w belong to the same connected component in K and so, there exists a path p between V and W in Gh (K) apart from the edge β that produces evw , by construction. Then, p + β is a cycle in Gh (K). Moreover, Ψ (p + β) = g(b) is a representative cycle of the homology generators of dimension 1 of K. Since representative cycles of a homology generator of dimension 1 of K map by ψ to a cycle in Rh (K), then ψ(g(b)) is a cycle in Rh (K). Since K is an h-complex, then a vertex in Rh (K) corresponds to a contour in a level ti , i = 1, . . . , r. Therefore, a vertex in Rh (K) is a vertex in Gh (K).
Fig. 5. From left to right: a cover forest T of the cubical complex K showed on the right of Figure 4; the complexes K0 , K1 , K2 and K3 ; a set of representative cycles of the generators H1 (K); and the graph Gh (K)
Example 1. Let K be the cubical complex K on the right in Figure 4. A cover forest T of K; the complexes K0 , K1 , K2 and K3 ; a set of representative cycles of the generators H1 (K); and the graph Gh (K) The non-trivial identification of the edges and loops in Gh (K) and K by Ψ are: Gh (K) Ψ h1 eab + ebc + ecd + edo + eof + eaf v1 eBj + eij + ehi + egh + edg + ecd + eac + eab + ebC v2 eBj + eij + ebi + eab + eaC v3 eBk + ek + em + emn + enC The representative cycles of generators of H1 (K) are: α0 = Ψ (h1 ) = eab + ebc + ecd + edo + eof + eaf α1 = Ψ (v1 ) + eBC = eBj + eij + ehi + egh + edg + ecd + eac + eab + ebC + eBC α2 = Ψ (v2 ) + eBC = eBj + eij + ebi + eab + eaC + eBC α3 = Ψ (v3 ) + eBC = eBk + ek + em + emn + enC + eBC
A Graph-with-Loop Structure for a Topological Representation
4
513
Conclusions and Future Work
It is possible to obtain representative cycles on the boundary of the given complex K if we compute a cover forest of K first adding the edges on the boundary. Another task is the generalization of the method to any dimension. The problem is that the homology of a complex of a dimension higher than 3 can have torsion groups. In order to capture the torsion part of the homology we could use the concept of λ-AT-model developed in [10]. A possible extension of this work is the construction of a discrete Morse complex Mh (K) associated to a cell complex K, such that there is not only a oneby-one identification of all the homology generators of Mh (K) with that of K, but also an isomorphism between cohomology rings. Mh (K) can be constructed using a gradient vector field VK associated to a discrete Morse function (see [6,7]) that can be obtained from an AT-model for K.
References 1. Alexandroff, P., Hopf, H.: Topologie I. Springer, Berlin (1935) 2. Biasotti, S., Facidieno, B., Spagnuolo, M.: Extended Reeb Graphs for Surface Understanding and Description. In: Nystr¨ om, I., Sanniti di Baja, G., Borgefors, G. (eds.) DGCI 2000. LNCS, vol. 1953, pp. 185–197. Springer, Heidelberg (2000) 3. Dahmen, W., Micchelli, C.A.: On the Linear Independence of Multivariate bSplines. Triangulation of Simploids. SIAM J. Numer. Anal., 19 (1982) 4. Cole-McLaughlin, K., Edelsbruner, H., Harer, J., Natarajan, V., Pascucci, V.: Loops in Reeb Graphs of 2-mainifolds. Discrete Comput. Geom. 32, 231–244 (2004) 5. Fomenko, A.T., Kunii, T.L.: Topological Methods for Visualization. Springer, Heidelberg (1997) 6. Forman, R.: A discrete Morse theory for cell complexes. In: Yau, S.T.(ed.) Geometry, Topology and Physics for Raoul Bott. International Press (1995) 7. Forman, R.: Discrete Morse Theory and the Cohomology Ring. Transactions of the American Mathematical Society 354, 5063–5085 (2002) 8. Gonzalez-Diaz, R., Real, P.: Towards Digital Cohomology. In: Nystr¨ om, I., Sanniti di Baja, G., Svensson, S. (eds.) DGCI 2003. LNCS, vol. 2886, pp. 92–101. Springer, Heidelberg (2003) 9. Gonzalez-Diaz, R., Real, P.: On the Cohomology of 3D Digital Images. Discrete Applied Math. 147, 245–263 (2005) 10. Gonzalez-Diaz, R., Jim´enez, M.J., Medrano, B., Real, P.: Extending AT-Models for Integer Homology Computation. In: GbR2007. LNCS, vol. 4538, pp. 330–339. Springer, Heidelberg (2007) 11. Massey, W.M.: A Basic Course in Algebraic Topology. New York (1991) 12. Munkres, J.R.: Elements of Algebraic Topology. Addison-Wesley, London, UK (1984) 13. Reeb, G.: Sur les Points Singuliers d’une Forme de Pfaff Complement Integrable ou d’une Function Num´erique. C. Rendud Acad. Sciences 222, 847–849 (1946)
Print Process Separation Using Interest Regions Reinhold Huber-M¨ork, Dorothea Heiss-Czedik, Konrad Mayer, Harald Penz, and Andreas Vrabl Business Unit High Performance Image Processing, Smart Systems Division, Austrian Research Centers GmbH - ARC, A-2444 Seibersdorf, Austria [email protected], http://www.smart-systems.at
Abstract. For quality inspection of printing systems it is necessary to measure the displacement between printing processes. Tie points are employed in correspondence and displacement estimation between individual print elements. We compare interest point and region descriptors for tie point detection in industrial inspection tasks. Clustering of measured displacements taken from sequences of sample images allows the estimation of the accuracy of printing processes and the alignment of printing processes. Results of an experimental application to banknote printing process inspection are given.
1
Introduction
Modern printing of text and images often involves different printing steps, e.g. the four-color process printing using cyan, magenta, yellow, and black (CMYK) inks or printing employing different technologies [11]. Printing processes are performed sequentially, e.g. an offset printing process is applied prior to intaglio printing process. The sequential application of different printing processes might cause some relative displacement between elements generated by the different processes. Displacements between different printing processes and within a single printing process are of great interest in high quality printing. In Fig. 1 two parts of images of banknotes are shown. Both images show an enlarged part of identical size and at the same position relative to the outline of the banknote. The spatial resolution of the images is 0.1 millimeters. The circles and the star on the left of each image were printed using offset printing, whereas the dark object on the right was produced by intaglio printing. It is possible to identify a horizontal shift of approximately four pixels, which corresponds to 0.4 millimeters, for the offset print elements between the two images. The intaglio print element has a stronger displacement on the order of magnitude of 12 pixels, which corresponds to 1.2 millimeters. In order to automatically separate print elements belonging to different print processes the following two observations will be utilized in this paper: 1. On a single image, elements belonging to a particular print tend to be moved coherently with respect to print elements printed using a different technology. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 514–521, 2007. c Springer-Verlag Berlin Heidelberg 2007
Print Process Separation Using Interest Regions (a)
515
(b)
0
20
40
60
80
100
Horizonal pixel position
120
140
0
20
40
60
80
100
120
140
Horizonal pixel position
Fig. 1. Images of the same area for two different 10 Euro banknotes: Elements printed using offset printing, e.g.the circles and the star, are slightly shifted, the element from intaglio printing, i.e. the element on the right, has larger displacement
2. From a sample of images one is able to observe different displacements between print elements produced by different technology on different images. Tie points are required to estimate displacements between print elements from images. A tie point is typically characterized by its neighborhood, e.g square windows centered at the tie point. The probably most prominent field of application for tie points is image registration [16]. Recently, tie points are also referred to as interest points and the fields of applications became more widespread. Wide baseline stereo vision [6] [12], object recognition [5] and video indexing and retrieval [8] became new application areas for interest points. In view of the considered application, we restrict ourselves to a single scale and the most prominent interest point and region detectors, i.e. The Harris, Hessian, Moravec and MSER detectors. We compared interest point detectors by an established quality measure [9], where MSERs were found to perform best, see Sec. 2. Therefore, we derive displacements between different printing processes using corresponding MSERs in Sec. 3. Details of the matching procedure are given in [2]. Furthermore, clustering of displacement vectors derived from sequences of banknote images enables us to perform reliable segmentation of print elements, see Sec. 4.
2
Interest Point and Region Detectors
Interest points are usually defined as local image features where the image content varies significantly in more than one direction, e.g. at region corner locations. Interest regions are usually closed areas in the image characterized by small variation in intensity and/or color. For wide baseline stereo vision, the maximally stable extremal regions (MSERs) [6] and intensity extrema based regions (IBR) as well as edge based regions (EBR) [12] have been suggested. The most prominent interest point detectors are Harris [1], Hessian [4] and the Moravec detectors [7]. Combinations of point and region detectors also appeared recently [14]. In order to detect interest points at different resolutions, image pyramid and scale space approaches have been suggested. Among others, one of the most prominent
516
R. Huber-M¨ ork et al.
approaches into this direction is the scale invariant feature transform (SIFT) [5]. Wavelet based descriptors have been presented by Sebe et. al [10]. Concerning interest point detectors, we will focus on properties derived from the local autocorrelation matrix ( Harris detector [1]), properties derived form the matrix containing local second order partial derivatives (Hessian detector [4]) and description of local variance variation (Moravec detector [7]). The MSER [6] interest region detector selects regions characterized by the property that their boundaries are relatively stable over a wide range of pixel intensities: – Every pixel in the region has an intensity value which is less than or equal to every pixel on the contour. – The change in the total number of pixels in the region over the range of contour intensities g − δ to g + δ is a local minimum, where g is an intensity value and δ is an integer increment or decrement thereof. 2.1
Matching of Interest Points and Regions
Found interest points or regions in a defined search area are required to correspond. For a given reference point or region location in a reference image a search region in a comparison image is derived. Usually, this is a square window centered on the reference point location or an enlarged bounding box with respect to the reference region bounding box. Correspondence is checked by comparison to a threshold on the similarity measure. The similarity of interest points is measured by normalized cross correlation. Similarly, the correspondence of an interest region, i.e. binary image patches containing MSERs, is assessed using a binary distance measure. Due to its efficient implementation and discriminative power [15] the Sokal-Michener measure of similarity for a MSER centered at point (x, y) is used δ [I(x, y), I(x + lx , y + ly )] , (1) C(x + lx , y + ly ) = W |W |
1 if a=b, δ[a, b] = 0 else, where lx and ly are the pixel lags of the replicas of the MSER. A binary image patch centered at (x, y) and containing the MSER is denoted by I(x, y). A shifted replica centered at (x + lx , y + ly ) is denoted by I(x + lx , y + ly ). The window W is a bounding box for the MSER in question. 2.2
Comparison of Detectors
The mean repeatability rate rmi for an image numbered i with respect to a reference image numbered m is measured as [9] rmi = |Smi (ν)|/min(nm , ni ),
(2)
where Smi (ν) is the set of corresponding interest points in images m and i. The parameter ν describes the search area, i.e. where to look in image i for a
Print Process Separation Using Interest Regions
517
Fig. 2. Mean repeatability for four different detectors and varying reference image
specific point or region found in image m. Practically ν defines rectangular search windows centered around each reference point location. The numbers nm , ni are the total numbers of interest points or regions found in images m and i. The repeatability of the detectors is summarized in Fig. 2 for the considered set of 95 front side images of 10 Euro banknotes. The mean repeatability for image i is simply calculated as (where n is the total number of sample images) r¯i = rsi /(N − 1). (3) s=1...N ;s=i
Interest points or regions were extracted from images printed using two processes, i.e. intaglio and offset printing. While MSERs are typically extracted from either one the print regions, the corner detectors also tend to extract interest points on borders between print regions. Due to possible shifts between print regions, corners at borders do not necessarily occur in both images. Therefore, the MSER detector significantly outperforms the other detectors. On the average the mean repeatability for Harris was 0.679, for Hessian it was 0.6653, for Moravec it was 0.5664 and MSERs achieved 0.9353. The results indicate that MSERs should be used for this task. Therefore, we will restrict ourselves to MSERs in the remaining paper.
3
Displacement Estimation Based on Region Correspondence
We will investigate the usefulness of the interest region matching approach in a problem related to banknote print inspection, the measurement of displacement between the output of different printing processes. Figure 3 shows scatter plots for the estimated displacements of all matching MSERs between four pairs of banknotes. The displacements visible to the human observer in Fig. 1, which are approximately 4 and 12 pixels in horizontal direction, correspond well to the centers of the two dominant clusters shown Fig. 3 (a). The cluster on the
518
R. Huber-M¨ ork et al.
Vertical displacement
(b)
Vertical displacement
(a)
Horizontal displacement
Vertical displacement
(d)
Vertical displacement
(c)
Horizontal displacement
Horizontal displacement
Horizontal displacement
Fig. 3. Examples of scatter plots showing horizontal and vertical displacements for MSERs estimated by region based matching of four different pairs (a-c) of images
left corresponds to MSERs printed in intaglio and the cluster on the right corresponds to MSERs printed in offset. The sparse measurements with a larger vertical displacement are MSERs located in the hologram stripe on the right side of each 10 Euro banknote. Figures 3 (b) to (c) show other examples of observed displacements in the considered sample of 95 images. Displacements observed in horizontal direction are larger than in vertical direction, which are characteristic of the inspected print. The variance among MSERs attributed to a specific cluster is mainly due to the dynamics, e.g. undulations of the paper and/or belt slip, involved when high banknotes are transported at high speed. 3.1
Accuracy of Estimated Displacements
In this subsection, we compare the performance of our approach with respect to the accuracy of manually estimated displacements. Manual evaluation turned out to be quite difficult. Table 1 summarizes the results of the experiments for MSERs numbered 1 to 10. The manually estimated displacement in horizontal (u) and vertical (v) directions are shown in rows ground-truth (GT) and the results of the proposed method using region based matching (RBM) and the absolute error with respect to ground-truth are shown in the remaining two rows. The mean absolute deviation of the proposed method with respect to ground truth is approximately 0.16 pixels.
Print Process Separation Using Interest Regions
519
Table 1. Estimated horizontal (u) and vertical (v) displacements in pixels measured manually (GT), using region based color matching (RBCM) and absolute errors MSER
4
1
2
3
4
5
6
7
8
9
10
GT
u -5.00 -4.00 -5.00 -4.00 -7.00 -10.00 -11.50 -11.00 -15.00 -12.00 v 0.50 1.00 1.00 1.00 1.00 0.00 0.00 -1.00 -3.00 -0.50
RBM
u -4.96 -4.05 -4.96 -4.06 -6.97 -10.00 -11.68 -10.99 -14.96 -12.43 v -0.36 1.15 1.12 1.36 -1.14 -0.34 -0.76 -0.96 -2.98 -0.17
Abs. Err.
u v
0.04 0.05 0.04 0.06 0.03 0.14 0.15 0.12 0.36 0.14
0.00 0.34
0.18 0.76
0.01 0.04
0.04 0.02
0.43 0.33
Print Process Separation
Cluster analysis is the process of classifying objects into subsets that have meaning in the context of a particular problem [3]. In the context of the banknote printing process evaluation, meaningful subsets are the different printing processes, i.e. the distinction between offset and intaglio. We define a feature vector fiAB = (ui , vi ), in which each pair (ui , vi ) corresponds to the displacement estimate for MSER i between a predefined reference image A and a comparison image B. The k-means clustering procedure, where the number of clusters is denoted by k, can be applied in the two-dimensional (u, v)-space. The clustering procedure can be improved by taking more than a single pair of images into account, which results in a higher dimensional feature vector fi = (fiA1 , fiA2 , . . . , fiAN ) for each MSER i present in N image pairs, where image A is taken as the reference image for each pair of images. Figure 3 showed four scatter plots which correspond to four pairs of images. The reference image A is the same for each pair. Each matching MSER is represented by a cross and indicates the displacement of the MSER in pixels in both directions. Cluster centers have been calculated by k-means clustering of the feature vector. The number of clusters k was set to 2. For example, all MSERs closer to the cluster on the left of Fig. 3 (a) correspond to intaglio print and MSERs closer to the cluster on the right correspond to offset print. The images related to the clustering shown in 3 are 10 Euro banknotes as depicted in Fig. 4 (a). The MSERs extracted from Fig. 4 (a) were matched with MSERs extracted from four comparison banknotes and the resulting 8dimensional feature vectors were clustered. The image in Fig. 4 (b) shows a manually derived ground truth which partitions the banknote image into offset print areas (black), intaglio print areas (gray), and remaining areas such as background, etc. (white). 4.1
Classification of Print Elements
A classification result based on displacement estimates derived using variable window block matching [13] is shown in Fig. 4 (c), where pixels classified as offset
520
R. Huber-M¨ ork et al.
(a)
(b)
(c)
(d)
Fig. 4. Classification of print layer regions based on MSER displacements: a) image of a 10 Euro banknote, b) ground truth for print regions: intaglio print areas are grey, offset print areas are black, c) classification based on block matching d) classified MSERs
are depicted in black and pixels classified as intaglio are gray. The result based on block matching has shortcomings for the application at hand, e.g. homogeneous regions such as the interiors of the large digits are not reliably classified. On the other hand, textured regions, where extraction of MSERs becomes unreliable, are classified correctly. If a dense segmentation of images is required, which is beyond the scope of this paper, combination of block and region based matching is suggested. The classification result using the proposed method is shown in Fig. 4 (d), where MSERs assigned to offset are depicted in black and MSERs assigned to intaglio are gray regions. There is a total number of 143 MSERs, of which 44 MSERs are classified as intaglio print regions and 99 MSERs are classified as offset print regions. There are 4 MSERs classified falsely as intaglio print regions. Three of these regions are enclosed by intaglio print, e.g. the interiors of the 0 digits and the arch. The fourth region falsely assigned to intaglio print is the background region on the bottom right which is adjacent to the intaglio print region. Regarding the determination of displacements between printing processes, those errors are non-critical. Regions falsely classified as offset print regions mainly lie inside the arch element on the banknote. The replicated structure in this area is the main reason for misclassification. The overall achieved correct classification rate was 94.4 percent.
5
Conclusion
We have presented a interest region based approach for estimating displacements between image regions, which is especially useful in situations when common
Print Process Separation Using Interest Regions
521
block matching algorithms tend to fail, e.g. for homogeneously colored objects and at borders between regions. The advantage of a regions based approach, when compared to interest points, is that it takes into account arbitrarily shaped and sized regions, while it ignores the area surrounding the regions. Using common clustering techniques, one is able to distinguish between image regions printed in different technologies. In applications, where the main goal is a dense segmentation, classifications based on block and region based matching can be combined. The suggested method is used to decrease setup times of printing systems. Future use could also be inline control of tolerances in printing.
References 1. Harris, C., Stephens, M.J.: A combined corner and edge detector. In: Proc. of ALVEY Vision Conf., pp. 147–152 (1988) 2. Huber-M¨ ork, R., Ramoser, H., Penz, H., Mayer, K., Heiss-Czedik, D., Vrabl, A.: Region based matching for print process identification. Pat. Rec. Let (to appear) 3. Jain, A.K., Dubes, R.C.: Clustering methods and algorithms. In: Algorithms for Clustering Data, pp. 55–142. Prentice-Hall, Englewood Cliffs (1988) 4. Kitchen, L., Rosenfeld, A.: Gray-level corner detection. Pat. Rec. Let. 1(2), 95–102 (1982) 5. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comp. Vis. 60(2), 91–110 (2004) 6. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximally stable extremal regions. Image Vis. Comp. 22(10), 761–767 (2004) 7. Moravec, H.: Obstacle avoidance and navigation in the real world by a seeing robot rover. Tech. Rep. CMU-RI-TR-3, Robotics Inst., Carnegie-Mellon Univ (1980) 8. Schaffalitzky, F., Zisserman, A.: Multi-view matching for unordered image sets, or How do I organize my holiday snaps? In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 414–431. Springer, Heidelberg (2002) 9. Schmid, C., Mohr, R., Bauckhage, C.: Evaluation of interest point detectors. Int. J. Comp. Vis. 37(2), 151–172 (2000) 10. Sebe, N., Tian, Q., Loupias, E., Lew, M.S., Huang, T.S.: Evaluation of salient point techniques. Image Vis., Comp. 21(13-14), 1087–1095 (2003) 11. Stone, M.C., Cowan, W.B., Beatty, J.C: Color gamut mapping and the printing of digital color images. ACM Trans. Graph. 7(4), 249–292 (1988) 12. Tuytelaars, T., Gool, L.V.: Matching widely separated views based on affine invariant regions. Int. J. Comp. Vis. 59(1), 61–85 (2004) 13. Veksler, O.: Fast variable window for stereo correspondence using integral images. In: Proc. Conf. Comp. Vis. Pat. Rec. (CVPR) (2003) 14. Winter, M., Fraundorfer, F., Bischof, H.: MSCC: Maximally stable corner clusters. In: Kalviainen, H., Parkkinen, J., Kaarna, A. (eds.) SCIA 2005. LNCS, vol. 3540, pp. 45–54. Springer, Heidelberg (2005) 15. Zhang, B., Srihari, S.N.: Properties of binary vector dissimilarity measures. In: Proc. Conf. on Comp. Vis., Pat. Rec. Image Proc. (CVPRIP) vol. 1 (2003) 16. Zitova, B., Flusser, J.: Image registration methods: a survey. Image Vis. Comp. 21, 977–1000 (2003)
Histogram-Based Lines and Words Decomposition for Arabic Omni Font-Written OCR Systems; Enhancements and Evaluation Mohamed Attia1 and Mohamed El-Mahallawy2 1
The Engineering Company for the Development of Computer Systems; RDI, Egypt [email protected] 2 Arab Academy for Science & Technology and Maritime Transport; AAST, Egypt [email protected]
Abstract. Given font/type-written text rectangle bitmaps extracted from digitally scanned pages, inferring the boundaries of lines hence complete words is a preprocessing vital to whatever OCR system while the recognition process itself as well as the post processing necessary for producing the recognized text. Histogram-based methods are commonly used due mainly to their relative implementation simplicity and computational efficiency, however, some authors report about their vulnerability to some idiosyncratic textual structure complexities, and noise. This paper elaborates on this approach to produce a more robust algorithm for lines/words decomposition esp. in Arabic, or Arabic dominated, text rectangles from real-life multifont/multisize documents. Trying to evaluate this algorithm, this paper also presents the results of extensive experimentation made on about 1800 documents fairly distributed over different kinds of sources with different noise levels at different scanning resolutions and color depths. Keywords: Arabic, decomposition, font-written, histogram, OCR, multifont, multisize, omni, segmentation, typeset, typewritten.
1 Introduction Realizing a word-error-rate (WER) ≤ 0.5% at a speed of 60 words/min. has always been the ultimate threshold of font-written OCR systems to be considered competent with skilled human typists [2]. While font-written OCR systems working on Latin script can claim approaching such measures under favorable conditions, the best systems working on other scripts (e.g. Chinese, Arabic, ..) are still well behind due to a multitude of complexities [2], [4]; e.g. the recent best reporting Arabic omni fontwritten OCR systems can only claim WER’s exceeding 10% under favorable conditions and not to mention the real-life ones [5]. Font-written OCR systems need to perform some processing on the input scanned text block prior to the recognition phase. Decomposing the target text block into lines hence into words is perhaps the most fundamental such preprocessing where errors add in the overall WER of the whole OCR system. This can be shown as follows: W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 522 – 530, 2007. © Springer-Verlag Berlin Heidelberg 2007
Histogram-Based Lines and Words Decomposition
523
a = 1 − e = (1 − e D ) ⋅ (1 − e R ) = 1 − (e D + e R − e D ⋅ e R ) ∵ e D , e R 0 and Li+1;i SIntra. Figure 8 on page 527 shows the result of diagnosing inter-word spaces, hence completing the task of decomposing a text rectangle bitmap into whole words.
3 Evaluation To assess the lines & words decomposition algorithm presented above, we ran it on a total of 1800 real-life documents evenly covering different kinds of sources, different levels of image-quality, and different scanning resolutions and color depths as explained in table 1 below. This evaluation DB1 has been acquired via a common hardware of HP 3800 flatbed scanner with TMA, HP LASER jet 1100 printer, and XEROX 256XT photocopier. Table 1. Results of evaluation experiments on the proposed decomposition algorithm Source of doc.’s → # of photocopying → 300 dpi B&W 300 dpi, Gray Shades 600 dpi B&W 600 dpi, Gray Shades
eD for LASER printed documents 50 pages, 1081 lines, 12146 words PC#0 PC#1 PC#2 0.41% 0.59% 0.73% 0.36% 0.56% 0.66% 0.35% 0.50% 0.61% 0.32% 0.46% 0.57%
eD for Book pages 50 pages, 987 lines, 12524 words PC#0 PC#1 PC#2 0.44% 0.61% 0.71% 0.42% 0.58% 0.68% 0.38% 0.53% 0.64% 0.31% 0.45% 0.59%
eD for News Papers 50 pages, 1309 lines, 12870 words PC#0 PC#1 PC#2 0.59% 0.71% 0.90% 2.10% 2.91% 3.67% 0.50% 0.63% 0.84% 0.85% 1.10% 01.43%
The WER of this algorithm is eD ≡ Ne÷Ntotal ; Ntotal is the total number of contiguous (i.e. un-spaced) strings in the input text block, and Ne is the number of wrongly decomposed full-word rectangles that either contain more than one contiguous strings or contain only part of contiguous strings. Errors due to typing mistakes, tiny isolated noise stains (recognized harmlessly as dots), and wrongly-split contiguous numbers (retrieved contiguous in the OCR post processing), are all not counted as errors. Acknowledgments. The authors feel indebted to RDI; The Engineering Company for the Development of Computer Systems; www.RDI-eg.com for its strong support to this work. Special thanks are posed to Prof. Mohsen A. A. Rashwan; RDI ’s CEO, and professor of Electronics & Electrical Communications, Faculty of Eng., Cairo Univ., Egypt.
References 1. Amin, A., Masini, G.: Machine recognition of multi-font printed Arabic texts. In: Proc. 8th International Joint Conf. on Pattern Recognition, Paris, France, pp. 392-395 (October 1986) 2. Al-Badr, B., Mahmoud, S.A.: Survey and Bibliography of Arabic Optical Text Recognition, Elsevier Science. Signal Processing 41, 49–77 (1995) 3. Attia, M., El-Mahallawy, M.: Histogram-Based Decomposition Algorithm Evaluation DB http://www.RDI-eg.com/OCR_Decomp_Eval_DB 1
The 4 parts of this several-GB’s DB corresponding to the 1st column in table 1 are freely downloadable at [3], where the whole DB is freely affordable by asking either of the authors.
530
M. Attia and M. El-Mahallawy
4. Attia, M.: Arabic Orthography vs. Arabic OCR, Multilingual Computing & Technology magazine, USA (December 2004) 5. Bazzi, I., Schwartz, R., Makhoul, J.: An Omni font Open-Vocabulary OCR System for English and Arabic. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(6) (1999) 6. Bouhilali, K., Kamrouni, M., Ellouze, N.: Method of segmentation of Arabic text image into characters. In: Proceedings of the 1st Kuwaiti Computer Conf., Kuwait, pp.442–446 (March 1989) 7. Fujisawa, H., Nakano, Y., Kurino, K.: Segmentation Methods for Character Recognition: From Segmentation to Document Structure Analysis. Proc. IEEE 80(7), 1079–1091 (1992) 8. Gonzalez, R., Woods, R.: Digital Image Processing, 2nd edn. Prentice-Hall, Englewood Cliffs (2002) 9. Gouda, A.M.: Arabic Handwritten Connected Character Recognition, PhD thesis, Dept. of Computer Engineering, Faculty of Engineering, Cairo University (November 2004) 10. Kasturi, R., O’Gorman, L.: Document image analysis: A bibliography. Machine Vision and Appl. 5, 231–243 (1992) 11. Parhami, B., Taraghi, M.: Automatic recognition of printed Farsi texts. Pattern Recognition 14(1), 1–6 (1981) 12. Pratt, W.K.: Digital Image Processing, 2nd edn. John Wiley & Sons Inc., West Sussex, England (1991)
Semi-automatic Training Sets Acquisition for Handwriting Recognition Jerzy Sas and Urszula Markowska-Kaczmar Wroclaw University of Technology, Applied Informatics Institute, Wyb Wyspianskiego 27, Wroclaw, Poland [email protected], [email protected]
Abstract. In this paper, a method of semi-automatic training set acquisition for character classifiers used in cursive handwriting recognition is described. The training set consists of character samples extracted from a training corpus by segmentation. The method first splits the word images from the corpus into a sequence of graphemes. Then, the set of candidate segmentation variants is elicited with an evolutionary algorithm, where the segmentation variant determines subdivision of grapheme sequences of words into subsequences corresponding to consecutive letters. Segmentation variants are modeled by a chromosome population. Next, each segmentation variant from the final population is tuned in an iterative process and the best chromosome is selected. Then character samples resulting from application of the segmentation modeled by the selected chromosome are grouped into sets corresponding to letters from the alphabet. Finally, the most outstanding samples are rejected so as to maximize the accuracy of words recognition obtained with a character classifier trained with the reduced samples set.
1
Introduction
Many systems of handwriting recognition follow ”analytic” paradigm, where the word is segmented into characters either explicitly or implicitly and isolated characters are recognized by character classifiers. The character classifiers used in such process must be trained with a training set consisting of correctly labeled character samples extracted from handwritten text. The creation of the training set from a corpus of handwritten texts is a tedious and time-consuming process. Automatic or at least semi automatic method of the training set extraction is therefore desired. In this paper the method is proposed, which extracts individual character samples from the set of handwritten word images manually annotated with the sequence of actual characters constituting the words. Because the annotation must be done by a user we call this approach ”semi-automatic”. Handwriting segmentation has been investigated in a number of other works. Proposed methods differ in efficiency and complexity. In [1], the authors describe the method of handwritten word segmentation with the usage of simulated annealing, which is very time consuming. In [2] the method is based on approximating a binary raster image with a set of polygons and building a continuous skeleton of these polygons. In [3] the word segmentation is performed by W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 531–538, 2007. c Springer-Verlag Berlin Heidelberg 2007
532
J. Sas and U. Markowska-Kaczmar
approximating the average character widths and then by seeking for such segmentation of an individual word, which minimizes the difference between actual and expected word width. The approach described in [4] combines a word segmentation and recognition but requires a lexicon of words that can appear in the text. The recognition stage requires a set of samples for each character from the alphabet. For this reason it is useless for our purpose, i.e. for the training set acquisition. The examples of evolutionary algorithms application, which we used in the first step of our method of segmentation can be found in [5] and [6]. They differ from the one used by us in the applied encoding, genetic operators and the fitness function. In the paper we focus on off-line handwritten word recognition. Writerdependent approach is being considered here because many researchers, for instance in [7], [5] claim that writer dependent recognition achieves higher recognition accuracy than writer independent approaches. Writer dependent handwriting recognition requires that individual classifier must be created and trained for each writer. It, in turn, requires thorough preparation of the training sets extracted from the texts written by the individual writer. Because typically it must be done in unsupervised manner during system operation, the possibility of automatic creation of the training set is of particular importance in such case.
2
Problem Definition
The aim of the method described here is to semi-automatically create the training set (T SCC) for a character classifiers. T SCC is extracted from the training corpus, consisting of handwritten word images annotated with their correct transcription to ASCII character sequences. T SCC is acquired by a three-stage procedure. At the first stage the set of candidate corpus segmentation variants is created with an evolutionary algorithm, where the chromosome is a model of the segmentation variant. At the next stage, all segmentation variants corresponding to individuals in the final population are refined in an iterative process, where the character boundaries are slightly modified as long as it improves overall segmentation evaluation. The best segmentation variant is selected and character samples extracted from word images according to the best variant are grouped into sets corresponding to letters from the alphabet. Finally, the obtained family of image sets for alphabet letters is evaluated by using it as a training set for a handwriting recognition procedure. The aim of the third stage is to reject the most outstanding samples from the sets for individual letters in the alphabet so as to maximize the accuracy of the character recognition. The method described here applies the word image oversegmentation into graphemes, which is based on the segmentation presented in [8]. It is assumed that the characters consist of the complete number of graphemes, i.e. that actual character boundaries are on the grapheme boundaries. At the first two stages of the proposed procedure our aim is to find such subdivision of the grapheme sequences for words, which most closely (optimally) corresponds to the actual
Semi-automatic Training Sets Acquisition for Handwriting Recognition
533
Fig. 1. Exemplary word segmentation into graphemes
character annotations explicitly given in the training corpus. It leads to the optimization problem defined as follows. Let us consider the corpus S of K handwritten word images Ik , k=1,...,K, annotated with the sequence of characters actually constituting the word wk = (1) (2) (l ) (ck , ck , ..., ck k ): S = ((I1 , w1 ), (I2 , w2 ), ..., (IK , wK ))
(1)
For each word image the segmentation procedure can be applied, which divides the word image into graphemes. For each word wk the corresponding grapheme (1) (2) (m ) Gk = (gk , gk , ..., gk k ) sequence can be obtained using the grapheme extraction procedure described in [8]. The result of exemplary word segmentation into graphemes is presented on Fig. 1. Experiments have shown that in the great majority of cases, individual characters consist of the subsequent grapheme se(n)
(b
)
(n)
(e
)
quences not longer than 4. Let qik (n) = (gk k,i , ..., gk k,i ) denote the subsequence of graphemes in the word wk corresponding to i−th character cik in certain vari(n) (n) ant Vnk of this word segmentation. bk,i and ek,i denote here the indices of the first and the last grapheme assigned to i−th character in the partitioning variant Vnk . By selecting a variant Vnk for each word in the corpus S the complete partitioning variant is obtained: Vˆn = (Vn11 , Vn22 , ·, VnKK ) By gathering the segments of the image Ik corresponding to graphemes in (i) k qi (n) the hypothetical image uki (n) of the character ck appearing on the i−th position in the word wk can be obtained. The method is based on quite natural assumption that various appearances of the same character are similar in certain sense. Let us define the function Δ(k, i, c), which is equal 1 if the character c appears at the i−th position of the word wk and 0 otherwise: ⎧ i ⎪ ⎨1 if ck = c, Δ(k, i, c) = (2) k ∈ K; c ∈ A ⎪ ⎩ 0 otherwise. Then the sets of hypothetical character samples for the partitioning variant Vˆn and for all characters from the alphabet A can be defined as: (i)
Tcn = {uk (n) : Δ(k, i, c) = 1} k ∈ K; c ∈ A
(3)
The similarity between character images is defined by the function d(u, v), which values come from the range (0, 1). The higher is d(u, v) value, the more similar
534
J. Sas and U. Markowska-Kaczmar
are the images x, and y. The overall similarity of the character images resulting from the complete segmentation variant Vˆn can be calculated as: c u,v∈T n d(u, v) u=v (4) Γ (Vn ) = η(Tnc )2 − η(Tnc ) c∈A
η(Tnc )
where is the cardinality of the set Tnc . If various appearances of the same character in the text corpus are similar, then the segmentation variant Vˆn which minimizes the value of Γ (V n ) in the formula (4) corresponds most closely to the actual segmentation of the words into characters. Our aim is therefore to find ∗ such complete partitioning variant Vˆn which minimizes Γ (V n ). The similarity function d(u, v) was calculated using ten geometrical features of the hypothetical character image. It is normalized within the standard interval (0,1) and based on Euclidean distance in the normalized feature space: M 2 i=1 (xi (u) − xi (v)) (5) d(u, v) = 1 − M where xi is the normalized feature.
3
The Proposed Method
3.1
First Stage – Evolutionary Algorithm
The optimization problem described in the previous section is difficult because of the enormous number of possible segmentation variants. Evolutionary algorithm has been applied to find a number of suboptimal segmentation variants due to its proved effectiveness for similar combinatorial optimization problems ([9]). The chromosome in the evolutionary algorithm corresponds here to the complete segmentation variant Vˆn . It is represented by the sequence of trailing grapheme indices for consecutive characters in successive words from the corpus S. (n) (n) (n) (n) (n) (n) (n) (n) (n) Chr(Vˆn ) = (e1,1 , e1,2 , ..., e1,l1 , e2,1 , e2,2 , ..., e2,l2 , ..., eK,1 , eK,2 , ..., eK,lK ) (6)
Each index is encoded as integer number. Because the segmentation variants for individual words can be changed independently, it seems reasonable to define (n) (n) (n) the gene as the subsequence (ek,1 , ek,2 , ..., ek,lk ) in Chr(Vˆn ) corresponding to a single word wk . The initial population is created randomly and then all individuals are evaluated. This needs decoding of chromosome and creation of consecutive characters on the basis of indices included in the evaluated chromosome. All graphemes (n) (n) between two indices (distinguished by bk,i and ek,i ) are merged. They produce a (i)
hypothetical character sample uk (n) from the word. Evaluation of the obtained
Semi-automatic Training Sets Acquisition for Handwriting Recognition
535
individual is based on the fitness function. The selection of individuals is carried out with the roulette wheel rule. The elitist model of selection is applied, so the best individual from the previous population always survives. Then, individuals are exposed to genetic operations: mutation and crossover. Uniform crossover operator is implemented, which means that the crossover operation consists in randomly exchanging (with assumed probability) of the corresponding genes between parents. Mutation operator relies on the creation of new individuals with assumed probability. The fitness value of the chromosome corresponding to Vˆn is calculated as Γ (Vn )α . The value of α was determined experimentally. The fastest convergence to the suboptimal solution was obtained for α ∈ (10, 20). 3.2
Second Stage - Iterative Refinement
The segmentation variants represented by chromosomes from the last population of the evolutionary algorithm can be further improved by moving towards a local maximum of the fitness function. The solution represented by a chromosome is being refined in a step-wise iterative procedure, where the chromosome is repeatedly modified by the series of minimal modifications. In each step only single element of the chromosome determining the right boundary position of a single character is modified. Elementary modification consists in moving the right character boundary one grapheme to the left and to the right (if it does not violate the constraints related to assumed limits of grapheme count per each character). If the fitness function value for the modified chromosome is increased then the modification remains, otherwise the unchanged chromosome is processed in the next iterations. The single pass of the method consists in trying the elementary modification subsequently for these elements of the chromosome, which correspond to all but the last characters of the corpus words (the right boundary of the last character in a word is fixed). The pass is performed repeatedly until the single cycle does not result in any increase of the fitness function value. Experiments show that in many cases the highest fitness value is reached by the segmentation variant different than the best one after the first stage. For this reason all chromosomes from the final population of the evolutionary algorithm are passed to the refinement stage. The chromosome, which has the highest fitness value after the refinement stage defines the words segmentation for the next processing stage. 3.3
Third Stage - Validation
The segmentation obtained with the refinement procedure may still contain incorrectly extracted characters. It may seriously impair the word recognizer trained with the set containing incorrectly extracted character samples. Segmentation faults resulting from incorrect segmentation into graphemes where graphemes are shared by adjacent characters are particularly detrimental. It is because they cannot be corrected at the first two segmentation stages based on grapheme grouping. At the last stage the most suspicious character samples are
536
J. Sas and U. Markowska-Kaczmar
rejected from the samples set. The candidates for rejection are selected in a validation procedure, where the character sample set elicited at the refinement stage is used as the training set for a character classifier applied in turn in a simplified cursive word classifier applying ”analytic” handwriting recognition paradigm. The validation procedure starts with the whole samples set. The samples are sorted taking into account their average similarity to the remaining samples in the set representing the same letter from the alphabet. The most outstanding, untypical samples are at the beginning of rank, while the most typical ones are at the end. In each step of the iterative procedure, single sample is temporarily removed from the samples set in the order determined by the samples rank (the most untypical ones are considered first) and the classifier is trained with the reduced set. If the sample removal resulted in improvement of words recognition accuracy then the sample is permanently removed, otherwise it is returned to the training set. The samples that survived the validation constitute the final result of the training set acquisition process.
4
Experiments
The method described in the previous section was tested to evaluate its effectiveness and to assess how the procedure applied in each stage contribute to the accuracy of the word segmentation. The accuracy of the character classifier trained with the automatically segmented character samples was also compared with the accuracy achieved by the classifier trained with the character sample images extracted from the same corpus by a human expert. In the experiment, three handwritten corpora were used. Each corpus consisted of the same set of polish words rewritten by a single writer. The writers were selected so, as to obtain the samples of three handwriting styles differing in accuracy, precision and readability. The styles can be classified as calligraphic (c), fair(f ) and sloppy(s). Each corpus was divided into training and validation parts. The training part was subject of the proposed segmentation procedure. The validation part was used at the validation stage. The training part of each corpus contained of the same set of 116 polish words consisting of 725 appearances of letters from Latin alphabet. The validation part contained 100 words consisting of 630 characters. In the first experiment the segmentations obtained by applying genetic algorithm (GA) and genetic algorithm with stepwise refinement (RGA) were compared with the manual segmentation. The accuracy of the automatic segmentation for each letter in the alphabet was calculated as the fraction of the letter appearances extracted consistently with the expert segmentation. The obtained accuracies are presented in Table 1. In the second experiment, the validation stage was tested. The character samples created by RGA algorithm were used as the training set for a simplified word recognition algorithm. As the result of the validation procedure described in the previous section, some of samples were rejected from the training set. The finally selected set of letter images was assessed in two ways: by evaluating of the
Semi-automatic Training Sets Acquisition for Handwriting Recognition
537
Table 1. Accuracy of automatic segmentation; c - calligraphic, f - fair, s - sloppy styles Character GA algorithm s f c a 0.62 0.74 0.92 b 0.59 0.71 0.88 c 0.79 0.91 0.97 d 0.72 0.79 0.93 e 0.70 0.84 0.95 f 0.95 0.95 1.00 g 0.85 1.00 1.00 h 0.55 0.64 0.86 i 0.81 0.87 0.94 j 0.55 0.65 0.85 k 0.77 0.77 0.91 l 0.87 0.94 1.00 m 0.78 0.81 0.93 n 0.66 0.76 0.76 o 0.90 0.85 0.95 p 0.81 0.95 0.90 r 0.63 0.69 0.75 s 0.78 0.89 0.89 t 0.53 0.91 0.97 u 0.68 0.91 0.94 w 0.75 0.82 0.82 z 0.85 0.79 0.85 y 0.89 0.94 1.00 Average 0.73 0.82 0.91
RGA algorithm s f c 0.69 0.85 0.94 0.65 0.82 1.00 0.79 0.94 1.00 0.83 0.83 0.97 0.77 0.88 0.98 0.90 1.00 1.00 0.85 1.00 1.00 0.68 0.68 0.86 0.87 0.87 0.94 0.65 0.75 0.90 0.80 0.80 0.94 0.87 0.94 1.00 0.78 0.85 0.93 0.69 0.83 0.83 0.93 0.93 1.00 0.95 0.95 0.90 0.67 0.77 0.81 0.85 0.93 0.93 0.53 0.94 0.97 0.74 0.91 0.94 0.71 0.86 0.82 0.85 0.88 0.94 0.94 1.00 1.00 0.77 0.87 0.94
Table 2. Results of isolated character recognition with semi-automatically extracted training sets; c - calligraphic, f - fair and s - sloppy styles, m.seg.-manual, a.seg.automatic segmentation
Number of rejected samples Number of remaining samples Remaining samples set consistency with the m.seg. Character recognition accuracy – train.set acquired with m.seg. Character recognition accuracy – train. set acquired with a.seg.
s 141 584 0.84 0.84 0.77
f 89 636 0.93 0.90 0.86
c 39 686 0.95 0.96 0.94
consistency with the manually extracted character images and by determining the character recognition accuracy of the character classifier trained with the final samples set. The latter one was compared to the character recognition accuracy achieved with the training set obtained by the manual segmentation. Results for the three analyzed handwriting styles are presented in Table 2.
538
5
J. Sas and U. Markowska-Kaczmar
Conclusions, Further Works
The aim of the method described in this article is to semi-automatically create the training set for character classifiers used in writer-dependent cursive handwriting recognition. Experiments carried out show that by combining three proposed techniques (evolutionary algorithm for preliminary segmentation, stepwise refinement of segmentation variants and sample set validation by worst samples rejection) high degree of segmentation consistency with the segmentation done by a human expert can be achieved. Moreover, isolated character recognition accuracy in case of a classifier trained with automatically acquired training set is relatively close to the recognition accuracy attainable with the training set prepared by manual training corpus segmentation. The results obtained automatically are particularly close to the results of manual segmentation in case of accurate writing styles. In such cases, the proposed method can be considered as practically useful. It should be emphasized that the recognition test was carried out for correctly isolated characters. To more reliably assess the usability of the described method, it should be tested in the future in application to unsegmented words recognition based on ”analytic” paradigm. Acknowledgement. This work was financed by Polish Ministry of Science and Higher Education in 2005 - 2007 years as a research project No 3T11E00528.
References 1. Schomaker, L., Teulings, H.: Unsupervised learning of prototype allographs in cursive script using invariant handwritting features. In: Simon, J.C., S.I. (ed.): From Pixels to Features III, Amsterdam, North-Holland (1992) 2. Mestetskii, L., Reyer, I., Sederberg, T.: Continuous approach to segmentation of handwritten text. In: Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 440–444 (2002) 3. Mackowiak, J., Schomaker, L., Vuurpijl, L.: Semi-automatic determination of allograph duration and position in on-line handwriting words based on the expected number of strokes. In: Progress in Handwriting Recognition, World Scientific, London (1997) 4. Yulyakov, S., Govindaraju, V.: Probabilistic model for segmentation based word recognition with lexicon. In: Proc. of the Sixth International Conference on Document Analysis and Recognition, pp. 164–167. IEEE Press, Orlando, Florida, USA (2001) 5. Sadri, J., Suen, C., Bui, T.D.: A genetic framework using contextual knowledge for segmentation and recognition of handwritten numeral strings. Pattern Recognition 40, 898–919 (2007) 6. Lamprier, S., Amghar, T., Levrat, B., Saubion, F.: Seggen: a genetic algorithm for linear text segmentation. In: Proceedings of IJCAI, pp. 1647–1652 (2007) 7. Connel, S., Jain, A.: Writer adaptation for online handwriting recognition. IEEE Trans. on PAMI 24, 329–346 (2002) 8. Arica, N., Yarman-Vural, F.: Optical character recognition for cursive handwriting. IEEE Trans. on PAMI 24, 801–813 (2002) 9. Knjazew, D.: OmeGA: A Competent Genetic Algorithm for Solving Permutation and Scheduling Problems. Kluwer Academic Publishers, Boston, MA (2002)
Gabor-Based Recognizer for Chinese Handwriting from Segmentation-Free Strategy Tong-Hua Su, Tian-Wen Zhang, De-Jun Guan, and Hu-Jie Huang School of Computer Science and Technology Harbin Institute of Technology, Harbin, 150001, China {tonghuasu,twzhang}@hit.edu.cn
Abstract. Segmentation-free recognizer is presented to transcribe Chinese handwritten documents, incorporating Gabor features and Hidden Markov Models (HMMs). Textline is extracted and filtered as Gabor observations by sliding windows first. Then Baum-Welch algorithm is used to train character HMMs. Finally, best character string in maximizing a posteriori criterion is found out through Viterbi algorithm as output. Experiments are conducted on a collection of Chinese handwriting. The results not only show the evident feasibility of segmentation-free strategy, but also manifest the advantages of Gabor filters in the transcription of Chinese handwriting. Keywords: Gabor filter, HMM, Chinese handwriting recognition.
1
Introduction
Despite thirty years research, offline recognition of Chinese handwriting remains one of the most challenging problems. High recognition rates are reported in literature on high-quality test samples [1]. When the real handwriting is fed, the results deteriorate dramatically [2]. Therefore, it could be advisable to emphasis on the more complex or even realistic handwriting recognition. We motivate to deal with above task from a segmentation-free strategy while previous recognizers for Chinese handwriting employ character segmentation prior to recognition stage. Textline is extracted and converted to observation sequence by sliding windows first (no attempts are made to segment textline into characters). Then Baum-Welch algorithm [3] is used to train character Hidden Markov Models (HMMs). Finally, the best character string in maximizing a posteriori criterion is found out through Viterbi algorithm [4] as output. One of the prominent advantages is that the juncture information between adjacent characters can be utilized easily. As for the complex structure and writing variability of Chinese handwriting, the core factor is to characterize Chinese character in a discriminative way. Some researchers have claimed that directional features of strokes demonstrate excellent performance to Chinese character recognition [1]. Gabor filter is one of the candidate ones. Gabor filter has been widely used in pattern recognition domain, e.g. in face recognition [5], iris recognition [6], and numeral recognition [7], due to W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 539–546, 2007. c Springer-Verlag Berlin Heidelberg 2007
540
T.-H. Su et al.
its biological evidences [8] and mathematical characteristics [9]. More recently, it has been applied into Chinese handwritten character recognition field with promising results [10] [11] [12]. In this paper, we explore Gabor features under the segmentation-free framework. After properly tuning the filter parameters, the recognition rates can be improved greatly. The results not only show the evident feasibility of segmentationfree strategy, but also manifest the advantages of Gabor filters in the transcription of Chinese handwriting. The next section details the main components of our system. Much emphasis is made on the Gabor feature extraction. Section 3 and 4 report the experimental results and draw conclusions from this work, respectively.
2
System Descriptions
When a textline image is fed into a recognizer, it is converted to a sequence of feature vectors or observations O = (o1 , ..., om ). The recognition task is to identify a string of characters S = (s1 , ..., sn ) maximizing the a posteriori (MAP) P (S|O). The recognizer described in this paper consists of three main components: textline extraction, Gabor feature selection by sliding window, recognition using HMMs. The system architecture is illustrated in Fig. 1.
Textline extraction
Sliding window based feature selection
Viterbi algorithm Character HMM
Reference database
B-W algorithm
Handwriting database
Character string
Performance
Fig. 1. System architecture
2.1
Textline Extraction
Each piece of handwriting for recognition consists of about 10 textlines and each textline includes about 24 characters. To make things easier, we first segment document into textlines. One straightforward method is horizontal histogram analysis. The zero-crossing points are potential text border lines. Unfortunately, skewed document presents overlapping between adjacent textlines and cannot be successfully handled by horizontal histogram analysis. A skew detection algorithm based on stroke histogram is adopted and then the de-skewed documents are analyzed by horizontal histogram [2]. By this technique, all textlines are successfully extracted in our handwritten document database written by a single writer (see Sect. 3.1 for more information relating the database). Fig. 2 shows
Gabor-Based Recognizer for Chinese Handwriting
541
Fig. 2. All lines are extracted after skew correction
a sample whose textlines are extracted after skew correction, while they cannot be separated by analysis of the histogram profile before skew correction. 2.2
Gabor Feature Selection
Once the textlines are extracted, we parameterize them using a sliding window (refer to Fig. 3(a)) from left to right. We are purposely omitting the noise removal and textline normalization stages to see the recognizer’s robustness. Generally, the height of the sliding window is the same as that of textline. The other two parameters, the width W and the shift step S, should be assigned by researchers or determined through experiments. In our system we set W = 12 pixels (one-fifth the characters average width) and S is determined by experiments. In addition, we partition the window into 3 zones, as showed in Fig. 3(b). The body zone is used to extract feature vector instead of the whole window in order to resist the undulation of textlines. The body zone is separated by the topmost and bottommost foreground pixels vertically.
S
…
upper blank zone …
ȥ .., oi, oi+1, … (a) The running of sliding window
body zone lower blank zone (b) Zone used to extract features
Fig. 3. Feature selection process using sliding window
We present cell features as a baseline system and Gabor-based features as enhanced one. The extraction of cell features is very simple. The body zone
542
T.-H. Su et al.
of the window is divided into 8 × 2 cells, and then the foreground pixels are summed in each cell. A bit complicated process is needed to extract the Gabor features. First we determine the parameters of Gabor filters. Mathematically the Gabor filter describes a plane wave modulated by a Gaussian envelope function, whose 2D form has been proven to resemble the receptive fields of simple cells in the visual cortex [8]. Among various wavelet bases, Gabor filter provides the optimized resolution in both spatial and frequency domains [13], and exhibits desirable characteristics of spatial locality and orientation selectivity. The 2D Gabor filters are defined: 2 9 κμ,ν 2 − κμ,ν 22 ζ2 8 iζκμ,ν − σ2 2σ e , (1) e − e σ2 where μ,ν are indices of orientation and scale of Gabor filters, ζ = x + iy, and κμ,ν is defined in half-octave space as:
ψμ,ν (ζ) =
κμ,ν = 2−
ν+2 2
πeiϕμ ,
(2)
where ϕμ is the μ-th orientation. We use following parameters: μ ∈ {0, 1, 2, 3}, ν = 3, σ = 2π. The values of μ and ν are designated from the statistical information of Chinese character structure [10]. Second, filter the handwritten textline. The filtered image at the i-th step is computed by convoluting the body zone in the i-th sliding window (let Bi (ζ) denote the zone) with Gabor filters: i (ζ) = Bi (ζ) ∗ ψμ,ν (ζ). Jμ,ν
(3)
Third, generate feature vectors. Generally, the Gabor feature extraction methods fall into two categories: directly sampling (referred as GS below) and local smoothing (also known as histogram, denoted as GH hereafter). The former dii (ζ) into 8 × 2 cells and the center point in each vides the filtered body zone Jμ,ν cell is selected and combined to construct the i-th feature vector oi . In the latter, i (ζ) is partitioned into 4 × 2 cells and two quantities are produced in each Jμ,ν cell, which are to be concatenated into oi : i i ζ∈{Jμ,ν (ζ)>0/0/ 10*ha) then processed CC was assumed as frame/window. Since the lines, frames and noises are the possible sources of error, they are discarded in the further steps of evaluation. Remained components were framed and these frames were filled to reach possible text areas. In Fig. 1c pre-energy map of the sample is shown. In Fig. 1d the energy map of the sample is shown. 2.3
Character Components Interval Tracing
Text characters in words and sentences, construct textural structure by the distance between its neighbor characters (pitchs) which are in a certain distance interval. In our model, we performed a novel component interval tracing process to benefit from this feature of text characters. Figure 1f shows the texts obtained by the application of character tracing from Fig. 1e which shows the binarized document image that masked with energy map. The binarized document image (by Otsu’s method) was masked with energy map to obtain probable text characters at the beginning of character interval tracing method. Connected component analysis made in this masked image and
Text Area Detection in Digital Documents Images
559
Fig. 1. a) Document image sample from database, b) Gabor filtering normalized result of sample image, c) Pre-energy map of this sample, d) Energy map of this sample, e) Binarized document image that masked with energy map, f) Text areas extracted by proposed method
text characters were found but there were still some non-text components which were labeled as probable text characters. In character components interval tracing process each component is classified as text or not. As stated previously, first the search regions for the component’s left and right neighbors are found as shown in Fig. 2. In Fig. 2, a text character “A” is given, where h is the height of this character component and Xa Yb the coordinate values of this character on the x-y plane. As it given in Fig. 2, the dotted areas in square form, are the search regions for this component’s neighbors. Then the following rules are applied: a) If there is no component in the search regions of the processed component then processed component is labeled as negative, b) If there is any component in the search regions of the processed
560
I. Ar and M.E. Karsligil
Fig. 2. The search areas of probable text component’s neighbors
component and the neighbor component does not have a height ratio given in (7) then the processed component is labeled as negative, c) If there is any component in the search regions of the processed component and the neighbor component have a height ratio satisfying (7) then the processed component and the neighbor component are labeled as positive. At the end, negative labeled components are erased from masked image, positive labeled components are accepted as text characters. In (7), height ratio rule is given. Nh is the height of the neighbor component, Ph is the height of the processed component. Nk . (7) (2 ∗ Nk ) ≥ Pk ≥ 2 This height ratio controls whether the neighbor components is similar to processed component or not.
Fig. 3. Character component interval tracing result on sample image
In Fig. 3, the result of character component interval tracing is given at the part of a document image. It is observed that masked image in Fig. 3.a has some non-text components as probable text components. But after character component interval tracing process, non-text components are eliminated due to neighborhood and height ratio, with great success.
3
Experimental Results
We tested our method on the generated document database which contains 40 such different kinds of document images from the books, the magazines and the newspapers. The results were quite promising for the documents having:
Text Area Detection in Digital Documents Images
561
Fig. 4. a) A document image taken from Media Team Database, b) Part of this sample, c) Energy map of part, d) Result of proposed method (text areas of part)
Fig. 5. a) A part of Russian document image, b) Result of masking energy map with black and white document image, c) Result of proposed method
pictures, non-vertical texts, character sequences of different languages, and nonrectangular layout. The proposed method sometimes fails when the document contains complex shapes like caricature, mathematical equation, musical note or pictorial background. Figure 4 shows a document that is composed of text areas and pictures and its parts. In the energy map of part, most of the image region’s components and non text components eliminated. And character component interval tracing eliminated the rest of non-text components easily. In Fig. 5 a part of document image that contains Russian text characters are given. Nearly all of the elements in image region eliminated at masking with energy map but still there were remaining non-text components. After character component interval tracing as the novelty of this work, only one nontext component remained.
4
Conclusion
The proposed method can be used for 256 grey level document images which are written in different languages. Our proposed method does not have restrictions like having a rectangular layout and training the system for known layout or
562
I. Ar and M.E. Karsligil
font types. The main advantage of the proposed system is the detection of text areas by determining textural features in a two stage. First, candidates of text region components were extracted by applying Gabor filter. Then the textures of the remaining components were examined to see whether they were part of a word. This second stage increases the performance of detecting the text regions correctly.
References 1. Simon, A., Pret, J.-C., Peter Johnson, A.: A Fast Algorithm for Bottom-Up Document Layout Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(3), 273–277 (1997) 2. Ha, J., Haralick, R., Phillips, I.: Document Page Decomposition by the BoundingBox Projection Technique. In: Proc. Third Int’l Conf. Document Analysis and Recognition. pp. 1,119-1,122, Montreal (1995) 3. Jain, A.K., Yu, B.: Document representation and its application to page decomposition. IEEE Trans. Pattern Analysis and Machine Intelligence 20, 294–308 (1998) 4. Chen, J.-L.: A simplified approach to the HMM based texture analysis and its application to document segmentation. Pattern Recognition Letters 18(10), 993– 1007 (1997) 5. Yuan, Q., Tan, C.L.: Page segmentation and text extraction from gray-scale images in microfilm, SPIE Document Recognition and Retrieval VIII, pp. 323-332 (2001) 6. Raju, S.S., Pati, P.B., Ramakrishnan, A.G.: Gabor Filter Based Block Energy Analysis for Text Extraction from Digital Document Images. In: Proceedings First International Workshop on Document Image Analysis for Libraries (DIAL 2004), pp. 233–243 (2004) 7. Pati, P.B., Raju, S., Pati, N., Ramakrishnan, A.G.: Gabor filters for document analysis in Indian Bilingual Documents. In: Proc. of International Conference on Intelligent Sensing and Information Processing - 2004, Chennai, India, pp. 123–126 (2004) 8. (2007) http://www.mediateam.oulu.fi/downloads/MTDB/ 9. Gonzales, R.C., Woods, R.E.: Digital Image Processing, 2nd edn. Prentice-Hall, Englewood Cliffs, NJ (2002) 10. Young, I., van Vliet, L., van Ginkel, M.: Recursive Gabor filtering. IEEE Trans. Sig. Proc. 50(11), 2799–2805 (2002) 11. Lee, T.S.: Image representation using 2D Gabor wavelets. IEEE Trans.Pattern Anal. Machine Intell. 18, 959–971 (1996) 12. Otsu, N.: A Threshold Selection Method From Gray Level Histograms. IEEE trans. Syst, Man. Cybercet, SMC, 62–66 (1979)
Optimal Threshold Selection for Tomogram Segmentation by Reprojection of the Reconstructed Image K. Joost Batenburg and Jan Sijbers University of Antwerp, Vision Lab, Universiteitsplein 1, B-2610 Wilrijk, Belgium
Abstract. Grey value thresholding is a segmentation technique commonly applied to tomographic image reconstructions. Many procedures have been proposed to optimally select the grey value thresholds based on the image histogram. In this paper, a new method is presented that uses the tomographic projection data to determine optimal thresholds. The experimental results for phantom images show that our method obtains superior results compared to established histogram-based methods.
1
Introduction
Segmentation of tomographically reconstructed images (also called tomograms) is a well known problem in computer vision. It refers to the classification of image pixels into distinct classes that are characterized by a discrete set of grey values. Amongst all segmentation techniques, image thresholding is the simplest, yet often most effective segmentation method. Many algorithms have been proposed for selecting “optimal” thresholds with respect to various optimality measures [1]. Thresholds are typically selected from the histogram of the tomogram, such that the distance between the tomogram and the segmented image is minimized. To the authors’ knowledge, current segmentation techniques for tomographic images are applied to tomograms only and do not exploit the available projection data but are rather based on the image histogram [2,3,4]. Specifically, the tomogram histogram is often used as a basis for determining global thresholds by, for example, fitting multiple gaussian distributions to the histogram [5] or applying a k-means clustering algorithm to it [6]. More recent thresholding techniques are based on a minimum variance criterion [7] or employ the variance and intensity contrast [8]. In practice however, the image histogram often lacks clear modes that would allow an intuitive selection of threshold values. Indeed, tomograms may be polluted by artifacts that tend to smear out the image histogram. Typical artifacts are streaking artifacts caused by highly absorbing object parts, blurring caused by object motion during scanning (or equivalently caused by small shifts of the detector during acquisition), bias fields, or artifacts caused by a limited field of view and/or a missing wedge. Hence, although threshold selection based on the image histogram can be made fully automatic, it lacks robustness in case clear grey value modes are absent. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 563–570, 2007. c Springer-Verlag Berlin Heidelberg 2007
564
K.J. Batenburg and J. Sijbers
This paper presents a new approach to segmentation, which uses the information in the projection data instead of the histogram. Forward projections of the tomogram are computed and compared to the measured projection data. In the search for optimal thresholds, many computationally expensive forward projections must be performed. An important contribution of the current paper is the efficient implementation of the forward projection method, which makes using the original projection data as a segmentation criterion feasible. The proposed method does not suffer from subjectiveness of the user. Our simulation experiments demonstrate that our method is robust and clearly outperforms established histogram-based methods.
2
Grey Level Estimation
We restrict ourselves to the segmentation of a 2-dimensional image, which is represented on a rectangular grid of width w and height h. Hence, the total number of pixels is given by n = wh. The grey-scale image x ∈ Rn that we want to segment is a tomographic reconstruction of some physical object, of which projections were acquired using a tomographic scanner. Our method can be used for any scanning geometry, e.g., parallel beam, fan beam and, in 3D, cone beam. Projections are measured as sets of detector values for various angles, rotating around the object. Let m denote the total number of measured detector values (for all angles) and let p ∈ Rm denote the measured data. The physical projection process in tomography can be modeled as a linear operator W that maps the image x (representing the object) to the vector p of measured data: W x = p.
(1)
For parallel projection data, the operator W is a discretized version of the wellknown Radon-transform. We represent W by an m×n matrix (wij ). Note that for each projection angle, every pixel i will only project onto a few detector pixels, so the matrix W is very sparse. Exploiting this sparsity forms an essential part of our method. The matrix representation of the projection operator is commonly used in algebraic reconstruction algorithms. We refer to Chapter 7 of [9] for details. From this point on, we assume that an image x has been computed that approximately satisfies Eq. (1). Our approach does not depend on the reconstruction algorithm that was used to compute x, e.g., Filtered Backprojection or ART. The image x now has to be segmented using global thresholding of the grey levels. The main motivation of using thresholding in general, is that pixels representing the same “material” in the scanned object should have approximately the same grey values in the tomogram. We rely strongly on the assumption that the scanned object consists of only a few different materials, which is true for all segmentation methods based on thresholding. Our segmentation approach first assigns a real-valued grey level to each of the segmentation classes. Using these grey levels, the projections of the segmented image are then computed. The computed forward projections are compared to the measured
Optimal Threshold Selection for Tomogram Segmentation
565
projection data, which provides a measure for the quality of the segmentation (along with the chosen grey levels). This quality measure can also be used for other segmentation techniques than thresholding. We first consider the problem of determining grey levels for each of the segmentation classes of a segmented image. We can consider a segmentation of an image into classes as a partition of the set of pixels, consisting of subsets. Let S = {S1 , . . . , S } be a partition of {1, . . . , n}. We label each set by its index t: St . Each pixel j is contained in exactly one set St ⊂ S, denoted by s(j) ∈ {1, . . . , }. To each set St , a grey level ρt ∈ R is assigned, which induces an assignment of grey levels to the pixels 1 ≤ j ≤ n, where pixel j is assigned the grey level ρs(j) . T Define r S (ρ) = ρs(1) . . . ρs(n) . The vector rS (ρ) contains, for each pixel j, the corresponding grey level of that pixel. Our goal is to determine “optimal” grey values ρ for the given partition S. The quality of a vector ρ is determined by computing the projections of the segmented image, using the grey levels from ρ, and comparing the computed projections to the measured projections p. More formally, we define the problem of finding optimal grey values for a given partition as follows: Problem 1. Let W ∈ Rm×n be a given projection matrix, let S = {S1 , . . . , S } be a partition of {1, . . . , n} and let p ∈ Rm be a vector of measured projection data. Find ρ ∈ R such that |W r S (ρ) − p|2 is minimal. We will start by deriving the equations for solving Problem 1 for a fixed partition S. Subsequently, we describe how the optimal set of grey values can be recomputed efficiently, each time the partition S has been modified by moving a single pixel from one class to another class. This fast update computation allows us to design efficient algorithms that combine the search for a segmentation S and the corresponding grey values ρ, such that the projection distance is minimal. Define A = (ait ) ∈ Rm× by ait = wij . (2) j: s(j)=t
The value ait equals the total area of pixels from the set St that contribute to detector value i. We denote the row vectors of A by ai = A·t . Clearly, we have [W r S (ρ)]i =
ait ρt .
(3)
t=1
Define the projection difference d ∈ Rm by d = W r S (ρ) − p = Aρ − p. Put ci = −2pi ai , Qi = ai ai T . Define the squared total projection difference by |d|2 = |Aρ − p|2 =
m
(ai T ρ − pi )2 =
i=1
Define c¯ =
m i=1
¯= ci , Q
m i=1
m (ci T ρ + ρT Qi ρ + p2i ). i=1
¯ Qi . Thus |d|2 = |p|2 + c¯T ρ + ρT Qρ.
(4)
566
K.J. Batenburg and J. Sijbers
Note that each of the terms ci and Qi only depend on ai and pi , not on ρ. A vector ρ that minimizes the projection difference |d| can now be computed by setting the derivatives of |d|2 with respect to ρ1 , . . . , ρ to 0, obtaining the ¯ = −¯ system 2Qρ c, and solving for ρ. So far, we have assumed that the partition S was fixed. Suppose that we have ¯ for the partition S. We now change S into a new partition computed c¯ and Q S , by changing s(j) for a single pixel j. The only rows of A that are affected by this transition are the rows i for which ¯ can be computed by wij = 0. This means that the new vector c¯ and matrix Q the following updates: c¯ = c¯ +
(ci − ci )
(5)
i:wij =0
and ¯ = Q ¯+ Q
(Qi − Qi ).
(6)
i:wij =0
¯ can be computed by applying updates for only a The fact that c¯ and Q few of the terms ci and Qi respectively, means that the optimal grey values for the entire image can be recomputed efficiently. This property allows us to develop algorithms for simultaneous segmentation and grey level estimation. Any segmentation algorithm that moves pixels from one class into another class, one at a time, can efficiently keep track of the optimal grey levels. In the next section we demonstrate this approach for a simple thresholding algorithm, finding both the optimal thresholds and grey values in a single pass.
3
Thresholding with Simultaneous Grey Level Estimation
The concepts in the previous section apply to any partition S of the pixels. We will now restrict ourselves to a specific type of partitions, that are induced by a thresholding scheme. Our starting point is a grey level image x ∈ Rn , that has been computed by any continuous tomographic reconstruction algorithm, e.g., Filtered Back Projection. The image x now needs to be segmented by means of thresholding, using a fixed number of classes for the pixels. Every pixel i is assigned a class according to a thresholding scheme using thresholds τ1 < τ2 < . . . < τ−1 . Put τ = (τ1 . . . τ−1 )T . Define the threshold function by ⎧ 1 (xi < τ1 ) ⎪ ⎪ ⎨ 2 (τ1 ≤ xi < τ2 ) . (7) s(i, τ ) = ... ⎪ ⎪ ⎩ (τ−1 ≤ xi ) The threshold function induces a partition Sτ of the set {1, . . . , n}. Define T r(τ , ρ) = ρs(1,τ ) . . . ρs(n,τ ) .
Optimal Threshold Selection for Tomogram Segmentation
567
Similar to Section 2, we define the quality of a set of thresholds τ and grey values ρ as the total squared projection distance for the corresponding segmentation. We now consider the problem of simultaneously finding thresholds and grey values, such that the total squared projection difference is minimal: Problem 2. Let W ∈ Rm×n be a given projection matrix, let x ∈ Rn be a grey scale image and let p ∈ Rm be a vector of measured projection data. Find τ ∈ R−1 with τ1 < . . . < τ−1 and ρ ∈ R , such that |W r(τ , ρ)−p|2 is minimal. The simplest case of Problem 2 occurs when the image x is segmented into two classes, using a single threshold τ . In that case, an exhaustive search over all possible thresholds can be performed efficiently: first, a list of all pixels is computed, sorted in ascending order of grey level in the continuous reconstruction x. The start segmentation is formed by setting the threshold τ at infinity, so all pixels will be in the segmentation class S1 . The threshold τ is now gradually decreased, each time moving pixels from S1 to S2 . By using the update operation from Section 2, it is possible to keep track of the optimal grey values and the projection difference by applying only small update steps. Figure 1 shows the steps of the segmentation algorithm. For each pixel j, the time complexity of the update computation is O(u(j)2 ), where u(j) = #{i : wij = 0}. Each pixel is moved from S1 to S2 only once. Therefore, the total time complexity of the algorithm is O(U 2 ), where U denotes the total number of nonzero elements in the projection matrix W . Make a list L containing all elements j ∈ {1, . . . , n}, sorted in ascending order of xj ; τ := ∞; S1 := {1, . . . , n}; S2 := ∅; P For i = 1, . . . , m: ai1 := m j=1 wij ; ai2 := 0; compute c i and Q i ; ¯ Compute c¯ and Q; S1 := ∅; S2 = {1, . . . , n}; k := n + 1; while k > 1 do begin k := k − 1; τ := xL(k) ; while (k ≥ 1) and (xL(k) = τ ) do begin k := k − 1; j := L(k); S1 := S1 − {j}; S2 := S2 ∪ {j}; ¯ for each i such that wij = 0: update ci , Qi , c¯ and Q; end ¯ Compute the minimizer ρ of |d|2 = |p|2 + c¯T ρ + ρT Qρ; if (d < dopt ) then dopt := d; ρopt := ρ; τopt := τ ; end
Fig. 1. Basic steps of the algorithm for solving Problem 2 in the case = 2
568
3.1
K.J. Batenburg and J. Sijbers
More Than Two Grey Levels
If there are more than two segmentation classes, it is usually not possible to compute the total squared projection error for each possible set of candidate thresholds, due to the vast number of candidates. However, using the update operation from Section 2, it is possible to compute a local minimum of the projection error in reasonable time. A simple algorithm for this case is the following: first, determine initial thresholds τ , possibly using another automated procedure, such as fitting ¯ and c¯. In an iteratively Gaussian functions to the histogram. Next, compute A, Q loop, compute for each threshold the effect of a small increase and a small decrease of that threshold on the total projection error. Among all these possible steps, select the one that results in the largest decrease of the total projection error. The algorithm terminates if no step can be found that decreases the total error. The initial estimate τ 0 can be computed using another automated procedure, such as fitting Gaussian functions to the histogram.
4
Results and Discussion
In order to validate our proposed threshold selection method, simulation experiments were set up. For this purpose, three phantom images of size 512×512 were constructed: a binary vessel image representing a vessel tree, a binary femur image, and a grey valued mouse leg image (see Fig 2). From these images, CT projections were simulated as follows. First, the Radon transform of the images was computed, resulting in a sinogram for which each data point represents the line integral of attenuation coefficients. Then, (noiseless) CT projection data were generated where a mono-energetic X-ray beam was assumed. The projections were then polluted with Poisson distributed noise where the number of photons per
(a) Vessel
(b) Femur
(c) Mouse leg
(d) Vessel
(e) Femur
(f) Mouse leg
Fig. 2. (a-c) Phantom images ; (d-f) Simulated CT reconstructions from 90 projections
Optimal Threshold Selection for Tomogram Segmentation
569
Fig. 3. Thresholding results of two commonly applied thresholding methods and the proposed method for the femur image (cfr Fig. 2(b)). The numbers at the top of the figure indicate the least possible number of misclassified pixels for each case.
detector element was varied from 5 × 102 − 5 × 104 . Next, the noisy sinogram of the attenuation coefficients was obtained by dividing the CT projection data by the maximum intensity and computing the negative logarithm. Finally, the simulated, noisy CT reconstructions were obtained by applying a SIRT algorithm. For each case, 10 independent noisy sinograms were generated. We refer to [9] for further details on image formation in CT and on the SIRT algorithm. The proposed tomogram thresholding technique was then compared to commonly applied thresholding methods [10]. First, a parametric optimal thresholding technique was implemented where the image histogram was fitted to a mixture probability density function (two gaussian functions) from which an optimal threshold was derived [5]. Next, the commonly used iterative threshold selection scheme of Otsu was implemented [6]. This is also the default thresholding method used in Matlab. As a final method, k-means clustering was applied to the image histogram. For each simulated reconstruction, global thresholds were computed. Then, the number of misclassified pixels for each method, referred to as Nm , was compared to the number of misclassified pixels Nopt of the ‘optimally thresholded image’. The latter image can be found by an exhaustive search over all possible threshold values and comparing the thresholded image to the original, noiseless image. From those numbers, a measure for the number of correctly classified pixels for each method is given by: R = 100 ∗ Nm /Nopt which will be referred to as the classification performance ratio. For each method, R is evaluated as a function of the number of photons per detector element employed during simulation of the CT sinograms.
570
K.J. Batenburg and J. Sijbers
Typical results of the simulation experiments are shown in Fig. 3. Since the classification performance ratio is based on Nopt , this number is denoted above the figure for each data point. The error bars around each point indicate the range of the results for the 10 independent experiments. From the figure, it is clear that the proposed projection based method outperforms conventional thresholding techniques with respect to the classification performance ratio. Similar results were obtained for the vessel and mouse-leg images. For the vessel image, the classification performance ratio of our proposed method was above 92% in all tests, whereas the best performing alternative method, k-means clustering, dropped to 80% in the test with the highest count per detector element. For all tests, the running time was around 10s on a standard desktop PC.
5
Conclusions
Global grey value thresholding is a trivial, yet often used segmentation technique. The search for the optimal grey level threshold is, however, far from trivial. Many procedures have been proposed to select the grey value threshold based on the image histogram. In our paper, we have presented an innovative approach to find the optimal threshold grey levels by exploiting the available projection data. The results on simulated CT-reconstructions show that our proposed method obtains superior results compared to established histogram-based methods.
References 1. Glasbey, C.A.: An analysis of histogram-based thresholding algorithms. Graphical Models and Image Processing 55, 532–537 (1993) 2. Seo, K.S.: Improved fully automatic liver segmentation using histogram tail threshold algorithms. vol. 3, pp. 822–825 (2005) 3. Sahoo, P.K., Arora, G.: Image thresholding using two-dimensional Tsallis. HavrdaCharv´ at entropy 27, 520–528 (2006) 4. Ng, H.F.: Automatic thresholding for defect detection. vol. 27, pp. 1644–1649 (2006) 5. Sonka, M., Fitzpatrick, J.M.: Handbook of Medical Image Processing and Analysis. SPIE Press (2004) 6. Otsu, N.: A threshold selection method from gray level histograms. IEEE Trans. Systems, Man and Cybernetics 9, 62–66 (1979) 7. Hou, Z., Hu, Q., Nowinski, W.: On minimum variance thresholding. 27, 1732–1743 (2007) 8. Qiao, Y., Hu, Q., Qian, G., Luo, S., Nowinski, W.L.: Thresholding based on variance and intensity contrast. 40, 596–608 (2007) 9. Kak, A.C., Slaney, M.: Principles of Computerized Tomographic Imaging. In: Volume Algorithms for reconstruction with non-diffracting sources, pp. 49–112. IEEP Press, New York, NY (1988) 10. Eichmann, M., L¨ ussi, M.: Efficient multilevel image thresholding. Master’s thesis, Hochschule f¨ ur Technik Rapperswil, Switzerland (2005)
A Level Set Bridging Force for the Segmentation of Dendritic Spines Karsten Rink and Klaus T¨onnies Department of Simulation and Graphics University of Magdeburg, Germany {karsten,klaus}@isg.cs.uni-magdeburg.de
Abstract. The paper focusses on a group of segmentation problems dealing with 3D data sets showing thin objects that appear disconnected in the data due to partial volume effects or a large spacing between neighbouring slices. We propose a modification of the speed function for the well-known level set method to bridge these discontinuities. This allows for the segmentation of the object as a whole. In this paper we are concerned with treelike structures, particularly dendrites in microscopic data sets, whose shape is unknown prior to segmentation. Using the modified speed function, our algorithm segments dendrites and their spines, even if parts of the object appear to be disconnected due to artifacts.
1
Motivation
A number of problems arise when an object in a 3D data set should be segmented. Depending on the modality there may be certain artifacts, e.g. magnetic field inhomogeneities in MR-images or metal artifacts in CT-images. An artifact common to almost all image acquisition techniques is the partial volume effect (PVE). In digital images the grey value of a pixel is the mean value of all the information that was measured in the area represented by this pixel. The result is a blurring of regions with large variances in grey values. This is especially evident at the borders of objects. It is amplified even more between neighbouring slices in data sets, as the slice thickness is often larger than the pixel spacing within a slice. Structures whose diameter is smaller than the width of a pixel will merge into the background and may even vanish entirely. When segmenting objects with a known shape, one can simply devise a model describing this shape and its variation to detect and/or segment the object within the data set. The location of small parts not visible in the data can then be estimated based on the parts of the object that have been segmented. Examples of such models are the Active Shape Model[2], Snakes[7] or Mass Spring Models[4]. With objects whose shape is very complex or not known prior to segmentation, partial volume effects may pose a big problem. It is difficult to determine the exact location of the borders of the object. Furthermore, parts of the structures that appear disconnected within or, more likely, between slices are usually not segmented at all. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 571–578, 2007. c Springer-Verlag Berlin Heidelberg 2007
572
K. Rink and K. T¨ onnies
(a)
(b)
Fig. 1. Illustration of the front propagation process: The left image shows an initial curve C and the level set function Φ at time t=0. The right image shows both functions at a later time.
Due to its implicit definition, the level set method is an appropriate means to segment objects whose shape is not known a priori or whose shape varies significantly between data sets. Therefore, we device a solution to the problem mentioned above using a modification of the level set method. We propose a new force term for the level set speed function to bridge small gaps in the representation of objects, such as those resulting from PVE.
2
Level Sets Methods and Speed Functions
Level sets were introduced by Osher and Sethian [10] for the solution of surface motion problems. They are used to describe the propagation process of a closed curve C ∈ n . This curve C is represented as the zero level set of a higher dimensional function Φ ∈ n+1 , i.e. C(t) = {x|Φ(x(t), t) = 0}.
(1)
C is moving in its normal direction over time. The speed of each point on C is given by a speed function F that is dependent on various internal and external forces, such as local curvature or image gradients, respectively. This leads to the level set equation Φt + F |∇Φ| = 0,
(2)
where |∇Φ| denotes the normalised gradients of the level set function and F is the speed function. The advantage of this representation is that Φ always remains a function even if C splits, merges or forms sharp corners. Also, this representation is independent of the number of dimensions of C. As Φ changes over time its zero level set Φ(x, t) = 0 always yields the propagating front, i.e. C(x) at time t. The speed function F governs the way the front is propagating. Usually F consists of a number of different forces that are combined in a meaningful way. Using level set methods for image processing applications, one can differentiate between external force terms derived from the underlying image and internal
A Level Set Bridging Force for the Segmentation of Dendritic Spines
573
force terms derived from the properties of the front itself. An example for a speed function of a level set function for an image processing task is given by F = F∇I (FA + Fκ ),
(3)
where FA is an advection term for expanding the front, Fκ is a force based on local curvature, usually given by Fκ = ∇
∇Φ , |∇Φ|
(4)
and F∇I is an external force based on image gradients. One way to define such a force is given by F∇I (x) =
1 , 1 + |∇Gσ ∗ I(x)|
(5)
other approaches are described in [9] and [1]. In equation 5, Gσ ∗ I(x, y) denotes an image convolved with a Gaussian low pass filter with a standard deviation of σ. If many force terms are used it is necessary to find meaningful weights for all forces to guarantee that the results of all forces have approximately the same order of magnitude. The definition of image-based forces is by no means limited to simple image features. Van Bemmel et al.[12] devise a force term based on the size of cross section areas and the eigenvalues of the Hessian matrix at each pixel for the segmentation of blood vessels in MRA images. The resulting speed term gives high values inside of cylindrical objects and low values otherwise. Leventon et al. [8] proposed a method to use curvature and intensity profiles along the gradients of the propagating front to segment the corpus callosum and the femur in MR images. Such advanced speed terms containing model-based aspects often result in a more robust and reliable segmentation result. Unfortunately, they are also often limited to the specific application they were designed for. Also, the independence of the level set method to the number of dimensions of the data is lost. A paper by Han et al. [5] introduces topology preserving level sets. This modification to the speed function prevents the propagating front from splitting or merging. In [5] this is used for the segmentation of grey matter in MR images of the human brain and the segmentation of bones in CT images. Like the speed term proposed in this paper, the modification in [5] is somewhat weaker in its contribution to the speed function. But it is consistent with the definition of level sets and not limited to a single application.
3
Bridging Gaps in Data
The goal of this section is to define a speed term that allows the complete segmentation of objects, even if parts of those objects are not connected. As
574
K. Rink and K. T¨ onnies
(a)
(b)
(c)
Fig. 2. Image 2(a) is an example of one slice of a contrast enhanced 3D microscopic image depicting a dendrite with various spines (see section 4 for details). The original image is convolved with a small gaussian filter and segmented using the level set method. Image 2(b) depicts the result using a common speed function. While some spines have been segmented during this segmentation process, others have been missed. Image 2(c) shows the segmentation result using a modified speed function containing our new speed term. Almost all spines have been segmented. See the enlarged regions in the lower right corner for more details. The missing spines are barely visible (e.g. one spine marked by the white arrow in the enlarged section in image 2(c)) and cannot be segmented without either the incorporation of additional knowledge or an oversegmentation in other parts of the image.
described in section 1 such gaps may occur in medical imaging due to partial volume effects when thin objects are seemingly disconnected due to a large pixel spacing in the data set. To solve this problem we define a new speed term that allows the propagating front to “look ahead” of its current positions. That is, we incorporate a term that for each pixel on the front decides if it should be moved even if the underlying image features in the data (grey values, gradients, etc.) suggest that it should be stopped. In a prior paper [11] we used the pixels along the surface normal to decide if the front should continue to propagate. This method is very fast, as the surface normal is easily derived from the level set representation. While this works well with most objects, problems occurred unter certain circumstances. Because the surface normal is calculated from a discrete representation of the level set function, the approximation at places of high curvature tends to be very inexact. This is obviously a problem when dealing with thin objects. On the other hand, these objects are of special interest as partial volume effects tend to affect particularly the representation of thin objects within the data. The new definition of our “bridging force” doesn’t rely on the surface normal anymore. Instead, for each pixel x on the front all pixels y in a given neighbourhood with Φ(y) > 0 are checked. The best pixel is selected using various, easily computed conditions:
A Level Set Bridging Force for the Segmentation of Dendritic Spines
575
– y fulfills the image-based conditions of the level set speed function (i.e. the pixel would be segmented if it was connected to the object) – the angle γ between the surface normal of x and the vector y-x is smaller than 90 degrees – the distance between x and y is smaller than the distance between y and the neighbours of x on the propagating front. Note, that the image-based conditions (e.g. gradient length, grey values, etc.) need to be calculated only once before the start of the propagation process and are needed for the use of other image based forces anyway. Also, instead of actually calculating γ we simply compute the scalar product (n · (y-x)), where n is the surface normal of the front at the position of x. Furthermore, we need only to calculate this force term if the speed of the front at position x is very small. Otherwise the front is currently still moving at this position, so there cannot be any gap that has to be bridged. The speed function of the level set equation is now rewritten as % αFI (x)(FA + βFκ (x)), if F > ε, F (x) = (6) otherwise. αFI (y)FA , FI denotes the combination of all image-based speed terms, Fκ is the curvaturebased speed term and α, β ∈ [0, 1] control the influence of image- and curvaturebased forces, respectively. Using the speed function given in equation 6 the propagating front may cross areas of the image that would not be segmented using a common speed function such as given in equation 3. Still, the front will not leak into the background of the image if the speed function is defined appropriately because the image-based speed terms represented by FI will prevent any propagation in these locations (see figure 4). While disconnected parts of the desired object are segmented in almost all cases the chosen path of the front was not always the semantically correct way. If two different parts of the front may potentially be connected to a given part of the object, it cannot be decided which connection would be the correct one (see figures 5(b) and 5(c)). In order to guarantee a correct decision, more semantic knowledge of the desired object would be required, which in turn would restrict the general use of our “bridging force” for various applications. Nevertheless, corrections of these semantically wrong “bridges” could be realised in a postprocessing step. Also, the size of the neighbourhood is important to consider. If the neighbourhood is too small, some parts of the desired object may not be found (see figure 5(a)). On the other hand, if the neighbourhood is too large, the number of unwanted bridges may increase, depending on the topology of the data.
4
Experimental Results
The modified speed function was tested on images depicting branches of neurons, called dendrites, which have been injected with a luminescent dye. These
576
K. Rink and K. T¨ onnies
(a)
(b)
(c)
Fig. 3. Preprocessing steps: A slice of the original 8 bit image (figure 3(a)) is contrastenhanced (figure 3(b)) and convolved with a 5 × 5 gaussian filter (figure 3(c))
(a)
(b)
Fig. 4. Advantages of the bridging force: Figure 4(a) depicts a slice of the 3D segmentation of the dendrite from figure 3(a) using a common level set function as given in equation 3. In contrast, the result in figure 4(b) uses the additional bridging force introduced in section 3.
dendrites have been scanned using a confocal laser scanning microscope. We used five data sets with a minimum resolution of 512x512x112 voxel. The voxel size in all images is 0.1μm3 . A number of extensions are visible on the side of each dendrite. These dendritic spines are used for the transmission of signals between various neurons [6]. Figure 2(a) shows a slice from one of the data sets. It can be seen that there is often no visible connection between dendrites and their spines and a segmentation using the level set function does not find these disconnected spines (see figure 2(b)). Prior to the segmentation, the original data has been contrast enhanced to allow for a more reliable calculation of the image based forces, and convolved with a 5 × 5 gaussian filter kernel to reduce the image noise (see figure 3). As mentioned in section 3, the success of the segmentation depends on those image features that are used in the speed function and on the number of pixels that the bridging force is using to decide if a gap should be crossed.
A Level Set Bridging Force for the Segmentation of Dendritic Spines
(a)
(b)
577
(c)
Fig. 5. Drawbacks of the bridging force: Shown are three slices from the same data set. Region A is an example of a spine too similar to the background to be segmented. The bright part of the spine in figure 5(a) is too far from the dendrite to be detected. Region B shows a spine that has been detected. The level set function bridged over to the bright region from another spine, though, which is semantically not correct.
The algorithm was initialised with a single seed pixel within the dendrite. It did segment all spines with the specified image features that where either (visibly) connected to the dendrite or who were located within the neighbourhood specified by the bridging force. For the experiments shown in this paper we used a maximum look-ahead distance of six pixels. As shown in figure 5(a), a small number of spines were missed. Nevertheless this was a good tradeoff, since a larger neighbourhood increased the number of semantically incorrect connections. In figure 2(c) we point out a number of hard to detect spines with image features similar to the background noise. Without the incorporation of additional semantic knowledge it is not possible to detect these spines. Since no ground truth for a correct segmentation exists, we compared our result with various other segmentation techniques. In figures 2(b) and 4(b) we depict results of a common level set segmentation. The results are comparable to other user guided segmentation techniques like region growing, fast marching, image foresting transform, etc. These techniques cannot segment disconnected parts of the specified objects without incorporating a modification similar to our bridging force. As mentioned in section 1 model-based segmentation techniques are not applicable since there is no prior information on the shape of the object. Simple pixel based techniques, e.g. thresholding methods, cannot connect disconnected spines to the dendrite, but we assume that it should be possible to achieve a correct segmentation using a markov random field [3] with an appropriate neighbourhood configuration.
5
Conclusions
We defined a new speed term to connect parts of an object that appear disconnected in the data due to partial volume effects. We aimed at a general definition
578
K. Rink and K. T¨ onnies
applicable to various similar problems. This speed term was successfully applied to a number of microscopic images of neurons to segment dendrites and their dentritic spines. We found all spines given our specified image features, although not always the correct connection to the associated dendrite. We presented the advantages of our modified speed function to a common level set segmentation and also mentioned drawbacks due to the general definition of our new force term. To obtain a semantically correct segmentation the integration of further problem related knowledge will be necessary, although this will require a modification of our speed term adapted to the problem.
Acknowledgements We like to thank Andreas Herzog for the fruitful discussion on the data sets.
References 1. Caselles, V., Kimmel, R., Sapiro, G.: Geodesic Active Contours. Int. J. Comp. Vis. 22(1), 61–79 (1991) 2. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active Shape Models - Their Training and Application. Comput. Vis. Image Understand. 61(1), 38–59 (1995) 3. Geman, S., Geman, D.: Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721– 741 (1984) 4. Hamarneh, G., McInerney, T., Terzopoulos, D.: Deformable Organisms for Automatic Medical Image Analysis. In: Niessen, W.J., Viergever, M.A. (eds.) MICCAI 2001. LNCS, vol. 2208, pp. 66–76. Springer, Heidelberg (2001) 5. Han, X., Xu, C., Prince, J.L.: A topology preserving level set method for geometric deformable models. IEEE Trans Pattern Anal Mach Intell 25(6), 755–768 (2003) 6. Herzog, A.: Formrekonstruktion dendritischer Spines aus dreidimensionalen Mikroskopbildern unter Verwendung geometrischer Modelle. Dissertation. Ottovon-Guericke-Universit¨ at Magdeburg, Fakult¨ at Elektrotechnik (2001) 7. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active Contour Models. Int. J. Comp. Vis. 1(4), 321–331 (1988) 8. Leventon, M.E., Faugeras, O., Grimson, W.E.L., Wells III, W.M.: Level Set Based Segmentation with Intensity and Curvature Priors. In: Workshop on Mathematical Methods in Biomedical Image Analysis (2000) 9. Malladi, R., Sethian, J.A., Vemuri, B.: Shape Modelling with Front Propagation: A Level Set Approach. IEEE Trans. Pattern Anal. Mach. Intell. 17(2), 158–175 (1995) 10. Osher, S., Sethian, J.A.: Fronts Propagating with Curvature Dependent Speed: Algorithms Based on Hamilton-Jacobi Formulation. J. Comp. Phys. 79, 12–49 (1988) 11. Rink, K., T¨ onnies, K.: A modification of the level set speed function to bridge gaps in data. In: Franke, K., M¨ uller, K.-R., Nickolay, B., Sch¨ afer, R. (eds.) Pattern Recognition. LNCS, vol. 4174, pp. 152–161. Springer, Heidelberg (2006) 12. Van Bemmel, C.M., et al.: A Level-Set-Based Artery-Vein Separation in Blood-Pool Agent CR-MR Angiograms. IEEE Trans. Med. Imag. 22(10), 1224–1234 (2003)
Knowledge from Markers in Watershed Segmentation Sébastien Lefèvre LSIIT, CNRS / University Louis Pasteur - Strasbourg I Parc d’Innovation, Bd Brant, BP 10413 67412 Illkirch Cedex, France [email protected]
Abstract. Due to its broad impact in many image analysis applications, the problem of image segmentation has been widely studied. However, there still does not exist any automatic segmentation procedure able to deal accurately with any kind of image. Thus semi-automatic segmentation methods may be seen as an appropriate alternative to solve the segmentation problem. Among these methods, the marker-based watershed has been successfully involved in various domains. In this algorithm, the user may locate the markers, which are used only as the initial starting positions of the regions to be segmented. We propose to base the segmentation process also on the contents of the markers through a supervised pixel classification, thus resulting in a knowledge-based watershed segmentation where the knowledge is built from the markers. Our contribution has been evaluated through some comparative tests with some state-of-the-art methods on the well-known Berkeley Segmentation Dataset. Keywords: Marker-based Watershed. Supervised classification. Colour Segmentation.
1 Introduction Analysis, processing and understanding of digital images often involve many different algorithms. Among them, the segmentation (which consists in generating a set of meaningful regions from an input image) is certainly one of the most crucial steps as the quality of the following procedures directly depends on the accuracy and relevance of the segmentation results. So many research works are related to the problem of image segmentation, one of the main goals being to elaborate a method both automatic (without user assistance) and generic (able to deal with any kind of images). As this objective is still unreachable, the existing approaches are either automatic or generic. Semi-automatic segmentation techniques may not be dedicated to a given type of images. Indeed, the genericity property is ensured by the user intervention or setting. There are several ways the user can drive the segmentation procedure, among which we can cite the spatial initialisation of the algorithm, the definition of the class of interest in the images, the respective influence of the differents features to be involved (e.g. colour, texture, shape, etc). The marker-based watershed [1], a widely used enhancement of the well-known watershed algorithm [2], may be driven by the user through a spatial initialisation (i.e. the location of the "markers"). Only the marker spatial position is used in the watershed algorithm, and the marker content is completely ignored. We W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 579–586, 2007. c Springer-Verlag Berlin Heidelberg 2007
580
S. Lefèvre
propose in this paper to increase the marker-based watershed performance by using the information related to the markers content. More precisely, we use the markers as learning sets in a supervised pixel classification procedure which results in a probability map per marker. From these probability maps is then built the relief necessary to the watershed algorithm. The rest of this paper is organized as follows. In section 2 we will recall the markerbased watershed and present existing watershed segmentation methods relying on knowledge or prior information. Then we will describe the proposed solution in section 3 and illustrate its relevance through comparative tests in section 4. Finally, section 5 will be devoted to concluding remarks.
2 Marker-Based and Knowledge-Based Watershed The watershed transform is a very popular segmentation method, as it is computationally efficient and does not require any parameter. However it has also some drawbacks, such as the sensitivity to noise and above all oversegmentation, where the result consists in a large number of irrelevant and undesired regions. To counter these limits and to increase the accuracy and relevance of the results, it is possible to consider prior knowledge. This knowledge often consists in the number and the positions of the regions through the definition of some markers, thus resulting in the marker-based watershed which will be recalled in this section. We will also present some other ways to involve knowledge in the watershed transform. 2.1 The Marker-Based Watershed The marker-based watershed [1] is certainly the mostly known and widely used enhancement of the watershed transform. Several definitions have been given in the literature, and we recall here the algorithmic definition given by Vincent and Soille [2] and called simulated immersion as it is the definition our method is relying on. In this definition, the set of the catchment basins of the greyscale image function f (with values in [hmin , hmax ] is equal to the set Xhmax obtained after the following recursion: Xhmin = Thmin (f ) Xh+1 = MINh+1 ∪ IZTh+1 (f ) (Xh ),
hmin ≤ h < hmax
(1)
where Th is the threshold set at level h, MINh is the union of all regional minima at altitude h, and IZA (B) is the union of geodesic influence zones of connected components of B defined with B ⊆ A and the following equations: IZA (B) =
l
0. Recall that this means that Kα ⊂ Kα+p , so Kα+p contains the same vertices, edges, and triangles as Kα , but may also contain new edges and triangles. Holes in Kα that are still holes in Kα+p are regions that are persistent in p. We shall refer to these regions as p-persistent regions. Holes in Kα that are filled in between Kα and Kα+p are regions that are not persistent in p. We shall refer to these as p-transient regions. And finally, the individual triangles that were already filled in Kα will be referred to as d-triangles. For subcomplexes of the plane, loops that do not persist bound regions that disappear completely in Kα+p . An equivalent definition to the more general one, is β1p (Kα ) = C − V + E − F − R, where R is the number of regions in the complement of Kα that are contained in Kα+p . Note that this is not necessarily true in higher dimensions.
3
Segmentation Algorithm
The proposed segmentation algorithm is a hybrid split-and-merge segmentation algorithm that employs computational topology and persistent homology for image splitting and region feature characteristics for region merging. The algorithm first employs some edge detection algorithm, such as the Canny algorithm [1] or the wavelet-based detector that we used [8], to determine the set of edge points, X, in the image. Topological splitting is the performed over X to generate an initial segmentation with three types of regions: p-persistent regions, p-transient regions, and d-triangles. The topological splitting is controlled by two parameters, α and p, where α defines the radius of the disks and p indicates the persistence. In the final step, the algorithm merges the p-transient and
Image Segmentation Using Topological Persistence
591
β1 8 4 0 0
C V E F
= = = =
1
C V E F
4 18 14 0
β1 = 0 β
β
1 p
1 18 22 0
C V E F
β1 = 5
p 1 0
= = = =
2
2
β
1 p
1 18 27 5
C V E F
β1 = 5
p 1 0
= = = =
3
2
β
1 p
1 18 35 10
C V E F
β1 = 8
p 1 0
= = = =
4
2
β
1 p
1 18 37 13
C V E F
β1 = 7
p 1 0
= = = =
5 α
2
β
1 p
1 18 38 18
β1 = 3
p 1 0
= = = =
2
p 1 0
1 p
2
Fig. 2. This figure demonstrates various complexes in a filtration, including their Betti numbers and persistent Betti numbers for each Kα and Kα+p
d-triangle regions with either the p-persistent regions or each other to generate the final segmentation. As discussed in the previous section, the p-persistent regions represent larger regions that remain (are not filled in) in Kα+p . Conversely, p-transient are smaller regions that are filled in between Kα and Kα+p , while d-triangles are the smallest regions, which already existed as triangles in Kα . The d-triangles are akin to “thickened” edges separating regions in the initial segmentation. The p-persistent regions are crucial in that they define the core objects in the initial segmentation. There are two reasons to believe these regions are important to the segmentation: they are completely surrounded by edges, and they all contain a disk of radius α + p in their interiors. These initial regions can then be expanded upon to create a full segmentation. We perform this process in such a way that the persistent homology is respected. One approach would be to use the simplification algorithms presented in [4] to expand these regions and preserve this homological information. However, this method uses only the edge information, essentially expanding the regions so that the region boundaries consist of the shortest possible edges. This approach has the significant drawback that it does not incorporate any region information. An alternative strategy for expanding the persistent regions to a full segmentation is to merge neighboring regions based on their feature characteristics. In our case, we simply used the average color value for each region, but any desired feature vector could be used. Homological persistence dictates the merge order of the regions, with regions being ordered according to the disk radius α for which the triangle was formed, starting from the smallest. In order to preserve the topology of the segmentation, regions will only be merged if the persistent Betti number remains unchanged.
592
D. Letscher and J. Fritts
Algorithm 1. Algorithm for edge-directed topological image segmentation 1: 2: 3: 4: 5: 6: 7: 8:
4
Identify set of edges, X, in the image. Find the Delaunay triangulation of X. Calculate β1p (Kα ) and identify regions in 2 − Kα as p-persistent, p-transient, or d-triangle regions. Let F be the set of regions ordered by increasing α values. while (F is not empty) do Let σ be the first face in F . And let F = F − {σ}. Merge σ with the most similar adjacent region provided that β1p (merged regions) = β1p (Kα ) end while
Results
Figures 3 and 4 demonstrate the results of the algorithm for a sample image, both over multiple stages of the process, and over multiple separate images, respectively. As evident, each of the resulting segmentations effectively distinguishes the critical objects in the image. Figure 5 demonstrates how the choice of α and p affects the resulting segmentation. For small values of α large regions are not separated from each other in the initial stages of the algorithm. For fixed α larger values of p remove small regions from consideration.
Fig. 3. Major steps in the algorithm: (a) original image (b) complement of the α complex for α = 4, p = 0. (c) the persistent regions at α = 4, p = 3 (d) final segmentation.
Fig. 4. Example segmentations a stop sign, a rose and a baseball player
Image Segmentation Using Topological Persistence p=0 23 Regions
p=2 22 Regions
p=4 13 Regions
47 Regions
34 Regions
21 Regions
43 Regions
28 Regions
22 Regions
593
α=2
α=4
α=6
Fig. 5. Affect on the choice of parameters of the resulting segmentation
The segmentation algorithm has several nice guarantees. In particular, “large” regions in the complement of the edges will not be broken into separate regions. Specifically, any ball of radius p in the complement of the edges will not be divided between multiple segments. The choice of α will affect how well the points identified by the edge detection fill out to complete the boundary of regions. In particular, if the points identified as edges have neighbors with a distance of α/2 in each direction then boundaries of regions will be correctly reconstructed.
5
Related Work
An early image segmentation method using Delaunay triangulations was presented in [6]. Their method is distinct in that they do not use persistent homology, but instead start with a full Delaunay triangulation and generate the core objects in the final segmentation solely through region merging. A more recent image segmentation method based on Delaunay triangulation is presented in [11]. They use similar criteria for doing their initial segmentation. However, whereas we use Delaunay triangulation for defining different types of regions and then perform region merging to generate the final segmentation, they use Delaunay triangulation for defining boundaries, and use thinning to eliminate undesired edges in the final segmentation.
594
D. Letscher and J. Fritts
Another recent image segmentation method based on Delaunay triangulation is presented in [10]. They generate the full Delaunay triangulation first, followed by region merging. They merge regions by extracting a skeleton of the segmented image from the connected components graph of the triangulations. Each separate connected component in the skeleton defines one region in the final segmentation.
6
Conclusion and Future Directions
We have presented an image segmentation algorithm that uses both edge and region information and uses techniques from computational topology. The segmentations obtained clearly delineate key objects in the images and are good initial indicators of the effectiveness of the method. There are several directions of followup for this algorithm. These include studying different orderings for merging the d-triangles and p-transient regions, and using alternate feature characteristics for region merging (as opposed to just average region color value). Significant improvement could also potentially be achieved in using machine learning methods to help identify appropriate parameterizations of α and p according to the characteristics of the image and edge features. Another improvement is to utilize more information from the edge detection algorithms. The current algorithm only uses the magnitude of the estimated gradient. The direction of the gradient can also be incorporated in finding an anisotropic Delaunay triangulation. These triangulations can better represent the regions divided by the edge information.
References 1. Canny, J.: A Computational Approach To Edge Detection. IEEE Trans. Pattern Analysis and Machine Intelligence 8, 679–714 (1986) 2. Edelsbrunner, H.: Algorithms in Combinatorial Geometry. Springer, New York (1987) 3. Edelsbrunner, H.: The Union of Balls and its Dual Shape. In: Proceedings of the Ninth Annual Symposium on Computational Geometry, pp. 218–231 (1993) 4. Edelsbrunner, H., Letscher, D., Zomorodian, A.: Topological Persistence and Simplification. Discrete and Computational Geometry 28, 511–533 (2002) 5. Fortune, S.: A Sweepline Algorithm for Voronoi Diagrams. Algorithmica 2, 153–174 (1987) 6. Gevers, T., Smeulders, A.W.M.: Combining Region Splitting and Edge Detection through Guided Delaunay Image Subdivision. In: Proc. of the 1997 International Conference on Computer Vision and Pattern Recognition, pp. 1021–1026 (1997) 7. Guibas, L., Knuth, D., Sharir, M.: Randomized Incremental Construction of Delaunay and Voronoi Diagrams. Algorithmica 7, 381–413 (1992) 8. Mallat, S., Zhong, S.: Characterization of signals from Multiscale Edges. IEEE Trans. Patt. Anal. and Mach. Intell. 14, 710–732 (1992) 9. Massey, W.: A Basic Course in Algebraic Topology. Springer, Heidelberg (1991)
Image Segmentation Using Topological Persistence
595
10. Prasad, L., Skourikhine, A.N.: Vectorized Image Segmentation via Trixel Agglomeration. In: Brun, L., Vento, M. (eds.) GbRPR 2005. LNCS, vol. 3434, pp. 12–22. Springer, Heidelberg (2005) 11. Stelldinger, P., Ullrich, K., Meine, H.: Topologically Correct Image Segmentation Using Alpha Shapes. In: Kuba, A., Ny´ ul, L.G., Pal´ agyi, K. (eds.) DGCI 2006. LNCS, vol. 4245, pp. 542–554. Springer, Heidelberg (2006)
Image Modeling and Segmentation Using Incremental Bayesian Mixture Models Constantinos Constantinopoulos and Aristidis Likas Department of Computer Science, University of Ioannina, GR 45110 Ioannina, Greece [email protected], [email protected]
Abstract. Many image modeling and segmentation problems have been tackled using Gaussian Mixture Models (GMM). The two most important issues in image modeling using GMMs is the selection of the appropriate low level features and the specification of the appropriate number of GMM components. In this work we deal with the second issue and present an approach for GMM-based image modeling employing an incremental variational algorithm for Bayesian GMM training that automatically specifies the number of mixture components. Experimental results on natural and texture images indicate that the method yields reasonable models without requiring the a priori specification of the number of components.
1
Introduction
Several approaches have been proposed for statistical image representation, ie. the modeling of an image based on the distribution of various features at pixel or window level. Simpler approaches focus on the use of histograms, while more sophisticated statistical modeling tools such as mixture models have also been considered [1]. More specifically, the assumption under the GMM framework is that an image is considered as a set of regions (segments) where each region is represented by a Gaussian distribution and the set of all regions in an image is represented by a GMM. To build the GMM for an image, first a dataset X = {fn } is constructed that contains one feature vector fn ∈ Rd either for each image pixel n (as is the case in this work) or for appropriately selected image windows (patches). Typical low level features used are related to color and texture. Then an efficient training method is applied to the dataset X in order to obtain the final GMM for the image. Let g be a mixture with M Gaussian components g(f ) =
M
πj N (f ; μj , Tj )
(1)
j=1
where π = {πj } are the mixing coefficients (priors), μ = {μj } the means (centers) of the components, and T = {Tj } the precision (inverse covariance) matrices.
This work was partially supported by Interreg IIIA (Greece-Italy) grant I2101005.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 596–603, 2007. c Springer-Verlag Berlin Heidelberg 2007
Image Modeling and Segmentation
597
After training, it is possible to segment the image, ie. assign a group of pixels to each GMM component by finding for the feature vector f of each pixel the component j with maximum posterior, (ie. with maximum πj N (f ; μj , Tj )). An issue to be considered is how to impose the requirement for spatial smoothness, ie. that in most cases neighboring pixels should be assigned to the same component (cluster). Two approaches have been proposed. The first imposes an MRF prior on the posteriors [2]. The MRF approach is more difficult to handle from the learning point of view, and requires all image pixels to be included in the training set. It provides as outcome the posterior probabilities for each pixel. In addition, no effective method has been proposed for automatically determining the number of GMM components in the MRF framework. In this work we focus on the second way to impose spatial smoothness [1]. According to this approach, the spatial location (x, y) of a pixel is considered as an additional feature to be included in the feature vector f describing this pixel, for example fn = (L, a, b, x, y)n for the n-th pixel with image coordinates (x, y) and color vector (L, a, b). Thus the spatial distance between two pixels contributes significantly to the total distance between the corresponding feature vectors. In this way, adjacent pixels tend to have similar cluster labels since their spatial distance is small, thus spatial smoothing is achieved. On the other hand, this approach forces large image areas (with approximately the same color) that can be considered as one segment to be splitted into smaller segments, because spatially distant pixels cause the distance of the corresponding feature vectors to be large, thus they cannot be modeled by a single Gaussian component. However, this fragmentation problem can be easily resolved through a simple postprocessing stage that merges image regions corresponding to GMM components with similar mean feature values. In addition, to further enforce spatial smoothness, it is possible to apply the MRF-based GMM model afterwards starting from the GMM solution provided by the second approach. The advantage of using the second approach for spatial smoothness is that it results in a GMM that takes as input the image location (x, y) and can assign feature labels (for example color) to every location (x, y) independent of whether this pixel has been used for training or not. This has three interesting consequences. First it is easy to obtain a model for a specific region of the image by considering only the mixture components that are active in this region. For example we can easily derive a mixture model for the color density in an arbitrary image region. Second it is straightforward to marginalize the (x, y) coordinates and obtain a GMM for the distribution of the low level features. Finally, it is easy to assign segment labels to image locations (x, y) not used for training, simply by computing the component j with highest posterior. This allows train the mixture model using only a subset of the pixels. Despite those advantages, a major difficulty is introduced that relates to the specification of the number of mixture components. For example, it is possible that an image with four colors cannot be modeled using a GMM with four components, especially in the case where the image segments with the same color have large size or are disconnected. In this case the GMM requires more
598
C. Constantinopoulos and A. Likas
components to accurately model the image. Therefore it is difficult to specify in advance the number of GMM components, thus it is essential to use GMM training methods that incorporate a built-in mechanism for automatically assessing the number of components. This is the case with our method [3] used in this work that is described next.
2
The Incremental Variational Bayesian Method
In this section we describe a Bayesian method for Gaussian mixture learning [3] that is deterministic, does not depend on the initialization, and resolves adequately the model selection problem, ie. the specification of the number of components. The method is an incremental one: it starts with one component and progressively adds components to the model. The procedure for component addition is based on a splitting test applied to each of the existing mixture components. According to this test, a component is splitted into two sub-components and then variational Bayesian learning is applied to the specific pair of components, while the rest components remain “fixed”. To apply this method, it is necessary to define a Bayesian Gaussian mixture model by imposing priors on the parameters π, μj and Tj . The effect of variational Bayesian learning is that a competition takes place between the two sub-components. If the data distribution in the region of the splitted component strongly suggests the existence of more than one clusters, then both sub-components will “survive” and the number of model components will be increased. Otherwise, the competition among the two components will cause one of them to be eliminated and the initial component will be recovered. This strategy of incremental component addition also facilitates the specification of the parameters of the priors, since it can be based on the parameters of the component to be splitted. In order to apply this idea, a modification of the typical Bayesian mixture model [4] is required that is described in the graphical model of Figure 1 and is explained next. Assume that we wish to restrict the competition to a subset containing s of the GMM components while the others M − s components are “fixed”. The proposed modification is to impose a prior only on the M − s “fixed” mixing coefficients π ˜ . Let X = {fn } the set of training points containing the feature vectors of the image pixels. The hidden variables Z = {zjn } capture the missing information of which component has generated a given data point fn . More specifically, zjn = 1 if component j is responsible for generating xn , otherwise zjn = 0. Therefore it holds that: p(X|Z, μ, T ) =
N M
zjn
[N (fn ; μj , Tj )]
(2)
n=1 j=1
The distribution of Z is a product of multinomials p(Z|π, π ˜) =
N s n=1 j=1
z
πj jn
M j=s+1
z
π ˜j jn
(3)
Image Modeling and Segmentation
s
N
T
X
~ π
μ
Z
π
599
J−s
s
Fig. 1. The graphical model
given the subset π ˜ = {˜ πj } of “fixed” mixing coefficients and the subset π = {πj } of “free” mixing coefficients. For notational convenience and assuming M mixing components, we can always rearrange the indexes so that the first s components are the “free” ones. The typical Bayesian framework assumes conjugate Dirichlet priors over the entire set of mixing coefficients. Therefore, in order to define the modified Bayesian GMM, it is necessary to define the conditional joint distribution p(˜ π |π) of the “fixed” mixing coefficients given the “free” which can be shown to be a nonstandard Dirichlet with parameters αj (j = s + 1, . . . , M ): ⎞−M+s ⎛ s Γ( M j=s+1 αj ) p(˜ π |π) = ⎝1 − πj ⎠ >M j=s+1 Γ (αj ) j=1 ×
M
1−
j=s+1
π ˜ sj k=1
αj −1 πk
(4)
It must be emphasized that we do not impose prior on the “free” mixing coefficients. Completing the specification of the Bayesian model we assume Gaussian and Wishart priors for μ and T respectively [4]: p(μ) =
p(T ) =
s j=1 s
N (μj |0, β I)
(5)
W(Tj |ν, V ).
(6)
j=1
Learning in the Bayesian framework can be achieved through maximization of the marginal likelihood of X given π, that is obtained by integrating out the hidden variables θ = {Z, μ, T, π ˜ } as follows: p(X, θ|π) dμ dT d˜ π. (7) p(X|π) = Z
Since this integral is intractable, the Variational Bayes methodology is adopted, where we maximize a lower bound L of the logarithmic marginal likelihood log p(X|π):
600
C. Constantinopoulos and A. Likas
L[q, π] =
q(θ) log
Z
p(X, θ|π) dθ q(θ)
(8)
where q is an arbitrary distribution that approximates the posterior distribution p(θ|X, π). The maximization of L is performed in an iterative way, where at each iteration two steps take place (in analogy to the EM approach): first maximization of the bound with respect to q, and subsequently maximization of the bound with respect to π. To implement this maximization with respect to q the mean-field approximation [4] has been adopted, which assumes that q π ). The is constrained to be a product of the form: q(θ) = qZ (Z)qμ (μ)qT (T )qπ˜ (˜ resulting update equations for the parameters of the q distributions (E-step) and the parameters π (M-step) are omitted and described in detail in [3]. Using the above idea, the incremental algorithm for constructing the GMM proceeds as follows. Mixture components are sequentially added to the mixture model using the following component splitting procedure: one of the mixture components is selected and is appropriately splitted in two components. The resulting two components are considered as “free” and the rest as “fixed” according to the terminology introduced previously. Next we set the precision prior p(T ) based on the characteristics of the splitted component, and apply variational learning as described in the previous section. In case that the two components provide a much better fit to the data in their region, then both components are 2
2
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2
−2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
−2
−1.5
−1
−0.5
(a)
0
0.5
1
1.5
0.5
1
1.5
(b)
2
2
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2
−2
−2
−1.5
−1
−0.5
0
(c)
0.5
1
1.5
−2
−1.5
−1
−0.5
0
(d)
Fig. 2. Four steps of the incremental training procedure. The expected covariance w.r.t. the Wishart prior is depicted with a dashed line. (a) An intermediate solution with 5 components. (b) One component is splitted in two. (c) The mixture after variational learning. (d) Another component is selected and splitted.
Image Modeling and Segmentation
20
20
20
40
40
40
60
60
60 80
80
80
100
100
120
120
601
100 120
140 20 40 60 80 100 120
140 50
100
150
200
20
20
20
40
40
40
60
60
60 80
80
80
100
100
120
120
50
100
150
200
50
100
150
200
100 120
140 20 40 60 80 100 120
(a)
140 50
100
150
(b)
200
(c)
Fig. 3. Segmentation of (a) an artificial image, (b) and (c) natural images from BSDS. Top row: original images. Bottom row: segmented images.
retained in the mixture model, otherwise the update equations will eliminate one of them. The splitting test is applied sequentially to all components and the method terminates when all mixture components have been unsuccessfully tested for splitting. In the case where a successful split is encountered, then the number of mixture components increases and a new round of split tests for all components is initialized. To illustrate the details of splitting, assume that some component jˆ has to be splitted, with density N (f ; μjˆ, Tjˆ). The idea is that in order to form the new mixture, we remove component jˆ and insert two new components with densities N (f ; μjˆ1 , Tjˆ1 ) and N (f ; μjˆ2 , Tjˆ2 ) respectively. We have selected to place the centers of the two components along the dimension of the principal axis of the covariance Tjˆ−1 and at opposite directions with respect to the center μjˆ. The mixing coefficients of the two components are set √ equal πjˆ1 = πjˆ2 =√πjˆ/2, and their parameters are set according to: μjˆ1 = μjˆ + λ u, μjˆ2 = μjˆ − λ u, Tjˆ1 = Tjˆ and Tjˆ2 = Tjˆ, where λ is the maximum eigenvalue of Tjˆ−1 and u the corresponding eigenvector. An important issue in the proposed method is the specification of the scale parameter V of the prior W(ν, V ) over the precision matrices, based on the splitted component. We set ν = d (which is the minimum allowed value) and V = νλI, where λ is the highest eigenvalue of Tjˆ−1 . The value of β was set to 10−10 . An example of component splitting is illustrated in Figure 2.
602
C. Constantinopoulos and A. Likas
20
20
20
40
40
40 60
60 80 100
80
60
100
80
120 20 40 60 80 100
100 20 40 60 80 100120
20
20
50
100
150
100
150
20
40
40
40 60
60 80 100
80
60
100
80
120 20 40 60 80 100
100 20 40 60 80 100120
(a)
(b)
50
(c)
Fig. 4. Segmentation using texture feature: (a) and (b) artificial images, (c) BSDS image. Top row: original images. Bottom row: segmented images.
3
Experimental Results
To illustrate the performance of the proposed method, we have conducted experiments using both artificially generated and natural images. For each image the following steps were taken. First the dataset containing the feature vectors for a subset of 5000 pixels was constructed and next the feature vectors were preprocessed so that each feature distribution has zero mean and unit standard deviation. The resulting dataset was then used to build the GMM for the image. Using the resulting GMM, the ’segmented’ image is produced. In the case where color features are used, the segmented image is produced by assigning to each pixel (x, y) the mean color value of the corresponding GMM component. In the case of texture images we use an arbitrary different color for each segment. For segmentation using artificial color features we used the (L,a,b) representation along with the (x, y) coordinates. Figure 3.(a) illustrates the segmentation result using an artificial color image. Figure 3.(b),(c) displays segmentation results with two natural images from the Berkeley Segmentation Data Set (BSDS) [5]. For these images we also added a texture feature, namely the polarity feature pl [1] to the feature vector describing a pixel, i.e. fn = (L, a, b, pl, x, y)n. Finally Figure 4 provides results for two artificial texture images and one BSDS image, using only the polarity feature, ie. fn = (pl, x, y)n .
Image Modeling and Segmentation
603
From the experimental results it is clear that, although only a subset of the pixels are used for GMM training, the method yields segmentations of satisfactory quality, that are spatially smooth and estimates reasonably well the number of image segments, without producing significant oversegmentation.
4
Conclusions
We have presented an efficient approach for image modeling and segmentation using GMMs that is based on a recently proposed Bayesian technique for estimating the components of a GMM [3]. The approach is fully automatic, makes no assumptions regarding the required number of GMM components and provides solutions that are spatially smooth. Future work will focus on a more systematic testing of the method in the case where other texture-related features are used and in the case where feature vectors are defined at the patch level instead of the pixel level. Also we plan to test the efficiency of the method in image retrieval tasks where GMMs are used as models of the stored images and the query image.
References 1. Carson, C., Belongie, S., Greenspan, H., Malik, J.: Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(8), 1026–1038 (2002) 2. Li, S.Z.: Markov Random Field Modelling in Computer Vision. Springer, Heidelberg (2001) 3. Constantinopoulos, C., Likas, A.: Bayesian gaussian mixture learning based on variational component splitting. IEEE Transactions on Neural Networks 18(3), 745–755 (2007) 4. Corduneanu, A., Bishop, C.M.: Variational Bayesian model selection for mixture distributions. In: Artificial Intelligence and Statistics 2001, pp. 27–34. Morgan Kaufmann, San Francisco (2001) 5. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. 8th Int’l Conf. Computer Vision, vol. 2, pp. 416–423 (2001)
Model-Based Segmentation of Multimodal Images Xin Hong, Sally McClean, Bryan Scotney, and Philip Morrow School of Computing and Information Engineering, University of Ulster Cromore Road, Coleraine, BT52 1SA, Northern Ireland, UK {x.hong, si.mcclean, bw.scotney, pj.morrow}@ulster.ac.uk
Abstract. This paper proposes a model-based method for intensitybased segmentation of images acquired from multiple modalities. Pixel intensity within a modality image is represented by a univariate Gaussian distribution mixture in which the components correspond to different segments. The proposed Multi-Modality Expectation-Maximization (MMEM ) algorithm then estimates the probability of each segment along with parameters of the Gaussian distributions for each modality by maximum likelihood using the Expectation-Maximization (EM) algorithm. Multimodal images are simultaneously involved in the iterative parameter estimation step. Pixel classes are determined by maximising a posteriori probability contributed from all multimodal images. Experimental results show that the method exploits and fuses complementary information of multimodal images. Segmentation can thus be more precise than when using single-modality images. Keywords: data fusion, multimodal images, model-based segmentation, Gaussian mixture, maximum likelihood, EM algorithm.
1
Introduction
Image segmentation is the process of separating an image into meaningful, homogeneous regions for further analysis. A single modality may only provide limited and possibly low accurate information about the image scene. Multimodal images often provide richer and more reliable information about the environment in which the modalities operate [10], so the use of multimodal images can help improve segmentation accuracy. Data fusion is the process of combining information from different sources or modalities to get an optimal estimation of objects. A number of approaches [2,3,5,10,11] have been presented to combine multimodality imaging information based on a data fusion framework: the Dempster-Shafer theory of evidence (DS theory). The advantage of using DS theory in data fusion lies in its ability to deal with ignorance and imprecision. But the derivation of the mass function for uncertainty measurements attracts criticisms about its practical applications such as image processing. In this paper, we propose a model-based method for segmenting images of the same object acquired from multiple modalities. Pixel intensity within a modality image is modelled by a univariate Gaussian distribution mixture in which W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 604–611, 2007. c Springer-Verlag Berlin Heidelberg 2007
Model-Based Segmentation of Multimodal Images
605
components correspond to different clusters or segments. Methods such as BIC [9] can be used to determine the number of segments contained in an image, and therefore we assume this to be known here. The Expectation-Maximization (EM) algorithm is adopted to estimate parameters of normal distribution mixtures by maximum likelihood. The main advantage of our method is that pixels in all modality sensed images are involved in estimating class distributions which in turn contribute to modality-specific class parameter estimations (means and variances). Segment assignment also takes into account contributions of all modalities. Our approach is based on well-established principles, namely probability models and maximum likelihood estimation, using the EM algorithm. As such it has the advantage of being underpinned by mathematical theory, as opposed to previous ad hoc approaches. The remainder of this paper is organised as follows. In section 2, we present a univariate Gaussian mixture model for multimodal sensed images. In section 3, we propose the Multi-Modality Expectation-Maximization (MMEM) algorithm for segmentation of multimodal images. Results of experiments on synthetic images and simulated MS Lesion brain images are given in section 4. In section 5, we discuss the segmentation results of section 4 and consider the problems our approach may encounter. We also outline possible improvements for the MMEM algorithm. Finally, conclusions are presented in section 6.
2
Gaussian Distribution Model of Images
In image segmentation each pixel of an image is considered as an observation and pixel intensity is considered to be a feature component of the observation. 2.1
Univariate Gaussian Distribution Mixture Model
A univariate Gaussian distribution is modelled by two parameters: mean μ and variance σ 2 . For an image x that consists of N one-dimensional observations, x = (x1 , . . . , xN ), following a Gaussian distribution, the Gaussian probability of pixel intensity xn , denoted as gθ (xn ) (θ = {μ, σ 2 }), is given by gθ (xn ) =
(xn − μ)2 1 √ exp {− } 2 σ2 σ 2π
It is usual that pixels in an image are best represented by a mixture of two or more univariate Gaussian distributions rather than a single distribution, where the mixture components can be thought of as different clusters or segments. Let K be the number of Gaussian mixture distributions gk with mean μk and variance σk2 , k = 1, . . . , K. The probability distribution of xn , n = 1, . . . , N , denoted as gm(xn ), is given by gm(xn ) =
K k=1
λk gk (xn ) .
606
X. Hong et al.
where λk are weighting parameters that can be thought of as prior probabilities for the Gaussian mixture, satisfying: (1) λk ≥ 0, k = 1, . . . , K (2)
K
λk = 1 .
k=1
The pixels in an image are commonly assumed to be independently and identically distributed. The probability of image x with pixel intensities (x1 , . . . , xN ) can then be computed as follows: gm(x) =
N n=1
2.2
gm(xn ) =
N K n=1
k=1
λk · gk (xn ) .
Joint Probability Distribution of Multimodal Images
Observing an image scene under M modalities would give rise to M ideal observed images, {x1 , . . . , xM }. By saying ideal images, we mean that all images are exactly spatially registered, i.e., the pixel positions of all images correspond to the same location in the real world. Let m be the index of the observed image provided by the mth modality, denoted as xm , xm = {xmn : n = 1, . . . , N }, m = 1, . . . , M . Each of the M images follows a Gaussian mixture distribution gmm on the parameters (λ, θm ) with λ = {λk : k = 1, . . . , K}, θm = {θmk : 2 : k = 1, . . . , K}. We assume that how the scene k = 1, . . . , K} = {μmk , σmk is observed by a modality will not be affected by other modalities observing the same image scene. The probability that the observed images take on the value (x1 , . . . , xM ) given the parameters (λ, θ) = (λ, θ1 , . . . , θM ) can be estimated as M P (x1 , . . . , xM |θ, λ) = gmm (xm ) . m=1
The probability as the function of parameters θ and λ represents the likelihood of the parameters given the images x1 , . . . , xM , conventionally denoted as L(θ, λ|x1 , . . . , xM ),
3
Model-Based Segmentation
Let an image scene contain K distinct classes, labelled {w1 , . . . , wK }. Image segmentation can be considered as the process of assigning each pixel of the image to one of K classes. Model-based segmentation is concerned with fitting a mixture of normal distributions represented by parameters: mean μ, variance σ 2 and class probability λ to the data. Since the unlabelled pixel has no class label then the observed pixel intensity is said to be incomplete in representing the image pixel. The EM algorithm [4] is the most general and effective algorithm for solving estimation problems of maximum likelihood parameters (mean
Model-Based Segmentation of Multimodal Images
607
μ, variance σ 2 and class probability λ) when the observation data are subject to incompleteness. For examples of its application to single modality clustering, see [1,8,9]. Following this convention we propose the MultiModal EM algorithm (MMEM ) that uses the EM iterations to find the maximum joint likelihood estimate of the parameters of original distributions from multimodal images. Within each iteration of the MMEM algorithm the E-Step is concerned with finding the posteriori probability of each pixel belonging to a particular class given the current estimates of μ, σ 2 and λ, whereas the M-Step recalculates parameters based on the new posterior probabilities estimated in the E-Step and the joint likelihood. Iterations continue until convergence is satisfied upon a predefined condition. After the optimal parameters have been estimated, a pixel is assigned to a class with greatest a posteriori probability. The detail of the algorithm is given below. Parameters estimated at the tth iteration are marked by a superscript (t). MMEM Algorithm: Begin 1 Initialise the parameters 2 E-Step: compute the posteriori probabilities. (t)
λk · gmk (xmn ) P (wk |xmn ) = K (t) k=1 λk gmk (xmn ) 3 M-Step: update the parameters. 1 N M (t+1) λk = P (wk |xmn ) n=1 m=1 MN N (t+1) n=1 xmn P (wk |xmn ) μmk = N n=1 P (wk |xmn ) N (t+1) P (wk |xmn )(xmn − μmk )2 (t+1) (σ 2 )mk = n=1 N n=1 P (wk |xmn ) 4 Repeat steps 2 and 3, until the change in ln L is beyond the threshold, or convergence is reached. 5 Segmentation step: determine the class label for each pixel in the output image. 1 M yn = arg max( P (wk |xmn )) m=1 wk M End. This algorithm generalises previous use of the EM algorithm for image segmentation to the multimodal problem. It allows us to combine data from different sources in an intuitive and principled manner, with resulting potential for classification improvement.
608
4
X. Hong et al.
Experiments
To evaluate the proposed method, experiments have been carried out on both synthetically generated images and simulated MS Lesion brain images. 4.1
Synthetic Images
Two images (Fig. 1a, 1b) were generated to simulate strong and weak X-ray acquisitions, each of which contains four regions: three steps of increasing thickness from top to bottom, on a uniform background. Thickness here represents the density of a step, where thick steps are more effective in blocking X-rays than thin ones, and hence weak X-rays are more easily blocked by a dense (thick) step than strong X-rays. In the strong X-ray image (Fig. 1a), the thinnest step is overexposed and confused with the background. However, the two thicker steps are well distinguished. In the weak X-ray image (Fig. 1b), the greater thickness steps are underexposed but the step of the smallest thickness is clearly distinguished from the background. The ground-truth image is shown in Fig. 1c. The two images (Fig. 1a, 1b) providing complementary information on the four regions are used as input to our MMEM algorithm. The output image (Fig. 1f) shows the four regions are well identified and clearly distinguishable. In comparison, Fig. 1d and 1e show the output images resulting from segmenting the two original images individually by the EM algorithm. Table 1 shows class and overall accuracy measurements when compared to the ground-truth for the segmented images Fig. 1d, 1e and 1f respectively. The results from Fig. 1 and Table 1 indicate that the complementary information provided by the two sensing modalities has been well exploited by the MMEM segmentation.
(a)
(b)
(c)
(d)
(e)
(f )
Fig. 1. First synthetic images: (a) Strong X-ray image (b) Weak X-ray image (c) Ground-truth image (d) EM segmentation of the strong X-ray image (e) EM segmentation of the weak X-ray image (f) MMEM segmentation when using the two images together
Model-Based Segmentation of Multimodal Images
609
Table 1. The accuracy figures for three segmented images segmented images Fig. 1d Fig. 1e Fig. 1f
4.2
class accuracy C1 C2 C3 C4 60.99% 69.15% 97.17% 97.74% 99.98% 99.83% 41.35% 72.87% 95.94% 97.70% 96.01% 97.67%
overall accuracy 81.26% 78.15% 96.83%
Simulated MS Lesion Brain Images
Magnetic resonance imaging (MRI) is one application of multimodality imaging in medicine. Although the same imaging modality is used, multispectral acquisitions (T1, T2, PD, etc) are performed under different operating conditions resulting in different images. The MR simulator of the Montreal Neurological Institute [6] generates 3D brain images simulating T1-, T2- and PD-weighted imaging. Eleven tissue classes are modelled: background, cerebrospinal fluid (CSF), grey matter, white matter, fat, muscle/skin, skin, skull, glial matter, connective, and MS lesion. By varying scanning parameters, tissue contrast can be altered and enhanced in various ways to demonstrate different features. Fig. 2a - 2c shows a slice of the multiple-sclerosis (MS) lesion brain images generated by T1-, T2-, and PD-weighted simulations. In MR imaging of the brain, T1-weighting causes white matter to appear white, grey matter to appear grey, and CSF to appear dark. The contrast of “white matter”, “grey matter” and “CSF” is reversed
(a)
(b)
(c)
(d)
(e)
(f )
(g)
(h)
Fig. 2. Simulated MS Lesion brain images Top: T1-, T2-, PD-weighted images, Phantom image (as the ground truth) Bottom: segmentation of T1-, T2-, PD-weighted images individually by the EM algorithm, T1-, T2-, PD-weighted images together via the MMEM algorithm
610
X. Hong et al.
using T2-weighted imaging. Proton-density (PD) imaging provides little contrast in normal subjects. By use of the three modality images together via our MMEM algorithm the overall accuracy of segmentation is higher than that resulting from segmenting T1-, PD-weighted images individually using the EM algorithm (55.60%, 50.10% and 45.29% respectively), but slightly lower than that of T2-weighted image segmentation (58.11%). However unlike segmentation of the T2-weighted image, which is only able to identify some classes, the MMEM segmentation identifies all classes. Fig. 2e - 2h shows the segmented images with the groundtruth image given in Fig. 2d.
5
Discussion
From Table 1, we can see that the overall accuracy of segmenting two modality images together via our MMEM method is much higher than that of segmenting a single modality image by the EM algorithm. However we also observe that class accuracy of segmenting these two images together using our method is slightly lower than that of segmenting the strong x-ray image on classes C3 and C4, and of segmenting the weak x-ray image on C1 and C2. In the current methodology, the computation of segment probabilities inherently weights all modalities equally. It might however be desirable to weight credibility of modalities thus influencing their contributions to the segmentation. In the future, we will investigate a method of weighting modality credibility thus incorporating weights into parameter estimation and the segmentation decision rule. Although brain tissues such as white matter, grey matter and CSF usually have a characteristic intensity, pixels around the brain often show an MR intensity that is very similar to brain tissue [7]. This can result in erroneous classification of small regions surrounding the brain so segmentation may be poor. Additionally information provided by MR images may not be typical for independent multimodality which our method is especially developed to handle.
6
Conclusions
In this paper we have proposed a model-based method for segmentation of images acquired from multiple independent modalities. Model parameter estimation is based on the principle of Maximum Likelihood. Maximisation is iteratively performed by the EM algorithm. Experimental results using synthetic images and simulated MR brain images reveal that using multimodal images helps improve segmentation accuracy. It is also likely that averaging probabilities contributed by multimodal images in calculating the prior probability of clusters and determining segmentation class of a pixel has a trade off against performance of segmenting a single modality image. In the near future we plan to introduce credibility of a modality into parameter estimation and segmentation. We are also going to carry out further experiments on real world images to investigate how well our method is suited to different imaging environments.
Model-Based Segmentation of Multimodal Images
611
References 1. Al Momani, B., Morrow, P., McClean, S.: Knowledge based semi-supervised satellite image classification. In: Proc. ISSPA 2007 (2007) 2. Bendjebbour, A., Delignon, Y., Fouque, L., Samson, V., Pieczynski, W.: Multisensor image segmentation using Dempster-Shafer fusion in markov fields context. IEEE Trans. Geo. Re. Sensing 39(8), 1789–1798 (2001) 3. Boudraa, A.-O., Bentabet, L., Salzenstein, F., Guillon, L.: Dempster-Shafer’s basic probability assignment based on fuzzy membership functions. Electronic Letters on Computer Vision and Image Analysis 4(1), 1–9 (2004) 4. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistics Society B(1), 1–38 (1977) 5. H´egarat-Mascle, S., Bloch, I., Vidal-Madjar, D.: Introduction of neighborhood information in evidence theory and application to data fusion of radar and optical images with partial cloud cover. Patt. Recog. 31(11), 1811–1823 (1998) 6. Kwan, R.K.-S., Evans, A.C., Pike, G.B.: MRI simulation-based evaluation of imageprocessing and classification methods. IEEE Trans. Med. Imaging 18(11), 1085– 1097 (1999), MRI simulator: http://www.bic.mni.mcgill.ca/brainweb/ 7. Leemput, K.V., Maes, F., Vandermeulen, D., Suetens, P.: Automated model-based tissue classification of MR images of the brain. IEEE Trans. Med. Imaging 18(10), 897–908 (1999) 8. McClean, S., Scotney, B., Morrow, P., Greer, K.: Knowledge discovery by probabilistic clustering of distributed databases. Data Knowl. Eng. 54(2), 189–210 (2005) 9. Murtagh, F., Raftery, A.E., Starck, J-L.: Bayesian inference for multiband image segmentation via model-based cluster trees. Img. and Vis. Comp. 23, 587–596 (2005) 10. Salzenstein, F., Boudraa, A.-O.: Iterative estimation of Dempster-Shafer’s basic probability assignment: application to multisensor image segmentation. Opt. Eng. 43(6), 1293–1299 (2004) 11. Zhu, Y.M., Bentabet, L., Dupuls, O., Kaftandjian, V., Babot, D., Rombaut, M.: Automatic determination of mass functions in Dempster-Shafer theory using fuzzy c-means and spatial neighborhood information for image segmentation. Opt. Eng. 41(4), 760–770 (2002)
Image Segmentation Based on Height Maps Gabriele Peters1 and Jochen Kerdels2
2
1 University of Dortmund Department of Computer Science - Computer Graphics Otto-Hahn-Str. 16, D-44221 Dortmund, Germany DFKI - German Research Center for Artificial Intelligence Robotics Lab Robert Hooke Str. 5, D-28359 Bremen, Germany
Abstract. In this paper we introduce a new method for image segmentation. It is based on a height map generated from the input image. The height map characterizes the image content in such a way that the application of the watershed concept provides a proper segmentation of the image. The height map enables the watershed method to provide better segmentation results on difficult images, e.g., images of natural objects, than without the intermediate height map generation. Markers used for the watershed concept are generated automatically from the input data holding the advantage of a more autonomous segmentation. In addition, we introduce a new edge detector which has some advantages over the Canny edge detector. We demonstrate our methods by means of a number of segmentation examples. Keywords: segmentation, edge detection, watershed, height maps.
1
Introduction
In this paper we introduce a new method for image segmentation. It is based on a height map which is generated from the gray values of the input image. Among the different methods for image segmentation morphological watersheds have some advantages [1]. They yield more stable results in comparison to other segmentation concepts such as detection of discontinuities, thresholding, or region processing. But they also have a drawback. Watersheds work on height level images. The association with height maps refers directly to the input image. If an image is interpretable as topographic image, such as images of cells under a microscope, watersheds perform well. To apply watershed segmentation to arbitrary images such as photographs of natural objects we propose to generate a height map which characterizes the content of the image in an appropriate way. A simple interpretation of an arbitrary image, e.g., of a tree, as height map would make no sense. We introduce the derivation of an appropriate height map from an edge filtered version of the input image. For that purpose we propose a new edge detector related to the Canny edge detector, but endowed with some advantages. This enables us to apply the watershed concept for the segmentation of arbitrary images and to exploit its friendly properties also for difficult images such as those W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 612–619, 2007. c Springer-Verlag Berlin Heidelberg 2007
Image Segmentation Based on Height Maps
original
new edge detector
edges
height map
skeleton
613
segments
watershed segmentation (based on height map and skeleton)
Fig. 1. Image Segmentation Based on Height Maps, Overwiew. The contributions of this paper are highlighted in red.
depicting natural objects. In addition, and in contrast to the general watershed concept, where markers are adopted to incorporate knowledge-based constraints in the segmentation process, we automatically generate markers from the height map and thus contribute to a more autonomous segmentation process. The segmentation based on height maps is derived in section 2, section 3 describes some results, and in section 4 we conclude the paper with a summary.
2
Image Segmentation Based on Height Maps
In fig. 1 an overview of the proposed image segmentation method is given. We start with the original image to be segmented. An edge detector is applied to the gray values of this image resulting in an edge image (subsection 2.2). From this binary edge image we generate a topographic image, the gray values of which can be interpreted as different height levels (subsection 2.3). Given this height map a skeleton image is derived, which is again a binary image (subsection 2.4). The height map is the source image for the watershed algorithm, which is applied utilizing the skeleton image as markers. This results in the final segmentation of the original image (subsection 2.5). The details of our approach, especially in a formalized version, are described in [2]. Here we restrict to the introduction of the basic principles of the proposed segmentation method. But first we start with a short retrospection of the concept of morphological watersheds (subsection 2.1). 2.1
Segmentation by Watersheds Revisited
Segmentation by morphilogical watersheds [3,4] requests a gray value image as input and is based upon the interpretation of this image as a topographic image, where different gray values code for different heights in a third dimension. Segmentation is accomplished by a slow flooding of this imaginary landscape. Water appears first in the deepest valleys and reaches higher landmarks at a later point in time. Any time the water impends to slop over a mountain crest a dam is erected to prevent the water from overflowing. The algorithm terminates when the whole landscape is flooded and only dams are visible. These dams constitute the segmentation of the image. The fact that each segment has its origin in a
614
G. Peters and J. Kerdels
local minimum of the height map can lead to an oversegmentation. A number of methods have been proposed dealing with this problem [5,6]. A suitable means to prevent oversegmentation is the utilization of so-called markers. They designate landmarks where water can emerge and thus, they restrict the number of segments in advance. Valleys, i.e., local minima, which are not labeled by a marker are not protected by dams. They are flooded eventually from one of the neighboring valleys and thus are assigned to the corresponding segment. It depends on the application at hand, where to place the markers. Originally markers have been regarded as an opportunity to incorporate knowledge-based constraints in the segmentation process [1]. Contrary to this opinion, we recognize the fact that an automatic generation of markers - as employed by our approach (and described in subsection 2.4) - holds the merit of a more automatic, independent, and self-contained segmentation process. 2.2
Edges
In this subsection we describe the derivation of a binary edge image from the input gray value image to be segmented. The edge image is the basis from which we derive the height map of the original image, which is needed as input for the watershed algorithm. As existing edge detectors display a series of disadvantages we develop a new edge detector, which is similar to the Canny edge detector [7] but has some advantages over it. Proposed Edge Detector. The proposed edge detection proceeds in 7 steps: 1. Simple Operator, Parameter d: We employ a simple edge operator, which is given in the left diagram of fig. 4 for one direction (horizontal only) and one intensity transition (dark to light only) as an example. The empty entries contain zeros. Parameter d determines the distance between 1 and −1 in the mask of the operator, and thus the size of the utilized local window. A larger distance effects that salient edges in the image generate d-wide edges in the response of the operator. On the other hand, edges, that are caused by noise oder disturbed structures in the image, i.e., non-salient edges, result in thinner edges in the filter response. 2. Gaussian Blur, Parameter b: The so created preliminary edge image is filtered with a Gaussian blur the magnitude of which is regulated by the parameter b. Whereas wide edges are diminished by the blur filter only at their ends, thin edges are degraded as a whole. 3. Potentiation, Parameter γ: Now the lowpass filtered, still preliminary edge image is exponentiated by the exponent γ. As the domain of the edge image is [0, 1], γ determines to which extent the values of the egde image are forced against zero. 4. Thresholding, Parameter : Afterwards we eliminate values below a threshold and set values above or equal the threshold to 1. This and the preceding steps result in a binary edge image with relatively wide edges.
Image Segmentation Based on Height Maps
615
5. Thinning: These wide edges are thinned to a width of one pixel only. 6. Deletion, Parameter λ: In addition, those edges are deleted the length of which is shorter than the average length of all edges muliplied by the factor λ. An exemplification of these six steps is given in fig. 2.
Fig. 2. Steps of the Edge Detector. The steps are illustrated for horizontal darkto-light transitions only. Upper left: original image. Upper right: result after the first step of the algorithm. Lower left: result after carrying out steps 2 to 4. Lower right: resulting edge image after steps 5 and 6. The parameters used are {d = 2, b = 2, γ = 3.0, = 0.01, λ = 1.0}.
7. Joining Directions and Transitions: In the final step the results from the edge detectors for the single directions and transitions are joined into the final edge image. Fig. 3 shows the complete edge image of the last example in the right image. It also visualizes the effects of different sets of parameters.
Fig. 3. Edge Images for Different Sets of Parameters. Left: {d = 1, b = 1, γ = 2.0, = 0.01, λ = 1.0}. Middle: {d = 1, b = 1, γ = 2.5, = 0.01, λ = 1.0}. Right: the same parameters as in fig. 2, i.e., {d = 2, b = 2, γ = 3.0, = 0.01, λ = 1.0}.
Advantages in Comparison with the Canny Edge Detector. Most of the steps are also carried out by the Canny edge detector, though in different order. Without going into detail, the principle steps of the Canny operator are Gaussian filtering of the image, edge detection, and thresholding. In particular, one difference of the proposed edge detector to the Canny detector is the potentiation of the blurred edge image in step 3. In combination with the following step (thresholding) this introduces an ”exponential thresholding” in contrast to
616
G. Peters and J. Kerdels
Fig. 4. Simple Edge Operator and Advantage of the Proposed Edge Detecor. Left: simple edge operator. Middle: original image. Right: resulting edge image. Salient edges in the original image are preserved even if they are of low contrast such as the edges of the light, vertical bar on the left. On the other hand, edges of similar contrast are largely suppressed if they are not salient such as the edges in the texture on the right.
Fig. 5. Edge Image and Height Map. Left: original gray value image. Middle: binary edge image. Right: gray value height map. Note the gap in the top of the tree outline in the edge image. The height map compensates perfectly for this deficiency.
the linear thresholding of the Canny detector. This kind of thresholding seems to be easier tunable. Experiments reported in [2] suggest a more robust behavior of the proposed edge detector. The same threshold value provides good results for a larger field of application than the Canny edge detector. Another advantage is illustrated in fig. 4. In step 1 edges of different strength are generated, depending on the saliency (not the contrast) of the edges in the original image: the more salient an edge in the source image, the wider the corresponding edge in the (intermediate) edge image. The subsequent application of the blur filter diminishes the thin edges (derived from non-salient edges in the original image) and thus boosts the wide, i.e., salient edges. In addition, the results of our edge detector seem to have similar advantages as those of Canny operators augmented with surround suppression such as described in [8], though it does not inhibit its surroundings actively. This can be seen in the edge image of fig. 7. 2.3
Height Map
As already mentioned we want to apply a watershed algorithm for segmentation, thus we need a height map. In this subsection we derive from the binary edge
Image Segmentation Based on Height Maps
617
image a gray value topographic image, which characterizes the original image in an adequate way. First we define a maximal distance dmax . It determines the maximal distance of a point in the edge image to the closest edge. Points with a larger distance from an edge are defined to be on the same height. That means that only gaps of a maximal size of 2dmax are tolerable in the edge image. For the images used, with a resolution of 640 × 480, dmax = 32 to 64 pixels provides good results. The idea is now that for each point in the edge image the minimal distance to the closest edge is determined. For that purpose we start a region growing from each point e of the edge image marked as an edge. For each image point p under consideration during this region growing we first verify whether a distance value has already been assigend to it and whether this distance value is smaller than the current distance dcur to the origin e of the region growing. If not, dmax − dcur is assigned to p. In addition, the neighboring points of p are stored as future candidates in the region growing. The region growing stops if either no more candidates for growing exist or the maximal distance dmax is reached. The right image of fig. 5 shows an example of a height map generated according to the preceding specification. Another example is shown in fig. 7. 2.4
Skeleton
As already pointed out in subsection 2.1 markers are used in combination with watershed algorithms to prevent the result from beeing oversegmented. The binary skeleton image, which we derive in this subsection, will be used as marker for the watershed algorithm we apply to the height map. By the application of Laplace operators to the height map one obtains a kind of gray value skeleton image. The coarseness of these skeletons can be controlled by the size of the Laplace operator as illustrated in fig. 6. For our examples we used a 3 × 3 filter. After the application of the Laplace filter we binarize the result with a threshold and, in addition, we delete those of the remaining edges that are thinner than
Fig. 6. Skeleton Images and Final Segmentation. Left: skeleton image of the same original image as depicted in fig 5, generated with a 3 × 3 Laplace filter. Middle: skeleton image generated with an 11 × 11 Laplace filter. Right: parts of the segmentation after applying the watershed algorithm with the left skeleton image as marker.
618
G. Peters and J. Kerdels
Fig. 7. Example of Image Segmentation Based on Height Maps. First: original image. Second: edge image. Third: height map. Fourth: parts of the resulting segmentation.
half of the filter size. Every white pixel of the resulting binary skeleton image acts as a marker for the following application of the watershed algorithm. 2.5
Segmentation
In this final step we apply the watershed algorithm as described in subsection 2.1 to the height map derived in subsection 2.3 utilizing the skeleton image as markers. As even the usage of markers cannot always prevent the result from beeing oversegmented similar neighboring segments are fused in a final postprocessing step. As a measure of similarity between two segments three values are compared: the average gray value, the variance of the gray values, and the entropy (calculated from the histogram of the gray values). Average, variance, and entropy of neighboring segments are compared, and only if they resemble significantly both segments are fused. As threshold for the difference between the average we used 0.1, for the variance 0.05, and for the entropy 0.5. The right image of fig. 6 shows the interesting segments of the result after the last fusion step has been carried out.
3
Results
For the evaluation of the proposed segmentation method we chose images depicting natural objects such as trees as these are typically difficult to segment. A first example is depicted in the figures 5 and 6. The height map compensates for gaps in the outline of the tree in the edge image. The resulting segmentation of the tree is fairly well accomplished and the foreground tree is separated from the tree in the background although these image areas are quite similar in terms of texture. Regrettably, the final fusion of similar segments turned out to be the weak point of the whole segmentation process as it was difficult to choose the values of the thresholds for average, variance, and entropy in such a way that neither an oversegmentation nor an undersegmentation occured, and this still for a series of different examples. But in the second example displayed in fig. 7 our approach provided one segment for the tree and another for the dominant tuft
Image Segmentation Based on Height Maps
619
in the foreground which corresponds quite well to our human perception. The separation between the tuft and the surrounding image areas can be regarded as difficult, because they are quite similar in structure. In addition, here again a gap in the contour of the tree poses no challenge for the further processing. More examples are discussed in [2].
4
Summary
We proposed a method to derive a height map from an arbitrary gray value image, which characterizes its content in such a way that the application of the watershed concept to the height map provides a proper segmentation of the image. This enables the application of the watershed segmentation also to images that are not similar to a topographic image. That means the watershed algorithm can provide now better segmentation results on difficult images, e.g., images of natural objects, than without the intermediate height map generation. The markers, usually introduced to the watershed algorithm to incorporate knowledge-based constraints, are generated automatically from the height map. This holds the merit of a more automatic, independent, and self-contained segmentation process. In addition, we introduced a new edge detector. Its advantages in comparison with the Canny edge detector are twofold. By the application of the blur filter to an intermediate edge image instead of the original image (as the Canny detector does) we derive an edge image which better distinguishes between salient and non-salient edges. The second merit of the proposed edge detector is an easier to handle thresholding which seems to provide more robust results. Acknowledgments. This research was funded by the German Research Association (DFG) under Grant PE 887/3-1.
References 1. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice-Hall, Englewood Cliffs (2002) 2. Kerdels, J.: Dynamisches Lernen von Nachbarschaften zwischen Merkmalsgruppen zum Zwecke der Objekterkennung. Diploma Thesis, University Dortmund (2006) 3. Serra, J.: Image Analysis and Mathematical Morphology. Ac. Press, NY (1988) 4. Beucher, S., Meyer, F.: The Morphological Approach of Segmentation: The Watershed Transformation. Mathematical Morphology in Image Processing. Marcel Dekker, New York (1992) 5. Najman, L., Schmitt, M.: Geodesic Saliency of Watershed Contours and Hierarchical Segmentation. IEEE PAMI 18(12), 1163–1173 (1996) 6. Bleau, A., Leon, J.: Watershed-Based Segmentation and Region Merging. Computer Vision and Image Understanding 77(3), 317–370 (2000) 7. Canny, J.: A Computational Approach to Edge Detection. IEEE PAMI 8(6) (1986) 8. Grigorescu, C., Petkov, N., Westenberg, M.A.: Contour and Boundary Detection Improved by Surround Suppression of Texture Edges. Image and Vision Computing 22(8), 609–622 (2004)
Measuring the Orientability of Shapes Paul L. Rosin School of Computer Science, Cardiff University, Cardiff CF24 3AA, Wales, UK [email protected]
Abstract. An orientability measure determines how orientable a shape is; i.e. how reliable an estimate of its orientation is likely to be. This is valuable since many methods for computing orientation fail for certain shapes. In this paper several existing orientability measures are discussed and several new orientability measures are introduced. The measures are compared and tested on synthetic and real data.
1
Introduction
It is often useful to determine the orientation of a shape. For instance, in robotics to locate a good grasping position. In computer vision, processing is often performed with respect to an object centred and oriented coordinate frame of reference. This can effectively provide descriptions invariant to certain geometric transforms of the object, and speeds up subsequent operations such as matching. Several schemes exist for estimating the orientation of a shape, but two problems arise. First, some shapes inherently have no well defined orientation – for example a circle. Second, even for shapes with a well defined orientation some methods fail to identify that orientation. This latter limitation applies to the most common method for estimating orientation based on the line which minimises the integral of the squares of distances of the points (belonging to the shape) to the line [10]. This line passes through the centroid of the shape S, and so the squared distance function of interior points to the line at orientation θ is (1) (x · sin θ − y · cos θ)2 dx dy F (θ) = S
where S has been translated such that its centroid lies on the origin. F (θ) is minimised when 2μ11 (2) tan 2θ = μ20 − μ02 where μpq are the central moments of order p + q. However, under certain conditions, namely (3) μ11 = μ20 − μ02 = 0 F (θ) is a constant function, and the method fails. Not only does this hold for all n-fold rotationally symmetric shapes with n > 2 [12] but also for many more general shapes [11]. This can be demonstrated by constructing some examples. One way we have constructed a simple polygonal example of such an irregular asymmetric shape is to express the conditions (3) for six vertices using line W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 620–627, 2007. c Springer-Verlag Berlin Heidelberg 2007
Measuring the Orientability of Shapes
(a)
(b)
(c)
621
(d)
Fig. 1. Shapes whose orientation is undefined according to the standard method given by (2). The pre-normalised shapes are shown in outline. In all cases μ11 = μ20 −μ02 = 0.
moments [9]. After setting coordinate values for four of the vertices the remainder are determined by numerically solving the set of equations. For example, starting with the values {(−40, 0), (−40, 100), (0, 300), (300, 300)} yields the remaining values {(97.9, −437.2), (−550.6, 277.5)} – see figure 1a. A more general approach by S¨ uße and Ditrich [11] enables an arbitrary shape to be normalised by a shearing and anisotropic scaling to yield a new shape that satisfies (3) – some examples of such shapes are shown in figure 1b–d. This limitation of the standard method for orientation estimation has led to the development of methods that can cope in these circumstances. Several are based on higher powers of the distance function F (θ) and/or higher order moments (including Zernike and generalised complex as well as geometric moments) [8,11,12,14]. Nevertheless, these new orientation estimators are not ideal either, since they may still fail in some cases, and are typically sensitive to noise as a consequence of the higher orders of moments used. The problems associated with orientation estimation suggest that a measure of shape orientability, whose purpose is to describe the degree to which a shape has distinct (but not necessarily unique) orientation, would be useful although there is little previous work in this area [1,16]. Orientability quantifies the likely reliability and stability of orientation estimates. For instance, even minor changes in a shape due to digitization or noise effects can substantially alter orientation estimates for shapes with low orientability [16].
2
Orientability Measures
In this section various schemes (new and old) are described for measuring orientability. There are certain basic properties that we expect all orientability measures to possess (which makes them easier to use and easier to compare): a) The measured orientability is in the interval [0, 1] for any shape; b) A circle has the measured orientability equal to 0; c) The measured orientability is invariant wrt similarity transformations. In addition, it would be desirable if d) The only shape with measured orientability equal to 0 is a circle; e) The only shape with measured orientability equal to 1 is a straight line segment (i.e. the limit of a rectangle as its aspect ratio tends to infinity).
622
2.1
P.L. Rosin
Moment Based Methods
Elongation. Consider the covariance matrix constructed from the second order central moments of the shape μ20 μ11 . C= μ11 μ02 The eigenvalues of C – denoted by I1 and I2 – provide the variances of the shape along the major and minor principal axes. and can be used to form a measure of elongation [5], which in turn is an indication of orientability:
√ 4μ211 + (μ20 − μ02 )2 Φ2 I1 − I2 DF = = = I1 + I2 μ20 + μ02 Φ1 where Φ1 and Φ2 are the first two rotation and translation moment invariants [5]. The same measure can be derived from the distance function F (θ) [16]: DF = 1 −
π · min{F (θ) | θ ∈ [0, π)} 7π . 0 F (θ) · dθ
When conditions (3) hold, and consequently F (θ) is constant, then DF = 0. Thus, apart from the only true unorientable shape of a circle, the elongation measure underestimates all the other shapes that satisfy (3). Higher Powers of Distance. One way to correctly determine the orientation of rotationally symmetric shapes is to use higher powers of distance in (1), i.e. F (θ, p) = (x · sin θ − y · cos θ)p dx dy. S
For shapes with N fold rotational symmetry Tsai and Chou [12] used p = N , ˇ c et al. [14] used p = 2N . In a similar manner, a higher order elongation while Zuni´ measure can be computed (and thereby an orientability measure). For instance DE (p) = 1 −
min{F (θ, p) | θ ∈ [0, 2 · π]} max{F (θ, p) | θ ∈ [0, 2 · π]}
is able to distinguish a circular from rotationally symmetric shapes for sufficiently large values of p. 2.2
Geometric Methods
ˇ c et al. [16] described a new measure of orientabilBounding Rectangles. Zuni´ ity based on bounding rectangles. For a given shape S let R(S, θ) be the minimal area rectangle whose edges make an angle θ with the coordinate axes and which includes S and let A(R(S, θ)) be the area of R(S, θ). Let Amin (S) =
min { A(R(S, θ)) } and Amax (S) = max { A(R(S, θ)) } .
θ∈[0,π)
θ∈[0,π)
Measuring the Orientability of Shapes
623
Fig. 2. The minimum area bounding rectangles (three are shown as examples) at all orientations for this shape have constant area but varying aspect ratio; Dα = 0
The orientability of the shape S is defined as: Dα (S) = 1 −
Amin (S) − α · A(S) Amax (S) − α · A(S)
where A(S) denotes the area of S, and the α · A(S) terms (α ∈ [0, 1]) have been included to enable shapes that have identical convex hulls to be differentiated. Dα satisfies the required properties (a), (b) and (c), but not property (d). In fact, Dα = 0 holds not only for the circle, but for all members of the class of curves of constant width (of which the circle is just one instance). The most famous example is the Reuleaux triangle. At all θ the minimum area bounding rectangles R(S, θ) will be identical squares. Furthermore, it is also possible to have other shapes with varying width that nevertheless have Dα = 0. In such cases the aspect ratio of R(S, θ) varies over θ, but A(R(S, θ)) remains constant, as shown in figure 2.
Fig. 3. This shape has orientability DP = 0
ˇ c [13] suggested computing the Projection of Edges. In place of (2) Zuni´ orientation that maximises the squared projection of the shape’s boundary. If the edges in polygon P have angles θi and lengths li , i = 1 . . . n, with respect to the x-axis, then the required orientation θ0 can be found as n 2 l sin 2θi tan 2θ0 = ni=1 2i . i=1 li cos 2θi We then compute orientability using the squared projection values at θ0 and θ0 + π2 as n 2 l sin (θi − θ0 ) DP = 1 − ni=1 2i . (4) l i=1 i cos (θi − θ0 )
624
P.L. Rosin
A problem with this approach is that it can give unexpected results in some instances, see for example figure 3 which has DP = 0 since the edge angles are distributed evenly in two orthogonal directions like a square. Given the measure’s sensitivity to local deviations in edge orientations, if the boundary is extracted from an image then it needs to have simplification (polygonal approximation) applied to eliminate these variations. However, this leads to another problem: variations in the sampling rate. For instance, consider an elongated rectangle from which points are selected along the boundary. Sampling at regular intervals will produce a high value of DP . However, the value of DP can be arbitrarily reduced by sampling the long side more densely than the short side. Since the projected lengths are squared then the effect in (4) of the segments from the long side will decrease. A solution to both these problems can be found in a technique previously employed for measuring rectilinearity [7]. The original data rather than its polygonal approximation is used, but is smoothed over a range of scales. The maximum orientability over scale is then selected. If a curve is smoothed using the geometric heat flow equation then it becomes more and more circular, eventually shrinking to a circular point in finite time [3]. Thus blurring will not increase orientability unless it reveals an elongated global structure that was effectively masked from the measure DP by local edges whose orientations are not aligned with the global structure. Weighting of Edges. Duchˆene et al. [1] describe a method for estimating orientation which also includes a confidence factor which can be used as an orientability measure. Orientations are considered for θ ∈ [0, π). The absolute difference in orientation φi of each edge from θ is taken modulo π. For each edge, φi if φi < δ then it contributes the following n weight wi = li (1 − δ ). Orientation is taken as the value of θ maximising i=1 wi . The same process is carried out to compute orientability as n i=1 wi DW = maxπ θ∈[0, 2 ) perim(P ) except that φi are now taken modulo π2 . + Note that if δ = πn then the measure is over the interval n2 , 1 rather than π [0, 1). Duchˆene et al. set δ = 12 , and in our experiments we have done likewise. Circularity Measures. Since a circle should produce an extreme value of orientability it seems reasonable that some of the measures of circularity can be used to measure orientability. Let the circumscribed circle, inscribed (i.e. largest empty) circle and convex hull of polygon P be denoted by CC(P ), IC(P ) and CH(P ). These geometric constructions are robust with respect to perturbations of the boundary which makes them suitable for use as elements of shape descriptors. Moreover, the inscribed circle can be computed in O(n log n) time [6] and the other two can be computed in linear time [2,4]. Note that IC(P ) ≤ CH(P ) ≤ CC(P ).
Measuring the Orientability of Shapes
625
Only for the least orientable shape – a circle – does CC(P ) = IC(P ) = CH(P ), and so they can be simply combined to form the following orientability measures, which can be based on either area or perimeter values: C1 = 1 − C3 = 1 −
area(CH(P )) ; area(CC(P # )) perim(CH(P )) π π−2 perim(CC(P ))
−
2 π
$
C2 = 1 − ; C4 = 1 −
area(IC(P )) area(CH(P )) ; perim(IC(P )) perim(CH(P )) .
Incorporating the circumscribed circle in the measure will ensure that protrusions are counted against circularity (and therefore in favour of orientability) while the inscribed circle will ensure that intrusions are treated likewise: counted against circularity and in favour of orientability. Further possibilities are also possible, using for instance linear combinations of the above such as αC1 + (1 − α)C2 or different scale normalisations (e.g. using polygon diameter).
3
Experiments
To visualise the similarities and differences between the various orientability measures they have been applied to order some simple shapes as shown in figure 4. Overall we can see that the circle is always assigned a low orientability score while the rectangle receives a high score, but there are many differences in behaviour for intermediate shapes. As expected, the low order moments measure DF is unable to discriminate between the rotationally symmetric shapes which all are assigned a value close to zero (the difference from zero being due to numerical and digitisation errors). In addition, two shapes – the rippled pear and the donkey – have been normalised according to S¨ uße and Ditrich’s scheme [11], and consequently also have approximately zero values. Using higher order powers overcomes this problem, as seen in the ranking by DE (50). Since this is still effectively an area based measure then small area features, such as the indentation in the circle (second shape from left), have little effect. As explained in section 2.2 Dα has a problem with certain shapes such as the irregular Reuleaux shape, fourth from the left for Dα=0.0 , and the third shape from the left, also described in figure 2. The effect of increasing the value of α is demonstrated: the circle and rectangle have identical values to their modified version with deep intrusions at α = 0, but are increasingly discriminated as α increases. Note however that Dα=1.0 assigns a square the same peak value of one as a rectangle since Amin = A. Putting the edge projection method in a multi-scale context (DP ) enables it to successfully cope with the zigzag rectangle (second on the right). Also, being boundary based it is very sensitive to the deep (but narrow) indentations which make the modified circle and rectangle much more orientable. It also discriminates between the plain square and the modified version with the zigzag pattern on the top and bottom, but fails however to distinguish between a square, cross and a circle. As mentioned previously, Duchˆene et al.’s method [1] (DW ) does not reach the lower bound of zero, even for a circle. Another point to note is that all the
626
P.L. Rosin
DF .000 .000 .000 .001 .003 .003 .004 .005 .009 .013 .025 .096 .405 .452 .886 .986 DE (50) .005 .007 .044 .065 .121 .204 .221 .245 .245 .250 .287 .295 .306 .425 .717 .917 Dα=0.0 .007 .007 .008 .008 .072 .109 .149 .176 .338 .388 .500 .500 .500 .500 .563 .860 Dα=0.5 .011 .012 .013 .013 .094 .133 .185 .236 .397 .537 .593 .604 .658 .667 .668 .925 Dα=1.0 .028 .028 .033 .034 .135 .170 .243 .357 .481 .729 .762 .820 .869 .963 1.00 1.00 DP .002 .005 .005 .005 .006 .021 .137 .427 .428 .443 .458 .503 .535 .699 .884 .984 DW .183 .188 .224 .240 .269 .334 .405 .425 .582 1.00 1.00 1.00 1.00 1.00 1.00 1.00 C1 .007 .017 .113 .127 .200 .240 .360 .362 .365 .365 .371 .382 .452 .591 .656 .898 C2 .013 .095 .215 .331 .396 .442 .597 .718 .770 .815 .860 .884 .890 .926 .932 .936 C3 .008 .021 .092 .150 .156 .180 .272 .278 .280 .283 .393 .427 .493 .574 .674 .875 C4 .007 .058 .213 .235 .303 .395 .413 .483 .521 .618 .652 .705 .735 .753 .758 .881 Fig. 4. Shapes ranked according to various orientability measures
rectilinear shapes achieve a value of one, suggesting that this is actually more of a rectilinearity measure [15,7] rather than an orientability measure. The circularity measures that incorporate the inscribed circle (C2 , C4 ) are more sensitive to intrusions that the remaining circularity measures, which makes the modified circle and rectangle more orientable. At least in these examples there does not appear to be a significant difference between the area and perimeter based versions. Note that several measures cannot distinguish the ‘U’ shape from a square: Dα=0.0 , DW , C1 , C3 . Also, the shapes are more clearly ordered by some measures; e.g. half the shapes have almost identical DF values. Quantifying this for each measure by the median of the differences in the ordered values gives {.004, .024, .037, .051, .049, .015, .016, .040, .046, .058, .051}, confirming that Dα>0 and the circularity measures perform well in this respect.
Measuring the Orientability of Shapes
4
627
Conclusions
Several orientability measures have been described and tested. While they all operate successfully for extremes of orientability – i.e. a circle and an elongated rectangle – performance is varied for intermediate shapes. Many of the measures cannot distinguish between dissimilar shapes, e.g. the class of rotationally symmetric shapes, constant width shapes, or squares versus circles. Future work will look at applying the orientability measures as shape descriptors, and evaluating their effectiveness for various object classification tasks.
References 1. Duchˆene, C., Bard, S., Barillot, X., Ruas, A., Tr´evisan, J., Holzapfel, F.: Quantitative and qualitative description of building orientation. In: Workshop on Progress in Automated Map Generalisation (2003) 2. G¨ artner, B.: Fast and robust smallest enclosing balls. In: Neˇsetˇril, J. (ed.) ESA 1999. LNCS, vol. 1643, pp. 325–338. Springer, Heidelberg (1999) 3. Grayson, M.A.: The heat equation shrinks embedded plane curves to round points. Journal of Differential Geometry 26, 285–314 (1987) 4. McCallum, D., Avis, D.: A linear algorithm for finding the convex hull of a simple polygon. Inform. Process. Lett. 9, 201–206 (1979) 5. Mukundan, R., Ramakrishnan, K.R.: Moment Functions in Image Analysis – Theory and Applications. World Scientific, Singapore (1998) 6. Preparata, F.P., Shamos, M.I.: Computational Geometry. Springer, Heidelberg (1985) ˇ c, J.: Measuring rectilinearity. Computer Vision and Image Un7. Rosin, P.L., Zuni´ derstanding 99(2), 175–188 (2005) 8. Shen, D., Ip, H.H.S.: Optimal axes for defining the orientations of shapes. Electronic Letters 32(20), 1873–1874 (1996) 9. Singer, M.H.: A general approach to moment calculation for polygons and line segments. Pattern Recognition 26(7), 1019–1028 (1993) 10. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis, and Machine Vision. PWS (1998) 11. S¨ uße, H., Ditrich, F.: Robust determination of rotation-angles for closed regions using moments. In: Int. Conf. Image Processing, vol. 1, pp. 337–340 (2005) 12. Tsai, W.H., Chou, S.L.: Detection of generalized principal axes in rotationally symetric shapes. Pattern Recognition 24(1), 95–104 (1991) ˇ c, J.: Boundary based orientation of polygonal shapes. In: Chang, L.-W., Lie, 13. Zuni´ W.-N. (eds.) PSIVT 2006. LNCS, vol. 4319, pp. 108–117. Springer, Heidelberg (2006) ˇ c, J., Kopanja, L., Fieldsend, J.E.: Notes on shape orientation where the stan14. Zuni´ dard method does not work. Pattern Recognition 39(5), 856–865 (2006) ˇ c, J., Rosin, P.L.: Rectilinearity measurements for polygons. IEEE Trans. on 15. Zuni´ Patt. Anal. and Mach. Intell. 25(9), 1193–1200 (2003) ˇ c, J., Rosin, P.L., Kopanja, L.: On the orientability of shapes. IEEE Trans. on 16. Zuni´ Image Processing 15(11), 3478–3487 (2006)
A 3–Subiteration Surface–Thinning Algorithm K´ alm´an Pal´ agyi Department of Image Processing and Computer Graphics, University of Szeged, Hungary [email protected]
Abstract. Thinning is an iterative layer by layer erosion for extracting skeleton. This paper presents an efficient parallel 3D thinning algorithm which produces medial surfaces. A three–subiteration strategy is proposed: the thinning operation is changed from iteration to iteration with a period of three according to the three deletion directions.
1
Introduction
Skeleton is a region–based shape feature that is extracted from binary image data. A very illustrative definition of the skeleton is given using the prairie–fire analogy: the object boundary is set on fire and the skeleton is formed by the loci where the fire fronts meet and quench each other [4]. In discrete spaces, the thinning process is a frequently used method for producing an approximation to the skeleton in a topology–preserving way [7]. It is based on digital simulation of the fire front propagation: border points of a binary object that satisfy certain topological and geometric constraints are deleted in iteration steps. The entire process is repeated until only the “skeleton” is left. A simple point is an object point whose deletion does not alter the topology of the image [9]. Sequential thinning algorithms delete simple points which are not end points, since preserving end–points provides important information relative to the shape of the objects. Curve thinning (i.e., a thinning process for extracting medial line) preserves line–end points while surface thinning (i.e., a thinning process for extracting medial surface) does not delete surface–end points. Parallel thinning algorithms delete a set of simple points simultaneously. A possible approach to preserve topology is to use directional approach (often referred to as subiteration–based or border sequential strategy) [6]: the thinning operation is changed from iteration to iteration with a period of n (n ≥ 2); each iteration of a period is then called a subiteration where only border points of certain kind can be deleted. Since there are six kinds of major directions in 3D images, 6–subiteration thinning algorithms were generally proposed [3,5,8,11,12,17]. Note, that 3–, 8–, and 12–subiteration algorithms were also developed [13,14,15]. In this paper, a new non–conventional 3–subiteration surface thinning algorithm is proposed. Some experiments are made on synthetic objects and the effectiveness is demonstrated. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 628–635, 2007. c Springer-Verlag Berlin Heidelberg 2007
A 3–Subiteration Surface–Thinning Algorithm
2
629
Basic Notions
Let p be a point in the 3D digital space Z3 . Let us denote Nj (p) (for j = 6, 26) the set of points j–adjacent to point p (see Fig. 1). A binary image I is a mapping (I : Z3 → {0, 1}), that assigns value 1 to black (or object) points and value 0 is assigned to white points. • • • • U • • • • • • N p E W • S • • • • •
• D • • •
Fig. 1. The set N6 (p) of the central point p ∈ Z3 contains the central point p and the 6 points marked U= u(p), N= n(p), E= e(p), S= s(p), W= w(p), and D= d(p). The set N26 (p) contains N6 (p) and the additional 18 points marked “•”.
We are dealing with (26, 6) images [7] (i.e., the equivalence classes of the set of black points induced by the transitive closure of the 26–adjacency form the objects of the given image; white components (the background and the cavities) are the equivalence classes of the set of white points induced by the transitive closure of the 6–adjacency). It is assumed that any image contains finitely many black points. A black point is called border point if it is 6–adjacent to at least one white point. A border point p is called U–border point in image I if I(u(p)) = 0 (see Fig. 1). We can define N–, E–, S–, W–, and D–border points in the same way. A black point p is called interior point if it is not border point (i.e., I(p) = I(u(p)) = I(n(p)) = I(e(p)) = I(s(p)) = I(w(p)) = I(d(p)) = 1). A black point is called simple point if its deletion does not alter the topology of the image [7]. (Note, that the simplicity of point p in a (26, 6) image is a local property; it can be decided in view of N26 (p).) We propose a new surface thinning algorithm for extracting medial surfaces from 3D (26, 6) images. The deletable points of the algorithm are border points of certain types and not surface end–point s (i.e., which are not extremities of surfaces). The proposed algorithm uses the following characterization of the surface end–points: A black point is surface end–point in a image if it is border point and it is not 6-adjacent to any interior point. Note, that the same characterization has been used by other authors [1,10].
3
The New Thinning Algorithm
Each conventional 6–subiteration 3D thinning algorithm uses the six deletion directions that can delete certain U–, D–, N–, E–, S–, and W–border points,
630
K. Pal´ agyi
respectively [3,5,8,11,12,17]. In our 3–subiteration approach, two kinds of border points can be deleted in each subiteration. The three deletion directions correspond to the three kinds of opposite pairs of points, and are denoted by UD, NS, and EW. The first subiteration assigned to the deletion direction UD can delete certain U– or D–border points; the second subiteration associated with the deletion direction NS attempt to delete N– or S–border points, and some E– or W–border points can be deleted by the third subiteration corresponding to the deletion direction EW. The proposed algorithm is given as follows: Input: binary image A Output: binary image B 3-subiteration thinning(A,B) begin B = A; repeat B = deletion from UD(B); B = deletion from NS(B); B = deletion from EW(B); until no points are deleted ; end.
/* 1st subiteration */ /* 2nd subiteration */ /* 3rd subiteration */
The new value of a black point depends on the values of 28 additional points. The considered special neighbourhoods are presented in Fig. 2. UD p
NS p
EW p
Fig. 2. The special local neighbourhoods assigned to the deletion directions UD, NS, and EW, respectively. The new value of a black point p depends on N26 (p) (marked “”) and two additional points (marked “”).
Deletable points in a subiteration are given by a set of matching templates. A black point is deletable if at least one template in the set of templates matches it. The set of templates TUD is given by Fig. 3. Note that Fig. 3 shows only the eight base templates TU1–TU4,TD1–TD4. Additionally, all their rotations around the vertical axis belong to TUD , where the rotation angles are 90◦ , 180◦ , and 270◦ . It is easy to see that the complete TUD contains 2 · (1 + 4 + 4 + 4) = 26 templates. This set of templates was constructed for deleting some simple points
A 3–Subiteration Surface–Thinning Algorithm
631
◦ ◦ ◦ TU2 ◦ ◦ ◦ TU3 · ◦ ◦ TU4 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ · ◦ ◦ ◦ ◦ ◦ · · · · · · ◦ ◦ ◦· ◦· ◦· · · · · · · · ·• · p p p · · · · · · p · • · · · · • · · •· ·• · · •· ·• · • · · ·• · • • • • • • • • • • • • · • · · · · • · • · • · • • • • • • • • · · · · · · · • • • • · TD1 • TD2• • • TD3• • • TD4• • • • • · • · · • · · • · · • · · · · · · · · · · · · · p · · p · · p · · p · • · · · · • · · • · • ·◦ ·◦ ◦ · ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦◦◦ ◦◦◦ · ◦◦ ◦◦◦ · · · · · · ◦ ◦ ◦ • ◦ ◦ TU1
Fig. 3. Base templates TU1–TU4,TD1–TD4 and their rotations around the vertical axis form the set of templates TUD assigned to the deletion direction UD. This set of templates belongs to the first subiteration. Notations: each position marked “p” and “•”, matches a black point; each position marked “◦” matches a white point; each “·” (“don’t care”) matches either a black or a white point.
UD 0 6
3
1
5 8 7 9
2
4
10
11
p 13 15 16 14 17 18 19 21 20 23 24 22 12
NS 6 14
7
16 23 24 22 3
8
15
4
5
p 13 21 20 0 1 2 10 11 9 18 19 17 12
EW 17 9 20 12 22
14 18
0
3
1
4
6 10
p
7 11 2 13 5 21 16 8 24 23
15 19
Fig. 4. Indices of the 25 Boolean variables (i.e., the considered points in N26 (p)). Note, that investigating the point marked “” is not needed. Since the deletion rule of a subiteration can be derived from the deletion rule of the reference subiteration UD by the proper rotation, the indexing scheme of a subiteration corresponds to the proper permutation of positions assigned to the reference subiteration.
which are neither surface end–points nor extremities of surfaces. The deletable points of the other two subiterations (corresponding to deletion directions NS and EW) can be obtained by proper rotations of the templates in TUD . Note that choosing another order of the deletion directions yields another algorithm. The proposed algorithm terminates when there are no more black points to be deleted. Since all considered input images are finite, it will terminate.
632
K. Pal´ agyi
Fig. 5. Two synthetic images containing a 140 × 140 × 50 horse and a 45 × 45 × 45 cube (top); and their skeletons produced by the proposed surface–thinning algorithm (bottom)
Implementing the proposed algorithm seems to be rather difficult and time consuming, but it is wide of the mark. We can state that a border point in image B is to be deleted from deletion direction UD if: ( ( B(d(p))=1 and B(u(p))=0 and d(p) is interior point ) or ( B(u(p))=1 and B(d(p))=0 and u(p) is interior point ) ) and f (x0 , x1 , . . . , x24 ) = 1, where f is a Boolean–function of 25 variables derived from the set of templates. It is easy to see, that function f can be given by a pre-calculated 4 Mbyte (unit time access) look-up-table. The considered 25 variables correspond to 25 points in N26 (p) (see Fig. 4). More details concerning the efficient implementation of 3D thinning algorithms are presented in [16].
A 3–Subiteration Surface–Thinning Algorithm
4
633
Discussion and Results
Thinning algorithms have to take care of the following four aspects: 1. forcing the “skeleton” to retain the topology of the original object (i.e., topology is to be preserved); 2. providing “shape preservation” (i.e., significant features of the original object are to be produced); 3. forcing the “skeleton” to be in its geometrically correct position (i.e., in the “middle” of the object); 4. producing “maximal” thinning (i.e., the desired “width” of the “skeleton” is one point). It is easy to see the topological correctness (the 1st requirement) by using a characterization of simple points [9] and a sufficient condition for parallel reduction operations of 3D (26, 6) images [13]. Shape preservation (the 2nd requirement) is a fairly important requirement, too. For example, an object like “b” cannot be thinned into an object like “o”. The aim of the thinning is not to produce the topological kernel [2] of an object: the thinning differs from shrinking. That is the reason why end–point criteria are used in thinning. It is easy to see that surface–end points are removed by none of our templates. Geometrical correctness (the 3rd requirement) of the extracted skeleton is mostly achieved by the subiteration (multi–directional) thinning approach. An object is to be shrunk uniformly from each directions.
Fig. 6. Three synthetic images containing a 45 × 45 × 45 cube with one, two, and three hole(s), respectively (top); and their skeletons produced by the proposed surface– thinning algorithm (bottom)
634
K. Pal´ agyi
Table 1. Computation times for the considered five kinds of test objects. The implemented surface–thinning algorithm was run under Linux on an Intel Pentium 4 CPU 2.80 GHz PC. Due to the efficient implementation [16], the time complexity depends only on the number of object points and the compactness of the objects (i.e., volume to area ratio); but it does not depend on the size of the image. test object
size
number of object points running time (sec.)
140 × 140 × 50 45 × 45 × 45 93 × 93 × 93 141 × 141 × 141 45 × 45 × 45 93 × 93 × 93 141 × 141 × 141 45 × 45 × 45 93 × 93 × 93 141 × 141 × 141 45 × 45 × 45 93 × 93 × 93 141 × 141 × 141
92 534
2
2
2
2
91 804 803 81 714 491 74 655 284 67 595 076
125 357 221 000 984 752 250 402 106 500 820 460
0.125 0.035 0.383 1.388 0.032 0.359 1.320 0.028 0.335 1.268 0.026 0.322 1.105
It is rather difficult to prove that the 4th requirement about maximal thinning is satisfied. Due to the used surface end–point criteria, the produced skeleton may contain 2–point thick surface patches [1,10]. It is easy to overcome this problem (e.g., by applying the final thinning step proposed by Arcelli et al. [1]). Our algorithm has been tested on objects of different shapes. Here we present five examples (see Figs. 5–6). The computation time of a thinning process depends on the complexity of an iteration step and the required number of iteration steps. The 3–subiteration 3D thinning strategy has been compared with other subiteration–based approaches with periods of 6, 8, or 12. It has been shown that the 3–subiteration approach requires the least number of iterations [15]. If we use unit time access lookup-tables (corresponding the deletion rules of the considered algorithms) and our efficient implemetation method [16] is applied, then the 3–subiteration algorithms are the fastest subiteration–based ones. The efficiency of the proposed method is illustrated in Table 1.
Acknowledgements The author is grateful to Stina Svensson (Centre for Image Analysis, Swedish University of Agricultural Sciences, Uppsala, Sweden) for supplying the horse image data (see Fig. 5).
A 3–Subiteration Surface–Thinning Algorithm
635
References 1. Arcelli, C., Sanniti di Baja, G., Serino, L.: New removal operators for surface skeletonization. In: Kuba, A., Ny´ ul, L.G., Pal´ agyi, K. (eds.) DGCI 2006. LNCS, vol. 4245, pp. 555–566. Springer, Heidelberg (2006) 2. Bertrand, G., Aktouf, Z.: A 3D thinning algorithms using subfields. In: Proc. SPIE Conf. on Vision Geometry III, vol. 2356, pp. 113–124 (1994) 3. Bertrand, G.: A parallel thinning algorithm for medial surfaces. Pattern Recognition Letters 16, 979–986 (1995) 4. Blum, H.: A transformation for extracting new descriptors of shape. Models for the Perception of Speech and Visual Form, pp. 362–380. MIT Press, Cambridge (1967) 5. Gong, W.X., Bertrand, G.: A simple parallel 3D thinning algorithm. In: Proc. 10th Int. Conf. on Pattern Recognition, pp. 188–190 (1990) 6. Hall, R.W.: Parallel connectivity–preserving thinning algorithms. In: Kong, T.Y., Rosenfeld, A. (eds.) Topological algorithms for digital image processing, pp. 145– 179. Elsevier Science, Amsterdam (1996) 7. Kong, T.Y., Rosenfeld, A.: Digital topology: Introduction and survey. Computer Vision, Graphics, and Image Processing 48, 357–393 (1989) 8. Lee, T., Kashyap, R.L., Chu, C.: Building skeleton models via 3–D medial surface/axis thinning algorithms. CVGIP: Graphical Models and Image Processing 56, 462–478 (1994) 9. Malandain, G., Bertrand, G.: Fast characterization of 3D simple points. In: Proc. 11th IEEE Internat. Conf. on Pattern Recognition, pp. 232–235 (1992) 10. Manzanera, A., Bernard, T.M., Pretˆeux, F., Longuet, B.: Medial faces from a concise 3D thinning algorithm. In: Proc. 7th IEEE Internat. Conf. Computer Vision, ICCV’99, pp. 337–343 (1999) 11. Mukherjee, J., Das, P.P., Chatterjee, B.N.: On connectivity issues of ESPTA. Pattern Recognition Letters 11, 643–648 (1990) 12. Pal´ agyi, K., Kuba, A.: A 3D 6–subiteration thinning algorithm for extracting medial lines. Pattern Recognition Letters 19, 613–627 (1998) 13. Pal´ agyi, K., Kuba, A.: Directional 3D thinning using 8 subiterations. In: Bertrand, G., Couprie, M., Perroton, L. (eds.) DGCI 1999. LNCS, vol. 1568, pp. 325–336. Springer, Heidelberg (1999) 14. Pal´ agyi, K., Kuba, A.: A parallel 3D 12–subiteration thinning algorithm. Graphical Models and Image Processing 61, 199–221 (1999) 15. Pal´ agyi, K.: A 3-subiteration 3D thinning algorithm for extracting medial surfaces. Pattern Recognition Letters 23, 663–675 (2002) 16. Pal´ agyi, K.: Efficient implementation of 3D thinning algorithms. In: Proc. 6th Conf. Hungarian Association for Image Processing and Pattern Recognition, pp. 266–274 (2007) 17. Tsao, Y.F., Fu, K.S.: A parallel thinning algorithm for 3–D pictures. Computer Graphics and Image Processing 17, 315–331 (1981)
Extraction of River Networks from Satellite Images by Combining Mathematical Morphology and Hydrology Pierre Soille and Jacopo Grazzini Spatial Data Infrastructures Unit Institute for Environment and Sustainability DG Joint Research Centre, European Commission, I-21020 Ispra, Italy {Pierre.Soille,Jacopo.Grazzini}@jrc.it
Abstract. In this paper, we propose a new methodology for extracting river networks from satellite images. It combines morphological generalised geodesic transformations with hydrological overland flow simulations. The method requires the prior generation of a geodesic mask and a marker image by applying a series of transformations to the original image. These images are then combined so as to produce a pseudo digital elevation model whose valleys match the desired networks. The performance of the methodology is demonstrated for the extraction of river networks from a single band of a Landsat image. The method is generic in the sense that it can be extended for the extraction of other types of arborescent networks such as blood vessels in medical images.
1
Introduction
Line networks [1], also called thin nets [2] or curvilinear structures [3], are found in numerous applications fields. Probably the most studied ones are those encountered in medical images (e.g., blood vessels [4]) and satellite images of the Earth (e.g., roads [5,6] and rivers [7,8,9]). The extraction of line networks is often based on a two step approach. Typically, a segmentation and/or classification leads to an initial detection of the desired network avoiding false positive detections but containing gaps. These gaps are then bridged using perceptual grouping techniques. In this paper, we focus on the extraction of river networks in satellite images and propose to exploit the arborescent nature of these networks (i.e., tree-like structure with a root). Similarly to the watershed based segmentation, this is achieved by combining concepts arising from mathematical morphology and hydrology. Morphological algorithms allow for the creation of a pseudo digital elevation model whose main valley lines match actual rivers. These valleys are then extracted using overland flow simulation procedures known in hydrology. The rest of the paper is organised as follows. Background morphological and hydrological concepts at the root of the proposed methodology are recalled in Sec. 2. The methodology itself is detailed in Sec. 3. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 636–644, 2007. c Springer-Verlag Berlin Heidelberg 2007
Extraction of River Networks from Satellite Images
2
637
Background Concepts
This section briefly describes the two main concepts at the basis of the proposed methodology: geodesic time function in mathematical morphology and contributing drainage areas in hydrology. Unless otherwise stated, we adopt definitions and notations presented in [10]. 2.1
Geodesic Time in Mathematical Morphology
The geodesic distance between two points of a connected set is defined as the length of the shortest path(s) linking these points and remaining in the set [11]. This idea has been generalised to grey tone geodesic masks in [12] by introducing the notion of geodesic time. The time necessary for travelling a path is defined as the sum of the intensity values of the points of the path. The geodesic time separating two points of a grey tone geodesic mask is then defined as the smallest amount of time of all paths linking these points and remaining within the definition domain of the geodesic mask. Formally, the time necessary to cover a discrete path P of length l (i.e., P = (po , . . . , pl )) defined on a discrete grey scale image f equals the mean of the values of f taken two at a time along P. It is denoted by τf (P): τf (P) =
l f (pi−1 ) + f (pi ) i=1
2
f (p0 ) f (pl ) + + f (pi ). 2 2 i=1 l−1
=
The geodesic time separating two points p and q in a grey scale image f is denoted by τf (p, q): τf (p, q) = min{τf (P) | P is a path linking p to q}. The geodesic time between a point p and a reference set of points Y of a geodesic mask image f is the smallest amount of time allowing to link p to any point q of Y : τf (p, Y ) = min τf (p, q). q∈Y
If p belongs to Y , the geodesic time from p to Y is zero. The geodesic time function from a reference set Y within a geodesic mask image f is defined by associating each point of the image definition domain with its geodesic time to Y . We denote by Tf (Y ) the resulting geodesic time function: + * Tf (Y ) (p) = τf (p, Y ). An efficient algorithm based on priority queue data structures is detailed in [10, pp. 237–238]. Note that the concept of geodesic time function is closely related to the notion grey weighted distances in image processing [13] and, more generally, cost functions in digital graphs. 2.2
Contributing Drainage Area in Hydrology
Digital elevation models (DEMs) can be viewed as grey tone images where the intensity value of each pixel represents the elevation of the terrain at the corresponding location. Since the availability of the first DEMs in the early 80’s,
638
P. Soille and J. Grazzini
numerous methodologies have been proposed for extracting river networks from them. A recent survey on this topic can be found in [14]. The best available method relies on the simulation of the flow of water on the topographic surface. It assumes that (i) all spurious minima have been suppressed beforehand and (ii) every point except the stream outlets have at least one neighbour with a lower elevation than the considered point. The flow direction at each point is then defined as the direction of the 8-neighbour of the point corresponding to the steepest slope. In this sense, each point of the terrain belongs to the network of streams. This is a valid assumption since the notion of stream is dynamic and depends on the intensity of the rain over the terrain so that for heavy rainfall all points of the terrain are flowing. In practice, the flow directions allow for the computation of the number of pixels located upstream of each point of the DEM. This number is called contributing drainage area. River network pixels are then defined as those pixels whose contributing drainage area exceeds some threshold value, see for example [15] and references thereof.
3
Methodology
Our methodology for extracting river networks from a single optical satellite image can be summarised as follows: 1. Define the outlets of the desired river network and a region of interest; 2. Enhance potential river stretches; 3. Generate a pseudo DEM by computing the geodesic time function with the outlets as reference set and the image with enhanced rivers as geodesic mask; 4. Calculate the local flow directions of the pseudo DEM and the subsequent contributing drainage areas; 5. Trim the resulting space filling network so as to obtain a network matching the target network. The successive steps are detailed below and illustrated on the Landsat image displayed in Fig. 1a. 3.1
Outlet Definition and Region of Interest
Typically, the outlets of the target network correspond to the sea pixels of the input imagery. Hence, the sea pixels play the role of the reference set of the geodesic time function and will therefore correspond to the level 0 of the pseudo DEM generated in Sec. 3.3. If the watershed boundaries of the catchment basins of the processed area are available, all computations are restricted to a region of interest defined as the union of the catchments fully included in the image and flowing towards sea pixels. Alternatively, other water features such as large inland lakes can be used instead of the sea for the reference set. Eventually, if no such features can be detected, the boundary of the image definition domain can be considered as a valid outlet in a first approximation.
Extraction of River Networks from Satellite Images
639
(a) Input Landsat image (band 5): f .
(b) Rank-min closing of f : φB,λ (f ).
(c) Top-hat of f : BTH(f ) = f − φB,λ (f ).
(d) Geodesic mask: g = f · T[0,0] (BTH(f )).
Fig. 1. Geodesic mask creation from the fifth band of a Landsat image
3.2
Generation of Geodesic Mask
The geodesic mask is obtained by modifying the input image in such a way that potential river pixels are set to low values. By doing so, the geodesic time function will propagate faster along potential rivers. Since satellite images are usually multispectral images, a combination of channels can be used. In this paper, we use a single spectral band having low reflectance values for water (alternatively, band combinations such as those defined by a water index could be used). Then, rather than exploiting knowledge about the reflectance values of water in the selected band, we translate knowledge about the shape, size, and local contrast of river stretches into a series of morphological operations. By doing so, we are able to enhance river stretches that could not be detected solely on the basis of their spectral values. Indeed, many rivers are detectable by a human operator owing to their shape and relative contrast since they appear as dark networks of thin lines. In mathematical morphology terms, the extraction of potential river pixels can be achieved by initially suppressing these networks with a closing by
640
P. Soille and J. Grazzini
(a) Geodesic time function: Tg (sea). This function can be seen as a pseudo DEM.
(b) Shaded view of lower complete transformation of Tg (sea).
(c) Flow directions of (b).
(d) Contributing drainage areas of (c).
Fig. 2. From geodesic time function to contributing drainage areas (see also Fig. 1). In this example, the sea pixels are falling outside the image frame.
a disk shaped structuring element whose diameter slightly exceeds the width of the targeted network. However, a closing by a plain structuring element would enhance too many structures. A less restrictive closing consists in considering the intersection (i.e., point-wise minimum) of the closing by all subsets of the disk containing a fixed number of pixels, see Fig. 1b. This type of closing is known as a parametric closing or rank-min closing owing to its representation in terms of a rank filter followed by a point-wise minimum with the input image (see [16] and references thereof). In Fig. 1b, taking into account the physical size of a pixel, we used a 3 × 3 square structuring element B and considered every subset of B containing 6 pixels. Although the closed image looks very similar to the input image, this operation closes many potential river stretches. This is highlighted by computing the difference between the closing and the input image called top-hat by closing, see Fig. 1c. Notice that the top-hat by closing leads to many false positive detections. This is however not an issue because the
Extraction of River Networks from Satellite Images
641
subsequent steps cope with false positives provided that higher rate of detection are obtained for actual river stretches. We therefore consider as potential river all pixels of this top-hat by closing that have a positive response. The geodesic mask is then formed by setting to zero these pixels, all other pixels retaining their original value, see Fig. 1d. 3.3
Geodesic Time Function
The geodesic time function is then computed from the marker set (sea pixels) within the geodesic mask. Since potential river pixels are set to zero in the geodesic mask, the propagation goes faster along potential rivers so that the resulting geodesic time function mimics a digital elevation model of the studied terrain. However, the computed time values correspond to a sum of reflectance values so that they do not match actual elevations. For this reason we call this geodesic time function a pseudo DEM, see Fig. 2a.
(a) River network extracted by applying a fixed threshold of 23000 pixels to Fig. 2d.
(b) Extracted network overlaid on input image.
Fig. 3. River network resulting from thresholding the contributing drainage areas of the geodesic time function
3.4
From Geodesic Time Function to Drainage Areas
We then compute the flow directions on the geodesic time function. The flow direction of a pixel is defined for each pixel as the direction of the 8-neighbour producing the steepest slope. Ties are solved either by randomly selecting a possible direction or choosing the central direction when there are three adjacent possibilities. However, flow directions cannot be defined from a local neighbourhood on plateaus (resulting from the presence of zero valued pixels in the geodesic mask). Plateaus can be suppressed using geodesic distance computations from their descending border [17]. A shaded view of the resulting lower complete function is shown in Fig. 2b and the corresponding flow directions in Fig. 2c.
642
P. Soille and J. Grazzini
(a) Retina image.
(b) Pseudo DEM.
(c) Extracted network.
Fig. 4. Detection of blood vessels in an image of an eye retina using the methodology described in this paper: preliminary results
Contributing drainage areas are then calculated by simulating the flow of water on the pseudo DEM using the previously defined flow directions, see [15] for a fast algorithm. The contributing drainage areas of our sample image are displayed in Fig. 2d. 3.5
Network Extraction
The extraction of an actual network from the contributing drainage area image can be achieved by thresholding it for a fixed threshold value. This is illustrated in Fig. 3. Rather than using fixed thresholds, adaptive threshold values can be defined using a priori knowledge or further transformations of the input image. By doing so, the river network is defined as the downstream of all pixels whose contributing drainage area exceed their threshold value. This idea has already been used for the extraction of river networks from actual DEMs in [18].
4
Conclusion and Perspectives
The originality of the method proposed in this paper lies in the combination of geodesic time computations with concepts related to overland flow simulation. The generation of the pseudo digital elevation model exploits information on the local contrast (river stretches are darker in the selected band) and shape (linear and thin structures) of the targeted network as well as its arborescent organisation. If necessary, physical image formation models and the use of several spectral bands rather than a single one could be used to constrain further the generation of the pseudo DEM and/or the extraction of the final network from this DEM. The proposed methodology is generic in the sense that it can be extended for the extraction of other arborescent networks such as blood vessels in medical
Extraction of River Networks from Satellite Images
643
images. An example of preliminary result for the extraction of blood vessels in retina images is displayed in Fig. 4. In this example, the optical nerve plays the role of the ’sea’ pixels and has been set manually, see yellow disk in Fig. 4b. All concepts are directly applicable to 3-dimensional images. This general framework will be detailed in a subsequent paper and will include the treatment of anastomoses (i.e., networks in which line features both branch out and reconnect such as for braided rivers). Multiscale concepts could also be considered for the enhancement of potential network stretches [19].
References 1. Geusebroek, J.M., Smeulders, A., Geerts, H.: A minimum cost approach for segmenting networks of lines. Int. J. of Computer Vision 43, 99–111 (2001) 2. Armande, N., Montesinos, P., Monga, O.: Thin nets extraction using a multi-scale approach. Computer Vision and Image Understanding 73, 285–295 (1999) 3. Steger, C.: An unbiased detector of curvilinear structures. IEEE Trans. on Pattern Analysis and Machine Intelligence 20, 113–125 (1998) 4. Zana, F., Klein, J.C.: Segmentation of vessel-like patterns using mathematical morphology and curvature evaluation. IEEE Trans. on Image Processing 10, 1010–1019 (2001) 5. Merlet, N., Zerubia, J.: New prospects in line detection by dynamic programming. IEEE Trans. on Pattern Analysis and Machine Intelligence 18, 426–431 (1996) 6. Mena, J.: State of the art on automatic road extraction for GIS update: A novel classification. Pattern Recognition Letters 24, 3037–3058 (2003) 7. Haralick, R., Wang, S., Shapiro, L., Campbell, J.: Extraction of drainage networks by using the consistent labeling technique. Remote Sensing of Environment 18, 163–175 (1985) 8. Ichoku, C., Karnieli, A., Meisels, A., Chorowicz, J.: Detection of drainage channel networks on digital satellite images. Int. J. of Remote Sensing 17, 1659–1678 (1996) 9. Dillabaugh, C., Niemann, O., Richardson, D.: Semi-automated extraction of rivers from digital imagery. GeoInformatica 6, 263–284 (2002) 10. Soille, P.: Morphological Image Analysis: Principles and Applications, 2nd edn. Springer, Heidelberg (2003) 11. Lantu´ejoul, C., Maisonneuve, F.: Geodesic methods in image analysis. Pattern Recognition 17, 177–187 (1984) 12. Soille, P.: Generalized geodesy via geodesic time. Pattern Recognition Letters 15, 1235–1240 (1994) Download a PDF preprint: http://ams.jrc.it/soille/ soille94.pdf 13. Verbeek, P., Verwer, B.: Shading from shape, the eikonal equation solved by greyweighted distance transform. Pattern Recognition Letters 11, 681–690 (1990) 14. Soille, P., Vogt, J., Colombo, R.: Carving and adpative drainage enforcement of grid digital elevation models. Water Resources Research 39, 1366 (2003) 15. Soille, P., Gratin, C.: An efficient algorithm for drainage networks extraction on DEMs. J. of Visual Communication and Image Representation 5, 181–189 (1994) 16. Soille, P.: On morphological operators based on rank filters. Pattern Recognition 35, 527–535 (2002)
644
P. Soille and J. Grazzini
17. Soille, P.: Advances in the analysis of topographic features on discrete images. In: Braquelaire, A., Lachaud, J.-O., Vialard, A. (eds.) DGCI 2002. LNCS, vol. 2301, pp. 175–186. Springer, Heidelberg (2002) 18. Colombo, R., Vogt, J., Soille, P., Paracchini, M., de Jager, A.: On the derivation of river networks and catchments at European scale from medium resolution digital elevation data. Catena (2007), Available online (November 22, 2006) 19. Grazzini, J., Chrysoulakis, N.: Extraction of surface properties from a high accuracy DEM using multiscale remote sensing techniques. In: Proc. of the 19th Conf. Informatics for Environmental Protection, Brno, Czech Republic, pp. 352– 356 (2005)
Fractal Active Shape Models Polychronis Manousopoulos, Vassileios Drakopoulos, and Theoharis Theoharis Department of Informatics and Telecommunications, University of Athens, Panepistimioupolis, 157 84, Athens, Greece {polyman, vasilios, theotheo}@di.uoa.gr
Abstract. Active Shape Models often require a considerable number of training samples and landmark points on each sample, in order to be efficient in practice. We introduce the Fractal Active Shape Models, an extension of Active Shape Models using fractal interpolation, in order to surmount these limitations. They require a considerably smaller number of landmark points to be determined and a smaller number of variables for describing a shape, especially for irregular ones. Moreover, they are shown to be efficient when few training samples are available.
1
Introduction
Active Shape Models (ASMs, see e.g. [1]) are widely used for image segmentation and motion tracking; they have been proven to be efficient and robust in many areas of application, especially biomedical ones (see e.g. [2], [3], [4]). Extensions of the original ASM formulation have also been proposed, in order to increase their robustness, efficiency and applicability (see e.g. [5], [6], [7]). In ASM applications, a significant number of landmark points has to be used in each training sample in order to model the desired variability of the allowable shape space, resulting in a time-consuming labelling process. Although (semi-)automatic methods have been developed for assisting this, the labelling remains a difficult and error-prone task. Moreover, the number of available training samples is often relatively small in practice, resulting in a rather restricted space of allowable shapes that reduces the efficiency of the image search. Our motivation is to create an extension of the ASMs that requires a considerably smaller number of landmark points, uses fewer variables to describe shapes, even irregular ones, thus increasing the accuracy of representation, and is efficient even for a small number of training samples. Motivated by the above points, we introduce the Fractal Active Shape Models, an extension of Active Shape Models using fractal interpolation.
2 2.1
Mathematical Background Active Shape Models
Active Shape Models are statistical models of shape that can be applied to image search (see e.g. [1]). An ASM uses a Point Distribution Model (PDM) for the W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 645–652, 2007. c Springer-Verlag Berlin Heidelberg 2007
646
P. Manousopoulos, V. Drakopoulos, and T. Theoharis
statistical modelling of shapes in a given training set of sample images. This model is then used for locating statistically allowable shapes in other images. A shape is described by a set of landmark points. These correspond to particular shape features and their interpretation should be consistent among all shapes of a class. For example, the boundary of the leaf Laurus nobilis depicted in Fig. 1(a) is represented by the 77 landmark points of Fig. 1(c). Moreover, they must be adequate for describing the shape variability of the training set as well as all other allowable shapes. The landmark points may be manually located on the training set images, but (semi-)automatic methods have also been developed, since this is often a time consuming process1 . The sets of landmark points located in the training images are aligned in a common coordinate frame, e.g. using the iterative algorithm of [1]. An aligned shape can be represented by the vector x = (x1 , . . . , xL , y1 , . . . , yL ),
(1)
where (xi , yi ), i = 1, . . . , L are the aligned landmark points. The aligned shapes are then statistically modelled by applying Principal Component Analysis (PCA) to them; each shape x is represented as ¯ + Pb , x=x
(2)
¯ is the mean shape, P is the matrix of eigenvectors of the covariance where x matrix of the shape vectors and b is a vector of parameters determining the deviation of x from the mean shape along each eigenvector. In practice, P contains only the eigenvectors of the largest eigenvalues by eliminating insignificant eigenvectors that may correspond to noise in the training set. Equation (2) can be used for defining the space of allowable shapes by suitably constraining b, √ e.g. |bk | ≤ 3 λk for all elements bk of b, where λk is the eigenvalue of the corresponding eigenvector. This statistical shape model of the training set can be used for locating similar shapes within graphical objects such as images. Starting with a rough approximation of the shape in the image, each point is moved to a nearby location to a better candidate. The points are usually moved along the shape boundary normal and are attracted to image edges or matching profiles of local intensity. This procedure is executed iteratively until it converges. At each step, the shape resulting from moving all points is constrained to the space of statistically allowable shapes. Therefore, the result is both compatible to the image structure as well as to the training set. 2.2
Fractal Interpolation
Fractal interpolation functions (see e.g. [8]) are based on the theory of iterated function systems (IFSs). An IFS, denoted by {X; wn , n = 1, 2, . . . , N }, consists 1
See Cootes T.F. and Taylor C.J., Statistical Models of Appearance for Computer Vision, http://www.isbe.man.ac.uk/∼ bim, March 2004, pp. 79–82 for a survey on landmark placement methods.
Fractal Active Shape Models
647
of a complete metric space (X, ρ), e.g. (IRn , || · ||), and a finite set of continuous mappings wn : X → X, n = 1, 2, . . . , N . If wn are contractions, then the IFS is termed hyperbolic and the transformation W : H(X) → H(X) with W (B) = ∪N n=1 wn (B), where H(X) denotes the space of nonempty compact subsets of X, has a unique fixed point A∞ = W (A∞ ) = limn→∞ W n (B), for every B ∈ H(X), which is called the attractor of the IFS. Let us represent the given set of data points as {(um , vm ) ∈ IR2 : m = 0, 1, . . . , M }. In general, the interpolation is applied to a subset of them, the interpolation points, represented as {(xi , yi ) ∈ IR2 : i = 0, 1, . . . , N }. Both sets are linearly ordered with respect to their abscissa, i.e. u0 < u1 < · · · < uM and u0 = x0 < x1 < · · · < xN = uM . The interpolation points partition the set of data points into interpolation intervals and may be chosen equidistantly or not. Let {IR2 ; wn , n = 1, 2, . . . , N } be an IFS with affine transformations x a 0 x d = n + n wn cn sn en y y constrained to satisfy x0 xn−1 wn = y0 yn−1
and
wn
xN yN
xn = yn
for every n = 1, 2, . . . , N . As results from solving the above equations (see e.g. [8], p. 213), the real numbers an , dn , cn , en are completely determined by the interpolation points, while the sn are free parameters of the transformations satisfying |sn | < 1, in order to guarantee that the IFS is hyperbolic with respect to an appropriate metric. The transformations wn are shear transformations and the sn their respective vertical scaling (or contractivity) factors. ;N It is well known (see for example [8]) that the attractor G = n=1 wn (G) of the aforementioned IFS is the graph of a continuous function f : [x0 , xN ] → IR that interpolates the points (xi , yi ), i = 0, 1, . . . , N . This function is called fractal interpolation function (FIF) corresponding to these points and is self-affine. If the interpolation points define a curve rather than a function, i.e. they are not linearly ordered with respect to their abscissa, then the direct use of a fractal interpolation function is not possible. In order to construct an IFS whose attractor interpolates the given points, and is therefore a curve, we can transform or extend the original points such that the application of a FIF is possible. This is then transformed or projected back to the plane to obtain a curve that interpolates the original points. Curve fitting using fractal interpolation is discussed in [9], where the Fractal Curve Fitting (FCF) method is introduced. The latter will be used in this paper, since ASM are usually applied to objects with landmark points that define curves. The FCF method consists of transforming the data points {(um , vm ) ∈ IR2 : m = 0, 1, . . . , M } by T1 (um , vm ) = (um , vm ), m = 0, 1, . . . , M , where um = u0 +
m j=1
(|uj − uj−1 | + ε) = um−1 + (|um − um−1 | + ε),
648
P. Manousopoulos, V. Drakopoulos, and T. Theoharis vm = vm ,
and ε > 0 is an arbitrary constant necessary, when all data points in an interpo lation interval have equal u-coordinates. A FIF is then constructed for (um , vm ), m = 0, 1, . . . , M , and this is finally transformed by T2 (u , v ) = (u, v), where u − um−1 u = um−1 + (um − um−1 ) , u ∈ [um−1 , um ], um − um−1 v = v , resulting in a fractal interpolation curve that interpolates the initial points and passes near the remaining data points. The advantage of the FCF method over other methods is that it uses a smaller number of total affine transformation parameters (specifically half). Therefore, it is more suitable to our purpose as will become clearer in the next section.
3
Fractal Active Shape Models
We now introduce the Fractal Active Shape Models (FASMs), an extension of the ASM using fractal interpolation. Within the FASM framework, a shape is represented by a fractal interpolation curve rather than a set of landmark points. The interpolation points for constructing the curve need not be all the landmark points of the respective ASM, but only a relatively small subset of them. Indeed, using fewer interpolation points we are able to construct a curve that accurately describes the shape while requiring a smaller total number of variables. Let us represent the set of all aligned shape boundary points, including the k ) ∈ IR2 : m = landmarks, extracted from the k-th training set image as {(ukm , vm 0, 1, . . . , M }. The selected interpolation points, a subset of the landmark points, are represented accordingly as {(xki , yik ) ∈ IR2 : i = 0, 1, . . . , N }. In the present work, the interpolation intervals are chosen with fixed length, i.e. every j-th data point is selected as interpolation point. A fractal interpolation curve is constructed for the shape points using the FCF method of Subsec. 2.2, with {IR2 ; wnk , n = 1, 2, . . . , N } being the respective IFS. The fractal interpolation curve, and therefore the respective shape, is described by the wnk transformation parameters, and is thus represented by the vector fk = (ak1 , . . . , akN , ck1 , . . . , ckN , sk1 , . . . , skN , dk1 , . . . , dkN , ek1 , . . . , ekN ).
(3)
If the shape consists of multiple disjoint parts, e.g. a face, then we model each part with a fractal interpolation curve and create an aggregate vector fk of all transformation parameters for representing the shape. The parameters akn , ckn , dkn , ekn are determined by the interpolation points, while the parameters skn by the remaining points. Thus, if we set skn = 0 for every n = 1, . . . , N , resulting in piecewise linear interpolation, we are taking into account only the
Fractal Active Shape Models
(a)
(b)
(c)
(d)
649
Fig. 1. (a) A leaf of Laurus nobilis. (b) The leaf’s boundary consisting of 930 points. (c) The 77 landmark points representing the leaf. (d) The fractal interpolation curve of 23 interpolation points representing the leaf.
interpolation points, i.e. the FASM is equivalent to the respective ASM. Therefore, the FASM can be viewed as an extension of the ASM. For example, the boundary of the leaf of Laurus nobilis depicted in Fig. 1(a) can be represented either by the 77 landmark points of Fig. 1(c), or by the fractal interpolation curve of 23 interpolation points of Fig. 1(d). The curve is constructed using the FCF method. The interpolation intervals have been chosen with fixed length and the vertical scaling factors sn have been calculated with the analytic method of [10]. In both cases, the accuracy of representation is similar, but the FASM requires less than 30% of the landmark points of the ASM. Moreover, in the ASM the shape is represented by 154(= 2 × 77) parameters, while in the FASM is represented by 115(= 5 × 23) parameters2 , i.e. the FASM representation is more economical. This is because fractal interpolation exploits the presence of self-affinity in the data, thus allowing some degree of compression. The shape of this example is rather simple; for more complex and irregular shapes the FASM representation is expected to be even more profitable. This representation also has the advantage of describing the whole shape and not only a set of landmark points. Although the landmark points can capture the overall shape, the information of the remaining shape points is lost in the ASM. The FASM representation captures the whole shape using a smaller amount of data than the ASM. The statistical modelling of the shapes is achieved by applying PCA to the vectors fk , k = 1, . . . , K. A shape can then be expressed as f = ¯f + Pb ,
(4)
where ¯f represents the mean transformation, P the matrix of eigenvectors of the covariance matrix of the shape vectors and b the parameter vector. The space of allowable shapes is similarly defined by suitably constraining b, e.g. |bk | ≤ 2
An additional copy of the first point is automatically appended to the interpolation points in order to obtain a closed curve, thus resulting in 23 affine transformations.
650
P. Manousopoulos, V. Drakopoulos, and T. Theoharis
√ 3 λk for all elements bk of b, where λk is the eigenvalue of the corresponding eigenvector. Note that in the allowable shape space it should always be |sn | < 1, for n = 1, 2, . . . , N . The feasibility of this approach stems from the fact that the attractor of an IFS, a fractal curve in our case, depends continuously on the affine transformation wnk parameters ([8], p. 111). Therefore it is reasonable to apply the aforementioned PCA. In order to locate an allowable shape in an image, we use the following algorithm, which is an extension of the ASM algorithm: 1. Initialize an approximate fit to the given image. 2. Calculate the fractal interpolation curve for the current parameter vector b and quantize it to pixel coordinates. 3. Move the curve points along the boundary normal towards the image edges. If duplicates exist, then remove them. 4. Calculate the parameter vector b for the adjusted points and limit it to the allowable space. 5. If the change in the shape is not negligible, goto Step 2. The initialization in Step 1 of the algorithm can be done in various ways. For example, we may select a set of initial points as in the ASM case, calculate the respective affine transformation parameters (akn , ckn , dkn , ekn ) and set the free parameters (sn ) to zero3 . In Step 2, the number of calculated attractor points defines the balance between shape accuracy and execution time. Note that increasing the number of attractor points beyond a limit is unnecessary because of the quantization to pixel coordinates. In Step 3, the adjustment of the points is performed as in the original ASM algorithm; the image edges are calculated using a gradient operator, e.g. Sobel operator. In Step 4, the parameter vector, i.e. the IFS, for the adjusted points is calculated as before with the FCF method; the interpolation points are the adjusted interpolation points of the previous step.
4
Results
We have applied the FASM on a training sample consisting of eight images of leaves of Laurus nobilis, as the one depicted in Fig. 1. On each leaf, 23 landmark points have been placed and a fractal interpolation curve has been constructed for the whole leaf boundary using the FCF method. The vertical scaling factors of the affine transformations have been calculated using the analytic algorithm described in [10]. Each shape is represented by 115(= 5 × 23) parameters; an example is given in Fig. 2, where the 115 shape parameters are depicted in five groups. Examination of the shape vector in the form of these groups can reveal useful information about the shape. For example, we can conclude from the an and dn groups that the interpolation intervals are similar in x-length, or, from the sn group, that the leaf shape is most irregular, i.e. divergent from linear, in the first and eighth interval where the magnitude of sn is the greatest. 3
The zero vertical scaling factors result in a piecewise linear attractor.
Fractal Active Shape Models
an
0,06
1500 1000
651
dn
500
0,04
0
0,02
-500
0,02
300
cn
0,01 0
200
-0,01
100
-0,02
0
0,04
en
sn
0,02 0 -0,02 -0,04
Fig. 2. The affine transformation parameters of the fractal interpolation curve of Fig. 1(d) divided into five groups, each consisting of 23 of the shape parameters
(a)
(b)
Fig. 3. (a) The initial position of the shape in the image. (b) The final position of the shape after convergence.
The result of an image search example is depicted in Fig. 3. Specifically, the initial estimate of the shape is depicted in Fig. 3(a), while its final position after convergence is depicted in Fig. 3(b). The leaf has been accurately detected in the image, with the FASM being able to capture its finer details with as few as 23 interpolation (i.e. landmark) points and 8 training samples.
5
Conclusions and Further Work
We have presented a novel extension of ASMs that uses fractal interpolation for representing shapes. The proposed FASMs have the advantage of requiring significantly fewer landmark points, even for irregular shapes, and of using fewer variables for representing a shape. Moreover, they are shown to be efficient, even when few training samples are available. Therefore, they are a competitive alternative to the original ASMs, offering an efficient and practical approach. Future work will focus on extending the FASMs to three dimensions and using piecewise self-affine fractal interpolation, which is expected to produce even better results.
652
P. Manousopoulos, V. Drakopoulos, and T. Theoharis
Acknowledgements This research project is co-financed by E.U.-European Social Fund (75%) and the Greek Ministry of Development-GSRT (25%), under the PENED programme 70/3/8405.
References 1. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active Shape Models - Their training and application. Computer Vision and Image Understanding 61, 38–59 (1995) 2. Duta, N., Sonka, M.: Segmentation and interpretation of MR brain images: An improved Active Shape Model. IEEE Transactions On Medical Imaging 17, 1049– 1062 (1998) 3. Paragios, N., Jolly, M.P., Taron, M., Ramaraj, R.: Active Shape Models and segmentation of the left ventricle in echocardiography. In: Kimmel, R., Sochen, N.A., Weickert, J. (eds.) Scale-Space 2005. LNCS, vol. 3459, pp. 131–142. Springer, Heidelberg (2005) 4. Smyth, P.P., Taylor, C.J., Adams, J.E.: Automatic measurement of vertebral shape using Active Shape Models. Image and Vision Computing 15, 575–581 (1997) 5. Davatzikos, C., Tao, X., Shen, D.: Hierarchical Active Shape Models, using the wavelet transform. IEEE Transactions On Medical Imaging 22, 414–423 (2003) 6. van Ginneken, B., Frangi, A.F., Staal, J.J., ter Haar Romeny, B.M., Viergever, M.A.: Active Shape Model segmentation with optimal features. IEEE Transactions On Medical Imaging 21, 924–933 (2002) 7. Twining, C., Taylor, C.J.: Kernel principal component analysis and the construction of non-linear active Shape Models. In: Proceedings of the British Machine Vision Conference, pp. 23–32 (2001) 8. Barnsley, M.F.: Fractals everywhere, 2nd edn. Academic Press Professional, San Diego (1993) 9. Manousopoulos, P., Drakopoulos, V., Theoharis, T.: Curve fitting by fractal interpolation. Transactions on Computational Science (to appear) 10. Mazel, D.S., Hayes, M.H.: Using iterated function systems to model discrete sequences. IEEE Trans. Signal Processing 40, 1724–1734 (1992)
Decomposition for Efficient Eccentricity Transform of Convex Shapes Adrian Ion1 , Samuel Peltier1 , Yll Haxhimusa1,2 , and Walter G. Kropatsch1 1
Pattern Recognition and Image Processing Group, Faculty of Informatics, Vienna University of Technology, Austria {ion,sam,yll,krw}@prip.tuwien.ac.at 2 Department of Psychological Sciences, Purdue University, USA [email protected]
Abstract. The eccentricity transform associates to each point of a shape the shortest distance to the point farthest away from it. It is defined in any dimension, for open and closed manyfolds. Top-down decomposition of the shape can be used to speed up the computation, with some partitions being better suited than others. We study basic convex shapes and their decomposition in the context of the continuous eccentricity transform. We show that these shapes can be decomposed for a more efficient computation. In particular, we provide a study regarding possible decompositions and their properties for the ellipse, the rectangle, and a class of elongated shapes.
1
Introduction
To extract the required information from a set of images, a frequently used pattern is to repeatedly transform the input image while gradually moving from the low abstraction level of the input data to the high abstraction level of the output data. The idea is to have a reduced amount of (important) data at these higher abstraction levels. A class of such transforms applied to 2D shapes, associates to each point of the shape a value that characterizes in some way it’s relation to the rest of the shape. This value in many cases is a distance between important points i.e. features. Examples include the distance transform [1], which associates to each point the length of the shortest path to the border, the Poisson equation [2], which can be used to associate to each point the average time to reach the border by a random path (an average of the random shortest paths), and the eccentricity transform [3] which associates to each point the longest of the shortest paths to any other point of the shape. In this way one tries to come up with an abstracted representation of the shape (e.g. the skeleton [4] or shock graph [5] build on distance transform), which could be easily used in e.g. shape classification or retrieval. Minimal path computation [6] as well as distance transform [7] are used in 2D and 3D image segmentation.
Supported by the Austrian Science Fund under grants P18716-N13 and S9103-N04.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 653–660, 2007. c Springer-Verlag Berlin Heidelberg 2007
654
A. Ion et al.
The eccentricity transform (ECC) has it’s origins in the graph based eccentricity [8,9]. Defined in the context of digital images in [3,10], where properties and robustness have been shown, it was applied for shape matching in [11]. The ECC can be computed for of any dimension, discrete closed (e.g. 2D binary image) or open sets (surface of an ellipsoid), and continuous (e.g. 3D ellipsoid or the 2D surface of the 3D ellipsoid, etc.). For discrete 2D shapes, a naive algorithm (O(N 3 ) in the number of pixels) and a more efficient one for 2D shapes without holes, have been presented in [3]. This paper presents top-down approach for the efficient computation of the ECC of basic convex shapes using decomposition. Section 2 gives a short recall of the ECC. Section 3 shows its computation on an ellipse, rectangle and an elongated shape. Section 4 concludes and gives an outlook of the future work.
2
Recall ECC
In this section basic definitions and properties of the ECC are introduced following [3,11]. Let the shape S be a closed set in R2 and ∂S be its border1 . A path π is the continuous mapping from the interval [0, 1] to S. Let Π(p1 , p2 ) be the set of all paths between two points p1 , p2 ∈ S within the set S (Π(p1 , p2 ) ⊂ S). The geodesic distance d(p1 , p2 ) between two points p1 , p2 ∈ S is defined as the length λ of the shortest path π(p1 , p2 ), such that π ∈ S, more formally 1 d(p1 , p2 ) = min{λ(π(p1 , p2 ))|π ∈ Π} where λ(π(t)) = |π(t)|dt ˙ (1) 0
where π(t) is a parametrization of the path from p1 = π(0) to p2 = π(1). The eccentricity transform of S can be defined as, ∀p ∈ S ECCS (p) = max{d(p, q)|q ∈ S} = max{d(p, q)|q ∈ ∂S}
(2)
i.e. to each point p it assigns the length of the shortest geodesics to the points farthest away from it. In [3] it is shown that this transformation is quasi-invariant to articulated motion and robust against salt and pepper noise [10] (which creates holes in the shape). An eccentric point is then the point y that reaches a maximum in Eq. 2. All the eccentric points lie on the border of S [3].
3
ECC of Basic Shapes
In the case of simply connected convex shapes S, geodesic distance equals Euclidean distance, as no obstacles exist that have to be avoided. In the rest of the paper, S denotes a simply connected convex shape, centered at the origin (center of gravity). We will show some properties of shapes S regarding the eccentricity transform, which will help designing new and faster algorithms. For clarity, we introduce the following notations: the set of points (x, y) ∈ S such that x > 0, is denoted Sr and called the right part of S, whereas the set of points (x, y) ∈ S such that x < 0, is denoted Sl and called the left part of S. 1
This definition can be generalized to higher dimensions.
Decomposition for Efficient Eccentricity Transform of Convex Shapes
655
b (- x0 , y 0 )
(x0 , y 0 )
a
0 (- x0 , - y0 )
(x0 , - y0 )
Fig. 1. The sets of eccentric points of the ellipse are shown with a thick line
3.1
Ellipse
Ellipse Recalls. The elliptical curve of points (x, y) around the origin with parameters a > 0 and b > 0 is defined by y2 x2 + = 1. a2 b2
(3)
In point (x, y) it has a tangent direction (x, ˙ y) ˙ satisfying y y˙ xx˙ + 2 = 0. 2 a b
(4)
Bounding Extremal points. We consider an elliptical region S around the 2 2 origin, defined by xa2 + yb2 ≤ 1. In the following, we assume that a > b, thus the small axis of S is the segment [−b, b], and the long axis of S is the segment [−a, a]. As mentioned before, it has been shown that the eccentric point(s) of any point (x, y) ∈ S, are on the border of S. We now provide two additional properties regarding eccentric points on an elliptical region. Property 1. Let (xe , ye ) be an eccentric point for (x, y) ∈ S. Then, the tangent to the ellipse at the point (xe , ye ) is orthogonal to the line defined by (x, y) and (xe , ye ). Proof. Suppose that the tangent is not orthogonal to the line defined by (x, y) and (xe , ye ), then there exists a point (xe , ye ) in the neighborhood of (xe , ye ), which is farther away from (x, y) than (xe , ye ). This would contradict the fact that (xe , ye ) is an eccentric point for (x, y). The following property is a general property for symmetric (having one axis of symmetry), simply connected and convex shapes. Property 2. Let S be a symmetric, simply connected and convex shape, with A its axis of symmetry. Let S1 and S2 denote the two symmetric parts of S delineated by A. Any point p ∈ S1 has its eccentric point pe in S2 .
656
A. Ion et al.
Proof. Let p ∈ S1 \A and pe ∈ S1 . Then there exists the symmetric point pe ∈ S2 . The straight line connecting p and pe intersects A in point q ∈ A having the same distance to pe and pe : d(p, pe ) = d(p, q) + d(q, pe ) = d(p, q) + d(q, pe ) > d(p, pe ) shows that pe is not eccentric due to the triangular inequality. The ellipse is a simply connected convex shape and the smaller axis is an axis of symmetry. For any ellipse S, we can choose S1 = Sl and S2 = Sr . We compute the eccentric points of (0, b) and (0, −b). This allow us to partition the ellipse into 4 subsegments alternating the property of being eccentric or not (see Fig. 1). Let us consider the line l that goes through the point (0, −b) and crosses the ellipse with orthogonal tangent at point (x0 , y0 ), such that x0 ≥ 0 and y0 ≥ 0. This line l(τ ) is defined by: % x y
= τ y˙0 = −b − τ x˙0 .
As (x0 , y0 ) ∈ l, we can deduce from Eq. 5 that: % τ0 = xy˙00 y0 = −b − xy˙˙00 x0 . From Eq. 4, we obtain − xy˙˙00 = −b −
y0 a2 b2 ,
y0 a2 x0 b2 .
(5)
(6) 2
Using Eq. 5, we obtain y0 = −b−x0 yx00ab2 =
so
b3 . a2 − b 2 The x-coordinate is then determined using the ellipse formula: y0 =
x20 = a2 (1 −
(a2
b4 ). − b2 )2
(7)
(8)
Similar calculations deliver the eccentric point for (0, −b) with x0 < 0 and y0 ≥ 0, and the two eccentric points for the point (0, b). One can directly deduce that any point (xc , yc ) of the ellipse, s.t. x2c < x20 has normals that do not intersect the segment [−b, b]. Thus, according to Prop. 2, all these points cannot be eccentric points for any point inside the ellipse. Hence the points (x0 , y0 ), (x0 , −y0 ), (−x0 , −y0 ), (−x0 , y0 ) partition the ellipse into 4 subsegments alternating the property of being extremal or not. Eccentric lines through the smaller axis. In this section, we show how to efficiently compute the eccentricity transform of an elliptical region S, by considering separately Sl and Sr . Using Prop. 1, we first show how to compute the eccentricity of all the points of the small axis. Let p = (0, μb), −1 ≤ μ ≤ 1 be a point on the small axis, and let pe = (xe , ye ) be its eccentric point in Sr . Using Prop. 1, the points (x, y) of the line l(τ ) defined by (p, pe ) satisfy:
Decomposition for Efficient Eccentricity Transform of Convex Shapes
657
pl psa pe
Fig. 2. Efficient computation of eccentricity transform based on decomposition
p
Fig. 3. Ellipse decomposition along the bigger axis: more than one line, orthogonal to the ellipse tangent at the point of intersection, can go through one point
% x y
= τ y˙e = μb + τ x˙e .
(9)
In particular, pe ∈ l, so we have xe = τe y˙e and ye = μb+τe x˙e . Thus we deduce 3 that τe = xy˙ee and ye = μb + xy˙˙ee xe = b2μb −a2 . The x-coordinate is then determined using Eq. 3 x2e = a2 (1 −
b4 μ2 ). − a2 )2
(b2
(10)
So, the eccentricity of any point (0, μb), −1 ≤ μ ≤ 1 of the small axis, can directly be computed by the above formula using a, b and μ. The direction to their eccentric point is also known and can be stored in each point. As it is a convex shape, the eccentric path from any point of an elliptic region to its eccentric point is a straight line. Moreover, from Prop. 2, we know that the eccentric point pe of any point pl ∈ Sl is in Sr . Thus, for computing the eccentricity of pl , we just have to find the point psa of the small axis, such that the direction (pl , psa ) is the same as the direction stored in psa (see Fig. 2). Eccentric lines through the bigger axis. In the previous section, we have shown that it is possible to decompose an ellipse S along its smaller axis to efficiently compute the eccentricity ECCS . This is not the case when decomposing S into Su and Sd along the bigger axis [−a, a] because
658
A. Ion et al.
v - w, h
vw, h p
pc1 pc2
q
v- w, - h
vw , - h
Fig. 4. Eccentric paths inside a rectangle
– the eccentric points pe of any point pba ∈ [−a, a] are either (−a, 0) or (a, 0), which is obviously not helpfull for deducing ECCS ; – even if we associate to each point pba ∈ [−a, a] the point p ∈ ∂Su s.t. (p, p ) is orthogonal to the tangent at p , ∃p ∈ Sd with at least two points p1 and p2 in ∂Su s.t. (p, p1 ) and (p, p2 ) are orthogonal to the tangent at p1 respectively p2 . So we cannot make a one to one mapping for a direct computation of ECCS (p) as d(p, p1 ) and d(p, p 2) have to be compared. See Fig. 3. Circle. In the special case where a = b, the shape S becomes a circle. Let r = a = b be the radius and O the center. For all p ∈ ∂S the line orthogonal to the tangent at p contains the center O. Thus all the eccentric paths go through the center O and the eccentricity of a point p(x, y) ∈ S is ECCS (p) = x2 + y 2 + r. Note that all the points of ∂S are eccentric points. 3.2
Rectangle
Even though the rectangle is a simple case, studying it’s decomposition in the context of eccentricity is of importance. Compared to the ellipse, a one to one association between points on the cut and eccentric paths cannot be made. In the case of the rectangle, two eccentric point candidates exist for each point on the cut. Let S be a rectangle with side lengths 2w and 2h (see Fig. 4). The four corners of the rectangle v−w,−h , v−w,h , vw,−h , vw,h make up the set of eccentric points of S. The rectangle S can be decomposed in two subparts Sl and Sr (see Fig. 4), along the cut C = [(0, −h), (0, h)]. The corners v−w,−h , v−w,h respectively vw,−h , vw,h are the eccentric points of all the points p ∈ C. The main difference compared to the ellipse, is that in the case of the rectangle we cannot associate to each point of C a single pair made of a direction and distance, because eccentric paths with more than one orientation can pass though the same point of the cut (see pc2 in Fig. 4). To solve this, to each point of C we associate two pairs of distances and directions, connecting it to the corners v−w,−h , v−w,h respectively vw,−h , vw,h . Thus, for any p ∈ Sl , ∃pc1 , pc2 ∈ C s.t. pc1 ∈ (p, vw,h ) and pc2 ∈ (p, vw,−h ). Then ECCS (p) = max(d(p, pc1 ) + ECCS (pc1 ), d(p, pc2 ) + ECCS (pc2 )).
Decomposition for Efficient Eccentricity Transform of Convex Shapes
659
2h
al
w
ar
Fig. 5. Elongated shape formed of two half ellipses and a rectangle
3.3
Elongated Shape
Let S be an elongated shape obtained by gluing two oppsite sides of a rectangle R with two halves of ellipses El and Er . Let us assume that: the width of R is 2w and its height is 2h; El is the half of the ellipse defined by the 2 parameters al and h; Er is the half of the ellipse defined by the 2 parameters (ar , h) (See Fig. 5). Symmetric Shape S: Let’s assume that a = al = ar and decompose S along the cut C = [(0, −h), (0, h)]. If a h, we can extend Prop. 1 from Section 3.1 to elongated shapes. Thus, for any point p ∈ Sl , its eccentric point pe is in Sr and is orthogonal to the ellipse tangent at pe . Moreover, we can deduce that pe is the unique eccentric point of p i.e. ep(p) = {p}. Like in Section 3.1, we can compute the eccentricity and the eccentric path orientation for all the points of C separately in Sl and Sr and for all p ∈ Sl compute the eccentricity ECCS (p) = d(p, pc ) + ECCSr (pc ) where pc ∈ C and (p, pc ) has the same direction as the eccentric path of pc in Sr . These results can directly be extended for the points of Sr . To find the eccentricity and eccentric path orientation for C in Sl and Sr one can decompose Sl and Sr in the halfellipse together with half the rectangle R, and directly use the formulas provided in Section 3.1. Note that if a = h then El and Er are half circles and all the eccentric paths will go through their centers. If r > h, the ellipses El and Er correspond to ellipses that have been cut along their bigger axis (see Section 3.1). In this case, as mentioned in Section 3.1, there exist some points p ∈ Sl which have more than one associated points {e1 , e2 , . . . , ek } ∈ Sr , such that (p, ei ) is orthogonal to the ellipse tangent at ei . Thus, we cannot associate each point in Sl with a single direction and distance based only on the orthogonality with the ellipse tangent. We need to take the maximum of the distances. Nonsymmetric Shape S: If al = ar , S is no longer symmetric and it cannot be decomposed by a line segment s.t. for any point p ∈ S all its eccentric points are contained in the part not containing p. An easy way to overcome this problem is to take for each point, the maximum between the eccentricity
660
A. Ion et al.
computed separately on the part containing the point and the one obtained by using the decomposition.
4
Outlook and Conclusion
In this paper we have studied top-down decomposition of basic shapes in order to speed up the computation of the eccentricity transform. Some partitions proved to be better suited than others. We showed that these shapes can be decomposed for a more efficient computation and also derived some properties that could be applied for more general shapes. In particular, we provide a study regarding possible decompositions and their properties for the ellipse, the rectangle and a class of elongated shapes. In the future we plan to extend this study to any 2D shape followed by a study for 3D and nD shapes.
References 1. Rosenfeld, A.: A note on ’geometric transforms’ of digital sets. Pattern Recognition Letters 1(4), 223–225 (1983) 2. Gorelick, L., Galun, M., Sharon, E., Basri, R., Brandt, A.: Shape representation and classification using the poisson equation. In: CVPR (2), pp. 61–67 (2004) 3. Kropatsch, W.G., Ion, A., Haxhimusa, Y., Flanitzer, T.: The eccentricity transform (of a digital shape). In: Kuba, A., Ny´ ul, L.G., Pal´ agyi, K. (eds.) DGCI 2006. LNCS, vol. 4245, pp. 437–448. Springer, Heidelberg (2006) 4. Ogniewicz, R.L., K¨ ubler, O.: Hierarchic Voronoi Skeletons. Pattern Recognition 28 (3), 343–359 (1995) 5. Siddiqi, K., Shokoufandeh, A., Dickinson, S., Zucker, S.W.: Shock graphs and shape matching. International Journal of Computer Vision 30, 1–24 (1999) 6. Paragios, N., Chen, Y., Faurgeras, O.: 6. In: Handbook of Mathematical Models in Computer Vision, pp. 97–111. Springer, Heidelberg (2006) 7. Soille, P.: Morphological Image Analysis. Springer, Heidelberg (1994) 8. Harary, F.: Graph Theory. Addison-Wesley, Reading (1969) 9. Diestel, R.: Graph Theory. Springer, New York (1997) 10. Klette, R., Rosenfeld, A.: Digital Geometry. Morgan Kaufmann, San Francisco (2004) 11. Ion, A., Peyr´e, G., Haxhimusa, Y., Peltier, S., Kropatsch, W.G., Cohen, L.: Shape matching using the geodesic eccentricity transform - a study. In: 31st OAGM/AAPR, Schloss Krumbach, Austria, May 2007. OCG (2007)
Euclidean Shortest Paths in Simple Cube Curves at a Glance Fajie Li and Reinhard Klette Computer Science Department The University of Auckland, New Zealand
Abstract. This paper reports about the development of two provably correct approximate algorithms which calculate the Euclidean shortest path (ESP) within a given cube-curve with arbitrary accuracy, defined by ε > 0, and in time complexity κ(ε) · O(n), where κ(ε) is the length difference between the path used for initialization and the minimumlength path, divided by ε. A run-time diagram also illustrates this lineartime behavior of the implemented ESP algorithm.
1
Introduction
Euclidean shortest path (ESP) problems are defined by a (2D, 3D, ...) Euclidean space which contains (closed) polyhedral obstacles; the task is to compute a path which connects two given points in the space such that it does not intersect the interior of any obstacle, and it is of minimum Euclidean length. Examples are the ESP inside of a simple polygon, on the surface of a convex polytope, or inside of a simply-connected polyhedron, or problems such as touring polygons, parts cutting, safari or zookeeper, or the watchman route. Alltogether, this defines a class of immensely important computational problems of huge impact in economy, science or technology. For time complexities of algorithms in this area, we cite two examples. The general 3D ESP problem (e.g., path-planning in robotics) is NP-hard, see J. Canny and J. H. Reif [5]. For 2D ESP problems, there are linear-time, but very complicated algorithms (e.g., algorithms for ESP calculation in a simple polygon, based on B. Chazelle’s [6] triangulation of whole polygons), or linear-time and easy-to-implement algorithms (e.g., for the relative convex hull in the 2D grid, see [9]). See [13] for work of the authors on 2D or 3D ESP problems in general. In this paper we consider ESPs in simple cube-curves (a cyclic sequence of subsequently face-adjacent grid cubes where each cube is only listed once), which are formed by successively face-adjacent grid cubes (of the uniform orthogonal 3D grid, see digital geometry [11]). T. B¨ ulow and R. Klette published between 2000 and 2002 (see, e.g., [4]) a so-called rubberband algorithm (RBA) for the calculation of a Euclidean shortest path in a simple cube-curve. [4] stated two open problems: is this approximate RBA actually always converging (with numbers of iterations) to the correct ESP, and is its time complexity actually linear as all experiments indicated at that time. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 661–668, 2007. c Springer-Verlag Berlin Heidelberg 2007
662
F. Li and R. Klette
This paper reports about the development of two approximate RBAs, which always converge towards the ESP, and have κ(ε) · O(n) time complexity. This paper is a first summarizing publication of results in the technical report [12] related to minimum-length polygon (MLP) calculation in simple cube-curves.
2
The Original RBA
Critical edges of a given cube-curve g are those grid edges which are incident with three cubes of the curve (see Figure 1). Critical edges are the only possible locations for vertices of an ESP [10]. A subset of those will define the step set of the RBA, which contains all those critical edges which contain exactly one ESP vertex each.
Fig. 1. Critical edges e1 , e2 , e3 , e4 , e5 , and e6
The Original RBA, as published in [4,11], is as follows: it consists of two subprocesses, (i) an initialization process (e.g., from an endpoint of one critical edge to the closest endpoint of the subsequent critical edge; satisfying a “closedpath” constraint at the end), and (ii) an iterative process which contracts the path during each of its loops, using a break-off criterion Ln − Ln+1 < ε where ε > 0, and Ln is the total length of the path after the nth loop. During each loop, the algorithm tries to shorten the path locally by checking three options, called OP1, OP2, and OP3. OP1 and OP2 find the step set of critical edges. OP3 optimizes the position of a vertex on its critical edge. These options are defined as follows: OP1: delete vertex pi if the line segment pi−1 pi+1 is in the tube g, which is the union of all the grid cubes in the given simple cube-curve g; OP2: calculate intersection points of the triangle pi−1 pi pi+1 with all critical edges (“between” pi−1 and pi+1 ) and replace the subsequence pi−1 , pi , pi+1 by the resulting convex arc, defined by these of intersection points;
Euclidean Shortest Paths in Simple Cube Curves at a Glance
663
Fig. 2. Illustration for the (original) Option 2
OP3: move pi on its critical edge e into the optimum position pnew , with de (pnew , pi−1 ) + de (pi+1 , pnew ) = inf{de (p, pi−1 ) + de (pi+1 , p) : p ∈ e}, where de denotes the Euclidean distance. We continue with vertices pnew , pi+1 , pi+2 of the path. At the end of each loop we compare the total length of the new path with that of the path at the end of the previous loop. See Figure 2 for OP2. Here, vertices on critical edges e11 , e14 and e18 are replaced by a convex arc with vertices on critical edges e11 , e13 , e16 , and e18 , and (in general) it may be e11 , e14 and e18 again within a subsequent loop – of course, for a reduced length of the calculated path at this stage. The situation with the original RBA in 2002 [4] was as follows: Even for very small values of ε, the measured time complexity indicated O(n), where n is the number of cubes in g. However, there was no proof for the asymptotic time complexity of the original RBA. For a small number of test examples, calculated paths seemed (!) to converge against the ESP. However, no implemented algorithm for calculating the correct ESP was available, and (more general) no proof whether the path, provided by the original RBA, converges towards the ESP. Nevertheless, the algorithm is in use since 2002 (e.g., in DNA research).
3
Non-existence of an Exact Arithmetic Algorithm
An arithmetic algorithm consists of a finite number of steps of arithmetic operations, possibly also using input parameters from the field of rational numbers, using only the following basic operators: +, −, ·, / or the kth root, for k ≥ 2. OP3 can be formalized by a system of three PDEs, involving parameters ti ∈ R for critical edges ei of the step set. The result ensures that pi (ti ) is the optimum point on ei . Considering the situation illustrated in Figure 3, this is equivalent to the problem of finding the roots of p(x) = 84x6 − 228x5 + 361x4 + 20x3 + 210x2 + 200x + 25 (see Chapter 7 in [12]). In fact, this problem is not solvable by radicals over the field of rationals; see [12]. (The proof uses a theorem by C. Bajaj [2] and the factorization algorithm by E. R. Berlekamp [3].)
664
F. Li and R. Klette
Fig. 3. Calculation of t1 and t2 such that the polyline p0 (t0 )p1 (t1 )p2 (t2 )p3 (t3 ) is fully contained in g. Point p1 is on e1 , and p2 on e2 .
This example allows two corollaries. First, there is no exact arithmetic algorithm for calculating the roots of polynomials of degree ≥5 (theorem by E. Galois; B.L. van der Waerdens famous example is p(x) = x5 − x − 1). Second, there is also no exact arithmetic algorithm for calculating 3D ESPs. C. Bajaj [1] showed this based on a polynomial of degree 20 for the general 3D ESP problem. As a new result, here we have a degree 6 polynomial, and the restricted ESP problem for simple cube-curves! Note that this is not just a “rounding number problem” but a fundamental non-existence of exact algorithms, no matter what kind of time-complexity is allowed (see Section 4). There is a uniquely defined shortest path, which passes through subsequent line segments e1 , e2 , . . . , ek in 3D space in this order; see, for example, [7]. Obviously, vertices of a shortest path can be at real division points, and even at those which cannot be represented by radicals over the field of rationals.
4
Approximate Algorithms
An algorithm is an (1 + ε)-approximation algorithm for a minimization problem P iff, for each input instance I of P , the algorithm delivers a solution that is at most (1 + ε) times the optimum solution [8]. # $ 2
The general 3D ESP problem can be solved in O n4 [b + log(n/ε)] /ε2 time by an (1 + ε)-approximation algorithm; see C. H. Papadimitriou [15]. An algorithm is κ-linear iff its time complexity is in κ(ε) · O(n), and function κ does not depend on the problem size n, for ε > 0. We use κ(ε) = (L0 − L)/ε, where L is the true length of the ESP, and L0 the initial length. A cube-curve is first-class iff each critical edge contains one ESP vertex. The original RBA is correct and κ-linear for first-class cube-curves [12]. [12] analyzed the following approximate graph-theoretical algorithm: Subdivide each critical edge by m uniformly-spaced vertices; connect each vertex with
Euclidean Shortest Paths in Simple Cube Curves at a Glance
665
Fig. 4. Weighted undirected graph for m = 3
those vertices such that the resulting edge is contained in the tube g. This defines a weighted undirected graph (see Figure 4). Calculate a shortest-length cycle, and use this as a (first-class !) input for the original RBA. The time-complexity of the graph-theoretic algorithm (in our specification) equals O m4 n4 + κ(ε) · n . It applies Dijkstras algorithm repeatedly; possibly its time-complexity can be reduced, but certainly not to be κ-linear. However, this (slow) algorithm allowed for the first time to evaluate results obtained by the original RBA. Assume a simple cube-curve g and a triple of consecutive critical edges e1 , e2 , and e3 such that ei is orthogonal to ej , for i, j = 1, 2, 3 and i = j. If e1 and e3 are also coplanar, then we say that e1 , e2 , and e3 form an end angle, and a middle angle otherwise. The following approximate numerical algorithm (see [12]) requires an input which is first-class and has at least one end angle; the cube-curve is split at end angles into one or several arcs. For each arc, one vertex on each critical edge can be calculated using the systems of PDEs briefly mentioned already above; variable ti determines the position of vertex pi on edge ei . This algorithm is provably correct and κ-linear for the assumed inputs. An open problem in [11] (page 406) was stated as follows: Is there a simple cubecurve such that none of the vertices of its ESP is a grid vertex? The answer is “yes” [12], and any of those curves does not have any end angle; see Figure 5.1 Thus, the provably correct approximate numerical algorithm cannot be used in general. This lead us back to the initial two questions about the original RBA: is it correct? (We can use either the approximate graph-theoretical or the numerical algorithm for evaluation.) What is its time-complexity in general? Indeed, corrections were in place: 1
Here are two new open problems: What is the smallest (say, in number of cubes or in number of critical edges - both is equivalent) simple cube curve which does not have any end angle? What is the smallest (say, in number of cubes or in number of critical edges - both is equivalent) simple cube curve which does not have any of its MLP vertices at a grid point location? We assume that the second problem is more difficult to solve.
666
F. Li and R. Klette
Fig. 5. A simple cube-curve where the ESP does not have any grid-point vertex (and which has no end angle)
Fig. 6. The original Option 2 misses e5
OP2: if intersecting with the triangle pi−1 pi pi+1 and using the convex arc only, we may miss edges of the step set (see Figure 6 for such a situation) - more tests are needed, and this option was totally reformulated (for details, see [12] - the specifications require some technical preparations which cannot be given in this short paper). OP3: the vertex pnew , found by optimization, may specify edges pi−1 pnew and pnew pi+1 such that one or both of them are not fully contained in the tube of the curve; an additional test is needed (a simple correction). Therfore, those corrections define a provably correct (for any simple cubecurve) and κ-linear edge-based RBA [12].
Euclidean Shortest Paths in Simple Cube Curves at a Glance
667
Fig. 7. Edge-based RBA Implemented in Java, run under Matlab 7.0.4, Pentium 4, using ε = 10−10
Fig. 8. Let ε be the maximum accuracy of the program, that means the smallest number for discriminating between Ln and Ln+1 . Still, the difference to the true value L might be δ > ε. The algorithm allows to obtain arbitrary accuracy (with respect to L) when continuing iterations, but this would require to reduce ε.
Instead of moving points along critical edges, we can also move points within critical faces (which contain one critical edge). Of course, the vertices will finally move onto or towards critical edges. This conceptually simpler (in its OP2) facebased RBA is also provably correct, but showing a slower convergence (within the limits of being κ-linear) towards the EPS. See Figure 7 for some statistics about measured run time. Half of a simple cube-curve was generated randomly, and the second half then generated using three straight arcs for closing the curve. The number of cubes in generated curves was between 10 and 630. The break-off criterion was defined by ε = 10−10 . Figure 8 illustrates the meaning of the break-off criterion. The lengths Ln , for loops n = 1, 2, 3 . . . define a Cauchy sequence which converges towards the true length L. An in-depth study of this sequence may reveal whether we can assume δ < ε in general, or not.
668
5
F. Li and R. Klette
Conclusions
This paper reported about the process of solving the minimum-length polygon problem for simple cube-curves. The developed methodology [i.e., define “critical” subsets, specify the step set such that each critical subset in this set contains exactly one (possibly redundant, such as colinear) vertex, apply OP3] can be applied to ESP problems as considered (e.g.) in [14]. A few RBA applications have been illustrated in [12,13]. For more details, see technical report [12].
References 1. Bajaj, C.: The algebraic complexity of shortest paths in polyhedral spaces. In: Proc. Allerton Conf. Commun. Control Comput., pp. 510–517 (1985) 2. Bajaj, C.: The algebraic degree of geometric optimization problems. Discrete Computational Geometry 3, 177–191 (1988) 3. Berlekamp, E.R.: Factoring polynomials over large finite fields. Math. Comp. 24, 713–735 (1970) 4. B¨ ulow, T., Klette, R.: Digital curves in 3D space and a linear-time length estimation algorithm. IEEE Trans. Pattern Analysis Machine Intelligence 24, 962–970 (2002) 5. Canny, J., Reif, J.H.: New lower bound techniques for robot motion planning problems. In: Proc. IEEE Conf. Foundations Computer Science, pp. 49–60 (1987) 6. Chazelle, B.: Triangulating a simple polygon in linear time. Discrete Computational Geometry 6, 485–524 (1991) 7. Choi, J., Sellen, J., Yap, C.-K.: Approximate Euclidean shortest path in 3-space. In Proc. ACM Conf. Computational Geometry, pp. 41–48. ACM Press, New York, NY, USA (1994) 8. Hochbaum, D.S.: Approximation Algorithms for NP-Hard Problems. PWS Pub. Co, Boston (1997) 9. Klette, R., Kovalevsky, V.V., Yip, B.: Length estimation of digital curves. In: Proc. Vision Geometry, SPIE 3811, pp. 117–129 (1999) 10. Klette, R., B¨ ulow, T.: Critical edges in simple cube-curves. In: Nystr¨ om, I., Sanniti di Baja, G., Borgefors, G. (eds.) DGCI 2000. LNCS, vol. 1953, pp. 467–478. Springer, Heidelberg (2000) 11. Klette, R., Rosenfeld, A.: Digital Geometry. Morgan Kaufmann, San Francisco (2004) 12. Li, F., Klette, R.: Exact and approximate algorithms for the calculation of shortest paths. In: Platinum Jubilee Conference of The Indian Statistical Institute. IEEE Conference, Kolkata, Report 2141 on www.ima.umn.edu/preprints/oct2006 13. Li., F., Klette, R.: Rubberband algorithms for solving various 2D or 3D shortest path problems. In: Proc. Computing: Theory Applications, plenary talk, pp. 9–19 (2007) 14. Mitchell, J.S.B., Sharir, M.: New results on shortest paths in three dimensions. In: Proc. SCG, pp. 124–133 (2004) 15. Papadimitriou, C.H.: An algorithm for shortest path motion in three dimensions. Inform. Process. Lett., 20, 259–263 (1985) 16. Talbot, M.: A dynamical programming solution for shortest path itineraries in robotics. Electr. J. Undergrad. Math. 9, 21–35 (2004)
A Fast and Robust Ellipse Detection Algorithm Based on Pseudo-random Sample Consensus Ge Song and Hong Wang State Key Laboratory of Intelligent Technology and Systems Department of Computer Science and Technology Tsinghua University, 100084, Beijing, China [email protected], [email protected]
Abstract. Ellipse is one of the most common features that appears in images. Over years in research, real-timing and robustness have been two very challenging problems aspects of ellipse detection. Aiming to tackle them both, we propose an ellipse detection algorithm based on pseudo-random sample consensus (PRANSAC). In PRANSAC we improve a contour-based ellipse detection algorithm (CBED), which was presented in our previous work. In addition, the parallel thinning algorithm is employed to eliminate useless feature points, which increases the time efficiency of our detection algorithm. In order to further speed up, a 3-point ellipse fitting method is introduced. In terms of robustness, a “robust candidate sequence” is proposed to improve the robustness performance of our detection algorithm. Compared with the state-of-the-art ellipse detection algorithms, experimental results based on real application images show that significant improvements in time efficiency and performance robustness of the proposed algorithm have been achieved. Keywords: Ellipse Detection, Parallel Thinning, Robust Candidate Sequence, PRANSAC.
1 Introduction In recent years, the research of fast and robust ellipses detection is very popular in image processing field. Over the last two decades researchers have developed many approaches to detecting ellipse. A commonly used algorithm is Hough transform [1].Hough transform is a usable method for ellipse detection but it costs a lot of time. The Fast Ellipse Hough Transform (FEHT) [2] method was proposed by Guil to decompose the parameter space into several subspaces of fewer dimensions. They introduced geometric features and the gradient vectors of the edge data to construct the lines passing through geometric centers. Accumulator-Free Ellipse Hough Transform (RAF-EHT) [3] presented by Yu, is based on two main ideas, i.e., an improved measure function (IMF) for handling the partiality and the obliqueness of ellipses, and a new accumulator-free computation scheme for finding the top k peaks of the IMF without complex peak detection. The methods based on Hough transform [1, 2, 3, 4, 5, 13] are most popular for ellipse detection. They are robust against outlier and occlusion but computationally expensive. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 669–676, 2007. © Springer-Verlag Berlin Heidelberg 2007
670
G. Song and H. Wang
Cai proposed a fast contour-based approach of circle and ellipse detection in 2004 [9]. A chain code based algorithm was adopted to detect contours and each contour was treated as a sequence of isolated inflexions. Then RANSAC [10] algorithm was applied to estimate the parameters of candidate ellipses from each candidate contour. This algorithm was faster than others in detection speed but the results are not stable. Therefore we propose new mechanism to improve the method in this paper. In this paper we propose a fast and robust ellipse detection algorithm, named PRANSAC. In PRANSAC we introduce a parallel thinning algorithm to eliminate useless feature points. Tangent direction is used to fit ellipses with 3 points. By doing so, we can reduce detection time. Also we introduce the parallel thinning algorithm in image pre-processing and propose a “robust candidate sequence” for RANSAC to make the detection algorithm more robust.
2 Overview of PRANSAC In order to make ellipse detection faster and more robust, we make several changes to a contour-based ellipse detection (CBED) algorithm that was proposed in our previous work [9]. In CBED, at least five points are used to determine an ellipse. Therefore, we have to calculate the equation with five parameters, which costs much computation. PRANSAC introduces the tangent directions for candidate feature points in pre-processing. Thus, we can fit an ellipse with 3 points. So it predigests the fitting calculation and reduces the detection time significantly. The parallel thinning algorithm is used to get rid of the useless feature points(the feature points are gained by edge detection as candidates for ellipse detection).The parallel thinning algorithm reduces the original lines with width greater than one pixel into ones with width equal to one-pixel. Thus, PRANSAC can reduce the number of feature points effectively. CBED uses RANSAC method for detection. RANSAC is an algorithm for robust fitting of models in the presence of many outliers. It’s fast and robust when there is much noise interference. However, the detection result from the RANSAC algorithm is not steady because it uses the random candidate points to fit ellipses. We improve the original RANSAC and propose “robust candidate sequence” for detection that can get the steady detection results. In the rest sections of this paper, section 3 presents CBED; section 4 describes the improvements of PRANSAC; experimental results and comparison with other methods are shown in section 5; and section 6 concludes the paper.
3 The Steps of CBED The original CBED algorithm [9] includes 3 steps mainly as follows: 3.1 Contour Detection Most of the existing ellipse detection methods, regard all the feature points in the edge image as a whole, so they have to deal with a larger feature point set, which costs lots of computation and memory. To cope with this deficiency, we propose an obvious but
A Fast and Robust Ellipse Detection Algorithm Based on PRANSAC
671
essential feature of detected objects, namely, connectivity constraint which emphasizes that the edge points of target object should be continuous. Relied on the constraint, the huge feature point set can be partitioned into several smaller subsets. Thus, the problem is simplified into detections of the targets in each subset. One example of contours is illustrated in Fig 1.
(a) Image
(b) Contours
Fig. 1. The contours of the image (Each contour is represented by different color.)
3.2 RANSAC For each candidate contour, many methods can be used to extract circle or ellipse. We introduce RANSAC [10] to estimate the parameters of ellipse from the data with outliers. 3.3 Ellipse Fitting Since five parameters determine an ellipse, five points at least are used to determine an ellipse (any three points are not collinear). The conic is written as: a 0 x 2 + 2 a1 xy + a 2 y 2 + 2a 3 x + 2 a 4 y = 1
.
(1)
We use five points to fit the ellipse and utilize the parameters in (1) to compute the geometric parameters of an ellipse. Then we vote and confirm the ellipse. To avoid the influence of small arcs and to avoid the case that several targets overlap in the real application, the 'fitting factor' of each possible figure is introduced to evaluate the estimation of parameters. We present the algorithm for calculating the fitting factor as follows: Begin Count: = 0; For I: = 0 to 360 do /* for each angle */ Compute the corresponding point P of I If P is a feature point in the image Count: = Count+1; Count: = Count/360; End;
672
G. Song and H. Wang
4 The Improvements of PRANSAC 4.1 The Parallel Thinning Algorithm Assuming there are P points on the edge. The computation complexity of CBED is O(P), which is introduced in [9]. So reducing the number of feature points is effective for reducing detection time. PRANSAC improves CBED by using the parallel thinning algorithm [12] in image pre-processing. The parallel thinning algorithm makes the original lines, whose width is greater than one pixel, reduced into lines with width equal to one-pixel. 8-connect skeleton can be achieved while avoiding excessive erosion that loses useful image information. The feature points are reduced effectively but the contours are conserved. The thinning result of this algorithm is shown in Fig 2.
(a) The original image (10296 feature points)
(b) The result after the parallel thinning algorithm (6933 feature points)
Fig. 2. Parallel Thinning Algorithm effectively reduces the number of feature points, which speeds up the proposed ellipse detection algorithm
4.2 Ellipse Fitting with 3 Points As we known, circles are special ellipses. However, circle detection is much faster than ellipse detection because of its simplicity. Three parameters, that is, the center and radius are required to decide a circle and only 3 points are needed for fitting circles. Therefore, we want to fit ellipses with three points like fitting circles. The original CBED needs five points to determine an ellipse. It wastes lots of time on fitting ellipse and voting. We introduce the tangent direction to fit ellipses with 3 points. In order to fit ellipses with 3 points, we calculate the tangent directions of these points. For every feature point, we get a 5*5 area whose center is this point and fit a line by a least-square solution for the 5*5 area, shown in Fig. 3. The direction of this line is considered as the tangent direction of this point. Then we obtain the tangent directions of the candidate points. We choose three points Pi ( xi , yi ) ( i=0,1, 2) which are not collinear, assuming they are on an ellipse. The
Pi is Li , which is calculated by (2). On the assumption that the midpoint of P1 and P2 is K and the cross point of L1 and L2 is M, the center of ellipse is on the
tangent of
A Fast and Robust Ellipse Detection Algorithm Based on PRANSAC
673
Fig. 4. Location of center
Fig. 3. Tangent direction (calculated by a least-square solution)
line of KM. Also the center is on another line calculated by P2 and P3 . According to this ellipse geometric property (Fig. 4), we confirm the center of ellipse ( x0 , y0 ) by three points and their tangents. Then we transform the reference frame as follows:
x ' = x − x0 y ' = y − y0
.
(2)
The new center of the ellipse is the origin of the new reference frame. So the new conic can be written as: a0 x '2 +2 a1 x ' y '+ a2 y '2 = 1 .
The new points are
(3)
Pi ' {i =1, 2, 3}. So we can obtain the new parameters as follows: ⎧ A = X −1 B ⎪ T ⎪ A = [a0 a1 a2 ] ⎪ B = [1 1 1 ]T ⎪ . ⎨ ⎡ x '12 , 2 x '1 y '1 , y '12 ⎤ ⎪ ⎪ X = ⎢ x ' 2 , 2x ' y ' , y ' 2 ⎥ ⎢ 2 2 2 2 ⎥ ⎪ ⎢ 2 2 ⎥ ⎪ ⎣ x '3 , 2 x '3 y '3 , y '3 ⎦ ⎩
(4)
We can calculate the major axis and the minor axis and calculate rotate angle by (4). Then we get the ellipse parameters with 3 points. 4.3 Robust Candidate Sequence CBED proposed RANSAC method to estimate the parameters of circle and ellipse from the data, which is fast and robust against outlier. However RANSAC chooses the candidates randomly and can get different results on the same image by multiple times
674
G. Song and H. Wang
(a) First detection
(b) Second detection
Fig. 5. Different detection results by original RANSAC
Fig. 6. Robust detection result by PRANSAC
of detection. One sample is shown in Fig.5. Fig.5 (a) is a good result for the first detection. We detect the image for the second time and Ellipse 4 hasn’t been detected as shown in Fig.5 (b). To get stable results, PRANSAC improves the algorithm by introducing a robust candidate sequence for every contour. We generate a Pseudo-random sequence of points as candidates to detect ellipses for RANSAC. The Pseudo-random sequence like a random sequence is deterministic actually. The seed of Pseudo-random can be set to the number of points on the contour. The sequence will supply the same series of points for different detections. With many experiments, robust candidate sequence can produce stable and reasonable candidates for ellipse detection and get the same result for multiple times of detection, see Fig.6.
5 Experiments and Comparison The method proposed in this paper has been implemented in C++ on Intel Pentium 4 CPU 3.06GHz and 512M memory running under Windows XP. The performance is shown in Table I and compared with CBED. In image pre-processing, we reduce useless points by the parallel thinning algorithm and calculate the tangent directions for all Table 1. Performance comparison with CBED Resolution
Test image 1 Test image 2 Test image 3 Test image 4 Test image 5
300*300 640*400 560*420 640*480 480*360
Image Pre-processing Time CBED PRANSAC 7(ms) 15(ms) 24(ms) 50(ms) 25(ms) 39(ms) 30(ms) 60(ms) 30(ms) 45(ms)
Total Detection Time(Including Image Pre-processing) CBED PRANSAC 109(ms) 39(ms) 161(ms) 94(ms) 170(ms) 100(ms) 220(ms) 150(ms) 379(ms) 80(ms)
Table 2. Performance comparison with RTEHT Resolution Test image 6 Test image 7
320*240 320*240
RTEHT (Pentium III 500MHz, 256M memory) 145(ms) 150(ms)
PRANSAC (Pentium III 700MHz, 256M memory) 84(ms) 78(ms)
A Fast and Robust Ellipse Detection Algorithm Based on PRANSAC
Test image 1
Test image 2
Fig. 7. Ellipse detection
Test image 4
675
Test image 3 Fig. 8. Ellipse instrument detection
Test image 5
Fig. 9. Ellipse work piece detection
Test image 6
Test image 7
Fig. 10. Test images in [13]
the feature points. So compared to CBED, image pre-processing time is longer but the whole detection time is reduced by 50%. We also compare PRANSAC with RTEHT [13], and show the result in Table II.
6 Conclusion A fast and robust ellipse detection algorithm has been proposed in this paper. We introduce the parallel thinning algorithm to get rid of useless feature points effectively with useful information conserved. Also we make use of the ellipse geometric property and introduce tangent directions to fit ellipses with 3 points, which reduces detection time by 50%. To avoid the unsteady results, we propose the “robust candidate sequence” as candidates for RANSAC. With many experiments in different scenes, PRANSAC has demonstrated its high performance. Acknowledgement. The research is supported by the National Natural Science Foundation of China (60433030).
676
G. Song and H. Wang
References [1] Nair, P.S., Ssaunders, A.T.: Hough transform based ellipse detection algorithm. Pattern Recognition Letters 17, 777–784 (1996) [2] Guil, N., Zapata, E.L.: Lower Order Circle and Ellipse Hough Transform. J. Pattern Recognition 30(10), 1729–1744 (1997) [3] Yu, X., Leong, H.W., Xu, C., Tian, Q.: A robust and accumulator-free ellipse Hough transform. In: ACM MM04, pp. 256–259 (2004) [4] Ser, P.-K., Siu, W.-C.: Novel detection of conics using 2-D Hough planes, Vision, Image and Signal Processing. IEEE Proceedings 142(5), 262–270 (1995) [5] Muammar, H., Nixon, M.: Approaches to extending the Hough transform. In: International Conference on Acoustics, Speech, and Signal Processing, ICASSP-89. vol.3, pp. 1556–1559 (May 23-26, 1989) [6] Kawaguchi, T., Nagata, R.-I.: Ellipse detection using a genetic algorithm. In: Proceedings, Fourteenth International Conference on Pattern Recognition, vol. 1, pp.141–145 (August 16-20 1998) [7] Frigui, H., Krishnapuram.: A comparison of fuzzy shell-clustering methods for the detection of ellipses. IEEE Transactions on Fuzzy Systems 4(2), 193–199 (1996) [8] Kim, E., Haseyama, M., Kitajima, H.: Fast line extraction from digital images using line segments. IEICE Trans. 84(8), 1566–1579 (2001) [9] Cai, W., Yu, Q., Wang, H.: A Fast Contour-based Approach to Circle and Ellipse Detection. In: Proceedings of IEEE 5th World Conference on Intelligent Control and Automation (WCICA 2004) June 15-19, 2004, Hangzhou, China, pp. 4686–4690 (2004) [10] Fischler, M.A., Bolles, R.C.: Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Comm. of the ACM 24, 381–395 (1981) [11] Xie, Y., Ji, Q.: A New Efficient Ellipse Detection Method. In: International Conference on Pattern Recognition, pp. 957–960 (2002) [12] Guo, Z., Hall, R.W.: Parallel thinning with two-subiteration algorithms. Communications of the ACM 32(3), 359–373 (1989) [13] Zhang, S.-C., Liu, Z.-Q.: A robust real-time ellipse detector. Pattern Recognition 38, 273–287 (2004)
A Definition for Orientation for Multiple Component Shapes ˇ c1, and Paul L. Rosin2 Joviˇsa Zuni´ 1
2
Computer Science Department, Exeter University, Exeter EX4 4QF, UK [email protected] School of Computer Science, Cardiff University, Cardiff CF24 3AA, Wales, UK [email protected]
Abstract. In this paper we introduce a new method for computing the orientation for compound shapes. If the method is applied to single component shapes the computed orientation is consistent with the shape orientation defined by the axis of the least second moment of inertia. If the new method is applied to compound shapes this is not the case, and consequently the presented method is both new and different. Keywords: Shape, orientation, image normalization, early vision.
1
Introduction
Determining the orientation of a shape is often performed in computer vision so as to enable subsequent analysis to be carried out in the shape’s local frame of reference (thereby simplifying that analysis). While for many shapes their orientations are obvious and can be computed easily, the orientation of other shapes may be ambiguous, subtle, or ill defined. The difficulty of the task can be seen from the multiplicity of mechanisms used in human perception in which orientation can be determined by axes of symmetry and elongation [1], as well as cues from local contour, texture, and context [9]. The most common computational method for determining a shape’s orientation is based on the axis of the least second moment of inertia [4,6]. Although straightforward and efficient to compute it breaks down in some circumstances – for example, problems arise when working with symmetric shapes [11,12]. This has encouraged the development of other competing methods [2,3,4,5,6,7,10,11]. Suitability of those methods strongly depends on the particular situation in which they are applied, as they each have their relative strengths and weaknesses (e.g. relating to robustness to noise, classes of shape that can be oriented, number of parameters, computational efficiency). In this paper we focus on the orientation of shape consisting of several components. We introduce a new approach to the shape orientation problem and, after that, we extend the method to shapes that consists of several components. We consider a line that maximises the integral of the squared lengths of the
ˇ c is also with the Mathematical Institute - SANU, Belgrade. J. Zuni´
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 677–685, 2007. c Springer-Verlag Berlin Heidelberg 2007
678
ˇ c and P.L. Rosin J. Zuni´
projections of line segments whose end points belong to the shape onto this line. Then we define the orientation of the shape by the slope of such a line. It turns out that, if applied to single component shapes, such a method for computing shape orientation is consistent with the standard method based on the line of the least second moment of inertia. Such a new approach leads to a natural definition of the orientation of compound shapes. The computed orientation of compound shapes differs from the orientation of compound shapes by the standard method. In some situations the new compounded shapes is more appropriate than the orientation computed by the standard method. The paper is organized as follows. In the next section we give a short sketch of the standard method for computing of shape orientation. In Section 3 we introduce a new approach for orienting shapes. Section 4 adopts the new introduced approach to the orientation of compound shapes. Section 5 discusses the properties of the new method and gives some illustrative examples. Section 6 gives concluding comments.
2
The Standard Method
The most standard method for computation of shape orientation is based on the axis of the least second moment of inertia. The axis of the least second moment of inertia ([4,6]) is the line which minimises the integral of the squares of distances of the points (belonging to the shape) to the line. The integral is I(α, S, ρ) = (1) r2 (x, y, α, ρ) dx dy S
where r(x, y, α, ρ) is the perpendicular distance from the point (x, y) to the line given in the form X · sin α − Y · cos α = ρ. It is well known that the axis of the least moment of inertia passes through the shape centroid. The centroid $ # m (S) m0,1 (S) where m0,0 (S) is the zeroth order moment of , is computed as m1,0 0,0 (S) m0,0 (S) S and m1,0 (S) and m0,1 (S) are the first 77 order moments of S. In general, the moment mp,q (S) is defined as mp,q = xp y q dx dy and has order p + q. Thus, S $ # m (S) m0,1 (S) (such that the if the shape S is translated by the vector − m1,0 , 0,0 (S) m0,0 (S) centroid of S coincides with the origin) then it is possible to set ρ = 0. Since the squared distance of a point (x, y) to the line X · sin α − Y · cos α = 0 is (x · sin α − y · cos α)2 the function F (α, S) that should be minimised in order to compute the orientation of S can be expressed as 2 m1,0 (S) m0,1 (S) F (α, S) = dx dy x− · sin α − y − · cos α m0,0 (S) m0,0 (S) S (m1,0 (S))2 (m0,1 (S))2 2 = m2,0 (S) − · sin α + m0,2 (S) − · cos2 α m0,0 (S) m0,0 (S) m1,0 (S) · m0,1 (S) − m1,1 (S) − · sin(2α). (2) m0,0 (S)
A Definition for Orientation for Multiple Component Shapes
679
Definition 1. The orientation of a given shape S is defined by the angle α for which the function F (α, S) reaches the minimum. Elementary mathematics says that the angle α which defines the orientation of S (as given by Definition 1) satisfies the following equation: sin(2α) 2 · m1,1 (S) = , (3) cos(2α) m2,0 (S) − m0,2 (S) $p # $q 77 # m (S) m1,0 (S) y − x − m1,0 dxdy are centralised moments. where mp,q (S) = m0,0 (S) 0,0 (S) S
3
An Alternative Approach
In this section we start with a different motivation for the definition of shape orientation. It will turn out that this approach in its basic variant (when single component shapes are considered) leads to the same shape orientation as if computed based on the axis of the least moment of inertia, but the application to compound shapes leads to an essentially new method. Let S be a given shape, and consider the line segments [AB] whose end points → A and B are points from S. Let − a denote the unit vector in the direction → − α, i.e., a = (cos α, sin α). Also, let pr− → a [AB] be the projection of the line segment [AB] onto a line that makes an angle α with the x-axis, while |pr− → a [AB]| denotes the length of such a projection. Then, it seems very natural to define the7 7orientation of the shape S by the direction that maximises the integral 77 2 |pr− [AB]| dx dy du dv of the squared length of projections of such → a
A∈S,B∈S
edges onto a line having this direction. Thus we give the following definition. Definition 2. The orientation of a given shape S is defined by the angle α where the function G(α, S) =
2 |pr− → a [AB]| dx dy du dv
(4)
A=(x,y)∈S B=(u,v)∈S
reaches its maximum. Even though Definition 1 and Definition 2 come from different motivations it turns out that they are equivalent. Theorem 1 shows that the difference G(α, S) − 2 · m0,0 (S) · F (α, S) depends only on the shape S but not on the angle α. Furthermore, this implies that the maximum of G(α, S) and minimum of F (α, S) are reached at the same point. In other words, the orientations computed by Definition 1 and Definition 2 are consistent. Theorem 1. The following equality holds: G(α, S) − 2 · m0,0 (S) · F (α, S) = 2 · ((m2,0 (S) + m0,2 (S)) · m0,0 (S) − (m1,0 (S))2 − (m0,1 (S))2 ).
(5)
ˇ c and P.L. Rosin J. Zuni´
680
y ]
[CD pr−> a
−> a α
]
a
[AB pr−>
B A
D
C
S x
Fig. 1. Projections of all the line segments whose endpoints lie in S are considered, irrespective of whether the line segment intersects the boundary of S
Proof. Let A = (x, y) and B = (u, v). Then by using trivial equalities: 2 2 |pr− and → a [AB]| = |(x − u, y − v) · (cos α, sin α)| xp y q ur v t dx dy du dv = mp,q (S) · mr,t (S) S×S
we complete the proof easily. Indeed, G(α, S) − 2 · m0,0 (S) · F (α, S) 2 |pr− = → a [AB]| dx dy du dv − 2 · m0,0 (S) · F (α, S) S×S
((x − u) · cos α + (y − v) · sin α)2 dx dy du dv − 2 · m0,0 (S) · F (α, S)
= S×S
= 2(m2,0 (S) + m0,2 (S)) · m0,0 (S) − 2(m1,0 (S))2 − 2(m0,1 (S))2 .
4
Orientation of Compound Objects
Image analysis often deals with groups of shapes or with shapes that are composed of several parts. The desired properties of the computed orientation of such compound shapes can vary. Sometimes it is reasonable that the computed orientation is derived from the orientations of components of compound shape, whereas in other cases it is preferable that the orientation is a global property of the whole compound object. In this section we introduce a new definition for computing the orientation of such compound shapes and give some examples of computed orientations. The definition of such an orientation follows naturally from Definition 2.
A Definition for Orientation for Multiple Component Shapes
681
Definition 3. Let S be a compound object which consists of m disjoint shapes S1 , S2 , . . . , Sm . Then the orientation of S is defined by the angle that maximises the function Gcomp (α, S) defined by Gcomp (α, S) =
m i=1
2 |pr− → a [AB]| dx dy du dv.
(6)
A=(x,y)∈Si B=(u,v)∈Si
The above definition allows an easy computation of the defined orientation. Theorem 2. The angle α where the function Gcomp (α, S) reaches the maximum satisfies the following equation
sin(2α) = m cos(2α)
2·
m
(m1,1 (Si ) · m0,0 (Si ) − m1,0 (Si ) · m0,1 (Si ))
i=1
((m2,0 (Si ) − m0,2 (Si )) · m0,0 (Si ) + (m0,1 (Si ))2 − (m1,0 (Si ))2 )
i=1
2· = m
m
m1,1 (Si ) · m0,0 (Si )
i=1
.
(7)
(m2,0 (Si ) − m0,2 (Si )) · m0,0 (Si )
i=1
Proof. Similarly as in the proof of Theorem 1, setting dGcomp (α, S)/dα = 0 the proof follows easily. The following notes are given to point out some properties of Gcomp (α, S). Note 1. The computed orientation of a single component object (based on F (α, S) and G(α, S)) breaks down when m1,1 (S) = m2,0 (S) − m0,2 (S) = 0 because under these conditions F (α, S) and G(α, S) became constant functions (see (2) and (5)). Consequently no direction can be selected as a shape orientation. Analom m gously, when m1,1 (Si ) = (m2,0 (Si ) − m0,2 (Si )) = 0 holds then the orii=1
i=1
entation of the compound shape S = S1 ∪ . . . ∪ Sm cannot be computed by Gcomp (α, S). Note 2. Any component Si of a compound shape S = S1 ∪ . . . ∪ Sm that is considered unorientable by G(α, Si ) (i.e. G(α, Si ) = const.) will not contribute to (7), and is therefore ignored in the computation of Gcomp (α, S). That is because G(α, Si ) = const. implies m1,1 (Si ) = 0 and m2,0 (Si ) = m0,2 (Si ). Note 3. If all components Si of a given shape S have identical orientation according to G(α, Si ) then this same orientation is also computed by Gcomp (α, S). Note 4. From (7) it can be seen that components of S contribute a weight proportional to m0,0 (Si )3 .
682
ˇ c and P.L. Rosin J. Zuni´
The cubic weighting given to components in (7) seems excessive since it will tend to cause the larger components to strongly dominate the computed orientation. For instance, is a compound object S consists of a shape S1 and the shape S2 which is the dilation of a shape S2 by a factor r, i.e. S2 = r · S2 = {(r · x, r · y) | (x, y) ∈ S2 } and, consequently, the size of S2 increases when r increases too. Then, m0,0 (S2 ) = r2 ·m0,0 (S2 ), m1,1 (S2 ) = r6 ·m1,1 (S2 ), m2,0 (S2 ) = r6 · m2,0 (S2 ), m0,2 (S2 ) = r6 · m0,2 (S2 ). Entering the above estimates into (7) we obtain that sin(2α)/ cos(2α) equals 2 · m1,1 (S1 ) · m0,0 (S1 ) + 2 · r6 · m1,1 (S2 ) · m0,0 (S2 ) (m2,0 (S1 ) − m0,2 (S1 )) · m0,0 (S1 ) + r6 · (m2,0 (S2 ) − m0,2 (S2 )) · m0,0 (S2 ) and obviously the influence of S2 to the computed orientation of S could be very big if the dilation factor r is much bigger than 1. This suggests a modification of (7) to enforce instead a linear weighting by area, namely:
sin(2α) = m cos(2α)
2·
m
m1,1 (Si )/m0,0 (Si )
i=1
.
(8)
(m2,0 (Si ) − m0,2 (Si ))/m0,0 (Si )
i=1
If the orientation α of S = S1 ∪ S2 = S1 ∪ r · S2 is computed by (8) then 2·m1,1 (S1 )/m0,0 (S1 )+2·r2 ·m1,1 (S2 )/m0,0 (S2 ) sin(2α)/ cos(2α) equals (m2,0 (S1 )−m0,2 (S1 ))/m0,0 (S1 )+r2 ·(m2,0 (S2 )−m0,2 (S2 ))/m0,0 (S2 ) . It is not difficult to imagine the situation where the size change of components of a compound objects should have no effect on the computed orientation. For instance, objects may be the same size in nature, but appear different sizes in the image due to varying distances from the camera. If we would like to avoid any impact of the size of the components on the computed orientation of a compound
m m
object then we can use the formula:
sin(2α) cos(2α)
2·
=
(m
1,1 (Si )/(m0,0 (Si ))
2
i=1
.
m
2,0 (Si )−m0,2 (Si ))/(m0,0 (Si ))
2
i=1
Following on from Note 2, any component that is almost unorientable by G(α, Si ) will have little effect on the computation of Gcomp (α, S). Thus we can say that Gcomp (α, S) is not the same as computing a simple circular mean1 of the orientations produced by G(α, Si ) since Gcomp (α, S) weights the contributions of components according to both their area and their orientability. The new methods (given by (7) and (8)) for computing orientation are demonstrated on some trademarks in figure 2. In figure 2a-c the computed orientations computed from Gcomp (α, S) are different and preferable to those based on F (α, S) (in which the trade marks are considered as single component objects). 1
The circular mean μ of a set of n orientation samples θi is defined as [8]: μ if S > 0 and C > 0 μ = μ + π if C < 0 μ + 2π if S < 0 and C > 0 n 1 S where μ = tan−1 C , C = n1 n i=1 cos θi , S = n i=1 sin θi .
A Definition for Orientation for Multiple Component Shapes
61.9◦ , 61.9◦ , 20.5◦ (a)
88.5◦ , 88.6◦ , −0.9◦ (e)
66.8◦ , 67.0◦ , 32.2◦ (b)
50.6◦ , 49.6◦ , −17.8◦ (c)
−64.2◦ , −71.1◦ , −28.1◦ −65.54◦ , 89.94◦ , −4.4◦ (f) (g)
683
−44.4◦ , −44.4◦ , −40.3◦ (d)
24.0◦ , −4.8◦ , 13.9◦ (h)
Fig. 2. Orientations of trademarks are shown computed by (7), (8), and (3)
Fig. 3. Orientations computed by (8) of multiple components are overlaid
The computed orientations for the shape in figure 2d are very close and coincident with the one of the shape’s symmetry axes. Some inconsistency is caused by the discretization process. The trademark in figure 2e has reflective symmetry, and so the orientation given by Gcomp (α, S) is along the symmetric axis as opposed to the standard method for estimating orientation. The individual orientations of the components are {−63.7◦, 61.7◦ }. When one component is reduced in size (figure 2f) the larger component dominates (7) whereas this effect is reduced by (8). In figure 2g four quarter area components combine to have identical effect to a single full size component, and the orientation given by (8) is close to 90◦ again. The larger component still strongly dominates (7).
684
ˇ c and P.L. Rosin J. Zuni´
Figure 2h gives an example of a shape that is unorientable by (7) and (8) – see Note 1. The individual component orientations of {65.5◦, −54.7◦, 5.6◦ } are well defined; |m2,0 (Si ) − m0,2 (Si )| + |m1,1 (Si )| are around the value 400. However, 3 3 | (m2,0 (Si ) − m0,2 (Si )) | + | m1,1 (Si )| is only equal to 6. This explains the i=1
i=1
inconsistency between the orientation estimates from (7) and (8) despite the component areas being almost equal. The new method is further demonstrated on several natural scenes in figure 3. The overall orientations: 47.16◦, 87.68◦, and 82.87◦ computed by (8) are appropriate and seems to be more acceptable than orientations −15.6◦, −8.1◦ and −4.3◦ computed by minimizing F (α, S). Note that in the last example the tilted blob has only a minor impact on the overall orientation estimated by (8).
5
Conclusion
In this paper we have considered an alternative approach for the computation of shape orientation. One benefit of such a new approach is that it leads to a new method for computing the orientation of compound objects which consist of several components. It was shown that such a defined compound shape orientation has several attractive properties. The computed orientations are given for several examples and they are reasonable. If applied to a single component objects the new defined shape orientation is consistent with the standard method for shape orientation computation, but the methods are different when applied to shapes consisting of several components. It is not possible to say that orientations computed by one of the methods are better then the orientations computed by the other one. A final judgement can be given only based on knowing the particular application where the methods are applied.
References 1. Boutsen, L., Marendaz, C.: Detection of shape orientation depends on salient axes of symmetry and elongation: Evidence from visual search. Perception & Psychophysics 63(3), 404–422 (2001) 2. Cortadellas, J., Amat, J., De la Torre, F.: Robust normalization of silhouettes for recognition applications. Pattern Recognition Letters 25(5), 591–601 (2004) 3. Ha, V.H.S., Moura, J.M.F.: Affine-permutation invariance of 2-d shapes. IEEE Trans. on Image Processing 14(11), 1687–1700 (2005) 4. Jain, R., Kasturi, R., Schunck, B.G.: Machine Vision. McGraw-Hill, New York (1995) 5. Kim, W.Y., Kim, Y.S.: Robust rotation angle estimator. IEEE Trans. on Patt. Anal. and Mach. Intell. 21(8), 768–773 (1999) 6. Klette, R., Rosenfeld, A.: Digital Geometry. Morgan Kaufmann, San Francisco (2004) 7. Lin, J.-C.: The family of universal axes. Pattern Recognition 29(3), 477–485 (1996) 8. Mardia, K.V., Jupp, P.E.: Directional Statistics. John Wiley, New York, NY (1999)
A Definition for Orientation for Multiple Component Shapes
685
9. Palmer, S.E.: Vision Science-Photons to Phenomenology. MIT Press, Cambridge (1999) 10. Shen, D., Ip, H.H.S.: Optimal axes for defining the orientations of shapes. Electronic Letters 32(20), 1873–1874 (1996) 11. Tsai, W.H., Chou, S.L.: Detection of generalized principal axes in rotationally symetric shapes. Pattern Recognition 24(1), 95–104 (1991) ˇ c, J., Kopanja, L., Fieldsend, J.E.: Notes on shape orientation where the stan12. Zuni´ dard method does not work. Pattern Recognition 39(5), 856–865 (2006)
Definition of a Model-Based Detector of Curvilinear Regions Cédric Lemaître1, Johel Miteran1, and Jiří Matas2 1
2
LE2I Faculté Mirande Aile H. Université de Bourgogne BP 47870 21078 Dijon Center for Machine Perception, Dept.of Cybernetics, CTU Prague, Karlovo nam 13, CZ 12 135 {cedric.lemaitre,miteranj}@u-bourgogne.fr, [email protected]
Abstract. This paper describes a new approach for detection of curvilinear regions. These features detection can be useful for any matching based algorithm such as stereoscopic vision. Our detector is based on curvilinear structure model, defined observing the real world. Then, we propose a multi-scale search algorithm of curvilinear regions and we report some preliminary results.
1 Introduction Detecting specific features has been shown to be useful in many computer vision applications such as stereoscopic vision [1, 2] or object recognition [3], image retrieval [4]. A number of feature detectors [1, 2, 5-9] have been proposed in the literature. As these detectors can be combined in order to improve the recognition performances, it can be useful to add new features. We propose in this paper to define a new model of such features, based on the extraction of curvilinear regions, often simply named: line extraction. There are a huge number of application fields where curvilinear structures extraction is used. In photogrammetic and remote sensing tasks, it is used to extract roads, railroads or rivers from satellite or low-resolution aerial imagery [10-12], which can be used for the capture or update of data for geographic information systems. Moreover, it is useful in medical imaging for the extraction of anatomical features or simply for the segmentation of medical images from CT or MR devices [13-15]. A few models [12, 14, 16-18] have been described for curvilinear structures detection, often application specific (for roads, vessels). Our aim is to define here a superset of these models, which can be used in the general case of object recognition allowing building an affine-invariant detector. Steger [18] classified previous work in curvilinear detection into three families. The first approach detects lines by considering the gray values of the image only and uses purely local criteria. The second approach regards lines as object having parallel edges. The last family is to regard the image as a function Z(x,y) and extract lines from it by using differential geometric properties. The basic idea behind these algorithms is to locate the positions of ridges and ravines in the image function. Steger described in his paper an explicit model for lines and line profile models. A scale space analysis is carried out for each of the models. This analysis is used to W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 686 – 693, 2007. © Springer-Verlag Berlin Heidelberg 2007
Definition of a Model-Based Detector of Curvilinear Regions
687
derive an algorithm in which lines and their widths can be extracted with subpixel accuracy. The algorithm uses a modification of the differential approach to detect lines and their corresponding edges. In this paper, we propose an extension of Steger’s model, which allows detecting lines using all criteria of three families described above. Observing the real world, we give in the first section the definition of the curvilinear structure. The section 2 presents the detector which can be seen as a minimization problem. The components of the cost function are detailed. Using the approach of Steger, we defined two main steps of detection: the profile detection and the whole curvilinear structure detection. The main new criteria we introduce are texture attributes inside and around the shape, and information about line profile symmetry and local curvature. In the last section, we present some results concerning the profile detection and some preliminary results concerning the whole detector.
2 Curvilinear Structure Definition 2.1
Geometry
We could define a curvilinear region as a set of pixels delimited by left and right G G G boundaries l (t ) and r (t ) . In the continuous case, it can be defined by {a (t ), w(t ) } as shown by fig. 1. In the discrete case, it can be defined by the following set: G G c = {(ai , wi ), i = 0, ", L } where L is the length of the curvilinear region, a (t ) G G G l (t ) + r (t ) is a vector defining the axis between the boundaries: a ( t ) = and 2 G G w(t ) = l (t ) − r (t ) .
3 Appearance G On a 1D view a curvilinear region could be defined by v (t ) (the “crossection”) is the G vector of pixel values on a linear segment centered at and to a ′(t ) and of length Similar areas
w (t )
JG a' ( t ) G a( t)
w(t)
Maximum of Gradient
JG V (t ) h(t ) k(t )
JG Vl (t )
Fig. 1. Geometric and appearance attributes
k(t) JG Vr (t )
688
C. Lemaître, J. Miteran, and J. Matas
G G w(t ) . vl (t ) (resp. vr (t ) ) is the vector of pixel values on a linear segment perpendicuG lar to a ′(t ) , and of length k (t ) , with k (t ) < w(t ) .
4 Detection of Lines in 1D The problem of curvilinear detection using the previous model can be posed as a minimization problem, defining a cost function for which it is necessary to find a G local minimum for over the space of all a , w . We could then define the following G cost function: J ( I , a(t ), w ( t ) ) where : I is the input image. The output will be a set S of N curvilinear regions defined by: JJG S = {c j , j = 0,..., N } , and c j = {(aij , wij ), i = 0,..., Lj } , where
L j is the length of the j th of the curvilinear region.
4.1 Cost Function Components Our cost function is composed by different constraints, chosen using some observations of real pictures (Fig . 2). Observing a 1D profil (Fig. 3) we could see that a curvilinear structure has a high difference between the exterior (left and right) of the region and the “crossection”. This constraint can be defined as follow:
G G G G ⎡1 − m V ,Vl ⎤ ⎡1 − m V ,Vr ⎤ ≈ 0 , ⎣ ⎦⎣ ⎦
(
)
(
)
where m is for example the normalized Euclidian distance computed in Fourier space, defined by :
( fˆl (u )
fˆr (u )
2
)
u
m( fl (t ), fr (t )) =
m max
This distance allows taking the local texture into account. Another simplified distance could be the normalized module of luminance gradient. Considering the value of left and right side should be similar that satisfied the following relation:
G G m Vl , Vr ≈ 0
(
)
Moreover, we can observe (Fig. 2 and Fig. 3) that the texture in the crossection is constant along the curvilinear region and defined by :
G G m V (t ), V (t + dt ) ≈ 0
(
)
Observing 2D signal (Fig. 3), we could add some 2D constraints on width and curvature: the width of curvilinear region is constant along the axis, implying:
Definition of a Model-Based Detector of Curvilinear Regions
689
dw(t ) ≈0 dt Moreover the local curvature γ (t ) should be also constant along the axis, and the boundaries are locally parallel and symmetric about the central axis :
G G d γ (t ) dr (t ) dl (t ) ≈ ≈ 0 and . dt dt dt
Fig. 2. Example of images containing curvilinear regions Similar borders
2 high gradients
Homogeneous section Curvilinear region
Fig. 3. Observations on 1D section
4.2 1D Search Algorithm Along the 1D profile of length imax, we have to minimize Q: G G G G G G Q = 1 − m V , Vl m V , Vr 1 − m Vl , Vr
( ( ) ( )( ( )))
(1)
Since we want to detect the crossection for different scales, we have to compute Q for all the possible widths of crossection. In the case that there is only one curvilinear crossection along the profile, the pseudo code of the multi-scale algorithm work as follows: For i1=0 to imax ¾ For i2=i1+ǻ to imax Compute FFT on left and right vectors of width ǻ near i1 and i2. Compute the complex distances in Fourier space. Normalize distances. ¾ Compute Q(i1,i2) according eq. 1. Find minimum: Q min(Ia)=min(Q(i1,i2))for i2 : i1 -> imax and for i1 : 0 -> imax with Ia= (i1+i2)/2
690
C. Lemaître, J. Miteran, and J. Matas
5 Experiments 5.1 Results of Crossection Detection In order to validated the 1D model using some artificial and real signals. We have applied some previous presented algorithm on particular picture. We depicted on the following figures the original signal and the value of Q ' (i ) = 1 − Qmin (i ) . This value should maximum when i=Ia, where Ia is the abscise of the axis. The first particular signal simulates two textures of different frequencies and magnitude. The detector response is maximum for the center of the central texture (fig. 4). We compared the result of our own to the result of the usual correlation with a “gate” signal1. It is clear that localisation of the Q’(i) maximum is more precise using our own detector (since the correlation function gives multiple maxima in the area of the theoretical maximum). 0,6 45
0,5
40 35
0,4
f(i)
30 0,3
25 20
0,2
15 10
0,1
5 0
0 0
50
100
150
200
0
250
50
100
150
200
250
i
i
Fig. 4. Original textured signal and detector response
The detector response for a 1D selection of retina-vessel image is depicted fig.5. This signal has a low contrast; indeed all original values are between 95 and 127. Moreover the edges of the vessel are not clearly defined. Despite these constraints, the maximum value of our detector is obtained at the centre of the vessel and with a high value which is near the value obtained for the standard “gate” signal. The last example 140
0.5 0.45
120
0.4
100
0.35
f(i)
f(i) Q'(i)
60
0.1
20
0.05
0
0 0
50
100
150
Fig. 5. Detector response in the case of retina vessel
(of the central texture width).
0.2 0.15
40
1
0.25
200 i
Q'(i)
0.3
80
Definition of a Model-Based Detector of Curvilinear Regions
691
0.6
250 f(i) Q'(i) 200
0.5 0.4
f(i)
0.3
Q'(i)
150
100 0.2 50
0.1
0
0 0
50
i
100
150
Fig. 6. Detector response in the case of road detection
presented fig.6 example is a selection of a road in an aerial picture. The selection has made inside a part of a forest. Despite woods around the road, the value of the maximum is high (0.57) and is well localised on the centre of the road. 5.2 1 D Repeatability Study In order to validate our detector, we have realized a repeatability study. The robustness is evaluated against point of view change (4 sets of images, each made of 14 points of view of the same scene) and against natural noise introduced by camera sensor when increasing sensibility from 25 to 1600 ISO (one set of 20 images of one scene). Some examples are depicted on Fig. 7. For each image, 2 sections have been analyzed by a human expert and our algorithm in order to determine the centre position and the width of section. The errors introduced by our method (absolute value of differences between human measurements and filter responses) are presented in the Table 1 and Table 2. The error measured on the axis position and the curvilinear region width is from 1 to 2 pixels in many cases. The width values were from 8 to 180 pixels.
Fig. 7. Example of images used for repeatability study Table 1 . Repeatability study: robustness against point of view change Scene Exterior scene White cable on textured flat Grey cable on textured flat White cable on woodmake table
Absolute axis position error (pixels) 1.5
Absolute width error (pixels) 1.69
1.46
3
1.46
1.61
2.04
1.92
692
C. Lemaître, J. Miteran, and J. Matas Table 2. Repeatability study: robustness against natural noise Scene Interior scene
Absolute axis position error (pixels) 1.5
Absolute width error (pixels) 4.08
5.3 2D Detection Preliminary Results We applied a preliminary version of the 2D algorithm to real images. On the following figures, the segments of curvilinear regions are depicted using colors, and the axis is depicted in white. In the case of the picture of the retina (Fig. 8), depending on parameters tuned by the user, it is possible so select a few families of regions (using with and minimal length of curvilinear region, for example). As presented in Fig. 9, it is also possible to use the algorithm for matching. The two original pictures are taken from near opposite points of view. However, some common parts are detected in both cases (branches, part of leafs), and thus could be used for a stereo matching process.
Fig. 8. Blood vessel segmentation (retina)
Fig. 9. Example of use of detector for stereo analysis
6 Conclusion In this paper, we proposed a model for curvilinear structure detection, useful for objects recognition, matching or image retrieval. Our approach is based on an extension of Steger’s model, introducing some texture based attributes and some new constraints on the shape of the region to be detected. After the segmentation step, it is also possible to use the features of the model to classify the curvilinear regions available in the images. We proposed a first filter allowing detecting the profile of curvilinear regions taking into account textures. We validated the filter using real images. However the full detector consumes high computing time. We planned thus to implement a simplified version of the method, for which the cost function will be computed only for a sampled set of pixels, belonging for example to the edges.
Definition of a Model-Based Detector of Curvilinear Regions
693
References [1] Tuytelars, T., Gool, L.V.: Matching widely separated views based on affine invariant regions. International journal of computer vision 59, 61–85 (2004) [2] Matas, J., Chum, O., Urban, M., Padjla, T.: Robust wide baseline stereo form maximally stable extremal regions. In: presented at British Machine Vision Conference (BMVC), pp. 384–393 (2002) [3] Belongie, S., Malik, J., P. J.: Shape matching and object recognition using shape context. IEEE Transaction on Pattern Analysis and Machine Intelligence 24, 509–522 (2002) [4] Smeukers, A.W.M., Woming, M., Santini, S., Gupta, A., Hamesh, J.: Content-based Image retrieval at the end of the early years. IEEE Transaction on Pattern Analysis and Machine Intelligence 22, 1349–1380 (2000) [5] Kadir, T., Zisserman, A., Brady, M.: An affine invariant salient region detector. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3021, pp. 228–241. Springer, Heidelberg (2004) [6] Harris, C., Stephens, M.: A combined corner and edge detector. In: presented at Alvey vision conference, pp. 147–151 (1998) [7] Mikolajczyk, K., Tuytelars, T., Schmid, C., Zisserman, A., Matas, J., Schaffalitzky, F., Kadir, T., Gool, L.V.: A comparaison of affine region detectors. International journal of computer vision 65, 43–72 (2005) [8] Moreels, P., Perona, P.: Evaluation of features detectors and descriptors based on 3D objects. In: International conference of computer vision, pp. 800–807 (2006) [9] Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Transaction on Pattern Analysis and Machine Intelligence 27, 1615–1630 (2005) [10] Baumgartner, A., Steger, C., Mayer, H., Eckein, W.: MultiResolution semantic and context for road extraction. In: Verlag, S. (ed.) Semantic Modeling for the acquisition of topographic from images and maps, Basel - Switzerland, pp. 140–156 (1997) [11] Fisher, M.A., Tenebaum, J.M., Wolf, A.C.: Detection of roads and linear Structures in low resolution aerial imagery using a multisource knownledge integration technique. Computer graphics and image processing 15, 201–223 (1981) [12] Geman, D., Jadynak, B.: An active testing model for tracking roads in satelite image. IEEE Transaction on Pattern Analysis and Machine Intelligence 18, 1–14 (1996) [13] Walter, Klein, Massin, Zana.: Automatic segmentation and registration of retinal fluorescing angiographies. In: Presented at CAFIA (2000) [14] Géraud, T.: Segmentation of curvilinear objects using watershed-based curve adjacency graph. In: Perales, F.J., Campilho, A., Pérez, N., Sanfeliu, A. (eds.) IBPRIA 2003. LNCS, vol. 2652, pp. 279–286. Springer, Heidelberg (2003) [15] Martinez-Perez, M.E., Hughes, A.D., Stanton, A.V., Thom, S.A., Bharath, A.T., Parker, K.H.: Segmentation of retinal blood vessels based on the second derivative and region growing. In: presented at International Conference on Image Processing, Kobe (Japan) (1999) [16] Deschènes, F., Ziou, D.:Detection of line junctions in Gray-Level Images. In: presented at International conference on Pattern Recognition (ICPR) (2000) [17] Ziou, D.: Optimal line detector. In: presented at International conference on Pattern Recognition (ICPR), pp. 3762 (2000) [18] Steger, C.: An unbiased detector of curvilinear structures. IEEE Transaction on Pattern Analysis and Machine Intelligence 20, 113–125 (1998)
A Method for Interactive Shape Detection in Cattle Images Using Genetic Algorithms Horacio M. Gonz´alez–Velasco, Carlos J. Garc´ıa–Orellana, Miguel Mac´ıas–Mac´ıas, Ram´on Gallardo–Caballero, ´ and Fernando J. Alvarez–Franco CAPI Research Group, University of Extremadura. Politechnic School. Av. de la Universidad, s/n. 10071 C´ aceres - Spain [email protected] http://capi.unex.es
Abstract. Segmentation methods based on deformable models have proved to be successful with difficult images, particularly those using genetic algorithms to minimize the energy function. Nevertheless, they are normally conceived as fully automatic, and not always generate satisfactory results. In this work, a method to include the information of fixed points whithin a contour detection system using point distribution models and genetic algorithms is presented. Also, an interactive scheme is proposed to take advantage of this technique. The method has been tested against a database of 93 cattle images, with a significant improvement in the success rate of the detections, from 61% up to 95%.
1
Introduction
Segmentation is one of the most important tasks in digital image analysis [1], as a previous step to other tasks related to the concrete application of an artificial vision system: feature extraction for classification, area and volume measures, shape analysis, etc. Among all segmentation methods, those based on deformable models stand out. These methods have become very popular since their introduction in the eighties [2], and have been successfully applied to different segmentation tasks, particularly when dealing with very difficult images (noisy or with many structures and objects), as medical images are [3]. Deformable models normally involve the fitting of a contour to some structure or object within the image. This fitting is typically carried out through the minimization of a particular energy function, that is built considering concrete characteristics extracted from the image. Though the calculus of variations was used for that minimization at the beginning, several authors have proposed the use of genetic algorithms (GA) for this task, both with “traditional” deformable models [4,5] and with statistically based active shape models [6,7] or similar approaches. Here we can cite also our previous work [8], where we proposed a method similar to Hill’s [7] for the extraction of cows contours in outdoors-taken cattle images (see fig. 1). W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 694–701, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Method for Interactive Shape Detection in Cattle Images Using GA
695
However, the fully automatic methods do not always generate completely satisfactory results. Sometimes failures are obtained in the contour adjustment (even completely wrong fittings), and these failures can not be detected without human intervention. For instance, in our application [8] we obtained a lot of errors due to failures in the adjustment of the limbs, because our modelled contour only considers the limbs in the foreground, and many times the animal position is such that both the limbs in the foreground and background can be observed, with a very similar shape. This supports the need for interactive approaches in which an operator can “guide” the segmentation process, as described in [9,10], and more recently in [11], where the same technique as ours is used to model the shapes (point distribution models, PDM [12]). In this work, we describe the manner in which the information of fixed points can be used in the contour search using the Hill’s approach [7], that we applied in [8], so that this technique can be used in an interactive scheme. To find out the improvement obtained in the search with fixed points, we carried out successive experiments over our database of 93 cattle images, considering 2, 1 and none fixed points. It can be observed that the results obtained with fixed points are notably better. In section 2, the method used to fit contours with GA is presented, describing the shape modelling, the objective function and the GA configuration. We then explain in section 3 the way in which the information of fixed points can be used to reduce our search space. The results obtained by applying the search with fixed points to our database are presented in section 4, whereas conclusions and possible improvements of our work are described, finally, in section 5.
2
Shape Detection Using GAs
The shape detection problem can be converted into an optimization problem considering two steps: the parametrization of the shape we want to search and the definition of a cost function that quantitatively determines whether or not a contour is adjusted to the object. For the contour modelling and parametrization the PDM technique [12] has been applied, which lets us restrict all the parameters and precisely define the search space. On the other hand, the objective function proposed is designed to place the contour over areas of the image where edges corresponding to the object are detected. 2.1
Shape Modelling and Search Space
The first step of our method consists in appropriately representing the desired shape we want to search along with its possible variations. In some previous works [7,8,11] the general method known as PDM has been used, which consist in deformable models representing the contours by means of ordered groups of points (figure 1) located at specific positions of the object, that are constructed statistically, based on a set of examples. As a result of the process, the average shape of our object is obtained, and each contour can be represented mathematically by a vector x such as
696
H.M. Gonz´ alez–Velasco et al.
25 34
50 46 62
40
65 61
56 57
24
28 15 11 106
92
126
20
17
7
71 93
74 86 78 80
(a)
82
99
105 110
1 122
114 116 118
(b)
Fig. 1. Figure (a) illustrates the images dealt with. Figure (b) shows the model of the cow contour in lateral position.
x = xm + P · b
(1)
where xm is the average shape, P is the matrix of eigenvectors of the covariance matrix and b a vector containing the weights for each eigenvector, and which properly defines the contour in our description. Fortunately, considering only few eigenvectors corresponding to the largest eigenvalues of the covariance matrix (modes of variation, using the terminology in [12]), practically all the variations taking place in the training set could be described. This representation (model) of the contour is, however, in the normalized space. To project instances of this model into the space of our image, a transformation that preserves the shape is required (translation t = (tx , ty ), rotation θ and scaling s). Thus, every permitted contour in the image can be represented by a reduced set {s, θ, t, b} of parameters. In order to have the search space precisely defined, the limits for this parameters are required. As stated in [12], to maintain a shape similar to those √ √ in the training set, bi parameters must be restricted to values − λi ≤ bi ≤ λi , where λi are the eigenvalues of the covariance matrix. The transformation parameters can be limited considering that we approximately know the position of the object within the image: normally the animal is in horizontal position (angle restriction), not arbitrarily small (restriction in scale) and more or less centered (restriction in t). In our specific case (lateral position), the cow contour is described using a set of 126 points (figure 1), 35 of which are significative. The model was constructed using a training set of 45 photographs, previously selected from our database of 138 images, trying to cover the variations in position as much as possible. For those 45 photographs the contours of the animals were extracted manually, placing the points of figure 1 by hand. Once the calculations were made, 10 modes of variation were enough to represent virtually every possible variation (more than 90%).
A Method for Interactive Shape Detection in Cattle Images Using GA
2.2
697
Objective Function and GA Configuration
Once the shape is parameterized, the main step of this method is the definition of the objective function. In several consulted works [6,7] this function is built so that it reaches a minimum when strong edges of similar magnitude are located near the points of a certain instance of the model in the image. We use a similar approach, generating an intermediate grayscale image that acts as a potential image (inverted, so we try to maximize the “energy”). This potential image could be obtained using a conventional edge detector. However, to avoid many local maxima in the objective function, it is important that the edge maps does not contain many more edges than those corresponding to the searched object. This is not always possible because of background objects. In our method, a system for edges selection, based on a multilayer perceptron neural network is applied, as described in [8]. With this potential image, we propose the following objective function to be maximized by the GA: ⎛
n−2
fL (X) = ⎝
j=0
⎞−1 Kj ⎠
·
n−2 j=0
⎛ ⎝
⎞ Dg (X, Y )⎠
(2)
(X,Y )∈rj
where X = (X0 , Y0 ; X1 , Y1 ; . . . ; Xn−1 , Yn−1 ) is the set of points (coordinates of the image, integer values) that form one instance of our model, and can be easily obtained from the parameters using equation 1 and a transformation; rj is the set of points (pixels of the image) that form the straight line which joins (Xj , Yj ) and (Xj+1 , Yj+1 ); Kj is the number of points that such a set contains; and Dg (X, Y ) is a gaussian function of the distance from the point (X, Y ) to the nearest edge in the potential image. This function is defined to have a maximum value Imax for distance 0 and must be decreasing, so that a value less than or equal to Imax /100 is obtained for a given reach distance DA , in our case 20 pixels. Regarding the GA, two aspects must be configured: the general structure (coding, selection method, etc) and the values for the parameters that determine the algorithm (population size, mutation rate, etc.). To obtain the most suitable configuration for our problem a methodology also based in GA was used. As a result, the best performance was obtained with a population of 2000 individuals, real coding, binary tournament selection method (using linear normalization to calculate fitness), two-point crossover with Pc = 0.45, probability of mutation Pm = 0.4, and steady-state replacement method with S = 60%.
3
Fixing Points to Drive the Search: Method
In the method described above, we codify the parameters {s, θ, t, b} in the chromosome, and, in the process of decoding, we calculate the projection of the contour into the image, using those parameters. Let us denominate x = (x0 , y0 , x1 , y1 , . . . , xn−1 , yn−1 )T the vector with the points of the contour in the normalized space, and X = (X0 , Y0 , X1 , Y1 , . . . , Xn−1 , Yn−1 )T the vector with
698
H.M. Gonz´ alez–Velasco et al.
the points of the contour projected into the image. The vector x can be calculated using equation 1, with the parameters b. In order to obtain vector X, all the points must be transformed into the image space: Xj tx s cos θ −s sin θ xj tx ax −ay xj = + · = + · (3) Yj ty yj ty ay ax yj s sin θ s cos θ To include the information of fixed points, our proposal is not to codify some of the parameters into the chromosome, but calculate them using the equations described above. Therefore, we reduce the dimensionality of the search space, being easier for the algorithm to find a good solution. Fixing one point. Consider that we know the position of the point p, (Xp , Yp ). Then we can leave the parameters (tx , ty ), the translation, without codifying, and calculate them using equations 1 and 3. With equation 1 we calculate the vector x, and applying equation 3 we obtain
Xp = tx + ax xp − ay yp tx = Xp − ax xp + ay yp and then (4) Yp = ty + ay xp + ax yp ty = Yp − ay xp − ax yp Fixing two points. Consider now that we know the position of the points p and q, (Xp , Yp ) and (Xq , Yq ). In this case, we can leave two more parameters without codifying: s and θ, appart from (tx , ty ). With equation 1 we calculate again the vector x, and applying the equation 3 we obtain ⎧ Xp = tx + ax xp − ay yp ⎪ ⎪ ⎨ Yp = ty + ay xp + ax yp (5) ⎪ Xq = tx + ax xq − ay yq ⎪ ⎩ Yq = ty + ay xq + ax yq This is a system of linear equations where tx , ty , ax and ay are the unknowns, and its resolution is quite straightforward. Fixing more than two points. If we know the position of more than two points, we have to leave some of the shape parameters (components of the vector b) without coding, thus restricting the allowable shapes that can satisfy the fixed points. Mathematically, this problem is more difficult, because applying equations 1 and 3 when m > 2 points are fixed, we obtain a non-linear system of 2m equations with 2m unknowns. It must be notice that this system has to be solved in the decoding process of the chromosome, before each calculation of the objective function. For this reason, the method used to solve the system should not take very much time, or the method would be useless. Due to its complexity, we have not considered this situation in this work, leaving it as a future improvement of the system. Interactive scheme. We propose to use an approach similar to [11]. First we start running the GA search without fixed points. Then the user decides if the
A Method for Interactive Shape Detection in Cattle Images Using GA
699
result is acceptable or not. If not, he also has to decide whether to fix two, one or zero points depending on the previous result, fix the points and run the algorithm again. This process can be repeated until an acceptable fitting is reached.
4
Experiments and Results
We have tested our method with the 93 images (640x480 pixels size) from our database that were not used in the construction of the shape model. Our primary goal was to test the performance of the system with and without fixed points and to compare results. Besides, we were interested in determining if we could solve the problems that we had with the limbs in our application, described in the introduction and in [8] (see fig. 2,b). In order to evaluate quantitatively a resulting contour, the mean distance (in pixels) between its points and their proper position has been proposed as a figure of merit. In our case, the distance was estimated by using only seven very significant points of the contour. The points used were {20, 25, 40, 50, 78, 99, 114}, and are represented in figure 1. Those points were precisely located for all the images in the database. To test the performance of the method we considered the significant points referred above and run the GA search, without fixed points and with all the possible combinations of one and two fixed points (1 + 7 + 21 = 29 cases), over the 93 images. The results obtained are presented in table 1 and in fig. 3. In table 1 the best and worst results for zero, one and two points fixed are presented. In order to define “success”, we applied a hard limit of ten pixels to the distance, that is enough for the morfological assesment application in which these contours are to be used [8]. More information can be found in the graph, where the percentage of successes vs. the distance considered to define success is presented. Attending to the best cases, we can see that fixing points is clearly helpful, because the results fixing the point 78 and the points 78 and 114 are
(a) Dist. 4.6
(b) Dist. 16.6
Fig. 2. In this figure some examples of the GA search without fixed points are presented, along with the final mean distance (in pixels) to the correct position
700
H.M. Gonz´ alez–Velasco et al. 100
% of succeses
80
60
40
None 78 78, 114 40 50, 99
20
0 0
5
10
15
20
25
30
Distance
Fig. 3. In this figure, the percentage of successes versus the distance used to define success is represented, considering fixed points Table 1. In this table we show the number of successes and the mean distance obtained by the method when it is applied to our database of 93 images Fixed points None 78 40 78, 114 50, 99
Successes 57 80 52 89 54
(61.3 (86.0 (55.9 (95.7 (58.1
%) %) %) %) %)
Mean distance 10.03 6.52 10.49 5.42 10.33
significantly better, compared to the ones obtained in the None case. However, it is very important the selection of the points to fix, because the performance in the worst cases is very similar to the original (without fixed points). We can also observe that the best cases are obtained fixing the points corresponding to the limbs (78 and 114, see fig. 1), indicating that the main problem of the automatic approach was the location of those limbs, and that this problem can be solved almost completely with this technique. Finally, we have to remark that the results presented has been reached without human intervention, considering the same fixed points for all the images in each execution. However, human interaction can be determinant to select the suitable points for each image, thus increasing the performance.
5
Conclusions and Future Work
In this work, we have shown how to use the information of up to two fixed points in the detection of modelled shapes withing images, using genetic algorithms. The method has been tested against a database of 138 cattle images, 45 of which were
A Method for Interactive Shape Detection in Cattle Images Using GA
701
used to model the shape, and 93 to test the performance. As a result, more than 95 % of the detections have been successful in the best of the cases with two points fixed. In order to improve the method, two lines are being considered at this moment. First, we are approaching the problem of fixing more than two points, trying to approximately solve the resulting non-linear system with an affordable computational cost. On the other hand, we are trying to improve the objective function, through the generation of better potential images.
Acknowledgements This work has been supported in part by the Junta de Extremadura through project PRI06A227.
References 1. Jain, A.K.: Fundamentals of digital image processing. Prentice Hall, Englewood Cliffs (1989) 2. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: active contour models. Journal of Computer Vision 1, 321–331 (1988) 3. McInerney, T., Terzopoulos, D.: Deformable models in medical image analysis. Medical Image Analysis 1, 91–108 (1996) 4. Ballerini, L.: Genetic snakes for color image segmentation. In: Boers, E.J.W., Gottlieb, J., Lanzi, P.L., Smith, R.E., Cagnoni, S., Hart, E., Raidl, G.R., Tijink, H. (eds.) EvoIASP 2001, EvoWorkshops 2001, EvoFlight 2001, EvoSTIM 2001, EvoCOP 2001, and EvoLearn 2001. LNCS, vol. 2037, pp. 268–277. Springer, Heidelberg (2001) 5. MacEachern, L., Manku, T.: Genetic algorithms for active contour optimization. In: IEEE Proc. of the Int. Symp. on Circuit and Systems 4, 229–232 (1998) 6. Hill, A., Taylor, C.J.: Model-based image interpretation using genetic algorithms. Image and Vision Computing 10, 295–300 (1992) 7. Hill, A., Cootes, T.F., Taylor, C.J., Lindley, K.: Medical image interpretation: a generic approach using def. templates. Jour. of Med. Informatics 19, 47–59 (1994) 8. Gonz´ alez, H., Garc´ıa, C., Mac´ıas, M., Gallardo, R., Acevedo, M.: Application of repeated GA to deformable template matching in cattle images. In: Perales, F.J., Draper, B.A. (eds.) AMDO 2004. LNCS, vol. 3179, Springer, Heidelberg (2004) 9. Barret, W., Mortensen, E.: Interactive live-wire boundary extraction. Medical Image Analysis 1, 331–341 (1997) 10. Liang, J., McInerney, T., Terzopoulos, D.: Interactive medical image segmentation with united snakes. In: Taylor, C., Colchester, A. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI’99. LNCS, vol. 1679, pp. 116–127. Springer, Heidelberg (1999) 11. Ginneken, B.v., Bruijne, M.d., Loog, M., Viergever, M.: Interactive Shape Models. In: Proceedings of SPIE. vol. 5032, pp. 1206–1216 (2003) 12. Cootes, T.F., Taylor, C.J., Cooper, D.H., Graham, J.: Active shape models – their training and application. Comp. Vision and Image Understanding 61, 38–59 (1995)
A New Phase Field Model of a ‘Gas of Circles’ for Tree Crown Extraction from Aerial Images Peter Horv´ath1,2 and Ian H. Jermyn2 1
University of Szeged, Institute of Informatics, P.O. Box 652, H-6701 Szeged, Hungary Fax:+36 62 546 397 2 Ariana (INRIA/I3S), INRIA, B.P. 93, 06902 Sophia Antipolis, France Fax:+33 4 92 38 76 43
Abstract. We describe a model for tree crown extraction from aerial images, a problem of great practical importance for the forestry industry. The novelty lies in the prior model of the region occupied by tree crowns in the image, which is a phase field version of the higher-order active contour inflection point ‘gas of circles’ model. The model combines the strengths of the inflection point model with those of the phase field framework: it removes the ‘phantom circles’ produced by the original ‘gas of circles’ model, while executing two orders of magnitude faster than the contour-based inflection point model. The model has many other areas of application e.g. , to imagery in nanotechnology, biology, and physics.
1 Introduction Due to the high cost of field studies, forestry services are increasingly turning to image processing to gather the information they need. An important problem in this context is the extraction of the region in the domain of any given image corresponding to tree crowns. Using this information, forestry services can compute or estimate, for example, the mean diameter of the crowns, the biomass, and so forth. If tree crown extraction from aerial images could be performed automatically, then, in addition, expensive semiautomatic and manual image processing could be avoided. In [1], we addressed this problem using ‘higher-order active contours’ (HOACs) [2], a new generation of active contours [3] allowing the incorporation of non-trivial prior knowledge about region geometry. Unlike most methods for incorporating prior geometric knowledge into active contours [4,5,6], HOACs do not necessarily constrain region topology, thereby allowing the detection of multiple instances of a single entity at no extra cost, a critical requirement for the current application. To extract tree crowns, the HOAC model was analysed theoretically to find parameter values favouring regions composed of a number of approximate circles of approximately a given radius, by making such regions minima of the energy. The result was the ‘gas of circles’ model [1]. When combined with a likelihood energy, the model works
This work was partially supported by EU project MUSCLE (FP6-507752), Egide PAI Balaton, OTKA T-046805, and a HAS Janos Bolyai Research Fellowship. We thank the French National Forest Inventory (IFN) for the data.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 702–709, 2007. c Springer-Verlag Berlin Heidelberg 2007
A New Phase Field Model of a ‘Gas of Circles’ for Tree Crown Extraction
703
well, but suffers from some drawbacks: computation time is long, and the model can create ‘phantom circles’ in homogenous areas of the image. In [7], the parameters were further fixed so that circles are energy inflection points: they are not in themselves stable, but even a small amount of appropriate image information can render them stable. This removed the second drawback but not the first. Rochery et al. [8] introduced an alternative formulation of HOACs, based on the ‘phase field’ framework used in physics. The basic phase field model is, to a good approximation, equivalent to a classical active contour, with energy given by boundary length. By adding a nonlocal term, phase field models equivalent to HOACs can be created. Phase fields have several advantages over other region modelling frameworks: a neutral initialization, obviating the need to choose an initial region; gradient descent based solely on the PDE arising from the model energy, thereby simplifying implementation, especially for HOACs; and enhanced topological freedom. The purpose of the present paper is to construct the phase field version of the model of [7], thereby benefitting from all the advantages of the phase field framework, and solving both drawbacks of the model in [1]. In particular, we will see a gain of two orders of magnitude in execution time compared to the contour formulation. In section 2, we give an overview of the original HOAC ‘gas of circles’ model. In section 3, we describe the phase field version. In section 3.2, we describe the phase field version of the HOAC inflection point ‘gas of circles’ model. In section 4 apply the model to the tree crown extraction problem, and we conclude in section 5.
2 Higher-Order Active Contours and the ‘Gas of Circles’ Model HOACs can incorporate not only the local differential-geometric information included in classical active contours, but also more complex knowledge carried by long-range dependencies between tuples of contour points. These dependencies are expressed by multiple integrals over the contour. Combined with the classical region area and boundary length terms, one of the forms of Euclidean invariant quadratic HOAC energy is βC dpdp γ(p)· ˙ γ(p ˙ )Ψ (r(p, p ), d) , (2.1) EC,G (R) = λC L(∂R)+ αC A(∂R)− 2 where ∂R is the boundary of region R; γ is an embedding defining ∂R, and γ˙ is its derivative; p is a coordinates on the domain of γ; L is the region boundary length functional; A is the region area functional; r(p, p ) = |r(p, p )|, where r(p, p ) = γ(p) − γ(p ); and Ψ is an interaction function, described in [1], parameterized by d ∈ R+ , that determines the geometric content of the model. 2.1 The HOAC ‘Gas of Circles’ Model For certain parameter ranges, circles, with radius r0 depending on the parameters, are stable configurations of EC,G . These parameter ranges can be found by expanding EC,G to second order in a functional Taylor series about a circle [1]. The series is most simply expressed in terms of the Fourier components of the perturbation: the first order term E1 is zero for non-zero frequency k = 0, while the second order term E2 is diagonalized by
704
P. Horv´ath and I.H. Jermyn
the Fourier basis. Thus we require E1 (r0 ) = 0 (energy extremum) and E2 (k, r0 ) > 0 for all k (energy minimum). These two quantities are also functions of the model parameters λC , αC , βC , and d. The E1 constraint determines βC in terms of the other parameters, while the E2 constraint restricts the range of αC for given λC and d. 2.2 The HOAC Inflection Point ‘Gas of Circles’ Model The above model has a disadvantage when minimized using gradient descent. Circles of the stable radius, once created, cannot disappear again because they are local minima, even if they have no support from the data. This problem was solved in [7], by further constraining the parameters so that the energy function has an inflection point at r0 with respect to changes of radius (i.e. E2 (0, r0 ) = 0), rather than a minimum. In this case, the effect of the data can easily create an energy minimum, but otherwise the circle will vanish. This second constraint fixes both αC and βC in terms of r0 , d, and λC . Since r0 is fixed by the application, the only remaining parametric degrees of freedom are the value of d, and the overall strength of the prior term, represented by λC . We also require αC and βC to be positive. When combined with the other constraints [7], this means that there is only a small range of permissible values of rˆ = r0 /d, to wit 0.6897 ≤ rˆ ≤ 0.7827. This effectively enables d to be fixed, once r0 is known. Thus despite the initial complexity of the model, we are able to fix all except one parameter, λC , representing the overall strength of the prior term.
3 Phase Fields and the ‘Gas of Circles’ Model A phase field φ is a real-valued function on the image domain Ω. Given a threshold z, a phase field determines a region by the map ζz (φ) = {x ∈ Ω : φ(x) > z}. Thus phase fields are a level set representation. The difference with the usual distance function level set representation is that the functions are not constrained: the set of possible φ, Φ, is a linear space. The simplest phase field energy is
" D 1 1 1 ∂φ · ∂φ + λ( φ4 − φ2 ) + α(φ − φ3 ) . dx E0 (φ) = 2 4 2 3 Ω With D = 0, φR arg minφ: ζz (φ)=R E0 (φ), i.e. the minimizing phase field for a given fixed region, would take the value 1 inside R and −1 outside. The effect of D = 0 is to smooth φR so that it has an interface of finite width around ∂R. The phase field model is approximately1 equivalent to a classical active contour in the sense that E0 (φR ) ≈ λC L(∂R) + αC A(R) EC,0 ,
(3.1)
Equation (3.1) means that gradient descent on φ using E0 will duplicate gradient descent on γ using EC,0 . The contour parameters are given by αC = 4α/3 , λ2C = 16DλK/15 , K = 1 + 5(α/λ)2 , w2 = 15D/λK . (3.2) 1
For a detailed discussion of why the approximation error does not matter, see [8].
A New Phase Field Model of a ‘Gas of Circles’ for Tree Crown Extraction
Rochery et al. [8] added the following nonlocal term to E0 : β dx dx ∂φ(x) · ∂φ(x ) G(x − x ) , EN L (φ) = − 2 Ω2
705
(3.3)
where G(x − x ) = Ψ (|x − x |/d). With βC = 4β, Eg = E0 + EN L is equivalent (as usual, to a good approximation) to the HOAC model EC,G , and can be used in its place, thus allowing the incorporation of non-trivial prior knowledge about region geometry while still profiting from all the advantages of the phase field framework. 3.1 The Phase Field ‘Gas of Circles’ Model Using the relations between the phase field and contour parameters, we can now create a phase field ‘gas of circles’ model equivalent to the HOAC model described in section 2.1 (units are chosen so that d = 1): 1. Choose w. It cannot be too small, or a subpixel discretization will be needed, and it cannot be too large or the phase field model will not be a good approximation to the HOAC model √ [8]. We have found that w = 3 or w = 4 work well. 2. Choose α ˜ C ≤ 5/(2w). This constraint arises from inverting equations (3.2). 3. Determine the β˜C parameter corresponding to r0 and α ˜ C using the method in [1]. 2 w2 /5], α ˜ = β˜C /4, and D ˜ = 15 [1 + 1 − 4α ˜ = w/4. ˜ ˜ = 3 α ˜ /4, β 4. Set λ C C 8w ˜ λ, ˜ α 5. Choose λC and multiply D, ˜ , and β˜ by it to get D, λ, α, and β. 3.2 The Phase Field Inflection Point ‘Gas of Circles’ Model We now want to combine the phase field ‘gas of circles’ model with the constraint that the circle energy have an inflection point rather than a minimum at the desired radius. This is a non-trivial requirement. The relations between the phase field and contour parameters were derived using an approximate ansatz for φR [8]. For the ‘gas of circles’ model, the approximations are not expected to be important, since small errors in the parameters will produce small changes in behaviour. However, an inflection point represents a set of measure zero in the parameter space. It is important to see whether these approximate parameter relations preserve the inflection point behaviour. √ ˜ C ≤ 5/(2w). In In the previous section, we chose α ˜ C according to the constraint α section 2, we described how a given√value of rˆ fixes α ˜C and β˜C . To satisfy both these ˜ C . To determine the parameters of the new constraints we need to choose w ≤ 5/2α phase field inflection point ‘gas of circles’ model, we therefore take the following steps: 1. 2. 3. 4.
Choose an rˆ value satisfying 0.6897 ≤ rˆ ≤ 0.7827. This fixes α ˜ C and β˜C . Choose w using the above criterion. ˜ α ˜ and D ˜ as before. Determine λ, ˜ , β, Multiply these parameters by λC .
To test that the parameter transformations work in practice, we fixed a set of contour parameters corresponding to an inflection point at r0 = 10, and then translated these parameters to give an equivalent phase field model. We then performed three gradient descent experiments using Eg . One used the value of β corresponding to the inflection
706
P. Horv´ath and I.H. Jermyn
Fig. 1. Preservation of the inflection point when translating from contour to phase field: gradient descent evolutions using Eg , starting from a circle of radius r0 = 10, for values of β close to β ∗ , the value giving an inflection point at r0 = 10. Row 1: evolution using βC = 0.96β ∗ ; row 2: evolution using β = 1.04β ∗ ; row 3: evolution using β = β ∗ .
point, β ∗ , while the other two used β values 4% above and below β ∗ . The region was initialized to a circle of radius r0 . Figure 1 shows the results. In the first row, with β < β ∗ , the region shrinks and disappears. In the second row, with β > β ∗ , the region grows until it reaches the corresponding energy minimum, at a radius of 12 pixels. In the third row, with β = β ∗ , the circle grows only very slightly, to a radius of 10.5 pixels. The inflection point behaviour is therefore preserved to a very good approximation.
4 Tree Crown Extraction The tree crown extraction problem has been studied in several papers. Gougeon [9] uses a valley following method to delineate the crowns, while Larsen [10] introduces a species-specific method based on the matching of 3D tree templates. Both these, and other similar methods, look for local maxima of certain features. Perrin et al. [11] use a global method, modelling tree crown configurations as a marked point process. This has the advantage over our method that overlapping trees are easily handled, but the disadvantage that trees are represented by ellipses: their outlines are not found. Our model for tree crown extraction is the sum of two terms: a prior energy Eg , as described above, and a likelihood energy, Ei , which we will now describe. We will ¯ using Gaussian distributions.2 model the image I, both in R and in the background R, We add a term that predicts high gradients along the boundary ∂R:
" (I − μ)2 (I − μ ¯ )2 Ei (I, R) = dx λi ∂I · ∂φ + αi φ+ + φ− , 2σ 2 2¯ σ2 Ω where φ± = (1 ± φ)/2. Note that to facilitate comparison of parameters in the prior energy, we set λ = 1 in Eg and introduce a weight αi in Ei . 2
Ê We ignore the normalization constant Z(R) = DI e−Ei (I,R) since in our case it merely changes λ and α, and we are interested in stability of the posterior in the absence of imagedependent terms.
A New Phase Field Model of a ‘Gas of Circles’ for Tree Crown Extraction
(a)
(b)
(c)
707
(d)
c IFN (0.6, 0.06, 0.33, 0.19, 2.5); (b) the result with the HOAC Fig. 2. (a) an image of poplars ‘gas of circles’ model with inflection point (1160, 0.7); (c) result with the phase field ‘gas of circles’ model (1111, 278, 177.2, 55.8, 30, 4); (d): result with the new model (1160, 0.7, 4)
4.1 Energy Minimization We minimize E = Eg + Ei using gradient descent. The functional derivative of EN L is δEN L (x) = β dx ∂ 2 G(x − x )φ(x ) . (4.1) δφ Ω The nonlocal force arising from the HOAC term makes gradient descent using EC,G complex to implement. One must extract the contour, perform numerical contour integrations, and extend the force off-contour [2]. In contrast, the force (4.1) arising from the nonlocal phase field term is simply a convolution. Implementation is thereby made much simpler, and execution much faster. The time for one iteration in the HOAC formulation scales as the square of the boundary length, which scales as the number of trees, which in turn scales as the number of pixels. Thus execution time for the HOAC formulation can be expected to scale as the number of pixels squared. In contrast, execution time for the phase field formulation scales linearly with the number of pixels. 4.2 Experimental Results Our data are ∼ 0.5m resolution colour infrared aerial images of the ‘Saˆone et Loire’ region in France provided by the French National Forest Inventory. The images show regularly and irregularly planted poplar forests. We compared the new phase field inflection point ‘gas of circles’ model (section 3.2) with the HOAC inflection point ‘gas of circles’ model (section 2.2) and with the phase field ‘gas of circles’ model with a minimum rather than an inflection point (section 3.1). The parameters μ, σ, μ ¯, and σ ¯ are learned from examples using maximum likelihood, and then fixed. 3 Figure 4.1(a) shows an image (200 × 200) of a regularly planted poplar forest. In the top right and bottom left there are fields, while in the middle, two different sizes of poplars. The aim is to extract the larger trees. Figure 4.1(b) shows the result with the 3
In the figure captions, the image parameters are (μ, σ, μ ¯, σ ¯ , r0 ); in the HOAC ‘gas of circles’ case, the prior parameters are (αi , rˆ), in the phase field ‘gas of circles’ case they are (αi , D, λ, α, β, w); while in the case of the new model they are (αi , rˆ, w).
708
P. Horv´ath and I.H. Jermyn
(a)
(b)
(c)
Fig. 3. (a) An image of regularly planted poplar stands with a less regularly planted trees at the c IFN (0.8, 0.06, 0.43, 0.2, 3.5); (b) result with the phase field ‘gas of circles’ model upper part (500, 50, 34.2, 9.3, 5.2); (c) result with the new model (500, 0.74, 4)
c IFN (0.8, Fig. 4. (a) An image of regularly planted poplars with different fields in the right part 0.06, 0.43, 0.2, 3.5); (b) result with the phase field ‘gas of circles’ model (500, 50, 34.2, 9.3, 5.2); (c) result with the new model (500, 0.74, 4)
HOAC inflection point ‘gas of circles’ model. The result is good, but the method found two small trees, and there is another false positive in the bottom left of the image. The execution time was 89 minutes. Figure 4.1(c) shows the result using the phase field ‘gas of circles’ model. There were no false negatives, but the model found false positives in the fields. Figure 4.1(d) shows the result with the new phase field inflection point ‘gas of circles’ model. All but two somewhat smaller circles were successfully found. For the phase field models, the execution time was less than 3 minutes. Figure 4.2(a) shows an image (133 × 271) of a planted forest. Figure 4.2(b) shows the result with the phase field ‘gas of circles’ model. The result is good, with only a few joined tree crowns. Figure 4.2(c) shows the result with the new model. There are fewer joined tree crowns. Both results were obtained in less than 2 minutes. Figure 4.2(a) shows a difficult image (129 × 139) to analyse. It has two fields with different intensities on the right. The result with the phase field ‘gas of circles’ model is shown in figure 4.2(b). This result clearly demonstrate the disadvantage of the noninflection point model: phantom objects are created in the homogenous areas. Figure 4.2(c) shows the result with the new model. The result is very good, with only one false positive. Both results were obtained in less than 1 minute.
A New Phase Field Model of a ‘Gas of Circles’ for Tree Crown Extraction
709
5 Conclusion We have described a phase field version of the HOAC inflection point ‘gas of circles’ model described in [7]. The model was applied to the problem of tree crown extraction from aerial images, a problem of great practical importance for the forestry industry. The model combines the strengths of the inflection point model with those of the phase field framework: it removes the ‘phantom circles’ produced by the original ‘gas of circles’ model, while executing two orders of magnitude faster than the contour model.
References 1. Horv´ath, P., Jermyn, I.H., Kato, Z., Zerubia, J.: A higher-order active contour model for tree detection. In: Proc. International Conference on Pattern Recognition (ICPR), Hong Kong, China (2006) 2. Rochery, M., Jermyn, I.H., Zerubia, J.: Higher-order active contours. International Journal of Computer Vision 69, 27–42 (2006) 3. Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1, 321–331 (1988) 4. Cremers, D., Kohlberger, T., Schn¨orr, C.: Shape statistics in kernel space for variational image segmentation. Pattern Recognition 36, 1929–1943 (2003) 5. Foulonneau, A., Charbonnier, P., Heitz, F.: Geometric shape priors for region-based active contours. In: Proc. IEEE International Conference on Image Processing (ICIP). vol. 3, pp. 413–416 (2003) 6. Leventon, M., Grimson, W., Faugeras, O.: Statistical shape influence in geodesic active contours. In: Proc. IEEE Computer Vision and Pattern Recognition (CVPR), Hilton Head Island, SC, USA, pp. 316–322 (2000) 7. Horv´ath, P., Jermyn, I.H., Kato, Z., Zerubia, J.: An improved ‘gas of circles’ higher-order active contour model and its application to tree crown extraction. In: Kalra, P., Peleg, S. (eds.) ICVGIP 2006. LNCS, vol. 4338, Springer, Heidelberg (2006) 8. Rochery, M., Jermyn, I., Zerubia, J.: Phase field models and higher-order active contours. In: Proc. IEEE International Conference on Computer Vision (ICCV), Beijing, China (2005) 9. Gougeon, F.A.: Automatic individual tree crown delineation using a valley-following algorithm and rule-based system. In: Hill, D., Leckie, D. (eds.) Proc. International Forum on Automated Interpretation of High Spatial Resolution Digital Imagery for Forestry, Victoria, British Columbia, Canada, pp. 11–23 (1998) 10. Larsen, M.: Finding an optimal match window for Spruce top detection based on an optical tree model. In: Hill, D., Leckie, D. (eds.) In: Proc. International Forum on Automated Interpretation of High Spatial Resolution Digital Imagery for Forestry, Victoria, British Columbia, Canada, pp. 55–66 (1998) 11. Perrin, G., Descombes, X., Zerubia, J.: A marked point process model for tree crown extraction in plantations. In: Proc. IEEE International Conference on Image Processing (ICIP), Genova, Italy (2005)
Shape Signature Matching for Object Identification Invariant to Image Transformations and Occlusion Stamatia Giannarou and Tania Stathaki Communications and Signal Processing Group Imperial College London {stamatia.giannarou,t.stathaki}@imperial.ac.uk
Abstract. This paper introduces a novel shape matching approach for the automatic identification of real world objects in complex scenes. The identification process is applied on isolated objects and requires the segmentation of the image into separate objects, followed by the extraction of representative shape features and the similarity estimation of pairs of objects. In order to enable an efficient object representation, a novel boundary-based shape descriptor is introduced, formed by a set of one dimensional signals called shape signatures. During identification, the cross-correlation metric is used in a novel fashion to gauge the degree of similarity between objects. The invariance of the method to uniformscaling and partial occlusion is achieved by considering both cases as possible scenarios when correlating shape signatures. The proposed vision system is robust to ambient conditions (partial occlusion) and image transformations (scaling, rotation, translation). The performance of the identifier has been examined in a great range of complex image and prototype object selections.
1 Introduction Recently there has been, within the image processing industry, an increasing emphasis on being able to extract from an image an object’s shape and characteristics. For example, in military applications it is essential to be able to extract an object’s shape for comparison with stored prototype shapes for target characterization. This work is concerned with the investigation of the fundamental principles and methods needed for an autonomous system which is able to recognize objects in digital images. Automatic object recognition is a hard image processing task comprising a great number of difficulties and requirements. A robust object recognition system should succeed to identify objects which are shifted, rotated, distorted or partially occluded in the scene. In addition, the identification system should recognize objects regardless of their scale. The fact that the shape of objects in real world varies depending on the scale of observation has significant implications in describing them. The aim of this work is to introduce a novel object identification system that meets the above invariance requirements. In this contribution, we present a novel vision system for the recognition of real world objects in complex scenes which is robust to ambient conditions (partial occlusion) and image transformations (scaling, rotation, translation). The identification process is based on the comparison of assemblies of image regions with a previously stored view of a known prototype object. The method has two steps: pre-processing W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 710–717, 2007. c Springer-Verlag Berlin Heidelberg 2007
Shape Signature Matching for Object Identification Invariant
711
and recognition. The first involves the extraction of significant features from the image (such as object boundaries), the segmentation of the image into separate objects and the tracing of object contours. The second step takes into account a novel boundary-based shape descriptor formed by a set of one dimensional signals called shape signatures, to enable an efficient object representation. During identification, the cross-correlation and related metrics of object signatures are used to gauge the degree of similarity between two shapes yielding a set of similarity terms. A test object will be of no interest to us as long as compared with the prototype, the degree of similarity remains below a suitable acceptance threshold.
2 Prior Work on Shape Recognition The problem of object recognition is one of the most familiar and fundamental problems in image processing. There is a great variety of shape descriptors and associated similarity measures which emphasize different image properties such as pixel intensities, color, texture and edges. Comprehensive surveys on methods for shape analysis have been reported [9, 12, 14]. In general, object identification techniques can be classified into two distinct categories presented below. The first category includes techniques based on the so called interest point operators. In these approaches, image features which mainly indicate local saliency are generated by applying local shape descriptors on image regions as for example, the Harris-Laplace detector [10]. The extracted features are used as input to a matching system that yields the final identification result. Representative interest point operators include the Scale Invariant Feature Transform (SIFT) approach developed by Lowe [10], Lindeberg’s blob detector [8] and the epipolar alignment of images taken from different viewpoints proposed in [15]. Techniques that belong to the second category are based on shape matching and are applied on isolated objects or pre-segmented images. They involve the segmentation of the image into separate objects, followed by shape representation and estimation of the similarity between pairs of objects. Common shape representation techniques of this category use a specific type of the so called moment invariants, first introduced by Hu [7]. A more efficient descriptor is the generic Fourier descriptor (GFD) [13]. It is obtained by applying the two-dimensional Fourier transform on a polar-raster sampled shape image. A recent approach to shape representation is the Shape Context descriptor [2]. This descriptor is applied on edge point locations and yields a 2D histogram describing the edge distribution in the surrounding region of each contour point. Another representative example is the shape signature approach. The signature is a one dimensional function derived from the boundary points of the shape. Common shape signatures include centroid distance, tangent angle and area [5, 11]. In this work, the object representation is based on shape signatures extracted from the object’s boundary. The paper is organized as follows. A new boundary-based object descriptor is introduced in the next section. A novel identification method is proposed in Section 4. Experimental results of applying the identifier on complex images and prototype objects are presented in Section 5. We discuss the performance of the identifier and conclude with Section 6.
712
S. Giannarou and T. Stathaki
3 Shape Representation Using a Set of Signatures In this section, we focus on 2D shape representation and based on shape signatures, we propose a novel descriptor that captures the object’s boundary information under the constraint of translation, rotation and uniform-scaling invariance properties. A number of preprocessing steps applied on the complex and the prototype shape image are prerequisite for the object representation. To begin with, the recovery of the shape descriptor requires the extraction of the edge map (collection of edge pixels) of the image of interest. This can be easily obtained by applying an edge detection process (as for example the Canny operator [3]) to the image. Since the proposed descriptor is applied on isolated objects, the so called ”Connected-Component Labeling” technique described in [6], is employed for the image segmentation into separate objects. The detected connected components represent separate objects in the image. In order to arrange the contour points of an object as an ordered sequence of adjacent pixels, each connected component becomes input to a contour tracer. The boundary tracing approach employed in this work is based on the one proposed in [4]. The output of the algorithm is an array containing the co-ordinates of the boundary traced in a clockwise direction. Let an object P represented by a set of traced contour (edge) points (pixels) {(x1 , y1 ) . . . (xn , yn )}, where x and y stand for the x and y-coordinates of the points on the image plane. The notation P ≡ {(x1 , y1 ) . . . (xn , yn )} may be used. Prior to the analysis of the proposed shape descriptor, we introduce the shape signature sk , defined as the set of the distances between the points of the traced boundary sequence and a fixed point on the contour. Mathematically, the above signature is expressed as:
(xi − xk )2 + (yi − yk )2 , for i = 1 . . . n (1) sk = From the above definition it is clear that the signature sk depends on the reference point (xk , yk ); a different reference point will result in a different signature for the same object. In order to enable an accurate shape representation, we propose a novel shape descriptor which takes into consideration all the signatures yielded from an object’s contour. For the object P , the proposed shape descriptor S is defined as: S = {sk , for k = 1 . . . n} where sk is the distance function defined by (1). According to this representation, each object will be described by a set, S, of shape signatures generated by calculating the distance functions sk with respect to different reference points. The length of the signature set S will be equal to the length of the traced boundary sequence of the object. Since the distances in (1) are calculated with respect to points on the shape, the descriptor is invariant to changes in the location of the test object in the image. In addition, since all the contour points are used as reference points, the shape descriptor of a rotated object will include the same signatures as the descriptor of the original one. It can be proven that uniform scaling of an object does not affect the shape of signature sk , only its amplitude and its length are multiplied by a factor proportional to the scaling one. Therefore, as it will be shown later on, when combined with a suitable similarity measure, the proposed shape descriptor is also invariant to uniform-scaling of the
Shape Signature Matching for Object Identification Invariant
713
test object. The above descriptor is also robust to ambient conditions, such as partial occlusion. Assume an object P ≡ {(x1 , y1 ) . . . (xn , yn )} represented by the signature set {spl , for l = 1 . . . n} and an occluded version of it, C ≡ {(x1 , y1 ) . . . (xm , ym )} described by the set {sck , for k = 1 . . . m}, where m < n. Since C is an occluded version of P , the former’s contour will be part of the latter’s. That means a signature sci from the set {sck , for k = 1 . . . m} will be part of another one spj from the set {spl , for l = 1 . . . n}. Thus, during retrieval sci and spj will exhibit a good partial (local) match.
4 Shape Similarity Evaluation As we explained in the previous section, invariance to rotation and translation is intrinsic to the proposed shape descriptor. When measuring shape similarity, special treatment is given to achieve invariance to uniform-scaling and partial occlusion, as well. In both of the aforementioned cases (scaling and occlusion) the length of the signature changes. Since the shape of a signature is not seriously affected by changes in the object’s size, uniform-scaling could be handled in similarity estimation by normalizing the compared signatures and making them of the same length. On the other hand, we could deal with partial occlusion by comparing the shorter signature with parts of the longer one. However, the proposed vision system for automatic object recognition is blind and it is not possible to determine in advance the transformation imposed to the test object, i.e., whether the object is scaled or occluded. For that reason, if the lengths of the compared shape signatures are different, the proposed method for similarity estimation distinguishes both cases. For the final decision, the outcomes from both cases are considered and compared. Let a prototype object P consisting of N boundary points, be represented by the set of shape signatures SP ≡ {Sprot i , for i = 1 . . . N }, where Sprot i is the distance function with reference to the contour point (xi , yi ) on the boundary of P . Similarly, let a test object T of M boundary points be described by the set ST ≡ {Stestj , for j = 1 . . . M }. Without loss of generality, let’s assume that M < N . The similarity between P and T is estimated by comparing the N × M signature pairs (Sprot i , Stest j ). In the first step of similarity extraction, the smaller of the compared objects, T , is treated as a scaled version of the bigger one, P , and the maximum of the Normalized Circular Cross-Correlation (NCCC) is employed to gauge the degree of similarity between each signature pair (Sprot i , Stestj ). Since the above measure requires the compared signatures to be of the same length, interpolation is employed to stretch the Stest j signatures and make them of size N , yielding the signatures SItest j . The Normalized Circular Cross-Correlation of a signature pair (Sprot i , SItest j ) is defined as: for, k = −N + 1 . . . N − 1 (Sprot i (n) − S prot i )(SItestj (n − k) − SI testj ) i = 1...N N CCCij (k) = 0 n 2 2 j = 1...M n (Sprot i (n) − S prot i ) n (SItestj (n − k) − SI testj )
where, n = 1 . . . N is the signature point index and S prot i , SI test j stand for the mean value of Sprot i and SItestj , respectively. The similarity of a signature pair (Sprot i , SItestj ) is given by:
714
S. Giannarou and T. Stathaki
Cij = C(Sprot i , SItest j ) = max {N CCCij (k)} k
The use of circular cross-correlation eliminates the effect of the unknown shift between the starting points of the two contours, while the normalization makes the similarity measure invariant to differences in the signatures’ amplitude due to scaling. According to the above object comparison, for each contour point on the smaller object associated with a signature Stestj , there will be a point on the bigger object associated with a signature Sprot i which bears the highest similarity to Stest j among all the signatures corresponding to the bigger object’s contour points. Therefore, each contour point j on the smaller object is assigned a value (equal to maxi {Cij }) indicating the maximum similarity of the two objects with reference to point j. Our main idea for similarity estimation stems from the fact that, when comparing two similar objects differing only in size, the maxi {Cij } values will be high (ideally equal to 1) for all the contour points. On the other hand, when comparing dissimilar objects, even if few signature pairs are highly correlated, the majority of contour point pairs are associated with dissimilar signatures, yielding low maxi {Cij } values. In order to make the similarity estimation more robust, the degree of similarity between two shapes is expressed with respect to two terms, namely the maximum and minimum similarity values assigned to the smaller object’s contour points. Therefore, the similarity between the objects P and T , when the latter is treated as a scaled version of the former, is determined by the terms: Vw max = maxj {maxi {Cij }}
and Vw min = minj {maxi {Cij }}
(2)
When comparing similar objects, both Vw max and Vw min terms should be of high value. For that reason, as it is explained later on, the product of the above terms is considered for the final decision on the objects’ similarity. The second step of similarity evaluation treats the smaller object as an occluded version of the larger one. In that case, we accept that if objects P and T match, one of the signatures {Stestj , for j = 1 . . . M } will be part of another one from the set {Sprot i , for i = 1 . . . N }. We should make clear the fact that the above postulation is not valid if the test object has been imposed both uniform-scaling and partial-occlusion. The signature comparison is achieved by cross-correlating each Stestj signature with frames of length M , {FSfprot , for f = 1 . . . (N − M )}, of each Sprot i signature. So, i the degree of similarity, Dij , for the pair (Sprot i , Stest j ) is defined as the maximum of the Normalized Cross-Correlation (NCC) values across all the frames. That is: Dij = max {N CC(FSfprot , Stestj )} f
i
for, i = 1 . . . N, j = 1 . . . M f = 1 . . . (N − M )
where N CC(r, q) represents the maximum value of the Normalized Cross-Correlation between the signatures r and q. Following the previous analysis, the object similarity for the case of treating the smaller object as an occluded version of the larger is defined by the terms: Vomax = maxj {maxi {Dij }}
and Vomin = minj {maxi {Dij }}
(3)
Shape Signature Matching for Object Identification Invariant
715
60 Original Object Scaled Object 50
40
30
20
10
0
(a)
0
20
(b)
40
60
80
100
120
140
160
180
(c) 40
60 Original Object Signature
Original Object Scaled Object
35
Streched Signature of Scaled Object
50
30 40
25 20
30
15 20
10 10
5 0
0
20
40
60
80
100
120
140
160
0
180
(d)
0
20
40
60
80
100
120
140
160
180
(e)
Fig. 1. (a) Complex and (b) prototype shape images. (c) The most similar shape signatures for the airplane objects. Comparison of the airplanes’ shape signatures when treating the smaller object as (d) a scaled (e) an occluded version of the bigger one.
Having defined the similarity terms for a pair of objects in equations (2) and (3), a decision has to be made on whether the two objects are similar or not. This decision is based on the ratio of the similarity terms, defined as: Rs =
Vo max Vo min · Vw max Vw min
(4)
The value of Rs is indicative of the transformation (uniform-scaling or partial-occlusion) that has been imposed to the test object. If Rs is greater than 1, the test object will be an occluded version of the prototype providing that the Vo min term exceeds a predefined threshold, T hro . Likewise, if Rs is less than 1, the value of Vw min is compared with the threshold T hrw to determine whether the test object is a scaled version of the prototype or not. In any of the above cases for Rs , if the examined similarity term survives the thresholding process, the target object is present in the complex scene at a location indicated by the x and y-coordinates of the contour points of T , otherwise the test object will be of no interest to us.
5 Experimental Results Generally, the signature correlation between two shapes tends to exhibit high values, even if the shapes correspond to different objects. This observation is even more profound in man-made objects (as for example aircrafts), the type of objects we are mainly
716
S. Giannarou and T. Stathaki
interested in this work. An obvious reason is that a large part of the contour of a manmade object consists of straight lines or low-curvature parts. Therefore, when estimating the signature correlation between two man-made objects, the above mentioned parts will be most probably present in both objects, enhancing the value of signature correlation. Moreover, when an object is treated as an occluded version of another object using the technique presented in this work, is again likely that the signature of the occluded object will be highly correlated to a signature frame of the original non-occluded object, even if the two objects are dissimilar. The reason is, that it is likely, that a part of a man-made object will exhibit some serious similarity with a part of any other manmade object. This observation is more identifiable if the occluded object contains less number of salient features, i.e., high activity points. However, in the proposed system we observed that although signature correlations are generally of high values, the correct scenario provides consistently better results (higher correlations) compared to the false scenario. The similarity thresholds were experimentally set equal to T hrw = 0.95 and T hro = 0.85. In this work, we have applied the proposed object identification algorithm on a great range of complex image and prototype object selections [1]. A representative simulation is illustrated in Figure 1 for the identification of a Harrier airplane in a synthetic image containing three different objects, a bird, a cat and a scaled Harrier airplane (of size half of the target’s size). In Figure 1(d) we present the comparison of the signature pair (Sprot i , SItest j ) that gives the maximum similarity value (Vw max ), when treating the smaller object as a scaled version of the bigger one. Figure 1(e) illustrates the comparison of the signature pair that gives the maximum similarity value (Vo max ), when treating the smaller object as an occluded version of the bigger one. The comparison of the shape signatures yielded a specific signature pair with the highest similarity, illustrated in Figure 1(c), corresponding to the Harrier prototype and the scaled Harrier test objects, respectively. The two signatures have different length but the same shape. The results of evaluating the similarity terms (Vw max , Vw min , Vo max , Vo min ) for each pair of test-prototype objects are: for Bird-Prototype (0.98, 0.80, 0.96, 0.78), for Cat-Prototype (0.96, 0.70, 0.95, 0.81) and for Harrier - Prototype (0.99, 0.96, 0.97, 0.89). The ratio of similarity terms in the case of the Bird-Harrier pair is equal to Rs B = 0.95 < 1 and since Vw min < T hrw , the two objects are not similar. Likewise, for the Cat-Harrier pair RsC = 1.14 > 1 and Vomin < T hro . The similarity term evaluation for the two Harrier objects results in Rs H = 0.9 < 1 and Vw min > T hrw , verifying that the compared objects are similar differing only in the size, which agrees with the actual scenario. The above results emphasize the ability of the proposed approach to distinguish between similar/ dissimilar objects and to determine the transformation (uniform-scaling or occlusion) that has been imposed to the test objects.
6 Conclusions A new system for the identification of two-dimensional objects in complex scenes has been presented in this work. The identification method is a prototype-based one, where prior knowledge of the target is given by a prototype template consisting of a set of representative features. In order to make the proposed system invariant to rotation and
Shape Signature Matching for Object Identification Invariant
717
translation, a novel shape descriptor is introduced as the set of shape signatures extracted from the object’s contour and defined as the distance of the boundary points with reference to a fixed point on the contour. The similarity between two objects is estimated by cross-correlating signature pairs included in their shape descriptors. During the comparison process, two distinct cases are considered in order to achieve invariance of the identifier to uniform-scaling and partial occlusion. The robustness of the identifier is verified by the presented experimental results.
Acknowledgements The investigations which are the subject of this paper were initiated by Dstl under the auspices of the United Kingdom Ministry of Defence Systems Engineering for Autonomous Systems Defence Technology Centre.
References 1. http://www.lems.brown.edu/vision/software 2. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Machine Intell. 24(4), 509–522 (2002) 3. Canny, J.F.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Machine Intell. 8(6), 679–698 (1986) 4. Carter, J.R.: Boundary tracing method and system. European Patent Applcation EP341819A3 (1989) 5. Davies, E.R.: Machine Vision: Theory, Algorithms, Practicalities. Academic Press, London (1997) 6. Haralick, R., Shapiro, L.G. (eds.): Computer and Robot Vision, vol. I, pp. 28–48. AddisonWesley, London, UK (1992) 7. Hu, M.K.: Visual pattern recognition by moment invariants. IRE Transactions on Information Theory IT 8, 179–187 (1962) 8. Lindeberg, T.: Feature detection with automatic scale selection. International Journal of Computer Vision 30(2), 77–116 (1998) 9. Loncaric, S.: A survey of shape analysis techniques. Pattern Recognition 31(8), 983–1001 (1998) 10. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 2(60), 99–110 (2004) 11. Van Otterloo, P.J.: A contour-oriented approach to shape analysis. Prentice-Hall, Englewood Cliffs (1991) 12. Veltkamp, R.C., Hagedoorn, M.: State of the art in shape matching. Principles of visual information retrieval, pp. 87–119 (2001) 13. Zhang, D., Lu, G.: Shape based image retrieval using generic fourier descriptors. Signal Processing: Image Communication 17(10), 825–848 (2002) 14. Zhang, D., Lu, G.: Review of shape representation and description techniques. Pattern Recognition 37(1), 1–19 (2004) 15. Zhang, Z., Deriche, R., Faugeras, O.D., Luong, Q.T.: A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence 78(1–2), 87–119 (1995)
Junction Detection and Multi-orientation Analysis Using Streamlines Frank G.A. Faas and Lucas J. van Vliet Quantitative Imaging Group, Delft University of Technology, The Netherlands [email protected]
Abstract. We present a novel method to detect multimodal regions composed of linear structures and measure the orientations in these regions, i.e. at line X-sings, T-junctions and Y-forks. In such complex regions an orientation detector should unmix the contributions of the unimodal structures. In our approach we first define a (streamline) divergence metric and apply it to our streamline field to detect junctions. After this step we select all streamlines that intersect a circle of radius r around the junction twice, cluster the intersection points and compute the direction per cluster. This yields a multimodal descriptor of the local orientations in the vicinity of the detected junctions. The method is suited for global analysis and has moderate memory requirements.
1
Introduction
Several well known tools exist for the detection, localization and characterization of unimodal linear structures based on the structure tensor and the Hessian matrix. For multimodal structures the detection can still rely on the structure tensor [7], as demonstrated by the Harris-Stephens corner detector [8]. However these methods cannot unmix the responses from the composing structures. Most detectors have a response which decreases strongly with decreasing angular separation between the composing structures. The filterbank approach [4,6,3] is a well known method for the analysis of multimodal regions. However, its excellent angular selectivity is combined with a poor localization. The key observation to our method is the following: any X-sing, T-junction or Y-fork is composed of a few unimodal structures (line or edge segments). At X,T,Y-transitions a single measure for orientation, e.g. by means of the structure tensor, yields a weighted sum of orientations of the constituent elements. Moving away from the center of the junction toward one of the unimodal structures, the measure will approach that of a unimodal structure. In this paper we present a streamline based method which connects the unimodal regions to multimodal regions, e.g. the arms of a junction to the junction itself. The streamlines are constructed from a vector field that represents the local structure. This method allows us to accurately locate junctions and crossings and at the same time measures the attributes of the underlying unimodal structures. The method offers a good angular selectivity within a relatively small analysis window. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 718–725, 2007. c Springer-Verlag Berlin Heidelberg 2007
Junction Detection and Multi-orientation Analysis Using Streamlines
719 AA
B B B B
A A
r
• (a)
(b)
CC CC (c)
Fig. 1. (a) Artificial Y-junction with orientation field (gray overlay) and three streamlines (green). (b) Orientation field with a streamline starting from the black dot in the center. (c) Sketch of a Y-junction (dashed lines) with some streamlines around the junction center. The circle labeled with r denotes the distance from the center at which the orientation field is sampled and the labeling of the streamlines is performed. The letters denote the (double) labeling of the streamlines.
2
Method
Our method comprises five steps. The first step is to find a suitable vector field which describes the local structure in a consistent way, see Fig.1(a-b). Second, this vector field is used to calculate streamlines at each pixel position. Third, a streamline divergence metric is introduced which yields a high value for multimodal structures and a low value on simple linear structures. Fourth, a suitable threshold is applied to detect the multimodal structures. After merging fragmented responses, the center point of the junction is determined. Fifth, we procede with the analysis of the streamlines in a region around the junction center, see Fig. 1(c). To that end we define a circle with radius r centered at the junction. A suitable streamline should cross this circle twice, i.e. once for each of the unimodal structures it connects. The points where the streamlines intersect with the circle defines the new endpoints. The set of all endpoints will give rise to clusters, i.e. one cluster for each unimodal structure in the junction. These clusters are analyzed and each streamline is assigned two labels, one for each unimodal structure it belongs to. For each cluster we compute the average direction over the endpoints of the streamlines that belong to the same cluster. Step One: Vector Field A suitable vector field in this framework follows the underlying local structure in the image and is smooth and continuous. The vector field should be well defined on both even and odd structures, i.e. respectively along the bottom of the valley and along the edge of linear structures. We assume, without loss of generality, a white background and dark foreground throughout this paper. The vector field in this paper is created by means of a set of rotated quadrature filters which are combined into a tensor representation. An eigensystem
720
F.G.A. Faas and L.J. van Vliet P (s
)−
P (0
)
) (s Q
dpq
−
w
dsep
) (0
distance
Q
P (−s) Q(−s)
P (s)
α
P (0)
Q(0) Q(s)
streamline labels (a)
(b)
dmax =
2w sin 12 α
(c)
Fig. 2. (a) Typical dendrogram for a X-sing, depicting the cluster distance measure (vertical) composed of individual streamlines (horizontal). (b) Sketch of the distance measure dpq between the streamlines P and Q at a distance s from the respective starting points P (0) and Q(0). (c) Sketch of the intrinsic scale (here denoted by dmax ) of the intersection of two lines of width w.
analysis gives the vector field. The tensor is constructed from quadrature filters as described by [9]. The quadrature filter is constructed by means of the generalized Hilbert transform Fi = G(|u|, σF ) H(u · ni ) (u · ni )2
(1)
with G an isotropic Gaussian transfer function, H the Heaviside function and the last factor a quadratic cosine term to select the radial polynomial as well as the angular response. The orientation vectors ni are defined as ni = [cos( π3 i), sin( π3 i)]T with i = 0, 1, 2. The 2D tensor T is defined as T=
|qi |(ni niT − I)
(2)
i=1,2,3
with qi the response of the ith quadrature filter in the spatial domain. Now the vector field is given by the eigenvector belonging to the smallest eigenvalue. Step Two: Streamlines A streamline is a line which is everywhere tangent to the local flow field. Mathematically this can be stated as dx ∝ u(x) and x(s0 ) = x0 ds
(3)
with u the vector field and x(s) the parameterized streamline. Further, point x0 denotes a point on the streamline, i.e. the point of origin. In here a streamline is defined as a curve which is tangent to the local structure. As even structures have no direction, the flow vector is equally well described by two antipodal vectors, creating possible phase jumps between neighboring points. To solve this phase jump problem locally , i.e. at position x(s) on the streamline, the flow field is
Junction Detection and Multi-orientation Analysis Using Streamlines
721
given by one of the two antipodal vectors which produces the smallest angle with respect to the tangent vector of the streamline at x(s). As such the flow field is defined with respect to a local point of origin and the propagation direction of the streamline. Further, the start direction, i.e. u(so ) of the streamline is ambiguous as there is no history to determine the tangent vector. This is not a problem as the curve of interest is centered around the point x0 and extends along the structure in both directions for a distance l resulting in a streamline of length 2l with parameter s running from −l to l along the curve, i.e. from one end to the other end. Here s is the position along the contour measured from the center point. The streamlines are implemented by means of a first order method. Step Three: Streamline Divergence Metric Two streamlines originating from points close together, on say one arm of the junction, can end up in two different arms separated by a significant distance, dpq , see Fig. 2(b). Our junction detector is based on this observation. For each pixel the streamlines in a neighborhood are analyzed and the maximum separation between the reference streamline and the streamlines in the neighborhood are determined. We compute the distance between two streamlines at a distance ld (measured along the streamline) away from the pixel to be inspected. Summing over the line distances in the neighborhood yields a high value in the proximity of a junction. The streamline, P , through point p has to be compared to the test streamline, Q, through point q, where q lies in the neighborhood of p. The first step in computing the maximum separation between two streamlines is to align them with respect to the streamline at the point of interest p. This is necessary because the ’positive’ direction of the streamline is ambiguous along line structures in the image. After alignment the two streamlines in negative direction converge and in the positive direction diverge. Now the distance between the two streams P and Q is defined as the distance between the points on the streams with a distance ld to their respective origin measured in the aligned positive direction; dpq = |(P (ld ) − Q(ld )) − (P (0) − Q(0))|
(4)
where the second term is the translation vector between the streamline centers, see Fig. 2(b). Now we sum the distances between the streamline at p and those in the local neighborhood N to get a ’streamline divergence’ measure dp at p; dp = dpq (5) ∀q∈N
in which the size of neighborhood N reflects the intrinsic scale of the constituting unimodal structures. The intrinsic scale is defined by the width of the line segments or the length of the edge slope. It is by definition larger or equal to the support of the overall Point Spread Function of optics and filters and can be measured by the method of Dijk [2].
722
F.G.A. Faas and L.J. van Vliet
Step Four: Detection of Junctions The aforementioned ’streamline divergence’ measure also gives reasonably high responses in the (noisy) background. We suppress these points by multiplying with the certainty of the Hessian matrix which is close to zero in the background. A measure based on the tensor T is rejected as its response is not selective and not localized enough due to the quadrature filters. Therefore we use the more localized certainty measure based on the Hessian matrix as presented in [1] to select our regions of interest, basically the valley regions. This certainty measure can be stated as ? 2 8 2 2 + f2 − f f (6) c = |f20 , f21 , f22 | = fxx xx yy + fxy yy 3 3 where f2i , with i ∈ {0, 1, 2}, denote the spherical harmonics of second order and {fxx, fxy , fyy } the second order spatial derivatives. This certainty deviates from the norm of second order derivatives, i.e c = |fxx , fxy , fyy |, as it corrects for the fact that the second order spatial derivatives do not form an orthogonal basis [1] while the spherical harmonics of the second order do. Now an isodata (twomeans) threshold [10] on the measure c is performed resulting in regions with a pronounced second order structure. As we are only interested in the valleys, the following restriction is introduced |λ1 | > |λ2 | ∧ λ1 > 0 where λi is the ith eigenvalue of the Hessian matrix with λ1 > λ2 . The regions for which one of the restrictions fails is put to zero in the response image. Of course, other nonlinear weightings can also be applied to suppress the background. Now we can apply a simple threshold to the cleaned response image (just above the distinct background peak of the corresponding histogram) to get points where the streamlines show a high degree of divergence. Right at the junctions eddies may occur due to a lack of translational invariance. Hence we get a relatively low response at the centers of these junctions. However close to these vortexes the divergence of the streamlines is very high. Therefore we include a step in which we merge the divergence responses in a neighborhood equal to the intrinsic scale of the underlying structure. Now we have identified the junctions we use the center of gravity of those regions as the position of the junctions. The localization can be further refined by finding the points of minimum distance to the found center point on all streamlines in the junction region. From these points a new center of gravity can be calculated and if necessary this can be iterated. In our implementation the last step of refining the localization is not performed as the localization was good enough for our aim of demonstrating the algorithm. Step Five: Direction Estimation In the metric subsection we aligned streamlines on a pair by pair basis. In this paragraph we will cluster them such that they are labeled with the two labels of the two unimodal structures they connect. This is done by a simple hierarchical clustering method and applying a threshold at dsep to the dendrogram, see Fig. 2(a)). In Fig. 1(c) a junction is shown with a number of streams each with two labels, i.e. A, B or C. We define a circle with radius r centered around the
Junction Detection and Multi-orientation Analysis Using Streamlines
723
junction center. Now from all streamlines the intersections with this circle are determined, i.e. each streamline should cross this circle twice. Based on these points we will label our streamlines. r should be chosen such that the circle is larger than the size of the mixed zone and on the other hand as small as possible to avoid mixing with neighboring junctions. The longest ’diameter’, dmax , of the mixed zone depends on the intrinsic scale of the constituent unimodal structures and the smallest angle α between them, see Fig. 2(c). Note that we have directional information due to the fact that the orientation ”vectors” can be oriented such that they point away from the junction center. These vectors can therefore be averaged in each cluster i.e. no structure tensor is necessary as all vectors are mapped in the same direction, i.e. away from the junction. This average is used as direction estimate. The directional information of the streamlines is sampled at r, see Fig. 1(c).
3
Results
Fig. 3(a) and (b) show respectively a small part of the vascular system in the retina of an eye and a deformed miniaturized claydike model with a superimposed grid. The overlays in both images show the locations of the detected crossings (red crosses), the circles on which the directions are measured and the measured orientations (colored lines). The minimum distance between clusters is set to dsep = 1 which means that clusters separated by a distance smaller than dsep in the dendrogram are merged. Further we set r and ld to identical values, i.e. r = ld = 7 and r = ld = 8 for respectively the retina and claydike image. These values roughly correspond to the scale of the crossings in the images. Further at a distance ld from the junction center the streamlines are in general parallel to the unimodal structures. As can be seen all crossings are detected. The direction estimate for one of the arms of left middle junction in the retina image is a bit off. This is a consequence of the very low contrast, i.e. the influence of the noise on the vector field is more severe if the contrast is low. Fig. 3(c-d) show respectively the ’streamline divergence’ before and after background suppression using the second order certainty measure.
4
Discussion and Conclusions
Preliminary testing of our method shows excellent detection and characterization for a modest window size. The junction detection is dependent on the local structure and as such relatively independent of the local contrast. Furthermore, the angular selectivity can be increased by increasing the labelling distance r. Another nice aspect of the method is that it automatically selects a mode, i.e. it selects the number of unimodal structures at the junction. The implementation used can easily be improved e.g. by means of higher order streamline methods. This should decrease the influence of noise and should refine the streamlines. Furthermore one could think of averaging the orientation information over a larger area to obtain a more robust estimate, .i.e. at this
724
F.G.A. Faas and L.J. van Vliet
(a)
(b)
(c)
(d)
Fig. 3. (a) Blowup of a retina image (courtesy National Eye Institute, U.S. National Institute of Health). (b) Image of a deformed miniaturized claydike model with a superimposed grid. Courtesy of GeoDelft, The Netherlands. In overlay: the dashed circles denote the analysis window and the colored lines at the circle denote the measured orientations at their intersection with the circle. With r = ld = 7 and r = ld = 8 or respectively image a and b. (c) The distance metric of the image in b. (d) The distance metric for the image b after setting the points where the Hessian is uncertain to zero.
moment the streams are only sampled were they cross a circle of radius r which makes the orientation susceptible to noise. We state that the labeling of the streamlines can be performed at approximately the same distance from the junction center as the detection of the junctions, i.e. ld ≈ r, because both parameters are directly related to the intrinsic size of the junctions. As shown in Fig. 2(c) we can relate the minimum angle of separation α to the width w of the unimodal structures. As such as a rule
Junction Detection and Multi-orientation Analysis Using Streamlines
725
of thumb we state that r ≤ sin2w1 α which ensures that the entire mixed zone is 2 enclosed. As such the parameters are basically reduced to an angular resolution parameter α and the width of the unimodal structures. Optionally one could perform the clustering and the direction measurements at different positions along the streamlines. As the intrinsic scale of junctions can differ over the image one could make r and ld position dependent [5]. This could also minimize interaction of structures in close proximity of each other. After inspection of the individual steps of the algorithm we estimate that the time complexity of the algorithm is approximately a factor of 10 higher than that of a eigensystem analysis of the gradient structure tensor. Further the data complexity is low and only widely accepted tools are used.
References 1. Danielsson, P.-E., Lin, Q., Ye, Q.-Z.: Efficient detection of second-degree variations in 2D and 3D images. J. Vis. Comm. Image Represent. 12, 255–305 (2001) 2. Dijk, J., van Ginkel, M., van Asselt, R.J., van Vliet, L.J., Verbeek, P.W.: A new sharpness measure based on gaussian lines and edges. In: Petkov, N., Westenberg, M.A. (eds.) CAIP 2003. LNCS, vol. 2756, pp. 149–156. Springer, Heidelberg (2003) 3. Faas, F.G.A., van Vliet, L.J.: 3d-orientation space; filters and sampling. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 36–42. Springer, Heidelberg (2003) 4. Freeman, W.T., Adelson, E.H.: The design and use of steerable filters. IEEE Trans. Patt. Anal. Mach. Intell 13(9), 891–906 (1991) 5. Garding, J., Lindeberg, T.: Direct computation of shape cues using scale-adapted spatial derivative operators. Int. J. Comput. Vis. 17, 163–191 (1996) 6. van Ginkel, M., Verbeek, P.W., van Vliet, L.J.: Orientation selectivity for orientation estimation. In: Proceedings of the 10th Scandinavian Conference on Image Analysis Lappeenranta, Finland, vol. I, pp. 533–537, June 9-11 (1997) 7. Granlund, G.H.: In search of a general picture processing operator. Computer Graphics and Image Processing 8, 155–173 (1978) 8. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proc. of the fourth Alvey Vision Conference, pp. 147–151 (August 31–September 2 1988) 9. Knutsson, H.: Representing local structure using tensors. In: The 6th Scandinavian Conference in Image Analysis, pp. 244–251, Oulu, Finland, June 19-22 (1989) 10. Ridler, T.W., Calvard, S.: Picture thresholding using an iterative selection method. IEEE Trans. Syst. Man. Cyb. 8, 630–632 (1978)
Decomposing a Simple Polygon into Trapezoids Fajie Li and Reinhard Klette Computer Science Department, The University of Auckland Auckland, New Zealand
Abstract. Chazelle’s triangulation [1] forms today the common basis for linear-time Euclidean shortest path (ESP) calculations (where start and end point are given within a simple polygon). This paper provides an alternative method for subdividing a simple polygon into “basic shapes”, using trapezoids instead of triangles. The authors consider the presented method as being substantially simpler than the linear-time triangulation method. However, it requires a sorting step (of a subset of vertices of the given simple polygon); all the other subprocesses are linear time. Keywords: computational geometry, simple polygon, Euclidean shortest path, rubberband algorithm.
1
Introduction
This paper proposes a decomposition of any simple polygon into trapezoids. Subsequent algorithms, such as calculating a Euclidean shortest path (ESP), or further processing for pattern recognition or shape analysis purposes, may have benefit from this. In [2] the authors claimed to have an O(n log n) rubberband algorithm for calculating ESPs in simple polygons. Actually, this needs to be corrected: the algorithm given in [2] has worst-case complexity O(n2 ). However, using the trapzoid decomposing as given in this paper it is possible to calculate step sets more efficiently [which allows to modify the algorithm given in [2] into one of O(n log n) time complexity]. Section 2 provides necessary definitions and theorems. Section 3 presents the trapzoid decomposing algorithm and examples and a time analysis. Section 4 concludes the paper.
2
Basics
Let Π • be the (topological) closure of a simple polygon Π, ∂Π be its frontier, V (Π) the set of vertices of Π, and E(Π) the set of edges of Π. For v ∈ ∂Π, vx and vy are the x- and y-coordinates of v, respectively. Let v1 , v2 , v3 ∈ ∂Π. Vertices are ordered either by a clockwise or counter-clockwise scan through ∂Π. Let ρ(v1 , Π, v2 ) be the polygonal path from v1 to v2 , contained in ∂Π. Let ρ(v1 , Π, v2 , Π, v3 ) be the polygonal path from v1 to v2 and then to v3 around ∂Π. Let u, v, w ∈ ∂Π such that uy = vy = wy ; consider ρ(u, Π, v, Π, w) ⊂ ∂Π. Let Π be a simple polygon obtained by adding a ‘closing edge’ uw to ρ(u, Π, v, Π, w), W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 726–733, 2007. c Springer-Verlag Berlin Heidelberg 2007
Decomposing a Simple Polygon into Trapezoids
727
Fig. 1. Illustration of the given definitions
denoted by Π = uw + ρ(u, Π, v, Π, w). Π is called an up- (down-) polygon with respect to v if, for each p ∈ Π • , py ≥ (≤) vy . For example, in Figure 1, Πl and Πr are up-polygons with respect to v1 . Π is a down-polygon with respect to v1 . Let Π, Π be two simple polygons with Π • ⊂ Π • ; let {v0 , v1 , v2 , . . . , vn−1 } and {v0 , v1 , v2 , . . . , vn−1 } be such orders of the vertices of Π and Π , respectively, that the following becomes true: For a sufficiently small ε > 0, we have de (v0 , v0 ) = ε for the Euclidean distance between v0 and v0 , edge vi vi+1 is parallel to edge vi vi+1 , and vi vi bisects the angle vi−1 vi vi+1 , where i = 0, 1, 2, . . . , n, and the addition or subtraction of indices is taken mod n. In this case, Π is called an inner simple polygon of Π, denoted by Π(v0 , ). – In Figure 2, u0 u1 u2 u3 u4 u5 u0 is an inner polygon of v0 v1 v2 v3 v4 v5 v0 . Let vi−1 , vi , vi+1 and vi+2 be four consecutive vertices of Π. If (vi−1 )y < (vi )y , (vi )y = (vi+1 )y and (vi )y > (vi+2 )y [or (vi−1 )y > (vi )y , (vi )y = (vi+1 )y and (vi )y < (vi+2 )y ], and the point (0.5 · (vi x + vi+1 x ), vi y + ε) [or (0.5 · (vi x + vi+1 x ), viy − ε)] is inside Π, then vi vi+1 is called an up- (down-) stable edge of Π with respect to the xy-coordinate system used for representing Π. If (vi−1 )y < (vi )y , (vi )y = (vi+1 )y and (vi )y > (vi+2 )y , and the point(0.5 · (vix + vi+1 x ), viy − ε) is in Π, then vi vi+1 is called a maximal edge of Π with respect to the chosen xy-coordinate system. Let vi−1 , vi and vi+1 be three consecutive vertices of Π. If (vi−1 )y < (vi )y , (vi )y > (vi+1 )y [or (vi−1 )y > (vi )y , (vi )y < (vi+1 )y ], and there exist points ui ∈ V (Π(v0 , ε)) [this is the set of vertices of Π(v0 , ε)] such that "vi−1 vi vi+1 does not (!) contain ui , then vi is called an up- (down-) stable point of Π with respect to the xy-coordinate system used for representing Π.
728
F. Li and R. Klette
Fig. 2. Example of an inner polygon
In Figure 1, v2 , v3 , v5 , v6 , v7 and v8 are up-stable points. v1 , v4 , v9 , v10 , v11 , v12 and v13 are down-stable points. If (vi−1 )y < (vi )y , (vi )y > (vi+1 )y [or (vi−1 )y > (vi )y , (vi )y < (vi+1 )y ] and there exist points ui ∈ V (Π(v0 , ε)) [this is the set of vertices of Π(v0 , ε)] such that "vi−1 vi vi+1 does (!) contain ui , then vi is called an maximal (minimal) point of Π with respect to the chosen xy-coordinate system. In Figure 1, u1 is a maximal point and u3 is a minimal point. – As common, a simple polygon is called monotonous if it has both a unique up- and downstable point or edge. In Figure 1, u1 u2 u3 u4 u5 u6 u7 u8 u1 is a monotonous simple polygon. Since the discussion of our algorithm in the case of an up- or down-stable edge is analogous to that of an up- or down-stable point, we will just detail the case of up- or down-stable points. Let v be an up-stable point of Π. Let Sv be the set of minimal points u of Π such that there exists a polygonal path from v to u around ∂Π and uy < vy . Let u ∈ Sv such that uy = max{uy : u ∈ Sv }. If there exists a point w ∈ ∂Π such that the segment u w ⊂ Π • and wy = uy , then u w is called a cut edge of Π. The polygonal path ρ(v, Π, u ) is called a decreasing polygonal path from v to u w. v is called an up-stable point with respect to the cut edge u w, and u w is called a cut edge with respect to the up-stable point v. u is called a closest minimal point with respect to v around the frontier of Π. In Figure 1, v8 v9 u9 u3 is a decreasing polygonal path from v8 to u3 u4 , and u3 is a closest minimal point with respect to v8 . If u, w ∈ ∂Π such that uy = vy = wy , ux < vx < wx , and uv, vw ∈ Π • , then u and w are called the left and right intersection points of v, respectively. – In Figure 1, v1 l and v1 r are the left and right intersection points of v1 , respectively. Let v be a maximal point of Π. Let Sv be the set of down-stable points u of Π such that there exists a polygonal path from v to u around ∂Π and uy < vy . Let u ∈ Sv such that uy = max{uy : u ∈ Sv }. u is called a closest down-stable point with respect to v around the frontier of Π.
Decomposing a Simple Polygon into Trapezoids
729
In Figure 1, v9 is a closest down-stable point with respect to the maximal point above the segment v9 u9 .
3
The Algorithm
This section describes the decomposition algorithm and its three subroutines, which are Procedures 1–3. Procedure 1 is used to update the left or right intersection points of up-stable points. It will be applied by Procedure 2 to remove up-stable points. Procedure 3 applies Procedures 1 and 2 to compute the left and right intersection points of all up-stable points, from bottom to top. This procedure will be called in Step 3 of the main algorithm which processes both up- and down-stable points, from top to bottom. In Procedure 1, let Π be a simple polygon such that V (Π) does not have any up-stable point. Procedure 1: update left or right intersection points 1. Decompose Π into a set of trapezoids, denoted by TΠ (note: straightforward, because there is no up-stable point). 2. For each edge e ∈ E(Π). 2.1. If e is a cut edge, then do the following: 2.1.1. Let ve be the up-stable point with respect to e. 2.1.2. Find a trapezoid t ∈ TΠ such that the bottom edge of t and edge e have a common end point in V (Π). 2.1.3. Let ρe be the decreasing polygonal path from ve to e. 2.1.4. Find a stack of trapezoids, denoted by Te (⊆ TΠ ), such that for each t ∈ Te , t• ∩ ρe = ∅. 2.1.5. Let te be the topmost trapezoid in Te . 2.1.6. Compute the intersection points between y = ve y with those two edges on the left and right side of te . 2.1.7. Update the left and right intersection points of ve by comparing the results of Step 2.1.6 with the initial left and right intersection points of ve . Let I be an interval of real numbers, and MI be a subset of maximal points of Π such that for each element v ∈ MI , vy ∈ I. Suppose MI is sorted according to y-coordinate decreasingly. Procedure 2: remove up-stable points 1. Let MI = {v1 , v2 , . . . , vk }. 2. For each i ∈ {1, 2, . . . , k}, do the following: 2.1. Find a closest down-stable point with respect to vi around the frontier of Π, denoted by ui . 2.2. Find a point wi ∈ ∂Π such that ρ(ui , Π, vi , Π, wi ) is the shortest polygonal path in ∂Π such that uiy = wiy . 2.3. Update Π by replacing ρ(ui , Π, vi , Π, wi ) by the edge ui wi .
730
F. Li and R. Klette
2.4. Let Πi = ui wi + ρ(ui , Π, vi , Π, wi ). 2.5. Now let Πi be the input of Procedure 1 and update the left and right intersection points of all possible up-stable points. Procedure 3: compute all left or right intersection points 1. Let U = {v1 , v2 , . . . , vk } be the sorted set of up-stable points of Π such that v1 y < v2 y < · · · < vk y . 2. For each i ∈ {1, 2, . . . , k}, do the following: 2.1. Find a closest minimal point with respect to vi by following the frontier of Π, denoted by ui . 2.2. Let I = [a, b] where a = ui y and b = vi y . 2.3. Compute MI . 2.4. If MI = ∅, then apply Procedure 2 and go to Step 2.1. 2.5. Otherwise, find a point wi such that ρ(ui , Π, vi , Π, wi ) is the shortest polygonal path of ∂Π with ui y = wi y . 2.6. Set initial left and right intersection points of vi as follows: 2.6.1. If vi−1 y = vi y = vi+1 y then let the initial left and right intersection points of vi be vi−1 and vi+1 , respectively. 2.6.2. Otherwise, find two points wi l , wi r such that ρ(wi l , Π, ui , Π, vi , Π, wi r ) is the shortest polygonal path of ∂Π with wi ly = vi y = wi ry . 2.6.3. Let the initial left and right intersection points of vi be wil and wir , respectively. 2.7. Update Π by replacing ρ(ui , Π, vi , Π, wi ) by the edge ui wi . 2.8. Now let Π be the input for Procedure 1; update the left and right intersection points of all possible up stable points. Main Algorithm: Decomposition Algorithm 1. Let S = ∅ and T = ∅. 2. Compute the set of up- and down-stable points, denoted by V . 3. Apply Procedure 3 to compute the left and right intersection points, for all up-stable points in V . 4. Sort V for decreasing y-coordinates. 5. For each i ∈ {1, 2, . . . , k}, with k = |V |, do the following: 5.1. Case 1. vi is a down-stable point. 5.1.1. Find two points vi l , vi r ∈ ∂Π such that vi ly = vi y = vi r y and vi lx < vix < vir x . 5.1.2a. Let Πl = vi l vi + ρ(vi l , Π, vi ) (up-polygon). 5.1.2b. Let Πr = vi vi r + ρ(vi , Π, vi r ) (up-polygon). 5.1.2c. Let Π = vil vir + ρ(vil , Π, vir ) (down-polygon). 5.1.3. Let S = S ∪ {Πl , Πr }. 5.1.4. Let Π = Π . 5.1.5. Let i = i + 1 and go to Step 5. 5.2. Case 2. vi is an up-stable point. 5.2.1a. Let vil , vir be the left and right intersection points of vi , respectively.
Decomposing a Simple Polygon into Trapezoids
731
5.2.2b. Let Πl = vi l vi + ρ(vi l , Π, vi ) (down-polygon). 5.2.2c. Let Πr = vi vi r + ρ(vi , Π, vi r ) (down-polygon). 5.2.2d. Let Π = vil vir + ρ(vil , Π, vir ) (up-polygon). 5.2.3. Let S = S ∪ {Π }. 5.2.4. Let Π = {Πl , Πr }. 5.2.5. Let i = i + 1 and go to Step 5. 6. For each j ∈ {1, 2, . . . , n}, with n = |S|, do the following: 6.1. For each (monotonous) polygon Πj ∈ S, decompose it into a stack of trapezoids, denoted by Tj . 6.2. Let T = T ∪ Tj . 7. Output T . 3.1
Time Complexity of the Algorithm
Lemma 1. The set of up- (or down-, maximal) stable points of Π can be computed in O(n), where n = |V (Π)|. Proof. For a sufficiently small number > 0, the “start” vertex v0 of an inner polygon Π(v0 , ) can be computed in O(n), where n = |V (Π)| (see [3]). For three consecutive vertices u, v, w ∈ V (Π) such that ux < vx < wx , if uy < vy , vy > wy and vy > vy , then v is a up stable point (if uy < vy , vy > wy and vy < vy , it follows that v is a maximal point; if uy > vy , vy < wy and vy < vy , then v is a down-stable point). Lemma 2. Procedure 1 can be computed in O(n log n), where n = |V (Π)|. Proof. If Π is monotonous, then (obviously) it can be decomposed into a stack of trapezoids in O(|V (Π)|). Otherwise, by assumption, Π can only have a finite number of down-stable points. Analogous to Step 5.1 in the decomposition algorithm, Π can be decomposed into a set of trapezoids in O(|V (Π)|). Thus, Step 1 can be computed in O(|V (Π)|). All steps following Step 2, except Step 2.1.4, can be computed in O(1). Step 2.1.4 can be computed in O(nlogn), where n = |V (Π)|: Let Sd be the set of all down-stable points of Π. A1. Sort Sd according to y-coordinates; we obtain Sd = {v1 , v2 , . . . , vk } such that v1 y ≤ v2 y ≤ v3 y ≤ . . . ≤ vk y . A2. Let m = min{|vi−1 y − viy | : |vi−1 y − viy | > 0, i = 2, 3, . . . , k}, and M = max{|vi−1 x − vi x | : |vi−1 x − vi x | > 0, i = 2, 3, . . . , k}. A3. Transform Π by rotating it by angle θ anticlockwise about the origin (i.e., for each point p = (x, y) ∈ Π • , update it by point (x , y ), where x = x cos θ − y sin θ, y = x sin θ + y cos θ). Without loss of generality, assume that m = |v1 y − v2 y | > 0. We have that − v2y | = |(v1 x − v2 x ) sin θ + (v1 y − v2 y ) cos θ| |v1y
≥ |(v1 y − v2 y ) cos θ| − |(v1 x − v2 x ) sin θ| ≥ m cos θ − M sin θ → m(θ → 0)
732
F. Li and R. Klette
Therefore, there exists a sufficiently small angle θ > 0 such that, after rotating, each down-stable point is still a down-stable point, and all down-stable points have unique y-coordinates. This implies that, for each trapezoid t ∈ TΠ , there exist at most two trapezoids t1 , t2 ∈ TΠ such that t1 and t2 has a common vertex with t, respectively. (In this case, t1 and t2 have a common vertex which is a down-stable point. Since Step A2 can be computed in O(|Sd |) and Step A3 can be computed in O(1), it follows that Step 2.1.4 can be computed in O(k), where k is the number of vertices of the decreasing polygonal path from ve to e, after sorting (Step A1) in O(|Sd |log|Sd |). Lemma 3. Procedure 2 can be computed in O(n log n) time, where n is the number of vertices of the original simple polygon Π. Proof. By Lemma 1, Step 1 can be computed in O(n log n), where n is the number of vertices of the original simple polygon Π. Step 2.1 can be computed in O(nu ), where nu is the number of vertices of ρ(ui , Π, vi ). Step 2.2 can be computed in O(nw ), where nw is the number of vertices of ρ(vi , Π, wi ). Steps 2.3 and 2.4 can be computed in O(1). By Lemma 2, Step 2.5 can be computed in O(ni log ni ), where ni = |V (Πi )|. Thus, Step 2 can be computed in O(n log n) altogether, where n is the number of vertices of Π. Lemma 4. Procedure 3 can be computed in O(n log n) time, where n is the number of vertices of the original simple polygon Π. Proof. By Lemma 1, Step 1 can be computed in O(n log n), where n is the number of vertices of the original simple polygon Π. Step 2.1 can be computed in O(nu ), where nu is the number of vertices of ρ(ui , Π, vi ). Step 2.2 can be computed in O(1). Step 2.3 can be computed in O(|MI |). By Lemma 4, Step 2.4 can be computed in O(nu log nu ), where nu = |V (ρ(ui , Π, vi ))|. Step 2.5 can be computed in O(nw ), where nw is the number of vertices of ρ(ui , Π, wi ). Step 2.6.1 can be computed in O(1). Step 2.6.2 can be computed in O(ni ), where ni = |V (ρ(ui , Π, wi l ))| + |V (ρ(wi , Π, wi r ))|. Step 2.6.3 can be computed in O(1). Step 2.7 can be computed in O(1). By Lemma 2, Step 2.8 can be computed in O(n log n), where n is the number of the vertices of the updated Π. Therefore, Procedure 3 can be computed in O(n log n), where n is the number of the vertices of the original simple polygon Π. Theorem 1. The given decomposition algorithm has time complexity O(n log n), where n is the number of vertices of the original simple polygon Π. Proof. Step 1 can be computed in O(1). By Lemma 1, Step 2 can be computed in O(n log n), where n is the number of vertices of the original simple polygon Π. By Lemma 4, Step 3 can be computed in O(n log n), where n is the number of the vertices of the original simple polygon Π. Step 4 can be computed in O(|V | log |V |). Step 5.1.1 can be computed in O(ni ), where ni = |V (ρ(vi l , Π, vi ))| + |V (ρ(vi , Π, vi r ))|. Steps 5.1.2a – 5.1.5 can be computed in O(1). Thus, Step 5.1 can be computed in O(ni ), where ni = |V (ρ(vil , Π, vi ))| + |V (ρ(vi , Π, vir ))|. Step 5.2 can be computed in O(1). By Lemma 3, Step 6.1
Decomposing a Simple Polygon into Trapezoids
733
Fig. 3. Output of the decomposition algorithm for the polygon of Figure 1
can be computed in O(|Πj |). Thus, Step 6 can be computed in O(n), where n is the number of the vertices of the original simple polygon Π. Step 7 can be computed in O(1). Therefore, the decomposition algorithm can be computed in O(n log n), where n is the number of the vertices of Π.
4
Conclusions
The paper described an O(n log n) time algorithm for the decomposition of any simple polygon into trapezoids. (See Figure 3 for an example.) Of course, these can then be further divided into triangles. The algorithm was described on these eight pages (compared to 40 journal pages of paper [1]). (For a longer version of this paper, with some examples illustrating processing steps, see CITR-TR-199.) The given algorithm will allow to improve work reported in [2]. - The authors conjecture that the O(nlogn) sorting step in the given algorithm can also be replaced by incorporating some type of a (linear-time) scan, and future studies will show.
References 1. Chazelle, B.: Triangulating a simple polygon in linear time. Discrete Computational Geometry 6, 485–524 (1991) 2. Li, F., Klette, R.: Finding the shortest path between two points in a simple polygon by applying a rubberband algorithm. In: Chang, L.-W., Lie, W.-N. (eds.) PSIVT 2006. LNCS, vol. 4319, pp. 280–291. Springer, Heidelberg (2006) 3. Sunday, D.: Algorithm 3: Fast winding number inclusion of a point in a polygon. (last visit: April 2007) See www.geometryalgorithms.com/Archive/algorithm 0103/
Shape Recognition and Retrieval: A Structural Approach Using Velocity Function Hamidreza Zaboli, Mohammad Rahmati, and Abdolreza Mirzaei Department of Computer Engineering, Amirkabir University of Technology {zaboli, rahmati, mirzaei}@aut.ac.ir Abstract. In this paper a new method for matching and recognition of shapes extracted from images, based on the structure and skeleton of the shape, is proposed. In this method, a function called “velocity function” derived from the radius function is introduced and values of this function are calculated for skeletal points. At this point, each skeletal curve is transformed into a vector containing values of the velocity function calculated along the curve. Then a tree-like structure is extracted from the skeleton of the shape so that each leaf of the tree indicates a skeletal curve of the skeleton and a velocity vector as its descriptor. These vectors are matched in fine-grain level using dynamic programming method and in coarse-grain level, the tree-like structures are matched using a greedy approach. At the final stage, the difference between the shapes is determined by a distance value resulted from the two levels matching process. This distance is used as a measure for shape matching and recognition. Experiments are performed on two standard binary image database in the presence of various transformations. Results confirm the efficiency and high recognition rate of our method. Keywords: Velocity function, radius function, shape matching, shape recognition, skeleton.
1 Introduction In the computer vision field, object recognition involves two key steps: image segmentation and recognition [1]. Often object recognition deals with shape recognition from a database containing special patterns which have been extracted from shapes based on special criteria. Diversity and difference between various methods are derived by different criteria and view points. From a general point of view, these criteria and approaches fall into two main groups: contour-based methods and content-based methods [2]. Contour-based methods use only boundary information of a shape for recognition, whereas content-based methods use all points of the shape. Moreover, the contentbased methods are also divided into global and structural methods depending on whether shape content is decomposed into parts or not [2]. In this research, we present a method which falls into the structural content-based methods. The methods which lie in this area, either decompose internal regions of the shape [3,4,5] or generate skeleton of the shape [6,7,8,9]. In the both approaches some features are extracted from the obtained components or the skeleton. Then matching these features leads to recognition of the shapes. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 734 – 741, 2007. © Springer-Verlag Berlin Heidelberg 2007
Shape Recognition and Retrieval: A Structural Approach Using Velocity Function
735
An important feature which may be extracted from the skeleton is the values of the radius function on different points of the skeleton. Values of this function on the skeleton describe different parts of the shape. Siddiqi et al. in [6,7] have used this function to classify skeletal points into four types. In this paper, we define a new function called “velocity function” and we use it as a basic factor for shape description. Due to the values of the radius function for each shape are represented at a shape style and scale, it can not be a perfect feature for all cases. In fact we need variations of this function to describe a shape. These variations are equal to the first order differential of the radius function and we call it velocity function. An advantage of applying the velocity function is that the function is scale invariant. This characteristic comes implicitly with the differential of the function without any computation overhead. We use this function (velocity function) and its properties for describing and representing the different parts of shapes. The paper is organized as follows: in section 2, the velocity function is defined and values of this function are calculated for points on the skeleton of shape. In section 3, the skeleton of the shape is generated and then a tree-like structure is extracted from it. In section 4, the method for comparison of the structures is presented and distance between the structures and the shapes is computed as a single value. In section 5, experiments are presented and finally in section 6, the paper is concluded.
2 Definition of the Velocity Function on Skeleton In order to define the velocity function, it is necessary to introduce the radius function on the skeleton of a shape. Assume S denotes the skeleton of a shape and consider point p where p ∈ S . Radius function R(p) is the radius of the maximal inscribed disc touches the boundary of the shape[2,10,11]. Movement of locus of p(x,y) on the S, cause that the value of the radius function to vary. Note that different parts of a shape containing convexities and concavities are generated by the variations of the radius function. Therefore, we define velocity function V(p) as the first order differential of the radius function R(p) on the points of the skeleton as follows:
∀p1, p2 ∈S, p1(x1, y1) ≠ p2 (x2, y2 ), || p2 − p1 ||< ε :V( p2 ) =
R( p2 ) − R( p1) || p2 − p1 ||
Fig. 1. A part of a shape used to define the velocity function V(p). With respect to the direction, R ( p2 ) > R ( p1 ) ⇒ V2 > 0 .
736
H. Zaboli, M. Rahmati, and A. Mirzaei
Where p1 and p2 are points on the skeleton S and ||p2-p1|| is a distance measure between p1 and p2 1. Fig. 1 shows this definition. Therefore, for all points on the skeleton, the velocity function can be defined and can be calculated. For calculating these values on the skeletal curves, we consider each curve segment2 separately originated at a branch point. Then we traverse each curve segment from the branch point to the other end of the curve segment (a terminal point or another branch point) and calculate the values of the velocity function for points on the curve segment. In this way, we may have a vector containing values of the velocity function for each curve segment and we call it as “velocity vector”. For the curves with two branch points at the two ends, we should have two velocity vectors each of which originated at one of the two distinct branch points. These velocity vectors will be used for comparison of different parts of shapes.
3 Skeleton and Extracting Structure Skeleton of a shape is a set of thinned curves and lines [10]. There are various types of the skeletons describing shapes. Medial axis is a type of skeleton and is defined as the locus of centers of maximal disks that fit within the shape [2,10,11,12]. We use Blum medial axis [13] as the skeleton of shapes. Fig. 2(a-c) illustrates shape of a dog and its medial axis. In fig. 2(c), the skeleton is decomposed into seven labeled curve segments (skeletal curves) with numbers 1 to 7. As shown in fig. 2(d), in the skeleton decomposition phase, we consider twice the curve segments which their end points terminate to branch points, e.g. the curve segments number 3 and 5. 3.1 Extracting Tree-Like Structure from the Skeleton We treat the skeleton of a shape as a tree-like structure such that the branch points are considered as children of the root of tree (branch nodes) and the skeletal curves are considered as leaves of the tree. We use this structure to describe shape in coarsegrain level. According to this structure, we define three rules for extracting a tree-like structure from the skeleton. Using these rules, we may have a tree for each skeleton. The rules are as follows: 1. ∀bp ∈ S , Add _ as _ Child (bp, root ) 2. ∀bp ∈ S , ∀curve ∈ S , Origin(curve) = bp ⇒ Add _ as _ Child (curve, bp) 3. ∀curve, Origin(curve) = null ⇒ create dummy branch point bp , Add _ as _ Child (bp, root ) , Add _ as _ Child (curve, bp ) 1
2
In our application ||p2-p1|| is equal to the length of skeletal curve segment between p1 , p2. Although this distance may be used as Euclidean distance between p1 , p2. Note that we use terms “curve segment” and “skeletal curve” interchangeably as a skeletal curve segment of the skeleton which connects two branch points or a branch point to a terminal point or vice versa. Also there is no branch point between the two end points of the curve segment.
Shape Recognition and Retrieval: A Structural Approach Using Velocity Function
737
• Where “bp” is a branch point, “curve” is a skeletal curve branching out from a “bp” and terminating at another “bp” or a terminal point. “S” is the skeleton of the shape. “root” is the root of the tree. • Procedure Add_as_Child(a,b) adds “a” to the list of children of “b”. • Function Origin(curve) returns branch point which “curve” branching out from. For curves with no branch point at their two ends, it returns “null”. In fig. 2(e) each branch point is placed as child of the root of the tree. However this is not actually a tree, but we represent it as a tree to have a unique and integrated structure for a shape. As seen in fig. 2(e), each vector describes a skeletal curve. Therefore, we have a velocity vector for each leaf in the tree.
Fig. 2. (a) Shape of a dog. (b) Boundary and skeleton of the shape. (c) The skeleton with labeled skeletal curves. (d) Decomposed medial axis into components consisting of a branch point and some skeletal curves branching out from it. (e) Extracted tree-like structure from the skeleton. Figure (a-e) shows steps for extracting tree-like structure from a given shape.
4 Comparison and Matching of the Structures As mentioned earlier, the descriptor structure has two levels: coarse-grain level and fine-grain level. Therefore comparison and matching of the structures should be performed in the two levels, such that in the fine-grain level, the velocity vectors are matched and in the coarse-grain level, subtrees and branch nodes are matched. 4.1 Matching the Velocity Vectors and Computing Fine-Grain Level Distances Before any matching process, all of the velocity vectors must be normalized. The goal of this normalization is to abstract and transform the velocity vectors into attributed strings containing characteristics of the velocity vectors. In this normalization phase, the values of vector V are transformed into elements of the set {+,0,-}. Sequential values of the velocity vector are grouped together depending on whether they belong to positive values or negative values or they are equal to zero and finally, an attributed string is generated. Indeed, this grouping is based on sign function. Each element of the attributed string, has three attributes: symbol (+ , – or 0), length as number of the values corresponding to an element of the string and sum (summary of the values). As a result, for each skeletal curve and velocity vector, we may have an attributed string. We call these attributed strings as “velocity strings”. For matching and computing the distances (as difference) between two velocity strings, dynamic programming method is used. In this matching process, the three attributes must be taken into account. We define three edit operations in our dynamic
738
H. Zaboli, M. Rahmati, and A. Mirzaei
⎧ ⎫ ⎧ ⎫ i.sum ⎪ ⎪ ⎪ ⎪ i.length i.length Delete ( i ) : Cost ( i , j 1) − + + ⎪ ⎪ ⎪ ⎪ U .length U .length ⎛ t.sum ⎞⎪ ⎪ ⎪ ⎪ ∑ ⎜ ⎟⎪ t . length ⎪ ⎪ ⎪ ⎠ t =U .1 ⎝ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ i.sum ⎪ ⎪ ⎪ ⎪ i.length ⎪ ⎪if (i.symbol ≠ j.symbol ) ⇒ Min ⎪⎨Insert (i) : Cost (i − 1, j ) + i.length + ⎪ ⎬ U .length ⎪ ⎪ U .length ⎛ t.sum ⎞⎪ ⎪ ⎪ ⎪ ∑ ⎜ ⎟ t.length ⎠ ⎪ ⎪ t =U .1 ⎝ ⎪ ⎪ ⎪ ⎪ Cost (i, j ) = ⎨ ⎬ ⎪ ⎪ ⎪ ⎪ ⎪Change(i, j ) : Cost(i − 1, j − 1) + ( Delete(i) + Insert ( j )) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩⎪ ⎭⎪ ⎪ ⎪ ⎛ i.sum ⎞ ⎛ j.sum ⎞ ⎪ ⎪ ⎜ ⎟−⎜ ⎟ ⎪ i . length j . lengt h ⎪ i.length − j.length ⎝ ⎠ ⎝ ⎠ ⎪if (i.symbol = j.symbol ) ⇒ Cost (i − 1, j − 1) + ⎪ + Max{i.length , j.length} ⎧⎛ i.sum ⎞ ⎛ j.sum ⎞⎫ ⎪ ⎪ Max , ⎨ ⎬ ⎜ ⎪ i.length ⎟⎠ ⎜⎝ j.length ⎟⎠ ⎪ ⎩⎝ ⎭⎭ ⎩
Fig. 3. Definition of the edit operations delete, insert and change. i,j are elements in the velocity strings U,V respectively and indicate corresponding groups of the values in their velocity vectors. Note that in the case of matching two identical symbols of two strings, a little cost is associated to them.
programming method with respect to the attributes of the elements. These edit operations are delete, insert and change and a cost is assigned to each operation. The edit operations are applied to one of the two strings to transform it to another string. At last, sum of the costs of the edit operations used for the transformation, specifies the distance between the two velocity strings. Fig. 3 illustrates the edit operations and their costs. Hence we may calculate the distance between each two velocity strings U,V as minimum cost of the used edit operations for transforming one of them to another using dynamic programming. We denote this cost by Distance(U,V). 4.2 Matching Descriptor Structures in the Coarse-Grain Level Distance between two tree-like structures T1 , T2 , extracted from skeletons S1 , S2 is defined as the sum of the minimum distances between each two corresponding nodes and is denoted by Distance(T1 ,T2). In fact, for computing the distance between two tree-like structures, we use a topdown approach based on a greedy algorithm, so that for each branch node and its subtree of the tree T1 , we find a branch node of the tree T2 such that the distance between the two branch nodes and their subtrees is minimal. Next, the candidate nodes and their children are deleted from the trees and the distance between them is considered as cost of the matching. Finally sum of these costs gives Distance(T1,T2). As a result we have relation (1) as follows: Distance(T1 ,T2 ) =
∑ Min{Distance(a,b)}
(1)
a∈T1 ,b∈T2
Where a and b are the branch nodes. Now it is necessary to define the distance between two branch nodes. The distance between two branch nodes a and b is defined
Shape Recognition and Retrieval: A Structural Approach Using Velocity Function
739
as the sum of minimum distances between each two velocity strings and is denoted by Distance(a,b). Similarly, to find corresponding velocity strings of the two branch nodes, a greedy approach is used. So we have relation (2) for computing distance between each two branch nodes as follows: Distance(a,b) =
∑ Min{Distance(u,v)}
(2)
u∈a ,v∈b
in relation (2), Distance(u,v) is the distance between two velocity strings u and v computed in section 4.1.
5 Experimental Results The measure presented in the earlier sections for determining distance between two shapes, may be used for shape recognition and retrieval. To achieve this goal, we compare each given shape with shapes of a database. The shapes which have minimum distances with the given shape, are recognized as similar shapes and are retrieved from the database. In our experiments, we use two standard dataset: kimia image dataset containing about 4000 sample binary images and MPEG-7 dataset. First experiment is looking for a given shape in the dataset and retrieving the same or similar shapes. Fig. 4(a) illustrates a set of 56 sample shapes categorized in 7 groups. Each group contains 8 shapes of a similar class. For each group, the first shape at the left is the query shape and other shapes are seven closest matches from the dataset. As seen in this figure, our method calculates distances and classifies similar shapes properly. Also we compare our method with another skeleton-based method called “shock graph” [6] on the kimia and MPEG-7 shape datasets. In these experiments several groups of shapes in various categories in the presence of different transformations were evaluated. These transformations are scale transformation, rotation, occlusion and missing parts. Results of the comparisons in the presence of different transformations are represented in Fig. 4(b). As seen in Fig. 4(b) at point A, there is no transformation and consequently the recognition rates are high for both of the two methods. At point B, shapes are scale transformed; also the recognition rates for both methods are high. This is because both methods are based on the topology of the skeleton and any scale transformation does not change the topology, connections and structure of the skeleton. At point C, shapes were rotated. As seen at this point, the recognition rates are almost high. This is because like the previous case, the skeleton (its topology and structure) is invariant under rotation. In points D and E, shapes are evaluated in the presence of occlusion and missing parts, respectively. In these two cases, we observe a decrease in the recognition rates of the methods. This decrement for our proposed method, is because the skeleton of a shape is modified in the presence of occlusion and missing parts. In the two cases, some skeletal curves and sometimes, some branch points are added to / deleted from the skeleton, respectively. Although as seen in Fig. 4(b) this decrement is about 10 to 20 percent and our method is still performs superior. In the third experiment, we categorized shapes of MPEG-7 dataset into three categories animals, plants and objects and we evaluated recognition rate of our
740
H. Zaboli, M. Rahmati, and A. Mirzaei
(a)
(b)
Fig. 4. (a) 56 shapes classified into 7 classes, each class contains 8 shapes, retrieved from the kimia and MPEG-7 datasets. (b) Recognition rates of our method and shock graph method in the presence of various transformations. A: no transformation, B: scale- transformation, C: rotation, D: occlusion, E: missing parts. Our method is denoted by '+'.
method. In this experiment, shapes are randomly chosen from the dataset and then are classified into the three categories. Fig. 5 shows first nine random shapes selected from the dataset which are categorized properly into the three categories. We evaluated all shapes of the dataset and categorized them using our method. Table 1 summarizes the results of this evaluation and shows recognition rate of the method for recognizing shapes of each class. As seen in the table 1, highest recognition rate belongs to class of animals with 88.63 and then class of objects and plants. High recognition rate of the method for animals is due to the structure of the skeleton of the animals are close to each other. Therefore recognition of the animals is simpler than other classes (like objects) which include wide range of the skeletal structures. Table 1. Results of classifying of the shapes of MPEG-7 dataset into the three classes animals, plants and objects. Recognition rate for animals class is higher than other classes.
Fig. 5. nine random selected shapes from MPEG-7 dataset. Our method properly classified the shapes into the three classes.
6 Conclusion and Future Work In this paper, we proposed a new skeleton-based method for shape recognition and retrieval problem. In this method we introduced the “velocity function” as the first
Shape Recognition and Retrieval: A Structural Approach Using Velocity Function
741
order differential of the radius function to describe the skeletal curves of the different parts of a shape. To determine the distance between each two given shapes, matching of the tree-like structures was performed in two levels: coarse-grain level and finegrain level, based on a greedy approach and dynamic programming method, respectively. Experiments confirm high performance of our method in the presence of different transformations. Although we used a greedy approach for the matching process, however there are other optimized algorithms e.g. Hungarian algorithm [15] that may be used for the matching.
References 1. He, L., Han, C.Y., Everding, B., Wee, W.G.: Graph matching for object recognition and recovery. Pattern Recognition 37, 1557–1560 (2004) 2. Zhang, D., Lu, G.: Review of shape representation and description techniques. Pattern Recognition 37, 1–19 (2004) 3. Kim, D.H., Yun, I.D., Lee, S.U.: A new shape decomposition scheme for graph-based representation. Pattern Recognition 38, 673–689 (2005) 4. Saber, E., Xu, Y., Tekalp, A.M.: Partial shape recognition by sub-matrix matching for partial matching guided image labeling. Pattern Recognition 38, 1560–1573 (2005) 5. Yu, L., Wang, R.: Shape representation based on mathematical morphology. Pattern Recognition Letters 26, 1354–1362 (2005) 6. Siddiqi, K., Shokoufandehs, A., Dickinsons, S.J., Zucker, S.W.: Shock Graphs and Shape Matching. International Journal of Computer Vision 35(1), 13–32 (1999) 7. Siddiqi, K., Kimia, B.B.: Toward a Shock Grammar for Recognition, Technical Report LEMS 143, LEMS, Brown University (1995) 8. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys (CSUR) 33(1), 31–88 (2000) 9. Sebastian, T.B., Klein, P.N., Kimia, B.B.: Recognition of Shapes by Editing Shock Graphs, ICCV 2001, pp. 755–762 (2001) 10. Lam, L., Lee, S.-W., Suen, C.Y.: Thining Methodologies – A Comprehensive Survey. IEEE Trans. PAMI 14(9) (1992) 11. van Eede, M., Marcini, D., Telea, A., Sminchisescu, C., Dickinson, S.: Canonical Skeletons for Shape Matching. ICPR 2, 64–69 (2006) 12. Dimitrov, P., Phillips, C., Siddiqi, K.: Robust and Efficient Skeletal Graphs. ICCV 1, 417– 423 (2000) 13. Blum, H.: A Transformation for extracting new descriptors of Shape. In: Whaten-Dunn, W.(ed.) pp. 362–380. MIT Press, Cambridge (1967) 14. Pelillo, M., Siddiqi, K., Zucker, S.W.: Matching Hierarchical Structures Using Association Graphs. IEEE Trans. PAMI 21(11) (1999) 15. Munkres, J.: Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics 5(1), 32–38 (1957)
Improved Morphological Interpolation of Elevation Contour Data with Generalised Geodesic Propagations Jacopo Grazzini and Pierre Soille Spatial Data Infrastructures Unit Institute for Environment and Sustainability European Commission - DG Joint Research Centre via E. Fermi 1, TP 262, I-21020 Ispra, Italy {jacopo.grazzini,pierre.soille}@jrc.it
Abstract. The objective of this work is to show how geodesic propagation techniques known in mathematical morphology can be employed to reconstruct terrain surfaces, and more generally grid data, from contour maps. We present the so-called generalised geodesic distance that was briefly introduced in a previous article. The shortest paths defined by the generalised geodesic distance can be used to linearly interpolate a given set of contour lines. We extend this concept by incorporating the knowledge of slope values over the contours and perform either Hermite interpolation with this distance or linear interpolation, but with a new weighted geodesic propagation function. Experiments are carried out over synthetic elevation data for which they result in nicely interpolated surfaces. The methodology is generic in the sense that it could be used to reconstruct any 2D or 3D digitised shape from its boundaries.
1
Introduction
Interpolation of grid-based contour data for the reconstruction of 2D or 3D surfaces is a common issue in image processing and computer graphics, and arises primarily in applications related to medical imaging digitisation of objects and geographic information systems [1]. It has been in particular intensively studied for terrain modelling and analysis, as interpolation is often necessary to recover terrain surfaces from topographic contour maps [2,3]. Owing to their simplicity, contour maps have been employed as one of the major forms to record terrain information. The interpolation of the elevation values between the contour lines enables to reconstruct numerical representations of the Earth’s surface known as Digital Elevation Models (DEMs): a DEM is a dense spatially referenced grid where each point represents a ground height value [4]. Although DEMs can now be directly derived using remote sensing, a great deal of elevation measures of the Earth’s surface is still in the form of topographic maps. Indeed, for historical landscapes, they are the only available source of information. Moreover, contour maps enable the retrieval of the true W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 742–750, 2007. c Springer-Verlag Berlin Heidelberg 2007
Improved Morphological Interpolation of Elevation Contour Data
743
terrain elevation and not the elevation of (natural or man-made) objects above the ground as measured by satellites. In addition, with the increasing availability of high-resolution elevation data from larger areas, low cost representation and storage of DEMs with sparse data like contours are also preferred [5]. Therefore, there is still a need for generating terrain surfaces from contours. The usual process for deriving DEM from grid-based contour lines involves the interpolation of elevation values for every pixel in the surface in function of the values of neighbouring pixels where contours exist [2]. Various algorithms have been proposed for this task in the literature, that consist either in interpolating onto a regular grid or in constructing a triangulation of points on the contour lines (see [3] for an in-depth review). Most of them have been unsatisfactory, as they either do not preserve minor ridges and valleys, or else they are poor at modelling slopes. The approach we propose in this article is a continuation of previous efforts led in the field of mathematical morphology (MM) [6,7] as it is intended to preserve the geomorphological properties of the terrain surface. It is interesting for a variety of reasons. First, MM is generally well suited to the processing of elevation data [4]: this is not surprising because in MM, any image is viewed as a topographic surface, the greylevel of a pixel standing for its height. Second, the newly introduced geodesic propagations tend to respect the natural topography of terrain surfaces and mimic, in a way, the hydrological runoff processes [8]. Third, our interpolation incorporates higher-order information regarding terrain slopes allowing for the generation of smooth realistic surfaces [9]. Finally, the technique enables efficient numerical implementations [4]. Noticeable improvements are achieved in the DEMs reconstructed by the proposed methods over previous works. The rest of the paper is organised as follows. In Sec. 2, some background notions and seminal works on contour interpolation are presented. The methodologies using generalised geodesy are detailed in Sec. 3. Experiments are led on synthetic data and compared with some well-established method in Sec. 4. Conclusions including hints about other possible applications are given in Sec. 5.
2
Background Notions
2.1
Formulation of the Contour Interpolation Problem
The problem of contour interpolation can be stated as follows [10]. Given a pair of adjacent contours Cl , Cu with elevations hl = hCl < hu = hCu , delimiting a connected inner-space Ω, we aim at reconstructing a surface h over the image space that verifies: h(p) = hl if p ∈ Cl ,
h(p) = h ∈ [hl , hu ] if p ∈ Ω,
h(p) = hu if p ∈ Cu
(1)
so that no artifial oscillations are created between Cl and Cu which are again the contour lines of h for the elevations hl and hu . Eq. (1) can naturally be extended to the case of n > 2 contours Ci , with elevations hi , partitioning the plane into n + 1 disjoint open sets Ωi (Fig. 1, left: a synthetic map with n = 10
744
J. Grazzini and P. Soille
Fig. 1. Left: synthetic image displaying contour lines; the elevations of the contours are represented in the greyscale range [0,255]. Right: corresponding complemented Euclidean distance field from the contours lines.
contour lines on 7 levels). The special case of summits (i.e. areas with Cu = ∅) or pits (Cl = ∅) is not discussed in the following, but can be easily treated by extrapolation from contours [10]. 2.2
Univariate Approaches
Following concepts known in hydrology and considering the importance of runoff processes in the topographic evolution of geologic surfaces [8], some techniques attempt to derive DEMs using appropriate interpolation function along straight lines in the direction of the steepest slope. Indeed, given a point p with elevation h, its closest point ql on the contour Cl with elevation hl < h can be reached by descending at each step in the direction given locally by the steepest slope; similarly, for the closest qu ∈ Cu [10]. Considering the piecewise linear segment (ql , p, qu ), the estimation of the elevation h(p) can be reduced to an univariate interpolation problem between (ql , hl ) and (qu , hu ). In this context, linear interpolation provides the simplest univariate scheme [11]: h(p) =
hl d(p, Cu ) + hu d(p, Cl ) d(p, Cl ) + d(p, Cu )
∀p ∈ Ω
(2)
with d(p, CL ) = minq∈CL (p) p − q = p − qL the distance (to be defined, e.g. standard Euclidean distance) from p to CL , L ∈ {l, u}. Alternatively, steepest slopes can be explicitly estimated. In [10], lower and upper slope functions sl and su are diffused from their estimated values on the contours to the inner-space regions. Introducing wL (p) = d(p, CL ) + tL d(p, CLc ) where tL = sL (p) (d(p, Cl ) + d(p, Cu ))/(hu − hl ), for L ∈ {l, u} and Lc ∈ {l, u}/ L, the height at any p ∈ Ω is computed through a monotone Hermite interpolation: h(p) =
hu d(p, Cl ) wl (p) + wu (p) d(p, Cu ) hl d(p, Cl ) wl (p) + d(p, Cu ) wu (p)
∀p ∈ Ω
(3)
with d(p, CL ) as before. Other authors propose to perform higher order interpolation along the steepest slope path, but this generally lead to height values outside the valid interval [hl , hu ], i.e. not fulfilling Eq. (1).
Improved Morphological Interpolation of Elevation Contour Data
745
Fig. 2. From left to right: Euclidean, generalised and weighted generalised distances d(p, Cl ), τf (p, Cl ), and τg (p, Cl ) resp. from lower to upper contours computed over the contour map of Fig. 1.
2.3
Morphology-Based Methods
MM provides us with a good theoretical framework for terrain modelling and analysis [4] and it is very well suited to the generation of grid DEMs from discrete contours. A classical approach consists in creating new intermediate contours between each pair of adjacent contour lines using basic morphological transforms: in [12], dilation and erosion operators are recursively applied while in [13], the skeleton extraction technique is used. However, the continuity of surfaces between the inner areas is in general not considered. Another family of MM algorithms is based on front propagations of Euclidean or geodesic distances. In [6], linear interpolation is performed along the shortest geodesic paths going through the interpolated pixels: in practice, Eq. (2) is applied with the geodesic distance dΩ (p, CL ) from p to CL (Fig. 2, left) - i.e. the minimal geometric length of a path (qL , p) linking p and a point qL ∈ CL entirely included in Ω [14] instead of the Euclidean distance d(p, CL ). This is a fast method but the created DEMs are not smooth on contour lines. Moreover, the approximated steepest slope path (ql , p, qu ) tends to stick to the concavities of the contour lines rather than following valley lines. In [15], the notion of generalised geodesic time was suggested in order to alleviate this problem. We currently propose to adopt and extend this latter approach for improving the interpolation method.
3 3.1
Interpolation with Generalised Geodesic Propagations Geodesic Time Function
Remember [14] that the geodesic distance between two points of a connected set is the length of the shortest path(s) linking these points and remaining in the set. This idea has been generalised to grey tone geodesic masks in [7]. The time τf (P) necessary for travelling on a path P defined on a discrete greyscale image f (the geodesic mask) is computed as the sum of the intensity values
746
J. Grazzini and P. Soille
Fig. 3. Influence zones of the lower contours Cl within the upper connected inner-space Ω (left); labels were randomly assigned. Estimated upper slopes sl over the whole image (right).
of the points on the path. The geodesic time τf (p, q) separating two points p and q of the geodesic mask f is then equal to the smallest amount of time of all paths linking these points and remaining within the definition domain of f . This definition can be again easily extended [7] to the geodesic time τf (p, Y ) between a point p and a reference set Y . Algorithms based on priority queue data structures enable an efficient implementation of geodesic time [4,6]. 3.2
Linear Interpolation Using Geodesic Time
A new geodesic distance was suggested in [15] in order to avoid artifacts occurring in interpolated surfaces when geodesic paths become tangent to the contours. It was proposed to use the geodesic time with the complement of the Euclidean distance to the contours as the geodesic mask. The process starts with the creation, on a regular grid, of the Euclidean distance field from the contour lines, followed by its complementation over Ω (Fig. 1, right): f (p) = [d(p, Cl ∪ Cu )]c = max{d(x, Cl ∪ Cu )} − d(p, Cl ∪ Cu ) ∀p ∈ Ω. x∈Ω
(4)
The time geodesic path defined with the previous mask f follow low Euclidean distance values (Fig. 2, middle). Univariate interpolation is then performed, like in [6], along these paths: Eq. (2) is used for reconstructing the unknown surface h, the distances d(p, Cl ) and d(p, Cu ) being replaced with the lower and upper geodesic times, resp. τf (p, Cl ) and τf (p, Cu ). Likewise the standard geodesic interpolation, flat peaks and troughs can be avoided provided that the elevation and location of the peaks and troughs is mapped in the image of contour lines [4]. 3.3
Hermite Interpolation Using Geodesic Time
Recent methods [10,16,17] showed evidence on the need for incorporating secondorder information in the interpolation process mainly through the estimation of slopes based on the contour lines (see also Sec. 2.2). We propose to adopt an approach somewhat similar to that of [10]: slopes are first computed over the existing isolines, and then estimated over the inner-space using geodesic distances and geodesic influence zones [18]. Indeed, the computation of the distance field in Sec. 3.2
Improved Morphological Interpolation of Elevation Contour Data
747
also enables to know for each pixel p ∈ Ω which is (are) the closest pixel(s) ql on the contour Cl . Reciprocally, for every pixel ql of the contour Cl , it is possible to know which inner pixels p ∈ Ω are closer to it than to any other pixel of Cl . Namely, the set of such pixels p constitutes the (upper) influence zone of ql inside the inner-space Ω (Fig. 3, left). Similarly, for a pixel qu ∈ Cu , it is possible to determine its lower influence zone inside Ω. Therefore, after computing the geodesic distance fields τf , one-sided slopes sl (ql ) and su (ql ) can be estimated at every pixel ql ∈ Cl taking into account the nearest lower and upper contours of Cl , and similarly for every qu ∈ Cu . For any p ∈ Ω the slopes sL (p) are then defined as a weighted arithmetic mean of the slopes at its closest points ql ∈ Cl and qu ∈ Cu : sL (p) =
τf (p, Cl ) sL (qu ) + sL (ql ) τf (p, Cu ) τf (p, Cl ) + τf (p, Cu )
∀p ∈ Ω
(5)
for L ∈ {l, u}. Lower and upper slope grids of sl and su are then generated over the image space (Fig. 3) and the elevation of the unknown surface can be computed using the Hermite interpolation of Eq. (3) like in [10]. 3.4
Linear Interpolation Using Weighted Geodesic Time
Instead of using local slopes for performing a higher-order interpolation scheme (see Sec. 3.3), we also propose another approach that consists in incorporating directly the knowledge about derivatives into the computation of the distance fields. Namely, using the lower and upper slopes (Fig. 3, right) defined in Eq. (5), we first introduce the weighted geodesic distance fields: c
gL (p) = sL (p) · f (p) = sL (p) · [d(p, Cl ∪ Cu )]
∀p ∈ Ω
(6)
for L ∈ {l, u}, and then define the propagation function τgL as the time function computed using gL as a geodesic mask. The idea is to propagate faster from the contour pixels where the slope is steep (Fig. 2, right). Finally, like in Sec. 3.2, the reconstruction of the unknown surface is performed with a linear interpolation using the propagation fields τg (·, CL ) = τgL .
4
Results and Discussion
We intend to evaluate the different proposed methodologies by visual inspection, which is generally ”a good judge of an interpolation method” [19], as the desirable characteristics of an interpolation algorithm are not obvious, since different quality criteria may conflict with each other. A synthetic contour map composed of polylines representing the isocontours was interpolated using the geodesic functions described in Sec. 3. It appears that the three proposed approaches perform better than the ’medial-line’ technique of [12], which is also derived from MM and implemented in a commercial geospatial software (see Sec. 2.3), or than the original ’geodesic’ algorithm [6] (see Sec. 3.2). In particular, the last two approaches (see Sec. 3.3 and 3.4) give rather good results as they create smooth surface transitions at the contour boundaries (Fig. 4, bottom middle and right: the results being very similar, we present two different
748
J. Grazzini and P. Soille
Fig. 4. From top down, left to right: synthetic contour map with elevation in the range [0, 255], excerpts (square box) of the hillshade representation of the surfaces interpolated using the methods described in [12], [6], Sec. 3.2 and Sec. 3.3 resp., and 3D representation of the whole surface interpolated using the method of Sec. 3.4
possible representations). Besides their efficiency, the algorithms lead to contour line interpolation without flat zones whereas we can observe with ’geodesic’ and, above all, ’medial-line’ methods that the surface between two contour lines has the tendency to drop quickly in a terraced effect (Fig. 4, top). Indeed, the additional information contained in the slopes ensure that the reconstructed DEM is continuous across the contours. The methods also faithfully reconstruct summits and pits (Fig. 4, bottom right). One of the major problems occur on the medial axis, i.e. the set of points that have more than one closest points on CL (i.e. where influence zones of two pixels from the same contour encounter [18]) as the transition between slope values might be very strong on this line. However, for smooth terrains where the overall slope variations are not so strong, we consider this rather an advantage as it enables to model ridges and valleys as sharp feature lines. We also observe that when computing the weighted generalised geodesic time, the crest lines are generally moved towards the contours of maximal slopes, which is a desirable property.
5
Conclusion
In this paper, we explore the use of generalised geodesic distance/propagations functions for terrain surface interpolation from contours. We adopt the morpho-
Improved Morphological Interpolation of Elevation Contour Data
749
logical approach suggested in a former paper on this subject, and we propose some possible extensions allowing for improved surface reconstruction. The basic idea is to reduce the surface reconstruction problem to an univariate interpolation problem, assuming that the elevation varies monotonically along the geodesic paths. The originality lies in the definition of new generalised geodesic functions, to be combined with classical interpolation schemes. The proposed methodology is generic in the sense that the introduced concepts could be applied in other interpolation techniques where distance fields are required. One may also think to the more general problem of 3D reconstruction from 2D cross-sections. In addition, this opens new fields of application for geodesic distances/propagations. New operators in MM, based on adaptive structuring elements, could be defined using the proposed geodesic functions. They could be also useful in geospatial applications where special care needs to be given to the geomorphological properties of the underlying surface, e.g. for introducing new distances for kriging techniques of variables defined over a terrain surface.
References 1. Schafer, R., Rabiner, L.: A digital signal processing approach to interpolation. Proc. of IEEE 61(6), 692–702 (1973) 2. Gold, C.: Surface interpolation, spatial adjacency and GIS. In: Raper, J. (ed.) 3D Applications in GIS, pp. 21–35. Taylor & Francis, London, UK (1989) 3. Carrara, A., Bitelli, G., Carla, R.: Comparison of techniques for generating digital terrain models from contour lines. Int. J. Geog. Inf. Sc. 11, 451–473 (1997) 4. Soille, P.: Advances in the analysis of topographic features on discrete images. In: Margaria, T., Yi, W. (eds.) ETAPS 2001 and TACAS 2001. LNCS, vol. 2031, pp. 175–186. Springer, Heidelberg (2001) 5. Grazzini, J., Chrysoulakis, N.: Extraction of surface properties from a high accuracy DEM using multiscale remote sensing techniques. In: Proc. of EnviroInfo, pp. 352–356 (2005) 6. Soille, P.: Spatial distributions from contour lines: an efficient methodology based on distance transformations. J. Vis. Comm. Im. Repres. 2(2), 138–150 (1991) 7. Soille, P.: Generalized geodesy via geodesic time. Patt. Recogn. Lett. 15(12), 1235– 1240 (1994) 8. Hutchinson, M.: Calculation of hydrologically sound digital elevation models. In: Proc. of ISSDH, pp. 117–133 (1988) 9. Kidner, D.: Higher-order interpolation of regular grid digital elevation models. Int. J. Rem. Sens. 24(14), 2981–2987 (2003) 10. Hormann, K., Spinello, S., Schr¨ oder, P.: C1 -continuous terrain reconstruction from sparse contours. In: Proc. of VMV, pp. 289–297 (2003) 11. Lee, J., Chung, S.: Reconstruction of 3D terrain data from contour map. In: Proc. of MVA, pp. 281–284 (1994) 12. Barrett, W., Mortensen, E., Taylor, D.: An image space algorithm for morphological contour interpolation. In: Proc. of Grap. Inter, pp. 16–24 (1994) 13. Thibault, D., Gold, C.: Terrain reconstruction from contours by skeleton construction. GeoInformatica 4, 349–373 (2000) 14. Lantu´ejoul, C., Maisonneuve, F.: Geodesic methods in image analysis. Patt. Recogn. 17, 177–187 (1984)
750
J. Grazzini and P. Soille
15. Soille, P.: Generalized geodesic distances applied to interpolation and shape description. In: Serra, J., Soille, P. (eds.) Mathematical Morphology and its Applications to Image Processing, pp. 193–200. Kluwer Academic Publishers, Boston, MA (1994) 16. Chai, J., Miyoshiy, T., Nakamae, E.: Contour interpolation and surface reconstruction of smooth terrain models. In: Proc. of IEEE Vis, pp. 27–33 (1998) 17. Dakowicz, M., Gold, C.: Extracting meaningful slopes from terrain contours. In: Proc. of ICCS, pp. 144–153 (2002) 18. Pr´eteux, F.: Watershed and skeleton by influence zones: a distance-based approach. J. Math. Imag. Vis. 1, 239–255 (1992) 19. Gousie, M., Franklin, W.: Constructing a DEM from grid-based data by computing intermediate contours. In: Proc. of ISAGIS, pp. 71–77 (2003)
Cylindrical Phase Correlation Method Jakub Bican and Jan Flusser Institute of Information Theory and Automation Academy of Sciences of the Czech Republic Pod vod´ arenskou vˇeˇz´ı 4, 18208 Prague 8, Czech Republic [email protected]
Abstract. We present a new image registration algorithm that employs Phase Correlation Method (PCM) to align rotated and translated 3D images. First, we use PCM and cylindrical coordinate transform to estimate rotation angle that aligns two images that are mutually rotated around known axis. An improvement of this technique is given to eliminate influence of noise and image differences in non-ideal conditions. Finally, an iterative optimization procedure is proposed which uses these techniques in rigid body registration tasks. We study the performance of the algorithm in simulated experiment on MRI brain image.
1
Introduction
Image registration plays a major role in multiframe image processing. The purpose of image registration is to geometrically align two or more images differing by the imaging time, viewpoint, sensor modality and/or the subject of the images. Among many areas where the image registration is employed (such as remote sensing, computer vision), is the medical image processing one of the most important. Image registration methods are usually classified into two main groups [14]. Feature based methods incorporate a feature selection step to detect a set of control points, a feature matching step to find the correspondences between the two sets of control points and a transform model estimation step to determine the parameters of selected transformation from the correspondences. A very popular in medical image registration – especially in tomographic brain image registration – is a optimization scheme that aims to find an extreme of similarity or dissimilarity measure on a multidimensional space of parameters of a selected transform model by some numerical optimization process. Methods based on mutual information are the state of the art among these approaches. Fourier methods form a special group of approaches based on Phase Correlation Method (PCM). PCM was first introduced by Kuglin and Hines [10] as a fast and robust method for estimation of inter image shifts. The method was extended by De Castro and Morandi [2] to register translated and rotated images and later by Reddy and Chatterji [12] to register translated, rotated and scaled images. They use a log-polar transform of shift invariant spectral magnitudes to turn rotation and scaling to translation that is handled analogically by PCM. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 751–758, 2007. c Springer-Verlag Berlin Heidelberg 2007
752
J. Bican and J. Flusser
This approach is not applicable for 3D image registration as there is no mapping that converts rotations to translations in 3D. Keller et al. [9] introduced an algorithm for the registration of rotated and translated 3D volumes based on Pseudopolar Fourier transform. Their approach uses the pseudopolar representation of spectral magnitudes to find the rotation axis and estimate the rotation angle without using interpolation. Other authors aim to refine the precision of PCM to subpixel level. Foroosh et al. [4] estimate the subpixel shifts by analyzing the polyphase decompostition of cross-power spectrum. In [13], Stone et al. first eliminate the effect of aliasing and then use least-squares fit to 2-D phase difference data. As this is a difficult task, authors of [5,6] separate the shift estimation in every dimension by SVD or high-order SVD of cross-power spectrum. An improvement of robustness of this approach is given in [8]. All of the approaches mentioned above estimate the rotations on the shift invariant spectral magnitudes of the images. In case of images with a low structural nature of spectral magnitudes (e.g. many medical imaging modalities) we find this difficult and unreliable as the most information that is important for the PCM (e.g. image edges) has been lost as the image phase was discarded from the rotation estimation. In this paper, we present a new way of employing PCM to register 3D images. Our method is based on cylindrical coordinate mapping of image in spatial domain and iterativelly uses PCM to estimate the update of rigid body transform with respect to some transform component: rotation or translation. We experimentally study the method’s performace in the case of tomographic brain image registration tasks.
2
Notation and Definitions
Image is a real, discrete and bounded function. In image registration, moving image fM is registered with reference (or fixed) image fR . Discrete Fourier transform (or spectrum) F (ω) = F {f (x)} of image f (x). 3D rotation rotates IR3 around some central point. In this paper we represent rotations by euler angles R(α, β, γ), axis and angle Rv (α) or by a 3 × 3 matrix M (R). Further representations and rotation properties can be found for example in [1,3] or any standard computer graphics textbook. Rigid body transform T can be viewed as a rotation R followed by a translation t. In words of matrix representation, it can be implemented as one multiplication of vector by rotation matrix and one addition of translation vector to rotated result. Rigid body transforms (as well as rotations) can be composed to get new rigid body transforms (T2 ◦ T1 )(x) := M (R2 )M (R1 )x + M (R2 )t1 + t2 . Cylindrical coordinate transform Cyl (defined by origin and axis of a cylinder) assigns each triplet of cartesian coordinates (x, y, z) with a new triplet of cylindrical coordinates (α, r, z). The first cylindrical coordinate corresponds
Cylindrical Phase Correlation Method
753
to the angle, the second to the radius of the cylinder and the third to the height of the cylinder.
3
Phase Correlation Method
Phase correlation method [10] takes advantage of Fourier shift theorem that relates the phase information of spectrums of an image and its shifted copy. If fM is shifted copy of fR then according to the shift theorem are the Fourier transforms of the images related as follows # $ FM (ω) F−1 = F−1 ei(ωx Δxx +ωy Δxy +ωz Δxz ) = δ (x + Δx) . FR (ω) Quotient of spectrums FM and FR is in practise (if fM is not exactly the shifted copy of fR ) computed as Corr(x) =
FM FR∗ , |FM | |FR |
so that PCM computes the correlation of whitened images (images with |F | = 1). Thus, locating a peak in a correlation surface Corr results in offset Δx that can be used to align the two images at pixel level.
4
Cylindrical Phase Correlation Method
In this section, we propose new image registration algorithms. First, we describe a technique of using PCM to find a rotation angle in case the two volumes are just rotated around a known axis. Then, we give some improvements that increase the performance of the technique in non-ideal conditions. Finally, we use the technique in an iterative algorithm that registers volumes differing by rigid body transform. 4.1
Finding Rotation with Known Axis
Now consider the two 3D images fR and fM are related by a rotation. In 2D case, it is possible to convert the rotation around some central point to a translation by polar transforming the images with the origin of the polar coordinates located in the center of the rotation. The angle of rotation can be then determined by using 2D version of PCM [12] described above. The direct generalization of this approach for 3D case by using spherical coordinates does not work as these coordinates do not convert 3D rotation to a translation. Let’s represent the rotation by axis v and angle α and assume that the rotation axis v is known. The angle α can be determined by first transforming the images to cylindrical coordinates with the axis of the cylinder aligned with v and then using PCM to find the shift between the transformed images offset = PCM (Cylv (fR ), Cylv (fM )) .
754
J. Bican and J. Flusser
The rotation angle directly corresponds with the offset in the first (angular) cylindrical coordinate, while this is the only offset found by PCM Rv (offsetα ) . In fact, this technique uses exactly the polar-mapping trick of [12] in every plane orthogonal to the rotation axis v, but in all these planes together. As well, it meets (in case of angular direction) the presumption of Fourier transform and PCM that the image can be periodically extended beyond its bounds. 4.2
Improving the Performance
The approach described in previous section has two main drawbacks. The first one is caused by performing computations in discrete domain: when making cylindrical transform of the images, it is necessary to use higher order interpolation, because the cylindrical transform (alike a polar transform) is sampling the space very non-uniformly. The second drawback is that the voxels of original volume located near the axis of the cylinder have much greater impact than the voxels located at the perimeter. If the angular and radial coordinate is sampled so that the perimeter of the cylinder is not subsampled and no information is lost, every voxel near the axis is stretched (or interpolated) to several voxels, while the voxels at the perimeter are resampled approximately one-to-one. Moreover, the PCM gives the same significance to well sampled voxels at the perimeter as to resampled voxels originating from the voxels near the axis, which are also highly affected by interpolation error. These drawbacks led us to develop technique which computes PCM separately for every layer of the cylinder defined by fixed radius. Every such layer has different angular resolution that suitably samples the original data: layer at radius r is in angular direction sampled by 2πr samples, i.e. with resolution (or 2π = 1r radians. Corresponding layers from reference and moving image spacing) 2πr are registered by PCM which results in a correlation surface that gives a degree of match for each angle. Correlation surfaces from all layers are then combined into one and the angle with the highest value in the combined correlation surface
Fig. 1. MRI brain image
Cylindrical Phase Correlation Method
755
is the resultant parameter of the registration. When combining, each surface is weighted by its radius which corresponds to the number of pixels it contains. This gives to each original voxel same impact on the result. function IRE(v,ReferenceImage,MovingImage) rmax := "maximal radius of data" height := 2*rmax GlobalCorrelation := zeros(floor(2*pi*rmax),height) for r = 1:rmax ReferenceLayer := Cyl(v,ReferenceImage)[r,:,:] MovingLayer := Cyl(v,MovingImage)[r,:,:] LayerCorrelation := PCM(ReferenceLayer,MovingLayer) GlobalCorrelation += r*resize(LayerCorrelation, size(GlobalCorrelation)) end offset := find_max(GlobalCorrelation) IRE := offset[0] end The input to the procedure are the two images and the rotation axis v. When resampling the correlation surface, we use the nearest neighbour interpolation for that the interpolation error is largest for layers close to cylinder’s axis, which have small impact due to weighting of the layers. 4.3
Rigid Body Registration
Rigid body transform is a transform that combines rotations and translations. Finding optimal parameters of rigid body transform (six parameters in 3D) is a very usual task of image registration [14] (intra-subject studies, multimodal registration, etc). As it was mentioned in the introduction, there is a class of registration methods that employ a numerical optimization process to find the optimum of similarity measure on a space of parameters of a transformation model. Our algorithm uses above described procedures to find parameters of rigid body transform so that the PCM metric – the correlation of whitened images – reaches its maximum. The optimization runs in iterations. Each iteration aims to improve the metric with respect to some subset of parameters. Such optimization resembles some well known optimizers (e.g. Powell’s direction set method [11]) and is sometimes called alternating optimization. Let’s start with identity transform T := id and a set of three linear independent axes. x, y and z is a good selection. Repeat following iterations to compute transform update Tupd : Odd iterations compute PCM to estimate the shift between the reference volume fR and moving volume transformed by actual transform T (fM ) Tupd := PCM (fR , T (fM )) .
756
J. Bican and J. Flusser
Even iterations estimate the rotation component with respect to one of the axes by above procedure Tupd := IRE{x|y|z} (fR , T (fM )) . Axes cyclically alternate as the algorithm advances so that for example in iterations 2, 4, 6, 8, 10 . . . are used axes x, y, z, x, y . . . respectively. After each iteration, the current transform T is updated by new Tupd T = T ◦ Tupd . This iterational process is stopped if there was no non-zero update found in last six iterations (no transform parameter can be further optimized), or if the maximum number of iterations is met (time limit) or if the actual result is good enough (e.g. algorithm is stopped by operator). The convergence of our algorithm faces similar problems as usual optimization techniques. It is not ensured that the reached optimum will be the global optimum of the similarity metric and that it is approached in some well defined time limit. But in constrast to other methods is our method optimal in every step (PCM and IRE find global optimum with respect to given parameter subset). Furthermore, in case of PCM, the optimum is found over three shift parameters in one step. Hence, the method should be able to register images with low spatial correlation and with high initial misregistration, which are usual properties that limit other methods. Again, this does not guarantee the convergence to global optimum.
5
Experimental Results
In this section, we present one of our experiments that study the method’s performance. We created an experimental implementation of the method in C++ as a module to Insight Toolkit ITK [7]. We aim to study the influence of initial misregistration level to the rigid body registration result and the number of iterations needed to converge the algorithm. We use CPCM to register randomly rotated and shifted MRI brain image (Fig. 1) with the original. The volume size is 128×128×40 with 1.8×1.8×4.58 mm voxel size. The degree of misregistration as well as the registration error is measured by a fixed set of eight points that uniformly sample the reference image’s volume. The error (or misregistration) is then measured as a mean euclidean distance of these points in moving image to their original counterparts in reference image. This could be understood as a mean distance of every point of a volume to its transformed counterpart. We continuously generated random transforms, so that there was at least one hundred of different transforms for each 1 mm level of initial misregistration. For each misregistration level, the results are the mean values over all transforms that introduced misregistration of that level. The graph in Fig. 2a shows
Cylindrical Phase Correlation Method (a)
(b) 80
Reg. result (total) Reg. result (error < 10 mm) Iterations (total) Iterations (error < 10 mm)
70 60
Downgrades Large error (error > 10 mm) Early stops
70 60
50
% of Cases
Mean Error [mm] / No. of Iterations
80
40 30 20
50 40 30 20
X: 100 Y: 1.686
10 0
757
0
20
40 60 80 100 Misregistration [mm]
120
10 140
0
0
20
40 60 80 100 Misregistration [mm]
120
140
Fig. 2. Influence of initial misregistration on the mean error after registration (a), number of iterations (a) and the failure statistics (b)
two alternative views of the results. First, we filtered only those results that succesfully converged under some reasonable error (here 10 mm – explained below). Then the graph also plots values that include all results. Figure 2b shows the statistics of three kinds of failures: 1. The method converged but the final error was larger than the initial misregistration (the alignment was downgraded), 2. The final misregistration error was larger than 10 mm (this also includes case 1 above), 3. The method reached the iteration limit (120 iterations) and was stopped (usually the error in this situation does not decrease under 10 mm so it is included in case 2 too). The results can be interpreted such that until the misregistration is up to about 100 mm, the method converges to the pixel level precision with at least 90% reliability and the number of iterations (i.e. time) behaves approximately logarithmic to the misregistration. As misregistration grows over 100 mm (which is approximately the radius of the volume), the failure rate increases and method’s performance decreases mainly due to cases in which method converged to some false position. We should point out that these results and trends does not depend on the specific value of reasonable error mentioned above. We use value of 10 mm that is one order higher than the pixel size and is still resonable small, but we could use values 5-45 mm without any significant effect on the graphs.
6
Conclusion
We presented a new image registration algorithm that is able to geometricaly align mutually translated and rotated pair of 3D images. The method iteratively
758
J. Bican and J. Flusser
recovers translational component of misalignment by Phase Correlation Method and rotational component by applying the same method on cylindrically mapped images. The experimental results on medical data expose method’s usableness even for large initial misalignment, which is a common limitation for example for methods based on Mutual information. On the other hand, current method is able to register images only with pixel-level precision.
Acknowledgement The authors would like to thank to Helena Trojanov´a from Clinic of Nuclear Medicine of Teaching Hospital Kr´ alovsk´e Vinohrady, Prague, Czech Republic for kindly providing the MRI data and to Czech Ministry of Education for support under the project No. 1M6798555601 (Research Center DAR).
References 1. Baker, M.J.: Maths - rotations, http://www.euclideanspace.com 2. de Castro, E., Morandi, C.: Registration of translated and rotated images using finite Fourier transform. IEEE Trans. Pattern Analysis and Machine Intelligence 9(5), 700–703 (1987) 3. Elliot, J.P., Dawber, P.G.: Symmetry in Physics. MacMillan, London (1979) 4. Foroosh, H., Zerubia, J.B., Berthod, M.: Extension of phase correlation to subpixel registration. IEEE Trans. Image Processing 11(3), 188–200 (2002) 5. Hoge, W.S.: A subspace identification extension to the phase correlation method. IEEE Trans. Medical Imaging 22(2), 277–280 (2003) 6. Hoge, W.S., Westin, C.F.: Identification of translational displacements between ndimensional data sets using the high-order svd and phase correlation. IEEE Trans. Image Processing 14(7), 884–889 (2005) 7. Ibanez, L., Schroeder, W., Ng, L., Cates, J.: The ITK Software Guide. Kitware, Inc. 2nd edn (2005) ISBN 1-930934-15-7, http://www.itk.org/ItkSoftwareGuide.pdf 8. Keller, Y., Averbuch, A., Miller, O.: Robust phase correlation. In: International Conference on Pattern Recognition, pages II, 740–743 (2004) 9. Keller, Y., Shkolnisky, Y., Averbuch, A.: Volume registration using the 3-D pseudopolar Fourier transform. IEEE Trans. Image Processing 54(11), 4323–4331 (2006) 10. Kuglin, C.D., Hines, D.C.: The phase correlation image alignment method. In: Proc. Int. Conf. on Cybernetics and Society, pp. 163–165. IEEE, Orlando (1975) 11. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes in C, 2nd edn. Cambridge University Press, Cambridge (1992) 12. Reddy, B.S., Chatterji, B.N.: An FFT-based technique for translation, rotation, and scale-invariant image registration. IEEE Trans. Image Processing 5(8), 1266– 1271 (1996) 13. Stone, H.S., Orchard, M.T., Chang, E.C., Martucci, S.A.: A fast direct Fourierbased algorithm for subpixel registration of images. IEEE Trans. Geoscience and Remote Sensing 39(10), 2235–2243 (2001) 14. Zitov´ a, B., Flusser, J.: Image registration methods: a survey. Image and Vision Computing 21(11), 977–1000 (2003)
Extended Global Optimization Strategy for Rigid 2D/3D Image Registration Alexander Kubias1 , Frank Deinzer2 , Tobias Feldmann1 , and Dietrich Paulus1 1
University of Koblenz-Landau, Koblenz, Germany {kubias, tfeld, paulus}@uni-koblenz.de 2 Siemens Medical Solutions, Forchheim, Germany [email protected]
Abstract. Rigid 2D/3D image registration is a common strategy in medical image processing. In this paper we present an extended global optimization strategy for a rigid 2D/3D image registration that consists of three components: a combination of a global and a local optimizer, a combination of a multi-scale and a multi-resolution approach, and a combination of an in-plane and an out-of-plane registration. The global optimizer Adaptive Random Search is used to provide several coarse registration results on a low resolution level that are refined by the local optimizer Best Neighbor on a higher resolution level. We evaluate the performance and the precision of our registration algorithm using two phantom models. We could approve that all three components of our optimization strategy lead to an significant improvement of the registration. Keywords: Image Registration, Global Optimization, Multi-Resolution.
1
Introduction and Related Work
As mentioned in [1], image registration is a very common problem in medical image processing and, thus, automatic image registration is a very important component in current medical imaging systems [2]. In 2D/3D image registration, a pre-operative volume (e.g. CT or MRT) is registered with an intra-operative X-ray image. Thus, the preoperatively acquired volume can be used for intraoperative therapy guidance [3]. In this paper, we are only interested in intensitybased methods for the 2D/3D image registration. These methods consist of two parts, namely the DRR (Digitally Reconstructed Radiograph) generation and the computation of a similarity measure between the DRR and the X-ray image. As our main contribution, we present an extended global optimization strategy for a rigid 2D/3D image registration. Our optimization strategy is a combination of three single components: First, a global optimizer and a local optimizer are combined. Nowadays, local optimizers are used frequently for image registration tasks as they are easy to implement and cost-efficient. In contrast, global optimizers are rarely used because of their high computational costs. However, by using a global optimizer, one avoids being trapped into a local optimum and, thereby, missing the global optimum. In our optimization method, the optimizer W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 759–767, 2007. c Springer-Verlag Berlin Heidelberg 2007
760
A. Kubias et al.
Fig. 1. Differentiation of in-plane and out-of-plane transformations. In this example, a translation along the Y-axis or the Z-axis is an in-plane translation, otherwise an outof-plane translation. A rotation around the X-axis is an in-plane rotation, otherwise an out-of-plane rotation.
Adaptive Random Search (ARS) [4] provides coarse registration results that are refined by the local optimizer Best Neighbor (BN) [5]. Despite its simplicity, the local optimizer Best Neighbor can achieve robust registration results [6] and, thus, it is frequently used [7]. Secondly, we combine a multi-resolution and a multi-scale approach. As a main difference between both approaches, images are used in different resolution levels in a multi-resolution approach, whereas images are only in a single resolution level in a multi-scale approach. Thus, in a multi-resolution approach images are implicitly smoothed by reducing their size. Instead, in a multi-scale approach images must be explicitly smoothed, e.g. by a Gaussian filter [8]. In our multi-resolution approach, the global optimizer is executed on a low resolution level, whereas the local optimizer runs on a higher resolution level [9]. In our multi-scale approach, DRRs are created from volumes with fewer slices and, thereby, smoothed. A similar method was proposed in [6], but not combined with a multi-resolution approach. By using a multi-resolution and a multi-scale approach, the runtime of the registration algorithm [10] and the number of local optima can be significantly reduced [2]. Thirdly, we do not estimate the six parameters of the rigid 2D/3D image registration problem in parallel because in a six-dimensional parameter space, many evaluations are needed in order to obtain a precise registration result. Instead, we split this six-dimensional problem into two three-dimensional problems by first estimating the in-plane parameters and, then, estimating the out-of-plane parameters. As mentioned in [11], the effects of the translations and rotations in the projected image depend on the axis according to which the translation or rotation is performed, if a perspective projection is used. These different effects are especially visible for translations. Thereby, in-plane and out-of-plane translations must be distinguished. A translation, which moves an object parallelly to the image plane, is an in-plane translation. In contrast, a translation, which
Extended Global Optimization Strategy
761
moves an object orthogonally to the image plane, is called an out-of-plane translation [11]. In Fig. 1 translations along the Y-axis and the Z-axis are in-plane translations and a translation along the X-axis is an out-of-plane translation. Rotations are treated in a similar way. In Fig. 1 the rotation around the X-axis is an in-plane rotation. The two other rotations are out-of-plane rotations. We present a new method for distinguishing in-plane and out-of-plane transformations and estimating them separately. Other approaches that cope with these problems were proposed in [7] and in [12]. In section 2 we describe the basics of the rigid 2D/3D image registration and the global optimization strategy that was used for the registration. In section 3 we show and discuss the results of our implementation. For the sake of performance, we implemented our 2D/3D image registration algorithm completely on the GPU. Finally, in section 4 we conclude and present some ideas for our future work.
2 2.1
Extended Global Optimization Strategy 2D/3D Image Registration
In 2D/3D image registration, a preoperative volume (e.g. CT or MRT) is registered with an intra-operative X-ray image. In this paper we restrict ourself to rigid image registration where the volume can only be translated and rotated according to three coordinate axes. This transformation is given by the parameter vector x = (tx , ty , tz , rx , ry , rz )T . Thereby, the parameters tx , ty , tz represent the translation in mm along the X-, Y- and Z-axis, whereas the parameters rx , ry , rz describe the rotation vector r = (rx , ry , rz )T , which is internally represented in Rodrigues notation [13]. ⎛ ⎞ 1000 p˜p = P ·p˜w = K· ⎝ 0 1 0 0 ⎠ ·D·p˜w (1) 0010 In order to generate a DRR from a volume, each point of the volume p˜w in the world coordinate system (WCS) [13] is mapped onto a point p˜p in the pixel coordinate system (PCS) [13] (1). The points p˜w and p˜p are given in homogeneous coordinates [13]. The projection matrix P contains the matrix K, which consists of the intrinsic camera parameters, a further matrix, which reduces the number of dimensions, and the matrix D, which consists of the extrinsic camera parameters [14]. Prior to the DRR generation, a transformation according to the parameter vector x is executed on the volume. Therefore, (1) is extended to p˜p = P ·T ·p˜w
.
(2)
In (2), the transformation matrix T , which models the translations and rotations of x, is performed onto the points of the volume. Thereby, T is the matrix
762
A. Kubias et al.
representation of x by using the Rodrigues formula in [14] and extending it with the translation of x. Afterwards, the DRR IDRR (x) is produced from the transformed volume (according to the current transformation x) and compared to the X-ray image IFLL by applying a similarity measure S(IFLL , IDRR (x)). The used X-ray image IFLL was acquired by a X-ray C-arm-system before. By using a certain optimization technique, the similarity between the DRR and the X-ray image will be increased until the optimal parameter vector x∗ is found (3) x∗ = argmax S(IFLL , IDRR (x)) . x
2.2
In-Plane and Out-of-Plane Registration
As in-plane transformations cause the crucial movements in the projected image and out-of-plane transformations do not affect the projected image very much, it is useful to distinguish between them. We propose a new optimization strategy, in which the six-dimensional registration problem is divided into a three-dimensional in-plane registration and into a three-dimensional out-of-plane registration. In a first step, an in-plane registration is executed in which only the three in-plane parameters are estimated. All six parameters were initialized with zeros before. These parameters are easier to estimate as they cause major changes in the projected image. In the meanwhile, the three out-of-plane parameters remain unchanged. In the second step, the out-of-plane registration is executed in which only the three out-of-plane parameters are determined. The starting point for the out-of-plane registration is the parameter vector that is returned by the in-plane registration. As the out-of-plane transformations cause only small effects in the projected image, it is difficult to obtain a stable estimation for them. Nevertheless, their estimation can significantly be improved if the in-plane transformations are already properly determined. The identification of the in-plane and out-of-plane transformations is not trivial because it depends on the point of view from which the volume is looked at. In contrast to the world coordinate system (WCS), it is known for the camera coordinate system (CCS) that translations along the X-axis and the Y-axis are in-plane translations and that the rotation around the Z-axis is an in-plane rotation. Naturally, the three other transformations are out-of-plane transformations. Thus, a solution is found for determining the in-plane and out-of-plane transformations independent of the current point of view from which the volume is gazed. In order to exploit the benefits of the CCS, the WCS is rotated by R so that it has the same orientation as the CCS except for a translation. This rotation, which is expressed by the rotation matrix R, can be computed from the projection matrix P by using the QR-decomposition [14]. In the rotated WCS, the transformation T is performed onto the volume. Finally, the rotated WCS is transformed into its original position by the inverse rotation R−1 . Thus, (2) must be amended to p˜p = P ·R·T ·R−1 ·p˜w
.
(4)
Extended Global Optimization Strategy
2.3
763
Multi-resolution and Multi-scale Registration
As mentioned before, a combination of a multi-resolution and a multi-scale approach is applied in our optimization strategy. Therefore, an X-ray image, whose original size was 10242 pixels, is scaled to different resolutions with 5122, 2562 , 1282 and 642 pixels. The scaling is done by a mean value filter with a 2 × 2 mask. Certainly, the DRR is created in the same size as the X-ray image. The multi-scale approach is achieved by using different resolutions of a CT volume for the DRR generation. A similar multi-scale approach was described in [6], but not combined with a multi-resolution approach. The scaling of the CT volume is done by computing the mean value in a neighborhood of 2 × 2 × 2 voxels. For generating DRRs in a high resolution, we use a CT volume with 2563 voxels. If the DRR should be produced in a lower resolution, a CT volume with 1283 voxels or, even, 643 voxels is used. Thus, an artificial smoothing of the DRR is achieved. Our final optimization strategy is divided into a global optimization step and a local optimization step. First, the global optimization step is executed which achieves a coarse alignment of the X-ray image and the CT volume on the lowest resolution level, in which the X-ray image has 642 pixel and the CT volume has 643 voxels. Therefore, the global optimizer ARS creates ninit initial parameter vectors, of which nnew parameter vectors are optimized over niter iterations. One of the parameters ninit , nnew and niter will be determined within the experiments. As described in section 2.2, only the in-plane transformations are estimated at first. Subsequently, the out-of-plane transformations are determined. The starting point for this global out-of-plane registration is the result of the global in-plane registration. After the convergence of the global optimizer, the nlocal best parameter vectors of the global optimizer are used as starting points for the local optimization. The local optimization is executed on a higher resolution level. Therefore, an X-ray image with 2562 pixels and a CT volume with 2563 voxels are used. In contrast to the global optimization, the local optimizer estimates all six parameters in parallel. Otherwise, new local optima could be generated. Each of the nlocal parameter vectors are locally optimized one after the other. The best of these nlocal parameters vectors is returned as the result of the registration algorithm. For the sake of performance, we do not use higher resolution levels for the CT volume or the X-ray image. Nevertheless, the optimization process could be continued on a higher resolution level in order to obtain even better results.
3
Results and Discussion
In order to evaluate the performance and the precision of our algorithm, we use a GeForce 7800 GS AGP with 256 MB. The GPU programming is done entirely in OpenGL and Cg. The CPU source code is executed on a single Pentium 4 CPU with 3.2 GHz. In the following, we will perform several experiments in order to determine reasonable values for the proposed parameters from section 2.3 and in order to
764
A. Kubias et al.
Fig. 2. Phantom model of a head and a thorax
show the benefits of our optimization strategy. For the experiments, we use the two phantom models whose X-ray images are shown in Fig. 2. The CT volumes of these phantom models were acquired by an X-ray C-arm-system. Then, the X-ray image of the head is registered to the volume of the head and the X-ray image of the thorax is registered to the volume of the thorax. As the ground truth position is known for both phantom models, the registration error can be measured in mm using the euclidean distance in the image. From these single distances the 95th, 90th and 75th percentiles are computed. As the ground truth position is given by the parameter vector x∗ = (0, 0, 0, 0, 0, 0), we add an artificial transformation in order to obtain the starting point for our registration. The artificial transformation translates the volume 14mm along the X-axis, 15mm along the Y-axis and 16mm along the Z-axis and rotates it 9 degrees around the X-axis, 10 degrees around the Y-axis and 11 degrees around the Z-axis. The resulting starting point is used in all experiments. Nowadays, registration problems are often solved using a local optimizer. Therefore, in our first experiment we compare a local optimization strategy (6dim local) to a global optimization strategy (6-dim global). In both strategies, the six parameters of the parameter space are estimated in parallel. We also compare these two approaches to our global in-plane and out-of-plane optimization strategy (in-out global). In our optimization strategy, the parameters from section 2.3 are set to ninit = 10000, nnew = 100, niter = 100 and nlocal = 10. One of these values is determined in the subsequent experiment. The global optimizer is executed on the lowest resolution level, in which the X-ray image has 642 pixels and the CT volume has 643 voxels, and the local optimizer is executed on a higher resolution level, in which the X-ray image has 2562 pixels and the CT volume has 2563 voxels. As shown in Table 1, our global optimization strategy was about six times slower than the local optimization strategy. Compared to this local approach, the registration error of our optimization strategy was about 90% smaller. The local optimizer was often trapped into a local optimum and, thus, could not achieve precise registration results. Compared to the six-dimensional global optimization strategy, our strategy provided slightly worse results for the head. As the head
Extended Global Optimization Strategy
765
Table 1. We compared our optimization strategy (in-out global) to a local and a global optimization strategy, in which all six parameters were estimated in parallel. For each approach, the registration error and the runtime is given. Our strategy achieved about 90% smaller registration errors than the local optimizer. The measured values were averaged over 50 registration runs. registration results 6-dim local 6-dim global in-out global
95th percentile 90th percentile head thorax head thorax 41.7mm 51.3mm 36.4mm 44.9mm 3.4mm 31.3mm 2.4mm 24.9mm 5.0mm 4.9mm 4.0mm 3.9mm
75th percentile head thorax 30.3mm 36.6mm 1.0mm 4.6mm 2.6mm 2.7mm
runtime head thorax 91s 93s 343s 229s 547s 579s
Table 2. We could approve for our approach (in-out global) that the registration error and the computation time depend on the number of initial parameter vectors. The parameters were set to nnew = 100, niter = 100 and nlocal = 10. The measured values were averaged over 50 registration runs. ninit 95th percentile 90th percentile head thorax head thorax 103 5.2mm 5.0mm 4.1mm 4.0mm 104 5.0mm 4.9mm 4.0mm 3.9mm 105 4.8mm 4.8mm 3.8mm 3.8mm
75th percentile head thorax 3.1mm 2.8mm 2.6mm 2.7mm 2.7mm 2.7mm
runtime head thorax 463s 540s 547s 579s 1822s 2252s
can be easily registered, an improvement of the registration cannot be accomplished by using the in-plane and out-of-plane separation. However, our strategy achieved clearly better results for the registration of the thorax. The registration error was about 80% smaller than the registration error of the six-dimensional global optimization strategy. As mentioned before, the parameters ninit , nnew , niter and nlocal have to be determined for our global in-plane and out-of-plane optimization strategy. Within this paper, we will only present the experiment in which the value of the parameter ninit is determined. The values of the parameters nnew = 100, niter = 100 and nlocal = 10 were also reasonably chosen in order to enable an efficient and precise registration process. We analyzed how much the performance and the precision of the registration is affected by the number of initial parameter vectors. As shown in Table 2, with an increasing number of initial parameter vectors the precision of the registration was improved. We decided to use ninit = 104 initial parameter vectors as they provided a better registration result than ninit = 103 initial parameter vectors and were just a little bit worse than ninit = 105 initial parameter vectors. Compared to ninit = 103 initial parameter vectors, the runtime of the registration process was only a little bit longer. However, we did not use ninit = 105 initial parameter vectors as they needed much more computation time and achieved only slightly better registration results.
766
4
A. Kubias et al.
Conclusions and Future Work
In this paper we presented an extended global optimization strategy for a rigid 2D/3D image registration that consists of three components: a combination of a global and a local optimizer, a combination of a multi-scale and a multiresolution approach and a combination of an in-plane and an out-of-plane registration. The global optimizer Adaptive Random Search was used to provide several coarse registration results that were refined by the local optimizer Best Neighbor. We evaluated the performance and the precision of our registration algorithm using two phantom models. We could approve that all three components of our optimization strategy lead to an improvement of the registration. Compared to a local optimizer, the registration error of our optimization strategy was about 90% smaller. In the future, we will extend our global optimization algorithm to feature-based similarity measures. Additionally, we will analyze other global optimizers and integrate them in our optimization strategy.
References 1. Goecke, R., Weese, J., Schumann, H.: Fast Volume Rendering Methods for Voxelbased 2D/3D Registration - A Comparative Study. In: Proceedings of International Workshop on Biomedical Image Registration ’99, pp. 89–102 (1999) 2. Khamene, A., Chisu, R., Wein, W., Navab, N., Sauer, F.: A Novel Projection Based Approach for Medical Image Registration. In: Pluim, J.P.W., Likar, B., Gerritsen, F.A. (eds.) WBIR 2006. LNCS, vol. 4057, pp. 247–256. Springer, Heidelberg (2006) 3. Russakoff, D.B., Rohlfing, T., Mori, K., Rueckert, D., Adler Jr., J.R., Maurer Jr., C.R.: Fast Generation of Digitally Reconstructed Radiographs using Attenuation Fields with Application to 2D-3D Image Registration. IEEE Transactions on Medical Imaging 24(11), 1441–1454 (2005) 4. Zhigljavsky, A.: Theory of Global Random Search. Kluwer Academic, Edmonton (1991) 5. Törn, A., Zilinskas, A.: Global Optimization. Springer, Berlin (1989) 6. Wein, W.: Intensity Based Rigid 2D-3D Registration Algorithms for Radiation Therapy. Master’s thesis, Technische Universität München (2003) 7. Russakoff, D., Rohlfing, T., Shahidi, R., Kim, D., Adler, J., Maurer, C.: IntensityBased 2D-3D Spine Image Registration Incorporating One Fiducial Marker. In: Ellis, R.E., Peters, T.M. (eds.) MICCAI 2003. LNCS, vol. 2878, pp. 287–294. Springer, Heidelberg (2003) 8. Adelson, E.H., Anderson, C.H., Bergen, J.R., Burt, P.J., Ogden, J.: Pyramid Methods in Image Processing. RCA Engineer 29(6), 33–41 (1984) 9. Unser, M., Aldroubi, A., Gerfen, C.: A Multiresolution Image Registration Procedure Using Spline Pyramids. In: Proc. of the SPIE Conference on Mathematical Imaging: Wavelet Applications in Signal and Image Processing, vol. 2034, pp. 160–170. SPIE Press, San Diego, CA (1993) 10. Thevenaz, P., Ruttimann, U.E., Unser, M.: Iterative Multi-Scale Registration Without Landmarks. In: Proc. of the 1995 International Conference on Image Processing, p. 3228. IEEE Computer Society, Washington, DC, USA (1995)
Extended Global Optimization Strategy
767
11. Penney, G.P.: Registration of Tomographic Images to X-Ray Projections for Use in Image Guided Interventions. PhD thesis, King’s College, London (1999) 12. Skerl, D., Likar, B., Pernus, F.: A Protocol for Evaluation of Similarity Measures for Rigid Registration. IEEE Transactions on Medical Imaging 25(6), 779–791 (2006) 13. Trucco, E., Verri, A.: Introductory Techniques for 3–D Computer Vision. Prentice Hall, New York (1998) 14. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000)
A Fast B-Spline Pseudo-inversion Algorithm for Consistent Image Registration Antonio Trist´an and Juan Ignacio Arribas Laboratorio de Procesado de Imagen Universidad de Valladolid. Spain [email protected],[email protected] http://www.lpi.tel.uva.es
Abstract. Recently, the concept of consistent image registration has been introduced to refer to a set of algorithms that estimate both the direct and inverse deformation together, that is, they exchange the roles of the target and the scene images alternatively; it has been demonstrated that this technique improves the registration accuracy, and that the biological significance of the obtained deformations is also improved. When dealing with free form deformations, the inversion of the transformations obtained becomes computationally intensive. In this paper, we suggest the parametrization of such deformations by means of a cubic B-spline, and its approximated inversion using a highly efficient algorithm. The results show that the consistency constraint notably improves the registration accuracy, especially in cases of a heavy initial misregistration, with very little computational overload. Keywords: Free form image registration, consistent registration, B-splines, inverse transformation.
1
Introduction
Elastic image registration is an important and challenging issue in medical imaging, and it has been successfully applied to a number of problems of great interest over the recent years [1]. The anatomy being imaged involves quite often deformable tissues, such as the breast [2,3], the brain [4], the kidney, and so on [1], and there lies the importance of fully deformable models for image registration. The idea of consistent image registration has been recently introduced in [5], and further explored in [6], although the convenience of this technique has been pointed out by an increasing number of authors [4,7,8,9,10]. The idea is that when one has to register a pair of images, that is, to deform a model to match a given scene, it is useful to estimate as well the inverse transformation, that is, the one that matches the scene to the model, constraining both transformations to be the inverse of each other. This helps to reduce the presence of local minima, usually yielding more biologically meaningful results [5,6,7], and even more accurate and robust estimates of the deformation [9,10]. This is especially well suited for elastic registration, where local minima may become especially problematic. Consistent W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 768–775, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Fast B-Spline Pseudo-inversion Algorithm
769
image registration is then an interesting issue in medical imaging, where elastic registration becomes more useful each day. Although it is a relatively recent idea, a number of algorithms have been developed considering this technique. The common way to impose the consistent constraint has consisted in adding a penalising term to the cost function [4,6,10]. In [9], authors propose a physical deformation model in which the forces that deform the image are symmetrised following the action and reaction principle. The method here proposed is different in the sense that it requires the effective inversion of the geometric transformation obtained; similar approaches can be found in [5,7], where authors iteratively invert the deformation at each point, or in [8], where direct and inverse optical flow are rectified in each iteration with a linear estimate of the inverse transformation. The use of a penalising term is well suited for variational approaches [6,10], but the effective inversion of the transformation may be a better choice in non-iterative algorithms. However, this inversion is computationally intensive with elastic registration, as is the case in [5], where a fixed point technique must be performed at each pixel. The main contribution of our paper is the development of a technique that provides a very efficient way to impose the consistent constraint using a parametric elastic transformation (a cubic B-spline, BS); to the best of our knowledge, it is the first attempt to apply the direct inversion methodology with a parametric deformation. Although our algorithm does not result in an exact inversion of the deformation, but in a pseudo-inversion instead, experiments demonstrate that it substantially improves the registration accuracy, with a fair computational overload. In section 2 we briefly introduce the use of B-splines as interpolators. Section 3 is devoted to the pseudo-inversion algorithm. In section 4 we show some results that demonstrate the usefulness of the algorithm, and finally, in section 5, we give some final considerations and conclude.
2
B-Spline Representation of the Deformation
The use of BS in image registration has been widely validated and widespread [2,3], being their main advantage their very little computational load. Compared to the classical Thin Plate Splines (TPS, [11]), they show the drawback of its worse regularity properties; however, the hierarchical interpolation technique presented in [12] overcomes in practice this difficulty. Here, we are going to center our attention in the two-dimensional case, although the extension to 3-D is immediate. Let J be an image with a compact support Ω ≡ [0, X]×[0, Y ] ⊂ R2 ; the deformation at a point (x, y) ∈ Ω is then given by the tensor product: 3 3 x + k=0 l=0 Bk (s) Bl (t) Φx(i+k)(j+l) x 3 3 , (1) = y y + k=0 l=0 Bk (s) Bl (t) Φy(i+k)(j+l) where i = #M x/X$ − 1, j = #N y/Y $ − 1, s = M x/X − #M x/X$, t = N y/Y − #N y/Y $1 , and Φx,y ij are the interpolation parameters to compute, one for each 1
Here x represents the greatest integer minor than x.
770
A. Trist´ an and J.I. Arribas
and every of the 2 × (M + 3) × (N + 3) grid nodes (note that there is a grid for each dimension), with M + 3 and N + 3 the number of nodes in each dimension, being verified that −1 ≤ i ≤ M + 1, −1 ≤ j ≤ N + 1, and 0 ≤ s, t < 1. Bi (·), i = 0, 1, 2, 3 are the BS kernels, which are third degree polynomials [12], in such a way that BS interpolators are piecewise polynomials, with the nice property that they are second order differentiable. Besides, they have another interesting property, their locality: a change in one of the spline coefficients Φx,y ij affects only to a vicinity of the corresponding node (i, j). In practice, this means that to interpolate a new control point (x, y) ∈ Ω, only a few (16, in the 2D case) grid nodes have to be changed. This property is used in [12] to hierarchically interpolate the control points of the deformation in a least square sense. This is the same approach followed in [2], and in our own work as well.
3
Pseudo-inversion Algorithm
A BS does not need to be invertible, and in case it is, its inverse transformation is not necessarily another BS. To impose the consistent constraint, we define the pseudo-inverse BS Ψ of a given BS Φ as a BS with the same support and grid nodes as Φ; we use these grid nodes as control points (the values to interpolate) and then we force the displacements at the grid points of Ψ to equal exactly the inverse deformation of that given by Φ at these points. This means that the inversion is exact at the grid nodes, and only approximated elsewhere. To obtain the inverse transformation at the grid nodes, we follow an approach similar to [5], which in fact is a variant of the fixed point method. From eq. (1), the deformation at a point x ≡ (x, y) ∈ Ω can be written in a compact form: x = g (x) = x + f (x) .
(2)
For a given point x0 ∈ Ω, finding the value of the inverse transform g−1 (x0 ) implies to find the point xt ∈ Ω such that g(xt ) = xt + f (xt ) = x0 , and then it will be verified that g−1 (x0 ) = x0 + f −1 (x0 ) = xt , where in this case f −1 is not the inverse of f but simply a convenient and comfortable notation. Thus, for each grid node x0 , we are interested in finding the point xt such that: x0 − f (xt ) = xt ⇔ hx0 (xt ) = xt ,
(3)
what reduces to find a fixed point in function hx0 , defined for each grid node. To do so, starting from the initial approximation x0t , the updating rule is: n
= hx0 (xnt ) = x0 − f (xnt ) ⇔ f −1 (x0 ) = −f (xnt ) . xn+1 t ∞
(4)
If series {xnt }n=0 converges toward a non infinite value, this value is xt . Unlike in [5], (4) is only used at the grid nodes; at the same time, the successive evaluations are exact here due to the continuum nature of BS, while in [5] interpolation is needed, thus convergence here will be faster.
A Fast B-Spline Pseudo-inversion Algorithm
771
Once the deformation is known at the grid nodes, it only remains to compute the parameters Φ of the BS. At grid nodes s = t = 0, [12], and we can write: 2 2 x x −x k=0 l=0 αk αl Φ(i+k)(j+l) , (5) = 2 2 y y − y k=0 l=0 αk αl Φ(i+k)(j+l) where α2 = α0 = 16 and α1 = 23 are the result of evaluating the BS kernels at 0. Thus, (5) defines 2 linear systems of (M + 3) × (N + 3) equations with (M + 3) × (N + 3) unknowns. Nevertheless, each value of the displacement f −1 (xi , yj ) affects only to nine Φx coefficients and another nine Φy , so the coefficient matrix is highly sparse. The generalisation to three or more dimensions is immediate; in fact, the tensorial character of the interpolation via BS permits that the systems of equations may be written by blocks, as shown in eq. (6): ⎞ ⎞ ⎛ x,y,... ⎞ ⎛ −1 fx,y,... (x−1 ) Φ−1 α1 Λ α2 Λ 0 0 ··· 0 0 ⎜ x,y,... ⎟ ⎜ −1 (x0 ) ⎟ ⎜ α0 Λ α1 Λ α2 Λ 0 · · · 0 0 ⎟ ⎜ ⎟ ⎟ ⎜ Φ0x,y,... ⎟ ⎜ fx,y,... −1 ⎜ 0 α0 Λ α1 Λ α2 Λ · · · 0 ⎟ ⎜ fx,y,... ⎟ 0 ⎟⎜ (x1 ) ⎟ ⎜ Φ1 ⎜ ⎟=⎜ ⎟, ⎜ .. ⎟ . .. ⎟ ⎜ .. ⎟ ⎜ .. .. .. .. .. ⎝ . ⎠ . .. . . . ⎠⎝ . ⎠ ⎝ . . −1 Φx,y,... 0 0 0 0 · · · α0 Λ α1 Λ fx,y,... (xM+1 ) M+1 ⎛
(6)
where the vector unknowns Φ and the values of displacements f −1 (xi ) are the result of fixing the first of the indices of the grid, i, and the d-dimensional matrices have been arranged like column vectors. By the own nature of the BS [12], the extreme nodes in each dimension (i = −1, i = M + 1, for x coordinate) correspond to points not included in Ω, so its inversion is not done following (4). Instead, we make f −1 (x−1 ) = f −1 (x0 ) and f −1 (xM+1 ) = f −1 (xM ). Besides, matrix Λ must present the same block-tridiagonal structure, implying that the value of the parameter at y is only affected by the value of displacement at the same y and at the next adjoining two, so that the system can be solved in a recursive way. Starting with the first dimension we may write, for both two systems: α0 ΛΦn−1 + α1 ΛΦn + α2 ΛΦn+1 = fn−1 , −1 < n < M + 1,
(7)
where we have adopted the notation fn−1 = f −1 (xn ) and extended the n subindex to the boundary values Φ−2 and ΦM+2 . We aim to find a solution in the form: Φn+1 = Σn Φn + θn , −2 ≤ n ≤ M,
(8)
and substituting (8) into (7), one obtains: α0 ΛΦn−1 + α1 ΛΦn + α2 Λ (Σn Φn + θn ) = fn−1 ⇒ α0 α1 1 −1 −1 Φn−1 + Φn + Σ n Φn + θ n = Λ fn ⇒ α2 α2 α2 1 −1 −1 α1 ID + Σn Φn = −Φn−1 + Λ fn − θn . α2 α2
(9)
772
A. Trist´ an and J.I. Arribas
Identifying terms in (9), the following recursion is obtained: Σn−1 = − θn−1 =
α1 ID + Σn α2 α1 ID + Σn α2
−1 , −1 < n ≤ M + 1 −1
1 −1 −1 Λ fn − θn , −1 < n ≤ M + 1, (10) α2
At the same time, if ΣM+1 can be written like γ · Id , it is easily proved by induction that Σn = γn · Id , and thus the system solution follows: Φn = γn−1 Φn−1 + θn−1 , −1 ≤ n ≤ M + 1 1 γn−1 = − α1 , −1 ≤ n ≤ M + 1 α2 + γn 1 −1 −1 Λ fn − θn , −1 ≤ n ≤ M + 1. θn−1 = −γn−1 α2
(11)
Matrix Λ is again block-tridiagonal, hence the calculus of Λ−1 fn−1 may be done by applying the algorithm in a recurrent fashion to the successive dimensions until reaching the last one, in which Λ will simply be 1. Regarding the first term in the recursion, we have experimentally proved that the best choice is to force ΦM+2 = ΦM+1 , and thus we set γM+1 = 1 and θM+1 = 0.
4
Results
To validate the proposed technique, we have chosen a simple and easy to implement registration algorithm, based on the work by Xiao et. al. [3]. It is a classical block-matching algorithm, where a grid (note that this grid is different to that of the BS interpolator) is superimposed to the images to register; then a block of pixels is placed at each node and moved to the surrounding positions to find the corresponding block in the other image that best match it. The quality of the match is measured in [3] as the linear correlation between the blocks, but we use Mutual Information (MI, [13]) instead, since this allows us to perform multimodal registration. As in [3] a Bayesian framework is used to smooth the estimated displacements at each node. The last step in [3] is the interpolation of the displacements by a gradient-descent based optimization of the BS coefficients. We use the technique suggested in [12] instead, since it is more efficient and it has been successfully tested [2]. Finally, we swap the roles of each image in the block-matching step, and so we can estimate the inverse BS transformation; then we use the pseudo-inversion algorithm and average the obtained BS coefficients with the coefficients estimated for the direct BS transformation. This procedure is enclosed in a multirresolution framework, which is known to result in more robust and accurate estimates [1]2 . 2
We have used 5 resolution levels; for the block matching, block-size was 11 × 11 pixels, and a block was centered at every six pixels. BS grid had size 64 × 64 nodes.
A Fast B-Spline Pseudo-inversion Algorithm
773
Fig. 1. Example of the test deformations. From left to right: original PD slice (with the random displacements). T1 slice deformed with a 3.6 bending energy TPS. Result of matching the deformed T1 slice to the original PD slice. All images have a size of 181 × 217 pixels.
Data used to test the algorithm is that available at the BrainWeb3 [14]. It consist of an MRI phantom volume of the brain, comprising PD, T1, and T2 modalities. The use of synthetic images allows us to use a gold standard: over the starting images (slices taken from the MRI volume), we establish 12 uniformly distributed landmarks, and for each of them a random vector is generated that represents the deformation at that point. The deformation is then interpolated with a TPS [11], so the results are not biased by the type of interpolation. This process may generate deformations of very diverse nature, so we have parametrised the extent of the deformation depending on its bending energy [11]. Besides, the use of known deformations allows the computation of the exact error at each image point as the Euclidean distance between the displacement vector estimated and its real value. An example of the kind of deformations generated, and the registration result, is given in Fig. 1. Results shown comprise first the successful registration rate percentage, where we took the criterion of considering a successful registration whenever the mean registration error is bellow the unit (i. e. the minimum achievable displacement); we have used all possible combinations of MRI images (PD, T1, and T2) and for each successful registration we have computed the mean error and the 90% confidence interval. Results ares shown in Table 1, for a total of 100 experiments with 3.6 bending energy (as in Fig. 1), with and without the consistent constraint. By comparing each pair of cases, it remains evident that the consistent constraint notably improves not only the robustness (the percentage of successful registrations) but the accuracy (mean error) as well. As a final result, we show in Fig. 2 the behaviour of the mean registration error as the deformation becomes larger (in terms of its bending energy), for the PD/T2 case; we have chosen it as the most adverse case, as shown in Table 1, and even so the advantage of the consistent constraint is evident for reasonably large deformations. 3
http://www.bic.mni.mcgill.ca/brainweb
774
A. Trist´ an and J.I. Arribas
Table 1. Registration performance in terms of the successful registration rate, and the mean value and 90% confidence interval of the error of the estimated displacement in pixels, both without and with the consistent constraint Scene Model Without consistent constraint With consistent constraint PD PD 98%(0.55 ± 1.13) 100%(0.51 ± 1.04) PD T1 83%(0.78 ± 1.68) 91%(0.63 ± 1.50) PD T2 86%(0.77 ± 1.63) 95%(0.78 ± 1.57) T1 PD 80%(0.82 ± 1.81) 90%(0.75 ± 1.65) T1 T1 97%(0.61 ± 1.31) 99%(0.56 ± 1.18) T1 T2 84%(0.82 ± 1.82) 90%(0.75 ± 1.67) T2 PD 83%(0.79 ± 1.66) 95%(0.70 ± 1.52) T2 T1 84%(0.79 ± 1.71) 92%(0.73 ± 1.61) T2 T2 99%(0.55 ± 1.14) 99%(0.51 ± 1.05)
Mean registration error vs. bending energy of the deformation (PD/T2) 2 Without the consistent constraint
Mean registration error in pixels
With the consistent constraint 1.75
1.5
Bending energy = 3.6 1.25
1
0.75
0.5 0
1
2
3
4
5
6
7
8
9
10
11
12
Bending energy
Fig. 2. Mean registration error versus bending energy of the deformation to recover, with and without the consistent constraint. Curves are a third degree polynomial fit of 400 randomly generated samples of the PD/T2 case.
5
Conclusion
BS are able to represent very complex and large deformations, with little computational load and nice smoothness properties when combined with hierarchical interpolation [12]; all these facts make them an interesting choice in elastic image registration [2,3]. We have presented an efficient algorithm to find a BS transformation that approximates the inverse transformation of another given BS, which allows us to introduce the concept of consistent image registration and all its benefits in the algorithms that make use of BS interpolation. Compared to [5], we have to invert exactly the transformation at only the 2 × M × N BS grid nodes, instead of at all image pixels, drastically reducing the
A Fast B-Spline Pseudo-inversion Algorithm
775
computational effort. Once the transformation is inverted at these points, the algorithm presented in section 3 is able to very efficiently compute the BS parameters. The main drawback of this technique is that the inversion is exact for the grid nodes, but only approximated elsewhere. However, the results demonstrate that it is accurate enough for our purposes, and they confirm the hypothesis that the consistent constraint produces more accurate and robust estimates of the deformation, as it has been previously suggested [9,10]. Acknowledgments. Authors want to thank Dr. R. C´ardenes and Dr. J. CidSueiro for their comments. This work was supported by grant numbers TEC0406647-C03-01, from the Comisi´ on Interministerial de Ciencia y Tecnolog´ıa, Spain, and FP6-507609 SIMILAR Network of Excellence from the European Union.
References 1. Zitova, B., Flusser, J.: Image registration methods: a survey. Image and Vision Computing 21(11), 977–1000 (2003) 2. Rueckert, D., Sonoda, L., Hayes, C., Hill, D., Leach, M., Hawkes, D.: Nonrigid registration using free-form deformations: application to breast MR images. IEEE Trans. on Medical Imaging 18(8), 712–721 (1999) 3. Xiao, G., Brady, J., Noble, J., Burcher, M., English, R.: Nonrigid registration of 3-D free-hand ultrasound images of the breast. IEEE Trans. on Medical Imaging 21(4), 405–412 (2002) 4. Shen, D., Davatzikos, C.: HAMMER: hierarchical attribute matching mechanism for elastic registration. IEEE Trans. on Medical Imaging 21(11), 1421–1439 (2002) 5. Christensen, G., Johnson, H.: Consistent image registration. IEEE Trans. on Medical Imaging 20(7), 568–582 (2001) 6. Johnson, H., Christensen, G.: Consistent landmark and intensity-based image registration. IEEE Trans. on Medical Imaging 21(5), 450–461 (2002) 7. Cachier, P., Rey, D.: Symmetrization of the non-rigid registration problem using inversion-invariant energies: application to multiple sclerosis. In: Medical Image Computing and Computer-Assisted Intervention - MICCAI’2000, pp. 472–481 (2000) 8. Thirion, J.P.: Image matching as a diffusion process: an analogy with Maxwell’s demons. Medical Image Analysis 2(3), 243–260 (1998) 9. Rogelj, P., Kovacic, S.: Symmetric image registration. Medical Image Analysis 10(3), 484–493 (2006) 10. Zhang, Z., Jiang, Y., Tsui, H.: Consistent multi-modal non-rigid registration based on a variational approach. Pattern Recognition Letters 27(7), 715–725 (2006) 11. Bookstein, F.: Principal warps: thin-plate splines and the decomposition of deformations. IEEE Trans. on Pattern Anal. and Machine Intell. 11(6), 567–585 (1989) 12. Lee, S., Wolberg, G., Shin, S.: Scattered data interpolation with multilevel Bsplines. IEEE Trans. on Visualization and Computer Graphics 3(3), 228–244 (1997) 13. Viola, P., Wells, W.M., I.: Alignment by maximization of mutual information. In: Proc. of Fifth International Conference on Computer Vision, 1995, pp. 16–23 (1995) 14. Kwan, R.S., Evans, A., Pike, G.: MRI simulation-based evaluation of imageprocessing and classification methods. IEEE Trans. on Medical Imaging 18(11), 1085–1097 (1999)
Robust Least-Squares Image Matching in the Presence of Outliers Patrice Delmas, Georgy Gimel’farb, Al Shorin, and John Morris Department of Computer Science, Tamaki Campus, The University of Auckland, Auckland, Private Bag 92019, New Zealand {p.delmas,g.gimelfarb,j.morris}@auckland.ac.nz, [email protected] Abstract. Although interfering (outlying) details complicate image recognition and retrieval, ‘soft masking’ of outliers shows considerable promise for robust pixel-by-pixel image matching or reconstruction from principal components (PC). Modeling the differences between two images or between an image and its PC estimate (obtained as a projection onto a subspace of PCs) with a mixed distribution of random noise and outliers, the masks are produced by a simple iterative Expectation-Maximisation based procedure. Experiments with facial images (extracted from the MIT face database) demonstrate the efficiency of this approach.
1
Introduction
Least-squares image matching and image reconstruction from principal components (PC) found by PCA are widely used in computer vision and image retrieval. For PCA, all images in a database must be co-registered to reduce the number of characteristic PCs representing the database and thus optimize the PCA process. However, many undesirable local features, e.g. occlusions in faces caused by hair, glasses, scarves, etc., are not easily handled. A pixel affected by one of these interfering artefacts is usually classified an outlier to emphasize its divergence from the probability model describing ‘valid’ signal deviations. Outliers degrade image matching, recognition and reconstruction accuracy. Outlier definition depends on a particular recognition or reconstruction problem. For instance, visual occlusions may not be the only outliers. In facial images, background is likely to vary considerably and thus its inclusion in the matching and reconstruction process is a bad strategy. At present, face databases usually contained cropped images which completely eliminate backgrounds e.g. the MIT database [1] (see Fig. 1). However, it is more natural to regard background pixels as outliers and automatically eliminate them. This paper preserves the traditional least-squares framework as developed in image matching and PCA but attempts to eliminate outliers by soft masking of suspicious pixels. Each pixel is weighted by a mask value ranging from 0 to 1. Random ‘valid’ and outlying signal variations are modelled with a mixture of their probability distributions. The masks are inferred with an Expectation-Maximisation (EM) algorithm resembling earlier model identification schemes [7,8]. The resulting iterative image reconstruction and matching algorithms are faster and more flexible than more conventional M-estimator or EM-based methods [11,12]. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 776–783, 2007. c Springer-Verlag Berlin Heidelberg 2007
Robust Least-Squares Image Matching in the Presence of Outliers 01
03
05
07
09
11
13
777
15
17
19
e7 0.27 94.0
e8 0.22 96.5
e9 0.15 98.1
Eigenfaces and eigenvalues
c λj ×10−7 νj (%)
e1 4.5 50.3
e2 1.4 66.7
e3 0.79 75.6
e4 0.64 82.7
e5 0.40 87.2
e6 0.33 90.9
Fig. 1. MIT face database [1] subset (10 image pairs were used - only the first image for each face is shown here), its centroid c and first nine grey-coded normalised eigenfaces ej with eigenvalues λj and cumulative relative variances νj = Λ1 ji=1 λi ; Λ = 19 i=1 λi , represented by the j top eigenfaces; j = 1, . . . , 19
Section 2 introduces our EM-based soft masking for image reconstruction with a simple probability model of errors and outliers. Section 3 shows how a similar model may be used for symmetric EM-based least-squares matching of two noisy images in the presence of outliers. Experiments validating these algorithms are described in Section 4.
2
Robust Image Reconstruction
ˆ , of an original image, g, using characteristic PCs for a The reconstruction, g certain image database is inevitably imprecise if g is not in the database or if only a small number of these PCs are kept to represent an image. Let {gj : j = 1, . . . , n} be a set of n grayscale images, each containing p pixels. Each image, gj , is represented by a column vector, gj = [gj1 , . . . , gjp ]T , where gji is the signal for pixel, i, in the image, j, in the mean-deviation form: gji − ci where @ gji is the original grey value and ci = n1 nj=1 @ gji is the gji = @ mean grey value for pixel, i, over the set. Let the database images be represented by a p × n matrix, G = [g1 . . . gn ] usually p n. The tractable PCA [2,3] finds the n − 1 top-rank eigenvectors, {ej : j = 1, . . . , n − 1}, with (possibly) non-zero eigenvalues for the p × p 1 GGT , by computing the like n − 1 eigenvectors and covariance matrix, S = n−1 eigenvalues, {(ej , λj ) : j = 1, . . . , n − 1}, of the n × n matrix, S = GT G. 1 Because G(GT G)ej = λj Gej , the PCs and eigenvalues for S are ej = |Ge | Gej j
and λj = |Gej |λj ; j = 1, . . . , n − 1, respectively. PCA-based image reconstruction projects an input image, g, onto a linear subspace of m eigenvectors, 1 ≤ m ≤ n − 1, ordered by decreasing eigenvalues: ˆ= g
m j=1
yj ej where yj = g ej ≡ T
p i=1
gi eji
778
P. Delmas et al.
We assume reconstruction error, ε = g−ˆ g, is modeled by an independent random field of pixel errors, εi = gi − gˆi . Each error, ε, represents either signal noise which we define here to include all noise inherent to lossy reconstruction - or an outlier. The error probability model is a mixture of a noise distribution, N (ε|σ), with unknown variance, σ 2 , and an outlier distribution, U(ε): Pr(ε) = ρN (ε|σ) + (1 − ρ)U(ε)
(1)
where 1 − ρ is an unknown prior probability of outliers. We also assume discrete errors, ε ∈ E = {−Q + 1, . . . , −1, 0, 1, . . . , Q − 1}, where Q is the number of grey values (typically, Q = 256). For simplicity, let N (ε|σ) be a discrete distribution stemming from the zero-centred normal distribution of noise and U(ε) be a uniform distribution of differences: N (ε|σ) =
1 − ε22 e 2σ ; Zσ
U(ε) =
1 2Q + 1
Q−1
where Zσ =
δ2
e 2σ2
(2)
δ=−Q+1
The probability model of Eq. (2) implies that ‘signal noise’ can be separated from outliers by masking out pixels with significantly larger than expected errors in the reconstructed images. The masks are built by an EM-based iterative procedure of model identification: similar procedures have been used [7,8]. Let γ = [γ1 , . . . , γp ]T be a mask for data vector, g. A binary mask, γi ∈ {0, 1}, excludes (γi = 0) or retains (γi = 1) signals, i.e. converts g into g◦ where gi◦ = gi γi . To fit the conventional EM framework for the maximum likelihood identification of the error model, the mask is replaced with its mathematical expectation as a soft mask [10]. The value γi ∈ [0, 1] converges to the probability of the error, εi , being induced by signal noise (Hastie et al. [10] refer to this as a ‘responsibility’): the smaller the value, the more likely the error is an outlier. The robust EM-based reconstruction algorithm is then: Input: An image, g, and m eigenvectors, ej ; j = 1, . . . , m. Initial step t = 0: Reconstruct g with the unit mask, p m [0] [0] [0] γ = [γi = 1 : i = 1, . . . , p] so that g = j=1 gi eji ej . i=1
Set the prior, ρ[0] = 0.5, to avoid singularities. [t] [t−1] Iteration t = 1, 2, . . .: Reset the mask and prior for the errors, εi = gi −gi : # $ p $ # [t] 2 [t−1] [t] εi γi ρ[t−1] N εi |σ[t] [t] 2 $ # # $; = i=1 p ; γi = σ[t] [t] [t] [t−1] ρ[t−1] N εi |σ[t] + (1 − ρ[t−1] )U εi γ i=1
ρ[t] =
1 p
p i=1
i
[t]
γi , and update the reconstruction: g
[t]
=
p m # j=1
i=1
[t] γi gi
$ $ # [t] [t−1] eji + 1 − γi gi
ej
Robust Least-Squares Image Matching in the Presence of Outliers [t]
[t−1]
Stopping rule: Terminate if maxi=1,...,p |γi − γi and θi are fixed empirical thresholds.
3
779
| ≤ θw or t > θi where θw
Robust Symmetric Least-Square Matching
In this section, we consider robust matching of images distorted by having dif@2 , be derived @1 and g ferent contrast (gain) and offset values. Let two images, g @ = (@ from an ideal unknown template, g gi : i = 1, . . . , p) with differing uniform contrast (a1 and a2 ) and offset (b1 and b2 ) values perturbed by independent per pixel errors caused either by noise or outliers. For the pixels affected by noise only, i ∈ {1, . . . , p}, g@1i = a1 g@i + b1 + ε1i ; @ g2i = a2 @ gi + b2 + ε2i
(3)
where the errors ε1i and ε2i have a centred normal distribution with the same variance. Outliers have uniformly distributed signal values and thus vary with no relation to the template transformations, so that conventional least-square matching may not properly work. To derive the robust version, let a soft mask, γi ∈ [0, 1], be applied, as before, to the pair (@ g1i , @ g2i ). The maximum likelihood signal dissimilarity, D12 = @2 )), combines non-outliers and outliers. Let εi denote the residg1 , g min (− ln Pr(@ θ
ual error, or difference between the signals for the pixel, i, after their contrast and offset variations are eliminated. Then D12 = −
p
[γi (ln N (εi )) + (1 − γi ) ln U(εi )] =
i=1
1 Φ12 + ν ln(Zσ ) + Ψ12 (4) 2σ 2
p p where ν = i=1 γi , Ψ12 = i=1 (1 − γi )U(εi ), and Φ12 is the minimum total @}: squared error with respect to the model parameters, θ = {a1 , a2 , b1 , b2 , g γi ε21i + ε22i i=1 8 9 p p 2 2 = min γi (@ g2i − a2 g@i − b2 ) = εˆ2i g1i − a1 g@i − b1 ) + (@ θ i=1 i=1 0 2 = 12 S11 + S22 − (S11 − S22 )2 + 4S12
Φ12 = min
p
θ
p p Here, Sjk = i=1 γi gˆji gˆki ; j, k = 1, 2, where gˆji = @ gji −μj and μj = ν1 i=1 γi gji , 2 and εˆ2i = εˆ21i + εˆ22i = (α2 gˆ1i − α1 gˆ2i ) is the squared residual per pixel matching error, where S S − S − S 1 1 11 22 11 22 1+ 1− α21 = ; α22 = 1 1 2 2 [(S11 − S22 )2 + 4S 2 ] 2 [(S11 − S22 )2 + 4S 2 ] 2 12
2
12
1 2ν Φ12 ,
The estimated noise variance is σ = and the local minimum of Φ12 is obtained by the EM-based iterative procedure that re-evaluates the soft masks and model parameters, similar to Section 2.
780
P. Delmas et al. DF01-01
DF01-02
DF01-03
DF01-04
DF01-05
DF01-06
DF01-07
DF01-08
DF01-09
DF01-10
DF01-11
DF01-12
DF01-13
DF01-14
(a)
(b)
(c)
(d)
(a)
(b)
(c)
(d) Fig. 2. Conventional (b) and robust (c) PCA-based reconstruction using the 10 top eigen-faces; row (a) - distorted version of face (#1); row (c) - grey-coded soft masks (black - 0; white - 1)
The robust symmetric matching algorithm is as follows: @1 and g @2 . Input: Two images g [0] Initial step t = 0: Match the images with the mask γ [0] = [γi = 1 : i = [0] [0] 1 [0] 2 = 2p Φ12 , D12 (the conventional matching 1, . . . , p] in order to find Φ12 , σ[0] [0]
[0]
[0]
[0]
score), α1 , α2 , μ1 , μ2 . Set the prior ρ[0] = 0.5. Iteration t = # 1, 2, . . .: Reset $ the mask# and prior for $ the current residual errors [t] [t−1] [t−1] [t−1] [t−1] g1i − μ1 @ − α1 g@i2 − μ2 : εi = α2
Robust Least-Squares Image Matching in the Presence of Outliers
[t]
γi
$ # [t] p ρ[t−1] N εi |σ[t−1] ν[t] [t] # # $ $ = γi ; ρ[t] = ; ν[t] = [t] [t] p ρ[t−1] N εi |σ[t−1] + (1 − ρ[t−1] )U εi i=1 [t]
[t]
[t] [t] [t] 1 2ν[t] Φ12 , D12 , α1 , [t−1] [t] D12 | ≤ θr D12 or t >
2 and update Sjk ; j, k = 1, 2, Φ12 , σ[t] = [t]
Stopping rule: Terminate if |D12 − are the fixed thresholds.
4
781
[t]
[t]
[t]
α2 , μ1 , and μ2 . θi where θr and θi
Experimental Results and Conclusions
A subset of the MIT database [1] containing two images with different lighting for each of the ten persons (in total, 20 normalised 200 × 200 pixel greyscale images) was used - see Fig. 1. The top seven out of 19 eigenfaces built with the tractable PCA [2,3] represent 94% of the overall variance of the original signals. Reconstruction. 14 distorted versions of the original face image (#1 in Fig. 1) were created: Fig. 2 shows the distorted faces, the two reconstructions and the the soft masks used by the robust PCA-based reconstruction. The latter is slightly better than the conventional one for relatively small occlusions (DF01-01 – DF0104), notably outperforms it for moderate occlusions (DF01-05 – DF01-10), but both fail for large distortions (DF01-11 – DF01-14). MIT FDB 01
DV1
DV1-B
DV2
DV2-B
D12 · 10−5 2nd match
C1 01: 0.40 C2 02: 1.08
01: 1.11 02: 1.32
01: 1.414 02: 1.415
01: 1.479 02: 1.495
Soft mask Fig. 3. Robust matching of images derived from MIT face #1 (Fig. (1)): closest (C1 ) and second closest (C2 ) scores, D12 (Eq. (4), and grey-coded soft masks
Matching. Figure 3 illustrates robust image-to-image matching. The contrast and offset values of face (#1) were changed significantly to produce images DV1 and DV2. The bottom section of the face was occluded in images DV1-B and DV2-B. Further trials are shown in Figure 4: various features (hair, glasses, beards, etc) were added to three faces from the MIT database (#1, #3 and #9). The corrupted faces were then matched against the database (Fig. 1). Soft
782
P. Delmas et al. M01-a
M01-b
M01-c
M01-d
M01-e
M01-f
D12 01: 1.03
01: 1.35
01: 1.39
01: 1.30
01: 0.96
01: 1.08
M01-g
M01-h
M01-i
M01-j
M01-k
M01-l
D12 01: 0.80
01: 1.02
01: 1.51
01: 1.50
01: 1.85
01: 1.73
M03-a
M03-b
M03-c
M03-d
M03-e
M03-f
D12 03: 1.61
03: 1.60
03: 1.27
03: 1.90
03: 1.49
03: 1.50
M03-g
M09-a
M09-b
M09-c
M09-d
M09-e
D12 03: 1.83
09: 1.51
09: 1.72
09: 1.80
09: 1.79
09: 1.99
Fig. 4. Closest match for the made-up faces ‘01’, ‘03’ and ‘09’ from the MIT database in Fig. 1: the robust score D12 (Eq. (4)) and grey-coded soft masks
Robust Least-Squares Image Matching in the Presence of Outliers
783
masking of outliers led to a significant increase in recognition and reconstruction accuracy when compared to conventional matching. In 13 out of 24 cases in Fig. 4, best matches using conventional distance measures were incorrect. Conclusions. Iterative readjustment of the mask within an EM framework results in large movements across the parameter space, i.e. a more global search roughly in the direction of the global minimum of the matching score [10]. As a result, our reconstruction and matching algorithms are robust to outliers and faster than other alternatives; thus they hold promise for several applications. These results were obtained on the MIT database [1] with a uniform background. Other face databases with non-trivial backgrounds and more realistic outliers than the artefacts generated here will be considered in the future work. Acknowledgement. This work was partially supported by the University of Auckland Research Committee (UARC) Grant 3607167/9343.
References 1. MIT face database [on-line] (accessed August 24, 2006), http://vismod.media.mit.edu/pub/images/ 2. Turk, M., Pentland, A.: Face recognition using eigenfaces. In: Proc. CVPR’91, Lahaina, Maui, Hawaii, June 3–6, 1991, pp. 586–591. IEEE CS Press, Los Alamitos (1991) 3. Turk, M.: Eigenfaces and beyond. In: Zhao, W., Chellappa, R. (eds.) Face Processing: Advanced Modeling and Methods, pp. 55–86. Academic Press (Elsevier), San Diego (2006) 4. Lai, S.-H.: Robust image matching under partial occlusion and spatially varying illumination change. Computer Vision and Image Understanding 78, 84–98 (2000) 5. Huber, P.J.: Robust Statistics. John Wiley & Sons, New York (1981) 6. Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection. John Wiley & Sons, New York (1987) 7. Gimel’farb, G., Farag, A.A., El-Baz, A.: Expectation-Maximization for a linear combination of Gaussians. In: Proc. 17th ICPR, Cambridge, UK, August 22 - 27, vol. 3, pp. 422–425. IEEE CS Press, Los Alamitos (2004) 8. Frey, B.J., Jojic, N.: A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE Trans. on Pattern Analysis and Machine Intelligence 27, 1392–1416 (2005) 9. Hasler, D., Sbaiz, L., S¨ usstrunk, S., Vetterli, M.: Outlier modeling in image matching. IEEE Trans. on Pattern Analysis and Machine Intelligence 25, 301–315 (2003) 10. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Heidelberg (2001) 11. De la Torre, F.D., Black, M.J.: Robust principal component analysis for computer vision. In: Proc. 8th ICCV’01, Vancouver, Canada, July 9–12, 2001, pp. 362–369. IEEE CS Press, Los Alamitos (2001) 12. Skocaj, D., Bischof, H., Leonardis, A.: A robust PCA algorithm for building representations from panoramic images. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 761–775. Springer, Heidelberg (2002)
A Neural Network String Matcher Abdolreza Mirzaei, Hamidreza Zaboli, and Reza Safabakhsh Department of Computer Engineering, Amirkabir University, Tehran 15914, Iran {amirzaei, zaboli, safabakhsh}@aut.ac.ir
Abstract. The aim of this work is to code the string matching problem as an optimization task and carrying out this optimization problem by means of a Hopfield neural network. The proposed method uses TCNN, a Hopfield neural network with decaying self-feedback, to find the best-matching (i.e., the lowest global distance) path between an input and a template. The proposed method is more than ‘exact’ string matching. For example wild character matches as well as character that never match may be used in either string. As well it can compute edit distance between the two strings. It shows a very good performance in various string matching tasks. Keywords: String matching, Parallel, Chaotic Neural Network, TCNN, and Optimization using Hopfield NN.
1 Introduction The general string-matching problem is to find shift s for which the pattern X appears in Text. The most straightforward approach in matching string is to test each possible shift s in turn, which is hardly optimal. In string-matching-with-errors problem the goal is to find the location in text where the pattern X is close to the substring or factor of text. This measure of closeness is chosen to be an edit distance. The edit distance is the minimum number of fundamental operation needed to transform the test string into prototype string. Thus the string-matching-with-error problem finds use in the same type of problem as basic string matching. The only difference being that there is a certain tolerance for match [1]. The string matching problems just outlined are conceptually very simple. The challenge arises when the problems are large. Survey and comparison of well known string matching algorithms can be found in [2], [3], [4]. Although several linear time algorithms have been developed in the last two decades, they all need a preprocessing step [4], [5], [6], [7]. Automata-based methods also need a preprocessing step to create automaton for each input pattern [4], [5], [8]. Among the methods proposed for string-matching-with-errors we can mention dynamic programming methods [9], generalized Boyer-Moore algorithm [10], and automata-based method [5], which all these algorithms require extra time and space for preprocessings. Since the VLSI technology has been developed rapidly, hardware approaches for string matching have also been proposed [11], [12], [13], [14]. Since software algorithms which use preprocessings with look up table methods can not be applied to design special W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 784–791, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Neural Network String Matcher
785
purpose hardware string matcher [15], we need algorithms suitable for hardware implementation. It is known that neural networks are capable of yielding high-quality solutions to large-scale problems [16]. This paper uses chaotic Hopfield neural network for string matching. The advantage of this method over traditional methods includes massive parallelism and convenient hardware implementation. Another advantage is that the procedure of Hopfield network is more general for application and a range of string matching problems can be solved by it. The remainder of this paper is organized as follows. Section 2 reviews the theoretical basis of edit distance algorithms. Section 3 introduces the Hopfield neural network and its application in optimization. Section 4 gives a description of proposed method and Section 5 discusses the performance of the developed technique. Section 6 derives the main conclusions from this work.
2 Problem Description We start with the calculation of edit distance between two strings. In section 5 we will see that the proposed algorithm can be used to solve other problems too. Edit distance between x and y describes how many fundamental operations are required to transform x into y . Theses fundamental operations are: • Substitutions: A character in x is replaced by the corresponding character in y . • Insertion: A character in y is inserted into x , thereby increasing the length of x by one character. • Deletion: A character in x is deleted, thereby decreasing the length of x by one character. The calculation of edit distance between x and y can be visualized by aligning the character of x along the X-axis and the character of y along the Y-axis. Now calculation of edit distance is equivalent to finding the best-matching (i.e. the lowest global distance) path between x and y . Let C be a m × n matrix of integers associated with a cost or distance and let δ (.,.) denotes a generalization of Kronecker delta function, having value 1 if the two arguments (characters) match and 0 otherwise. The variables m and n are equal to the length of x and y denoted by x and y . The basic equations for calculation of edit distance are
C[0 ,0 ] = 0 C[i,0 ] = i for i = 1 to m
(1) (2)
C[o, j ] = j for j = 1 to n C[i, j ] = min[C[i − 1, j ] + 1, C[i, j − 1] + 1, C[i − 1, j − 1] + 1 − δ (x[i ], y[ j ])]
(3)
insertion
deletion
(4)
nochange/ exchange
Equation 4 is used to find the minimum cost in each entry of C column by column. As it can be seen from this equation the only valid paths to a given point must pass
786
A. Mirzaei, H. Zaboli, and R. Safabakhsh
through a point from the left and/or below it. The final distance C[m, n] gives us the overall matching distance for the x and y patterns.
3 The Transiently Chaotic Neural Network A Hopfield neural network consists of a set of neurons, where the output of each neuron is connected to the input of all other neurons except itself. The weights on these connections are fixed and are constrained to be symmetrical. The network is characterized by an energy function, which decreases with the changes in neural states with time. Once started, the network relaxes towards a stable state corresponding to a local minimum of its energy function. The work of Hopfield and Tank [17] showed that neural networks can be applied to solving combinatorial optimization problems. They suggested that a nearoptimal solution of traveling salesman problem (TSP) can be obtained by finding a local minimum of an appropriate energy function, which is implemented by a neural network. Typically, the network energy function is made equivalent to the objective function, which is to be minimized, while each of the constraints of the optimization problem is included in the energy function as penalty terms. Clearly, a constrained minimum of the optimization problem will also optimize the energy function. Unfortunately, a minimum of the energy function does not necessarily correspond to a minimum of the objective function due to the fact that there are likely to be several terms in the energy function, which contribute to many local minima. Furthermore, even if the network does manage to converge to a feasible solution, its quality is likely to be poor compared to other techniques, since the Hopfield network is a descent technique and converges to the first local minimum it encounters. We use the transiently chaotic neural network (TCNN) proposed by Chen and Aihara[18]. TCNN, a hopfield neural network with decaying self-feedback, has been developed as a new approach to extend the problem solving ability of the standard hopfield neural network. The difference equations describing the dynamics of the TCNN are as follows xij (t ) = f ( yij (t )) =
1 1+ e
− y ij ( t ) / ε
(5)
⎞ ⎛ yij (t + 1) = kyij (t ) + α ⎜ ∑∑ wijkl xkl (t ) + I ij ⎟ − zij (t )( xij (t ) − I 0 ) ⎝ k l ⎠
(6)
zij (t + 1) = (1 − β ) zij (t )
(7)
where xij (t ) is the output of neuron ij , yij (t ) is the internal state of neuron ij , wijkl is the connection weight from neuron ij to neuron kl ( wijkl = wklij ), I ij is the input bias of neuron ij , α is a positive scaling parameter for neuronal input, k is the damping factor of the nerve membrane ( 0 ≤ k ≤ 1 ), z (t ) is the self-feedback connection weight (refractory strength), β is the damping factor of z (t ) ( 0 ≤ β ≤ 1 ), I 0 is a positive
parameter and
ε is the steepness parameter of the output function( ε > 0 ).
A Neural Network String Matcher
787
4 String Matching Using Hopfield Neural Network To solve the string matching problem, the Hopfield network is organized as a n × m matrix of neurons, where n and m are the total number of character in the pattern and text. A neuron n Xi explores the hypothesis that characterX in pattern word corresponds to characteri in text. We will consider the constraints existing on the matching process (or that we can impose on that process) and use those constraints to come up with an efficient algorithm. The constraints we shall impose are straightforward and not very restrictive: • Every character in the pattern and the text must be used in a matching path. • Monotonic condition: the path will not turn back on itself, both the i and j indexes either stay the same or increase, they never decrease. • Continuity condition: The path advances one step at a time. Both i and j can only increase by 1 on each step along the path. • Local distance scores are combined by adding to give a global distance. We said that every character in the pattern and the text must be used in a matching path. This means that if we take a point (i, j ) in the matrix (where i indexes the pattern character, j the text character), then previous point must be (i − 1, j − 1) , (i − 1, j ) or (i − 1, j − 1) . According to the above mentioned constraints, we introduce the following energy function: A 2
∑ (1 − ∑ j =1 Pij )
B + 2
∑ (1 − ∑ i =1 Pij )
C + 2
∑ ∑ ∑ ∑ Pij . Pkl
D 2
∑ ∑ ∑ ∑ Pij . Pkl
E =
+
E + 2
2
n
m
i =1
2
m
n
j =1 n
m
n
m
(8)
i =1 j = 1 k > i l < j n
m
n
m
i = 1 j =1 k < i l > j
n
m
∑ ∑ Pij .d ij i =1 j = 1
where A , B , C , D and E are positive constants. Pij is the output of neuron nij that takes on values from 0 to 1. The first two terms basically enforce the uniqueness constraint. It can be seen that if more than one neuron is excited in one row or column of the network, the energy of the system tends to increase. This constraint is used (with small coefficient A and B ) in order to make the solution more monotonic (if one character is matched with two character, one of them must be deleted or inserted to make two string equal). The minimum energy is obtained only if one of the neurons in one column or row is excited. The third and forth terms are the inhibiting terms to implement the fact that the matching paths cannot go backwards in time. For any on neuron, they inhibit its upper left and lower right as shown in Figure 1.
788
A. Mirzaei, H. Zaboli, and R. Safabakhsh
Fig. 1. Inhibiting areas due to monotonicity constraint
The fifth term is the main term, which is used to enforce the similarity constraint. If the two characters are equal, the less the cost of the energy function will be, and the two characters are a correct matching pair with a high probability. d ij is given by
d ij = 1 − δ ( x[i ], y[ j ])
(9)
The objective now is to compute the network parameters, which include the external inputs of the network and the network connection weights. Comparing equation (8) with the energy function of a Hopfield neural network, given in equation (10), we can compute the properties. HH = −
1 I ij Pij ∑∑∑∑ Wij ,kl Pij Pkl − ∑∑ 2 i j k l i j
(10)
where each neuron has the output level of Pij (bounded by zero and one) and the bias current (or negative threshold) I ij . The weights, which determine the strength of the connections from ij to kl , are given by Wij ,kl .
5 Experimental Results To evaluate the proposed method, 2 types of experiment are performed. In the experiments the TCNN parameters used are: k = 0.9 ; I 0 = 0.65 ; z (1) = 0.08 ; β = 0.003 ; α = 0.015 which are the same as the ones used by Chen and Aihara[18]. In our experiments every neuron nij is initialized to a value inversely proportional to d ij . The N
criterion ∑
i , j =1
p ij (t + 1) − Pij (t ) < 5 ×10 −5
is used to see whether the network has converged
or not. At the end of each experiment, every neuron that has an output greater than a predefined threshold (0.5) is considered as fired. A post-processing step is performed for rows and columns with no fired neuron, in which the greatest activation in that row or column will be replaced by 1(the neuron is considered as fired). The first experiment is to illustrate the use of TCNN in exact string matching. In this case the number of insert operation (right move) before first character of the text shows a valid shift (i.e. variable s). In this experiment some random character are added to the pattern “book” to create the text with size of 10, 15, 20, 40, 60, 80, 100,
A Neural Network String Matcher
789
120 characters. Then the ability of network to find a valid shift (for which the pattern “book” appears in the text) is evaluated. Table 1 summarizes the results of 100 different runs in terms of rate of correct estimation of valid shift and the number of iterations for convergence. These results are shown in Figure 2. We note that for a constant value of β , with the increase in the text size the number of iteration for convergence asymptotically become constant and text size has not significant effect on it. On the other hand the accuracy of matching degrades with the increase of text size. This is due to the fact that using the same number of iteration for larger matching problems the network has not enough time to reach the optimal solution. So the value of β must be selected such that for larger problem the number of iteration for convergence increases. To assess the effect of β in network convergence and the quality of obtained paths, we analyzed the matching of a text with size 40 and the pattern “book” with β =0.001, 0.003, 0.005, 0.01, 0.015, 0.02 for 100 runs. The results of this experiment is shown in the Table 2 and Figure 3. In TCNN, the variable z (t ) corresponds to the temperature in the stochastic annealing process and β determines the cooling speed. By increasing β the chaotic phase of the network finishes earlier and the network converges with fewer iterations (it can be seen in figure 3(b)). However the searching capabilities of network and therefore the quality of solutions decrease in this case (see figure 3 (a)). So, in order to achieve acceptable results for matching large text, β must be selected small enough. The second experiment deals with the calculation of the edit distance between two strings. This experiment shows the capability of the approach to approximate Table 1. The results of using TCNN to find a valid shift with different text size Length of reference string 10 Rate of correct estimation Mean 100 of valid shift Std 0
15 100 0
20 100 0
40 97.8 4.91
60 96.6 4.27
80 87 7.42
100 79.8 11.01
Number of iterations for Mean 123.24 126.56 127.96 130.36 130.94 133.18 134 convergence Std 2.003 1.502 3.002 2.669 1.595 1.354 0.869
Numebr of iterations for convergence
100 Accuracy(%)
134.62 0.918
136
110
90
80
70
60 0
120 76.2 12.19
20
40
60 80 Text Size
(a)
100
120
134 132 130 128 126 124 122 120 0
20
40
60 80 Text Size
100
120
(b)
Fig. 2. The results of using TCNN to find a valid shift with different text size. (a)Rate of correct estimation of valid shift (b)Number of iterations for convergence.
790
A. Mirzaei, H. Zaboli, and R. Safabakhsh
Table 2. The results of Matching of a text with size 40 and the pattern “book” with different value of β
β Rate of correct estimation of valid shift Number of iterations for convergence
Mean Std Mean Std
0.001
0.003
0.005
0.01
0.015
0.02
98.2 3.8239 229.1900 29.5626
97.8 4.91 130.36 2.669
96.9 4.99 84.2300 1.5535
96.7 5.31 49.4900 0.4886
95.6 5.68 35.8400 0.2716
93.00 6.15 28.5300 0.2359
300
104 Number of iterarions for convergence
102
Accuracy(%)
100 98 96 94 92 90 88 86
0
0.005
0.01
0.015
0.02
250
200 150
100 50
0
0
0.005
0.01
Beta
0.015
0.02
Beta
(a)
(b)
Fig. 3. The results of using TCNN to find a valid shift with different vale for β parameter. (a)Rate of correct estimation of valid shift (b) Number of iterations for convergence(b).
matching (i.e. matching with error). To compute the edit distance, we count the extra ON neurons in rows and columns that have more than one neuron ON and add the results to the matching cost. Matching cost for two equal characters is zero. In order to provide a quantitative evaluation of the proposed method in calculation of the edit distance between two strings, the text “We are all pencils in the hand of God”* is corrupted by randomly substituting, deleting or adding a character from/to it, and then the modified text is matched against the original text. The result is considered as optimal if the edit distance computed from the network is the same as the one computed from the well known dynamic programming algorithm and also if the derived path is a valid path. The table 3 shows the ratio of finding the correct edit distance in 100 runs. This table indicates that if the distortion is moderate the network will find the optimal edit distance with a high probability. Table 3. The result of finding the edit distance between two strings, a fraction of a sentence is corrupted and then it is matched against the original one Distortion The ratio of finding the correct edit distance (%) *
Mother Teresa.
5% 100
10% 97
20% 91
30% 79
40% 67
A Neural Network String Matcher
791
6 Conclusion The main concept of this paper is a demonstration that string matching problem can be formulated as an optimization task. The advantage of this method which uses Hopfield network to solve the resulted optimization problem over traditional methods includes massive parallelism and convenient hardware implementation. Another advantage is that the procedure of Hopfield network is more general for application and a range of string matching problem can be solved by it.
References 1. Duda, R.O, Hart, P.E., Stork, D.G.: Pattern classification, 2nd edn. Wiley-Interscience publication, Chichester (2000) 2. Aho, V.: Algorithms for Finding Patterns in Strings. In: Leeuwen, J.V. (ed.) Handbook of Theoretical Computer Science, vol. A, pp. 257–297. Elsevier Science Publishers, Amsterdam (1990) 3. Jokinen, P., Tarhio, J., Ukkonen, E.: A Comparison of Approximate String Matching Algorithms. Software-Practice and Experience 26, 1439–1457 (1996) 4. Lecroq, T.: Experimental Results on String Matching Algorithms. Software-Practice and Experience 25, 727–765 (1995) 5. Baeza, R., Gonnet, G.H.: A New Approach to Text Searching. Comm. ACM 35, 74–81 (1992) 6. Cole, R., Hariharan, R., Paterson, M.: UriZwick: Tighter Lower Bounds on the Exact Complexity of String Matching. Siam J. Comput. 24, 30–45 (1995) 7. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. McGrow Hill, New York (1990) 8. Knuth, D.E., Morris, J., Pratt, V.: Fast Pattern Matching in Strings. SIAM J. of Comput. 6, 323–350 (1977) 9. Wagner, R.A.: The String-to-String Correction Problem. Journal of the ACM 21, 168–173 (1974) 10. Tarhio, J., Ukkonen, E.: Approximate Boyer-Moore String Matching. Siam J. Comput. 22, 243–260 (1993) 11. Cheng, H.D., Fu, K.S.: VLSI Architecture for String Matching and Pattern Matching. Pattern Recognition 20, 125–141 (1987) 12. Isenman, M.E., Shasha, D.E.: Performance and Architectural Issues for String Matching. IEEE Trans. on Computers 39(2), 238–250 (1990) 13. Mukherjee, A.: Hardware Algorithms for Determining Similarity between Two Strings. IEEE Trans. On Computers 38(4), 600–603 (1989) 14. Sastry, R., Ranganathan, N., Remedios, K.: CASM: A VLSI Chip for Approximate String Matching. IEEE Trans. on Pattern Analysis and Machine Intelligence 17, 824–830 (1995) 15. Park, J.H., George, K.M.: Parallel String Matching Algorithms Based on Dataflow. In: Proceedings of the 32nd Hawaii International Conference on System Sciences, vol. Track 3, IEEE, Los Alamitos (1999) 16. Tagliarini, G.A., Christ, J.F., Page, E.W.: Optimization using neural networks. IEEE Transaction On Computers 40(12), 1347–1358 (1991) 17. Hopfield, J.J., Tank, D.W.: ‘Neural’ Computation of Decisions in Optimization Problems. Biological Cybernetics 52, 141–152 (1985) 18. Chen, L., Aihara, K.: Chaotic simulated annealing by a neural network model with transient chaos. Neural Networks 8(6), 915–930 (1995)
Incorporating Spatial Information into 3D-2D Image Registration Guoyan Zheng MEM Research Center, University of Bern, Stauffacherstrasse 78, CH-3014, Bern, Switzerland [email protected]
Abstract. This paper addresses the problem of estimating the 3D rigid pose of a CT volume of an object from its 2D X-ray projections. We use maximization of mutual information, an accurate similarity measure for multi-modal and mono-modal image registration tasks. However, it is known that the standard mutual information measure only takes intensity values into account without considering spatial information and its robustness is questionable. In this paper, instead of directly maximizing mutual information, we propose to use a variational approximation derived from the Kullback-Leibler bound. Spatial information is then incorporated into this variational approximation using a Gibbs random field model. The newly derived similarity measure has a least-squares form and can be effectively minimized by a multi-resolution LevenbergMarquardt optimizer. Experimental results are presented on X-ray and CT datasets of a plastic phantom and a cadaveric spine segment. Keywords: mutual information, 3D-2D registration, X-ray, CT, Gibbs random field, Kullback-Leibler bound.
1
Introduction
3D-2D registration of a three-dimensional (3D) CT volume with two-dimensional (2D) X-ray images has shown great potential in a number of image-guided therapy applications. The reported techniques to achieve this registration can be split into two main categories: feature-based methods and intensity-based methods. Feature-based methods require a prerequisite segmentation stage which is errorprone and hard to achieve automatically. The errors in segmentation can lead to errors in the final registration. In contrast, intensity-based methods directly compare the X-ray image with the associated digitally reconstructed radiograph (DRR), which is obtained by simulating X-ray projection of the CT volume. No segmentation is required. In this work, we use maximization of mutual information (MI), an accurate similarity measure for multi-modal and mono-modal image registration tasks [1][2][3]. However, it is known that the standard mutual information measure only takes intensity values into account without considering spatial information and its robustness is questionable [4]. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 792–800, 2007. c Springer-Verlag Berlin Heidelberg 2007
Incorporating Spatial Information into 3D-2D Image Registration
793
Several attempts have been made to adapt the MI-based registration framework to incorporate spatial information of individual images [4][5][6][7][8]. However, the resultant similarity measure either requires to compute the entropy of higher dimensional probability distributions, which is not advisable because of the increase of statistical uncertainties with higher dimensions due to the scarcity of data, or is not robust to outliers. In this paper, instead of directly maximizing mutual information, we propose to use a variational approximation derived from the Kullback-Leibler bound [9]. Spatial information is then incorporated into this variational approximation using a Gibbs random field (GRF) model [10]. The newly derived similarity measure has a least-squares form and can be effectively minimized by a multiresolution Levenberg-Marquardt non-linear optimizer. The paper is organized as follows. In Section 2, we describe the proposed approach. In Section 3, we presents the experimental results, followed by conclusions in Section 4.
2 2.1
The Proposed Approach Derivation of a Variational Approximation to the MI
In this work, we assumes that the X-ray images are calibrated for their intrinsic parameters and that the X-ray images are corrected for distortion. If multiple X-ray images are used, they are all registered to a common reference frame. Therefore, the goal of a 3D-2D registration is to compute the rigid transformation T that relates the coordinate frame of the CT volume with the reference coordinate frame of the X-ray images. In the following, we focus on the derivation based on the qth X-ray image (where q = 1, ..., Q) and its associated DRR. Let us denote the values of the X-ray image (V ) as v(x) and the corresponding values of the DRR (U ) created from the CT volume given the current transformation estimation as u(x; T ). In this work, we regard the image values v(x) and u(x; T ) as random variables with associated probability density functions p(v(x)) and p(u(x; T )), respectively. The joint probability density function of these two random variables is p(v(x), u(x; T )). The conditional probability density function of v(x) given the values of u(x; T ) is expressed as p(v(x)|u(x; T )). The mutual information of two random variables is derived from the entropy values of the variables, both separately and jointly, as given by: 7 77 H(V ) = − p(v) log(p(v))dv; H(V, U ) = − p(v, u) log(p(v, u))dvdu (1) and the conditional entropy of two random variables is: H(V |U ) = − p(v, u) log(p(v|u))dvdu
(2)
The entropy can be seen as a measure of uncertainty of a random variable. The mutual information between two random variables is defined by: q SMI (V, U, T ) = H(V ) + H(U ) − H(V, U ) = H(V ) − H(V |U )
(3)
794
G. Zheng
After some replacement, we can write Eq. 3 as: q SMI (V, U, T ) = p(v(x), u(x; T )) log(p(v(x)|u(x; T )))dvdu + H(V )
(4)
The optimal estimation of the rigid transformation can then be obtained by: Tˆ = arg max T
Q
q SMI (V, U, T )
(5)
q=1
where Q is the number of input X-ray images. Eq. 5 is the standard registration framework using maximization of mutual information. Histogram-based method [2] as well as Parzen window based method [1] have been proposed to compute the mutual information. It is known that the standard mutual information measure only takes intensity values into account without considering spatial information and its robustnesses are questionable. It can be shown using Kullback-Liebler bound [9] that: q p(v(x), u(x; T )) log(q(v(x)|u(x; T ))dvdu + H(V ) (6) SMI (V, U, T ) where q(v(x)|u(x; T )) is an arbitrary variational distribution. We call the right side of Eq. 6 the variational approximation to mutual information (VA-MI) and denote it as SVq A−MI (V, U, T ). The approximation is exact if q(v(x)|u(x; T )) ≡ p(v(x)|u(x; T )). As we are dealing with discrete images, the values v(x = (i, j)) and u((x = (i, j)); T ) that we observe from the images can be regarded as random samples from p(v(x), u(x; T )). Note that H(V ) in Eq. 6 does not depend on T . Ignoring this constant term, we can further approximate SVq A−MI (v(x), u(x; T )) by its sample estimate: SVq A−MI (V, U, T ) ≈
I J 1 log(q(v(i, j)|u((i, j); T ))) I × J i=1 j=1
(7)
where I × J is the pixel size of the X-ray image. Using Eq. 7, we actually convert the maximization of mutual information to an optimal labeling problem in which the labels are the conditional intensity values (v(i, j)|u((i, j); T )). Such a problem can be effectively solved using a Gibbs random field model. 2.2
Effective Realization of the Variational Approximation
Gibbs Random Field Theory. Gibbs random field theory is a branch of probability theory that relates the probabilities of the various possible states of a system to the energies associated to them. It has been used extensively in image restoration, segmentation, object recognition and matching [10]. In what follows, the basic concepts of the GRF are reviewed for completeness. For rigorous expositions, one may refer to [10].
Incorporating Spatial Information into 3D-2D Image Registration
795
Let L = {(i, j) : 1 i I, 1 j J} be an I × J integer lattice; then D = {Di,j ; (i, j) ∈ L} denotes a family of random variables, i.e., a random field, defined on L. A system d can be viewed as a discrete sample realization of D assuming certain state on each pixel site. d = {d1,1 , d1,2 , · · · , dI,J } is referred as a configuration of D. The complete set of all the configuration is denoted as D. r Definition 1. A neighborhood system for L is defined as N = {Ni,j ; (i, j) ∈ L}, r where Ni,j is the set of sites around (i, j) and is defined as follows: r = {(i , j )|(i , j ) ∈ L, (i , j ) = (i, j), |(i , j ) − (i, j)| r} Ni,j
(8)
where r is a positive integer that determines the size of the neighborhood system. Definition 2. A clique c is a subset of L, for which every pair of sites is a neighbor. Single pixels are also considered cliques. The set of all cliques related with the pixel site (i, j) is denoted by Ci,j . Definition 3. D is a Gibbs random field with respect to (w.r.t.) the neighborhood system N if and only if: 1 −E(d) e (9) z where p(d) is called a Gibbs measure of d; z is a normalization constant called the partition function and E(d) is the energy function of the form: p(d) =
E(d) = −
I,J
Wc (di,j )
(10)
i,j c∈Ci,j
where Wc is called the clique potential. Generally, Wc is a function of the cliques around the site under consideration. Realization of the Variational Approximation Using A GRF. To estimate the conditional distribution q(v(x)|u(x; T )), knowing the exact values of v(i, j) and u((i, j); T ) are not important. We are more interested in knowing the conditional difference between v(i, j) and u((i, j); T ). Having this knowledge and using the concept of Gibbs measure, we can approximate the conditional distribution q(v|u) using a GRF w.r.t. the neighborhood system N defined on the lattice L = {(i, j) : 1 i I, 1 j J} of the X-ray image. However, until now we still can not directly compare v(i, j) to u((i, j); T ) at each pixel site because there are inherent differences between the X-ray image and the associated DRR. In this paper, we propose to use a local normalization to circumvent this problem. The rationale behind it is that in a local region the intensity differences between different sites are mainly caused by the imaged object, if no external object presents in the field-of-view. Definition 4. A local region of size r for the pixel site (i, j) ∈ L is the set of sites defined by: r = {(i , j )|(i , j ) ∈ L, |(i , j ) − (i, j)| r} Ri,j
(11)
796
G. Zheng
The local normalization of both the X-ray image and the associated DRR is then performed as follows: v¯(i, j) =
v(i,j)−mv (Rri,j ) ; σv (Rri,j )
and u ¯((i, j); T ) =
u((i,j);T )−mu (Rri,j ) σu (Rri,j )
(12)
r r r r ), σv (Ri,j ) and mu (Ri,j ), σu (Ri,j ) are the mean value and the where mv (Ri,j standard deviation calculated from the intensity values of all sites in the local r of the X-ray image and of the associated DRR, respectively. region Ri,j We can now model the difference image
s((i, j); T ) = v¯(i, j) − u ¯((i, j); T )
(13)
as a GRF w.r.t. the neighborhood system N defined on the lattice L = {(i, j) : 1 i I, 1 j J}. According to the relationship between the probability measure and the energy function of a GRF at a single site, we have: log q(v(i, j)|u(i, j); T ) ≈ log p(s((i, j); T ); and Wc (s(i, j); T ) log p(s((i, j); T ) = −E(s((i, j); T ) = −
(14)
c∈Ci,j
We can further expand the clique potentials in Eq. 14 according to the clique size. In this work, we only consider the cliques of size up to two. Using such an approximation, we derive a new similarity measure. We call it the GRF model based variational approximation to mutual information (GRF-VA-MI) q and denote it as SGRF −V A−MI (V, U, T ). It has the form: q −SVq A−MI (V, U, T ) ≈ SGRF −V A−MI (V, U, T ) =
I,J
Wc (s((i, j); T ))
i,j
+
I,J i,j
1 r ) card(Ni,j
·
r (i ,j )∈Ni,j
(15)
Wc (s((i, j); T ), s((i , j ); T ))
where SVq A−MI (V, U, T ) is negative because mutual information is maximized, whereas energy must be minimized. The first term of the rightest side is the potential function for single-pixel cliques and the second term is the potential r ) means to compute the number function for all other pairwised cliques. card(Ni,j r of pixels in neighborhood Ni,j . The selection of the potential function in Eq. 15 is a critical issue in GRF modeling [10]. It is worth in the future to investigate how the selection of the potential function affects the registration performance. In this work, we simply use following form: Wc (s((i, j); T )) = [s((i, j); T )]2 Wc (s((i, j); T ), s((i , j ); T )) = [s((i, j); T ) − s((i , j ); T )]2
(16)
Now the registration framework as described in Eq. 5 is changed to: Q q Tˆ = arg min[ SGRF −V A−MI (V, U, T )] T
q=1
(17)
Incorporating Spatial Information into 3D-2D Image Registration
797
Fig. 1. Experimental datasets. Left two images: phantom and one of the X-ray images; right two images: volume rendering result showing the location of one of the fiducial markers and one of the X-ray images with the projection of the interventional instrument (delineated by a black box). Table 1. Data specifications CT Data Specification test object rows × columns × slices
data res. (mm3 )
start res. (mm3 )
end res. (mm3 )
phantom
512 × 512 × 93
0.36 × 0.36 × 2.5 2.88 × 2.88 × 2.5 2.88 × 2.88 × 2.5
spine
512 × 512 × 72
0.36 × 0.36 × 1.25 2.88 × 2.88 × 1.25 2.88 × 2.88 × 1.25 X-ray Data Specification
test object width × height × images data res. (mm2 )
start res. (mm2 )
end res. (mm2 )
phantom
768 × 576 × 2
0.39 × 0.39
3.12 × 3.12
3.12 × 3.12
spine
768 × 576 × 2
0.39 × 0.39
3.12 × 3.12
3.12 × 3.12
2.3
Implementation Details
To accelerate the registration process, we exploit a spline-based multi-resolution 3D-2D registration scheme [11]. A cubic-splines data model is used to compute the multi-resolution data pyramids for the CT volume, the X-ray images, the DRRs, as well as for the gradient and the Hessian of the GRF-VA-MI. The registration is then performed from the coarsest resolution until the finest one. At each resolution level, the size of the local region for the normalization is always equal to that of the neighborhood system used in Eq. 15. And to improve the capture range, we use two different sizes of neighborhood systems: r=15 and r=3. The GRF-VA-MI with the bigger neighborhood system is first minimized via a Levenberg-Marquardt non-linear least-squares optimizer. The estimated Tˆ is then treated as the starting value for optimizing the GRF-VA-MI with the smaller neighborhood system.
3
Experimental Results
We conducted two studies on X-ray and CT datasets of a plastic phantom and a cadaveric spine segment (Fig. 1). The data sizes, the original data resolution, the start and the end resolutions of the X-ray and the CT datasets are summarized in Table 1. The ground truth transformations of both datasets were obtained by
798
G. Zheng
Fig. 2. Probe through the minimum of similarity measures on the phantom data (the first row) and on the spine data (the second row). The ordinate shows the value of similarity measures normalized to the range [0.0, 1.0], which are given as functions of each parameter in the range of [−15o ,15o ] or [-15mm, 15mm] away from its ground truth (the first three columns represent the translational probe along X, Y and Z axis, respectively; the last three columns represents the rotational probe along each axis).
Table 2. Study results using the datasets of cadaveric spine segment absolute parameter range (o , mm) (2, 2) (4, 4) (6, 6) (8, 8) (10, 10) (12, 12) average of the initial mTRE (mm) 2.3
4.6
6.9
9.2
11.5
success percentage
100
100
100
99
95
13.5 85
average of the final mTRE (mm)
0.8
0.8
0.8
0.8
0.8
0.8
performing paired-point matchings on implanted fiducial markers. The phantom was custom-made to simulate a good condition. In contrast, the quality of the X-ray images for the cadaveric spine was poor and there were projections of interventional instruments presented (see the rightest image of Fig. 1). Using the datasets of both objects downsampled until the start resolution, we first compared the behavior of the GRF-VA-MI to those of a MI-based measure using a histogram-based implementation [2] and of a similarity measure introduced in [11], which is a global normalization based sum-of-squared-differences (SSD). The results are presented in Fig. 2. It was found that all similarity measures had similar behavior when tested on the phantom dataset but different behavior when tested on the spine segment dataset. The GRF-VA-MI shows a superior behavior compared to others. More specifically, all curves of the GRFVA-MI have clear minima and are smoother than those of others. It also shows that using bigger neighborhood system, which is equivalent to incorporate wider range of spatial information, leads to smoother energy function whereas using smaller neighborhood system results in higher accuracy. The second study was performed only on spine segment dataset to evaluate the performance of the registration scheme using the GRF-VA-MI. In this study, we perturbed the ground truth by randomly varying each parameter in the range of [−2o , 2o ] or [-2mm, 2mm] to get 200 positions, and then another 200 positions in the rage of [−4o , 4o ] or [-4mm, 4mm], and so on until the range of [−12o , 12o ] or [-12mm, 12mm]. We then performed the registration starting from these perturbed positions and counted the success rate. Using a method similar to
Incorporating Spatial Information into 3D-2D Image Registration
799
that reported in [12], we regarded a registration as successful if the mean target registration errors (mTRE) evaluated on the fiducial markers was smaller than 1.5 mm . The capture range was then defined as the the average of the initial mTRE when a 95% success rate is achieved. The study results are presented in Table 2. When the absolute parameter range is (12o , 12mm), the average CPU time tested on a 3.0 GHz Pentium machine was 26.7 seconds. It was found that the capture range of the GRF-VA-MI was much larger than those reported in [7] and in [12], although the attained accuracy was lower than that reported in [12]. This might be explained by the large inter-slice distance (2.5 mm in this work vs. 0.31 mm in [12]) and the region outliers in the X-ray images.
4
Conclusions
In this paper, we derived a novel information and Gibbs random field theory based similarity measure, the GRF model based variational approximation to mutual information, based on the Kullback-Leibler bound. The newly derived similarity measure enabled us to effectively incorporate spatial information into a 3D-2D registration. Results from the experiments performed on the datasets of a plastic phantom and of a cadaveric spine segment show that the newly derived similarity measure has a larger capture range than those have been previously reported and attains satisfactory accuracy.
References 1. Wells, W., Viola, P., et al.: Multi-modal volume registration by maximization of mutual information. MedIA 1, 35–51 (1996) 2. Maes, F., Collignon, A., et al.: Multimodality image registration by maximization of mutual information. IEEE Trans. Med. Imaging 16, 187–1998 (1997) 3. Pluim, J.P., et al.: Mutual information based registration of medical images: a survery. IEEE Trans. Med. Imaging 22, 986–1004 (2003) 4. Pluim, J., et al.: Image registration by maximization of combined mutual information and gradient information. IEEE Trans. Med. Imaging 19, 809–814 (2000) 5. Rueckert, D., Clarkson, M.J., et al.: Non-rigid registration using higher-order mutual information. SPIE Medical Imaging: image processing 3979, 438–447 (2000) 6. Sabuncu, M.R., Ramadge, P.J.: Spatial information in entropy-based image registration. In: Gee, J.C., Maintz, J.B.A., Vannier, M.W. (eds.) WBIR 2003. LNCS, vol. 2717, Springer, Heidelberg (2003) 7. Russakoff, D.B., Tomasi, C., et al.: Image similarity using mutual information of regions. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3023, pp. 596–607. Springer, Heidelberg (2004) 8. Gan, R., Chung, A.C.S.: Multi-dimensional mutual information based robust image registration using maximum distance-gradient-magnitude. In: Christensen, G.E., Sonka, M. (eds.) IPMI 2005. LNCS, vol. 3565, pp. 210–221. Springer, Heidelberg (2005) 9. Barber, D., Agakov, F.V.: The IM algorithm: a variational approach to information maximization. In: NIPS’03, vol. 16, MIT Press, Cambridge, MA (2004)
800
G. Zheng
10. Li, S.Z.: Markov random field modeling in computer vision. Springer, Heidelberg (1995) 11. Joni´c, S., Th´evenaz, P., et al.: An optimized spline-based registation of a 3D CT to a set of C-arm images. Int. J. Biomed. Imaging, Article ID 47197, 1–12 (2006) 12. von de Kraats, E.B., Penney, G.P., et al.: Standardized evaluation methodology for 2-D-3-D registration. IEEE Trans. Med. Imaging 24, 1177–1189 (2005)
Performance Evaluation and Recent Advances of Fast Block-Matching Motion Estimation Methods for Video Coding Berenice Ramirez Centro de Investigacion y Desarrollo de Tecnologia Digital, Av. del Parque 1310, 22510 Tijuana B.C., Mexico [email protected] http://www.citedi.mx/portal
Abstract. This paper presents a performance evaluation of a set of fast block matching motion estimation methods of recent development. These methods are Kite-Cross-Diamond search (KCDS), Cross-Diamond-Hexagonal search (CDHS), Efficient-Hexagonal-Inner search (EHIS) and a newly proposed in this paper, called Cross-Half-Diamond low resolution search (CHDS), which is an improvement of Cross-Diamond-Hexagonal search. Also, Diamond search (DS) and Full search (FS) algorithms are implemented. These methods are evaluated based on the mean square error and the number of search points. Based on experimental results, it is concluded that DS, CDHS, CHDS and EHIS are appropriate for small and large motion contents, except EHIS for high frequencies and complex motion contents. The KCDS method has the lowest performance compared with other algorithms. Keywords: Motion Estimation, Block-Matching, Video Coding, Performance Evaluation.
1
Introduction
The fast block-matching motion estimation (ME) begins with the 2D-logarithmic search, the three step search (TSS), and the modified conjugate direction search, developed between 1981-1984, mentioned in [1]. These methods become the influence of later algorithms, developed between 1990-1996. The algorithms found in this research are the four step search (4SS) [2], the cross search (CS) [3], the new three step search (N3SS) [4], and the block-based gradient descent search (BBGDS) [5]. Later, in 1998 the Diamond Search (DS) algorithm [6] is developed based on statistics studies, and from 2000 new algorithms based on DS are developed, these are, the optimized diamond search [7], the cross diamond search (CDS) [8], the efficient three step search (E3SS) [9], and the algorithms evaluated in this work developed between 2004-2006, kite-cross-diamond search (KCDS) [10], the cross-diamond-hexagonal search (CDHS) [11], and the improvement of CDHS, called cross-half-diamond low resolution search (CHDS) presented in Section 2 on this paper; the efficient-hexagonal-inner search (EHIS) W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 801–808, 2007. c Springer-Verlag Berlin Heidelberg 2007
802
B. Ramirez
[12] is based on hexagon search patterns following the line in [13]. All the algorithms mentioned in this section are called conventional fast block-matching motion estimation algorithms, because they consider only translational motions. On the other hand, there are the no conventional motion estimation algorithms, those found in this work are, the fast motion estimation method based on particle swarm optimization (PSO) [14], those that use Lie operators [15, 16], a very recent method using a Kalman filter [17], and those that take advantages of the available tools of video coding standards, as the scheme for motion estimation using multireference frames in H.264/AVC [18], and the automatic block split pyramidal motion estimation for MPEG 4 [19], and finally those methods based on wavelets [20]. The focus in this paper is to report the performance evaluation of fast blockmatching motion estimation methods developed recently, but that they are conventional algorithms. In Section 2 the fast block-matching motion estimation methods evaluated in this work are presented concisely. Also, a brief description of the bases of the methods is presented. The Section 3 presents the experimental environment and features. Finally, in Tables 1 and 2 the experimental results of average MSE and average number of search points are presented, respectively. Finally, in Section 4 the conclusions are exposed.
2
Fast Block-Matching ME Methods
The fast block-matching motion estimation methods KCDS, CDHS, and CHDS evaluated in this work are based on statistics studies. The first results can be found in [6]. They treat the motion vector as a random variable, and then they compute its cumulative probability distribution function for a set of test sequences. For their experiments they use the FS algorithm, the MSE as the matching criterion, and a 15×15 search window (SW). They found that exist a greater probability to find motion vectors in a circular support with radium of 2 pels and centered in zero motion vector (the center of SW) within the SW, the circular support is referred in sections below as a circular area. From these results they developed the DS algorithm. Later works followed this line and made another statistics experiments. The results of theses later experiments can be found in [8] and [11]. In [8] can be found the results of compute the cumulative probability distribution function for the search patterns: square, diamond, and cross. In their experiments, they found that exist a greater probability to find motion vectors on the horizontal and vertical directions, than another locations at the same radium. From these results the cross shape becomes an important search pattern. Finally, in [11] are published the results of compute the probability to find motion vectors in the corners of a diamond pattern, they call this case as a diamond-corner-hit. In base on their experimental results they proposed hexagonal patterns to reduce the number of search points. The EHIS
Block-Matching Motion Estimation: Performance Evaluation
803
algorithm does not belong to line DS, it only uses a hexagon pattern, in [13] can be found the first work of hexagon-based search pattern. 2.1
Kite-Cross-Diamond Search
This algorithm is proposed by Lam et al. at 2004 [10]. The patterns used in this method are: cross pattern, diamond pattern, and four kite patterns called up-kite, down-kite, left-kite, and right-kite. This algorithm starts to explore the search window using a 3×3 cross pattern and centered in zero motion. Depending of the direction of the minimum distortion, the kite pattern corresponding to that direction is used to find the global minimum point. But, on the other hand, the experimental results of Zhu [6] demonstrated that exist a greater probability to find motion vectors in the circular area, and the results of Cheung [8] demonstrated that exist more probability to find horizontal and vertical motions. But this method focuses only in the 3×3 cross-shaped centered in the zero motion, and another possible directions are not explored, as a result, this method finds local minimum points. If we compare the experimental results of this method with results of the other methods, shown in Section 4, KCDS has the lowest performance. 2.2
Cross-Diamond-Hexagonal Search
The algorithm is proposed by Cheung and Po at 2005 [11]. This algorithm uses different patterns that can be found in [11]. It incorporates the results of Zhu [6], Cheung [8], and their own results published in [11]. The algorithm starts exploring the 5×5 cross-shaped area centered in zero motion, using a small cross pattern and a large cross pattern. As a result, this method explores the circular area with a lot of efficiency if we compare it to the other algorithms implemented in this work. It achieves a comparable precision to that of DS, but reducing the number of search points. Furthermore, according to these authors when a diamond pattern is used, it is possible to reduce the number of search points switching it by hexagonal patterns. 2.3
Cross-Half-Diamond Low Resolution Search: An Improvement of CDHS
This improvement of CDHS is a contribution of this work developed at 2006. It is based on contributions of Zhu [6], and Cheung [8] described in previous sections. The CHDS keeps the search method of CDHS to explore the 5×5 cross-shaped area centered in zero motion, but when the search moves away from the circular area the CHDS propose four half diamond patterns of low resolution shown in Fig. 1 for different directions of the global minimum point. The search is shown in Fig. 2. The proposal of these patterns is based on contributions in [6], and [8] interpreted as follow: If the probability to find motion vectors decreases when the search moves away from the circular area, it is not required to explore the rest
804
B. Ramirez
Fig. 1. Search patterns of CHDS a) upwards, b) downwards, c) towards the left, and d) towards the right
Fig. 2. Example of a CHDS coarse search using towards the right search pattern
of SW accurately. Furthermore, there is a greater probability to find horizontal and vertical displacements. Finally, the CHDS method achieves to improve slightly the speed of CDHS keeping its precision. However, it is only useful in video with long motions. 2.4
Efficient-Hexagonal-Inner Search
This algorithm is proposed by Su et al. at 2005 [12]. The algorithm uses a prediction scheme and a hexagon pattern to reduce the number of search points. EHIS focuses in the improvement of inner search using six groups of distortions. These groups of distortions are compared and the smallest gives the locations
Block-Matching Motion Estimation: Performance Evaluation
805
in the inner of the hexagon for fine resolution. The efficiency of this algorithm depends highly of the prediction scheme.
3
Experimental Results
As the experimental environment a SP ML MPEG-2 encoder with a Group of Pictures (GOP) length set to 5, a frame distance set to 1, a 16 × 16 pixel block, and a 15 × 15 search window is used. For image sequences with small and moderate motion content, Claire, and Tennis are used, respectively, and for complex and large motion content, Flowergarden, and Football are used, respectively. The MSE between the original image sequence and the reconstructed image sequence using only motion vectors is computed to evaluate the precision of the methods. Furthermore, the number of search points to compute motion vectors is calculated as a measure of the speed of the methods. In Table 1 the average MSE is presented, and in Table 2 the average number of search points is presented. For Claire sequence as it is shown in Table 1, DS, KCDS, CDHS, CHDS, and EHIS presented a MSE performance very close to MSE performance of FS. Therefore, these algorithms estimate accurately short motion vectors. For Tennis sequence, DS, CDHS, CHDS, and EHIS have similar MSE values to MSE value of FS for translational motions as it is shown in Fig. 3 (frame 1 to 17), but as it is expected, the performance of the methods decreases for another motion models, in this case, a zoom motion content, shown in Fig. 3 from frame 18. The accuracy of KCDS method is achieved only in image sequences with very small motion content, in accordance with experimental results, although it has the smallest number of search points in all sequences, except in Claire. In Flowergarden sequence, DS, CDHS, and CHDS achieve a MSE performance close to MSE performance of FS. In terms of average MSE, EHIS is away almost 240 units, in contrast with the other algorithms reaching almost the 50 units for lowest performance. Table 1. Average MSE Values Method Claire Tennis Flowergarden Football FS 3.16 91.26 287.5 439.00 DS
3.16 103.26
307.29
505.39
KCDS
3.51 215.81
1320.71
852.86
CDHS
3.17 110.24
334.82
521.86
CHDS
3.17 110.36
315.36
522.21
EHIS
3.30 135.90
527.04
579.00
806
B. Ramirez Table 2. Average Number of Search Points Method Claire Tennis Flowergarden Football DS 10.58 12.97 16.63 16.02 KCDS
6.99
6.71
6.84
6.71
CDHS
4.80
8.38
14.60
11.85
CHDS
4.80
8.10
13.68
11.18
EHIS
9.36
10.39
12.25
11.00
Fig. 3. MSE for Tennis sequence after to apply the fast search methods
Fig. 4. Reconstructed frame of Flowergarden sequence, using only motion vectors computed with FS method
Therefore DS, CDHS, and CHDS are more appropriate to use in image sequences with high frequencies and complex motion content. In Figures 4 and 5, a reconstructed frame of Flowergarden sequence made by the MPEG encoder using only the motion vectors after apply FS and CDHS algorithms is shown, respectively. For Football sequence, DS, CDHS, and CHDS achieve a close MSE performance among them, but away almost 85 units from MSE of FS. EHIS is
Block-Matching Motion Estimation: Performance Evaluation
807
Fig. 5. Reconstructed frame of Flowergarden sequence, using only motion vectors computed with CDHS method
away 140 units from MSE of FS, and almost 57 units from lowest performance of mentioned methods. Under this observation, it has not the performance of other methods but it achieves to follow them.
4
Conclusions
In this paper, the performance evaluation and recent advances of fast blockmatching motion estimation methods are reported. Based on experimental results, DS, CDHS, CHDS, and EHIS are appropriate for small and large motion content, except EHIS for high frequencies and complex motion content. The KCDS method has the lowest performance if we compare it to the other algorithms; it only achieves a good performance for small motion content.
References 1. Musmann, H., Pirsch, P., Grallert, H.: Advances in Picture Coding. Proceedings IEEE 73(4), 523–548 (1985) 2. Po, L.M., Ma, W.C.: A Novel Four-Step Search Algorithm for Fast Block Motion Estimation. IEEE Trans. on Circuits and Systems for Video Technology 6(3), 313–317 (1996) 3. Ghanbari, M.: The Cross-Search Algorithm for Motion Estimation. IEEE Trans. on Communications 38(7), 950–953 (1990) 4. Li, R., Zeng, B., Liou, M.L.: A New Three-Step Search Algorithm for Block Motion Estimation. IEEE Trans. on Circuits and Systems for Video Technology 4(4), 438– 443 (1994) 5. Liu, L.K., Feig, E.: A Block-Based Gradient Descent Search Algorithm for Block Motion Estimation in Video Coding. IEEE Trans. on Circuits and Systems for Video Technology 6(4), 419–423 (1993) 6. Zhu, S., Ma, K.: A New Diamond Search Algorithm for Fast Block-Matching Motion Estimation. IEEE Trans. on Image Processing 9(2), 287–290 (2000) 7. Zhu, C., Lin, X., Chau, L.P., Ang, H.A., Ong, C.Y.: An Optimized Diamond Search Algorithm for Block Motion Estimation. IEEE Int’1 Symposium on Circuits and Systems 2(2), 488–491 (2002)
808
B. Ramirez
8. Cheung, C., Po, L.: A Novel Cross-Diamond Search Algorithm for Fast Block Motion Estimation. IEEE Trans. on Circuits and Systems for Video Technology 12(12), 1168–1177 (2002) 9. Chau, L-P., Jing, X.: Efficient Three-Step Search Algorithm for Block Matching Motion Estimation. IEEE Int’1 Conf. on Acoustics 3, 421–424 (2003) 10. Lam, C.W., Po, L.M., Cheung, C.H.: A Novel Kite-Cross Diamond Search Algorithm for Fast Block Matching Motion Estimation. IEEE Int’l. Symposium on Circuits and Systems 3, 729–732 (2004) 11. Cheung, C., Po, L.: A Novel Cross-Diamond-Hexagonal Search Algorithms for Fast Block Motion Estimation. IEEE Trans. on Multimedia 7(1), 16–22 (2005) 12. Su, C., Hsu, Y., Chang, C.: Efficient Hexagonal Inner Search for Fast Motion Estimation. IEEE Int’l Conference on Image Processing 1, 1093–1096 (2005) 13. Zhu, C., Lin, X., Chau, L.P.: Hexagon-based Search Pattern for Fast Block Motion Estimation. IEEE Trans. on Circuits Systems for Video Technology 12(5), 349–355 (2002) 14. Du, G-Y., Huang, T-S.: A Novel Fast Motion Estimation Method Based on Particle Swarm Optimization. IEEE Proc. of 2005 Int’l Machine Learning and Cybernetics 8, 5038–5042 (2005) 15. Nalasani, M., Pan, W.: On the Complexity and Accuracy of Motion Estimation using Lie Operators. IEEE Proc. of the 36th. Southeastern Symposium on System Theory, 16–20 (2004) 16. Veeravalli, A.G., Pan, W.D., Nalasani, M., Mohan, M.K.S.: Covariance Analysis of Non-translational Motion-compensated Frame Differences. In: IEEE Proc. of the 37th. Southeastern Symposium on System Theory, pp. 462–465 (2005) 17. Kuo, C-M., Chung, S-C., Shih, P-Y.: Kalman Filtering Based Rate-Constrained Motion Estimation for Very Low Bit Rate Video Coding. IEEE Trans. on Circuits and Systems for Video Technology 16(1), 3–18 (2006) 18. Zhu, X., Zhu, S., Liu, T.: Automatic Block Split Pyramidal Motion Estimation for MPEG 4 Video Encoder. In: IEEE 5th. World Congress on Intelligent Control and Automation. vol. 5, pp. 3958–3961 (2004) 19. Kim, S-E., Han, J-K., Kim, J-G.: An Efficient Scheme for Motion Estimation using Multireference Frames in H.264/AVC. IEEE Trans. on Multimedia 8(3), 457–466 (2006) 20. Park, H-W., Kim, H-S.: Motion Estimation using Low-Band-Shift Method for Wavelet-based Moving-Picture Coding. IEEE Trans. on Image Processing 9(4), 577–587 (2000)
Spectral Eigenfeatures for Effective DP Matching in Fingerprint Recognition Boris Danev and Toshio Kamei Common Platform Software Research Laboratories, NEC Corporation 1753, Shimonumabe, Nakahara, Kawasaki, Kanagawa 211-8666, Japan [email protected], [email protected]
Abstract. Dynamic Programming (DP) matching has been applied to solve distortion in spectral-based fingerprint recognition. However, spectral data is redundant, and its size is huge. PCA could be used to reduce the data size, but leads to loss of topographical information in projected vectors. This allows only inter-vector similarity estimations such as Euclid or Mahalanobis distances, and proves to be inadequate in presence of distortion occurring in finger sweeping with a line sensor. In this paper, we propose a novel two-step PCA to extract compact eigenfeatures amenable to DP matching. The first PCA extracts eigenfeatures of Fourier spectra from each image line. The second extracts eigenfeatures from all lines to form the feature templates. In matching, the feature templates are inversely transformed to line-by-line representations on the first PCA subspace for DP matching. Fingerprint matching experiments demonstrate the effectiveness of our proposed approach in template size reduction and accuracy improvement.
1
Introduction
Automated fingerprint identification has been used in law enforcement, and showed the identification capability of fingerprints [1,2]. Nowadays fingerprint recognition is being widely spread in various applications, especially consumer devices such as personal computers, mobile phones, and USB keys [3]. In these devices, fingerprint sensors are embedded to authenticate the validity of users. The most popular sensor is a sweep-type line sensor because of its cost and size. The users sweep their fingers on the sensor, and fingerprint images are captured and recognized. However, current fingerprint recognition technologies are not enough to satisfy all the requirements such as accuracy, size and cost for embedded devices. Matsumoto et al. [4] proposed fingerprint verification methods based on DP matching of spectral features. In this work, spectral features such as Linear Prediction Coefficients (LPC) were extracted from each image line. DP treats the spectral features as one-dimensional signals to solve distortion in fingerprint images. Although this approach might have the potential to satisfy the requirements for embedded devices, spectral data contains redundant information, and its size is huge. Principal Component Analysis (PCA) is effective in reducing data W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 809–816, 2007. c Springer-Verlag Berlin Heidelberg 2007
810
B. Danev and T. Kamei
size by keeping original information, and therefore it is widely used in biometrics [5,6,7]. PCA projects the original feature into a PCA subspace. Inter-vector distances such as Euclid or Mahalanobis distances are often used to compute similarity between projected vectors. In case of spectral features of fingerprints, PCA could be used to reduce the data size, but leads to loss of topographical information in projected vectors. Therefore, DP matching is not applicable to the projected vectors. Even if inter-vector distances were used, it is difficult to reach enough accuracy due to the presence of distortion occurring in finger sweeping with a line sensor. Our challenge is how to introduce an efficient PCA technique into DP matching of spectral features. In this paper, we propose a novel two-step PCA to extract compact eigenfeatures amenable to DP matching. The first PCA is used to extract eigenfeatures of Fourier spectra from each image line. Then a second PCA is applied to extract the eigenfeatures from all image lines in order to form the feature templates. In matching phase, the feature templates are inversely transformed into the first PCA subspace to reconstruct line-by-line representations. Then we apply DP matching to the line-by-line representations. Furthermore, in order to enhance the second PCA, we introduce automatically generated images into the estimation of the second PCA matrix. These images are generated from the original images based on warping information provided by DP matching.
2 2.1
Spectral Eigenfeatures for DP Matching Spectral Eigenfeature Extraction Using Two-Step PCA
In feature extraction, spectral eigenfeatures are extracted using two-step PCA to obtain compact features amenable to DP matching as summarized in Fig. 1(a). First, we apply the one-dimensional Discrete Fourier Transformation (DFT) line by line to the normalized image f (m, n) of a given fingerprint: M−1 1 mω ) f (m, n) exp(−2πi F (ω, n) = √ M M m=0
(1)
where 0 ≤ m ≤ M − 1 (horizontal), 0 ≤ n ≤ N − 1 (vertical) In the first PCA, a projected vector g l , also called an eigenfeature, is extracted from the Fourier amplitude spectrum of each horizontal image line l using a PCA matrix WL to obtain its principal components: g l = WLt sl
(2)
Here, sl denotes the spectrum of the Fourier amplitude |F (ω, l)| in a vector t form: sl = [ |F (1, l)| |F (2, l)| · · · |F (M/2 − 1, l)| ] , where the DC component and redundant half of the spectrum are removed. Extraction for all lines is written as G = WLt S in step (a) in Fig. 1 using G, an array of g l , and a matrix S = [s0 s1 · · · sN −1 ], also called spectral image in this paper.
Spectral Eigenfeatures for Effective DP Matching in Fingerprint Recognition
W tL
h1
h1
(b)
G
gPN
(c)
hD
hD h
g
h
(a) Extraction
WI
g11
-1
g11. . g1N
(d)
gPN g
(e)
..
W tI
Reconstructed features
..
gPN
..
g11
Inverse transform
..
gP1
Template feature
2st PCA
..
..
(a)
..
..
..
.. s s(M/2-1)1 (M/2-1)N
g11. . g1N
Template feature
..
s 11 . . s 1N
S
Raster scan
1st PCA
..
Fourier spectra
811
gP1. . gPN G
(b) Matching
Fig. 1. A scheme for spectral eigenfeature extraction and matching
The second PCA projects a vector g obtained by raster-scanning of elements in G into the PCA subspace specified by a matrix WI to obtain the second eigenfeature h as shown in step (c): h = WIt φ(G) = WIt φ WLt S (3) where φ(G) denotes the raster-scanning of matrix G. The second eigenfeature h is enrolled as a feature template used in matching. The numbers of principal components in the first and second PCA are experimentally determined. 2.2
DP Matching Using Eigenfeatures
DP matching [4,8] is used to match feature templates. To restore topographical information, a reference template hR and a test template hT are inversely transformed into the first PCA subspace to reconstruct line-by-line representations ˆR, G ˆ T using the second PCA matrix WI : G ˆ R = φ−1 (WI hR ), G
ˆ T = φ−1 (WI hT ) G
(4)
where φ−1 denotes the inverse operation of raster-scanning φ. ˆ R, G ˆ T are approximations As the reconstructed line-by-line representations G of the principal components of spectral features of horizontal lines, DP matching ˆT . ˆ R and G is performed in vertical direction to obtain the similarity between G 2.3
Training of PCA Matrices
PCA matrix is given as the set of κ eigenvectors in the eigenvector matrix V corresponding to the κ-highest eigenvalues in the eigenvalue problem CV = V Λ where Λ is the eigenvalue matrix and C is the covariance matrix calculated from a training set. K normalized images f k (m, n)(0 ≤ k ≤ K − 1) and their spectral images S k = [sk0 sk1 · · · skN −1 ] are used as the training set to calculate the first and second PCA matrices WL and WI . Since each line is projected using the same PCA matrix in the first PCA, the matrix can be calculated from a training set {s00 , s01 , · · · , s10 , s11 , · · · , sK−1 N −1 } consisting of all Fourier amplitude spectra skl of all spectral images S k . On the other hand, the second PCA extracts the principal components of all features obtained in the first PCA. In order to increase training dataset, we
812
B. Danev and T. Kamei
(a) Generated image from the same finger
(b) Generated image from the different finger Fig. 2. Automatically generated images for training
introduce an automatically generated training set into the estimation of the second PCA. The generated training set {X pq } is produced by warping images in the original training set. Image X pq is generated by transforming an original image f q into the topography of the reference image f p using warping information obtained in the DP matching between the spectral images of f p and f q . For example, the generated image in Fig. 2(a) has changed position of ridge bifurcation marked by rectangles in the figure, because DP matching applied on the original image with the reference image has tried to solve the deformation in the fingerprint. In the case of using a reference image from a different finger, the generated image in Fig. 2(b) looks consistently stretched compared to the original. Using the spectral images SX pq derived from the generated images X pq , the covariance matrix CI for the second PCA is calculated: CI =
K K 1 ¯ )(xpq − x ¯ )t , (xpq − x K 2 − 1 p=1 q=1
xpq = φ(WLt SX pq )
(5)
Therefore, the number of training data becomes K 2 from K.
3 3.1
Experiments Dataset
For experiments, we used a database of fingerprint images captured with an optical line sensor with 858 dpi resolution. The database contains images from 18 people with 8 different fingers, 3 samples per finger.
Spectral Eigenfeatures for Effective DP Matching in Fingerprint Recognition
813
1.8 DFT GDS
Equal error rate (%)
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 64x64
96x96
128x128
192x192
Image size
Fig. 3. Accuracy dependency of DFT and GDS on image size
We pre-processed the database in the following way: we first manually identified a reference point. A core was usually selected as such a point. In cases where core was not present, a minutia near the core region was selected. Then, a square region of size 96×96 was cropped with reference point in the middle, and pixel-value normalization was applied [9]. Furthermore, the database was divided in two datasets. The first 144 (48 fingers × 3 samples) images formed the training set, while the other 288 images (96 fingers × 3 samples) composed the testing set. Thus, training and testing sets were disjoint. 3.2
Spectral Features
DFT vs GDS. Matsumoto et. al proposed Group Delay Spectrum (GDS) as DFT alternative for spectral representation of each horizontal line of image in DP matching of fingerprints [4]. We compared DFT with GDS for different image sizes using their approach. Hamming window is applied to each horizontal line in the DFT calculation. As shown in Fig. 3, GDS feature proved better than DFT feature in large size region (e.g., 192×192), and scored poorly in small size ones (e.g., 96×96). The reasons for this are probably twofold. First, as GDS uses phase information of the LPC spectrum, the short length of the one-dimensional signal is not enough to properly predict the linear coefficients. Second, larger size regions in our database had areas that include background regions. This tends to degrade discriminant capabilities of the DFT feature. Thus, we selected DFT of 96×96 cropped images in the experiments described later. Dimensionality of Features. Since the width of the cropped images is 96, the dimensionality of the Fourier amplitude spectrum sl becomes 96/2-1=47. The size of S becomes 96×47=4512. The first 12 principal components were extracted in the first PCA, as this experimentally proved to be enough to represent each Fourier amplitude spectrum without losing matching accuracy. The size of G becomes 96×12=1152.
814
B. Danev and T. Kamei
7
5
Ia Ib Ref
5 4 3 2
IIa IIb Ref
4.5 4
Equal error rate (%)
Equal error rate (%)
6
3.5 3 2.5 2 1.5 1
1 0.5
0 10
30
50
70
90
110
150
200
400
Dimensionality
(a) Euclidean distance (Ia,Ib)
800 1152
0 10
30
50
70
90
110
150
200
400
800 1152
Dimensionality
(b) Proposed matching (IIa,IIb)
Fig. 4. Accuracy dependency on the dimensionality of feature template Table 1. Summary of the matching results Algorithm Ref[4] Ia Ib IIa IIb
3.3
Matching DP matching Euclidean distance Euclidean distance Proposed matching Proposed matching
Training Set Best EER Template Size N/A 0.42% 4512 original images 3.81% 200 generated images 3.47% 400 original images 0.42% 1152 generated images 0.26% 200
Matching Experiments
We evaluated our proposed algorithm in comparison to other algorithms with respect to the feature template size and matching accuracy. The reference algorithm is the algorithm in [4] which uses DFT instead of GDS. Equal Error Rate (EER) of the reference algorithm was 0.42% at image size 96×96. We also compared our algorithm with an algorithm based on an inter-vector matching between feature templates h in Eq. 3 using the Euclidean distance. The effect of image generation in WI computation was also investigated. Fig. 4 shows the accuracy dependency on the dimensionality of feature template. In the figure, the suffixes a and b (e.g., Ia, Ib) denote the difference of a training set used in WI computation; a denotes the original training images was used, and b - the generated training images was used. Although the accuracy of inter-vector matching (Ia, Ib) showed around EER=4%, our proposed algorithms (IIb) reached the reference accuracy EER=0.42% with 90 dimensions, and scored the best accuracy EER=0.26% with 200 dimensions as shown in Fig. 4. The figure also shows that the automatic image generation is significantly effective in the training of IIb. Table 1 summarizes the experimental results. Template size in the table means the dimensionality of the feature template at the best EER. The proposed algorithm is also very efficient in terms of speed and memory consumption. At the best EER, it required 200×4=800 bytes for template storage in floating precision, while the template size of the reference algorithm
Spectral Eigenfeatures for Effective DP Matching in Fingerprint Recognition
815
7 Ia Ib IIa IIb
1
Ia Ib IIa IIb
6
0.8
5
Fisher ratio
Contribution ratio
0.9
0.7 0.6 0.5
4
3
0.4
2 0.3 0.2 10
30
50
70
90
110
150
200
400
800 1152
Dimensionality
(a) Cumulative contribution ratio Γ (κ)
1 10
30
50
70
90
110
150
200
400
800 1152
Dimensionality
(b) Fisher ratio F (κ)
Fig. 5. Performance measures of subspaces
is 18,048 bytes. The memory size for WL and WI are 47×12×4=2,256 bytes, 1152×200×4=921,600 bytes, respectively. Matching time was about 10 ms on a Intel Pentium 4 - 3.2 GHz processor using MATLAB code.
4
Discussion
We present the results of performance measures to empirically explain the reasons behind the significant improvement achieved by the proposed algorithm. The cumulative contribution ratio and Fisher ratio of the second PCA subspace are computed as the performance measures. The cumulative contribution ratio Γ (κ) shows the extent to which n κκ principal components reflects the original information. It is defined as Γ (κ) = i σi2 / i σi2 , where σi2 is the variance of i-th principal component, and n is the dimensionality of the original space. The variances σi2 were computed using the testing set. Γ (κ) with image generation (Ib, IIb) are much larger than those without image generation (Ia, IIa) as in Fig. 5(a). It means that subspaces produced with image generation effectively represent the original information. This effect is mainly obtained from the increase of training database size by the image generation. Fisher ratio is a measure of the discriminant efficiency. For subspaces, it can be computed as F (κ) = tr(SκB )/tr(SκW ), where SκB , SκW are the between and within covariance matrices of the κ-dimensional subspace. It is displayed in Fig. 5(b). Applying DP matching in subspaces (IIa, IIb) scored higher Fisher ratios which translates in their better discriminant efficiency. Furthermore, the PCA subspace calculated with generated images (IIb) exhibits the highest Fisher ratios. This effect is obtained from the warping in DP matching. In case of IIb, the distance of two features is calculated in the topography warped by DP matching, while the generated images were transformed into the similar topography. The similarity of the topography makes the image generation effective in case of IIb. Additionally, an interesting phenomenon is observed in cases IIa, IIb at the dimensionality larger than 400. The subspaces of IIa, IIb composed of 1152
816
B. Danev and T. Kamei
principal components are theoretically the same, which explains the same Fisher ratio achieved at the end. We expected the curves IIa and IIb on Fig. 5(b) to exhibit a smoother joining as seen in the accuracy curves in Fig. 4(b). The joining indeed occurs but it is fairly steep. A possible explanation is that although the higher components in case IIb represent mostly noise, in the case of IIa discriminant information still remains in the higher components because the PCA without image generation could not pack the relevant information into the lower subspace due to the insufficient training size and the inconsistency of topography.
5
Conclusions
We have proposed a novel way to extract spectral eigenfeatures and an effective Dynamic Programming for matching. We have also introduced an image generation technique to enhance the principal component analysis. In that context, we have proved that PCA can successfully extract very compact eigenfeatures. Learning with the image generation significantly improved the proposed scheme allowing to use only 800B for EER = 0.26% compared to the 18KB used in the reference approach with EER=0.42%.
Acknowledgment We would like to thank the Student Exchange Office at Swiss Federal Institute of Technology (EPFL) and Prof. Takeshi Mizuno from Saitama University for the help provided in finding a post-diploma research internship in NEC Corporation.
References 1. Asai, K., Kato, Y., Hoshino, Y., Kiji, K.: Automatic fingerprint identification. Proc. Soc. of Photo-Optical Instrumentation Engineers 182, 49–56 (1979) 2. Ratha, N., Bolle, R. (eds.): Automatic Fingerprint Recognition Systems. Springer, Heidelberg (2004) 3. Mainguet, J.: Biometrics for large-scale consumer products. In: Proc. Intl. Conf. on Artificial Intelligence, pp. 310–314 (2003) 4. Matsumoto, N., Sato, S., Fujiyoshi, H., Umezaki, T.: Evaluation of a fingerprint verification method based on LPC analysis. Trans. of IEE 122(5), 799–807 (2002) 5. Moghaddam, B., Pentland, A.: Probabilistic visual learning for object representation. IEEE Trans. on Pattern Anal. Machine Intell. 19(7), 696–710 (1997) 6. Kamei, T., Mizoguchi, M.: Fingerprint preselection using eigenfeatures. In: Proc. Intl. Conf. on Computer Vision and Pattern Recognition, pp. 918–923 (1998) 7. Kamei, T.: Face retieval by an adaptive Mahalanobis distance using a confidence factor. Proc. IEEE Intl. Conf. on Image Processing 1, I153–I156 (2002) 8. Sakoe, H., Chiba, S.: Dynamic programming algorthim optimization for spoken word recognition. Speech and Signal Processing 26(1), 43–49 (1978) 9. Jain, A., Prabhakar, S., Hong, L.: A multichannel approach to fingerprint classification. IEEE Trans. on Pattern Anal. Machine Intell. 21(4), 348–359 (1999)
Registering Long-Term Image Series Detlev Droege and Dietrich Paulus Active Vision Group, Institute for Computational Visualistics, University of Koblenz-Landau, Universitätsstr. 1, 56070 Koblenz, Germany {droege,paulus}@uni-koblenz.de http://www.uni-koblenz.de/agas Abstract. In this article we present a method to register image sequences taken at long time intervals. We suggest an improvement to Wards registration method for high dynamic range images to achieve limited sub-pixel accuracy. We show the chosen method to be appropriate even for heavily changing lighting conditions.
1 Introduction Image series generated from long time outdoor observations impose some difficult problems, for comparative processing as well as for the generation of presentation material. Several external parameters vary over time and thus have considerable influence on the images taken over the course of months or even years. Major influences are caused by – – – – –
weather conditions (e.g. sunny, rainy or foggy), time of the day dependent sun position, seasonal sun position and vegetation, natural changes (growing vegetation) and intended changes (e.g. construction works).
Additional influences are caused by properties of the recording equipment, such as – automatic exposure control, – positional inaccuracies (e.g. due to wind or mechanical instabilities) and – focus control that may lead to images of different sharpness. Fig. 1 shows three frames from one of the image sequences recorded on our campus. The data set consists of 52 sequences (i.e. camera positions) each holding 815 images during the main construction phase of the buildings (26 month) and another 520 images recorded over a period of 18 months when the exterior of the buildings was mostly finished. The purpose of this material is to create a documentation of the construction of the new buildings on the campus. Observing the construction process of a building over such a long period of time using a pan tilt camera thus results in a sequence of images which are not directly suitable to produce a fast motion film of the growing building. The observed negative results when just creating a film by lining up consecutive images can be accounted to the above mentioned reasons: – positional jitter due to mechanical inaccuracies and wind, – bright, high contrast (direct sunlight) vs. dim, low contrast (cloudy), W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 817–822, 2007. c Springer-Verlag Berlin Heidelberg 2007
818
D. Droege and D. Paulus
Fig. 1. Images taken at three consecutive days around noon
– cast shadows on sunny days, almost no shadows on other days, – color shifts due to varying illumination (e.g. bluish/grayish for cloudy days, yellowish for sunny days, reddish for sunrise and sunset), – moving trees/bushes (wind) and – (dis-)appearing objects (e.g. trucks or equipment). The goal is to stitch images that are captured when the camera pans and tilts to one larger image per recording date. These images are then combined to form a fast motion movie. A two stage registration process is needed to first align the images of one day to a single panorama. Then these have to be registered for the fast motion sequence. However, this cannont be done independently, as usual techniques like bundle adjustment might cause local displacements of some of the mosaic images. Further, to suppress flickering, illumination changes need to be compensated. Moving objects, such as pedestrians or cars, need to be distinguished from progress in the construction. Whereas the moving objects need to be suppressed, the changes of the buildings needs to remain unchanged in the images. The recordings started almost 8 years ago and the camera that was used is of relatively low quality. This imposes additional challenges to image processing, which will need to increase resolution and image quality. We will show in Sect. 2 how easy and robust rigid image registration can be obtained with a method that was originally used to create HDR images. First quantitative results are shown in Sect. 3. We conclude with an outlook on our further plans in Sect. 4.
2 Approach The drawbacks of the given image material invalidate most of the usual methods for automatic registration. We tested several approaches and ended up with an algorithm that was originally developed to register exposure steps for high dynamic range images. 2.1 Related Work Various authors report on algorithms that they apply to long image sequences. Whereas [1] estimates the background using a statistical color model, the authors in [2] estimate the foreground. Markov random fields are used in [3] to adapt the processing routines to illumination changes.
Registering Long-Term Image Series
819
The algorithms above operate on video streams. The major difference of our problem to conventional processing of video streams is the delay between the recording of successive frames that will cause considerable differences in illumination and scenic content. 2.2 Image Registration Standard intensity based approaches for image registration fail due to the heavily varying lighting conditions in our image data. Color based approaches as described in [4] did lead to some improvement, but were only usable for those sequences with a sufficient number of non-moving colored objects within the scene. This precondition cannot be met for all camera positions. Common feature based methods suffer from similar problems as the intensity based methods do: they rely on a considerable amount of similarity of the same feature between two consecutive images. However, this assumption often does not hold for pictures taken with a delay of 24 hours. In [5], Brown introduced the application of Lowes SIFT features [6] to the automatic recognition of panoramic images. SIFT features are not only scale invariant as their name suggests, but also to some degree invariant to illumination changes. However, the rather low resolution of our input images (352 × 288 pixel) causes a rather small number of SIFT features to be detected. For re-occurring architectural elements these features have very similar feature description vectors, thus leading to incorrect correspondences between them. Other features attach to shadow corners and thus do not have stable positions between images. The remaining feature correspondences do not outnumber the outliers by a sufficient amount to provide usable results.
Fig. 2. Binary images computed from the first two images in Fig. 1. Exclusion information is additionally depicted in gray
2.3 HDR Image Registration Finally, the image registration method presented by Ward in [7] proved to be well suited for our needs, although initially developed to register high dynamic range (HDR) images. While one would expect this method to be capable to deal with illumination changes, it also showed surprisingly good results when registering images with very different shadow situations. Wards algorithm uses a resolution hierarchy and a simplified image based correspondence measure based on binarized images (Fig. 2). The key
820
D. Droege and D. Paulus
Fig. 3. Isolated exclusion masks corresponding to Fig. 2
to his approach is the selection of the threshold, which should be the median of the image (after being converted to a gray-scale image). To avoid a negative influence of intensity values close to the threshold, the authors additionally employ an exclusion mask which contains all pixel positions containing values close to the threshold in any of the images (Fig. 3). XOR-ing the binarized images and AND-ing the result with the (inverted) exclusion mask provides a dissimilarity measure by counting all 1-bits. Thus the best position is determined in a [−1, 0, 1] range in vertical and horizontal direction in every resolution level around the current optimum of the preceding level.1 2.4 Extension For high resolution images like those made with state of the art digital cameras, often consisting of several million pixels, a registration accuracy of a discrete pixel, resulting in a positional error of ±0.5 pixels, usually is sufficient to acquire satisfying results. For low resolution images like ours (100K pixels) this accuracy is not sufficient to obtain a smooth impression in consecutive frames of the generated film. To get a certain degree of sub-pixel accuracy, we extended Wards approach by adding enlarged versions of the images to the resolution hierarchy (zoomed by cubic interpolation). While with high resolution images this would impose considerable impact on the memory usage and runtime, our low resoluted images will merely approach the quantities of smaller HDR images even when scaled to 4 × 4 of its size.
3 Experiments As noted in Sect. 2.2, the rapidly changing illumination conditions heavily influence the ability to register the images, which is in turn needed for any process of retrospective estimation of the lighting conditions. As to focus on this problem first, the initial tests were done on image sequences showing the almost finished buildings of the campus. The changes appearing even in these rather static scenes proved to be challenging enough, before moving to the even more difficult to process construction scenes. Some 1
This can be accelerated considerably by employing bitwise operations on 32- or even 64-bit processor register words, but this gain is not as significant for small pictures like those present in our case.
Registering Long-Term Image Series
821
Fig. 4. Even these sunblinds caused only one misregistration in a sequence of 520 images
Fig. 5. Heavily changing scenery causes misregistration
considerable influences on the comparability of the images like the presence of sunblinds had surprisingly little impact on the registration quality (see Fig. 4). Evaluation was done by visual inspection. At a frame rate of 25 fps these sequences play approx. 20 sec and even slight misregistrations are easily observed in the material during playback. Most sequences having about 40-50% of the visible area without too much change registered almost perfectly. In average, such scenes (like the one shown in Fig. 4) had 1 to 3 misregistration within 520 images. Sequences with larger areas of change (e.g. Fig. 5) suffered from a large number of misregistrations, depending in their amount on the type of change, sometimes even exceeding 50%.
4 Conclusion The registration method introduced in [7] shows to fit excellently for our need to register image series with heavily varying lighting conditions. Our extension to achieve a limited amount of sub-pixel accuracy further improved the quality of the results. For our given scenario of known camera positions, the current registration problems with heavily changing scenes can be easily avoided by introducing exclusion masks for the changing areas. Such masks integrate seamlessly with the registration algorithm. In a further step, attempts could be made to determine such masks automatically.
822
D. Droege and D. Paulus
Based on these results, the more difficult to process sequences of the construction phase will be approached. By splitting these sequences in several subsequences with only tolerable change over time these could be registered similar to the later scenes. The alignment of the subsequences will be addressed with a hierarchical approach. After the registration, the illumination intensity and color changes will be addressed.
References 1. François, A.R., Medioni, G.G.: Adaptive color background modeling for real-time segmentation of video streams. In: Proceedings of the International Conference on Imaging Science, Systems and Technology, Las Vegas, NA, pp. 227–232 (1999) 2. Wu, Z., Chen, C.: A new foreground extraction scheme for video streams. In: MULTIMEDIA ’01. Proceedings of the ninth ACM international conference on Multimedia, New York, NY, USA, pp. 552–554. ACM Press, New York, NY, USA (2001) 3. Wesolkowski, S., Fieguth, P.W.: Adaptive color image segmentation using markov random fields. In: ICIP 2002, International Conference on Image Processing, Proceedings. vol. III, pp. 769–772 (2002) 4. Droege, D., Hong, V., Paulus, D.: Farbnormierung auf langen Bildfolgen. In: Franke, K.H. (ed.) 9. Workshop Farbbildverarbeitung, Ilmenau, Zentrum für Bild- und Signalverarbeitung e.V, pp. 107–112 (2003) 5. Brown, M., Lowe, D.G.: Recognising panoramas. In: ICCV ’03. Proceedings of the Ninth IEEE International Conference on Computer Vision, Washington, DC, USA, IEEE Computer Society, Los Alamitos. 1218 (2003) 6. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 7. Ward, G.J.: Fast, robust image registration for compositing high dynamic range photographs from hand-held exposures. Journal of graphics tools 8(2), 17–30 (2003)
Graph Similarity Using Interfering Quantum Walks David Emms, Edwin R. Hancock, and Richard C. Wilson Department of Computer Science University of York YO10 5DD, UK
Abstract. We consider how continuous-time quantum walks can be used for graph matching, both exact and inexact, and measuring graph similarity. Our approach is to simulate the quantum walk on the two graphs in parallel by using an auxiliary graph that incorporates both graphs. The auxiliary graph allows quantum interference to take place between the two walks. Modelling the resultant interference amplitudes, which result from the differences in the two walks, we calculate probabilities for matches between pairs of vertices from the graphs. Using the Hungarian algorithm on these probabilities we recover a mapping between the graphs. To calculate graph similarity, we combine these probabilities with edge consistency information to give a consistency measure. We analyse our approach experimentally using synthetic graphs.
1
Introduction
Quantum algorithms have attracted considerable attention in the theoretical computer science community primarily because of the considerable speed-up over classical algorithms they achieve [6,11]. However, quantum algorithms also have a richer structure than their classical counterparts since they use quantum rather than classical states as their basic representational unit. For example, the state of a quantum walk is complex-valued whereas states of a classical random walk take only positive real values. In addition, (quantum) interference is an important feature of quantum walks, and will play a central role in our algorithm. From a practical perspective, there have been a number of useful applications of random walks in, for example, the analysis of routing problems in network and circuit theory. Of more recent interest is the use of ideas from random walks to define the page-rank index for internet search engines such as Googlebot [1]. In the pattern recognition community there have been several attempts to use random walks for graph matching. These include the work of Robles-Kelly and Hancock which has used both a standard spectral method and a more sophisticated one based on ideas from graph seriation to convert graphs to strings, so that string matching methods may be used [10]. Meila and Shi use a random walk based on pairwise similarities between image pixels to carry out clustering and thus segmentation of images [9]. Gori, Maggini and Sarti [5], on the other W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 823–831, 2007. c Springer-Verlag Berlin Heidelberg 2007
824
D. Emms, E.R. Hancock, and R.C. Wilson
hand, have used ideas borrowed from page-rank to associate a spectral index with graph nodes and have then used standard subgraph isomorphism methods for matching the resulting attributed graphs. Quantum walks have been introduced as quantum counterparts of random walks [7] and posses a number of interesting properties not exhibited by classical random walks. The amplitudes of the paths of quantum walks have been used to define a matrix representation of graphs that is able to lift the cospectrality of certain classes of graphs that are typically hard to distinguish [2]. Also, quantum commute times have been used to successfully embed graphs in two dimensions rather than a submanifold of 2D space, as can occur when classical commute times are used [4]. In this paper we continue previous work which introduced a novel auxiliary graph structure for graph matching. We make use of the continuous-time quantum walk, a model that is significantly different in structure from the discretetime quantum walk. The use of the continuous-time quantum walks has a number of advantages over its discrete relative. The walk can be designed to allow the walks to interfere continuously rather than at a final end stage. In addition, complex-valued states can be utilized for the walk. Our approach is to connect every vertex of one graph to every vertex of the other graph via ‘auxiliary vertices’. A walk is then simulated on this new graph and the auxiliary vertices serve as sites on which the walks interfere. We model the resulting amplitudes on these interference sites to calculate probabilities for vertex-vertex matches between the graphs. Using the Hungarian algorithm [8] on these probabilities, we recover the mapping between the graphs. We also show how the probabilities can be combined with edge-connectivity constraints to construct a similarity measure. We carry out experiments on sets of synthetic graphs in order to evaluate our approach.
2
Graphs
Let G = (V, E) be a unweighted graph with vertex set, V , and edge set, E = {{u, v}|u, v ∈ V , u adjacent to v}. The graph has a adjacency matrix, W , with W (u, v) = 1 if {u, v} ∈ E, and 0 otherwise. Let n = |V | be the total number of vertices in the graph. We define the degree matrix to be the matrix D = diag(d(1), d(2), . . . , d(n)) where d(u) = |{v|u ∼ v}| is the degree of vertex u. The Laplacian matrix, L = W − D, has elements ⎧ ⎨ 1 if {u, v} ∈ E; Luv = −d(u) if u = v; ⎩ 0 otherwise. The Laplacian matrix is used to define the time evolution of the quantum walk on the graph. For a brief description of the classical random walk see [4].
Graph Similarity Using Interfering Quantum Walks
2.1
825
Quantum Random Walk on a Graph
The state space for a the continuous-time quantum random walk is the set of vertices, V . However, the state is described by a complex state vector which we |V | write (using Dirac’s notation) as |ψt ∈ C . This can be written componentwise as |ψt = u∈V au (t)|u where the magnitude of au (t) gives the probability of the walk being at u ∈ V at time t according to the rule P (X t = u) = au a∗u where a∗u is the complex conjugate of au . The axioms of probability give that |au (t)| ∈ [0, 1] for all u ∈ V , t ∈ R+ , and u∈V au a∗u = 1. Transitions occur only between adjacent vertices and at a rate μ. The evolution of the state vector d |ψt = −iμL|ψt . Since the evolution of the probability vector is given by dt of the walk at time t depends on the state vector of the walk (not merely the probability vector), the quantum walk is not a Markov chain. However, given an initial state, |ψ0 , we get that |ψt = e−iμLt |ψ0 .
3
Algorithm
The algorithm can be broken down into a number of separate parts. An auxiliary graph is created from the two input graphs. A quantum walk is simulated on the auxiliary graph which produces a set of |V1 ||V2 | complex-valued interference amplitudes, each one corresponding to one of the possible matches between a pair of vertices from the two graphs. (Non-zero interference amplitudes arise from differences in the amplitudes of the walk between a pair of vertices from the two graphs.) The probabilities that each of these interference amplitudes corresponds to a true match are calculated. To carry out graph matching, the Hungarian method is applied to the reciprocals of these probabilities. Alternatively, the similarity measure is calculated using these probabilities and edge consistencies from the two graphs. The component parts of the algorithm are described in the following subsections. 3.1
Auxiliary Matching Structure
Given two graphs, G and H, an auxiliary graph, Γ (G, H), is created. The auxiliary graph is symmetric with respect to interchanging its two arguments (i.e. Γ (G, H) ∼ = Γ (H, G)), and its structure is shown in Figure 1. Let G = (VG , EG ) and H = (VH , EH ), then Γ = (VΓ , EΓ ) where the vertex and edge sets can be decomposed such that VΓ = VG ∪ VH ∪ VA and EΓ = EG ∪ EH ∪ EA . The auxiliary vertices, VA , serve to connect the two graphs and act as sites on which the interference takes place. The auxiliary edges connect one of each of the |VG ||VH | pairs of vertices from the two graphs by way of one of the auxiliary vertices. That is VA = {v{gi ,hj } |gi ∈ VG , hj ∈ VH } and EA = {{gi , v{gi ,hj } }, {hj , v{gi ,hj } }|gi ∈ VG , hj ∈ VH }. The vertices gi ∈ VG and hj ∈ VH are linked via an auxiliary vertex, denoted v{gi ,hj } . The auxiliary graph is similar to the association graph, however, information about the structure of the two graphs comes from incorporating the original graphs themselves rather than through the connections between the auxiliary vertices.
826
D. Emms, E.R. Hancock, and R.C. Wilson
−5
6
x 10
5
probability
4
3
2
1
0 0
Fig. 1. The auxiliary graph, Γ (G, H), showing the vertices g1 , g2 , g3 ∈ VG and h1 , h2 , h3 ∈ VH connected by way of auxiliary vertices
3.2
0.5
1
T
1.5 time
2
2.5
3
Fig. 2. The probabilities associated with the interference for a particular vertex showing the time, T , of the first maximum in the interference amplitudes
The Quantum Walk on the Auxiliary Graph
The state of the quantum walk on Γ at time t is given by |ψt = The starting state has amplitudes ⎧ d(u) if u ∈ VG ; ⎨ C au (0) = − d(u) if u ∈ VH ; ⎩ C 0 otherwise.
u∈VΓ
au (t)|u.
where C = u∈VH ∪VH d(u)2 is a normalisation constant. This starting state is used rather than the a starting state that is constant and positive on the vertices of one graph and constant and negative on the vertices of the other since such a state is a stationary state of the walk. The state of the walk at time T is calculated, where T is the time when the first maximum in probability is achieved by one of the interference amplitudes, as shown in Figure 2. If there is an isomorphism, ρ : VG → VH , between the two graphs, then av{g,h} (t) = 0 for all t ∈ R+ whenever ρ(vg ) = vh . If, however, some noise has been applied, we still expect ag (t) ≈ ah (t), thus av{g,h} (t) will be close to zero in comparison to the other interference amplitudes. That this is the case is demonstrated in the following section in which we model the amplitudes for ‘matching’ and ‘non-matching’ pairs of vertices for graphs subject to noise. 3.3
Modelling the Interference Amplitudes
In this subsection we provide a model for the amplitudes for ‘true’ and ‘false’ matches between vertices and use this model to calculate the probability that an observed amplitude corresponds to a true pairing of vertices from the two graphs.
Graph Similarity Using Interfering Quantum Walks
827
To provide related graphs, an initial graph, G, with n vertices is generated by randomly connecting each two vertices in the graph with probability p. A related graph, H, is generated by adding and removing at random e edges from G. Figure 3 shows the distribution of the interference amplitudes for such a pair of graphs. We see that the false matches are the most widely distributed and are in a banded structure. The true matches which are not directly affected by noise all appear in the central band and the true matches which have been directly affected by noise lie close to the central band. Thus we can immediately assign a probability of zero to the matches corresponding to the amplitudes outside the central band and concentrate only on amplitudes in the central band. Also, as the amount of noise increases, fewer true matches remain in the central band and as will become clear, this will improve the performance of the measure. The histograms representing the distribution of the amplitudes for true and false matches in the central band are shown in Figure 4. Let a = x + iy be a particular amplitude and let X and Y be the sets of real and imaginary components for some set of interference amplitudes (we will write XT an XF etc. if we need to distinguish between the sets for true and false matches). Let μZ and σZ be the mean and standard deviation of a set Z. To calculate the probability that a particular amplitude corresponds to a true match we model the real and imaginary parts of these amplitudes representing true matches as originating from a bivariate Gaussian distribution with zero mean. Thus z y2 2ρxy 1 x2
exp − P (x, y|t) = , where z = 2 + 2 − 2 2(1 − ρ ) σX σY σX σY 2πσX σY 1 − ρ2 −μX μY where ρ = μXYσX is the correlation of the real and imaginary parts. The σY same model is used for the probability distribution, P (x, y|f ), for amplitudes resulting from false matches but with different values of the standard deviations and correlation. It can be seen from Figure 4 that σXT σXF and σY T σY F . Clearly, in practice, we cannot distinguish between the amplitudes for true and false matches in order to determine the standard deviations, thus they must be
−3
2
x 10
400 False matches True matches− no edges changed True matches− edges changed
1
300
0.5
250
0
200
−0.5
150
−1
100
−1.5
50
−2 −8
−6
−4
−2
0 real parts
2
4
6
8 −3
x 10
Fig. 3. Scatter plot of interference amplitudes
False matches True matches
350
Frequency
imaginary parts
1.5
0 −2
−1.5
−1
−0.5
0 Real part
0.5
1
1.5
2 −4
x 10
Fig. 4. Histogram of the real part of the interference amplitudes for true and false matches
828
D. Emms, E.R. Hancock, and R.C. Wilson
estimated from measurements of the distribution of the two sets together. The variance and fourth central moment of the observed distributions of the amplitudes could be used to obtain estimates for all of the standard deviations [3], however, the differences in magnitudes lead to inaccurate estimates. Furthermore, we have found that the consistency measure which we will outline below is robust with respect to the estimates used and so for these, and the correlations, we set 10σRf = σRt , 10σIf = σIt , ρt = 0.16 and ρf = −0.75 based on our analysis. The difference between ρt and ρf , which are apparent in Figure 3, make the bivariate Gaussian distributions particularly powerful at modelling the probabilities for the interference amplitudes for true and false matches. Given a particular amplitude a = x + iy and the conditional probabilities of observing this amplitude if the match is true, P (x, y|t), and if it is false, P (x, y|f ), Bayes’ rule can be used to calculate the probability that a particular amplitude corresponds to a true match. Let N be the number of interference amplitudes in the central band. We have that p(α) = p(α|t)p(t) + p(α|f )p(f ) | 1−|V | where p(t) = |V N and p(f ) = N . Note that this slightly over estimates p(t) and underestimates p(f ) as there are actually fewer than |V | true matches in the central band, but this does not greatly affect the similarity measure. Thus, the probability that some interference amplitude corresponds to a true match is . given by p(t|α) = p(α|t)p(t) p(α)
4
Correspondence Measure
We combine the probabilities for vertex matches with structural information in order to give a ‘correspondence measure’, which quantifies the quality of the match between the graphs. Consider two pairs of vertices (g1 , h1 ) and (g2 , h2 ), where g1 , g2 ∈ VG and h1 , h2 ∈ VH , corresponding to two possible matches with probabilities p(g1 , h1 ) and p(g1 , h2 ) respectively. The quantity p(g1 , h1 )p(g2 , h2 ) will in general be larger when both (g1 , h1 ) and (g2 , h2 ) are correct matches than if either, or neither, of them are. Thus we can define a correspondence measure by ME (G, H) = {g1 ,g2 }∈EG {h1 ,h2 }∈EH p(g1 , h1 )p(g2 , h2 ). Note that this sum need only be carried out over the sets of interference amplitudes with a non-zero probability that they are correct matches and can thus be calculated in time O(N 2 ).
5
Experiments
We divide our experimental section into two parts. In the first part we evaluate the proposed method for graph matching and in the second section we evaluate our consistency measure. We randomly generate graphs using the method described in Section 3.3 with n between 10 and 58 and p between 0.3 and 0.7. We then carry out a random permutation of the vertices and run the algorithm with the permuted and unpermuted graphs as the two inputs to try to recover the isomorphism. We assign unit
Graph Similarity Using Interfering Quantum Walks
829
cost for matching a pair of vertices with zero interference amplitude and infinite cost to those with non-zero interference amplitudes. Table 1 gives the proportion of false matches which are assigned unit matching cost and the proportion of complete correct graph matches. For graphs with few vertices a large proportion of false matches between vertices have zero interference amplitudes, however, this proportion decreases rapidly as the number of vertices increases. Moreover, we are always able to recover the isomorphism despite this small proportion of mis-identified false matches Table 1. The number of false matches with zero interference amplitudes and the rate for successful reconstruction of the isomorphism between the two graphs Vertices Mis-identified Successful graph Vertices Mis-identified Successful graph false matches match rate false matches match rate 10 0.316 1 28 0 1 0.168 1 30 0.008 1 12 0.064 1 32 0.116 1 14 0.02 1 34 0.08 1 16 0.016 1 36 0.06 1 18 0 1 38 0.028 1 20 0 1 40 0.06 1 22 0.004 1 42 0 1 24 0 1 44-58 0 1 26
We now consider the performance of the algorithm in the presence of noise. Given a graph obtained as before, we permute the vertices and then add and remove random edges in order to create graphs with 1%, 2%, 3%, 5% and 10% noise. Figure 5 shows the performance of the algorithm as we increase the number of veritces and the noise. As before, we are always able to recover the permutation when there is no noise. In the presence of noise, even 10%, we are still able to recover the correct permutation in almost all cases and this recovery rate even shows a slight improvement as the number of vertices increases. We now provide an experimental evaluation of the performance of the similarity measure. Again we randomly generate graphs, this time with edit distance from the original 0 to 20 (approximately 9% noise). We calculate the similarity measure between the original graph and each of the related graphs. We repeat this whole process for 100 sets of 21 graphs derived from randomly generated graphs. Figure 6 shows the mean similarity measure as a function of edit distance for this set of graphs and also the relative deviation. The mean of the similarity measure decreases monotonically as edit distance increases. Furthermore, at larger values of edit distance the rate of change of the similarity measure decreases; such behaviour could be useful in the field of robust statistics. At smaller edit distances, however, the similarity measure is sensitive to small changes in edit distance and its relative deviation is small providing a good measure of graph similarity.
D. Emms, E.R. Hancock, and R.C. Wilson
30 32 34 36 38 40 42 44 46 48 50 52 54 56
0% 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1% 0.982 0.987 0.991 0.991 0.98 0.99 0.985 0.986 0.98 0.984 0.987 0.98 0.986 0.983
Noise 2% 3% 0.972 0.96 0.973 0.968 0.974 0.97 0.979 0.973 0.971 0.965 0.978 0.97 0.973 0.968 0.976 0.973 0.976 0.973 0.973 0.97 0.977 0.975 0.978 0.975 0.979 0.974 0.976 0.975
5% 0.953 0.963 0.963 0.963 0.962 0.964 0.968 0.964 0.967 0.968 0.972 0.97 0.971 0.971
10% 0.95 0.954 0.957 0.952 0.957 0.96 0.959 0.962 0.963 0.964 0.967 0.966 0.969 0.969
Fig. 5. The rate for successfully recovering the permutation applied to the graph as a function of noise and number of vertices
6
100
80
Similarity measure
830
60
40
20
0
−20
0
5
10 Edit distance
15
20
Fig. 6. The similarity measure as a function of edit distance
Conclusion
In this paper we have looked at one of the ways in which the richer structure inherent in quantum processes can be utilized classically. We have described an auxiliary graph that can be used for the purpose of graph matching. We simulated a continuous-time quantum walk on this auxiliary structure which allows the complex-valued amplitudes on the two graphs being compared to interfere. The differences in the amplitudes on the two graphs manifest themselves as non-zero interference amplitudes. By modelling these amplitudes using bivariateGaussian distributions, the real and imaginary parts of the amplitudes, as well the relationship between them, are used to calculate probabilities for matches between the graphs. We have shown that these probabilities can be used to recover the mapping between a pair of graphs even in the presence of significant noise, and also that the performance is in no way reduced as the number of vertices increases. These probabilities can also be used to calculate a similarity measure using consistencies in vertex connectivity.
References 1. Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30(1–7), 107–117 (1998) 2. Emms, D., Severini, S., Wilson, R.C., Hancock, E.: Coined quantum walks lift the co-spectrality of graphs and trees. In: Rangarajan, A., Vemuri, B., Yuille, A.L. (eds.) EMMCVPR 2005. LNCS, vol. 3757, pp. 332–345. Springer, Heidelberg (2005)
Graph Similarity Using Interfering Quantum Walks
831
3. Emms, D., Wilson, R., Hancock, E.: Graph matching using interference of coined quantum walks. In: IEEE 17th International Conference on Pattern Recognition (August 2006) 4. Emms, D., Wilson, R., Hancock, E.: Graph embedding using quantum commute times. In: GbR (2007) 5. Gori, M., Maggini, M., Sarti, L.: Graph matching using random walks. In: IEEE 17th ICPR (August 2004) 6. Grover, L.: A fast quantum mechanical algorithm for database search. In: STOC ’96. Proc 28th ACM Theory of computing, pp. 212–219. ACM Press, New York (1996) 7. Kempe, J.: Quantum random walks – an introductory overview. Contemporary Physics 44(4), 307–327 (2003) 8. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistic Quarterly 2, 83–97 (1955) 9. Meila, M., Shi, J.: A random walks view of spectral segmentation (2001) 10. Robles-Kelly, A., Hancock, E.: Graph edit distance from spectral seriation. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 365–378 (2005) 11. Shor, P.W.: Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J. Comput. 26, 1484–1509 (1997)
Visual Speech Recognition Using Motion Features and Hidden Markov Models Wai Chee Yau1 , Dinesh Kant Kumar1 , and Hans Weghorn2 1
2
School of Electrical and Computer Engineering, RMIT University GPO Box 2476V Melbourne, Victoria 3001, Australia Information Technology, BA-University of Cooperative Education, Stuttgart, Germany [email protected]
Abstract. This paper presents a novel visual speech recognition approach based on motion segmentation and hidden Markov models (HMM). The proposed method identifies utterances from mouth video, without evaluating voice signals. The facial movements in the video data are represented using 2D spatial-temporal templates (STT). The proposed technique combines discrete stationary wavelet transform (SWT) and Zernike moments to extract rotation invariant features from the STTs. HMMs are used as speech classifier to model English phonemes. The preliminary results demonstrate that the proposed technique is suitable for phoneme classification with a high accuracy. Keywords: visual speech recognition, hidden Markov models, Zernike moments.
1
Introduction
Speech-based systems are emerging as attractive interfaces that provide the flexibility for users to control machines using speech. In spite of the advancements in speech technology, speech recognition systems are not widely used in mainstream human computer interfaces (HCI). The difficulty of audio-only speech recognition systems is the sensitivity of such systems to changes in acoustic conditions. The performance of such systems degrades when the acoustic signal strength is low, or in situations with high ambient noise levels. Possible methods to overcome this limitations are such as visual [11,14] and recording of facial muscle activity [1]. The vision-based techniques are more desirable options as such techniques are non intrusive. Research where audio visual speech recognition (AVSR) systems are being made more robust, and able to recognize complex speech patterns are being reported [10,5]. While AVSR systems are suitable for applications such as for telephony in noisy environment, such systems are not useful when it is essential to maintain silence. The need for visual-only, voice-less communication systems arises. Such systems are also known as visual speech recognition (VSR) systems. This research proposes to use video data related to facial movements for VSR applications. The possible advantages of VSR system are : (i) not affected W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 832–839, 2007. c Springer-Verlag Berlin Heidelberg 2007
Visual Speech Recognition Using Motion Features and HMM
833
by audio noise (ii) not affected by change in acoustic conditions (iii) does not require the user to make a sound. Visual features proposed in the literature can be categorized into shape-based, pixel-based and motion-based features. The shape-based features rely on the shape of the mouth. The first VSR system was developed by Petajan [9] using shape-based features such as height and width of the mouth. Researchers have reported on the use of artificial markers on speaker’s face to extract lip contours [6]. The use of artificial markers is not suitable for practical speech-controlled applications. VSR systems that use pixel-based features assume that the pixel values around the mouth area contain salient speech information [8,11]. Pixelbased and shape-based features extracted from static frames and can be viewed as static features. Motion-based features are features that directly utilize the dynamics of speech. Few researchers have focused on using motion-based features for VSR. Nevertheless, the dynamic features are reported to be most discriminative when comparing static and motion features by [4]. This paper proposes a novel VSR technique based on motion features extracted using spatial-temporal templates (STT). One of the advantages of the proposed technique is that it does not require the use of artificial markers on the speakers face. This paper proposes a system where the camera is attached in place of the microphone to the commonly available headsets. Potamianos et. al. [11] has demonstrated that using mouth videos captured from cameras attached to wearable headset produced better results as compared to full face videos. Another advantage of this is that it is no longer required to identify the region of interest, reducing the computation required.
2 2.1
Theory Visual Speech Model
This paper models visual speech based on visemes.Visemes are the atomic units of visual movements associated with phonemes (basic units of speech sound). Visemes can be concatenated to form words and sentences, thus providing the flexibility to expand the vocabulary of the system. The pronunciation of different speech sounds (such as /p/ and /b/) may be associated with identical visible facial movements. Thus, each viseme may corresponds to more than one phoneme, resulting in a many-to-one mapping of phonemes-to-visemes. This paper adopts a viseme model established for facial animation applications by an international audiovisual object-based video representation standard known as MPEG-4. Based on the this model the English phonemes can be mapped to 14 visemes as shown in Table 1. 2.2
Motion Segmentation
This proposed technique performs motion segmentation on the video data to generate spatial-temporal templates(STT). STT are grayscale images that show where and when facial movements occurs in the video [2,14]. STT is generated
834
W.C. Yau, D.K. Kumar, and H. Weghorn Table 1. Viseme model of the MPEG-4 standard for English phonemes Viseme number Corresponding phonemes Vowel or consonant Example words 1 p, b, m consonant put, bed, me 2 f,v consonant f ar, v oice 3 T,D consonant think, that 4 t,d consonant tick, d oor 5 k, g consonant k ick, gate 6 tS, dZ, S consonant chair, j oin, she 7 s,z consonant sit, z eal 8 n,l consonant need, l ead 9 r consonant r ead 10 A: vowel car 11 e vowel bed 12 I vowel tip 13 Q vowel top 14 U vowel book
using an accumulative image difference approach. The facial movement is segmented by detecting the changes between consecutive frames. Intensity values between successive frames of the video are subtracted to generate the difference of frames (DOFs). The DOFs are converted to binary images - represented as function B(x, y) by thresholding the DOFs to obtain a change or no change classification. B(x, y) will be assigned a pixel value of 1 at spatial coordinates where the intensity values between two consecutive frames are appreciably different based on a threshold value. The intensity value of the STT at pixel location (x, y) of tth frame is defined by ST Tt (x, y) = max
N< −1
Bt (x, y) × t
(1)
t=1
where N is the total number of frames of the mouth video. Bt (x, y) represents the binarised version of the DOF of frame t. In Eq. 1, Bt (x, y) is multiplied with a linear ramp of time to implicitly encode the temporal information of motion into the STT. By computing the STT values for all the pixels coordinates (x, y) of the image sequence using Eq. 1 will produce a grayscale image (STT) where the brightness of the pixels indicates the history of facial movements in the image sequence. Figure 1 illustrates the STTs of fourteen visemes used in the experiments. This motivation of using STT to segment the facial movements is because of the ability of STT to remove static elements and preserve the short duration facial movements in the video data. Also, STT is insensitive to the speaker’s skin color due to the image subtraction process. This paper suggests a model to approximate the speed variations of speech by normalizing the overall duration
Visual Speech Recognition Using Motion Features and HMM
835
Fig. 1. Spatial-temporal templates (STT) of fourteen visemes based on the viseme model of MPEG-4 standard
of the utterance. The paper applies discrete stationary wavelet transform (SWT) on STT to obtain a transform representation that is insensitive to small variations for different videos of the same utterance. 2-D SWT at level 1 is applied on the STT. SWT decomposition of the STT generates four sub images. The approximate (LL) sub image is the smoothed version of the STT and is used to represent the STT. 2.3
Feature Extraction
The proposed technique adopts Zernike moments as features to represent the SWT LL image. Earlier work of the authors compares 3 features - Zernike moments, Hu moments and geometric moments and demonstrates that Zernike moments have the best image representation ability and are least sensitive to rotational, translational and scale changes of the mouth [14]. Zernike moments are computed by projecting the image function f (x, y) onto the orthogonal Zernike polynomial Vnl of order n with repetition l, defined within a unit circle (i.e.:x2 + y 2 ≤ 1). Zernike moments Znl of order n and repetition l is given by 2π ∞ n+1 [Vnl (ρ, θ)] f ∗ (ρ, θ)dρdθ (2) Znl = π 0 0 |l| ≤ n and (n−|l|) is even. f (ρ, θ) is the intensity distribution of the approximate image of STT mapped to a unit circle of radius ρ and angle θ where x = ρcosθ
836
W.C. Yau, D.K. Kumar, and H. Weghorn
and y = ρsinθ.The main advantage of Zernike moments is the simple rotational property of the features[7]. Changes in the orientation of the mouth in the image result in a phase shift on the Zernike moments of the rotated image as compare to the features of non rotated image [14]. Thus, the absolute value of Zernike moments is invariant to the rotation of the image patterns. This paper uses the absolute value of the Zernike moments as the rotation invariant features. 49 Zernike moments that comprise of 0th up to 12th order moments are extracted from the approximate image of the STT. 2.4
Classification Using Hidden Markov Models
HMM is a finite state network based on stochastic processes. The strength of left-right HMM lies in its ability to statistically model the time-varying speech features [12]. Left-right HMM is a commonly used classifier in speech recognition [10]. This paper adopts single-stream, continuous HMMs to classify the motion features. Continuous HMMs is employed as opposed to discrete HMMs to avoid the loss of information occur in the quantization of the features. The motion features are assumed to be Gaussian distributed. Each viseme is modelled using a left-right HMM with three states, one mixture of Gaussian component per state and diagonal covariance matrix. During training of the HMMs, the unknown HMMs parameters vectors consisting of the transition probability and observation probability are estimated iteratively based on the training samples using Baum-Welch algorithm. In the classification stage, the unknown motion features are presented to the 14 trained HMMs and the features are assigned to the viseme class whose HMM produces output with the highest likelihood.
3
Methodology
Experiments were conducted to test the proposed visual speech recognition technique. Fourteen visemes from the viseme model of MPEG-4 standard (highlighted in bold fonts in Table 1 )were evaluated in the experiments. Each viseme represents one pattern of facial movement. The speaker pronounces each vowel or consonant in isolation. The number of publicly available audio visual speech databases is much less as compared to audio-only speech databases. Most of the audio visual speech databases are recorded in ideal studio environment with controlled lighting. To evaluate the performance of the approach in a real world environment, video data was recorded using an inexpensive web camera in a typical office environment. This was done towards having a practical voiceless communication system using low resolution video recordings. The camera focused on the mouth region of the speaker and was kept stationary throughout the experiment. The following factors were kept the same during the recording of the videos : window size and view angle of the camera, background and illumination. 280 video files (240 x 240 pixels) were recorded and stored as true color (.AVI) files. One STT was generated from each AVI files. An example of STT for each visemes are shown in Figure 1. SWT at level-1 using Haar wavelet was applied
Visual Speech Recognition Using Motion Features and HMM
837
Fig. 2. Block diagram of the proposed technique
on the STTs and the approximate image (LL) was used for analysis. 49 Zernike moments have been used as features to represent the SWT approximate image of the STT. The Zernike moments features were used to train the hidden Markov models(HMM) classifier. Each viseme is modelled using one HMM. The leaveone-out method was used in the experiment. The HMMs were trained with 266 training samples and were evaluated on the 14 remaining samples (1 sample from each viseme group). This process is repeated 20 times with different sets of training and testing data. The average recognition rates of the HMMs for the 20 repetitions were computed. Figure 2 shows a block diagram of the proposed visual speech recognition technique.
4
Results and Discussion
The classification accuracies of the HMM are tabulated in Table 2. The average recognition rate of the proposed visual speech recognition system is 88.2%. The results indicate that the proposed technique based on motion features is suitable for viseme recognition. Based on the results, the proposed technique is highly accurate for vowels classification using the motion features. An average success rate of 97% is achieved in recognizing vowels. The classification accuracies of consonants are slightly lower due to the poor recognition rate of one of the consonant - /n/. One of the possible reason for the misclassifications of /n/ is due to the inability of vision-based technique to capture the occluded speech articulators movements. The movement of the tongue within the mouth cavity is not visible (occluded by the teeth) in the video data during the pronunciation of /n/. Thus, STT of /n/ does not contain information on the tongue movement which may have resulted a high error rate for /n/. Same visual-only speech recognition task (based on the the 14 visemes of MPEG-4 standard) is tested in [3] and a similar error rate is obtained using shape-based features extracted from static images. Nevertheless, the errors made
838
W.C. Yau, D.K. Kumar, and H. Weghorn
Table 2. Recognition Rates of the proposed system based on viseme model of MPEG-4 standard Viseme Recognition Rate (%) m 95 v 90 T 70 t 80 g 85 tS 95 s 95 n 40 r 100 A: 100 e 100 I 95 Q 95 U 95
in the authors’ proposed system using motion features are different as compared to the errors reported in [3] that uses static features. This indicates that complementary information exist in static and dynamic features of visual speech. For example, our proposed system has a much lower error rate in identifying visemes /m/, /t/ and /r/ by using the facial movement features as compare to the results in [3]. This shows that motion features are better in representing phones which involve distinct facial movements (such as the bilabial movements of /m/). The static features of [3] yield better results in classifying visemes with ambiguous or occluded motion of the speech articulators such as /n/. The results demonstrate that a computationally inexpensive system which can easily be developed on a DSP chip for voice-less communication application. The proposed system has been designed for specific applications such as control of machines using simple commands consisting of discrete utterances without requiring the user to make a sound.
5
Conclusion
This paper reports on a technique to identify utterances from videos without using sound signals. A high recognition rate is obtained in classifying isolated English phonemes using hidden Markov models (HMM). The results suggest that the proposed technique based on the different patterns of facial movements is suitable in identifying phonemes. For future work, the authors intend to test the method on a larger vocabulary set covering words. Furthermore, the investigation shall be extended from an English-spoken environment to other languages such as German. Potential applications of such a system is to drive computerized machinery in noisy environments and for voice-less communication.
Visual Speech Recognition Using Motion Features and HMM
839
References 1. Arjunan, S.P., Kumar, D.K., Yau, W.C., Weghorn, H.: Unspoken Vowel Recognition Using Facial Electromyogram. IEEE EMBC, New York (2006) 2. Bobick, A.F., Davis, J.W.: The Recognition of Human Movement Using Temporal Templates. IEEE Transactions on Pattern Analysis and Machine Intelligence 23, 257–267 (2001) 3. Foo, S.W., Dong, L.: Recognition of Visual Speech Elements Using Hidden Markov Models. In: Chen, Y.-C., Chang, L.-W., Hsu, C.-T. (eds.) PCM 2002. LNCS, vol. 2532, pp. 607–614. Springer, Heidelberg (2002) 4. Goldschen, A.J., Garcia, O.N., Petajan, E.: Continuous Optical Automatic Speech Recognition by Lipreading.presented at In: 28th Annual Asilomar Conf on Signal Systems and Computer (1994) 5. Hazen, T.J.: Visual Model Structures and Synchrony Constraints for Audio-Visual Speech Recognition. IEEE Transactions on Audio, Speech and Language Processing 14(3), 1082–1089 (2006) 6. Kaynak, M.N., Qi, Z., Cheok, A.D., Sengupta, K., Chung, K.C.: Audio-visual modeling for bimodal speech recognition. IEEE Transactions on Systems, Man and Cybernetics 34, 564–570 (2001) 7. Khontazad, A., Hong, Y.H.: Invariant Image Recognition by Zernike Moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 489–497 (1990) 8. Liang, L., Liu, X., Zhao, Y., Pi, X., Nefian, A.V.: Speaker Independent AudioVisual Continuous Speech Recognition. In: IEEE Int. Conf. on Multimedia and Expo (2002) 9. Petajan, E.D.: Automatic Lip-reading to Enhance Speech Recognition. In: GLOBECOM’84 (1984) 10. Potamianos, G., Neti, C., Gravier, G., Senior, A.W.: Recent Advances in Automatic Recognition of Audio-Visual Speech. Proc. of IEEE 91 (2003) 11. Potamianos, G., Neti, C., Huang, J., Connell, J.H., Chu, S., Libal, V., Marcheret, E., Haas, N., Jiang, J.: Towards Practical Deployment of Audio-Visual Speech Recognition. In: ICASSP, IEEE (2004) 12. Rabiner, L.R.: A tutorial on HMM and selected applications in speech recognition. Proc. IEEE 77(2-2), 257–286 (1989) 13. Teh, C.H., Chin, R.T.: On Image Analysis by the Methods of Moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 10, 496–513 (1988) 14. Yau, W.C., Kumar, D.K., Arjunan, S.P.: Visual Speech Recognition Method Using Translation, Scale and Rotation Invariant Features. In: IEEE International Conference on Advanced Video and Signal based Surveillance, Sydney, Australia (2006)
Feature Extraction of Weighted Data for Implicit Variable Selection Luis Sánchez, Fernando Martínez, Germán Castellanos, and Augusto Salazar Control & Digital Signal Processing Group, Universidad Nacional de Colombia Sede Manizales {lgsanchezg,fmartinezt,cgcastellanosd,aesalazarj}@unal.edu.co
Abstract. Approaches based on obtaining relevant information from overwhelmingly large sets of measures have been recently adopted as an alternative to specialized features. In this work, we address the problem of finding a relevant subset of features and a suitable rotation (combined feature selection and feature extraction) as a weighted rotation. We focus our attention on two types of rotations: Weighted Principal Component Analysis and Weighted Regularized Discriminant Analysis. The objective function is the maximization of the J4 ratio. Tests were carried out on artificially generated classes, with several non-relevant features. Real data tests were also performed on segmentation of naildfold capillaroscopic images, and NIST-38 database (prototype selection). Keywords: Relevance, Feature Selection, PCA, LDA.
1
Introduction
Many approaches based on obtaining relevant information from overwhelmingly large sets of measures have been recently adopted, replacing those ones that rely on specialized (expert-based) and sometimes difficult to obtain features. Reducing the size of data either by encoding or removing irrelevant information, becomes necessary, if one wants to achieve good performance in the inference. In the present work, we study weighting schemes for attaining subset selection by finding a relevant subset of features and a suitable rotation (feature selection and feature extraction) as a weighted rotation. The attention is then focused on a type Supervised Weighted Principal Component Analysis and Weighted Regularized Discriminant Analysis. Feature extraction methods such as PCA and LDA have been widely treated, and WPCA has been referred in image processing applications with improvements on the results obtained with classic PCA [1]. Although, properties of changing the problem formulation of PCs are discussed in [2], any comments about weight matrix estimation are kept aside. Other works have directed their efforts towards variable selection using weights for ranking their relevance [3], [4], however their work is not directly related to
This research has been made under grant of the project titled "Técnicas de Computación de Alto Rendimiento en la Interpretación Automatizada de Imágenes Médicas y Bioseñales". DIMA, Universidad Nacional de Colombia Sede Manizales.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 840–847, 2007. c Springer-Verlag Berlin Heidelberg 2007
Feature Extraction of Weighted Data
841
WPCA or WRDA. In our approach, we make use of J4 feature selection criteria for assessing the relevance of a given set [5]. In order to test the effectiveness of the weighting algorithms, several tests on artificially generated data were carried out. Results are encouraging, since non-relevant variables received weights much closer to zero than classic Regularized Discriminant Analysis (RDA). Experiments on real data encompass segmentation of naildfold capillaroscopic images, and NIST-38 digit detection by using dissimilarities (prototype selection).
2
Weighted Rotation, Variable Selection and Relevancy
Variable selection problem can be understood as selecting a subset of p features from a larger set of c measures. This type of search is guided by some evaluation function that have been defined as the relevancy of a given set [6]. Typically, such procedures involve exhaustive search (binary selection) and so relevance measure must take into account the number of dimension p to state if there is significant improvement when dimensions are added or rejected. On the other hand, feature extraction techniques take advantage of data to encode representation efficiently in lower dimensions [7]. We can capitalize on this property to keep a fixed dimension (projected space) and assess the relevancy of the projected set. Binary search algorithms involves explicit enumeration of the subsets and only optimal selection is guaranteed after exhaustive search. Suboptimal heuristics such as greedy search have been introduced to reduce the computational complexity, but relevance criteria involves the size of the subset, and proper tuning of objective function becomes problematic. Weighting schemes are more flexible in this sense. Although, weights do not provide optimal solution in most of cases, they can reach fair results. We can combine feature extraction methods with weighted data to maintain a fixed set size and accommodate weights in such a manner relevance is maximized. Next, we will give descriptions of the employed methodologies and algorithms. 2.1
PCA and Probabilistic PCA
Classic PCA. Principal Component Analysis has been the mainstream for data analysis in large number of applications. Its attractiveness relies on the simplicity and capacity for compressing dimensionality by minimizing the squared reconstruction error from a linear combination of latent variables known as Principal Components. The model parameters can be directly computed from centered data matrix either by Singular Value Decomposition or the diagonalization of the positive semidefinite covariance matrix. Among other methods developed for estimating the PCs, there is a probabilistic framework discussed by [8, 9], that we will adopt for the computation of our weighted algorithm via the Expectation Maximization (EM) procedure for computing p principal components. Weighted Probabilistic PCA. Consider the latent variable model for data x = Cz + v;
z ∼ N (0, I) v ∼ N (0, R)
(1)
842
L. Sánchez et al.
latent variables z are assumed to be independent and identically distributed over a unit variance spherical gaussian. Notice the difference between PPCA model and PCA were the variance of latent variables can be associated to the diagonal elements of eigenvalue matrix (the diagonal matrix of the singular value decomposition of the covariance matrix). The model also considers a general perturbation matrix R, but [8] restrict it to be I (Isotropic noise). Now, we modify the formulation of the model to introduce weights on the measured variables and so perform the weighted rotation. Let D be a diagonal matrix containing the weight of the i-th variable in the element dii . If we assume the new observed variable as dii xi , the observed vector becomes Dx, the model is now defined as: (2)
Dx = Cz + v
z and v are still distributed as in (1). From equation (2) it can be derived an EM algorithm to estimate the unknown state (latent variable) in the e-step and maximize the expected joint likelihood of the estimated z and observed y in the m-step by choosing C and R under model assumptions. In the case of the modified model (weighted variables), we have the two variants of EM: – Noise-free model e − step : ZT = CT C – Noisy model SPCA
−1
CT DXT ;
−1
e − step : β = CT CCT + I
−1
m − step : C = DXT Z ZT Z
; μz = βDXT ; Σz = nI − nβC + μz μTz
m − step : C = DXT μTz Σ−1 ; = trace DXT XD − Cμz XD /nc
Up to this point, it has been only defined the way PCA rotation can be estimated. Before moving on to weight updating, we will review some brief aspects about Regularized Discriminant Analysis. 2.2
Regularized Discriminant Analysis
RDA was proposed by [10] for small sample, high-dimensional data sets to overcome the degradation of the discriminant rule. In our particular case. we refer to Regularized linear discriminant analysis. The aim of this technique is to find a linear projection of the space where scatter between classes is maximized and the within scatter is minimal. A way of finding such a projection is to maximize the ratio between the projected between ΣB and within ΣW class matrices [5]. Conditional extremes can be obtained from Lagrange multipliers; the solutions are the k − 1 leading generalized eigenvectors W of ΣB and ΣW that are the leading eigenvectors of Σ−1 W ΣB . The need of regularization arises from small samples were ΣW can not be directly inverted. Then the solution is rewritten
Feature Extraction of Weighted Data
843
as (ΣW + δI)−1 ΣB W = WΛ. After weighting data, that is XD we can recast |WT DΣB DW| the J quotient as JD = |W . T DΣ W DW| 2.3
Variable Weighting and Relevance Criteria
From the previous sections, we may know now the desired weighted linear transformation Φ we want to apply. Data will be projected onto a fixed dimension subspace. Such dimensionality is chosen depending on the rotation criteria, for instance if we have a two-class problem and we want to test WRDA the fixed dimension should be 1 as demonstrated for LDA. In order to assess the relevance of a given weighted projection of a fixed dimension we introduce some separability measure. The search function will fall in some local maximum of the target function. The parameter to be optimized is the weight matrix D, and selected criteria is the ratio traces of the aforementioned within and between matrices; this criteria is particularly known as J4 [5]. For weighted-projected data this measure is given by: J4 (D, Φ) =
trace(ΦT DΣB DΦ) trace(ΦT DΣW DΦ)
(3)
where Φ can be either replaced by the PCA rotation or RDA rotation W. The size of Φ is (c × f ) and f denotes the fixed dimension, which is the number of projection vectors φ; Φ = φ1 φ2 · · · φf . In order to apply matrix derivatives easily, we may want to rewrite D in terms of its diagonal entries and represent it as a column vector d. For this purpose, equation (3) can be rewritten in terms of Hadamard products as follows: f A f T T T T ΣB ◦ φi φi d d ΣW ◦ φi φi d (4) J4 (d) = d i=1
i=1
This target function is quite similar in nature to the one obtained for LDA1 . Therefore, the solution of d with constrained L2 norm is given by the leading f f eigenvector of ( ΣW ◦ φi φTi + δI)−1 ( ΣB ◦ φi φTi ). i=1
i=1
Notice this type of description assumes elements of Φ as static, though derivatives on d do exist. To overcome this issue we interleave the computation of d and Φ until convergency of both [11, 12]. WPCA Algorithm. We take advantage of iterative nature and convergence of the EM estimation of parameters from probabilistic frameworks. The procedure is described as follows: 1. Normalize each feature vector to have zero mean and · 2 = 1 2. Start with some initial set of orthonormal vectors U(0) 1
Regularization can be thought as Tikhonov filtering of singular solutions of (4) or the SNR from sampling populations with equal means.
844
L. Sánchez et al.
3. Compute d(r) from solution given in section 2.3, and reweigh data. 4. Compute the (e and m)-steps of the desired latent variable model (Noise-free or Sensible). The normalized columns of the estimated C, that is · 2 = 1, become the columns of U(r) 5. Compare U(r) and U(r−1) for some ε and return to step 3 if necessary2 . 6. Orthogonalize the obtained subspace as below: svd(UT DXT XDU) = ASAT ; Uend = AT U WRDA Algorithm. In this case we do not worry for missing the derivatives of the rotation with respect to the data weights since both functions (weight and rotation calculation) have similar directions. The procedure is described as follows: 1. 2. 3. 4. 5. 6.
Set dimension to be k − 1, being k the number of classes Normalize each feature vector to have zero mean and · 2 = 1 Start with some initial set of orthonormal vectors W(0) Compute d(r) from solution given in section 2.3, and reweigh data. Compute the W(r) from solution given in section 2.2. Compare W(r) and W(r−1) for some ε and return to step 3 if necessary (likewise WPCA).
In the following section we test our algorithms in toy and real data problems.
3
Experimental Setup
Tests are carried out on three types of data, artificially generated classes (toy data), and two binary classification tasks on real problems, namely, medical image segmentation and digit recognition. 3.1
Toy Data
For comparison purposes we generate artificial observation vectors in the same way is described for the linear problem in [4]. Figure 1 shows the resulting weight vector for Noise-free Weighted PCA (3 principal directions); remaining irrelevant features exposed the same behavior (zero values). Reduced space by WPCA is also presented along with classic PPCA. From figure 1(c) it can be seen how classes are well separated compared with probabilistic principal components depicted in 1(b). We also test the accuracy of the algorithms with respect to number of observations for training. A fixed set of 1000 observations was kept for validation while different sizes of training sets were generated. Figure 2 shows the classification accuracy for different training sizes by computing the mean of 20 training sets of the same size. Classification results come from a linear classifiers; 2(a) assumes equal gaussian distributions for each class and 2(b) uses a linear SVM. 2
T
We make use of the sum of absolute values of the diagonal elements of (U(r) ) U(r−1) , T which are compared to the value obtained for (U(r−1) ) U(r−2) .
Feature Extraction of Weighted Data Absolute of weights WPCA Noise−free
Reduced tridimensional space with PPCA
845
Reduced tridimensional space with WPCA 0.04
0.04
0.7
0.03 0.03 0.02
0.02
0.6
0.01
0.01
0
0
weight
0.5
−0.01
−0.01
0.4
−0.02
−0.02
−0.03
−0.03
−0.04
−0.04
0.3
0.3
0.08 0.2
0.06 0.1
0.2
0.04 0 0.02 −0.1
0.1
0 −0.2 −0.02
−0.3
0
−0.4
0
5
10
15
20
25
−0.5
30
−0.4
−0.3
−0.2
−0.1
0.2
0.1
0
0.3
0.4
−0.04 −0.06 −0.2
−0.15
−0.05
−0.1
0
0.05
0.1
0.15
0.2
variable index
(a) Weight vector WPCA Noise-free
for
(b) Principal nents
Compo-
(c) Weighted Principal Components Noise-free
Fig. 1. Weight vector for WPCA. Spanned spaces for PPCA and WPCA. The size of the training set is 100 observations per class.
WRDA RDA First PC WPCA with 3 Components First 3 PPCs
0.5 0.45
0.35
0.35
0.3 0.25 0.2 0.15
0.3 0.25 0.2 0.15
0.1
0.1
0.05
0.05
0 10
WRDA RDA First PC WPCA with 3 Components First 3 PPCs
0.4
Validation error
Validation error
0.4
0.5 0.45
20
30
40
50
60
70
80
90
100
0 10
20
30
(a) Linear pooled covariance matrix
40
50
60
70
80
90
100
Number of training observations per class
Number of training observations per class
(b) Linear SVM
Fig. 2. Comparison of projection methods feeding linear classifiers
3.2
Nailfold Capillaroscopic (NC) Images
Information about connective tissue diseases can be obtained from capillaroscopic images (see figure 3(a)). In [13], medical image segmentation through machine learning approaches has been successfully applied, which motivates the idea of adapting the proposed method for solving a similar problem. In the present case, each pixel is represented by 24-dimension vector obtained from standard color space transformations: RGB, HSV, YIQ, YCbCr, LAB, XYZ, UVL, CMYK. The underlying idea is the contrast enhancement of capillary images. We tested the algorithm on 20 images After applying the weighting algorithms we have obtained a set of relevant color transformations, specifically for WRDA: Q plane from Y IQ space, Cr from Y CbCr, and A from LAB transformation. For WPCA with one principal component, we have: Q plane from Y IQ space, and A from LAB transformation (see figure 4). We make use of ROC curves to assess the quality of the transformation. Scores are obtained from the likelihoods of discriminant functions of gaussian distributions with equal priors. In figure 5(a) are the plots of 4 ROCs. The areas under the curves are: 0.9557, 0.9554, 0.9549, 0.7295 for WRDA, WPCA,RDA and PCA, respectively. Notice how there is no significant changes on working with only 2 or 3 color planes of the original 24, while rotation offered by PCA does not perform as well as the rest of rotations. Even though,
846
L. Sánchez et al.
(a) Acquired Image
(b) Desired Segmentation
Fig. 3. Sample Image from Nailfold Capillaroscopic data
Fig. 4. Enhanced contrast plane by means of different linear transformations
ROC curve
1
true positive rate
true positive rate
0.8
0.6
0.4
WRDA WPCA RDA PCA
0.2
0 0
0.2
0.4
0.6
ROC curve
1
false positive rate
0.8
1
0.8
0.98
0.6
0.96
0.4
0.2
1
0 0
0.94
WRDA WPCA RDA PCA
0.2
0.4
0.6
false positive rate
(a) NC Images
0.8
WRDA WPCA RDA
0.92
1
0.9 0
0.02
0.04
0.06
0.08
0.1
(b) NIST-38 digit detection
Fig. 5. Receiver Operating Characteristic curve for NC Images and NIST-38 digit detection
segmentation performance can be increased if more involved methods brought into consideration, the results obtained can be used as a preprocessing stage for subsequent improvements. 3.3
NIST-38 Digit Recognition
The MNIST digit database is a well known benchmark for testing algorithms. We want to observe the feasibility of applying our approach to dissimilarity representations. Prototype selection can be seen as a variable selection process in dissimilarity spaces [14]. We construct our training set by randomly picking 900 examples per class. These new samples are split into observation set (600 images) and prototype set (remaining 300 images; the same rule applies for validation sets. Dissimilarity matrices are (1200×600) containing Euclidean distances. ROC curves are displayed for WRDA, WPCA (1 WPC), RDA and PCA (1 PC) (see figure 5(b)); and their respective areas under the curve are: 0.995547, 0.991719, 0.998769, 0.509725. Even though, RDA presents slight increment in performance, the number of prototypes needed is 600, compared to 30 to 40 prototypes selected by WPCA and WRDA.
Feature Extraction of Weighted Data
4
847
Conclusions
We have exposed a feature selection criteria based on weighting variables followed by a projection onto a fixed dimension subspace, which is a suitable method for evaluating the relevance of a given set. Results have shown how reduced dimensionality does not affect in the overall performance of the inference system (linear cases such as capillaroscopic images). WRDA may perform better for classification than WPCA and can be catalogued as a wrapper method since projections are class dependent (supervised projection). Derivations from the probabilistic point of view are worthy of further investigation and nonlinear extensions are the current subject of study. Results obtained for artificial data are encouraging since feature selection performed very well on small training samples (see figure 2).
References 1. Skočaj, D., Leonardis, A.: Weighted incremental subspace learning. In: Workshop on Cognitive Vision, Zurich (September 2002) 2. Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer Series in Statistics. Springer, Heidelberg (2002) 3. Jebara, T., Jaakkola, T.: Feature selection and dualities in maximum entropy discrimination. In: 16th Conference on Uncertainty in Artificial Intelligence (2000) 4. Weston, J., Mukherjee, S., Chapelle, O., Pontill, M., Poggio, T., Vapnik, V.: Feature selection for svms. In: NIPS (2001) 5. Webb, A.R.: Statistical Pattern Recognition, 2nd edn. John Willey and Sons, West Sussex, England (2002) 6. Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. AI 97(1-2) (1997) 7. Turk, M.A., Pentland, A.P.: Face recognition using eigenfaces. In: Computer Vision and Pattern Recognition (1991) 8. Tipping, M., Bishop, C.: Probabilistic principal component analysis. J. R. Statistics Society 61, 611–622 (1999) 9. Roweis, S.: Em algorithms for pca and spca. Neural Information Processing Systems 10, 626–632 (1997) 10. Friedman, J.H.: Regularized discriminant analysis. Journal of the American Statistical Association 84, 165–175 (1989) 11. Wolf, L., Shashua, A.: Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weighted-based approach. Journal of Machine Learning Research (2005) 12. Direct feature selection with implicit inference. In: ICCV (2003) 13. Li, S., Fevens, T., Krzyzak, A., Li, S.: Automatic clinical image segmentation using pathological modelling, pca and svm. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, Springer, Heidelberg (2005) 14. Pekalska, E., Duin, R.P.W., PaclÌk, P.: Prototype selection for dissimilarity-based classifiers. Elsevier, Pattern Recognition 39 (2006)
Analysis of Prediction Mode Decision in Spatial Enhancement Layers in H.264/AVC SVC Koen De Wolf, Davy De Schrijver, Wesley De Neve, Saar De Zutter, Peter Lambert, and Rik Van de Walle Ghent University – IBBT Department of Electronics and Information Systems – Multimedia Lab Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium {koen.dewolf, davy.deschrijver, wesley.deneve, saar.dezutter, peter.lambert, rik.vandewalle}@ugent.be http://multimedialab.elis.ugent.be/
Abstract. On top of the prediction modes defined in the H.264/AVC standard, Scalable Video Coding defines prediction modes for inter-layer prediction. These inter-layer prediction modes allow the re-use of coded data from the base layer, at the cost of increasing the search space at the encoder and as a result increase the encoding time. In this paper, we investigate the relation between the coding decisions taken in the base layer and the enhancement layer. Our tests have shown that a number of relations can be clearly identified. We have observed that the co-located macroblock of a base layer macroblock coded in P 8x8 mode has a 40 % chance of being coded in P 8x8 mode as well. Further, we have observed that the P Skip mode is only used when the quantization parameter in the enhancement layer is high. For macroblocks coded in B Skip mode, the co-located macroblock in the enhancement layer will be coded in the B Skip mode when it is highly quantized (probability of 63 % to 92 % for quantization parameter 30). These observations can be used to construct a model for fast mode decision in SVC.
1
Introduction
H.264/AVC Scalable Video Coding (SVC), as proposed by the Joint Video Team (JVT), can be classified as a layered video specification based on the single-layer H.264/AVC standard [1], [2]. Additional Enhancement Layers (ELs) contain information pertaining to the embedded spatial and SNR enhancements. Similar single-layer prediction techniques as in H.264/AVC are applied, in particular, intra and motion-compensated prediction. However, additional inter-layer prediction mechanisms have been developed for the minimization of redundant information between different layers. In case of Rate Distortion-Optimized coding (RDO), the encoder complexity is significantly increased due to the large search space constructed by the numerous prediction modes incorporated in SVC. These prediction modes can be divided into two groups. The first group contains the prediction modes that originate from the single-layer H.264/AVC specification, i.e., the various intra and W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 848–855, 2007. c Springer-Verlag Berlin Heidelberg 2007
Analysis of Prediction Mode Decision in Spatial Enhancement Layers
849
motion-compensated prediction modes. The second group contains the interlayer prediction modes. In this paper, we will analyze and discuss the mode decisions of an SVC encoder. This analysis can be used to construct a model for fast mode decision. The use of such a model in an SVC encoder can significantly reduce the encoding time at the cost of a lower coding efficiency (i.e., possibly higher bit rate together with lower visual quality). Such an analysis was already performed by He Li et al. [3] with version 2 of the reference software (dated 2005). However, in the mean time several new coding tools were added to the SVC specification. Furthermore, in our analysis, we also take the Quantization Parameters (QPs) into account. This paper is organized as follow. First, we provide a detailed outline of the prediction modes that are available in SVC, used in the context of spatial scalability. In the third section, we analyze and discuss the relation between the coding decision in the base layer and the enhancement layer. Finally, conclusions are provided in Sect. 4.
2
SVC Prediction Modes
As mentioned in the introduction, SVC is a layered extension of the H.264/AVC video coding specification. The structure of a possible SVC encoder is shown in Fig. 1. In this figure, the original input video sequence is down-scaled in order to obtain the pictures for all different spatial layers (resulting in spatial scalability). In each spatial layer, a motion-compensated pyramidal decomposition is performed, taking into account the characteristics of each layer (i.e., GOPstructure, bit rate, ...). This temporal decomposition results in a motion vector field on the one hand and residual texture data on the other hand. This information is coded by using similar techniques as in H.264/AVC, extended with progressive SNR refinement features. Several methods that allow the reuse of coded information among different spatial resolution layers are under investigation by the JVT. In particular, the layered structure of SVC allows the reuse of motion vectors, residual data, and intra-texture information of lower spatial and SNR layers for the prediction of higher-layer pictures in order to reduce inter-layer redundancy. In the next section, these inter-layer prediction methods are discussed in more detail. 2.1
Intra-layer Prediction
Depending on the slice type, a macroblock (MB) in a slice can be coded using one of the following three different prediction modes: intra prediction (I-macroblock), uni-directional prediction (P-macroblock), or bi-directional (B-macroblock) prediction. Whereas the first prediction mode only uses the values of spatially neighboring samples of the block (intra-slice prediction), the latter two modes rely on Motion-Compensated (MC) prediction signals based on samples originating from
850
K. De Wolf et al.
2
+
T
ME Q
FFReference reference reference
Reorder & Entropy Encoding
MC
Multiplex
-
Pictures
SVC Bit Stream
Current Picture
Q-1
Intra Prediction + Reconstructed Pictures
+
T-1
2 2
C Current Picture
-
Base Layer Encoder
+
+
T
A B MVR Q
-
Reorder & Entropy Encoding
ME
FFReference reference reference
MC
Q-1
Pictures
Original Pictures
2 Reconstructed Pictures
Intra Prediction D
+ T-1 + Enhancement Layer Encoder
Fig. 1. Possible SVC encoder structure that supports two spatial layers
preceding or subsequent pictures. The prediction error is obtained by subtracting the prediction signals from the original signal. Usually, this prediction error (or residual) is then transform-coded. In a next step, these transform coefficients are quantized and entropy coded. Intra-Slice Prediction Modes. Three types of intra prediction are defined in the H.264/AVC specification: I 4x4, I 16x16, and I PCM. In the first 2 types, the prediction signal (respectively 4x4 and 16x16 samples) is constructed by copying or interpolating the values of previously coded neighboring samples. Nine and four modes are respectively defined for the I 4x4 type and I 16x16 type. These modes stipulate the direction of the interpolation or the direction of the copy of the signal1 . The I PCM mode (Intra-slice Pulse Code Modulation) allows an encoder to bypass the prediction and transform coding processes, directly sending the values of the samples to the entropy encoder. Motion-Compensated Prediction Modes. The concepts of MC prediction, as defined in H.264/AVC, are used in the SVC ELs as well. Hence, variable 1
Except for mode 2, in which the prediction signal is constructed by taking the average of adjacent pixels.
Analysis of Prediction Mode Decision in Spatial Enhancement Layers
851
block-size motion compensation, quarter-sample motion vector accuracy, multiple reference pictures, and weighted prediction are supported by SVC. For an introduction to this tools, we refer to [2]. In case of uni-directional coding, only one MC prediction block is used. The translation of this block – commonly referred to as motion vector – and the index of the picture in the reference list i.e., list 0 (L0) used for this prediction need to be coded. This reference list may contain pictures before and after the current picture in display order. H.264/AVC allows variable block sizes, ranging from 16x16, 16x8, 8x16 to 8x8 (P L0 16x16, P L0 L0 16x8, P L0 L0 8x16, and P 8x8 respectively). An 8x8 partition can be further divided into 8x4, 4x8, and 4x4 sub-partitions. This means that for each of these partitions both a motion vector and a picture reference index need to be coded, with the exception of sub-partitions that all must use the same reference picture. An additional unidirectional prediction type is called P Skip. For this type, no motion vector, reference index, or residual is coded. The prediction signal (16x16 samples) is constructed by using the picture with index 0 in the picture reference list (L0). The motion vector is equal to the motion vector predictor2 of that MB. In case of bi-directional coding, at most two motion-compensated prediction signals can be used. These signals originate from pictures that are stored in two separate reference lists: list 0 (L0) and list 1 (L1). For B-macroblocks, four modes of MC prediction are defined in H.264/AVC: list 0, list 1, bi-predictive, and direct. In case of mode list 0 and list 1, only one motion vector per partition is transmitted, together with the index of the picture in the appropriate reference list. For the bi-predictive mode, a weighted average of the MC prediction blocks is used to obtain the prediction signal. In the direct mode, the coding mode is inferred from the neighboring prediction modes and can be list 0, list 1, or bi-predictive. When a MB is coded in direct mode and no prediction error signal is coded, this mode is referred to as B Skip mode. Together with the possible MB partitions, this results in 24 prediction modes. 2.2
Inter-layer Prediction
On top of the MC prediction and intra prediction modes, as defined in H.264/AVC, four prediction methods are defined that re-use coding information from a Base Layer (BL). This is called Inter-Layer Prediction (ILP). The BL for a particular EL is the layer that can be used for ILP. This BL does not need to be the layer with the lowest quality nor lowest spatial resolution. In the first ILP method, BL motion information is re-used for efficient coding of the EL motion data. In this mode, commonly referred to as base layer mode, no additional motion vectors are transmitted for the EL. When the base layer is a down-scaled version of the current layer, the motion vectors and the MB partitioning mode are up-sampled accordingly. This is illustrated in Fig. 1 by the dashed arrow A. Also, for the current MB, the same reference indices as for the corresponding 8x8 partitions of the base layer are used [4]. 2
The motion vector predictor is the average of the motion vectors of the upper and left MB.
852
K. De Wolf et al.
The second ILP prediction mode is called quarter pel refinement mode and is an extension to the base layer mode. The motion vectors, the partitioning mode, and the reference indices of the current MB are derived from the corresponding sub-macroblock in the base layer. Additionally, a quarter pel Motion Vector Refinement (MVR) is transmitted for that particular block (dashed arrow B in Fig. 1). This allows to improve the MC prediction. The third ILP method is called residual prediction mode. Here, residual information of MC coded MBs from the base layer can be used for the prediction of the residual of the current layer. A flag, indicating whether residual prediction is used, is transmitted for each MB of the current EL. When residual prediction is applied, the base layer residuals of the corresponding MBs are block-wise up-sampled using a bi-linear filter with constant border extension. Doing so, only the difference between the residual of the current layer, obtained after motion compensation (MC), and the upsampled residual of the base layer is coded (dashed arrow C in Fig. 1). The fourth ILP mode is inter-layer intra prediction (dashed arrow D in Fig. 1). In this mode, an intra-predicted MB of a slice in the base layer can be used for the prediction of the co-located block in the current EL. Therefor, the upsampled version of decoded block is used. In SVC, the 6-tap filter used for halfsample interpolation, as defined in the H.264/AVC specification, is used for the interpolation of these decoded samples. The drawback of this approach is that a decoder needs to decode all the referred intra-coded MBs in the base layer. Note that residual prediction mode can be combined with the base layer mode, quarter pel refinement mode, and the inter-layer intra prediction mode. The combination of base layer mode and residual prediction mode is also called BL Skip mode. A detailed analysis of the coding performance and time-complexity of the ILP modes is given in [5].
3
Prediction Mode Decision
The prediction mode decision is taken by minimizing the Lagrangian cost functional J = D + λR for all possible MB coding modes. Here, D represents the distortion between original and reconstructed signal, R is the bit rate needed for coding of the motion vectors and residual data, and λ is the Lagrangian multiplier, which depends on the chosen QP setting [6]. The value of this Lagrangian parameter is layer-specific. The minimization of J is a very time-consuming operation as for all possible modes, motion vectors (within the search window), and reference pictures this cost function needs to be evaluated. Fast Mode Decision and Fast Motion Estimation algorithms are developed in order to minimize the complexity of this decision taking process [7], [8], [9]. A side effect of such an algorithm is the introduction of a coding efficiency penalty as the mode decision will be sub-optimal.
Analysis of Prediction Mode Decision in Spatial Enhancement Layers
Foreman 50
45
45
20 15
Prediction Modes
Foreman 50
45
45
20 15 10
Prediction Modes
ip BL _S k
4x 4 I_
P_ 8x 8
P_ LO _L O
_1 6
P_ LO _L O
8x 16
0 x8
5
0 P_ Sk ip
5
I_ 4x 4
10
25
P_ 8x 8
15
(30, 12) (30, 18) (30, 24) (30, 30)
30
6
20
35
P_ LO _L O 8x 1
25
P_ LO _L O _1 6x 8
30
40
p
(24, 12) (24, 18) (24, 24) (24, 30)
P_ Sk i
35
P_ LO _1 6x 16
40
Selection Probability (%)
50
P_ LO _1 6x 16
Selection Probability (%)
4x 4
Prediction Modes
Foreman
BL _S ki p
I_
BL _S ki p
4x 4 I_
P_ 8x 8
8x 16
x8 _1 6
P_ LO _L O
P_ LO _L O
P_ Sk ip
0 P_ LO _1 6x 16
5
0
P_ 8x 8
10
5
8x 16
10
25
x8
15
(18, 12) (18, 18) (18, 24) (18, 30)
30
_1 6
20
35
P_ LO _L O
25
40
P_ LO _L O
(12, 12) (12, 18) (12, 24) (12, 30)
30
P_ Sk ip
35
P_ LO _1 6x 16
40
Selection Probability (%)
50
BL _S ki p
Selection Probability (%)
Foreman
853
Prediction Modes
Fig. 2. Prediction mode decision in spatial EL for the Foreman sequence when the colocated MB in BL is coded in P 8x8 mode. Base layer QP= 12 (top-left), 18 (top-right), 24 (bottom-left), and 30 (bottom-right).
3.1
Test Configuration
The statistical analysis of the used prediction modes, is performed on five sequences: Crew, Foreman, Mobile & Calendar, Mother & Daughter, and Stefan. Two spatial layers are used: QCIF and CIF, both at 30Hz. The GOP size is set to 16; intra-coded slices are inserted every 32 frames. Full-search ME and all ILP modes are enabled. We coded the sequences with version 8 of the reference software [10] using 16 different QP-combinations (QPBL , QPEL )|QPBL , QPEL ∈ {12, 18, 24, 30} for all sequences. 3.2
Results and Discussion
Due to space limitations, we are unable to publish all results in detail. Therefore, we will discuss two of the most selected MB prediction modes in the BL i.e., P 8x8 mode for P-pictures and B Skip mode for B-pictures. Co-located MB is Coded in P 8x8 Mode. This mode is used in the BL for about 30 % of the MBs in P-pictures. We can see from the graphs in Fig. 2 that in approximately 40 % of the co-located MBs in the enhancement layer are also coded in P 8x8 mode, except when the EL is highly quantized. In that case, BL Skip mode is selected most (24-40 %). When the QP of the EL is low, the
K. De Wolf et al.
Stefan
Stefan 90
40 30
90
Prediction Modes
B_ 8x 8
_8 x1 6
B_ Bi _B i
x1 6
BL _S ki p
_8 x1 6
B_ 8x 8
B_ Bi _B i
Prediction Modes
B_ Bi _B i
x1 6
_1 6x 8
B_ Bi _B i
6x 16
B_ Bi _1 6
B_ L1 _1
B_ L0 _1
6x 16
0 ire ct
10
_1 6x 8
20
0 B_ Sk ip
B_ 8x 8
30
10
B_ D
_8 x1 6
40
6x 16
20
B_ L1 _1
30
50
6x 16
40
(30, 12) (30, 18) (30, 24) (30, 30)
60
ire ct
50
70
B_ L0 _1
(24, 12) (24, 18) (24, 24) (24, 30)
60
80
B_ Sk ip
70
B_ D
80
Selection Probability (%)
100
90
Selection Probability (%)
x1 6
Stefan
100
BL _S ki p
_1 6x 8
Prediction Modes
Stefan
B_ Bi _B i
Prediction Modes
B_ Bi _B i
BL _S ki p
B_ 8x 8
ct B_ L0 _1 6x 16 B_ L1 _1 6x 16 B_ Bi _1 6x 16 B_ Bi _B i_ 16 x8 B_ Bi _B i_ 8x 16
B_ D ire
B_ Sk ip
0
BL _S ki p
10
0
6x 16
20
10
B_ Bi _1 6
20
50
B_ L1 _1
30
(18, 12) (18, 18) (18, 24) (18, 30)
60
ire ct
40
70
6x 16
50
80
B_ D
(12, 12) (12, 18) (12, 24) (12, 30)
60
B_ L0 _1
70
B_ Sk ip
80
Selection Probability (%)
100
90
Selection Probability (%)
100
B_ Bi _1 6
854
Fig. 3. Prediction mode decision in spatial EL for the Stefan sequence when the colocated MB in BL is coded in B Skip. Base layer QP= 12 (top-left), 18 (top-right), 24 (bottom-left), and 30 (bottom-right).
I 4x4 mode is selected second most (22-25 %). For the tested configurations, the P Skip mode is only used for higher QPs in the EL. Co-located MB is Coded in B Skip Mode. In Fig. 3, the prediction mode decisions (expressed in %) in the spatial EL for the Stefan sequence are shown when the co-located MB in the base layer is coded using the B Skip mode. Notwithstanding the QP of the BL, for high QPs in the EL, the MB will be coded in B Skip mode (accuracy ranging from 39 % to 86 % for QPEL = 24 and from 63 % to 92 % for QPEL = 30). For low QPs in the EL, the B Skip mode is seldom used. Instead, the used prediction modes are more or less equally divided over BL Skip, B Direct, B Bi 16x16, and B 8x8. Also, we observe that for less quantized MBs in the BL, BL Skip, and B Skip modes are selected more often in the EL. The same behaviour is observed for the other sequences.
4
Conclusion
SVC is an extension of H.264/AVC providing spatial, temporal, and SNR scalability with a high compression efficiency. This compression efficiency is achieved by relying on the available coding modes. Exhaustive search techniques are used to select the best coding mode for each MB. Doing so, these techniques achieve the highest possible coding efficiency, but at the cost of a higher computational
Analysis of Prediction Mode Decision in Spatial Enhancement Layers
855
complexity. We have analyzed the relation between the coding mode decisions made in the BL and the EL. Our tests have shown that a number of relations can be clearly identified. We have observed that the co-located MB of a BL MB coded in P 8x8 mode has a 40 % chance of being coded in P 8x8 mode. Moreover, we have noticed that the P Skip mode is only used for high QPs in the EL. For MBs coded in B Skip mode, the co-located MB in the EL will be coded in the B Skip mode when the EL is highly quantized (63 % to 92 % for QPEL = 30). The observations in this paper can be used to construct a model for fast mode decision. Such a model can be used to guide the mode decision algorithm of an SVC encoder, hereby reducing the overall encoding time. Acknowledgements. The research activities that have been described in this paper were funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT-Flanders), the Fund for Scientific Research-Flanders (FWO-Flanders), and the European Union.
References 1. ITU-T, ISO/IEC JTC 1: Advanced video coding for generic audiovisual services, ITU-T Rec. H.264 and ISO/IEC 14496-10 AVC (2003) 2. Wiegand, T., Sullivan, G., Bjøntegaard, G., Luthra, A.: Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology 13, 560–576 (2003) 3. Li, H., Li, Z., Wen, C., Chau, L.P.: Fast mode decision for spatial scalable video coding. In: IEEE International Symposium on Circuits and Systems (ISCAS) (2006) 4. Wiegand, T., Sullivan, G., Reichel, J., Schwarz, H., Wien, M. (eds.).: Joint Scalable Video Model 8: Joint Draft 8 with proposed changes, Doc. JVT-U202. JVT (2006) 5. De Wolf, K., De Schrijver, D., De Zutter, S., Van de Walle, R.: Scalable Video Coding: Analysis and coding performance of inter-layer prediction. In: Proceedings of the 9th International Symposium on Signal Processing and its Applications, Dubai (U.A.E.), SuviSoft Oy Ltd, 4 (2007) 6. Sullivan, G., Baker, R.: Rate-distortion optimized motion compensation for video compression using fixed or variable size blocks. In: Proceedings of the IEEE Global Telecommunications Conference, Phoenix, AZ. vol. 3, pp. 85–90 (1991) 7. Pan, F., Lin, X., Rahardja, S., Lim, K., Li, Z., Wu, D., Wu, S.: Fast mode decision algorithm for intraprediction in H.264/AVC video coding. IEEE Transactions on Circuits and Systems for Video Technology 15, 813–822 (2005) 8. Dai, Q., Zhu, D., Ding, R.: Fast mode decision for inter prediction in H.264. In: Proceedings of International Conference on Image Processing (ICIP) (2004) 9. Lin, Z., Yu, H., Pan, F.: A scalable fast mode decision algorithm for H.264. In: IEEE International Symposium on Circuits and Systems (ISCAS) (2006) 10. Vieron, J., Wien, M., Schwarz, H. (eds.).: Joint Scalable Video Model (JSVM) 8 software, Doc. JVT-Q203. JVT (2006)
Object Recognition by Implicit Invariants 1 ˇ Jan Flusser1 , Jaroslav Kautsky2 , and Filip Sroubek 1
2
Institute of Information Theory and Automation, AS CR Pod vod´ arenskou vˇeˇz´ı 4, 182 08, Prague 8, Czech Republic The Flinders University of South Australia, Adelaide, Australia
Abstract. The use of traditional moment invariants is limited to a certain set of simple geometric transforms, such as rotation, scaling and affine transform. This paper presents a novel concept of so-called implicit moment invariants, which enable us to recognize objects under a broader set of geometric deformations.
1
Introduction
Recognition of objects and patterns that are deformed in various ways has been a goal of much recent research. There are basically three major approaches to this problem – full search, image normalization, and invariant descriptors. The approach using invariant descriptors appears to be the most promising one and has been used extensively. Its basic idea is to describe the object by a set of features which are not sensitive to particular deformations and which provide enough discrimination power to distinguish among objects belonging to different classes. In 2D object recognition, various moment invariants have become classical and frequently used shape descriptors during last forty years. Even if they suffer from some intrinsic limitations (the most important of which is their globalness, which prevents them from being used for recognition of occluded objects), they often serve as the ”first-choice descriptors” and as a reference method for evaluation of the performance of other shape descriptors. All moment invariants ever studied (see for instance [1,2,3,4]) are so-called explicit invariants. An explicit invariant is a functional (let us denote it as E) acting on the space of image functions which does not change its value if the image f undergoes certain deformation τ from the set of admissible deformations, i.e. which satisfies the condition E(f ) = E(τ (f )) for any image f . There have been described many systems of explicit moment invariants with respect to rotation, scaling, affine transform, contrast changes, and linear filtering. However, there are several classes of image deformations which occur frequently in practice but explicit moment invariants with respect to them are not known or even have been proven they cannot exist. Such typical examples are projective transform, cylindrical and spherical projections, quadratic transform, and other polynomial transforms of the image coordinates. To overcome this, we propose in this paper a new concept of implicit invariants. Implicit invariant I is a functional defined on image pairs such that W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 856–863, 2007. c Springer-Verlag Berlin Heidelberg 2007
Object Recognition by Implicit Invariants
857
I(f, τ (f )) = 0 for any image f and deformation τ . According to this definition, explicit invariants are just particular cases of implicit invariants. Clearly, if an explicit invariant exists, we can set I(f, g) = E(f ) − E(g). As we show later on in the paper, there are many types of image deformations where explicit moment invariants do not exist while implicit moment invariants do. In those cases, implicit invariants can be used as features for object recognition. Unlike explicit invariants, implicit invariants do not provide description of a single image because they are always defined for a pair of images. For recognition purposes this is not a drawback. We consider I(f, g) to be a ”distance measure” (even if it does not exhibit all properties of a metric) between f and g factorized by τ and we can, for each database template gi , calculate the value of I(f, gi ) and then to classify f according to the minimum.
2
General Moments
Definition 1. Let p0 , p1 , . . . , pn−1 , . . . be some basis functions defined on a bounded D ⊂ IRN and let f be an image function having a finite integral. By a general moment of f we understand the functional f (x)pj (x)dx. μj (f ) = D
If N = 1 and pj (x) = xj , we speak about standard moments. Using a matrix notation we can write ⎛ p (x) 0 ⎜ p1 (x) p(x) = ⎜ .. ⎝ .
⎞ ⎟ ⎟ ⎠
and
⎛ μ (f ) 0 ⎜ μ1 (f ) μ(f ) = ⎜ .. ⎝ .
pn−1 (x)
⎞ ⎟ ⎟ . ⎠
(1)
μn−1 (f )
˜ be a transformation of the domain D into D ˜ and let f˜ : D ˜ → IR Let r : D → D be another image function which satisfies f˜(r(x)) = f (x)
(2)
˜ for x ∈ D and f (˜ x) = 0 for x ˜ ∈ D/r(D). (This means that image f˜ is a spatially deformed version of f .) We are interested in the relation between the moments μ(f ) and the moments μ ˜ (f˜) = f˜(˜ x)p(˜ ˜ x)d˜ x= f˜(˜ x)p(˜ ˜ x)d˜ x r(D)
˜ D
of the transformed function with respect to some other n ˜ basis functions T
x) p˜1 (˜ x) . . . p˜n˜ −1 (˜ x) ) p(˜ ˜ x) = ( p˜0 (˜
˜ We can now formulate the following Theorem. defined on D.
858
ˇ J. Flusser, J. Kautsky, and F. Sroubek
Theorem 1. Denote by Jr (x) the Jacobian of the transform function r. If p(r(x))|J ˜ r (x)| = Ap(x)
(3)
for some n ˜ × n matrix A then μ ˜ = Aμ .
(4)
The power of this theorem depends on our ability to choose the basis functions so that we can, for a given transform r, express the left-hand side of (3) in terms of the basis functions p and thus construct the matrix A. This is always possible for a polynomial r by choosing polynomial bases p(x) and p(˜ ˜ x).
3
Implicit Moment Invariants
Let us assume that the transformation r depends on a finite number, say m, m < n ˜ , parameters a = (a1 , . . . , am ). Traditional explicit moment invariants with respect to r can be obtained in two steps. (a) Eliminate a = (a1 , . . . , am ) from the system (4). This leaves us n ˜ − m equations which depend only on the two sets of general moments (and on the choice of basis functions, of course). We call it a reduced system. (b) Re-write these equations equivalently in the form qj (˜ μ(f˜)) = qj (μ(f )) ,
j = 1, . . . , n ˜−m
(5)
for some functions qj . Then the explicit moment invariants are E(f ) = qj (μ(f )). However, for some transforms (quadratic, cubic, etc.) we may not be able to perform the second step – finding the explicit forms qj . Introducing implicit invariants can overcome this drawback. The reduced system in step (a) is independent of the particular transformation. For classifying of an object, we traditionally compare the values of its descriptors (explicit moment invariants) with those of the database images, that is we look for such database image, which satisfy equations (5). However, it is equivalent to checking for which database image the above reduced system is satisfied. So we can, in case we are not able to find explicit moment invariants in the form (5), use this system as a set of implicit invariants. In other words, the images are classified according to the error with which the system is satisfied. We will demonstrate the above idea of implicit moment invariants on a 1D example using standard powers as basis functions. Consider the transform r(x) = x + ax2 , ˜ = a−1, a+1. where a ∈ (0, 1/2, which maps interval D = −1, 1 on interval D Let us show two implicit invariants. Since m = 1 (one-parameter transform), we
Object Recognition by Implicit Invariants
859
need n ˜ = 3 and n = 6. The Jacobian is Jr (x) = 1 + 2ax and for standard powers for both p and p˜ we would get ⎛ ⎞ 1 2a 0 0 0 0 A = ⎝ 0 1 3a 2a2 0 0 ⎠ . 0 0 1 4a 5a2 2a3 However, now we have to evaluate the moments of the transformed signal over ˜ which depends on the unknown parameter a. This problem is the domain D resolved by choosing a shifted power basis x) = (˜ x − a)j , p˜j (˜
j = 0, 1, . . . n ˜−1
as we have then, after the shift of variable x ˜ = xˆ + a, a+1 1 μ ˜j (f˜) = x= x f˜(˜ x)(˜ x − a)j d˜ f˜(ˆ x + a)ˆ xj dˆ −1
a−1
which is now independent of a as fˆ(ˆ x) = f˜(ˆ x + a) has domain basis p˜ we obtain a different transform matrix ⎛ 1 2a 0 0 0 A = ⎝ −a 1 − 2a 3a 2a2 0 a2 2a(a2 − 1) 1 − 6a2 4a(1 − a2 ) 5a2
−1, 1. For this ⎞ 0 0 ⎠ 2a3
and the first of equations (4) gives a=
μ ˜0 − μ0 2μ1
while the two reduced equations rewrite, after substitution, as 2μ21 (˜ μ1 − μ1 ) = μ1 (3μ2 − μ ˜0 )(˜ μ0 − μ0 ) + μ3 (˜ μ0 − μ0 )2 3 2 4μ1 (˜ μ2 − μ2 ) = 4μ1 (2μ3 − μ1 )(˜ μ0 − μ0 ) + μ1 (5μ4 + μ ˜0 − 6μ2 )(˜ μ0 − μ0 )2 3 +(μ5 − 2μ3 )(˜ μ0 − μ0 ) . (6) In the example above it was straightforward to derive the transform matrix A for simple transformation r and a small number of invariants. For numerical reasons, this intuitive approach cannot be used for higher-order polynomial transform r and/or for more invariants. To obtain numerically stable method it is important to use suitable polynomial bases, such as orthogonal polynomials, without using their expansions into standard (monomial) powers. Our implementation is based on the representation of polynomial bases by matrices with a special structure [5]. This representation allows to evaluate the polynomials efficiently by means of recurrent relations.
4
Implementation of the Implicit Invariants
Depending on r, the elimination of the m parameters of the transformation function from the n ˜ equations of (4) to obtain a parameter-free reduced system
860
ˇ J. Flusser, J. Kautsky, and F. Sroubek
may require numerical solving of nonlinear equations. This may be undesirable or impossible. Even the simple transform r used in the experimental section would lead to cubic equations in terms of its parameters. Obtaining a neat reduced system may be very difficult. Furthermore, even if successful, we create an unbalanced method – we have demanded some of the equations in (4) to hold exactly and use the accuracy in the resulting system as a matching criterion to find the transformed image. We therefore propose another implementation of the implicit invariants. Instead of eliminating the parameters, we calculate the ”uniform best fit” from all equations in (4). For a given set of values of the moments μ and μ ˜, we find values of the m parameters to satisfy (4) as best as possible in 2 norm; the error of this fit then becomes the value of the respective implicit invariant. Our actual implementation of the recognition by implicit invariants can be described as follows. (a) Given is a library (database) of images gj (x, y), j = 1, . . . , L, and a deformed image f˜(˜ x, y˜) which is assumed to have been obtained by a transform of a known polynomial form r(x, y, a) with unknown values of m parameters a. (b) Choose the appropriate domains, polynomial bases p and p, ˜ and the recurrence matrices for evaluation of the polynomials. (c) Derive a program to evaluate the matrix A(a). This critical error-prone step is performed by a symbolic algorithmic procedure which produces the program used then in numerical calculations. This step is performed only once for the given task. (It has to be repeated only if we change the polynomial bases or the form of transform r(x, y, a), which basically means only if we move to another application). (d) Calculate the moments μ(gj ) of all library images gj (x, y). (e) Calculate the moments μ ˜ (f˜) of the deformed image f˜(˜ x, y˜). (f) For all j = 1, . . . , L calculate, using an optimizer, the values of the implicit invariant I(f˜, gj ) = min μ ˜(f˜) − A(a)μ(gj ) a and denote M = min I(f˜, gj ). j
The norm used here should be weighted, for example relatively to the components corresponding to the same degree. ˜ ) I(f,g (g) The identified image is gk for which I(f˜, gk ) = M ; the ratios I(f˜,gj ) , j = k, k may be used as confidence measures of the identification.
5
Numerical Experiments
As we have shown earlier, the implicit moment invariants can be constructed for a very broad class of image transforms including all polynomial transforms.
Object Recognition by Implicit Invariants
861
Here we will demonstrate the implementation and the power of the method on images transformed by the following function x ˜ ax + by + c(ax + by)2 , (7) = r(x, y) = −bx + ay y˜ which is a rotation with scaling (parameters a and b) followed by a quadratic deformation in the x˜ direction (parameter c). We have chosen this particular transform for our tests for the following reasons: – It is general enough to approximate many real-life situations, for instance deformations caused by the fact that the photographed object was drawn/ printed on a spherical or cylindrical surfaces like bottles and balls. – It is sometimes used by web designers to warp images in order to reach desirable visual effect. Very often this is an unauthorized act violating the copyright. It is important for the copyright owners to have a tool how to identify such images. – Explicit invariants to this kind of transforms cannot exist because they do not preserve the moment orders and do not form a group. The first experiment was aimed to test the discriminative power of the implicit invariants and to demonstrate that they can be used as shape descriptors for recognition of distorted real objects This test was done on a standard benchmark database ALOI [6]. We took 100 ALOI images and deformed each of them by the warping model (7) (see Fig. 1 for some examples). The coefficients of the deformations were generated randomly; c from a range of admissible values and the rotation angle from (−40◦ , 40◦ ), both with uniform distribution. Each deformed image was then classified against the undistorted database by three different methods: by implicit invariants according to minimal norm, by the Hu’s rotation moment invariants [1] according to minimum distance, and by affine moment invariants (AMI) [2] also using the minimum-distance rule. In all three cases, six invariants were used. The last two methods were selected for a comparison because they are similar to the new technique in their nature (all of them are based on moments) and because they are traditional, well-established reference methods in pattern recognition. We run the whole experiment several times with different deformation parameters. In each run the recognition rate we achieved was 99 or 100% for the implicit invariants, from 43 to 47% for the rotation invariants, and from 34 to 40% for the AMI’s. These results illustrate two important facts. First, the implicit invariants can serve as an efficient tool for object recognition in case when the object deformation corresponds to the assumed model. Secondly, in case of nonlinear distortions the implicit invariants significantly outperform both rotation as well as affine moment invariants, which corresponds to our theoretical expectation. In case when only rotation is present, all three methods are equivalent. To illustrate this, we run the experiment once again but we fixed c = 0. Then the recognition rate of all three methods was 100%.
862
ˇ J. Flusser, J. Kautsky, and F. Sroubek
Fig. 1. The original images from the ALOI database (top) and their deformed versions (bottom)
The second experiment was done on real images taken in our lab. It illustrates good performance and high recognition power of the implicit invariants even in the case where theoretical assumptions about the degradation are not fulfilled. With a standard digital camera (Olympus C-5050), we took a photo of letters printed on a label which was glued to a bottle, see Fig. 2(a). The letters were organized in a 4 × 3 mesh with “A”s, “B”s, “V”s and “X”s each printed three times in a row. After a simple segmentation, the letters were labeled from left to right A1 , A2 , A3 , B1 , . . . , V1 ,. . . ,X1 , X2 , and X3 . Due to the curvature of the bottle surface, the letters appear distorted in the horizontal direction and the distortion grows to the right. A1 does not exhibit any visible distortion while A3 is the most distorted one and likewise for the other three letters. The task was to recognize (classify) these letters against a database containing the full English alphabet (26 undistorted letters of the same font). In an ”ideal” case when the camera is in infinity the image distortion can be described by orthogonal projection of the cylinder onto a plane, i.e. x ˜ r sin( xr ) = , y˜ y where r is the bottle radius, x, y are the coordinates on the bottle surface and x ˜, y˜ are the coordinates on the acquired images. In our case the object-to-camera distance was finite, so small perspective effect appears in addition to the above model. We assume the actual image deformation can be approximated by a quadratic polynomial in x direction. Although it is clear that such approximation cannot be very accurate, we will demonstrate that it is accurate enough for our purpose. We classified all deformed letters by means of implicit invariants in the same way as in the previous experiment and also by the Hu’s moment invariants. The table in Fig 2(b) summarizes the classification results of both methods. Implicit invariants provided a perfect recognition rate as all deformed letters were classified correctly with high minimum confidence (see the definition in (g), Section 4). It might be a bit surprise if one considers the very rough approximation of the
Object Recognition by Implicit Invariants
A1 A2 A3 Imp-inv. A A A confidence 65 29 15 Hu-inv. A Y N (a)
B1 B2 B B 64 186 B S (b)
B3 B 69 S
V1 V 96 V
V2 V 62 G
V3 V 80 N
X1 X 35 X
X2 X 98 X
863
X3 X 19 I
Fig. 2. (a) Letters captured by a standard digital camera exhibit distortion due to the cylindrical shape of the bottle. (b) Classification of four letters (each having three different degrees of distortion) by implicit invariants (first row) with confidence in the second row, and classification by Hu’s invariants (third row).
transformation model made above. It indicates some degree of robustness of the implicit invariants to the type of the image deformation. Hu’s moment invariants classified correctly only the letters without any quadratic deformation (A1 , B1 , V1 , X1 ) and failed in other cases, which is in agreement with the theory. Acknowledgement. This work was partially supported by the Czech Ministry ˇ of Education under the project 1M0572 (Research Center DAR). F. Sroubek was also supported by the Czech Academy of Sciences under the project AV0Z10750506-I055.
References 1. Hu, M.K.: Visual pattern recognition by moment invariants. IRE Trans. Information Theory 8, 179–187 (1962) 2. Flusser, J., Suk, T.: Pattern recognition by affine moment invariants. Pattern Recognition 26, 167–174 (1993) 3. Reiss, T.H.: The revised fundamental theorem of moment invariants. IEEE Trans. Pattern Analysis and Machine Intelligence 13, 830–834 (1991) 4. Flusser, J.: On the independence of rotation moment invariants. Pattern Recognition 33(9), 1405–1410 (2000) 5. Kautsky, J., Golub, G.: On the calculation of Jacobi matrices. Linear Algebra Appl. 52/53, 439–455 (1983) 6. Amsterdam Library of Object Images: http://staff.science.uva.nl/∼ aloi/
An Automatic Microarray Image Gridding Technique Based on Continuous Wavelet Transform Emmanouil Athanasiadis1, Dionisis Cavouras2, Panagiota Spyridonos1, Ioannis Kalatzis2, and George Nikiforidis1 1
Medical Image Processing and Analysis (M.I.P.A.) Group, Laboratory of Medical Physics, School of Medical Science, University of Patras, 26500 Rion - Patras, Greece [email protected], [email protected], [email protected] http://mipa.med.upatras.gr 2 Medical Image and Signal Processing (MED.I.S.P.) Laboratory, Department of Medical Instruments Technology, Technological Educational Institute of Athens, Ag. Spyridonos Street, Aigaleo, 122 10, Athens, Greece {cavouras, ikalatzis}@teiath.gr http://www.teiath.gr/stef/tio/medisp/
Abstract. In the present study, a new gridding method based on continuous wavelet transform (CWT) was performed. Line profiles of x and y axis were calculated, resulting to 2 different signals. These signals were independently processed by means of CWT at 15 different levels, using daubechies 4 mother wavelet. A summation, point by point, was performed on the processed signals, in order to suppress noise and enhance spot’s differences. Additionally, a wavelet based hard thresholding filter was applied to each signal for the task of alleviating the noise of the signals. 14 real microarray images were used in order to visually assess the performance of our gridding method. Each microarray image contained 4 sub-arrays, each sub-array 40x40 spots, thus, 6400 spots totally. Moreover, these images contained contamination areas. According to our results, the accuracy of our algorithm was 98% in all 14 images and in all spots. Additionally, processing time was less than 3 sec on a 1024×1024×16 microarray image, rendering the method a promising technique for an efficient and fully automatic gridding processing. Keyword: Microarrays, gridding, Continuous Wavelet Transform.
1 Introduction Microarray imaging is used for the simultaneous identification of thousands of genes in bioinformatics [1]. In a complementary DNA (cDNA) microarray experiment, mean fluorescence intensity values are calculated. These intensities are closely related to the expression level of a specific gene. Hence, the more precise the localization of a spot, the more accurate the intensity measurement is. Consequently, a more precise expression measurement of a gene is obtained. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 864 – 870, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Automatic Microarray Image Gridding Technique
865
In order to acquire spot intensity measurements, three major steps are followed [1][2]: 1/ the gridding or addressing step, where a precise localization of the spots with their surrounding is determined, 2/ the segmentation step, where a discrimination of the cell’s foreground from background is accomplished and 3/ the intensity extraction step, where calculation of the mean fluorescence value of each spot is performed. Nevertheless, all the above steps are not a trivial issue, due to the fact that microarray images are usually highly contaminated with noise and artifacts during their construction [1]. More precise, rotations, misalignments, and local deformations of the ideally rectangular grids of the image occur, rendering the whole process demanding. Gridding is the first step in the chain of the expression measurement process. It is vital that the gridding should be accurate since any errors during addressing procedure are transferred to the following steps. A lot of work has already been done for the task of automatic gridding processing. However, initialization or other parameters are needed in order the gridding to be applied. Jain et al. [3] described a gridding method that requires as input the rows and columns of all grids. In Katzer et al [4] technique, gaps between the grids are essential while in Steinfath et al. [5], filter arrays with radioactive label are used. On the other hand, there are many researchers that create manually the grid and then perform an automatic procedure for intensity measurement [1]. In this paper, a novel full automatic gridding approach is presented, based on the properties of the continuous wavelet transform (CWT) [6]. This technique attempts to solve the automatic gridding procedure problem efficiently, providing a tool for an accurate and fast microarray image processing.
2 Material and Methods In the present study, 9 microarray images, concerning Saccharomyces cerevisiae, were obtained by a publicly available database [7]. In those images, contamination areas, as well as a slight shift of the alignment were visible. Each image consists of 4 subarrays, each subarray consists of 40x40 spots, leading to 6400 spots, totally. 2.1 Continuous Wavelet Transform CWT is a transform that is used to decompose a signal into wavelets, small oscillations that are highly localized in time and described by eq. 1.
C ( a , b) =
+∞
∫ f (t )
−∞
1 ⎛t −b⎞ ψ⎜ ⎟dt |a| ⎝ a ⎠
(1)
where, f(t) is the original signal, á represents scale, b represents time or space, ø represents the mother wavelet and z represents the complex conjugate of z. The choice of Daubechies (db) was guided by the fact that this mother wavelet family resulted in better results. Furthermore, we worked with only four coefficients
866
E. Athanasiadis et al.
(db 4), since using a greater number of coefficients, have not resulted in any significant change in our results. On the other hand, keeping the number of coefficients, as low as possible it was essential for our methodology since the processing time is an important issue for the gridding procedure. After applying the CWT, a hard-threshold wavelet based technique [8] was performed according to eq. 2.
Wout
⎧Win + T ⋅ (G − 1) if Win > T ⎫ ⎪ ⎪ = ⎨Win − T ⋅ (G − 1) if Win < −T ⎬ ⎪0 ⎪ otherwise ⎩ ⎭
(2)
where Wout denotes the output and Win the input coefficient values of the details. T and G are threshold and gain values respectively (Fig 1).
Fig. 1. Wavelet-based hard thresholding mapping layout
2.2 Gridding Procedure The gridding procedure was accomplished as follow: firstly, line-profiles for x and y axes were calculated and two signals were obtained. Secondly, the CWT was applied to both signals, at 15 scales, using db4 mother wavelet, and 15 detail images for each signal were obtained (Fig.2). Scale was set to 15, due to the fact that in higher scales, no significant information was divulged. During decomposition process, high frequency details are being distinguished more accurately from noise structures, as noise is not present in all decomposed signals, in contradiction to the alternations of spots with their background that can be present in all levels. Thirdly, a summation point by point of the signals of all the 15 scales was performed for the purpose of increasing the actual signal of the spots and decreasing the noise. Fourthly, a hard- thresholding
An Automatic Microarray Image Gridding Technique
867
wavelet based technique [6] was applied to each signal for the task of suppressing noise (Fig.3). Finally, local maxima and local minima were calculated on both signals, corresponding to the centers and the boundaries of the spots respectively. Optimal results were obtained by setting the threshold equal to 10% of the maximum signal value and the gain equal to unity. The described method is briefly summarized in the following steps: Step 1: Load the initial Image Step 2: Calculate line profiles of x and y axis Step 3: Apply the CWT on Both x and y axis signals, up to 15 scales using as mother wavelet the db4. Step 4: For each one of the two signals, make a summation point by point from 1st to 15th the coefficients of the CWT. Step 5: Apply wavelet denoising technique using as threshold the 10% of the maximum value of each signal. Step 6: Find the local maxima (centers of spots) and local minima (boundaries of spots) in x and y axis.
Fig. 2. X-axis signal processed by the CWT in 3 dimensions (coefficient, time or space, scale). In this image, scales a are ranging from 1 to 15, space b is the dimension of the signal (here 1024) and COEFS are the resulted coefficient values of the wavelet transform.
868
E. Athanasiadis et al.
(a)
(b) Fig. 3. Initial (a) and processed with hard threshold wavelet based filtering technique signal (b). Local maxima corresponded to the centers and local minima corresponded to the boundaries of the cells. In (b) noise has been suppressed by the wavelet threshold filter.
3 Results and Discussion The proposed method was evaluated by visually inspecting [9] the gridding results and assigning each spot into two categories: either perfectly surrounding the spot (category 1) or not (category 2). Additionally, our method was compared with the method proposed by Blekas et. al. [9]. According to our findings, our algorithm was 98% accurate in all the 14 microarray images in opposition to the other method that scored 95%. It should be noted that microarray images were not pre-processed, thus
An Automatic Microarray Image Gridding Technique
869
Fig. 4. Gridding results of contaminated arrayer. The contamination is being localized at the middle of the figure.
noise affected the outcome of both gridding algorithms. Nevertheless, in the proposed technique, noise suppression by using the hard-thresholding filter, leaded to higher accuracy results. A typical contaminated area of a microarray image with the grid overlaid is illustrated in Fig. 4. As we can conclude by the results, the contamination at the boundaries of the arrayer is eliminated by the hard thresholding technique. It should be noted that the processing time for the gridding, on a 1024×1024×16 microarray image, was lower than 3sec (processor: PentiumIV 3.00GHz, 512 MB RAM), rendering the technique a valuable tool for a fully automated microarray image processing application.
4 Conclusion In the present study we have proposed a new method for automatic addressing of microarray images based on continuous wavelet transform Following this technique, the noise of microarray images was sufficiently suppressed, and the boundaries of the spots were delineated with high accuracy. The processing time was minimal providing the method an effective tool for the demanding task of microarray image processing.
870
E. Athanasiadis et al.
Acknowledgements We would like to thank the Greek State Scholarships Foundation (I.K.Y.) for funding the above work.
References 1. Yang, Y.H., Buckley, M.J., Duboit, S., Speed, T.P.: Comparison of methods for Image Analysis on cDNA Microarray Data. Journal of Computational and Graphical Statistics 11, 108–136 (2002) 2. Schena, M., Shalon, D., Davis, R.W., Brown, P.O.: Quantitative monitoring of gene expression patterns with a complementary DNA microarrray. Science 270, 467–470 (1995) 3. Jain, A., Tokuyasu, T., Snijders, A., Segraves, R., Albertson, D., Pinkel, D.: Fully Automatic Quantification of Microarray Image Data. Genome Res, 325–332 (2002) 4. Katzer, M., Kummert, F., Sagerer, G.: Automatische Auswertung von Mikroarraybildern. In: Workshop Bildverarbeitung für die Medizin, Leipzig (2002) 5. Steinfath, M., Wruck, W., Seidel, H.: Automated image analysis for array hybridization experiments. Bioinformatics 17, T. 7 S., 634–641 (2001) 6. Ingrid Daubechies, Ten Lectures on Wavelets, CBMS-NSF Regional Conference Series in Applied Mathematics), Soc for Industrial & Applied Math (1992) 7. DeRisi, J.L., Iyer, V.R., Brown, P.O.: Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale. SCIENCE 278, 680 (1997) 8. Athanasiadis, E., Glotsos, D., Daskalakis, A., Bougioukos, P., Kostopoulos, S., Theocharakis, P., Spyridonos, P., Kalatzis, I., Nikiforidis, G., Cavouras, D.: Microarray Image Enhancement Techniques Using the Discrete Wavelet Transform. In: Second International Conference From Scientific Computing to Computational Engineering (2nd IC-SCCE 2006), Athens (July 5 - 8 2006) 9. Blekas, K., Galatsanos, N., Likas, A., Lagaris, I.E.: Mixture Model Analysis of DNA Microarray Images. IEEE Transactions on Medical Imaging 24, 901–907 (2005)
Image Sifting for Micro Array Image Enhancement Pooria Jafari Moghadam and Mohamad H. Moradi Biomedical faculty of AmirKabir University of Technology, Tehran, Iran Bioinstrumentation & Signal processing Lab, AmirKabir University of Technology, Tehran, Iran [email protected]
Abstract. cDNA micro arrays are more and more frequently used in molecular biology as they can give insight into the relation of an organism's metabolism and its genome. The process of imaging a micro array sample can introduce a great deal of noise and bias into the data with higher variance than the original signal which may swamp the useful information. As imperfections and fabrication artifacts often impair our ability to measure accurately the quantities of interest in micro array images, image processing for analysis of these images is an important and challenging problem. How to eliminate the effect of the noise imposes a challenging problem in micro array analysis. In this paper we implemented a novel algorithm for image sifting which could remove objects with definite size from macro array images. We used regular moving grids to sift noise object and obtained clean images for segmentation. The results have been compared with SWT, DWT and wiener filter denoising. Keywords: sifting, micro array, enhancement, image, denoising.
1 Introduction Micro arrays have become the tool of choice for the global analysis of gene expression. Powerful statistical tools are now available to analyze this expression and to gain an understanding of how changes in gene expression patterns impact biological systems. Currently, several different platforms have evolved from the origin of this imaging technique which goes back to the 1970’s [1]. The analysis of such data has become a computationally-intensive task that requires technological developments at various stages, from the design of the array, to image analysis, database storage, data processing and clustering and information extraction. Further innovations were made by M. Schena et al [2], by using nonporous solid support to facilitate miniaturization and fluorescent-based detection. Another improvement was the methods introduced by S. Fodor et al. [3], for the high density spatial synthesis oligonucleotide which makes it possible to monitor changes in the expression patterns of thousands of genes. Image analysis is an important aspect for micro array experiments that can affect subsequent analysis such as identification of differentially expressed genes. Image processing for micro array images includes three tasks: spot gridding, segmentation and information extraction. In recent years, large number of commercial tools has been developed in micro array image processing [4, 5, 6, 7, and 8]. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 871 – 877, 2007. © Springer-Verlag Berlin Heidelberg 2007
872
P.J. Moghadam and M.H. Moradi
Micro arrays are a scientific tool that should be viewed in a similar fashion to any other laboratory technique with careful experimental planning, replication, and proper statistical analysis. There is a need to examine these data with statistical techniques to help discern possible patterns in the data. Such statistical methods include analysis of variance introduced by Kerr [9], ratio distribution by Chen et al [10] and Ermolaeva et al. [11] and Gamma distribution by Newton 2001 [12]. These methods deal mainly with measurement error, such as preparation of the sample, cross hybridization, and fluctuation of fluorescence value from gene to gene. But none deals particularly with the effect of the noise. By giving information on the levels of gene expression of thousands of genes at the same time, DNA micro arrays allow researchers, for example, to relate the effect of a disease to particular genes. The basic procedure for a micro array experiment is simply described is as follows. RNA is extracted from a cell or tissue sample and then is converted to cDNA. Fluorescent tags, (usually Cy3 and Cy5) are enzymatically incorporated into the newly synthesized cDNA or can be chemically attached to the new strands of DNA. A cDNA molecule that contains a sequence complementary to one of the single-stranded probe sequences on the array will hybridize, via base pairing, to the spot at which the complementary reporters are affixed. The spot will then fluoresce when examined using a micro array scanner. The fluorescence intensity of each spot is then evaluated in terms of the number of copies of a particular mRNA, which ideally indicates the level of expression of a particular gene. A schematic diagram for this process created. Methods to reduce the noise source include usage of clean glass slide and a higher laser power rather than a higher PMT voltage. However, these are not adequate for the required image qualities and an enhanced software procedure embedded within the process is a much better alternative. The results of the micro array experiment are two 16-bit tagged mage files, one for each fluorescent dye. The image is not perfect and includes noisy sources that blur such images for further gene expression experimentation. The noise source originates from different sources during the course of experiment, such as photon noise, electronic noise, treatment of the glass slide, background fluorescence, laser light reflection, and dust on the slide. Hence, it is crucial to denoise the resultant image within this process. In this paper, we implemented a novel algorithm for image sifting which could remove objects with definite size from macro array images. We used regular moving grids to sift noise object and obtained clean images for segmentation. The results have been compared with SWT, DWT and wiener filter denoising.
2 Material and Methods Here, we have implemented a novel algorithm for image sifting which could remove objects with definite size from micro array images. One example of these distortions for micro array images is known as impulse noise. We know that background correction is an important task in the analysis of micro arrays. This is necessary in order to remove the contribution in intensity which is not due to the hybridization of the cDNA samples to the spotted DNA, so first we must uniform the background of micro array images.
Image Sifting for Micro Array Image Enhancement
873
Thresholding is important in image processing for extracting objects from their background. If the histogram has a deep and sharp valley between two peaks then it is easy to threshold the histogram by simply choosing the valley bottom as the threshold point. Also, in some cases, the valley of histogram is flat and broad or two peaks are extremely unequal in height, making it difficult to trace the valley bottom. We have used optimal multi-threshold method to overcome some of the above mentioned problems. Maximizing of interclass variance between bright and dark [13]. After thresholding, binary image was sifted by moving grid. We have created a proper grid to remove desired objects from image.
Fig. 1. Grid structure
Grid equation can be shown as:
∑ ( y = i.W
V
where
+ y0 ) + ∑ ( x = i.WH + x0 )
WV = WH can be 2,4,8,…,256 and ( x0 , y0 ) is initial point and if P =
(1)
512 , WV
then i = 0,1,2,..., P , In order to extract new edges for each of grid windows, the edges of grid windows each were traced. Tracer moved on edge of grid windows in such a way that always the background was seen in its left side, so when edge of grid windows is affected by objects of image, the tracer would follow these edges of objects to make a new edge for windows (See Fig. 2). Thus, the edges of grid windows are distorted and reshaped according to edges of the objects detected in the path of moving tracer. Likewise sifting things, we have removed all of objects which were encircled inside these new grid windows. Then grid was moved along x and y axis, pixel by pixel to sift the whole of image. For this purpose, we have set x 0 from zero to W H when y 0 was constant and then
874
changed
P.J. Moghadam and M.H. Moradi
y 0 in the range of zero to WV and have set x 0 from zero to W H again for
each step. This helped us to sift the whole of image completely (see Fig. 3).
Fig. 2. Grid window get new structure because of object edge effect
Fig. 3. Grid movement along x-axis
3 Result Figure 4 shows the resulted denoised image using our sifting approach. As it can be seen, contrary to other approaches this method has not any serious effect on image clearness such as blur or darkness. This could help us to continue our processing without need of any alignment. In order to provide a quantitative measure of the resulted images, we used the universal index proposed in [14]. Let x = {x i i = 1,2,..., N } and
y = {y i i = 1, 2,..., N } be the original and the test image signals, respectively. The proposed quality index is defined as
Image Sifting for Micro Array Image Enhancement
(a)
(b) Fig. 4. (a) Original image (b) Enhanced image by sifting
875
876
P.J. Moghadam and M.H. Moradi
Q=
4σ xy x . y (σ + σ y2 )[( x )2 + ( y )2 ]
(2)
2 x
N
N
i =1
i =1
where x = (1/ N ) ∑ x i , y = (1/ N ) ∑ y i
σ x2 =
1 N ∑ ( x i − x )2 N − 1 i =1
σ y2 =
1 N ( y i − y )2 ∑ N − 1 i =1
σ xy =
1 N ∑ (x i − x )( y i − y ) N − 1 i =1
Using this index, the quality of the denoised images by Stationary Wavelet Transform (SWT), Discrete Wavelet Transform (DWT), Wiener filter and sifting were 0.7837, 0.7790, 0.7885 and 0.8073 respectively. As it can be clearly observed, the highest denoising was achieved by the sifting. For further testing the method we have carried out several other simulations using different micro array images. The results are shown in Table 1. Table 1. Comparative performance of the mean of quality indexes
DWT 0.7416
SWT 0.7531
Wiener filter 0.7723
Sifting 0.8142
4 Discussion It is well known that micro array technology can monitor thousand of DNA sequences in a high density array on a glass. Two mRNA samples are reverse-transcribed into cDNA, labeled using different fluorescent dyes then mixed and hybridized with the arrayed DNA sequences. After this competitive hybridization, the slides are imaged by a scanner to measure fluorescent intensity of each dye. The image is not perfect and includes noisy sources that blur such images for further gene expression experimentation. The noise source originates from different sources during the course of experiment, such as photon noise, electronic noise, laser light reflection, dust on the slide, and so on. Hence, it is crucial to denoise the resulted images of this process. Exciting methods to reduce the noise source include usage of clean glass slide and a higher laser power rather than higher PMT voltages. However, these are not adequate for the required image qualities and an enhanced software procedure embedded within the process would be a much better alternative. As imperfections and fabrication artifacts often impair our ability to measure accurately the quantities of interest in micro array images, image processing for analysis of these images is an important and challenging problem. How to eliminate the effect of the noise imposes a challenging problem in micro array analysis. In this
Image Sifting for Micro Array Image Enhancement
877
paper we implemented a novel algorithm for image sifting which could remove objects with definite size from macro array images. We have used regular moving grids to sift noise object and obtained clean images for segmentation. The simulation results applied on micro array image examples verified this enhanced characteristic and denoising quality of the image analysis. The sifting provides a better performance in denoising micro array image than traditional transform methods. The results also show that the sifting denoising has a 10% better performance than Wiener filter which is widely used in commercial denoising software system. The application of this method would improve the accuracy of gene expression, and therefore easily identify the diseased gene for diagnosing critical diseases.
References 1. Southern, E.M.: Detection of specific sequences among DNA fragments separated by gel electrophoresis. J. Mol. Biol., 98, 503–517 (1975) 2. Schena, M., Shalon, D., Davis, R.W., Brown, P.O.: uantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995) 3. Fodor, S.P.A., Read, J.L., Pirrung, M.C., Stryer, L., Lu, A.T., Solas, D.: Light-directed, spatially addressable parallel chemical synthesis. Science 251, 767–773 (1991) 4. Steinfath, M., Wruch, W., Seidel, H., Lehrach, H., Radelof, U., O’Brien, J.: Automated image analysis for array hybridization experiments. Bioinformatics 17(7), 634–641 (2001) 5. Bozinov, D., Rahnenfuhrer, J.: Unsupervised technique for robust target separation and analysis of DNA microarray spots. Bioinformatics 18(5), 747–756 (2002) 6. Wruch, W., Griffiths, H., Steinfath, M., Lehrach, H., Radelof, U., O’Brien, J.: Xdigitise: Visualization of hybridization experiments. Bioinformatics 18(5), 757–760 (2002) 7. Zapala, M.A., Lockhart, D.J., Pankratz, D.G., Garcia, A.J., Barlow, C., Lockhard, D.J.: Software and methods for oligonucleotide and cDNA array data analysis. Genome Biol. 3(6) (2002) 8. Jain, A.N., Tokuyasu, T.A., Snijders, A.M., Segraves, R., Albertson, D.G., Pinkel, D.: Fully automatic quantification of microarray image data. Genome Res., 12, 325–332 (2002) 9. Kerr, M.K., Martin, M., Churchill, G.A.: Analysis of variance gene expression microarray data. J.Comput. Biol., 7, 819 (2001) 10. Chen, Y., Dougherty, E.R., Bittner, M.L.: Ratio-based decision the quantitative analysis of cDNA microarray images. J. Biomed. Optics, 364–374 (1997) 11. Ermolaeva, O., Rastogi, M., Pruitt, K.D., Schuler, G.D., Bittner, M.L., Chen, R., Simon, P.M, Trent, J.M., Boguski, M.: Data management and analysis for gene expression arrays. Nature Genetics 20, 19–23 (1998) 12. Newton, M.A., Kendziorski, C.M., Richmond, C.S., Blattner, F.R., Tsui, K.W.: On differential variability of expression ratios: Improving statistical inference about gene xpression changes from microarray data. J. Computat. Biol., 8, 37–52 (2001) 13. Otsu, N.: A Threshold Selection Method from Gray-Level Histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1), 62–66 (1979) 14. Wang, Z., Bovik, A.: A universal image quality index. IEEE Trans. Signal Processing Letter 9, 81–84 (2002)
Wavelet Based Local Coherent Tomography with an Application in Terahertz Imaging Xiao-Xia Yin1 , Brian W.-H. Ng1 , Bradley Ferguson2 , and Derek Abbott1 1
Centre for Biomedical Engineering, School of Electrical & Electronic Engineering, The University of Adelaide, SA 5005, Australia Tel.: +61-8-8303-5748 Fax: +61-8-8303-4360 [email protected] 2 Tenix Defence Systems Pty Ltd, Technology Park, Mawson Lakes, SA 5095
Abstract. Terahertz Computed Tomography (THz-CT) is a form of optical coherent tomography, which offers a promising approach for achieving non-invasive inspection of solid materials, with potentially numerous applications in industrial manufacturing and biomedical engineering. With traditional CT techniques such as X-ray tomography, full exposure data are needed to produce cross sectional images, even if the region of interest is small. For time-domain terahertz measurements, the requirement for full exposure data is impractical due to the slow measurement process. In this paper, we apply a wavelet-based algorithm to locally reconstruct THz-CT images with a significant reduction in the required measurements. The algorithm recovers an approximation of the region of interest from terahertz measurements within its vicinity, and thus improves the feasibility of using terahertz imaging to detect defects in solid materials and diagnose disease states for clinical practice.
1
Introduction
Terahertz (T-ray) radiation spans the frequency range from 0.1 THz to 10 THz in the electromagnetic spectrum. T-rays have promising potential for both in vivo and in vitro biosensing [7] owing to (i) their non-invasive property, and (ii) the fact that biomolecules [9] have rich resonances in the terahertz region. While one- and two-dimensional applications with time-domain terahertz spectroscopy have been well-demonstrated in the past [4], the ability to non-destructively probe the inner three-dimensional structure of optically opaque structures is less well studied. There has been a relative scarcity of terahertz tomography work in the literature. The aim of our current work is to perform terahertz tomographic reconstructions, particularly localised reconstructions based on limited data. The main goal of this paper is to present a wavelet based reconstruction algorithm for terahertz computed tomography and to show how this algorithm can be used to rapidly reconstruct the region of interest (ROI) with a reduction in the measurements of terahertz responses, compared with a standard reconstruction. This algorithm generates the approximation and detail images separately, W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 878–885, 2007. c Springer-Verlag Berlin Heidelberg 2007
Wavelet Based Local Coherent Tomography
879
and the final reconstruction is found by inverse wavelet transform. The current algorithm computes the approximate and detailed portions of the reconstructed image using ramp filtered scaling and wavelet functions within the more familiar filtered back-projection (FBP) framework. Compared to previous algorithms in terahertz computed tomography [3], the current reconstruction algorithm accelerates terahertz image scanning and reduced computation complexity, since it only requires the knowledg of a small number of projections on lines passing through the ROI and its small immediate neighbourhood. In this paper, uniform exposure is assumed at all angles for simplicity in analysis as well as potential implementation in hardware. Reconstruction ability at off-centered and centered regions of interest are also explored. 1.1
A Brief Introduction to Terahertz Imaging
A terahertz CT system is based on a chirped terahertz time domain spectroscopy scanned imaging system. The target is mounted on a motion stage that allows it to be translated along both axes in the horizontal plane, as well as being rotated about is vertical axis. This setup is capable of producing tomographic data sets of an object with full variability of angle and distance offset distance from its centre. The detailed information for the chirped pulse scanning and relative rectangular coordinate system and polar coordinates for imaging reconstruction can be found in Ferguson et al [2].
2 2.1
Methodology An Overview of CT and Terahertz CT
Normally, a filtered back projection algorithm (FBP) begins with a collection of sinograms obtained from projection measurements. A sinogram s(ξ, θ) is simply generated via a collection of the projections at all angles: s(ξ, θ) = o(x, z)dξ, (1) where all points on projection offset ξ satisfy the equation: x cos θ + z sin θ = ξ and o denotes the measured image intensity of a target object, which is a function of pixel position in the x − z plane . The FBP algorithm for terahertz CT reconstruction is expressed as follows: 2 π 1 ∞ S(θ, β)|β|exp[i2πβξ]dβ dθ, (2) I(x, y) = 0
−∞
where S(θ, β) is the spatial Fourier transform of the parallel projection data, β is the spatial frequency in the ξ direction. It should be noted that the operation of the ramp filter |β| in Eq. (2) is equivalent to a differentiation followed by a Hilbert transform, which introduces a discontinuity in the derivative of the Fourier transform at zero frequency. This is the cause of the non-locality of the FBP algorithm.
880
2.2
X.-X. Yin et al.
Calculation of Terahertz Parameters for Reconstruction of Terahertz CT
One of the advantages that terahertz CT has over X-ray CT is that s(θ, ξ) may be one of several parameters derived from terahertz pulses. Fundamentally, a terahertz CT setup is capable of measuring the transmitted terahertz pulse as a function of time t, for a given projection angle and projection offset. In principle, terahertz sinograms can be obtained in both time and frequency domains. In this paper, we review the calculation of terahertz sinograms in the time domain. This method is based on the assumption that the target is dispersionless and therefore the THz pulse shape is unchanged after propagation through the target apart from attenuation and time delay. A reference terahertz pulse pi (t) is measured without the target in place. To estimate the phase shift t of a terahertz pulse pd (t) going through the target, the two signals are resampled using lowpass interpolation. The two interpolated signals are then cross-correlated, and the maximised cross-correlation product at each angle as the lag is taken as the estimation of the phase delay of pd (t). This delay is used as the extracted parameter for use in CT reconstruction. 2.3
Two Dimensional Wavelet Based CT Reconstruction
Two Dimensional Wavelet Transform (2D DWT). Wavelet transforms play an important role in many image processing algorithms. Fundamentally, wavelet decomposition corresponds to a multiresolution analysis of a signal. This has the advantage of much improved joint time-frequency localisation over Fourier based techniques. In practice, it is nearly always implemented using digital filters and downsamplers. In two dimensions, the discrete version of a wavelet transform can be realised by a 2D scaling function, φ(x, y), and three 2D wavelets, ψ1 (x, y), ψ2 (x, y), and ψ3 (x, y), which are calculated by taking the 1D wavelet transform along the rows of f (x, y) and the resulting columns. The 2D scaling function and 2D wavelet functions satisfy the equations represented in Gonzalez and Woods [5]. Our current experiment uses symmetric (linear phase) filters for the analysis of tomographic reconstruction. Let h0 , h1 , denote a pair of linear phase low˜ 0, h ˜ 1 denote the corresponding reconstruction and high-pass wavelet filters and h filters. The discrete approximation at resolution 2j−1 can be obtained by combination of the details and approximation at resolution 2j using reconstructed wavelet filters: ˜ 0 (l − 2n)cj (m, n) + h ˜ 0 (k − 2m)h ˜ 1 (l − 2n)dH (m, n) ˜ 0 (k − 2m)h cj−1 (k, l) = h j m,n
˜ 0 (l − 2n)dV (m, n) + h ˜ 1 (k − 2m)h ˜ 1 (l − 2n)dD (m, n). ˜ 1 (k − 2m)h (3) +h j j Our method is to focus the wavelet application on recovering local images from wavelet approximate and detail coefficients. In order to support the reconstructed filter for the recovery of the local target area, the calculation of these reconstructed coefficients includes the region of interest and a margin area.
Wavelet Based Local Coherent Tomography ramp filter in the frequency domain
scaling and wavelet filters
1
1.5
0.8
scaling and wavelet ramp filters
scaling ramp filter horizontal wavelet ramp filter vertical wavelet ramp filter diagonal waveletr ramp filter
0.9
881
0.7
scaling ramp filter horizontal wavelet ramp filter vertical wavelet ramp filter diagonal waveletr ramp filter
0.8
0.6 1
0.6
0.4
0.5 0.2
0.5
0.4 0.3
0 0
0.2
−0.2
0.1 0
50
100
150
200
250
−0.5
5
10
(a)
15
20
(b)
25
30
35
5
10
15
20
25
30
35
(c)
Fig. 1. (a) Illustration of a traditional ramp filter. (b) and (c) Illustration of the scaling and wavelet ramp filters at the sixth projection angle (43.2 degree) using BiorSplines 2.2 wavelet, respectively.
Two Dimensional Wavelet Reconstruction. The idea of a wavelet based reconstruction is to use scaling (or wavelet) modified ramp filter in the FBP algorithm to generate the appropriate subbands of the reconstructed image [8]. Since the 2D wavelet transform is separable, the modified ramp filters are simply convolutions of appropriately rotated 1D scaling and wavelet functions. Locality is achieved since most common scaling and wavelet functions have compact support, as do their Hilbert transforms [1]. The modified FBP algorithm for terahertz CT reconstruction is expressed as follows: π ∞ I(x, y) = S(θ, β)|β|G2j (β cos θ, β sin θ) exp(i2πβξ)dβ dθ, (4) 0
−∞
where S(θ, β) and G2j (β1 , β2 ) are the spatial Fourier transforms of s(θ, ξ) and g2j (the scaling or wavelet ramp filter in the time domain), respectively. An example of a ramp filter and a modified wavelet ramp filter are shown in Fig. (1)(a)-(c). For the current paper, only one level of wavelet transform step is considered. This is mostly due to a single level WT is sufficient to allow clear reconstruction of an image in the ROI and it avoids extra computational complexity of more WT levels [8]. The wavelet reconstruction formulas in Eq. (4) allow for such reconstruct by setting j = 1. The 2D inversion of the wavelet transform (IWT) is performed on the reconstructed approximate and detail subbands, after the relevant filters are applied in the FBP.
3
Implementation Issues
The current research based on terahertz imaging is inspired by Rashid-Farrokhi et al [8]. In this work, we experiment with the 2D wavelet technique using terahertz tomographic data by modifying the measured projections. This modification uses an extrapolation technique to reduce edge effects due to sinogram truncation. It is observed that approximate coefficients of a scaling function shows good localised features in the local reconstruction using our algorithm,
882
X.-X. Yin et al. Projection profile at 25th angle 40
Projection profile at 25th angle after Extrapolation
ramp filtered projection wavelet ramp filtered projection
7
30 6 5
amplitude (a.u.)
amplitude (a.u.)
20 10 0 −10
4 3 2 1
−20 0 −30
−1
−40
−2 10
20
30
projection (mm)
(a)
(b)
40
50
10
scaling ramp filtered projection ramp filtered projection 20 30 40
50
projection (mm)
(c)
Fig. 2. (a) An optical image of a target with 2 mm diameter holes drilled into a polystyrene cylinder with varying interhole distances. (b) Projection filtered by a scaling ramp filter and a traditional ramp filter, respectively. (c) Projection extrapolation outside the ROI after filtered projections.
where the reconstructed intensity of an image varies much between different target materials. It should be noted that, in the application of terahertz data for local reconstruction, it is found that artifacts at the edges of the region of exposure (ROE) in terahertz projections, where nonlocal data is set to zero, are very significant whether traditional ramp filter or scaling and wavelet ramp filters are used. Thus, the ROE must have a radius re = ri + ra , where ri is the ROI radius, and ra is the radius of the region of artifacts (ROA). Fig 2(b) shows sharp variation along the borders of the ROE after applying scaling ramp filter and ramp filter, respectively, on one of the 1D projections, which result in the image appearing relatively weakened in intensity compared to a large constant bias that exists along the reconstructed edges in the region of interest. The constant extrapolation we use is given by Eq. 25 in Rashid-Farrokhi et al [8]. Fig. 2(c) shows the extrapolated projection at the 25th projection angle after the application of a scaling ramp filters and a ramp filter. The extrapolated projection removes spikes at the edge of the ROE. This extrapolation procedure is also suitable for the reconstruction of an off-center area. In order to recover the cross-sectional image in the region of interest, the values of the sinograms outside of the ROE are set to zero. The traditional filtered back projection formulas and wavelet based reconstruction are applied to the remaining projections, respectively for analysis and comparison. With the extrapolation step, the overall algorithm becomes: 1. The original projections through the ROE are calculated in the time domain from measured pulses; 2. Each projection is filtered by three different modified wavelet filters at all projection angles; 3. Each projection is filtered by modified scaling filter at all projection angles; 4. Filtered projections are extrapolated with constants to limit artifacts at the boundaries of the projections;
Wavelet Based Local Coherent Tomography
883
5. Extrapolated filtered projections are back projected to every other point to obtain the approximate and detail sub-images at resolution 2−1 ; all remaining points are set to zero; 6. Image is reconstructed from these sub-images using a 2D IWT
4
Reconstruction Results
In this paper, terahertz data measured from a simple structure: a cylinder with holes inside is used as the target (Fig 2(a)). The sample consists of 101 projections at each of 25 projection angles covering a 180◦ projection area in a 100 × 100 image. Two situations are analyzed for this target sample: (i) an ROE of diameter 42 pixels at the center of the image and (ii) an ROE of diameter 67 pixels offcenter to the image. 4.1
Local Reconstruction of Centered Region
Fig. 3 shows local reconstruction results for a centered region with a radius of 16 pixels using the local reconstruction algorithm. Fig. 3(a) is the local reconstruction. Fig. 3(b) shows four reconstructed subimages (approximate and detailed coefficients) using wavelet and scaling ramp filters, constant extrapolation and BP. Fig. 3(c) and (d) show the reconstruction profiles at an arbitrarily chosen horizontal row and vertical column of pixels corresponding to local and traditional reconstruction. It is observed that the reconstruction inside the ROI is similar in quality to a full exposure reconstruction using traditional FBP. Reconstructed profile at 12th horizontal pixel row
Wavelet Based Tomography 0
0
10
10
40
50 0
3
50 2.5
60
2
70
1.5
80
1
90 0
20
40 60 X (mm)
(a)
80
Y (mm)
Y (mm)
30 40
3.5
6
30
5.5
40 20 X (mm)
50 0
40
0
0
10
10
20 30 40 50 0
20 X (mm)
40
(b)
scaling ramp filtered LCT ramp filtered GCT ramp filtered LCT wavelet based LCT
5
4 3.5 3 2.5
scaling ramp filtered LCT ramp filtered GCT ramp filtered LCT wavelet based LCT
5.5
4.5
30
50 0
40
6
5
20
4.5 4 3.5 3 2.5 2 1.5
40 20 X (mm)
Reconstructed profile at 12th vertical pixel column
6.5
20
Magnitude (dB)
4
30
20
Y (mm)
20
Y (mm)
4.5
Y (mm)
5
Magnitude (dB)
0 10
2 20 X (mm)
40
5
10
15
20
Number of pixels
(c)
25
1
5
10
15 Number of pixels
20
25
(d)
Fig. 3. (a) Reconstructed image localised to a region of interest from the inverse wavelet transform. (b) Centered approximate and three detail reconstruction subimages (clockwise). (c) Reconstructed profiles at the 12th horizontal pixel row. (d) Reconstructed profiles at the 12th vertical pixel column.
4.2
Local Reconstruction of Off-Centered Region
Fig. 4(a)-(d) shows locally reconstructed images of an off-centered region with a radius of 61 pixels using the current local reconstruction method. The subfigures illustrates in turn: local reconstruction from extrapolated wavelet and scaling filtered projection after decomposition; the reconstruction of extrapolated approximate and detail coefficients after BP; the reconstruction profiles
884
X.-X. Yin et al.
30 40
20 X (mm)
50 0
40
20 X (mm)
40
2.5
60
0
0
10
10
2
70
0.5
20
40 60 X (mm)
80
(a)
Y (mm)
1
90
Y (mm)
80
30
6
4.5
5
4
3
50 0
2
1
30
0
40 20 X (mm)
50 0
40
3
20
1.5 40
4
2
2.5
1.5
20
5
3.5
scaling ramp filted LCT ramp filtered LCT ramp filtered GCT
7
Magnitude (dB)
30
50 0
3
8
scaling ramp filted LCT ramp filtered LCT ramp filtered GCT
5.5 20
Magnitude (dB)
40 50
20
40
3.5
Y (mm)
0 10 Y (mm)
4
Y (mm)
4.5
30
0
0 10
6 5
20
Reconstructed profile at 12th vertical pixel column
Reconstructed profile at 28th horizontal pixel row
Extrapolated wavelet based LCT 0 10
1 20 X (mm)
40
(b)
5
10 15 Number of pixels
(c)
20
10
20
30 40 Number of pixels
50
60
(d)
Fig. 4. (a) A reconstructed image from the inverse wavelet transform without decomposition for clarity. (b) Off-centered approximate and three detail reconstructed subimages (clockwise). (c) Reconstructed profiles at the 28th horizontal row of pixels. (d) Reconstructed profiles at the 12th vertical column of pixels.
at an arbitraily chosen horizontal row; and another arbitrarily chosen vertical column. For comparison purposes, the profiles for both proposed and traditional FBP are shown in Fig. 4(c)-(d). It should be noted that the reconstruction from wavelet approximate coefficients shows strong contrast in intensity for different media and traditional FBP based local reconstruction shows a little higher intensity than FBP based global reconstruction.
5
Conclusion and Future Work
We have presented an algorithm to reconstruct the wavelet and scaling coefficients of a function from its signogram image of terahertz signals. The scheme is motivated by the observation that for some wavelet bases, with sufficient zero moments, the scaling and wavelet functions have essentially the same support after ramp filtering. Two targets are recovered from terahertz measurements, which demonstrates the quality of the reconstructions possible from the presented local reconstruction reconstruction method. Since the current work involves only the one level of 2D DWT, it will be interesting to explore the reconstruction algorithm with more levels of decomposition in the future. Moreover, a research area of much current interest is the development of statistical based local tomography algorithms and techniques [6]. There is much room for the integration of such a statistical estimation framework with wavelet techniques, which have been shown here to be critical for local reconstruction.
References 1. Delaney, A.H., Bresler, Y.: Multiresolution tomographic reconstruction using wavelets. IEEE Transactions on Image Processing 4(6), 799–813 (1995) 2. Ferguson, B., Wang, S., Gray, D., Abbott, D., Zhang, X.C.: Identification of biological tissue using chirped probe THz imaging. Microelectronics Journal 33(12), 1043–1051 (2002) 3. Ferguson, B., Wang, S., Gray, D., Abbott, D., Zhang, X.C.: T-ray computed tomography. Optics Letters 27(15), 1312–1314 (2002)
Wavelet Based Local Coherent Tomography
885
4. Galv¨ ao, R.K.H., Hadjiloucas, S., Bowen, J.W., Coelho, C.: Optimal discrimination and classification of thz spectra in the wavelet domain. Opt. Express 11, 1462–1473 (2003) 5. Gonzalez, R.C., Woods, R.E.: Digital Image Processing. Prentice-Hall, Inc, New Jersey (2002) 6. Hanson, K.M., Wecksung, G.W.: Bayesian estimation of 3-D objects from few radiographs. Journal of Optical Society of America 73(11), 1501–1509 (1983) 7. Mittleman, D., Neelamani, R., Baraniuk, R.G., Rudd, J.V., Koch, M.: Recent advances in terahertz imaging. Appl. Phys. B. 68, 1085–1094 (1999) 8. Rashid-Farrokhi, F., Liu, K.J.R., Berenstein, C.A., Walnut, D.: Wavelet-based multiresolution local tomography. IEEE Transactions on Image Processing 6(10), 1412– 1430 (1997) 9. Siegel, H.P.: Terahertz technology in biology and medicine. IEEE transactions on microwave theory and techniques 52(10), 2438–2447 (2004)
Nonlinear Approximation of Spatiotemporal Data Using Diffusion Wavelets Marie Wild Austrian Research Centers GmbH Video and Safety Technology Tech Gate Vienna Donau-City-Str.1 1220 Vienna, Austria [email protected]
Abstract. We present a multiscale, graph-based approach to 3D image analysis using diffusion wavelet bases, which were presented in [1]. Diffusion wavelets allow to obtain orthonormal bases of L2 functions on graphs. This permits the study of classical wavelet algorithms (such as compression and denoising of functions in L2 (Rn ), n ∈ N, via nonlinear approximation) in this setting. In this paper, we describe how this could be used in structure-preserving compression of image sequences, modelled as a whole as a weighted graph, as a first step towards structural spatiotemporal wavelet segmentation. We further discuss the possibilities for using this abstract approach in computer vision tasks.
1
Introduction
Wavelet based methods have been proven to be successful in signal and image analysis, where the main applications are edge-preserving smoothing and denoising of functions in L2 (Rn ), n ∈ N, via nonlinear approximation. The recent concept of diffusion wavelets [1,2] allows the construction of wavelet bases for functions defined on other than Rn , such as certain domains, manifolds and graphs. In place of the usual dilation on Rn , a diffusion operator on the data serves as scaling tool. In this paper, we study the use of classical wavelet algorithms, lifted to a graph based setting, concentrating on an algorithm for nonlinear approximation of 2d+ time image data using diffusion wavelet bases. In this sense, we build a bridge from applied mathematics to computer vision, describing how theoretical results on wavelet theory could be used in tasks such as graph-based compression and denoising, as well as segmentation and tracking in future work. We suppose the input data to be an image sequence, regarded as a 3d, or more precisely a 2d + time data set. We model the whole image sequence as a weighted graph, where the edge weights describe local similarity between certain data
Supported by the Austrian Science Fund under the grant FWF-P18716-N13.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 886–894, 2007. c Springer-Verlag Berlin Heidelberg 2007
Nonlinear Approximation of Spatiotemporal Data Using Diffusion Wavelets
887
points, which represent the nodes of the graph. Notice that the data points herein are not necessarily the whole set of pixels, but could also be subsets of pixels such as feature points, or groups of pixels obtained from a pre-segmentation. The weights on the edges can be based on intensity, feature properties, motion information or a combination of the above. On the graph G, we build a diffusion wavelet basis on it. Instead of using the geometry of the data set in R3 (pixel distances on a regular grid), we use the ‘intrinsic geometry’ or ‘structure’ of the data given by the connectivity of the graph nodes during a diffusion process, which is given by increasing powers of a diffusion operator. This diffusion process can be regarded as ‘learning’ the global structure of the data using local relationships. As the operator is formed from the graph representing the input data, the wavelet basis depends on the data and the whole process is data-adaptive and non-linear. We can analyze functions on the graph by computing coefficients from the constructed diffusion wavelet basis (ψj,l )j≥1,l∈Z . As this basis is an orthonormal basis, we have that for a function in L2 (G), f 2L2(G) =
| f, ψj,l |2 .
(1)
j,l
In this way, all the information of f is maintained in the sequence of coefficients. Moreover, as this information is simply kept in the squared sum of the coefficients, the salient information is reflected in the largest coefficients just like for the usual wavelet transform. The main difference to the classical wavelet algorithms is that ‘information’ now defers to structural similarity of the data, encoded as a graph, instead of properties on the fixed grid Rn . The wavelet basis depends on the graph, which itself depends on the data, such that the shortcomings of the classical wavelet transform (i.e. the lack of efficiency in coding functions with edges along smooth curves) can be overcome as these shortcomings mainly arise from the fixed geometry of Rn . In a sense, we adapt to the intrinsic geometry of the data. Nevertheless, we can employ classical wavelet methods on the diffusion wavelet coefficients, although the information they encode is a totally different one. Due to the norm equality (1), we can efficiently approximate functions on a graph using only the N largest coefficients for reconstruction. This method, known as N -term approximation, does more than just approximate a function by another one on a coarser resolution, it sums wavelet series according to large coefficients across different resolution levels and is thereby nonlinear. A similar method in the context of denoising is the so-called Wavelet Shrinkage method: the N largest coefficients are further decreased in size towards the noise level. For both of these approaches exists a large theory (e.g. see [3,4,5]), which relates the error of approximation to the function’s properties. Mostly important to mention is the fact that these algorithms compress the data performing a sort of a discontinuity preserving smoothing, which in our graph setting in a sense corresponds to structure-preserving smoothing: the ‘discontinuities’ which
888
M. Wild
are preserved in this case correspond to abrupt changes of edge weights in a neighborhood of ‘continuously’ changing weights, which are smoothed out. This can be seen as a first step towards structure-based spatiotemporal segmentation. After this short overview, we go into details, where the rest of the paper is organized as follows: in the next section, we will comment on related work in computer vision concerning spatiotemporal segmentation and spectral graph methods. In section 3, we decribe the concept of diffusion wavelet multiresolution. The algorithm is presented in section 4. We give an abstract example to visualize our theoretical approach in section 5 before we close this article with a conclusion and outlook to further work concerning structure-based spatiotemporal segmentation and its application to computer vision problems.
2
Related Work and Context
Diffusion wavelets can be seen as a particular instance of spectral graph theory. Methods based on this theory have been widely studied and applied to computer vision tasks, see e.g. [6,7,8,9,10,11,12]. Spectral graph methods start from the knowledge of the local geometry of the given data and infer a global representation, they allow a non-linear re-organizing and dimension reduction of data sets (on graphs and more general manifolds) and are well-suited for subsequent tasks such as visualization, grouping and partitioning of the data. We shortly sum up the basics, for details see [13]. Let G = (V, E) be an undirected graph, where V and E are finite sets of vertices and edges. LetG be a weighted graph equipped with a weight matrix W : E → R+ . Let dv = u∈V W (u, v) denote the degree of vertex v. Let the (normalized) Laplacian on G be defined as ⎧ W (v,v) if V & u = v and dv = 0, ⎪ ⎨1 − d v W (u,v) √ if u, v adjacent, L= − d d u v ⎪ ⎩ 0 otherwise. and let the diffusion matrix on G be defined as K = I − L. Note that on the one hand, these matrices describe the graph just like a generalized adjacency matrix, on the other hand, they can be seen as (locally averaging) operators acting on functions on a graph. We will thus refer to K as the diffusion operator in the following. K reflects local similarity in the graph and global properties can be explored from its eigenvalues. A global analysis of the data corresponds to the study of large powers of K and is efficiently performed using the eigenvectors with the largest eigenvalues, which are the ones with the lowest frequency. A projection on these top eigenvectors corresponds to an approximation on low frequency approximation. In these terms, this projection can be regarded as an analog to a Fourier approximation, where the basis functions are the eigenfunctions, sorted by their frequency. In [14], this is used for dimensionality reduction of high dimensional data, in [11], this is applied to image denoising and semi-supervised classification, whereas in [12], it is used for data fusion and multi-cue data matching.
Nonlinear Approximation of Spatiotemporal Data Using Diffusion Wavelets
889
The normalized cut method [8,10], used for graph-based image segmentation and spatiotemporal segmentation for tracking fits into this spectral context. It partitions a weighted graph into disjoint sets, finding a cut of relatively small weight between two subsets with strong internal connection. The algorithm calculates eigenvalues and eigenvectors of the Laplacian matrix and uses the eigenvector of the second smallest Eigenvalue to bipartition the graph. In this sense, functions on a graph are projected on a subspace of lowfrequency, Fourier-like approximations. Our goal is to derive an analog spatiotemporal segmentation based on a multiresolution analysis on a graph rather than on a kind of Fourier transform, lifting the classical advantages by projecting on wavelet subspaces across scales rather than on Fourier-like linear subspaces to this setting.
3
Multiresolution Analysis and Diffusion Wavelets
In mathematical terms, the discrete wavelet algorithm transform algorithm [15] corresponds to a fast implementation deriving coefficients of a wavelet series. Wavelets provide an orthonormal basis for functions in R (also in Rn , n ∈ N, for 2 simplicity of presentation, we concentrate here on the 1d case). Any f ∈ L (R) can be written as f = j,l∈Z f, ψj,l ψj,l , where ψj,l are dilated and translated versions of a mother wavelet ψ ∈ L2 (R). A remark: the wavelet series resembles a Fourier series, but there are crucial differences. As Fourier series decompose functions into sines and cosines having infinite support, they serve as a frequency representation, analyze functions globally and are well-suited for smooth functions (e.g. the Sobolev class). The wavelet series decomposition allows a decomposition into localized functions at different scales, is thereby a time-frequency representation. N -term approximation and Wavelet Shrinkage are classical wavelet algorithms, having well-known efficiency properties (for details see [3]). It can be shown that the decay of the approximation error is directly related to a certain decay of the wavelet coefficients. For functions on R, Z and other settings, this decay is reached if and only if the function is in a (Besov-type) smoothness space [3,16], provided some properties of the wavelet basis. As the Besov spaces (in contrast to Sobolev spaces) may contain rather rough functions, even with discontinuities, one could state that the approximation smooths out the data while preserving discontinuities. Furthermore, Wavelet Shrinkage can be interpreted as the solution of a certain variational problem [5], including a regularization term which is close to the total variation. This gives an heuristic evidence why to use wavelet methods in the graph setting: the data on the graph will be smoothed by the approximation, but not that rigorously (preserving certain discontinuities, i.e. abrupt changes in the edge weights) as using Fourier based methods (for a discussion on smoothness for functions on graphs in relation to Sobolev spaces based on the ‘smooth flow’ of the diffusion process see [11]). The approximation will in this sense be better able to preserve the intrinsic structure of the data.
890
M. Wild
We now shortly explain the construction of the diffusion wavelets. Dilations (stretching and squeezing) like on R cannot be used for scaling in this setting. j Instead, one uses increasing powers of the diffusion operators (K 2 )j>0 as a scaling tool to define a multiresolution analysis (for an exact definition and discussion of this concept see e.g. [17]). Let {λi }i≥0 be the (decreasingly ordered) spectrum of K with eigenvectors {ξi }, 0 < ε < 1, tj := 2j+1 − 1, j ≥ 0. The recipe is to divide the spectrum j into ‘low-pass’ portions σj (K) := {λ ∈ σ(K) : λt ≥ ε}. For j ≥ 0 define the approximation spaces by Vj := span{ξλ : λ ∈ σj (K)} and wavelet spaces by Vj−1 = Vj ⊕ Wj . Projections on the spaces Vj are approximations of a function f at different resolutions and projections on the Wj (partial wavelet series) describe the difference between two approximation levels. The system (ψj,l )j≥0,l∈Z , formed from orthonormal bases (ψj,l )l∈Z for each Wj , is an orthonormal basis for L2 (G), called diffusion wavelet basis. Building localized orthonormal bases for Vj , Wj is possible because for K with a fast decaying spectrum, the rank of j K t decreases and the powers of the operator can be compressed compressibility. For technical details we refer to [1]. With this construction one can now calculate scaling and wavelet coefficients of functions on a graph (see again [1]). This permits a true multiscale transform in the spirit of classical wavelet analysis on non-linear structures such as graphs, manifolds or general metric spaces. The approximation algorithm based on this diffusion wavelet transform is presented in the following section.
4
Nonlinear Approximation on Graphs Using Diffusion Wavelets
As described in the introduction, in this paper we want to study in nonlinear approximation of image sequences, modelled as a whole as a weighted graph. Employing the whole bunch of theory we presented in the above sections, we proceed as following: First, we have to build a weighted graph G from the 3d image data. The vertices can be chosen as the whole set of pixels, or due to complexity considerations as a subset, e.g. using a downsampled version of the sequences, by filtering or by a feature point selection procedure. The crucial point is to define the edges of G and their corresponding weights: here we encode the local relations between vertices from which the algorithm will ‘learn’ the global and multiscale structure from. A standard choice in our setting is w(u, v) = exp(−ρ(u, v)2 ), where ρ(u, v) may be the difference of intensities in u, v, the distance in space, feature point properties, information from a motion prediction or a combination of the above. For complexity reasons, it is reasonable to set ρ(u, v) = ∞, if v is outside a certain finite neigborhood of u. For an elaborate discussion on these choices, see [11] and [10]. A function f on G imposes additional attributes on the vertices. In our context, f can describe intensities (if the vertices are pixels), feature properties or other
Nonlinear Approximation of Spatiotemporal Data Using Diffusion Wavelets
891
Algorithm 1. Diffusion wavelet approximation Input: f ∈ L2 (G), function on a weighted graph G Result: f˜ ∈ L2 (G), compressed function on G choose T > 0; // threshold parameter: the larger T , the less coefficients constitute the approximation build a diffusion wavelet basis (ψj,l )j≥0,l∈Z on G ; compute wavelet coefficients (f, ψj,l )j≥0,l∈Z ; forall j ≥ 0, l ∈ Z do case Hard Thresholding if |f, ψj,l | ≥ T then f˜, ψj,l ← f, ψj,l ; else f˜, ψj,l ← 0 ; case Soft Thresholding if |f, ψj,l | < T then f˜, ψj,l ← 0 ; else if f, ψj,l > T then f˜, ψj,l ← f, ψj,l − T ; else if f, ψj,l < −T then f˜, ψj,l ← f, ψj,l + T ; end f˜ ←
˜
j,l f , ψj,l ψj,l
// Build approximation f˜ from (f˜, ψj,l )
additional information. For analyzing the ‘pure’ graph we constructed from the data, choose f ≡ 1. The graph G and the function f (∈ L2 (G) as G is supposed to have a finite set of nodes) are the input to the approximation algorithm.
5
Example
We visualize the algorithm by an example. Figure 1 indicates one possibility how to build a weighted graph from an image sequence, for other possibilities see section 4. In figure 2 a), there is an abstract graph which could be a (simplified) cutout from a graph that has been built from an image sequence. In b) and c), we show the expected outcome for different thresholding parameters in an idealized way: the function is smoothed in regions where the function values vary ‘smoothly’ while preserving ‘jumps’ in form of sudden changes in intensities and edge weights. The larger the threshold, the less diffusion wavelet coefficients constitute the approximation and the higher is the rate of compression. In real image data, the ‘discontinuities’ or ‘jumps’ may correspond to object borders in single images or to the main direction of movement, depending on the information encoded in the weigths. Experimenting with different choices of vertices, weights and thresholds on real data will be part of our future work.
892
M. Wild
Fig. 1. Graph formed from a sequence of images, taking pixels as nodes, edges connecting nodes only in a certain neighborhood and the weights corresponding to the difference gray values
Fig. 2. Different thresholds on an abstract graph formed from the data
6
Conclusion and Further Work
In this article, we present an algorithm for nonlinear approximation using diffusion wavelets. This algorithm is derived from classical wavelet methods now lifted to a graph-based setting. We explain how to apply the algorithm to spatiotemporal data, modelled as a function on a graph. The data will be smoothed by the approximation, where abrupt changes in the edge weights (which can describe object borders in single images or the main direction of movement) are preserved. The approximation should in this sense be better able to preserve the intrinsic structure of the data. Right now, the implementation and testing of the algorithm, as well as the enhancement towards segmentation is in the focus of our work. Also the exact approximation properties have still to be worked out in the graph setting, studying properties of the diffusion wavelets in order to relate smoothness properties (analog to Besov spaces) of the data in L2 (G) to the decay of the approximation error. Structure-preserving compression on a graph can be seen as a first step towards structural spatiotemporal wavelet segmentation. In order to obtain a true
Nonlinear Approximation of Spatiotemporal Data Using Diffusion Wavelets
893
segmentation, i.e. providing the nodes of the graph with labels indicating the segment they live in, we have to go a step further. A first idea is to use a Hidden Markov Model on the coefficient tree, classifying the coefficients as ‘large’ or ’small’ and following them across resolution levels (see [18]). The approximation algorithm can be used for compression of data, while preserving structural properties as encoded in the graph. The segmentation method we work on could be used as a way of a combined spatial and motion segmentation of a whole image sequence and thereby as an instance of an ’offline’ tracking method similar to the normalized cut tracking, involving a true multiresolution analysis on the data.
References 1. Coifman, R.R., Maggioni, M.: Diffusion wavelets. Appl. Comput. Harmon. Anal. 21(1), 53–94 (2006) 2. Coifman, R.R., Lafon, S., Lee, A.B., Maggioni, M., Nadler, B., Warner, F., Zucker, S.W.: Geometric diffusion as a tool for harmonic analysis and structure definition of data: Multiscale methods. In: Proceedings. vol. 102, PNAS, pp. 7432–7437 (2005) 3. DeVore, R.: Nonlinear Approximation. Acta Numerica, 51–150 (1998) 4. Donoho, D., Johnstone, I.: Minimax Estimation Via Wavelet Shrinkage. Ann. Stat. 26(3), 879–921 (1998) 5. Chambolle, A., DeVore, R.A., Lee, N.-Y., Lucier, B.J.: Nonlinear Wavelet Image Processing: Variational Problems, Compression, and Noise Removal Through Wavelet Shrinkage. IEEE Trans. Image Process 7, 319–335 (1998) 6. Qiu, H., Hancock, E.R.: Robust Multi-body Motion Tracking using Commute Time Clustering. In: ECCV, pp. 160–173 (2006) 7. Luo, B., Hancock, E.R.: Structural graph matching using the EM algorithm and singular value decomposition. IEEE Trans. Pattern Anal. Mach. Intell. 23(10), 1120–1136 (2001) 8. Shi, J., Malik, J.: Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000) 9. Meila, M., Shi, J.: Learning Segmentation by Random Walks. In: NIPS, pp. 873– 879 (2000) 10. Shi, J., Malik, J.: Motion Segmentation and Tracking Using Normalized Cuts. In: ICCV pp. 1154–1160 (1998) 11. Szlam, A.D., Maggioni, M., Coifman, R.R.: A general framework for adaptive regularization based on diffusion processes on graphs (2006) 12. Lafon, S., Keller, Y., Coifman, R.R.: Data Fusion and Multi-Cue Data Matching by Diffusion Maps. IEEE Trans. Pattern An. and Mach. Int. 28-11, 1784–1797 (2006) 13. Chung, F.R.: Spectral Graph Theory. CMBS-AMS 92 (1997) 14. Coifman, R.R., Lafon, S.: Diffusion maps. Appl. and Comp. Harm. Anal. 21, 5–30 (2006) 15. Mallat, S.: Multiresolution Approximations and Wavelet Orthonormal Bases in L2 (R). Trans. Amer. Math. Soc. 315, 69–87 (1989)
894
M. Wild
16. F¨ uhr, H., Wild, M.: Characterizing Wavelet Coefficient Decay of Discrete-Time Signals. Appl. and Comp. Harm. Anal. 20(2), 184–201 (2006) 17. Daubechies, I.: Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics (1992) 18. Choi, H., Baraniuk, R.G.: Multiscale image segmentation using wavelet-domain hidden Markov models. IEEE Trans. Signal Process 101(92) (September 2001)
A New Wavelet-Based Texture Descriptor for Image Retrieval Esther de Ves1 , Ana Ruedin2 , Daniel Acevedo2 , Xaro Benavent3 , and Leticia Seijas2 1
2
Computer Science Department, University of Valencia, Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires, 3 Robotics Institute, University of Valencia {Esther.Deves,Xaro.Benavent}@uv.es
Abstract. This paper presents a novel texture descriptor based on the wavelet transform. First, we will consider vertical and horizontal coefficients at the same position as the components of a bivariate random vector. The magnitud and angle of these vectors are computed and its histograms are analyzed. This empirical magnitud histogram is modelled by using a gamma distribution (pdf). As a result, the feature extraction step consists of estimating the gamma parameters using the maxima likelihood estimator and computing the circular histograms of angles. The similarity measurement step is done by means of the well-known Kullback-Leibler divergence. Finally, retrieval experiments are done using the Brodatz texture collection obtaining a good performance of this new texture descriptor. We compare two wavelet transforms, with and without downsampling, and show the advantage of the second one, which is translation invariant, for the construction of our texture descriptor. Keywords: Texture descriptor, Wavelet Transform, Image retrieval.
1
Introduction
The increasing amount of information available in today’s world raises the need to retrieve relevant data efficiently. Unlike text-based retrieval, where key words are successfully used to index documents, content-based image retrieval poses up-front the fundamental questions of how to extract useful image features and how to use them for intuitive retrieval [11] [5]. Interest in content-based image retrieval (CBIR) systems has been growing in the last few years. This interest has been motivated by the increasing number of image databases which need effective and efficient techniques for retrieving multimedia information. Attempts have been made to develop general purpose image retrieval systems based on multiple features (e.g. color, shape and texture), which describe the image content [10]. The new visual information retrieval systems extract visual image features usually related to color, texture and shape from each image stored in the database, and use this representation to compare images by means of a similarity measure. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 895–902, 2007. c Springer-Verlag Berlin Heidelberg 2007
896
E. de Ves et al.
The features extracted from an image can be classified into low and high level features, which are normally obtained by combining low level features with a reasonable predefined model. The low level features are obtained by preprocessing each image in the database. Among these characteristics we can mention those with chromatic information, those related to textures present in the image and those related to the shape of objects in the image. Generally speaking, the structure of a CBIR system is composed of two main modules: the feature extraction module and the similarity measurement module. In the latter a distance between the query image and each image in the database is computed, by making use of the extracted features. Each image obtains a relevance score related to the query image in order to rank the database. Texture analysis plays an important role in many image processing tasks. There are a great number of references for texture analysis and particularly texture classification. A central point in texture analysis is the definition of good features to characterize textures, that can be useful for content-based retrieval. Gabor features for texture classification followed by a linear discriminant analysis was used in [2]. Ayala proposes in [1] a new descriptor for binary and gray-scale textures based on defined spatial size distributions (SSD). In [7] a functional descriptor of the multivariate random closed set defined from the texture is proposed as a features to describe grayscale textures. Randen presents an interesting comparative study of different filtering approaches where the features used for texture classification are obtained from signal processing techniques like Law and Gabor filters, wavelet transforms or the discrete cosine transform [12]. Wavelets have been successfully applied to different image processing applications. When transformed, an image is represented as the sum of its details at different scales and orientations, plus a coarse approximation of the image. This naturally led to consider wavelets as a possible tool for texture classification. Some traditional approaches use the energy of the wavelet coefficients in each subband as texture descriptors, under the assumption that the energy of the distribution in the frequency domain identifies textures. A natural extension of the mentioned method consists in shaping every texture by means of marginal densities of the wavelet coefficients in every subband. This is justified by psychological studies that suggest that two homogeneous textures are difficult to distinguish if they produce similar marginal distributions as response to a bank of filters ([8]). A number of authors have observed that wavelet subband coefficients have highly non-Gaussian statistics [3], [14]. Numerous tests show that the normalized histograms of wavelet coefficients in each detail subband can be well approximated with a Generalized Gaussian Distribution (GGD). This model has been used with the Kullback-Leibler divergence as a similarity measure among distributions [6]. In this work, by applying a wavelet transform and considering at each scale the pair (horizontal detail, vertical detail) associated with each position, we extract information on the importance (contrast) and orientation of the edges in the image. We present a novel texture descriptor, used for content-based retrieval, based on the wavelet transform. The basic idea consists in modeling the
A New Wavelet-Based Texture Descriptor for Image Retrieval
897
distributions of moduli of wavelet detail coefficients in each decomposition level, and computing their empirical angle histogram at each level (assuming vertical and horizontal coefficients at the same position are the components of a bivariate random vector). Our tests indicate that the gamma density function (pdf), described by two parameters, is a reasonable model for the moduli coefficients histogram. Thus, we propose to characterize each texture in the database by information extracted both from the moduli and the orientation of the coefficients. Section 2 introduces an illustrative example of our work, and explains our approach in detail. In section 3 we present experimental results which evaluate the performance of our texture descriptors. Finally, in section 4 we have concluding remarks.
2 2.1
Modelling Coefficients Distribution of Wavelet Transform Analyzing the Joint Histograms of Wavelet Coefficients
For our tests we have chosen the orthogonal Daubechies 4 wavelet. In the traditional wavelet transform ([4] [9]), coefficients are calculated via convolutions with 2 filters (lowpass and highpass). This is followed by downsampling operations, which prevent the wavelet from being invariant to translations. For comparison, in our tests we have included another wavelet transform that is translation invariant, proposed by Mallat [9]. It is calculated with the so–called à trous algorithm, via convolutions with 2 filters, but has no downsampling operations, so that it gives a redundant representation of an image. To capture lower frequencies at each step of the transform, the filters are upsampled. 1 [3 + e, 1 − e, 3 − e, 1 + e], Daubechies 4 filters are lowpass [ h3 h2 h1 h0 ] = 4√ 2 √ with e = 3, and highpass [−h0 h1 − h2 h3 ]. The filters for Mallat’s translation invariant transform correspond to a biort√ 2[ a b b a ] and highpass [ c − c ], with a = 0.125, hogonal wavelet: lowpass √ b = 0.375 and c = 2/2. A previous step before proposing a new model for the joint histogram of wavelets coefficients for each detail level, is to study this distribution function for some simple images. The image in figure 1(a), chosen for the analysis, corresponds to a natural image from the Brodatz collection. It represents a texture with a privileged orientation. Both mentioned wavelet transforms have been applied to this image up to three levels of decomposition. In each level, four subbands are obtained: the approximation subband, the horizontal, vertical and diagonal details. In our study, we shall only use the vertical and horizontal detail subbands. It is worth mentioning that for the traditional wavelet transform, the subbands for the first level are a fourth (in size) of the image; when there is no downsampling step, they have the same size as the original image. In figure 1(a) we have an original texture, to which we have applied 3 levels of the biorthogonal wavelet transform without downsampling. In figure 1(b) are
898
E. de Ves et al.
(a)
200
300
150
100
200
100
50
100
50
(b)
0
0
0
−50
−50
−100
−100
−200
−100
−150 −100
0
100
−300
−200
−100
0
100
200
150
0
100 200
800 300
600
100 50
(c)
−200 −100
200
400
100
200
0
0
0
−50
−100
−200
−100
−200
−400 −600
−300
−150 −100
0
100
−200
0.2
0
200
400
−400 −200
0.09
0.06
0.06
0.04
0.03
0.02
0
200 400
0.15
(d)
0.1
0.05
0
0
50
100
150
0.1
0
0
50
100
150
200
0.07
0
100
200
300
400
0.08
0.06
0.08
0.06
0.05
(e)
0
0.06
0.04
0.04
0.03
0.04 0.02 0.02 0
0.02
0.01 0
40
80
120 140
0
0
100
0
300
400
0
0
150
0
0.136
300
0.054
0.054
0.068
0.036
0.036
0.018
0.018
0.000
0.000
π/2
π
Level 1
0.000
π/2
3π/2
π
Level 2
600 700
0.072
0.034
3π/2
450
0
0.072
0.102
(f)
200
π/2
3π/2
π
Level 3
Fig. 1. Empirical distributions for image (a). First row (b): joint distribution of Mallat’s à trous wavelet coefficients, second row (c): joint distribution of Daubechies 4 detail coefficients, third row (d): magnitude histogram of Mallat’s à trous coefficients, fourth row (e): magnitude histogram of Daubechies 4 transform coefficients, (f): circular histogram of Mallat’s à trous coefficients.
A New Wavelet-Based Texture Descriptor for Image Retrieval
899
plotted the joint empirical distributions of the detail coefficients for 3 levels, assuming that horizontal and vertical detail coefficients at the same position are the components of a bivariate random vector. From these joint distributions it can be inferred that a correlation exists between the horizontal and vertical wavelet subbands. This correlation is noticeable in the second and third levels of the transform, whereas the finer detail coefficients do not give much information: they are dominated by noise. The magnitude and angle of the random vector samples can be computed and analyzed. The empirical magnitude histogram (figure 1(d)) and the circular histogram (figure 1(f)) are shown for this wavelet. The circular histogram clearly indicates a privileged orientation of the edges in the texture. We have similar results for images with clearly oriented edges. The same study has been done applying the traditional Daubechies 4 wavelet transform to the same image. The joint empirical distributions of detail coefficients are shown in figure 1(c). It is evident that in this case there is no noticeable correlation between horizontal and vertical subbands. The histograms behave differently. 2.2
The Proposed Model for Wavelet-Coefficients Distributions
The previous section shows the importance of treating horizontal and vertical details jointly. However, it is a very difficult task to find a model to which the empirical joint histogram will fit reasonably well for any kind of texture. There are some papers in this approach, as [13], where the distribution of the subband coefficients is modeled using a joint alpha-stable sub-Gaussian distribution. Our approach is different. We do want to make use of the existing relation between the vertical and horizontal coefficients at a certain position, but we also want a model which is simple to fit and capable of characterizing different types of textures. We want a model for the moduli histograms. Observe that for both wavelet transforms, the shape of the empirical distribution changes for each level, in such a way that the peak of the histogram is shifted to the right. This behavior may be modelled by means of the gamma distribution (pdf). This distribution is defined by two parameters: k and θ. f (x; k, θ) = xk−1
e−(x/θ) θk Γ (k)
(1)
where k > 0 is a shape parameter, and θ > 0 is related to the scale of the distribution (if θ is large, then the distribution will be more spread out). The gamma distribution is related to many other distributions. The chi-square and exponential distributions, which are children of the gamma distribution, are oneparameter distributions that fix one of the two gamma parameters. The idea is to fit our empirical moduli histograms to a gamma distribution using the maxima-likelihood estimator. A goodness-of-fit Kolmogorov-Smirnov test has been applied to the different samples in order to justify the use of this model, giving very high p-values (larger than α = 0.05) for most of the samples
900
E. de Ves et al.
analyzed, for both wavelet transforms considered. It seems that the gamma distribution may be a reasonable model for the moduli of wavelet coefficients. As seen in the previous section, the circular histogram gives valuable information about privileged orientations in textures. Thus, these histograms can also be used to describe textures. An image with a privileged orientation will present a bimodal circular histogram with 180 degrees separated statistical modes whereas a random texture, without privileged orientations, will present a uniform circular histogram. Thus, we propose to characterize each texture in the database by: – Information from the moduli: parameters kn , θn , of the gamma pdf for n = 1 . . . 3, where n is the wavelet decomposition level. – Information from the angles given by the empirical circular histogram. Angles are quantized by dividing interval [0 , 2 π] into 40 bins. Pn (r) is the observed frequency of the angles in the rth bin for level n. Retrieval experiments in section 3, have been performed by using the gamma distribution parameters and the circular histograms.
3
Experimental Results
The objetive of this section is to test the new wavelet-based texture descriptor. Images from the well-known Brodatz image database have been used for the experiments; this image database is composed of 105 images representing different kinds of textures. Thirty six images from this collection were selected and each original image of size 512 × 512 was partitioned into sixteen 128 × 128 subimages by randomly choosing the left corner, thus creating a test database of N = 36 × 16. The texture descriptor was computed for each image in the test database. In our experiments, a simulated query image was anyone in our test database. The relevant images for each query were defined as the other 15 subimages from the same original Brodatz image. We evaluated the performance in terms of the average rate of retrieval of relevant images. We have done the same experiment three times: with the Daubechies 4 filters and the traditional wavelet transform, with the Daubechies 4 filters without downsampling, and Mallat’s biorthogonal wavelet without downsampling. The Kullback-Leibler divergence (KLD), commonly used to compare 2 distributions, was the chosen similarity measure between query image Iq and each image in the database {Ij , j = 1 . . . N }: we retrieved the top 16 closest images to the query, and counted how many corresponded to the same texture as the query image. The score used to measure the similarity between image Ij and Iq was computed as S(j, q) = S1 (j, q) + S2 (j, q),
where S (j, q) =
3 n=1
S (j, q, n), = 1, 2,
(2)
A New Wavelet-Based Texture Descriptor for Image Retrieval
901
f (x; knj , θnj ) P j (r) Pnj (r) log nq . q q dx, S2 (j, q, n) = f (x; kn , θn ) Pn (r) r (3) The term S1 (j, q, n) is the KLD (or cross entropy) of the moduli pdf at level n for images j and q. The term S2 (j, q, n) is the KLD of the empirical distribution of the quantized angles at level n for images j and q. The final score was the sum of both scores (modulus and angle KLD scores) at the first 3 levels. S1 (j, q, n) =
f (x; knj , θnj ) log
Table 1. Average retrieval rate in the top 16 Wavelet Transform Wavelet filters
Algorithm
Extracted Features S1
Daubechies 4 with downsampling 0.75 Daubechies 4 no downsampling 0.75 Mallat’s biorthogonal no downsampling 0.79
S1 + S2 0.84 0.89 0.89
Table 1 shows the percentage of relevant images retrieved in the top 16 matches. The maximun and minimun percentage rates were 89% and 75% respectively. The main points that we can observe from these results are that the downsampling operation of the wavelet transform and the features used are very significant in the retrieval performance. Omitting the downsampling steps in the wavelet transform seems very meaningful for the results. This is because the resulting transform is translation invariant. Our results in the first column of the table, which do not take into account the angle information, and only base the results on the moduli information, are similar to the ones obtained by modelling the marginal histograms (vertical and horizontal independently) with the GGD distribution. Moreover, in our proposed descriptor we need only half the number of parameters. The inclusion of the orientation information in the features improves the average percentage rate in 10% independently of the wavelet considered and the presence– or absence– of the downsampling operation. It means that the orientation information is a valuable feature to discriminate textures.
4
Conclusions
We have presented a novel wavelet-based texture descriptor for visual information retrieval. The basic idea to characterize each texture in the image database is to use the information from moduli and orientations of wavelet coefficients, assuming that vertical and horizontal coefficients at the same position are the components of a bivariate random vector. The Kullback-Leibler divergence has been chosen as measure of similarity. The good performance of this new descriptor is revealed in retrieval experiments using the Brodatz database.
902
E. de Ves et al.
Acknowledgement. This work has been partially supported by grants GV04177 (for research stays), MCYT TIN2006-10134, UBACYT X166 and BID 1728/ OC-AR-PICT 26001.
References 1. Ayala, G., Domingo, J.: Spatial size distributions: Applications to shape and texture analysis. IEEE Trans Image Processing 23, 1430–1442 (2001) 2. Azencott, R., Wang, J.P., Younes, L.: Texture classification using windowed fourier filters. IEEE Trans Pattern Analysis and Machine Intelligence 19(3), 148–153 (1997) 3. Buccigrossi, R., Simoncelli, E.: Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Signal Proc 8, 1688–1701 (1999) 4. Daubechies, I.: Ten lectures on wavelets. Society for Industrial and Applied Mathematics (1992) 5. Del Bimbo, A.: Visual Information Retrieval. Morgan Kaufmann, San Francisco (1999) 6. Do, M.N., Vetterli, M.: Wavelet-based texture retrieval using generalized gaussian density and kullback-leibler distance. IEEE Trans Image Processing 11, 2 (2002) 7. Epifanio, I., Ayala, G.: A random set view of texture classification. IEEE Transactions on Image Processing 11, 859–867 (2002) 8. Heeger, D., Berger, J.R.: Pyramid-based txture analysis/synthesis. In: Proc. ACM SIGGRAPH (1995) 9. Mallat, S.: A Wavelet Tour of Signal Processing. Academic Press, London (1999) 10. Pentland, A., Picard, R.W., Sclaroff, S.: Photobook: Content-based manipulation of image databases. International Journal of Computer Vision 18(3), 233–254 (1996) 11. Smeulders, A.W.M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE transactions on Pattern Analysis and Machine Intelligence 22(12), 1349–1379 (2000) 12. Husoy, J., Randen, T.: filtering for texture classifciation: A comparative study. IEEE Trans Pattern Analysis and Machine Intelligence 20, 115–122 (1999) 13. Tzagkarakis, G., Beferull-Lozano, B., Tsakalides, P.: Rotation-invariant texture retrieval with gaussianized steerable pyramids. IEEE Trans Image Processing 15(2006) 14. Wouwer, G.V., Scheunders, P., Dyck, D.V.: Statistical texture characterization from discrete wavelet representation. IEEE Trans Image Processing 8, 592–598 (1999)
Space-Variant Restoration with Sliding Discrete Cosine Transform Vitaly Kober and Jacobo Gomez Agis Department of Computer Science CICESE, Ensenada, B.C. , Mexico [email protected]
Abstract. A local adaptive restoration technique using a sliding discrete cosine transform (DCT) is presented. A minimum mean-square error estimator in the domain of a sliding DCT for image restoration is derived. The local restoration is performed by pointwise modification of local DCT coefficients. To provide image processing in real time, a fast recursive algorithm for computing the sliding DCT is utilized. The algorithm is based on a recursive relationship between three subsequent local DCT spectra. Computer simulation results using a real image are provided and compared with that of common restoration techniques.
1 Introduction Many different restoration techniques (linear, nonlinear, iterative, noniterative, deterministic, stochastic, etc.) optimized with respect to different criteria have been introduced [1-7]. The techniques may be broadly divided in two classes: (i) fundamental algorithms and (ii) specialized algorithms. One of the most popular fundamental techniques is a linear minimum mean square error (LMMSE) method. It finds the linear estimate of the ideal image for which the mean square error between the estimate and the ideal image is minimum. The linear operator acting on the observed image to determine the estimate is obtained on the basis of a priori second order statistical information about the image and noise processes. In the case of stationary processes and space-invariant blurs, the LMMSE estimator takes the form of the Wiener filter. A Kalman filter determines the causal LMMSE estimate recursively. It is based on a state-space representation of the imaging system, and image data are used to define the state vectors. Specialized algorithms can be viewed as extensions of the fundamental algorithms to specific restoration problems. In this paper we deal with restoration of images degraded by space-variant blurs. Basically, all fundamental algorithms apply to the restoration of images degraded by space-variant blurs. However, because Fourier transforms cannot be utilized when the blur is space-variant, space-domain implementations of these algorithms may be computationally formidable due to large matrix operations. Several specialized methods were developed to attack the spacevariant restoration problem. The first class referred to as sectioning is based on assumption that the blur is approximately space-invariant within local regions of the image. Therefore, the entire image can be restored by applying well-known spaceinvariant techniques to the local image regions. A drawback of sectioning methods is W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 903 – 911, 2007. © Springer-Verlag Berlin Heidelberg 2007
904
V. Kober and J. Gomez Agis
the generation of artifacts at the region boundaries. The second class is based on a coordinate transformation [7], which is applied to the observed image so that the blur in the transformed coordinates becomes space-invariant. Therefore, the transformed image can be restored by a space-invariant filter and then transformed back to obtain the final restored image. However, the statistical properties of the image and noise processes are affected by the coordinate transformation. In particular, the stationarity in the original spatial coordinates is not preserved in the transform coordinate system. In this paper, we carry out the space-variant restoration using a sliding discrete cosine transform (DCT) coefficients. The sliding DCT is based on the concept of short-time signal processing [8]. The short-time orthogonal transform of a signal xk is defined as
X sk =
∞
∑x
k + n wnψ
(n, s ) ,
(1)
n = −∞
where wn is a window sequence, ψ(n,s) represents the basis functions of an orthogonal transform. We use one-dimensional notation for simplicity. Equation (1) can be interpreted as the orthogonal transform of xk+n as viewed through the window wn. X sk displays the orthogonal transform characteristics of the signal around time k. Note that while increased window length and resolution are typically beneficial in the spectral analysis of stationary data, for time-varying data it is preferable to keep the window length sufficiently short so that the signal is approximately stationary over the window duration. Assume that the window has finite length around n=0, and it is unity for all n∈[-N1, N2]. Here N1 and N2 are integer values. This leads to signal processing in a sliding window. In other words, local filters in the domain of an orthogonal transform at each position of a moving window modify the orthogonal transform coefficients of a signal to obtain only an estimate of the pixel xk of the window. The choice of orthogonal transform for sliding signal processing depends on many factors. The DCT is one the most appropriate transform with respect to the accuracy of power spectrum estimation from the observed data that is required for local filtering, the filter design, and computational complexity of the filter implementation. Linear filtering in the domain of DCT followed by inverse transforming is superior to that of the discrete Fourier transform (DFT) because a DCT can be considered as the DFT of a signal evenly extended outside its edges. This consequently attenuates boundary effects caused by circular convolution that are typical for linear filtering in the domain of DFT. The presentation is organized as follows. In Section 2, we review recursive algorithms for computing the sliding forward and inverse DCTs. In Section 3, a local adaptive filter minimizing the minimum mean-square error defined in the domain of the sliding DCT is derived. In section 4, we test the filter performance to restore a real aerial image degraded by nonuniform motion blur. Section 5 summarizes our conclusions.
2 Fast Forward and Inverse Algorithms of Sliding DCT The discrete cosine transform is widely used in many signal processing applications. This is because the DCT performs close to the optimum Karhunen-Loeve transform
Space-Variant Restoration with Sliding Discrete Cosine Transform
905
for the first-order Markov stationary data, when the correlation coefficient is near 0.9 [1]. Recently, fast forward and inverse algorithms for fast computing of DCTs were proposed [9]. The sliding cosine transform (SCT) is defined as ⎛ ( n + N1 + 1 2 ) s ⎞ xk + n cos ⎜⎜ π ⎟⎟ , N n =− N ⎝ ⎠ 1 N
X sk =
2
∑
(2)
where N=N1+N2+1, { X sk ; s=0, 1,…, N-1} are the transform coefficients around time
k. The coefficients of the DCT can be obtained as { C 0k = X 0k 2 ; C sk = X sk , s=1,…, N-1}. The SCT on the base of a recursive relationship between three subsequent local DCT spectra is given by
(
s ⎛πs ⎞ ⎛ πs ⎞ X sk +1 = 2 X sk cos ⎜ ⎟ − X sk −1 + cos ⎜ ⎟ xk − N1 −1 − xk − N1 + ( −1) xk + N2 +1 − xk + N2 2 N N ⎝ ⎠ ⎝ ⎠
(
)) . (3)
We see that the computation of the DCT at the window position k+1 involves values of the input sequence xk as well as the DCT coefficients computed in two previous positions of the moving window. Table 1. Number of arithmetical operations for computing of sliding DCT
Fast DCTs [10]
Number of additions 3MN/2 - N+1
Number of multiplications MN/2+1
Recursive algorithm
2N+5
2N-1
Tables 1 provides a comparison of the computational complexity of the recursive algorithm with fast DCT algorithms. The length of a moving window for the recursive algorithm is an arbitrary integer value determined by characteristics of a signal to be processed. In contrast, fast DCT algorithms require the length to be of a power of 2, N=2M. If xk is the central pixel of the window, that is, N1=N2 and N=2N1+1, then the inverse transform is written as xk =
1 N
⎛ N1 ⎞ ⎜ 2 (− 1)s X k + X k ⎟ . 2s 0 ⎜ ⎟ ⎝ s =1 ⎠
∑
(4)
We note that in the computation only the spectral coefficients with even indices are involved. The computation requires one multiplication and N1+1 additions.
3 Signal Restoration in the Domain of Sliding DCT First we define a local criterion of the performance of filters for image and signal processing and then derive optimal local adaptive filters with respect to the criterion. One the most used criterion in signal processing is the minimum mean-square error (MMSE). Since the processing is carried out in a moving window, then for each
906
V. Kober and J. Gomez Agis
position of a moving window an estimate of the central element of the window is computed. Suppose that the signal to be processed is approximately stationary within the window. The signal may be distorted by sensor’s noise. Let us consider a generalized linear filtering of a fragment of input onedimensional signal (for instance for a fixed position of the moving window). Let a=[ak] be undistorted real signal, x=[xk] be observed signal, k=1,…, N, N be the size of the fragment, U be the matrix of the discrete cosine transform, E{.} be the expected value, superscript T denotes the transpose. Let a = Hx be a linear estimate of the undistorted signal, which minimizes the MMSE averaged over the window
{
MMSE = E ( a − a )
T
( a − a )}
N.
(5)
The optimal filter for this problem is the Wiener filter [1]:
{
} {
}
−1
H = E a x T ⎡⎣ E x x T ⎤⎦ .
(6)
Let us consider the known model of signal:
xk = ∑ wk , n an + vk ,
(7)
n
where W=[wk,n] is a distortion matrix, ν=[vk] is additive noise with zero mean, k,n=1,…N, N is the size of fragment. The equation can be rewritten as
x = Wa + v ,
(8)
and the optimal filter is given by −1
H = K aa W T ⎡⎣WK aa W T + Kνν ⎤⎦ ,
{ }
{ }
{
(9)
}
where K aa = E aaT , Kνν = E νν T , E a ν T = 0 are the covariance matrices. It is assumed that an input signal and noise are uncorrelated. The obtained optimal filter is based on an assumption that an input signal within the window is stationary. The result of filtering is the restored window signal. This corresponds to signal processing in nonoverlapping fragments. Now suppose that the signal is processed in a moving window in the domain of the sliding DCT. For each position of the window an estimate of the central pixel should be computed. Using the equation for inverse sliding DCT presented in the previous section, the pointwise MSE for reconstruction of the central element of the window can be written as follows: 2 ⎧⎪ ⎡ N 2 ⎤ ⎫⎪ PMSE ( k ) = E ⎣⎡ a ( k ) − a ( k ) ⎦⎤ = E ⎨ ⎢ ∑ α ( l ) ( A ( l ) − A ( l ) ) ⎥ ⎬ , (10) ⎦ ⎭⎪ ⎪⎩ ⎣ l =1
{
}
where A = ⎡⎣ A ( l ) = H ( l ) X ( l ) ⎤⎦ is a vector of signal estimate in the domain of the DCT, HU = ⎡⎣ H ( l ) ⎤⎦ is a diagonal matrix of the scalar filter, α = ⎡⎣α ( l ) ⎤⎦ is a diagonal matrix of the coefficients of inverse sliding cosine transform (4). Minimizing (10), we obtain HU = [ Px x ] Pax Iα . −1
(11)
Space-Variant Restoration with Sliding Discrete Cosine Transform
907
where Pax = ⎡⎣ E { A ( l ) X ( k )}⎤⎦ , Px x = ⎡⎣ E { X ( l ) X ( k )}⎤⎦ , Iα is the identity matrix of the dimension of α. Note that matrix of coefficients α = ⎣⎡α ( l ) ⎦⎤ for the inverse sliding transform (4) is singular. The inverse sliding cosine transform (4) possesses the dimension of the matrix twice less than the size of the window signal. Therefore, the computational complexity of the scalar filters in (11) and signal processing can be significantly reduced comparing to the complexity for the filter in (6). For the model of signal distortion in (8) the filter matrix is given as
(
)
−1
HU = ⎡⎣U WK aa W T + Kνν U T ⎤⎦ U K aa W T U T Iα .
(12)
If a signal has a high correlation coefficient and a smoothed version of the signal is corrupted by additive, weakly-correlated noise, then the matrix U WK aa W T + Kνν U T in (12) is close to diagonal. Figure 1 shows the covariance
(
)
matrix of a smoothed, noisy, one-dimensional signal having the correlation coefficient of 0.95 as well as the discrete cosine transform of the covariance matrix. The linear convolution between a signal x and the matrix K aa W T in the domain of the sliding
(
)
DCT can be well approximated by a diagonal matrix Diag UK aa W T U T Iα X . Therefore, the matrix of the scalar filter in (12) is close to diagonal, and the filter can be written as P1 ( l ) H (l ) ≈ , (13) P2 ( l ) + Pνν ( l ) where
P1 ( l ) , P2 ( l ) , Pnn ( l ) T
T
are diagonal elements of the following matrices T
U K aa W U Iα , U WK aa W U T , U Kνν U T , l=1,…N1, N1 is the dimension of the matrix Iα.
(a)
(b)
Fig. 1. (a) Covariance matrix of a noisy signal, (b) DCT of the covariance matrix
For the design of local adaptive filters in the domain of a sliding DCT the covariance matrices and power spectra of fragments of a signal are required. Since they are often unknown, in practice, these matrices can be estimated from observed signals [3-6].
908
V. Kober and J. Gomez Agis
4 Computer Simulation Results The objective of this section is to develop a technique for local adaptive restoration of images degraded owing to nonuniform motion blur. Assume that the blur is owing to horizontal relative motion between the camera and the image, and it is approximately space-invariant within local regions of the image. It is known that point spread functions for motion and focus blurs do have zeros in the frequency domain, and they can be uniquely identified by the location of these zero crossings [4]. We assume also that the observation noise is a zero-mean, white Gaussian process that is uncorrelated to the image signal. In this case, the noise field is completely characterized by its variance, which is commonly estimated by the sample variance computed over a lowcontrast local region of the observed image. Next, we design local adaptive filters on the base of a sliding DCT to restore the image. Let { X k ( l ) , X k ( l ) , H wk ( l ) , l= 1,…, N} be the DCT transform coefficients around time k of the observed signal, filtered signal, and linear motion degradation, respectively. Here N=2N1+1 is the length of the DCT. Note that N1 is an arbitrary integer value, which is determined by the minimal size of details to be preserved after filtering. We use the criterion of the PMSE around time k which is defined in the domain of sliding DCT. Taking into account (13) and (4), the estimate of the reconstructed image signal can be written as xk =
1 ⎛ N1 s k k⎞ ⎜ 2∑ ( −1) X 2 s + X 0 ⎟ , N ⎝ s =1 ⎠
(14)
where ⎧⎛ Pxxk ( l ) − Pννk ( l ) ⎞ k k k k k ⎪⎜ ⎟ X ( l ) , Pxx ( l ) > Pνν ( l ) +B and H w ( l ) ≠ 0 X k ( l ) = ⎨⎜⎝ Pxxk ( l ) H wk ( l ) ⎟⎠ , ⎪ 0, otherwise ⎩
(15)
and Pxxk ( l ) , Pννk ( l ) , l= 1,…, N1 are estimates of the power spectra of the observed signal and noise in the domain of the sliding DCT, respectively, Bk is a small signaldependent bias value. The designed filter can be considered as a modified spectral subtraction method in the domain of sliding DCT. In general, spectral subtraction methods, while reducing the wide-band noise, introduce a new narrow-band noise due to the presence of remaining spectral peaks. To attenuate the remaining noise, one can suggest over subtraction of the power spectrum of noise by introducing a nonzero power spectrum bias Bk. A real test aerial image is shown in Fig. 2(a). The size of image is 512x512, each pixel has 256 levels of quantization. The signal range is [0, 1]. The image quadrants are degraded by sliding 1D horizontal averaging with the following sizes of the moving window: 5, 6, 4, and 3 pixels (for quadrants from left to right, from top to bottom). The image is also corrupted by zero-mean additive white Gaussian noise. The degraded image with the noise standard deviation of 0.05 is shown in Fig. 2(b). In our tests the window length of 15x15 pixels is used. Since there exists difference in spectral distributions of the image signal and wide-band noise, the power spectrum of noise can be easily measured from the experimental covariance matrix.
Space-Variant Restoration with Sliding Discrete Cosine Transform
(a)
(b)
(c)
(d)
(e)
909
(f)
Fig. 2. (a) Test image, (b) space-variant degraded test image, (c) global Wiener restoration, (d) local adaptive restoration in domain of sliding DCT, (e) difference between the original image and restored by global Wiener algorithm, (f) difference between the original image and restored by proposed algorithm
The results of image restoration by the global parametric Wiener filtering [2] and the proposed method are shown in Figs. 2(c) and 2(d), respectively. Figures 2(e) and 2(f) show a difference of the original image with the image restored by global Wiener
910
V. Kober and J. Gomez Agis
algorithm, and the image restored with proposed algorithm, respectively. We see that the proposed algorithm is capable to perform a good space-variant image restoration and noise suppression. Finally, we investigate the robustness of the tested restoration techniques to additive noise. The performance of the global parametric Wiener filtering and the local adaptive filtering is shown in Fig. 3. 0.018 0.016 0.014
MSE
0.012
Global restoration
0.01 Local adaptive restoration
0.008 0.006 0.004 0.002 0 0.00
0.05
0.10
0.15
0.20
STANDARD DEVIATION
Fig. 3. Performance of the restoration algorithms in terms of MSE versus the standard deviation of additive noise
5 Conclusions In this paper, we presented a new technique for space-variant restoring linearly degraded and noisy images. The MMSE estimator in the domain of sliding DCT is derived. To provide image processing at high rate, a fast recursive algorithm for computing the sliding DCT was utilized. Extensive testing using various parameters of degradations (nonuniform motion blurring and corruption by noise) has shown that the original image can be well restored by proper choice of the algorithm parameters.
References 1. Jain, A.K.: Fundamentals of digital image processing. Prentice Hall, Englewood Cliffs (1989) 2. Gonzalez, R.C., Woods, R.E.: Digital image Processing. Prentice Hall, Englewood Cliffs (2002) 3. Pratt, W.K.: Digital image processing. Wiley-Interscience Publication, New York (1991) 4. Biemond, J., Lagendijk, R.L., Mersereau, R.M.: Iterative methods for image deblurring. Proceedings of the IEEE 78(5), 856–883 (1990) 5. Flusser, J., Suk, T., Saic, S.: Recognition of blurred images by the method of moments. IEEE Trans. on Image Processing 5(3), 533–538 (1996) 6. Bertero, M., Boccacci, P.: Introduction to inverse problems in imaging. Institute of Physics Publishing. Bristol and Philadelphia (1998)
Space-Variant Restoration with Sliding Discrete Cosine Transform
911
7. Sawchuk, A.A.: Space-variant image restoration by coordinate transformations. J. Opt. Soc. Am., 64(2), 138–144 (1974) 8. Oppenheim, A.V., Shafer, R.W.: Discrete-time signal processing. Prentice Hall, Englewood Cliffs (1989) 9. Kober, V.: Fast algorithms for the computation of sliding discrete sinusoidal transforms. IEEE Trans. on Signal Process 52(6), 1704–1710 (2004) 10. Hou, H.S.A: fast recursive algorithm for computing the discrete cosine transform. IEEE Trans. Acoust. Speech Signal Process 35(10), 1455–1461 (1987)
Comparative Evaluation of Classical Methods, Optimized Gabor Filters and LBP for Texture Feature Selection and Classification Jaime Melendez1, Domenec Puig1, and Miguel Angel Garcia2,* 1
Intelligent Robotics and Computer Vision Group Department of Computer Science and Mathematics Rovira i Virgili University Av. Països Catalans 26, 43007 Tarragona, Spain {jaime.melendez, domenec.puig}@urv.cat 2 Department of Informatics Engineering Autonomous University of Madrid Ctra. Colmenar Viejo Km 15, 28049 Madrid, Spain [email protected]
Abstract. This paper builds upon a previous texture feature selection and classification methodology by extending it with two state-of-the-art families of texture feature extraction methods, namely Manjunath & Ma’s Gabor wavelet filters and Local Binary Pattern operators (LBP), which are integrated with more classical families of texture filters, such as co-occurrence matrices, Laws filters and wavelet transforms. Results with Brodatz compositions and outdoor images are evaluated and discussed, being the basis for a comparative study about the discrimination capabilities of those different families of texture methods, which have been traditionally applied on their own.
1 Introduction Pixel-based texture classification aims at recognizing the texture patterns to which the pixels of a given image belong. In order to accomplish this task, it is necessary to compute a set of texture features by evaluating one or more texture feature extraction methods (texture methods in short) in a neighborhood of every pixel. This neighborhood is usually defined as a square window centered at that pixel. A wide variety of texture methods have been proposed in the literature. More recently, Gabor wavelet filters following the optimized design proposed in [1] and Local Binary Patterns [2] have proven to be very successful in different application domains. These techniques are considered the state-of-the-art in texture analysis. This paper builds upon previous work on pixel-based texture classification [3] and texture feature selection [4], by integrating the two aforementioned families of texture methods along with more classical families, and by evaluating the obtained results in *
This work has been partially supported by the Spanish Ministry of Education and Science under project DPI2004-07993-C03-03.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 912 – 920, 2007. © Springer-Verlag Berlin Heidelberg 2007
Comparative Evaluation of Classical Methods, Optimized Gabor Filters and LBP
913
the scope of pixel-based texture classification. A total of 68 texture methods independently evaluated on six window sizes (in practical terms, the number of methods is thus 68 × 6 = 408) have been integrated. The major contribution of this work is an exhaustive evaluation and comparative analysis that shows the pros and cons of those families of texture methods in terms of texture discrimination capabilities. These families have been traditionally applied on their own or sometimes combined without a clear criterion. Section 2 of this paper summarizes both the texture classification and feature selection techniques that are the basis for the current work. Section 3 shows experimental results after applying the above techniques with different families of texture methods, including optimized Gabor filters and LBP. These results are analyzed and discussed in section 4. Conclusions and future work are given in section 5.
2 Pixel-Based Texture Classification and Automatic Texture Feature Selection This section summarizes the pixel-based texture classifier proposed in [3] and the feature selection algorithm proposed in [4]. Details are omitted due to space limitations. (a) Pixel-Based Texture Classification. Given a set of T texture patterns of interest, a statistical model for each pattern is built based on texture features extracted by M texture methods applied over square windows of W different sizes. This is done by computing M × W × T likelihood functions and by integrating them through a linear opinion pool scheme that uses a set of reliability measures as weights. Next, the Bayes rule is applied in order to compute T posterior probabilities. Every image pixel is then classified into the texture pattern that receives the maximum posterior probability, provided the latter is above an automatically computed significance level. Otherwise, the pixel is classified as unknown. (b) Automatic Feature Selection. Given T different texture patterns of interest, each represented by a sample image, M texture methods and W window sizes per method, the significance of every method is first determined based on its performance in classifying the given patterns. Then, a list of the M texture methods sorted in descending order of significance is created per window size. Finally, a sequential forward generation procedure adapted to the proposed classifier keeps adding methods from the top of the sorted list until a performance criterion is maximized. This procedure is repeated W times in order to choose the most significant methods for every window size.
3 Experimental Evaluation The two techniques summarized in the previous section have been evaluated on composite images made of patches of well-known Brodatz textures [5], based on those used in [6], and real outdoor images. The first column of Fig. 1 shows seven of the input images used for the experiments. Six window sizes have been considered in all
914
J. Melendez, D. Puig, and M.A. Garcia
cases: 3 × 3, 5 × 5, 9 × 9, 17 × 17, 33 × 33 and 65 × 65. The number of integrated texture methods depends on the actual experiments carried out as will be explained below. For the sake of comparison, the evaluated texture methods have been organized into three broad classes: “Classical” methods, optimized Gabor wavelet filters proposed in [1], which will be denoted as “GaborMM”, and Local Binary Patterns, denoted as “LBP”. While the last two classes constitute families of methods by themselves, the Classical class groups heterogeneous families of methods that have already been evaluated in [3][4]. Fourteen methods belonging to five different families are considered per window size: (1) four Laws filter masks, (2) two wavelet transforms, (3) four non-optimized Gabor filters, (4) three statistics and (5) the fractal dimension. The GaborMM class is built as a filter bank with six scales and four orientations. The texture features that will characterize every pixel and its surrounding neighborhood are the mean and standard deviation of the module of the Gabor wavelet coefficients obtained after filtering an input image. Thus, the GaborMM class considers a total of 6 × 4 × 2 = 48 methods. These settings come from [1][7]. The kernel size is set to be the same as the window size. The LBP class is constituted by the three available uniform rotation invariant Local Binary Pattern operators. The texture features used in these experiments are the mean and standard deviation of the values produced after applying each LBP operator to an input image for every considered window size. Hence, the number of texture methods corresponding to the LBP class becomes 3 × 2 = 6. 3.1 Experiment 1: Integration of Multiple Methods In the first experiment, the classifier described in section 2 is used to test the performance of each class of methods (Classical, GaborMM, LBP) when they are either independently integrated or integrated all together. Columns 1 to 3 in Table 1 show the classification rates for every class of methods and test image, as well as the average classification rates for every class by considering all test images. Column 4 shows the same results when considering all the available methods. In turn, Fig. 1 (columns 3 and 4) shows some of the segmentation maps produced after the previous classification. Classification rates have been computed against the ground-truth images shown in the second column. 3.2 Experiment 2: Automatic Selection of Methods The second experiment applies the selection scheme summarized in section 2 in order to reduce the number of methods to be integrated. Without this selection scheme, the number of methods to be integrated can be very large, reaching its maximum when all methods and window sizes are considered: (14 + 48 + 6) × 6 = 408. The classification rates corresponding to this experiment are shown in Table 1 (column 5). The number of methods used in each case is shown between parenthesis. The segmentation maps produced after classification with the selected methods are shown in Fig. 1 (column 5). In order to analyze the significance of each family (not class) of methods in the classification results, the ratio between the number of selected methods per family and the number of times all methods from that family could have been chosen has been computed for every particular test image by considering all window sizes. The
Comparative Evaluation of Classical Methods, Optimized Gabor Filters and LBP
915
Fig. 1. Test images (column 1), ground-truth (column 2), and segmentation maps produced by integrating: the best class (column 3), all methods (column 4), and the methods chosen after the selection process (column 5)
916
J. Melendez, D. Puig, and M.A. Garcia Table 1. Classification rates for different groups of integrated methods Test image 1 2 3 4 5 6 7 Average
Classical 75.76(84) 60.84(84) 66.30(84) 42.09(84) 53.17(84) 74.88(84) 69.86(84) 63.27(84)
Class of texture methods GaborMM LBP 92.76(288) 75.71(36) 84.06(288) 63.47(36) 91.76(288) 72.59(36) 73.97(288) 63.90(36) 70.98(288) 74.85(36) 88.68(288) 63.48(36) 85.68(288) 48.14(36) 84.98(288) 66.02(36)
All methods
Selection of methods
92.46(408) 78.82(408) 88.06(408) 77.96(408) 72.29(408) 86.18(408) 86.46(408) 83.18(408)
89.61(55) 88.97(54) 94.45(62) 79.98(78) 77.31(136) 83.37(136) 83.85(136) 85.36(87)
Table 2. Significance of each family of methods per image Test image 1 2 3 4 5 6, 7 Average
Ratio of selected methods per family vs. total number of methods within a family (per image) Laws Statistic Gabor Fractal Wavelet GabMM LBP 0.67 0.00 0.08 0.17 0.17 0.11 0.08 0.29 0.00 0.00 0.00 0.00 0.15 0.08 0.46 0.00 0.00 0.00 0.00 0.17 0.00 0.50 0.06 0.17 0.17 0.00 0.18 0.22 0.83 0.17 0.13 0.33 0.00 0.31 0.56 0.17 0.25 0.67 0.00 0.58 0.37 0.00 0.49 0.07 0.17 0.11 0.13 0.22 0.16
computed ratios (shown in Table 2) range between zero and one, with zero meaning that a family is useless since its methods are never selected, and one meaning that an entire family is necessary because all of its methods are selected every time. Finally, a similar study (shown in Table 3) about the significance of each family is conducted, but this time by taking into account each window size for all the given test images.
4 Discussion From the first experiment conducted in section 3, it can be concluded that if just a single class of methods is to be utilized for texture classification (see Table 1), the GaborMM class is clearly the best option. This is consequent with the fact that this family has been widely used in the literature and even proposed as a texture descriptor in the MPEG-7 standard [8]. However, the next option is not as obvious, since the LBP operators produce acceptable results for the Brodatz textures, while the classical methods perform better for the outdoor images. Even though the LBP family is currently considered within the state-of-the-art in texture analysis, this experimentation shows that the performance of the LBP methods can be seriously influenced by the nature of the texture patterns upon which these methods are applied. On the other hand, the LBP family has the additional property of being small and, hence, of having
Comparative Evaluation of Classical Methods, Optimized Gabor Filters and LBP
917
Table 3. Significance of each family of methods per window size Window size 3×3 5×5 9×9 17 × 17 33 × 33 65 × 65
Ratio of selected methods per family vs. total number of methods within a family (per window size for all test images) Laws Statistic Gabor Fractal Wavelet GabMM LBP 0.75 0.11 0.54 0.00 0.08 0.19 0.19 0.63 0.06 0.17 0.00 0.17 0.21 0.11 0.50 0.06 0.17 0.00 0.17 0.23 0.08 0.33 0.06 0.17 0.17 0.17 0.22 0.14 0.46 0.06 0.00 0.33 0.08 0.17 0.17 0.29 0.06 0.00 0.00 0.08 0.18 0.25
a low associated computational cost. These characteristics make the LBP methods more suitable for those tasks in which time constraints are an issue. Another important result is the fact that the integration of the three tested classes does not always guarantee the maximum classification rates as can be observed in the rates corresponding to most of the images in Table 1. Furthermore, this alternative is not justifiable in terms of computational cost, since, in average, it is possible to obtain better classification rates by only integrating methods from the GaborMM class, which constitute a 70% of all available methods. In addition, the first noticeable result after applying the feature selection algorithm in the second experiment described in section 3 is that it not only preserves classification rates to a large extent, but also improves those rates in a number of cases (see column 5, rows 2 to 5, in Table 1). Even more, the highest average classification rate is the one corresponding to the integration of the selected methods. Notwithstanding, the main contribution of the feature selection technique can be appreciated in the fact that the same good performances highlighted above have been reproduced with a much lower number of texture methods. Actually, the number of integrated methods is, in average, more than three times lower than the number of methods that constitute the best family (GaborMM) and more than four and a half times lower than the number of available methods (see the numbers in parenthesis in Table 1). The only cases in which the number of selected methods is larger than the number of methods that constitute a single family are when comparing with the Classical class for the patterns used in images 5 to 7, and when comparing with the LBP class, since the latter has just six members. In turn, the number of selected methods is similar in most cases, except for images 5 to 7 (see Table 1, column 5, rows 5 to 7) where this number is more than twice the one of previous cases. The reason is that these images are so complex that the feature selector cannot yield appropriate classification rates with a low number of methods. In fact, some textures in those images are difficult to distinguish even to the human eye. Furthermore, the study of the performance of each family of methods according to the ratios discussed in section 3.2 is also quite revealing. The first conclusion from Table 2 is that all families are potentially useful in different contexts, as all of them have been chosen by the selection algorithm more than once. Hence, they should not be discarded a priori. The Statistics and Wavelet families and the fractal dimension
918
J. Melendez, D. Puig, and M.A. Garcia
have the lowest performance, with very low to low ratios, as they have only been selected in particular cases. The LBP and classical Gabor families perform in the middle, while the Laws masks and the GaborMM family are the main components of the integration strategy after the selection process, as they are always selected. At first glance, the ratios obtained by the Laws masks can be contradictory, as they are superior to those obtained by the GaborMM family, while the classification rates produced by the Classical class, which includes the Laws family, are, by far, much lower than those produced by the GaborMM methods. The reason for those high ratios is that they are not computed based on classification rates, but in terms of how significant a family is with respect to its size. For example, the Laws family has a ratio close to 0.5, meaning that, even though its number of members is very small (4 methods), at least two of them are always selected in average and their participation is crucial for obtaining good classification results with a reduced number of methods. In fact, a member of this Laws family scores first many times. Alternatively, the GaborMM family has a small average ratio, even though it is the best family to be integrated on its own. The reason is that not all its members contribute to the final classification, and only a few of them (about a 20% in average) do a good classification job. Notwithstanding, it is important to highlight that the methods selected within each family are not always the same, as this selection is highly dependent on the texture patterns of interest. This is the reason why the feature selection algorithm must be applied starting with the complete set of available texture methods as input. In such a way, the selector is able to choose the best methods according to the specific situation. The last point that deserves discussion is the effect of window size on the selection process. Table 3 shows that the Laws family and specially the classical Gabor family tend to perform best with small to very small window sizes, while the Fractal dimension only does well with medium window sizes. Alternatively, the LBP family reaches its maximum with very small and medium to big window sizes. The latter can explain the success of LBP in tasks related to image retrieval. Finally, the GaborMM, Statistics and Wavelet families seem to perform almost equally with all window sizes.
5 Conclusions This paper presents the results of a thorough evaluation conducted on a previously proposed technique for pixel-based classification through integration of multiple families of texture methods and selection of the most significant ones. The families of texture methods analyzed in previous works have been complemented with two stateof-the-art families of methods: optimized Gabor wavelet filters designed by Manjunath and Ma, and LBP. A comparative analysis of the performance of all those families has been carried out. This evaluation shows that the best family in terms of classification rates corresponds to the optimized Gabor filters. On the other hand, Local Binary Patterns have proven to yield acceptable results with a reduced computational cost, although, in these experiments, they have shown a lower performance when applied to the outdoor images. In general, they appear to be a good choice for texture analysis applications with strong time constraints.
Comparative Evaluation of Classical Methods, Optimized Gabor Filters and LBP
919
Another conclusion from this study is that it is not adequate to integrate all the available texture methods, since this approach leads to lower classification rates in most cases and with a significant computational cost. Besides the novelty of the comparative evaluation of those different families of well-recognized texture methods, another important contribution of this paper is to show the clear benefits of applying the previously proposed texture feature selector prior to classification. Indeed, when a subset of texture methods is automatically selected based on the given texture patterns of interest, the final classification rates are similar or even better than when the best families of texture methods are utilized on their own, but at a significantly lower computational time. In addition, these results also show that the GaborMM family is oversized as only a small percentage of its members are really contributing to each specific classification problem. However, the methods that contribute the most are highly dependent on the texture patterns to be recognized. Thus, the application of the selection scheme is highly beneficial for the particular case of this family, as it is able to select the specific Gabor filters that are most suitable for the problem at hand, leaving aside the ones that do not make a significant contribution. This implies that the power of the GaborMM family is kept with a very significant reduction of the required computational cost. Finally, the results presented in this paper also show that some texture methods are most suitable for particular window sizes, while others behave almost independently of this parameter, and that the classical methods (Laws filters, fractal dimension, etc.) cannot be discarded a priori, as they prove to be useful (they are selected) in some occasions. This is where the selection algorithm shows its strength and utility. Further research will consist of extending the feature selection technique evaluated in this paper to application-independent selection of texture methods in order to obtain a more general subset of methods that produce acceptable classification results when the classifier is applied to different problems. In this way, it would be possible to avoid a specific selection of texture methods for each different classification problem. We also aim at extending the proposed technique to unsupervised pixel-based segmentation.
References 1. Manjunath, B.S., Ma, W.Y.: Texture Features for Browsing and Image Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 837–842 (1996) 2. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002) 3. Garcia, M.A., Puig, D.: Supervised Texture Classification by Integration of Multiple Texture Methods and Evaluation Windows. Image Vis. Comput. 25(7), 1091–1106 (2007) 4. Puig, D., Garcia, M.A.: Automatic Texture Feature Selection for Image Pixel Classification. Pattern Recogn. 39(11), 1996–2009 (2006) 5. Brodatz, P.: Textures: A Photographic Album for Artists and Designers. Dover & Gree Publishing Company, Mineola, NY (1999)
920
J. Melendez, D. Puig, and M.A. Garcia
6. Randen, T., Husoy, J.H.: Filtering for Texture Classification: A Comparative Study. IEEE Trans. Pattern Anal. Mach. Intell. 21(4), 291–310 (1999) 7. Chen, L., Lu, G., Zhang, D.: Effects of Different Gabor Filter Parameters on Image Retrieval by Texture. In: Proc. of Int. Conf. MMM’04, pp. 273–278 (2004) 8. Manjunath, B.S., et al.: Color and Texture Descriptors. IEEE Trans. Circ. Syst. Video Tech. 11, 703–715 (2001)
A Multiple Classifier Approach for the Recognition of Screen-Rendered Text Steffen Wachenfeld, Stefan Fleischer, and Xiaoyi Jiang Department of Computer Science, University of M¨ unster Einsteinstrasse 62, D-48149 M¨ unster, Germany
Abstract. The lower the resolution of a given text is, the more difficult it becomes to segment and to recognize it. The resolution of screenrendered text can be very low. With a typical x-height of 4 to 7 pixels it is much lower as in other low resolution OCR situations. Modern OCR approaches for such very low resolution text use a classification-based segmentation where the underlying classifier plays an important role. This paper presents a multiple classifier system for the classification of single characters. This system is used as a subsystem for the classification-based segmentation within a system to read screen-rendered text. The paper shows that the presented multiple classifier system outperforms the best former single classifier system on single characters by far and it shows the impact of using the multiple classifier system on the word reading performance.
1
Introduction
.5Screen-rendered characters are of very low resolution, are often smoothed, broken or touch each other and have a variable appearance according to their subpixel position. This makes a reliable segmentation of screen-rendered text into characters prior to classification impossible. Classification-based segmentation algorithms can be found in all fields of low resolution text recognition which have to deal with similar problems, e.g. in the field of handwriting recognition. They are also called integrated segmentation and recognition with hypothesisand-test strategy [4], recognition-based segmentation [1], hybrid segmentation [5], or internal segmentation in contrast to external segmentation. These algorithms rely strongly on the underlying classifier whose performance is thus very important. In this paper we present a multiple classifier system (MCS) based on multiple type-3 classifiers for the recognition of single characters. We have created the MCS to improve the classification-based segmentation of our screen-text recognition system. Our early paper [6] gives an overview of the former system which used a single classifier. The system allows to read text from screenshot images as it is necessary e.g. to provide copy-and-paste functionality for text in images or for translation tools which allow users to click on any text on the screen and give a translation. Besides some commercial OCR programs which start to address W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 921–928, 2007. c Springer-Verlag Berlin Heidelberg 2007
922
S. Wachenfeld, S. Fleischer, and X. Jiang
Fig. 1. Classification-based segmentation: image of a rendered word, oversegmentation, resulting primitive segments, merged segments with highest classification scores, and ground truth
the problem of reading screenshots we are not aware of other systems reported in research literature which address this task. In Section 2 we define the classification task and show the context for which the classifier is needed. Section 3 shows the design and evaluation of single classifiers for the later combination. Section 4 shows how the classifiers are combined and presents experimental results on single characters, which show that our MCS outperforms the best former single classifier by far. Section 5 discusses the impact of integrating the created MCS into our recognition system on the word reading performance. Section 6 concludes this paper with a discussion of our achievements and possible future work.
2
The Classification Task
The classifier shall serve within our recognition system for screen-rendered text. The system reads screen-rendered text from screenshot images in multiple steps. After the preprocessing, where word boundaries are determined and the text is separated from the background, the main step is the classification-based segmentation which is shown in Figure 1. During this classification-based segmentation the image of a displayed word is oversegmented into a sequence of primitive image segments s1 , . . . , sn so that each segment can be expected to correspond to a character or to a part of a character. The primitive segments are then combined to candidate character pattern. The classifier has the important task to assign
A Multiple Classifier Approach for the Recognition of Screen-Rendered Text
s12
si j
s33
923
s46 6
1 5
2 4
3 j
4
3
i
2
5 6
1
Fig. 2. Example graph for the word arm with a path indicated, which represents the correct segmentation and recognition
appropriate plausibility scores to the candidate patterns. In a following step a graph of possible combinations of candidate patterns can be searched for the optimal path which has the highest plausibility score (see Figure 2). This optimal path implies the final segmentation, the recognized characters and thus the recognized word. If s1 , . . . , sn is the sequence of the primitive image segments, the candidate character pattern sij result from merging neighbored primitive segments sij = si si+1 . . . sj , 1 ≤ i ≤ j ≤ n. The classification task for a given candidate character pattern is the assignment of plausibilities p(ck |sij ) for all character classes ck . The described context leads to several requirements concerning the classifier. As a classification result for each segment is needed, features have to be extracted from each segment. Thus, the first requirement is that the classifiers use features which are computable in acceptable time. Further, the classifier outputs are used to compute the optimal path. The qualities of the paths is computed as the product of the plausibilities of all segments in the path, weighted by their relative width: p(c(1) , . . . , c(m) |s(1) , . . . , s(m) ) =
m
p(c(i) |s(i) )ws(i) /w .
i=1
Thus, it is evident that the classifiers as well as the MCS assign plausibilities p(ck |sij ) for all character classes ck , which requires the classifier combination to be a type-3 combination.
3
Single Classifier Design and Evaluation
The single classifier takes a segment which is an image of a character or a part of a character, extracts features and assigns plausibilities p(ck |sij ) for all character classes ck . To design different classifiers, it is possible to change the classification method, the features used, or both.
924
S. Wachenfeld, S. Fleischer, and X. Jiang
1 0.8 0.6
a)
b)
0.4 0.2 0 0 0.2 0.4 0.6 0.8 1
1
0,72
c)
0,56
d)
0,36 0,16 0
0,21 0,49 1 0,29 0,73
Fig. 3. Projection of a character image and resulting feature vector: equidistant (a and b), based on gray value mass (c and d)
In the search for a good single classifier for our recognition system we have experimented with several features and different classification methods. The former recognition system uses a 5 × 5 feature vector, which results from a gray value projection of segments onto a uniform square (see Fig. 3) and a single k-nearestneighbor-like classifier. Using more complex features such as black-and-white transitions, gradient information or projection measures as well as other classifiers, e.g. a Bayesian classifier with a Parzen as well as with a uniform estimator resulted in lower performance. The knn-like classifier used assigns plausibility values for each class c by considering the relation between the average distance of a sample x to the kc nearest neighbors of class c and the maximum distance between any two samples. This leads to kc
p(c|x) = max(1 −
d(x, xc,i )
i=1
dmax kc
, 0)
where d(xi , xj ) is the Euclidean distance between two samples xi and xj . The number of considered nearest neighbors kc is class dependent and chosen in relation to the number of samples Nc belonging to class c. In experiments kc = αNc with α = 0.01 has proven to be a good value. Now we show how we create a pool of many single classifiers based on this classifier. Later classifiers are selected from this pool to be integrated into the MCS. Experiments turned out, that the robustness against slight segmentation errors can be increased by changing the way to project the segments. Instead of equidistantly choosing the zones for the projection (as shown in Figure 3a, b)
A Multiple Classifier Approach for the Recognition of Screen-Rendered Text
925
the zones can be chosen based on gray value mass. If the size of the zones is chosen in a way that each row and each column holds exactly the same gray value mass (as shown in Figure 3c, d) the robustness against additional single pixels, which otherwise strongly move the zones, is increased. We name the equidistant features which have 5 × 5 zones knn 5x5e and the mass based knn 5x5m. As the character size changes within the database, some characters are better recognized with a lower number of zones and some with a higher number. To let the later MCS benefit from classifiers with different numbers of zones we create classifiers knn 5x5[e|m] up to knn 10x10[e|m]. All these classifiers project onto a uniform square which proved to be effective to recognize narrow as well as wide fonts. Anyway, the aspect ratio of the character is lost. To be able to compensate this, we create classifiers having the aspect ratio as additional feature. So to each classifier knn zxz[e|m] in the pool we add the pendant knn zxz[e|m]a which has the additional aspect rate. Table 1. Recognition performance of the 24 single character classifiers on 15,808 isolated characters z 5 6 7 8 9 10
knn zxze correct false 98.9119% 1.0881% 98.9246% 1.0754% 98.9689% 1.0311% 98.9562% 1.0438% 98.9119% 1.0881% 98.8803% 1.1197%
knn zxzea correct false 98.9246% 1.0754% 98.8613% 1.1387% 98.9752% 1.0248% 98.9436% 1.0564% 98.8930% 1.1070% 98.8677% 1.1323%
knn zxzm correct false 98.8993% 1.1007% 98.9183% 1.0817% 99.0195% 0.9805% 98.9626% 1.0374% 98.9183% 1.0817% 98.9499% 1.0501%
knn zxzma correct false 98.9309% 1.0691% 98.9815% 1.0185% 99.0385% 0.9615% 98.9879% 1.0121% 98.9815% 1.0185% 98.9562% 1.0438%
Table 1 shows the performance of the 24 classifiers on 15.808 images of characters using a leave-one-out technique. The characters belong to the freely available Screen-Char database which consists of annotated screenshot images of single characters [7] [8]. It can be seen that on these characters the knn 7x7ma achieves the highest performance of 99.04%, which is the highest performance reported for this database so far. Not counting confusions between the two similar looking letters I (’I’) and l (’l’) the performance was 99.72%.
4
Classifier Combination
The goal is to increase the already very good performance on single characters by combining multiple classifiers. For a given screenshot image s of a character our classifiers return plausibility values p(ck |s) for each class ck . We use multiple classifiers C1 , . . . , Cn to classify the same screenshot image and thus get n plausibility values for each class which can be combined by a function p(ck |s) = F [p1 (ck |s), . . . , pn (ck |s)] (Figure 4 shows the setup). The functions we have used for combination are
926
S. Wachenfeld, S. Fleischer, and X. Jiang
Fig. 4. Type-3 combination of the classification outputs for all classes
– Product (F =prod): p(ck |s) =
n
pi (ck |s)
i=1
– Simple mean (F =mean): 1 pi (ck |s) n i=1 n
p(ck |s) =
– Minimum, Median, Maximum (F =min, median, max). For example: p(ck |s) = max{pi (ck |s)}. i
We have tested the performance of several subsets from the 24 classifiers. Table 2 shows the results of combining the six classifiers for z = 5, ..., 10 of one type (e.g. mass based projection with aspect) for the functions named before. The performance of 99.2346% (99.7964% without counting confusions between I and l), which was achieved by the combination of the knn zxzm classifiers using the median function, is the best performance on single characters so far. Combining too many classifiers is computationally expensive and does not yield better results. For example if all 24 classifiers are combined using the median function,
A Multiple Classifier Approach for the Recognition of Screen-Rendered Text
927
Table 2. Recognition performance of combined classifiers on 15,808 isolated characters using different functions. The six classifiers for z = 5, . . . , 10 have been combined for each listed classifier type.
prod mean median min max
knn zxze correct false 99.1776% 0.8224% 99.1966% 0.8034% 99.1460% 0.8540% 99.1270% 0.8730% 99.0891% 0.9109%
knn zxzea correct false 99.1840% 0.8160% 99.1966% 0.8034% 99.1713% 0.8287% 99.1270% 0.8730% 99.1397% 0.8603%
knn zxzm correct false 99.1776% 0.8224% 99.2156% 0.7844% 99.2346% 0.7654% 99.1713% 0.8287% 99.1903% 0.8097%
knn zxzma correct false 99.1903% 0.8097% 99.1903% 0.8097% 99.2093% 0.7907% 99.1713% 0.8287% 99.1334% 0.8666%
the performance is 99.1840%. The median function proved to be a good choice to combine classifier outputs in this situation. When the four types of classifiers for a fixed number of zones (e.g. 5 × 5, 6 × 6, ...) have been combined the resulting performance was above 99.1397% in all cases and thus as well very good. We have integrated MCSs consisting of the best combinations into our recognition system.
5
System Performance
We took a MCS consisting of the six knn zxzm classifiers and integrated it into our recognition system, replacing the single classifier. To find a measure for the impact on the recognition quality is hard as the recognition result is always a list of candidate words ranked by their paths plausibility (see Figure 5). One way to measure the quality is to count how often the candidate on the first rank is correct. But as soon as language models such as n-grams or dictionaries are used, it becomes more important that the correct word is in the ranked list at all or even better somewhere on the first places. To take this into account we compute the average rank of the correct words and the number of times where the rank is ≤ 10.
Fig. 5. Application example and resulting list of ranked word candidates
928
S. Wachenfeld, S. Fleischer, and X. Jiang
In tests using screenshot images of screen-rendered words taken from the freely available Screen-Word database [7] [8] it could be seen that rank 1 was only slightly more often correct but that the average rank of the correct words was significantly better. On 2,400 words the average rank is 1.53 (was 1.65 before).
6
Conclusion and Future Work
In this paper we have discussed the importance of the single character classifier, which is used for the classification-based segmentation within our system for the recognition of screen-rendered text. This was taken as motivation to systematically test features and classifier combinations. In experiments on screenshot images of single characters, taken from the freely available Screen-Char database, we have achieved a recognition performance of 99.2346% which is the best performance reported for this database (was 98.91 before). The theoretical maximum performance can only be increased by using classifiers for combination which make different errors but correctly classify yet falsely classified samples. Thus, we will try to develop such features which lead to better overall performance. Also we will increase the number of samples in our test databases to have a better foundation for our measurements.
References 1. Burges, C.J.C., Matan, O., LeCun, Y., Denker, J.S., Jackel, L.D., Stenard, C.E., Nohl, C.R., Ben, J.I.: Shortest Path Segmentation: A Method for Training a Neural Network to Recognize Character Strings. In Proc. of Int. Joint Conf. Neural Networks 3, 165–172 (1992) 2. Casey, R., Lecolinet, E.: A Survey of Methods and Strategies in Character Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 18(7), 690–706 (1996) 3. Guyon, I., Haralick, R.M., Hull, J.J., Phillips, I.T.: Database and Benchmarking. In: Bunke, H., Wand, P.S.P. (eds.) Handbook of Character Recognition and Document Image Analysis, pp. 779–799. World Scientific, Singapore (1997) 4. Liu, C.-L., Sako, H., Fujisawa, H.: Effects of Classifier Structures and Training Regimes on Integrated Segmentation and Recognition of Handwritten Numeral Strings. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), 1395–1407 (2004) 5. Nagy, G.: Twenty Years of Document Image Analysis in PAMI. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 38–62 (2000) 6. Wachenfeld, S., Klein, H.-U., Jiang, X.: Recognition of Screen-Rendered Text. In: Proc. of 18th Int. Conf. on Pattern Recognition (ICPR). vol. 2, pp. 1086–1089 (2006) 7. Wachenfeld, S., Klein, H.-U., Jiang, X.: Annotated Databases for the Recognition of Screen-Rendered Text. In: Proc. of 9th Int. Conf. on Document Analysis and Recognition (ICDAR) (2007) 8. Wachenfeld, S., Klein, H.-U., Jiang, X.: The official Screen-Char and Screen-Word database website. http://cvpr.uni-muenster.de/research/ScreenTextRecognition/
Improving Stability of Feature Selection Methods Pavel Kˇr´ıˇzek1, Josef Kittler2 , and V´aclav Hlav´aˇc1 1
2
Czech Technical University in Prague, Center for Machine Perception, Karlovo n´ am. 13, 121 35 Prague 2, Czech Republic University of Surrey, Centre for Vision, Speech, and Signal Processing, GU2 7XH Guildford, United Kingdom {krizekp1, hlavac}@fel.cvut.cz, [email protected]
Abstract. An improper design of feature selection methods can often lead to incorrect conclusions. Moreover, it is not generally realised that functional values of the criterion guiding the search for the best feature set are random variables with some probability distribution. This contribution examines the influence of several estimation techniques on the consistency of the final result. We propose an entropy based measure which can assess the stability of feature selection methods with respect to perturbations in the data. Results show that filters achieve a better stability and performance if more samples are employed for the estimation, i.e., using leave-one-out cross-validation, for instance. However, the best results for wrappers are acquired with the 50/50 holdout validation. Keywords: Feature selection, stability, entropy.
1
Introduction
Many tasks in statistical pattern recognition are characterised by high dimensional data which have to be processed and analysed using statistical tools. A data sample is a vector formed generally by several hundreds of measurements, called features. Examples of such data are measurements arising in text recognition, genetic engineering, astronomy, etc. The problem of analysing and processing such multivariate sensory information can be aggravated by a relatively small sample size available for learning. A small sample statistics results in inaccurate parameter estimates of the data models. Thus, a poor generalisation is achieved on unknown data. This phenomenon is known as the curse of dimensionality [1]. A common solution is to reduce the dimensionality and employ, for example, only those features that are relevant to a given problem. This task is known as the feature selection [1]. There are many approaches to feature selection, however, all in principle involve two main ingredients: i) an objective function (criterion) which reflects
This work was supported by the EU INTAS project PRINCESS 04-77-7347 and by the Czech Ministry of Education under Project 1M0567.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 929–936, 2007. c Springer-Verlag Berlin Heidelberg 2007
930
P. Kˇr´ıˇzek, J. Kittler, and V. Hlav´ aˇc
informativeness of feature subsets, and ii) a search strategy. The search strategy is fixed and employs usually feature ranking [7] or subset search [1,11] methods. Both approaches are premised on certain principles which guide the search through the feature space and determine what results can be ideally achieved. The objective function guiding the search can be either classifier specific [9,5] (so called wrappers) or classifier independent [7,5] (so called filters). In general, learning algorithms designed using features selected by wrappers have been shown to achieve a better predictive accuracy. Nevertheless, wrappers are computationally very intensive and tend to over-fit [9]. Filters can be seen more as a preprocessing step for a subsequent learning, because the objective function is not directly linked to minimising the error rate of a particular classifier. Filters usually execute quite fast and provide a general approach to feature selection. One issue that has been relatively neglected in the literature is the stability of feature selection algorithms, i.e., the sensitivity of the solution (selected features) to perturbations in the input data. The motivation behind exploring the stability is to provide an evidence that the selected features are relatively robust to slight changes in the data. Feature selection methods producing consistent solution on the given data are preferable to those with highly volatile outputs. Possibly, the only related publications discussing the feature selection stability topic are [2,6,10]. However, the proposed stability measures have many limitations, unclear motivation, and empirically estimated bounds. Furthermore, these papers do not clearly motivate remedies for improving the stability and performance. We show that for any given feature selection algorithm, the stability and performance can be significantly improved by a careful use of the data available. The paper is organised as follows. Section 2 formulates the stability problem. In the same section, we propose and theoretically justify a measure which can assess the stability of feature selection algorithms with respect to perturbations in the data. We also suggest a concept how the stability and performance can be improved. The experimental set-up is described in Section 3. Section 4 presents and discusses the numerical results. Conclusions are drawn in Section 5.
2 2.1
Stability of Feature Selection Methods The Stability Problem
It is not often realised in the literature on feature selection that values of the objective function guiding the search for the best feature set are random variables with some probability distribution, no matter what type of the search strategy or criterion is adopted. This originates from the randomly sampled data at the feature selection input. The exact criterion value is unknown and the search strategy has to work just with an estimate acquired on the available data. Note that the accuracy of the criterion estimate can considerably influence the result of the feature selection. Suppose that we carry out T ∈ N runs of a feature selection algorithm on randomly sampled data. We would like to find some kind of measure which would allow us to assess the feature selection method stability with respect to
Improving Stability of Feature Selection Methods
931
perturbations in the data, i.e., to assess the sensitivity of the selected features to variations in the input data. We will consider a concept which reflects the most frequently selected feature subsets. It has to be emphasised that the stability does not say anything about the performance of the selected features. It just indicates the sensitivity of a feature selection method output to random perturbations in the input data. Large variations in the selected feature subsets signify that something is wrong. For instance, the feature selection algorithm is not appropriate for a given data, or there are not enough samples, or too many correlated variables or feature subsets with very similar information content, etc. Thus, less confidence should be assigned to feature sets that change radically with slight variations of the input data or perhaps it is advisable to refrain completely from the feature selection. 2.2
Stability of Feature Sets
Notice that various feature selection techniques select different feature subsets with a certain probability if the input data for training are randomly sampled from the original data set. One extreme case is a random feature selection which selects every feature subset with the same probability and thus produces a uniform probability distribution. The other extreme is a perfectly stable feature selection method which all the time selects the same feature subset and thus creates a single peak probability distribution. It appears that the stability of feature selection algorithms can be assessed through the properties of the generated probability distributions of the selected feature subsets. Our interest is, of course, in feature selection algorithms that produce probability distributions far from the uniform and close to the peak one. A convenient measure quantifying randomness of a system is entropy [12]. Entropy is a real function defined on a set of probability distributions. In information theory, the concept of entropy indicates the amount of uncertainty about an event associated with a given probability distribution. The entropy is maximal for a uniform probability distribution (i.e., outcome of random feature selection). If the event is certain (i.e., outcome of perfectly stable feature selection) then the entropy is zero. There are several entropy measures in information theory. We derive the stability measure from the Shannon entropy, see [12], H(X) = −
m
p(xi ) log p(xi ) .
(1)
i=1
Here X is a discrete random variable with possible states X = {x1 , x2 , . . . , xm } (i.e., particular feature subsets), m ∈ N is the number of all possible states (i.e., the number of all different feature subsets), and p(xi ) is the probability of the i-th state occurrence (i.e., probability of selecting a particular feature subset). Let n ∈ N be the problem dimensionality, k ∈ {1, 2, . . . , n} indicates the size of the feature subset, and T ∈ N is the number of evaluation trials with randomly sampled data. The frequencies of selected feature subsets are recorded
932
P. Kˇr´ıˇzek, J. Kittler, and V. Hlav´ aˇc
+ over T trials in a histogram given by a structure with entries Gjk ∈ Z n, where + Z are non-negative integers, j = 1, 2, . . . , C(n, k), and C(n, k) = k is the number of all possible feature combinations. The histogram structure can be implemented by recording different feature subsets and their frequencies Gjk . The probability estimates of particular feature subsets occurrence can be determined by normalising the histogram entries by the number of trials, i.e., Gjk = Gjk /T . C(n,k) Thus, all bin values Gjk are scaled into the interval [0, 1] and j=1 Gjk = 1. Based on the Shannon entropy (1), the following stability measure can be constructed for a feature subset of a fixed size k,
C(n,k)
γk = −
Gjk log Gjk .
(2)
j=1
In reality, the histogram structure is sparse, because the number of evaluation trials T is small compared to the theoretical combinatorial amount of all possible feature combinations which yields the maximal number of the histogram entries Gjk > 0 to be jmax = min [T, C(n, k)]. Thus, the stability measure (2) ranges in the interval 0 ≤ γk ≤ log min [T, C(n, k)]. Deriving both bounds is a straightforward task. The lower bound expresses that there is only one uniquely selected feature subset of size k over all T trials which corresponds to a perfectly stable feature selection algorithm, hence zero entropy. The theoretical upper bound is based on the assumption that an arbitrary feature subset of size k is selected over T trials with the same probability Gjk = (min [T, C(n, k)])−1 . It can be seen that the stability measure (2) creates an ordering and thus, the stability of examined feature selection algorithms can be compared. 2.3
Improving Stability and Performance
The primary key for the improved stability is an appropriate estimate of the objective function values. With a better criterion estimate, the search algorithm is also more likely to converge to its optimal solution with respect to the unknown data underlying probability distribution. Selected features thus achieve a better generalisation and performance. Surprisingly, it is very hard to find a research paper which would follow this basic rule. However, the price we may have to pay for a better estimate is an exponentially increasing execution time. The commonly adopted estimation methods in statistical pattern recognition are data re-sampling techniques like cross-validation, holdout validation, or bootstrap, see [1,3,8]. Only wrapper design sometimes applies five or ten-fold cross-validation to avoid over-fitting of a classifier employed in the objective function definition, see [9,10]. No estimation is ever done in filter approaches.
3
Experiment Design and Description
The motivation behind the following experiments is to investigate how the choice of a data re-sampling technique for the objective function estimation influences the stability and performance of filter and wrapper feature selection approaches.
Improving Stability of Feature Selection Methods
3.1
933
Data Used in Experiment
A simple artificial two-class problem is synthesised for the purpose of this study, since we would like to have a control all over the experiment. The data are designed so that ordinary ranking or greedy feature selection methods fail to find an optimal feature subset of a minimal size. Samples are derived from a 20-dimensional normal probability distribution with a common covariance matrix Σ and mean values μ = μ1 = −μ2 . First class consists of a component N (μ, Σ), and second class of a component N (−μ, Σ). The common covariance matrix and the mean values comprise several blocks which simulate different qualities of features. The first block contains statistically independent features with identical discriminatory ability of the particular features. The parameters are the following Σ 1,...,3 = I3
and μ1,...,3 = [0.635, 0.635, 0.635] ,
where Id is the d × d identity matrix for d ∈ N. Upper indices in Σ i,j,k,... and μi,j,k,... indicate the corresponding coordinates of the block. A nested pair of features with indices {4, 6} is hidden in the second block. The parameters are given by ⎡ ⎡ ⎤ ⎤ 1.05 0.48 0.95 0.5 Σ 4,...,6 = ⎣ 0.48 1.0 0.20 ⎦ and μ4,...,6 = ⎣ 0.4 ⎦ . 0.95 0.20 1.05 0 The third block contains statistically independent features with decreasing discriminatory ability of the particular features. The parameters are Σ 7,...,13 = I7
and μ7,...,13 = [0.636, 0.546, 0.455, 0.364, 0.273, 0.182, 0.091] .
The last block contains only noise with parameters Σ 14,...,20 = I7
and μ14,...,20 = [0, 0, 0, 0, 0, 0, 0] .
Finally, all the dimensions of the problem are randomly permuted to make sure that an examined feature selection algorithm will not follow feature ordering created in the design above. Data in our experiment contain 500 samples per class. It is interesting to note that the theoretical classification error of such data is about 2.3%. 3.2
Experiment Set-Up
In experiments, we examine a filter and wrapper variant of the state-of-the-art feature selection technique known as the Sequential Forward Floating Search (SFFS) algorithm [11]. The filter version applies the Mahalanobis distance [1] in the objective function definition. The wrapper form uses prediction accuracy of a linear decision rule created by the Gaussian classifier [1]. Both feature selection algorithms fit the data so they should find the exact solution.
934
P. Kˇr´ıˇzek, J. Kittler, and V. Hlav´ aˇc
For the objective function estimation, we consider five-fold, ten-fold, and leave-one-out cross-validation, repeated holdout validation scheme with 50/50, 60/40, 70/30, 80/20, and 90/10 splits of data to the training/validation part, respectively, and estimation methods based on sampling with replacement such as bootstrap, .632 bootstrap, and out-of-bootstrap, see [1,3,8] for more details. Holdout and bootstrap techniques employ 1, 5, 10, 50, 100, and 200 trials, respectively, for the objective function estimation. All methods are stratified [8] so the class a priori probabilities are preserved within the data re-sampling. The feature selection is applied in all experimental set-up conditions described above. To acquire a good statistics on the probability estimates of the selected feature subsets, one hundred evaluation trials are performed with 90% of the data randomly sampled from the original data set for training. The stability measure in Equation (2) is determined from the acquired probability estimates. The performance is assessed by the Hamming distance [4,2] between the exact solution and the selected features in order to check the real quality of the result. The exact solution is found by the SFFS algorithm which employs the true data probability distribution and the Mahalanobis distance in the objective function definition. The final performance is averaged over all one hundred trials.
4
Numerical Results and Discussion
The stability results for the filter and wrapper are depicted in Figures 1a and 1b. The performance of the selected features measured by the average Hamming distance is shown in Figures 1c and 1d. Graphs are depicted only for a feature subset of size k = 8 as similar behaviour was observed for all subset sizes.
4
Number of estimation trials 5/10/loo 1 5 10 50 100
4
200
3 γk
γk
3 2
2
1
1
0 5fold 10fold loo 50/50 60/40 70/30 80/20 90/10 boot estimation technique
0 5fold 10fold loo 50/50 60/40 70/30 80/20 90/10 boot 632b outb estimation technique
a)
b) 15
10
10
hk
hk
15
5
5
0 5fold 10fold loo 50/50 60/40 70/30 80/20 90/10 boot estimation technique
0 5fold 10fold loo 50/50 60/40 70/30 80/20 90/10 boot 632b outb estimation technique
c)
d)
Fig. 1. The stability factor γk for a) filter, b) wrapper, and the averaged Hamming distance hk for c) filter, d) wrapper, displayed with respect to the re-sampling technique involved in the objective function estimation for a feature subset of size k = 8
Improving Stability of Feature Selection Methods
935
The filter variant of the SFFS algorithm achieves better stability if more samples are employed in the objective function estimation, i.e., using techniques like ten-fold or leave-one-out cross-validation, for instance. Such result can be interpreted by the general fact that more samples available for learning lead to better parameter estimates and thus to a more consistent solution. This effect is clearly visible for the holdout validation, where the stability gets worse with less samples available for the estimation. However, the estimate gets better with the increasing number of trials. It appears that at some point the absolute stability factor saturates and becomes more or less independent of the examined re-sampling techniques if the number of estimation trials is sufficiently large. The situation is quite different for the wrapper approach. The best stability and the performance result is obtained for the repeated 50/50 holdout validation employed in the objective estimation. The popular ten-fold or leave-one-out cross-validation does not achieve such a good solution. Our interpretation is as follows. Although a classifier designed with more samples has better parameter estimates, the objective function is based on the prediction accuracy using the validation data. Having fewer samples available for validation implies higher variance of the prediction accuracy. Thus, the feature selection algorithm becomes more sensitive to random perturbations in the data and fails to find a consistent solution. The number of the training and validation samples compete against each other which explains why the 50/50 holdout validation achieves the best results. It seems again, that the stability factor saturates at some point if the number of estimation trials is sufficiently large. Wrappers appear to be much more sensitive to the correct objective function estimate than filters. Notice that the .632 bootstrap achieved the best stability factor, however, its performance is by far the worst. Bootstrap techniques are supposed to give estimates with low variance [3] which explains a good stability. Nevertheless, the bias of the estimate is high and as a result the wrapper converged to a wrong solution. As for the performance, the non-zero Hamming distance indicates that none of the solutions is identical with the theoretical best feature subset. The slight bias (Hamming distance “2”) is probably caused by an in-accurate estimate of the weakest features discriminatory power.
5
Conclusions
The feature selection results should always be strengthen by some confidence in the solution. For this purpose, we designed and theoretically justified an entropy based measure which can assess the stability of any kind of a feature selection algorithm with respect to random perturbations in the input data. Furthermore, we derived bounds on the stability measure analytically. For a given feature selection method and data, the stability factor has to be determined empirically over a number of evaluation trials with randomly sampled training data. A large number of trials is required to guarantee a sufficient accuracy of the probability estimates of the selected feature subsets occurrence.
936
P. Kˇr´ıˇzek, J. Kittler, and V. Hlav´ aˇc
The stability and performance of feature selection methods can be improved to a certain degree by a suitable algorithm design. This can be achieved by an appropriate estimate of the objective function guiding the search for the best subset of features. Our experiments showed that filters achieved better stability and performance if more samples were employed in the objective function estimation, i.e., using re-sampling techniques like ten-fold or leave-one-out cross-validation, or 90/10 holdout validation. For wrappers, however, the best estimation technique appeared to be the 50/50 holdout validation. Wrappers also turned up to be much more sensitive to incorrect objective function estimation than filters. Nevertheless, a good stability of a feature selection algorithm on the given data is just a necessary condition for a good performance of the selected features. Both, the stability and performance should always be analysed together, because the stability itself does not say anything about the selected features quality.
References 1. Devijver, P.A., Kittler, J.: Pattern Recognition: A Statistical Approach. Prentice Hall, Englewood Cliffs (1982) 2. Dunne, K., Cunningham, P., Azuaje, F.: Solutions to instability problems with sequential wrapper-based approaches to feature selection. Technical Report TCDCD-2002-28, Dept. of Computer Science, Trinity College, Dublin, Ireland (2002) 3. Efron, B., Tibshirani, R.: Estimating the error rate of a prediction rule: Improvement on cross-validation. Technical Report TR-477, Dept. of Statistics, Stanford University (1995) 4. Hamming, R.W.: Error detecting and error correcting codes. Bell System Technical Journal 26(2), 147–160 (1950) 5. Jain, A., Zongker, D.: Feature selection: Evaluation, application and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(2), 153–158 (1997) 6. Kalousis, A., Prados, J., Hilario, M.: Stability of feature selection algorithms. Proceedings of the 5th IEEE International Conference on Data Mining, Houston, Texas, pp. 218–225. IEEE Computer Society, Los Alamitos (2005) 7. Kirra, K., Rendell, L.A.: The feature selection problem: Traditional methods and a new algorithm. In: Proceedings of the 10th National Conference on Artificial Intelligence, San Jose, CA, pp. 129–134. MIT Press, Cambridge, MA (1992) 8. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the Joint Conference on Artificial Intelligence, pp. 1137–1145. Morgan Kaufmann, Montreal, Canada, San Mateo, CA (1995) 9. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997) 10. Kuncheva, L.I.: A stability index for feature selection. In: Proceedings of the 25th International Multi-Conference on Artificial Intelligence and Applications, pp. 390– 395 (February 2007) 11. Pudil, P., Novoviˇcov´ a, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognition Letters 15, 1119–1125 (1994) 12. Shannon, C.: Mathematical theory of communication. Bell System Technology Journal 27, 379–423, 623–656 (1948)
A Movie Classifier Based on Visual Features Hui-Yu Huang1 , Weir-Sheng Shih2 , and Wen-Hsing Hsu2 1
Department of Computer Science and Information Engineering, National Formosa University, Huwei, Yunlin 632,Taiwan [email protected] 2 Department of Electrical Engineering, National Tsing Hua University, Hsinchu 300, Taiwan
Abstract. In this paper, we propose an approach to classify the film categories by using low-level features and visual features. The goal of this approach is to classify the films into genres. Our current domain of study is the movie preview. A film preview often emphasizes the theme of a film and hence provides suitable information for classification process. In our approach, we classify films into three broad categories: Action, Dramas, and Thriller films. Four computable video features (average shot length, color variance, motion content and lighting key) and visual effects are combined in our approach to provide the advantage information to demonstrate the movie category. Our approach can also be extended for other potential applications, including browsing, retrieval of videos on the internet, video-on-demand, and video libraries. Keywords: shot detection, movie genres.
1
Introduction
With recent advances in digital video coding and transmission, larger and larger amounts of digital videos are becoming from various source in every day. The advent of digital television and internet is yet another motivating factor for automated analysis of digital video. Although digital video can be labeled at the production stage, there is still need for automatic classification of videos. Manual annotation or analysis of video is an expensive and arduous task that will not be able to keep up with this rapidly increasing volume of video data in the near feature. Hence, people need some technologies such as video database, video browsing, indexing, and data mining to manage the videos and discover knowledge from videos. Application of scene-level classification would allow departure from the prevalent system of movie ratings to a more flexible system of scene ratings. For instance, a child would be able to watch movies containing a few scenes with excessive violence, if a pre-filter system can prune out scenes that have been rated as violent. In order to effective classification of movie protecting the children download from the internet, a film classification processing is needed to develop. Pickering and Ruger [1] investigated the application of a variety of contentbased image retrieval method to video retrieval. Low et al. [2] presented an W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 937–944, 2007. c Springer-Verlag Berlin Heidelberg 2007
938
H.-Y. Huang, W.-S. Shih, and W.-H. Hsu
overview in developing an integrated and content-based solution for computerassigned video parsing, abstraction, retrieval and browsing. The most important feature of this approach is the use of low-level visual features as a representation of video content and automatic abstraction process. In order to bridge low-level media features and high-level semantics, there are many algorithms propose to bridge low-level features to object level search, and then connect the object level to event level. Chua and Ruan [3] designed a system to support the entire process of video information management: segmenting, logging, retrieving, and sequencing of video data. It is mainly developed a semiautomatic tool to divide video sequences into the meaningful shots. The rest of the paper is organized as follows. The proposed method is described in Section 2. Experimental results are given in Section 3. Finally, the paper is concluded in Section 4.
2
Proposed Method
In this section, we present the film classification method using the movie preview within the feature-based paradigm. For clustering video categories, we collect all movies which were played in Taiwan from 2004 to 2006 at [4], there are many genres can be found such as Action, Adventure, Comedy, Crime, Drama, Animation, Thriller, War, etc. According to these data, the Action, Drama and Thriller genres almost occupy 88% in every year. Hence, if we can classify those of categories, the great part of movies will be indicated. Based on this viewpoint, we identified three major genres: Action/Adventure, Drama (including comedy, drama, romance) and Thriller (or Horror) to demonstrate our approach. 2.1
Color Space Selection
So far, YUV is used in television broadcasting and MPEG compression standard. HSV model is commonly used in computer graphics applications and extremely fits for human color perception [5]. The function of the movie is to provide humans recreation. Hence, we adopt the YUV and HSV color spaces to extract color information. 2.2
Shot Boundary Detection
Shot boundary detection is used to video processing scheme to estimate the relationship between frames. In our approach, we will extend this method for detecting shot boundaries [6] using color histogram intersection in HSV color model. First, video is transformed to many frames, which can be seen as still images. Then, each histogram consists of 16 bins which are eight components for hue, four components for the saturation, and four components for the value. Let S(i) represent the intersection of histograms Hi and Hi−1 of frames ith and i − 1th, respectively. It is expressed as min(Hi (j), Hi−1 (j)). S(i) = (1) j∈all bins
A Movie Classifier Based on Visual Features
939
The magnitude S(i) is often used as a measure of shot boundary in related works. The values of i where S(i) is less than a fixed threshold are assumed to be the shot boundaries. Applying a fixed threshold to S(i) when the shot transition occurs with a dissolve generates several outliers because the consecutive frames differ from each other until the shot transition is completed. To improve the accuracy, an iterative smoothing of the 1-D S function is performed first. We have adapted the algorithm proposed by Perona and Malik [7] based on anisotropic diffusion. S is smoothed iteratively using a Gaussian kernel such that the variance of the Gaussian function varies with the signal gradient. Formally S t+1 (i) = S t (i) + λ[cE · ∇E S t (i) + cW · ∇W S t (i)], (2) where t is the iteration number and 0 < λ < 1/4 with ∇E S(i) ≡ S(i + 1) − S(i)
(3)
∇W S(i) ≡ S(i − 1) − S(i).
(4)
The condition coefficients are a function of the gradients and are updated for every iteration ctE = g(|∇E S t (i)|) ctW = g(|∇W S t (i)|), |∇E | 2
(5) (6)
|∇W | 2
where g(∇E S) = e−( k ) and g(∇W S) = e−( k ) . The constants were set to λ = 0.1 and k = 0.1. Finally, the shot boundaries are detected by finding the local minima in the smoothed similarity function S. Thus, a shot boundary will be detected where two consecutive frames will have minimum color similarity. 2.3
Visual Feature Extraction
The low-level features and visual features will be extracted. The low-level features contain average shot length, color variance, and lighting key [6]. In addition, we found that human feeling can be affected by movie rhythm. While a shot change, the movie editor will use video editing techniques to enhance the visual effect. In order to analyze the rhythm more detail, we propose the visual effect features which are slow moving effect and fast moving effect, the film editor can utilize shot change to create these visual effects. Slow moving effect : This visual effect is gradual change from shot to shot. Fade and Dissolve are classified as this effect, because they usually need a long time duration in shot change. The Fade and Dissolve are shown in Fig. 1. Fast moving effect: This visual effect will have two or more hard changes in a short time duration. This effect heavily used in Action and Thriller films. These frames possess high contrast to their neighbor frames. In addition, it can also be seen as several Abrupt Cuts occurred in a short time, in our experiments, we set
940
H.-Y. Huang, W.-S. Shih, and W.-H. Hsu
Fig. 1. Slow moving effect from movie Fig. 2. Fast moving effect. (a)A flash ocSilent Hill. (a) Fade in and Fade out. (b) cur in a short time from movie Silent Hill. Dissolve. (b)A short shot from movie i Robot.
if there are at least 2 Abrupt Cuts occurred in 0.2 seconds, it possesses a fast moving effect. Two fast moving effects are shown in Fig. 2. Abrupt Cut : The total brightness difference will almost equal to brightness difference at change frame, because there has only one rapidly change in these durations. Fade: Film minima brightness can be found in these durations because the fade must have to change between original shot to black frames. Besides that, brightness change is gradually increased or decreased. Dissolve: This technique is hard to determine by brightness. The brightness change can be gradually increased, decreased, or unchanged. If a shot change is detected and cannot recognize as Abrupt Cut or Fade, we set this shot change as Dissolve. We only want to find fast moving effect and slow moving effect, the fast moving effect is defined as Abrupt Cut occurred in 0.2 seconds, slow moving effect is including Fade and Dissolve. Thus, we design a rule to classify these two change modes (denoted as CH) which present Abrupt Cut and Gradual Change (including Fade and Dissolve). From the change frame, the neighboring frame’s brightness is denoted B, the brightness difference between two frames is denoted Bd , the classification rule is defined as: CH ∈ Abrupt Cut, if 0.8 < T, CH ∈ Gradual Change, otherwise,
(7)
where T = (max(B) − min(B))/max(abs(Bd )) < 1.2. With this classification rule and the visual effect definition, we can extract the fast moving effect and slow moving effect.
A Movie Classifier Based on Visual Features
941
In addition, we will give slow moving effect and fast moving effect to a visible value in order to represent the visual effect characteristics. By using the frame numbers which were detected as fast/slow moving, we calculate the distance between two neighboring frames and then obtain a new vector. Quantizing this vector, we calculate the relative histograms normalized to 1. Then, we can obtain two new features which are defined as fast and slow moving effect distribution, fv and sv , respectively. In our experiments, we set the total length of visual effect distribution as 100. We will further estimate the values of these two features denoted Fme and Sme , and defined as Fme =
n
l · fv (l),
l=1
Sme =
n
l · sv (l),
(8)
l=1
where n is set 100. If one kind of visual effect values is never occurred in a film, we set its visual effect value as 100. The visual effect value can serve as the expected value of how often this visual effect occurred in a film. The less the value, the more occurrence the visual effect. 2.4
Classification Method
Thus far, we have discussed the relevance of various low-level features of video data based on feature-space analysis. And then we choose the visual effect values and lighting key be features, building a classifier model, as shown in Fig. 3. A decision tree is a flowchart of tree structure, where each node denotes a test on the corresponding attribute, each branch represents the test outcome, and leaf nodes represent the classes or class distributions. The first decision rule is decided by visual effect value, including fast moving effect value Fme and slow moving effect value Sme . For a non-drama film, its fast moving effect value will less than drama film. We construct the classification trees to calculate the threshold Tf m and Tsm and further to distinguish a film which belongs to drama or non-drama film. For a film F (i), the decision rule is defined as F (i) ∈ Non-drama, if Fme < Tf m and Sme < Tsm , F (i) ∈ Drama, otherwise.
(9)
Fig. 3. Film classification construction diagram (A two-layer classifier structure)
942
H.-Y. Huang, W.-S. Shih, and W.-H. Hsu
The second decision rule is decided by lighting information. Although its significance is less than other features to distinguish Drama or Comedy films, this feature can be still used to distinguish Action or Thriller films. The lighting value is indicated L. A threshold Tl will be calculated by means of classification trees, and Action and Thriller films can be distinguished by using a decision rule, it is expressed as F (i) ∈ Action, if L > Tl , F (i) ∈ Thriller, if L ≤ Tl .
(10)
Thus, we can express all decision rules as F (i) ∈ Action, if Fme < Tf m and Sme < Tsm and L > Tl , F (i) ∈ Thriller, if Fme < Tf m and Sme < Tsm and L ≤ Tl , F (i) ∈ Drama, otherwise.
3
(11)
Experimental Results
In our experiments, we choose 44 films to test and verify our proposed method, including 9 thrillers, 10 actions, and 25 dramas (containing comedies). For testing the performance of visual effects, we choose six film previews, total number of Abrupt Cut and Gradual Change (including Fade and Dissolve) is 484 and 176, respectively. Fig. 4 shows the variability of average shot length. From Fig. 4, it is clearly known that the average shot length in Action films is shorter than Drama and Comedy films, which means the Action film has faster tempo than Drama and Comedy films, thus they can be distinguished by this feature. Drama and Comedy films are almost the same at this feature. Motion content and color variance distribution are shown in Fig. 5. In Fig. 5, we found out the distribution of motion content behavior. The activity in Action film is higher than Drama film. For color variance, it is a strong correlational structure with respect to genres. However, in our experiments, it is hard to
Fig. 4. Variability of average shot length Fig. 5. Distribution of motion content and color variance
A Movie Classifier Based on Visual Features
943
Fig. 6. Distribution of visual effect value
Fig. 7. Distribution of lighting key
Table 1. Abrupt Cut detection result
Table 2. Gradual Change detection result
Nc Na Nd Precision(%) Recall(%) 467 484 486 96.09 96.49
Nc Na Nd Precision(%) Recall(%) 173 176 232 74.57 98.01
crisply distinguish film genres. In other words, the color information is not a critical condition. Fig. 6 shows the result of visual effect values. The Action and Thriller films tend to lower the fast moving effect value because there will have more shots about flash or short action when the fast moving effect occurs. That is, these frames happened these effects are centralized to the neighboring frames. Fig. 7 shows the distribution of lighting key feature. From Fig. 7, the lighting factor in these films is an insignificant feature, hence, it cannot effectively categorize the film cluster. The Abrupt Cut detection result is presented in Table 1. In addition, we use two parameters, precision and recall [8], to evaluate the performance of visual effect detection. Precision and recall are denoted the accuracy of detection and the ability of detection, respectively, defined by Nc × 100%, Na Nc × 100%, P recision = Nd Recall =
(12) (13)
where Nc is the number of correct detected, Na is the number of actual occurred, Nd is the number of detected. The Gradual Change (including Fade and Dissolve) detection result is given in Table 2. In order to obtain the classification result, we choose two-thirds films for each category randomly for training the classification tree, and we will calculate the decision threshold Tf m , Tsm , and Tl by Eq. (10), the rest of films are used as the test data. And we repeat this step six times for all of films. The accuracy of this classifier can be measured by precision [9] as: precision = t/w,
(14)
where t is the number of samples that actually belongs to this film genre and is correctly classified. w is the number of all samples that is classified as this film
944
H.-Y. Huang, W.-S. Shih, and W.-H. Hsu
genre. We have to notice that the precision is different in which the estimation of classification performance and detection performance.
4
Conclusions
In this paper, we have analyzed the low-level and visual features applied in classification of film genres. The experimental results are presented that the combining visual cues with cinematic principles can provide the powerful tools for genre categorization. In other words, our approach can serve as a pre-filter system in selecting the movie what you want to watch in the internet. In the future, we will combine audio or text cues to modify the classification accuracy and analyze more movie categories.
Acknowledgments This work was supported in part by the National Science Council of Republic of China under Grant No. NSC 94-2213-E-007-055 and NSC 95-2221-E-150-091.
References 1. Pickering, M.J., Ruger, S.: Evaluation of key frame-based retrieval techniques for video. Computer Vision and Image Understanding. 92, 217–235 (2003) 2. Low, C.Y., Zhang, H.J., Smoliar, S.W.: Video parsing, retrieval and browsing: an integrated and content-based solution. In: Proc. of the ACM Int. Conf. on Multimedia, pp. 15–24 (1995) 3. Chua, T.S., Ruan, L.Q.: A video retrieval and sequencing system. ACM Trans. on Inf. Systems. 13, 373–407 (1995) 4. http://www.truemovie.com/ 5. Rafae, G., Richard, E.: Digital image processing, 2nd edn. Prentice-Hall, Englewood Cliffs (2002) 6. Rasheed, Z., Sheikn, Y., Shah, M.: On the use of computable features for film classification. IEEE Trans. on Circuits and System for Video Technology. 15, 52–63 (2005) 7. Perona, P., Malik, J.: Scale-space and edge detection using anisotropic diffusion. IEEE Trans. on Pattern Analysis and Machine Intelligence. 12, 629–639 (1990) 8. Lupatini, G., Saraceno, C., Leonardi, R.: Scene break detection: a Comparison. In: Proc. of IEEE Int. Workshop on Research Issues in Data Engineering, pp. 34–41 (1998) 9. Yuan, Y., Song, Q.B., Shen, J.Y.: Automatic video classification using decision tree method. In: Proc. of the First Int. Conf. on Machine Learning and Cybernetics, pp. 1153–1157 (2002)
An Efficient Method for Filtering Image-Based Spam E-mail Ngo Phuong Nhung and Tu Minh Phuong Posts and Telecommunications Institute of Technology, Hanoi, Vietnam [email protected] Abstract. Spam e-mail with advertisement text embedded in images presents a great challenge to anti-spam filters. In this paper, we present a fast method to detect image-based spam e-mail. Using simple edge-based features, the method computes a vector of similarity scores between an image and a set of templates. This similarity vector is then used with support vector machines to separate spam images from other common categories of images. Our method does not require expensive OCR or even text extraction from images. Empirical results show that the method is fast and has good classification accuracy.
1 Introduction The increasing number of Internet users and the low cost of e-mail make this form of communication very attractive for direct marketers. As a consequence, the volume of unsolicited commercial e-mail (“spam”) has grown tremendously in past few years. In addressing this growing problem, many solutions to spam reduction have been proposed. Among these solutions are automated methods for filtering spam. Using hand-crafted rules or machine learning techniques, anti-spam filters analyze the text content of e-mail to detect spam. Some anti-spam filters are reported to achieve accuracy of up to 99% [9]. To circumvent such systems, spammers have invented many techniques. An example of techniques spammers use is to embed advertising text in images being sent with spam. While the contents of such messages are normally viewed by spam receivers they are shielded from text-based anti-spam filters. By some estimates, up to 25% of spam being sent today contain imagery and this number is expected to increase [1]. Therefore, it is desirable to develop systems that can detect and filter image-based spam. A possible way to detect image-based spam is using a pipeline of an optical character recognition (OCR) system that extracts and recognizes embedded text, followed by a text classifier that separates advertising text from legitimate content. While this solution promises to detect spam with a certain level of accuracy, the existing OCR algorithms are computationally expensive and thus cannot operate on heavily loaded e-mail servers. In a recent paper, Aradhye et al. [1] described an image-based anti-spam filter that does not require full text recognition. Their method starts by extracting regions with overlaid text from images. Based on the text regions and other image elements, the method creates several simple features that are indicative of spam. Images represented by the extracted features are then classified into spam and non-spam using SVM. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 945–953, 2007. © Springer-Verlag Berlin Heidelberg 2007
946
N.P. Nhung and T.M. Phuong
Since the method does not include the text recognition step, it is much faster than systems with full OCR. However, the extraction of text regions is nontrivial and still requires considerable computational resources. In this paper, we propose a fast method to detect spam images. The proposed method does not try to extract embedded text from an image. Instead, it uses an edgebased feature vector, which can be computed efficiently, to represent major shape properties of the image. Since most of spam images contain large proportions of text (Fig. 1), they must have shape representation similar to that of other text intensive images. Our method uses the edge-based feature to compute a vector of similarity measures from an image to a small set of gold standards – images with different proportions of overlaid text. These similarity vectors then serve as input to SVMs that separate spam images from legitimate ones. The method is fast because it does not use computationally expensive image processing and text recognition steps. On a collections of images, the method separates spam images from non-spam ones with accuracy of 80% and higher.
(a) Image containing only text
(b) Image with photographic elements
Fig. 1. Examples of spam images
(a) Natural scene photo
(b) Greetings e-card
Fig. 2. Examples of non-spam images
Related work. The problem of image-based spam detection is a special case of image categorization that has been studied in context of many important applications. Depending on application requirements and the nature of images to categorize, generic image categorization methods can use different image features or their combinations to distinguish between two or more classes. In a work by Hu and Bagga [4], the authors relied on the correlation between image functional categories and
An Efficient Method for Filtering Image-Based Spam E-mail
947
several features such as: whether the images are graphic or photographic, whether the images have text elements. They used frequency domain analysis of image intensity and DCT coefficients to decide about the presence of these features. The learning and classification steps were then done with SVMs. Gavilan et al. [3] represented an image in term of blobs – image regions lighter or darker than background. The blob representations of the images are then used to train neural networks to distinguish among natural, artificial, portrait, or text images. Another interesting application is indoor vs. outdoor classification [10]. Beside spam images, many other categories of images and videos contain text. Since overlaid text contains important information about image content, the extraction of text blocks and recognition of their content have attracted research interest in the image processing and multimedia analysis community. Previous work on text extraction used different characteristics of text regions such as their color [7], the frequent occurrence of vertical edges, or wavelets and spatial variance of texture [13] to locate blocks of text.
2 Algorithm Our proposed spam detection method uses SVM combined with vector representation of images to distinguish between spam and non-spam images. The algorithm consists of three steps outlined next. 1) Feature extraction and normalization using Edge Directions (ED) or Edge Orientation Autocorrelogram (EOAC). This step summarizes shape properties of an image in term of edge orientation and correlation. 2) Calculation of similarity scores between the image and a small set of templates or sample images containing only text. This step allows representing the image as a vector of similarity scores with respect to the templates. 3) Training and classification with SVM. The vector representations of images as computed in step 2 are used to train SVM and subsequently to classify each new image as spam or non-spam. In the following sections, we provide a detailed description for each of these steps. 2.1 Edge Directions and Edge Orientation Autocorrelogram In image categorization, it is important to choose appropriate features to represent images. Since spam images are text intensive, and text elements have special shape characteristics which make them different from that of background or other elements, it is desirable to use features able to capture such characteristics. At the same time, for the method to be practical, a feature of choice must be fast to compute. Although other features like color-based features, texture-based features, blobs etc. have been used with success for other image categorization tasks [3,4,8,10], edgebased features are very indicative of text intensive spam images while remain simple to compute. In this work, we have chosen two edge-based features: ED [5,6] and EOAC [8]. The ED feature summarizes global shape information. EOAC is an extension of ED which captures correlation between text elements over small distances. The ED histogram of an image is computed in three steps, and the EOAC is computed by adding two more steps as follows.
948
N.P. Nhung and T.M. Phuong
1. Edge detection: The Sobel operator is used to generate two edge components Gx and Gy, from which edge amplitude and edge orientation are computed as follows:
G = Gx 2 + G y 2
∠G = tg-1 (Gy / Gx)
and
2. Finding prominent edges: only edges with amplitude higher than a predefined threshold T1 are extracted. In our experiment we used T1 = 25 as in [8]. 3. Edge orientation quantization: Edges are quantized into k segments ∠G1, ∠G2, …, ∠Gk, each segment is five degrees. The result of this step is the ED histogram. 4. Determining distance set: This step constructs a distance set D, member of which are the distances from the current edge. This set is used to compute the correlations in the next step. We used the set of four members as in the original paper D = {1, 3, 5, 7}. 5. Computing EOAC matrix: The EOAC matrix is a two-dimensional array with k rows and |D| columns. The element of this matrix contains the number of similar edges with the orientation ∠Gj , which are i pixel distances apart. Two edges are defined to be similar if the absolute values of their orientations amplitudes differences are less than an angle and an amplitude threshold value. Figure 3 shows examples of EOAC graphs and ED histograms. The ED and EOAC features are translation invariant. This is a desired characteristic in our case because text can be located anywhere within an image. To achieve also scaling invariance, the ED histogram is normalized with respect to the number of edge points of the image, and the EOAC matrix is normalized with respect to the sum of the populations of all EOAC bins. 2.2 Calculation of Similarity Scores If the proportion of overlaid text within an image is large enough, the contribution of the text to the image’s ED and EOAC will dominate that of the other image elements. As a consequence, two images containing large amounts of text tend to share similarities in their shape representations. The algorithm proposed herein exploits this observation to distinguish text intensive images from others. Specifically, in this step, the algorithm computes similarity scores of the image with respect to a small set of n sample images or templates. Here we define a template as a specially constructed image, which contains only text. The proportions of text as well as text characteristics are chosen to vary among different templates so that the set covers a large variety of images with overlaid text. Figure 3 shows a non-spam image (a), a spam image (b), and a template (c) as well as their respective EOAC graphs [(d)-(f)] and ED histograms [(g)-(i)]. As shown in the figure, the EOAC and ED representations of the spam image share some similarities with those of the template while the representations of the non-spam image look very different. To measure the similarity between an image with EOAC X and a template with EOAC Y we use L1 distance computed as follows: k
L1( X , Y ) =
d
∑∑| X
ij
− Yij |
i =1 j =1
For the ED feature, the similarity is computed similarly.
An Efficient Method for Filtering Image-Based Spam E-mail
949
1200 1000 800 600 400 200 0 0
10 20 30 40 50 60 70
(d)
(g)
(a) Non-spam image 5000
1500 4000
1000
3000 2000
500 S1
1000 0
0 0
10
20
30
40
50
60
70
0
10
20
(e)
30
40
50
60
70
(h)
(b) Spam image 5000 4000 3000 6000 4000 2000 0
S1 0
(c) Template image
10
20
30
40
50
60
2000 1000 0
70
(f)
0
10
20
30
40
50
60
70
(i)
Fig. 3. An example of shape representation using EOAC and ED; (a) - (c) show three images, where (a) is non-spam, (b) spam, (c) a template; (d)-(e) their respective EOAC graphs; and (g)(i) their respective ED histogram.
At the end of this step, each image is represented by a vector of length n, elements of which are the similarity scores with respect to the templates. In machine learning community, the idea of representing an object via its similarity to a set of other objects is known as empirical feature map [11]. The merit of the empirical feature map is that it provides a general way to map from similarity scores to vector representation, from which a proper kernel can be constructed to use with SVM. 2.3 Support Vector Machine Learning and Classification
With vector of similarity computed above, we train SVM to differentiate between the two classes: spam and non-spam. The SVM algorithm relies on two main ideas. First, the algorithm maps the given training sets of n-dimensional vectors with positive and negative examples into a (possibly) high-dimensional feature space. Then, in the feature space, the algorithm seeks to locate a hyperplane with two properties: 1) it separates the positive from the negative examples; 2) it maintains a maximum margin from any example in the training set [12]. Having found such a hyperplane, the SVM predicts the label of a new example by mapping it into the feature space and defining on which side of the hyperplane the example is located. The mapping from the input
950
N.P. Nhung and T.M. Phuong
space to the feature space is done by using a so called kernel function. In this work, we used similarity vectors as input to the SVM and tried several kernel functions defined on input vectors.
3 Experiments and Results Dataset. Unlike the situation with text-based spam filtering, to our best knowledge, there is no public benchmark dataset for image-based spam. To create a dataset for experiments, we collected images from spam messages arriving at an e-mail server. We used images from a spam message only if the message does not contain text in its body and hence the image content provides the major source of information to make spam-non-spam decision. Images that exist only for formatting were not included. The spam part of the dataset contains 411 images, about half of them contain only text without complex background (Fig. 1a). The other images contain graphic and photographic elements with different levels of complexity (Fig. 1b). The non-spam images in our dataset were collected from several sources. We asked our friends and colleagues to donate e-cards they received over email. The e-card collection was augmented by free e-cards downloaded from different websites with default text. This resulted in 287 images all contain text. We further randomly selected 300 images from the CorelDraw collection. Finally, following the work presented in [1] other 723 images were collected by querying the Google-images search engine with keywords “nature photo”, “portrait” and “baby” and then randomly selecting from what the engine returned. Evaluation methodology. We assessed the method by using 10-fold cross-validation. The dataset was randomly divided into 10 folds of equal size. One fold is left as the test set and the other folds were used for training. The experiment was repeated 10 times with different folds being the test sets; the classification accuracy is calculated by averaging over 10 runs. In our evaluation, we used spam recall and non-spam recall defined as follows: # of spam correctly classified # of all spam # of non - spam correctly classified non − spam recall = # of all non - spam spam recall =
SVM training. To conduct experiments we used WEKA library (http://www. cs.waikato.ac.nz/ml/weka). All the similarity vectors were normalized to have unit length. In all the experiments we used SVM with soft margin and C = 1. We tried different kernels and found that the linear kernel and the RBF kernel give the best results. In what follows, only results obtained with the linear kernel are reported. Template set construction. Since the algorithm makes sense only when an image contains dominating amount of text, the templates were constructed so that text regions cover more than 50% of each template. Specifically, we used black-white templates with text areas covering from 50% to 90% of the whole images with interval of 10%. Letters were chosen uniformly so that all the letters of the English alphabet appear with the same frequency. We tried several commonly used font families and their italic and bold face variants. Since EOAC is scaling-invariant, the
An Efficient Method for Filtering Image-Based Spam E-mail
951
choice of font size is not critical. In our experiment, we used font size = 10. To avoid unexpected effect when comparing images of different sizes, we used ten sets of templates each of them consists of templates of the same size. The sizes were chosen from 80x60 to 800x600 pixel with interval of 80x80. When computing the similarity vector for the given image, the set with the size closest to the image’s size is used. Results. In the first experiment, we compared the performance of two versions of the method - with ED and EOAC used as image features. We also examined how the use of templates with different font families affected the spam detection accuracy. In addition to three fonts that are commonly used in the Internet namely Times New Romans (TNR), Arial, and Tahoma, we used two other template sets constructed with Gothic and Lucida fonts. The results are summarized in Table 1. The results show that, in average, the two edge-based features have nearly the same non-spam recall, but EOAC gives significantly higher spam detection accuracy. At the same time, computing EOAC is more expensive than computing ED. On a PC with Pentium IV and 512 MB RAM, the computation of EOAC and ED for 1000 800x600 images takes 260 seconds and 125 seconds respectively. Table 1. Spam recall and non-spam recall for different edge-based features and font families
TNR
Tahoma
Arial
Gothic
Lucida
ED
Spam recall Non-spam recall
0.73 0.88
0.75 0.88
0.74 0.88
0.74 0.88
0.79 0.87
EOAC
Spam recall Non-spam recall
0.80 0.87
0.83 0.87
0.80 0.88
0.79 0.87
0.86 0.84
1
Non-spam recall
0.9 0.8 0.7
Nature photo Portrait
0.6
E-cards 0.5 0.5
0.6
0.7
0.8
0.9
1
Spam recall
Fig. 4. Spam and non-spam recalls for different image categories
As expected, the proposed method is not sensitive to the choice of font families used in template construction step. Except small fluctuations in cases of Tahoma and Lucida, the different font sizes give nearly the same classification accuracy. A possible explanation is that the small differences in shapes of the letters from different font families are smoothed during the edge orientation quantization step.
952
N.P. Nhung and T.M. Phuong
The next experiment was designed to evaluate the performance of the method on different categories of non-spam images. The method was run with EOAC as image feature and the template set constructed from Tahoma font family. In figure 4, nonspam recall for each image category is plotted against spam recall. The results show that the method can accurately distinguish spam images from “nature photo” images both spam and non-spam recalls are higher than 93%. At the same time, e-cards proved to be most difficult to distinguish from spam images. Images of this category always contain some amount of overlaid text, which make them similar to spam.
4 Conclusion We have described a new method for detecting spam e-mail with content embedded in images. Given an image, our method first extracts an edge-based feature, which summarizes the global information of the image. It then computes a vector of similarity scores between the image and a set of templates that contain only text. This vector representation of the image is used as input for support vector machines training and classification. The use of edge-based feature allows capturing regularities in shapes of text intensive spam images while does not require costly computations. Empirical tests show that our method achieves overall accuracy of 80% and higher in classifying spam from different categories of images whiles remains fast. Given the complexity of the problem, these results are encouraging and the proposed method can be used as a starting step for the construction of image-base anti-spam filters.
References 1. Aradhye, H.B., Myers, G.K., Herson, J.A.: Image Analysis for Efficient Categorization of Image-based Spam E-mail. In: Proc. of ICDAR’05, Seoul, Korea, pp. 914–918 (2005) 2. Fumera, G., Pillai, I., Roli, F.: Spam filtering based on the analysis of text information embedded into images. J. of Machine Learning Research 7, 2699–2720 (2006) 3. Gavilan, D., Takahashi, H., Nakajima, M.: Image Categorization Using Color Blobs in a Mobile Environment. Computer Graphics Forum (EG 2003) 22(3), 427–432 (2003) 4. Hu, J., Bagga, A.: Categorizing Images in Web Documents. IEEE Multimedia 11(1), 22– 30 (2004) 5. Jain, A.K., Vailaya, A.: Image retrieval using color and shape. Pattern Recognition 29(8), 1233–1244 (1996) 6. Jain, A.K., Vailaya, A.: Shape-basedretrieval: a case study with trademark image database. Pattern Recognition 31(9), 1369–1390 (1998) 7. Lienhart, R., Effelsberg, W.: Automatic Text Segmentation and Text Recognition in Video Indexing. ACM/Springer Multimedia Systems 8, 69–81 (2000) 8. Mahmoudi, F., Shanbehzadeh, J., Soltanian-Zadeh, H.: Image retrieval based on shape similarity by edge orientation autocorrelogram. Pattern Recognition 36, 1725–1736 (2003) 9. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E., Bayesian, A.: Approach to Filtering Junk E-Mail. In: Proc. of AAAI-98 Workshop on Learning for Text Categorization (1998) 10. Szummer, M., Picard, R.W.: Indoor-Outdoor Image Classification. In: Proc. IEEE Intl. Workshop on Content-Based Access of Image and Video Databases, pp. 42–51 (1998)
An Efficient Method for Filtering Image-Based Spam E-mail
953
11. Tsuda, K.: Support vector classification with asymmetric kernel function. In: Proc. of 7-th European symposium on Artificial Neural Networks, pp. 183–188 (1999) 12. Vapnik, V.N.: Statistical Learning Theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York (1999) 13. Li, H., Doermann, D., Kia, O.: Automatic Text Detection and Tracking in Digital Video. IEEE Transactions on Image Processing 9(1), 147–156 (2000)
SVM-Based Active Feedback in Image Retrieval Using Clustering and Unlabeled Data Rujie Liu1 , Yuehong Wang1 , Takayuki Baba2 , Yusuke Uehara2 , Daiki Masumoto2 , and Shigemi Nagata2 1
Fujitsu Research and Development Center, Beijing, China {rjliu, wangyh}@cn.fujitsu.com 2 Fujitsu Laboratories LTD., Kawasaki, Japan {baba-t, yuehara, masumoto.daiki, nagata.shigemi}@jp.fujitsu.com Abstract. In content based image retrieval, relevance feedback has been extensively studied to bridge the gap between low level image features and high level semantic concepts. However, it is still challenged by small sample size problem, since users are usually not so patient to label a large number of training instances. In this paper, two strategies are proposed to tackle this problem: (1) a novel active selection criterion. It takes into consideration both the informative and the representative measures. With this criterion, the diversities of the selected images are increased while their informative powers are kept, thus more information gain can be obtained from the feedback images; and (2) incorporation of unlabeled images within the co-training framework. Unlabeled data partially alleviates the training data scarcity problem, thus can improve the efficiency of SVM active learning. Systematic experimental results verify the superiority of our method over some existing active learning methods.
1
Introduction
The success of content-based image retrieval (CBIR) is largely limited by the gap between low-level features and high-level semantic concepts. To bridge this gap, relevance feedback (RF) initially developed in text retrieval was introduced into CBIR since the 1990’s and big success was achieved[1][2]. Many relevance feedback methods have been developed in recent years along the path from heuristic based techniques to optimal learning algorithms, such as “Query Point Movement” methods[3], discriminant analysis[4], D-EM[5], SVM based methods[6][7]. However, all these methods are challenged by small sample size problem, because few users will be so patient to label a large number of training instances in feedback process. Two strategies are commonly used to address the training data scarcity problem: active learning and adopting unlabeled data. Active learning studies the strategy for the learner to actively select samples to query the use to achieve maximal information gain in decision making. Cox et al.[8] used entropy minimization in search of the set of images that, once labeled, will minimize the expected number of future iteration. Tong and Chang[6] proposed the SVMActive algorithm for applications in text and image retrieval. According to their criterion, good W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 954–961, 2007. c Springer-Verlag Berlin Heidelberg 2007
SVM-Based Active Feedback in Image Retrieval
955
requests should maximally reduce the size of the version space, which can be approximately achieved by selecting the points near the SVM boundary. Semisupervised learning with unlabeled data is another strategy to solve the problem of insufficient training data. Based on the EM framework and discriminant analysis, D-EM[5] relaxes the assumption of the probabilistic structure of the data distribution and learns a generative model in a lower dimensional subspace. The cotraining paradigm[10] trains two classifiers separately on two distinct views and uses the predictions of each classifier on unlabeled examples to augment the training set of another classifier. In this paper, we aim to improve SVMActive [6] solution by following two strategies: (1) Enhance its active selection criterion by incorporating a representative measure. To achieve maximal information gain in feedback process, we argue that the selected images must be informative given the current estimation of the target and the redundancy between these images has to be low. In SVMActive , the most informative images are returned for labeling, however, much redundancy may exist between these examples. To solve this problem, a representative criterion is proposed to maximize the information capacity of the selected images. In each retrieval round, these images that are near the classification boundary are clustered by an unsupervised learning process, and one representative image is selected from each cluster for labeling. With this measure, diversities between the selected images are increased and the feature space covered by them is enlarged, thus, more information gain can be obtained. (2) Argument the training set by exploiting unlabeled data within the cotraining framework. In each retrieval round, a simple Euclidean distance model is built, and its most confident negative examples are used to augment the training pool of the SVM. The rest of this paper is organized as follows. In section 2, we firstly introduce the SVMActive method, then, incorporate our representative measure to generate a combined criterion. Section 3 presents the usage of unlabeled data within the co-training framework. The experimental results are shown in section 4, followed by the conclusion in section 5.
2 2.1
Active Learning Criterion SVMActive - The Informative Measure
SVMActive criterion[6] investigated active learning method by using the version space. In the following, the basic idea of this method is briefly presented. Let {(x1 , y1 ) . . . (xn , yn )} ⊂ (χ × {−1, +1})n be the training data with χ denoting a nonempty input space. A Mercer kernel K implicitly defines a mapping φ from the input space χ to the feature space F such that K corresponds to a dot product in F by K(x, x ) = φ(x), φ(x ). Assume these training data be linearly separable in the feature space, version space is defined as the set of hyperplanes that separate the data in F . More formally:
956
R. Liu et al.
V = {w ∈ W | yi w, φ(xi ) > 0 for i = 1 . . . n and ||w|| = 1}
(1)
where W denotes the parameter space. There exists a duality between the feature space F and the parameter space W : points in F correspond to hyperplanes in W and vice versa. With this definition, learning can be viewed as a search problem with the version space V which contains the solution we are looking for. Tong[6] proposed that one good way of doing this is to choose query point(s) that halves the version space, so that the version space can be reduced as fast as possible. This can be realized by testing each of the unlabeled instance x in the pool to see how close their corresponding hyperplanes in W come to the center of V . The closer a hyperplane in W is to the center of V , the more it bisects V . The SVM hyperplane w provides a reasonable approximation to the center of the version space, thus, Tong’s SVMActive criterion turns out to be selecting a new unlabeled instance with minimal distance to current classification hyperplane. 2.2
The Representative Measure
While Tong’s SVMA ctive strategy provides an effective solution to the selection of the most informative images, it encounters some problems when used to select more than one candidate images. Adding a batch of new examples that is selected exclusively based on the distances to the classification hyperplane does not necessarily yield a much greater reduction of the volume of the version space than simply adding one such example. In other words, this strategy does not consider the redundancies of these images. In this paper, a dynamic clustering process is adopted to enhance the representative abilities of the selected images, as shown in Fig. 1. In each feedback round, these images that are near the classification boundary and all images labeled till now are clustered into several groups, and one image is selected from each cluster for labeling. Thus, each selected image is representative of a small feature space, thus, diversities between the selected images are increased while their informative powers are kept. This clustering process is dynamic as images are selected depending on current estimation. Besides this, both the query image and images that have been labeled in previous feedback rounds are taken into clustering, in order to refine maximally the feature space with the help of available information. Normalized cut (Ncut)[11] is used to realize the clustering process. It formulates clustering into a graph partition problem, and organizes the image objects into groups such that the within-group similarity is high while the betweengroups similarity is low. Given N images to be clustered, a weighted undirected graph G = (V, E) is firstly constructed, where each node represents an image, and edges are formed between each pair of nodes. The weight on each edge, wij , is a function of the similarity between nodes i and j, which is defined as: wij = e−
d2 (i,j) σ2
(2)
SVM-Based Active Feedback in Image Retrieval
957
Fig. 1. Illustration of the representative measure. These images that are near the classification boundary are clustered, as denoted by the dashed ellipses, then, one image is selected from each cluster to query the user.
where d(i, j) is the distance between image i and j, and σ is a scaling parameter which is typically set to 10 to 20 percent of the total range of the image distance. Assume that the graph G is partitioned into two disjoint sets, A, B, A ∪ B = V, A ∩ B = ∅, the Ncut measure[11], i.e. the normalized weights of the edges that connect the two set, is then adopted to quantify this partition: cut(A, B) cut(A, B) N cut(A, B) = + , cut(A, B) = wuv (3) assoc(A, V ) assoc(B, V ) u∈A,v∈B
where assoc(A, V ) = u∈A,v∈B wuv is the total connection from nodes in A to all nodes in the graph and assoc(B, V ) is similarly defined. It was proven in[11] that finding a bipartition with minimum N cut value can be approximately achieved by calculating the second smallest generalized eigenvector with following generalized eigenvalue system: (D − W )y = λDy
(4)
where W is an N × N symmetric matrix with W (i, j) = wij , and D is a diagonal matrix with d(i) = j wij . The eigenvectors obtained above often take on continuous values, thus, a splitting point should be chose to determine the partition. In our implementation, the median value of the second smallest generalized eigenvector is used as the splitting point. In clustering, the Ncut method is recursively applied to get several clusters. But this leads to the questions: (1) which sub-graph should be divided? and (2) how to determine the number of clusters, or, when should this process stop? For the first question, we use a simple heuristic, i.e. the sub-graph with the maximum number of nodes is selected for further partition. As to the second question, a straightforward way is to set the number of clusters to that of images which are returned to query the user. However, it leads to another kind
958
R. Liu et al.
of redundancy. Some clusters usually contain images that have been labeled in previous feedback rounds, thus, images selected from these clusters may be similar with the labeled ones. In this case, only a little more information can be obtained from these newly selected training instances. To avoid this problem, we adopt following criterion to control the clustering process: Repeat clustering until the number of the clusters not containing labeled images equals to that of the instances to be labeled. Furthermore, it is assumed that the labeled images can represent their corresponding clusters that they belong to. Thus, we only select images from these clusters without containing labeled images and return them to query the user. 2.3
Active Selection
To meet both requirements, viz. selected images must be informative and the redundancy between these images must be low, the informative and representative measures are linearly combined for batch images selection. Let c = {x1 , . . . , xM } denote one cluster without containing labeled images. The representative measure of image xi is defined as Rep(xi ) = j∈c wij . The larger the within cluster similarities of an image, the better the representative of this image. The informative measure of image xi is defined as Inf (xi ) = |g(xi )|, where g(xi ) denotes the real-valued output of the SVM learner. By integrating these two factors together, the final score of an unlabeled instance xi can be written as: s(xi ) = −λInf (xi ) + (1 − λ)Rep(xi )
(5)
where, parameter λ is used to adjust the individual influence of each factor. For all images in cluster c, their scores are calculated according to Equ. 5, and the image with largest score is selected for labeling.
3
Exploiting Unlabeled Images
In each feedback round, another learner is constructed based on all images that are labeled till now, then this learner makes prediction to unlabeled images. After that, several most negative images considered by this learner are used as negative instances to augment the training set of the SVM. To meet the real time requirement in image retrieval, a very simple model, i.e. weighted Euclidean distance model, is used as the learner: f (x) = d(x, q) =
d
1/2 (x − q ) ∗ w i
i 2
i
(6)
i=1
where, x and q denote respectively the image to be estimated and the new query point, which are both represented by d-dimensional feature vectors, wi is the weight associated with each feature element.
SVM-Based Active Feedback in Image Retrieval
959
Assume P and N denote respectively the set of labeled positive and negative examples. The new query point can be obtained by: q=
1 1 xk − xk |P | |N | xk ∈P
(7)
xk ∈N
The wi ’s associated with the feature vector reflect the different contribution of each feature component. As in [3], a standard deviation based weight calculation approach is used in this paper. Given all the values of one feature component in the positive examples, the inverse of their standard deviation can be used as a good estimation of the weight. The smaller the variance, the larger the weight and vice versa. Since above learner is not strong, the labels they assign to the unlabeled examples may be incorrect. To alleviate this problem, a conservative strategy is used, that is, only a few most confident negative examples are used to enlarge the SVM’s training pool. More specifically, several examples with the largest distance in Equ. 6 are selected. For a specific query, most of the images in the database are irrelevant while only a small number of images are relevant, as a result, the prediction of the most negative examples should be more reliable. Besides this, another conservative measure is employed to further improve the reliability, i.e. the negative images labeled by this learner are only temporarily used by the SVM. In next feedback round, a new learner will be constructed based on the enlarged labeled data, and then a different set of negative examples will be selected.
4
Experiments
In this section, the effectiveness of the proposed algorithm is evaluated through device parts retrieval from assembly drawings. In order to search the assembly drawings containing similar device parts with the query image, all the device parts contained in the assembly drawings are firstly extracted, thus, the searching problem is converted into an ordinary image retrieval problem. In our application, all assembly drawings are grayscale images, which are obtained through scanning from hardcopy papers. The test dataset is composed by 14,770 device parts images extracted from real assembly drawings. For testing purpose, 1,939 commonly used device parts are manually selected and classified into 8 categories. In experiments, only these selected images are used in turn as queries and their category labels serve as the ground truth: the images in the same category with the query are treated as positive images. A 259-D feature space is used with 23 Zernike moments, 128 grid based features, 108 edge direction histogram. Grid based feature is obtained by firstly mapping the binarized object on a polar grid of 32x8 cells, then counting the number of pixels in each cell to generate a 256-D vector. To realize rotation invariant, Fourier transform is applied to this vector and its 128 magnitude values are used as the final feature.
960
R. Liu et al. 0.7
0.7 Unlab=10 Unlab=0
0.65
0.6 BEP(L*=6)
0.6 BEP(L*=3)
Unlab=10 Unlab=0
0.65
0.55 0.5 0.45
0.55 0.5 0.45
0.4
0.4
0.35
0.35
0
1
2 3 Iteration (λ=0.5)
4
5
0
1
2 3 Iteration (λ=0.5)
4
5
Fig. 2. Comparison with and without unlabeled data 0.7
BEP
0.6
L*=3,clu,λ=0.5 L*=6,clu,λ=0.5 L*=9,clu,λ=0.5 L*=3,ang,λ=0.8 L*=6,ang,λ=0.8 L*=9,ang,λ=0.8
0.5
0.4
0
1
2
3
4
5
Iteration
Fig. 3. Comparison of our algorithm and angle-diversity strategy, which are denoted as ’clu’ and ’ang’ respectively in above figure
Edge direction histogram is constructed in a 2D space: for each edge pixel pair, its absolute direction difference and spatial distance are quantified respectively into 18 and 6 bins to generate a 108 dimensional feature. To make the experimental results be clearly illustrated, we use precision-recall break-even point (BEP) instead of precision-recall curve as a measure of retrieval performance. BEP is defined as the point where precision and recall are equal. At first, the effect of the unlabeled data on retrieval performance is analyzed. It comprises two sets of tests: one is designed with 10 unlabeled images, while another without unlabeled image. In the latter test, however, unlabeled images are still adopted in the 0th round so that the SVM learner can be constructed with both positive and negative examples. Both tests are carried out by setting λ in Equ. 5 be 0.5, and the number of feedback images(L∗) be 3, 6. Besides this, 400 images that are near the current classification boundary are used in NCut clustering. The obtained BEP curves are depicted in Fig. 2. It is shown that unlabeled data is helpful in bettering the retrieval, and impressive improvement is achieved during the first feedback round especially for fewer feedback images. Next, we compare the proposed algorithm with the angle-diversity method[9] which was verified as an improvement to Tong’s SVMActive solution[6]. To make
SVM-Based Active Feedback in Image Retrieval
961
the comparison fair, 10 unlabeled data is also adopted in the angle-diversity strategy, which is obtained with the same process as introduced in section 3. This experiment is performed for L∗ =3, 6, and 9. Given each L∗ , different λ values are tested, and the best result is chosen. Fig. 3 illustrates the experimental results, where parameter λ in these two algorithms is set to be 0.5 and 0.8 respectively. It is clear that our method is consistently superior to angle-diversity strategy. Besides this, larger improvement is obtained by our algorithm in the first iteration, and this is just what we expected on relevance feedback techniques.
5
Conclusion
This paper presents a SVM based active relevance feedback technique for image retrieval. To solve the small sample size problem, two strategies are adopted: (1) a new active selection strategy is designed, which can be viewed as an improvement of Tong’s SVMactive simple algorithm by introducing the representative criterion. With this criterion, the redundancy between selected images is reduced, thus more information gain is achieved; (2) unlabeled data are adopted within the co-training framework.
References 1. Zhou, X.S., Huang, T.S.: Relevance feedback in image retrieval: a comprehensive review. Multimedia Systems 8(6), 536–544 (2003) 2. Smeulders, A., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. on PAMI 22(12), 1349–1380 (2000) 3. Rui, Y., Huang, T.S., Ortega, M., Mehrotra, S.: Relevance feedback: a power tool for interactive content-based image retrieval. IEEE Trans. on Circuits and Video Technology 8(5), 644–655 (1998) 4. Tao, D., Tang, X., Li, X., Rui, Y.: Direct kernel biased discriminant analysis: a new content-based image retrieval relevance feedback algorithm. IEEE Transactions on Multimedia 8(4), 716–726 (2006) 5. Wu, Y., Tian, Q., Huang, T.S.: Discriminant-EM algorithm with application to image retrieval. In: Proc. IEEE Int’l Conf. CVPR, pp. 222–227 (2000) 6. Tong, S., Chang, E.: Support vector machine active learning for image retrieval. In: Proc. ACM Multimedia (2001) 7. Tao, D., Tang, X.: Random sampling based SVM for relevance feedback image retrieval. In: Proc. IEEE Int’l Conf. on CVPR, pp. 647–652 (2004) 8. Cox, I.J., Miller, M.L., Minka, T.P., Papathomas, T.V., Yianilos, P.N.: The Bayesian image retrieval system, PicHunter: theory, implementation, and psychophysical experiments. IEEE Trans. on Image Processing 9(1), 20–37 9. Brinker, K.: Incorporating diversity in active learning with support vector machines. In: Proc. of the 20th Int’l Conf. on Machine Learning(ICML), pp. 59–66 (2003) 10. Zhou., Z.H.: Enhancing relevance feedback in image retrieval using unlabeled data. ACM Trans. on Information Systems 24(2), 219–244 (2006) 11. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. on PAMI 22(8), 888–905 (2000)
Hierarchical Classifiers for Detection of Fractures in X-Ray Images Joshua Congfu He1 , Wee Kheng Leow1 , and Tet Sen Howe2 1
Dept. of Computer Science, National University of Singapore 3 Science Drive 2, Singapore 117543 {hecongfu, leowwk}@comp.nus.edu.sg 2 Dept. of Orthopaedics, Singapore General Hospital Outram Road, Singapore 169608 [email protected]
Abstract. Fracture of the bone is a very serious medical condition. In clinical practice, a tired radiologist has been found to miss fracture cases after looking through many images containing healthy bones. Computer detection of fractures can assist the doctors by flagging suspicious cases for closer examinations and thus improve the timeliness and accuracy of their diagnosis. This paper presents a new divide-and-conquer approach for fracture detection by partitioning the problem into smaller sub-problems in SVM’s kernel space, and training an SVM to specialize in solving each sub-problem. As the sub-problems are easier to solve than the whole problem, a hierarchy of SVMs performs better than an individual SVM that solves the whole problem. Compared to existing methods, this approach enhances the accuracy and reliability of SVMs.
1
Introduction
Fracture of the bone is a very serious medical condition. According to the International Osteoporosis Foundation [1], 1 in 3 women and 1 in 5 men above age 50 may experience osteoporotic fractures. 30–50% of women and 15–30% of men may suffer osteoporotic fractures in their lifetime. In particular, worldwide incidence of hip fractures can rise from 1.6 million to between 4.5 and 6.3 million by 2050, with more than 50% of all osteoporotic hip fractures occurring in Asia. In clinical practice, a tired radiologist has been found to miss fracture cases after looking through many images containing healthy bones. Computer detection of fractures can assist the doctors by flagging suspicious cases for closer examinations and directing the doctors attention to suspicious cases. It can thus improve the timeliness and accuracy of their diagnosis. Computer detection of fractures in x-ray images is a difficult and challenging problem. The femur can fracture in many ways with varying degrees of severity. While severe fractures cause drastic change to the shape of the femur, mild fractures do not change the femur’s shape and leave only very subtle signs in the
This research is supported by NMRC/0482/2000.
W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 962–969, 2007. c Springer-Verlag Berlin Heidelberg 2007
Hierarchical Classifiers for Detection of Fractures in X-Ray Images
(a)
(b)
(c)
(d)
(e)
(f)
963
Fig. 1. Healthy and fractured femurs. (a–c) Healthy femurs can have different appearances due to patients’ standing postures. (d) Severe fracture changes the femurs’ shape. (e, f) Mild fractures leave the femurs’ shape unchanged. Arrows indicate fractures.
x-ray images (Fig. 1). There is no single characteristic that can describe all kinds of fractures. X-ray images of healthy femurs also exhibit a significant amount of variation primarily due to the patients’ standing postures when the images are taken (Fig. 1). The unbalanced data problem, i.e., large difference in the proportions of healthy and fractured samples, further compounds to the problem’s difficulty. In a consecutive set of x-ray images of femurs (i.e., consecutive in the time that the patients took their x-ray images) that we collected from a local public hospital, only about 12% of them are fractured. When the training set is small, the difficulty becomes even more severe because there may not be enough samples to capture the whole range of possible variations. This paper presents a new approach for fracture detection by partitioning the classification problem into smaller sub-problems in SVM’s kernel space. A hierarchy of SVMs is trained so that each SVM specializes in solving a subproblem, which is easier to solve than the whole problem. Thus, the hierarchy of SVMs performs better than a single SVM solving the whole problem.
2
Related Work
Tian et al. [2] published the first research work on the detection of fractures in x-ray images by computing the angle between the neck axis and shaft axis. Subsequently, Lim et al. [3] Yap et al. [4] and Lum et al. [5] reported methods
964
J.C. He, W.K. Leow, and T.S. Howe
that detect femur fractures based on Gabor, Markov Random Field, and gradient intensity features extracted from the x-ray images. Three SVMs were trained to classify the samples each based on a different feature type. The individual SVM’s performance was not very good. By combining the decisions of the three SVMs, the overall accuracy and sensitivity (i.e, fracture detection rate) were improved. Combination of SVMs is a standard way to obtain a multi-class SVM from binary (two-class) SVMs [6,7,8]. Typically, each constituent SVM is trained to solve a one-vs-one problem, and they are combined using either a tree, a graph, a voting scheme, or other methods. Our method differs from these SVM combination methods in two important ways. Instead of partitioning a k-class problem into many one-vs-one sub-problems, our method partitions a binary (healthyvs-fractured) problem into several smaller 3-class (healthy, fractured, unknown) problems such that each is handled by a SVM. Moreover, the partitioning is performed based on estimations of the reliability of the SVMs.
3
Hierarchical SVMs
The guiding principle of our approach is divide and conquer or division of labor. A hierarchy of complementary SVMs are trained to each tackle a different sub-problem of the whole fracture detection problem. A well known divide-andconquer approach is to first cluster the input samples so that the samples in each cluster are more consistent with each other [9]. Then, a set of classifiers are trained to each classify only the samples in a different cluster. This approach is effective when the training set is large. The more complex the problem, the more clusters are needed to achieve good performance, and the larger the training set needs to be. In our investigation of this approach, we found that a large number of clusters are required to capture the large variations of both healthy and fractured samples. As a result, there are not enough training samples in each cluster to train a classifier. Instead of partitioning the problem in the feature space, our approach partitions it in SVM’s kernel space. This approach has two advantages. First, it is easier to separate the healthy and fractured samples in the SVM’s high-dimensional kernel space. Second, the partitioning performed by SVM is optimal. 3.1
Training Algorithm
The training of the hierarchical SVMs is guided by three principles: 1. Samples that can be reliably classified by a higher-level SVM should be handled by it. 2. Samples that cannot be reliably classified by a higher-level SVM should be passed to its child, a lower-level SVM. 3. The performance of a lower-level SVM on the samples passed to it should be better than the performance of its parent on these samples. The training algorithm begins with the top-level SVM S, which is given two data sets: the training set T and the validation set V . Its main stages are as follows:
Hierarchical Classifiers for Detection of Fractures in X-Ray Images
965
1. 2. 3. 4. 5.
Train SVM S on training set T . Run S on validation set V to obtain classification error rate E(S, V ). Based on E(S, V ), select a subset V of V . Create a new SVM S at the next level. Find a subset T of T that can be used to train S to achieve the performance of E(S , V ) < E(S, V ) and E(S , F ) ≤ E(S, F ), where F is the subset of fractured samples in V . That is, 1 − E(S , F ) is the sensitivity of S on V . 6. If S cannot achieve the above performance, then stop. 7. Otherwise, set S ← S , T ← T , V ← V , and continue at Step 3.
This algorithm uses probabilistic SVMs, such as Gini-SVMs [10], that produce classification results and probability estimates p. The p value ranges from 0 to 1, with 0.5 corresponding to ambiguous cases located on the decision surface in the kernel space. We shall call the side of the decision surface with p > 0.5 the positive side, and the side with p < 0.5 the negative side. After running a trained SVM on a sample set, each sample v will be assigned a probability estimate p(v), and the samples can be sorted in increasing order of p(v). There are two critical stages in the algorithm: Stage 3 and 5, which select appropriate subsets T and V to channel to the SVM S at the next level. In essence, V defines the sub-problem that S needs to solve, and T is the appropriate training set that can train S to solve the sub-problem. These stages will be discussed in more details in the following sections. 3.2
Selection of Validation Subset
Stage 3 of the training algorithm embodies the first two principles outlined in Section 3.1. It determines a subset V of V that the SVM S cannot reliably classify. This subset V is channeled to another SVM in the next level to achieve division of labor. Classification reliability is estimated based on two quantities: (1) the classification error rate E(S, V ) and (2) the probability estimates p(v) assigned to samples v in V by S. Let us compute the cumulative error rate c+ (p) from p = 1.0 towards 0.5 for the samples with p(v) > 0.5, and c− (p) from p = 0 towards 0.5 for the samples with p(v) < 0.5 (Fig. 2(b)). Then, the samples in the range p+ to p = 1 where c+ (p+ ) < E(S, V ) would have estimated error less than E(S, V ); similarly for the samples in the range p = 0 to p− where c− (p− ) < E(S, V ). That is, samples in the tail regions can be reliably classified. Therefore, samples in the middle range from p− to p+ should be selected to form the subset V . In the current implementation, c+ (p+ ) = c− (p− ) = ε called the error tolerance. 3.3
Selection of Training Subset
Stage 5 of the training algorithm embodies the third principle outlined in Section 3.1. Selection of appropriate training subset T is tricky because SVMs are very strong classifiers. Their accuracy on the training set is often close or equal to 100%. If the method for selecting validation subset is applied directly to the
966
J.C. He, W.K. Leow, and T.S. Howe
selection of training subset, very few training samples will be selected and they are not enough for training the lower-level SVM to achieve high performance. On the other hand, if the selection criterion is too loose and almost all the training samples are channeled to the lower-level SVM, then it would be solving the same problem as its parent and there would be no division of labor. So, the goal is to find a subset that is not too large and not too small. Let q(u) denote the probability estimate of sample u in T . Our method searches for the appropriate T iteratively: For q − from 0 to 0.5 in increments of Δq − , For q + from 1 to 0.5 in decrements of Δq + , Set T to contain all training samples between q − and q + . Train S on T and then test S on V . If E(S , V ) < E(S, V ) and E(S , F ) ≤ E(S, F ), then found T and return with success. Cannot find desired T . Return with failure. The increment Δq − and decrement Δq + are set as fixed proportions of the standard deviations of the distributions of training samples T + and T − in the positive and, respectively, negative side of the decision surface (Fig. 2(a)). 3.4
Testing Algorithm
The same testing algorithm is applied to the validation set during training and testing set at system test. Given a sample v, the following algorithm is applied: 1. For each SVM S from top to bottom of hierarchy, (a) Run SVM S on sample v to compute the probability estimate p(v). (b) If p(v) > p+ , then classify v as healthy and stop. (c) If p(v) < p− , then classify v as fractured and stop. (d) Otherwise, pass v to the child of S. 2. If with rejection, then classify v as unknown and stop. 3. (Without rejection) If p(v) > 0.5, then classify v as healthy and stop. 4. Otherwise, classify v as fractured and stop.
4
Experiments and Discussions
420 consecutive femur images were collected from a local public hospital. They were divided randomly into 200 training, 160 validation, and 60 testing samples. The percentages of fractured samples were kept roughly the same for all three sets at about 12%. Gabor and intensity gradient (IG) features as described in [3,5,4] were extracted as the features for classification. The following SVM configurations were trained and tested for comparison: – SVM: Single SVM trained on the training set and tested on the testing set. – SVM+: Single SVM trained on the combined training and validation set, and tested on the testing set. – H-SVM: Hierarchical SVM without rejection. – H-SVM−: Hierarchical SVM with rejection.
Hierarchical Classifiers for Detection of Fractures in X-Ray Images count 70
cumulative error 0.3
Level 1
60
Level 1 Level 2
Level 2
50
967
0.2
40 30 0.1
20 10 0
0
0.4
0.45
0.5
0.55
probability estimate
(a)
0.6
0.4
0.45
0.5 0.55 0.6 probability estimate
0.65
0.7
(b)
Fig. 2. Performance of SVMs in first two levels of the hierarchy. (a) Distributions of training samples over probability estimates. (b) Cumulative errors on validation sets.
Each of the above configuration was trained to classify Gabor and IG separately. Figure 2 shows the internal working of H-SVM. The curves on the left and right side of Fig. 2(a) show the distributions of samples on the corresponding sides of the SVMs’ decision surfaces. They show that the fracture detection problem is very difficult because all the samples fall within a narrow range of p = 0.4 to 0.55, and most of them are on the positive side. Fig. 2(b) shows that the cumulative error at the second level has a narrower spread than that in the first level. That is, errors of the level-2 SVM are focused within a narrower range, indicating that it solves the sub-problem better than its parent. Figure 3 shows the results of testing H-SVM on the validation and testing sets for both feature types. With very small error tolerance ε, no training and validation samples can be passed down to the lower-level SVMs, reducing HSVM to a single SVM. With very large ε, most, if not all, the training and validation samples are passed down to the lower-level SVMs defeating the divideand-conquer strategy. With an appropriate ε, there is a good division of labor. The trends of the error curves for validation set and testing set are similar although their minimum may not coincide at the same ε. In actual application, the error tolerance is selected as the ε that maximizes accuracy (i.e., 1 − error rate) and sensitivity on the validation set. The trained H-SVMs have a hierarchy of three levels for Gabor and four levels for IG (Table 1). Level-1 SVMs classify more than 70% of the testing samples and pass the remaining samples to the lower-level SVMs. They classify most of the healthy samples and sizable amounts of fractured samples. Therefore, they achieve higher accuracy but lower sensitivity compared to lower-level SVMs, just like single SVMs (Table 2). The samples processed by lower-level SVMs are more balanced, but the discrimination of healthy and fractured cases is still relatively difficult. So, they achieve higher sensitivity at the expense of lower accuracy.
968
J.C. He, W.K. Leow, and T.S. Howe error rate 0.14
error rate 0.1
Validation Test
0.13
0.09
0.12
0.08
0.11 0.1
0.07
0.09
0.06
0.08
Test Validation
0.05
0.07 0.06
0.04 0
0.05
0.1 0.15 error tolerance ε
0.2
(a)
0
0.1
0.2 0.3 0.4 error tolerance ε
0.5
0.6
(b)
Fig. 3. Test results on different feature types. (a) Gabor. (b) Intensity gradient (IG). Table 1. Performance of SVMs in the hierarchy. IG: Intensity gradient. accu: accuracy, sens: sensitivity, %class: percentage of testing samples classified by the SVMs in the respective levels, %frac: percentage of fractured testing samples classified.
level 1 2 3 4 overall
Gabor accu sens %class %frac 95.45% 50.00% 73.33% 28.57% 90.00% 0.00% 16.67% 14.29% 66.67% 75.00% 10.00% 57.14% — — — — 91.67% 57.14%
IG accu sens %class 96.15% 66.67% 86.66% 100% 100% 1.67% 75.00% 100% 6.67% 66.67% 100% 5.00% 93.33% 85.71%
%frac 42.86% 14.29% 28.57% 14.29%
Table 2 compares the performance of various SVM configurations on the testing set. The performance of SVM+ is better than that of SVM for IG but the converse is true for Gabor. This shows that having more training samples does not always improve the accuracy of single SVM. In comparison, H-SVM can use the training and validation sets optimally to achieve high performance. Its accuracy is as high as the larger accuracy between SVM and SVM+. Its sensitivity is also as high as those of SVM and SVM+ except for IG. By rejecting uncertain samples, H-SVM− achieves higher accuracy for Gabor and IG at the expense of lower sensitivity compared to H-SVM. The classification results based on Gabor and IG can be combined using a simple OR rule: classify a sample as fractured if it is classified as fractured using either Gabor or IG. For H-SVM−, the combined performance is better than that using single feature. Moreover, its rejection rate drops to 0. With feature combination, H-SVM− achieves a significantly higher accuracy compared to SVM, SVM+, and H-SVM, and the same sensitivity as SVM and H-SVM. In summary, the test results show that the overall performance can be optimized if an SVM can reject samples that it cannot classify reliably, and pass the samples to other SVMs to classify.
Hierarchical Classifiers for Detection of Fractures in X-Ray Images
969
Table 2. Test results on different feature types. IG: intensity gradient. Last column is the rejection rate of H-SVM−. OR: Combine Gabor and IG results using OR rule.
Gabor IG OR
5
accuracy sensitivity accuracy sensitivity accuracy sensitivity
SVM 91.67% 57.14% 90.00% 85.71% 86.67% 85.71%
SVM+ 90.00% 57.14% 93.33% 100% 88.33% 100%
H-SVM 91.67% 57.14% 93.33% 85.71% 90.00% 85.71%
H-SVM− 94.45% 33.33% 94.83% 83.33% 93.33% 85.71%
reject 10.00% 3.33% 0.00%
Conclusion
This paper presents a new divide-and-conquer approach for fracture detection by partitioning the problem in the kernel space of the SVM into smaller subproblems, and training an SVM to specialize in solving a sub-problem. Each subproblem is easier to solve than the whole problem. The training scheme ensures that lower-level SVMs always complement the performance of higher-level SVMs. As a result, the hierarchy of SVMs performs better than an individual SVM solving the whole problem. Compared to existing methods, this approach can enhance the accuracy and reliability of the SVMs.
References 1. IOF: Facts and statistics about osteoporosis and its impact. International Osteoporosis Foundation (2007), www.iofbonehealth.org/facts-and-statistics.html 2. Tian, T.P., Chen, Y., Leow, W.K., Hsu, W., Howe, T.S., Png, M.A.: Computing neck-shaft angle of femur for x-ray fracture detection. In: Petkov, N., Westenberg, M.A. (eds.) CAIP 2003. LNCS, vol. 2756, pp. 82–89. Springer, Heidelberg (2003) 3. Lim, S.E., Xing, Y., Chen, Y., Leow, W.K., Howe, T.S., Png, M.A.: Detection of femur and radius fractures in x-ray images. In: Proc. 2nd Int. Conf. on Advances in Medical Signal and Info. Proc. (2004) 4. Yap, W.H., Chen, Y., Leow, W.K., Howe, T.S., Png, M.A.: Detecting femur fractures by texture analysis of trabeculae. In: Proc. ICPR (2004) 5. Lum, V.L.F., Leow, W.K., Chen, Y., Howe, T.S., Png, M.A.: Combining classifiers for bone fracture detection in x-ray images. In: Proc. ICIP (2005) 6. Kreßel, U.: Pairwise classification and support vector machines. In: Sch¨ olkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods Support Vector Learning, pp. 255–268. MIT Press, Cambridge (1999) 7. Platt, J., Cristianini, N., Shawe-Taylor, J.: Large margin DAGs for multiclass classification. In: Solla, S.A., Leen, T.K., Muller, K.R. (eds.) Advances in Neural Information Processing Systems, MIT Press, Cambridge (2000) 8. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class support vector machines. IEEE Trans. on Neural Networks 13, 415–425 (2002) 9. Li, R., Leow, W.K.: From region features to semantic labels: A probabilistic approach. In: Proc Int. Conf. on Multimedia Modeling, pp. 402–420 (2003) 10. Chakrabartty, S., Cauwenberghs, G.: Gini-SVM. (bach.ece.jhu.edu/svm/ginisvm/)
An Efficient Nearest Neighbor Classifier Using an Adaptive Distance Measure Omid Dehzangi, Mansoor J. Zolghadri, Shahram Taheri, and Abdollah Dehzangi Department of Computer Science and Engineering, Eng. no. 2 Building, Molla Sadra St., Shiraz University, Shiraz, Iran {Dehzangi, jzolghadri, taheri, A.dehzangi}@cse.shirazu.ac.ir
Abstract. The Nearest Neighbor (NN) rule is one of the simplest and most effective pattern classification algorithms. In basic NN rule, all the instances in the training set are considered the same to find the NN of an input test pattern. In the proposed approach in this article, a local weight is assigned to each training instance. The weights are then used while calculating the adaptive distance metric to find the NN of a query pattern. To determine the weight of each training pattern, we propose a learning algorithm that attempts to minimize the number of misclassified patterns on the training data. To evaluate the performance of the proposed method, a number of UCI-ML data sets were used. The results show that the proposed method improves the generalization accuracy of the basic NN classifier. It is also shown that the proposed algorithm can be considered as an effective instance reduction technique for the NN classifier. Keywords: Pattern classification, nearest neighbor, adaptive distance metric, instance-weighting, data pruning.
1 Introduction As a simple but powerful supervised learning algorithm, the nearest neighbor has been used successfully on pattern classification problems [1,2,3]. In the basic form of this non-parametric classification system, a set of training patterns are presented and stored in the memory. A new pattern X is then classified using a distance function to determine how close it is to each stored instance, and choose the class of the nearest stored training instance of X. The basic NN classifier stores all of the training instances to use during generalization. The distance between an input query pattern and all stored patterns should be computed to classify the pattern. Instance reduction techniques [4,5] aim at reducing the size of training set to improve the generalization speed and storage requirement of the basic NN classifier. One other reason to reduce the original data set is to improve the prediction accuracy of the basic NN classifier by removing noisy instances. W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 970–978, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Efficient Nearest Neighbor Classifier Using an Adaptive Distance Measure
971
In the literature, many instance selection algorithms have been proposed. These algorithms treat all instances in the reduced subset the same while generalizing. In the scheme proposed in this paper, a weight is associated with each stored instance. In the generalization phase, the weight of each instance is multiplied by the value of similarity between the query and that instance to obtain a weighted similarity value. The NN of a query pattern is identified as the instance having maximum value of weighted similarity. The overall scheme can be viewed as a form of locally adaptive distance metric. The performance of NN is dependant crucially on the choice of distance metric. In the past, many methods have been developed to locally adapt the distance metric. Examples of these include the flexible metric method proposed by Friedman [6], the discriminant adaptive method by Hasti and Tibshirani [7], and the adaptive metric method by Domeniconi et al. [8]. these methods estimate feature relevance locally at each query pattern. This leads to a weighted metric for computing the similarity between the query patterns and training data. In [10], similar to our proposed method, a simple locally adaptive distance measure is proposed that specifies a weight to each training instance. The method we propose in this paper uses a locally adaptive metric to improve the performance of the basic NN classifier and at the same time, it reduces the size of training set. The rest of this paper is organized as follows. In section 2, the adaptive weighted distance measure is described. In section 3, some instance selection algorithms are presented. In section 4, our proposed algorithm to learn the weight of each training instance is introduced. In section 5, experimental results are shown. Section 6 concludes this paper.
2 Adaptive Distance Measure Using Weighted Training Instances The nearest neighbor classifier assigns label of a test pattern according to the class label of its nearest training instance. To introduce the weighted version of nearest neighbor rule, here, the notation of basic nearest neighbor rule is briefly described. Assume that classification of patterns in an m-dimensional space is under investigation. Having a set of training instances {(X(i), C(k))}, where X(i), i=1,…,n are training feature vectors and C(k) ,k=1,…,C are the labels. NN rule finds the nearest neighbor of a new test pattern X, which denoted by X(w), using a distance function and assigns X to C(w) (the class label of the winner class). Different distance functions are used to find the nearest neighbor of each test pattern. As a commonly used distance measure, Euclidean distance has been used to measure the distance (i.e dissimilarity) between X and Y [11] in many applications: m
d (X ,Y ) =
∑ (x
i
- yi ) 2
(1)
i =1
In the first step, we changed the distance as a dissimilarity measure between query pattern X and jth instance X(j) to a similarity measure. This is done by a linear conversion as follows:
972
O. Dehzangi et al.
μ (X , X (j) ) = 1 - d (X , X (j) ) d max (X , X (j) )
(2)
Where, dmax(X,X(j)) is the maximum distance which can ever occur on the whole training set. As a result, instead of finding minimum distance pattern to X, we search for instance X(j) such that μ(X,X(j)) is maximized. This can be interpreted as normalizing the distance from [∞, 0] to a real number in the interval [0,1]. In Basic NN, the most similar pattern Xp to a query pattern X can be formally determined by:
p = argmax {μ ( X , X j ) | i = 1,...,n}
(3)
The NN rule assumes that all the stored instances are equally reliable and uses equation (3) to find the NN of a query pattern. This paper is based on weighted training instances. By assigning a weight wk to each instance Xk, the weights of the training instances are incorporated in the similarity metric to find the NN of a query pattern:
w = argmax{ μ ( X , X j ).w j | j = 1,..., n}
(4)
where wj is the weight assigned to jth instance by the learning algorithm which is introduced in section 4.
3 Instance Selection Algorithms In the field of automatic classification, where instance-based learning algorithms are used, many researchers have addressed the problem of training set reduction to solve either computation and/or the performance problem. According to their goals for reducing the training set, these algorithms may be divided into two main subgroups: noise filters and condensation methods. 3.1 Noise Filters
Noise filters aim at removing outliers and instances near boundary. They remove noisy instances (i.e., those that do not agree with their neighbors). These methods often produce a smoother decision boundary by removing the instances close to boundary. However, these algorithms do not remove internal points which are not necessarily contributed to the decision boundary. The aim of noise filters is mostly to improve the performance of the classifier by removing noisy instances. One early method of this kind is the Edited Nearest Neighbor (ENN) algorithm developed by Wilson [12].This algorithm finds a subset S of the training set T in the following way. Starting with S the same as T, each instance in S is removed if it does not agree with the majority of its k nearest neighbors. This algorithm removes noisy instances as well as instances near boundary. Repeated ENN applies the ENN algorithm repeatedly until all instances remaining have majority of their neighbors with the same class and no change is made in S. The effect of this is to widen the gap between classes and make the decision boundary smoother.
An Efficient Nearest Neighbor Classifier Using an Adaptive Distance Measure
973
All k-NN method proposed by Tomek [13] is an extension to ENN method. The ENN is repeated for all k (k=1, …, m) but it flags as “bad” any instance not classified correctly by its k nearest neighbors and after completing the loop all m times, all flagged instances are removed. 3.2 Condensation Methods
On the other hand, the aim of condensation methods is to remove those instances that do not affect the decision boundary (i.e mostly internal instances). The intuition behind them is that internal instances do not affect the decision boundary as much as border instances. Therefore, they try to find a significantly reduced subset by removing internal instances from the full training set. The aim of these methods is mostly to improve the efficiency of instance-based learning algorithms which use original instances from the training set as exemplars. These methods are mostly used to improve the speed and memory requirements of k-nearest neighbor classifiers (k-NN). During generalization, k-NN computes the distance between the input test pattern with each stored instance to predict the output class of the input pattern. In this case, condensation methods try to find a significantly reduced set of instances whose classification accuracy resulted by k-NN is as close as possible to those obtained using all training instances. Condensed Nearest Neighbor Rule (CNN) made by Hart [14] was one of the first reduction techniques. This algorithm begins by randomly selecting one instance per class from training set T and putting them in a new data set S. Each instance in T which is misclassified is added to S, thus ensuring that it will be classified correctly. This method is sensitive to noise and also the order of instance presentation affects the classification rate. Reduced nearest neighbor (RNN) introduced by Gates [15] starts with S equal to T and remove only those instances that do not decrease accuracy. Wilson and Martinez in [16] have provided a survey of algorithms used to reduce storage requirements in instance-based learning algorithms. They also proposed several reduction algorithms, DROP1-5. To describe their proposed methods, let’s define a set of instances for which instance P is one of the k nearest neighbors as P’s associates set (those instances depend on P). DROP1 starts with S=T and removes instance P if at least as many of its associates in S would be classified correctly without P (accuracy is checked on S). DROP2 is similar to DROP1 with the exception that the accuracy is checked on T instead of S and also S is sorted in an attempt to remove center points before border points. DROP3 uses DROP2 rule and additionally makes the use of ENN algorithm as a preprocessing step for noise removal. DROP4, changes the noise removal step of DROP3 to the following rule: Remove each instance only if it is misclassified by its k nearest neighbors, and it does not hurt the classification of other instances. This new rule is more conservative than the previous one and makes sure that it will not remove far too many instances in the noise-reduction pass. DROP5 changes the order of DROP2 and starts removing from border points to center points and then, iteratively applies DROP2 rule on the remaining set as long as any further improvement is being made. Theses methods showed good results based on [16]. In this work, we use the results taken by these methods as a comparison to our reduction ability.
974
O. Dehzangi et al.
4 Instance-Weight Learning In this section, an iterative learning scheme is proposed that attempts to specify the weight of each instance in the training set such that the number of misclassified training patterns is minimized. The basic element of this scheme is an algorithm that finds the optimal weight of each instance, assuming that the weight of all other instances in the training set is given and fixed. For an M-class problem, assume that N labeled instances {Xi, i=1, 2,…, n} from different classes are available. Initially, all instances are assumed to have a weight of one { wi =1.0, i=1,2,...,n}. The weight assigned to a training pattern can only affect the classification of other instances in the training set. The optimal weight wk of instance Xk from class T (given weights of all other patterns) is a real number in the interval [0, ∞] that minimizes the number of misclassified training data. For this purpose, in the first step, the instance is removed from the training set (i.e. wk = 0.0). In the second step, training patterns of class T that are classified correctly are removed from the training set. Notice that these patterns will be classified correctly regardless of the value of wk. In the third step, training patterns of other classes that are misclassified are also removed. Notice that these patterns will be misclassified regardless of the value of wk. At this stage, Representing the current training set by Γ, for each training pattern Xt ∈ Γ left in the training set, the following measure is used to calculate its score S(Xt).
S (X t ) =
max{μ (X t , X j ).w j | X j ∈ Γ, j ≠ t } j
μ (X t , X k )
(5)
where μ(Xi,Xj) represents the similarity of patterns Xi and Xj calculated using (2). The algorithm given in Table 1 can now be used to find the best threshold (i.e., weight) for pattern Xk. The purpose of the steps 1-3 in above procedure is to remove those instances that their classification do not depend on the value of wk. In other words, these patterns will be either classified correctly or misclassified regardless of the value of wk. The algorithm of Table 1 sorts the patterns in ascending order of their scores. Assuming that a set of m patterns (and their scores) is passed to this algorithm, a maximum of m+1 thresholds are examined to find the best threshold. Notice that for each specified threshold t, all training patterns X with S(X) < t are classified as class T. In this way, for each specific threshold, the number of misclassified patterns can be easily calculated. The best threshold (i.e. weight) is simply the one that minimizes the number of misclassified training data. The resulting threshold is considered as the weight of instance Xk and is used in the adaptive distance measure to find the nearest neighbor in (4). The proposed algorithm in Table 1 finds the best threshold by maximizing the accuracy which is determined according to TP1 and FP2 values which are defined as follows: TP: number of patterns of class T which are classified correctly. FP: number of patterns not of class T which are misclassified. 1 2
True Positive. False Positive.
An Efficient Nearest Neighbor Classifier Using an Adaptive Distance Measure
975
Table1. Algorithm for finding the best threshold Inputs: patterns Xt, scores S(Xt) Output: the value of best threshold (best-th) current = number of misclassified patterns corresponding to the threshold of th = 0 (i.e., classifying everything as class T ) optimum = current best-th = 0 rank the patterns in ascending order of their scores {assume that Xk and Xk+1 are two successive patterns in the list} for each different threshold th = (Score(Xk)+Score(Xk+1))/2 current = number of misclassified patterns corresponding to the specified threshold (i.e., all patterns Xt having Score(Xt) < th are classified as class T) if current < optimum then optimum = current best-th = th end if end for {assume that last is the score of last pattern in the list and τ is a small positive number} current = number of misclassified patterns corresponding to th = last + τ (i.e., classifying everything as class T) if current < optimum then optimum = current best-th = th end if return best-th
The accuracy is calculated below: Accuracy = TP - FP
(6)
The above procedure claculates the optimal weight of the instance Xk assuming that the weights of all other instances are given and fixed. The search for the best combination of instance weights is conducted by optimizing each rule in turn assuming that the order of the instances to be optimized is fixed. As the weight assigned to each instance can only affect the classification of other training instances, the algorithm of Table 1 assigns the weight of an instance to zero when no threshold could be found to improve the classification of other instances. In real data sets, the weight of vast majority of training instances will be set to zero by the learning algorithm. Since the instances having zero weights are not used in the generalization phase, the proposed algorithm can be considered as a serious instance reduction technique for the NN classifier.
5 Experimental Results In order to assess the performance of the proposed method, we used 10 data sets available from UCI-ML repository. Table 2 gives some statistics of the data sets we used in the experiments. Our objective in the first experiment is to investigate the effect of instance-weight learning algorithm of section 4 on the generalization ability
976
O. Dehzangi et al.
of NN classifier. Ten-fold cross-validation (10-CV) was used in our experiment. Each data set was divided into 10 partitions and 9 partitions (i.e., 90% of the data) was used as the training set. The algorithm of section 4 was then used to specify the weight of each instance in the reduced training set. The remaining partition (i.e., the other 10% of the data) was classified using the weighted instances. This procedure was repeated until all of the 10 partitions were used in the test phase. The whole 10-CV procedure was repeated 10 times with different initial partitioning of the data set. Table 2. Some statistics of the data sets used in our experiments Data set Wine Glass Sonar Image Segmentation Cancer Wisconsin Bupa Pima Diabetes Thyroid Iris Balance
# of Attributes 13 9 60 15 9 6 8 5 4 4
# of patterns 178 214 208 210 683 345 768 215 150 625
# of Classes 3 6 2 7 2 2 2 3 3 3
The results are evaluated from two aspects: classification accuracy and reduction ability, which are reported in Tables 3 and 4 respectively. The purpose of our first experiment is to investigate the effect of the learning algorithm on classification accuracy of the learned classifier. Table 3 shows the classification rates of the NN rule using the Euclidean distance measure and our proposed method using the adaptive distance measure. We report on classification accuracies of basic NN in comparison to our proposed method. Table 3. Generalization accuracy of basic NN compared to the proposed scheme Data set
Basic NN
Adaptive NN
Sonar Wine Glass Cancer (W) Bupa Pima Thyroid Iris Image Balance Ave. Acc.(%)
86.75 94.91 72.15 94.54 62.41 70.94 96.14 94.23 91.37 81.31 84.47
86.25 95.54 72.05 95.15 67.13 76.68 96.44 93.31 94.15 87.61 86.43
An Efficient Nearest Neighbor Classifier Using an Adaptive Distance Measure
977
In the last row of this table, we report on average generalization accuracy of each method over all 10 data sets. On average, our novel adaptive distance metric has improved the prediction accuracy of the basic NN by approximately 2% over the 10 data sets. In the second experiment, reduction ability of the proposed method is verified and compared with DROP3-5 as other effective instance selection algorithms proposed in the past. In Table 4, we report on generalization accuracy and percentage of original training set retained by each method on various data sets used in this paper. The average accuracy and reduction percentage of each method over the 10 data sets is reported in the last row of this table. Table 4. Accuracy and storage percentage of DROP2-DROP5 and the proposed method Data Set
DROP3
DROP4
DROP5
Adaptive NN
Sonar
Acc. 80.38
% 22.65
Acc. 82.74
% 22.65
Acc. 85.62
% 19.34
Acc. 86.25
% 14.17
Wine
94.97
11.11
94.97
11.11
97.19
5.99
95.54
6.78 22.37
Glass
67.32
19.99
65.91
21.96
69.24
20.09
72.05
Cancer
95.42
3.74
95.28
3.78
95.28
3.62
95.15
1.15
Bupa
59.09
24.28
58.22
27.57
59.41
26.12
67.13
18.26
Pima
72.65
15.06
72.39
17.80
69.80
17.91
76.68
11.63
Thyroid
95.43
17.67
93.23
17.98
94.76
18.52
96.44
6.78
Iris
94.00
9.93
94.00
10.37
91.33
8.59
93.31
6.42
Image
91.04
11.43
92.38
12.74
88.85
10.96
94.15
15.97
Balance
83.57
12.78
82.23
11.72
80.49
12.65
87.61
6.45
Ave. (%)
83.39
14.87
83.14
15.77
83.21
14.38
86.43
11.02
As it can be seen, the average accuracy of our proposed method is about 3% higher than best of DROP3-5 methods. The percentage of the original training set retained using the proposed method is also about 3.5% lower than best of DROP3-5 methods. That is, our proposed method simultaneously achieves better performance in terms of accuracy and reducing the size of training data.
6 Conclusions In this paper, we proposed a method of learning the weight of each instance in the training set. This learning scheme attempts to minimize the number of misclassified patterns. The weight assigned to each training instance was then incorporated in the adaptive distance metric to find the nearest instance of a query pattern to predict the class label of that pattern. For empirical evaluation of the proposed scheme, we used 10 real world data sets from the UCI-ML repository. we evaluated the performance of our proposed scheme in terms of generalization accuracy and reducing the size of original training set. The experimental results showed that the proposed scheme is more effective than other
978
O. Dehzangi et al.
instance selection algorithms proposed in the past (DROP3-5) both in terms of prediction accuracy and reducing the size of training set.
References 1. Ghosh, A.K., Chaudhuri, P., Murthy, C.A.: On aggregation and visualization of nearest neighbor classifiers. IEEE Trans. Patt. Anal. Mach. Intel. 27, 1592–1602 (2005) 2. Ghosh, A.K., Chaudhuri, P., Murthy, C.A.: Multi-scale classification using nearest neighbor density estimates. IEEE Trans. Sys. Man Cyber. 36, 1139–1148 (2006) 3. Dasarathy, B.V.: Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, CA (1991) 4. Wilson, D.R., Martinez, T.R.: Reduction Techniques for Exemplar-Based Learning Algorithms. Mach. Lear. 38, 257–286 (2000) 5. Jankowski, N., Grochowski, M.: Comparison of Instances Selection Algorithm I. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 598–603. Springer, Heidelberg (2004) 6. Friedman, J.: Flexible metric nearest-neighbor classification. Technical Report 113, Stanford University Statistics Department (1994) 7. Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Trans. Patt. Anal. Mach. Intel. 18, 607–615 (1996) 8. Domeniconi, C., Peng, J., Gunopulos, D.: Locally adaptive metric nearest neighbor classification. IEEE Trans. Patt. Anal. Mach. Intel. 24, 1281–1285 (2002) 9. Wang, J., Neskovic, P., Cooper, L.N.: Improving nearest neighbor rule with a simple adaptive distance measure. Patt. Rec. Lett. 28, 207–213 (2007) 10. Cover, T.M., Hart, P.E.: Nearest Neighbor Pattern Classification. Trans. Info. Theo. 13, 21–27 (1967) 11. Wilson, D.: Asymptotic properties of nearest neighbor rule using edited data. IEEE Trans. Info. Theo. 18, 431–433 (1972) 12. Tomek, I.: An experiment with edited nearest neighbor rule. IEEE Trans. Sys. Man Cyber. 6, 448–452 (1972) 13. Hart, P.E.: The Condensed Nearest Neighbor Rule. IEEE Trans. Info. Theo. 14, 515–516 (1968) 14. Gates, G.W.: The Reduced Nearest Neighbor Rule. Trans. Info. Theo. 18, 431–433 (1972) 15. Wilson, D.R., Martinez, T.R.: An Integrated Instance-Based Learning Algorithm. Comp. Intel. 16, 1–28 (2000)
Accurate Identification of a Markov-Gibbs Model for Texture Synthesis by Bunch Sampling Georgy Gimel’farb and Dongxiao Zhou Department of Computer Science, Tamaki Campus, The University of Auckland, Auckland 1, New Zealand [email protected], [email protected] http://www.tcs.auckland.ac.nz/~georgy
Abstract. A prior probability model is adapted to a class of images by identification, or parameter estimation from training data. We propose a new and accurate analytical identification of a generic Markov-Gibbs random field (MGRF) model with multiple pairwise interaction and use it for structural analysis and synthesis of textures.
1
Introduction
Prior MGRF models of spatially homogeneous images are used for years [2, 15]. Such a model describes each class of images in terms of explicit geometry and quantitative strength of inter-pixel interaction. A number of other image priors (see e.g. [12, 13]) have been proposed recently to more closely reflect actual probability distributions of natural images. But much simpler MGRFs with pairwise interaction still lead not infrequently to effective solutions to many practical image analysis and synthesis problems. This paper proposes a new analytical identification of the MGRF with more accurate than before [5] estimates of Gibbs potentials and interaction geometry from first- and second-order statistics of a training image. Each training image is more adequately assigned to aperiodic (stochastic) or nearly periodic (regular) textures, and its more adequate structural description in terms of arbitraryshaped textons, or texels, and their spatial placement rules enhances our bunch sampling technique for fast realistic texture synthesis-via-analysis [16]. Section 2 derives the new accurate analytical approximation of the potentials for a generic MGRF. Section 3 discusses the improved structural descriptions of textures, presents synthetic stochastic and regular textures obtained with the enhanced bunch sampling, and compares it to other texture synthesis methods.
2
Potential Estimates for a Generic Markov-Gibbs Model
Let g : R → Q be a digital image with a finite set, Q = {0, 1, . . . , Q − 1}, of scalar signals q (grey values or colour indices) on a finite arithmetic lattice, R = {(x, y) : x = 0, 1, . . . , M − 1; y = 0, 1, . . . , N − 1}, of size M N where (x, y) are pixel coordinates. Spatially homogeneous pairwise interaction is described by W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 979–986, 2007. c Springer-Verlag Berlin Heidelberg 2007
980
G. Gimel’farb and D. Zhou
a fixed neighbourhood set N of coordinate offsets for neighbours interacting with a pixel; each pixel (x, y) ∈ R interacts with the subset Ex,y of the neighbouring pixels, Ex,y = {(x + ξ, y + η), (x − ξ, y − η) : (ξ, η) ∈ N } ∧ R. Gibbs potential functions Vξ,η : Q2 → (−∞, ∞) of cooccurrences (gx,y = q, gx+ξ,y+η = s); (q, s) ∈ Q2 describe the strength of interaction between each pixel pair ((x, y); (x + ξ, y + η)) ∈ Ex,y ; (x, y) ∈ R. The entire interaction is * +T given by the potential |N |Q2 -vector V = Vξ,η (q, s) : (q, s) ∈ Q2 ; (ξ, η) ∈ N . +T * be a vector of scaled Let F(g) = ρξ,η fξ,η (q, s|g) : (q, s) ∈ Q2 ; (ξ, η) ∈ N empirical probabilities (relative frequencies) fξ,η (q, s|g) of the neighbouring sig|Cξ,η | is the relative size of the set Cξ,η of all pixel nal co-occurrences. Here, ρξ,η = |R| pairs in R with the offset (ξ.η), and (q,s)∈Q2 fξ,η (q, s|g) = 1 for all (ξ, η) ∈ N . Given N and V, a generic MGRF with multiple pairwise interaction is specified with a Gibbs probability distribution (GPD) [15]: 1 2−1 * + ◦ T ◦ T (1) PN ,V (g ) = exp |R|V F(g ) exp |R|V F(g) g∈G
where G = Q|R| is the parent population of the images. To identify the MGRF of Eq. (1), both the characteristic neighbourhood N and the potential V are to be estimated from a training image g ◦ . Given g ◦ and N , the specific log-likelihood of the potential is: V|g◦ =
1 1 log PN ,V (g ◦ ) = VT F(g ◦ ) − ln exp |R|VT F(g) |R| |R|
(2)
g∈G
An analytical approach similar to that in [5] yields Proposition 1. Let the log-likelihood gradient, ∇V|g◦ = F(g ◦ ) − E{F(g)|V}), and the Hessian matrix, ∇2 V|g◦ = −CF(g)|V , where E {. . . |V} and CF(g)|V are the mathematical expectation and covariance matrix, respectively, of the scaled empirical probability vectors under the GPD of Eq. (1), be known for an image g ◦ and potential V0 . Then the first approximation of the MLE: V∗ = V0 + λ∗ ∇V0 |g◦ ; λ∗ =
∇T V0 |g◦ ∇V0 |g◦ ◦ ∇T V0 |g◦ CF(g)|V0 ∇V0 |g
(3)
maximises the second-order Taylor series expansion of the log-likelihood 1 T V|g◦ ≈ V0 |g◦ + ∇T V0 |g◦ (V − V0 ) − (V − V0 ) CF(g)|V (V − V0 ) 2 along the gradient from V0 . The approximate MLEs of Eq. (3) are centred: ∗ (q,s)∈Q2 Vξ,η (q, s) = 0 for (ξ, η) ∈ N . The proof is straightforward. The vector E{F(g)|V} of the scaled marginal co-occurrence probabilities and the covariance matrix CF(g)|V are explicitly known if the MGRF of Eq. (1) is
Accurate Identification of a Markov-Gibbs Model
981
an independent random field (IRF). The solution in [5] assumes the simplest IRF (denoted IRF0 below) with zero potential V0 = 0 and equal cooccurrence probabilities, pξ,η (q, s) = Q12 ; (q, s) ∈ Q2 . In this case ∇0|g◦ = Fcn (g ◦ ) and CF(g)|0 ≈ QQ−1 4 Diag [ρξ,η u : (ξ, η) ∈ N ] * +T ◦ 2 where Fcn (g ) = ρξ,η fcn;ξ,η (q, s) : (q, s) ∈ Q ; (ξ, η) ∈ N is a vector of the scaled centred probabilities fcn;ξ,η (q, s) = fξ,η (q, s|g ◦ ) − Q12 for the image g ◦ such that (q,s)∈Q2 fcn;ξ,η (q, s|g ◦ ) = 0 for all (ξ, η) ∈ N ; u is the Q2 -vector with unit components, and Diag[. . .] denotes a diagonal matrix. 2
Corollary 1. The approximate MLE of Eq. (3) in the vicinity of V0 = 0 is V∗ =
◦ ◦ FT cn (g )Fcn (g ) Fcn (g ◦ ) T ◦ Fcn (g )Cind Fcn (g ◦ )
≈ Q2 Fcn (g ◦ )
(4)
Proof is straightforward. The right-hand simplification holds if Q 1 and the lattice R is sufficiently large to make ρξ,η ≈ 1 for all (ξ, η) ∈ N . A considerably more accurate approximation of the MLE is obtained if, instead of the IRF0 , the starting point V0 in Proposition 1 will relate to the IRFg◦ with the same marginal signal distribution, Fpix = [fpix (q|g ◦ ) : q ∈ Q]T , as the training image g ◦ . It is easy to show that the IRFg◦ is represented by the GPD similar to Eq. (1) but with the centred pixel-wise potential Q-vector 8 9T 1 ◦ ln f (κ|g ) : q ∈ Q : Vpix = Vpix (q) = ln fpix (q|g ◦ ) − Q pix κ∈Q $−|R| # T Pirf (g ◦ ) = exp |R|Vpix Fpix (g ◦ ) exp (V (q)) pix q∈Q
(5)
Proposition 2. Assume that s∈Q fξ,η (q, s|g) = s∈Q fξ,η (s, q|g) = fpix (q|g) for all (ξ, η) ∈ N and g ∈ G. Then the potential of the pairwise interaction V0 = Virf:ξ,η (q, s) =
1 2ρξ,η |N |
T (Vpix (q) + Vpix (s)) : (q, s) ∈ Q ; (ξ, η) ∈ N 2
reduces the MGRF of Eq. (1) to the IRFg◦ with the marginal signal probability distribution Fpix (g ◦ ). Proof. The assumption holds precisely for the marginal signal co-occurrence and marginal signal probability distributions and therefore it is asymptotically valid for the empirical distributions, too, providing the lattice R is sufficiently large to ignore deviations due to border effects. In this case the normalised exponent of the MGRF is as follows: ρξ,η Virf:ξ,η (q, s)fξ,η (q, s|g) = Vpix (q)fpix (q|g) (ξ,η)∈N
(q,s)∈Q2
q∈Q
982
G. Gimel’farb and D. Zhou
With V0 from Proposition 2, the log-likelihood gradient ∇V0 |g◦ = Δ(g ◦ ) where * +T Δ(g ◦ ) = ρξ,η (fξ,η (q, s|g ◦ ) − fpix (q|g ◦ )fpix (s|g ◦ )) : (q, s) ∈ Q2 ; (ξ, η) ∈ N is the nQ2 -vector of the scaled differences between the empirical cooccurrence probability for g ◦ and the cooccurrence probability for the IRFg◦ . The covariance *matrix CF(g)|V0 is closely approximated by the diagonal matrix Cirf = + Diag ρξ,η varq,s : (q, s) ∈ Q2 ; (ξ, η) ∈ N where varq,s is the variance of the latter probability: varq,s = fpix (q|g ◦ )fpix (s|g ◦ ) (1 − fpix (q|g ◦ )fpix (s|g ◦ )). Proposition 3. The first approximation of the potential MLE in the vicinity of the point V0 from Proposition 2 in the potential space is V∗ = V0 + λ∗ Δ(g ◦ ) with the maximising factor ρ2ξ,η Δ2ξ,η;q,s T ◦ ◦ Δ (g )Δ(g ) (ξ,η)∈N (q,s)∈Q2 ∗ = (6) λ = T ◦ Δ (g )Cirf Δ(g ◦ ) ρ3ξ,η varq,s Δ2ξ,η;q,s (ξ,η)∈N
(q,s)∈Q2
The approximate estimate of Proposition 3 coincides with the actual potential for all the IRFs. Therefore, it is considerably closer to the actual MLE for a generic MGRF than the approximation of Corollary 1.
3
Experimental Results and Conclusions
The potential of Proposition 3 results in a more informative model-based interac tion map (MBIM) [5] of energies eξ,η (g ◦ ) = ρξ,η (q,s)∈Q2 Vξ,η (q, s))Fξ,η (q, s|g ◦ ) over the neighbourhoods: MBIM(g ◦ ) = {eξ,η (g ◦ ) : (ξ, η) ∈ W} where W = {(ξ, η) : |ξ| ≤ ξmax ; |η| ≤ ηmax } is a set of all inter-pixel coordinate offsets such that the longest anticipated interaction is below (ξmax , ηmax ), i.e. N ⊂ W. The grey-coded MBIMs for several training stochastic and regular textures from [1, 10] are shown in Fig. 1 (the darker the pixel, the higher the energy). Visual pattern of a texture depends mostly on a small characteristic group of neighbours with higher energies. The MBIM with the potential of Proposition 3 discriminates more distinctly between the characteristic and non-characteristic energies so that simple energy thresholds, e.g. heuristic [5] or unimodal thresholding [11], produce a notably better characteristic neighbourhood N to derive texels and their placement rules for bunch sampling [16]. Also, local energy clusters in the (ξ, η)-plane of a MBIM allow us to easily discriminate between stochastic (aperiodic) and regular (periodic) textures. Figure 2 presents synthetic textures obtained by the improved bunch sampling from different regular and stochastic training images. Visual fidelity is preserved in most of these synthetic textures. However, the bunch sampling averages local geometric and signal deviations typical for weakly homogeneous textures such that their geometric structure varies across the image. As a result, it produces only their rectified (precisely periodic) versions and fails dramatically on textures
Accurate Identification of a Markov-Gibbs Model
983
Bark0009
D004
D012
D024
D029
D066
D105
Metal0005
a:|N | = 17
7
20
5
10
32
22
4
b:|N | = 20 D001
10 D006
22 D020
10 D034
13 D052
32 D076
45 D077
5 D101
a:|N | = 14
19
30
4
17
24
11
40
b:|N | = 27
52
56
24
35
34
20
62
Fig. 1. Stochastic (upper) and regular (bottom) Brodatz [1] and MIT VisTex [10] textures: training samples 128 × 128, their grey-coded MBIMs 85 × 85 (ξmax = ηmax = 42), and neighbourhoods N estimated by thresholding of their MBIMs for the potential estimates of Corollary 1 (a) and Proposition 3 (b). Note that the estimate (a) fails for D034 and separates much less neatly the higher energies in other textures.
with aperiodic irregular shapes and/or arbitrary placement of local elements when only the pairwise interaction is simply inadequate. Today’s most popular texture synthesis techniques [3, 4, 6, 7, 14] are based on non-parametric seamless replication of image patches. The bunch sampling resembles them in that signals retrieved from the training image are used to
984
G. Gimel’farb and D. Zhou
D006
D014
D020
D034
D052
D101
D102
Tile0007
D004
D024
Bark0009
Water0002
Flower0004
Metal0005
Grass0001
Fabric0016
Cans
Weave
Dots
Floor
Flowers
Flora
Knit
Design
Fig. 2. Bunch sampling synthesis of regular and stochastic textures from [1, 7, 10] (the training 128 × 128 and synthetic 360 × 360 images)
Accurate Identification of a Markov-Gibbs Model
Training image
Placement grid
Bunch sampling
Method in [4]
Method in [3]
Method in [14]
Method in [8]
Method in [6]
985
Fig. 3. Bunch sampling vs. other methods (the images are taken in part from [8])
form synthetic textures. But instead of a texture model, these methods rely on local neighbourhood matching to constrain the selection of the training signals and replicate the texture features. Therefore, these methods are unaware of the global structure of a texture and encounter problems in determining the proper size and shape of the local neighbourhood for capturing features at various scales. In this case, only an interaction with the user can provide an adequate neighbourhood that notably varies from one texture to another. If a texture is built sequentially, pixel by pixel, the accumulated errors eventually may destroy the desired pattern. Due to only local constraints, the non-parametric methods may outperform the bunch sampling in catching local variations of stochastic textures but are weak in reproducing regular textures. The basic idea of an alternative synthesis of regular textures in [9] is similar to the bunch sampling. But the periodicity of a training texture is recovered from translational symmetries of an autocorrelation function that describes statistics of pairwise signal products over the image. As a result, the estimated interaction structure is less definite than with more general statistics of pairwise signal co-occurrences. Another difference from the bunch sampling is that large image tiles are used in [9] as construction units for texture synthesis. Each tile is cut out from the training image and then placed into the synthetic texture in line with the estimated placement grid. As a result, the overlapping regions have to be blended at seams to avoid visual disruption. Figure 3 compares results of the synthesis of the colour texture “Mesh” using the bunch sampling, the non-parametric sampling in [3, 4, 6, 14], and the method proposed in [8]. It is worthy to note that only the bunch sampling discovers the periodicity of this texture; the placement grid found is also shown in Fig. 3. Therefore, this training image is perceived as a perfect regular texture only by the
986
G. Gimel’farb and D. Zhou
bunch sampling, while all other methods consider it as a weakly-homogeneous one due to intricate local neighbourhoods. Each natural texture mixes both global and local features, so that neither the model-based bunch sampling nor the non-parametric approaches can solve alone a rich variety of texture analysis and synthesis problems. Future solutions should combine global and local texture representations. At the same time, the derived accurate identification of a generic MGRF prior produces more adequate global descriptions of the training images and is simple enough to be of practical interest by itself.
References 1. Brodatz, P.: Textures: A Photographic Album for Artists and Designers. Dover, New York (1966) 2. Cross, G.R., Jain, A.K.: Markov random fields texture models. IEEE Trans Pattern Anal. Machine Intell. 5, 25–39 (1983) 3. Efros, A.A., Freeman, W.T.: Image quilting for texture synthesis and transfer. In: Proc. ACM Computer Graphics Conf. SIGGRAPH 2001, pp. 341–346. ACM Press, N.Y (2001) 4. Efros, A.A., Leung, T.K.: Texture synthesis by non-parametric sampling. In: Proc 7th Int Conf Computer Vision (ICCV 1999), pp. 1033–1038. IEEE CS Press, Los Alamitos (1999) 5. Gimel’farb, G.L.: Image Textures and Gibbs Random Fields. Kluwer Academic, Dordrecht (1999) 6. Kwatra, V., Schidl, A., Essa, I.A., Turk, G., Bobick, A.: Graphcut textures: Image and video synthesis using graph cuts. In: Proc. ACM Computer Graphics Conf. SIGGRAPH 2003, pp. 277–286. ACM Press, N.Y (2003) 7. Liang, L., Liu, C., Shum, H.Y.: Real-time texture synthesis by patch-based sampling. Technical Report MSR-TR-2001-40, Microsoft Research (2001) 8. Liu, Y., Lin, W.-C.: Deformable texture: the irregular-regular-irregular cycle. In: Proc. 3rd Int Workshop Texture Analysis and Synthesis (Texture 2003), pp. 65–70. Heriot-Watt Univ., Edinburgh (2003) 9. Liu, Y., Tsin, Y., Lin, W.: The promise and perils of near-regular texture. Int. J. Computer Vision 62, 145–159 (2005) 10. Picard, R., Graszyk, C., Mann, S., et al.: VisTex Database. MIT Media Lab, Cambridge, Mass., USA (1995) 11. Rosin, P.: Unimodal thresholding. Pattern Recognition 34, 2083–2096 (2001) 12. Roth, S., Black, M.J.: Fields of Experts: A framework for learning image priors. In: Proc. IEEE CS Conf. Computer Vision Pattern Recognition (CVPR 2005), pp. 860–867. IEEE CS Press, Los Alamitos (2005) 13. Srivastava, A., Liu, X., Grenander, U.: Universal analytical forms for modeling image probabilities. IEEE Trans. Pattern Anal. Machine Intell. 24, 1200–1214 (2002) 14. Wei, L., Levoy, M.: Fast texture synthesis using tree-structured vector quantization. In: Proc. ACM Computer Graphics Conf SIGGRAPH 2000, pp. 479–488. ACM Press / Addison Wesley Longman, New York (2000) 15. Winkler, G.: Image Analysis, Random Fields and Dynamic Monte Carlo Methods. Springer, Berlin (1995) 16. Zhou, D., Gimel’farb, G.: Bunch sampling for fast texture synthesis. In: Petkov, N., Westenberg, M.A. (eds.) CAIP 2003. LNCS, vol. 2756, pp. 124–131. Springer, Heidelberg (2003)
Texture Defect Detection Michal Haindl, Jiˇr´ı Grim, and Stanislav Mikeˇs Institute of Information Theory and Automation Academy of Sciences CR, 182 08 Prague, Czech Republic {haindl,grim,xaos}@utia.cas.cz
Abstract. This paper presents a fast multispectral texture defect detection method based on the underlying three-dimensional spatial probabilistic image model. The model first adaptively learns its parameters on the flawless texture part and subsequently checks for texture defects using the recursive prediction analysis. We provide colour textile defect detection results that indicate the advantages of the proposed method.
1
Introduction
Traditional manual inspection of material surfaces is labor- and cost-intensive and offers major bottleneck in the high-speed production lines [1]. Many defects are very difficult to detect manually; it is estimated [2] that a highly trained human operator can detect about 60% to 70% of leather material defects. The advantages of automated visual inspection are well known; repeatability, reliability and accuracy. Unfortunately very few practical automated inspection systems for automated inspection of textile surfaces are available mainly due to their computational costs [3]. Texture imperfections are either non-textured or different textured patches that locally disrupt the homogeneity of a texture image [4]. Quality is a topical issue [5] in manufacturing, designed to ensure that defective products are not allowed to reach the customer. Since in many areas, the quality of a surface is best characterized by its texture, texture analysis plays an important role in automatic visual inspection of surfaces. The major textile texture defects reported by [5] were, missing threads (causing dark lines on the image), gathered knots and oil stains (causing small dark regions on the image), gathered threads (causing dark curves on the image), and tiny holes on the fabrics. Due to the nature of weaving process, majority of the defects on the textile web occurs along two directions i.e. horizontal and vertical [1]. Defect detection in textured materials can be very subjective task, since defects can be very subtle and not well localized in space, which may lead to a small modification in the power spectrum. The conventional approach [4] is to compute a texture features in a local subwindow and to compare them with the reference values representing a perfect pattern. The method [5] preprocesses a gray level textile texture with histogram modification and median filtering. The image is subsequently thresholded using the 2D CAR model predictor and finally smoothed with another median filter run. Another approach for detection of gray level textured defects using linear W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 987–994, 2007. c Springer-Verlag Berlin Heidelberg 2007
988
M. Haindl, J. Grim, and S. Mikeˇs
FIR filters with optimized energy separation was proposed in [1]. Similarly the defect detection [2] is based on a set of optimized filters applied to wavelet subbands and tuned for a defect type. Method [3] uses translation invariant 2-D RISpline wavelets for textile surface inspection. The gray level texture is removed using the wavelet shrinkage approach and defects are subsequently detected by simple thresholding. Contrary to above approaches the presented method uses the multispectral information. The organization of this paper is as follows. First, a concise description of the underlying texture model as well as the model selection criterion is given. The third section summarizes the core part of the detection algorithm followed by the experimental results and conclusions sections.
2
Texture Representation
We assume that multispectral textured image can be represented by an adaptive 3D causal simultaneous autoregressive model [6]: As Xr−s + r , (1) Xr = s∈Irc
where r is a white Gaussian noise vector with zero mean, and a constant but unknown covariance matrix Σ and r, s are multiindices. The noise vector is uncorrelated with data from a causal neighbourhood Irc , ⎞ ⎛ s1 ,s2 1 ,s2 a1,1 , . . . , as1,d ⎟ ⎜ .. . . .. As1 ,s2 = ⎝ ⎠ ., . , . s1 ,s2 s1 ,s2 ad,1 , . . . , ad,d are d×d parameter matrices where d is the number of spectral bands. r, r−1, . . . is a chosen direction of movement on the image lattice (e.g. scanning lines rightward and top to down). This model can be analytically estimated using numerically robust recursive statistics hence it is exceptionally well suited for possibly real-time texture defect detection applications. The model adaptivity is introduced using the standard exponential forgetting factor technique in parameter learning part of the algorithm. The model can be written in the matrix form Xr,i = γZr + r ,
(2)
where γ = [ A1 , . . . , Aη ] , η = card(Irc ) is a 1 × dη parameter matrix and Zr is a corresponding vector of Xr−s . To evaluate conditional mean values E{Xr |X (r−1) } the one-step-ahead prediction posterior density p(Xr | X (r−1)) is needed. If we assume the normal-gamma parameter prior for parameters in (1) (alternatively we can assume the Jeffrey’s parameter prior) this posterior density has the form of Student’s probability density 1
p(Xr |X (r−1) ) =
−1
2 ) π − 2 λ(r−1) Γ ( β(r)−dη+3 2
−1 Γ ( β(r)−dη+2 ) (1 + ZrT Vzz(r−1) Zr ) 2 2 1
Texture Defect Detection
1+
(Xr − γˆr−1 Zr )T λ−1 ˆr−1 Zr ) (r−1) (Xr − γ
989
− β(r)−dη+3 2 ,
−1 1 + ZrT Vzz(r−1) Zr
(3)
with β(r) − dη + 2 degrees of freedom, where the following notation is used: β(r) = β(0) + r − 1 , −1 T = Vzz(r−1) Vzx(r−1) , γˆr−1 ˜ T Vxx(r−1) V˜zx(r−1) Vr−1 = +I , V˜zx(r−1) V˜zz(r−1) V˜uw(r−1) =
r−1
(4) (5) (6)
Uk WkT ,
(7)
k=1 −1 T Vz(r) Vzx(r) . λ(r) = Vx(r) − Vzx(r)
(8)
where β(0) > 1 and U, W denote either X or Z vector, respectively. If β(r − 1) > η then the conditional mean value is E{Xr |X (r−1) } = γˆr−1 Zr
(9)
and it can be efficiently computed using the following recursion T + γˆrT = γˆr−1
−1 Vz(r−1) Zr (Xr − γˆr−1 Zr )T −1 1 + ZrT Vz(r−1) Zr
.
The selection of an appropriate model support (Irc ) is important to obtain good restoration results. If the contextual neighbourhood is too small it can not capture all details of the random field. Inclusion of the unnecessary neighbours on the other hand adds to the computational burden and can potentially degrade the performance of the model as an additional source of noise. The optimal Bayesian decision rule for minimizing the average probability of decision error chooses the maximum posterior probability model, i.e., a model Mi corresponding to maxj {p(Mj |X (r−1) )} . If we assume uniform prior for all tested support sets (models) the solution for the optimal model support (Irc ) can be found c ) analytically. The most probable model given past data is the model Mi (Ir,i for which i = arg maxj {Dj } . 1 α(r) ln |Vz(r−1) | − ln |λ(r−1) | 2 2 α(r) β(0) − dη + 2 dη ln π ln Γ ( ) − ln Γ ( ) , + 2 2 2
Dj = −
(10)
where α(r) = β(r) − dη + 2.
3
Defect Detection
Single multispectral pixels are classified as belonging to the defective area based on their corresponding prediction errors. If the prediction error is larger than the adaptive threshold
990
M. Haindl, J. Grim, and S. Mikeˇs mask 1
mask 2
Fig. 1. Defect masks used in this study for experimental texture mosaics
2.7 |E{Xr−i | X (r−i−1),(s−i−1) } − Xr−i | (11) l i=1 l
|E{Xr | X (r−1),(s−1) } − Xr | >
then the pixel r is classified as a detected defect pixel. l in (11) is a process history length of the adaptive threshold and the constant 2.7 was found experimentaly. The one-step-ahead predictor E{Xr | X (r−1),(s−1) } = γˆs−1 Xr
(12)
differs from the corresponding predictor (9) in using parameters γˆs−1 which were learned only in the flawless texture area (s < r). The whole algorithm is extremely fast because the adaptive threshold is updated recursively: 1 l 2 2.7 |r−l + |r | − |r−l | , |r+1 | > l i=1 where r is the prediction error r = E{Xr | X (r−1),(s−1)} − Xr and γˆs−1 is the parametric matrix which is not changing. Hence the algorithm can be easily applied in real time surface quality control.
4
Experimental Results
The presented method was tested on the set of artificially damaged 512 × 512 colour textile textures, so the ground truth (Fig.1) for every pixel is well known and cannot be influenced by a subjective evaluation. All tested images are colour (d = 3) however is obvious that the method allows any number of spectral bands. The performance of the algorithm is tested using the usual recall (r), precision (p), and the type II (II) error criteria. Let us denote the number of defect pixels nd , number of pixels interpreted as defect pixels ni and the number of correctly interpreted defect pixels nc . The performance criteria are then as follows: r=
nc , nd
p=
nc , ni
II =
ni − nc . n − nd
Texture Defect Detection mosaic
detected defect
991
prediction error map
Fig. 2. Defect detection on texture mosaics (a-c) using the defect mask Fig.1-left skin
detected disease
prediction error map
Fig. 3. Monitoring of the pemphigus vulgaris skin disease progress
Recall estimates the probability that the reference pixels will be correctly assigned, precision is the defect accuracy estimate relative to the error due to wrong assignment and the type II error estimates the probability of the commission error. All these criteria have range 0; 1. Single defects were simulated by replacing irregular parts of textile textures with different but as similar as
992
M. Haindl, J. Grim, and S. Mikeˇs mosaic a
detected defect
mosaic b
detected defect
mosaic c
detected defect
mosaic d
detected defect
mosaic e
detected defect
prediction error map
Fig. 4. Defect detection on texture mosaics (a-e) using the defect mask Fig.1-right Table 1. Performance criteria mosaic (row) Fig.2 - a Fig.2 - b Fig.2 - c Fig.4 - a Fig.4 - b Fig.4 - c Fig.4 - d Fig.4 - e Fig.5 - a Fig.5 - b
recall r 0.22 1.00 0.11 1.00 0.93 0.92 0.34 0.93 0.06 0.13
precision p 0.01 0.70 0.62 0.70 0.10 0.19 0.11 0.71 0.00 0.00
II 0.09 0.00 0.00 0.00 0.02 0.01 0.01 0.00 0.45 0.21
possible textile texture. All textures are from our large (more than 1000 high resolution colour textures categorized into 10 thematic classes) colour texture database. All results presented are without any postprocessing such as isolated defect pixels filtering to demonstrate basic performance of the presented method.
Texture Defect Detection
993
Table 2. Time performance on the HP 9000/785 Unix machine η 3 8
mosaic
learning [s] 1 12
detected defect
detection [s] 7 15
prediction error map
Fig. 5. Failures on highly structured textures
Figs.2,4 exhibit correct defect detection which is also clearly visible on the corresponding prediction maps. Tab.1 presents robust performance with high recall values even for hardly visible defects (Fig.4-b,e). Even for lower recall values (Fig.2-a,c, Fig.4-d) the defect is clearly outlined. Both precision and type II criteria are expectedly low respectively high in failure examples (Fig.5). Finally, the method was successfully evaluated on skin disease treatment progress monitoring application. Fig. 3 illustrates a patient with pemphigus vulgaris skin disease and its automatically detected regions which are subsequently compared with previous checking to monitor a disease treatment efficiency. Defects were detected using simple models with causal neighbourhoods containing either 3 or 8 sites (η ∈ {3, 8}) (Tab.2) and adaptive learning on uncorrupted quarter of every texture mosaic. Processing time in Tab.2 is for unoptimized code and can be easily further decreased. Fig.5 indicates type of highly structured textures which are out of the means of the presented simple probabilistic model. Although even on these examples the defect was correctly detected, the method simultaneously detects also large textile design patterns which cannot be distinguished from the defect. A possible solution is to filter out these design artefacts using prior information such as regularity, size or spectral content.
994
5
M. Haindl, J. Grim, and S. Mikeˇs
Conclusions
Most published texture defect detection methods does not use the multispectral information. Our method takes advantage of both multispectral as well as the spatial information. The method is simple, extremely fast and robust in comparison with these alternative methods. The presented method results are encouraging, all simulated defects on fine granularity textile textures were correctly localized as well as sick skin patches in real dermatology application. The method will fail on highly structured textures due to limited low frequencies modelling power of the underlying probabilistic model. The presented method can be easily generalized for gradually changing (e.g. illumination, colour, etc.) texture defect detection by exploiting its adaptive learning capabilities.
Acknowledgments This research was supported by the EC project FP6-507752 MUSCLE, grants A2075302, 1ET400750407 of the Grant Agency of the Academy of Sciences CR ˇ and partially by the grants MSMT 1M0572,2C06019.
References 1. Kumar, S., Hebert, M.: Discriminative random fields: A discriminative framework for contextual interaction in classification. In: iccv03, Nice, France, pp. 1150–1157 (2003) 2. Sobral, J.L.: Optimised filters for texture defect detection. In: International Conference on Image Processing. vol. III, 565–568 ( 2005) 3. Fujiwara, H., Zhang, Z., Toda, H., Kawabata, H.: Textile surface inspection by using translation invariant wavelet transform. In: IEEE Int. Symp. on Computational Intelligence in Robotics and Automation, vol. 3, pp. 1427–1432. IEEE, NJ, New York (2003) 4. Chetverikov, D.: Detecting defects in texture. In: International Conference on Pattern Recognition, pp. 61–63 (1988) 5. Meylani, R., Ertuzun, A., Ercil, A.: A comparative study on the adaptive lattice filter structures in the context of texture defect detection. In: ICECS ’96, vol. II, pp. 976–979 (1996) ˇ 6. Haindl, M., Simberov´ a, S.: A Multispectral Image Line Reconstruction Method. In: Theory & Applications of Image Analysis, pp. 306–315. World Scientific Publishing Co., Singapore (1992) 7. Kittler, J.V., Marik, R., Mirmehdi, M., Petrou, M., Song, J.: Detection of defects in colour texture surfaces. In: IAPR Workshop on Machine Vision Applications, pp. 558–567 (1994) 8. Mirmehdi, M., Marik, R., Petrou, M., Kittler, J.V.: Iterative morphology for fault detection in stochastic textures. Electronics Letters 32, 443–444 (1996) 9. Mitra, P., Murthy, C., Pal, S.: Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 301–312 (2002)
Extracting Salient Points and Parts of Shapes Using Modified kd-Trees Christian Bauckhage Deutsche Telekom Laboratories 10587 Berlin, Germany http://www.telekom.de/laboratories
Abstract. This paper explores the use of tree-based data structures in shape analysis. We consider a structure which combines several properties of traditional tree models and obtain an efficiently compressed yet faithful representation of shapes. Constructed in a top-down fashion, the resulting trees are unbalanced but resolution adaptive. While the interior of a shape is represented by just a few nodes, the structure automatically accounts for more details at wiggly parts of a shape’s boundary. Since its construction only involves simple operations, the structure provides an easy access to salient features such as concave cusps or maxima of curvature. Moreover, tree serialization leads to a representation of shapes by means of sequences of salient points. Experiments with a standard shape database reveal that correspondingly trained HMMs allow for robust classification. Finally, using spectral clustering, tree-based models also enable the extraction of larger, semantically meaningful, salient parts of shapes.
1 Motivation and Background Shapes and silhouettes provide significant cues for the human visual system. As their impact on human perception is well documented in a long line of psychological and neurophysiological studies (cf. e.g. [1,2,3,4]), it comes with little surprise that shape processing has been extensively investigated in computer vision, too [5,6]. Although an exhaustive review of related approaches is beyond the scope of this paper, it is worthwhile pointing out an intriguing phenomenon: While established shape representation techniques such as moments, skeletons, or shock graphs [7,8,9] consider rather intuitive planar geometry, recent contributions argue for more sophisticated and more complex mathematical models [10,11]. Such rigorous approaches based on conformal mappings or Riemannian manifold embeddings are certainly appealing for they might lead to a more general understanding of shape perception. However, with respect to real-time constraints which are important for most practical applications of computer vision, they yet seem too involved. What is needed in practice are methods that allow for fast computation and useful characterization alike. The work presented here originated from an earlier project on gait analysis where real-time shape processing was pivotal [12]. We proposed a representation especially tailored to the needs in classification: Iterative fitting of 2D lattices copes with shapes of different topologies, is storage and time efficient, and yields consistent vector space embeddings of two-dimensional objects [13]. In [14], we demonstrated that the lattice W.G. Kropatsch, M. Kampel, and A. Hanbury (Eds.): CAIP 2007, LNCS 4673, pp. 995–1002, 2007. c Springer-Verlag Berlin Heidelberg 2007
996
C. Bauckhage
fitting mechanism can be interpreted as the growing of a widely branching tree of depth two. This led to an improved approach which combines properties of kd-trees [15,16], R-trees [17,18] and regression tress [19,20]. In this paper, we refer to these trees as modified kd-trees and summarize their construction in section 2. Experiments reported in [14] underlined that modified kd-trees provide efficient shape compression: even for compression rates of more than 90%, the average reconstruction error never exceed 5%. Other experiments in [14] considered shape classification. Since information contained in the leaf nodes may be naturally interpreted as a signature, i.e. an unordered set of weighted vectors, it allows for applying the Earth Mover’s Distance [21]. However, since this approach requires several prototypes per class and treats classification as a flow allocation problem over bipartite graphs, it is of limited real-time capability. In this paper, we take a different point of view. Instead of translating leaf node information into signatures, we apply tree-serialization and represent shapes by means of sequences of feature vectors. Consequently, efficient methods from statistical learning become applicable. In particular, in section 3, we will consider the use of Hidden Markov Models and show that they perform as well as the approach in [14] albeit much faster. Moreover, compared to features extracted from usual kd-trees, features resulting from modified kd-trees yield better recognition accuracy and simultaneously decrease computation time. This is a consequence of their property of allocating more resources to the representation of salient parts of shape boundaries. Finally, in section 4, we briefly sketch how applying spectral clustering to the data stored in a modified kd-tree allows for efficiently extracting salient subparts of shapes.
2 Modified kd-Trees for Shape Representation Towards the problem of shape representation and classification, we consider a data structure that combines properties of kd-trees, R-trees, and regression tress. Being a method for organizing data in higher dimensional spaces [15], kd-trees facilitate tasks such as nearest neighbor searches [16]. Using splitting planes that are perpendicular to one of the coordinate axes, they partition the space in a top-down manner. Originally a tool in computer aided design [17], R-trees have also become popular for indexing high dimensional data [18]. Given a clustered sample, an R-tree recursively combines the data into multidimensional minimum bounding rectangles in a bottom-up fashion. Regression trees [19,20] are used for multidimensional function approximation. They provide piecewise constant approximations by recursively partitioning a set of samples. Given binary images such as shown throughout this paper, we assume a shape S to be a set of L pixels, S = {pk ∈ R2 | k = 1, . . . , L} and defne |S| = L. The minimum bounding rectangle or bounding box of S is denoted by B(S); its center is μ and its width and height are called w and h, respectively. With these definitions, a binary tree containing a compact representation of S results from the procedure in Fig. 1. From the figure we see that, if |S| is less than a given threshold or if it equals the area w · h of its bounding box, the current tree node is a leaf. Otherwise, B(S) will be split at a point c. Note that at this point of the procedure different approaches may apply [16]. For the results reported below, we proceeded as follows: if the box’s width
Extracting Salient Points and Parts of Shapes
997
BuildTree(S, l) compute the bounding box B(S) and its corresponding μ, w, and h create a new node N N.data = (μ, w, h) if |S| ≤ or |S| = w · h or l = lmax N.leftchild = ∅ N.rightchild = ∅ else determine split dimension d ∈ {x, y} determine split point c determine sets S1 = {p ∈ S | pd < cd } and S2 = {p ∈ S | pd ≥ cd } N.leftchild = BuildTree(S1 , l + 1) N.rightchild = BuildTree(S2 , l + 1)
Fig. 1. Recursive procedure to compute a tree-based representation of a shape S
... (a) l = 0
(b) l = 1
(c) l = 2
(d) l = 3
(e) l = 4
(f) l = 8
Fig. 2. Recursive growth of a modified kd-tree; with an increasing depth l, the collection of bounding boxes stored in the tree better approximates the shape. Also, ceasing to sprout further subtrees once a homogenous region has been reached, provides an efficient representation.
exceeds its height, the pixel set S is split into a left and a right part; if the hight exceeds the width, S is split into an upper and a lower part. In both cases, the median of the pixel distributions along the splitting direction serves as the split point and subtrees are computed for the resulting subsets. This process recurses until no further split is possible or a desired depth level is reached. Given the repeated computation of bounding boxes, the algorithm’s relation to Rtrees is obvious. Its ties to regression are due to the condition that determines leaf nodes. If |S| = wh, S completely fills its bounding box and the splitting process has produced a piecewise constant region. Finally, as for usual kd-trees, splits are performed along the coordinate axes. Contrary to the unmodified case, however, the termination condition in the modified procedure causes storage efficient albeit unbalanced trees to result. Figures 2 and 3 exemplify this effect. While the former shows the the growth of a modified kd-tree, the latter shows the approximation of a shape by means of a usual kd-tree. The modified version stops sprouting subtrees once the splitting process has reached a homogenous, most likely interior region of the shape. The usual procedure, however, keeps splitting sets of pixels as long as they contain more than one element. Consequently, shape representation based on usual kd-trees will inevitably produce 2lmax bounding boxes and cannot achieve compression rates as reported in [14].
998
C. Bauckhage
... (a) l = 0
(b) l = 1
(c) l = 2
(d) l = 3
(e) l = 4
(f) l = 8
Fig. 3. Recursive growth of a usual kd-tree; the procedure does not stop once a homogenous region has been reached but continues and thus yields a balanced but less efficient representation
... (a) l = 4
(b) l = 5
(c) l = 6
(d) l = 7
(e) l = 8
(f) l = 11
... (g) l = 4
(h) l = 5
(i) l = 6
(j) l = 7
(k) l = 8
(l) l = 11
Fig. 4. Center points of bounding boxes in shape approximating trees of depth l = 4, 5, . . . , 11. (a) – (f) points obtained from a usual kd-tree; (g) – (l) points obtained from a modified version.
Figs. 4 and 5 illustrate another consequence of the different splitting behaviors. Both figures plot the center points {μi }i=1,...,N of bounding boxes assigned to the leafs on the deepest level lmax of a tree, where lmax varies from 4 to 11. For common kd-trees, the number of leafs grows exponentially and all leafs always reside on the deepest level. Accordingly, for an increasing value of lmax , the set of points {μi }i=1,...,N almost densely covers the underlying shape (see Figs. 4(f) and 5(f)). Concerning modified kd-trees, however, the number N of leafs found on the deepest level first increases but then decreases again (see Figs. 4(k)–(l) and 5(k)–(l)). In our experiments, we observed N to peak for lmax ≈ 8. Moreover, since the creation of subtrees stops if a bounding box covers a homogenous region, bounding boxes found on deeper levels tend to be located at the shape boundary. This is simply because, compared to the interior, it generally requires many more splits until homogeneous rectangular regions are found at the boundary. What is even more, if one keeps increasing lmax , bounding boxes attached to leaf nodes on deeper levels automatically concentrate at the least homogeneous parts of the boundary which correspond to concave cusps or points of high curvature. This is an intriguing property of modified kd-trees, since points of concavity or high curvature play a crucial role in how humans perceive shapes [3]. In a series of experiments, we therefore investigated whether salient features automatically extracted from modified kd-trees can offer benefits for automatic shape recognition as well. The results reported next suggest that this is indeed the case.
Extracting Salient Points and Parts of Shapes
999
... (a) l = 4
(b) l = 5
(c) l = 6
(d) l = 7
(e) l = 8
(f) l = 11 ...
(g) l = 4
(h) l = 5
(i) l = 6
(j) l = 7
(k) l = 8
(l) l = 11
Fig. 5. Another example of the effect demonstrated in Fig. 4
3 HMM-Based Shape Recognition Recognition In this section, we describe the use of Hidden Markov Models for shape recognition. This idea is motivated by the fact that data stored in a tree can be consistently serialized and thus transformed into a sequence of feature vectors. Since modified kd-trees provide very good approximations of shapes, sequences thus obtained should distinctly characterize shapes. In our experiments, we serialized the trees considered in this paper using pre-ordered traversal. Collecting the centroids of the bounding boxes stored in the leafs on the deepest level provided us with sequences of 2d vectors {μi }i=1,...,N . These were transformed into a canonical, scale invariant representation using the transformation i (μx − vx )/w ˜i = μ (1) (μiy − vy )/h
avg. rec. rate
where v indicates the location of the bottom left corner of the initial bounding box of a shape S and w and h denote its width and height, respectively. We considered a standard testbed for shape matching introduced by Kimia et al.[8] and focused on six classes of animal shapes. For each class, a third of the shapes was
modified kD tree classical kD tree
0.9 0.8 0.7 0.6 3
4
5
6 7 8 depth of tree
9
10
11
Fig. 6. Recognition rates in shape classification based on features extracted from tree models
1000
C. Bauckhage
Table 1. Confusion matrix resulting from using a features in modified kd-tree of depth 11 bird cat dog elephant horse rabbit bird cat dog elephant horse rabbit
1.00 0.00 0.05 0.00 0.00 0.00
0.00 0.94 0.00 0.00 0.00 0.00
0.00 0.00 0.72 0.00 0.33 0.00
0.00 0.06 0.00 1.00 0.00 0.00
0.00 0.00 0.23 0.00 0.67 0.00
0.00 0.00 0.00 0.00 0.00 1.00
used for training a corresponding HMM. In order to provide sufficient training data, we increased each training set by applying modest affine transformation to the shapes, leading to a total of 600 training samples. For testing we considered two thirds of the shapes which, after correspondingly increasing the set, led to a total of 850 test samples. In our implementation we made use of Kevin Murphy’s HMM toolbox1. For the results reported here, we considered HMMs of Q = 8 states with a mixture of 2 Guassians per state. Since the mixtures were learned using randomly seeded expectation maximization, we report recognition rates averaged over 5 trials. Figure 6 plots recognition rates with respect to the maximum depth level of the trees used for shape representation. Generally, for increasing values of the maximum depth lmax , the features extracted from modified kd-trees yield higher accuracies than the ones obtained from usual trees. This is noteworthy since usual kd-trees provide many more feature points, however, most of which do cover the shapes’ interior (see again Fig. 4 and 5). According to the figure, the considerably smaller sets of points resulting from modified kd-trees seem to contain much less redundant information. Also, for the features extracted from modified kd-trees we notice two peaks in the recognition rates achieved for lmax = 8 and lmax = 11. This is interesting, for these depth levels provide a rather dense sampling of the shapes boundary or a set of highly salient points, respectively (see again Fig. 4 and 5). In the light of recognition rates of a 100% reported in recent contributions on HMM-based shape classification [22], we feel urged to point out that, in the experiments reported here, we dealt with shapes of high inter-class similarity. This is corroborated in Tab. 1 which shows the confusion matrix obtained from classification based on modified kd-trees of maximum depth lmax = 11. From the table it appears that visually distinct shapes such as those of birds, rabbits, and elephants are recognized with an accuracy of 100%. Visually more ambiguous shapes such as those of cats, dogs, and horses are more often confused, thus decreasing the overall performance to about 80%. This, however, accords with other recent results on HMM-based shape recognition, which also considered ambiguous silhouettes [23]. Finally, note that an overall recognition rate of 80% corresponds to the results obtained from the more involved approach in [14]. However, whereas classification based on the Earth Mover’s Distance is of complexity O(L3 ), where L denotes the sequence length, using a HMM for sequence classification requires only O(LQ2 ), where Q counts the number of states and Q L. Compared to [14], HMM-based shape recognition thus gains two orders of magnitude in terms of processing time. 1
http://www.cs.ubc.ca/ murphyk/Software/HMM/hmm.html
Extracting Salient Points and Parts of Shapes
1001
4 Summary and Outlook This paper investigated the use of modified kd-trees for shape encoding and classification. Trees obtained from recursive, axis perpendicular bounding box splitting provide a representation that is resolution adaptive and thus storage efficient. Since the structure automatically accounts for geometric details at a shape’s boundary, it provides information on salient features such as concave cusps or maxima of curvature. In addition, since the derivation of these trees only relies on basic computational geometry and does not require complex optimization steps, the approach is also time efficient. Given a tree representation of a shape, tree traversal leads to shape representations by means of sequences of salient points. Experimental results obtained on a standard shape database indicate that correspondingly trained HMMs enable robust classification, even if there are considerable inter class similarities. Currently, we explore a different use of the information contained in the trees considered in this paper. First experiences demonstrate that, in addition to shape classification, shape partitionings contained in a tree representation allow for clustering shapes into semantically meaningful, salient subparts. Figure 7 shows an example of such clusters which were determined automatically. The bounding boxes in this example resulted from constructing a modified kd-tree of depth lmax = 11. From the matrix of geometric adjacencies of the boxes, we computed a similarity matrix where the similarity of two boxes was a function of the shortest path between them. We then applied spectral clustering using the algorithm by Ng et al. [24]. The resulting clusters apparently correspond to the cat’s body parts. Again, producing this result did not require involved geometric reasoning or optimization. We attribute the “salient part finding” ability of this simple approach to the fact that modified kd-trees emphasize the representation of salient parts of shape boundaries. Therefore, relations among corresponding bounding boxes abound in he similarity matrix and influence its eigenvector decomposition. Similar to human perception [3], concave cusps and points of high curvature thus guide the segmentation into parts. In future work, we will exploit this in devising part-based recognition algorithms.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 7. Clusters resulting from spectral clustering of the bounding box adjacency matrix
References 1. Wertheimer, M.: Untersuchungen zur Lehre von der Gestalt I. Psychologische Forschung 1, 47–58 (1922) 2. Zusne, L.: Visual Perception of Form. Academic Press, New York (1970)
1002
C. Bauckhage
3. Hoffman, D.D.: Visual Intelligence. W.W. Norton, New York (1998) 4. Tarr, M.J., B¨ulthoff, H.H.: Object Recognition in Man, Monkey, and Machine. MIT Press, Cambridge (1999) 5. Loncaric, S.: A survey of shape analysis techniques. Pattern Recogn. 31(8), 983–1001 (1998) 6. Costa, L., Cesar, R.M.: Shape Analysis and Classification. CRC Press, Boca Raton (2000) 7. Jain, A.K., Vailaya, A.: Shape-based retrieval: A case study with trademark image databases. Pattern Recogn. 31(9), 1369–1390 (1998) 8. Sharvit, D., Chan, J., Tek, H., Kimia, B.B.: Symmetry-based indexing of image databases. J. Vis. Commun. Image R. 9(4), 366–380 (1998) 9. Siddiqi, K., Shokoufandeh, A., Dickinson, S., Zucker, S.: Shock graphs and shape matching. Int. J. Comput. Vision 35(1), 13–32 (1999) 10. Klassen, E., Srivastava, A., Mio, W., Joshi, S.H.: Analysis of planar shapes using geodesic paths on shape space. IEEE Trans. Pattern Anal. Machine Intell. 26(3), 372–383 (2004) 11. Sharon, E., Mumford, D.: 2d-shape analysis using conformal mapping. In: Proc. CVPR. vol. II, pp. 350–357 (2004) 12. Bauckhage, C., Tsotsos, J., Bunn, F.: Automatic detection of abnormal gait. Image Vision Comput. (to appear 2007) 13. Bauckhage, C., Tsotsos, J.: Bounding box splitting for robust shape classification. In: Proc. ICIP. vol. 2, pp. 478–481 (2005) 14. Bauckhage, C.: Tree-based signatures for shape classification. In: Proc. ICIP, pp. 2105–2108 (2006) 15. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Comm. of the ACM 18(9), 509–517 (1975) 16. Kubica, J., Masiero, J., Moore, A., Jedicke, R., Connolly, A.: Variable kd-tree algorithms for spatial pattern search and discovery. In: Proc. NIPS, pp. 691–698 (2005) 17. Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: Proc. ACM SIGMOD Int. Conf. on Management of Data, pp. 47–57 (1984) 18. Manolopoulos, Y., Nanopoulos, A., Papadopoulos, A.N., Theodoridis, Y.: R-Trees: Theory and Applications. Springer, Heidelberg (2005) 19. Morgan, J., Sonquist, J.: Problems in the analysis of survey data and a proposal. J. Am. Stat. Assoc. 58, 415–434 (1963) 20. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth (1984) 21. Rubner, Y., Tomasi, C., Guibas, L.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40(2), 99–121 (2000) 22. Bicego, M., Murino, V.: Investigating hidden markov models’ capabilities in 2d shape classification. IEEE Trans. Pattern Anal. Machine Intell. 26(2), 281–286 (2004) 23. Thakoor, N., Gao, J.: Shape classifier based on generalized probabilistic descent method with hidden markov descriptor. In: Proc. ICCV. vol. 1, pp. 495–502 (2005) 24. Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Proc. NIPS, pp. 849–856 (2001)
Author Index
Abbott, Derek 878 Acevedo, Daniel 895 Ahmad, Munir 270 Altamirano, Leopoldo 117 ´ Alvarez–Franco, Fernando J. 694 Anderson, Mats 229 Ar, Ilktan 555 Arribas, Juan Ignacio 189, 768 Athanasiadis, Emmanouil 864 Atkinson, Gary A. 466 Attia, Mohamed 522 Azpiroz, Fernando 293 Baba, Takayuki 954 Bach, Meritxell 229 Bala, Piotr 245 Batenburg, K. Joost 563 Bauckhage, Christian 995 Beck, Cornelia 53 Ben-Shahar, Ohad 13 Benavent, Xaro 895 B´er´eziat, Dominique 482 Bezerianos, Anastasios 197 Bican, Jakub 751 Bigun, Josef 360 Bock, R¨ udiger 165 Boluda, Jose A. 77 Borgefors, Gunilla 253 Bougioukos, Panagiotis 197, 221 Bultelle, Matthieu 229 C´ ardenes, Rub´en 229 Cashman, Peter 229 Castel´ an, Mario 399 Castellanos, Germ´ an 840 Cavouras, Dionisis 197, 221, 864 Chi, Ying 229 Chlebiej, Michal 245 Compa˜ n, Patricia 141 Constantinopoulos, Constantinos 596 Costaridou, Lena 237 Danev, Boris 809 Danı¸sman, Kenan 366 Daskalakis, Antonis 197, 221
de Luis, Rodrigo 229 De Neve, Wesley 848 De Schrijver, Davy 848 De Silva, P. Ravindra S. 326 de Ves, Esther 895 De Wolf, Koen 848 De Zutter, Saar 848 Debled-Rennesson, Isabelle 474 Dehzangi, Abdollah 970 Dehzangi, Omid 970 Deinzer, Frank 759 Delmas, Patrice 776 Dickens, Matthew P. 342 Ding, Feng 205 Ditrich, Frank 490 Drakopoulos, Vassileios 645 Droege, Detlev 817 Dubuisson, S´everine 482 Eastwood, Brian 125 Ebner, Marc 441 El-Mahallawy, Mohamed Emms, David 823 Entacher, Karl 36 Entrena, Luis 391 Erat, Murat 366 Erg¨ un, Salih 366
522
Faas, Frank G.A. 718 Faraj, Maycel Isaac 360 Feldmann, Tobias 759 Ferguson, Bradley 878 Fern´ andez-Nofrer´ıas, Eduard 285 Fleischer, Stefan 921 Florea, Laura 278 Flusser, Jan 450, 751, 856 Franke Stenport, Victoria 253 Fritts, Jason 587 Gallardo–Caballero, Ram´ on 694 Gallego, Antonio Javier 141 Garcia, Esteban 117 Garcia, Miguel Angel 912 Garc´ıa–Orellana, Carlos J. 694 Gedda, Magnus 173
1004
Author Index
Georgiadis, Pantelis 197 Ghita, Ovidiu 374 Giannarou, Stamatia 710 Gil, Debora 213 Gimel’farb, Georgy 776, 979 G´ omez, Francisco 181 Gomez Agis, Jacobo 903 Gonz` alez, Jordi 45 Gonzalez-Diaz, Rocio 506 Gonz´ alez–Velasco, Horacio M. Gottbehuet, Thomas 53 Grazzini, Jacopo 636, 742 Grest, Daniel 101 Grim, Jiˇr´ı 987 Guan, De-Jun 539
Jiang, Zhuhan 69 Jim´enez, Mar´ıa Jos´e Johansson, Carina B.
694
Haindl, Michal 987 Hajder, Levente 109, 498 Hanbury, Allan 424 Hancock, Edwin R. 342, 399, 466, 823 Hanseg˚ ard, Jøger 157 Haxhimusa, Yll 653 He, Joshua Congfu 962 Heiss-Czedik, Dorothea 514 Hern` andez, Aura 213 Hern´ andez, Jorge 416 Heynderickx, Ingrid 334 Higashi, Masatake 326 Hlav´ aˇc, V´ aclav 93, 929 Ho, Wai Han 351 Hong, Xin 604 Hornegger, Joachim 165 Horv´ ath, Peter 702 Howe, Tet Sen 205, 962 Hsu, Wen-Hsing 937 Huang, Hu-Jie 539 Huang, Hui-Yu 937 Huang, Zhuan Qing 69 Huben´ y, Jan 309 Huber-M¨ ork, Reinhold 514 Hugueny, Samuel 317 Igual, Laura 293 Imiya, Atsushi 61 Ion, Adrian 653 Iregui, Marcela 181 Jafari Moghadam, Pooria Jermyn, Ian H. 702 Jiang, Xiaoyi 921
871
506 253
Kagadis, George 221 Kalatzis, Ioannis 197, 221, 864 Kalogeropoulou, Christina 237 Kameda, Yusuke 61 Kamei, Toshio 809 Kampel, Martin 547 Kanak, Alper 366, 383 Karsligil, M. Elif 555 Kaˇs´ık, Marek 309 Kasparek, Tomas 301 Kautsky, Jaroslav 856 Kayaoglu, Mehmet 366 Kaz´ o, Csaba 498 Kerdels, Jochen 612 Kittler, Josef 929 Klette, Reinhard 661, 726 Kober, Vitaly 903 Korfiatis, Panayiotis 237 Kostopoulos, Spiros 197, 221 Kˇr´ıˇzek, Pavel 929 Kropatsch, Walter G. 653 Krˇsek, Pˇremysl 261 Krueger, Volker 101 Kubias, Alexander 759 Kumar, Dinesh Kant 832 Lambacher, Stephen G. 326 Lambert, Peter 28, 848 Lef`evre, S´ebastien 579 Lemaˆıtre, C´edric 686 Lenz, Christian 36 Leow, Wee Kheng 205, 962 Letscher, David 587 Li, Fajie 661, 726 Li, Gang 13 Likas, Aristidis 596 Lindblad, Joakim 253 Lindoso, Almudena 391 Liu, Hantao 334 Liu, Rujie 954 Liu-Jimenez, Judith 391 Mac´ıas–Mac´ıas, Miguel 694 Maggini, Marco 408 Manousopoulos, Polychronis 645 Marcotegui, Beatriz 424
Author Index Markowska-Kaczmar, Urszula Marras, Ioannis 229 Martens, Ga¨etan 28 Mart´ınez, Fernando 840 Maˇska, Martin 309 Masumoto, Daiki 954 Matas, Jiˇr´ı 686 Mauri, Josepa 285 Mayer, Konrad 514 Mazurek, Przemyslaw 149 McClean, Sally 604 Medrano, Belen 506 Meier, J¨ org 165 Melacci, Stefano 408 Melendez, Jaime 912 Michelson, Georg 165 Mikeˇs, Stanislav 987 Mirzaei, Abdolreza 734, 784 Miteran, Johel 686 Mohr, Daniel 432 Molina, Rafael 141 Moradi, Mohamad H. 871 Morris, John 776 Morrow, Philip 604
531
Nadarajan, Gayathri 133 Nagata, Shigemi 954 Neumann, Heiko 53 Ng, Brian W.-H. 878 Nguyen, Thanh Phuong 474 Nhung, Ngo Phuong 945 Nikiforidis, George 197, 221, 864 Nowi´ nski, Krzysztof 245 Ny´ ul, L´ aszl´ o G. 165 Orderud, Fredrik 157 Orozco, Javier 45 Ortiz, Francisco 458 Osano, Minetada 326 Pal´ agyi, K´ alm´ an 628 Pardo, Fernando 77 Paulus, Dietrich 759, 817 Peltier, Samuel 653 Penz, Harald 514 ´ Pernek, Akos 498 Peters, Gabriele 612 Phuong, Tu Minh 945 Pizlo, Zygmunt 1 Poppe, Chris 28
Prieto, Flavio 416 Puig, Domenec 912 Raamana, Pradeep Reddy 101 Rabben, Stein I. 157 Radeva, Petia 213, 285, 293 Rahmati, Mohammad 734 Ramirez, Berenice 801 Ravazoula, Panagiota 221 Real, Pedro 506 Rekik, Wafa 482 Renouf, Arnaud 133 Rink, Karsten 571 Roca, F. Xavier 45 Rodriguez-Leor, Oriol 213 Romero, Eduardo 181 Rosin, Paul L. 620, 677 Rotger, David 285 Rousson, Mika¨el 317 Ruedin, Ana 895 Safabakhsh, Reza 784 Sakellaropoulos, Philippos 237 Salazar, Augusto 416, 840 S´ anchez, Luis 840 S´ anchez-Ferrero, Gonzalo V. 189 Sarti, Lorenzo 408 Sarve, Hamid 253 Sas, Jerzy 531 Schwarz, Daniel 301 ´ lo, Piotr 245 Scis Scotney, Bryan 604 Segu´ı, Santi 293 Seidel, Martin 36 Seijas, Leticia 895 Shih, Weir-Sheng 937 Shorin, Al 776 Sijbers, Jan 563 ˇ Siler, Ondˇrej 261 Skiadopoulos, Spyros 237 ˙ So˜ gukpınar, Ibrahim 383 Soille, Pierre 636, 742 Song, Ge 669 ˇ Sorel, Michal 450 ˇ Spanˇ el, Michal 261 Spyridonos, Panagiota 864 ˇ Sroubek, Filip 856 ˇ Stancl, V´ıt 261 Stathaki, Tania 710 S ¸ tefan, Ion 278
1005
1006
Author Index
Stejskal, Stanislav 309 Stricker, Didier 20 Su, Tong-Hua 539 Suesse, Herbert 490 Sutherland, Alistair 374 Svensson, Stina 173 Svoboda, David 309 ˇ Svub, Miroslav 261 Taheri, Shahram 970 Taylor II, Russell M. 125 Theoharis, Theoharis 645 Thurau, Christian 93 Todd-Pokropek, Andrew 270 T¨ onnies, Klaus 571 Trist´ an, Antonio 768 Uehara, Yusuke 954 Uhl, Andreas 36 Van de Walle, Rik 28, 848 van Vliet, Lucas J. 718 Verity, Dominic 351 Vertan, Constantin 278 Villagr´ a, Carlos 141 Vitri` a, Jordi 293 Vrabl, Andreas 514
Wachenfeld, Steffen 921 Wang, Hong 669 Wang, Yang 85 Wang, Yuehong 954 Watters, Paul 351 Weghorn, Hans 832 Weiglmaier, Rudolf 36 Whelan, Paul F. 374 Wientapper, Folker 20 Wild, Marie 886 Wilson, Richard C. 823 Wuest, Harald 20 Yau, Wai Chee 832 Yin, Xiao-Xia 878 Yu, Dahai 374 Zaboli, Hamidreza 734, 784 Zachmann, Gabriel 432 Zaharieva, Maia 547 Zambanini, Sebastian 547 Zhang, Tian-Wen 539 Zheng, Guoyan 792 Zhou, Dongxiao 979 Zimmermann, Michal 309 Zolghadri, Mansoor J. 970 Zucker, Steven W. 13 ˇ c, Joviˇsa 677 Zuni´