125 59 114MB
English Pages 551 Year 2008
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5096
Gerhard Rigoll (Ed.)
Pattern Recognition 30th DAGM Symposium Munich, Germany, June 10-13, 2008 Proceedings
13
Volume Editor Gerhard Rigoll Technische Universität München Lehrstuhl für Mensch-Maschine-Kommunikation Arcisstraße 21, 80333 Munich, Germany E-mail: [email protected]
Library of Congress Control Number: 2008928567 CR Subject Classification (1998): I.5, I.4, I.3.5, I.2.10, I.2.6, F.2.2 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-540-69320-3 Springer Berlin Heidelberg New York 978-3-540-69320-8 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12279227 06/3180 543210
Preface
This year, 2008, we had a very special Annual Symposium of the Deutsche Arbeitsgemeinschaft f¨ ur Mustererkennung (DAGM) in Munich, and there are several reasons for that. First of all, this year was the 30th anniversary of the symposium. This means that the first symposium was organized in 1978 and the location of this event was: Munich! Just two years before, in 1976, the DAGM was founded in: Munich! And Munich was also the location of two further DAGM symposia, in 1991 and in 2001. When I attended the conference in 2001, I was in negotiations for my appointment to the Chair of Human–Machine Communication at the Technische Universit¨ at M¨ unchen (TUM) and certainly I did not at all anticipate that I would have the pleasure and honor to host this conference just seven years later again in Munich for its 30th anniversary. But special dates are not the only reason why DAGM was somewhat different this time. This year, DAGM was organized in conjunction with Automatica, the Third International Trade Fair for Automation in Assembly, Robotics, and Vision, one of the world’s leading fairs in automation and robotics. This was an ideal platform for the exchange of ideas and people between the symposium and the fair, and the conference thus took place in a somewhat unusual but extraordinary location, the International Congress Center (ICM), in the direct vicinity of the New Munich Trade Fair Center, the location of the Automatica fair. With free access to Automatica, the registrants of DAGM got the opportunity to make full use of all the synergy effects associated with this special arrangement. Coming one last time back to dates, this synchronization with Automatica also resulted in yet another unusual fact, namely, that the DAGM symposium was moved this year from its usual time slot in early September to mid-June. The fact that the conference was organized by the Institute for Human– Machine Communication of Technische Universit¨ at M¨ unchen (TUM) shows the strong relation between pattern recognition and human–computer interaction (HCI). While HCI is a relatively broad field and contains several areas with only little interconnection to pattern recognition, several core areas of HCI make heavy use of pattern recognition algorithms, especially in the perception of audio and visual information for humans. Moreover, machine learning techniques are very important for adaptive user interfaces as well as for most recognition methods which, especially in human–machine communication, often rely on statistical approaches with parameter learning from very large databases. The technical program covered all aspects of pattern recognition as in recent years and consisted of oral presentations and poster contributions, which were treated equally and were given the same number of pages in the proceedings as usual. Each section is devoted to one specific topic and contains all oral and poster papers for this topic, sorted alphabetically by first author. One further
VI
Preface
point that also made this year’s DAGM special was the unusual low acceptance rate of only 39% accepted papers; therefore all authors of the papers in this volume can indeed be pleased since the inclusion of their contribution has been the product of a very strict paper selection process. We therefore congratulate these authors again on their extraordinary success and of course have to thank them but also all other authors who submitted a paper to DAGM 2008 because the overall quality of all these submitted papers was the true basis for this highly selective review process. Sincere thanks also have to be expressed to our invited speakers, the reviewers from the Program Committee and all people involved in the organization of this event, especially the members of our Institute for Human–Machine Communication and the colleagues from the Munich Trade Fair Center who were involved in the local arrangements in conjunction with Automatica. Last but not least, our sponsors OLYMPUS Europe Foundation Science for Life, MVTec Software GmbH, Continental, SMI SensoMotoric Instruments, STEMMER IMAGING, Siemens, PCO. Imaging, deserve our gratitude for their helpful support, which contributed to several awards at the conference and reasonable registration fees. We are happy that the Annual Symposium of DAGM was back in Munich to celebrate its 30th birthday and that we were able to contribute to this event before passing the baton on to Jena for DAGM in 2009. June 2008
Gerhard Rigoll
Organization
Program Committee 2008 T. Aach H. Bischof J. Buhmann H. Bunke H. Burkhardt D. Cremers J. Denzler G. Fink W. F¨ orstner U. Franke D. Gavrila H.-M. Gross F. A. Hamprecht J. Hornegger B. J¨ ahne X. Jiang R. Koch U. Koethe W.-G. Kropatsch T. Martinetz H. Mayer B. Mertsching R. Mester B. Michaelis K.-R. M¨ uller H. Ney K. Obermayer H. Ritter G. Ruske G. Sagerer B. Schiele C. Schn¨ orr B. Sch¨ olkopf G. Sommer C. Stiller T. Vetter F. M. Wahl J. Weickert
RWTH Aachen TU Graz ETH Z¨ urich Universit¨at Bern Universit¨ at Freiburg Universit¨at Bonn Universit¨ at Jena Universit¨ at Dortmund Universit¨ at Bonn Daimler Daimler TU Ilmenau Universit¨at Heidelberg Universit¨ at Erlangen Universit¨ at Heidelberg Universit¨ at M¨ unster Universit¨ at Kiel Universit¨ at Heidelberg TU Wien Universit¨ at L¨ ubeck BW-Universit¨ at M¨ unchen Universit¨ at Paderborn Universit¨at Frankfurt Universit¨ at Magdeburg Fraunhofer FIRST RWTH Aachen TU Berlin Universit¨at Bielefeld TU M¨ unchen Universit¨ at Bielefeld Universit¨ at Darmstadt Universit¨ at Mannheim MPI T¨ ubingen Universit¨at Kiel Universit¨ at Karlsruhe Universit¨ at Basel Universit¨ at Braunschweig Universit¨ at des Saarlandes
VIII
Organization
Local Organizing Committee 2008 General Chair
Gerhard Rigoll (TU M¨ unchen, Germany)
General Co-chairs
Hugo Fastl (TU M¨ unchen, Germany) Bernd Radig (TU M¨ unchen, Germany) Gudrun Klinker (TU M¨ unchen, Germany) Gerhard Hirzinger (DLR Oberpfaffenhofen, Germany)
Technical Program
Frank Wallhoff (TU M¨ unchen, Germany)
Sponsoring, Awards and Proceedings
Bj¨ orn Schuller (TU M¨ unchen, Germany)
Registration
J¨ urgen Gast (TU M¨ unchen, Germany) Florian Eyben (TU M¨ unchen, Germany) Martin W¨ ollmer (TU M¨ unchen, Germany)
Tutorials
Benedikt H¨ ornler (TU M¨ unchen, Germany)
Poster Session
Joachim Schenk (TU M¨ unchen, Germany)
Web and Print Design, IT Services
Local Organization
Claus von R¨ ucker (TU M¨ unchen, Germany) Peter Brand (TU M¨ unchen, Germany) Stefan Schw¨ arzler (TU M¨ unchen, Germany) Dejan Arsic (TU M¨ unchen, Germany) Andre St¨ ormer (TU M¨ unchen, Germany) Sascha Schreiber (TU M¨ unchen, Germany) Klaus Laumann (TU M¨ unchen, Germany) Tony Poitschke (TU M¨ unchen, Germany)
Table of Contents
Learning and Classification MAP-Inference for Highly-Connected Graphs with DC-Programming . . . J¨ org Kappes and Christoph Schn¨ orr
1
Approximate Parameter Learning in Conditional Random Fields: An Empirical Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Filip Korˇc and Wolfgang F¨ orstner
11
Simple Incremental One-Class Support Vector Classification . . . . . . . . . . . Kai Labusch, Fabian Timm, and Thomas Martinetz
21
A Multiple Kernel Learning Approach to Joint Multi-class Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christoph H. Lampert and Matthew B. Blaschko
31
Fast Generalized Belief Propagation for MAP Estimation on 2D and 3D Grid-Like Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kersten Petersen, Janis Fehr, and Hans Burkhardt
41
Boosting for Model-Based Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . Amir Saffari and Horst Bischof
51
Improving the Run-Time Performance of Multi-class Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrey Sluzhivoy, Josef Pauli, Volker R¨ olke, and Anastasia Noglik
61
Sliding-Windows for Rapid Object Class Localization: A Parallel Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Wojek, Gyuri Dork´ o, Andr´e Schulz, and Bernt Schiele
71
A Performance Evaluation of Single and Multi-feature People Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Wojek and Bernt Schiele
82
Tracking Model-Based Motion Capture for Crash Test Video Analysis . . . . . . . . . . Juergen Gall, Bodo Rosenhahn, Stefan Gehrig, and Hans-Peter Seidel
92
Efficient Tracking as Linear Program on Weak Binary Classifiers . . . . . . . Michael Grabner, Christopher Zach, and Horst Bischof
102
X
Table of Contents
A Comparison of Region Detectors for Tracking . . . . . . . . . . . . . . . . . . . . . . Andreas Haja, Steffen Abraham, and Bernd J¨ ahne
112
Combining Densely Sampled Form and Motion for Human Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Konrad Schindler and Luc van Gool
122
Recognition and Tracking of 3D Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Wiedemann, Markus Ulrich, and Carsten Steger
132
Medical Image Processing and Segmentation Segmentation of SBFSEM Volume Data of Neural Tissue by Hierarchical Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bj¨ orn Andres, Ullrich K¨ othe, Moritz Helmstaedter, Winfried Denk, and Fred A. Hamprecht
142
Real-Time Neighborhood Based Disparity Estimation Incorporating Temporal Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bogumil Bartczak, Daniel Jung, and Reinhard Koch
153
A Novel Approach for Detection of Tubular Objects and Its Application to Medical Image Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Bauer and Horst Bischof
163
Weakly Supervised Cell Nuclei Detection and Segmentation on Tissue Microarrays of Renal Clear Cell Carcinoma . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas J. Fuchs, Tilman Lange, Peter J. Wild, Holger Moch, and Joachim M. Buhmann
173
A Probabilistic Segmentation Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dmitrij Schlesinger and Boris Flach
183
Using Eigenvalue Derivatives for Edge Detection in DT-MRI Data . . . . . Thomas Schultz and Hans-Peter Seidel
193
Space-Time Multi-Resolution Banded Graph-Cut for Fast Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobi Vaudrey, Daniel Gruber, Andreas Wedel, and Jens Klappstein
203
Combination of Multiple Segmentations by a Random Walker Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pakaket Wattuya, Xiaoyi Jiang, and Kai Rothaus
214
Audio, Speech and Handwriting Recognition Automatic Detection of Learnability under Unreliable and Sparse User Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yvonne Moh, Wolfgang Einh¨ auser, and Joachim M. Buhmann
224
Table of Contents
Novel VQ Designs for Discrete HMM On-Line Handwritten Whiteboard Note Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joachim Schenk, Stefan Schw¨ arzler, G¨ unther Ruske, and Gerhard Rigoll Switching Linear Dynamic Models for Noise Robust In-Car Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bj¨ orn Schuller, Martin W¨ ollmer, Tobias Moosmayr, G¨ unther Ruske, and Gerhard Rigoll Natural Language Understanding by Combining Statistical Methods and Extended Context-Free Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Schw¨ arzler, Joachim Schenk, Frank Wallhoff, and G¨ unther Ruske
XI
234
244
254
Multiview Geometry and 3D-Reconstruction Photoconsistent Relative Pose Estimation between a PMD 2D3D-Camera and Multiple Intensity Cameras . . . . . . . . . . . . . . . . . . . . . . Christian Beder, Ingo Schiller, and Reinhard Koch
264
Implicit Feedback between Reconstruction and Tracking in a Combined Optimization Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olaf K¨ ahler and Joachim Denzler
274
3D Body Scanning in a Mirror Cabinet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sven Molkenstruck, Simon Winkelbach, and Friedrich M. Wahl
284
On Sparsity Maximization in Tomographic Particle Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefania Petra, Andreas Schr¨ oder, Bernhard Wieneke, and Christoph Schn¨ orr Resolution Enhancement of PMD Range Maps . . . . . . . . . . . . . . . . . . . . . . . A.N. Rajagopalan, Arnav Bhavsar, Frank Wallhoff, and Gerhard Rigoll A Variational Model for the Joint Recovery of the Fundamental Matrix and the Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Levi Valgaerts, Andr´es Bruhn, and Joachim Weickert
294
304
314
Motion and Matching Decomposition of Quadratic Variational Problems . . . . . . . . . . . . . . . . . . . . Florian Becker and Christoph Schn¨ orr
325
XII
Table of Contents
A Variational Approach to Adaptive Correlation for Motion Estimation in Particle Image Velocimetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Florian Becker, Bernhard Wieneke, Jing Yuan, and Christoph Schn¨ orr Optical Rails: View-Based Point-to-Point Navigation Using Spherical Harmonics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Holger Friedrich, David Dederscheck, Eduard Rosert, and Rudolf Mester
335
345
Postprocessing of Optical Flows Via Surface Measures and Motion Inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Claudia Kondermann, Daniel Kondermann, and Christoph Garbe
355
An Evolutionary Approach for Learning Motion Class Patterns . . . . . . . . Meinard M¨ uller, Bastian Demuth, and Bodo Rosenhahn
365
Relative Pose Estimation from Two Circles . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Rahmann and Veselin Dikov
375
Staying Well Grounded in Markerless Motion Capture . . . . . . . . . . . . . . . . Bodo Rosenhahn, Christian Schmaltz, Thomas Brox, Joachim Weickert, and Hans-Peter Seidel
385
An Unbiased Second-Order Prior for High-Accuracy Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Werner Trobin, Thomas Pock, Daniel Cremers, and Horst Bischof
396
Physically Consistent Variational Denoising of Image Fluid Flow Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrey Vlasenko and Christoph Schn¨ orr
406
Convex Hodge Decomposition of Image Flows . . . . . . . . . . . . . . . . . . . . . . . Jing Yuan, Gabriele Steidl, and Christoph Schn¨ orr
416
Image Analysis Image Tagging Using PageRank over Bipartite Graphs . . . . . . . . . . . . . . . . Christian Bauckhage
426
On the Relation between Anisotropic Diffusion and Iterated Adaptive Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Felsberg
436
Comparing Local Feature Descriptors in pLSA-Based Image Models . . . . Eva H¨ orster, Thomas Greif, Rainer Lienhart, and Malcolm Slaney
446
Example-Based Learning for Single-Image Super-Resolution . . . . . . . . . . . Kwang In Kim and Younghee Kwon
456
Table of Contents
XIII
Statistically Optimal Averaging for Image Restoration and Optical Flow Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kai Krajsek, Rudolf Mester, and Hanno Scharr
466
Deterministic Defuzzification Based on Spectral Projected Gradient Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tibor Luki´c, Nataˇsa Sladoje, and Joakim Lindblad
476
Learning Visual Compound Models from Parallel Image-Text Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Moringen, Sven Wachsmuth, Sven Dickinson, and Suzanne Stevenson
486
Measuring Plant Root Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias M¨ uhlich, Daniel Truhn, Kerstin Nagel, Achim Walter, Hanno Scharr, and Til Aach
497
A Local Discriminative Model for Background Subtraction . . . . . . . . . . . . Adrian Ulges and Thomas M. Breuel
507
Perspective Shape from Shading with Non-Lambertian Reflectance . . . . . Oliver Vogel, Michael Breuß, and Joachim Weickert
517
The Conformal Monogenic Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lennart Wietzke and Gerald Sommer
527
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
537
MAP-Inference for Highly-Connected Graphs with DC-Programming J¨ org Kappes and Christoph Schn¨ orr Image and Pattern Analysis Group, Heidelberg Collaboratory for Image Processing, University of Heidelberg, Germany {kappes,schnoerr}@math.uni-heidelberg.de
Abstract. The design of inference algorithms for discrete-valued Markov Random Fields constitutes an ongoing research topic in computer vision. Large state-spaces, none-submodular energy-functions, and highly-connected structures of the underlying graph render this problem particularly difficult. Established techniques that work well for sparsely connected grid-graphs used for image labeling, degrade for non-sparse models used for object recognition. In this context, we present a new class of mathematically sound algorithms that can be flexibly applied to this problem class with a guarantee to converge to a critical point of the objective function. The resulting iterative algorithms can be interpreted as simple message passing algorithms that converge by construction, in contrast to other message passing algorithms. Numerical experiments demonstrate its performance in comparison with established techniques.
1
Introduction
Applications of Markov Random Fields (MRFs) abound in computer vision. For a review and performance evaluation, we refer to [11]. The majority of applications amounts to some form of image labeling over sparsely connected grid graphs, akin to PDE-based processing in the continuous case. Established inference algorithms [12,7] rely on convex relaxations and dedicated algorithms for solving the resulting large-scale linear programs (LPs). For a review, we refer to [13]. Yet, it has been recognized that the performance of these algorithms degrade for more involved problems that have a large number of states and highlyconnected underlying graphs [6]. Such problems typically arise in connection with object recognition. Another problem concerns convergence of the most attractive techniques. While it is well-known that loopy belief propagation is a heuristic from the viewpoint of algorithm design [14], sound relaxation techniques like tree-reweighted belief propagation may also suffer from convergence problems [12]. Techniques to remedy this [5] are based on restrictions that may not be satisfied in applications. G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 1–10, 2008. c Springer-Verlag Berlin Heidelberg 2008
2
J. Kappes and C. Schn¨ orr
Finally, the quickly increasing number of constraints of LP-based relaxations is an issue caused by highly-connected graphs, in particular if the number of states is large, too. In this connection, Ravikumar and Lafferty [10] suggested recently a quadratic programming (QP) relaxation that essentially boils down to mean-field based MAP-estimation. Unlike the usual fixed-point iterations used in mean-field annealing, QP techniques can be applied, but the inherent non-convexity of this type of relaxation remains. To cope with it, a spectral rectification of the problem matrix was suggested in [10] in order to approximate the non-convex relaxation again by a convex one. In this paper, we present a novel class of inference algorithms for MRFinference, based on the non-convex relaxation introduced in [10]. This class is based on Difference of Convex Functions (DC) - programming that utilizes problem decompositions into two convex optimization problems. While in our field these technique has been previously applied for the marginalization problem [15], without referring to the vast and established mathematical literature [4], our present work applies these techniques for the first time to the MAP-inference problem, leading to fairly different algorithms. We fix the notation and introduce the MAP-inference problem in section 2. The basic problem relaxation and the novel class of inference algorithms are detailed in Section 3, followed by a comparative numerical performance evaluation in Section 4.
2
Problem
2.1
Notation
Let G = (V, E) be a graph with a set of nodes V and edges E. The variable xs with s ∈ V belongs to the set Xs , so the configuration space of all labellings x is s∈V Xs . The costs assigned to any value of x are given by the model functions θst (xs , xt ) and θs (xs ) , ∀s, t ∈ V , s = t. The corresponding index sets are I = I V ∪ I E , I V = {(s; i)|s ∈ V, i ∈ Xs } and I E = {(st; ij)|s, t ∈ V, i ∈ Xs , j ∈ Xt }. 2.2
MAP Inference
The maximum a posteriori (MAP) inference problem amounts to find a labeling x minimizing an energy function of the form J(x) = θst (xs , xt ) + θs (xs ) . (1) st∈E
s∈V
Assembling all function values of all terms into a single large vector θ ∈ R|I| , this problem can be shown to be equivalent to evaluating the support function of the marginal polytope1 M, sup −θ, μ ,
µ∈M 1
(2)
The negative sign in (2) is due to our preference to work with energies that are to be minimized.
MAP-Inference for Highly-Connected Graphs with DC-Programming
3
in terms of the vector of marginals μ. This problem, of course, is as intractable as is problem (1). But relaxations can be easily derived by replacing M by simpler sets. In this paper, we consider the simplest possibility, i.e. the product of all standard (probability) simplices over V : IV μs;i = 1 , ∀s ∈ V . (3) Λ = μ ∈ R+ i∈Xs
3 3.1
Approach QP-Relaxation |I|
In [10], it was suggested to reduce the problem size by replacing μ ∈ R+ in (2) V
|I |
by a new set of variables τ ∈ R+ . Inserting μst;ij = τs;i τt;j and μs;i = τs;i , problem (2) becomes a QP of a much smaller size 1 τ Qτ + q τ , 2 s.t. τ ∈ Λ .
min
(4)
This QP is not convex in general. Ravikumar and Lafferty [10] propose to base inference one a convex approximation, by adding a diagonal matrix in order to shift the spectrum of Q to the nonnegative cone, and by modifying the linear term accordingly, in view of extreme points of the set Λ, 1 1 τ (Q + diag(d))τ + (q − d) τ , 2 2 s.t. τ ∈ Λ .
min
(5)
Good performance results with respect to inference are reported in [10], in addition to being computationally attractive for graphical models with large edge sets E due to the removal of many constraints. On the other hand, the original idea of inner-polytope relaxations in terms of a mean-field approximation to inference is inevitably lost through the convex approximation. It is this latter problem that we address with a novel class of algorithms, to be described next. 3.2
DC-Decomposition
According to the discussion at the end of the previous section, we propose to dispense with convex approximations of (4), but to tackle it directly through DC-programming. The basic idea is to decompose the non-convex symmetric quadratic part into the difference of two semi-definite quadratic forms.
4
J. Kappes and C. Schn¨ orr
1 τ Qτ + q τ = g(τ ) − h(τ ) , 2 1 g(τ ) = τ Q1 τ + q τ , 2 1 h(τ ) = − τ Q2 τ . 2
f (τ ) =
(6a) (6b) (6c)
Various choices of Q1 , Q2 are possible: Eigenvalue-Decomposition: DCEV Based on the spectral decomposition Q = V diag(eig(Q))V , where V is the matrix of the eigenvectors and eig(Q) are the eigenvalues of Q, we define Q1 = V diag(max{0, eig(Q)}V ,
(7)
Q2 = V diag(max{0, −eig(Q)}V .
(8)
Decomposition based on the Smallest Eigenvalue: DCMIN Since the computation of the eigenvalue decomposition is costly, another decomposition can be based on computing a lower bound for the smallest eigenvalue dmin < min{eig(Q)} < 0. The smallest eigenvalue can be computed, e.g., by the power method. Q1 = Q − dmin · I , Q2 = −dmin · I .
(9) (10)
Decomposition based on the Largest Eigenvalue: DCMAX A third immediate possibility utilizes the value of the largest eigenvalue of Q and has additionally the property that the convex part of the function becomes very simple. Let dmax > max{eig(Q)} be an upper bound of the largest eigenvalue. We define Q1 = dmax · I , Q2 = dmax · I − Q . 3.3
(11) (12)
Inference Algorithm
For any decomposition introduced in the previous section, minimization is carried out by the general two-step iteration2 y k ∈ ∂h(xk ) ,
xk+1 ∈ ∂g ∗ (y k )
that applies to any DC-objective g(x) − h(x) and has known convergence properties [2]. Taking into account the specific structure of our objective function (6a), we arrive at the following simple algorithm: Depending on the particular decomposition Q = Q1 − Q2 , the convex optimization steps 3 and 7 can be efficiently conducted with dedicated algorithms. 2
∂f (x) denotes the set of subgradients of a proper convex lower-semicontinuous function f at x, and g ∗ (y) denotes the Fenchel conjugate function supx {x, y − g(x)}.
MAP-Inference for Highly-Connected Graphs with DC-Programming
5
Algorithm 1. DC-Algorithm for MRF [x 1: 2: 3: 4: 5: 6: 7: 8:
] ← dc4mrf ( Q, q ) [Q1 , Q2 ] ⇐ decompose(Q) i⇐0 x0 ⇐ arg minx 21 x Q1 x + q x s.t. x ∈ Λ {Solve convex part} repeat i⇐i+1 y i−1 ⇐ Q2 x(i−1) xi ⇐ arg minx 21 x Q1 x + (q + y i−1 ) x s.t. x ∈ Λ until xi − xi−1 ∞ <
4
Experiments
We will compare nine different approaches to minimize the objective function given in (1) for full connected graphs. For computing the global optimum we use an A∗ -based algorithm suggested in [3]. The three DC-decompositions, presented in this paper, are compared with the LP-relaxation [12], QP-relaxations [10], Belief Propagation (BP) [14], the TRBP-message passing algorithm by Wainwright [12] and the Second-Order-Cone-Programming (SOCP) approach by Kumar [9]. Since the QP-relaxation suggested in [10] turned out to be not tight for our data sets, we use λmin I to make the problem convex. For SOCP we do not use triangular constraints to keep the number of constraints manageable. Note, that the latest modifications [8] are not taken into account here, in order not to increase further the number of constraints. TRBP and BP are stopped if the largest change of a message is smaller than 10−6 or after 1000 iterations. A∗ , BP and TRBP are C-implementations, DCMAX runs with pure MatLab research code. The constrained QPs achieved by DCMIN and DCEV as well as LP and QP, are solved using MOSEK [1]. Note, the constraints for DCMIN and DCEV are simple and performance can be increased using specific optimization-methods. We choose = 10−6 and set the maximal number of iterations for DCMAX to 100000 and for DCMIN and DCEV to 1000. In cases where relaxations resulted in fractal solution we project it to the nearest integer solution. Some approaches strongly depend to the number of constraints and variables, which limits the range of application. For instance, when using SOCP and LP, the number of constraints grow very fast. Table 1 summarizes the number of constraints and variables. 4.1
Synthetic Experiments
For synthetic experiments we generate graphical models and vary the number of nodes and the size of the state-space of each random variable. The potentials θ are sampled in three ways, given in Table 2. Table 3 shows the mean energy, the calculation time (round brackets) and for experiment C the accuracy (square brackets). For experiments A and B our DCapproach outperforms BP, TRBP, SOCP, QP and LP. A∗ , which always finds
6
J. Kappes and C. Schn¨ orr
Table 1. Number of constraints and variables required for the different methods. L is the number of labels and K is the number of not truncated edges (see [9] for details). In the worst case K ∼ |E| · L2 . Method A∗ SOCP DC / QP LP TRBP / BP
Number of constraints Number of variables 0 |V | |V | + |V | · L + 3K |V | · L + 2 · K |V | |V | · L |V | · L + |E| · L2 + |V | · L + 2 · |E| · L |V | · L + |E| · L2 0 2 · |E| · L
Table 2. Overview of the sythetic experiments. In experiment 1 we compute uniform samples in 0 and 1 and set the potentials to the negative logarithm. In the second experiment we set all unary potentials to 0. In the third one we select a configuration x and set θst xs , xt and 5% or 10% entries in θst to 0 and the others to − log(0.1). θs θst
Exp. A − log(U(0, 1)) − log(U(0, 1))
Exp. B − log(1) − log(U(0, 1))
Exp. C − log(1) − log({0.1, 1})
the global optimum, is only applicable to small graphs, of course. In experiment C, BP and TRBP are superior to our approach. However, there is no guarantee for convergence. We also kept the ratio between nodes and labels fixed to 4 and did inference with the different approaches. The result is shown in Fig. 1. The plot for uniform sampled potentials (top line) again shows that DCMAX outperforms state of the art inference techniques, at higher computational costs, however. Observe how for experiment C (bottom line) the run-time for SOCP and LP increases fast with the number of constraints. To analyze the influence of the initial state to our DC-approach we selected x0 in experiment A randomly, using 5 nodes and 20 labels. In the same experiments of Table 3, with 10 random initial states, we achieved a mean energy of 3.5878, which is fairly robust. 4.2
Application to Object Recognition
We tested different approaches on a real world scenario. The potential functions are now no longer generated synthetically, but estimated from real data through local classification. We have two scenarios, the human face and the human body, as explained detailed in [3]. Due to the strong geometry, faces are simple and A∗ solve this problems global optimal and fast. For human bodies A∗ suffers from several shortcomings. For some images inference with A∗ takes several minutes. The median with 0.21 seconds its still fast. In terms of energy we get the ranking A∗ < BP < DCMAX < T RBP as visualized in Table 4, where the difference between BP an DCMAX is not significant.
MAP-Inference for Highly-Connected Graphs with DC-Programming
7
1
10
15
A* DCmax BP TRBP QP LP
10
5
0 3
A* DCmax BP TRBP QP LP
0
mean run−time (seconds)
mean energy of integer solution
20
4
5 |V|
6
10
−1
10
−2
10
7 −3
10
3
4
5 |V|
6
7
2
10
accurace
1.2
A* DCmax BP TRBP QP LP SOCP
1
A* QP SOCP
10 mean run−time (seconds)
1.4
1
0.8
0.6
0
10
−1
10
−2
10
−3
10 0.4 3
4
5
6
7 |V|
8
9
10 −4
10
3
4
5
6
7
8
9
10
|V|
Fig. 1. In the first row we show energy- and runtime-plots for models of type A with 3 to 7 nodes. In the second row we show accuracy with respect to ground truth and runtime-plots for models of type C with 5% noise and 3 to 10 nodes. We use 4 times more labels than nodes. Accuracy for the other approaches are same than for A∗ . DCM AX finds in exp. A the lowest energy, next to A∗ , and is comparable with state of the art approaches in exp C.
Fig. 2. visualize the three images with the largest relative energy of DCM AX . From left to right the configurations of A∗ , DCM AX , BP and TRBP are shown. Surprisingly the DCM AX solution matches the reality best.
8
J. Kappes and C. Schn¨ orr
Table 3. The table shows for different combinations of experiment settings and optimization techniques, the energy of integer solution, the required run-time (seconds) in round brackets and the accuracy in squared brackets. In the first column the numbers of variables and labels are given. We repeated each setting 100 times (5 times if marked with *) with random data. A dash indicates that the experiments could not be carried out, because the number of constraints is to high (LP, SOCP). For exp. A and B the DC-Algorithms outperforms state of the art approaches. Experiment 05-20-A 05-20-B 05-20-C-05%
05-20-C-10%
10-50-A 10-50-B 20-50-C-05%
20-50-C-10%
A∗ 2.90 (0.01) 1.07 (0.01) 0.00 (0.01) [1.00] 0.00 (0.01) [1.00] 14.53* (1162)* 9.75* (1297)* 0.00 (0.02) [1.00] 0.00 (0.02) [1.00]
DCM AX 3.29 (0.88) 1.59 (1.89) 0.00 (0.06) [1.00] 1.40 (0.09) [0.80] 15.77 (21.59) 12.11 (16.92) 0.00 (1.21) [1.00] 0.00 (1.26) [1.00]
DCM IN 3.34 (8.30) 1.61 (10.20) 0.00 (0.56) [1.00] 1.20 (0.87) [0.83] 15.90* (1087)* 12.26* (1109)* 0.00* (4809)* [1.00]* 0.00* (4707)* [1.00]*
DCEV 3.36 (2.32) 1.68 (2.29) 0.62 (0.11) [0.94] 2.83 (0.31) [0.64] 16.21* (808)* 12.36* (805)* 0.00* (5089)* [1.00]* 0.00* (4844)* [1.00]*
BP 7.12 (0.22) 6.30 (0.28) 0.00 (0.01) [1.00] 0.00 (0.01) [1.00] 42.35 (5.68) 42.44 (5.74) 0.00 (0.33) [1.00] 0.00 (0.46) [1.00]
TRBP 4.76 (0.41) 4.86 (0.11) 0.00 (0.14) [1.00] 0.00 (0.15) [1.00] 30.03 (1.71) 39.36 (1.68) 0.00 (9.83) [1.00] 0.00 (29.91) [1.00]
SOCP 10.01 (65.14) 6.39 (65.21) 1.82 (0.33) [0.95] 11.86 (0.98) [0.46] -
QP 7.05 (0.02) 5.43 (0.02) 13.22 (0.02) [0.56] 14.09 (0.02) [0.38] 33.75 (1.12) 31.89 (1.08) 47.94 (5.08) [0.94] 199.04 (5.00) [0.68]
LP 6.16 (0.34) 6.04 (0.38) 0.00 (0.29) [1.00] 0.00 (0.31) [1.00] 36.76 (54.86) 41.86 (53.49) -
Human faces: Our DC-decomposition finds the global optimum in 93.59% of the images. Figure 2 shows the three images in which the relative energy difference is largest. Surprisingly the global optimum of the energy function does not match the reality in this images in contrast to DCMAX . Table 4. shows the mean energy, time in seconds, and accuracy with respect to the global optimum. For real world examples DCM AX outperforms TRBP, but finds for human bodies configurations with slightly higher energies than BP. Experiment Face
Body
A∗ 45.7410 (0.0003) [1.0000] 57.6411 (5.6569) [1.0000]
DCM AX 45.7669 (3.7384) [0.9538] 60.5376 (79.0601) [0.6010]
BP 46.7476 (0.0070) [0.9701] 58.2711 (0.3382) [0.8673]
TRBP 46.8325 (0.0074) [0.7658] 73.8115 (1.4921) [0.4057]
MAP-Inference for Highly-Connected Graphs with DC-Programming
9
Fig. 3. The pictures above show 3 typical images. In the first one DCM AX found the global optima, in the second it stopped in a critical point which describes the image not well, in the third the solution of DCM AX describes the image better than the global optimum. From left the configurations of A∗ , DCM AX , BP and TRBP are shown.
Human bodies: Detecting human bodies is a very challenging task, see Figure 3. While in the second image the solution, achieved by DCMAX , does not describe the underlying image accurately, in the third image it is closer to the human perception. For this complex models DCMAX ends up in 39.26% of the images in the global optimum.
5
Conclusions
We introduced a novel class of approximate MRF-inference algorithms based on quadratic DC-programming. Besides provable convergence properties, the approach shows competitive performance. It is applicable to highly-connected graphical models where standard LP-based relaxations cannot be applied, because the number of constraints becomes too large.
10
J. Kappes and C. Schn¨ orr
Our future work will supplement the DC-programming framework by globalization strategies and focus on the derivation of performance bounds that hold for any application.
References 1. Mosek 5.0 , http://www.mosek.com 2. An, L.T.H., Tao, P.D.: The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals of Operations Research 133(24), 23–46 (2005) 3. Bergtholdt, M., Kappes, J.H., Schn¨ orr, C.: Learning of graphical models and efficient inference for object class recognition. In: 28th Annual Symposium of the German Association for Pattern Recognition (September 2006) 4. Horst, R., Thoai, N.V.: DC programming: Overview. J. Optimiz. Theory Appl. 103(1), 1–43 (1999) 5. Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. IEEE Trans. Patt. Anal. Mach. Intell. 28(10), 1568–1583 (2006) 6. Kolmogorov, V., Rother, C.: Comparison of energy minimization algorithms for highly connected graphs. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954. Springer, Heidelberg (2006) 7. Komodakis, N., Tziritas, G.: Approximate labeling via graph cuts based on linear programming. PAMI 29(8), 2649–2661 (2007) 8. Kumar, M.P., Kolmogorov, V., Torr, P.H.S.: An analysis of convex relaxations for MAP estimation. In: Proceedings of Advances in Neural Information Processing Systems (2007) 9. Pawan Kumar, M., Torr, P.H.S., Zisserman, A.: Solving markov random fields using second order cone programming relaxations. In: CVPR 2006: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, pp. 1045–1052. IEEE Computer Society, Los Alamitos (2006) 10. Ravikumar, P., Lafferty, J.: Quadratic programming relaxations for metric labeling and markov random field MAP estimation. In: ICML 2006: Proceedings of the 23rd international conference on Machine learning, pp. 737–744. ACM Press, New York (2006) 11. Szeliski, R., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M.F., Rother, C.: A comparative study of energy minimization methods for markov random fields. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954. Springer, Heidelberg (2006) 12. Wainwright, M.J., Jaakola, T.S., Willsky, A.S.: MAP estimation via agreement on trees: message-passing and linear programming. IEEE Trans. Inform. Theory 51(11), 3697–3717 (2005) 13. Werner, T.: A linear programming approach to max-sum problem: A review. IEEE Trans. Patt. Anal. Mach. Intell. 29(7), 1165–1179 (2007) 14. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory 51(7), 2282–2312 (2005) 15. Yuille, A.L.: CCCP algorithms to minimize the bethe and kikuchi free energies: convergent alternatives to belief propagation. Neural Comput. 14(7), 1691–1722 (2002)
Approximate Parameter Learning in Conditional Random Fields: An Empirical Investigation Filip Korˇc and Wolfgang F¨ orstner University of Bonn, Department of Photogrammetry, Nussallee 15, 53115 Bonn, Germany [email protected],[email protected] http://www.ipb.uni-bonn.de
Abstract. We investigate maximum likelihood parameter learning in Conditional Random Fields (CRF) and present an empirical study of pseudo-likelihood (PL) based approximations of the parameter likelihood gradient. We show, as opposed to [1][2], that these parameter learning methods can be improved and evaluate the resulting performance employing different inference techniques. We show that the approximation based on penalized pseudo-likelihood (PPL) in combination with the Maximum A Posteriori (MAP) inference yields results comparable to other state of the art approaches, while providing advantages to formulating parameter learning as a convex optimization problem. Eventually, we demonstrate applicability on the task of detecting man-made structures in natural images. Keywords: Approximate parameter learning, pseudo-likelihood, Conditional Random Field, Markov Random Field.
1
Introduction
Classification of image components in meaningful categories is a challenging task due to the ambiguities inherent into visual data. On the other hand, image data exhibit strong contextual dependencies in the form of spatial interactions among components. It has been shown that modeling these interactions is crucial to achieve good classification accuracy. Conditional random field (CRF) provides a principled approach for combining local classifiers that allow the use of arbitrary overlapping features, with adaptive data-dependent label interaction. This formulation provides several advantages compared to the traditional Markov Random Field (MRF) model. Further, the restrictive assumption of conditional independence of data, made in the traditional MRFs, is relaxed in the CRF model. CRFs [3] have been proposed in the context of segmentation and labeling of 1D text sequences. In [1], the concept of CRFs has been extended to graphs with loops and made thus well applicable to problems in computer vision. The CRF that uses arbitrary discriminative classifiers to design the model potentials has been called the Discriminative Random Field (DRF). G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 11–20, 2008. c Springer-Verlag Berlin Heidelberg 2008
12
F. Korˇc and W. F¨ orstner
In our previous work [4], we discussed the differences between a traditional MRF formulation and the DRF model, compared the performance of the two models and an independent sitewise classifier and demonstrated the application feasibility for the task of interpreting terrestrial images of urban scenes. Further, we presented preliminary results suggesting the potential for performance improvement. Exact computation of the likelihood of CRF model parameters is in general infeasible for graphs with grid or irregular topology. For this reason, developing effective parameter learning methods is the crucial part in applying CRFs in computer vision. In [5], a model modification is described resulting in parameter learning formulated as a convex optimization problem. Learning/inference coupling is studied in [2]. Learning in CRFs can be accelerated using Stochastic Gradient Methods [6] and Piecewise Pseudo-likelihood [7]. A semi-supervised learning approach to learning in CRF can be found in [8]. In this work, we empirically investigate approximate parameter learning methods based on pseudolikelihood (PL). We show that these methods yield results comparable to other state of the art approaches to parameter learning in CRFs, while providing desirable convergence behavior independent of the initialization
2
Conditional Random Field
CRFs are used in a discriminative framework to model the posterior over the labels given the data. In other words, let y denote the observed data from an input image, where y = {yi }i∈S , yi is the data from the ith site, and S is the set of sites. Let the corresponding labels at the image sites be given by x = {xi }i∈S . We review the CRF formulation in the context of binary classification on 2D image lattices. A general formulation on arbitrary graphs with multiple class labels is described in [9]. Thus, we now have xi ∈ {−1, 1} for a binary classification problem. In the considered CRF framework, the posterior over the labels given the data is expressed as, ⎛ ⎞ 1 Ai (xi , y) + Iij (xi , xj , y)⎠ (1) P (x|y) = exp ⎝ Z i∈S
i∈S j∈Ni
The CRF model in Eq. (1) captures the class association Ai at individual sites i with the interactions Iij in the neighboring sites ij. The parameter dependent term Z(θ) is the normalization constant (or partition function) and is in general intractable to compute. Ni is the set of neighbors of the image site i. Both, the unary association potential Ai and the pairwise interaction potential Iij can be modeled as arbitrary unary and pairwise classifiers [5]. In this paper, as in [5][2] and our previous work [4], we use a logistic function σ(t) = 1/(1+e−t) to specify the local class posterior, i.e., Ai (xi , y) = log P (xi |y) = log σ(xi wT hi (y)) where the parameters w specify the classifier for individual sites. Here, hi (y) is a sitewise feature vector, which has to be chosen such that a high positive weighted sum wT hi (y) supports class xi = 1. Similarly, to model Iij we use
Approximate Parameter Learning in Conditional Random Fields
13
a pairwise classifier of the following form: Iij (xi , xj , y) = xi xj vT μij (y). Here, the parameters v specify the classifier for site neighborhoods. μij (y) is a feature vector similarly being able to support or suppress the identity xi xj = 1 of neighboring class labels. We denote the unknown CRF model parameters by θ = {w, v}. In the following we assume the random field in Eq. (1) to be homogeneous and isotropic. Hence we drop the subscripts and use the notation A and I.
3
Parameter Learning
We learn the parameters θ of the CRF model in a supervised manner. Hence, we use training images and the corresponding ground-truth labeling. We use standard maximum likelihood approach and, in principle, maximize the conditional likelihood P (x|y, θ) of the CRF model parameters. However, this would involve the evaluation of the partition function Z which is in general NP-hard. To overcome the problem, we may either use sampling techniques or approximate the partition function. As in [5], we use the pseudo-likelihood (PL) approximation P (x|y, θ) ≈ i∈S P (xi |xNi , y, θ) [10],[11], which is characterized by its relatively low computational complexity. It has been observed [5] that this approximation tends to overestimate the interaction parameters, causing the MAP estimate of the field to be a poor solution. To overcome the difficulty, they propose to adopt the Bayesian viewpoint and find the maximum a posteriori estimate of the parameters by assuming a Gaussian prior over the parameters such that P (θ|τ ) = N (θ|0, τ 2 I) where I is the identity matrix. Thus, given M independent training images, we determine θ from M m m ˆ ML ≈ argmax P (xm θ i |xNi , y , θ)P (θ|τ ) θ m=1 i∈S or equivalently from the log likelihood ⎛ ⎞ M 1 ˆ ML ≈ argmax ⎝A(xi , y, w) + I(xi , xj , y, v) − log zi ⎠ − 2 vT v θ 2τ θ m=1 i∈S j∈Ni (2) where exp{A(xi , y, w) + I(xi , xj , y, v)} zi = xi ∈{−1,1}
j∈Ni
As stated in [5], if τ is given, the problem in Eq. (2) is convex with respect to the model parameters and can be maximized using gradient ascent. We note that it is an approximation of the true likelihood gradient that is now being computed. We implement a gradient ascent method variation with exact line search and maximize for different values of τ . In our experiments, we adopt two methods of parameter learning. In the first set of experiments, we learn the parameters of the CRF using a uniform
14
F. Korˇc and W. F¨ orstner
prior over the parameters in Eq. (2), i.e., τ = ∞. This approach is referred to as the pseudo-likelihood (PL) learning method. Learning technique in the second set of experiments, where Gaussian prior over the CRF model parameters is used, is denoted as the penalized pseudo-likelihood (PPL) learning method. We specify further details of the PPL learning together with the experiments. Discrete approximations of the partition function based on the Saddle Point Approximation (SPA) [12], Pseudo-Marginal Approximation (PMA) [13] and the Maximum Marginal Approximation (MMA) are described in [2]. A Markov Chain Monte Carlo (MCMC) sampling inspired method proposed in [14] and called the Contrastive Divergence (CD), is another way to deal with the combinatorial size of the label space. In our experiments we compare the PL based methods with these approaches to parameter learning.
4
Experiments
To analyze the learning and inference techniques described in the previous section, we applied the CRF model to a binary image restoration task. The aim of these experiments is to recover correct labeling from corrupted binary images. We use the data that has been used in learning and inference experiments in [5],[2]and compare our results with those published in the above mentioned works. Four base images, see the top row in Fig. 1, 64 × 64 pixels each are used in the experiments. Two different noise models are employed: Gaussian noise and class dependent bimodal (two mixtures of two Gaussians) noise. Details of the noise model parameters are given in [5]. For each noise model, 10 out of 50 noisy images from the left most base image in Fig. 1 are used as the training set for parameter learning. 150 noisy images from the other 3 base images are used for testing. The unary and pairwise features are defined as: hi (y) = [1, Ii ]T and μij (y) = [1, |Ii − Ij |]T respectively, where Ii and Ij are the pixel intensities at the site i and the site j. Hence, the parameter w and v are both two-element vectors, i.e., w = [w0 , w1 ]T , and v = [v0 , v1 ]T . 4.1
Optimization
Finding optimal parameters of the CRF model means solving convex optimization problem in Eq. (2). For this purpose, we implement a variation of the gradient ascent algorithm with exact line search. For the computation of the numerical gradient we use the spacing 0.00001 between points in each direction. Plots in Fig. 2ac show negative logarithm of the objective function of the optimization problem in Eq. (2). Results in Fig. 2ab correspond to the objective function, where no prior over the parameters is used, i.e., to the PL learning method. Results in Fig. 2cd show learning where parameters are penalized by imposing a Gaussian prior. This corresponds to the PPL learning method.
Approximate Parameter Learning in Conditional Random Fields
15
Original image
Bimodal noise
Logistic classifier
ICM with PL
ICM with PPL
MIN-CUT with PL
MIN-CUT with PPL
Fig. 1. Image restoration results for synthetic images corrupted with bimodal noise. Results for different combinations of parameter learning (PL: Pseudo-likelihood, PPL: Penalized Pseudo-likelihood) and inference methods (MIN-CUT: min-cut/max-flow algorithm, ICM: Iterated Conditional Modes, LOGISTIC: logistic classifier) are shown. Train and test data courtesy Sanjiv Kumar (4 image columns on the left).
Fig. 2bd shows minimizing sequences for the model parameter w1 . On this example, we illustrate that the model parameters change their values significantly although the criterion value does not decrease much compared to the initial iterations. This observation motivates the employment of exact optimization. An inexact approach, commonly used in practice, where the step length is chosen to
16
F. Korˇc and W. F¨ orstner
(a)
(b)
(c)
(d)
Fig. 2. Parameter learning using gradient ascent with exact line search. Plots of the negative logarithm of the approximated likelihood of the model parameters (a,c) and plots of the model parameter updates (b,d) for the PL parameter learning (a,b) and the PPL parameter learning (c,d).
approximately minimize the criterion along the chosen ray direction, stops the computation far from optimum in this case. 4.2
PL and PPL Parameter Learning
In the following, we first adopt the PPL learning approach and investigate different combinations of model parameter priors and values of the parameter τ . Gaussian priors over the following four combinations of parameters: {w}, {v}, {v1 w1}, {θ} are used in our experiments, where in each case uniform prior is used for the rest of the parameters. Further, we run the PPL parameter learning for the following values of the prior parameter τ = {1, 0.1, 0.01, 0.001}. We used 10 training images corrupted with both the Gaussian and the bimodal noise to learn the model parameters and, in this case, evaluated the method on all 200 images. Tab. 1 summarizes the experiment for the bimodal noise model by showing the resulting pixelwise classification errors obtained by the mincut/max-flow algorithm (MIN-CUT) [15][16]. In accordance with [5], PPL learning with prior over the interaction parameters v together with the MIN-CUT inference yields the lowest classification error for both noise models. In addition to this, for the Gaussian noise model we find that also learning with prior over all the parameters θ yields comparable classification error. Further, in accordance with [5], τ = 0.01 yields the lowest classification errors in case of the bimodal noise. As opposed to [5], we find that prior parameter value τ = 1 with the Gaussian noise model yields the best results. Table 1. Pixelwise classification errors (%) on 200 images. Columns show combinations of model parameter priors and rows values of the prior parameter τ . See text for more. Parameter τ 1 0.1 0.01 0.001
Parameter Prior {w} {v} {v1 w1} {θ} 11.27 11.03 11.25 11.20 21.96 7.42 21.60 19.65 26.09 6.25 22.15 22.15 23.04 16.55 22.15 22.15
Approximate Parameter Learning in Conditional Random Fields
17
Table 2. Pixelwise classification errors (%) on 150 test images. Columns show parameter learning methods used with two noise models. KH’06 stands for the results published in [5]. Mean ± standard deviation over 10 experiments is given for our results. Learning Method Gaussian Noise Bimodal Noise PL PPL PL PPL KH’06 3.82 2.30 17.69 6.21 ours 2.46 ± 0.07 2.43 ± 0.05 14.21 ± 1.51 6.30 ± 0.11
We now employ the parameter prior and the value of τ identified in the previous experiment and validate our results by learning on 10 images randomly selected from the training set and subsequently testing on 150 images from the test set. For every scenario we run the experiment 10 times and eventually give the mean together with the standard deviation. Tab. 2 summarizes the experiment and illustrates that our PPL learning yields results comparable to [5]. We now adopt the PL parameter learning method and evaluate the approach in combination with the MIN-CUT inference. As illustrated in Tab. 2, we improve the performance of PL learning for both the Gaussian and the bimodal noise model by respectively 36% and 20%. We attribute this improvement to the employment of exact approach to the optimization. In our experiments, we compute a numerical gradient of the approximated likelihood. The time needed for learning could be decreased by computing exact gradient of the approximated likelihood. Employing an inexact line search would accelerate learning at the cost of approximate solution. We maintain that the learning time is in this case to a great extent initialization dependent. At last, we compare results of a logistic classifier (LOGISTIC), Iterated Conditional Modes (ICM) [17] and the MIN-CUT inference for the case of parameters learned through both the PL and the PPL method and for the both noise models. In our experiments, MIN-CUT inference yields the lowest classification error for both learning approaches. The experiment is summarized in Tab. 3 and typical classification results are further illustrated in Fig. 1. Table 3. Pixelwise classification errors (%) on 150 test images. Rows show inference techniques and columns show parameter learning methods used with two noise models. Mean ± standard deviation over 10 experiments is given. Inference Method LOGISTIC ICM MIN-CUT Learning Time (sec)
Learning Method Inference Gaussian Noise Bimodal Noise Time PL PPL PL PPL (sec) 15.02 ± 0.21 15.08 ± 0.20 27.01 ± 0.64 26.30 ± 0.23 0.08 14.52 ± 0.26 14.66 ± 0.22 26.27 ± 0.58 23.37 ± 0.21 1.36 2.46 ± 0.07 2.43 ± 0.05 14.21 ± 1.51 6.30 ± 0.11 0.15 3318 ± 1228 3426 ± 1008 1550 ± 248 3027 ± 753
18
F. Korˇc and W. F¨ orstner
Table 4. Pixelwise classification errors (%) on 200 test images. Rows show parameter learning procedures and columns show two different noise models. KH’05 stands for the results published in [2]. Mean ± standard deviation over 10 experiments is given for our results. Gaussian noise MIN-CUT MMA, KH’05 34.34 PL, KH’05 3.82 PL, ours 2.52 ± 0.07 CD, KH’05 3.78 PMA, KH’05 2.73 PPL, ours 2.49 ± 0.05 SPA, KH’05 2.49
4.3
Bimodal noise Learning time MIN-CUT (Sec) 26.53 636 17.69 300 12.69 ± 1.21 2434 8.88 207 6.45 1183 6.25 ± 0.12 3227 5.82 82
Comparison of Learning Methods
For the MAP MIN-CUT inference, we compare our parameter learning with other state of the art learning methods proposed in [2] and mentioned in Sec. 3. We summarize the comparison in Tab. 4. It was found in [2] that for MAP inference SPA based learning is the most accurate as well as time efficient. However, it was also showed that this approximation leads to a limit cycle convergence behavior dependent on the parameter initialization. As the convergence is not guaranteed, a parameter selection heuristics has to be chosen for the oscillatory case. This is the main drawback of the approximation. In Tab. 4, we show that MAP inference with PPL based learning yields results comparable to SPA learning while providing advantages to formulating parameter learning as a convex optimization problem. In this case, the problem can be solved, very reliably and efficiently, drawing upon the benefits of readily available methods for convex optimization. We note, however, that SPA learning still turns out to be more time efficient in our comparison. This is a drawback with respect to the SPA learning. 4.4
Natural Images
We demonstrate applicability on the task of detection of man-made structures in natural images and show preliminary results on real data. Our intention in this experiment is to label each site of a test image as structured or non-structured. We divide our test images, each of size 3008 × 2000 pixels, into non-overlapping blocks, each of size 150 × 150 pixels, that we call image sites. For each image site i, a 1-dimensional single-site feature is computed as a linear combination of gradient magnitude and orientation based features. In the current setup, we reduce the CRF parameter learning to the determination of the interaction parameter w0 , where the rest of the parameters is fixed. We choose the values of the parameters that by observation yield the best performance on a test set of 15 images. See Fig. 3 for illustration.
Approximate Parameter Learning in Conditional Random Fields
(a) Input Image
19
(b) CRF
Fig. 3. (a) Input Image. (b) Man-made structure detection result using the CRF model. Man-made structure is denoted by blue crosses superimposed on the original image.
5
Conclusion
We investigate maximum likelihood parameter learning in Conditional Random Fields (CRF) and present an empirical study of pseudo-likelihood (PL) based approximations of the parameter likelihood gradient. We show that the approximation based on penalized pseudo-likelihood (PPL) in combination with the Maximum A Posteriori (MAP) inference yields results comparable to other state of the art approaches, while providing desirable convergence behavior independent of the initialization. Eventually, we demonstrate the applicability of the method to the task of detecting man-made structures in natural images. We are currently exploring further ways of efficient parameter learning in the CRFs on grid graphs and on graphs with general neighborhood system. Acknowledgments. The authors would like to thank to V. Kolmogorov for the min-cut code and S. Kumar for the training and test data. The first author was supported by the EC Project FP6-IST-027113 eTRIMS.
References 1. Kumar, S., Hebert, M.: Discriminative random fields: A discriminative framework for contextual interaction in classification. In: Proc. of the 9th IEEE International Conference on Computer Vision, vol. 2, pp. 1150–1157 (2003) 2. Kumar, S., August, J., Hebert, M.: Exploiting inference for approximate parameter learning in discriminative fields: An empirical study. In: 5th International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition (2005) 3. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289 (2001)
20
F. Korˇc and W. F¨ orstner
4. Korˇc, F., F¨ orstner, W.: Interpreting terrestrial images of urban scenes using Discriminative Random Fields. In: Proc. of the 21st Congress of the International Society for Photogrammetry and Remote Sensing (ISPRS) (July 2008) 5. Kumar, S., Hebert, M.: Discriminative random fields. International Journal of Computer Vision 68(2), 179–201 (2006) 6. Vishwanathan, S.V.N., Schraudolph, N.N., Schmidt, M.W., Murphy, K.: Accelerated training of conditional random fields with stochastic gradient methods. In: Cohen, W.W., Moore, A. (eds.) Proc. of the 24th International Conf. on Machine Learning. ACM International Conference Proceeding Series, vol. 148, pp. 969–976. ACM Press, New York (2006) 7. Sutton, C., McCallum, A.: Piecewise pseudolikelihood for efficient training of conditional random fields. In: Ghahramani, Z. (ed.) International Conference on Machine learning (ICML). ACM International Conference Proceeding Series, vol. 227, pp. 863–870 (2007) 8. Lee, C.-H., Wang, S., Jiao, F., Schuurmans, D., Greiner, R.: Learning to model spatial dependency: Semi-supervised discriminative random fields. In: Sch¨ olkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems 19, pp. 793–800. MIT Press, Cambridge (2007) 9. Kumar, S., Hebert, M.: Multiclass discriminative fields for parts-based object detection. In: Snowbird Learning Workshop (2004) 10. Besag, J.: Statistical analysis of non-lattice data. The Statistician 24(3), 179–195 (1975) 11. Besag, J.: Efficiency of pseudo-likelihood estimation for simple gaussian fields. Biometrika 64, 616–618 (1977) 12. Geiger, D., Girosi, F.: Parallel and deterministic algorithms from mrfs: Surface reconstruction. IEEE Trans. on Pattern Analysis and Machine Intelligence 13(5), 401–412 (1991) 13. McCallum, A., Rohanimanesh, K., Sutton, C.: Dynamic conditional random fields for jointly labeling multiple sequences. In: NIPS Workshop on Syntax, Semantics, and Statistics (December 2003) 14. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Computation 14(8), 1771–1800 (2002) 15. Greig, D.M., Porteous, B.T., Seheult, A.H.: Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society 51(2), 271–279 (1989) 16. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. on Pattern Analysis and Machine Intelligence 23(11), 1222–1239 (2001) 17. Besag, J.: On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society 48(3), 259–302 (1986)
Simple Incremental One-Class Support Vector Classification Kai Labusch, Fabian Timm, and Thomas Martinetz Institute for Neuro- and Bioinformatics, University of L¨ ubeck, Ratzeburger Allee 160, D-23538 L¨ ubeck, Germany
Abstract. We introduce the OneClassMaxMinOver (OMMO) algorithm for the problem of one-class support vector classification. The algorithm is extremely simple and therefore a convenient choice for practitioners. √ We prove that in the hard-margin case the algorithm converges with O(1/ t) to the maximum margin solution of the support vector approach for oneclass classification introduced by Sch¨ olkopf et al. Furthermore, we propose a 2-norm soft margin generalisation of the algorithm and apply the algorithm to artificial datasets and to the real world problem of face detection in images. We obtain the same performance as sophisticated SVM software such as libSVM.
1
Introduction
Over the last years, the support vector machine [1] has become a standard approach in solving pattern recognition tasks. There are several training techniques available, e.g. SMO [2]. Although these methods seem to be simple, they are hard to understand without a background in optimisation theory. Hence, they are difficult to motivate when explained to practitioners. In many cases, in particular in industrial contexts, where external libraries or other third party software cannot be used due to various reasons, these techniques are not applied, even though they might be beneficial for solving the problem. In many applications one has to cope with the problem that only samples of one class are given. The task is to separate this class from the other class that consists of all outliers. Either only few samples of the outlier class are given or the outlier class is missing completely. In these cases two-class classifiers often show bad generalisation performance and it is advantageous to employ one-class classification. Approaches to one-class classification can be divided into three groups: density estimators, reconstruction methods, and boundary methods. The first and the second group are the most powerful because they derive a model of the data that is defined everywhere in the input space. An advantage of the boundary methods is that they consider an easier problem, that is, describing only the class boundaries, instead of describing the whole distribution of the data. In the present work, we describe a very simple and incremental boundary method based on the support vector approach. It provides the same solution G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 21–30, 2008. c Springer-Verlag Berlin Heidelberg 2008
22
K. Labusch, F. Timm, and T. Martinetz
as comparable techniques such as SMO, despite being extremely simple and therefore applicable for practitioners who are not within the field of machine learning.
2
Previous Work
In the context of one-class classification several boundary methods have been developed. We only want to give a brief description of two approaches that have been introduced almost simultaneously. Tax et al [3] consider the problem of finding the smallest enclosing ball of given data samples xi ∈ X , i = 1, . . . , L that is described by the radius R and centre w:
1 min R + ξi w,R νl i
s.t. ∀i : φ(xi ) − w ≤ R − ξi ∧ ξi ≥ 0 .
(1)
This is the soft version of the problem. It deals with outliers by using slack variables ξi in order to allow for samples that are not located inside the ball defined by w and R. For ν → 0 one obtains the hard-margin solution, where all samples are located inside the ball. Here φ denotes a mapping of the data samples to some feature space. Sch¨ olkopf et. al [4] show that one-class classification can be taken as two-class classification, where the other class is represented by the origin. They consider the problem of finding the hyperplane w that separates the data samples from the origin with maximum distance ρ: 1 1 2 w + ξi − ρ s.t. ∀i : w T φ(xi ) ≥ ρ − ξi ∧ ξi ≥ 0 . (2) min w,ξ,ρ 2 νl i Again, the soft-margin problem is shown. It allows for samples that are misclassified by using slack variables ξi . For ν → 0 one obtains the hard-margin solution that enforces correct classification of all given samples. In [4] it is shown that (1) and (2) turn out to be equivalent, if the φ(xi ) lie on the surface of a sphere. Then, the radius R of problem (1) and the margin ρ in (2) can easily be computed by choosing a support vector on the boundary. If the Gaussian kernel is used K(x, y) = φ(x), φ(y) = exp −x − y/2σ 2
(3)
in order to implicitly map the given samples to some feature space, the φ(xi ) have unit norm and this condition is satisfied. To make the problem solvable, the origin has to be linearly separable from the target class. This precondition is also given if a Gaussian kernel is used. In the following we require that the data has been mapped to some feature space where these conditions hold, i.e. linear separability of the origin and unit norm of all samples.
Simple Incremental One-Class Support Vector Classification
23
1 w∗
H2 1
H1
ρ∗
Fig. 1. Some data samples having unit norm are shown as well as the solutions of the optimisation problem (4), i.e. H1 and the solution of (2), i.e. H2
3
OneClassMaxMinOver
In this section we describe a very simple incremental algorithm for one-classclassification called OneClassMaxMinOver(OMMO). This algorithm is closely connected to problem (2). It is inspired by the MaxMinOver algorithm for twoclass classification proposed in [5]. We consider the problem of finding the hyperplane w∗ passing the origin and having maximum margin ρ∗ with respect to the given data samples. Finding this hyperplane is equivalent to solving the optimisation problem (2), that is finding the hyperplane w ∗ that separates the given data samples with maximum margin ρ∗ from the origin (see Fig. 1). Mathematically, we are looking for the solution of the following optimisation problem w∗ = arg max min w T xi s.t. w = 1 . (4) w
The margin ρ∗ is obtained by
xi
ρ∗ = min wT∗ xi . xi
(5)
In the following w t denotes the approximation of w∗ at time t. During the learning process the constraint w = 1 is dropped. The algorithm starts with w0 = 0 and after tmax learning iterations the norm of the final approximation wtmax is set to one. In each learning iteration, the algorithm selects the sample that is closest to the current hyperplane defined by w t (6) xmin (t) = arg min wTt xi . xi
For each given training sample xi there is a counter variable αi that is increased by 2 whenever the sample is selected as xmin (t): αi = αi + 2 for
xmin (t) = xi .
(7)
X (t) denotes the set of samples xj for which αj > 0 holds at time t. Out of this set, the algorithm selects the sample being most distant with respect to the current hyperplane defined by wt :
24
K. Labusch, F. Timm, and T. Martinetz
xmax (t) = arg max wTt xj .
(8)
xj ∈X (t)
Whenever a sample is selected as xmax (t), its associated counter variable is decreased by 1: αi = αi − 1 for xmax (t) = xi . (9) The approximation of w∗ in learning iteration t + 1 is given by w t+1 =
L
αi xi .
(10)
i=1
Note that (7) and (9) can be combined to the learning rule wt+1 = wt + 2xmin (t) − xmax (t) .
(11)
Altogether, we obtain algorithm 1 for incremental one-class classification. Algorithm 1. OneClassMaxMinOver. With h(xi ) = αi ← 0 ∀i = 1, ..., N for t = 0 to tmax do xmin (t) ← arg min h(xi )
L
j=1
αj xTj xi
x i ∈X
xmax (t) ← arg max h(xi ) x i ∈X (t)
αmin ← αmin + 2 αmax ← αmax − 1 end
α ← α/ i αi h(x i ) ρ ← min h(xi ) x i ∈X
In section 3.1 we are going to prove that for t → ∞ the following propositions hold: √ – wt /wt converges at least as O(1/ t) to w ∗ . – αi > 0 only holds for samples xi having distance ρ∗ with respect to the hyperplane w∗ , i.e. support vectors. As mentioned before, we require that the data set has been mapped into some feature space where all samples have unit norm and can be linearly separated from the origin. However, this has not to be done explicitly. It can also be achieved by replacing the standard scalar product with a kernel that implements an implicit mapping to a feature space having the required properties. In case of the OMMO algorithm this corresponds to replacing the function h(xi ) with h(xi ) =
L
αj K(xj , xi ) ,
(12)
j=1
where K(xj , xi ) is an appropriate kernel function. For a discussion of kernel functions see, for example, [6].
Simple Incremental One-Class Support Vector Classification
3.1
25
Proof of Convergence
Our proof of convergence for the OneClassMaxMinOver algorithm is based on the proof of convergence √ for MaxMinOver by Martinetz [5], who showed a convergence speed of O(1/ t) in the case of two-class classification. √ Proposition 1. The length of wt is bounded such that wt ≤ ρ∗ t + 3 t. Proof. This is done by induction and using the properties that wTt xmin (t) ⇒ ρ∗ wt ≥ w Tt xmin (t) , wt ∀ i : xi = 1 , and
ρ∗ ≥ ρt =
T
∀ t : xmin (t) xmax (t) = cos β xmin (t) xmax (t) ≥ −1 . =1
(13) (14) (15)
=1
The case t = 0 is trivial and for t → t + 1 it follows that wt+1 2
(11)
wTt w t + 2wTt (2xmin (t) − xmax (t)) + (2xmin (t) − xmax (t)) wTt w t + 2wTt xmin (t) + 2 wTt xmin (t) − w Tt xmax (t) +
= =
2
≤0
4xmin (t)T xmin (t) + xmax (t)T xmax (t) − 4xmin (t)T xmax (t) (14),(15)
≤ (13) ≤ ≤ = ≤ ≤ =
wTt w t + 2wTt xmin (t) + 9 wTt w t + 2ρ∗ wt + 9 √ 2 √ ρ∗ t + 3 t + 2ρ∗ ρ∗ t + 3 t + 9 √ √ ρ2∗ t2 + 2ρ2∗ t + 6ρ∗ t t + 6ρ∗ t + 9t + 9 √ ρ2∗ (t2 + 2t) + 6ρ∗ (t + 1) t + 9(t + 1) + ρ2∗ √ ρ2∗ (t + 1)2 + 6ρ∗ (t + 1) t + 1 + 9(t + 1) √ 2 . ρ∗ (t + 1) + 3 t + 1
Theorem 1. For t → ∞ the angle γt between the optimal direction w∗ and the direction wt found by OMMO converges to zero, i.e. limt→∞ γt = 0. Proof. cos γt
=
wT∗ wt wt
=
1 T w (2 xmin (i) − xmax (i)) wt i=0 ∗
t−1
(16)
26
K. Labusch, F. Timm, and T. Martinetz t−1
=
1 T 1 ρ∗ t w xmin (i) ≥ wt i=0 ∗ wt
(17)
≥ρ∗
≥
ρ∗ t 1 √ √ = ρ∗ t + 3 t 1 + 3ρ∗ tt
≥
1−
Prop.1
3 t→∞ √ −→ 1 ρ∗ t
From (16) to (17) we have used that a sample can only be forgotten, if it was learnt before, such that ∀ xmax (t) ∃ xmin (t ) , t < t : xmax (t) = xmin (t ). Theorem 2. Beyond some finite number of iterations tˆ the set X (t) will always consist only of support vectors. Proof. First, we show that after some finite number of iterations t the xmin (t) with t > t will always be a support vector. We use an orthogonal decomposition of w t as shown in Fig. 2. X sv will denote the set of true support vectors. For an indirect proof we assume that a finite number of iterations t where xmin (t) with t > t will always be a support vector does not exist, i.e. t < ∞ ∀ t , t < t : xmin (t) ∈ X sv . wTt xmin (t) wt
T
(cos γt wt w∗ + ut ) xmin (t) wt T u xmin (t) = cos γt w T∗ xmin (t) + t wt uT xmin (t) (21) = cos γt w T∗ xmin (t) + t sin γt ut t→∞ t→∞ ≥ρ∗
⇒ ρ∗ ≥ ρt =
(20)
=
−→ 1
≤1
(18)
−→ 0
If xmin (t) is not a support vector, w∗ xmin (t) > ρ∗ holds. Due to (18), there is a t where xmin (t) being a non-support vector and t > t inevitably leads to a contradiction. Note that for t > t only support vectors are added to the set X (t), i.e. there is a finite number of non-support vectors contained in the set X (t). As a consequence after a finite number of iterations t also xmax (t) will always be a support vector. Now, we show that all non-support vectors in the set X (t) will be removed. Assumption: There exists a sample x that is not a support vector but it remains in the set X (t), i.e. ∃ x : x ∈ X sv ∧ x ∈ X (t) for all t. This means that wt wt x< xmax (t) (19) wt wt t→∞
−→ ρ∗
always holds. This leads to a contradiction since after a finite number of iterations xmax (t) will always be a support vector.
Simple Incremental One-Class Support Vector Classification
w∗ cos γt w t γt wt
ut
w t = cos γt w t w ∗ + ut ut = w t sin γt
27
(20) (21)
w ∗ = 1
Fig. 2. Orthogonal decomposition of w t and properties that hold within this decomposition
3.2
Soft-OneClassMaxMinOver
So far, we only have considered the hard-margin problem. In order to realise a 2-norm soft-margin version of the OMMO algorithm, we consider the quadratic optimisation problem: C 2 1 2 min w + ξ (22) s.t. ∀i : wT φ(xi ) ≥ 1 − ξi . w,ξ 2 2 i i In the hard-margin case (C → ∞) this is equivalent to the optimisation problem (4). Note, that compared to the optimisation problems (1) and (2) the constraint on each slack variable (ξi ≥ 0) disappears. By constructing the primal Lagrangian of (22), setting the partial differentiations to zero and rearranging [6], we obtain ⎛ ⎞ 1 1 αi − αi αj K(xi , xj ) + δij ⎠ s.t. ∀i : αi ≥ 0 , (23) min ⎝ α 2 C i i,j where δij is the Kronecker delta which is 1 if i = j, and 0 otherwise. As mentioned in [6] this can be understood as solving the hard-margin problem in a modified kernel space. The modified kernel is K(xi , xj ) + C1 δij . Hence, in order to implement a 2-norm soft-margin version of OMMO, we modify algorithm 1 such that L 1 h(xi ) = αj K(xj , xi ) + δij . (24) C j=1
4
Experiments and Results
We applied the OMMO algorithm to artificial datasets and a real-world problem using Gaussian kernels. We created a sinusoid and an xor dataset each consisting of 250 samples (Fig. 3). The hyperparameters were set to extremal values and to more appropriate ones that can be determined, for instance, by cross-validation. The results on the artificial datasets are shown in Fig. 3. Similar to the approach (2) that implements a 1-norm slack term, different solutions ranging from hard
28
K. Labusch, F. Timm, and T. Martinetz (a) 0.3, 106
(b) 0.3, 0.5
(e) σ = 0.19, C = 104
(g) 0.35, 106
(h) 0.35, 0.3
(c) 0.1, 1.5
(d) 0.07, 106
(f) σ = 0.2, C = 20
(i) 0.1, 6.5
(j) 0.1, 106
Fig. 3. Our algorithm applied to two exemplary artificial datasets – sinusoid, and xor. The parameters (σ, C) are shown above each graph. The stars depict support vectors that lie inside the hypersphere, dark circles depict support vectors outside the hypersphere. All other samples as well as the boundary are represented in white. In the first and third row extremal values for σ and C were chosen to achieve hard-margin ((a), (d), (g), (j)) as well as soft margin solutions ((b), (c), (h), (i)). In (b), (c), and (h) there is no support vector which lies inside the hypersphere and so the boundary is only influenced by support vectors from outside the hypersphere. The best solutions for the datasets are shown in the second row, where the values lie in between the extremal hard and soft margin solutions.
Simple Incremental One-Class Support Vector Classification
(b) ROC curve using libSVM 1
0.9
0.9
0.8
0.8
0.7 C = 2.395 ∗ 105 , σ = 1.9879
0.5
AUC = 0.868600
0.4
EER = 0.209230
0.3
|SV| = 0.0327 %
true positive rate
true positive rate
(a) ROC curve using OMMO 1
0.6
0.7 0.6
ν = 0.09, σ = 1.9545
0.5
AUC = 0.868741
0.4
EER = 0.209484
0.3
|SV| = 0.0309 %
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 false positive rate
0.8
29
0.9
1
0
0
0.1
0.2
0.3
0.4 0.5 0.6 0.7 false positive rate
0.8
0.9
1
Fig. 4. The receiver-operator-characteristics shows that the two algorithms achieve the same performance on a test set of 472 faces and 23573 non-faces. Both models obtained by parameter validation are rather hard- than soft-margin (C large, ν small). They have a Gaussian width of σ ≈ 1.9 and the fraction of support vectors is almost equal. The performance measured by the area under curve (AUC) for OMMO is the same as for libSVM. This holds also for the equal-error-rate (EER).
to soft margin can be realised by controlling the parameter C, i.e. the relevance of outliers can be controlled. Furthermore, we applied the OMMO algorithm to the problem of face detection where we used the MIT-CBCL face detection dataset1 that contains 2901 images of faces and 28121 images of non-faces of size 19x19 pixels. The dataset is divided into a training set containing 2429 faces and 4548 non-faces and a test set containing 472 faces and 23573 non-faces. We used the raw data but performed the preprocessing steps described in [7] to reduce the within-class variance. Afterwards, we took the training set to perform a simple grid search over σ, C and chose randomly 1215 faces to train OMMO and tested the performance on a test set with 1214 faces and 4548 non-faces. To reduce the variance we performed 25 runs at all combinations of σ, C. The performance of OMMO for a fixed σ, C was evaluated by the equal-error-rate of the receiver-operator-characteristics (ROC). Having determined the optimal parameters σ, C, we trained OMMO with the whole training set of 2429 faces and computed the ROC curve of the 24045 test samples. The same steps were performed using the libSVM [8], except that here we used the parameter ν to control the softness. A comparison between both ROC curves is depicted in Fig. 4. Although these two approaches differ significantly in their implementation complexity, their performance is almost equal. The execution time of OMMO and libSVM cannot be compared directly because it depends heavily on implementation details.
1
http://cbcl.mit.edu/software-datasets/FaceData.html
30
5
K. Labusch, F. Timm, and T. Martinetz
Conclusions
Based on an existing two-class classification method, we proposed a very simple and incremental boundary approach, called OneClassMaxMinOver. OMMO can be realised by only a few lines of code,which makes it interesting particularly for practitioners. We proved that after a finite number of learning steps OMMO yields a maximum margin hyperplane that is described only by support vectors. √ Furthermore, we showed that the speed of convergence of OMMO is O(1/ t) where t is number of iterations. Considering the ideas described in [9], even a O(1/t) convergence of OMMO can be expected. By simply using a modified kernel function K (x, y) = K(x, y)+ C1 δxy OMMO can also realise a soft maximum margin solution controlled by the softness parameter C. Thus, OMMO can cope with datasets that also contain outliers. In the future, a closer look at convergence speed and bounds on the target error will be made. Moreover, the problem of parameter validation will be examined, since in many one-class problems only target objects are available. Thus, standard validation techniques cannot be applied. If it is not possible to evaluate the target error, the complexity or volume of the boundary description have to be estimated in order to select good hyperparameters. Since, simple sampling techniques fail to measure the volume in high-dimensional input spaces, more sophisticated methods need to be derived.
References 1. Vapnik, V.N.: The nature of statistical learning theory. Springer, Heidelberg (1995) 2. Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Sch¨ olkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods — Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999) 3. Tax, D.M.J., Duin, R.P.W.: Data domain description using support vectors. In: ESANN, pp. 251–256 (1999) 4. Sch¨ olkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Computation 13(7), 1443–1471 (2001) 5. Martinetz, T.: MaxMinOver: A Simple Incremental Learning Procedure for Support Vector Classification. In: IEEE Proceedings of the International Joint Conference on Neural Networks (IJCNN 2004), Budapest, Hungary, pp. 2065–2070 (2004) 6. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge (2000) 7. Sung, K.K.: Learning and example selection for object and pattern detection. In: MIT AI-TR (1996) 8. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/cjlin/libsvm 9. Martinetz, T.: Minover revisited for incremental support-vector-classification. In: Rasmussen, C.E., B¨ ulthoff, H.H., Sch¨ olkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175. Springer, Heidelberg (2004)
A Multiple Kernel Learning Approach to Joint Multi-class Object Detection Christoph H. Lampert and Matthew B. Blaschko Max Planck Institute for Biological Cybernetics, Department for Empirical Inference 72076 T¨ ubingen, Germany {chl,blaschko}@tuebingen.mpg.de
Abstract. Most current methods for multi-class object classification and localization work as independent 1-vs-rest classifiers. They decide whether and where an object is visible in an image purely on a per-class basis. Joint learning of more than one object class would generally be preferable, since this would allow the use of contextual information such as co-occurrence between classes. However, this approach is usually not employed because of its computational cost. In this paper we propose a method to combine the efficiency of single class localization with a subsequent decision process that works jointly for all given object classes. By following a multiple kernel learning (MKL) approach, we automatically obtain a sparse dependency graph of relevant object classes on which to base the decision. Experiments on the PASCAL VOC 2006 and 2007 datasets show that the subsequent joint decision step clearly improves the accuracy compared to single class detection.
1
Introduction
Object detection in natural images is inherently a multi-class problem. Already in 1987, Biederman estimated that humans distinguish between at least 30,000 visual object categories [3]. Even earlier, he showed that the natural arrangement and co-occurrence of objects in scenes strongly influences how easy it is to detect objects [4]. Recently, Torralba and Olivia obtained similar results for automatic systems [27]. However, most algorithms that are currently developed for object detection predict the location of each object class independently from all others. The main reason for this is that it allows the algorithms to scale only linearly in the number of classes. By disregarding other object classes in their decision, such systems are not able to make use of dependencies between object classes. Dependencies are typically caused by functional relations between objects (an image showing a computer keyboard has a high chance of also showing a computer screen), or by location and size (a mosquito would not be visible in an image of an elephant). In this paper, we propose a method that automatically makes use of dependencies between objects and their background as well as between different object G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 31–40, 2008. c Springer-Verlag Berlin Heidelberg 2008
32
C.H. Lampert and M.B. Blaschko
Fig. 1. Example images from the PASCAL VOC 2007 dataset [8]. Objects are marked by their bounding boxes. Some object classes like chairs and tables or cars and buses tend to occur together.
classes. It relies on first performing an overcomplete per-class detection, followed by a post-processing step on the resulting set of candidate regions. All necessary parameters, in particular the relation between the categories, are learned automatically from training data using a multiple kernel learning procedure. As a result, we obtain a sparse dependency graph of classes that are relevant to each other. At test time only these relevant classes are considered, making the algorithm efficiently applicable for problems with many object categories.
2
Related Work
Early approaches to object localization were mainly targeted at the detection of frontal faces and of pedestrians in street scenes. The influential work by Viola and Jones [28] might be the most well known publication in this area. Viola and Jones propose to detect faces by applying a cascade of weak classifiers at every location of the image. Regions that do not look face-like are rejected early on, whereas promising regions are kept until a final decision is made. The authors also mention the possibility of using a different classifier as last element of the cascade, which then acts as a strong post-filter of the cascade’s output. Such two-step procedures, in which a first stage predicts candidate regions and a second stage accepts or rejects them, have been used frequently, especially when real-time performance is required. Variants include the of use of artificial neural networks [22], linear SVMs [7], reduced set SVMs [14], tree-structures instead of linear cascades [17], or fusion of different data modalities [19]. Recently, more methods to also detect multiple and more general object classes have been developed. In this area, sliding window approaches of single layer quality functions are more popular than hierarchical cascades to generate the regions of interest [11,15]. Alternatively, the Implicit Shape Model has been used [12], or heuristic techniques based on keypoint voting [5,6]. There is also a variety of methods to estimate the location and pose of object by probabilistic or geometric part models [2,10,13,18,24]. However, all these methods have in common that they only consider one object class at a time and cannot make use of class dependencies. Attempts to take context into account using probabilistic appearance models, e.g. by Torralba [25], were restricted to object–background interaction and do not
Joint Multi-class Object Detection
33
capture relations between objects. Aiming at simultaneous multi-class detection, Torralba et al. [26] have proposed to share features between different classes, but this applies only to the class representations and does not allow one to base the final decision on between-class dependencies. To our knowledge, the only published work making use of inter-class dependencies to improve object detection is by Rabinovich et al. [20]. Their method segments the image and classifies all segments jointly based on a conditional random field. However, this requires the object dependencies to be specified apriori, whereas our method learns the dependencies during the training process to best reflect the a-posteriori beliefs.
3
Joint Multi-class Object Detection
The proposed method for joint object detection is applicable as a post-processing operation to any of the single-class methods mentioned in the previous section. We will therefore concentrate on this aspect and assume that routines to identify candidate locations for K object classes ω1 , . . . , ωK are given. We do not, however, assume that these routines are able to reliably judge if an object is present at all or not, so in theory, even random sampling of locations or an exhaustive search would be possible. For all candidate regions, it is predicted whether they are correct or incorrect hypotheses for the presence of their particular object class. In this way we reduce the problem to a collection of binary classifications, but in contrast to existing detection methods, the decision is based jointly on all object hypotheses in the image, not only on each of them separately. Following a machine learning approach, the system learns its parameters from a set of training images I i , i = 1, . . . , N , with known locations l1i , . . . , lni i and class labels for the ni objects present in I i . For simplicity, we assume that there is i ) for exactly one candidate region per class per image, writing xi := (I i , l1i , . . . , lK i = 1, . . . , N . This is not a significant restriction, since for missing classes we can insert random or empty regions, and for classes with more than one candidate region, we can create multiple training examples, one per object instance. For every test image I we first predict class hypotheses and then, if necessary, we use the same construction as above to bring the data into the form x = (I, l1 , . . . , lK ). The class decisions are given by a vector valued function K times
f : I × L × · · · × L → RK
(1)
where I denotes the space of images and L denotes the set of representations for objects, e.g. by their location and appearance. Each component fk of f corresponds to a score how confident we are that the object lk is a correct detection of an object of class ωk . If a binary decision is required, we use only the sign of fk .
34
3.1
C.H. Lampert and M.B. Blaschko
Discriminative Linear Model
In practical applications, the training set will rarely be larger than a couple of hundred or a few thousand examples. This is a relatively low number taking into account that the space I × LK grows exponentially with the number of classes. We therefore make use of a discriminative approach to classification, which has shown to be robust against the curse of dimensionality. Following the path of statistical learning theory, we assume f to be a vector-valued function that is linear in a high-dimensional feature space H. Using the common notation of reproducing kernel Hilbert spaces, see e.g. Sch¨ olkopf and Smola [21], each component function fk of f can be written as fk (x) = wk , φk (x) H + bk
(2)
where x = (I, l1 , . . . , lK ). The feature map φk : I × LK → H is defined implicitly by the relation kk (x, x ) = φk (x), φk (x )H for a positive definite kernel function kk . The projection directions wk ∈ H and the bias terms bk ∈ R parametrize f . Note that we do not compromise our objective of learning a joint decision for all classes by writing separate equations for the components of f , because each fk is still defined over the full input space I × LK . 3.2
Learning Class Dependencies
In an ordinary SVM, only wk and bk are learned from training data whereas the kernels kk and thereby the feature maps φk are fixed. In our setup, this approach has the drawback that the relative importance of one class for another must be fixed a priori in order to encode it in kk . Instead, we follow the more flexible approach of multiple kernel learning (MKL) as developed by Lanckriet et al. [16] and generalize it to vector valued output. MKL allows us to learn the relative importance of every object class for the of every other class from the Kdecision j β κ , where the κj are fixed base data. For this, we parametrize kk = j j=0 k j kernels. The weights βk are learned together with the other parameters during the training phase. They have characteristics of probability distributions if we constrain then by βkj ∈ [0, 1] and j βkj = 1 for all k. We assume that each base kernels κj reflects similarity with respect only to the object class ωj , and that κ0 is a similarity measure on the full image level. Note, however, that this is only a semantic choice that allows us to directly read off class dependencies. For the MKL training procedure, the choice of base kernels and also their number is arbitrary. Because the coefficients βkj are learned from training data, they correspond to a-posteriori estimates of the conditional dependencies between object classes: the larger the value of βkj the more the decision for a candidate region of class ωk depends on the region for class ωj , where the dependency can be excitatory or inhibitory. In contrast, βkj = 0 will render the decision function for class ωk independent of class ωj . This interpretation shows that the joint-learning approach is a true generalization of image based classifiers (setting βk0 = 1 and βkj = 0 for j = 0) and of single class object detectors (βkj = δjk ).
Joint Multi-class Object Detection
3.3
35
Vector Valued Multiple Kernel Learning
To learn the parameters of f we apply a maximum-margin training procedure. As usual for SVMs, we formulate the criterion of maximizing the soft margin for all training examples with slack variables ξki as the minimization over the norm of the projection vector. Consequently, we have to minimize K K n 1 j 2 i βj wk Hk + C ξk (3) 2 j=0 i=1 k=1
with respect to wkj ∈ Hk , bk ∈ R, βkj ∈ [0, 1] and ξki ∈ R+ , subject to yki
K
βjk
wjk , φk (xi )
Hk + bj
≥ 1 − ξki
for i = 1 . . . , N , k = 1, . . . , K,
j=1
where the training labels yki ∈ {±1} indicate whether the training location lki in image I i did in fact contain an object of class ωk , or whether it was added artificially. The space Hj with scalar product . , .Hj is implicitly defined by κj , and C is the usual slack penalization constant for soft-margin SVMs. Because the constraints for different k do not influence each other, we can decompose the problem into K optimization problems. Each of these is convex, as Zien and Ong [30] have shown. We can therefore solve (3) by applying the multiple kernel learning algorithm K times. The result of the training procedure are classifiers fk (x) =
N K
αik βjk κj (x, xi ) + bk
for k = 1, . . . , K,
(4)
i=1 j=0
where the coefficients αik are Lagrangian multiplier that occur when dualizing Equation (3). Because the coefficients αik and βkj are penalized by L1 -norms in the optimization step, they typically become sparse. Thus, in practice most of the N · K terms in Equation (4) are zero and need not be calculated.
4
Experiments
For experimental evaluation we use the recent PASCAL VOC 2006 and VOC 2007 image datasets [8,9]. They contain multiple objects per image from sets of classes that we can expect to be inherently correlated, e.g. tables/chairs and cars/buses, or anti-correlated, e.g. cats/airplanes. Some examples are shown in Figure 1. In VOC 2006, there are 5,304 images with 9,507 labeled objects from 10 classes. VOC 2007 contains 9,963 images, with a total of 24,640 objects from 20 different classes. Both datasets have pre-defined train/val/test splits and ground truth in which objects are represented by their bounding boxes.
36
4.1
C.H. Lampert and M.B. Blaschko
Image Representation
We process the images following the well established bag-of-features processing chain. At first, we extract local SURF descriptors [1] at interest point locations as well as on a regular image grid. On average, this results in 16,000 local descriptors per image. We cluster a random subset of 50,000 descriptors using K-means to build a codebook of 3,000 entries. For every descriptor only its x, y position in the image and the cluster ID of its nearest neighbor codebook entry are stored. To represent a full image, we calculate the histogram of cluster IDs of all feature points it contains. Similarly, we represent a region within an image by the histogram of feature points within the region. These global or local histograms are the underlying data representation for the generation of candidate regions as well as for the class-decision step. The joint decision function takes a set of hypothesized object regions as input. To generate these for our experiments, we use a linear SVM approach similar to Lampert et al. [15]: one SVM per object class is trained using the ground truth object regions in the training set as positive training examples and randomly sampled image boxes as negative training examples. The resulting classifier functions are evaluated over all rectangular regions in the images, and for each image and class, the region of maximal value is used as hypothesis. 4.2
Base Kernels and Multiple Kernel Learning
In the area of object classification, the χ2 -distance has proved to be a powerful measure of similarity between bag-of-features histograms, see e.g. [29]. We use χ2 base kernels in the following way: for a sample x = (I, l1 , . . . , lK ), let h0 (x) be the cluster histogram of the image I and let hj (x) be the histogram for the regions lj within I. We set
3000 (hc −hc )2 with χ2 (h, h ) = , κk (x, x ) = exp − 2γ1k χ2 (hk (x), hk (x )) c=1 hc + hc where hc denotes the c-th component of a histogram h. The normalization constants γk are set to the mean of the corresponding χ2 -distances between all training pairs. With these kernels, we perform MKL training using the Shogun toolbox. It allows efficient training up to tens of thousands of examples and tens of kernels [23]. At test time, f is applied to each test sample x = (I, l1 , . . . , lK ), and every candidate region lk is assigned fk (x) as a confidence score.
5
Results
The VOC 2006 and VOC 2007 datasets provide software to evaluate localization performance as a ranking task. For each class, precision and recall are calculated as follows: at any confidence level ν, the recall is the number of correctly predicted object locations with confidence at least ν divided by the total number of objects in this class. The precision is the same number of correctly detected objects, divided by the total number of boxes with a confidence of ν or more.
Joint Multi-class Object Detection VOC 2006 motorbike
1.0
0.2
0.1
0.2
0.3
recall
0.4
0.5
0.6
VOC 2007 aeroplane
0.0
0.2
0.1
0.2
recall
0.3
0.4
VOC 2007 sofa
precision
precision
0.4
0.4
0.2
recall
0.3
0.4
0.0
0.0
0.1
0.2
recall VOC 2007 bottle raw: AP=0.001 optimized: AP=0.005
0.6
0.4
0.2
0.1
0.0
0.8
0.6
0.2
0.0
1.0
raw: AP=0.113 optimized: AP=0.165
0.8
0.6
0.0
0.0
1.0
raw: AP=0.160 optimized: AP=0.169
0.8
0.0
0.4
precision
1.0
0.6
0.4
0.2
raw: AP=0.110 optimized: AP=0.144
precision
0.6
0.4
VOC 2006 car
0.8
precision
0.6
0.0
1.0
raw: AP=0.040 optimized: AP=0.245
0.8
precision
0.8
0.0
VOC 2006 bus
1.0
raw: AP=0.086 optimized: AP=0.369
37
0.2
0.1
0.2
0.3
recall
0.4
0.5
0.0
0.00
0.02
recall
0.04
Fig. 2. Typical Precision–Recall Curves for VOC 2006 (top) and VOC 2007 (bottom). The blue (dashed) curve corresponds to the raw scores of the single-class candidate prediction, the black (dark) curve to the filtered results of the jointly optimized system.
A predicted bounding box B is counted as correct, if its area overlap area(B∩G) area(B∪G) with a ground truth box G of that class is at least 50%. From the precision–recall curves, an average precision score (AP) can be calculated by determining the maximal recall in 11 subintervals of the recall axis and averaging them, see [9] for details. Note, however, that AP scores are unreliable when they fall below 0.1 and should not be used to draw relative comparison between methods in this case. Figure 2 shows results for three classes each of the VOC 2006 and of the VOC 2007 dataset. The plots contain the precision–recall curves of using either the score that the single-class candidate search returns, or the output of the learned joint-classifier as confidence values. Table 1 lists the AP scores for all 30 classes in VOC 2006 and VOC 2007. In addition to the single-class scores and the joint-learning score, the results of the corresponding winners in the VOC 2006 and VOC 2007 challenge are included, illustrating the performance of the best state-of-the-art systems. 5.1
Discussion
The plots in Figure 2 and the list of scores in Table 1 show that the joint learning of confidence scores improves the detection results in the majority of cases over the single-class hypothesis prediction, in particular in the range of reliable values AP > 0.1. The increase in performance is more prominent in the VOC 2006 dataset than in 2007 (e.g. Figure 2, left column). For several classes, the system achieves results which are comparable to the participants of the VOC challenges and it even achieves better scores in the three categories bus-2006, sofa-2007 (Figure 2, center column) and dog-2007. The score for diningtable2007 is higher than the previous one as well, but it is unreliable. There are
38
C.H. Lampert and M.B. Blaschko
Table 1. Average Precision (AP) scores for VOC 2006 (top) and VOC 2007 (bottom) VOC2006 bicycle bus car cat single-class 0.351 0.040 0.110 0.079 jointly learned 0.411 0.245 0.144 0.099 VOC 2006 best 0.440 0.169 0.444 0.160
cow dog horse motorbike person sheep 0.032 0.038 0.019 0.086 0.005 0.108 0.098 0.089 0.045 0.369 0.091 0.091 0.252 0.118 0.140 0.390 0.164 0.251
VOC2007 aeroplane bicycle bird boat bottle bus car cat chair cow single-class 0.160 0.144 0.097 0.020 0.001 0.174 0.120 0.228 0.006 0.053 jointly learned 0.169 0.162 0.052 0.019 0.005 0.168 0.126 0.188 0.009 0.055 VOC 2007 best 0.262 0.409 0.098 0.094 0.214 0.393 0.432 0.240 0.128 0.140 table dog horse motorbike person plant sheep sofa train tv single-class 0.049 0.150 0.032 0.207 0.116 0.004 0.092 0.113 0.101 0.055 jointly learned 0.101 0.165 0.048 0.219 0.089 0.023 0.092 0.165 0.118 0.042 VOC 2007 best 0.098 0.162 0.335 0.375 0.221 0.120 0.175 0.147 0.334 0.289
some cases, in which both stages of the system fail to achieve a good detection rate compared to the state-of-the-art, e.g. car-2006 or bottle-2007. Analyzing the precision–recall curves shows this is typically due to a bad set of candidate region. The maximum recall level in the examples is below 20% and 5% (Figure 2, right column). One cannot hope to achieve better scores in these cases, because the post-processing only assign a confidence to the object regions but cannot create new ones. We expect that a better hypothesis generation step and a test procedure predicting several candidate boxes per image would improve on this. 5.2
Dependency Graphs
Besides improving the localization performance, the multiple kernel learning also predicts class-specific dependency coefficients βjk that allow us to form a sparse diningtable 0.07 0.20
chair 0.08
0.04
car 0.05
train
aeroplane 0.06
0.05
boat
0.04
0.06
0.20
0.27
0.09
bus
0.05
tvmonitor
motorbike 0.05 0.04
bottle
0.07
pottedplant
horse
0.06
bird
0.05
bicycle
dog
0.09
0.06
0.04
0.05
0.12
cat
0.09 0.12
sofa 0.05
0.06
person
0.09
0.05 0.08
0.05
0.05
sheep 0.05 0.08
cow
Fig. 3. Automatically learned dependency graph between the classes in VOC 2007. An arrow ωj → ωk means that ωj helps to predict ωk . This effect can be excitatory or inhibitory. All classes additionally depend on themselves and on the full image (notl shown). The score and width of the arrow indicates the relative weight βkj / K l=1 βk without the image component. Connections with a score below 0.04 have been omitted.
Joint Multi-class Object Detection
39
dependency graph. These dependencies are non-symmetric, in contrast to generative measures like co-occurrence frequencies or cross-correlation. Figure 3 shows the automatically generated graph for VOC 2007. One can see that semantically meaningful groups have formed (vehicles, indoors, animals), although no such information was provided at training time.
6
Conclusions
We have demonstrated how to perform joint object-class prediction as a postprocessing step to arbitrary single-class localization systems. This allows the use of class dependencies while remaining computationally feasible. The method is based on a maximum margin classifier using a linear combination of kernels for the different object classes. We gave an efficient training procedure based on formulating the problems as a collection of convex optimization problems. For each class, the training procedure automatically identifies the subset of object classes relevant for the prediction. This provides a further speedup at test time and allows the formation of an a posteriori dependency graph. Experiments on the VOC 2006 and 2007 datasets show that the joint decision is almost always able to improve on the scores that the single-class localization system provided, resulting in state-of-the-art detection rates, if the set of candidate regions allows so. The resulting dependency graph has a semantically meaningful structure. Therefore, we expect that the learned dependency coefficients will be useful for other purposes as well, e.g. to generate class hierarchies. Acknowledgements. This work was funded in part by the EC project CLASS, IST 027978. The second author is supported by a Marie Curie fellowship under the EC project PerAct, EST 504321.
References 1. Bay, H., Tuytelaars, T., Gool, L.J.V.: SURF: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 404– 417. Springer, Heidelberg (2006) 2. Bergtholdt, M., Kappes, J.H., Schn¨ orr, C.: Learning of graphical models and efficient inference for object class recognition. In: Franke, K., M¨ uller, K.-R., Nickolay, B., Sch¨ afer, R. (eds.) DAGM 2006. LNCS, vol. 4174, pp. 273–283. Springer, Heidelberg (2006) 3. Biederman, I.: Recognition by components - a theory of human image understanding. Psychological Review 94(2), 115–147 (1987) 4. Biederman, I., Mezzanotte, R.J., Rabinowitz, J.C.: Scene perception: Detecting and judging objects undergoing relational violations. Cognitive Psychology 14, 143–177 (1982) 5. Bosch, A., Zisserman, A., Mu˜ noz, X.: Representing shape with a spatial pyramid kernel. In: CIVR, pp. 401–408 (2007) 6. Chum, O., Zisserman, A.: An exemplar model for learning object classes. In: CVPR, pp. 1–8 (2007)
40
C.H. Lampert and M.B. Blaschko
7. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, pp. 886–893 (2005) 8. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 Results (2007), http://www.pascal-network.org/challenges/VOC/voc2007/workshop/ index.html 9. Everingham, M., Zisserman, A., Williams, C.K.I., Van Gool, L.: The PASCAL Visual Object Classes Challenge 2006 Results (2006), http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf 10. Fergus, R., Perona, P., Zisserman, A.: A sparse object category model for efficient learning and exhaustive recognition. In: CVPR, pp. 380–387 (2005) 11. Ferrari, V., Fevrier, L., Jurie, F., Schmid, C.: Groups of adjacent contour segments for object detection. PAMI 30, 36–51 (2008) 12. Fritz, M., Leibe, B., Caputo, B., Schiele, B.: Integrating representative and discriminative models for object category detection. In: ICCV, pp. 1363–1370 (2005) 13. Keysers, D., Deselaers, T., Breuel, T.M.: Optimal geometric matching for patchbased object detection. ELCVIA 6(1), 44–54 (2007) 14. Kienzle, W., Bakır, G.H., Franz, M.O., Sch¨ olkopf, B.: Face detection - efficient and rank deficient. In: NIPS (2004) 15. Lampert, C.H., Blaschko, M.B., Hofmann, T.: Beyond sliding windows: Object localization by efficient subwindow search. In: CVPR (2008) 16. Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. JMLR 5, 27–72 (2004) 17. Lienhart, R., Liang, L., Kuranov, A.: A detector tree of boosted classifiers for real-time object detection and tracking. ICME 2, 277–280 (2003) 18. Ommer, B., Buhmann, J.M.: Learning the compositional nature of visual objects. In: CVPR (2007) 19. Opelt, A., Pinz, A., Fussenegger, M., Auer, P.: Generic object recognition with boosting. PAMI 28(3), 416–431 (2006) 20. Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S.: Objects in context. In: ICCV (2007) 21. Sch¨ olkopf, B., Smola, A.J.: Learning with Kernels. MIT Press, Cambridge (2002) 22. Schulz, W., Enzweiler, M., Ehlgen, T.: Pedestrian recognition from a moving catadioptric camera. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 456–465. Springer, Heidelberg (2007) 23. Sonnenburg, S., R¨ atsch, G., Sch¨ afer, C., Sch¨ olkopf, B.: Large scale multiple kernel learning. JMLR 7, 1531–1565 (2006) 24. Teynor, A., Burkhardt, H.: Patch based localization of visual object class instances. In: MVA (2007) 25. Torralba, A.: Contextual priming for object detection. IJCV 53(2), 169–191 (2003) 26. Torralba, A., Murphy, K.P., Freeman, W.T.: Shared features for multiclass object detection. In: Toward Category-Level Object Recognition, pp. 345–361 (2006) 27. Torralba, A., Oliva, A.: Statistics of natural image categories. Network: Computation in Neural Systems 14(3), 391–412 (2003) 28. Viola, P.A., Jones, M.J.: Robust real-time face detection. IJCV 57(2), 137–154 (2004) 29. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: A comprehensive study. IJCV 73(2), 213–238 (2007) 30. Zien, A., Ong, C.S.: Multiclass multiple kernel learning. In: ICML, pp. 1191–1198 (2007)
Fast Generalized Belief Propagation for MAP Estimation on 2D and 3D Grid-Like Markov Random Fields Kersten Petersen, Janis Fehr, and Hans Burkhardt Albert-Ludwigs-Universit¨ at Freiburg, Institut f¨ ur Informatik, Lehrstuhl f¨ ur Mustererkennung und Bildverarbeitung, Georges-Koehler-Allee Geb. 052, 79110 Freiburg, Deutschland [email protected]
Abstract. In this paper, we present two novel speed-up techniques for deterministic inference on Markov random fields (MRF) via generalized belief propagation (GBP). Both methods require the MRF to have a grid-like graph structure, as it is generally encountered in 2D and 3D image processing applications, e.g. in image filtering, restoration or segmentation. First, we propose a caching method that significantly reduces the number of multiplications during GBP inference. And second, we introduce a speed-up for computing the MAP estimate of GBP cluster messages by presorting its factors and limiting the number of possible combinations. Experimental results suggest that the first technique improves the GBP complexity by roughly factor 10, whereas the acceleration for the second technique is linear in the number of possible labels. Both techniques can be used simultaneously.
1
Introduction
Markov random fields [1] have become popular as probabilistic graphical models for representing images in image processing and pattern recognition applications, such as image filtering, restoration or segmentation. While an MRF provides a sound theoretical model for a wide range of problems and is easy to implement, inference on MRFs is still an issue. It is an NP hard problem [2] to explore all possible pixel combinations, and thus we have to resort to approximate inference algorithms, such as Monte Carlo chain methods (MCMC), or message passing algorithms. For a long time MCMC methods have been the common choice for inferring on MRFs, although they are non-deterministic and converge very slowly. But since the proclamation of message passing algorithms like Pearl’s belief propagation (BP), they are no longer state-of-the-art. The BP algorithm has the advantage of being deterministic, fast and precise on many MRFs, especially if they are tree-structured [3]. However, BP proves to be inaccurate and unstable on graphs with many cycles, preventing its application to many image processing problems that are based on grid-like graphs. A more refined variant that generalizes the idea of the BP algorithm has been introduced by Yedidia G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 41–50, 2008. c Springer-Verlag Berlin Heidelberg 2008
42
K. Petersen, J. Fehr, and H. Burkhardt
φs xs
ys ψst
xt
Fig. 1. A Markov random field with pairwise potential functions
et al. [4]. It is called generalized belief propagation and is far more stable and accurate. Yet, it is also computationally expensive, which limits its use to small graphs. In this paper, we present two techniques that accelerate the GBP algorithm on grid-like MRFs making GBP more suitable for 2D and 3D image processing tasks. Related Work. In the literature, we have spotted only two papers, [5] and [6], that focus on speed-up techniques for the GBP algorithm, and both are guided by the same idea: Several edge potentials (see section 2) in an MRF can be divided into compatible pairs of labels whose values are label-dependent and incompatible pairs of labels that all have the same value. As the number of compatible pairs of labels nc is usually much smaller than the number of incompatible labels ni , we gain a speed-up of nc /ni by not computing redundant incompatible labels. Shental et al. [5] suggests this approach for the Ising model, and Kumar et al. [6] for the more general robust truncated model, comprising the piecewise constant prior and the piecewise smooth prior [7]. Thus, both techniques require a beneficial structure of the edge potentials for sparing redundant label configurations. In contrast, both our techniques accelerate the GBP algorithm on grid-like MRFs with arbitrary potential functions.
2
Belief Propagation and Its Generalization
The belief propagation (BP) algorithm is a message passing method that iteratively computes the marginal probabilities of a Markov random field (MRF). An MRF is defined as a random vector X = (Xs )s∈S on the probability space (X, P ) with respect to a neighborhood system ∂ such that P is strictly positive and fulfills the local Markov property. In this paper, we concentrate on pairwise MRFs, i.e. s ∈ ∂{t} if and only if t ∈ ∂{s} for two sites s, t ∈ S. For a given labeling problem, we have to divide the MRF nodes into two sets: The observed image nodes are denoted by ys , whereas the hidden label nodes of the image are referred to as xs . Figure 1 shows a pairwise MRF where ys is depicted by a filled circle and xs is indicated by an empty circle. Connections between a filled and an empty circle are weighted by node potentials to encode the similarity between the observed image and the hidden labeling. Connections among empty
Fast Generalized Belief Propagation
43
circles correspond to edge potentials which capture the similarity of neighboring hidden nodes, or more general prior knowledge about the image. According to the Hammersley-Clifford theorem the joint probability function of this MRF is given by 1 P (x) = φs (xs ) ψst (xs , xt ) (1) Z s st where Z denotes a partition function, φs (xs ) a node potential, and ψst (xs , xt ) an edge potential. Note that we use shorthand notation φs (xs ) for φs (xs , ys ), since the observed image can be regarded as fixed. In the BP algorithm [3], messages are iteratively passed along the edges of the hidden image until their rate of change falls below a pre-set threshold. A message mst (xt ) from node s to node t is a one-dimensional vector that propagates the likeliest label probabilities for t from the view of s. Once the message values have converged, we can compute the beliefs for the hidden nodes, i.e. approximations for the marginal probabilities of the MRF. The BP algorithm features many attractive properties, such as exact estimates in trees or fast execution time. On many loopy graphs, however, it shows poor convergence and delivers inaccurate estimates. The source of error is the circular message flow that distorts good message approximations by self-dependent probability values. Yedidia et al. [4] propose a generalization of the BP algorithm called generalized belief propagation (GBP) which alleviates this undesired behavior. The basic idea is to compute more informative messages between groups of nodes in addition to messages between single nodes. We obtain an algorithm that demonstrates improved convergence behavior and delivers accurate approximations to the exact estimates. It no longer tends to the stationary points of the Bethe energy but is proven to approximate the fixed points of the more precise Kikuchi energy [3]. The GBP algorithm can be formulated in different ways [8]; in this paper we refer to the parent-to-child variant in max-product form for calculating the MAP estimate on grid graphs. For two-dimensional grid graphs, we obtain the following formulas from [3] (see figures 2 and 3). The formula for computing single-node beliefs is given by bs (xs ) = kφs mas mbs mcs mds
(2)
where we use shorthand notation mas ≡ mas (xs ) and φs = φs (xs ). The variables a, b, c and d denote the neighbors of s. If we compute the formula at the border of the grid and a neighbor lies outside the grid, we can neglect corresponding factors or set them to 1. The message update rule for edge messages, i.e. messages between two single nodes, evaluates to msu (xu ) = max (φs ψsu mas mbs mcs mbdsu mcesu ) xs
(3)
where we abbreviate ψsx = ψsx (xs , xu ). The message update rule for cluster messages, i.e. messages between two pairs of nodes, unfolds as mstuv (xu , xv ) =
maxxs xt (φs φt ψst ψsu ψtv mas mcs mbt mdt mabst mcesu mdf tv ) (4) msu mtv
44
K. Petersen, J. Fehr, and H. Burkhardt
a
a
s
b
c
d
a
b
s
c
b
s
c
d
u
e
d
u
e
f
f
Fig. 2. LEFT: A diagram of the messages that influence the single-node belief at site s in a two-dimensional grid. CENTER and RIGHT: All edge messages (double-lined arrows) that are contained in the same two-node belief region R = {s, u} (gray nodes). Note that the cluster messages from edges to edges are identical in both figures.
a
b
c
s
t
d
e
u
v
f
g
h
a
b
c
s
t
d
e
u
v
f
g
h
a
b
c
s
t
d
e
u
v
f
g
h
a
b
c
s
t
d
e
u
v
f
g
h
Fig. 3. A diagram of all cluster messages (double-lined arrows) that are contained in the same four-node belief region R = {s, t, u, v} (gray nodes). Solid (blue) edges on the grid lines stand for edge messages in the nominator, whereas dashed edges are those in the denominator of the corresponding message update rule. (Green) messages in the centre of grid cells denote cluster messages that influence the value of the (double-lined) cluster message. We can observe that the same messages appear within several figures.
a
a c
a
d
c
b
b e
f
b s
d c
s
e
g
s
t
h
d u
f e
g i
h
u
i
v m
j n
l
k
f j
o
p
Fig. 4. LEFT: A diagram of the messages that influence the single-node belief at site s in a three dimensional grid. CENTER: The messages that influence the edge message from site s to site u. RIGHT: All messages that influence the (double-lined) cluster message from edge st to edge uv.
On three dimensional grid-graphs, the formulas for the GBP algorithm evaluate to (see figure 4): bs (xs ) = kφs mas mbs mcs mds mes mf s
(5)
msu (xu ) = max (φs ψsu mas mcs mds mes mis mabsu mdf su megsu mijsu )
(6)
xs
Fast Generalized Belief Propagation
mstuv (xu , xv ) = (msu mtv )−1 max (φs φt ψst ψsu ψtv M1 M2 ) xs xt
45
(7)
where M1 = mas mes mgs mms mbt mf t mht mnt M2 = mabst mef st mmnst macsu mgisu mmosu mbdtv mhjtv mnptv
3
Caching and Multiplication
Analyzing the messages that are computed within the same two- or four-node region, we notice that some messages appear repeatedly. As shown in figure 3, each cluster message computation involves four of the eight surrounding edge messages and three of the four surrounding cluster messages. Remarkably, the selection of the messages is not arbitrary but follows a simple pattern. In a cluster message mstuv for instance, where s and t are its source nodes, and u and v are its target nodes, node potentials φ are only defined for the source nodes, while edge potentials ψ necessitate at least one involved node is a source node. Similarly, incoming edge messages lead from outside the basic cluster to a source node of mstuv , while incoming cluster messages demand that at least one of its source nodes has to be a source node of mstuv . Also in edge messages, source nodes depend on data potentials and incoming edge messages, whereas pairwise node regions rely on edge potentials and incoming cluster messages. We subsume related factors of the message update rules into cache products and benefit in two ways: 1. Caching: Some cache products appear in several message update rules. We gain a speed-up if we pre-compute (cache) and use them for multiple message computations (see figure 2). 2. Multiplication Order : The multiplication order of the potentials and incoming messages plays a vital role. Node potentials and edge messages are represented by k-dimensional vectors, whereas edge potentials and cluster messages correspond to k × k matrices. Cache products comprise factors of either vector or matrix form which means that we need less computations than in the original formula where vectors and matrices are interleaved. 3.1
Caching and Multiplication in 2D
Edge Messages in 2D. If we subsume all edge message factors that depend on the same source node s into a cache product Ps , we obtain Ps = φs mas mbs mcs ,
(8)
whereas edge message factors that depend on two nodes s and u can be summarized as a cache product Psu Psu = ψsu mbdsu mcesu .
(9)
46
K. Petersen, J. Fehr, and H. Burkhardt
Combining both cache products, we can rewrite (3) as msu (xu ) = max Ps Psu .
(10)
xs
Cluster Messages in 2D. In analogy to the case of edge messages, we can define cache products within the message update rule for cluster messages. If we for instance compute the message update rule for the first cluster message of figure 3, the required cache products for source nodes are given by Ps = φs mas mcs ,
Pt = φt mbt mdt .
(11)
and the cache products for pairs of nodes can be written as Pst = ψst mabst ,
Psu = ψsu mcesu ,
Ptv = ψtv mdf tv .
(12)
Substituting these expressions into (4), we obtain −1
mstuv (xu , xv ) = (msu mtv ) 3.2
max Ps Pt Pst Psu Ptv . xs ,xt
(13)
Caching and Multiplication in 3D
We can extend the caching and multiplication technique to three-dimensional grids with a six-connected neighborhood system. The only difference is that the products of nodes and edges involve more terms than in the two-dimensional case, thereby increasing the speed-up. Edge Messages in 3D. The cache product over the source variable of edge messages is computed by Ps = φs mas mcs mds mes mis
(14)
and the corresponding product over the pairs of nodes is described by Psu = ψsu mabsu mdf su megsu mijsu
(15)
Using these definitions of the cache products, (6) takes the same form as in the 2D case (see formula (10)). Cluster Messages in 3D. For cluster messages, we define the cache products for the source nodes as Ps = φs mas mes mgs mms (16) and the cache products on pairs of nodes as Pst = ψst mabst mef st mmnst .
(17)
The explicit formula (7) transforms to the same formula as in the 2D case (see formula (13)).
Fast Generalized Belief Propagation
4
47
Accelerating MAP Estimation
According to (13) the GBP algorithm grows with the power of four in the number of labels, as a cluster message computation involves the traversal of all label combinations of xs and xt for each combination of xu and xv . Compared to edge messages that require quadratic computation time (see (10)), the update rule for cluster messages consumes most of the time for inference problems with multiple labels. In this section, we therefore pursue the question if it is necessary to explore all possible label combinations of xs and xt for determining the maximum. In the spirit of [9] [6] [10], we sort the terms of (13) by source variables xs and xt , yielding mstuv (xu , xv ) = (msu mtv )−1 max (Pst Msu Mtv ) (18) xs ,xt
where we define Msu = Ps Psu ,
Mtv = Pt Ptv .
We observe that the maximum message value is likely to consist of relatively large factors Pst , Msu and Mtv . Thus, for each combination of xu and xv the basic idea is to start at the maximum values of Msu and Mtv in the respective columns and then systematically decrease the factors until the product of both entries and the corresponding value in Pst is assured to be maximal. Thus, we have to answer two questions: (1) How do we traverse the label combinations for xs and xt such that the product of Msu and Mtv monotonically decreases? (2) Under which conditions can we terminate the search for the maximum? Traversal Order. We have to proceed for each combination of xu and xv in separation. Assume we set xu = u and xv = v. Then we obtain the maximum product of the first two factors in (18) by determining the maximum entry smax in the u-th column of Msu and the maximum entry tmax in the v-th column of Mtv . We multiply this product with the entry at position (smax , tmax ) in Pst and store the result as the first temporary maximum product value rmax . Suppose that the entry (smax , tmax ) is the ilef t biggest value of all entries in Pst . Then all combinations of xs and xt , whose entry in Pst is smaller than the ilef t biggest value of Pst , are not eligible to be the final maximum product value. For this reason we can save time by solely computing the products for combinations of xs and xt with a bigger value than the ilef t biggest value of Pst . Unfortunately, our speed-up is relatively small if ilef t is large. For decreasing ilef t , we examime which label combination of xs and xt leads to the next biggest product of Msu and Mtv . We sort Msu and Mtv column by column and refer to them as Ssu and Stv . Then smax and tmax correspond to the positions (1, u) in Ssu and (1, v) in Stv , and the candidates for the next biggest combination have to be either (1, u) in Ssu and (2, v) in Stv or (2, u) in Ssu and (1, v) in Stv . We compute both products and take the bigger one. In general, the set of candidates for the next biggest product value of Msu and Mtv constitutes from label combinations for xs and xt that are adjacent to the already visited ones. Compare figure 5 where sS and tS refer to the row positions in Ssu and Stv .
48
K. Petersen, J. Fehr, and H. Burkhardt 7 6 5
sS
4
c 3
√
2
√
c
c
√
√
c
1 1
2
3
4
5
6
7
tS
Fig. 5. A graphical depiction of the candidates for the next combination of sS and sT . Visited combinations are marked with a tick on dark gray background. The light gray fields with a c denote possible candidates for the maximal unvisited combination. White fields are unvisited and are not eligible as the next possible combination.
Thus, we gradually determine the next biggest products of Msu and Mtv and multiply it with the corresponding entry of s and t in Pst . We compare the result with the temporary maximum product value rmax and replace it if the new value is bigger. In this case we also update ilef t . This pattern is repeated until ilef t is considerably small, i.e. smaller than a pre-set threshold β. Once ilef t falls below β, we can trigger the traversal of less than ilef t combinations whose entries in Pst are bigger than the entry of the current maximum product value (smax , tmax ). Termination Conditions. We have found the maximum product if any of the following two conditions is satisfied: 1. We have visited all entries in Pst that are bigger than the entry at position (smax , tmax ) in Pst . 2. The product of Msu and Mtv , multiplied with the maximal unvisited entry in Pst , is smaller than rmax .
5
Experiments
In our experiments on two-dimensional images, the caching and multiplication technique improves on the standard implementation by roughly factor 10. Several optimizations contribute to this result. First, we gain most of the speed-up by exploiting cache products that inherently use a beneficial multiplication order. Second, compared to the standard form of the message update rule in [8], the direct evaluation of the message update rules avoids the initial recursive traversal of the region graph for determining all incoming messages in the formulas. And finally, we do not have to evaluate the variable order of factors before multiplication. An additional benefit of the caching and multiplication technique is that we can reduce the storage costs, since we do not have to store references from a message to its dependent messages. On three-dimensional images, we can expect even higher speed-ups, as the cache products consist of more factors.
Fast Generalized Belief Propagation
49
Labels Standard [s] MAP accelerated [s] Speed-up 8 0.03 0.05 0.61 0.15 0.15 1.03 16 1.21 0.59 2.05 32 15.43 2.92 5.28 64 243.03 19.29 12.60 128 4845.19 183.36 26.42 256
30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0
Standard GBP GBP with MAP acceleration
running time ratio
running time (seconds)
Fig. 6. The average iteration time of the standard and the accelerated MAP implementation for various label sizes
0
20
40
60
80
100
120
140
label k
160
180
200
220
240
260
32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0
Standard GBP GBP with MAP acceleration
0
20
40
60
80
100
120
140
160
180
200
220
240
260
label k
Fig. 7. LEFT: Running time as a function of the number of labels k. RIGHT: Ratio of standard GBP to MAP accelerated GBP as a function of k.
We estimate the effect of the accelerated MAP computation with two experiments1 : First, we measure the speed-up for different numbers of labels on a 8 × 8 grid-like MRF, where we use the Potts model2 for the edge potentials. In figure 6, we contrast how the average running time per iteration varies between the standard implementation and our technique. Note that the running time in the first iterations is often much higher than in later iterations. The reason could be that the message values contain more candidate maxima, which have to be evaluated in our optimized algorithm. After several iterations edge messages seem to attribute relatively high probabilities to a small number of labels. And second, we demonstrate the effectiveness of our technique for other potentials by computing the average running time of 100 random edge potentials for various sizes of k (see figure 7). Opposed to the first experiment, we do not evaluate the computation cost of the whole GBP algorithm, but solely the elapsed time for computing the MAP estimate for cluster messages. Figure 7 shows the ratio of these running times. We can observe that the speed-up of our implementation grows almost linear. While we always benefit from using cache products, we recommend to use the second optimization technique solely for label sizes bigger than 15. For smaller 1 2
All experiments are conducted on a 3 GHz CPU with 2 GB RAM. The caching and multiplication technique is enabled. The Potts model [1] is the generalization of the Ising model and is used for various image processing tasks.
50
K. Petersen, J. Fehr, and H. Burkhardt
labels sizes, the computational overhead per combination outweighs the reduced number of visited combinations.
6
Summary and Conclusion
In this paper, we have presented two novel techniques for reducing the computational complexity of the GBP algorithm on grid-like graphs without losing any precision. Both techniques are independent on the values of the potentials and can be used simultaneously. Thus, our accelerated GBP algorithm may solve an inference problem on a 2D grid-like MRF with 256 labels more than 250 times faster than the standard version. In the future, we are going to investigate how to generalize these techniques for other MRFs, and how we can integrate them with other acceleration techniques that are currently under development.
References 1. Winkler, G.: Image Analysis, Random Fields and Markov Chain Monte Carlo Methods. Springer, Heidelberg (2006) 2. Cowell, R.G., Dawid, A.P., Lauritzen, S.L., Spiegelhalter, D.J.: Probabilistic Networks and Expert Systems. Springer, New York (1999) 3. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Bethe free energy, kikuchi approximations and belief propagation algorithms. Technical Report TR-2001-16, Mitsubishi Electric Research Laboratories (2001) 4. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Generalized belief propagation. In: NIPS, pp. 689–695 (2000) 5. Shental, N., Zomet, A., Hertz, T., Weiss, Y.: Learning and inferring image segmentations using the gbp typical cut algorithm. In: ICCV 2003: Proceedings of the Ninth IEEE International Conference on Computer Vision, Washington, DC, USA, p. 1243. IEEE Computer Society, Los Alamitos (2003) 6. Kumar, M.P., Torr, P.H.S.: Fast Memory-Efficient Generalized Belief Propagation. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 451–463. Springer, Heidelberg (2006) 7. Veksler, O.: Efficient graph-based energy minimization methods in computer vision. PhD thesis, Cornell University (1999) 8. Yedidia, J.S., Freeman, W.T., Weiss, Y.: Constructing free energy approximations and generalized belief propagation algorithms. Technical Report TR-2004-40, Mitsubishi Electric Research Laboratories (2004) 9. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient belief propagation for early vision. Int. J. Comput. Vision 70, 41–54 (2006) 10. Kumar, M.P., Torr, P.H.S., Zisserman, A.: Obj cut. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, vol. 1, pp. 18–25 (2005)
Boosting for Model-Based Data Clustering Amir Saffari and Horst Bischof Institute for Computer Graphics and Vision Graz University of Technology, Austria {saffari,bischof}@icg.tugraz.at
Abstract. In this paper a novel and generic approach for model-based data clustering in a boosting framework is presented. This method uses the forward stagewise additive modeling to learn the base clustering models. The experimental results on relatively large scale datasets and also Caltech4 object recognition set demonstrate how the performance of relatively simple and computationally efficient base clustering algorithms could be boosted using the proposed algorithm. Keywords: Boosting, Model-Based Clustering, Ensemble Methods.
1
Introduction
Currently the boosting methods are amongst the best techniques for classification problems. Recently, there has been a few attempts to bring the idea of boosting to the clustering domain [1,2,3]. In general, ensemble clustering refers to producing data partitions by utilizing results delivered by a set of clustering models. Such an ensemble could result in improvements in different aspects, such as performance, stability, and robustness, which are not attainable by individual models [4]. There are mainly two dominant topics of interest in ensemble clustering: 1) consensus function learning and 2) generating individual members of the ensemble. The consensus function learning [5,6,7,8,4,9] usually acts as a postprocessing step to combine the results of different clustering models and usually does not deal with how the individual members of an ensemble are generated. However, the main concern of this paper is the second major topic which deals with creating suitable models for the ensemble clustering tasks. In this context, Topchy et al. [2] nicely incorporated the idea of using a consistency index as a measure of how often a sample remains in the same cluster and preserves its label as the new models are introduced to the ensemble. The consistency index is then used for obtaining weights for the data samples. At each iteration, a new bootstrapped dataset is constructed by using the weights as probability distribution, and the next base model is trained over this dataset. The k-means algorithm is used as the base learning model. They have shown that this method results in a better performance compared to a non-boosting ensemble method [10] which uses a uniform sub-sampling of the dataset. Frossyniotis et al. [1] also used the concept of sub-sampling the dataset, by using two different performance measures for assessing the clustering quality for G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 51–60, 2008. c Springer-Verlag Berlin Heidelberg 2008
52
A. Saffari and H. Bischof
each sample, both based on the membership values of each sample to the individual clusters. They incorporated a very similar approach used in the original Discrete AdaBoost [11] for updating the weights for both samples and base models. They compared the performance of the k-means and fuzzy c-means to their boosted versions and showed a better clustering results. In this paper, we propose a novel boosting algorithm which provides a unified and general framework to include any model-based clustering method as its internal processing units. In Section 2, we will present this general boosting framework and use the concept of forward stagewise additive modeling [12] for finding solutions for ensemble members. The experimental results of the proposed methods on a few real world datasets will be demonstrated in Section 3.
2
Methods
Let χ = {x1 , . . . , xN }, xi ∈ RD be a finite set of samples. According to the bipartite graph view of model-based clustering [13], we define a clustering algorithm as a mapping from the input feature space, RD , to the space of data models which are usually described by probability density models. These density models usually are sampled from a known and fixed class of parametric functions. Therefore, we assume that in a clustering model, there is a collection of K probability density functions: c(x) = [p1 (x|θ1 ), . . . , pK (x|θK )]T
(1)
where θk are the parameters of k th model. Additionally, each clustering system is accompanied with an assignment function, h(x, k) = P (k|x), which based on c(x) determines the membership of a sample to k th model. Therefore, we refer to the combination of the data models and the assignment function, C(x) = {c(x), h(x, k)} as a clustering model. Note that Zhong and Ghosh [13] proposed this formulation as a general and unified framework for model-based clustering algorithms, which encompasses many different algorithms ranging from traditional k-means and mixture of models to deterministic annealing. For example in this framework, the k-means algorithm uses equivariant spherical Gaussian density models together with the following hard assignment function [14]: 1 if k = arg max py (x|θy ) y (2) h(x, k) = 0 otherwise The parameter estimation procedure for the density models are usually conducted by maximizing the expected log-likelihood objective function: L(θ, X ) =
N n=1
P (xn )
K
h(x, k) log pk (xn |θk )
(3)
k=1
where θ = [θ1 , . . . , θK ] is the collection of all model parameters. In practice, the sample prior probability is unknown and it is common to set it to be a uniform
Boosting for Model-Based Data Clustering
53
distribution as: ∀n : P (xn ) = N1 . Since some of the clustering algorithms do not directly address the probabilistic models, the objective function of Eq.(3) can be formulated in terms of a general loss function as: L(θ, X ) = −
N
P (xn )
n=1
K
h(x, k)d(xn , k)
(4)
k=1
where d(xn , k) is the loss for the sample xn under the k th model (or cluster). For example, probabilistic models use d(x, k) = − log pk (x|θk ), while in k-means this is usually expressed by d(x, k) = x − μk 2 where μk is the k th cluster center. We further simplify this objective function by introducing the sample loss function as l(x) = K k=1 h(x, k)d(x, k) and rewriting the overall objective function of Eq.(4) to be minimized as: L(θ, X ) =
N
P (xn )l(xn )
(5)
n=1
2.1
Additive Modeling
Assume that a set of M different clustering models {C1 , . . . , CM }, are trained on a given dataset with equal number of partitions, K. We refer to each of these clustering models, Cm , as the base models and to the collections of them as the ensemble model. The additive modeling approach is a general method for generating an aggregated model out of a set of base models. For example, the traditional boosting [11], and majority (plurality) voting strategies [15], can be seen as special cases of additive modeling [12]. In the context of data clustering, once the cluster correspondence between different models are solved, the assignment function for the ensemble can be represented in an additive form as: H(x, k) =
M 1 hm (x, k) M m=1
(6)
where the function H(x, k) is an additive construction of M base functions, {hm (x, k)}M m=1 . This is equivalent to the traditional scheme of classifier combination in the field of supervised learning problems. The corresponding objective function for the whole ensemble can be written in additive form as: LM (θ, X ) =
N M N 1 1 P (xn ) lm (xn ) = P (xn )LM (xn ) M n=1 M m=1 n=1
(7)
where lm (xn ) is the sample loss of xn under mth base clustering model, and LM (x) = M m=1 lm (x) is the sample loss function for the ensemble of M base clustering models. It should be noted that since we usually train a specific class
54
A. Saffari and H. Bischof
of algorithms, e.g. k-means, over the same dataset, their individual losses are comparable. In order to facilitate the derivation of the new boosting algorithm for data clustering, we use an exponential form of Eq.(7) as: N N 1 1 M exp( P (xn )L (xn ) < P (xn )LM (xn )) L (θ, X ) = M n=1 M n=1 M
≤
N N 1 1 P (xn ) exp(LM (xn )) = P (xn )LM e (xn ) M n=1 M n=1
(8)
M where LM e (x) = exp(L (x)). This derivation uses the fact that ∀x : exp(x) > x, N n=1 P (xn ) = 1, and Jensen’s inequality for the convex exponential transformation. This shows that the exponential loss function provides an upper bound on the actual loss shown in Eq.(7).
2.2
Learning
The forward stagewise additive modeling [12] is a general approach for learning the base models of the additive formulation of a function. It is an iterative algorithm which at each stage, finds the best model which if added to the previous set of models would result in an improvement in performance. It adds this model to the ensemble and fixes its parameter and continues the search for the next best model. We apply the forward stagewise additive modeling approach to learn the parameters of the base clustering models. Note that at the ith iteration, the exponential ensemble loss function in Eq.(8) can be written as: Lie (x) = exp(li (x))
i−1
exp(lm (x)) = Li−1 e (x) exp(li (x)) = w(x) exp(li (x)) (9)
m=1
where w(x) = Li−1 e (x) is usually referred in boosting community as the sample weight, due to the fact that its value reflects the loss of previous models. Now from Eq.(8), we can derive the objective function for the ith stage of boosting as: Lie (θ, X ) =
N 1 P (xn )w(xn ) exp(li (xn )) i n=1
(10)
which is the weighted version of the exponential loss function. Since as base models we use normal model-based clustering algorithms, and they are designed to minimize the loss function represented in Eq.(5), not the exponential form of it, we show that under a mild condition the optimal solution of model for Eq.(5) can be also an optimal solution of Eq.(10). If we assume that the sample loss is
Boosting for Model-Based Data Clustering
55
sufficiently small, we can approximate the exponential term with the first order Taylor expansion as: Lie (θ, X ) = =
N N 1 1 P (xn )w(xn ) exp(li (xn )) P (xn )w(xn )(1 + li (xn )) i n=1 i n=1 N N 1 P (xn )w(xn ) + P (xn )w(xn )li (xn ) i n=1 n=1
(11)
Note that the first term in Eq.(11) is a fixed constant, so the optimization process only depends on the last term which is exactly the weighted version of normal objective function shown in Eq.(4). Therefore, finding a solution for the following loss function: N 1 Li (θ, X ) = P (xn )w(xn )li (xn ) (12) i n=1 ensures the optimality for the weighted exponential loss function Eq.(10). Note that the assumption of li (x) to be small can be ensured for all clustering models by scaling their original loss to a suitable range, which of course does not have any effect on the solution they deliver. 2.3
Weak Learners
In supervised boosting methods, the base models are sometimes called weak learners due to the fact that these models need to perform just better than random guessing. Because of lack of labels, obtaining a similar concept is usually difficult for clustering tasks. However, based on our boosting framework, we propose a condition over base models which can serve both as a definition of weak learner and also as an early stopping criteria for the boosting iterations. In order to achieve an overall better clustering results, it is desirable that with each addition of a base model, the average loss does not increase compared to the previous setup of base models. In terms of loss function, this translates into the condition of Lie ≤ Li−1 e . If we use the Eq.(10) and Eq.(8), we can write this condition as:
βi =
N
N N 1 1 P (xn )Lie (xn ) ≤ P (xn )Li−1 e (xn ) i n=1 i − 1 n=1
P (xn )w(xn ) exp(li (xn )) i ≤ N i − 1 n=1 P (xn )w(xn )
n=1
(13)
We use this condition during the boosting iterations to prevent addition of newer models which are not able to provide better solutions to the current state of the ensemble.
56
A. Saffari and H. Bischof
Algorithm 1. CBoost: Boosting for Model-Based Data Clustering 1: Input dataset: χ = {xn }, n = 1, . . . , N . 2: Set: wn = 1/N, n = 1, . . . , N . 3: for m = 1 to M do N 4: Compute: θm = arg min n=1 P (xn )wn l(xn ). θ N
P (x )w
exp(l
(x ))
5: Compute: βm = n=1 N n Pn(x )w m n . n n n=1 m then 6: if m > 1 and βm > m−1 7: Break. 8: end if 9: Set: ∀n : wn ← wn exp(lm (xn )) 10: Optional: Set: ∀n : wn ← Nwn w . n=1 n 11: end for 12: Find the cluster correspondence or a suitable consensus function for the clustering models.
2.4
Discussion
The overall algorithm is shown in Algorithm.1. There are a few remarks regarding this method. First, it provides a unified and generic boosting framework which is able to include any model-based clustering algorithm as its building blocks. In fact, any clustering algorithm which inherently optimizes a loss function can be used here. Optionally, it is beneficial for the algorithms to be able to use the sample weights in their optimization procedures. For those methods which lack such ability, one can always use the bootstrap methods to reflect the effect of sample weights. Additionally, in this framework, there is essentially no need for solving cluster correspondence during the boosting iterations. In fact, after boosting training is over, it is quite possible to benefit from the state-of-the-art consensus function learning methods to derive the final clusters out of the ensemble. Finally, the computational complexity of this algorithm is mainly centered around the complexity of its base models, as the only over-head operations are just a simple update of sample weights together with a check for early stopping condition.
3
Experiments
3.1
Methodology
As base models, we use ETree [16], and k-means. The ETree has a fast training and indexing capabilities with a complexity of O(N log N ) [16], however in general it provides inferior results compared to k-means. In order to introduce the sample weights into the ETree algorithm, the update 1 rate of the leaf nodes for each sample is multiplied with the ηn = 1+exp(−w , )
wn =
wn −μw σw .
n
Since ETree does not deliver a fixed set of clusters, we apply an
Boosting for Model-Based Data Clustering
57
Table 1. Datasets used in the experiments Dataset No. Classes No. Features No. Samples Synth 6 60 600 Pendigit 10 16 7494 Isolet 26 671 6238 Caltech4 4 420 1200
agglomerative clustering with group linkage over the leaf nodes to produce the desired number of clusters. For the k-means we use the same algorithm as in [3], which uses the weighted average for calculating the location of cluster centers. To assess the quality of the clustering results, because ETree does not directly address a particular density model, it is not possible to measure log-likelihood for this model. Therefore, we use the Normalized Mutual Information (NMI ) [5] between the true labels and the labels returned by the clustering algorithms. Besides its statistical meaning, the normalized mutual information enables us to compare clustering models with different number of partitions. Additionally, 3.2
Datasets
In order to show the performance of the proposed methods, we use four real-world datasets, which are summarized in Table 1. The Synthetic Control, Pendigit, and Isolet datasets are obtained from UCI machine learning [17] and UCI KDD repositories [18]. These datasets have been standardized before performing any clustering over them, i.e. for each feature, the mean has been subtracted and then feature values have been divided by their standard deviation. Additionally, we created a dataset from Caltech 4 object categorization repository by randomly selecting 400 images from each category of Airplanes, Cars, Faces, and Motorbikes, and then applying the Pyramid of Histograms of Orientation Gradients (PHOG) [19] shape descriptor to these images. We used a 2-level pyramid with 20-bin histograms using the same settings proposed by Bosch et al. [19]. We normalize these descriptors to have a unit L1 -norm. 3.3
Results
Table 2 and Table 3 show the performance of ETree and k-means respectively, together with their boosted versions over different number of cluster centers. The values are the average and standard deviation of NMI over 50 independent runs of the algorithms. We also implemented the algorithms proposed in [1,2] and compared their performance with our method. Additionally, we applied different consensus function learning methods, namely voting [20], HGPA, and MCLA [5], to the final ensemble. For all boosting methods we set the maximum number of base models to be 50 for ETree and 20 for k-means, respectively. Since the method presented in [2] does not have an early stopping criteria, we report its best result over different number of base models.
58
A. Saffari and H. Bischof
Table 2. Results of clustering using ETree in terms of %N M I. K is the number of experimental cluster centers. The next four columns are the performance of ETree itself together with using it as the base model of our boosting algorithm (CB) and applying voting [20], HGPA, and MCLA [5] consensus function learning methods. The next two columns indicate the results of using the methods of Frossyniotis et al. [1] and Topchy et al. [2] with ETree. For those methods we only report their best performance over applying different consensus functions. Dataset K 3 Synth 6 12 5 Pendigit 10 20 13 Isolet 26 39 4 Caltech 20 40
ETree 72 ± 06 67 ± 04 69 ± 03 57 ± 05 66 ± 03 68 ± 02 52 ± 03 56 ± 03 58 ± 01 41 ± 02 45 ± 05 42 ± 03
CB.Vote CB.HGPA CB.MCLA Fross [1] Topchy [2] 75 ± 01 75 ± 03 75 ± 01 71 ± 03 68 ± 01 80 ± 03 79 ± 01 83 ± 03 68 ± 03 75 ± 02 78 ± 02 75 ± 03 81 ± 01 70 ± 04 73 ± 02 59 ± 03 58 ± 01 59 ± 01 55 ± 04 54 ± 02 69 ± 02 68 ± 03 69 ± 02 66 ± 03 69 ± 03 73 ± 01 73 ± 02 73 ± 02 70 ± 06 69 ± 01 71 ± 02 67 ± 02 72 ± 02 70 ± 02 71 ± 02 75 ± 01 69 ± 03 76 ± 03 68 ± 03 72 ± 04 73 ± 01 66 ± 02 73 ± 01 70 ± 05 71 ± 03 45 ± 03 43 ± 01 46 ± 01 42 ± 01 43 ± 02 48 ± 02 44 ± 03 48 ± 01 48 ± 03 47 ± 01 46 ± 02 45 ± 03 47 ± 02 41 ± 03 43 ± 03
Table 3. Results of clustering using k-means and its boosted form. The details are exactly the same as Table 2. Dataset K 3 Synth 6 12 5 Pendigit 10 20 13 Isolet 26 39 4 Caltech 20 40
k-means 70 ± 09 75 ± 03 76 ± 05 55 ± 02 68 ± 03 72 ± 02 65 ± 04 69 ± 02 70 ± 03 37 ± 05 46 ± 04 47 ± 06
CB.Vote CB.HGPA CB.MCLA Fross [1] Topchy [2] 77 ± 04 77 ± 02 78 ± 02 72 ± 03 73 ± 08 78 ± 02 76 ± 03 82 ± 02 77 ± 03 76 ± 03 78 ± 02 77 ± 01 80 ± 03 80 ± 04 76 ± 03 55 ± 03 54 ± 02 57 ± 03 56 ± 04 54 ± 04 69 ± 01 64 ± 03 69 ± 02 66 ± 03 67 ± 03 74 ± 02 72 ± 02 74 ± 03 65 ± 06 70 ± 04 69 ± 01 67 ± 02 71 ± 03 68 ± 02 66 ± 01 71 ± 02 70 ± 03 72 ± 02 70 ± 03 71 ± 03 71 ± 02 69 ± 02 72 ± 01 70 ± 05 70 ± 01 38 ± 03 37 ± 02 39 ± 02 39 ± 02 37 ± 02 46 ± 04 43 ± 03 48 ± 01 45 ± 03 47 ± 02 48 ± 01 47 ± 02 49 ± 01 47 ± 02 46 ± 03
From these results, it seems that in single model track, the k-means method has an expected advantage over ETree clustering. However, it should be noted that in our C++ implementation of both algorithms, ETree is about 15 times faster than k-means. Now looking into the results of our boosting algorithm, it is obvious that in most cases, the performance of the boosted base models are improved compared to the individual models. This clearly shows the advantage of incorporating a boosting framework for clustering algorithms. Additionally,
Boosting for Model-Based Data Clustering
(a) Quantization Error
59
(b) NMI
Fig. 1. Average vector quantization error (a) and NMI (b) versus the number of boosting iterations for the Isolet dataset using ETree as the base model
considering the variance of the results, the stability and robustness of the clusterings are also improved in most cases. Looking further into these tables, it becomes clear that the boosted version of ETree seems to be as competitive as boosted k-means. It emphasizes the power of boosting method to transform a weaker method into a strong one. This is also a very desirable property as in large scale datasets, the ETree outperforms the k-means, in terms of computation time during both training and query phases. Comparing the performance of different consensus function learning methods, we can see that the MCLA is performing slightly better than voting, and the HGPA seems to be not as competitive. Also we can see that in most cases our method outperforms the previous algorithms proposed in [1,2]. Figure 1 shows the average quantization error (the loss which both ETree and k-means are optimizing) and the average NMI values, with respect to the addition of each base model during the boosting iterations for the Isolet dataset using the ETree as the base model. It is obvious that on average, with each boosting iteration and addition of a new base model the average error is decreasing, while the average NMI is gradually increasing. This demonstrates the success of the overall optimization process of the additive model proposed for the clustering tasks. Additionally, in most experiments, the convergence of the quantization error found to be an indication of the convergence for the NMI values. This experimental observation suggest the effectiveness of the early stopping method based on the convergence of the average loss during the boosting iterations.
4
Conclusion
In this paper, a novel approach for clustering datasets in a boosting framework was presented. The proposed methods are based on theoretical concepts of optimization using additive models in a forward stagewise approach. The
60
A. Saffari and H. Bischof
experimental results, on a few real world and relatively large scale datasets demonstrate the benefits of using an ensemble of base models in the proposed boosting framework.
Acknowledgment This work has been supported by the FWF Austrian Joint Research Project Cognitive Vision under projects S9103-N04 and S9104-N04.
References 1. Frossyniotis, D., Likas, A., Stafylopatis, A.: A clustering method based on boosting. Pattern Recognition Letters 25, 641–654 (2004) 2. Topchy, A., Bidgoli, M.B., Jain, A.K., Punch, W.F.: Adaptive clustering ensembles. In: Proc. of ICPR., vol. 1, pp. 272–275 (2004) 3. Nock, R., Nielsen, F.: On weighting clustering. IEEE Trans. PAMI 28(8), 1223– 1235 (2006) 4. Topchy, A., Jain, A.K., Punch, W.: Clustering ensembles: Models of consensus and weak partitions. IEEE Trans. PAMI 27(12), 1866–1881 (2005) 5. Strehl, A., Ghosh, J.: Cluster ensembles-a knowledge reuse framework for combining multiple partitions. JMLR 3, 583–617 (2002) 6. Fred, A., Jain, A.K.: Data clustering using evidence accumulation. In: Proc. of ICPR, pp. 276–280 (2002) 7. Dudoit, S., Fridlyand, J.: Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9), 1090–1099 (2003) 8. Fischer, B., Buhmann, J.M.: Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Trans. PAMI 25(4), 513–518 (2003) 9. Viswanath, P., Jayasurya, K.: A fast and efficient ensemble clustering method. In: Proc. of ICPR, pp. 720–723 (2006) 10. Bidgoli, M.B., Topchy, A., Punch, W.F.: Ensembles of partitions via data resampling. In: Proc. of ITCC., vol. 2, pp. 188–192 (2004) 11. Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proc. of ICML, pp. 148–156 (1996) 12. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. The Annals of Statistics 38(2), 337–374 (2000) 13. Zhong, S., Ghosh, J.: A unified framework for model-based clustering. JMLR 4, 1001–1037 (2003) 14. Kearns, M., Mansour, Y., Ng, A.Y.: An information-theoretic analysis of hard and soft assignment methods for clustering. In: Proc. of UAI, pp. 282–293 (1997) 15. Matan, O.: On voting ensembles of classifiers. In: Proc. of the AAAI 1996 Workshop on Integrating Multiple Learned Models, pp. 84–88 (1996) 16. Pakkanen, J., Iivarinen, J., Oja, E.: The evolving tree-analysis and applications. IEEE Trans. NN 17(3), 591–603 (2006) 17. Blake, C.L., Merz, C.J.: UCI repository of machine learning databases (1998) 18. Hettich, S., Bay, S.D.: The UCI KDD archive (1999) 19. Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid kernel. In: Proc. of CIVR, pp. 401–408 (2007) 20. Topchy, A.P., Law, M.H.C., Jain, A.K., Fred, A.L.: Analysis of consensus partition in cluster ensemble. In: Perner, P. (ed.) ICDM 2004. LNCS (LNAI), vol. 3275, pp. 225–232. Springer, Heidelberg (2004)
Improving the Run-Time Performance of Multi-class Support Vector Machines Andrey Sluzhivoy1 , Josef Pauli1 , Volker R¨ olke2 , and Anastasia Noglik1 1
Universit¨ at Duisburg-Essen, Lehrstuhl Intelligente Systeme, Duisburg, Germany 2 Robert Bosch GmbH, Leonberg, Germany
Abstract. In this paper we propose three approaches to speed up the prediction phase of multi-class Support Vector Machines (SVM). For the binary classification the method of partial sum estimation and the method of orthonormalization of the support vector set are introduced. Both methods rely on an already trained SVM and reduce the amount of necessary computations during the classification phase. The predicted result is always the same as when using the standard method. No limitations on the training algorithm, on the kernel function or on the kind of input data are implied. Experiments show that both methods outperform the standard method, though the orthonormalization method delivers significantly better results. For the multi-class classification we have developed the pairwise classification heuristics method, which avoids a lot of unnecessary evaluations of binary classifiers and obtains the predicted class in a shorter time. By combining the orthonormalization method with the pairwise classification heuristics, we show that the multi-class classification can be performed considerably faster compared to the standard method without any loss of accuracy.
1
Introduction
Support Vector Machines (SVM) is a machine learning method proposed by Vapnik [1], which originates from statistical learning theory. Due to its strong theoretical foundations and good generalization capabilities, it has recently become an area of major interest. Given labeled training samples of two classes, SVM implicitly maps them to a higher dimensional feature space and generates the maximum margin separating hyperplane in this space. The decision function of an SVM is obtained by solving a quadratic programming problem and can be written as N S αi k(x, xi ) − b , f (x) = sgn i=1
where NS is the number of so called support vectors xi , which form a subset of the training data, αi are the weights of xi , b is the bias term and k is a (positive definite) kernel function, which induces a nonlinear mapping Φ from the input space X to feature space F and is used to compute the dot product in F : k: X × X → IR,
k(x, y) = Φ(x) · Φ(y) .
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 61–70, 2008. c Springer-Verlag Berlin Heidelberg 2008
62
A. Sluzhivoy et al.
Fast classification by an SVM requires a fast calculation of NS
αi k(x, xi ) − b =
i=1
NS
αi Φ(x) · Φ(xi ) − b = Φ(x) · Ψ − b ,
(1)
i=1
where Ψ=
NS
αi Φ(xi )
(2)
i=1
is a vector in the feature space F . Many techniques have been proposed recently to increase the efficiency of SVM by speeding up the calculation of decision functions. Most of them deal with the reduction of support vector set by discarding some vectors which are considered less contributing to the decision function [2,3] or by constructing the reduced set of new vectors [4], effectively approximating the decision function. Although usually a good reduction of the number of SV can be achieved without drastically decreasing the classification accuracy, usually one has to accept a systematic error in the approximation of f . Additionally, obtaining the reduced set and re-training the SVM could be computationally expensive. The approaches described in this paper avoid these problems by actually reducing the amount of computations during the calculation of f (x), while the support vector set remains untouched. There is no systematic error implied, since the result is always calculated exactly.
2
Partial Sum Estimation
The basic idea how to speed up the calculation of (1) is to compute the sum up to a certain point and to discard the rest. The calculation could be stopped as soon as the desired accuracy is reached. We determine the accuracy by estimating the remaining part of the sum. Let Sjq :=
q
αi (Φ(x) · Φ(xi )) =
i=j
q
αi k(x, xi )
i=j
be the shorthand for the partial sum. Since for the binary classification only the sign of the result is used, the computation continues until NS |S1q − b| > |Sq+1 |
(3)
is valid for some q. We can estimate the r. h. s. of (3) using the Cauchy-Schwarz inequality: |k(x, xi )| = |Φ(x) · Φ(xi )| ≤ Φ(x) Φ(xi ) = k(x, x) k(xi , xi ) , so that NS |Sq+1 |=|
NS i=q+1
αi k(x, xi )| ≤
NS k(x, x) |αi | k(xi , xi ) . i=q+1
(4)
Improving the Run-Time Performance of Multi-class SVM
63
Note that the computation of the partial sums in the r. h. s. of (4) can be performed in advance, as soon as the support vectors are obtained. NS This approach is easy to implement but the estimation of Sq+1 is rather loose, because of the multiple application of Cauchy-Schwarz inequality, which implies that the interruption criterion (3), (4) is not accurate enough. If we could obtain the exact length of the vector Ψ kj =
k
αi Φ(xi ) j, k ∈ {1, . . . , NS }, j ≤ k
i=j
in F , a single application of Cauchy-Schwarz would provide a better estimate: S |Φ(x) · Ψ q1 − b| > Φ(x) Ψ N q+1 . S However, in many cases we cannot compute Ψ N q+1 directly. If we had an orNS thonormal basis of F and the projections of Ψ q+1 onto the basis vectors, this could easily be done, but F can be infinite dimensional and an explicit representation of Φ could possibly not be derived. Fortunately, all Ψ kj are contained within a finite dimensional subspace FS ⊆ F , which is spanned by the support vectors. Thus, we can obtain a basis of FS by orthonormalizing the support vectors in the feature space F .
3
Orthonormalization of the Support Vector Set
The projection of Φ(x) onto FS is denoted as as ΦFS (x). We derive an important fact that Φ(x) · Ψ = ΦFS (x) · Ψ . That is, the vector Φ(x) − ΦFS (x) has no influence on the result of (1) and hence can be ignored. We employ the method of Gram-Schmidt to find the orthonormalization of the support vector set V . Let 1 ), . . . , Φ(x i) Vi = Φ(x be the set of orthonormalized vectors, which is inductively constructed at the ith step of the Gram-Schmidt orthonormalization process. Note that we construct the set V implicitly, that is, the elements are not expressed directly, but rather the dot product of Φ(x) with an element of V for any given vector x can be calculated with some additional effort. For abbreviation, let Ki,j = Φ(xi ) · Φ(xj ) = k(xi , xj ) . as We define the matrix K i,j = Φ(x i ) · Φ(xj ) . K
64
A. Sluzhivoy et al.
i,j for i > j is always zero, because Φ(xj ) could be completely writNote that K j ), which are all orthogonal to a later obtained 1 ), . . . , Φ(x ten in terms of Φ(x i ). To simplify the notation, it is assumed that the elements of V are linear Φ(x independent, so V has the same number of elements. The presence of linear dependent elements does not imply any problems and would rather further improve the runtime performance, since the set V would then have less elements. The first orthogonal vector is simply Φ(x1 ) with the length 1 = Φ(x1 ) = k(x1 , x1 ) . The first orthonormal vector is therefore 1 ) = Φ(x1 ) = Φ(x1 ) , Φ(x 1 k(x1 , x1 ) Now we are able to compute the first row of the matrix K: 1,j = Φ(x 1 ) · Φ(xj ) = k(x1 , xj ) , K k(x1 , x1 )
j = 1, . . . , NS .
For the induction step from i to i + 1 we assume that Vi is a set of orthonormal vectors, which spans the subspace span{Φ(x1 ), . . . , Φ(xi )}. Also, it is assumed and the lengths 1 , . . . , i have been calculated. that the rows 1, . . . , i of K To find the orthogonal part of Φ(xi+1 ), we subtract the projection of Φ(xi+1 ) onto the subspace spanned by Vi from itself. The new orthogonal vector is thus i+1 ) = Φ(xi+1 ) − Φ(x
i
t ) · Φ(xi+1 )) Φ(x t) (Φ(x
t=1
= Φ(xi+1 ) −
i
t,i+1 Φ(x t) K
(5)
t=1
and its length is i+1
i i+1 )2 = Ki+1,i+1 − 2 i+1 ) = Φ(x K = Φ(x t,i+1 . t=1
The last result is obtained by squaring the result of (5) and considering the t ). The new orthonormal vector is orthonormality of Φ(x i+1 ) = −1 Φ(x i+1 ) . Φ(x i+1 i+1 )}. The (i + 1)-th row of K can be calculated as We set Vi+1 = Vi ∪ {Φ(x follows: i+1 ) · Φ(xj ) i+1,j = Φ(x i+1 ) · Φ(xj ) = −1 Φ(x K i+1 i −1 t,i+1 Φ(x t) K = Φ(xj ) Φ(xi+1 ) − i+1
=
−1 i+1
Ki+1,j −
t=1 i t=1
t,i+1 K t,j K
,
j ≥i+1 .
Improving the Run-Time Performance of Multi-class SVM
65
and the lengths i , i = 1, . . . , n. Continuing the process, we obtain the matrix K The orthonormal basis V = VNS have been (implicitly) constructed. Coming back to the question of what was the reason for this transformation, we recall the representation (2) of Ψ in V . Now we should find the representation of Ψ in the new orthonormal basis V : Ψ=
NS
αi Φ(xi ) =
NS
t) . α t Φ(x
t=1
i=1
We obtain α t by projecting Ψ onto the vectors of the new basis V : N NS S t,i . αi Φ(xi ) · Φ(xt ) = αi K α t = Ψ · Φ(xt ) = i=1
i=1
and i can be pre-calculated in advance Note that the values α t as well as K and hence imply no additional computational costs during the classification. For any feature vector Φ(x) ∈ F its projection onto FS is ΦFS (x) =
NS
t )) Φ(x t) = (Φ(x) · Φ(x
t=1
NS
t) . θt (x)Φ(x
t=1
The values θt (x) will be computed successively during the classification. Now we can calculate the dot product Φ(x) · Ψ in the subspace FS in terms of the new basis V : NS NS NS t) = t) = Φ(x) · Ψ = Φ(x) · α t Φ(x α t Φ(x) · Φ(x α t θt (x) . (6) t=1
t=1
t=1
For abbreviation we set Sjq :=
q
q t) = α t Φ(x) · Φ(x α t θt (x) .
t=j
t=j
The calculation of (6) in terms of the new orthonormalized basis V starts with 1) = α S11 = α 1 θ1 (x) = α 1 Φ(x) · Φ(x 1 −1 1 k(x1 , x) . Further computation mainly consists of the successive updates of the sum S1q = S1q−1 + α q θq (x) ,
(7)
where θq (x) could be obtained as follows: q) θq (x) = Φ(x) · Φ(x =
−1 q Φ(x)
Φ(xq ) −
=
−1 q
k(x, xq ) −
q−1
t=1 q−1
t,q Φ(x t) K
t,q θt (x) K
t=1
.
(8)
66
A. Sluzhivoy et al.
We come to the point where the properties of the orthonormal representation of the vectors Φ(x) and Ψ can be exploited. We notice that the remaining part of the sum (6) after step q is fully determined by the remaining parts of its factors, notably NS NS t ) with Ψ q = α t Φ(x α 2t . Ψ q := t=q+1
Φ(x)q := Φ(x) −
q
t) θt (x)Φ(x
t=q+1
with
q Φ(x)q = k(x, x) − θt (x)2 .
t=1
t=1
Hence, the remaining part of the sum can be estimated using the Cauchy-Schwarz inequality: NS |Sq+1 | = |Φ(x)q · Ψ q | ≤ Φ(x)q | |Ψ q . Note that both factors of the r. h. s. of the last inequality can be updated at every step q with a little computation effort. Now, the interruption condition (3) can be replaced by (9) |S1q − b| > Φ(x)q Ψ q . Taking a closer look at the distinct steps (7) and (8) of the calculation of the sum (6), it is clear that an additional overhead is incurred by the orthonormalization approach. Generally, the complete run of the calculation would require O(N · NS + NS2 ) time, which is worse than the time complexity of the straightforward calculation of (1), which takes O(N · NS ) time. The additional computational effort is due to the time needed to calculate θq (x), which grows quadratically with increasing q. Hence, it is essential that (9) is valid for some q < NS and the calculation can be interrupted at an earlier stage. The performance improvement is therefore not guaranteed, though the acceleration is likely to occur when the number of support vectors NS is small compared to the dimension of the input space X as well as for complicated kernels, since the reduction of the number of necessary kernel evaluations would then save a considerable amount of computational time. One of the important factors of the runtime performance of the algorithm is the order of support vectors during the orthonormalization process. Generally, sorting the support vectors by their corresponding values of | αt | is a good idea, because this way the portions of the sum (6) with the largest absolute value would be calculated at the very beginning and the estimations of the remainder would be decreasing rapidly. Further implementation ideas and a detailed algorithm description can be found in [5].
4
Pairwise Classification Heuristics
The runtime optimization techniques we have developed so far are designed for the binary SVM classifiers. Since real-world problems usually have more than two classes, several methods have been developed for building a multi-class SVM
Improving the Run-Time Performance of Multi-class SVM
67
classifier from binary classifiers. A good comparison of various approaches can be found in [6]. We are using the pairwise classification as it generally requires acceptable training time and produces good prediction results. In this section we will provide a heuristic technique for accelerating the pairwise classification without affecting the prediction result. Generally, if we have n classes, then m=
n(n − 1) 2
binary classifiers are generated. The pairwise classification involves the evaluation of m decision functions and selecting a class which receives the most votes. Ties can be broken arbitrarily, though usually the class with the smallest index is selected. This schema resembles a tournament: each player represents a class and a game is a comparison between two classes. This approach has a disadvantage that many redundant class comparisons are performed. Speaking in terms of a tournament, when player i has vi wins and li games left, he could get at most vi + li votes after all games have been played. If player j has already vj > vi + li votes, player i cannot be the top rank player anyway. If player j has even n − 1 wins, he is the absolute winner of the tournament, because no other class can achieve that many wins. By observing full comparison outcomes in pairwise SVM classification, we have noticed that in many cases one particular class receives the maximum possible number of votes, which is n − 1. Thus we developed a simple heuristics rule for conducting the comparisons, which successively conducts the comparisons and stops whenever the winner class is found. After each comparison, the values vi (number of votes for the class i), li (comparisons left for the class i) and boolean cij (indicates whether the comparison between class i and class j was already conducted) are updated. The algorithm consists of two phases. First, it is checked in two loops by applying a greedy strategy whether an absolute winner exists. If no absolute winner is found, we repeatedly choose a class which is believed to be the winner. This is a class i with the the largest number of potential votes (vi + li ). Then, we perform all remaining comparisons with the class i and check whether the termination condition is fulfilled. This is the case when the number of votes vi exceeds the potential number of votes for every other class j, i. e. when vi ≥ vj + lj . Otherwise we select among all other classes j with vj + lj > vi one class with the maximum value of vj + lj . If lj > 0, then j is selected as the potential winner and the procedure is repeated. If lj = 0, then j is a certain winner and the algorithm stops. Because in each iteration we perform all remaining comparisons with the selected class and for the next iteration only a class with some comparisons left can be chosen, each class is processed at most once and the total number of iterations can not exceed n − 1. In the worst case the algorithm would conduct all m comparisons, which is the same as using the standard method. However, there is some extra overhead incurred for managing vi , li and cij , though the overall complexity remains asymptotically the same and is bounded by O(n2 ). When an absolute winner is present, at most 2n − 3 comparisons will be conducted, i. e. O(n). The time
68
A. Sluzhivoy et al.
complexity in the average case is between these two extrema. As experiments will show, this simple heuristics will deliver good results even for a moderate number of classes. Note that for the experiments we have used an additional check to make sure that the returned class has the smallest index among all classes with the maximum number of votes, in order to guarantee the same prediction result as of the standard method. If an arbitrary class with the maximum number of votes is accepted, the run-time performance would even be slightly better. A similar approach was found in [7]. Instead of counting the potential number of votes, the authors track the loss value of each class, which is the number of comparisons where this class was not predicted. It was shown that a significant speed-up is achieved and the computational complexity tends to be almost linear in terms of the number of classes.
5
Results
We have tested the new algorithms on the 3D object recognition task using ALOI image library [8] and OpenCV [9] as the standard SVM implementation. Each class represents a certain object on a black background, captured from different viewing angles (72 image samples per class, 1000 classes total). For the testing purposes the images were scaled down to 48×36 pixels. We have taken the holistic representation of images using the pixel values in all 3 RGB channels as the input data vectors [10]. The SVM is a linear support vector machine with regularization parameter ν = 0.25. 50% of samples within each class were selected as training samples, the rest remained as test samples. The classes are selected randomly at each test run. Every test case consists of running the training procedure and measuring classification performance on the entire data set (both training and test samples). To provide more practice-oriented results, we have measured the execution time with high-precision time measurement functions, averaged over several runs. The graph in Fig. 1 shows the query time improvement of both methods for binary SVM classifiers, introduced in Sect. 2 and 3, compared to the standard 6
Pairwise classification heuristics Partial sum estimation Orthonormalization
100 Query time ratio, %
5 Average query time, ms
120
Standard Partial sum estimation Orthonormalization
4 3 2 1
80 60 40 20
0
0 0
50
100
150
200
250
Average number of SV
300
350
400
2
4
6
8
10
12
14
16
18
20
Number of classes
Fig. 1. Comparing the first two methods Fig. 2. Comparing all three methods using using binary classification multi-class classification
Improving the Run-Time Performance of Multi-class SVM
69
SVM algorithm. To increase the number of training samples, extra input data was generated by adding noise to the original images. In order to observe the performance increase with different number of support vectors, we have varied the fraction of training samples. Both methods have proved to outperform the standard computation method. While the partial sum estimation shows quite moderate speed improvement, the orthonormalization method delivers the best results. In Fig. 2 we compare all three methods stand-alone. The speed-up of a pairwise classification heuristics increases with the increasing number of classes, in contrast to the other two methods. In Fig. 3 the speed improvement is summarized for all methods and combinations, executed up to 100 classes. The methods of partial sum estimation and orthonormalization exhibit a gradual performance degradation with the increasing number of classes because of the additional overhead involved. Since the kernel evaluations during the classification are cached, the chances that while calculating a particular binary decision function many kernel evaluations will have been already computed will also increase. When the quantity of already calculated evaluations exceeds a certain amount, it would be more efficient to simply compute the sum instead of performing the remaining sum estimation at every step. This explains why the performance of both methods for binary classifiers degrade. This shortcoming is eliminated by employing pairwise classification heuristics. This method avoids many unnecessary evaluations of binary classifiers and shows a steady increase of speed with the growing number of classes. Also, the two other methods give better results when used in combination with the pairwise classification heuristics. Many evaluations of binary classifiers are not performed and the advantages of reducing the amount of computations in each binary classifier show up again. Using the orthonormalization method with the pairwise
Pairwise classification heuristics Partial sum estimation Partial sum estimation + heuristics Orthonormalization Orthonormalization + heuristics
140
Query time ratio, %
120 100 80 60 40 20 0 10
20
30
40 50 60 Number of classes
70
80
90
100
Fig. 3. Extensive benchmarking of all methods involved. The graph shows average query time relative to the time required for the standard computation.
70
A. Sluzhivoy et al.
classification heuristics, the query time ratio is about 9% by 100 classes. This indicates the speed increase by more than 10 times. Besides, this combination exhibits a stable increase of speed on the whole scale.
6
Discussion and Conclusions
Compared to other methods for improving the runtime performance of SVM [2,3,4], the approaches introduced in this paper effectively reduce the amount of computations during the classification phase and always deliver the same prediction result as the standard method. All methods are parameter-free and have no restrictions on the training method, the kernel function or the kind of data used. They can be applied to any multi-class classification problem and can be combined with other optimization techniques. Though the speed improvement is not guaranteed and can be hardly estimated in advance, it usually occurs when using a high dimensional input space with a moderate number of support vectors. The orthonormalization method delivers a good speed-up for binary classifiers, while the method of pairwise classification heuristics reduces the prediction time for the multi-class classification. The combination of pairwise classification heuristics and the orthogonalization of the support vector set exhibits the best results in experiments, yielding a speed-up by a factor of 10 compared to the standard computation.
References 1. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995) 2. Osuna, E., Girosi, F.: Reducing the run-time complexity in Support Vector Machines. In: ICPR, Brisbane, Australia (1998) 3. Guo, J., Takahashi, N., Nishi, T.: An Efficient Method for Simplifying Decision Functions of Support Vector Machines. IEICE Transactions 89-A(10) (2006) 4. Burges, C.J.C.: Simplified Support Vector Decision Rules. In: 13th International Conference on Machine Learning, pp. 71–77 (1996) 5. R¨ olke, V.: Beschleunigung der Anwendung von Kernmethoden. Diploma Thesis Christian-Albrechts-Universit¨ at, Kiel (2001) 6. Hsu, C.-W., Lin, C.-J.: A Comparison of Methods for Multi-Class Support Vector Machines. IEEE Transactions on Neural Networks (2), 415–425 (2002) 7. Park, S.-H., F¨ urnkranz, J.: Efficient Pairwise Classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladeniˇc, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 658–665. Springer, Heidelberg (2007) 8. Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.M.: The Amsterdam Library of Object Images. Int. J. Comput. Vision 61(1), 103–112 (2005) R Open Source Computer Vision Library (2006), 9. OpenCV: Intel http://www.intel.com/technology/computing/opencv/ 10. Pontil, M., Verri, A.: Support Vector Machines for 3D Object Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 637–646 (1998)
Sliding-Windows for Rapid Object Class Localization: A Parallel Technique Christian Wojek, Gyuri Dorkó, André Schulz, and Bernt Schiele Computer Science Department TU Darmstadt {wojek, dorko, schulz, schiele}@cs.tu-darmstadt.de
Abstract. This paper presents a fast object class localization framework implemented on a data parallel architecture currently available in recent computers. Our case study, the implementation of Histograms of Oriented Gradients (HOG) descriptors, shows that just by using this recent programming model we can easily speed up an original CPUonly implementation by a factor of 34, making it unnecessary to use early rejection cascades that sacrifice classification performance, even in real-time conditions. Using recent techniques to program the Graphics Processing Unit (GPU) allow our method to scale up to the latest, as well as to future improvements of the hardware.
1
Introduction
In recent literature, densely sampled local descriptors have shown excellent performance, and therefore have become more and more popular for object class recognition. As the processing power of computers increases, sliding windowbased techniques become more and more feasible for real-time applications. While interest point detectors offer a smart way for pre-sampling possible locations and therefore provide a sparser set for learning and recognition, the advantage of dense random sampling, or sampling on a regular lattice has been shown [1,2] to outperform sparse representations. Many of the best object class detectors use sliding window techniques (e.g, [3,4,5,6,7,8,9]), i.e., extract overlapping detection windows at each possible position, or on a regular lattice, and evaluate a classifier. The sliding window technique is, in general, often criticized as being too resource intensive, and consequently, it is often seen as unfeasible for real-time systems. However, many high dynamic automotive applications are interested in detecting pedestrians using this technique in a fast and yet robust manner [6,10,11]. In general, gradient based methods [5,6,7,8,9] perform very well, but most of them are computationally expensive. Existing real-time solutions include incorporating simple features, that are computed rapidly, such as Haar-like wavelets in [3,4], and improving the speed via early rejection. This is typically achieved by a cascade of classifiers [3,12], or alternatively arranging features from coarse to fine for multi-resolution processing [13]. While these techniques [12,13] can make a state-of-the-art detector [5] G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 71–81, 2008. c Springer-Verlag Berlin Heidelberg 2008
72
C. Wojek et al.
faster, their early rejection comes with drawbacks. Zhang et al. [13] misclassifies “harder examples”, i.e. detections having lower confidence, more easily. This performance loss is compensated by running the detector with more expensive features at higher resolution. Zhu et al. [12] select a subset of features from a detection window using AdaBoost. Since the number of features is fixed, their method becomes computationally expensive for scanning a large number of windows, which is a typical requirement for detecting objects on small scales. The ideal solution is to avoid rejection phases relying on coarser features, downscaled images, or other approximations, and to process the entire detection window with a strong, high-resolution classifier. In this paper we argue that methods that sacrifice classification performance in order to achieve speedups, do not stand in the long term. We show that by using parallel architectures that can be found in many recent PC’s graphics processors (GPUs) we can easily obtain a speedup of 30 and more. As a case study, we present an implementation of Dalal & Triggs’ Histograms of Oriented Gradients (HOG) approach using a technology called general-purpose computation on graphics processing units (GPGPU). Our performance analysis shows guidelines for better optimization, and how to avoid unnecessary overhead using GPGPU technology. HOG descriptors are features developed for object class detection, and when combined with SVM classification it is one of the best detectors available [14]. To the best of our knowledge there is no published GPU-based HOG implementation available in the literature. However, there are a few related ongoing vision projects, that take advantage of the GPU. Examples include Lowe’s Scale Invariant Feature Transform (SIFT) [15], a very similar type of descriptor, which has been implemented on GPUs in OpenVIDIA [16] and recently by Mårten Björkman [17].
2
Object Class Detection Using HOG
This section provides a brief overview of object class detection using HOG [5] features. All provided parameters correspond to experiments on people detection, and are similar to [5]. Detection Phase. A given test image is scanned at all scales and locations. The ratio between the scales is 1.05, and the overlapping detection windows are extracted with a step size of 8 pixels, both horizontally and vertically. HOG features are computed for each detection window, and a linear SVM classifier decides upon the presence of the object class. Finally, a robust mode estimator, a mean shift algorithm, fuses multiple detections over position and scale space (3D), and the system returns bounding boxes marked by their confidence. Figure 1 illustrates the computation of a rectangular HOG feature for a given detection window. After image normalization and gradient computation, each detection window is divided into adjacent cells of 8×8 pixels. Each cell is represented by a 9-bin histogram of gradient orientations in the range of 0◦ − 180◦ , weighted by their magnitudes. A group of 2 × 2 cells is called a block. Blocks are overlapping, and are normalized using L2 -Hys, the Lowe-style clipped L2 norm. A block is
Sliding-Windows for Rapid Object Class Localization
73
Fig. 1. A HOG descriptor (left). Steps of localization using HOG descriptors (right).
represented as a concatenation of all cell histograms, and a HOG feature as a concatenation of all blocks. For people, a detection window is 64 × 128 pixels. When blocks overlap 50%, i.e., 1 cell – which is a typical choice for efficient CPU implementations – a detection window consists of 7 × 15 = 105 blocks, and therefore the length of a HOG descriptor is 105 × 2 × 2 × 9 = 3780. To be robust to small translations, cell histograms are computed with trilinear interpolation. Gradient magnitudes are weighted by a Gaussian (σ = 8.0) centered at the middle of the given block. In case of color images, channels are separated, and orientation histograms are built using the maximum gradient of the channels. Learning Phase. The HOG descriptors are computed similarly to detection. The learning phase differs in that there is neither need to compute the full scale-space for all images, nor to scan the images with a sliding window. Using the given annotations, normalized crops of fixed resolution are created and fed into SVM training. Negative crops are first chosen at random, or given by the dataset. After SVM training they are resampled (false positives) to create “hard examples” and retrain the SVM – a typical technique to improve the classifier by one order of magnitude [10].
3
Programming on the GPU
The term GPGPU refers to a technique that uses the graphics chip as a coprocessor to perform scientific computations. The architecture of GPUs allows highly parallel computations at high speed, and thus provides an excellent platform for computer vision. GPU manufacturers have realized the need for better support of non-graphics applications, and therefore they have been working on novel architectures. In this paper our implementation is based on NVIDIA’s CUDA architecture and programming model. Consequently, we use a CUDA capable card, GeForce 8800 Ultra, for our experiments. All numbers and speed measurements in this paper reflect this model. While CUDA allows us to use typical computer graphics procedures, such as vertex and fragment shaders, algorithms still need to be adapted to achieve high data level parallelism, and efficient memory access.
74
C. Wojek et al.
The graphics card GeForce 8800 Ultra, a highly multi-threaded device, consists of 16 multi-processors, each one made up of 8 processors, and therefore, capable of running 128 threads simultaneously. Programs running on the GPU, called kernels, are compiled with NVIDIA’s C compiler. Kernels are launched with a user specified grid and thread block configuration. Thread blocks group up to 512 threads together, and are arranged in a grid to help complex addressing. Each block runs on the same multi-processor and therefore may share data, via on-chip shared memories. Each multi-processor has 8192 registers and 16384 bytes of shared memory that are dynamically allocated to threads and thread blocks. Due to these limitations and the configuration of threads, not all processors can be active all the time. The ratio that reflects how a kernel occupies the GPU is called the occupancy and is 100% at best. In general, higher occupancy hides the latency of global memory accesses better, and therefore often leads to better performance. Besides the on-chip shared memory there are three other types of off-chip memories. The global memory (768MB), also called device memory, has high latency and is not cached. Constant memory (65536 bytes) is typically used if all threads are accessing the same pre-computed value, and texture memory (65536 bytes) is optimized for 2D spatial locality. Constant and texture memories are transparently cached (8KB on-chip). Each type of memory has different access patterns, and thus programmers have to decide where the data is stored for best performance. E.g., the high-latency global memory is best accessed in continuous chunks that are aligned w.r.t. thread blocks. This is the so-called coalesced memory access.
4
HOG on the GPU
Figure 1 shows the steps of our implementation. First, the image is transferred from the CPU’s main memory to the GPU’s global memory. After initial padding, the test image is gradually downscaled, and for each scale the HOG descriptor is computed on the color normalized channels. A linear SVM is evaluated and the scores are transferred back to the CPU’s memory for non-maximum suppression. Training of the SVM is done on the CPU with fixed image crops (Sect. 6), but using the GPU implementation to extract HOGs. In the following we detail the steps of our detector. Preprocessing. Preprocessing consists of four steps. In order to detect objects that are partially cropped or near the image boundaries, extra padding is added to each side of the image. Then, the image is gradually downscaled, color channels are separated, and on each channel a color normalization is performed. In the following we discuss the implementation of each step. Padding. After a test image is transferred to the global memory of the GPU, extra pixels are added to each side of the image. Each new pixel is computed by averaging the color of the closest 5 pixels in the previous row/column. The implementation is split into 2, vertical and horizontal, kernels while for each 2 thread blocks are launched. Due to the pixel dependencies from previous computations, kernels compute the missing pixels in a row/column-wise manner.
Sliding-Windows for Rapid Object Class Localization
75
Table 1. Maximum occupancy per kernel is determined by the number of registers, amount of shared memory (in bytes), and the thread block configuration. Bold numbers indicate the current limitation. Padding needs additional shared memory D, see text for details. The last column shows whether the kernel has fully coalesced memory access. Kernel Padding Downscale C. Decomp., Gamma com. Horizontal Convolution Vertical Convolution Grad.Ori.Mag. - Max Block Histograms Block Normalization Linear SVM Evaluation
Registers Sh.Mem. Thrd/Blk Occupancy Coal. mem. 22 9 7 6 15 13 13 5 15
80 + D 40 72 556 3244 60 2468 312 1072
320 16 × 16 16 × 16 145 16 × 8 16 × 16 16 × 4 36 128
max.42% 100% 100% 83% 67% 67% 50% 67% 67%
Only vert. Yes Yes Yes Yes Yes Yes No Yes
Our implementation loads an entire row/column into the shared memory (max. 16KB), imposing a reasonable limit on target image dimensions, 2038 pixels. Due to the limitation on the number of registers the kernel occupancy is at most 42% as indicated in Tbl. 1. Downscale. Our downscale kernel takes advantage of the texturing unit to efficiently subsample the source image by a factor of 1.05 using linear interpolation. The target image is “covered” by thread blocks which consist of 16 × 16 threads. Each thread computes one pixel of the downscaled image. Color Decomposition & Gamma Compression. This kernel’s purpose is to separate the color channels of a 32-bit color interleaved image to red, green, and blue. The target pixels of the decomposed channels are also converted to floats, for further processing. Each thread corresponds to a pixel, and for efficient memory access they are grouped into 16 × 16 thread blocks. Since gamma compression also is a pixel-wise operation, it is integrated into this kernel for best performance, i.e., to save unnecessary kernel launches. Color Gradients. Separable convolution kernels (from the SDK examples) compute x and y derivatives of each color channel (3 ∗ 2 kernel launches). According to the guidelines, thread block sizes are fixed to 145 and 128 threads for horizontal and vertical convolutions, respectively. The occupancy is bounded by these numbers, and is 83% for horizontal and 67% for vertical processing (cf. Tbl. 1). The next kernel computes gradient orientations and magnitudes. Each thread is responsible for computing one pixel taken as a maximum of the gradient on the three channels. For efficiency threads are grouped in 16 × 16 blocks. Block Histograms. Our implementation is inspired by the histogram64 example [18]. The basic idea of parallel histogram computation is to store partial results, so-called sub-histograms, in the low-latency shared memory. If the number of histogram bins per cell, hc is 9, our algorithm requires hc ∗ sizeof(float) = 9 ∗ 4 = 36 bytes of shared memory per thread. There are two pre-computed tables, Gaussian weights and bilinear spatial weighting, transferred to the texture
76
C. Wojek et al.
memory. Interpolation between the orientation bin centers is computed in the kernel. Assuming HOG block size of 2 × 2 cells, and 8 × 8-pixel cell sizes, the Gaussian weights require 16 ∗ 16 ∗ 4 = 1024 bytes, and the bilinear weighting table needs 16 ∗ 16 ∗ 2 ∗ 2 ∗ 4 = 4096 bytes. Each thread block is responsible for the computation of one HOG block. Threads within a block are logically grouped, such that each group computes one cell histogram, and each thread processes one column of gradient orientation and magnitude values corresponding to the HOG block. Given the above mentioned cell and block sizes, in our case a thread block has 16 × 4 threads. This arrangement reflects the cell structure within a HOG block, and therefore provides easier indexing to our pre-computed tables. The second part of the kernel fuses the sub-histograms to a single HOG block histogram using the same technique as histogram64 [18]. Our configuration runs with 50% GPU occupancy, due to size limits on shared memory (cf. Tbl. 1). Block Normalization. HOG blocks are normalized individually using L2 Hys by a kernel, where each thread block is responsible to normalize one HOG block, and consists of the number of histogram bins per block, hb = 36, threads. Squaring of the individual elements as well as the sum of the squares are computed in parallel. Keeping a full HOG block in shared memory avoids the latency of global memory accesses. The kernel runs with 67% occupancy (cf. Tbl. 1). Linear SVM Evaluation. This kernel is similar to the block normalization kernel, since both are based on a dot product, and therefore inspired by the example scalarProd [18]. Each thread block is responsible for one detection window. Each thread in a block computes weighted sums corresponding to each column of the window. Partial sums are added in a pairwise element fashion, at each time using half of the threads until only one thread is left running. Finally, the bias of the hyperplane is subtracted and the distance from the margin is stored in global memory. The number of threads per block is 128. During computation, the linear weights of the trained SVM are kept in texture memory. Keeping all values of a detection window in shared memory would occupy nearly all available space (7 ∗ 15 ∗ 36 ∗ 4 = 15120 bytes), therefore we have decided to store one partial result of the dot product for each thread, 128 ∗ 4 = 512 bytes. The kernel runs with 67% GPU occupancy (cf. Tbl. 1). Non-Maximum Suppression. The window-wise classification is insensitive to small changes in scale and position. Thus, the detector naturally fires multiple times at nearby scale and space positions. To obtain a single final hypothesis for each object, these detections are fused with a non-maximum suppression algorithm, a scale adaptive bandwidth mean shift [19]. This algorithm is currently running on the CPU. Our current time estimates suggest that it is not yet worth to run it on the GPU. However, parallelization of kernel density estimates with mode searching could itself be a research topic. In the future, we plan to run the estimation on the CPU asynchronously and simultaneously to the other computations on the GPU.
Sliding-Windows for Rapid Object Class Localization
5
77
Discussion on GPU Implementations
This section summarizes our general experience for porting existing computer vision techniques to the GPU. The following guidelines should give an impression for what is worth, and what is hard to realize on GPU architectures. Port Complete Sequence of Operations to the GPU. Due to the transfer overhead between the CPU and the GPU, it is not profitable to port only small portions of a complete framework to the GPU. E.g., just to run convolution on the GPU and do the rest on the CPU involves an overhead twice as much as the effective computation on the GPU. It is better to keep the data on the GPU for further processing, in particular, if we can further compress it. E.g., our transfer time of all SVM results currently takes 0.430ms even for a large image of 1280 × 960, however, transferring back all HOG descriptors would have taken 2 to 3 orders of magnitude more time. Group Subsequent Steps Together. Our experience has shown that integrating kernels that access the data in the same fashion leads to significant speed improvements due to the reduced number of kernel launches. E.g., if we split the decompose colors & gamma compression, or the gradient orientation & maximum selection kernels into two, our algorithm slows down by 2ms for each. Figure 2 (left) shows that the GPU computation time for an image of size 320 × 240 is 13.297ms, and the program actually spends 20.179ms in the driver software, which includes the effective GPU time and the additional overhead of kernel launches and parameter passing. Larger Data, Higher Speedup. Consequently, the more data we process, the larger is the expected speedup compared to a CPU implementation. Notice that the overhead is independent of the GPU time, and in case of longer computations, it could be relatively small. Figure 2 (right) shows the real GPU computation in relation to the kernel running time, including overhead, on different image sizes. While for a smaller image the overhead is 34% for a larger image it is only 17%. Choose the Right Memory Type. Different memory types have different access patterns. It is important to choose the right one. E.g., the SVM evaluation may store the SVM weights in constant memory. However, since each thread accesses a different weight, it is better to use the texture memory. In our case SVM evaluation speeds up by a factor of 1.6, i.e., by 3ms for a 320 × 240 image using 25000
Time in us
20000 15000
Kernel runtimes
13.297ms
10000
Downscale Linear SVM Evaluation Block Normalization Block Histograms Grad.Ori.Mag. & Max Vertical Convolution Horizontal Convolution C. Decomp & Gamma Comp Horizontal Padding Vertical Padding
77%
80%
83%
66% 60% 40% 20%
5000 0
Effective computation
100% 20.179ms
GPU GPU+Driver 320x240
0%
G G+D 320x240
G G+D 640x480
G G+D 1280x960
Fig. 2. Effective GPU times and calling overheads. See text for details.
78
C. Wojek et al.
the texture memory. Similarly another 3ms is won by storing the pre-computed Gaussian weights in the texture memory for the histogram computation. Address Aligned Data. Alignment guidelines are essential for global memory access. In simple cases this usually means additional padding of images. For more complicated cases, when the same data is accessed multiple times using different patterns, the threads have to be aligned on the data, e.g., by launching more threads, and according to the alignment some of them do nothing. Our experience has shown that non-coalesced global memory access may cause a slowdown of kernels of up to 10 times. Flexibility has High Impact on Speed. Due to the above guidelines, flexibility, i.e., using not hard-coded parameters can cause significant slowdown by, e.g; non-coalesced memory access, or by increasing kernel launch overhead due to more parameters, or by more variables and computations that increase the number of registers and the amount of required shared memory, and consequently reducing occupancy. Launch Many Threads to Scale for the Future. Finally, to scale well for future improvements of hardware a good implementation launches thousands of threads simultaneously, even if only 128 run physically parallel on the current cards. Due to the above overheads, the sub-optimal memory access, and the rest of the computation, loading/saving, etc., one can only expect an actual speedup of a magnitude less than 128. In the following section we report real WALL times for our experiments, and measure the actual speedup of our HOG implementation.
People Detection Experiments Precision
6
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
In order to verify both performance and runtime of our implementation we conducted several exGPU implementation periments on the INRIA Person test set [5]. The CPU implementation Dalal’s binary dataset contains people in different challenging 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 scenes. For training, the dataset contains 2416 Recall normalized positive (i.e., people cropped from 615 images) and 1218 negative images. For testing the Implementation WALL time Dalal’s binary 39 min 28s dataset has 453 negative images, a set of 1132 posOur CPU 35 min 1s itive crops, and their corresponding 288 full size Our GPU 1 min 9s images. For evaluation we use precision-recall curves, Fig. 3. Performance on the which provide a more intuitive and more informa- INRIA Person test set tive report than fppw (false positives per window) on the performance of object localization. fppw plots do not reflect the distribution of false positives in scale and location space, i.e., how the classifier performs in the vicinity of objects, or on background that is similar to the object context. As described earlier our system has a non-maximum suppression step to merge nearby detections to one final hypothesis, and therefore providing a clear way for evaluation. Consequently, our results are computed using only the full-sized
Precision
Sliding-Windows for Rapid Object Class Localization 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
79
WALL times (per image) Downsc. Factor 320x240 640x480 1280x960
Scale factor 1.05 Scale factor 1.1 Scale factor 1.2
1.05 (orig.) 1.1 1.2
29ms 20ms 15ms
99ms 59ms 38ms
385ms 216ms 133ms
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall
Fig. 4. Increasing downscale factor on the INRIA person test set
288 positive images, and not the crops. Detections are counted as true positives if they match the ground truth annotations with less than 50% overlap error, and double detections are counted as false positives, according to the PASCAL [14] criteria. Even though we have done our best to implement the original algorithm as close- as possible, due to restructuring the algorithm and using different precision for computations, small changes in recognition performance are expected. For this reason, our first set of experiments compares our localization results to CPU implementations in terms of recall and precision. Figure 3 (top) shows three curves. The blue dotted curve corresponds to results obtained by running the publicly available binary written by the original author; the dashed curve, performing similar to the dotted, is our CPU based reimplementation of [5]; the solid red curve is our GPU implementation, which obtains slightly better results. The improvement probably comes from floating point precision on interpolated histogram computation, since the CPU implementations use integers with rounding errors at several points, presumably for speedups. Figure 3 (right) reports total run times1 for the test. Our implementation runs 34 times faster than Dalal’s binary, and 30 times faster than our CPU reimplementation. How can we make our detector even faster? First, one can try to improve the performance by reducing the overhead, e.g., by transferring more images at a time to the GPU, or by reducing kernel calls. Employing several GPUs at a time allows pipelining and the expected throughput can be further increased up to 4 times with currently available GPU configurations. If we are ready to trade our performance for speed, small modification on the parameters may also be sufficient. Figure 4 shows an example, when the algorithm uses a coarser scalespace than before. Speed results are reported in a more intuitive way, on a per image basis. The experiment shows, that a small adjustment of the scale factor does not influence the precision of our detector, but causes a small drop in recall. On average, on a 320 × 240 image the localization speeds up from 34 fps to 67 fps, i.e., by a factor of 2.0. 1
WALL times always indicate total running time, i.e., the “real” time reported by the time utility on the binary.
80
7
C. Wojek et al.
Conclusions
In this paper we have shown a parallel implementation of an object class detector using HOG features. Our implementation runs 34 fps on 320 × 240 images, and is approximately 34 times faster than previous implementations, without any tradeoff in performance. Our experiments used one single GPU only, but due to the flexible programming model, it scales up to multi-GPU systems, such as the Tesla Computing Systems with an additional expected speedup of 2 to 4. We have also analyzed the overhead created mainly by data transfers and system calls, which defines the current limitation of these architectures. Experiments on adjusting sliding-window parameters have shown the tradeoff between classification performance and speed: we have shown a detector that runs at 67 fps with similar precision, but a small drop in recall. In the future, we plan to further improve our current implementation by reducing kernel launches and test on multi-GPU systems, as well as to adopt other features and classifiers to GPU-based architectures. Acknowledgements. This work has been funded, in part, by the EU project CoSy (IST-2002- 004250).
References 1. Nowak, E., Jurie, F., Triggs, B.: Sampling strategies for bag-of-features image classification. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954. Springer, Heidelberg (2006) 2. Tuytelaars, T., Schmid, C.: Vector quantizing feature space with a regular lattice. In: ICCV (October 2007) 3. Viola, P.A., Jones, M.J.: Robust real-time face detection. IJCV 57(2), 137–154 (2004) 4. Papageorgiou, C., Poggio, T.: A trainable system for object detection. IJCV 38(1), 15–33 (2000) 5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, pp. 886–893 (2005) 6. Shashua, A., Gdalyahu, Y., Hayun, G.: Pedestrian detection for driving assistance systems: Single-frame classification and system level performance. In: International Symposium on Intelligent Vehicles, pp. 1–6 (2004) 7. Laptev, I.: Improvements of object detection using boosted histograms. In: BMVC, vol. III, pp. 949 (September 2006) 8. Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on Riemannian manifolds. In: CVPR (June 2007) 9. Sabzmeydani, P., Mori, G.: Detecting pedestrians by learning shapelet features. In: CVPR (June 2007) 10. Munder, S., Gavrila, D.M.: An experimental study on pedestrian classification. PAMI 28(11), 1863–1868 (2006) 11. Gavrila, D.M., Philomin, V.: Real-time object detection for smart vehicles. In: ICCV, pp. 87–93 (1999) 12. Zhu, Q., Avidan, S., Yeh, M., Cheng, K.: Fast human detection using a cascade of histograms of oriented gradients. In: CVPR (June 2006)
Sliding-Windows for Rapid Object Class Localization
81
13. Zhang, W., Zelinsky, G., Samaras, D.: Real-time accurate object detection using multiple resolutions. In: ICCV (October 2007) 14. Everingham, M., Zisserman, A., Williams, C., van Gool, L.: PASCAL visual object classes challenge results (2006) 15. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004) 16. OpenVIDIA: GPU accelerated CV library, http://openvidia.sourceforge.net/ 17. Björkman, M.: CUDA implementation of SIFT (2007) 18. NVIDIA: NVIDIA CUDA SDK code samples 19. Comaniciu, D.: An algorithm for data-driven bandwidth selection. PAMI 25(2), 281–288 (2003)
A Performance Evaluation of Single and Multi-feature People Detection Christian Wojek and Bernt Schiele Computer Science Department TU Darmstadt {wojek, schiele}@cs.tu-darmstadt.de
Abstract. Over the years a number of powerful people detectors have been proposed. While it is standard to test complete detectors on publicly available datasets, it is often unclear how the different components (e.g. features and classifiers) of the respective detectors compare. Therefore, this paper contributes a systematic comparison of the most prominent and successful people detectors. Based on this evaluation we also propose a new detector that outperforms the state-of-art on the INRIA person dataset by combining multiple features.
1
Introduction
People are one of the most challenging classes for object detection, mainly due to large variations caused by articulation and appearance. Recently, several researchers have reported impressive results [1,2,3] for this task. Broadly speaking there are two types of approaches. Sliding-window methods exhaustively scan the input images over position and scale independently classifying each sliding window, while other methods generate hypotheses by evidence aggregation (e.g. [3,4,5,6,7]). To the best of our knowledge there exist only two comparative studies on people detection methods. [8] compares local features and interest point detectors and [9] compares various sliding window techniques. However, [9] is focused on automotive applications and their database only consists of cropped gray scale image windows. While the evaluation on single image windows is interesting, it does not allow to assess the detection performance in real-world scenes where many false positive detections may arise from body parts or at wrong scales. This paper therefore contributes a systematic evaluation of various features and classifiers proposed for sliding-window approaches where we assess the performance of the different components and the overall detectors on entire real-world images rather than on cropped image windows. As a complete review on people detection is beyond the scope of this work, we focus on most related work. An early approach [1] used Haar wavelets and a polynomial SVM while [10] used Haar-like wavelets and a cascade of AdaBoost classifiers. Gavrila [11] employs a hierarchical Chamfer matching strategy to detect people. Recent work often employs statistics on image gradients for people detection. [12] uses edge orientation histograms in conjunction with SVMs G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 82–91, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Performance Evaluation of Single and Multi-feature People Detection
83
while [2] uses an object description based on overlapping histograms of gradients. [13] employs locally learned features in an AdaBoost framework and Tuzel [14] presents a system that exploits covariance statistics on gradients in a boosting classification setting. Interestingly, most approaches use discriminant classifiers such as AdaBoost or SVMs while the underlying object descriptors use a diverse set of features. This work contributes a systematic evaluation of different feature representations for general people detection in combination with discriminant classifiers on full size images. We also introduce a new feature based on dense sampling of the Shape Context [15]. Additionally, several feature combination schemes are evaluated and show an improvement to state-of-the-art [2] people detection. The remainder of this paper is structured as follows. Section 2 reviews the evaluated features and classifiers. Section 3 introduces the experimental protocol, and section 4.1 gives results for single cue detection. Results for cue combination are discussed in section 4.2 and section 4.3 analyzes failure cases.
2
Features and Classifiers
Sliding window object detection systems for static images usually consist of two major components which we evaluate separately in this work. The feature component encodes the visual appearance of the object to be detected, whereas the classifier determines for each sliding window independently whether it contains the object or not. Table 1 gives an overview Table 1. Original combination of features and of the feature/classifier com- classifiers binations proposed in the Linear Kernel Ada- OtherCriterion literature. As can be seen SVM SVM Boost from this table, many pospolyHaar wavelet [1] ROC nomial sible feature/classifier combinations are left unexplored Haar-like cascaded ROC wavelet [10] therefore making it difficult to HOG [2] ✓ RBF FPPW [13] ✓ FPPW assess the respective contribu- Shapelets Shape Context [8] ISM RPC tion of different features and classifiers to the overall detector performance. To enable a comprehensive evaluation using all possible feature/classifier combinations, we reimplemented the respective methods. Comparisons with published binaries (whenever available) verifies that our reimplementations perform at least as good as the originally proposed feature/classifier combinations (cf. Figure 1(h)). The remainder of this section reviews the evaluated features and classifiers. 2.1
Features
Haar Wavelets have first been proposed by Papageorgiou and Poggio [1]. They introduce a dense overcomplete representation using wavelets at the scale of 16 and 32 pixel with an overlap of 75%. Three different types are used, which allow
84
C. Wojek and B. Schiele
to encode low frequency changes in contrast: vertical, horizontal and diagonal. Thus, the overall length of the feature vector for a 64 × 128 pixel detection window is 1326 dimensions. In order to cope with lighting differences, for each color channel only the maximum response is kept and normalization is performed according to the window’s mean response for each direction. Additionally, the original authors report that for the class of people the wavelet coefficient’s sign is not carrying information due to the variety in clothing. Hence, only the absolute values for each coefficient is kept. During our experiments we found that an additional L2 length normalization with regularization of the feature vector improves performance. Haar-Like Features have been proposed by Viola and Jones [10] as a generalization of Haar wavelets with arbitrary dimensions and different orientations (efficiently computed by integral images). They suggest to exhaustively use all possible features that can be sampled from a sliding window and let AdaBoost select the most discriminative ones. Thus, their approach is computationally limited to rather small detection window sizes. For our evaluation we us the OpenCV1 implementation of their algorithm to select the relevant features and only use those appropriately scaled to our detection window’s size of 64 × 128 pixels. Similarly to [1] we found that for the class of people the coefficient’s sign is irrelevant due to different clothing and surroundings and therefore used absolute values. Moreover, we found that the applied illumination variance normalization performs worse than simple L2 length normalization on the selected features. Histograms of Oriented Gradients have been proposed by Dalal and Triggs [2]. Image derivatives are computed by centered differences in x- and y direction. The gradient magnitude is then inserted into cell histograms (8 × 8 pixels), interpolating in x, y and orientation. Blocks are groups of 2 × 2 cells with an overlap of one cell in each direction. Blocks are L2 length normalized with an additional hysteresis step to avoid one gradient entry to dominate the feature vector. The final vector is constituted of all normalized block histograms with a total dimension of 3780 for a 64 × 128 detection window. Shapelets [13] are another type of gradient-based feature obtained by selecting salient gradient information. They employ discrete Adaboost on densly sampled gradient image patches of multiple orientations (0◦ , 90◦ , 180◦, 270◦ ) at the scales of 5 to 15 pixels to classify those locally into people and non-people based on the local shape of the object. As preprocessing step, gradient images are smoothed to account for inaccuracies of the person’s position within the annotation. Moreover, the underlying gradient image is normalized shapelet-wise to achieve illumination invariance. Compared to the published source code2 we use stronger regularization for the normalization step, in order not to amplify noise. This improves the results considerably. Shape Context has originally been proposed as a feature point descriptor [15] and has shown excellent results for people detection in the generative ISM framework [16,3]. The descriptor is based on edges which are extracted with a 1 2
http://sourceforge.net/projects/opencvlibrary http://www.cs.sfu.ca/˜mori/research/shapelet_detect
A Performance Evaluation of Single and Multi-feature People Detection
85
Canny detector. Those are stored in a log-polar histogram with location being quantized in nine bins. For the radius 9, 16 and 23 pixels are used, while orientation is quantized into four bins. For sliding window search we densly sampled on a regular lattice with a support of 32 pixels (other scales in the range from 16 to 48 pixels performed worse). For our implementation we used the version of Mikolajczyk [17] which additionally applies PCA to reduce the feature dimensionality to 36 dimensions. The overall length of all descriptors concatenated for one test window is 3024. 2.2
Classifiers
The second major component for sliding-window approaches is the deployed classifier. For the classification of single windows two popular choices are SVMs and decision tree stumps in conjunction with the AdaBoost framework. SVMs optimize a hyperplane to separate positive and negative training samples based on the global feature vector. Different kernels map the classification problem to a higher dimensional feature space. For our experiments we used the implementation SVM Light [18]. In contrast, boosting is picking single entries of the feature vector with the highest discriminative power in order to minimize the classification error in each round.
3
Dataset and Methodology
To evaluate the performance for the introduced features and their combination with different classifiers we use the established INRIA Person dataset3 . This data set contains images of humans taken from several viewpoints under varying lighting conditions in indoor and outdoor scenes. For training and testing the dataset is split into three subsets: the full size positive images, the scale-normalized crops of humans and full size negative images. Table 2 gives an overview of the number of images and the number of overall depicted people. For training we use all 2416 positive images and for the negative train- Table 2. Number of images and instances ing instances we randomly cropped a for the INRIA Person dataset fixed set of 10 negative windows from Positive every negative image. Unlike the origNormalized Negative set set/ crops set inal authors [2] we test the trained de# instances tectors on the full images. We do so, Training 615 / 1208 2416 1218 1132 453 in order not only to evaluate the de- Testing 288 / 566 tector in terms of false positive detections per window (FPPW) but with respect to their frequency and spatial distribution. This gives a more realistic assessment on how well a detector performs for real image statistics. To allow this evaluation in terms of recall and precision, the nearby initial detections in scale and space need to be merged to a single final hypothesis. To achieve this, a mode seeking adaptive-bandwidth mean shift 3
http://pascal.inrialpes.fr/data/human
86
C. Wojek and B. Schiele
algorithm [19] is used. The width of the smoothing kernel was kept fixed for all experiments and no further postprocessing was applied. Ground truth and final detections are matched using the PASCAL criterion [20], which demands a minimum overlap of 50% for two matching bounding boxes.
4 4.1
Experiments Single Feature Detection
We start by evaluating all features individually in combination with the three classifiers AdaBoost, linear SVM and RBF kernel SVM. In order not to introduce bias by the selection of negative samples a fixed set was used and no bootstrap learning was employed. Figures 1 (a)-(c) show the results we have obtained. First of all, the HOG descriptor and the similar Shape Context descriptor consistently outperform the other features independent of the learning algorithm. They are able to achieve around 60% equal error rate. The two Haar-like waveletbased approaches perform similar, while the Haar features by [1] perform slightly better in combination with AdaBoost and the Haar-like features by [10] show better results when combined with a linear SVM. Shapelets are not performing as well as suggested by the reported FPPWs in the original paper. Only in combination with a linear SVM they do better than the wavelet features. Overall, RBF kernel SVMs together with the gradient-based features HOG and Shape Context show the best results. All features except shapelets show better performance with the RBF kernel SVM compared to the linear SVM. AdaBoost achieves a similarly good performance in comparison with RBF kernel SVMs in particular for the Haar-like wavelet, the HOG feature and for shapelets. It does slightly worse for the dense Shape Context descriptor. For the wavelet features, linear SVMs are not able to learn a good classifier with limited data. AdaBoost and RBF kernel SVMs are doing better in this case due to their ability to separate data non-linearly. Remarkably, linear SVMs show better performance in combination with Shape Context compared to HOG. This might be an effect of the log-polar sampling for the feature histograms which allows for a better linear separation. 4.2
Multi-cue Detection
A closer look on the single detectors’ complementarity reveals that different features in combination with different classifiers have a varying performance on the individual instances. This can be explained by the fact, that the features encode different information. While gradients encode high frequency changes in the images, Haar wavelets as they are proposed by [1] also encode much lower frequencies. Thus, it is worth to further investigate the combination of features. To this end, we conducted several experiments employing early integration with linear SVMs and AdaBoost as classifier. RBF kernel SVMs have not further been employed for computational complexity reasons.
1.0
1.0
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
recall
recall
A Performance Evaluation of Single and Multi-feature People Detection
0.5
Haar features (Papageorgiou) Haar-like features(Viola) Shapelets HOG Dense Shape Context
0.3 0.2 0.1 0.0 0.0
0.5 0.4
0.4
0.1
0.2
0.3
0.4
0.5
0.6
1-precision
0.7
0.8
0.9
0.2 0.1 0.0 0.0
1.0
1.0
1.0
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.2 0.1 0.0 0.0
0.1
0.2
0.3
0.4
0.5
0.6
1-precision
0.7
0.8
0.9
0.6
0.7
0.8
0.9
1.0
AdaBoost without Bootstrapping linear SVM without Bootstrapping AdaBoost with Bootstrapping linear SVM with Bootstrapping Published HOG binary (Dalal)
0.1 0.0 0.0
1.0
1.0
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.1
0.2
0.3
0.4
0.5
0.6
1-precision
0.7
0.8
0.9
1.0
(d) Combination of Haar wavelets [1] and HOG, different classifiers
recall
recall
0.5
1-precision
0.2
0.4
0.5 0.4
0.3
0.3 AdaBoost without Bootstrapping linear SVM without Bootstrapping AdaBoost with Bootstrapping linear SVM with Bootstrapping Published HOG binary (Dalal)
0.2 0.1 0.1
0.2
0.3
0.4
0.5
0.6
1-precision
0.7
0.8
0.9
AdaBoost without Bootstrapping linear SVM without Bootstrapping AdaBoost with Bootstrapping linear SVM with Bootstrapping Published HOG binary (Dalal)
0.2 0.1 0.0 0.0
1.0
(e) Combination of Haar wavelets [1] and dense Shape Context, different classifiers 1.0
1.0
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5 0.4
0.1
0.2
0.3
0.4
0.5
0.6
1-precision
0.7
0.8
0.9
1.0
(f) Combination Haar wavelets [1], HOG and dense Shape Context, different classifiers
recall
recall
0.4
0.5
1.0
Published HOG binary (Dalal) HOG reimplementation Published Haar-like wavelets binary (Lienhart, OpenCV) Haar-like wavelets reimplementation Published Shapelets binary (Sabzmeydani) Shapelets reimplementation
0.5 0.4
0.3
0.3 Combination of Haar features and dense Shape Context Dense Shape Context Haar features Published binary (HOG)
0.2 0.1 0.0 0.0
0.3
0.3
(c) Feature performance with RBF SVM
0.0 0.0
0.2
0.4
Haar features (Papageorgiou) Haar-like features (Viola) Shapelets HOG Dense Shape Context
0.3
0.1
(b) Feature performance with linear SVM
recall
recall
(a) Feature performance with AdaBoost
0.4
Haar features (Papageorgiou) Haar-like features (Viola) Shapelets HOG Dense Shape Context
0.3
0.1
0.2
0.3
0.4
0.5
0.6
1-precision
0.7
0.8
0.9
0.2 0.1 1.0
(g) Bootstrapped single feature detectors of (e) and their combination, linear SVM
0.0 0.0
0.1
0.2
0.3
0.4
0.5
0.6
1-precision
0.7
0.8
0.9
1.0
(h) Performance of available detector binaries and our implementations; for Haar-like wavelets we improved regularization, for Shapelets we set the regularization properly
Fig. 1. Recall-Precision detector performances
87
88
C. Wojek and B. Schiele
Before stacking feature vectors in an linear SVM classifier, each feature cue was L2 length normalized to avoid a bias resulting from the features’ scale range. In order to keep the comparison fair, we also used the same normalization for AdaBoost. We have combined all possible subsets of HOG, Shape Context and Haar wavelet-based features [1]. Combinations with shapelets have not been tried due to the poor performance of the feature. In the following we will focus on the combinations which yielded the best results. Additionally we also employed a bootstrapping method, which has shown to improve performance [2,9]. For this an initial classifier is trained with all available positive training data and random negative samples. Then “hard examples” are collected by scanning the negative training images. The final classifier is then trained on the set of the initial and hard samples. Our most successful experiments yielded results as depicted in plots 1(d)-(f). For easier comparison the curve of the best performing published4 binary ([2], bootstrapped SVM classifier) is also shown. Figure 1(d) shows the performance of Haar wavelets [1] and HOG features. Even without boostrapping, the combined features with the AdaBoost classifier almost reach the performance of the published HOG binary. This is due to local optimization of AdaBoost that concentrates on the most discriminative feature in each round. An analysis shows that 67.5% are HOG features while 32.5% are Haar features. The performance of SVM with this feature combination is in between the performance of the two original features. This result can be explained by the global optimization strategy of SVMs, which needs more data to obtain a good fit. Obviously, the bootstrapping method provides more data and consequently performance increases substantially to little above the performance of the bootstrapped HOG features. However, it does not reach the performance of the bootstrapped AdaBoost classifier. As already discussed in section 4.1, AdaBoost is doing better in separating HOG features and Haar wavelets when used individually. Thus, it is only little surprising for the combination to perform also well. Figure 1(e) shows the combination of dense Shape Context features with Haar wavelets. Without boostrapping AdaBoost and linear SVM perform similar and better than for the single features alone. Adding bootstrapping the SVM classifier again gains a significant improvement. This is due to the same fact we have pointed out in section 4.1. Shape Context features show good linear separability and thus linear SVMs are able to achieve a high classification performance. Again we reviewed the features chosen by AdaBoost. Those were 66.25% Shape Context features and 33.75% Haar wavelet features. We also analyzed the performance of the individual features in a linear SVM when learned with a bootstrapping strategy. Figure 1(g) shows, that in fact both features on their own cannot reach the performance that is reached with their combination. Compared to the stateof-the-art HOG object detector we improve recall considerably about 10% at 80% precision. Finally, figure 1(f) shows results of the combination of HOG, Shape Context and Haar features. For this combination AdaBoost already outperforms 4
http://pascal.inrialpes.fr/soft/olt
A Performance Evaluation of Single and Multi-feature People Detection
89
the HOG object detector by Dalal [2] even without bootstrapping. The linear SVM classifier again profits from the boostrapping step and performs similarly to the bootstrapped AdaBoost classifier. Interestingly, the performance obtained by the combination of HOG, Shape Context and Haar features is highly similar to the pairwise combinations of Haar features with either HOG or Shape Context. Here the analysis on the chosen features yields the following distribution: 45.25% HOG, 34.0% Shape Context, 20.75% Haar. Additionally adding Haar-like features [10] resulted in almost unchanged detections. In summary we can state that the combination of different features is successful to improve state-of-the-art people detection performance. We have shown, that a combination of HOG features and Haar wavelets in a AdaBoost classification framework as well as dense Shape Context features with Haar wavelets in a linear SVM framework are able to achieve about 10% better recall with a precision of 80% compared to a single feature HOG detector. Figure 2 shows the improvement on sample images. Similarly to [21] we also observe that a combination of features can achieve better detection performance than the standalone features when trained on the same amount of training data. Additionally, SVMs were able to benefit from a bootstrapping strategy during learning as noted by [9]. While AdaBoost also improves by bootstrapping, the effect is much weaker compared to SVMs. 4.3
Failure Analysis
To complete our experimental evaluation we also conducted a failure case analysis. In particular, we have analyzed the missing recall and the false positive detections at equal error rate (149 missing detections/ 149 false positives) for the feature combination of Shape Context and Haar wavelets in combination with a linear SVM. Missing recall mainly occurred due to unusual articulations (37 cases), difficult background or contrast (44 cases), occlusion or carried bags
Fig. 2. Sample detections at a precision of 80%. Red bounding boxes denote false detections, while yellow bounding boxes denote true positives. First row shows detection by the publically available HOG detector[2]; second row depicts sample detections for our combination of dense Shape Context with Haar wavelets in a linear SVM.
90
C. Wojek and B. Schiele
(a) Unusual articulation
(b) Difficult contrast
(e) Detection on parts
(f) Too large scale
(c) Occlusion
(d) Person carrying goods
(g) Detection on vertical
(h) Cluttered
(i) Missing
structures
background
annotation
Fig. 3. Missed recall (upper row) and false positive detections (lower row) at equal error rate
(43 cases), under- or overexposure (18 cases) and due to detection at too large or too small scales (7). There were also 3 cases which were detected with the correct height but could not be matched to the annotation according to the PASCAL criterion due to the very narrow annotation. False positive detections can be categorized as follows: Vertical structures like poles or street signs (54 cases), cluttered background (31 cases), too large scale detections with people in lower part (24 cases), too low scale on body parts (28 cases). There were also a couple of “false” detections (12 cases) on people which were not annotated in the database (mostly due to occlusion or at small scales). Some samples of missed people and false positives are shown in figure 3.
5
Conclusion
We have presented a systematic performance evaluation of state-of-the-art features and classification algorithms for people detection. Experiments on the challenging INRIA Person dataset showed that both HOG and dense Shape Context perform better than other features independent of the deployed classifier. Moreover, we have shown that a combination of multiple features is able to improve the performance of the individual detectors considerably. Clearly, there are several open issues which cannot be solved easily with single image classification. Thus, additional motion features and the integration across multiple frames are necessary to further improve performance. Motion for instance can help to resolve false detections due to vertical structures while multiple frame integration is likely to yield better results with cluttered background.
A Performance Evaluation of Single and Multi-feature People Detection
91
Acknowledgements. We gratefully acknowledge support by the Frankfurt Center for Scientific Computing. This work has been funded, in part, by the EU project CoSy (IST-2002- 004250).
References 1. Papageorgiou, C., Poggio, T.: A trainable system for object detection. IJCV 38(1), 15–33 (2000) 2. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, pp. 886–893 (2005) 3. Seemann, E., Leibe, B., Schiele, B.: Multi-aspect detection of articulated objects. In: CVPR, pp. 1582–1588 (2006) 4. Forsyth, D., Fleck, M.: Body plans. In: CVPR (1997) 5. Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In: ICCV (2005) 6. Felzenszwalb, P., Huttenlocher, D.: Efficient matching of pictorial structures. In: CVPR (2000) 7. Mikolajczyk, K., Schmid, C., Zisserman, A.: Human detection based on a probabilistic assembly of robust part detectors. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3024, pp. 69–81. Springer, Heidelberg (2004) 8. Seemann, E., Leibe, B., Mikolajczyk, K., Schiele, B.: An evaluation of local shapebased features for pedestrian detection. In: BMVC (2005) 9. Munder, S., Gavrila, D.M.: An experimental study on pedestrian classification. PAMI 28(11), 1863–1868 (2006) 10. Viola, P.A., Jones, M.J.: Robust real-time face detection. IJCV 57(2), 137–154 (2004) 11. Gavrila, D.: Multi-feature hierarchical template matching using distance transforms. In: Proceedings of the International Conference on Pattern Recognition, vol. 1, pp. 439–444 (1998) 12. Shashua, A., Gdalyahu, Y., Hayun, G.: Pedestrian detection for driving assistance systems: Single-frame classification and system level performance. In: International Symposium on Intelligent Vehicles, pp. 1–6 (2004) 13. Sabzmeydani, P., Mori, G.: Detecting pedestrians by learning shapelet features. In: CVPR (2007) 14. Tuzel, O., Porikli, F., Meer, P.: Human detection via classification on Riemannian manifolds. In: CVPR (2007) 15. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. PAMI 24(4), 509–522 (2002) 16. Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: CVPR, pp. 878–885 (2005) 17. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. PAMI 27(10), 1615–1630 (2005) 18. Joachims, T.: Making large–scale SVM learning practical. In: Schölkopf, B., Burges, C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods — Support Vector Learning, pp. 169–184. MIT Press, Cambridge (1999) 19. Comaniciu, D.: An algorithm for data-driven bandwidth selection. PAMI 25(2), 281–288 (2003) 20. Everingham, M., Zisserman, A., Williams, C., van Gool, L.: The PASCAL visual object classes challenge (VOC 2006) results. Technical report (2006) 21. Levi, K., Weiss, Y.: Learning object detection from a small number of examples: The importance of good features. In: CVPR, vol. II, pp. 53–60 (2004)
Model-Based Motion Capture for Crash Test Video Analysis Juergen Gall1 , Bodo Rosenhahn1 , Stefan Gehrig2 , and Hans-Peter Seidel1 1
Max-Planck-Institute for Computer Science, Campus E1 4, 66123 Saarbr¨ ucken, Germany 2 Daimler AG, Environment Perception, 71059 Sindelfingen, Germany {jgall, rosenhahn, hpseidel}@mpi-inf.mpg.de, [email protected] Abstract. In this work, we propose a model-based approach for estimating the 3D position and orientation of a dummy’s head for crash test video analysis. Instead of relying on photogrammetric markers which provide only sparse 3D measurements, features present in the texture of the object’s surface are used for tracking. In order to handle also small and partially occluded objects, the concepts of region-based and patch-based matching are combined for pose estimation. For a qualitative and quantitative evaluation, the proposed method is applied to two multi-view crash test videos captured by high-speed cameras.
1
Introduction
The analysis of crash test videos is an important task for the automotive industry in order to improve the passive safety components of cars. In particular, the motion estimation of crash test dummies helps to improve the protection of occupants and pedestrians. The standard techniques for crash analysis use photogrammetric markers that provide only sparse 3D measurements, which do not allow the estimation of the head orientation. In this work, we address motion capture of rigid body parts in crash test videos where we concentrate on the head – one of the most sensitive body parts in traffic accidents. As shown in Figure 1 a), this is very challenging since the head covers only a small area of the image and large parts are occluded by the airbag. In addition, shadows and background clutter make it difficult to distinguish the target object from the background. To this end, we propose a model-based approach that estimates the absolute 3D rotation and position of the object from multiple views independently of photogrammetric markers. In order to make the estimation robust to occlusions, reference images are synthesized using a 3D model that contains the geometry and the texture of the object. Since our approach further combines region-based and patch-based matching, reliable estimates are obtained even for small body parts as demonstrated in Figure 1. The 3D surface model of a crash test dummy is readily available as most dummies are manufactured according to ISO standards. The texture is often not provided but it can be acquired by projecting the images from the calibrated cameras on the object’s surface using the technique described in [1] to align the 3D model to the images. G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 92–101, 2008. c Springer-Verlag Berlin Heidelberg 2008
Model-Based Motion Capture for Crash Test Video Analysis
93
Fig. 1. Left: a) Estimating the pose of the dummy’s head from crash test videos is very challenging. The target object is relatively small and partially occluded by the airbag. Furthermore, background clutter and the car’s shadow make it difficult to distinguish the head from the background. Right: b) Estimated pose of the dummy’s head. The 3D surface model is projected onto the image.
2
Related Work
The knowledge of a 3D surface model has been widely used for tracking humans or human body parts, see e.g. [2] or [3]. Besides silhouettes and edges, optical flow is another popular cue for model-based tracking, see e.g. [4]. More recently, segmentation and pose estimation has been coupled [5] where the projected surface of the previous estimated pose serves as a shape prior for the segmentation. It has the advantage that a static background is not required in contrast to methods that rely on background subtraction. Since the performance depends on the accuracy of the shape prior, optical flow can be used to predict the shape of the target object [6]. These approaches are, however, not suitable for crash test videos since they tend to fail in the case of occlusions. Using patch-based matching, the tracking can be regarded as a detection problem [7] where patches of a textured model from different viewpoints are extracted in a preprocessing step. During tracking, each frame is then matched to one of the keyframes. Although this approach is very fast, it cannot be applied to small and occluded objects where only few features are available. The same problem arises for approaches that combine an iterative analysis-by-synthesis scheme with optical flow [8] or patch-based matching [9]. In order to handle occlusions and small objects, we propose an analysis-by-synthesis approach that combines the concepts of region-based and patch-based matching. Tracking of small objects for crash video analysis without a surface model has been investigated in [10], where the relative transformation is reconstructed from a 3D point cloud that is tracked using KLT [11] and stereo depth data. In contrast to model-based approaches, point clouds do not provide all relevant information like depth of penetration or absolute head orientation.
94
3
J. Gall et al.
System Overview
An outline of our approach is given in Figure 2. Patch-based matching (Section 4.1) is used for pose prediction and for establishing correspondences between the current image and the synthesized reference image. For synthesis, the textured model is projected onto the current image using the predicted pose. Although the reference images provide a relatively small number of matches due to illumination differences, they help to prevent an error accumulation since the static texture of the model is not affected by tracking errors. The small size of the object and temporarily occlusions actually yield very few matches from patches. Additional correspondences are therefore extracted by region matching (Section 4.2) where the segmentation is improved by a shape prior from the predicted pose. The final pose is then estimated from weighted correspondences (Section 4.3) established by prediction, region-based matching, and patch-based matching.
Fig. 2. Left: a) Having estimated the pose for time t − 1, the pose for the next frame is predicted by matching patches between the images of frames t − 1 and t. The predicted pose provides a shape prior for the region-based matching and defines the pose of the model for synthesis. The final pose for frame t is estimated from weighted correspondences emerging from the prediction, region-based matching, and patch-based matching. Right: b) From top left to bottom right: Correspondences between two successive frames (square: frame t − 1; cross: frame t). Estimated contour. Synthesized image. Correspondences between synthesized image (square) and original image (cross).
4 4.1
Pose Tracking Patch-Based Matching
Patch-based matching extracts correspondences between two successive frames for prediction and between the current image and a synthesized image for avoiding drift as outlined in Figure 2. For reducing the computation effort of the keypoint extraction [12], a region of interest is selected by determining the bounding
Model-Based Motion Capture for Crash Test Video Analysis
95
box around the projection and adding fixed safety margins that compensate for the movement. As local descriptor for the patches, we apply PCA-SIFT [13] that is trained by building the patch eigenspace from the object texture. 2D-2D correspondences are then established by nearest neighbor distance ratio matching [14]. Since each 2D keypoint x of the projected model is inside or on the border of a triangle with vertices v1 , v2 , and v3 , the 3D counterpart is approximated by X = i αi Vi using barycentric coordinates (α1 , α2 , α3 ). The corresponding triangle for a 2D point can be efficiently determined by a look-up table containing the color index and vertices for each triangle. The patch matching produces also outliers that need to be eliminated. In a first coarse filtering step, mismatches are removed by discarding 2D-2D correspondences with an Euclidean distance that is much larger than the average. After deriving the 3D-2D correspondences, the pose is estimated and the new 3D correspondences are projected back. By measuring the distance between the 2D correspondences and their reprojected counterparts, the remaining outliers are detected. 4.2
Region-Based Matching
Region-based matching minimizes the difference between the projected surface of the model and the object region extracted in the image, see Figure 2 b). For this purpose, 2D-2D correspondences between the contour of the projected model and the segmented contour are established by a closest point algorithm [15]. Since we are interested in 3D-2D correspondences between the model and the image, we consider only the projected mesh vertices on the model contour where the 3D coordinates are known. The 2D counterpart is then given by the point on the segmented contour that is closest to the projected vertex. The silhouette of the object is extracted by a level-set segmentation that divides the image into fore- and background where the contour is given by the zero-line of a level-set function Φ. As proposed in [5], the level-set function Φ is obtained by minimizing the energy functional H(Φ) ln p1 + (1 − H(Φ)) ln p2 dx E(Φ) = − Ω 2 +ν |∇H(Φ)| dx + λ (Φ − Φ0 ) dx, (1) Ω
Ω
where H is a regularized version of the step function. The densities of the foreand background p1 and p2 are estimated by a Parzen estimator with Gaussian kernels. While the first term maximizes the likelihood, the second term regulates the smoothness of the contour by parameter ν = 2. The last term penalizes deviations from the projected surface of the predicted pose Φ0 where we use the recommended value λ = 0.06. 4.3
Pose Estimation
For estimating the pose, we seek for the transformation that minimizes the error of given 3D-2D correspondences denoted by pairs (Xi , xi ) of homogeneous
96
J. Gall et al.
coordinates. To this end, we represent a 3D rigid motion M by a twist θξˆ [4]: ˆ Hence, a transformation of a point Xi is given by M = exp(θξ). ˆ i. Xi = exp(θξ)X
(2)
Since each 2D point xi defines a projection ray that can be represented as Pl¨ ucker line Li = (ni , mi ) [16], the error of a pair (Xi , xi ) is given by the norm of the perpendicular vector between the line Li and the point Xi Π (Xi ) × ni − mi 2 ,
(3)
where Π denotes the projection from homogeneous coordinates to non-homogeˆ ≈ I + θξˆ where I neous coordinates. Using the Taylor approximation exp(θξ) denotes the identity matrix, Equation (2) can be linearized. Hence, the sought transformation is obtained by solving the weighted linear least squares problem 2 1 wi Π I + θξˆ Xi × ni − mi , (4) 2 i 2 i.e. by solving a system of linear equations. For estimating the final pose, correspondences from the prediction (Cp ), regionbased matching (Cr ), and the synthetic image (Cs ) are used as outlined in Figure 2. Since the number of correspondences from region matching varies according to scale, shape, and triangulation of the object, we weight the summands in Equation (4) such that the influence between patches and silhouette is independent of the model. This is achieved by setting the weights for the equations for Cp and Cs in relation to Cr : |Cr | , wr = 1, ws = β wp . (5) wp = α |Cp | While the influence of the image-based patches and the contour is controlled by the parameter α independently of the number of correspondences, the weight ws reflects the confidence in the matched patches between the synthesized and original image that increases with the number of matches |Cs | relative to |Cp |. Since illumination differences between the two images entail that |Cs | is usually less than |Cp |, the scaling factor β compensates for the difference. In our experiments, we have obtained good results with α = 2.0 and β = 2.0.
5
Experiments
The first experiment investigates the dynamics of a pedestrian head crashing onto the engine hood, see Figure 3. The sequence has been captured at 1000 Hz by two calibrated cameras with 512 × 384 pixel resolution. For segmentation, the images have been converted to the CIELab color space that mimics the human perception of color differences. Since we have registered the engine hood as shown in row 3 of Figure 3, the depth of penetration can be measured from the estimated head pose. In this case, the head penetrates 49.1mm into the engine
Model-Based Motion Capture for Crash Test Video Analysis
97
Fig. 3. Rows 1, 2: The head crashes onto the engine hood. Estimates for frames 5, 25, 45, 65, 85, and 105 are shown (from top left to bottom right). The pose of the head is well estimated for the entire sequence. Row 3: Virtual reconstruction of the crash showing the 3D surface model of the head and of the engine hood. From left to right: a) Frame 5. b) Frame 25. The head penetrates the engine hood. c) Depth of penetration. The black curve shows the distance of the head to the engine hood (dashed line).
compartment. This is a relevant information for crash test analysis since severe head injuries might be caused by crashing into the solid engine block. Note that a standard silhouette-based approach would not be able to estimate the rotation due to the symmetric shape of the object whereas our approach provides good results for the entire sequence. For the second experiment, the head of a dummy is tracked during a EuroNCAP offset crash where the car drives into an aluminum barrier with 40% overlap as shown in Figure 4. Due to the barrier, the car jumps and moves laterally. Although the sequence was captured at 1000 Hz by 3 cameras with 1504 × 1128 pixel resolution, the head covers only 70 × 70 pixels, i.e., less than 0.3% of the image pixels. In addition, the head is occluded by more than 50% at the moment of the deepest airbag penetration, and the segmentation is hindered by shadows
98
J. Gall et al.
Fig. 4. Three frames of the EuroNCAP offset crash sequence. The car jumps and moves laterally due to the offset barrier. The head is occluded by more than 50% at the moment of the deepest airbag penetration.
Fig. 5. Left: 3D trajectory of the head. Right: 3D tracking error of the head. The ground truth is obtained from a marker-based system with standard deviation ±2.5mm for the x- and y-coordinates and ±5mm for the z-coordinate. Note that the object is about 10m away from the camera.
Fig. 6. Comparison with a marker-based system and an acceleration sensor. The modelbased approach provides accurate estimates for velocity and acceleration. Left: Velocity (x-axis). Right: Acceleration (x-axis).
Model-Based Motion Capture for Crash Test Video Analysis
99
and background clutter. Nevertheless, Figure 7 demonstrates that the head pose is well estimated by our model-based approach during the crash. The trajectory in Figure 5 reflects the upward and lateral movement of the car away from the camera due to the offset barrier. For a quantitative error analysis, we have compared the results with a marker-based system using photogrammetric markers. The 3D tracking error is obtained by the Euclidean distance between the
Fig. 7. Estimated pose of the dummy’s head for frames 7, 22, 37, 52, 67, 82, 97, 112, 127, 142, 157, 172, 187, 202, and 217 (from top left to bottom right)
100
J. Gall et al.
estimated position and the true position of the 5-dot marker on the left hand side of the dummy’s head. The results are plotted in Figure 5 where the average error is 37mm with standard deviation of 15mm. acceleration of the head, the trajectories from the marker-based and the model-based method are slightly smoothed, as it is common for crash test analysis. Figure 6 shows that the velocity and the acceleration are well approximated by our approach. A comparison with an acceleration sensor attached to the head further reveals that the deceleration is similar to the estimates of our approach. For the offset crash sequence, our current implementation requires 6 seconds per frame on a consumer PC. Finally, we remark that the marker-based system provides only one 3D point for the position of the head whereas our approach estimates the full head pose. It allows additional measurements like rotation of the head or penetration depth which help to analyze crash test videos.
6
Conclusion
We have presented a model-based approach for pose tracking of rigid objects that is able to meet the challenges of crash test analysis. It combines the complementary concepts of region-based and patch-based matching in order to deal with the small size of the objects. Since the targets are temporarily occluded, we have proposed the use of synthesized reference images, which help to avoid drift and to recover from prediction errors. In contrast to conventional marker-based systems, our approach estimates all six degrees of freedom of dummy body parts like the head. This opens up new opportunities for analyzing pedestrian crashes where many biomechanical effects are not fully understood. The accuracy and robustness of our system has been demonstrated by a offset crash sequence where a quantitative comparison with a marker-based system and an acceleration sensor is provided. Acknowledgments. The research was partially funded by the Max Planck Center Visual Computing and Communication and the Cluster of Excellence on Multimodal Computing and Interaction.
References 1. Gall, J., Rosenhahn, B., Seidel, H.P.: Clustered stochastic optimization for object recognition and pose estimation. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 32–41. Springer, Heidelberg (2007) 2. Hogg, D.: Model-based vision: A program to see a walking person. Image and Vision Computing 1(1), 5–20 (1983) 3. Gavrila, D., Davis, L.: 3-d model-based tracking of humans in action: a multi-view approach. In: IEEE Conf. on Comp. Vision and Patt. Recog., pp. 73–80 (1996) 4. Bregler, C., Malik, J., Pullen, K.: Twist based acquisition and tracking of animal and human kinematics. Int. J. of Computer Vision 56(3), 179–194 (2004)
Model-Based Motion Capture for Crash Test Video Analysis
101
5. Rosenhahn, B., Brox, T., Weickert, J.: Three-dimensional shape knowledge for joint image segmentation and pose tracking. Int. Journal of Computer Vision 73(3), 243– 262 (2007) 6. Brox, T., Rosenhahn, B., Cremers, D., Seidel, H.P.: High accuracy optical flow serves 3-d pose tracking: Exploiting contour and flow based constraints. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 98– 111. Springer, Heidelberg (2006) 7. Lepetit, V., Pilet, J., Fua, P.: Point matching as a classification problem for fast and robust object pose estimation. In: IEEE Conf. on Computer Vision and Patt. Recognition, pp. 244–250 (2004) 8. Li, H., Roivainen, P., Forcheimer, R.: 3-d motion estimation in model-based facial image coding. IEEE Trans. Pattern Anal. Mach. Intell. 15(6) (1993) 9. Gall, J., Rosenhahn, B., Seidel, H.P.: Robust pose estimation with 3d textured models. In: Chang, L.-W., Lie, W.-N. (eds.) PSIVT 2006. LNCS, vol. 4319, pp. 84–95. Springer, Heidelberg (2006) 10. Gehrig, S., Badino, H., Paysan, P.: Accurate and model-free pose estimation of small objects for crash video analysis. In: Britsh Machine Vision Conference (2006) 11. Shi, J., Tomasi, C.: Good features to track. In: IEEE Conf. on Comp. Vision and Patt. Recog., pp. 593–600 (1994) 12. Lowe, D.: Object recognition from local scale-invariant features. In: Int. Conf. on Computer Vision, pp. 1150–1157 (1999) 13. Ke, Y., Sukthankar, R.: Pca-sift: A more distinctive representation for local image descriptors. In: IEEE Conf. on Comp. Vision and Patt. Recog., pp. 506–513 (2004) 14. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. In: IEEE Conf. on Computer Vision and Patt. Recognition, pp. 257–263 (2003) 15. Zhang, Z.: Iterative point matching for registration of free-form curves and surfaces. Int. Journal of Computer Vision 13(2), 119–152 (1994) 16. Stolfi, J.: Oriented Projective Geometry: A Framework for Geometric Computation. Academic Press, Boston (1991)
Efficient Tracking as Linear Program on Weak Binary Classifiers Michael Grabner1,3 , Christopher Zach2 , and Horst Bischof3 1
3
Microsoft Photogrammetry Graz, Austria [email protected] 2 Department of Computer Science University of North Carolina, Chapel Hill [email protected] Institute for Computer Graphics and Vision Graz University of Technology [email protected]
Abstract. This paper demonstrates how a simple, yet effective, set of features enables to integrate ensemble classifiers in optical flow based tracking. In particular, gray value differences of pixel pairs are used for generating binary weak classifiers, forming the respective object representation. For the tracking step an affine motion model is proposed. By using hinge loss functions, the motion estimation problem can be formulated as a linear program. Experiments demonstrate robustness of the proposed approach and include comparisons to conventional tracking methods.
1
Introduction
Considering object tracking as a binary classification problem has been well studied in recent years [2,7,16,1,18]. The principle idea is to learn a discriminant function which distinguishes a particular object from non target regions. For learning such a function many techniques have been proposed in machine learning literature whereas in particular SVMs and ensemble methods have demonstrated success within object tracking. While the procedure of learning such classifiers has been well studied there is still the question how to integrate them in tracking applications for the purpose of parameter estimation of motion models with higher degrees of freedom. To estimate how an image region has moved from one frame to the next, the naive way is to apply the classifier in an exhaustive manner over a search region [7,16,2] to create a likelihood map where the peak is typically interpreted as the objects new location. However, for tracking typically we would like to handle several degrees of freedom of the target object, e.g. affine deformations for planar objects, where exhaustive search simply becomes to cumbersome. Williams
This work has been sponsored by the Austrian Joint Research Project Cognitive Vision under projects S9103-N04 and S9104-N04.
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 102–111, 2008. c Springer-Verlag Berlin Heidelberg 2008
Efficient Tracking as Linear Program on Weak Binary Classifiers
103
et al. [18] overcomes this limitations of exhaustive search by training so called displacement experts [8,13] for multiple directions of the target region. In particular, shifted samples of the target object are collected for training multiple Relevance Vector Machines (RVMs) which are distinctive for particular directions and finally are used for making predictions even for more than two degrees of freedom. Probably the work most related to ours is Avidan’s Support Vector Tracking (SVT) [1]. SVT integrates SVM as a powerful detection method for the purpose of object tracking. In particular, a pre-trained SVM is used in combination with the optical flow constraint which maximizes, instead of minimizing the sum of squared pixel differences, the SVM score between two successive frames. Similar to the idea of Avidan, we propose a technique to combine powerful ensemble methods [4] with optical flow based tracking. Ensemble methods are characterized by linearly combining individual expert decisions. The goal is to select a set of weak classifiers with a certain amount of diversity in their decisions to increase overall performance. Different methods have been proposed for constructing such an ensemble where bagging [3] and boosting [5] probably belong to the most widely used ones. These methods have been successfully applied to object detection tasks [17] as well as recently for tracking tasks [7,2]. However, they are typically limited to translational movement estimation. In this work we propose to create weak hypothesis from grayvalue responses at random pixel positions. Simple pixel comparisons have been widely used in many applications such as pattern matching or classification tasks. Pele [15] uses simple comparisons of pixel intensities for pattern matching and illustrates results which are competitive to well established local image descriptors. Furthermore, the Census transform [19] proposed for dense matching has a similar idea. Pixels are compared with a simple greater than comparison of a center pixel to its neighborhood to generate a bit ordering of the neighborhood and though achieving invariance to changes in gain or bias. Recently, in [10,14] pixel comparisons have been applied for real-time keypoint recognition. Simple binary tests based on gray value comparisons are used for solving large multiclass problems (with classes greater than 100) and again - robustness and effectiveness of these methods are mainly due to the employed pixel-pair features. The basic advantage of building weak hypothesis from pixel-based features is that it enables to apply the idea of linearization from Lucas and Kanade [12] and thereby to perform tracking through optimizing the classifier output. In particular, we illustrate the concept for tracking affine motions which can be replaced by simpler or more complex warping functions. The tracking step itself is formulated as linear program wheres weak binary classifiers form the basis for the optimization step. The outline of the paper is as follows. In Section 2 we present how to fuse an ensemble classifier with optical flow and formulate the problem of displacement estimation within linear programming. Section 3 presents processed sequences and the benefit versus a conventional sum-of-squared differences tracking based approach.
104
2
M. Grabner, C. Zach, and H. Bischof
Affine Tracking with Pixel-Pair Features
This section describes our approach for object tracking by combining learned weak classifiers with the optical flow constraint. If (u, v) denote the displacement vectors, then the brightness constancy assumption of corresponding pixel yields to the optical flow constraint, ∂I ∂I ∂I =u +v . ∂t ∂x ∂y
(1)
Avidan [1] integrated this constraint into the support vector framework to enable tracking of a learned object representation stored in the support vectors. Due to the specific choice of the kernel, support vector tracking can be performed by solving linear systems of equations. Note, that [1] only addresses the estimation of pure translations and does not incorporate a more complex affine motion model. In this work we propose the combination of weak and simple classifiers with the optical flow constraint, which naturally yields to a linear program to determine the update of the motion parameters (see Section 2.2). Our underlying objective function to be minimized is a sum of hinge loss functions, which is more robust to outliers than the often used sum of squared differences. In combination with the underlying pixel-pair features, which are invariant with respect to gain changes, we consider our tracking approach very robust to a large range of illumination changes and occlusions. 2.1
Object Representation
The learned object representation is an ensemble of weak classifiers, hi , each associated with a pixel pair, and the corresponding voting weights αi > 0. A weak classifier hi is represented by two pixel positions, pi1 and pi2 , and an intensity difference threshold bi . The response of the classifier hi for a given image I and motion parameters A is then (2) hi (I, A) = sgn I(A(pi1 )) − I(A(pi2 )) − bi ∈ {+1, −1}, where A(p) denotes the original position p transformed by the motion hypothesis A. Thus, the weak classifier returns a positive response, if the intensity difference of two pixels is larger than a learned threshold. Combining the weak classifiers results in a strong classifier, αi hi (I, A). (3) i
Generally, the location of the tracked object can be determined by maximizing the response of this strong classifier over A for the provided image I: αi hi (I, A), (4) A = max A
i
Efficient Tracking as Linear Program on Weak Binary Classifiers
105
which is a highly non-convex optimization task and therefore hard to solve. Exhaustive sampling is only feasible for low-dimensional parameter spaces for A, e.g. when searching only for translational motion. In order to allow affine motion tracking at a suitable speed, we use an optical flow based search instead of an exhaustive one. 2.2
Displacement Estimation through Linear Programming
In this section we combine the weak classifiers derived from pixel pairs with the optical flow constraint. The obtained formulation allows the optimization for the full affine tracking parameters. The current tracking hypothesis comprising the position and orientation of the tracked object is represented by the following affine matrix A a11 a12 a13 . (5) A= a21 a22 a23 Thus, a (homogeneous) pixel position p in the reference image has the corresponding position A(p) = A · p in the current image. Recall from the last section, that directly searching for optimal motion parameters A = maxA i αi hi (I, A) is generally not feasible for real-time or interactive performance due to its high degree of non-convexity. The non-convexity has two sources: first, the images seen as a 2D function are usually not convex. Second, the thresholding induced by the sign(·) function in the weak classifier definition introduces additional local minima. The first source of non-convexity is addressed by locally linearizing the image intensities, since tracking is typically applied on video sequences. Hence, we can assume small changes in A for consecutive frames, i.e. A(t) = A(t−1) + ΔA(t) with ΔA(t) small. This allows us to employ a first order Taylor approximation for the current image I (t) , I (t) (p(t) ) = I (t) (p(t−1) + Δp(t) ) ≈ I (t) (p(t−1) ) + ΔA(t) , ∇A I (t) (p(t−1) ) ,
(6)
where we use the shorter notations p(t) = A(t) · p and Δp(t) = ΔA(t) · p. Since the linearization of pixel intensities is only valid for a small neighborhood, the proposed tracking procedure is embedded in a coarse-to-fine approach. At this point, we incorporate the brightness constancy assumption typically utilized in optical flow determination. ∇A I is the derivative of the image intensities with respect to the affine parameters akl , e.g. ∂I ∂I ∂x ∂I x. = = ∂a11 ∂x ∂a11 ∂x
(7)
Obviously, the binary response functions hi are not convex and maximizing i αi hi is highly vulnerable to reach only local optima. The pixel intensity ˜ i (I, A) := I(A · pi ) − I(A · pi ) − bi , are linear and therefore convex, differences, h 1 2 ˜ but optimizing αi hi in conjunction with the Taylor approximation in Eq. 7
106
M. Grabner, C. Zach, and H. Bischof
results in unbounded solutions (since the objective function is then linear with respect to the (unbounded) affine matrix parameters). Therefore, we employ a ˜ i , i.e. the new penalty different approach by applying a hinge loss function on h
˜ i (I, A) . Positive responses of a weak function is now fi (I, A) := max 0, −h classifier do not contribute to the objective function, whereas negative responses are basically weighted according to their confidence. Note that the hinge loss function is a convex function, but not differentiable everywhere. Nevertheless, as it is derived in the following, the accumulated hinge loss function αi fi (A, I (t) ) can be optimized by linear programming. Explicitly, the objective function is now
˜ i (I (t) , A) . αi fi (A, I (t) ) = αi max 0, −h (8) (t) ˜ i (I, A) = I (t) (A · pi ) − I (t) (A · pi ) + bi . J (t) (A) is posWe introduce Ji (A) = −h 2 1 i itive if the tracked parameters violate the respective pixel feature. Inserting the ˜ i and incorporating the first order image intensity approximation definition of h (Eq. 7) yields
αi max 0, (9) αi fi (A, I (t) ) = i
(t) (t) Ji (A(t) ) + ΔA(t) , ∇A Ji (A(t−1) ) , with ΔA containing the unknowns. This minimization task can be solved using linear programming by introducing new unknowns Li (for the loss induced by the weak classifier i) with
(t) (t) (10) Li = max 0, Ji (A(t−1) ) + ΔA, ∇A Ji (A(t−1) ) for given ΔA. Minimizing Eq. 9 reduces now to the following equivalent linear program: min αi Li ΔA,Li
i
Li ≥ 0
(t) (t) Li ≥ Ji (A(t−1) ) + ΔA, ∇A Ji (A(t−1) ) ,
(11)
which can be solved by standard linear programming methods. In order to obtain a meaningful solution even in the case that all Li are zero, we slightly extend the linear program above: min αi Li + ε Akl ΔA,Li ,Akl
i
k,l
Li ≥ 0
(t) (t) Li ≥ Ji (A(t−1) ) + ΔA, ∇A Ji (A(t−1) ) Akl ≥ Δakl Akl ≥ −Δakl .
(12)
Efficient Tracking as Linear Program on Weak Binary Classifiers
107
Akl represents |Δakl |, hence the objective function includes a L1 penalizing term on ΔA. ε is a small constant, which enforces zero updates if all weak classifiers are already satisfied by the preceding tracking parameters. If all hi (I (t) , A(t−1) ) return a positive response (i.e. Li = 0 for A(t−1) ), the obtained update ΔA is zero in this case. A feasible initial solution for ΔA, Akl and Li is given by ΔA ≡ 0, Akl ≡ 0, and Li is determined by Eq. 10, respectively. Further, all αi are positive and the Li are bounded from below, hence the optimal objective value is always bounded, too. Thus, a bounded optimal solution is always obtained. The update ΔA(t) for the affine parameters is the returned optimal solution for ΔA, and the new affine parameters are set to A(t) = A(t−1) + ΔA(t) . 2.3
Online Selection of Hypothesis
The tracking step has been formulated in a general way meaning that any ensemble learning algorithm (e.g. [9,5,6,11]) can be applied for constructing the strong classifier and thereby bring a meaningful variety in the selected set of weak classifiers. Note that the learning algorithm itself is not the focus of this work but instead investigates the fusion of ensemble classifiers with optical flow based tracking. In our case we have decided for integration of an online boosting variant [6] which has recently also applied to tracking applications [7]. Due Algorithm 1. Ensemble Classifier based Optical Flow Tracking Input: Frames I (1) , ..., I (n) and object region r (1) 1. Initialization: (a) Repeat for each pyramid level: i. Initialize strong classifier H(x) = i αi hi (I, E) with αi = 1 2. For each new frame I (t) do: (a) Repeat for each pyramid level: i. Determine change in affine parameters ΔA: min
ΔA,Li ,Akl
α L + εA i
i
i
kl
k,l
ii. Update affine parameters: A ← A + ΔA (b) Repeat for each pyramid level: i. Generate random affine warps ii. Apply online boosting for updating strong classifier H(x)
to the usage of an online classifier we perform updates during tracking with positive and negative samples. For this purpose, we generate a set of random affine transformations for making negative updates while the tracked region is
108
M. Grabner, C. Zach, and H. Bischof
assumed to be a positive sample. Movements larger than a few pixels result in large linearization errors and typically lead to tracking failures. For making the approach more robust against larger movements, we incorporate a pyramidal representation of classifiers which are then used on the different levels for displacement estimations respectively, see Algorithm 1. Therefore, on each image pyramid level a separate classifier is initialized for the target region. The movement is estimated at each pyramid level, starting from the lowest level which then provides the initial guess for the next level. Even if subsampling introduces errors, coarser levels of the pyramid provide a good estimate of the initial position for the next level and so helps to tackle the problem of larger movements. Further on, it increases robustness against image noise as well as reflectance or lighting changes. The upper bounds for the paramters of the motion model are identical for all levels.
3
Experiments
For tracking targets with the presented approach we manually selected the target object to be tracked. Afterwards pairs of pixels are randomly chosen over the target region. Different criteria for selecting pixels have been tested (i.e. points on edges or near edges), however it turned out, that random initialization is a good choice since it has the advantage not to limit the approach to certain object appearances. As mentioned before for each layer of the pyramid an individual classifier is trained to perform online feature selection. The strong classifier complexity has been set to 150 weak classifiers whereas the proposed
(a) Frame 8
(b) Frame 68
(c) Frame 100 (d) Frame 147 (e) Frame 200
Fig. 1. This sequence is available at http://esm.gforge.inria.fr/ESMdownloads. html and has been chosen in order to demonstrate the ability to track affine motions. The upper row presents the individual frames whereas the target itself is depicted in more detail below. The approach presents accurate localization of the target over the whole sequence.
Efficient Tracking as Linear Program on Weak Binary Classifiers
109
feature pool provided 600 possible pixel comparisons. In addition the affine parameters between consecutive frames has been limited to a factor of 0.03 whereas the translational components do not exceed 3 pixels for a single layer (note that the settings of these parameters is not critical). The approach has been implemented in Matlab and runs with about 1 fps. Several sequences have been used for testing the approach. The first one presented within this section is shown in Figure 1. The object is accurately localized throughout the whole sequence. Figure 2 presents a sequence containing a higher amount of artifacts. Illumination changes as well as rather large inter frame motions are handled by the approach in a robust manner and the affine parameters are correctly estimated. The approach has also been tested on sequences which
(a) Frame 26
(b) Frame 37
(c) Frame 57
(d) Frame 91
(e) Frame 123
Fig. 2. Throughout the sequence appearance artifacts of the target object are present. Illumination changes as well as spots are handled by the approach and do not confuse the tracker when estimating the affine parameters.
(a) Frame 26
(b) Frame 37
(c) Frame 57
(d) Frame 91
(e) Frame 123
Fig. 3. For this sequence a simple translational warping function has been used, since the target region is of very low resolution. Again strong changes in appearance are present and successfully handled by the approach.
110
M. Grabner, C. Zach, and H. Bischof
(a) Frame 1
(b) Frame 57
(c) Frame 111
(d) Frame 200
Fig. 4. Comparison of simple SSD tracker (top) and our approach (bottom). Due to the specific loss function of our approach and the possibility to limit the amount of warping parameters the approach turns out to be more robust to standard methods.
are typical for the purpose of autonomous driving, see Figure 3. In particular it shows a car entering a tunnel where shuttering of the camera complicates the tracking task. Due to the low resolution and the simple requirement of localization within image space, the affine warping function has been replaced by a translational model. As depicted in the lower row of the sequence, drastic appearance changes are handled. A comparison to a simple SSD tracker is presented in 4 whereas the SSD measure simply accounts for the same set of pixels which are considered by our approach. The explanation for making our approach more robust against the artifacts are simple the type of features in combination to the robust loss function used within the approach.
4
Conclusion
The paper has presented an approach which allows to apply ensemble classification techniques for the purpose of object tracking with motion models. Simple decisions resulting from intensity tests on pixel pairs are used as weak hypothesis which are finally combined to a strong classifier. The specific type of hypothesis allows us to fuse classical optical flow based tracking with an ensemble classifier and thereby makes it possible to use this classification techniques for displacement estimation of more complex motion models. The estimation of parameters of the warping function between successive frames is formulated as a linear program. The loss function used within the optimization step shows robustness which is demonstrated in several challenging sequences.
Efficient Tracking as Linear Program on Weak Binary Classifiers
111
References 1. Avidan, S.: Support vector tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(8), 1064–1072 (2004) 2. Avidan, S.: Ensemble tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(2), 261–271 (2007) 3. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996) 4. Dietterich, T.G.: Ensemble methods in machine learning. In: Proceedings International Workshop on Multiple Classifier Systems, pp. 1–15 (2000) 5. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997) 6. Grabner, H., Bischof, H.: On-line boosting and vision. In: Proceedings Conference Computer Vision and Pattern Recognition, vol. 1, pp. 260–267 (2006) 7. Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via on-line boosting. In: Proceedings British Machine Vision Conference, vol. 1, pp. 47–56 (2006) 8. Jurie, F., Dhome, M.: A simple and efficient template matching algorithm. In: Proceedings International Conference on Computer Vision, vol. 2, pp. 544–549 (2001) 9. Kolter, J.Z., Maloof, M.A.: Dynamic weighted majority: A new ensemble method for tracking concept drift. In: Proceedings International Conference on Data Mining, pp. 123–130 (2003) 10. Lepetit, V.: Keypoint recognition using randomized trees. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(9), 1465–1479 (2006) 11. Littlestone, N., Warmuth, M.K.: The weighted majority algorithm. Information and Computation 108(2), 212–261 (1994) 12. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings International Joint Conference on Artificial Intelligence, pp. 674–679 (1981) 13. Matas, J., Hilton, A., Zimmermann, K., Svoboda, T.: Learning efficient linear predictors for motion estimation. In: Indien Conference on Computer Vision, Graphics and Image Processing, pp. 445–456 (2006) ¨ 14. Ozuysal, M., Lepetit, V., Fleuret, F., Fua, P.: Feature harvesting for tracking-bydetection. In: Proceedings European Conference on Computer Vision, vol. 3, pp. 592–605 (2006) 15. Pele, O., Werman, M.: Robust real time pattern matching using bayesian sequential hypothesis testing. IEEE Transaction Pattern Analysis and Machine Intelligence (to appear, 2008) 16. Tian, M., Zhang, W., Liu, F.: On-line ensemble svm for robust object tracking. In: Proceedings Asian Conference on Computer Vision, pp. 355–364 (2007) 17. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings Conference Computer Vision and Pattern Recognition, vol. I, pp. 511–518 (2001) 18. Williams, O., Blake, A., Cipolla, R.: Sparse bayesian learning for efficient visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 1292–1304 (2005) 19. Zabih, R., Woodfill, J.: Non-parametric local transforms for computing visual correspondence. In: Proceedings European Conference on Computer Vision, vol. 2, pp. 151–158 (1994)
A Comparison of Region Detectors for Tracking Andreas Haja1,2 , Steffen Abraham2 , and Bernd J¨ ahne1 1 Interdisciplinary Center for Scientific Computing University of Heidelberg, Speyerer Straße 4-6, 69115 Heidelberg, Germany {andreas.haja,bernd.jaehne}@iwr.uni-heidelberg.de 2 Robert Bosch GmbH, Corporate Research 31139 Hildesheim, Germany [email protected]
Abstract. In this work, the performance of five popular region detectors is compared in the context of tracking. Firstly, conventional nearestneighbor matching based on the similarity of region descriptors is used to assemble trajectories from unique region-to-region correspondences. Based on carefully estimated homographies between planar object surfaces in neighboring frames of an image sequence, both their localization accuracy and length, as well as the percentage of successfully tracked regions is evaluated and compared. The evaluation results serve as a supplement to existing studies and facilitate the selection of appropriate detectors suited to the requirements of a specific application. Secondly, a novel tracking method is presented, which integrates for each region all potential matches into directed multi-edge graphs. From these, trajectories are extracted using Dijkstra’s algorithm. It is shown, that the resulting localization error is significantly lower than with nearest-neighbor matching while at the same time, the percentage of tracked regions is increased. Keywords: affine-covariant region detection, region tracking, graph traversal, performance evaluation.
1
Introduction
In this work, the performance of five state-of-the-art affine-covariant region detectors is compared in the context of tracking. Firstly, based on the similarity of SIFT-descriptors [1], conventional nearest-neighbor matching is used to generate unique region-to-region correspondences for each detector between adjacent frames of an image sequence. From these, trajectories are assembled by following each region from its first frame of appearance to the last. Based on carefully estimated homographies between planar object surfaces in neighboring frames of an image sequence, both their localization accuracy and length, as well as the percentage of successfully tracked regions is evaluated. To the best knowledge of the authors, a detailed comparison of region detectors with regard to these properties is presented for the first time, serving as a supplement to existing studies such as [2]. There, detectors are analyzed with regard to the G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 112–121, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Comparison of Region Detectors for Tracking
113
percentage of successfully matched regions under several geometric and photometric transformations of the observed scenes. While Mikolajczyck et al. only considered planar scenes and objects, Moreels and Perona [3] extended their analysis to a large set of arbitrarily-shaped 3D-objects, but also without regard to localization accuracy or tracking. Secondly, a new tracking method is proposed which integrates for every region all possible match candidates into a set of track graphs, which is gradually constructed by traversing through an image sequence. Relations (edges) between corresponding regions (nodes) are weighted using a combination of descriptor distance and the well-known path coherence model by Sethi and Jain [4], which is further extended by a term that evaluates and penalizes the relative changes of region scale. Single trajectories (paths) are iteratively extracted from each graph by using Dijkstra’s method [5]. It will be shown, that tracking accuracy can be significantly improved compared to the traditional method based on nearest-neighbor matching, while at the same time the number and length of trajectories is increased. The investigated detectors are: the edge-based (EBR) and intensity-based (IBR) region detectors by Tuytelaars and Van Gool [6], the maximally stable extremal regions detector (MSER) by Matas et al.[7], and both Harris- and Hessian-affine detectors (HARAFF & HESAFF) introduced by Mikolajczyck and Schmid [8]. For all experiments, the original implementations1 of the authors with their respective default parameters are used. This paper is organized as follows: In section 2, the evaluated image sequences are introduced. Section 3 describes in detail the new graph-based tracking method while section 4 contains a detailed comparison of nearest-neighbor and graph-based tracking. Finally, section 5 gives a concluding summary and an outlook on future work.
2
The Image Data Set
Figure 1 shows the first and last frame of the image sets used for the evaluation. All 3 sequences have been selected from the database made available by Moreels and Alatorre2 and consist of 10 equal-sized images of 864 x 1444 pel. The observed objects have been positioned on a rotating dish which moves with an increment of 5 degrees between adjacent frames. All three objects are piecewise planar so that correspondences can be related by homographies Hi,i+1 , where i indicates the frame number. In order to determine the accuracy of tracked regions as shown in section 4, Hi,i+1 is estimated for each frame pair according to the method described in [2] with a Root Mean Square error of less than 1 pel. The object surfaces mainly show homogeneous regions with distinctive edges and repeating artificial patterns, but also natural textures such as the cereals in the pops-sequence or the faces in the dvd -sequence.
1 2
http://www.robots.ox.ac.uk/∼ vgg/research/affine http://www.vision.caltech.edu/pmoreels/Datasets/TurntableObjects
114
A. Haja, S. Abraham, and B. J¨ ahne
Fig. 1. Top: First and last image of the evaluated sequences. Carefully estimated homographies between neighboring frames enable the assessment of tracking accuracy. Bottom: Ellipses indicate randomly selected regions provided by each detector for an image patch taken from the first frame of the carton-sequence.
3 3.1
A Graph-Based Method for Region Tracking Descriptor-Based Region Association
The definition of a region varies between detectors: Both HARAFF and HESAFF use the second moment matrix of the intensity gradient [9] to derive an elliptic region description. EBR uses parallelograms derived from the edge geometry of the surrounding neighborhood, IBR regions are based on intensity profiles along rays emanating from an intensity extremum and MSER returns the boundaries of watershed-like regions. In order to enable meaningful comparisons between the different detectors, each region is replaced by an (approximated) ellipse [2], if necessary. In order to find corresponding regions between a reference frame i and a target frame i + 1, three descriptor-based matching techniques can be used [10]. All three methods compare each descriptor dm,i with each descriptor dn,i+1 whose associated region n lies within a gating distance dG around the position of region m. The parameter dG reflects the maximally expected object motion between two frames and has been set to 5% of the image width in all experiments. Throughout this work, the well known SIFT-method [1] has been used for region description, which represents affine-normalized intensity values by a three-dimensional histogram of gradient locations and orientations. In [3] and [10], the SIFT-descriptor proved its robustness to various geometric and photometric transformations as well as to errors in region detection. In the case of simple threshold-based matching, two regions m and n are assigned to each other if the euclidean distance between their descriptors is below a threshold: dm,n ≤ dmax (1) dm,n = dm − dn ,
A Comparison of Region Detectors for Tracking
115
Depending on dmax , on the density of regions in the target frame and on the size of the gating area, region m might have several matches. In the case of nearest-neighbor matching, two regions are uniquely assigned to each other if dm,n is below a threshold and at the same time dm and dn are nearest neighbors in descriptor space among all match candidates. A third commonlyused matching strategy - descriptor uniqueness - applies thresholding to the ratio of descriptor distances between nearest neighbor and second-nearest neighbor to dm in descriptor space. The results are similar to the previous method, except that regions which have a second similar match are additionally penalized [10]. 3.2
Track Graph Construction
Tracking methods based on nearest-neighbor matching or descriptor uniqueness produce unique region-to-region correspondences between all adjacent frame pairs of a sequence. Thus, every region has exactly one incoming and/or one outgoing link which can be followed from the first frame of appearance to the last frame and assembled into a trajectory. If several similar regions exist within dG as shown in figure 2 (note that for region 3 in the second frame, there are 2 corresponding regions in the first frame), matching based on nearest-neighbors can be unreliable due to insufficiently discriminatory descriptors. If descriptor uniqueness were used instead, ambiguous candidates might be rejected entirely. For the evaluation in section 4, only nearest-neighbor matching has been used for tracking. In the remainder of this paper, this method is referred to as local tracking because unique correspondences are found by considering only a single frame pair. The new method proposed in this publication works by postponing the decision on a specific correspondence until the entire image sequence has been processed using threshold-based matching: Instead of discarding correspondences on the basis of ambiguous and thus unreliable local information, all candidates are kept and the relations between them are modeled into a directed multi-edge track graph.
Fig. 2. Example of a track graph existing in three frames. Regions not active in the current frame are grayed, rejected correspondences have a dotted line. Left+Middle: With regard to the cost function from eq. 3, the correspondences (1, 3) and (3, 5) are preferred over (2, 3) and (4, 5) respectively, based solely on descriptor similarity. Right: Using the extended cost function from eq. 6, the (correct) path (2, 3, 5) is chosen as final trajectory.
116
A. Haja, S. Abraham, and B. J¨ ahne
A track graph G is defined as a set of nodes (regions) N and edges (relations) E between corresponding regions. It is represented as an adjacency list A where each entry A(m) for every region m contains the corresponding list of matches n: A(m) = {n ∈ N |(m, n) ∈ E} (2) 3.3
Graph Traversal
After construction, each track graph G may contain several trajectories which have to be extracted under the assumption, that each region only has one unique match. A trajectory is defined as a path between a start node and an end node of G (i.e. nodes that don’t have incoming or outgoing edges). A transition between two nodes m and n from frame i to frame i + 1 is further related with a cost function which, in its simplest form, reflects only the respective descriptor distance: (3) c(mi , ni+1 ) = 1 − dm,n During graph traversal, the cost of all edge transitions is accumulated and the path with the highest weight is selected as the optimal and thus most likely path. If a graph contains several start and end nodes, traversal is performed for all plausible combinations. From the set of all successful traversals, the one with the highest accumulated cost is chosen as the final trajectory tf . Path search is performed using Dijkstra’s algorithm [5], which solves the single-source optimal path problem for directed graphs with non-negative edge weights. Its idea is to store for each region m in a cost table C the maximal cost path found so far originating from a start node s and in a lookup-table P the indexes of the previously visited nodes. The basic operation of the algorithm is edge relaxation: If there is an edge from a node mi in frame i to node ni+1 in frame i + 1 which has the highest weight among all other edges, then the best known path from s to mi can be extended to a path from s to ni+1 . In C, node ni+1 will be assigned the sum of the cost of the old path and the edge weight c(mi , ni+1 ) for the new transition. Additionally, mi will be stored as its predecessor in P. If an end node e is reached, the optimal path can be easily reconstructed by following the entries of P back to s. The resulting overall cost is found from a simple table lookup in C(e). Under the assumption that a specific region can be assigned to a single trajectory only, all nodes in tf are finally removed from G. Among the remaining nodes and edges in G, dependencies are then analyzed anew in order to identify independent sub-graphs. These are processed again iteratively until all nodes have either been assigned to trajectories or isolated. 3.4
Path Coherence Function
Experiments have shown that, in cases of high region density and low discriminatory power of the descriptor set, the number of nodes in a single graph quickly increases, leading to complex and thus costly traversals. As will be seen in section 4.2, this is especially a problem for EBR-regions.
A Comparison of Region Detectors for Tracking
117
Also, spurious matches may occur as illustrated in figure 2(left ), where c(11 , 32 ) > c(21 , 32 ). A solution to these problems is to extend the cost function in equation 3 by a term which exploits the fact that, due to inertia, the motion of observed objects cannot change instantaneously, given a sufficient frame rate. This assumption, often referred to as path coherence, has previously been used effectively in tracking applications [4][11][12]. Based on the center positions p of three regions l, m and n in consecutive frames, path coherence is computed as (pl,i−1 − pm,i ) · (pm,i − pn,i+1 ) m = 1− w1 1 − pl,i−1 − pm,i pm,i − pn,i+1 pl,i−1 − pm,i pm,i − pn,i+1 −w2 1 − 2 (4) pl,i−1 − pm,i + pm,i − pn,i+1 where w1 and w2 are weights for controlling the contributions of direction and velocity changes respectively. Generally, p is defined as the center of the associated ellipse. In this publication, the original function in equation 4 is extended to respond to changes of region scale as well. Similar to the velocity term controlled by w2 , changes in scale s (which is the square root of the product of major and minor axis of the associated ellipse) are penalized proportionally to the associated region area. This leads to the extended path coherence function sl,i−1 s2m,i sn,i+1 . (5) me = m − w3 1 − 2 2 sl,i−1 s2n,i+1 + s4m,i The new term penalizes changes of the relative region scale, and thus takes advantage of the additional shape information supplied by the detectors. The major benefit of the new expression is a reduction of the number of graph nodes and thus of computational complexity. Using equation 5, the extended cost function is defined as dl,m + dm,n c(li−1 , mi , ni+1 ) = 1 − (6) me . 2 If a node has several incoming and outgoing edges, the response of the coherence function me is computed for all possible combinations, favoring the transition with the smallest deviation in velocity, direction and scale.
4 4.1
Evaluation Performance Measures
The localization error is defined as the absolute euclidean distance el = Hi,i+1 pmi − pni+1
(7)
where pmi and pni+1 are the positions of the corresponding region centers and Hi,i+1 denotes the estimated homography relating the object surfaces in both frames to each other.
118
A. Haja, S. Abraham, and B. J¨ ahne 10
10
10
el (graph)
EBR
5
0.2
0.3
0.4
0.2
0.3
0.4
el (local)
17.410 14.9 12.9 11.2
5
0.2
0.3
0.4
0 0.5 0.1
0.2
0.3
0.4
0.2
0.3
0.4
0.2
0.3
0.4
0 0.5 0.1 10
5
0 0.5 0.1
descriptor distance d
5
0 0.5 0.1 10
5
0 0.5 0.1
descriptor distance d
HESAFF
5
12.110
5
0 0.1
HARAFF
5
0 0.5 0.1
10
MSER
5
0 0.1 10
10
IBR
0.2
0.3
0.4
0.3
0.4
0.5
0.2
0.3
0.4
0.5
5
0 0.5 0.1
descriptor distance d
0.2
0.2
0.3
0.4
0 0.5 0.1
descriptor distance d
descriptor distance d
Fig. 3. The diagrams show the median localization error el,50 (diamond markers), the 25%- and 75%-percentiles (solid lines) and the 5%- and 95%-percentiles (dashed lines) estimated from all sequences. For EBR, dmax has been limited to the interval 0.1 ≤ dmax ≤ 0.35 in order to contain the complexity of the resulting track graphs.
In figure 3, el is shown for both local tracking and graph-based tracking. Evaluation has been performed for 9 settings of the maximum descriptor distance dmax (according to equation 1) for the image sequences introduced in section 2. For EBR-regions, the maximum descriptor threshold has been limited to dmax ≤ 3.5 due to a strong increase in the number of nodes and edges for higher settings. This aspect will be investigated more closely in section 4.2. For local tracking, EBR exhibits the worst accuracy while MSER clearly performs best with 95% of all regions below el = 2.5 pel for all settings of dmax . Between HARAFF and HESAFF, differences are marginal while IBR performs slightly better. For graph-based tracking, the path coherence weights in equation 5 have been set such that the maximum direction change was limited to 30 while the max. and min. permissible change in both center velocity and relative scale have been bounded to 2.0 and 0.5 respectively. While the relative ordering of the detectors is identical to local tracking, the overall el is clearly smaller, especially with EBR and IBR. Again, MSER performs best, although there is no improvement to local tracking to be observed. Generally, detectors that provide large regions are more prone to localization errors: Given two regions with identical geometrical area overlap (and thus similar descriptor distance) but of different scales, the distance between the region centers (and thus el ) for the large region will be higher than for the smaller one. From this relation, the high localization error of EBR-regions and the general difference between local and graph-based tracking can partially be explained. Table 1 gives the 50−, 75− and 95−percentiles of the distribution of all region scales. In [2], the authors used the repeatability score as a measure for matching performance defined as the ratio of the number of correspondences and the smaller of the number of regions in a pair of images. For tracking performance, a similar measure is introduced in this paper: The tracking score pt measures the number of successfully tracked regions in a sequence, normalized on the total number of regions. A region m is counted as tracked, if it has been assigned to a tra-
A Comparison of Region Detectors for Tracking
119
Table 1. Additionally to the number of regions and graphs for all sequences, this table shows the percentage (and absolute numbers) of successfully tracked features pt as well as the 50-, 75- and 95-percentiles of the distribution of both trajectory lengths tl and region scales s IBR MSER HARAFF HESAFF 19357 8065 18022 47763 local tracking pt 26.2%(9931) 22.9%(4433) 58.3%(4701) 27.1%(4884) 40.5%(19344) tl,50/75/95 4/5/8 4/5/9 6/9/10 4/5/7 4/5/8 s50/75/95 37/48/85 30/41/79 19/30/65 17/22/50 17/21/42 graph-based tracking 3756 1488 908 1860 5590 graphs pt 36.3%(13760) 35.8%(6930) 74.2%(5984) 34.4%(6200) 48.5%(23165) tl,50/75/95 4/5/10 4/5/9 6/9/10 4/5/8 4/6/10 s50/75/95 32/41/75 21/30/58 18/30/72 16/20/40 16/20/36
regions
EBR 37906
jectory (disregarding its length) and if the geometric area overlap error is less than 50%. This setting has been chosen in accordance with the relevant literature [2]. In this way, spurious correspondences which do not comply to Hi,i+1 do not contribute positively to pt . For all detectors, the percentage of such outliers stays below 1% (for dmax = 0.35). In table 1, pt is given for each detector along with the total number of detected regions from all sequences. In the case of graph-based tracking, pt is generally higher for all detectors. Especially for MSER-regions the difference to local tracking is most significant. Given the comparatively low number of detected regions, the high tracking score levels out the difference in absolute numbers to IBR and HARAFF. The highest number of tracked regions is clearly provided by HESAFF with a difference of almost 70% to the second-best detector EBR. While the tracking score only reflects if a region has been successfully tracked, it does not take into account the length of the trajectory of which it is part. To this end, table 1 also shows the 50−, 75− and 95−percentiles of the distribution of trajectory lengths tl (also for dmax = 0.35). As before, outliers with an area overlap error higher than 50% are removed from the evaluation and the respective trajectories are split up accordingly so that spurious correspondences do not contribute positively to tl . Generally, differences in tl between both methods are small. Only in the 95−percentiles of EBR, HARAFF and HESAFF, there is a slight advantage of the graph-based approach to be seen. Once again, MSERregions perform best, providing the longest trajectories among all detectors. 4.2
Graph Complexity Issues
The computational complexity of Dijkstra’s algorithm on a graph with E edges and N nodes is O(E + N 2 ), which is the best possible complexity for dense graphs. For sparse graphs, where E 1. This ’smear’ of the prediction is justified by the expected precision of one disparity level. At the beginning of the estimation t = 0, no optical flow estimation has been performed and thus no prediction for the first disparity estimation is applied. Please observe that the change in disparity is not estimated from a direct correspondence but from a neighborhood assumption, this way a warping of the optical flow estimates can be avoided. Moreover a discrepancy of one disparity in the cross-validation of the prediction is tolerated.
4
Evaluation
The algorithm described in the previous section and the one proposed in [8] use a cost reduction to introduce temporal observations into the disparity estimation. Since the cost adaption is performed before the aggregation with a shiftable window, a neighborhood in the cost volume is affected. This way predictions can help to make the matching more unambiguous and at the same time will have a smoothing effect. If no prediction is made, the algorithm falls back to the behavior of local WTA disparity estimation using a shiftable window aggregation. The disparity estimation procedures described in the previous section and the system presented in [8] were implemented to run on the GPU, using Nvidia’s CUDA framework. In order to test the systems’ performances on dynamic scenes, synthetic data was used, because hereby ground truth data is available. RGBColor image and corresponding depth image sequences with 512 × 384 pixels resolution were generated, paying thorough attention to anti-aliasing. Synthetic noise was used to simulate lighting effects and system noise in a post processing step. For this normal distributions with zero mean and a normal deviations
Real-Time Neighborhood Based Disparity Estimation
159
Fig. 1. First and last frame next to ground truth disparity maps from a synthetic image sequence. The camera rig was steadily moving towards the arc, capturing 50 images.
of σ ∈ {0, 6} were used to individually offset each color-channel value in each image. In the following the observations made by simulating a forward moving camera are presented. Two exemplary images and ground truth disparity maps from the generated sequence are shown in fig. 4. The disparity search range was chosen to be [0, 39] hypotheses, while the optical flow displacement was searched in an area of [−4, 4] × [−4, 4] pixels. The change in disparity ΔD was expected to lie within 1 disparity level. This way our approach had to search in 121 hypotheses per pixel, while Gong’s scheme had to evaluate 283 hypotheses per pixel. The cost truncation Cmax was set to 50, while a successful prediction was reducing cost by one third (sP = 3). The shiftable window realization was using a 9 × 9 aggregation window and a 5 × 5 shift filter in the optical flow as well as in the disparity cost volume. The results of disparity estimation and prediction from the first 20 frames of the sequence are displayed in fig. 2. The left column is showing the number of correctly estimated pixels relative to the overall number of pixels in the image. A pixel is considered to have a wrong estimate, if it has been assigned a disparity that is more than 1 disparity away from the ground truth. Estimates that assign a disparity to an occluded pixel or declare a pixel occluded although it was visible, are also considered to be wrong. The top column shows the correct estimates without noise. The line without any symbol is depicting the results of plain disparity estimation (without the usage of eq. (3)), while the line with rhombi is representing the outcome of [8]. The crossed line is the result achieved with the scheme proposed in this paper. The scene’s geometry and texture is advantageous for area based disparity estimation and hence a lot of correct estimates are generated with the plain method. In the beginning of the sequence a slightly higher error of the schemes with predictions is observed. This is caused by the smoothing at discontinuities induced by the cost reduction in combination with the used aggregation scheme. While the camera rig moves forward the sampling of the geometry and its texture changes differently in the left and right view over frames. Beginning at frame 15 the windows above the arc start to have an ambiguous sampling in hypothesis space. The maximum error is observed at frame 17. The estimated disparity map for frame 17 resulting from Gong’s scheme is shown in the top of the center column in fig. 2, while the disparity map generated with our proposal is shown next to it. The plain disparity estimation result is shown next to the ground-truth disparity map, which is the left most image in fig. 3. Without the smoothing effect of the prediction the plain disparity estimation experiences the largest error. The explanation of the difference between Gong’s scheme and our proposal
160
B. Bartczak, D. Jung, and R. Koch σ=0 0.951 0.95 + ♦♦+ ♦++++++♦ ♦+ 0.949 + ♦♦♦♦♦+ + ♦♦+ ♦ + ♦ 0.948 ♦+ + + + ♦ 0.947 ++♦ ♦+ 0.946 0.945 ♦♦ ♦ 0.944 0.943 0.942 0 5 10 15 20
σ=6 0.86 + + 0.85 + ++++++ + + + + 0.84 + ++ +++ + 0.83 0.82 ♦ 0.81 ♦ ♦ ♦♦ ♦♦♦ ♦♦ 0.8 ♦♦♦♦ ♦♦♦ ♦ ♦♦ ♦ 0.79 + 0.78 0 5 10 15 20 image number Fig. 2. Left Column: Graphs show the development of correct estimates of the first 20 synthetic images at σ ∈ {0, 6}. Lines without any symbols represent the results with plain disparity estimation. Rhombi show the results of the scheme from [8]. Crosses display the outcome of the scheme proposed in this work. Middle column: Disparity maps (top) and corresponding prediction maps (below) for frame 17 calculated with [8]. Right column: Results achieved at frame 17 with the proposed scheme.
Fig. 3. Left: ground truth disparity map of the 17th frame. Middle: result of plain disparity estimation at σ = 0. Right: result of plain disparity estimation at σ = 6.
can be seen in the prediction maps, which are shown in fig. 2 below the disparity maps. Gong’s scheme has less predictions on the floor and and on the building front, because it discards more estimates in the cross-validation. Consequently the prediction is having a lower impact. In the same figures the results of the disparity estimation on the image sequence with added noise are shown. It can be observed how the temporal estimation component “automatically” selects
Real-Time Neighborhood Based Disparity Estimation
161
Fig. 4. Estimation results on real data. Shown are the frames 1, 30, 60 and 90 (from top). The camera rig was steadily rotated to view into top-right direction. Meanwhile a person is moving towards the cameras. The left column shows the color intensity images captured. The second left column displays disparity estimates achieved with plain shiftable window aggregation. The middle columns, show the prediction maps (left) and the disparity maps determined with [8]. Results achieved with the proposed scheme are shown in the right columns.
the observations above the noise level and stabilize the disparity estimation by propagating this information in a local neighborhood via the shiftable window aggregation. Fig. 4 shows some exemplary results on real image data of 512 × 384 pixels. The camera was panning from bottom left to top right, in the meanwhile a person was moving towards the camera. The captured sequence has many low textured areas and is noisy. The same estimation parameters like for the synthetic sequence were used to generate the presented results. The differences between the different estimations are most prominent at the floor, the background, the chair and the person’s head, where the temporal evidence supports the estimation. The corresponding prediction maps are also shown, in which the increased effectiveness due to the propositions of this paper are clearly visible.
5
Conclusion
Local winner takes all search algorithms for dense disparity estimation are among the fastest algorithm available for dense depth estimation. Because no global reasoning is applied the key to a good estimation result is a proper accumulation of
162
B. Bartczak, D. Jung, and R. Koch
evidence in the cost volume. In this work2 it was shown how to incorporate time consistency evidence based on optical flow and previous disparity estimates. In difference to previously proposed schemes the uncertainty due to the discrete hypotheses is regarded and the algorithm is therefore more effective. Furthermore a novel neighborhood assumptions was presented, which makes the implementation more efficient. By exploitation of the computational power of modern GPUs high processing speed is reached. Experiments have shown, that the approach reduces the influence of ambiguities and noise in the disparity estimation of fully dynamic scenarios, where objects and the cameras move independently at the same time.
References 1. Yang, R., Pollefeys, M.: Multi-Resolution Real-Time Stereo on Commodity Graphics Hardware. In: Conference on Computer Vision and Pattern Recognition (CVPR 2003), Madison, Wisconsin, USA (June 2003) 2. Zhang, Y., Kambhamettu, C.: On 3d scene flow and structure estimation. In: 2001 Conference on Computer Vision and Pattern Recognition (CVPR 2001), December 2001, pp. 778–785 (2001) 3. Huguet, F., Devernay, F.: A variational method for scene flow estimation from stereo sequences. In: ICCV 2007, pp. 1–7 (2007) 4. Vedula, S., Baker, S.: Three-dimensional scene flow. IEEE Trans. Pattern Anal. Mach. Intell 27(3), 475–480 (2005); Member-Peter Rander and Member-Robert Collins and Fellow-Takeo Kanade 5. Leung, C., Appleton, B., Lovell, B.C., Sun, C.: An energy minimisation approach to stereo-temporal dense reconstruction. In: ICPR 2004: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR 2004), Washington, DC, USA, vol. 4, pp. 72–75. IEEE Computer Society, Los Alamitos (2004) 6. Matthies, L., Kanade, T., Szeliski, R.: Kalman filter-based algorithms for estimating depth from image sequences. International Journal of Computer Vision 3(3), 209–238 (1989) 7. Badino, H., Franke, U., Mester, R.: Free space computation using stochastic occupancy grids and dynamic programming. In: Workshop on Dynamical Vision, ICCV (2007) 8. Gong, M.: Enforcing temporal consistency in real-time stereo estimation. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 564– 577. Springer, Heidelberg (2006) 9. Scharstein, D., Szeliski, R., Zabih, R.: A taxonomy and evaluation of dense twoframe stereo correspondence algorithms. In: Proceedings of IEEE Workshop on Stereo and Multi-BaselineVision, Kauai, Hawaii (December 2001) 10. Collins, R.T.: A space-sweep approach to true multi-image matching. In: Proc. Computer Vision and Pattern Recognition Conf., pp. 358–363 (1996) 11. Scharstein, D., Szeliski, R.: Stereo matching with non-linear diffusion. International Journal of Computer Vision 28(2), 155 (1998) 2
This work was partially supported by the German Research Foundation (DFG), KO-2044/3-1 and the Project 3D4YOU, Grant 215075 of the Information Society Technologies area of the EUs 7th Framework programme.
A Novel Approach for Detection of Tubular Objects and Its Application to Medical Image Analysis Christian Bauer and Horst Bischof Institute for Computer Graphics and Vision, Graz University of Technology, Austria {cbauer, bischof}@icg.tu-graz.ac.at
Abstract. We present a novel approach for detection of tubular objects in medical images. Conventional tube detection / lineness filters make use of local derivatives at multiple scales using a linear scale space; however, using a linear scale space may result in an undesired diffusion of nearby structures into one another and this leads to problems such as detection of two tangenting tubes as one single tube. To avoid this problem, we propose to replace the multi-scale computation of the gradient vectors by the Gradient Vector Flow, because it allows an edge-preserving diffusion of gradient information. Applying Frangi’s vesselness measure to the resulting vector field allows detection of centerlines from tubular objects, independent of the tubes size and contrast. Results and comparisons to related methods on synthetic and clinical datasets show a high robustness to image noise and to disturbances outside the tubular objects.
1
Introduction
The detection and description of tubular structures, like blood vessels or airways is important for several medical image analysis tasks. The derived representations of the tubes, which are typically based on centerline descriptions, are used for visualization, interaction, initialization of segmentations, registration tasks, or as prior step for quantification of diseases like stenoses, aneurisms, or arteriosclerosis. To produce such descriptions two classes of approaches can be used: segmentation with a subsequent skeletonization or bottom up tube detection filters. As the top down segmentation problem is not a simple task which might require user interaction, bottom up tube detection filters are the state of the art methods used in several applications. Most tube detection filters presented in the literature are based on the assumption that the tubular objects are bright structures in front of a darker homogeneous background; for example, [1,2,3,4,5,6]. The radius of these structures varies, but following the concepts of scale-space theory [7], the tubular structures form height-ridges when the scale is adapted accordingly to the size of the
This work was supported by the Austrian Science Fund (FWF) under the doctoral program Confluence of Vision and Graphics W1209.
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 163–172, 2008. c Springer-Verlag Berlin Heidelberg 2008
164
C. Bauer and H. Bischof
Fig. 1. 2D cross section (orthogonal to the tube’s tangent direction) of 3D tubular structures and intermediate processing results of the multi-scale gradient vector compuation and the GVF
objects. Based on these assumptions, conventional tube detection filters try to identify the tubular objects at different scales and combine all responses into one multi-scale response. The response on a single scale is achieved by convolution of the initial image with a medialness function K(x, σ), where x is a point in 3D space and σ denotes the scale of the measurement. Typical medialness functions use the first order derivatives (gradient vectors), or second order derivatives (Hessian matrix), or a combination of both to identify the tubular objects at a specific scale; for example, [1,2,3,4,5,6]. They all have in common that a linear (Gaussian) scale space is used for computation of the spatial derivatives. This Gaussian smoothing is an isotropic diffusion that does not only result in suppression of noise but also in a blurring, which causes nearby features to diffuse into one another. This frequently leads to problems, e.g. that two tangenting tubes are detected as one single tube (see Fig. 1). For this reason, using the gradient vectors derived in a Gaussian scale space is generally not the appropriate choice for tube detection in medical images. In this work, we address this problem and present a novel approach for detection of tubular objects. We propose to replace the isotropic diffusion of the image gradients by an anisotropic (edge preserving) diffusion process – namely the Gradient Vector Flow (GVF) [8] – and use the resulting vector field for detection of tubular objects. This avoids diffusion of nearby structures into one another and supersedes the computation at multiple scales. We discuss properties of the approach using synthetic datasets from public databases and demonstrate the applicability and advantage of our approach on clinical datasets.
A Novel Approach for Detection of Tubular Objects and Its Application
165
The presented approach works in 2D as well as in 3D, for tubular structures brighter than the surrounding tissue or darker than the surrounding tissue. But for simplicity, in the remainder of the paper we will assume bright 3D tubular structures surrounded by darker tissue.
2
Methodology
As outlined in the introduction, tube detection filters utilize – directly or indirectly – the first order spatial derivatives to identify tubular objects. The reason is that tubular objects show characteristic gradient vector fields at their centerlines (see Fig. 1 bottom row) which can be used for classification. Therefore, the whole tube detection process can be split into two parts: deriving an appropriate gradient vector field V from the given image I and a subsequent classification based on this vector field. Both parts will be discussed separately in the next two paragraphs. A supporting example illustrating the basic idea on some tubular objects is shown in Fig. 1. Generation of the gradient vector field: Conventional approaches for tube detection compute the gradient vector field at multiple scales. Given a specific scale σ, the gradient vector field Vσ is computed by convolution of the original image with a Gaussian filter kernel Gσ and computation of the local derivatives: Vσ = ∇(Gσ I) = Gσ ∇I. This can also be interpreted as a distribution of gradient information towards the center of the tubular object. When the scale is adapted appropriately to the size of the tube, the resulting vector field shows the typical characteristics of a tube at their centerlines (see Fig. 1 middle columns). However, when the scale gets larger, nearby objects diffuse into one another and may produce vector fields that can also be interpreted as tubular objects (see Fig. 1 middle right column). This behaviour is inherent in the linear scale space, as Gauss filtering is a non-feature-preserving diffusion process (isotropic diffusion). To avoid diffusion of nearby objects into one another, it is necessary to replace the isotropic diffusion by a feature-preserving (edge-preserving) diffusion process. Anisotropic diffusion of the original image does not solve the problem, because for tube detection it is necessary to distribute gradient information from the boundary of the tubular object towards its center. The key is to perform a diffusion of the gradient information. A method that fulfills this requirement edge-preserving diffusion of gradient information - is the GVF as presented by Xu and Prince [8]. The GVF is defined as the vector field V (x) that minimizes: E(V ) =
Ω
μ|∇V (x)|2 + |∇I(x)|2 |V (x) − ∇I(x)|2 dx,
(1)
where x = (x, y, z) and μ is a regularization parameter that has to be adapted according to the amount of noise present. The variational formulation of the GVF makes the result smooth where the initial vector magnitudes are small,
166
C. Bauer and H. Bischof
while keeping vectors with high magnitude nearly equal. In practice, the GVF preserves even weak structures while being robust to large amounts of noise [8]. For tubular objects applying the GVF results in the vector field shown on the very right column of Fig. 1. Compared to the vector fields derived from the Gaussian scale space, the GVF has two different properties: first, the problem of the linear scale space, diffusion of nearby structures into one another, is avoided. Second, at the centers of the tubular objects the GVF results in the same characteristic vector field as obtained with the multi-scale gradient computation when the scale is adapted appropriately to the tubes size. This allows detection of tubular objects (more precisely their centerlines) directly from the vector field produced by the GVF - without the need for a multi-scale analysis. Therefore, for the task of tube detection the GVF can be used as a replacement of the multi-scale gradient vector computation. Classification based on the vector field: Having generated an appropriate vector field, the second step in tube detection is classification based on this vector field. As mentioned before, the GVF can - to some extent - be seen as a replacement for the multi-scale gradient vector computation, and the GVF’s vector field can be combined with several tube detection filter approaches. We experimented with offset and central medialness functions published by other authors ([1,2,3]), achieving good results with all of them, but we decided to demonstrate the combination with Frangi’s vesselness measure since it is simple and well known. Using the medialness functions of Pock et al. [3] or Krissian et al. [2] would also provide radius estimates for the tubes. For a tubular object, the gradient vectors point all directly towards the centerline of the tube; the local vector field shows a large variance in two dimensions, and a low variance in the third dimension (see Fig. 1). One way to measure this variance is based on the Hessian matrix H(x) = ∇V (x) and its eigenvalues |λ1 | ≤ |λ2 | ≤ |λ3 |. From these eigenvalues, it is possible to distinguish between plate-like, blob-like, and tubular structures (brighter or darker than the background) and noise. A frequently used measure to derive a tube-likeliness from the eigenvalues of the Hessian matrix is Frangi’s vesselness measure [1]: T =
0 1 − exp −
R2A exp 2α2
−
R2B 1 2β 2
− exp −
S2 2c2
if λ2 > 0 or λ3 > 0 else
(2) with RA = |λ1 |/ |λ2 ||λ3 | indicating blob-like structures, RB = |λ2 |/|λ3 | to distinguish between plate-like and line-like structures, and S = λ21 + λ22 + λ23 for suppression of random noise effects. The parameters α, β, and c allow to control the sensitivity of the filter to the measures RA , RB , and S, respectively. When combining Frangi’s vesselness measure with the GVF, two adaptions are necessary. First, the third term of Frangi’s vesselness measure that controls the noise-sensitivity becomes obsolete, as the noise-suppression is controlled by the regularization parameter μ of the GVF. Second, the magnitude of the GVF has to be normalized, V n (x) = V (x)/|V (x)|, because the original edge strength is not
A Novel Approach for Detection of Tubular Objects and Its Application
167
of importance anymore (see Fig. 1 on the bottom right). Thus, the final response also becomes independent from the contrast of the tube and the centerlines of tubular objects can be extracted immediately by simple thresholding.
3
Evaluation and Results
In order to compare our approach to other methods and to demonstrate the difference between using a linear scale space and using the GVF for distribution of gradient information, we present results achieved with our approach and the results achieved with the method of Frangi et al. [1] and the method of Krissian et al. [2]. These two other approaches are multi-scale methods that operate in a Gaussian scale space. Frangi’s method makes use of the eigenvalues of the Hessian matrix to derive a vesselness-measure as already presented in Sec. 2. Krissian’s approach uses the Hessian matrix only to identify height ridges and to estimate the tubes orientation. In a second step, Krissian uses this information for selection of tube surface points and evaluates the gradient information at these points to derive his measure of medialness. For both approaches, the parameters suggested by the authors were used; with Frangi’s approach it was necessary to adapt the noise-sensitivity parameter c according to the noise level. With our approach, which also makes use of Frangi’s vesselness measure, the following set of parameters was used, α = 0.5, β = 0.5, c = 100, but instead of adapting c, the GVF’s regularization parameter μ was adapted according to the noise level (default: μ = 0.1). For low-noise datasets, the GVF was computed on the images directly without any preprocessing; for high-noise datasets, the images were slightly smoothed with a Gaussian filter with a standard deviation of one voxel, to account for image noise and partial voluming. If not mentioned explicitly, the images were not preprocessed. One may argue that this smoothing becomes a scale-space problem again; but, the slight smoothing only accounts for image noise and partial volume effects and does not take larger image areas into account. Therefore, the slight smoothing is only determined by the noise level and not the image content, and this is in contrast to Gauss filtering with a large scale filter kernel. In order to visualize the datasets and to make the filter responses comparable, all images shown in the next sections were produced using maximum intensity projection for visualization; the gray value ranges of the datasets were normalized prior to visualization to show the full data range; exceptions are mentioned explicitly. 3.1
Synthetic Datasets
To demonstrate properties of the three methods under controlled reproducible conditions, we use synthetic datasets from two public databases and one synthetic dataset we created. These datasets allow us to study properties of the methods under varying noise conditions, contrast conditions, tube diameters and varying tube configurations as they occur in vascular systems. In these datasets,
168
C. Bauer and H. Bischof
Fig. 2. Tubular objects in varying configurations and responses of the different methods. From top to bottom: original datasets, response of Frangi’s method, Krissian’s method, and our proposed method. From left to right: T-junction with constant diameter, T-junction with varying diameter, Y-junction with constant diameter, Y-junction with varying diameter, tube with varying diameter, tangenting tubes, helix.
the tubular objects show different kinds of edge-types as they appear in medical datasets: perfect step edges, slightly blurred step edges (due to partial voluming), and tubes with Gaussian cross-section profiles. Varying tube configurations: The public database of Krissian and Farneback [2,9], see Fig. 2, shows standard situations of tubular objects, as they occur in vascular systems. For junctions of tubular objects with largely varying diameters, the response of our method falls off slightly, similar to the response of Frangi’s method; but, as pointed out by Bennink et al. [4], this behaviour is common to most line filters and may be assumed as correct for pure lineness filters. With the example of the tube with the varying diameter, all methods allow for extraction of the correct centerline, but as the structure of this tube becomes more bloblike the response of Frangis’s method and our proposed method falls off slightly; the response of Krissian’s approach on the other hand is wider than the tubular object itself. The datasets with the tangenting tubes and the helix highlight the problem of the linear scale space analysis. Frangi’s method, as well as Krissian’s method, produce high responses outside the tubular objects, because on a larger scale these structures diffuse into one another; a separation is not possible anymore. In contrast, the response of our method is insensitive to influences outside the tubular objects and responds only at the correct centerline of the tubes. Varying noise level: The dataset provided by Aylward et al. [10], see Fig. 3, contains a tortuous, branching, tubular object with vanishing radius. The contrast between the tubular object and the background ranges from 100 at the middle of the tube to 50 at the tube’s edge. The datasets were corrupted with additive Gaussian noise with increasing standard deviations η of 10, 20, 40 and 80. “The η = 20 data is representative of the noise level in MR and CT data. The η = 40
A Novel Approach for Detection of Tubular Objects and Its Application
169
Fig. 3. Tubes with varying noise levels and responses of the different methods. From top to bottom: original datasets, response of Frangi’s method, Krissian’s method, and our proposed method. From left to right: increasing noise level using additive Gaussian noise with standard deviations η of 10, 20, 40, and 80.
data more closely resembles the noise magnitude of ultrasound data. [...] The η = 80 images are well beyond any worst case number [...] for any clinically acceptable MRA, CT, or ultrasound data.”[10]. For Frangi’s and our methods, the noise-sensitivity parameters, c and μ, respectively, had to be adapted. Krissian’s method has no parameter that allows control of noise sensitivity. With our approach, on the η = 10 and η = 20 datasets the GVF was applied directly without any preprocessing; the η = 40 and η = 80 datasets were smoothed slightly using a Gaussian filter with a standard deviation of one voxel. The results show that our approach produces clean responses at the centers of the tubular objects even under high (clinically acceptable) noise levels (η = 10, 20, and 40). On the η = 80 dataset, the results of all methods are not satisfying. Varying contrast and diameter: The dataset shown in Fig. 4 on the very left shows tubular structures with different radii. The background intensity of the image was steadily increased, thus resulting in a decreased contrast between the tube and the background. The responses of the method of Frangi and Krissian both depend on the contrast between the tube and the background. Thus, their response decreases with decreasing contrast. In comparison, the response of our method is independent of the tubes size and contrast. 3.2
Clinical Datasets
In this section, we apply the three methods to two medical volume datasets and verify the results obtained on the synthetic datasets on clinical datasets. CT angiography: In the top row of Fig. 5, a CT angiography image and enlarged subregions with challenging situations are shown. The main problems for tube
170
C. Bauer and H. Bischof
Fig. 4. Tubes with varying diameter and contrast and the responses of the different methods. From left to right: original dataset, response of Frangi’s method, Krissian’s method, and our proposed method.
detection filters with this kind of dataset are the detection of very thin low contrast vessels, diffuse edges, and closely adjacent vessels. Frangi’s and Krissian’s methods were specifically designed for angiography images. With our approach, the GVF was applied directly to the dataset, without any preprocessing, as the conditions are comparable to the synthetic datasets from Krissian’s database (see Sec. 3.1). The overview image of the whole dataset (leftmost column) shows that our approach works similarly well for wide tubes with bar-like cross-section profiles and
Fig. 5. CT angiography image and responses of the different methods. From top to bottom: original dataset, response of Frangi’s method, Krissian’s method, and our proposed method. From left to right: overview of the whole dataset, closely tangenting vessels (yellow arrow), thin vessels (green arrow), overlapping vessels (red arrow).
A Novel Approach for Detection of Tubular Objects and Its Application
171
for very thin tubular objects with Gaussian cross-section profiles. The magnified subregion showing the tangenting tubes (middle left column), verifies the results of the synthetic datasets; Frangi’s and Krissian’s multi-scale approaches diffuse the closely adjacent structures into one another, making an extraction of height ridges as suggested by Krissian et al. [2] for centerline extraction impossible; our approach allows a clear separation of both. The magnified subregion shows some very thin low contrast vessels (middle right column). As the responses of Frangi’s and Krissian’s methods decrease with decreasing contrast (see Sec. 3.1) the response to low contrast vessels also decreases. This makes a separation from the background difficult (e.g. based a single threshold as suggested by Krissian et al. [2]); our approach is insensitive to changing contrast situations and finding a single threshold for extraction of the centerlines is an easy task. The magnified subregion on the very right column shows another example of very thin tubular structures; the vessels overlap completely in the image because of partial voluming. None of the three methods is able to separate them because they do not take sub-pixels into account. CT of the aorta: To demonstrate the performance of the different methods in a more complex environment and with a higher noise level, the methods have been applied to a CT dataset of the abdomen showing the aorta, see Fig. 6. With our approach, the dataset was preprocessed using a Gaussian filter with a standard deviation of one voxel (intra-slice resolution). This example reveals the drawback of the linear scale space very clearly: the close proximity of spine and aorta disturbs the detection of the aorta when applying Franig’s or Krissian’s method. Both produce undesired responses to several structures in the image. One may think of the whole spine as a tubular object at a large scale, but this disturbs the detection
Fig. 6. CT of the abdomen showing the aorta in proximity to the spine. Top row: axial slice; Bottom row: maximum intensity projection. From left to right: original dataset showing the full data range, original dataset with adapted gray value range, response of Frangi’s method (adapted gray value range), Krissian’s method (adapted gray value range), and our proposed method.
172
C. Bauer and H. Bischof
of smaller vessels in its close proximity. With our approach, the detection of the aorta is not influenced at all by any surrounding object.
4
Conclusion
In this work, we addressed a common problem of most tube detection filters that is related to the multi-scale gradient vector computation in the conventionally used Gaussian scale space: diffusion of nearby structures into one another that may result in additionally detected tubular structures. We showed that this problem can be avoided by using the GVF as a replacement for the multi-scale gradient vector computation. In combination with Frangi’s vesselness measure, the resulting approach allows detection of centerlines of tubular objects, independent of the tubes size and contrast.
References 1. Frangi, A.F., Niessen, W.J., Vincken, K.L., Viergever, M.A.: Multiscale vessel enhancement filtering. In: Wells, W.M., Colchester, A.C.F., Delp, S.L. (eds.) MICCAI 1998. LNCS, vol. 1496, pp. 130–137. Springer, Heidelberg (1998) 2. Krissian, K., Malandain, G., Ayache, N., Vaillant, R., Trousset, Y.: Model-based detection of tubular structures in 3D images. Computer Vision and Image Understanding 80(2), 130–171 (2000) 3. Pock, T., Beichel, R., Bischof, H.: A novel robust tube detection filter for 3d centerline extraction. In: Kalviainen, H., Parkkinen, J., Kaarna, A. (eds.) SCIA 2005. LNCS, vol. 3540. Springer, Heidelberg (2005) 4. Bennink, H.E., van Assen, H.C., ter Wee, R., Spaan, J.A.E., ter Haar Romeny, B.M.: A novel 3D multi-scale lineness filter for vessel detection. In: Ayache, N., Ourselin, S., Maeder, A. (eds.) MICCAI 2007, Part II. LNCS, vol. 4792, pp. 436– 443. Springer, Heidelberg (2007) 5. Lorenz, C., Carlsen, I.C., Buzug, T.M., Fassnacht, C., Weese, J.: Multi-scale line segmentation with automatic estimation of width, contrast and tangential direction in 2d and 3d medical images. In: Troccaz, J., M¨ osges, R., Grimson, W.E.L. (eds.) CVRMed-MRCAS 1997, CVRMed 1997, and MRCAS 1997. LNCS, vol. 1205, pp. 233–242. Springer, Heidelberg (1997) 6. Sato, Y., Nakajima, S., Shiraga, N., Atsumi, H., Yoshida, S., Koller, T., Gerig, G., Kikinis, R.: Three-dimensional multi-scale line filter for segmentation and visualization of curvilinear structures in medical images. MIA 2(2), 143–168 (1998) 7. Lindeberg, T.: Edge detection and ridge detection with automatic scale selection. In: CVPR 1996, Washington, DC, USA, p. 465. IEEE Computer Society, Los Alamitos (1996) 8. Xu, C., Prince, J.L.: Snakes, shapes, and gradient vector flow. IEEE Transactions on Image Processing 7(3), 359–369 (1998) 9. Krissian, K., Farneback, G.: Building reliable clients and servers. In: Leondes, C.T. (ed.) Medical Imaging Systems Technology: Methods in Cardiovascular and Brain Systems, World Scientific Publishing Co., Singapore (2005) 10. Aylward, S., Bullit, E.: Initialization, noise, singularities, and scale in height ridge traversal for tubular object centerline extraction. IEEE Transactions on Medical Imaging 21(2), 61–75 (2002)
Weakly Supervised Cell Nuclei Detection and Segmentation on Tissue Microarrays of Renal Clear Cell Carcinoma Thomas J. Fuchs1,3 , Tilman Lange1 , Peter J. Wild2 , Holger Moch2,3 , and Joachim M. Buhmann1,3 1
2 3
Institute for Computational Science, ETH Z¨ urich, Switzerland {thomas.fuchs, langet, jbuhmann}@inf.ethz.ch Institute of Pathology, University Hospital Z¨ urich, University Z¨ urich, Switzerland Competence Center for Systems Physiology and Metabolic Diseases, ETH Z¨ urich
Abstract. Renal cell carcinoma (RCC) is one of the ten most frequent malignancies in Western societies and can be diagnosed by histological tissue analysis. Current diagnostic rules rely on exact counts of cancerous cell nuclei which are manually counted by pathologists. We propose a complete imaging pipeline for the automated analysis of tissue microarrays of renal cell cancer. At its core, the analysis system consists of a novel weakly supervised classification method, which is based on an iterative morphological algorithm and a soft-margin support vector machine. The lack of objective ground truth labels to validate the system requires the combination of expert knowledge of pathologists. Human expert annotations of more than 2000 cell nuclei from 9 different RCC patients are used to demonstrate the superior performance of the proposed algorithm over existing cell nuclei detection approaches.
1
Introduction
The clinical workflow of cancer tissue analysis is composed of several estimation and classification steps which yield a diagnosis of the disease stage and a therapy recommendation. This paper proposes an automated system to model such a workflow which provides more objective estimates of cancer cell detection and nuclei counts than pathologists’ results. Cell detection and segmentation is achieved by morphological filtering and semi-supervised classification as described in section 2. Our image processing pipeline is tailored to renal cell carcinoma (RCC) is one of the ten most frequent malignancies in Western societies. The prognosis of renal cancer is poor since many patients suffer already from metastases at first diagnosis. The identification of biomarkers for prediction of prognosis (prognostic marker) or response to therapy (predictive marker) is therefore of utmost importance to improve patient prognosis. Various prognostic markers have been suggested in the past, but conventional estimation of morphological parameters is still most useful for therapeutical decisions. G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 173–182, 2008. c Springer-Verlag Berlin Heidelberg 2008
174
T.J. Fuchs et al.
(a)
(b)
(c)
(d)
Fig. 1. Tissue Microarray Analysis (TMA): Primary tissue samples are taken from a cancerous kidney (a). Then 0.6mm tissue cylinders are punched from the primary tumor block of different patients and arrayed in a recipient paraffin block (b). Slices of 0.6µm are cut off the paraffin block and are immunohistochemically stained (c). These slices are scanned and each spot, represents a different patient. Image (d) depicts a TMA spot of clear cell renal cell carcinoma from our test set stained with the MIB-1 (Ki-67) antigen.
Clear cell RCC (ccRCC) is the most common subtype of renal cancer and it is composed of cells with clear cytoplasm and typical vessel architecture. ccRCC shows an architecturally diverse histological structure, with solid, alveolar and acinar patterns. The carcinomas typically contain a regular network of small thinwalled blood vessels, a diagnostically helpful characteristic of this tumor. Most ccRCC show areas with hemorrhage or necrosis (Fig. 1d), whereas an inflammatory response is infrequently observed. The cytoplasm is commonly filled with lipids and glycogen, which are dissolved in routine histologic processing, creating a clear cytoplasm surrounded by a distinct cell membrane (Fig. 1d and Fig. 4a). Nuclei tend to be round and uniform with finely granular and evenly distributed chromatin. Depending upon the grade of malignancy, nucleoli may be inconspicuous and small, or large and prominent. Very large nuclei or bizarre nuclei may occur [1]. The tissue microarray (TMA) technology promises to significantly accelerate studies seeking for associations between molecular changes and clinical endpoints [2]. In this technology, 0.6mm tissue cylinders are punched from primary tumor blocks of hundreds of different patients and these cylinders are subsequently embedded into a recipient tissue block. Sections from such array blocks can then be used for simultaneous in situ analysis of hundreds or thousands of primary tumors on DNA, RNA, and protein level (Fig. 1b,c). These results can then
Weakly Supervised Cell Nuclei Detection and Segmentation
175
be integrated with expression profile data which is expected to enhance the diagnosis and prognosis of ccRCC [3], [4], [5]. The high speed of arraying, the lack of a significant damage to donor blocks, and the regular arrangement of arrayed specimens substantially facilitates automated analysis. Although the production of tissue microarrays is an almost routine task for most laboratories, the evaluation of stained tissue microarray slides remains tedious, time consuming and prone to error. Furthermore, the significant intratumoral heterogeneity of RCC results in high interobserver variability. The variable architecture of RCC also results in a difficult assessment of prognostic parameters. Current image analysis software requires extensive user interaction to properly identify cell populations, to select regions of interest for scoring, to optimize analysis parameters and to organize the resulting raw data. Because of these drawbacks in current software, pathologists typically collect tissue microarray data by manually assigning a composite staining score for each spot often during multiple microscopy sessions over a period of days. Such manual scoring can result in serious inconsistencies between data collected during different microscopy sessions. Manual scoring also introduces a significant bottleneck that hinders the use of tissue microarrays in high-throughput analysis. The prognosis for patients with RCC depends mainly on the pathological stage and the grade of the tumor at the time of surgery. Other prognostic parameters include proliferation rate of tumor cells and different gene expression patterns. Tannapfel et al. [6] have shown that cellular proliferation may prove to be another measure for predicting biologic aggressiveness and, therefore, for estimating the prognosis. Immunohistochemical assessment of the MIB-1 (Ki-67) antigen indicates that MIB-1 immunostaining (Fig. 1d) is an additional prognostic parameter for patient outcome. TMAs were highly representative of proliferation index and histological grade using bladder cancer tissue [7]. In the domain of cytology, especially blood analysis and smears, automated analysis is already established [8]. The big difference to histological tissues is the homogeneous background on which the cells are clearly distinguishable and the absence of vessels and connection tissue. The isolation of cells simplifies the detection and segmentation process of the cells significantly. A similar simplification can be seen in the field of immunofluorescence imaging [9]. Only the advent of high resolution scanning technologies in recent years made it possible to consider an automated analysis of histological slices. Cutting-edge scanners are now able to scan slices with resolution, comparable to a 40x lens on a light microscope. In addition the automated scanning of staples of slices enables an analysis in a high throughput manner.
2
Methods
We propose an imaging pipeline for detection and segmentation of cell nuclei in images of ccRCC obtained by light microscopy. Tissue Preparation and Scanning: The tissue microarray block was generated in a trial from the University Hospital Z¨ urich. The TMA slides were
176
T.J. Fuchs et al.
immunohistochemically stained with the MIB-1 (Ki-67) antigen and scanned on a ScanScope virtual slide light microscope scanner from Aperio Technologies Inc. A lens with a magnification of 40x was used, which resulted in a per pixel resolution of 0.25μm. Finally the spots of single patients were extracted as separate images of 1500 × 1500 × 3 pixels size. Uneven Illumination Correction: Most of the TMA spots show an illumination gradient resulting from light variations during the scanning process or an uneven cut tissue slice, which leads to thicker or thinner and, therefore, to darker and lighter areas on the image. We use a top-hat transform γ(I) for mitigating the illumination gradients as described in [10,11]. A top-hat with a large isotropic structuring element acts as a high-pass filter. Therefore, it can remove the illumination gradient which lies within the low frequencies of the image. In practice, we open the image I with square B of size 25 × 25 and subtract the result from the original Image: Ieven = Ioriginal − γB (I). Edge Pruning: We apply the Canny edge detector [12] to get an edge map of the TMA spot. Besides edges on nuclei boundaries, this results in a large number of edge responses from undesired boundaries between cytoplasm, vessels, connecting tissue and background. To filter out these edges, the following edge pruning algorithm is applied: First, we run a self devised, simple and fast junction detector to find and remove junctions between edges. This task is solved by applying a series of hit-or-miss (HM T ) transformations with all possible structuring elements B, representing junctions in a 3x3 neighborhood: HM TB (X) = B1 (X)∩B2 (X C ), where B (X) is the morphological opening with a structuring element B, X is the set of pixel and X C its complement. To reduce the set of possible junctions we shrink the edge map to minimally connected strokes. The result of this procedure is a set of thinned edges without junctions. Second, these edges are split at equidistant points into edgels with a length of approximately ten pixels. For each of the edgels, a neighboring region on each side of this edgel is considered. We then calculate the mean intensity in these regions and use the lower value of this intensity as a score for the edgel. The reasoning therefore is the observation that desired edges occur on boundaries between a nucleus membrane and the cytoplasm as well as between a nucleus and the background. Hence the difference between the edgel neighboring regions can vary significantly but the lower mean value has to be below a maximum threshold θ for the edgel to be considered as part of a nucleus boundary. Third, we keep all edgels that either score lower than θ or which are neighbors to edgels with satisfying score. This pruning procedure allows us to discard undesired edges that are not part of a nucleus boundary. Morphological Object Segmentation: To segment nuclei in the pruned edge map we devise a novel iterative, algorithm, that applies morphological opening
Weakly Supervised Cell Nuclei Detection and Segmentation
177
and closing operations to detect potential nuclei. These nuclei are than subtracted from the pruned edge image and the process is repeated with larger structuring elements. In detail, the algorithm works as follows: 1. Perform a morphological closing φB1 with the structuring element B1 to close gaps of the size of B1 : Iboundaries = φB1 (Iedges ) 2. Fill all holes in the image Iboundaries . 3. Perform a morphological opening γB2 with a structuring element B2 to remove single edges, which do not belong to a closed blob, respectively a nucleus: Iblobs = γB2 (Iboundaries ) 4. Add the resulting blobs to the final segmentation map: Isegmented = Isegmented |Iblobs 5. Remove the edges, that belong to the found nuclei from the edge map: Iedges = Iedges \ Iblobs 6. Increase the size of the structuring element B1 to close larger gaps in the next iteration: B1 = δB3 (B1 ) 7. Start over at step 1 until a predefined number of iterations is performed. The resulting segmentation contains the true nuclei and a number of false positive segmented blobs. Nuclei Classification: The morphological segmentation algorithm described above yields a large number of potential nuclei including many false positive. To solve this problem and to filter out these false positive detections a softmargin support vector machine [13] is trained to classify between true nuclei and false positive. For this task we designed 21 features based on the expertise of the collaborating pathologists and these features are supposed to capture the properties of a nucleus in terms of shape, appearance and geometry. The geometric features for a nucleus n (the set of pixels belonging to it) are defined as follows: (|n| − μ)2 1 √ exp − , μ = 600, σ = 300 2σ 2 σ 2π |nEllipse | Ellipticity(n) = |nEllipse | + |(nEllipse ∪ n) \ (nEllipse ∩ n)| nArea ShapeRegularity(n) = 2π /nP erim π Size(n) =
where |n| is its area and nP erim its perimeter. nEllipse is the ellipse that has the same normalized second central moments as the nucleus and |nEllipse | is its area. Color features are computed for each color channel separately: NucleusIntensity(n) =
1 x |n| x∈n
178
T.J. Fuchs et al.
Fig. 2. Illustration of the main nucleus features used for the classification process. From left to right: Schematic sketch of a nucleus. Overlap and extend of the nucleus and its ellipse with the same normalized second central moments. Outer and inner boundary regions of the nucleus membrane. Nucleus perimeter and its optimal circle with respect to the area.
InnerIntensity(n) = OuterIntensity(n) =
1 |n| 1 |n|
x
x∈[n\B (n)]
x
x∈[δB (n)\n]
InnerHomogeneity(n) = std(x ∈ [n \ B (n)]) OuterHomogeneity(n) = std(x ∈ [δB (n) \ n]) ⎛ 1 1 IntensityDifference(n) = x·⎝ |n| |n| x∈(δB (n)\n)
⎞−1 x⎠
x∈(n\B (n))
where B (n) is the morphological erosion of the nucleus n with the structuring element B and δB (n) is the equivalent morphological opening as described in [10]. For the structuring element a disk with a radius of five is used. The outer and inner homogeneity is the standard deviation of the intensities in the corresponding regions. Fig. 2 depicts the geometrical features and the inner and outer regions. To train the support vector machine we used the labels from the generated gold standard, described in section 3. Therefore we used 300 cell nuclei from one patient for training and left out the remaining 1700 nuclei from the other 8 patients for validation. The training data consists of the 21 features for each nucleus candidate and the label yn indicating whether the candidate nucleus is a true or a false positive. We used the support vector machine in conjunction with a Gaussian kernel function k(x, y) = exp(−σx − y2 ). The kernel width parameter σ as well the penalty for misclassified points C (cf. [13]) have been determined by K-fold cross-validation.
3
Validation
Generating Gold Standard Labels: The absence of an objective ground truth requires generating a gold standard by combining the knowledge of expert
Weakly Supervised Cell Nuclei Detection and Segmentation
179
Fig. 3. Left: Precision-Recall plot of the evaluated algorithms. Right: Boxplot of the F-Measure for all algorithms.
pathologists. These labels are indispensable for the training of classifiers, for their validation and it is highly non-trivial to acquire these labels from a technical as well as statistical point. Although studies were conducted on a global estimation of staining on TMA spots [14,15], to our knowledge this is the first in depth study for tissue microarrays which incorporates expert labeling information down to the detail and precision of single cell nuclei. To facilitate the labeling process for trained pathologists, we developed a special labeling tool, dedicated for TMA spots. The software allows the user to view single TMA spots and it provides zooming and scrolling capabilities. It is possible to annotate the image with vectorial data in SVG (support vector graphics) format and to mark cell nuclei, vessels and other biological structures. An additional demand to the software was the usability on a tablet PC so that a pathologist can perform all operations with a pen alone in a simple an efficient manner. Two trained pathologists and experts in renal cell carcinoma from the University Hospital Z¨ urich used the software to annotate TMA spots of 9 different patients. They marked the location of each cell nucleus and its approximates size. In total each pathologist has detected more than 2000 cell nuclei on these images. This tedious process demanded several days of work and was performed independently of each other. Therefore the resulting gold standard of the two pathologists differ, which can be seen in the validation of section 3. Performance Measure: One way to evaluate the quality of the segmentations/nuclei detection is to consider true positive (T P ), false positive (F P ) and false negative (F N ) rates. The calculation of these quantities is based on a matching matrix where each boolean entry indicates if a machine extracted nucleus matches a hand labeled one or not. To quantify the number of correctly segmented nuclei, a strategy is required to uniquely match a machine detected nucleus to one identified by a pathologist. To this end we model this problem
180
T.J. Fuchs et al.
as bipartite matching problem, where the bijection between extracted and goldstandard nuclei is sought inducing the smallest detection error [16]. This prevents overestimating the detection accuracy of the algorithms. To compare the performance of the algorithms we calculated the precision (P = T P/(T P + F P )) and the recall (R = T P/(T P + F N )) values as well as the corresponding F-measure (2 · P · R/(P + R)). Results: At first, it is noteworthy that the annotations provided by the pathologists in terms of nuclei detection differ by roughly 20%, which is also depicted in Fig. 4. This discrepancy is due to the fact that the pathologist did not agree if some structures in the image are cell nuclei or not. In terms of the F-measure, the inter-pathologists variability is approximately 11%. Reaching this range of variability represents the ultimate goal of a computational approach to cell nuclei detection. We have applied the morphological segmentation of section 2 with and without the soft-margin SVM classification step to all 8 TMA test images. In order to relate the quality of our proposal to existing methods, we have additionally considered (i) a standard fuzzy c-means segmentation [17] and (ii) a texturebased algorithm [18] relying on a support vector clustering algorithm. For the fuzzy c-means algorithm we started with four cluster and then merged the two darkest (which represent stained and not stained nuclei) and the two brightest (which represent cytoplasma and background). For the algorithm from Glotsos et al. we scaled down the images to the same resolution as described in [18] and we used the recommended parameters. Fig. 3 summarizes the results of these experiments with SVMmorph representing our combined algorithm. The second best performing method in terms of the F-measure is the morphological segmentation. If the classification postprocessing is additionally applied, the result is significantly improved. This quality jump demonstrates that the SVM-based method manages to significantly reduce the number of false positives without sacrificing too much accuracy, since the precision is higher while the recall is only marginally smaller than that of the pure morphological segmentation. The score achieved by this approach comes close to the inter-pathologists variability. Fuzzy c-means performs rather poorly on these images and it turned out to be the method with the lowest performance in this study. This observation is primarily explained by the high number of false positives specifically in connecting tissue. Hence the low precision results in the worst value of F-measure in this experimental evaluation. Most of the nuclei actually detected by the proposed algorithm of Glotsos et al. can also be found in the annotation of the pathologists. Therefore, the results of their algorithm produces a fairly high precision value. This high precision, however, is achieved at the expense of a low recall: The algorithm misses many of the annotated cell nuclei due to its conservative search strategy. Hence, the method by Glotsos et al. is only the third best method in this study achieving a score of 55%. In summary, our method significantly outperforms the other methods under consideration.
Weakly Supervised Cell Nuclei Detection and Segmentation
(a)
(b)
(c)
(d)
181
Fig. 4. A detail of a TMA spot is shown in (a). The two independent annotations from the pathologists for this detail are depicted in (b). (c) shows the morphological segmentation and the final result after SVM nuclei classification is presented in (d). Comparing image (b) and (d) it can be seen that the algorithm misses one nucleus in the top right quadrant but segments another nucleus on the left border, which was detected by only one of the two experts.
4
Conclusion
Automatic, high throughput analysis of tissue microarrays promises new avenues for the discovery of biomarkers to detect and prognose renal cell cancer. We have designed and evaluated an imaging pipeline for cell nuclei detection and segmentation which approaches the performance of working pathologists. Adaptive classification techniques with a soft-margin support vector machine in combination with morphological image operations clearly outperform competing methods, such as the one proposed in [18], by filtering out the overwhelming number of false positive detections of cell nuclei. To our knowledge, this is the first in depth study for tissue microarrays which incorporates expert labeling information down to the detail of single cell nuclei. The automated, quantitative analysis of tissue microarrays may in the future give rise to intelligent prognosis systems. Investigating the quality of such prediction approaches on the basis of TMA features and the correlation of survival time and automated prognosis is the subject of our current research. Acknowledgments. We thank Peter Bode for additional annotations and Norbert Wey and Claas B¨orger for scanning and image preparation. Bj¨ orn Ommer, Volker Roth and Peter Schraml for fruitful discussions and Nima Razavi for an implementation of algorithm [18].
182
T.J. Fuchs et al.
References 1. Grignon, D.J., Eble, J.N., Bonsib, S.M., Moch, H.: Clear cell renal cell carcinoma. In: World Health Organization Classification of Tumours. Pathology and Genetics of Tumours of the Urinary System and Male Genital Organs, IARC Press (2004) 2. Kononen, J., Bubendorf, L., et al.: Tissue microarrays for high-throughput molecular profiling of tumor specimens. Nat. Med. 4(7), 844–847 (1998) 3. Takahashi, M., Rhodes, D.R., et al.: Gene expression profiling of clear cell renal cell carcinoma: gene identification and prognostic classification. Proc. Natl. Acad. Sci. U S A. 98(17), 9754–9759 (2001) 4. Moch, H., Schraml, P., et al.: High-throughput tissue microarray analysis to evaluate genes uncovered by cdna microarray screening in renal cell carcinoma. Am. J. Pathol. 154(4), 981–986 (1999) 5. Young, A.N., Amin, M.B., et al.: Expression profiling of renal epithelial neoplasms: a method for tumor classification and discovery of diagnostic molecular markers. Am. J. Pathol. 158(5), 1639–1651 (2001) 6. Tannapfel, A., Hahn, H.A., et al.: Prognostic value of ploidy and proliferation markers in renal cell carcinoma. Cancer 77(1), 164–171 (1996) 7. Nocito, A., Bubendorf, L., et al.: Microarrays of bladder cancer tissue are highly representative of proliferation index and histological grade. J. Pathol. 194(3), 349– 357 (2001) 8. Yang, L., Meer, P., Foran, D.J.: Unsupervised segmentation based on robust estimation and color active contour models. IEEE Transactions on Information Technology in Biomedicine 9(3), 475–486 (2005) 9. Mertz, K.D., Demichelis, F., Kim, R., Schraml, P., Storz, M., Diener, P.A., Moch, H., Rubin, M.A.: Automated immunofluorescence analysis defines microvessel area as a prognostic parameter in clear cell renal cell cancer. Human Pathology 38(10), 1454–1462 (2007) 10. Soille, P.: Morphological Image Analysis: Principles and Applications. Springer, New York (2003) 11. Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis, and Machine Vision. Thomson-Engineering (2007) 12. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 8(6), 679–698 (1986) 13. Vapnik, V.: Statistical Learning Theory. Wiley, New York (forthcoming, 1998) 14. Hall, B., Chen, W., Reiss, M., Foran, D.J.: A clinically motivated 2-fold framework for quantifying and classifying immunohistochemically stained specimens. In: Ayache, N., Ourselin, S., Maeder, A. (eds.) MICCAI 2007, Part II. LNCS, vol. 4792, pp. 287–294. Springer, Heidelberg (2007) 15. Yang, L., Chen, W., Meer, P., Salaru, G., Feldman, M.D., Foran, D.J.: High throughput analysis of breast cancer specimens on the grid. In: Med. Image Comput. Comput. Assist. Interv. Int. Conf. Med. Image Comput. Comput. Assist. Interv., vol. 10(pt. 1), pp. 617–25 (2007) 16. Kuhn, H.W.: The hungarian method for the assignment problem: Naval Research Logistic Quarterly 2, 83–97 (1955) 17. Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River (1988) 18. Glotsos, D.: An image-analysis system based on support vector machines for automatic grade diagnosis of brain-tumour astrocytomas in clinical routine. Medical Informatics and the Internet in Medicine 30(3), 179–193 (2005)
A Probabilistic Segmentation Scheme Dmitrij Schlesinger and Boris Flach Dresden University of Technology
Abstract. We propose a probabilistic segmentation scheme, which is widely applicable to some extend. Besides the segmentation itself our model incorporates object specific shading. Dependent upon application, the latter is interpreted either as a perturbation or as meaningful object characteristic. We discuss the recognition task for segmentation, learning tasks for parameter estimation as well as different formulations of shading estimation tasks.
1
Introduction
Segmenting images in meaningful parts is a common task of image analysis. “Common” means that segmentation arises as a subtask in the context of diverse and different applications of image analysis and computer vision. Corresponding research efforts of the previous decades – often motivated by these different contexts – have led to a plenty of segmentation algorithms and their variants. Unfortunately, this does not hold for the number of corresponding models – the algorithms were often constructed in a rather phenomenological manner. This discrepancy is even greater if either supervised or unsupervised learning is required for the parameters of the algorithm/model. On the other hand, by now there is no hope for a universal model/algorithm pair for segmentation. Since segmentation is often a subtask of a recognition task, it provides rather intermediate results, which are determined not by the image(s) only, but also e.g. by feedback from “higher” model parts (like object models). Hence, we believe that segmentation schemes, which are to some extend widely applicable, should have the following properties. Firstly, there should be a clear ab initio model (preferable a modular one). Secondly, it is necessary to have well-posed recognition and learning task formulations. Finally, the scheme should have interfaces (like model parameters) for feedback from higher model levels. These requirements suggest probabilistic models. The main reason for this choice is learning – statistical pattern recognition has the farthermost advanced theory of learning. Besides, it is often preferable to have a (posterior) probability distribution of segmentations instead of a unique segmentation. Unfortunately, the choice of a probabilistic model presently disallows continuous variational approaches like e.g. Level Sets [3,2]. These models are not extensible to stochastic ones, mainly because it is not yet clear, how to define probability measures on function spaces correctly. The actually most popular approach in the scope of discrete models for segmentation is Energy Minimisation [6,1,5,7]. In most cases it corresponds to the G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 183–192, 2008. c Springer-Verlag Berlin Heidelberg 2008
184
D. Schlesinger and B. Flach
Maximum A-Posteriori Decision with respect to a certain probabilistic model. We believe however, that an overhastily preference for the MAP-criterion may hide other possible formulations of recognition tasks and – what is more important – reasonable approaches for learning. In the following we present a scheme for segmentation. Its main part is a probabilistic model. In particular we use Gibbs probability distributions of second order to represent the distribution of hidden variables. The second part of the method is the formulation of recognition tasks. We would like to point out, that the latter can be formulated and derived in many different ways within the same model. Last but not least we consider the learning of unknown parameters of the probability distribution. We give an unsupervised learning scheme, which is based on the Maximum Likelihood principle. In particular we use the Expectation Maximisation algorithm to solve learning tasks approximatively.
2
The Model
The sample space of the model is built by triples of the following groups of variables: the image(s), the segmentation field and segment related shading fields. We decided to include the shading into the model due to the following reasons. Firstly, shading is often a segment specific concomitant phenomenon. Secondly, in some applications (e.g. Shape from Shading) it is rather a quantity we are interested in, as opposed to just a perturbation variable. Finally, for one and the same model, the task of shading estimation might differ, depending on the application. Throughout the paper we use the following notation. The field of vision R ⊂ Z2 is equipped with the structure of an undirected graph G = (R, E), where E denotes the set of edges. S is a finite set of segment labels, where s ∈ S denotes the label of a particular segment. The segmentation field f : R → S, where f (r) ∈ S denotes the segment chosen for the element r ∈ R of the field of vision. An observation (i.e. image) is denoted by x : R → V , where V is a set of colour values. And finally, hs : R → V denotes the shading field associated with segment s. The set of all shadings is h = (h1 , h2 . . . h|S| ). The probability of an elementary event – a triple (f, h, x) is modelled by p(hs ). (1) p(f, h, x) = p(f ) · p(x | f, h) · s
The prior probability distribution for segmentations is assumed to be a Gibbs probability distribution of second order p(f ) ∼ gsegm f (r), f (r ) (2) rr ∈E
with gsegm : S × S → R+ . A particular and appropriate choice for segmentation problems is e.g. the Potts model a > 1 if s = s (3) gsegm (s, s ) = 1 otherwise.
A Probabilistic Segmentation Scheme
Similarly, the prior probability of a shading field is p(hs ) ∼ gshad hs (r), hs (r ) ,
185
(4)
rr ∈E
where gshad : V × V → R+ expresses the prior assumptions about shadings, e.g. smoothness or hard restrictions: gshad (v, v ) = exp −(v − v )2 /2σ 2 , (5) (6) gshad (v, v ) = 1{|v − v | < δ}. The probability distribution of an observation, given the other two variables, is assumed to be conditionally independent q x(r) − hf (r) (r), f (r) . (7) p(x | f, h) = r
The expression x(r) − hf (r) (r) can be understood as “shading adjusted observation value” at position r and is obtained by subtracting the shading value associated with the segment f (r) chosen in this node. The function q : V × S → R+ is a conditional probability p(v|s) for the colour value v given the segment s.
3
Recognition and Learning Tasks
In this section we discuss different tasks, which can be formulated in the scope of the model. We begin with a typical recognition task – segmentation estimation. Afterwards, we give a scheme for unsupervised learning of the unknown conditional probability distribution q, given an example image. Finally, we consider the shading estimation. It turns out, that both “recognition” and “learning” formulations are reasonable for shading estimation. The choice depends on the particular application. We discuss both variants and compare them. 3.1
Recognition
Let us assume that the parameters of the model (e.g. functions g and q) are known. The shadings hs are considered as a perturbation and assumed to be known as well. The task is to estimate the segmentation f ∗ given an observation x. We formulate the segmentation problem as a task of Bayesian decision, i.e. we minimise the risk R(f ) = p(f | h, x) · C(f, f ) → min, (8) f
f
where C(f, f ) is the loss function. When applying decision theory for pattern classification, it is common to use a (possibly class dependent) constant for the loss associated to a classification error. The situation is quite different in the scope of segmentation: it is reasonable to use e.g. the number of mis-segmented
186
D. Schlesinger and B. Flach
pixels for the loss function. (Note that this loss is usually used when comparing different segmentation algorithms.) Hence, we advocate an additive loss function of the type C(f, f ) = c f (r), f (r) , (9) r
where c : S × S → R is a function, which penalises deviations from the estimated segment label in a node r to the unknown true one. Substituting a loss function of that type in (8) gives p(f | h, x) · c f (r), f (r) = p(f | h, x) · c f (r), f (r) . R(f ) = f
r
f
r
Because the second factor depends only on the segment label of the node r, it is possible to split the sum over all segmentation fields f into the sum over all f (R \ r) and the sum over all segmentation labels of the node r. This gives c f (r), s · p f (r) = s | h, x → min, (10) R(f ) = r
where
f
s∈S
p f (r) = s | h, x =
p(f | h, x)
(11)
f :f (r)=s
are the a-posteriori marginal probability distributions of states. The optimisation (10) can be performed for each node r independently and gives f ∗ (r) = arg min c(s, s ) · p f (r) = s | h, x for each r. (12) s
s
In particular the additive delta cost function – i.e. c(s, s ) = 1{s = s } in (9) – can be used for segmentation. This leads to the decision f ∗ (r) = arg max p f (r) = s | h, x for each r. (13) s
It is easy to see, how this decision should be changed if the shading fields are unknown: p(f , h | x). (14) f ∗ (r) = arg max p f (r) = s | x = arg max s
3.2
s
h f :f (r)=s
Learning
Let us consider the task of unsupervised learning of unknown parameters of the probability distribution – e.g. learning the conditional probability distributions q given an image. For simplicity, we still assume that the shadings hs are known. We follow the Maximum Likelihood principle and maximise1 ln p(x | h; q) = ln p(f, x | h; q) → max . (15) f 1
The notation p(. . . ; q) means “parameterised by q”.
q
A Probabilistic Segmentation Scheme
187
We use the EM-algorithm for approximation. The standard approach leads to the following iterative scheme: 1. Expectation step: compute the a-posteriori probability distribution of segmentations for the current set of parameters, i.e. p(f |h, x; q (n) ); 2. Maximisation step: maximise q (n+1) = arg max p(f | h, x; q (n) ) · ln p(f, h, x; q). (16) q
f
Substitution of the last term in (16) by means of (1)-(7) and omission all terms, which do not depend on q, lead to p(f | h, x; q (n) ) · ln q x(r) − hf (r) (r), f (r) = f
s
r
s
r
r
p(f | h, x; q (n) ) · ln q x(r) − hs (r), s =
f :f (r)=s
p f (r) = s | h, x; q (n) · ln q x(r) − hs (r), s → max . q
(17)
Obviously, this optimisation can be performed for each segment independently. For a particular segment we have p f (r) = s | h, x; q (n) · ln q x(r) − hs (r), s = r
v
r:x(r)−hs (r)=v
ln q(v, s) ·
p f (r) = s | h, x; q (n) → max . q
Due to the Shannon’s theorem, the solution of this task is q (n+1) (v, s) ∼ p f (r) = s | h, x; q (n) .
(18)
(19)
r:x(r)−hs(r)=v
Summarising in a nutshell, an iteration of the learning algorithm is: 1. Expectation step: a-posteriori probabilities of states calculate the marginal in each node – p f (r) = s|h, x; q (n) ; 2. Maximisation step: (a) Sum up these marginals over all pixels with colour x(r) = v + hs (r) for each segment s and each colour value v, i.e. compute the “histogram” (20) p f (r) = s | h, x; q (n) . hist(v, s) = r:x(r)−hs (r)=v
(b) Finally, normalise it: q (n+1) (v, s) =
hist(v, s) . v hist(v , s)
(21)
188
D. Schlesinger and B. Flach
Let us consider now the learning task for the case that the shadings are not known. In that case we have to optimise p(f, h, x; q) → max . (22) ln p(x; q) = ln f
q
h
Although this task seems to be much more difficult than the previous one (15), it is not hard to see, that all derivations can be performed in a similar way. The difference is only, that it is necessary to sum over all possible shadings instead of considering them fixed. Finally, in the maximisation step of the EM-algorithm (19) the sum over certain pixels should be replaced by the sum over all pixels, weighted by corresponding marginal probabilities: q (n+1) (v, s) ∼ (23) p f (r) = s, x(r) − hs (r) = v | x; q (n) . r
3.3
Shading Estimation
In contrast to the segmentation and learning tasks discussed so far, the situation with the shading is not so straightforward. In many applications the shading is considered as an unknown parameter of the model (caused e.g. by inhomogeneous and anisotropic lighting). This perturbation of object’s appearance should be simply removed from the processed image. In such cases, it is reasonable to estimate the shading according e.g. to the Maximum Likelihood principle. Unfortunately, such a concept disallows incorporation of a-priori assumptions about shading in a “weighted” manner (like e.g. in (5)). The only possibility is to restrict the set of all possible shadings, using for example hard constraints like (6). Another way to deal with shading (which is in fact very similar to the first one), is to consider it as a statistical variable and to use the MAP criterion for its estimation. In some applications it is however not reasonable to consider the shading as a perturbation because it may characterise certain properties of the segmented objects. Therefore, shading is not an “auxiliary” variable anymore. In such cases, it is appropriate to pose the problem of shading estimation again as a task of Bayesian decision. Hence, a suitable loss function should be chosen for that. In the following we discuss both variants – MAP decision with respect to shading and shading estimation as a task of Bayesian decision. We begin with the first one. The task is to find p(f, h | x) → max . (24) ln p(h | x) = ln f
h
Again, we use the Expectation Maximisation algorithm to avoid summation over all segmentations. The standard approach leads to the following optimisation problem, which should be solved in each maximisation step: p(f | h(n) , x) · ln p(f, h, x) → max . (25) f
h
A Probabilistic Segmentation Scheme
189
We substitute our model for p(f, h, x) and obtain
p(f | h(n) , x) · ln p(hs ) + ln q x(r)−hf (r) (r), f (r) = s
f
r
ln p(hs ) + p(f | h(n) , x) ln q x(r)−hs (r), s → max . (26) s
f
h
r:f (r)=s
Obviously, the optimisation can be performed for each segment s separately: ln p(hs ) + p(f | h(n) , x) · ln q x(r) − hs (r), s = r
f :f (r)=s
ln p(hs ) + p f (r) = s | h(n) , x · ln q x(r) − hs (r), s → max . hs
r
(27)
Let us denote: g˜(v, v ) = − ln gshad (v, v ) and q˜r (v) = −p f (r) = s | h(n) , x · ln q x(r) − v, s .
(28)
The optimisation problem (27) can be written then in the form g˜ hs (r), hs (r ) + q˜r hs (r) → min, rr ∈E
r
(29)
hs
which is an Energy Minimisation task. If the function g˜(., .) is submodular – which is the case for (5) and (6) – then its solution can be found in polynomial time (see e.g. [8] for details). Now let us discuss the variant that the shading is considered as a stochastic variable and should be estimated by defining an appropriate task of Bayesian decision p(h | x) · C(h, h ) = p(f, h | x) · C(h, h ) → min . (30) R(h) = h
h
f
Again, we advocate an additive loss function of the type C(h, h ) = c hs (r), hs (r) . s
h
(31)
r
Moreover, in the of shading estimation the summands c v, v ) can be de case fined e.g. by c v, v ) = (v − v )2 , i.e. penalising deviations between the true and the estimated shading values. We omit here the derivation details for the decision strategy because of their similarity to the segmentation case. The Bayesian decision for the shading is its posterior mean value v · p hs (r) = v | x , (32) h∗s (r) = v
190
D. Schlesinger and B. Flach
with the the marginal a-posteriori probabilities for shading values in each node p hs (r) = v | x = p(f, h | x). (33) h:hs (r)=v f
Remark 1. It is well known, that the calculation of marginal probabilities for Gibbs/Markov distributions is an NP-complete task. We need these probabilities for segmentation (12), learning (19) and shading estimation (28),(32). This illustrates once more, that the search for polynomial solvable subclasses of this problem is an important open question of structural pattern recognition. Until then, we are constrained to use approximative algorithms like the belief propagation method and others [9,10] or the Gibbs sampler [4].
4
Results
To begin with, we present an artificial example. The original image shown in Fig. 1a was produced by filling two segments with gradients, which have the same characteristics and finally adding Gaussian noise. The hard restrictions (6) were used for shading and the task of shading estimation was posed as Bayes decision task (30) with the loss function (31). The probability distributions q for segments were supposed to be zero mean Gaussians. Their variances were estimated unsupervised. The obtained segmentation is shown in Fig. 1c. The image in Fig. 1b was produced by replacing the original grayvalue in each pixel by the value of the estimated shading. The next example (Fig. 2) is an image of a real scene. It is divided into three segments. The conditional probabilities q for the segments were supposed to be arbitrary, but channelwise conditionally independent. They were learned unsupervised by (23). The shading was handled in the same way as in the previous example. The image in Fig. 2c shows the result obtained for the same model but without shading fields. The last example is related to satellite-based glacier monitoring and in particular to the recognition of debris covered glaciers of the everest-type. We applied our scheme to a visual and thermal infrared channel of the ASTER satellite
(a) Original image
(b) Denoised image
Fig. 1. An artificial example
(c) Segmentation
A Probabilistic Segmentation Scheme
(a) Original image
(b) Segmentation
191
(c) Without shadings
Fig. 2. An example of a real scene
(a) Satellite image
(b) Expert’s segmentation
(c) Segmentation
Fig. 3. Segmenting debris covered glaciers
combined with 3D elevation data. The conditional probabilities q were modeled by multivariate normal distributions for six segments and were learned partially unsupervised – during learning we used the expert’s segmentation to permit three labels in the background, another two in the foreground (glaciers) and a last one in lake regions. Because of the homogeneous lighting, only one common vector-valued shading (modeled by (5)) was used for all segments and estimated by the MAP-criterion. Fig. 3 shows the visual satellite channel, the expert’s segmentation (glaciers and lakes) and the obtained segmentation, which is correct for 96 percents of pixels.
5
Conclusion
We have presented a probabilistic scheme for segmentation which includes segment specific shading and admits well posed recognition and learning tasks. In order to use the scheme for a particular application, it is necessary to decide, whether certain “parts” of the model should be considered as an (unknown) parameter or as a statistical variable. It is e.g. natural to consider the
192
D. Schlesinger and B. Flach
segmentation field as a statistical variable – because a particular segmentation is rather an event of a (possibly unknown) probability distribution and not a parameter which characterises a class of images. The situation is different for the shading – both variants are suitable depending on the application. On the other hand, the probability distribution q was considered as a parameter throughout the paper. Though in principle possible, it is rather unnatural to consider it as a (unknown) statistical variable. The reason is twofold: often it is not possible to introduce a-priori assumptions for q and secondly, the choice for this function often corresponds to a class of images. Summarising, we suggest to consider a “part” of the model as a parameter either if it is hardly possible to formalise a-priori assumptions for it or if a particular instance of this part describes a class of images, and not a particular image. For learning we have used the Maximum-Likelihood Principle. However, according to the learning theory other choices are possible. It is therefore highly desirable to analyse whether e.g. Minimisation of Empirical Risk can be generalised for structural pettern recognition and in particular for Gibbs/Markov probability distributions.
References 1. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. In: Proc. of the 7. Intl. Conf. on Computer Vision, vol. 1, pp. 377–384 (1999) 2. Cremers, D.: A multiphase level set framework for variational motion segmentation. In: Griffin, L.D., Lillholm, M. (eds.) Scale-Space Methods in Computer Vision (Isle of Skye), vol. 2695, pp. 599–614. Springer, Heidelberg (2003) 3. Cremers, D., Rousson, M., Deriche, R.: A review of statistical approaches to level set segmentation: integrating color, texture, motion and shape. International Journal of Computer Vision 72(2), 195–215 (2007) 4. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6), 721–741 (1984) 5. Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 28(10), 1568–1583 (2006) 6. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 65–81. Springer, Heidelberg (2002) 7. Micusik, B., Pajdla, T.: Multi-label image segmentation via max-sum solver. In: Proc. of Computer Vision and Pattern Recognition (2007) 8. Schlesinger, D., Flach, B.: Transforming an arbitrary minsum problem into a binary one, Tech. report, Dresden University of Technology, TUD-FI06-01 (April 2005), http://www.bv.inf.tu-dresden.de/∼ ds24/tr kto2.pdf 9. Wainwright, M.J., Jaakkola, T.S., Willsky, A.S.: Tree-reweighted belief propagation algorithms and approximate ML estimation via pseudo-moment matching. In: Workshop on Artificial Intelligence and Statistics (January 2003) 10. Wainwright, M.J., Jaakkola, T.S., Willsky, A.S.: A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory 51(7), 2313– 2335 (2005)
Using Eigenvalue Derivatives for Edge Detection in DT-MRI Data Thomas Schultz and Hans-Peter Seidel MPI Informatik, Campus E 1.4, 66123 Saarbr¨ ucken, Germany [email protected]
Abstract. This paper introduces eigenvalue derivatives as a fundamental tool to discern the different types of edges present in matrix-valued images. It reviews basic results from perturbation theory, which allow one to compute such derivatives, and shows how they can be used to obtain novel edge detectors for matrix-valued images. It is demonstrated that previous methods for edge detection in matrix-valued images are simplified by considering them in terms of eigenvalue derivatives. Moreover, eigenvalue derivatives are used to analyze and refine the recently proposed Log-Euclidean edge detector. Application examples focus on data from diffusion tensor magnetic resonance imaging (DT-MRI).
1
Introduction
In grayscale images, edges are lines across which image intensity changes rapidly, and the magnitude of the image gradient is a common measure of edge strength. In matrix-valued images, edges have a more complex structure: Matrices have several degrees of freedom, which can be classified as invariant under rotation (shape) and rotationally variant (orientation). In particular, real symmetric matrices can be decomposed into real eigenvalues, which parametrize shape, and eigenvectors, which parametrize orientation. Consequently, there are different types of edges, corresponding to changes in the different degrees of freedom. In this paper, we consider edges in matrix-valued images stemming from diffusion tensor magnetic resonance imaging (DT-MRI), which measures water self-diffusion in different directions and uses second-order tensors (real-valued, symmetric 3 × 3 matrices) to model the directionally dependent apparent diffusivities [1]. DT-MRI is frequently used to image the central nervous system and is unique in its ability to depict neuronal fiber tracts non-invasively. Edge maps of DT-MRI data have first been created by Pajevic et al. [2]. They distinguish two types of edges by either considering the full tensor information or only its deviatoric (trace-free) part. More recently, Kindlmann et al. [3] have presented a framework based on invariant gradients, which separates six different types of edges, corresponding to all six degrees of freedom present in a symmetric 3 × 3 matrix. Based on a preliminary description of this approach [4], Schultz et al. [5] have demonstrated the practical relevance of differentiating various types of edges in matrix data for segmentation and smoothing. G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 193–202, 2008. c Springer-Verlag Berlin Heidelberg 2008
194
T. Schultz and H.-P. Seidel
The contribution of the present work is to suggest eigenvalue derivatives as a fundamental tool to discern various types of edges in matrix-valued images. After a review of some additional related work in Section 2, Section 3 summarizes some results from perturbation theory [6], which show how to find the derivatives of eigenvalues from matrix derivatives. Since all shape metrics in DT-MRI can be defined in terms of eigenvalues, this allows one to map edges with respect to arbitrary shape measures. Section 4 demonstrates that the existing framework based on invariant gradients [3] can be formulated in terms of eigenvalue derivatives, which allows us to simplify and to extend it. Arsigny et al. [7] have proposed to process the matrix logarithm of diffusion tensors to ensure that the results remain positive definite. Consequently, they use the gradient of the transformed tensor field for edge detection. Section 5 shows that eigenvalue derivatives can also be used to analyze various types of edges in this setting, which has not been attempted before. Finally, Section 6 summarizes our results and concludes this paper.
2
Related Work
In the context of DT-MRI, perturbation theory has previously been used by Anderson [8] to study the impact of noise on anisotropy measures and fiber tracking. This work differs from ours not only in scope, but also in methods, since Anderson considers finite deviations from an ideal, noise-free tensor and differentiability of eigenvalues is not relevant to his task. O’Donnell et al. [9] have distinguished two different types of edges in DT-MRI data by manipulating the certainties in a normalized convolution approach. Their results resemble the ones obtained from deviatoric tensor fields [2]. Kindlmann et al. [10] extract crease geometry from edges with respect to one specific shape metric (fractional anisotropy), and demonstrate the anatomical relevance of the resulting surfaces. Our work points towards a possible refinement of this approach, which will be discussed in higher detail in Section 3.3.
3 3.1
Using Perturbation Theory for Edge Detection Spectral Decomposition and Shape Metrics
A fundamental tool for the analysis of DT-MRI data is the spectral decomposition of a tensor D into eigenvalues λi and eigenvectors ei (e.g., cf. [6]): D=
3
λi ei eTi = EΛET
(1)
i=1
where ei is the ith column of matrix E and Λ is a diagonal matrix composed of the λi . In DT-MRI, it is common to sort the λi in descending order (λ1 ≥ λ2 ≥ λ3 ). If two or all λi are equal, the tensor D is called degenerate.
Using Eigenvalue Derivatives for Edge Detection in DT-MRI Data
195
A number of scalar measures are used to characterize the shape of a diffusion tensor. All of them can be formulated in terms of eigenvalues. Two popular anisotropy metrics, fractional anisotropy (FA) and relative anisotropy (RA), are due to Basser et al. [11]. They both quantify the degree to which the observed diffusion is directionally dependent: √ √ μ2 μ2 3 FA = √ 2 RA = (2) μ1 2 λ1 + λ22 + λ23 where μ1 is the eigenvalue mean (μ1 = 13 i λi ), μ2 is the eigenvalue variance (μ2 = 13 i (λi − μ1 )2 ). A more discriminative set of metrics has been proposed by Westin et al. [12]. It discerns the degree to which a tensor has linear (cl ), planar (cp ) or spherical (cs ) shape. These measures are non-negative and sum to unity, so they can be considered as a barycentric coordinate system: cl =
λ1 − λ2 λ1 + λ2 + λ3
cp =
2(λ2 − λ3 ) λ1 + λ2 + λ3
cs =
3λ3 λ1 + λ2 + λ3
(3)
In order to find edges with respect to these shape metrics, we need to differentiate them. The basic rules of differentiation yield corresponding formulae in terms of eigenvalues and their derivatives, so eigenvalue derivatives are required to evaluate them. They will be considered in the following section. 3.2
Eigenvalue Derivatives
For simplicity of notation, this section will assume a tensor field D(t) which is differentiable in a single scalar t ∈ R and denote its derivative with respect to t by D (t). From perturbation theory, it is known that derivatives of the ordered eigenvalues of a differentiable, symmetric tensor field exist if all eigenvalues are distinct (cf. chapter two in [6]). They are given as the diagonal elements of [D (t)]E , which is obtained by rotating D (t) into the eigenframe of D(t): [D (t)]E = ET D (t)E
(4)
Ordered eigenvalues are generally not differentiable at exceptional points t, at which D(t) is degenerate. Within a small neighborhood of such points, the repeated eigenvalue typically splits into different eigenvalues, which constitute its λ-group [6]. Even if a repeated eigenvalue itself is not differentiable, the mean value of its λ-group is. From [D ]E , this mean derivative can be extracted as the average of the diagonal entries that belong to the duplicated eigenvalue. When interested in eigenvalue derivatives only, one may compute this average and use it as a replacement of the eigenvalue derivative where the latter is undefined. ¯ such that those diagonal Alternatively, it is possible to find a rotation E entries of [D ]E ¯ which correspond to a repeated eigenvalue equal its λ-group mean derivative. This requires some additional effort, but has the advantage of cleanly separating changes in shape (diagonal elements) from changes in orientation (off-diagonal) while preserving the magnitude of the derivative as measured by
196
T. Schultz and H.-P. Seidel
the Frobenius norm (D = tr(DT D )), which is rotationally invariant. The key to this method is to observe that in case of a degeneracy, we are free to choose any set of mutually orthogonal eigenvectors ei which span the eigenspace of the repeated eigenvalue. To identify cases in which the relative distance of two eigenvalues λi and λj is so small that their eigenvectors are no longer numerically well-defined, we introduce the measure ρi,j : ρi,j =
|λi − λj | |λi | + |λj |
(5) (1)
Let bi be the vectors of the assumed basis, and let dj,k be entry (j, k) of matrix D(1) = [D ]E . Then, our algorithm works as follows: (2)
(2)
(3)
(3)
1. If ρ1,2 < , create D(2) by rotating D(1) around b3 such that d1,1 = d2,2 . 2. If ρ2,3 < , create D(3) by rotating D(2) around b1 such that d2,2 = d3,3 . (3)
3. If ρ1,3 < , identify i such that di,i is in between the remaining two diagonal (3)
(3)
(3)
(3)
(3)
entries, dj,j ≤ di,i ≤ dk,k . If di,i is larger (smaller) than μ = 0.5 · (dj,j + (3)
(4)
dk,k ), rotate around bk (bj ) such that di,i = μ. Afterwards, rotate around (5)
(5)
bi such that dj,j = dk,k . The final matrix D(5) equals [D ]E ¯ . The correct angles φ for the rotations are found by writing the desired elements of D(n+1) as trigonometric functions of elements from D(n) and φ and solving the specified equalities for φ. To avoid visible boundaries that would result from a fixed threshold , we perform a gradual transition between the non-degenerate and the degenerate case (ρ = 0) by scaling rotation angles φ by (1 − ρ/). In our experiments, = 0.05. Note that this interpolation requires us to always select the smallest angle φ which solves the given trigonometric equality. 3.3
Experimental Results
We used component-wise convolution with a cubic B-spline kernel [2] to obtain a differentiable tensor field from the discrete sample values. This method preserves positive definiteness and implies slight smoothing. Figure 1 (a) presents a cl map of a coronal section of the brainstem. It reveals several tracts, which have been annotated by an expert: The pontine crossing tract (pct), the superior cerebellar peduncle (scp), the decussation of the superior cerebellar peduncle (dscp), the corticopontine/corticospinal tract (cpt/cst), and the middle cerebellar peduncle (mcp). Figure 1 (b) is produced by computing the total derivative of cl , cl =
λ1 (−2λ2 − λ3 ) + λ2 (2λ1 + λ3 ) + λ3 (λ1 − λ2 ) (λ1 + λ2 + λ3 )2
(6)
and using eigenvalue derivatives to evaluate it. Minima in edge strength nicely separate adjacent fiber bundles, which was confirmed by overlaying the edges
Using Eigenvalue Derivatives for Edge Detection in DT-MRI Data
197
(a) Annotated cl map of the brainstem
(b) Edges in cl , from eigenvalue derivatives
(c) Approximate edges from scalar cl samples
(d) Edges in FA
(e) Edges in cs
(f) Plots of cl (—), FA (– –), and cs (· · · )
Fig. 1. Eigenvalue derivatives allow one to map edges in cl , which clearly separate adjacent fiber tracts in DT-MRI data (b). Neither evaluating cl at grid points (c) nor mapping edges in other shape metrics (d+e) produces results of comparable quality.
onto a color coded direction map [13] (for references to color, please cf. the electronic version of this article). Figure 1 (c) illustrates that due to the nonlinearity of cl , it is not sufficient to evaluate this metric at grid points and to construct an edge map by computing gradients from the resulting scalar samples. Similar results have previously been obtained by Kindlmann et al. [10], who employ valley surfaces of fractional anisotropy (FA) to reconstruct interfaces between adjacent tracts of different orientation. Figure 1 (d) presents an FA edge map of the same region. Unlike cl , FA can be formulated directly in terms of tensor components, so exact edge maps do not require eigenvalue derivatives. A comparison to Figure 1 (b) suggests that cl produces more pronounced fiber path boundaries. In particular, the dscp is hardly separated in the FA edge map. The observation that a shape metric like FA can be used to find boundaries in orientation has been explained by the fact that partial voluming and componentwise interpolation lead to more planar shapes in between differently oriented tensors [10]. Since cl is more sensitive to changes between linearity and planarity than FA is, this explains why it is better suited to identify such boundaries. Further evidence is given in Figure 1 (e), which presents edges in cs , an isotropy measure that completely ignores the difference between linearity and planarity, and which consequently is less effective at separating tracts than FA. The visual impression is confirmed by Figure 1 (f), which plots cl (solid line), FA (dashed line) and cs (dotted line, uses the axis on the right) against vertical voxel position along a straight line that connects the centers of dscp and scp. It exhibits a sharp minimum in cl , a shallow minimum in FA, and no extremum in cs .
198
4
T. Schultz and H.-P. Seidel
Invariant Gradients in Terms of Eigenvalue Derivatives
The currently most sophisticated method for detecting different types of edges in DT-MRI data has been suggested by Kindlmann et al. [3]. It is based on considering shape invariants Ji as scalar functions over Sym3 , the vector space of symmetric, real-valued 3×3 matrices, and computing their gradient ∇D Ji , which is an element from Sym3 for each tensor D. Then, a set {∇ D Ji } of normalized orthogonal gradients is used as part of a local basis and the coordinates of a tensor derivative D with respect to that basis specify the amount of tensor change which is aligned with changes in the corresponding invariant Ji . Bahn [14] treats the eigenvalues as a fundamental parametrization of the three-dimensional space of tensor shape. Within this eigenvalue space S ∼ = R3 , he proposes a cylindrical and a spherical coordinate system, where both the axis of the cylinder and the pole of the sphere are aligned with the line of triple eigenvalue identity (λ1 = λ2 = λ3 ). The resulting coordinates are shown to be closely related to standard DT-MRI measures like trace and FA. Ennis and Kindlmann [15] point out that the two alternative sets of invariants in their own work, Ki and Ri , are analogous to the cylindrical (Ki ) and spherical (Ri ) eigenvalue coordinate systems, but that those cannot be easily applied for edge detection, because they are not formulated in terms of tensor components. A connection between both approaches can be made via eigenvalue derivatives: 3 Restricting tensor [D ]E ¯ from Section 3.2 to its diagonal yields a vector in R , which describes the shape derivative in eigenvalue space. Moreover, the analogous definitions of the tensor scalar product A, B = tr(AT B) and the standard dot product on R3 preserve magnitudes and angles when converting between both representations. This means that once D has been rotated such that eigenvalue derivatives are on its diagonal, we can alternatively analyze shape changes in eigenvalue space S or in Sym3 , and obtain equivalent results. This insight simplifies the derivation of invariant gradients: Instead of having to isolate them from the Taylor expansion (as in the appendix of [15]), invariants Ji can now be considered as functions over eigenvalue space S, and their gradients ∇S Ji in S are simply found via the basic rules of differentiation. This makes it possible to extend the invariant gradients framework towards the Westin metrics. The corresponding gradients are ⎞ ⎞ ⎞ ⎛ ⎛ ⎛ 2λ1 + λ3 −λ2 + λ3 −λ3 ∇S cl ∼ ⎝ −2λ1 − λ3 ⎠ ∇S cp ∼ ⎝ λ1 + 2λ3 ⎠ ∇S cs ∼ ⎝ −λ3 ⎠ (7) −λ1 + λ2 −λ1 − 2λ2 λ1 + λ2 where scalar prefactors have been omitted for brevity, because the gradients will be normalized before use. All three are orthogonal to ∇S D ∼ (λ1 , λ2 , λ3 )T . It has been pointed out [15] that the gradients of the Westin measures cannot be used as part of a basis of tensor shape space, which follows immediately from the fact that they provide three coordinates for a two-dimensional space. However, one may still select an arbitrary measure (cl , cp , or cs ) as part of an orthonormal
S D, the normalized selected basis of S. The basis is then constructed from ∇
Using Eigenvalue Derivatives for Edge Detection in DT-MRI Data
(a) Annotated cl map (coronal section)
(b) AO measure from [3]
199
(c) Sharper AO measure involving ∇cp
Fig. 2. Extending the invariant gradients framework towards the Westin metrics allows for a sharper version of the adjacent orthogonality (AO) measure
gradient from Equation (7), and a third vector which is the cross product of the first two and captures any remaining changes in shape. The fiber tracts in Figure 2 (a) are superior longitudinal fasciculus (slf), internal capsule (ic), corpus callosum (cc), cingulum (cing), and fornix (fx). Figure 2 (b) presents a map of the adjacent orthogonality (AO) measure, defined in [3] from the coordinates of the tensor field derivative in the invariant gradi 3 |2 )1/2 . It separates differently oriented
3 |2 + |∇φ ents framework as AO = (|∇R
3 , and rotatracts, based on shape changes towards planarity, measured by ∇R
tions around e3 , measured by ∇φ3 . For detailed information on rotation tangents
i , which are used to analyze changes in orientation, the reader is referred to [3]. Φ For the task of separating differently oriented tracts, we select a basis of S that has ∇S cp as one of its axes. Since cp reacts more specifically to changes in planarity than R3 does, a sharper version of AO is then obtained by replacing
3 with ∇cp . In particular, this produces clear borders in some locations where ∇R the original formulation of AO indicates no or only very unsharp boundaries (marked by arrows in Figure 2 (c)). Overlaying them on a color coded direction map confirms that they correspond precisely to tract interfaces.
5
Edge Detection in the Log-Euclidean Framework
The fact that negative diffusivities do not have any physical meaning restricts diffusion tensors to the positive definite cone within Sym3 , which is closed under addition and multiplication by positive scalars. Arsigny et al. [7] point out that after taking the matrix logarithm, one may process diffusion tensors with arbitrary (even non-convex) operations without leaving the positive definite cone in the original space, because the inverse map, the matrix exponential, maps all real numbers back to positive values. The matrix logarithm log D of a diffusion tensor D is computed by performing its spectral decomposition (Equation (1)) and taking the logarithm of the eigenvalues. Within this framework, edge strength is measured as ∇ log D. We refer to this as the Log-Euclidean edge detector, in contrast to the standard Euclidean edge detector ∇D. To simplify computations, it has been suggested to evaluate
200
T. Schultz and H.-P. Seidel
(a) ∇ log D via finite differences
(b) ∇ log D via chain rule
(c) same as (a), after tensor estimation as in [16]
Fig. 3. Approximating ∇ log D via finite differences of logarithms leads to artifacts near steep edges ((a) and (c)). They are avoided by using the chain rule instead (b).
log D at sample positions and to obtain an approximation of ∇ log D by taking finite differences between the resulting matrices [7]. However, as we have seen in Figure 1 (c), approximating the derivative of a nonlinear function via finite differences may not produce sufficiently exact results. In fact, near steep edges like those between the ventricles (vent) and brain tissue, we observed artifacts in approximated edge maps of ∇ log D, marked by two arrows in Figure 3 (a). This problem can be avoided by applying the multivariate chain rule of differentiation, which is again simplified by considering eigenvalue derivatives. In the eigenframe of D, the Jacobian of the matrix logarithm log D takes on diagonal form: Let di,j be entry (i, j) of [D]E ¯ , li,j be the corresponding en−1 −1 try of [log D]E and for i = j, ∂li,j /∂di,j = ¯ . Then, ∂li,i /∂di,i = di,i = λi (log di,i − log dj,j )/(di,i − dj,j ). All other partial derivatives vanish. Thus, we −1 simply obtain [(log D) ]E ¯ from [D ]E ¯ by multiplication of entries (i, i) by λi and multiplication of entries (i, j), i = j, by (log λi − log λj )/(λi − λj ). The resulting corrected map is shown in Figure 3 (b). We confirmed that the artifacts in Figure 3 (a) are not caused by estimating the tensors via the standard least squares method [1] and clamping the rare negative eigenvalues to a small positive epsilon afterwards. They are still present when using the gradient descent approach from [16], which integrates the positive definiteness constraint into the estimation process itself (Figure 3 (c)). The reformulation of the invariant gradients framework in terms of eigenvalue space allows us to apply it to Log-Euclidean edge detection, simply by considering the natural logarithms of eigenvalues as the fundamental axes of tensor shape space. Similar to the Euclidean case, the cylindrical coordinate system from [14] can be used to separate meaningful types of edges. Figure 4 (a) shows edges in overall diffusivity, Figure 4 (b) edges in anisotropy. In both cases, the Euclidean result is on the left, the Log-Euclidean one on the right. The Log-Euclidean approach measures overall diffusivity by the matrix determinant instead of the trace. The determinant also reflects eigenvalue dispersion, which explains why the contours of some fiber tracts appear in the right image of Figure 4 (a). In the Euclidean case, they are isolated more cleanly in the anisotropy channel. Consequently, anisotropy contours appear more blurred in the Log-Euclidean case. Overlaying them on a principal eigenvector color map
Using Eigenvalue Derivatives for Edge Detection in DT-MRI Data
(a) Changes in overall diffusivity
201
(b) Changes in anisotropy
Fig. 4. Different types of edges are distinguished more cleanly by a Euclidean (left sub-images) than by a Log-Euclidean edge detector (right sub-images)
indicates that they are offset towards the inside of fiber tracts in some places (arrows in Figure 4 (b)). Maps of the third shape axis, which captures transitions between linearity and planarity, were extremely similar (not shown).
6
Conclusion
Given the ubiquity of the spectral decomposition in DT-MRI processing, eigenvalue derivatives are a natural candidate for the analysis of local changes in this kind of data. In the present work, we have used them to generate edge maps with respect to the widely used Westin shape metrics [12], which we have shown to identify anatomical interfaces in real DT-MRI data and to allow for a more specific analysis of changes in tensor shape than has been possible with previously suggested edge detectors. The existing edge detection framework based on invariant gradients [3] is both simplified and easily extended by considering it in terms of eigenvalue derivatives. Finally, we have applied our results to analyze the relatively recent Log-Euclidean edge detector [7]. We have both corrected a source of artifacts in its previously proposed form and demonstrated that it, too, allows for separation of different types of edges, yet with a slightly lower anatomical specificity than the more traditional Euclidean detector.
Acknowledgements We would like to thank Bernhard Burgeth, who is with the Mathematical Image Analysis group at Saarland University, Germany, for in-depth discussions and for proofreading our manuscript. We are grateful to Alfred Anwander, who is with the Max Planck Institute for Human Cognitive and Brain Science in Leipzig, Germany, for providing the dataset and helping with the annotations. We thank Gordon Kindlmann, who is with the Laboratory of Mathematics in Imaging, Harvard Medical School, for providing the teem libraries (http://teem.sf.net/) and giving feedback on our results. This work has partially been funded by the Max Planck Center for Visual Computing and Communication (MPC-VCC).
202
T. Schultz and H.-P. Seidel
References 1. Basser, P.J., Mattiello, J., Bihan, D.L.: Estimation of the effective self-diffusion tensor from the NMR spin echo. Journal of Magnetic Resonance B(103), 247–254 (1994) 2. Pajevic, S., Aldroubi, A., Basser, P.J.: A continuous tensor field approximation of discrete DT-MRI data for extracting microstructural and architectural features of tissue. Journal of Magnetic Resonance 154, 85–100 (2002) 3. Kindlmann, G., Ennis, D., Whitaker, R., Westin, C.F.: Diffusion tensor analysis with invariant gradients and rotation tangents. IEEE Transactions on Medical Imaging 26(11), 1483–1499 (2007) 4. Kindlmann, G.: Visualization and Analysis of Diffusion Tensor Fields. PhD thesis, School of Computing, University of Utah (September 2004) 5. Schultz, T., Burgeth, B., Weickert, J.: Flexible segmentation and smoothing of DT-MRI fields through a customizable structure tensor. In: Bebis, G., Boyle, R., Parvin, B., Koracin, D., Remagnino, P., Nefian, A.V., Gopi, M., Pascucci, V., Zara, J., Molineros, J., Theisel, H., Malzbender, T. (eds.) ISVC 2006. LNCS, vol. 4291, pp. 455–464. Springer, Heidelberg (2006) 6. Kato, T.: Perturbation theory for linear operators, 2nd edn. Die Grundlehren der mathematischen Wissenschaften, vol. 132. Springer, Heidelberg (1976) 7. Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Log-euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine 56(2), 411–421 (2006) 8. Anderson, A.W.: Theoretical analysis of the effects of noise on diffusion tensor imaging. Magnetic Resonance in Medicine 46(6), 1174–1188 (2001) 9. O’Donnell, L., Grimson, W.E.L., Westin, C.F.: Interface detection in diffusion tensor MRI. In: Barillot, C., Haynor, D.R., Hellier, P. (eds.) MICCAI 2004. LNCS, vol. 3216, pp. 360–367. Springer, Heidelberg (2004) 10. Kindlmann, G., Tricoche, X., Westin, C.F.: Delineating white matter structure in diffusion tensor MRI with anisotropy creases. Medical Image Analysis 11(5), 492–502 (2007) 11. Basser, P.J., Pierpaoli, C.: Microstructural and physiological features of tissues elucidated by quantitative-diffusion-tensor MRI. Journal of Magnetic Resonance B(111), 209–219 (1996) 12. Westin, C.F., Peled, S., Gudbjartsson, H., Kikinis, R., Jolesz, F.A.: Geometrical diffusion measures for MRI from tensor basis analysis. In: International Society for Magnetic Resonance in Medicine 1997, Vancouver, Canada, p. 1742 (1997) 13. Pajevic, S., Pierpaoli, C.: Color schemes to represent the orientation of anisotropic tissues from diffusion tensor data: application to white matter fiber tract mapping in the human brain. Magnetic Resonance in Medicine 42(3), 526–540 (1999) 14. Bahn, M.M.: Invariant and orthonormal scalar measures derived from magnetic resonance diffusion tensor imaging. Journal of Magnetic Resonance 141, 68–77 (1999) 15. Ennis, D.B., Kindlmann, G.: Orthogonal tensor invariants and the analysis of diffusion tensor magnetic resonance images. Magnetic Resonance in Medicine 55(1), 136–146 (2006) 16. Fillard, P., Pennec, X., Arsigny, V., Ayache, N.: Clinical DT-MRI estimation, smoothing and fiber tracking with log-euclidean metrics. IEEE Transactions on Medical Imaging 26(11), 1472–1482 (2007)
Space-Time Multi-Resolution Banded Graph-Cut for Fast Segmentation Tobi Vaudrey1 , Daniel Gruber2 , Andreas Wedel3,4 , and Jens Klappstein3 1
3
The University of Auckland, New Zealand 2 Universit¨ at Konstanz, Germany Daimler Group Research, Sindelfingen, Germany 4 Universit¨ at Bonn, Germany
Abstract. Applying real-time segmentation is a major issue when processing every frame of image sequences. In this paper, we propose a modification of the well known graph-cut algorithm to improve speed for discrete segmentation. Our algorithm yields real-time segmentation, using graph-cut, by performing a single cut on an image with regions of different resolutions, combining space-time pyramids and narrow bands. This is especially suitable for image sequences, as segment borders in one image are refined in the next image. The fast computation time allows one to use information contained in every image frame of an input image stream at 20 Hz, on a standard PC. The algorithm is applied to traffic scenes, using a monocular camera installed in a moving vehicle. Our results show the segmentation of moving objects with similar results to standard graph-cut, but with improved speed.
1
Introduction
Separating structures of interest from the background is a well known problem in the computer vision community. The goal of such segmentation is to reduce the representation of an image into meaningful data. Segmentation tasks include; identifying anatomical areas of interest in medical image processing [13], optimal contour finding of user defined foreground and background image areas [14,19], and segmenting areas of similar motion for machine vision [20]. The goal of this paper is the separation of moving and stationary objects using a monocular camera installed in a moving vehicle. Due to the motion of the vehicle, simple background subtraction does not work in this case, because (apart from the focus of expansion) the whole image changes from frame to frame (see Figure 1). This is also different to motion segmentation, namely the grouping of image regions which are similar in their motion, because for some constellations, such as motion of non-rigid objects, no parametric motion field exists. Last, the binary segmentation needs to be performed in real-time on video streams. We formulate an energy function based on the boundary length, the number of foreground pixels, and the assigned probabilities of a pixel being foreground (moving object) or background (stationary world). Several techniques have been developed for image segmentation [15]. Two methods dominate the domain of G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 203–213, 2008. c Springer-Verlag Berlin Heidelberg 2008
204
T. Vaudrey et al.
Fig. 1. The images show the flow field of image features in a traffic scene and a segmented moving object. The image on the right shows the multi-resolution graph structure obtained using the approach in this paper for real-time segmentation.
segmentation by energy minimization in the image domain: differential methods, such as LevelSets [3,12], and discrete methods. Among the discrete methods, the graph-cut algorithm [8] finds the global optimal energy in polynomial time. However, the time to compute the energy is still far from real-time for many applications. Real-time performance is important when working on image streams, where the information in every frame has to be processed. This is especially important in the field of motion segmentation where small time steps between consecutive images is important, as motion estimation is based on gradient descent and the displacement of pixels is crucial [4,18]. The contribution of this paper is a modification of the input images, such that the graph-cut gains real-time performance on image sequences. We reduce the dimension of the search space for the segmentation, by executing the graph cut algorithm on different scales of the input image, in one single cut. The original image is segmented at a low resolution, then segmentation boundaries are refined throughout the image sequence. The cut is still optimal on the modified input images, such that no repeated iterations on the same image are required. Experimental results prove, that for most problems, where the segment boundary consists only of a small set of image pixels, this algorithm yields real-time performance on standard computer hardware. The next section contains a literature overview, investigating approaches for speeding up the graph-cut algorithm. Our novel approach is described in Section 3. Results and comparisons to the algorithms described in literature are found in Section 4. The paper closes with conclusions and an outlook on future work.
2
Literature Overview
Graph-cut for image segmentation was first introduced by [5] for image restoration. Its impact on the computer vision community rapidly increased after the publication of [1], where an algorithm especially designed for the segmentation of image domains was proposed. The algorithm boosted average segmentation times from several minutes to a few seconds for usual image domains, where every pixel yields one node in the resulting graph. However, most research on graph-cut focuses on single images and not on real-time image streams, where computation time becomes crucial. In our experiments we use frame rates of up
Space-Time Multi-Resolution Banded Graph-Cut
205
to 25 frames per second, demanding accordingly fast segmentations. In the remaining part of this section we will explain the graph-cut algorithm and review techniques to speed up graph-cut computation time. 2.1
Graph-Cut
The image being segmented is modelled as a weighted undirected graph. Each pixel corresponds to a node in the graph, and an edge is formed between two neighbouring pixels. We use an N4 neighbourhood where each pixel has four neighbours (upper, lower, left, and right) representing the Manhattan metric. The weight of an edge is a measure of similarity between end nodes (in this case pixels). We use the grey value difference of the end nodes as similarity measure. With the maximal grey value Imax and a scale factor c the weight of an edge connecting x and y, ex,y is: w (ex,y ) = c ·
1−
I(x) − I(y) Imax
2 .
(1)
Additional edges connect nodes with a source and sink. They have weights corresponding to a node’s foreground and background probability. To detect and segment moving objects we use the motion constraints described in [6]. The constraints are based on a motion analysis of individual tracked image features. We use a real time implementation of the tracking described in [17]. The tracked features are 3D reconstructed. A point is detected as moving if its reconstruction is identified as erroneous by checking whether it fulfils the constraints: – Epipolar Constraint: This constraint ensures that the viewing rays (joining projection centres and the 3D point) from both cameras must meet. If this constraint is violated, no reconstruction is possible and the 3D position of the point has changed between both frames of the sequence. – Positive Depth Constraint: If viewing rays intersect behind the camera, the 3D point must be moving. – Positive Height Constraint: All static points in traffic scenes lie above the ground plane. If viewing rays intersect under the ground plane, the 3D point must be moving. For details on how to exploit these constraints, refer to [6]. For every tracked point an error measure is computed, measuring the distance in pixels, to the next valid point fulfilling the above constraints. This error value serves as the weight for an edge from tracked pixels to the source (background) or sink (foreground). In our experiments we use w = (1 + exp(6 − 3x))−1 . The optimal s-t (source to sink) separating cut of the graph is one that minimizes the sum of the weights of the edges that are removed (referred to as energy), such that no more connections between source and sink exist. Figure 2 shows the model of a graph and its optimal cut.
206
T. Vaudrey et al.
Fig. 2. The plot shows the segmentation of an image by performing an s-t cut. The pixels correspond to nodes of the graph (blue). Edges for foreground and background pixels connecting these pixels with the source and sink respectively have costs 1.0. Edges between neighbouring pixels, connecting their corresponding nodes in the graph have costs 0.3 or 0.1, depending on the grey value difference of the two pixels. The minimal s-t separating cut with cost 3.0 is given by the red boundary. Although we use sparse data representing the foreground and the background, one connected region with it’s boundary along a grey value difference is found.
2.2
Banded Cut on Image Pyramids
The simplest way to decrease computational time is using images of lower resolution, thus reducing the number of nodes and edges. However, this is at the expense of less segmentation accuracy, especially along the boundaries. In [9] the authors propose the use of image pyramids and refine the boundaries on lower pyramid images as illustrated in Figure 3. When image features are small, they might not be detected on lower pyramid levels as they are smoothed out. One possibility to account for these changes was proposed in [16]. An additional Laplace pyramid is built, which represents the lost information from the image down-sampling process in the high frequency spectrum. If the values in the Laplace image exceed a certain threshold, additional pixels are introduced into the graph. The drawback of pyramid approaches is the inherent iterative process, where consecutive graph-cut steps on lower pyramid levels have to wait for
Fig. 3. The figure shows an example of an image pyramid (left). A full graph cut is performed on the highest pyramid level and the result is iteratively propagated downwards, then refined in a band around the result of the upper pyramid level.
Space-Time Multi-Resolution Banded Graph-Cut
207
the preceding step to finish. This causes the undesired effect of the graph-cut being carried out several times on the same input image, once for every pyramid resolution. 2.3
Temporal Banded Cut
Recall that we are interested in detecting moving objects. If the frame rate is high, this implies that segmentation boundaries in consecutive images are similar. Instead of predicting the segmentation boundary from the highest to lowest pyramid level, the segmentation of the current image is used as a hint for the cut of the next image in the sequence, and refined in a band around the segmentation border. This is done by [10] and [11]. The major drawback of this method is that essentially only tracking is performed, therefore new objects that appear are only detected if they are within the band width of the boundary for existing objects. In the papers, a human initialization step is required. We propose to solve this problem by a temporal prediction on pyramids. 2.4
Temporal Prediction on Pyramids
Instead of propagating segmentation boundaries from a higher pyramid level onto a lower pyramid level of the same image, this propagation is done in the subsequent image of an image sequence (green arrows in Figure 4). Using the assumption that the segments motion in the image is roughly known, this combines the detection in the pyramid scheme and the temporal prediction (tracking) of segmentation boundaries. Another advantage of this method is that the graphcut algorithms for the single pyramid levels can be executed in parallel, resulting in an improved computational time when using parallel processing computers. Other Techniques. to speed up the segmentation algorithm have been presented by [7]. The authors use the search tree of the old segmentation as a hint for the current segmentation. In [2], an approximation of the energy is proposed which yields faster convergence. Both approaches are applied on a graph
Fig. 4. This shows the temporal prediction on pyramids. Instead of propagating results down (red), or from one time instance to another on the same pyramid level (blue), a combined approach propagates results between frames and pyramid levels to combine both real-time detection and real-time tracking.
208
T. Vaudrey et al.
without changing its topology. In the following section, a novel approach is presented related to temporal prediction on pyramids, but performing a single cut on an image with multiple resolutions. This is related to surface approximations, where primitives may consist of different scales, such that a good approximation is achieved with the constraint, to use a minimal number of primitives (e.g. 3D surface modelling).
3
Multi-Resolution Image
If parallel processing is not possible, the temporal prediction on pyramids yields no computation time gain compared to the plain vanilla pyramid implementation of the graph-cut algorithm. A closer look at the temporal prediction also reveals that situations occur, where the same segmentation boundary is found on every image level, such that a prediction onto the next time instance creates similar problems as using a full pyramid. The idea to solve this issue is to take the highest pyramid image (lowest resolution) and iteratively refine the pixels into 4 sub pixels if they are within the segmentation border of the lower pyramid level (Figure 5 shows the result; see also Figure 6 for a schematic illustration). The resulting image consists of different resolutions representing different pyramid layers, thus different segmentation accuracies. A high resolution is obtained at segmentation boundaries (inhomogeneous background / foreground regions) and a low resolution at homogeneous image regions. In the next section we will show how to obtain a graph representation of a multi-resolution image. The advantage is obvious, as only a single cut has to be performed on the multi-resolution image.
Fig. 5. The left image shows the original frame with a segmentation overlay. The right image shows a close-up of the calculated multi-resolution image. Areas around segmentation boundaries have high resolution and in the other areas, pixels are combined to provide a low resolution.
3.1
Multi-Resolution Graph-Cut
The mapping of the multi-resolution image onto the graph is straightforward. In the first frame t, a coarse grid is created (see Figure 6). Every grid section is a
Space-Time Multi-Resolution Banded Graph-Cut
209
Fig. 6. The graph shows the iterative segmentation approach of the multi-resolution graph-cut. The dashed line shows the edge of the true segmentation. Blue grid lines identify graph sections (nodes). Red dots touching a blue grid line indicate edges.
node, every node has an edge from the source and one to the sink. Pixels within the same grid are combined by adding their weights from the source and to the sink. The remaining edges, for each node, are created from the N4 boundary grid nodes (identified by any red pixel touching a blue grid line in Figure 6). The edge values are based on the energy function mentioned in Equation 1. The weights of all pixel edges connecting two grid cells are summed to obtain a single edge weight. The graph-cut is performed on the coarse grid, providing a rough segmentation. In the next time frame (t + 1), the grid sections on the boundaries of the segmentations are split into 4 subsections. The old node (grid section) and corresponding edges are removed from the graph and the new nodes are added. Finally, the new edges are added to the graph. New edges can be seen in Figure 6 as any red node that is touching a blue grid line. This process is then refined again in the next frame (t + 2). This will happen iteratively over time until the segmentation boundary is refined to single pixel detail. As we are segmenting image sequences, we need to deal with the implicit movement within the images over time. The movement of the segmentation area
Fig. 7. The plot compares the calculated minimum cut costs for a VGA (640 × 480 pixels) test sequence of 25 images. The minimum cut cost of the multi-resolution method increases slightly with larger grid size and is close to the cost of the full graph-cut.
210
T. Vaudrey et al.
can be estimated by using the tracked features used for the motion constraints. In this case, a segmentation boundary in frame (t + 1) is estimated from frame t using the motion constraint data. The segmentation in frame (t+1) is performed as above, using the estimated boundary position. The graph-cut is performed over the entire image, still allowing new features to be detected. Another issue that needs to be dealt with when using this approach is when there are refined areas in the image, that are not located on a segmentation boundary of the subsequent frame. When this happens, the process is reversed over the involved homogeneous grid cells, reverting to a lower resolution. Figure 7 shows a comparison of our algorithm in terms of minimal cut costs, identifying a minor loss in minimization. Note: Do not mix up the cut costs, which deals with the energy minimum (see Figure 7), and the computational cost which is discussed in the results section.
4
Results
In this result section we show qualitative segmentation results of moving objects and a quantitative comparison of computation time for different speed improvement techniques for graph-cut segmentation. The segmentation and tracking results show that we are able to detect rigid objects, such as other vehicles (see Figure 8). The segmentation is meaningful and finds the moving object in the image. A close look reveals that vehicles driving away from the camera, at large distances, are not detected (Figure 8 upper right). This is due to the motion constraint, which is too weak for such situations. The error for each track is shown in the image demonstrating that sparse data is used for segmentation. Note that in the experiments, the camera is installed in a moving vehicle, so simple background subtraction is not possible. While the example above could be solved by motion segmentation with parametric motion fields, the accurate segmentation of non-rigid motion would not be possible with such an approach. However, by using the motion constraint,
Fig. 8. Segmentation of moving objects using the multi-resolution graph-cut method and constraints on tracked features (20 frames per second on VGA images). The segmentation boundary proves to be accurate keeping in mind that sparse features are used. Outliers are rejected and the moving objects are verified.
Space-Time Multi-Resolution Banded Graph-Cut
211
Fig. 9. Segmentation of non-rigid motion using the multi-resolution method in comparison to the full graph-cut method. The left image shows the multi-resolution graph used for segmentation, the middle image shows the segmentation overlay on the original image. The right image shows the results using original graph-cut, yielding similar results.
Fig. 10. The left plot shows the runtime comparison of different graph-cut implementations (on a Pentium IV 3.2 GHz). The right plot shows average runtime comparisons for the multi-resolution approach using different maximum grid sizes.
we are able to segment a moving person quite accurately, as can be seen in Figure 9. The multi-resolution method generates segmentation results close to those obtained using the full graph-cut algorithm. Figure 10 compares computation time for the multi-resolution method and full graph-cut. In addition, the pyramid version and the temporal prediction techniques (other techniques used to speed up graph-cut computation time) are included in the runtime comparison. As the temporal banded cut is used for tracking and not detection, the resulting boundary from the multi-resolution method was provided as an initialization in every frame. The sequence used had increased movement in the scene near the end, this explains the increase in computational time, especially noticeable using original graph-cut. The multi-resolution method performed best on our test sequence. The results confirm that run times under 50 ms are achieved, such that real time segmentation becomes possible. The speed improvements using the multi-resolution approach, compared to the original graph-cut, lies between a factor of 10 and 20. The right hand graph in Figure 10 shows the different run times using decreasing maximum grid-size. Note that a max grid-size of 1 pixel corresponds to the original graph-cut.
212
5
T. Vaudrey et al.
Conclusion and Future Work
In this paper we presented a novel method to combine tracking and detection in a real-time graph-cut approach. The multi-resolution method outperforms the standard graph-cut in computation time, on average, by a factor of 10. It also outperforms other speed improvement techniques, but by a smaller factor. The future ideas for this approach are to expand the algorithm to other areas, such as: – The usage of the algorithm on other segmentation problems (e.g. stereo and general motion segmentation into segments of parametric (affine) motion). – The generalization of the approach for 3D volumes.
References 1. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. In: Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 359–374 (2001) 2. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. In: IEEE Trans. on Pattern Analysis and Machine Intelligence, pp. 1222–1239 (2001) 3. Cremers, D., Rousson, M., Deriche, R.: A review of statistical approaches to level set segmentation: integrating color, texture, motion and shape. International Journal of Computer Vision 72(2), 195–215 (2007) 4. Cremers, D., Soatto, S.: Motion competition: A variational framework for piecewise parametric motion segmentation. International Journal of Computer Vision 62(3), 249–265 (2005) 5. Greig, D.M., Porteous, B.T., Seheult, A.H.: Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society 51, 271–279 (1989) 6. Klappstein, J., Stein, F., Franke, U.: Monocular motion detection using spatial constraints in a unified manner. In: IEEE Intelligent Vehicles Symposium, pp. 261–267 (2006) 7. Kohli, P., Torr, P.H.S.: Effciently solving dynamic markov random fields using graph cuts. In: ICCV, pp. 922–929 (2005) 8. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? In: IEEE Trans. Pattern Analysis and Machine Intelligence, pp. 147–159 (2004) 9. Lombaert, H., Sun, Y., Grady, L., Xu, C.: A multilevel banded graph cuts method for fast image segmentation. In: ICCV, pp. 259–265 (2005) 10. Malcolm, J., Rathi, Y., Tannenbaum, A.: Multi-object tracking through clutter using graph cuts. In: ICCV, pp. 1–5 (2007) 11. Mooser, J., You, S., Neumann, U.: Real-time object tracking for augmented reality combining graph cuts and optical flow. In: ISMAR: Int. Symposium on Mixed and Augmented Reality, pp. 145–152 (2007) 12. Osher, S., Paragios, N.: Geometric Level Set Methods in Imaging,Vision,and Graphics. Springer, New York (2003) 13. Pham, D.L., Xu, C., Prince, J.L.: Current methods in medical image segmentation. In: Annual Review of Biomedical Engineering, vol. 2, pp. 315–337 (2000)
Space-Time Multi-Resolution Banded Graph-Cut
213
14. Rother, C., Kolmogorov, V., Blake, A.: “GrabCut”: interactive foreground extraction using iterated graph cuts. 23(3), 309–314 (August 2004) 15. Shapiro, L.G., Stockman, G.C.: Computer Vision, New Jersey, pp. 279–325. Prentice-Hall, Englewood Cliffs (2001) 16. Sinop, A.K., Grady, L.: Accurate banded graph cut segmentation of thin structures using laplacian pyramids. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4191, pp. 896–903. Springer, Heidelberg (2006) 17. Stein, F.: Efficient computation of optical flow using the census transform. In: Rasmussen, C.E., B¨ ulthoff, H.H., Sch¨ olkopf, B., Giese, M.A. (eds.) DAGM 2004. LNCS, vol. 3175, pp. 79–86. Springer, Heidelberg (2004) 18. Tomasi, C., Kanade, T.: Detection and tracking of point features. Technical report, Carnegie Mellon University Technical Report CMU-CS-91-132 (1991) 19. Wang, J., Bhat, P., Colburn, R.A., Agrawala, M., Cohen, M.F.: Interactive video cutout. SIGGRAPH 24, 585–594 (2005) 20. Wedel, A., Schoenemann, T., Brox, T., Cremers, D.: Warpcut - fast obstacle segmentation in monocular video. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 264–273. Springer, Heidelberg (2007)
Combination of Multiple Segmentations by a Random Walker Approach Pakaket Wattuya, Xiaoyi Jiang, and Kai Rothaus Department of Mathematics and Computer Science University of M¨ unster, Germany {wattuya,xjiang,rothaus}@math.uni-muenster.de
Abstract. In this paper we propose an algorithm for combining multiple image segmentations to achieve a final improved segmentation. In contrast to previous works we consider the most general class of segmentation combination, i.e. each input segmentation can have an arbitrary number of regions. Our approach is based on a random walker segmentation algorithm which is able to provide high-quality segmentation starting from manually specified seeds. We automatically generate such seeds from an input segmentation ensemble. Two applications scenarios are considered in this work: Exploring the parameter space and segmenter combination. Extensive tests on 300 images with manual segmentation ground truth have been conducted and our results clearly show the effectiveness of our approach in both situations.
1
Introduction
Unsupervised image segmentation is of essential relevance for many computer vision applications and remains a difficult task despite of decades of intensive research. Recently, researchers start to investigate combination of multiple segmentations. Several works can be found in medical image analysis, where an image should be segmented into a known number of semantic labels [10]. Typically, such algorithms are based on local (i.e. pixel-wise) decision fusion schemes such as voting. Alternatively, a shape-based averaging is proposed in [11] to combine multiple segmentations. The works [5,6] deal with the general segmentation problem. However, they still assume that all input segmentations contain the same number of regions. In [5] a greedy algorithm finds the matching between the regions from the input segmentations which build the basis for the combination. The authors of [6] consider an image segmentation as a clustering of pixels and apply a standard clustering combination algorithm for the segmentation combination purpose. Our work is not limited to this restriction and we consider the most general case (an arbitrary number of regions per image). In addition we study two different scenarios of multiple segmentation combination. In the next section we motivate our work in detail. Then, a random walker approach is proposed to solve the combination problem in Section 3, followed by its experimental validation in Section 4. Finally, some discussions conclude this paper. G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 214–223, 2008. c Springer-Verlag Berlin Heidelberg 2008
Combination of Multiple Segmentations
2
215
Application Scenarios of Segmentation Combination
Currently, we investigate two possible applications: Exploration of parameter space and segmenter combination. Exploring Parameter Space without Ground Truth: Segmentation algorithms mostly have some parameters and their optimal setting is not a trivial task. Figure 1 shows two images segmented using exactly the same parameter set. In case (b) a nearly perfect segmentation can be achieved while we obtain a very bad segmentation in (a). Note that in this work we use the F-measure [8] (weighted harmonic mean of precision P and recall R) to quantitatively evaluate the quality of a segmentation against corresponding human GT segmentations. In recent years automatic parameter training has become popular, mainly by probing a subspace of the parameter space by means of quantitatively comparing with a training image set with (manual) ground truth segmentation [1,9]. We propose to apply the multiple segmentation combination for implicitly exploring the parameter space without the need of ground truth segmentation. It is assumed that we know a reasonable subspace of the parameter space (i.e. a lower and upper bound for each parameter), which is sampled into a finite number N of parameter settings. Then, we run the segmentation procedure for all the N parameter settings and compute a final combined segmentation of the N segmentations. The rationale behind our approach is that this segmentation tends to be a good one within the explored parameter subspace. Segmenter Combination: Different segmentation algorithms may have different performance on the same image. There exists no universal segmentation algorithm that can successfully segment all images. It is not easy to know the optimal algorithm for one particular image. We postulate: “Instead of looking for the best segmenter which is hardly possible on a per-image basis, now we look for the best segmenter combiner”. The rationale behind this idea is that while none of the segmentation algorithms is likely to segment an image correctly, we may
(a)
(b)
Fig. 1. Segmentations produced by using the FH segmentation algorithm [3] given the same set of parameter values. (a) R = 0.1885, P = 0.5146, F = 0.2759. (b) R = 0.7852, P = 0.9523, F = 0.8607.
216
P. Wattuya, X. Jiang, and K. Rothaus
benefit from combining the strengths of multiple segmenters. For this purpose we may apply various segmentation methods (each perhaps run with multiple parameter sets) to build a segmentation ensemble. In both application scenarios we need a robust combiner. It takes an ensemble S = {Si , ..., SN } of N segmentations and computes a final segmentation result S ∗ which is superior to the initial segmentations in a statistical sense (to be specified later).
3
The Multiple Segmentation Combination Algorithm
The segmentations ensemble can be produced by using different segmentation algorithms or with the same segmentation algorithm but different parameter values. We introduce the use of random walker to the problem of multiple segmentation combination. It is based on the random walker algorithm for image segmentation [4]: Given a small number of K pixels with user-defined labels (or seeds), the algorithm labels unseeded pixels by resolving the probability that a random walker starting from each unseeded pixel will first reach each of the K seed points. A final segmentation is derived by selecting for each pixel the most probable seed destination for a random walker. The algorithm can produce a quality segmentation provided suitable seeds are placed manually. The random walker algorithm [4] is formulated on an undirected, weighted graph G = (V, E), where each image pixel pi has a corresponding node vi ∈ V . The edge set E is constructed by connecting pairs of pixels that are neighbors in a 4-connected neighborhood structure. Each edge eij has a corresponding weight wij which is a real-value measure of the similarity between the neighboring pixels vi and vj . Given such a graph and the seeds, the probability of all unseeded pixels reaching the seeds can be efficiently solved. In principle there exists a natural link of our problem at hand to the random walker based image segmentation. The input segmentations provide strong hints about where to automatically place some seeds. Given such seed regions we are then faced with the same situation as image segmentation with manually specified seeds and can thus apply the random walker algorithm [4] to achieve a quality final segmentation. To develop a random walker based segmentation ensemble combination algorithm we need three components: Generating a graph to work with (Section 3.1), extracting seed regions (Section 3.2) and computing a final combined segmentation result (Section 3.3). 3.1
Graph Generation
In our context the weight eij should indicate how probably the two pixels pi and pj belong to the same image region. Clearly, this can be guided by counting the number nij of initial segmentations, in which pi and pj share the same region label. Then, we define the weight function as a Gaussian weighting: wij = exp (−β(1 −
nij )) N
(1)
Combination of Multiple Segmentations
217
where β is a free parameter of the algorithm. Low edge weights indicate high probabilities that there exists region boundary evidence between two neighboring pixels and avoid a random walker crossing these boundaries. 3.2
Seed Region Generation
We describe a two-step strategy to automatically generate seed points: (i) extracting candidate seed regions from G and (ii) grouping (labeling) them to form final seed regions to be used in the combination step. Candidate Seed Region Extraction: It proceeds as follows: 1. Build a new graph G∗ by removing edges whose weight in G is lower than a threshold t0 ∈ [0, 1] and work with G∗ subsequently. 2. Determine all connected components consisting of nodes with degree 4. 3. Extend each connected component C by adding those nodes which are connected to C by a chain of edges. It may happen that a node can be added to two different connected components Ci and Cj . In this case Ci and Cj are merged. The resulting connected components build a set of initial seed regions which are further refined in the next step. Note that step 1 basically deletes those edges between two neighboring nodes (pixels) which are very unlikely to belong to the same region. Grouping Candidate Seed Regions: The number of candidate seed regions from the last step is typically higher than the true number of regions in an input image. Thus, a further reduction is performed by iteratively merging the two closest candidate seed regions until some termination criterion is satisfied. For this purpose we need a similarity measure between two candidate seed regions and a termination criterion. Recall that in the initial graph G the edge weights ekl indicate how probably two pixels pk and pl belong to the same image region. This interpretation gives us a means to estimate how probably two candidate seed regions Ci and Cj belong to the same region. For a node pk ∈ Ci we consider the probability P (pi , Cj ) that when starting from pi , a random walk will reach any node in Cj . Then, we define the similarity between Ci and Cj by: Similarity(Ci , Cj ) =
1 [ max P (pk , Cj ) + max P (pl , Ci )] pl ∈Cj 2 pk ∈Ci
(2)
Note that these similarity values can be efficiently computed by the baseline random walker algorithm [4]; the full details are not given here due to the space limitation. We iteratively select two candidate seed regions with the highest similarity value and merge them to build one single (spatially disconnected) candidate seed region. This operation is stopped if the highest similarity value is below a threshold tmerge . Obviously, tmerge determines the final number K of seed regions and the segmentation combination result. Further discussion will follow later.
218
3.3
P. Wattuya, X. Jiang, and K. Rothaus
Segmentation Ensemble Combination
Given the graph G constructed from the initial segmentations and the automatically determined K seed regions, we apply the random walker algorithm [4] to compute the combination segmentation. The computation of random walker probabilities can be performed exactly without the simulation of random walks, but by solving a sparse, symmetric, positive-definite system of equations (solving the combinatorial Dirichlet problem). Each unseeded pixel is then assigned a K-tuple vector, specifying the probability that a random walker starting from that pixel will first reach each of the K seed regions. A final segmentation is derived by assigning each pixel to the label of the largest probability. 3.4
Further Implementational Details
In the seed region extraction step the computation of all similarity values in Eq. (2) can be time-consuming since we need to run the random walker algorithm [4] Nc times (being the number of candidate seed regions). In practice, we use only a small amount of pixels per candidate seed region by sampling them along the horizontal and vertical image grid by factor 5 in each direction. Furthermore, only the first 50 largest candidate seed regions are used in the merging process and all other small candidate seed regions are deleted. This number of candidate seed regions is experimentally determined to be large enough to cover all salient natural image segments. By doing it this way, we can improve the computational performance of the algorithm without degrading the qualities of combination results.
4
Experimental Results
We have conducted a series of experiments to validate the potential of our approach in the two application scenarios. The natural images with human segmentations from the Berkley segmentation dataset [7] are used. The dataset consists of 300 images of size 481×321 or vice versa, each having multiple manual segmentations. We apply the F-measure [8] to quantitatively evaluate the segmentation quality against the ground truth. One segmentation result is compared to all manual segmentations of the image and the average F-measure is reported. We empirically set the parameters β and tmerge to 30 and 0.5, respectively, for all experiments. 4.1
Exploring Parameter Space
For this experiment series the graph-based image segmentation algorithm (FH) developed by Felzenszwalb and Huttenlocher [3] is used as the baseline segmentation algorithm. The reasons for selecting this algorithm are its competitive segmentation performance and high computational efficiency. In fact, the running time is nearly linear in the number of graph edges and very fast in practice.
Combination of Multiple Segmentations
F = 0.6091
F = 0.2759
F = 0.4733
F = 0.5611
F = 0.8188
F = 0.5444
F = 0.5896
F = 0.8607
F = 0.6805
F = 0.3392
F = 0.4685
F = 0.6620
F = 0.8026
F = 0.5774
F = 0.7275
F = 0.7933
F = 0.7369
F = 0.6408
F = 0.7091
F = 0.7406
F = 0.8091
F = 0.6738
F = 0.7558
F = 0.8071
F = 0.6907
F = 0.4599
F = 0.5575
(a) Combined Result(b) Worst Input (c) Median Input
219
F = 0.6156 (d) Best Input
Fig. 2. Exploring parameter space. (a) Combined segmentation results of the proposed algorithm. (b)–(d) show three input segmentations with the worst, median and the best F-measure scores, respectively.
220
P. Wattuya, X. Jiang, and K. Rothaus
F−Measure
0.6
0.55
0.5 the average performance of combined results the average performance of each individual parameter setting 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Parameter Setting
Fig. 3. Exploring parameter space: The average performance of combined results (F=0.5948) and of each individual parameter setting results over the 300 images in the database
The algorithm has three parameters: σ is used to smooth the input image before segmentation, k is the value for a threshold function (Larger values for k result in larger components in the result), and min area is a minimum component size enforced by post-processing. We obtain multiple segmentations of an image by varying the parameter values. This way 24 segmentations are generated by fixing min area = 1500 (about 1% of image area) and varying σ = 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and k = 150, 300, 500, 700. Figure 2 shows some examples of combined segmentations produced by our method. For comparison purpose we also show the input segmentation with the worst, median and best F-measure. In some cases (e.g. the first and last example) we can even achieve some improvement against the best input segmentation. For the second image this is not true (F=0.8188 compared to the best input segmentation with F=0.8607). In this case the median input segmentation has a low F=0.5896, indicating an ensemble with a majority of relatively bad segmentations. Generally, we can observe a substantial improvement of our combination compared to the median input segmentation. These results demonstrate that we can obtain an “average” segmentation which is superior to the - possibly vast majority of the input ensemble. Indeed, this behavior can be validated on the entire database of 300 images. Figure 3 shows the average performance of all 300 images for each parameter setting in comparison with the average performance of our approach. Given the fact that we do not know the optimal parameter setting for a particular image in advance (see Figure 1), the comparative performance of our approach is remarkable and reveals its potential in dealing with the difficult problem of parameter selection without ground truth. 4.2
Combination of Segmenters
In this experiment series we use the FH algorithm and multiscale Normalized Cuts (NCuts) algorithm [2] to generate multiple segmentations. While FH tends
Combination of Multiple Segmentations
221
to produce small over-segmented regions, NCuts algorithm aims to produce a global segmentation with large segments that have a chance to be objects. We generate 6 different segmentations from each algorithm by varying the parameter values. For FH algorithm, we fix min area = 1500 and k = 500, and vary σ = 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, and for NCuts algorithm, the parameter configuration is the number of segment k = 4, 6, 8, 10, 12, 14 and a fixed image scale 80%. One result is shown in Figure 4. For this particular image NCuts achieves higher scores, in some sense simulating a situation where the different segmentation algorithms are complementary. Again, we generally do not know which algorithm works best for a particular image. Our approach helps us to figure out the best out of all input segmentations, as exemplified in Figure 4 (bottom) with a combination result better than the entire segmentation ensemble. The average performance for all 300 images with regard to each of the 12 individual configurations (segmenter and parameter setting) is shown in Figure 5 (The first 6 configurations correspond to FH algorithm while the others are from NCuts). We also draw the red line for the performance of our combination approach of all 300 images. Without knowing the best segmenter (and the related
FH (0.5544)
FH (0.4976)
FH (0.5334)
FH (0.3429)
FH (0.4832)
FH (0.5992)
NCuts (0.5533)
NCuts (0.5839)
NCuts (0.6624)
NCuts (0.6789)
NCuts (0.6790)
NCuts (0.6799)
F = 0.7303 Fig. 4. Example of segmentation ensemble from two segmentation algorithms (top, values in parenthesis are F-measure). The combination result (bottom).
222
P. Wattuya, X. Jiang, and K. Rothaus 0.6
0.55
F−Measure
0.5
0.45
0.4
0.35 the average performance of combined results the average performance of each individual parameter setting 0.3 0
1
2
3
4
5 6 7 8 Parameter Setting
9
10
11
12
Fig. 5. Combination of segmenters: The average F-measure of combined results is 0.5902; the average F-measure of the best segmenter (6) is 0.5602
optimal parameter values in our tests) for a particular image our approach is useful for overcoming the segmentation imperfections by using multiple segmenters (and parameter values). 4.3
Discussions
The proposed algorithm was implemented in MatLab on an Intel Core 2 CPU. For a 481 × 321 image approximately 2.5 seconds are required for the seed region extraction step (Section 3.2) and less than 2 seconds for the random walker algorithm for the final combination step (Section 3.3). Regardless of the particular segmentation algorithm(s) and the thresholds of our combination algorithm, the size of segmentation ensemble can play an important role for the combined result accuracy. Increasing the size of segmentation ensemble may result in increasing both true boundary segments and false boundary segments. False boundary segments, that are shared among most of input segmentations, will most likely propagate into the final combined result. An example of this problem can be observed in Figure 4. The over-segmentation of the calf in most of the input segmentations leads to its over-segmentation in the combined result. On the other hand, decreasing the size of segmentation ensemble may result in decreasing chances of having true boundary segments. However, this problem is basically inherent to all combination algorithms that are based on co-association values voting by a segmentation ensemble.
5
Conclusion
The difficult image segmentation problem has various facets of fundamental complexity. In general it is hardly possible to know the best segmenter and for a particular segmenter the optimal parameter values. The segmenter/parameter selection problem has not received the due attention in the last. For the parameter selection problem, for instance, the researchers typically claim to have
Combination of Multiple Segmentations
223
experimentally fixed the parameter values or train in advance based on manual ground truth. As we have discussed before, these approaches are not optimal and it lacks an adaptive framework for dealing with the problem in a more general context. We have taken some steps towards a framework of multiple image segmentation combination based on the random walker idea. In contrast to the few works so far we consider the most general class of segmentation combination, i.e. each input segmentation can have an arbitrary number of regions. We could demonstrate the usefulness of our approach in two application scenarios: Exploring parameter space and Combination of segmenters. The results show the effectiveness of our combination method in both situations. While the preliminary results are very promising, several issues remain. The seed region generation part of our algorithm, particularly the termination criterion, needs to be further refined. Other future work includes an extension of the random walker based combination approach to other problem domains.
References 1. Cingue, L., Cucciara, R., Levialdi, S., Martinez, S., Pignalberi, G.: Optimal range segmentation parameters through genetic algorithms. In: Proc. of 15th Int. Conf. on Pattern Recognition, vol. 1, pp. 474–477 (2000) 2. Cour, T., Benezit, F., Shi, J.: Spectral segmentation with multiscale graph decomposition. In: Proc. of CVPR, pp. 1124–1131 (2005) 3. Felzenszwalb, P., Huttenlocher, D.: Efficient graph-based image segmentation. Int. Journal of Computer Vision, 167–181 (2004) 4. Grady, L.: Random walks for image segmentation. IEEE Trans. Patt. Anal. Mach. Intell 28, 1768–1783 (2006) 5. Jiang, T., Zhou, Z.-H.: SOM ensemble-based image segmentation. Neural Processing Letters 20, 171–178 (2004) 6. Keuchel, J., K¨ uttel, D.: Efficient combination of probabilistic sampling approximations for robust image segmentation. In: Franke, K., M¨ uller, K.-R., Nickolay, B., Sch¨ afer, R. (eds.) DAGM 2006. LNCS, vol. 4174, pp. 41–50. Springer, Heidelberg (2006) 7. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proc. 8th Int. Conf. Computer Vision, vol. 2, pp. 416–423 (2001) 8. Martin, D., Fowlkes, C., Malik, J.: Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. Patt. Anal. Mach. Intell. 26(5), 530–539 (2004) 9. Min, J., Powell, M., Bowyer, K.W.: Automated performance evaluation of range image segmentation algorithms. IEEE Trans. on SMC – Part B 34(1), 263–271 (2004) 10. Rohlfing, T., Russakoff, D.B., Maurer Jr., C.R.: Performance-based classifier combination in atlas-based image segmentation using expectation-maximization parameter estimation. IEEE Trans. Medical Imaging 23, 983–994 (2004) 11. Rohlfing, T., Maurer Jr., C.R.: Shape-based averaging for combination of multiple segmentations. In: Duncan, J.S., Gerig, G. (eds.) MICCAI 2005. LNCS, vol. 3750, pp. 838–845. Springer, Heidelberg (2005)
Automatic Detection of Learnability under Unreliable and Sparse User Feedback Yvonne Moh, Wolfgang Einh¨ auser, and Joachim M. Buhmann Institute of Computational Science Swiss Federal Institute of Technology (ETH) Zurich {tmoh,einhaeuw,jbuhmann}@inf.ethz.ch
Abstract. Personalization for real-world machine-learning applications usually has to incorporate user feedback. Unfortunately, user feedback often suffers from sparsity and possible inconsistencies. Here we present an algorithm that exploits feedback for learning only when it is consistent. The user provides feedback on a small subset of the data. Based on the data representation alone, our algorithm employs a statistical criterion to trigger learning when user feedback is significantly different from random. We evaluate our algorithm in a challenging audio classification task with relevance to hearing aid applications. By restricting learning to an informative subset, our algorithm substantially improves the performance of a recently introduced classification algorithm.
1
Introduction
Sound classification according to user preferences poses a challenging machine learning problem under real world scenarios. In the realm of hearing instrument personalization, additional challenges arise since hardware considerations constrain processing and storage capacity of the device, while the acoustic environment may undergo substantial changes. Furthermore, user feedback is typically sparse, and individual preferences may shift. Finally, the device shall be operative without initial individual training (default settings), learning shall only smoothly affect settings, and adaptation may only occur as consequence of user feedback. Consequently, a ”smart” hearing instrument, which is envisioned to adapt its amplification settings according to user preferences and acoustic environment, needs to employ a sophisticated online adaptation algorithm that is passive, efficient and copes with sparse labels. The key to enable adaptivity is to decide when learning shall take place. At the one extreme, one may assume that the hearing instruments are exclusively factory-trained with no adaptation to idiosyncratic preferences [1]. At the other extreme, continuous user feedback would be required, which is infeasible in practice. Using the sub-problem of music categorization into like/dislike, we here develop an algorithm that fulfills the aforementioned constraints. It decides based on a statistical criterion when learning shall take place by evaluating the consistency of user feedback. The criterion endows the learning algorithm with G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 224–233, 2008. c Springer-Verlag Berlin Heidelberg 2008
Automatic Detection of Learnability
225
a confidence measure in the newly provided labels. This test enables efficient online learning of user preferences under real-world conditions. A recent study [2] proposed to extend additive expert ensembles [3] to semisupervised learning via learning with local and global consistency (LLGC) [4]. These results provided the proof-of-concept that this approach works the current problem. In particular the system of [2] learns only when feedback is made available and its current ensemble of classifiers does not suffice to model the preference indicated by the feedback. However, this earlier model suffered from several shortcomings. Most importantly, the system continues to learn even when the novel information is inconsistent and, therefore, unusable and when classifier and/or features are ill-suited for the proposed separation. Here we propose an algorithm that decides whether new labels should be ignored or learning should take place. A confidence score over the user feedback is computed based on semisupervised graphs with randomized labels. This score determines whether the labels are consistent and thus learnable. Learnability, i.e. self-consistency of the labels, is established by determining the its deviation from the random case.
2
Background
In semi-supervised learning, a set of data points X = XL ∪ XU is available, where labels YL for XL are provided, and XU is the unlabeled dataset that we wish to label or exploit for learning. Graphs provide an excellent means to code manifold structures which can be exploited for the purpose of semi-supervised learning [5]. Formally, we are presented with a set of n data points X = XL ∪ XU = {x1 , ..., xl , xl+1 , ...., xn }, and xi ∈ Rm is a m-dimensional feature vector representing the i-th data point. For the first l data points, we have the corresponding labels YL = {y1 , ...., yl }, where yi ∈ {−1, 1}. Let F (0) = {y1 , ...., yl , 0, ..., 0} be the vector that represents the labels for the data points, where yi = 0 for l < i ≤ n. Furthermore, let g(xi , xj ) be any similarity measure between two points xi , xj . In our experiments, we use g(xi , xj ) = exp(−||xi − xj ||2 /2σ 2 ). Graph Notations: We code the data as a graph G = (X, W ). The nodes X represent individual data points, and the edges are coded in an affinity matrix W . W stores the similarity between data points, i.e. Wij = g(xi , xj ) ∀i = j , and Wii = 0. The normalized graph Laplacian L is defined as 1 1 L = D− 2 W D− 2 with Dii = Wij . (1) j
Learning with Local and Global Consistency (LLGC)[6]: LLGC is a graph based semi-supervised learning technique that spreads the label information from XL to the XU based on the affinity of the data points. It does this via the normalized graph Laplacian, and mathematically, the iterative information spreading is: F (t + 1) = αLF (t) + (1 − α)F (0),
(2)
226
Y. Moh, W. Einh¨ auser, and J.M. Buhmann
where α is a suitably chosen learning rate. This iteration converges to the solution F ∗ = (1 − α)(I − αL)−1 F (0),
(3)
as can easily been seen from the condition F (t + 1) = F (t). The LLGC solution F ∗ can be interpreted as node weight after YL has been propagated across the graph. Due to the smoothness constraints, reliable labels should reinforce each other, resulting in higher node weights, whereas labels showing inconsistencies tend to cancel out, resulting in lower node weights. Here, we endeavor to find a measure of quality for the solution obtained by LLGC. This criterion will allow us to decide if the label information provided is useful for our classification task in an unsupervised manner. We propose to measure the reliability of the LLGC solution, and the learnability of the provided labels.
3
Algorithm
Our algorithm is presented with a set of data X, for which a fraction q is labeled by the user (YL,usr ). The aim is to decide, whether or not this scenario is learnable. To achieve this goal, the algorithm measures the quality of the LLGC solution on (X, YL,usr ) by comparing it with solutions on K sets of randomly drawn labels on the same data (X, YL,k ), 1 ≤ k ≤ K. The interplay of two major factors affect the solution of LLGC: 1. The topology of the graph (coded by the affinity matrix W ). 2. The labels provided in YL,k , k ∈ {usr, 1, ..., K}. Each factor on its own influences the final configuration of the graph: The topology appears in the graph Laplacian L term, and the labels are F (0) . F ∗ ∝ (I − αL)−1 F (0) . Topology Labels
(4)
We denote each configuration by Hk = (G, YL,k , Fk∗ ), where G = (X, W ) represents the data points, YL,k , is the set of labels provided for a subset (XL ) of the data points, and Fk∗ is the solution of LLGC. The set of source (emitting) nodes (XL ) has to be kept constant to ensure comparability of hypotheses, since the graph is weighted and the topology plays an essential role in the label-spreading. To measure learnability, we need to quantify the quality of Husr as compared to the average over Hk . The weight distributions corresponding to Hk are Fk∗ = [fk,1 , ...., fk,n ], for k ∈ {usr, 1, ..., K}, where n is the total number of nodes. We ∗ to all other Fk∗ by computing the absolute node weight difference compare Fusr for each node Xi in Husr to its corresponding counterpart in Hk , given as dusr,k (i) = |fusr,i | − |fk,i |.
(5)
Similarly, we compute dku ,kv for all pairs, 1 ≤ ku < kv ≤ K. Comparing the statistics of dusr,k to dku ,kv allows us to quantify the significance of dusr,k . Hence
Automatic Detection of Learnability
227
we have reduced the test of learnability of Husr to a test of statistical significance on d. We outline the statistics that we generate over d. We compute the mean μs and standard deviations σs of the s-th percentile values. For robustness considerations (sensitivity to outliers), we base our statistical test on the properties of a fixed quantile of the distribution of D. In this study, we focus on the 90-th percentile (t = 0.9), but also examine other percentiles (median, t = 0.5) and other statistics (mean) (Sec. 5). For the Husr , we compute μusr,t t-th percentile: μusr,s =
K 1 Qs (dusr,k ), K
(6)
k=1
where Qs (d) computes the s-th percentile value of vector d. For the random drawings, μrnd,s is μrnd,s =
K K 2 Qs (dk,j ). K(K − 1)
(7)
k=1 j=k+1
σusr,s and σrnd,s are correspondingly computed. If K is sufficiently large, we can approximate μrnd,s , σrnd,s , by an appropriate subset, e.g. μrnd,s =
K 1 Qs (dk,k⊕1 ), K
(8)
k=1
where ⊕ denotes addition modulo K. Husr is defined to be learnable iff (μusr,s , σusr,s ) is significantly different from (μrnd,s , σrnd,s ). Algorithm: Below, we formally present our algorithm which applies LLGC as a subroutine. As input, the algorithm is presented with G = (X, W ), Yusr,L . The percentile level used here is s = 0.9, and K = 100. 1. Generate the K random shufflings of Yusr,L (same label distributions and node subset XL ) to obtain Yk,L for 1 ≤ k ≤ K. ∗ 2. Compute F·∗ using LLGC to obtain Husr = (G, Yusr,L , Fusr ) and Hk = ∗ 1 (G, Yk,L , Fk ) respectively . 3. Calculate d·,· , which are used to compute μusr,s ,σusr,s , μrnd,s and σrnd,s for a desired percentile t. 4. Accept Husr for learning if μusr,s > μrnd,s + σrnd,s . Otherwise reject.
4
Experimental Setting
In this section we evaluate the performance of our algorithm to test for learnability. We base our experiments on two standard music datasets. We simulate 1
The cost of computing the random probes creates only negligible overhead compared to computing the original LLGC algorithm. Hence this step has only minimal impact on overall runtime.
228
Y. Moh, W. Einh¨ auser, and J.M. Buhmann
Fig. 1. Visualization plot of classical music dataset projected onto the first two LDA dimensions computed over the 14 music categories. General regions of music subclusters are identified.
different user profiles, i.e., their individual preferences and feedback consistency. Preference defines the decision boundaries, whilst the consistency refers to the noise level in the feedback (labels YL,usr ). Data: We used two distinct music audio datasets, one comprising predominantly classical music (“classic”), and another dataset of pop music (“artist20”). The former is used by the algorithm to determine learnability (“validation”). The latter is a “hold-out” set, used exclusively for performance evaluation. classic: The classic dataset consists of 2000 song files divided in 14 categories. The bulk of the dataset are classical music: opera (4 categories: H¨andel, Mozart, Verdi and Wagner), orchestral music (6 categories: Beethoven, Haydn, Mahler, Mozart, Shostakovitch, Piano concertos) and chamber music (3 categories: piano, violin sonatas, and string quartets). The remaining category is comprised of a small set of pop music, intended as “dissimilar” music. Figure 1 shows a LDA projection plot of this dataset for visualization. artist20: We use a publicly available set artist20 [7] to evaluate performance. artist20 is a dataset of 20 pop artists (Table 1), each represented by six albums to yield a total of 1413 songs. Here each artist represents a category, the songs are the exemplars. Features: Data are represented by 20 mel frequency cepstral coefficients (MFCC) features [8], which are commonly used in speech/music classification. For each dataset, these features are computed from mono channel raw sources. For each
Automatic Detection of Learnability
229
Table 1. Artists found in the artist20 set. Each artist is covered by six albums. Aerosmith Beatles Creedence Clearwater Revival Cure Suzanne Vega Tori Amos Green Day Dave Matthews Band U2 Steely Dan Garth Brooks Madonna Fleetwood Mac Queen Metallica Radiohead Roxette Led Zeppelin Prince Depeche Mode
song, we compute means and variances of the MFCC components averaged over time, to obtain a static 40-dimensional song-specific feature vector .
5
Results
For validation of our approach, we first examine the usefulness of the statistics μusr,s , σusr,s , μrnd,s , and σrnd,s to decide on learnability. This test does not include the actual learning, which is later evaluated on the hold-out set artist20 (sec 5.4). For a given dataset, we simulate different user profiles. Each user likes a random subset of music categories, and dislikes the remaining categories. The user then labels a fraction q of songs in the dataset, drawn at random over all categories, into “like” or “dislike”. Hereafter we will refer to this like/dislike judgment as “preference” to avoid confusion with the category “labels”. Besides modeling user preferences, we also model their consistency: a given user has identical preference judgments only for a fraction p of songs of the same preference class. A p close to 0.5 thus indicates a high noise level, while a p of 1 indicates no noise (p < 0.5 would correspond to less noise with flipped labels). Since we are in particular interested in sparse feedback, we step q logarithmically from 0.5% and 10% for each user. Similarly, we simulate p from 50% (random) to 100% (no noise), in steps of 5%. 5.1
Easy, Noise-Free User Preferences
Before using the 50 simulated users, we consider a simple example user for a first illustration (“LikeVocal”): LikeVocal has an easy preference scenario, liking vocal music (pop, opera; total 5 categories) and disliking non-vocal music (orchestral, chamber; total: 9 categories). From the LDA visualization in Figure 1, we see that the preference classes of LikeVocal are almost linearly separable, rendering this case particularly easy. In the first test, we additionally assume a noise-less situation with LikeVocal’s preference being fully consistent (p = 1). For each value of q, we then generate 50 sets of randomly selected YL,usr , and run the algorithm for each set. First we inspect the distributions of the d·,· (Figure 2, top). Whilst the random case (e.g., d1,2 ) is likely to be Gaussian (Figure 2, upper right), this does
230
Y. Moh, W. Einh¨ auser, and J.M. Buhmann
Fig. 2. For user LikeVocal: Top: Distribution of values for dusr,1 (left) and d1,2 (right). We see that the distribution of dusr,1 is not normally distributed, hence computing statistics on the median or other percentiles is more robust than mean and variance. Bottom: Absolute weight value differences w.r.t. random labeling with conservation of label distribution. Left: Mean value, Right: 90-th percentile. Note that the x-axis is presented on the log-scale.
not hold for the user set (e.g., dusr,1 ; Figure 2, upper left). We observe that the accumulated weights are substantially higher for the true labels than the weights for the randomly seeded graphs. This is clearly reflected in Figure 2, bottom, where the 90-th percentile reflects better discriminatory strength. This observation motivates the use of the 90-th percentile statistic for the remainder of the paper. 5.2
Noisy User Preferences
So far we have considered a noise-free case (p = 1). While preserving the general preference, our user LikeVocal becomes inconsistent (p < 1) . The absolute weight difference d increases with decreasing noise (Figure 3). This trend is mirrored in the classification performance on the entire data X (including the unlabeled subset X = XU ∪ XL )2 . 2
The error rate is corrected for symmetry, i.e.consistent mislabeling would be judged as correct: the error rate is maximally 50%.
Automatic Detection of Learnability
231
Fig. 3. Left: Distribution of 90-th percentile absolute weights for different levels of noise on classicaleasy Right: Corresponding error rates of LLGC
5.3
Random User Preferences
We now consider a harder problem, where the sub-clusters are randomly assigned to the two classes of user preferences. Here, the overlap zones are higher, and some assignments may be less intuitive, e.g. the user likes Mozart operas but dislikes Verdi operas. In classicalhard,we randomly model 50 users, with randomly sampled preference assignments. The results of our algorithm are shown in Figure 4. Compared to the easy user LikeVocal, the error rates are considerably higher. This effect is expected, as the increased class overlap renders the classification problem more difficult. Nevertheless, we observe the same pattern as in the easy case. Again, weight differences increase with p and q. This trend shows that the weight differences are a good measure for learniblity and thus a useful basis for our algorithm.
Fig. 4. Left: 90-th percentile of absolute weights differences for different levels of noise on set classicalhard. Right: Corresponding error rates of LLGC. Note that error bars show standard error of mean (SEM).
Y. Moh, W. Einh¨ auser, and J.M. Buhmann 50% (random)
accuracy (|50%-error|)
25%
70%
90%
100%
100%
fraction rejected labels accuracy thresholded data (learnable & reliable)
5% 0.5
accuracy all data
0.9 1.6 2.9 5.1 9.2 0.5 0.9 1.6 2.9 5.1 9.2 0.5 0.9 1.6 2.9 5.1 9.2 0.5 0.9 1.6 2.9 5.1 9.2 fraction labeled data [%] fraction labeled data [%] fraction labeled data [%] fraction labeled data [%]
fraaction rejected labels
232
0%
Fig. 5. Estimated accuracy (left axis) with decision to learn and rejection rates (right axis). Panels left to right show values p = 50%, 70%, 90% and 100%. and noise levels. Error bars report SEM.
5.4
Evaluation on Hold-Out Set
So far we have demonstrated that our statistical measures can provide a useful threshold for learning, but it remains to be evaluated in the full setting. For each simulated user, our algorithm is presented with different proportion of labels (q) at different consistency levels (p). The task is to decide whether or not to reject the labels. On the sets that are accepted, the error rates are evaluated. We here use a dataset held-out when designing the algorithm (artist20). We again simulate 50 users, each by splitting the 20 artists (categories) randomly into 2 classes (like/dislike). For each user, we vary the proportion of user feedback (the number of songs labeled, q), and consistency of the preference (p). For all values of p and q average categorization accuracy on all songs (labeled+unlabeled) is improved when our algorithm is (Figure 5) compared to a continuously learning baseline system. The improvement gained through our algorithm is largest, when the labels are very inconsistent (small p) or only a small fraction is labeled (small q). In these cases rejection rate is high, and classification can benefit most. In turn, if the user provides a large and consistent set of preference judgments (large p and q), rejection is mild and performance without rejection is similar. This result shows that our algorithm reliably rejects uninformative data, while preserving consistent and informative feedback nearly in full.
6
Conclusions
We present an algorithm that decides whether or not user feedback in a realworld application is consistent and thus learnable given the input representation. By using real-world data and realistic labeling, the learning threshold induced by our algorithm increases performance for the envisioned hearing instrument application. Hence we demonstrate the feasibility of a statistically motivated selection stage as preprocessing step that can be integrated into online learning on noisy data with sparse labels. Although the present study focuses on a specific application in the auditory domain, we are confident that the presented algorithm is applicable to more general settings. Whenever labeling can be corrupted by noise or incompatible with
Automatic Detection of Learnability
233
the features used, our criterion prevents a system from learning from malicious data. In fact, any online learning system with limited memory and unreliable labeling can benefit from our algorithm. Beyond the realm of hearing instruments, this might be of importance for any portable device that needs to adapt to a changing environment, such that factory-training is not sufficient. In the dawning age of ”wearable” computing, such devices are likely to become more and more ubiquitous. While the relevance of our algorithm for such applications still needs to be established in further research, our results present a crucial step towards ”smart” hearing instruments that adapt to changes in environment and in idiosyncratic preferences. Acknowledgement: This work is funded by KTI, Nr8539.2;2EPSS-ES.
References 1. B¨ uchler, M., Allegro, S., Launer, S., Dillier, N.: Sound classification in hearing aids inspired by auditory scene anlysis. Journal of Applied Signal Processing 18, 2991– 3002 (2005) 2. Moh, Y., Orbanz, P., Buhmann, J.: Music preference learning with partial information. In: ICASSP (2008) 3. Zolter, J.Z., Maloof, M.A.: Using additive expert ensembles to cope with concept drift. In: Proceedings of the 22nd Intl Conference on Machine Learning (2005) 4. Chapelle, O., Sch¨ olkopf, B., Zien, A. (eds.): Semi-Supervised Learning. MIT Press, Cambridge (2006) 5. Belkin, M., Niyogi, P.: Using manifold structure for partially labelled classification. In: Advances in NIPS, vol. 15 (2003) 6. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Sch¨ olkopf, B.: Learning with local and global consistency. In: Advances in Neural Infomation Processing Systems, vol. 16, pp. 321–328. MIT Press, Cambridge (2004) 7. Ellis, D.: Classifying music audio with timbral and chroma features. In: Proc. Int. Conf. on Music Information Retrieval, Vienna, Austria (September 2007) 8. Ellis, D.: PLP and RASTA (and MFCC, and inversion) in Matlab, online web resource (2005)
Novel VQ Designs for Discrete HMM On-Line Handwritten Whiteboard Note Recognition Joachim Schenk, Stefan Schw¨ arzler, G¨ unther Ruske, and Gerhard Rigoll Institute for Human-Machine Communication Technische Universit¨ at M¨ unchen 80290 Munich, Germany {schenk,schwaerzler,ruske,rigoll}@mmk.ei.tum.de
Abstract. In this work we propose two novel vector quantization (VQ) designs for discrete HMM-based on-line handwriting recognition of whiteboard notes. Both VQ designs represent the binary pressure information without any loss. The new designs are necessary because standard kmeans VQ systems cannot quantize this binary feature adequately, as is shown in this paper. Our experiments show that the new systems provide a relative improvement of r = 1.8 % in recognition accuracy on a character- and r = 3.3 % on a word-level benchmark compared to a standard k-means VQ system. Additionally, our system is compared and proven to be competitive to a state-of-the-art continuous HMM-based system yielding a slight relative improvement of r = 0.6 %.
1
Introduction
Hidden-Markov-Models (HMMs, [1]) have proven their power for modeling timedynamic sequences of variable lengths. HMMs also compensate for statistical variations in those sequences. Due to this property they have been adopted from automatic speech recognition (ASR) to the problem of on-line (cursive) handwriting recognition [2], and have more recently be applied to for on-line handwritten whiteboard note recognition [3]. One distinguishes between continuous and discrete HMMs. In case of continuous HMMs, the observation probability is modeled by mixtures of Gaussians [1], whereas for discrete HMMs the probability computation is a simple table look-up. In the latter case vector quantization (VQ) is performed to transform the continuous data to discrete symbols. While in ASR continuous HMMs are increasingly accepted, it remains unclear whether discrete or continuous HMMs should be used in on-line handwriting [4] and whiteboard note recognition in particular. In a common handwriting recognition system each symbol (i. e. letter) is represented by one single-stream HMM (either discrete or continuous). Words are recognized by combining character-HMMs using a dictionary. While high recognition rates are reported for isolated word recognition systems [5], performance considerably drops when it comes to recognition of whole unconstrained handwritten text lines [3]: the lack of previous word segmentation introduces new G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 234–243, 2008. c Springer-Verlag Berlin Heidelberg 2008
Novel VQ Designs
235
variability and therefore requires more sophisticated character recognizers. An even more demanding task is the recognition of handwritten whiteboard notes as introduced in [3]. The conditions described in [3] contributes to the characterization of the problem of on-line whiteboard note recognition as “difficult”. In this paper, we discuss the use of discrete single-stream HMMs for the task of on-line whiteboard note recognition with respect to varying codebook sizes. While in ASR features have a purely continuous nature in general, in handwriting recognition continuous features are used as well as discrete or even binary features [3]. As shown in [6] the binary feature “pressure” is one of the most significant features for recognition. Our experiments indicate that state-ofthe-art vector quantizers are not capable of coding this binary feature properly due to quantization error. In this paper, we therefore introduce two novel VQ designs which are capable of adequately quantizing this feature. To that end, the next section gives a brief overview of the recognition system including the necessary preprocessing and feature extraction for whiteboard note recognition. Section 3 reviews VQ as well as discrete HMMs. Two novel VQ systems are introduced in Sec. 4 in order to handle binary features. The impacts of varying codebook sizes and the novel VQ designs are evaluated in the experimental section (Sec. 5), in which our discrete system is compared to a state-of-the-art continuous system. Finally conclusions and discussion are presented in Sec. 6.
2
System Overview
For recording the handwritten whiteboard data the eBeam-System1 is used. A special sleeve allows the use of a normal pen. This sleeve sends infrared signals to a receiver mounted on any corner of the whiteboard. As a result the x- and ycoordinates of the sleeve as well as the information whether or not the tip of the pen hits the whiteboard, the binary “pressure” p, are recorded at a varying sample rate of Ts = 14 ms, . . . , 33 ms. Afterwards, the written data is heuristically segmented into lines [3]. The sampled data is preprocessed and normalized as a first step. In the subsequent feature extraction step, 24 features are extracted from the three dimensional data vector (sample points) st = (x(t), y(t), p(t))T . Then, VQ with varying codebooks and codebook sizes is performed. Finally the data is recognized by a classifier based on discrete HMMs. 2.1
Preprocessing and Feature Extraction
As mentioned above, the data consists of individual text lines recorded at varying sample rates. Therefore the data is sampled neither in time nor in space equidistantly. As a result, two characters with the same size and style may result in completely different temporal sequences even if written with the same speed. To avoid this time varying effect, the data is resampled to achieve equidistant 1
http://www.e-beam.com
236
J. Schenk et al.
sampling. Following this, a histogram based skew- and slant-correction is performed as described in [7]. Finally all text lines are normalized such that there is a distance of “one” between the upper and lower scriptlines. Afterwards features are extracted from the three dimensional sample vector st = (x(t), y(t), p(t))T in order to derive a 24-dimensional feature vector ft = (f1 (t), . . . , f24 (t)). The state-of-the-art features for handwriting recognition and recently published new features (altered slightly) for whiteboard note recognition [3] used in this paper are briefly listed below and refer to the current sample point st . They can be divided into two classes: on-line and off-line features. As on-line features we extract f1 : indicating the pen “pressure”, i. e. f1 =
1 0
pen tip on whiteboard otherwise
(1)
f2 : f3 : f4 : f5,6
velocity equivalent computed before resampling and later interpolated x-coordinate after resampling and subtraction of moving average y-coordinate after resampling and normalization : angle α of spatially resampled and normalized strokes (coded as sin α and cos α, the “writing direction”) f7,8 : difference of consecutive angles Δα = αt − αt−1 (coded as sin Δα and cos Δα, the “curvature”) In addition, certain on-line features describing the relation of the sample point st to its neighbors were adopted and altered from those described in [3]. These are: f9 : logarithmic transformation of the aspect of the trajectory between the points st−τ and st (referred to as “vicinity aspect”), f9 = sign(v) · log(1 + |v|). f10,11 : angle ϕ between the line [st−τ , st ] and the lower line (coded as sin ϕ and cos ϕ, the “vicinity slope”) f12 : length of trajectory normalized by max(|Δx|; |Δy|) (“vicinity curliness”) f13 : average square distance to each point in the trajectory and the line [st−τ , st ] The second class of features, the so-called off-line features, are: f14−22 : 3 × 3 subsampled bitmap slid along pen’s trajectory (“context map”) to incorporate a 30 × 30 fraction of the currently written letter’s actual image f23−24 : number of pixels above, or respectively, beneath the current sample point st (the “ascenders” and “descenders”)
Novel VQ Designs
3
237
Vector Quantization and Discrete HMMs
In this section we briefly summarize vector quantization (VQ), review discrete HMMs, and describe the notations. 3.1
Vector Quantization
Quantization is the mapping of a continuous, N -dimensional sequence O = (f1 , . . . , fT ), ft ∈ RN to a discrete, one dimensional sequence of codebook inˆ = (fˆ1 , . . . , fˆT ), fˆt ∈ N provided by a codebook C = (c1 , . . . , cNcdb ), dices o ck ∈ RN containing |C| = Ncdb centroids ci [8]. For N = 1 this mapping is called scalar, and in all other cases (N ≥ 2) vector quantization (VQ). Once a codebook C is generated, the assignment of the continuous sequence to the codebook entries is a minimum distance search (2) fˆt = argmin d(ft , ck ), 1≤k≤NCdb
where d(ft , ck ) is commonly the squared Euclidean distance. The codebook C itself and its entries ci are derived from a training set Strain containing |Strain | = Ntrain training samples Oi by partitioning the N -dimensional feature space defined by Strain into Ncdb cells. This is performed by the well known k-means algorithm as described in e. g. [8; 9; 10]. As stated in [8], the centroids of a well trained codebook capture the distribution of the underlying feature vectors p(f ) in the training data. As the values of the features described in Sec. 2.1 are neither mean nor variance normalized, each feature fj is normalized to the mean μj = 0 and standard derivation σj = 1, yielding the normalized feature f˜j . The statistical dependencies of the features are thereby unchanged. 3.2
Discrete HMMs
For handwriting recognition with discrete HMMs each symbol (in case of this paper each character) is modeled by one HMM. Each discrete HMM i is represented by a set of parameters λi = (A, B, π) where A denotes the transition matrix, B the matrix of discrete output probabilities corresponding to each possible, discrete observation, and π the initial state distribution [1]. In order to use discrete HMMs, the continuous observations O = (f1 , . . . , fT ) are vector quantized yielding discrete observation sequences oi = (fˆ1 , . . . , fˆT ) as explained in the previous section. Given some discrete training data oi = (fˆ1 , . . . , fˆT ) the parameters λi can be trained with the well known EM-method, in the case of HMMs known as Baum-Welch-algorithm [11]. Recognition is performed by presenting the unknown pattern x to all HMMs λi and selecting the model ki = argmax p(x|λi ) i
(3)
with the highest likelihood. In case of word or even sentence recognition this is done by the Viterbi algorithm [12], which also performs a segmentation of the input vector x.
238
4
J. Schenk et al.
VQ Designs
As mentioned in the introduction, in contrast to ASR, both continuous and discrete features are used in handwritten whiteboard note recognition. In case of discrete HMMs, where all features are quantized jointly, the binary feature f1 as defined in Eq. 1 looses its significance, due to inadequate quantization and quantization error. However, in [6] it is stated that in continuous HMM based recognition systems the pen’s pressure is one of the four most significant features. In the following, we therefore present two VQ design which explicitly model the binary pressure feature f1 without loss. 4.1
Joint-Codebook VQ Design
For the first VQ design, f˜1 is treated to be statistically independent of f2−23 2 , i. e. p(f˜1 |f˜2 , . . . , f˜24 ) = p(f˜1 ). In this case p(˜f ) = p(f˜1 , . . . , f˜24 ) = p(f˜1 ) · p(f˜2 , . . . , f˜24 ).
(4)
Hence, p(˜f ) can be quantized by two independent codebooks. The first one is directly formed by the normalized pressure value f˜1 , and the remaining features f2−23 are represented by M 23-dimensional centroids rk . N centroids ci representing the 24-dimensional data vectors ˜ft = (˜f1 , . . . , f˜24 ) can then be obtained by augmenting the centroids rk as follows: (f˜1 |f1 =0 , rj ) 1 ≤ i ≤ M, j=i ci = (5) (f˜1 |f1 =1 , rj ) M + 1 ≤ i ≤ 2 · M, j = i − M, where f˜1 |f1 =n is the normalized value of f˜1 corresponding to f1 = n, n = {0, 1}. Hence, Eq. 5 describes the construction of a joint codebook. The total number N of centroids ci calculates to N = 2 · M . The modified VQ system is shown in Fig. 1.
μσ
μσ μσ
Fig. 1. VQ system using a joint codebook for explicitly modeling f1 and assuming statistical independence according to Eq. 4.
2
Note: the authors are aware that in this case also multiple-stream HMMs [4] can be applied. However, these are out of this paper’s scope.
Novel VQ Designs
4.2
239
Codebook-Switching VQ Design
In the second VQ design the statistical dependency between f1 and the other features is taken into account. Applying Bayes’ rule the joint probability p(˜f ) in this case yields p(˜f ) = p(f˜1 , . . . , f˜24 ) = p(f˜2 , . . . , f˜24 |f˜1 ) · p(f˜1 ) = p(f˜2 , . . . , f˜24 |f˜1 < 0) · p(f˜1 < 0) if f1 = 0 = p(f˜2 , . . . , f˜24 |f˜1 > 0) · p(f˜1 > 0) if f1 = 1.
(6)
As pointed out in Sec. 2.1 the feature f1 is binary. Hence, as indicated in Eq. 6 p(˜f ) can be represented by two arbitrary codebooks Cs and Cg depending on the value of f1 . To adequately model p(f˜2 , . . . , f˜24 |f˜1 ), the normalized training set S˜train , which consists of Ttrain feature vectors (˜f1 , . . . , ˜fTtrain ), is divided into two sets Fs and Fg where Fs = {˜ft |f1,t = 0} (7) , 1 ≤ t ≤ Ttrain . Fg = {˜ft |f1,t = 1} Afterwards, the assigned feature vectors are reduced to the features f2,...,24 , yet the pressure information can be inferred from the assignment of Eq. 7. Ns centroids rs,i , i = 1, . . . , Ns are derived from the set Fs and Ng centroids rg,j , j = 1, . . . , Ng from set Fg forming two independent codebooks Rs and Rg holding N = Ns + Ng centroids for the whole system. Given N and the ratio N Ng ⇒ Ng = + 0.5 , Ns = N − Ng (8) R= 1 Ns 1+ R the number of Ns and Ng can be derived for any value of N . The optimal value of both, N and R with respect to maximum recognition accuracy is derived by experiment in Sec. 5. In order to keep the exact pressure information after VQ for each normalized feature vector ˜ft of the data set, first two codebooks Cs and Cg (with centroids cs,i , 1 ≤ i ≤ Ns and cg,j , 1 ≤ j ≤ Ng ) are constructed similar to Eq. 5: cs,i = (f˜1 |f1 =0 , rs,i ) , 1 ≤ i ≤ Ns cg,j = (f˜1 |f1 =0 , rg,j ) , 1 ≤ j ≤ Ng .
(9)
The value of the feature f˜1,t is handed to the quantizer separately to decide which codebook C should be used for quantization: ⎧ ˜ ⎪ if f1,t = 0 ⎨argmin d(ft , cs,i ) 1≤i≤N s
ˆ (10) ft = ⎪ ⎩ argmin d(˜ft , cg,j ) + Ns if f1,t = 1. 1≤j≤Ng
The VQ system using two switching codebooks is illustrated in Fig. 2.
240
J. Schenk et al.
μσ
μσ
μσ
Fig. 2. Second VQ system to model the statistical dependency between f1 and f2,...,24 according to Eq. 6
5
Experimental Results
The experiments presented in this section are conducted on a database containing handwritten heuristically line-segmented whiteboard notes (IAM-OnDB3 ). For further information on the IAM-OnDB see [13]. Comparability of the results is provided by using the settings of the writer-independent IAM-onDB-t1 benchmark, consisting of 56 different characters and an 11 k dictionary which also provides well-defined writer-disjunct sets (one for training, two for validation, and one for test). For our experiments, the same HMM topology as in [3] is used. The following four experiments are conducted on the combination of both validation sets, each with seven different codebook sizes (N = 10, 100, 500, 1000, 2000, 5000, 7500). For training the vector quantizer as well as the parameters λi of the discrete HMMs the IAM-onDB-t1 training set is used. The results with respect to the actual codebook size N are depicted as character accuracy on the left-hand side of Fig. 3. Experiment 1. (Exp. 1 ): In the first experiment all components of the feature vectors (f1,...,24 ) are quantized jointly by one codebook. The results shown in Fig. 3 (left) form the baseline for the following experiments. As one can see, the maximum character accuracy abase = 62.6 % is achieved for a codebook size of N = 5000. The drop in recognition performance when raising the codebook size to N = 7500 is due to sparse data [1]. Experiment 2. (Exp. 2 ): To prove that the binary feature f1 is not adequately quantized by standard VQ, independent of the number of centroids, all features except the pressure information (f2,...,24 ) are quantized jointly for the second experiment. As Fig. 3 (left) shows, only little degradation in recognition performance compared to the baseline can be observed. The peak rate of ar = 62.5 % is once again reached at a codebook size of N = 5000, which equals a relative change of r = −0.2 %. This is rather surprising as in [6] pressure is assumed to be a relevant feature in on-line whiteboard note recognition. Experiment 3. (Exp. 3 ): The fact the recognition accuracy decays only moderately when omitting the pressure information shows that the binary information is not quantized properly in the first experiment. Therefore f1 is directly 3
http://www.iam.unibe.ch/~ fki/iamnodb/
Novel VQ Designs
241
#!
(
(
( $%'
(
! "#$% #$% #!&
(
#$% #!&
Fig. 3. Evaluation of different systems’ character accuracy with respect to the codebook N size N (left), character accuracy for different codebook sizes and varying ratios R = Ngs for the VQ design using codebook-switching (right)
modeled as explained in Sec. 4.1 and a joint codebook is used. This results in a further drop in performance due to the independent modeling of f1 and f2−24 as displayed in Fig. 3 (left). Peak performance is found to be ajoint = 62.4 for a codebook size of N = 7500 which is a relative change of r = −0.3 %. Experiment 4. (Exp. 4 ): In the last experiment the performance of the second VQ system, as introduced in Sec. 4.2, is evaluated. The optimal value of R = Ng Nk is found by experimentation. Investigating the right-hand side of Fig. 3 reveals the optimal values for R for arbitrary codebook sizes. Finally the results are shown on the left-hand side of Fig. 3 with respect to the codebook size N for the formerly found optimal values of R. The highest character accuracy of aswitch = 63.7 % is found for N = 5000 and Ropt = 5, which yields (according to Eq. 8) Ns = 833 and Ng = 4167 for the codebooks Cs and Cg . Compared to the baseline system this is a relative improvement of r = 1.8 % (ΔR = 1 % absolute). In order to prove the competitiveness of our systems, the parameters and models which delivered the best-performing systems in the previous experiments are used to perform word-level recognition on the test set of the IAM-onDBt1 benchmark, and are compared to a state-of-the-art continuous recognition system as presented in [14]. The baseline system, using a standard VQ and coding all features jointly achieves, a word accuracy of Abase = 63.5 %. As expected from the character accuracy of the previous experiments (Exp. 2 ), the omission of the “pressure” information has little influence on the word-level accuracy: Ar = 63.3 % can be reported in this case, indicating a drop of r = −0.3 % relative to the baseline system. The word accuracy of the first VQ design using a joint codebook is Ajoint = 63.2 %, a relative change of r = −0.5 % given the baseline. Compared to the standard VQ system an absolute word accuracy of Aswitch = 65.6 % can be achieved by using the codebook-switching design (Exp. 4 ) which is a relative improvement of r = 3.3 %. This system even slightly outperforms the system presented in [14] by r = 0.6 % relative.
242
J. Schenk et al.
Table 1. Final results for the experiments Exp1, . . ., Exp4 and one different system [14] on the same word-level recognition task
6
system
Exp. 1 (Abase )
Exp. 2 (Ar )
Exp. 3 (Ajoint )
Exp. 4 (Aswitch )
[14]
word-level accuracy
63.5 %
63.3 %
63.2 %
65.6 %
65.2 %
Conclusion and Discussion
In this paper, we successfully applied discrete HMMs in the field of handwritten whiteboard note recognition. Our experiments with a common VQ system show that the binary pressure information is not adequately quantized regardless of the codebook size. To overcome this problem, two VQ designs are introduced which model the presure information without loss. The first approach, based on the assumption that the pressure feature is statistically independent of the remaining features, showed improvement neither in character nor in word accuracy. The second VQ design takes the statistical dependency between the pressure and the remaining features into account by using two arbitrary codebooks. The main N parameters for this second system are the ratio R = Ngs of the two codebook sizes as well as the total number N of codebooks used. Both parameters are optimized on a validation set by means of maximal character accuracy. The best-performing combination led to a relative improvement of r = 3.3 % in word-level accuracy on an arbitrary test set, compared to a common VQ system. In comparison to a recently published continuous system, a slight relative improvement of r = 0.6 % can be reported, illustrating the competitiveness of our system. In future work the role of the parameter R will be investigated more analytically. Additionally, in the future we plan to extend the approaches presented in this paper to other binary and discrete features commonly used in on-line whiteboard note recognition, as well as to investigate the use of multiple-stream HMMs [4].
Acknowledgments The authors sincerely thank Marcus Liwicki for providing the lattice for the final benchmark, Garrett Weinberg, and the reviewers for their useful comments.
References [1] Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. of the IEEE 77(2), 257–285 (1989) [2] Plamondon, R., Srihari, S.N.: On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(1), 63–84 (2000)
Novel VQ Designs
243
[3] Liwicki, M., Bunke, H.: HMM-Based On-Line Recognition of Handwritten Whiteboard Notes. In: Proc. of the Int. Workshop on Frontiers in Handwriting Rec., pp. 595–599 (2006) [4] Rigoll, G., Kosmala, A., Rottland, J., Neukirchen, C.: A Comparison between Continuous and Discrete Density Hidden Markov Models for Cursive Handwriting Recognition. In: Proc. of the Int. Conf. on Pattern Rec., vol. 2, pp. 205–209 (1996) [5] Schenk, J., Rigoll, G.: Novel Hybrid NN/HMM Modelling Techniques for OnLine Handwriting Recognition. In: Proc. of the Int. Workshop on Frontiers in Handwriting Rec., pp. 619–623 (2006) [6] Liwicki, M., Bunke, H.: Feature Selection for On-Line Handwriting Recognition of Whiteboard Notes. In: Proc. of the Conf. of the Graphonomics Society, pp. 101–105 (2007) [7] Kavallieratou, E., Fakotakis, N., Kokkinakis, G.: New Algorithms for Skewing Correction and Slant Removal on Word-Level. In: Proc. of the Int. Conf. ECS, vol. 2, pp. 1159–1162 (1999) [8] Makhoul, J., Roucos, S., Gish, H.: Vector Quantization in Speech Coding. Proc. of the IEEE 73(11), 1551–1588 (1985) [9] Forgy, E.W.: Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classifications. Biometrics 21, 768–769 (1965) [10] Gray, R.M.: Vector Quantization. IEEE ASSP Magazine, 4–29 (April 1984) [11] Baum, L.E., Petrie, T.: Statistical Inference for Probabilistic Functions of Finite State Markov Chains. Annals of Mathematical Statistics 37, 1554–1563 (1966) [12] Viterbi, A.: Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm. IEEE Transactions on Information Theory 13, 260–267 (1967) [13] Liwicki, M., Bunke, H.: IAM-OnDB - an On-Line English Sentence Database Acquired from Handwritten Text on a Whiteboard. In: Proc. of the Int. Conf. on Document Analysis and Rec., vol. 2, pp. 1159–1162 (2005) [14] Liwicki, M., Bunke, H.: Combining On-Line and Off-Line Systems for Handwriting Recognition. In: Proc. of the Int. Conf. on Document Analysis and Rec., pp. 372– 376 (2007)
Switching Linear Dynamic Models for Noise Robust In-Car Speech Recognition Bj¨ orn Schuller1 , Martin W¨ ollmer1 , Tobias Moosmayr2, 1 G¨ unther Ruske , and Gerhard Rigoll1 1
Technische Universit¨ at M¨ unchen, Institute for Human-Machine Communication, 80290 M¨ unchen, Germany [email protected] 2 BMW Group, Forschungs- und Innovationszentrum, Akustik, Komfort und Werterhaltung, 80788 M¨ unchen, Germany
Abstract. Performance of speech recognition systems strongly degrades in the presence of background noise, like the driving noise in the interior of a car. We compare two different Kalman filtering approaches which attempt to improve noise robustness: Switching Linear Dynamic Models (SLDM) and Autoregressive Switching Linear Dynamical Systems (ARSLDS). Unlike previous works which are restricted on considering white noise, we evaluate the modeling concepts in a noisy speech recognition task where also colored noise produced through different driving conditions and car types is taken into account. Thereby we demonstrate that speech enhancement based on Kalman filtering prevails over all standard de-noising techniques considered herein, such as Wiener filtering, Histogram Equalization, and Unsupervised Spectral Subtraction.
1
Introduction
Aiming to counter the performance degradation of speech recognition systems in noisy surroundings, as for example the interior of a car, a variety of different concepts have been developed in recent years. The common goal of all noise compensation strategies is to minimize the mismatch between training and recognition conditions, which occurs whenever the speech signal is distorted by noise. Consequently two main methods can be distinguished: one is to reduce the mismatch by focusing on adapting the acoustic models to noisy conditions. This can be achieved by either using noisy training data or by joint speech and noise modeling. The other method is trying to determine the clean features from the noisy speech sequence while using clean training data. Preprocessing techniques for speech enhancement aim to compensate the effects of noise before the feature-based speech representation is classified by the recognizer which has been trained on clean data. The state-of-the-art speech signal preprocessing that is used as a baseline feature extraction algorithm for noisy speech recognition problems is the Advanced Front End (AFE) two-step Wiener filtering concept introduced in [1]. As shown in [2], methods based on spectral subtraction like Cepstral Mean Subtraction (CMS) [3] or Unsupervised G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 244–253, 2008. c Springer-Verlag Berlin Heidelberg 2008
Switching Linear Dynamic Models
245
Spectral Subtraction (USS) [4] reach similar performance while requiring less computational cost than Wiener filtering. Further attempts to reduce the mismatch between test and training conditions are Mean and Variance Normalization (MVN) [5] or Histogram Equalization (HEQ) [6], [7], a technique which is often used in digital image processing to improve the contrast of pictures. In speech processing HEQ is a powerful method to improve the temporal dynamics of feature vector components distorted by noise. This paper examines a model based preprocessing approach to enhance noisy features as it is proposed in [8]. Here a Switching Linear Dynamic Model (SLDM), which can be considered as Kalman filter, is used to describe the dynamics of speech while another linear dynamic model captures the dynamics of additive noise. Both models serve to derive an observation model describing how speech and noise produce the noisy observations and to reconstruct the features of clean speech. A second technique for noise robust speech recognition using Kalman filtering is outlined and applied in the noisy speech recognition task of this work. This method was first introduced in [9] where a Switching Autoregressive Hidden Markov Model (SAR-HMM) had been extended to an Autoregressive Switching Linear Dynamical System (AR-SLDS) for improved noise robustness. Similar to the SLDM, the AR-SLDS includes an explicit noise model by modeling the dynamics of both the raw speech signal and the noise. However, the technique does not model feature vectors like the SLDM, but the raw speech signal in the time domain. The paper is organized as follows: Section 2 outlines the SLDM used for feature enhancement in this work, while Section 3 introduces the SAR-HMM which is embedded into an AR-SLDS in Section 4. Both Kalman filtering approaches are evaluated in a noisy isolated digit recognition task in Section 5.
2
Switching Linear Dynamic Models
Model based speech enhancement techniques are based on modeling speech and noise. Together with a model of how speech and noise produce the noisy observations, these models are used to enhance the noisy speech features. In [8] a Switching Linear Dynamic Model is used to capture the dynamics of clean speech. Similar to Hidden Markov Model (HMM) based approaches to model clean speech, the SLDM assumes that the signal passes through various states. Conditioned on the state sequence the SLDM furthermore enforces a continuous state transition in the feature space. 2.1
Modeling of Noise
Unlike speech, which is modeled applying an SLDM, the modeling of noise is done by using a simple Linear Dynamic Model (LDM) obeying the following system equation: (1) xt = Axt−1 + b + vt
246
B. Schuller et al.
Fig. 1. Linear Dynamic Model for noise
Thereby the matrix A and the vector b simulate how the noise process evolves over time and vt represents a Gaussian noise source driving the system. A graphical representation of this LDM can be seen in Figure 1. As LDM are time-invariant, they are suited to model signals like colored stationary Gaussian noise. Alternatively to the graphical model in Figure 1 the equations p(xt |xt−1 ) = N (xt ; Axt−1 + b, C) T p(x1:T ) = p(x1 ) t=2 p(xt |xt−1 )
(2) (3)
can be used to express the LDM. Here, N (xt ; Axt−1 + b, C) is a multivariate Gaussian with mean vector Axt−1 + b and covariance matrix C, whereas T denotes the length of the input sequence. 2.2
Modeling of Speech
The modeling of speech is realized by a more complex dynamic model which also includes a hidden state variable st at each time t. Now A and b depend on the state variable st : (4) xt = A(st )xt−1 + b(st ) + vt Consequently every possible state sequence s1:T describes an LDM which is non-stationary due to A and b changing over time. Time-varying systems like the evolution of speech features over time can be described adequately by such models. As can be seen in Figure 2, it is assumed that there are time dependencies among the continuous variables xt , but not among the discrete state variables st . This is the major difference between the SLDM used in [8] and the models used in [10] where time dependencies among the hidden state variables are included. A modification like this can be seen as analogous to extending a Gaussian Mixture Model (GMM) to an HMM. The SLDM corresponding to Figure 2 can be described as follows: p(xt , st |xt−1 ) = N (xt ; A(st )xt−1 + b(st ), C(st )) · p(st ) T p(x1:T , s1:T ) = p(x1 , s1 ) t=2 p(xt , st |xt−1 )
Fig. 2. Switching Linear Dynamic Model for speech
(5) (6)
Switching Linear Dynamic Models
247
Fig. 3. Observation model for noisy speech yt
To train the parameters A(s), b(s) and C(s) of the SLDM, conventional EM techniques are used [11]. Setting the number of states to one corresponds to training a Linear Dynamic Model instead of an SLDM to obtain the parameters A, b and C needed for the LDM which is used to model noise. 2.3
Observation Model
In order to obtain a relationship between the noisy observation and the hidden speech and noise features, an observation model has to be defined. Figure 3 illustrates the graphical representation of the zero variance observation model with SNR inference introduced in [12]. Thereby it is assumed that speech xt and noise nt mix linearly in the time domain corresponding to a non-linear mixing in the cepstral domain. 2.4
Posterior Estimation and Enhancement
A possible approximation to reduce the computational complexity of posterior estimation is to restrict the size of the search space applying the generalized pseudo-Bayesian (GPB) algorithm [13]. The GPB algorithm is based on the assumption that the distinct state histories whose differences occur more than r frames in the past can be neglected. Consequently, if T denotes the length of the sequence, the inference complexity is reduced from S T to S r whereas r T . Using the GPB algorithm, the three steps collapse, predict and observe are conducted for each speech frame [8]. The Gaussian posterior obtained in the observation step of the GPB algorithm is used to obtain estimates of the moments of xt . Those estimates represent the de-noised speech features and can be used for speech recognition in noisy environments. Thereby the clean features are assumed to be the Minimum Mean Square Error (MMSE) estimate E[xt |y1:t ].
3
Switching Autoregressive Hidden Markov Models
An alternative to conventional HMM modeling of speech is the modeling of the raw signal directly in the time domain. As proven in [14] and [15], modeling the raw signal can be a reasonable alternative to feature-based approaches. In [9] a Switching Autoregressive HMM is applied for isolated digit recognition. The SAR-HMM is based on modeling the speech signal as an autoregressive (AR) process whereas the non-stationarity of human speech is captured by the switching between a number of different AR parameter sets. This is done by a discrete switch variable st that can be seen as analogon to the HMM states.
248
B. Schuller et al.
Fig. 4. Dynamic Bayesian Network structure of the SAR-HMM
One of S different states can be occupied at each time step t. Thereby the state variable indicates which AR parameter set to use at the given time instant t. Here, the time index t denotes the samples in the time domain and not the feature vectors as in Section 2. The current state only depends on the preceding state with transition probability p(st |st−1 ). Furthermore it is assumed that the current sample vt is a linear combination of the R preceding samples superposed by a Gaussian distributed innovation η(st ). Both η(st ) and the AR weights cr (st ) depend on the current state st : vt = −
R
cr (st )vt−r + η(st )
(7)
r=1
with η ∼ N (η; 0, σ 2 (st )) The purpose of η(st ) is not to model an independent additive noise process but to model variations from pure autoregression. For the SAR-HMM the joint probability of a sequence of length T is p(s1:T , v1:T ) = p(v1 |s1 )p(s1 )
T
p(vt |vt−R:t−1 , st )p(st |st−1 )
(8)
t=2
corresponding to the Dynamic Bayesian Network (DBN) structure illustrated in Figure 4. As the number of samples in the time domain which are used as input for the SAR-HMM is usually a lot higher than the number of feature vectors observed by an HMM, it is necessary to ensure that the switching between the different AR models is not too fast. This is granted by forcing the model to stay in the same state for an integer multiple of K time steps. The training of the AR parameters is realized applying the EM algorithm [11]. To infer the distributions p(st |v1:T ) a technique based on the forward-backward algorithm [16] is used. Due to the fact that an observation vt depends on R preceding observations (see Figure 4) the backward pass is more complicated for the SAR-HMM than for a conventional HMM. To overcome this problem a correction smoother as derived in [17] is applied which means that the backward pass computes the posterior p(st |v1:T ) by correcting the output of the forward pass.
Switching Linear Dynamic Models
4
249
Autoregressive Switching Linear Dynamical Systems
To improve noise robustness, the SAR-HMM can be embedded into an AR-SLDS to include an explicit noise process as shown in [9]. The AR-SLDS interprets the observed speech sample vt as a noisy version of a hidden clean sample. Thereby the clean signal can be obtained from the projection of a hidden vector ht which has the dynamic properties of a Linear Dynamical System: ht = A(st )ht−1 + ηtH with
(9)
ηtH ∼ N ηtH ; 0, ΣH (st )
The dynamics of the hidden variable are defined by the transition matrix A(st ) which depends on the current state st . Variations from pure linear state dynamics are modeled by the Gaussian distributed hidden “innovation” variable ηtH . Similar to the variable ηt used in Equation 7 for the SAR-HMM, ηtH does not model an independent additive noise source. To obtain the current observed sample, the vector ht is projected onto a scalar vt as follows: vt = Bht + ηtV
(10)
with ηtV ∼ N (ηtV ; 0, σV2 ) The variable ηtV thereby models independent additive white Gaussian noise which is supposed to corrupt the hidden clean sample Bht . Figure 5 visualizes the structure of the SLDS modeling the dynamics of the hidden clean signal, as well as independent additive noise. The SLDS parameters A(st ), B and ΣH (st ) can be defined in a way that the obtained SLDS mimics the SAR-HMM derived in Section 3 for the case σV = 0 (see [9]). This has the advantage that in case σV = 0 a noise model is included without having to train new models. Since inference calculation for the AR-SLDS is computationally intractable, the Expectation Correction algorithm developed in [18] is applied to reduce the complexity.
Fig. 5. Dynamic Bayesian Network structure of the AR-SLDS
250
5
B. Schuller et al.
Experiments
For noisy speech recognition experiments we use the digits “zero” to “nine” from the TI 46 Speaker Dependent Isolated Word Corpus [19]. The database contains utterances from 16 different speakers - 8 female and 8 male speakers. For the sake of better comparability with the results presented in [9], only the words which are spoken by male speakers are used. For every speaker 26 utterances were recorded per word class whereas 10 samples are used for training and 16 for testing. Consequently, the overall training corpus consists of 80 utterances per class while the test set contains 128 samples per class. As in [9], all utterances were downsampled to 8000 Hz. The in-car noise database which was used as additive noise source in this work is the same as in [20]. The noise recordings aim to simulate a wide range of different car types and driving conditions such as driving on big cobbles (“COB”) at 30 km/h, over a smooth city road surface (“CTY”) at 50 km/h, and on a highway (“HWY”) at 120 km/h. Thereby four different car types are considered: BMW 530i (Touring), BMW 645Ci (Convertible), BMW M5 (Sedan) and Mini Cooper S (Convertible). Even though the soft top of both convertibles was closed during recording, the worst case noise scenario is represented by the MINI convertible driving over cobbles (see Figure 6). In spite of SNR levels below 0 dB the noisy test sequences are still well audible since the recorded noise samples are lowpass signals with most of their energy in the frequency band from 0 to 500 Hz. Consequently, there is no full overlap of the spectrum of speech and noise. Two further noise types were used: first, a mixture of babble and street noise (“BAB”) recorded in downtown Munich. This noise type is relevant for in-car speech recognition performance when driving in an urban area with open windows. The babble and street noise was superposed with the clean speech
Fig. 6. SNR distribution of TI 46 (digits) utterances superposed with car noise: frequency of occurrence versus SNR level
Switching Linear Dynamic Models
251
Table 1. Mean recognition rates for different speech enhancement and modeling methods; Mean recognition rate without speech enhancement: 76.69% Method Recognition Rate Method Recognition Rate SLDM 97.65% USS 87.25% AFE 85.53% HEQ 94.77% AR-SLDS 63.60% CMS 93.24% SAR-HMM 59.18% MVN 92.39%
!"
Fig. 7. Recognition rate in percent versus driving condition and SNR level respectively using different speech enhancement and modeling techniques
utterances at SNR levels 12 dB, 6 dB and 0 dB. Furthermore, additive white Gaussian noise (“AWGN”) has been used in the experiments. Thereby the SNR levels 20 dB, 10 dB and 0 dB were taken into account.
252
B. Schuller et al.
For every digit from “zero” to “nine” an HMM consisting of 8 states with a mixture of 3 Gaussians per state was trained, except for the SAR-HMM and AR-SLDS experiments where speech was modeled by a 10th order AR process with 10 states. Feature vectors consisted of 13 MFCC as well as the first and second order derivatives of the cepstral coefficients. A global speech SLDM of 32 hidden states was trained, whereas the enhancement algorithm was run with the history parameter r = 1. The LDM modeling stationary noise was trained for each noisy test sequence using the first and last 10 frames of the utterance. In addition to the Kalman filter based speech modeling and enhancement concepts explained in Section 2 and 4, a variety of different standard feature enhancement techniques as named in Section 1 were evaluated. Figure 7 shows the performance of the different speech enhancement strategies for different noise types. Thereby training was carried out using clean data. With a recognition rate of 97.65% averaged over all noise types (see Table 1), the SLDM outperformed all other feature enhancement and modeling techniques for each of the car and babble noise types, whereas for AWGN at low SNR levels the AR-SLDS performed best (recognition rate of 88.52% for AWGN at 0 dB SNR). However, the AR-SLDS was not suited to model colored noise such as noise occurring in the interior of a car.
6
Conclusion
In this paper we compared two techniques for noise robust in-car speech recognition based on Kalman filtering and joint speech and noise modeling. The strategy of describing the dynamics of speech with a Switching Linear Dynamic Model while modeling noise as linear dynamic process is able to outperform other known speech enhancement approaches like Wiener filtering or Histogram Equalization whenever speech is corrupted by colored noise produced while driving a car. Speech disturbed by white noise can best be modeled using an Autoregressive Switching Linear Dynamical System which captures the speech and noise dynamics of the raw signal in the time domain.
Acknowledgement We would like to thank Jasha Droppo and Bertrand Mesot for providing SLDM and AR-SLDS binaries.
References 1. Speech processing, transmission and quality aspects (STQ); Distributed speech recognition; Advanced front-end feature extraction algorithm; Compression algorithms. ETSI standard doc. ES 202 050 V1.1.5 (2007) 2. Lathoud, G., Doss, M.M., Boulard, H.: Channel normalization for unsupervised spectral subtraction. In: Proceedings of ASRU (2005)
Switching Linear Dynamic Models
253
3. Rahim, M.G., Juang, B.H., Chou, W., Buhrke, E.: Signal conditioning techniques for robust speech recognition. IEEE Signal Processing Letters, 107–109 (1996) 4. Lathoud, G., Magimia-Doss, M., Mesot, B., Boulard, H.: Unsupervised spectral subtraction for noise-robust ASR. In: Proceedings of ASRU, pp. 189–194 (2005) 5. Viikki, O., Laurila, K.: Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication, 133–147 (1998) 6. de la Torre, A., Peinado, A.M., Segura, J.C., Perez-Cordoba, J.L., Benitez, M.C., Rubio, A.J.: Histogram equalization of speech representation for robust speech recognition. IEEE Transactions on Speech and Audio Processing, 355–366 (2005) 7. Hilger, F., Ney, H.: Quantile based histogram equalization for noise robust speech recognition. In: Eurospeech, pp. 1135–1138 (2001) 8. Droppo, J., Acero, A.: Noise robust speech recognition with a switching linear dynamic model. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (2004) 9. Mesot, B., Barber, D.: Switching linear dynamical systems for noise robust speech recognition. IEEE Transactions on Audio, Speech and Language Processing (2007) 10. Deng, J., Bouchard, M., Yeap, T.H.: Noisy speech feature estimation on the Aurora2 database using a switching linear dynamic model. Journal of Multimedia, 47–52 (2007) 11. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 1–38 (1977) 12. Droppo, J., Deng, L., Acero, A.: A comparison of three non-linear observation models for noisy speech features. In: Eurospeech, pp. 681–684 (2003) 13. Bar-Shalom, Y., Li, X.R.: Estimation and tracking: principles, techniques, and software. Artech House, Norwood, MA (1993) 14. Ephraim, Y., Roberts, W.J.J.: Revisiting autoregressive hidden Markov modeling of speech signals. IEEE Signal Processing Letters, 166–169 (2005) 15. Poritz, A.: Linear predictive hidden Markov models and the speech signal. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1291–1294 (1982) 16. Baum, L.E., Petrie, T.: Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics, 1554–1563 (1966) 17. Rauch, H.E., Tung, G., Striebel, C.T.: Maximum likelihood estimates of linear dynamic systems. Journal of American Institiute of Aeronautics and Astronautics, 1445–1450 (1965) 18. Barber, D.: Expectation correction for smoothed inference in switching linear dynamical systems. Journal of Machine Learning Reseach, 2515–2540 (2006) 19. Doddington, G.R., Schalk, T.B.: Speech recognition: turning theory to practice. IEEE Spectrum, 26–32 (1981) 20. Grimm, M., Kroschel, K., Harris, H., Nass, C., Schuller, B., Rigoll, G., Moosmayr, T.: On the necessity and feasibility of detecting a driver’s emotional state while driving. In: Paiva, A., Prada, R., Picard, R.W. (eds.) ACII 2007. LNCS, vol. 4738, pp. 126–138. Springer, Heidelberg (2007)
Natural Language Understanding by Combining Statistical Methods and Extended Context-Free Grammars Stefan Schw¨arzler , Joachim Schenk, Frank Wallhoff, and G¨unther Ruske Institute for Human-Machine Communication Technische Universit¨at M¨unchen 80290 Munich, Germany {sts,joa,waf,rus}@mmk.ei.tum.de
Abstract. This paper introduces an novel framework for speech understanding using extended context-free grammars (ECFGs) by combining statistical methods and rule based knowledge. By only using 1st level labels a considerable lower expense of annotation effort can be achieved. In this paper we derive hierarchical non-deterministic automata from the ECFGs, which are transformed into transition networks (TNs) representing all kinds of labels. A sequence of recognized words is hierarchically decoded by using a Viterbi algorithm. In experiments the difference between a hand-labeled tree bank annotation and our approach is evaluated. The conducted experiments show the superiority of our proposed framework. Comparing to a hand-labeled baseline system (=100%) we achieve 95,4 % acceptance rate for complete sentences and 97.8 % for words. This induces an accuray rate of 95.1 % and error rate of 4.9 %, respectively F1-measure 95.6 % in a corpus of 1 300 sentences.
1 Introduction In this paper, we address the problem of developing a simple, yet powerful speech understanding system based on manually derived domain-specific grammars. Contrary to existing grammar based speech understanding systems (e. g. Nuance Toolkit platform), not only grammar decisions are included, but information from grammar and word-label connections are combined as well and decoded by Viterbi algorithm. A two pass approach is often adopted, in which a domain-specific language model is constructed and used for speech recognition in the first pass, and the understanding model obtained with various learning algorithms is applied in the second pass to “understand” the output from the speech recognizer. Another approach handles recognition and understanding at the same time [?,2]. For this purpose so-called “concepts” are defined, which represent a piece of information on the lexical as well as on the semantic level. In this way all the statistical methods from speech recognition can be utilized for all hierarchies. This concerns especially the stochastic modeling solutions offered by Hidden Markov Models. The hierarchies consist of transition networks (TNs) whose nodes either represent terminal symbols or refer to other TNs [3]. This approach uses
Note: Both authors contributed equally to this paper.
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 254–263, 2008. c Springer-Verlag Berlin Heidelberg 2008
Natural Language Understanding
255
statistical methods for semantic speech understanding requiring fully annotated tree banks. To this end a fully annotated tree bank in German language is available in the NaDia corpus [2], which is similar to the corpus used in the ATIS task [4]. In our experiments we use the NaDia corpus as baseline system. In general, fully annotated tree banks are not available for most application domains. However, a flat structure of words on the sentences level can commonly be realized rather easily. The sentences of natural language understanding have to obey certain grammatical rules, e. g. time data rules. Therefore we use extended context free grammars (ECFGs), whose production rules on the right-hand side consist of regular expressions (REs). REs are compact notations to describe formal languages. In our approach, we describe the working-domain’s grammar of our speech understanding system with appropriate REs. The basic principles of RE can be found in [5]. Context-free grammars (CFGs) consist of terminal and nonterminal symbols. Nonterminal symbols are used to describe production rules in CFG. For this reason we obtain a hierarchical structure of production rules, which constitute rule sets of semantic units (e. g. rules for time, origin, and destination). The idea is that we start with the start nonterminal symbol S and replace it with any sequence in the language described by its RE. This hierarchical coding determines the semantic meaning of the terminal symbol. Our aim is to analyse a given sentence and to decode the hierarchical structure of each terminal symbol. A similar approach has been presented in combination of CFG and N -gram modeling in Semantic Grammar Learning [6]. However, our system uses ECFGs, and is built on REs to describe the grammar rules. Thereby we utilize the entire sentences for semantic decoding by the grammar. The next section gives a brief overview of extended context-free grammars and the hierarchical utilization. Afterwards we introduce our TN for speech understanding. In section 4 we present our Viterbi based parsing methods. The settings and the goals of our experiments are defined in section 5. Our system is evaluated in section 6 on a hand-labeled tree bank. A comparison between hand-crafted annotation and automatic annotation is presented. Finally, conclusions and outlook are given in section 7.
2 Extended Context-Free Grammars (ECFG) CFGs are powerful enough to describe the syntax of most programming languages. On the other hand they are simple enough to allow construction of efficient parsing algorithms. In spoken language understanding, an extended CFG is used by the Phoenix parser, which allows to skip unknown words and performs partial parsing [7]. An ECFG G is specified by a tuple of the form (N, Σ, P, S), where N is a nonterminal alphabet, Σ is a terminal alphabet, P is a set of production schemes of the form A → LA , such that A is a nonterminal and LA is a regular language over the alphabet Σ ∪ N , and S is the sentence symbol, which is a nonterminal [8]. Nonterminal symbols are used to describe further production rules in ECFG. The production rules consist of a “Kleene closure” expressed by ∗, +, and ? operators in regular expression syntax. Using this construction we obtain a hierarchical structure of production rules in ECFG which constitute rule sets of semantic units (e. g. rules for time, origin, and destination), see Fig. 1. Each terminal symbol is generated by a nonterminal symbol, which is part of the hierarchical structure.
256
S. Schw¨arzler et al.
Σ
Σ
Σ
Σ
Σ
Fig. 1. Schematic tree consisting of terminal leaf nodes and nonterminal symbols
Hence, for every terminal symbol a hierarchical coding is provided by the sequence of nonterminal symbols leading from the root to the terminal symbol. This hierarchical coding determines the semantic meaning of the terminal symbol. Our aim is to analyse a given sentence and to decode the hierarchical structure of each terminal symbol. To that end we formulate a TN that describes the transitions between each two succeeding terminal symbols in time. The TN cannot be derived directly from the ECFGs. Therefore we translate the production rules in ECFGs into non-deterministic automata (NDA), whose equivalence has been shown in [9].
3 Transition Network (TN) 3.1 Non-Deterministic Automata (NDA) In this section we show how a TN can be derived from ECFGs using NDA. We may assume that the language LA (see Sec. 2) is represented by a nondeterministic finite state automaton MA = (QA , Σ ∪ N, δA , sA , FA ), where QA is a finite set of states, δA is the transition relation, sA is the start state and FA is a set of final states. For example, consider sentences that contain any number of DUMMY followed by either ORIG (origin) or DEST (destination), and are ended by any number of DUMMY. The label DUMMY denotes a not-meaningful entity. The rather simple grammar describing these sentences formulated as a set of productions P in ECFG is shown in Eq. 1, where a bold-font denotes nondetermistic states, the remaining states are deterministic. S = DUMMY∗ .(ORIG | DEST).DUMMY∗ ORIG = FROM.DUMMY∗ .CITY DEST = TO.DUMMY∗ .CITY
(1)
The equivalent NDA is shown in Fig. 2. As seen in this figure, the NDA consists of states. A transition from one state to another is performed by emitting a terminal symbol. For NDA, arbitrary emissions can lead from a certain state to a certain other state. As explained above we build a structure of NDA with ECFGs. Each grammar rule is represented by one sub-NDA. -transitions denote transitions between states without emitting a terminal symbol and are used to connect sub-NDA.
Natural Language Understanding
257
Fig. 2. Schematic diagram of a nondeterministic finite state automaton
The NDA are now transformed into the NT. To avoid confusion, we distinguish between NDA states, i. e. states that are introduced during the transformation from ECFG to the NDA, and rule-states (s1 , . . . sN ) which describe the TN. Each rule-state represents a terminal symbol, the transitions between the rule states describe the desired transitions between terminal symbols in time. Each emitted symbol of the NDA corresponds to a rule-state in the TN. Hence each rule-state emerges from and points to an NDA state. The transition between the rule-states is also defined by the NDA-states: One rulestate (e. g. sj ) is followed by all rule-states emerging from the NDA-state where rulestate sj is pointing to. This leads from the NDA depicted in Fig. 2 to the rule network described by matrix A = (aij ), i, j = 1, . . . , N , defined by the ten rule-states s1 , . . . , s10 : s1 s2 s3 s4 s A= 5 s6 s7 s8 s9 s10
0 1/3 1/3 0 0 1/3 0 0 0 0 0 1/3 1/3 0 0 1/3 0 0 0 0 0 0 0 1/2 1/2 0 0 0 0 0 0 0 0 1/2 1/2 0 0 0 0 0 0 0 0 0 0 0 0 0 1/2 1/2 0 0 0 0 0 0 1/2 1/2 0 0 0 0 0 0 0 0 1/2 1/2 0 0 0 0 0 0 0 0 0 0 1/2 1/2 0 0 0 0 0 0 0 0 1/2 1/2 0 0 0 0 0 0 0 0 0 1
N The probabilities of each line i in A add up to j=1 aij = 1, except for the last line. Every probability denotes a transition from si to sj . In this work we assume each
258
S. Schw¨arzler et al.
transition from one state to another as equally probable as long as the transition is allowed by the rule-scheme. Each state si corresponds to a terminal symbol which is stored in the vector b = (‘START’, ‘DUMMY’, ‘FROM’, ‘DUMMY’, ‘CITY’, ‘TO’, ‘DUMMY’, ‘CITY’, ‘DUMMY’, ‘END’)
(2)
The vector b denotes the first level labels from an annotated corpus. By definition every sentence ends with the token “END”. Thus there are no transitions from the corresponding state, and the last line in Matrix A is always 1 in the last column. In our case, each transition of one state to another is equally probable, as far as it is allowed by the TN. 3.2 Hierarchical Structure As explained above, each state si of the transition matrix A represents a terminal symbol from the ECFG. To keep the hierarchical information of each terminal symbol, the list H is introduced, where H = (1 → S; 2 → S; 3 → S,ORIG; 4 → S,ORIG, 5 → S,ORIG; 6 → S,DEST; 7 → S,DEST; 8 → S,DEST; 9 → S; 10 → S)
(3)
contains the hierarchical information for each state. Matrix A can be illustrated in a transition-diagram with the possible transitions between two consecutive time instances t and t + 1, as shown in Fig. 3; additionally the hierarchical structure of each terminal symbol is shown in this figure. For example, the terminal symbol ‘DUMMY’ can occur both in concept ORIG and in concept DEST.
Fig. 3. Possible state transitions from time instance t to t + 1
Natural Language Understanding
259
4 Parsing The unparsed sequence of first-level labels b = (b1 , . . . , bT ) is hierarchically decoded ˆ = (ˆ by finding the best path q q1 , . . . , qˆT ) through a trellis – built by concatenating the transition diagrams – leading from the start symbol to the end symbol. However, in natural spoken sentences, a direct mapping between the words (terminals in the sentence) to the labels (terminals in the rule-network) is not available for each word. E. g. the word “to” has different meanings, depending on the context. First it may be an infinitive companion, second it can be used as a TOLOC (e. g. “to location”) when combined with a city. In our approach all possible meanings of a word are pursued in the trellis leading to a variety of sentence hypotheses. To cope with different meanings of words, the word-label probability p(bi |wt ), describing the probability that a certain word wt belongs to the given first level label bi , is introduced. Applying Bayes’ rule, one derives p(bi |wt ) =
p(wt |bi ) · p(bi ) . p(wt )
(4)
The probabilities of the right hand side of Eq. 4 are estimated by using databases and are referred to as “world knowledge”. Given the sentence “I want to go to Munich”, hence w = (‘START’, ‘I’, ‘want’, ‘to’, ‘go’, ‘to’, ‘Munich’, ‘END’), the transition diagram shown in Fig. 3, and the trellis diagram displayed in Fig. 4 can be built, containing all parsing hypotheses. The y-axis shows the terminal symbols b including the hierarchical structures, which were defined in H. The x-axis represents time. All possible transitions can be displayed depending on their consecutive time instances (concatenation of Fig. 3). For the test example our system finds the optimal path through the trellis diagram (shown as bold arrows in Fig. 4). Light arrows represent possible alternative paths that have not been chosen due to the statistical constraints. 4.1 Viterbi Path The best path through the trellis (the path with the highest probability) can be found by using the Viterbi algorithm. Introducing the quantity δt (i) describing the probability of a single path at time t that ends in state si , the Viterbi algorithm is formulated as displayed in the following: Initialization: δ1 (i) = p(bi |w1 ) , 1 ≤ i ≤ N. (5) ψ1 (i) = 0 Recursion: δt (j) = max [δt−1 (i) · aij ] · p(bj |wt )
2≤t≤T 1≤i≤N ψt (j) = argmax[δt−1 (i) · aij ] · p(bj |wt ) , 1 ≤ j ≤ N
(6)
qˆT = argmax δT (i)
(7)
1≤i≤N
Termination: 1≤i≤N
260
S. Schw¨arzler et al.
Fig. 4. Trellis diagram and best path for “I want to go to Munich”
The best path is then found by backtracking qt+1 ). qˆt = ψt+1 (ˆ
(8)
With the states revealed from the backtracking path the hierarchical structure of the sentence w is derived using the hierarchy list H. Considering the above reviewed sentence, backtracking leads to the hierarchical decoding of the sentence shown in Fig. 4, as well as to the best path found by the Viterbi algorithm.
5 Experiments We verified our approach on a database containing more than 1 300 sentences (1 000 for training, 300 for testing) of the air traveling information domain from the German NaDia corpus [2]. Each word in each sentence is manually hierarchically labeled (baseline system), according to its meaning. We used our system to hierarchically decode the sentences, given a certain ECFG. The crucial probabilities p(wt |bi ) for the decoding are trained using the NaDia corpus. However any other corpus in the same language could be used for training. The production rules of an ECFG describing the transition matrix A are defined to match the requirements of the corpus and suit the German grammar. The performance of the system depends on the production rules which are used in ECFG. On the one hand the rules must be precise enough to take care of the semantic content, on the other hand the rules have to capture all variants of natural language formulations. To show this, we created three different types of production rules in ECFG (in rising the number of states of the transition matrix), differing by their perplexity. In our experiments we compare the performance of the production rules in ECFG with the manually hierarchically labeled annotations (baseline system). The goal is to identify the concepts TIME , ORIGIN, and DESTINATION independently of their order and combination, in more than 300 sentences of the air traveling
Natural Language Understanding
261
information domain. Each word in the sentences is labeled by one of 14 tags, like Dummy, AOrigin, ADestination, ATime, etc.
6 Results The proposed types of ECFGs (see Sec. 5) are evaluated using the following measurement methods. A sentence is accepted, if the hierarchical structure automatically obtained by our system fully matches the baseline labeling. We define the sentence acceptance rate rsent as rsent =
# of accepted sentences . total # of sentences
(9)
As explained above, a sentence is rejected as soon as the hierarchical decoding of one single word differs from the baseline annotation. To investigate our system’s decoding abilities on a word level we use the word acceptance rate rword =
# of accepted words . total # of words
(10)
Tab. 1 shows further the word accuracy, the error rate and the well-known F1measure results of three different types of ECFGs. In Tab. 1 it is assumed that the handlabeled annotation works perfectly (=100%). The results shown in Tab. 1 represent the automatic method for comparison. The best results were achieved in ECFG3 with a word acceptance rate of 97.8 % respectivley 95.6 % F1-measure and 95.4 % for complete sentences. The columns of Tab. 1 represent three different types of ECFGs, which differ in the number of terminal symbols. In ECFG1 a rough description of the concepts TIME, ORIGIN, and DESTINATION exists, whereas the perplexitiy is reduced in ECFG2 and ECFG3. The main difference consists in the description of TIME rules. As shown in Tab. 1 the sentence and word acceptance rate increases by the complexity of their grammar, which increases also to the number of states. The slight difference between the word and sentence acceptance rates is caused by multiple word errors in same sentences. In result most of our 1 300 sentences were accepted by the grammar. However, some sentences are refused as they do not follow the German grammar rules. Table 1. Results for different ECFGs with decreasing perplexity ECFG1 ECFG2 ECFG3 rsent [%] rword [%] #states
92.8 95.5 89
94.7 96.9 98
95.4 97.8 113
Accuracy [%] 93.1 Error [%] 6.9 F1-Measure [%] 93.3
94.0 6.0 95.0
95.1 4.9 95.6
262
S. Schw¨arzler et al.
7 Conclusions and Outlook In this work a novel approach for speech understanding has been presented. We showed how ECFGs can be used instead of tree bank annotations, e. g. understanding time data, etc. Our approach combines semantic grammar rules with the advantage of statistical methods. The system provides all possible rules by working in parallel running from the first frame till the last frame of the sentence. Thus a set of alternative trails on the trellis diagram results. The best path is found by using the Viterbi algorithm. We showed how to build a transition matrix A from ECFGs. The probability at postion (ij) in this matrix, denoting a transition from si to sj (see Sec. 3.1), is set equal for each possible transition allowed by the rule-scheme up to now. In future work, we will be able to weight the transitions according to a training set. The results already (see Sec. 6) showed that our approach is comparable to a full hand-labeled annotated tree bank (baseline system). Our best verification reaches an acceptance rate of 95.4 % for complete sentences and 97.8 % for words. Furthermore in contrast to hand-labeled tree bank annotations our approach only needs a non-hierarchical annotation from a corpus, but up to now just suits parts of the German grammar. In future investigations we therefore plan to learn general parts of grammar (ECFGs) from a training set. At this stage we presented only the principles of our approach. For further results we will apply our approach on the ATIS-Tasks. Additionally the perplexity of our grammar will be increased by automatic grammar reduction. In this work our approach used just the decoder’s best word hypotheses (having maximum probability). However our approach offers the possibility to evaluate wordconfidences utilizing the ECFGs. In future work, we aim to introduce alternative word hypotheses and confidence measures from the one-stage decoder [?].
Acknowledgements This work was funded partly by the German Research Council (DFG).
References 1. Thomae, M., Fabian, T., Lieb, R., Ruske, G.: A One-Stage Decoder for Interpretation of Natural Speech. In: Proc. NLP-KE 2003, Beijing, China (2003) 2. Lieb, R., Ruske, G.: Natural Dialogue Behaviour Using Complex Information Services in Cars. Institute for Human Machine Communication, Techn. University, Munich, Germany, Tech. Rep (2003) 3. Thomae, M., Fabian, T., Lieb, R., Ruske, G.: Hierarchical Language Models for One-Stage Speech Interpretation. In: Proc. Eurospeech, Lisbon, Portugal (2005) 4. Ward, W.: The CMU Air Travel Information Service: Understanding Spontaneous Speech. In: Proc. Workshop on Speech and Natural Language. Hidden Valley, Pennsylvania, pp. 127–129 (1990) 5. Blackburn, P., Bos, J.: Representation and Inference for Natural Language. Leland Stanford Junior University: CLSI Publications (2005) 6. Wang, Y., Acero, A.: Combination of CFG and N-Gram Modeling in Semantic Grammar Learning. In: Proc. Interspeech (2003)
Natural Language Understanding
263
7. Ward, W., Issar, S.: Recent Improvements in the CMU Spoken Language Understanding System. In: Proc. of ARPA Human Language Technology Workshop, pp. 213–216 (1994) 8. Br¨uggemann-Klein, A., Wood, D.: The Parsing of Extended Context-Free Grammars. In: HKUST Theoretical Computer Science Center Research Report, no. 08 (2002) 9. Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to automata theory, languages and computation. Addison-Wesley, Reading (2006)
Photoconsistent Relative Pose Estimation between a PMD 2D3D-Camera and Multiple Intensity Cameras Christian Beder, Ingo Schiller, and Reinhard Koch Computer Science Department Kiel University, Germany {beder,ischiller,rk}@mip.informatik.uni-kiel.de
Abstract. Active range cameras based on the Photonic Mixer Device (PMD) allow to capture low-resolution depth images of dynamic scenes at high frame rates. To use such devices together with high resolution optical cameras (e.g. in media production) the relative pose of the cameras with respect to each other has to be determined. This task becomes even more challenging, if the camera is to be moved and the scene is highly dynamic. We will present an efficient algorithm for the estimation of the relative pose between a single 2D3D-camera with respect to several optical cameras. The camera geometry together with an intensity consistency criterion will be used to derive a suitable cost function, which will be optimized using gradient descend. It will be shown, how the gradient of the cost function can be efficiently computed from the gradient images of the high resolution optical cameras. We will show, that the proposed method allows to track and to refine the pose of a moving 2D3D-camera for fully dynamic scenes.
1
Introduction
In recent years active range cameras based on the Photonic Mixer Device (PMD) have become available. Those cameras deliver low resolution depth images at high frame rates comparable to usual video cameras. Some PMD-cameras also capture intensity images registered with the depth images (c.f. [11] or [10]) and are therefore sometimes called 2D3D-cameras. In some applications, such as media production, the images of the 2D3Dcamera cannot be used directly. In those cases high resolution optical cameras are used together with the depth information from 2D3D-cameras. The combination of PMD and optical cameras has for instance been used in [6], which explicitly does not require an accurate relative orientation between the two systems. A combination of PMD and stereo images has also been proposed in [9] and in [1], which rely on an accurate relative orientation between the optical cameras and the PMD-camera. There, the PMD-camera is not allowed to be moved and a fixed rig is used instead, which is calibrated beforehand using a calibration pattern (cf. [2] and [13]). G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 264–273, 2008. c Springer-Verlag Berlin Heidelberg 2008
Photoconsistent Relative Pose Estimation
265
Fig. 1. The setup comprising of a moving PMD 2D3D-camera mounted on a pan-tilt unit between two fixed high resolution optical cameras
Due to the narrow opening angle and the low resolution of the PMD camera a fixed rig limits the visible space of the 2D3D-camera. This limitation can be circumvented by moving the 2D3D-camera and focus on the interesting spots in the scene. See figure 1 for the setup we used comprising of two high resolution optical cameras and a 2D3D-camera mounted on a pan-tilt unit. However, in case the 2D3D-camera is moving, its relative pose with respect to the optical cameras has to be determined. Tracking the pose of a moving PMDcamera has been presented in [3] and also in [5]. There the intensity information from the 2D3D-camera is not used. Tracking the pose of a moving 2D3D-camera using also the intensity information has been done in [8], [12] and [14]. In contrast to our approach all those approaches are not based on additional optical cameras and therefore require a static scene. Because we estimate the relative pose between the 2D3D-camera and the optical cameras for each frame, our algorithm is also able to cope with fully dynamic scenes as long as the images are synchronized. This paper is organized as follows: in section 2 we will derive the geometry of the 2D3D-camera in relation to the optical cameras and propose a cost function based on an intensity consistency constraint. In section 3 it will be shown, how this cost function can be efficiently optimized using gradient descend. Finally we will present some results on synthetic and real data taken with the setup depicted in figure 1 in section 4.
2
Camera Geometry and Intensity Consistency
First we will introduce some notation describing the geometry of the cameras. We start with the 2D3D-camera, from which we obtain intensity as well as depth images. D0 [x] : IR2 → IR (1) I0 [x] : IR2 → IR
266
C. Beder, I. Schiller, and R. Koch
for each shot. The camera geometry of this 2D3D-camera will be characterized by the projection matrix (cf. [7, p.143]) P0 = K0 R T 0 (I 3 | − C 0 )
(2)
which maps homogeneous 3d points X to homogeneous image coordinates according to (3) x = P0 X In our application the cameras are assumed to be calibrated, i.e. we assume K0 to be known. The pose of the 2D3D-camera, i.e. R 0 and C 0 , is unknown and will be estimated in the following. Inverting equation (3) yields for each pixel x together with the depth image D0 [x] a 3d point D0 [x]R 0 K −1 0 x (4) X= + C0 −T xT K0 K−1 x 0 We now introduce additional intensity cameras, which produce normal intensity images Ii [xi ] : IR2 → IR (5) triggered synchronously with the 2D3D-camera. The camera geometry of those additional cameras will be described by the projection matrices Pi = Ki R T i (I 3 | − C i )
(6)
which are assumed to be completely known in the following. The 3d points from equation (4) are then projected into those intensity cameras at the homogeneous coordinates (7) xi = Ki R T i (X − C i ) Denoting the homogeneous image coordinates as ⎛ ⎞ ui xi = ⎝ vi ⎠ wi the Euclidean coordinates are obtained by simple normalization 1 ui xi = wi vi
(8)
(9)
We have now established correspondences between pixels x in the 2D3D-camera with a pixel in each intensity camera xi using the depth image D[x]. Using the derived pixel correspondences x ↔ xi , we will assume that the intensities of those pixels are equal up a per image brightness offset bi and a per image contrast difference ci ! ci I0 [x] = Ii [xi ] + bi (10)
Photoconsistent Relative Pose Estimation
267
This condition only holds true, if the pixel is not occluded. Therefore we introduce an occlusion map on the images of the 2D3D-camera 1 if x ↔ xi is not occluded νi [x] = (11) 0 otherwise which can be computed from the depth map D using shadow mapping techniques (cf. [15]). Note, that approximate pose parameters R 0 and C 0 are required for this operation. Finally introducing the robust cost function 2 x if |x| < θ (12) Ψ [x] = |x| + θ2 − θ otherwise we are able to formulate the intensity consistency constraint by optimizing the following cost function
νi [x]Ψ [Ii [xi ] − ci I0 [x] + bi ] → min (13) φ(R 0 , C 0 , c, b) = x i for the unknown pose parameters R 0 and C 0 as well as the unknown brightness offsets b and contrast differences c. In case of a PMD 2D3D-camera the resolution and hence the number of pixels is small, so that we sum over all pixels x of the image. To improve the running time of the algorithm in case of higher resolution 2D3D-cameras it is also possible to detect interest points and sum up only those. In the following section we will show, how this cost function can be efficiently optimized using gradient descent techniques.
3
Optimization
We will now show, how the cost function derived in the previous section can be efficiently optimized. Therefore we first approximate the rotation matrix by (cf. [4, p.53]) (14) R 0 = R 0 + [r0 ]× where [·]× denotes the 3×3 skew symmetric matrix induced by the cross product (cf. [7, p.546]). The unknown rotation is now parameterized using the 3-vector r 0 containing the differential rotation angles. Stacking all the unknown parameters into the parameter vector ⎛ ⎞ r0 ⎜ C0 ⎟ ⎟ p=⎜ (15) ⎝ c ⎠ b the gradient of the cost function (13) is given by
∂Ii [xi ] T − I0 [x]eT + e (16) νi [x]Ψ [Ii [xi ] − ci I0 [x] + bi ] g= i+6 i+N +6 ∂p x i
268
C. Beder, I. Schiller, and R. Koch
where N is the number of images and ei is a vector of the same size as the parameter vector p containing zeros except for the i-th component, which is one. Note, that the dependence of the occlusion map νi [x] on the rotation and translation of the 2D3D-camera is neglected here. We will now look at the components of the gradient. First, the derivative of the robust cost function is simply given by 2x if |x| < θ (17) Ψ [x] = 1 otherwise Second, the partial derivatives of the intensity images Ii [xi ] with respect to the unknown parameters p is obtained from the gradient images ∇Ii [xi ] using chain rule as ∂xi ∂xi ∂X ∂Ii [xi ] = ∇Ii [xi ] (18) ∂p ∂xi ∂X ∂p Its components are the partial derivatives of equation (7) being
1 ui ∂xi wi 0 − wi2 = 0 w1i − wvi2 ∂xi
(19)
i
as well as of equation (9) being ∂xi = K iR T i ∂X
(20)
Finally we need the Jacobian ∂X = ∂∂X r0 ∂p
∂X ∂C 0
0 3×2N
(21)
where N is the number of intensity images and the null matrix 0 3×2N reflects the fact, that the 3d point is independent of the image brightness and contrast. The components of this remaining Jacobian are the partial derivatives of equation (4) given by −1 −D0 [x] ∂X K0 x × = (22) ∂r0 −1 xT K−T K x 0 0 as well as
∂X = I3 ∂C 0
(23)
Note, that only the gradient images of the high resolution optical cameras and some simple matrix operations are required to compute the gradient, which can be performed very efficiently. Having derived an efficient method for computing the gradient of the cost function from the gradient intensity images, it is now possible to minimize the cost function using gradient descent techniques. To do so we need to determine a step width. This can be done by using the curvature of the cost function in
Photoconsistent Relative Pose Estimation
269
the direction of the gradient. We therefore compute three samples of the cost function φ1 = φ(p0 − g) φ2 = φ(p0 − 2g) (24) φ0 = φ(p0 ) and note, that the parabola fitted through this three values has its peak at 0 =
3φ0 − 4φ1 + φ2 2φ0 − 4φ1 + 2φ2
(25)
If the cost function in the direction of the gradient is convex, then 0 > 0. If this is not the case, we set 0 to an arbitrary positive value (e.g. 0 = ). Then we proceed as follows: Starting with a line-search factor α = 1 we check, if there is an improvement in the cost function, i.e. if φ(p0 − α0 g) < φ(p0 )
(26)
If this is the case, we update p0 accordingly and iterate. If this is not the case we reduce α and repeat the test. This scheme is iterated, until the parameter updates are sufficiently small. We have presented an efficient estimation scheme, which we will evaluate in the following section.
4
Results
To evaluate the proposed algorithm, we rendered a synthetic data set from a known geometry, for which ground truth data is available. The setup was chosen similar to the one depicted in figure 1 comprising of two optical cameras with a 2D3D-camera in between. Figure 2 shows the 3d setup we used. The optical cameras are placed 1m to the left and to the right of the 2D3D-camera at a distance of about 18m from the castle with the occluding arc at a distance of approximately 11m. The resolution of the optical cameras was chosen as 640×480 pixels at an opening angle of 70◦ while the resolution of the 2D3D-camera was chosen as 160 × 120 pixels at an opening angle of 40◦ .
Fig. 2. Estimated pose and point cloud after optimization of the synthetic scene. The camera symbols in the middle show the pose of the 2D3D-camera before the optimization, after the first iteration and at the final position.
270
C. Beder, I. Schiller, and R. Koch
Fig. 3. Left image of the synthetic sequence. The 2D3D image is overlaid onto the intensity image using the initial pose on the left hand side and using the optimized pose on the right hand side. The enlargements show that the registration error decreases. Note the speakers in the top view and the corrected discontinuity on the bottom view.
On the left hand side of figure 3 the image from one of the optical cameras is shown together with the image of the 2D3D-camera re-projected using exemplary erroneous initial pose parameters at a distance of 0.4m from the ground truth. On the right hand side of figure 3 the same image is shown after the optimization. As expected the registration between the images has improved. In order to quantify this improvement, we initialized the algorithm with pose parameters disturbed by Gaussian noise of increasing standard deviation σC and compared the resulting poses with the known ground truth poses. In addition to this disturbance of the initial pose parameters we repeated the experiment with Gaussian noise on the depth images of the 2D3D-camera with standard deviation
0.25
0.06
σint = 50
0.15
dC
dC
0.2
no noise σdepth = 2
0.1
0.04
0.02
0.05 0 0
0.1
0.2
σC
0.3
0.4
0 0
0.5
1 σdepth
1.5
2
Fig. 4. Left: Mean distance and standard deviation of the resulting pose to the ground truth pose plotted against the standard deviation of the initial pose disturbance. For the bottom curve no noise was added to the input images, while the two other curves result from severe additional noise in the depth and intensity images respectively. Right: Mean distance and standard deviation of the resulting pose to the ground truth pose plotted against the standard deviation of the depth image noise using initial pose disturbance with standard deviation σC = 0.25m.
Photoconsistent Relative Pose Estimation
271
σdepth = 2m as well as with Gaussian noise on the optical images with standard deviation σint = 50 at an intensity range of 255. The distances dc of the resulting poses to the known ground truth poses together with their standard deviations are plotted against the standard deviation of the disturbance of the initial pose parameters σC on the left hand side of figure 4. Observe, that the algorithm reacts much more sensitive to disturbances of the depth than to disturbances of the intensity images. The sensitivity of the algorithm to disturbances of the depth images is shown on the right hand side of figure 4, where we plotted the distances dc of the resulting poses to the known ground truth poses together with their standard deviations against the standard deviation of the disturbance of the depth images σdepth while keeping the disturbance of the initial pose at σC = 0.25m. It can be seen, that in this example severe disturbances of depth and intensity images do not affect the performance of the algorithm unless the pose initialization is above σC = 0.25m. As expected the Gaussian intensity noise is well compensated by the robustified least-squares optimization. It can be seen, that also the Gaussian noise in the depth measurements, which also affects the occlusion maps, can be coped with to some degree. Next, we evaluated our algorithm on real images taken with the setup depicted in figure 1. It comprises of two optical cameras with a resolution of 1600 × 1200 pixels having an opening angle of 60◦ in combination with a 2D3D-camera mounted on a pan-tilt unit. The 2D3D-camera comprises of a PMD camera with a resolution of 176 × 144 pixels having an opening angle of 40◦ operated together with a rigidly coupled and calibrated optical camera from which the intensity values are taken using the mutual calibration. Figure 5 shows the intensity image of the left camera together with the re-projection of the image of the 2D3Dcamera. On the left hand side the initial pose obtained approximately from the pan-tilt unit is used and the image on the right hand side shows the re-projection after the optimization. Observe, that on the left hand side the registration is fairly poor, while the mutual registration after the optimization is significantly
Fig. 5. Left image of a real sequence. The 2D3D image is overlaid onto the intensity image using the initial approximate pose from the pan-tilt unit on the left hand side and using the optimized pose on the right hand side. Note the corrected registration for instance on the enlarged part of the poster in the background.
272
C. Beder, I. Schiller, and R. Koch
Fig. 6. Estimated pose and point cloud after optimization of the real scene. Again the camera symbols in the middle show the pose of the 2D3D-camera before and after optimization.
improved as shown in the image on the right hand side. The 3d camera poses before and after the optimization are depicted in figure 6.
5
Conclusion
We have presented a system for estimating the pose of a moving 2D3D-camera with respect to several high resolution optical cameras. Because the method is based on single synchronized shots of the scene, no additional assumptions on the rigidness are made and the scene is allowed to be fully dynamic. It has been shown, that our approach is quite robust and able to cope very well with severe noise on the intensity as well as on the depth images. The presented method is a greedy gradient based optimization, so that we require good initial values for the pose parameters. Those can be either obtained from external sources, such as the rotation data from the pan-tilt unit or an inertial sensor mounted on the 2D3D-camera, or by tracking the pose over an image sequence. Our experiments indicate, that both approaches are feasible and the radius of convergence is sufficiently large for the application of the pan-tilt unit or the inertial sensor as well as for tracking the pose at high frame rates. As the radius of convergence is governed by the reach of the image gradients, future work will focus on improving the convergence by introducing a multiscale coarse-to-fine optimization. We expect, that thereby the tracking of faster rotational movements, which cause large image displacements even at high frame rates, can be improved.
Acknowledgments This work was partially supported by the German Research Foundation (DFG), KO-2044/3-1, and the Project 3D4YOU, Grant 215075 of the Information Society Technologies area of the EUs 7th Framework programme.
Photoconsistent Relative Pose Estimation
273
References 1. Beder, C., Bartczak, B., Koch, R.: A combined approach for estimating patchlets from PMD depth images and stereo intensity images. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 11–20. Springer, Heidelberg (2007) 2. Beder, C., Koch, R.: Calibration of focal length and 3d pose based on the reflectance and depth image of a planar object. In: Proceedings of the DAGM Dyn3D Workshop, Heidelberg, Germany (2007) 3. Beder, C., Koch, R.: Real-time estimation of the camera path from a sequence of intrinsically calibrated pmd depth images. In: Proceedings of the ISPRS Congress, Bejing, China (to appear, 2008) 4. F¨ orstner, W., Wrobel, B.: Mathematical concepts in photogrammetry. In: McGlone, J.C., Mikhail, E.M., Bethel, J. (eds.) Manual of Photogrammetry, 5th edn., pp. 15–180. ASPRS (2004) 5. Fuchs, S., May, S.: Calibration and registration for precise surface reconstruction with tof cameras. In: Proceedings of the DAGM Dyn3D Workshop, Heidelberg, Germany (2007) 6. Hahne, U., Alexa, M.: Combining time-of-flight depth and stereo images without accurate extrinsic calibration. In: Proceedings of the DAGM Dyn3D Workshop, Heidelberg, Germany (2007) 7. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 8. Huhle, B., Jenke, P., Strasser, W.: On-the-fly scene acquisition with a handy multisensor-system. In: Proceedings of the DAGM Dyn3D Workshop, Heidelberg, Germany (2007) 9. Kuhnert, K.D., Stommel, M.: Fusion of stereo-camera and PMD-camera data for real-time suited precise 3d environment reconstruction. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (October 2006) 10. Lindner, M., Lambers, M., Kolb, A.: Sub-pixel data fusion and edge-enhanced distance refinement for 2d/3d images. In: Proceedings of the DAGM Dyn3D Workshop, Heidelberg, Germany (2007) 11. Prasad, T.D.A., Hartmann, K., Wolfgang, W., Ghobadi, S.E., Sluiter, A.: First steps in enhancing 3d vision technique using 2d/3d sensors. In: Chum, V., Franc, O. (eds.) Computer Vision Winter Workshop 2006, University of Siegen, pp. 82–86. Czech Society for Cybernetics and Informatics (2006) 12. Prusak, A., Melnychuk, O., Schiller, I., Roth, H., Koch, R.: Pose estimation and map building with a pmd-camera for robot navigation. In: Proceedings of the DAGM Dyn3D Workshop, Heidelberg, Germany (2007) 13. Schiller, I., Beder, C., Koch, R.: Calibration of a pmd camera using a planar calibration object together with a multi-camera setup. In: Proceedings of the ISPRS Congress, Bejing, China (to appear, 2008) 14. Streckel, B., Bartczak, B., Koch, R., Kolb, A.: Supporting structure from motion with a 3d-range-camera. In: Ersbøll, B.K., Pedersen, K.S. (eds.) SCIA 2007. LNCS, vol. 4522. Springer, Heidelberg (2007) 15. Williams, L.: Casting curved shadows on curved surfaces. SIGGRAPH Comput. Graph. 12(3), 270–274 (1978)
Implicit Feedback between Reconstruction and Tracking in a Combined Optimization Approach Olaf K¨ ahler and Joachim Denzler Chair for Computer Vision, Friedrich-Schiller University Jena {kaehler,denzler}@informatik.uni-jena.de http://www.inf-cv.uni-jena.de
Abstract. In this work, we present a combined approach to tracking and reconstruction. An implicit feedback of 3d information to the tracking process is achieved by optimizing a single error function, instead of two separate steps. No assumptions about the error distribution of the tracker are needed in the reconstruction step either. This results in higher reconstruction accuracy and improved tracking robustness in our experimental evaluations. The approach is suited for online reconstruction and has a close to real-time performance on current computing hardware.
1
Introduction
The structure-from-motion problem is a central task in computer vision. Most solutions follow a two step strategy: matching features between images in a first step and then extracting 3d information from the feature positions in a distinct, second step. However, the tracking process can greatly benefit from knowledge about the 3d scene [1,2,3,4]. A close connection between the steps is also needed to model the uncertainty of the tracker in the reconstruction process [5]. We address these two issues using a combined approach. We verify the superiority of this approach both by theoretical argumentation and in practical experiments. Impressive reconstruction results are already obtained by splitting up the problem into independent tracking, camera calibration and depth reconstruction steps [6]. In our formulation, we solve the tracking and camera calibration tasks by optimization of a single, combined error function. As calibrated camera positions are computed with high accuracy, additional reconstruction detail can then be acquired with the third step from above. The key for a combined error function is the geometric 3d parameterization of the 2d image motion of planar features. Similar approaches are used in modelbased tracking [2,3], but there the scene structure is treated as fixed a priori knowledge. Unlike [4], also the camera parameters are treated as unknowns and estimated online. In the terminology of [7], our approach performs a direct reconstruction from grey values to 3d data. It is also highly related to the work presented in [8], but focusing on planes as large scene features, a higher robustness and speed can be achieved. Still, the key contribution is not the use of planar features, but the combined solution to tracking and reconstruction. We G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 274–283, 2008. c Springer-Verlag Berlin Heidelberg 2008
Implicit Feedback between Reconstruction and Tracking
275
Combined Images
Tracking
Reconstruction
Fig. 1. Outline of our combined approach. In typical feature-based methods, there is only a feed-forward connection between tracking and reconstruction.
show, that this is superior to a two step solution even when using the same, planar features. The overall workflow of our structure-from-motion system is depicted in figure 1. Accordingly, we will first review classical brightness-constancy tracking methods and feature-based reconstruction in sections 2 and 3. Our main contribution, the combination of the two methods, follows in section 4. Finally, the system is experimentally compared to two-step approaches in section 5. An outlook on further work and final remarks conclude the paper in sections 6 and 7.
2
Tracking
To solve the correspondence problem for video sequences, methods in the line of the KLT-tracker [9] and Hager-tracker [10] have proved very successful. The underlying brightness constancy assumption forms a crucial part of our combined method presented in section 4. These trackers are also used to provide an input for the feature-based reference method shown in section 3. We will hence first formulate the tracking problem in a general way and present some details on the specific implementation for planar features. 2.1
General Tracking Framework
Under the brightness constancy assumption, all grey value changes between images I0 and Iv are explained by a 2d motion f with parameters p: I0 (x) = Iv (f (p, x)) This equation leads to an error function , which is quadratic in the vector of grey value residuals d: (p) = (Iv (f (p, x)) − I0 (x))2 = dT (p)d(p) (1) x
For a given feature consisting of a set of pixels x, the task of tracking is to find ˆ minimizing the error function : an optimal parameter vector p ˆ = argminp (p) p
276
O. K¨ ahler and J. Denzler
As the images Iv are typically non-linear functions, this minimum of is found using non-linear optimization like the Levenberg-Algorithm or Gauss-Newton iteration [9,11]. In our implementation we currently use the Levenberg-Algorithm [5] with iteration steps as follows: −1 T ∂ ∂ ∂ pi+1 = − d(pi ) d(pi ) + λId d(pi ) d(pi ) (2) ∂pi ∂pi ∂pi 2.2
Tracking of Planar Features
Using the projective camera model and homogeneous image coordinates, it is well known that points on a common plane are mapped by a common homography between two different views [12]. This leads to a motion model f for planar features. In order to track such features, it is necessary to estimate the parameters of a homography for each plane p and for each view v. ˜ from one image to A homography H is a mapping of homogeneous points x ˜ in another image according to the equation: homogeneous points x ⎛ (1,1) (1,2) (1,3) ⎞ H H H ˜ = H x ˜ ˜ = ⎝ H (2,1) H (2,2) H (2,3) ⎠ x x (3) H (3,1) H (3,2) H (3,3) Note that H is only defined up to scale and has 8 independent parameters. The motion model f is hence depending on a 8-dimensional parameter vector p and defined as follows: (1,1) 1 x + p(1,2) y + p(1,3) p f (p, x) = (3,1) (4) p x + p(3,2) y + 1 p(2,1) x + p(2,2) y + p(2,3) ˆ can be estimated by minimizing equation (1) An optimal parameter vector p with respect to p. This parameter vector then represents the homography transforming the planar feature from image I0 to Iv . In practical applications, the estimated parameters from the previous image can be used as an initialization for the non-linear optimization. This naturally incorporates the assumption of small camera motions. To gain stability, a compositional formulation [11] is used. Further it is important to precondition the problem [12] using an appropriate coordinate scaling e.g. from pixel coordinates to the interval [0; 1]. Finally note a resolution hierarchy can dramatically increase the robustness of the tracking process.
3
Feature-Based Reconstruction
To reconstruct 3d data based on tracked features, a set of camera and scene parameters has to be found, such that the individual positions of all features in all images are optimally explained. Successful, existing solutions [13] use linear methods in a homogeneous embedding of euclidean space into projective space.
Implicit Feedback between Reconstruction and Tracking
277
A stratification step is then needed to get back into the euclidean frame. For our experimental comparison in section 5, a factorization method [13] of this type is used as a feature-based reference. Alternatively, a solution can be found with an optimization problem, as will be shown in section 3.1. An online reconstruction system using this formulation will be presented in section 3.2. This system is used as a second feature-based method in our experimental comparison, and will be reformulated to our new, combined tracking and reconstruction approach in section 4. 3.1
Optimal Least-Squares Solution
The presented reconstruction approach minimizes a non-linear error function directly in an euclidean 3d coordinate frame and can hence be referred to as a bundle-adjustment [5]. To formulate the error function, a relation between the desired 3d parameters and the measured homographies is first established. With the usual projective camera model, homogeneous 3d world coordinates ˜ are mapped to homogeneous 2d image coordinates x ˜ with the equation x ˜= X ˜ λKv (Rv |tv )X. Here, λ is an unknown scale factor, Kv are the intrinsic and Rv , tv the extrinsic camera parameters in the v-th image. A 3d plane πp is defined by ˜ = 0, where n ˜ p = (np , dp )T is the homogeneous representation ˜ Tp X the equation n of the plane normal np and its distance to the origin dp . With an arbitrary fixing of the 3d world coordinate system to R0 = Id and t0 = 0, the homography Hp,v induced by plane πp in view v can then be written as [12]: 1 T Hp,v = αKv Rv − tv np K0−1 (5) dp This defines a mapping of homogeneous pixels coordinates in image I0 to homogeneous pixel coordinates in image Iv . Note the decomposition is not unique, as e.g. multiplying both tv and np with −1 will lead to the same homographies. Only in one of the two cases, the scene will be in front of the cameras. ˆ p,v from the tracking process, we Given a set of measured homographies H ˆ ˆ p such that the induced want to find an optimal set of parameters Rv , ˆtv and n ˆ p,v . Using the homographies Hp,v are equal up to scale with the measured H Frobenius norm of scale-normalized homographies, the overall error function is hence: ˆ p,v Hp,v (Rv , tv , np ) H (6) = (3,3) − (3,3) H ˆ p,v Hp,v (Rv , tv , np ) v
p
F
The minimization is performed using non-linear optimization, as introduced in section 2.1. During this process, the chain rule, quotient rule and finally derivatives of equation (5) with respect to Rv , tv and np are needed. We use the quaternion representation of rotation matrices, in order to handle these derivatives more compactly. Note that instead of using the Frobenius norm, different components of the homographies can be weighted according to their estimation accuracy in the tracking process. We will come back to this issue in section 4.2.
278
3.2
O. K¨ ahler and J. Denzler
Online Reconstruction Framework
We will now present a short overview of a system incorporating the above minimization problem into an online reconstruction environment. This will also lead to a suitable initialization method, such that local minima are avoided. In our system, the planar features are first tracked as in section 2. The resulting homographies are then passed to the reconstruction problem in equation (6) and the estimations of camera parameters and scene geometry are updated. In an online framework, not all view parameters do always need to be optimized [14]. A sliding window approach is used, taking into account only the last n views and assuming all others as fixed. This reduces the problem size and improves the performance of the system. In our experiments, we typically use n = 15 views. The intrinsic camera parameters Kv are currently assumed to be known for each frame. This is no inherent system limitation, but improves stability if only few, short baseline frames are available in a short sliding window. To fix an overall euclidean transformation of the world coordinate system, the first camera pose is set to R0 = Id and t0 = 0, and to resolve the scale ambiguity, the distance of the first plane to the origin of the world coordinate system is kept at d0 = 1. If a new view is added, the camera pose is initialized from the best estimate ˆ v−1 and tv = ˆtv−1 . Again this is naturally inin the previous frame, i.e. Rv = R corporating the assumption of small camera motions between successive frames. Additional planes can be selected manually or e.g. with the method from [15]. A new plane normal is initialized as parallel to the optical axis of the observing camera, as the plane is obviously facing towards that camera. With these initializations, convergence was always achieved in our experiments.
4
Reconstruction as a Tracking Problem
With the feature-based method from section 3, the two stages of tracking and reconstruction are needed. For optimal performance, the noise on the measured homographies, i.e. the error of the tracking algorithm, has to be addressed in the reconstruction step [5]. An analysis of this noise is hardly possible, however. Direct methods [7] instead try to avoid the intermediate step and reconstruct a 3d scene model directly from measured image intensities. In the following, we will first present our approach to achieve such a direct reconstruction for planar scenes, and then shortly discuss its benefits from a theoretical point of view. 4.1
Proposed Method
Starting with the intensity based tracker from section 2 and the optimizationbased reconstruction from section 3, a combined formulation can be derived. Instead of using the 8 independent parameters per homography from equation (3), we use the geometrically meaningful 3d parameters from equation (5) to describe the 2d image motions. The overall error function for tracking then becomes:
2 Iv (f (p(p,v) , x)) − I0 (x) (7) (p) = v
p
x
Implicit Feedback between Reconstruction and Tracking
279
with f from equation (4). Note the tracking problem for independent features is essentially the same, but due to the parameterization as independent homographies, it then collapses to one independent homography estimation problem per plane and view. The feedback of a geometric scene interpretation is implicitly given by the geometric parameterization in our new formulation. Like before, the error function is minimized by adjusting camera and plane parameters. Basically another chain rule links the two equations (1) and (6), while the optimization framework remains the same. Also the system performing the iterative online estimation of camera and scene parameters can be set up like in the feature-based case in section 3.2. The sliding window, all initializations and the fixation of an overall similarity transform are identical. 4.2
Expected Benefits
For the classical feature-based approach with two steps, the tracking can not make use of any 3d scene information and the reconstruction has to deal with whatever tracking errors of the features. We will shortly discuss, how a combined treatment can benefit in these two aspects. With the features tracked independently, their motions are assumed to be independent from each other. Any coherence due to a rigid alignment of features in the 3d world is ignored. This is reflected by an overparameterization of the motions, and can be exploited to gain an enormous robustness of the tracking [1]. The classical reconstruction methods on the other hand have to deal with tracking errors. Typically Gaussian noise is assumed on the observed parameters of the homographies. This leads to the Frobenius norm and least-squares solutions, as presented in section 3. However, an optimal performance can only be achieved, if the noise model is accurate [5]. Although it is certainly not Gaussian, the real error distribution of a tracking algorithm seems hardly tractable. Our direct method instead assumes Gaussian noise on the observed grey values, which is a much more intuitive quantity than the error distribution on the entries of a homography matrix. The optimal 3d parameters are not balancing the error on tracked feature positions, but on the intensity measurements.
5
Experiments
For real world tasks, the limiting factor frequently is not accuracy, but the runtime of a structure-from-motion approach. We will first address this issue in section 5.1. To verify the claimed benefits in accuracy, a ground-truth comparison is presented in 5.2. Finally some additional qualitative results are shown in 5.3. The test sequences for the evaluations were recorded with a Sony DFWVL500 camera with a resolution of 640 × 480 and 30fps (sequences rx90 seq1, rx90 seq2, books2 and box occlude) and a Sony DCR-HC96 progressive scan camcorder at 720×576 with 25 fps (sequences books5 and books6). The intrinsic parameters of both cameras were calculated in an offline calibration step. Two additional sequences (synth church1 and synth church2) were rendered with the povray program. Sample images are shown on the left hand side of figure 3.
280
O. K¨ ahler and J. Denzler
Table 1. Typical frame rates in frames per second for the different algorithms and sequences.
synth church1 synth church2 rx90 seq1 rx90 seq2
5.1
feature 21.8523 23.3024 12.8395 17.8756
direct 0.8483 0.7076 0.4789 0.7655
feature direct box occlude 23.7421 0.1643 books2 21.6332 0.2014 books5 12.3678 0.5667 books6 9.9629 0.1926
Runtime Measurements
We have implemented our proposed approach on a Intel Core2 6600 CPU with 2.40GHz and 2GB RAM. It is currently running as monolithic, single thread application, although many of the computations could be parallelized. Most importantly, the measured computation times vary with the number and size of the selected features. The second important parameter affecting the system performance is the number of views taken into account in the sliding window of the optimization process. We currently use a sliding window with n = 15 frames, leading to fairly accurate results with a reasonable computational effort. Table 1 shows typical frame rates achieved with the described systems. Note the feature-based approach (feature) from section 3 is suitable for real-time applications with typically 20fps, including both tracking and reconstruction. The direct approach from section 4 is notably slower with typically 0.3fps. This difference is mainly due to the more expensive evaluation of the Jacobian ∂ ∂pi d(pi ) in each iteration step. For the direct approach, a sum over all grey values of all planes p and all views v has to be accumulated each time. Still the run-time of both presented algorithms is high enough to consider them for real-time or close to real-time applications. 5.2
Ground-Truth Evaluation
To evaluate the accuracy of both approaches and especially to show the improvements of the direct formulation, we performed a comparison with known ground-truth 3d information. Additionally to the two rendered scenes, two further sequences were recorded with a camera mounted on a St¨ aubli RX90 robot arm, performing a controlled motion. The evaluation is hence not affected by the different image characteristics of rendered and real images. In our evaluation, we concentrate on the camera localization error. The translation error is measured as the differences in the optical centers. For the rotations, (reconst) (truth) T Rv is given as the residual rotation. The measured the angle of Rv residuals are shown in table 2. Note that for the rx90 sequences, the residual translation is given in meters and for the synth sequences, dimensionless povray distance units are used. All rotational residuals are given in degrees. As a reference method, results from the offline factorization method in [13] are given in the column factor. In all cases, the residual error of the direct approach is much smaller than with the feature method, sometimes by up to
Implicit Feedback between Reconstruction and Tracking
281
Table 2. Camera localization error compared to ground truth information factor feature synth church1 0.1474 0.1941 synth church2 0.1825 0.1829 rx90 seq1 0.0139 0.0387 rx90 seq2 0.1583 0.0381 translation error
direct 0.0293 0.0176 0.0050 0.0036
factor feature synth church1 2.1103 5.8193 synth church2 2.1249 5.6612 rx90 seq1 10.4286 22.2563 rx90 seq2 97.6951 36.1420 rotation error
direct 0.4352 0.3241 2.7547 4.6378
Fig. 2. Using the direct method, planes can also be tracked during temporary occlusions, allowing an accurate reconstruction of the whole sequence box occlude
Fig. 3. Excerpts from the test sequences and the reconstructions achieved with the direct reconstruction method. From top to bottom: synth church1, rx90 seq1, books5.
282
O. K¨ ahler and J. Denzler
a factor of 10. This confirms that the direct approach will outperform featurebased methods. Note that care was taken for all planes to be tracked in all views. This was another limiting factor of the feature-based approach, as independent tracking was less robust than the combined approach [1]. 5.3
Qualitative Results
Finally we applied our methods to additional sequences, recorded with handheld cameras and uncontrolled, unsteady motions. No ground truth information is available, but qualitative reconstruction results seem to be very accurate, as shown in figure 3. The reconstructed 3d scenes were textured using image data and the camera positions were visualized with red pyramids. The images in figure 2 further demonstrate the improved tracking robustness achieved with the direct method. Due to implicit feedback of the accumulated 3d scene knowledge, tracking and reconstruction succeeds also in cases, where individual tracking of the features is no longer possible. An accurate reconstruction of the whole sequence is hence possible.
6
Further Work
While the reconstruction approaches were shown to work on real image data, there is still some open questions to be addressed in future research. We will present some of the most prominent issues in this chapter. In our current system, the planar features are selected manually with a common reference image. While this is actually beneficial for an evaluation of the tracking and reconstruction system itself, it can easily be overcome in real applications. Methods for automatic detection of planar features are presented in [15] and a homography decomposition with arbitrary reference frames is given in [2]. For real world applications, the computational effort of the reconstruction approaches still has to be reduced. The main bottleneck is the evaluation of Jacobians, as was already mentioned in 5.1. An idea of resolving this bottleneck is to improve convergence speed by providing better initial values to the optimization process, e.g. by bootstrapping from a feature-based reconstruction. Finally the level of scene detail can be improved by accurate depth reconstruction. With the highly accurate camera calibration from our combined tracking and structure-from-motion system, methods such as [6] are an ideal additional step to extend the scene reconstruction beyond the planar features.
7
Conclusions
We presented two different methods for the reconstruction of structure-frommotion. In the first approach, two distinct tracking and reconstruction steps are used, while in the second, a combined step allows direct reconstruction from grey values. We showed both in theory and in experiments, that the second approach
Implicit Feedback between Reconstruction and Tracking
283
is superior with respect to the reconstruction accuracy. Aside from that, also the tracking robustness increases [1] due to the implicit feedback of 3d information. The main reason for the improved reconstruction accuracy of the combined approach is the use of an more realistic noise model. The feature-based approach assumes Gaussian noise on the entries of the observed homography matrices when solving for optimal 3d scene parameters. On the contrary, the integrated formulation assumes Gaussian noise on the observed grey values. One of the open issues is the run-time performance, which is considerably slower for the direct formulation. Still the achieved frame rates were close to real-time and make the presented algorithms attractive.
References 1. K¨ ahler, O., Denzler, J.: Rigid motion constraints for tracking planar objects. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 102–111. Springer, Heidelberg (2007) 2. Ladikos, A., Benhimane, S., Navab, N.: A real-time tracking system combining template-based and feature-based approaches. In: Proceedings VISAPP 2007, March 2007, vol. 2, pp. 325–332 (2007) 3. Koeser, K., Bartczak, B., Koch, R.: Robust gpu-assisted camera tracking using free-form surface models. Journal of Real-Time Image Processing 2(2-3), 133–147 (2007) 4. Habbecke, M., Kobbelt, L.: Iterative multi-view plane fitting. In: Proc. Vision, Modeling, and Visualization Conference, November 2006, pp. 73–80 (2006) 5. Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment – a modern synthesis. In: Vision Algorithms: Theory and Practice, pp. 298–372. Springer, Heidelberg (2000) 6. Pollefeys, M., Gool, L.V., Vergauwen, M., Verbiest, F., Cornelis, K., Tops, J., Koch, R.: Visual modeling with a hand-held camera. International Journal of Computer Vision 59(3), 207–232 (2004) 7. Irani, M., Anandan, P.: About direct methods. In: Vision Algorithms: Theory and Practice, pp. 267–277. Springer, Heidelberg (2000) 8. Jin, H., Favaro, P., Soatto, S.: A semi-direct approach to structure from motion. The Visual Computer 19(6), 377–394 (2003) 9. Shi, J., Tomasi, C.: Good features to track. In: IEEE Conference on Computer Vision and Pattern Recognition CVPR, June 1994, pp. 593–600 (1994) 10. Hager, G.D., Belhumeur, P.N.: Efficient region tracking with parametric models of geometry and illumination. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(10), 1025–1039 (1998) 11. Baker, S., Matthews, I.: Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision 56(3), 221–255 (2004) 12. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2003) 13. Rother, C., Carlsson, S., Tell, D.: Projective factorization of planes and cameras in multiple views. In: Proceedings ICPR 2002, August 2002, pp. 737–740 (2002) 14. Engels, C., Stewenius, H., Nister, D.: Bundle adjustment rules. In: Proceedings Photogrammetric Computer Vision 2006, ISPRS, September 2006, pp. 266–271 (2006) 15. K¨ ahler, O., Denzler, J.: Detecting coplanar feature points in handheld image sequences. In: Proceedings VISAPP 2007, March 2007, vol. 2, pp. 447–452 (2007)
3D Body Scanning in a Mirror Cabinet Sven Molkenstruck, Simon Winkelbach, and Friedrich M. Wahl Institute for Robotics and Process Control, Technical University of Braunschweig, Mühlenpfordtstr. 23, D-38106 Braunschweig, Germany {S.Molkenstruck, S.Winkelbach, F.Wahl}@tu-bs.de
Abstract. Body scanners offer significant potential for use in many applications like clothing industry, orthopedy, surgery, healthcare, monument conservation, art, as well as film- and computer game industry. In order to avoid distortions due to body movements (e.g. breathing or balance control), it is necessary to scan the entire surface in one pass within a few seconds. Most body scanners commercially available nowadays require several scans to obtain all sides, or they consist of several sensors making them unnecessarily expensive. We propose a new body scanning system, which is able to fully scan a person’s body or an object from three sides at the same time. By taking advantage of two mirrors, we only require a single grayscale camera and at least one standard line laser, making the system far more affordable than previously possible. Our experimental results prove efficiency and usability of the proposed setup.
1
Introduction
The first systems for three-dimensional body scanning have been developed more than ten years ago (see e.g. [1], [2]), but an exploitation of all promising applications is still in the very early stages. Modern body scanners can capture the whole shape of a human in a few seconds. Such systems offer significant potential for use in e.g. clothing industry, orthopedy, surgery, healthcare, monument conservation, art, as well as film- and computer game industry. Recently, Treleaven and Wells published an article that thoroughly emphasizes the expected “major impact on medical research and practice” [3]. They exemplify the high capability of 3D body scanners to improve e.g. scoliosis treatment, prosthetics, drug dosage and cosmetic surgery. Most publications dealing with body scanners are motivated by the increasing importance of fast and automatic body measuring systems for the clothing industry (see e.g. [1], [4], [5]), since such systems may enable the industry to produce mass customized clothing. In the last years a few approaches for contact-free 3D body scanning have been proposed and some commercial products are already available. Most systems rely on well-known acquisition techniques like coded light, phase shift, structured light, laser triangulation, and time-of-flight (see [6] for a review of different range sensors). Surface acquisition techniques are state-of-the-art (see e.g. [7], [8], [9]), but when they are applied to body scanning, there are still some restrictive G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 284–293, 2008. c Springer-Verlag Berlin Heidelberg 2008
3D Body Scanning in a Mirror Cabinet
285
Fig. 1. Proposed body scanner setup consisting of two mirrors, at least one camera, and multiple line laser modules which are mounted on a linear slide
conditions that have to be taken into consideration. In order to avoid problems with body movement (e.g. due to breathing or balance control), it is necessary to scan the entire surface in one pass within a few seconds. Whole-body scanners must capture the surface from multiple viewing directions to obtain all sides of a person. Therefore most of them consist of several sensors, which make them very expensive. Alternative low-cost solutions are in great demand. We propose a new body scanner system that is able to scan a person’s body or an object from all sides at the same time. By making intelligent use of two mirrors, we require only one single grayscale camera and at least one standard line laser, making the system far more affordable than was previously possible.
2
System Setup
Our proposed body scanner setup is illustrated in Fig. 1. It consists of two mirrors, at least one camera, and multiple line laser modules mounted on a linear slide. Surface acquisition is based on standard triangulation (i.e. intersection of camera rays and laser planes, see e.g. [7], [10]). The key point of our setup are two mirrors behind the person, having an angle of about 120◦ between them. This ‘mirror cabinet’ not only provides that each camera can capture three viewing directions of the subject (see Fig. 2), but additionally reflects each laser plane into itself, yielding a coplanar illumination from three directions and a visible ‘laser ring’ at the subject’s surface. This enables a simultaneous measurement of multiple sides of the person with minimal hardware requirements.
286
S. Molkenstruck, S. Winkelbach, and F.M. Wahl
Fig. 2. View of the upper camera. Each camera captures three viewing directions of the subject: two mirrored views in the left and right side of the image, and one direct view in the middle.
Details, possible variations, and recommendations about the setup are discussed in Section 4. 2.1
Camera/Mirror Calibration
A precise camera calibration is an essential precondition for accurate body scanning. To estimate the intrinsic and extrinsic camera parameters, the calibration process requires points in 3D world coordinates and corresponding points in 2D pixel coordinates. These points should preferably span the entire working space (i.e. scan volume). Therefore, we built a calibration target with visible markers at three sides as shown in Fig. 3. Since each camera in our mirror setup can capture images in three viewing directions, we derive three camera models for each camera: One model for the center view and two for the left and right reflected views. In this way we can act as if we used three separate cameras that can see three different sides of the body. Our enhanced calibration procedure consists of three steps: 1. Separate calibration of reflected left, reflected right, and unreflected center camera. 2. Coarse pose estimation of the left and right mirror plane. 3. Combined fine calibration of the overall mirror-camera setup. The first step applies three times the standard camera calibration approach of Tsai [11], except that the reflected cameras require to reflect the x-component whenever we map from image to world coordinates or vice versa. The results are three calibrated camera models representing three viewing directions of one real camera.
3D Body Scanning in a Mirror Cabinet
287
Fig. 3. Calibration target consisting of three white panels mounted at a precise angle of 120◦ and having a total of 48 markers: (Left) schematic layout; (right) shot of the upper camera
The second step estimates an initial pose of each mirror using the fact that each of them lies in a midplane of two camera focal points. For example, the left mirror plane is given by plane equation (fc − fl ) · (x − (fl + fc )/2) = 0;
x ∈ R3
(1)
with focal point fc of the unreflected center camera and focal point fl of the reflected left camera. Theoretically those two calibration steps are sufficient for the following triangulation approach. However, a well-known weakness of fixed camera calibration is the close relationship between focal length f and camera-object distance Tz . The effects on the image of small changes in f compared to small changes in Tz are very similar, which leads to measurement inaccuracy. This fact is the main reason for slight differences in the focal length of all three camera models (which should be equal). Therefore, the calibration is refined in the third step by performing a numerical optimization of both mirror poses and all center camera model parameters, incorporating all 48 markers. Finally, optimized left and right camera models can be derived from the optimized center camera by reflection on the left and right mirror plane. A comparison of scans with and without this last optimization step is given in the last section (Fig. 5).
3
3D Scanning
Computation of surface data consists of three steps: (i) Extraction of laser line(s) from camera images, (ii) determination of the laser slide position for each camera frame, and (iii) triangulation of surface points and filtering. We suggest to
288
S. Molkenstruck, S. Winkelbach, and F.M. Wahl
perform only the first step during scanning; all further processing steps can be carried out afterwards when time is no longer critical. 3.1
Laser Line Extraction
From each camera image, the laser line(s) need to be extracted for later processing. In our case an efficient column-wise line detection algorithm can be applied, since the laser lines run roughly horizontally (see Fig. 2). In order to obtain the peak position of a laser line to subpixel precision, we use a conventional centerof-mass estimation: For each image row x, the subpixel precise y coordinate of a laser line is calculated as the average y coordinate of all “bright” pixels within a moving window, weighted with its corresponding pixel intensities. (See [12] for a comparison of different subpixel peak detection methods.) The resulting peak coordinates per image column are collected for each camera frame and are used in the following steps. 3.2
Laser Calibration
There are several possible ways to obtain the laser slide position (i.e. height of the laser light planes) for each camera frame: (i) Synchronized camera(s) and servo drive system for laser movement, (ii) stereo vision using synchronized cameras, or (iii) a laser calibration target (a diffusely reflective object/surface) at a precisely known location, such that at least one laser line is always visible on it. The latter method has been implemented in our experimental setup, as it has the advantage of requiring neither any hardware synchronization nor expensive servo drive systems. In our setup we simply use the two flat panels behind the mirrors as laser calibration target, since these panels run parallel to the previously calibrated mirror planes. 3D points along the laser lines that lie on the panels are obtained by ray-plane intersection. To be robust with respect to occlusions and noise, we presume that the laser moves at constant speed. This allows us to estimate a linear function that represents the laser height over time, e.g. by using the Hough Transform for lines [13], [14]. From this processing step, the position (height) of the laser sledge is precisely known for each camera frame, allowing triangulation of 3D surface points described in the following. 3.3
Triangulation and Filtering
Triangulation of 3D surface points from calibrated cameras and known laser planes is simple if we use a single laser plane. But since we want to use several laser planes simultaneously, we have to solve the pixel-to-plane correspondence problem (i.e. identify the laser plane number that corresponds to a given illuminated pixel). Here we propose two heuristics, which can be combined: Bounding Volume Constraint. When the scanned subject is a human body and its approximate position in the scene is known, its surface can be roughly approximated by a bounding volume, e.g. by a surrounding vertical cylinder c
3D Body Scanning in a Mirror Cabinet
289
Fig. 4. Intersection of a camera ray with the light planes of laser 1 and laser 2 produces two possible surface points p1 and p2 . However, only the correct point p2 lies inside the bounding cylinder.
with a certain height and radius. Triangulation with all n possible laser planes yields n different surface points, which lie on the corresponding light ray to the camera. In a reasonable setup, only one of them lies inside c (see Fig. 4) The maximum radius of c, i.e. the scanning region, depends on the distance between the laser planes and the triangulation angle between laser planes and camera view. Obviously these parameters need to be adjusted to the scanned subject’s size. Column-Wise Laser Counting. Another possible way of determining the correct laser for each illuminated surface point can be integrated into the laser line extraction algorithm (Section 3.1): If the number of detected laser peaks in an image column is equal to the number of lasers n, their respective laser plane index i can be easily determined by simple counting. However, this is not possible in all columns due to occlusion, reflections, image noise, etc. Therefore, we suggest to use a specialized labeling algorithm that assigns the same label to illuminated points that are direct neighbors in space or time (i.e. frame number). Thus all points with the same label most probably have been illuminated by the same laser line. The index of this laser line can be determined by a majority vote of all available indexes i. Note: The same principle of majority vote can be used to decide whether points with a specific label have been seen directly, via the left mirror, or via the right mirror. This can be necessary if some directly visible parts of the subject (e.g. the arms, see Fig. 2) reach into the left and/or right image region, occluding the mirrors. 3.4
Postprocessing
Finally the resulting surface fragments are merged to a closed triangle mesh using Poisson Reconstruction [15]. If they do not fit perfectly, e.g. due to imprecisions
290
S. Molkenstruck, S. Winkelbach, and F.M. Wahl
in setup or calibration, surface registration techniques like ICP [16], [17] can be used to improve their alignment.
4 4.1
Experiments and Conclusion Experimental Setup
The approach described in this paper allows several setup variations. One parameter is the number of parallel laser planes. Increasing this number has the advantage of decreasing the duration of a whole surface scan, but makes the pixel-to-plane correspondence problem (discussed in Section 3.3) more difficult. In our experiments, we have used one to three lasers (650 nm, 16 mW ) with a distance of 20 cm. The lasers are mounted on a linear slide, which is driven at constant speed. Images are captured by two Firewire grayscale cameras (1024 × 768 pixels, 30 fps), set up as depicted in Fig. 1. In case of three lasers we can capture up to 90 body slices per second, which should be sufficient for most applications. Depending on the scanned object, a single camera may be sufficient; but if some surface areas are occluded, a second camera may be useful to cover them. Since only laser lines will be extracted from the camera images during scanning, depending on the environmental light conditions, it may be useful to attach a corresponding wavelength bandpass filter to the cameras. In our prototypical setup, we use two usual hall mirrors from a do-it-yourself store (62 cm × 170 cm each). Although those mirrors are not perfectly flat (especially the one on the right), we obtained satisfying results. However, industrial front surface mirrors should be preferred in a professional setup. 4.2
Results and Conclusion
To verify the absolute precision of our setup, we have scanned a precisely known object: a cuboid of 227.7 mm × 316.8 mm × 206.6 mm, mounted on a tripod at about chest height (Fig. 5). The obtained scan has an absolute error of approx. 1 mm in size. However, the influence of our slightly distorted right mirror can be seen at the right side of Fig. 5(b): The right edge of the cuboid is not precisely parallel to the left one. Furthermore, we have analyzed the surface noise of the measured cuboid faces (i.e. the captured surface point distances to the corresponding planes). The scan data of the front side, which is directly visible in the camera image, have lowest noise: The 17025 captured surface points exhibit a standard deviation of 0.63 mm; their maximum distance is less than 2.2 mm. As the other three faces are not directly visible (only as reflections in the mirrors), their pixel resolution is lower, and noise is slightly stronger: We obtained 5993 to 8558 surface points per side, having a standard deviation of 0.82 mm to 0.93 mm and a maximum outlier distance of 3.9 mm. Some qualitative results from our experiments can be seen in Fig. 7 and Fig. 8. Fig. 6 shows a corresponding camera view directly before scanning and an overlay
3D Body Scanning in a Mirror Cabinet
(a)
(b)
291
(c)
Fig. 5. Unfiltered scan of a 206.6 mm × 316.8 mm × 227.7 mm cuboid. (a) Horizontal slice (seen from above) after performing a standard camera calibration. (b) Same horizontal slice after performing our refined calibration (see step 3 in Section 2.1). The divergence at the right side is caused by our slightly distorting right mirror. (c) 3D rendering of the result.
Fig. 6. Shots of the upper camera: (Left) Person in a mirror cabinet; (right) overlaid laser lines of every 10th camera frame
of multiple camera images during scanning. The 3D data have only been filtered with a 3 × 3 averaging filter. On closer inspection of the surface one can see some small horizontal waves appearing at the surface that are caused by slight body motion, which is almost impossible to avoid. In future work, we will try to eliminate this waviness by adjusting the balance point of each body slice. Gaps appear in the 360◦ view on too specular or dark surface parts (e.g. hair), or in regions which are occluded from the lasers or cameras by other body parts. Our results prove usability and efficiency of the suggested body scanner setup. Its precision is sufficient for many applications in health care or clothing industries, and the use of mirrors and related algorithms makes our setup very cost-efficient compared to previous techniques, because each mirror saves costs of at least one camera, one laser, and one linear slide.
292
S. Molkenstruck, S. Winkelbach, and F.M. Wahl
(a)
(b)
(c)
Fig. 7. Scan results of the person in Fig. 6: (a) Front and rear view of a scan using the lower camera only; (b) merged data of lower and upper camera, front view; (c) rear view
(a)
(b)
(c)
(d)
Fig. 8. Further results: (a) Front view of a female; (b) rear view. (c) Front view of a male; (c) rear view.
3D Body Scanning in a Mirror Cabinet
293
References 1. Jones, P.R.M., Brooke-Wavell, K., West, G.: Format for human body modelling from 3d body scanning. Int. Journal of Clothing Science and Technology 7(1), 7–15 (1995) 2. Horiguchi, C.: Bl (body line) scanner - the development of a new 3d measurement and reconstruction system. International Archives of Photogrammetry Remote Sensing 32(5), 421–429 (1998) 3. Treleaven, P., Wells, J.: 3d body scanning and healthcare applications. Computer, IEEE Computer Society 40(7), 28–34 (2007) 4. Istook, C.K., Hwang, S.J.: 3d body scanning systems with application to the apparel industry. Journal of Fashion Marketing and Management 5(2), 120–132 (2001) 5. Guerlain, P., Durand, B.: Digitizing and measuring of the human body for the clothing industry. Int. Journal of Clothing Science and Technology 18(3), 151–165 (2006) 6. Blais, F.: Review of 20 years range sensor development. Journal of Electronic Imaging 13(1) (2004) 7. Pipitone, F.J., Marshall, T.G.: A wide-field scanning triangulation rangefinder for machine vision. International Journal of Robotics Research 2(1), 39–49 (1983) 8. Winkelbach, S., Wahl, F.M.: Shape from single stripe pattern illumination. In: Van Gool, L. (ed.) DAGM 2002. LNCS, vol. 2449, pp. 240–247. Springer, Heidelberg (2002) 9. Winkelbach, S., Molkenstruck, S., Wahl, F.: Low-cost laser range scanner and fast surface registration approach. In: Franke, K., Müller, K.-R., Nickolay, B., Schäfer, R. (eds.) DAGM 2006. LNCS, vol. 4174, pp. 718–728. Springer, Heidelberg (2006) 10. Hall, E.L., Tio, J.B.K., McPherson, C.A.: Measuring curved surfaces for robot vision. Computer 15(12), 42–54 (1982) 11. Tsai, R.Y.: An efficient and accurate camera calibration technique for 3d machine vision. In: IEEE Conf. Computer Vision and Pattern Recognition, pp. 364–374 (1986) 12. Fisher, R., Naidu, D.: A comparison of algorithms for subpixel peak detection. In: Sanz, J.L.C. (ed.) Image Technology Division, pp. 385–404. Springer, Heidelberg (1996) 13. Hough, P.V.C.: Method and means for recognizing complex patterns, U.S. Patent 3,069,654 (December 1962) 14. Duda, R.O., Hart, P.E.: Use of the hough transformation to detect lines and curves in pictures. Communications od the ACM 15(1), 11–15 (1972) 15. Kazhdan, M., Bolitho, M., Hoppe, H.: Poisson surface reconstruction. In: Eurographics Symposium on Geometry Processing, pp. 61–70 (2006) 16. Besl, P.J., McKay, N.D.: A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Machine Intell. 14(2), 239–258 (1992) 17. Dalley, G., Flynn, P.: Pair-wise range image registration: a study in outlier classification. Comput. Vis. Image Underst. 87(1-3), 104–115 (2002)
On Sparsity Maximization in Tomographic Particle Image Reconstruction Stefania Petra1 , Andreas Schr¨ oder2 , 3 Bernhard Wieneke , and Christoph Schn¨ orr1 1
Image and Pattern Analysis Group, Heidelberg Collaboratory for Image Processing, University of Heidelberg, Germany {petra,schnoerr}@math.uni-heidelberg.de 2 Deutsches Zentrum f¨ ur Luft- und Raumfahrt e.V. (DLR), G¨ ottingen, Germany [email protected] 3 LaVision GmbH, G¨ ottingen, Germany [email protected]
Abstract. This work focuses on tomographic image reconstruction in experimental fluid mechanics (TomoPIV), a recently established 3D particle image velocimetry technique. Corresponding 2D image sequences (projections) and the 3D reconstruction via tomographical methods provides the basis for estimating turbulent flows and related flow patterns through image processing. TomoPIV employs undersampling to make the high-speed imaging process feasible, resulting in an ill-posed image reconstruction problem. We address the corresponding basic problems involved and point out promising optimization criteria for reconstruction based on sparsity maximization, that perform favorably in comparison to classical algebraic methods currently in use for TomoPIV.
1
Introduction
Recent developments of particle image velocimetry (PIV) techniques [14] allow to capture the flow velocity of large and unsteady flow fields instantaneously. Among the different 3D techniques presently available for measuring velocities of fluids, tomographic particle image velocimetry (TomoPIV) [6] has recently received most attention, due to its increased seeding density with respect to other 3D PIV methods. This, in turn, enables high-resolution velocity field estimates of turbulent flows by means of a cross correlation technique [15]. TomoPIV is based on a multiple camera-system, three-dimensional volume illumination and subsequent 3D reconstruction, cf. [6]. In this paper we consider the essential step of this technique, the 3D reconstruction of the particle volume functions from few projections, which amounts to solve an underdetermined system of linear equations of the form Ax = b . (1)
This work has been supported by the German Research Foundation (DFG), grant Schn 457/10-1.
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 294–303, 2008. c Springer-Verlag Berlin Heidelberg 2008
On Sparsity Maximization in Tomographic Particle Image Reconstruction
295
Such systems, disregarding for the moment the inconsistent case, have infinitely many solutions. Classical regularization approaches with strictly convex objective functions are therefore applied – cf. [6] and section 3 below – in order to single out a particular solution. However, commonly choices such as maximizing entropy or minimizing the 2 norm can be problematic within the TomoPIV setting, since both approaches provide much too dense “solutions”, i.e. solutions x with most entries different from zero, resulting in blurry “ghost particles”. This is clearly detrimental for subsequent motion estimation. The objective of this paper is to point out better alternatives. Organization, After sketching the imaging process in section 2, we recall the classical solution concepts that are currently in use for TomoPIV, in section 3. Better alternatives based on sparsity-based regularization are the subject of section 4. Our arguments are confirmed and illustrated by numerical experiments in section 5.
2
Discretization
A basic assumption in image reconstruction is that the image I to be reconstructed can be approximated by a linear combination of basis functions Bj , ˆ = I(z) ≈ I(z)
n
xj Bj (z),
∀z ∈ Ω ⊂ IR3 ,
(2)
j=1
where Ω denotes the volume of interest and Iˆ is the digitization of I. The main task is to estimate the weights xj from the recorded 2D images, corresponding to basis functions located at a Cartesian equidistant 3D grid pj , j = 1, . . . , n. We consider Gaussian-type basis functions (“blobs”), an alternative to the classical voxels, of the form Bj (z) = e−
z−pj 2 2 2σ2
,
for z ∈ IR3 : z − pj 2 ≤ r ,
(3)
or value 0, if z − pj 2 > r. See Fig. 1, left, for the 2D case. The choice of a Gaussian-type basis function is justified in the TomoPIV setting, since a particle projection in all directions results in a so-called diffraction spot. Figure 1 (right) shows a typical projection of a 3D particle distribution. Based on geometrical optics, the recorded pixel intensity is the object intensity integrated along the corresponding line of sight, obtained from a calibration procedure. Thus the i-th measurement obeys n ˆ I(z)dz = I(z)dz ≈ xj Bj (z)dz , (4) bi :≈ Li Li Li j=1 :=aij
where Li is a line or more realistically a thin cone of light. Compare Fig. 1, left. Due to inaccuracies of the physical measurements the set of all projection “rays” yields an approximate linear system (1), where A is the projection matrix and b the measurement vector.
296
S. Petra et al.
Fig. 1. Left: Discretization by covering the domain Ω with compactly supported basis functions. Right: Projection of a 3D particle distribution. The volume function has to be reconstructed from 4–6 such measurements.
3
Classical Solution Concepts
Throughout this paper we concentrate on ill-posedness of (1). We have to handle a underdetermined system with m 0 the new data vector. All xj variables corresponding to removed columns in A can be set to zero. The classical minimum energy approach for regularizing (1) computes a leastsquares solution by solving the constrained minimization min x22
(PLS )
s.t.
Ax = b ,
(6)
where · 2 denotes the Euclidean 2 norm. A common and successful class of algorithms called row-action methods [2] exists to solve this problem and its variants. They are well suited for parallel computation and therefore particularly attractive for our application. A further established approach is the linearly constrained entropy maximization problem (PE )
min
n j=1
xj log(xj ) s.t. Ax = b , x ≥ 0 ,
(7)
On Sparsity Maximization in Tomographic Particle Image Reconstruction
297
where E(x) := − nj=1 xj log(xj ) , x ≥ 0, is the Boltzmann-Shannon entropy measure. Adding the nonnegativity constraint is necessary for the maximum entropy approach, since the log function is defined only for positive values and 0 log 0 := 0. After the removal of redundant equations as described at the beginnig of this section, the relative interior of the feasible set is typically nonempty, and the unique solution to (PE ) is strictly positive, except for those variables corresponding to removed columns in A. Algebraic reconstruction techniques (ART, MART) [8] are classical row-action methods for solving (6) and (7). For details, we refer to, e.g., [2]. In connection with TomoPIV, MART (multiplicative algebraic reconstruction technique) has been advocated in [6]. The behavior of MART in the case of inconsistent equations, is not known. In contrast, ART converges even when applied to inconsistent systems to the least-square solution. ART can be adapted to involve nonnegativity constraints by including certain constraining strategies [9] in the iterative process.
4
Regularization Via Sparsity
Since the particles are sparsely spread in the 3D volume, we are interested to look for solutions of (1) with as many components equal to zero as possible. This leads us to the following optimization problem (P0 )
min x0
s.t. Ax = b ,
(8)
providing a minimal-support solution. We denote the support of a vector by supp(x) := {i : xi = 0} and by x0 := #supp(x) the sparsity measure. Problem (P0 ) is nonsmooth with a highly nonconvex objective function. Thus many local optima may occur. Moreover its complexity grows exponentially with the number of variables n and if P =NP there is no polynomial time algorithm, that for every instance A and b computes the optimal solution of (P0 ), see [12] for this NP-hardness result. In addition to this negative results the regularization attempt via sparsity may be inappropriate in case of nonuniqueness of the sparsest solution of (P0 ). Fortunately previous work has shown that if a sparse enough solution to (P0 ) exists than it will be necessarily unique. In what follows we will give a flavor of this results. They involve the measure spark(A) which equals the minimal number of linearly dependent columns of A (see [4,5]) and the signature of a matrix A ∈ IRm×n . This is defined as the discrete function sigA (k) ∈ [0, 1], for k ∈ {2, . . . , n}, that equals the number of k column combinations in A which are linearly dependent divided by the number of all k columns from the n existing ones. We have 2 ≤ spark(A) ≤ rank(A) + 1, where rank(A) is at most m. By definition sigA (k) = 0, for all k < spark(A). In contrast to rank(A), spark(A) as well as sigA (k) is NP-hard to compute. Fortunately bounds on this measures can be derived [4].
298
S. Petra et al.
The following result is surprisingly elementary and can be found in [4]. Theorem 1. (Uniqueness) Let x be a solution of (1) with x0 < Then x is the unique solution of (P0 ).
spark(A) . 2
In [5] Elad adopts a probabilistic point of view to study uniqueness of sparse solutions of (P0 ) beyond the worst-case scenario. Theorem 2. [5, Th. 6,Th. 5] Let σ := spark(A) ≤ rank(A) =: r and x be a solution of (1). Assume the locations of the nonzero entries in x are chosen at random with equal and independent probability. If 1/2σ ≤ x0 =: k ≤ r, then the probability that x is the sparsest solution of (1) is 1 − sigA (k) and the probability to find a solution of (1) of the same cardinality k is
k−σ k (k − j)(n − k + j) sigA (k − j) or lower, if x0 ≥ σ; (a) j=0 j (b) 0, if 1/2σ ≤ x0 < σ. Hence uniqueness of the sparsest solution with cardinality less then spark(A) can be claimed with probability 1. Some of our previous experiments have shown that for projection matrices A as arising in our application, but based on voxel distretization, rank(A) is approaching m, even though A is rank-deficient. In contrast spark(A) remains small even for increasing m. However, in the case of blob-based discretizations we made better experiences regarding the value of spark(A). An upper bound on the signature was derived via arguments from matroid theory [1], under the assumption that the spark is known. Compare also Fig. 2, left. 4.1
Minimum Support Solutions by Solving Polyhedral Concave Programs
In this section we present an adaption of a general method due to Mangasarian [11] to solve the NP-hard problem (8) and its counterpart (P0+ )
min x0 , x∈F
(9)
which involves the nonnegativity constraints and amounts to find the nonnegative sparsest solution of (1). The core is to replace for x ≥ 0, x0 by the idea n exponential approximation fα (x) := i=1 (1 − e−αxi ), for some α > 0. fα is also smooth and concave. Note that fα (x) ≤ x0 = limα→∞(x) fα (x) holds, for all x ≥ 0. Consider now the problem (P0α )
min fα (x) , x∈F
(10)
for some fixed parameter α > 0. This problem solves the minimal-support problem (9) exactly for a finite value of the smoothing parameter α.
On Sparsity Maximization in Tomographic Particle Image Reconstruction
299
Theorem 3. Under the assumption that F = ∅, problem (9) has a vertex of F as a solution, which is also a vertex solution of (10) for some sufficiently large positive but finite value α0 of α. The proof of the above statement rely on the fact that (P0α ) has a vertex solution, since the concave objective function fα is bounded bellow on the polyhedral convex set F , which contains no straight lines that go to infinity in both directions (it contains the constraint x ≥ 0). Note that Theorem 3 guarantees a vertex solution of F despite the nonconcavity of · 0. Moreover it states that by solving problem (P0α ) for a sufficiently large but finite α we have solved problem (P0+ ). Similar statements can be made for sparse solutions of polyhedral convex sets, see [11]. We stress however that there is no efficient way of computing α as long P =NP. 4.2
SLA
In this section we turn our attention to a computational algorithm which at every step solves a linear program and terminates at a stationary point, i.e. a point satisfying the minimum principle necessary optimality condition [10]. It is a successive linear approximation method of minimizing a concave function on a polyhedral set due to Mangasarian [11]. The method can also be seen as a stepless Franke-Wolfe algorithm [7] with finite termination or a DC algorithm for a particular decomposition of the concave objective function as a difference of convex functions [13]. We state now the algorithm for problem (P0α ) which has a differentiable concave objective function. Algorithm 1. (Successive Linearization Algorithm - SLA) (S.0) Choose x0 ∈ IRn and set l := 0. l (S.1) Set cl = αe−αx and compute xl+1 as a vertex solution of the following linear program (11) min(cl )T x . x∈F
(S.2) If xl ∈ F and xl+1 = xl is satisfied within the tolerance level: STOP. Otherwise, increase the iteration counter l ← l+1 and continue with (S.1). Note that the finite termination criteria in step (S.2) is nothing else but the minimal principle necessary optimality condition [10] for this particular choice of objective function with ∇fα (xl ) > 0. The additional condition xl ∈ F treats the case of x0 not belonging to F . Theorem 4. [10, Th. 4.2] Algorithm 1 is well defined and generates a finite sequence of iterates {xl } with strictly decreasing objective function values: fα (x0 ) > fα (x1 ) > · · · > fα (xf ) such that the final iterate xf is a local optima of (P0α ).
300
5
S. Petra et al.
Some Experiments and Discussion
We demonstrate the feasibility and the typical behavior of the proposed approach on one medium size example in order to facilitate the evaluation of the measure spark and to enable visualization. We consider 1 to 40 particles in a 2D volume V = [− 12 , 12 ] × [− 12 , 12 ]. The grid refinement was chosen d = 0.0154, resulting in 4356 gridpoints. At these gridpoints we centre a Gaussian-type basis function, where σ = d. Particle positions were chosen randomly but at grid positions, to avoid discretization errors. Four 50−pixel cameras are measuring the 2D volume from angles 45o , 15o , −15o, −45o, according to a fan beam geometry. The screen and focal length of each camera is 0.5. For the obtained projection matrix, with about 9% nonzero entries, we obtained rank(A) = 172 and a tight upper bound on spark(A) ≤ 25, by computing a sparse kernel vector from the reduced row echelon form of A. In order to estimate sigA , we assumed that spark(A) = 25, although a computed lower bound derived via the mutual coherence [4] is by far not so optimistic. For each number of particles k = 1, . . . , 40, we generated 1000 random distributions, yielding each time an original weighting vector of support length xorig 0 = k. The pixel intensities in the measurement vector b are computed according to (4), integrating the particle image exactly along each line of sight. Two randomly selected 30 and 40 particle distribution are depicted in Fig. 3. We compared the results of SLA, see Tab. 1 and Fig. 3, to the classical algebraic methods. Besides ART and MART, both without relaxation, we considered also a modification of ART, which we further call ART+, based on constraining strategies proposed in [9]. This method amounts to project the current iterate on the positive orthant after each complete sweep of the ART, through all equations. As a preprocessing step we reduce system Ax = b according to the methodology proposed in section 3. The reduced dimensionalities are summarized in Tab. 1. As starting point for ART(+) and SLA we chose the all zero vector, whereas x0 := e−1l for MART. We terminate the iteration of the main algorithm if the appropriate termination is satisfied within tolerance level 10−4 or if the maximum number of outer iterations was reached, i.e. 1000 complete sweeps through the rows of the reduced matrix A in the case of (M)ART(+) or 100 linear programs solved in (S.1) of SLA. The linear programs we solved using the primal simplex method (since we are interested in vertex solutions) that come along with MOSEK Version 5.0.0.45. Having obtained with the SLA a stationary point xf0 different from our original solution we applied a heuristic method to further reduce the support. Starting with xf0 , we randomly chose Ti ⊂ Si = supp(xfi ), with #Ti = 10, then deleted the corresponding columns in A and restarted SLA with this reduced matrix. Following this and increasing i ← i + 1 it often helped to recover the original solution. The appropriate line in Tab. 1 reports the 74 restarts of SLA along with the average number (bold) of linear programs solved.
On Sparsity Maximization in Tomographic Particle Image Reconstruction
301
Table 1. SLA vs. (M)ART(+) for the examples in Fig. 3 xorig 0 mr × nr 30 145 × 3239
40
142 × 3352
Method #Outer Iter. SLA 2 MART 145 ART 145 ART+ 145 SLA 24(74) MART 142 ART 142 ART+ 142
xorig − xf 2 7.21 · 10−11 3.88 5.18 46.7 6.31 · 10−12 5.70 6.02 68.6
Axf − b∞ 2.20 · 10−10 1.73 · 10−2 1.12 · 10−3 1.52 · 10−2 5.32 · 10−12 2.45 · 10−3 3.12 · 10−3 5.98 · 10−3
Probability: SLA takes 2 outer.iter
More significant are the reconstructed particle images, depicted in Fig. 3. The ”smearing” of the particles in the lines of the projections is typical for minimum energy reconstructions. This phenomenon is preserved by ART. The MART and ART+ reconstructions show more distinct particles. However, additional spots are visible. SLA was able to reconstruct the original image, at considerably increased costs for the 40 particles considered. An interesting behavior of SLA is worth to mention. Very often (83% of the 1000 test runs) it takes only two outer iteration until convergence to the original solution. Since the last iteration is just a convergence test, this amounts to solve a single linear program with equally weighted linear objective, in view of our starting point x0 = 0, to obtain xorig . This phenomenon is superficially tackled by counting those test runs with respect to an increasing particle density, whereby SLA succeeds to find the original distribution by solving one linear program only. The results in Fig. 2, right, complement the received opinion [3,4,5] that sufficiently sparse solution to underdetermined systems can be found by 1 norm minimization.
A
Lower Bound to 1−sig (k)
1 0.8 0.6 0.4 0.2 0 0
10
20 k
30
40
1 0.8 0.6 0.4 0.2 0 0
10
20 30 Number Particles
40
Fig. 2. Left: Lower bound, derived via [5, Th. 7], on the probability that the underlying k particle distribution is the sparsest of all those satisfying the measurements. Right: The probability (must be read with caution) of success of the SLA performs only one LP to find the original particle distribution, with respect to increasing density.
302
S. Petra et al.
(a) Orig. Image
(b) MART
(c) SLA
(d) LS
(e) ART
(f) ART+
(g) Orig. Image
(h) MART
(i) SLA
(j) LS
(k) ART
(l) ART+
Fig. 3. Top: (a)-(f) 30 particles and their reconstruction. Bottom: (g)-(l) 40 particles and their reconstruction. (a),(g) The original particle distribution contains 30 resp. 40 particles. (b),(h) Reconstruction using MART. (c),(i) Reconstruction using SLA. (d),(j) Minimal norm solution of (PLS ) obtained via the Moore-Penrose pseudoinverse A+ , without reducing A. (e),(k) Reconstruction using ART. (f),(l) Reconstruction using ART+.
On Sparsity Maximization in Tomographic Particle Image Reconstruction
6
303
Conclusion and Further Work
A classical algebraic reconstruction approach that is currently in use, together with closely related variants, were recently re-considered by the authors in some detail to reveal pros and cons from the perspective of TomoPIV. While pursuing these goals, an regularization alternative was developed, which amounts to find positive sparse solutions of underdetermined systems. Promising research directions were outlined to accomplish more efficiently this task in terms of computational time. Still model extensions have to be developed to the problem in the presence of noise and discretization errors.
References 1. Bj¨ orner, A.: Some matroid inequalities. Discrete Math 31, 101–103 (1980) 2. Censor, Y.: Parallel Optimization: Theory, Algorithms and Applications. Oxford University Press, Inc., New York (1997) 3. Chen, S.S., Donoho, D.l., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20, 33–61 (1998) 4. Donoho, D.L., Elad, M.: Optimally sparse representation in general (nonorthogonal) dictionaries via 1 minimization. Proc. Natl. Acad. of Sci. U.S.A. 100, 2197–2202 (2003) 5. Elad, M.: Sparse Representations Are Most Likely to Be the Sparsest Possible. EURASIP JASP, pp. 1–12 (2006) 6. Elsinga, G., Scarano, F., Wieneke, B., van Oudheusden, B.: Tomographic particle image velocimetry. Exp. Fluids 41, 933–947 (2006) 7. Frank, M., Wolfe, P.: An algorithm for quadratic programming. Nav. Res. Log. Quart. 3, 95–110 (1956) 8. Gordon, R., Bender, R., Herman, G.T.: Algebraic reconstruction techniques (ART) for three-dimensional electron microscopy and X-ray photography. J. Theor. Biol. 29, 471–481 (1970) 9. Koltracht, I., Lancaster, P.: Contraining Strategies for Linear Iterative Processes. IMA J. Num. Anal. 10, 55–567 (1990) 10. Mangasarian, O.L.: Machine Learning via Polyhedral Concave Minimization. In: Klaus Ritter, H., Fischer, B., Riedm¨ uller, S. (eds.) Applied Mathematics and Parallel Computing – Festschrift, pp. 175–188. Physica-Verlag, Heidelberg (1996) 11. Mangasarian, O.L.: Minimum-support solutions of polyhedral concave programs. Optimization 45, 149–162 (1999) 12. Natarajan, K.: Sparse approximate solutions to linear systems. SIAM J. Comput. 24, 227–234 (1995) 13. Tao, P.D., An, L.T.H.: A D.C. Optimization Algorithm for Solving the TrustRegion Subproblem. SIAM J. Optim. 8, 476–505 (1998) 14. Raffel, M., Willert, C., Kompenhans, J.: Particle Image Velocimetry, 2nd edn. Springer, Heidelberg (2001) 15. Scarano, F., Riethm¨ uller, M.L.: Advances in iterative multigrid PIV image proccesing. Exp. Fluids 29, 51–60 (2000)
Resolution Enhancement of PMD Range Maps A.N. Rajagopalan1, Arnav Bhavsar1, Frank Wallhoff2 , and Gerhard Rigoll2 1 Department of Electrical Engineering Indian Institute of Technology Madras, Chennai 600 036, India 2 Lehrstuhl f¨ ur Mensch-Maschine-Kommunikation Technische Universit¨ at M¨ unchen, 80333 M¨ unchen, Germany {raju,ee04s036}@ee.iitm.ac.in, [email protected], [email protected]
Abstract. Photonic mixer device (PMD) range cameras are becoming popular as an alternative to algorithmic 3D reconstruction but their main drawbacks are low-resolution (LR) and noise. Recently, some interesting works have stressed on resolution enhancement of PMD range data. These works use high-resolution (HR) CCD images or stereo pairs. But such a system requires complex setup and camera calibration. In contrast, we propose a super-resolution method through induced camera motion to create a HR range image from multiple LR range images. We follow a Bayesian framework by modeling the original HR range as a Markov random field (MRF). To handle discontinuities, we propose the use of an edge-adaptive MRF prior. Since such a prior renders the energy function non-convex, we minimize it by graduated non-convexity.
1
Introduction
In contemporary computer vision and multimedia, numerous applications harness the information in the 3D shape of a scene. Apart from the conventional shape from X techniques, in recent years, direct acquisition of 3D shape from range sensors has gained importance. Laser range scanners produce high quality range maps but their use is currently limited by high cost and long acquisition times [1]. Photonic mixer device (PMD) scanners are less accurate and have a depth resolution of approximately 6 mm. However, they are attractive due to their cost, speed and simplicity. They are being used in many applications in pattern recognition, computer vision and multimedia [2,3,4]. PMD range scanners work on the basis of time-of-flight principle. The complete scene is illuminated using modulated infra-red waves and the reflected light is received by the PMD sensor. The output signal of the sensor is related to the phase delay between the reflected optical signal and the modulation signals. The electro-optic mixing process occurs at every pixel and hence the phase delay measurement is carried out at every pixel in the PMD pixel array [4,5,6]. Applications involving 3D visualization, image based rendering, augmented reality etc. require an overall scene representation that differentiates major objects. Similarly, an approximate range map is usually sufficient in applications related to scene segmentation, object recognition, video matting, surveillance and robot navigation [7]. G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 304–313, 2008. c Springer-Verlag Berlin Heidelberg 2008
Resolution Enhancement of PMD Range Maps
305
A major drawback of PMD scanners is their low spatial range resolution which is 64 × 48 or at the most 160 × 120 [3,6] pixels. The resultant poor localization in 3D space inherently restricts the eventual utility of the PMD scanner. Recently, many interesting works have addressed the issue of enhancing the spatial resolution of PMD range data so that its scope can be extended to a multitude of applications. Prasad et al. [8] interpolate the range map and register it with a HR CCD image. However, this smooths the range data and causes loss of high frequencies especially at object discontinuities. An MRF-based energy minimization framework that uses a range map and a HR CCD image claims better performance at discontinuities [9]. Based on the assumption that discontinuities in range and image tend to coincide, this approach weights the smoothness terms in the energy function by a gradient measure of the HR intensity image. A fusion of LR PMD range images with a stereo algorithm to produce HR depth maps is proposed in [3], where the data term of the energy function comes from the HR stereo pairs as well as the LR range. The above mentioned works typically use a registered CCD HR image or a CCD HR stereo pair. However, this necessitates the requirement of an elaborate setup and an involved calibration procedure for both the CCD and the PMD cameras. In this work, we propose to principally exploit translational camera motion (relative to the object) for super-resolution of PMD range data. Importantly, our approach uses only the PMD scanner. Camera motion results in multiple LR images with relative sub-pixel shifts thus effectively yielding a higher sampling rate. These LR range images can be modeled to have been formed by down-sampling sub-pixel shifted versions of the HR range image that is to be estimated. The use of multiple images also enhances the ability to reduce noise. Our technique requires only image-to-image registration which is much simpler than the complicated calibration involving CCD cameras followed in other works. Shifted range images of a 3D scene will ideally tend to produce motion parallax. However, since we capture videos by simple camera translation, the assumption of global shifts is valid for consecutive or near consecutive LR frames. These global shifts can be determined by any good sub-pixel registration technique [10]. The problem of estimating high-resolution range data from low-resolution range observations is basically ill-posed and prior constraints must be imposed to enforce regularization. We model the original HR range image by a discontinuityadaptive MRF (DAMRF) prior. This model not only regularizes the solution but also enables depth discontinuities to be preserved effectively. The solution that we seek is the maximum a posteriori (MAP) estimate of the HR range image given the LR range images. The DAMRF prior renders the resultant cost function non-convex. To avoid local minima problems, we use graduated non-convexity (GNC) to arrive at the MAP estimate of the super-resolved range image. Section 2 explains the relationship between HR and LR range data. In section 3, we propose a MAP-MRF framework to solve for the HR range data. Section 4 discusses optimization using the GNC algorithm tailored to our problem. This is followed by results and conclusions in sections 5 and 6, respectively.
306
2
A.N. Rajagopalan et al.
LR-HR Relationship of Range Data
We move the camera relative to the object and capture several frames of lowresolution range data. Suppose we have N relatively shifted low-resolution data [y1 , y2 , ..., yN ] of size N1 × N2 from the PMD range scanner. These LR observations can be modeled to have been generated from a high-resolution range image x of size L1 × L2 by warping followed by down-sampling (by a factor of L1 L2 L1 L2 N1 × N2 ). Down-sampling is caused by averaging of N1 × N2 pixels. In Fig. 1, we show an example of down-sampling by 2 along both the spatial dimensions. Each pixel of LR1 (the first range image) is formed by averaging of four pixels
Fig. 1. The formation of LR observations from the HR range image
in the reference HR range image. Similarly, each pixel in LR2 to LR4 (second to fourth range image) is formed by averaging four pixels in the HR image when the reference HR image is shifted by one pixel in each direction. Thus, the pixels in the LR images carry unique information about different regions in the HR image effectively resulting in a higher sampling rate to enable super-resolution [11]. The use of multiple range images also facilitates to average noise effects. The above process can be expressed mathematically as yi = DWi x + ηi
(1)
Here, yi is the lexicographically arranged ith LR observation and D and Wi are down-sampling and warping matrices, respectively, that produce yi from the HR range image x. The term ηi represents noise. Equation (1) can be expressed in scalar form as L 1 ,L2 yi (n1 , n2 ) = d(n1 , n2 , l1 , l2 )·x(θ1i , θ2i ) + ηi (n1 , n2 )
(2)
l1 ,l2 =1
where d(n1 , n2 , l1 , l2 ) is the element of the D matrix that maps the (l1 , l2 )th pixel in HR image x(θ1i , θ2i ) to the (n1 , n2 )th pixel in the ith LR image. The
Resolution Enhancement of PMD Range Maps
307
transformations θ1i and θ2i are the warping transformations that are encoded in the matrix Wi . For a translating camera, equation (2) simplifies to L 1 ,L2
yi (n1 , n2 ) =
d(n1 , n2 , l1 , l2 )·x(l1 − δ1i , l2 − δ2i ) + ηi (n1 , n2 )
(3)
l1 ,l2 =1
where δ1i and δ2i are the shifts in the x and y directions, respectively. From the above model, we observe that we require the Wi matrices that denote warping at high resolution. Since we consider only translational motion, we can compute the HR shifts by simply multiplying the LR shifts by the resolution factor. We compute the LR shifts using the well-known sub-pixel motion estimation algorithm proposed in [10].
3
Regularization Using MRF
Having discussed the formation of shifted LR images and estimation of their spatial shifts, we now address the problem of deriving the super-resolved (SR) range data x given observations y1 , y2 , ... yN . We propose to solve for the maximum a posteriori (MAP) estimate of x within a Bayesian framework. Let Y1 , Y2 , ... Yn be the random fields associated with the observations y1 , y2 , ... yn and let X be the random field associated with the SR range map. We wish such that to estimate x = max P (X = x|Y1 = y1 ...Yn = yn ) x x
(4)
Using Bayes rule, the above equation can be written as = max P (Y1 = y1 ...Yn = yn |X = x) · P (X = x) x x
(5)
Solving for x is clearly an ill-posed problem due to the down-sampling and warping operators, and due to the presence of noise [12,13]. We need to incorporate constraints on the solution through a suitably chosen prior. The first term in the product on the right-hand side of the equation (5) is the likelihood term that arises from the image formation model. From equation (1), assuming a pin hole camera model and considering the noise to be additive white Gaussian with variance σ 2 , we have N yi − DWi x2 1 P (Y1 = y1 ...Yn = yn |X = x) = exp − (6) (2πσ 2 )N1 N2 2σ 2 i=1 We model the prior probability P (X = x) for the SR range image by a Markov random field. MRF modeling provides a natural way to embed constraints on the solution. The Markovianity property implies that the label at a pixel depends only on its neighborhood i.e., only neighboring labels have interactions with one another. This property is quite natural in the sense that the range value at a
308
A.N. Rajagopalan et al.
particular pixel does not depend on the range values of pixels that are located far away from it. Due to the MRF - Gibbs equivalence [14], the prior probability of X can be expressed in analytical form as P (X = x) = K exp − Vcx (x) (7) c∈Cx
Vcx (x)
Here, is the potential function and c is called a clique which is a subset of the MRF neighborhood. The potential function captures the manner in which neighboring pixels interact. For details on MRF, refer to [15]. From equations (6) and (7), we can rewrite equation (5) as n yi − DWi x2 x = min + Vc (x) (8) x x 2σ 2 i=1 c∈Cx
The MAP - MRF framework results in an energy minimization formulation to , where the cost function is the bracketed term in estimate the SR range data x equation (8). The first term in the cost is the data term. It measures how closely the transformed (warped and down-sampled) x compares with the observations. The form of the second term is crucial for a good solution. It is usually known as the smoothness term in image super-resolution works [16]. This is because the potential function usually has the form Vcx (x) = (x(i, j) − x(p, q))2 where pixels (p, q) belong to the neighborhood of (i, j). But this form of the potential function tends to select solutions that are smooth and results in loss of discontinuities and high frequencies which one would typically wish to preserve in the HR image. We propose to use a discontinuity adaptive MRF (DAMRF) prior model for x in which the degree of interaction between pixels can be adjusted adaptively in order to preserve discontinuities. Li [15] suggests some models for DAMRF clique potentials. In this work, we use the potential function Vcx (x) = γ − γe−(x(i,j)−x(p,q))
2
/γ
(9) which is shown in Fig. 2(a). It is convex in the band Bγ = (− γ/2, γ/2) and the value of γ controls the shape of the function. Thus, choosing a large value of γ makes the function convex. Beyond Bγ , the cost of the prior tends to saturate as the difference between the pixel values increases. Hence, unlike the quadratic prior, the cost for a sudden change is not excessively high which allows discontinuities to be preserved in the solution. For the first-order MRF that we make use of in this work, the exact expression for the prior term in equation (7) is given by c∈C
Vcx (x) =
L1 L2
4 ∗ γ − γ exp{−[x(i, j) − x(i, j − 1)]2 /γ}
i=1 j=1
−γ exp{−[x(i, j) − x(i, j + 1)]2 /γ} − γ exp{−[x(i, j) − x(i − 1, j)]2 /γ} −γ exp{−[x(i, j) − x(i + 1, j)]2 /γ} (10)
Resolution Enhancement of PMD Range Maps
(a)
309
(b)
Fig. 2. (a) DAMRF clique potential function. (b) Graphical representation of GNC.
4
Energy Minimization: Graduated Non-convexity
Due to the non-convex nature of the DAMRF prior, the overall energy function in equation (8) becomes non-convex. Hence, minimization to obtain the MAP estimate of x becomes non-trivial in the sense that traditional local gradientbased techniques cannot be used as they can get trapped in local minima. We minimize the energy using a deterministic annealing technique known as graduated non-convexity (GNC) [15,17]. The idea behind GNC is graphically illustrated in Fig. 2(b). It (initially) starts with a convex cost function by choosing a large value of γ and finds the minimum using simple gradient-descent. This value is then used as the initial estimate for the next iteration but now with a smaller γ. This step is iterated, reducing γ in each iteration. As shown in the figure, the lower curve symbolizes the energy function at an earlier iteration and the successive upper curves denote the energy functions at successive iterations. Note that the function non-convexity increases as the iterations progress. The vertical arrows denote that the located minimum of the previous iteration is considered as the initial state for the next iteration. The slant arrows depict convergence to the nearest minimum in each iteration. The proposed algorithm for range super-resolution is summarized on the next page. We note that a basic requirement for GNC is computation of the gradient of the cost. From equations (8) and (10), the gradient at the k th iteration is grad(k) =
n 1 T T W D (DWi x − yi ) + λG(k) σ 2 i=1 i
(11)
where λ is a smoothness parameter and the gradient G(n) at (p, q) is given by G(k) (p, q) = 2[x(p, q) − x(p, q − 1)] exp{−[x(p, q) − x(p, q − 1)]2 /γ} + 2[x(p, q) − x(p, q + 1)] exp{−[x(p, q) − x(p, q + 1)]2 /γ} + 2[x(p, q) − x(p − 1, q)] exp{−[x(p, q) − x(p − 1, q)]2 /γ} + 2[x(p, q) − x(p + 1, q)] exp{−[x(p, q) − x(p + 1, q)]2 /γ} (12)
310
A.N. Rajagopalan et al.
We would like to mention that all matrix computations in equation (11) can be incorporated easily through local image operations involving very few pixels. An important point in favor of the DAMRF prior from a computational perspective is that it is a continuous and differentiable function unlike other quasi-discontinuity handling priors such as line fields [14]. Hence, the computationally efficient GNC algorithm can be used as opposed to the much slower simulated annealing. Algorithm: Range super-resolution using GNC. Require: : Observations {yi } and motion parameters. Calculate x(0) as the average of the bi-linearly up-sampled and aligned images yi s Choose a convex γ (0) = 2v where v is the maximum value of the gradient along the x and y directions of the initial estimate x(0) n=0 Do a. Update x(n) as x(n+1) = x(n) − α grad(n) b. Set n = n + 1 c. If (norm(x(n) − x(n−1) ) < ) set γ (n) =max [γtarget , kγ (n−1) ] UNTIL (norm(x(n) − x(n−1) ) < ) and (γ (n) = γtarget ) = x(n) Set x Here, α is the step size, is a constant for testing convergence, and k is a factor that takes γ (n) slowly towards γtarget .
5
Experimental Results
In this section, we present results obtained using the proposed algorithm. The range data was captured with the PMD 3k-S range scanner from PMD Technologies. It operates at a wavelength of 870 nm with a modulating RF of 20 MHz. It captures range data at a frame-rate of 25 fps. The detector dimensions are 6.4 mm × 4.8 mm while the resolution of the range image is 64 × 48 pixels. We captured range videos of objects by translating the object in front of the PMD scanner (which was held static). We ensured that the motion of the object deviated only negligibly from pure translation. We compute only frame-to-frame registration thus precluding the need for any controlled motion or calibration procedures. Also, we select successive or near-successive frames as observations for super-resolution to avoid any parallax effects. The captured video was converted into a readable ASCII format by the software CamVis Pro provided by PMD Technologies [6]. We attempted to super-resolve the range image by a factor of 4. Due to space constraints, we give only representative results here. In the first experiment, we took range images of a wooden hand-model. It was translated fronto-parallel to the camera to ensure that the motion was translational. In fact, we translated only along the horizontal direction. (In general, the motion can be 2D. In fact, we require 2D motion for true 2D super-resolution).
Resolution Enhancement of PMD Range Maps
311
Several frames of range data were captured and one of the low resolution depth maps is shown in Fig. 3(a). In order to perform super-resolution by a factor of 4, we selected a total of 8 frames from the range data set which were nearconsecutive. Ideally, one would need 16 observations to upsample by a factor of 4. But since the motion is 1D in our example, we considered only 8 frames. These frames were then fed to the algorithm in [10] and the motion parameters for each frame was estimated. (As expected, the motion module gave negligible displacement along the vertical direction for this data set). The estimated motion parameters were then fed to the proposed GNC-based super-resolution algorithm. The initial value of γ was chosen as 1000 while γtarget was taken as 1 with k = 0.9. The initial estimate of the super-resolved image was obtained by averaging the bilinearly up-sampled and aligned low resolution observations. The values of λ and α were chosen as 0.01 and 6, respectively. The GNC algorithm was run with the above parameters and the corresponding output (when super-resolved by a factor of 4) is given in Fig. 3(b). When compared with Fig. 3(a) in which the spatial localization of the range map of PMD is not good, we note that our output is significantly better. The contours of the hand are clearly discernible after performing super-resolution. In particular, note how well the thumb has been reconstructed in Fig. 3(b). We also tried bicubic interpolation of the low-resolution depth map but it suffers from the effect of range data bleeding into the background due to smoothing. This is to be expected since the low-resolution range map of PMD itself is not well-localized in space.
2000
2000
1000
1000
0 0
0 0 0
10 5 20
0
50
20
10 15
30
40 100
60
20 25
40
80 150
100
30 50
(a)
35
120 200
140
(b)
Fig. 3. Wooden hand-model. (a) Low-resolution range data. (b) Super-resolved range map using the proposed method.
In the next experiment, we took images of the Alpha Rex robot. The robot was translated horizontally and the range data was recorded over many frames. We again chose 8 frames from this data set and attempted super-resolution by a factor of 4 using the proposed method. One of the low resolution frames is shown in Fig. 4(a). Even though one can make out the shape of the robot, the spatial extent of the robot is difficult to gauge due to poor spatial resolution of the range
312
A.N. Rajagopalan et al.
2000 1000 0
1500 1000 500 0
0
0 10
50 20
40 35
150
100
30
30
100
25 20
40
15
150 50
10 50
5 0
(a)
200
0
(b)
Fig. 4. Alpha Rex robot. (a) Low-resolution range map. (b) Super-resolved range output of our method.
map from PMD. The bleeding effect is very much evident in the figure. (Please see the pdf file which has a color-coded version of the plot). In the output of the proposed method shown in Fig. 4(b) the edges of the robot are well-preserved and the localization is also very good due to improved spatial resolution. The noise level is comparatively much lower. The arms and elbow of the robot come out clearly, especially the right palm region. Note the wedge-shaped thighs and the holes in the legs which are difficult to infer from Fig. 4(a). The shape of the head emerges in its true shape. Also, the proposed algorithm correctly brings out the outline of the (slender) neck of the robot which appears smeared in Fig. 4(a). Our method takes a few minutes to run on a 1 GHz Athlon PC with 64 MB RAM (for a non-optimized Matlab code). There is enough scope to increase the speed multi-fold by resorting to an efficient implementation in C on a faster machine.
6
Conclusions
In this paper, we proposed a new method for super-resolution of range data captured from a low-resolution PMD camera that (unlike previous works) avoids the need for cumbersome camera calibration. An edge-preserving MRF prior was used to adaptively retain discontinuities. We used GNC for non-convex optimization and showed results on real data to demonstrate the effectiveness of our technique. We are currently investigating fusion of range and intensity data. We are also working on extending our method to more general motion of the range sensor.
Acknowledgments The first author gratefully acknowledges support from the Humboldt Foundation, Germany. The work was supported in part within the DFG excellence
Resolution Enhancement of PMD Range Maps
313
initiative research cluster Cognition for Technical Systems - CoTeSys (visit website www.cotesys.org).
References 1. Kil, Y., Mederos, B., Amenta, N.: Laser scanner super-resolution. In: Eurographics Symposium on Point-Based Graphics, pp. 9–16 (2006) 2. Ghobadi, S., Hartmann, K., Weihs, W., Netramai, C.: Detection and classification of moving objects - stereo or time-of-flight images. In: International Conference on Computational Intelligence and Security, pp. 11–16 (2006) 3. Hahne, U., Alexa, M.: Combining time-of-flight depth and stereo images without accurate extrinsic calibration. In: International Workshop on Dynamic 3D Imaging, pp. 1–8 (2007) 4. Beder, C., Bartczak, B., Koch, R.: A comparison of PMD-cameras and stereovision for the task of surface reconstruction using patchlets. In: IEEE International Conference Computer Vision and Pattern Recognition, pp. 1–8 (2007) 5. Reulke, R.: Combination of distance data with high resolution images. In: ISPRS Commission V Symposium Image Engineering and Vision Metrology, pp. 1-6 (2006) 6. http://www.pmdtec.com 7. Wallhoff, F., Ruß, M., Rigoll, G., Gobel, J., Diehl, H.: Improved image segmentation using photonic mixer devices. In: Proceedings IEEE Intl. Conf. on Image Processing, vol. VI, pp. 53–56 (2007) 8. Prasad, T., Hartmann, K., Weihs, W., Ghobadi, S., Sluiter, A.: First steps in enhancing 3D vision technique using 2D/3D sensors. In: Computer Vision Winter Workshop, pp. 82–86 (2006) 9. Huhle, B., Fleck, S., Schilling, A.: Integrating 3D time-of-flight camera data and high resolution images for 3DTV applications. In: 3DTV-Conference, pp. 1–4 (2008) 10. Irani, M., Peleg, S.: Improving resolution by image registration. Graphical Models and Image Processing 53(3), 231–239 (1991) 11. Park, S.C., Park, M.K., Kang, M.G.: Super-resolution image reconstruction: A technical overview. IEEE Signal Processing Magazine 16(3), 21–36 (2003) 12. Farsiu, S., Robinson, D., Elad, M., Milanfar, P.: Fast and robust super-resolution. In: IEEE International Conference on Image Processing, pp. 14–17 (2003) 13. Rav-Acha, A., Zomet, A., Peleg, S.: Robust super resolution. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 645–650 (2001) 14. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6), 721–741 (1984) 15. Li, S.Z.: Markov random field modeling in computer vision. Springer, Tokyo (1995) 16. Suresh, K., Rajagopalan, A.N.: Robust and computationally efficient superresolution algorithm. Journal of the Optical Society of America - A 24(4), 984–992 (2007) 17. Blake, A., Zisserman, A.: Visual reconstruction. The MIT Press, Cambridge (1987)
A Variational Model for the Joint Recovery of the Fundamental Matrix and the Optical Flow Levi Valgaerts, Andr´es Bruhn, and Joachim Weickert Mathematical Image Analysis Group Faculty of Mathematics and Computer Science, Building E1.1 Saarland University, 66041 Saarbr¨ucken, Germany {valgaerts,bruhn,weickert}@mia.uni-saarland.de
Abstract. Traditional estimation methods for the fundamental matrix rely on a sparse set of point correspondences that have been established by matching salient image features between two images. Recovering the fundamental matrix from dense correspondences has not been extensively researched until now. In this paper we propose a new variational model that recovers the fundamental matrix from a pair of uncalibrated stereo images, and simultaneously estimates an optical flow field that is consistent with the corresponding epipolar geometry. The model extends the highly accurate optical flow technique of Brox et al. (2004) by taking the epipolar constraint into account. In experiments we demonstrate that our approach is able to produce excellent estimates for the fundamental matrix and that the optical flow computation is on par with the best techniques to date.
1 Introduction The fundamental matrix is the basic representation of the geometric relation that underlies two views of the same scene. This relation is expressed by the so called epipolar constraint [5,11], which tells us that corresponding points in the two views are restricted to lie on specific lines rather than anywhere in the image plane. A reliable estimation of the fundamental matrix from the epipolar constraint is essential for many computer vision tasks such as the 3D reconstruction of a scene, structure-from-motion and camera calibration. Apart from a limited number of approaches that recover the fundamental matrix directly from image information [18], most methods are based on the prior determination of point correspondences. Of the latter type feature-based methods have proven to be very successful and are by far most frequently used. These methods try to match characteristic image features in the two views and compute the fundamental matrix by imposing the epipolar constraint on this sparse set of correspondences. Theoretically eight perfect point matches are sufficient to compute the fundamental matrix in a linear way [10]. In practice, however, the establishment of feature correspondences is error prone because the local nature of most feature-matching algorithms results in localization errors and false matches. This has led to the development of a multitude of robust extensions that can deal with a relatively large amount of outliers. M-estimators [9], Least Median of Squares [16] and the Random Sample Consensus (RANSAC) [6] number among such robust techniques. G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 314–324, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Variational Model for the Joint Recovery
315
Recent advances in optical flow computation have proven that variational methods are a viable alternative to feature-based methods when it comes down to the accuracy of the correspondences established between two images. In [12] the authors advocate the use of variational optical flow methods as a basis for the estimation of the fundamental matrix. The proposed approach offers at least two advantages over feature-based approaches: (i) Dense optical flow provides a very large number of correspondences and (ii) the amount of outliers is small due to the combination of a robust data term and a global smoothness constraint. Because of this inherent robustness no involved statistics was used in the estimation process and favorable results have been produced by using a simple least squares fit. In this paper we propose a novel variational approach that allows for a simultaneous estimation of both the optical flow and the fundamental matrix. This is achieved by minimizing a joint energy functional. Our method extends the method of Brox et al. [3] by including the epipolar constraint as a soft constraint. In this context it differs from the two-step method proposed in [12] that concentrates solely on the estimation of the fundamental matrix from the optical flow. This has the disadvantage that the found correspondences are not corrected by the recovered epipolar geometry. Moreover, displacement fields that yield a good fundamental matrix are not necessarily good by optical flow standards. These observations clearly motivate a joint solution of both unknowns. Our strategy also differs from the work presented in [19], that focuses on the calculation of the disparity while imposing the epipolar constraint as a hard constraint. This work gave excellent results for the ortho-parallel camera setup but required the epipolar geometry to be known in advance. Our method is related to other recent attempts to pair the epipolar constraint with other constraints such as the brightness constancy assumption in one joint formulation [18,17]. However, these techniques are restricted to the estimation of non-dense correspondences. Close in spirit are also feature-based methods that minimize some type of reprojection error in which both the fundamental matrix and a new set of correspondences are estimated from an initial set of feature matches [5,7]. Our paper is organized as follows. In Section 2 we shortly revise the estimation of the fundamental matrix from a set of correspondences. In Section 3 we introduce our variational model before discussing the minimization of the energy and the solution of the resulting equations. A performance evaluation is presented in Section 4, followed by conclusions and a summary in Section 5.
2 From Epipolar Constraint to Fundamental Matrix The epipolar constraint between a given point ˜ x = (x, y, 1) in the left image and its corresponding point ˜ x = (x , y , 1) in the right image can be rewritten as the product of two 9 × 1 vectors [5]: 0=˜ x F ˜ x = s f, (1)
where s = (xx , yx , x , xy , yy , y , x, y, 1) and f = (f11 , f12 , f13 , f21 , f22 , f23 , f31 , f32 , f33 ) . Here fi,j with 1 ≤ i, j ≤ 3 are the unknown entries of the fundamental matrix F and the tilde superscript indicates that we are using projective coordinates.
316
L. Valgaerts, A. Bruhn, and J. Weickert
To find the entries of F from n > 8 point correspondences we can minimize the energy E(f) =
n
2 2 (s i f) = S f ,
(2)
i=1
where S is an n × 9 matrix of which the rows are made up by the constraint vectors s i , 1 ≤ i ≤ n. This is equivalent to finding a least squares solution to the over determined system S f = 0. Since F is defined up to a scale factor we can avoid the trivial solution f = 0 by imposing the explicit side constraint on the norm f2 = 1. The solution of the thus obtained total least squares (TLS) problem [10] is known to be the eigenvector that belongs to the smallest eigenvalue of S S. The TLS method can be rendered more robust with respect to outliers by replacing the quadratic penalization in the energy (2) by another function of the residual: E(f) =
n
2 . Ψ (s i f)
(3)
i=1
Here Ψ (s2 ) is a positive, symmetric and in general convex function in s that grows sub-quadratically, like for instance the regularized L1 norm. Applying the method of Lagrange multipliers to the problem of minimizing the energy (3) with the constraint f2 = f f = 1 means that we are looking for critical points of F (f, λ) =
n
2 + λ(1 − f f). Ψ (s i f)
(4)
i=1
Setting the derivatives of F (f, λ) with respect to f and λ to zero finally yields the nonlinear problem n 2 Ψ (si f) si si − λI f = S W (f) S − λI f , (5) 0= i=1
0 = 1 − f2 .
(6)
In the above formula W is an n × n diagonal matrix with positive weights wii = 2 . To solve this nonlinear system we propose a lagged iterative scheme in f) Ψ (s i which we fix the symmetric positive definite system matrix S W S for a certain estimate fk . This will result in a similar eigenvalue problem as in the case of the TLS fit, and by solving it for fk+1 we can successively refine the current estimate. An important preliminary step aimed at improving the condition number of the eigenvalue problem is the normalization of the point data in the two images before the estimation of F . The points ˜ x and ˜ x are transformed by the respective affine transfor x will have on average the projective coordinate mations T and T such that T ˜ x and T ˜ ˆ (1, 1, 1) . The fundamental matrix F , estimated from the transformed points, is then used to recover the original matrix as F = T Fˆ T . Data normalization was strongly promoted by Hartley in conjunction with simple linear methods [8]. Another issue concerns the rank of F . The estimates obtained by the minimization techniques presented in this section will in general not satisfy the rank 2 constraint. Therefore, it is common to enforce the rank of F after estimation by e.g. singular value decomposition [5].
A Variational Model for the Joint Recovery
317
3 The Variational Model In this section we will adopt the notations that are commonly used in variational optical flow computation. We assume that the two stereo images under consideration are consecutive frames in an image sequence I(x, y, t) : Ω × [0, ∞) → R. We denote by x = (x, y, t) a location within the rectangular image domain Ω ∈ R2 at a time t ≥ 0. Our goal is to estimate the fundamental matrix F together with the optical flow w = (u, v, 1) between the left frame I(x, y, t) and the right frame I(x, y, t + 1) of an uncalibrated pair of stereo images. 3.1 Integrating the Epipolar Constraint In order to estimate the optical flow and the fundamental matrix jointly, we propose to extend the 2D variant of the model of Brox et al. [3] with an extra term as follows Ψd |I(x + w) − I(x)|2 + γ · |∇I(x + w) − ∇I(x)|2 dx dy E(w, f) = Ω 2 dx dy, (7) +α Ψs |∇w|2 dx dy + β Ψe s f Ω
Ω
while imposing the explicit side constraint f = 1. Here |∇w|2 := |∇u|2 + |∇v|2 denotes the squared magnitude of the spatial flow gradient with ∇ = (∂x , ∂y ) . The first term of E(w, f) is the data term. It models the constancy of the image brightness and the spatial image gradient along the displacement trajectories. These two constraints combined provide robustness against varying illumination while their formulation in the original nonlinear form allows for the handling of large displacements. The second term is the smoothness term and it penalizes deviations of the flow field from piecewise smoothness. For the functions Ψd (s2 ) and Ψs (s2 ) the regularized L1 penalizer Ψ (s2 ) = √ 2 2 s + is chosen which equals total variation (TV) regularization in the case of the smoothness term. While the first two terms in E(w, f) are the same as in the original model, the newly introduced third term penalizes deviations from the epipolar geometry. The vectors s and f are defined as in Section 2 but this time s is a function of x and w. To minimize the influence of outliers in the computation of F we choose Ψe (s2 ) to be the regularized L1 penalizer. The weight β determines to what extend the epipolar constraint will be satisfied in all points. The constraint on the Frobenius norm of F avoids the trivial solution. 2
3.2 Minimization To minimize E(w, f) with respect to u, v and f, subject to the constraint f2 = 1, we use the method of the Lagrange multipliers. We are looking for critical points of F (w, f, λ) = E(w, f) + λ(1 − f f),
(8)
i.e. tuples (u∗ , v ∗ , f∗ , λ∗ ) for which the functional derivatives of the Lagrangian F with respect to u and v and the derivatives of F with respect to f and λ vanish: 0=
δ ∂ δ F (w, f, λ), 0 = F (w, f, λ), 0 = ∇f F (w, f, λ), 0 = F (w, f, λ). δu δv ∂λ
(9)
318
L. Valgaerts, A. Bruhn, and J. Weickert
Here ∇f stands for the gradient operator ∂f∂11 , · · · , ∂f∂33 . The first two equations in equation system (9) are the Euler-Lagrange equations of the optical flow components u and v. To derive them in more detail we first write the epipolar constraint as follows: ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞ a x+u x u (10) s f = ⎝ y + v ⎠ F ⎝ y ⎠ = ⎝ v ⎠ ⎝ b ⎠ = a u + b v + q. q 1 1 1 This is a scalar product involving the optical flow w where a and b denote the first two coefficients of the epipolar line F ˜ x of a point ˜ x = (x, y, 1) in the left image. The x can be interpreted as the distance of ˜x to its own epipolar line up quantity q = ˜ x F˜ √ to the normalization factor 1/ a2 + b2 . With the help of formula (10) we can easily derive the contributions of the epipolar term in E(w, f) to the Euler-Lagrange equations. The partial derivative of its integrand Ψe (s f)2 with respect to u and v are ∂ Ψe (s f)2 = 2 Ψe (s f)2 (a u + b v + q) a , ∂u ∂ Ψe (s f)2 = 2 Ψe (s f)2 (a u + b v + q) b . ∂v
(11) (12)
The contributions from the data term and the smoothness term remain unchanged from the original model. Thus we obtain the final Euler-Lagrange equations of u and v by adding the right hand sides of equations (11) and (12) to the Euler-Lagrange equations given in [3]: 2 2 0= Ψd Iz2 + γ Ixz (Ix Iz + γ (Ixx Ixz + Ixy Iyz )) + Iyz (13) − α div Ψs |∇u|2 + |∇v|2 ∇u + β Ψe (s f)2 (a u + b v + q) a , 2 2 0= Ψd Iz2 + γ Ixz + Iyz (14) (Iy Iz + γ (Iyy Iyz + Ixy Ixz )) − α div Ψs |∇u|2 + |∇v|2 ∇v + β Ψe (s f)2 (a u + b v + q) b . Here we have made use of the same abbreviations for the partial derivatives and the temporal differences in the data term as in [3]: Ix = ∂x I(x + w), Iy = ∂y I(x + w), Iz = I(x + w) − I(x), Ixx = ∂xx I(x + w), Ixy = ∂xy I(x + w), Iyy = ∂yy I(x + w), Iyz = ∂y I(x + w) − ∂y I(x). Ixz = ∂x I(x + w) − ∂x I(x),
(15)
To differentiate F with respect to f we only have to consider the newly introduced epipolar term since neither the data term nor the smoothness term depends on f. The two last equations of (9) give rise to a similar eigenvalue problem as equation (5): Ψe (s f)2 ss dx dy − λI f = (M − λI) f, (16) 0= Ω
0 = 1 − f2 .
(17)
A Variational Model for the Joint Recovery
319
Note that we were able to switch the order of differentiation and integration because f is a constant over the domain Ω. The system matrix positive definite M is symmetric and its entries are the integral expressions mij = Ω Ψe (s f)2 si sj dx dy with 1 ≤ i, j ≤ 9 and si the i-th component of s. 3.3 Solution of the System of Equations To solve the system of equations (9) we opt to iterate between the optical flow computation and the fundamental matrix estimation. After solving the Euler-Lagrange equations with a current estimate of the fundamental matrix f, we compose the system matrix M based on the computed optical flow w. Once M has been retrieved, we solve equation (16) for f. Due to the constraint (17) the solution will always be of unit norm. The new estimate of f will in turn be used to solve the Euler-Lagrange equations again for w, and this process is repeated until convergence. The rank of F is not enforced during this iteration process, but it can be enforced on the final estimate. To solve the Euler-Lagrange equations we adopt the warping strategy proposed in [3]. The flow is incrementally refined on each level of a multiresolution pyramid such that the algorithm does not get trapped in a local minimum. To calculate the flow increment a multigrid solver is used to assure fast convergence. Equation (16) is solved as a series of eigenvalue problems as described in Section 2. 3.4 Integrating Data Normalization We have found the data normalization discussed in Section 2 indispensable in obtaining accurate results. Therefore we have taken the effort to integrate it in our model. The main difficulty in inserting the transformations T and T into the epipolar constraint s f is the dependence of T = T (w) on the optical flow. This complicates the derivation of the Euler-Lagrange equations of u and v considerably because the derivative of T with respect to the flow components has to be taken into account. To overcome this problem we choose T = T . The impact of this simplification is most likely small since the dense point sets of the left and the right image will have a similar distribution. If we employ isotropic scaling as suggested in [8] then T will be a constant transformation only depending on the image domain Ω. As a result substituting F = T Fˆ T in the energy functional does not change the presented Euler-Lagrange equations (13) and (14) in the sense that a, b and q are computed from F as before (now via Fˆ ). However, replacing the original side constraint F = 1 with Fˆ = 1 and solving equations (16) and (17) for Fˆ yields the desired normalization effect during the total least squares fit.
4 Experiments We demonstrate the performance of our method by concentrating on the fundamental matrix estimation and the optical flow computation in two separate experiments. In order to deal with RGB color images we implemented a multichannel variant of our model where the 3 color channels are coupled in the data term as follows 3 3 2 2 Ψd |Ii (x + w) − Ii (x)| + γ · |∇Ii (x + w) − ∇Ii (x)| dx dy. (18) Ω
i=1
i=1
320
L. Valgaerts, A. Bruhn, and J. Weickert
Fig. 1. The influence of the parameter β on the progression of dF . The experiments were conducted for 1000 iteration steps but convergence takes place within the first 200 steps.
We initialize our method with a zero fundamental matrix such that the first iteration step comes down to the two-step method of recovering the fundamental matrix from pure optical flow [12]. Additionally we exclude those points from the estimation of the fundamental matrix that are warped outside the image. The reason for this is that in these points no data term can be evaluated which has a less reliable flow as result. In our first experiment we recover the epipolar geometry of a synthetic image pair. The two frames of size 640 × 480 represent two views of the 3D reconstruction of a set from a film studio and since they have been generated synthetically the fundamental matrix is known exactly. As an error measure between our estimate for the fundamental matrix and the ground truth we use the distance proposed by Faugeras et al. in [5] which we will denote by dF . For this measure two fundamental matrices are used to determine the epipolar lines of several thousand randomly chosen points while systematically switching their roles to assure symmetry. Finally an error measure in pixels is obtained that describes the discrepancy between two epipolar geometries in terms of fundamental matrices for a complete scene. Figure 1 shows how dF decreases as a function of the number of iterations, and eventually converges. All optical flow parameters have been optimized with respect to the distance error of the first estimated fundamental matrix. We see that the weight β has mainly an influence on the convergence speed and to a much lesser extent on the final error. The best results were achieved for β = 25 with a final error of dF = 0.42 after 1000 iterations. This is significantly below one pixel and a large improvement of the initial error of 6.4 pixels after the first iteration step. The fact that this value is reached after 200 iterations while remaining virtually unchanged afterwards shows the stability of our iteration scheme. In Figure 2 we can observe how an initial estimation of the epipolar line geometry is readjusted during the iteration process to almost coincide with the ground truth. Additionally the initial and the final flow fields are displayed together with the mask for the data term. In a second experiment we provide evidence that a simultaneous recovery of the epipolar geometry can improve the optical flow estimation substantially. To this end we use our method to compute the optical flow between frames 8 and 9 of the Yosemite sequence without clouds. These two 316 × 252 frames actually make up a stereo pair, since only diverging motion is present due to the camera movement. We evaluate the estimated optical flow by means of the average angular error (AAE) [2]. In Table 1
A Variational Model for the Joint Recovery
321
Fig. 2. Experiments on a synthetic image sequence. The epipolar lines estimated by our method are depicted as full white lines while the ground truth lines are dotted. (a) Top Left: Estimated eipolar lines in the left frame after the first iteration. (b) Top Right: Estimated eipolar lines in the right frame after the first iteration. (c) Middle Left: Estimated eipolar lines in the left frame after 1000 iterations. (d) Middle Right: Estimated eipolar lines in the right frame after 1000 iterations. (e) Bottom Left: Magnitude plot of the estimated optical flow field after the first iteration. Brightness encodes magnitude. Pixels that are warped outside the image are colored black. (f) Bottom Right: Magnitude plot of the estimated optical flow field after 1000 iterations.
we see that we were able to improve the AAE from 1.59◦ to 1.17◦ and are ranked among the best results published so far for a 2D method. It has to be noted that methods with spatio-temporal smoothness terms give in general slightly smaller errors. For this
322
L. Valgaerts, A. Bruhn, and J. Weickert
Table 1. Results for the Yosemite sequence without clouds compared to other 2D methods Method Brox et al. [3] M´emin/P´erez [13] Roth/Black [15] Bruhn et al. [4]
AAE 1.59◦ 1.58◦ 1.47◦ 1.46◦
Method Amiaz et al. [1] Nir et al. [14] Our method
AAE 1.44◦ 1.18◦ 1.17◦
Fig. 3. Results for the Yosemite sequence without clouds. (a) Top Left: Frame 8. (b) Top Middle: Frame 9. (c) Top Right: Magnitude plot of the ground truth for the optical flow field between frames 8 and 9. Brightness encodes magnitude. (d) Bottom Left: Estimated eipolar lines in frame 8 after 15 iterations. (e) Bottom Middle: Estimated eipolar lines in frame 9 after 15 iterations. (f) Bottom Right: Magnitude plot of the estimated optical flow after 15 iterations. Pixels (apart from the sky region) that are warped outside the image are colored black.
experiment all optical flow parameters have been optimized with respect to the AAE of the first estimated optical flow and β has been set to 50. The pixels that are warped outside the image are included in the computation of the AAE. In Figure 3 we show the results for the estimated epipolar lines and flow field. The epipolar lines seem to meet in a common epipole despite the fact that the rank has not been enforced.
5 Summary Until now concepts in geometrical and variational computer vision have often been developed independently. In this paper we have demonstrated that two such concepts, namely the estimation of the fundamental matrix and the computation of dense optical flow, can be coupled successfully. To this end we have embedded the epipolar constraint together with a data and smoothness penalty in one energy functional and have
A Variational Model for the Joint Recovery
323
proposed an iterative solution method. Experiments not only show the convergence of our scheme but also that the fundamental matrix and the optical flow can be computed very accurately. We hope that these findings will stimulate the efforts to bring together geometrical and variational approaches in computer vision even more. Acknowledgements. Levi Valgaerts gratefully acknowledges funding by the Deutsche Forschungsgemeinschaft (DFG). The authors thank Reinhard Koch and Kevin K¨oser for providing the film studio image data.
References 1. Amiaz, T., Lubetzky, E., Kiryati, N.: Coarse to over-fine optical flow estimation. Pattern Recognition 40(9), 2496–2503 (2007) 2. Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical flow techniques. International Journal of Computer Vision 12(1), 43–77 (1994) 3. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optic flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Berlin (2004) 4. Bruhn, A., Weickert, J., Schn¨orr, C.: Lucas/Kanade meets Horn/Schunck: Combining local and global optic flow methods. International Journal of Computer Vision 61(3), 211–231 (2005) 5. Faugeras, O., Luong, Q.-T., Papadopoulo, T.: The Geometry of Multiple Images. MIT Press, Cambridge (2001) 6. Fischler, M., Bolles, R.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24, 381–385 (1981) 7. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 8. Hartley, R.I.: In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(6), 580–593 (1997) 9. Huber, P.J.: Robust Statistics. Wiley, New York (1981) 10. Longuet-Higgins, H.C.: A computer algorithm for reconstructing a scene from two projections. Nature 293, 133–135 (1981) 11. Luong, Q.-T., Faugeras, O.D.: The fundamental matrix: theory, algorithms, and stability analysis. International Journal of Computer Vision 17(1), 43–75 (1996) 12. Mainberger, M., Bruhn, A., Weickert, J.: Is dense optical flow useful to compute the fundamental matrix? In: Campilho, A., Kamel, M. (eds.) Image Analysis and Recognition. LNCS, Springer, Berlin (to appear, 2008) 13. M´emin, E., P´erez, P.: Hierarchical estimation and segmentation of dense motion fields. International Journal of Computer Vision 46(2), 129–155 (2002) 14. Nir, T., Bruckstein, A.M., Kimmel, R.: Over-parameterized variational optical flow. International Journal of Computer Vision 76(2), 205–216 (2008) 15. Roth, S., Black, M.: On the spatial statistices of optical flow. In: Proc. Tenth International Conference on Computer Vision, Beijing, China, June 2005, pp. 42–49. IEEE Computer Society Press, Los Alamitos (2005) 16. Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection. Wiley, New York (1987)
324
L. Valgaerts, A. Bruhn, and J. Weickert
17. Saragih, J., Goecke, R.: Monocular and stereo methods for AAM learning from Video. In: Proc. 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, June 2007, pp. 1–7. IEEE Computer Society Press, Los Alamitos (2007) 18. Sheikh, Y., Hakeem, A., Shah, M.: On the direct estimation of the fundamental matrix. In: Proc. 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, June 2007, pp. 1–7. IEEE Computer Society Press, Los Alamitos (2007) 19. Slesareva, N., Bruhn, A., Weickert, J.: Optic flow goes stereo: a variational approach for estimating discontinuity-preserving dense dispartiy maps. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) DAGM 2005. LNCS, vol. 3663, pp. 33–40. Springer, Berlin (2005)
Decomposition of Quadratic Variational Problems Florian Becker and Christoph Schn¨ orr Image and Pattern Analysis Group, Heidelberg Collaboratory for Image Processing, University of Heidelberg, Germany {becker,schnoerr}@math.uni-heidelberg.de
Abstract. Variational problems have proved of value in many image processing and analysis applications. However increase of sensor resolution as occurred in medical imaging and experimental fluid dynamics demand for adapted solving strategies to handle the huge amount of data. In this paper we address the decomposition of the general class of quadratic variational problems, which includes several important problems, such as motion estimation and image denoising. The basic strategy is to subdivide the originally intractable problem into a set of smaller convex quadratic problems. Particular care is taken to avoid ill-conditioned sub-problems. We demonstrate the approach by means of two relevant variational problems.
1
Introduction
Variational approaches to motion estimation are nowadays routinely used in many image processing applications. A key problem, however, concerns the ever increasing sizes of data sets to be processed, in particular for analysing 3D image sequences in medical imaging and experimental fluid dynamics. For example, the next generation of imaging sensors in experimental fluid will deliver data volumes taken with high-speed cameras, that cannot be handled within the working memory of a single PC. This necessitates to investigate coarse problem decompositions that can process in parallel large-scale problems with off-theshelf hardware. In this paper, we present a decomposition approach for a fairly general class of variational problems, including higher-order regularisation, for instance. We pay attention to obtain easy-to-solve subproblems that communicate in order to compute provably the unique global solution. Related Work and Contribution. The use of variational domain decomposition for motion estimation has been introduced to the field of image processing by Kohlberger et al. [1]. This approach is not applicable however to variational
This work was partially financed by the EC project FLUID (FP6-513663).
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 325–334, 2008. c Springer-Verlag Berlin Heidelberg 2008
326
F. Becker and C. Schn¨ orr
models involving higher-order regularisation, like [2] for instance, because subproblems in inner domains become inherently singular, and because the corresponding handling of boundary conditions becomes involved. In this paper, we therefore adopt the viewpoint of convex programming to tackle the decomposition problem [3]. We consider the general class of convex quadratic optimisation problems and propose a method to decompose any instance into a sum of still convex sub-functions. This allows to apply the dual decomposition approach. The initial problem can then be solved as several smaller independent convex problems and computationally cheap synchronisation steps without changing the overall objective. We describe an extension that allows to improve the numerical properties of the underlying problem, hence improving convergence rate. Two problems from motion estimation are decomposed exemplarily. Error measurements in comparison to single-domain solutions are presented. Organisation. In section 2 we summarise the underlying idea of dual problem decomposition and specialise it to the considered class of optimisation problems. Section 3 describes an iterative method for solving the problem in its decomposed formulation. An extension is proposed to improve the numerical properties of the sub-problems. Experimental results for the application to two variational motion estimation problems are given in section 4.
2 2.1
Problem Decomposition Dual Decomposition of Convex Optimisation Problems
Our approach is based on the idea known as dual decomposition [4]. The method requires that the objective function of a convex optimisation problem, min f (u) , u
f (u) : Rn → R ,
(1)
d can be decomposed into a sum of d convex sub-functions, so f (u) = l=1 fl (u). For demonstrating the basic idea we restrict ourselves to d = 2. The variable vector u is split up into internal variables ul ∈ Rnl that are only involved in exactly one sub-function, and a vector of complicating variables y ∈ RnC that are common to f1 and f2 . Next we introduce an own set of complicating variables yl for each sub-function and enforce their identity by a consistency constraint, leading to an optimisation problem equivalent to (1): min f1 (u1 , y1 ) + f2 (u2 , y2 )
{ul ,yl }
s.t. y1 = y2
(2)
For briefness we denote a set of vectors {x1 , . . . , xd } as {xl }. With the Lagrangian function which is defined as L({ul , yl }, λ) := f1 (u1 , y1 ) + f2 (u2 , y2 ) + λ (y1 − y2 ) we can make the constraints implicit and gain the primal (P) and dual (D) Lagrangian problem corresponding to (2),
Decomposition of Quadratic Variational Problems
327
(P)
p∗ := min sup L(u, λ)
(3)
(D)
d∗ := max
(4)
{ul ,yl } λ
λ
inf
{ul ,yl }
L(u, λ)
Their optimal values are related as p∗ ≥ d∗ (weak duality). For convex functions without inequality constraints this relation holds strictly and allows to solve the primal problem by the way of its dual formulation, max inf f1 (u1 , y1 ) + λ y1 + inf f2 (u2 , y2 ) − λ y2 . u1 ,y1 u2 ,y2 λ sub-problem 1 sub-problem 2 master problem
(5)
The problem decomposes into two convex sub-problems with independent variables embedded into a master problem that updates λ. This formulation allows to specify an iterative method, initialise λ ← 0, ul ← 0, yl ← 0, k ← 0 repeat for l = 1, . . . , d do: (k+1) (k+1) , yl ul ← solution of sub-problem l
(k+1) (k+1) , λ(k) , yl λ(k+1) ← λ(k) + α(k) ∇λ L ul k ←k+1 until convergence . In each iteration the primal variables are refined by solving smaller sub-problems which are preferably of a simple structure. Due to their independence this task can be performed in parallel. Afterwards the dual variables are updated, e.g. by a subgradient method such as cutting plane, bundle or trust region which choose α(k) – we refer to [3] for details. The update steps are iterated until a stopping criteria is met, for example the change in the primal and/or dual variables. 2.2
Considered Class of Problems
Here we consider the class of convex quadratic optimisation problems given in the form 1 min Du + c22 with D ∈ Rm×n , c ∈ Rm (6) u∈Rn 2 which for example includes the variational approach to optical flow estimation by Yuan et al. [2]: 1 ∇g u + gt 2 + β1∇div u2 + β2∇curl u2 dx min 2 2 2 u 2 Ω 1 2 + β3 ∂n u2 dx (7) 2 ∂Ω In section 4 we solve this problem in its decomposed form.
328
2.3
F. Becker and C. Schn¨ orr
Decomposition of Convex Quadratic Problems
We assume that the problem description can be rearranged by permuting the order of variables in u and the row order in D and c in order to get a form that makes the separable structure explicit: ⎛ D1,C ⎜ ⎜ 0 1 1 2 ⎜ f (u) = Du + c2 = ⎜ 2 2 ⎝ ... 0
2 ⎞ 0 · · · 0 D1,C ⎛u1 ⎞ ⎛c1 ⎞ ⎟ ⎜ . ⎟ ⎜ ⎟ . . c2 ⎟ . D2,I . . .. D2,C ⎟ ⎜ ⎜ ⎟ ⎟ . ⎟ + ⎜ . ⎟ .. ⎟ ⎜ .. .. ⎝ ⎠ ⎝ .. ⎠ . . 0 . ⎠ ud y c d · · · 0 Dd,I Dd,C 2
(8)
Then Dl,I ∈ Rml ×nl and Dl,C ∈ Rml ×nC represent the coefficients of the internal respectively complicating variables and cl ∈ Rml the constant parts. The objective function decomposes into ⎛ ⎞ d ⎜ ⎟ Dl,I 1 ul Dl,I Dl,I ul ⎜ 1 ul ⎟ cl ⎟ + cl + c ⎜ l Dl,C Dl,C y y Dl,C ⎝2 y ⎠ 2 l=1 const Al −bl := d d 1 ul ul ul = Al bl + const = fl (ul , y) + const . − y y 2 y l=1 l=1
(9)
fl (ul ,y):=
2 Note that Al are symmetric and due to x Al x = Dl,I Dl,C x 2 ≥ 0 ∀x positive semidefinite matrices. Hence each sub-function is convex and quadratic by construction. We further assume that the decomposition is chosen such that the matrices Al are also non-singular.
3
Optimisation of Decomposed Quadratic Problems
In the previous section we showed that the objective function (8) can be decomposed into a sum of convex sub-functions and hence we can apply the dual decomposition method for solving the problem (6). First of all we combine the internal and complicating variables of each sub and define a set of linear operators {Cl } to reprefunction into xl := u l yl sent the consistency constraints, e.g. as ⎛ ⎜ ⎜ ⎜ ⎝
y1 − y2 y2 − y3 .. . yd−1 − yd
⎞ d ⎟ ⎟ Cl xl = 0 . ⎟= ⎠ l=1
Decomposition of Quadratic Variational Problems
329
Note that although xl contains local variables the constraints only involve complicating ones. Then the following decomposed problem is equivalent to (6). min {xl }
d 1 l=1
2
x l Al xl
−
x l bl
s.t.
d
Cl xl = 0
(10)
l=1
With the definition of the Lagrange function, L ({xl }, λ) =
d 1 l=1
2
x l Al xl − xl bl + λ Cl xl
,
the problem in its decomposed dual form reads d 1 . inf xl Al xl − xl bl − Cl λ max xl 2 λ
(11)
l=1
For the considered class of optimisation problems, each sub-problem is an unconstrained convex qua et aldratic optimisation problem. Due to this beneficial properties, any first-order optimal solution xl with ∇fl (xl ) = 0 is also a global minimum for fl . This can efficiently be found by solving the linear system Al xl = bl − Cl λ, e.g. by a conjugate gradient method [5]. The variables λ of the enveloping master problem are updated by moving along the gradient of the Lagrange function, ∇λ L({xl }, λ) = dl=1 Cl xl . The update steps for the primal and dual variables then read (k+1) (k) b , l = 1, . . . , d (12) xl ← A−1 − C λ l l l λ(k+1) ← λ(k) + α(k)
d l=1
(k+1)
Cl xl
.
(13)
or – as proposed in [3] – as a The step scaling α(k) > 0 may be chosen constant (k) → ∞. sequence with limk→∞ α(k) → 0 and ∞ k=1 α 3.1
Extension for Badly Conditioned Sub-problems
The matrices Al which arise from the decomposition (9) are positive definite but may be badly conditioned, depending on the problem and actual decomposition. We propose a framework that allows to improve the numerical properties of the underlying optimisation problems without altering the overall objective. Instead of solving minxl fl (xl ) in each iteration and for each sub-problem we modify the objective functions by adding a regularisation term which involves (k) the current value of xl : fl (x) +
2 2 1 1 1 1 12 (k) (k) (k) + Bl2 xl Bl x − xl = x(Al + Bl ) x − x bl + Bl xl 2 2 2 2 2
330
F. Becker and C. Schn¨ orr
with an arbitrary symmetric positive semidefinite matrix Bl . Then the new iteration steps (replacing (12)) for the primal variables are (k+1) (k) −1 bl + Bl xl − Cl λ(k) . (14) xl ← (Al + Bl ) In this representation it becomes apparent that Bl allows to directly modify the linear system that has to be solved for each sub-problem. For Bl = 0 ∀l we obtain the original update step (12). In order to show that the altered method still solves the original problem, we rewrite the steps (14) together with the unmodified update for the dual variables (13) as a single system of the form ⎞⎛ (k) ⎞ ⎛ ⎞ ⎞⎛ (k+1) ⎞ ⎛ ⎛ x1 x1 0 0 0 −C1 B1 b1 A1 + B1 ⎜ ⎜ .. ⎟ ⎜ .. ⎟ .. ⎟ .. ⎟⎜ .. ⎟⎜ . . ⎜ ⎟ ⎜ ⎟ . . ⎜ ⎟ ⎜ . . . ⎟=⎜ . ⎟ . ⎟ ⎟⎜ . ⎟ + ⎜ . ⎟ ⎟⎜ ⎜ ⎜ ⎟ ⎜ ⎠⎜ (k) ⎟ ⎝ ⎠ ⎝0 ⎝ 0 Ad + Bd 0 ⎠⎝x(k+1) bd B −C ⎠ ⎝ ⎠ d xd d d 1 (k+1) (k) C1 ··· Cd − α1 I 0 0 · · · 0 − I λ λ α M:=
z (k+1) :=
N :=
z (k) :=
b:=
which can be interpreted as a splitting method, a class of iterative methods for solving linear problems – we refer to [5] for details. Theorem 1. [5, Th. 10.1.1] If M − N and M are nonsingular and the singular values of G := M −1 N lie within the unit circle, i.e. ρ(G) < 1, then the iteration M z (k+1) = N z (k) + b converges to a solution of (M − N )z = b for any initial z (0) . Corollary 2. The iteration (13)-(14) solves (11) if it converges. Proof. Due to Theorem 1 the fix-point of the proposed iteration method reads ⎛ ⎞⎛ ⎞ ⎛ ⎞ 0 C1 A1 b1 x1 ⎜ .. ⎟ ⎜ .. ⎟ ⎜ .. ⎟ . . ⎜ ⎜ ⎟ ⎜ ⎟ . . ⎟ (M − N )z = ⎜ ⎟⎜ . ⎟ = ⎜ . ⎟ = b . ⎝0 Ad Cd ⎠ ⎝xd ⎠ ⎝bd ⎠ λ 0 C1 · · · Cd 0 These equations exactly represent the first order Karush-Kuhn-Tucker conditions of the original problem, ∇xl L({xl }, λ) = 0 ∀l
and ∇λ L({xl }, λ) = 0 .
Hence the method solves the original problem for any Bl if it converges.
4
(15)
Experiments and Discussion
In this section we apply the proposed decomposition method to two variational motion estimation approaches. In order to measure the exactness we compare
Decomposition of Quadratic Variational Problems
331
the results to the non-decomposed solution of the problem. We examine the evolution of the error measurements over iterations of the master problem to investigate the time complexity of the method. A synthetic image pair of size 500 by 500 pixels was used as input data for the experiments. The ground truth vector field (see Fig. 1(d)) is affine in the coordinates and has a maximum magnitude of 1 and 0.46 pixels in average. In all experiments we chose a geometrical based method to determine the decomposition into sub-functions. The sub-problems were solved as linear programs with a conjugate gradient method. For the master problem we defined α(k) =
1 1 . a + bk ∇λ L({xl }, λ)2
(16)
Both the primal and dual variables were initialised with all-zero vectors. For every grid position x ∈ Ω we determined the Euclidean distance e(x) between the solutions of the decomposed and non-decomposed problems. The following overall error measurements were evaluated: (e(x) − μe )2 1 and max e := max e(x) μe := e(x) , σe := x∈Ω |Ω| |Ω| − 1 x∈Ω
x∈Ω
The results and error measurements were recorded for each iteration. Reaching the maximum number of iterations - ten - was the only stopping criteria used. 4.1
Horn and Schunck
In [6], Horn and Schunck proposed the following variational approach to globally estimate the optical flow field: 1 ∇g u + gt 2 + 1 β ∇u1 2 + 1 β ∇u2 2 dx min 2 2 2 u 2 2 Ω 2 where u(x) := u1 (x) u2 (x) . A finite difference scheme was used for discretisation, allowing to write the discrete version of the objective function as f (u) :=
2 1 1 1 DOFC u + cOFC 22 + β D∂x u22 + β D∂y u 2 2 2 2
with the linear operators DOFC , D∂x and D∂y and constant vector cOFC . To make the proposed decomposition method applicable, we rewrite the objective function as ⎛ ⎞ 2 ⎞ ⎛ DOFC c OFC 1 ⎝√ . ⎠ ⎠ ⎝ 0 f (u) = √βD∂x u + 2 0 βD∂y 2 For the decomposition we divided the grid into four areas which overlap by one grid unit, see Fig. 1(a).
332
F. Becker and C. Schn¨ orr
(a)
(c)
(b)
(d)
Fig. 1. Discretisation grid and decomposition, dots represent variables, shadings represent the four (overlapping) subdivisions: (a) approach by Horn&Schunck, (b) approach by Yuan et al. with original boundary terms and (c) with additional terms on subdivision boundaries, (d) ground truth vector field (sub-sampled)
Results. For the experiments we chose the regularisation parameter as β = 0.1. The parameters for the dual variable update rule (16) were set to a = −300 and b = 400. The error plot over the iterations one to ten in Fig. 2(b) shows a steep drop of all error measurements within few iterations. After ten iterations they reach μe = 3.57·10−6, σe = 4.04·10−5 and max e = 0.0028. According to Fig. 2(a) the errors of the order of max e are located at only few positions at the artificial borders especially where all four areas meet and quickly reach 10−15 in the remaining parts of the subdivisions.
0
error: mean/standard deviation/maximum
10
−1
10
−2
10
maxe −3
10
−4
10
σ
e
−5
10
μ
e
0
1
2
3
4
5
6
7
8
9
10
iteration
(a) error positions, max e = 0.0028
(b) error history (logarithmic scale)
Fig. 2. Horn&Schunck: error of decomposed solution compared to single-domain result, (a) distribution over coordinates after ten iterations, (b) evolution over ten iterations
Decomposition of Quadratic Variational Problems
4.2
333
Optical Flow Estimation with Higher Order Regularisation
The second approach we consider for decomposition uses higher order regularisation of the vector field (7). Mimetic differences as described in [7] were used for discretisation. In Fig. 1(b) the underlying variable grid is depicted. Using the notation introduced in [2] the discrete version of the objective function reads 2 2 1 1 1 1 Gu + ∂t g2HV + β1 GDiv u H + β2 GCurl u H + β3 Pu2bc S E 2 2 2 2 where G, G, Div, ∂t , Curl and P are linear operators and g = 12 (g1 + g2 ) represents the image information in a vector representation, i.e. the columns stacked together. Due to the fact that the definitions of the HV -, HS -, HE - and bcnorms are equivalent to the Euclidean norm we are able to rewrite the objective function as ⎛ ⎞ ⎛ ⎞ 2 ∂t g √ G ⎜ β1 G Div ⎟ ⎜ 0 ⎟ 1 ⎜ ⎟ ⎜ ⎟ √ f (u) = ⎝ ⎠ u + ⎝ 0 ⎠ . β√ 2 2 G Curl 0 β3 P 2
This is exactly the form presumed by the proposed decomposition method. Results. First experiments showed that the decomposition by the proposed method leads to convex sub-functions but with badly conditioned matrices Al . This causes a divergent behaviour of the algorithm. In order to improve the prob2 lem properties we imposed additional terms 12 ∂n u2 on the artificial boundaries. Thereby each modified sub-problems has regularisation terms at all its
0
error: mean/standard deviation/maximum
10
−1
10
max
e
−2
10
−3
σ
10
e
μ
e
0
1
2
3
4
5
6
7
8
9
10
iteration
(a) error positions, max e = 0.02
(b) error history (logarithmic scale)
Fig. 3. Yuan et al.: error of decomposed solution compared to single-domain result, (a) distribution over coordinates after ten iterations, (b) evolution over ten iterations
334
F. Becker and C. Schn¨ orr
boundaries just as if it was a smaller instance of the original one. In Fig. 1(c) the location of the boundary terms are represented as shaded areas. In contrast Fig. 1(b) shows the unmodified grid decomposition where each sub-problem is only regularised by boundary terms at the borders of the original problem. The discrete forms of the new regularisation terms were concentrated into the matrices Bl and applied as described in section 3.1. As a result of this modification the condition number dropped from ρ(Al ) ≈ 108 to ρ(Al + Bl ) ≈ 105 and the algorithm converged quickly. The problem parameters were set to β1 = β2 = 0.4 and β3 = 0.2. The algorithm parameters were chosen by hand, a = 900 and b = 100. The four areas overlap by two grid cells. Figure 3(b) shows that the error measurements reduce drastically within the first iteration and reach the order of their final values, μe = 1.22 · 10−4 , σe = 6.39 · 10−4 and max e = 0.02. The errors are mainly located around the artificial boundaries (see Fig. 3(a)) and tend to 10−15 in the remaining areas.
5
Conclusion and Further Work
We presented a convex programming approach to the decomposition of a fairly general class of variational approaches to image processing. The overall algorithm converges to the optimality conditions and can be carried out by solving wellconditioned subproblems in parallel. Our further work will focus on estimates of the convergence rate, and on connections to the domain decomposition literature [8] concerning preconditioning.
References 1. Kohlberger, T., Schn¨ orr, C., Bruhn, A., Weickert, J.: Domain Decomposition for Variational Optical Flow Computation. IEEE Trans. Image Process. 14(8), 1125– 1137 (2005) 2. Yuan, J., Schn¨ orr, C., M´emin, E.: Discrete Orthogonal Decomposition and Variational Fluid Flow Estimation. J. Math. Imaging Vis. 28(1), 67–80 (2007) 3. Conejo, A.J., Castillo, E., Minguez, R., Garc´ıa-Bertrand, R.: Decomposition Techniques in Mathematical Programming. Springer, Heidelberg (2006) 4. Boyd, S., Xiao, L., Mutapcic, A.: Notes on Decomposition Methods. Technical report, Stanford University (2003) 5. Golub, G.H., van Loan, C.F.: Matrix Computations. The John Hopkins University Press (1996) 6. Horn, B., Schunck, B.: Determining Optical Flow. Artif. Intell. 17, 185–203 (1981) 7. Hyman, J.M., Shashkov, M.J.: Natural Discretizations for the Divergence, Gradient, and Curl on Logically Rectangular Grids. Comput. Math. Appl. 33(4), 81–104 (1997) 8. Toselli, A., Widlund, O.: Domain Decomposition Methods - Algorithms and Theory. Springer, Berlin (2004)
A Variational Approach to Adaptive Correlation for Motion Estimation in Particle Image Velocimetry orr1 Florian Becker1 , Bernhard Wieneke2 , Jing Yuan1 , and Christoph Schn¨ 1
Image and Pattern Analysis Group, Heidelberg Collaboratory for Image Processing, University of Heidelberg, Germany {becker,yuanjing,schnoerr}@math.uni-heidelberg.de 2 LaVision GmbH, G¨ ottingen, Germany [email protected]
Abstract. In particle image velocimetry (PIV) a temporally separated image pair of a gas or liquid seeded with small particles is recorded and analysed in order to measure fluid flows therein. We investigate a variational approach to cross-correlation, a robust and well-established method to determine displacement vectors from the image data. A “soft” Gaussian window function replaces the usual rectangular correlation frame. We propose a criterion to adapt the window size and shape that directly formulates the goal to minimise the displacement estimation error. In order to measure motion and adapt the window shapes at the same time we combine both sub-problems into a bi-level optimisation problem and solve it via continuous multiscale methods. Experiments with synthetic and real PIV data demonstrate the ability of our approach to solve the formulated problem. Moreover window adaptation yields significantly improved results.
1
Introduction
Overview. Particle image velocimetry is an important measurement technique for industrial fluid flow questions. Small particles are introduced into liquids or gases and act as indicators for the movement of the investigated substance around obstacles and in mixing zones. A 2D plane is illuminated by laser light rendering the particles in there visible to a camera which records two images of the highlighted area within a short time interval. The analysis of the image data allows to determine the movement of particles and with this to measure the speed, turbulence or other derived mechanical properties of the fluid. In contrast to particle tracking velocimetry where first single objects are identified by their position and then matched between two image frames, algorithms for PIV determine patches from the first and second frame that fit best to some similarity measure. Cross-correlation has developed
This work was partially financed by the EC project FLUID (FP6-513663).
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 335–344, 2008. c Springer-Verlag Berlin Heidelberg 2008
336
F. Becker et al.
the state-of-the-art method for motion estimation in PIV and benefits from its robustness against noise and illumination disturbances. In this paper we describe a variational approach to cross-correlation by continuously optimising over the displacement variables. In addition a criterion is defined to locally adapt the correlation window in order to improve accuracy of the estimation. Related Work and Contribution. A vast number of literature exists on all aspects of the application of cross-correlation for PIV, here we only refer to [1] for an excellent overview. Typically an exhaustive search over the integer displacements is performed to search for the highest correlation peak. The correlation function is interpolated to gain sub-pixel accuracy. In contrast we present a variational approach to motion estimation based on continuously maximising the cross-correlation between two images. The correlation window is described by a “soft” Gaussian weighting function instead of a sharp rectangular mask. This idea is used both in a local [2] and global context [3] for smoothing the optical flow constraint. However while most approaches use a fixed window size common for all positions, we formulate a sound criterion for the location-dependent choice of the window shape parameters (size, orientation, anisotropy) in words of a further optimisation problem. Both displacements and window parameters are determined as a solution of a combined, bi-level minimisation problem which is being solved via a multiscale gradient-based algorithm. We test our approach with synthetic and real particle images to demonstrate the ability to robustly determine displacements and that window shape adaptation can improve results significantly. Organisation. In section 2 we explicitly define the cross-correlation in words of a continuous minimisation problem and introduce the utilised weighting function. Our approach to window adaptation is motivated and described. Discretisation and optimisation of the defined approach is subject to section 3. Results of experiments with real and synthetic data are given in section 4. We conclude and describe further work in section 5.
2 2.1
Problem Statement Variational Approach to Correlation
The input data consists of two images defined on Ω ⊂ R2 . However we define them to vanish outside Ω and thus obtain two infinite image functions g1 , g2 : R2 → R. For the continuous case, we define the negative cross-correlation function at position x ∈ Ω by 1 1 (1) w(y − x, Σ) g1 y − v g2 y + v dy, C(v, Σ, x) := − 2 2 R2 with a window function w(x, Σ) parametrised by Σ. In order to estimate the movement between two image frames in the considered areas, the correlation
A Variational Approach to Adaptive Correlation for Motion Estimation
337
function is minimised with respect to the displacement vector v ∈ R2 . The local estimation is extended to a global variational problem to determine a vector field u : Ω → R2 , min C(u, Σ) , with C(u, Σ) := C(u(x), Σ(x), x) dx , (2) u
Ω
where Σ is defined on Ω and describes the location-dependent window shape. This formulation allows to add regularisation terms such as physical priors, e.g. incompressibility constraints [4], on the vector field. However here the integration over the correlation window in (1) is the only spatial regularising mechanism. In this work we jointly determine the integer and fractional part of the displacement by continuously searching for an optimum of the correlation function. In addition we choose a “soft” window function w(x, Σ) := exp − 21 x Σ −1 x which is basically a non-normalised Gaussian function, instead of a sharp, rect2 angular window. The symmetric, positive definite two-by-two matrix Σ ∈ S++ allows to continuously steer the size, anisotropy and orientation of the window, see Fig. 1 for some possible shapes.
Fig. 1. Possible shapes of the weighting function: varying size, anisotropy, orientation
2.2
Window Adaptation
When cross-correlation is employed for estimating motion it is implicitly assumed that the displacements within the considered window are homogeneous. However this only holds true in very simple cases and leads to estimation errors in areas of large motion gradients as the vector field is smoothed out. This effect could be avoided by reducing the window size, however with the harm of a smaller area of support and number of particles and thus a higher influence of image noise. In order to improve accuracy we propose to adapt the window shape by minimising a function which models the expected error subject to the choice of the window parameter Σ at position x ∈ Ω. Given a fixed vector field u we define the energy function as σ2 w(y − x, Σ)e(x, y, u) dy + √ , (3) E(Σ, u, x) := 2π det Σ R2 2 u(x) − u(y)2 if y ∈ Ω e(x, y, u) := . eout otherwise The first term of (3) measures the deviation from the assumption u = const. In addition we assume a constant error eout if the correlation window incorporates data from outside the image domain.
338
F. Becker et al.
The second term describes the error caused by insufficient large support for the displacement estimation in the presence of a homogeneous movement. We assume that each vector u results from a weighted least-square estima tion u = arg minu w(x, Σ) u − u(x)22 dx over independent measurements u ∗ of the true displacement ∗ 2 u , which are disturbed by Gaussian additive noise, i.e. u(x) ∼ N u , σ I . Then it is possible to show that the expected square er2
2
σ ror is E u − u∗ 22 = w(x,Σ)dx = 2π√σdet Σ . The parameter σ in E constitutes an estimation of the influence of image noise on the measurement error. Note that our definition of the error measure can easily be extended to involve further expert knowledge about the local influence of experimental parameters, such as particle seeding density, on the error of the cross-correlation method.
2.3
Joint Optimisation
In section 2.1 and 2.2 we proposed two concepts disregarding their dependencies by assuming that the window shape is fixed during correlation respectively that a vector field is known to estimate the error caused by spatial displacement variations. Now we combine both by defining a bi-level optimisation problem, min C(u, Σ)
(4)
u
with Σ(x) ∈ arg min2 E(Σ, u, x) , ∀x ∈ Ω . Σ∈S++
(5)
The top-level optimisation estimates the displacements u at all positions x in the image by maximising the correlation terms. The window shapes are adapted in the underlying optimisation problems that constrains each Σ(x) to a minimum of the error estimation function E and again depends on u.
3 3.1
Discretisation and Optimisation Data and Variable Discretisation
The discrete input data g1 , g2 is assumed to be sampled at a regular grid Y with grid spacing ay and is stored in a cubic spline representation which yields a two times continuously derivable representation. We use an efficient implementation based on [5] to evaluate the function g, its gradient ∇g and its second derivatives ∇2 g also at non-integer positions. Grey values of g1 and g2 are shifted beforehand so they have a mean of zero each. Values outside the image domain are defined to be zero. Displacement and window shape variables are discretised on a separate regular grid X ⊂ Ω with spacing ax which is typically chosen to be coarser than Y . We denote the variables located at the coordinates xi ∈ X as ui := u(xi ) respectively Σi := Σ(xi ). For the discretisation of the integral expressions in (2) and (3) we use simple, piecewise constant element functions 1 if x − xi ∞ ≤ 12 a vi (x) := , 0 otherwise
A Variational Approach to Adaptive Correlation for Motion Estimation
339
which incorporate a discrete basis for functions R2 → R if arranged on a regular grid with spacing a. The discrete versions of the objective functions then read 1 1 2 2 ay w(yj − xi , Σi )g1 yj − ui g2 yj + ui C(u, Σ) = ax 2 2 xi ∈X
and E(Σ, u, x) = a2x
yj ∈Y
w(xi − x, Σ)e(x, xi , u) +
xi ∈X
σ2 √ . 2π det Σ
The derivative of the first function with respect to ui simplifies to 1 1 2 2 ∇ui C(u, Σ) = ax ay w(yj − xi )∇ui g1 yj − ui g2 yj + ui . 2 2 yj ∈Y
For speed up evaluation is restricted to a rectangular area that encloses all positions with w(x, Σ) ≥ 10−3 . 3.2
Optimisation
The joint formulation (4)–(5) is a bi-level optimisation problem with non-convex function in both layers. First we remove the explicit constraints Σi 0 of the second level by including them into the objective function as a logarithmic barrier function and define ˜ E(Σ, u, x) := E(Σ, u, x) − μ log det Σ , with some constant μ > 0, here chosen as μ = 10−2 . In the same manner it is possible to incorporate additional upper and lower bounds on the window size. We relax the optimisation objectives by considering only the first order optimality conditions, ∇u C(u, Σ) = 0 ˜ ∇Σi E(Σi , u, xi ) = 0 ,
(6) ∀xi ∈ X .
(7)
An initial solution is iteratively improved by updating the variables in parallel. initialise Σ (0) , u(0) , k ← 0 repeat k ←k+1 for each xi ∈ X (k+1) (k) (k) ui ←update(C(·, Σi , xi ),ui ,λu ) (k+1) ˜ u(k) , xi ),Σ (k) ,λΣ ) Σi ←update(E(·, i until convergence The update step is a Levenberg-Marquardt method that just as the Newton step method involves first and second order derivatives of the objective function f (x), but additionally steers the step length by the parameters λ > 0.
340
F. Becker et al.
function update(f (x),x0,λ) H ← ∇2 f (x0 ), g ← ∇f (x0 ) d ← −(H + λI)−1 g choose α ∈ [0, 1), so that f (x0 + αd) ≤ f (x0 ) return x0 + αd However due to the non-convexity of C and E, a variable assignment that fulfils equations (6)-(7) does not necessarily incorporate a local optima for (4)-(5). For this reason we line search along the step direction to reduce the value of the objective functions and to avoid local maxima and saddle points. 3.3
Multiscale Framework
The optimisation method described in the previous section finds a local optimum which however might be far from a globally optimal solution. Our approach avoids most of them by plugging the single-level optimisation into a multiscale framework. Figure 2 illustrates how this allows to circumnavigate the many suboptimal positions in the correlation function of two noisy images.
Fig. 2. Optimisation of the non-convex correlation function of noisy image data: Value of −C(u, Σ, x) over u and evolution of the displacement over several iterations (bright line, starting in (0, 0)); due to the multiscale framework the method does not get stuck in a local optima but finds the correct solution in (+8, −8)
In order to gain a coarse-to-fine representation of the image frames, the data is repetitively low-pass filtered and sub-sampled. When moving in opposite direction within the resolution pyramid, the variables are sampled to a finer grid and
A Variational Approach to Adaptive Correlation for Motion Estimation
341
used as initialisation for the next iteration steps. We use cubic spline interpolation both for the re-sampling of data and vector variables. For the window shape 2 variables, bi-linear interpolation implicitly conserves the constraint Σ ∈ S++ .
4 4.1
Experiments and Discussion PIV Evaluation Data Set
Our first experiment is based on an evaluation data set designed to verify the ability of correlation-based motion estimation methods to cope with large gradients in the vector field. Case A4 of the PIV Challenge 2005 [6] contains an area named 1D Sinusoids I which consist of two synthetic particle images each 1000 by 400 pixels in size. Their vertical displacements vary sinus-like with different wavelength (400 pixels on the left down to 20 on the right) while the horizontal component is zero everywhere, see Fig. 3(a)-(b). Six respectively two scale levels were used for the fixed-window experiments and experiments√with window adaptation. The scale factor between two successive levels is 2. The parameters in (3) were set to σ = 20 and eout = 10. Windows were constrained so that their 50%-level curve lies within a radius of about 63 pixels, i.e. w(x, Σ) < 0.5 for all x2 ≥ 63. The maximum displacement in data is about 2.7 pixels and 1.2 pixels in average. The results (see Fig. 3(c)) show that pure correlation with fixed window shapes can capture the main structures of the images but fails to accurately estimate the vector fields especially in the presence of large displacements gradients. However when used as initialisation to adaptive correlation we obtained a precise reconstruction (see Fig. 3(d)) of the displacement field, even for the smallest wavelength. 4.2
Real Data
In order to test the ability of our approach to cope with noise and disturbances in real-life turbulent data we applied it to an image pair from a wind tunnel experiment (wake behind a cylinder, see [7]). Figure 4(a) shows the resulting vector field. √ The multiscale framework used eight scale levels and a scaling factor of 2. Window adaptation parameters were chosen as σ = 10 and eout = 10. The window radius was constrained to a maximum of about 32 pixels. The average measured displacement is about seven pixels. Although we used neither regularisation terms nor data validation algorithms (e.g. median filter) on the vector field we obtained a smooth solution without outliers. Figure 4(b)-4(d) demonstrate how the window shapes align themselves along areas of equal displacements and avoid gradients as intended by the design of the window adaptation criterion.
342
F. Becker et al.
(a)
(b)
(c)
(d)
Fig. 3. Experiments with synthetic data: (a) Groundtruth vector field (sub-sampled). (b) Vertical component (mapped to grey-values: bright = up, dark = down) of the groundtruth vector field and highlighted detail, also shown on the right. (c) Correlation with fixed window shape; estimates inevitably deteriorate at small wavelengths. (d) Joint correlation and window adaptation can significantly improve accuracy despite spatially variant wavelengths.
A Variational Approach to Adaptive Correlation for Motion Estimation
(a) overview
(b) detailed view (lower left)
(c) detailed view (lower middle)
(d) detailed view (upper right)
343
Fig. 4. Experiments with real data (wake behind a cylinder): (a) Results (sub-sampled) of the correlation approach with window adaptation. The dark highlighting marks the area of the (b)-(d) detailed views of the vector field with some adapted windows (contour line: w(x, Σ) = 0.5). Note that each window propagates into regions of homogeneous movement and perpendicular to gradients in the vector field and not necessarily along the flow.
5
Conclusion and Further Work
Conclusion. We proposed an approach to fluid flow estimation based on the continuous optimisation of the cross-correlation measure. The expected estimation error for the choice of the correlation window shape is modelled and minimised in order to adapt the windows to displacement gradients. Both sub-problems were combined in a bi-level optimisation problem. A multiscale gradient-based
344
F. Becker et al.
approach was described that continuously searches for both optimal displacements and window shapes. Experiments with synthetic and real data showed that the approach can cope with large displacements and disturbances typical for real fluid flow experiments. The adaptation of window shapes by means of the error expectation model leads to meaningful results in the presence of displacement gradients and improved error significantly in the PIV-Challenge data set. Further Work. Our approach is the origin for further potential investigations: Due to its variational formulation it is possible to extend the correlation term to estimate also affine displacements within the window and to involve physical priors, such as incompressibility. Further expert knowledge and statistical information can be incorporated into window adaptation criterion to improve the estimation accuracy. Also more complex shapes for the weighting function should be considered. A comparison to state-of-the-art correlation implementations is planned.
References 1. Raffel, M., Willert, C., Kompenhans, J.: Particle Image Velocimetry, vol. 2. Springer, Berlin (2001) 2. Lucas, B., Kanade, T.: An Iterative Image Registration Technique with an Application to Stereo Vision. In: Proceedings of DARPA Imaging Understanding Workshop, pp. 121–130 (1981) 3. Bruhn, A., Weickert, J., Schn¨ orr, C.: Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods. International Journal of Computer Vision 61(3), 211–231 (2005) 4. Ruhnau, P., Stahl, A., Schn¨ orr, C.: Variational Estimation of Experimental Fluid Flows with Physics-based Spatio-temporal Regularization. Measurement Science and Technology 18, 755–763 (2007) 5. Unser, M., Aldroubi, A., Eden, M.: B-Spline Signal Processing: Part II—Efficient Design and Applications. IEEE Transactions on Signal Processing 41(2), 834–848 (1993) 6. Stanislas, M., Okamoto, K., K¨ ahler, C., Westerweel, J.: Third International PIVChallenge. Exp. in Fluids (to be published, 2006) 7. Carlier, J., Wieneke, B.: Deliverable 1.2: Report On Production and Diffusion of Fluid Mechanic Images and Data. Activity Report, European Project FLUID Deliverable 1.2, Cemagref, LaVision (2005)
Optical Rails View-Based Point-to-Point Navigation Using Spherical Harmonics Holger Friedrich, David Dederscheck, Eduard Rosert, and Rudolf Mester Visual Sensorics and Information Processing Lab J.W. Goethe University, Frankfurt, Germany {holgerf,davidded,rosert,mester}@vsi.cs.uni-frankfurt.de http://www.vsi.cs.uni-frankfurt.de
Abstract. We present a view-based method for steering a robot in a network of positions; this includes navigation along a prerecorded path, but also allows for arbitrary movement of the robot between adjacent positions in the network. The approach uses an upward-looking omnidirectional camera; even a very modest quality of the optical system is sufficient, since all views are represented in terms of low-order basis functions (spherical harmonics). Motor control signals for the robot are derived from a differential matching approach; the computation of the involved gradient information is extremely simplified by exploiting the fact that all images are represented in terms of basis functions. The viability of the approach for steering the robot has been shown in extensive simulations using photorealistic views; the validity of these simulations in comparison to a tangible system implementation operating in a real indoor scene has been shown in previous investigations [5].
1
Introduction
Visual perception is the most important capability of humans and animals when it comes to autonomously moving through a complex environment. The question “Where am I?” is in biological systems most often answered by comparing views, and it seems less probable that any sort of geometry, e. g. triangulation, plays an essential role in identifying where we are, and in finding our way along a known path through our world. Visual perception based on geometric principles (triangulation, epipolar geometry, etc.) has its particular virtues, so far also gladly exploited by our group, but it seems as if view-based approaches were advantageous in other situations. The present paper shows that guiding a robot or vehicle through a known environment merely based on omnidirectional views and without using any geometric information is possible, and that it is not only a viable, but also an advantageous alternative to using markers, laser scanners, etc. We denote this approach as ‘Optical Rails’. The method shares certain characteristics with landmark-based robot homing approaches [4,9,1], view based ones [11], and also with those that use epipolar G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 345–354, 2008. c Springer-Verlag Berlin Heidelberg 2008
346
H. Friedrich et al.
geometry [2,8]. Our approach is particularly efficient by exploiting the representation of the image in spherical harmonics. We obtain a pose derivative neccessary for controlling the motion of the robot vehicle from a differential analysis of a matching criterion. However – in contrast to many similar approaches – we aquire differential entities in a simple and computationally inexpensive way using precomputed expressions. By expanding the image signal in basis functions we avoid the computation of derivatives in image space like in the classical Lucas-Kanade approach [7]. Because we use a truncated expansion, our method works well without a good image resolution or precision optics.
2
Moving Robots on Optical Rails
The concept of Optical Rails is based on the omnidirectional views obtained at discrete locations in a network of so-called ‘waypoints’. Instead of using geometric position information, merely the comparison of the omnidirectional views at the current robot location and at a target location allows the robot to control its drive motors to move towards the target waypoint. Furthermore, paths can be represented by a sequence of waypoints to visit one after the other (Fig. 1). This enables the robot to drive along prerecorded sequences of waypoints, or to arbitrary locations as necessary ad hoc. We denote the task of approaching a target location based on the associated view as differential pose tracking. 2IILFH
SRVHFKDQJH SL FRQWURORXWSXW X
W &RQI 5RRP
RSWLPL]DWLRQRI SRVHFKDQJHWRZDUGV GHVWLQDWLRQYLHZ
URERWPRWLRQ UHFRUGQHZYLHZ DWQH[WWLPHVWHS
«
+DOO ZD\ 6WRUDJH (QWU\
Fig. 1. A navigation graph consisting of a moderate number of reference views and interconnecting ‘Optical Rails’
SUHYLRXV YLHZ
FXUUHQWYLHZ DWHDFKLWHUDWLRQ
YLHZDW GHVWLQDWLRQ
Fig. 2. Differential pose tracking: At each step, the optimal pose change is calculated based on the current view
In our approach, the omnidirectional views obtained from an upward-looking omnidirectional or ultra-wide angle camera are represented in terms of an expansion into spherical harmonics, analogously to a usual Fourier analysis in the plane. The coefficients of this expansion form view descriptors. The similarity (or dissimilarity) between different views can be easily determined from the difference of the corresponding two coefficient vectors. Considering one view descriptor as a target vector, and the other as the view descriptor of the current
Optical Rails
347
robot pose, the differential change of the dissimilarity when performing an infinitesimal pose change of the robot yields a direct information on how the robot should be moved in order to decrease the dissimilarity. This differential process is fully analogous to classical differential matching processes (a.k.a. ‘differential image registration’) such as used by Lucas and Kanade [7]. This way, we can deduce the translation and rotation a robot should move to approach a given target location. The effect of a pose change (2D translation on the ground plane, rotation about the z axis) on the spherical image signal is in general complex, in particular for the translation part. Furthermore, for an exact prediction of how the spherical signal would look like after the pose change we would need the depth structure of the scene, which is usually not available. Therefore, we reproject the spherical signal on a virtual ‘ceiling plane’ parallel to the ground plane on which the robot is moving, perform the pose change (3 degrees of freedom). The projection onto the ceiling plane is known as ‘gnonomic projection’ in the literature. It is important to notice that the absolute pose of the robot in the world coordinate frame is not needed for this operation; what we actually deal with are (infinitesimal) changes of the pose in terms of the robot coordinate system. For using this to actually move the robot in the ‘right’ direction, only the Euclidean transform between the camera coordinate frame and the robot coordinate frame needs to be known.
3
A Model for Differential Pose Change Estimation
Let pc be the current pose, and a be the associated view descriptor. Let fur˜ the associated destination view dethermore pd be the destination pose, and a scriptor. (The actual values of pc and pd will not be needed in the final result). The dissimilarity measure Q between two views is (like in our previous work [5]) the mean squared image signal difference, obtained by integration across the semisphere which is related to the view descriptors in a simple way: Q :=
π/2 2π θ=0
φ=0
(s(θ, φ) − s˜(θ, φ))2 · sin θ dφ dθ = ||a − a ˜||22 .
(1)
The view descriptor a is dependent on the current pose of the robot; thus the negative gradient −g = −∂Q/∂pc represents the driving direction of the robot to reduce the dissimilarity Q, which corresponds to approaching the destination. Since the relation between the image signal and the pose change is awkward to compute on the sphere, we use the projection approach: – The spherical image signal corresponding to a particular view descriptor (= a weighted sum of basis functions) is projected onto the ceiling plane, – differential translations and rotations of the projected signal resemble movements of the camera, which correspond to pose changes of the robot, – the dissimilarity measure Q is computed by integration across the ceiling plane.
348
H. Friedrich et al.
YLHZIURPWRS
]
\
VLGHYLHZ
SURMHFWLRQSODQH
[
[
\
Fig. 3. Gnomonic Projection of a Hemispherical Signal
Taking the partial derivatives with respect to the components of the current pose vector pc yields the gradient g which represents the (negative) steering direction towards the destination. This approach is of course only an approximation since the effects of the underlying geometry of the environment, i. e. objects at different distances, are disregarded. Gnomonic Projection of a Hemispherical Signal. If we project an image signal s(θ, φ) defined on a semi-sphere from the center of that sphere onto a plane, we obtain the Gnomonic Projection (Fig. 3). We obtain spherical coordinates from Cartesian coordinates on the projection plane by θ(x, y) = arctan(1, x2 + y 2 ) T G (x, y) = (θ(x, y); φ(x, y)) (2) φ(x, y) = arctan(x, y). Conversely, we obtain cartesian coordinates on the projection plane from spherical coordinates as follows: G−1 (θ, φ) = (x(θ, φ); y(θ, φ))T
x(θ, φ) = y(θ, φ) =
sin(θ)·cos(φ) cos(θ) sin(θ)·sin(φ) cos(θ)
= tan(θ) · cos(φ) = tan(θ) · sin(φ).
(3)
Integral of a Planar Projection of a Hemispherical Function. The dissimilarity measure Q as defined in (1) shall now be computed by an integration over the projection plane instead of an integration over the sphere. To that purpose, we need to introduce a weight function w(x). With Δ(θ, φ) := s(θ, φ) − s˜(θ, φ), our dissimilarity measure Q is given by Q=
π/2 2π θ=0
φ=0
Δ2 (θ, φ) · sin θ dφ dθ.
where the factor sin θ is introduced by the Jacobian determinant. If we now perform the gnomonic transform as given above, we obtain ∞ ∞ Q = −∞ −∞ Δ2 (θ(x, y), φ(x, y) ) · det |J| · sin(θ(x, y)) dx dy, =G(x,y)
=:w(x,y)
(4)
(5)
Optical Rails
349
where det |J| is the determinant of the Jacobian for the gnomonic transform ⎞ G (x, y): ⎛√ y x ∂θ(x,y) ∂φ(x,y) − x2 +y 2 2 +y 2 (1+x2 +y 2 ) x ∂x ∂x ⎠ (6) =⎝√ J = ∂θ(x,y) y x ∂φ(x,y) x2 +y 2 2 2 2 2 ∂y
=⇒
∂y
x +y (1+x +y )
det |J| = 1/( x2 + y 2 · (1 + x2 + y 2 ) )
(7)
For the weight function w(x, y) we finally obtain: 3 w(x, y) = det |J| · sin(arctan( x2 + y 2 )) = (1 + x2 + y 2 )− 2 .
(8)
Differential Pose Change Estimation for 3 DoF Motion. Let b(x) and ˜b(x) be the planar image signals taken at the current pose pc and destination pose pd of the robot, respectively. We define f (x, d) a 2D affine transform which represents the pose translation and rotation by the parameter vector d: ⎞ cos ϕ − sin ϕ v1 ⎝ sin ϕ cos ϕ v2 ⎠, 0 0 1 ⎛
f (x, d) = A · x
A=
⎞ v1 ⎝ v2 ⎠, ϕ ⎛
d=
⎞ x1 ⎝ x2 ⎠ 1 ⎛
x=
(9)
We now regard our dissimilarity measure Q for a transformed input image b(f (x, d)): 2 Q= w(x) · b(f (x, d)) − ˜b(x) dx , (10) A
The partial derivative of Q with respect to the parameter vector d at d = 0 is: ∂Q ∂b(f (x,d)) ∂˜ b(x) ˜ = 2 · w(x) · b(f (x, d)) − b(x) · − ∂di dx . (11) ∂di ∂di A d=0 ˜ ˜ b(x) b(x) As ˜b(x) does not depend on d, ∂∂d is 0. The derivative ∂b(f∂d(x,d)) − ∂∂d can i i i be obtained by the generalized chain rule: ∂b(f (x, d)) ∂f1 (x,d) ∂f2 (x,d) ∂f3 (x,d) ∂b(y) ∂b(y) ∂b(y) T = (12) , ∂di , ∂di ∂y1 , ∂y2 , ∂y3 ∂di ∂di y=f (x,d)
where fi (x, d) are three individual components of the vector function f (x, d), which represents the affine transform. With ξ1 := −x1 · sin ϕ + x2 cos ϕ· and ξ2 := −x1 · cos ϕ−x2 · sin ϕ, the individual partial derivatives evaluate as follows: ∂ (x1 cos ϕ + x2 sin ϕ + v1 ) ∂f1 (x,d) ∂f2 (x,d) ∂f3 (x,d) ∂v1 = 1 ∂v1 = 0 ∂v1 = 0 ∂pi ∂ (−x1 sin ϕ + x2 cos ϕ + v2 ) ∂f1 (x,d) ∂f2 (x,d) ∂f2 (x,d) ∂f3 (x,d) = ∂pi ∂v2 = 0 ∂v2 = 1 ∂v2 = 0 ∂pi ∂ (0 · x1 + 0 · x2 + 1 · 1) ∂f1 (x,d) ∂f3 (x,d) (x,d) (x,d) = = ξ1 ∂f2∂ϕ = ξ2 ∂f3∂ϕ =0 ∂pi ∂ϕ ∂pi T ∂b(f (x, d)) ∂b(y) ∂b(y) ∂b(y) = ∂b(y) (13) =⇒ ; ; ξ + ξ 1 ∂y1 2 ∂y2 ∂y1 ∂y2 ∂di y=f (x,d) ∂f1 (x,d) ∂pi
=
350
H. Friedrich et al.
By substituting the results from (13) into (11) we then obtain: ∂Q = 2 w(x) b(f (x, d))−˜b(x) ∂d A
) ) ∂b(y ) ∂b(y ) ; ∂y2 ;ξ1 ∂b(y +ξ2 ∂b(y ∂y1 ∂y1 ∂y2
T
dx y=f (x,d)
(14) d=0.
As we have now evaluated all derivative terms which depend on the value of the parameter vector, we may now perform the transition to d = 0. As f (x, 0) is the identity transform, we obtain: T ∂Q ∂b(y) ∂b(x) ∂b(x) = 2·w(x)· b(x) − ˜b(x) · ∂b(x) dx, (15) ; ; x − x 2 ∂x1 1 ∂x2 ∂x1 ∂x2 ∂d A which represents the gradient g for continuous planar image signals. However, by replacing these signals using expansions into basis functions, this can be drastically simplified as will be shown below. Exploiting the Representation as Linear Combination. We regard now the particular case of image signals being represented by linear combination of basis functions Y˘j defined on the ceiling plane, i. e. ˜b(x) = a ˘ (16) b(x) = j aj · Y˘j (x), j ˜j · Yj (x), where aj are the coefficients of the linear combination, corresponding to the j-th entry of the view descriptor a (see Appendix A). Substituting the image signal ˘ j aj · Yj (x) in (15) results in:
⎛
⎜ ⎜ w(x) ⎜ ⎝ A
j
k (aj
j
k (aj
⎞
˘ (x) k −a ˜j )Y˘j (x)ak ∂ Y∂x 1
⎟ ˘ (x) ⎟ k −a ˜j )Y˘j (x)ak ∂ Y∂x ⎟ dx 2 ⎠ ˘ ˘ ∂ Yk (x) ∂ Yk (x) ˘ ˜j )Yj (x)ak x2 ∂x1 − x1 ∂x2 j k (aj − a ⎛ ⎞ ˘j (x) ∂ Y˘k (x) dx (a − a ˜ )a w(x) Y j j k j k ∂x1 A ⎜ ⎟ ⎜ ⎟ ˘j (x) ∂ Y˘k (x) dx (a − a ˜ )a w(x) Y = 2⎜ ⎟ j j k j k ∂x2 A ⎝ ⎠ ˘ (x) ˘ (x) ∂Y ∂Y k k ˘ dx (aj − a ˜j )ak w(x)Yj (x) x2 − x1
∂Q =2 ∂d
j
k
A
∂x1
(17)
∂x2
Since the integrals in the three sums do not depend on any of the spectral coefficients ai , we may pre-compute them. We obtain the precomputed coefficients ui (j, k): ∞ ∞ ˘k (x) u1 (j, k) = x1 =−∞ x2 =−∞ w(x) · Y˘j (x) ∂ Y∂x dx2 dx1 (18) 1 ∞ ∞ ˘ k (x) dx2 dx1 (19) u2 (j, k) = x1 =−∞ x2 =−∞ w(x)Y˘j (x) ∂ Y∂x 2 ∞ ∞ ˘ ˘ k (x) k (x) dx2 dx1 . (20) u3 (j, k) = x1 =−∞ x2 =−∞ w(x)Y˘j (x) x2 ∂ Y∂x − x1 ∂ Y∂x 1 2 With the above precomputed integrals, the computation of the components of the gradient can be regarded as a the following bilinear form: ⎛ ⎞ (aj − a ˜j ) · ak · u1 (j, k) ∂Q ⎝ j k ˜j ) · ak · u2 (j, k) ⎠ ˜ ) := = j k (aj − a (21) g(a, a ∂d (a − a ˜ ) · a · u (j, k) j j k 3 j k
Optical Rails
351
Note that the negative gradient −g represents necessary pose change of the camera. This direction −g has to be transformed to the coordinate system of the vehicle in order to steer the robot in the desired direction.
4
Experiments
We use a complex office environment created using Blender [3] in order to provide an experimental area for a simulated robot [5]. The modeled camera yields photo-realistic wide-angle images with a field of view of approx.172.5◦ which can be projected onto a hemisphere. This hemispherical signal is extended to a full spherical signal by suitable reflection at the equator and is then being expanded in spherical harmonics (Appendix A). The following experiments are carried out with a low order approximation ( = 4, i. e. 15 contributing basis functions.). Differential Translation Estimation. In our first experiment, we demonstrate and verify the behavior of the gradient −g. We have rendered a grid of images around two target locations A and B. Ideally, from each grid point, the vector should point towards the center. The results are shown in Fig. 4. The quality of the gradient estimate depends on the geometric properties of the room.
Fig. 4. The vector plots show the estimated direction towards the destination at the center. The grid extends 1 m in each direction at a spacing of 0.1 m. Note that even objects very close to the camera (area A, left plot) do not disturb the estimate, whereas directions of preference may deteriorate the results (area B, right plot).
Differential Pose Tracking. In this experiment we demonstrate the process of differential pose tracking based on iteratively applying the results of our different pose change model to views obtained at successive locations. We begin with two initial views: The omnidirectional image at the destination pose pd and at the initial current pose pc are rendered, from those images we ˜ and a. After this, the following steps obtain the corresponding view descriptors a are iterated: ˜ ) is computed and −g/||g||2 · s is added From the images the gradient −g(a, a to the current pose pc , where s denotes a translatory step size. The view at this new pose pc is rendered and the view descriptor a is updated accordingly.
352
H. Friedrich et al.
SRVLWLRQRIFDPHUD WDUJHWFDPHUDSRVLWLRQ HVWLPDWHGWUDQVODWLRQ FDPHUDKHDGLQJ
Fig. 5. Left: Visual pose tracking from initial pose (top) to target position (lower right, started from approx. 4 m distance, detail plot). Right: Round-trip using Optical Rails.
˜ ||22 falls below a threshold value. Note that The iteration stops when ||a − a for differential pose tracking, also the rotational component of −g is considered (Fig. 5, left). A Round-Trip on Optical Rails. In the previous experiments we have shown that the concept of differential pose tracking can serve as a reliable method of approaching a single destination from within a local vicinity. The implementation of Optical Rails is possible by extending the destination to a sequence of views approached one after the other. To that end, a criterion is required to decide whether we have reached a destination so that we may proceed to the next destination (waypoint handover ). Different measures have been investigated, including thresholding on similarity, a quotient of dissimilarities of the current view to the destination vs. next destination, and finally a criterion that meets out needs best. When steering from a current view a towards a destination view ˜ with the predecessor destination view a ˜ , our criterion for advancing towards a ˜ is the next destination view a κ>
˜ ||2 ||a − a , ˜ ||2 · ||g(˜ ˜ )||2 ||˜ a−a a , a
(22)
where we call κ a threshold of successive relative similarity. This similarity measure has proven to be sufficiently robust for handover in most situations. In Fig. 5 we show a round-trip through our virtual office environment using a set of equidistant waypoints. Due to the spatial characteristics in the different areas of the environment, waypoint handover is difficult in some cases (note the
Optical Rails
353
lower part of Fig. 5). Premature waypoint handover can cause the robot to leave the Optical Rail. However, convergence towards the proper destination is usually retained and should improve further if we make use of a vehicle model and an adaptive waypoint selection scheme [1]. In case of occlusions – e. g. caused by doors – a systematic waypoint placement would yield in a local augmentation of the Optical Rail with additional reference views providing greater stability.
5
Outlook
We have shown that Optical Rails is a very advantageous concept for robot view-based navigation without geometry, and verified this claim in a realistic simulated environment. The expansion of the image signal into basis functions enables us to incorporate precomputed differential expressions of the basis functions. This yields a computationally inexpensive way of obtaining steering information towards a destination. The next obvious step to perform is to port this approach to a real robot. The hardware requirements for Optical Rails are quite modest: Due to the lowfrequency character of the used omnidirectional image representation, no highresolution camera or high-quality lens system are required. It has already been verified in our previous work that even a low-cost camera system is sufficient for omnidirectional vision using spherical harmonics. As there are no markers or scanning devices involved, the Optical Rails approach imposes only very few requirements on the structure of the environment. Other issues of practical importance have been considered in our investigations, but are not discussed here: efficient ways of achieving illumination invariance have been investigated thoroughly [6], and the robustness of waypoint handover, and handling of occlusions are subjects currently under investigation. All experimental results obtained so far indicate that Optical Rails is truly a versatile and powerful navigation concept for robots in a known environment.
References 1. Argyros, A., Bekris, K., Orphanoudakis, S., Kavraki, L.E.: Robot homing by exploiting panoramic vision. Autonomous Robots 19(1), 7–25 (2005) 2. Basri, R., Rivlin, E., Shimshoni, I.: Visual homing: Surfing on the epipoles. International Journal of Computer Vision 33(2), 117–137 (1999) 3. The Blender Foundation. Blender (2007), http://www.blender.org 4. Franz, M.O., Sch¨ olkopf, B., Mallot, H.A., B¨ ulthoff, H.H.: Where did I take that snapshot? Scene-based homing by image matching. Biological Cybernetics 79, 191– 202 (1998) 5. Friedrich, H., Dederscheck, D., Krajsek, K., Mester, R.: View-based robot localization using spherical harmonics: Concept and first experimental results. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713. Springer, Heidelberg (2007) 6. Friedrich, H., Dederscheck, D., Mutz, M., Mester, R.: View-based robot localization using illumination-invarant Spherical Harmonics descriptors. In: VISAPP 2008 (2008)
354
H. Friedrich et al.
7. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI 1981, pp. 674–679 (1981) 8. Makadia, A., Geyer, C., Daniilidis, K.: Radon-based structure from motion without correspondences. In: Proc. CVPR (2005) 9. M¨ oller, R., Vardy, A., Kreft, S., Ruwisch, S.: Visual homing in environments with anisotropic landmark distribution. Autonomous Robots 23(3), 231–245 (2007) 10. Nillius, P.: Image Analysis using the Physics of Light Scattering. PhD thesis, Royal Institute of Technology (KTH), Stockholm, Sweden (2004) 11. St¨ urzl, W., Mallot, H.A.: Efficient visual homing based on Fourier transformed panoramic images. Robotics and Autonomous Systems 54, 300–313 (2006) 12. Weisstein, E.W.: Legendre polynomial. A Wolfram Web Resource (2007), http://mathworld.wolfram.com/LegendrePolynomial.html
A
Expansion in Basis Functions, Spherical Harmonics
A very compact representation of views can be obtained by expanding the spherical image signal s(θ, φ) in orthonormal basis functions. The natural choice for spherical basis functions are spherical harmonics. The complex-valued spherical harmonics Ym (θ, φ) are defined as (−|m|)! Ym (θ, φ) = √12π ·Nm·Pm (cos θ)·eimφ Nm = 2+1 2 (+|m|)! , ∈ N0 , m ∈ Z (23) with Pm (x) the Associated Legendre Polynomials [12] and eimφ being a complexvalued phase term. ( > 0) is called order and m (m = −..+) is called quantum number for each . To approximate a signal s(θ, φ), i. e. s(θ, φ) =
∞ =0
m=−
am · Ym (θ, φ),
(24)
the coefficients am are obtained by computing scalar products between the signal and the complex conjugate of each of the basis functions: am =
2π π 0
0
s(θ, φ) · Ym (θ, φ) · sin θ dθ dφ.
(25)
As the image signals are real-valued, it is sufficient to use real-valued spherical harmonics which can be obtained from the complex-valued ones by combination of complex conjugate functions. However, for real-valued signals the coefficient vectors for both representations can be easily converted into each other [10].
Postprocessing of Optical Flows Via Surface Measures and Motion Inpainting Claudia Kondermann, Daniel Kondermann, and Christoph Garbe HCI at Interdisciplinary Center for Scientific Computing, University of Heidelberg, Germany [email protected]
Abstract. Dense optical flow fields are required for many applications. They can be obtained by means of various global methods which employ regularization techniques for propagating estimates to regions with insufficient information. However, incorrect flow estimates are propagated as well. We, therefore, propose surface measures for the detection of locations where the full flow can be estimated reliably, that is in the absence of occlusions, intensity changes, severe noise, transparent structures, aperture problems and homogeneous regions. In this way we obtain sparse, but reliable motion fields with lower angular errors. By subsequent application of a basic motion inpainting technique to such sparsified flow fields we obtain dense fields with smaller angular errors than obtained by the original combined local global (CLG) method and the structure tensor method in all test sequences. Experiments show that this postprocessing method makes error improvements of up to 38% feasible.
1
Introduction
Optical flow calculation is a crucial step for a wide variety of applications ranging from scientific data analysis and medical imaging to autonomous vehicle control and video compression. Despite high quality results on common test sequences, there are several challenging situations in image sequences, where many or even all known methods fail, e.g. in the case of difficult occlusions, transparent structures, severe noise, aperture problems, homogeneous regions or incoherent motion. Previously, some of these situations could be identified by means of confidence measures. In this paper we demonstrate that optical flow fields obtained by both local and global methods can be improved significantly by identification of situations, where a reliable estimation is possible, and subsequent motion inpainting in order to obtain a dense optical flow field. Our approach consists of two steps. First, we propose new confidence measures called ”surface measures”, which indicate the feasibility of a correct flow estimation. In a second step we minimize a basic motion inpainting functional in order to estimate the
The authors thank the German Research Foundation (DFG) for funding this work within the priority program ”Mathematical methods for time series analysis and digital image processing” (SPP1114).
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 355–364, 2008. c Springer-Verlag Berlin Heidelberg 2008
356
C. Kondermann, D. Kondermann, and C. Garbe
flow within the unreliable regions only based on the flow information classified as reliable by the surface measures. In this way we obtain significantly improved optical flow fields. For the local structure tensor method we even obtain results superior to the global CLG method. Let a given image sequence I be defined on a space-time interval Ω × [0, T ], I : Ω × [0, T ] → R, Ω ⊆ R2 . Then the notion ”optical flow” refers to the displacement field u of corresponding pixels in subsequent frames of an image sequence, u : Ω × [0, T ] → R2 .
2
Related Work
Previously measures for the detection of difficult situations in the image sequence have been understood as a subclass of confidence measures. They mainly focus on the detection of features in the image, which make a correct estimation difficult. Such measures examine for example the magnitude of the image gradient or the eigenvalues of the structure tensor [1]. In contrast, other confidence measures are based on the flow computation method, e.g. [2] for variational methods or [3] for local methods. Limited comparisons of confidence measures have been carried out by Barron and Fleet [4] and Bainbridge and Lane [5]. Yet, the measures proposed so far are not able to detect all relevant difficult situations in an image sequence. Hence, to detect situations, where a reliable optical flow computation is feasible, we propose to examine the intrinsic dimensionality of image invariance functions similar to correlation surfaces. Correlation surfaces have for example been applied by Rosenberg and Werman [6] in order to detect locations where motion cannot be represented by a Gaussian random variable and by Irani and Anandan in [7] to align images obtained from different sensors. The inpainting of motion fields has been proposed before by Matsushita et al. [8] in order to accomplish video stabilization. Their approach differs from ours in two points: 1) The flow field is only extrapolated at the edges of the image sequence to continue the information to regions occluded due to perturbations of the camera. In contrast, we interpolate corrupted flow field regions within the sequence, which are identified by surface measures. 2) Instead of a fast marching method we use a variational approach to fill in the missing information.
3
Intrinsic Dimensions
According to [9] the notion ’intrinsic dimension’ is defined as follows: ’a data set in n dimensions is said to have an intrinsic dimensionality equal to d if the data lies entirely within a d-dimensional subspace’. It has first been applied to image processing by Zetzsche and Barth in [10] in order to distinguish between edge-like and corner-like structures in an image. Such information can be used to identify reliable locations, e.g. corners, in an image sequence for optical flow computation, tracking and registration. An equivalent definition of ’intrinsic dimension’ in image patches is based on its spectrum assigning the identifier
Postprocessing of Optical Flows
357
– i0d if the spectrum consists of a single point (homogeneous image patch) – i1d if the spectrum is a line through the origin (edge in the image patch) – i2d otherwise (e.g. edges and highly textured regions) In [11,12] Barth has introduced the intrinsic dimensionality of three-dimensional image sequences and applied it to motion estimation, especially for multiple and transparent motions. Kr¨ uger and Felsberg [13] proposed a continuous formulation of the intrinsic dimension and introduced a triangular topological structure. Provided the assumption of constant brightness over time holds, motion of a single point corresponds to a line of constant brightness in the image sequence volume. Thus, the intrinsic dimension of locations in image sequences, where motion takes place is lower or equal to two. In the case of intrinsic dimension three the brightness constancy assumption is violated which can be due to e.g. noise or occlusions. Only in the i2d case a reliable estimation of the motion vector is possible, since then the trajectory of the current pixel is the only subspace containing the same intensity. Otherwise, severe noise (i3d), aperture problems (i1d) or homogeneous regions (i0d) prevent accurate estimates. As indicated in [11] occlusions or transparent structures increase the intrinsic dimension by one thus leading to the problem of occluded i1d locations misclassified as reliable i2d locations. Hence, the detection of i2d situations which do not originate from the occlusion of i1d situations would be beneficial for optical flow computation methods, as only in these cases reliable motion estimation is possible. This new situation will be denoted by ’i2d-o situation’.
4
Surface Measures
In order to estimate the accuracy of a given optical flow field we investigate the intrinsic dimension of invariance functions f : Ω ×[0, T ]×R2 → R which evaluate the constancy of invariant image features at corresponding pixels in subsequent frames, e.g. the constancy of a) the brightness [14], b) the intensity, c) the gradient or d) the curvature at a given position x ∈ Ω × [0, T ] with corresponding displacement vector u ∈ R2 : a) brightnessConst: b) ssdConst: c) gradConst: d) hessConst:
f (x, u) = ∇I(x) u(x) + ∂I(x) ∂t f (x, u) = I(x) − Iw (x)2l2 f (x, u) = ∇I(x) − ∇Iw (x)2l2 f (x, u) = H(x) − Hw (x)2l2
Here Iw and Hw denote the image sequence and the Hessian of the image sequence warped by the computed flow field u. The set of invariance functions will be denoted by E. A surface function for a given flow vector u reflects the variation of an invariance function f ∈ E over the set of modifications of the current displacement vector. Hence, it can be understood as an indicator for possible alternatives to the current displacement vector: Sx,u,f : R2 → [0, 1], Sx,u,f (d) := f (x, u + d) .
(1)
358
C. Kondermann, D. Kondermann, and C. Garbe
The surface functions are used to derive surface measures for the detection of the id2-o situation based on any given invariance function f ∈ E and the following theoretical considerations. Low surface function values Sx,u,c (d) for several displacements d denote uncertainty in the flow estimation, such as in the case of homogeneous regions and aperture problems. These situations can be detected by analyzing the curvature of the surface function along its principal axes, which is small along both axes for homogeneous regions and along one principal axis for aperture problems. In the case of occlusion, transparent structures and noise the minimum of the surface function is usually high indicating that no good estimate is possible at all. In contrast, a single, low minimum suggests a unique, reliable displacement vector. Hence, the intrinsic dimension of the surface function together with its minimum value yield information on the reliability of optical flow estimates. Only in the case of intrinsic dimension two and a low minimum value an optical flow estimate can be understood as reliable due to missing alternatives and a low value of the invariance function. The case of i1d and i0d can be detected by at least one low curvature value of the principal surface axes. The case of occlusion, severe noise and transparent structures yields a high minimum value of the surface function as none of the modified displacement vectors fulfills the assumption of the invariance function. Therefore, the confidence value should always be close to 1 if the smaller curvature value cS is high and the surface minimum mS is low. Let S be the set of surface functions defined in Equation (1). Then the surface measure can be defined as mf : Ω × R2 → [0, 1], mf (x, u) := ϕ(Sx,u,f ) 1 1 · (1 − ) . ϕ : S → [0, 1], ϕ(Sx,u,f ) := 1 + mS 1 + τ c2S
(2) (3)
τ ∈ R+ is used to scale the influence of the minimum curvature value. In this paper we use τ = 60. The discretization of the surface measure has a large influence on the quality of the result. To discretize a surface function Sx,u,f we use a step size h denoting the distance between two surface points and a fixed size b referring to the number of surface points in horizontal and vertical direction after discretization, e.g. h = 0.5, b = 13 yielded good results. For high accuracy h is chosen between 0 and 1 and bicubic interpolation is used. Examples for discretized surface functions are shown in Figure 1. To obtain robustness towards noise we weight the surface by a Gaussian function centered on the central pixel before choosing the minimum value mS and ignore all surface positions separated from the minimum by a local maximum for
a)
b)
c)
d)
Fig. 1. Discretized surface functions: a) i0d, b),c) i1d, d) i2d
Postprocessing of Optical Flows
359
the calculation of the principal surface axes. Since the eigenvalues of the Hessian yield noisy curvature estimates, a robust curvature estimator is introduced. It averages n curvature values along the principal axis using the following filter . . 1 −2n 1 . . 1). mask: n1 (1 . . n
5
n
Motion Inpainting
We will show that sparsification based on the information contained in a surface measure map with subsequent motion inpainting improves the flow fields calculated with the combined local global and the structure tensor method on our test sequences (the well-known Marble and Yosemite sequence as well as the Street and Office sequence [15]). Motion inpainting is formulated as the minimization of a functional, which can be optimized by means of a variational approach. Let ω ⊂ Ω × [0, T ] be a closed subset of the image domain, where the flow field is to be reconstructed. This is the case for all pixel positions which have been classified as unreliable by the surface measure. Let ∂ω be the boundary of ω, and let u∗ be the reconstructed flow field within the region ω. Then u∗ is the solution to the minimization problem (4) min ∇3 u∗ 2 with u∗ |∂ω = u|∂ω . ω
Here ∇3 means the spatio-temporal gradient. The minimizer satisfies the following system of Euler-Lagrange equations consisting of the Laplacian of each component of the flow field: ui = uixx + uiyy + uitt = 0, i ∈ {1, 2}, u∗ |∂ω = u|∂ω .
(5)
We discretize the Euler-Lagrange equations using finite differences, von Neumann boundary conditions and the three-dimensional seven point Laplace stencil. The resulting linear system of equations is solved by successive overrelaxation.
6
Experiments and Results
To evaluate our results we first compare the quality of the surface measures to the previously known confidence measures detecting i2d situations and show that all of our surface measures - independent of the underlying invariance function - perform better than the best previously proposed measures and are robust to noise as well. Then we show motion inpainting results which demonstrate the usefulness of the suggested postprocessing method. 6.1
Comparison to i2d Measures
As test sequence we use the synthetic sequence on the left in Figure 2, as it contains every intrinsic dimension and occlusions of every intrinsic dimension.
360
C. Kondermann, D. Kondermann, and C. Garbe
0.6
SSM−brightnessConst SSM−gradConst SSM−hessConst SSM−laplaceConst SSM−gradNormConst SSM−hessNormConst SSM−ssd structMultMotion structCc structMinorsAngle ssdSurface (Anandan)
0.5
rin + rout
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
2
2.5 3 image noise
3.5
4
4.5
5
Fig. 2. Left: test sequence for surface measures containing four different intrinsic dimensions and the case of augmented intrinsic dimensions due to occlusion of the upper half in the lower half; Right: Comparison of surface measures (SSM-measures) based on different invariance functions for the recognition of i2d-o situations to known confidence measures for increasing noise levels
In this way we can examine if the surface measures are able to recognize the i2d-o situation and if they can distinguish it from occluded lower intrinsic dimensions. The patterns are moving to the lower right, and the lower half of the sequence is occluded by the sine pattern in the current frame in order to obtain examples of increased intrinsic dimensions due to occlusion. To obtain numerical results let g(x) denote the ground truth at position x. We compute two values expressing the average deviation of the measure for the set of pixels P where the i2d-o situation is present and for the set of pixels Q where the situation is not present. The sum of both values serves as error measure. x∈Q |mc (x, u(x)) − g(x)| x∈P |mc (x, u(x)) − g(x)| + (6) r = rin + rout = |P | |Q| As no measures are known for the detection of the i2d-o situation we compare our surface measures to the best known measures for the i2d situation: structMultMotion derived from [16], structCc [1], Anandan’s measure [17] and structMinorsAngle [18]. Figure 2 (right) shows the error measure r plotted against an increasing noise level σ ∈ [0, 5] in the test sequence. The proposed surface measures are labeled by the prefix ”SSM” and an abbreviation of the invariance function they are based on. We can see that the proposed surface measures generally perform better than the best previously proposed i2d measures for any underlying invariance function f . All surface measures are robust to noise, but depend on the robustness of the underlying invariance function. The susceptibility to noise increases with the order of the derivatives in the invariance function.
Postprocessing of Optical Flows
361
However, the influence of noise on the surface measures is limited by the robust curvature estimation along the principal axes. 6.2
Application to Other Test Sequences
For further validation of the surface measures we apply them to standard test sequences. As no ground truth concerning the i2d-o situation is available for these sequences, only a visual evaluation is feasible. Figure 3 a)-f) shows six different cropped regions of the Marble sequence and the corresponding surface measure result based on the brightness constancy invariance function (brightnessConst). In Figure 3 a), b) and c) we can see the application of the surface measure to different textures. In a) and b) the Marble blocks show only very little texture which makes these regions unreliable for flow estimation. In contrast, most parts of the block texture in c) are classified as sufficient for a reliable flow computation. In d) and e) we can see examples of aperture problems (i1d). The diagonal line on the table as well as the edges of the flagstones in the background of the sequence are typical examples for this situation. Both are recognized well by the surface measure. The corners of the flagstones are correctly recognized as i2d-o regions. The table region in f) is partially recognized as i2d-o and partially as i0d. This is due to the larger homogeneous regions in the table texture, as here the result depends on the size of the surface considered. If the whole surface function lies within the homogeneous region, the curvature along the main axis is 0 and thus the surface measure result as well. To demonstrate that our surface measures can also detect occlusions, we
a)
b)
c)
d)
e)
f)
Fig. 3. Top: Cropped Marble sequence regions with result of brightness constancy surface measure for the recognition of i2d-o situations scaled to [0,1]; a),b),c) texture of blocks (i2d-o/i0d), d) diagonal table line (i1d) e) flagstones in the background (i1d,i2do at corners), f) table (i2d-o/i0d), Bottom: Office sequence with additional lens flare and result of the SSD constancy Surface Measure correctly idetifying the occlusion
362
C. Kondermann, D. Kondermann, and C. Garbe
use the cropped Office sequence [15] with an additional lens flare occluding part of the background in Figure 3 (this kind of lens flare often poses problems e.g. in traffic scenes). The brightness constancy surface measure detects this region. 6.3
Motion Inpainting
To evaluate the performance of the motion inpainting method we use a surface measure map to sparsify flow fields calculated on four ground truth sequences (Marble, Yosemite, Street and Office) by the three dimensional linear CLG method by Bruhn et al. [19] (a widely used global method) and by the structure tensor method by Big¨ un [20] (a fast local method). We apply motion inpainting to the sparsified displacement fields in order to reconstruct the flow at pixels with low surface measure values. We demonstrate that the angular error [4] is reduced significantly by means of motion inpainting. Table 1 shows the average angular error and standard deviation over ten frames for the sparsification and reconstruction of the flow field for the best previously proposed confidence measure (structMultMotion) and the new surface measures. For sparsification, we chose the flow field density optimal for motion inpainting with respect to the angular error. Concerning the quality of the proposed measures, we can draw several conclusions from the results presented in Table 1. Firstly, the average angular error of the motion inpainting algorithm based on the surface measures is lower than the error we obtain based on the best previously proposed confidence measure. Hence, using the surface measures we can make more accurate statements on the reliability of the flow estimation than by means of previous i2d confidence measures. Secondly, the average angular error after motion inpainting is lower than the original angular error for the CLG and the structure tensor method. Thus, we conclude that the remaining flow vectors after sparsification contain all relevant information of the original flow field, and that any other information is dispensable, even obstructive, for the computation of a 100% dense flow field. Table 1. Angular error for four test sequences for original field, sparsified field with given density, result of motion inpainting based on best surface measure and result of motion inpainting based on previously best confidence measure (structMultMotion), averaged over ten frames for the CLG and the structure tensor (ST) method; the density indicates the density optimal for motion inpainting CLG Marble Yosemite Street Office ST Marble Yosemite Street Office
original 3.88 ± 3.39 4.13 ± 3.36 8.01 ± 15.47 3.74 ± 3.93 original 4.49 ± 6.49 4.52 ± 10.10 5.97 ± 16.92 7.21 ± 11.82
sparsification density (%) 3.59 ± 3.03 70.6 2.78 ± 2.24 20.7 2.77 ± 2.52 11.5 3.25 ± 4.80 26.7 sparsification density 2.96 ± 2.25 42.3 2.90 ± 3.49 37.5 2.07 ± 5.61 34.6 2.59 ± 4.32 5.1
inpainting 3.87 ± 3.38 3.85 ± 3.00 7.73 ± 16.23 3.59 ± 3.93 inpainting 3.40 ± 3.56 2.76 ± 3.94 4.95 ± 13.23 4.48 ± 4.49
previously best 3.88 ± 3.39 4.13 ± 3.36 7.99 ± 15.48 3.62 ± 3.91 previously best 3.88 ± 4.89 4.23 ± 9.18 5.69 ± 16.47 6.35 ± 10.14
Postprocessing of Optical Flows
363
Finally, the table also indicates the average angular error for the sparsification of the flow field by means of the surface measures. Here we chose the sparsification density which has been found optimal for motion inpainting. The sparsification error is lower than the motion inpainting error and can be achieved if a dense flow field is not required. Hence, we have demonstrated that the quality of the surface measures is superior to previous measures and that the information contained in the remaining flow field after sparsification is sufficient for reconstruction. Concerning the results of the motion inpainting algorithm we can draw the following conclusions. For both the CLG and the structure tensor method the sparsification of the flow field based on surface measures and subsequent inpainting yields lower angular errors than the original methods for all test sequences. The results of the local structure tensor method after motion inpainting are even superior to the original and the inpainted global CLG method in all cases but one. Therefore, we can conclude that - in contrast to the accepted opinion which favors global methods over local methods if dense flow fields are required - the filling-in effect of global methods is not necessarily beneficial for obtaining an accurate dense flow field. Instead, local and global methods alike can lead to better results if motion inpainting in combination with surface measures for sparsification is employed. Here, local methods often even seem preferable.
7
Summary and Conclusion
We have presented a method to estimate the feasibility of accurate optical flow computation. The proposed surface measures have proven robust to noise and are able to detect the situations where the full flow cannot be estimated reliably. They yield better results than previously proposed confidence measures and contain all relevant information for the reconstruction of the original flow field with even higher quality. Based on these measures we sparsified the original locally or globally computed flow field and filled in the missing flow vectors by a basic motion inpainting algorithm. Tests have been conducted using the CLG method and the structure tensor method on four standard test sequences. For our test sequences we can conclude that the application of a postprocessing method to sparsified flow fields calculated with local or global methods yields better results than can be achieved by exploiting the filling-in effect of global methods. Hence, in contrast to the accepted opinion, global methods are not always preferable to local methods if a dense flow field is required, because motion inpainting only based on reliable flow vectors can lead to superior results.
References 1. Haussecker, H., Spies, H.: Motion. In: J¨ ahne, B., Haussecker, H., Geissler, P. (eds.) Handbook of Computer Vision and Applications, ch. 13, vol. 2. Academic Press, London (1999) 2. Bruhn, A., Weickert, J.: A Confidence Measure for Variational Optic Flow Methods, pp. 283–298. Springer, Heidelberg (2006)
364
C. Kondermann, D. Kondermann, and C. Garbe
3. Kondermann, C., Kondermann, D., J¨ ahne, B., Garbe, C.: An adaptive confidence measure for optical flows based on linear subspace projections. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 132–141. Springer, Heidelberg (2007) 4. Barron, J.L., Fleet, D.J., Beauchemin, S.: Performance of optical flow techniques. International Journal of Computer Vision 12(1), 43–77 (1994) 5. Bainbridge-Smith, R., Lane, A.: Measuring confidence in optical flow estimation. IEEE Electronics Letters 32(10), 882–884 (1996) 6. Rosenberg, A., Werman, M.: Representing local motion as a probability distribution matrix applied to object tracking. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 654–659 (1997) 7. Irani, M., Anandan, P.: Robust multi-sensor image alignment. In: Proceedings of the International Conference on Computer Vision, pp. 959–966 (1998) 8. Matsushita, Y., Ofek, E., Tang, X., Shum, H.: Full-frame video stabilization. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 50–57 (2005) 9. Bishop, C.: Neural Networks for Pattern Recognition. Oxford University Press, New York (1995) 10. Zetzsche, C., Barth, E.: Fundamental limits of linear filters in the visual processing of two dimensional signals. Vision Research 30(7), 1111–1117 (1990) 11. Barth, E.: Bewegung als intrinsische geometrie von bildfolgen. In: Proceedings of the German Association for Pattern Recognition (DAGM) (1999) 12. Barth, E., Stuke, I., Aach, T., Mota, C.: Spatio-temporal motion estimation for transparency and occlusions. In: Proceedings of the International Conference on Image Processing (ICIP), vol. 3, pp. 69–72 (2003) 13. Kr¨ uger, N., Felsberg, M.: A continuous formulation of intrinsic dimension. In: British Machine Vision Conference (2003) 14. Horn, B., Schunck, B.: Determining optical flow. Artificial Intelligence 17, 185–204 (1981) 15. McCane, B., Novins, K., Crannitch, D., Galvin, B.: On benchmarking optical flow (2001), http://of-eval.sourceforge.net/ 16. Mota, C., Stuke, I., Barth, E.: Analytical solutions for multiple motions. In: Proceedings of the International Conference on Image Processing ICIP (2001) 17. Anandan, P.: A computational framework and an algorithm for the measurement of visual motion. International Journal of Computer Vision 2, 283–319 (1989) 18. Barth, E.: The minors of the structure tensor. In: Proceedings of the DAGM (2000) 19. Bruhn, A., Weickert, J., Schn¨ orr, C.: Lucas/Kanade meets Horn/Schunck: Combining local and global optic flow methods. International Journal of Computer Vision 61(3), 211–231 (2005) 20. Big¨ un, J., Granlund, G.H., Wiklund, J.: Multidimensional orientation estimation with applications to texture analysis and optical flow. IEEE Journal of Pattern Analysis and Machine Intelligence 13(8), 775–790 (1991)
An Evolutionary Approach for Learning Motion Class Patterns Meinard M¨ uller1 , Bastian Demuth2 , and Bodo Rosenhahn1 1
Max-Planck-Institut f¨ ur Informatik Campus E1-4, 66123 Saarbr¨ ucken, Germany [email protected] 2 Universit¨ at Bonn, Institut f¨ ur Informatik III R¨ omerstr. 164, 53117 Bonn, Germany
Abstract. This article presents a genetic learning algorithm to derive discrete patterns that can be used for classification and retrieval of 3D motion capture data. Based on boolean motion features, the idea is to learn motion class patterns in an evolutionary process with the objective to discriminate a given set of positive from a given set of negative training motions. Here, the fitness of a pattern is measured with respect to precision and recall in a retrieval scenario, where the pattern is used as a motion query. Our experiments show that motion class patterns can automate query specification without loss of retrieval quality.
1
Introduction
Motion capture or mocap systems [3] allow for tracking and recording of human motions at high spatial and temporal resolutions. The resulting 3D mocap data is used for motion analysis and synthesis in fields such as sports sciences, biomechanics, and computer animation [2,4,9]. Furthermore, in computer vision, it has also been used as prior knowledge for human tracking [10,11,12]. Even though there is a rapidly growing corpus of freely available mocap data, there still is a lack of efficient motion retrieval systems that work in a purely content-based fashion without relying on manually generated annotations. Here, the main difficulty is due to the fact that similar types of motions may exhibit significant spatial as well as temporal variations [2,4]. To cope with spatial variations, M¨ uller et al. [6] have introduced the concept of relational features that capture semantically meaningful boolean relations between specified points of the kinematic chain underlying the mocap data. For example, such features may express whether a hand is raised or not or whether certain parts of the body such as legs are bent or stretched, see Fig. 1a. Furthermore, to cope with temporal variations, a feature-dependent temporal segmentation is used by merging adjacent motion frames that satisfy the same boolean relations, see Fig. 1b. Motion retrieval can then be performed very efficiently by using standard index-based string matching techniques, see [6]. One remaining major problem is that the retrieval performance heavily depends on the query formulation, which involves manual and time-consuming specification of a query-dependent feature selection. G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 365–374, 2008. c Springer-Verlag Berlin Heidelberg 2008
366
M. M¨ uller, B. Demuth, and B. Rosenhahn
(a)
(b)
Right knee bent? Fig. 1. (a) Relational feature checking whether the right knee is bent or stretched. (b) Segmentation of a parallel leg jumping motion D = Djump with respect to a combination of the features “left knee bent” and “right knee bent”. Poses assuming the same feature values are indicated by identically marked trajectory segments. The trajectories of the body points “top of head” and “right ankle” are shown.
The main contribution of this paper is to overcome this problem by applying a genetic algorithm to learn motion class patterns from positive and negative training motions. Such a pattern can be thought of as an automated, locally adaptive feature selection for the motion class represented by the positive training motions. In the experiments, we will demonstrate that the retrieval performance of automatically learned patterns is similar to manually selected and heavily tuned patterns. The paper is organized as follows. In Sect. 2, we briefly review the underlying motion retrieval procedure and fix the notation. Then, in Sect. 3, we describe in detail our proposed evolutionary learning algorithm. Finally, in Sect. 4, we summarize our experiments and conclude in Sect. 5 with prospects on future work. Further references will be given in the respective sections.
2
Motion Representation and Retrieval
In this section, we summarize the motion retrieval concept described in [6], while introducing some notation. In the following, a mocap data stream D is regarded as a sequence of data frames, each representing a human pose as illustrated by Fig. 1b. A human pose is encoded in terms of a global position and orientation as well as joint angles of an underlying skeleton model. Mathematically, a relational feature is a boolean function from the set of poses to the set {0, 1}. In the following, we fix a feature function F = (F1 , . . . , Ff ), which consists of f such relational features. A feature function is applied to a mocap data stream D in a pose-wise fashion. An F -segment of D is defined to be a subsequence of D of maximal length consisting of consecutive frames that exhibit the same F -feature values. Picking for each segment one representative feature vector one obtains a so-called F -feature sequence of D, which is denoted by F [D]. We also represent such a sequence by a matrix MF [D] ∈ {0, 1}f ×K , where K denotes the number of F -segments and the columns of MF [D] correspond to the feature vectors in F [D]. For example, consider a feature function F = (F1 , F2 ) that consists of f = 2 relational features, the first checking whether the left and the second
An Evolutionary Approach for Learning Motion Class Patterns
367
checking whether right knee is bent or stretched. Applying F to the jumping motion D = Djump as shown in Fig. 1b results in the feature sequence (1) and MF [D] = 11 01 00 01 11 , F [D] = 11 , 01 , 00 , 01 , 11 where the five columns correspond to the K = 5 motion segments. Such a feature sequence or matrix can then be used as a query for efficient index-based motion retrieval as described in [6]. Note that for the parallel leg jumping motion above, first both knees are bent, then stretched, and finally bent again. This information is encoded by the first, third, and fifth column of F [D]. The feature vectors 01 in the second and fourth column arise because the actor does not bend or stretch both legs at the same time. Instead, he has a tendency to keep the right leg bent a bit longer than the left leg. However, in a parallel leg jumping motion, there are many possible transitions from, e.g., 11 (legs bent) to 00 (legs stretched), such as 1 → 00 , 11 → 01 → 00 , or 11 → 10 → 01 → 00 . To account 1 for such irrelevant transitions, M¨ uller et al. [6] introduce some fault tolerance mechanism based on fuzzy sets. The idea is to allow at each position in the query sequence a whole set of possible, alternative feature vectors instead of a single one. In the following, we encode alternative feature values by asterisks in the matrix representation MF [D]. For example, the second column of the matrix 1∗ 0∗1 encodes the set V2 = 00 , 01 , 10 , 11 of four alternative feature 1∗ 01 1 0 1 . The asterisks can be vectors and the fourth column the set V4 = 1 , 1 used to mask out irrelevant transitions between the alternating vectors 11 and 0 0 from the original feature sequence. More generally, we consider matrices X ∈ {0, 1, ∗}f ×K , which are referred to as motion class patterns. As indicated by the example above, the k th column X(k) of X encodes a subset Vk ⊆ {0, 1}f of alternative feature vectors, also referred to as fuzzy set. Furthermore, X encodes a sequences V (X) := (V1 , . . . , VK ) of fuzzy sets, also referred to as fuzzy query. Such fuzzy queries1 can be efficiently processed using an index that is based on inverted lists, see [6].
3
Genetic Learning Algorithm
Let T + be a set of positive training motions representing a certain class of semantically related motions. For example, T + may consist of several parallel leg jumping motions performed by different actors and in various styles. Furthermore, let T − be a set of negative training motions that are not considered to be in the motion class of interest. For example, T − may contain one leg jumping motions, jumping jacks, or walking motions. The overall goal is to learn a motion class pattern that suitably characterizes the motion class represented by T + . In this section, we present a genetic algorithm to automatically derive such 1
Note that one can enforce the admissible condition needed in the matching algorithm by taking suitable complements and unions of neighboring fuzzy sets, see [4,6].
368
M. M¨ uller, B. Demuth, and B. Rosenhahn
Input: T + T− F p μ r s G
Set of positive training motions. Set of negative training motions. Feature function consisting of f components F1 , . . . , Ff . Size of the population, where p > f . Rate of mutation. Number of parents used in the recombination step. Number of fittest offsprings used in the next generation, where s < p. Number of generations.
F F Initialization: Compute indices IT + and IT − for the training sets, see [6]. Choose an initial population Π (0) of size |Π (0) | = p. Evolution: Repeat the following procedure for g = 0, . . . , G − 1: (1) Compute the fitness of the individuals in the population Π (g) . (2) Select r individuals as parents based on universal stochastic sampling. (3) Generate r(r − 1)/2 offsprings by pairwise recombination of the parents. (4) Modify offspring via mutation with respect to the mutation rate μ. (5) Compute the fitness of all resulting offspring and pick the s fittest one. (6) Replace the s individuals in Π (g) exhibiting the lowest fitness by the s offsprings picked in (5). The resulting population is denoted by Π (g+1) .
Output: Fittest individuum in Π (G) .
Fig. 2. Genetic algorithm for learning motion class patterns
pattern from T + and T − . Generally, a genetic algorithm is a population-based optimization technique to find approximate solutions to optimization problems, see [8]. In our case, the optimization criterion is based on the retrieval performance in terms of precision and recall when using the motion class pattern as fuzzy query. Fig. 2 gives an overview of our genetic learning algorithm, which is explained in the subsequent subsections. In the following, we fix a feature function F = (F1 , . . . , Ff ) (typically consisting of 10 − 20 components) and omit F in the notation by writing M [D] instead of MF [D]. 3.1
A Model for Individuals
A population consists of a set of individuals, where each individual is a description of some motion class pattern representing a candidate solution of the optimization problem. In our scenario, an individual Ind is defined to be a tuple Ind := (D, Feat, Col, Fix). Here, D ∈ T + denotes a reference motion and Feat ⊆ [1 : f ] represents a selection of features comprised in the feature function F = (F1 , . . . , Ff ). From these two parameters, one obtains a motion class pattern M [D, Feat] with entries in Σ = {0, 1, ∗} as follows. First, the feature sequence matrix M [D] ∈ {0, 1}f ×K is modified by replacing all entries with a row index in [1 : f ] \ Feat by the entry ∗. Then, any coinciding consecutive columns are replaced by a single column yielding the matrix M [D, Feat] having L columns, L ≤ K. The component Col is a subset Col ⊆ [1 : L], which models the transition segments. Furthermore, the component Fix is a map Fix : Col → 2Feat that assigns to each column ∈ Col a subset Fix() ⊆ Feat. Intuitively, the function Fix is used to fix and blend out certain entries in M [D, Feat]. More precisely, all entries of M [D, Feat] having some column index ∈ Col and some row index in Feat \ Fix() are replaced by the entry ∗. The resulting matrix is denoted by M [Ind]. For example, let K = 6, f = 4, and Feat = {1, 3, 4}. Then, for the
An Evolutionary Approach for Learning Motion Class Patterns
369
pattern M [D] given in (2), one obtains L = 5. Furthemore, let Col = {2, 4} with Fix(2) = ∅ and Fix(4) = {1, 4}. Then one obtains the following patterns M [D, Feat] and M [Ind]:
M [D] =
3.2
000000 111010 100011 110111
, M [D, Feat] =
00000 ∗∗∗∗∗ 10001 11011
, M [Ind] =
0∗000 ∗∗∗∗∗ 1∗0∗1 1∗011
(2)
Fitness Function
We now define a fitness function Fit that measures the quality of a given individual. First, an individual Ind is transformed into a fuzzy query V := V (M [Ind]). The resulting fuzzy query is evaluated on the union T := T + ∪ T − . The fitness function assigns a high fitness to the individual if the associated fuzzy query leads to a high recall as well as a high precision. More precisely, let H(Ind) ⊂ T denote the subset of motion documents retrieved by V . The recall value R and the precision value P with respect to V are defined as R := R(Ind) :=
|H(Ind) ∩ T + | |T + |
and P := P (Ind) :=
|H(Ind) ∩ T + | . |H(Ind)|
(3)
Obviously, 0 ≤ R ≤ 1 and 0 ≤ P ≤ 1. Note that a high recall value R implies that many motions from T + are retrieved by V , whereas a high precision value P (Ind) implies that most motions from T − are rejected by V . It is the objective to identify fuzzy queries, which simultaneously maximize R and P . As described later, our genetic algorithm will be initialized by individuals typically revealing high recall but low precision values. In the course of the evolutionary process more and more specialized individuals will enter the population. Therefore, at the beginning of the process the improvement of the precision values is of major concern, while with increasing generations the recall values gain more and more importance. This motivates the following definition of a fitness function, which depends on the number g of generations already performed: Fit(g, Ind) := P (Ind) · R(Ind)min(4,
√ g)
.
(4) √ for g ∈ N. Note that 0 ≤ Fit(g, Ind) ≤ 1. The formula min(4, g) in the exponent of R(Ind) puts an increasing emphasis on the recall value for increasing g and has been determined experimentally. 3.3
Initialization
For the start, we generate an initial population Π (0) of size p > f . First, we define f individuals Indn = (Dn , Featn , Coln ), n ∈ [1 : f ], based on some randomly chosen but fixed reference motion Dn := D0 ∈ T + . Furthermore, we set Featn := {n} and Col := ∅ with trivial Fix. The entries of M [Indn ] have the value ∗ except for the entries in row n, which alternately assume the values 0 and 1. For the remaining p − f elements of Π (0) , we generate individuals Ind = (D, Feat, Col, Fix) with a randomly chosen reference motion D ∈ T + , a randomly chosen set Feat consisting of two elements in [1 : f ], and Col = ∅.
370
M. M¨ uller, B. Demuth, and B. Rosenhahn Sample 1
Sample 2
Sample 3
1/4
0 Ind1
Ind2
Sample 4 1
Ind3
Ind4
Ind5 Ind6 Ind7 Ind8
Fig. 3. An example for universal stochastic sampling selecting r = 4 individuals from a population of size p = 8. The selected individuals are Ind1 , Ind3 , Ind4 , and Ind7 .
3.4
Genetic Operations
We now describe the genetic operations of selection, recombination, and mutation, which are needed to breed a new population from a given population. Selection. In each step of our genetic algorithm, we select r individuals (the parents), which are used to create new individuals (the offsprings). To increase the possibility of obtaining offsprings of high fitness, one should obviously revert to parents that reveal the highest fitness values in the current population. On the other hand, to avoid an early convergence towards some poor local optimum, it is important to have some genetic diversity within the parents used in the reproduction process. Therefore, we use a selection process known as stochastic universal sampling, see [8]. Let Π (g) denote the current population. Then all individuals of Π (g) are arranged in a list (Ind1 , . . . , Indp ) with decreasing fitness. The interval [0, 1] is partitioned into p subintervals, where the kth subinterval, 1 ≤ k ≤ p, has a length corresponding to the proportion of the fitness value of Indk to the sum of the fitness values over all individuals contained in Π (g) . To select r individuals from Π (g) , the interval [0, 1] is sampled in some equidistant fashion, where two neighboring samples have a distance of 1/r. Furthermore, the first sample is randomly chosen with the interval [0, 1/r], see Fig. 3. Each sample determines an individual according the subinterval it is contained in. That way, each individual of Π (g) has a probability that is proportional to its relative fitness value to be selected for reproduction. Recombination. In the recombination step two parent individuals are used to derive a new offspring individual by combining and mixing the properties (the genes) of the parents. In this process, we need a suitable notion of randomly choosing subsets from a given set. For a finite set A, we first randomly generate a natural number n with 1 ≤ n ≤ |A| by suitably rounding the output of some normally distributed random variable with expectation value μ and standard deviation σ. We then form a subset B of size n by randomly picking n elements of A with respect to a uniform distribution on A. For short, we write (μ,σ) B ⊆rand A. In our recombination procedure, we use the parameters μ = 0.7|A| and σ = 0.3|A|. For these parameters, we will simply write B ⊆rand A. We now describe how to recombine two parents Ind1 := (D1 , Feat1 , Col1 , Fix1 ) and Ind2 := (D2 , Feat2 , Col2 , Fix2 ) in order to obtain a new offspring Ind := (D, Feat, Col, Fix). The following list defines the recombination operator depending on the degree of correspondence of the two parents. 1. Suppose D1 = D2 , Feat1 = Feat2 , and Col1 = Col2 . Then set D := D1 , Feat := Feat1 , Col := Col1 , and define Fix by randomly choosing
An Evolutionary Approach for Learning Motion Class Patterns
371
Fix() ⊆rand Fix1 () ∪ Fix2 () for ∈ Feat. In case this results in Fix() = Feat, the element is removed from Col and Fix is suitably restricted. 2. Suppose D1 = D2 , Feat1 = Feat2 , and Col1 = Col2 . Then set D := D1 , Feat := Feat1 , and choose Col ⊆rand Col1 ∪ Col2 . Finally, define Fix by Fix() := Fix1 () for ∈ Col ∩Col1 and Fix() := Fix2 () for ∈ Col \Col1 . 3. Suppose D1 = D2 and Feat1 = Feat2 . Then set D := D1 and chose Feat ⊆rand Feat1 ∪ Feat2 . Note that Feat1 , Feat2 , and Feat generally induce different segmentations on D, which prevents to directly transfer properties encoded by the parameters Col and Fix from the parents to the offspring. To transfer at least part of the information, we proceed as follows. Suppose D is segmented into K segments with respect to Feat1 , then let (1 = s1 < s2 < . . . < sK ) be the indices of the start frames of the segments. Similarly, let (1 = t1 < t2 < . . . < tL ) be the start indices in the segmentation of D induced by Feat. Then all ∈ [1 : L] with sk = t for some k ∈ Col1 are added to Col and Fix() := Fix1 (k). In other words, the offspring inherits all Fix1 -properties from the first parent that refer to segments having the same starting frame with respect to Feat1 and Feat. Similarly, the offspring also inherits all Fix2 -properties from the second parent. 4. Suppose D1 = D2 . Then randomly set D := D1 or D := D2 , and chose Feat ⊆rand Feat1 ∪ Feat2 . Finally, define Col := ∅ with trivial Fix. Mutation. The concept of mutation introduces another degree of randomness in the reproduction process to ensure that all potential individuals have some probability to appear in the population thus avoiding an early convergence towards some poor local optimum. Each of the offspring individuals Ind := (D, Feat, Col, Fix) are further modified in the following way. First, a number d ∈ N0 of modifications is determined by rounding the output of some exponentially distributed random variable with expectation value μ (in our experiments we chose μ = 2). Then, one successively performs d basic mutations, where the type of modification is chosen randomly from the following list of five types:. 1. The reference motion D is replaced by a randomly chosen motion in T + . Furthermore, Col is replaced by the empty set with trivial Fix. 2. In case Feat = [1 : f ] (otherwise perform Step 3), the set Feat is extended by an additional element, which is randomly chosen from [1 : f ] \ Feat. Note that this leads to a refined segmentation. Therefore, the parameters Col and Fix are adjusted by suitably relabeling segment indices. 3. In case |Feat| > 1 (otherwise perform Step 2), an element is randomly removed from Feat. This may lead to a coarsened segmentation. In this case, Col and Fix are adjusted as described in Step 3 of the recombination. 4. Let L be the number of columns of M [Ind]. In case |Col| < L − 1 (otherwise perform Step 5), the set Col is extended by an additional element randomly chosen from [1 : L] \ Col and Fix is extended by setting Fix() := ∅. 5. In case Col = ∅ (otherwise perform Step 4), an element ∈ Col is chosen randomly and the set Fix() is extended by an element randomly chosen from Feat \ Fix(). In case this results in Fix() = Feat, the element is removed from Col and Fix is suitably restricted.
372
3.5
M. M¨ uller, B. Demuth, and B. Rosenhahn
Breeding the Next Generation
To create the population Π (g+1) from the population Π (g) , we proceed as follows. First, r parent individuals are selected from Π (g) . Any two of these r parent offsprings, which are then mutated individuals are recombined to yield r(r−1) 2 individually. Then the fitness values are computed for the resulting offsprings. Finally, the s fittest offsprings are picked to replace the s individuals in Π (g) that exhibit the lowest fitness.
4
Experiments
For our experiments, we used the HDM05 motion database that consists of several hours of systematically recorded motion capture data [7]. From this data, we manually cut out suitable motion clips and arranged them into 64 different classes. Each such motion class (MC) contains 10 to 50 different realizations of the same type of motion, covering a broad spectrum of semantically meaningful variations. For example, the motion class “Parallel Leg Jump” contains 36 variations of a jumping motion with both legs. The resulting motion class database DMC contains 1, 457 motion clips, amounting to 50 minutes of motion data. We now describe one of our experiments conducted with 15 motion classes, see Table 1. For each of these motion classes, we automatically generated a set T + of positive training motions consisting of one third of the available example motions of the respective class (e. g. 12 motions in case of “Parallel Leg Jump”). Similarly, T − was generated by randomly choosing one third of the motions of the other 14 classes. Depending on the respective motion class, we either used a feature function F or a feature function Fu similar to the ones described in [6]. F has 11 components characterizing the lower body, whereas Fu has 12 components characterizing the upper body. In our experiments, the following parameters turned out to be suitable: we chose a population size of p = 50 and used r = 7 parents for the recombination. All resulting s = 28 offsprings were used for replacement. Furthermore, G = 50 generations turned out to yield a good convergence. With these parameters, our evolutionary algorithm required in average 7.2 seconds for each generation (using MATLAB 6.5 run on an Athlon XP 1800+). Therefore, using G = 50, it took roughly six minutes to compute one motion class pattern. For each of the 15 motion classes, we performed the entire algorithm ten times. To determine the quality, the resulting patterns were used as fuzzy queries and evaluated on the entire motion class database DMC . Table 1 summarizes the retrieval results in terms of precision and recall. The first precision-recall (PR) pair of each of the 15 motion classes indicates the average PR values over the ten patterns. The second and third pairs show the best and worst PR values among the ten patterns. Finally, the fourth pair shows the PR values of a manually optimized query specification by a retrieval expert indicating what seems to be achievable by a manual query process. For example, in case of the class
An Evolutionary Approach for Learning Motion Class Patterns
373
Table 1. Retrieval results for motion retrieval based on ten automatically learned motion class patterns for each class (average, best, worst) for motion retrieval based on an optimized manual query specification (manual)
recall (average) precision (average) recall (best)
Elbow-to-knee
Cartwheel
Jumping Jack
Parallel Leg Jump
One Leg Jump
25.6/27 = 0.95
19.3/21 = 0.92
49.5/52 = 0.95
25.3/36 = 0.70
40.0/42 = 0.95
0.92
0.83
0.99
0.56
0.94
26/27 = 0.96
21/21 = 1.00
52/52 = 1.00
28/36 = 0.78
41/42 = 0.98
precision (best)
26/26 = 1.00
21/21 = 1.00
52/52 = 1.00
28/44 = 0.64
41/43 = 0.95
recall (worst)
26/27 = 0.96
18/21 = 0.86
47/52 = 0.90
18/36 = 0.50
39/42 = 0.93
precision (worst)
26/33 = 0.79
18/199 = 0.09
47/48 = 0.98
18/28 = 0.64
39/42 = 0.93
recall (manual)
24/27 = 0.89
21/21 = 1.00
51/52 = 0.98
21/36 = 0.58
33/42 = 0.79
precision (manual)
24/26 = 0.92
21/21 = 1.00
51/55 = 0.93
21/34 = 0.62
33/182 = 0.18
Hit-on-Head
Right Kick
Sit-down
Lie-down
Rotate Arms
10.9/13 = 0.84
26.0/30 = 0.87
14.2/20 = 0.71
15.7/20 = 0.78
14.6/16 = 0.91
0.82
0.65
0.85
0.89
0.95
11/13 = 0.85
26/30 = 0.87
15/20 = 0.75
18/20 = 0.90
16/16 = 1.00
recall (average) precision (average) recall (best) precision (best)
11/13 = 0.85
26/28 = 0.93
15/16 = 0.94
18/19 = 0.95
16/16 = 1.00
recall (worst)
10/13 = 0.77
20/30 = 0.67
13/20 = 0.65
11/20 = 0.55
14/16 = 0.88
precision (worst)
10/16 = 0.63
20/49 = 0.41
13/16 = 0.81
11/28 = 0.39
14/16 = 0.88
recall (manual)
12/13 = 0.92
25/30 = 0.83
16/20 = 0.80
19/20 = 0.95
16/16 = 1.00
precision (manual)
12/14 = 0.86
25/41 = 0.61
16/40 = 0.40
19/21 = 0.90
16/16 = 1.00
Climb Stairs
Walk
Walk Backwards
Walk Sideways
Walk Cross-over
26.1/28 = 0.93
14.5/16 = 0.91
11.4/15 = 0.76
14.8/16 = 0.93
11.1/13 = 0.85
0.96
0.43
0.52
0.97
0.78
27/28 = 0.96
15/16 = 0.94
12/15 = 0.8
15/16 = 0.94
13/13 = 1.00
recall (average) precision (average) recall (best) precision (best)
27/27 = 1.00
15/30 = 0.50
12/19 = 0.63
15/15 = 1.00
13/16 = 0.81
recall (worst)
25/28 = 0.89
14/16 = 0.88
12/15 = 0.80
14/16 = 0.88
11/13 = 0.85
precision (worst)
25/29 = 0.86
14/52 = 0.27
12/43 = 0.28
14/14 = 1.00
11/22 = 0.50
recall (manual)
22/28 = 0.79
15/16 = 0.94
8/15 = 0.53
16/16 = 1.00
10/13 = 0.77
precision (manual)
22/23 = 0.96
15/33 = 0.45
8/22 = 0.36
16/17 = 0.94
10/13 = 0.77
“Elbow-to-knee” in average 25.6 of the 27 correct motions with a precision of 0.92 were recovered by the motion class patterns. The best pattern could recover 26 of the 27 correct motions with a precision of 1.00. In case of the optimized manual query specification, we obtained a recall of 24/27 = 0.89 and a precision of 0.92. From our experiments, we can say that for a large number of whole body movements our automatically learned motion class patterns are capable to produce qualitatively similar retrieval results as manual feature selection and query specifications.
374
5
M. M¨ uller, B. Demuth, and B. Rosenhahn
Conclusions
In this paper, we have presented an evolutionary approach for learning motion class patterns, which offer an automated alternative to manual query specification without loss of retrieval quality. We introduced a novel genetic learning algorithm with a model for individuals, rules for selection, recombination and mutation as well as a suitable fitness function. Opposed to previous automated query specification approaches as described in [5], our motion class patterns can be directly plugged-in into the index-based matching scenario of [6] affording very efficient motion retrieval. For the future, we plan to employ similar genetic algorithms for automated keyframe selection as required in [5]. Another application we have in mind is the combination of efficient retrieval methods with learning approaches in order to generate suitable prior knowledge as needed to stabilize and support human motion tracking [1].
References 1. Brox, T., Rosenhahn, B., Cremers, D., Seidel, H.-P.: Nonparametric density estimation with adaptive anisotropic kernels for human motion tracking. In: Elgammal, A., Rosenhahn, B., Klette, R. (eds.) Human Motion 2007. LNCS, vol. 4814, pp. 152–165. Springer, Heidelberg (2007) 2. Kovar, L., Gleicher, M.: Automated extraction and parameterization of motions in large data sets. ACM Trans. Graph. 23(3), 559–568 (2004) 3. Moeslund, T.B., Hilton, A., Kr¨ uger, V.: A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding 104(2), 90–126 (2006) 4. M¨ uller, M.: Information Retrieval for Music and Motion. Springer, Heidelberg (2007) 5. M¨ uller, M., R¨ oder, T.: Motion templates for automatic classification and retrieval of motion capture data. In: Proc. ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 137–146. ACM Press, New York (2006) 6. M¨ uller, M., R¨ oder, T., Clausen, M.: Efficient content-based retrieval of motion capture data. ACM Trans. Graph. 24(3), 677–685 (2005) 7. M¨ uller, M., R¨ oder, T., Clausen, M., Eberhardt, B., Kr¨ uger, B., Weber, A.: Documentation: Mocap Database HDM05. Computer Graphics Technical Report CG2007-2, Universit¨ at Bonn (2007) 8. Pohlheim, H.: Evolution¨ are Algorithmen: Verfahren, Operatoren und Hinweise. Springer, Heidelberg (1999) 9. Rosenhahn, B., Klette, R., Metaxas, D.: Human Motion Understanding, Modeling, Capture, and Animation. Springer, Heidelberg (2007) 10. Sidenbladh, H., Black, M.J., Sigal, L.: Implicit probabilistic models of human motion for synthesis and tracking. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 784–800. Springer, Heidelberg (2002) 11. Sminchisescu, C., Jepson, A.: Generative modeling for continuous non-linearly embedded visual inference. In: Proc. Int. Conf. on Machine Learning (2004) 12. Urtasun, R., Fleet, D.J., Fua, P.: 3D people tracking with Gaussian process dynamical models. In: Proc. International Conference on Computer Vision and Pattern Recognition, pp. 238–245. IEEE Computer Society Press, Los Alamitos (2006)
Relative Pose Estimation from Two Circles Stefan Rahmann and Veselin Dikov MVTec Software GmbH Neherstr. 1, 81675 M¨ unchen, Germany {rahmann,dikov}@mvtec.com
Abstract. Motion estimation between two views especially from a minimal set of features is a central problem in computer vision. Although algorithms for computing the motion between two calibrated cameras using point features exist, there is up to now no solution for a scenario employing just circles in space. For this task a minimal and linear solver is presented using only two arbitrary circles. It is shown that the derived algorithm is not hampered by critical configurations. Here, the problem of relative pose estimation is cast into a combined reconstruction plus absolute orientation algorithm. Thus, even for small or vanishing translations the relative pose between cameras can be computed in a stable manner within a single algorithm.
1
Introduction
Motion estimation between two views is a classical topic in computer vision. Depending on the type of viewed features and on the type of camera, different methods for solving the motion problem exist. This work is focused on the scenario, in which circles are imaged by two calibrated cameras. The majority of motion algorithms are based on point features. For calibrated cameras there exists the so called five-point algorithm, [14]. It can cope with planar scenes, but even in the non minimal case of more than five points it computes multiple solutions. Minimal solvers are of importance for hypothesis-and-verify algorithms like RANSAC, [5], for establishing unknown correspondences between the features in both images. Apart from points, higher order planar curves can be used for motion estimation between two views. In [10,11] it was shown that four conics are sufficient to determine the fundamental matrix and the authors in [11] proved three is the minimal number of conics to determine the essential matrix and hence the relative pose. The Euclidian information of circles cannot be incorporated advantageously in any of those two approaches. Recently, [3,7,19] used coplanar or parallel circles for camera calibration, but only [3] derived the exterior orientation of the camera to the world plane including the circles. Numerous publications about pose estimation from points and lines exist, see e.g. [2,16]. Furthermore, methods for computing the pose of a single circle in space are at hand, like in [17]. A method for motion estimation from two G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 375–384, 2008. c Springer-Verlag Berlin Heidelberg 2008
376
S. Rahmann and V. Dikov
coplanar conics is presented in [12]. But, we are not aware of any pose estimation algorithm using two arbitrary circles. This paper presents an approach for the determination of the relative pose between two calibrated cameras based on the projection of two arbitrary circles in space. In this respect it closes a gap in the above cited works. It will be shown that this case is minimal and overdetermined at the same time. The key idea is to first reconstruct the circles in space with respect to both camera coordinate systems. This is carried out using Euclidian invariants of this specific configuration in 3D space. Once, the reconstruction is done, just the relative pose between both reconstructions has to be computed, which is the classical registration problem, see e.g. [9]. The uniqueness of the solution is discussed and degenerate situations are examined. As the accuracy and the robustness is a key issue, experiments show the performance of the algorithm.
2 2.1
A Relative Pose from Two Circle Algorithm Preliminaries
The starting point is the extraction of the elliptical contours in the image. Then, ellipses are fitted to this contour points or edgels using algorithms like in [1,6]. The ideas of the paper are developed within the framework of using calibrated cameras. Both cameras can be identical or not, but the internal camera parameters must be known. With the known camera parameters the ellipses in the image can be transferred into ellipses in the normalized image plane, which is the plane with the z-coordinate equal to one. So, points on the normalized ellipse are directions in 3D space. This will constitute the basis for the subsequent circle pose algorithms. The relative pose problem is the determination of rotation and translation between both cameras, which can be represented by the rotation matrix R and the translation vector t. As the overall scale of the motion and the reconstruction is not known, the length of the translation vector can be scaled arbitrarily. Therefore, the relative pose problem has five degrees of freedom, three for the rotation and two for the translation direction. 2.2
Single View Geometry of a Circle
A circle in space has six degrees of freedom, which are the position, the orientation and the radius of the circle. We choose to parameterize a circle by the pair of 3-vectors [n, c], where n is its normal vector, with the addition that the length of the normal vector is the radius r of the circle: r = n. The vector c contains the coordinates of the center of the circle. A circle is projected onto an image as an ellipse. Because an ellipse has only five degrees of freedom, the position of a circle cannot be determined by a single projection without any additional assumptions. There is a one parameter family of solutions and all circles with [kn, kc] and k = 0 project onto the same ellipse.
Relative Pose Estimation from Two Circles
377
We note that more precisely k > 0 must hold so that all circles are in front of the camera. Additionally, there exists a two-fold ambiguity: for a particular cone there are two circles with a fixed radius [n1 , c1 ] and [n2 , c2 ] that span the same cone in space, see figure 1. There are well known algorithms for computing both circles in space from their normalized imaged ellipse, and the reader is referred to [3,15].
C2,1 , C2,1
C1,2 C1,2 n1
C1,1 , C1,1
O
I
C2,2 n2
C2,2
I
O
Fig. 1. From each view of a circle, there is a two-fold ambiguity by the reconstruction of the circle in space that results in a true (solid line) and a false reconstruction (dashed line). All non-primed entities, like O and I, refer to the first camera system, whereas all primed entities, like O and I , belong to the second camera frame.
2.3
Position of Two Circles from Two Views
We now show how the additional constraints arising from a set of two circles can be used to determine the scale or distance factors k for each circle. This will be achieved by constructing Euclidian invariants, i.e. geometric entities that stay the same even though they are transformed by a rotation and translation from one camera coordinate frame to the second one. If the scale factors are known, the circles are reconstructed in both camera systems and the relative pose between the cameras can be easily computed. Because the vector pair [n, c] can be chosen arbitrarily, we scale it such that c = 1 holds. The first Euclidian invariants are the radii of the circles. The scaling factors for corresponding circles are identical in both camera systems, resulting in (1) ki = ki . Because there is no constraint on the overall scale of the reconstruction and the pose, we can set k1 = 1. Set k = k2 the distance factor of the second circle. The situation with the unknown parameter k is depicted in figure 2. The distance vector between the first and the second circle is x and x in the left and right camera coordinate system respectively. They can be expressed as
378
S. Rahmann and V. Dikov C1 , C1 C2 , C2 x, x
1
k
c1
kc2 kc2
c1 O
O
Fig. 2. The distance between the two circles is an Euclidian invariant, which can be used to resolve the coefficient k
x = kc2 − c1 and x = kc2 − c1 . The length of the distance is a second Euclidian invariant from which the following quadratic equation in k is derived: T T x = x ⇔ k 2 (cT 2c2 − c2 c2 ) − 2k(c1c2 − c1 c2 ) + c1c1 − c1 c1 = 0 T
T
T
. (2)
In the sequel we denote by k = km , for m ∈ {1, 2}, the two solutions of the above equation. But, as each vector pair has a two-fold ambiguity, there is a total of 16 equations of the form (2). Let g, h, i, j denote the indices of the possible vector pairs: [n1 , c1 ] = [n1,g , c1,g ] , [n2 , c2 ] = [n2,h , c2,h ] , [n1 , c1 ] = [n1,i , c1,i ] , [n2 , c2 ] = [n2,j , c2,j ] with g, h, i, j ∈ {1, 2}. Then, the solution for k is a function in the permutations of the vector pair indices: km = km (g, h, i, j). This means, there is a total of 32 ambiguities, which have to be reduced to a unique solution by deploying additional constraints. There are three further Euclidian constraints, which refer to the relative orientation of the vectors x, n1 , n2 . Without loss of generality we assume that the vector n1 is more orthogonal with respect to x than n2 : n1 × x > n2 × x. Now, x together with n1 will define the local coordinate system (i, j, k) as folT lows: i = x/x , j = (n1 − i(nT 1i))/n1 − i(n1i) , k = i × j. For the vectors T T T T n1 , n2 this results in: n1 = i(n1i) + j(n1j) and n2 = i(nT 2i) + j(n2j) + k(n2k). A third set of euclician invariants summarized in the following five-element vector can be built: T T T T (3) d = ( nT 1i, n1j, n2i, n2j, n2k ) . In both camera systems these entities must be the same and d = d must hold. We note that even though there are only three angular invariants, in practice it proved that the above five invariants are favorable due to a better numerical robustness. Resolving the 32-fold ambiguity can now be formulated into a simple combinatorial problem: (ˆ g, ˆ h, ˆi, ˆj, m) ˆ = arg{ d(g, h, i, j, m) = d (g, h, i, j, m) } .
(4)
In practice the equality d = d will not be fulfilled due to noisy data measurements. Instead, the minimum of d − d is taken.
Relative Pose Estimation from Two Circles
379
ˆ ˆi, ˆj, m) Once, the set (ˆ g , h, ˆ is determined, all four vector pairs defining the position of the circles in each camera frame are known, namely: [n1,ˆg , c1,ˆg ], [km ˆ , km ˆ ], [n2,ˆi , c2,ˆi ] and [km ˆ n1,h ˆ c1,h ˆ n2,ˆ ˆ c2,ˆ j , km j ]. Finally, we mention the following insight: the above stated error criterion based on the invariant entities in equation (3), can be replaced by the reprojection error of the circles. To do so, the circles reconstructed in the left camera frame are projected into the right image and vice versa, after the relative pose between both cameras has been established. Then, the distances between reprojected and image ellipses can be computed to build the reprojection error. We implemented and tested this alternative approach and found similar results. Because the invariant approach is simpler and faster, it is preferred. 2.4
Relative Pose
Determining the relative pose (R, t) between the two cameras has now become an absolute orientation problem, and its solution is straightforward based on the reconstructed circles. The rotation R can be determined from the relation (x , n1 , n2 ) = R(x, n1 , n2 ). The best way to perform this is to use quaternions, see [9] for an explanation. Using quaternions, the rotation can be computed in a linear manner, more exactly by a singular-value-decomposition, incorporating advantageously the overdetermined set of equations. With the known rotation the translation t is computed from ci = Rci + t with i = {1, 2}, which is once again an overdetermined linear system. Alternatively, rotation and translation parameters can be computed simultaneously using dual quaternions, like in [4], based on the three lines in space defined by the circle normals and the connection between the circle centers. The length of the baseline t depends on the fixed radius of the first circle; but finally, the reconstruction and the baseline is scaled such that t = 1, like it is common for relative pose estimation. 2.5
Uniqueness of the Solution
A circle in 3D space has six degrees of freedom and a relative pose has five. This makes a total of 17 unknowns. The four ellipses in both images sum up to 20 constraints. Counting the number of unknowns and the number of constraints the problem of relative pose estimation from two circles is overdetermined by three constraints. This can be seen as well in the equations for determining the rotation, based on three directional vectors. We are aware of the fact that we do not present a theoretical proof for the uniqueness of the solution. We rather refer to the experimental section, where the uniqueness is shown. Recall, however, that comparable minimal solvers like the five- or seven-point algorithm are plagued by the same problem. Like it is done here, a unique solution is achieved by an error criterion in conjunction with overconstraining data. Finally, it is noted that the relative pose from circle algorithm produces a unique solution even if both circles are coplanar. This is in contrast to the relative pose from points algorithm, which can exhibit a twofold solution, see e.g. [13].
380
2.6
S. Rahmann and V. Dikov
Degenerate Scene Situation
A degenerate situation occurs if the three direction vectors x, n1 , n2 are all collinear. Then, it is unpossible to construct a local coordinate system, and thus building the invariant vector in equation (3). However, this is a situation which hardly occurs in practice. An example for this is a transparent cylinder, where top and bottom circles are on the same axis. 2.7
Zero Translation
A vanishing translation is a condition under which classical stereo or motion algorithms, like the five-point algorithm [14] or the eight-point-algorithm [8], will fail. For zero translation the essential matrix or the fundamental matrix collapses to a zero matrix. But, as the presented algorithm is actually a pose and not a stereo algorithm, even for vanishing translation the relative pose can be computed exactly and in a numerically stable manner. If the translation is zero, both camera centers are identical and the image to image transformation is a planar homography controlled just by the rotation. ˆ ˆi, ˆj) ∈ There are four possible combinations in the circle vector pairs with (ˆ g , h, {(1, 1, 1, 1), (2, 1, 2, 1), (1, 2, 1, 2), (2, 2, 2, 2)}. For each of the four combinations the rotational part of the relative pose can be computed as described above and it naturally turns out that all four rotations are identical. It is understood that the individual scaling factors cannot be derived for reconstructing the circles. 2.8
Absolute Pose from Known Circles
The presented algorithm can be used to determine the absolute pose from two known circles to the viewing camera as well: the scaling factor for both circles, defined by the vector pairs [n1,g , c1,g ] and [n1,h , c1,h ], is derived from the known radii. Then, just the four combinations of the vector pair g, h have to be checked, using the Euclidian invariant of equation (3), to find the right one. Finally, the pose is computed as stated above.
3 3.1
Experiments Synthetic Data
Setup of the experiment. Two circles are constructed in space: the centers of both circles are situated on the left and right side of a cube, which has unit length, and the circle normals point in frontal direction towards the viewer. In doing so it is√ensured that both circles have a minimum distance of one and a maximum of 3. Furthermore, both circles can be panned and tilted up to ±45◦ . These four attitude parameters are chosen at random and uniformly distributed over their value ranges for each trial. This setup guarantees that the circles can be seen from their front by the cameras, and that circle normals and center difference vector span a local coordinate system; this is a prerequisite for a proper
Relative Pose Estimation from Two Circles
381
working of the proposed algorithm. The radii of both circles are one tenth of a unit. The cameras are at a distance of five unit lengths from the cube looking at its front. At this distance they are positioned randomly and uniformly distributed within a second cube with edges of two unit lengths. Therefore, the actual dis√ tance between both camera centers vary seamlessly between zero and 2 3. Thus, large baselines - like for classical stereo algorithms - and very small baselines like for pose estimation algorithms - are tested in one single framework. In addition to a changing position, the orientations of each camera are randomly varied up to ±10◦ . Uniqueness of the solution. With the above experimental setup 100,000 trials are conducted. The difference between the computed pose and the ground truth pose are measured. In all of the experiments the error in the computed relative pose is zero, up to numerical noise. From this we can conclude that for a realistic geometric setup, which obeys the mentioned prerequisite, the solution is always the correct one. Separately, the zero translation case is tested. Here, the same result holds and the relative pose is always computed correctly. Evaluation under noise. The above described experiment is extended to assess the accuracy and the robustness of the algorithm as a function of the noise level of the ellipse contour edgels extracted in the image. Traditionally, in computer vision motion estimation algorithm are compared with respect the motion type the cameras are undergoing, see e.g. [14] or [18]. Typically, sideways and forward motion is distinguished. Here, we add a third motion type, which is the pure rotation case with zero translation. Sideways and forward motion have a translation vector of unit length. Additionally, the cameras are slightly rotated at random, like in the above sketched experiment. In order to add realistic noise to the ellipse parameters the following procedure is carried out for each individual ellipse in each experimental sample: the ellipses are resampled at pixel distance to create a set of edgels. Then, these edgels are disturbed by adding noise, obeying a normal distribution and scaled with the desired noise level. From the noisy edgels the ellipses are fitted again, resulting in deformed ellipses. The noise level is the equivalent of the accuracy of the edge detector, extracting the edgels of the ellipse contour. A reasonable noise level is one pixel and the chosen noise values range from 0.2 to 2 pixels. For each noise level a total of 1000 samples are drawn to build a representative statistic. In order to measure the difference between poses there is a common way: for each pose a seven element vector is build from the translation vector and the four element vector, which is the quaternion of the rotational part. Then, the difference between poses is the Euclidian vector norm of the seven-element difference vector. So, the error for a single experimental sample is the difference between the computed pose and its ground truth counterpart. The accuracy of the algorithm is defined by the median on the relative pose errors. An algorithm is called robust if, despite of noisy data, the computed solution is somehow close to the correct one. How close a solution must be, to
382
S. Rahmann and V. Dikov
Fraction of wrong poses
make it a suitable initialization for, for example, a subsequent bundle-adjustment type of iteratively converging algorithm, is a difficult question and no general answer to it is possible. We decided that a threshold of 0.5 is appropriate. In figure 3(a) it can be seen that the algorithm is accurate, because the average pose error is very modest. Additionally, the error is proportional to the noise level. Sideways and forward motion behaves similar. In the zero translation case the rotation is computed even of order ten times more accurate. Figure 3(b) depicts the fraction of relative poses, where the error is higher than 0.5. It can be seen that for general motion a noise level of up to one pixel can be tolerated to yield the correct solution. For pure sideways motion half of the solutions are doubtful under a noise level of two pixels. Once again the forward motion performs better than the sideways motion. In the pure rotation case the algorithm shows very good robustness.
Median pose error
0.15
0.10
0.05
0.0
0.2
1.0
Noise of edgels (pixels)
(a) Accuracy
2.0
0.6
0.4
0.2
0.0 0.2
1.0
2.0
Noise of edgels (pixels)
(b) Robustness
Fig. 3. Evaluation of the accuracy and the robustness of the algorithm for various motion types: (a) Accuracy of the relative pose w.r.t the noise of the ellipse edgels as the median relative pose error. (b) Robustness of the relative pose w.r.t the noise of the ellipse edgels as the fraction of computed pose with a difference to ground truth higher than 0.5. Sideway motion is displayed by circles (), forward motion by squares () and pure rotation by triangles ().
3.2
Real Images
The proposed algorithm is tested on a sequence of ten stereo images with a simple object, like depicted in figure 4. The used object features two drill holes and just a few corners. Note, that in this situation the number of corner points visible in both images is not sufficient for using the five-point algorithm. The images stem from a calibrated stereo setup. Hence, the known relative pose between the cameras serves as ground truth. The translation and rotational errors are computed like presented in the previous section. Half of the images result in correct relative poses with minimal error of about 0.05. The other half exhibits an error larger than 0.5, which was defined as the limit for outlier classification. This number of outliers can be explained by the specific geometry of the experimental setup: the dimension of the object is approximately one tenth of its distance to the camera. A scene with circles more distant from each other - which hence would undergo more perspective distortion under viewer
Relative Pose Estimation from Two Circles
383
Fig. 4. Excerpt of a sequence of stereo image pairs with left images (top row) and right images (bottom row). The simple object features two drill holes, which are used to determine the relative pose between both cameras.
motion - will probably constitute a more stable and robust configuration. A future solution to this problem could be the following approach: take a couple of promising poses together with the reconstructed circles as initial solution and plug them into a bundle-adjustment type of algorithm to optimize pose and reconstruction. Finally, select the solution with the smallest reprojection error.
4
Discussion
The presented algorithm is of importance because it can be used both for relative as well as for absolute pose determination. In this respect it supplements the range of existing motion and pose estimation algorithms. In a number of experiments the algorithm has proven to be accurate and robust for a realistic noise level. We point out that the resulting accuracy is comparable to existing pose determination algorithm, like in [2,16]. This is remarkable because the only 3D information, which our algorithm is built on, is just the circular shape and no explicit 3D coordinates, that pose algorithms are using. Compared to standard motion algorithms, we refer for example to [14,18], it can be summarized that our approach outperforms them. The algorithm has been applied to a real world experiment. For small objects it turned out, that the algorithm is not sufficiently stable as it does not always find the correct solution. We propose a bundle-adjustement type of subsequent optimization to overcome this problem. An interesting future work issue would be the combination of a single circle with additional points. Furthermore, an algorithm for three or more circles would be of use for more robust and accurate pose estimation.
384
S. Rahmann and V. Dikov
References 1. Ahn, S.J., Rauh, W., Warnecke, H.-J.: Least-squares orthogonal distances fitting of circle, sphere, ellipse, hyperbola, and parabola. Pattern Recognition 34, 2283–2303 (2001) 2. Ansar, A., Daniilidis, K.: Linear pose estimation from points and lines. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) 25(5), 578–589 (2003) 3. Chen, Q., Wu, H., Wada, T.: Camera calibration with two arbitrary coplanar circles. In: Proc. of European Conf. on Computer Vision (ECCV), vol. 3, pp. 521–532 (2004) 4. Daniilidis, K.: Hand-eye calibration using dual quaternions. International Journal of Robotics Research 18(3), 286–298 (2001) 5. Fischler, M., Bolles, R.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. Assoc. Comp. Mach. 24(6), 381–395 (1981) 6. Fitzgibbon, A., Pilu, M., Fisher, R.: Direct least square fitting of ellipses. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) 21(5), 476–480 (1999) 7. Gurdjos, P., Sturm, P., Wu, Y.: Euclidean structure from N ≥ 2 parallel circles: Theory and algorithms. In: Proc. of European Conf. on Computer Vision (ECCV), vol. 1, pp. 238–252 (2006) 8. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000) 9. Horn, B.: Closed-form solution of absolute orientation using unit quaternions. Journal of the Optical Society of America (JOSA-A) 4(4), 629–642 (1987) 10. Kahl, F., Heyden, A.: Using conic correspondence in two images to estimate the epipolar geometry. In: Proc. of Intl. Conf. on Computer Vision (ICCV), pp. 761–766 (1998) 11. Kaminski, J.Y., Shashua, A.: On calibration and reconstruction from planar curves. In: Proc. of European Conf. on Computer Vision (ECCV), vol. 1, pp. 678–694 (2000) 12. Ma, S.D.: Conics-based stereo, motion estimation, and pose determination. International Journal of Computer Vision (IJCV) 10(1), 7–25 (1993) 13. Negahdaripour, S.: Closed-form relationship between the two interpretations of a moving plane. Journal of the Optical Society of America (JOSA-A) 7(2), 279–285 (1990) 14. Nister, D.: An efficient solution to the five-point relative pose problem. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) 26(6), 756–770 (2004) 15. Philip, J.: An algorithm for determining the position of a circle in 3d from its perspective 2d projection. Technical Report TRITA-MAT-1997-MA-1, Department of Mathematics, KTH (Royal Institute of Technology) (1997) 16. Quan, L., Lan, Z.: Linear pose estimation from points and lines. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) 21(7), 774–780 (1999) 17. Safaee-Rad, R., Tchoukanov, I., Smith, K., Benhabib, B.: Three-dimension of circular features for machine vision. IEEE Trans. Robot. Automat. 8(5), 624–640 (1992) 18. Segvic, S., Schweighofer, G., Pinz, A.: Performance evaluation of the five-point relative pose with emphasis on planar scenes. In: 31st AAPR/OAGM Workshop, pp. 33–40 (2007) 19. Wu, Y., Zhu, H., Hu, Z., Wu, F.: Camera calibration from the quasi-affine invariance of two parallel circles. In: Proc. of European Conf. on Computer Vision (ECCV), vol. 1, pp. 190–202 (2004)
Staying Well Grounded in Markerless Motion Capture Bodo Rosenhahn1, Christian Schmaltz2 , Thomas Brox3 , Joachim Weickert2 , and Hans-Peter Seidel1 1
Max Planck Institute for Computer Science, Saarbr¨ucken, Germany [email protected] 2 Mathematical Image Analysis, Saarland University, Germany 3 Intelligent Systems, University of Dresden, Germany
Abstract. In order to overcome typical problems in markerless motion capture from video, such as ambiguities, noise, and occlusions, many techniques reduce the high dimensional search space by integration of prior information about the movement pattern or scene. In this work, we present an approach in which geometric prior information about the floor location is integrated in the pose tracking process. We penalize poses in which body parts intersect the ground plane by employing soft constraints in the pose estimation framework. Experiments with rigid objects and the HumanEVA-II benchmark show that tracking is remarkably stabilized.
1 Introduction Markerless motion capture (MoCap) is an attractive field of research with applications in computer graphics (games, animation), sports science or medicine. Given an input image sequence of a moving person, the task is to determine its position and orientation, as well as the joint angles of the subject. Since there are no artificial markers as in marker based tracking systems, one usually assumes a 3D model of the person to be given. A popular representation is a surface model with known joint positions [8]. Multi-view image streams, i.e., images from synchronized and calibrated cameras, are very common to reduce ambiguities. One of the main problems in motion capture is the high dimensional search space, e.g. six pose parameters and at least 20 joint angles. For this reason it is beneficial to apply techniques that reduce the search space. Many techniques include a learning stage by using previously recorded MoCap data as prior information. These techniques comprise dimensionality reduction, density estimation, or regression techniques [1,14,5]. In [3] further prior knowledge about light sources and shadows is applied to exploit more information about the scene observed by the cameras and to introduce virtual shadow cameras. A physically motivated model for tracking the lower body has been proposed in [6]. Moreover, it is also possible to impose fixed joint angle limits, as suggested in [7]. The present work pursues the same basic idea of integrating prior information to stabilize tracking. The main contribution is to exploit prior information about the ground floor in the tracking process. The approach is motivated from the fact that many scenarios involve interaction of the tracked subject with the ground plane and indeed it is physically impossible that a body part (e.g. a foot) intersects the ground plane. The G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 385–395, 2008. c Springer-Verlag Berlin Heidelberg 2008
386
B. Rosenhahn et al.
present paper shows how to integrate such constraints in a markerless motion capture system. Moreover, it is verified to which extent such constraints improve the tracking performance. In particular, we show results on the recent HumanEVA-II [12] benchmark, which involves a quantitative error analysis.
2 The Motion Capture System In this section we briefly introduce the markerless motion capture system we use in our experiments. It is a silhouette based system similar to [10]. However, in contrast to explicitly compute a silhouette using level-set functions we rely on the shape of the model in the image to separate the inner and outer regions. This system has been introduced in [11] and will be briefly summarized in the following subsections. The notation introduced here is also the basis for deriving the plane constraint later in Section 3. 2.1 Kinematic Chains Our models of articulated objects, e.g. humans, are represented in terms of free-form surfaces with embedded kinematic chains. A kinematic chain is modeled as the consecutive evaluation of exponential functions, and twists ξi are used to model (known) joint locations [9]. In this work we denote a twist ξ as vector ξ = (ω1 , ω2 , ω3 , v1 , v2 , v3 )T = (ω , v)T or as matrix ξˆ as done in [9]. It is further common to scale the twists ξ with respect to ω = (ω1 , ω2 , ω3 )T , i.e. θ = ω , ω := ωθ and v := θv . We denote a scaled twist as θ ξ or θ ξˆ , respectively. The exponential function of a twist leads to a rigid body motion in space. The transformation of a mesh point of the surface model is given as the consecutive application of the local rigid body motions involved in the motion of a certain limb: Xi = exp(θ ξˆ )(exp(θ1 ξˆ1 ) . . . exp(θn ξˆn ))Xi .
(1)
For abbreviation, we note a pose configuration by the (6+n)-D vector χ = (ξ , θ1 , . . . , θn ) = (ξ , Θ ) consisting of the 6 degrees of freedom for the rigid body motion ξ and the nD vector Θ comprising the joint angles. In the MoCap-setup, the vector χ is unknown and has to be determined from the image data. 2.2 Region Based Pose Tracking Given an articulated surface model and an image stream, we are interested in fitting the mesh to the image data by computing the position, orientation, and joint angle configuration in such a way that the surface projection optimally covers the region of the person in the images. Many existing approaches [8] rely on previously extracted silhouettes, for instance by background subtraction, and establish correspondences between contour points and points on the model surface. This involves a contour extraction and a matching of the projected surface with the contour. The work in [11] avoids explicit computations of contours and contour matching but directly adapts the pose parameters χ such that the projections of the surface optimally split all images into homogeneously distributed object and background regions.
Staying Well Grounded in Markerless Motion Capture
387
Energy model. Similar to a segmentation technique [10], the model states the conditions for an optimal partitioning of the image domain Ω . This can be expressed as minimization of the energy function E(χ ) = −
Ω
P(χ , q) log p1 + (1 − P(χ , q)) log p2 dq ,
(2)
where the function P : R6+n × Ω (χ , q) → {0, 1} is 1 if and only if the surface of the 3-D model with pose χ projects to the point q in the image plane. P splits the image domain into two parts, the object and background region. In both regions the homogeneity of the feature distributions is to be maximized. These distributions are modeled by probability density functions (pdf) p1 and p2 . In order to model the image features by pdfs we use the color distributions in the CIELAB color space. Minimization. To minimize E(χ ), a gradient descent can be applied. It results in the desired pose that minimizes E(χ ) locally. In order to approximate the gradient of E(χ ), 2D-3D point correspondences (qi , xi ) are established by projecting silhouette points xi of the surface with current pose χ to the image plane where they yield points qi . Each image point qi obtained in this way which seems to belong to the object region – i.e. those points for which p1 (qi ) is greater than p2 (qi ) – will be moved in outward normal direction to a new point qi . Points where p1 (qi ) < p2 (qi ) holds will be shifted into the opposite direction to qi , respectively. In order to compute the normal direction ∇P, we use Sobel operators. The length l := qi − qi of the shift vector is set to a constant. More details are given in [11]. The 2D-3D point correspondences (qi , xi ) obtained this way are used in the point based pose tracking algorithm to get a new pose, as explained in the following section. This forms one optimization step, which is iterated until the pose changes induced by the force vectors will start to mutually cancel each other. We stop iterating when the average pose change after up to three iterations is smaller than a given threshold. 2.3 Pose Estimation from Point Correspondences Assuming a given set of corresponding 2D points (3D rays) and 3D points, a 3D pointline based pose estimation algorithm for kinematic chains is applied to minimize the spatial distance between both contours: each 2D point is reconstructed to a 3D Pl¨ucker line Li = (ni , mi ), see Section 3.1. For pose estimation the reconstructed Pl¨ucker lines are combined with the screw representation for rigid motions. Incidence of the transformed 3D point Xi with the 3D ray Li = (ni , mi ) can be expressed as (exp(θ ξˆ )Xi )π × ni − mi = 0.
(3)
Since exp(θ ξˆ )Xi is a 4D vector, the homogeneous component (which is 1) is neglected to evaluate the cross product with ni . This mapping is denoted by π . The resulting nonlinear equation system can be linearized in the unknown twist parameters by using the first two elements of the sum representation of the exponential function: exp(θ ξˆ ) =
∞
(θ ξˆ )i ≈ (I + θ ξˆ ). i i=0
∑
(4)
388
B. Rosenhahn et al.
This approximation is used in (3) and leads to the linear equation system ((I + θ ξˆ )Xi )π × ni − mi = 0.
(5)
Gathering a sufficient amount of point correspondences and appending the single equation systems leads to an overdetermined linear system of equations in the unknown pose parameters θ ξˆ . The least squares solution is used for reconstruction of the rigid body motion using Rodrigues’ formula [9]. Then the model points are transformed and a new linear system is built and solved until convergence. The final pose is given as the consecutive evaluation of all rigid body motions during iteration. Joints are expressed as special screws with no pitch of the form θ j ξˆj with known ξˆj (the location of the rotation axes is part of the model) and unknown joint angle θ j . The constraint equation of an ith point on a jth joint has the form (exp(θ ξˆ ) exp(θ1 ξˆ1 ) . . . exp(θ j ξˆj )Xi )π × ni − mi = 0
(6)
which is linearized in the same way as the rigid body motion itself. It leads to three linear equations with the six unknown twist parameters and j unknown joint angles.
3 Integrating Ground Plane Constraints The key idea of the present work is now to extend the previous system of equations with additional equations that reflect the ground plane constraint. These equations regularize the system and reduce the space of possible solutions to a desired subspace. This section explains how to formalize the geometric constraint and how to integrate it in the tracking system as soft constraint. 3.1 Lines and Planes Although we note the necessary equations here in matrix calculus, we point out that other representations, e.g. Clifford or geometric algebras [13], provide a much more compact and unifying way to represent entities, their interactions and transformations. Due to space limits we avoid to introduce a possible higher-order algebraic model and stick to matrices. In this work, we use an implicit representation of a 3D line in its Pl¨ucker form [4]. A Pl¨ucker line L = (n, m) is given by a normalized vector n (the direction of the line) and a vector m, called the moment, which is defined by m := X × n for a given point X on the line. Collinearity of a point X to a line L can be expressed by X ∈ L ⇔ X × n − m = 0,
(7)
and the distance of a point X to the line L can easily be computed by X × n − m, see [10] for details. For planes we use an implicit representation as Hessian form. A plane P = (D, d) is given by a normalized vector D (the normal of the plane) and the distance of the plane to the origin d. A point X is on the plane iff
Staying Well Grounded in Markerless Motion Capture X ∈ P ⇔ X · D + d = 0.
389
(8)
The resulting scalar value also provides a signed distance from the point to the plane. For a given line L = (n, m) and plane P = (D, d) the intersection point of these entities is given as X = L∨P =
D × m − nd . D·n
(9)
For a more detailed analysis of the constraints, we refer to [13]. 3.2 Soft-Constraints for Penalizing Floor Intersections As the described tracking procedure so far does not integrate any prior information about the scene, the surface mesh is able to take all possible configurations and positions. However, it is physically not possible that extremities intersect with the ground plane during, e.g., walking or jogging. For penalizing such configurations, we use the previously introduced representations of lines in Pl¨ucker coordinates and planes in Hessian from. We first determine all points which are below the ground floor. These can be found by testing the sign of Equation (8) for all points on the surface mesh. Each detected point that is below the ground plane is then mapped to its closest point on the plane by following the steps of Figure 1, left: We assume a given 3D-plane P = (D, d)
L=(n,m) P=(D,d)
X’
X
Fig. 1. Left: Projection of a 3D point onto a 3D plane. Right: Force vectors (marked in blue) shift the right foot towards the ground plane.
and a point X below the ground plane. From the point X and normal D we generate a Pl¨ucker line L = (n, m) = (D, X × D) and compute the intersection of L and P. Since D = n and D · n = 1, Equation (9) reduces to X = D × m − nd.
(10)
Now we can establish point-point correspondences between all points X i below the ground plane and their mapping Xi onto the ground plane. The correspondences are used in the pose estimation procedure to enforce the point X i being close to the plane P: we define X i as a set of 3D model points, which correspond to a set of points Xi on the spatial ground plane. We express incidence in terms of ∀i : X i − Xi = 0.
(11)
390
B. Rosenhahn et al.
Since X i belongs to a kinematic chain, X i can be expressed as X i = exp(θ ξˆ )
∏
exp(θ j ξˆj )Xi
(12)
j∈J (xi )
and we can generate a set of equations forcing the transformed point X i to stay close to Xi : exp(θ ξˆ )
∏
exp(θ j ξˆj )Xi − Xi = 0
(13)
j∈J (xi )
Some exemplary force vectors are shown in Figure 1 on the right. Note that the vectors Xi are treated as constant vectors, so that the only unknowns are the pose and kinematic chain coefficients. Also note that the unknowns are the same as for Equation (3), the unknown pose parameters. Only the point-line constraint is replaced by a point-point constraint. It expresses the involved geometric properties which are to be fulfilled during the interaction of the subject with the ground plane. As mentioned before, to generate the correspondences we run a simple test on all surface points if a point is below the ground plane or above. For all points below the ground plane we apply Equations (10) - (13) to generate equations which enforce the points below the ground plane to stay on (or above) the ground plane. The matrix of gathered linear equations is attached to the one presented in Section 2.3. Furthermore, its effect is to regularize the equations. The structure of the generated linear system is Aχ = b, with two parts generated from the linearized equations (5) and (13). Since the additional equations act only as soft constraints we add a strong weighting factor to Equation (13) in order to avoid any severe violations of the plane constraint. Infinite weights would turn the soft constraints into hard constraints.
4 Experiments In the first experiment we considered a simple rigid object consisting of Duplo bricks and a controlled movement in a lab environment. The object was captured in a stereo set-up and the sequence consists of a simple ground plane movement. Since the object is just moving on the ground-plane and the ground plane is given by y = 0, negative y-values indicate an intersection with the given ground plane. Figure 2 shows on the right some example frames of the stereo cameras. As can be seen, it is possible to track the object, though parts are occluded by other bricks. The left diagram shows y-axis coordinates of two control points on the model ground plane. The expected ground truth is a simple constant function at 0. The red curves show the outcome without the additional constraints from Section 3 and the black ones show the result with the additional constraints from Section 3. The red curves deviate up to 4mm in space. This is already pretty accurate but the black curves show an even better and smoother tracking with a deviation of up to just 2mm. In the second experiment we took a sequence of the HumanEVA-II benchmark [12]. Here a surface model, calibrated image sequences, and background images are provided. Everyone is invited to experiment with the data and upload tracking results (as 3D marker positions) to a server at Brown University.
Staying Well Grounded in Markerless Motion Capture
391
As the sequences are captured in parallel with a marker based tracking system (here a Vicon system is used), an automated script evaluates the accuracy of the tracking in terms of relative errors in mm. Figure 3 shows the four provided camera views and the surface mesh in its initial pose of the subject S4, which we used in our experiments. Figure 4 shows some example poses of our tracking system. The top two rows show results without the proposed soft constraints and the two bottom rows depict the outcome with the additional constraints. White ellipses mark some tracking deviations which
Fig. 2. Lego marble sequence with a planar movement. Left: y-axis coordinates of two control points on the ground plane. Red: without additional constraints to stay on the ground plane, black: with additional constraints. The black curves reveal smoother results where the object stays closer to the ground plane. Right: Some frames of the stereo sequence.
Fig. 3. Sample frame and pose of the HumanEVA-II database
are obviously smaller in the result where the ground plane constraint has been applied. Figure 5 shows an overlay of some poses with constraints (green) and without (purple). Without the additional constraints the feet can cross the ground plane and leg crossings occur as a consequence of the wrong configuration. The constraint clearly avoids such problems.
392
B. Rosenhahn et al.
Fig. 4. Sample poses of our tracking system. Top: Pose results without the proposed soft constraints. Bottom: Pose results with the additional constraints.
The HumanEVA-II benchmark allows for a quantitative comparison. Figure 6 shows the deviations of our tracking system to the tracked markers. It depicts the unconstrained results in red and the constrained results in black. Table 1 compares the errors (automatically evaluated [12]). Overall the tracking has been improved remarkably using the additional ground plane constraint. In [2] a similar walking pattern on the HumanEVA-I database is analyzed. Since HumanEva-I does not provide a surface mesh, it was not possible for us to compare our outcome directly with the results reported in [2]. Therefore, the comparison in Table 1 should be interpreted deliberately. However, an overall improvement is clearly visible.
Staying Well Grounded in Markerless Motion Capture
393
Fig. 5. Pose results in a simulation environment. Green: with constraints, purple: without.
160
Error (mm)
140 120 100 80 60 40 20 0
50
100
150
200
250
300
Frame
Fig. 6. Deviations of our tracking results with a marker based outcome. Red: without constraints, blue: with constraints.
Table 1. Comparison of the HumanEVA-II gait sequence with and without imposed ground plane constraints and the RoAM model from [2] (on a similar pattern in the HumanEVA-I dataset)
Without constraints With constraints RoAM model [2]
Average Error (mm). Std. Deviation min error (mm) max error (mm) 50.1 31.7 22.1 152.4 33.8 5.9 20.2 50.4 >60mm -
5 Summary In this work we have presented an approach in which geometric prior information about the floor the person is standing and moving on is integrated into a markerless pose
394
B. Rosenhahn et al.
tracking process. Therefore, we penalize movements in which extremities intersect the ground plane using soft constraints in the pose estimation framework. Experiments with rigid objects and the HumanEVA-II benchmark show that tracking is stabilized significantly and the additional soft constraints can be decisive for a successful tracking. For future experiments we are interested in tracking people in more complex environments, i.e., using a reconstructed background model. This would allow to integrate knowledge about persons and their interaction with the environment in the tracking process.
Acknowledgments The research was funded by the German Research Foundation (DFG), the Max Planck Center VCC and the Cluster of Excellence on Multimodal Computing and Interaction, M 2CI.
References 1. Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(1), 44–58 (2006) 2. Balan, A., Black, M.: An adaptive appearance model approach for model-based articulated object tracking. In: Conference of Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society Press, Los Alamitos (2006) 3. Balan, A., Sigal, L., Black, M., Haussecker, H.: Shining a light on human pose: On shadows, shading and the estimation of pose and shape. In: Proc. International Conference on Computer Vision. IEEE Computer Society Press, Los Alamitos (2007) 4. Blaschke, W.: Kinematik und Quaternionen, Mathematische Monographien, vol. 4. Deutscher Verlag der Wissenschaften (1960) 5. Brox, T., Rosenhahn, B., Kersting, U., Cremers, D.: Nonparametric density estimation for human pose tracking. In: Franke, K., M¨uller, K.-R., Nickolay, B., Sch¨afer, R. (eds.) DAGM 2006. LNCS, vol. 4174, pp. 546–555. Springer, Berlin (2006) 6. Brubaker, M., Fleet, D.J., Hertzmann, A.: Physics-based person tracking using simplified lower-body dynamics. In: Conference of Computer Vision and Pattern Recognition (CVPR), Minnesota. IEEE Computer Society Press, Los Alamitos (2007) 7. Herda, L., Urtasun, R., Fua, P.: Implicit surface joint limits to constrain video-based motion capture. In: Pajdla, T., Matas, J(G.) (eds.) ECCV 2004. LNCS, vol. 3022, pp. 405–418. Springer, Heidelberg (2004) 8. Moeslund, T.B., Hilton, A., Kr¨uger, V.: A survey of advances in vision-based human motion capture and analysis. Computer Vision and Image Understanding 104(2), 90–126 (2006) 9. Murray, R.M., Li, Z., Sastry, S.S.: Mathematical Introduction to Robotic Manipulation. CRC Press, Baton Rouge (1994) 10. Rosenhahn, B., Brox, T., Weickert, J.: Three-dimensional shape knowledge for joint image segmentation and pose tracking. International Journal of Computer Vision 73(3), 243–262 (2007) 11. Schmaltz, C., Rosenhahn, B., Brox, T., Cremers, D., Weickert, J., Wietzke, L., Sommer, G.: Region-based pose tracking. In: Mart´ı, J., Bened´ı, J.M., Mendonc¸a, A.M., Serrat, J. (eds.) IbPRIA 2007. LNCS, vol. 4478, pp. 56–63. Springer, Heidelberg (2007)
Staying Well Grounded in Markerless Motion Capture
395
12. Sigal, L., Black, M.: Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Technical Report CS-06-08, Brown University, USA (2006), http://vision.cs.brown.edu/humaneva/ 13. Sommer, G.: Geometric Computing with Clifford Algebras. Springer, Heidelberg (2001) 14. Urtasun, R., Fleet, D.J., Fua, P.: 3D people tracking with Gaussian process dynamical models. In: Proc. International Conference on Computer Vision and Pattern Recognition, pp. 238– 245. IEEE Computer Society Press, Los Alamitos (2006)
An Unbiased Second-Order Prior for High-Accuracy Motion Estimation Werner Trobin1 , Thomas Pock1,2 , Daniel Cremers2 , and Horst Bischof1 1
Institute for Computer Graphics and Vision, Graz University of Technology 2 Department of Computer Science, University of Bonn
Abstract. Virtually all variational methods for motion estimation regularize the gradient of the flow field, which introduces a bias towards piecewise constant motions in weakly textured areas. We propose a novel regularization approach, based on decorrelated second-order derivatives, that does not suffer from this shortcoming. We then derive an efficient numerical scheme to solve the new model using projected gradient descent. A comparison to a TV regularized model shows that the proposed second-order prior exhibits superior performance, in particular in lowtextured areas (where the prior becomes important). Finally, we show that the proposed model yields state-of-the-art results on the Middlebury optical flow database.
1
Introduction
Estimating the optical flow between two consecutive images of a scene requires finding corresponding points in the images. Since optical flow is a highly ill-posed, inverse problem, solely relying on the optical flow constraint, i.e. to assume that the intensities remained constant, does not provide enough information to infer meaningful flow fields. In particular, optical flow computation suffers from two problems: first, no information is available in untextured regions. Second, one can only compute the normal flow, i.e. the motion in direction of the image gradient; a restriction known as the aperture problem. Optical flow algorithms therefore have to implement an implicit or explicit prior model, to infer the flow in these problematic areas based on the flow field in a local neighborhood – this is typically referred to as smoothing or regularization. Horn and Schunck were the first to formulate the optical flow problem in the framework of the calculus of variations [1]: |∇u1 |2 + |∇u2 |2 dΩ + λ (I1 (x + u(x)) − I0 (x))2 dΩ , (1) min u
Ω
Ω
where I0 and I1 are the images, u(x) = (u1 (x), u2 (x))T is the displacement field, and the parameter λ balances the influence of the regularization and the optical
This work was supported by the Austrian Science Fund under grant P18110-B15 and the Austrian Research Promotion Agency within the VM-GPU project (no. 813396).
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 396–405, 2008. c Springer-Verlag Berlin Heidelberg 2008
An Unbiased Second-Order Prior for High-Accuracy Motion Estimation
397
flow constraint. The choice of a quadratic error function in both the smoothness and the data terms leads to a well-understood system of linear partial differential equations. Yet, quadratic penalizers are sensitive to discontinuities in the flow field and to outliers in the data term. Numerous improvements to the Horn and Schunck model have been proposed to overcome its inherent limitations. Black and Anandan [2] introduced methods from robust statistics to avoid the quadratic penalization and allow for flow discontinuities and outliers in the data term. Motivated by the observation that the flow discontinuities at the boundaries of moving objects often coincide with large image gradients, Nagel et al. [3], Weickert et al. [4], and other authors proposed to replace the homogeneous regularization in the Horn-Schunck model with anisotropic diffusion approaches. For a more systematic survey on variational formulations of the optical flow problem, we refer to Aubert et al. [5]. State-of-the-art variational approaches allow robust calculation of highly accurate flow fields [6], some of them even in real-time [7,8]. Virtually all of them use robust penalty functions, like total variation (TV) or the Lorentzian, to regularize the gradient of the flow field. Using the flow gradient, however, causes problems in untextured areas, where such a regularization tends to yield piecewise constant solutions. For stereo computation, a strongly related problem, Ishikawa recently showed that an affinity towards piecewise constant regions in a disparity map, i.e. fronto-parallel surfaces, is not consistent with human perception [9]. He proposed a novel prior model based on the total absolute Gaussian curvature, which unfortunately is hard to optimize. However, his results indicate that it is beneficial to penalize second-order derivatives. The main contribution of this paper is that we propose a novel flow regularization approach based on second-order derivatives, thereby inhibiting the implicit preference of piecewise constant flow fields. In contrast to Yuan et al. [10], who used higherorder derivatives for simultaneous flow estimation and decomposition, we do not aim to decompose the flow into physically meaningful parts, which significantly simplifies the problem. Several authors proposed discrete methods to estimate the optical flow by fitting affine flow patches [11,12,13]. They typically split the domain into disjoint patches and estimate the parameters of the prevalent affine motion per patch. This leads to rather coarse estimates, especially near flow discontinuities. In the next section, we will briefly review the recent variational approach of Nir et al. [14], who proposed an over-parameterized, affine extension of the Horn and Schunck model. Motivated by the shortcomings of this approach, we will propose our second-order flow regularization approach in Sec. 3.
2
Affine Flow Via Over-Parameterization
Nir et al. [14] proposed to model the flow field using piecewise affine regions and penalize only the deviations from these models. To simplify the discussion, we will consider only a single horizontal line in a rectified stereo image pair, i.e. a one-dimensional flow problem. Even in this case, we can show that the approach
398
W. Trobin et al.
2
15
Es(x0)
u(x)
1.5
20
Initial flow u Extension of u => u*
1
10 5
0.5 00
Es(x0) for u TV(u) Es(x0) for u* TV(u*)
1
2
(a)
x
3
4
5
00
1
2
x0
3
4
5
(b)
Fig. 1. The given 1D flow field u and the extension yielding u∗ is shown in (a). For both flow fields, (b) shows the smoothing energy ES for the over-parameterized model as a function of x0 . The horizontal lines are TV(u) and TV(u∗ ), respectively.
of Nir et al. may not lead to the desired solution. In the 1D setting, the proposed ˆ = ρ(x − x0 )/x0 and x0 is fixed affine model reduces to u(x) = A1 + A2 xˆ, where x at the center of the domain. The weighting coefficients Ai (x) are free to vary across the domain and the parameter ρ allows balancing their relative weight. Without loss of generality, we set ρ = 1 for the remainder of this section. This over-parameterized representation allows multiple descriptions for any reasonable flow field, so instead of regularizing the flow field u(x), Nir et al. suggest to directly regularize the coefficients Ai , using a convex approximation of the L1 norm. For affine regions, the benefit of such a model seems to be clear: instead of directly representing the flow u, incurring costs for its affine changes, the model allows representing affine motion fields with constant parameters, i.e. with zero cost. However, this benefit of constant parameters within affine regions is often outweighed by the costs of large discontinuities at the region borders. In order to illustrate that, we will try to find exact representations for the piecewise affine flow field u(x) given in Fig. 1(a), by following two alternative strategies. The first strategy is to set A2 (x) = 0 and A1 (x) = u(x), i.e. to ignore the affine extension and fall back to a plain, TV-regularized flow model. The second strategy is to make use of this extension and select a constant set of coefficients for every affine region of u. For a fixed x0 , such a strategy yields one unique set {Ai1 , Ai2 } per region i. For the basic TV-regularized model, the smoothing cost is just TV(u), as depicted in Fig. 1(b). When using the affine extension, we get the energy depicted by the thick, solid line in Fig. 1(b). The two key observations are: first, the global minimum for the affine model is lower than TV(u) and second, for the ad-hoc choice of x0 = 2 (in the center of the domain, as suggested by Nir et al.) it is actually higher than TV(u). Let us extend u with another linear region (the dashed line in Fig. 1(a)), yielding u∗ . This innocuous looking extension leads to a significant advantage for the plain TV model (cf. the dashed lines in Fig. 1(b)). These surprisingly high smoothing costs for the affine model are caused by the step edges in the (by design) piecewise constant coefficient space; an effect amplified with increasing distance to the “origin” of the representation (x0 ). Hence,
An Unbiased Second-Order Prior for High-Accuracy Motion Estimation
399
contrary to the expectations, the coefficients of the inferred solutions rarely are piecewise constant, even for perfectly affine motions. This limitation can only be resolved by allowing multiple “origins” x0 , which requires a simultaneous segmentation and estimation of the flow field, a problem that is inherently nonconvex and therefore hard to optimize.
3
An Unbiased Second-Order Prior
Instead of modeling affine functions based on an explicit parameterization, our intention is to design an appropriate differential operator, having the intrinsic property to penalize only deviations from affine functions. First of all, we notice that second-order derivatives of affine functions are zero, i.e. a regularization based on second-order derivatives causes zero cost for affine functions. Unfortunately, second-order derivative operators are not orthogonal and the local information of orientation and shape are entangled. Therefore, penalizing each of them, e.g. using the Euclidean vector norm, would induce a bias in the sense that certain affine fields would be energetically favored over others. In [15], Danielsson et al. defined a new operator set to resolve these problems. They used circular harmonic functions to map the the second-order derivative operators into an orthogonal space. In two spatial dimensions, the new operator is given by 2 √ ∂2 T ∂ 1 ∂2 ∂2 √ ∂2 + , 2 − 8 . (2) , 3= 3 ∂x2 ∂y2 ∂x2 ∂y2 ∂x ∂y More interestingly, the magnitude can be directly calculated as the Euclidean vector norm: 2 2 2 2 2 ∂ u ∂2u ∂ u ∂2u ∂2u 1 3u = + + 2 − + 8 . (3) 2 2 2 2 3 ∂x ∂y ∂x ∂y ∂x ∂y As intended, this norm exactly measures the local deviation of a function u from being affine. Hence, it is quite natural to use this norm as a new prior model for optical flow. Based on [8], we propose the following variational model to compute the optical flow u = (u1 , u2 )T of two images I0 and I1 : 2
3ud dΩ + λ |ρ(u)| dΩ . (4) min u
d=1
Ω
Ω
The first term utilizes the new prior model to regularize the flow field in each direction ud . Note that this term is non-smooth and therefore allows for discontinuities in the second-order derivatives. The second term penalizes violations of the optical flow constraint using the robust L1 norm, where ρ(u) denotes the locally linearized image residual given by ρ(u) = I1 (x + u0 ) + u − u0 , ∇I1 − I0
400
W. Trobin et al.
and u0 is a given disparity map [8]. Due to the linearization procedure, (4) becomes convex, but it has to be embedded into an iterative warping approach to compensate for image non-linearities [6]. Let us now describe an algorithm to compute the minimizer of (4). Since (4) is convex but non-strictly convex, we introduce an auxiliary variable v = (v1 , v2 )T and propose to minimize the following convex approximation of (4): 2 2
1
2 3ud dΩ + (ud − vd ) dΩ + λ |ρ(v)| dΩ . (5) min u,v 2θ Ω Ω Ω d=1
d=1
Unlike (4), (5) is a minimization problem in two variables u and v. We therefore have to perform an alternating minimization procedure: 1. For v fixed, solve for every ud : 1 2 3ud dΩ + (ud − vd ) dΩ . min ud 2θ Ω Ω 2. For u fixed, solve for v: 2 1
2 (ud − vd ) dΩ + λ |ρ(v)| dΩ . min v 2θ Ω Ω
(6)
(7)
d=1
The solutions of both steps are summarized in the following propositions. Proposition 1. The solution of (6) is given by ud = vd − θ3 · pd .
(8)
The dual variable pd is obtained as the steady state of τ 3 vd − θ3 · pkd , θ p˜ k+1 d , = max 1, |p˜d k+1 |
p˜d k+1 = pk + pk+1 d
(9)
where k is the iteration number, pd0 = 0, and τ ≤ 3/112. Proof: Note that for convenience, we have dropped the subscript d out of (6). To overcome the non-differentiability of (3), we employ its dual formulation 3u ≡ max {3u · p} , p≤1
(10)
where p = (p1 , p2 , p3 )T is the dual variable. Next, we follow the approach of [16] to derive the dual formulation of (6): θ 2 3v · p + (3 · p) dΩ . (11) min − p≤1 2 Ω
An Unbiased Second-Order Prior for High-Accuracy Motion Estimation
401
Hence, solving (11) is equivalent to solving the primal problem (6), which can in turn be recovered via u = v − θ3 · p . (12) We proceed by describing a procedure to solve the dual problem (11). First we write down the Euler Lagrange equation of (11): −3 · (v − θ3 · p) = 0
s.t.
p ≤ 1 .
(13)
Being convex in p, minimizing (11) is achieved by gradient descend of (13) and subsequent reprojection of p to ensure p ≤ 1 [17]. The gradient descend step reads as ˜ k+1 = pk + Δt [3 (v − θ3 · p)] , p (14) and the subsequent reprojection is given by: pk+1 =
˜ k+1 p . max 1, ˜ pk+1
(15)
Finally, we have to determine the maximum value of Δt such that (14) remains stable. We employ the following standard finite differences approximation of the 3 operator: ⎞ ⎛ 1 (u + u + u + u − 4u ) i,j+1 i−1,j i+1,j i,j ⎟ ⎜ 3 i,j−1 ⎟ ⎜ 2 (16) (3u)i,j = ⎜ ⎟ , (u + u − u − u ) i−1,j i+1,j i,j−1 i,j+1 3 ⎠ ⎝ 8 3 (ui,j + ui+1,j+1 − ui,j+1 − ui+1,j ) and
1/3 p1i,j−1 + p1i,j+1 + p1i−1,j + p1i+1,j − 4p1i,j
+ 2/3 p2i−1,j + p2i+1,j − p2i,j−1 − p2i,j+1
+ 8/3 p3i,j + p3i−1,j−1 − p3i,j−1 − p3i−1,j ,
(3 · p)i,j =
(17)
where (i, j) denote the indices of the discrete image domain. Since (14) is linear in pi,j , stability of (14) is guaranteed as long as Δtθρ(32 p)i,j ≤ 1 ,
(18)
where ρ(32 p)i,j denotes the spectral radius and 32 = 3 · 3. It therefore remains to give an upper bound of ρ(32 p)i,j . In fact, ρ(32 p)i,j ≤ 32 pi,j . Hence, we end up with an upper bound of Δt ≤ τθ , where we explicitely computed τ = 3/112. 2 Proposition 2. The solution of the minimization task in (7) is given by the following thresholding step: ⎧ if ρ(u) < −λ θ |∇I1 |2 ⎨ λ θ ∇I1 if ρ(u) > λ θ |∇I1 |2 (19) v = u + −λ θ ∇I1 ⎩ 2 −ρ(u) ∇I1 /|∇I1 | if |ρ(u)| ≤ λ θ |∇I1 |2 .
402
W. Trobin et al.
Proof: This proof is a more technical variant of the proof presented in [8]. Let us first write down the Euler Lagrange equation of (7): v − u + λθ
ρ(v) ∇I1 = 0 . |ρ(v)|
(20)
We proceed by analyzing the three possible cases ρ(v) > 0, ρ(v) < 0 and ρ(v) = 0. The first two cases are straightforward. The third case requires some additional investigation. In fact, for ρ(v) = 0 we have I1 (x + u0 ) − I0 (x) + ∇I1 , v − u0 = 0 .
(21)
It is easy to see that this system is undetermined, since one has only one equation for the unknown two-dimensional vector v. However, from the first term of (7), we observe that an optimal v ∗ should also minimize the squared Euclidean distance to u. Let us therefore consider the following Lagrangian optimization problem: 2
2 2 L = min (ud − vd ) dΩ + α ρ(v) dΩ , (22) v
d=1
Ω
Ω
where α are Lagrange multipliers. This optimization problem aims for finding an optimal v ∗ , which minimizes the squared Euclidean distance to u with the constraint that ρ(v ∗ ) = 0. The Lagrange multipliers α are used to enforce this constraint. The Euler Lagrange equations with respect to v and α are given by ∂L = v − u + α ρ(v) ∇I1 = 0 ∂v
and
∂L = ρ(v)2 = 0 . ∂α
(23)
It is easy to see that this system of equations is well posed. We can therefore solve this system with respect to v, resulting in the third case of the thresholding scheme presented in (19). 2
4
Results
We evaluated our C++/CUDA 1.0 implementation of the proposed model on the Middlebury optical flow database [18], a collection of challenging synthetic and real world image sequences. The test setup was a workstation with an Intel Core 2 CPU at 2.66 GHz (the software is single-threaded, so only one core was used) with an NVidia GeForce 8800 GTX graphics card, running a 32 bit Linux operating system and recent NVidia display drivers. First, we demonstrate the effects of the new second-order prior by comparing the proposed model to an implementation of the TV-regularized model [8] on the Middlebury “Venus” sequence. This sequence lends itself to such a comparison for four reasons: 1. the ground truth flow is available, 2. large areas of the flow field are perfectly affine, 3. there are several weakly textured areas (emphasizing the prior’s influence), and 4. the vertical flow component u2 is zero (because it is a rectified stereo pair), which simplifies the comparison. Figure 2(a) shows the
An Unbiased Second-Order Prior for High-Accuracy Motion Estimation
(a)
(b)
-3
-3
-4
(c)
ground truth 1st order prior
-2
u1(x)
-2
u1(x)
u1(x)
ground truth flow -2
-4
-5
x
100
150
(g)
-3
-5 50
(d)
ground truth 2nd order prior
-4
-5 50
403
x
100
150
(e)
(h)
50
x
100
150
(f)
(i)
(j)
(k) Fig. 2. First row: color-coded flow fields for the Middlebury “Venus” sequence; (a) ground truth, (b) calculated with TV-L1 method, and (c) with the proposed method. Second row: (d–f) shows the u1 component of the flow at the start of line 270 of (a–c), respectively. The corresponding interval in line 270 is marked black in (a–c). Third row: the ground truth flows (g), (i) and the color-coded flows (h), (j) for the Middlebury “Army” (AAE = 3.84◦ ) and “Mequon” (AAE = 3.14◦ ) sequences, respectively. Last row: Average angular error results for the Middlebury benchmark dataset; the proposed method, labeled “Second-order Prior”, was ranked 2nd at the time of submission.
404
W. Trobin et al.
color-coded ground truth flow, where an interval at the start of line 270 has been marked black. The u1 component of the flow in this interval can be seen as a profile plot in Fig. 2(d). Note that the staircasing effect visible in this profile seems to be caused by a coarse quantization – based on the scene, we would expect a straight line here. Figures 2(b) and 2(c) show the color-coded flows for the TV-regularized (λ = 76.5, θ = 0.25) and the proposed model (λ = 45, θ = 0.25), respectively. Directly below each solution, Figs. 2(e) and 2(f) allow a close comparison to the ground truth in the selected interval. As intended, we got rid of the bias towards piecewise-constant motion by using the second-order prior. We also evaluated the proposed model on the Middlebury optical flow benchmark, using two grayscale frames per sequence. To reduce the influence of illumination changes and shadows, we performed a structure-texture decomposition of the images using the well known ROF model [19] (λROF = 10) and used only the textural part for the motion estimation. The model parameters λ = 55 and θ = 0.1 were held constant across the entire dataset. Estimating the flow fields for all eight sequences, including preprocessing, took 44 s (5.6 s on a single 584 × 388 image pair). Figures 2(h) and 2(j) show the resulting flow fields for the “Army” and the “Mequon” sequences, while Fig. 2(g) and 2(i) depict the corresponding ground truth flows. At the time of submission, the method was ranked second out of ten submissions at the Middlebury test site for both the average angular error and the average end-point error; see Fig. 2(k) for the angular error table of the six leading entries. For further images and a comparison on other statistical measures, we encourage the reader to visit http://vision.middlebury.edu/flow/eval/. A closer inspection of the table in Fig. 2(k) leads to three observations: first, the proposed method is the top performer in untextured areas for six out of eight sequences, indicating an advantage of the new prior. Second, the subpar performance on the “Grove” and the “Urban” sequences, which both contain large flow magnitudes, can be explained by deficiencies of the optimization approach (multiple scales, repeated warping, and locally linear approximation) and are not an inherent limitation of our new model. Finally, it is clearly apparent that the parameters for the baseline results provided by [18] have been optimized for the synthetic “Yosemite” sequence. This causes overly smooth results on real-world sequences, which considerably skews the comparison. On the other hand, this also explains the bad “Yosemite” results of our implementation.
5
Conclusion
We proposed a novel regularization approach, based on decorrelated second-order derivatives, which does not penalize locally affine motions and therefore avoids a bias towards fronto-parallel flow fields. Subsequently, we derived an efficient numerical scheme for solving this model. By achieving state-of-the-art results on the challenging Middlebury optical flow database, especially in untextured areas, we have clearly shown the applicability of the presented method. Future work will mainly focus on finding better optimization approaches to improve the performance on image pairs with large flow magnitudes (e.g. the
An Unbiased Second-Order Prior for High-Accuracy Motion Estimation
405
Middlebury “Urban” sequence). We also plan to investigate more complex data terms and an extension to color images.
References 1. Horn, B.K.P., Schunck, B.G.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981) 2. Black, M.J., Anandan, P.: A framework for the robust estimation of optical flow. In: Proc. of the ICCV, May 1993, pp. 231–236 (1993) 3. Nagel, H.H., Enkelmann, W.: An investigation of smoothness constraints for the estimation of displacement vector fields from image sequences. IEEE Trans. Pattern Anal. Mach. Intell. 8(5), 565–593 (1986) 4. Weickert, J., Brox, T.: Diffusion and regularization of vector- and matrix-valued images. Inverse Problems, Image Analysis, and Medical Imaging. Contemporary Mathematics 313, 251–268 (2002) 5. Aubert, G., Deriche, R., Kornprobst, P.: Computing optical flow via variational techniques. SIAM Journal on Applied Mathematics 60(1), 156–182 (2000) 6. Papenberg, N., Bruhn, A., Brox, T., Didas, S., Weickert, J.: Highly accurate optic flow computation with theoretically justified warping. International Journal of Computer Vision 67(2), 141–158 (2006) 7. Bruhn, A., Weickert, J., Kohlberger, T., Schn¨ orr, C.: A multigrid platform for real-time motion computation with discontinuity-preserving variational methods. International Journal of Computer Vision 70(3), 257–277 (2006) 8. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Hamprecht, F.A., Schn¨ orr, C., J¨ ahne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007) 9. Ishikawa, H.: Total absolute gaussian curvature for stereo prior. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007, Part I. LNCS, vol. 4843, pp. 537–548. Springer, Heidelberg (2007) 10. Yuan, J., Sch¨ orr, C., Steidl, G.: Simultaneous higher-order optical flow estimation and decomposition. SIAM J. on Scientific Computing 29(6), 2283–2304 (2007) 11. Ju, S., Black, M.J., Jepson, A.: Skin and bones: Multi-layer, locally affine, optical flow and regularization with transparency. In: Proc. of the CVPR, pp. 307–314 (1996) 12. Bereziat, D., Herlin, I.: Object based optical flow estimation with an affine prior model. In: Proc. of the CVPR, vol. 3, pp. 1048–1051 (2000) 13. Le, H.V., Seetharaman, G., Zavidovique, B.: A multiscale technique for optical flow computation using piecewise affine approximation. In: Proc. of the Int. Workshop on Computer Architectures for Machine Perception, pp. 221–230 (2003) 14. Nir, T., Bruckstein, A.M., Kimmel, R.: Over-parameterized variational optical flow. International Journal of Computer Vision 76, 205–216 (2008) 15. Danielsson, P.E., Lin, Q.: Efficient detection of second-degree variations in 2D and 3D images. Journal of Visual Comm. and Image Representation 12, 255–305 (2001) 16. Carter, J.: Dual Methods for Total Variation-based Image Restoration. PhD thesis, UCLA, Los Angeles, CA (2001) 17. Chambolle, A.: Total variation minimization and a class of binary MRF models. In: Energy Minimization Methods in Comp. Vision and Pattern Rec., pp. 136–152 (2005) 18. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M., Szeliski, R.: A database and evaluation methodology for optical flow. In: Proc. of the ICCV (2007) 19. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60, 259–268 (1992)
Physically Consistent Variational Denoising of Image Fluid Flow Estimates Andrey Vlasenko and Christoph Schn¨ orr Image and Pattern Analysis Group, Heidelberg Collaboratory for Image Proc.(HCI), University of Heidelberg, Germany {vlasenko,schnoerr}@math.uni-heidelberg.de
Abstract. High-speed image measurements of fluid flows define an important field of research in experimental fluid mechanics and the related industry. Numerous competing methods have been developed for both 2D and 3D measurements. Estimates of fluid flow velocity fields are often corrupted, however, due to various deficiencies of the imaging process, making the physical interpretation of the measurements questionable. We present an algorithm that accepts vector field estimates from any method and returns a physically plausible denoised version of it. Our approach enforces the physical structure and does not rely on particular noise models. Accordingly, the algorithm performs well for different types of noise and estimation errors. The computational steps are sufficiently simple to scale up to large 3D problems in the near future.
1
Introduction
Experimental fluid mechanics is a challenging field of research of imaging science with important industrial applications [1]. During the last two decades, the prevailing technique for investigating turbulent flows through imaging has been particle image velocimetry in 2D [2,3], whereas various 3D measurement techniques, while being attractive from the physical viewpoint of applications, have been suffering from various drawbacks including noisy measurements, complexity and costs of the set-up, and limited resolution [4,5,6,7,8,9]. Remarkable progress has been recently achieved through a novel technique, Tomographic Particle Image Velocimetry (TomoPIV) [10] that, in principle, provides 3D estimates with higher spatial resolution. As a result, there are a range of methods for computing vector field estimates of incompressible viscous flows from image data, that exhibit diverse artefacts depending on the particular technique used, and on the particular physical scenario considered. This motivates to investigate a method that denoises a given vector field in a physically plausible way. Rather than to model noise explicitly which is difficult and too specific due to the diversity of estimation errors that can occur, the method should return a vector field that is close to the input data and approximately satisfies the basic physical equations governing the flow. At the same time,
This work has been supported by the German Science Foundation, priority program 1147, grant SCHN 457/6-3.
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 406–415, 2008. c Springer-Verlag Berlin Heidelberg 2008
Physically Consistent Variational Denoising of Image Fluid Flow Estimates
407
Fig. 1. Top: Two ground truth flows, a Kelvin-Helmholtz flow (left) and a turbulent flow (top center). Bottom left, center: The Kelvin-Helmholtz and turbulent flows spoiled with noise. Right column: The turbulent flow corrupted by missing data and outliers within rectangular regions. The location of these regions is assumed to be unknown. The problem addressed in this paper is to restore the flows in a way that takes into account the physical origin of the flows but does not depend on a particular noise model. In the case of missing data (lower right panel), we assume these regions to be unknown.
the method should be robust to various types of estimation errors and computationally simple, so as to be applicable to large-scale 3D problems that the next generation of 3D measurement techniques will raise in the near future. Our approach presented below is motivated by recent work on variational PIV methods. Ruhnau and Schn¨ orr [11] showed how to estimate physically consistent flow from PIV image sequences utilizing a distributed-parameter control approach. This has been extended in [12,13] to a dynamic setting based on the vorticity transport equation formulation of the Navier-Stokes equation. The task studied in this paper is more involved, however, because we wish to process corrupted vector fields as input data, and therefore cannot resort to image data in order to determine additional control variables. Rather, we wish to devise a method that accepts vector field estimates produced by any of the algorithms mentioned above, and returns a denoised version just by preserving and enforcing physically consistent flow structure. Our paper is organized as follows. We introduces the basic notation and the overall variational approach in section 2. Additional details for each computational stage involved are provided in section 3. Experimental results for various error types and performance measures are discussed on section 4. We conclude and point out further work in section 5.
408
2 2.1
A. Vlasenko and C. Schn¨ orr
Variational Approach Notation, Preliminaries
We formulate our approach for the 3D case. Vectors are indicated with bold font. v denotes a velocity field and ω = ∇ × v its vorticity. The 2D case is automatically obtained by setting v = (v1 , v2 , 0) . For example, ∇ × v then yields a vorticity field ω = (0, 0, ω3 ) , with scalar vorticity ω3 = ∂x1 v2 − ∂x2 v1 . We denote by u, v, v the euclidean inner product and norm, and by u, vΩ , vΩ the inner product and norm of L2 (Ω)]d , d = 2, 3, respectively, where Ω ⊂ Rd is the given image section. We list few relations used in the remainder of this paper. The vorticitytransport equation reads ∂ ω + (v · ∇)ω + (ω · ∇)v = νΔω . ∂t
(1)
We consider a flow on a such short time interval, that the scale of temporal vorticity changes becomes several orderer smaller, than the scale of spatial vorticity changes. The term ∂t ω in (1) has a negligible impact in comparison with impacts of other terms, and it can be omitted from the equation. In other words, the the chosen time time interval is short enough, that one can reduce the flow motion to a quasi stationary case. Once the time derivative was excluded, we can focus on a fixed point of time. For the remaining left-hand side, we introduce the short-hand e(v) = νΔω ,
e(v) = (v · ∇)ω + (ω · ∇)v .
The integral identity (∇ × v) · φ dx − v · (∇ × φ) dx = 0 , Ω
Ω
d ∀φ ∈ C0∞ (Ω) ,
(2)
(3)
is used to derive Euler-Lagrange equations below, and for incompressible flow we have ∇ × ∇ × ω = −Δω . (4) 2.2
Approach
Suppose we have given a corrupted vector field d. The true underlying viscous flow is incompressible and is assumed to satisfy (2). Regarding the design of a denoising algorithm, we have to keep in mind computational simplicity in order to be able to solve sequences of large-scale 3D problems that sensors will produce in the near future. Iterative Algorithm. In view of these requirements, we study in this paper a denoising approach comprising the following steps: 1. Project the data d to the subspace of divergence-free flows: d → v.
Physically Consistent Variational Denoising of Image Fluid Flow Estimates
409
2. Compute the vorticity ωv = ∇×v. Preserve and enforce physically plausible flow structure in terms of equation (2): ωv → ω. 3. Recover velocity from vorticity: ω → u. If the termination criterion is satisfied, stop. Otherwise, continue with step 1. The steps 1.-3. will be detailed in section 3. Termination criterion. We compute the energy spectrum1 in term of the 2 (k) , k ∈ [π, π]d , and define the magnitude of a cutoff Fourier transform v frequency kc depending on the flow under investigation, e.g. in terms of the smallest scale of vorticities to be resolved. We stop the iteration once the upper energy band has been sufficiently damped relative to the lower band 2 (k)dk ≤ ck 2 (k)dk , (5) v v k∞ ≥kc
k∞ ≤kc
where a typical value is ck = 100. See Fig. 2 below for an illustration.
3
Computational Steps: Details
3.1
Subspace Projection
3 Consider the orthogonal decomposition of the space V = L2 (Ω) = Vg ⊕ Vsol into gradients and solenoidal (divergence-free) vector fields [14]: d = ∇φ + ∇ × ψ.The orthogonal projection P : V → Vsol onto the space Vsol = v ∈ V ∇ · v = 0 , v · n = 0 on ∂Ω is accomplished by solving Δφ = ∇ · d , φ = 0 on ∂Ω and removing the divergence v = d − ∇φ ∈ Vsol . (6) 3.2
Vorticity Rectification
Set ω v = ∇ × v. In order to take into account Eqn. (2), we minimize
min ω − ω v 2Ω + α ν∇ × ω2Ω + 2 e(v), ω Ω ω
(7)
The first term ensures that the minimizer stays close to the vorticity computed in the previous step, and the second term enforces (2). Computing the first variation of the functional (7) and applying (3) and (4), we obtain the EulerLagrange equation ω − ανΔω = ωv − αe(v) . 1
Note that the usual definition is based on time averages. We focus in this paper on a single time point, however.
410
A. Vlasenko and C. Schn¨ orr
This is a system of linear diffusion equations (a single equation in 2D). Rewriting this equation, ω = ω v − α e(v) − νΔω , we see that the vorticity ω is corrected by the residual of Eqn. (2) where the nonlinearity e(v) is evaluated at v computed in the preceding step. Because correct boundary conditions are not known, we reduce the Laplacian at ∂Ω to linear diffusion along the boundary. 3.3
Velocity Restoration
The minimizer ω of the previous step may not correspond to a solenoidal vector field. Hence, as a third computational step, we propose to minimize
min u − v2Ω + β∇ × u − ω2Ω , (8) u
with v from (6) and ω from (7). Computing the first variation of the functional (8) and applying (4), we obtain again a linear diffusion system u − βΔu = v + β∇ × ω ,
∂u =0. ∂n
Here, it is plausible to impose natural boundary conditions on u.
4
Experiments and Discussion
We tested the denoising approach for 2D flows, so far. The 3D case is subject of future work and will be reported elsewhere. The test was carried out for two different flows: 1 Groundtrout Noisy
|| V||2 ( k)
Denoised
0.5
0
pi/2 k
pi
Fig. 2. Illustration of the termination criterion (5) for the noisy turbulent flow experiment shown in Fig. 4. Restoration effectively removes noise and preserves physically significant large-scale structures of vorticity.
Physically Consistent Variational Denoising of Image Fluid Flow Estimates
411
Fig. 3. Top: Ground truth flow g.Center: Noisy data d. Noise has been cut out within a rectangular region to illustrate the signal-to-noise ratio.Bottom: Denoised flow u. Performance measure: SDR=4.43, ADR=6032, NDR=1.12, DDR=0.999. Number of iterations: 3.
412
A. Vlasenko and C. Schn¨ orr
Fig. 4. Top: Ground truth flow g. Center: Noisy data. Noise has been cut out within a rectangular region to illustrate the signal-to-noise ratio. Bottom: d. Denoised flow u. Performance measures: SDR=15.12, ADR=2.45, NDR=9.51, DDR=1.004. Number of iterations: 16. This example shows that the approach gracefully degrades with increasing noise levels and its limitation: In the extreme situation above, not all small-scale structures can be restored.
Physically Consistent Variational Denoising of Image Fluid Flow Estimates
413
Fig. 5. Top: Ground truth flow g. Center: Corrupted data d by missing values (left) or outliers within rectangular regions (right). The location of these regions are assumed to be unknown. Bottom: Denoised flows u. Performance measures, left experiment: SDR=1.1131, DDR=1.128. Number of iterations: 4. Performance measures, right experiment: SDR=1.645, ADR=0.532, NDR=1.0087, DDR=0.924. Number of iterations k=5.
414
A. Vlasenko and C. Schn¨ orr
– A Kelvin-Helmholtz flow [15], [16], see Fig. 1 top-left and Fig. 3 top, that occurs when velocity shear is present withing a continuous fluid, or when there is a sufficiently large velocity difference across the interface between two fluids. – A turbulent flow from Navier-Stokes simulations, see Fig. 1 top-right and Fig. 4 top. Various types of data corruptions were investigated: White Gaussian noise with both high and low signal-to-noise ratio as well as square regions with missing and corrupted data. Figure 1 displays these vector fields represented by d in the first step (6) of our iterative restoration approach. Denote g:ground truth flow, d: corrupted input data, u: denoised vector field. In order to evaluate the denoising performance of our approach and its robustness we used the following quantitative performance measures: SDR =
d−gΩ u−gΩ
, ADR =
arccos(dg)Ω arccos(ug)Ω
, N DR =
dΩ gΩ
, DDR =
uΩ gΩ
.
The meanings of these measures are as follows: SDR gives the rate between standard deviations of noisy and denoised vector fields from groundtrouth vector field, FDR gives the rate between average angle deviations of noisy and denoised data from groundrouth, ADR gives the rate of average angle deviation of noisy and denoised vector fields from groundtrouth, NDR and DDR give the average vector length of noisy and denoised vector fields in comparison with groundtrouth average vector length. Figures 3-5 confirm and illustrate that the algorithm does not depend on a particular model of noise and errors. Rather, the physical structures (vorticity) are restored fairly well. Quantitative results are given in the figure captions. Figure 4 shows an example of very low signal-to-noise-ratio. The still plausible restoration result illustrates that our approach gracefully degrades with increasing noise levels, and its limitation: Small-case structures cannot always be correctly restored in such extreme situations.
5
Conclusion
We presented a black-box variational approach to the restoration of PIV-measurements in experimental fluid mechanics. It restores physically significant flow structures, copes with various types of noise and errors, degrades gracefully with decreasing signal-to-noise ratio and involves computationally simple steps. Our further work will further explore the trade-off between computational simplicity (i.e. efficiency) and tigher couplings of the computational steps through constraints, and on problem representations specific to large-scale 3D applications.
References 1. Tropea, C., Yarin, A.L., Foss, J.F. (eds.): Springer Handbook of Experimental Fluid Mechanics. Springer, Heidelberg (2007) 2. Adrian, R.J.: Twenty years of particle velocimetry. Experiments in Fluids 39(2), 159–169 (2005)
Physically Consistent Variational Denoising of Image Fluid Flow Estimates
415
3. Raffel, M., Willert, C.E., Wereley, S.T., Kompenhans, J.: Particle Image Velocimery – A Practical Guide. Springer, Heidelberg (2007) 4. Br¨ ucker, C.: Digital-particle-image-velocimetry (DPIV) in a scanning light-sheet: 3d starting flow around a short cylinder. Experiments in Fluids 19(4) (1995) 5. Burgmann, S., Br¨ ucker, C., Schr¨ oder, W.: Scanning PIV measurements of a laminar separation bubble. Experiements in Fluids 41(2), 319–326 (2006) 6. Hinsch, K.D.: Holographic particle image velocimetry. Measurement Science and Technology 13, 61–72 (2002) 7. Maas, H.G., Gruen, A., Papantoniou, D.: Particle tracking velocimetry in threedimensional flows. Experiments in Fluids 15(2), 133–146 (1993) 8. Hoyer, K., Holzner, M., L¨ uthi, B., Guala, M., Liberzon, A., Kinzelbach, W.: 3d scanning particle tracking velocimetry. Experiments in Fluids 39(5), 923–934 (2005) 9. Ruhnau, P., G¨ uttler, C., Putze, T., Schn¨ orr, C.: A variational approach for particle tracking velocimetry. Measurement Science and Technology 16, 1449–1458 (2005) 10. Elsinga, G., Scarano, F., Wieneke, B., van Oudheusden, B.: Tomographic particle image velocimetry. Experiments in Fluids 41(6), 933–947 (2006) 11. Ruhnau, P., Schn¨ orr, C.: Optical stokes flow estimation: An imaging-based control approach. Exp. Fluids 42, 61–78 (in press, 2007) 12. Ruhnau, P., Stahl, A., Schn¨ orr, C.: On-line variational estimation of dynamical fluid flows with physics-based spatio-temporal regularization. In: Franke, K., M¨ uller, K.-R., Nickolay, B., Sch¨ afer, R. (eds.) DAGM 2006. LNCS, vol. 4174, pp. 444–454. Springer, Heidelberg (2006) 13. Ruhnau, P., Stahl, A., Schn¨ orr, C.: Variational estimation of experimental fluid flows with physics-based spatio-temporal regularization. Measurement Science and Technology 18, 755–763 (2007) 14. Girault, V., Raviart, P.-A.: Finite Element Methods for Navier-Stokes Equations. Springer, Heidelberg (1986) 15. Stashchuk, N., Vlasenko, V., Hutter, K.: Kelvin-Helmholz instability: Numerical modelling of disintegration of basin-scale internal waves in a tank filled with stratified water. Nonlinear Processes in Geophysics 12, 955–964 (2005) 16. Van Dyke, M.: An Album of Fluid Motion. The Parabolic Press, Stanford (1982)
Convex Hodge Decomposition of Image Flows Jing Yuan1 , Gabriele Steidl2 , and Christoph Schn¨ orr1 1
Image and Pattern Analysis Group, Heidelberg Collaboratory for Image Processing, University of Heidelberg, Germany {yuanjing,schnoerr}@math.uni-heidelberg.de 2 Faculty of Mathematics and Computer Science, University of Mannheim, Germany [email protected]
Abstract. The total variation (TV) measure is a key concept in the field of variational image analysis. Introduced by Rudin, Osher and Fatemi in connection with image denoising, it also provides the basis for convex structure-texture decompositions of image signals, image inpainting, and for globally optimal binary image segmentation by convex functional minimization. Concerning vector-valued image data, the usual definition of the TV measure extends the scalar case in terms of the L1 -norm of the gradients. In this paper, we show for the case of 2D image flows that TV regularization of the basic flow components (divergence, curl) leads to a mathematically more natural extension. This regularization provides a convex decomposition of motion into a richer structure component and texture. The structure component comprises piecewise harmonic fields rather than piecewise constant ones. Numerical examples illustrate this fact. Additionally, for the class of piecewise harmonic flows, our regularizer provides a measure for motion boundaries of image flows, as does the TV-measure for contours of scalar-valued piecewise constant images.
1
Introduction
The total variation (TV) measure has been introduced by Rudin, Osher and Fatemi [1] in connection with image denoising. The precise definition will be given in Section 2. In the following, let Ω ⊂ R2 be a bounded domain. Minimizing the convex functional 1 u − d2Ω + λTV(u) (1) 2 for given image data d leads to an edge-preserving nonlinear smoothing process that effectively removes noise and small-scale spatial patterns from d. Starting with the work of Meyer [2], the more general viewpoint of image decomposition has been adopted – see [3] and references therein. The basic model is again given by (1), leading to a decomposition d=u+v
(2)
This work was supported by the German Science Foundation (DFG), grant Schn 457/9-1.
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 416–425, 2008. c Springer-Verlag Berlin Heidelberg 2008
Convex Hodge Decomposition of Image Flows
417
of given image data d in a piecewise constant image structure u and oscillatory patterns and noise v. Another key property of the TV measure is due to its geometric interpretation via the co-area formula [4]: TV(u) equals the length of level lines of u, summed up over the range of u (contrast). As a consequence, TV(u) measures the length of contours of piecewise constant images, hence implements the regularization term of the Mumford-Shah variational approach to segmentation. This fact has been explored for image inpainting [5] and more recently also for image segmentation, in order to replace non-convex variational approaches [6,7] by convex ones that can be globally optimized [8,9]. A natural issue concerns the application of the TV measure to vector -valued data u = (u1 , u2 ) . Usually definitions are applied that, for sufficiently regular u (cf. next section), take the form [10] |∇u1 |2 + |∇u2 |2 dx . (3) TV(u) = Ω
The Helmholtz decomposition of u into its basic components, divergence and curl, suggest an alternative: (div u)2 + (curl u)2 dx. (4) Ω
This viewpoint has been suggested recently in [11] for decomposing image flows. However, a geometric interpretation and its connection to the definitions (3) and (1) has not been given. It will turn out below, that – (4) is a mathematically more natural definition extending (1), and that – (4) decomposes flows into a richer “structural” component – analogous to u in the scalar case (2) – comprising piecewise harmonic flows rather than piecewise constant flows. The paper is organized as follows. We recall necessary material of TV-based scalar image decomposition in Section 2. Next, we consider the Helmholtz and Hodge decompositions of vector fields into orthogonal subspaces in Section 3. Comparing by analogy the latter decomposition to the scalar-valued case, and then re-considering the convex TV-based decomposition of scalar images in Section 2, we are led in Section 4 to the interpretation of (4) as a generalized convex decomposition of image flows. Section 5 embedds this decomposition into a variational denoising approach for image flows. We illustrate this result and its consequences numerically in Section 6 and conclude in Section 7.
2
TV-Based Convex Image Decomposition
The total variation of a scalar function u is defined as TV(u) =
sup p∞ ≤1
u, div pΩ ,
p|∂Ω = 0 ,
(5)
418
J. Yuan, G. Steidl, and C. Schn¨ orr
where p∞ = maxx∈Ω p21 (x) + p22 (x) and p ∈ C01 (Ω; R2 ). Based on this definition, the convex variational image denoising approach (1) can be transformed into its dual formulation [12] λ div p − d2Ω , d = u + λ div p . (6) min p∞ ≤1
This results in a decomposition of scalar image data d, d = u + λ div p ,
p∞ ≤ 1 ,
p|∂Ω = 0 ,
where u provides the piecewise smooth component, i.e. the large-scale structural parts of the image signal d, and λ div p comprises the corresponding small-scale structure, i.e. texture and noise. To this end, it is also called the structure-texture decomposition. In order to be consistent with the following definitions, we rewrite the TVbased scalar decomposition as d = u + div p ,
p∞ ≤ λ ,
p|∂Ω = 0 .
(7)
The small-scale component div p is the orthogonal projection Π : L2 (Ω) → div Cλ of the data d onto a convex set div Cλ , i.e. onto the image of the set of normconstrained vector fields Cλ = (p1 , p2 ) p∞ ≤ λ (8) under the divergence operator. For d with mean zero, the smallest value of λ such that the solution u of (1) becomes zero equals the G-norm of d (cf. [2]) which measures the size of the small-scale oscillating component (noise, texture) of the image function. Hence, we formally write L2 (Ω) = U ⊕Π div Cλ ,
(9)
where the index of ⊕Π indicates that this is a convex decomposition, rather than an orthogonal decomposition into subspaces.
3
Orthogonal Hodge Decomposition of Image Flows
In this section, we summarize orthogonal decomposition of vector fields, the Helmholtz decomposition and, as an extension thereof, the Hodge decomposition. The latter decomposition together with (9) will provide the basis for our novel decomposition of image flows in the subsequent section.
Convex Hodge Decomposition of Image Flows
419
Theorem 1 (2D Orthogonal Decomposition). Any vector field u ∈ (L2 (Ω))2 can be decomposed into into an irrotational and a solenoidal part u = v + ∇⊥ φ ,
φ∂Ω = 0 ,
(10)
where curl v = 0. This is unique, and the components are or
decomposition thogonal in the sense v, ∇⊥ φ Ω = 0. The representation (10) corresponds to an orthogonal decomposition of the space of all vector fields into subspaces of gradients and curls 2 2 (11) L (Ω) = ∇H(Ω) ⊕ ∇⊥ Ho (Ω) , where H(Ω) and Ho (Ω) are suitably defined Sobolev-type spaces. We refer to [13] for details. The curl-free component can be further decomposed into v = h + ∇ψ , where ψ∂Ω = 0 and
(12)
div v = div ∇ψ .
The component h satisfies div h = 0 ,
curl h = 0 ,
and is called harmonic flow. We summarize these facts: Theorem 2 (2D Hodge Decomposition). Let Ω be a bounded, sufficiently regular 2D domain (Poincar´e lemma [13]). Then any vector field u ∈ (L2 (Ω))2 can be decomposed into u = h + ∇ψ + ∇⊥ φ ,
φ∂Ω = 0 , ψ∂Ω = 0 ,
(13)
where div h = curl h = 0 . This decomposition is unique, and the three parts on the right-hand side are orthogonal to each other. Accordingly, the decomposition (11) can be refined to 2 2 L (Ω) = H ⊕ ∇Ho (Ω) ⊕ ∇⊥ Ho (Ω) ,
(14)
where the additional index of the second component indicates the boundary condition ψ∂Ω = 0, and where H denotes the space of all harmonic flows.
4
Convex Hodge Decomposition of Image Flows
In this section, we first provide an orthogonal decomposition of scalar-valued functions analogously to the orthogonal decomposition of vector fields in the previous section. Next, we generalize this to a novel convex decomposition of image flows, in view of the convex decomposition of scalar functions described in Section 2.
420
4.1
J. Yuan, G. Steidl, and C. Schn¨ orr
Orthogonal Decomposition of Scalar-Valued Functions
We look for an orthogonal decomposition of scalar-valued image functions, analogous to the decomposition (13) for vector fields. Any scalar-valued function d ∈ L2 (Ω) can be decomposed into d = c + div p ,
p|∂Ω = 0 .
(15)
Since Ω div p dx = 0, due to the boundary condition on p, the constant function c is just the mean value Ω d dx/|Ω|, and we have also the orthogonality of the two components c, div pΩ = 0. Referring to (13), it makes sense to regard (15) as Hodge decomposition of scalar fields. Next, let us consider the connection between (15) and the convex decomposition (7). As in (7), d is decomposed into two components, a smooth component and a remaining component of small-scale structure. The difference between (7) and (15) is that the latter projects onto special convex sets: orthogonal subspaces. This leads to a constant “smooth” component c in (15), as opposed to u in (7) that tends to be piecewise constant. Of course, if λ is large enough, then u in (7) also becomes a constant. Likewise, analogous to the convex decomposition (9), we obtain as a result of (15) the orthogonal decomposition L2 (Ω) = Hc ⊕ div P ,
(16)
where Hc is the set of all constant functions on Ω, and P the subspace of all vector fields vanishing on the boundary ∂Ω. 4.2
Convex Decomposition of Vector Fields
Based on the previous discussion, natural questions arise: 1. Because the orthogonal decomposition of scalar fields (15), (16) is clearly inferior to the convex decomposition (7), (9) in that the structural component u models more general functions, what is the convex counterpart of the orthogonal decomposition of vector fields (13), (14)? 2. What is the natural generalization of the variational denoising approach (1) to the case of vector fields? We will address the first question next and the second question in the following section. Definition 1 (Convex Hodge Decomposition). Given a vector field d ∈ 2 2 L (Ω) , we define the convex decomposition d = u + ∇ψ + ∇⊥ φ ,
(17)
Convex Hodge Decomposition of Image Flows
421
where the non-smooth component u is given by the orthogonal projection of d onto the convex set 2 , (18) Sλ = ∇ψ + ∇⊥ φ (ψ, φ) ∈ Cλ ∩ Ho (Ω) with Cλ from (8). Formally, write 2
2 L (Ω) = UH ⊕Π (∇, ∇⊥ ), Cλ ,
(19)
where Π is the orthogonal projector onto Sλ (18). Note the structural similarity between (7), (9) and (17), (19). Likewise, as U in (9) generalizes Hc in (16) from constant functions to piecewise constant functions, so UH in (19) generalizes H in (14) to piecewise harmonic vector fields.
5
Variational Denoising of Image Flows
In this section, we investigate a natural generalization of the variational approach (1) to vector fields. First, we present the objective function and reveal its connection to the convex decomposition (17). Next, we show that for the particular case of piecewise harmonic vector fields, it regularizes motion boundaries in a similar way as the functional (1) does for the contours of piecewise constant and scalar-valued functions. 5.1
Variational Approach
For a given vector field d, we study the convex functional inf
1
u
2
u − d2Ω + λR(u)
(20)
with R(u) = sup s, uΩ ,
(21)
s∈S
where S = S1 from (18). Because R(u) is positively homogeneous, λR(u) equals sups∈Sλ s, uΩ . Hence, (20) reads inf sup
u s∈Sλ
1 2
u − d2Ω + s, uΩ
.
Exchanging inf and sup yields the convex decomposition (17) (cf. (18)) d=u+s, and as the dual convex problem of (1) the projection of the data d onto the convex set Sλ inf s − d2Ω , d = u + ∇ψ + ∇⊥ φ . (22) s∈Sλ
422
J. Yuan, G. Steidl, and C. Schn¨ orr
Let us point out again the similarities of our approach for denoising vector fields with the ROF-model (1) for denoising scalar-valued functions, based on the convex Hodge decomposition introduced in the previous section. The regularizer R(u) in (21) is defined as support function of a convex set, as is the total variation measure TV(u) in (5). Likewise, the dual convex problem (22) characterizes the non-smooth, oscillating component (“motion texture”) of a given vector field d as a projection onto a convex set, as does the ROF-model for scalar-valued functions in (6). 5.2
Motion Segmentation and Piecewise Harmonic Image Flows
We take a closer look to the analogy between the ROF-model and our generalization to vector fields, as discussed in the previous section. Assuming u to be sufficiently smooth, and taking into account the representation (18) and that the boundary values of functions ψ, φ representing an element s ∈ S vanish, we rewrite the regularizing term (21) R(u) = sups, uΩ = s∈S
=
sup (ψ,φ)∞ ≤1
=
sup
∇ψ + ∇⊥ φ, uΩ
(ψ,φ)∞ ≤1
− ψ, div uΩ − φ, curl uΩ
(div u)2 + (curl u)2 dx .
(23)
Ω
This differs from the commonly used term (3). In particular, for a piecewise harmonic flow (i.e. with almost everywhere vanishing divergence and curl), corresponding to the structural part of the decomposition (19), this measure only contributes at motion boundaries. In fact, we can show that R(u) plays a similar role as does the TV-measure at discontinuities of scalar-valued functions [14].
6
Numerical Examples
The main contribution of this paper is the mathematically thorough investigation of non-smooth convex regularization of image flows, based on the decomposition framework presented in Section 4, and leading to the variational approach (20) in Section 5 that naturally extends the ROF-model (1) to the vector-valued case. In this section, we confine ourselves to underline the difference between the novel approach (20) and the commonly used regularization term (3) with few numerical experiments. Figure 1 shows an image flow additively composed of a harmonic flow and a second small-scale motion component. While the usual regularization (3) always returns a constant vector field as structural part, the regularizer (21) is able to separate these two components. This result clearly validates the theory. Figure 2 presents a more involved non-rigid flow composed of several components, a global harmonic flow and four very local constant ones (i.e. harmonic as
Convex Hodge Decomposition of Image Flows
423
Fig. 1. Left two columns: A flow comprising a harmonic component and motion texture as vector plot and orientation plot (top) and its divergence and curl (bottom). Third column: Standard TV-regularization is unable to separate the flows. Right column: Our novel regularizer clearly separates motion texture and nonrigid flows.
Fig. 2. A complex flow – orientation plot (left) and vector plot (right) – comprising a single global harmonic flow, four very local constant flows, and two further local nonharmonic flows. The addition of all components provides the input data for convex variational denoising.
well), and further two local non-harmonic local flows. Figures 3 and 4 show the results for the usual TV method (3) und our approach for two different values of the regularization parameter λ that determines what large-scale and smallscale motion structures mean. The key observation is that the (3) does not lead to a meaningful result, whereas the novel term (23) clearly separates motion discontinuities and non-harmonic flows from harmonic ones, depending on their scale relative to the value of λ.
424
J. Yuan, G. Steidl, and C. Schn¨ orr
Fig. 3. Left column: Standard TV regularization always mixes up the flow components of Figure 2. Right column: Our novel geometric regularization term always captures – depending on the scale parameter λ – motion discontinuities and non-harmonic flow components.
Fig. 4. Same results as shown in Figure 3, but for a larger value of the scale parameter λ. Left column: Independent of this different λ-value, standard TV-regularization just filters out a constant flow. Right column: Due to the larger value of the scale parameter λ, only the smooth global harmonic flow is returned as structural part by the novel regularizer (top, right), whereas all smaller-scale flows are regarded and separated as motion texture (bottom, right).
Convex Hodge Decomposition of Image Flows
7
425
Conclusions
Our work enlarges the class of convex variational approaches that can be used to denoise and separate – in our case – vector-valued data. Our results elucidate a further mathematical setting where to some extent decisions (separation and segmentation) can be done just by convex programming, that is globally optimal. Our further work will explore the connection of spaces of harmonic flows to other low-dimensional spaces of flows that are relevant for applications. Acknowledgement. This work has been supported by the German Science Foundation (DFG), grant Schn 457/9-1.
References 1. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60, 259–268 (1992) 2. Meyer, Y.: Oscillating Patterns in Image Processing and Nonlinear Evolution Equations. Univ. Lect. Series, vol. 22. Amer. Math. Soc (2001) 3. Aujol, J.-F., Gilboa, G., Chan, T., Osher, S.: Structure-texture image decomposition – modeling, algorithms, and parameter selection. Int. J. Comp. Vision 67(1), 111–136 (2006) 4. Ziemer, W.P.: Weakly Differentiable Functions. Springer, Heidelberg (1989) 5. Chan, T.F., Shen, J.: Mathematical models for local non-texture inpaintings. SIAM J. Appl. Math. 61(4), 1019–1043 (2001) 6. Mumford, D., Shah, J.: Optimal approximations by piecewise smooth functions and associated variational problems. Comm. Pure Appl. Math. 42, 577–685 (1989) 7. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans. Image Proc. 10(2), 266–277 (2001) 8. Chan, T.F., Esedoglu, S., Nikolova, M.: Algorithms for finding global minimizers of image segmentation and denoising models. SIAM J. Appl. Math. 66(5), 1632–1648 (2006) 9. Bresson, X., Esedoglu, S., Vandergheynst, P., Thiran, J.-P., Osher, S.: Fast global minimization of the active contour/snake model. J. Math. Imag. Vision 28(2), 151– 167 (2007) 10. Chan, T.F., Shen, J.: Image Processing and Analysis. Cambridge University Press, Cambridge (2005) 11. Yuan, J., Schn¨ orr, C., Steidl, G.: Simultaneous optical flow estimation and decomposition. SIAM J. Scientific Computing 29(6), 2283–2304 (2007) 12. Chambolle, A.: An algorithm for total variation minimization and applications. J. Math. Imag. Vision 20, 89–97 (2004) 13. Girault, V., Raviart, P.-A.: Finite Element Methods for Navier-Stokes Equations. Springer, Heidelberg (1986) 14. Yuan, J., Steidl, G., Schn¨ orr, C.: Convex hodge decompositions of image flows (submitted)
Image Tagging Using PageRank over Bipartite Graphs Christian Bauckhage Deutsche Telekom Laboratories, Berlin, Germany http://www.laboratories.telekom.com
Abstract. We consider the problem of automatic image tagging for online services and explore a prototype-based approach that applies ideas from manifold ranking. Since algorithms for ranking on graphs or manifolds often lack a way of dealing with out of sample data, they are of limited use for pattern recognition. In this paper, we therefore propose to consider diffusion processes over bipartite graphs which allow for a dual treatment of objects and features. As with Google’s PageRank, this leads to Markov processes over the prototypes. In contrast to related methods, our model provides a Bayesian interpretation of the transition matrix and enables the ranking and consequently the classification of unknown entities. By design, the method is tailored to histogram features and we apply it to histogram-based color image analysis. Experiments with images downloaded from flickr.com illustrate object localization in realistic scenes.
1 Motivation and Background After online photo sharing and storage services such as flickr or zooomr were launched just a few years ago, annotated image data have become available in abundance. For researchers in computer vision and image retrieval, this provides opportunities and challenges alike. On the one hand, vast collections of labeled images provide the perfect foundation for statistical learning and corresponding efforts towards the daunting problem of general image categorization are increasing [1,2,3]. On the other hand, folksonomies, i.e. collaboratively created collections of tags, are already notorious for being misleading, less reliable, or even contradictory. For diligent users who want their pictures to rank hight in appropriate indices the latter translates to an increased workload. In order to ease the burden of having to assign or verify tags on a regular basis, methods are called for that assist users in these tasks. In order to cope with real time constraints of online services, automatic tagging requires descriptive features that are efficient to compute and process. Moreover, systems that assist users in image tagging are supposed to deal with color images and should be able to locate objects of interest. Recent approaches to image categorization are therefore less suited for such a scenario for they either require extensive training and/or classify whole images and cannot tag image regions. In this paper, we therefore explore the possibility of a rapidly learning classifier that achieves a per-pixel decision as to whether or not an intended object is present in an image. This, of course, is hardly a new idea since for certain objects there are efficient and reliable per-pixel detectors [4]. However, these methods again train extensively and furthermore are tailored towards object classes of low intra class variability so that their performance drops noticeably for less homogenous objects. G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 426–435, 2008. c Springer-Verlag Berlin Heidelberg 2008
Image Tagging Using PageRank over Bipartite Graphs
427
In order to cope with the highly non-linear visual manifolds one encounters when dealing with arbitrary objects from the real world, we follow a prototype-based approach. Our basic idea is to compute a local color descriptor for each pixel and compare it to a set of familiar examples. This differs from the popular bags of features approach sparked by Lowe’s SIFT descriptor [5] in that we do not locate keypoints for which the descriptors are computed. Neither does our method directly apply gradient information or derive a visual vocabulary from clustering large sets of descriptors as proposed in [1]. What is common to recent methods however, is our use of histograms. Color- or gradient-histograms are well established descriptors. They are conceptually simple and empirical evidence underlines their good performance [1,6,7]. Moreover, for rectangular image regions, histograms can be computed efficiently regardless of the size of the region [8] and they are known to be invariant under small viewpoint variations [9]. However, histograms merely count color frequencies and do not capture spatial characteristics of a set of color pixels. Correlograms, on the other hand, count color transitions between pixels and thus provide an enhanced descriptor [10]. If normalized to unity, i.e. turned into a distribution, histograms and correlograms may also cope with varying object sizes. With respect to classification, however, the unit constraint entails certain limitations. Covariance matrices, for instance, become singular, so that a host of popular techniques do not apply anymore [11]. Our contribution in this paper is as follows: We first describe a novel algorithm particularly suited for prototype-based histogram or correlogram classification. We represent objects and features by means of the two sets of vertices of a bipartite graph and consider Markov processes over the set of objects in order to obtain similarity rankings. Extending the ranking procedure to the classification of out of sample objects is straightforward and requires almost no training. After contrasting our approach with related work and clarifying its relation to Google’s PageRank and Kleinberg’s HITS algorithm, we secondly briefly discuss efficient correlogram computation by means of dynamic programming. Finally, in section 4, we present experiments on tagging of unconstrained, highly textured natural images which demonstrate the capabilities of our approach. A summery and an outlook to future work will close this contribution.
2 Histogram Classification with Bipartite PageRank This section introduces histogram classification by means of probabilistic diffusion over bipartite graphs. The basic idea is to project novel observations into the space spanned by the training objects. A Markov process then ranks each novel observation with respect to the known ones. In order for our probabilistic interpretation to be valid, we henceforth assume accumulated histograms to be normalized to 1. 2.1 Terms and Concepts Let O = {o1 , o2 , . . . , on } be a set of prototypical objects and let F = {f1 , f2 , . . . , fm } be a set of features. We assume each oj ∈ O to be characterized by a distribution over the set which we represent by means of a stochastic vector f j ∈ Rm where mfeature j k=1 fk = 1.
428
C. Bauckhage
These assumptions translate into a directed bipartite graph G = (N , E) where the set of nodes N = O ∪ F with O ∩ F = ∅ and the set of edges E ⊆ O × F. A dual perspective arises, if we assume E to also contain edges from the set of features to the set of objects: E ⊆ O × F ∪ F × O. With edges pointing back and forth between the sets O and F , it is natural to ask, whether, once we observe the occurrence of a feature fi , there is a distribution over the set of objects that would explains n this observation, i.e. whether there is a corresponding stochastic vector o where k=1 ok = 1. The following construction yields a corresponding subset of edges pointing from F to O. If we collect all prototypic feature vectors in a column stochastic m × n matrix (1) A = f 1f 2 . . . f n whose elements therefore correspond to the conditional probabilities Aij = p(fi |oj ), we recognize A to realize a probabilistic mapping from Rn to Rm . Given a stochastic vector o, it yields the corresponding feature distribution f = A o. The context of a feature, i.e. the distribution of objects that most likely explains its occurrence, then results form a probabilistic mapping from Rm to Rn . This is realized by the column stochastic n × m matrix B = AT D −1 where Djj = ATij and Dij = 0. (2) i
Assuming p(oj ) =
1 n
for all j, we have
p(fi |oj ) n1 Aij p(fi |oj ) p(fi |oj )p(oj ) Bji = = = p(oj |fi ) (3) = 1 = A p(f |o ) p(fi ) p(f |o ) i k i k n k ik k k and recognize the mappings between the two sets of vertices of the bipartite graph to be related by Bayes’ theorem. 2.2 Bipartite PageRank With the matrices A and B at hand, we consider diffusion processes over the graph. If at time t, there is an object distribution ot , it can be mapped back and forth between the sets of vertices as follows: f t+1 = A ot
(4) !
ot+1 = B f t+1 = BA ot = H ot .
(5)
It is easy to verify that the matrix H defined in (5) is a bi-stochastic matrix that defines a Markov process over the set of objects. In information retrieval, Markov processes like this are used for computing similarity rankings: if started with a particular object distribution, a few iterations will diffuse it such that objects similar to the initial one(s) will be assigned more probability mass than less similar ones. In contrast to common information retrieval where such processes are considered over conventional adjacency graphs (e.g. webpages and their linking structure [12]), our bipartite graph model allows for a latent semantics interpretation of the process. The
Image Tagging Using PageRank over Bipartite Graphs
429
transition probabilities p(oj |oi ) can be explained as the effect of hidden latent variables since p(oj |oi ) = k p(oj |fk )p(fk |oi ). Alas, since the limiting distribution of a bi-stochastic matrix always amounts to limt→∞ ot = n1 [11 . . . 1]T , meaningful global rankings that take into account all prototypes oj ∈ O cannot be obtained this way. We therefore consider a slightly modified process ot+1 = αH ot +(1−α)o0 where 0 < α ≤ 1 and o0 = ot=0 may be interpreted as a continuing source of probability mass that constantly feeds the process. Since H is bi-stochastic and its eigenvalues therefore obey |λi | ≤ 1, the Perron-Frobenius theorem guarantees convergence of this system, too. Its limiting distribution amounts to −1 o0 . (6) o∗ = (1 − α) I − αH This equation, however, became famous as PageRank and, in the context of web search, the vector o0 is often referred to as the personalization vector. Suitable choices for the parameter α are the topic of an ongoing debate [12]. Here, we propose to avoid choosing α altogether by integrating it out. Writing (5) as a power series and considering the definition of the matrix logarithm, we arrive at o∗ =
1
(1 − α) 0
=H
∞
αt H t o0 dα
t=0
−1
! I + H −1 − I log I − H o0 = K o0 .
(7)
2.3 Extending Bipartite PageRank to Out of Sample Data While ranking on adjacency graphs is confined to closed sets of objects or requires contrived extensions in order to be applicable to out of sample objects, the bipartite model considered here provides a simple way of ranking unknown objects ou ∈ O. This is by virtue of the matrix B which allows for starting the diffusion process from a distribution f (ou ) over the set of features: o∗ = K B f (ou ).
(8)
Since the components o∗j of o∗ contain probabilities, (8) can be used to classify ou . If there are k different classes Ω1 , . . . , Ωk , several methods apply, for instance – maximum likelihood-based: ou → Ωc if Ωc = argmaxΩn o∗ ∈Ωn o∗j j – rank-based: ou → Ωc if k of the l highest ranking objects in o∗ belong to Ωc . In our experiments, we found that if trained with a single image, rank-based classification performs better. Moreover, choosing k and l such that 0 ≤ kl ≤ 1 allows us to compare our method against conventional nearest neighbor classifiers. Dealing with histograms, these are often based on the L1 norm or the Kullback-Leibler divergence and we shall consider them for baseline comparison in our experiments. Finally, training a diffusion based classifier merely requires to collect a number of prototypes and to compute matrices B and K. If the numbers n and m of prototypes and features are of the order O(102 ), the latter happens almost instantaneously.
430
C. Bauckhage
2.4 Related Work
−1 The matrices (1 − α) I − αH and K derived above are graph diffusion kernels. General properties of diffusion kernels for manifold ranking were studied in [12, 13, 14,15,16]. In the context of pattern recognition, known contributions rely on adjacency graphs where vertices represent objects and edges are are weighted according to some distance between corresponding feature vectors. Markov processes over the objects then result from turning weighted adjacency matrices into stochastic matrices. The bipartite graph model discussed above avoids such overhead. Instead, the Bayesian nature of the mappings between object- and feature space, allows for a concise interpretation of the procedure in terms of latent semantics. Our approach also differs from the HITS algorithm for ranking webpages [12]. HITS understands nodes of a graph as hubs or authorities and their scores are updated using the eigenvectors of the hub matrix LLT and the authority matrix LT L where L is the adjacency matrix of the graph; Markov processes are not involved in that method. Finally, ranking on bipartite graphs has been applied to document retrieval [17], but, to the best of our knowledge, it has not been studied in the context of classification.
3 Efficient Correlogram Computation Given an image I(x, y), a frequent task in image processing consists in computing a scalar- or vector-valued integral function x+ m 2
c(x, y, m, n) =
y+ n 2
f (I(i, j)).
(9)
n i=x− m 2 j=y− 2
whose value depends on all pixels in a m × n neighborhood of (x, y). Recently, several authors [4, 8, 18] pointed out xthatiny dealing with such functions, it is beneficial to consider integrals C(x, y) = i=0 j=0 c(i, j) where c(i, j) = c(i, j, 1, 1) is computed from a single pixel only. Although these integrals of integrals necessitate an additional iteration over the image, they can ease computational burden. Since integrating over an interval is equivalent to integrating over corresponding disjoint subintervals, it follows that x−m y−n x+m y+n , ) + C( , ) 2 2 2 2 x−m y+n x+m y−n , ) − C( , ) − C( 2 2 2 2
c(x, y, m, n) = C(
(10)
and we see that c(x, y, m, n) can be computed in constant time, regardless of the choice of m and n. Moreover, subintervals also allow for computing C(x, y) by means of the efficient dynamic program C(x, y) = C(x, y − 1) + C(x − 1, y) − C(x − 1, y − 1) + c(x, y).
(11)
Our use of correlograms follows the idea of Huang et al. [10] who only count transitions between pixels of the same color and therefore arrive at a descriptor that is of
Image Tagging Using PageRank over Bipartite Graphs
431
the same size as a conventional histogram. Moreover, for computational efficiency, they only consider horizontal and vertical pixel displacements Δx and Δy, respectively. Therefore, for bin i = I(x, y) of the correlogram c(x, y, 1, 1) we have ci (x, y) = cX,i (x, y) + cY,i (x, y) 1, if I(x, y) = I(x − Δx, y) = 0, otherwise
+
1, if I(x, y) = I(x, y − Δy) 0, otherwise
and we recognize that the dynamic program in (11) also applies to integral correlogram computation. In contrast to histograms, however, local correlograms cannot be determined from (10) because correlations with pixels from the outside of the intended interval must be discarded. This is tantamount to adjusting the interval’s upper left corner by an amount of Δx and Δy. If the integral correlogram over the whole image is computed and stored as C = C X + C Y , the adjustments can be made in C X and C Y , respectively. Compared to integral histograms, this doubles memory consumption but avoids loss of valid co-occurrence information that might occur if both adjustments are applied to C directly.
4 Experiments This section presents results obtained from experiments with an image set of 256 zoo animals retrieved from flickr.com. There are 11 different species in the set and for each species, the animals are shown in various poses and contexts. Since our overall interest lies in image tagging for online services, we investigated to what extend it is possible to locate animals of the same species, if just a single image is available for training. 4.1 Setting We applied the approach presented above to color histograms and correlograms and compared it against nearest neighbor classifiers based on the L1 distance or the Kullback-Leibler (KL) divergence. Our use of integral histograms and integral correlograms superseded downsampling the images (whose resolution varies between 500 × 343 and 500 × 500 pixels). Likewise, we left their brightness unchanged but applied color quantization using a palette of 256 colors. A user was asked to indicated objects of interest in pictures of her liking. Figure 1 shows two images of tigers, where the body and the face of the animals were marked using brush strokes. The bounding boxes of such strokes and their center pixel were determined and, in each experiment, 50 training patches were randomly sampled from the vicinity of this pixel. 50 histograms or correlograms that served as counterexamples were created at random. They were sampled from Dirichlet mixtures of three components whose modes did not coincide with those of the positive training examples. The distances Δx and Δy for computing correlograms were set to 5 pixels. In the classification phase, the test images were processed using a stepsize of 8 pixels in horizontal and vertical direction. Given a C implementation running on an Intel Pentium D processor, this yielded processing times of less than a second per image. For the
432
C. Bauckhage
Fig. 1. Two exemplary pictures of tigers, manually indicated regions of interest, and 50 randomly selected image patches used for training a prototype-based tagging system
diffusion-based classifiers as well as for the norm- and divergence-based nearest neighbor classifiers, we determined similarity rankings of the 100 prototypes with respect to the current test vector. In different experiments, we varied the number l ∈ {5, 10, 20} of the highest ranking prototypes considered for classification. Figure 2 shows examples of per-pixel response maps where l = 10. They resulted from training with the image in the upper row of Fig. 1. Brighter pixels therefore indicate higher probabilities of the presence of the body of a tiger. Obviously, the nearest neighbor classifier based on the L1 distance between histograms fares poorly; nearest neighbor classification of correlograms performs better but only for the diffusion-based classifiers it appears that –despite considerable intra class variations– tigers trigger high responses that are well localized. However, while the classifier achieves good separation from genetically distant species (even if the individuals have fur or hide of similar color), we found that especially the big cats in the data set tended to be confused. We attribute this to their similar silhouettes and therefore similar spatial statistics. For a quantitative evaluation, the data set was labeled manually so that for each animal a bounding box was available. This allowed us to determine precision and recall. To this end, all pixels in an image for which the probability of depicting the animal in question fell below 0.7·l were suppressed from the response map. The remaining pixels were subjected to a connected component analysis and the centroid of the resulting largest region was computed. If it came to lie within a bounding box surrounding the animal in question, it was counted as a true positive. Hits outside of the bounding box or inside bounding boxes of animals from other species counted as false positives. Table 1 summarizes some of our findings for cases of well and less well localizable animals. 4.2 Discussion It is noticeable that, with respect to precision, the probabilistic diffusion-based classifiers generally outperform their norm- or divergence-based counterparts. For our intended application, this is indeed a desirable property, since a higher precision in
Image Tagging Using PageRank over Bipartite Graphs test image
L1 dist. hist.
L1 dist. corr.
diffusion hist.
433
diffusion corr.
Fig. 2. Examples of response maps produced by different tested methods. The brighter a pixel is, the more likely it (supposedly) depicts the body of a tiger.
automatic image tagging guarantees a lower number of misplaced image labels and thus entails less additional effort for the user. Furthermore, it is noticeable that, at least for the setting considered here, it appears not to be crucial whether the classification is based on the top 5, the top 10, or top 20 ranking prototypes. While there is a tendency towards improved accuracy for an increasing l, it is not significant. Also, as we expected from looking at Fig. 2, it shows that in our application correlogram features yield higher accuracy than histogram features. Finally, while for certain species, our approach performs well, others seem to elude localization. Due to lack space, Tab. 1 only presents exemplary cases. Animals of more rigid body pose (e.g. turtles) or in less textured habitats (e.g. apes in front of the sky) were localized reliably. Big cats with their considerable intra class variance and simultaneously considerable inter class similarity proved the biggest challenge. However,
434
C. Bauckhage
Table 1. Examples of precision and recall values obtained on a test set of 255 pictures of animals object type classifier p (l=5) r (l=5) p (l=10) r (l=10) p (l=20) r (l=20) tiger body L1 hist. L1 corr. KL hist. KL corr. diff. hist. diff. corr.
0.46 0.59 0.49 0.53 0.68 0.70
1.00 1.00 1.00 1.00 0.63 0.67
0.46 0.57 0.49 0.52 0.68 0.70
1.00 0.97 1.00 1.00 0.83 0.71
0.46 0.55 0.49 0.52 0.66 0.70
1.00 1.00 1.00 1.00 0.91 0.93
tiger face L1 hist. L1 corr. KL hist. KL corr. diff. hist. diff. corr.
0.40 0.39 0.40 0.41 0.46 0.57
1.00 0.97 1.00 1.00 0.60 0.84
0.40 0.40 0.40 0.42 0.45 0.56
1.00 1.00 1.00 1.00 0.73 0.89
0.40 0.40 0.40 0.41 0.45 0.54
1.00 1.00 1.00 1.00 0.86 0.97
cheetah body L1 hist. L1 corr. KL hist. KL corr. diff. hist. diff. corr.
0.14 0.16 0.14 0.16 0.17 0.27
1.00 0.85 1.00 1.00 0.28 0.50
0.14 0.17 0.14 0.15 0.14 0.21
1.00 1.00 1.00 1.00 0.28 0.50
0.14 0.17 0.14 0.15 0.17 0.19
1.00 1.00 1.00 1.00 0.57 0.50
given the difficulty of localizing animals in unconstrained environments in general and the additional constraint of learning from a single image in particular, our results appear encouraging.
5 Summary and Future Work This paper presented an approach to online image tagging based on the idea of probabilistic diffusion over a set of prototypic local image descriptors. In addition to considering features that can be efficiently computed from integral images, our prototype-based framework does not require extensive training and was conceived as a way of dealing with high degrees of visual variability in realistic settings. Key to the proposed diffusion scheme is that it is not based on conventional adjacency graphs but considers bipartite graphs whose two sets of vertices represent objects and features. We did show that, if the features are chosen to be probability distributions, Markov processes over these graphs can be explained in terms of latent variables. Further extending known work on graph ranking for pattern recognition, we presented a seamless extension to the ranking and classification of out of sample objects. Currently, we are exploring ways of further improving the bipartite diffusion scheme. First, we are exploring methods of selecting suitable sets of prototypes other than the random selection scheme applied in this paper. Second, we are considering ways of incorporating domain specific knowledge into the approach. This might be achieved by means of weighted dynamic processes ot+1 = αH ot + (1 − α)W o0 , where
Image Tagging Using PageRank over Bipartite Graphs
435
the stochastic matrix W may steer task specific transitions. Finally, we work towards hierarchical versions of our algorithm that incorporate Markov chain lumping.
References 1. Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In: Proc. CVPR, vol. 2, pp. 2169–2178 (2006) 2. Li, J., Wang, J.: Real-time Computerized Annotation of Pictures. In: Proc. ACM Multimedia Conf, pp. 911–920 (2006) 3. Griffin, G., Holub, A., Perona, P.: Caltech-256 Object Category Dataset. Technical Report 7694, California Institute of Technology (2007) 4. Viola, P., Jones, M.J.: Robust Real-Time Face Detection. Int. J. of Computer Vision 57(2), 137–154 (2004) 5. Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 6. Adam, A., Rivlin, E., Shimshoni, I.: Robust fragments-based tracking using the integral histogram. In: Proc. CVPR, vol. 1, pp. 798–805 (2006) 7. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: Proc. CVPR, vol. 2, pp. 886–893 (2005) 8. Porikli, F.: Integral Histogram: A Fast Way to Extract Histograms in Cartesian Spaces. In: Proc. CVPR, vol. 1, pp. 829–836 (2005) 9. Domke, J., Aloimonos, Y.: Deformation and Viewpoint Invariant Color Histograms. In: Proc. BMVC, vol. II, pp. 509–518 (2006) 10. Huang, J., Kumar, S., Mitra, M., Zhu, W.J., Zabih, R.: Image Indexing Using Color Correlograms. In: Proc. CVPR, pp. 762–768 (1997) 11. Aitchison, J.: The Statistical Analysis of Compositional Data. Chapman & Hall, Boca Raton (1986) 12. Langville, A., Meyer, C.: Google’s PageRank and Beyond. Princeton University Press, Princeton (2006) 13. Agarwal, S.: Ranking on Graph Data. In: Proc. ICML, pp. 25–32 (2006) 14. Kashima, H., Tsuda, K., Inokuchi, A.: Kernels for graphs. In: Sch¨olkopf, B., Tsuda, K., Vert, J.P. (eds.) Kernel Methods in Computational Biology, pp. 155–170. MIT Press, Cambridge (2004) 15. Kondor, R., Lafferty, J.: Diffusion Kernels on Graphs and Other Discrete Input Spaces. In: Proc. ICML, pp. 315–322 (2002) 16. Zhou, D., Weston, J., Gretton, A., Bousquet, O., Sch¨olkopf, B.: Ranking on Data Manifolds. Proc. NIPS 16, 169–176 (2004) 17. Fouss, F., Pirotte, A., Renders, J.M., Saerens, M.: Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation. IEEE Trans. on Knowledge and Data Engineering 19(3), 355–369 (2007) 18. Liu, Y., Stroller, S., Li, N., Rothamel, T.: Optimizing Aggregate Array Computations in Loops. ACM Trans. on Prog. Languages and Systems 27(1), 91–125 (2005)
On the Relation between Anisotropic Diffusion and Iterated Adaptive Filtering Michael Felsberg Computer Vision Laboratory, Link¨ oping University S-58183 Link¨ oping, Sweden [email protected]
Abstract. In this paper we present a novel numerical approximation scheme for anisotropic diffusion which is at the same time a special case of iterated adaptive filtering. By assuming a sufficiently smooth diffusion tensor field, we simplify the divergence term and obtain an evolution equation that is computed from a scalar product of diffusion tensor and the Hessian. We propose further a set of filters to approximate the Hessian on a minimized spatial support. On standard benchmarks, the resulting method performs in average nearly as good as the best known denoising methods from the literature, although it is significantly faster and easier to implement. In a GPU implementation video real-time performance is achieved for moderate noise levels.
1
Introduction
Image denoising using non-linear diffusion [1] is a commonly used technique. The anisotropic extension of non-linear diffusion using the structure tensor [2] often leads to better results, in particular close to lines and edges. The numerical implementation of anisotropic diffusion is however less trivial as expected, as can be seen by the variety of algorithms and publications on the topic [3,4,5]. This is even more severe as we observed that the applied numerical scheme has larger influence on the quality of results than the choice of the method, i.e., using a suboptimal numerical scheme for anisotropic diffusion results in worse peak-signal to noise ratios (PSNR) than a closer to optimal scheme for non-linear diffusion. Besides non-linear diffusion, many different approaches for image denoising have been published, too many to be covered here. Some of the more popular approaches use iterated adaptive filters [6], bilateral filtering [7,8], mean-shift filtering [9,10], channel decomposition [11], or multi-band filtering [12]. Optimal results usually require estimates of image priors [13,14]. Some of the mentioned methods were related and compared in [15,16]. The purpose of this paper is to introduce a novel, very simple numerical scheme for anisotropic diffusion, which – is competitive to state of the art methods for image denoising – relates anisotropic diffusion directly to iterated adaptive filtering – is extremely simple to implement and real-time capable on GPU hardware.
This work has been supported by the CENIIT project CAIRIS.
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 436–445, 2008. c Springer-Verlag Berlin Heidelberg 2008
On the Relation between Anisotropic Diffusion
437
The paper is structured as follows. In Sect. 2, we introduce some tensor notation and calculus and we give some background on anisotropic diffusion and adaptive filtering. In Sect. 3, we derive the novel scheme from the anisotropic diffusion equation and relate it to iterated adaptive filtering, we propose concrete discrete filters for all relevant steps and discuss the choices of parameters. In Sect. 4, we present a number of standard denoising experiments in order to compare our approach with methods from the literature. Finally, we discuss the advantages and drawbacks of the proposed method in the concluding Sect. 5.
2
Tensors, Diffusion, and Adaptive Filtering
In this section we introduce some notation, terms, and methods from the literature, which will be required in subsequent sections. 2.1
Notation and Tensor Calculus
Despite the fact that tensors are coordinate-system independent algebraic entities, we restrict ourselves to matrix representations of tensors in this paper. In what follows, (real) vectors are denoted as bold letters and matrices as bold capital letters. The scalar product between two column vectors a and b is usually denoted by the matrix product of the transpose of a and b: aT b. Tensors (of second order) will be often built by outer products of vectors, e.g., T = abT . Since second order tensors form a vector space, a scalar product (Frobenius product) is defined, which reads for two tensors A and B aij bij . (1) A|B = trace(ABT ) = i
j
In most cases we will consider symmetric tensors, i.e., aij = aji and for these tensors the spectral theorem allows a decomposition into real eigenvalues: A = OΛOT = O diag(λi ) OT where O is an orthogonal matrix formed by the eigenvectors of A and Λ is the diagonal matrix containing the corresponding (real) eigenvalues λi . Note that arbitrary powers of A can be computed as An = (OΛOT )n = OΛn OT = O diag(λni ) OT exp(k A) = O diag(exp(k λi )) OT .
and due to linearity (2)
In many cases we deal with vector-valued or tensor-valued functions in this paper, even if written as vectors or tensors. The spatial variable, mostly denoted as x, is often omitted. In this context, differential operators are based on the partial derivatives with respect to the components of x and are denoted using the nabla operator ∇, such that ∂ai ∂aij ∂b T T ∇ a = div(a) = ∇ A = div(A) = . ∇b = ∂xi i ∂xi ∂xi i i j
438
2.2
M. Felsberg
Anisotropic Diffusion
We will not go into much detail about the theory of anisotropic diffusion, the reader is referred to, e.g., the work of Weickert [17]. The diffusion scheme is defined as an evolution equation of an image b(x) over time t as: ∂b = div(D(J)∇b) (3) ∂t where D(J) is the diffusion tensor which is computed by modifying the eigenvalues of the structure tensor J. The latter is computed by locally weighted averaging the outer product of the image gradient [18,19]: J(x) = w(y)∇b(x − y)∇T b(x − y) dy . (4) Note that the gradient operators in the equation above are normally regularized by some low-pass filter, e.g., using Gaussian derivatives. The local weight function w is also mostly chosen as a Gaussian function. When implementing anisotropic diffusion, an iterative algorithm that updates the input image successively has to be implemented. In each time-step, the diffusion tensor and the image gradient have to be estimated. In the next step, the divergence is approximated by another numerical approximation of a derivative operator. Altogether, four numerical approximations of derivatives are made: the time-derivative (mostly by an explicit scheme), the image gradient, the image gradient used in the structure tensor (not necessarily the same) and the divergence. There are however exceptions, where semi-analytic solutions are implemented to reduce the number of numerical approximations [5]. 2.3
Adaptive Filtering
Adaptive filtering is a more general and actually earlier published variant of steerable filters [20], developed by Knuttson et al. [21,22]. The main idea is to compose a spatially variant filter kernel by linear combinations of shift invariant kernels. The linear coefficients are locally estimated by an orientation-dependent scheme. The final formulation of the method can be found in [6], Chapt. 10. Adopting the notation above, the filter is composed as ˜ k hHP,k C(J)|N (5) hadapt = hLP + k
˜k where hHP,k is an orientation selective high-pass filter with orientation n, N T is the dual tensor for the orientation tensor nk nk and C is the control tensor. The high-pass filter with orientation n is defined in the Fourier domain by a polar separable filter with radial component R(|u|) and an angular component D(u/|u|) = (nTk u)2 /|u|2 . Consider for instance the case R(|u|) = |u|2 such that HHP,k = (nTk u)2 (6) with n1 = (1, 0)T , n2 = (0, 1)T , and n3 = ( 1/2, 1/2)T . The resulting frequency responses are HHP,1 = u21 , HHP,2 = u22 , and HHP,3 = u21 /2 + u1u2 + u22 /2.
On the Relation between Anisotropic Diffusion
439
The corresponding orientation tensors are given as
1 0 0 0 1/2 1/2 N1 = N2 = N3 = 0 0 0 1 1/2 1/2 The dual tensors can be computed via the isometric vector representation as
1 −1/2 0 −1/2 0 1 ˜ ˜ ˜ N1 = N2 = N3 = . −1/2 0 −1/2 1 1 0 Finally, the rightmost term in (5) is proportional to the negative Hessian since ˜ k HHP,k = uuT . N k
The tensor controlled adaptive filter scheme is often applied in an iterated way to achieve good denoising results [6], Chapt. 10. By selecting the high-pass filter in the way we have done in the example above, but with a very small constant multiplier 0 < γ 10, 000). One way of reducing the time complexity is to trade it off with the optimality of
Example-Based Learning for Single-Image Super-Resolution
459
the solution by finding the minimizer of (1) only within the span of a basis set {k(b1 , ·), . . . , k(blb , ·)} (lb l): aij k(bj , ·), for i = 1, . . . , N. (4) f i (·) = j=1,...,lb
In this case, the solution is obtained by −1 A = (Kbx K Kbx Y, bx + λKbb )
(5)
where [Kbx(i,j) ]lb ,l = k(bi , xj ) and [Kbb(i,j) ]lb ,lb = k(bi , bj ), and accordingly the time complexity reduces to O(M × lb ) for testing. For a given fixed basis points B = {b1 , . . . , blb }, the time complexity of computing the coefficient matrix A is O(lb3 + l × lb × M ). In general, the total training time depends on the method of finding B. In KMP [11,4], the basis points are selected from the training data points in an incremental way: for given n−1 basis points, the n-th basis is chosen such that the cost functional (1) is minimized when the A is optimized accordingly. The exact implementation of KMP costs O(l2 )-time for each step. Another possibility is to note the differentiability of the cost functional (4) which leads to gradient-based optimization to construct B. Assuming that the evaluation of the derivative of k with respect to a basis vector takes O(M )-time, the evaluation of derivative of (1) with respect to B and corresponding coefficient matrix A takes O(M × l × lb + l × lb2 )-time. Because of the increased flexibility, in general, gradient-based methods can lead to a better optimization of the cost functional (1) than selection methods as already demonstrated in the context of sparse Gaussian process (GP) regression [8]. However, due to the non-convexity of (1) with respect to B, it is susceptible to local minima and accordingly a good heuristic is required to initialize the solution. In this paper, we use a combination of KMP and gradient descent. The basic idea is to assume that at the n-th step of KMP, the chosen basis point bn plus the accumulation of basis points obtained until the (n − 1)-th step (Bn−1 ) is a good initial point. Then, at each step of KMP, Bn can be subsequently optimized by gradient descent. Naive implementation of this idea is still very expensive. To reduce further the complexity, the following simplifications are adopted: 1. In the KMP step, instead of evaluating the whole training set for choosing bn , only lc (lc l) points are considered; 2. Gradient descent of Bn (M ) and corresponding A(1:n,:) 2 are performed only at the every r-th KMP step. Instead, for each KMP step, only bn and A(n,:) are optimized. In this case, the gradient can be evaluated at O(M × l).3 2 3
With a slight abuse of the Matlab notation, A(m:n,:) stands for the submatrix of A obtained by extracting the rows of A from m to n. Similarly to [4], A(n) can be analytically calculated at O(M × l)-cost: A(n) =
Knx (y − K bx(1:n−1) A(1:n−1) ) − λKnb A(1 : n − 1) , Knx K nx + λ
where [Knx(1,i) ]1,l = k(bn , xi ) and [Knb(1,i) ]1,n−1 = k(bn , bi ).
(6)
460
K.I. Kim and Y. Kwon
At the n-th step, the lc -candidate basis points for KMP is selected based on a rather cheap criterion: we use the difference between the function output obtained at the (n−1)-th step and the estimated desired response of full KRR for each training data points which is then approximated by the localized KRR: for a training data point xi , its NNs are collected in the training set and the full KRR is trained based on only these NNs. The output of this localized KRR for xi gives the estimation of the desired response for xi . It should be noted that these local KRRs cannot be directly applied for regression as they might interpolate poorly on non-training data points. Once computed at the beginning, the estimated desired responses are fixed throughout the whole optimization process. To gain an insight into the performances of different sparse solution method, a set of preliminary experiments has been performed with KMP, gradient descent (with basis initialized by kmeans algorithm), and the proposed combination of KMP and gradient descent with 10,000 training data points. Figure 1 summarizes the results. Both gradient descent methods outperform KMP, while the combination with KMP provides a better performance. This could be attributed to the better initialization of the solution for the subsequent gradient descent step.
KMP Gradient descent KMP+gradient descent
58
54
cost
50
46
42
38 0
50
100
# basis points
200
300
Fig. 1. Performance of the different sparse solution methods evaluated in terms of the cost functional (1). A fixed set of hyper-parameters were used such that the comparison can be made directly in (1).
Combining Candidates. It is possible to construct a super-resolved image based on only the scalar-valued regression (i.e., N = 1). However, we propose to predict a patch-valued output such that for each pixel, N different candidates are generated. These candidates constitutes a 3-D image Z where the third dimension corresponds the candidates. This setting is motivated by the observation that 1. by sharing the hyper-parameters, the computational complexity of resulting patch-valued learning reduces to the scalar-valued learning; 2. the candidates contain information of different input image locations which are actually diverse enough such that the combination can boost the performance: in our preliminary experiments, constructing an image by choosing the best and the worst (in terms of the distance to the ground truth) candidates from each 2-D location of Z resulted in an average signal-to-noise ratio (SNR) difference of 8.24dB.
Example-Based Learning for Single-Image Super-Resolution
461
Certainly, the ground truth is not available at actual super-resolution stage and accordingly a way of constructing a single pixel out of N candidates is required. One straightforward way is to construct the final estimation as a convex combination of candidates based on a certain confidence measure. For instance, by noting that the (sparse) KRR corresponds to the maximum a posteriori estimation with the (sparse) GP prior [8], one could utilize the predictive variance as a basis for the selection. In the preliminary experiments this resulted in an improvement over the scalar-valued regression. However, a better prediction was obtained when the confidence estimation is obtained based not only on the input patches but also on the context of neighbor reconstructions. For this, a set of linear regressors is trained such that for each location (x, y), they receive a patch of output images Z(NL (x,y),:) and produce the estimation of differences ({d1 (x, y), . . . , dN (x, y)}) between the unknown desired output and each candidate. The final estimation of pixel value for an image location (x, y) is then obtained as the convex combination of candidates given in the form of a softmax : wi (x, y)Z(x, y, i), (7) Y (x, y) =
i=1,...,N
|d (x,y)| exp(− j σC ) . where wi (x, y) = exp − For the experiments in this paper, we set M = 49(7 × 7), N = 25(5 × 5), L = 49(7×7), σk = 0.025, σC = 0.03, and λ = 0.5·10−7 . The values are obtained based on a set of separate validation images. The number of basis points for KRR (lb ) is determined to be 300 as the trade off between the accuracy and the time complexity. In the super-resolution experiments, the combination of candidates based on these parameters resulted in an average SNR increase of 0.43dB over the scalar-valued regression. |di (x,y)| / j=1,...,N σC
Post-processing Based on Image Prior. As demonstrated in Fig. 2.b, the result of the proposed regression-based method is significantly better than the bicubic interpolation. However, detailed visual inspection along the major edges (edges showing rapid and strong change of pixel values) reveals ringing artifacts (oscillation occurred along the edges). In general, regularization methods (depending on the specific class of regularizer) including KRR and SVR tend to fit the data with a smooth function. Accordingly, at the sharp changes of the function (edges in the case of images) oscillation occurs to compensate the resulting loss of smoothness. While this problem can indirectly be resolved by imposing less regularization at the vicinity of edges, more direct approach is to rely on the prior knowledge of discontinuity of images. In this work, we use a modification of the natural image prior (NIP) framework proposed by Tappen et al. [9]:
α
2
|ˆ xi − xˆj | x ˆ i − yi 1 exp − exp − , (8) · P ({x}|{y}) = C σN σR i (i,j(∈NS (i)))
where {y} represents the observed variables corresponding to the pixel values of Y , {x} represents the latent variable, and NS (i) stands for the 8-connected neighbors of the pixel location i. While the second product term has the role of
462
K.I. Kim and Y. Kwon
preventing the final solution flowing far away from the input regression result Y , the first product term tends to smooth the image based on the costs |ˆ xi − x ˆj |. The role of α(< 1) is to re-weight the costs such that the largest difference is stressed more than the others. If the second term is removed, the maximum probability for a pixel i is achieved by assigning it with the value of the neighbor with the largest difference among the other neighbors, rather than a certain weighted average of neighbors which might have been the case when α > 1. Accordingly, this distribution prefers a strong edge rather than a set of small edges and accordingly can be used to resolve the problem of smoothing around major edges. The optimization of (8) is performed by belief propagation (BP) similarly to [9]. To facilitate the optimization, we reuse the candidate set generated from the regression step such that the best candidates are chosen by BP.
a
b
e
c
d
f
Fig. 2. Example of super resolution: a. bicubic, b regression result. c. post-processed result of b based on NIP, d. Laplacian of bicubic with major edges displayed as green pixels, and e and f. enlarged portions of a-c from left to right.
Optimizing (8) throughout the whole image region can lead to degraded results as it tends to flatten the textured area, especially, when the contrast is low such that the contribution of the second term is small.4 This problem is resolved by applying the (modification of) NIP only at the vicinity of major edges. Based on the observation that the input images are blurred and accordingly very high spatial frequency components are removed, the major edges are found by thresholding each pixel of Laplacian of the input image using L2 and L∞ norms of the local patches encompassing it. It should be noted that the major edge is in general different from the object contour. For instance, in Fig. 2.d, the boundary between the chest of the duck and water is not detected as major edges as 4
In original work of Tappen et al. [9], this problem does not happen as the candidates are 2 × 2-size image patches rather than individual pixels.
Example-Based Learning for Single-Image Super-Resolution
463
the intensity variations are not significant across the boundary. In this case, no visible oscillation of pixel values are observed in the original regression result. The parameters α, σN , and σR are determined at 0.85, 200 and 1, respectively. While the improvement in terms of SNR is less significant (on average 0.04dB from the combined regression result) the improved visual quality at major edges demonstrate the effectiveness of NIP (Fig. 2).
3
Experiments
The proposed method was evaluated based on a set of high- and low-resolution image pairs (Fig. 3) which is disjoint from the training images. The desired resolution is twice the input image along each dimension. The number of training data points is 200,000 where it took around a day to train the sparse KRR on a 2.5GHz PC. For comparison, several different example-based image superresolution methods were evaluated, which include Freeman et al.’s NN-based method [2], Tappen et al.’s NIP [9],5 and Kim et al.’s SVR-based method [6] (trained based on only 10,000 data points).
Fig. 3. Thumbnails of test images: the images are indexed by numbers arranged in the raster order
Figure 4 shows examples of super-resolution results. All the example-based super-resolution methods outperform the bicubic interpolation in terms of visual plausibility. The NN-based method and the original NIP produced sharper images at the expense of introducing noise which, even with the improved visual quality, lead to lower SNR values than the bicubic interpolations. The SVR produced less noisy images. However it generated smoothed edges and perceptually distracting ring artifacts which have disappeared for the proposed method. Disregarding the post-processing stage, we measured on average 0.69dB improvement of SNRs for the proposed method from the SVR. This could be attributed to the sparsity of the solution which enabled training on a large data set and the 5
The original NIP algorithm was developed for super-resolving the NN-subsampled image (not bicubic resampling which is used for experiments with all the other methods). Accordingly, for the experiments with NIP, the low resolution images were generated by NN subsampling. The visual qualities of the super-resolution results are not significantly different from the results obtained from bicubic resampling. However, the quantitative results should not be directly compared with other methods.
464
K.I. Kim and Y. Kwon
a
b
c
d
e
f
g
h
i
j
k
l
Fig. 4. Results of different super-resolution algorithms on two images from Fig. 3: a-b. original, c-d. bicubic, e-f. SVR [6], g-h. NN-based method [2], i-j. NIP [9], and k-l. proposed method. 4
bicubic SVR NN NIP proposed
increase of SNR from bicubic
3
2
1
0
−1
−2
1
2
3
4
5
6
7
8
9
10
11
12
image index
Fig. 5. Performance of different super-resolutions algorithms
effectiveness of the candidate combination scheme. Moreover, in comparison to SVR the proposed method requires much less processing time: super-resolving a 256×256-size image into 512×512 requires around 25 seconds for the proposed
Example-Based Learning for Single-Image Super-Resolution
465
method and 20 minutes for the SVR-based method. For quantitative comparison, SNRs of different algorithms are plotted in Fig. 5.
4
Conclusion
This paper approached the problem of image super-resolution from a nonlinear regression viewpoint. A combination of KMP and gradient descent is adopted to obtain a sparse KRR solution which enabled a realistic application of regressionbased super-resolution. To resolve the problem of smoothing artifacts that occur due to the regularization, the NIP was adopted to post-process the regression result such that the edges are sharpen while the artifacts are suppressed. Comparison with the existing example-based image super-resolution methods demonstrated the effectiveness of the proposed method. Future work should include comparison and combination of various non-example-based approaches. Acknowledgment. The contents of this paper have greatly benefited from discussions with G. BakIr and C. Walder, and comments from anonymous reviewers. The idea of using localized KRR was originated by C. Walder.
References 1. Baker, S., Kanade, T.: Limits on super-resolution and how to break them. IEEE Trans. Pattern Analysis and Machine Intelligence 24(9), 1167–1183 (2002) 2. Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based super-resolution. IEEE Computer Graphics and Applications 22(2), 56–65 (2002) 3. Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Computer Graphics (Proc. Siggraph 2001), pp. 327–340. ACM Press, New York (2001) 4. Keerthi, S.S., Chu, W.: A matching pursuit approach to sparse gaussian process regression. In: Advances in Neural Information Processing Systems, MIT Press, Cambridge (2005) 5. Kim, K.I., Franz, M.O., Sch¨ olkopf, B.: Iterative kernel principal component analysis for image modeling. IEEE Trans. Pattern Analysis and Machine Intelligence 27(9), 1351–1366 (2005) 6. Kim, K.I., Kim, D.H., Kim, J.H.: Example-based learning for image superresolution. In: Proc. the third Tsinghua-KAIST Joint Workshop on Pattern Recognition, pp. 140–148 (2004) 7. Ni, K., Nguyen, T.Q.: Image superresolution using support vector regression. IEEE Trans. Image Processing 16(6), 1596–1610 (2007) 8. Snelson, E., Ghahramani, Z.: Sparse gaussian processes using pseudo-inputs. In: Advances in Neural Information Processing Systems, MIT Press, Cambridge (2006) 9. Tappen, M.F., Russel, B.C., Freeman, W.T.: Exploiting the sparse derivative prior for super-resolution and image demosaicing. In: Proc. IEEE Workshop on Statistical and Computational Theories of Vision (2003) 10. Tschumperl´e, D., Deriche, R.: Vector-valued image regularization with pdes: a common framework for different applications. IEEE Trans. Pattern Analysis and Machine Intelligence 27(4), 506–517 (2005) 11. Vincent, P., Bengio, Y.: Kernel matching pursuit. Machine Learning 48, 165–187 (2002)
Statistically Optimal Averaging for Image Restoration and Optical Flow Estimation Kai Krajsek1 , Rudolf Mester2 , and Hanno Scharr1 1
Forschungszentrum J¨ ulich, ICG-3, 52425 J¨ ulich, Germany {k.krajsek,h.scharr}@fz-juelich.de 2 J.W. Goethe University, Frankfurt, Germany Visual Sensorics and Information Processing Lab [email protected]
Abstract. In this paper we introduce a Bayesian best linear unbiased estimator (Bayesian BLUE) and apply it to generate optimal averaging filters. Linear filtering of signals is a basic operation frequently used in low level vision. In many applications, filter selection is ad hoc without proper theoretical justification. For example input signals are often convolved with Gaussian filter masks, i.e masks that are constructed from truncated and normalized Gaussian functions, in order to reduce the signal noise. In this contribution, statistical estimation theory is explored to derive statical optimal filter masks from first principles. Their shape and size are fully determined by the signal and noise characteristics. Adaption of the estimation theoretical point of view not only allows to learn optimal filter masks but also to estimate the variance of the estimate. The statistically learned filter masks are validated experimentally on image reconstruction and optical flow estimation. In these experiments our approach outperforms comparable approaches based on ad hoc assumptions on signal and noise or even do not relate their method at all to the signal at hand.
1
Introduction
Linear convolution with an averaging mask occurs in many pre- and intermediate steps in computer vision and image processing algorithms. Besides the requirement that the filter coefficients sum up to one, no additional restrictions are commonly applied upon the filter mask. It is often proposed to decrease filter coefficients, i.e. averaging weights, for sampling points that are more far away from the filter center, but no theoretical justification for such proposals is given. In this contribution such a statistical justification is developed for two cases: (1) simple linear signal reconstruction as a toy example to present concepts and (2) optical flow estimation, where neighborhood sizes and shapes are frequently selected by hand in well-known estimation schemes [1,2]. Furthermore we introduce a Bayesian best linear unbiased estimator (BLUE) to perform the filter design. While it is derived in a straight forward manner from textbook knowledge on estimation theory (cmp. Sec. 2), to the best of our knowledge, it has not been G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 466–475, 2008. c Springer-Verlag Berlin Heidelberg 2008
Statistically Optimal Averaging for Image Restoration
467
described so far in an image processing context. For linear signal reconstruction local weighted averaging is performed to reduce noise in an image signal. The optimal shape of the filter for image denoising is purely determined by the signal and noise characteristics. Considering the noise, typically, the higher the noise the lower the signal frequencies are delivering reliable information. Consequently the larger the averaging filter should be. Considering the signal, the higher the signals correlation length, i.e. the longer the signal remains constant or only linearly changing, the larger the averaging kernels may be. This interaction of signal and noise is statistically modeled and exploited for optimal filter kernel construction in Sec. 3. Optical flow is estimated from image data using one brightness constancy constraint per pixel. Multiple constraints are locally combined in order to get an over-determined system of equations. Technically this can be done by local averaging of these constraints or squared versions of them [3,4,1,2]. This averaging is used for two purposes. On the one hand it solves the so-called aperture problem. On the other hand it makes estimation results more stable to noise when increasing filter sizes, just like in the denoising case above. In Sec. 4 statistical estimation theory is used to derive an optimal averaging mask in the optical flow estimation scenario whose shape and size is fully determined by the signal and noise characteristics. In Sec. 5 we show optical flow estimation results on a well-known test sequence outperforming comparable algorithms.
2
Estimation Theory
Let us consider a scalar signal defined on the N nodes of a discrete grid. As usually done for discrete sampled signals, we convert the signal or a subset of the signal into a vector s ∈ IRN independent of the underlying dimension of the space on which the signal is defined. The task is to estimate the signal ˆ from measurements z based on an observation equation relating the value s ˆ the measurements to the desired signal values. Let us denote with ε = s − s error of the estimator for a particular realization of the signal values s and the measurements values z. The performance of (Bayesian) estimators are judged from their risk R(ε) = E[C(ε)] defined by the expectation value of a cost function C(ε) with respect to the joint probability density function p(s, z). The minimum of the risk defines the optimal estimator with respect to the corresponding cost functions [5]. Different cost functions lead to different estimators. For instance the quadratic cost function (minimum mean squared error (MMSE) principle) ˆ = sp(s|z)dz. In practise the optimal estimator results in the posterior mean s cannot be applied in many cases, either due to the fact that the full posterior pdf p(s|z) is not available or there are further design criteria on the estimator like computational efficiency. One example of a such a estimator is the well known classical Wiener filter. It is restricted to affine transformations of the measurements combined with the quadratic cost function. However, in many applications it is desired for simplicity and efficiency reasons to apply fast linear convolutions. A linear variant of the Wiener filter has been proposed in [6] for
468
K. Krajsek, R. Mester, and H. Scharr
image processing applications. However, the linear Wiener filter is biased1 could require the (except for the special case of signals with zero mean) , i.e Ep(z|s) [ˆ s] = s the expectation value of the estimate is not equal to the current realization of the signal. In this contribution, we reformulate the best (with respect to the MMSE criterium) linear unbiased estimator (BLUE) from a Bayesian point of view in order to learn the size as well as the shape of optimal average filter masks for linear convolution. 2.1
Observation Equation
In the following, we present the general linear model for scalar signals. We may not be able to observe the full N -dimensional signal vector, but a signal vector of lower dimension M , M ≤ N , for instance if some data values are missing. The relation between the ideal signal and the observable signal values is modeled by an observation matrix K ∈ IRN ×M mapping the ideal signal s onto the ˜ ∈ IRM via s ˜ = Ks. The noise inevitable occurring in real-world observable one s applications can be modeled by an additive noise component v with zero mean and covariance matrix Σv leading to the general linear model z = Ks + v .
(1)
Our objective is to infer the original signal s from the observed signal z. If less observations are available as signal values (M < N ), the estimation of the signal vector is a classical ill-posed problem since there exists no unique solution to this problem. But also if each observation corresponds to one underlying signal value, we need to estimate each signal value only from one observation. From an estimation theoretical point of view the most reasonable solution is to take the observed signal itself. In order to get closer to the original noise free signal s additional assumptions or prior knowledge need to be incorporated in the estimation process. The signal can be modeled in a local neighborhood and instead of estimating the signal only the model parameters need to be estimated. A common model is the representation of the signal by a linear combination of L, (L < N ) basis vectors such that s = Bx with B ∈ RN ×L , x ∈ RL . Inserting this data model in the linear model equ.(1) leads to a general data model that relates the L coefficients with the observable signal z = Hx + v,
H = KB .
(2)
With this model not all the N signal values but only the L model parameters need to be estimated while keeping the number M of observations. Thus, the model parameters can be estimated with a higher precision (lower variance of the estimate) than the signal values itself. The price we have to pay for the precision gain in the model parameter estimation is a probable systematical error. 1
Please note that the concept of bias can strictly speaking not been applied to Bayesian estimation. There exist no true underlying signal as the signal is a realization of a stochastic field. However, formally we can require the mean of the estimated signal (with respect to the sampling distribution p(z|s)) to be equal to the current realization of the stochastic field.
Statistically Optimal Averaging for Image Restoration
2.2
469
The Bayesian Best Linear Unbiased Estimator (Bayesian-BLUE)
In this subsection, we present a Bayesian version of the best linear unbiased estimator for estimating the model parameters x. According to the variancebias decomposition, the MMSE can be decomposed in the sum of variance and squared bias of the estimator. Thus, for bias free estimators, minimizing the MMSE equals minimizing the variance of the estimator. The Gauss-Markov theorem [5] states that the best linear unbiased estimator (BLUE), i.e. the linear estimator that minimizes the variance of the estimator, is given by the minimum of the energy J(x) = εT Σv−1 ε = (z − Hx)T Σv−1 (z − Hx)
(3)
The parameter vector minimizing this energy reads ˆ = (HT Σv−1 H)−1 HT Σv−1 z . x
(4)
The covariance matrix of the estimated signal is given by −1 Σx = HT Σv−1 H
(5)
and the minimum variance of each signal component is given by the diagonal element of the covariance matrix. Note that the Gauss-Markov theorem is usually used in this form only in a non Bayesian context and therefore the variance is computed with respect to the sampling distribution p(z|s). An extension of this principle to a Bayesian interpretation is straightforward by simply taken the expectation in the MMSE not with respect to the sampling distribution p(z|s) but with respect to the joint pdf p(s, z). As for the usual BLUE, HT Σv−1 H needs to be invertible (cmp. Eq. 5), which is not always the case. This situation occurs when the parameters to be estimated exceeds the number of available measurements or if the basis functions in a local expansion of the signal are linear dependent. In such cases one can regularize the minimization approach using standard heuristic approaches like ridge regression that lead to the pseudo inverse [5]. In the Bayesian context, the regularization term follows directly from the Gauss-Markov theorem by relaxing the assumption of unbiasedness, leading to the linear Wiener filter [6].
3
Scalar Signals
In the following we apply the Bayesian BLUE to derive statistical optimal filter average masks for linear convolution. In contrast to ad hoc designed filter masks our approach lead to filter masks whose size and shape are fully determined by the signal and noise characteristics as well as mismatch of the local signal model. Linear averaging implies the following signal and noise model with the observation matrix b = (1, 1, ..., 1)T z = bs + v .
(6)
470
K. Krajsek, R. Mester, and H. Scharr
The noise components vj are assumed to be mutually statistical independent. Usually one requires the coefficients of the filter mask to sum up to one in order to preserve the mean. This follows directly from the requirement of an unbiased estimator as follows. Let h denote the linear filter kernel. An unbiased estimate requires the expectation value of the estimated signal E[ˆ s] = E[hT z] = s to be equal to the true signal realization. !
E[hT z] = hT bs = s
(7)
⇒ hT b = 1
(8)
Thus, the Bayesian BLUE naturally leads to a filter mask summing up to one. The signal model assumes the signal to be constant within the filter support. Such a restricted signal model is only fulfilled exactly for constant signals over the entire signal domain. Thus, besides the uncertainties caused by the measurement noise, the model mismatch leads to a certain uncertainty in this estimation process. Each noise component vj at position j is modeled by the sum vj = ε + δsj of the measurement noise ε and a noise component δsj describing the model mismatch for each position except for the position at which the signal is estimated. Inserting the model (6) into the equation (4) of the BLUE leads to the estimator for the the signal value and its variance sˆ =
1 1 zj / , 2 σvj σv2j j j
The filter mask hj =
1 σv2
j
/
1 j σv2
σs2 = 1/
1 . σv2j j
(9)
is fully determined by the measurement noise
j
and the model error and thus can be learned from training data. The size as well as the shape of the filter mask can either be learned by estimating the covariance matrix or by directly applying the MMSE principle For learning the optimal filter masks for a given noise level, we choose the gray valued training images from the Berkeley segmentation database [7]. We learn the filter coefficients for six different noise levels σεj = (1, 5, 10, 20, 50, 100) by estimating σv2j from the minimum mean square error criterium. Figure 1 shows the learned filter masks for the different noise levels. In order to validate the performance of the learned filter masks we applied them to the test images corrupted with the different noise levels. We assume the noise level to be known and apply the corresponding filter mask. As a reference we also apply truncated (at three times the width of the Gaussian) and normalized Gaussian filter kernels with different widths. Figure 2 shows the results of the experiment. All learned filter masks deliver nearly equal results as the Gaussian filter with optimal width.
4
Estimators for Optical Flow
In the following we discuss the combination of the well known observation equation, the brightness constancy constraint equation (BCCE) with the Bayesian BLUE discussed above.
Statistically Optimal Averaging for Image Restoration
1
0.8
0.2
0.6
0.15
0.4
0.1
0.2
0.05
0 10
0 10
471
0.06
0.8
0.05
0.6
0.04
0.4 0.2
0.03 0.02 0.01
0 8 6 4
5
6 0
0
15
10
8
5
4
2 0
10
8
4
2
0 15
10
8 6
10
6
5
4
2 0
0
2
5 0
0
0
Fig. 1. The learned filter masks for 4 different noise levels; from left to right σε = 1, σε = 10, σε = 50, σε = 100 200
250
250
200
150
200
150 150
100 100
100
50
50
0
0.2
0.4
0.6
0.8
1
1.2
1.4
400
0 0
50 0.5
1
1.5
2
2500
350
0
0.5
1
1.5
1
2
3
2
10000
2000
8000
300 1500
6000
250 1000
4000
200 500
150 100
0
1
2
3
4
0 0
2000
1
2
3
4
0 0
4
Fig. 2. MSE for the test sequences vs. the variance of a Gaussian filter kernel. The black points indicate the MSE for the learned filter mask that has been horizontally positioned to the minimum of the curve.
4.1
Differential Approaches to Motion Analysis
The general principle behind all differential approaches to motion estimation is that the conservation of some local image characteristic throughout its temporal evolution is reflected in terms of differential-geometric entities on the space-time signal. In the following we describe the image sequence intensity values as a continuous function s(x), x = (x, y, t) defined on the continuous Euclidian space denoted as the space-time volume A. In order to estimate the optical flow field from the image sequence, a functional relationship, the observation equation, between the signal s(x) and the optical flow field u(x), has to be established. A simple relation can be derived by the assumption that all intensity variations are due to motion such that the brightness of the signal keeps constant through its evolution in space-time s(x(t), y(t), t) = c .
(10)
This implies the total time derivative to be zero leading to the brightness constancy constraint equation (BCCE) gx ux + gy uy + gt = 0
⇔
g T uh = 0 ,
(11)
472
K. Krajsek, R. Mester, and H. Scharr
where we have defined g = (∂x s, ∂y s, ∂t s)T and uh = (ux , uy , 1). Since it is fundamentally impossible to solve for u by a single linear equation and there is noise in the gradient components g T uh = ε, additional constraints have to be considered. One way is to model the optical flow in a local area around the point where the optical flow is to be estimated. A simple model is the assumption of a constant optical flow field within the neighborhood. To this end we consider the gradients as well as the optical flow field sampled on a discrete grid in the space-time. Let us denote with g t a vector containing all M temporal gradient components within Ω and with Gs a matrix whose rows consists of the M spatial gradients within Ω. The relation between the optical flow vectors within Ω allows us to incorporate all BCCE’s in Ω into one linear equation system − g t = Gs u + v
(12)
that has formally the same mathematical structure as the general linear model when we set Gs = H, z = −g t and s = u. The noise vector contains noise components vj = εTsj u + εtj + δuj originating from the spatial εsj = (εsxj , εsxj ), the temporal gradient components εtj and the noise δuj originating from the model mismatch. According to equ. (4), the best linear unbiased estimator reads −1 ˆ = −(Gs C−1 u Gs C−1 v Gs ) v gt .
(13)
Please note that even for statistical independent noise in the input image sequence, the noise covariance matrix has non-zero off-diagonal entries due to the overlapping gradient filter masks. If we neglect the correlation between adjacent pixel values (which does mean non-zero off-diagonal entries in the covariance matrix), assume statistically independence of the different error components and furthermore independence of the variance of the signal noise on the position in space time, the covariance matrix becomes diagonal with the entries 2 . σv2 j = σs2 |u|2 + σt2 + σδu j
(14)
Incorporating all assumptions in (13) leads to the energy for the optical flow J(u) =
M
uTh Tj uh , 2 + σt2 + σδuj
σ 2 |u|2 j=1 s
(15)
where we defined Tj = g j g Tj . Since the entries in the denominator depend on the optical flow itself, the minimization of (15) need to be solved iteratively. For the two opposed cases of either dominating errors in the gradient components or in the model, two further approximations can be made. Let us assume, the spatial error variance times the squared optical flow σs2 |u|2 is much smaller than the sum of error variance and variance of the model error. This assumption leads to the least squares energy function J(u) =
M uTh Tj uh 2 σ 2 + σδuj j=1 t
(16)
Statistically Optimal Averaging for Image Restoration
473
The (non-normalized) filter coefficients are for this case given by the inverse sum 2 of the both variances σt2 and σδu . Please note, that this approximation is justij 2 2 2 2 fied by σs |u| σtj + σδuj which does not necessarily require the noise in the temporal gradient components to be large than the noise in the spatial gradient components which is commonly believed. This gain of knowledge origins from the explicit considering of the model error in the derivation of the error function. Otherwise the approximation above can only made with the stricter assumption σs2 |u|2 σt2 . As we assume the noise in the input signal to be i.i.d. noise, the variance of the temporal gradient does not depend on its position in the local neighborhood. Different weighting coefficients origin only from the error in the local optical flow model. If on the other hand the noise in the gradient components dominates the model error, the energy function reads approximatively J(u) =
M uTh Tj uh σ 2 |u|2 + σt2 j=1 s
(17)
which is exactly the energy function for the total least squares estimator (Structure tensor approach) [3]. Summing up we can say that under the other assumptions made above, the Bayesian BLUE becomes the weighted least squares estimator for neglecting noise variance in the spatial gradient components. The weighting coefficients are uniquely defined by the noise variance in the temporal gradient components and the variance of the model error. For large error variances of the spatial gradient components the Bayesian BLUE became the well known total least squares estimator with constant weighting function. In the intermediate case, i.e. when the spatial gradient variance as well as the model error cannot be neglected, we have to rely on equ.(15). Here, the weighting is determined by the model error as for the weighted least squares case. However, due to the non-zero spatial gradient variance, the weighting coefficients depend on the optical flow itself and requires an iterative scheme for solving for the optical flow.
5
Experiments
In this section, we validate our estimator (13) with the assumption on the noise for the least squares estimator (16). As test image sequences, we use the ’Yosemite’ (without clouds). For training purpose, we use the ’Diverging tree’, ’Marble’ and the ’Translating tree’ sequences. For computing the gradient components, we rely on optimized derivative filters derived in [9]. The size of the derivative filters has been set to 5 × 5 × 5. For performance evaluation, the average angular error (AAE) [10] is computed. We trained the weighting mask for all sequences by minimizing the MMSE for the training sequences. Afterwards, we computed the AAE for the test sequence. Table 1 shows the results of our approach for the Yosemite sequence compared with other linear and non-linear local optical flow estimators using the same optical flow model. As one can see our Bayesian BLUE outperforms all other approaches with respect to the AAE.
474
K. Krajsek, R. Mester, and H. Scharr
Table 1. Comparison between the Bayesian BLUE and other published local estimators for the Yosemite Sequence Estimator
AAE
Classical structure tensor [8]
3.80
Nonlinear structure tensor [8]
3.71
Robust structure tensor [8]
3.21
Coherence based structure tensor [8]
3.43
Bayesian BLUE
2.54
Fig. 3. From left to right: One frame of the test sequence; the ground truth and the estimated optical flow field
However, as all linear estimators, our approach suffers from blurring discontinuities. In order to preserve edges inevitable require the usage of some non-linear technique and leads to more expensive computational costs.
6
Summary and Conclusion
We introduced the Bayesian BLUE and derived statistical optimal averaging masks for image denoising and optical flow estimation. The image denoising application is straight forward and more tutorial like. Results are comparable to optimal Gaussian smoothing. For optical flow we start with the full Bayesian BLUE formulation similar to the denoising case. Introducing more and more assumptions on the noise in the optical flow observation equation we end up with the well known structure tensor approach with a constant averaging mask. Learning this averaging mask from training data yields an estimation scheme outperforming several other comparable schemes on test data.
References 1. Haussecker, H.W., Fleet, D.J.: Computing optical flow with physical models of brightness variation. IEEE Trans. Pattern Anal. Mach. Intell 23, 661–673 (2001) 2. Bruhn, A., Weickert, J., Schn¨ orr, C.: Lucas/Kanade meets Horn/Schunk: Combining local and global optical flow methods. Int. J. Comput. Vision 61 (2005)
Statistically Optimal Averaging for Image Restoration
475
3. Big¨ un, J., Granlund, G.H.: Optimal orientation detection of linear symmetry. In: Proc. ICCV, pp. 433–438. IEEE, Los Alamitos (1987) 4. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proc. Seventh International Joint Conference on Artificial Intelligence, Vancouver, Canada, August 1981, pp. 674–679 (1981) 5. Kay, S.M.: Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice-Hall, Englewood Cliffs (1993) 6. M¨ uhlich, M., Mester, R.: A statistical extension of normalized convolution and its usage for image interpolation and filtering. In: Proc. European Signal Processing Conference (EUSIPCO 2004), Vienna (2004) 7. Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings of the Eighth IEEE International Conference on Computer Vision (ICCV), pp. 416–423 (2001) 8. Brox, T., van den Boomgaard, R., Lauze, F., van de Weijer, J., Weickert, J., Mr´ azek, P., Kornprobst, P.: Adaptive structure tensors and their applications. In: Weickert, J., Hagen, H. (eds.) Visualization and Processing of Tensor Fields (2005) 9. Scharr, H.: Optimal Operators in Digital Image Processing. PhD thesis, Interdisciplinary Center for Scientific Computing, Univ. of Heidelberg (2000) 10. Barron, J.L., Fleet, D.J., Beauchemin, S.S.: Performance of optical flow techniques. Int. Journal of Computer Vision 12, 43–77 (1994)
Deterministic Defuzzification Based on Spectral Projected Gradient Optimization Tibor Luki´c1 , Nataˇsa Sladoje1 , and Joakim Lindblad2 1
Faculty of Engineering, University of Novi Sad, Serbia {tibor,sladoje}@uns.ns.ac.yu 2 Centre for Image Analysis, SLU, Uppsala, Sweden [email protected]
Abstract. We apply deterministic optimization based on Spectral Projected Gradient method in combination with concave regularization to solve the minimization problem imposed by defuzzification by feature distance minimization. We compare the performance of the proposed algorithm with the methods previously recommended for the same task, (non-deterministic) simulated annealing and (deterministic) DC based algorithm. The evaluation, including numerical tests performed on synthetic and real images, shows advantages of the new method in terms of speed and flexibility regarding inclusion of additional features in defuzzification. Its relatively low memory requirements allow the application of the suggested method for defuzzification of 3D objects.
1
Introduction
Fuzzy segmentation provides images where for each pixel the degree of its membership to the segmented object is expressed as a value between zero and one. The process of generating an appropriate crisp representation of a fuzzy image is referred to as defuzzification. The most commonly used method for performing defuzzification is the selection of a specific α-cut, i.e., thresholding of the fuzzy image. Recently, a more advanced method, defuzzification by feature distance minimization, was proposed [5,6]. That method utilizes precise estimates of features of the image objects, obtained from their fuzzy representations, to find crisp representatives where (selected) features of the original objects are preserved as well as possible. The task of finding an appropriate binary configuration is based on numerical minimization of a distance function. An algorithm which is successfully adapted and mostly used for this purpose so far is Simulated Annealing (SA), [5]. An application of a deterministic optimization method based on DC (Difference of Convex Functions) Algorithm to the same problem is presented in [4]. The contribution of this paper is in improvement of the numerical performance of the defuzzification process, which is very important for practical realization of the defuzzification process.
The first and the second author acknowledge the Ministry of Science of the Republic of Serbia for support through the Projects ON144018 and ON144029.
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 476–485, 2008. c Springer-Verlag Berlin Heidelberg 2008
Deterministic Defuzzification Based on SPG Optimization
2
477
Background
A fuzzy set S on a reference set H is a set of ordered pairs S = {(x, μS (x)) | x ∈ H}, where μS : H → [0, 1] is the membership function of S in H. A fuzzy set is called crisp set if the codomain of its membership function is {0, 1}. An α-cut of a fuzzy set S, for α ∈ (0, 1], is the set Sα = {x ∈ H | μS (x) ≥ α}. We denote by F (H) and P(H) the set of fuzzy and crisp (sub-)sets on a reference set H, respectively. Being interested in applications in digital image analysis, we consider digital fuzzy sets, where H ⊂ Zn . For simplicity, we assume in the following that H ⊂ Z2 . The vector representation of a fuzzy set S = {(i, j), (μS (i, j)) | i = 1, . . . , r, j = 1, . . . , c} of size N = r × c is the vector s ∈ [0, 1]N , defined by T s = [si ]N i=1 = [μS (1, 1), . . . , μS (1, c), . . . , μS (r, 1), . . . , μS (r, c)] . The inner prodn xi yi . uct of x, y ∈ Rn is x, y = i=1
We use the definition of the perimeter P of a set S found in [7], P (S) = (|μS (i, j) − μS (i, j + 1)| + |μS (i, j) − μS (i + 1, j)|) .
(1)
i,j
The moment mp,q of the set S is, mp,q (S) =
μS (i, j) ip j q . The area A of S
i,j
is, [7], A(S) = m0,0 (S). The center of gravity (centroid) C of S is m1,0 (S) m0,1 (S) , C(S) = (Cx (S), Cy (S)) = . m0,0 (S) m0,0 (S) 2.1
(2)
Defuzzification by Feature Distance Minimization
An optimal defuzzification [6] D(A) of a fuzzy set A ∈ F(H), with respect to the distance function d, is D(A) ∈ {C ∈ P(H) | d(A, C) = min d(A, B)}. B∈P(H)
(3)
A suitable distance measure to be utilized in (3) is a feature distance, defined as dΦ (A, B) = d(Φ(A), Φ(B)), where Φ : F (H) → Rn provides a featurevector representation of a fuzzy set. Features that can be appropriate to consider in the distance measure include, e.g., point-wise membership values of the sets, and also global shape features, such as area, perimeter and centroid (see [6]). In general, the optimization problem (3) cannot be solved analytically. Moreover, the search space P(H) is too big to be exhaustively traversed. As a consequence, we have to apply some suitable optimization method to find an approximate solution of Eq. (3).
478
2.2
T. Luki´c, N. Sladoje, and J. Lindblad
Spectral Projected Gradient Method
The Spectral Projected Gradient (SPG) algorithm is a deterministic, iterative optimization method, introduced by Birgin, Mart´ınez and Raydan (2000) in [2] for solving the convex-constrained optimization problem minx∈Ω f (x) , where the feasible region Ω is a closed convex set in Rn . The requirements for application of SPG algorithm are: i) f is defined and has continuous partial derivatives on an open set that contains Ω; ii) the projection, PΩ of an arbitrary point x ∈ Rn onto a set Ω is defined. The algorithm uses the following parameters: integer m ≥ 1; 0 < αmin < αmax , γ ∈ (0, 1), 0 < σ1 < σ2 < 1 and initially α0 ∈ [αmin , αmax ] (see [1] for details). Starting from an arbitrary configuration x0 ∈ Ω, the below computation is iterated until convergence. SPG iterative step [1]. Given xk and αk , the values xk+1 and αk+1 are computed as follows: dk = PΩ (xk − αk ∇f (xk )) − xk ; fmax = max{f (xk−j ) | 0 ≤ j ≤ min{k, m − 1}}; xk+1 = xk + dk ; δ = ∇f (xk ), dk ; λk = 1; while f (xk+1 ) > (fmax + γλk δ) λtemp = − 12 λ2k /(f (xk+1 ) − f (xk ) − λk δ); if (λtemp ≥ σ1 ∧ λtemp ≤ σ2 λ) then λk = λtemp else λk = λk /2; xk+1 = xk + λk dk ; end while; sk = xk+1 − xk ; y k = ∇f (xk+1 ) − ∇f (xk ); βk = sk , y k ; if βk ≤ 0 then αk+1 = αmax else αk+1 = min{αmax , max{αmin , sk , sk βk }} The SPG algorithm is particularly suited for the situations when the projection calculation is inexpensive, as in box-constrained problems, and its performance is shown to be very good in large-scale problems (see [1]). This motivates us to apply the SPG algorithm in defuzzification, which is a convex, box-constrained large-scale optimization problem.
3
SPG Based Algorithm for Defuzzification
We consider the following formulation of the defuzzification problem (3). Let S ∈ F (H) be a given fuzzy set of size r × c = N and s ∈ [0, 1]N its vector representation. Find X ∈ P(H) that minimizes the distance dΦ 2 (X, S) = d2 (Φ(X), Φ(S)), where d2 is the Euclidean distance, while Φ(X) and Φ(S) are the feature-based representations of the sets X and S. We use a feature-based representation of the fuzzy set that contains (appropriately rescaled) membership, perimeter, area and centroid terms: T s1 sN P (S) A(S) Cx (S) Cy (S) , wA , wC , wC Φ(S) = wM √ , . . . , wM √ , wP , P (H) A(H) Cx (H) Cy (H) N N (4)
Deterministic Defuzzification Based on SPG Optimization
479
where wM , wP , wA , and wC are user defined weights that assign a relative importance to the features, membership, perimeter, area and centroid, respectively. The perimeter function P , as defined in Eq. (1), is not differentiable, being a sum of absolute values. To make it differentiable, we suggested in [4] to consider √ a smooth approximation of the function |x|: χε (x) = 4ε2 + x2 − 2ε , x ∈ R , for a small ε > 0. Replacing |·| with χε (·) in (1) we obtain the smooth perimeter estimation function Pχε . In addition, the functions Cx and Cy , Eq. (2), defining the coordinates of the centroid C(S), are not differentiable in [0, 0, . . . , 0]T . To obtain functions differentiable on the whole domain [0, 1]N , we introduce m (S) smooth approximations of centroid coordinates functions, Cxδ (S) = m0,01,0(S)+δ m
(S)
and Cyδ (S) = m0,00,1(S)+δ , for small δ > 0. We define the smooth objective function Ψ to be the squared Euclidean distance between the smoothed feature vector representations of sets X and S, 2 ΨS (x) = (dΦ 2 (X, S)) . The defuzzification problem with respect to the distance Φ measure d2 can be considered as the following binary (integer) optimization problem min ΨS (x) . (5) x∈{0,1}N
Instead of binary optimization (5), we consider the relaxed nonlinear convex constrained optimization problem, min Eμ (x),
x∈[0,1]N
1 Eμ (x) = ΨS (x) + μx, e − x 2
(6)
obtained by adding a concave term, 12 μx, e − x, where e = [1, 1, . . . , 1]T and μ > 0, into the objective function, to enforce binary solutions. The following Theorem ensures the equivalence of Problems (5) and (6). Theorem 1. [3] Let Ψ be Lipschitzian on an open set A ⊃ [0, 1]n and twice continuously differentiable on [0, 1]n . Then there exist a μ∗ ∈ R such that for all μ > μ∗ (i) the integer programming problem nonlinear problem (ii) the function Ψ (x) +
min
x∈{0,1}n
Ψ (x)
is equivalent with the
min Ψ (x) + 12 μx, e − x,
x∈[0,1]n 1 2 μx, e −
x is concave on [0, 1]n .
We define the projection PΩ of an arbitrary vector x ∈ RN onto a set Ω = [0, 1]N as ⎧ xi ≤ 0 ⎨ 0, xi ≥ 1 , where i = 1, . . . , N . (7) [PΩ (x)]i = 1, ⎩ xi , elsewhere PΩ is a projection with respect to the Euclidean distance, i.e. PΩ (x) = arg min d2 (x, y). y∈Ω
480
T. Luki´c, N. Sladoje, and J. Lindblad
By providing that the objective function Eμ , (6), is smooth for μ > 0 and by defining the projection function, (7), onto a feasible convex region, [0, 1]N , the requirements for application of the SPG algorithm for solving the optimization problem (6) for any fixed μ > 0 are fulfilled. Our strategy is to solve a sequence of optimization problems (6), with gradually increasing μ, which will lead, according to Theorem 1, to a solution of the binary optimization problem (5). More precisely, we suggest the following optimization algorithm: SPG based algorithm for defuzzification. Parameters: in > 0; out > 0; μΔ > 0 x0 = s; μ = 0; k = 0; do do compute xk+1 from xk by SPG iterative step; k = k + 1; while xk − xk−1 ∞ > in μ = μ + μΔ ; while max{min{xki , 1 − xki }} > out . i
The initial configuration is the original fuzzy set. In each iteration in the outer loop we solve, by using the SPG method, an optimization problem (6), for a fixed binary factor μ > 0. By iteratively increasing the value of μ in the outer loop, binary solutions are enforced. The termination criterion for the outer loop, out , regulates the tolerance for the finally accepted (almost) binary solution. The termination criterion for the inner loop, in , affects the number of iterations made for the specific binary enforcement setting μ. If in is chosen too small, it may lead to an unnecessarily large number of inner iterations, and by that slow down the process, without any significant improvement on the final binary solution. Its value, however, should be small enough to ensure reasonably good output of the SPG algorithm. The starting value of μ is 0, to ensure that the influence of the features observed in defuzzification is sufficiently high in the beginning of the process, compared to the influence of the binary enforcement term. The parameter μΔ regulates the speed of enforcement of binary solutions. Too fast binarization (high value of μΔ ) usually leads to a poor result. On the other hand, too small μΔ step may lead to an unnecessary big number of outer iterations.
4
Evaluation
This section contains the performance evaluation of the proposed SPG based method by comparison with the, so far, recommended SA and DC based methods. The results obtained by Optimal α-cut method as a fast, but less powerful method, are also considered. The most important requirement for defuzzification by feature distance minimization is to preserve the observed feature values of the fuzzy set in its
Deterministic Defuzzification Based on SPG Optimization
481
(1)
(2)
(3) (a)
(b)
(c)
(d)
(e)
Fig. 1. Test images used in experiments. (1a) Fuzzy segmented microscopy image of a bone implant. (1b) Fuzzy segmented MRA image of a human aorta. (1c-3b) Fuzzy segmented fluorescence images of Calcein stained cells. (3c-3e) Synthetic fuzzy images.
defuzzification. Hence, the overall feature distance between the original fuzzy set and its defuzzification is the most important evaluation criterion. In addition, the applicability of the underlying optimization methods, including running time, deterministic or non-deterministic results, computational complexity, stability regarding parameter settings, and flexibility regarding inclusion of other features or additional constraints in the optimization, also form significant evaluation criteria. A successful utilization of the defuzzification method in real applications is significantly dependent on the performance of the used optimization method. For segmentation purposes, the appearance of the defuzzified image is relevant too. However, this depends on the choice of distance measure and the selection of features, and should not depend on the choice of optimization method. The appearance of the defuzzified objects are therefore not included as a part of the evaluation of the herein studied optimization methods. To evaluate the particular distance measure is beyond the scope of this paper. 4.1
Optimization Methods
SPG based algorithm. The starting configuration is the original fuzzy set. The values of in and out are set to 10−5 and 10−3 , respectively. The increment factor μΔ is set to in . On the basis of our experiments and suggestions in [1] we chose the following parameter settings for SPG: γ = 10−4 , σ1 = 0.1, σ2 = 0.9, m = 10, αmin = 10−3 , αmax = 103 , and α0 = 2. The values of ε in χε and δ in (Cxδ , Cyδ ) are set to 10−3 and 10−6 , respectively.
482
T. Luki´c, N. Sladoje, and J. Lindblad
SA DC SPG
1a
SA SPG 1a 1b
1c
1c
1d
1d
1e
1e
2a
2a
2b
2b
Figure
Figure
1b
2c
2c
2d
2d
2e
2e
3a
3a
3b
3b
3c
3c
3d
3d 3e
3e 0.9
0.92
0.94 0.96 0.98 Distance relative to Optimal α−cut
1
1.02
0.9
0.92
0.94 0.96 0.98 Distance relative to Optimal α−cut
1
1.02
Fig. 2. Minimal distances found by defuzzification based on left: membership, area and perimeter; right: membership, area, perimeter and centroid
DC based algorithm. We use the DC based algorithm as described in [4]. The termination criteria for the inner and outer loop, in and out , as well as the increment factor μΔ , and the value ε are assigned the same values as in SPG based method. Simulated Annealing. We use the SA algorithm as described in [6]. The initial configuration is obtained by the optimal α-cut. The initial temperature, T0 is 0.1. The number of perturbations tested at each temperature level is 1000, after which the temperature is reduced one level, Tk+1 = 0.995Tk . The temperature is successively lowered until 5 000 successive perturbations do not provide any step that gives a reduction in distance, after which a new re-annealing is restarted from the currently best found solution. After 10 re-annealings the process is stopped and the best found configuration is used. For more details see [6]. Optimal α-cut. This method consists of the brute force testing of all α-cuts of the fuzzy set, and selecting the one with the smallest distance to the set. It is a fast but constrained method, leading to reasonably good results. 4.2
Experiments and Results
The methods are tested on 12 real and 3 synthetic images, shown in Fig. 1. All experiments are carried out on a standard PC with 1.6 GHz CPU and 512 MB of RAM. The SA algorithm is implemented in C++, whereas DC and SPG based methods are implemented in Matlab. Since the main parts of DC and SPG methods are implemented using matrix operations, no significant reduction of speed is caused by this choice of environment. We consider defuzzifications where all the observed features are included in the distance function with equal importance, i.e. wM = wA = wP = 1, wC = 0
Deterministic Defuzzification Based on SPG Optimization
483
Image (2a)
SA
SPG/DC
P : 179.56 A: 409.07
P : 180 A: 368 Dist.: 0.13124
P : 180 A: 374 Dist.: 0.13209
Image (2d)
SA
SPG
P : 95.04 A: 322.71 C: (14.21,21.50)
P : 94 A: 297 C: (14.43,21.37) Dist.: 0.13536
P : 94 A: 299 C: (14.30,21.40) Dist.: 0.13593
Fig. 3. Defuzzifications of two fuzzy segmented images by different methods. The included features are top: P -perimeter and A-area; bottom: P , A and C-centroid.
(Fig. 2(left)) or wC = 1 (Fig. 2(right)) in (4). The DC based algorithm requires that the objective function is expressed as a difference of convex functions. This is, in general, not an easy task and it has, so far, not been done with the centroid term included. The centroid term is therefore excluded when comparing performance with the DC based algorithm, presented in Fig. 2(right). The defuzzifications found by SA, DC, and SPG algorithms are in general very similar, or exactly the same. The reached distances for the considered methods are presented by in Fig. 2. For all test images, the distances obtained by SA, DC, and SPG are significantly better than the distance obtained by Optimal α-cut. For 12 out of 15 images, the distances obtained by SA, DC, and SPG algorithms are the same, while, when including centroid feature, the reached distances for SA and SPG algorithms are the same in 13 cases (see Fig. 2(right)). In the remaining 5 cases, the distances obtained by SA are a bit better than for DC and SPG. This, however, comes at the expense of the use of a non-deterministic method. Fig. 3 shows two defuzzifications with most different achieved distances. 4.3
Applicability of the Method
Deterministic optimization algorithms, in contrary to non-deterministic ones, always converge to the same result with the same convergence speed, which makes such methods practically appealing. Our main goal was to find an appropriate deterministic optimization method to be used in defuzzification, and the emphasis of applicability evaluation of the suggested method is therefore on the comparison with the recently proposed deterministic DC based method. Considering data presented in Table 1, the running time of SPG algorithm is significantly smaller than of the other observed methods. This is a consequence of the low number of iterations needed for convergence of SPG, compared with DC algorithm (Table 1). Furthermore, the computational complexity of SPG is significantly lower than of DC. The number of functional evaluations for DC objective function is O((N + 1)2 ), whereas for SPG that number is only O(N ).
484
T. Luki´c, N. Sladoje, and J. Lindblad
Table 1. Total number of iterations and running time for different optimization methods when used for defuzzification of the images presented in Fig. 1. time1 is without, and time2 is with the centroid term included. Img.
Size
(1a) (1b) (1c) (1d) (1e) (2a) (2b) (2c) (2d) (2e) (3a) (3b) (3c) (3d) (3e)
45x24 29x30 35x38 36x26 45x25 47x51 22x36 28x31 28x43 44x28 34x29 29x32 32x41 24x24 32x32
SA DC based alg. SPG based alg. time1 time2 iterations time1 iterations time1 time2 8.05 min 5.18 min 1422 83 9.53 min 7 578 22.45 s 36.01 s 3.89 min 3.66 min 167 066 17.94 min 2 696 7.56 s 10.20 s 3.83 min 8.27 min 207 080 114.56 min 5 095 18.20 s 16.15 s 6.79 min 3.29 min 218 661 91.55 min 2 989 9.54 s 9.57 s 3.34 min 5.02 min 164 156 42.05 min 5 166 15.67 s 21.82 s 3.02 min 1.70 min 218 595 262.22 min 2 891 19.14 s 20.93 s 2.53 min 2.85 min 175 333 50.48 min 3 308 9.96 s 8.85 s 5.49 min 3.94 min 202 492 24.17 min 2 987 9.53 s 9.03 s 4.00 min 2.92 min 196 468 83.76 min 3 076 9.98 s 11.35 s 2.54 min 4.91 min 202 759 83.14 min 3 896 13.35 s 14.87 s 6.71 min 6.93 min 204 109 66.99 min 3 182 9.10 s 11.84 s 4.11 min 5.62 min 211 607 82.36 min 3 517 13.01 s 11.95 s 5.45 min 4.70 min 206 414 69.75 min 2 761 12.00 s 11.25 s 11.53 min 8.39 min 107 925 1.47 min 2 534 5.17 s 6.37 s 7.07 min 7.02 min 54 077 1.40 min 4 389 12.06 s 12.68 s
Regarding flexibility of the methods related to adding new features and constraints in optimization, seen as a rather important evaluation criterion, SPG shows much more appealing behavior than DC. Whereas the objective function of the DC based algorithm has to be expressed as a difference of convex functions, which is not a trivial task in practice, and is very sensitive to any changes of the function, the SPG objective function is much easier to adjust to new features, since the only requirement for SPG objective function is its smoothness. 3D Example. For 3D applications, the search space of the defuzzification problem is significantly larger than for 2D applications. The SPG based algorithm for defuzzification of 2D fuzzy sets is straightforwardly generalized to the 3D case. The selected features included in the feature distance are membership, volume, and surface area. We present an example defuzzification of a fuzzy segmented volume (32 × 32 × 32) which is a part of a CT image of a bone implant, Fig. 4. The parameter settings for SPG method are the same as in 2D case, except for μΔ and ε in χε (x), which are set to 10−6 and 10−8 , respectively. The DC based method was not applied in defuzzification in this example, because of the too high memory requirements imposed by the implementation of this method; the DC algorithm uses a matrix of a size (N + 1) × (N + 1), where N is the number of voxels. The SPG algorithm processes only the vector representations of the image, which is a minimal memory request for a deterministic algorithm. The obtained distances for Optimal α-cut, SA, and SPG based algorithm are 0.03807, 0.03757 and 0.03760, respectively. Defuzzifications obtained by SPG and SA algorithm are very similar.
Deterministic Defuzzification Based on SPG Optimization
485
Fig. 4. Left: Four slices trough the image volume of a bone implant. Right: 3D rendering of defuzzification obtained by SPG based algorithm.
5
Conclusions
The SPG optimization algorithm is successfully adjusted and applied to defuzzification by feature distance minimization. The main advantages of the suggested method, in addition to its deterministic nature, are fast convergence, low memory requirements and low computational complexity, and its flexibility regarding the inclusion of additional features and constraints in optimization. The experiments, presented in Section 4, have confirmed that the SPG based algorithm is better suited for the defuzzification task than the previously used DC algorithm. Future work will be related to inclusion of topological constraints, high resolution reconstruction, and overall improvement of the optimization performance. Applications in 3D will be in focus.
References 1. Birgin, E.G., Mart´ınez, J.M., Raydan, M.: Algorithm: 813: SPG - Software for convex-constrained optimization. ACM Transactions on Mathematical Software 27, 340–349 (2001) 2. Birgin, E.G., Mart´ınez, J.M., Raydan, M.: Nonmonotone spectral projected gradient methods on convex sets. SIAM J. on Optimization 10, 1196–1211 (2000) 3. Giannessi, F., Niccolucci, F.: Connections between nonlinear and integer programming problems. Symposia Mathematica, 161–176 (1976) 4. Lindblad, J., Luki´c, T., Sladoje, N.: Defuzzification by Feature Distance Minimization Based on DC Programming. In: Proc. of 5th International Symposium on Image and Signal Processing and Analysis, Istanbul, Turkey, pp. 373–379 (2007) 5. Lindblad, J., Sladoje, N., Luki´c, T.: Feature Based Defuzzification in Z 2 and Z 3 Using a Scale Space Approach. In: Kuba, A., Ny´ ul, L.G., Pal´ agyi, K. (eds.) DGCI 2006. LNCS, vol. 4245, pp. 378–389. Springer, Heidelberg (2006) 6. Lindblad, J., Sladoje, N.: Feature Based Defuzzification at Increased Spatial Resolution. In: Reulke, R., Eckardt, U., Flach, B., Knauer, U., Polthier, K. (eds.) IWCIA 2006. LNCS, vol. 4040, pp. 131–143. Springer, Heidelberg (2006) 7. Rosenfeld, A., Haber, S.: The perimeter of a fuzzy subset. Pattern Recognition 18, 125–130 (1985)
Learning Visual Compound Models from Parallel Image-Text Datasets Jan Moringen1 , Sven Wachsmuth1 , Sven Dickinson2 , and Suzanne Stevenson2 1
Bielefeld University {jmoringe,swachsmu}@techfak.uni-bielefeld.de 2 University of Toronto {sven,suzanne}@cs.toronto.edu Abstract. In this paper, we propose a new approach to learn structured visual compound models from shape-based feature descriptions. We use captioned text in order to drive the process of grouping boundary fragments detected in an image. In the learning framework, we transfer several techniques from computational linguistics to the visual domain and build on previous work in image annotation. A statistical translation model is used in order to establish links between caption words and image elements. Then, compounds are iteratively built up by using a mutual information measure. Relations between compound elements are automatically extracted and increase the discriminability of the visual models. We show results on different synthetic and realistic datasets in order to validate our approach.
1
Introduction
Parallel datasets are an interesting source for learning models that offer semantic access to huge collections of unstructured data. Given a corpus of paired images and captions, we aim to learn structured visual appearance models that can be used to automatically annotate images newly observed among other applications. In related work, a number of approaches already have been reported. These can be differentiated by the kind of image representation used, the kind of labelling expected, and the word-image association model applied. Barnard et al. [1,2] start from a blob representation of the image which is provided by a general image segmentation algorithm, and construct a vocabulary of visual words from the color and texture features extracted for each blob. They have explored different modelling approaches. First, they apply a generative hierarchical co-occurrence model proposed by Hofmann [3] that treats blobs and words as conditionally independent given a common topic. Thus, the visual correspondent of a word is not explicitly localized in an image. A second approach applies a statistical translation model (Model 1) of Brown et al. [4] which includes a set of alignment variables that directly link words to single blobs in the image. A third approach, followed by Blei & Jordan, combines both approaches using latent dirichlet allocation (LDA) [1]. Our approach strongly builds on the translation model approach but extends it towards compounds of shaped-based visual features. G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 486–496, 2008. c Springer-Verlag Berlin Heidelberg 2008
Learning Visual Compound Models from Parallel Image-Text Datasets
487
Further work by Berg et al. [5] finds correspondences of image faces and caption names, but needs an initial set of unique assignments in order to drive the model construction process. Caneiro et al. [6] start from bags of localized image features and learn a Gaussian mixture model (GMM) for each semantic class which is provided by a weak labelling. They do not need any previous segmentation, but rely on the assumption that each caption word relates to the image. In contrast to the structureless bag-of-features model, Crandall & Huttenlocher [7] refer to the problem of learning part-based spatial object models from weakly labelled datasets. However, they assume a single label for each image that refers to the depicted compound object. Opelt et al. [8] propose a shapebased compound model that builds on class-discriminative boundary fragments. Each boundary fragment is defined relative to the object centroid. This provides a straightforward method for dealing with the object’s spatial layout, but implies further constraints on training data because, besides the label, a bounding box needs to be specified for each image instance. The work described in this paper is more tightly related to Jamieson et al. [9]. There, visual compound models are represented as collections of localized SIFT features. A visual vocabulary is learned on a separate data set. Initial featureword correspondences are established by using a translation model similar to Barnard et al. [1]. Compound models are generated in an iterative search process that optimizes the translation model. Jamieson et al. [10] further improve the compound model by adding spatial relations of localized SIFT features, using pairs of features for initialization, and applying a more efficient correspondence model rather than a full-blown translation model. As impressive as these results are, the approach circumvents certain problems occurring for more general object categories. Many categories need to be characterized by their global shape rather than local gradient statistics. This includes the treatment of segmentation issues. In the following, we will describe our approach to learning structured shapemodels from image-caption pairs. It builds on some ideas already discussed by Wachsmuth et al. [11]. Again, initial correspondences are established by applying a statistical translation model. For learning visual compound models the framework of Melamed [12] is applied that is able to re-combine recurring subparts. Several new ideas are introduced in order to apply the general framework to shape-based representations. Boundary fragments are used for generating a visual vocabulary, the translation model is used in order to optimize detection thresholds of visual words, and compound models are extended by adding spatial relations between boundary fragments. The approach is evaluated on two datasets of image-caption pairs with different characteristics and shows its applicability for annotation tasks.
2
Image Representation
As mentioned before, we represent images by a set of localized visual words which are learned over shape-based features extracted from training images. Following Opelt et al. [8], we extract so called boundary fragments that are
488
J. Moringen et al.
connected components of region boundaries. In order to gain a more global measure of boundary pixels, we overlay several region segmentations using the EDISON system described by [13]. The accumulated boundary pixels are thresholded to create edge images, from which boundary fragments are extracted by concatenating edge pixels starting from randomly chosen seed pixels. However, in order to only select highly discriminative features, fragments are rejected if they (i) consist of too few pixels, (ii) consist of too many pixels, (iii) are not sufficiently curved. Boundary fragments lend themselves well to fault-tolerant shape recognition using chamfer matching as described by Borgefors [14]. In our approach, we exploit this property for two purposes: (i) using chamfer matching, we locate previously built features in unknown images as described in [14]; and (ii) we apply chamfer matching as a distance metric between boundary fragments: dsymm (f1 , f2 ) := dfrag (f1 , f2 ) + dfrag (f2 , f1 ) where dfrag (f1 , f2 ) := min dedge (f1 , T, If2 ).
(1)
T ∈S
S is a discrete set of transformations1 that is applied to boundary fragment f1 when overlaying it over an image during chamfer matching, If2 is a bitmaprepresentation of the fragment f2 , and dedge is the edge distance as described in [14]. In order to build up the visual vocabulary, we use a clustering approach on boundary fragments. Definition 1. A cluster is a triple (F, r, h) consisting of a set F = {f1 , . . . , fn } of boundary fragments, a representative boundary fragment r ∈ F with minimal distance dsymm to all other fragments in F , and a threshold h ∈ R+ . Following Opelt et al. [8] we apply an agglomerative clustering technique, because the fact that boundary fragments do not form a vector space rules out methods such as k-means. The distance function2 between two clusters l1 , l2 can be derived from the distance between boundary fragments (Eq. 1) by dmax (l1 , l2 ) :=
max
f1 ∈F1 ,f2 ∈F2
dsymm (f1 , f2 ) where lk = (Fk , rk , hk ), k ∈ {1, 2} (2)
As it requires a large number of expensive computations of the edge distance, dmax cannot be used to cluster huge amounts of boundary fragments. To remedy this problem, we take the following two-step approach: 1. boundary fragments are agglomeratively clustered into classes of roughly the same overall shape using a computationally cheap distance function that compares variances along principal directions. 2. these classes of boundary fragments are further refined according to the distance function dmax by using agglomerative clustering on each subset. 1 2
Here we only allow translations. Actually, a whole family of distance functions can be obtained by using aggregation functions like minimum or average, instead of maximum.
Learning Visual Compound Models from Parallel Image-Text Datasets
489
Fig. 1. Left: A cluster of boundary fragments. The highlighted boundary fragment represents the cluster. Right: Training images from which the two leftmost boundary fragments of the cluster have been extracted.
Besides building classes of similar fragments (see Fig. 1 for an example) for the translation model, each cluster also provides a detector for its corresponding class in unseen images. The detector exploits the representative boundary fragment as well as the threshold value h and is explained in more detail in Sec. 5.
3
Translation Model
We use a simple statistical machine translation model of the sort presented by Brown et al. [4] as Model 1. This model exploits co-occurences of visual words c = {c1 , . . . , cM } and image caption words w = {w1 , . . . , wL } in our training bitext3 to establish initial correspondences between them: P (c | w) =
P (M )
a
M
P (aj | L)P (cj | aj , waj )
(3)
j=1
where aj ∈ {0, . . . , L}, 1 ≤ j ≤ M are the alignment variables for the visual words cj , and aj = 0 is an assignment to the empty word. Once its parameters have been estimated, the model can be used to find the most likely translation: = argmax P (w | c) = argmax P (c | w)P (w) w w
(4)
w
when presented image features c. The param(Eq. 4) generates annotations w eters of the translation model can be estimated from parallel data using an EM-style algorithm described by Brown et al. [4]. It simultaneously establishes explicit correspondences between instances of visual and caption words and estimates translation probabilities t(c | w) for pairs of visual and textual words.
4
Compound Features
Despite good results for text translation tasks, the translation model described in Section 3 does not perform well when applied to boundary fragments and 3
A body of two roughly aligned (e.g., on the level of sentences for bilingual texts) pieces of data.
490
J. Moringen et al.
caption words. There are several reasons for this problem: (i) individual boundary fragments are not discriminative enough to produce a reasonable low rate of false positive matches, and (ii) the translation model has difficulty in relating relatively many image features to relatively few caption words. As suggested by Wachsmuth et al. [11], Jamieson et al. [9] and others, this problem can be solved by grouping individual image features into larger compound objects. This approach has the additional benefit that such compounds can be augmented with descriptions of the spatial layouts of their component features in training images. Definition 2. A compound-object (or compound for short) is a pair (L, R) consisting of a set L = {l1 , . . . , ln } of clusters and a set of sums of Gaussian nl1 l2 N (µi , σ) defining densities R = {dl1 l2 | l1 , l2 ∈ L, l1 = l2 } with dl1 l2 = i=1
spatial relations between pairs of clusters. nl1 l2 is determined by the procedure described at the end of this section. The vectors µi determine the expected spatial offsets while the variance σ is fixed. The process of building compounds is driven by co-occurrences of boundary fragments in the training data. The method we use was suggested by Melamed [12] and is originally intended to identify Non-Compositional-Compounds (NCCs) like hot dog in bilingual texts.4 Melamed’s method is based around the observation that the mutual information P (c, w) (5) P (c, w) log I(C, W) = P (c)P (w) c∈C w∈W
of a translation model increases when the individual words that comprise NCCs are fused and treated as single tokens. The joint distribution P (c, w) is determined by the alignments that the translation model produces for the visual and captions words of the training bitext. We use the following counting rule to obtain the joint distribution: weight
P (c, w) ∝
S s=1
min{#ws , #csw } s s min{#c , #w } #csw
(6)
number of links c → w
In (Eq. 6) the sum is over all pairs of the bitext, #cs , #ws and #csw refer to the number of instances of the respective token in the pair (cs , ws ), and cw are visual words which translate to w. While in principle (Eq. 5), in conjunction with (Eq. 6), is sufficient to test any given NCC candidate using a trial translation model, the computational cost of this naive approach is not feasible for realistic numbers of NCC candidates. Therefore Melamed uses certain independence assumptions to derive a predictive value function V from (Eq. 5) that can be used to estimate the contributions of individual NCC candidates to the performance of the translation model without actually computing the mutual information: 4
Melamed’s approach was first applied to captionized datasets in [11]. However, results were only shown for synthetic symbol sets.
Learning Visual Compound Models from Parallel Image-Text Datasets
V (C, W) =
VW (c)
where VW (c) = P (c, wc ) log
c∈C
P (c, wc ) P (c)P (wc )
491
(7)
Even though V allows us to test numerous NCC candidates c1 c2 in parallel by independently computing their contributions ΔVc1 c2 , it is still too computationally expensive for practical applications. Fortunately further independence assumptions allow us to estimate the change of V caused by a NCC candidate without even introducing a trial translation model:
c1 c2 = max{VW (c1 | s(c1 , c2 )), VW (c2 | s(c2 , c1 )} ΔV + VW (c1 | ¬s(c1 , c2 )) + VW (c2 | ¬s(c2 , c1 )) − VW (c1 ) − VW (c2 ) The predicate s evaluates to true if an instance of the second argument is present in the image from which the instance of the first argument originates. The process of identifying compounds is very similar to Melamed’s original method and consists of several iterations of the steps 2 – 8 of the following algorithm: 1. create empty lists of validated and rejected compounds 2. use bitext to train base translation model 3. produce candidate compounds for all pairs (c1 , c2 ) that are not on list of rejected compounds and co-occur in at least two images of the training bitext
c1 c2 (in the first iteration, c1 and c2 are simply clusters; in and compute ΔV later iterations, however, c1 and c2 can be compound objects themselves)
c1 c2 and discard candidates if (i) 4. sort candidate list according to ΔV
c1 c2 ≤ 0, or (ii) c1 or c2 is part of a candidate that is closer to the ΔV top of the list 5. train a test translation model in which components of candidate compounds are fused into single objects 6. discard and put on list of rejected compounds all candidates for which ΔV c1 c2 ≤ 0 7. permanently fuse the remaining candidate compounds into single objects 8. goto step 2 until no more candidates are validated or the maximum number of iterations is reached After compounds have been identified, information about typical spatial relations of the components of the compounds are added. This is done by finding images in which boundary fragments of different clusters of a compound are present. Then relative positions of the boundary fragments give rise to spatial relations. Relative positions of boundary fragments in training images are shown in Fig. 2. Finally, optimal values for the individual detection thresholds of all clusters are computed using the translation model and the training images. As each compound possesses a word as its most likely translation, it is known in which training images a given compound should be detected. Therefore the detection threshold h can be adjusted to ensure that clusters are detected in positive but not in negative training images. In order to avoid overfitting, distorted versions of the training images are used in this process.
492
J. Moringen et al.
Fig. 2. Relative positions of boundary fragments. The left image shows the extraction of a spatial relation from a training image, the right image visualizes a compound generated from the furniture dataset. The different fragments of each component cluster are shown.
5
Automatic Image Annotation
The most obvious application5 of the correspondence data created by the presented process is automatic image annotation, i.e., assigning suitable caption words to previously unseen and uncaptioned images. In our approach, automatic annotation is done by detecting compounds in the presented images and assigning the most likely translation words of the detected compounds as caption words. However, since detecting boundary fragments always produces localized results, it is also possible to relate image regions to caption words, therefore allowing region naming as an extension of automatic annotation. When presented an unknown image, in principle, all boundary fragments could be matched against the image using chamfer matching, which would lead to zero or more matching locations for each boundary fragment. However, this approach is not feasible because of the large number of chamfer matching operations necessary. Instead we only match boundary fragments previously chosen to represent clusters in order to quickly rule out clusters of boundary fragments that probably do not match the image. The whole annotation procedure works as follows: 1. obtain edge image from input image 2. find matching boundary fragments using the following two-step approach (a) for each cluster (F, r, h): match the representative boundary fragment r against the image; discard cluster when edge distance is above a global coarse threshold H (b) match all boundary fragments F of the remaining clusters against the image; store locations of all matches for which the edge distance is below the cluster-specific tight threshold h; discard clusters that do not produce at least one match 3. for all compounds for which all component clusters have been detected in the image in at least one position: check spatial relations for all possible 5
For a more extended list of applications of multimodal correspondence models, see Duygulu et al. [2].
Learning Visual Compound Models from Parallel Image-Text Datasets
493
combinations of detected cluster instances (each of which could comprise a compound instance); store combination that fulfills spatial relations best; discard compound if such a combination does not exist 4. output most likely translation words of remaining compounds as caption
6
Results
In order to evaluate the annotation performance, we used our method to automatically annotate training images and held-out test images from two datasets. As both datasets contained full annotations, it was possible to compare original and generated caption words. As in the evaluations of comparable methods, we use precision and recall to quantify annotation performance. The first dataset we evaluated was obtained from the product catalogue of a large European supplier of furniture by extracting product images and head nouns of the related product descriptions. The resulting dataset consists of images of single pieces of furniture with simple backgrounds along with sets of several caption words that mostly refer to the depicted objects. The 525 image-caption pairs of the dataset were divided into 300 training pairs and 225 test pairs. The dataset features some challenging aspects, namely large withincategory shape variations, different words for similarly shaped pieces of furniture, and a number of categories that occur very rarely. Fig. 3 shows three examples of automatic annotations. The precision and recall values for automatic annotations of the training and test subsets of the furniture dataset are shown in Fig. 4. The low precision values for most object categories were caused by clusters of shapes (like legs, see Fig. 1) that occur in a wide variety of object categories. The second dataset used in the evaluation was synthetically generated to allow fine grained control over shape variation and image clutter. As the first dataset turned out to be quite challenging, the generated dataset was tuned to be simpler in some regards. A probabilistic grammar was developed to construct simple hierarchically structured objects while allowing variations in structure and shape. We distinguished five different object categories, some of which shared common elementary shapes. For the training, suitable caption words were emitted along with an adjustable number of random words. All words were chosen from a pool of 27 words. The generated pairs were divided into 300 training and 200 test pairs. For test images, which were used in the annotation task, random wall,
up-
lighter, lamp, lamp, bookcase, chair, floor
table, beech, clock, clock
bookcase, wall, lamp,
holder,
uplighter, bookcase,
chair,
beech,
chair,
...
Fig. 3. Results of automatic annotation for furniture dataset. Generated annotation words are listed on the right side of the respective test image, correct annotation words appear in bold print.
494
J. Moringen et al.
100
80
bench dinig glass-door table armchair
80
60
Recall
Recall
100
bench dinig glass-door table armchair
60
40
40
20
20
0
0 0
20
40 60 Precision
80
100
0
20
40 60 Precision
80
100
Fig. 4. Precision vs. recall for furniture dataset. Each curve shows precision and recall for a single concept. Left: training subset. Right: test subset.
tv, tv, red,
combination,
tv,
kitchen, lamp, white, lamp, lamp,
com-
puter,
tv,
lamp,
lamp,
lamp,
lamp, lamp, wall-mounted, kitchen,
tv, tv, tv
small, conference
100
100
80
80
60
60
Recall
Recall
Fig. 5. Results of automatic annotation for synthetic dataset. Generated annotation words are listed on the right side of the respective test image, correct annotation words appear in bold print.
40 20
40 20
0
0 0
10
20
30 Precision
40
50
60
0
5
10
15
20 25 Precision
30
35
40
Fig. 6. Precision vs. recall for synthetic dataset. Left: training subset. Right: test subset.
shapes were added to emulate background clutter. As the synthetically generated objects were treated as regular images, the algorithm also needed to deal with occlusion as well as over- and under-segmentation issues. As results show (see Fig. 6), our method produced acceptable precision and recall when applied to the training portion of the synthetic dataset. For the test portion however, the achievable precision was considerably lower. Fig. 5 shows two examples of the generated annotations.
Learning Visual Compound Models from Parallel Image-Text Datasets
7
495
Conclusion
In this paper, we presented on-going work on learning visual compounds from image-caption pairs. We focused on shape-based feature descriptions that provide a suitable characterization for many categories of man-made objects. The approach is able to deal with slight occlusion as well as moderate over- and under-segmentation by using boundary fragments for building a basic visual vocabulary. The grouping of fragments into compounds is driven by caption words using a translation model and a mutual information measure. We have shown that the approach works on a set of synthetically generated images with captions. First tests on a realistic dataset from a furniture catalogue showed the generation of some promising compounds but also indicated several problems regarding the discriminability of boundary fragments and the sparseness of this kind of dataset.
References 1. Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D., Jordan, M.: Matching Words and Pictures. Journal of Machine Learning Research 3, 1107–1135 (2003) 2. Duygulu, P., Barnard, K., de Freitas, J.F.G., Forsyth, D.A.: Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary. In: Proc. of 7th Europ. Conf. on Computer Vision, Copenhagen, vol. 4, pp. 97–112 (2002) 3. Hofmann, T.: Learning and representing topic. A hierarchical mixture model for word occurrence in document databases. In: Proc. Workshop on learning from text and the web, CMU (1998) 4. Brown, P.F., Pietra, S.A.D., Mercer, R.L., Pietra, V.J.D.: The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics 19(2), 263–311 (1993) 5. Berg, T., Berg, A., Edwards, J., Maire, M., White, R., Teh, Y.W., Learned-Miller, E., Forsyth, D.: Names and faces in the news. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2004) 6. Carneiro, G., Chan, A.B., Moreno, P.J., Vasconcelos, N.: Supervised Learning of Semantic Classes for Image Annotation and Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3), 394–410 (2007) 7. Crandall, D., Huttenlocher, D.: Weakly supervised learning of part-based spatial models for visual object recognition. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954. Springer, Heidelberg (2006) 8. Opelt, A., Pinz, A., Zisserman, A.: A Boundary-Fragment-Model for Object Detection. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954. Springer, Heidelberg (2006) 9. Jamieson, M., Dickinson, S., Stevenson, S., Wachsmuth, S.: Using Language to Drive the Perceptual Grouping of Local Image Features. In: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), New York, vol. 2, pp. 2102–2109 (2006) 10. Jamieson, M., Fazly, A., Dickinson, S., Stevenson, S., Wachsmuth, S.: Learning Structured Appearance Models from Captioned Images of Cluttered Scenes. In: Proc. of the Int. Conf. on Computer Vision (ICCV), Rio de Janeiro (October 2007)
496
J. Moringen et al.
11. Wachsmuth, S., Stevenson, S., Dickinson, S.: Towards a Framework for Learning Structured Shape Models from Text-Annotated Images. In: Proceedings of the HLT-NAACL 2003 workshop on Learning word meaning from non-linguistic data, Edmonton, vol. 6, pp. 22–29 (2003) 12. Melamed, I.D.: Automatic Discovery of Non-Compositional Compounds in Parallel Data. In: Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, Providence, pp. 97–108 (1997) 13. Christoudias, C.M., Georgescu, B., Meer, P.: Synergism in Low Level Vision. In: 16th Int. Conf. on Pattern Recognition, Quebec City, vol. 4, pp. 150–155 (2002) 14. Borgefors, G.: Hierarchical chamfer matching: A parametric edge matching algorithm. Trans. on Pattern Analysis and Machine Intelligence 10, 849–865 (1988)
Measuring Plant Root Growth Matthias M¨ uhlich1 , Daniel Truhn1 , Kerstin Nagel2 , Achim Walter2 , Hanno Scharr2 , and Til Aach1 1
Lehrstuhl f¨ ur Bildverarbeitung, RWTH Aachen University, Germany 2 ICG-3, Forschungszentrum J¨ ulich, Germany {muehlich,aach,truhn}@lfb.rwth-aachen.de, {k.nagel,h.scharr,a.walter}@fz-juelich.de
Abstract. Plants represent a significant part of our natural environment and their detailed study becomes more and more interesting (also economically) because of their importance for mankind as a source of food and/or energy. One of the most characteristic biological processes in plant life is their growth; it serves as a good indicator for the compatibility of a plant with varying environmental conditions. For the analysis of such influences, measuring the growth of plant roots is a widely used technique. Unfortunately, traditional schemes of measuring roots with a ruler or scanner are extremely time-consuming, hardly reproducible, and do not allow to follow growth over time. Image analysis can provide a solution. In this paper, we describe a complete image analysis system, from imaging issues to the extraction of various biological features for a subsequent statistical analysis. This opens a variety of new possibilities for biologists, including many economical applications in the study of agricultural crops.
1
Introduction
The analysis of plant growth is important both for fundamental biological research and for applied research concerning special agricultural crops. In contrast to higher developed animals, plants consist of repeating structures like leaves or roots. This construction principle allows plants to grow during their whole life. The study of plant growth enables a statistical inference on how well plants can grow under (or adapt to) certain environmental conditions. This statement also holds for very young plants – which is highly beneficial for experimental evaluation. In same cases root growth reacts even more rapidly and pronounced to changing environmental factors than leaf growth [1,2]. Therefore, the study of plant root growth moved into the focus of biologist in recent years. Unfortunately, traditional measuring methods are very time-consuming. For plants grown in soil or sand, it means digging out, washing them carefully, then measuring. Obviously, this laborious measuring process influences further growth; undisturbed measuring over time is impossible. An interesting alternative is growing plants in petri dishes, see fig. 1. A (semi-)transparent nutrient gel G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 497–506, 2008. c Springer-Verlag Berlin Heidelberg 2008
498
M. M¨ uhlich et al.
Fig. 1. Plants are grown in petri dishes filled with (half-)transparent nutrient gel. This allows imaging of root systems with transmitted light.
is filled in between two perspex panels which are a few millimeters apart. Then plant seeds are put on the top such that the roots grow down into the gel. Using this setup, it is possible to take images of the growing plant roots at certain intervals (one image every 30 minutes, over a period of two or three weeks, is common in our joint project between partners from biology and image analysis). Fig. 2 shows six images from such a time series. This setup also reduces the problem of measuring roots from a 3d problem to a ‘almost 2d’ one. While a third dimension of just 3 millimeters width is indeed negligible for length measurements, it does unfortunately not prevent roots from crossing each other in different layers; this happens quite frequently in real data, see fig. 1. The aim of our project was the extraction of tree-like structures for roots, augmented with several biologically interesting data values. These features include lengths, thicknesses, growth rates, branching angles, or spatial densities of branching points; extension to further data should be possible. Additionally, simple access to these features should be possible for a statistical analysis in order to make use of the primary strength of a computerized approach (in contrast to measurements by humans): the ability to measure hundreds or thousands of plants, thus allowing valid and significant statistical statements. This paper describes the build-up of a complete imaging system according to these requirements. Its organization reflects the various building blocks of the whole system; each block is discussed in a separate chapter. At the end, we will show some results for the total length of the root system, divided into primary
Measuring Plant Root Growth
499
Fig. 2. Growth of plant roots: the same three plants are imaged at six different times
roots (i.e. roots starting at the origin), secondary roots (roots branching off from primary roots), and tertiary roots (which are branching off from secondary roots).
2
Imaging and Preprocessing
The first step which deserves careful considerations is the image acquisition itself. Imaging with visible light would disrupt the natural day and night cycle of plants. The solution is infra-red imaging, but a constant illumination of the whole image with acceptable effort is next to impossible. Fortunately, varying background illumination can be compensated with standard image processing techniques. Fig. 3 shows our preprocessing scheme: the input image (a) is processed with a min/max nonlinear filter, resulting in image (b). This image is then smoothed (c). Rescaling the original image with the background brightness information from (c) then yields (d), the preprocessed image which we use as input to our algorithm. A few aspects which cannot be compensated properly with preprocessing are image artefacts due to condensing waterdrops or drying nutrient gel which
a
b
c
d
Fig. 3. Compensation of varying illumination. See text for explanation.
500
M. M¨ uhlich et al.
dissolves from the perspex wall. We mention these practical problems to illustrate that a high level of robustness is also required for our algorithm.
3
Modelling Roots I: A Steerable Filter Approach
Plant roots, even of the samt plant type, can look highly different. If one has to design a computer program for the analysis of roots, it is therefore an apparently useful first step to identify their common characteristic structural elements. As the number or shape of bifurcations (or other point-like structures, from root tips to root crossings) is rather difficult to predict or model, the starting point for our algorithm is the detection of linear structures: the roots. Such structures can be found with a correlation based approach. Let us postpone the treatment of branches and crossings to section 5 and discuss simple roots first; from an image analysis point of view, it is the detection of dark lines on light background and these lines appear in arbitrary orientations. Within the class of correlation-based detectors for rotated lines, one approach presented by Jacob and Unser in 2004 [3] can be considered state of the art. The presentation of this approach is the subject of this section. Let f (x) denote an image within which we seek a feature (here: a line). Furthermore, let us assume that this feature can be modeled as a template f0 (x). Then filtering the image with a filter h(x) = f0 (−x) yields an output image measuring how strong the feature is present at each location x. This principle is known as matched filter [4]. Its application to the detection of a family of features is, in general, computationally inefficient, but for one special class of filters, namely rotated versions of some given template, the steerable filter approach introduced by Freeman and Adelson in [5] offers a convenient solution: by limiting the class of possible (unrotated) templates to those templates which can be represented in polar coordinates in the form h(r, φ) =
P
ap (r) exp(ipφ) ,
(1)
p=−P
one can represent any rotated version of h(r, φ) as a linear combination of ν base filters, where the minimum number for ν is given as the number of nonzero Fourier coefficients in (1). (Note that, in slight abuse of notation, we will always denote an image template as h, regardless whether it is represented in Cartesian or polar coordinates.) Following the notation of Freeman & Adelson and others, let us define a rotation operator: hα (r, φ) = h(r, φ − α). Different variants for designing steerable filters exist, but they all have in common that rotation can be expressed as linear combination of a set of base templates: hα (r, φ) =
ν
wi (α) hi (r, φ) .
(2)
i=1
Here, hi denotes the set of base filters. In [3], Jacob and Unser applied this rotated matched filter approach to the detection of edges and lines in images.
Measuring Plant Root Growth
Fig. 4. Line template within the class of steerable filters (rotated)
501
Fig. 5. Maximum filter response S(x) as a measure of ‘line strength’
They also showed that using directional derivatives of a 2d Gaussian function up to fourth order as base functions yields a very good orientation selectivity while still allowing an analytic solution for the optimal angle. Fig. 4 shows a contour plot for such an fourth order line template. Applying the steerable filter approach to the input image then leads to two values for every image location x: the orientation angle α(x) which fits best and the strength of the correlation S(x), i.e. S(x) = f (x) ∗ hα(x) (x) .
(3)
Then S(x) can be interpreted as a measure for the probability of a root at location x. Fig. 5 shows the “line strength image” S(x). Based on S(x), it is possible to identify the origin of roots: high values of S(x) in a small horizontal strip below the upper border of the petri dish indicate the presence of a root origin and all roots in a certain neighborhood are grouped together to a single plant. These roots are called primary roots and followed downwards.
4
Modelling Roots II: Following Roots
The next image processing step is following roots from origin to root tip, passing through all branching points and potential crossings as straight as possible. Our algorithm is based on the image S(x) computed before (see fig. 5); for every pixel identified as root, the neighboring values of S(x) contribute a data term indicating the likelihood that the root continues this way. However, basing a root following algorithm only on the imaga data would perform poorly in the presence of noise and the resulting root structures would look unnatural. We therefore also added a model term which simply states that roots usually grow straight. Instead of continuing roots in the direction which maximize S(x) directly, we maximize a modified quantity S (x) = c · S(x)
(4)
502
M. M¨ uhlich et al.
which introduces a correction term c. This correction term is based on the current root orientation α and the new orientation α which would be induced by continuing in the tested orientation. The more similar α and α are, i.e. the better the assumption of straight continuation is matched, the higher the correction factor c should be. On the other hand, strong bending should be penalized with a low value of c. This requirement is met with the exponential term (α − α )2 c = exp . (5) 2σ 2 The choice of this function is motivated by a physical analogue: the bending of an elastic rod leads to the same mathematical representation. Just like a high bending energy is required for a strong bending, the data term in our input image must sufficiently support the hypothesis of a direction change of a root. If no acceptable continuation can be found, then we have to end the tracking of the root – we have found its tip. Fig. 6 shows the detection result for a primary
Fig. 6. Tracked primary root and 5-pixel distance lines on both sides
Fig. 7. Primary root (white) and potential branches indicated as small white dots 5 pixels away from branch
root. Additionally, two more lines are drawn left and right of the detected root. On these two lines, other roots are detected in order to analyze branches and crossings in the following step.
5
Modelling Roots III: Crossing and Branching
With the algorithm described so far, it is possible to follow primary roots from the origin to the tip. This is possible because branching never happens in the
Measuring Plant Root Growth
503
way that a root tip splits in two and both new tips grow in directions which are entirely different from the old one. In image analysis, such a model would also be known as Y junction. Instead, a branch is created at the side of an already existing root, thus leading to a T junction (with the stem of the letter T “growing out of the horizontal bar”). This biological background does not only allow an algorithm based on the following of single root, it even calls for a recursive implementation of such a strategy. Note that this is completely different from other possible approaches to plant root measurement focussing on the detection of point features (branching points, crossings, root tips) as underlying structural elements. The existence of T junctions instead of Y junctions also constitutes a notable difference to medical data (blood vessels, bronchial tubes) which might appear similar at first glance.
Fig. 8. Algorithms for the following of roots (left) and finding of branches (right)
The branch detection is carried out by searching for roots 5 pixels beside the primary roots. For every found point (see fig. 7 for an example), we then try to find a path to the original root. If this is not possible, then the found point belongs to a parallel root, not a branch. Furthermore, we search only upwards from the tested point to the primary root (i.e. the test point 5 pixels away from the branch must lie below a potential branching point). All branches found in this way are denoted secondary roots in biology and trigger a recursive root following just like the initial root following for primary roots. The same can be continued for tertiary roots and so on. Special care has to be applied to crossings. A downward branching has to be excluded if there exists another upward branching on the opposite side with a similar intersection angle to the original root. In this case, it represents a crossing with a different root, not a branch. Fig. 8 illustrates this algorithm and the root following scheme from the previous section. Fig. 9 then shows the root detection result for a single input image.
504
6
M. M¨ uhlich et al.
Integrating Temporal Information
The basic algorithms shown in the previous sections are already sufficient to handle single images. In practical applications, however, it is sometimes possible to take multiple images of the same plant over time. How can we integrate the temporal information?
Fig. 9. Detected root system. Primary Fig. 10. Growth of the complete root sysroots are shown in green, secondary roots tem (upper thick line) and the contribution in red, tertiary roots in blue of different root orders over time
The first step for the evaluation of an image series is obviously a registration step; this becomes especially important if the petri dish is not fixed in front of the camera. In our application, a hundred individual petri dishes are mounted on a caroussel-like structure which rotates the different dishes in front of the camera (in order to allow a meaningful statistical evaluation). In this situation, the compensation of a translational offset is always necessary. Nevertheless, we will not discuss registration here; standard textbook approaches will work reasonably well. The desire for an algorithm with is capable of handling both single images and image series (in a sequential update mode) requires a modification of the algorithm presented so far in a way such that it can additionally exploit information collected from previous images of the same plant. Most importantly, this can be used to disambiguate the crossing problem: which incoming root connects to which outgoing root? This problem is inherently difficult to solve in single images, but for image series, one can decide which root grew there first in 99.9 % of all cases. Time series processing offers a lot of advantages: – If a root from the previous image was not found in the current image, then search again with modified threshold. Roots cannot disappear. – Misdetections can be eliminated if a root cannot be detected again, even with lower threshold. This often applies to ‘ghost branches’: young branches which have only grown a few pixels are extremely hard to distinguish from noise.
Measuring Plant Root Growth
505
– The tree structure (branchings and especially crossings) can be recovered in much better quality. Incorporation of such concepts (mainly based on plausibility checks and/or heuristics) into our algorithm is possible in a straightforward way and enhances the detection quality considerably, especially for older (and thus much more complex) root systems. The analysis results for the image series shown at the beginning of the paper in fig. 2 is given in fig. 11.
Fig. 11. Detection result for the plants shows in fig. 2. An update approach which exploits the detection result from the previous image was used.
7
Results and Evaluation
In our program, we created a linked data structure consisting of ‘nodes’ and ‘branches’. These elements are augmented with a number of biologically interesting data values. We implemented a data model which allows to add arbitrary values of three generic types: floating point numeric values (e.g. root thickness, root length, orientation angle) or integer values (e.g. branch order, node order) or positional data (x and y coordinates, e.g. the nearest root tip). For each type, basic statistical tools are implemented (e.g. histogram functions). This enables us to extract the relevant information from the large amount of data that we collect; easy maintenance and expandability were a key design issue for the software. As a very basic example, fig. 10 shows a graphical representation of the total root length versus time for a series of more than 1000 images of the same plant. No smoothing was applied; we directly obtain this nice monotonous graph.
506
8
M. M¨ uhlich et al.
Summary and Discussion
In this paper, we presented and discussed a system for the computerized analysis of plant root growth. In a application problem like this one, the progress does not lie in an exceptionally novel extension of one specific building block; it is more like an engineering task where there basic problem consists in making a lot of different building blocks work together properly. Neither block is complicated by itself. A large part of the problem was decomposing a large and complex problem into small and easily digestible bits; we hope that this paper illustrates the basic ideas and reasonings behind our approach. That key element of our software is the extraction of a tree model for plant roots. We developed an algorithm which is applicable to both single images and image series. Once the tree model is found, a variety of biological data (thicknesses, lengths, angles, etc) is computed with a number of rather simple textbook algorithms (a later refinement of specific measurement tools is of course possible, if is should turn out to be necessary). The progress of this approach becomes visible if we compare the huge amount of data which now becomes available to the traditional measuring of plant root growth by hand. We are confident that our new software will turn out to be an important tool for an increased understanding of plants and their interaction with environmental influences.
References 1. Nagel, K.A., Schurr, U., Walter, A.: Dynamics of root growth stimulation in nicotiana tabacum in increasing light intensity. Plant Cell Environ, 1936–1945 (2006) 2. Walter, A., Nagel, K.A.: Root growth reacts rapidly and more pronounced than shoot growth towards increasing ligth intensity in tobacco seedlings. Plant Signaling & Behavior, 225–226 (2006) 3. Jacob, M., Unser, M.: Design of steerable filters for feature detection using cannylike criteria. IEEE Trans. PAMI 26(8), 1007–1019 (2004) 4. Therrien, C.W.: Decision, Estimation and Classification: Introduction to Pattern Recognition and Related Topics. John Wiley and Sons, Chichester (1989) 5. Freeman, W.T., Adelson, E.H.: The design and use of steerable filters. IEEE Trans. PAMI 13, 891–906 (1991)
A Local Discriminative Model for Background Subtraction Adrian Ulges1 and Thomas M. Breuel1,2 1
Department of Computer Science, Technical University of Kaiserslautern [email protected] 2 Image Understanding and Pattern Recognition Group German Research Center for Artificial Intelligence (DFKI), Kaiserslautern [email protected]
Abstract. Conventional background subtraction techniques that update a background model online have difficulties with correctly segmenting foreground objects if sudden brightness changes occur. Other methods that learn a global scene model offline suffer from projection errors. To overcome these problems, we present a different approach that is local and discriminative, i.e. for each pixel a classifier is trained to decide whether the pixel belongs to the background or foreground. Such a model requires significantly less tuning effort and shows a better robustness, as we will demonstrate in quantitative experiments on self-created and standard benchmarks. Finally, segmentation is improved significantly by integrating the probabilistic evidence provided by the local classifiers with a graph cut segmentation algorithm.
1
Introduction
Motion-based segmentation in static scenes is targeted at separating moving foreground objects from a static background given a fixed camera position and focal length. For this purpose, a number of background subtraction techniques exist that construct a model of the static scene background and label regions not fitting this model as foreground regions. Using these methods, background subtraction systems achieve fair segmentation results but do not react properly to sudden intensity changes due to camera gain control, light switches, shadows, weather conditions, etc.. On the one hand, systems should adapt to such phenomena, while on the other hand reliably detecting foreground objects. Our practical experience has been that an “online” adaption as it is proposed by most approaches in the literature is error-prone, that the associated parameters (like feature weights or adaption rates of the background model) are difficult to tune, and that robustness is hard to achieve. An alternative is to learn a background model during a separate learning phase in the absence of foreground objects (i.e., “offline”). In this way, scene properties can be modeled like weather changes and different light sources as well as characteristics of the camera like gain correction and noise. While conventional G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 507–516, 2008. c Springer-Verlag Berlin Heidelberg 2008
508
A. Ulges and T.M. Breuel
global methods following this strategy fail in the presence of pronounced foreground objects, we propose a local and discriminative model based on classifiers deciding whether a pixel belongs to the background or foreground. Our contributions are: (1) a background subtraction approach that is simple and – in contrast to conventional methods we tested before – does not require much tuning, (2) quantitative experiments demonstrating that the local discriminative approach performs competitive to hand-tuned state-of-the art segmenters, and (3) it is shown how the probabilistic evidence given by local classifiers can be fused to an improved segmentation using a graph cut algorithm.
2
Related Work
The majority of background subtraction techniques proposed in the literature maintain a pixel-wise background model that is adapted online to illumination changes caused by varying weather conditions, camera gain control, light sources switched on and off, shadows, and background motion like waving trees (for reviews, see [9,6]). Starting with the work by Wren et al. [12], several ways have been proposed to model the distribution of pixel intensity x given the fact that the pixel belongs to the background, p(x|b). For this density, parametric approaches have been proposed using Gaussians [12] and mixtures of Gaussians [15] as well as non-parametric techniques like kernel densities [10,11]. p(x|b) is then used for segmentation by thresholding with the difference between background model and observation [18], or by integrating it in a Bayesian decision framework to compute P (b|x) (e.g., [16,17]). To adapt to changes of the environment, most systems perform updates of the background model online, i.e. while segmentation is running. To work robustly, these heuristic updates must adapt properly to sudden scene changes while at the same time detecting non-background regions, which makes them error-prone and difficult to tune. To overcome this problem, it has been suggested to use features that are robust to illumination changes, like the gradient direction [8], shadow models [13], and color co-occurrence [7]. Our experience has been that problems with online updates can be overcome to some degree using proper features and careful tuning, but the system is not truly robust. Once segmentation fails, background models tend to be corrupted by foreground regions. Also, scene parts covered by the object cannot be updated properly. To better adapt to scene changes, an alternative is to learn a model of the scene offline, i.e. during a separate learning phase in the absence of foreground objects. The most popular method based on this idea are “Eigenbackgrounds” [14] which view images as vectors of pixel values and perform a global Principal Component Analysis (PCA) decomposition on image level. While we adapt the strategy of learning a scene model offline, it will be shown that a global approach like PCA fails in the presence of large foreground objects. Instead, our model is based on local discriminative patch classifiers. Other approaches have followed the idea of local classifiers before. Recently, Culibkr et al. [3] have proposed a related approach, in which frequently occurring
A Local Discriminative Model for Background Subtraction
509
Fig. 1. The proposed setup: Each pixel is associated with a receptive field from which simple features like pixel intensity or gradient strength are extracted. Those form the input of a neural network classifier, which estimates the background posterior P (b|x). These values are finally fed to a graph cut algorithm that determines the segmentation result.
patterns of pixel features are stored in a Radial Basis Function Network (RBF). The replacement and update of weights, however, are done using heuristic rules similar to the ones for standard online approaches [15]. In contrast to this work, the classifiers proposed here are trained in a discriminative manner by minimizing classification error. Closest to our work are systems that truly train patch classifiers based on local information: Criminisi et al. proposed to integrate multiple cues such as motion and color in a Conditional Random Field (CRF) framework using treebased classifiers [4]. Grabner et al. [5] use an online boosting framework. The key problem with such approaches is the lack of non-background samples for training. While Criminisi’s prototype is trained in a fully supervised manner, i.e. ground truth sequences with full segmentations are demanded, Grabner’s framework assumes uniform distributions for foreground features. In contrast to both, we propose a third alternative, namely to synthesize virtual foreground samples for training.
3
Our Approach
Most background subtraction systems construct a generative model p(x|b) for the value of a pixel x given the the fact that it belongs to the background. This can then be integrated in a Bayesian framework to obtain the posterior P (b|x). In contrast to this, we follow a discriminative strategy, i.e. P (b|x) is directly estimated using a local classifier for each pixel x. Since training is carried out in the absence of foreground objects, “virtual” non-background samples are obtained by synthesizing them. Finally, it is demonstrated how the local posteriors P (b|x) can be fused to a global segmentation using a smoothness prior and a graph cut algorithm for optimization. System Setup: To estimate the background probability P (b|x), a neural network classifier is used for each pixel x, which is associated with a squared receptive field in the surrounding of x (also referred to as a patch). From this receptive field, simple features are extracted, which can be (1) pixel gray values or color
510
A. Ulges and T.M. Breuel
Fig. 2. Left: Two images of a static scene from the Office dataset with a highlighted patch (red). Right: Background and foreground Samples of the highlighted patch for training the associated classifier. Virtual foreground Samples have been synthesized by covering background samples with random synthetic texture.
values, (2) gradient strength, or (3) the chroma components in Y U V space, a description that has proven robust to illumination changes. These features form the input of a Multilayer Perceptron (MLP) with 1 hidden layer, which estimates P (b|x) [2]. These posteriors can be used for a local, pixel-wise decision by thresholding them, or they can be integrated with a Gibbs prior to obtain smooth region boundaries. See Figure 1 for an illustration of the proposed system architecture. Training: Most conventional background subtraction systems only maintain a model for the background, which is “trained” by heuristic update rules in a fully unsupervised manner. Our hypothesis is that a better distinction between foreground and background can be achieved if modelling both classes in a discriminative fashion. While background samples can be acquired during an offline learning phase, for the foreground uniform distributions have been assumed [5], or systems have been trained on fully segmented images [4]. We propose a third alternative, namely to synthesize virtual training samples for the foreground. In this paper, these training samples are constructed by simply covering the receptive field of a pixel partially or fully with non-background texture (for this, a random color was chosen and Gaussian noise with standard deviation 10 was added). Both background and foreground samples of a training patch are illustrated in Figure 2. Given such samples, the MLP training is carried out using plain backpropagation [2], whereas a fixed learning rate (0.2) and number of epochs (50) are used. Integration with Graph Cut: Since the scores given by the local MLP classifiers have a probabilistic interpretation as P (b|x), they can be integrated with a smoothness prior as described in the following. For background subtraction, a similar formulation has been proposed before in [4] and combined with so-called “background attenuation” (e.g., [16]). For an image with pixels x1 , .., xn , the Boolean background variables b1 , .., bn are determined according to a MAP formulation: (bˆ1 , .., bˆn ) = arg max b1 ,..,bn
∝ arg max b1 ,..,bn
P (b1 , .., bn |x1 , .., xn ) P (bi |xi ) i
P (bi )
· P (b1 , .., bn )
A Local Discriminative Model for Background Subtraction
(a)
(b)
(c)
(d)
511
(e)
Fig. 3. A sample result on the Office dataset. The Eigenbackgrounds method projects the input image (a) to a too dark background image (b), and thresholding the difference (c) between the observation (a) and background image (b) obviously leads to segmentation errors. In contrast, the posterior P (b|x) returned by our approach (d) gives a good segmentation (e) if integrated with a graph cut technique.
Further, the prior P (b1 , .., bn ) is assumed to be a Gibbs distribution: If denoting all pairs of 8-connected neighbor pixels with C, we define: e−U(i,j) (1) P (b1 , .., bn ) ∝ (xi ,xj )∈C
with clique potentials U (i, j) = 0 (if bi = bj ) and U (i, j) = ν (else), i.e. the length of the foreground object boundary is penalized with a constant bias. Note that it follows that P (b1 ) = ... = P (bn ) = 12 . If taking the logarithm, the MAP solution thus minimizes an energy function consisting of a “data fit” term and a “smoothness term”: log P (bi |x) + ν · (1 − δ(bi , bj )) (2) (bˆ1 , .., bˆn ) = arg min − b1 ,..,bn
i
“dataf it
(xi ,xj )∈C
“smoothness
Via the “smoothness parameter” ν, both constraints are weighted relative to each other. Minimization is carried out efficiently using a graph cut algorithm [1]. The effect of this additional smoothness constraint is illustrated in Figure 3: While the local posteriors give a noisy result with some local misclassifications (d), these are overruled by the smoothness constraint (e).
4
Experiments
In the following experiments, we first analyze the influence of several internal parameters of our system on a self-created dataset. Second, comparisons with other state-of-the-art methods on both self-generated and public benchmarks are presented that demonstrate the competitive performance of the local discriminative approach. 4.1
Experiments on Office Dataset
The following evaluations have been done on the self-created Office dataset1 . This benchmark consists of 497 training images taken over several days that show a 1
The dataset is publicly available at http://www.iupr.org/downloads/data
512
A. Ulges and T.M. Breuel
Fig. 4. Experimenting with internal parameters of the system: a saturation of the performance is found for about (a) 300 virtual foreground samples and (b) 5 hidden units
complex scene in an office environment with several light sources, shadows, and slight background motion of a plant (see Figure 2 for two sample pictures). For testing, objects were presented to the camera in 6 short sequences taken at about 1 fps.. These test sequences represent difficult situations with shadows cast by the objects, abrupt light switches, and objects in varying distance to the camera. For 90 randomly sampled frames, ground truth segmentations were provided manually. A Sample Result: We start with a first qualitative result that compares the local discriminative approach with the global Eigenbackgrounds approach (quantitative results will be given later in Figure 5). Given a test frame with an object held close to the camera (Figure 3, the Eigenbackgrounds approach projects the input image to a low-dimensional Eigenspace and thresholds the distance between the observation and the projection. Due to the influence of the large foreground object, the input (a) is projected to a too dark background image (b), and thresholding with the resulting distance (c) obviously gives a poor segmentation. In contrast, since the proposed pixel classifiers are local, the foreground region has negligible influence on the rest of the image, and the resulting posterior P (b|x) (d) allows for a proper segmentation (e) when integrated with a graph cut as outlined in Section 3. Number of Synthesized Foreground Samples: A core idea of the proposed approach is to synthesize samples for the non-background class. In this experiment, we tested how many such samples are needed per patch. The system was trained and tested on the data described above, whereas images were scaled to a width of 80 pixels. Note that the local discriminative approach scales linearly with the number of image and patch pixels - since the proposed approach proved robust to downscaling in our experiments, it was decided to use subsampled images such that the system runs in near-realtime at 5 fps. on a 2.4 GHz Opteron processor. As features, chroma components in Y U V color space were used with a patch width of 7 and an MLP with 5 hidden units.
A Local Discriminative Model for Background Subtraction
513
method / features used Li’s Quality EER PCA 52.04 18.22 ONLINE 63.23 — gray 63.56 17.61 grad 67.20 6.99 uv 71.41 8.09 grad+uv 69.76 6.25 uv + graph cut 83.87 — Fig. 5. Left: When using graph cut segmentation with a proper smoothness parameter ν, the segmentation quality (Li’s measure) can be improved significantly from 71 % (best pixel-wise thresholding) to 84 %. Right: Overall results for baseline methods and our system with different feature types on the Office dataset.
For the assessment of segmentation quality, two different measures were used: 1. the equal error rate (EER) for foreground and background pixels 2. the quality measure used by Li et al. [17]: If the foreground regions in the result and ground truth mask are denoted with At and B, the segmentat ∩B tion quality is defined as S(At .B) := A At ∪B . Note that the result At can be obtained by a graph cut algorithm or by locally thresholding the posterior P (b|x). In the latter case, we choose the threshold t that maximizes S on the test set. Figure 4 plots both segmentation quality measures against the number of synthesized samples. As expected, the segmentation quality increases with the number of samples, but it converges against an optimum approximately reached at 300 samples (this number will be used in the following experiments). Number of Hidden Units: A similar experiment was done for the number of hidden units, whereas the setup from the last experiment was copied and the number of foreground samples per patch was set to 300. Our results illustrated in Figure 4 show that the overall performance of the system depends less on the number of hidden units than it does on the number of training samples. Also, we find the optimal performance for about 5 hidden units (when increasing this number further, the performance drops slightly due to overfitting). Influence of Graph Cut: As outlined in Section 3, one key feature of the proposed framework is the probabilistic integration of posteriors with a graph cut optimization. In this experiment, we study quantitatively whether this optimization actually improves the overall performance of the system (the setup from the previous experiment was kept). In Figure 5, Li’s quality measure is plotted against the smoothness parameter ν from Equation (2). The top performance can be observed at ν = 10. More interesting, however, is that the overall segmentation performance is improved to 84 %, i.e. by 18 % relative to the best pixel-wise thresholding with the posterior (ca. 71 %, as can be seen in Figure 4).
514
A. Ulges and T.M. Breuel CamouflageWav.Trees Campus
Fountain WaterSurf.Curtain
Fig. 6. Sample results on public datasets as reported in [17,18]. Row 1 shows test frames, Row 2 variance images that illustrate where motion occurs. Ground truth masks are given in Row 3, and the bottom row shows the posterior given by our approach.
Comparison with other Methods: Finally, we compare the proposed framework with two other methods, namely Eigenbackgrounds [14] (“PCA”) and a thoroughly tuned implementation of a state-of-the-art online background subtraction approach (“ONLINE”). This method weighs two background models, namely a shadow model [13] and histograms of gradient directions [8], and integrates them with a background attenuation and graph cut optimization as in Sun’s Background Cut [16]. Quantitative results for both methods as well as for the proposed approach tested with several features are illustrated in Figure 5 (the same setup was used as in the experiments before). Our approach gives a statistically significant improvement compared to PCA, which is revealed by a paired t-test and corresponds to the observations made in Figure 3. The online algorithm performs comparably to our system when using gray pixel values (“gray”) – it has particular problems with light switches and strong gain control (note that test sequences were taken at only 1 fps., which simulates sudden light changes and stresses the online system). The local discriminative approach does even better when using more robust features like gradient strength (“grad”), chroma information (“uv”), or both (“grad+uv”). Finally, the top performance is achieved when integrating the posteriors with a graph cut optimization. 4.2
Experiments on Other Datasets
In this experiment, the proposed framework is tested on publicly available benchmark sequences from the literature representing difficult situations with motion in the background, like waving trees, flickering monitor sequences, and water surfaces. We demonstrate that our approach is capable of learning an adequate background model in such situations and show competitive results.
A Local Discriminative Model for Background Subtraction
515
Table 1. Quantitative results on public datasets from [18,17] Li’s quality (from [17])
Error Rate (from [18]) sequence our approach best res. in [18]
sequence our approach Li’s system
MOG
Camoufl.
5.41
9.54 Campus
60.01
68.3
W.Trees
5.31
5.02 Fountain
54.87
67.4
48.0 66.3
W.Surf.
72.66
85.1
53.6
Curtain
40.51
91.1
44.5
From [18] we use the “WavingTress” and “Camouflage” Sequences, and from [17] “Campus”, “Fountain”, “WaterSurface”, and “Curtain”. Other sequences are available, but do not satisfy our need for a background-only training phase showing all lighting conditions that occur in testing. The system was run using a pixel-wise thresholding of the posterior, i.e. without graph cut integration. “uv” features were used with a patch radius of 3, 5 hidden units and 300 training samples. Images were scaled to a resolution of width 160 (for “Campus” and “Fountain”, which show very small objects) or 80 (for all others). 200 training frames in the beginning of each sequence were used. Some sample results are illustrated in Figure 6, and quantitative performance measures are given in Table 1, whereas we stick with the error measures from the corresponding publications. For [18], this is the rate of pixel errors (we use the equal error rate threshold). Our approach performs comparably (“WavingTrees”) or significantly better (“Camouflage”) than the best results reported in [18]. For the sequences from [17], we choose the threshold that optimizes Li’s quality. Here, the local discriminative approach is outperformed by Li’s for all sequences. Compared to a standard mixture-of-Gaussians (MOG) system, it performs better for “Campus” and “WaterSurface”, comparable on “Curtain”, and does worse for “Fountain”. An in-depth analysis revealed that for the “Fountain” Sequence, our system reacts sensitive to a small camera shake during the test phase of the sequence.
5
Discussion
In this paper, we have presented a background subtraction approach based on training local discriminative classifiers that assign pixels to foreground and background. A competitive performance on self-created and standard benchmarks has been demonstrated. Further, our experience has been that the system is more robust and easier to tune than online algorithms implemented previously. While segmentation with our prototype can be done near-realtime (for an image width of 80 pixels, an unoptimized implementation runs at 5 fps.), training takes significantly longer (about 1 sec. per pixel) and is thus not really suitable for online updates such as the system in [5]. Note, however, that our prototype can be parallelized by treating pixels separately. An interesting open question is the influence of more realistic samples for the “foreground model” of the system. So far, a very simple sampling strategy has
516
A. Ulges and T.M. Breuel
been used (random colors with additive noise). It might be interesting to test whether our approach can do better with samples from real images, or even from foreground objects that are known to occur in the scene2 .
References 1. Boykov, Y., Kolmogorov, V.: An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision. IEEE PAMI 26(9), 1124–1137 (2004) 2. Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley Interscience Publications, Chichester (2000) 3. Culibrk, D., Marques, O., Socek, D., Kalva, H., Furht, B.: Neural Network Approach to Background Modeling for Video Object Segmentation. IEEE Trans. Neur. Netw. 18(6), 1614–1627 (2007) 4. Yin, P., Criminisi, A., Winn, J.M., Essa, I.A.: Tree-based Classifiers for Bilayer Video Segmentation. In: CVPR 2007, pp. 1–8 (2007) 5. Grabner, H., Roth, P.M., Grabner, M., Bischof, H.: Autonomous Learning a Robust Background Model for Change Detection. In: Int. Workshop Perf. Eval. Track. Surv, pp. 39–46 (2006) 6. Cheung, S., Kamath, C.: Robust Background Subtraction with Foreground Validation for Urban Traffic Video. In: EURASIP Appl. Sign. Proc, pp. 2330–2340 (2005) 7. Li, L., Huang, W., Gu, I., Tian, Q.: Foreground Object Detection in Changing Background Based on Color Co-Occurrence Statistics. In: WACV 2002, pp. 269– 274 (2002) 8. Noriega, P., Bernier, O.: Real Time Illumination Invariant Background Subtraction Using Local Kernel Histograms. In: BMVC 2006, vol. III, pp. 979–988 (2006) 9. Piccardi, M.: Background Subtraction Techniques: A Review. IEEE SMC/ICSMC 4, 3099–3104 (2004) 10. Han, B., Comaniciu, D., Davis, L.: Sequential Kernel Density Approximation through Mode Propagation: Applications to Background Modeling. In: ACCV 2004 (2000) 11. Elgammal, A.M., Harwood, D., Davis, L.S.: Non-parametric Model for Background Subtraction. In: ECCV 2000, pp. 751–767 (2000) 12. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.: Pfinder: Real-Time Tracking of the Human Body. IEEE PAMI 19(7), 780–785 (1997) 13. Horprasert, T., Harwood, D., Davis, L.S.: A Statistical Approach for Realtime Robust Background Subtraction and Shadow Detection. In: IEEE Framerate Workshop, pp. 1–19 (1999) 14. Oliver, N., Rosario, B., Pentland, A.: A Bayesian Computer Vision System for Modelling Human Interactions. IEEE PAMI 22(8), 831–843 (2000) 15. Stauffer, C., Grimson, E.: Learning Patterns of Activity Using Real-Time Tracking. IEEE PAMI 22(8), 747–757 (2000) 16. Sun, J., Zhang, W., Tang, X., Shum, H.-Y.: Background Cut. In: ECCV 2006, pp. 628–641 (2006) 17. Li, L., Huang, W., Gu, I., Tian, Q.: Statistical Modelling of Complex Backgrounds for Foreground Object Detection. IEEE PAMI 13(11), 75–104 (2004) 18. Toyama, K., Krumm, J., Brumitt, B., Meysers, B.: Wallflower: Principles and Practice of Background Maintenance. In: ICCV 1999, pp. 255–261 (1999) 2
This work was supported in part by the Stiftung Rheinland-Pfalz f¨ ur Innovation, project InViRe (961-386261/791).
Perspective Shape from Shading with Non-Lambertian Reflectance Oliver Vogel, Michael Breuß, and Joachim Weickert Mathematical Image Analysis Group, Faculty of Mathematics and Computer Science, Building E1.1 Saarland University, 66041 Saarbr¨ ucken, Germany {vogel,breuss,weickert}@mia.uni-saarland.de
Abstract. In this work, we extend the applicability of perspective Shape from Shading to images incorporating non-Lambertian surfaces. To this end, we derive a new model inspired by the perspective model for Lambertian surfaces recently studied by Prados et al. and the Phong reflection model incorporating ambient, diffuse and specular components. Besides the detailed description of the modeling process, we propose an efficient and stable semi-implicit numerical realisation of the resulting Hamilton-Jacobi equation. Numerical experiments on both synthetic and simple real-world images show the benefits of our new model: While computational times stay modest, a large qualitative gain can be achieved.
1
Introduction
Given a single greyscale image, the shape-from-shading (SFS) problem amounts to computing the 3-D depth of depicted objects. It is a classical problem in computer vision with many potential applications, see [1,2,3] and the references therein for an overview. In early SFS models, the camera is performing an orthographic projection of the scene of interest, which is assumed to be composed of Lambertian surfaces. Let us especially honour the pioneering works of Horn [1,4] who was also the first to model the problem via the use of a partial differential equation (PDE). For orthographic SFS models, there have been some attempts to extend the range of applicability to non-Lambertian surfaces [5,6]. However, orthographic models usually suffer from ill-posedness, especially in the form of the so-called convex-concave ambiguities [3]. Moreover, the orthographic camera model yields reconstruction results not of convincing quality in most situations [3]. The problem of ill-posedness can be dealt with successfully by using a perspective camera model, see e.g. [3,7,8,9,10]. As our model incorporates for a special choice of parameters the perpective model for Lambertian surfaces widely studied in the literature, let us give some more details on these. Assuming a pinhole camera model and a point light source at the optical center, the perspective SFS model for Lambertian surfaces amounts to the Hamilton-Jacobi equation 2 2 1 If 2 2 2 f + u2 = 2 , |∇u| + (∇u · x) (1) 2 u Q u G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 517–526, 2008. c Springer-Verlag Berlin Heidelberg 2008
518
O. Vogel, M. Breuß, and J. Weickert
where x ∈ R2 is in the image domain Ω, |.| denotes the Euclidean norm, and – u := u(x) is the sought depth map, – I := I(x) is the normalised brightness of the given grey-value image – f is the focal length relating the optical center and its retinal plane, 2
– Q := Q(x) := f/ |x| + f 2 . As already indicated, perspective models such as (1) yield superior depth maps compared to orthographic models. However, up to now there does not exist a reliable and easy-to-use PDE-based model incorporating non-Lambertian surfaces. An interesting, yet not very sophisticated attempt to incorporate other surface models into perspective SFS is given in [11]. Our contribution. In our paper, we introduce a detailed model of a new PDE for perspective SFS incorporating non-Lambertian surfaces. It is clearly stated at which point the Phong reflection model we use for this purpose is taken into account. A second objective is to give an efficient and easy-to-code algorithm. We realise this aim by using the algorithm of Vogel et al. [12] for perspective SFS and Lambertian surfaces as a basis. As our experiments show, we achieve a considerable gain concerning the quality of computed depth maps, and we obtain reasonable results even for simple real-world images. Organisation of the paper. In Section 2, we present the model process in detail. In Section 3, a thorough description of the numerical scheme is given. Following a discussion of experiments in Section 4, the paper is finished by concluding remarks.
2
Description of the Model
Consider the surface S representing the object or scene of interest depicted in ¯ → R3 , Ω ⊂ R2 [12], a given image, parameterised by using the function S : Ω with fu(x) T (x, −f) . (2) S(x) = 2 2 |x| + f ∈R2 ×R As the two columns of the Jacobian J [S(x)] are tangent vectors to S at the point S(x), their cross-product gives a normal vector n(x) at S(x) by T
fu(x) fu(x) x , ∇u(x) · x + 2 f . (3) n(x) = f∇u(x) − 2 |x| + f 2 |x| + f 2 Up to this point, the model is largely identical to the one for perspective SFS [3]. However, we assume that the surface properties can be described by a Phong reflection model [13,14], and thus we introduce the brightness equation I(x) = ka Ia +
light sources
1 (kd Id cos φ + ks Is (cos θ)α ) . r2
(4)
Perspective Shape from Shading with Non-Lambertian Reflectance
519
Let us comment on equation (4) in some detail: Ia , Id , and Is are the intensities of the ambient, diffuse, and specular components of the reflected light, respectively. Accordingly, the constants ka , kd , and ks with ka +kd +ks ≤ 1 denote the ratio of ambient, diffuse, and specular reflection. The light attenuation factor 1/r2 , where r is the distance between light source and surface, is taken into account. The intensity of diffusely reflected light in each direction is proportional to the cosine of the angle φ between surface normal and light source direction. The amount of specular light reflected towards the viewer is proportional to (cos θ)α , where θ is the angle between the ideal (mirror) reflection direction of the incoming light and the viewer direction, and α is a constant modeling the roughness of the material. For α → ∞ this describes an ideal mirror reflection. We restrict the model to a single light source at the optical center of the camera [15]. As in this case the view direction and light source direction are the same, we obtain θ = 2φ. Moreover we restrict the model to scalar valued intensities as we only consider greyscale images. Then equation (4) becomes 1 kd (N · L)Id + ks (2(N · L)2 − 1)α Is , (5) 2 r n(x) with N = |n(x)| being the unit normal vector at the surface at point x, and L is the unit light vector which points towards the optical center of the camera. The scalar products N ·L arise since cos φ = N·L and cos θ = cos 2φ = 2(cos φ)2 −1 = 2(N · L)2 − 1. As the normalised light source direction is given by −1/2 (−x, f)T , (6) L (S(x)) = |x|2 + f 2 I(x) = ka Ia +
we can evaluate the scalar products yielding −1 . N · L (S(x)) = fu(x) |n(x)| |x|2 + f 2 By use of r = fu(x), we obtain from (5-7) α 1 2u(x)2 Q(x)2 u(x)Q(x) I(x) = ka Ia + 2 Id + ks − 1 Is , kd f u(x)2 |n(x)| |n(x)|2 where |n(x)| = f 2 |∇u(x)|2 + (∇u(x) · x)2 + u(x)2 Q(x)2 and with Q(x) = f 2 /(|x|2 + f 2 ) .
(7)
(8)
(9)
The PDE (8) is of Hamilton-Jacobi-type. We now rewrite (8), yielding α kd Id f 2 |n(x)| |n(x)|ks Is 2u(x)2 Q(x)2 (I(x) − ka Ia ) − − −1 = 0 . (10) Q(x)u(x) u(x)2 u(x)3 Q(x) |n(x)|2 We assume – as usual when dealing with this problem – that the surface S is visible, so that u is strictly positive. Then we use the change of variables v = ln(u) which especially implies |n(x)| 2 = f |∇v(x)|2 + (∇v(x) · x)2 + Q(x)2 , (11) u(x)
520
O. Vogel, M. Breuß, and J. Weickert
1 since ∇v(x) = u(x) ∇u(x). Neglecting the notational dependence on x ∈ R2 , we finally obtain the PDE α W ks Is −2v 2Q2 −2v JW − kd Id e e − −1 =0 (12) Q W2
where := J(x) = (I(x) − ka Ia )f 2 /Q(x) , W := W (x) = f 2 |∇v|2 + (∇v · x)2 + Q(x)2 .
(13) (14)
Note that in the Phong model, the cosine in the specular term is usually 2 replaced by zero if cos θ = 2Q(x) W (x)2 − 1 < 0.
3
Numerical Method
In order to solve the arising boundary value problem (12), we employ the method of artificial time. As it is well-known, this is one of the most successful strategies to deal with static Hamilton-Jacobi equations, see e.g. [16] and the references therein. This means, we introduce a pseudo-time variable t riting v := v (x, t), and we iterate in this pseudo-time until a steady state defined by vt = 0 is attained. Thus, for v = v(x, t), we opt to solve the time-dependent PDE α 2 2Q −2v W − 1 −kd Id e−2v . (15) vt = JW − ks Is e Q W2 =:A n Discretisation. We employ the following standard notation: vi,j denotes the approximation of v (ih1 , jh2 , nτ ), i and j are the coordinates of the pixel (i, j) in x1 - and x2 -direction, h1 and h2 are the corresponding mesh widths, and τ is a time step size which needs to be chosen automatically or by the user. We approximate the time derivative vt by the Euler forward method, i.e.,
vt (x, t)|(x,t)=(ih1 ,jh2 ,nτ ) ≈
n+1 n vi,j − vi,j . τ
(16)
Let us now consider the spatial terms. The discretisation of I(x) and Q(x) is simple as these terms can be evaluated pointwise at all pixels (i, j). As a building block for the discretisation of spatial derivatives incorporated in W , we use the stable upwind-type discretisation of Rouy and Tourin [16]: vx1 (ih1 , jh2 , ·) ≈ h−1 1 max (0, vi+1,j − vi,j , vi−1,j − vi,j ) ,
(17)
vx2 (ih1 , jh2 , ·) ≈
(18)
h−1 2 max (0,
vi,j+1 − vi,j , vi,j−1 − vi,j ) .
Note that in (17), (18) the time level is not yet specified, as we wish to employ a Gauß-Seidel-type idea, compare e.g. [12]. To this end, notice that at pixel (i, j)
Perspective Shape from Shading with Non-Lambertian Reflectance
521
the data vi,j+1 , vi−1,j , vi,j , vi+1,j , vi,j−1 are used. Iterating pixel-wise over the computational grid, ascending in i and descending in j, we incorporate already updated values into the scheme. This yields the formulae
n+1 n n n vx1 (x, t)|(x,t)=(ih1 ,jh2 ,nτ ) ≈ h−1 (19) 1 max 0, vi+1,j − vi,j , vi−1,j − vi,j ,
−1 n+1 n n n vx2 (x, t)|(x,t)=(ih1 ,jh2 ,nτ ) ≈ h2 max 0, vi,j+1 − vi,j , vi,j−1 − vi,j . (20) n+1 n+1 Let us emphasize that the data vi,j+1 and vi−1,j in (19),(20) are already computed via the described procedure, so that they are fixed and one can safely use n+1 . them for the computation of vi,j Being a factor before W , we discretise ks Is e−2v at pixel (i, j) using nknown data as it is the case in the remainder of this term, i.e., setting ks Is e−2vi,j . Finally, let us consider the source term kd Id e−2v . Source terms like this enforce the use of very small time step sizes when evaluated explicitly, leading to very long computational times. Thus, we discretise it implicitly by n+1
kd Id e−2v(x,t) |(x,t)=(i,j,nτ ) ≈ kd Id e−2vi,j .
(21)
Letting Aˆ denote the discretised version of term A from (15), we obtain by (16) the update formula n+1
n+1 n = vi,j − τ Aˆ − τ kd Id e−2vi,j vi,j
(22)
n+1 which has to be solved for vi,j . We treat the nonlinear equation (22) by use of the classical one-dimensional Newton method, which convergences in practice in 3-4 iterations. To summarize our algorithm, we (i) initialize v := −0.5 log If 2 , (ii) iterate using equation (22) until the pixel-wise error in v is smaller than some predefined small constant in every pixel. Note that it is possible to do a rigorous stability analysis, yielding a reliable estimate for τ useful for computations, compare [12] for an example of such a procedure.
4
Experiments
Let us now present experiments on both synthetic and real images. In all these experiments we use the method developed above. Let us note, that the method was compared to other recent methods in the field in the context of perspective SFS with Lambertian reflectance [12]. In the latter context, the proposed algorithm has turned out to be by far the most efficient numerical scheme. Using synthetic test scenes, the ground truth is known, so that we can compare the reconstructions with it and get a quantitative measure of the error. Reality, however, is different: Recent SFS methods consider only Lambertian surfaces, while in reality such surfaces do not exist. Although the Phong model is only an approximation to the physical reality, surfaces rendered with it appear much more realistic. In order to evaluate the benefit of the Phong-based
522
O. Vogel, M. Breuß, and J. Weickert
Fig. 1. The vase: Ground truth surface and rendered image
approach for real-world images, we only consider synthetic surfaces rendered with the Phong model and ks > 0. We use Neumann boundary conditions in all experiments. The Vase Experiment. In our first experiment, we use a classic SFS test scene: The vase [2]. Figure 1 shows the ground truth together with a rendered version. It has been rendered by ray-tracing the surface which complies to our model with f = 110, h1 = h2 = 1, Id = Is = 3000, kd = 0.6, ks = 0.4, ka = 0, 128 × 128 pixels, and α = 4.
Fig. 2. Reconstructions of the vase. Left: Lambertian model, Right: Phong model.
In Fig. 2 we find reconstructions using our new model and the Lambertian model from [3,12] employing the known parameters. The reconstruction with the Lambertian model looks distorted. The whole surface is pulled towards the light source, located at (0, 0, 0). The effect is most prominent at specular highlights. At the boundary of vase and background, we observe a strong smoothing, which is normal when using a Lambertian model. The shape of the reconstruction using our new model is quite close to the original shape. It has the right distance from the camera and the correct size. The boundary of the vase is smooth, too, but the transition is clearly sharper than the one obtained using the Lambertian model. In Tab. 1, we compare the reconstruction with the ground truth. We observe a considerable improvement using our model. The computational times are about
Perspective Shape from Shading with Non-Lambertian Reflectance
523
Table 1. Average L1 errors for the vase experiment Error in u Error in v Relative error in u
Phong model Lambertian model 0.042 0.088 0.049 0.101 4.62% 9.66%
one minute, which is just a bit slower than the Lambertian algorithm. Note that we omitted the relative error in v since v = log u. Let us now evaluate the performance of our method on noisy input images. Figure 3 shows the vase input image distorted by Gaussian noise with standard deviations σ = 5 and σ = 10. This is quite a lot of noise for shape from shading applications.
Fig. 3. The vase: Noisy input images. Left: σ = 5, Right: σ = 10.
Figure 4 shows the reconstructions of both noisy images using our new method. In both cases, the general shapes are preserved quite well, only the fine structure of the surface looks a bit noisy. Despite the strong noise in the input data, we get good reconstructions. The error levels in Tab. 2 support this impression. For the image distorted with only a little noise, we get almost the same reconstruction error as in the noiseless experiment. With the second image, the error is a bit higher, but almost the same. The method performs very stable under noise. For experiments with noise on Lambertian surfaces compare [3,12]. The Cup Experiment. In Fig. 5, we see a photograph of a plastic cup, taken with a normal digital camera with flashlight. In this image, several model assumptions are violated. – We do not have the same surface throughout the image. – The flashlight is not located at the optical center, creating shadows in the image. – Some parts of the scene reflect on the surface of the cup, especially on the left.
524
O. Vogel, M. Breuß, and J. Weickert
Fig. 4. The vase: Reconstructions of noisy data. Left: σ = 5, Right: σ = 10. Table 2. Average L1 errors for the vase experiment on noisy images Error in u Error in v Relative error in u
Noise, σ = 5 Noise, σ = 10 0.047 0.050 0.051 0.054 5.33% 5.49%
Fig. 5. Photograph of a plastic cup
– The image was taken in a room that was darkened, but certainly was not pitch black, and there was some ambient light reflected from the walls of the room. Figure 6 shows reconstructions using Phong and Lambertian models, respectively. Parameters for the Phong reconstruction were f = 1500, h1 = h2 = 1, Is = Id = Ia = 2000, ka = 0.1, kd = 0.5, ks = 0.4, and α = 4. We used some ambient lighting to compensate for ambient light present in the room. For the Lambertian reconstruction we used the same parameters, but with ks = 0. In the Phong reconstruction, the plate in the background is flat like it should be. In the Lambertian reconstruction, we see artifacts at specular highlights on the cup where the surface is pulled towards the light source: The cup is estimated much too close to the camera (not observable in Fig. 6). In the Phong reconstruction, we hardly see artifacts. At the specular highlights, we have an almost normal shape. Even the handle of the cup and the plate are recovered very well.
Perspective Shape from Shading with Non-Lambertian Reflectance
525
Fig. 6. Reconstructions of the cup. Left: Lambertian model. Right: Phong model.
5
Summary
We have introduced a Phong-type perspective SFS model that yields much more accurate results than its Lambertian counterpart. Moreover, we have seen that also real-world images and noisy images can be tackled successfully by our new model. This is an important step towards making SFS ideas applicable to a larger range of real-world problems. Our ongoing work is concerned with more complicated illumination models as this step will improve the applicability of SFS to real-world problems, as well as algorithmic speedups. We also propose to use the current model with the presented technique for automated document analysis, which will be the subject of future work.
References 1. Horn, B.K.P., Brooks, M.J.: Shape from Shading. Artificial Intelligence Series. MIT Press, Cambridge (1989) 2. Zhang, R., Tsai, P.S., Cryer, J.E., Shah, M.: Shape from shading: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(8), 690–706 (1999) 3. Prados, E.: Application of the theory of the viscosity solutions to the Shape from Shading problem. PhD thesis, University of Nice Sophia-Antipolis (2004) 4. Horn, B.K.P.: Shape from Shading: A Method for Obtaining the Shape of a Smooth Opaque Object from One View. PhD thesis, Department of Electrical Engineering, MIT, Cambridge, MA (1970) 5. Bakshi, S., Yang, Y.H.: Shape from shading for non-Lambertian surfaces. In: Proc. IEEE International Conference on Image Processing, vol. 2, pp. 130–134. IEEE Computer Society Press, Austin (1994) 6. Lee, K., Kuo, C.C.J.: Shape from shading with a generalized reflectance map model. Computer Vision and Image Understanding 67(2), 143–160 (1997)
526
O. Vogel, M. Breuß, and J. Weickert
7. Okatani, T., Deguchi, K.: Reconstructing shape from shading with a point light source at the projection center: Shape reconstruction from an endoscope image. In: Proc. 1996 International Conference on Pattern Recognition, Vienna, Austria, pp. 830–834 (1996) 8. Tankus, A., Sochen, N., Yeshurun, Y.: Perspective shape-from-shading by fast marching. In: Proc. 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 43–49. IEEE Computer Society Press, Washington (2004) 9. Tankus, A., Sochen, N., Yeshurun, Y.: Shape-from-shading under perspective projection. International Journal of Computer Vision 63(1), 21–43 (2005) 10. Cristiani, E., Falcone, M., Seghini, A.: Some remarks on perspective shape-fromshading models. In: Sgallari, F., Murli, A., Paragios, N. (eds.) SSVM 2007. LNCS, vol. 4485, pp. 276–287. Springer, Berlin (2007) 11. Okatani, T., Deguchi, K.: Shape reconstruction from an endoscope image by shape from shading technique for a point light source at the projection center. Computer Vision and Image Understanding 66(2), 119–131 (1997) 12. Vogel, O., Breuß, M., Weickert, J.: A direct numerical approach to perspective shape-from-shading. In: Lensch, H., Rosenhahn, B., Seidel, H.P., Slusallek, P., Weickert, J. (eds.) Vision, Modeling, and Visualization, Saarbr¨ ucken, Germany, pp. 91–100 (2007) 13. Phong, B.T.: Illumination of Computer-Generated Images. PhD thesis, Department of Computer Science, University of Utah (1975) 14. Phong, B.T.: Illumination for computer-generated pictures. Communications of the ACM 18(6), 311–317 (1975) 15. Prados, E., Faugeras, O.: Unifying approaches and removing unrealistic assumptions in shape from shading: Mathematics can help. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 141–154. Springer, Heidelberg (2004) 16. Rouy, E., Tourin, A.: A viscosity solutions approach to shape-from-shading. SIAM Journal of Numerical Analysis 29(3), 867–884 (1992)
The Conformal Monogenic Signal Lennart Wietzke and Gerald Sommer Kiel University, Institute of Computer Science, Chair of Cognitive Systems, Christian-Albrechts-Platz 4, 24118 Kiel, Germany {lw,gs}@ks.informatik.uni-kiel.de
Abstract. The conformal monogenic signal is a novel rotational invariant approach for analyzing i(ntrinsic)1D and i2D local features of twodimensional signals (e.g. images) without the use of any heuristics. It contains the monogenic signal as a special case for i1D signals and combines scale-space, phase, orientation, energy and isophote curvature in one unified algebraic framework. The conformal monogenic signal will be theoretically illustrated and motivated in detail by the relation of the Radon and the Riesz transform. One of the main ideas is to lift up two-dimensional signals to a higher dimensional conformal space where the signal can be analyzed with more degrees of freedom. The most interesting result is that isophote curvature can be calculated in a purely algebraic framework without the need of any derivatives.
1
Introduction
In this paper 2D signals (e.g. gray value images) f ∈ L2 (Ω; R) with Ω ⊂ R2 will be locally analyzed. Features such as phase φ, orientation θ and curvature κ will be determined at every test point (x, y) ∈ Ω of the original 2D signal f . For each test point a local coordinate system will be applied before analysis. One important local structural feature is the phase φ of a DC free 1D signal model g(x) := a(x) cos(φ(x)) which can be calculated by means of the Hilbert transform. Furthermore all signals will be analyzed in scale space (e.g. in Poisson scale space [1]) because the Hilbert transform can only be interpreted for narrow banded signals. One possible generalization of the Hilbert transform to higher dimensions which will be used in this work is the Riesz transform. 2D signals f are classified into local regions N ⊆ Ω of different intrinsic dimensions (also known as codimension): ⎧ ⎨ i0DN , f (xi ) = f (xj ) ∀xi , xj ∈ N / i0DN . f ∈ i1DN , f (x, y) = g(x cos θ + y sin θ) ∀(x, y) ∈ N, f ∈ ⎩ i2DN , else
(1)
We acknowledge funding by the German Research Foundation (DFG) under the project SO 320/4-2.
G. Rigoll (Ed.): DAGM 2008, LNCS 5096, pp. 527–536, 2008. c Springer-Verlag Berlin Heidelberg 2008
528
2
L. Wietzke and G. Sommer
The Monogenic Signal
Phase and amplitude of 1D signals can be analyzed by the analytic signal. The generalization of the analytic signal to multidimensional signal domains has been done by the monogenic signal [2]. In case of 2D signals the monogenic signal delivers local phase, orientation and energy information. The monogenic signal can be interpreted for the application to i1D signals. This work presents the generalization of the monogenic signal for 2D signals to analyze both i1D and i2D signals in one single framework. The conformal monogenic signal delivers local phase, orientation, energy and curvature for i1D and i2D signals with the monogenic signal as a special case. To illustrate the motivation and the interpretation of this work, first of all the monogenic signal will be recalled in detail. 2.1
Riesz Transform
The monogenic signal replaces the Hilbert transform of the analytic signal by the Riesz transform which is known from Clifford analysis [3]. The Riesz transform R{·} extends the signal f to a monogenic (holomorphic) function. It is one possible, but not the only generalization of the Hilbert transform to multidimensional signal domains. In spatial domain the Riesz transform is given by the following convolution [4]: R{f }(0) :=
Γ ( m+1 2 ) π
m+1 2
x
∈Rm
f (x) x dx . xm+1
(2)
In this work the Cauchy principal value (P.v.) of all integrals will be omitted. To enable interpretation of the Riesz transform, its relation to the Radon transform will be shown in detail. This relation can be proved by means of the Fourier slice theorem [5]. 2.2
Relation of the Riesz and Radon Transform
The Riesz transform can be expressed using the Radon and the Hilbert transform. Note that the relation to the Radon transform is required solely for interpretation and theoretical results. Neither the Radon transform nor its inverse are ever applied to the signal in practice. Instead the Riesz transformed signal will be determined by convolution in spatial domain. The 2D Radon transform [6] is defined as: f (x, y)δ0 (x cos θ + y sin θ − t)d(x, y) (3) r(t, θ) := R{f }(t, θ) := (x,y)∈R2
with θ ∈ [0, π) as the orientation, t ∈ R as the minimal distance of the line to the origin (0, 0) and δ0 as the Dirac delta distribution (see figure 1). The inverse 2D Radon transform exists and is defined by:
The Conformal Monogenic Signal
529
t
r(t, θm )
r(t, θm )
t
θm
x
θ
0
y
θm
π
Fig. 1. Left figure: i1D signal f in spatial domain with orientation θm and local phase φ = 0 at the origin (0, 0). Right figure: i1D signal f in Radon space. Each point in Radon space represents the integral in spatial domain on a line which is uniquely defined by the minimal distance t ∈ R to the origin and the orientation θ ∈ [0, π).
R−1 {r}(x, y) :=
1 2π 2
π θ=0 t∈R
∂ ∂t r(t, θ)
x cos θ + y sin θ − t
dtdθ .
(4)
Now the Riesz transform will be expressed by the Hilbert transform, the Radon transform and its inverse: cos θ R{f }(x, y) = R−1 (5) h1 (t) ∗ r(t, θ) (x, y) sin θ 1 with h1 (t) = πt as the one-dimensional Hilbert kernel and ∗ as the convolution operator. In other words the Riesz transform applies a one-dimensional Hilbert transform to the Radon space representation r(t, θ) of the signal along t ∈ R for each orientation θ ∈ [0, π) separately. For the
following implications the signal is defined by a superposition of i1D signals f := i∈I fi where each i1D signal fi is orientated along θi . To be able to extract orientation and phase information from the Riesz transformed signal, the inverse Radon transform must be simplified. This can be achieved by two assumptions. Firstly, the point of interest where local feature information should be obtained will be translated to the origin (0, 0) for each point (x, y) ∈ Ω ⊂ R2 so that the inverse 2D Radon transform has to be evaluated only at (x, y) := (0, 0). Let be M := [0, π) − i∈I {θi } the set of all orientations where no i1D signal information exists. Secondly, the θ-integral of the inverse Radon transform van∂ r(t, θ) = 0 ∀t ∈ ishes because r(t1 , θ) = r(t2 , θ) ∀t1 , t2 ∈ R ∀θ ∈ M implies ∂t R ∀θ ∈ M for a finite number I ∈ N of superimposed i1D signals. Because of this fact (and the linearity property of the Radon transform), the θ-integral of the inverse Radon transform can be replaced by a finite sum of discrete angles θi to enable modeling the superposition of an arbitrary number of i1D signals. Therefore the inverse Radon transform can be written as: ∂ 1 ∂t r(t, θi ) dt . (6) R−1 {r}(0, 0) = − 2 2π t i∈I t∈R
530
L. Wietzke and G. Sommer
Now the 2D Riesz transform and therefore the monogenic signal can be interpreted in an explicit way. 2.3
Interpretation of the 2D Riesz Transform
∂ ∂ Because of the property ∂t (h1 (t) ∗ r(t, θ)) = h1 (t) ∗ ∂t r(t, θ) the 2D Riesz transform of any i1D signal with orientation θm results in: 1 ∂ 1 Rx {f }(0, 0) cos θm h1 (t) ∗ r(t, θm )dt =− 2 . (7) Ry {f }(0, 0) sin θm 2π t ∂t t∈R
=:s(θm )
The orientation of the signal can therefore be derived by: θm = arctan
s(θm ) sin θm Ry {f }(0, 0) . = arctan s(θm ) cos θm Rx {f }(0, 0)
(8)
The partial Hilbert transform [7] of fθm (τ ) := f (τ cos θm , τ sin θm ) and therefore also its phase can be calculated by: φ = atan2 ((h1 ∗ fθm )(0), f (0, 0)) Rx2 {f }(0, 0) + Ry2 {f }(0, 0), f (0, 0) . = atan2
(9) (10)
This reveals that - although the Riesz transform is a generalization of the Hilbert transform to multi-dimensional signal domains - it still applies a one-dimensional Hilbert transform along the main orientation θm to the signal. In short, the monogenic signal enables interpretation of i1D signals and the mean value of their superposition [8].
3
The Conformal Monogenic Signal
The feature space of the 2D monogenic signal is spanned by phase, orientation and energy information. This restriction correlates to the dimension of the associated Radon space. Therefore, the feature space of the 2D signal can only be extended by lifting up the original signal to higher dimensions. This is one of the main ideas of the conformal monogenic signal. In the following the 2D monogenic signal will be generalized to analyze also i2D signals by embedding the 2D signal into the conformal space. The previous section shows that the 2D Riesz transform can be expressed by the 2D Radon transform which integrates all function values on lines. This restriction to lines is one of the reasons why the 2D monogenic signal is limited to i1D signals (such as lines and edges) with zero isophote curvature. To analyze also i2D signals and to measure curvature κ = ρ1 , a 2D Radon transform which integrates on curved lines (i.e. circles with radius ρ) is preferable. Unfortunately, the inverse Radon transform directly on
The Conformal Monogenic Signal
531
Fig. 2. Circles on the original 2D plane are mapped to circles on the sphere passing not the north pole (0, 0, 1). Lines on the plane are mapped to circles passing through the north pole, i.e. lines are a special case of circles with infinite radius.
circles is not unique [9]. Now it will be proposed to solve this problem in conformal space. In 3D signal domains the Radon transform integrates on planes, although at first sight 3D planes are not related to 2D signals. The idea is that circles form the intersection of a sphere (with center at 0, 0, 12 and radius 12 ) and planes passing through the origin (0, 0, 0). Since the Riesz transform can be extended to any dimension and the 3D Riesz transform can be expressed by the 3D Radon transform, the 2D signal coordinates must be mapped appropriately to the sphere. This mapping must be conformal (i.e. angle preserving), so that interpretation of the 3D Riesz transform in conformal space is still reasonable. Analogous to the (t, θ) line parametrization of the 2D Radon transform, the planes of the 3D Radon transform are uniquely defined by the parameters (t, θ, ϕ). This new parametrization (see figure 4) truly extends the interpretation space of the monogenic signal by one dimension. Now the 2D signal will be embedded into a two-dimensional subspace of the conformal space. 3.1
The Conformal Space
The main idea is that the concept of lines in 2D Radon space becomes the more abstract concept of planes in 3D Radon space. These planes determine circles on the sphere in conformal space. Since lines and circles of the two-dimensional signal domain are mapped to circles [10] on the sphere (see figure 2), the integration on these circles determines points in the 3D Radon space. The stereographic projection C known from complex analysis [11] maps the 2D signal domain to the sphere (see figure 3). This projection is conformal and can be inverted by C −1 for all elements of S ⊂ R3 : 2 1 1 1 2 1 2 2 S := (x, y, z) ∈ − , (11) × [0, 1) : x + y + (z − ) = 2 2 2 4 S
R
2
⎡ 1 ⎣ C(x, y) := 2 x + y2 + 1
⎤ x ⎦, y 2 2 x +y
C
−1
1 x (x, y, z) := . (12) 1−z y
This mapping has the property that the origin (0, 0) of the 2D signal domain will be mapped to the south pole 0 := (0, 0, 0) of the sphere and both −∞, +∞ will be
532
L. Wietzke and G. Sommer
Fig. 3. Left and right figure show the conformal space from two different point of views. The 2D signal f will be mapped by the stereographic projection on the sphere.
mapped to the north pole (0, 0, 1) of the sphere. Lines and circles of the 2D signal domain will be mapped to circles on the sphere and can be determined uniquely by planes in 3D Radon space. The integration on these planes corresponds to points (t, θ, ϕ) in the 3D Radon space. 3.2
The Riesz Transform in Conformal Space
Since the signal domain Ω ⊂ R2 is bounded, not the whole sphere is covered by the original signal (see left part of figure 4). Anyway, all planes corresponding to circles remain unchanged. That is the reason why the conformal monogenic signal models i1D lines and all kinds of curved i2D signals which can be locally approximated by circles. To give the Riesz transform more degrees of freedom, the original two-dimensional signal will be embedded in a applicable subspace of the conformal space by: 1 2 1 −1 T 2 2 3 (13) RR c(x, y, z) := f (C (x, y, z) ) , x + y + z − 2 = 4 . 0 , else Thus, the 3D Riesz transform can be applied to all points on the sphere. The center of convolution in spatial domain is the south pole (0, 0, 0) where the origin of the 2D signal domain meets the sphere. At this point the 3D Riesz transform will be performed. Now the conformal monogenic signal can be introduced by the 3D Radon transform and its inverse analogous to the monogenic signal in 2D: ⎤ ⎤ ⎡ ⎡ c(0) ⎫ ⎧⎡ ⎤ c(0, 0, 0) ⎥ ⎢ Rx {c} (0) ⎥ ⎢ ⎬ ⎨ sin ϕ cos θ ⎥ ⎥ ⎢ ⎢ ⎣ Ry {c} (0) ⎦ := ⎣ R−1 ⎣ sin ϕ sin θ ⎦ h1 (t) ∗ R {c} (t, θ, ϕ) (0, 0, 0) ⎦ (14) ⎭ ⎩ Rz {c} (0) cos ϕ
The Conformal Monogenic Signal
533
Fig. 4. Left figure: The stereographic projection ray passes through each 2D point (x, y) and the north pole (0, 0, 1) of the sphere. The conformal mapping of the point (x, y) is defined by the intersection of its projection ray and the sphere. Right figure: Each intersection of the sphere and a plane passing through the origin (0, 0, 0) delivers a circle. Those planes and thus all circles on the sphere are uniquely defined by the angles (θ, ϕ) of the normal vector.
and without loss of generality the signal will be analyzed at the origin 0 = (0, 0, 0). Compared to the 2D monogenic signal the conformal monogenic signal performs a 3D Riesz transformation in conformal space. 3.3
The Radon Transform in Conformal Space
To interpret the conformal monogenic signal, the relation to the 3D Radon transform in conformal space must be taken into account. The 3D Radon transform is defined as the integral of all function values on the plane defined by (t, θ, ϕ) ∈ R × [0, 2π) × [0, π): ⎡ ⎤ sin ϕ cos θ R {c} (t, θ, ϕ) = c(x)δ0 (x ⎣ sin ϕ sin θ ⎦ − t)dx . (15) cos ϕ x∈R3 Since the signal is mapped on the sphere and all other points of the conformal space are set to zero, the 3D Radon transform actually sums up all points lying on the intersection of the plane and the sphere. For all planes this intersection can be either empty or a circle. The concept of circles in the conformal 3D Radon transform can be compared with the concept of lines known from the 2D Radon transform. Since lines in the 2D domain are also mapped to circles, the conformal monogenic signal can analyze i1D as well as curved i2D signals in one single framework. The inverse 3D Radon transform exists and differs from the 2D case such that it is a local transformation [12]. That means the Riesz
534
L. Wietzke and G. Sommer
transform at (0, 0, 0) is completely determined by all planes passing the origin (i.e. t = 0). In contrast, the 2D monogenic signal requires all integrals on all lines (t, θ) to reconstruct the original signal at a certain point and is therefore called a global transform. This interesting fact turns out from the following 3D inverse Radon transform: −1
R
1 {r}(0) := − 2 8π
2π π θ=0 ϕ=0
∂2 r(t, θ, ϕ)|t=0 dϕdθ . ∂t2
(16)
Therefore, the local features of i1D and i2D signals can be determined by the conformal monogenic signal at the origin of the 2D signal without knowledge of the whole Radon space. Hence, the relation of the Radon and the Riesz transform is essential to interpret the Riesz transform in conformal space. 3.4
Interpretation and Experimental Results
Analogous to the interpretation of the monogenic signal, the parameters of the plane within the 3D Radon space determine the local features of the curved i2D signal. The conformal monogenic signal can be called the generalized monogenic signal for i1D and i2D signals, because lines and edges can be considered as circles with zero curvature. These lines are mapped to circles passing through the north pole in conformal space. The curvature can be measured by the parameter ϕ of the 3D Radon space: Rz {c} (0) . ϕ = arctan 2 Rx {c} (0) + Ry2 {c} (0)
(17)
It can be shown that ϕ corresponds to the isophote curvature κ known from differential geometry [13,14]: κ=
−fxx fy2 + 2fx fy fxy − fyy fx2 . 3 fx2 + fy2 2
(18)
Besides, the curvature of the conformal monogenic signal naturally indicates the intrinsic dimension of the signal. The parameter θ will be interpreted as the orientation in i1D case and deploys to direction θ ∈ [0, 2π) for the i2D case: θ = atan2 (Ry {c} (0), Rx {c} (0)) . The phase is defined by φ for all intrinsic dimensions by φ = atan2 Rx2 {c} (0) + Ry2 {c} (0) + Rz2 {c} (0), c(0) .
(19)
(20)
All proofs are analogous to those shown for the 2D monogenic signal. The conformal monogenic signal can be efficiently implemented by convolution in spatial
The Conformal Monogenic Signal
535
Fig. 5. Experimental results and comparison. Top row from left to right: Synthetic signal, monogenic isophote curvature and classical isophote curvature determined by derivatives. Bottom row from left to right: Energy, phase and direction. Convolution mask size: 5 × 5 pixels.
domain without the need of any Fourier transform. Since the 3D convolution in conformal space can be simplified to a faster 2D convolution on the sphere, the time complexity of the conformal monogenic signal computation is in O(n2 ) with n as the convolution mask size in one dimension. On synthetic signals the error of the feature extraction converges to zero with increasing refinement of the convolution mask. The advantages of the monogenic isophote curvature compared to the curvature delivered by the classical differential geometry [15] approach can be seen clearly in figure 5. Under the presence of noise the monogenic isophote curvature performs in general more robust than the classical isophote curvature. Detailed application and performance behavior of the conformal monogenic signal will be part of future work.
4
Conclusion
In this paper a novel generalization of the monogenic signal for two-dimensional signals has been presented to analyze i(ntrinsic)1D and i2D signals in one unified algebraic framework. The idea of the conformal monogenic signal is to lift up two-dimensional signals to an appropriate conformal space in which the signal can be Riesz transformed with more degrees of freedom compared to the 2D monogenic signal. Without steering i1D and i2D local features such as phase, orientation/direction, energy and isophote curvature can be determined in spatial domain. The conformal monogenic signal can be computed efficiently with same time complexity as the 2D monogenic signal. Furthermore, the exact local
536
L. Wietzke and G. Sommer
isophote curvature (which is of practical importance in low level image analysis) can be calculated without the need of derivatives. Hence, all problems of partial derivatives on discrete grids can be avoided by the application of the conformal monogenic signal.
References 1. Felsberg, M., Sommer, G.: The monogenic scale-space: A unifying approach to phase-based image processing in scale-space. Journal of Mathematical Imaging and Vision (2003) 2. Felsberg, M., Sommer, G.: The monogenic signal. Technical report, Kiel University (2001) 3. Brackx, F., De Knock, B., De Schepper, H.: Generalized multidimensional hilbert transforms in clifford analysis. International Journal of Mathematics and Mathematical Sciences (2006) 4. Delanghe, R.: Clifford analysis: History and perspective. In: Computational Methods and Function Theory, vol. 1, pp. 107–153 (2001) 5. Stein, E.M.: Singular Integrals and Differentiability Properties of Functions. Princeton University Press, Princeton (1971) 6. Beyerer, J., Leon, F.P.: The radon transform in digital image processing. Automatisierungstechnik 50(10), 472–480 (2002) 7. Hahn, S.L.: Hilbert Transforms in Signal Processing. Artech House (1996) 8. Wietzke, L., Sommer, G., Schmaltz, C., Weickert, J.: Differential geometry of monogenic signal representations. In: Robot Vision, pp. 454–465. Springer, Heidelberg (2008) 9. Ambartsoumian, G., Kuchment, P.: On the injectivity of the circular radon transform. In: Inverse Problems, vol. 21, p. 473. Inst. of Physics Publishing (2005) 10. Needham, T.: Visual Complex Analysis. Oxford University Press, Oxford (1997) 11. G¨ urlebeck, K., Habetha, K., Spr¨ ossig, W.: Funktionentheorie in der Ebene und im Raum (Grundstudium Mathematik), Birkh¨ auser Basel (2006) 12. Bernstein, S.: Inverse probleme. Technical report, TU Bergakdemie Freiberg (2007) 13. do Carmo, M.P.: Differential Geometry of Curves and Surfaces. Prentice-Hall, Englewood Cliffs (1976) 14. Baer, C.: Elementare Differentialgeometrie, vol. 1, de Gruyter, Berlin, New York (2001) 15. Lichtenauer, J., Hendriks, E.A., Reinders, M.J.T.: Isophote properties as features for object detection. In: CVPR (2), pp. 649–654. IEEE Computer Society, Los Alamitos (2005)
Author Index
Aach, Til 497 Abraham, Steffen 112 Andres, Bj¨ orn 142 Bartczak, Bogumil 153 Bauckhage, Christian 426 Bauer, Christian 163 Becker, Florian 325, 335 Beder, Christian 264 Bhavsar, Arnav 304 Bischof, Horst 51, 102, 163, 396 Blaschko, Matthew B. 31 Breuel, Thomas M. 507 Breuß, Michael 517 Brox, Thomas 385 Bruhn, Andr´es 314 Buhmann, Joachim M. 173, 224 Burkhardt, Hans 41 Cremers, Daniel
396
Dederscheck, David 345 Demuth, Bastian 365 Denk, Winfried 142 Denzler, Joachim 274 Dickinson, Sven 486 Dikov, Veselin 375 Dork´ o, Gyuri 71 Einh¨ auser, Wolfgang
224
Fehr, Janis 41 Felsberg, Michael 436 Flach, Boris 183 F¨ orstner, Wolfgang 11 Friedrich, Holger 345 Fuchs, Thomas J. 173 Gall, Juergen 92 Garbe, Christoph 355 Gehrig, Stefan 92 Grabner, Michael 102 Greif, Thomas 446 Gruber, Daniel 203
Haja, Andreas 112 Hamprecht, Fred A. 142 Helmstaedter, Moritz 142 H¨ orster, Eva 446 J¨ ahne, Bernd Jiang, Xiaoyi Jung, Daniel
112 214 153
K¨ ahler, Olaf 274 Kappes, J¨ org 1 Kim, Kwang In 456 Klappstein, Jens 203 Koch, Reinhard 153, 264 Kondermann, Claudia 355 Kondermann, Daniel 355 Korˇc, Filip 11 K¨ othe, Ullrich 142 Krajsek, Kai 466 Kwon, Younghee 456 Labusch, Kai 21 Lampert, Christoph H. 31 Lange, Tilman 173 Lienhart, Rainer 446 Lindblad, Joakim 476 Luki´c, Tibor 476 Martinetz, Thomas 21 Mester, Rudolf 345, 466 Moch, Holger 173 Moh, Yvonne 224 Molkenstruck, Sven 284 Moosmayr, Tobias 244 Moringen, Jan 486 M¨ uhlich, Matthias 497 M¨ uller, Meinard 365 Nagel, Kerstin 497 Noglik, Anastasia 61 Pauli, Josef 61 Petersen, Kersten 41 Petra, Stefania 294 Pock, Thomas 396
538
Author Index
Rahmann, Stefan 375 Rajagopalan, A.N. 304 Rigoll, Gerhard 234, 244, 304 R¨ olke, Volker 61 Rosenhahn, Bodo 92, 365, 385 Rosert, Eduard 345 Rothaus, Kai 214 Ruske, G¨ unther 234, 244, 254 Saffari, Amir 51 Scharr, Hanno 466, 497 Schenk, Joachim 234, 254 Schiele, Bernt 71, 82 Schiller, Ingo 264 Schindler, Konrad 122 Schlesinger, Dmitrij 183 Schmaltz, Christian 385 Schn¨ orr, Christoph 1, 294, 325, 335, 406, 416 Schr¨ oder, Andreas 294 Schuller, Bj¨ orn 244 Schultz, Thomas 193 Schulz, Andr´e 71 Schw¨ arzler, Stefan 234, 254 Seidel, Hans-Peter 92, 193, 385 Sladoje, Nataˇsa 476 Slaney, Malcolm 446 Sluzhivoy, Andrey 61 Sommer, Gerald 527 Steger, Carsten 132 Steidl, Gabriele 416 Stevenson, Suzanne 486
Timm, Fabian 21 Trobin, Werner 396 Truhn, Daniel 497 Ulges, Adrian 507 Ulrich, Markus 132 Valgaerts, Levi 314 van Gool, Luc 122 Vaudrey, Tobi 203 Vlasenko, Andrey 406 Vogel, Oliver 517 Wachsmuth, Sven 486 Wahl, Friedrich M. 284 Wallhoff, Frank 254, 304 Walter, Achim 497 Wattuya, Pakaket 214 Wedel, Andreas 203 Weickert, Joachim 314, 385, 517 Wiedemann, Christian 132 Wieneke, Bernhard 294, 335 Wietzke, Lennart 527 Wild, Peter J. 173 Winkelbach, Simon 284 Wojek, Christian 71, 82 W¨ ollmer, Martin 244 Yuan, Jing
335, 416
Zach, Christopher
102