211 70 21MB
English Pages [439] Year 2021
Signals and Communication Technology
Michael N. Rychagov Ekaterina V. Tolstaya Mikhail Y. Sirotenko Editors
Smart Algorithms for Multimedia and Imaging
Signals and Communication Technology Series Editors Emre Celebi, Department of Computer Science, University of Central Arkansas, Conway, AR, USA Jingdong Chen, Northwestern Polytechnical University, Xi'an, China E. S. Gopi, Department of Electronics and Communication Engineering, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India Amy Neustein, Linguistic Technology Systems, Fort Lee, NJ, USA H. Vincent Poor, Department of Electrical Engineering, Princeton University, Princeton, NJ, USA
This series is devoted to fundamentals and applications of modern methods of signal processing and cutting-edge communication technologies. The main topics are information and signal theory, acoustical signal processing, image processing and multimedia systems, mobile and wireless communications, and computer and communication networks. Volumes in the series address researchers in academia and industrial R&D departments. The series is application-oriented. The level of presentation of each individual volume, however, depends on the subject and can range from practical to scientific. **Indexing: All books in “Signals and Communication Technology” are indexed by Scopus and zbMATH** For general information about this book series, comments or suggestions, please contact Mary James at [email protected] or Ramesh Nath Premnath at [email protected].
More information about this series at http://www.springer.com/series/4748
Michael N. Rychagov • Ekaterina V. Tolstaya • Mikhail Y. Sirotenko Editors
Smart Algorithms for Multimedia and Imaging
Editors Ekaterina V. Tolstaya Michael N. Rychagov National Research University of Electronic Aramco Innovations LLC Moscow, Russia Technology (MIET) Moscow, Russia Mikhail Y. Sirotenko Google Research New York, NY, USA
ISSN 1860-4862 ISSN 1860-4870 (electronic) Signals and Communication Technology ISBN 978-3-030-66740-5 ISBN 978-3-030-66741-2 (eBook) https://doi.org/10.1007/978-3-030-66741-2 © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Over the past decades, people have produced vast amounts of multimedia content, including text, audio, images, animations, and video. The substance of this content belongs, in turn, to various areas, including entertainment, engineering, medicine, business, scientific research, etc. This content should be readily processed, analysed, and displayed by numerous devices like TVs, mobile devices, VR headsets, medical devices, media players, etc., without losing its quality. This brings researchers and engineers to the problem of the fast transformation and processing of multidimensional signals, where they must deal with different sizes and resolutions, processing speed, memory, and power consumption. In this book, we describe smart algorithms applied both for multimedia processing in general and in imaging technology in particular. In the first book of this series, Adaptive Image Processing Algorithms for Printing by I.V. Safonov, I.V. Kurilin, M.N. Rychagov, and E.V. Tolstaya, published by Springer Nature Singapore in 2018, several algorithms were considered for the image processing pipeline of photo-printer and photo-editing software tools that we have worked on at different times for processing still images and photos. The second book, Document Image Processing for Scanning and Printing by the same authors, published by Springer Nature Switzerland in 2019, dealt with document image processing for scanning and printing. A copying technology is needed to make perfect copies from extremely varied originals; therefore, copying is not in practice separable from image enhancement. From a technical perspective, it is best to consider document copying jointly with image enhancement. This book is devoted to multimedia algorithms and imaging, and it is divided into four main interconnected parts: • • • •
Image and Video Conversion TV and Display Applications Machine Learning and Artificial Intelligence Mobile Algorithms
v
vi
Preface
Image and Video Conversion includes five chapters that cover solutions on superresolution using a multi-frame-based approach as well as machine learning-based super-resolution. They also cover the processing of 3D signals, namely depth estimation and control, and semi-automatic 2D to 3D video conversion. A comprehensive review of visual lossless colour compression technology concludes this part. TV and Display Applications includes three chapters in which the following algorithms are considered: video editing, real-time sports episode detection by video content analysis, and the generation and reproduction of natural effects. Machine Learning and Artificial Intelligence includes four chapters, where the following topics are covered: image classification as a service, mobile user profiling, and automatic view planning in magnetic resonance imaging, as well as dictionarybased compressed sensing MRI (magnetic resonance imaging). Finally, Mobile Algorithms consists of four chapters where the following algorithms and solutions implemented for mobile devices are described: a depth camera based on a colour-coded aperture, the animated graphical abstract of an image, a motion photo, and approaches and methods for iris recognition for mobile devices. The solutions presented in the first two books and in the current one have been included in dozens of patents worldwide, presented at international conferences, and realized in the firmware of devices and software. The material is based on the experience of both editors and the authors of particular chapters in industrial research and technology commercialization. The authors have worked on the development of algorithms for different divisions of Samsung Electronics Co., Ltd, including the Printing Business, Visual Display Business, Health and Medical Equipment Division, and Mobile Communication Business for more than 15 years. We should especially note that this book in no way pretends to present an in-depth review of the achievements accumulated to date in the field of image and video conversion, TV and display applications, or mobile algorithms. Instead, in this book, the main results of the studies that we have authored are summarized. We hope that the main approaches, optimization procedures, and heuristic findings are still relevant and can be used as a basis for new intelligent solutions in multimedia, TV, and mobile applications. How can algorithms capable of being adaptive to image content be developed? In many cases, inductive or deductive inference can help. Many of the algorithms include lightweight classifiers or other machine-learning-based techniques, which have low computational complexity and model size. This makes them deployable on embedded platforms. As we have mentioned, the majority of the described algorithms were implemented as systems-on-chip firmware or as software products. This was a challenge because, for each industrial task, there are always strict specification requirements, and, as a result, there are limitations on computational complexity, memory consumption, and power efficiency. In this book, typically, no devicedependent optimization tricks are described, though the ideas for effective methods from an algorithmic point of view are provided. This book is intended for all those who are interested in advanced multimedia processing approaches, including applications of machine learning techniques for
Preface
vii
the development of effective adaptive algorithms. We hope that this book will serve as a useful guide for students, researchers, and practitioners. It is the intention of the editors that each chapter be used as an independent text. In this regard, at the beginning of a large fragment, the main provisions considered in the preceding text are briefly repeated with reference to the appropriate chapter or section. References to the works of other authors and discussions of their results are given in the course of the presentation of the material. We would like to thank our colleagues who worked with us both in Korea and at the Samsung R&D Institute Rus, Moscow, on the development and implementation of the technologies mentioned in the book, including all of the authors of the chapters: Sang-cheon Choi, Yang Lim Choi, Dr. Praven Gulaka, Dr. Seung-Hoon Hahn, Jaebong Yoo, Heejun Lee, Kwanghyun Lee, San-Su Lee, B’jungtae O, Daekyu Shin, Minsuk Song, Gnana S. Surneni, Juwoan Yoo, Valery V. Anisimovskiy, Roman V. Arzumanyan, Andrey A. Bout, Dr. Victor V. Bucha, Dr. Vitaly V. Chernov, Dr. Alexey S. Chernyavskiy, Dr. Aleksey B. Danilevich, Andrey N. Drogolyub, Yuri S. Efimov, Marta A. Egorova, Dr. Vladimir A. Eremeev, Dr. Alexey M. Fartukov, Dr. Kirill A. Gavrilyuk, Ivan V. Glazistov, Vitaly S. Gnatyuk, Aleksei M. Gruzdev, Artem K. Ignatov, Ivan O. Karacharov, Aleksey Y. Kazantsev, Dr. Konstantin V. Kolchin, Anton S. Kornilov, Dmitry A. Korobchenko, Mikhail V. Korobkin, Dr. Oxana V. Korzh (Dzhosan), Dr. Igor M. Kovliga, Konstantin A. Kryzhanovsky, Dr. Mikhail S. Kudinov, Artem I. Kuharenko, Dr. Ilya V. Kurilin, Vladimir G. Kurmanov, Dr. Gennady G. Kuznetsov, Dr. Vitaly S. Lavrukhin, Kirill V. Lebedev, Vladislav A. Makeev, Vadim A. Markovtsev, Dr. Mstislav V. Maslennikov, Dr. Artem S. Migukin, Gleb S. Milyukov, Dr. Michael N. Mishourovsky, Andrey K. Moiseenko, Alexander A. Molchanov, Dr. Oleg F. Muratov, Dr. Aleksei Y. Nevidomskii, Dr. Gleb A. Odinokikh, Irina I. Piontkovskaya, Ivan A. Panchenko, Vladimir P. Paramonov, Dr. Xenia Y. Petrova, Dr. Sergey Y. Podlesnyy, Petr Pohl, Dr. Dmitry V. Polubotko, Andrey A. Popovkin, Iryna A. Reimers, Alexander A. Romanenko, Oleg S. Rybakov, Associate Prof., Dr. Ilia V. Safonov, Sergey M. Sedunov, Andrey Y. Shcherbinin, Yury V. Slynko, Ivan A. Solomatin, Liubov V. Stepanova (Podoynitsyna), Zoya V. Pushchina, Prof., Dr.Sc. Mikhail K. Tchobanou, Dr. Alexander A. Uldin, Anna A. Varfolomeeva, Kira I. Vinogradova, Dr. Sergey S. Zavalishin, Alexey M. Vil’kin, Sergey Y. Yakovlev, Dr. Sergey N. Zagoruyko, Dr. Mikhail V. Zheludev, and numerous volunteers who took part in the collection of test databases and the evaluation of the quality of our algorithms. Contributions from our partners at academic and institutional organizations with whom we are associated through joint publications, patents, and collaborative work, i.e., Prof. Dr.Sc. Anatoly G. Yagola, Prof. Dr.Sc. Andrey S. Krylov, Dr. Andrey V. Nasonov, and Dr. Elena A. Pavelyeva from Moscow State University; Academician RAS, Prof., M.D. Sergey K. Ternovoy, Prof., M.D. Merab A. Sharia, and M.D. Dmitry V. Ustuzhanin from the Tomography Department of the Cardiology Research Center (Moscow); Prof., Dr.Sc. Rustam K. Latypov, Dr. Ayrat F. Khasyanov, Dr. Maksim O. Talanov, and Irina A. Maksimova from Kazan
viii
Preface
State University; Academician RAS, Prof., Dr.Sc. Evgeniy E. Tyrtyshnikov from the Marchuk Institute of Numerical Mathematics RAS; Academician RAS, Prof., Dr.Sc. Sergei V. Kislyakov, Corresponding Member of RAS, Dr.Sc. Maxim A. Vsemirnov, and Dr. Sergei I. Nikolenko from the St. Petersburg Department of Steklov Mathematical Institute of RAS; Corresponding Member of RAS, Prof., Dr.Sc. Rafael M. Yusupov, Prof., and Prof., Dr.Sc. Vladimir I. Gorodetski from the St. Petersburg Institute for Informatics and Automation RAS; Prof., Dr.Sc. Igor S. Gruzman from Novosibirsk State Technical University; and Prof., Dr.Sc. Vadim R. Lutsiv from ITMO University (St. Petersburg), are also deeply appreciated. During all these years and throughout the development of these technologies, we received comprehensive assistance and active technical support from SRR General Directors Dr. Youngmin Lee, Dr. Sang-Yoon Oh, Dr. Kim Hyo Gyu, and Jong-Sam Woo; the members of the planning R&D team: Kee-Hang Lee, Sang-Bae Lee, Jungsik Kim, Seungmin (Simon) Kim, and Byoung Kyu Min; the SRR IP Department, Mikhail Y. Silin, Yulia G. Yukovich, and Sergey V. Navasardyan from General Administration. All of their actions were always directed toward finding the most optimal forms of R&D work both for managers and engineers, generating new approaches to create promising algorithms and SW, and ultimately creating solutions of high quality. At any time, we relied on their participation and assistance in resolving issues. Moscow, Russia New York, NY, USA
Michael N. Rychagov Ekaterina V. Tolstaya Mikhail Y. Sirotenko
Acknowledgment
Proofreading of all pages of the manuscript was performed by PRS agency (http:// www.proof-reading-service.com).
ix
Contents
1
Super-Resolution: 1. Multi-Frame-Based Approach . . . . . . . . . . . . Xenia Y. Petrova
1
2
Super-Resolution: 2. Machine Learning-Based Approach . . . . . . . . Alexey S. Chernyavskiy
35
3
Depth Estimation and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ekaterina V. Tolstaya and Viktor V. Bucha
59
4
Semi-Automatic 2D to 3D Video Conversion . . . . . . . . . . . . . . . . . . Petr Pohl and Ekaterina V. Tolstaya
81
5
Visually Lossless Colour Compression Technology . . . . . . . . . . . . . 115 Michael N. Mishourovsky
6
Automatic Video Editing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Sergey Y. Podlesnyy
7
Real-Time Detection of Sports Broadcasts Using Video Content Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Xenia Y. Petrova, Valery V. Anisimovsky, and Michael N. Rychagov
8
Natural Effect Generation and Reproduction . . . . . . . . . . . . . . . . . 219 Konstantin A. Kryzhanovskiy and Ilia V. Safonov
9
Image Classification as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Mikhail Y. Sirotenko
10
Mobile User Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Alexey M. Fartukov, Michael N. Rychagov, and Lyubov V. Stepanova
xi
xii
Contents
11
Automatic View Planning in Magnetic Resonance Imaging . . . . . . . 277 Aleksey B. Danilevich, Michael N. Rychagov, and Mikhail Y. Sirotenko
12
Dictionary-Based Compressed Sensing MRI . . . . . . . . . . . . . . . . . . 303 Artem S. Migukin, Dmitry A. Korobchenko, and Kirill A. Gavrilyuk
13
Depth Camera Based on Colour-Coded Aperture . . . . . . . . . . . . . . 325 Vladimir P. Paramonov
14
An Animated Graphical Abstract for an Image . . . . . . . . . . . . . . . . 351 Ilia V. Safonov, Anton S. Kornilov, and Iryna A. Reimers
15
Real-Time Video Frame-Rate Conversion . . . . . . . . . . . . . . . . . . . . 373 Igor M. Kovliga and Petr Pohl
16
Approaches and Methods to Iris Recognition for Mobile . . . . . . . . . 397 Alexey M. Fartukov, Gleb A. Odinokikh, and Vitaly S. Gnatyuk
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
About the Editors
Michael N. Rychagov received his MS in acoustical imaging and PhD from the Moscow State University (MSU) in 1986 and 1989, respectively. In 2000, he received a Dr.Sc. (Habilitation) from the same University. From 1991, he is involved in teaching and research at the National Research University of Electronic Technology (MIET) as an associate professor in the Department of Theoretical and Experimental Physics (1998), professor in the Department of Biomedical Systems (2008), and professor in the Department of Informatics and SW for Computer Systems (2014). Since 2004, he joined Samsung R&D Institute in Moscow, Russia (SRR), working on imaging algorithms for printing, scanning, and copying; TV and display technologies; multimedia; and tomographic areas during almost 14 years, including last 8 years as Director of Division at SRR. Currently, he is Senior Manager of SW Development at Align Technology, Inc. (USA) in the Moscow branch (Russia). His technical and scientific interests are image and video signal processing, biomedical modelling, engineering applications of machine learning, and artificial intelligence. He is a Member of the Society for Imaging Science and Technology and a Senior Member of IEEE. Ekaterina V. Tolstaya received her MS in applied mathematics from Moscow State University, in 2000. In 2004, she completed her MS in geophysics from the University of Utah, USA, where she worked on inverse scattering in electromagnetics. Since 2004, she worked on problems of image processing and reconstruction in Samsung R&D Institute in Moscow, Russia. Based on these investigations she obtained in 2011 her PhD with research on image processing algorithms for printing. In 2014, she continued her career with Align Technology, Inc. (USA) in Moscow branch (Russia) on problems involving computer vision, 3D geometry, and machine learning. Since 2020, she works at Aramco Innovations LLC in Moscow, Russia, on geophysical modelling and inversion.
xiii
xiv
About the Editors
Mikhail Y. Sirotenko received his engineering degree in control systems from Taganrog State University of Radio Engineering (2005) and PhD in Robotics and AI from Don State Technical University (2009). In 2009, he co-founded computer vision start-up CVisionLab, shortly after he joined Samsung R&D Institute in Moscow, Russia (SRR), where he lead a team working on applied machine learning and computer vision research. In 2015, he joined Amazon to work as a research scientist on Amazon Go project. In 2016, he joined computer vision start-up Dresr, which was acquired by Google in 2018, where he leads a team working on object recognition.
Chapter 1
Super-Resolution: 1. Multi-Frame-Based Approach Xenia Y. Petrova
1.1 1.1.1
Super-Resolution Problem Introduction
Super-resolution (SR) is the name given to techniques that allow a single highresolution (HR) image to be constructed out of one or several observed lowresolution (LR) images (Fig. 1.1). Compared to single-frame interpolation, SR reconstruction is able to restore the high frequency component of the HR image by exploiting complementary information from multiple LR frames. The SR problem can be stated as described in Milanfar (2010). In a traditional setting, most of the SR methods can be classified according to the model of image formation, the model of the image prior, and the noise model. Commonly, research has paid more attention to image formation and noise models, while the image prior model has remained quite simple. The image formation model may include some linear operators like smoothing and down-sampling and the motion model that should be considered in the case of multi-frame SR. At the present time, as machine learning approaches are becoming more popular, there is a certain paradigm shift towards a more elaborate prior model. Here the main question becomes “What do we expect to see? Does it look natural?” instead of the question “What could have happened to a perfect image so that it became the one we can observe?” Super-resolution is a mature technology covered by numerous research and survey papers, so in the text below, we will focus on aspects related to image formation models, the relation and difference between SR and interpolation, supplementary
X. Y. Petrova (*) Samsung R&D Institute Russia (SRR), Moscow, Russia e-mail: [email protected] © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_1
1
2
X. Y. Petrova
Fig. 1.1 Single-frame (left side) vs multi-frame super-resolution (right side)
Fig. 1.2 Interpolation grid (a) Template with pixel insertion (b) Uniform interpolation template
algorithms that are required to make super-resolution practical, super-resolution with input in the Bayer domain, and fast methods of solving the super-resolution problem.
1.1.2
Super-Resolution and Interpolation, Image Formation Model
When pondering on how to make a bigger image out of a smaller image, there are two basic approaches. The first one, which looks more obvious, assumes that there are some pixels that we know for sure and are going to remain intact in the resulting image and some other pixels that are to be “inserted” (Fig. 1.2a). In image interpolation applications, this kind of idea was developed in a wide range of edge-directed algorithms, starting with the famous NEDI algorithm, described in Li and Orchard (2001) and more recent developments, like those by Zhang and Wu (2006), Giachetti and Asuni (2011), Zhou et al. (2012), and Nasonov et al. (2016). However, simply observing Fig. 1.2a, it can be seen that keeping some pixels intact in the resulting image makes it impossible to deal with noisy signals. We can also expect the interpolation quality to become non-uniform, which may be unappealing visually, so the formation model from Fig. 1.2b should be more appropriate. In the interpolation problem, researchers rarely consider the relation between large-resolution and small-resolution images, but in SR formulation the main focus is on the image formation model (light blue arrow in Fig. 1.2b). So, interpreting SR as an inverse problem to images formation has become a fruitful idea. From this point of view, the image formation model in Fig 1.2a is a mere down-sampling operator, which is in
1 Super-Resolution: 1. Multi-Frame-Based Approach
3
Fig. 1.3 Image formation model (a) comprised of blur, down-sampling and additive noise; (b) using information from multiple LR frames with subpixel shift to recover single HR frame
weak relation with the physical processes taking place in the camera, including at least the blur induced by the optical system, down-sampling, and camera noise (Fig. 1.3a), as described in Heide et al. (2013). Although the example in the drawing is quite exaggerated, it emphasizes three simple yet important ideas: 1. If we want to make a real reconstruction, and not a mere guess, it would be very useful to get more than one observed image (Fig. 1.3b). The number of frames used for reconstruction should grow as a square of the down-sampling factor. 2. The more blurred the image is, the higher the noise, and the bigger the downsampling factor, the harder it is to reconstruct an original image. There exist both theoretical estimates, like those presented in Baker and Kanade (2002) and Lin et al. (2008), and also practical observations, when the solution of the SR reconstruction problem really makes sense. 3. If we know the blur kernel and noise parameters (and the accurate noise model), the chances of successful reconstruction will increase. The first idea is an immediate step towards multi-frame SR, which is the main topic of this chapter. In this case, a threefold model (blur, down-sampling, noise) becomes insufficient, and we need to consider an additional warp operator, which describes a spatial transform applied to a high-resolution image before applying blur, down-sampling, and noise. This operator is related to camera eigenmotions and object motion. In some applications, the warp operator can be naturally derived from the problem itself, e.g. in astronomy, the global motion of the sky sphere is known. In sensor-shift SR, which is now being implemented not only in professional products like Hasselblad cameras but also in many consumer level devices like Olympus, Pentax, and some others, a set of sensor shifts is implemented in the
4
X. Y. Petrova
hardware, and these shifts are known by design. But in cases targeting consumer cameras, estimation of the warp operator (or motion estimation) becomes a separate problem. Most of the papers, such as Heide et al. (2014), consider only translational models, while others turn to more complex parametric models, like Rochefort et al. (2006), who assume globally affine motion, or Fazli and Fathi (2015) as well as Kanaev and Miller (2013), who consider a motion model in the form of optical flow. Thus, the multi-frame SR problem can be formulated as the reconstruction of a high-resolution image X from several observed low-resolution images Yi, where the image formation model is described by Yi ¼ WiX + ηi, 8 i ¼ 1, . . ., k, where Wi is the ith image formation operator and ηi is additive noise. Operators Wi can be composed out of warp Mi, blur Gi, and decimation (down-sampling) D for a single-channel SR problem: W i ¼ DGi M i : In Bodduna and Weickert (2017), along with this popular physically motivated warp-blur-down-sample model, it was proposed to use a less popular yet more effective for practical purposes blur-warp-down-sample model, i.e. Wi ¼ DMiGi. In cases like those described in Park et al. (2008) or Gutierrez and Callico (2016), when different systems are used to obtain different shots, the blur operators Gi can be different for each observed frame, but using the same camera system is more common, so it’s enough to use a single blur matrix G. Anyway, in the case of spatially invariant warp and blur operators, the blur and warp operators commute, so there is no difference between these two approaches. The pre-blur (before warping) model compared to the post-blur model also allows us to concentrate on finding GX instead of X. In Farsiu et al. (2004), a more detailed model containing two separate blur operators, responsible for camera blur Gcam and atmospheric blur Gatm, is considered: W i ¼ DGcam M i Gatm : But in many consumer applications, unlike astronomical applications, atmospheric blur may be considered negligible. The formulations above can be sufficient for images obtained by a monochromatic camera or three-CCD camera, but more often the observed images are obtained using systems equipped with a colour filter array (CFA). Some popular types of arrays are shown in Fig. 1.4. The Bayer array is only one of the possible CFAs, but from the computational point of view, it makes no difference which particular array type to consider, so without any further loss of generality, we are going to stick to the Bayer model. During the development of the algorithm, it makes sense to concentrate only on one specific pattern, because in Liu et al. (2019), it was shown that it’s quite a straightforward procedure to convert solutions for different types of Bayer patterns using only shifts and reflections of the input and output images. In Bayer image formation, the model includes the Bayer
1 Super-Resolution: 1. Multi-Frame-Based Approach
5
Fig. 1.4 Types of colour filter arrays
decimation operator B. Thus the degradation operator can be written as Wi ¼ BDGMi. This model can be used for reconstruction from the Bayer domain (joint demosaicing and super-resolution) as proposed in Heide et al. (2013, 2014) and repeated by Petrova and Glazistov and Petrova (2017). Slight modifications to the image formation model can be used for other image processing tasks, e.g. Wi ¼ BGMi can be used for Bayer reconstruction from multiple images, Wi ¼ BG for demosaicing, Wi ¼ G for deblurring, and Wi ¼ DyGMi, where Dy is the down-sampling operator in the vertical dimension, can be used for de-interlacing. As for noise mode ηi, rather often researchers consider a simple Gaussian model, but those who are seeking a more physically motivated approach refer to the brightness-dependent model as described in Foi et al. (2007, 2008), Azzari and Foi (2014), Rakhshanfar and Amer (2016), Pyatykh and Hesser (2014), and Sutour et al. (2015, 2016). In many papers, the authors assume that camera noise can be represented as a sum of two independent components: signal-dependent Poissonian shot noise ηpand signal-independent Gaussian electronic read-out noise ηg. The Poissonian term represents a random number of photons registered by the camera system. The flux intensity increases proportionally to the image brightness. Thus, the registered pixel values are described as y ¼ Wx + ηp(Wx) + ηg. In the case of a relatively intensive photon flux, the Poissonian distribution is accurately enough approximated by a Gaussian distribution. That’s why the standard deviation of camera noise can be described by a heteroscedastic Gaussian model pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi y ¼ Wx þ a Wx þ bξ, where ξ is sampled from a Gaussian distribution with a zero mean and unit standard deviation. In Kuan et al. (1985), Liu et al. (2013, 2014), aspwell ffiffiffi as Zhang et al. (2018), the pffiffiffi generalized noise model y ¼ Wx þ ðWxÞγ aη þ bξ is considered, where η is also Gaussian noise with a zero mean and unit standard deviation, independent of ξ.
6
X. Y. Petrova
Parameters a and b of the noise model depend on the camera model and shooting conditions, such as gain or ISO (which is usually provided in metadata) and the exposure time. Although there is abundant literature covering quite sophisticated procedures of noise model parameter estimation, for researchers focused on the SR problem per se, it is more efficient to refer to data provided by camera manufacturers, like the NoiseProfile field described in the DNG specification (2012). Camera 2 API of Android devices is also expected to provide this kind of information. There also exist quite simple computational methods to estimate the parameters of the noise model using multiple shots in a controlled environment. Within the machine learning approach, the image formation model described above is almost sufficient to generate artificial observations out of the “perfect” input data to be used as the training dataset. Still, to obtain an even more realistic formation model, some researchers like Brooks et al. (2019) consider also the colour and brightness transformation taking place in the image signal processing (ISP) unit: white balance, colour correction matrix, gamma compression, and tone mapping. Some researchers like Segall et al. (2002), Ma et al. (2019), and Zhang and Sze (2017) go even further and consider the reconstruction problem for compressed video, but this approach is more related to the image and video compression field rather than image reconstruction for consumer cameras.
1.1.3
Optimization Criteria
The SR image reconstruction problem can be tackled within a data-driven approach based on an elaborate prior model or a data-agnostic approach based on solution of the inverse problem. In the data-driven approach, researchers propose some structure of the algorithm S depending on parameters P. The algorithm takes one or several observed low-resolution frames as input and provides an estimate of the highresolution frame X ¼ S(P, Y) or X ¼ S(P, Y1, ⋯Yk) as output. The algorithm itself is usually non-iterative and computationally quite cheap, while the main computational burden lies in algorithm parameter estimation. Thus, given some training data T, consisting of t pairs images and corresponding low-resolution of high-resolution observations T ¼ X u , Y u1 , ⋯Y uk Y ui ¼ W ui X u þ ηui g, u ¼ 1::t, the parameters can be found as P ¼ argminP
t X
F data S P, Y u1 , ⋯Y uk X u
u¼1
for multi-frame problem setting. For single-frame problem setting (single-frame super-resolution known as SISR), at least for now, this formulation is the most popular among researchers practising machine learning. In SISR, a training database consists of pairs T ¼ fX u , Y u j Y u ¼ WX u þ ηui g, u ¼ 1::t, and the model parameters are found as
1 Super-Resolution: 1. Multi-Frame-Based Approach
P ¼ argminP
t X
7
F data ðSðP, Y u Þ X u Þ:
u¼1
Even in modern research like Zhang et al. (2018), the data term Fdata often remains very simple, being just an L2 norm, PSNR (peak signal to noise ratio) or SSIM (Structural Similarity Index). Although more sophisticated losses, like the perceptual loss proposed in Johnson et al. (2016), are widely used in research, in major super-resolution competitions like NTIRE, the outcomes of which were discussed by Cai et al. (2019), the winners are still determined according to the PSNR and SSIM, which are still considered to be the most objective quantitative measures. By the way, we can consider as representatives of the data-driven approach not only computationally expensive CNN-based approaches but also advanced edgedriven methods like those covered in Sect. 1.1.2 or well-known single-frame SR algorithms like A+ described by Timofte et al. (2014) and RAISR covered in Romano et al. (2017). The data-agnostic approach makes minimum assumptions about output images and solves an optimization problem for each particular input. Considering a threefold (blur, down-sampling, noise), fourfold (warp, blur, down-sampling, noise), or extended fourfold (warp, blur, down-sampling, Bayer down-sampling, noise) image formation model automatically suggests the SR problem formulation X ¼ argminX
k X i¼1
F data ðW i X Y i Þ þ F reg ðX Þ
k X
log Lnoise ðW i X Y i Þ,
i¼1
where Fdata is a data fidelity term, Freg is the regularization term responsible for data smoothness, and Lnoise is the likelihood term for the noise model. When developing an algorithm targeting a specific input device, it is reasonable to assume that we know something about the noise model, possibly more than about the input data. Also, when the noise model is Gaussian, the problem is sufficiently described by the square data term alone, but a more complex noise model would require a separate noise term. In MF SR, the regularization term is indispensable, because the input data is often insufficient for unique reconstruction. More than that, problem conditioning depends on a “lucky” or “unhappy” combination of warp operators, and if these warps are close to identical, the condition number will be very large even for numerous available observations (Fig. 1.5). The data fidelity term is usually chosen as the L1 or L2 norm, but most of the research in SR is focused on the problem with a quadratic data fidelity term and total variation (TV) regularization term. Different types of norms, regularization terms, and corresponding solvers are described in detail by Heide et al. (2014). In the case of a non-linear and particularly non-convex form of the regularization term, the only way to find a solution is an iterative approach, which may be prohibitive for real-time implementation.
8
X. Y. Petrova
Fig. 1.5 “Lucky” (observed points cover different locations) and “unhappy” (observed points are all the same) data for super-resolution
In Kanaev and Miller (2013), an anisotropic smoothness term is used for regularization.
1.2 1.2.1
Fast Approaches to Super-Resolution Problem Formulation Allowing Fast Implementation
Let us consider a simple L2 L2 problem with Freg(X) ¼ λ2(HX)(HX), where H is a convolution operator. This makes it possible to solve the SR problem as the linear equation: b X ¼ W Y, A b ¼ W W þ λ2 H H, W ¼ W , . . . , W , Y ¼ Y , . . . , Y . where A 1 k 1 k It is important to mention that this type of problem can be treated by the very fast shift and add approach covered in Farsiu et al. (2003) and Ismaeil et al. (2013), which allows the reconstruction of the blurred high-resolution image using averaging of shifted pixels from low-resolution frames. Unfortunately, this approach leaves the filling of the remaining wholes to the subsequent deblur sub-algorithm on an irregular grid. This means that the main computational burden is being transferred to the next pipeline stage and remains quite demanding. As an image formation model, we are going to use Wi ¼ DGMi and Wi ¼ BDGMi. It is possible to make this problem even narrower and assume each warp Mi and blur G as being space-invariant. This limitation is quite reasonable when processing a b is small image patch, as has been stated in Robinson et al. (2009). In this case, A known to be reducible to the block diagonal form. This fact is intensively exploited in papers on fast super-resolution by Robinson et al. (2009, 2010), Sroubek et al. (2011), and Zhao et al. (2016). Similar results for the image formation model with
1 Super-Resolution: 1. Multi-Frame-Based Approach
9
Bayer down-sampling were presented by Petrova et al. (2017) and Glazistov and Petrova (2018). The warp operators Wi are assumed to be already estimated with sufficient accuracy. Besides, we assume that the motion is some subpixel circular translation, which is a reasonable assumption that holds in the small image block. We consider a simple Gaussian noise model with the same sigma value for all the observed frames ηi ¼ η. Within the L2 L2 formulation, the Gaussian noise model means that minimising the data fidelity term minimizes also the noise term, so we can consider a significantly simplified problem, i.e. X ¼ argminX
k X
F data ðW i X Y i Þ þ λ2 ðHX Þ ðHX Þ
i¼1
We assume a high-resolution image to be represented as a stack X ¼ (RT, GT, B )T, where R, G, B are vectorized colour channels. Low-resolution images are T represented as Y ¼ Y T1 , . . . , Y Tk , where Yi is the vectorized i-th observation. As long as we assume Bayer input, each of the input observations has only one channel. So, if we consider observed images (or we should rather say the input image block) 2 of size n m, then Yi 2 Rmn and X 2 R3s mn , where s is a magnification factor. We have chosen regularization operator H so that it can both control the image smoothness within each colour channel and also mitigate the colour artefacts. The main idea behind reducing the colour artefacts is the requirement that the image gradient in the different colour channels should change simultaneously. This idea is described in detail in literature related to demosaicing, like Malvar et al. (2004). Thus, we compose operator H from intra-channel sub-operators HI and HC mapping HI 0 0 b 2 2 H I bc ¼ b I ¼ 0 H I 0 , H from Rs mn to Rs mn as H ¼ , where H H bc 0 0 HI H C H C 0 0 H C : We took HI in the form of a 2D convolution operator with HC 0 H C H C 1=8 1=8 1=8 the kernel 1=8 1 1=8 and p1ffiffiγ H C ¼ H I . As will be shown in Sect. 1.2.2, such 1=8 1=8 1=8 T
problem formulation is very convenient for analysis using the apparatus of structured matrices.
1.2.2
Block Diagonalization in Super-Resolution Problems
The apparatus of structured matrices fits well the linear optimization problems arising in image processing (and also as a linear inner loop inside non-linear algorithms). The mathematical foundations of this approach were described in
10
X. Y. Petrova
Voevodin and Tyrtyshnikov (1987). A detailed English description of the main relevant results can be found in Trench (2009) and Benzi et al. (2016). Application of this apparatus for super-resolution is covered in Robinson et al. (2009, 2010), Sroubek et al. (2011), and Zhao et al. (2016). An excellent presentation of the matrix structure for the deblur problem, which has a lot of common traits with SR, is offered by Hansen et al. (2006). The key point of these approaches is the block diagonalization of the problem matrix. The widely known results about the diagonalization of circulant matrices can be extended to block diagonalization of some non-circulant matrices, which still share some structural similarity in certain aspects with circulant matrices. The basic idea for how to find this structural similarity of matrices arising from the quadratic super-resolution problem is sketched below. Let us start the discussion with 1D and 2D single-channel SR problems and use them as building blocks to construct the description of the Bayer SR problem with pronounced structural properties encouraging a straightforward block diagonalization procedure and allowing a fast solution based on Fast Fourier transform. First, we are going to recollect some basic facts from linear algebra that are essential in the apparatus of structured matrices. We will use extensively a permutation matrix P obtained from the identity matrix by row permutation. Left multiplication of the matrix M by P causes permutation of rows, while right multiplication by Q ¼ PT leads to the same permutation of columns. Permutation matrices P and Q are orthogonal: P1 ¼ PT , Q1 ¼ QT : Notation Pu will mean a cyclic shift by u, providing ðPu Þ ¼ ðPu Þ1 ¼ Pu : A perfect shuffle matrix Πn1 n2 corresponds to the transposition of a rectangular matrix of size n1 n2 in vectorized form. Notes on the application of a perfect shuffle matrix to structured matrices can be found in Benzi et al. (2016). This is an n n permutation matrix, where n ¼ n1n2 and the element with indices i, j is one if and only if i and j can be presented as i 1 ¼ α2n1 + α1, j 1 ¼ α1n2 + α2 for some integers α1, α2 : 0 α1 n1 1, 0 α2 n2 1. An explicit formula for the matrix of a Fourier transform of size n n will also be used: 2
1
1
6 E11 61 n 6 21 1 E Fn ¼ 6 n 6 6 4 1 Eðnn1Þ1
...
1
1
... ...
ðn2Þ E1 n ðn2Þ E2 n
ðn1Þ E1 n ðn1Þ E2 n
⋮ . . . Eðnn1Þðn2Þ
Eðnn1Þðn1Þ
where En ¼ e n . The Fourier matrix and its conjugate satisfy 2πi
3 7 7 7 7, 7 7 5
1 Super-Resolution: 1. Multi-Frame-Based Approach
11
F n F n ¼ F n F n ¼ n I n : Definition 1 A circulant matrix is a matrix with a special structure, where every row is a right cyclic shift of the row above 2
a1 6a 6 n A¼6 4
a2 a1
3 . . . an . . . an1 7 7 7 5 ⋱
a2
a3
...
a1
and corresponds to 1D convolution with cyclic boundary conditions. Circulant matrices are invariant under cyclic permutations: 8A 2 ℂ ) A ¼ ðPu ÞT APu : The class of circulant matrices of size n n is denoted by ℂn; so, we can write A 2 ℂn. A circulant matrix is defined by a single row (or column) a ¼ [a1, a2, . . ., an]. It can be transformed to diagonal form by the Fourier transform: 1 8A 2 ℂn ) A ¼ F n Λn F n : n All circulant matrices of the same size commute. Many matrices used below are circulant ones, i.e. matrices corresponding to one-dimensional convolution with cyclic boundary conditions. N Since we are going to deal with two or more dimensions, the Kronecker product becomes an important tool. Properties of the Kronecker product that may be useful for further derivations are summarized in Zhang and Ding (2013) and several other educational mathematical papers. An operator Nthat down-samples a vector of length n by the factor s can be written as Ds ¼ I n=s eT1,s , where eT1,s is the first row of identity matrix Is. Suppose a two-dimensional n n array is given: 2
x11
6x 6 21 X matr ¼ 6 4 xn1
x12
. . . x1n
3
x22
. . . x2n 7 7 7: 5 ⋮
xn2
. . . xnn
In vectorized form, this can be written as X T ¼ ½x11 , x21 , . . . , xn1 , x12 , x22 , . . . , xn2 , . . . , x1n , . . . , xnn :
12
X. Y. Petrova
If An is a 1D convolution operator from Rn to Rn with coefficients ai, i ¼ 1, . . ., n, N 2 then In An applied to X will correspond to row-wise convolution acting from Rn N 2 to Rn , and An In to column-wise convolution with this filter. For N two row-wise and column-wise convolution operators An and Bn, operator An Bn will be a 2 2 separable convolution operator from Rn to Rn for vectorized n n arrays due to the following property of the Kronecker product: AB
O
O O
CD ¼ A C B D :
For example, 2D down-sampling by factor s will be Ds,s ¼ Ds
O
Ds ¼ I n=s
O
eT1,s
O
I n=s
O
eT1,s :
Two-dimensional non-separable convolution (warp and blur) operators are block circulant with circulant block, or BCCB, according to the notation from Hansen et al. (2006). This can be expressed via a sum of Kronecker products of 1D convolution operators: 8A 2 ℂn ℂm ) ∃N i 2 ℂn , M i 2 ℂm , i ¼ 1, . . . , r : A r O X Mi: ¼ Ni i¼1
A BCCB matrix can be easily transformed to block diagonal form: 8A 2 ℂn ℂm ) A ¼
O O
1 F m Λ F n Fm , F mn n
N M P are diagonal matrices of eigenvalues of Λi and ΛNi , ΛM where Λ ¼ ri¼1 ΛNi i matrices Ni and Mi . Although BCCB matrices and their properties are extensively covered in the literature, matrices arising from the SR problem (especially the Bayer case) are more complicated, and this paper will borrow a more general concept of the matrix class from Voevodin and Tyrtyshnikov (1987) to deal with them in a simple and unified manner. Definition 2 A matrix class is a linear subspace of square matrices. Matrix A with elements ai, j : i, j ¼ 1, . . ., n belongs to matrix class described by numbers ðqÞ aij , q 2 Q if it satisfies X ðqÞ aij aij ¼ 0: i, j
1 Super-Resolution: 1. Multi-Frame-Based Approach
13
This definition is narrower than in the original work, which allows a non-zero constant on the right-hand side and considers also rectangular matrices, but this modification makes a definition more relevant to the problem under consideration. We are interested in ℂn(circulant), (general, Q ¼ ∅), and n (diagonal) classes of square matrices of size n n. The Kronecker product produces bi-level matrices of N class 1 2 from matrices 1 2 1 2 1 2 1 from classes and : 8M 2 , 8M 2 ) M M 2 2 1 2 . Here, 1 2 is called an outer class and an inner class. Saying A 2 simply means that each block of A belongs to class : Multilevel classes like 1 2 , . . . , i , . . . , j , . . . , k can also be constructed. Furthermore, we’ll show that matrices related to the SR problem belong to certain multilevel classes 1 2 , . . . , i iþ1 , . . . , j jþ1 , . . . , k containing several diagonal block classes, where i stands for some non-diagonal types. Then it will be shown how matrices from this class can be transformed to block diagonal form (by grouping together diagonal subclasses). Detailed proofs of the facts below can be found in Glazistov and Petrova (2017, 2018). b can be expanded as Let’s start with the 1D case. Single-channel SR matrix A b¼ A
k X
! M i G D DGM i
þ λ2 H H:
i¼1
In 1D, Mi and H being convolution operators provides M i , G, H 2 ℂn , D ¼ Ds ¼ I n=s
O
eT1,s :
b constructed as described above satisfies Matrix A b ¼ F ΛA F n , A n where ΛA 2 s n=s . In the 2D case, the warp, blur, and regularization operators become M i , G, H 2 ℂn ℂn , D ¼ Ds,s ¼ Ds
O
Ds :
N N b will satisfy A b ¼ F F ΛA ðF n F n Þ, Such matrix A n n s n=s s n=s . b can be expanded as In the Bayer case, matrix A b¼ A
k X i¼1
where
! e D eM e i G e B BD eG ei M
e H, e þ λ2 H
where ΛA 2
14
X. Y. Petrova
e ¼ I3 D
O
e ¼ I3 Ds,s , G
2
D2,2 6 D P1,1 6 2,2 B¼6 4 0 0 2
Hg 6 0 6 6 6 0 e ¼6 H 6H 6 c1 6 4 H c2 0
O
e i ¼ I3 G, M
0 0 D2,2 P1,0 0 0 Hb 0 H c1 0 H c3
O
0 0 0 D2,2 P0,1 3 0 0 7 7 7 Hr 7 7, 0 7 7 7 H c2 5
Mi,
3 7 7 7, 5
H c3
and Pu, v is a 2D cyclic shift by u columns and v rows. Submatrices from the expression above satisfy M i , G, H r , H g , H b , H c1 , H c2 , H c3 2 ℂn ℂn : Bayer down-sampling operator B extracts and stacks channels G1, G2, R, and B from the pattern in Fig. 1.6. b constructed as described As proven in Glazistov and Petrova (2018), the matrix A above satisfies O O O O
b ¼ I3 F n F n ΛA I 3 Fn Fn , A where ΛA 2 3 2s 2sn 2s 2sn . After characterizing the matrix in terms of matrix class, it becomes possible to prescind from the original problem setting and focus on matrix class transformations. In the papers relying on block diagonalization of BCCB matrices, like Sroubek et al. (2011), it is usually only noted that certain matrices can be transformed to and no explicit transforms are provided, probably because it’s hard to express the formula, but thanks to the apparatus of the structured matrices it becomes easy to obtain closed-form permutation matrices transforming from classes s n=s , s n=s s n=s , and 3 2s n=ð2sÞ 2s n=ð2sÞ to block diagonal form n=s s , n2 =s2 s2 , and n2 =ð4s2 Þ 12s2 , respectively. Fig. 1.6 Pixel enumeration in Bayer pattern
1 Super-Resolution: 1. Multi-Frame-Based Approach
15
Fig. 1.7 Swapping matrix classes by permutation of rows and columns
If we want to rearrange the matrix elements in order to swap matrix classes, as shown in in Fig. 1.7, the following theorem from Voevodin and Tyrtyshnikov (1987) can be applied: Theorem 1.1 Let 1 and 2 be two classes of n n and m m matrices, respectively. Then 8A 2 1 2 : Π Tn,m AΠ n,m 2 2 1 , where Π n, m is a perfect shuffle. A perfect shuffle matrix is used in the property of the Kronecker product, as was shown in Benzi et al.N(2016). If A and NB are matrices of size n n and m m, respectively, then ðB AÞ ¼ ΠTn,m ðA BÞΠn,m , but Theorem 1.1 is slightly more general. It can be applied to any class, including n m N or n m , which cannot necessarily be expressed as a single Kronecker product A B. An example of such matrix class is a BCCB matrix. It is also interesting to know which operations preserve the matrix classes on a certain level. Thus, no matter what are the classes 1 of size n n and 2 , 3 of size m m (each of these classes can be multilevel), there exist permutation matrices Pm, Qm, such that if 8A 2 2 ) PTm AQm 2 3 , then the outer class preservation property holds O O
PTm B I n Q m 2 1 3 8B 2 1 2 ) I n The property of inner class preservation can be postulated for any matrix classes and for matrices Pn, Qn, providing 8A 2 2 ) PTn AQn 2 3 : O O
8B 2 1 2 : PTn I m B Qn I m 2 3 2 : This means that each outer block is transformed from class 1 to class 3 , while the inner class 2 remains the same.
16
X. Y. Petrova
Table 1.1 Complexity Problem 1D 2D
Matrix size nn n2 n2
Bayer
3n2 3n2
Blocks n s
n2 s
n 2 2s
Block size ss s2 s2
MI, original n3 n6
MI, reduced ns2 n2s4
12s2 12s2
n6
n2s4
b from the 1D single-channel SR problem can be transformed to a Thus matrix A b Πs,n 2 n=s s : block diagonal form as follows: ΠTs,n F n AF n s s In the 2D SR problem, we can apply the class swapping operation twice and N N b to block diagonal form: for each A b ¼ F F ΛA ðF n F n Þ, where convert A n n ΛA 2 s n=s s n=s , it holds that O O O
O O
O
b F ΠTs,n I s ΠTns,n F n Fn A F n Πns,ns I ns Πs,ns Is I ns n s
s
2 n2 s2 : s2
b b¼ A arising from the Bayer SR problem satisfying A Matrix N N N N n n where ΛA 2 3 2s 2s 2s 2s , can be I 3 F n F n ΛA ðI 3 F n F n Þ, transformed to block diagonal form in a similar way: O O
O O
O
b I 2sn ΠT2s, n I 2s ΠT2ns, n Fn F n A I3 ΠT3,n2 I 3 2s 2s O O O O O
F n F n I 3 Π2ns,2sn I 2sn Π2s,2sn I 2s Π3,n2 2 n2 12s2 : I3 4s2
Table 1.1 summarizes the computational complexity of finding the matrix inverse (marked “MI”) for the 1D, 2D, and Bayer SR problems. Block diagonalization made it possible to reduce the complexity of the 2D and Bayer SR problems from O n6 to O ðn2 s4 Þ þ O ðn2 log nÞ, where n2 log n corresponds to the complexity of the block diagonalization process itself. Typically, n is much larger than s (as n ¼ 16, . . ., 32, s ¼ 2, . . ., 4), which provides significant economy.
1.2.3
Filter-Bank Implementation
It is quite a common idea in image processing to implement the pixel processing algorithm in the form of a filter-bank, where each output pixel is produced by applying some filter selected according to some local features. For example, Romano et al. (2017) solved the SR problem by applying pretrained filters (s2 phases for magnification factor s, which naturally follows from selecting a down-sampling grid as shown in Fig. 1.2b) grouped into 216 “buckets” (24 gradations of angle, 3 gradations of edge strength, and 3 gradations of coherence). All these characteristics were
1 Super-Resolution: 1. Multi-Frame-Based Approach
17
computed using a structure tensor. In the multi-frame case, we need to merge information from several, say, frames, which have different mutual displacement. This means that for each pixel we should consider k filters and the filter-bank should store entries for different combinations of mutual displacement. Let us start with the applicability of the filter-bank approach for the L2 L2 problem setting. In this case, the solution can be written in closed form as X ¼ b1 W Y . This means that it is possible to precompute the set of matrices A ¼ A b1 W for all possible shifts between LR frames and later use them for SR A reconstruction. Processing can be performed on a small image block where it is safe to assume that displacement is translational and space-invariant for each pixel of the block. Also, it’s quite reasonable to assume that pixel-scale displacement can be compensated in advance, and we should be concerned only with subpixel displacement (in the LR frame) quantized to a certain precision, like ¼ pixel (which is known to be sufficient for compression applications). This means that for k input frames, if each observed LR image is shifted with respect to a decimated HR image by s ∙ uk pixels horizontally and by s ∙ vk pixels vertically, we can express the inverse matrix as a function of 2k scalar values: A ¼ A(u0, v0, . . .uk, vk), as shown in Fig. 1.8. In consumer camera applications, we can assume that k is small – something like 3–4 frames. Let us consider matrix A constructed for some small absolute values of uk and vk in Fig. 1.9. For the sake of visual clarity, it will be a single-channel SR problem with three observations. Elements of A with absolute values exceeding 103 are shown in white, and the rest are shown in black. No more than 5% of the matrix elements b discussed in exceed this threshold. Taking into account the structure of matrix A b Sect. 1.2.2, it is no surprise that matrix A is also structured. Matrix A is sparse, but generally A is not bound to be sparse. Luckily, our case has the property of weight decay, as covered in Mastronardi et al. (2010). Thus, the matrix structure and weight decay suggest excluding redundancy in the matrix representation and describing the whole matrix only by s2 lines that are used to compute adjacent pixels of the HR image, belonging to the same block of size s s, and storing only a limited number of values corresponding to the central pixel. For a single-channel SR problem, we stored 11 11 filters and for Bayer SR – 16 16 filters. Figure 1.10 gives numerical evidence to justify such selection of the filter size and shows the distribution of filter energy depending on the distance from
Fig. 1.8 Using precomputed inverse matrices for SR reconstruction
18
X. Y. Petrova
Fig. 1.9 Visualization of matrix A and extracting filters
Fig. 1.10 Dependency of the proportion of energy of filter coefficients inside ε-vicinity of the central element on the vicinity size
the central element, averaged for all filters computed for three input frames with ¼ motion quantization. The filters are extracted as shown in Fig. 1.9 during the off-line stage (Fig. 1.11) and applied in the online stage (Fig. 1.12). These images seem self-explanatory, but an additional description can be found in Petrova et al. (2017).
1 Super-Resolution: 1. Multi-Frame-Based Approach
19
Fig. 1.11 Online and off-line parts of the algorithm
Fig. 1.12 Online and off-line parts of the algorithm
1.2.4
Symmetry Properties
In the case of a straightforward implementation of the filter-bank approach, the number of filters that need to be stored is still prohibitively large, e.g. for singlechannel SR with k ¼ 4 frames, magnification s ¼ 4 and ¼ pixel motion quantization (q ¼ 4), even supposing that the displacement of one frame is always zero, the total number of filters will be k ∙ s2 ∙ (q2)k 1 ¼ 262144, and for k ¼ 3 it will be 12,288. For Bayer reconstruction, this number will become 3k ∙ (2s)2 ∙ (q2)k 1.
20
X. Y. Petrova
We will show that by taking into account the symmetries intrinsic to this problem and implementing a smart filter selection scheme, this number can be dramatically reduced. Strict proofs were provided in Glazistov and Petrova (2018), while here only the main results will be listed. Let’s introduce the following transforms: O
O
ϕ1 ðBÞ ¼ J Ps,s B J Ps,s , O
O
ϕ2 ðBÞ ¼ I 3 Px,y B I 3 Px,y , O O O O
Un In B I3 Un In , ϕ3 ðBÞ ¼ I 3 O O O O
In Un B I3 In Un , ϕ4 ðBÞ ¼ I 3 O
O
ΠTn,n B I 3 Πn,n , ϕ5 ð B Þ ¼ I 3 2
0
60 6 where U n ¼ 6 40
... 0 ... 1
1
3
07 7 7 (a permutation matrix of size n n that flips the 05
0 1 . . . 20 0 1 0 6 input vector) and J ¼ 4 0 0
3 0 7 1 5, and Px, y is a 2D circular shift operator, where
0 1 0 x is the horizontal shift and y is the vertical shift. Then the number of stored filters can be reduced using the following properties:
bðu1 , v1 , ⋯, uk , vk Þ , bðu1 , v1 , ⋯, uk , vk Þ ¼ ϕ1 A A
bðu1 þ x, v1 þ y, ⋯, uk þ x, vk þ yÞ ¼ ϕ2 A bðu1 , v1 , ⋯, uk , vk Þ , A
bðu1 1, v1 , ⋯, uk , vk Þ ¼ ϕ3 A bðu1 , v1 , ⋯, uk , vk Þ , A bðu1 , v1 1, ⋯, uk , vk 1Þ ¼ ϕ4 A bðu1 , v1 , ⋯, uk , vk Þ, A
bðv1 þ s, u1 þ s, ⋯, vk þ s, uk þ sÞ ¼ ϕ5 A bðu1 , v1 , ⋯, uk , vk Þ : A We can also use the same filters for different permutations of input frames: if σ(i) is any permutation of indices i ¼ 1, . . ., k, then
1 Super-Resolution: 1. Multi-Frame-Based Approach
21
Table 1.2 Filter-bank compression using symmetries Problem 2D, s ¼ 2 2D, s ¼ 4 Bayer, s ¼ 2 Bayer, s ¼ 4
Original size 16 [3 2 2 16 16] 256 [3 4 4 16 16] 256 [3 3 4 4 16 16] 4096 [3 3 8 8 16 16]
Compressed size 26 16 16 300 16 16 450 16 16 25752 16 16
Compression factor 7.38 40.96 81.92 91.62
bðu1 , v1 , ⋯, uk , vk Þ ¼ A b uσ ð1Þ , vσð1Þ , ⋯, uσðkÞ , vσð1Þ : A Adding 2s to one of the ui’s or vi’s also does not change the problem: bðu1 , v1 , ⋯, uk , vk Þ ¼ A bðu1 , v1 , ⋯, ui1 , vi1 , ui þ 2s, vi , uiþ1 , viþ1 , ⋯, uk , vk Þ, A bðu1 , v1 , ⋯, uk , vk Þ ¼ A bðu1 , v1 , ⋯, ui1 , vi1 , ui , vi þ 2s, uiþ1 , viþ1 , ⋯, uk , vk Þ: A For some motions u1, v1, ⋯, uk, vk, certain non-trivial compositions of transforms ϕ1, . . ., ϕ5 keep the system invariant:
b , b ¼ ϕi ϕi . . . ϕi A A 1 2 m b by using elements from other which makes it possible to express some rows of A rows. This is an additional resource for filter-bank compression. Applying exhaustive search and using the rules listed above, for filter size 16 16 and k ¼ 3, we have obtained the filter-bank compression ratios described in Table 1.2. Thus, the number of stored values and the number of problems to be solved during the off-line stage were both reduced. The compression approach increased the complexity of the online stage to a certain extent, but the proposed compression scheme allows a straightforward software implementation based on the index table, which stores appropriate base filters and a list of transforms, encoded in 5 bits, for each possible set of quantized displacements. The apparatus of multilevel matrices can be similarly applied to deblurring, multiframe deblurring, demosaicing, multi-frame demosaicing, or de-interlacing problems in order to obtain fast FFT-based algorithms similar to those described in Sect. 1.2.2 and to analyse problem symmetries, as was shown in this section for the Bayer SR problem.
1.2.5
Discussion of Results
We have developed a high quality multi-frame joint demosaicing and SR (Bayer SR) solution which does not use iterations and has linear complexity. A visual
22
X. Y. Petrova
Fig. 1.13 Sample quality on real images: top row demosaicing from Hirakawa and Parks (2005) with subsequent bicubic interpolation bottom row
Fig. 1.14 Comparison of RGB and Bayer SR: left side demosaicing from Malvar et al. (2004) with post-processing using RGB SR; right side Bayer SR
comparison with the traditional approach is shown in Fig. 1.13. It can also be seen that direct reconstruction from the Bayer domain is visually more pleasing compared to subsequent demosaicing and single-channel SR, as shown in Fig. 1.14. This is the only case when we used for benchmarking a demosaicing algorithm from Malvar et al. (2004), because its design purpose was to minimize colour artefacts, which would be a desirable property for the considered example. In all other cases, we prefer the approach suggested by Hirakawa and Parks (2005), which provides a higher PSNR and more natural-looking results. In Fig. 1.15, we perform a visual comparison with an implementation of an algorithm from Heide et al. (2014), which shows that careful choice of the linear cross-channel regularization term can result in a more visually pleasing image than a non-linear term. We performed a numeric evaluation of the SR algorithm quality on synthetic images in order to concentrate on the core algorithm performance without considering issues of accuracy of motion estimation. We used a test image shown in Fig. 1.16, which contains several challenging areas (from the point of view of demosaicing algorithms). Since the reconstruction quality depends on the displacement between low-resolution frames (worst corner case: all the images with the same displacement), we conducted a statistical experiment with randomly generated motions. Numeric measurements for several experiment conditions are charted in Fig. 1.17. Measurements are made separately for each channel. Experiments with reconstruction from two, three, and four frames were made. Four different
1 Super-Resolution: 1. Multi-Frame-Based Approach
23
Fig. 1.15 Sample results of joint demosaicing and SR on rendered sample with known translational motion: (a) ground truth; (b) demosaicing from Hirakawa and Parks (2005) + bicubic interpolation; (c) demosaicing from Hirakawa and Parks (2005) + RGB SR; (d) Bayer SR, smaller regularization term; (e) Bayer SR, bigger regularization term; (f) our implementation of Bayer SR from Heide et al. (2014) with cross-channel regularization term from Heide et al. (2013)
Fig. 1.16 Artificial test image
24
X. Y. Petrova
Fig. 1.17 Evaluation results on synthetic test
combinations of multipliers that go with the intra-channel regularization H term and cross-channel regularization term Hc were used. It is no surprise that, in the green channel, which is sampled more densely, we can observe better PSNR values than in other channels, while the red and blue channels have almost the same reconstruction quality. We can see that different subpixel shifts provide different PSNR values for the same image, which means that, for accurate benchmarking of the multi-frame SR set-up, many possible combinations of motion displacements should be checked. It is possible to find globally optimal regularization parameters, which perform best for both “lucky” and “unhappy” combinations of motions. Obviously, increasing the number of LR frames leads to a higher PSNR, so the number of frames is limited only by the shooting speed and available computational resources. Since the algorithm relies on motion quantization, we evaluated the impact of this factor on the reconstruction quality. The measurement results for the red channel for magnification factor s ¼ 4 are provided in Table 1.3. We used 100 randomly sampled displacements and measured the PSNR. Artificial degradation (Bayer down-sampling) was applied to the test image shown in Fig. 1.16. Cases with two, three, and four input frames, single-channel (RGB SR), and joint demosaicing and SR (Bayer SR) configurations were tested. Since the number of stored filters increases dramatically when increasing the magnification ratio, we also checked the configuration with subsequent multi-frame SR with s ¼ 2 followed by bicubic up-scaling. Bicubic up-sampling after demosaicing by Hirakawa and Parks (2005) with 23.0 dBs (shown on the bottom line) was considered as a baseline. In RGB SR, two demosaicing methods were evaluated – Hirakawa and Parks (2005) and Malvar et al. (2004). For our purposes, the overall reconstruction quality was on average about 0.6–0.8 dBs higher for Hirakawa and Parks (2005) than for Malvar et al.
1 Super-Resolution: 1. Multi-Frame-Based Approach
25
Table 1.3 Impact of MV rounding and RGB/Bayer SR for 4 magnification, red channel (PSNR, Db)
MV rounding No Yes No Yes No Yes No Yes No Yes No Yes N/A
Domain RGB
Demosaicing method Malvar et al. (2004)
Configuration 4 " SR
Hirakawa and Parks (2005) Bayer
N/A
2 " SR + 2 " 4 " SR
RGB
Malvar et al. (2004)
2 " SR + 2 "
Hirakawa and Parks (2005) RGB
Hirakawa and Parks (2005)
4"
Number of frames used for reconstruction 2 3 4 23.5 23.7 23.9 23.4 23.6 23.7 24.3 24.5 24.6 24.0 24.3 24.3 24.6 25.2 25.6 23.3 23.4 23.5 24.9 25.7 26.3 24.5 25.2 25.6 23.5 23.7 23.9 22.8 22.9 23.0 24.1 24.2 24.5 23.4 23.5 23.5 23.0
(2004). Increasing the number of frames from 2 to 4 caused a 0.5 dB increase in the RGB SR set-up and a 1.1–1.4 dB increase in the Bayer SR set-up. As expected, the Bayer SR showed a superior performance, with 26.3 dB on four frames without rounding of the motion vectors and 25.6 dB with rounded motion vectors. MV rounding caused a quality drop by 0.1–0.3 dB for RGB SR and a quality drop by 0.4–0.7 dB for Bayer SR. The configuration with subsequent SR and up-sampling behaves well enough without MV rounding but in the case of rounding can be even inferior to the baseline. Although there is clear evidence of weight decay, we had to evaluate the real impact on the quality of the algorithm output caused by filter truncation. Also, since the results of subsequent 4 SR with subsequent 2 downscaling were visually more pleasing than plain 2 SR, we evaluated these configurations numerically. The results are shown in Table 1.4. The bottom line shows the baseline with Hirakawa and Parks (2005) demosaicing followed by 2 bicubic up-sampling, providing 27.95 Db reconstruction quality. We can see that even for the simplest setting for 2 magnification, the difference in PSNR from the baseline is 1.38 Db. Increasing the number of frames from two to four allows us to increase the quality from about 0.9 (for 2 SR) to 1.1 Db (4 SR + down-sampling) compared to the corresponding two-frame configuration. We can also see that for each number of observed frames, the 4 SR + down-sampling is about 1.6–1.8 Db better than the corresponding plain 2 SR. The influence of the reduced kernel size (from 16 to 12) is almost negligible and never exceeds 0.15 Dbs. Finally, in the four-frame set-up, we can see a PSNR increase of 4.17 Db compared to the baseline.
26
X. Y. Petrova
Table 1.4 Impact of the kernel size and comparison of 4 3 " SR + 2 3 # and 2 3 " SR configurations (PSNR, Db)
Kernel size 16 16 14 14 12 12 N/A
1.3 1.3.1
Configuration 4 " SR + 2 # 2 " SR 4 " SR + 2 # 2 " SR 4 " SR + 2 # 2 " SR Hirakawa and Parks (2005) + 2 " SR
Number of frames used for reconstruction 2 3 30.93 31.75 29.34 30.19 30.93 31.75 29.34 30.04 30.93 31.72 29.33 30.05 27.95
4 32.12 30.25 32.12 30.26 32.12 30.25
Practical Super-Resolution System Architecture
The minimalistic system architecture of the SR algorithm was already shown in Fig. 1.12. However, there are additional problems that need to be solved to make an out-of-the-box MF SR solution. Figure 1.18 shows a variant of the implementation of a complete MF SR system based on the filter-bank approach that includes dedicated sub-algorithms for directional processing in edge areas, motion estimation in the Bayer domain, preliminary noise reduction, salience map estimation, blending motion using a reliability map, and post-processing. The directional processing has the following motivation. We found that the core Bayer SR algorithm for some motions performs poorly on edges, and we developed edge directional reconstruction filters to be applied in strongly directional areas. These reconstruction filters were computed in the same way as anisotropic ones, except for regularization term H, which was implemented as a directional DoG (Difference of Gaussians) filter. The filters were stored in a separate directional filterbank with special slots for each direction. Local direction was estimated in the Bayer domain image using the structure tensor as described in Sect. 1.3.2. A visual comparison of the results obtained with and without direction adaptation is shown in Fig. 1.19. Earlier, the applicability of the DoG filter was demonstrated also for mask creation and multiplication of the initial image by the created mask for the purpose of blending the initial image with its blurred copy (Safonov et al. 2018, Chap. 11), for the creation and specification of binary maps as well as for binary sketch generation (Safonov et al. 2019, Chaps. 2 and 13). In order to avoid using pixels from frames where the motion was not estimated accurately, a special reliability map was computed. Obviously, it is based on analysis of the difference of the reference and compensated frames, but in order to obtain the desired behaviour on real images, the process included the operations described below. For each frame, a half-resolution compensated frame was obtained. The
1 Super-Resolution: 1. Multi-Frame-Based Approach
27
Fig. 1.18 MF SR system architecture Fig. 1.19 Baseline anisotropic reconstruction left; directional reconstruction right
green channel of the input half-resolution frame was estimated as an average of green sub-channels G1 and G2. The reference half-resolution frame and compensated frame were subjected to white balance, gamma correction, Gaussian smoothing, and subsequent conversion to the LST colour space. This colour space was successfully used in segmentation tasks by Chen et al. (2016) and was reported to be a fast alternative to LaB space. Luminance, saturation, and tint were computed as
28
X. Y. Petrova
L¼
1 1 1 pffiffiffi ðR þ G þ BÞ, S ¼ pffiffiffi ðR BÞ, T ¼ pffiffiffi ðR 2G þ BÞ, M rgb 3 M rgb 2 M rgb 6
where Mrgb is the maximum of colour channels. These formulae assume normalized input (i.e. between 0 and 1). Let us denote the reference frame in LST space as fref and a compensated frame in LST space as fk. Then two difference sub-metrics were computed: d1 ¼ 1 (( fref fk) G)γ , where is a convolution operator and G is a Gaussian filter, and d2 ¼ max (0, SSIM( fref, fk)). The final reliability max was d2 computed as a threshold transform over dd11þd with subsequent nearest neighbour 2 up-sampling to full size. In pixels where motion was detected to be unreliable, a reduced number of frames were used for reconstruction. In Fig. 1.18, the special filter-bank for processing areas which use a reduced number of frames (from 1 to k 1) is denoted as “fallback”. In order to obtain the final image out of pixels obtained using anisotropic, directional, and partial (fallback) reconstruction filters, a special blending sub-algorithm was implemented. Motion estimation in the Bayer domain is an interesting problem deserving further description, which will be provided in Sect. 1.3.3. The main goal of post-processing is to reduce the colour artefacts. Unfortunately, a cross-channel regularizer that is strong enough to suppress colour artefacts produces undesirable blur as a side effect. In order to apply a lighter cross-channel regularization, an additional colour artefact suppression block was implemented. It converts an output image to YUV space, computes the map of local standard deviations in the Y channel, smooths it using Gaussian filtering, and uses it as the reference channel for cross-bilateral filtering of channels U and V. Then the image with the original values in the Y channel and filtered values in the U and V channels is transferred back to RGB. Since the reconstruction model described above considers only Gaussian noise, it makes sense to implement a separate and more elaborate noise reduction block using accurate information from metadata and the camera noise profile. In the case of higher noise levels, it is possible to use reconstruction filters computed for a higher degree of regularization, but in this case the effect of SR processing shifts from revealing new details to noise reduction, which is a simpler problem that can be solved without the filter-bank approach. In order to achieve a good balance between noise reduction and detail reconstruction, a salience map was applied to control the local strength of the noise reduction. A detailed description of salience map computation along with a description of a Bayer structure tensor is provided in Sect. 1.3.2. A visual comparison of the results obtained with and without salience-based local control of noise reduction is shown in Fig. 1.20. It can be seen that such adaptation provides a better detail level in textured areas and higher noise suppression in flat regions compared to the baseline version.
1 Super-Resolution: 1. Multi-Frame-Based Approach
29
Fig. 1.20 Visual comparison of results: left salience-based adaptation off; right salience-based adaptation on
1.3.2
Structure Tensor in Bayer Domain and Salience Map
The structure tensor is a traditional instrument for the estimation of local directionality."The tensor is#a matrix composed of local gradients of pixel values P Pstructure ∇x ∇y ∇2x T¼ P P 2 , and the presence of a local directional structure is ∇x ∇y ∇y
2 detected by threshold transform of coherence c ¼ λλþþ λ þλ , which is computed from the larger and smaller eigenvalues of structure tensor λ+ and λ, respectively. If the coherence is small, this means that a pixel belongs to the low textured area or some high textured area without a single preferred dimension. If the coherence is above the threshold, the local direction is collinear to the eigenvector corresponding to the larger eigenvalue. For RGB images, the gradients are obviously computed as ∇x ¼ Iy, x + 1 Iy, x 1, ∇ y ¼ Iy + 1, x Iy 1, x, while Bayer input requires some modifications. The gradients were computed as ∇x ¼ max (∇xR, ∇xG, ∇ xB), ∇ y ¼ max (∇yR, ∇yG, ∇yB), where the gradients in each channel were computed as shown in Table 1.5. In order to apply texture direction estimation in the filter-bank structure, the angle of the texture direction was quantized into 16 levels (we checked configurations with 8 levels, which provided visibly inferior quality and 32 levels, which provided a minor improvement over 16 levels but was more demanding from the point of view of the memory required). An example of a direction map is shown in Fig. 1.21. The smaller eigenvalue λ of the structure tensor was also used to compute the salience map. In each pixel location, the value of λ was computed in some local window, and then threshold transform and normalization were applied: ðλ , t 1 Þ, t 2 Þ r ¼ max ð min . After that, the obtained map was smoothed by a Gaussian filter t 2 t 1
30
X. Y. Petrova
Table 1.5 Computation of gradients for Bayer pattern (from Fig. 1.6) Gradient ∇xR ∇xG ∇xB ∇yR ∇yG ∇yB Gradient ∇xR ∇xG ∇xB ∇yR ∇yG ∇yB
Position in Bayer pattern R
B
I y,xþ2 I y,x2 2
I yþ1,xþ1 þI y1,xþ1 I yþ1,x1 I y1,x1 2
Iy, x + 1 Iy, x 1 I yþ1,xþ1 þI y1,xþ1 I yþ1,x1 I y1,x1 2 I yþ2,x I y2,x 2
I y,xþ2 I y,x2 2 I yþ1,xþ1 þI yþ1,x1 I y1,xþ1 I y1,x1 2
Iy + 1, x Iy 1, x I yþ1,xþ1 þI yþ1,x1 I y1,xþ1 I y1,x1 2
I yþ2,x I y2,x 2
Position in Bayer pattern G1
G2
I y1,xþ2 þI yþ1,xþ2 I y1,x2 I yþ1,x2 Iy, x + 1 Iy, x 1 4 I y,xþ2 I y,x2 þI yþ1,xþ1 þI y1,xþ1 I yþ1,x1 I y1,x1 4 I y1,xþ2 þI yþ1,xþ2 I y1,x2 I yþ1,x2 Iy, x + 1 Iy, x 1 4 I y2,xþ1 þI y2,x1 I yþ2,xþ1 I yþ2,x1 Iy + 1, x Iy 1, x 4 I yþ2,x I y2,x þI yþ1,xþ1 þI yþ1,x1 I y1,xþ1 I y1,x1 4 I yþ2,xþ1 þI yþ2,x1 I y2,xþ1 I y2,x1 Iy + 1, x Iy 1, x 4
Fig. 1.21 Direction map constructed using structure tensor: left input image and right estimated directions shown by colour wheel as of Baker et al. (2007)
and inverted. The resulting value was used as a texture-dependent coefficient to control the noise reduction.
1.3.3
Motion Estimation in Bayer Domain
Motion estimation inside the MF SR pipeline should be able to cover a sufficient range of displacements on the one hand and provide an accurate estimation of
1 Super-Resolution: 1. Multi-Frame-Based Approach
31
Fig. 1.22 Implementation of motion estimation for Bayer MF SR
subpixel displacements on the other hand. At the same time, it should have modest computational complexity. To fulfill these requirements, a multiscale architecture combining 3-Dimensional Recursive Search (3DRS) and Lucas–Kanade (LK) optical flow was implemented (Fig. 1.22). Using a 3DRS algorithm for frame-rate conversion application is demonstrated also by Pohl et al. (2018) and in Chap. 15. Here, the LK algorithm was implemented with improvements described in Baker and Matthews (2004). To estimate the largest scale displacement, a simplified 3DRS implementation from Pohl et al. (2018) was applied to the ¼ scale of the Y channel. Further motion was refined by conventional LK on ¼ and ½ resolution, and finally one pass of specially developed Bayer LK was applied. The single-channel Lucas– Kanade method relies on local solution of the system TTT|u v|T ¼ TTb, where T is computed similarly to the way it was done in Sect. 1.3.2, except for averaging the Gaussian window applied to the gradient values. However, for this application gradient values were obtained just from bilinear demosaicing of the original Bayer image. The chart of the algorithm is shown in Fig. 1.22.
References Azzari, L., Foi, A.: Gaussian-Cauchy mixture modeling for robust signal-dependent noise estimation. In: Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5357–5351 (2014) Baker, S., Kanade, T.: Limits on super-resolution and how to break them. IEEE Trans. Pattern Anal. Mach. Intell. 24(9), 1167–1183 (2002) Baker, S., Matthews, I.: Lucas-Kanade 20 years on: a unifying framework. Int. J. Comput. Vis. 56(3), 221–255 (2004)
32
X. Y. Petrova
Baker, S., Scharstein, D., Lewis, J., Roth S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1–8 (2007). https://doi.org/10.1007/s11263-010-0390-2 Benzi, M., Bini, D., Kressner, D., Munthe-Kaas, H., Van Loan, C.: Exploiting hidden structure in matrix computations: algorithms and applications. In: Benzi, M., Simoncini, V. (eds.) Lecture Notes in Mathematics, vol. 2173. Springer International Publishing, Cham (2016) Bodduna, K., Weickert, J.: Evaluating data terms for variational multi-frame super-resolution. In: Lauze, F., Dong, Y., Dahl, A.B. (eds.) Lecture Notes in Computer Science, vol. 10302, pp. 590–601. Springer Nature Switzerland AG, Cham (2017) Brooks, T., Mildenhall, B., Xue, T., Chen, J., Sharlet, D., Barron, J.-T.: Unprocessing images for learned raw denoising. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11036–11045 (2019) Cai, J., Gu, S., Timofte, R., Zhang, L., Liu, X., Ding, Y. et al.: NTIRE 2019 challenge on real image super-resolution: methods and result. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 2211–2223 (2019) Chen, C., Ren, Y., Kuo, C.-C.: Big Visual Data Analysis. Scene Classification and Geometric Labeling, Springer Singapore, Singapore (2016) Digital negative (DNG) specification, v.1.4.0 (2012) Farsiu, S., Robinson, D., Elad, M., Milanfar, P.: Robust shift and add approach to superresolution. In: Proceedings of IS&T International Symposium on Electronic Imaging, Applications of Digital Image Processing XXVI, vol. 5203 (2003). https://doi.org/10.1117/12.507194. Accessed on 02 Oct 2020 Farsiu, S., Robinson, M.D., Elad, M., Milanfar, P.: Fast and robust multi-frame super resolution. IEEE Trans. Image Process. 13(10), 1327–1344 (2004) Fazli, S., Fathi, H.: Video image sequence super resolution using optical flow motion estimation. Int. J. Adv. Stud. Comput. Sci. Eng. 4(8), 22–26 (2015) Foi, A., Alenius, S., Katkovnik, V., Egiazarian, K.: Noise measurement for raw data of digital imaging sensors by automatic segmentation of non-uniform targets. IEEE Sensors J. 7(10), 1456–1461 (2007) Foi, A., Trimeche, M., Katkovnik, V., Egiazarian, K.: Practical Poissonian-Gaussian noise modeling and fitting for single-image raw-data. IEEE Trans. Image Process. 17(10), 1737–1754 (2008) Giachetti, A., Asuni, N.: Real time artifact-free image interpolation. IEEE Trans. Image Process. 20(10), 2760–2768 (2011) Glazistov, I., Petrova X.: Structured matrices in super-resolution problems. In: Proceedings of the Sixth China-Russia Conference on Numerical Algebra with Applications. Session Report (2017) Glazistov, I., Petrova, X.: Superfast joint demosaicing and super-resolution. In: Proceedings of IS&T International Symposium on Electronic Imaging, Computational Imaging XVI, pp. 2721–2728 (2018) Gutierrez, E.Q., Callico, G.M.: Approach to super-resolution through the concept of multi-camera imaging. In: Radhakrishnan, S. (ed.) Recent Advances in Image and Video Coding (2016). https://www.intechopen.com/books/recent-advances-in-image-and-video-coding/approach-tosuper-resolution-through-the-concept-of-multicamera-imaging Hansen, P.C., Nagy, J.G., O’Leary, D.P.: Deblurring Images: Matrices, Spectra, and Filtering. Fundamentals of Algorithms, vol. 3. SIAM, Philadelphia (2006) Heide, F., Rouf, M., Hullin, M.-B., Labitzke, B., Heidrich, W., Kolb, A.: High-quality computational imaging through simple lenses. ACM Trans. Graph. 32(5), Article No. 149 (2013) Heide, F., Steinberger, M., Tsai, Y.-T., Rouf, M., Pajak, D., Reddy, D., Gallo, O., Liu, J., Heidrich, W., Egiazarian, K., Kautz, J., Pulli, K.: FlexISP: a flexible camera image processing framework. ACM Trans. Graph. 33(6), 1–13 (2014) Hirakawa, K., Parks, W.-T.: Adaptive homogeneity-directed demosaicing algorithm. IEEE Trans. Image Process. 14(3), 360–369 (2005)
1 Super-Resolution: 1. Multi-Frame-Based Approach
33
Ismaeil, K.A., Aouada, D., Ottersten B., Mirbach, B.: Multi-frame super-resolution by enhanced shift & add. In: Proceedings of 8th International Symposium on Image and Signal Processing and Analysis, pp. 171–176 (2013) Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) European Conference on Computer Vision, pp. 694–711. Springer International Publishing, Cham (2016) Kanaev, A.A., Miller, C.W.: Multi-frame super-resolution algorithm for complex motion patterns. Opt. Express. 21(17), 19850–19866 (2013) Kuan, D.T., Sawchuk, A.A., Strand, T.C., Chavel, P.: Adaptive noise smoothing filter for images with signal-dependent noise. IEEE Trans. Pattern Anal. Mach. Intell. 7(2), 165–177 (1985) Li, X., Orchard, M.: New edge-directed interpolation. IEEE Trans. Image Process. 10(10), 1521–1527 (2001) Lin, Z., He, J., Tang, X., Tang, C.-K.: Limits of learning-based superresolution algorithms. Int. J. Comput. Vis. 80, 406–420 (2008) Liu, X., Tanaka, M., Okutomi, M.: Estimation of signal dependent noise parameters from a single image. In: Proceedings of the IEEE International Conference on Image Processing, pp. 79–82 (2013) Liu, X., Tanaka, M., Okutomi, M.: Practical signal-dependent noise parameter estimation from a single noisy image. IEEE Trans. Image Process. 23(10), 4361–4371 (2014) Liu, J., Wu C.-H., Wang, Y., Xu Q., Zhou, Y., Huang, H., Wang, C., Cai, S., Ding, Y., Fan, H., Wang, J.: Learning raw image de-noising with Bayer pattern unification and Bayer preserving augmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 4321–4329 (2019) Ma, D., Afonso, F., Zhang, M., Bull, A.-D.: Perceptually inspired super-resolution of compressed videos. In: Proceedings of SPIE 11137. Applications of Digital Image Processing XLII, Paper 1113717 (2019) Malvar, R., He, L.-W., Cutler, R.: High-quality linear interpolation for demosaicing of Bayerpatterned color images. In: International Conference of Acoustic, Speech and Signal Processing, vol. 34(11), pp. 2274–2282 (2004) Mastronardi, N., Ng, M., Tyrtyshnikov, E.E.: Decay in functions of multiband matrices. SIAM J. Matrix Anal. Appl. 31(5), 2721–2737 (2010) Milanfar, P. (ed.): Super-Resolution Imaging. CRC Press (Taylor & Francis Group), Boca Raton (2010) Nasonov, A. Krylov, A., Petrova, X., Rychagov M.: Edge-directional interpolation algorithm using structure tensor. In: Proceedings of IS&T International Symposium on Electronic Imaging. Image Processing: Algorithms and Systems XIV, pp. 1–4 (2016) Park, J.-H., Oh, H.-M., Moon, G.-K.: Multi-camera imaging system using super-resolution. In: Proceedings of 23rd International Technical Conference on Circuits/Systems, Computers and Communications, pp. 465–468 (2008) Petrova, X., Glazistov, I., Zavalishin, S., Kurmanov, V., Lebedev, K., Molchanov, A., Shcherbinin, A., Milyukov, G., Kurilin, I.: Non-iterative joint demosaicing and super-resolution framework. In: Proceedings of IS&T International Symposium on Electronic Imaging, Computational Imaging XV, pp. 156–162 (2017) Pohl, P., Anisimovsky, V., Kovliga, I., Gruzdev, A., Arzumanyan, R.: Real-time 3DRS motion estimation for frame-rate conversion. In: Proceedings of IS&T International Symposium on Electronic Imaging, Applications of Digital Image Processing XXVI, pp. 3281–3285 (2018) Pyatykh, S., Hesser, J.: Image sensor noise parameter estimation by variance stabilization and normality assessment. IEEE Trans. Image Process. 23(9), 3990–3998 (2014) Rakhshanfar, M., Amer, M.A.: Estimation of Gaussian, Poissonian-Gaussian, and processed visual noise and its level function. IEEE Trans. Image Process. 25(9), 4172–4185 (2016) Robinson, M.D., Farsiu, S., Milanfar, P.: Optimal registration of aliased images using variable projection with applications to super-resolution. Comput. J. 52(1), 31–42 (2009)
34
X. Y. Petrova
Robinson, M.D., Toth, C.A., Lo, J.Y., Farsiu, S.: Efficient Fourier-wavelet super-resolution. IEEE Trans. Image Process. 19(10), 2669–2681 (2010) Rochefort, G., Champagnat, F., Le Besnerais, G., Giovannelli, G.-F.: An improved observation model for super-resolution under affine motion. IEEE Trans. Image Process. 15(11), 3325–3337 (2006) Romano, Y., Isidoro, J., Milanfar, P.: RAISR: rapid and accurate image super resolution. IEEE Trans. Comput. Imaging. 3(1), 110–125 (2017) Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Adaptive Image Processing Algorithms for Printing. Springer Nature Singapore AG, Singapore (2018) Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Document Image Processing for Scanning and Printing. Springer Nature Switzerland AG, Cham (2019) Segall, C.A., Katsaggelos, A.K., Molina, R., Mateos, J.: Super-resolution from compressed video. In: Chaudhuri, S. (ed.) Super-Resolution Imaging. The International Series in Engineering and Computer Science Book Series, Springer, vol. 632, pp. 211–242 (2002) Sroubek, F., Kamenick, J., Milanfar, P.: Superfast super-resolution. In: Proceedings of 18th IEEE International Conference on Image Processing, pp. 1153–1156 (2011) Sutour, C., Deledalle, C.-A., Aujol, J.-F.: Estimation of the noise level function based on a non-parametric detection of homogeneous image regions. SIAM J. Imaging Sci. 8(4), 2622–2661 (2015) Sutour, C., Aujol, J.-F., Deledalle, C.-A.: Automatic estimation of the noise level function for adaptive blind denoising. In: Proceedings of 24th European Signal Processing Conference, pp. 76–80 (2016) Timofte, R., De Smet, V., Van Gool, L.: A+: adjusted anchored neighbourhood regression for fast super-resolution. In: Asian Conference on Computer Vision, pp. 111–126 (2014) Trench, W.: Properties of multilevel block α-circulants. Linear Algebra Appl. 431(10), 1833–1847 (2009) Voevodin, V.V., Tyrtyshnikov, E.E.: Computational processes with Toeplitz matrices. Moscow, Nauka (1987) (in Russian). https://books.google.ru/books?id¼pf3uAAAAMAAJ. Accessed on 02 Oct 2020 Zhang, H., Ding, F.: On the Kronecker products and their applications. J. Appl. Math. 2013, 296185 (2013) Zhang, Z., Sze, V.: FAST: a framework to accelerate super-resolution processing on compressed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1015–1024 (2017) Zhang, L., Wu, X.L.: An edge-guide image interpolation via directional filtering and data fusion. IEEE Trans. Image Process. 15(8), 2226–2235 (2006) Zhang, Y., Wang, G., Xu, J.: Parameter estimation of signal-dependent random noise in CMOS/ CCD image sensor based on numerical characteristic of mixed Poisson noise samples. Sensors. 18(7), 2276–2293 (2018) Zhao, N., Wei, Q., Basarab, A., Dobigeon, N., Kouame, D., Tourneret, J.-Y.: Fast single image super-resolution using a new analytical solution for ℓ2-ℓ2 problems. IEEE Trans. Image Process. 25(8), 3683–3697 (2016) Zhou, D., Shen, X., Dong, W.: Image zooming using directional cubic convolution interpolation. IET Image Process. 6(6), 627–634 (2012)
Chapter 2
Super-Resolution: 2. Machine Learning-Based Approach Alexey S. Chernyavskiy
2.1
Introduction
Multi-frame approaches to super-resolution allow us to take into account the minute variabilities in digital images of scenes that are due to shifts. As we have seen earlier in the previous Chap. 1, after warping and blurring parameters are estimated, a solution of large systems of equations is usually involved in the process of generating a composite highly detailed image. Many ingenious ways of accelerating the algorithms and making them practical for applications in multimedia have been developed. On the other hand, for single image super-resolution (SISR), the missing information cannot be taken from other frames, and it should be derived based on some statistical assumptions about the image. In the remainder of this chapter, by super-resolution we will mean single image zooming that involves detail creation and speak of 2, 3, etc. zoom levels, where N means that we will be aiming at the creation of new high-resolution (HR) images with height and width N times larger than the low-resolution (LR) original. It is easy to see that for N zoom, the area of the resulting HR image will increase N2 times compared to the original LR image. So, if an LR image is up-scaled by a factor of two and if we assume that all the pixel values of the original image are copied into the HR image, then ¾ of the pixels in the resulting HR image should be generated by an algorithm. For 4 zoom, this figure will be 15/16 which is almost 94% of all pixel values. Deep learning methods have been shown to generate representations of the data that allow us to efficiently solve various computer vision tasks like image classification, segmentation, and denoising. Starting with the seminal paper by Dong et al. (2016a), there has been a rise in the use of deep learning models and, specifically, of convolutional neural networks (CNNs), in single image super-resolution. The
A. S. Chernyavskiy (*) Philips AI Research, Moscow, Russia e-mail: [email protected] © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_2
35
36
A. S. Chernyavskiy
remainder of this chapter will focus on architectural designs of SISR CNNs and on various aspects that make SISR challenging. The blocks that make up the CNNs for super-resolution do not differ much from neural networks used in other image-related tasks, such as object or face recognition. They usually consist of convolution blocks, with kernel sizes of 3 3, interleaved with simple activation functions such as the ReLU (rectified linear unit). Modern super-resolution CNNs incorporate the blocks that have been successfully used in other vision tasks, e.g. various attention mechanisms, residual connections, dilated convolutions, etc. In contrast with CNNs that are designed for image classification, there are usually no pooling operations involved. The input low-resolution image is processed by sets of filters which are specified by kernels with learnable weights. These operations produce arrays of intermediate outputs called feature maps. Non-linear activations functions are applied to the feature maps, after adding learnable biases, in order to zero out some values and to accentuate others. After passing through several stages of convolutions and non-linear activations, the image is finally transformed into a high-resolution version by means of the deconvolution operation, also called transposed convolution (Shi et al. 2016a). Another up-scaling option is the sub-pixel convolution layer (Shi et al. 2016b), which is faster than the deconvolution, but is known to generate checkerboard artefacts. During training, each LR image patch is forward propagated through the neural network, a zoomed image is generated by the sequence of convolution blocks and activation functions, and this image is compared to the true HR patch. A loss function is computed, and the gradients of the loss function with respect to the neural network parameters (weights and biases) are back-propagated; therefore the network parameters are updated. In most cases, the loss function is the L2 or L1 distance, but the choice of a suitable measure for comparing the generated image and its ground truth counterpart is a subject of active research. For training a super-resolution model, one should create or obtain a training dataset which consists of pairs of low-resolution images and their high-resolution versions. Almost all the SISR neural networks are designed for one zoom factor only, although, e.g. a 4 up-scaled image can in principle be obtained by passing the output of a 2-zoom CNN through itself once again. In this way, the size ratio of the HR and LR images used for training the neural network should correspond to the desired zoom factor. Training is performed on square image patches cropped from random locations in input images. The patch size should match the receptive field of the CNN, which is the size of the neighbourhood that is involved in the computation of a single pixel of the output. The receptive field of a CNN is usually related to the typical kernel size and the depth (number of convolution layers). In terms of architectural complexity, the progress of CNNs for the super-resolution task has closely followed the successes of image classification CNNs. However, some characteristic features are inherent to the CNNs used in SISR and to the process of training such neural networks. We will examine these peculiarities later. There are several straightforward design choices that come into play when one decides to create and train a basic SISR CNN.
2 Super-Resolution: 2. Machine Learning-Based Approach
37
First, there is the issue of early vs. late upsampling. In early upsampling, the low-resolution image is up-scaled using a simple interpolation method (e.g. bicubic or Lanczos), and this crude version of the HR image serves as input to the CNN, which basically performs the deblurring. This approach has an obvious drawback which is the large number of operations and large HR-sized intermediate feature maps that need to be stored in memory. So, in most modern SISR CNN, starting from Dong et al. (2016b), the upsampling is delayed to the last stages. In this way, most of the convolutions are performed over feature maps that have the same size as the low-resolution image. While in many early CNN designs the network learned to directly generate an HR image and the loss function that was being minimized was computed as the norm of the difference between the generated image and the ground truth HR, Zhang et al. (2017) proposed to learn the residual image, i.e. the difference between the LR image, up-scaled by a simple interpolation, and the HR. This strategy proved beneficial for SISR, denoising and JPEG image deblocking. An example of CNN architecture with residual learning is shown in Fig. 2.1; a typical model output is shown in Fig. 2.2 for a zoom factor equal to 3. Compared to standard interpolation, such as bicubic or Lanczos, the output of a trained SISR CNN contains much sharper details and shows practically no aliasing.
Fig. 2.1 A schematic illustration of a CNN for super-resolution with residual learning
Fig. 2.2 Results of image up-scaling by a factor of 3: left result of applying bicubic interpolation; right HR image generated by a CNN
38
2.2
A. S. Chernyavskiy
Training Datasets for Super-Resolution
The success of deep convolutional neural networks is largely due to the availability of training data. For the task of super-resolution, the neural networks are typically trained on high-quality natural images, from which the low-resolution images are generated by applying a specific predefined downscaling algorithm (bicubic, Lanczos, etc.). The images belonging to the training set should ideally come from the same domain as the one that will be explored after training. Man-made architectural structures possess very specific textural characteristics, and if one would like to up-scale satellite imagery or street views, the training set should contain many images of buildings, as in the popular dataset Urban100 (Huang et al. 2015). Representative images from Urban100 are shown in Fig. 2.3. On the other hand, a more general training set would allow for greater flexibility and better average image quality. A widely used training dataset DIV2K (Agustsson and Timofte 2017) contains 1000 images each having 2K pixels size of at least one of its axes and features a very diverse content, as illustrated in Fig. 2.4.
Fig. 2.3 Sample images from the Urban100 dataset
2 Super-Resolution: 2. Machine Learning-Based Approach
39
Fig. 2.4 Sample images from the DIV2K dataset
Fig. 2.5 Left LR image; top-right HR ground truth; bottom-right image reconstructed by a CNN from an LR image that was obtained by simple decimation without smoothing or interpolation. (Reproduced with permission from Shocher et al. 2018)
When a CNN is trained on synthetic data generated by a simple predefined kernel, its performance on images that come from a different downscaling and degradation process deteriorates significantly (Fig. 2.5). Most of the time, real images from the
40
A. S. Chernyavskiy
Web, or taken by a smartphone camera, or old historic images contain many artefacts coming from sensor noise, non-ideal PSF, aliasing, image compression, on-device denoising, etc. It is obvious that in a real-life scenario, a low-resolution image is produced by an optical system and, generally, is not created by applying any kind of subsampling and interpolation. In this regard, the whole idea of training CNNs on carefully engineered images obtained by using a known down-sampling function might sound faulty and questionable. It seems natural then to create a training dataset that would simulate the real artefacts introduced into an image by a real imaging system. Another property of real imaging systems is the intrinsic trade-off between resolution (R) and field of view (FoV). When zooming out the optical lens in a DSLR camera, the obtained image has a larger FoV but loses details on subjects; when zooming in the lens, the details of subjects show up at the cost of a reduced FoV. This trade-off also applies to cameras with fixed focal lenses (e.g. smartphones), when the shooting distance changes. The loss of resolution that is due to enlarged FoV can be thought of as a degradation model that can be reversed by training a CNN (Chen et al. 2019; Cai et al. 2019a). In a training dataset created for this task, the HR image could come, for example, from a high-quality DSLR camera, while the LR image could be obtained by another camera that would have inferior optical characteristics, e.g. a cheap digital camera with a lower image resolution, different focal distance, distortion parameters, etc. Both cameras should be mounted on tripods in order to ensure the closest similarity of the captured scenes. Still, due to the different focus and depth of field, it would be impossible to align whole images (Fig. 2.6). The patches suitable for training the CNN would have to be cropped from the central parts of the image pair. Also, since getting good image quality is not a problem when up-scaling low-frequency regions like the sky, care should be taken to only select informative patches that contain high-frequency information, such as edges, corners, and spots. These parts can be selected using classical computer vision feature detectors, like SIFT, SURF, or FAST. A subsequent distortion compensation and correlation-based alignment (registration) must be performed in order to obtain the most accurate
Fig. 2.6 Alignment of two images of the same scene made by two cameras with different depth of field. Only the central parts can be reliably aligned
2 Super-Resolution: 2. Machine Learning-Based Approach
41
LR-HR pairs. Also, the colour map of the LR image, which can differ from that of the HR image due to white balance, exposure time, etc., should be adjusted via histogram matching using the HR image as reference (see Migukin et al. 2020). During CNN training, the dataset can be augmented by adding randomly rotated and mirrored versions of the training images. Many other useful types of image manipulation are implemented in the Albumentations package (Buslaev et al. 2020). Reshuffling of the training images and sub-pixel convolutions can be greatly facilitated by using the Einops package by Rogozhnikov (2018). Both packages are available for Tensorflow and Pytorch, the two most popular deep learning frameworks.
2.3
Loss Functions and Image Quality Metrics
Since super-resolution can be thought of as a regression task, the L2 distance (mean squared error, MSE) can be used as the loss function that is minimized during training. However, the minimization of the L2 loss often results in overly smooth images with reduced edge sharpness. The L1 (absolute mean difference) loss is preferable, since it produces sharper edges. Due to the different behaviour of these loss functions when the argument is close to zero, in some cases it is preferable to use a combination of these losses or first train the CNN with L2 loss and fine-tune it using the L1 loss. Other choices include differentiable variants of the L1 loss, e.g. the Charbonnier loss as in Lai et al. (2017), which has the ability to handle outliers better. In deep learning, there is a distinction between the loss function that is minimized during training and the accuracy, or quality, metric that is computed on the validation dataset once the model is trained. The loss function should be easily differentiable in order to ensure a smooth convergence, while the quality metric should measure the relevance of the model output in the context of the task being solved. The traditional image quality metrics are the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), including its multiscale version MS-SSIM. The PSNR is inversely proportional to the MSE between the reconstructed and the ground truth image, while SSIM and MS-SSIM measure the statistical similarity between the two images. It has been shown that these metrics do not always correlate well with human judgement about what a high-quality image is. This is partly due to the fact that these metrics are pixel-based and do not consider large-scale structures, textures, and semantics. Also, the MSE is prone to the regression to the mean problem, which leads to degradation of image edges. There are many other image quality metrics that rely on estimating the statistics of intensities and their gradients (see Athar and Wang 2019 for a good survey). Some of them are differentiable and can therefore serve as loss functions for training CNNs by back-propagation in popular deep learning frameworks (Kastryulin et al. 2020). Image similarity can also be defined in terms of the distance between features obtained by processing the images to be compared by a separate dedicated neural network. Johnson et al. (2016) proposed to use the L2 distance between the visual
42
A. S. Chernyavskiy
Fig. 2.7 Example of images that were presented to human judgement during the development of LPIPS metric. (Reproduced with permission from Zhang et al. 2018a)
features obtained from intermediate layers of VGG16, a popular CNN that was for some time responsible for the highest accuracy in image classification on the ImageNet dataset. SISR CNNs trained with this loss (Ledig et al. 2017, among many others) have shown more visually pleasant results, without the blurring that is characteristic of MSE. Recently, there has been a surge in attempts to leverage the availability of big data for capturing visual preferences of users and simulating it using engineered metrics relying on deep learning. Large-scale surveys have been conducted, in the course of which users were given triplets of images representing the same scene but containing a variety of conventional and CNN-generated degradations and asked whether the first or the second image was “closer”, in whatever sense they could imagine, to the third one (Fig. 2.7). Then, a CNN was trained to predict this perceptual judgement. The learned perceptual patch similarity LPIPS by Zhang et al. (2018a) and PieAPP by Prashnani et al. (2018) are two notable examples. These metrics generalize well even for distortions that were not present during training. Overall, the development of image quality metrics that would better correlate with human perception is a subject of active research (Ding et al. 2020).
2.4
Super-Resolution Implementation on a Mobile Device
Since most user-generated content is created by cameras installed in mobile phones, it is of practical interest to realize super-resolution as a feature running directly on the device, preferably in real time. Let us consider an exemplar application and adapt FSRCNN by Dong et al. (2016b) for 2–4 zooming. The FSRCNN, as it is formulated by its authors, consists of a low-level feature extraction layer (convolution with a 5 5 kernel) which produces 56 feature maps, followed by a shrinking layer which compresses the data to 12 feature maps, and a series of mapping building blocks repeated four times. Finally, deconvolution is applied to map from the low-resolution to high-resolution space. There are parametric rectified linear activation units (PReLUs) in between the convolution layers. FSRCNN achieves an average PSNR of 33.06 dB when zooming images from a test set Set5 by a factor
2 Super-Resolution: 2. Machine Learning-Based Approach
43
of 3. We trained this neural network on a dataset of image patches, as described in Sect. 2.1 and experimented with the number of building blocks. We found that with ten basic blocks, we could achieve 33.23 dB, if the number of channels in the building blocks is increased from 12 to 32. This CNN took about 4 s to up-scale a 256 256 image four times on a Samsung Galaxy S7 mobile phone, using a CPU. Therefore, there was room for CNN optimization. One of the most obvious ways to improve the speed of a CNN is to reduce the number of layers and the number of feature maps. We reduced the number of basic blocks down to 4 and the number of channels in the first layer from 56 to 32. This gave a small increase in speed, from 4 to 2.5 s. A breakthrough was made when we replaced the convolutions inside the basic blocks by depthwise separable convolutions. In deep learning, convolution filters are usually regarded as 3D parallelepipeds; their depth is equal to the number of input channels, while the other two dimensions are equal to the height and width of the filter, usually 3 3 or 5 5. In this way, if the number of input channels is N, the number of output channels is M, and the size of filter is K K; the total number of parameters (excluding bias) is NMK2. Convolving an H W feature map requires HWNMK2 multiplications. If depthwise separable convolutions are used, each of the N feature maps in the stack is convolved with its own K K filter, which gives N output feature maps. These outputs can then be convolved with M 3D filters of size 1 1 N. The total number of parameters in this new module becomes equal to NK2 + NM. The amount of computations is HW(NK2 + NM). For our case, when, in the BasicBlock, N ¼ 32, M ¼ 64, K ¼ 3, the amount of computations drops from HW*32*32*9 ¼ HW*9216 to HW*1312, which is seven times less! These are figures computed analytically, and the real improvement in computations is more modest. Still, switching from 3D convolutions to depthwise separable convolutions allowed an increase of the speed and resulted in the possibility of zooming a 256 256 image 4 times in 1 s using a CPU only. The number of parameters is also greatly reduced. A modified FSRCNN architecture after applying this technique is shown in Table 2.1. We also added a residual connection in the head of each basic block (Table 2.2): the input feature maps are copied and added to the output of the 1 1 32 convolution layer before passing to the next layer. This improves the training convergence. For further acceleration, one can also process one-channel images instead of RGB images. The image to be zoomed should be transformed from RGB colour space to YCrCb space, where the Y channel contains the intensity, while the other two channels contain colour information. Research shows that the human eye is more sensitive to details which are expressed as differences in intensities, so only the Y channel could be up-scaled for the sake of speed without a big loss of quality. The other two channels are up-scaled by bicubic interpolation, and then they are merged with the generated Y channel image and transformed to RGB space for visualization and saving. Most of the parameters are contained in the final deconvolution layer. Its size depends on the zoom factor. In order to increase the speed of image zooming and to reduce the disk space required for CNN parameter storage, we can train our CNN in such a way as to maximize the reuse of operations. Precisely, we first train a CNN for
44
A. S. Chernyavskiy
Table 2.1 FSRCNN architecture modified by adding depthwise separable convolutions and residual connections Layer name Data Upsample
Comment Input data Y channel Upsample the input bicubic interpolation
Conv1 Conv2 BasicBlock1
See Table 2.2
BasicBlock2
See Table 2.2
BasicBlock3
See Table 2.2
BasicBlock4
See Table 2.2
Conv3 Conv4 Deconv Result
Obtain the residual Deconv + Upsample
Type
Filter size
Output channels 1 1
55 11 33
32 32 32
33
32
33
32
Deconvolution Convolution, PReLU Convolution, PReLU 3 3, 1 1, ReLU, sum w/residual 3 3, 1 1, ReLU, sum w/residual 3 3, 1 1, ReLU, sum w/residual 3 3, 1 1, ReLU, sum w/residual Convolution, PReLU Convolution Deconvolution Summation
33 11 11 9 9 stride 3
32 32 1 1
Table 2.2 FSRCNN architecture modified by adding depthwise separable convolutions and residual connections Layer name Conv3 3 Conv1 1 Sum
Type Depthwise separable convolution Convolution, ReLU Summation of input to block and results of Conv1 1
Filter size 33 11
Output channels 32 32 32
3 zooming. Then, we keep all the layers’ parameters frozen (by setting the learning rate to zero for these layers) and replace the final deconvolution layer by a layer that performs 4 zoom. In this way, after an image is processed by the main body of the CNN, the output of the last layer before the deconvolution is fed into the layer specifically responsible for the desired zoom factor. So, the latency for the first zoom operation is big, but zooming the same image by a different factor is much faster than the first time. Our final CNN achieves 33.25 dB on the Set5 dataset with 10K parameters, compared to FSRCNN with 33.06 dB and 12K parameters. To make full use of the GPU available on recent mobile phones, we ported our super-resolution CNN to a Samsung Galaxy S9 mobile device using the Qualcomm Neural Processing Engine (SNPE) SDK. We were able to achieve 1.6 FPS for 4-up-scaling of 1024 1024 images. This figure does not include the CPU to GPU data transfers and RGB to YCbCr transformations, which can take 1 second in total. The results of superresolution using this CNN are shown in Table. 2.2 and Fig. 2.8.
2 Super-Resolution: 2. Machine Learning-Based Approach
45
Fig. 2.8 Super-resolution on a mobile device: left column bicubic interpolation; right column modified FSRCNN optimized for Samsung Galaxy S9
2.5
Super-Resolution Competitions
Several challenges on single image super-resolution have been initiated since 2017. They intend to bridge the gap between academic research and real-life applications of single image super-resolution. The first NTIRE (New Trends in Image Restoration and Enhancement) challenge featured two tracks. In Track 1 bicubic interpolation was used for creating the LR images. In Track 2, all that was known was that the LR images were produced by convolving the HR image with some unknown kernel. In both tracks, the HR images were downscaled by factors of 2, 3, and 4, and only blur and decimation were used for this, without adding any noise. The DIV2K image dataset was proposed for training and validation of algorithms (Agustsson and Timofte 2017). The competition attracted many teams from academia and industry, and many new ideas were demonstrated. Generally, although the PSNR figures for all the algorithms were worse on images from Track 2 than on those coming from Track 1, there was a strong positive correlation between the success of the method in both tracks. The NTIRE competition became a yearly event, and the tasks to solve became more and more challenging. It now features more tracks, many of them related to image denoising, dehazing, etc. With regard to SR, NTIRE 2018 already featured four tracks, the first one being the same as Track 1 from 2017, while the remaining three added unknown image artefacts that emulated the various degradation factors
46
A. S. Chernyavskiy
Fig. 2.9 Perception-distortion plane used for SR algorithm assessment in the PIRM challenge
present in the real image acquisition process from a digital camera. In 2019, RealSR, a new dataset captured by a high-end DSLR camera, was introduced by Cai et al. (2019b). For this dataset, HR and LR images of the same scenes were acquired by the same camera by changing its focal length. In 2020, the “extreme” 16 track was added. Along with PSNR and SSIM values, the contestants were ranked based on the mean opinion score (MOS) computed in a user study. The PIRM (Perceptual Image Restoration and Manipulation) challenge that was first held in 2018 was the first to really focus on perceptual quality. The organizers used an evaluation scheme based on the perception-distortion plane. The perceptiondistortion plane was divided into three regions by setting thresholds on the RMSE values (Fig. 2.9). In each region, the goal was to obtain the best mean perceptual quality. For each participant, the perception index (PI) was computed as a combination of the no-reference image quality measures of Ma et al. (2017) and NIQE (Mittal et al. 2013), a lower PI indicating better perceptual quality. The PI demonstrated a correlation of 0.83 with the mean opinion score. Another similar challenge, AIM (Advances in Image Manipulation), was first held in 2019. It focuses on the efficiency of SR. In the constrained SR challenge, the participants were asked to develop neural network designs or solutions with either the lowest amount of parameters, or the lowest inference time on a common GPU, or the best PSNR, while being constrained to maintain or improve over a variant of SRResNet (Ledig et al. 2017) in terms of the other two criteria. In 2020, both NTIRE and AIM introduced the Real-World Super-Resolution (RWSR) sub-challenges, in which no LR-HR pairs are ever provided. In the Same Domain RWSR track, the aim is to learn a model capable of super-resolving images in the source set, while preserving low-level image characteristics of the input source domain. Only the source (input) images are provided for training, without any HR ground truth. In the Target Domain RWSR track, the aim is to learn a model capable of super-resolving images in the source set, generating clean high-quality images
2 Super-Resolution: 2. Machine Learning-Based Approach
47
similar to those in the target set. The source input images in both tracks are constructed using artificial, but realistic, image degradations. The difference with all the previous challenges is that this time the images in the source and target set are unpaired, so the 4 super-resolved LR images should possess the same properties as the HR images of different scenes. Final reports have been published for all of the above challenges. The reports are a great illustrated source of information about the winning solutions, neural net architectures, training strategies, and trends in SR, in general. Relevant references are Timofte et al. (2017), Timofte et al. (2018), Cai et al. (2019a), and Lugmayr et al. (2019).
2.6
Prominent Deep Learning Models for Super-Resolution
Over the years, although many researchers in super-resolution have used the same neural networks that produced state-of-the-art results in image classification, a lot of SR-specific enhancements have been proposed. The architectural decisions that we will describe next were instrumental in reaching the top positions in SISR challenges and influenced the research in this field. One of the major early advances in single image super-resolution was the introduction of generative adversarial networks (GANs) to produce more realistic high-resolution images in (Ledig et al. 2017). The proposed SRGAN network consisted of a generator, a ResNet-like CNN with many residual blocks, and a discriminator. The pair of two networks was being trained concurrently, with the generator network trying to produce high-quality HR images, and the discriminator aiming to correctly classify whether its input is a real HR image or one generated by an SR algorithm. The performance of the discriminator was measured as the adversarial loss. The rationale was that this competition between the two networks would push the generated images closer to the manifold of natural images. In that work, the similarity between intermediate features generated by passing the two images through a well-trained image classification model was also used as perceptual loss – so that, in total, three different losses (along with MSE) were combined into one for training. It was clearly demonstrated that GANs are able not only to synthesize fantasy images given a random input but also to be instrumental in image processing. Since then, GANs have become a method of choice for deblurring, super-resolution, denoising, etc. In the LapSRN model (Lai et al. 2017), shown in Fig. 2.10, the upsampling follows the principle of Laplacian pyramids, i.e. each level of the CNN learns to predict a residual that should explain the difference between a simple up-scale of the previous level and the desired result. The predicted high-frequency residuals at each level are used to efficiently reconstruct the HR image through upsampling and addition operations. The model has two branches: feature extraction and image reconstruction. The first one uses stacks of convolutional layers to produce and, later, up-scale the
48
A. S. Chernyavskiy
Fig. 2.10 Laplacian pyramid network for 2, 4 and 8 up-scaling. (Reproduced with permission from Lai et al. 2017)
residual images. The second one serves for summing up the residuals coming from the feature extraction branch with the images that are upsampled by bilinear interpolation and process the results. The entire network is a cascade of CNNs with a similar structure at each level. Each level has its loss function which is computed with respect to the corresponding ground truth HR image at the specific scale. LapSRN generates multiscale predictions, with zoom factors equal to powers of 2. This design facilitates resource-aware applications, such as those running on mobile devices. For example, if there is a lack of resources for 8 zooming, the trained LapSRN model can still perform super-resolution with factors 2 and 4. Like LapSRN, ProSR proposed by Wang et al. (2018a) aimed at the power-oftwo up-scale task and was built on the same hierarchical pyramid idea (Fig. 2.11). However, the elementary building blocks for each level of the pyramid became more sophisticated. Instead of sequences of convolutions, the dense compression units (DCUs) were adapted from DenseNet. In a DCU, each convolutional layer obtains “collective knowledge” as additional inputs from all preceding layers and passes on its own feature maps to all subsequent layers through concatenation. This results in better gradient flow during training. In order to reduce the memory consumption and increase the receptive field with respect to the original LR image, the authors used an asymmetric pyramidal structure with more layers in the lower levels. Each level of the pyramid consists of a cascade of DCUs followed by a sub-pixel convolution layer. A GAN variant of the ProSR was also proposed, where the discriminator matched the progressive nature of the generator network by operating on the residual outputs of each scale. Compared to LapSRN, which also used a hierarchical scheme for power-of-two upsampling, in ProSR the intermediate subnet outputs are neither supervised nor used as the base image in the subsequent level. This design simplifies the backward pass and reduces the optimization difficulty.
2 Super-Resolution: 2. Machine Learning-Based Approach
49
Fig. 2.11 Progressive super-resolution network. (Reproduced with permission from Wang et al. 2018a)
Fig. 2.12 DBPN and its up- and down-projection units. (Reproduced with permission from Haris et al. 2018)
The Deep Back-Projection Network DBPN (Haris et al. 2018) exploits iterative up- and down-sampling layers, providing an error feedback mechanism for projection errors at each stage. Inspired by iterative back-projection, an algorithm used since the 1990s for multi-frame super-resolution, the authors proposed using mutually connected up- and down-sampling stages each of which represents different types of image degradation and high-resolution components. As in ProSR, dense connections between upsampling and down-sampling layers were added to encourage feature reuse. Initial feature maps are constructed from the LR image, and they are fed to a sequence of back-projection modules (Fig. 2.12). Each such module performs a change of resolution up or down, with a set of learnable kernels, with a subsequent return to the initial resolution using another set of kernels. A residual between the input feature map and the one that was subjected to the up-down or down-up operation is computed and passed to the next up- or downscaling. Finally, the
50
A. S. Chernyavskiy
Fig. 2.13 Channel attention module used to reweight feature maps. (Reproduced with permission from Zhang et al. 2018b)
outputs of all the up-projection units are concatenated and processed by a convolutional layer to produce the HR output. Zhang et al. (2018b) introduced two novelties to SISR. First, they proposed a residual-in-residual (RIR) module that allows better gradient propagation in deep networks. RIR allows abundant low-frequency information to be bypassed through multiple skip connections, making the main network focus on learning highfrequency information. The resulting CNN has almost 400 layers but a relatively low number of parameters due to the heavy use of residual connections. Second, the authors introduced a channel attention (CA) mechanism, which is basically a remake of the squeeze-and-excite module from SENet (Hu et al. 2018). Many SISR methods treat LR channel-wise features equally, which is not flexible for the real cases, since high-frequency channel-wise features are more informative for HR reconstruction. They contain edges, textures, and fine details. In order to make the network focus on more informative features, the authors exploit the interdependencies among feature channels. The CA first transforms an input stack of feature maps into a channel descriptor by using global average pooling (Fig. 2.13). Next, the channel descriptor is projected into a lower-dimensional space using a 1 1 convolution (which is equivalent to a linear layer). The resulting smaller-sized vector is gated by a ReLU function and projected back into the higher-dimensional space using another 1 1 convolution operation and finally goes through a sigmoid function. The resulting vector is used to multiply each input feature map by its own factor. In this way, the CA acts as a guide for finding more informative components of an input and to adaptively rescale channel-wise features based on their relative importance. The main contribution of Dai et al. (2019) to super-resolution was a novel trainable second-order channel attention (SOCA) module. Most of the existing CNN-based SISR methods mainly focus on wider or deeper architecture design, neglecting to explore the feature correlations of intermediate layers, hence hindering the representational power of CNNs. Therefore, by adaptively learning feature interdependencies, one could rescale the channel-wise features and obtain more discriminative representations. With this rescaling, the SOCA module is similar to SENet (Hu et al. 2018) and RCAN (Zhang et al. 2018b). However, both SENet and RCAN only explored firstorder statistics, because in those models, global average pooling was applied to get one value for each feature map. Since, in SR, features with more high-frequency are important for the overall image quality, it is natural to use second-order statistics.
2 Super-Resolution: 2. Machine Learning-Based Approach
51
The benefits of using second-order information in the context of CNN-based processing of information have been demonstrated for fine-grained classification of birds, face recognition, etc. After applying any layer of a CNN, a stack of C intermediate feature maps of size H W is reshaped into a feature matrix X containing HW features of dimension C. The sample covariance matrix can then be computed as Σ ¼ XXT. This matrix Σ is symmetric positive semi-definite. The covariance matrix is used to produce a Cdimensional vector of second-order statistics through what is called a global covariance pooling function. This function computes the averages of each row of the covariance matrix. As in SENet and RCAN, the resulting vector of statistics is passed through one 1 1 convolution layer that reduces its size to C/r, and, after applying the ReLU, it is transformed into a C-dimensional vector wc. Finally, the input stack of C feature maps is scaled by multiplying it by weights given by wc. The EIG decomposition involved in the normalization of Σ can be computed approximately in several iterations of the Newton-Schulz algorithm, and so the whole SR model containing SOCA modules can be trained efficiently on a GPU. Wang et al. (2018b) addressed the problem of recovering natural and realistic texture from low-resolution images. As demonstrated in Fig. 2.14, many HR patches could have very similar LR counterparts. While GAN-based approaches (without prior) and the use of perceptual loss during training can generate plausible details, they are revealed to be not very realistic upon careful examination. There should be a prior that would help the SR model to differentiate between similar-looking LR patches, in order to generate HR patches that would be more relevant and constrained to the semantic class present in this patch. It is shown in Fig. 2.14 that a model trained on a correct prior (e.g. on a dataset of only plants, or only buildings, but not a mix of them) succeeds in hypothesizing HR textures. Since training a separate super-resolution network for each semantic class is neither scalable nor computationally efficient, the authors proposed to modulate
Fig. 2.14 The building and plant patches from two LR images look very similar. Without a correct prior, GAN-based methods can add details that are not faithful to the underlying class. (Reproduced with permission from Wang et al. 2018b)
52
A. S. Chernyavskiy
Fig. 2.15 Modulation of SR feature maps using affine parameters derived from probabilistic segmentation maps. (Reproduced with permission from Wang et al. 2018b)
the features F of some intermediate layers of a super-resolution CNN by semantic segmentation probability maps. Assuming that these maps are available from some other pretrained CNN, a set of convolutional layers takes them as input and maps them to two arrays of modulation parameters γ(i, j) and β(i, j) (Fig. 2.15). These arrays are then applied on a per-pixel elementwise basis to the feature maps F: SFT (F | γ, β) ¼ γ*F + β, where ‘*’ means the Hadamard product. The reconstruction of an HR image with rich semantic regions can be achieved with just a single forward pass through transforming the intermediate features of a single network. The SFT layers can be easily introduced to existing super-resolution networks. Finally, instead of a categorical prior, other priors such as depth maps can also be applied. This latter fact could be helpful to the recovery of texture granularity in super-resolution. Previously, the efficacy of depth extraction from single images using CNN has also been summarized in Chap. 1 by Safonov et al. (2019).
2.7
Notable Applications and Future Challenges
In the previous sections, we have demonstrated several approaches and CNN architectures that allowed us to obtain state-of-the-art results for single image super-resolution, in terms of PSNR or other image quality metrics. Let us now describe some notable ingenious applications of SISR and list the directions of future improvements in this area. As we have seen before, SISR CNNs are usually trained for performing zooming by a fixed factor, e.g. 2, 3, or a power of 2. In practice, however, the user might need to up-scale an image by an arbitrary noninteger factor in order to fit the resulting image within some bounds, e.g. the screen resolution of a device. Also, in the multimedia content generation context, a continuous zoom that imitates the movement towards the scene is a nice feature to have. There is, of course, an option to simulate this kind of arbitrary magnification by first zooming the LR image by the nearest integer factor using a trained CNN and then applying down-sampling or upsampling of the output with subsequent standard interpolation. Hu et al. (2019)
2 Super-Resolution: 2. Machine Learning-Based Approach
53
proposed to use a special Meta-Upscale Module. This module can replace the standard deconvolution modules that are placed at the very end of CNNs and are responsible for the up-scaling. For an arbitrary scale factor, this module takes the zoom factor as input, together with the feature maps created by any SISR CNN, and dynamically predicts the weights of the up-scale filters. The CNN then uses these weights to generate an HR image of arbitrary size. Besides the elegance of a metalearning approach, and the obvious flexibility with regard to zoom factors, an important advantage of this approach is the requirement to store parameters for only one small trained subnetwork. The degradation factor that produces an LR image from an HR one is often unknown. It can be associated with a nonsymmetric blur kernel, it can contain noise from sensors or compression, and it can even be spatially dependent. One prominent approach to simultaneously deal with whole families of blur kernels and many possible noise levels has been proposed by Zhang et al. (2018c). By assuming that the degradation can be modelled as an anisotropic Gaussian, with the addition of white Gaussian noise with standard deviation σ, a multitude of LR images are created for every HR ground truth image present in the training dataset. These LR images are augmented with degradation maps which are computed by projecting the degradation kernels to a low-dimensional subspace using PCA. The degradation maps can be spatially dependent. The super-resolution multiple-degradations (SRMD) network performs simultaneous zooming and deblurring for several zoom factors and a wide range of blur kernels. It is assumed that the exact shape of the blur kernel can be reliably estimated during inference. Figure 2.16b demonstrates the result of applying SRMD to an LR image that was obtained from the HR image by applying a Gaussian smoothing with an isotropic kernel that had a different width for different positions in the ground truth HR image; also spatially dependent white Gaussian noise was added. The degradation model shown in Fig. 2.16a, b is quite complex, but the results of simultaneous zooming and deblurring (Fig. 2.16c) still demonstrate sharp edges and good visual quality. This work was further extended to non-Gaussian degradation kernels by Zhang et al. (2019).
Fig. 2.16 Examples of SRMD on dealing with spatially variant degradation: (a) noise level and Gaussian blur kernel width maps; (b) zoomed LR image with noise added according to (a); (c) results of SRMD with scale factor 2
54
A. S. Chernyavskiy
Fig. 2.17 Frame of the animated feature film “Thumbelina” produced in the USSR in 1964, restored and up-scaled using deep learning and GANs by Yandex
Deep learning algorithms can be applied to multiple frames for video superresolution, a subject that we did not touch on in this chapter. In order to ensure proper spatiotemporal smoothness of the generated videos, DL methods are usually leveraged by optical flow and other cues from traditional computer vision, although many of these cues can also be generated and updated in a DL context. It is worth mentioning that the Russian Internet company Yandex (2018) has successfully used deep learning to restore and up-scale various historical movies and cartoons (see Fig. 2.17 for an example), which the company streamed under the name DeepHD. Super-resolution of depth maps is also a topic of intense research. Depth maps are obtained by depth cameras; they can also be computed from stereo pairs. Since this computation is time-consuming, it is advantageous to use classical algorithms to obtain the LR depth and then rescale them by a large factor of 4 to 16. The loss of edge sharpness during super-resolution is much more prominent in depth maps than in regular images. It has been shown by Hui et al. (2016) that CNNs can be trained to accurately up-scale LR depth maps given the HR intensity images as an additional input. Song et al. (2019) proposed an improved multiscale CNN for depth map super-resolution that does not require the corresponding images. A related application of SR is up-scaling of stereo images. Super-resolution of stereo pairs is challenging because of large disparities between similar-looking patches of images. Wang et al. (2019) have proposed a special parallax-attention mechanism with a large receptive field along the epipolar lines to handle large disparity variations. Super-resolution was recently applied by Chen et al. (2018) for magnetic resonance imaging (MRI) in medicine. The special 3D CNN processes image volumes and allows the MRI acquisition time to be shortened at the expense of minor image quality degradation. It is worth noting here that, while in the majority of use cases we expect that the SISR algorithms would generate images that are visually appealing, in many cases when an important decision should be made by analysing the image – e.g. in security, biometrics, and especially in medical imaging – a much more important issue is to keep the informative features unchanged during the
2 Super-Resolution: 2. Machine Learning-Based Approach
55
up-scaling, without introducing features that look realistic for the specific domain as a whole but are not relevant and misguiding for the particular case. High values of similarity, including perceptual metrics, may give an inaccurate impression about good performance of an algorithm. The ultimate verdict should come from visual inspection by a panel of experts. Hopefully, this expertise can also be learned and simulated by a machine learning algorithm to some degree. Deep learning-based super-resolution has come a long way since the first attempts at zooming synthetic images obtained by naïve bicubic down-sampling. Nowadays, super-resolution is an integral part of the general image processing pipeline which includes denoising and image enhancement. Future super-resolution algorithms should be tunable by the user and provide reasonable trade-offs between the zoom factor, denoising level and the loss or hallucination of details. Suitable image quality metrics should be developed for assessing users’ preferences. In order to make the algorithms generic and independent of the hardware, camerainduced artefacts should be disentangled from the image content. This should be done without requiring much training data from the same camera. Ideally, a single image should suffice to derive the prior knowledge necessary for up-scaling and denoising, without the need for pairs of LR and HR images. This direction is called zero-shot super-resolution (Shocher et al. 2018; Ulyanov et al. 2020; Bell-Kliger et al. 2019). Unpaired super-resolution is a subject of intense research; this task is formulated in all the recent super-resolution competitions. The capturing of image priors is often performed using generative adversarial networks and includes not only low-level statistics but semantics (colour, resolution) as well. It is possible to learn the “style” of the target (high-quality) image domain and transfer it to the super-resolved LR image (Pan et al. 2020). Modern super-resolution algorithms are computationally demanding, and there is no indication that the number of convolutional layers in a CNN, after it exceeds some threshold, inevitably leads to higher image quality. High quality is obtained by other methods – residual or dense connections, attention mechanisms, multiscale processing, etc. The number of operations per pixel will most likely decrease in future SR algorithms. The CNNs will become more suitable for real-time processing even for large images and high zoom factors. This might be achieved by training in fixed-point arithmetic, network pruning and compression, and automatic adaptation of architectures to target hardware using neural architecture search (Eisken et al. 2019). Finally, the application of deep learning methods to video SR, including the time dimension (frame rate up-conversion; see Chap. 15), will set new standards in multimedia content generation.
References Agustsson, E., Timofte, R.: NTIRE 2017 challenge on single image super-resolution: dataset and study. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1122–1131 (2017)
56
A. S. Chernyavskiy
Athar, S., Wang, Z.: A comprehensive performance evaluation of image quality assessment algorithms. IEEE Access. 7, 140030–140070 (2019) Bell-Kliger, S., Shocher, A., Irani, M.: Blind super-resolution kernel estimation using an internalGAN. Adv. Neural Inf. Proces. Syst. 32 (2019). http://www.wisdom.weizmann.ac.il/~vision/ kernelgan/index.html. Accessed on 20 Sept 2020 Buslaev, A., Iglovikov, V.I., Khvedchenya, E., Parinov, A., Druzhinin, M., Kalinin, A.A.: Albumentations: fast and flexible image augmentations. Information. 11(2), 125 (2020) Cai, J., et al.: NTIRE 2019 challenge on real image super-resolution: methods and results. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 2211–2223 (2019a). https://ieeexplore.ieee.org/document/9025504. Accessed on 20 Sept 2020 Cai, J., Zeng, H., Yong, H., Cao, Z., Zhang, L.: Toward real-world single image super-resolution: A new benchmark and a new model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3086–3095 (2019b) Chen, Y., Shi, F., Christodoulou, A.G., Xie, Y., Zhou, Z., Li, D.: Efficient and accurate MRI superresolution using a generative adversarial network and 3D multi-level densely connected network. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) Medical Image Computing and Computer Assisted Intervention. Lecture Notes in Computer Science, vol. 11070. Springer Publishing Switzerland, Cham (2018) Chen, C., Xiong, Z., Tian, X., Zha, Z., Wu, F.: Camera lens super-resolution. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1652–1660 (2019) Dai, T., Cai, J., Zhang, Y., Xia, S., Zhang, L.: Second-order attention network for single image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11057–11066 (2019) Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and texture similarity. arXiv, 2004.07728 (2020) Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016a) Dong, C., Loy, C.C., He, K., Tang, X.: Accelerating the super-resolution convolutional neural network. In: Proceedings of the European Conference on Computer Vision, pp. 391–407 (2016b) Eisken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20, 1–21 (2019) Haris, M., Shakhnarovich, G., Ukita, N.: Deep back-projection networks for super-resolution. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1664–1673 (2018) Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018) Hu, X., Mu, H., Zhang, X., Wang, Z., Tan, T., Sun, J.: Meta-SR: a magnification-arbitrary network for super-resolution. In: Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1575–1584 (2019) Huang, J., Singh, A., Ahuja, N.: Single image super-resolution from transformed self-exemplars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206 (2015) Hui, T.-W., Loy, C.C., Tang, X.: Depth map super-resolution by deep multi-scale guidance. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol. 9907. Springer Publishing Switzerland, Cham (2016) Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol. 9906. Springer Publishing Switzerland, Cham (2016)
2 Super-Resolution: 2. Machine Learning-Based Approach
57
Kastryulin, S., Parunin, P., Zakirov, D., Prokopenko, D.: PyTorch image quality. https://github. com/photosynthesis-team/piq (2020). Accessed on 20 Sept 2020 Lai, W., Huang, J., Ahuja, N., Yang, M.: Deep Laplacian pyramid networks for fast and accurate super-resolution. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 5835–5843 (2017) Ledig, C., et al.: Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 105–114 (2017) Lugmayr, A., et al.: AIM 2019 Challenge on real-world image super-resolution: methods and results. In: Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop, pp. 3575–3583 (2019) Ma, C., Yang, C.-Y., Yang, M.-H.: Learning a no-reference quality metric for single-image superresolution. Comput. Vis. Image Underst. 158, 1–16 (2017) Migukin, A., Varfolomeeva, A., Chernyavskiy, A., Chernov, V.: Method for image superresolution imitating optical zoom implemented on a resource-constrained mobile device, and a mobile device implementing the same. US patent application 20200211159 (2020) Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal Process. Lett. 20(3), 209–212 (2013) Pan, X., Zhan, X., Dai, B., Lin, D., Change Loy, C., Luo, P.: Exploiting deep generative prior for versatile image restoration and manipulation. arXiv, 2003.13659 (2020) Prashnani, E., Cai, H., Mostofi, Y., Sen, P.: PieAPP: perceptual image-error assessment through pairwise preference. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1808–1817 (2018) Rogozhnikov, A.: Einops – a new style of deep learning code. https://github.com/arogozhnikov/ einops/ (2018). Accessed on 20 Sept 2020 Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Document Image Processing for Scanning and Printing. Springer Nature Switzerland AG, Cham (2019) Shi, W., Caballero, J., Theis, L., Huszar, F., Aitken, A., Ledig, C., Wang, Z.: Is the deconvolution layer the same as a convolutional layer? arXiv, 1609.07009 (2016a) Shi, W., Caballero, J., Theis, L., Huszar, F., Aitken, A., Ledig, C., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883 (2016b) Shocher, A., Cohen, N., Irani, M.: “Zero-shot” super-resolution using deep internal learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3118–3126 (2018) Song, X., Dai, Y., Qin, X.: Deeply supervised depth map super-resolution as novel view synthesis. IEEE Trans. Circuits Syst. Video Technol. 29(8), 2323–2336 (2019) Timofte, R., et al.: NTIRE 2017 challenge on single image super-resolution: methods and results. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1110–1121 (2017) Timofte, R., et al.: NTIRE 2018 challenge on single image super-resolution: methods and results. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 965–96511 (2018) Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. Int. J. Comput. Vis. 128, 1867–1888 (2020) Wang, Y., Perazzi, F., McWilliams, B., Sorkine-Hornung, A., Sorkine-Hornung, O., Schroers, C.: A fully progressive approach to single-image super-resolution. In: Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 977–97709 (2018a) Wang, X., Yu, K., Dong, C., Change Loy, C.: Recovering realistic texture in image super-resolution by deep spatial feature transform. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 606–615 (2018b)
58
A. S. Chernyavskiy
Wang, L., et al.: Learning parallax attention for stereo image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12242–12251 (2019) Yandex: DeepHD: Yandex’s AI-powered technology for enhancing images and videos. https:// yandex.com/promo/deephd/ (2018). Accessed on 20 Sept 2020 Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a Gaussian denoiser: residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155 (2017) Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018a) Zhang, Y., Li, K., Li, K., Wang, L., Zhong, B., Fu, Y.: Image super-resolution using very deep residual channel attention networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. Lecture Notes in Computer Science, vol. 11211. Springer Publishing Switzerland (2018b) Zhang, K., Zuo, W., Zhang, L.: Learning a single convolutional super-resolution network for multiple degradations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3262–3271 (2018c) Zhang, K., Zuo, W., Zhang, L.: Deep plug-and-play super-resolution for arbitrary blur kernels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1671–1681 (2019)
Chapter 3
Depth Estimation and Control Ekaterina V. Tolstaya and Viktor V. Bucha
3.1
Introduction
During the early 2010s with the release of the super popular Hollywood movie “Avatar” in 2009, by James Cameron, the era of 3D TV technology got new wind. By 2010, the interest was very high, and all major TV makers included a “3D-ready” feature in smart TVs. In early 2010, the world’s biggest manufacturers such as Samsung, LG, Sony, Toshiba, and Panasonic launched first home 3D TVs in the market. The 3D TV technology was at the top of expectations. Figure 3.1 shows some of prospective technologies as they were seen in 2010. Over the next several years, 3D TV was a hot topic in major consumer electronic shows. However, because of the absence of technical innovations, lack of content, and clearer disadvantages of the technology, the interest of consumers to such types of devices started to subside (Fig. 3.2). By 2016, almost all TV makers announced the termination of the 3D TV feature in flat panel TVs and drove their attention to high resolution and high dynamic range features, though 3D cinema is still popular. The possible causes of such interest transformation are the following, from the marketing and technology point of view. 1. Inappropriate moment. The recent transition from analogue to digital TV forced
E. V. Tolstaya (*) Aramco Innovations LLC, Moscow, Russia e-mail: [email protected] V. V. Bucha BIQUANTS, Minsk, Belarus e-mail: [email protected] © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_3
59
60
E. V. Tolstaya and V. V. Bucha expectations 3D Flat - Panel TVs and Displays Cloud Computing Cloud/Web Platforms
Augmented Reality Internet TV 3D Printing
Gesture Recognition
Mobile Robots Video Search Autonomous Vehicles
Electronic Paper Speech Recognition
E-book Readers
Interactive TV Computer -Brain Interface Virtual Assistants
Human Augmentation
Mobile Application Stores Consumer-Generated Media As of August 2010
Technology
Peak of
Trough of
Slope of
Plateau of
Trigger
Inflated
Disillusionment
Enlightment
Productivity
Expectations Years to mainstream adoption: less than 2 years 2 to 5 years
time 5 to 10 years
more than 10 years
Fig. 3.1 Some prospective technologies, as they were seen in 2010. (Gartner Hype Cycle for Emerging technologies 2010, www.gartner.com)
100 90
Avatar premiere, December 10, 2009, London
3D TV popularity dynamics
80 70 60 50 40 30 20 10 0 2008-01
2010-01
2012-01
2014-01
2016-01
2018-01
2020-01
Fig. 3.2 Popularity of term “3D TV” measured by Google Trends, given in percentage of its maximum
many consumers to buy new digital TVs, and by 2010 many of them were not ready to invest in the new TVs once again. 2. Extra cost. To take full advantage of using this new technology, consumers also had to buy a 3D-enabled Blu-ray player or get a 3D-enabled satellite box. 3. Uncomfortable glasses. The 3D images work on the principle that each of our eyes sees a different picture. By perceiving a slightly different picture from each
3 Depth Estimation and Control
61
eye, the brain automatically constructs the third dimension. 3D-ready TVs came with stereo glasses: with so-called active or passive glasses technology. Glasses of different manufacturers could be incompatible with each other. Moreover, a family of three or more people would need additional pairs, since usually only one or two pairs were supplied with a TV. Viewers wearing prescription glasses had to wear an additional pair over the required ones. And finally, the glasses needed charging, so to be able to watch TV, you had to keep several pairs fully charged. 4. Live TV. It is a difficult task for broadcast networks to support 3D TV: a separate channel is required for broadcasting 3D content, additional to the conventional 2D one for viewers with no 3D feature. 5. Picture quality. The picture in the 3D mode is dimmer than in the conventional 2D mode, because each eye sees only half of the pixels (or half of the light) intended for the picture. In addition, viewing a 3D movie on a smaller screen with a narrow field of view did not give a great experience, because the perceived depth is much smaller than on big screen cinema. In this case, even increasing parallax length does not boost depth but adds to eye fatigue and headache. 6. Multiple user scenario. When several people watch a 3D movie, not all of them can sit at the point in front of the TV that gives best viewing experience. This leads to additional eye fatigue and picture defects. It is clear that some of the mentioned technological problems still exist and need its future engineers to resolve. However, even now we can see that it is still possible for stereo reproduction technology to meet its next wave of popularity. Virtual reality headsets, which recently appeared on the market, give an excellent viewing experience in 3D movie watching.
3.2
Stereo Content Reproduction Systems
The whole idea of generating a 3D impression in the viewer’s brain is based on the principle of showing two slightly different images to each eye of the user. The images are slightly shifted in the horizontal direction by a distance called the parallax. This creates the illusion of depth in binocular vision. There are several very common ways to achieve it that require view separation, i.e. systems that use one screen for stereo content reproduction. Let us consider different ways for separating left and right views on the screen.
3.2.1
Passive System with Polarised Glasses
In passive systems with polarised glasses, the two pictures are projected with different light polarisation, and corresponding polarised glasses allow separating the images. The system requires the use of a quite expensive screen that saves the initial polarisation of reflected light. Usually, such a system is used in 3D movie
62
E. V. Tolstaya and V. V. Bucha
theatres. The main disadvantage is loss of brightness, since only half of the light reaches each eye.
3.2.2
Active Shutter Glass System
The most common system in home 3D TVs is based on active shutter glasses: the TV alternates left and right views on the screen, and synchronised glasses close the eyes one by one. The stereo effect and its strength depend on the parallax (i.e. difference) in two views of a stereopair. The main disadvantage is loss of frame rate, because only half of the existing video frames reach each eye.
3.2.3
Colour-Based Views Separation
Conventional colour anaglyphs use red-blue colour spectrum parts, accompanied with correspondingly coloured glasses with different colour filters on left and right views. More recent systems use amber and blue filters such as the ColorCode 3D system, or Inficolor 3D, where the left image uses the green channel only and the right use the red and blue channels with some added post-processing, and the brain then combines the two images to produce a nearly full colour experience. The latter system even allows watching stereo content in full colour without glasses. In the more sophisticated system by Dolby 3D (first developed by Infitec), a specific wavelength of the colour gamut is used for each eye, with an alternate colour wheel placed in front of projector; glasses with corresponding dichroic filters in the lenses filter out either one or the other set of light wavelengths. In this way, the projector can display left and right images simultaneously, but the filters are quite fragile and expensive. Colour-based stereo reproduction is most susceptible to crosstalk when colours of glasses and filters are not well calibrated. The common problems of all such systems are the limited position of the viewer (no head turning) and eye fatigue due to various reasons.
3.3
Eye Fatigue
Let us consider in more detail the cause of eye fatigue while viewing 3D video on TV. The idea of showing slightly different images to each eye allows creating the 3D effect in the viewer’s brains. The bigger the parallax, the more obvious the 3D effect. The types of parallax are illustrated in Fig. 3.3.
3 Depth Estimation and Control
63
Fig. 3.3 Types of parallax: negative, positive and zero parallax
Positive parallax
Stereo plane
Zero parallax
Negative parallax
1. Zero parallax. The image difference between left and right views is zero, and the eye focuses right at the plane of focus. This is a convenient situation in real life, and generally it does not cause viewing discomfort. 2. Positive (uncrossed) parallax. The position of the convergence point is located behind the projection screen. The most comfortable viewing is accomplished when the parallax is almost equal to the interocular distance. 3. Negative (crossed) parallax. The focusing point is in front of the projection screen. The parallax depends on the convergence angle and the distance of the observer from the display, and therefore it can be more than the interocular distance. 4. Positive diverged parallax. The optical axes are diverging to perceive stereo with the parallax exceeding the interocular distance. This case can cause serious visual inconvenience for the observers. Parallax is usually measured as percentage of shift related to frame width. In real 3D movies, the parallax can be as big as 16% (e.g. “Journey To The Center of The Earth”), 9% as in “Dark Country” and 8% “Dolphins and Whales 3D: Tribes of the Ocean” (Vatolin 2015). This means that the parallax can be up to 1 metre, when viewing a 6-metre-wide cinema screen, which is significantly bigger than average interocular distance. Such situation causes unconventional behaviour of the eyes, when they try to diverge. Smaller screens (like TV on a smartphone) have smaller parallax, and the eyes can better adapt to 3D. However, the drawback is that on smaller screens, the 3D effect is smaller, and objects look flat. Usually, human eyes use the mechanism of accommodation to see objects at different distances in focus (see Fig. 3.4). The muscles that control the lens in the eye shorten to focus on close objects. The limited depth of field means that objects that are not at the focal length are typically out of focus. This enables viewers to ignore certain objects in a scene. The most common causes of eye fatigue (Mikšícek 2006) are enumerated below. 1. Breakdown of the accommodation and convergence relationship. When a viewer observes an object in the real world, the eyes focus on a specific point belonging to this object. However, when the viewer watches 3D content, the eyes try to
64
E. V. Tolstaya and V. V. Bucha
Accommodation for a near target
Far blurred Near in focus
Near blurred Accommodation for a far target Far in focus
Fig. 3.4 Accommodation for near and far targets
Fig. 3.5 Conflict between interposition and parallax Fig. 3.6 Vertical disparity
focus on “popping-out” 3D objects, while for a sharp picture, they have to focus on the screen plane. This can misguide our brains and add a feeling of sickness. 2. High values of the parallax. High values of parallax can lead to divergence of the eyes, and this is the most uncomfortable situation. 3. Crosstalk (ghosts). Occurs when the picture dedicated to the left eye view is partly visible for the right eye view and vice versa. It is quite common for technologies based on colour separation and light polarisation, when filtering is insufficient, or in cases of bad synchronisation between the TV display and the shutter glasses. 4. Conflict between interposition and parallax. A special type of conflict between depth cues appears when a portion of the object on one of the views is clipped by the screen (or image window) surround. The interposition depth cue indicates that the image surround is in front of the object, which is in direct opposition to the
3 Depth Estimation and Control
65
Fig. 3.7 Common cue collision
Correct scene point
PScene point at
Viewer at other position
distorted position
P’ Viewer at correct position stereo
display
Fig. 3.8 Viewer position
5.
6.
7.
8.
disparity depth cue. This conflict causes depth ambiguity and confusion (Fig. 3.5). Vertical disparities. Vertical disparities are caused by wrong placement of the cameras or faulty calibration of the 3D presentation apparatus (e.g. different focal lengths of the camera lenses). Figure 3.6 illustrates vertical disparity. Common cue collision. Any logical collision of binocular cues and monocular cues, such as light and shade, relative size, interposition, textural gradient, aerial perspective, motion parallax, perspective, and depth cueing (Fig. 3.7). Viewing conditions. Viewing conditions include viewing distance, screen size, lighting of the room, viewing angle, etc. Also, personal features of the viewer: age, anatomical size, and eye adaptability. Generally, the older the person, the greater the eye fatigue and sickness, because of lower adaptability of the brain. For children, who have smaller interocular distance, the 3D effect will be more expressed, but the younger brain and greater adaptability will decrease the negative experience. For people having strabismus, it is impossible to perceive stereo content at all. Figure 3.8 illustrates the situations, when one of the viewers is not at the optimal position. Content quality. Geometrical distortions, difference in colour, sharpness, brightness/contrast, depth of field of the production optical system between left and right views, flipped stereo, and time shift, all these factors contribute to lower content quality, causing more eye fatigue.
66
3.4
E. V. Tolstaya and V. V. Bucha
Depth Control for Stereo Content Reproduction
The majority of the cited causes of eye fatigue relate to stereo content quality. However, even high-quality content can have inappropriate parameters, like high value of parallax. To compensate for this effect, fast real-time depth control technology has been proposed, which is aimed to reduce the stereo effect of 3D content by diminishing the stereo effect (Fig. 3.9). The depth control feature can be implemented in a 3D TV, and it can be controlled on the TV remote (Ignatov and Joesan 2009), as shown in Fig. 3.10. The proposed scheme of stereo effect modification is shown in Fig. 3.11. First, the depth map between input stereo content is estimated, depth is post-processed to
Fig. 3.9 To reduce perceived depth and associated eye fatigue, it is necessary to diminish the stereo effect
Fig. 3.10 Depth control functionality for stereo displays
3 Depth Estimation and Control
67
Processed left eye image
Original left eye image
Disparity estimation
Disparity/ depth map
Intermediate view generation Processed right eye image
Original right eye image
Depth control parameter estimation
Fig. 3.11 Depth control general workflow
Fig. 3.12 Depth tone mapping: (a) initial depth; (b) tone-mapped depth
remove artefacts and mapped for modified stereo effect, and after that left and right views are generated. The depth control method implies using two techniques. 1. Control the depth of pop-up objects (Reduction of excessive negative parallax). This could be thought as reduction of stereo baseline, when modified stereo views are moved to each other. In this case, 3D perception of close objects will be reduced first of all. This technique could be realised via the view interpolation method when the virtual view for the modified stereopair is interpolated by the initial stereopair according to a portion of the depth/disparity vector. Areas of the image with small disparity vectors will wherein remain almost the same, and areas of the image with large disparity vectors (pop-up objects) will produce less perception of depth. 2. Control the depth of image plane. This could be thought as moving the image plane along the z-direction. Then, the perceived 3D scene will move further in the z-direction. This technique could be realised via depth tone mapping with subsequent view interpolation. Depth tone mapping will equally decrease depth perception for every region of the image. Figure 3.12 illustrates how all objects of the
68
E. V. Tolstaya and V. V. Bucha
scene are made more distant for an observer. Depth tone mapping could be realised through pixel-wise operation of contrast, brightness, and gamma functions.
3.5
Fast Recursive Algorithm for Depth Estimation From Stereo
Scharstein and Szeliski (2002) presented a taxonomy of matching algorithms based on the observation that stereo algorithms generally perform (subsets of) the following four steps (see Fig. 3.13). The following constraints are widely used in stereo matching algorithms. 1. Epipolar constraint: the search range for a corresponding point in one image is restricted to the epipolar line in the other image. 2. Uniqueness constraint: a point in one image should have at most one corresponding point in the other image. 3. Continuity constraint: the disparity varies slowly across a surface. 4. Photometric constraint: correspondence points have similar photometric properties in matching images. Area-based approaches are the oldest methods used in computer vision. They are based on photometric compatibility constraints between matched pixels. The following optimisation equation is solved for individual pixels in a rectified stereopair: Dðx, yÞ ¼ arg min ðI r ðx, yÞ I t ðx þ d, yÞÞ ¼ arg min ðCost ðx, y, dÞÞ, d
d
where Ir is the pixel intensity at reference image, It is the pixel intensity at target image, d ¼ {dmin, dmax} denotes the disparity range, and D(x,y) is the disparity map. The photometric constraint applied to single pixel pair does not provide a unique solution. Instead of comparing individual pixels, several neighbouring pixels are grouped in a support window, and their intensities are compared with those of pixels in another window. The simplest matching measure is the sum of absolute differences (SAD). The disparity which minimised the SAD cost for each pixel is chosen. The optimisation equation can be rewritten as follows:
Matching cost computation
Cost (support) aggregation
Disparity computation /optimisation
Fig. 3.13 Acquisition of depth data, using a common approach
Disparity refinement
3 Depth Estimation and Control
69
Fig. 3.14 Basic area-based approach
Dðx, yÞ ¼ arg min ð d
¼ arg min ð d
XX i
j
i
j
XX
ðI r ðxi , y j Þ I t ðxi þ d, y j ÞÞÞ ðCostðxi , y j , dÞÞÞ,
where i2[n,n] and j2[m,m] define the support window size (Fig. 3.14). Other matching measures include normalised cross correlation (NCC), modified normalised correlation (MNCC), rank transform, etc. However, there is a problem with correlation and SAD matching since the window size should be large enough to include enough intensity variation for matching but small enough to avoid effects of projective distortions. For this reason, approaches which adaptively select the window size depending on local variations of intensities are proposed. Kanade and Okutomi (1994) attempt to find ideal the window in size and shape for each pixel in an image. Prazdny (1987) proposed a new function to assign support weights to neighbouring pixels iteratively. In this method, it is assumed that neighbouring disparities, if corresponding to the same object in a scene, are similar and that two neighbouring pixels with similar disparities support each other. In general, the prior-art aggregation step uses rectangular windows for grouping the neighbouring pixels and comparing their intensities with those of the pixels in another window. The pixels can be weighted using linear or nonlinear filters for better results. The most popular nonlinear filter for disparity estimation with variable support strategy is the cross-bilateral filter. However, the computation complexity of this type of filter is extremely high, especially for real-time applications. In this work, we adapted a separable recursive bilateral-like filtering for matching cost aggregation. It has a constant-time complexity which is independent of filter window size and runs much faster than the traditional one while producing the similar aggregation result of matching cost (Tolstaya and Bucha 2012). We used a recursive implementation of the cost aggregation function, similar to (Deriche 1990). The separable implementation of the bilateral filter allows significant speed-up of computations, having a result similar to the full-kernel implementation (Pham and
70
E. V. Tolstaya and V. V. Bucha
van Vliet 2005). Right-to-left and left-to-right disparities are computed using similar considerations. First, a difference image between left and right images is computed: Dðx, y, δÞ ¼ ΔðI l ðx, yÞ I r ðx δ, yÞÞ, where Δ is a measure of colour dissimilarity. It can be implemented as a mean difference between colour channels, or some more sophisticated measure, like the Birchfield-Tomasi method from (Birchfield and Tomasi 1998), which does not depend on image sampling. The cost function is computed by accumulating the image difference within a small window, using adaptive support by analogy of cross-bilateral filtering: Fðx, y, δÞ ¼
1 X Dðx0 , y0 , δÞSðjx x0 jÞ hðΔðIðx, yÞ, Iðx0 , y0 ÞÞ, w x0 , y0 2Γ
where w(x,y) is weight to normalise filter output and computed according to following formula: wðx, yÞ ¼
X
Sðjx x0 jÞ hðΔðIðx, yÞ, Iðx0 , y0 ÞÞ,
x0 , y0 2Γ
and Γ is the support window. This helps to adapt the filtering window, according to colour similarity of image regions. In our work, we used the following h(r) and S(x), range and space filter kernels correspondingly: jrj jxj hðr Þ ¼ exp and SðxÞ ¼ exp σr σs The disparity d is computed via minimisation of the cost function F: dðx, yÞ ¼ arg min F ðx, y, δÞ: δ
Symmetric kernels allow separable accumulating for rows and columns. Kernels h(r) and S(x) are not equal to the commonly used Gaussians kernels, but using those kernels it is possible to construct the recursive accumulating function and significantly increase processing speed, while preserving the quality. Let us consider the one-dimensional case of the smoothing with S(x). For fixed δ, we have
3 Depth Estimation and Control
F ð xÞ ¼
71 N 1 X
DðxÞSðk xÞ:
k¼0
Unlike (Deriche 1990), the second pass is based on the result of the first pass: F 1 ðxÞ ¼ DðxÞð1 αÞ þ αF 1 ðx 1Þ, F ðxÞ ¼ F 1 ðxÞð1 αÞ þ αF ðx þ 1Þ with the normalising coefficient α ¼ e1=σ s to ensure that the range of the output signal is the same as the range of the input signal. In case of cross-bilateral filtering, the weight α is a function of x, and the formulas for the filtered signal (first pass) are the following: P1 ðxÞ ¼ I ðxÞð1 αÞ þ αP1 ðx 1Þ, 1 αðxÞ ¼ exp hðjP1 ðxÞ I ðxÞjÞ, σs F 1 ðxÞ ¼ DðxÞð1 αðxÞÞ þ αðxÞF 1 ðx 1Þ: The backward pass step is modified similarly, and F(x) is the formula for the filtered signal: PðxÞ ¼ P1 ðxÞð1 αÞ þ αPðx þ 1Þ, 1 αðxÞ ¼ exp hðjPðxÞ I ðxÞjÞ, σs F ðxÞ ¼ F 1 ðxÞð1 αðxÞÞ þ αðxÞF ðx þ 1Þ: To compute the aggregated cost for the 2D case, four passes of recursive equations are performed: left to right, right to left, top to bottom, and bottom to top. Those formulas give only an approximate solution for the cross-bilateral filter, but for the purpose of cost function aggregation, they give adequate results. After the matching cost function is aggregated for every δ, the pass for every pixel along δ gives the disparity values. Disparity in occlusion areas is filtered additionally, according to the following formulas, using symmetry consideration: DFL ðx, yÞ ¼ min ðDL ðx, yÞ, DR ðx DL ðx, yÞ, yÞÞ, DFR ðx, yÞ ¼ min ðDR ðx, yÞ, DL ðx þ DR ðx, yÞ, yÞÞ, where DL is disparity map from left image to right image and DR is disparity from right to left.
72
E. V. Tolstaya and V. V. Bucha
Fig. 3.15 Left image of stereopair (a) and computed disparity map (b)
This rule is very efficient in the case of correction disparity in occlusion areas of stereo matching, because usually it is known that minimal (or maximal, depending on the stereopair format) disparity corresponds to farthest (covered) objects and occlusions occur near boundaries and cover farther objects. Figure 3.15 shows the results of the proposed algorithm.
3.6
Depth Post-Processing
The proposed method relies on the idea of convergence from a rough estimate towards the consistent depth map through subsequent iterations of the depth filter (Ignatov et al. 2009). On each iteration, the current depth estimate is refined by filtration with accordance to images from the stereopair. A reference image is a colour image from a stereopair for which the depth is estimated. A matching image is the other colour image from the stereopair. The first step of the method for depth smoothing is analysis and cutting of the reference depth histogram (Fig. 3.16). The cutting of the histogram suppresses noise present in depth data. The raw depth estimates could have a lot of outliers. The noise might appear due to false stereo matching in occlusion areas and in textureless areas. The proposed method uses two thresholds: a bottom of the histogram range B and a top of the histogram range T. These thresholds are computed automatically from the given percentage of outliers. The next step of the method for depth smoothing is left, right depth cross-check. The procedure operates as follows: • • • • •
Compute the left disparity vector (LDV) from the left depth value. Fetch the right depth value mapped by the LDV. Compute the right disparity vector (RDV) from the right depth value. Compute the disparity difference (DD) of absolute values of LDV and RDV. In the case that DD is higher than the threshold, then the left depth pixel is marked as the outlier.
3 Depth Estimation and Control
Depth histogram cutting
Depth crosscheck
73
Image segmentation onto textured
Depth smoothing
Fig. 3.16 Steps of depth post-processing algorithm
Fig. 3.17 Example of left, right depth cross-check: (a) left image; (b) right image; (c) left depth; (d) right depth; (e) left depth with noisy pixels (marked black); (f) smoothing result for left depth without depth cross-checking; (g) smoothing result for left depth with depth cross-checking
In our implementation, the threshold for the disparity cross-check is set at 2, and noisy pixels are marked by 0. Since 0 < 64, noisy pixels are automatically treated as outliers for further processing. An example of the depth map marked with noisy pixels according to depth cross-checking is shown in Fig. 3.17. It shows that the depth crosscheck successfully removes outliers from occlusion areas (it is shown by red circles).
74
E. V. Tolstaya and V. V. Bucha
Fig. 3.18 Example of image segmentation into textured and non-textured regions: (a) colour image; (b) raw depth; (c) binary segmentation mask (black, textured regions; white, non-textured regions); (d) smoothing result without using image segmentation; (e) smoothing result using image segmentation
The next step of the method for depth smoothing is binary segmentation of the left colour image into textured and non-textured regions. For this purpose, the gradients in four directions, i.e. horizontal, vertical, and two diagonal, are computed. If all gradients are lower than the predefined threshold, the pixel is considered to be non-textured; otherwise it is treated as textured. This could be formulated as follows: BSðx, yÞ ¼
255, 0,
if gradients ðx, yÞ < Threshold otherwise
,
where BS is the binary segmentation mask for the pixel with coordinates (x, y), a value of 255 corresponds to a non-textured image pixel, while 0 corresponds to a textured one. Figure 3.18 presents an example of image segmentation into textured and non-textured regions along with example of depth map smoothing with and without the segmentation mask.
3 Depth Estimation and Control
75
Fig. 3.19 Examples of processed depth: (a) colour images; (b) initial raw depth maps; (c) depth maps smoothed by the proposed method
Figure 3.19 presents examples of processed depth maps.
3.7
Stereo Synthesis
The problem of depth-based virtual view synthesis (or depth image-based rendering, DIBR) means that reconstruction of the view from the virtual camera CV, while views from other cameras C1 and C2 (or different views captured by a moving camera) and available scene geometry (point correspondences, depth, or the precise polygon model) are provided (see Fig. 3.20). The following problems need to be addressed in particular during view generation: • • • •
Disocclusion Temporal consistency Symmetric vs asymmetric view generation Toed-in camera configuration
76
E. V. Tolstaya and V. V. Bucha
Fig. 3.20 Virtual view synthesis X
C1
t1
t2 Cv
C2
Reconstructed 3D scene
X
Disocclusion area
Virtual image for C11
x
Reference image
Disocclusion area
Virtual image for Cr1
Fig. 3.21 Virtual view synthesis
3.7.1
Disocclusion
As we intend to use one depth map for virtual view synthesis, we should be ready for the appearance of disocclusion areas. The disocclusion area is a part of the virtual image which becomes visible in a novel viewpoint in contrast to the initial view. The examples of disocclusion areas are marked by black in Fig. 3.21. A common way to eliminate disocclusions is to fill up these areas by colours of neighbouring pixels.
3 Depth Estimation and Control
3.7.2
77
Temporal Consistency
Most stereo disparity estimation methods consider still images as input, but TV stereo systems require real-time depth control/view generation algorithms intended for video. When considering all frames independently, some flickering can occur, especially near objects’ boundaries. Usually, the depth estimation algorithm is modified to output temporally consistent depth maps. Since it is not very practical to use some complicated algorithms like bundle adjustment, more computationally effective methods are applied, like averaging inside a small temporal window.
3.7.3
Symmetric Versus Asymmetric View Generation
The task of stereo intermediate view generation is a particular case of arbitrary view rendering, where the positions of virtual views are constrained to lie on the line connecting the centres of source cameras. To generate the new stereopair with a reduced stereo effect, we applied symmetric view rendering (see Fig. 3.22), where the middle point of baseline stays fixed and both left and right views are generated. Other configurations will render only one view, leaving the other intact. But in this case, the disocclusion area will be located on one side of the popping-out objects and can be more susceptible to artefacts.
3.7.4
Toed-in Camera Configuration
There are two possible camera configurations: parallel and toed-in (see Fig. 3.23). In the case of parallel configuration, depth has positive value, and all objects appear in front of the screen. When the stereo effect is large, this can cause eye discomfort. The toed-in configuration is closer to natural human visual system. However, the toed-in configuration generates keystone distortion in images, including vertical disparity. Due to non-parallel disparity lines, the algorithm of depth estimation will give erroneous results, and such content will require rectification. To eliminate Fig. 3.22 Symmetric stereo view rendering
Cl
Vl
V r
C r
78
E. V. Tolstaya and V. V. Bucha
(a)
(b)
L
R
(c)
L
R
Fig. 3.23 Parallel (a) and toed-in (b) camera configuration; illustration of keystone distortion (c)
Fig. 3.24 Virtual view synthesis: initial stereopair (top); generated stereopair with 30% depth decrease (bottom)
eye discomfort from stereo and save the stereo effect, a method of zero-plane setting can be applied. It consists of shifting the virtual image plane and reducing depth by some amount to have negative values in some image areas. Figures 3.24 and 3.25 illustrate resulting stereopairs with 30% depth decrease. In the proposed application, we consider depth decrease as more applicable for the “depth control” feature, since the automatic algorithm in this case will not face the problem of disocclusion and, hence, hopefully produce fewer artefacts. In the following Chap. 4 (Semi-automatic 2D to 3D Video Conversion), we will further address the topic of depth-based image rendering (DIBR) for situations with depth increase and appearing disocclusion areas that should be treated in a specific way. At the end, we would like to add that unfortunately very few models of 3D TVs were equipped with the “3D depth control” feature for a customisable strength of the stereo effect, for example, LG Electronics 47GA7900. We can consider automatic 2D ! 3D video conversion systems (that were available in production in some models of 3D TVs by LG electronics, Samsung, etc. and also in commercially available TriDef 3D software by DDD Group) as a part of such feature, since during conventional monoscopic to stereoscopic video conversion, the user can preset the desired amount of stereo effect.
3 Depth Estimation and Control
79
Fig. 3.25 Virtual view synthesis: initial stereopair (top); generated stereopair with 30% depth decrease (bottom)
References Birchfield, S., Tomasi, C.: A pixel dissimilarity measure that is insensitive to image sampling. IEEE Trans. Pattern Anal. Mach. Intell. 20(4), 401–406 (1998) Deriche, R.: Fast algorithms for low-level vision. IEEE Trans. Pattern Anal. Mach. Intell. 12(1), 78–87 (1990) Ignatov, A., Joesan O.: Method and system to transform stereo content, European Patent EP2293586 (2009) Ignatov, A., Bucha, V., Rychagov, M.: Disparity estimation in real-time 3D acquisition and reproduction system. In: Proceedings of International Conference on Computer Graphics “Graphicon”, pp. 61–68 (2009) Kanade, T., Okutomi, M.: A stereo matching algorithm with an adaptive window: theory and experiment. IEEE Trans. Pattern Anal. Mach. Intell. 16(9), 920–932 (1994) Mikšícek, F.: Causes of visual fatigue and its improvements in stereoscopy. University of West Bohemia in Pilsen, Pilsen, Technical Report DCSE/TR-2006-04 (2006) Pham, T., van Vliet, L.: Separable bilateral filtering for fast video preprocessing. In: Proceedings of IEEE International Conference on Multimedia and Expo, pp. 1–4 (2005) Prazdny, K.: Detection of binocular disparities. In: Fischler, M.A., Firschein, O. (eds.) Readings in Computer Vision. Issues. Problem, Principles, and Paradigms, pp. 73–79. Morgan Kaufmann, Los Altos (1987) Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47(1), 7–42 (2002) Tolstaya, E.V., Bucha, V.V.: Silhouette extraction using color and depth information. In: Proceedings of 2012 IS&T/SPIE Electronic Imaging. Three-Dimensional Image Processing (3DIP) and Applications II. 82900B (2012) Accessed on 04 October 2020. https://doi.org/10.1117/12. 907690 Vatolin, D.: Why does 3D lead to the headache? / Part 4: Parallax (in Russian) (2015). https://habr. com/en/post/378387/
Chapter 4
Semi-Automatic 2D to 3D Video Conversion Petr Pohl and Ekaterina V. Tolstaya
4.1
2D to 3D Semi-automatic Video Conversion
As we mentioned in Chap. 3, during the last decade, the popularity of 3D TV technology has passed through substantial rises and falls. It is worth mentioning here, however, that 3D cinema itself was not so volatile but rather gained vast popularity among cinema lovers: over the last decade, the number of movies produced in 3D grew by an order of magnitude, and the number of cinemas equipped with modern 3D projection devices grew 40 times and continues to grow (Vatolin 2019). Moreover, virtual reality headsets, which recently appeared on the market, give an excellent viewing experience for 3D movie watching. This is a completely new type of device intended for stereo content reproduction: it has two different screens, one for each eye, whereas the devices of the previous generation had one screen and therefore required various techniques for view separation. Maybe these new devices can once again boost consumer interest in 3D cinema technology, which has already experienced several waves of excitement during the twenty-first century. Among the causes of the subsiding popularity of 3D TV, apart from costs and technological problems, which are mentioned in Chap. 3, was a lack of 3D content. In the early stages of the technology renaissance, the efforts of numerous engineers around the world were applied to stereo content production and conversion techniques, i.e. producing 3D stereo content from common 2D monocular video. Very often, such technologies were required even when the movie was shot with a stereo rig: for example, the movie “Avatar” contains several scenes shot in 2D and
P. Pohl (*) Samsung R&D Institute Rus (SRR), Moscow, Russia e-mail: [email protected] E. V. Tolstaya Aramco Innovations LLC, Moscow, Russia e-mail: [email protected] © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_4
81
82
P. Pohl and E. V. Tolstaya
Table 4.1 Advantages/disadvantages of shooting in stereo and stereo conversion Shooting in stereo
Stereo conversion
Advantages Capturing stereo images at time of shooting Natural stereo effect in case of difficult to reproduce situations: smoke, hair, reflections, rain, leaves, transparent objects, etc. Possibility of immediately reviewing stereo content
Possibility of shooting with standard process, equipment, and personnel Wide choice of film or digital cameras and lenses Possibility of assigning any 3D depth during post-production on a shot-by-shot basis, flexibility to adjust the depth and stereo effect of each actor or object in a scene; this creative option is not available when shooting in stereo
Disadvantages Requires specialized camera rigs, which are more complex, heavy, hard to operate, and require more time and more experienced personnel Restrictions on lenses that can obtain good-looking stereo Cameras should operate synchronously, because even small differences in shooting options (like aperture and exposure, depth of field, focusing point) will lower stereo content quality Stereo depth should be fixed at time of shooting Lens flares, shiny reflections, and other optical effects can appear differently and require fixing Post-processing is needed to fix problems of captured stereo video (brightness differences, geometrical distortion, colour imbalance, focus difference, and so on) Extra time and cost required for postproduction Reflections, smoke, sparks, and rain more difficult to convert transparent and semi-transparent objects pose serious problems during post-production stereo conversion Risk of conversion artefacts even in regular scenes
converted to stereo in post-production. A stereo rig is an expensive and bulky system, prone to certain limitations: it must be well calibrated to produce highquality, geometrically aligned stereo with the proper colour and sharpness balance. Sometimes this is difficult to do, so now and then movies are shot in conventional 2D and converted afterward. Table 4.1 enumerates the advantages and disadvantages of both techniques, i.e. shooting in stereo and stereo conversion. Algorithms for creating stereo video from a mono stream can be roughly divided into two main subcategories: fully automatic conversion algorithms, which are implemented most often via a specialized chip inside a TV, and semi-automatic (operator-assisted) conversion algorithms, which use special applications, tools for marking videos, and serious quality control. Fully automatic, real-time methods, developed under strict limitations on memory usage and the number of computations, provide quite low-quality 3D images, resulting in a poor viewing experience, including headache and eye fatigue. This forces consumers to further invest in their home entertainment systems, including 3D-enabled devices to reproduce display-ready 3D content.
4 Semi-Automatic 2D to 3D Video Conversion
83
Recent developments suggest a fully automated pipeline, where the system automatically tries to guess the depth map of the scene and then applies depthbased rendering techniques to synthesize stereo, as demonstrated by Appia et al. (2014) or Feng et al. (2019). Other systems directly predict the right view from the left view, excluding the error-prone stages of depth-based rendering (Xie et al. 2016). Still, such systems rely on predicting the depth of a scene based on some cues (or systems that learn the cues from training data), which leads to prediction errors. That is why much effort was directed toward semi-automatic algorithms, which require human interaction and control, a lot of time, and high costs but provide much higher quality content. An operator-assisted pipeline for stereo content production usually involves the manual drawing of depth (and possibly also segmentation) for some selected reference frames (key frames) and subsequent depth propagation for stereo rendering. An initial depth assignment is done, sometimes by drawing just disparity scribbles, and after that, it is restored and propagated using 3D cues (Yuan 2018). In other techniques, full key frame depth is needed (Tolstaya et al. 2015). The proposed 2D-3D conversion technique consists of the following steps: 1. 2. 3. 4. 5. 6. 7.
Video analysis and key frame selection Manual depth map drawing Depth propagation Object matting Background inpainting Stereo rendering Occlusion inpainting
4.2
Video Analysis and Key Frame Detection
The extraction of key frames is a very important step for semi-automatic video conversion. Key frames during stereo conversion are completely different from key frames selected for video summarization, as is described in Chap. 6. The stereo conversion key frames are selected for an operator, who will manually draw depth maps for key frames, and then these depth map frames will be further interpolated (propagated) through the whole video clip, followed by depth-based stereo view rendering. The more frames that are selected, the more manual labour will be required for video conversion, but in the case of an insufficient number of key frames, a lot of intermediate frames may have inappropriate depths or contain conversion artefacts. The video clip should be thoroughly analysed prior to the start of manual work. For example, a slow-motion frame with simple, close to linear motion will require fewer key frames, while dramatic, fast moving objects, especially if they are closer to the camera and have larger occlusion areas, must have more key frames to assure better conversion quality.
84
P. Pohl and E. V. Tolstaya
In Wang et al. (2012), the key frame selection algorithm relies on the size of cumulative occlusion areas and shot segmentation is performed using a block-based histogram comparison. Sun et al. (2012) select key frame candidates using the ratio of SURF feature points to the correspondence number, and a key frame is selected from among candidates such that it has the smallest reprojection error. Experimental results show that the propagated depth maps using the proposed method have fewer errors, which is beneficial for generating high-quality stereoscopic video. However, for semi-automatic depth 2D-3D conversion, the key frame selection algorithm should properly handle various situations that are difficult for depth propagation to diminish possible depth map quality issues and at the same time limit the overall number of key frames. Additionally, the algorithm should analyse all parts of video clips and group similar scene parts. For example, very often during character conversation, the camera switches from one object to another, while objects’ backgrounds almost do not change. Such smaller parts of a bigger “dialogue” scene can be grouped together and be considered as a single scene for every character. For this purpose, video should be analysed and segmented into smaller parts (cuts), which should be sorted into similar groups with different characteristics: 1. Scene change (shot segmentation). Many algorithms have already been proposed in the literature. They are based on either abrupt colour change or motion vector change. The moving averages of histograms are analysed and compared to some threshold, meaning that the scene changes when the histogram difference exceeds the threshold. Smooth scene transitions pose serious problems for such algorithms. It is needless to say that the depth map propagation and stereo conversion of such scenes is also a difficult task. 2. Motion type detection: still, panning, zooming, complex/chaotic motion. Slowmotion scenes are easy to propagate, and in this case, a few key frames can save a lot of manual work. A serious problem for depth propagation is caused by zoom motion (objects approaching or moving away); in this case, a depth increase or decrease should be smoothly interpolated (this is illustrated in Fig. 4.2 bottom row). 3. Visually similar scenes (shots) grouping. Visually similar scenes very often occur when shooting several talking people, when the camera switches from one person to another. In this case, we can group scenes with one person and consider this group to be a longer continuous sequence. 4. Object tracking in the scene. To produce consistent results during 2D-3D conversion, it is necessary to analyse objects’ motion. When the main object (key object) appears in the scene, its presence is considered to select the best key frame when the object is fully observable, and its depth map can be well propagated to the other frames of its appearance. Figure 4.1 and 4.2 (two top rows) illustrates this kind of situation. 5. Motion segmentation for object tracking. For better conversion results, motion is analysed within the scenes. The simplest type of motion is panning or linear
4 Semi-Automatic 2D to 3D Video Conversion
85
Key frame for key object
Key object
Fig. 4.1 Example of the key object of the scene and its corresponding key frame
Object is not occluded key frame
Object appeared key frame
Area grew frame Frame t
key
Frame t+Dt
Fig. 4.2 Different situations for key frame selection: object should be fully observable in the scene, objects appears in the scene, and object is zoomed
motion: this motion is easy to convert. Other non-linear or 3D motions require more key frames for smooth video conversion. 6. Background motion model detection. Background motion model detection is necessary for background motion inpainting, i.e. to fill up occlusion areas. In the case of a linear model, it is possible to apply an occlusion-based key frame detection algorithm, as in Wang et al. (2012).
86
P. Pohl and E. V. Tolstaya
7. Occlusion analysis. The key frame should be selected when the cumulative area of occlusion goes beyond the threshold. This is similar to Wang et al. (2012). The key frame selection algorithm (Tolstaya and Hahn 2012), based on a function reflecting the transition complexity between every frame pair, proposes to find the optimal distribution of key frames indexed by graph optimization. The frames of a video shot are represented by the vertices of this graph, where the source is the first frame and the sink is the last frame. When two frames are too far away, their transition complexity is set to be equal to some large value. The optimization of such a path can be done using the well-known Dijkstra’s algorithm. Ideas on the creation of an automatic video analysis algorithm based on machine learning techniques can be further explored in Chap. 6, where we describe approaches for video footage analysis and editing and style simulations for creating dynamic and professional-looking video clips.
4.3 4.3.1
Depth Propagation from Key Frames Introduction
One of the most challenging problems that arise in semi-automatic conversion is the temporal propagation of depth data. The bottleneck of the whole 2D to 3D conversion pipeline is the quality of the propagated depth: if the quality is not high enough, a lot of visually disturbing artefacts appear on the final stereo frames. The quality strongly depends on the frequency of manually assigned key frames, but drawing a lot of frames requires more manual work and makes production slower and more expensive. That is why the crucial problem is error-free temporal propagation of depth data through as many frames as possible. The optimal key frame distance for the desired quality of outputs is highly dependent on the properties of the video sequence and can change significantly within one sequence.
4.3.2
Related Work
The problem of the temporal propagation of key frame depth data is a complex task. Video data are temporally undersampled: they contain noise, motion blur, and optical effects such as reflections, flares, and transparent objects. Moreover, objects in the scene can disappear, get occluded, or significantly change shape or visibility. The traditional method of depth interpolation uses a motion estimation result, using either depth or video images. Varekamp and Barenbrug (2007) propose creating the first estimate of depth by bilateral filtering of the previous depth image and then correct the image by estimating the motion between depth frames. A similar approach is described by Muelle et al. (2010). Harman et al. (2002) use a machine
4 Semi-Automatic 2D to 3D Video Conversion
87
learning approach for the depth assignment of key frames. They suggest that these should be selected manually or that techniques similar to those for shot-boundary detection should be applied. After a few points are assigned, a classifier (separate for each key frame) is trained using a small number of samples, and then it restores the depth in the key frame. After that, a procedure called “depth tweening” restores intermediate depth frames. For this purpose, both classifiers of neighbouring key frames are fed with an image value to produce a depth value. For the final depth, both intermediate depths are weighted by the distance to the key frames. Weights could linearly depend on the time distance, but the authors propose the use of a non-linear time-weight dependence. A problem with such an approach could arise when intermediate video frames have areas that are completely different than those that can be found on key frames (for example, in occlusion areas). However, this situation is difficult for the majority of depth propagation algorithms. Feng et al. (2012) describe a propagation method based on the generation of superpixels, matching them and generating depth using matching results and key frame depths. Superpixels are generated by SLIC (Simple Linear Iterative Clustering). Superpixels are matched using mean colour (three channels) and the coordinates of the centre position. Greedy search finds superpixels in a non-key frame (within some fixed window) with minimal colour difference and L1 spatial distance, multiplied by a regularization parameter. Cao (2011) proposes the use of motion estimation and bilateral filtering to get a first depth estimate and refines it by applying depth motion compensation in frames between key frames with assigned depths. As a base method for comparison, we will use an approach similar to Cao (2011). The motion information is then used for the warping of the depth data from previous and subsequent key frames. These two depth fields are then mixed with the weights of motion confidence. These weights can be acquired by an analysis of the motion projection error from the current image to one of the key frames or the error of projection of a small patch, which has slightly more stable behaviour. As a motion estimation algorithm, we use the optical flow described by Pohl et al. (2014) with the addition of the third channel of three YCbCr channels. Similar but harder problems appear in the area of greyscale video colourization, as in Irony (2005). In this case, only greyscale images can be used for matching and finding similar objects. Pixel values and local statistics (features) are used in this case; spatial consistency is also taken into account. Most motion estimation algorithms are not ready for larger displacement and could not be used for interpolation over more than a few frames. The integration of motion information over time leads to increasing motion errors, especially near object edges and in occlusion areas. Bilateral filtering of either motion or depth can cover only small displacements or errors. Moreover, it can lead to disturbing artefacts in the case of similar foreground and background colours.
88
4.3.3
P. Pohl and E. V. Tolstaya
Depth Propagation Algorithm
Our dense depth propagation algorithm interpolates the depth for every frame independently. It utilizes the nearest preceding and nearest following frames with known depth maps (key frame depth). The propagation of depth maps from two sides is essential, as it allows us to interpolate most occlusion problems correctly. The general idea is to find correspondence between image patches, and assuming that patches with similar appearances have similar depths, we can synthesize an unknown depth map based on this similarity (Fig. 4.3). The process of finding similar patches is based on work by Korman and Avidan (2015). First, a bank of Walsh-Hadamard (W-H) filters is applied to both images. As a result, we have a vector of filtering results for every pixel (Fig. 4.4). After that, a hash code is generated for each pixel using this vector of filtering results.
Key frame
Patches hash table Input video frame
Synthesized depth
Key frame
Fig. 4.3 Illustration of depth propagation process from two neighbouring frames, the preceding and following frames
Fig. 4.4 After applying a bank of Walsh-Hadamard filters to both images (key frame image and input image), we have a stack of images—the results of the W-H filtering. For every pixel of both images, we therefore have a feature vector, and feature closeness in terms of the L1 norm is assumed to correspond to patches’ similarity
Hashing W-H feature vectors
Multi-candidates with ranging (L1-prior)
4 Semi-Automatic 2D to 3D Video Conversion
89
Hash tables are built using hash codes and the corresponding pixel coordinates for fast search for patches with equal hashes. We assume that patches with equal hashes have a similar appearance. For this purpose, the authors applied a coherency sensitive hashing algorithm. Hash code is a short integer of 16 bits, and with hashes for each patch, a greedy matching algorithm selects patches with the same (or closest) hash and computes the patch difference as the difference between vectors with W-H filter results. The matching error used in the search for the best correspondence is a combination of filter output difference and spatial distance, with a dead zone (no penalty for small distances) and a limit for maximum allowed distance. Spatial distance between matching patches was introduced to avoid unreasonable correspondences between similar image structures from different parts of the image. Such matches are improbable for relatively close frames of video input. In the first step, RGB video frames are converted to the YCrCb colour space. This allows us to treat luminance and chrominance channels differently. To accommodate fast motion and decrease the sensitivity to noise, this process is done using image pyramids. Three pyramids for video frames and two pyramids for key frame depth maps are created. The finest level (level ¼ 0) has full-frame resolution, and we decrease the resolution by a factor of 0.5, so that the coarsest level ¼ 2 and has 1/4 of the full resolution. We use an area-based interpolation method. The iterative scheme starts on the coarsest level of the pyramid by matching two key frames to the current video frame. The initial depth is generated by voting over patch correspondences using weights dependent on colour patch similarity and temporal distances from reference frames. In the next iterations, matching between images combined from colour and the depth map is accomplished. Due to performance reasons, only one of chrominance channels (Cr or Cb does not make a big difference) with a given depth for reference frames is replaced, and thus the depth estimate is obtained for the current frame. On each level, we perform several CSH matching iterations and iteratively update the depth image by voting. A Gaussian kernel smoothing with decreasing kernel size or another low-pass image filter is used to blur the depth estimate. This smooths the small amount of noise coming from logically incorrect matches. The low-resolution depth result for the current frame is upscaled, and the process is repeated for every pyramid level and ends at the finest resolution, which is the original resolution of the frames and depth. The process is described by Algorithm 4.1. Algorithm 4.1 Depth Map Propagation from Reference Frames tþ1 t Initialize three pyramids with YCbCr video frames: I t1 level , I level , and I level and tþ1 two pyramids with key frame depth maps Dt1 level , Dlevel . With this notation, t Dlevel is the unknown depth map. Using CSH algorithm, find patch matches at the coarsest level, using just colour image frames: (continued)
90
P. Pohl and E. V. Tolstaya
Algorithm 4.1 (continued) t Maplevel0 ¼ I tlevel0 ! I t1 level0 and t tþ1 Mapþt level0 ¼ I level0 ! I level0
t–1 Ilevel
–t Maplevel
t Ilevel
+t Maplevel
t+1 Ilevel
By initial patch-voting procedure for coarsest level, synthesize Dtlevel0 for coarsest level of the depth pyramid. For each level from coarsest to finest (from 2 to 0) do several iterations (Nlevel) of the algorithm do for iteration ¼ 1 .. Nlevel do At the iteration 0, upscale Dtlevelþ1 to Dtlevel with bilinear (or bicubic) algorithm to get first estimate of unknown depth map at the level. Filter the tþ1 known depth maps Dt1 level and Dlevel with Gaussian kernel to remove noise. tþ1 t1 t Copy Dlevel , Dlevel , and Dlevel to one of chrominance channels of image tþ1 t pyramids I t1 level , I level , and I level correspondingly. Using CSH algorithm find patch matches t t1 Mapt level ¼ I level ! I level and t tþ1 Mapþt level ¼ I level ! I level
By patch-voting procedure of Algorithm 4.2, synthesize Dt:level The patch-voting procedure uses an estimate of match error for forward and backward frames and is described in Algorithm 4.2. As an estimate of error, we use the sum of absolute differences over patch correspondence (on the coarsest level) or the sum of absolute differences of 16 Walsh-Hadamard kernels that are available after CSH estimation. In our experiment, the best results were achieved with errors normalized by the largest error over the whole image. The usage of WalshHadamard kernels as a similarity measure is justified because it is an estimate of the true difference of patches, but it decreases sensitivity to noise, because only the coarser filtering results are used. Algorithm 4.2 Depth Map Synthesis by Patch-Voting Procedure þt tþ1 t1 Given Mapt level and Maplevel and respective matching errors Err level and Err level computed from image correspondence and key frame depth maps Dtlevelþ1 and Dtlevel (continued)
4 Semi-Automatic 2D to 3D Video Conversion
91
Algorithm 4.2 (continued) Initialize Stlevel and W tlevel by zeros tþ1 ¼ Mapþt for each match, Y t1 ¼ Mapt level ðX Þ and Y level ðX Þ, where X is a t1 t+1 t and Y are patches in the images I t1 patch in the image I level and Y level and tþ1 I level correspondingly do for each pixel of patch X do Compute correspondence errors: t1 tþ1 Err t1 X , and Err tþ1 X level ¼ Y level ¼ Y Estimate voting weights: W prev ¼ e
Errt1 level V ðlevelÞ
2σ
W next ¼ e
,
Err tþ1 level V ðlevel,tÞ
2σ
Stlevel ¼ Stlevel þ W prev Y t1 þ W next Y tþ1 W tlevel ¼ W prev þ W next Dtlevel ¼ Stlevel =W tlevel
Most parts of the algorithm are well parallelizable and can make use of multicore CPUs or GPGPU architectures. The implementation of CSH matching from Korman and Avidan (2015) can be parallelized by the introduction of propagation tiles. This decreases the area over which the found match is propagated and usually leads to an increased number of necessary iterations. When we investigated the differences on an MPI-Sintel testing dataset Butler et al. (2012), the differences were small even on sequences with relatively large motions. To speed up CSH matching, we use only 16 Walsh-Hadamard kernels (as we mentioned earlier), scaled to short integer. This allows the implementation of the computation of absolute differences using SSE and intrinsic functions. The testing of candidates can be further parallelized in GPGPU implementation. Our solution has parts running on CPU and parts running on GPGPU. We tested our implementation on a PC with a Core i7 960 (3.2 GHz) CPU and an Nvidia GTX 480 graphics card. We achieved running times of ~2 s/ frame on a video with 960 540 resolution. The most time-consuming part of the computation is the CSH matching. For most experiments, we used three pyramid levels with half resolution between the levels, with two iterations per level and σ(level,1) ¼ 2 and σ(level,2) ¼ 1. σ V(level) ¼ 0.053(level-1). The slow python implementation of the algorithm is given by Tolstaya (2020).
92
4.3.4
P. Pohl and E. V. Tolstaya
Results
In general, some situations remain difficult for propagation, such as the low contrast videos, noise, and small parts of moving objects, since in this case the background pixels inside the patch occupy the biggest part of the patch and contribute too much in voting. However, in the case where the background does not change substantially, small details can be tracked quite acceptably. The advantages given by CSH matching include the fact that it is not true motion, and objects on the query frame can be formed from completely different patches, based only on their visual similarity to reference patches (Fig. 4.5). The main motivation for the development of a depth propagation algorithm was the elimination or suppression of the main artefacts of the previously used algorithm based on optical flow. The main artefacts include depth leakage and depth loss. Depth leakage can be caused either by the misalignment of key frame depth and motion edge or an incorrect motion estimation result. The most perceptible artefacts are noisy tracks of object depth that are left on the background after the foreground object moves away. Depth loss is mostly caused by an error of motion estimation in the case of fast motion or complex scene changes like occlusions, flares, reflections, or semi-transparent objects in the foreground. Examples of such artefacts are shown Reference frame 1
Propagated frame
Fig. 4.5 Example of interpolated depths
Reference frame 2
4 Semi-Automatic 2D to 3D Video Conversion
93
Fig. 4.6 Comparison of optical flow-based interpolation (solid line) with our new method (dashed line) for four different distances of key frames—key frame distance is on the x-axis. PSNR comparison with original depth. Top—ballet sequence, bottom—breakdance sequence
Fig. 4.7 Optical flow-based interpolation—an example of depth leakage and a small depth loss (right part of head) in the case of fast motion of an object and flares
in Fig. 4.7. Figure 4.8 shows the output of the proposed algorithm. We compared the performance of our method with optical flow-based interpolation on the MSR 3D Video Dataset from Microsoft research (Zitnick et al. 2004). The comparison of the interpolation error (as the PSNR from the original) is shown in Fig. 4.6. Figure 4.9 compares the depth maps of the proposed algorithm and the depth map computed with motion vectors, with the depth overlain onto the source video frames. We can see that in the case of motion vectors, small details can be lost. Other tests were done on proprietary datasets with the ground truth depth from stereo (computed by the method of Ignatov et al. 2009) or manually annotated.
94
P. Pohl and E. V. Tolstaya
Fig. 4.8 Our depth interpolation algorithm—an example of solved depth leakage and no depth loss artefact
Fig. 4.9 Comparison of depth interpolation results—optical flow-based interpolation (top) and our method (bottom)—an example of solved depth leakage and thin object depth loss artefacts. Left, depth + video frame overlay; right, interpolated depth. The key frame distance used is equal to eight
From our experiments, we see that the proposed depth interpolation method has on average better performance than interpolation based on optical flow. Usually, finer details of depth are preserved, and the artefacts coming from the imperfect alignment of the depth edge and the true edge of objects are less perceptible or removed altogether. Our method is also capable of capturing faster motion. On the other hand, optical flow results are more stable in the case of consistent and not too fast motion, especially in the presence of a high level of camera noise, video flickering, or a complex depth structure. The proposed method has a lot of parameters, and many of them were set up by intelligent guesswork. One of the future steps might be a tuning of parameters on a representative set of sequences. Another way forward could be a hybrid approach that merges the advantages of our method and optical flow-based interpolation. Unfortunately, we were not able to find a public dataset for the evaluation of depth propagation that is large enough and includes a
4 Semi-Automatic 2D to 3D Video Conversion
95
satisfying variety of sequences to be used for tuning parameters or the evaluation of the interpolation method for general video input.
4.4 4.4.1
Motion Vector Estimation Introduction
Motion vectors provide helpful insights on video content. They are used for occlusion analysis and, during the background restoration process, to fill up occlusions produced by stereo rendering. Motion vectors are the apparent motion of brightness patterns between two images defined by the vector field u(x). Optical flow is one of the important but not generally solved problems in computer vision, and it is under constant development. Recent methods using ML techniques and precomputed cost volumes like Teed and Deng (2020) or Zhao (2020) improve the performance in the case of fast motion and large occlusions. At the time this material was prepared, the state-of-theart methods generally used variational approaches. The Teed and Deng (2020) paper states that even the most modern methods are inspired by a traditional setup with a balance between data and regularization terms. Teed and Deng (2020) even follow the iterative structure similar to first-order primal-dual methods from variational optical flow; however, it uses learned updates implemented using convolutional layers. For computing optical flow, we decided to adapt the efficient primal-dual optimization algorithm proposed by Chambolle and Pock (2011), which is suitable for GPU implementation. The authors propose the use of total variation optical flow with a robust L1 norm and extend the brightness constancy assumption by an additional field to model brightness change. The main drawbacks of the base algorithm are incorrect smoothing around motion edges and unpredictable behaviour in occlusion areas. We extended the base algorithm to use colour information and replaced TV-L1 regularization by a local neighbourhood weighting known as the non-local smoothness term, proposed by Werlberger et al. (2010) and Sun et al. (2010a). To fix the optical flow result in occlusion areas, we decided to use motion inpainting, which uses nearby motion information and motion over-segmentation to fill in unknown occlusion motion. Sun et al. (2010b) propose explicitly modelling layers of motion to model the occlusion state. However, this leads to a non-convex problem formulation that is difficult to optimize even for a small number of motion layers.
4.4.2
Variational Optical Flow
Our base algorithm is a highly efficient primal-dual algorithm proposed by Chambolle and Pock (2011), which is suitable for parallelization and hence,
96
P. Pohl and E. V. Tolstaya
effective GPU implementation. Variational optical flow usually solves a variational minimization problem of the general form: Z E¼
Ω
½λE D ðx, I 1 ðxÞ, I 2 ðxÞ, uðxÞÞ þ E s ðx, uðxÞÞdx:
The energy ED is usually linearized to achieve a convex problem and is solved on several discrete grids according to a coarse-to-fine pyramid scheme. Linearization can be done several times on a single pyramid level. A resized solution of a coarser level is used for the initialization of the next finer level of the pyramid. Here, Ω is the image domain, x ¼ (x1, x2) is the position, I1(x) and I2(x) are images, u(x) ¼ (u1(x), u2(x)) is the estimated motion vector field, and λ is a regularization parameter that controls the trade-off between the smoothness of u(x), described by the smoothness of the energy term ES, and the image warping fit described by the data energy term ED. The pyramid structure is a common approach in computer vision to deal with different scales of details in images. A pyramid for one input image is a set of images with decreasing resolution. The crucial pyramid creation parameters are the finest level resolution, the resolution ratio between levels, the number of levels, and the resizing method, together with smoothing or anti-aliasing parameters.
4.4.3
Total Variation Optical Flow Using Two Colour Channels
As a trade-off between precision and computation time, we use two colour channels (Y’ and Cr of the Y’CbCr colour space) instead of three. A basic version of two-colour variational optical flow with an L1 norm smoothness term is below: E ðuðxÞ, wðxÞÞ ¼ E D ðuðxÞ, wðxÞÞ þ ES ðuðxÞ, wðxÞÞ X λL ELD ðx, uðxÞ, wðxÞÞ þ λC E CD ðx, uðxÞÞ E D ðuðxÞ, wðxÞÞ ¼ x2Ω
E LD ðx, uðxÞ, wðxÞÞ
¼ jI L2 ðx þ uðxÞÞ I L1 ðxÞ þ γ wðxÞj1 E CD ðx, uðxÞÞ ¼ I C2 ðx þ uðxÞÞ I C2 ðxÞ1 X ES ðuðxÞ, wðxÞÞ ¼ jΔuðxÞj1 þ jΔwðxÞj1 , x2Ω
where Ω is the image domain; E is the minimized energy; u(x) ¼ (u1(x), u2(x)) is the motion field; w(x) is the field connected to the illumination change; λL and λC are parameters that control the data term’s importance for luminance and colour channels, respectively; γ controls the regularization of the illumination change; ES is the
4 Semi-Automatic 2D to 3D Video Conversion
97
smoothness part of energy that penalizes changes in u and w; I L1 and I L2 are the luminance components of the current and next frames; and I C2 and I C2 are the colour components of the current and next frames.
4.4.4
Total Variation Optical Flow with a Non-local Smoothness Term
To deal with motion edges, we incorporated a local neighbourhood weighting known as the non-local smoothness term proposed by Werlberger et al. (2010) and Sun et al. (2010b): ES ¼
X
sn ðI 1 , x, dxÞjuðx þ dxÞ uðxÞj1 þ jΔwðxÞj1
dxEΨ
sðI 1 , x, dxÞ ¼ eks jdxj e sn ðI 1 , x, dxÞ ¼ P
jI 1 ðxÞI 1 ðxþdxÞj2 2kc 2
sðI 1 , x, dxÞ
dx2Ψ sðI 1 , x, dxÞ
,
where Ψ is the non-local neighbourhood term (e.g. a 5 5 square with (0,0) in the middle); s(I1, x, dx) and sn(I1, x, dx) are non-normalized and normalized non-local weights, respectively; and ks and kc are parameters that control the non-local weights’ response.
4.4.5
Special Weighting Method for Fast Non-local Optical Flow Computation
In the original formulation of the non-local smoothness term, the size of the local window determines the number of weights and dual variables per pixel. Thus, for example, if the window size is 5 5, then for each pixel we need to store 50 dual variables and 25 weights in memory. Considering that all of these variables are used in every iteration of optimization, a large number of computations and memory transfers are required. To overcome this problem, we devised a computation simplification to decrease the number of non-local weights and dual variables. The idea is to decrease the non-local neighbourhood for motion and use a larger part of the image for weights. The way that we use a 3x3 non-local neighbourhood with 5x5 image information is described by the following formula:
98
P. Pohl and E. V. Tolstaya
sðI 1 , x, dxÞ ¼ eks jdxj e
jI 1 ðxÞI 1 ðxþdxÞj2 þjI 1 ðxÞI 1 ðxþ2dxÞj2 4k c 2
,
where all notations are the same as in the previous equation.
4.4.6
Solver
Our solver is based on Algorithm 4.1 from Chambolle and Pock (2011). According to this algorithm, we derived the iteration scheme for the non-local smoothness term in optical flow. In order to get a convex formulation for optimization, we need to linearize the data term ED for both channels, Y and Cr. The linearization uses the current state of motion field u0; derivatives are approximated using the following scheme: I T ðxÞ ¼ I 2 ðx þ u0 ðxÞÞ I 1 ðxÞ, I x ðxÞ ¼ I 2 ðx þ u0 ðxÞ þ ð0:5, 0ÞÞ I 2 ðx þ u0 ðxÞ ð0:5, 0ÞÞ, I y ðxÞ ¼ I 2 ðx þ u0 ðxÞ þ ð0, 0:5ÞÞ I 2 ðx þ u0 ðxÞ ð0, 0:5ÞÞ, where I1 is the image from which motion is computed, I2 is the image to which motion is computed, IT is the image time derivative estimation, and Ix and Iy are image spatial derivative estimations. It is also possible to derive a three colour channel version, but the computational complexity increase is considerable. Full implementation details can be found in Pohl et al. (2014).
4.4.7
Evolution of Data Term Importance
One of the problems of variational optical flow estimated on a pyramid scheme is that it fails if fast motion is present in the video sequence. To improve the results, we proposed an update to the pyramid processing, which changes the data importance parameters λL and λC during the pyramid computation from a more data-oriented solution to a smoother solution: λðnÞ ¼ λcoarsest , n > nramp λðnÞ ¼ λcoarsest ðλcoarsest λ
f inest Þ
ðnramp nÞ , n nramp , nramp
where n is the pyramid level (zero n means the finest level and hence the highest resolution), λ is one of the regularization parameters that is a function of n, and nramp
4 Semi-Automatic 2D to 3D Video Conversion
99
Fig. 4.10 Occlusion detection three frame scheme with 1D projection of image
is start of the linear ramp. The parameters λcoarsest and λfinest define the initial and final λ value. The variational optical flow formulation does not explicitly handle occlusion areas. The estimated motion in occlusion areas is incorrect and usually follows a match to the nearest patch of similar colour. However, if the information for visible pixels is correct, it is possible to find occlusion areas using the motion from the nearest frames to the current frame, as shown in Fig. 4.10. Object 401 on the moving background creates occlusion 404. The precise computation of occlusion areas uses inverse of bilinear interpolation and thresholding.
4.4.8
Clustering of Motion
The purpose of motion clustering is to recognize different areas of an image that move together. Joint clustering of forward and backward motion in a three-frame scheme, as shown in Fig. 4.10, is used. At first, four motion fields are computed using variational optical flow; then, the occlusion areas are detected for the central frame. To find nonocclusion areas that can be described by one motion model for forward motion and one motion model for backward motion, we use a slightly adapted version of the RANSAC algorithm introduced by Fischler and Bolles (1981). The RANSAC algorithm is a nondeterministic model fitting, which randomly selects a minimal set of points that determine the model and estimates how many samples actually do fit this model. To distinguish inliers from outliers, it uses some error threshold. Our target is the processing of general video material. In this case, it is impossible to make a single sensible setting of this threshold. For dealing with this problem, we change the RANSAC evaluation function. At first, the method evaluates tested forward and backward motion models by summing the Gaussian prior of the misfit as follows:
100
P. Pohl and E. V. Tolstaya
Jðθ23 , θ21 Þ ¼
X
e
ju23 ðxÞMðθ23 ,xÞj2 þju21 ðxÞMðθ21 ,xÞj2 2k2
,
x2Ω
where Ω is the image domain, u23(x) and u21(x) are motion fields from the central to the next and central to the previous frame respectively, M(θ, x) is the motion given by the model with parameters θ at the image point x, and k is the parameter that weights the dispersion of the misfit. J is the evaluation function: higher values of this function give better candidates. This evaluation still needs a parameter k that acts as the preferred dispersion. However, it will still give quite reasonable results, even when all clusters have high dispersions. After the evaluation stage, misfit histogram analysis is done in order to find a first mode of misfit. We search for the first local minima in a histogram smoothed by convolution with a Gaussian, because the first local minima in an unsmoothed histogram is too sensitive to noise. Pixels that have a misfit below three times the standard deviation of the first mode are deemed pixels belonging to the examined cluster. After the joint model fit, single direction occlusion areas are added if the local motion is in good agreement with the fitted motion model. Our experiments show that the best over-clustering results were achieved using the similarity motion model: 2
u1
3
2
6 7 sR I 6 u2 7 ¼ M ðθ, xÞ ¼ 6 4 4 5 0 0 1
t1
3
7 t 2 5, 1
where u1 and u2 are motion vectors, x1 and x2 are the coordinates of the original point, t ¼ (t1, t2) is the translation, R is an orthonormal 2 2 rotation matrix, s is the scaling coefficient, and θ ¼ (R, s, t) are the parameters of the model M. In order to assign clusters to areas which are marked as occlusions in both directions, we use a clustering inpainting algorithm. The algorithm searches for every occluded pixel and assigns it a cluster number using a local colour similarity assumption: Wðx, kÞ ¼
X
fe
y2Ω
jI 1 ðxÞI 1 ðyÞj2 2σ 2
, CðyÞ ¼ k 0, CðyÞ 6¼ k
CðxÞ ¼ argmaxðWðx, kÞÞ, k
where Ω is the local neighbourhood domain, I1(x) is the current image pixel with coordinates x, C(x) is the cluster index of the pixel with coordinates x, W(x, k) is the weight of the cluster k for a pixel with coordinates x, and σ is the parameter controlling the colour similarity measure of current and neighbourhood pixels.
4 Semi-Automatic 2D to 3D Video Conversion
4.4.9
101
Results
The results of the clustering are the function C(x) and the list of models MC(θ, x). This result is used to generate unknown motion in occlusion areas using the motion model of the cluster that the occlusion pixel was added to. An example of a motion clustering result is shown in Fig. 4.11. The main part of the testing of our results was done on scene cuts of film videos to see “real-life” performance. The main problem with this evaluation is that the ground truth motion is unavailable and manual evaluation of the results is the only method we can use. To have some quantification as well as a comparison with the state of the art, we used the famous Middlebury optical flow evaluation database made by Baker et al. (2011). We use a colouring scheme that is used by the Middlebury benchmark. Motion estimation results are stable for a wide range of values of the regularization parameter lambda, as can be seen in their errors from ground truth in Fig. 4.12. The results with lower values of lambda are oversmoothed, whereas a value of lambda that is too high causes a lot of noise in the motion response as the algorithm tries to fix the noise in the input images. The non-local neighbourhood smoothness term improves the edge response in the constructed motion field. The simplified version can be seen as a relaxation to full non-local processing, and the quality of results is somewhere in between TV-L1 regularization and full non-local neighbourhood. An example of motion edge behaviour between these approaches is demonstrated in Fig. 4.13. You can see that the simplified non-local term decreases the smoothing around the motion edge but still creates a motion artefact not aligned with the edge of the object. However, this unwanted behaviour is usually caused by fast motion or a lack of texture on one of the motion layers. We found out that on the testing set of the Middlebury benchmark, the difference between the simplified and normal non-local neighbourhoods is not important. We think it is because the dataset has only small or moderate motion and usually has rather well-textured layers. You can see the comparison in Fig. 4.14, with a comparison of errors shown in Fig. 4.15. The proposed method of motion estimation keeps the sharp edges of the motion and tries to fix incorrect motion in occlusion areas. We also presented a way to relax the underlying model to allow considerable speedup of the computation. Our motion estimation method was ranked in the top 20 out of all algorithms on the Middlebury
Fig. 4.11 Motion inpainting result on cut of Grove3 sequence from Middlebury benchmark— clustering result overlain on grey image (left), non-inpainted motion (centre), motion after occlusion inpainting (right)
102
P. Pohl and E. V. Tolstaya
Fig. 4.12 Average endpoint and angular errors for Middlebury testing sequence for changing the regularization parameter
Fig. 4.13 Motion results on Dimetrodon frame 10 for lambda equal to 1, 5, 20, 100
Fig. 4.14 Comparison of TV-L1 (left), 3 3 neighbourhood with special weighting (middle), and full 5 5 non-local neighbourhood (right) optical flow results; the top row is the motion colourmap, and on the bottom it is overlaid on the greyscale image to demonstrate the alignment of motion and object edges
4 Semi-Automatic 2D to 3D Video Conversion
103
Fig. 4.15 Comparison of our 3 3 non-local neighbourhood with special weights and 5 5 neighbourhood optical flow result on Middlebury testing sequence
optical flow dataset of the time, but only one other method reported better processing time on the “Urban” sequence.
4.5 4.5.1
Background Inpainting Introduction
The task of the background inpainting step is to recover occluded parts of video frames to use later during stereo synthesis and to fill occlusion holes. We have an input video sequence and a corresponding sequence of masks (for every frame of the video) that denote areas to be inpainted (a space-time hole). The goal is to recover background areas denoted (covered) by object masks so that the result is visually plausible and temporally coherent. A lot of attention has been given to the area of still image inpainting. Notable methods include diffusion-based approaches, such as Telea (2004) and Bertalmio (2000), and texture synthesis or exemplar-based algorithms, for example, Criminisi et al. (2004). Video inpainting imposes additional temporal restrictions, which make it more computationally expensive and challenging.
4.5.2
Related Work
Generally, most video inpainting approaches fall into two big groups: global and local with further filtering. Exemplar-based methods for still images can be naturally extended for videos as a global optimization problem of filling the space-time hole. Wexler et al. (2007) define a global method using patch similarity for video completion. The article reports satisfactory results, but the price for global
104
P. Pohl and E. V. Tolstaya
optimization is that the algorithm is extremely complex. They report several hours of computation for a video of very low resolution and short duration (100 frames of 340 120 pixels). A similar approach was taken by Shiratori (2006). They suggest a procedure called motion field transfer to estimate motion inside a space-time hole. Motion is filled in with a patch-based approach using a special similarity measure. The motion found allows the authors to inpaint a view sequence while maintaining temporal coherence. However, the performance is also quite low (~40 minutes for a 60-frame video of 352 240). Bugeau et al. (2010) propose inpainting frames independently and filter the results by Kalman filtering along point trajectories found by the dense optical flow algorithm. The slowest operation of this approach is the computation of optical flow. Their method produces a visually consistent result, but the inpainted region is usually smoothed with occasional temporal artefacts.
4.5.3
Background Inpainting
In our work (Pohl et al. 2016), we decided not to use a global optimization approach to make the algorithm more computationally efficient and avoid the loss of small details on the restored background video. The proposed algorithm consists of three well-defined steps: 1. Restoration of background motion in foreground regions, using motion vectors 2. Temporal propagation of background image data using the restored background motion from step #1 3. Iterative spatial inpainting with temporal propagation of inpainted image data Background motion is restored by computing and combining several motion estimates based on optical flow motion vectors. Let black contour A be the edge of a marked foreground region (see Fig. 4.16a). Outside A, we will be using motion produced by the optical flow algorithm, M0(x, y). In area B, we obtain a local background motion estimate M1(x, y). This local estimate can be produced, for example, by running a diffusion-like inpainting algorithm, such as Telea (2004),
Fig. 4.16 Background motion estimation: (a) areas around object’s mask; (b) example of background motion estimation
4 Semi-Automatic 2D to 3D Video Conversion
105
on M0(x, y) inside region B. Next, we use M0(x, y) from area C to fit the parameters a0,..,a5 of an affine global motion model M 2 ðx, yÞ ¼
a0 þ a1 x þ a 2 y a3 þ a4 x þ a 5 y
! :
Motion M2(x, y) is used inside D. We choose area C to be separated from foreground object contour A. This is to avoid introducing artefacts that might come from optical flow estimation around object edges into global motion estimates. To improve motion smoothness, we blend motions M1 and M2 inside A. Let 0 W(x, y)1 be a weighting function; then, the resulting motion vector field is computed as M ðx, yÞ ð1 W ðx, yÞÞM 2 ðx, yÞ þ W ðx, yÞM 2 ðx, yÞ: W(x, y) is defined as the exponential decay (with a reasonable decay rate parameter in pixels) of distance from the foreground area edge A. It is equal to 1 outside contour A, and the distance parameter is computed by the effective distance transform algorithm. As a result, we obtain a full-frame per-pixel estimate of background motion that generally has the properties of global motion inside previously missing regions but does not suffer from discontinuity problems around the foreground region edges (Fig. 4.16b). The second step—temporal propagation—is the crucial part of our algorithm. The forward and backward temporal passes are symmetrical, and they fill in the areas that were visible on other frames of video. We do a forward and backward pass through the video sequence using integrated (accumulated) motion in occluded regions. In the forward pass, we integrate motion in the backward direction and decide which pixels can be filled by data from the past. The same is done for the backward pass to fill in pixels from future frames. After forward and backward temporal passes, we can still have some areas that were not inpainted (unfilled areas). These areas were not seen during the entire video clip. We use still image spatial inpainting to fill in the missing data in a selected frame and propagate it using restored background motion to achieve temporal consistency. Let us introduce the following notation: – – – – – –
I(x, y) is the nth input frame. M(m, n, x) is the restored background motion from frame m to frame n. F(n, x) is the input foreground mask for frame n (area to be inpainted). QF(n, x) is the inpainted area mask for frame n. I(n, x) is the inpainted frame n. T is the temporal window size (algorithm parameter).
106
P. Pohl and E. V. Tolstaya
Algorithm 4.3 Forward Temporal Pass for Ncurr ¼ 1. . Nframes do Initialize forward pass image and mask: I F ðN curr , xÞ ¼ I ðN curr , xÞ QF ðN curr , xÞ ¼ 0 Iterate backward for temporal data: for Nsrc ¼ Ncurr 1. . max {1, Ncurr T} do Integrate motion: MðN curr , N src , xÞ ¼ MðN curr , N src þ 1, xÞ þ MðN curr , N src , x þ MðN curr , N src þ 1, xÞÞ New inpainted mask: Qnew F ðN curr , xÞ ¼ FðN curr , xÞ\ QF ðN curr , xÞ \ FðN curr , x þ MðN curr , N src , xÞÞ Update inpainted image and mask: for all points x marked on Qnew F ðN curr , xÞ do I F ðN curr , xÞ ¼ I F ðN src , x þ MðN curr , N src , xÞÞ QF ðN curr , xÞ ¼ QF ðN curr , xÞ [ Qnew F ðN curr , xÞ
Algorithm 4.3 is a greedy algorithm that inpaints the background with data from the temporally least distant frame. The backward temporal pass algorithm is doing the same operations, only with a reversed order of iterations and motion directions. Some areas can be filled from both sides. In this case, we need to find a single inpainting solution. We used the temporally less distant source. In the case of the same temporal distance, a blending procedure was done based on the distance from the non-inpainted area. The third step is the spatial pass. The goal of the spatial pass is to inpaint regions that were not filled by temporal passes. To achieve temporal stability, we inpaint on a selected frame and propagate inpainted data temporally. We found that a reasonable strategy is to find the largest continuous unfilled area, use spatial inpainting to fill it in, and then propagate through the whole sequence using a background motion estimate. It is necessary to perform spatial inpainting with iterative propagation until all unfilled areas are inpainted. Any spatial inpainting algorithm can be used; in our work, we experimented with exemplar-based and diffusion-based methods. Our
4 Semi-Automatic 2D to 3D Video Conversion
107
experience shows that it’s better to use a diffusion algorithm for filling small or thin areas, and an exemplar-based algorithm is better for larger unfilled parts. Let us denote QFB(n, x) and IFB(n, x) as the temporally inpainted mask and the background image, respectively, after forward and backward passes are blended together. QS(n, x) and IS(n, x) are the mask and the image after temporal and spatial inpainting. |D| stands for the number of pixels in the image domain D. Algorithm 4.4 Spatial Pass Initialize spatially inpainted image and mask: QS ðn, xÞ ¼ QFB ðn, xÞ, I S ðn, xÞ ¼ I FB ðn, xÞ while |F(n, x) [ ~QS(n, x)| > 0 8 (n, x) do Find Ncurr such that |F(Ncurr, x) \ ~QS(n, x)| is maximal Let RðN curr , xÞ ¼ F ðN curr , xÞ \ Q~S ðN curr , xÞ Inpaint area R(Ncurr, x) and store pixel data into IS(Ncurr, x) QS ðN curr , xÞ ¼ QS ðN curr , xÞ [ F ðN curr , xÞ \ Q~S ðN curr , xÞ ∖ Propagate inpainted area forward and backward: for all frames Ndst : |F(Ndst, x) \ ~QS(Ndst, x)| > 0 do Integrate motion to obtain M(Ndst,Ncurr, x) Get new inpainting mask: Qnew S ðN dst , xÞ ¼ FðN dst , xÞ\ QS ðN dst , xÞ \ QS ðN curr , x þ MðN dst, N curr , xÞÞ Update inpainted image: for all points x marked on Qnew S ðN dst , xÞ do I S ðN dst , xÞ ¼ I S ðN curr , x þ MðN dst, N curr , xÞÞ QS ðN dst , xÞ ¼ QS ðN dst , xÞ [ Qnew S ðN dst , xÞ
4.5.4
Results
For quantitative evaluation, we generated a set of synthetic examples with known, ground truth backgrounds and motion. There are two kinds of object motion in the test sequences:
108
P. Pohl and E. V. Tolstaya
Fig. 4.17 Synthetic data example
Fig. 4.18 Background restoration quality: (a) simple motion; (b) affine motion
1. Simple motion. The background and foreground are moved by two different randomly generated motions, which include only rotation and shift. 2. Affine motion. The background and foreground are moved by two different randomly generated affine motions. Evaluation was done by running the proposed algorithm with default parameters on synthetic test sequences (an example is shown in Fig. 4.17). We decided to use the publicly available implementation of an exemplar-based inpainting algorithm by Criminisi et al. (2004) to provide a baseline for our approach. Our goal was to test both the quality of background inpainting for each frame and the temporal stability of the resulting sequence. For the evaluation of the background inpainting quality, we use the PSNR between the algorithm’s output and the ground truth background. Results are shown in Fig. 4.18. To measure temporal stability, we use the following procedure. Let I(n, x) and I(n, x) be a pair of consecutive frames with inpainted backgrounds, and let MGT(n, n + 1, x) be the ground truth motion from frame n to n + 1. We compute the PSNR between I(n, x) and I(n + 1, x + MGT(n, n + 1, x) (sampling is done using bicubic interpolation). The results are shown in Fig. 4.19. As we can see, our inpainting algorithm usually provides a slightly better quality of background restoration than exemplar-based inpainting when analysed statistically. However, it produces far more temporally stable output, which is very important for video inpainting. We applied our algorithm to a proprietary database of videos with two types of resolution: 1920 1080 (FHD) and 960 540 (quarter HD, qHD). The method shows a reasonable quality for scenes without changes in scene properties
4 Semi-Automatic 2D to 3D Video Conversion
109
Fig. 4.19 Temporal stability: (a) simple motion; (b) affine motion
Fig. 4.20 Background inpainting results
(brightness change, different focus, changing fog or lights) and with rigid scene structure. A few examples of our algorithm outputs are shown in Fig. 4.20. Typical visible artefacts include misalignments on the edges of the temporal inpainting direction (presumably in cases when motion integration is not precise enough) and mixing the same part of a scene whose appearance properties changed with time. Examples of such artefacts are shown in Fig. 4.21. The running time of the algorithm (not including optical flow estimation) is around 1 s/frame for qHD and 3 s/frame for FHD sequences on a PC with a single GPU. The results are quite acceptable for the restoration of areas occluded due to
110
P. Pohl and E. V. Tolstaya
Fig. 4.21 Typical artefacts: misalignment (top row) and scene change artefacts (bottom row)
stereoscopic parallax for a limited range of scenes, but for wider applicability, it is necessary to decrease the level of artefacts. It is possible to apply a more advanced analysis of propagation reliability for a better decision between spatial and temporal inpainting; also, the temporal inpainting direction uses the analysis of the level of scene changes. In addition, it may be useful to improve the alignment of temporally inpainted parts from different time moments or apply the Poisson seamless stitching approach from Pérez et al. (2003) in case there are overlapping parts.
4.6
View Rendering
As we mentioned in our discussion of view rendering in Chap. 3, the main goal of view rendering is to fill holes caused by objects’ disocclusion. In the case of a still background, where no temporal inpainting is available, the holes should be filled with a conventional image-based hole-filling algorithm. There are quite a few methods proposed in the literature. The well-known Criminisi algorithm (Criminisi et al. 2004) allows the restoration of large holes in the image. The main idea of this method is to select a pixel in the boundary of the damaged area, centre the point, and select the texture block of the appropriate size according to the texture features of the image. Then, it finds the best matching block and replaces the texture block with this block to complete the image restoration. The main problem with this method is caused by image discontinuities overlapping with holes. The artefacts caused by regular patterns are the most noticeable and disturbing. To overcome such a problem, we proposed looking for straight lines (or edges) on the image, finding those that cross holes, filling the hole along the line in the first step, and filling the rest of the hole in the next step.
4 Semi-Automatic 2D to 3D Video Conversion
111
Rendered stereo (with holes):
Hole filing – along straight vertical lines
Fig. 4.22 Background inpainting with straight lines
Straight line detection was made with the Hough transform and voting procedure. Patches along the lines are propagated inside the hole, and the rest of the hole is filled with a conventional method, like the one described in Criminisi’s paper (Criminisi et al. 2004). Figure 4.22 illustrates the steps of the algorithm.
4.7
Stereo Content Quality
A large area of research is devoted to stereo quality estimation. Bad stereo results lead to a poor viewing experience, headache, and eye fatigue. That is why it is important to review media content and possibly eliminate production or postproduction mistakes. The main causes of eye fatigue are listed in Chap. 3, and content quality is among the mentioned items. The most common quality issues that can be present in stereo films (including even well-known titles with million-dollar budgets) include the following: • • • • • • •
Mixed left and right views Stereo view rotation Different sizes of objects Vertical disparity Temporal shift between views Colour imbalance between views Sharpness difference
112
P. Pohl and E. V. Tolstaya
A comprehensive review of the defects with examples from well-known titles is given by Vatolin (2015a, b,); Vatolin et al. (2016). The enumerated defects are more common for films shot in stereo, when, for example, a camera rig was not properly calibrated, and this led to a vertical disparity, or there are different sizes of objects between views, or when different camera settings (like the aperture setting) were used for the left and right cameras, which leads to a sharpness difference and a depth of field difference. Such problems can be detected automatically and sometimes even fixed without noticeable artefacts (Voronov et al. 2013). There are many difficult situations during stereo conversion that require special attention during post-production. The most difficult situations for stereo conversion are the following: • The low contrast of input video does not allow reliable depth propagation. • A fast motion scene is a challenge for motion estimation and background motion model computation • Transparent and semi-transparent objects are very difficult to render without artefacts. • Objects changing shape, when the depth map of an object can change over time. • Depth changes/zoom present in a scene requires accurate depth interpolation for smooth stereo rendering with growing occlusion areas. • Foreground and background similarity lead to background inpainting errors. • Small details are often lost during interpolation and object matting. • Changing light makes it difficult to find interframe matching for motion estimation and depth propagation. Such difficult situations can be automatically detected during the very first step of semi-automatic stereo conversion: the key frame detection step. Usually, for postprocessing quality estimation, people are involved, since the converted title may look totally different on a computer monitor and on the big screen. That is why quality control requirements for different systems (like small-screened smartphones, small TV sets, and wide screen cinema) will be different.
References Appia, V., Batur, U.: Fully automatic 2D to 3D conversion with aid of high-level image features. In: Stereoscopic Displays and Applications XXV, vol. 9011, p. 90110W (2014) Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. Int. J. Comp. Vision. 92(1), 1–31 (2011) Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Proceedings of the 27th Annual Conference on Computer graphics and Interactive Techniques, p. 417 (2000) Bugeau, A., Piracés, P.G.I., d'Hondt, O., Hervieu, A., Papadakis, N., Caselles, V.: Coherent background video inpainting through Kalman smoothing along trajectories. In: Proceedings of 2010–15th International Workshop on Vision, Modeling, and Visualization, p. 123 (2010)
4 Semi-Automatic 2D to 3D Video Conversion
113
Butler D.J., Wulff J., Stanley G.B., Black M.J.: A Naturalistic Open Source Movie for Optical Flow Evaluation. In: Fitzgibbon A., Lazebnik S., Perona P., Sato Y., Schmid C. (eds) Computer Vision – ECCV 2012. ECCV 2012. Lecture Notes in Computer Science, vol 7577. Springer, Berlin, Heidelberg (2012) Cao, X., Li, Z., Dai, Q.: Semi-automatic 2D-to-3D conversion using disparity propagation. IEEE Trans. Broadcast. 57, 491–499 (2011) Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vision. 40(1), 120–145 (2011) Criminisi, A., Pérez, P., Toyama, K.: Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process. 13(9), 1200–1212 (2004) Feng, J., Ma, H., Hu, J., Cao, L., Zhang, H.: Superpixel based depth propagation for semi-automatic 2D-to-3D video conversion. In: Proceedings of IEEE Third International Conference on Networking and Distributed Computing, pp. 157–160 (2012) Feng, Z., Chao, Z., Huamin, Y., Yuying, D.: Research on fully automatic 2D to 3D method based on deep learning. In: Proceedings of the IEEE 2nd International Conference on Automation, Electronics and Electrical Engineering, pp. 538–541 (2019) Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM. 24(6), 381–395 (1981) Harman, P.V., Flack, J., Fox, S., Dowley, M.: Rapid 2D-to-3D conversion. In: Stereoscopic displays and virtual reality systems IX International Society for Optics and Photonics, vol. 4660, pp. 78–86 (2002) Ignatov, A., Bucha, V., Rychagov, M.: Disparity estimation in real-time 3D acquisition and reproduction system. In: Proceedings of the International Conference on Computer Graphics «Graphicon 2009», pp. 61–68 (2009) Irony, R., Cohen-Or, D., Lischinski, D.: Colorization by example. In: Proceedings of the Sixteenth Eurographics conference on Rendering Techniques, pp. 201–210 (2005) Korman, S., Avidan, S.: Coherency sensitive hashing. IEEE Trans. Pattern Anal. Mach. Intell. 38 (6), 1099–1112 (2015) Muelle, M., Zill, F., Kauff, P.: Adaptive cross-trilateral depth map filtering. In: Proceedings of the IEEE 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video, pp. 1–4 (2010) Pérez, P., Gangnet, M., Blake, A.: Poisson image editing. In: ACM SIGGRAPH Papers, pp. 313–318 (2003) Pohl, P., Molchanov, A., Shamsuarov, A., Bucha, V.: Spatio-temporal video background inpainting. Electron. Imaging. 15, 1–5 (2016) Pohl, P., Sirotenko, M., Tolstaya, E., Bucha, V.: Edge preserving motion estimation with occlusions correction for assisted 2D to 3D conversion. In: Image Processing: Algorithms and Systems XII, 9019, pp. 901–906 (2014) Shiratori, T., Matsushita, Y., Tang, X., Kang, S.: Video completion by motion field transfer. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, p. 411 (2006) Sun, J., Xie, J., Li, J., Liu, W.: A key-frame selection method for semi-automatic 2D-to-3D vonversion. In: Zhang, W., Yang, X., Xu, Z., An, P., Liu, Q., Lu, Y. (eds.) Advances on Digital Television and Wireless Multimedia Communications. Communications in Computer and Information Science, vol. 331. Springer, Berlin, Heidelberg (2012) Sun, D., Roth, S., Black, M.J.: Secrets of optical flow estimation and their principles. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2432–2439 (2010a) Sun, D., Sudderth, E., Black, M.: Layered image motion with explicit occlusions, temporal consistency, and depth ordering. In: Proceedings of the 24th Annual Conference on Neural Information Processing Systems, pp. 2226–2234 (2010b)
114
P. Pohl and E. V. Tolstaya
Teed, Z., Deng, J.: Raft: Recurrent all-pairs field transforms for optical flow. In: Proceedings of the European Conference on Computer Vision, pp. 402–419 (2020) Telea, A.: An image inpainting technique based on the fast marching method. J. Graph. Tools. 9 (1) (2004) Tolstaya E.: Implementation of Coherency Sensitive Hashing algorithm. (2020). Accessed on 03 October 2020. https://github.com/ktolstaya/PyCSH Tolstaya, E., Hahn, S.-H.: Method and system for selecting key frames from video sequences. RU Patent 2,493,602 (in Russian) (2012) Tolstaya, E., Pohl, P., Rychagov, M.: Depth propagation for semi-automatic 2d to 3d conversion. In: Proceedings of SPIE Three-Dimensional Image Processing, Measurement, and Applications, vol. 9393, p. 939303 (2015) Varekamp, C., Barenbrug, B.: Improved depth propagation for 2D to 3D video conversion using key-frames. In: Proceedings of the 4th European Conference on Visual Media Production (2007) Vatolin, D.: Why Does 3D Lead to the Headache? / Part 8: Defocus and Future of 3D (in Russian) (2019). Accessed on 03 October 2020. https://habr.com/ru/post/472782/ Vatolin, D., Bokov, A., Erofeev, M., Napadovsky, V.: Trends in S3D-movie quality evaluated on 105 films using 10 metrics. Electron. Imaging. 2016(5), 1–10 (2016) Vatolin, D.: Why Does 3D Lead to the Headache? / Part 2: Discomfort because of Video Quality (in Russian) (2015a). Accessed on 03 October 2020. https://habr.com/en/post/377709/ Vatolin, D.: Why Does 3D Lead to the Headache? / Part 4: Parallax (in Russian) (2015b). Accessed on 03 October 2020. https://habr.com/en/post/378387/ Voronov, A., Vatolin, D., Sumin, D., Napadovsky, V., Borisov, A.: Methodology for stereoscopic motion-picture quality assessment. In: Proceedings of SPIE Stereoscopic Displays and Applications XXIV, vol. 8648, p. 864810 (2013) Wang, D., Liu, J., Sun, J., Liu, W., Li, Y.: A novel key-frame extraction method for semi-automatic 2D-to-3D video conversion. In: Proceedings of the IEEE international Symposium on Broadband Multimedia Systems and Broadcasting, pp. 1–5 (2012) Werlberger, M., Pock, T., Bischof, H.: Motion estimation with non-local total variation regularization. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2464–2471 (2010) Wexler, Y., Shechtman, E., Irani, M.: Space-time completion of video. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 463–476 (2007) Xie, J., Girshick, R., Farhadi, A.: Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In: Proceedings of the European Conference on Computer Vision, pp. 842–857 (2016) Yuan, H.: Robust semi-automatic 2D-to-3D image conversion via residual-driven optimization. EURASIP J. Image Video Proc. 1, 66 (2018) Zhao, S., Sheng, Y., Dong, Y., Chang, E., Xu, Y.: MaskFlownet: asymmetric feature matching with learnable occlusion mask. In: Proceedings of the CVPR, vol. 1, pp. 6277–6286 (2020) Zitnick, C.L., Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High-quality video view interpolation using a layered representation. ACM Transactions on Graphics. 23(3) (2004)
Chapter 5
Visually Lossless Colour Compression Technology Michael N. Mishourovsky
5.1 5.1.1
Why Embedded Compression Is Required Introduction
Modern video processing systems such as TV devices, mobile devices, video coder/ decoders, and surveillance equipment process a large amount of data: high-quality colour video sequences with resolution up to UD at a high frame rate, supporting HDR (high bit depth per colour component). Real-time multistage digital video processing usually requires high-speed data buses and huge intermediate buffering, which lead to increased costs and power consumption. Although the latest achievements in RAM design make it possible to store a lot of data, reduction of bandwidth requirements is always a desired and challenging task; moreover, for some applications it might become a bottleneck due to the balance between cost and technological advances in a particular product. One possible approach to cope with this issue is to represent video streams in a compact (compressed) format, preserving the visual quality, with a possible small level of losses as long as the visual quality does not suffer; this category of algorithms is called embedded or texture memory compression and is usually implemented in HW and should provide very high visual fidelity (quality), a reasonable compression ratio and low hardware (HW) complexity. It should impose no strong limitations on data-fetching, and it should be easy to integrate into synchronous parts of application-specific integral circuits (ASIC) or system on chip (SoC). In this chapter, we describe in detail the so-called visually lossless colour compression technology (VLCCT), which satisfies the abovementioned technical
M. N. Mishourovsky (*) Huawei Russian Research Institute, Moscow, Russia e-mail: [email protected] © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_5
115
116
M. N. Mishourovsky
requirements. We also touch on the question of visual quality evaluation (and what visual quality means) and present an overview of the latest achievements in the field of embedded compression, video compression, and quality evaluations. One of the key advantages of the described technology is that it uses only one row of memory for buffering, works with extremely small blocks, and provides excellent subjective and objective quality of images with a fixed compression ratio, which is critical for random access data-fetching. It is also robust to error propagation, as it introduces no inter-block dependency, which effectively prevents error propagation. According to estimates done during the technology transferring, the technology requires only a small number of equivalent physical gates and could be widely adopted into various HW video processing pipelines.
5.1.2
Prior-Art Algorithms and Texture Compression Technologies
The approach underlying VLCCT is not new; it is applied for GPU, OpenGL, and Vulcan supporting several algorithms to effectively compress textures; and OS such as Android and iOS both support several techniques for encoding and decoding textures. During the initial stage of the development, several prototypes were identified including the following: algorithms based on decorrelating transformations (DCT, wavelet, differential-predictive algorithms) combined with entropy coding of different kinds; the block truncation encoding family of algorithms; block palette methods; and the vector quantization method. Mitsubishi developed a technology called fixed block truncation coding by Matoba et al. (1998), Takahashi et al. (2000), and Torikai et al. (2000) and even produced an ASIC implementing this algorithm for the mass market. It provides fixed compression with relatively high quality. Fuji created an original method called Xena (Sugita and Watanabe 2007), Sugita (2007), which was rather effective and innovative, but provided only lossless compression. In 1992, a paper was published (Wu and Coll 1992) suggesting an interesting approach which is essentially targeted to create an optimal palette for a fixed block to minimize the maximum error. The industry adopted several algorithms from S3 – so-called S3 Texture Compression (DXT1 ~ DXT5), and ETC/ETC1/ETC2 were developed by Ericsson. ETC1 is currently supported by Android OS (see the reference) and is included in the specification of OpenGL – Methods for encoding and decoding ETC1 textures (2019). PowerVR Texture Compression was designed for graphic cores and then patented by Imagination Technologies. Adaptive Scalable Texture Compression (ASTC) was jointly developed by ARM and AMD and presented in 2012 (which is several years after VLCCT was invented). In addition to the abovementioned well-known techniques, the following technologies were reviewed: Strom and Akenine-Moeller (2005), Mitchell and Delp (1980), Jaspers and de With (2002), Odagiri et al. (2007), and Lee et al. (2008). In
5 Visually Lossless Colour Compression Technology
117
general, all these technologies provide compression ratios from ~1.5 to 6 times with different quality of decompressed images, different complexity, and various optimization levels. However, most of them do not provide the required trade-off between algorithmic and HW complexity (most of these algorithms require several rows – from 2. . .4 and even more), visual quality, and other requirements behind VLCCT. Concluding this part, the reader can familiarize himself with the following links providing a thorough overview of different texture compression methods: Vulkan SDK updated by Paris (2020), ASTC Texture Compression (2019), Paltashev and Perminov (2014).
5.2
Requirements and Architecture of VLCCT
According to the business needs, the limitations for HW complexity, and acceptance criteria for visual quality, the following requirements were defined (Table 5.1): Most of compression technologies rely on some sort of redundancy elimination mechanisms, which can essentially be classified as follows: 1. 2. 3. 4.
Visual redundancy (caused by human visual system perception). Colour redundancy. Intra-frame and inter-frame redundancy. Statistical redundancy (attributed to the probabilistic nature of elements in the
Table 5.1 Initial VLCCT requirements Characteristic Input spatial resolution Frame rate Input colour format Bit depth Scan order Targeted compression ratio Distortion level Bitrate character Spatial resolution of chroma Complexity of a decoder Type of potential content Maximum pixel size in a display device, in mm. Minimal distance for observing display device Zoom in
Initial value From 720 480 to 1920 1080, progressive Up to 60 fps YCbCr 4:4:4, mandatory; 4:2:2, optional 8;10 bits Left-to-right; up-to-down >2 times Visually lossless Constant Should not be reduced by means of decimation 30 k gates Any type ~0.40 0.23 (based on LE40M8 LCD which is like LN40M81BD) 0.4 m No zoom
118
M. N. Mishourovsky
stream; a stream is not random, and this might be used by different methods, such as: • • • •
Huffman encoding and other prefix codes Arithmetic encoding – initial publication by Rissanen and Langdon (1979) Dictionary-based methods Context adaptive binary arithmetic encoding and others
Usually, some sort of data transformations is used: • Discrete cosine transform, which is an orthogonal transform with normalized basic functions. DCT approximates decorrelation (removal of linear relations in data) for natural images (PCA for natural images) and is usually applied at the block level. • Different linear prediction schemes with quantization (so-called adaptive pulse coding modulation – ADPCM) are applied to effectively reduce the selfsimilarity of images. However, a transform or prediction itself does not provide compression; its coefficients must be further effectively encoded. To accomplish this goal, classical compression schemes include entropy coding, reducing the statistical redundancy. As HW complexity has been a critical, limiting factor, it was decided to consider a rather simple pipeline which prohibited the collection of a huge amount of statistics and provision of long-term adaptation (as the CABAC engine does in modern H264, H265 codecs). Instead, a simple approach with a one-row buffer and sliding window was adopted. The high-level diagram of the data processing pipeline for the encoding part is shown in Fig. 5.1. As the buffering lines contribute a lot to the overall HW complexity, the only possible scenario with one cache line meant that the vertical decorrelation is limited. However, the proposed sliding window processing still allowed us to implement a variety of methods; it was realized that a diversity of images (frames) potentially being compressed by VLCCT inevitably requires different models for effective representation of a signal. Before going further, let us highlight several methods tried: • JPEG2000, which is based on bit-plane arithmetic encoding following right after wavelet transform • Classical DCT-based encodings (JPEG-like).
K pixels Input
K pixels
N-1 line(int. buffer)
Coder
Packet
Window
Fig. 5.1 High-level data processing pipeline for encoding part of VLCCT
5 Visually Lossless Colour Compression Technology
119
Fig. 5.2 The structure of a compressed elementary packet
• ADPCM methods with different predictors including adaptive and edge-sensitive causal 2D predictors • ADPCM combined with modular arithmetic • Grey codes to improve bit-plane encoding • Adaptive Huffman codes Unfortunately, all these methods suffered from the problem of either a low compression ratio or visually noticeable distortions or high complexity. Content analysis helped to identify several categories of images: 1. Natural images 2. Computer graphic images 3. High-frequency (HF) and test pattern images (charts) Firstly, a high-level structure of a compressed bitstream of an image was proposed. It consists of blocks (elementary packets) of a fixed size (which is opposite to the variable-length packet structure); this approach helped fetch data from RAM and organize the bitstream and packing/unpacking mechanisms. Within the elementary packet, several subparts were defined (Fig. 5.2), where VLPE stands for variablelength prefix encoding; as the input might be 10-bit depth, we decided to represent the bitstream as two parts: a high 8-bit and low 2- bit part. The reason for such a representation is that, according to the analysis, it followed that display systems (at least at the time of the development) were not able to show much difference in signals differing in low significant bits. Also, the analog-digital converters of capturing devices as well as displaying systems had a signal-to-noise ratio which is roughly just a little higher than the 8-bit range. Bit-stuffing helped to keep all the packages well-aligned with the fixed size. It was decided to consider a small processing block of 2 4 pixels. Two lines directly relate to the specification limits on the complexity, and four pixels are sufficiently small to work locally. From another point of view, this is a minimal block size which can be constructed from simple transforms like 2 2 transforms. Assuming 10 bits per pixel and three colour channels, the original bit size was 240 bits; to provide a compression ratio larger than 2, we targeted a final elementary packet size of from 100 to 120 bits per packet. To pick the right method for encoding, special modules were assumed; a VLPE table and bitstream packer were also included in the detailed architecture of the VLCCT. To process data efficiently, two types of methods were identified: 1. Methods that process 2 4 without explicit sub-partitioning 2. Methods that can split 2 4 into smaller parts, for example, two sub-blocks of 2 2 size The final architecture of the VLCCT is presented in Fig. 5.3.
120
M. N. Mishourovsky
Fig. 5.3 The final architecture of VLCCT including all main blocks and components
5.3
Methods for Encoding
According to the identified categories of image content, we proposed methods that work with 2 4 blocks and methods that work with 2 2 blocks: they will be explained further. Their names came from internal terminology and were kept “as is” to be consistent with the original VLCCT description.
5.3.1
Overview of 2 3 4 Methods
There are seven sub-methods developed to compress 2 4 pixels blocks. In particular: • D-method. It is designed to encode parts of an image that are complex and in which diagonal edges dominate. • F-method. Optimized for effective encoding of image parts that include gradients. • M-method. This method is designed to effectively compress test image patterns and charts; usually such images are used to verify the resolution and other technical parameters of devices. • O-method. This method is designed to effectively compress image parts where only one colour channel dominates – red, green, blue or luminance.
5 Visually Lossless Colour Compression Technology
121
• L-method. This method is applied to all three colour channels simultaneously. To effectively represent the image structure, an individual palette is provided for each 2 2 sub-block. Three sub-modes are provided to encode the palette values for each 2 2 sub-block: inter-sub-block differential encoding of palettes, intrasub-block differential palette encoding, and bypass mode with quantization of palette colours. • U-method. This method is provided for such parts of an image that consist of complex structure parts and uniform regions; uniform regions are represented by means of a mean value, while the remaining bits are allocated as complex structure parts. • H-method. This method is to encode natural images where smooth transitions exist. Each 2 4 block is considered as a set of two 2 2 blocks, and then 2 2 DCT transform is applied. The DCT coefficients are encoded by means of fixed quantization (the Lloyd-Max Quantizer, which is also known as k-means clustering).
5.3.2
Overview of 2 3 2 Methods
For individual encoding of 2 2 sub-blocks, five additional methods are provided. These methods are applied for each 2 2 sub-block independently according to the minimal error and can be combined providing additional adaptation: • N-method. This method is to encode natural images and is based on a new transform, called NSW (explained later) and fixed quantization of differential values. NSW transform is a simple 2D transform which combines predictive encoding, 1D Haar wavelet and a lifting scheme. • P-method. This method is a modification of the N-method and improves the encoding of vertical/horizontal/diagonal-like sub-blocks. It is based on vertical/ horizontal/diagonal and mean value encoding for an underlying sub-block. • C-method. This method is based on the NSW transform which is applied for all colour channels simultaneously, followed by group encoding of differential values from all colour channels all together. By doing so, it is possible to reduce the error of the encoded values if the colour channels are well-correlated to each other. It is achieved by removing the quantizer adaptation for each colour channel and increasing the quantizer dynamic range. • E-method. This method is applied to increase the precision of colour transition reproduction near to the boundary between two colours. • S-method. This method is like the N-method, but the idea of this method is to improve the precision of encoding image parts where differential values after NSW are small for all colour channels. NSW transform is a 2D transform specifically designed to decorrelate pixel values within the smallest possible block – 2 2. If pixels within a 2 2 block
122
M. N. Mishourovsky
Fig. 5.4 Calculation of the features using lifting scheme: (a) pixel notation; (b) 2D-lifting directions
are denoted as specified in Fig. 5.4a, the following features are calculated using a lifting scheme (Sweldens 1997) (Fig. 5.4b): {A, B, C, D} ! {s, h, v, d} are transformed according to the following: Forward transform h¼AB v¼AC d ¼AD hþvþd s¼A 4
Inverse transform hþvþd A¼sþ 4 B¼Ah C ¼Av D¼Ad
Here, s is the mean value of four pixel values; h, v, d – the simplest directional derivative values (differences). The s-value requires 8 bits to store, h, v and d require 8 bits +1 sign-bit. How can it be encoded effectively with a smaller number of bits? Let us consider two 2 2 blocks and initial limits on the bit-budget. According to this, we may come to an average 15 bits per block per colour channel. The mean value s can be represented with 6-bit precision (uniform quantization was proven to be a good choice); quantization of h, v, and d is based on a fixed quantization that is very similar to the well-known Lloyd-Max quantization (which is similar to the kmeans clustering method) (Kabal 1984; Patane and Russo 2001). Several quantizers can be constructed, where each of them consists of more than 1 value and is optimized for different image parts. In detail, the quantization process includes the selection of an appropriate quantizer for the h, v, and d values for each colour; then each difference is approximated by a quantization value to which it is most similar (Fig. 5.5a–c). To satisfy bit limits, the following restrictions are applied: only eight quantizer sets are provided; each consists of four positive/negative values; trained quantizer values are shown in Table 5.2. To estimate an approximation error, let us note that, once the quantization is completed, it is possible to express an error and then estimate pixel errors as follows: 8 0 > < h ¼ h þ Δh v0 ¼ v þ Δv > : 0 d ¼ d þ Δd
8 Δh þ Δv þ Δd > ΔA ¼ > > 4 > < ΔB ¼ ΔA Δh > > ΔC ¼ ΔA Δv > > : ΔD ¼ ΔA Δd
5 Visually Lossless Colour Compression Technology
123
Fig. 5.5 Selection of appropriate quantizers: (a) distributions for h, v, d, and reconstructed levels; (b) bit allocation for h, v, d components; (c) an example of quantization with quantizer set selection and optimal reconstruction levels Table 5.2 Example of NSW quantizer set values Quantizer set number 0 1 2 3 4 5 6 7
Id0 220 23 200 80 138 63 42 104
Id1 80 4 10 40 50 18 12 28
Id2 80 4 10 40 50 18 12 28
Id3 220 23 200 80 138 63 42 104
If a mean value is encoded with an error, this error should be added to an error estimation of the A pixel; then, it is possible to aggregate errors for all pixels using, as an example, the squared sum error (SSE). Once the SSE is calculated for every quantizer set, the quantizer set which provides the minimal error is selected; other criteria can be applied to simplify the quantization. The encoding process explained above is named Fixed Quantization via Table (FQvT) and is described by the following: arg
min E fd1 , . . . dk g, fId QI 1 , . . . Id QI k g, QI ,
QI¼0...QS
124
M. N. Mishourovsky
Table 5.3 Sub-modes in P-method (NSW-P) and its encoding 2-bit prefix 00 01 100 101 110 1110 1111
Sub-mode Horizontal direction Vertical direction H value is close to 0 V value is close to 0 D value is close to 0 Diagonal direction Uniform area
Fig. 5.6 Sub-modes in the P-method
where {d1 . . . dK}, original differences; {Id1 . . . IdK}, levels used to reconstruct differences; and QI, a quantizer defined by a table. E stands for an error of reconstructed differences relative to the original value. Another method adopted by VLCCT is the P-method. It helps encoding of such areas where a specific orientation of details exists; it might be vertical, horizontal, or diagonal. In addition, this method helps to encode non-texture areas (where the mean value is a good approximation) and areas where one of the NSW components is close to zero. The best sub-mode is signalled by a 2-bit value according to Table 5.3: According to Fig. 5.6, the h sub-mode means that each 2 2 block is encoded using only two pixels A and C. In the same way, the v sub-mode uses A and B to approximate the remaining pixel values by these pixels; the diagonal sub-mode encodes a block in a way similar to that of h and d but uses a diagonal pattern; in the uniform mode, the mean value is used to approximate the whole 2 2 block. If any one of the h, v, or d differences is close to zero, the corresponding sub-mode is invoked. In this case, the following is applied: h 0; fv, dg ! fQuantizer N, Iv, Id g; v 0; fh, dg ! fQuantizer N, Ih, Id g; d 0; fh, vg ! fQuantizer N, Ih, Ivg: The quantization table for non-zero differences is provided in Table 5.4a. The mean value (s-value) of the NSW transform is quantized using 6 bits via LSB bit truncation. However, sometimes a higher bit depth precision is required; one of the quantizers (N ¼ 1) is used as a means to efficiently signal such a mode, as in this case 1 extra bit might be saved. Table 5.4b shows that modification.
5 Visually Lossless Colour Compression Technology
125
Table 5.4a Quantizers for the P-method Quantizer N 0 1 2 3 4 5 6 7
Id0 250 0 224 118 94 220 185 39
Id1 240 0 9 66 19 125 52 12
Id2 240 7 9 66 19 125 52 12
Id3 250 7 224 118 94 220 185 39
Table 5.4b Modification of quantizer N, where N ¼ 1 Quantizer N 1
Id0 0
Id1 Prohibited value
3bit(N) 2bit(SMode) 1(2)bits(I1)
1(2)bits(I2)
Id2 7
Id3 7
6bit+Free Bits(s)
Fig. 5.7 Encoding bits package for encoding sub-modes in P-method
The actual bit precision of the mean value is based on the analysis of “free bits”. Depending on this, the mean value can be encoded using 6, 7, or 8 bits. The structure of the bits comprising encoding information for this mode is shown in Fig. 5.7. The algorithm to determine the amount of free bits is shown in Fig. 5.8. The next method is the S-method. This method is adopted for low-light images and is derived from the N-method by means of changing the quantization tables and special encoding of the s-value. In particular, it was found that the least significant 6 bits are enough for encoding the s-value of low-light areas and the least significant bit of these is excluded, so only 5 bits are used. The modified quantization table is shown in Table 5.5. The C-method is also based on the NSW transform for all colour channels simultaneously, followed by joint encoding of all three channels. The efficiency of this method is confirmed by the fact that the NSW values are correlated for all three colour channels. Due to sharing the syntax information between colour channels and removing the independent selection of quantizers for each colour (they are encoded by the same quantizer set), an increase of the quantizer’s dynamic range (to eight levels instead of four) is enabled – see Table 5.6. The quantization process is like that of the P-method. The main change is that, for all differences h,v,d for R,G,B colours, one quantizer subset is selected. Then each difference value is encoded using a 3-bit value; 9 difference values take 27 bits; the quantizer subset requires another 3 bits; thus, the differences take 30 bits; every s-value (for R,G, and B) is encoded using 5 bits by truncation of 3 LSBs that are reconstructed by binary value 100b. Every 2 2 colour block is encoded using
126
M. N. Mishourovsky
Fig. 5.8 The algorithm to determine the number of free bits for the P-method Table 5.5 Quantizers for encoding of differential values of low-light areas Quantizer N 0 1 2 3 4 5 6 7
Id0 1 2 1 4 3 4 8 3
Id1 2 6 4 10 8 20 40 15
Id2 1 2 1 4 3 4 8 3
Id3 2 6 4 10 8 20 40 15
45 bits; to provide optimal visual quality, the quantization error should take into account human visual system colour perception, which is translated into weights for every colour:
5 Visually Lossless Colour Compression Technology
127
Table 5.6 Quantizer table for the C-method Quantizer N 0 1 2 3 4 5 6 7
Id0 128 3 20 50 90 9 163 180
Id1 180 15 86 82 145 40 122 139
Id2 213 30 160 104 194 73 66 63
Id3 245 50 215 130 240 106 15 25
Id4 128 3 20 50 90 9 15 25
Id5 180 15 86 82 145 40 66 63
Id6 213 30 160 104 194 73 122 139
Id7 245 50 215 130 240 106 163 180
8 ΔhC þ ΔvC þ Δd C > > > ΔAC ¼ > 4 > > < ΔBC ¼ ΔAC ΔhC > > > ΔC C ¼ ΔAC ΔvC > > > : ΔDC ¼ ΔAC ΔdC X ðΔAC Þ2 þ ðΔBC Þ2 þ ðΔC C Þ2 þ ðΔDC Þ2 E¼ WC, 4 C¼R, G, B where Wc, weights reflecting the colour perception by an observer for a colour channel C ¼ R, G, and B; ΔhC, ΔvC and ΔdC, errors to approximate h, v, and differences. A simpler yet still efficient way to calculate weighted encoding errors is E0 ¼
X
½MAX ðjΔhj, jΔvj, jΔdjÞ W C
C¼R, G, B
By subjective visual testing, it was confirmed that this method provided good visual results and it was finally adopted into the solution. Let us review the E-method, which is intended for images with sharp colour edges and colour gradients where other methods usually cause visible distortions. To deal with this problem, it was suggested to represent each 2 4 block as four small “stripes” – 1 2 sub-blocks. Every 1 2 sub-block consists of three colour channels (Fig. 5.9); only one colour channel is defined as a dominant colour channel for such a sub-block; its values are encoded using 6 bits; the remaining colour channels are encoded using average values only. By the design, a dominant channel is determined and pixel values for that channel are encoded using 6 bits. The remaining channels are encoded via the average values of two values for every spatial position. Besides, a quantization and clipping are applied. Firstly, the R, G, and B colour channels are analysed, and, if conditions are not met, YCbCr colour space is used. The luminance channel is considered as dominant, while Cb and Cr are encoded as the remaining channels. The key point
128
M. N. Mishourovsky
Fig. 5.9 Splitting of 2 4 block into 1 2 colour sub-blocks (AB)
of this method is that, when calculating the average value, it is calculated jointly for both the remaining channels for every spatial position. Every 1 2 sub-block is extended with extra information indicating the dominant channel. The algorithm describing this method is shown in Fig. 5.10. In addition to the methods described above, seven other methods are provided that encode 2 4 blocks without further splitting into smaller sub-blocks. In general, they are all intended for the cases explained above, but, due to their different mechanisms to represent data, they might be more efficient in different specific cases. The 2 4 D-method is targeted for diagonal-like image patches combined with natural parts (which means a transition region between natural and structure image parts). The sub-block with a regular diagonal structure is encoded according to a predefined template, while the remaining 2 2 sub-block is encoded by means of truncating the rwo least significant bits of every pixel of this 2 2 sub-block. A sub-block with a regular pattern is determined, which is accomplished via calculating errors over special templates locations; then the block with the smallest error is detected (Fig. 5.11). It is considered as a block with a regular structure; the remaining block is encoded in simple PCM mode with 2 LSB bits truncation. The template values mentioned above are calculated according to the following equations: Rð0, 0 þ 2k Þ þ Rð1, 1 þ 2kÞ þ Gð0, 1 þ 2kÞ þ Gð1, 0 þ 2kÞ þ Bð0, 0 þ 2kÞ þ Bð1, 1 þ 2k Þ 6 Rð0, 1 þ 2k Þ þ Rð1, 0 þ 2kÞ þ Gð0, 0 þ 2kÞ þ Gð1, 1 þ 2kÞ þ Bð0, 1 þ 2kÞ þ Bð1, 0 þ 2k Þ C 1 ðkÞ ¼ 6
C 0 ðkÞ ¼
where C0, C1 – so-called template values; k – the index of a sub-block: k¼¼0 means the left sub-block, k¼¼1 means the right sub-block. The approximation error for the k-th block is defined as follows:
5 Visually Lossless Colour Compression Technology
129
Fig. 5.10 The algorithm of determining dominant channels and sub-block encoding (E-method)
Fig. 5.11 Sub-blocks positions and naming according to D-method
130
M. N. Mishourovsky
BlEðkÞ ¼ jRð0, 0 þ 2kÞ C0ðk Þj þ jRð1, 1 þ 2kÞ C0ðkÞjþ jGð0, 1 þ 2kÞ C0ðk Þj þ jGð1, 0 þ 2k Þ C0ðkÞjþ jBð0, 0 þ 2kÞ C0ðk Þj þ jBð1, 1 þ 2kÞ C0ðkÞjþ jRð0, 1 þ 2kÞ C1ðk Þj þ jRð1, 0 þ 2kÞ C1ðkÞjþ jGð0, 0 þ 2kÞ C1ðk Þj þ jGð1, 1 þ 2k Þ C1ðkÞjþ jBð0, 1 þ 2kÞ C1ðk Þj þ jBð1, 0 þ 2kÞ C1ðkÞj: After determining which sub-block is better approximated by the templatepattern, its index is placed into the bitstream along with the template values and PCM values. This is shown in Fig. 5.12 below. The next 2 4 method is the so-called F-method, which is suitable for the gradient structure and is applied for each channel independently. To keep the target bitrate and high accuracy of gradient transitions, reference pixel values, the mean value of horizontal inter-pixel differences, and signs of differences are used. Figure 5.13a shows the reference values and directions of horizontal differences, and the block-diagram for the F-method is shown in Fig. 5.13b. Now, let us consider the M-method for a 2 4 block, which includes three modes. The first mode provides uniform quantization with different precision for all pixels of the underlying 2 4 block: every pixel colour value is encoded using 3 or 4 bits; these three and four codes are calculated according to the following equations:
Fig. 5.12 Algorithm realizing D-method
Start
To detect a spatial 2x2 sub-block which could be encoded via pattern + pattern values: sub-block N
To encode 2x2 sub-block N with pattern + pattern values
To encode another 2x2 sub-block with PCM
To encode the number of a block which could be encoded via pattern
End
5 Visually Lossless Colour Compression Technology
131
(a) A1
A2
A3
A4
B1
B2
B3
B4
Reference points and differences directions
(b) Start
Calculate: D1= A1-A2; D2=A2-A3; D3=A3-A4 D4=B1-B2; D5=B2-B3; D6=B3-B4
Calculate: D_Average = (|D1| + |D2| + |D3| + |D4| + |D5| + |D6|)/6
Calculate: Sign(D1); Sign(D2); Sign(D3); Sign(D4); Sign(D5); Sign(D6)
Transmit corresponding bits: D_Average –8 bits; signs –1 bit each
End The algorithm for the F-method
Fig. 5.13 (a) Reference points and difference directions. (b) The algorithm for the F-method
132
M. N. Mishourovsky
C R,i,j ¼
8 I R,i,j > > > < 16 ,
if ði, jÞ 2 fð0, 0Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 3Þg
> I R,i,j > > : , else 32 I G,i,j C G,i,j ¼ , 8ði, jÞ 16 8 I B,i,j > > if ði, jÞ 2 fð0, 1Þ, ð0, 3Þ, ð1, 2Þg > < 16 , C B,i,j ¼ > I B,i,j > > : , else 32 where IR, IG, IB stands for the input pixel colour value. Output values after quantization are defined as CR, CG, CB. Reconstruction is performed according to the following equations: 16CR,i,j 0 8, if ði, jÞ 2 fð0, 0Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 3Þg DR,i,j ¼ 32CR,i,j 0 10, else DG,i,j ¼ 16C G,i,j 0 8, 8ði, jÞ ( 16CB,i,j 0 8, if ði, jÞ 2 fð0, 1Þ, ð0, 3Þ, ð1, 2Þg DB,i,j ¼ : 32CB,i,j 0 10, else (
The second mode is based on averaging combined with LSB truncation for 2 2 sub-blocks: I R,i,j þ I R,i,jþ2 , i ¼ 0, 1:j ¼ 0; 1, 4 I G,i,j þ I G,i,jþ2 CG,i,j ¼ , i ¼ 0, 1:j ¼ 0; 1, 2 I B,i,j þ I B,i,jþ2 CB,i,j ¼ , i ¼ 0, 1:j ¼ 0; 1: 4 CR,i,j ¼
And decoding can be done according to DR,i,j ¼ DR,i,jþ2 ¼ 2CR,i,j , i ¼¼ 0, 1:j ¼ 0; 1, DG,i,j ¼ DG,i,jþ2 ¼ CG,i,j , i ¼ 0, 1:j ¼ 0; 1, DB,i,j ¼ DB,i,jþ2 ¼ 2CB,i,j , i ¼ 0, 1:j ¼ 0; 1: The third mode of the M-method is based on partial reconstruction of one of the colour channels using the two remaining colours; these two colour channels are encoded using bit truncation according to the following:
5 Visually Lossless Colour Compression Technology
133
I Rij ; if ði, jÞ 2 fð0, 0Þ, ð0, 1Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 2Þ, ð1, 3Þg 8 I Gij ¼ ; if ði, jÞ 2 fð0, 0Þ, ð0, 1Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 2Þ, ð1, 3Þg: 8
CRi,j ¼ CGi,j
The two boundary pixels of the blue channel are encoded similarly: CB1,0 ¼
I B10 I , CB0,3 ¼ B03 : 8 16
Decoding (reconstruction) is done according to these equations: DRi,j ¼ C Ri,j 8 4; if ði, jÞ 2 fð0, 0Þ, ð0, 1Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 2Þ, ð1, 3Þg DGi,j ¼ C Gi,j 8 4; if ði, jÞ 2 fð0, 0Þ, ð0, 1Þ, ð0, 2Þ, ð0, 3Þ, ð1, 0Þ, ð1, 1Þ, ð1, 2Þ, ð1, 3Þg DB0,0 ¼ ðI R1,1 8Þj4; DB0,1 ¼ ðI R1,2 8Þj4; DB0,2 ¼ ðI R1,3 8Þj4 DB1,1 ¼ ðC G0,0 8Þj4; DB1,2 ¼ ðC G0,1 8Þj4; DB1,3 ¼ ðC G0,2 8Þj4 DB1,0 ¼ ðC B0,1 8Þj4; , DB0,3 ¼ ðC B0,3 16Þj8
The best encoding mode is determined according to the minimal reconstruction, and its number is signalled explicitly in the bitstream. In the case where one colour channel strongly dominates over the others, the 2 4 method denoted as the O-method is applied. Four cases are supported: luminance, red, green, and blue colours, which are signalled by a 2-bit index in the bitstream. All the pixel values of the dominant colour channel are encoded using PCM without any distortions (which means 8 bits per pixel value) and the remaining colours are approximated by the mean value over all pixels comprising 2 4 blocks and encoding it using an 8-bit value. This is explained in Fig. 5.14. This method is like the E-method but gives a different balance between precision and locality, although both methods use the idea of dominant colours. The 2 4 method denoted as the L-method is applied to all colour channels, but it is processed independently. It is based on construction of a colour palette. For every 2 2 sub-block a palette of two colours is defined; then three modes are provided to encode the palettes: differential encoding of the palettes, differential encoding of the colours for every palette, and explicit PCM for colours through quantization. The differential colour encoding of palettes enhances the accuracy of colours encoding in some cases; in addition, in some cases the palette colours might coincide due to calculations, which is why extra palette processing is provided, which increases the chance of differential encoding being used. If the so-called InterCondition is true, then the first palette colours are encoded without any changes (8 bits per colour), while the colours of the second palette are encoded as a difference relative to the colours of the first palette:
134
M. N. Mishourovsky
Fig. 5.14 The algorithm for the O-method Table 5.7 CDVal distribution over TIdx_x, TIdx_y CDVal TIdx_y ¼ 0 TIdx_y ¼ 1 TIdx_y ¼ 2 TIdx_y ¼ 3
TIdx_x ¼ 0 0 4 8 12
TIdx_x ¼ 1 1 5 9 13
TIdx_x ¼ 2 2 6 10 14
TIdx_x ¼ 3 3 7 11 15
InterCondition ð2 C00 C10 < 2Þ ^ ð1 C01 C11 2Þ: To manage this, two indexes are calculated: TIdx y ¼ C00 C10 þ 2:Range is ½0 . . . 3 TIdx x ¼ C01 C11 þ 1:Range is ½0 . . . 3: These indexes are transformed into 4-bit values, called cumulative difference values ¼ CDVal, which encode the joint distribution of TIdx_y, TIdx_x (Table 5.7):
5 Visually Lossless Colour Compression Technology
135
This table helps in the encoding of the joint distribution for better localization and the encoding of rare/common combinations of indexes. There is another Table 5.8 provided to decode indexes according to CDVal. Using decoded indexes, it is possible to reconstruct colour differences (and hence colours encoded differentially): C00 C10 ¼ TIdx y 2:A new range is ½2::2: C01 C11 ¼ TIdx x 1:A new range is ½1::2: Extra palette processing is provided to increase the chance of using the differential mode. It consists of the following: • Detection of the situation when both colours are equal; setting flag FP to 1 in case of equal colours; this is done for each palette individually. • Checking if FP1 + FP2 ¼¼ 1. • Suppose the colours of the left palette are equal. Then, a new colour from a second palette should be inserted into the first palette. This colour must be as far from the colours of the first palette as possible (this will extend the variability of the palette): the special condition is checked and a new colour from the second palette is inserted: if jC00 C11j < jC00 C10j ! C00 ¼ C10 else C01 ¼ C11: It must be mentioned that if such a colour substitution occurs, it should be tracked appropriately to reverse it back if differential encoding is not then invoked. Finally, a colour map is generated. If inter-palette differential encoding is not applied, two remaining modes exist: intra-palette differential encoding and colour quantization. Intra-palette mode works in such a way that the very first colour corresponds to the left-top pixel of a 2 2 sub-block. Then, referring to this colour, the differential condition is checked for every sub-block independently. If it is true, appropriate signalling is enabled, and the second colour is encoded as a 2-bit correction to the first colour, which is encoded without errors. If the intra-palette condition is false, then every colour is encoded by means of truncation of three LSBs. This is executed for every 2 2 sub-block. A colour map is created where every pixel has its own colour index; two sub-modes are provided: three entries per 2 2 sub-block and four entries per 2 2 sub-block. Each entry in the colour map is a bit index of a colour to be used to encode the current pixel. The best colour for each pixel is determined by means of finding a minimum absolute difference between a colour and a pixel value, as shown in Fig. 5.15. In addition, there is a method which is effective for regions where a combination of complex and relatively uniform areas exists either within one colour channel or within an underlying block. This method is named the U-method with two sub-methods: U1 and U2. The best sub-method is picked according to the minimal error of approximating the original image block.
TIdx_x
TIdx_y
CDVal 0 2 CDVal 0 1
2 2
2 1
1 2
1 0
3 2
3 2 4 1
4 1 5 0
5 1 6 1
6 1
Table 5.8 Correspondence between CDVal and actual TIdx_x/y values
7 2
7 1 8 1
8 0 9 0
9 0 10 1
10 0
11 2
11 0
12 1
12 1
13 0
13 1
14 1
14 1
15 2
15 1
136 M. N. Mishourovsky
5 Visually Lossless Colour Compression Technology
137
Fig. 5.15 Algorithm of 2 4 L-method (palette method)
Sub-method U1 is applied for a 2 2 sub-block which is part of the 2 4 block and is especially effective if one of the colour parts of this sub-block can be effectively represented by the mean value, while the remaining colours are encoded with NSW transform and FQvT quantization. To estimate the colour values, different methods can be used. In particular, the least squares method can be applied. VLCCT adopted a simple yet effective method which is shown in Fig. 5.16. First, for every colour channel, the mean value is evaluated (over the 2 2 sub-block to be considered); then the SAD between all pixel values comprising this
138
M. N. Mishourovsky
Fig. 5.16 Fast algorithm to estimate colours for palette encoding
sub-block and the mean value is calculated. For example, for the red colour channel SAD is calculated as follows: SADR ¼ jAR mRj þ jBR mRj þ jC R mRj þ jDR mRj: Then, the colour channel which can be effectively encoded with the mean value is determined: every SAD is compared with Threshold1 in well-defined order; the colour channel which has the minimal SAD (in accordance with the order shown in Fig. 5.17) is determined. In addition, the mean value is analysed to decide if it is small enough (compared with Threshold2 which is set to 32); it might be signalled in
5 Visually Lossless Colour Compression Technology
139
Fig. 5.17 Algorithm of sub-method U1
the bitstream that the mean (average) value is small, as a 2-bit value prefix is used to signal which colour channel is picked, and value 3 is available. In this case, 2-bit extra colour index is also added to notify the decoder which colour channel is to be encoded with the mean value using 3 bits by truncating the remaining two (as the mean value takes 5 bits at most). Otherwise, if the mean value is greater than Threshold2, 6 bits are used to encode; again, 2 LSB bits are truncated; the remaining 2 colours (“active colours”) within the underlying 2 2 sub-block are encoded using NSW + FQvT: each active colour is transformed to the set of values: {s1, h1, v1, d1}, {s2, h2, v2, d2}. {h1, v1, d1} and {h2, v2, d2} are encoded according to the FQvT procedure, using quantization Table 5.9.
140
M. N. Mishourovsky
Table 5.9 Quantization table for difference values of sub-method U1 Quantizer – Set N 0 1 2 3 4 5 6 7
Id0 0 14 6 56 70 0 53 16
Id1 18 60 38 77 107 17 125 65
Id2 34 115 62 100 142 32 196 100
Id3 45 160 95 112 160 130 232 120
Id4 0 14 6 56 70 0 53 16
Id5 18 60 38 77 107 17 125 65
Id6 34 115 62 100 142 32 196 100
Id7 45 160 95 112 160 130 232 120
Every set of differential values is encoded using its own quantizer set. s1 and s2 values are encoded in one of two modes: 1. 6 bits per s-value. It is used if the mean value of the uniform channel is small and the number of bits is enough to reserve 12 bits for s-values. 2. Differential mode is used if 6-bit mode cannot be used; the following is provided in this mode: • An error caused by encoding of s1, s2 into 5-bit/5-bit (via LSB truncation) is evaluated: ErrQuant ¼ js1 ðs1&0 F8Þj0 04j þ js2 ðs2&0 F8Þj0 04j; • An error caused by encoding of s1, s2 as 6-bit/6-bit followed by the signed 3-bit difference value between s2 and s1: temp s1 ¼ s1&0 FC; temp s2 ¼ s2&0 FC; temp s2 ¼ temp s2 temp s1; сlampðtemp s2, 31, 31Þ temp s2 ¼ temp s2&0 FC temp s2 ¼ temp s2 þ temp s1 ErrDifQuant ¼ js1 temp s1j þ js2 temp s2j; • if ErrQuant < EffDiffQuant, s1 and s2 are encoded using 5 bits per s-value; otherwise, 6 bits are used to encode s1 and 4 bits for 3-bit signed difference. In terms of bitstream structure, the diagram in Fig. 5.18 shows the bits distribution. According to this diagram, the U1 sub-method spends 43 bits per 2 2 sub-block. The second sub-method of the U-method is called U2. Every 2 4 colour block is processed as three independent 2 4 colour blocks. Every independent 2 4 block is further represented as two adjacent 2 2 sub-blocks. For each 2 2 sub-block, approximation by a mean value is estimated. Then, the sub-block which is approximated by a mean value with the smallest error is determined. Another sub-block is encoded with NSW and FQvT (see quantization Table 5.10).
5 Visually Lossless Colour Compression Technology
141
2-bit colour index 0/1/2
3
6-bitÆaverage of smooth colour
2-bitÆsmooth colour channel number
1-bit ÆSDif mode
3-bitÆaverage of smooth colour
6-bitÆS1 1-bitÆsign (S1-S2) 3-bit Æ|S1-S2|
5-bitÆS1 5-bitÆS2
6-bitÆS1 6-bitÆS2
Fig. 5.18 Diagram of bits distribution of sub-method U1 Table 5.10 Quantization table for difference values of sub-method U2 Quantizer set 0 1 2 3 4 5 6 7
Id0 60 2 10 44 100 1 8 60
Id1 141 16 76 80 150 2 38 110
Id2 178 28 150 108 201 3 65 160
Id3 233 50 190 129 227 5 88 170
Id4 60 2 10 44 100 1 8 60
Id5 141 16 76 80 150 2 38 110
Id6 178 28 150 108 201 3 65 160
Id7 233 50 190 129 227 5 88 170
Every difference value {h,v,d} is encoded using the 3-bit quantizer index and 3-bit quantizer set number. The s-value is encoded using a 7-bit value by truncating LSB. A 1-bit value is also used to signal which block is encoded as uniform. This sub-method spends 28 bits per independent 2 4 block. The last method is based on the Hadamard transform (Woo and Won 1999) and is called the H-method. In this method the underlying 2 4 block is considered as a set of two 2 2 blocks and H-transform is applied to every block as follows: 8 AþBþCþD > > S¼ > 8 > 4 > > A ¼ S þ dD þ dH þ dV > > > AþCBD > > > < B ¼ S þ dV dH dD < dH ¼ 4 , AþBCD > > C ¼ S þ dH dV dD > > dV ¼ > > : > > 4 > D ¼ S þ dD dH dV > > > : dD ¼ A B C þ D 4
142
M. N. Mishourovsky
Table 5.11 Quantizer values adopted for H-transform Quantizer set 0 1 2 3 4 5 6 7
Id0 6 20 1 3 18 6 8 9
Id1 47 30 20 10 40 16 40 30
Id2 86 41 38 16 64 26 73 52
Id3 106 47 48 20 75 32 89 63
Id4 6 20 1 3 18 6 8 9
Id5 47 30 20 10 40 16 40 30
Id6 86 41 38 16 64 26 73 52
Id7 106 47 48 20 75 32 89 63
The dD value is set to zero as it is less important for the human visual system; the remaining components S, dH, and dV are encoded as follows: • the S-value is encoded using a 6-bit value via 2LSB truncation; reconstruction of these 2LSB is done with a fixed value of 10 (binary radix). • dH, dV are encoded using FQvT; Table 5.11 describes the quantizer indexes for eight quantizer sets adopted for H-transform. Every 2 2 sub-block within every colour channel is encoded using 6 bits for the s-value, one 3-bit quantizer set index shared between dV and dH, and a 3-bit value for each difference value. This approach is similar to methods described before, for example, the U-method; the difference in encoding is in the shared quantizer set index, which saves bits and relies on correlation between the dV and dH values.
5.4
How to Pick the Right Method
The key idea is to pick a method that provides the minimum error; in general, the more similar it is to how a user ranks methods according to their visual quality, the better the final quality is. However, complexity is another limitation which bounds the final efficiency and restricts the approaches that can be applied in VLCCT. According to a state-of-the-art review and experimental data, two approaches were adopted in VLCCT. The first approach is weighted mean square wrror, which is defined as follows: WMSE ¼
2 h X X
Ai Ai
2
2 2 2 i þ Bi Bi þ Ci C i þ Di Di
C¼R, G, B i¼1
WC, where A, B, C, and D are pixel values and Wc are weights dependent on the colour channel. According to experiments conducted, the following weights are adopted:
5 Visually Lossless Colour Compression Technology
WC value
Red 6
143 Green 16
Blue 4
Another approach adopted in VLCCT is the weighted max-max criterion. It includes the following steps: • Evaluate the weighted squared maximum difference between the original and encoded blocks: 2 MaxSqC ¼ MAX AC AC , BC BC , CC C C , DC DC W C ; • Then, calculate the sum of MaxSq and the maximum over all colour channels: SMax ¼
X
MaxSqC
C¼R, G, B
MMax ¼ MAX ½MaxSqR , MaxSqG , MaxSqB ; • Then, the following aggregate value is calculated: WSMMC ¼ SMax þ Mmax: To calculate WSMMC for the 2 4 block, the WSMMC values for both 2 2 sub-blocks are added. This method is simpler than WMSE but was still shown to be effective. In general, to determine the best encoding method, all feasible combinations of methods are checked, errors are estimated, and the one with the smallest error is selected.
5.5
Bitstream Syntax and Signalling
As explained at the beginning of the chapter, each 2 4 block being represented as 10 bits per colour per pixel value occupies 100 bits after compression. It consists of a prefix, called VLEH, compressed bits (syntax elements according to the semantics of the methods explained above), and padding (if needed). To provide visually lossless compression, VLEH and combinations of different VLCCT methods were optimized. The final list of VLCCT methods along with the syntax elements is summarized in Table 5.12. Every prefix (VLEH) may take from 1 to 10 bits. Each row in the table has the name of the encoding method for an underlying block. The notation is as follows: if the name consists of two letters, then the first letter corresponds to the method applied for the left sub-block and the second to the method applied for the second sub-block.
144
M. N. Mishourovsky
Table 5.12 Summary of methods, prefixes, bits costs in VLCCT Method name for 2 4 block NN NP NS NC NE PN PP PS PC PE SN SP SS SC SE CN CP CS CC CE EN EP ES EC EE D O M F L U H
Number of compressed bits 90 93 87 90 85 93 96 90 93 88 87 90 84 87 82 90 93 87 90 85 85 88 82 85 80 90 82 90 90 87 87 90
Header (VLEH) LSB bits 3 3 6 3 6 3 3 6 3 4 6 6 6 6 6 3 3 6 3 6 6 4 6 6 6 3 6 3 3 5 5 3
1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 0 1 1 1 0
3 1 0 1 1 1 0
4 0 0 0 0 1 1
0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 1 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0
5 0
6 0
7 0
8
9
0
0
0 0 1
0 1 1
1 0 0
1 0
0 1
0 1
0
1 1 1 1
1 0 1 1
1 1 0 0
1
0
0
1
1 0 1 1 1 1 1 1 0 1 0 0 1 1 1
1 0 1 1 0 1 1 1 0 1 1 1 0 0 0
1 0 0 0 0 1 1 1 1 1 0 1 1 1 0
1 1 1 0 0 1
0 1 0 1 1
1
1
1
0
0 1
10
11
Padding size 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3 0 0 0 0 0 0 0 3 0 4 0 2 0 0 0 0 0
VLCCT provided a simple way to represent the least significant 2 bits of the 2 4 colour block. To keep a reasonable trade-off between bits costs, quality, and complexity, several methods to encode LSB have been proposed: • 3-bit encoding – the mean value of LSB bits for every colour is calculated over the whole 2 4 block followed by quantization:
5 Visually Lossless Colour Compression Technology y¼1 ¼3 P xP
mLSBc ¼
145
½I c ðx, yÞ&3
y¼0 x¼0
16
,
where Ic(x, y) – input 10-bit pixel-value in the colour channel C, at the position (x, y). mLSBc is encoded using 1 bit for each colour: • 4-bit encoding – like the 3-bit encoding approach, but every 2 2 sub-block is considered independently for the green colour channel (which reflects higher sensitivity of the human visual system to green): y¼1 P x¼1 P
mLSBc Left ¼
y¼1 P x¼1 P
½IG ðx, yÞ&3
y¼0 x¼0
mLSBG Right ¼
8
½IG ðx, yÞ&3
y¼0 x¼2
8
;
• 5-bit encoding – like 4-bit encoding, but the red channel is also encoded using splitting into left/right sub-blocks; thus, the green and red colours have higher precision for LSB encoding. • 6-bit encoding – every colour channel is encoded using 2 bits, in the same way as the green channel is encoded in the 4-bit encoding approach. Reconstruction is done following the next equation (applicable for every channel and every sub-block or block): Middle Value ¼
2, ðmLSB ¼¼ 1Þ : 0, ðmLSB ¼¼ 0Þ
This middle value is assigned to 2 LSB bits of the corresponding block/sub-block for every processed colour.
5.6
Complexity Analysis
Hardware complexity was a critical limitation for VLCCT technology. As explained at the beginning, this limitation was the main reason for focusing this research on simple compression methods that worked locally and had potential for hardware (HW) implementation. According to the description given so far, VLCCT consists of a plurality of compression methods; they are structured as shown in Fig. 5.19 (encoder part). The methods are denoted by English capital letters according to the names given above. The following ideas became the basis for the complexity analysis and optimization:
146
M. N. Mishourovsky
Left
Right
Left + Right
N
N
F
P
P
O
S
S
D
C
C
L
E
E
U H M
Error Calculation and Optimal Method
Error Calculation and Optimal Method
Error Calculation and Optimal Method
Minimum Error Detection and Method Selection LSB Encoding
Output Bit Stream Generation
Fig. 5.19 Detailed structure of VLCCT encoder
• Each operation or memory cell takes the required predefined number of physical gates in a chip. • For effective use of the hardware and to keep constrained latency, pipelining and concurrent execution should be enabled. • Algorithmic optimizations must be applied where possible. The initial HW complexity was based on estimating elementary operations for almost every module (method). The elementary operations are the following. 1. 2. 3. 4. 5. 6. 7. 8.
Addition / subtraction. 8-bit signed/unsigned operation. Denoted as (+) Bit-shift without carry-bit. Denoted as ( 0:2 ^ Y > 110, where RGMAX is the maximum of the green channel G and red channel R, RGMIN is minimum of both of these values and S is the colour saturation. The final form of the detecting green pixels then becomes Gr ¼ Gr0 ^ SRGB > 80 ^ formula for R þ B < 32 G _ R þ B < 255 < R B < 35 ^ Y > 80 ^ Y e . Finding the proportion of green pixels is far from sufficient to solve the problem. Some sporting events contain a fairly small number of green pixels, and even a human observer may make a mistake in the absence of accompanying text or audio commentary. A condition is therefore applied in which a frame with zero green pixels belongs to class C4. The classification of other types of scenes is based on the following observations. Sporting scenes, and especially those in football, are characterised by the presence of green pixels with high saturation. Moreover, the green pixels making up the image of the pitch have low values for the blue colour channel and relatively high brightness and saturation. The range of variation in the brightness of green pixels is not wide, and if high values of the brightness gradient are seen within green areas, it is likely that these frames correspond to images of natural landscapes rather than a soccer pitch. In sports games, bright and saturated spots usually correspond to the players’ uniforms, while white areas correspond to the markings on the field and the players’ numbers. In close-up shots of soccer players, a small number of green pixels are usually present in each frame. These observations can be described by the following empirical relation: W ¼ SRGB > 384 ^ max ðR, G, BÞ min ðR, G, BÞ < 30, where SRGB ¼ R + G + B. The detection of bright and saturated colours is formalised as follows: Bs ¼ max ðR, G, BÞ > 150 & max ðR, G, BÞ min ðR, G, BÞ max ðR, G, BÞ=2: The skin tone detector is borrowed from the literature (Gomez et al. 2002): Sk ¼ ðG 6¼ 0Þ ^ B G þ
G 267R 83 SRGB 83 SRGB ^ SRGB > 7 ^ B ^G : 2 2 28 28
The classification rules are based on the following cross-domain features (Fig. 7.2): proportion of green pixels, F1; proportion of skin tone pixels, F2; average brightness of all pixels, F3; average gradient of green pixels, F4; proportion of bright and saturated pixels, F5; mean saturation of green pixels, F6; proportion of white
200
X. Y. Petrova et al.
Fig. 7.2 Calculation of cross-domain features
pixels, F7; average brightness of green pixels, F8; mean value of the blue colour channel for green pixels, F9; and compactness of the brightness histogram for green pixels, F10. The proportion of green pixels is calculated as follows: 1 X δðGr ði, jÞÞ, i¼1::w,j¼1::h wh
F1 ¼
where w is the frame width, h is the frame height, i, j are pixel coordinates, Gr(i, j) is the result generated by the green pixel detector and δ is a function that converts a 0, Øx logical type to real one, i.e. δðxÞ ¼ . 1, x The proportion of skin tone pixels is calculated as follows: F2 ¼
1 X δðSk ði, jÞÞ, i¼1::w,j¼1::h wh
where Sk(i, j) is the result of the skin tone pixel detector. The average gradient for the green pixels is calculated as follows: P F4 ¼
i¼1::w,j¼1::h jDY ði, jÞj
P
δðGr ði, jÞÞ , δ ð G r ði, jÞÞ i¼1::w,j¼1::h
where the horizontal derivative of the brightness DY is obtained by convolution of the luminance component Yс with a linear filter Kgrad ¼ [0 0 0 1 0 0 1]. The proportion of bright and saturated pixels is calculated as follows: F5 ¼
1 X δðBs ði, jÞÞ i¼1::w,j¼1::h wh
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis
201
where Bs(i, j) is the detection result of bright and saturated pixels. The mean saturation of the green pixels is calculated using the formula: P
i¼1::w,j¼1::h Sði, jÞ
P
F6 ¼
δðGr ði, jÞÞ , δ ð G r ði, jÞÞ i¼1::w,j¼1::h
where S(i, j) is the saturation. The proportion of white pixels is estimated as: F7 ¼
1 X δðW ði, jÞÞ, i¼1::w,j¼1::h wh
where W(i, j) is the detection result of the white pixels. The average brightness of the green pixels is derived, in turn, from the formula: P F8 ¼
i¼1::w,j¼1::h Y ði, jÞ
δðGr ði, jÞÞ : δ ð G r ði, jÞÞ i¼1::w,j¼1::h
P
The average value of the blue channel for the green pixels is obtained as follows: P F9 ¼
i¼1::w,j¼1::h Bði, jÞ
δðGr ði, jÞÞ : i¼1::w,j¼1::h δðGr ði, jÞÞ
P
Finally, the compactness of the brightness histogram for the green pixels, F10, is calculated using the following steps: • A histogram of the brightness values HYGr for the Y pixels belonging to green areas is constructed. • The width of the histogram D is computed as the distance between its right and left non-zero elements. • The feature F10 is calculated as the proportion of the histogram that lies no further than an eighth of its width from the central value: PP8 þD=8 F 10 ¼
i¼P8 D=8 H YGr ðiÞ : P255 i¼0 H YGr ðiÞ
The resulting classifier is implemented in the form of a directed acyclic graph (DAG) with elementary classifiers at the nodes. To make classifier synthesis more straightforward, a special editing and visualisation software tool was developed. This software implemented elementary classiLO fiers of the following types: one-dimensional threshold transforms θHI i ðxÞ и θ i ðxÞ, 2D L 2D threshold functions θi (x, y), linear classifiers θi (x, y), 3D threshold transforms θi3D(x, y, z) and elliptic two-dimensional classifiers θie(x, y). One-dimensional threshold transforms have one input argument and are described by the formula:
202
X. Y. Petrova et al.
(a)
(b)
(d)
(e)
(c)
(f)
Fig. 7.3 Elementary classifiers. (a) 1D threshold transform with higher threshold (b) 1D threshold transform with lower threshold (c) 2D threshold transform (d) 2D linear classifier (e) 3D threshold transform (f) elliptic 2D classifier
TRUE, x > Ti (Fig. 7.3а) FALSE, otherwise TRUE, x < Ti or θLO (Fig. 7.3b). i ð xÞ ¼ FALSE, otherwise θHI i ðxÞ
¼
The parameters of two-dimensional threshold transformations (or table classifiers) are presented in Fig. 7.3c. They are described by two threshold vectors, V 1T ¼ T 10 T 11 .. . T 1N , where T 10 < T 11 < T 12 < . .. < T 1N , and V 2T ¼ T 20 T 21 ... T 2M , where T 20 0, where K1, K2 and B are predefined constants. The number of bright and saturated pixels is controlled by the rule Q2 ¼ F5 > T13. The average brightness of green pixels is divided into two ranges and different empirical logic subsequently used for each, for example, Q3 ¼ F8 > T14 and P1 ¼ F8 > T15. The average value of the blue component of green pixels is controlled by the rule P2 ¼ F9 < T16, and the degree of compactness of the histogram of the brightness of green pixels is evaluated as follows: P3 ¼ F10 < T17. The final classification result R is calculated as R ¼ ðØV 1 Þ ^ Q3 ^ P3 ^ V 2 ^ ðP2 _ ðØQ3 ^ P2 ÞÞ, where V1 ¼ N1 _ N2 _ N3 _ N4 _ z11 and V2 ¼ (y22 ^ Q2) _ y23 _ (y24 ^ Q1) _ y32 _ y33 _ y34 _ y42 _ y43 _ y44. To ensure consistent classification results within the same scene, the detection result obtained for the first frame is extended to the entire scene (see Fig. 7.5). Real-time scene change detection is a fairly elaborate topic, and research in this direction is still required by practical applications, as explained by Cho and Kang (2019). In hard real-time applications such as frame rate conversion (FRC) in TV sets or mobile phones, video encoders in consumer cameras or mobile phone scene
Fig. 7.5 Structure of the video stream classifier
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis
205
Fig. 7.6 Structure of the scene change detector
change detector is an indispensable part of the algorithm which needs to be very lightweight and robust. In general, there are two approaches that have been described in the literature: methods based on motion vector analysis and methods based on histograms. Since motion-based methods introduce additional frame delay and also require motion estimation, which may have a complexity comparable to the rest of the video enhancement pipeline, we focus here on histogram-based approaches. In addition, when a scene change is used to control the temporal smoothness of realtime video enhancement algorithms, the problem statement becomes slightly different than that for compression or FRC. When shooting the same event switching between two different cameras or switching between long-range shot and close-up should not be treated as a scene change, while changing programmes or switching to the commentator in the studio should be considered as a scene change. In order to meet these requirements, we apply the robust scene change detector algorithm described below (Fig. 7.6). The RGB colour channels of the current video frame and the cluster centres KC from the output of the delay block are fed to the input of the clustering block. The cluster centres form a matrix of size K 3, where K is the number of clusters: C R1 K C ¼ GC1 BC 1
RC2
RC3
GC2
GC3
BC2
BC3
RCN K . . . GCN K : . . . BCN K ...
In the present study, K is assumed to be eight. Clustering is one iterative step of the k-means method, in which each pixel Ρði, jÞ ¼ j Rði, jÞ Gði, jÞ Bði, jÞ j with coordinates i, j is assigned to the cluster K ði, jÞ ¼ arg min DðΡði, jÞ, C k Þ, where k¼1::N k C k ¼ RCk GCk BCk is the centre of the k-th cluster and D(x, y) ¼ kx yk. The P D Ρði, jÞ, CK ði,jÞ . The total residual is calculated in this case as E ¼ i ¼ 1::w j ¼ 1::h cluster centres are updated using the formula:
206
X. Y. Petrova et al.
C e R1 e C ¼ G eC K 1 C B e1
eC2 R
eC3 R
eC G 2
eC G 3
eC2 B
eC3 B
eCN R K C e . . . GN K , eCN ... B K ...
where P eCk ¼ R P eC ¼ G k P eCk ¼ B
δðK ði, jÞ ¼ kÞ Rði, jÞ i ¼ 1::w, j ¼ 1::h P , δðK ði, jÞ ¼ kÞ i ¼ 1::w, j ¼ 1::h δðK ði, jÞ ¼ kÞ Gði, jÞ i ¼ 1::w, j ¼ 1::h P , δðK ði, jÞ ¼ kÞ i ¼ 1::w, j ¼ 1::h i ¼ 1::w,
δðK ði, jÞ ¼ kÞ Bði, jÞ
j ¼ 1::h P δðK ði, jÞ ¼ kÞ i ¼ 1::w, j ¼ 1::h
The scene change flag is calculated using the formula:
:
max ðE cur , E prev Þ min ðE cur , E prev Þ
> T break ,
where Ecur is the total residual in the current frame, Eprev is the total residual in the current frame and Tbreak ¼ 8 is a predefined threshold. This type of algorithm is fairly robust to rapid changes, motion within the same scene and changes from close-up to mid-range or long shots, as the class centroid in this case remains almost the same while the attribution of pixels to different classes changes. At the same time, it is sensitive enough to detect important video cuts and provide consistent enhancement settings. The classifier has a very low cost from the point of view of its hardware implementation and needs to store only 24 values per frame (eight centroids with three RGB values each).
7.3 7.3.1
Results Basic Classification
The algorithm was implemented in the form of a С++/C++.NET programme. The dependency of the feature values F1, F2, F3, F4, F5 and F6 on time were displayed in the programme in the form of separate graphs, and the values of these features for the
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis
207
Fig. 7.7 UI of the demo SW
Table 7.2 Testing of the classifier in real time # 1
Classes C1, C3
2
C1, C3
3 4 5 6 7 8 9 10
C1, C3 C1, C3 C2, C3 C2, C3 C4 C4 C4 C4
Title Football: 2006 FIFA World Cup Semi-final: Italy vs. Germany Football: Liverpool-Manchester United, BBC broadcast Football: Milan-Liverpool Football: Manchester United-Villareal (Sky Sports 2) American football: Penn State-Ohio State American football: Sugar Bowl Wild South America Greatest places Movie: Men in Black Miscellaneous videos from YouTube, 84 files in total Total
Duration (min) 147
Resolution (pixels) 960 544
83
592 336
51 72 141 168 100 24 98 421 1305
640 352 624 336 352 288 352 288 720 576 1280 720 720 480 720 528
current frame were shown as a bar graph (Fig. 7.7). The classifier was tested on at least 20 hours of real-time video (Table 7.2). Errors of the first kind were estimated by averaging the output of the classifier on the video sequences of class C4. In the classification process, a 95% accuracy threshold was reached. To evaluate errors of the second kind, a total number Ntotal ¼ 220 of random frames were selected from
208
X. Y. Petrova et al.
various sequences of classes C1C3, and the classification accuracy was calculated N þ þN þ as C1N total C2 100%, where N þ C1 is the number of frames from class C1 (classified as þ “sport”), and N C2 is the number of frames from class C2 (also classified as “sport”). The classification accuracy was 96.5%. Calculations using the coefficients from Table 7.1 for frames from class C3 indicated an acceptable level of accuracy (above 95%); however, these measurements are not of great value, as experts disagree on how to classify many of the types of images from class C3. The performance of this algorithm on a computer with a 2 GHz processor and 2 Mb of memory reached 30 fps. The proposed algorithm can be implemented using only shift and addition operations, which makes it attractive in terms of hardware implementation.
7.3.2
Processing of Corner Cases
Despite the high quality of the classification method described above, further work was required to eliminate the classification errors observed (see Fig. 7.8). Figure 7.8a shows a fragment from a nature documentary that was misclassified as a long shot of a soccer game, and Fig. 7.8b shows a caterpillar that was confused with a close-up of a player. In a future version of the algorithm, such errors will be avoided through the use of an advanced skin detector, the addition of a white marking detector for long shots and the introduction of a cross-domain feature that combines texture and colour for regions of bright and saturated colours. The classification error in Fig. 7.8c is caused by an overly broad interpretation of the colour green. To solve this problem, colour detectors could be applied to the global illumination of the scene. To correct the error shown in Fig. 7.8d, a silhouette classifier could be developed. However, it would be a quite a complicated solution with the performance inacceptable for real-time application. Many of these problems can be solved, one way or another, using modern methods of deep learning with neural networks, and a brief review of these is given in the next section. It must be borne in mind that although this approach does not require the manual construction of features to describe the image and video, this is achieved in practice at the cost of high computational complexity and poor algorithmic interpretability (Kazantsev et al. 2019).
7.4
Video Genre Classification: Recent Approaches and CNN-Based Algorithms
A sports video categorisation system that utilises high-order spectral features (HOSF) for feature extraction from video frames and subsequently applies a multiclass SVM for video classification has been presented by Mohanan (2017).
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis
209
(a)
(b)
(c)
(d)
Fig. 7.8 Detection errors (a) Nature image with low texture and high amount of green mis-classified as soccer (b) Nature image with high amount of saturated green and bright colors mis-classified as soccer (c) Underwater image with high amount of greens mis-classified as soccer (d) Golf mis-classified as soccer
HOSF are used to extract both the phase and the amplitude of the given input, allowing the subsequent SVM classifier to use a rich feature vector for video classification (Fig. 7.9). Another work by Hamed et al. (2013) that leveraged classical machine learning approaches tackled the task of video genre classification via several steps: initially, shot detection was used to extract the key frames from the input videos, and the feature vector was then extracted from the video shot using discrete cosine transform (DCT) coefficients processed by PCA. The extracted features were subsequently scaled to values of between zero and one, and, finally, weighted kernel logistic regression (WKLR) was applied to the data prepared for classification with the aim
210
X. Y. Petrova et al.
Fig. 7.9 Workflow of a sports video classification system. (Reproduced with permission from Mohanan 2017)
Fig. 7.10 Overview of the sport genre classification method via sensor fusion. (Reproduced with permission from Cricri et al. 2013)
of achieving a high level of accuracy, making WKLR an effective method for video classification. The method suggested by Cricri et al. (2013) utilises multi-sensor fusion for sport genre classification in mobile videos. An overview of the method is shown in Fig. 7.10. Multimodal data captured by a mobile device (video, audio and data
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis
211
from auxiliary sensors, e.g. electronic compass, accelerometer) are preprocessed by feature extractors specific to each data modality to produce features that are discriminative for the sport genre classification problem. Several MPEG-7 visual descriptors are used for video data, including dominant colour, colour layout, colour structure, scalable colour, edge histogram and homogeneous texture. For audio data, melfrequency cepstral coefficient (MFCC) features are extracted. Data quality estimation is also performed in conjunction with feature extraction, in order to provide modality data confidence for subsequent classifier output fusion. SVM classifiers are used and visual and sensor features and Gaussian mixture models (GMMs) for audio features. Gade and Moeslund (2013) presented a method for the classification of activities in a sports arena using signature heat maps. These authors used thermal imaging to detect players and calculated their positions within the sports arena using homography. Heat maps were produced by aggregating Gaussian distributions representing people over 10-minute periods. The resulting heat maps were then projected onto a low-dimensional discriminative space using PCA and then classified using Fisher’s linear discriminant (FLD). The method proposed by Maheswari and Ramakrishnan (2015) approached the task of sports video classification using edge features obtained from a nonsubsampled shearlet transform (NSST), which were classified using a k-nearest neighbour (KNN) classifier. The five sports categories of tennis, cricket, volleyball, basketball and football were considered. Following the success of convolutional neural network (CNN) approaches for various visual recognition tasks, a surge in the number of works utilising these networks for video classification tasks has been seen in recent years. One CNN-based approach to video classification (Simonyan and Zisserman 2014; Ye et al. 2015) was motivated by findings in the field of neuroscience showing that the human visual system processes what we see through two different streams, the ventral and dorsal pathways. The ventral pathway is responsible for processing spatial information such as shape and colour, while the dorsal pathway is responsible for processing motion information. Based on this structure, the authors designed a CNN to include two streams, the first of which was responsible for processing spatial information (separate visual frames) and the second for handling temporal motionrelated data (stacked optical flow images), as depicted in Fig.7.11. To efficiently combine information from these two streams, they introduced two kinds of fusion, model and modality fusion, investigating both early and late fusion approaches. Another approach suggested by Wu et al. (2015) involved the idea of spatial and temporal information fusion and introduced long short-term memory (LSTM) networks in addition to the two features produced by CNNs for the two streams (see Fig. 7.12). They also employed a regularised feature fusion network to perform video-level feature fusion and classification. The usage of LSTM allowed them to model longterm temporal information in addition to both the spatial and the short-term motion
212
X. Y. Petrova et al.
Fig. 7.11 The processing pipeline of the two-stream CNN method. (Reproduced with permission from Ye et al. 2015)
Fig. 7.12 Overview of a hybrid deep learning framework for video classification. (Reproduced with permission from Wu et al. 2015)
features, while the fusion between the spatial and motion features in a regularised feature fusion network was used to explore feature correlations. Karpathy et al. (2014) built a large-scale video classification framework by fusing information over the temporal dimension using only a CNN, without recurrent networks like LSTM. They explored several approaches to the CNN-based fusion of temporal information (see Fig. 7.13). Another idea of theirs, motivated by the human visual system, was a multiresolution CNN that was split into fovea and context streams, as shown in Fig. 7.14. Input frames were fed into two separate processing streams: a context stream, which modelled low-resolution images, and a fovea stream, which processed high-resolution centre crop. This design takes advantage of the camera bias present in many online videos, since the object of interest often occupies the central region.
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis
213
Fig. 7.13 Approaches for fusing information over the temporal dimension. (Reproduced with permission from Karpathy et al. 2014)
Fig. 7.14 Multiresolution CNN architecture splatted into fovea and context streams. (Reproduced with permission from Karpathy et al. 2014)
A compound memory network (CMN) was proposed by Zhu and Yang (2018) for a few-shot video classification task. Their CMN structure followed the key-value memory network paradigm, in which each key memory involves multiple constituent keys. These constituent keys work collaboratively in the training process, allowing the CMN to obtain an optimal video representation in a larger space. They also introduced a multi-saliency embedding algorithm which encoded a variable-length video sequence into a fixed-size matrix representation by discovering multiple saliencies of interest. An overview of their method is given in Fig. 7.15. Finally, there are several methods which combine the advantages of both classical machine learning and deep learning. One such method (Zha et al. 2015) used a CNN to extract features from video frame patches, which were subsequently subjected to spatio-temporal pooling and normalisation to produce video-level CNN features.
214
X. Y. Petrova et al.
Fig. 7.15 Architecture of compound memory network. (Reproduced with permission from Zhu and Yang 2018)
Fig. 7.16 Video classification pipeline with video-level CNN features. (Reproduced with permission from Zha et al. 2015)
Fig. 7.17 Learnable pooling with context gating for video classification. (Reproduced with permission from Miech et al. 2017)
SVM was used to classify video-level CNN features. An overview of this video classification pipeline is shown in Fig. 7.16. Another method presented by Miech et al. (2017) and depicted in Fig. 7.17 employed CNNs as feature extractors for both video and audio data and aggregated the extracted visual and audio features over the temporal dimension using learnable pooling (e.g. NetVLAD or NetFV). The outputs were subsequently fused using fully connected and context gating layers.
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis
215
References Bai, L., Lao, S.Y., Liao, H.X., Chen, J.Y.: Audio classification and segmentation for sports video structure extraction using support vector machine. In: International Conference on Machine Learning and Cybernetics, pp. 3303–3307 (2006) Brezeale, D., Cook, D.J.: Using closed captions and visual features to classify movies by genre. In: Poster Session of the 7th International Workshop on Multimedia Data Mining (MDM/KDD) (2006) Brezeale, D., Cook, D.J.: Automatic video classification: a survey of the literature. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 38(3), 416–430 (2008) Cho, S., Kang, J.-S.: Histogram shape-based scene change detection algorithm. IEEE Access. 7, 27662–27667 (2019). https://doi.org/10.1109/ACCESS.2019.2898889 Choroś, K., Pawlaczyk, P.: Content-based scene detection and analysis method for automatic classification of TV sports news. Rough sets and current trends in computing. Lect. Notes Comput. Sci. 6086, 120–129 (2010) Cricri, F., Roininen, M., Mate, S., Leppänen, J., Curcio, I.D., Gabbouj, M.: Multi-sensor fusion for sport genre classification of user generated mobile videos. In: Proceedings of 2013 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2013) Dinh, P.Q., Dorai, C., Venkatesh, S.: Video genre categorization using audio wavelet coefficients. In: Proceedings of the 5th Asian Conference on Computer Vision (2002) Gade, R., Moeslund, T.: Sports type classification using signature heatmaps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 999–1004 (2013) Gillespie, W.J., Nguyen, D.T.: Classification of video shots using activity power flow. In: Proceedings of the First IEEE Consumer Communications and Networking Conference, pp. 336–340 (2004) Godbole, S.: Exploiting confusion matrices for automatic generation of topic hierarchies and scaling up multi-way classifiers. Indian Institute of Technology – Bombay. Annual Progress Report (2002). http://www.godbole.net/shantanu/pubs/autoconfmat.pdf. Accessed on 04 Oct 2020 Gomez, G., Sanchez, M., Sucar, L.E.: On selecting an appropriate color space for skin detection. In: Lecture Notes in Artificial Intelligence, vol. 2313, pp. 70–79. Springer-Verlag (2002) Hamed, A.A., Li, R., Xiaoming, Z., Xu, C.: Video genre classification using weighted kernel logistic regression. Adv. Multimedia. 2013, 1 (2013) Huang, H.Y., Shih, W.S., Hsu, W.H.: A film classifier based on low-level visual features. J. Multimed. 3(3) (2008) Ionescu, B.E., Rasche, C., Vertan, C., Lambert, P.: A contour-color-action approach to automatic classification of several common video genres. Adaptive multimedia retrieval. Context, exploration, and fusion. In: Lecture Notes in Computer Science, vol. 6817, pp. 74–88 (2012) Jaser, E., Kittler, J., Christmas, W.: Hierarchical decision-making scheme for sports video categorisation with temporal post-processing. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 908–913 (2004) Jiang, X., Sun, T., Chen, B.: A novel video content classification algorithm based on combined visual features model. In: Proceedings of the 2nd International Congress on Image and Signal Processing, pp. 1–6 (2009) Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014) Kazantsev, R., Zvezdakov, S., Vatolin, D.: Application of physical video features in classification problem. Int. J. Open Inf. Technol. 7(5), 33–38 (2019)
216
X. Y. Petrova et al.
Kittler, J., Messer, K., Christmas, W., Levienaise-Obada, B., Kourbaroulis, D.: Generation of semantic cues for sports video annotation. In: Proceedings of International Conference on Image Processing, vol. 3, pp. 26–29 (2001) Koskela, M., Sjöberg, M., Laaksonen, J.: Improving automatic video retrieval with semantic concept detection. Lect. Notes Comput. Sci. 5575, 480–489 (2009) Li, L.-J., Su, H., Fei-Fei, L., Xing, E.P.: Object Bank: a high-level image representation for scene classification and semantic feature sparsification. In: Proceedings of the Neural Information Processing Systems (NIPS) (2010) Liu, Y., Kender, J.R.: Video frame categorization using sort-merge feature selection. In: Proceedings of the Workshop on Motion and Video Computing, pp. 72–77 (2002) Machajdik, J., Hanbury, A.: Affective image classification using features inspired by psychology and art theory. In: Proceedings of the International Conference on Multimedia, pp. 83–92 (2010) Maheswari, S.U., Ramakrishnan, R.: Sports video classification using multi scale framework and nearest neighbour classifier. Indian J. Sci. Technol. 8(6), 529 (2015) Mel, B.W.: SEEMORE: combining color, shape, and texture histogramming in a neurally inspired approach to visual object recognition. Neural Comput. 9(4), 777–804 (1997). http://www.ncbi. nlm.nih.gov/pubmed/9161022. Accessed on 04 Oct 2020 Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905 (2017) Mohanan, S.: Sports video categorization by multiclass SVM using higher order spectra features. Int. J. Adv. Signal Image Sci. 3(2), 27–33 (2017) Pass, G., Zabih, R., Miller, J.: Comparing images using color coherence vectors. In: Proceedings of the 4th ACM International Conference on Multimedia (1996) Roach, M., Mason, J.: Classification of video genre using audio. Eur. Secur. 4, 2693–2696 (2001) Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Adaptive Image Processing Algorithms for Printing. Springer Nature Singapore AG, Singapore (2018) Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Document Image Processing for Scanning and Printing. Springer Nature Switzerland AG, Cham (2019) Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of 28th Conference on Neural Information Processing Systems, pp. 578–576 (2014) Subashini, K., Palanivel, S., Ramalingam, V.: Audio-video based classification using SVM and AANN. Int. J. Comput. Appl. 44(6), 33–39 (2012) Takagi, S., Hattori, S.M., Yokoyama, K., Kodate, A., Tominaga, H.: Sports video categorizing method using camera motion parameters. In: Proceedings of International Conference on Multimedia and Expo, vol. II, pp. 461–464 (2003a) Takagi, S., Hattori, S.M., Yokoyama, K., Kodate, A., Tominaga, H.: Statistical analyzing method of camera motion parameters for categorizing sports video. In: Proceedings of the International Conference on Visual Information Engineering. VIE 2003, pp. 222–225 (2003b) Truong, B.T., Venkatesh, S., Dorai, C.: Automatic genre identification for content-based video categorization. In: Proceedings of 15th International Conference on Pattern Recognition, vol. 4, p. 4230 (2000) Vaswani, N., Chellappa, R.: Principal components null space analysis for image and video classification. IEEE Trans. Image Process. 15(7), 1816–1830 (2006) Wei, G., Agnihotri, L., Dimitrova, N.: Tv program classification based on face and text processing. In: IEEE International Conference on Multimedia and Expo, vol. 3, pp. 1345–1348 (2000) Wu, Z., Wang, X., Jiang, Y.G., Ye, H., Xue, X.: Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 461–470 (2015) Ye, H., Wu, Z., Zhao, R.W., Wang, X., Jiang, Y.G., Xue, X.: Evaluating two-stream CNN for video classification. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pp. 435–442 (2015)
7 Real-Time Detection of Sports Broadcasts Using Video Content Analysis
217
Yuan, Y., Wan, C.: The application of edge feature in automatic sports genre classification. In: IEEE Conference on Cybernetics and Intelligent Systems, vol. 2, pp. 1133–1136 (2004) Zha, S., Luisier, F., Andrews, W., Srivastava, N., Salakhutdinov, R.: Exploiting image-trained CNN architectures for unconstrained video classification. arXiv preprint arXiv: 1503.04144 (2015) Zhu, L., Yang, Y.: Compound memory networks for few-shot video classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 751–766 (2018)
Chapter 8
Natural Effect Generation and Reproduction Konstantin A. Kryzhanovskiy and Ilia V. Safonov
8.1
Introduction
The creation and sharing of multimedia presentations and slideshows has become a pervasive activity. The development of tools for automated creation of exciting, entertaining, and eye-catching photo transitions and animation effects, accompanied by background music and/or voice comments, is a trend in the last decade (Chen et al. 2010). One of the most impressive effects is the animation of still photos: for example, grass swaying in the wind or raindrop ripples in water. The development of fast and realistic animation effects is a complex problem. Special interactive authoring tools, such as Adobe After Effects and VideoStudio, are used to create animation from an image. In these authoring tools, the effects are selected and adjusted manually, which may require considerable efforts by the user. The resulting animation is saved as a file, thus requiring a significant amount of storage space. During playback, such movies are always the same, thus leading to a feeling of repetitiveness for the viewer. For multimedia presentations and slideshows, it is preferable to generate animated effects on the fly with a high frame rate. Very fast and efficient algorithms are necessary in order to provide the required performance; this is extremely difficult for low-powered embedded platforms. We have been working on the development and implementation of automatically generated animated effects for full HD images on ARM-based CPUs without the usage of GPU capabilities. In such limited conditions, the creation of realistic and impressive animated effects – especially for users
K. A. Kryzhanovskiy (*) Align Technology Research and Development, Inc., USA, Moscow Branch, Russia e-mail: [email protected] I. V. Safonov National Research Nuclear University MEPhI, Moscow, Russia e-mail: [email protected] © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_8
219
220
K. A. Kryzhanovskiy and I. V. Safonov
Fig. 8.1 Detected beats affect the size of the flashing light
who are experienced at playing computer games on powerful PCs and consoles – is a challenging task. We have developed several algorithms for the generation of content-based animation effects from photos, such as flashing light, soap bubbles, sunlight spot, magnifier effect, rainbow effect, portrait morphing transition effect, snow, rain, fog, etc. For these effects, we propose a new approach for automatic audio-aware animation generation. In this chapter, we demonstrate the adaptation of effect parameters according to background audio for three effects: flashing light (see the example in Fig. 8.1), soap bubbles, and sunlight spot. Obviously, the concept can be extended to other animated effects.
8.2
Previous Works
There are several content-adaptive techniques for the generation of animation from still photos. Sakaino (2005) depicts an algorithm for the generation of plausible motion animation from textures. Safonov and Bucha (2010) describe the animated thumbnail which is a looped movie demonstrating salient regions of the scene in sequence. Animation simulates camera tracking in, tracking out, and panning between detected visual attention zones and the whole scene. More information can be found in Chap. 14 (‘An Animated Graphical Abstract for an Image’). Music plays an important role in multimedia presentations. There are some methods aimed at aesthetical audiovisual composition in slideshows. ‘Tiling slideshow’ (Chen et al. 2006) describes two methods for the analysis of background audio to select timing for photo and frame switching. The first method is beat detection. The second is energy dynamics calculated using root-mean-square values of adjacent audio frames. There are other concepts for the automatic combination of audio and visual information in multimedia presentations. Dunker et al. (2011) suggest an approach
8 Natural Effect Generation and Reproduction
221
that focuses on an automatic soundtrack selection. The process attempts to comprehend what the photos depict and tries to choose music accordingly. Leake et al. (2020) present a method for transforming text articles into audiovisual slideshows by leveraging the notion of word concreteness, which measures how strongly a phrase is related to some perceptible concept.
8.3 8.3.1
Animation Effects from a Photo General Workflow
In general, the procedure to create an animation effect from a single still image consists of the following major stages: effect initialization and effect execution. During effect initialization, certain operations that have to be made only once for the entire effect span are performed. Such operations may include source image format conversion, pre-processing, analysis, segmentation, and creation of some visual objects and elements displayed during effect duration, etc. At the execution stage for each subsequent frame, a background audio analysis is performed, visual objects and their parameters are modified depending on the time elapsed, audio features are calculated, and the entire modified scene is visualized. The animation effect processing flow chart is displayed in Fig. 8.2.
8.3.2
Flashing Light
The flashing light effect displays several flashing and rotating coloured light stars over the bright spots on the image. In this effect, the size, position, and colour of flashing light stars are defined by the detected position, size, and colour of the bright areas on the source still image. An algorithm performs the following steps to detect small bright areas in the image: • Calculation of the histogram of brightness of the source image. • Calculation of the segmentation threshold as the grey level corresponding to a specified fraction of the brightest pixels of the image using the brightness histogram. • Segmentation of the source image by thresholding – while thresholding, the majority of the morphological filter is used to filter out localized bright pixel groups. • Calculation of the following features for each connected region of interest (ROI): (a) Mean colour Cmean. (b) Centroid (xc,yc). (c) Image fraction F (fraction of the image area, occupied by ROI).
222
K. A. Kryzhanovskiy and I. V. Safonov
Effect Initialization
Obtain still image
Detect regions of interest
Detect ROI features
Create visual objects
Effect Execution
Obtain audio fragment
Detect audio features
Update visual object parameters
Generate animation frame
Display animation frame
Animation stopped?
No
Yes
Fig. 8.2 Animation effect processing flow chart
(d) Roundness (a relation of the diameter of the circle with the same area as the ROI to the maximum dimension of the ROI): pffiffiffiffiffiffiffiffi 2 S=π Kr ¼ , max ðW, H Þ where S is the area of the ROI and W, H are ROI bounding box dimensions. (e) Quality (the integral parameter characterizing the possibility of the ROI being a light source, calculated as follows):
8 Natural Effect Generation and Reproduction
223
QL ¼ wY max Y max þ wYmean Y mean þ wR K r þ wF K F , where Ymax is the maximum brightness of the ROI, Ymean is the mean brightness of the ROI, and KF is the coefficient of ROI size. KF ¼
F=F 0 ,
if F F 0
F 0 =F,
if F > F 0
,
where F0 is the image fraction normalizing coefficient for an optimal lightspot size and wYmax, wYmean, wR, and wF are the weighting coefficients. Weighting coefficients w and the optimal lightspot size normalization coefficient F0 are obtained by minimizing differences between automatic and manual light source segmentation results. • Selection of regions with appropriate features. All bright spots, with image fractions falling within the appropriate range (Fmin, Fmax) and with roundness Kr larger than a certain threshold value Kr0, are considered to be potential light sources. Potential light sources are sorted by their quality value QL. A specified number of light sources with the largest quality is selected as the final position of light star regions. Centroids of selected light regions are used as star positions. Star size is determined by the dimensions of the appropriate light region. The mean colour of the region determines the colour of the light star. Figure 8.3 illustrates the procedure for detection of appropriate regions for flashing. Every light star is composed from bitmap templates of two types, representing star-shape elements: halo shape and star ray (spike) shape. These templates are independently scaled alpha maps. Figure 8.4 shows examples of templates. During rendering, the alpha map of a complete star of an appropriate size is prepared in a separate buffer, and then the star is painted with the appropriate colour with a transparency value extracted from the star alpha map. During animation, light star sizes and intensities are changed gradually and randomly to give the impression of flashing lights.
8.3.3
Soap Bubbles
This effect displays soap bubbles moving over the image. Each bubble is composed from a colour map, an alpha map, and highlight maps. A set of highlight maps with different highlight orientations is preliminary calculated for each bubble. Highlight position depends on the lighting direction in corresponding areas of the image. Lighting gradient is calculated using a downscaled brightness channel of the image.
224
K. A. Kryzhanovskiy and I. V. Safonov
Collect brightness histogram
Calculate threshold
Image segmentation
Calculate features for each region
Find appropriate regions
Fig. 8.3 Illustration of detection of appropriate regions for flashing
(a)
(b)
Fig. 8.4 Light star-shape templates: (a) halo template, (b) ray template
The colour map is modulated by the highlight map, selected from the set of highlight maps in accordance with the average light direction around the bubble, and then is combined with the source image using alpha blending with a bubble alpha map. Figure 8.5 illustrates the procedure of soap bubble generation from alpha and colour maps. During animation, soap bubbles move smoothly over the image from bottom to top or vice versa while oscillating slightly in a horizontal direction to give the impression of real soap bubbles floating in the air. Figure 8.6 demonstrates a frame of animation with soap bubbles.
8 Natural Effect Generation and Reproduction
225
Fig. 8.5 Soap bubble generation from alpha and colour maps Fig. 8.6 Frame of animation containing soap bubbles
8.3.4
Sunlight Spot
This effect displays a bright spot moving over the image. Prior to starting the effect, the image is dimmed according to its initial average brightness. Figure 8.7a shows an image with the sunlight spot effect. The spotlight trajectory and size are defined by the attention zones of the photo. Similar to the authors of many existing publications, we consider human faces and salient regions to be attention zones. In addition, we regard text inscriptions as attention zones because these can be the name of a hotel or town in the background of the photo. In the case of a newspaper, such text can include headlines. Despite great achievements by deep neural networks in the area of multi-view face detection (Zhang and Zhang 2014), the classical Viola–Jones face detector (Viola and Jones 2001) is widely used in embedded systems due to its low power consumption. The number of false positives can be decreased with additional skin tone segmentation and processing of the downsampled image (Egorova et al. 2009). So far, a universal model of human vision does not exist, but the pre-attentive vision model based on feature integration theory is well known. In this case, because the observer is at the attentive stage while viewing the photo, a model of human pre-attentive vision is not strictly required. However, existing approaches for the detection of regions of interest are based on saliency maps, and these often provide reasonable outcomes, whereas the use of the attentive vision model requires too much prior information about the scene, and it is not generally applicable. Classical saliency map-building algorithms (Itti et al. 1998) have a very high computational
226
K. A. Kryzhanovskiy and I. V. Safonov
(a)
(b)
Fig. 8.7 Demonstration of the sunlight spot effect: (a) particular frame, (b) detected attention zones
complexity. That is why researchers have devoted a lot of effort to developing fast saliency map creation techniques. Cheng et al. (2011) compare several algorithms for salient region detection. We implemented the histogram-based contrast method into our embedded platform. While developing an algorithm for the detection of areas with text, we took into account the fact that text components are ordered the same way and are similar to texture features. Firstly, we applied a LoG edge detector. Then, we filtered the resulting connected components based on an analysis of the texture features. We used the following features (Safonov et al. 2019):
8 Natural Effect Generation and Reproduction
227
1. Average block brightness Bi: N P N P
Bi ¼
Bi ðr, cÞ
r¼1 c¼1
:
N2
2. Average difference in average brightnesses of the blocks Bk in four connected neighbourhoods of block Bi: 4 P
j Bi Bk j
dBi ¼ k¼1
:
4
3. Average of vertical dBiy and horizontal dBix block derivatives: N N1 P P
dx,y Bi ¼
r¼1 c¼1
dBix ðr, cÞ þ
N1 N P P r¼1 c¼1
dBiy ðr, cÞ
2N ðN 1Þ
4. Block homogeneity Bi: H ¼
:
P N d ði, jÞ i, j
1þjijj ,
where Nd is a normalized co-occurrence matrix, and d defines the spatial relationship. 5. The percentage of pixels with a gradient greater than the threshold: Pg ¼
X
f1j∇Bi ðr, cÞ > T g=N 2 ,
8ðr, cÞ2Bi
where ∇Bi(r, c) is calculated as the square root of the sum of the squares of the horizontal and vertical derivatives. 6. The percentage of pixel value changes after the morphological operation of opening Boi on a binary image Bbi , obtained by binarization with a threshold of 128: Pm ¼
X
1jBoi ðr, cÞ 6¼ Bbi ðr, cÞ =N 2 :
8ðr, cÞ2Bi
Also, an analysis of geometric dimensions and relations was performed. We merged closely located connected components, arranging those with the same
228
K. A. Kryzhanovskiy and I. V. Safonov
order and similar colours and texture features, in groups. After that, we classified the resulting groups. We formed final text zones on the basis of groups classified as text. Figure 8.7b shows the detected attention zones. A red rectangle depicts a detected face; green rectangles denote text regions; yellow is a bounding box of the most salient area according to HC method.
8.4
Adaptation to Audio
What animation parameters depend on the characteristics of background audio signals? First, the size and intensity of animated objects, and the speed of their movement and rotation, can be adjusted. In addition, we investigated the question: how can we change the colour of animate objects, depending on music? Attempts to establish a connection between music and colour are long established. French mathematician Louis Bertrand Castel is considered to have been a pioneer in this area. In 1724, in his work Traité de Physique sur La Pesanteur Universelle des Corps, he described an approach to the direct ‘translation’ of music to colour on a ‘spectre–octave’ basis. To illustrate his ideas, Castel even constructed le clavecin pour les yeux (ocular harpsichord, 1725). About 100 years ago, the famous Russian composer and pianist Alexander Scriabin also proposed a theory of connection between music and colour. Colours corresponding to notes are shown in Fig. 8.8. This theory connects major and minor tonality of the same name. Scriabin’s theory was embodied in clavier à lumières (keyboard with lights), which he invented for use in his work Prometheus: The Poem of Fire. The instrument was supposed to be a keyboard (Fig. 8.9) with notes corresponding to the colours of Scriabin’s synesthetic system.
Fig. 8.8 Colours arranged on the circle of fifths, corresponding to Scriabin’s theory
8 Natural Effect Generation and Reproduction
229
Fig. 8.9 Tone-to-colour mapping on Scriabin’s clavier à lumières
Fig. 8.10 Colour circle corresponding to each octave
On our platform, we worked with a stereo audio signal with a frequency of 44 kHz. We considered four approaches to connect the animation of the three effects mentioned above with background audio. In all approaches, we analysed the average of two signal channels in the frequency domain. The spectrum was built 10 times per second for 4096 samples. The spectrum was divided into several bands as in a conventional graphic equalizer. The number of bands depended on the approach selected. For fast Fourier transform computing with a fixed-point arithmetic, we used the KISS FFT open-source library (https://github.com/mborgerding/kissfft). This library does not use platform-specific commands and is easily ported to ARM. Our first approach for visualizing music by colours was inspired by Luke Nimitz’s demonstration of a ‘Frequency Spectrograph – Primary Harmonic Music Visualizer’. This is similar to Scriabin’s idea. It can be considered to be a specific visualization of the graphic equalizer. In this demonstration, music octaves are associated with the HSL colour wheel, as shown in Fig. 8.10, using the statement: Angle ¼ 2π log 2
f , c
where f is the frequency, and c is the origin on the frequency axis. The angle defines the hue of the current frequency.
230
K. A. Kryzhanovskiy and I. V. Safonov
Depending on the value of the current note, we defined the brightness of the selected hue and drew it on a colour circle. We used three different approaches to display colour on the colour wheel: painting sectors, painting along the radius, and using different geometric primitives inscribed into the circle. In the soap bubble effect, depending on the circle colour generated, we determined the colour of the bubble texture. Figure 8.11 demonstrates an example of soap bubbles with colour distribution depending on music. In the sunlight spot effect, the generated colour circle determines the distribution of colours for the highlighted spot (Fig. 8.12). In the second approach, we detected beats or rhythm of the music. There are numerous techniques for beat detection in the time and frequency domains (Scheirer 1998; Goto 2004; Dixon 2007; McKinney et al. 2007; Kotz et al. 2018; Lartillot and Grandjean 2019). We faced constraints due to real-time performance limitations, and we were dissatisfied with the outcomes for some music genres. Finally, we assumed that the beat is present if there are significant changes of values in several bands. This method meets performance requirements with an acceptable finding for quality of beats. Figure 8.1 illustrates how detected beats affect the size of the flashing light. If a beat was detected, we instantly maximized the size and brightness of the lights, and they then gradually returned to their normal state until the next beat happened. Also, Fig. 8.11 Generated colour distribution of soap bubbles depending on music
Fig. 8.12 Generated colour distribution of sunlight spot depending on music
8 Natural Effect Generation and Reproduction
231
Fig. 8.13 Low, middle, and high frequencies affect brightness and saturation of corresponding flashing lights
it was possible to change the flashing lights when the beat occurred (by turning on and off light sources). In the soap bubble effect, we maximized saturation of the soap bubble colour when the beat occurred. We also changed the direction of moving soap bubbles as the beat happened. In the sunlight spot effect, if a beat was detected, we maximized the brightness and size of the spot, and these then gradually returned to their normal states. In the third approach, we analysed the presence of low, middle, and high frequencies in audio signals. This principle is used in colour music installations. In the soap bubble effect, we assigned a frequency range for each soap bubble and defined its saturation according to the value of the corresponding frequency range. In the flashing light effect, we assigned each light star to its own frequency range and defined its size and brightness depending on the value of the frequency range. Figure 8.13 shows how the presence of low, middle, and high frequencies affect flashing lights. Another approach is not to divide the spectrum into low, middle, and high frequencies but rather to assign these frequencies to different tones inside octaves. Therefore, we used an equalizer containing a large number of bands and where each octave had enough corresponding bands. We accumulated the values of each equalizer band to a buffer cell where the corresponding cell number was calculated using the following statement: num ¼
log 2
f c
360 mod 360 360 length
þ 1,
where f is the frequency, c is the origin of the frequency axis, and length is the number of cells. Each cell controls the behaviour of selected objects. In the soap bubble effect, we assigned each soap bubble to a corresponding cell and defined its saturation
232
K. A. Kryzhanovskiy and I. V. Safonov
depending on the value of the cell. In the flashing light effect, we assigned each light to a corresponding cell and defined its size and brightness depending on the value of the cell.
8.5
Results and Discussion
The major outstanding question is how these functions can be implemented in modern multimedia devices for real-time animation. The algorithms were optimized for ARM-based platforms with a CPU frequency of 800–1000 MHz. Limited computational resources of the target platform combined with an absence of graphics acceleration hardware are a serious challenge for the implementation of visually rich animation effects. Therefore, comprehensive optimization is required to obtain smooth frame rates. The total performance win was 8.4 times in comparison with the initial implementation. The most valuable optimization approaches are listed in Tables 8.1 and 8.2. Because objective evaluation of the proposed audiovisual presentation is difficult, we evaluated the advantages of our technique through a subjective user-opinion survey. Flashing light, soap bubbles, and sunlight spot effects with octave-based audio adaptation were used for demonstration. Two questions were asked about the three audiovisual effects: 1. Are you excited by the effect? 2. Would you like to see that effect in your multimedia device? Twenty-three observers participated in the survey. Figure 8.14 reflects the survey results. In general, an absolute majority of the interviewees rated the effects positively. Only two people said that they did not like not only the demonstrated effects but any multimedia effects. Some observers stated, ‘It’s entertaining, but I cannot say “I’m excited”, because such expression would be too strong.’ Several Table 8.1 Optimization approaches Approach Fixed-point arithmetic SIMD CPU instructions (NEON) Effective cache usage Re-implementing of key glibc functions
Speed-up 4.5 3 1.5 1.25
Table 8.2 Performance of proposed effects for HD photo Effect Flashing light Soap bubbles Sunlight spot
Initialization time, s 0.15 0.08 1.4
FPS 20 45 50
8 Natural Effect Generation and Reproduction
233
Fig. 8.14 Survey results
participants of the survey said that they did not like the photos or background music used for the demonstration. It is also worth noting that eight of the respondents were women and, on average, they rated the effects much higher than men. Therefore, we can claim that the outcomes of the subjective evaluation demonstrate the satisfaction of the observers with this new type of audiovisual presentation, because audio-aware animation behaves uniquely each time it is played back and does not repeat itself during the playback duration, thus creating vivid and lively impressions for the observer. A lot of observers were excited by the effects; they would like to see such features in their multimedia devices.
8.6
Future Work
Several other audio-awareness animation effects can be proposed. Figure 8.15 shows screenshots of our audio-aware effect prototypes. In rainbow effect (Fig. 8.15a), the colour distribution of the rainbow changes according to the background audio spectra. The movement direction, speed, and colour distribution of confetti and serpentines are adjusted according to music rhythm in the confetti effect (Fig. 8.15b). Magnifier glass movement speed and magnification are affected by background music tempo in the magnifier effect (Fig. 8.15c). In the lightning effect (Fig. 8.15d), lightning bolt strikes are matched to accents in background audio. Obviously, other approaches to adapting the behaviour of animation to background audio are also possible. It is possible to analyse left and right audio channels separately and to apply different behaviours to the left and right sides of the screen, respectively. Other effects amenable to music adaption can be created.
234
K. A. Kryzhanovskiy and I. V. Safonov
(a)
(b)
(c)
(d)
Fig. 8.15 Audio-aware animation effect prototypes: (a) rainbow effect; (b) confetti effect; (c) magnifier effect; (d) lightning effect
References Chen, J.C., Chu, W.T., Kuo, J.H., Weng, C.Y., Wu, J.L.: Tiling slideshow. In: Proceedings of the ACM International Conference on Multimedia, pp. 25–34 (2006) Chen, J., Xiao, J., Gao, Y.: iSlideShow: a content-aware slideshow system. In: Proceedings of the International Conference on Intelligent User Interfaces, pp. 293–296 (2010) Cheng, M.M., Zhang, G.X., Mitra, N.J., Huang, X., Hu, S.M.: Global contrast based salient region detection. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 409–416 (2011) Dixon, S.: Evaluation of the audio beat tracking system BeatRoot. J. New Music Res. 36(1), 39–50 (2007) Dunker, P., Popp, P., Cook, R.: Content-aware auto-soundtracks for personal photo music slideshows. In: Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 1–5 (2011) Egorova, M.A., Murynin, A.B., Safonov, I.V.: An improvement of face detection algorithm for color photos. Pattern Recognit. Image Anal. 19(4), 634–640 (2009) Goto, M.: Real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass lines in real-world audio signals. Speech Comm. 43(4), 311–329 (2004) Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998) Kotz, S.A., Ravignani, A., Fitch, W.T.: The evolution of rhythm processing. Trends Cogn. Sci. Special Issue: Time in the Brain. 22(10), 896–910 (2018) Lartillot, O., Grandjean, D.: Tempo and metrical analysis by tracking multiple metrical levels using autocorrelation. Appl. Sci. 9(23), 5121 (2019) Accessed on 01 October 2020. https://www. mdpi.com/2076-3417/9/23/5121
8 Natural Effect Generation and Reproduction
235
Leake, M., Shin, H.V., Kim, J.O., Agrawala, M.: Generating audio-visual slideshows from text articles using word concreteness. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–11 (2020) McKinney, M.F., Moelants, D., Davies, M.E.P., Klapuri, A.: Evaluation of audio beat tracking and music tempo extraction algorithms. J. New Music Res. 36(1), 1–16 (2007) Safonov, I.V., Bucha, V.V.: Animated thumbnail for still image. In: Proceedings of the GRAPHICON Symposium, pp. 79–86 (2010) Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Document Image Processing for Scanning and Printing. Springer Nature Switzerland AG (2019) Sakaino, H.: The photodynamic tool: generation of animation from a single texture image. In: Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 1090–1093 (2005) Scheirer, E.D.: Tempo and beat analysis of acoustic musical signals. J. Acoust. Soc. Am. 103(1), 588–601 (1998) Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 511–518 (2001) Zhang, C., Zhang, Z.: Improving multiview face detection with multi-task deep convolutional neural networks. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 1036–1041 (2014)
Chapter 9
Image Classification as a Service Mikhail Y. Sirotenko
9.1
Introduction
In the context of this book, we define image classification as an algorithm of predicting semantic labels or classes for a given two- or more-dimensional digital image. A very simple example of this is an algorithm that takes a photograph as an input and predicts whether it contains a person or not. Image classification as a service (ICaaS) refers to the specific implementation of this algorithm as a web service which accepts requests containing an image and returns a response with classification results (Hastings 2013). Image classification is an important problem having applications in many areas including: • • • • • • • • • • • •
Categorising and cataloguing images Visual search Inappropriate content detection Medical imaging and diagnostics Industrial automation Defect detection Cartography and satellite imaging Product recognition for e-commerce Visual localisation Biometric identification and security Robotic perception And others In this chapter, we will discuss all stages of building an image classification service.
M. Y. Sirotenko (*) 111 8th Ave, New York, NY 10011, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_9
237
238
9.1.1
M. Y. Sirotenko
Types of Image Classification Systems
Image classification systems could be divided into different kinds depending on the application and implementation: 1. Binary classifiers separate images in one of two mutually exclusive classes, while multi-class classifiers can predict one of many classes for a given image. 2. Multi-label classifiers are classifiers that can predict many labels per image (or none). 3. Based on the chosen classes, there are hierarchical or flat classification systems. Hierarchical ones assume a certain taxonomy or ontology imposed on classes. 4. Classification systems could be fine-grained or coarse-grained based on the granularity of chosen classes. 5. If classification is localised, then it means the class is applied to a certain region of the image. If it is not localised, then the classification happens for the entire image. 6. Specialist classifiers usually focus on a relatively narrow classification task (e.g. classifying dog breeds), while generalist classifiers can work with any images. Figure 9.1 shows an example of the hierarchical fine-grained localised multi-class classification system (Jia et al. 2020).
Fig. 9.1 Fashionpedia taxonomy
9 Image Classification as a Service
9.1.2
239
Constraints and Assumptions
Given that the type of image classification system discussed here is a web-based service, we can deduce a set of constraints and assumptions. The first assumption is that we are able to run the backend on an arbitrary server as opposed to running image classification on a specific hardware (e.g. smartphone). This means that we are free to choose any type of model architecture and any type of hardware to run the model without very strict limitations on the memory, computational resources or battery life. On the other hand, transferring data to the web service may become a bottleneck, especially for users with small bandwidth. This means that the classification system should work well with compressed images having relatively low resolution. Another important consideration is concurrency. Since we are building a web server system, it should be designed to support concurrent user requests.
9.2
Ethical Considerations
The very first question to ask before even starting to design an image classification system is how this system may end up being used and how to make sure it will do no harm to people. Recently, we saw a growing number of examples of unethical or discriminatory uses of AI. Such uses include using computer vision for military purposes, mass surveillance, impersonation, spying and privacy violations. One of the recent examples is this: certain authorities use facial recognition to track and control a minority population (Mozur 2019). This is considered by many as the first known example of intentionally using artificial intelligence for racial profiling. Some local governments are banning facial recognition in public places since it is a serious threat to privacy (y Arcas et al. 2017). While military or authoritarian governments using face recognition technology is an instance of unethical use of otherwise useful technology, there are other cases when AI systems are flawed by design and represent pseudoscience that could hurt some groups of people if attempted to be used in practice. One example of such pseudoscience is the work titled “Automated Inference on Criminality Using Face Images” published in November 2016. The authors claimed that they trained a neural network to classify people’s faces as criminal or non-criminal. The practice of using people’s outer appearance to infer inner character is called physiognomy, a pseudoscience that could lead to dangerous consequences if put into practice and represents an instance of a broader scientific racism. Another kind of issue that may lead to unethical use of the image classification system is algorithmic bias. Algorithmic bias is defined as unjust, unfair or prejudicial treatment of people related to race, income, sexual orientation, religion, gender and other characteristics historically associated with discrimination and marginalisation, when and where it is manifested in algorithmic systems or algorithmically aided
240
M. Y. Sirotenko
Fig. 9.2 Many ways how human bias is introduced into the machine learning system
decision-making. Algorithmic biases could amplify human biases while building the system. Most of the time, the introduction of algorithmic biases happens without intention. Figure 9.2 shows that human biases could be introduced into machine learning systems at every stage of the process and even lead to positive feedback loops. Such algorithmic biases could be mitigated by properly collecting the training data and using metrics that measure fairness in addition to standard accuracy metrics.
9.3
Metrics and Evaluation
It is very important to define metrics and evaluation procedures before starting to build an image classification system. When talking about a classification system as a whole, we need to consider several groups of metrics: • End-to-end metrics are application specific and measure how the system works as a whole. An example of an end-to-end metric is the percentage of users who found top ten search results useful. • Classification metrics are metrics used to evaluate the model predictions using the validation or test set. • Training and inference performance metrics measure latency, speed, computing, memory and other practical aspects of training and using ML models. • Uncertainty metrics help to measure how over- or under-confident a model is. • Robustness metrics measure how well the model performs classification on the out-of-distribution data and how stable predictions are under natural perturbation of the input. • Fairness metrics are useful to measure whether an ML model treats different groups of people fairly. Let’s discuss each group of metrics in more detail.
9 Image Classification as a Service
9.3.1
241
End-to-End Metrics
It is impossible to list all kinds of end-to-end metrics because they are very problem dependent, but we can list some of the common ones used for image classification systems: • Click-through rate measures what percentage of users clicked a certain hyperlink. This metric could be useful for the image content-based recommendation system. Consider a user who is looking to buy an apparel item, and based on her preferences, the system shows an image of a recommended item. Thus, if a user clicked on the image, it means that result looks relevant. • Win/loss ratio measures the number of successful applications of the system vs unsuccessful ones over a certain period of time compared to some other system or human. For example, if the goal is to classify images of a printed circuit board as defective vs non-defective, we could compare the performance of the automatic image classification system with that of the human operator. While comparing, we count how many times the classifier made a correct prediction while the operator made a wrong prediction (win) and how many times the classifier made a wrong prediction while the human operator was correct (loss). Dividing the count of wins by the count of losses, we can make a conclusion whether deploying an automated system makes sense. • Man/hours saved. Consider a system that classifies a medical 3D image and based on classification highlights areas indicating potential disease. Such a system would be able to save a certain amount of time for a medical doctor performing a diagnosis.
9.3.2
Classification Metrics
There are dozens of classification metrics being used in the field of machine learning. We will discuss the most commonly used and relevant for image classification. Accuracy This is the most simple and fundamental metric for any classification problem. It is computed as the number of correctly classified samples divided by the number of incorrectly classified samples. Here and further, by sample we mean an image from the test set associated with one or more ground-truth labels. A correctly classified sample is the one for which the model prediction matches ground truth. This metric can be used in single-label classification problems. The downside of this metric is that for every sample, there is a prediction; it ignores the score and order of predictions; it could be sensitive to incorrect or incomplete ground truth; it ignores test data imbalance. It would be fair to say that this metric is good for toy problems but not sufficient for most of the real practical tasks. Top K Accuracy This is a modification of the accuracy metric where the prediction is considered correct if any of the top K-predicted classes (according to the score)
242
M. Y. Sirotenko
matches with the ground truth. This change makes the metric less prone to incomplete or ambiguous labels and helps to promote models that do a better job at ranking and scoring classification predictions. Precision Let’s define true positives (TP) as the number of all predictions that match (at least one) ground-truth label for a corresponding sample, false positives (FP) as the number of predictions that do not match any ground-truth labels for a corresponding sample and false negatives (FN) as the number of ground-truth labels for which no matching predictions exist. Then, the precision metric is computed as: Precision ¼ TP=ðTP þ FPÞ: The precision metric is useful for models that may or may not predict a class for any given input (which is usually achieved by applying a threshold to the prediction confidence). If predictions exist for all test samples, then this metric is equivalent to the accuracy. Recall Using the notation defined above, the recall metric is defined as: Recall ¼ TP=ðTP þ FNÞ: This metric ignores false predictions and only measures how many of the true labels were correctly predicted by the model. Note that precision and recall metrics are oftentimes meaningless if used in isolation. By tuning a confidence threshold, one can trade off precision for recall and vice versa. Thus, for many models, it is possible to achieve nearly perfect precision or recall. Recall at X% Precision and Precision at Y% Recall These are more practical versions of the precision and recall metrics we defined above. They measure the maximum recall at a target precision or maximum precision at a given recall. Target precision and recall are usually derived from the application. Consider an image search application where the user enters a text query and the system outputs images that match that query from the database of billions of images. If the precision of an image classifier used as a part of that system is low, the user experience could be extremely poor (imagine searching something and seeing only one out of ten results to be relevant). Thus, a reasonable goal could be to set a precision goal to be 90% and try to optimise for as high recall as possible. Now consider another example – classifying whether an MRI scan contains a tumour or not. The cost of missing an image with a tumour could be literally deadly, while the cost of incorrectly predicting that an image has a tumour would require a physician to double check the prediction and discard it. In this case, recall of 99% could be a reasonable requirement, while the model could be optimised to deliver as high precision as possible. Further information dealing with MRI can be revealed in Chap. 11 as well as in Chap. 12.
9 Image Classification as a Service
243
F1 Score Precision at X% recall and recall at Y% precision are useful when there is a well-defined precision or recall goal. But what if we don’t have a way to fix either precision or recall and measure the other one? This could happen if, for example, prediction scores are not available or if the scores are very badly calibrated. In that case, we can use the F1 score to compare two models. The F1 score is defined as follows: F1 ¼ 2 ðPrecision RecallÞ=ðPrecision þ RecallÞ: The F1 score takes the value of 1 in the case of 100% precision and 100% recall and takes the value of 0 if either precision or recall is 0. Precision-Recall Curve and AUC-PR In many cases, model predictions have associated confidence scores. By varying the score threshold from the minimum to the maximum, we can calculate precision and recall at each of those thresholds and generate a plot that will look like the one in Fig. 9.3. This plot is called the PR curve, and it is useful to compare different models. From the example, we can conclude that model A provides a better precision in low recall modes, while model B provides a better recall at low precision mode. PR curves are useful to better understand model performance in different modes, but comparing and concluding which model is absolutely better could be hard. For that purpose, we can calculate an area under the PR curve. This will provide us with the single metric that captures both precision and recall of the model on the entire range of threshold.
Precision 1.
Model A Model B
1.
Fig. 9.3 Example of the PR curves for two models
Recall
244
9.3.3
M. Y. Sirotenko
Training and Inference Performance Metrics
Both inference and training time are very important for all practical applications of image classification models for several reasons. First of all, faster inference means lower classification latency and therefore better user experience. Fast inference also means classification systems may be applied to real-time streamed data or to multidimensional images. Inference speed also correlates with the model complexity and required computing resources, which means it is potentially less expensive to run. Training speed is also an important factor. Some very complex models could take weeks or months to train. Not only could this be very expensive, but it also slows down innovation and increases risks since it would take a long time before it would be clear that the model did not succeed. One of the ways to measure computational complexity of the model is by calculating FLOPs (floating point operations per second) required to run the inference or the training step. This metric is a good approximation to compare complexity of different models. However, it could be a bad predictor of real processing time. The reason is because certain structures in the model may be better utilised by modern hardware accelerators. For example, it is known that models that require a lot of memory copies are running slower even if they require fewer FLOPs for operation. For the reasons above, the machine learning community is working towards standardising benchmarks to measure actual inference and training speed of certain models on a certain hardware. Training benchmarks measure the wall-clock time required to train a model on one of the standard datasets to achieve the specified quality target. Inference benchmarks consist of many different runs with varying input image resolution, floating point precision and QPS rates.
9.3.4
Uncertainty and Robustness Metrics
It is common to generate a confidence score for every prediction made by an image classifier. The user’s expectation is that at least a prediction with a higher score has a higher chance of being correct than the prediction with a lower score. Ultimately, we would like for the score to reflect the probability that the prediction is correct. This requirement allows different model predictions to be directly compared and combined. It also allows the use of model predictions in applications where it is critical to be confident in the prediction. Raw predictions from the neural network, however, are typically overconfident, meaning that prediction with a normalised score of 0.8 will have less than 80% correct predictions on the validation set. The process of adjusting model confidences to represent true prediction confidence is called calibration. A number of metrics were proposed to measure model uncertainty (Nixon et al. 2019), among which the most popular ones are expected calibration error (ECE) and static calibration error (SCE).
9 Image Classification as a Service
245
Another important feature of the model is how it behaves if the input images are disturbed or from a domain different from the one the model was trained on. The set of metrics used to measure these features is the same as that used for measuring classification quality. The difference is in the input data. To measure model robustness, one can add different distortions and noise to the input images or collect a set of images from a different domain (e.g. if the model was trained on images collected from the internet, one can collect a test set of images collected from smartphone cameras).
9.4
Data
Data is the key for modern image classification systems. In order to build an image classification system, one needs training, validation and test data at least. In addition to that, a calibration dataset is needed if the plan is to provide calibrated confidence scores, and out-of-domain data is needed to fine-tune model robustness. Building a practical machine learning system is more about working with the data than actually training the model. This is why more and more organisations today are establishing data operations teams (DataOps) whose responsibilities are acquisition, annotation, quality assurance, storage and analysis of the datasets. Figure 9.4 shows the main steps of the dataset preparation process. Preparation of the high-quality dataset is an iterative process. It includes data acquisition, human annotation or verification, data engineering and data analysis.
Internal historical data Public data Synthetic data Commercial datasets
Human annotation or verification
Data engineering
Crowdsourced Controlled acquisition Data acquisition
Feedback
Fig. 9.4 Dataset preparation stages
Data analysis
246
9.4.1
M. Y. Sirotenko
Data Acquisition
Data acquisition is the very first step in preparing the dataset. There are several ways to acquire the data with each way having its own set of pros and cons. The most straightforward way is to use internal historical data owned by the organisation. An example of that could be a collection of images of printed circuit boards with quality assurance labels whether PCB is defective or not. Such data could be used with minimal additional processing. The downside of using only internal data is that the volume of samples might be small and the data might be biased in some way. In the example above, it may happen that all collected images are made with the same camera having exactly the same settings. Thus, when the model is trained on those images, it may overfit to a certain feature of that camera and stop working after a camera upgrade. Another potential source of data is publicly available data. This may include public datasets such as ImageNet (Fei-Fei and Russakovsky 2013) or COCO (Lin et al. 2014) as well as any publicly available images on the Internet. This approach is the fastest and least expensive way to get started if your organisation does not own any in-house data or the volumes are not enough. There is a big caveat with this data though. Most of these datasets are licensed for research or non-commercial use only which makes it impossible to use for business applications. Even datasets with less restrictive licences may contain images with wrong licence information which may lead to a lawsuit by the copyright owner. The same applies to public images collected from the Internet. In addition to copyright issues, many countries are tightening up privacy regulations. For example, General Data Protection Regulation (GDPR) law in the European Union treats any image that may be used to identify an individual as personal data, and therefore companies that collect images that may contain personally identifiable information have to comply with storage and retention requirements of the law. The more expensive way of acquiring the data is to buy a commercial dataset if one exists that fits your requirements. The number of companies that are selling datasets is growing rapidly these days; so, it is possible to purchase the dataset for most popular applications. Data crowdsourcing is the strategy to collect the data (usually including annotations) using a crowdsourcing platform. Such a platform asks users to collect the required data either for compensation or for free as a way to contribute to the improvement of a service. An example of a paid data crowdsourcing platform is Mobeye, and an example of a free data crowdsourcing platform is Google Crowdsource. Another way of data crowdsourcing implementation is through the data donation option available in the application or service. Some services provide an option for users to donate their photos or other useful information that is otherwise considered private to the owner of the service in order to improve that service. Controlled data acquisition is the process of creating data using a specially designed system. An example of such system is a set of cameras pointing to a
9 Image Classification as a Service
247
rotating platform at certain angles designed to collect a dataset of objects in a controlled environment (Singh et al. 2014). This approach allows collecting all the parameters of the acquired data such as camera position, lighting, type of lens, object size, etc. Synthetic data acquisition is becoming increasingly popular for training computer vision models (Nikolenko 2019). There are multiple ways to generate synthetic data depending on the task. For a simple task such as character recognition, one can generate data by generating images of a text while adding noise and distortions. A more advanced way that is widely used in building autonomous driving and robotic systems is computer graphics and virtual environments. Some more recent attempts propose to use generative deep models to synthesise data. One should be careful with using synthetic data as deep learning models could overfit on subtle nuances in the synthetic data and work poorly in practice. The common strategy is to mix synthetic data with real-world data to train the model. Increasing concerns about personal data privacy and as a result of new regulations push companies to rely less on user data to improve their models. On the other hand, there is pressure to improve fairness and inclusivity of the developed models which often require very rare data samples. This makes synthetic data a very good candidate to fulfil future data needs, and many new companies appeared in recent years offering synthetic data generation services.
9.4.2
Human Annotation or Verification
Depending on the way the images were acquired, they may or may not have groundtruth labels, or the labels may be not reliable enough and need to be verified by humans. Data annotation is the process of assigning or verifying ground-truth labels to each image. Data annotation is one of the most expensive stages of building image classification systems as it requires the human annotator to visually analyse every sample, which could be time-consuming. The process is typically managed using special software that handles a dataset of raw images and associated labels, distributes work among annotators, combines results and implements a user interface to do annotations in the most efficient way. Currently, dozens of free and paid image annotation platforms exist that offer different kinds of UIs and features. Several strategies in data annotation exist aimed to reduce the cost and/or increase the ground-truth quality which we discuss below: Outsourcing to an Annotation Vendor Currently, there exist dozens of companies providing data annotation services. These companies are specialising in the cost-effective annotation of data. Besides having the right software tools, they handle process management, quality assurance, storage and delivery. Many of such companies provide jobs in areas of the world with very low income or to incarcerated people who would not have other means to earn money. Besides the benefits of reduced costs of labour and reduced management overhead, another
248
M. Y. Sirotenko
advantage of such approach is data diversity and avoidance of annotation biases. This can be achieved by contracting multiple vendors from different regions of the world and different demographics. The reasons why this approach might not be appropriate are as follows: the highly confidential or private nature of the data, required expertise in a certain domain or too low volumes of the data to justify the annotation costs. If using an external vendor for data annotation is not an option, then one may consider running a custom data annotation service. As mentioned above, there are many available platforms to handle the annotation process. Depending on the complexity of the annotation task and required domain expertise, three different approaches could be used. 1. Simple tasks requiring common human skills. These tasks do not require any special knowledge and could be performed by almost anyone, for example, answering whether an image contains a fruit. For such tasks, there are crowdsource platforms such as Amazon Mechanical Turk, Clickworker and others where any person can become a worker and do work any time. 2. Tasks that require some training. For example, an average person may not be able to tell the difference between plaid, tartan and floral textile patterns. However, after studying examples and running a quick training session, she would be able to annotate these patterns. For such tasks, it is best to have a more or less fixed set of workers because every time a new worker joins, she needs to complete training before doing work. 3. Very complex tasks requiring expert knowledge. Examples are annotating medical images or categories and attributes of a specific product. Such annotation campaigns are very expensive and usually involve a limited number of experts. Thus, they require a relatively low overhead in managing such work. Even with the best quality assurance process, humans make mistakes, especially if the task is very complex, subjective, abstract or ambiguous. Studies show (Neuberger et al. 2017) that for hard tasks, it is beneficial to invest in improving the quality of the annotations rather than collecting more unreliable annotations. This can be achieved by a majority voting approach where the same sample is sent to multiple workers and results get aggregated by voting (i.e. if two out of three workers choose label A and one chooses label B, then label A is the correct one). Some versions of this approach use disagreement between raters as a label confidence. Also, in some tasks, there was success in recording the time that each worker takes to answer the question and use it as a measure of task complexity. One of the big problems in data annotation is redundancy. Not all samples are equally valuable for improving model performance or for measuring it. One of the main reasons for that is class imbalance which is a natural property of most of the real-world data. For example, if our task is to annotate a set of images of fruits crawled from the Internet, we may realise that we have hundreds of millions of images of popular fruits such as apples or strawberries but only a few hundreds of rambutans or durians. Since we do not know which image shows which fruit, we
9 Image Classification as a Service
249
Update datasets
Human annotation
Labeled data Training
Inference
Unlabeled data
ML Model
Predictions / Features / Gradients
Selection strategy
Selected samples
Fig. 9.5 Active learning system
would have to spend a lot of money to annotate all of them to get annotations for all rare fruits. In order to tackle this problem, an active learning approach could be used (Schröder and Niekler 2020). Active learning attempts to maximise a model’s performance gain while annotating the fewest samples possible. The general idea of active learning is shown in Fig. 9.5. Active learning helps to either significantly reduce costs or improve quality by only annotating the most valuable samples.
9.4.3
Data Engineering
Data engineering is a process of manipulating the raw data in a way to make it useful for training and/or evaluation. Here are some typical operations that may be required to prepare the data: • • • • • •
Finding near-duplicate images Ground-truth label smearing and merging Applying taxonomy rules and removing contradicting labels Offline augmentation Image cropping Removing low-quality images (low resolution or size) and low confidence labels (e.g. labels that have low agreement between annotators) • Removing inappropriate images (porn, violence, etc.) • Sampling, querying and filtering samples satisfying certain criteria (e.g. we may want to sample no more than 1000 samples for each apparel category containing a certain attribute) • Converting storage formats Multimedia data takes a lot of storage compared to text and other structured data. At the same time, images are one of the most abundant types of information storage with billions of images uploaded online every day. This makes data engineering tasks for image datasets not a trivial task.
250
M. Y. Sirotenko
Unlike non-multimedia datasets that may be stored in relational databases, for image datasets, it makes more sense to store them in a key-value database. In this approach, the key uniquely identifies an image (oftentimes it is an image hash), and the value includes image bytes and optionally ground-truth labels. If one image may be associated with multiple datasets or ground-truth labels, the latter may be stored in a separate relational database. This also simplifies the filtering and querying procedures. Processing such a large volume of data usually requires massive parallelism and use of frameworks implementing the MapReduce programming model. This model divides processing into three operations: map, shuffle and reduce. The map operation applies a given function to every sample of the dataset independently from other samples. The shuffle operation redistributes data among worker nodes. The reduce operation runs a summary over all the samples. There are many commercial and open-source frameworks available for distributed data processing. One of the most popular is Apache Hadoop.
9.4.4
Data Management
There are several reasons why datasets should be handled by specialised software rather than just kept as a set of files on a disk. The first reason is backup and redundancy. As we mentioned above, multimedia data may take a lot of storage, which increases the chance of data corruption or loss if no backup and redundancy is used. The second is dataset versioning. In academia, it is typical for a dataset to be used without any change for over 10 years. Usually, this is very different in practical applications. Datasets created for practical use cases are always evolving – new samples added, bad samples removed, ground-truth labels could be added continuously, datasets could be merged or split, etc. This leads to a situation where introducing a bug in the dataset is very easy and debugging this bug is extremely hard. Dataset versioning tools help to treat dataset versions similarly to how code versions are treated. A number of tools exist for dataset version control including commercial and open source. DVC is one of the most popular tools. It allows to manage and version data and integrates with most of the cloud storage platforms. The third reason is data retention management. A lot of useful data could contain some private information or be copyrighted, especially if this data is crawled from the web. This means that such data must be handled carefully and should have ways to remove a sample following an owner request. Some regulators also require that the data should be stored for a limited time frame after which it should be deleted.
9 Image Classification as a Service
9.4.5
251
Data Augmentation
Data augmentation is a technique used to generate more training samples from an existing training dataset by applying various (usually random) transformations to the input images. Such transformations may include the following: • • • • • • • •
Affine transformations Blur Random crops Colour distortions Impulse noise Non-rigid warping Mixing one or more images Overlaying a foreground image onto a different background image
The idea behind augmentation is that applying these types of transformations shouldn’t change the semantics of the image and by using them for training a model, we improve its generalisation capabilities. Choosing augmentation parameters, however, could be nontrivial because too strong augmentations can actually reduce model accuracy on a target test set. This led to a number of approaches that aim to automatically estimate optimal augmentation parameters for a given dataset and the model (Cubuk et al. 2018). Image augmentation techniques are especially useful in the context of self-supervised learning that is discussed in the following sections.
9.5
Model Training
The deep learning revolution made most of the classical computer vision approaches to image classification obsolete. Figure 9.6 shows the progress of image classification models over the last 10 years on a popular ImageNet dataset (Deng et al. 2009). The only classical computer vision approach is SIFT-FV with 50.9% top 1 accuracy. The best deep learning model is over four times better in terms of classification error (EfficientNet-L2). For quite some time, deep learning was considered a highaccuracy but high-cost approach because it required considerable computational resources to run inference. In recent years however, much new specialised hardware has been developed to speed up training and inference. Today, most flagship smartphones have some version of a neural network accelerator. Reducing costs and time for training and inference is one of the factors of the increasing popularity of deep learning. Another factor is a change of software engineering paradigm that some call “Software 2.0”. In this paradigm, developers no longer build a system piece by piece. Instead, they specify the objective, collect training examples and let optimisation algorithms build a desired system. This paradigm turns system development into a set of experiments that are easier to parallelise and thus speed up progress.
252
M. Y. Sirotenko
Fig. 9.6 Evolution of image classification models’ top 1 accuracy on ImageNet dataset
The amount of research produced every year in the area of machine learning and computer vision is incomprehensible. The NeurIPS conference has a growing trend of paper submissions (e.g. in 2019 the number of submissions was 6743, and 1428 papers were accepted). Not all of the published research passes the practice test. In this chapter, we will briefly discuss the most effective model architectures and training approaches.
9.5.1
Model Architecture
A typical neural network-based image classifier can be divided into backbone and prediction layers (Fig. 9.7). The backbone consists of the input layer, hidden layers and feature (or bottleneck) layers. It takes an image as an input and produces the image representation or features. The predictor is then using this representation to predict classes. This kind of division is rather virtual and allows to conveniently separate a reusable and more complex part that extracts the image representation from a predictor that usually is a simple few-layer fully connected network. In the following, we focus on choosing the architecture for the backbone part of the classifier. There are over a hundred different neural network model architectures that exist today. However, nearly all successful architectures are based on the concept of the residual network (ResNet) (He et al. 2016). A residual network consists of a chain of residual blocks as depicted in Fig. 9.8. The main idea is to have a shortcut connection between the input and output. Such connection allows for the gradients to flow freely and avoid a vanishing gradient problem that for many years prevented building very deep neural networks. There are several modifications to the standard residual block. One modification proposes to add more levels of shortcut connections (DenseNet).
9 Image Classification as a Service
Input image
Input layer
253
Hidden layers
Feature layer
Backbone
Hidden layers
Class predictions
Predictor
Fig. 9.7 Neural network-based image classifier
Input
Output Weights
Residual connection
Activation
Normalization
1x1 Conv
Fig. 9.8 Residual block
Another modification proposes to assign and learn weights for the shortcut connection (HighwayNet). Deploying an ML model is a trade-off between cost, latency and accuracy. Cost is mostly managed through the hardware that runs the model inference. A more costly and powerful hardware can either run a model faster (low latency) or run a larger and more accurate model with the same latency. If the hardware is fixed, then the tradeoff is between the model size (i.e. latency) and the accuracy. When choosing or designing the ResNet-based model architecture, the accuracylatency trade-off is achieved through varying the following parameters: 1. Resolution: this includes input image resolution as well as the resolution of intermediate feature maps. 2. Number of layers or blocks of layers (such as residual block). 3. Block width which is to the number of channels of the feature maps of the convolutional layers. By varying model depth, width and resolution, one can influence model accuracy and inference speed. However, predicting how accuracy will change as a result of changing one of those parameters is not possible. The same applies to predicting model inference speed. Even though it is straightforward to estimate the amount of computing and memory required for the new architecture, different hardware may work faster with certain architectures requiring more computations vs some other ones requiring less computations. Because of the abovementioned problems, designing model architectures used to be an art as much as science. But recently, more and more successful architectures are designed by optimisation or search algorithms (Zoph and Le 2016). Examples of the architectures that were designed by algorithm are EfficientNet (Tan and Le 2019) and MobileNetV3 (Howard et al. 2019). Both architectures are a result of neural
254
M. Y. Sirotenko
architecture search. The difference is that EfficientNet is a group of models optimised for general use, while MobileNetV3 is specifically optimised to deliver the lowest latency on mobile devices.
9.5.2
Classification Model Training Approaches
Depending on the amount and kind of training data, privacy requirements, available resources to spend for model improvement and other constraints and considerations, different training approaches could be chosen. The most straightforward approach is the fine-tuning of the pre-trained model. The idea here is to find a pre-trained visual model and fine-tune it using a collected dataset. Fine-tuning in this case means training a model that has a backbone initialised from the pre-trained model and a predictor initialised randomly. Thousands of models pre-trained on various datasets are available online for download. There are two modes of fine-tuning: full and partial. Full-model fine-tuning trains the entire model, while partial freezes most of the model and trains only certain layers. Typically, in the latter mode, the backbone is frozen while the predictor is trained. This mode is used when the dataset size is small or when there is a need to do a quick training. Fine-tuning mode is also used as a baseline before using other ways of improving model accuracy. As was mentioned above, data labelling is one of the most expensive stages of building an image classification system. Acquiring unlabelled data on the other hand could be much easier. Thus, it is very common to have a small labelled dataset and much larger dataset with no or weak labels. Self-supervised and semi-supervised approaches aim to utilise the massive amounts of unlabelled data to improve model performance. The general idea is to pre-train a model on a large unlabelled dataset in an unsupervised mode and then fine-tune it using a smaller labelled dataset. This idea is not new and has been known for about 20 years. However, only recent advances in unsupervised pre-training made it possible for such models to compete with fully supervised training regimes where all the data is labelled. One of the most successful approaches for unsupervised pre-training is contrastive learning (Chen et al. 2020a). The idea of contrastive learning is depicted in Fig. 9.9. An unlabelled input image is transformed by two random transformation functions. Those two transformed images are fed into an encoder network to produce corresponding representations. Finally, two representations are used as inputs to the projection network whose outputs are used to compute consistency loss. Consistency loss pushes two projections from the same image to be close, while projections from different images are pushed to be far away. It was shown that unsupervised contrastive learning works best with large convolutional neural networks (Chen et al. 2020a, b). In order to make this approach more practical, one of the ways is to use knowledge distillation. Knowledge distillation in a neural network consists of two steps:
9 Image Classification as a Service
255
Fig. 9.9 Contrastive unsupervised learning
1. In the first step, a large model or an ensemble of models is trained using groundtruth labels. This model is called a teacher model. 2. In the second stage, (typically) a smaller network is trained using predictions of the teacher network as a ground truth. This model is called a student network. In the context of semi-supervised learning, a larger teacher model got trained using unlabelled data and got distilled into a smaller student network. It was shown that it is possible to do such distillation with negligible accuracy loss (Hinton et al. 2015). Knowledge distillation is not only useful for semi-supervised approaches but also as a way to control model size and computational requirements while keeping high accuracy. Another approach that is related to semi-supervised learning is domain transfer learning. Domain transfer is a problem of using one dataset to train a model to make predictions on a dataset from a different domain. Examples of domain transfer problems include training on synthetic data and predicting on real data, training on e-commerce images of products and predicting on user images and so on. There are two ways of tackling the domain transfer problem depending on whether there is some training data from the target domain or not:
256
M. Y. Sirotenko
1. If the data from the target domain is available, then the contrastive training that aims to minimise distance between source and target domain is one of the most successful approaches. 2. If the target domain samples are unavailable, then the goal is to train a model robust to the domain shift. In this case, heavy augmentation, regularisation and contrastive losses help. Another common problem when training image classification models is data imbalance. Almost every real-life dataset has some kind of data imbalance which manifests in some classes having orders of magnitude more training samples than others. Data imbalance may result in a biased classifier. The most obvious way to solve a problem is to collect more data for underrepresented classes. This, however, is not always possible or could be too costly. Another approach that is widely used is under- or over-sampling. The idea is to use fewer samples of the overrepresented class and duplicate samples of the underrepresented class during training. This approach is simple to implement, but the accuracy improvement for the underrepresented classes often comes with the price of reduced accuracy for the overrepresented ones. There is a vast amount of research aimed at handling data imbalance by building a better loss function (see Cao et al. 2019, for instance). One more training approach we would like to mention in this chapter is federated learning (FL) (Yang et al. 2019). Federated learning is a type of privacy-preserving learning where no central data store is assumed. As shown in Fig. 9.10, in FL there is a centralised server that does federated learning and a large enough number of user’s devices. Each user’s device downloads a shared model from the server, uses it and computes gradients using only data available on that device. According to a schedule, those gradients are sent to a centralised server where gradients from all users integrated into a shared model. This approach guarantees that no actual data could leak from the centralised server. Such approach is gaining more popularity recently since it allows improving model performance using users’ data without compromising privacy. Fig. 9.10 Federated learning diagram
Shared model Centralized server Gradients Updated model
Model User 1 device
Model
Model User N device
9 Image Classification as a Service
9.6
257
Deployment
After the model is trained and evaluated, the final step is model deployment. Deployment in the context of this chapter means running your model in production to classify images coming from the users of your service. The factors that are important at this step are latency, throughput and cost. Latency means how fast a user will get a response from your service after sending an input image, and throughput means how many requests your service can process per unit of time without failures. The other important factors during the model deployment stage are convenience of updating the model, proper versioning and ability to run A/B tests. The latter depends on the software framework that was chosen for deployment. TensorFlow serving is a great example of the platform that delivers highperformance serving of models with gRPC and REST client support. The way to reduce the latency or increase the throughput of the model serving system is by using a more powerful hardware. This could be done either by building your own server or by using one of many cloud solutions. Most of the cloud solutions for serving models offer instances with modern GPU support that provide a much better efficiency than CPU-only solutions. Another alternative to GPU for accelerating neural networks is tensor processing units (TPU) that were specifically designed for running neural networks’ training and inference. Another aspect of model deployment that is important to keep in mind is protecting the model from stealing. Protecting the confidentiality of ML models is important for two reasons: (a) a model can be a business advantage to its owner, and (b) an adversary may use a stolen model to find transferable adversarial examples that can evade classification by the original model. Several methods were proposed recently to detect model stealing attacks as well as protecting from them by embedding watermarks in the neural networks (Juuti et al. 2019; Uchida et al. 2017).
References Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with labeldistribution-aware margin loss. In: Advances in Neural Information Processing Systems, pp. 1567–1578 (2019) Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020a) Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.: Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 (2020b) Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) Fei-Fei, L., Russakovsky, O.: Analysis of large-scale visual recognition. Bay Area Vision Meeting (2013)
258
M. Y. Sirotenko
Hastings, R.: Making the most of the cloud: how to choose and implement the best services for your library. Scarecrow Press (2013) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Hinton, G., Vinyals, O, Dean, J.: Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02531 (2015) Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V.: Searching for mobilenetv3. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1314–1324 (2019) Jia, M., Shi, M., Sirotenko, M., Cui, Y., Cardie, C., Hariharan, B., Adam, H., Belongie, S.: Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset. arXiv preprint arXiv:2004.12276 (2020) Juuti, M., Szyller, S., Marchal, S., Asokan, N.: PRADA: protecting against DNN model stealing attacks. In: Proceedings of the IEEE European Symposium on Security and Privacy (EuroS&P), pp. 512–527 (2019) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision, pp. 740–755. Springer, Cham (2014) Mozur, P.: One month, 500,000 face scans: How China is using A.I. to profile a minority (2019). Accessed on September 27 2020. https://www.nytimes.com/2019/04/14/technology/chinasurveillance-artificial-intelligence-racial-profiling.html Neuberger, A., Alshan, E., Levi, G., Alpert, S., Oks, E.: Learning fashion traits with label uncertainty. In: Proceedings of KDD Workshop Machine Learning Meets Fashion (2017) Nikolenko, S. I.: Synthetic data for deep learning, arXiv preprint arXiv:1909.11512 (2019) Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G., Tran, D.: Measuring calibration in deep learning. In: Proceedings of the CVPR Workshops, pp. 38–41 (2019) Schröder, C., Niekler, A.: A survey of active learning for text classification using deep neural networks, arXiv preprint arXiv:2008.07267 (2020) Singh, A., Sha, J., Narayan, K.S., Achim, T., Abbeel, P.: Bigbird: A large-scale 3d database of object instances. In: Proceedings of the IEEE International Conference on Robotics and Automation, pp. 509–516 (2014) Tan, M., Le, Q. V.: Efficientnet: Rethinking model scaling for convolutional neural networks, arXiv preprint arXiv:1905.11946 (2019) Uchida, Y., Nagai, Y., Sakazawa, S., Satoh, S.I.: Embedding watermarks into deep neural networks. In: Proceedings of the ACM on International Conference on Multimedia Retrieval, pp. 269–277 (2017) y Arcas, B.A., Mitchell, M., Todorov, A.: Physiognomy’s New Clothes. In: Medium. Artificial Intelligence (2017) Accessed on September 27 2020 https://medium.com/@blaisea/ physiognomys-new-clothes-f2d4b59fdd6a. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. 10(2), 1–19 (2019) Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning, arXiv preprint arXiv:1611.01578 (2016)
Chapter 10
Mobile User Profiling Alexey M. Fartukov, Michael N. Rychagov, and Lyubov V. Stepanova
10.1
Introduction
It would be no exaggeration to say that modern smartphones fully reflect the personality of their owners. Smartphones are equipped with a myriad of built-in sensors (Fig. 10.1). At the same time, almost half of smartphone users spend more than 5 hours a day on their mobile devices (Counterpoint press release 2017). Thus, smartphones collect (or able to collect) huge amounts of personalized data from the applications and the frequency of their usage as well as heterogeneous data from its sensors, which record various physical and biological characteristics of an individual (Ross et al. 2019). This chapter is dedicated to the approach and methods intended for personalization of customer services and mobile applications and unobtrusive mobile user authentication by using data generated by built-in sensors in smartphones (GPS, touchscreen, etc.) (Patel et al. 2016). It should be specially noted that collection of the abovementioned data and extraction of ancillary information, such as a person’s age, gender, etc., do not lead to privacy infringement. First, such information should be extracted only after the subject’s consent. Second, the information should be protected from unauthorized access. Later requirements place restrictions on which algorithmic solutions can be selected. Enactment of the EU General Data Protection Regulation has emphasized the importance of privacy preservation (Regulation EU 2016). Discussion of data protection issues and countermeasures is out of the scope of this chapter.
A. M. Fartukov · L. V. Stepanova (*) Samsung R&D Institute Rus (SRR), Moscow, Russia e-mail: [email protected]; [email protected] M. N. Rychagov National Research University of Electronic Technology (MIET), Moscow, Russia e-mail: [email protected] © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_10
259
260
A. M. Fartukov et al.
Fig. 10.1 Sensors and data sources available in a modern smartphone (part of this image was designed by Starline/Freepik)
The aim of the methods we present in this chapter is to enhance a customer’s convenience during his/her interaction with a smartphone. In Sect. 10.2, we describe a method for demographic prediction (Podoynitsina et al. 2017), considering demographic characteristics of a user such as gender, marital status and age. Such information plays a crucial role in personalized services and targeted advertising. Section 10.3 includes a brief overview of passive authentication of mobile users. In Section 10.4, we shed light on the procedure of dataset collection, which is needed to develop demographic prediction and passive authentication methods.
10.2
Demographic Prediction Based on Mobile User Data
Previous research on demographic prediction has been predominantly focused on separate use of data sources such as Web data (Hu et al. 2007; Kabbur et al. 2010), mobile data (Laurila et al. 2013; Zhong et al. 2013; Dong et al. 2014) and application data (Seneviratne et al. 2014). These methods have several issues. As mentioned earlier, information such as marital status, age and user data tend to be sensitive. To avoid leaking sensitive information, it is important to analyse user behaviour directly on a smartphone. This endeavour requires developing lightweight algorithms in terms of computational complexity and memory consumption. With our present research, we intend to examine usage of all possible types of available data that can be collected by modern smartphones and select the best combination of features to improve prediction accuracy. For demographic prediction, different types of data are considered: call logs, SMS logs and application usage
10
Mobile User Profiling
261
data among others. Most of them can be directly used as features for demographic predication, except Web data. Thus, another issue is to develop a way to represent the massive amount of text data from Web pages in the form of an input feature vector for the demographic prediction method. To solve this issue, we propose using advanced natural language processing (NLP) technologies based on a probabilistic topic modelling algorithm (Blei 2012) to extract meaningful textual information from the Web data. In this method, we propose extracting user’s interests from Web data to construct a compact representation of the user’s Web data. For example, user interests may be expressed by the following: which books the user reads or buys, which sports are interesting for the user or which purchases the user could potentially make. This compact representation is suitable for training the demographic classifier. It should be noted that the user’s interests can be directly used by the content provider for better targeting of advertisement services or other interactions with the user. To achieve flexibility in demographic prediction or to provide language independence, we propose using common news categories as a model for user interests. News streams are available in all possible languages of interest. Its categories are also reasonably universal across languages and cultures. Thus, it allows one to build a multi-lingual topic model. This endeavour requires building and training a topic model with a classifier of text data for specified categories (interests). A list of wanted categories can be provided by the content provider. The topic model categorizes the text extracted from Web pages. The topic model can be built with the additive regularization of topic models (ARTM) algorithm. User interests are then extracted using the trained topic model. ARTM is based on the generalization of two powerful algorithms: probabilistic latent semantic analysis (PLSA) (Hofmann 1999) and latent Dirichlet allocation (LDA) (Blei 2012). The additive regularization framework allows imposing additional necessary constraints on the topic model, such as sparseness or desired word distribution. It should be noted that ARTM can be used not only for clustering but for classification for a given list of categories. Next, the demographic model is trained using datasets collected from mobile users. Features from the collected data are extracted with the help of topic model trained on previous step. The demographic model comprises several (in our case three) demographic classifiers. Demographic classifiers must predict the age of the user (one of the following: ‘0–18’, ‘19–21’, ‘22–29’, ‘30+’), gender (male or female) or marital status (married/not married) based on a given feature vector. Architecture of the proposed method is depicted in Fig. 10.2. It is important to emphasize that the language the user will use cannot be predicted in advance. This uncertainty means that the model must be multi-lingual: a speaker of another language may only need to load data for his/her own language. ARTM allows inclusion of various types of modalities (translations into different languages, tags, categories, authors, etc.) into one topic model. We propose using cross-lingual features to implement a language-independent (multi-lingual) NLP procedure. The idea of cross-lingual feature generation involves training one topic
262
A. M. Fartukov et al.
Fig. 10.2 Architecture of the proposed solution for demographic prediction
model on translation of documents into different languages. The translations are interpreted as modalities in such a way that it is possible to embed texts in different languages into the same space of latent topics. Let us consider each aspect of the proposed method in detail. First, we would like to shed light on the proposed topic model. We need to analyse webpages viewed by the user. Such analysis involves the following steps: • • • •
Pre-processing Web pages PLSA Extension of PLSA with ARTM Document aggregation
10.2.1 Pre-processing As mentioned earlier, the major source of our observational data is Web pages that the user has browsed. Pre-processing of Web pages includes the following operations: removing HTML tags, performing stemming or lemmatization of every word, removing stop words and converting all characters to lowercase and translating the webpage content into target languages. We consider three target languages (Russian, English and Korean) in the proposed algorithm. We use the ‘Yandex.Translate’ Web service for translation.
10
Mobile User Profiling
263
10.2.2 Probabilistic Latent Semantic Analysis To model the user’s interests, one must analyse the documents viewed by the user. Topic modelling enables us to assign a set of topics T and to use T to estimate a conditional probability that word w appears inside document d: phwjdi ¼
X t2T
phwjt iphtjdi,
where T is a set of topics. In accordance with PLSA, we follow the assumption that all documents in a collection inherit one cluster-specific distribution for every cluster of topic-related words. Our purpose is to assign such topics T, which will maximize a functional L: LðΦ, ΘÞ ¼ ln
YY
pðw _ d Þndw !
d2Dw2d
max Φ, Θ
,
where ndw denotes the number of times that word w is encountered in a document d, Φ ¼ ( p(w| t))W T ¼ (φwt)W T is the matrix of term probabilities for each topic, and Θ ¼ ( p(t| d ))T D ¼ (θtd)T D is the matrix of topic probabilities for each document. By using the abovementioned expressions, we obtain the following: LðΦ, ΘÞ ¼ X w2W
X
X d2D
n w2d dw
ln
X
phwjt i ¼ 1, phwjt i 0;
t2T
pðw _ t Þpðt _ d Þ !
X t2T
max Φ, Θ
and
phtjdi ¼ 1, phtjdi 0:
10.2.3 Additive Regularization of Topic Models Special attention should be drawn to the following fact: zero probabilities are not acceptable to the natural logarithm in the above equation for L(Φ, Θ). To overcome this issue, we followed the ARTM method proposed by Vorontsov and Potapenko (2015). The regularization coefficient R(Φ, Θ) is added: RðΦ, ΘÞ ¼
Xr
τ R ðΦ, ΘÞ, τi i¼1 i i
0,
where τi is a regularization coefficient and Ri(Φ, Θ) is a set of different regularizers. In the proposed method, we use the smoothing regularizers for both matrices Φ, Θ. Let us define the Kullback-Leibler divergence as follows:
264
A. M. Fartukov et al.
KLðp _ qÞ ¼
Xn
p i¼1 i
ln
pi : qi
The Kullback-Leibler divergence evaluates how well the distribution p approximates another distribution q in terms of information loss. To make the smoothing regularization for values p(w| t) in the matrix Φ, one must find a fixed distribution β ¼ (βw)w 2 W, which can approximate p(w| t). Thus, we look for the minimum KL values: X t2T
min
KLw ðβw _ ϕωt Þ !
Φ
:
Similarly, to make the smoothing regularization of the matrix Θ, we leverage another fixed distribution α ¼ (αt)t 2 T that can approximate p(t| d ): X d2D
min
KLt ðαt _ θtd Þ !
Θ
:
To achieve both minima, we combine the last two expressions into a single regularizer Rs(Φ, Θ): Rs ðΦ, ΘÞ ¼ β0
X
X t2T
β w2W ω
ln ðϕωt Þ þ α0
X
X d2D
α t2T t
ln ðθtd Þ ! max :
Finally, we combine L(Φ, Θ) with Rs(Φ, Θ) in a single formula: LðΦ, ΘÞ þ Rs ðΦ, ΘÞ !
max Φ, Θ
:
We maximize this expression using the EM algorithm by Dempster et al. (1977).
10.2.4 Document Aggregation After performing step 3, it is possible to describe each topic t with the set of its words w using the probabilities p(w| t). It also becomes possible to map each input document d into a vector of topics T in accordance with probabilities p(t| d ). At the next step, we need to aggregate all topic information about the documents d1u, . . ., dnu viewed by the user u into a single vector. Thus, we average the obtained topic vectors pu(ti| dj) in the following manner:
10
Mobile User Profiling
265
Table 10.1 Parameters adjusted by a genetic algorithm Parameter Number of neurons Learning rate Weights decay Gradient moment Weights deviation
Minimum value 8 0.0001 0 0 0.00001
pu ð t i _ d Þ ¼
Maximum value 256 0.1 0.01 0.95 0.1
1 XN d p t _ d , i ju u j¼1 Nd
where Nd denotes the number of documents viewed by a user u and dju is the j-th document viewed by the user. The resulting topic vector (or user interest vector) is used as feature vector for demographic model. Let us consider it in detail. The demographic model consists of several demographic classifiers. In the proposed method, the following classifiers are used: age, gender and marital status. In present work, a deep learning approach builds such classifiers, and the Veles framework (Veles 2015) is used as a deep learning platform. Each classifier is built with a neural network (NN) and optimized with a genetic algorithm. NN architecture is based on the multi-layer perceptron (Collobert and Bengio 2004). It should be noted that we determined the possible NN architecture of each classifier and optimal hyper-parameters using a genetic algorithm. We used the following hyper-parameters of NN architecture: size of the minibatch, number of layers, number of neurons in each layer, activation function, dropout, learning rate, weight decay, gradient moment, standard deviation of weights, gradient descent step, regularization coefficients, initial ranges of weights and number of examples per iteration. Using a genetic algorithm, we can adjust these hyper-parameters (Table 10.1). We also use a genetic algorithm to select optimal features in the input feature vector and to reduce the size of the input feature vector of demographic model. When operating a genetic algorithm, a population P with M ¼ 75 instances of demographic classifiers with the abovementioned parameters is created. Next, we use error backpropagation to train these classifiers. Based on training results, classifiers with highest performance in terms of demographic profile prediction are chosen. Subsequently, to add new classifiers into the population, a crossover operation is applied. Such a crossover includes random substitution of numbers taken from the parameters of two original classifiers if they do not coincide with these classifiers. Let us illustrate the process with the following example. If classifier C1 contains n1 ¼ 10 in the first layer, and classifier C2 contains n2 ¼ 100 neurons, then replacement of its value to 50 may be performed in the crossover operation. The
266
A. M. Fartukov et al.
newly created classifier C3 ¼ crossover(C1, C2) replaces a classifier that shows the worst performance in the population P. To introduce modifications of best classifiers, an operation of mutation is also applied. Each classifier with new parameters is added to the population of classifiers; subsequently, all new classifiers are retrained and their performance is measured. This process is continued until the classification process is improved. A demographic classifier with the best performance in the last population is chosen as the final one. Let us consider results from the proposed method. To build robust demographic prediction models, we collected an extensive dataset with various available types of features from mobile users. The principles and a detailed description of the dataset are described in Sect. 10.4. For demographic prediction, we explored different machine learning approaches and methods: support vector machines (SVMs) (Cortes and Vapnik 1995), NNs (Bishop 1995) and logistic regression (Bishop 2006). We performed accuracy tests (without optimization), the results of which are presented in Table 10.2. Based on the obtained results, the NN approach builds a demographic prediction classifier. Although there are a myriad of freely available deep learning frameworks for training NNs (Caffe, Torch, Theano, etc.), we decided to use our custom deep learning framework (Veles 2015) because it was designed as a very flexible tool in terms of workflow construction, data extraction and pre-processing and visualization. It also has an additional advantage: the ease of porting the resulting classifier to mobile devices. It should be noted that we also had to optimize the number of topic features: we decreased the initial number of ARTM features from 495 to 170. The demographic prediction accuracies using topic features generated from ARTM and LDA are shown in Table 10.3.
Table 10.2 Test results (without optimization) Method Fully connected NNs Linear SVM Logistic regression
Accuracy, % Gender 84.85 75.76 72.73
Marital status 68.66 61.19 52.24
Age 51.25 42.42 43.94
Table 10.3 A comparison of accuracy in case of ARTM and LDA Task Gender Marital status Age
Demographic prediction accuracy, % ARTM 93.7 87.3 62.9
LDA 88.9 79.4 61.3
10
Mobile User Profiling
267
In accordance with experimental results, the proposed method achieves demographic prediction accuracies on gender, marital status and age as high as 97%, 94% and 76%, respectively.
10.3
Behaviour-Based Authentication on Mobile Devices
Most authentication methods implemented on modern smartphones involve explicit authentication, which requires specific and separate interactions with a smartphone to gain access. This is true of both traditional (based on passcode, screen pattern, etc.) and biometric (based on face, fingerprint or iris) methods. Smartphone unlock, authorization for payment and accessing secure data are the most frequent operations that users perform on a daily basis. Such operations often force users to focus on the authentication step and may be annoying for them. The availability of different sensors in modern smartphones and recent advances in deep learning techniques coupled with ever-faster system on a chip (SoC) allows one to build unique behavioural profiles of the users based on position, movement, orientation and other data from mobile device (Fig. 10.3). Using these profiles ensures convenient and non-obtrusive ways for user authentication on a smartphone. This approach is called passive (or implicit) authentication. It provides an additional layer of security by continuously monitoring the user’s interaction with the device (Deb et al. 2019). This section contains a brief overview of the passive authentication approach. In accordance with papers by Crouse et al. (2015) and Patel et al. (2016), passive authentication includes the following main steps (Fig. 10.4). Monitoring incoming data from smartphone sensors begins immediately after the smartphone starts up. During the monitoring, the incoming data are passed to an implicit authentication step to establish a person’s identity. Based on the authentication result, a decision is made whether the incoming data correspond to a legitimate user. If the collected Fig. 10.3 Approaches to user authentication on a smartphone
268
A. M. Fartukov et al.
Fig. 10.4 A mobile passive authentication framework
behavioural profile corresponds to a legitimate user, the new incoming data are passed to implicit authentication step. Otherwise, the user will be locked out, and the authentication system will ask the user to verify his/her identity by using explicit authentication methods such as a password or biometrics (fingerprint, iris, etc.). The framework that is elaborated above determines requirements that the implicit authentication should satisfy. First, methods applied for implicit authentication should be able to extract representative features that reflect user uniqueness from noisy data (Hazan and Shabtai 2015). In particular, a user’s interaction with the smartphone causes high intra-user variability, which should be handled effectively by methods of implicit authentication. Second, these methods should process sensor data and profile a user in real time on the mobile device without sending data out of the mobile device. This process indicates that implicit authentication should be done without consuming much power (i.e. should have low battery consumption). Fortunately, SoCs that are used in modern smartphones include special low-power blocks aimed at real-time management of the sensors without waking the main processor (Samsung Electronics 2018). Third, implicit authentication requires protecting both collected data and their processing. This protection can be provided by a special secure (trusted) execution environment, which also imposes additional restrictions on available computational resources. These include the following: restricted number of available processor cores, reduced frequencies of the processor core(s), unavailability of extra computational hardware accelerators (e.g. GPU) and a limited amount of memory (ARM Security Technology 2009).
10
Mobile User Profiling
269
The application of machine learning techniques to task of implicit authentication dictates the need to use online or offline training approaches for an authentication model (Deb et al. 2019). Each training approach has its inherent advantages and disadvantages, and selection of a training approach is determined by applied algorithms and, concomitantly, by the nature of the input data. An online training approach requires collection of user data for a certain period of time and training of an authentication model on the device. As a result, an individual model for each user is obtained. A delay in model deployment due to the necessity to collect enough data for training is the main disadvantage of online training. Deb et al. (2019) also highlighted difficulties in estimating authentication performance across all the users in case of online training. The idea of an offline training approach is to obtain a common authentication model for all users by performing training on pre-collected sensor data. In this case, the authentication model learns distinctive features and can be deployed immediately without any delay. The main challenge of offline training is the necessity of dataset collection, which leads to the development of data collection procedures and their implementation (described in Sect. 10.4). It should be noted that pre-trained authentication models can be updated during the operational mode. Thus, we can talk about the sequential application of the abovementioned approaches. Many passive authentication methods have been described in the literature. All of them can be divided into two categories: methods that use a single modality (touch dynamics, usage of mobile applications, movement, GPS location, etc.) and multimodal methods, which fuse decisions from multiple modalities to authenticate the user. Historically, early work on passive smartphone authentication had been dedicated to methods that use a single modality (Patel et al. 2016). In particular, the smartphone’s location can be considered a modality. Simple implicit authentication based on GPS location is implemented in Smart Lock technology by Google for Android operating system (OS) (Google 2020). A smartphone stays unlocked when it is in a trusted location. The user can manually set up such locations (e.g. home or work) (Fig. 10.5). An evolutionary way of developing this idea is automatic determination of trusted locations and routes specific to the user. When the user is in such a location or on the route, explicit authentication for unlocking the smartphone can be avoided. Evidence for the possibility to use information about locations and routes for implicit authentication is that most people regularly visit the same places and choose the same routes. Fluctuations in location coordinates over time can be considered as time series, and the next location point can be predicted based on history and ‘learned’ habits. The abovementioned factors lead to the idea of predicting next location points and their comparison to user’s ‘usual’ behaviour at a particular time. It was successfully implemented by method based on gradient boosting (Chen and Ernesto 2016). The method includes median filtering of input latitude and longitude values to reduce noise in location data. The method considers account timestamps corresponding to location data: time, day of week (e.g. working day/weekend) and month. A decision for user authentication is based on the difference between the actual and predicted locations at a particular time. It should be noted that the described method has several shortcomings: it requires online training of gradient-
270
A. M. Fartukov et al.
Fig. 10.5 Smart Lock settings screen of an Android-based smartphone
boosted decision trees and, more importantly, to some extent decreases the level of a smartphone’s security. Decision of later issue is to use multi-modal methods for passive authentication, which have obvious advantages over methods based on a single modality. Deb et al. (2019) described an example of a multi-modal method for passive authentication. In that paper, the authors proposed using Siamese long short-term memory (LSTM) architecture (Varior et al. 2016) to extract deep temporal features from the data corresponding to a number of passive sensors in smartphones for user authentication (Fig. 10.6). The authors proposed a passive user authentication method based on keystroke dynamics, GPS location, accelerometer, gyroscope, magnetometer, linear accelerometer, gravity and rotation modalities that can unobtrusively verify a genuine user with 96.47% True Accept Rate (TAR) at 0.1% False Accept Rate (FAR) within 3 seconds.
10.4
Dataset Collection System: Architecture and Implementation Features
As we mentioned in the previous sections, building a behaviour-based user profile requires collecting datasets that should accumulate various types of information from a smartphone’s sensors. To collect such dataset, we developed the following system.
10
Mobile User Profiling
271
Fig. 10.6 Architecture of the model proposed by Deb et al. (2019). (Reproduced with permission from Deb et al. 2019)
The main tasks of the system include data acquisition, which tracks usual user interactions with a smartphone, subsequent storage and transmission of the collected data, customization of data collection procedure (selection of sources/sensors for data acquisition) for individual user (or groups of users) and controlling and monitoring the data collection process. The dataset collection system contains two components: a mobile application for Android OS (hereafter called the client application) and a server for dataset storage and controlling dataset collection (Fig. 10.7). Let us consider each component in detail. The client application is designed to collect sensor data and user activities (in the background) and to send the collected data to the server via encrypted communication channel. The client application can operate on smartphones supporting Android 4.0 or higher. Because users would continue to use their smartphones in a typical manner during dataset collection, the client application is optimized in terms of battery usage. Immediately after installation of the application on a smartphone, it requests the user to complete a ground-truth profile. This information is needed for further verification of the developed methods. The client collects the following categories of information: ‘call + sensors’ data, application-related data and Web data. ‘Call + sensors’ data comprises SMS and call logs, battery and light sensor status, location information (provided by GPS and cell towers), Wi-Fi connections, etc. All sensitive information (contacts used for calls and SMS, SMS messages, etc.) is transformed in a non-invertible manner (hashed) to ensure the user’s privacy. At the same time, the hashed information can be used to characterize user behaviour. Physical sensors provide an additional data source about the user context. Information about the type and current state of the battery can reveal the battery charging pattern. The light sensor helps to determine ambient light detection.
272
A. M. Fartukov et al.
Fig. 10.7 Architecture of the dataset collection system
Location information (a pair of latitude and longitude coordinates accompanied with a time label) is used to extract meaningful labels for places that the user visits frequently (e.g. home, work) and to determine routes specific for the user. Application-related data includes a list of installed and running applications, the frequency and duration of particular application usage, etc. Market-related information about applications can be obtained from the Google Play store, the Amazon App store, etc. Tracking (which applications are used at what time, location, etc.) provides the richest information to determine the context and predict the user’s behaviour. Web data can be obtained from various mobile browsers (i.e. Google Chrome) by using Android platform content provider functions to get history. The history of browsing is then used to get textual (Web page content) information for further analysis by NLP algorithms. It should be noted that Bird et al. (2020) support a finding that browsing profiles are highly distinctive and stable characteristics for user identification. An important point is that the user fully controls the collected information and can enable/disable the collection process at any time (in case a user decides that data collection could violate his/her privacy at the moment). The client explicitly selects which data will be collected. In addition, the user can modify the data transfer policy (at night hours, via Wi-Fi connection only, etc.) for his/her convenience and mobile traffic saving.
10
Mobile User Profiling
273
Fig. 10.8 Visualization of history and current distribution of participants by gender and marital status
Another component of the data collection system is the server, which is aimed to store collected data and to control the collection process (Fig. 10.7). Let us briefly describe the main functions of the server: 1. Storing collected data in common dataset. 2. Monitoring activity of the client application. This factor includes tracking client activation events; collecting information about versions of the client application that are currently used for data collection, number and frequency of data transactions between the clients and the server; and other technical information about the client application’s activities. 3. Providing statistical information. It is important to obtain actual information about the current status of the dataset collection. The server should be able to provide information about the amount of already collected data, number of data collection participants, their distribution by age/gender and marital status (Fig. 10.8), etc. in convenient way. 4. Control of the dataset collection procedure. The server should be able to set up a unified configuration for all clients. Such a configuration determines the set of sensors that should be used for data collection and the duration of data collection. Participants can be asked different questions to establish additional data labelling or ground-truth information immediately before the data collection. Another aspect of data collection control is notifying participants that the client application should be updated or requires special attention during data collection. This function is implemented as push notifications received from the server. 5. System health monitoring and audit. An actual aim of this function is to detect software issues promptly and provide information that can help to fix them. For this endeavour, the server should be able to define potentially problematic user environments (by automatically collecting and sending to the server client
274
A. M. Fartukov et al.
application’s logs in case of any error) and to collect participants’ feedback, which can be obtained by the ‘Comment to developer’ function in the client application. We have successfully applied the described system for dataset collection. More than 500 volunteers have been involved for the data collection, which lasted from March 2015 to October 2016. Each volunteer supplied data for at least 10 weeks. To associate the collected data with demographic information, each volunteer had to complete a questionnaire with the following questions: date of birth, gender, marital status, household size, job position, work schedule, etc. As mentioned earlier, the data collection procedure was totally anonymized.
References ARM Security Technology: Building a Secure System Using TrustZone Technology. ARM Limited (2009). https://developer.arm.com/documentation/genc009492/c Bird, S., Segall, I., Lopatka, M.: Replication: why we still can’t browse in peace: on the uniqueness and reidentifiability of web browsing histories. In: Proceedings of the Sixteenth Symposium on Usable Privacy and Security, pp. 489–503 (2020) Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, New York (1995) Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006) Blei, D.M.: Probabilistic topic models. Commun. ACM. 55(4), 77–84 (2012) Chen, T., Ernesto, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016) Collobert, R., Bengio, S.: Links between perceptrons, MLPs and SVMs. In: Proceedings of the Twenty-First International Conference on Machine Learning, pp. 1–8 (2004) Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20, 273–297 (1995) Counterpoint Technology Market Research: Almost half of smartphone users spend more than 5 hours a day on their mobile device (2017) (Accessed on 29 September 2020). https://www. counterpointresearch.com/almost-half-of-smartphone-users-spend-more-than-5-hours-a-dayon-their-mobile-device/ Crouse, D., Han, H., Chandra, D., Barbello, B., Jain, A.K.: Continuous authentication of mobile user: fusion of face image and inertial measurement unit data. In: Proceedings of International Conference on Biometrics, pp. 135–142 (2015) Deb, D., Ross, A., Jain, A.K., Prakah-Asante, K., Venkatesh Prasad, K.: Actions speak louder than (pass)words: passive authentication of smartphone users via deep temporal features. In: Proceedings of International Conference on Biometrics, pp. 1–8 (2019) Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. Ser. B (Methodolog.). 39(1), 1–38 (1977) Dong, Y., Yang, Y., Tang, J., Yang, Y., Chawla, N.V.: Inferring user demographics and social strategies in mobile social networks. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 15–24 (2014) Google LLC: Choose when your Android phone can stay unlocked (2020) Accessed on 29 September 2020. https://support.google.com/android/answer/9075927?visit_ id¼637354541024494111-316765611&rd¼1 Hazan, I., Shabtai, A.: Noise reduction of mobile sensors data in the prediction of demographic attributes. In: IEEE/ACM MobileSoft, pp. 117–120 (2015)
10
Mobile User Profiling
275
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57 (1999) Hu, J., Zeng, H.-J., Li, H., Niu, C., Chen, Z.: Demographic prediction based on user’s browsing behaviour. In: Proceedings of the 16th International Conference on World Wide Web, pp. 151–160 (2007) Kabbur, S., Han, E.-H., Karypis, G.: Content-based methods for predicting web-site demographic attributes. In: Proceedings of the 2010 IEEE International Conference on Data Mining, pp. 863–868 (2010) Laurila, J.K., Gatica-Perez, D., Aad, I., Blom, J., Bornet, O., Do, T.M.T., Dousse, O., Eberle, J., Miettinen, M.: From big smartphone data to worldwide research: the mobile data challenge. Pervasive Mob. Comput. 9(6), 752–771 (2013) Patel, V.M., Chellappa, R., Chandra, D., Barbello, B.: Continuous user authentication on mobile devices: recent progress and remaining challenges. IEEE Signal Process. Mag. 33(4), 49–61 (2016) Podoynitsina, L., Romanenko, A., Kryzhanovskiy, K., Moiseenko, A.: Demographic prediction based on mobile user data. In: Proceedings of the IS&T International Symposium on Electronic Imaging. Mobile Devices and Multimedia: Enabling Technologies, Algorithms, and Applications, pp. 44–47 (2017) Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (GDPR). Off. J. Eur. Union L119(59), 1–88 (2016) Ross, A., Banerjee, S., Chen, C., Chowdhury, A., Mirjalili, V., Sharma, R., Swearingen, T., Yadav, S.: Some research problems in biometrics: the future beckons. In: Proceedings of International Conference on Biometrics, pp. 1–8 (2019) Samsung Electronics: Samsung enables premium multimedia features in high-end smartphones with Exynos 7 Series 9610 (2018) Accessed on 29 September 2020. https://news.samsung.com/ global/samsung-enables-premium-multimedia-features-in-high-end-smartphones-with-exynos7-series-9610 Seneviratne, S., Seneviratne, A., Mohapatra, P., Mahanti, A.: Predicting user traits from a snapshot of apps installed on a smartphone. ACM SIGMOBILE Mob. Comput. Commun. Rev. 18(2), 1–8 (2014) Varior, R.R., Shuai, B., Lu, J., Xu, D., Wang, G.: A Siamese long short-term memory architecture for human re-identification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol. 9911. Springer, Cham (2016) Veles Distributed Platform for rapid deep learning application development. Samsung Electronics (2015) Accessed on 29 September 2020. https://github.com/Samsung/veles Vorontsov, K., Potapenko, A.: Additive regularization of topic models. Mach. Learn. 101, 303–323 (2015) Zhong, E., Tan, B., Mo, K., Yang, Q.: User demographics prediction based on mobile data. Pervasive Mob. Comput. 9(6), 823–837 (2013)
Chapter 11
Automatic View Planning in Magnetic Resonance Imaging Aleksey B. Danilevich, Michael N. Rychagov, and Mikhail Y. Sirotenko
11.1
View Planning in MRI
11.1.1 Introduction Magnetic resonance imaging (MRI) is one of the most widely used noninvasive methods in medical diagnostic imaging. For better quality of the MRI image slices, their position and orientation should be chosen in accordance with anatomical landmarks, i.e. respective imaging planes (views or slices) should be preliminarily planned. This procedure is named in MRI as view planning. The locations of planned slices and their orientations depend on the human body parts under investigation. For example, typical cardiac view planning consists of obtaining two-chamber, threechamber, four-chamber and short-axis views (see Fig. 11.1). Typically, view planning is performed manually by a doctor. Such manual operation has several drawbacks: • It is time-consuming. Depending on the anatomy and study protocol, it could take up to 10 minutes and even more in special cases. The patient should stay in the scanner during this procedure.
A. B. Danilevich Samsung R&D Institute Rus (SRR), Moscow, Russia e-mail: [email protected] M. N. Rychagov National Research University of Electronic Technology (MIET), Moscow, Russia e-mail: [email protected] M. Y. Sirotenko (*) 111 8th Ave, New York, NY 10011, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_11
277
278
A. B. Danilevich et al.
Fig. 11.1 Cardiac view planning
• It is operator-dependent. The image of the same anatomy produced with the same protocol by different doctors may differ significantly. This fact degrades diagnosis quality and analysis of the disease dynamics. • It requires qualified medical personnel to do the whole workflow (including the view planning) instead of only analysing the images for diagnostics. To overcome all these disadvantages, an automatic view planning system may be used which estimates positions and orientations of desired view planes by analysis of a scout image, i.e. a preliminary image obtained prior to performing the major portion of a particular study. The desired properties of such AVP system are: • • • • •
High acquisition speed of the scout image High computational speed High accuracy of the view planning Robustness to noise and anatomical abnormalities Support (suitability) for various anatomies
The goal of our work is an AVP system (and respective approach) being developed in accordance with the requirements listed above. We describe a fully automatic view planning framework which is designed to be able to process four kinds of human anatomies, namely, brain, heart, spine and knee. Our approach is based on anatomical landmark detection and includes several anatomy-specific pre-processing and post-processing algorithms. The key features of our framework are (a) using deep learning methods for robust detection of the landmarks in rapidly acquired low-resolution scout images, (b) unsupervised learning for overcoming the problem of small training dataset, (c) redundancy-based midsagittal plane detection for brain AVP, (d) spine disc position alignment via
11
Automatic View Planning in Magnetic Resonance Imaging
279
3D-clustering and using a statistical model (of the vertebrae periodicity) and (e) position refinement of detected landmarks based on a statistical model.
11.1.2 Related Works on AVP Recent literature demonstrates various approaches for view planning in tomographic diagnostic imaging. Most authors consider some specific MRI anatomies only and create highly specialized algorithms aimed to perform view planning for this anatomy. Several methods for cardiac view planning are based on localization of the left ventricle (LV) and building slices according to its position and orientation. In the paper by Zheng et al. (2009), the authors perform localization of the LV using marginal space learning and probabilistic boosting trees. In Lu et al. (2011), a 3D mesh model of the LV is fitted in the MRI volume to localize all LV anatomical structures and anchors. For brain MRI view planning, a widely used approach is to find the midsagittal plane and to localize special landmarks for building other planes, as shown by Young et al. (2006) and Li et al. (2009). As for anatomically consistent landmark detection, a lot of approaches are used which are adjusted for specific landmark search techniques. For example, according to Iskurt et al. (2011), non-traditional landmarks are proposed for brain MRI, as well as rather a specific approach for their finding. As for knee automatic view planning, Bystrov et al. (2007), Lecouvet et al. (2009) and Bauer et al. (2012) demonstrate the application of 3D deformable models of the femur, tibia and patella as an active shape model for detecting anchor points and planes. In Zhan et al. (2011), AVP for the knee is performed using landmark detection via the Viola-Jones-like approach based on modified Haar wavelets. Spine view planning is generally performed by estimation of positions and orientations of intervertebral discs. A great majority of existing approaches uses simple detectors based on image contrast and brightness gradients in a spine sagittal slice to localize the set of discs and their orientation (Fenchel et al. 2008). Pekar et al. (2007) apply a disc detector which is based on eigenvalue analysis of the image Hessian. Most of the spinal AVP approaches work with 2D images in sagittal slices. All mentioned methods could be divided into three categories: 1. Landmark-based methods which build view planes relative to some predefined anatomical landmarks 2. Atlas-based methods which use an anatomical atlas for registration of the input scout image and estimate view planes based on this atlas 3. Segmentation-based methods which try to perform either 2D or 3D segmentation of a scout image in order to find desired view planes Our approach is related to the first category.
280
A. B. Danilevich et al.
Fig. 11.2 AVP workflow
11.2
Automatic View Planning Framework
The AVP framework workflow consists of the following steps (see Fig. 11.2): 1. 3D scout MRI volume acquisition. It is a low-resolution volume acquired at high speed. 2. Pre-processing of the scout image, which includes such common operations as bounding box estimation and statistical atlas anchoring and anatomy-specific operations which include midsagittal plane estimation for the brain. 3. Landmark detection. 4. Post-processing, which consists of common operations for landmark position refinement and filtering as well as anatomy-specific operations like vertebral disc position alignment for spine AVP. 5. Estimation of the positions of view planes and their orientation. A commonly used operation for the pre-processing stage is a bounding box reconstruction for a body part under investigation. Such a bounding box is helpful for various working zone estimations, local coordinate origin planning, etc. The bounding box is a three-dimensional rectangular parallelepiped that bounds only the essential part of the volume. For example, for brain MRI, the bounding box simply bounds the head, ignoring the empty space around it. This first rough estimation already brings information about the body part position, which reduces the ambiguity of positions of anatomical points within a volume. Such ambiguity appears due to the variety of body part positions relative to the scanner. This reduction of ambiguity yields a reduction of the search zone for finding anatomical landmarks. The bounding box is estimated via integral projections of the whole
11
Automatic View Planning in Magnetic Resonance Imaging
281
Fig. 11.3 Bounding box and statistical atlas
volume onto coordinate axes. The bounding box is formed by utmost points of intersections of the projections with the predefined thresholds. The integral projection is a one-dimensional function whose value in each point is calculated as a total of all voxels with a respective coordinate fixed. Using non-zero thresholds allows cutting off noise in side areas of the volume. In the next step, the search zone is reduced even more by application of the statistical atlas. The statistical atlas contains information about the statistical distribution of anatomical landmarks’ positions inside a certain part of a human body. It is constructed on the basis of annotated volumetric medical images. Positions of landmarks in a certain volume are transformed to a local coordinate system which relates to the bounding box, not to the whole volume. Such transformation prevents wide dispersion of the annotations. On the basis of the landmark positions calculated for several volumes, the landmarks’ spatial distribution is estimated. In the simplest case, such distribution can be presented by the convex hull of all points (for a certain landmark type) in local coordinates. When the statistical atlas is anchored to the volume, the search zone is defined (Fig. 11.3). From the landmark processing point of view, the post-processing stage represents filtering out and clustering of detected points (the landmark candidates). From the applied MRI task point of view, the post-processing stage contains procedures which perform the computation of the desired reference lines and planes. For knee AVP and brain AVP, the post-processing stage implies auxiliary reference line and plane evaluation on the basis of previously detected landmarks. During post-processing, at the first stage, all detected point candidates are filtered by thresholds for the landmark criterion. All candidates of a certain landmark type whose quality criterion value is less than the threshold are eliminated. Such thresholds are estimated in advance via a set of annotated MRI volumes. They are chosen in such a way that minimizes the loss function consisting of false-positive errors and false-negative errors. Optimal thresholds provide a balanced set of false positives and false negatives for the whole set of available volumes. This balance could be adjusted by some trade-off parameter. In the loss function mentioned above, the number of false negatives is calculated as a sum of all missed ground truths in all
282
A. B. Danilevich et al.
volumes. A ground truth is regarded as missed if no landmarks are detected within a spherical zone with a predefined radius, surrounding the ground-truth point. The total of false positives is calculated as a sum of all detections that are outside of these spheres (too far from ground truths).
11.2.1 Midsagittal Plane Estimation in Brain Images The AVP workflow includes some anatomy-specific operations as pre-processing and post-processing steps. One of such operations is midsagittal plane (MSP) estimation. The MSP is the plane which divides the human brain into two cerebral hemispheres. To estimate the MSP position, a brain longitudinal fissure is used in our approach. The longitudinal fissure is detected in several axial and coronal sections of a brain image; the obtained fissure lines are used as reference for MSP estimation. A redundant set of sections and reference lines is used to make the algorithm more robust. The main stages of brain MSP computation are the following: 1. A set of brain axial and coronal slices is chosen automatically on the basis of a
Fig. 11.4 Novel idea for automatic selection of working slices to be used for longitudinal fissure detection
11
Automatic View Planning in Magnetic Resonance Imaging
(a)
283
(b)
Fig. 11.5 Longitudinal fissure detection: (a) example of continuous and discontinuous fissure lines; (b) fissure detector positions (shift and rotation) which should be tested
head bounding box and anatomical proportions (Fig. 11.4). 2. The 2D bounding box is estimated for each slice. 3. The longitudinal fissure line is detected in each slice by an algorithm which detects the longest continuous line within a bounding box. In each slice, the fissure detector is represented as a strip (continuous or discontinuous, it depends on the slice anatomy) which bounds a set of pixels to be analysed (see Fig. 11.5a). The fissure is found as the longest continuous or discontinuous straight line in the given slice by applying a detector window many times, step by step changing its position and orientation (Fig. 11.5b). 4. The redundant set of obtained reference lines is statistically processed for the clustering of line directions. Some detections could be false due to image artefacts or disease, as shown in Fig. 11.6a. Nevertheless, because of data redundancy, the outliers are filtered out (Fig. 11.6b). The filtering is performed via the line projections onto each other and respective orthogonal discrepancy comparison. As a result, two generalized (averaged) directional vectors are obtained (for the fissure direction in axial and coronal slices, respectively). They permit us to create the MSP normal vector, which is orthogonal to the two directions mentioned above. 5. For MSP creation, a point is necessary for the plane to pass through. Such a point may be obtained by statistical averaging of points formed by intersection of reference lines with the head contour in respective slices. Finally, the MSP is created via this “central” point and the normal vector, computed via the reference lines (fissures, detected in slices).
284
A. B. Danilevich et al.
(a)
(b)
Fig. 11.6 An example of adequately detected and wrongly detected fissures. The idea for filtering out outliers: (a) an example of the fissure detection in a chosen set of axial slices; (b) groups of detected directions (shown as vectors) of the fissure in axial and coronal planes, respectively
The MSP may be also created directly, on the basis of the points formed by the reference line intersection with a head contour in slices: as a least-square optimization task. The result is just the same as for the averaged vectors and central points. It should be pointed out that the redundancy in obtained reference lines plays a great role in MSP estimation. As each reference line may be detected with some inaccuracy, the data redundancy reduces the impact of errors on the MSP position. The data redundancy feature makes our approach differ from others (Wang and Li 2008); it permits us to make the procedure more stable. In contrast, in other approaches, as a rule, two slices only are used for MSP creation – one axial and one coronal. Some authors describe an entirely different approach which does not use the MSP at all. For example, according to van der Kouwe et al. (2005), the authors create slices and try to map them to some statistical atlas, solving an optimization task with a rigid body 3D transformation. Then, this spatial transformation relative to an atlas is used for MRI plane correction. The estimated MSP is used as one of the planned MRI views. Further, the MSP helps us to reduce the search space for the landmark detector since the landmarks (corpus callosum anterior (CCA) and corpus callosum posterior (CCP)) are located just in this plane.
11
Automatic View Planning in Magnetic Resonance Imaging
285
11.2.2 Anatomical Landmark Detection Anatomical landmark detection is a process of localizing landmark positions in a 3D volume. For each MRI type (anatomy), a set of specific landmarks are chosen to be used for the desired plane evaluation. Unique landmarks of different types are used for brain, cardiac and knee anatomies, respectively (see Table 11.1). It should be pointed out that in these tasks, anatomical points are determined unambiguously, i.e. a single landmark exists for each of the mentioned anatomical structures. In contrast to this fact, another situation may occur that a group of similar anatomical points exists. For example, for spine MRI, there could be several landmarks of the same type in a volume. Since we have to find all vertebrae (or intervertebral discs), a lot of similar anatomical objects exist in this task. The target points for detection are chosen as the disc posterior points where the discs join the spinal canal. Typically, a set of such landmarks exist in the spine, and these points are similar to each other. As a rule, spinal MRI investigation is performed separately for three spinal zones: upper, middle and lower. Such zones correspond to the following vertebrae types: cervical (C), thoracic (T) and lumbar (L). For each of these zones, a separate landmark type is established. All such spinal landmarks are located at the posterior point of intervertebral discs. Every single landmark corresponds to the vertebra located over it. Figure 11.7 shows landmarks of C-type (red) and T-type (orange). Landmark detection is equivalent to point classification which is performed as mapping of a set of all search points onto a set of respective labels, such as Landmark_1 or Landmark_2 or . . . Landmark_N or Background. “Background” relates to the points where no landmarks are located. Thus, landmark detection is reduced to applying a discriminative function to each point. Points to be tested are called “search points”. These points are actually only a subset of all points in the volume. Firstly, search points are picked up from the search Table 11.1 Landmark titles and description Anatomy Brain Cardiac
Knee
Spine
Landmark CCA CCP APEX MVC AVRC RVL RVII RVIS LVOT LPC MPC TPL TPI VP
Description Anterior Corpus Callosum Posterior Corpus Callosum Left Ventricular Apex Center of Mitral Valve Atrioventricular Node Lateral Right Ventricle Superior Right Ventricular Insertion Inferior Right Ventricular Insertion Left Ventricular Outflow Tract Lateral Posterior Condyle Medial Posterior Condyle Lateral Tibial Plateau Internal Tibial Plateau Posterior Vertebral Disc
286
A. B. Danilevich et al.
Fig. 11.7 Spinal landmarks
Fig. 11.8 Surrounding context of search point
area defined by the statistical atlas anchored to the volume. This search area is obtained as a union of sets of points from subvolumes that correspond to statistical distributions of landmarks’ positions. Secondly, inside the search area, a grid of search points is defined with some prescribed step (i.e. distance between neighbouring points). For classification of a point, its surrounding context is used. The surrounding context is a portion of voxel data extracted from neighbourhood of the search point. In our approach, we pick up a cubic subvolume surrounding the respective search point and extract three orthogonal slices of this subvolume passing through the search point (Fig. 11.8). Thus, the landmark detector scans a selected volume with a 3D sliding window and performs classification of every point by its surrounding context (Fig. 11.9).
11
Automatic View Planning in Magnetic Resonance Imaging
287
Fig. 11.9 Landmark detection
Classification is done by means of a learned deep discriminative system. The system is based on a multi-layer convolutional neural network (CNN) (Sirotenko 2006). In inference mode, the trained network takes three slices as its input and produces a vector of pseudo-probabilities of the fact that the input belongs to one of the specified classes (one of the landmarks or “Background”). Finally, after scanning a whole search space with the mentioned discriminative system, we obtain a vector field of outputs at each search point. Nevertheless, such probabilities describe only the absolute magnitude of assurance of the mentioned fact, which is not enough for adequate classification. Thus, a comparative measure is necessary which takes into account all class probabilities relative to each other. Being calculated for each point, such a measure is named the “landmark quality” (LMQ). If a certain landmark type (for which the calculation of the landmark’s quality is performed) can be estimated as the greatest value in the CNN output vector in a certain position, then the quality is calculated as a difference between this value of output and the second highest value (by magnitude) in the output vector. Otherwise, if a certain landmark type does not have the greatest value in the CNN output vector, the quality value is calculated as a difference between value of the output for this landmark and the greatest value in the output vector. The same operation is done for the background class. Algorithm 11.1 Landmark Quality Calculation Given output of CNN [a1 . . . aN] Compute LMQ vector [Q1 . . . QN] C max ¼ arg max fai g i
For C ¼ 1: N if (Cmax ¼¼ C) QC ¼ aC – max ({ai}/aC) else QC ¼ aC – max ({ai})
288
A. B. Danilevich et al.
At this step, classification of each search point is done by eliminating the candidates with negative landmark quality value (for each class). Then, the whole set of search points is divided into a disjoint set of subsets corresponding to each class. Sets of points corresponding to landmarks (not background) are passed as output detections (candidates) of the landmark detection procedure.
11.2.3 Training Landmark Detector The main part of the landmark detector is a discriminative system used for classification of the surrounding context extracted around the current search point. In our approach, we utilize the neural network approach (Rychagov 2003). During the last years, convolutional neural networks (LeCun and Bengio 1995; Sirotenko 2006; LeCun et al. 2010) were applied for various recognition tasks and showed very promising results (Sermanet et al. 2012; Krizhevsky et al. 2012). The network has several feature-extracting layers, pooling layers, rectification layers and fully connected layers (Jarrett et al. 2009). Layers of the network contain trainable weights which prescribe behaviour of the discriminative system. The process of tuning these weights is based on learning or training. Convolutional layers produce feature maps which are obtained by convolution of input maps and applying a nonlinear function to the maps after convolution. This nonlinearity also depends on some parameters which are trained. Pooling layers alternate with convolutional layers. This kind of layer performs down-sampling of feature maps. We use max pooling in our approach. Pooling layers provide invariance of small translations and rotations of features. As rectification layers, we use abs rectification and local contrast normalization layers (Jarrett et al. 2009). Finally, on top of the network, a fully connected layer is placed, which produces the final result. The output of the convolutional neural network is a vector with a number of elements equal to a number of landmarks plus one. For example, for the multiclass landmark detector designed for detecting CCA and CCP landmarks, there are three outputs of the neural network: two landmarks and background. These output values correspond to pseudo-probabilities that the current landmark is located in a current search point (or no landmarks are located here in case of background). We train our network in two stages. In the first stage, we perform an unsupervised pre-training using predictive sparse decomposition (PSD) (Kavukcuoglu et al. 2010). Then, we perform a supervised training (refining the pre-trained weights and learning other weights) using stochastic gradient descent with energy-based learning (LeCun et al. 2006). A specially prepared dataset is used for the training. Several medical images are required to be used to compose this training dataset. We have collected several brain, cardiac, knee and spine scout MRI images for this purpose. These MRI volumes were manually annotated: using a special programme, we pointed positions of landmarks of interest in each volume. These points are used to construct true samples corresponding to the landmarks. As a sample, we suppose a combination of a chosen
11
Automatic View Planning in Magnetic Resonance Imaging
289
class label (target output vector) and a portion of respective voxel data taken from the surrounding context of the certain point with a predefined size. The way of extracting such surrounding context around the point of interest is explained in Sect. 11.2.2. The samples are randomly picked from annotated volumes. The class label is a vector of target probabilities of the fact that the investigated landmark (or background) is located in a respective point. These target values are calculated using the energy-based approach. For every landmark class, the target value is calculated on the basis of the distance from current sample to the closest groundtruth point of this class. For example, if sample is picked right in the ground-truth point for Landmark_1, then the target value for this landmark is “1”. If the sample is picked far from any ground truths of Landmark_1 (with distance exceeding the threshold), then the target value for this landmark is “1”. And while we approach the ground truth, the target value is monotonically increased. We added some extra samples with spatial distortions to train the system to be robust and invariant to noise. At the first stage of learning, an unsupervised pre-training procedure initializes the weights W of all convolutional (feature extraction) layers of the neural network. Learning in unsupervised mode uses unlabelled data. This learning is performed via sparse coding and predictive sparse decomposition techniques (Kavukcuoglu et al. 2010). The learning process is done separately for each layer by performing an optimization procedure: W ¼ argmin W
X kz F W ð y Þ k2 , y2Y
where Y is a training set, y is an input of the layer from the training set, z is a sparse code of y, and Fw is a predictor (a function which depends on W; it transforms the input y to the output of the layer). This optimization is performed by stochastic gradient descent. Each training sample is encoded into a sparse code using the dictionary D. The predictor FW produces features of the sample which should be close to sparse codes. The reconstruction error, calculated on the basis of features and the sparse code, is used to calculate the gradient in order to update the weights W. To compute a sparse code for a certain input, the following optimization problem is solved: z ¼ argminkzk0 : y ¼ Dz, z
where D is the dictionary, y is the input signal, z is the encoded signal (code), and z is the optimal sparse code. In the abovementioned equation, the input y is represented as a linear combination of only a few elements of some dictionary D. It means that the produced code z (the vector of coefficients of decomposition) is sparse. The dictionary D is obtained from the training set in unsupervised mode (without using annotated labels). An advantage of the unsupervised approach for the optimal dictionary D finding is the fact that
290
A. B. Danilevich et al.
the dictionary is learned directly from data. So, the found dictionary D optimally represents a hidden structure and specific nature of used data. An additional advantage of the approach is that it does not need a large amount of annotated input data for the dictionary training. Finding D is equivalent to solving the optimization problem: W ¼ argmin
X y2Y
W
kDz yk2 ,
where Y is a training set, y is an input of the layer from the training set, z is a sparse code of y, and D is the dictionary. This optimization is performed via a stochastic gradient descent. Decoding of the sparse code is performed to produce decoded data. The reconstruction error (discrepancy) is calculated on the basis of the training sample and the decoded data; the discrepancy is used to calculate gradients for the dictionary D updating. The process of the dictionary D adjustment is alternated with finding the optimal code z for the input y with fixed D. For all layers except the first, the training set Y is formed as a set of the previous layer’s outputs. Unsupervised pre-training is useful when we have only a few labelled data. We demonstrate the superiority of using PSD with few labelled MRI volumes. For such experiment, unsupervised pre-training of our network is performed in advance. Then, several annotated volumes were picked up for supervised fine-tuning. After training, the misclassification rate (MCR) on the test dataset (samples from MRI volumes which were not used at the training) was calculated. The plot in Fig. 11.10 shows the performance of classification (the lower MCR is the better one) depending on various numbers of annotated volumes taking part in supervised learning. After the unsupervised training is completed, the entire convolutional neural network is adjusted to produce multi-level sparse codes which are a good hierarchical feature representation of input data. In the next step, a supervised training is performed to tune the whole neural network for producing features (output vector) which correspond to probabilities of the appearance of a certain landmark or background in a certain point. This is done by performing the following optimization: W ¼ argmin W
X y2Y
kx G W ð y Þ k2 ,
where W represents a set of all trainable parameters, Y is a training set, y is an input of the feature extractor from the training set, xis a target vector based on the ground truth corresponding to the input y, and GW depends on W and defines the entire transformation done by the neural network from the input y to the output. With such optimization, the neural network is learned to produce outputs similar to those predefined by annotation labels. The optimization procedure is performed by a stochastic gradient descent. During such training, the error (discrepancy) calculated on the basis of the loss function is backpropagated from the last layer to
11
Automatic View Planning in Magnetic Resonance Imaging
291
MCR on Test dataset 0.65 Without PSD With PSD
0.6 0.55 0.5
MCR
0.45 0.4 0.35 0.3 0.25 0.2 0
5
15 20 10 Number of training MRI volumes
25
30
Fig. 11.10 Misclassification rate plot: MCR is calculated on the test dataset using CNN learned with various numbers of annotated MRI volumes. Red line, pure supervised mode; blue line, supervised mode with PSD initialization
the first one with their weights updating. At the beginning of the procedure, some weights of feature extraction layers are initialized with the values computed at the pre-training stage. The final feature extractor is able to produce a feature vector that could be directly used for discriminative classification of the input or for assigning to every class a probability of the input belonging to a respective class. The trained classifier shows good performance on datasets composed of samples from different types of MRI volumes (such as knee, brain, cardiac, spine). For future repeatability and comparison, we also trained our classifier on the OASIS dataset of brain MRI volumes which is available online (Marcus et al. 2007). MCR results calculated on test datasets are shown in Table 11.2. For validation of the convolutional neural network approach, we have compared it with the widely used support vector machine classifier (SVM) (Steinwart and Christmann 2008) applied to samples. We used training and testing datasets composed of samples from cardiac MRI volumes. Table 11.3 demonstrates the superiority of convolutional neural networks.
11.2.4 Spine Vertebrae Alignment The spine AVP post-processing stage is used for the filtering of detected landmarks, their clustering and spinal curve approximation on the basis of these clustered landmark nodes.
292
A. B. Danilevich et al.
Table 11.2 Classification results on different datasets Dataset type OASIS Brain Cardiac Knee Spine
MCR 0.0590 0.0698 0.1553 0.0817 0.0411
Table 11.3 Classification results of SVM and ConvNets Algorithm SVM (Linear kernel) SVM (degree-2 polynomial kernel) SVM (degree-3 Polynomial kernel) ConvNet (pure supervised) ConvNet (with PSD initialization)
MCR 0.2952 0.29 0.3332 0.1993 0.1553
The clustering of the detected points is performed for elimination of the outliers among them (if they present). This operation finds several clusters – dense groups of candidates – and all points apart from these groups are filtered out. The point quality weight factors may be optionally considered in this operation. So, after such pre-processing, a set of clusters’ centres is obtained which may be regarded as nodes for further spinal approximation curve creating. They represent more adequate data than the original detected points. The clustering operation is illustrated in Fig. 11.11. On the basis of the clustered nodes, a refined sagittal plane is created. This is one of the AVP results. Another necessary result is a set of local coronal planes adapted for each discovered vertebra (or intervertebral disc). For such plane creating, respective intervertebral disc locations should be found, as well as the planes’ normal vector directions. Firstly, a spinal curve approximation is created on the basis of previously estimated nodes – clustered landmarks. The approximation is represented as two independent functions of the height coordinate: x(z) and y(z) in coronal and sagittal sections, respectively. The coronal function x(z) is represented as a sloped straight line. The sagittal function y(z) is different for C, T and L zones of the spine (upper, middle and lower spine, respectively). Sagittal approximation for these zones is represented as a series of such components as a straight line, parabolic function and trigonometric one (for C and L zones) with adjusted amplitude, period and starting phase. The approximation is fitted with obtained nodes via the least squares approach. As a rule, such approximation is quite satisfactory; see the illustration in Fig. 11.12.
11
Automatic View Planning in Magnetic Resonance Imaging
293
Fig. 11.11 Spinal landmark clustering
Fig. 11.12 An example of spinal curve approximation (via nodes obtained as the clustered landmark centres). Average brightness measured along the curve is shown in the illustration, too
Then, the intervertebral disc location should be determined. On the basis of the spinal curve approximation, a curved “secant tube” is created, which passes approximately through the vertebrae centres. An averaged brightness of voxels in the tube’s axial sections is collected along the tube. So, the voxels’ brightness function is obtained for the spinal curve: B(z), or B(L), where L is a running length of the spinal curve. In a T2 MRI protocol, vertebrae appear as bright structures, whereas intervertebral discs appear as dark ones. So, the brightness function (as well as its gradients) is used for supposed intervertebral disc position detection (see Fig. 11.12). Nevertheless, the disc locations may be poorly expressed, and, on the other hand, there may be a lot of false-positive locations detected. To avoid this problem, additional filtration is used for these supposed discs’ positions; we call it “periodic filtration”. A statistical model of intervertebral distances is created, and averaged positions of the discs along the spinal curve are presented as a pulse function. A convolution of this pulse function with the spinal curve brightness function (or with
294
A. B. Danilevich et al.
Fig. 11.13 The vertebrae statistical model (the periodic “pulse function”) as additional knowledge for vertebrae supposed location estimation
the brightness gradient function) permits us to detect the disc location more precisely. The parameters of this pulse function – its shift and scale – are to be adjusted during the convolution optimization process. The spinal brightness function processing is illustrated in Fig. 11.13. Finally, the supposed location of the intervertebral discs is determined. The local coronal planes are computed in these points, and the planes’ directional vectors are evaluated as the spinal approximation curve local direction vectors. The result of intervertebral disc secant plane estimation is shown in Fig. 11.14. As a rule, 2D sagittal slices are used for vertebra (or intervertebral disc) detection via various techniques (gradient finding, active shape models, etc.). Some authors use segmentation (Law et al. 2012), and some do not (Alomari et al. 2010). In the majority of the works, the statistical context model is used to make vertebrae finding more robust (Alomari et al. 2011; Neubert et al. 2011). Sometimes, the authors use a marginal space approach to boost up the process (Kelm et al. 2011). A distinctive point of our approach is that we use mostly 3D techniques. We did not use any segmentation: neither 3D, nor 2D. We obtain the spinal column spatial location directly via detected 3D points. Our spinal curve is a 3D one, as well as the clustering and filtering methods. To make the detection operation more robust, we use a spine statistical model: the vertebrae “pulse function” which represents statistically determined intervertebral distance along the spinal curve. This model is used in combination with pixel brightness analysis in the spatial tubular structure created near the spinal curve.
11
Automatic View Planning in Magnetic Resonance Imaging
295
Fig. 11.14 Intervertebral discs’ secant planes computed in the postprocessing stage: an example
11.2.5 Landmark Candidate Position Refinement Using Statistical Model In some complicated cases (poor scout image quality, artefacts, abnormal anatomy, etc.), the output of the landmark detector could be ambiguous. This means that there could be several landmarks detected with high LM quality measure for any given LM type. In order to eliminate this ambiguity, a special algorithm is used based on statistics of the landmark’s relative positions. The goal of the algorithm is to find the configuration of landmarks with high LMQ value minimizing the energy: E ðX, M X , ΣX Þ ¼
X xs 2X, xt 2X,
ψ st ðxs , xt , M X , ΣX Þ,
where X 2 R3 K is a set of K vectors of coordinates of landmarks, called the landmark configuration; MX, mean distances of each landmark from each other; ΣX, landmark coordinate distance covariance tensor; E, energy of the statistical model, lower values correspond to more consistent configurations; and ψ st, spatial energy function measuring statistical consistency of two landmarks. In our implementation, we define spatial energy as follows: ψ st ðxs , xt , M X , ΣX Þ ¼ 0:5ðxs xt μst ÞT
X1 st
ðxs xt μst Þ,
where xs and xt correspond to three-dimensional coordinate vectors from the configuration X; μst, three-dimensional mean distance vector between landmarks s and t; and Σ1 st , inverse covariance matrix of three-dimensional vectors of distances between landmarks s and t. Statistics μst and Σ1 st are computed once on an annotated dataset.
296
A. B. Danilevich et al.
The complete minimization of the abovementioned energy function would require sophisticated and computationally expensive algorithms. Instead, we use a heuristic approach based on the assumption that landmark detections are mostly correct. In the first step of this algorithm, from the plurality of landmark candidate points, a subset Sa is selected, consisting of M subsets Sa1. . .SaM of N candidate points having the greatest pseudo-probability for each of the M landmarks. Also, a set Sb is selected from Sa consisting of M best candidates – one for each landmark. Next, the loop is defined for all candidates xj from Sb. Then, the partial derivative ∂E ðX, M X , ΣX Þ of the statistical model with respect to xi is calculated. If the magnitude ∂xi of this partial derivative is the greatest among the elements of Sb, then the nested loop is initialized for all xi in Sai. In the next step, a new Sb0 configuration is defined by substitution of xj in Sb instead of xi. This new configuration is then used to compute the energy of the statistical model; if the energy is lower than the energy for Sb, then the Sb0 is assigned to Sb. This process is repeated until some stopping criterion is satisfied: for instance, maximum number of iterations or minimum value of partial derivative magnitude.
11.3
Results
For the algorithm testing and the landmark detector training, a database of real MRI volumes of various anatomies was collected. This included 94 brain, 80 cardiac, 31 knee and 99 spine MRI volumes. Based on the robustness of our landmark detector, we acquired low-resolution 3D MRI volumes. The advantage of this approach is short acquisition time. All data were acquired with our specific protocol and then annotated by experienced radiologists. In our implementation, we used a convolutional neural network with input size 32323 slices. Input volumes (at both learning and processing stages) were resized to spacing 2, so 32 pixels correspond to 64 mm. The architecture of the network is the following: the first convolutional layer has 16 kernels 88 with shrinkage transfer function; on top of it follows abs rectification, local contrast normalization and max-pooling with down-sampling factor 2; then we have the second convolutional layer with 32 kernels 88 connected to 16 input feature maps with a not fully filled connection matrix; then a fully connected layer with hyperbolic tangent sigmoid transfer function is situated, which finalizes the network. Verification of the AVP framework was performed by comparing automatically built views with views built on the basis of ground-truth landmark positions. The constructed view planes were compared with the ground-truth ones by angle and generalized distance between them. For spine MRI result verification, the number of missed and wrongly determined vertebral discs was counted. In this procedure, the discs of a specified type only were considered. Examples of the constructed views are shown in Figs. 11.15, 11.16, 11.17, and 11.18.
11
Automatic View Planning in Magnetic Resonance Imaging
297
Fig. 11.15 Brain AVP results: (a–d) Automatically built midsagittal views. Red lines mean intersection with corresponding axial and coronal planes. It is shown that the axial plane passes through CCA and CCP correctly
Comparison of time required for AVP procedures is presented in Table 11.4. Statistics (mean STD) of the discrepancies between the constructed views and respective ground truths are presented in Table 11.5. The results show that the developed technique provides better quality of the planned anatomical views in comparison with competitors’ results, while the time spent for the operations (image acquisition and processing) is much less.
11.4
Conclusion
We have presented a novel automatic view planning framework based on robust landmark detectors, able to perform high-accuracy view planning for the different anatomies using low-quality rapidly acquired scout images. The quality of the proposed algorithmic solutions was verified using collected and publicly available datasets. Benchmarking of the developed system shows its superiority compared with most of the competitive approaches in terms of view planning and workflow speed up. The presented framework was implemented as a production-ready software solution.
298
A. B. Danilevich et al.
Fig. 11.16 Knee AVP results: (a–c) ground-truth views, (d–f) automatically built views. Red lines mean intersection with other slices
(a)
(b)
(c)
Fig. 11.17 Spine AVP results: positions and orientations of detected (green) and spinal curve (red) for different spinal zones – (a) cervical, (b) thoracic and (c) lumbar
Our results demonstrate that we were able to achieve reliable, robust results in much less time than our best competitors. Based on the results, we believe that this novel AVP framework will help clinicians in achieving fast, accurate and repeatable MRI scans and be a big differentiating aspect of a company’s MRI product portfolio.
11
Automatic View Planning in Magnetic Resonance Imaging
(a)
(e)
(b)
(f)
(c)
(g)
(d)
(h)
299
Fig. 11.18 Cardiac AVP results: (a–d) ground-truth views; (e–h) automatically built views; (a–d) 2-chamber view; (b–f) 3-chamber view; (c–g) 4-chamber view; (d–h) view from short-axis stack
300
A. B. Danilevich et al.
Table 11.4. Time comparison with competitors: imaging time (IT), processing time (PT) and total time (TT) MRI type Cardiac Brain Knee Spine
Competitors Name IT (s) S 20 P – S 42 G 40 S – P 40 S 30 P 120 G 25
PT (s) 12.5 103 2 2 30 6 5 8 7
TT (s) 32.5 100+ 44 42 30+ 46 45 128 32
Our approach IT (s) PT (s) 19 5
TT (s) 24
25
1
26
23
2
25
27
2
29
Table 11.5 Quality verification of AVP framework MRI Type Brain Cardiac Knee Spine
Nearest competitor Dist (mm) 4.55 5.4 8.38 12.4 1.53 0.4 2.42 1
Angle(deg) 3.18 1.8 14.35 15 1.18 0.4 3.85 2.3
Our approach Dist (mm) 1.2 0.9 6.37 7.4 2.1 1.6 3.84 1.9
Angle(deg) 1.59 1.2 9.64 7.7 1.73 1.3 4.14 2.8
References Alomari, R.S., Corso, J., Chaudhary, V., Dhillon, G.: Computer-aided diagnosis of lumbar disc pathology from clinical lower spine MRI. Int. J. Comput. Assist. Radiol. Surg. 5(3), 287–293 (2010) Alomari, R.S., Corso, J., Chaudhary, V.: Labeling of lumbar discs using both pixel-and object-level features with a two-level probabilistic model. IEEE Trans. Med. Imaging. 30(1), 1–10 (2011) Bauer, S., Ritacco, L.E., Boesch, C., Reyes, M.: Automatic scan planning for magnetic resonance imaging of the knee joint. Ann. Biomed. Eng. 40(9), 2033–2042 (2012) Bystrov, D., Pekar, V., Young, S., Dries, S.P.M., Heese, H.S., van Muiswinkel, A.M.: Automated planning of MRI scans of knee joints. Proc. SPIE Med. Imag. 6509 (2007) Fenchel, M., Thesen, A., Schilling, A.: Automatic labeling of anatomical structures in MR FastView images using a statistical atlas. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI, pp. 576–584. Springer, Berlin, Heidelberg (2008) Iskurt, A., Becerikly, Y., Mahmutyazicioglu, K.: Automatic identification of landmarks for standard slice positioning in brain MRI. J. Magn. Reson. Imaging. 34(3), 499–510 (2011) Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: Proceedings of 12th International Conference on Computer Vision, vol. 1, pp. 2146–2153 (2009) Kavukcuoglu, K., Ranzato, M.A., LeCun, Y.: Fast inference in sparse coding algorithms with applications to object recognition. arXiv preprint arXiv: 1010.3467 (2010) Kelm, B.M., Zhou, K., Suehling, M., Zheng, Y., Wels, M., Comaniciu, D.: Detection of 3D spinal geometry using iterated marginal space learning. In: Medical Computer Vision. Recognition
11
Automatic View Planning in Magnetic Resonance Imaging
301
Techniques and Applications in Medical Imaging, pp. 96–105. Springer, Berlin, Heidelberg (2011) Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural Inform. Process. Syst. 25(2), 1–9 (2012) Law, M.W.K., Tay, K.Y., Leung, A., Garvin, G.J., Li, S.: Intervertebral disc segmentation in MR images using anisotropic oriented flux. Med. Image Anal. 17(1), 43–61 (2012) Lecouvet, F.E., Claus, J., Schmitz, P., Denolin, V., Bos, C., Vande Berg, B.C.: Clinical evaluation of automated scan prescription of knee MR images. J. Magn. Reson. Imaging. 29(1), 141–145 (2009) LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. In: The Handbook of Brain Theory and Neural Networks, vol. 3361(10) (1995) LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M.A., Huang, F.J.: A tutorial on energy-based learning. In: Bakir, G., Hofman, T., Schölkopf, B., Smola, A., Taskar, B. (eds.) Predicting Structured Data. MIT Press, Cambridge, USA (2006) LeCun, Y., Kavukcuoglu, K., Farabet, C.: Convolutional networks and applications in vision. In: Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 253–256 (2010) Li, P., Xu, Q., Chen, C., Novak, C.L.: Automated alignment of MRI brain scan by anatomic landmarks. In: Proceedings of SPIE, Medical Imaging, vol. 7259, (2009) Lu, X., Jolly, M.-P., Georgescu, B., Hayes, C., Speier, P., Schmidt, M., Bi, X., Kroeker, T., Comaniciu, D., Kellman, P., Mueller, E., Guehring, J.: Automatic view planning for cardiac MRI acquisition. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2011, pp. 479–486. Springer, Berlin, Heidelberg (2011) Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open Access Series of Imaging Studies (OASIS): cross-sectional MRI data in young, middle aged, nondemented, and demented older adults. J. Cogn. Neurosci. 19(9), 1498–1507 (2007) Neubert, A., Fripp, J., Shen, K., Engstrom, C., Schwarz, R., Lauer, L., Salvado, O., Crozier, S.: Automated segmentation of lumbar vertebral bodies and intervertebral discs from MRI using statistical shape models. In: Proc. of International Society for Magnetic Resonance in Medicine, vol. 19, p. 1122 (2011) Pekar, V., Bystrov, D., Heese, H.S., Dries, S.P.M., Schmidt, S., Grewer, R., den Harder, C.J., Bergmans, R.C., Simonetti, A.W., van Muiswinkel, A.M.: Automated planning of scan geometries in spine MRI scans. In: Medical Image Computing and Computer-Assisted Intervention– MICCAI, pp. 601–608. Springer, Berlin, Heidelberg (2007) Rychagov, M.: Neural networks: Multilayer perceptron and Hopfield networks. Exponenta Pro. Appl. Math. 1, 29–37 (2003) Sermanet, P., Chintala, S., Yann LeCun, Y.: Convolutional neural networks applied to house numbers digit classification. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR), pp. 3288–3291 (2012) Sirotenko, M.: Applications of convolutional neural networks in mobile robots motion trajectory planning. In: Proceedings of Scientific Conference and Workshop. Mobile Robots and Mechatronic Systems, pp. 174–181. MSU Publishing, Moscow (2006) Steinwart, I., Christmann, A.: Support vector machines. Springer, New York (2008) van der Kouwe, A.J.W., Benner, T., Fischl, B., Schmitt, F., Salat, D.H., Harder, M., Sorensen, A.G., Dale, A.M.: On-line automatic slice positioning for brain MR imaging. Neuroimage. 27(1), 222–230 (2005) Wang, Y., Li, Z.: Consistent detection of mid-sagittal planes for follow-up MR brain studies. In: Proceedings of SPIE, Medical Imaging, vol. 6914, (2008) Young, S., Bystrov, D., Netsch, T., Bergmans, R., van Muiswinkel, A., Visser, F., Sprigorum, R., Gieseke, J.: Automated planning of MRI neuro scans. In: Proceedings of SPIE, Medical Imaging, vol. 6144, (2006)
302
A. B. Danilevich et al.
Zhan, Y., Dewan, M., Harder, M., Krishnan, A., Zhou, X.S.: Robust automatic knee MR slice positioning through redundant and hierarchical anatomy detection. IEEE Trans. Med. Imaging. 30(12), 2087–2100 (2011) Zheng, Y., Lu, X., Georgescu, B., Littmann, A., Mueller, E., Comaniciu, D.: Automatic left ventricle detection in MRI images using marginal space learning and component-based voting. In: Proceedings of SPIE, vol. 7259, (2009)
Chapter 12
Dictionary-Based Compressed Sensing MRI Artem S. Migukin, Dmitry A. Korobchenko, and Kirill A. Gavrilyuk
12.1
Toward Perfect MRI
12.1.1 Introduction In a contemporary clinic, magnetic resonance imaging (MRI) is one of the most widely used and irreplaceable tools. Being noninvasive, MRI offers superb softtissue characterization with global anatomic assessment and allows object representation from arbitrary vantage points. In contrast to computer tomography, there is no emitted ionizing radiation, which grants MRI a great potential for further spreading as a dominant imaging modality (Olsen 2008). Conventionally, MRI data, which are samples of the so-called k-space (spatial Fourier transform of the object), are acquired by a receiver coil, and the resulting MR image to be analysed is computed by the discrete inverse Fourier transform of the full spectrum. Despite the abovementioned assets, the most important disadvantage of MRI is its relatively high acquisition time (about 50 minutes of scan time) due to fundamental limitations by physical (gradient amplitude and slew rate) and physiological (nerve stimulation) constraints. Phase-encoding (PE) lines, forming rows in the k-space, are sampled in series, sequentially in time. Because of that, the intuitive way to reduce the acquisition time is to decrease the number of sampling PE lines. It results in the partially
A. S. Migukin (*) Huawei Russian Research Institute, Moscow, Russia e-mail: [email protected] D. A. Korobchenko Nvidia Corporation, Moscow Office, Moscow, Russia e-mail: [email protected] K. A. Gavrilyuk University of Amsterdam, Amsterdam, The Netherlands e-mail: [email protected] © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_12
303
304
A. S. Migukin et al.
Fig. 12.1 2D k-space undersampling by an aligned sampling mask
sampled (undersampled) k-space, which can be also considered as the Hadamard (element-wise) multiplication of the full k-space by a sparse binary sampling mask (Fig. 12.1). Hardware-based parallel data acquisition (pMRI) methods (Roemer et al. 1990; Pruessmann 2006) reduce the required amount of k-space samples by simultaneous scanning using receiver arrays, so data diversity, provided by multiple receiver coils, allows eliminating the resulting aliasing (Pruessmann et al. 1999; Griswold et al. 2002; Blaimer et al. 2004; Larkman and Nunes 2007). Although tens of receiver coils may be available, the conventional pMRI techniques are, however, nonetheless limited by noise, the extremely growing size of datasets and imperfect alias correction. This typically enables acceleration smaller than three- or fourfold (Ravishankar and Bresler 2010). Therefore, one is seeking for methods to essentially reduce the amount of acquired data and to reconstruct tissues without visible degrading of image quality. Complementary to hardware-based acceleration, the total time to obtain an MR image can be significantly decreased by algorithmic reconstruction of the highly undersampled k-space. These rely on implicit or explicit modelling or constraints on the underlying image (Liang and Lauterbur 2000), with some methods even adapting the acquisition to the imaged object (Aggarwal and Bresler 2008; Sharif et al. 2010). One of the most promising techniques is compressed sensing (CS) utilizing a priori information on sparsity of MR signals in some domain (Lustig et al. 2008). According to the mathematical theory of CS (Candès et al. 2006; Donoho 2006), if the acquisition is incoherent, then an image with a sparse representation in some domain can be recovered accurately from significantly fewer measurements than the number of unknowns or than mandated by the classical Nyquist sampling condition. The cost of such acceleration is that the object reconstruction scheme is nonlinear and far from trivial. Intuitively, artefacts due to random undersampling add as noise-like interference. For instance, a naive object reconstruction by the discrete inverse Fourier transform of an undersampled spectrum with zero filling (ZF) of empty positions leads to strong aliasing effect (see Fig. 12.2b). In addition, a lot of high-frequency components are lost. At the time of work, many proposals for MRI data reconstruction try to mitigate undersampling artefacts and recover fine details (Do and Vetterli 2006; Lustig et al.
12
Dictionary-Based Compressed Sensing MRI
305
Fig. 12.2 Aliasing artefacts: the magnitude of the inverse Fourier transform – (a) for the fully sampled spectrum and (b) for the undersampled k-space spectrum as it is shown in Fig. 12.1
2007; Ma et al. 2008; Goldstein and Osher 2009). However, the reconstruction by leading CS MRI methods with nonadaptive, global sparsifying transforms (finite differences, wavelets, contourlets, etc.) are usually limited to relatively low undersampling rate and still have many undesirable artefacts and loss of features (Ravishankar and Bresler 2010). The images are usually represented by a general predefined basis or frame which may not provide sufficient sparse representation for them. For instance, the traditional separable wavelet fails to sparsely represent the geometric regularity along the singularities, and the conventional total variation (TV) results in staircase artefacts in case of limited acquired k-space data (Goldstein and Osher 2009). Contourlets (Do and Vetterli 2006) can sparsely represent the smooth details but not the spots in images. All these transforms are only in favour of the sparse representation for the global image specifics. The local sparsifying transforms allow highlighting a broad set of fine details, i.e. they carry the local geometric information (Qu et al. 2012). In the so-called patch-based approach, an image is divided into small overlapping blocks (patches), and the vector corresponding to each patch is modelled as a sparse linear combination of candidate vectors termed atoms taken from a set called the dictionary. Here one requires a huge arsenal of various transforms, whose perfect fit is extremely hard to assort. Alternatively, the size of patches needs to be constantly decreased. In perspective, researchers have shown great interest in finding adaptive sparse regularization. Images may be content-adaptive represented by patches via collaborative sparsifying transform (Dabov et al. 2007) or in terms of dictionary-based image restoration (Aharon et al. 2006). Adaptive transforms (dictionaries) can sparsify images better because they are constructed (learnt) especially for the particular image instance or class of images. Recent studies have shown the promise of patch-based sparsifying transforms in a variety of applications such as image denoising (Elad and Aharon 2006), deblurring (Danielyan et al. 2011) or in a specific task such as phase retrieval (Migukin et al. 2013). In this work, we exploit adaptive patch-based dictionaries to obtain substantially improved reconstruction performance for CS MRI. According to our best knowledge and following Ravishankar and Bresler (2010), Caballero et al. (2012) and Song et al. (2014), such a sparse regularization with training patch-based dictionaries
306
A. S. Migukin et al.
provides state-of-the-art results in actual MRI reconstruction. Aiming at optimal sparse representations, and thus optimal noise/artefact removing capabilities, we learn the dictionary on fully sampled clinic data with sparse constraints on codes (linear weights for atoms) enforced by the ℓ 0 norm. The subject of this chapter is an efficient, high-performance and fast-converging CS algorithm for the MRI reconstruction from highly undersampled k-space data. Moreover, we are looking for a robust, clear and flexible algorithm whose implementation may be easily tuned for end-users. This task is comprehensively realized by our method (Migukin et al. 2015, 2017; Korobchenko et al. 2018): it is not necessary to choose any proper sparsifying functions, and a good dictionary (with patches of non-uniform size adapted to specifics of undersampling) precooked in advance with settings of its use (tolerance for image patch approximating) fully determines the CS MRI reconstruction for particular datasets. Irrespective of denoising/deblurring approaches, almost all authors omit the initialization problem. In particular, the initial guess for MRI algorithms is typically achieved by zero filling (ZF). Here, we present a helpful hint of an efficient initialization with the iterative split Bregman cast algorithm (Goldstein and Osher 2009). Finally, our implementation on a graphics processing unit (GPU) makes the reconstruction time negligible with respect to the data acquisition time.
12.2
Compressed Sensing MRI
We use a vector-matrix notation for discrete representation of complex-valued object images and their k-space spectra. Let x denotes hereafter the column vector of a target object to be reconstructed and y denotes its vectorized spatial Fourier transform. In case of a noiseless and fully sampled k-space, the relation can be given as follows: y ¼ Fx, where F is the Fourier transform matrix. The vectors x and y are of the same length.
12.2.1 Observation Model Undersampling occurs whenever the number of k-space measurements is less than the length of x: yu ¼ F u x = m∘Fx: The problem is to find the unknown object vector x from the vector of the undersampled spectral measurements yu, i.e. to solve an underdetermined system
12
Dictionary-Based Compressed Sensing MRI
307
of linear equations. In the equation above, Fu is the undersampled Fourier encoding matrix (Ravishankar and Bresler 2010). The calculation of yu can be reinterpreted via the Hadamard multiplication (denoted by ) by the binary sampling mask vector m, which nicely fits in with our illustration in Fig. 12.1.
12.2.2 Sparse Object Representation Compressed sensing solves the problem by minimizing the ℓ0 quasi-norm (number of non-zero components of the vector) of the sparsified signal or sparse codes Ψ(x), where Ψ is typically a global orthonormal sparsifying transform. The corresponding optimization problem can be presented as follows: min jjΨðxÞjj0 x
s:t:
yu ¼ F u x:
This sparse coding problem can be solved by greedy algorithms (Elad 2010). Following Donoho (2006), the CS reconstruction problem can be also simplified by replacing the ℓ0 norm with its convex relaxation, the ℓ1 norm. Since the real measurements are always noisy, the CS problem is shown (Donoho et al. 2006) to be efficiently solved using basis pursuit denoising. Thus, the typical formulation of the CS MRI reconstruction problem has the following Lagrangian setup (Lustig et al. 2007): min jjyu F u xjj22 þ λ jjΨðxÞjj1: x
Here λ is a regularizing parameter, and the ℓ1 norm is defined as a sum of absolute values of items of the vector.
12.3
Dictionary-Based CS MRI
As mentioned above, adaptive CS techniques lead to higher sparsities and hence to potentially higher undersampling factors in CS MRI. The key to adaptation to the data is dictionary learning. In dictionary-based approaches, the sparsifying transform Ψ is represented by a sparse decomposition of patches of the target object to basis elements from a dictionary D. Thus, the optimization problem is reformulated (Ravishankar and Bresler 2010, cf. Eq. 4) as: min kPðxÞ DZk2F þ ν kyu F u xk22
x, Z
s:t: kZi k0 T 8i
308
A. S. Migukin et al.
where the subscript F denotes the Frobenius norm, columns of the matrix X ¼ P(x) are vectorized patches extracted by the operator denoted by P, column vectors of the matrix Z are sparse codes, and ν is a positive parameter for the synthesis penalty term. Literally, it is assumed that each vectorized patch Xi can be approximated by the linear combination DZi, where each column vector Zi contains no larger than T non-zero components. Note that X is formed as a set of overlapping patches extracted from the object image x. Since the sparse approximation is performed for vectorized patches, no restriction on the patch form is imposed. In our work, we are dealing with rectangular patches to harmonize their size with specifics of k-space sampling.
12.3.1 Alternating Optimization We are faced with a challenging synthesis-based CS problem: given the dictionary D, minimize the previous equation on the object x and the column vectors of the sparse codes Zi. The conventional “trick” is to transform such unconstrained problem into the constrained one via variable splitting and then resolve this constrained problem using the alternating optimization (Gabay and Mercier 1976; Bertsekas and Tsitsiklis 1989; Eckstein and Bertsekas 1992). In this case, the optimization variables x and {Zi} are decoupled according to their roles: data consistency in k-space and sparsity of the object approximation. Thus, the Lagrangian function is minimized with respect to these blocks, which leads to alternating minimization in the following iterative algorithm: Z kþ1 ¼ arg min kZ i k0 i Zi
s:t:
k Pi x DZi 2 τ 8i; F
x A DZkþ1 2 þ ν kyu F u xk2 : xkþ1 ¼ arg min 2 2 x Here A denotes the operator that assembles the vectorized image from the set of patches (columns of input matrix). Particularly, the approximation of the image vector is assembled from the sparse approximation of patches {DZi}. Here the positive parameter ν represents the confidence in the given (noisy) measurements, and the parameter τ controls the accuracy of the object synthesis from sparse codes. Such kind of algorithms is the subject of intensive research in various application areas (Bioucas-Dias and Figueiredo 2007; Afonso et al. 2010). In the first step, the object estimate is assumed to be fixed, and the sparse codes {Zi} are found using the given dictionary D in terms of the basis pursuit denoising so the sparse object approximation is satisfied by a certain tolerance τ. In our work, it is realized based on the orthogonal matching pursuit algorithm (OMP) (Elad 2010). In the other step, the sparse representation of the object is assumed to be fixed, and the object estimate is updated targeting the data consistency. The last equation in the above-given system can be resolved from the normal equation:
12
Dictionary-Based Compressed Sensing MRI
309
1 1 H kþ1 ðF H Þ: u F u þ ν IÞx ¼ F u yu þ ν AðDZ The superscript {∙}H denotes the Hermitian transpose operation. Solving the equation directly is tedious due to inversion of a typically huge matrix. It can be simplified by transforming from the image to the Fourier domain. Let the Fourier transform matrix F from the first equation in the current chapter be normalized such that FHF ¼ I. Then: 1 1 H H kþ1 Þ, ðFF H u F u F þ ν IÞFx ¼ FF u yu þ ν FAðDZ H where FF H is a diagonal matrix consisting of ones and zeros – ones are at u FuF those diagonal entries that correspond to a sampled location in the k-space. Here k + 1/2 ¼ FA(DZk + 1). The vector represents the Fourier spectrum yu ¼ FF H u yu and y of the sparse approximated object at the k-th iteration. It follows that the resulting Fourier spectrum estimation is of the form (Ravishankar and Bresler 2010, cf. Eq. 9):
ykþ1 ¼ Øm∘ykþ1=2 þ m∘ ykþ1=2 þ ν yu
1 , 1þν
where the mask Øm is logically complementary to m and represents all empty, non-sampled positions of the k-space. In general, the spectrum in sampled positions is recalculated as a mixture of the given measurements yu and the estimate yk + 1/2. For noiseless data (ν!1), the sampled frequencies are merely restored to their measured values only.
12.3.2 Dictionary Learning Dictionary learning aims to solve the following basis pursuit denoising problem (Elad 2010) with respect to D: min k Zi k0
D, Zi
s:t:
kX DZk2F τ 8i:
Since dictionary elements are basis atoms used to represent image patches, they should be also learnt on the same type of signals, namely: image patches. In the equation above, the columns of the matrices X and Z represent vectorized training patches and the corresponding sparse codes, respectively. Again, one commonly alternates searching for D for the fixed Z (dictionary update step) and seeking for Z taking D fixed (sparse coding step) (Elad 2006, 2010; Yaghoobi et al. 2009). In our method, we exploit the state-of-the-art dictionary learning algorithm: K-SVD (Aharon et al. 2006), successfully applied for image denoising (Mairal et al. 2008;
310
A. S. Migukin et al.
Fig. 12.3 Complex-valued dictionary with rectangular (8 4) atoms: left magnitudes and right arguments/phases of atoms. Phase of atoms is represented in the HSV (Hue, Saturation, Value) colour map
Protter and Elad 2009). We recommend taking into consideration the type of data it contains, namely: training patches should be extracted from datasets of images similar to the target object or from actual corrupted (aliased) images to be reconstructed. An example of a dictionary learnt on complex-valued data is shown in Fig. 12.3.
12.4
Proposed Efficient MRI Algorithm
In this section, we share some hints found during our long-term painstaking research on efficient CS MRI reconstruction by beforehand precomputed dictionaries: effective innovations providing fast convergence and imaging enhancement, spatial adapting to aliasing artefacts and acceleration by parallelization under limited GPU resources (Korobchenko et al. 2016).
12.4.1 Split Bregman Initialization While zeros in the empty positions of the Fourier spectrum lead to strong aliasing artefacts, some other initial guess seems to be clever. It is found that the result of the split Bregman iterative algorithm (Goldstein and Osher 2009) is an efficient initial guess that essentially suppresses aliasing effects and significantly increases both the initial reconstruction quality and the convergence rate of the main computationally expensive dictionary-based CS algorithm. In accordance with (Wang et al. 2007), the ℓ 1 and ℓ 2 norms in the equation given in Sect. 12.2.2 may be decoupled as follows:
12
Dictionary-Based Compressed Sensing MRI
311
min jjF u x yu jj22 þ λ jjχ jj1 þ μ jjΨðxÞ χ jj22 ,
x,
χ
where χ is a sparse object approximation found by a global sparsifying transform. This optimization problem can be resolved simply via approximating by the so-called Bregman distance between the optimal and approximate result (Liu et al. 2009, cf. Fig. 1). The solution is adaptively refined by iteratively updating the regularization function, which follows the split Bregman iterative algorithm (Goldstein and Osher 2009, Sect. 3): 2
xkþ1 ¼ arg min jjF u x yu jj22 þ μ jjχ k ΨðxÞ bk jj2 , x 2
χ kþ1 ¼ arg min λ jjχ jj1 þ μ jjχ Ψðxkþ1 Þ bk jj2 χ
bkþ1 ¼ bk þ Ψ xkþ1 χ kþ1 , where μ is a regularization parameter, and bk is an update of the Bregman parameter vector. In our particular case, the conventional differentiation operator is used as a sparsifying transform Ψ, i.e. here we resolve the total variation (TV) regularization.
12.4.2 Precomputed Dictionary In the absence of reference signals, dictionary learning is commonly performed online using patches extracted from intermediate results of an iterative reconstruction algorithm (Ravishankar and Bresler 2010). Note that at the beginning of the reconstruction procedure, training patches are significantly corrupted by noise and aliasing artefacts, so a pretty large number of iterations are required to sufficiently suppress the noise. It is experimentally found (Migukin et al. 2015, 2017; Korobchenko et al. 2018) that precomputed dictionaries (offline learning) provide a better reconstruction quality in comparison with learning on actual data (Ravishankar and Bresler 2010) and iterative dictionary updating (Liu et al. 2013). Online dictionary learning could be well applied for daily-photo denoising because atoms, used in the dictionary-learning procedure, are free from random noise due to smoothing properties of the ℓ 2 norm. Regarding MRI, patches are corrupted with specific aliasing artefacts, and thus these artefacts significantly corrupt dictionary atoms, which consequently impacts the reconstruction quality. It leads to the idea that the dictionary should be learnt on fully sampled (and therefore artefact-free) experimental data. Moreover, this approach leads to a significant speedup of the reconstruction algorithm, because the time-consuming step of dictionary learning is moved outside the reconstruction and could be performed only once.
312
A. S. Migukin et al.
In our development, dictionaries are computed in advance on a training dataset of fully sampled high-quality images from a target class. Training patches are randomly extracted from the set of training images. The subset of all possible overlapping patches from all training images is shuffled, and then a desired amount of training patches is selected to be used as columns of X in the learning procedure formulated earlier in Sect. 12.3.2.
12.4.3 Multi-Band Decomposition Aiming at the computational efficiency of the proposed CS MRI algorithm, we exploit the linearity of the Fourier transform to parallelize the reconstruction procedures. Taking into account that a plurality of object-image components corresponds to different frequency bands, we split the k-space spectrum to be reconstructed y and the sampling mask m into several bands (e.g., corresponding to a group of low, middle and high frequencies) to produce a number of new bandoriented subspectra {yb} and sampling submasks {mb}, respectively. Each resulting subspectrum yb contains only frequencies from the corresponding band, while the other frequencies are filled with zeros and marked as measured by mb. The decomposition of the MR signal into frequency bands yields a contraction of signal variety within a band (the signal has a simple form within each band) and consequently increasing the signal sparsity and the reconstruction quality. The undersampled subspectra {∙} are treated as inputs to the proposed dictionary-based reconstruction (alternating minimization in the iterative algorithm presented in Sect. 12.3.1) and are processed in parallel. Once the subspectra are reconstructed, the resulting object reconstruction is obtained by summing the reconstructed images for all bands. Note that each subspectrum is reconstructed using its own special dictionary Db. Thus, dictionary learning is performed respectively: training images are decomposed into frequency bands, and a number of dictionaries are learnt for each of such bands separately. For Cartesian one-dimensional sampling (see Fig. 12.1), the bands are rectangular stripes in a k-space, which are aligned with sampling lines. For a radially symmetric or isotropic random sampling (Vasanawala et al. 2011), bands are represented as concentric rings centred in a zero frequency.
12.4.4 Non-uniform-Sized Patches It seems to be straightforward that an optimal aspect ratio (relative proportion of width and height) of a rectangular patch depends on a k-space sampling scheme. Let us start from the two-dimensional (2D) case. A common way of Cartesian sampling in an MRI device is to acquire phase-encoding lines. Such lines represent rows in a kspace. The information about frequencies is lost along the y-direction (see Fig. 12.1), which affects in vertical aliasing artefacts in the zero filling (ZF) reconstruction (see
12
Dictionary-Based Compressed Sensing MRI
313
Fig. 12.2b). Analogously in the three-dimensional (3D) case, we are proposing to cover more data along the direction of undersampling by applying non-uniformsized patches. If the number of dictionary elements is fixed, the size of a patch should not be very large in all dimensions due to difficulty of encoding a large amount of information with only a small number of patches’ components. It is found (Migukin et al. 2017; Korobchenko et al. 2018) that rectangular/parallelepiped patches (one of dimensions is in priority) allow achieving better reconstruction quality than square/ cubic patches, with the same or even a smaller number of pixels. For a non-Cartesian isotropic random sampling scheme, we achieve a higher reconstruction quality for square/cubic patches. In the general case, the patch aspect ratio depends on the amount of anisotropy.
12.4.5 DESIRE-MRI Let us assume that the measurement vector and sampling mask are split into Nb bands, and proper multi-band dictionaries {Db} for all these bands are already learnt by K-SVD as we discussed above. Let the initial precooking of the undersampled spectrum of the object be performed by the iterative split Bregman algorithm (see Sect. 12.4.1). We denote this initial guess by xSB. Then, considering the resulting Fourier spectrum estimation presented in Sect. 12.3.1, the reconstruction of such a pre-initialized object is performed by the proposed iterative multi-band algorithm defined in the following form: Algorithm 12.1: DESIRE-MRI Algorithm: DESIRE-MRI Input data: fDb g, ybu Initialization: k ¼ 0, x0 ¼ xSB. For all object k-space bands b ¼ 1, . . .Nb Repeat until convergence: 1. Sparse codes update via mini-batch-OMP
Zb
kþ1
¼ arg min Zb
2 X b k b b Zb þ γ b P x Z D , i 0 i i
F
2. Update for spectra of sparse object approximations kþ1 b kþ1=2 y ¼ FA Db Zb , (continued)
314
A. S. Migukin et al.
Algorithm 12.1 (continued) 3. Update of object band-spectra and band-objects ðyb Þ
kþ1
¼ Ømb ∘ððyb Þ
ðxb Þ
kþ1
¼ F H ðyb Þ
kþ1
kþ1=2
þ ν ybu Þ
kþ1=2 1 þ mb ∘ðyb Þ , b 1þν
, k ¼ k þ 1:
When converges, combine all resulting band-objects b x¼
XN b b¼1
xb
We name this two-step algorithm with the split Bregman initialization and beforehand multi-band dictionary learning precomputed – Dictionary Express Sparse Image Reconstruction and Enhancement (DESIRE-MRI). The initial guess x0 is split into Nb bands forming a set of estimates for the band objects {xb}. Then, all these estimates are partitioned into patches to be sparse approximated by mini-batch OMP (Step 1). We discuss the mini-batch OMP targeting on GPU realization in later Sect. 12.5 in details. In Step 2, estimates of the object subspectra are found by successively assembling the sparse codes into the band objects and their Fourier transform. Step 3 gives the update of the resulting band objects by restoring the measured valued in the object subspectra (as it is defined by the last equation in Sect. 12.3.1) and returning to the image domain. So, we go back to Step 1 until DESIRE-MRI converges. Note that the output of the algorithm is the sum of the reconstructed band objects. DESIRE algorithm for a single-coil case was originally published in (Migukin et al. 2015). There is a clear parallel with the well-studied Gerchberg-Saxton-Fienup (Fienup 1982) algorithms, but in contrast with loss/ambiguity of the object phase, here we are faced with the total loss of some complex-valued observations.
12.4.6 GPU Implementation One of the main problems in the classical OMP algorithm is the computationally expensive matrix pseudo-inversion (Rubinstein et al. 2008; Algorithm 1, step 7). It is efficiently resolved in OMP Cholesky and Batch OMP by the progressive Cholesky factorization (Cotter et al. 1999). Batch OMP also uses pre-computation of the Gram matrix of dictionary atoms (Rubinstein et al. 2008), which allows omitting iterative recalculation of residuals. We use such an optimized version of OMP because of encoding a huge set of patches by a single dictionary. Moreover, Batch OMP is
12
Dictionary-Based Compressed Sensing MRI
315
based on the matrix-matrix and matrix-vector operations, and thus it is tempting to use the advantage of parallelization by GPU implementation. Direct treatment of all patches simultaneously may fail due to lack of GPU resources (capacity). We are applying mini-batch technique for CS MRI, namely: the full set of patches is divided into subsets, called mini-batches, which are then processed independently. This approach allows using multiple GPUs to process each mini-batch on an individual GPU or CUDA streams to obtain maximum performance by leveraging concurrency of CUDA kernels and data transfers. In our work, we are considering the lower border of parallelization, and therefore the results by only one GPU are presented. Our mini-batch OMP algorithm is realized by highly optimized CUDA kernels and standard CUBLAS batch functions such as “trsmBatched” to solve multiple triangular linear systems simultaneously. The Cholesky factorization (less space-consuming compared with the QR factorization) is taken in order to possibly enlarge the size and number of mini-batches in our simultaneous processing. The result of acceleration by these tricks is presented in Sect. 12.5 in Table 12.1.
12.5
Experimental Results for Multi-Coil MRI Reconstruction
The goal of our numerical experiments is to analyse the reconstruction quality and to study the performance of the algorithm. Here, we consider the reconstruction quality of complex-valued 2D and 3D target objects as in vivo MR scans with the normalized amplitudes. The used binary sampling masks with the undersampling rate (ratio of the total number of components to sampling ones) equal to 4 are illustrated in Fig. 12.4. In addition to our illustration, we present the reconstruction accuracy in Table 12.1 Acceleration for various parts of DESIRE-MRI pi 2D 3D
SB 8.5% 1.5 2.7
Fig. 12.4 Binary sampling masks, undersampling rate 4: left 2D Cartesian mask for the row-wise PE and right 3D Gaussian mask for the tube-wise PE (all slices are the same)
RM 11.5% 4.5 13.9
SA 80% 55.8 10.5
SA 10.3 8.6
316
A. S. Migukin et al.
peak signal-to-noise ratio (PSNR) and the high-frequency error norm (HFEN). Following Ravishankar and Bresler (2010), the reference (fully sampled) and reconstructed object are (slice-wise) filtered by the Laplacian of Gaussian filter, and HFEN is found as the ℓ2 norm for the difference between these filtered objects. Note that in practice, one has no true signal x to calculate the reconstruction accuracy in terms of PSNR or HFEN. DESIRE-MRI is assumed to be converged if the norm ||xk xk 1||2 for the difference between the successive iterations (denoted by DIFF) reaches an empirically found threshold. We exploit the efficient basis pursuit denoising approach, i.e. for particular objects, we choose proper parameters of patch-based sparse approximation and the tolerance τ. In general, the parameters for the split Bregman initialization are λ ¼ 100 and μ ¼ 30 for 2D objects and λ ¼ 0.1 and μ ¼ 3 for 3D objects. For easy comparison with recent (by the time of development) MRI algorithms (Lustig et al. 2007; Ravishankar and Bresler 2010), all DESIRE-MRI results are given for ν ¼ 1, i.e. act on the assumption that our spectral measurements are noise-free.
12.5.1 Efficiency of DESIRE-MRI Innovations In processing of experimental data, the trigger which stops DESIRE-MRI might be either achieving a certain number of iterations or decreasing of DIFF lower than a threshold. The use of DIFF reduces the required number of iterations and shows a potential of further improvement during the MRI reconstruction. The challenge is to choose the proper threshold, which largely depends on the convergence rate and the reached imaging quality. In Fig. 12.5, we demonstrate the influence of the initialization by split Bregman on the convergence rate and hence on the resulting reconstruction accuracy. The red horizontal line in Fig. 12.5(top) denotes the stopping threshold (here we take DIFF ¼ 0.258), and the vertical red dotted line maps this difference onto the reached PSNR and HFEN. Note that the further increase of DIFF after reaching the threshold has no effect. It can be seen that the proposed split Bregman initialization (hereinafter SB) gives approximately 4 dB improvement in PSNR (see the 0th iteration in Fig. 12.5(middle)) and about 0.24 in HFEN (Fig. 12.5 (bottom)). The DESIRE-MRI stopping condition is achieved in 50 iterations for SB and in 93 iterations for the conventional ZF initialization. In addition, SB in 50 iterations gives 0.16 dB higher PSNR and 0.01 lower HFEN compared with ZF in 93 iterations. Let us consider the influence of the multi-band decomposition on the reconstruction imaging based on the Siemens T1 axial slice (one of four coil images). DESIREMRI is performed with the tolerance τ ¼ 0.04, and the used dictionary is composed of 64 complex-valued 2D patches of 8 4. For the consistency with our further experiments and results by sided algorithms, the DESIRE-MRI reconstruction is hereafter performed during 100 iterations. Figure 12.6a illustrates a fragment of the original object. In Fig. 12.6b, we present the DESIRE-MRI result with no band splitting (PSNR ¼ 38.5 dB). Figure 12.6c demonstrates a clear imaging
12
Dictionary-Based Compressed Sensing MRI
317
Fig. 12.5 Influence of initialization on convergence and accuracy of DESIRE-MRI, split Bregman (SB, solid curve) vs zero filling (ZF, dashed curve) in 2D: (top) stopping condition and reconstruction quality in terms of (middle) PSNR and (bottom) HFEN
Fig. 12.6 Imaging enhancement with the multi-band scheme: fragment of (a) original axial magnitude and the DESIRE-MRI result by (b) the whole bandwidth and (c) two bands
318
A. S. Migukin et al.
enhancement due to applying two subbands in DESIRE-MRI. It can be seen that multi-band decomposition results in swiping remaining wave-like aliasing artefacts out, and PSNR is approximately 39 dB. Note that the effect of imaging enhancement by the multi-band decomposition reflects in the name of the proposed algorithm.
12.5.2 DESIRE-MRI Reconstruction Accuracy Let us compare the DESIRE-MRI result with the recently applied CS MRI approaches. Again, the Fourier spectrum of the Siemens T1 axial coil image is undersampled by the binary sampling mask shown in Fig. 12.4 (left) and reconstructed by LDP (Lustig et al. 2007) and DLMRI (Ravishankar and Bresler 2010). In Fig. 12.7, we present the comparison of the normalized magnitudes. DLMRI with online recalculation of dictionaries is unable to remove the large aliasing artefacts (Fig. 12.7c, PSNR ¼ 35.3 dB). LDP suppresses aliasing artefacts but still not sufficiently (Fig. 12.7b), PSNR ¼ 34.2 dB). A slightly lower PSNR compared with DLMRI is due to oversmoothing of the LDP result. DESIRE-MRI with two-band splitting (see Fig. 12.7d) produces an aliasing-free reconstruction that
Fig. 12.7 Comparison of the reconstructed MR images: (a) the original axial magnitude; (b) its reconstruction obtained by LDP (Lustig et al. 2007), PSNR ¼ 34.2 dB; (c) DLMRI (Ravishankar and Bresler 2010), PSNR ¼ 35.3 dB; (c) our DESIRE-MRI, PSNR ¼ 39 dB
12
Dictionary-Based Compressed Sensing MRI
319
Fig. 12.8 Imaging of multi-coil DESIRE-MRI reconstructions for 2D and 3D objects (columnwise, from left to right): Samsung TOF 3D, Siemens TOF 3D, Siemens T1 axial 2D slice and Siemens 2D Phantom. The comparison of SOS (row-wise, from top to bottom): for the original objects, the zero-filling reconstruction and by DESIRE-MRI
looks much closer to the reference: a small degree of smoothing is inevitable at high undersampling rate. All experimental data are multi-coil, and thus individual coil images are typically reconstructed independently, and then the final result is found by merging these reconstructed coil images with the so-called SOS (“sum-of-squares” by Larsson et al. 2003). In Fig. 12.8, some DESIRE-MRI results of such multi-coil reconstructions are demonstrated. In the top row, we present SOS for the original objects (column-wise, from left to right): Samsung TOF (344 384 30 angiography), Siemens TOF (348 384 48 angiography), Siemens T1 axial slice (320 350) and Siemens Phantom (345 384). In the middle row, we demonstrate SOS for the alias reconstructions by zero filling. The undersampling of the 2D and 3D object spectra are performed by the corresponding 2D Cartesian and 3D Gaussian masks given in Fig. 12.4. In the bottom row of Fig. 12.8, SOS for the DESIRE-MRI reconstructions are illustrated. For both 2D and 3D cases, DESIRE-MRI results in clear reconstruction with almost no essential degradations. Note that on Siemens Phantom with a lot of high-frequency details (Fig. 12.8, bottom-right image), DESIRE-MRI returns some visible defects on the borders of the bottom black rectangle and between radial “beams”.
320
A. S. Migukin et al.
12.5.3 Computational Performance Here, we compare the computational performance of the proposed algorithm for 2D and 3D objects. DESIRE-MRI is developed in MATLAB v8.1 (R2013a) and further implemented in C++ with CUDA 5.5. Computations were performed with an Intel Xeon E5-2665 CPU at 2.4 GHz with 32 GB RAM, using a 64-bit Ubuntu 12.04. For parallel computations, we used Nvidia Tesla K20c with 5120 MB memory. Considering the overheads, the overall speedup A of DESIRE-MRI is found according to Amdahl’s law (Amdahl 1967): SA ¼ 1=
X
p =s , i i i
where si is the acceleration (speedup) of the programme part pi. In order to represent the computational performance, the proposed DESIRE-MRI algorithm is roughly divided into three blocks: the initialization by split Bregman (SB), restoring the measured values in the k-space domain (RM) and the sparse object approximation in the image domain (SA). In Table 12.1, the acceleration for all these operations for reconstructing 2D and 3D objects is presented. Practically, split Bregman is an isolated operation performed only once before the main loop, so its contribution is 8.5% in the whole DESIRE-MRI duration. Still we accelerate SB in about two times. Computationally, the RM part is relatively cheap because it includes well-optimized discrete Fourier transforms and replacement of values. Nevertheless, the contribution of this block is 11.5% due to its interactive (we assume 100 iterations) repeating. The larger the problem dimension, the greater the speedup: for 3D data, we have three times better acceleration factor. The most time-consuming part SA (80% of the whole computation time) is well optimized by GPU based mini-batch OMP: 55.8 and 10.5 times for 2D and 3D objects. The SA acceleration drop for the 3D case is due to more complex memory access patterns leading to an inefficient cache use. Note that here we omit the influence of overheads on SA which may significantly decrease the total speedup. The total time of multi-coil DESIRE-MRI for 2D objects is about 1 second and about 1000 seconds for 3D in slice-wise scheme. This computation time coupled with high acceleration rate SA allows claiming a great productivity of the proposed CS MRI algorithm.
12.6
Conclusion
In this chapter, an efficient procedure of the CS MRI reconstruction by means of a precomputed dictionary is presented. Aiming at optimal sparse object regularization, the proposed algorithm takes into consideration a significant amount of noise and corruption by aliasing artefacts. Based on basis pursuit denoising, we have developed the iterative optimization algorithm for the complex-valued 2D and 3D object
12
Dictionary-Based Compressed Sensing MRI
321
reconstruction. The algorithm demonstrates both perfect convergence rate by the effective Split Bregman initialization and the state-of-the-art reconstruction imaging for real-life noisy experimental data. In addition, our implementation on commodity (available on the market in 2014) GPU allows achieving remarkably high performance fully applicable for commercial use.
References Afonso, M.V., Bioucas-Dias, J.M., Figueiredo, M.A.T.: Fast image recovery using variable splitting and constrained optimization. IEEE Trans. Image Process. 19(9), 2345–2356 (2010) Aggarwal, N., Bresler, Y.: Patient-adapted reconstruction and acquisition dynamic imaging method (PARADIGM) for MRI. Inverse Prob. 24(4), 1–29 (2008) Aharon, M., Elad, M., Bruckstein, A.: K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54(11), 4311–4322 (2006) Amdahl, G.M.: Validity of the single-processor approach to achieving large-scale computing capabilities. In: Proceedings of AFIPS Conference, vol. 30, pp. 483–485 (1967) Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, Englewood Cliffs. 735 p (1989) Bioucas-Dias, J.M., Figueiredo, M.A.T.: A new TwIST: two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Trans. Image Process. 16(12), 2980–2991 (2007) Accessed on 04 October 2020. http://www.lx.it.pt/~bioucas/TwIST/TwIST.htm Blaimer, M., Breuer, F., Mueller, M., Heidemann, R.M., Griswold, M.A., Jakob, P.M.: SMASH, SENSE, PILS, GRAPPA: how to choose the optimal method. Top. Magn. Reson. Imaging. 15(4), 223–236 (2004) Caballero, J., Rueckert, D., Hajnal, J.V.: Dictionary learning and time sparsity in dynamic MRI. In: Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), vol. 15, pp. 256–263 (2012) Candès, E., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory. 52(2), 489–509 (2006) Cotter, S.F., Adler, R., Rao, R.D., Kreutz-Delgado, K.: Forward sequential algorithms for best basis selection. In: IEE Proceedings - Vision, Image and Signal Processing, vol. 146 (5), pp. 235–244 (1999) Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising by sparse 3-D transformdomain collaborative filtering. IEEE Trans. Image Process. 16(8), 2080–2095 (2007) Danielyan, A., Katkovnik, V., Egiazarian, K.: BM3D frames and variational image deblurring. IEEE Trans. Image Process. 21(4), 1715–1728 (2011) Do, M.N., Vetterli, M.: The contourlet transform: an efficient directional multiresolution image representation. IEEE Trans. Image Process. 14(2), 2091–2106 (2006) Donoho, D.: Compressed sensing. IEEE Trans. Inf. Theory. 52(4), 1289–1306 (2006) Donoho, D.L., Elad, M., Temlyakov, V.N.: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inf. Theory. 52(1), 6–18 (2006) Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55, 293–318 (1992) Elad, M.: Sparse and Redundant Representations: from Theory to Applications in Signal and Image Processing. Springer Verlag, New York., 376 p (2010) Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 15(12), 3736–3745 (2006) Fienup, J.R.: Phase retrieval algorithms: a comparison. Appl. Opt. 21(15), 2758–2769 (1982) Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite-element approximations. Comput. Math. Appl. 2(1), 17–40 (1976)
322
A. S. Migukin et al.
Goldstein, T., Osher, S.: The Split Bregman method for L1-regularized problems. SIAM J. Imag. Sci. 2(2), 323–343 (2009) Accessed on 04 October 2020. http://www.ece.rice.edu/~tag7/Tom_ Goldstein/Split_Bregman.html Griswold, M.A., Jakob, P.M., Heidemann, R.M., Nittka, M., Jellus, V., Wang, J., Kiefer, B., Haase, A.: Generalized autocalibrating partially parallel acquisitions (GRAPPA). Magn. Reson. Med. 47(6), 1202–1210 (2002) Korobchenko, D.A., Danilevitch, A.B., Sirotenko, M.Y., Gavrilyuk, K.A., Rychagov, M.N.: Automatic view planning in magnetic resonance tomography using convolutional neural networks. In: Proceedings of Moscow Institute of Electronic Technology. MIET., 176 p, Moscow (2016) Korobchenko, D.A., Migukin, A.S., Danilevich, A.B., Varfolomeeva, A.A, Choi, S., Sirotenko, M. Y., Rychagov, M.N.: Method for restoring magnetic resonance image and magnetic resonance image processing apparatus, US Patent Application 20180247436 (2018) Larkman, D.J., Nunes, R.G.: Parallel magnetic resonance imaging. Phys. Med. Biol. 52(7), R15–R55 (2007) Larsson, E.G., Erdogmus, D., Yan, R., Principe, J.C., Fitzsimmons, J.R.: SNR-optimality of sumof-squares reconstruction for phased-array magnetic resonance imaging. J. Magn. Reson. 163(1), 121–123 (2003) Liang, Z.-P., Lauterbur, P.C.: Principles of Magnetic Resonance Imaging: a Signal Processing Perspective. Wiley-IEEE Press, New York (2000) Liu, B., King, K., Steckner, M., Xie, J., Sheng, J., Ying, L.: Regularized sensitivity encoding (SENSE) reconstruction using Bregman iterations. Magn. Reson. Med. 61, 145–152 (2009) Liu, Q., Wang, S., Yang, K., Luo, J., Zhu, Y., Liang, D.: Highly undersampled magnetic resonance image reconstruction using two-level Bregman method with dictionary updating. IEEE Trans. Med. Imaging. 32, 1290–1301 (2013) Lustig, M., Donoho, D., Pauly, J.M.: Sparse MRI: the application of compressed sensing for rapid MR imaging. Magn. Reson. Med. 58, 1182–1195 (2007) Lustig, M., Donoho, D.L., Santos, J.M., Pauly, J.M.: Compressed sensing MRI. IEEE Signal Process. Mag. 25(2), 72–82 (2008) Ma, S., Wotao, Y., Zhang, Y., Chakraborty, A.: An efficient algorithm for compressed MR imaging using total variation and wavelets. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008) Mairal, J., Elad, M., Guillermo, S.: Sparse representation for color image restoration. IEEE Trans. Image Process. 17(1), 53–69 (2008) Migukin, A., Agour, M., Katkovnik, V.: Phase retrieval in 4f optical system: background compensation and sparse regularization of object with binary amplitude. Appl. Opt. 52(1), A269–A280 (2013) Migukin, A.S., Korobchenko, D.A., Sirotenko, M.Y., Gavrilyuk, K.A., Choi, S., Gulaka, P., Rychagov, M.N.: DESIRE: efficient MRI reconstruction with Split Bregman initialization and sparse regularization based on pre-learned dictionary. In: Proceedings of the 27th Annual International Conference on Magnetic Resonance Angiography, p. 34 (2015) http:// society4mra.org/ Migukin, A.S., Korobchenko, D.A., Sirotenko, M.Y., Gavrilyuk, K.A., Gulaka, P., Choi, S., Rychagov, M.N., Choi Y.: Magnetic resonance imaging device and method for generating magnetic resonance image. US Patent Application 20170053402 (2017) Olsen, Ø.E.: Imaging of abdominal tumours: CT or MRI? Pediatr. Radiol. 38, 452–458 (2008) Protter, M., Elad, M.: Image sequence denoising via sparse and redundant representations. IEEE Trans. Image Process. 18(1), 27–36 (2009) Pruessmann, K.P.: Encoding and reconstruction in parallel MRI. NMR Biomed. 19(3), 288–299 (2006) Pruessmann, K.P., Weiger, M., Scheidegger, M.B., Boesiger, P.: SENSE: sensitivity encoding for fast MRI. Magn. Reson. Med. 42(5), 952–962 (1999)
12
Dictionary-Based Compressed Sensing MRI
323
Qu, X., Guo, D., Ning, B., Hou, Y., Lin, Y., Cai, S., Chen, Z.: Undersampled MRI reconstruction with patch-based directional wavelets. Magn. Reson. Imaging. 30(7), 964–977 (2012) Ravishankar, S., Bresler, Y.: MR image reconstruction from highly undersampled k-space data by dictionary learning. IEEE Trans. Med. Imag. 30(5), 1028–1041 (2010) Accessed on 04 October 2020. http://www.ifp.illinois.edu/~yoram/DLMRI-Lab/DLMRI.html Roemer, P.B., Edelstein, W.A., Hayes, C.E., Souza, S.P., Mueller, O.M.: The NMR phased array. Magn. Reson. Med. 16, 192–225 (1990) Rubinstein, R., Zibulevsky, M., Elad, M.: Efficient implementation of the K-SVD algorithm using batch orthogonal matching pursuit. Technical report – CS-2008-08 Technion (2008) Sharif, B., Derbyshire, J.A., Faranesh, A.Z., Bresler, Y.: Patient-adaptive reconstruction and acquisition in dynamic imaging with sensitivity encoding (PARADISE). Magn. Reson. Med. 64(2), 501–513 (2010) Song, Y., Zhu, Z., Lu, Y., Liu, Q., Zhao, J.: Reconstruction of magnetic resonance imaging by three-dimensional dual-dictionary learning. Magn. Reson. Med. 71(3), 1285–1298 (2014) Vasanawala, S.S., Murphy, M.J., Alley, M.T., Lai, P., Keutzer, K., Pauly, J.M., Lustig, M.: Practical parallel imaging compressed sensing MRI: summary of two years of experience in accelerating body MRI of pediatric patients. In: Proceedings of the IEEE International Symposium on Biomedical Imaging: from Nano to Macro, pp. 1039–1043 (2011) Wang, Y., Yin, W., Zhang, Y.: A fast algorithm for image deblurring with total variation regularization. CAAM Technical Report TR07-10 (2007) Yaghoobi, M., Blumensath, T., Davies, M.E.: Dictionary learning for sparse approximations with the majorization method. IEEE Trans. Signal Process. 57(6), 2178–2191 (2009)
Chapter 13
Depth Camera Based on Colour-Coded Aperture Vladimir P. Paramonov
13.1
Introduction
Scene depth extraction, i.e. the computation of distances to all scene points visible on a captured image, is an important part of computer vision. There are various approaches for depth extraction: a stereo camera and a camera array in general, a plenoptic camera including dual-pixel technology as a special case, and a camera with a coded aperture to name a few. The camera array is the most reliable solution, but it implies extra cost and extra space and increases the power consumption for any given application. Other approaches use a single camera but multiple images for depth extraction, thus working only for static scenes, which severely limits the possible application list. Thus, the coded aperture approach is a promising singlelens single-frame solution which requires insignificant hardware modification (Bando 2008) and can provide a depth quality sufficient for many applications (e.g. Bae et al. 2011; Bando et al. 2008). However, a number of technical issues have to be solved to achieve a level of performance which is acceptable for applications. We discuss the following issues and their solutions in this chapter: (1) light-efficiency degradation due to the insertion of colour filters into the camera aperture, (2) closeness to the diffraction limit for millimetre-size lenses (e.g., smartphones, webcams), (3) blindness of disparity estimation algorithms in low-textured areas, and (4) final depth estimation in millimetres in the whole image frame, which requires the use of a special disparity with the depth conversion method for coded apertures generalised for any imaging optical system.
V. P. Paramonov (*) Samsung R&D Institute Russia (SRR), Moscow, Russia e-mail: [email protected] © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_13
325
326
13.2
V. P. Paramonov
Existing Approaches and Recent Trends
Depth can be estimated using a camera with a binary coded aperture (Levin et al. 2007; Veeraraghavan et al. 2007). It requires computationally expensive depth extraction techniques based on multiple deconvolutions and a sparse image gradient prior. Disparity extraction using a camera with a colour-coded aperture which produces spatial misalignment between colour channels was first demonstrated by Amari and Adelson in 1992 and has not changed significantly since that time (Bando et al. 2008; Lee et al. 2010, 2013). The main advantage of these cameras over cameras with a binary coded aperture is the lower computational complexity of the depth extraction techniques, which do not require time-consuming deconvolutions. The light efficiency of the systems proposed in Amari and Adelson 1992; Bando et al. 2008; Lee et al. 2010, 2013; Levin et al. 2007; Veeraraghavan et al. 2007; Zhou et al. 2011 is less than 20% compared to a fully opened aperture, which leads to a decreased signal-to-noise ratio (SNR) or longer exposure times with motion blur. That makes them impractical for compact handheld devices and for real-time performance by design. A possible solution was proposed by Chakrabarti and Zickler (2012), where each colour channel has an individual effective aperture size. Therefore, the resulting image has colour channels with different depths of field. Due to its symmetrical design, this coded aperture cannot provide discrimination between objects closer to or further than the in-focus distance. Furthermore, it requires a time-consuming disparity extraction algorithm. Paramonov et al. (2016, b, c) proposed a solution to the problems outlined above by presenting new light-efficient coded aperture designs and corresponding algorithm modifications. All the aperture designs detailed above are analysed and compared in Sect. 13.7 of this chapter. Let us consider the coded aperture concept. A simplified imaging system is illustrated schematically in Fig. 13.1. It consists of a single thin lens and RGB colour sensor. A coded aperture is placed next to the thin lens. The aperture consists of colour filters with different passbands, e.g. red and green colour filters (Fig. 13.2a).
Fig. 13.1 Conventional single-lens imaging system image formation: (a) focused scene; (b) defocused scene
13
Depth Camera Based on Colour-Coded Aperture
327
Fig. 13.2 Colour-coded aperture image formation: (a) in-focus foreground; (b) defocused background
Fig. 13.3 Colour image restoration example. From left to right: image captured with colour-coded aperture (causing a misalignment in colour channels); extracted disparity map; restored image
Defocused regions of an image captured with this system have different viewpoints in the red and green colour channels (see Fig. 13.2b). By considering the correspondence between these two channels, the disparity map for the captured scene can be estimated as in Amari and Adelson (1992). The original colour image cannot be restored in the case of the absence of a blue channel. Bando et al. (2008), Lee et al. (2010, 2013), and Paramonov et al. (2016a, b, c) changed the aperture design to include all three colour channels, thus making image restoration possible and enhancing the disparity map quality. The image is restored by applying colour shifts based on the local disparity map value (Fig. 13.3). To get the depth map from an estimated disparity map, one may use the thin lens equation. In practice, most of the prior art works in this area do not discriminate disparity and depth, treating them as synonyms as one-to-one correspondence exists. However, a modern imaging system usually consists of a number of different lenses, i.e. an objective. That makes the use of a thin lens formula impossible. Furthermore, the planar scene does not have plane depth if we apply a trivial disparity-to-depth conversion equation. A number of researchers worked on this problem for different optics systems (Dansereau et al. 2013; Johannsen et al. 2013; Trouvé et al. 2013a, b in two papers). Depth results for coded aperture cameras (Lee et al. 2013; Panchenko et al. 2016) are valid only in the centre of the captured image.
328
V. P. Paramonov
A recent breakthrough in efficient deep neural network architectures (He et al. 2016; Howard et al. 2017; Paramonov et al. 2016a, b, c; Wu et al. 2018; Chen et al. 2019) allows cheaper AI models inference on mobile devices for computer vision applications. This trend is also evident in coded aperture approaches, including colour-coded apertures and chromatic aberration-coded apertures (Sitzmann et al. 2018; Chang and Wetzstein 2019; Moriuchi et al. 2017; Mishima et al. 2019). At the same time, practitioners would like to avoid deep learning approaches in subtasks which could be solved by direct methods, thus avoiding all the burden of expensive dataset recollections and hard-to-achieve generalization. These tasks include precise disparity-to-depth conversion and optical system calibration which is required for this conversion as described in this chapter. Given that the disparity estimation provided by a deep learning algorithm performs well on a new camera (with different optical characteristics), one does not have to recollect and/or retrain anything, while the calibration procedure is significantly cheaper to implement (requiring minutes of work by a single R&D engineer). Furthermore, some part of the dataset can be synthesized using numerical simulation of the optics with the coded aperture as described in the next section of this chapter. This chapter is organized as follows: the next Sect. 13.3 presents an overview of coded aperture numerical simulation and its possible applications. We consider light-efficient aperture designs in Sect. 13.4. In Sects. 13.5 and 13.6, a method of depth estimation for a generalized optical system which provides a valid depth in the whole frame is analysed. We evaluate these approaches in Sect. 13.7 and show the prototypes and implementations (including 3D reconstruction using a coded aperture-based depth sensor) in Sect. 13.8.
13.3
Numerical Simulation and Its Applications
A simplified numerical simulation of the image formation process was disclosed by Paramonov et al. (2014) and implemented in Paramonov et al. (2016a, b, c) to accelerate the research process and new coded aperture design elaboration. The goal of the simulation is to provide an image captured by a camera with a given coded aperture pattern (Fig. 13.4) for a given undistorted input image and given disparity map.
Fig. 13.4 Aperture designs for image formation numerical simulation: (a) open aperture, (b) binary coded aperture (Levin et al. 2007); (c) colour-coded aperture (Bando et al. 2008); (d) colour-coded aperture (Lee et al. 2010); (e) colour-coded aperture (Chakrabarti and Zickler 2012); (f) colourcoded aperture (Paramonov et al. 2016a, b, c)
13
Depth Camera Based on Colour-Coded Aperture
329
Fig. 13.5 PSF numerical simulation for different coded aperture designs at different point source distances from camera along optical axis. From top to bottom row: conventional open aperture, binary coded aperture (Levin et al. 2007); colour-coded aperture (Bando et al. 2008); colour-coded aperture (Lee et al. 2010); colour-coded aperture (Chakrabarti and Zickler 2012); colour-coded aperture (Paramonov et al. 2016a, b, c)
The first step is to simulate the point spread function (PSF) for a given coded aperture design and different defocus levels, for which we follow the theory and the code provided in Goodman (2008), Schmidt (2010), and Voelz (2011). The resulting PSF images are illustrated in Fig. 13.5. Given a set of PSFs corresponding to different defocus levels (i.e. different distances), one can simulate the image formation process for a planar scene via convolution of the input clear image with the corresponding PSF. In the case of a complex scene with depth variations, this process requires multiple convolutions with different PSFs for different depth levels. In order to do this, a continuous depth map should be represented by a finite number of layers. In our simulation, we precalculate 256 PSFs for a given aperture design corresponding to 256 different defocus levels. Once we have precalculated the PSF, the process of any image simulation does not require to repeat it. It should be noted that object boundaries and semi-transparent objects require extra care to make this simulation realistic. As a sanity check for the simulation model of the optics, one can numerically simulate the imaging process using a pair of corresponding image and disparity taken from existing datasets. Here, we use an image from the Middlebury dataset (Scharstein and Szeliski 2003) to simulate an image captured through a colour-coded aperture illustrated in Fig. 13.4c (proposed by Bando et al. 2008). Then we use the disparity estimation algorithm provided by the original authors of Bando et al. (2008) for their own aperture design (link to the source code: http://web.media. mit.edu/~bandy/rgb/). Based on the results in Fig. 13.6, we conclude that the model’s realism is acceptable. This gives an opportunity to generate new synthetic datasets for depth estimation with AI algorithms. Namely, one can use existing datasets of images with
330
V. P. Paramonov
Fig. 13.6 Numerical simulation of image formation for colour-coded aperture: (a) original image; (b) ground truth disparity map; (c) numerically simulated image; (d) raw layered disparity map extracted by algorithm implemented by Bando et al. (2008)
ground-truth depth or disparity information to numerically simulate a distorted image dataset corresponding to a given coded aperture design. In this case, there is no need for time-consuming dataset collection. Furthermore, the coded aperture design could be optimized together with a neural network as its first layer, similarly to what is described by Sitzmann et al. (2018) and Chang and Wetzstein (2019). It is important to emphasize that image quality in terms of human perception might not be needed in some applications where the image is used by an AI system only, e.g. in applications with robots, automotive navigation, surveillance, etc. In this case, a different trade-off should be found that prioritizes the information extracted from the scene that differs from the image itself (depth, shape or motion patterns for its recognition).
13
Depth Camera Based on Colour-Coded Aperture
13.4
331
Light-Efficient Colour-Coded Aperture Designs
A number of light-efficient aperture designs proposed by Paramonov et al. (2014, 2016a, b, c) are presented in Fig. 13.7. In contrast to aperture designs in previous works, sub-apertures of non-complementary colours and non-congruent shapes have been utilized. These aperture designs are now being used in more recent research works (Moriuchi et al. 2017; Mishima et al. 2019; Tsuruyama et al. 2020). Let us consider the semi-circle aperture design illustrated in Fig. 13.7. It consists of yellow and cyan filters. The yellow filter has a passband which includes green and red light passbands. The cyan filter has a passband which includes green and blue light passbands. The green channel is not distorted by those filters (at least in the perfect world) and can be used as a reference in the image restoration procedure. With ideal filters, this design has a light efficiency over 65% with respect to a fully open aperture (the ratio between the coded aperture and the fully transparent one and averaged through all three colour channels). An image is captured in sensor colour space, e.g. RGB. However, the disparity estimation algorithm works in coded aperture colour space, e.g. CYX, shown in Fig. 13.8a. The artificial vector X is defined as a unit vector orthogonal to the vectors representing the C and Y colours. To translate the image from RGB to CYX colour space, a transform matrix M is estimated, similar to Amari and Adelson (1992). Then, for each pixel of the image we have:
Fig. 13.7 Light-efficient colour-coded aperture designs with corresponding light efficiency approximation (based on effective area sizes)
332
V. P. Paramonov
Fig. 13.8 CYX colour space visualization: (a) cyan (C), yellow (Y) and X is a vector orthogonal to the CY plane; (b) cyan and yellow coded aperture dimensions
1 i,j wi,j CYX ¼ M wRGB , i,j where wi,j CYX and wRGB are vectors representing the colour of the (i, j) pixel in the CYX and RGB colour spaces, respectively. A fundamental difference with Amari and Adelson (1992) is that, in this case, the goal of this procedure is not just to calibrate a small mismatch between colour filters in the aperture and the same set of colour filters in the Bayer matrix of the sensor but to change the colour basis between two sets of completely different colour bases. All the aperture designs described here were verified numerically and through prototyping by Paramonov et al. (2014, 2016a, b, c). Despite the non-congruent shapes and non-complementary colours of the individual sub-apertures, they are able to provide a colour channel shift that is sufficient for depth extraction, at the same time providing superior light efficiency.
13.5
Depth Map Estimation for Thin Lens Cameras
One approach for disparity map estimation is described by Panchenko et al. (2016). Its implementation for depth estimation and control is given also in Chap. 3. The approach utilizes a mutual correlation of shifted colour channels in an exponentially weighted window and uses bilateral filter approximation for cost volume regularization. We describe the basics below for the completeness and self-sufficiency of the current chapter. Let fI i gn1 represent a set of n-captured colour channels of the same scene from different viewpoints, where Ii is the M N frame. A conventional correlation matrix Cd is formed for the fI i gn1 set and candidate disparity values d: 0
1 B Cd ¼ @ ⋮ corr I dn , I d1
⋯ ⋱ ⋯
1 corr I d1 , I dn C ⋮ A, 1
13
Depth Camera Based on Colour-Coded Aperture
333
where superscript ()d denotes the parallel shift in d pixels in the corresponding channel. The direction of the shift is dictated by the aperture design. The determinant of the matrix Cd is a good measure of fI i gn1 mutual correlation. Indeed, when all channels are in strong correlation, all the elements of the matrix are equal to one and det(Cd) ¼ 0. On the other hand, when the data is completely uncorrelated, we have det(Cd) ¼ 1. To extract a disparity map using this metric, one should find disparity values d corresponding to the smallest value of det(Cd) in each pixel of the picture. Here, we derive another particular implementation of the generalized correlation metric for n ¼ 3. It corresponds to the case of an aperture with three channels. The determinant of the correlation matrix is: 2 2 2 detðCd Þ ¼ 1 corr I d1 , I d2 corr I d2 , I d3 corr I d3 , I d1 þ 2corr I d1 , I d2 corr I d2 , I d3 corr I d3 , I d1 , and we have argmindetðCd Þ ¼ argmax d
d
X
2 Y d d d d corr I i , I j 2 corr I i , I j :
This metric is similar to the colour line metrics (Amari and Adelson 1992) but is more robust in important cases of low texture density in some areas of the image. The extra robustness appears when one of the three channels does not have enough texture in a local window around a point under consideration. In this case, the colour lines metric cannot provide disparity information, even if the other two channels are well defined. The generalized correlation metric avoids this disadvantage and allows the depth sensor to work similarly to a stereo camera in this case. Usually, passive sensors provide sparse disparity maps. However, dense disparity maps can be obtained by propagating disparity information to non-textured areas. The propagation can be efficiently implemented via joint-bilateral filtering (Panchenko et al. 2016) of the mutual correlation metric cost or by applying variational methods (e.g. Chambolle and Pock 2010) for global regularization with classic total variation or other priors. Here, we assume that the depth is smooth in non-textured areas. In contrast to the original work by Panchenko et al. (2016), this algorithm has also been applied not in sensor colour space but in colour-coded aperture colour space (Paramonov et al. 2016a, b, c). This increases the texture correlation between the colour channels if they have overlapping passbands and helps to improve the number of depth layers compared to RGB colour space (see Fig. 13.9c, d for comparison of the number of depth layers sensed for the same aperture design but different colour basis). Let us derive a disparity-to-depth conversion equation for a single thin-lens optical system (Fig. 13.1) as was proposed by Paramonov et al. (2016a, b, c). For a thin lens (Fig. 13.1a), we have:
334
V. P. Paramonov
Fig. 13.9 Depth sensor on the axis calibration results for different colour-coded aperture designs: (a) three RGB circles processed in RGB colour space; (b) cyan and yellow halves coded aperture processed in CYX colour space; (c) cyan and yellow coded aperture with open centre processed in CYX colour space; (d) cyan and yellow coded aperture with open centre processed in conventional RGB colour space. Please note that there are more depth layers in the same range for case (c) than for case (d), thanks only to the CYX colour basis (the coded aperture is the same)
1 1 1 þ ¼ , zof zif f where f is the lens focal length, zof the distance between a focused object and the lens, and zif the distance from the lens to the focused image plane. If we move the image sensor towards the lens as shown in Fig. 13.1b, the image of the object on the sensor is convolved with a colour-coded aperture copy, which is the circle of confusion, and we obtain:
13
Depth Camera Based on Colour-Coded Aperture
335
1 1 1 þ ¼ , zod zid f 1 þ c=D 1 1 þ ¼ , zof f zid where zid is the distance from the lens to the defocused image plane, zod is the distance from the lens to the defocused object plane corresponding to zid, c is the circle of confusion diameter, and D is the aperture diameter (Fig. 13.1b). We can solve this system of equations for the circle of confusion diameter: fD zod zof , c¼ zod zof f which gives the final result for disparity in pixels: βfD zod zof β , d¼ c¼ 2μ 2μzod zof f where μ is the sensor pixel size, β ¼ rc/R is the coded aperture coefficient, R ¼D/2 is the aperture radius, and rc is the distance between the aperture centre and the singlechannel centroid (Fig. 13.8b). Now, we can express the distance between the camera lens and any object only in terms of the internal camera parameters and the disparity value corresponding to that object: zod ¼
bf zof , bf 2μd zof f
where b ¼ βD ¼ 2rc is the distance between two centroids (see Fig. 13.8), i.e. the colour-coded aperture baseline equivalent to the stereo camera baseline. Note that if d ¼ 0, zod is naturally equal to zof, then the object is in the camera focus.
13.6
Depth Sensor Calibration for Complex Objectives
To use the last equation in any real complex system (objective), it was proposed to substitute it with a black box with the entrance and exit pupils (see Goodman 2008; Paramonov et al. 2016a, b, c for details) located at the second and the first principal points (H0 and H ), respectively (see Fig. 13.10 as an example of principal planes location in the case of a double Gauss lens). The distance between the entrance and exit pupils and the effective focal length are found through a calibration procedure proposed by Paramonov et al. (2016a, b, c). Since the pupil position is unknown for a complex lens, we measure
336
V. P. Paramonov
Fig. 13.10 Schematic diagram of the double Gauss lens used in Canon EF 50 mm f/1.8 II lens and its principal plane location. Please note that this approach works for any optical imaging system (Goodman 2008)
the distances to all objects from the camera sensor. Therefore, the disparity-to-depth conversion equation becomes: ~zod δ ¼
bf ð~zof δÞ , bf 2μdð~zof δ f Þ
where ~zod is the distance between the defocused object and the sensor, ~zof is the distance between the focused object and the sensor, and δ ¼ zif + HH0 is the distance between the sensor and the entrance pupil. Thus, for ~zod we have: ~zod ¼
bf ~zof 2μdδð~zof δ f Þ : bf 2μdð~zof δ f Þ
On the right-hand side of the equation above, there are three independent unknown variables, namely, ~zo f , b, and δ. We discuss their calibration in the following text. Other variables are either known or dependent. Another issue arises due to the point spread function (PSF) changing across the image. This causes a variation in the disparity values for objects with the same distances from the sensor but with different positions in the image. A number of researchers encountered the same problem in their works (Dansereau et al. 2013; Johannsen et al. 2013; Trouvé et al. 2013a, b). A specific colour-coded aperture depth sensor calibration to mitigate this effect is described below. The first step is the conventional calibration with the pinhole camera model and a chessboard pattern (Zhang 2000). From this calibration, we acquire the distance zif between the sensor and the exit pupil. To find the independent variables ~zof , b, and HH0, we capture a set of images of a chessboard pattern moving in a certain range along the optical axis and orthogonal to it (see Fig. 13.11). Each time, the object was positioned by hand, which is why small errors are possible (up to 3 mm). The optical system is focused on a certain distance
13
Depth Camera Based on Colour-Coded Aperture
337
Fig. 13.11 Depth sensor calibration procedure takes place after conventional camera calibration using pinhole camera model. A chessboard pattern is moving along the optical axis and captured while the camera focus distance is constant
from the sensor. Our experience shows that the error in focusing by hand in a close range is high (up to 20 mm for the Canon EF 50 mm f/1.8 lens), so we have to find the accurate value of ~zof through the calibration as well. Disparity values are extracted, and their corresponding distances are measured by a ruler on the test scene for all captured images. Now, we can find ~zof and b so that the above equation for the distance between a defocused object and the sensor holds with minimal error (RMS error for all measurements). To account for depth distortion due to curvature of the optical system field, we perform the calibration for all the pixels in the image individually. The resulting colour-coded aperture baseline b(i, j) and in-focus surface ~zof ði, jÞ are shown in Fig. 13.12a and b, respectively. The procedure described here was implemented on a prototype based on the Canon EOS 60D camera and Canon EF 50 mm f/1.8 II lens. Any other imaging system would also work, but this Canon lens allows easy disassembling (Bando 2008). Thirty-one images were captured (see Fig. 13.11), where the defocused image plane was moving from 1000 to 4000 mm in 100 mm steps (zod) and the camera was focused at approximately 2000 mm (zof). The results of our calibration for different coded aperture designs are presented in Fig. 13.9. Based on the calibration, the effective focal length of our Canon EF 50 mm f/1.8 II lens is 51.62 mm, which is in good agreement with the focal length value provided to us by the professional opticians (51.6 mm) who performed an accurate calibration (it is not in fact 50 mm). Using the calibration data, one can perform an accurate depth map estimation: ~zod ði, jÞ ¼
bði, jÞf~zof ði, jÞ 2μdδð~zof ði, jÞ δ f Þ : bði, jÞf 2μdð~zof ði, jÞ δ f Þ
The floor in Fig. 13.13a is flat but appears to be concave on the extracted depth map due to depth distortion (Fig. 13.13b). After calibration, the floor surface is corrected and is close to planar (Fig. 13.13c). The accuracy of undistorted depth maps extracted with a colour-coded aperture depth sensor is sufficient for 3D scene reconstruction, as discussed in the next section.
338
V. P. Paramonov
Fig. 13.12 Colour-coded aperture depth sensor 3D calibration results: (a) coded aperture equivalent baseline field b(i, j); (b) optical system in-focus surface ~zof ði, jÞ , where (i, j) are pixel coordinates
Fig. 13.13 Depth map of a rabbit figure standing on the floor: (a) captured image (the colour shift in the image is visible when looking under magnification); (b) distorted depth map, floor appears to be concave; (c) undistorted depth map, floor surface is planar area
13.7
Evaluation Results
First, let us compare the depth estimation error for the layered and sub-pixel approaches. Figure 13.14 shows that the sub-pixel estimation with the quadratic polynomial interpolation significantly improves the depth accuracy. This approach also allows real-time implementation as we interpolate around global maxima only. The different aperture designs are compared in Paramonov et al. (2016a, b, c), having the same processing algorithm (except the aperture corresponding to Chakrabarti and Zickler (2012) as it utilizes a significantly different approach). The tests were conducted using the Canon EOS 60D DSLR camera with a Canon EF 50 mm f/1.8 II lens in the same light conditions and for the same distance to the object, while the exposure time was adjusted to achieve a meaningful image in each case. Typical results are shown in Fig. 13.15.
13
Depth Camera Based on Colour-Coded Aperture
339
Fig. 13.14 Cyan and yellow coded aperture: depth estimation error comparison for layered and sub-pixel approaches
A MATLAB code was developed for processing. A non-optimized implementation takes 6 seconds on a CPU to extract 1280 1920 raw disparity maps in the case of three colour filters in the aperture, which is very close to Bando et al.’s (2008) implementation (designs I–IV in Fig. 13.17). In the case of two colour filters in the aperture (designs V, VII in Fig. 13.17), our algorithm takes only 3.5 seconds to extract the disparity. The similar disparity estimation algorithm implementation in Chakrabarti and Zickler (2012) takes 28 seconds in the same conditions. All tests were performed in single-thread mode and with the parameters recommended by their authors. Raw disparity maps usually require some regularization to avoid strong artefacts. For clarity, the same robust regularization method was used for all the extracted results. The variational method (Chambolle and Pock 2010) was used for global regularization with a total variation prior (see the last column in Fig. 13.17). It takes only 3 seconds for 1280 1920 disparity map regularization on a CPU. Low light efficiency is a significant drawback of existing coded aperture cameras. A simple procedure for measuring the light efficiency of a coded aperture camera was implemented. We capture the first image I nc i:j ði, jÞ (here and after (i, j) denote the pixel coordinates) with a non-coded aperture and compare it to the second image I ci:j ði, jÞ captured with the same camera presets in the presence of a coded aperture. To
340
V. P. Paramonov
Fig. 13.15 Results comparison with prior art. Rows correspond to different coded aperture designs. From top to bottom: RGB circles (Lee et al. 2010); RGB squares (Bando et al. 2008); CMY squares, CMY circles, CY halves (all three cases by Paramonov et al. 2016a, b, c); magenta annulus (Chakrabarti and Zickler 2012); CY with open area (Paramonov et al. 2016a, b, c); open aperture. Light efficiency increases from top to bottom
avoid texture dependency and image sensor noise, we use a blurred captured image of a white sheet of paper. The transparency Ti, j shows the fraction of light which passes through the imaging system with the coded aperture relative to the same imaging system without the coded aperture: T i,j ¼
I ci,j : I nc i:j
The transparency is different for different colours. We provide the resulting transparency corresponding to the colours on image sensors: red, green, and blue. In Fig. 13.16, we present the results for a Canon EF 50 mm f/1.8 II lens. The integral light efficiency of these designs is 86%, 55%, and 5.5% correspondingly. Any imaging system suffers from the vignetting effect, which is a reduction of an image’s brightness at the periphery compared to the image centre. Usually, this effect is mitigated numerically. In the case of a colour-coded aperture, this restoration procedure should take into account the difference between the transparency in different colour channels.
13
Depth Camera Based on Colour-Coded Aperture
341
Fig. 13.16 Transparency maps for different aperture designs (columns) corresponding to different image sensor colour channels (rows)
Fig. 13.17 SNR loss for different apertures in the central and border image areas (dB)
We conducted experiments for analysing the in-focus quality of the image captured with the proposed depth sensor using the Imatest chart and software (Imatest 2014). The results are presented in Fig. 13.17.
342
V. P. Paramonov
Fig. 13.18 Light efficiency for different coded aperture designs. From left to right: RGB circles (Lee et al. 2010); RGB squares (Bando et al. 2008); CMY squares, CMY circles, CY halves (all three cases by Paramonov et al. 2016a, b, c); magenta annulus (Chakrabarti and Zickler 2012); CY with open area (Paramonov et al. 2016a, b, c); open aperture
All photos were taken with identical camera settings. It seems that SNR degradation from the centre to the side is induced by lens aberrations. Different apertures provide different SNRs, the value depending on the amount of captured light. The loss between aperture 1 and 3 is 2.3 dB. To obtain the original SNR value for aperture 3, one should increase the exposure time by 30%. It is important to take into account the light efficiency while evaluating depth sensor results. We estimated the light efficiency by capturing a white sheet of paper through different coded apertures in the same illumination conditions and with the same camera parameters. The light efficiency values are presented for each sensor colour channel independently (see Fig. 13.18). The aperture designs are sorted based on their light efficiency. Apertures V and VI in Fig. 13.15 have almost the same light efficiency, but the depth quality of aperture V seems to be better. Aperture VII has a higher light efficiency and can be used if depth quality is not an issue. This analysis may be used to find a suitable trade-off between image quality and depth quality for a given application.
13.8
Prototypes and Implementation
A number of prototypes with different colour-coded aperture designs were developed based on the Canon EOS 60D DSLR camera and Canon EF 50 mm f/1.8 II lens, which has an appropriate f-number and can be easily disassembled (Bando et al. 2008). The corresponding colour-coded aperture design is shown in Fig. 13.19a. Two examples of captured images and their extracted depth maps are presented in Fig. 13.20. We also describe two other application scenarios: image effects (Fig. 13.21) based on depth (Figs. 13.22 and 13.23) extracted with a smartphone camera
13
Depth Camera Based on Colour-Coded Aperture
343
Fig. 13.19 Design of colour-coded apertures: (a) inside a DSLR camera lens; (b) inside a smartphone camera lens; (c) disassembled smartphone view
Fig. 13.20 Scenes captured with the DSLR-based prototype with their corresponding depth maps
Fig. 13.21 Depth-dependent image effects. From top to bottom rows: refocusing, pixelization, colourization
Fig. 13.22 Tiny models captured with the smartphone-based prototype and with their corresponding disparity maps
(Fig. 13.19b, c) and real-time 3D reconstruction using a handheld or mounted consumer grade camera (Figs. 13.24, 13.25, 13.26, and 13.27). For implementation of a smartphone prototype, the coded aperture could not be inserted into the pupil plane, as most of the camera modules cannot be disassembled. However, a number of different smartphones were disassembled, and for some models, it was possible to insert the colour-coded aperture between the front lens of the imaging system and the back cover glass of the camera module (see Fig. 13.19b and c). Unfortunately, we have no access to smartphone imaging system parameters and cannot say how far this position is from the pupil plane.
344
V. P. Paramonov
Fig. 13.23 Disparity map and binary mask for the image captured with the smartphone prototype
Fig. 13.24 Web camerabased prototype
An Android application was developed (Paramonov et al. 2016a, b, c) to demonstrate the feasibility of disparity estimation with quality sufficient for depth-based image effects (see Figs. 13.20, 13.21, 13.22, and 13.23).
13
Depth Camera Based on Colour-Coded Aperture
345
Fig. 13.25 Real-time implementation of raw disparity map estimation with web camera-based prototype
Fig. 13.26 Point Grey Grasshopper 3 camera with inserted colour-coded aperture on the left (a) and real-time 3D scene reconstruction process on the right (b)
The real-time 3D reconstruction scenario is based on a Point Grey Grasshopper 3 digital camera with a Fujinon DV3.4x3.8SA-1 lens with embedded yellow-cyan colour-coded aperture. The Point Grey camera and 3D reconstruction process are shown in Fig. 13.26. The following 3D reconstruction scheme was chosen: • Getting a frame from the camera and undistorting it according to the calibration results
346
V. P. Paramonov
Fig. 13.27 Test scenes and their corresponding 3D reconstruction examples (a) test scene frontal veiw and (b) corresponding 3D reconstruction view, (c) test scene side view and (d) corresponding 3D reconstruction veiw
• Extracting the frame depth in millimetres and transforming it to a point cloud • Using the colour alignment technique (Bando et al. 2008) to restore the colour image • Utilization of GPU-accelerated dense tracking and mapping Kinect Fusion (KinFu) algorithm (Newcomb et al. 2011) for camera pose estimation and 3D surface reconstruction A PC depth extraction implementation provides 53fps (Panchenko and Bucha 2015), OpenCL KinFu provides 45fps (Zagoruyko and Chernov 2014), and the whole 3D reconstruction process works at 15fps, which is sufficient for on-the-fly usage.
13
Depth Camera Based on Colour-Coded Aperture
347
The test scene and 3D reconstruction results are shown in Fig. 13.27a–d. Note that a chessboard pattern is not used for tracking but only to provide good texture. In Fig. 13.28, we show the distance between depth layers corresponding to disparity values equal to 0 and 1 based on the last formula in Sect. 13.5. The layered
Fig. 13.28 Depth sensor accuracy curves for different aperture baselines b: (a) full-size camera with f-number 1.8 and pixel size 4.5 μm; (b) compact camera with f-number 1.8 and pixel size 1.2 μm
348
V. P. Paramonov
depth error is by definition two times smaller than this distance. The sub-pixel refinement reduces the depth estimation error two times further (see Figs. 13.6 and 13.10). That gives a final accuracy better than 15 cm on the distance of 10 m and better than 1 cm on the distance below 2.5 m for the equivalent baseline b of 20 mm.
References Amari, Y., Adelson, E.: Single-eye range estimation by using displaced apertures with color filters. In: Proceedings of the International Conference on Industrial Electronics, Control, Instrumentation and Automation, pp. 1588–1592 (1992) Bae, Y., Manohara, H., White, V., Shcheglov, K.V., Shahinian, H.: Stereo imaging miniature endoscope. Tech Briefs. Physical Sciences (2011) Bando, Y.: How to disassemble the Canon EF 50mm F/1.8 II lens (2008). Accessed on 15 September 2020. http://web.media.mit.edu/~bandy/rgb/disassembly.pdf Bando, Y., Chen, B.-Y., Nishita, T.: Extracting depth and matte using a color-filtered aperture. ACM Trans. Graph. 27(5), 134:1–134:9 (2008) Chakrabarti, A., Zickler, T.: Depth and deblurring from a spectrally-varying depth-of-field. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) Lecture Notes in Computer Science, vol. 7576, pp. 648–661. Springer, Berlin, Heidelberg (2012) Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40, 120–145 (2010) Chang, J., Wetzstein, G.: Deep optics for monocular depth estimation and 3d object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 10193–10202 (2019) Chen, W., Xie, D., Zhang, Y., Pu, S.: All you need is a few shifts: designing efficient convolutional neural networks for image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7234–7243 (2019) Dansereau, D., Pizarro, O., Williams, S.: Decoding, calibration and rectification for lenselet-based plenoptic cameras. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1027–1034 (2013) Goodman, J.: Introduction to Fourier Optics. McGraw-Hill, New York (2008) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. 1704, 04861 (2017) Imatest. The SFRplus chart: features and how to photograph it (2014). Accessed on 15 September 2020. https://www.imatest.com/docs/ Johannsen, O., Heinze, C., Goldluecke, B., Perwaß, C.: On the calibration of focused plenoptic cameras. In: Grzegorzek, M., Theobalt, C., Koch, R., Kolb, A. (eds.) Time-of-Flight and Depth Imaging. Sensors, Algorithms, and Applications. Lecture Notes in Computer Science, vol. 8200, pp. 302–317. Springer, Berlin, Heidelberg (2013) Lee, E., Kang, W., Kim, S., Paik, J.: Color shift model-based image enhancement for digital multi focusing based on a multiple color-filter aperture camera. IEEE Trans. Consum. Electron. 56(2), 317–323 (2010) Lee, S., Kim, N., Jung, K., Hayes, M.H., Paik, J.: Single image-based depth estimation using dual off-axis color filtered aperture camera. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2247–2251 (2013) Levin, A., Fergus, R., Durand, F., Freeman, W.T.: Image and depth from a conventional camera with a coded aperture. ACM Trans. Graph. 26(3), 70:1–70:10 (2007)
13
Depth Camera Based on Colour-Coded Aperture
349
Mishima, N., Kozakaya, T., Moriya, A., Okada, R., Hiura, S.: Physical cue based depth-sensing by color coding with deaberration network. arXiv:1908.00329. 1908, 00329 (2019) Moriuchi, Y., Sasaki, T., Mishima, N., Mita, T.: Depth from asymmetric defocus using colorfiltered aperture. The Society for Information Display. Book 1: Session 23: HDR and Image Processing (2017). Accessed on 15 September 2020. https://doi.org/10.1002/sdtp.11639 Newcomb, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davidson, A.J., Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, A.: Kinectfusion: realtime dense surface mapping and tracking. In: Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, pp. 127–136 (2011) Panchenko, I., Bucha, V.: Hardware accelerator of convolution with exponential function for image processing applications. In: Proceedings of the 7th International Conference on Graphic and Image Processing. International Society for Optics and Photonic, pp. 98170A–98170A (2015) Panchenko, I., Paramonov, V., Bucha, V.: Depth estimation algorithm for color coded aperture camera. In: Proceedings of the IS&T Symposium on Electronic Imaging. 3D Image Processing, Measurement, and Applications, pp. 405.1–405.6 (2016) Paramonov, V., Panchenko, I., Bucha, V.: Method and apparatus for image capturing and simultaneous depth extraction. US Patent 9,872,012 (2014) Paramonov, V., Lavrukhin, V., Cherniavskiy, A.: System and method for shift-invariant artificial neural network. RU Patent 2,656,990 (2016a) Paramonov, V., Panchenko, I., Bucha, V., Drogolyub, A., Zagoruyko, S.: Depth camera based on color-coded aperture. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 1, 910–918 (2016b) Paramonov, V., Panchenko, I., Bucha, V., Drogolyub, A., Zagoruyko, S.: Color-coded aperture. Oral presentation in 2nd Christmas Colloquium on Computer Vision, Skolkovo Institute of Science and Technology (2016c). Accessed on 15 September 2020. http://sites.skoltech.ru/app/ data/uploads/sites/25/2015/12/CodedAperture_CCCV2016.pdf Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 1, 195–202 (2003) Schmidt, J.D.: Numerical Simulation of Optical Wave Propagation with Examples in MATLAB. SPIE Press, Bellingham (2010) Sitzmann, V., Diamond, S., Peng, Y., Dun, X., Boyd, S., Heidrich, W., Heide, F., Wetzstein, G.: End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging. ACM Trans. Graph. 37(4), 114 (2018) Trouvé, P., Champagnat, F., Besnerais, G.L., Druart, G., Idier, J.: Design of a chromatic 3d camera with an end-to-end performance model approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 953–960 (2013a) Trouvé, P., Champagnat, F., Besnerais, G.L., Sabater, J., Avignon, T., Idier, J.: Passive depth estimation using chromatic aberration and a depth from defocus approach. Appl. Opt. 52(29), 7152–7164 (2013b) Tsuruyama, T., Moriya, A., Mishima, N., Sasaki, T., Yamaguchi, J., Kozakaya, T.: Optical filter, imaging device and ranging device. US Patent Application US 20200092482 (2020) Veeraraghavan, A., Raskar, R., Agrawal, A., Mohan, A., Tumblin, J.: Dappled photography: mask enhanced cameras for heterodyned light fields and coded aperture refocusing. ACM Trans. Graph. 26(3), 69:1–69:12 (2007) Voelz, D.G.: Computational Fourier Optics: A MATLAB Tutorial. SPIE Press, Bellingham (2011) Wu, B., Wan, A., Yue, X., Jin, P.H., Zhao, S., Golmant, N., Gholaminejad, A., Gonzalez, J.E., Keutzer, K.: Shift: a zero FLOP, zero parameter alternative to spatial convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9127–9135 (2018) Zagoruyko, S., Chernov, V.: Fast depth map fusion using OpenCL. In: Proceedings of the Conference on Low Cost 3D (2014). Accessed on 15 September 2020. http://www.lc3d.net/ programme/LC3D_2014_program.pdf Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000) Zhou, C., Lin, S., Nayar, S.K.: Coded aperture pairs for depth from defocus and defocus deblurring. Int. J. Comput. Vis. 93(1), 53–72 (2011)
Chapter 14
An Animated Graphical Abstract for an Image Ilia V. Safonov, Anton S. Kornilov, and Iryna A. Reimers
14.1
Introduction
Modern image capture devices are capable of acquiring thousands of files daily. Despite tremendous progress in the development of user interfaces for personal computers and mobile devices, the approach for browsing large collections of images has hardly changed over the past 20 years. Usually, a user scrolls through the list of downsampled copies of images to find the one they want. This downsampled image is called a thumbnail or icon. Figure 14.1 demonstrates a screenshot of File Explorer in Windows 10 with icons of photos. Browsing is time-consuming, and search is ineffective taking into account senseless names of image files. Often it is difficult to recognise the detailed content of the original image from the observed thumbnail as well as to estimate its quality. For a downsampled copy of the image, it is almost impossible to assess its blurriness, noisiness, and presence of compression artefacts. Even for viewing photographs, a user frequently is forced to zoom in and scroll a photo. The situation is much harder for browsing of images having complex layout or intended for special applications. Figure 14.2 shows thumbnails of scanned documents. How can the required document be found effectively in the case of inapplicability of employing optical character recognition? Icons of slices of X-ray computed tomographic (CT) images of two various sandstones are shown in Fig. 14.3. Is it possible to detect some given sandstone based on the thumbnail? How can the quality of the slices be estimated? In the viewing interface, a user needs a fast and handy way to see an abstract of the image. In general, the content of the abstract is application-specific.
I. V. Safonov (*) · A. S. Kornilov National Research Nuclear University MEPhI, Moscow, Russia e-mail: [email protected] I. A. Reimers Moscow Institute of Physics and Technology, Moscow, Russia © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_14
351
352
Fig. 14.1 Large icons for photos in File Explorer for Win10
Fig. 14.2 Large icons for documents in File Explorer for Win10
I. V. Safonov et al.
14
An Animated Graphical Abstract for an Image
353
Fig. 14.3 Large icons for slices of two X-ray microtomographic images in File Explorer for Win10: (a) Bentheimer sandstone; (b) Fontainebleau sandstone
Nevertheless, common ideas for the creation of convenient interfaces for viewing of images can be formulated: a user would like to see clearly the regions of interest and to estimate the visual quality. In this chapter, we describe the technique for generating a thumbnail-size animation comprising transitions between the most important zones of the image. Such animated graphical abstract looks attractive and provides a user-friendly way for browsing large collections of images.
14.2
Related Work
There are several publications devoted to the generation of smart thumbnails by automatic cropping of photographs. Suh et al. (2003) describe the method for cropping comprising face detection and saliency map building based on the pre-attentive human vision model by Itti et al. (1998). In the strict sense, a model of human pre-attentive vision does not quite fit in this case since the observer is in the attentive stage while viewing thumbnails. Nevertheless, the approach by Suh et al. (2003) often demonstrates reasonable outcomes. However, it is not clear how this method works for photos containing several faces as well as for images with several spatially distributed salient regions. A lot of techniques of automatic cropping including those intended for thumbnail creation look for attention zones. In recent decades, the theory of saliency map building and attention zone detection has developed rapidly. Lie et al. (2016) assess eight fast saliency detectors and three automatic thresholding algorithms for automatic generation of image thumbnails by cropping. All those approaches arbitrarily change the aspect ratio of the original photo, which may be undesirable for the user interface. The overall composition of the photo suffers due to cropping. Frequently it
354
I. V. Safonov et al.
is impossible to evaluate the noise level and blurriness because the cropped fragment is downsized and the resulting thumbnail has a lower resolution in comparison with the original image. The last advances with automatic cropping for thumbnailing relate to the application of deep neural networks. Esmaeili et al. (2017) and Chen et al. (2018) depict end-to-end fully convolutional neural networks for thumbnail generation without the building of an intermediate saliency map. Except for the capability of preserving aspect ratio, these methods have the same drawbacks as other cropping-based methods. There are completely different approaches for thumbnail creation. To reflect noisiness (Samadani et al. 2008) or blurriness (Koik and Ibrahim 2014) of the original image, these methods fuse corresponding defects in the thumbnail. Such algorithms do not modify the image composition and better characterise the quality of the originals. However, yet it is hard to recognise relatively small regions of interest due to the much smaller size of the thumbnail than the original. There are many fewer publications devoted to thumbnails of scanned documents. Berkner et al. (2003) describe the so-called SmartNail for browsing document images. SmartNail consists of a selection of cropped and scaled document segments that are recomposed to fit the available display space while maintaining the recognisability of document images and the readability of text and keeping the layout close to the original document layout. Nevertheless, the overall initial view is destroyed, especially for small display size, as well as sometimes layout alteration is estimated negatively by the observer. Berkner (2006) depicts the method for determination of the scale factor to preserve text readability and layout recognisability in the downsized image. Safonov et al. (2018) demonstrate the rescaling of images by retargeting. That approach allows to decrease the size of the scanned document several times, but the preservation of text readability for small thumbnails remains an unsolved problem.
14.3
An Animated Graphical Abstract
14.3.1 General Idea To generate a good graphical abstract, we need to demonstrate both the whole diagram and the enlarged fragments. These goals are in a contradiction in the case of graphical abstract as a still image. That is why we propose to create smooth animated transitions between attention zones of the image. Obtained video frames are cropped from the initial still image and scaled to thumbnail size. The aspect ratio can remain unchanged or altered according to user interface requirements. The duration of the movie should not be long. The optimal duration is less than 10 seconds. Therefore, the number of attention zones is limited, from 3 to 5. Animation may be looped.
14
An Animated Graphical Abstract for an Image
355
The algorithm for the generation of the animated graphical abstract comprises the following three key stages. 1. Detection of attention zones 2. Selection of a region for quality estimation 3. Generation of video frames, which are transitions between the zones and the whole image Obviously, the attention zones differ for various types of images. For the demonstration of advantages of the animated graphical abstract as a concept, we consider the following image types: conventional consumer photographs, images of scanned documents, and slices of X-ray microtomographic images of rock samples. Human faces are adequate for identification of photo content for the most part. For photos that do not contain faces, salient regions can be considered as visual attention zones. The title, headers, other emphasised text elements, and pictures are enough for the identification of a document. For the investigation of images acquired by tomography, we need to examine the regions of various substances. For visual estimation of blurriness, noise, compression artefacts, and specific artefacts of CT images (Kornilov et al. 2019), observers should investigate a fragment of an image without any scaling. We propose to use several simple rules for the selection of the appropriate fragment: the fragment should contain at least one contrasting edge and at least one flat region; the histogram of the fragment’s brightness should be wide but without clipping on limits of dynamic range. These rules are employed for the region selection in the central part of an image or inside the attention zones. It should be clear that to select approaches for important zone detection, one should take into account the application scenario and hardware platform limitations for implementation. Fortunately, panning over an image during animation allows recognising image content even in the case when important zones were detected incorrectly. For implementations portable to embedding platforms, we prefer techniques having low computational complexity and power consumption rather than more comprehensive ones.
14.3.2 Attention Zone Detection for Photos Information about humans in the photo is important to recognise the scene. Thus, it is reasonable to apply the face detection algorithm to detect attention zones in the photo. There are numerous methods for face detection. At the present time, methods based on the application of deep neural networks demonstrate state-of-the-art performance (Zhang and Zhang 2014; Li et al. 2016; Bai et al. 2018). Nevertheless, we prefer to apply the Viola-Jones face detector (Viola and Jones 2001), which has several effective multi-view implementations for various platforms. The number of false positives of the face detector can be decreased with additional skin tone segmentation and processing of downsampled images (Egorova et al. 2009). We
356
I. V. Safonov et al.
Fig. 14.4 Attention zones for a photo containing people
set the upper limit for the number of detected faces equal to four. If a larger number of faces are detected, then we select the largest regions. Figure 14.4 illustrates detected faces as attention zones. Faces may characterise the photo very well, but a lot of photos do not contain faces. In this case, an additional mechanism has to be used to detect zones of attention. The thresholding of a saliency map is one of the common ways of looking for attention zones. Again, deep neural networks perform well for the problem (Wang et al. 2015; Zhao et al. 2015; Liu and Han 2016), but we use a simpler histogram-based contrast technique (Cheng et al. 2014), which usually provides a reasonable saliency map. Figure 14.5 shows examples of attention zones detected based on the saliency map.
14.3.3 Attention Zone Detection for Images of Documents The majority of icons for document images look very similar. It is difficult to distinguish from one another. To recognise some document, it is important to see the title, emphasised blocks of text, and embedded pictures. There are a lot of document layout analysis methods that allow to perform the segmentation and detection of different important regions of the document (Safonov et al. 2019). However, we do not need complete document segmentation to detect several attention zones, so we can use simple and computationally inexpensive methods.
14
An Animated Graphical Abstract for an Image
357
Fig. 14.5 Attention zones based on the saliency map
We propose a fast algorithm to detect a block of text from the very large size that relates to title and headers. The algorithm includes the following steps. First, the initial rgb image is converted to greyscale I. The next step is downsampling the original document image to a size that provides recognisability of text with the size 16–18 pt or greater. For example, a scanned document image with a resolution of 300 dpi should be downsampled five times. The resulting image of an A4 document has size 700 500 pixels. Handling of a greyscale downsampled copy of the initial image allows significant decrease of processing time. Downsized text regions look like a texture. These areas contain the bulk of the edges. So, to reveal text regions, edge detection techniques can be applied. We use Laplacian of Gaussian (LoG) filtering with zero crossing. LoG filtering is a convolution of the downsampled image I with kernel k: kðx, yÞ ¼
ðx2 þ y2 2σ 2 Þk g ðx, yÞ , PN=2 PN=2 2πσ 6 x¼N=2 y¼N=2 kg ðx, yÞ
2 2 2 kg ðx, yÞ ¼ eðx þy Þ=2σ ,
where N is the size of convolution kernel; σ is standard deviation; and (x, y) are coordinates of the Cartesian system with the origin at the centre of the kernel. The zero-crossing approach with fixed threshold T is preferable for edge segmentation. The binary image BW is calculated using the following statement:
358
I. V. Safonov et al.
BWðr, cÞ ¼ 1 if ðjI e ðr, cÞ I e ðr, c þ 1Þj: >¼ T and I e ðr, cÞ < 0 and I e ðr, c þ 1Þ > 0Þ or ðjI e ðr, cÞ I e ðr, c 1Þj: >¼ T and I e ðr, cÞ < 0 and I e ðr, c 1Þ > 0Þ or ðjI e ðr, cÞ I e ðr 1, cÞj: >¼ T and I e ðr, cÞ < 0 and I e ðr 1, c 1Þ > 0Þ or ðjI e ðr, cÞ I e ðr þ 1, cÞj: >¼ T and I e ðr, cÞ < 0 and I e ðr þ 1, c 1Þ > 0Þ; otherwise BWðr, cÞ ¼ 0, where Ie is the outcome of LoG filtering; and (r, c) are coordinates of a pixel. For segmentation of text regions, we look for the pixels that have a lot of edge pixels in the vicinity: ( Lðr, cÞ ¼
P rþdr=2 Pcþdc=2 1, i¼rdr=2 j¼cdc=2 BWði, jÞ > Tt
8r, c
,
0
where L is the image of segmented text regions; dr and dc are sizes of blocks; and Tt is a threshold. In addition to text, regions corresponding to vector graphics such as plots and diagrams are segmented too. Further steps are labelling of connected regions in L and calculation of its bounding boxes. Regions with a small height or width are eliminated. The calculation of the average size of characters for each region of the text and selection of several zones with large size of characters are performed in the next steps. Let us consider how to calculate the average size of characters of the text region, which corresponds to some connected region in the image L. The text region can be designated as: Z ðr, cÞ ¼ I ðr, cÞ Lðr, cÞ, 8r, c2Ω
The image Z is binarised by the threshold. We use an optimised version of the well-known Otsu algorithm (Lin 2005) to calculate the threshold for the histogram calculated from the pixels belonging to Ω. Connected regions in the binary image Zb are labelled. If the number of connected regions in Zb is too small, then the text region is eliminated. The size of the bounding box is calculated for all connected regions in Zb.
14
An Animated Graphical Abstract for an Image
359
Fig. 14.6 The results of detection of text regions
Figure 14.6 illustrates our approach for the detection of text regions. Detected text regions L are marked with green. The image Z consists of all the connected regions. The average size of characters is calculated for the dark connected areas inside the green region. This is a reliable way to detect the region of the title of the paper. At the final stage of our approach, photographic illustrations are identified because they are important for the document recognition as well. The image I is divided into non-overlapping blocks with size N M for the detection of embedded photos. For each block Ei, the energy of the normalised grey level co-occurrence matrix is calculated: XX 1j I ðx, yÞ ¼ i and I ðx þ dx, y þ dyÞ ¼ j C I ði, jÞ ¼ 0j otherwise 8x 8y C ði, jÞ N I ði, jÞ ¼ P PI , i j C I ði, jÞ XX Ei ¼ N I 2 ði, jÞ, i
j
where x, y are coordinates of pixels of a block; and dx, dy are displacements. If Ei is less than 0.01, then all pixels of the block are marked as related to the photo. Further, all adjacent marked pixels are combined to connected regions. Regions with small area are eliminated. Regions with too large area are eliminated because they belong to the complex background of the document as a rule. The
360
I. V. Safonov et al.
Fig. 14.7 The results of detection of a photographic illustration inside a document image
bounding box of the region with the largest area defines the zone of the embedded photo. Figure 14.7 shows the outcomes of the detection of the blocks related to a photographic illustration inside the document image.
14.3.4 Attention Zones for a Slice of a Tomographic Image As a rule, a specimen occupies only part of a slice of a CT image, the so-called region of interest (ROI). Kornilov et al. (2019) depict the algorithm for ROI segmentation. To examine the ROI, it is preferable to see several enlarged fragments related to various substances and/or having different characteristics. The fragments can be found by calculation of similarity between blocks of a slice inside the ROI. Kornilov et al. (2020) apply Hellinger distance between normalised greyscale histograms to estimate similarity across slices of the tomographic image. For two discrete probability distributions, Hc and H, the similarity is defined as the unit minus Hellinger distance: Dsim
1 ¼ 1 pffiffiffi 2
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X pffiffiffiffiffiffiffi pffiffiffiffiffi2 H ci H i : i
14
An Animated Graphical Abstract for an Image
361
Instead of the 1D histogram, we propose to calculate similarity via the normalised structural co-occurrence matrix (Ramalho et al. 2016), where co-occurrence is counted between a pixel with coordinates (x, y) of the initial slice and a pixel with coordinates (x, y) of the same slice smoothed by box filter.
14.3.5 Generation of Animation At the beginning, the sequence order of zones is selected for animation creation. The first frame always represents a whole downsampled image that is the conventional thumbnail. The subsequent zones are selected to provide the shortest path across the image during moving between attention zones. The animation can be looped. In this case, the final frame is a whole image too. The animation simulates the following camera effects: tracking-in, tracking-out, and panning between attention zones, slow panning across a large attention zone, and pausing on the zones. Tracking-in, tracking-out, and panning effects between two zones are created by constructing a sequence from N frames. Each frame of the sequence is prepared with the following steps: 1. Calculation of coordinates of a bounding box for the cropping zone using the line equation in the parametric form: xðt Þ ¼ x1 þ t ðx2 x1Þ, yðt Þ ¼ y1 þ t ðy2 y1Þ, where (x1, y1) are coordinates of the start zone; (x2, y2) are coordinates of the end zone; and t is the parameter, which is increased from 0 to 1 with step dt ¼ 1/ (N 1) 2. Cropping the image using coordinates of the calculated bounding box, preserving the aspect ratio 3. Resizing of the cropped image to the target size Figure 14.8 demonstrates an example of the animated graphical abstract for a photo (Safonov and Bucha 2010). Two faces are detected. Hands of kids are selected as a region for quality estimation. The animation consists of four transitions between the whole image and these three zones. The first sequence of the frames looks like a camera tracking-in to a face. After that, the frame is frozen on a moment to focus on the zoomed face. The second sequence of the frames looks like a camera panning between faces. The third sequence of the frames looks like a camera panning and zooming-in between the face and hands. After that, the frame with the hands is frozen on a moment for visual quality estimation. The final sequence of frames looks like a camera tracking-out to the whole scene, and freeze frame takes place again.
362
I. V. Safonov et al.
Fig. 14.8 Illustration of animation for photo
Figure 14.9 demonstrates an example of the animated graphical abstract for the image of a scanned document. The title and embedded picture are detected in the first stage. The fragment of the image with the title is appropriate for quality estimation. The animation consists of four transitions between the entire image and these two zones as well as viewing a relatively large zone of title. The first sequence of frames looks like a camera tracking-in to the left side of the title zone. The second sequence of frames looks like a slow camera panning across the zone of the title. After that, the frame is frozen on a moment for visual quality estimation. The third sequence of
14
An Animated Graphical Abstract for an Image
363
Fig. 14.9 Illustration of animation for the image of a document
frames looks like a camera panning from the right side of the title zone to the picture inside the document. After that, frame is frozen on a moment. The final sequence of frames looks like a camera tracking-out to the entire page. Finally, the frame with the entire page is frozen on a moment. This sequence of the frames allows to identify the image content confidently.
364
I. V. Safonov et al.
For CT images, the animation is created between zones having different characteristics according to visual similarity and an image fragment without scaling, which is used for quality estimation. In contrast to the two previous examples, panning across the tomographic image often does not allow to see the location of a zone in the image clearly. It is preferable to make transitions between zones via intermediate entire slice.
14.4
Results and Discussion
We conducted a user study to estimate the effectiveness of animated graphical abstracts in comparison with conventional large icons in Windows 10. The study was focused on the recognition of content and estimation of image quality. The survey was held among ten persons. Surely, ten participants are not enough for a deep and confident investigation. However, that survey allows to demonstrate the advantages of our concept. Survey participants were asked to complete five tasks on one laptop PC with Windows 10 in similar viewing conditions independently of one another. Conventional large icons were viewed in File Explorer. Animated graphical abstracts were inserted in a slide of Microsoft PowerPoint as an animated GIF. Each participant has 1 minute to solve each task: the first half of the minute by viewing large icons and the second half of the minute by viewing the slide with animated graphical abstracts having the same size. The first task was the selection of two photos with a person known to the respondent. The total number of viewed icons was eight. The original full-sized photos were never seen before by the respondents. Figure 14.10 shows conventional large icons used in the first task. Most faces are too small for confident recognition. Nevertheless, the percentage of right answers was not bad: 60% of respondents selected both photos with that person. It is probably explained by the high cognitive abilities of the people. Such characteristics as hair colour, head form, build of body, height, typical pose and expression allow to identify a known person even if the size of the photo is extremely small. However, the recognition results for animation are much better: 90% of respondents selected all requested photos. Figure 14.11 shows frames that contain enlarged faces of the target person. In most cases, the enlarged face allows to identify a person. Even if a face is not detected as an attention zone, walking through zoomed-in image fragments allows to see the face in detail. Perhaps 10% of errors are explained by carelessness because faces were frozen on a moment only and the time for task completion was limited. The second task was the selection of two blurred photos from eight. Figure 14.12 shows the conventional large icons used in that survey. It is almost impossible to detect blurred photos by thumbnail viewing. Only 30% of participants gave the right
14
An Animated Graphical Abstract for an Image
Fig. 14.10 Conventional large icons in the task detection of photos with a certain person
365
366
I. V. Safonov et al.
Fig. 14.11 Frames of the animated graphical abstract in the task detection of photos with a certain person
answer. It is a little bit better than random guessing. Two responders had better results than others because they had much experience in photography and understood shooting conditions which can cause a blurred photo. The animated graphical abstract demonstrates zoomed-in fragments of the photo and allows to identify low-quality photos. In our survey, 90% of respondents detected proper photos by viewing animation frames. Figure 14.13 shows enlarged fragments of sharp photos in the top row and blurred photos in the bottom row. The difference is obvious, and blurriness is detectable. 10% of errors are explained by subjective interpretation of the blurriness concept probably. Indeed, sharpness and blurriness are not formalised strictly and depend on viewing conditions. The third task was the selection of two scanned images that represent the document related to the Descreening topic. The total number of documents was nine. Figure 14.14 shows the conventional large icons of the scanned images used in that survey. For icons, the percentage of correct answers was 20. In general, it is impossible to solve the task properly using conventional thumbnails of small size. To the contrary, animated graphical abstracts provide a high level of correct answers. 80% of respondents selected both pages related to Descreening thanks to zooming and panning through the title of the papers, as shown in Fig. 14.15. The fourth task was the classification of sandstones, icons of slices of CT images of which are in Fig. 14.3. Probably, researchers experienced in material science can classify those slices based on icons more or less confidently. However, the participants in our survey have no such skills. They were instructed to make decisions based on the following text description: Bentheimer sandstone has middle, sub-angular non-uniform grains with 10–15% inclusions of other substances such as spar and clay; Fontainebleau sandstone has large angular uniform grains. For icons of slices, the percentage of right answers was 50, which corresponds to random guessing. Enlarged fragments of images in frames of animation allows to classify images of sandstones properly. 90% of respondents gave the right answers. Figure 14.16 shows examples of frames with enlarged fragments of slices.
14
An Animated Graphical Abstract for an Image
Fig. 14.12 Conventional large icons in the task selection of blurred photos
367
368
I. V. Safonov et al.
Fig. 14.13 Frames of animated graphical abstract in the task selection of blurred photos: top row for sharp photos, bottom row for blurred photos
The final task was the identification of the noisiest tomographic image. We scanned the same sample six times with different exposure time and number of frames for averaging (Kornilov et al. 2019). Longer exposure time and greater number of frames for averaging allow to obtain a high-quality image. Shorter exposure time and absence of averaging correspond to the noisy images. The conventional thumbnail does not allow to estimate noise level. Icons of slices for all six images look almost identical. That is why only 20% of respondents could identify the noisiest image correctly. Frames of the animation containing zoomed fragments of slices grant assessing noise level easily. 80% of respondents identify noisier images via viewing of the animated graphical abstract. Table 14.1 contains the results for all tasks of our survey. The animated graphical abstract provides the capability of recognising image content and estimating quality confidently and outperforms conventional icons and thumbnails considerably. In addition, such animation is an impressive way for navigation through image collections in software for PCs, mobile applications, and widgets. The idea of the animated graphical abstract can be extended to other types of files, for example, PDF documents.
14
An Animated Graphical Abstract for an Image
369
Fig. 14.14 Conventional large icons in the task selection of documents related to the given topic
370
I. V. Safonov et al.
Fig. 14.15 Frames of animated graphical abstract with panning through the title of the document
Fig. 14.16 Frames of animated graphical abstracts in the task classification of the type of sandstone
14
An Animated Graphical Abstract for an Image
371
Table 14.1 Survey results
Task Detection of a certain person in a photo Detection of blurred photos Selection of documents related to the given topic Classification of sandstones Identification of the noisiest tomographic image
Percentage of right answers (%) Conventional Animated large icon graphical abstract 60 90 30 90 20 80 50 90 20 80
References Bai, Y., Zhang, Y., Ding, M., Ghanem, B.: Finding tiny faces in the wild with generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–30 (2018) Berkner, K.: How small should a document thumbnail be? In: Digital Publishing. SPIE. 6076, 60760G (2006) Berkner, K., Schwartz, E.L., Marle, C.: SmartNails – display and image dependent thumbnails. In: Document Recognition and Retrieval XI. SPIE. 5296, 54–65 (2003) Chen, H., Wang, B., Pan, T., Zhou, L., Zeng, H.: Cropnet: real-time thumbnailing. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 81–89 (2018) Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H., Hu, S.M.: Global contrast based salient region detection. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 569–582 (2014) Egorova, M.A., Murynin, A.B., Safonov, I.V.: An improvement of face detection algorithm for color photos. Pattern Recognit. Image Anal. 19(4), 634–640 (2009) Esmaeili, S.A., Singh, B., Davis, L.S.: Fast-at: fast automatic thumbnail generation using deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4622–4630 (2017) Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998) Koik, B.T., Ibrahim, H.: Image thumbnail based on fusion for better image browsing. In: Proceedings of the IEEE International Conference on Control System, Computing and Engineering, pp. 547–552 (2014) Kornilov, A., Safonov, I., Yakimchuk, I.: Blind quality assessment for slice of microtomographic image. In: Proceedings of the 24th Conference of Open Innovations Association (FRUCT), pp. 170–178 (2019) Kornilov, A.S., Reimers, I.A., Safonov, I.V., Yakimchuk, I.V.: Visualization of quality of 3D tomographic images in construction of digital rock model. Sci. Vis. 12(1), 70–82 (2020) Li, Y., Sun, B., Wu, T., Wang, Y.: Face detection with end-to-end integration of a ConvNet and a 3d model. In: Proceedings of the European Conference on Computer Vision, pp. 420–436 (2016) Lie, M.M., Neto, H.V., Borba, G.B., Gamba, H.R.: Automatic image thumbnailing based on fast visual saliency detection. In: Proceedings of the 22nd Brazilian Symposium on Multimedia and the Web, pp. 203–206 (2016) Lin, K.C.: On improvement of the computation speed of Otsu’s image thresholding. J. Electron. Imaging. 14(2), 023011 (2005) Liu, N., Han, J.: DHSNet: deep hierarchical saliency network for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 678–686 (2016)
372
I. V. Safonov et al.
Ramalho, G.L.B., Ferreira, D.S., Rebouças Filho, P.P., de Medeiros, F.N.S.: Rotation-invariant feature extraction using a structural co-occurrence matrix. Measurement. 94, 406–415 (2016) Safonov, I.V., Bucha, V.V.: Animated thumbnail for still image. In: Proceedings of the Graphicon conference, pp. 79–86 (2010) Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Adaptive Image Processing Algorithms for Printing. Springer Nature Singapore AG, Singapore (2018) Safonov, I.V., Kurilin, I.V., Rychagov, M.N., Tolstaya, E.V.: Document Image Processing for Scanning and Printing. Springer Nature Switzerland AG, Cham (2019) Samadani, R., Mauer, T., Berfanger, D., Clark, J., Bausk, B.: Representative image thumbnails: automatic and manual. In: Human Vision and Electronic Imaging XIII. SPIE. 6806, 68061D (2008) Suh, B., Ling, H., Bederson, B.B., Jacobs, D.W.: Automatic thumbnail cropping and its effectiveness. In: Proceedings of ACM UIST (2003) Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 511–518 (2001) Wang, L., Lu, H., Ruan, X., Yang, M.H.: Deep networks for saliency detection via local estimation and global search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3183–3192 (2015) Zhang, C., Zhang, Z.: Improving multi-view face detection with multi-task deep convolutional neural networks. In: IEEE Winter Conference on Applications of Computer Vision, pp. 1036–1041 (2014) Zhao, R., Ouyang, W., Li, H., Wang, X.: Saliency detection by multi-context deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1265–1274 (2015)
Chapter 15
Real-Time Video Frame-Rate Conversion Igor M. Kovliga and Petr Pohl
15.1
Introduction
The problem of increasing the frame rate of a video stream started to gain attention a long time ago in the mid-1990s, just as TV screens started to get larger and the stroboscopic effect caused by discretisation of smooth motion started to become more apparent. The early 100 Hz CRT TV image was of low resolution (PAL/NTSC), and the main point was to get rid of CRT inherent flicker, but with ever larger LCD TV sets, the need for good and real-time frame-rate conversion (FRC) algorithms was increasing. The problem setup is to analyse the motion of objects in a video stream and create new frames that follow the same motion. It is obvious that with high resolution content, this is a computationally demanding task that needs to analyse frame data in real time and interpolate and compose new frames. As for the TV industry, the problem of computational load was solved by dedicated FRC chips with highly optimised circuitry and without strict limitations on power consumption. The prevailing customer of Samsung R&D Institute Russia was a mobile division, so we proposed to bring the FRC “magic” to the smartphone segment. The computational performance of mobile SoCs was steadily increasing, even more so on the GPU side. The first use cases for FRC were reasonably chosen to have limited duration, so that the increased power consumption was not a catastrophic problem. The use cases were Motion Photo playback and Super Slow Motion capture. We expected that besides relatively fine quality, we will have to deliver a solution that will be working smoothly in real time on a mobile device, possibly with power consumption limitations. The requirements more or less dictate the use of block-wise
I. M. Kovliga (*) · P. Pohl Samsung R&D Institute Rus (SRR), Moscow, Russia e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_15
373
374
I. M. Kovliga and P. Pohl
motion vectors, which dramatically decreases the complexity of all parts of the FRC algorithm.
15.2
Frame-Rate Conversion Algorithm Structure
The high-level structure of a FRC algorithm is, with slight variations, shared between many variants of FRC (Cordes and de Haan 2009). The main stages of the algorithm are: 1. Motion estimation (ME) – usually analyses two consecutive video frames and returns motion vectors suitable for tracking objects, so-called true motion (de Haan et al. 1993; Pohl et al. 2018). 2. Occlusion processing (OP) and preparation of data for MCI – analyses several motion vector fields and makes decisions about appearing and disappearing areas and creates data to guide their interpolation. Often, this stage modifies motion vectors (Bellers et al. 2007). 3. Motion compensated interpolation (MCI) – takes data from OP (occlusion corrected motion vectors and weights) and produces the interpolated frame. 4. Fallback logic – keyframe repetition instead of FRC processing is applied if the input is complex video content (either the scene is changed or either highly nonlinear or extremely fast motion appears). In this case, strong interpolation artefacts are replaced by judder globally or locally, which is visually better. We had to develop a purely SW FRC algorithm for fastest possible commercialisation and prospective support of devices already released to the market. In our case, purely SW means an implementation that uses any number of available CPU cores and GPU via the OpenCL standard. This gave us a few advantages: • Simple upgrades of the algorithm in case of severe artefacts found in some specific types of video content • Relatively simple integration with existing smartphone software • Rapid implementation of possible new scenarios • Release from some hardware limitations in the form of a small amount of cached frame memory and one-pass motion estimation We chose to use 8 8 basic blocks, but for higher resolutions, it is possible to increase the block size in the ME stage with further upscaling of motion vectors for subsequent stages.
15.3
ME Stage: Real-Time 3DRS Algorithm
A motion estimation (ME) algorithm is a crucial part of many algorithms and systems, for example, video encoders, frame-rate conversion (FRC), and structure from motion. The performance of the ME algorithm typically provides an
15
Real-Time Video Frame-Rate Conversion
375
overwhelming contribution to the performance of the ME-based algorithm in terms of both computational complexity and visual quality, and it is therefore critical for many ME applications to have a low-complexity ME algorithm that provides a good-quality motion field. However, the ME algorithm is highly task-specific, and there is no “universal” ME that is easily and efficiently applicable to any task. Since we focus our efforts on FRC applications, we choose the 3D recursive search (3DRS) algorithm (de Haan et al. 1993) as a baseline ME algorithm, as it is well suited for real-time FRC software applications. The 3DRS algorithm has several important advantages that allow a reasonable quality of the motion field used for FRC to be obtained at low computational cost. Firstly, it is a block matching algorithm (BMA); secondly, it checks a very limited set of candidates for each block; and thirdly, many techniques developed for other BMAs can be applied to 3DRS to improve the quality, computational cost, or both. Many 3DRS-based ME algorithms are well known. One considerable drawback of 3DRS-based algorithms with a meandering scanning order is the impossibility of parallel processing when a spatial candidate lies in the same row as the current block being propagated. Figure 15.1 shows the processing dependency. The green blocks are those which need to be processed before processing the current block (depicted by a white colour in a red border). Blocks marked in red cannot be processed until the processing of the current block is finished. A darker colour shows a direct dependency. This drawback limits processing speed, since only one processing core of a multicore processor (MCP) can be used for 3DRS computation. This can also increase the power consumption of the MCP since power consumption rises superlinearly with increasing clock frequency. A time-limited task can be solved more power-efficiently on two cores with a lower frequency than on one core with a higher frequency. In this work, we introduce several modifications to our variant of the 3DRS algorithm that allow multithreaded processing to obtain a motion field and also
Fig. 15.1 Trees of direct dependencies (green and red) and areas of indirect dependencies (light green and light red) for a meandering order (left). Top-to-bottom meandering order used for forward ME (right top). Bottom-to-top meandering order used for backward ME (right bottom)
376
I. M. Kovliga and P. Pohl
improve the computational cost without any noticeable degradation in the quality of the resulting motion field.
15.3.1 Baseline 3DRS-Based Algorithm The 3DRS algorithm is based on block matching, using a frame divided into blocks of pixels, where X ¼ (x, y) are the pixel-wise coordinates of the centre of the block. Our FRC algorithm requires two motion fields for each pair of consecutive frames Ft1, Ft. The forward motion field DFW, t1(X) is the set of motion vectors assigned to the blocks of Ft1. These motion vectors point to frame Ft. The backward motion field DBW, t(X) is the set of motion vectors assigned to the blocks of Ft. These motion vectors point to frame Ft1. To obtain a motion vector for each block, we try a few candidates only, as opposed to an exhaustive search that tests all possible motion vectors for each block. The candidates we try in each are block called a candidate set. The rules for selecting motion vectors in the candidate set are the same for each block. We use the following rules (CS - candidate set) to search the current motion field Dcur: CSðXÞ ¼ fCSspatial ðXÞ, CStemporal ðXÞ, CSrandom ðXÞg, CSspatial ðX Þ ¼ fcsDj Dcur ðX þ UScur Þg, UScur ¼ fðW, 0Þ, ð0, HÞ, ð4 W, HÞ, ðW, 4 HÞ, ð2 W, 3 HÞg, CStemporal ðX Þ ¼ csDj Dpred X þ USpred , USpred ¼ fð0, 0Þ, ðW, 0Þ, ð0, HÞ, ð4 W, 2HÞg, CSrandom ðX Þ ¼fcsDjfcsDbest ðX Þ þ ðrndð2Þ, rndð2ÞÞ; ðrndð2Þ, rndð2ÞÞ; ðrndð9Þ, rndð9ÞÞgg, csDbest ðXÞ ¼ argmincsDεfCSspatialðXÞ ,CStemporalðXÞ g MADðX, csDÞ, where W and H are the width and height of a block (we use 8 8 blocks); rnd(k) is a function whose result is a random value from the range < k, k+1, . . .k>; MAD is the mean absolute difference between the window over the current block B(X) of one frame and the window over the block pointed to by a motion vector in another frame; and the size of the windows is 16 12. Dcur is the motion vector from the current motion field, and Dpred is a predictor obtained from the previously found motion field. If the forward motion field DFW, t1(X) is searched, then the predictor will be (DBW, t1(X)); if the backward motion field DBW, t(X) is searched, then the predictor PBW, t will be formed from DFW, t1(X) by projecting it onto the block grid of frame Ft with the subsequent inversion: PBW, t(X+DFW, t1(X)) ¼ DFW, t1(X). In fact, two ME passes are used for each pair of frames: the first pass is an estimation of the forward motion field, and the second pass is an estimation of the
15
Real-Time Video Frame-Rate Conversion
377
Fig. 15.2 Sources of spatial and temporal candidates for a block (marked by a red box) in an even row during forward ME (left); sources of a block in an odd row during forward ME (right)
backward motion field. We use different scanning orders for the first and second passes. The top-to-bottom meandering scanning order (from top to bottom and from left to right for odd rows and from right to left for even rows) is used for the first pass (right-top image of Fig. 15.1). A bottom-to-top meandering scanning order (from bottom to top and from right to left for odd rows and from left to right for even rows; rows are numbered from bottom to top in this case) is used for the second pass (rightbottom image of Fig. 15.1). The relative positions UScur of the spatial candidate set CSspatial and the relative positions of the temporal candidate set CStemporal in the above description are valid only for top-to-bottom and left-to-right directions. If the direction is inverted for some coordinates, then the corresponding coordinates in UScur and USpred should be inverted accordingly. Thus, the direction of recursion changes in a meandering scanning order from row to row. In Fig. 15.2, we show the sources of spatial candidates (CSspatial, green blocks) and temporal candidates (CStemporal, orange blocks) for two scan orders. On the left-hand side, this is the top-to-bottom, left-to-right direction, and on the right-hand side, it is the top-to-bottom, right-to-left direction. The block erosion process (de Haan et al. 1993) was skipped in our modification of 3DRS. For additional smoothing of the backward motion field, we applied additional regularisation after ME. For each block, we compared MADnarrow(X, D) for the current motion vector D of the block and MADnarrow(X, Dmedian), where Dmedian is a vector obtained by per-coordinate combining of the median values of a set consisting of the nine motion vectors from an 8-connected neighbourhood and from the current block itself. Here, MADnarrow is different from MAD, which is used for 3DRS matching, and the size of the window for MADnarrow is decreased to 8 8 (to give equal block sizes). The original motion vector is overwritten by Dmedian if MADnarrow is better or worse by a small margin.
15.3.2 Wave-Front Scanning Order The use of a meandering scanning order in combination with the candidate set described above prevents the possibility of parallel processing several blocks of a
378
I. M. Kovliga and P. Pohl
given motion field. This is illustrated in Fig. 15.1. The blocks marked in green should be processed before starting the processing of the current block (marked in white in a red box) due to their direct dependency via the spatial candidate set and the blocks marked in red, which will be directly affected by the estimated motion vector in the current block. The light green and light red blocks mark indirect dependencies. Thus, there are no blocks that can be processed simultaneously with the current block since all blocks need to be processed either before or after the current block. In Al-Kadi et al. (2010), the authors propose a parallel processing which preserves the direction switching of a meandering order. The main drawback is that the spatial candidate set is not optimal, since the upper blocks for some threads are not processed at the beginning of row processing. If we change the meandering order to a simple “raster scan” order (always from left to right in each row), then the dependencies become smaller (see Fig. 15.3a). We propose changing the scanning order to achieve wave-front parallel processing, as proposed in the HEVC standard (Chi et al. 2012), or a staggered approach as shown in Fluegel et al. (2006). A wave-front scanning order is depicted in Fig. 15.3b, together with the dependencies and sets of blocks which can be processed in parallel (highlighted in blue). The set of blue blocks is called a wave-front. In the traditional method of using wave-front processing, each thread works on one row of blocks. When a block is processed, the thread should be synchronised with a thread that works on an upper row, in order to preserve the dependency (the upper thread needs to stay ahead of the lower thread). This often produces stalls due to the different times needed to process different blocks. Our approach is different; working threads process all blocks belonging to a wave-front independently. Thus, synchronisation that produces a stall is performed only when the processing of the next wave-front starts, and even this stall may not happen since the starting point of the next wave-front is usually ready for processing, provided that the number of tasks is at least several times greater than the number of cores. The proposed approach therefore eliminates the majority of stalls. The wave-front scanning order changes the resulting motion field, because it uses only “left to right” relative positions UScur and USpred during the estimation of forward motion and only “right to left” for backward motion. In contrast, the meandering scanning order switches between “left to right” and “right to left” after processing every row of blocks.
Fig. 15.3 Parallel processing several blocks of a motion field: (a) trees of dependencies for a raster (left-right) scan order; (b) wave-front scanning order; (c) slanted wave-front scanning order (two blocks in one task)
15
Real-Time Video Frame-Rate Conversion
379
15.3.3 Slanted Wave-Front Scanning Order The proposed wave-front scanning order has an inconvenient memory access pattern and hence uses the cache of the processor ineffectively. For a meandering scanning order with smooth motion, the memory accesses are serial, and frame data stored in the cache are reused effectively. The main direction of the wave-front scanning order is diagonal, which nullifies the advantage of a long cache line and degrades the reuse of the data in the cache. As a result, the number of memory accesses (cache misses) increases. To solve this problem, we propose to use several blocks placed in raster order as one task for parallel processing (see Fig. 15.3c), where the task consists of two blocks). We call this modified order a slanted wave-front scanning order. This solution changes only the scanning order, and not the directions of recursion, so only rnd(k) influences the resulting motion field. If rnd(k) is a function of X (the spatial position in the frame) and the number of calls in X, then the results will be exactly the same as for the initial wave-front scanning order. The quantity of blocks in one task can vary; a greater number is better for the cache but can limit the number of parallel tasks. Reducing the quantity of tasks limits the maximum number of MCP cores used effectively but also reduces the overhead for thread management.
15.3.4 Double-Block Processing The computational cost for motion estimation can be represented as a sum of the costs for a calculation of the MAD and the cost of the control programme code (managing a scanning order, construction of a candidate set, optimisations related to skipping the calculation of the MAD for the same candidates, and so on). During our experiments, we assumed that a significant number of calculations were spent on the control code. To decrease this overhead, we introduce double-block processing. This means that one processing unit consists of a horizontal pair of neighbouring blocks (called a double block) instead of a single block. The use of double-block processing allows us to reduce almost all control cycles by half. One candidate set is considered for both blocks of this double block. For example, in forward ME, a candidate set CS (X) from the left block of a pair is also used for the right block of the pair. However, calculation of the MAD is performed individually for each block of the pair, and the best candidate is considered separately for each block of the pair. This point distinguishes double-block processing from a horizontal enlargement of the block. A slanted wave-front scanning order where one task consists of two double-block units is shown in Fig. 15.4. The centre of each double-block unit is marked by a red point, and the left and right blocks of the double-block unit are separated by a red line. The current double-block unit is highlighted by a red rectangle. For this unit, the sources are shown for the spatial candidate set (green blocks) and for the temporal candidate set (orange blocks) related to block A. The same sources are used for block
380
I. M. Kovliga and P. Pohl
Fig. 15.4 Candidate set for a double-block unit. The current double block consists of a block A and a block B. The candidate set constructed for block A is also used for block B
B that belong to the same double block as A. Thus, the same motion vector candidates are tried for both blocks A and B including random ones. However, it may be that a true candidate set for block B is useful when the candidate set of block A gets results for blocks A and B that are too variable (the best MAD values); this is possible on some edges of a moving object or when a nonlinear object is used. We therefore propose an additional step for the analysis of MAD values, which are related to the best motion vectors of blocks A and B of the doubleblock unit. Depending on results of this step, we either accept the previously estimated motion vectors or carry out a motion estimation procedure for block B. If xA ¼ (x, y) and xB ¼ (x + W, y) are centres of two blocks of a double-block pair, we can introduce the following decision for additional analysis of B block candidates: DBEST ðX, CSðX A ÞÞ ¼ argmincsD2CSðX A Þ MADðX, csDÞ, M A ¼ MADðX A , DBEST ðX A , CSðX A ÞÞÞ, M B ¼ MADðX B , DBEST ðX B , CSðX A ÞÞÞ: Candidates from CS(XB) for block B are considered only if MB > T1 or MB MA > T2. T1 and T2 are threshold values. Reasonable values of the threshold for the usual 8-bit frames are T1 ¼ 16 and T2 ¼ 5.
15.3.5 Evaluation of 3DRS Algorithm Modifications The quality of the proposed modifications was checked for various Full HD video streams (1920 1080) with the help of the FRC algorithm. To get ground truth, we down-sampled the video streams from 30fps to 15fps and then up-converted them back to 30fps with the help of motion fields obtained by the tested ME algorithms.
15
Real-Time Video Frame-Rate Conversion
381
The initial version of our 3DRS-based algorithm was described above in the section entitled “Baseline 3DRS-based algorithm”. This algorithm was modified with the proposed modifications. Luminance components of input frames (initially in YUV 4:2:0) for ME were down-sampled twice per coordinate using an 8-tap filter, and the resulting motion vectors had a precision of two pixels. The MCI stage worked with the initial frames in Full HD resolution and was based on motioncompensated interpolation (MCI). To calculate the MAD match metric, we used luminance component of the frame and one of the chrominance components. One chrominance component was used for forward ME and another for backward ME. In backward ME, the wave-front scanning order was also switched to bottom-to-top and right-to-left. Table 15.1 presents the results of the quality of the FRC algorithm based on (a) an initial version of the 3DRS-based algorithm. (b) a version modified using the wavefront scanning order, and (c) a version modified using both the wave-front scanning order and double-block units. The proposed modifications retain the quality of the FRC output, except for a small drop when double-block processing is enabled. Table 15.2 presents the overall results for performance in terms of speed. Column 8 of the table contains the mean execution time for the sum of forward and backward ME for a pair of frames using five test video streams which were also used for quality testing. Experiment E1 shows the parameters and speed of the initial 3DRS-based algorithm as described in the section entitled “Baseline 3DRS-based algorithm”. Experiment E2 shows a 19.7% drop (E2 vs. E1) when the diagonal wave-front scanning order is applied instead of the meandering order used in E1. The proposed slanted wave-front used in E3 and E4 (eight and 16 blocks in each task) minimises the drop to 4% (E4 vs. E1). The proposed double-block processing increases the speed by 12.6% (E5 vs. E4) relative to version without double-block processing and same number of blocks in task. The speed performance of the proposed modifications for the 3DRS-based algorithm was evaluated using a Samsung Galaxy S8 mobile phone based on the MSM8998 chipset. The clock frequency was fixed within a narrow range for the stability of the results. The MSM8998 chipset uses a big.LITTLE configuration with four power-efficient and four powerful cores in clusters with different performance. Threads performing ME algorithms were assigned to a powerful cluster of CPU cores in experiments E1–E11. Table 15.1 Comparison of initial quality and proposed modifications: baseline (A), wave-front scan (B), wave-front + double block (C)
Video stream Climb Dvintsev12 Turn2 Bosphorusa Jockeya Average: a
A PSNR (dB) 42.81 27.56 31.98 43.21 34.44 36.00
Description of video can be found in Correa et al. (2016)
B PSNR (dB) 42.81 27.55 31.98 43.21 34.44 36.00
C PSNR (dB) 42.77 27.44 31.97 43.20 34.34 35.94
382
I. M. Kovliga and P. Pohl
Table 15.2 Comparison of execution times for proposed modifications 1. Name of experiment E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11
2. Scanning order M WF WF WF WF WF WF WF WF WF WF
3. MT control code No No No No No Yes Yes Yes Yes Yes Yes
4. Thread count 1 1 1 1 1 1 1 1 1 2 4
5. Double block No No No No Yes No No No Yes Yes Yes
6. Units in task 1 1 8 16 8 1 8 16 8 8 8
7. Blocks in task 1 1 8 16 16 1 8 16 16 16 16
8. Mean time (ms) 24.91 29.81 26.44 25.90 22.63 33.29 28.67 27.94 24.38 14.77 14.14
M meandering scanning order, WF wave-front scanning order, MT multithreading Table 15.3 Execution times of ME on small cluster Name of experiment E12 E13 E14 E15
Number of threads 1 2 3 4
Mean execution time (ms) 55.03 27.38 20.12 18.23
Comparison with E12 (%) 100 49.7 36.6 33.1
If we preserve the conditions of experiments E9–E11 and only pin ME threads to a power-efficient CPU cluster, we obtain different results (see E12–E15 in Table 15.3). The parallelisation of three threads is closer to the ideal (E14 vs. E12). The attempt to use four threads did not give a good improvement (E15 vs. E14). Our explanation of this fact is that one core is at least partially occupied by OS work.
15.4
OP Stage – Detection of Occlusions and Correction of Motion
Occlusions are areas of pair frames that will disappear on the video frame on which we try to find matching positions. For video sequences, there are two types of occlusions – covered and uncovered areas (Bellers et al. 2007). We consider only the interpolation that uses two source frames nearest to the interpolated temporal position. Normal parts of a frame have a good match – that means that locally the image patches are very similar, and it is possible to use an appropriate patch from either image or their mixture (in case of a correct motion vector). The main issue with occlusion parts is the fact that motion vectors are unreliable in occlusions and
15
Real-Time Video Frame-Rate Conversion
383
appropriate image patches are only in one of the neighbouring frames. So two critical decision have to be made – what motion vector is right for a given occlusion and what is the appropriate frame to interpolate from (if a covered area was detected, then the previous frame should be used only; and if an uncovered area was detected, then the next frame should be used). Figure 15.5 illustrates the occlusion problem in a simplified 1D cut. In reality, the vectors are 2D and in case of real-time ME usually somewhat noisy. To effectively implement block-wise interpolation (the MCI stage is described in the next section), one has to estimate all vectors in the block grid of the interpolated frame. Motion vectors should be corrected in occlusions. In general, there are two main possibilities of how to fill in vectors in occlusions: spatial or temporal “propagation” or inpainting. Our solution uses temporal filling because its implementation is very efficient. Let the forward motion field DFW, N be the set of motion vectors assigned to the blocks lying in the block grid of frame FN. These forward motion vectors point to frame FN+1 from frame FN. Motion vectors of the backward motion field DBW, N point to frame FN 1 from frame FN. For detection of covered and uncovered areas in frame FN+1, we need motions DFW, N and DBW, N+2 generated by three consecutive frames (FN, FN+1, FN+2) (Bellers et al. 2007). We should analyse the DBW, N+2 motion field to detect covered areas in frame FN+1. Areas where motion vectors of DBW, N+2 do not point to are covered areas in FN+1. So, motion vectors in DFW, N+1 in those covered areas may be incorrect. Collocated inverted motion vectors from the DBW, N+1 motion field may be used for correction of incorrect motion in DFW, N+1 (this is the temporal filling mentioned above). Motion field DFW, N should be used to detect uncovered areas in frame FN+1. Collocated inverted motion vectors from DFW, N+1 in detected uncovered areas may be used instead of potentially incorrect motion vectors of motion field DBW, N+1. In our solution for interpolation of any frame FN+1+α (α 2 [0. . .1] is a phase of an interpolated frame), we detect covered areas in FN+1, obtaining CoverMapN+1, and detect uncovered areas in FN+2, obtaining UncoverMapN+2, as described above. These maps simply indicate whether blocks of the frame belong to the occluded area or not. We need two motion fields DFW, N+1, DBW, N+2 between frames FN+1, FN +2 for detection of those maps and two more motion fields DBW, N+1, DFW, N+2 between FN, FN+1 and FN+2, FN+3 frames for correction of motion vectors in found occlusions (see Fig. 15.5). Basically, we calculate one of the predicted interpolated blocks with coordinates (x, y) in frame FN+1+α by using some motion vector (dx, dy) as follows (suppose that the backward motion vector from FN+2 to FN+1 is used): Predðx, yÞ ¼F Nþ1 ðx þ α dx, y þ α dyÞ ð1 αÞ þ F Nþ2 ðx ð1 αÞ dx, y ð1 αÞ dyÞ α: Here, we mix previous and next frames with proportion α equal to the phase. But it needs to understand which is the proper proportion to mix the previous and next
384
I. M. Kovliga and P. Pohl
Fig. 15.5 1D visualisation of covered and uncovered areas for an interpolated frame FN + 1 + α
frames in occlusions, because only the previous frame FN+1 should be used in covered areas and only the next frame FN+2 in uncovered areas. So, we need to calculate some weight αadj for each block instead using the phase α everywhere: Predðx, yÞ ¼F Nþ1 ðx þ α dx, y þ α dyÞ 1 αadj ðx, yÞ þ F Nþ2 ðx ð1 αÞ dx, y ð1 αÞ dyÞ αadj ðx, yÞ: αadj should tend to 0 in covered areas and tend to 1 in uncovered areas. In other areas, it should stay α. In addition, it is necessary to use corrected motion vectors in the occlusions. To calculate αadj and obtain the corrected motion field fixedαDBW, N+2, we do the following (details of the algorithm below are described in Chappalli and Kim (2012)): • Copy DBW,N+2 to fixedαDBW,N+2.
15
Real-Time Video Frame-Rate Conversion
385
• (See top part of Fig. 15.6 for details) looking at where in the interpolated frame FN+1+α blocks from FN+2 moved using motion vectors from DBW, N+2, the moved position of a block with coordinates (x, y) in keyframe FN+2 is x+(1 α) ∙ dx(x, y), y+(1 α) ∙ dy (x, y) in frame FN+1+α. The overlapped area between each block in the block grid of the interpolated frame and all moved blocks from the keyframe can be found. αadj can be the proportional value with the overlapped area in this area of interpolated block case: αadj ¼ α overlapped . If a collocated area in CoverMapN+1 size of interpolated block was marked as a covered area for some block in the interpolated frame, then (a) the block in the interpolated frame is marked as COVER if the overlapped area of the block equals zero, and (b) the motion vector from the collocated position of DBW, N+1 is copied to the collocated position fixedαDBW,N+2 if the block was marked as COVER; • (see bottom part of Fig. 15.6 for details) for DFW,N+1, we look at where in the interpolated frame blocks from FN+1 were moved. For blocks in the interpolated frame which collocated with an uncovered area in UncoverMapN+1 we do the following: area of interpolated block (a) calculate αadj as: αadj ¼ 1 ð1 αÞ overlapped , size of interpolated block (b) Mark as UNCOVER blocks in the interpolated frame which have zero overlapped area, (c) Copy inverted motion vectors from collocated positions of DFW, N+2 to collocated positions of fixedαDBW,N+2 for all blocks marked as UNCOVER. (d) Copy inverted motion vectors from collocated positions of DFW,N+1 for all other blocks (which were not marked as UNCOVER) area of interpolated block , pixel-wise operations are needed. To obtain the ratio overlapped size of interpolated block Pixel-wise operations can be removed if the ratio will be replaced by: distance to the nearest moved block max 0, 1 euclidian . linear size of interpolated block
To allow correct interpolation even in blocks which contain object and background both, we need to use at least two motion vector candidates (one for object and one for background). Further, the fixedαDBW,N+2 motion field is used to obtain these two motion vector candidates in each block of the interpolated frame FN+1+α. For an interpolated block with coordinates (x, y), we compose a collection of motion vectors by picking motion vectors from the collocated block of the motion field fixedαDBW, N+2 and from neighbouring blocks of the collocated block (see Fig. 15.7). The quantity and positions of neighbouring blocks depend on the magnitude of the motion vector in the collocated block. We apply k-means-based clustering to choose only two motion vectors from the collection (centres of found clusters). It is possible to use more than two motion vector candidates in the MCI stage, and k-means-based clustering does not restrict getting more candidates. Denote the motion vector field which has K motion vector candidates in each block as CD[K]α,N+2. This is the result of clustering the fixedαDBW,N+2 motion field. The algorithm of clustering motion vectors has been described in the patent by Lertrattanapanich and Kim (2014).
386
I. M. Kovliga and P. Pohl
Fig. 15.6 Example of obtaining fixedαDBW,N+2, αadj and COVER/UNCOVER marking. Firstly, cover occlusions are fixed (top part). Secondly, uncover occlusions are processed (bottom part)
Our main work here in the OP stage was to adapt and optimise the implementation for ARM-based SoC mostly. The complexity of the stage is not so big as the complexity of ME and MCI stages, because the input of the OP stage is block-wise motion fields and all operations can be performed in block-wise manner. Nevertheless, a fully fixed-point implementation with NEON SIMD parallelisation was needed. In the description of the OP stage, we focused only on main ideas. In fact, we have to use slightly more sophisticated methods to get CoverMap, UncoverMap, αadj, and fixedαDBW,N+2. Their use is caused by rather noisy input motion fields or just complex motion in a scene. On the other hand, additional details would make the description even more difficult to understand than now.
15
Real-Time Video Frame-Rate Conversion
387
Fig. 15.7 1D visualisation of (a) clustering and (b) applying motion vector candidates
15.5
MCI Stage: Motion-Compensated Frame Interpolation
Motion compensation is the final stage of the FRC algorithm that directly interpolates pixel information. As already mentioned, we used interpolation based on earlier patented work by Chappalli and Kim (2012). The algorithm samples two nearest frames in positions determined by one or two motion vector candidates. Motion vector candidates come from the motion clustering algorithm mentioned above. It is possible that only one motion vector candidate will be found for some block because a collection for the block may contain the same motion vectors. Because all our previous sub-algorithms working with motion vectors were block-based, the MCI stage naturally uses motion hypotheses constant over an interpolated block. Firstly, we form two predictors for the interpolated block with coordinates (x, y) by using each motion vector candidate: cd[i] ¼ (dx[i](x, y), dy[i](x, y)), i ¼ 1, 2 from CD[K]α,N+2:
388
I. M. Kovliga and P. Pohl
Pred½iðx, yÞ ¼ F Nþ1 ðx þ α dx½iðx, yÞ, y þ α dy½iðx, yÞÞ 1 αadj ðx, yÞ þF Nþ2 x ð1 αÞ dx½i x, yÞ, y ð1 αÞ dy½iðx, yÞÞ αadj ðx, yÞ where we use bilinear interpolation to obtain the pixel-wise version of αadj(x, y) that was naturally block-wise in the OP stage. Although we use the backward motion vector candidates cd[i] which point to frame FN+1 from the collocated block with position (x, y) in frame FN+2 (see Fig. 15.7a), we use them as if the motion vector candidate started in the (x (1 α) ∙ dx[i] (x, y), y (1 α) ∙ dy[i] (x, y)) position of frame FN+2 (see Fig. 15.7b) for calculating predictors. So, those motion vector candidates will pass the interpolated frame FN+1+α exactly in (x, y) position. It simplifies the clustering process and MCI stage both because we do not need to keep any “projected” motion vectors and each interpolated block can be processed independently during the MCI stage. Despite that we do not use fractional pixel motion vectors in ME and OP stages, they may appear in the MCI stage, for example, due to following operation: (1 α) ∙ dx[i] (x, y). In this case, we use bicubic interpolation which gives good balance between speed and quality. The interesting question is how to mix a few predictors Pred[i](x, y). Basically, we obtain a block of the interpolated frame FN+1+α by the sum of a few predictors with some weights: F Nþ1þα ðx, yÞ ¼
Xp
Xp w ½ i ð x, y Þ Pred½iðx, yÞ= i¼1 w½iðx, yÞ, i¼1
where p is the number of used motion vector candidates in the block (two in our solution, so in the worst case, block patches at four positions from two keyframes have to be sampled). w[i](x, y) are pixel-wise mixing weights or reliability of predictors. As said in Chappalli and Kim (2012), we can calculate w[i](x, y) as a function of the difference between patches which were picked out from keyframes when predictors were calculated: w½iðx, yÞ ¼ f ðerr½iðx, yÞÞ err½iðx, yÞ ¼
yþu xþl X X
jF Nþ1 ðk þ α dx½iðx, yÞ, m þ α dy½iðx, yÞÞ
k¼xl m¼yu
F Nþ2 ðk ð1 αÞ dx½iðx, yÞ, m ð1 αÞ dy½iðx, yÞÞj, where l, u are some small values from the [0. . .5] range. f(e) must be inversely proportional to the argument, for example: w½iðx, yÞ ¼ exp ðerr½iðx, yÞÞ:
15
Real-Time Video Frame-Rate Conversion
389
However, in occlusions, a correct (background) motion vector connects the background in one keyframe to the foreground in another keyframe. So, such method of calculating weights will lead to visual artefacts. In Chappalli and Kim (2012), an update formula for calculating wi has been proposed: w½iðx, yÞ ¼f ðerr½iðx, yÞ, jcd ½iðx, yÞ vref ðx, yÞjÞ ¼ω exp ðerr½iðx, yÞÞ þ ð1 ωÞ exp ðjcd ½iðx, yÞ vref ðx, yÞjÞ, where vref is a background motion vector. For blocks in the interpolated frame which were marked as COVER or UNCOVER, vref is a motion vector from the collocated block fixedαDBW,N+2 and ω ¼ 0. For other normal blocks, ω ¼ 1. The MCI stage has high computational complexity, comparable to the complexity of the ME stage, because both algorithms require reading a few (extended) blocks from original frames per one processed block. But in contrast with ME, each block of the interpolated frame can be processed independently during MCI. This fact gives us the opportunity to implement MCI on GPU using OpenCL technology.
15.6
Problematic Situations and Fallback
There are many situations in which the FRC problem is incredibly challenging and sometimes just plainly unsolvable. An important part of our FRC algorithm development is detecting and correcting such corner situations. It is worth noting that the tasks of detection and correction are different tasks. When a corner case is detected, two possibilities remain. The first is to fix the problems, and the second is to skip interpolation (fallback). Fallback, in turn, can be done for the entire frame (global fallback) or locally, only in those area where problems arose (local fallback in Park et al. 2012). We apply global fallback only: place the nearest keyframe instead of the interpolated frame. The following situations are handled in our solution: 1. Periodic textures. The presence of periodic textures in video frames adversely affects ME reliability. Motion vectors can easily step over the texture period, which causes severe interpolation artefacts. We have a detector of periodic textures as well as a corrector for smoothing motion. The detector recognises periodic areas in each keyframe. The corrector makes motion fields smooth in detected areas after each 3DRS iteration. Regularisation is not applied in detected areas. Fallback does not apply. 2. Strong 1D-only features. Subpixels moving long flat edges also confuse ME due to different aliasing artefacts in neighbour keyframes, especially in the case of fast and simple ME with one-pixel accuracy developed for real-time operation. A detector and a corrector are in the same manner as for periodic textures (detection of 1D features in keyframes and smoothing motion filled after each 3DRS iteration, no regularisation in detected areas) with no fallback.
390
I. M. Kovliga and P. Pohl
3. Change of scene brightness (fade in/out). ME’s match metric (MAD) and mixing weights w in MCI are dependent on the absolute brightness of the image. We detect global changes between neighbouring keyframes by a linear model using histograms and correct one of the keyframes for the ME stage only. For relatively small and global changes, such adjustment works well. A global fallback strategy is applied if calculated parameters of the model are too large or the model is incorrect. 4. Inconsistent motion/occlusions. When a foreground object is nonlinearly deformed (flying bird or quick finger movements), it is impossible to restore the object correctly. The result of an interpolation looks like a mess of blocks, which greatly spoils the subjective quality. We analyse CoverMap, UncoverMap, motion fields, and obtain some reliability for each block in an interpolated frame. If the gathered reliability falls below some threshold for some set of neighbouring interpolated blocks, we apply global fallback. A local fallback strategy may be also applied by using the reliabilities like in Park et al. 2012, but the appearance of strong visual artefacts is still highly possible, although the PSNR metric will be higher for local fallback. Detectors use mostly pixel-wise operations with preliminary downscaled keyframes. All detectors either fit perfectly for SIMD operations or consume small computations. In our solution, detectors work in a separate stage called the preprocessing stage, except inconsistent motion/occlusion that is performed during the OP stage. Correctors are implemented in the ME stage.
15.7
FRC Pipeline: Putting Stages Together
The target devices (Samsung Galaxy S8/S9/S20) are using ARM 64-bit 8-core SoCs with big.LITTLE scheme with the following execution units: four power-efficient processor cores (little CPUs), four high-performance processor cores (big CPUs) and powerful GPU. Stages of the FRC algorithm should be run on different execution units simultaneously to achieve the highest number of frame interpolations per second (IPS). This leads us to the idea of organising a pipeline of calculations where stages of the FRC algorithm are run in parallel. The stages are connected by a buffer to hide random delays. The pipeline of FRC for 4 upconversion is shown in Fig. 15.8. As shown in Fig. 15.8, the following processing are performed simultaneously: • Preprocessing of keyframe FN+6 • Estimation of motion fields DFW, N+4 and DBW, N+5 by using FN+4, FN+5 and corresponding preprocessed data (periodic, 1D areas) • Occlusion processing for interpolated frame FN+2+2/4, where inputs are DBW, N+2, DFW, N+2, DBW, N+3, andDFW, N+3; • Motion-compensated interpolation of interpolated frame FN+2+1/4, where inputs are fixed1/4DBW, N+2 for phase α ¼ 1/4
15
Real-Time Video Frame-Rate Conversion
391
Fig. 15.8 FRC pipeline for 4 up-conversion – timeline of computations
Note that OP and MCI stages can be performed for multiple interpolated frames between a pair of consecutive keyframes and each time at least a part of calculations will be different. Whereas preprocessing and ME stages are performed only once for a pair of consecutive keyframes, they do not depend on the number of interpolated frames. Actually, our main use case is doubling frame rate. We optimised each stage and assigned execution units (shown in Fig. 15.8) so that their durations became close for some “quite difficult” scene in the case of doubling frame rate.
15.8
Results
We were able to develop the FRC algorithm of commercial-level quality that could work on battery-powered mobile devices. The algorithm used only standard modules of SoC (CPU + GPU), which means upgrades or fixes are quite easy. This algorithm has been integrated in two modes of the Samsung Galaxy camera: • Super Slow Motion (SSM) – offline 2 conversion of HD (720p) video from 480 to 960 FPS, Target performance: >80 IPS; • Motion Photo – real-time 2 conversion, playback of 3 seconds of FHD (1080p) video clip stored within JPEG file, Target performance: >15 IPS.
392
I. M. Kovliga and P. Pohl
Table 15.4 Performance of FRC solution on target mobile devices in various use cases in interpolations per second (IPS) Motion Photo (FHD) (IPS) 81 94
Device Galaxy S8 Galaxy S9
Super Slow Motion (IPS) 104 116
Table 15.5 Performance of FRC solution on Samsung Galaxy Note 10 Prep. time ME time OP time MCI time yPSNR IPS
Traffic 5.81 ms 4.80 ms 4.11 ms 2.17 ms 37.8 dB 172
Jockey 6.71 ms 5.57 ms 4.66 ms 2.75 ms 35.3 dB 149
Kimono 6.22 ms 7.39 ms 4.73 ms 3.04 ms 34.4 dB 135
Tennis 7.72 ms 12.40 ms 8.31 ms 3.64 ms 27.9 dB 80
IPS Interpolated frames per second (determined by the longest stage here), yPSNR average PSNR of luma component of a video sequence
The average conversion speed performance for HD and FHD videos can be seen in Table 15.4. The difference is small because in the case of FHD, the ME stage and part of the OP stage were performed at reduced resolution. Detailed measurements of various FRC stages for 2 upconversion are shown in Table 15.5. Here, Samsung Galaxy Note 10 based on Snapdragon 855 SM8150 was used. ME and OP stages were performed on two big cores each, the preprocessing stage (Prep.) uses two little cores, and the MC stage was performed by GPU. Any fallbacks were disabled, so all frames were interpolated fully. The quality and speed performance of the described FRC algorithms were checked for various video streams, for which a detailed description can be found in Correa et al. (2016). We used only the first 100 frames for time measurements and all frames for quality measurements. All video sequences were downsized to HD resolution (Super Slow Motion use case). To get ground truth, we decreased the frame rate of the video streams twice and then upconverted them back to the initial frame-rate with the help of the FRC. It can be seen that with increasing magnitude, and complexity of the motion in a scene, the computational cost of the algorithm grows, and the quality of the interpolation decreases. The ME stage is the most variable and requires the most computational cost. For scenes with moderate movements, the algorithm shows satisfactory quality and attractive speed performance. Actually, even in video with fast movement, the quality can be good in most places. In the middle part of Fig. 15.9, an interpolated frame from the “Jockey” video is depicted. Displacement of the background between depicted keyframes is near 50 pixels (see Fig. 15.10 to better understand the position of occlusion areas and movement of objects). An attentive reader can see quite a few
15
Real-Time Video Frame-Rate Conversion
393
Fig. 15.9 Interpolation quality. In the top is drawn keyframe #200, in the middle interpolated frame #201 (with PSNR quality 35.84 dB for luma component), and in the bottom keyframe #202
394
I. M. Kovliga and P. Pohl
Fig. 15.10 Visualisation of movement. Identical features of fast-moving background are connected by green lines in keyframes. Identical features of almost static foreground are connected by red lines in keyframes
15
Real-Time Video Frame-Rate Conversion
395
defects in the interpolated frame. The most unpleasant are those which appear regularly on a foreground object (look at the horse’s ears). This is perceived as an unnatural flickering of the foreground object. Artefacts, which regularly arise in occlusions, are perceived as a halo around a moving foreground object. Artefacts, which appear only in individual frames (not regularly), practically do not spoil the subjective quality of the video.
References Al-Kadi, G., Hoogerbrugge, J., Guntur, S., Terechko, A., Duranton, M., Eerenberg, O.: Meandering based parallel 3DRS algorithm for the multicore era. In: Proceedings of the IEEE International Conference on Consumer Electronics (2010). https://doi.org/10.1109/ICCE.2010.5418693 Bellers, E.B., van Gurp, J.W., Janssen, J.G.W.M., Braspenning, R., Wittebrood, R.: Solving occlusion in frame-rate up-conversion. In: Digest of Technical Papers International Conference on Consumer Electronics, pp. 1–2 (2007) Chappalli, M.B., Kim, Y.-T.: System and method for motion compensation using a set of candidate motion vectors obtained from digital video. US Patent 8,175,163 (2012) Chi, C., Alvarez-Mesa, M., Juurlink, B.: Parallel scalability and efficiency of HEVC parallelization approaches. IEEE Trans. Circuits Syst. Video Technol. 22(12), 1827–1838 (2012) Cordes, C.N., de Haan, G.: Invited paper: key requirements for high quality picture-rate conversion. Dig. Tech. Pap. 40(1), 850–853 (2009) Correa, G., Assuncao, P., Agostini, L., da Silva Cruz, L.A.: Appendix A: Common test conditions and video sequences. In: Complexity-Aware High Efficiency Video Coding, pp. 125–158. Springer International Publishing, Cham (2016) de Haan, G., Biezen, P., Huijgen, H., Ojo, O.A.: True-motion estimation with 3-D recursive search block matching. IEEE Trans. Circuits Syst. Video Technol. 3(5), 368–379 (1993) Fluegel, S., Klussmann, H., Pirsch, P., Schulz, M., Cisse, M., Gehrke, W.: A highly parallel sub-pel accurate motion estimator for H.264. In: Proceedings of the IEEE 8th Workshop on Multimedia Signal Processing, pp. 387–390 (2006) Lertrattanapanich, S., Kim, Y.-T.: System and method for motion vector collection based on K-means clustering for motion compensated interpolation of digital video. US Patent 8,861,603 (2014) Park, S.-H., Ahn, T.-G., Park, S.-H., Kim, J.-H.: Advanced local fallback processing for motioncompensated frame rate up-conversion. In: Proceedings of 2012 IEEE International Conference on Consumer Electronics (ICCE), pp. 467–468 (2012) Pohl, P., Anisimovsky, V., Kovliga, I., Gruzdev, A., Arzumanyan, R.: Real-time 3DRS motion estimation for frame-rate conversion. Electron. Imaging. (13), 1–5 (2018). https://doi.org/10. 2352/ISSN.2470-1173.2018.13.IPAS-328
Chapter 16
Approaches and Methods to Iris Recognition for Mobile Alexey M. Fartukov, Gleb A. Odinokikh, and Vitaly S. Gnatyuk
16.1
Introduction
The modern smartphone is not a simple phone but a device which has access to or contains huge amount of personal information. Most smartphones have the ability to perform payment operations by such services as Samsung Pay, Apple Pay, Google Pay, etc. Thus, phone unlock protection and authentication for payment and for access to secure folders and files are required. Among all approaches to authenticate users of mobile devices, the most suitable are knowledge-based and biometric methods. Knowledge-based methods are based on asking for something the user knows (PIN, password, pattern). Biometric methods refer to the use of distinctive anatomical and behavioral characteristics (fingerprints, face, iris, voice, etc.) for automatically recognizing a person. Today, hundreds of millions of smartphone users around the world praise the convenience and security provided by biometrics (Das et al. 2018). The first commercially successful biometric authentication technology for mobile devices is fingerprint recognition. Despite the fact that fingerprint-based authentication shows high distinctiveness, it still has drawbacks (Daugman 2006). Among all the biometric traits, the iris has several important advantages in comparison with other biometric traits (Corcoran et al. 2014). The iris image capturing procedure is contactless, and iris recognition can be considered as a more secure and convenient authentication method, especially for mobile devices.
A. M. Fartukov (*) Samsung R&D Institute Rus (SRR), Moscow, Russia e-mail: [email protected] G. A. Odinokikh · V. S. Gnatyuk Independent Researcher, Moscow, Russia e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2_16
397
398
A. M. Fartukov et al.
This chapter is dedicated to the iris recognition solution for mobile devices. Section 16.2 describes the iris as a recognition object. A brief review of the conventional iris recognition solution is given. Main challenges and requirements for iris recognition for mobile devices are formulated. In Sect. 16.3, the proposed iris recognition solution for mobile devices is presented. Special attention is paid to interaction with the user and capturing system. The iris camera control algorithm is described. Section 16.4 contains a brief description of the developed iris feature extraction and matching algorithm. Testing results and comparison with state-of-theart iris feature extraction and matching algorithms are provided also. In Sect. 16.5, limitations of iris recognition are discussed. Several approaches which allow to shift these limitations are described.
16.2
Person Recognition by Iris
The iris is a highly protected, internal organ of the eye, which allows contactless capturing (Fig. 16.1). The iris is unique for every person, even for twins. The uniqueness of the iris is in its texture pattern that is determined by melanocytes (pigment cells) and circular and radial smooth muscle fibers (Tortora and Nielsen 2010). Although the iris is stable over a lifetime, it is not a constant object due to permanent changes of pupil size. The iris consists of muscle tissue that comprises a sphincter muscle that causes the pupil to contract and a group of dilator muscles that cause the pupil to dilate (Fig. 16.2). It is one of the main sources of intra-class variation, which should be considered in the development of an iris recognition algorithm (Daugman 2004). The iris is highly informative biometric trait (Daugman 1993). That is why iris recognition provides high recognition accuracy and reliability. A conventional iris recognition system includes the following main steps: • • • •
Iris image acquisition Quality checking of obtained iris image Iris area segmentation Feature extraction (Fig. 16.3).
Fig. 16.1 View of the human eye
16
Approaches and Methods to Iris Recognition for Mobile
399
Fig. 16.2 Responses of the pupil to light of varying brightness: (a) bright light; (b) normal light; (c) dim light. (Reproduced with permission from Tortora and Nielsen (2010)) Fig. 16.3 Simplified scheme of iris recognition
During registration of a new user (enrollment), an extracted iris feature vector is stored in the system database. The iris feature vector or set of iris feature vectors resulting from enrollment is called the template. In case of verification (one-to-one user comparison) or identification (one-to-many user comparison), the extracted iris feature vector (also known as the probe) is compared with template(s) stored in the database. Comparison of the probe and the enrolled template is named matching.
400
A. M. Fartukov et al.
Iris image acquisition is performed using a high-resolution near infrared (NIR) or visible-spectrum (VIS) camera (Prabhakar et al. 2011). The NIR camera is equipped with an active illuminator. The wavelength of NIR light used for illuminating the iris should be between 700 and 900 nm, which is better than visible light in acquiring the texture of dark irises. It also should be noted that capturing in NIR light allows (to some extent) to avoid reflections and glares which mask the iris texture. For those reasons, iris images captured in the NIR spectrum are only considered in this chapter as input data. If the iris image is successfully acquired, then its quality is determined in terms of suitability for subsequent extraction of the feature vector. Iris quality checking can be distributed across several stages of the recognition algorithm. Iris area segmentation separates the iris texture area from the background, eyelids and eyelashes, and glares, which mask the iris texture. A comprehensive review of iris segmentation methods can be found in Rathgeb et al. (2012). In particular, an iris segmentation algorithm based on a lightweight convolutional neural network (CNN) is proposed in Korobkin et al. (2018). After that, the obtained iris area is used for feature extraction. This stage consists of iris area normalization and construction of the feature vector. At normalization, the iris image is remapped from initial Cartesian coordinates to a dimensionless non-concentric polar coordinate system (Daugman 2004). It allows to compensate for variability of iris optical size in the input image and to correct elastic deformation of the iris when the pupil changes in size. The normalized image is used for extraction of iris features. Because the iris is a visible part of the human eye, it is not a secret. So, an iris recognition system is vulnerable to presenting synthetically produced irises to the sensor. Prevention of direct attacks to the sensor by discriminating real and fake irises is called presentation attack detection or liveness detection. Consideration of iris liveness detection is out of the scope of this chapter. An introduction to iris liveness detection can be found in Sun and Tan (2014) and Galbally and GomezBarrero (2016). It should be noted that the abovementioned quality checking stage provides (to some extent) protection against presentation attacks. Iris recognition systems were implemented and successfully deployed for border control in the United Arab Emirates and in several European and British airports (Daugman and Malhas 2004). In such systems, iris acquisition is usually performed in controlled environment conditions by cameras, which are capable of capturing high-quality iris images. Minimal requirements for the iris image capturing process are summarized in ISO/IEC 19794-6:2011. In case of mobile devices, camera compactness and its cost become even more essential, and thus not all mentioned requirements imposed on the camera can be satisfied. Development and implementation of the iris acquisition camera for the mobile devices is also out of the scope of this chapter. Most of the issues related to the iris capturing device can be found in Corcoran et al. (2014) and Prabhakar et al. (2011). Regarding requirements of an iris recognition solution for mobile devices, they include the ability to operate under constantly changing environmental conditions and flexibility to a wide range of user interaction scenarios. Mobile iris recognition
16
Approaches and Methods to Iris Recognition for Mobile
401
Fig. 16.4 Examples of iris images captured with a mobile device
should handle input iris images captured under ambient illumination, which varies over a range from 104 at night to 105 Lux under direct sunlight. The changing capturing environment also assumes the randomness of the locations of the light sources, along with their unique characteristics, which creates a random distribution of the illuminance in the iris area. The mentioned factors can cause a deformation of the iris texture due to a change in the pupil size, making users squint and degrading the overall image quality (Fig. 16.4). Moreover, different features related to interaction with a mobile device and user itself should be considered: • The user could wear glasses or contact lenses. • The user could try to the perform authentication attempt while walking or just suffer from a hand tremor, thereby causing the device to shake. • The user can hold the device too far or too close to them, so the iris turns out of the camera depth of field. • There could be occlusion of the iris area by eyelids and eyelashes if the user’s eye is not opened enough. All these and many other factors affect the quality of the input iris images, thus influencing the accuracy of the recognition (Tabassi 2011). Mobile iris recognition should be used on a daily basis. Thus, it requires providing an easy user interaction and a high recognition speed, which is determined by the computational complexity. There is a trade-off between computational complexity and power consumption: recognition should be performed with the best camera frame rate and should not consume much power at the same time. Recognition should be performed in a special secure (trusted) execution environment, which provides limited computational resources – restricted number of available processor cores and computational hardware accelerators, reduced frequencies of processor core(s), and limited amount of memory (ARM Security Technology 2009). These facts should be taken into account in early stages of biometric algorithm development.
402
A. M. Fartukov et al.
All mentioned requirements lead to the necessity of the development of an iris recognition solution capable of providing high recognition performance on mobile devices. There are several commercial mobile iris recognition solutions known to date. The first smartphones enabled with the technology were introduced by Fujitsu in 2015. The solution for this smartphone was developed by Delta ID Inc. (Fujitsu Limited 2015). It should be noted, in particular, that all Samsung flagship devices were equipped with iris recognition technology during 2016–2018 (Samsung Electronics 2018). Recently, the application of mobile iris recognition technology is shifting from mass market to B2B and B2G areas. Several B2G applications of the technology are also known in the market, such as Samsung Tab Iris (Samsung Electronics 2016) and IrisGuard EyePay Phone (IrisGuard UK Ltd. 2019).
16.3
Iris Recognition for Mobile Devices
While developing a biometric recognition algorithm, it should not be considered as an isolated algorithm. Characteristics of a target platform and efficient ways of interaction with the environment in which recognition system operates should be taken into account. In case of a mobile device, it is possible to interact with a user in order to obtain iris images suitable for recognition. For instance, if the eyes are not opened enough or the user’s face is too far from the device, then the recognition algorithm should provide immediate feedback saying “open eyes wider” or “move your face close to the device,” respectively. To do it, the following algorithm structure is proposed (Fig. 16.5). The idea of the proposed structure is to start each next stage of the recognition algorithm only after a successful pass of corresponding quality assessment. Each
Fig. 16.5 Iris recognition algorithm structure
16
Approaches and Methods to Iris Recognition for Mobile
403
Fig. 16.6 Interaction of mobile iris recognition algorithm with environment
quality assessment measure is performed immediately after the information for its evaluation becomes available. It allows us not to waste computational resources (i.e., energy consumption) for processing data which are not suitable for further processing and to provide feedback to user as earlier as possible. It should be noted that structure of the algorithm depicted in Fig. 16.5 is a modification of the algorithm proposed by Odinokikh et al. (2018). The special quality buffer was replaced with the straightforward structure as shown in Fig. 16.5. All the other parts of the algorithm (except the feature extraction and matching stages) and quality assessment checks were used with no modifications. Besides interaction with a user, the mobile recognition system can communicate with iris capturing hardware and additional sensors such as illuminometer, rangefinder, etc. Obtained information can be used for control parameters of iris capturing hardware and for adaptation of the algorithm to constantly changing environment conditions. The scheme summarizing the described approach is presented in Fig. 16.6. Details can be found in Odinokikh et al. (2019a). Along with the possibility to control the iris capturing hardware by the recognition algorithm itself, a separate algorithm for controlling iris camera parameters can be applied. The purpose of such algorithm is to provide fast correction of the sensor’s shutter speed (also known as an exposure time), gain and/or parameters of the active illuminator to obtain iris images suitable for the recognition. We propose a two-staged algorithm for automatic camera parameter adjustment that offers fast exposure adjustment on the basis of a single shot with further iterative camera parameter refinement. Many of the existing automatic exposure algorithms have been developed to obtain an optimal image quality for difficult environmental conditions. In the case of the most complicated scenes, a significant number of these algorithms have some drawbacks: poor exposure estimation, complexity of region of interest detection, and
404
A. M. Fartukov et al.
high computational complexity, which may limit their applicability to mobile devices (Gnatyuk et al. 2019). In the first stage of the proposed algorithm, a global exposure is adjusted based on a single shot in order to start the recognition process as fast as possible. The fast global exposure adjustment problem is presented as a dependency between the exposure time and the mean sample value (MSV) which corresponds to the captured image brightness. MSV is determined as follows (Nourani-Vatani and Roberts 2007): MSV ¼
4 X ði þ 1Þhi , P4 i¼0 i¼0 hi
where H ¼ {hi| i ¼ 0. . .4} is a 5-bin pixel brightness histogram of the captured image. In contrast to mean brightness, MSV is less sensitive to high peaks in the image histogram, which makes it useful when the scene contains several objects with almost constant brightness. It is known that photodiodes have almost linear transmission characteristic in the photoconductive mode. However, due to differences in amplification factors for each photodiode and the presence of noise in the real CMOS sensors, an exposure-tobrightness function is better approximated by a sigmoid function rather than a piecewise linear function (Fig. 16.7). With knowledge of sigmoid coefficients and a target image brightness level, we can determine sensor parameters which lead to correct captured image exposure. Therefore, the dependency between an exposure time and MSV can be expressed as: μ¼
1 , 1 þ ecp
Fig. 16.7 The experimental dependency between an exposure time and mean sample value
16
Approaches and Methods to Iris Recognition for Mobile
405
Fig. 16.8 The visualization of the global exposure adjustment
where μ ¼ 0.25 (MSV 1) is a normalized MSV value, p 2 [1; 1] is a normalized exposure time, and c ¼ 6 is an algorithm parameter which controls a sigmoid slope and may be adjusted for a particular sensor (Fig. 16.8). Solving the above equation for p gives us a trivial result: 1 1 p ¼ ln 1 : c μ The optimal normalized exposure time p can be obtained with the value μ which empirically determines the optimal MSV value that allows the successful pass of quality assessment checks as described below in this section. Since the exposure time varies in the (0; Emax] interval, the suboptimal exposure time E is obtained as: E ¼
E 0 ð p þ 1Þ , p0 þ 1
where E0 is the exposure time of the captured scene, and p and p0 are normalized exposure times calculated with the optimal MSV μ and the captured scene MSV μ0, respectively. If MSV lies out of the confidence zone μ0 2 [μmin; μmax], we should first make a “blind” guess on a correct exposure time. This is done by subtracting or adding the predefined coefficient Eδ to the current exposure E0 several times until MSV becomes less than μmax or higher than μmin. μmin, and μmax values are determined based on experimental dependency between an exposure time and normalized MSV. If there is no place for a further exposure adjustment, we stop and do not adjust a scene anymore, because the sensor is probably blinded or covered with something. Once the initial exposure time guess E is found, we try to further refine the capture parameters.
406
A. M. Fartukov et al.
Table 16.1 The iris image quality difference between competing auto-exposure approaches Approach Global exposure adjustment
Full frame
Eye region
Underexposed perfect
Proposed
perfect overexposed
The key idea of the second stage is a mask construction to precisely adjust camera parameters to the face region brightness in order to obtain the most accurate and fast iris recognition. In case of the recognition task, it is important to pay more attention to eye regions, and it is not enough to provide an optimal full-frame visual quality provided by well-known global exposure adjustment algorithms (Battiato et al. 2009). Table 16.1 illustrates the mentioned drawback. In order to obtain a face mask, a database of indoor and outdoor image sequences for 16 users was collected in the following manner. Every user tries to pass the enrollment procedure on a mobile device in normal conditions, and the corresponding image sequence is collected. Such sequence is used for enrollment template creation. Next, the user tries to verify himself, and the corresponding image sequence (probe sequence) is also collected. All frames from probe sequences are used for probe (probe template) creation. After that, dissimilarity scores (Hamming distances) between the user’s enrolled template and each probe are calculated (Odinokikh et al. 2017). In other words, only genuine comparisons are performed. Each score is compared with the predefined threshold HDthresh, and the vector of labels Y ≔ {yi}, i ¼ 1. . .Nscores is created. Nscores is the amount of verification attempts. The vector of labels represents the dissimilarity between probes and enrollment templates: ( yi ¼
0, HDi > HDthresh , 1, HDi < HDthresh :
Each label yi of Y shows if the person was successfully recognized at the frame i.
16
Approaches and Methods to Iris Recognition for Mobile
Fig. 16.9 Calculated weighted mask to represent each image pixel significance for the recognition: bright blobs correspond to eye position
407 Most significant pixels
Less significant pixels
After the calculation of the vector Y, we downscale and reshape each frame of probe sequences to get the feature vector xi and construct the matrix of feature vectors X: 0
x1 x2 ...
B B X¼B @
1 C C C: A
xN scores Using the feature matrix X and the vector of labels Y, we calculate the logistic regression coefficients of each feature, where the coefficients represent the significance of each image pixel for the successful user verification (Fig. 16.9). As a result, the most significant pixels emphasize eye regions and the periocular area. This method allows avoiding handcrafted mask construction. It automatically finds the regions that are important for correct recognition. Obtained mask values are used for weighted MSV estimation, where each input pixel has a significance score that determines the pixel weight in the image histogram. After mask calculation, the main goal is to set camera parameters that make MSV fall in the predefined interval which leads to the optimal recognition accuracy. To get interval boundaries, each pair (HDi, MSVi) is mapped onto the coordinate plane (Fig. 16.10a), and points with HD > HDthresh are removed to exclude probes which corresponded to rejection during verification (Fig. 16.10b). The optimal image quality interval center is defined as: p ¼ argmaxð f Þ: x
Here f(x) is the distribution density function, and p is the optimal MSV value corresponding to the image with the most appropriate quality for recognition. Visually, the plotted distribution allows to distinguish three significant clusters: noise, points with low pairwise distance density, and points with high pairwise distance density. To find the optimal image quality interval borders, we cluster plotted point pairs for three clusters using the K-means algorithm (Fig. 16.10c). The densest “red” cluster with the minimal pairwise point distance is used to determine the optimal quality interval borders. The calculated interval can be represented as [p – delta, p + delta], where the delta parameter equals:
408
A. M. Fartukov et al.
(a)
(b)
(c)
Fig. 16.10 Visualization of cluster construction procedure: (a) all (HDi, MSVi) pairs are mapped onto a coordinate plane; (b) excluded points with HD > HDthresh and plotted distribution density; (c) obtained clusters
delta ¼ min ðjl pj, jr pjÞ, where wr ¼ {MSVi 2 red cluster| i ¼ 1, 2, . . ., Nscores} is the set of points belonging to the cluster with a minimal pairwise points distance, and l ¼ min (wr), r ¼ max (wr) are the corresponding cluster borders. After the optimal image quality interval is determined, we need to adjust the camera parameters to make the captured image MSV fall in this interval. In order to implement this idea, we get the pre-calculated exposure discrete (ED) value and calculate the gain discrete (GD) value. After that, we iteratively add or subtract ED and GD values from the suboptimal exposure time E and the default gain G in order to find the optimal exposure time and gain parameters. Thus, updated E and G values can be calculated according to the following rule: if MSV < p delta, then:
16
Approaches and Methods to Iris Recognition for Mobile
409
E ¼ E þ ED, G ¼ G þ GD; if MSV p + delta, the updated exposure and gain are: E ¼ E ED, G ¼ G GD: Such iterative technique allows to perform the optimal camera parameter adjustment for different illumination conditions. The proposed algorithm was tested as a part of the iris recognition system (Odinokikh et al. 2018). This system operates in a mode where False Acceptance Rate (FAR) ¼ 107. Testing was performed using a mobile phone which is based on Exynos 8895 and equipped with an NIR iris capturing hardware. Testing involved 10 users (30 verification attempts were made for each user). The enrollment procedure was performed in an indoor environment, while the verification procedure was done under harsh incandescent lighting in order to prove that the proposed autoexposure algorithm improves recognition. To estimate algorithm performance, we use two parameters: FRR (False Rejection Rate), a value which shows how many genuine comparisons were rejected incorrectly, and recognition time, which determines the time interval between the start of the verification procedure and successful verification. The results of comparison are shown in Table 16.2. In accordance with obtained results, if the fast parameter adjustment stage of the algorithm is removed, then the algorithm will adjust the camera parameters in an optimal way, but the exposure adaptation time will be significantly increased because of the absence of suboptimal exposure values. If the iterative adjustment stage is removed, then the adaptation time will be small, but face illumination will be estimated in non-optimal way, and the number of false rejections will increase. Therefore, it is crucial to use both fast and iterative steps to reduce both recognition time and false rejection rate. A more detailed description of the proposed method with minor modifications and comparison with well-known algorithms for camera parameter adjustment can be found in Gnatyuk et al. (2019).
Table 16.2 Performance comparison for the proposed algorithm of automatic camera parameter adjustment
Value FRR (%) Recognition time (s)
No adjustment 82.3 7.5
First stage only (fast exposure adjustment) 20.0 0.15
Second stage only (iterative parameter refinement) 1.6 1.5
Proposed method (two stages) 1.6 0.15
410
A. M. Fartukov et al.
In conclusion, it should be noted that the proposed algorithm can be easily adapted to any biometric security system and any kind of recognition area: iris, face, palm, wrist, etc.
16.4
Iris Feature Extraction and Matching
As mentioned in Sect. 16.2, the final stage of iris recognition includes construction of the iris feature vector. This procedure represents the extraction of iris texture information relevant to its subsequent comparison. The input of the feature vector construction is the normalized iris image (Daugman 2004) as depicted in Fig. 16.11. Since the iris region can be occluded by eyelids, eyelashes, reflections, and others, such areas contain irrelevant information for subsequent iris texture matching and are not used for the feature extraction. It should be noted that feature extraction and matching are considered together because they are closely connected to each other. A lot of feature extraction methods, considering iris texture patterns in different levels of detail, have been proposed (Bowyer et al. 2008). A significant leap in reliability and quality in the field was achieved with the start of using deep neural networks (DNNs). There were numerous attempts from that time to apply DNNs for iris recognition. In particular, Gangwar and Joshi (2016) introduced their DeepIrisNet as the model combining all successful deep learning techniques known at the time. The authors thoroughly investigated obtained features and produced a strong baseline for the next works. An approach with two fully convolutional networks (FCN) with a modified triplet loss function was recently proposed in Zhao and Kumar (2017). One of the networks is used for iris template extraction, whereas the second produces the accompanying mask. Fuzzy image enhancement combined with simple linear iterative clustering and a self-organizing map (SOM) neural network were proposed in Abate et al. (2017). Despite the method being designed for iris recognition on a mobile device, a real-time performance has not been achieved. Another recent work by Zhang et al. (2018) declared as suitable for the mobile case proposes two-headed (iris and periocular) CNN with fusion of embeddings. Thus, there is no optimal solution for iris feature extraction and matching presented in published papers.
Fig. 16.11 Segmented iris texture area and corresponding normalized iris image
16
Approaches and Methods to Iris Recognition for Mobile
411
Fig. 16.12 Proposed model scheme of iris feature extraction and matching (Odinokikh et al. 2019c)
This section is a brief description of the iris feature extraction and matching presented in Odinokikh et al. (2019c). The proposed method represents a CNN designed to utilize advantages of the normalized iris image as an invariant, both low- and high-level representations of discriminative features and information about iris area and pupil dilation. It contains iris feature extraction and matching parts trained together (Fig. 16.12). It is known that shallow layers in CNNs are responsible for extraction of low-level textural information, while high-level representation is achieved with depth. Basic elements of the shallow feature extraction block and their relations are depicted in Fig. 16.12. High-level (deep) feature representation is performed by convolution block #2. Feature maps, which come from block #1, are concatenated by channels and pass through it. The meaning of concatenation at this stage is in the invariance property of the normalized iris image. The output vector FVdeep reflects high-level representation of discriminative features and is assumed to handle complex nonlinear distortions of the iris texture caused by the changing environment.
412
A. M. Fartukov et al.
Match score calculation is performed on FVdeep, the shallow feature vector FVsh., and additional information (FVenv.) about the iris area and pupil dilation by using the variational inference technique. The depth-wise separable convolution block structure, which is memory and computationally efficient, was picked up for basic structural elements for the entire network architecture. Along with lightweight CNN architecture, it allows to operate in real time on the device with highly limited computational power. The following methods were selected as state of the art: FCN + Extended Triplet Loss (ETL) (Zhao and Kumar 2017) and DeepIrisNet (Gangwar and Joshi 2016). It also should be noted that results of lightweight CNN proposed in Zhang et al. (2018) are obtained on the same datasets, which is used for testing of the proposed method. For detailed results, please refer to Zhang et al. (2018). Many of the methods were excluded from consideration due to their computational complexity and unsuitability for mobile applications. Three different datasets were used for training and evaluation(CASIA 2015): CASIA-Iris-M1-S2 (CMS2), CASIA-Iris-M1-S3 (CMS3), and one more (IrisMobile, IM) collected privately using a mobile device with an embedded NIR camera. The latter is collected simulating real authentication scenarios of a mobile device user: images captured in highly changing illumination both indoors and outdoors, with/without glasses. More detailed specifications of the datasets are described in Table 16.3. Results on the recognition accuracy are represented in Table 16.4. ROC curves obtained for comparison with state-of-the-art methods on CMS2, CMS3, and IM datasets are depicted on Fig. 16.13. The proposed method outperforms the chosen state-of-the-art ones on all the datasets. After the division into subsets, it became impossible to estimate FNMR at FMR ¼ 107 for CMS2 and CMS3 datasets since the number of comparisons in the test sets did not exceed ten million. So, yet another experiment was to estimate the performance of the proposed model on those datasets without training on them. The model trained on IM is evaluated on entire CMS2 and CMS3 datasets in order to
Table 16.3 Dataset details Dataset CMS2 CMS3 IM
Images 7723 8167 22,966
Irises 398 720 750
Outdoor 0 0 4933
Subjects Asian Asian Asian&Cauc.
Table 16.4 Recognition performance evaluation results Method DeepIrisNet FCN + ETL Proposed
Equal error rate (EER) CMS2 CMS3 0.0709 0.1199 0.0093 0.0301 0.0014 0.0190 0.0003 0.0086
IM 0.1371 0.0607 0.0116 –
Testing WithinDB WithinDB WithinDB CrossDB
FPS 11 12 250
16
Approaches and Methods to Iris Recognition for Mobile
(a)
413
(b)
(c)
Fig. 16.13 ROC curves obtained for comparison with state-of-the-art methods on different datasets: (a) CASIA-Iris-M1-S2 (CMS2); (b) CASIA-Iris-M1-S3 (CMS3); (c) Iris Mobile (IM)
get FNMR at FMR ¼ 107 (CrossDB). Obtained results demonstrate high generalization ability of the model. A mobile device equipped with the Qualcomm Snapdragon 835 CPU was used for estimating the overall execution time for these iris feature extraction and matching methods. It should be noted that a single core of the CPU was used. The results are summarized in Table 16.4. Thus, the proposed algorithm showed robustness to high variability of iris representation caused by change in the environment and physiological features of the iris itself. A profitability of using shallow textural features, feature fusion, and variational inference as a regularization technique is also investigated in the context of the iris recognition task. Despite the fact that the approach is based on deep learning, it is capable of operating in real time on a mobile device in a secure environment with substantially limited computational power.
414
16.5
A. M. Fartukov et al.
Limitations of Iris Recognition and Approaches for Shifting Limitations
In conclusion, we would like to shed light on several open issues of iris recognition technology. The first issue is related to the limitation in usage of iris recognition technology in extreme environmental conditions. In near dark, the pupil dilates and masks almost all the iris texture area. In case of outdoors in direct sunlight, the user could not open his eyes wide enough, and the iris texture can be masked by reflections. The second issue is related to wearing glasses. Usage of active illumination leads to glares on glasses, which can mask the iris area. In this case, the user should change the position of the mobile device for successful recognition or take off the glasses. It can be inconvenient, especially in daily usage. Thus, the root of the majority of issues is in obtaining enough iris texture area for reliable recognition. It has been observed that at least 40% of the iris area should be visible to achieve the given accuracy level. To mitigate the mentioned issues, several approaches, except changes of hardware for iris capturing, were proposed. One of them is the well-known multi-modal recognition (e.g., fusion of iris and face (or periocular area) recognition) as described in Ross et al. (2006). In this section, approaches related to the eye itself are considered only. The first approach is based on the idea of multi-instance iris recognition, which performs the fusion of the two irises and uses the relative spatial information and several factors that describe the environment. Often the iris is significantly occluded by the eyelids, eyelashes, highlights, etc. This happens mainly because of the complex environment, in which the user cannot open the eyes wide enough (bright illumination, windy weather, etc.). It makes the application of the iris multi-instance approach reasonable in case the input image contains both eyes at the same time. The final dissimilarity score is calculated as a logistic function of the form: Score ¼
1 , 6 P 1 þ exp wi M i
i¼0
where M ¼ {Δdnorm, davg, AOImin, AOImax, ΔNDmin, ΔNDmax, ΔPIRavg}; Δdnorm is the normalized score difference for two pairs of irises (dLEFT is the score for the left eye; dRIGHT is the score for the right eye): Δd norm ¼ davg is the average score for the pair:
jd LEFT dRIGHT j ; d LEFT þ dRIGHT
16
Approaches and Methods to Iris Recognition for Mobile
415
Fig. 16.14 Parameters of the pupil and iris used for the iris fusion
davg ¼
d LEFT þ dRIGHT : 2
AOImin, AOImax are the minimum and maximum values of the area of intersection between the two binary masks Mprobe and Menroll in each pair: AOI ¼ ΣM c = M height M width , M c ¼ M probe M enroll : c c ΔNDmin and ΔNDmax are the minimum and maximum values of the normalized distance ΔND between the centres of the pupil and the iris: ΔND ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 NDXprobe NDXenroll þ NDYprobe NDYenroll NDX ¼
y yI xP xI , NDY ¼ P , RI RI
where (xP, yP) are the coordinates of the centre of the pupil, and RP is its radius; (xI, yI) are the coordinates of the centre of the iris, and RI is its radius (Fig. 16.14). ΔPIRavg represents the difference in pupil dilation during the enrollment and probe based on the PIR ¼ RP/RI ratio: LEFT RIGHT RIGHT ΔPIRavg ¼ PIRLEFT enroll PIRprobe þ PIRenroll PIRprobe =2: The weight coefficients above wi, i E [1; 7] were obtained after the training of the classifier on genuine and impostor matches on a small subset of the data. In case only one out of two feature vectors is extracted, all the pairs of values used in the weighted sum are assumed to be equal. The proposed method allowed to decrease the threshold for the visible iris area from 40% to 29% during verification/identification without any loss in the accuracy and performance, which leads to decrease of overall FRR (or in other words, user convenience is improved).
416
A. M. Fartukov et al.
Table 16.5 Recognition accuracy for different matching rules Error rate (%) EER FNMR
Method CNN-based GAQ CNN-based GAQ
Matching rule Proposed 0.01 0.10 0.48 1.07
Minimum 0.21 1.31 0.92 3.17
Consensus 0.21 1.31 1.25 4.20
Table 16.6 Recognition accuracy in different verification conditions Error rate (%) EER FNMR FTA
Method CNN-based GAQ CNN-based GAQ –
Verification conditions IN&NG IN&G 0.01 0.09 0.10 0.35 0.48 5.52 1.07 8.94 0.21 4.52
OT&NG 0.42 3.15 10.1 32.5 0.59
To prove effectiveness of the proposed fusion method, a comparison with the well-known consensus and minimum rules (Ross et al. 2006) was performed. According to the consensus rule, a matching is considered as successful if both dLEFT and dRIGHT are less than the decision threshold. In the minimum rule, what is required is that the minimum of the two values min(dLEFT, dRIGHT) should be less than the threshold. The testing results are presented in Table 16.5. The second approach is related to adaptive biometric systems which attempt to adapt themselves to the intra-class variation of the input biometric data as a result of changing environmental conditions (Rattani 2015). In particular, such adaptation is made by replacing or appending the input feature vector (probe) to the enrolled template immediately after a successful recognition attempt (to avoid impostor intrusion). As mentioned in Sect. 16.2, a normalization procedure which assumes uniform iris elasticity is used for compensation of changes in the iris texture occurring due to pupil dilation or constriction. Experimental results presented in Table 16.6 show that John Daugman’s rubber sheet model works well in case of the limited range of pupil changes which usually occur in indoor conditions. But in case of the wide range of outdoor illumination changes, additional measures for compensation of the iris texture deformation are required. The idea of the proposed method is to perform an update of the enrolled template by taking into consideration the normalized pupil radius PIR and average mutual dissimilarity scores of feature vectors (FVs) in the enrolled template. The final goal of such update is to obtain the enrolled template which contains iris feature vectors corresponding to iris images captured in a wide range of illumination conditions. In case of the multi-instance iris recognition, the update procedure is applied independently for left and right eyes. Let us consider the update procedure for a single eye.
16
Approaches and Methods to Iris Recognition for Mobile
417
Fig. 16.15 Eyelid position determination
The first step of the proposed procedure (except successful pass of the verification) is an additional quality check of the probe feature vector FVprobe which can be considered as input for the update procedure. It should be noted that the thresholds which are used for the quality check are different in enrollment and verification modes. In particular, the normalized eye opening (NEO) value, described below, is set as 0.5 for the enrollment and 0.2 for the verification; the non-masked area (NMA) of the iris (not occluded by any noise) is set as 0.4 and 0.29 for the enrollment and probe, respectively (in case of the multi-instance iris recognition). The NEO value reflects the eye opening condition and is calculated as: NEO ¼
El þ Eu : 2 RI
Here El and Eu are lower and upper eyelid positions determined as a distance to the eyelid from a pupil center (Pc) by a vertical (Fig. 16.15). One of the methods for eyelid position detection is presented in Odinokikh et al. (2019b). It is based on applying multi-directional 2D Gabor filtering and is suitable for running on mobile devices. Additional checking of the probe feature vector consists of applying enrollment thresholds for NEO and NMA values associated with FVprobe. The second step consists of checking the possibility to update the enrolled template. The structure of the enrolled template is depicted in Fig. 16.16. All FVs in the enrolled template are divided into three groups: initially enrolled FVs obtained during enrollment of a new user and two groups corresponding to FVs obtained at high illumination and low illumination conditions respectively. The latter groups are initially empty and receive new FVs through appending or replacing. It is important to note that the group of initially enrolled FVs is not updated to prevent possible degradation of recognition accuracy. Each FV in the enrolled template contains information about the corresponding PIR value and average mutual dissimilarity score: d am ðFVk Þ ¼
1 N
X
d ðFVk , FVi Þ:
i2f1, ..., N j i6¼kg
Here N is the current amount of FVs in the enrolled template. d(FVk) values are updated after each update cycle.
418
A. M. Fartukov et al.
Fig. 16.16 Structure of enrolled template
E E Let FV1 , . . . , FV the set of M initially enrolled FVs. If M denote PIR FVprobe < min PIR FVE1 , . . . , PIR FVEM , then FVprobe is considered as a candidate for the update of the group of FVs obtained at high illumination (lPIR group). Overwise, if PIR FVprobe > max PIR FVE1 , . . . , PIR FVEM , then FVprobe is considered as a candidate for the update of the group of corresponding FVs obtained at low illumination (hPIR group). lPIR and hPIR groups have predefined maximum sizes. If the selected group is not full, FVprobe is added to it. Overwise, the following rules are applied. If PIR (FVprobe) is the minimal value among all FVs inside the lPIR group, then FVprobe replaces the FV with the minimal PIR value in the lPIR group. Similarly, if PIR (FVprobe) is the maximal value among all FVs inside the hPIR group, then FVprobe replaces the FV with the maximal PIR value in the hPIR group. Otherwise, the FV which is closest to FVprobe by PIR value is searched inside the selected group. Let FVi denote the closest feature vector to FVprobe in terms of PIR value, and FVi1 and FVi+1 – its corresponding neighbors in terms of PIR value. Then, the following values are calculated: D ¼ PIRðFVi Þ PIRavg PIR FVprobe PIRavg , 1 PIRavg ¼ ðPIRðFVi1 Þ þ PIRðFViþ1 ÞÞ: 2 If D exceeds the predefined threshold, then FVprobe replaces FVi. This simple rule allows to obtain a group of FVs which are distributed uniformly in terms of PIR values. Otherwise, an additional rule is applied: if dam(FVprobe) < dam(FVi), then FVprobe replaces FVi. dam(FVprobe) and dam(FVi) are average mutual dissimilarity
16
Approaches and Methods to Iris Recognition for Mobile
419
Table 16.7 Dataset specification Dataset Users in dataset Max comparisons Ethnic diversity Eyes on video Videos per user Video length Capturing distance Capturing speed Image resolution
Non-glasses (NG) 476 22,075,902 Asian and Caucasian Two 10 2 30 frames 25–40 cm 15 frames per second (FPS) 1920 1920
Glasses (G) 224 10,605,643
scores calculated as shown above. It aids in selecting the FV that exhibits maximum similarity with the other FVs in the enrolled template. In order to prove efficiency of the proposed methods, a dataset which emulates user interaction with a mobile device was collected privately. It is a set of two-second video sequences, each of which is a real enrollment/verification attempt. It should be noted that there are no such publicly available datasets. The dataset was collected using a mobile device with an embedded NIR camera. It contains videos captured at different distances, in indoor (IN) and outdoor (OT) environments, with and without glasses. During dataset capturing, the following illumination ranges and conditions are set up: (i) three levels for the indoor samples (0–30, 30–300, and 300–1000 Lux) and (ii) a random value in the range 1–100K Lux (data were collected on a sunny day with different arrangements of the device relative to the sun). A detailed description of the dataset can be found in Table 16.7. The Iris Mobile (IM) dataset used in Sect. 16.4 was randomly sampled from, as well. The testing procedure for proposed multi-instance iris recognition considers each video sequence as a single attempt. The procedure contains the following steps: 1. All video sequences captured in indoor conditions and without glasses (IN&NG) are used to produce the enrollment template. The enrolled template is successfully created if the following conditions are satisfied: (a) At least 5 FVs were constructed for each eye. (b) At least 20 out of 30 frames were processed. 2. All video sequences are used to produce probes. The probe is successfully created if at least one FV was constructed. 3. Each enrolment template is compared with all probes except the ones generated from the same video. Thus, the pairwise matching table of the dissimilarity scores for performed comparisons was created. 4. Obtained counters of successfully created enrolled templates and probes and the pairwise matching table are used for calculating FTE, FTA, FNMR, FMR, and EER as described in Dunstone and Yager (2009).
420
A. M. Fartukov et al.
Fig. 16.17 Verification rate values obtained at each update cycle for Gabor-based feature extraction and matching (Odinokikh et al. 2017)
The recognition accuracy results are presented in Table 16.6. The proposed CNN-based feature extraction and matching method described in Sect. 16.4 is compared with the one described in Odinokikh et al. (2017) as a part of the whole iris recognition pipeline. The latter method is based on Gabor wavelets with an adaptive phase quantization technique (denoted as GAQ in Table 16.6). Both methods were tested in three different verification environments: indoors without glasses (IN&NG), indoors with glasses (IN&G), and outdoors without glasses (OT&NG). The enrollment was always carried out only indoors without glasses, and, for this reason, the value of FTE ¼ 3.15% is the same for all the cases. The target FMR ¼ 107 was set in every experiment. Applying different matching rules was also investigated. The proposed multiinstance fusion showed advantages over the other compared rules (Table 16.5). To simulate template adaptation in a real-life scenario, the following testing procedure is proposed. The subset containing video sequences captured both in indoor and outdoor environmental conditions for 28 users is formed from the whole dataset. For each user, one video sequence captured in indoor conditions without glasses is randomly selected for generating the initial enrolled template. All other video sequences (both indoor and outdoor) are used for generating probes. After generating probes, they are split into two subsets: one is for performing genuine attempts during verification (genuine subset); another is for enrolled template update (update subset). It should be noted that all probes are used for performing impostor attempts during verification. On each update cycle, one probe from the update subset is randomly selected, and the enrolled template update is started. The updated enrolled template is involved in performance testing after every update cycle. Figure 16.17 shows the verification rate values obtained at different update cycles for the proposed method for the Gabor-based feature extraction and matching algorithm proposed in Odinokikh
16
Approaches and Methods to Iris Recognition for Mobile
421
et al. (2017). It can be seen that the proposed adaptation scheme allows to increase the verification rate up to 6% after 9 update cycles. Portions of the research in this chapter use the CASIA-Iris-Mobile-V1.0 dataset collected by the Chinese Academy of Sciences’ Institute of Automation (CASIA 2015).
References Abate, A.F., Barra, S., D’Aniello, F., Narducci, F.: Two-tier image features clustering for iris recognition on mobile. In: Petrosino, A., Loia, V., Pedrycz, W. (eds.) Fuzzy Logic and Soft Computing Applications. Lecture Notes in Artificial Intelligence, vol. 10147, pp. 260–269. Springer International Publishing, Cham (2017) ARM Security Technology. Building a secure system using TrustZone Technology. ARM Limited (2009) Battiato, S., Messina, G., Castorina, A.: Exposure сorrection for imaging devices: an overview. In: Lukas, R. (ed.) Single-Sensor Imaging: Methods and Applications for Digital Cameras, pp. 323–349. CRC Press, Boca Raton (2009) Bowyer, K.W., Hollingsworth, K., Flynn, P.J.: Image understanding for iris biometrics: a survey. Comput. Vis. Image Underst. 110(2), 281–307 (2008) Chinese Academy of Sciences’ Institute of Automation (CASIA). Casia-iris-mobile-v1.0 (2015). Accessed on 4 October 2020. http://biometrics.idealtest.org/CASIA-Iris-Mobile-V1.0/CASIAIris-Mobile-V1.0.jsp Corcoran, P., Bigioi, P., Thavalengal, S.: Feasibility and design considerations for an iris acquisition system for smartphones. In: Proceedings of the 2014 IEEE Fourth International Conference on Consumer Electronics, Berlin (ICCE-Berlin), pp. 164–167 (2014) Das, A., Galdi, C., Han, H., Ramachandra, R., Dugelay, J.-L., Dantcheva, A.: Recent advances in biometric technology for mobile devices. In: Proceedings of the IEEE 9th International Conference on Biometrics Theory, Applications and Systems (2018) Daugman, J.: High confidence visual recognition of persons by a test of statistical independence. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1148–1161 (1993) Daugman, J.: Recognising persons by their iris patterns. In: Li, S.Z., Lai, J., Tan, T., Feng, G., Wang, Y. (eds.) Advances in Biometric Person Authentication. SINOBIOMETRICS 2004. Lecture Notes in Computer Science, vol. 3338, pp. 5–25. Springer, Berlin, Heidelberg (2004) Daugman, J.: Probing the uniqueness and randomness of iris codes: results from 200 billion iris pair comparisons. Proc. IEEE. 94(11), 1927–1935 (2006) Daugman, J., Malhas, I.: Iris recognition border-crossing system in the UAE (2004). Accessed on 4 October 2020. https://www.cl.cam.ac.uk/~jgd1000/UAEdeployment.pdf Dunstone, T., Yager, N.: Biometric System and Data Analysis: Design, Evaluation, and Data Mining. Springer-Verlag, Boston (2009) Fujitsu Limited. Fujitsu develops prototype smartphone with iris authentication (2015). Accessed on 4 October 2020. https://www.fujitsu.com/global/about/resources/news/press-releases/2015/ 0302-03.html Galbally, J., Gomez-Barrero, M.: A review of iris anti-spoofing. In: Proceedings of the 4th International Conference on Biometrics and Forensics (IWBF), pp. 1–6 (2016) Gangwar, A.K., Joshi, A.: DeepIrisNet: deep iris representation with applications in iris recognition and cross sensor iris recognition. In: Proceedings of 2016 IEEE International Conference on Image Processing (ICIP), pp. 2301–2305 (2016) Gnatyuk, V., Zavalishin, S., Petrova, X., Odinokikh, G., Fartukov, A., Danilevich, A., Eremeev, V., Yoo, J., Lee, K., Lee, H., Shin, D.: Fast automatic exposure adjustment method for iris recognition system. In: Proceedings of 11th International Conference on Electronics, Computers and Artificial Intelligence (ECAI), pp. 1–6 (2019)
422
A. M. Fartukov et al.
IrisGuard UK Ltd. EyePay Phone (IG-EP100) specification (2019). Accessed on 4 October 2020. https://www.irisguard.com/node/57 ISO/IEC 19794-6:2011: Information technology – Biometric data interchange formats – Part 6: Iris image data (2011), Annex B (2011) Korobkin, M., Odinokikh, G., Efimov, Y., Solomatin, I., Matveev, I.: Iris segmentation in challenging conditions. Pattern Recognit Image Anal. 28, 652–657 (2018) Nourani-Vatani, N., Roberts, J.: Automatic camera exposure control. In: Dunbabin, M., Srinivasan, M. (eds.) Proceedings of the Australasian Conference on Robotics and Automation, pp. 1–6. Australian Robotics and Automation Association, Sydney (2007) Odinokikh, G., Fartukov, A., Korobkin, M., Yoo, J.: Feature vector construction method for iris recognition. In: International Archives of the Photogrammetry, Remote Sensing and Spatial Information Science. XLII-2/W4, pp. 233–236 (2017). Accessed on 4 October 2020. https://doi. org/10.5194/isprs-archives-XLII-2-W4-233-2017 Odinokikh, G.A., Fartukov, A.M., Eremeev, V.A., Gnatyuk, V.S., Korobkin, M.V., Rychagov, M. N.: High-performance iris recognition for mobile platforms. Pattern Recognit. Image Anal. 28, 516–524 (2018) Odinokikh, G.A., Gnatyuk, V.S., Fartukov, A.M., Eremeev, V.A., Korobkin, M.V., Danilevich, A. B., Shin, D., Yoo, J., Lee, K., Lee, H.: Method and apparatus for iris recognition. US Patent 10,445,574 (2019a) Odinokikh, G., Korobkin, M., Gnatyuk, V., Eremeev, V.: Eyelid position detection method for mobile iris recognition. In: Strijov, V., Ignatov, D., Vorontsov, K. (eds.) Intelligent Data Processing. IDP 2016. Communications in Computer and Information Science, vol. 794, pp. 140–150. Springer-Verlag, Cham (2019b) Odinokikh, G., Korobkin, M., Solomatin, I., Efimov, I., Fartukov, A.: Iris feature extraction and matching method for mobile biometric applications. In: Proceedings of International Conference on Biometrics, pp. 1–6 (2019c) Prabhakar, S., Ivanisov, A., Jain, A.K.: Biometric recognition: sensor characteristics and image quality. IEEE Instrum. Meas. Soc. Mag. 14(3), 10–16 (2011) Rathgeb, C., Uhl, A., Wild, P.: Iris segmentation methodologies. In: Iris Biometrics. Advances in Information Security, vol. 59. Springer-Verlag, New York (2012) Rattani, A.: Introduction to adaptive biometric systems. In: Rattani, A., Roli, F., Granger, E. (eds.) Adaptive Biometric Systems. Advances in Computer Vision and Pattern Recognition, pp. 1–8. Springer, Cham (2015) Ross, A., Jain, A., Nandakumar, K.: Handbook of Multibiometrics. Springer-Verlag, New York (2006) Samsung Electronics. Galaxy tab iris (sm-t116izkrins) specification (2016). Accessed on 4 October 2020. https://www.samsung.com/in/support/model/SM-T116IZKRINS/ Samsung Electronics. How does the iris scanner work on Galaxy S9, Galaxy S9+, and Galaxy Note9? (2018). Accessed on 4 October 2020. https://www.samsung.com/global/galaxy/what-is/ iris-scanning/ Sun, Z., Tan, T.: Iris anti-spoofing. In: Marcel, S., Nixon, M.S., Li, S.Z. (eds.) Handbook of Biometric Anti-Spoofing, pp. 103–123. Springer-Verlag, London (2014) Tabassi, E.: Large scale iris image quality evaluation. In: Proceedings of International Conference of the Biometrics Special Interest Group, pp. 173–184 (2011) Tortora, G.J., Nielsen, M.: Principles of Human Anatomy, 12th edn. John Wiley & Sons, Hoboken (2010) Zhang, Q., Li, H., Sun, Z., Tan, T.: Deep feature fusion for iris and periocular biometrics on mobile devices. IEEE Trans. Inf. Forensics Secur. 13(11), 2897–2912 (2018) Zhao, Z., Kumar, A.: Towards more accurate iris recognition using deeply learned spatially corresponding features. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), pp. 3829–3838 (2017)
Index
A Active/passive glasses, 60 Adaptive CS techniques, 307 Adaptive patch-based dictionaries, 305 Adaptive pulse coding modulation (ADPCM), 118, 119 Additive regularization of topic models (ARTM), 261–263, 266 AIM (Advances in Image Manipulation), 46 Algorithm-generated summary video, 159 Algorithmic bias, 239 Aliasing artefacts, 305 Anatomical landmark detection, 285–288 Android application, 344 Animated graphical abstract advantages, 355 attention zone detection document images, 356–359 for photos, 355, 356 tomographic image, 360 conventional large icons, 364, 369 CT images, 364 effectiveness, 364 frames, 370 generation of animation, 361–364 generation, key stages, 355 goals, 354 hands of kids, 361 image content, 368 image of document, 363 Microsoft PowerPoint, 364 PDF documents, 368 sandstone, 370 scanned document, 362 task detection of photos, 366
visual estimation, 355 zone detection, 355 zoomed face, 361 zoomed-in fragments, 366 Animated thumbnail, 220 Animation from photography animation effect processing flow chart, 222 effect initialization, 221 execution stage, 221 flashing light effect, 221, 223 soap bubble generation, 223–225 sunlight spot effect, 225, 227, 228 Animation of still photos, 219, 220 Application-related data, 272 Application-specific integral circuits (ASIC), 115, 116 Arbitrary view rendering, 77 Artificial intelligence (AI) racial profiling, 239 Attention zone detection, 220, 225, 226, 228 Attentive vision model, 225 Auantization linear prediction schemes, 118 Audio-awareness animation effects, 233, 234 Audiovisual slideshows, 221 Authentication methods, 267 Automated update of biometric templates, 406, 416, 417, 419, 420 Automated video editing, 165 Automatic audio-aware animation generation, 220 Automatic cropping, 353, 354 Automatic editing model training, 172, 173 Automatic film editing, 165, 170, 171 Automatic video editing
© Springer Nature Switzerland AG 2021 M. N. Rychagov et al. (eds.), Smart Algorithms for Multimedia and Imaging, Signals and Communication Technology, https://doi.org/10.1007/978-3-030-66741-2
423
424 Automatic video editing (cont.) aesthetical scores, video footage, 163, 164 ASL, 156 cinematographic cut, 156 dialogue-driven scenes, 163 dynamic programming (see Dynamic programming) existing methods, learning editing styles, 164, 165 imitation learning (see Imitation learning) nonparametric probabilistic approach for media analysis, 167–169 shot length metrics, 156 single-camera, multiple-angle and multipletake footage, 163–164 time-consuming process, 155 timing statistics, film, 156 video clips generation, 165, 167 video footage from multiple cameras, 159–163 video summarising, 157–159 Automatic video editing quality, 184, 185 Automatic view planning (AVP) brain AVP, 281, 297 brain MRI, 279 cardiac AVP, 299 desired properties, 278 framework (see AVP framework) knee AVP, 298 spine AVP, 298 tomographic diagnostic imaging, 279 verification, 296 Average shot length (ASL), 156 AVP framework ambiguity yields, 280 anatomical landmark detection, 285–288 bounding box, 280, 281 brain AVP, 281 knee AVP, 281 landmark candidate position refinement, 295, 296 MSP estimation in brain images, 282–284 post-processing, 280, 281 pre-processing, 280 quality verification, 300 spine AVP, 291–295 statistical atlas, 281 steps, workflow, 280 training landmark detector, 288–291 workflow, 280, 282
B Background audio signals, 228 Background inpainting article reports satisfactory, 103
Index background motion, 104, 105 diffusion-based approaches, 103 forward temporal pass, 106 global optimization approach, 104 input video sequence, 103 motion, 104 spatial pass, 106, 107 video inpainting approaches, 103 Bayer SR, 21–26 Binary coded aperture, 326 Biometric methods, 397 Biometric recognition algorithm, 402 Block circulant with circulant block (BCCB), 12, 14, 15 Block diagonalization, 10, 14 complexity, 16 Blur-warp-down-sample model, 4 Brain AVP, 281, 297 Brain MRI MSP computation, 282 non-traditional landmarks, 279 on test datasets, 291 view planning, 279
C Call + sensors’ data, 271 Camera calibration, 328, 334–337 Cardiac AVP, 299 Cardiac MRI, 288, 291 Cardiac view planning, 277–279 Channel attention (CA) mechanism, 50 Cinematographic, 155, 156 Circulant matrix, 11 Client application, 271 Coded aperture approach, 325 Collaborative sparsifying transform, 305 Colour classification error, 208 colour coherence vectors, 196 MPEG-7 visual descriptors, 211 properties, 196 RGB colour channels, 205 RGB colour space, 195 skin tone detector, 199 ventral pathway, 211 Colour-based stereo reproduction, 62 Colour-coded aperture and chromatic aberration-coded, 328 deep neural network architectures, 328 depth estimation, 332, 333, 335 depth sensor calibration, 335 image formation, 327 image restoration, 327 light-efficient designs, 331
Index numerical simulation, 328, 329 prototypes, 342 PSF, 329 real-time 3D reconstruction scenario, 345 simplified imaging system, 326 smartphone prototype, 343 spatial misalignment, 326 Colour coherence vectors, 196 Colour filter array (CFA), 4 Compound memory network (CMN), 213, 214 Compressed sensing (CS) complex-valued object images, 306 compressed sensing, 307 dictionary-based approaches, 307 k-space measurements, 306 MR signals, 304 Compressed sensing MRI, 306, 307 Computational complexity, 31 Computed tomographic (CT) images, 351, 355, 360, 364, 366 Computer vision, 325, 328 Computing optical flow, 95 Confusion matrix, 198 Content-adaptive techniques, 220 Content-based animation effects, 220 Content-based image orientation recognition, 193 Context stream, 212 Contextual behaviour-based profiling, 267–269 Continuous user authentication, 267 Contrastive learning, 254 Controlled data acquisition, 246 Conventional camera calibration, 337 Conventional colour anaglyphs, 62 Conventional iris recognition system, 398 Convolutional neural network (CNN), 155, 159, 163, 165, 166, 177, 188, 254 architecture, 37 multiresolution, 212, 213 receptive field, 36 in SISR, 35 depth maps, 54 FSRCNN, 42–44 learned perceptual patch similarity, 42 upsampling, 37 super-resolution task, 36 super-resolution with residual learning, 37 SVM, 213 visual recognition tasks, 211 two-stream CNN method, 211, 212
D Daisy descriptors, 168 Data acquisition, 271
425 Data annotation, 247–249 Data augmentation, 251 Data crowdsourcing, 246 Data engineering, 245, 249 Data imbalance, 256 Data operations teams (DataOps), 245 Dataset collection system, 271, 272 Dataset synthesis, 328 Deconvolution, 36, 42–44, 53 Deep back-projection networks (DBPN), 49 Deep learning (DL), 328 algorithms, 54 convolution filters, 43 and GANs, 54 loss function, 41 naïve bicubic down-sampling, 55 revolution, 251 synthetic data, 247 training and inference, 251 to video SR, 55 visual data processing, 155 Deep neural networks (DNNs), 410 Demographic classifiers, 261 Dense compression units (DCUs), 48 Depth-based rendering techniques, 83 Depth control depth tone mapping, 67, 68 disparity vector, 67 stereo content reproduction, 66, 67 Depth estimation, 325, 328, 329, 338, 339 acquisition, depth data, 68 cost aggregation, 69 cross-bilateral filter, 69 depth map, 66 depth smoothing, 72, 74 post-processing algorithm, 73 reference image, 72 stereo matching algorithms, 68 Depth image-based rendering (DIBR), 75, 78 Depth maps, 54 Depth post-processing, 72, 73 Depth propagation algorithm, 88 comparison, interpolation error, 93 CSH algorithm, 89 CSH matching, 91, 92 depth interpolation, 86 hash tables, 89 interpolated depths, 92 machine learning approach, 86 motion information, 87 patch-voting procedure, 90, 91 semi-automatic conversion, 86 superpixels, 87
426 Depth propagation (cont.) temporal propagation, 86 2D to 3D conversion pipeline, 86 Depth tweening, 87 Dictionary-based CS MRI, 307–310 Dictionary learning, 309 Difference of Gaussians (DoG) filter, 26 Differential Mean Opinion Score (DMOS), 149 Directed acyclic graph (DAG), 201 Directional DoG filter, 26 Discrete cosine transform (DCT) coefficients, 118, 194, 209 Disparity estimation, 328, 344 Disparity extraction, 326 Disparity map estimation, 332 Document segmentation, 355, 356 Domain transfer, 255 Dynamic programming automatic editing, non-professional video footage, 182 automatic video editing, 177 cost function, 180–182 evaluation, automatic video editing quality, 184, 185 non-professional video photographers, 182–184 problem statement, 178 raw video materials for automatic editing, 182 reference motion pictures, 182 trainable system, 177 transition quality, 179, 180 video editing parameters, 177
E Elliptical classifier, 202 Embedded/texture memory compression, 115, 116 Enrolled template, 399 Explicit authentication, 267
F Fallback logic, 374 Fast real-time depth control technology, 66 Feature maps, 36, 37, 42, 43, 49–53 Federated learning (FL), 256 File Explorer for Win10, 352 Filter-bank implementation, 16–18 Fingerprint recognition, 397 Fisher’s linear discriminant (FLD), 211 Fixed Quantization via Table (FQvT), 123, 137, 139, 140, 142
Index Flashing light effect, 220, 221, 223 Floating point operations per second (FLOPs), 244 Frame rate conversion (FRC), 204, 205 algorithms, 373, 374 magic, 373 Fully automatic conversion algorithms, 82 Fully convolutional networks (FCN), 410
G Gaussian distribution, 5 Gaussian mixture models (GMMs), 211 Generative adversarial networks (GANs), 47, 48, 51, 54 Global orthonormal sparsifying transform, 307 GoogLeNet-based network structure, 170, 171 GPU implementation, 314, 315 Greyscale video colourization, 87
H Hadamard multiplication, 307 Hadamard transform, 141, 142 Hand-annotate existing film scenes, 164 Hardware complexity, 145 Hardware-based parallel data acquisition (pMRI) methods, 304 Hermitian transpose operation, 309 Hidden Markov Model (HMM), 164, 196 High-order spectral features (HOSF), 208 Histogram of gradient (HoG), 158 Histogram of optical flow (HoF), 158 Histogram-based contrast method, 226 Homography, 211
I Image classification AI, 239 algorithmic bias, 239 applications, 237 binary classifiers, 238 data calibration dataset, 245 commercial dataset, 246 controlled data acquisition, 246 crowdsourcing platform, 246 data acquisition, 246 data augmentation, 251 data engineering, 249–250 data management, 250 DataOps, 245
Index dataset preparation process, 245 human annotation/verification, 247–249 publicly available data, 246 synthetic data acquisition, 247 deployment, 257 face recognition technology, 239 hierarchical fine-grained localised multiclass classification system, 238 hierarchical/flat classification systems, 238 ICaaS, 237 metrics and evaluation classification metrics, 241, 242 end-to-end metrics, 241 F1 score, 243 groups of metrics, 240 precision metric, 242 recall metrics, 242 training and inference performance metrics, 244 uncertainty and robustness metrics, 244–245 model architecture, 252–254 model training classical computer vision approach, 251 contrastive learning, 254 data imbalance, 256 domain transfer, 255 fine-tuning, pre-trained model, 254 FL, 256 flagship smartphones, 251 knowledge distillation, 254, 255 self-supervised learning, 254 semi-supervised approaches, 254 multi-label classifiers, 238 web-based service, 239 Image similarity, 41 Imitation learning, 179 automatic editing model training, 172, 173 classes, shot sizes, 171 features extraction, 171–172 frames, 170 qualitative evaluation, 173–174 quantitative evaluation, 174–177 rules, hand-engineered features, 170 video footage features extraction pipeline, 170, 171 Interpolations per second (IPS), 390 Iris feature extraction, 410–413 Iris image acquisition, 400 Iris quality checking, 400 Iris recognition for mobile devices auto-exposure approaches, 406
427 automatic camera parameter adjustment, 403 cluster construction procedure, 408 ED and GD value, 408 exposure time and MSV, 404 face region brightness, 406 FRR, 409 global exposure, 404 iris capturing hardware, 403 iterative adjustment stage, 409 MSV, 404, 405 optimal recognition accuracy, 407 performance comparison, 409 photodiodes, 404 recognition algorithm, 402 sensors, 403 sigmoid coefficients, 404 special quality buffer, 403 visualization, 405 limitations, 414 person recognition CNN, 400 contactless capturing, 398 on daily basis, 401 human eye, 398, 400 informative biometric trait, 398 intra-class variation, 398 minimal requirements, 400 mobile device and user, 401 mobile devices, 400 muscle tissue, 398 NIR camera, 400 registration, new user, 399 Samsung flagship devices, 402 simplified scheme, 399 Iris recognition systems, 400
K KISS FFT open-source library, 229 k-Nearest neighbour (KNN) classifier, 211 Knee AVP, 279, 281, 298 Knee MRI, 288 Knowledge-based methods, 397 Knowledge distillation, 254, 255 Kohonen networks, 196
L LapSRN model, 47, 48 Latent Dirichlet allocation (LDA), 261, 266 LG Electronics 47GA7900, 78 Light polarisation, 61
428 Light star-shape templates, 224 Lightning effect, 233, 234 Location information, 272 Loss function, 36, 37, 41, 48 Lucas–Kanade (LK) optical flow, 31
M Machine learning, 1, 6 and computer vision, 252 classification metrics, 241 human biases, 240 inference and training speed, 244 Magnetic resonance imaging (MRI) brain MRI, 279 cardiac MRI, 291 compressed sensing MRI, 306, 307 computational performance, 315, 320 computer tomography, 303 convolutional neural network, 296 CS algorithm, 306 DESIRE-MRI, 313, 314, 316–320 dictionary-based CS MRI, 307–310 efficient CS MRI reconstruction, 310 global sparsifying transforms, 305 hardware-based acceleration, 304 k-space, 303 landmark detector training, 296 noninvasive methods, 277 scout image, 278 spine MRI, 285, 296 3D scout MRI volume acquisition, 280 view planning, 277, 278 Magnifier effect, 233, 234 Matrix class, 12–15 Mean Opinion Score (MOS), 149 Mean sample value (MSV), 404 Mel-frequency cepstral coefficient (MFCC), 211 Middlebury dataset, 329 Midsagittal plane (MSP), 282–284 Mobeye, 246 Mobile device, 42–45, 48 Mobile iris recognition, see Iris recognition Mobile user demographic prediction additive regularization, topic models, 263, 264 analysis, 262 application usage data, 260 architecture of proposed solution, 262 ARTM, 261 call logs, 260 demographic classifiers, 261 document aggregation, 264–266 flexibility, 261
Index LDA, 261 machine learning approaches, 266 mobile data, 260 NLP, 261 PLSA, 261 pre-processing of Web pages, 262 probabilistic latent semantic analysis, 263 SMS logs, 260 topic model, 261 Web data, 260 Modern smartphone, 397 Android-based, 270 available data, 260 explicit authentication, 267 passive authentication, 269 sensors and data sources available, 260 SoCs, 268 Modern super-resolution algorithms, 55 Modern super-resolution CNNs, 36 Modern video processing systems, 115 Modified FSRCNN architecture, 43 Motion compensation, 387 Motion estimation (ME), 374 double-block processing, 379, 380 evaluation, 3DRS algorithm modifications, 380–382 slanted wave-front scanning order, 379 3DRS algorithm, 376, 377 wave-front scanning order, 377, 378 Motion information, 87, 95 Motion picture masterpieces, 172 Motion vectors comparison of errors, 101 computing optical flow, 95 evolution, data term importance, 98, 99 Middlebury optical flow evaluation database, 101 motion clustering, 99–101 motion estimation results, 101 non-local neighbourhood smoothness term, 101 optical flow, 95 solver, 98 special weighting method, 97 total variation optical flow using two colour channels, 96, 97 with non-local smoothness term, 97 variational optical flow, 96 Motion-compensated interpolation (MCI), 374, 381, 383, 385, 387–391 MPEG-7 visual descriptors, 211 Multi-band decomposition, 312 Multi-coil MRI reconstruction, 315, 316
Index Multi-frame SR image formation model, 1 multilevel matrices, 21 PSNR values, 24 SISR, 6 SR problem, 4 3DRS, 31 threefold model, 3 Multi-instance fusion, 414, 415, 420 Multi-instance iris recognition, 414, 416, 417, 419 Multilevel matrices, 21 Multimedia presentations, 219, 220 Multimedia slideshows, 219 Multi-view face detection, 225
N Naïve bicubic down-sampling, 55 Natural Image Quality Evaluator (NIQE), 151 Natural language processing (NLP), 261 Natural stereo effect, 82 Nearest neighbours (NN)-finding algorithm, 182 Neural networks, 196 NIR camera, 400 Noisiest tomographic image, 368 Non-professional video photographers, 182–184 Non-subsampled shearlet transform (NSST), 211 Non-uniform-sized patches, 312, 313 NTIRE (New Trends in Image Restoration and Enhancement) challenge, 45, 46 Numerical simulation, 328–330
O Occlusion processing (OP), 374, 382–384, 390 Occlusions, 382 Optical flow computing, 95 depth interpolation results, 94 Middlebury optical flow evaluation, 101 two-colour variational, 95, 96
P Parallax high values, 63 interposition depth, 64 negative (crossed), 63 positive (uncrossed), 62 positive diverged parallax, 63 real 3D movies, 63
429 6-metre-wide cinema screen, 63 types, 62 zero, 62 Parametric rectified linear activation units (PReLUs), 42, 44 Passive authentication, mobile users, 267–270 Patch-based approach, 305 Peak signal-to-noise ratio (PSNR), 41, 42, 45, 46, 52 Perception index (PI), 46 Perception-distortion plane, 46 Perfect shuffle matrix, 10, 15 Phase-encoding (PE) lines, 303 Physical sensors, 271 PIRM (Perceptual Image Restoration and Manipulation) challenge, 46 Point spread function (PSF), 329, 336 Poissonian distribution, 5 Portable devices, 155 Post-processing, 82 Practical SR DoG filter, 26 final reliability max, 28 half-resolution compensated frame, 26 LST space, 28 MF SR system architecture, 27 motion estimation for Bayer MF SR, 30, 31 post-processing, 28 reconstruction filters, 28 reconstruction model, 28 structure tensor, 29, 30 system architecture, 26 visual comparison, 28 Precomputed dictionaries, 311, 312, 314, 320 Principal component analysis (PCA), 196 Privacy infringement, 259 Probabilistic latent semantic analysis (PLSA), 261–263 Progressive super-resolution network, 49
Q Quantization and clipping, 127 FQvT, 123 NSW quantizer set values, 123 NSW transform, 121
R Rainbow effect, 233, 234 RANSAC algorithm, 99 Real-time methods, 82
430 Real-time multistage digital video processing, 115 Real-time video classification algorithm, 197 Real-time video processing, 193, 194, 197, 204, 205, 207 Real-World Super-Resolution (RWSR) sub-challenges, 46 Rectified linear unit (ReLU), 36, 50, 51 Region of interest (ROI), 360 Reinforcement learning, 155 Residual network (ResNet), 252, 253 Residual-in-residual (RIR) module, 50 RGB colour channels, 205 RGB colour sensor, 326 RGB colour spaces, 332, 334 RGB video frames, 89
S Scene change detector, 204, 205 Scene depth extraction, 325 Scout image, 278–280 Second-order channel attention (SOCA) module, 50, 51 Semi-automatic conversion algorithms, 82 Shooting in stereo, 82 Signal-to-noise ratio (SNR), 326 Similarity, 360, 361, 364 Single image super-resolution (SISR) AIM challenge, 46 challenges, 45 CNN training, 41 competition, 45 feature maps, 43 FSRCNN, 42 GANs, 47 HR image, 35 image quality metrics, 41 implementation on mobile device, 42, 43, 45 loss function, 41 LR image, 35 MRI in medicine, 54 neural networks, 36, 38 NTIRE competition, 45 perception-distortion plane, 46 PIRM challenge, 46 PSNR, 41 real imaging systems, 40 single image zooming, 35 SSIM, 41 traditional image quality metrics, 41 training datasets, 38–40 Single-frame super-resolution (SISR), 6
Index Skin colour pixels, 204 Smart TVs, 59 SmartNails, 354 Soap bubble effect, 230, 231 Soccer genre detection, 199, 208 Social video footage, 155 Special interactive authoring tools, 219 Spinal AVP approaches, 279 Spine AVP, 280, 291, 298 Spine MRI, 285, 296 Spine view planning, 279 Split Bregman initialization, 310, 311, 314, 316, 320 Sport genre detection multimedia applications, 193 real-time detection, 193 video classification (see Video sequence classification) video processing pipeline, 194 Sporting scenes, 199 Sports games, 194, 198, 199 Sports video categorisation system CMN structure, 213 CNN-based approach, 211 DCT coefficients, 194, 209 homography, 211 HOSF, 208 hybrid deep learning framework, 212 large-scale video classification framework, 212 leveraged classical machine learning approaches, 209 MPEG-7 visual descriptors, 211 multimodal data, 210 multiresolution CNN, 212, 213 signature heat maps, 211 via sensor fusion, 210 WKLR, 209 workflow, 210 Squared sum error (SSE), 123 Stereo content reproduction active shutter glasses, 62 colour-based views separation, 62 depth control, 66, 67 passive systems with polarised glasses, 61 Stereo conversion, 82–84, 112 See also 2D-3D semi-automatic video conversion Stereo disparity estimation methods, 77 Stereo matching algorithms, 68 Stereo pair synthesis, 103, 108 Stereo rig, 82 Stereo synthesis
Index DIBR, 75 problems during view generation disocclusion area, 76 symmetric vs. asymmetric, 77 temporal consistency, 77 toed-in configuration, 77, 78 virtual view synthesis, 76 Structural similarity (SSIM), 41, 46 Structure tensor, 29, 30 Sub-pixel convolutions, 36, 41, 48 Sunlight spot effect, 225, 226, 230–232 Superpixels, 87 Super-resolution (SR) arrays, 4 Bayer SR, 21–26 BCCB matrices, 12 block diagonalization in SR problems, 9, 10, 14 circulant matrices, 11 colour filter arrays, 5 data fidelity, 7 data-agnostic approach, 7 filter-bank implementation, 16–18 HR image, 1 image formation model, 1–3, 5 image interpolation applications, 2 interpolation problem, 2 LR images, 1 machine learning, 6 mature technology, 1 on mobile device, 45 modern research, 7 optimization criteria, 6 perfect shuffle matrix, 10, 15 problem conditioning, 7 problem formulation, fast implementation, 8–9 reconstruction, 1 sensor-shift, 3 single- vs. multi-frame, 2 single-channel SR, 19 SISR, 6 (see also Single image superresolution (SISR)) symmetry properties, 19–21 warping and blurring parameters, 35 Super-resolution multiple-degradations (SRMD) network, 53 Support vector machine (SVM), 194, 195, 208, 291 Symmetric stereo view rendering, 77 Synthetic data acquisition, 247 System on chip (SoC), 115
431 T Temporal propagation, 105 Texture codes, 196 3D recursive search (3DRS) algorithm, 375 3D TVs on active shutter glasses, 62 cause, eye fatigue, 62, 63, 65, 66 consumer electronic shows, 59 interest transformation extra cost, 60 inappropriate moment, 59 live TV, 61 multiple user scenario, 61 picture quality, 61 uncomfortable glasses, 60 parallax (see Parallax) prospective technologies, 59, 60 smart TVs, 59 stereo content (see Stereo content reproduction) Thumbnail creation, 353, 354 Tiling slideshow, 220 Toed-in camera configuration, 77, 78 Tone-to-colour mapping, 229 Transposed convolution, 36 TV programmes, 194 TV screens, 373 2D-3D semi-automatic video conversion advantages and disadvantages, 82 background inpainting step (see Background inpainting) causes, 81 depth propagation from key frame (see Depth propagation) motion vector estimation (see Motion vectors) steps, 83 stereo content quality, 111, 112 stereo rig, 81, 82 video analysis and key frame detection, 84–86 view rendering, 110, 111 virtual reality headsets, 81
U Unpaired super-resolution, 55 Unsupervised contrastive learning, 254 User authentication, 267 User data collection, 269, 271–274 User interfaces, 351, 353, 354
432 V Variable-length prefix encoding (VLPE), 119 Video clips generation, 165, 167 Video editing valuable footage, 155 Video sequence classification camera movement analysis, 195 “energy flow of activity”, 195 genre detection, 194 modalities, 194 movies by genre, 195 PCA, 196 properties of boundaries, 195 real-time video classification algorithm, 197 sporting events, 196 SVM, 194, 195 visual keys, 196 Video stream classifier, 204 Video summarising, 157–159 View planning automatic system (see Automatic view planning (AVP)) cardiac, 278 MRI, 277, 278 View rendering, 110, 111 Virtual view synthesis, 76, 78, 79 Visual clustering, 159 Visual keys, 196 Visual quality description, 149 DMOS, 149 MOS, 149 testing methodology, 149 video streams, 115 by VLCCT categories of images, 149 dataset for testing, 150 datasets, 151 machine learning methods, 151 “perceptual loss” criterion, 151 zoom, 149 Visual quality estimation, 362 Visually lossless colour compression technology (VLCCT) algorithms, 116 architecture, 119, 120 bitstream syntax, 143 complexity analysis, 145–149 compressed elementary packet, 119 compression ratios, 117 content analysis, 119 data transformations, 118 elementary operations, 146
Index encoding methods (see Visually lossless encoding methods) ETC/ETC1/ETC2, 116 fixed block truncation coding, 116 high-level data processing pipeline, 118 methods, 118 prototypes, 116 quality analysis (see Visual quality) redundancy elimination mechanisms, 117 requirements, 117 S3 Texture Compression, 116 specifications, 152 structure, VLCCT encoder, 146 subjective testing, 148 VLPE, 119 weighted max-max criterion, 143 weighted mean square error, 142 with syntax elements, 143, 144 Visually lossless encoding methods 2 2 sub-blocks and 2 4 blocks, 128, 130 and 2 4 method, 133 bits distribution, sub-method U1, 141 CDVal distribution, 134–136 C-method, 121, 125, 127 decoding (reconstruction), 133 difference values, sub-method U2, 140, 141 D-method, 129 E-method, 121, 127 extra palette processing, 135 features using lifting scheme, 122 F-method, 130 FQvT, 139 free bits analysis, 125 Hadamard transform, 141, 142 InterCondition, 133 intra-palette mode, 135 LSB truncation, 132 M-method, 132 N-method, 121 pixel errors, 122 P-method, 121, 124, 125 quantization and clipping, 127 quantizers, 122, 123 quantizers for P-method, 125 red colour channel SAD, 138 S-method, 121 SSE, 123 sub-modes in P-method, 124 template values, 128 U-method, 135
Index 2 4 pixels blocks D-method, 120 F-method, 120 H-method, 121 L-method, 121 M-method, 120 O-method, 120 U-method, 121 VMAF (Video Multi-Method Assessment Fusion), 151 Vowpal Wabbit, 173
W Walsh-Hadamard (W-H) filters, 88 Warp-blur-down-sample model, 4
433 Web camera-based prototype, 344 Web data, 260, 261, 271, 272 Weighted kernel logistic regression (WKLR), 209 Weighted max-max criterion, 143
X Xena, 116 X-ray microtomographic images, 353
Z Zero filling (ZF), 304, 306 Zero-shot super-resolution, 55