268 77 39MB
English Pages 444 [428] Year 2021
GENERATIVE ADVERSARIAL NETWORKS FOR IMAGE-TO-IMAGE TRANSLATION
GENERATIVE ADVERSARIAL NETWORKS FOR IMAGE-TO-IMAGE TRANSLATION Edited by ARUN SOLANKI
Assistant Professor, Department of Computer Science and Engineering, Gautam Buddha University, Greater Noida, India ANAND NAYYAR
Lecturer, Researcher and Scientist, Duy Tan University, Da Nang, Viet Nam MOHD NAVED
Assistant Professor, Analytics Department, Jagannath University, Delhi NCR, India
Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2021 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-823519-5 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals
Publisher: Mara Conner Acquisitions Editor: Chris Katsaropoulos Editorial Project Manager: Gabriela D. Capille Production Project Manager: Niranjan Bhaskaran Cover Designer: Christian J. Bilbow Typeset by SPi Global, India
Contributors
Er. Aarti Department of Computer Science & Engineering, Lovely Professional University, Phagwara, Punjab, India Supavadee Aramvith Multimedia Data Analytics and Processing Research Unit, Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand Tanvi Arora Department of CSE, CGC College of Engineering, Landran, Mohali, Punjab, India Betul Ay Firat University Computer Engineering Department, Elazig, Turkey Galip Aydin Firat University Computer Engineering Department, Elazig, Turkey Junchi Bin University of British Columbia, Kelowna, BC, Canada Erik Blasch MOVEJ Analytics, Dayton, OH, United States Udaya Mouni Boppana Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia Najihah Chaini Faculty of Applied Sciences and Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia Amir H. Gandomi University of Technology Sydney, Ultimo, NSW, Australia Aashutosh Ganesh Radboud University, Nijmegen, The Netherlands Thittaporn Ganokratanaa Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand
xi
xii
Contributors
Koshy George SRM University—AP, Guntur District, Andhra Pradesh, India Meenu Gupta Department of Computer Science & Engineering, Chandigarh University, Ajitgarh, Punjab, India ´ lvaro S. Hervella A CITIC Research Center; VARPA Research Group, Biomedical Research Institute of A Corun˜a (INIBIC), University of A Corun˜a, A Corun˜a, Spain Kavikumar Jacob Faculty of Applied Sciences and Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia Rachna Jain Department of Computer Science & Engineering, Bharati Vidyapeeth’s College of Engineering, Delhi, India S. Jayalakshmy IFET College of Engineering, Villupuram, India Leta Tesfaye Jule Department of Physics, College of Natural and Computational Science; Centre for Excellence in Indigenous Knowledge, Innovative Technology Transfer and Entrepreneurship, Dambi Dollo University, Dambi Dollo, Ethiopia A. Sampath Kumar Department of Computer Science and Engineering, Dambi Dollo University, Dambi Dollo, Ethiopia Meet Kumari Department of Electronics & Communication Engineering, Chandigarh University, Ajitgarh, Punjab, India Lakshay Department of Computer Science & Engineering, Bharati Vidyapeeth’s College of Engineering, Delhi, India Zheng Liu University of British Columbia, Kelowna, BC, Canada H.R Mamatha Department of CSE, PES University, Bengaluru, India Omkar Metri Department of CSE, PES University, Bengaluru, India
Contributors
Aida Mustapha Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia D. Nagarajan Department of Mathematics, Hindustan Institute of Technology and Science, Chennai, India Jorge Novo CITIC Research Center; VARPA Research Group, Biomedical Research Institute of A Corun˜a (INIBIC), University of A Corun˜a, A Corun˜a, Spain Marcos Ortega CITIC Research Center; VARPA Research Group, Biomedical Research Institute of A Corun˜a (INIBIC), University of A Corun˜a, A Corun˜a, Spain Lakshmi Priya Manakula Vinayaga Institute of Technology, Pondicherry, India Krishnaraj Ramaswamy Centre for Excellence in Indigenous Knowledge, Innovative Technology Transfer and Entrepreneurship; Department of Mechanical Engineering, Dambi Dollo University, Dambi Dollo, Ethiopia S. Mohamed Mansoor Roomi Department of Electronics and Communication Engineering, Thiagarajar College of Engineering, Madurai, India Jose Rouco CITIC Research Center; VARPA Research Group, Biomedical Research Institute of A Corun˜a (INIBIC), University of A Corun˜a, A Corun˜a, Spain Angel D. Sappa ESPOL Polytechnic University, CIDIS-FIEC, Guayaquil, Ecuador; Computer Vision Center, Edifici O, Campus UAB, Bellaterra, Barcelona, Spain K. Saruladha Pondicherry Engineering College, Department of Computer Science and Engineering, Puducherry, India A. Sasithradevi School of Electronics Engineering, VIT University, Chennai, India R. Sivaranjani Department of Electronics and Communication Engineering, Sethu Institute of Technology, Madurai, India Rituraj Soni Department of CSE, Engineering College Bikaner, Bikaner, Rajasthan, India
xiii
xiv
Contributors
S. Sountharrajan School of Computing Science and Engineering, VIT Bhopal University, Bhopal, India Patricia L. Sua´rez ESPOL Polytechnic University, CIDIS-FIEC, Guayaquil, Ecuador Gnanou Florence Sudha Pondicherry Engineering College, Pondicherry, India E. Thirumagal Pondicherry Engineering College, Department of Computer Science and Engineering, Puducherry; REVA University, Bengaluru, India Boris X. Vintimilla ESPOL Polytechnic University, CIDIS-FIEC, Guayaquil, Ecuador N. Yuuvaraj Research and Development, ICT Academy, Chennai, India Ran Zhang University of British Columbia, Kelowna, BC, Canada
CHAPTER 1
Super-resolution-based GAN for image processing: Recent advances and future trends Meenu Guptaa, Meet Kumarib, Rachna Jainc, and Lakshayc a
Department of Computer Science & Engineering, Chandigarh University, Ajitgarh, Punjab, India Department of Electronics & Communication Engineering, Chandigarh University, Ajitgarh, Punjab, India c Department of Computer Science & Engineering, Bharati Vidyapeeth’s College of Engineering, Delhi, India b
1.1 Introduction One can think of the term machine as older than the computer itself. In 1950, the computer scientist, logician, and mathematician Alan Turing penned a paper for the generations to come, “Computing Machinery and Intelligence” [1]. Today, computers can not only match humans but have outperformed them completely. Sometimes people think about not achieving the superhumanity face recognition or cleaning the medical image of the patient accurately, even for the small algorithm, as a machine learning algorithm is the best at pattern reorganization in existing image data using features for tasks such as classification and regression. When we try to generate new data, however, the computer has struggled [2]. An algorithm can easily defeat a chess grandmaster, classify whether a transaction is fraudulent or not, and classify in a medical report whether the given medical report has any disease or not, but fail on humanity’s most basic and essential capacities—including crafting an original creation or a pleasant conversation. Mahdizadehaghdam et al. [3] proposed some tests named the Imitation game, also known as the Turing Test. Behind a closed door, an unknown observer talks with two counterparts means a computer and a human. In 2014, all of the above problems were solved when Ian Goodfellow invented generative adversarial networks (GANs). This technique has enabled computers to generate realistic data by using two separate neural networks. Before GANs, different ways have been proposed by the programmer to analyze the generated data. But the result received from the generated data was not up to the mark. When GANs were introduced the first time, it showed a remarkable result as there was no difference between the generated fake images or photograph-image and gave the same result as the real-world-like quality. GANs turn scribbled images to a photograph-like image [4].
Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00030-0
Copyright © 2021 Elsevier Inc. All rights reserved.
1
2
Generative adversarial networks for image-to-Image translation
Fig. 1.1 Improves realism of the image as general adversarial networks varies [4].
In recent years, how far GANs have changed the meaning of generating or improving the real image is shown in Fig. 1.1. Fig. 1.1 was first produced by GAN in 2014 and shows how human faces continuously improve in generating fake images. The machine could produce as a blurred image, and even that achievement is celebrated as a success. In just the next 3 years, we could not classify which is fake or which qualifies as high-resolution portrait photographs [5]. GANs are a category of machine learning techniques that uses two simultaneously trained models: the first is the generator to generate fake data and the discriminator is used to discrete the raw data from the real dataset images. The word generative indicates creating new data from the given data. GAN generates the data which learn from the choice of the given training set. The term adversarial points to maintaining the dynamic between the two models that are the generator and the discriminator. Here two networks are continually trying to trick the other as the generator generates better fake images to get convincing data. The better discriminator is trying to distinguish the real data examples from the fake generated ones. The word networks indicates the class of machine models. The generator and the discriminator commonly use the neural network. As a complex, the neural network is more complex than the implementation of GAN [6]. GAN has two models. First, it works where we put the input and then we get the output. The goal is to form two models that combine and run simultaneously so that the first discriminator receives input from the real data that come from the training dataset, and the second time onward there are two input sources that are the actual data and the fake examples coming from the given generator. A random number vectors is passed through the generator. The output acquired from the generator is Fake examples that try to convince as far as possible the real data. The discriminator predicted the probability of the input real. The main purpose of creating two models separately is to overcome the problem of fake data that is generated from the training dataset. The discriminator’s goal is to differentiate between the fake data generated from the generator and the real input example from the dataset. This section further discusses the training parts of the discriminator and the generator in Sections 1.1.1 and 1.1.2 [7].
Super-resolution-based GAN for image processing
(a) (c) and (d) x x* (b)
Discriminator
Classification error
Generator
z
Fig. 1.2 Train the discriminator.
1.1.1 Train the discriminator Fig. 1.2 discusses the trained model of the discriminator and the steps are as follows [8]: (a) First, get a random real example x from the given training dataset. (b) Now get a new random vector z and, utilizing the generator network, synthesize a fake example as x*. (c) Utilize the discriminator network to distinguish between x* and x. (d) Find the classification error and back-propagate. Then try to minimize the classification error to update the discriminator biases and weight.
1.1.2 Train the generator Fig. 1.3 shows the trained model of the generator as you can see the labeling of these steps as follows: (a) First, choose a random new image from the dataset as vector z, using a generator to create an x*, i.e., a fake example. (b) It used a discriminator to categorize real and fake examples. (c) Find the classification error and back-propagate. Then try to minimize the classification error due to which the total error to renovate the discriminator biases and weight [9].
1.1.3 Organization of the chapter This chapter is further classified into different sections. Section 1.2 discusses the background study of this work and also different research views. Section 1.3 discusses the SR GAN model for image processing. Section 1.4 discusses the application-based GAN case studies to enhance object detection. Section 1.5 discusses the open issues and challenges faced in the working of GAN. Section 1.6 concludes the chapter with its future scope.
3
4
Generative adversarial networks for image-to-Image translation
(b) and (c)
r
x
x`
(a)
Discriminator
Classification error
Generator
z
Fig. 1.3 Train the generator.
1.2 Background study The goal of Perera et al. [10] is to determine whether the given query is from the same class or different class. Their solution is based on learning the latent representation of an in-class example using a de-noising and auto-encoder network. This method gives good results for COIL, MINST f-MINST dataset. They give a new thinking to GANs as in face recognition we create fake images, which help to identity theft and privacy breaches. In this chapter, they proposed a technique to recognize the forensic face. They use the deep face recognition system as a core for their model and create fake images repeatedly to help data augmentation [11]. Tripathy et al. [12] present a generic face animator that controls the pose and expression using a given face image. They implemented a two-stage neural network model, which is learned in a self-supervised manner. Rathgeb et al. [13] proposed a supervised deep learning algorithm using CNNs to detect synthetic images. The proposed algorithm gives an accuracy of 99.83 for distinguishing real images from dataset and fake images generated using GANs. Yu et al. [14] have shown advances to complete a new level as they proposed a method to visualize forensic and model attribution. The model supports image attribution, enables fine-grained model authentication, persists across different image frequencies, fingerprint frequencies and paths, and is not biased. Lian et al. [15] provide a guidance module that is introduced in FG-SRGAN, which is utilized to reduce the space of possible mapping functions and helps to learn the correct
Super-resolution-based GAN for image processing
mapping function from a low-resolution domain to a high-resolution domain. The guidance module is used to greatly reduce adversarial loss. Takano and Alaghband [16] proposed the SRGAN model. In this chapter, they solved the problem of sharpening the images. It can give a slight hint on how the real image looks like from the blurry image as they convert a low-resolution image to a high-resolution image. Dou et al. [17] proposed PCA-GAN, which greatly improves the performance of GAN-based models on super-resolving face images. The model focuses on cumulative discrimination in the orthogonal projection space spanned by PCA projection to details into the discriminator. Jiang et al. [18] proposed to improve the perception of the CT image using SRGAN, which leads to greatly enhance the spatial resolution of the image, as perception increases the disease analysis on a tiny portion of areas and pathological features. They introduced a diluted convolution module. The mean structural similarity (MSSIM) loss is also introduced to improve the perceptual loss function. Li et al. [19] provide an improvement method of SRGAN and a solution for the problem of image distortion in textile flow detection, a super-resolution image reconstruction. Here the result of an experiment shows that the PNSR of SRGAN is 0.83 higher than that of the Bilinear, and the SSIM is higher than 0.0819. SRGAN can get a clearer image and reconstruct a richer texture, with more high-frequency details, and that is easier to identify defects, which is important in the flaw detection of fabrics. Wang et al. [20] decided to use dense convolutional network blocks (dense blocks), which connect each layer to every other layer in a feed-forward manner as our very deep generator networks. GAN solves the problem of spectral normalization as the method offers better training stability and visual improvements. Nan et al. [21] solved the complex computation, unstable network, and slow learning speed problems of a generative adversarial network for image super-resolution (SRGAN). We proposed a single image super-resolution reconstruction model called Res_WGAN based on ResNeXt. Li et al. [22] discussed an edge-enhanced super-resolution network (EESR), which proposed better generation of high-frequency structures in blind super-resolution. EESR is able to recover textures with 4 times upsampling and gained a PTS of 0.6210 on the DIV2K test set, which is much better than the state-of-the-art methods. Sood et al. [23] worked on magnetic resonance (MR) images to obtain highresolution images for which the patients have to wait for a long time in a still state. Obtaining low-resolution images and then converting them to high-resolution images uses four models: SRGAN, SRCNN, SRResNet, and sparse representation; among them, SRGAN gives the best result.
5
6
Generative adversarial networks for image-to-Image translation
Lee et al. [24] present a super-resolution model specialized for license plate images, CSRGAN, trained with a novel character-based perceptual loss. Specifically, they focus on character-level recognizability of super-resolved images rather than pixel-level reconstruction. Chen et al. [25] divided the technique into two different parts: the first one is to improve PSNR and the second one is to improve visual quality. They propose a new dense block, which uses complex connections between each layer to build a more powerful generator. Next, to improve perceptual quality, they found a new set of feature maps to compute the perceptual loss, which would make the output image look more real and natural. Jeon et al. [26] proposed a method to increase the similarity between pixels by performing the operation of the ResNet module, which has an effect similar to that of the ensemble operation. That gives a better high-resolution image. As the resolutions of remote sensing images are low, to improve the performance we required high-level resolution. In this chapter, they first optimize the generator and residual-in-residual dense without BN (batch normalization) is used. Firtstly GAN (relativistic generative adversarial network) is introduced and then the sensation loss is improved [27].
1.3 SR-GAN model for image processing Image super-resolution is defined as an increase in the size image, but trying to not decrease the quality of the image keeps the reduction in quality to a minimum or creates a high-resolution image from a low-resolution image by using the details from the original image. This problem has some difficulties as for an input low-resolution image, and there are some multiple solutions available. SR-GAN has numerous applications like medical image processing, satellite image, aerial image analysis, etc. [28].
1.3.1 Architecture of SR-GAN Many programs that are good, fast, and accurate get a single image super-resolution. But still, something that is missing is the texture of the original features of the image. That is the way where we recover the low-resolution image so that the image produce is not distorted. Later we recover these errors, but it is not complete all errors that are produced. The main error shows that result has a peak signal-to-noise ratio (PSNR) high thus provides good image quality results, but lacking high-frequency details. The previous result also sees the similarity in pixel space, which leads to a blurry or unsatisfying image. Due to this, we introduce SR-GAN, a model that can capture the perceptual difference in the ground truth image and the model output. Fig. 1.4 discusses the architecture of SRGAN [29].
Super-resolution-based GAN for image processing
HR Images Discriminator LR Image
Generator
Content Loss
GAN Loss
SR Image
Fig. 1.4 Architecture of SRGAN [29].
The training algorithm of SRGAN is shown in the following steps: (a) We run the HR (high-resolution) images to get sample LR (low-resolution) images. To train our dataset, we required both LR and HR images. (b) Then allow LR images to pass through the generator, which increases the samples and provides SR (super-resolution) images. (c) LR and HR images are classified by passing through the discriminator and backpropagated [30]. Fig. 1.5 presents the network between the generator and the discriminator. It contains convolution layers, parameterized ReLu(PrelU), and batch normalization. The generator also implements skip connections similar to ResNet [31].
1.3.2 Network architecture Residual blocks are defined as seep learning networks that are difficult to train. The residual learning framework makes the training easier for the networks and enables them to go deep substantially, to improve the performance. In the generator, there are a total of 16 residual blocks used [32]. As in Generator 2, a subpixel is used for getting the feature map up-sampling. Every time pixel shuffle is applied it rearranges the elements of the L*B*H*r*r tensor and transforms into the rL*rB*H tensor. With increase in computation, the bicubic filter from the pipeline has been removed. We use parameterized rely on instead of Relu pr LeakyRelu. Prelude adds a learnable parameter, which leads to learning the negative part coefficient adaptively. The convolution layer “k3n64s1” represents kernel filters of 3*3 outputting channels 64 along with stride 1. Similarly, “k3n256s1” and “K9n3s1” are other convolution layers added [33].
1.3.3 Perceptual loss As in the below equation, LSR shows the perceptual loss, and it is a commonly used model based on the mean square error. As the equation is a loss function, it calculates the loss function and gives a solution concerning characteristics. Here LSRx is notated
7
Generative adversarial networks for image-to-Image translation
g residual blocks Generator Generated Image
k9n1s1
Conv
Conv
k3n128s1
ReLU
Batch Norm
Elementwise Sum
k3n128s1
Conv
k3n128s1
Elementwise Sum
ReLU
Conv
Batch Norm
Conv ReLU
Noisy Image
k9n128s1 k3n128s1
Early fusion
Conditional masks
Middle fusion Late fusion
Notations k- kernel size n- number of channels
Middle/Late fusion
s- stride
Early fusion
Original or Generated Image?
Dense Leaky ReLU Dense (1)
Leaky ReLU
Conv
Batch Norm
Conv Leaky ReLU
k3n128s1
Original and Generated Images
8
Discriminator d residual blocks
Fig. 1.5 SRGAN-based model architecture for seismic images [30].
as a content loss, and it is used as the last term is an adversarial loss as shown in Eq. (1.1). The weighted sum of both gives the perceptual loss (VGG-based content loss): LSR ¼ LSRx + 103 LSRAdv
(1.1)
1.3.3.1 Content loss The pixel-wise mean square error loss is evaluated as LSRMSE ¼
rl X rb 2 1 X IHRx, y GθG ILRx, y 2 r LB x¼1 y¼1
(1.2)
Eq. (1.2) is the most generally utilized advancement focus for image SR on which many best-in-class approaches depend [34, 35]. However, while accomplishing
Super-resolution-based GAN for image processing
significantly high PSNR, arrangements of MSE advancement problems frequently require high frequency, which brings about perceptually unsuitable management with highly smooth surfaces [36]. Rather than depending on losses (pixel-wise), we expand on the thoughts of Shi et al. [35], Denton et al. [37], and Ledig [4] and use a loss function work that is closer to perceptual likeness. We characterize the VGG loss dependent on the ReLU enhancement layers of the preprepared 19-layer VGG model portrayed in Simonyan and Zisserman [38]. We demonstrate the element map obtained by the nth convolution before the Ith max-pooling layer inside the VGG19 loss as the Euclidean separation between the component portrayal of reproduced picture GθG(ILR) as shown in the Eq. (1.3) [39] i, j X i, j 2 1 X LSRVGG=ij: ¼ ∅i, j IHRx, y ∅i, j GθG ILRx, y Li, j Bi, j x¼1 y¼1
L
B
(1.3)
Here Li, j and Bi, j describe the length and the width of the given feature map used in the VGG system. 1.3.3.2 Adversarial loss The last section of this chapter discusses the content loss and also included the generative part of GAN pertaining particularly to perceptual loss. It urges our system to support arrangements that dwell on the complex of regular pictures, by attempting to fool the discriminator. The generative loss LSRadv is defined on the basis of the probabilities of the discriminator GθG(ILR) overall training as shown in Eq. (1.4) [40]: LSRadv ¼
N X
log DθD ðGθG ðILRÞÞ
(1.4)
n¼1
where DθD(GθG(ILR)) is the probability that creates fake images GθD(ILR) as a real highresolution image. For a good gradient, we limit the value logDθD from log1 x where x is the probability of creating fake images [41].
1.4 Case study This includes the different case studies as applications of EE-GAN to enhance object detection, edge-enhanced GAN for remote sensing image, application of SRGAN on video surveillance, and forensic application and super-resolution of video using SRGAN.
1.4.1 Case study 1: Application of EE-GAN to enhance object detection Detection performance of small objects in remote sensing images has not been more desirable than in huge size objects, especially in noisy and low-resolution images. Thus, enhanced super-resolution GAN (ESRGAN) provides significant image enhancement
9
10
Generative adversarial networks for image-to-Image translation
output. However, reconstructed images generally lose high-frequency edge data. Thus, object detection performance gives small objects decrement on low-resolution and noisy remote sensing images. Thus, residual-in-residual dense blocks (RRDB) for both the EEN and ESRGAN and EEN, for the detector system used a high-speed region-based convolutional network (FRCNN) as well as a single-shot detector (SSD) [42].
1.4.2 Case study 2: Edge-enhanced GAN for remote sensing image The recent super-resolution (SR) techniques that are dependent on deep learning have provided significant comparative merits. Still, they remain not desirable in highfrequency edge details for the recovery of pictures in noise-contaminated image conditions, such as remote sensing satellite images. Thus, a GAN-based edge-enhancement network (EEGAN) is used for reliable satellite image SR reconstruction with the adversarial learning method, which is noise insensitive. Especially EEGAN comprises two primary subnetworks: an edge-enhancement subnetwork (EESN) and an ultra-dense subnetwork (UDSN). First, in UDSN, 2-D dense blocks are collected for feature extraction to gain an intermediate image in high-resolution result, which looks sharp but offers artifacts and noise. After that, EESN is generated to enhance and extract the image contours by purifying the noise-contaminated components with mask processing. The recovered enhanced edges and intermediate image can be joined to produce high credibility and clear content results. Extensive experiments on Jilin-1 video satellite images, Kaggle Open Source Data set as well as Digital globe provide a more optimum reconstruction performance than previous SR methods [43].
1.4.3 Case study 3: Application of SRGAN on video surveillance and forensic application Person reidentification (REID) is a significant work in forensics and video applications. Several past methods are based on a primary assumption that several person images have sufficiently high and uniform resolutions. Several scale mismatching and low resolution always present in the open-world REID. This is known as scale-adaptive low-resolution person re-identification (SALR-REID). The intuitive method to address this issue is to improve several low resolutions to a high resolution uniformly. Thus, SRGAN is one of the popular image super-resolution deep networks constructed with a fixed upscaling parameter. But it is yet not suitable for SALR-REID work that requires a network not only to synthesize image features for judging a person’s identity but also to enhance the capability of scale-adaptive upscaling. We group multiple SRGANs in series to supplement the ability of image feature representation as well by plugging in an identification network. Thus, a cascaded super-resolution GAN (CSRGAN) framework with a unified formulation can be used [44].
Super-resolution-based GAN for image processing
1.4.4 Case study 4: Super-resolution of video using SRGAN SRGAN techniques are used to improve the image quality. There are several methods of image transformation where the computing system gets input and sends it in the output image. GAN is the deep neural network that consists of two networks, discriminator and generator. GANs are about designing, such as portrait drawing or symphony composition. SRGAN gives various merits over methods. It proposes a perceptual loss factor that comprises the merits of content and adversarial losses. Here the discriminator block discriminates between real HR images from produced super-resolved images [45], while the generator function is used for propagating model training. Adversarial loss function utilizes a discriminator network that is trained to discriminate already between the two pictures. However, content loss function utilizes perceptual similarity despite the pixel space similarity. The superior thing about SRGAN is that it produces the same data as real data. SRGANs learn the representations that are internal to produce upscale images [46]. The neural network is faithful in photo-realistic textures that are recovered from downgraded images. The SRGAN methods demit with a high peak to signal noise ratio but also give high visual perception and efficiency. Joining the adversarial and perceptual loss will produce a high-quality, super-resolution image. Moreover, the training phase perceptual losses evaluate image similarities robustly compared to per-pixel losses. Further, perceptual loss functions identify the high-level semantic and perceptual differences between the generated images [42].
1.5 Open issues and challenges When we train our GAN models, we suffer many major problems. Some problems are nonconvergence, model collapse, and diminished gradient unbalanced between the two models. GAN is sensitive toward the hyper-parameter factors. In GAN, sometimes the partial model is collapsed [45]. The gradient corresponds to ILR approaches zero, and then our model is collapsed. When we restart our model, the training in the discriminator detects the single-mode impact. The discriminator will take charge and change a single point to the next most likely point [46]. Overfitting is one of the main challenges as the balance between the generator and the discriminator. Some programmers give the solution. Someone proposes to use cost function with a nonvanishing gradient instead. Nonconvergence occurs due to both low and high mesh quality [47]. As we cannot apply GAN on static data due to a more complex convolution layer being required as the real and fake static data, we have not classified the data. There are some results theoretically but cannot be implemented [7]. Again, alongside various merits of GANs, there are yet open challenges that require to be solved for their medical imaging employment. In cross-modality image and image
11
12
Generative adversarial networks for image-to-Image translation
reconstruction synthesis, most tasks still adopt traditional shallow reference advantages like PSNR, SSIM, or MAE for quantitative analysis. However, these measures do not relate to the image’s visual quality, e.g., pixel-wise loss direct optimization generated a blurry result but gave higher numbers compared to using adversarial loss [48]. It provides great difficulty in interpreting these horizontal comparison numbers of GAN-based tasks, particularly when other losses are presented. One method to reduce this issue is to utilize downstream works like classification or segmentation to validate the quality of the produced sample. Some other method is to recruit domain experts but this method is time-consuming, expensive, and hard to scale [49]. Today, we have applied GAN for more than 20 basic applications. All the applications have a broad area where GAN is applied, as most important are satellite images where GAN is best for training and testing of images. In medical images like MRI and X-ray images as they are of low resolution, the images and edges are not sharp enough, due to which extraction of more features is not possible with the help of SR-GAN and EE-GAN GAN [50].
1.6 Conclusion and future scope In the past years before the discovery of GANs, image processing of satellite images or medical X-ray images was quite hard for feature extraction purposes. Classification is also somewhat hard due to the presence of a high error rate at the time. In a single image, every 1px represents at least 10 m, due to which feature extraction is significantly reduced. As the images are of low quality, the objects are blurry to get the high-resolution image, due to which SR-GAN is used. As both the models run at the same time, it greatly reduces the training time. GAN is used to generate fake data in today’s world. Hence, many algorithms are proposed, which make fake things appear real. GAN has several other applications, including making recipes, songs, fake images of a person, generating Cartoon characters, generating new human poses, face aging, and photo blending. These are the areas generally used in the present scenarios where GAN is freely applied. In future, by using GAN, we can create videos of robot motion and train a robot for progressive enhancement. Some researchers are working on the Novo generation of new molecules for extracting the desired properties in silica molecules. Many of the researchers are also working on the application of autonomous driving of a self-driving car using GAN.
References [1] K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, J. Jiang, Edge-enhanced GAN for remote sensing image superresolution, IEEE Trans. Geosci. Remote Sens. 57 (8) (2019) 5799–5812. [2] V. Ramakrishnan, A.K. Prabhavathy, J. Devishree, A survey on vehicle detection techniques in aerial surveillance, Int. J. Comput. Appl. 55 (18) (2012).
Super-resolution-based GAN for image processing
[3] S. Mahdizadehaghdam, A. Panahi, H. Krim, Sparse generative adversarial network, in: Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), 2019, pp. 3063–3071. [4] C. Ledig, L. Theis, F. Husza´r, J. Caballero, A. Cunningham, A. Acosta, W. Shi, Photo-realistic single image super-resolution using a generative adversarial network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4681–4690. [5] S. Borman, R.L. Stevenson, Super-resolution from image sequences—a review, in: 1998 Midwest Symposium on Circuits and Systems (Cat. No. 98CB36268), IEEE, 1998, pp. 374–378. [6] D. Dai, R. Timofte, L. Van Gool, Jointly optimized regressors for image super-resolution, in: Computer Graphics Forum, vol. 34, 2015, pp. 95–104. No. 2. [7] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, Y. Fu, Image super-resolution using very deep residual channel attention networks, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 286–301. [8] N.M. Nawawi, M.S. Anuar, M.N. Junita, Cardinality improvement of Zero Cross Correlation (ZCC) code for OCDMA visible light communication system utilizing catenated-OFDM modulation scheme, Optik 170 (2018) 220–225. [9] P. Mamoshina, L. Ojomoko, Y. Yanovich, A. Ostrovski, A. Botezatu, P. Prikhodko, I.O. Ogu, Converging blockchain and next-generation artificial intelligence technologies to decentralize and accelerate biomedical research and healthcare, Oncotarget 9 (5) (2018) 5665–5690. [10] P. Perera, R. Nallapati, B. Xiang, Ocgan: one-class novelty detection using gans with constrained latent representations, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2898–2906. [11] N.T. Do, I.S. Na, S.H. Kim, Forensics face detection from gans using convolutional neural network, in: Proceeding of 2018 International Symposium on Information Technology Convergence (ISITC 2018), 2018. [12] S. Tripathy, J. Kannala, E. Rahtu, Icface: interpretable and controllable face reenactment using gans, in: The IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 3385–3394. [13] C. Rathgeb, A. Dantcheva, C. Busch, Impact and detection of facial beautification in face recognition: an overview, IEEE Access 7 (2019) 152667–152678. [14] N. Yu, L.S. Davis, M. Fritz, Attributing fake images to gans: learning and analyzing gan fingerprints, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7556–7566. [15] S. Lian, H. Zhou, Y. Sun, Fg-srgan: a feature-guided super-resolution generative adversarial network for unpaired image super-resolution, in: International Symposium on Neural Networks, Springer, Cham, 2019, pp. 151–161. [16] N. Takano, G. Alaghband, Srgan: Training Dataset Matters, arXiv, 2019. preprint arXiv:1903.09922. [17] H. Dou, C. Chen, X. Hu, Z. Hu, S. Peng, PCA-SRGAN: Incremental Orthogonal Projection Discrimination for Face Super-Resolution, arXiv, 2020. preprint arXiv:2005.00306. [18] X. Jiang, Y. Xu, P. Wei, Z. Zhou, CT image super resolution based on improved SRGAN, in: 2020 5th International Conference on Computer and Communication Systems (ICCCS), IEEE, 2020, pp. 363–367. [19] H. Li, C. Zhang, H. Li, N. Song, White-light interference microscopy image super-resolution using generative adversarial networks, IEEE Access 8 (2020) 27724–27733. [20] M. Wang, Z. Chen, Q.J. Wu, M. Jian, Improved face super-resolution generative adversarial networks, Mach. Vis. Appl. 31 (2020) 1–12. [21] F. Nan, Q. Zeng, Y. Xing, Y. Qian, Single image super-resolution reconstruction based on the ResNeXt network, in: Multimedia Tools and Applications, 2020, pp. 1–12. [22] Y.Y. Li, Y.D. Zhang, X.W. Zhou, W. Xu, EESR: edge enhanced super-resolution, in: 2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), IEEE, 2018, pp. 1–3. [23] R. Sood, M. Rusu, Anisotropic super resolution in prostate Mri using super resolution generative adversarial networks, in: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), IEEE, 2019, pp. 1688–1691. [24] S. Lee, J.H. Kim, J.P. Heo, Super-resolution of license plate images via character-based perceptual loss, in: 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), IEEE, 2020, pp. 560–563.
13
14
Generative adversarial networks for image-to-Image translation
[25] B.X. Chen, T.J. Liu, K.H. Liu, H.H. Liu, S.C. Pei, Image super-resolution using complex dense block on generative adversarial networks, in: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, 2019, pp. 2866–2870. [26] W.S. Jeon, S.Y. Rhee, Single image super resolution using residual learning, in: 2019 International Conference on Fuzzy Theory and its Applications (iFUZZY), IEEE, 2019, pp. 1–4. [27] J. Wenjie, L. Xiaoshu, Research on super-resolution reconstruction algorithm of remote sensing image based on generative adversarial networks, in: 2019 IEEE 2nd International Conference on Automation, Electronics and Electrical Engineering (AUTEEE), IEEE, 2019, pp. 438–441. [28] V.K. Ha, J. Ren, X. Xu, S. Zhao, G. Xie, V.M. Vargas, Deep learning based single image superresolution: a survey, in: International Conference on Brain Inspired Cognitive Systems, Springer, Cham, 2018, pp. 106–119. [29] X. Wang, K. Yu, C. Dong, C. Change Loy, Recovering realistic texture in image super-resolution by deep spatial feature transform, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 606–615. [30] W.S. Lai, J.B. Huang, N. Ahuja, M.H. Yang, Fast and accurate image super-resolution with deep laplacian pyramid networks, IEEE Trans. Pattern Anal. Mach. Intell. 41 (11) (2018) 2599–2613. [31] W.S. Lai, J.B. Huang, N. Ahuja, M.H. Yang, Deep laplacian pyramid networks for fast and accurate super-resolution, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 624–632. [32] D. Kim, H.U. Jang, S.M. Mun, S. Choi, H.K. Lee, Median filtered image restoration and anti-forensics using adversarial networks, IEEE Signal Process Lett. 25 (2) (2017) 278–282. [33] T. Tong, G. Li, X. Liu, Q. Gao, Image super-resolution using dense skip connections, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4799–4807. [34] C. Dong, C.C. Loy, X. Tang, Accelerating the super-resolution convolutional neural network, in: European Conference on Computer Vision, Springer, Cham, 2016, pp. 391–407. [35] W. Shi, J. Caballero, F. Husza´r, J. Totz, A.P. Aitken, R. Bishop, Z. Wang, Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883. [36] L. Yue, H. Shen, J. Li, Q. Yuan, H. Zhang, L. Zhang, Image super-resolution: the techniques, applications, and future, Signal Process. 128 (2016) 389–408. [37] E.L. Denton, S. Chintala, R. Fergus, Deep generative image models using a laplacian pyramid of adversarial networks, in: Advances in Neural Information Processing Systems, 2015, pp. 1486–1494. [38] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv, 2014. preprint arXiv:1409.1556. [39] X. Li, Y. Wu, W. Zhang, R. Wang, F. Hou, Deep learning methods in real-time image superresolution: a survey, J. Real-Time Image Proc. (2019) 1–25. [40] K. Hayat, Multimedia super-resolution via deep learning: a survey, Digital Signal Process. 81 (2018) 198–217. [41] M.S. Sajjadi, B. Scholkopf, M. Hirsch, Enhancenet: single image super-resolution through automated texture synthesis, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4491–4500. [42] X. Zhao, Y. Zhang, T. Zhang, X. Zou, Channel splitting network for single MR image superresolution, IEEE Trans. Image Process. 28 (11) (2019) 5649–5662. [43] J. Kim, J. Kwon Lee, K. Mu Lee, Accurate image super-resolution using very deep convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1646–1654. [44] R. Timofte, E. Agustsson, L. Van Gool, M.H. Yang, L. Zhang, Ntire 2017 challenge on single image super-resolution: methods and results, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 114–125. [45] X. Song, Y. Dai, X. Qin, Deep depth super-resolution: learning depth super-resolution using deep convolutional neural network, in: Asian Conference on Computer Vision, Springer, Cham, 2016, pp. 360–376.
Super-resolution-based GAN for image processing
[46] L. Zhang, P. Wang, C. Shen, L. Liu, W. Wei, Y. Zhang, A. Van Den Hengel, Adaptive importance learning for improving lightweight image super-resolution network, Int. J. Comput. Vis. 128 (2) (2020) 479–499. [47] Y. Li, J. Hu, X. Zhao, W. Xie, J. Li, Hyperspectral image super-resolution using deep convolutional neural network, Neurocomputing 266 (2017) 29–41. [48] Y. Liang, J. Wang, S. Zhou, Y. Gong, N. Zheng, Incorporating image priors with deep convolutional neural networks for image super-resolution, Neurocomputing 194 (2016) 340–347. [49] Q. Chang, K.W. Hung, J. Jiang, Deep learning based image Super-resolution for nonlinear lens distortions, Neurocomputing 275 (2018) 969–982. [50] B. Lim, S. Son, H. Kim, S. Nah, K. Mu Lee, Enhanced deep residual networks for single image superresolution, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 136–144.
15
CHAPTER 2
GAN models in natural language processing and image translation E. Thirumagala,b and K. Saruladhaa a
Pondicherry Engineering College, Department of Computer Science and Engineering, Puducherry, India REVA University, Bengaluru, India
b
2.1 Introduction In recent years, GANs have shown significant progress in modeling image and speech complex data distributions. The introduction of GAN and VAE made training big datasets in an unsupervised manner possible.
2.1.1 Variational auto encoders The variational auto encoders (VAEs) [1] were used for generating images before GANs. The VAE has a probabilistic encoder and probabilistic decoder. The real samples “r” are fed into the encoder. The encoder outputs an encoded image, with which the noise “n’ is added whose distribution is given by Xe(n jr). The distribution Xe(n jr) is given as input to decoder whose distribution is given by Yd(r j n) which will generate the fake image “r.” The loss function L(e,d) between encoder and decoder is computed for every iteration. The VAE uses a mean square loss function, which is given by L ðe, dÞ ¼ EnXeðnjrÞ ½Yd ðr j nÞ + KLDðXe ðnj r ÞkYd ðnÞÞ
(2.1)
where KLDðXe ðnj r ÞkYd ðnÞÞ ¼
X
Xe ðnj r Þ Xe ðnj r Þlog Yd ðnÞ
The Kullback-Leibler divergence (KLD) is the distance metric that computes the similarity between the real sample given to the encoder Xe and the generated fake image from decoder Yd. If the loss function yields more value, it means the decoder does not generate fake images similar to the real samples. The backpropagation will take place for every iteration until the decoder generates the image similar to the real image. By using stochastic gradient descent, the weights and bias of the encoder and decoder will be adjusted and again image generation will happen. The optimal value of the loss function is 0.5. Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00001-4
Copyright © 2021 Elsevier Inc. All rights reserved.
17
18
Generative adversarial networks for image-to-Image translation
When the loss function of the decoder becomes 0.5, it means the decoder generates the image similar to the real image. 2.1.1.1 Drawback of VAE VAE uses Kullback-Leibler divergence (KLD). When the generated image distribution Yd(n) does not match the real image distribution Xe(n j r), then Yd(n) value will become 0. The KLD will lead to ∞ (infinity), which means learning will not take place for the encoder and decoder. This leads to the invention of GANs.
2.1.2 Brief introduction to GAN Generative adversarial networks (GANs) [2–4] are generative neural network models introduced by Ian Goodfellow in 2014. Recently, GANs have been used in numerous applications such as discovery and prevention of security attacks, clothing translation, text-to-image conversion, photo blending, video games, etc. GANs have generator (G) and discriminator (D) which can be convolutional neural network (CNN), feed forward neural networks, or recurrent neural networks (RNNs). The generator (G) will generate fake images, by taking random noise distribution as input. The real samples and the generated fake images are given as input to the discriminator (D), which will output whether the image is from the real sample or from the generator (1 or 0). The loss functions are computed to check whether (i) the generator is generating images close to real samples and (ii) the discriminator is correctly discriminating between real and fake images. If the loss function yields a big value, then backpropagate to Generator and Discriminator neural networks, adjust its weights and bias which is called as optimization. There are various optimization algorithms such as stochastic gradient descent, RMSProp, Adam, AdaGrad, etc. Therefore, both G and D learn simultaneously. This chapter is organized as follows: The various GAN architectures are discussed in Section 2.2. Section 2.3 describes the applications of GANs in natural language processing. The applications of GANs in image generation and translation are discussed in Section 2.4. The Section 2.5 discusses the evaluation metrics that can be used for checking the performance of the GAN. The tools and the languages used for GAN research are discussed in Section 2.6. The open challenges for further research are discussed in Section 2.7.
2.2 Basic GAN model classification based on learning From the literature survey made, the classification of various GANs is made based on the learning methods as shown in Fig. 2.1. The learning can be supervised or unsupervised. Supervised learning is making the machine learning with the labeled data. The classification and regression algorithms come under supervised learning. The unsupervised learning [5] will take place when the data is unlabeled. The machine will act on the data
GAN models in natural language processing and image translation
Fig. 2.1 GAN architecture classification.
based on the similarities, differences, and patterns. The clustering and association algorithms come under unsupervised learning.
2.2.1 Unsupervised learning The generative models before GANs used the Markov chain [6] method for training which has various drawbacks such as high computational complexity, low efficiency, etc. As shown in Fig. 2.1, vanilla GAN, WGAN, WGAN-GP, Info GAN, BEGAN, Unsupervised Sequential GAN, Parallel GAN, and Cycle GAN are categorized under unsupervised learning [7]. The above said GANs will take the data or the real samples without labels as input. The architectures of all the above GANs are shown in the following sections. This section details about each GAN architecture with its loss functions and optimization techniques. 2.2.1.1 Vanilla GAN The Vanilla GAN [8, 9] is the basic GAN architecture. The real samples are given by “r.” The random noise is given by “n.” The random noise distribution pn(n) is given as input to the G which will generate the fake image. The real sample distribution pd(r) and fake images are given as input to D. D will discriminate whether the image is real (which means 1) or fake (which means 0) which is shown in Fig. 2.2. Then by using the binary cross entropy loss function, the loss of G and D will be calculated by Eqs. (2.3) and (2.4). Binary cross-entropy loss function (Goodfellow [2]) is given by Eq. (2.2). L ðx 0 , xÞ ¼ xlog x 0 + ð1 xÞ log ð1 x 0 Þ
(2.2)
where x0 is the generated fake image and x is the real image. When the image is coming from real sample “r” to D, then D has to output 1. So, substitute x0 ¼ D(r) and x ¼ 1 in Eq. (2.2) will lead to the following equation: LGAN ðDÞ ¼ Εrpdðr Þ ½ log ðDðr ÞÞ
(2.3)
19
20
Generative adversarial networks for image-to-Image translation
Fig. 2.2 Vanilla GAN, WGAN, WGAN-GP architecture.
When the image is coming from G, G(n) to D, then D has to output 0.So substitute x ¼ D(G(r)) and x ¼ 0 in Eq. (2.2) will lead to the following equation: 0
LGAN ðGÞ ¼ ΕnpnðnÞ ½ log ð1 DðGðnÞÞ
(2.4)
Using min-max game theory, the D has to output 1 if the image is the real sample. Hence the D has to be maximized. The D has to output 0 if the image has come from the generator. Hence the G has to be minimized. The loss function is given by n min max LGAN ðG, DÞ ¼ min max Erpdðr Þ ½ log ðDðr ÞÞ + ΕnpnðnÞ ½ log ð1 DðGðnÞÞg G
D
G
D
(2.5) The optimal value of D is given by D∗ ¼
pd ðr Þ pd ðr Þ + pg ðr Þ
(2.6)
If the optimal value of D (0.5) is obtained, then D cannot differentiate between real and fake images. The optimal value of G is given by G∗ ¼ log 4 + 2 JSD pd ðr Þkpg ðr Þ (2.7) where 1 JSD pd ðr Þkpg ðr Þ ¼ KLD pd ðr Þk pd + pg + KLD pg ðr Þk pd + pg 2 If the loss function yields more value than by using backpropagation and stochastic gradient descent [6], weights and bias will be adjusted for every epoch until D discriminates properly.
GAN models in natural language processing and image translation
2.2.1.2 WGAN The WGAN [10] stands for Wasserstein GAN. The architecture is the same as that of Vanilla GAN as shown in Fig. 2.2. In order to avoid the drawback of using Jenson Shannon Divergence distance, the Wasserstein distance metric has been used. JSD matches the real image and fake image distributions in the vertical axis. Whereas the Wasserstein distance matches the real image and fake image distributions in the horizontal axis. The Wasserstein distance is otherwise called as earth mover distance. The Wasserstein distance between the real image distribution “Pr” and the generated image distribution “Pg” is given by
W Pr , Pg ¼ inf δEπðpr , pg Þ Eða, bÞδ jx yj (2.8) Π is the transport plan which tells how the distribution changes from real and generated image.
Eq. (2.8) is intractable. To make it tractable, using Kantorovich-Rubinstein duality the W-distance is given by
W Pr , Pg ¼ SupjjDjjL1 Ε Dðr Þ D Gg ðnÞ (2.9) When taking the slope between real and fake image distributions, if the slope is less than or equal to k, it is called as k-Lipchitz constant. When k is 1 it is 1-Lipchitz constant. Wasserstein GAN uses 1-Lipchitz constant and to achieve it, the weights will be clipped in the range (1, 1). The discriminator loss function is given by LWGAN ðDÞ ¼ Εrpdðr Þ ½Dðr Þ
(2.10)
21
22
Generative adversarial networks for image-to-Image translation
The generator loss function is given by LWGAN ðGÞ ¼ Ε npnðnÞ ½DðGðnÞÞ The overall parametric loss function is given by LWGAN ðG, DÞ ¼ min max Ε rpdðr Þ ½Dðr Þ ΕnpnðnÞ ½DðGðnÞÞ G
wEW
(2.11)
(2.12)
In WGAN, the discriminator will not return 0 or 1 rather it will return the Wasserstein distance. WGAN uses RMSProp optimizer which alters the weights and bias of G and D for every iteration until D cannot discriminate between real and fake images. 2.2.1.3 WGAN-GP The WGAN-GP [11, 12] (Wasserstein GAN-Gradient Penalty) architecture is identical to Vanilla GAN and WGAN as shown in Fig. 2.2. To avoid the drawback of weight clipping in order to use the Lipchitz constant, the gradient penalty term is incorporated with WGAN loss function. The gradient penalty term is computed when the gradient norm value moves away from 1. The loss function of WGAN-GP is given by LWGAN ðG, DÞ ¼ min max Ε rpdðr Þ ½Dðr Þ Ε npnðnÞ ½DðGðnÞÞ wEW G h 2 + λΕxP ðxÞ j— x DðxÞj 1 (2.13) The gradient penalty term is included with the loss function of WGAN. In Eq. (2.13), x-sampled from noise “n” and real image “r” is given by, x ¼ t x + (1 t)x, where t is sampled between 0 and 1. λ-hyperparameter. The adam optimizer, if used with WGANGP, it generates good clear images. 2.2.1.4 Info GAN Info GAN [8, 13] is the information maximizing GAN. The semantic information is added with noise and given to the G. G outputs the fake image. The fake image that is generated and the real sample are given to D which will output 0 (fake) or 1 (real) shown in Fig. 2.3. Then the loss function is computed. The stochastic gradient descent is used for optimizing the neural network. The noise “n” with semantic information “si” is fed to G which is given by G(n, si). The mutual information MI (si; G(z, n)) has to be maximized between semantic information “si” and generator G(n, si). The mutual information MI (c; G(z, n)) is the amount of information obtained from knowledge of G(z, n) about semantic information si. Maximizing mutual information is not very easy, so variational lower bound of mutual information MI (si; G(z, n)) by
GAN models in natural language processing and image translation
Fig. 2.3 Info GAN architecture.
defining one more semantic distribution Q(si jr). The variational lower bound LB (G, Q) of mutual information is given by LBðG, QÞ ¼ EsiP ðsiÞ, xGðn, siÞ ½ log Qðsij r Þ + H ðsiÞ I ðsi; Gðz, nÞÞ
(2.14)
where H(si) is the entropy of latent codes. The loss function is given by min max LInfoGAN ðG, DÞ ¼ Erpdðr Þ ½ log ðDðr ÞÞ G, Q D + ΕnpnðnÞ ½ log ð1 DðGðnÞÞ λ LBðG, QÞ
(2.15)
Wake Sleep algorithm has been used with InfoGAN. The lower bound of the generator log PG(x) has been optimized and updated in the wake phase. The auxiliary distribution Q is updated in the sleep phase by up sampling from generator distribution instead of real data distribution. The cost is only a little more than the vanilla GAN. 2.2.1.5 BEGAN BEGAN [14, 15] stands for Boundary Equilibrium GAN. This BEGAN is mainly developed to achieve Nash equilibrium. The architecture is the same as that of vanilla GAN with one difference: for maintaining equilibrium, the proportional control theory has been used as shown in Fig. 2.4. In BEGAN generator acts as a decoder and the discriminator acts as autoencoder and also discriminates between real and fake images. In BEGAN, instead of matching the data distributions of real image and generated image, the autoencoder loss has been calculated for real image and generated image. The Wasserstein distance has been computed between the autoencoder loss of real and generated images. The autoencoder loss is given by L ðsÞ ¼ js AF ðsÞjη
(2.16)
where L(s) is a loss for training autoencoder. “s” is the sample of dimension “d,” AF is the autoencoder function which converts the sample of dimension “d” to sample of dimension “d,” η is the target norm takes value {1,2}.
23
24
Generative adversarial networks for image-to-Image translation
Fig. 2.4 BEGAN architecture.
The loss of the D is given by LD ¼ L ðr Þ ki L ðGðnD ÞÞ
(2.17)
The loss of the G is given by LG ¼ L ðGðnG ÞÞ
(2.18)
BEGAN makes use of the proportional control model to preserve equilibrium E[L(G(n))] ¼ γ E[L(r)] where γ is the hyperparameter which takes the value (0, 1). For maintaining equilibrium, it uses the variable ki which takes a value (0, 1) to control the generator loss during gradient descent. Where ki is given by ki + 1 ¼ ki + λk ðγ L ðr Þ L ðGðnG ÞÞÞ
(2.19)
Initially take k0 ¼ 0. λk is the learning rate of k. 2.2.1.6 Unsupervised sequential GAN The Sequential GAN [16, 17] involves a sequence of G and D. The noise vector “z” is given as input to Generator G1. The G1 produces fake image1 “f” as the output. The fake image1 and real sample “r” is given as input to discriminator D1 which will discriminate between real image1 and fake image1. The fake image1 is given as input to generator G2. The G2 produces fake image2 as the output. The fake image2 and the real image2 is given as input to discriminator D2 which will discriminate between the real and fake image as shown in Fig. 2.5. The loss function by considering G1 and D1 is given by Ladv ðG1, D1, n, r Þ ¼ Erpdðr Þ ½ log ðD1ðr ÞÞ + Ε npnðnÞ ½ log ð1 D1ðG1ðnÞÞg
(2.20)
The loss function by considering G2 and D2 is given by Limg2img ðG2, D2, f , r Þ ¼ Erpdðr Þ ½ log ðD2ðr ÞÞ + Ε f pf ð f Þ ½ log ð1 D2ðG2ðf ÞÞg
(2.21)
GAN models in natural language processing and image translation
Fig. 2.5 Unsupervised sequential GAN architecture.
The loss function of unsupervised sequential GAN is given by LunseqGAN ðG1, D1, G2, D2Þ ¼ Ladv ðG1, D1, n, r Þ + Limg2img ðG2, D2, f , r Þ
(2.22)
2.2.1.7 Parallel GAN The architecture of the Parallel GAN [18] is shown in Fig. 2.6. Whenever there are bimodal images to be processed or when there is a need to generate multiple images at the same time then the parallel GAN can be used. Noise vector will be given to generator G1 and G2 parallelly. The G1 and G2 generate fake image1 and fake image2 parallelly. The real image1 and fake image1 are given as input to input to discriminator D1 which will discriminate between real image1 and fake image1. The real image2 and fake image2 are given as input to input to discriminator D2 which will discriminate between
Fig. 2.6 Parallel GAN architecture.
25
26
Generative adversarial networks for image-to-Image translation
real image2 and fake image2 parallelly with D1. The binary cross-entropy loss is computed for (G1, D1) and (G2, D2) parallelly. If D1 and D2 are not discriminating between real and fake images properly, then by using back propagation and stochastic gradient (G1, D1) and (G2, D2) weights and bias will be adjusted for every iteration until the D1 and D2 discriminate correctly. 2.2.1.8 Cycle GAN The Cycle GAN is otherwise called cycle-consistent GAN [19, 20]. The noise vector “z” is given as input to generator G1. The G1 produces feature map “f” as the output. The feature map and real sample “r” is given as input to discriminator D1 which will discriminate between the real sample and feature map. The feature map is given as input to generator G2. The G2 produces a fake image as the output. The fake image and the real image are given as input to discriminator D2 which will discriminate between the real and fake image as shown in Fig. 2.7. The loss function by considering G1 and D1 is given by L1ðG1, D1, n, r Þ ¼ Erpdðr Þ ½ log ðD1ðr ÞÞ + Ε npnðnÞ ½ log ð1 D1ðG1ðnÞÞg
(2.23)
The loss function by considering G2 and D2 is given by L2ðG2, D2, f , r Þ ¼ Erpdðr Þ ½ log ðD2ðr ÞÞ + Εf pf ð f Þ ½ log ð1 D2ðG2ðf ÞÞg
(2.24)
The cycle consistency loss is given by h h i i Lcycle ðG1, G2Þ ¼ Ε npnðnÞ jG2ðG1ðnÞÞ nj1 + Εf pf ð f Þ jG1ðG2ð f ÞÞ f j1 (2.25) The Cycle GAN loss is given by LcycleGAN ðG1, G2, D1, D2Þ ¼ L1ðG1, D1, n, r Þ + L2ðG2, D2, f , r Þ + Lcycle ðG1, G2Þ (2.26)
Fig. 2.7 Cycle GAN architecture.
GAN models in natural language processing and image translation
2.2.2 Semisupervised learning Semisupervised learning is that the discriminator D will be trained by class labels, i.e., D will do supervised learning. The generator will not be trained with the class labels hence the learning will be unsupervised. The Semi GAN comes under this semisupervised learning category which is discussed in the following section. 2.2.2.1 Semi GAN The architecture of semi GAN [21, 22] is shown in Fig. 2.8. The class labels are added with the real samples and given as input to discriminator D, so the learning becomes supervised. The noise vector is given as input to generator G which generates the fake sample. The real samples with the class labels and the fake image generated by generator G are given as input to D. The D will discriminate between real and fake image and also classifies the image to which class it belongs to. The loss functions are computed for G and D. If the D is not discriminating properly, then by using backpropagation and stochastic gradient the parameters of the G and G will be adjusted for every iteration until D discriminates correctly. The discriminator loss function is given by LsemiGAN ðDÞ ¼ Ε rpdðr Þ ½ log ðDðrj c ÞÞ
(2.27)
The generator loss function is given by LsemiGAN ðGÞ ¼ Ε npnðnÞ ½ log ð1 DðGðnÞÞ
(2.28)
Using min-max game theory, the D has to output 1 if the image is the real sample. Hence the D has to be maximized. The D has to output 0 if the image is from the generator. Hence the G has to be minimized. The loss function is given by n o min max LSemiGAN ðG, DÞ ¼ min max Erpdðr Þ ½ log ðDðrj c ÞÞ + Ε npnðnÞ ½ log ð1 DðGðnÞÞÞ G
D
G
D
(2.29)
Fig. 2.8 Semi GAN.
27
28
Generative adversarial networks for image-to-Image translation
The generator is not trained with the class labels but the discriminator has been trained with the class labels. The following sections describe about supervised learning.
2.2.3 Supervised learning Supervised learning is making the machine learning with the labeled data. CGAN, BiGAN, AC GAN, and supervised sequential GAN are the GAN architectures which will learn in a supervised manner. The architectures of the abovementioned GAN are shown in the following sections. This section details about each GAN architecture and the loss functions and optimization techniques have been used in each architecture. 2.2.3.1 CGAN CGAN stands for conditional GAN [23, 24]. The architecture of CGAN is shown in Fig. 2.9. The class labels are attached with the real samples. Generator G and discriminator D are trained with the class labels. The noise vector along with the class labels are given as input to G. The G outputs the fake image. The class labels, real, and fake image generated by G are given as input to D. The D discriminate between the real and fake image also find out to which class the image belongs to. The loss function is the same as that of vanilla Gan with one difference that the class labels “c” are added with the real sample, discriminator, and the generator terms. The binary cross-entropy loss [25] is used and the stochastic gradient descent is used for optimizing G and D when D is not discriminating properly. The discriminator loss function is given by LCGAN ðDÞ ¼ Ε rpdðr Þ ½ log ðDðrj c ÞÞ
(2.30)
The generator loss function is given by LCGAN ðGÞ ¼ Ε npnðnÞ ½ log ð1 DðGðnj c ÞÞ
Fig. 2.9 CGAN architecture.
(2.31)
GAN models in natural language processing and image translation
Using min-max game theory, the D has to output 1 if the image is the real sample. Hence the D has to be maximized. The D has to output 0 if the image is from generator. Hence the G has to be minimized. The loss function is given by n o min max LCGAN ðG, DÞ ¼ min max Erpdðr Þ ½ log ðDðrj c ÞÞ + Ε npnðnÞ ½ log ð1 DðGðnj c ÞÞÞ G
D
G
D
(2.32) CGANs [26] with multilabel predictions can be used for automated image tagging where the generator can generate the tag vector distribution conditioned on image features. 2.2.3.2 BiGAN BiGAN stands for bidirectional GAN [8, 27, 28]. The architecture of BiGAN is shown in Fig. 2.10. The noise vector is given as input to generator G which generates the fake image. The real sample is given as an input to the encoder with output the encoded image with which the noise is added. The encoded image, noise, real image, and generated fake image are given as input to discriminator D. The discriminator discriminates between real and fake images. The loss functions are computed for G and D. If D is not discriminating properly, then by using backpropagation and stochastic gradient the parameters of the G and G will be adjusted for every iteration until D discriminates correctly. The discriminator loss function is given by
LBiGAN ðDÞ ¼ Εrpdðr Þ ΕnpEðnj r Þ ½log ðDðr, nÞ (2.33) The discriminator has been trained with the real data, noise, and encoded image distribution. The generator loss function is given by
Fig. 2.10 BiGAN architecture.
29
30
Generative adversarial networks for image-to-Image translation
LBiGAN ðGÞ ¼ ΕnpnðnÞ Ε rpGðr j nÞ ½ log ð1 Dðr, nÞ
(2.34)
Using min-max game theory, the D has to output 1 if the image is the real sample. Hence the D has to be maximized. The D has to output 0 if the image is from the generator. Hence the G has to be minimized. The loss function is given by
min max LBiGAN ðD, E, GÞ ¼ min max Ε rpdðr Þ Ε npEðnj r Þ ½ log ðDðr, nÞ G, E D G, E D
+ Ε npnðnÞ Ε rpGðr j nÞ ½ log ð1 Dðr, nÞ (2.35) 2.2.3.3 ACGAN The architecture of ACGAN [12, 29] is shown in Fig. 2.11. The architecture is the same as that of CGAN with one difference the class labels “c” are conditioned with the real samples and noise vector which is given as input to generator G. The class labels are not conditioned with the discriminator D. The training is based on the log probability of correct source whether the image is real or fake image generated by G is real or fake and log probability of correct class to which the sample belongs to. The stochastic gradient descent is used to adjust the weights and bias of G and D for every iteration if D is not discriminating correctly. The log probability of the correct source whether the image is from the real sample or the image is generated by generator G is given by Lsource ¼ Ε½ log P ðsource ¼ realjRreal Þ + Ε½ logP ðsource ¼ fakejRfake Þ
(2.36)
The log probability of the correct class to which the image belongs to or classified correctly is given by
Fig. 2.11 ACGAN architecture.
GAN models in natural language processing and image translation
Lclass ¼ Ε ½ log P ðclass ¼ c jRreal Þ + Ε ½ log P ðclass ¼ c jRfake Þ
(2.37)
The image samples are given by “R.” The conditional probability has been used. The training has to be carried out in the way that D has to maximize Lsource + Lclass and G has to maximize Lsource Lclass. 2.2.3.4 Supervised seq-GAN The supervised sequential GAN [25, 30, 31] architecture is shown in Fig. 2.12. The real image is given as input to the encoder that outputs the encoded image. The encoded image is given as input to the G1 which in turn generates the fake image1. The fakeimage1 is given as input to G2 which will generate fakeimage2. The noise vector, encoded image, and fake image1 are given as input to D1. The noise vector, encoded image, and fake image2 are given as input to D2. D1 and D2 will discriminate between real and fake images. The loss function by considering G2 and D2 is given by Ladv ðG1, D1, n, r Þ ¼ Erpdðr Þ ½ log ðD1ðr ÞÞ + ΕnpnðnÞ ½ log ð1 D1ðG1ðnÞÞg
(2.38)
The loss function by considering G2 and D2 is given by the following equations. Limg2img ðG2, D2, f , r Þ ¼ Erpdðr Þ ½ log ðD2ðr ÞÞ + Ε f pf ð f Þ ½ log ð1 D2ðG2ð f ÞÞg (2.39)
Lencoder ðr, nÞ ¼ Ε rpdðr Þ Ε npEðnj r Þ ½ log ðDðr, nÞ + Ε npnðnÞ Ε rpGðr j nÞ ½ log ð1 Dðr, nÞ (2.40) The loss function of unsupervised sequential GAN is given by the following equation LSupseqGAN ðG1, D1, G2, D2Þ ¼ Ladv ðG1, D1, n, r Þ + Limg2img ðG2, D2, f , r Þ + Lencoder ðr, nÞ
Fig. 2.12 Supervised sequential GAN.
(2.41)
31
32
Generative adversarial networks for image-to-Image translation
2.2.4 Comparison of GAN models This section discusses the comparison of GAN models. Table 2.1 summarizes the activation function, loss function, distance metrics, and optimization techniques used by the GAN models. Table 2.1 Loss functions and distance metrics of GAN’s. GAN
Activation function
Loss function
Distance metric
Optimization technique
Back propagation with stochastic gradient decent RMS prop
Vanilla GAN
Rectified linear unit (ReLU)
Binary cross entropy loss
Jenson Shannon divergence
WGAN
ReLU, leaky ReLU, tanh ReLU, leaky ReLU, tanh
KantorovichRubinstein duality loss KantorovichRubinstein duality + penalty term added when the gradient moves away from 1 Binary cross entropy + variational information regularization Auto encoder loss + proportional control theory Binary cross entropy + image to image conversion loss
Wasserstein distance Wasserstein distance
WGAN—GP
Info GAN
BEGAN
Unsupervised seq-GAN
Parallel GAN
Cycle GAN
Semi GAN
Rectified linear unit (ReLU) Exponential linear unit (ELU) Rectified linear unit (ReLU)
Rectified linear unit (ReLU) ReLU, sigmoid ReLU
Binary cross entropy loss Binary cross entropy loss + cycle consistency loss Binary cross entropy loss with labels included with real samples
Adam
Jenson Shannon divergence Wasserstein distance
Stochastic gradient decent Adam
Jenson Shannon divergence and KullbackLeibler divergence Jenson Shannon divergence Jenson Shannon divergence Jenson Shannon divergence
RMS prop
Stochastic gradient decent Batch normalization Stochastic gradient decent
GAN models in natural language processing and image translation
Table 2.1 Loss functions and distance metrics of GAN’s—cont’d GAN
Activation function
CGAN
ReLU
BiGAN
ReLU
AC GAN
ReLU
Supervised seq-GAN
ReLU
Loss function
Binary cross entropy loss with labels included Binary cross entropy loss + guarantee G and E are inverse Log likelihood of real source + log likelihood of correct label Binary cross entropy + image to image conversion loss + autoencoder loss
Distance metric
Optimization technique
Jenson Shannon divergence Jenson Shannon divergence Jenson Shannon divergence Jenson Shannon divergence and KullbackLeibler divergence
Stochastic gradient decent Stochastic gradient decent Stochastic gradient decent RMS prop
2.2.5 Pros and cons of the GAN models This section discusses the pros and cons of GAN models. Table 2.2 summarizes the pros and cons of the various GAN models.
2.3 GANs in natural language processing Currently, many GAN architectures are emerging and yielding good results for the natural language processing applications. There have been various GAN architectures proposed in the recent years, including SeqGAN with policy gradient that is used for generating speech, poems, and music which outperforms other architectures. The RankGAN is used for generating sentences where the discriminator will act as the ranker. The following subsection elaborates on the various GAN architectures proposed for the applications of NLP.
2.3.1 Application of GANs in natural language processing This section discusses the various GAN architectures such as SeqGAN, RankGAN, UGAN, Quasi-GAN, BFGAN, TH-GAN, etc., proposed for the applications of natural language processing.
33
Table 2.2 Pros and cons of GAN models. GAN
Pros
Cons
Vanilla GAN
GAN can generate samples that are more similar to the real samples. It can learn deep representations of the data
WGAN
The experiments conducted using WGAN reveals that it does not lead to the problem of mode collapse
WGAN—GP
The training is very balanced. Hence the machine could be trained effortlessly which will make the model converge properly It is used for learning data representations which are disentangled by using information theory extensions
When the real image distribution and generated fake image distribution is not overlapping then Jenson Shannon divergence between real and fake image distribution value becomes log 2. The derivative of log 2 is 0, which means learning will not take place at the initial start of back propagation The Lipchitz constant is been applied. Weight clipping is a simple technique but it will lead to poor quality image generation Attaining Nash equilibrium state is very hard. Batch normalization cannot be used as gradient penalty will be applied to all data samples The mutual info has been included to generator that will eliminate the significant attributes from data and assign them to semantic information while learning is in progress. Fine tuning λ hyperparameter if not done accurately, it will not generate good quality image The hyperparameter γ must be fine tuned properly. The appropriate learning rate has to be set properly. If not done properly, it is not generate good clarity image Hard to achieve Nash equilibrium
Info GAN
BEGAN
The training is fast and stable
Unsupervised seq-GAN Parallel GAN Cycle GAN
It extract more deep features
Semi-GAN
Multiple images can be generated at same time The requirement for dataset is low. Randomly two image styles can be converted It is an effective model which can be used for the regression tasks
Hard to achieve Nash equilibrium When doing image-to-image translation, considering various parameters such as color, texture, geometry, etc. is very difficult The generator cannot generate more realistic image to fool the discriminator as it is strong enough to discriminate since it is been trained with the class labels
CGAN
BiGAN
AC GAN Supervised seq-GAN
The class labels are included which increases the performance of the GAN and it can be used for many applications such as shadow maps generation, image synthesis, etc. As class labels are included, it can generate good realistic images As class labels are included, it can generate good realistic images It extract more deep features
The training is not stable. The stability in training can still be improved
The drawback is that the real image sample which is been given to the encoder must be of good clarity and it cannot perform well when the data distributions are complex The GAN training is not stable. The ACGAN training can be still improved The real sample given to the encoder should be of good clarity otherwise it will not generate realistic image
36
Generative adversarial networks for image-to-Image translation
2.3.1.1 Generation of semantically similar human-understandable summaries using SeqGAN with policy gradient In recent years, generating text summaries have become attractive in the area of natural language processing. The SeqGAN with policy gradient architecture has been proposed for generating text summary. The proposed SeqGAN with policy gradient architecture [32] has three neural networks namely one generator (G) and two discriminators, viz., D1 and D2 as shown in Fig. 2.13. The G is the sequential model which takes the raw text as input and generates the summary of the text as output. The D1 trains G to output summaries which are human readable. Hence G and D1 form the GAN. The D1 is trained to distinguish between input text and the summary generated by G. G is trained to fool D1. As D1 trains the generator to generate the human-readable summary, it is called as human-readable summary discriminator. The summary generated by the generator might be irrelevant with only G and D1. Hence another discriminator D2 is added to the architecture for checking the semantic similarity between the input raw text and the generated human-readable summary. SeqGAN incorporates reinforcement learning. Policy gradient is the optimization technique used for updating the parameters (weights and bias) of G by obtaining rewards from D1 and D2. Hence D1 will train G to generate a semantically similar summary and D2 will train G to generate human-readable summary. Semantic similarity discriminator The semantic similarity discriminator is trained as the classifier using the text summarization dataset shown in Fig. 2.14. This discriminator will teach the generator to generate a semantically similar and more concise summary. The raw text and the human-readable summary are given as inputs to the encoders individually to generate the encoded representations namely Ri and Rs. The Ri and Rs are concatenated, product and difference are performed and given to the four-class classifiers which classify the human-readable summary into four classes namely similar, dissimilar, redundant, and incomplete class. The softmax outputs the probability distribution.
Fig. 2.13 SeqGAN for generation of human-readable summary.
GAN models in natural language processing and image translation
Fig. 2.14 Semantic similarity discriminator.
2.3.1.2 Generation of quality language descriptions and ranking using RankGAN Language generation plays a major role in many NLP applications such as image caption generation, machine translation, dialogue generation systems, etc. Hence the RankGAN has been proposed to generate high-quality language descriptions. The RankGAN [33] consists of two neural networks such as generator G and ranker R. The generative model used is long short-term memory (LSTM) to generate the sentences which are called machine-written sentences. Instead of the discriminator being trained to be a binary classifier, RankGAN uses a ranker which has been trained to rank the human-written sentences more than the machine-written sentences. The ranker will train the generator to generate machine-written sentences which are similar to human-written sentences. In this way, generator fools the ranker to rank the machine-written sentences more than the human-written one. The policy gradient method is used for optimizing the training. The architecture of RankGAN is shown in Fig. 2.15. The G generates the sentences from the synthetic dataset. The human-written sentences with the generated machinewritten sentences are given as input to the ranker. The reference human-written sentence
Fig. 2.15 Architecture of RankGAN.
37
38
Generative adversarial networks for image-to-Image translation
is also given as input to the ranker. The ranker has to rank human-written sentences more than the machine-written sentence. The generator G will be trained to fool the ranker, hence the ranker will rank the machine-written sentence more than the human-written sentence. The ranker will compute the rank score by using Rðij S,C Þ ¼ ΕsS ½P ðij s, C Þ where P ðij s, C Þ ¼ P
exp ðβαðij sÞÞ
s 0 EC 0
exp ðβαði 0 j sÞÞ
(2.42)
and α(ij s) ¼ cosine(xi, xs)
xi is the feature vector of input sentences. xs is the feature vector of reference sentences. The parameter β value is set during the experiment empirically. The reference set S is constructed by sampling reference sentences from human-written sentences. C is the comparison set sampled from both human-written and machine-generated sentence set. s is the reference sentence sampled from set S. 2.3.1.3 Dialogue generation using reinforce GAN Dialogue generation is the most important module in applications such as Siri, Google assistant, etc. The reinforce GAN [34] was proposed for dialogue generation using reinforcement learning. The architecture of reinforced GAN is shown in Fig. 2.16. The reinforce GAN has two neural network architectures namely generator G and discriminator D. The input dialogue history is given to the generator which outputs the machinegenerated dialogue. The {input dialogue history, machine-generated dialogue} pair is given to the hierarchical encoder which outputs the vector representation of dialogue. The vector representation is given as input to the discriminator D which in turn outputs the probability that the dialogue is human generated or machine generated. The policy gradient optimization technique is used. The weights and bias of G and D are adjusted by the rewards generated by them during training. The discriminator outputs will be used as rewards to train the generator so that the generator can generate a dialogue which is more similar to the human-generated dialogue.
Fig. 2.16 Architecture of reinforce GAN.
GAN models in natural language processing and image translation
2.3.1.4 Text style transfer using UGAN Text style transfer is an important research application of natural language processing which aims at rephrasing the input text into the style that is desired by the user. Text style transfer has its application in many scenarios such as transferring the positive review into a negative one, conversion of informal text into a formal one, etc. Many techniques that are used for text style transfer are unidirectional, i.e., it transfers the sentence from positive to negative form. UGAN (Unified Generative Adversarial Networks) [35] is the only architecture which does multidirectional text style transfer shown in Fig. 2.17. Input to the architecture will be the sentence and the target attribute, for example input: sentence: “chicken is delicious” and target attribute: “negative.” Output of the architecture will be the transferred sentence. Output: “chicken is horrible” and vice versa. UGAN has two networks namely generator and discriminator. The LSTM is the generator network which takes the sentence and the target attribute as input and generates the output sentence as per the target attribute. The output transferred sentence generated by LSTM is given as the input to the discriminator. The discriminator uses the RankGAN rank score computation equations to rank the original sentence and the generated sentence. The classification of the sentence whether “positive” or “negative” is done by the discriminator. 2.3.1.5 Tibetan question-answer corpus generation using Qu-GAN In recent years, many question answering systems have been designed for many languages using deep learning models. It is hard to design a question-answering systems for languages with less resources such as Tibetan. To solve this problem, QuGAN [36] has been proposed for a question answering system. The architecture of the QuGAN is shown in Fig. 2.18. Initially, by using maximum likelihood, some amount of data is sampled from
Fig. 2.17 Architecture of UGAN.
Fig. 2.18 Architecture of QuGAN.
39
40
Generative adversarial networks for image-to-Image translation
the data in the database. This is done to reduce the distance between the probability distribution of the real and the generated data. The randomly sampled data is given to the generator (quasi recurrent neural network—QRNN) which generates the question and answers which in turn is given to the BERT model to correct the grammatical errors and syntax. The generated and the real data are given as an input to the discriminator (long short-term memory—LSTM) which classifies between the real and the generated data. The policy gradient and the Monto-Carlo search optimization techniques are used to optimize the training of the neural networks by adjusting their weights and bias. 2.3.1.6 Generation of the sentence with lexical constraints using BFGAN Nowadays for generating meaning sentences, lexical constraints are incorporated to the model which has applications in machine translation, dialogue system, etc. For generating lexically constrained meaningful sentences, BFGAN (backward forward) [37] has been proposed as shown in Fig. 2.19. BFGAN has two generators namely forward and backward generators and one discriminator. The LSTM dynamic attention-based model called as attRNN is used as the generators. The discriminator can be CNN-based binary classifier to classify between real sentences and machine-generated meaningful sentences. The input sentence is split into words and given as input to the backward generator which generates the first half of the sentence in the backward direction. The backward sentence is reversed and fed as input to the forward generator which in turn outputs the complete sentence with lexical constraints. The discriminator is used for making the backward and forward generators powerful by training them using the Moto Carlo optimization technique. The real sentence and the generated sentences are given as input to the discriminator which will classify between real and the machine-generated complete sentence with the lexical constraints incorporated in it. 2.3.1.7 Short-spoken language intent classification with cSeq-GAN Intent classification in dialog system has grabbed attention in industries. For intent classification, cSeq-GAN [38] has been proposed shown in Fig. 2.20. cSeq-GAN has two
Fig. 2.19 Architecture of BFGAN.
GAN models in natural language processing and image translation
Fig. 2.20 Architecture of cSeq-GAN.
neural networks namely the generator (LSTM) and the discriminator (CNN). The real questions with no tags and with tags are given as input to the generator. The generator in turn generates questions with classes. The generated and the real questions with tags are given as input to the discriminator. The CNN is used as the discriminator which has been implemented with both the sigmoid and the softmax layer. The sigmoid layer classifies the real and the generated questions. The softmax layer is used for classifying the questions to the respective intent class. The policy gradient optimization technique is used for adjusting the weights and bias of the generator and discriminator during training. 2.3.1.8 Recognition of Chinese characters using TH-GAN Historical Chinese characters are of low-quality images. In order to enhance the quality of the historical Chinese character images, TH-GAN (transfer learning-based historical Chinese character recognition) [39] has been proposed shown in Fig. 2.21. The generator used is the U-Net architecture. The WGAN model has been used. The source Chinese character is given as input to the generator which outputs generated Chinese character. The target image, real character image, and the generated character images are given as input to the discriminator. The discriminator classifies between the real and the fake character image. The policy gradient is the technique used for adjusting weights and bias of the generator and the discriminator during training. The following session discusses the NLP datasets.
Fig. 2.21 Architecture of TH-GAN.
41
42
Generative adversarial networks for image-to-Image translation
2.3.2 NLP datasets The open-source free NLP datasets available for research is shown in Table 2.3.
2.4 GANs in image generation and translation In recent years, for image generation and translation, many GAN architectures have been proposed such as cycleGAN, DualGAN, DiscoGAN, etc. The following section discusses the various applications of GANs in image generation and translation.
2.4.1 Applications of GANs in image generation and translation The following subsections discuss the various applications of GANs in image generation and translation. 2.4.1.1 Ensemble learning GANs in face forensics Fake images generated by newer image generation methods such as face2face and deepfake are really hard to distinguish using previous face-forensics methods. To overcome the same, a novel generative adversarial ensemble learning method [40] has been proposed as shown in Fig. 2.22. In this GAN, two generators with the same architecture are used but both of them are trained in different ways. The feedback face generator gets the feedback from the discriminators and generates the more fine-tuned image. As a discriminator ResNet and DenseNet are used. The ability to discriminate between real and fake images is achieved by combining the feature maps of both ResNet and DenseNet. Image is fed to both the network and 1024-dimensional output feature is extracted by using global average pooling and then a 2048-dimensional feature vector is generated by taking output features by both the networks and concatenating them, later SoftMax function is used to normalize the 2D scores. During the training process of the GAN, the spectral normalization method is used for the stabilization of the process. 2.4.1.2 Spherical image generation from the 2D sketch using SGANs Most of the VR applications rely mostly on panoramic images or videos, and most of the image generation models just focus on 2D images and ignore the spherical structure of the panoramic images. To solve this, a panoramic image generation method based on spherical convolution and GAN called SGAN [41] is proposed shown in Fig. 2.23. For input, a sketch map of the image is taken, which provides a really good geometric structure representation. A custom designed generator is used to generate the spherical image and it reduces the distortion in the image using spherical convolution, loss of least squares is used to describe the constraint for whether the discriminator is able to distinguish the image generated from the real image. The spherical convolution is used for observing the data from multiple angles. Discriminator is used to distinguish between generated
GAN models in natural language processing and image translation
Table 2.3 NLP datasets. NLP datasets
Description
Link
CNN/Daily Mail dataset
It is the text summarization dataset which as two features namely the documents need to be summarized (article) and the target text summary (highlights) It has the author details, date of the news, headlines and the detailed news link It has small poems. Each poem is with 4–5 lines and each line is with 4–5 words
https://github.com/abisee/cnndailymail
News summarization dataset Chinese poem dataset
COCO (common objects in context) captions Shakespear’s plays
Open subtitles dataset YELP
Amazon
It is the object detection and caption dataset. The dataset has five sections such as info, licenses, images, annotations, and category It consists of 715 characters of Shakespeare plays. It has continuous set of lines spoken by each character in a play. It can be used for text generation It has the group of translated movie subtitles. It has subtitles of 62 languages It is a business reviews and user dataset. It has 5,200,000 user business reviews Information about 174,000 businesses The data about 11 metropolitan areas It is an amazon review dataset
Caption
It consists of approximately 3.3 million image caption pairs
Noisy speech
Noisy and clean speech dataset. It can be used for speech enhancement applications It consists of 3,000,000 reviews on cars, hotels collected from tripadvisor It consists of text summaries of about 4000 cases. It can be used training text summarization tasks
OpinRank
Legal case reports
https://www.kaggle.com/ sunnysai12345/news-summary https://github.com/Disiok/ poetry-seq2seq https://github.com/ XingxingZhang/rnnpg http://cocodataset.org/ #download
https://www.tensorflow.org/ federated/api_docs/python/tff/ simulation/datasets/shakespeare/ load_data https://github.com/PolyAILDN/conversational-datasets/ tree/master/opensubtitles https://www.kaggle.com/yelpdataset/yelp-dataset
https://registry.opendata.aws/? search¼managedBy:amazon https://ai.googleblog.com/ 2018/09/conceptual-captionsnew-dataset-and.html https://datashare.is.ed.ac.uk/ handle/10283/2791 http://kavita-ganesan.com/ entity-ranking-data/#. XuxKF2gzY2z https://archive.ics.uci.edu/ml/ datasets/Legal+Case+Reports
43
44
Generative adversarial networks for image-to-Image translation
Fig. 2.22 Architecture of ensemble learning GAN.
Fig. 2.23 Architecture of SGAN.
images and real images and in the case of image generation, a multiscale discriminator is used, which is quite common and adds the advantage of decreasing the burden on the network. 2.4.1.3 Generation of radar images using TsGAN Radar data becomes really hard to understand due to imbalanced data and also becomes the bottleneck for some operations. To defend the radar operations, a two-stage general adversarial network (TsGAN) [42] has been introduced, as shown in Fig. 2.24. In the first stage, it generates samples which are similar to real data and distinguishes its eligibility. To generate radar image sequences, each frame is decomposed as content information and motion information. Also, for capturing data such as the flow of clouds, RNN is used. For discriminators, two of them are being used, one for distinguishing between radar image and generated image and the second one for motion information, like image generation sequence. The second stage is used to define the relationship between intervals and adjacent frames. The rank discriminator is used for computing
GAN models in natural language processing and image translation
Fig. 2.24 Architecture of TsGAN.
the rank loss between generated motion sequences, real motion sequences and the enhanced generated motion sequences. 2.4.1.4 Generation of CT from MRI using MCRCGAN The MRI (magnetic resonance images) are really useful in radiation treatment planning with the functional information that provides as compared with CT (computed tomography). But there are some applications where MRI cannot be used because of the absence of electron density information. To apply MRI for these types of applications, MCRCGAN (multichannel residual conditional GAN) [43] has been introduced, which generates pseudo-CT as shown in Fig. 2.25. MCRCGAN has two parts, generator which generated the pseudo-CT image according to the input MR images, and discriminator is used to distinguish between p-CT images with the real ones and measure the degrees/number of mismatches since it helps the network feed accordingly for the next iteration for better efficiency. MCRCGAN actually adopts the multichannel ResNet as the generator and CNN as the discriminator. 2.4.1.5 Generation of scenes from text using text-to-image GAN Generating an image from text is a vividly interesting research topic with very unique use cases but it is quite difficult since the language description and images vary a different part
Fig. 2.25 Architecture of MCRCGAN.
45
46
Generative adversarial networks for image-to-Image translation
of the world and the current models which generate images tend to mix the generation of background and foreground which leads to object in images which are really submerged into the background. To make sure that the generation of the image is done by keeping in mind about the background and foreground. To achieve this VAE (variational autoencoder) and GAN proved to be robust. Here the generator contains three modules, namely, downsampling module, upsampling module, and residual module. The architecture of text-to-image GAN [44] is shown in Fig. 2.26. 2.4.1.6 Gastritis image generation using PG-GAN For detection of gastric cancer, X-ray images of gastric are used. Multiple X-ray images are relatively large in size so LC-PGGAN (loss function-based conditional progressive growing GAN) [45] has been introduced as shown in Fig. 2.27. This GAN generates images which are effective for gastritis classification and have all the necessary details to look for any sort of symptoms. For the generation of synthetic images, divided patched images are used. The whole process is divided into two different sections. (1) lowresolution step: Here fake and real images are given to the discriminator which sends the loss values to (2) high-resolution step: here fake images along with patches with random sampling and real images with patches are given to the discriminator to finalize the output.
Fig. 2.26 The architecture of text-to-image GAN.
Fig. 2.27 Architecture of LC-PGCAN.
GAN models in natural language processing and image translation
2.4.1.7 Image-to-image translation using quality-aware GAN Image-to-image translation is one of the widely practiced with GAN and to do the same many works has been proposed but all of them depend on pretrained network structure or they rely on image pairs, so they cannot be applied on unpaired images. To solve these issues, a unified quality-aware GAN-based framework [46] was proposed as shown in Fig. 2.28. Here two different implementations of quality loss are done, one is based on the image quality score between the real and reconstructed image and another one is based on the adaptive deep network-based loss to calculate the score between the real and reconstructed image from the generator. Here the generators generate such as each constructed image has a similar or close score to the real image. The loss function includes adversarial loss, reconstruction loss, quality-aware loss, IQA loss, and content-based loss. 2.4.1.8 Generation of images from ancient text using encoder-based GAN Ancient texts are of great use since it helps us to get to know about our past and maybe some keys to our future, to retrieve or understand these texts, an encoder-based GAN [9] has been introduced to generate the remote sensing images retrieved from the text retrieved from different sources as shown in Fig. 2.29. To train this particular network, we have used satellite images and ancient images. Here generator is conditioned with the training set text encodings and corresponding texts are synthesized. The discriminator is used to predicting the sources of input images, for whether they are real or synthesized. Text encoder and Noise generator is used prior to the input.
Fig. 2.28 Architecture of quality aware GAN.
Fig. 2.29 Architecture of encoder-based GAN.
47
48
Generative adversarial networks for image-to-Image translation
2.4.1.9 Generation of footprint images from satellite images using IGAN For many architectural purposes and planning, building footprints plays an important role. To convert satellite images into footprint images, a IGAN (improved GAN) [26] was proposed as shown in Fig. 2.30. This GAN uses CGAN with the cost function from Wasserstein distance and integrated with gradient penalty. The generator is provided with noise and satellite image, using Leaky ReLU as activator function it generates a footprint image which then sent to discriminator helps to get the score, and if the score does not get as close as the real image, it goes to generator again and the iterations provide better results every time. The dataset was based on Munich and Berlin which gave 256 256 images to work on. Also, segmentation is used on images to get the visible gradients. 2.4.1.10 Underwater image enhancement using a multiscale dense generative adversarial network Underwater image improvement has become more popular in underwater vision research. The underwater images suffer from various problems such as underexposure, color distortion, and fuzz. To address these problems, multiscale dense block generative adversarial network (MSDB-GAN) [47] for enhancing underwater images has been proposed as shown in Fig. 2.31. The random noise and the image to be enhanced are given as input to the generator. The multiscale dense block is embedded within the generator. The MSDB is used for concatenating all the local features of the image using the
Fig. 2.30 Architecture of IGAN.
Fig. 2.31 Architecture of MSDB-GAN.
GAN models in natural language processing and image translation
Leaky ReLU activation function. The discriminator discriminates between the real and the generated image.
2.4.2 Image datasets The open-source free image datasets available for research are shown in Table 2.4. The following section discusses the various evaluation metrics. Table 2.4 Image datasets. Image datasets
Description
Link
CelebA-HQ
It consists of 30,000 face images of high resolution It consists of 685,000 footprints of the buildings
https://www.tensorflow.org/ datasets/catalog/celeb_a_hq https://spacenetchallenge.github. io/datasets/spacenetBuildingsV2summary.html https://www.kaggle.com/ navoneel/brain-mri-images-forbrain-tumor-detection http://www.vision.caltech.edu/ visipedia/CUB-200.html https://www.robots.ox.ac. uk/vgg/data/flowers/102/ http://mmlab.ie.cuhk.edu.hk/ projects/CelebA.html https://www.openstreetmap.org/ #map¼5/21.843/82.795
AOI
MRI brain tumor
It consist of 96 images of MRI brain tumor
CUB
It consists of 200 various bird species images It consists of 102 various flow category images It consists of 200,000 celebrity images The map data can be downloaded by selecting the smaller areas from the map It consists of 108,077 Images with captions of people, signs, buildings, etc. It consists of approximately 9,000,000 images been annotated with labels and bounding boxes for 600 object categories CIFAR 10 consists of 60,000 images of 10 classes. CIFAR 100 is extended by 100 classes. Each class consists of 600 images It consists of 30,000 images categorized in 256 classes It consists of 190,000 images, 60,000 annotated images, 658,000 labeled objects It consists of 100 different toys images. Each toy being photographed in 72 poses. Hence 7200 images for 100 toys are present
Oxford 102 CelebA OpenStreetMap
Visual Genome
Open Images
CIFAR 10/100
Caltech 256 LabelMe
COIL-20
http://visualgenome.org/api/v0/ api_home.html https://storage.googleapis.com/ openimages/web/download.html
https://www.cs.toronto. edu/kriz/cifar.html
https://www.kaggle.com/ jessicali9530/caltech256 http://labelme.csail.mit.edu/ Release3.0/browserTools/php/ dataset.php https://www.cs.columbia.edu/ CAVE/software/softlib/coil-20. php
49
50
Generative adversarial networks for image-to-Image translation
2.5 Evaluation metrics This section discusses the various evaluation metrics that are needed to assess the performance of the GAN models.
2.5.1 Precision Precision (P) refers to the percentage of the relevant results obtained during prediction. It is given by the ratio of true positives and the actual results. TP TP + FP where TP is true positive and FP is false positive. P¼
2.5.2 Recall Recall (R) refers to the total percentage of the relevant results that are correctly classified by the classifier. It is given by the ratio of true positives and the predicted results. TP TP + FN where TP is true positive and FN is false negative. R¼
2.5.3 F1 score F1 score is defined as the harmonic mean of both precision and recall. It is given twice the ratio of multiplication of precision and recall to the addition of precision and recall. P R F1 score ¼ 2 P +R where P is precision and R is recall.
2.5.4 Accuracy Accuracy refers to how accurately the model predicts the results. It is given by the ratio of true positive and true negative results to the total results obtained. TP + TN Total where TP is true positive and TN is true negative Accuracy ¼
chet inception distance 2.5.5 Fre Frechet inception distance (FID) is the metric used to evaluate the quality of the images generated by the GANs. If FID is less, then it means the generator has generated a good
GAN models in natural language processing and image translation
quality image. If FID is more, then it means the generator has generated a lower quality image. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi FID ¼ jjμ1 μ2 jj2 + Tr C1 + C2 2 C1 C2 Þ where μ1 and μ2 are feature-wise mean of real and generated images. C1 and C2 are covariance matrices of real and generated image feature vectors. Tr indicates trace linear algebra function.
2.5.6 Inception score The inception score (IS) is the metric used for measuring both the quality of the generated image and the difference between the generated and the real image. For measuring the quality of the image, the inception network can be used to classify the generated and the real images. The difference between the real and the generated image is computed using KL-divergence. IS ¼ exp EgG DKL ðpðrj gÞj j pðr ÞÞÞ where g is the generated image, r is the real samples with labels. DKL is the KullbackLeibler divergence measures the distance between the real and generated image probability distributions.
2.5.7 IoU score Intersection over union (IoU) otherwise called as Jaccard index is the metric which computes the overlap between the predicted results and the ground truth samples. The score ranges from 0 to 1. 0 indicates no overlap. TP IoU score ¼ TP + FP + FN where TP is the true positive results, FP is the false positive results, and FN is the false negative results.
2.5.8 Sensitivity Sensitivity measures the percentage of the true positives that are correctly computed. TP TP + FN where TP is the true positive results and FN is the false negative results. Sensitivity ¼
2.5.9 Specificity Specificity measures the percentage of the true negatives that are correctly computed. TN Specificity ¼ TN + FP where TN is the true negative results and FP is the false positive results.
51
52
Generative adversarial networks for image-to-Image translation
2.5.10 BELU score Bilingual evaluation understudy (BELU) score is the metric used for measuring the similarity between the system-generated text and the input reference text. N T where N is the total number of words matching between the system-generated text and the input reference text. T is the total number of system-generated words. BELU ¼
2.5.11 ROUGE score Recall-oriented understudy for gisting evaluation (ROUGE) score is used for evaluating automatic text summarization. It evaluates by computing ROUGE recall and precision. N N , ROUGE Recall ¼ T R where N is the total number of words matching between the system-generated text and the input reference text. T is the total number of words in system-generated text. R is the total number of words in the input reference text. The next section discusses the various languages and tools used for research. ROUGE Precision ¼
2.6 Tools and languages used for GAN research This section discusses the various languages that can be used for training the neural networks such as generator and the discriminator.
2.6.1 Python For (1) (2) (3) (4)
(5)
For (1) (2)
training the generator, Pandas are used for data manipulation Using OS, data path is set Train and test data is divided using pd.DataFrame() function In the infinite loop, • pd.read_csv is used to read the data from csv file • Labels are retrieved from the list using data.iloc() function • Append it into array using append() function Inside generator (), batch_size,shuffle_data is used, • For setting array list empty [] list are initialized • cv2.imread is used to read images (if theres is any) • reading array is done using np.array training the discriminator, Keras can be used The discriminator is defined using def define_discriminator(n_inputs¼2)
GAN models in natural language processing and image translation
(3) Define the model type and the activation functions to be used by, • model ¼ Sequential () • model.add (Dense(25, activation ¼‘relu’, kernel_initializer ¼‘he_uniform’, input_dim ¼n_inputs)) • model.add (Dense(1, activation¼‘sigmoid’)) (4) Compile the model by specifying the loss function and the optimizer to be used using model. compile (loss¼‘binary_crossentropy’, optimizer¼‘adam’, metrics¼[‘accuracy’]) (5) return model
2.6.2 R programming Install the neural network packages using install.packages(“neuralnet”) Load the neural net packages using library (“neuralnet”) Read the CSV file using read.csv() Preview the dataset using View() To view the structure and verify the ID variable str() function is used. To set the input variables to the same scale, scale (Any var[1:12]) is used. Generate a random seed using set.seed (200) Split the dataset into 70-30 train and test set using, • ind , HH>}, where the low (L) and high (H) pass filters are 1 1 L> ¼ pffiffiffi ½1, 1, H> ¼ pffiffiffi ½1, 1 2 2
(13.3)
Thus, the DWT can generate four types of output denoted as LL, LH, HL, HH, respectively. Fig. 13.4 shows the examples after DWT. The output of LL has the smooth texture of the images, while the rest of the outputs capture the vertical, horizontal, and diagonal edges [31]. For simplicity, we denote the output of LL as low-frequency components and outputs of LH, HL, and HH as high-frequency components. The DWT enables the proposed model to control the IR-to-RGB conversion by different components separately. Specifically, the low-frequency component can affect the overall generative texture, while the high-frequency components affect the generative structure. Without processing the generative network’s high-frequency components, the structural information can be well maintained in these components. From this point of view, the
WGGAN: A wavelet-guided generative adversarial network
Fig. 13.4 The illustration of discrete wavelet transformation (DWT).
Fig. 13.5 The illustration of wavelet pooling and wavelet unpooling.
wavelet pooling and wavelet unpooling are proposed to use these components in autoencoder for better IR-to-RGB translation, as shown in Fig. 13.5 [31]. The wavelet pooling applies DWT to the encoder layer to have low frequency and high-frequency components. The kernels of the convolutional layer are changed to DWT kernels to apply the DWT in the deep neural layer. On the other hand, the wavelet pooling layer is locked during the optimization. Moreover, the stride of the layer is 2 to have downsampling features as same as conventional pooling layers [31]. The lowfrequency component will be further processed by the network. Meanwhile, the high-frequency components skipped to the symmetrical wavelet unpooling layer in the generator. In the wavelet unpooling layer of the generator, both high-frequency and low-frequency components are concatenated. Then, the concatenated components are processed to have upsampling features by transpose convolution [31].
13.3.3 Objective functions in adversarial training The full objective of the WGGAN comprises four loss functions: cycle-consistency loss, ELBO loss, perceptual loss, and GAN loss [11, 13, 32]. Cycle-consistency loss. To train the proposed method with unpaired RGB and IR images, we adopt the cycle consistency loss which is similar to MUNIT and CycleGAN [11]. The basic idea of cycle consistency loss aims to include two generative networks for constraining the generative images. Two generative adversarial networks: GAN1 ¼ {E1, G1, D1}
319
320
Generative adversarial networks for image-to-image translation
for IR-to-RGB translation and GAN2 ¼ {E2, G2, D2} for RGB-to-IR translation are used in training, where E, G, and D denote encoder, generator, and discriminator, respectively. For simplicity, E(x) ¼ z indicates the latent space z is generated by encoder E. The theory of the loss is that the image translation cycle should be capable of bringing converted images back to original images, i.e., x ! E(x) ! G1(z) ! G2G1(z) x. The cycle-consistency loss is shown as below: LCC ðE1 , G1 , E2 , G2 Þ ¼ x1 pðx1 Þ ½k G2 ðG1 ðz1 ÞÞ x1 k + x2 pðx2 Þ ½k G1 ðG2 ðz2 ÞÞ x2 k (13.4) where k k represents the ‘1 distance. ELBO loss. ELBO aims to minimize the variational upper bound of latent space z. The objective function is LE ðE, GÞ ¼ λ1 KL qðzj xÞkpη ðzÞ λ2 zqðzj xÞ log pG ðxj zÞ (13.5) X P ðxÞ (13.6) KLðP|jQÞ ¼ P ðxÞlog QðxÞ where the hyperparameters λ1 and λ2 control the weights of the objective terms and the KL divergence terms that penalize deviation of the distribution of latent space from the prior distribution pη(z). Here q(.) represents the reparameterization mentioned in the previous section, while pη(z) represents zero-mean Gaussian distribution. pG(.) is the Laplacian distribution based on generator according to empirical studies [13]. Perceptual loss. Perceptual loss is a conventional loss function of neural style transfer with the assistant of pretrained VGG-16 [33] as shown in the following equation. The perceptual loss consists of two parts: the first term loss is content loss and the second term is style loss. LP ðE, G, xc , xs Þ ¼
1 k ϕj ðGðzÞÞ ϕj ðxc Þ k22 + Cj Hj Wj i 1 h k Gr ϕj ðGðzÞÞ Gr ϕj ðxs Þ k22 Cj Hj Wj
(13.7)
where ϕj(x) represents the feature map of jth convolutional layers of shape Cj Hj Wj in pretrained VGG-16; xc denotes content images while xs denotes style images; Gr is the Gram matrix which is used for representing image style. From this point of view, the perceptual loss aims to transfer the style of images while maintaining image structure. The details of the perceptual loss can be found in Ref. [32]. GAN loss. GAN loss aims to ensure the translated images resembling the images in the target domains, respectively [14]. For example, if the discriminator regards the synthetic IR images as real IR images, the synthetic IR images are successful.
WGGAN: A wavelet-guided generative adversarial network
LGAN ðE, G, DÞ ¼ xpðxÞ logDðxÞ + zpðzj xÞ ½ log 1 DðGðzÞÞ
(13.8)
Full loss. Finally, the complete loss function can be written as: Ltotal ¼ λE ðLE ðE, GÞÞ + λP ðLP ðE1 , G1 , x1 , x2 ÞÞ + λGAN ðLGAN ðE, G, DÞÞ + λCC ðLCC ðE1 , G1 , E2 , G2 ÞÞ
(13.9)
where λE ¼ 0.1, λP ¼ 0.1, λGAN ¼1, and λCC ¼ 10 represent the weights of ELBO loss, perceptual loss, GAN loss, and cycle consistency loss, respectively.
13.4 Experiments This section presents the details of the experiments. First, the section describes the implemented dataset and evaluation methods in this experiment. Then, the baselines and the relevant experimental setup are also introduced. Finally, both qualitative and quantitative analyses of translation results are presented.
13.4.1 Data description The implemented dataset is FLIR ADAS [34], which is an open dataset for autonomous driving. The dataset contains RGB and IR images from the same driving car. However, the recorded RGB and IR images are unpaired due to the cameras’ different properties [34]. For all experiments, the training and testing splits follow the dataset benchmark. The training dataset contains 8862 IR images and RGB images, while there are 1363 IR images and 1257 RGB images in the testing dataset. The statistics of the FLIR ADAS are presented in Table 13.1.
13.4.2 Evaluation methods Qualitative analysis. In researches on generative models, the human perceptual study is a direct way to compare the quality of translation among models. In this study, several graduate students and computer vision engineers are invited to subjectively evaluate the translated results from the proposed method and baseline methods with source IR image. They are then required to select which output has the best quality with short comments. Table 13.1 Statistics in FLIR ADAS. Dataset
Image type
# of frames
Image size
Training
IR RGB IR RGB
8862 8363 1363 1257
640 512 1800 1600 640 512 1800 1600
Testing
321
322
Generative adversarial networks for image-to-image translation
Quantitative analysis. Numerical evaluating the proposed WGGAN and other comparative methods is challenging since there are no paired RGB images. To measure the translated quality, we include four Inception-based metrics: 1-Nearest Neighbor classifier (1-NN), kernel maximum mean discrepancy (KMMD), Frechet inception distance (FID), and Wasserstein distance (WD) [35]. These methods compute the distance between features in the target and generated images from the Inception network [35]. If an IR image is well translated, these metrics will have small values, which indicates the generated RGB image is similar to the distribution of target RGB images. Besides, two no-reference image quality assessment (NR-IQA) methods, blind/referenceless image spatial quality evaluator (BRISQUE) [36], and natural image quality evaluator (NIQE) [37], are also used to independently evaluate the generated images without any pair or unpaired images. The small values of them indicate the high quality of the translated images. Moreover, a multicriteria decision analysis, TOPSIS [38], is included to summarize all the quantitative evaluation metrics.
13.4.3 Baselines CycleGAN. CycleGAN consists of two standard residual autoencoders for training with GAN loss and cycle-consistency loss [11]. MUNIT. MUNIT is similar to CycleGAN, which also consists of two autoencoders. For having diverse generative images, the encoder of MUNIT has one content branch and a style branch. Inspired by neural style transfer [13], where the two branches are combined by adaptive instance normalization in the generator for image reconstruction. StarGAN. StarGAN is a state-of-the-art generative method in facial attribute transfer and facial expression synthesis. It includes mask vector and domain classification to generate diverse output [29]. UGATIT. The UGATIT adopts an attention mechanism to residual autoencoder with auxiliary classifier inspired by weakly supervised learning. Moreover, it also introduces an adaptive instance normalization to the residual generator [12]. It achieves novel performance in tasks of anime translation and style transfer.
13.4.4 Experimental setup The adaptive moment optimization (ADAM) [12] is used as an optimizer for training the proposed method where the learning rate is set to 0.00001, and momentums are set to 0.5 and 0.99. For improving the model’s robustness, the batch size was set as 1 with instance normalization after each neural layer. The discriminator is adopted from PatchGAN [11]. Moreover, all the activation functions of neurons are set to the rectified linear unit (ReLU), while the activation function of the output layer is Tanh to generate synthetic images.
WGGAN: A wavelet-guided generative adversarial network
To make a fair comparison, both WGGAN and baseline models are trained in 27 epochs with batch size 1. On the other hand, all images are resized to 512 512 before feeding to the network. A desktop with an NVIDIA TITAN RTX, and Intel Core i7 and 64 GB memory is used throughout the experiments.
13.4.5 Translation results This subsection aims to present both qualitative and quantitative analyses of translation results compared with baselines. In qualitative analysis, the examples of translation are illustrated with subjective comments. On the other hand, the quantitative analysis presents numerical results to compare the proposed WGGAN with baselines. 13.4.5.1 Qualitative analysis Fig. 13.6 illustrates the translation results in the test set of FLIR ADAS. StarGAN has the worst translation performance, which has unsuitable colors and noisy black spots on the images. The rest of the translated images can generate clear edges of solid objects like vehicles. However, the UGATIT is not capable of clearly translating objects such as trees and houses. Compared with WGGAN, CycleGAN, and MUNIT, participants also point out that the road texture is not well translated by UGATIT, as shown in Fig. 13.6. The texture of the road is too smooth to present the details, such as the curb on the road. On the other hand, CycleGAN can accurately translate the objects from IR images with sharp edges and textures. Several participants also mention that there are some incorrect mapping objects on the generated RGB images from CycleGAN. For example, trees should not appear in the sky in Fig. 13.6B. On the contrary, both the proposed WGGAN and MUNIT are able to translate the IR images with clear texture information of objects. However, some parts of the images are not correctly translated, such as the sky and people, as shown in Fig. 13.6C. In qualitative evaluation, participants indicate that the proposed WGGAN can generate the best quality of images with clear texture and correct mapping objects. Compared with other state-of-the-art methods, the generated images are less scattered noises. To conclude, participants believe that the proposed WGGAN has the best performance in IR-to-RGB translation. 13.4.5.2 Quantitative analysis Table 13.2 illustrates the quantitative results of the IR-to-RGB translation. The best result of each evaluation method is highlighted. It is difficult to identify the best method within contemporary methods. CycleGAN achieves excellent performance in NR-IQA evaluation, while MUNIT has better performance in 1-NN and KMMD. Unlike contemporary methods, the proposed WGGAN outperforms all the contemporary models with the smallest values in 1-NN, KMMD, FID, and NIQE. For inception-based metrics, WGGAN has 26.1% and 53.1% improvement in KMMD and FID, which means that the generative RGB images are similar to the target RGB domain. On the other
323
Fig. 13.6 Examples of (A) source IR images, (B) proposed WGGAN, (C) CycleGAN, (D) MUNIT, (E) StarGAN, and (F) UGATIT.
WGGAN: A wavelet-guided generative adversarial network
Table 13.2 Quantitative results of contemporary methods. Models
1-NN
KMMD
FID
WD
BRISQUE
NIQE
CycleGAN MUNIT StarGAN UGATIT WGGAN
0.961 0.927 0.992 0.959 0.924
0.318 0.237 0.397 0.283 0.175
0.222 0.157 0.121 0.098 0.046
61.60 67.50 75.73 65.88 65.97
15.17 27.99 36.55 36.81 28.89
2.730 2.750 6.392 3.663 2.477
Table 13.3 Ranking results by TOPSIS based on quantitative result.
TOPSIS
CycleGAN
MUNIT
StarGAN
UGATIT
WGGAN
0.479
0.569
0.313
0.559
0.796
hand, the WGGAN also achieves the best performance in NIQE, which indicates the image is similar to the natural images in terms of statistical regularities. To better conclude the best model with these evaluating methods, we use a common multicriteria decision model, TOPSIS, to rank the image translation models based on evaluating methods. Table 13.3 shows that WGGAN has the highest values of TOPSIS. The ranking results also demonstrate that the proposed WGGAN can generate more high-quality RGB images in comparison with other novel image translation methods.
13.5 Conclusion In this chapter, an IR-to-RGB image translation method, wavelet-guided generative adversarial network (WGGAN), is proposed for context enhancement. A waveletguided variational autoencoder (WGVA) is proposed for generating smooth and clear RGB images from the IR domain, which combines variational inference and discrete wavelet transformation. In addition, more objective functions are introduced to improve generative quality, such as ELBO loss and perceptual loss. Both qualitative and quantitative results demonstrate the effectiveness of the proposed WGGAN to enable better context enhancement for IR-to-RGB translation. Many industrial applications can benefit from the proposed method, such as object detection at night for applications of semiautonomous driving, unmanned aerial vehicle (UAV) surveillance, and urban security. Although the proposed WGGAN has promising results in thermal image translation, there is still room for improvement. For example, the objects’ colors should be more discriminative from the background. Therefore, we first aim to add numerous IR and RGB images for fully training the proposed WGGAN. Then, more advanced modules, such as adaptive instance normalization, will be included to enhance the translation.
325
326
Generative adversarial networks for image-to-image translation
Acknowledgment The first three authors are supported in part by grants from TerraSense Analytics Ltd. and Advanced Research Computing, University of British Columbia.
References [1] G. Bhatnagar, Z. Liu, A novel image fusion framework for night-vision navigation and surveillance, Signal Image Video Process. 9 (1) (2015) 165–175. [2] G. Hermosilla, F. Gallardo, G. Farias, C.S. Martin, Fusion of visible and thermal descriptors using genetic algorithms for face recognition systems, Sensors 15 (8) (2015) 17944–17962. [3] X. Chang, L. Jiao, F. Liu, F. Xin, Multicontourlet-based adaptive fusion of infrared and visible remote sensing images, IEEE Geosci. Remote Sens. Lett. 7 (3) (2010) 549–553. [4] T. Hamam, Y. Dordek, D. Cohen, Single-band infrared texture-based image colorization, in: 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel, 2012, pp. 1–5. [5] J. Ma, et al., Infrared and visible image fusion via detail preserving adversarial learning, Inf. Fusion 54 (2020) 85–98. [6] X. Jin, et al., A survey of infrared and visual image fusion methods, Infrared Phys. Technol. 85 (2017) 478–501. [7] S. Liu, V. John, E. Blasch, Z. Liu, Y. Huang, IR2VI: enhanced night environmental perception by unsupervised thermal image translation, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, June, vol. 2018, 2018, pp. 1234–1241. [8] H. Chang, O. Fried, Y. Liu, S. DiVerdi, A. Finkelstein, Palette-based photo recoloring, ACM Trans. Graph. 34 (4) (2015). [9] G. Larsson, M. Maire, G. Shakhnarovich, Learning representations for automatic colorization, in: Proceedings of European Conference on Computer Vision (ECCV), LNCS, vol. 9908, 2016, pp. 577–593. [10] A.Y.-S. Chia, et al., Semantic colorization with internet images, ACM Trans. Graph. 30 (6) (2011) 1–8. [11] J.Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, October, vol. 2017, 2017, pp. 2242–2251. [12] J. Kim, M. Kim, H. Kang, L. Kwanhee, U-GAT-IT: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation, in: ICLR 2020, 2020, pp. 1–19. [13] X. Huang, M.Y. Liu, S. Belongie, J. Kautz, Multimodal unsupervised image-to-image translation, in: Proceedings of European Conference on Computer Vision (ECCV), LNCS, vol. 11207, 2018, pp. 179–196. [14] I.J. Goodfellow, et al., Generative adversarial nets, Adv. Neural Inf. Process. Syst. 3 (2014) 2672–2680. [15] A. Levin, D. Lischinski, Y. Weiss, Colorization using optimization, in: ACM SIGGRAPH 2004 Papers, 2004, pp. 689–694. [16] Q. Luan, F. Wen, D. Cohen-Or, L. Liang, Y.-Q. Xu, H.-Y. Shum, Natural image colorization, in: Proceedings of the 18th Eurographics Conference on Rendering Techniques, 2007, pp. 309–320. [17] R.K. Gupta, A.Y.-S. Chia, D. Rajan, H.Z. Ee Sin Ng, Image colorization using similar images, in: Proceedings of the 20th ACM international conference on Multimedia (MM’12), 2012. [18] A. Hertzmann, C.E. Jacobs, N. Oliver, B. Curless, D.H. Salesin, Image analogies, in: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, 2001, pp. 327–340. [19] Y. Zheng, E. Blasch, Z. Liu, Multispectral Image Fusion and Night Vision Colorization, Society of Photo-Optical Instrumentation Engineers, 2018. [20] Z. Cheng, Q. Yang, B. Sheng, Deep colorization, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), vol. 2015, 2015, pp. 415–423.
WGGAN: A wavelet-guided generative adversarial network
[21] M. Limmer, H.P.A. Lensch, Infrared colorization using deep convolutional neural networks, in: 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2016, pp. 61–68. [22] P.L. Sua´rez, A.D. Sappa, B.X. Vintimilla, Infrared image colorization based on a triplet DCGAN architecture, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 212–217. [23] P.L. Sua´rez, A.D. Sappa, B.X. Vintimilla, Learning to colorize infrared images, in: PAAMS, 2017. [24] S. Tripathy, J. Kannala, E. Rahtu, Learning image-to-image translation using paired and unpaired training samples, in: Proceedings of Asian Conference on Computer Vision (ACCV), LNCS, vol. 11362, 2019, pp. 51–66. [25] P. Isola, J.Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings—30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, January, vol. 2017, 2017, pp. 5967–5976. [26] T. Kim, M. Cha, H. Kim, J.K. Lee, J. Kim, Learning to discover cross-domain relations with generative adversarial networks, in: The 34th International Conference on Machine Learning, March, vol. 4, 2017, pp. 2941–2949. [27] Z. Yi, H. Zhang, P. Tan, M. Gong, DualGAN: unsupervised dual learning for image-to-image translation, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), October, vol. 2017, 2017, pp. 2868–2876. [28] M.Y. Liu, T. Breuel, J. Kautz, Unsupervised image-to-image translation networks, Adv. Neural Inf. Process. Syst. 2017 (2017) 701–709. [29] Y. Choi, M. Choi, M. Kim, J.W. Ha, S. Kim, J. Choo, StarGAN: unified generative adversarial networks for multi-domain image-to-image translation, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018, pp. 8789–8797. [30] Y. Choi, Y. Uh, J. Yoo, J.-W. Ha, StarGAN v2: diverse image synthesis for multiple domains, in: CoRR, vol. abs/1912.0, 2019. [31] J. Yoo, Y. Uh, S. Chun, B. Kang, J.W. Ha, Photorealistic style transfer via wavelet transforms, in: Proceedings of the IEEE International Conference on Computer Vision, October, vol. 2019, 2019, pp. 9035–9044. [32] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution, in: Proceedings of the 14th European Conference on Computer Vision, Lecture Notes in Computer Science (LNCS), vol. 9906, Springer, 2016, pp. 694–711. [33] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: 3rd International Conference on Learning Representations. ICLR 2015—Conference Track Proceedings, 2015, pp. 1–14. [34] F.A. Group, FLIR thermal dataset for algorithm training, 2018, [Online]. Available from: https:// www.flir.in/oem/adas/adas-dataset-form. [35] Q. Xu, et al., An empirical study on evaluation metrics of generative adversarial networks, in: CoRR, vol. arXiv:1806, 2018, pp. 1–14. [36] A. Mittal, A.K. Moorthy, A.C. Bovik, No-reference image quality assessment in the spatial domain, IEEE Trans. Image Process. 21 (12) (2012) 4695–4708. [37] A. Mittal, R. Soundararajan, A.C. Bovik, Making a ‘completely blind’ image quality analyzer, IEEE Signal Process. Lett. 20 (3) (2013) 209–212. [38] V. Yadav, S. Karmakar, P.P. Kalbar, A.K. Dikshit, PyTOPS: a python based tool for TOPSIS, SoftwareX 9 (2019) 217–222.
327
CHAPTER 14
Generative adversarial network for video analytics A. Sasithradevia, S. Mohamed Mansoor Roomib, and R. Sivaranjanic a
School of Electronics Engineering, VIT University, Chennai, India Department of Electronics and Communication Engineering, Thiagarajar College of Engineering, Madurai, India c Department of Electronics and Communication Engineering, Sethu Institute of Technology, Madurai, India b
14.1 Introduction The objective of video analytics is to recognize the events in videos automatically. Video analytics can detect events such as a sudden burst of flames, suspicious movement of vehicles and pedestrians, abnormal movement of a vehicle not obeying traffic signs. A commonly known application in the research field of video analytics is video surveillance which has started evolving 50 years ago. The principle behind the video surveillance is to involve human operators to monitor the events occurring in a public area, room, or desired space. In general, an operator is given full responsibility for several cameras and studies have shown that increasing the number of cameras to be monitored per operator degrades the performance of the operator. Hence, video analysis software aims to provide a better trade-off between accurate event detection and huge video information [1–3]. Machine learning, in particular, its descendant namely deep learning has prompted the research in the video analytics domain. The fundamental purpose of deep learning is to identify the sophisticated model that signifies the probability distributions over the different samples of videos which need analytics. Generative adversarial network (GAN) provides an efficient way to learn deep representations with minimal training data. GAN is an evolving technique for generating and representing the samples using both unsupervised and semisupervised learning methods. It is accomplished through the implicit modeling of high-dimensional data distribution. The underlying working principle of GAN is to train the pair of networks in competition with each other. Among these networks, one acts like an imitator and the other as a skillful specialist. From the formal description of GAN, the generator creates fake data mimicking the realistic one and the discriminator is an expert trained to distinguish the real samples from the forger ones. Both the networks are trained simultaneously in competition with each other. This generic framework for GAN is shown in Fig. 14.1. Both the generator and the discriminator are the neural networks where the former generates new instances and the latter assesses whether the instances belong to the dataset.
Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00008-7
Copyright © 2021 Elsevier Inc. All rights reserved.
329
330
Generative adversarial networks for image-to-Image translation
Fig. 14.1 Generic framework for generative adversarial networks.
For the purpose of classification, the discriminator plays the role of a classifier to distinguish the real from the fake. To build a GAN, one needs to have a training dataset and a clear idea about the desired output. Initially, GAN learns from simple distribution of 2D data, later GAN could be able to mimic high-dimensional data distribution along with eventual training. During the training phase, both the competing networks get the attributes regarding the distribution of data. The data samples generated by the generator along with the real data samples are used to train the discriminator. After sufficient training, the generator is trained against the discriminator. Thus the generator learns to map any random data samples. Consider the scenario as Fig. 14.1, where a D-dimensional noise vector obtained from the latent space is fed into the generator which converts them into new data samples. The discriminator then processes both the real and fake samples for classifying it. The main advantage of GAN relies on its randomness which aids it to create new data samples rather than the exact replica of the real data. Another crucial advantage of GAN over Autoencoders [4] and Boltzmann machine [5] is that GAN does not rely on Markov chain for the purpose of generating training models. GANs were designed to eliminate the high complexity associated with Markov chains. Also, the generator function undergoes a minimum restriction compared to Boltzman machines. Owing to these advantages, GANs have been attracted toward a variety of applications and the craving to utilize it in numerous areas is increasing. They have been effectively used in a wide variety of tasks like image to image translation, obtaining high-resolution images from lowresolution images, deciding the drugs for treating desired diseases, retrieving images, object recognition, text-image translation, intelligent video analysis [6], and so on. In this article, we present an overview of the working principle of GANs and its variants available for video analytics. We also emphasize the pros, cons, and the challenges for the fruitful implementation of GANs in different video analytic problems.
Generative adversarial network for video analytics
The remainder of this chapter is organized as follows: Section 14.2 provides the building blocks of GANs, its driving factor called objective functions and the challenging issues of GANs. Section 14.3 highlights the variants of GANs emerged for the problem of video analytics in past years. Section 14.4 discusses the possible future works in the area of video analytics based on GAN. Section 14.5 concludes this chapter.
14.2 Building blocks of GAN This section describes the basic building blocks of GAN and the different objective functions used for training the GAN architectures.
14.2.1 Training process The training process involving the objective or cost function is the basic building block for GANs. Training of GAN is a dual process which includes choosing the parameters for a generator that confuses the discriminator with fake data and discriminator that maximize the accuracy for any given application. The algorithm involved in the training process is described as follows:
Algorithm 14.1 Step 1: Update parameters of discriminator “θD”: Input: “m” samples from real frames and “m” samples from noise data. Do: Compute the expected Gradient rθD ¼ f{JθD(θD; θG)} Update: θD (θD, —θD) Step 2: Update parameters of Generator “θG”: Input: “m” samples from noise data and θD. Do: Compute the expected Gradient rθG ¼ f{JθG(θG; θD)} Update: θG (θG, rθG).
The objective or the cost function V(G, D) for the training depends on the two competing networks. The training process includes both maximization and minimization as max min V ðG, DÞ D G
(14.1)
where V(G, D) ¼ fpdata(x) log D(x) + fpg(x) log(1 D(x)). As illustrated in the Algorithm 14.1, one of the model parameters are updated, while p ðxÞ the other is fixed. An exclusive discriminator D0 ðxÞ ¼ pdataðdata is available for any fixed xÞ + pgðxÞ generator G [7]. The generator is also optimal when Pg(x) ¼ Pdata(x) and it shows that the generator reaches an optimal point only when the discriminator is totally confused in
331
332
Generative adversarial networks for image-to-Image translation
discriminating the real data from fakes. The discriminator is not trained completely until the generator reaches the optimum value. But the generator is updated simultaneously with the discriminator. An alternate cost function typically used for updating the generator is maxG log D(G(Z)) instead of minG log (1 2D(G(Z))).
14.2.2 Objective functions The main objective of generative models is to make Pg(x) equivalent to the real data distribution Pdata(x). Hence, the underlying fact for training the generator is to reduce the dissimilarity between the two distributions [8]. In recent years, researchers have attempted to utilize various dissimilarity measures to upgrade the performance of GAN. This section describes the difference in computation using various measures and objective functions. f-Divergence: It is a dissimilarity measure between two distribution functions that are convex in nature. The f-Divergence between the two convex functions [8] namely Pg(x) and Pdata(x) is written as 0 x2 ð Pdata ðxÞ @ Df Pdata | Pg ¼ Pg ðxÞf dx (14.2) Pg ðxÞ x1
Integral probability metric: It provides the maximal dissimilarity measure between two arbitrary functions [8]. Consider the data space X R with probability distribution function defined as P(X). The IPM distance metric between the distributions Pdata, Pg P(X) is defined as (14.3) dF Pdata , Pg ¼ Supf F ExPdata f ðxÞ ExPg ðf ðxÞÞ Auxiliary object functions: The auxiliary functions that are related to the adversarial objective functions are reconstruction and classification objective function. • Reconstruction objective function: The goal of reconstruction objective function is to minimize the difference between the output image of the neural network and the real image provided as input to the neural network [9, 10]. This type of objective function aids the generator to preserve the content of the real image data and use the autoencoder architecture for GAN’s discriminator [11, 12]. The discrepancy value evaluated using reconstruction objective function mostly involves L1 norm measure. • Classification objective function: Discriminator network can also be used as classifier [13, 14] when cross entropy loss is employed as an objective function in the discriminator. Cross entropy loss is widely used in many GAN applications for semisupervised learning and domain adaptation. This objective function can also be used to train the generator and discriminator jointly for the classification purpose.
Generative adversarial network for video analytics
14.3 GAN variations for video analytics In recent years, intelligent video analytics has become an emerging technology and research field in academics and industry. The scenes in videos are recorded by cameras that aids for invigilation of happenings that occur in the area where human ability fails. Recently, a huge number of cameras are utilized for useful purposes [6] like fire detection, person detection and tracking, vehicle detection, smoke detection, unknown object and crime detection in country borders, shopping malls, airports, sports stadiums, underground stations, residential areas, and academic campuses and so on. The manually monitoring the videos is really cumbersome due to the obstacles like drowsiness of the operator, diversion due to increased responsibilities, etc. This prompts the need for semisupervised approaches for analyzing the events in videos [15, 16]. Hence, intelligent video analytics is one challenging problem in the field of computer vision where deep networks have not succeeded classical handcrafted attributes. To date, video analytics has traveled a long journey from holistic features such as motion history image [17] (MHI), motion energy image [18] (MEI), action banks [19] up to local feature-based approaches like HOG3D [20], spatiotemporal histogram of radon projection (STHRP) [21], histogram of optical flow [22], and tracking approaches. One efficient approach is to employ deep networks for learning and analyzing the videos without the knowledge of class labels but with the sequential organization of frames termed as “weak supervision.” This technique also requires a little supervision in strategies for providing input to deep neural networks such as sampling, encoding, and organizing methods. Unlike deep networks, generative models called GANs [23, 24] have been successfully implemented in the field of video analytics without human intervention in labeling the videos for applications such as future video frame prediction [25]. Over a period of time, the architectures of GAN is modified for various applications like video generation, video prediction, action recognition, video summarization, video understanding, and so on as listed in Table 14.1.
14.3.1 GAN variations for video generation and prediction Recent progress in generative models [26] has attracted the researchers to examine image synthesis. In particular, GANs have been employed to synthesize images from random data distribution, through nonlinear transformation from prime image to synthesized one or generate the synthesized images from the source domain. This enhanced advances in image synthesis have gained the confidence to utilize GANs for generating video sequence. One of the challenging issues in using GANs for generating and predicting videos is that the output of the GAN architectures must provide meaningful video responses. This challenge has added huge responsibility to GAN which includes understanding both the spatial and the temporal content of the video. One such extension of GAN is MoCoGAN [27], which is used for generating videos with no prior knowledge
333
334
Generative adversarial networks for image-to-Image translation
Table 14.1 GAN variations. S. no
GAN variations
Application
1 2 3 4 5 6 7 8 9 10 11
MoCoGAN VGGAN LGGAN TGANs Dynamic transfer GAN FTGAN DMGAN AMCGAN Discrimnet HiGAN DCycle GAN
Video generation
12 13 14
PoseGAN Recycle GAN DTRGAN
Video prediction Action recognition Video recognition Face translation between images and videos Human pose estimation Video retargeting Video Summarization
about priming image. This variant of GAN architecture partitions the input data distribution into two subspaces namely content and motion subspace. The content subspace sampling follows Gaussian distribution sampling whereas motion subspace sampling was accomplished by RNN. These two subspaces form the two discriminators called content and motion discriminators. Even though MoCoGAN could generate videos of variable lengths, the motion discriminator was designed only to handle the frames in limited number. As shown in Fig. 14.2, the spatial content generation was performed for different instances of appearance but the motion was fixed at the same expression. Another useful variant of GAN is dynamic transfer GAN [28]. It attempts to generate the video sequence by transferring the dynamics of temporal motion available in the
Fig. 14.2 Example frames generated by MoCoGAN [27].
Generative adversarial network for video analytics
Fig. 14.3 Frames generated by dynamic GAN [28] for anger expression.
source video sequence onto a prime target image. This target image contains the spatial content of the video data and the dynamic information is obtained from the arbitrary motion. RNN is used for spatiotemporal encoding. This dynamic GAN can generate video sequences of variable length using the competition between a generator and two discriminator networks. Among the two discriminator networks, one act as spatial discriminator to monitor the fidelity of the generated video sequence and other acts as a dynamic discriminator to maintain the integrity of the entire video sequence. They have provided visualization to demonstrate the ability of the dynamic GAN in encoding the enriched dynamics from source videos by suppressing the appearance features. Fig. 14.3 shows an example of frames generated using dynamic GAN for anger expression. Ohnishi et al. developed flow and texture generative adversarial network (FTGAN) model [29] used to generate hierarchical video from orthogonal information. FTGAN comprises two networks namely FlowGAN and TextureGAN. This variation in the GAN architecture is proposed to explore the representation and generate videos without enormous annotation cost. Flow GAN is used to generate optical flow which provides the edge and motion information for the video to be generated. The RGB videos are generated from optical flow using Texture GAN. The generic framework for FTGAN is shown in Fig. 14.4.
Fig. 14.4 Generic framework for FTGAN [29].
335
336
Generative adversarial networks for image-to-Image translation
TextureGAN preserves the consistency in the foreground and the scenes while accumulating texture information with the generated optical flow. This model provides a progressive advance in generating more realistic videos without label data. The prime advantage of FTGAN is that both GANs share complementary information about the video content. The authors used both real and computer graphics (CG) videos for training the texture and FlowGANs. The real-world dataset namely Penn Action contains 2326 videos of 15 different classes whereas the CG human video dataset namely SURREAL consists of 67,582 videos. The TextureGAN and FlowGAN are trained on these dataset for 60 k iterations. The accuracy obtained on SURREAL dataset is 44% and 54% while using textureGAN and flowGAN respectively. On Penn Action dataset, the accuracy obtained through textureGAN is 72% and flowGAN is 58%. A multistage dynamic generative adversarial network (MSDGAN) was proposed for generating time-lapse videos of high resolution. The process involved in MSDGAN [30] is twofolds: at the initial stage, realistic information is generated for each frame in the video. The next stage prunes the videos generated by the first stage through the use of motion dynamics which could make the videos closer to the real one. The authors had used a large-scale time-lapse dataset to test the videos. This model generates realistic videos of up to 128 128 resolution for 32 frames. They had collected over 5000 timelapse videos from YouTube and short clips are created manually from it. After that, the short video clips are partitioned into frames and MSDGAN is used to generate clips. A short video clip can be generated from continuous 32 frames. Fig. 14.5 shows the frames generated by the MSDGAN and the red circle indicates motion between adjacent frames.
Fig. 14.5 Frames generated by MSDGAN [30], given the start frame 1.
Generative adversarial network for video analytics
A robust one-stream video generation architecture which is an extension of Wasserstein GAN architecture known as improved video generative adversarial network (iVGAN) is another variation of GAN. This model generates the whole video clip without separating foreground from background. Similar to classical GAN, iVGAN [31] model has two networks called a generator and a critic/discriminator network. The aim of using a generator network is to create videos from a low-dimensional latent code. Critic network discriminates the real and fake data and updates in competence with the generator. This iVGAN architecture tackles the challenging issues in video analytics such as future frame prediction, video colonization, and in painting. The authors used different dataset such as stabilized videos collected from YouTube and Airplanes dataset. This model works by constantly filling the damaged holes to reconstruct the spatial and temporal information of videos. Fig. 14.6 depicts the example video frames generated by iVGAN. One of the useful efforts for generating the videos for the given description/caption has been taken by Pan et al. [32]. These kinds of video generation from the text description are attracted toward real-time applications. It is attained through the efficient extension of GAN architecture termed as temporal GAN (TGAN). TGAN consists of a generator and three discriminator networks. The input to the generator network is the combination of noise vector and the encoded sentences derived from LSTM network. The generator produces the frames of video sequences using 3D convolution operator. Three discriminators are utilized in TGANs for purposes such as video, frames, and motion discrimination. Among the discriminators, the function of two networks is to distinguish the real and fake videos or frames formed by the generator. In addition to this, these discriminator networks discriminate the semantically matched and mismatched video or frame text description pairs. The need for last discriminator network is to improve the temporal coherence between the real and generated frames. The whole TGAN architecture has undergone end to end learning. This GAN variant is evaluated over datasets like SMBG, TBMG, and MSVD to generate videos from captions. The coherence metric implies readability and temporal coherence of videos. The coherence metric of 1.86 is reported for TGANs. Table 14.2 enumerates the collection of dataset used for evaluating the GAN variants proposed for video generation.
14.3.2 GAN variations for video recognition Video recognition is usually done via large number of labeled videos during the training session. For a new test task, many videos are unlabeled and annotation is needed there. It also requires human help to annotate every video. It’s a tedious process to annotate a large set of data. In order to overcome this Yu et al. proposed a novel approach called hierarchical generative adversarial networks (HiGAN) [3] where the fully labeled images are utilized to recognize those unlabeled videos. The idea behind HiGAN model is
337
Fig. 14.6 Video frames generated using iVGAN [31].
Table 14.2 List of dataset available to validate video generation techniques. S. no
Model
Dataset
1
MoCoGAN
2 3
TGANs Dynamic transfer GAN FTGAN iVGAN MSDGAN
MUG facial expression dataset, YouTube videos, Weizmann action dataset and UCF101 SMBG,TBMG, MSVD CASIA
4 5 6
Penn Action, SURREAL Tiny videos, Airplane dataset YouTube videos, Beach dataset, Golf dataset
Generative adversarial network for video analytics
combining low-level conditional GAN and high-level conditional GAN and utilizing the adversarial learning from them. Also, this method provides domain invariant feature representation between labeled images and unlabeled video. The performance is evaluated by conducting experiments on two complex video datasets UCF101 [33] and HMDB51 [34]. In this work, each target video is split into 16-frame clips without any overlap and it constructs a video clip domain by combining all the target video frame. In each video clip, the deep feature that is 512D feature vector from pool 5 layer of 3D ConvNets is extracted and are used to train large-scale video dataset. The HiGAN comparatively outperforms in terms of recognition rate in both datasets compared to the approach C3D [35]. HiGAN recognition rate is observed as 4% improvement in UCF101 and 10% improvement in HMDB51 dataset compared to C3D technique. Human behavior understanding in video is still a challenging task. It requires an accurate model to handle both the pixel-wise and global level prediction. Spampinato et al. [36] demonstrated an adversarial GAN-based framework to learn video representation through unsupervised learning to perform both local and global prediction of human behavior in videos. In this approach, first the video is synthesized by factorizing the process in to the generation of static visual content and motion and secondly enforcing spatiotemporal coherency of object trajectories and finally incorporates motion estimation and pixel-wise dense prediction. So, the self-supervised way of learning provides an effective feature set which further used for video object segmentation and video action tasks [37]. Also, the new segmentation network proposed is able to integrate into any another segmentation model for supervision. This provides a strong model for object motion. The wide range of experimental evaluation showed that VOS-GAN performance on modeling object motion better than the existing video generation methods such as VGAN, TGAN, and MoCoGAN. In the previous researches the following approaches were implemented for video retargeting [38]: The first one is specifically performed domain wise which is not applicable for other domains. And the second one is implemented across the domain which needs manual supervision for labeling and alignment of information and the last approach is unsupervised and unpaired image translation where learning is mutually done in different domains which is also shown insufficient information for processing. Bansal et al. [38] propose a new unsupervised data-driven approach for effective video retargeting which incorporates spatiotemporal information with conditional generative adversarial networks (GANs). It combines both spatial and temporal information along with adversarial loses for translating content and also preserving style. The publicly available Viper dataset is used for experimentation for image-to-labels and labels-to-image to evaluate the results of video retargeting. The performance measures such as mean pixel accuracy (M), average class accuracy (AC), and intersection over union (IoU) provides comparatively better results for the combination of cycle GAN and recycle-GAN
339
340
Generative adversarial networks for image-to-Image translation
Jang and Kim [39] developed appearance and motion conditions generative adversarial network (AMC-GAN) which consists of a generator, two discriminators, and perceptual ranking module. The two discriminators monitor the appearance and motion features. They used a new conditioning scheme that helps the training by varying appearance and motion conditions. The perceptual ranking module enables AMCGAN for understanding the events in the video. AMCGAN model is evaluated on MUG facial expressions and NATOPS human action dataset. The MUG dataset consists of 931 video clips which contain six basic emotions like anger, disgust, fear, happy, sad, and surprise. It is preprocessed to get 32 frames of resolution 64 64 pixels. The NATOPS human action dataset has 9600 videos containing 24 different actions. In unsupervised video representation future frame prediction is a challenging task. Existing methods operate directly on pixels which result blurry prediction of the future frame. Liang et al. [26] proposed a dual motion generative adversarial net (GAN) architecture to predict future frame in video sequence through dual learning mechanism. The future frame prediction and dual future flow prediction form a close loop. It achieves better video prediction by generating informative feedback signals to each other. This dual motion GAN has fully differentiable network architecture for video prediction. Extensive experiments on video frame prediction, flow prediction, and unsupervised video representation learning demonstrate the contributions of Dual Motion GAN to motion encoding and predictive learning. Caltech and YouTube Clips are taken for future frame analysis to show the performance of video recognition using dual motion GAN compared to other existing approaches in the KITTI dataset. The performance evaluation metrics such as mean square error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index metrics (SSIM) are used to evaluate the image quality of future frame prediction. Higher PSNR and SSIM are achieved via dual motion GAN. The implementations are based on the public Torch7 platform on a single NVIDIA GeForce GTX 1080. Dual motion GAN takes around 300 ms to predict one future frame.
14.3.3 GAN variations for video summarization Due to the availability of the huge amount of multimedia data produced by the progressive growth of video capturing devices, video summarization [12,40–42] plays a crucial role in video analytics problem. Video summarization [43] extracts the representative and useful content from the video for data analysis and it is highly useful in large scale video analysis. One of the efficient approaches in video summarization is that deriving the suitable key frames from the entire video and those set of key frames are enough to portray the story of the video. To enhance the quality of summarization, there exists some challenges need to be tackled by the summarization techniques. The first challenge is to choose a fine key frame selection strategy which takes into account the temporal relation
Generative adversarial network for video analytics
of the frames within the video and the importance of the key frames. The next challenge is to devise a mechanism to assess the preciseness and completeness of the selected key frames. To address these issues, several models have been introduced so far like feature-based approaches [12], Long-short-term memory (LSTM)-based models [1,44] and determinantal point process (DPP)-based techniques [45]. Owing to the memory problems that arise in LSTM as well as DPP and redundant key frames issue in feature-based approaches, GAN has attracted the researchers in this community because of its regularization ability. One of the GAN variants proposed for video summarization namely dilated temporal relational generative adversarial network (DTRGAN) [46] is shown in Fig. 14.7. The generator contains two units namely dilated temporal relational (DTR) and bidirectional LSTM (Bi-LSTM). The generator gets the video and the real summary of the respective video as input. The DTR unit aims to tackle the first challenge. The inputs to the discriminator are real, generated, and random summary pairs and the purpose of the discriminator is to optimize the player losses at the time of training. A supervised generator loss term is introduced to attain the completeness and preciseness nature of the key frames.
Fig. 14.7 Architecture of DTRGAN [46].
341
342
Generative adversarial networks for image-to-Image translation
14.3.4 PoseGAN Walker et al. [25] have developed a video forecasting technique by generating future pose using generative adversarial network (GAN) and variational autoencoders (VAEs). In this approach, video forecasting is attained by generating videos directly in pixel space. This approach models the whole structure of the videos including the scene dynamics conjointly under unconstrained environment. The authors divided the video forecasting problem into two stages: The first stage handles the high-level features of video like human scenes and uses VAE to predict the future actions of a human. The authors used UCF101 dataset for evaluating the poseGAN architecture in predicting the future poses of human.
14.4 Discussion 14.4.1 Advantages of GAN One of the major advantages of GAN is that it does not require knowledge about the shape of the generator’s probability distribution model. Hence, GANs avoids the need for the determined density shapes for representing high complex and high-density data distribution. Reduced time complexity: The sampling of generated data can be parallelized in GANs and it makes them pretty faster than PixelRNN [47], Wavenet [48], and PixelCNN [49]. In the future frame prediction problem [39], the autoregressive models rely on the value of the previous frame’s pixel value for the prediction of the probability distribution of the future frame’s pixel. Hence, the generation of the future frame is too slow and the time consumption is even worse for high-dimensional data. But GANs uses a simple feed-forward neural network strategy for mapping in the generator. The generator creates all the future frame pixels at the same time itself rather than pixel by pixel approach followed by autoregressive models. This pace of GAN processing attracted many researchers in various fields. Accurate results: Form the study of different GAN variants it is evident that GAN can produce astounding results for video analytics problems. Also, the performance is far better than variational autoencoder (VAE), one of the generator models which assume the probability distribution of pixels as a normal distribution. As GAN can master in capturing the high-frequency parts of the data, the generator develops to guide the highfrequency parts to betray the discriminator. Lack of assumptions: Even though VAE attempts to maximize likelihood through variational lower bound, it needs assumptions on the prior and posterior probability distributions of data. On the other hand, GANs do not need any strong assumptions about the probability distribution.
Generative adversarial network for video analytics
14.4.2 Disadvantages of GAN Trade-off between discriminator and generator: The imbalance occurs between generator and discriminator because of nonconvergence and mode collapse. Mode collapse is a commonly occurring and difficult issue in GAN models. It happens in the case when the generator is offered with images that look similar. Also, when the generator is trained extensively without updating any information to the discriminator, the mode collapses. Owing to this mode collapse, the generator will converge to an optimal data which fools discriminator the most and it is the best realistic image from the perspective of the discriminator. A partial mode collapse occurs in GANs frequently than a complete mode collapse. Thus, the training process involved in GAN is heuristic in nature. Hyperparameters and training: The need for suitable hyperparameters to attain the cost function is a major concern in GANs. The tuning of these parameters is also a time-consuming process.
14.5 Conclusion GAN is growing as an efficient generative model through the generation of real-like data using random latent spaces. The underlying fact in the GAN process is that it does not need the understanding of real data samples and high-level mathematical foundations. This merit allowed the GANs to be extensively used in various academic and engineering fields. In this chapter, we introduced the basics and working principle of GAN, several variations of GAN available for various applications like video generation, video prediction, action recognition, and video summarization in the area of video analytics. The enormous growth of GAN in the video analytics domain is not only due to its ability to learn the deep representation and nonlinear mapping but due to its potential to use the enormous amount of unlabeled video data. There are huge openings in the development of algorithms and architectures of GAN for using it in different application domains apart from video analytics, such as prediction, superresolution, generating new human poses, and face frontal view generation. The future scope in video recognition includes exploiting large-scale web images for video recognition which will further improve the recognition accuracy. Video retargeting can be accomplished more precisely using spatiotemporal generative models and further, it can be extended to multiple source domain adaptation. Also, the spatiotemporal neural network architecture can be applied for video retargeting in future. The realworld videos with complex motion interactions can be attempted for video recognition through the modeling of multiagent dependencies. Also, the alternative can be made for loss function, evaluation metrics, RNN, and synthetically generated videos to improve the performance of video recognition system. Generative adversarial neural networks can be the next step in deep learning evolution and while they provide better results across several application domains.
343
344
Generative adversarial networks for image-to-Image translation
References [1] H. Chen, G. Ding, Z. Lin, S. Zhao, J. Han, Show, observe and tell: attribute-driven attention model for image captioning, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 606–612. [2] L.C. Chen, G. Papandreou, S.F. Adam, Rethinking Atrous Convolution for Semantic Image Segmentation, (2017). arXiv preprint arXiv:170605587. [3] F. Yu, X. Wu, et al., Exploiting Images for Video Recognition with Hierarchical Generative Adversarial Networks, (2018). arXiv:1805.04384v1 [cs.CV]. [4] S. Skansi, Autoencoders, in: Introduction to Deep Learning, Springer, Berlin, 2018, pp. 153–163. [5] D.H. Ackley, G.E. Hinton, T.J. Sejnowski, A learning algorithm for Boltzmann machines. Cogn. Sci. 9 (1) (1985) 147–169, https://doi.org/10.1207/s15516709cog0901_7. [6] Wahyono, A. Filonenko, K.-H. Jo, Designing interface and integration framework for multi-channel intelligent surveillance system, in: IEEE Conference on Human System Interactions, 2016. [7] J. Goodfellow, M. Pouget-Abadie, B.X. Mirza, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, 2014, pp. 2672–2680. [8] Y. Hong, U. Hwang, J. Yoo, S. Yoon, Show Generative Adversarial Networks and Their Variants Work: An Overview, (2017). arXiv preprint arXiv:1711.05914. [9] T. Che, Y. Li, A.P. Jacob, Y. Bengio, W. Li, Mode regularized generative adversarial networks, in: Proc. ICLR, 2017, 2017. [10] J.Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2223–2232. [11] D. Berthelot, T. Schumm, L. Metz, Began: Boundary Equilibrium Generative Adversarial Networks, https://arxiv.org/abs/1703.10717, 2017. [12] B. Zhao, E.P. Xing, Quasi real-time summarization for consumer videos, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2513–2520. [13] A. Odena, C. Olah, J. Shlens, Conditional image synthesis with auxiliary classifier GANs, in: Proceedings of the 34th International Conference on Machine Learning, vol. 70, 2017, pp. 2642–2651. [14] J.T. Spring Enberg, Unsupervised and Semi-Supervised Learning with Categorical Generative Adversarial Networks, (2015). arXiv preprint arXiv:1511.06390. [15] B. Fernando, H. Bilen, E. Gavves, S. Gould, Self-Supervised Video Representation Learning With Odd-One-out Networks, arXiv preprint arXiv:1611.06646(2016). [16] I. Misra, C.L. Zitnick, M. Hebert, Shuffle and learn: unsupervised learning using temporal order verification, in: European Conference on Computer Vision, 2016, pp. 527–544. [17] M.A.R. Ahad, J.K. Tan, H. Kim, S. Ishikawa, Motion history image: its variants and applications, Mach. Vis. Appl. 23 (2) (2012) 255–281. [18] A.F. Bobick, J.W. Davis, The recognition of human movement using temporal templates, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2001) 257–267. [19] S. Sadanand, J.J. Corso, Action bank: a high-level representation of activity in video. in: IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, 2012, pp. 1234–1241, https://doi.org/10.1109/CVPR.2012.6247806. [20] N. Li, X. Cheng, S. Zhang, et al., Realistic human action recognition by fast HOG3D and selforganization feature map. Mach. Vis. Appl. 25 (2014) 1793–1812, https://doi.org/10.1007/s00138014-0639-9. [21] A. Sasithradevi, S.M.M. Roomi, Video classification and retrieval through spatio-temporal radon features. Pattern Recogn. 99 (2020) 107099, https://doi.org/10.1016/j.patcog.2019.107099. [22] J. Pers, V. Suli´c, M. Kristan, M. Perˇse, K. Polanec, S. Kovaˇciˇc, Histograms of optical flow for efficient representation of body motion, Pattern Recogn. Lett. 31 (11) (2010) 1369–1376. [23] J. Zhao, M. Mathieu, Y. LeCun, Energy-Based Generative Adversarial Network, (2016). arXiv preprint arXiv:1609.03126. [24] Z. Huang, B. Kratzwald, et al., Face Translation Between Images and Videos Using Identity-Aware CycleGAN, (2017). arXiv:1712.00971v1 [cs.CV].
Generative adversarial network for video analytics
[25] J. Walker, K. Marino, et al., The Pose Knows: Video Forecasting by Generating Pose Futures, (2017). (arXiv:1705.00053v1 [cs.CV). [26] X. Liang, Lisa, et al., Dual Motion GAN for Future-Flow Embedded Video Prediction, (2017) arXiv:1708.00284v2 [cs.CV]. [27] S. Tulyakov, M.-Y. Liu, X. Yang, J. Kautz, Mocogan, Decomposing Motion and Content for Video Generation, (2017). arXiv preprint arXiv:1707.04993. [28] W.J. Baddar, G. Gu, et al., Dynamics Transfer GAN: Generating Video by Transferring Arbitrary Temporal Dynamics from a Source Video to a Single Target Image, (2017). arXiv:1712.03534v1 [cs.CV]. [29] K. Ohnishi, S. Yamamoto, et al., Hierarchical Video Generation from Orthogonal Information: Optical Flow and Texture, (2017). arXiv:1711.09618v2 [cs.CV]. [30] W. Xiong, W. Luo, et al., Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks, (2018)arXiv:1709.07592v3 [cs.CV]. [31] B. Kratzwald, Z. Huang, et al., Improving Video Generation for Multi-Functional Applications, arXiv:1711.11453v2 [cs.CV](2018). [32] Y. Pan, Z. Qiu, et al., To Create What you Tell: Generating Videos from Captions, (2018). arXiv:1804.08264v1 [cs.CV]. [33] K. Soomro, A.R. Zamir, M. Shah, UCF101: a dataset of 101 human actions classes from videos in the wild, in: Computer Vision and Pattern Recognition (cs.CV), 2012 arXiv:1212.0402. [34] H. Kuehne, H. Jhuang, E.´ı. Garrote, T. Poggio, T. Serre, Hmdb: a large video database for human motion recognition, in: International Conference on Computer Vision (ICCV), IEEE, 2011, pp. 2556–2563. [35] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3D convolutional networks, in: ICCV, 2015, pp. 4489–4497. [36] C. Spampinato, S. Palazzo, et al., Adversarial Framework for Unsupervised Learning of Motion Dynamics in Videos, (2019)arXiv:1803.09092v2 [cs.CV]. [37] U. Ahsan, H. Sun, et al., DiscrimNet: Semi-Supervised Action Recognition from Videos Using Generative Adversarial Networks, (2018)arXiv:1801.07230v1 [cs.CV]. [38] A. Bansal, S. Ma, D. Ramanan, Y. Sheikh, Recycle-GAN: unsupervised video retargeting. in: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Computer Vision—ECCV 2018 Lecture Notes in Computer Science, vol. 11209, Springer, Cham, 2018https://doi.org/10.1007/978-3-030-01228-1_8. [39] Y. Jang, G. Kim, et al., Video Prediction with Appearance and Motion Conditions, (2018). arXiv:1807.02635v1 [cs.CV]. [40] M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool, Creating summaries from user videos, in: Proceedings of European Conference on Computer Vision, 2014, pp. 505–520. [41] G. Kim, L. Sigal, E.P. Xing, Joint summarization of large-scale collections of web images and videos for storyline reconstruction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4225–4232. [42] B. Mahasseni, M. Lam, S. Todorovic, Unsupervised video summarization with adversarial LSTM networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. [43] A. Sharghi, B. Gong, M. Shah, Query-focused extractive video summarization, in: Proceedings of European Conference on Computer Vision, 2016. [44] K. Zhang, W.L. Chao, F. Sha, K. Grauman, Video summarization with long short-term memory, in: Proceedings of European Conference on Computer Vision, 2016, pp. 766–782. [45] B. Gong, W.L. Chao, K. Grauman, F. Sha, Diverse sequential subset selection for supervised video summarization, in: Advances in Neural Information Processing Systems, 2014, pp. 2069–2077. [46] Y. Zhang, M. Kampffmeyer, et al., Dilated Temporal Relational Adversarial Network for Generic Video Summarization, (2019). arXiv:1804.11228v2 [cs.CV]. [47] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, Wavenet: A Generative Model for Raw Audio, (2016). arXiv preprint arXiv:1609.03499. [48] A. Oord, N. Kalchbrenner, K. Kavukcuoglu, Pixel Recurrent Neural Networks, (2016). arXiv preprint arXiv:1601.06759 90. [49] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training GANs, in: Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
345
CHAPTER 15
Multimodal reconstruction of retinal images over unpaired datasets using cyclical generative adversarial networks Roucoa,b, Jorge Novoa,b, and Marcos Ortegaa,b Álvaro S. Hervellaa,b, Jose a
CITIC Research Center, University of A Corun˜a, A Corun˜a, Spain VARPA Research Group, Biomedical Research Institute of A Corun˜a (INIBIC), University of A Corun˜a, A Corun˜a, Spain
b
15.1 Introduction The recent rise of deep learning has revolutionized medical imaging, making a significant impact on modern medicine [1]. Nowadays, in clinical practice, medical imaging technologies are key tools for the prevention, diagnosis, and follow-up of numerous diseases [2]. There exist a large variety of imaging modalities that allow to visualize the different organs and tissues in the human body [3]. Thus, clinicians can select the most adequate imaging modality to study the different anatomical or pathological structures in detail. Nevertheless, the detailed analysis of the images can be a tedious and difficult task for a clinical specialist. For instance, many diseases in their early stages are only evidenced by very small lesions or subtle anomalies. In these scenarios, factors such as the clinicians’ expertise and workload can affect the reliability of the final analysis. Thus, the use of deep learning algorithms allows to accelerate the process and helps to produce a more reliable analysis of the images. Ultimately, this will result in a better diagnosis and treatment for the patients. Deep neural networks (DNNs) have been demonstrated to provide a superior performance for numerous image analysis problems in comparison to more classical methods [4]. For instance, nowadays, deep learning represents the state-of-the-art approach for typical tasks, such as image segmentation [5] or image classification [6]. Besides the remarkable improvements in these canonical image analysis problems, deep learning also makes possible the emergence of novel applications. For instance, these algorithms can be used for the transformation of images among different modalities [7], or the training of future clinical professionals using realistic generated images [8]. These novel applications, among others, certainly benefit from the particular advantages of generative adversarial networks (GANs) [9]. This creative setting, consisting of different networks with opposite objectives, have been demonstrated to be able to further exploit the capacity of the DNNs.
Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00014-2
Copyright © 2021 Elsevier Inc. All rights reserved.
347
348
Generative adversarial networks for image-to-Image translation
Multimodal reconstruction is a novel application driven by DNNs that consists in the translation of medical images among complementary modalities [7]. Nowadays, complementary imaging modalities, representing the same organs or tissues, are commonly available in most medical specialties [3]. The differences among modalities can be due to the use of different capture devices, and also due to the use of contrasts that enhance certain tissues. The clinicians choose the most adequate imaging modality according to different factors, such as the target organs or tissues, the evidence of disease, or the risk factors of the patient. In this sense, it is particularly important to consider the properties of the different anatomical and pathological structures, given that some structures can be enhanced in one modality and be completely missing in other. This significant change in the appearance, dependent on the properties of the tissues and organs, can make the translation among modalities very challenging. However, this challenge that complicates the training of the multimodal reconstruction is beneficial if we are interested in using the task for representation learning purposes. This is due to the fact that a harder task will enforce the network to learn more complex representations during the training. In this regard, the multimodal reconstruction has already demonstrated a successful performance as pretraining task for transfer learning in medical imaging [10]. In this chapter, we study the use of GANs for the multimodal reconstruction between complementary imaging modalities. In particular, the multimodal reconstruction is addressed by using a cyclical GAN methodology, which allows training the adversarial setting with independent sets of two different image modalities [11]. Nowadays, GANs represent the quintessential approach for image-to-image translation tasks [12]. However, these kinds of applications are typically focused on producing realistic and aesthetically pleasing images. In contrast, in the multimodal reconstruction of medical images, the realism and aesthetics of the generated images are not as important as producing medically accurate reconstructions. In particular, this means that the generated color patterns and textures must be coherent with the expected visualization of the real organs or tissues in the target modality. Additionally, this may involve the omission of certain structures, or even the enhancement of those that are only vaguely appreciated in the original modality. We evaluate all these aspects in order to assess the validity of the studied cyclical GAN method for the multimodal reconstruction. The study presented in this chapter is focused on ophthalmic imaging. In particular, we use the retinography and the fluorescein angiography as the original and target imaging modalities in the multimodal reconstruction. These imaging modalities, which represent the eye fundus, are useful for the study of important ocular and systemic diseases, such as glaucoma or diabetes [2]. A representative example of retinography and fluorescein angiography for the same eye is depicted in Fig. 15.1. The main difference between them is that the fluorescein angiography uses a contrast dye, which is injected to the patient, to produce the fluorescence of the blood. Thus, the fluorescein angiography depicts an enhanced representation of the retinal vasculature and related lesions.
Multimodal reconstruction of retinal images
Fig. 15.1 Example of retinography and fluorescein angiography for the same eye: (A) retinography and (B) angiography.
In this context, the successful training of a deep neural network in the multimodal reconstruction of the angiography from the retinography will provide a model able to produce a contrast-free estimation of the enhanced retinal vasculature. Additionally, due to the challenges of the transformation, which is mainly mediated by the presence of blood flow in the different tissues, the neural networks will need to learn rich high level representations of the data. This represents a remarkable potential for transfer learning purposes [13, 14]. The presented study includes an extensive evaluation of the cyclical GAN methodology for the multimodal reconstruction between complementary imaging modalities. For this purpose, two different multimodal datasets containing both retinography and fluorescein angiography images are used. Additionally, in order to further analyze the advantages and limitations of the methodology, we present an extensive comparison with a state-of-the-art approach for the multimodal reconstruction of these ophthalmic images [15]. In contrast with the cyclical GAN methodology, this other approach requires the use of multimodal paired data for training, i.e., retinography and angiography of the same eye. Therefore, the cyclical GAN presents an important advantage, avoiding not only the necessity of paired data but also the unnecessary preprocessing for the alignment of the different image pairs.
15.2 Related research Generative adversarial networks (GANs) represent a relatively new deep learning framework for the estimation of generative models [16]. The original GAN setting consists of two different networks with opposite objectives. In particular, a discriminator that learns to distinguish between real and fake samples and a generator that learns to produce fake samples that the discriminator misclassifies as real. Based on this original idea, several variations were developed in posterior works, aiming at applying the novel paradigm in different scenarios [17].
349
350
Generative adversarial networks for image-to-Image translation
In recent years, GANs have been extensively used for addressing different vision problems and graphics tasks. The use of GANs has been especially groundbreaking for computer graphics applications due to the visually appealing results that are obtained. Similarly, a kind of vision problem that has been revolutionized by the use of GANs is image-to-image translation, which consists of performing a mapping between different image domains or imaging modalities [12]. An early work addressing this problem with GANs, known as Pix2Pix [18], relied on the availability of paired data for learning the generative model. In particular, Isola et al. [18] show that their best results are achieved by combining a traditional pixel-wise loss and a conditional GAN framework. Given the difficulty of gathering the paired data in many application domains, posterior works have proposed alternatives to learn the task by using unpaired training data. Among the different proposals, the work of Zhu et al. [19], known as CycleGAN, has been especially influential. CycleGAN compensates for the lack of paired data by learning not only the desired mapping function but also the inverse mapping. This allows introducing a cycle-consistency loss whereby the subsequent application of both mapping functions must return the original input image. Concurrently, this same idea with different naming was also proposed in DualGAN [20] and DiscoGAN [21]. Additionally, besides the cycle-consistency alternative, other different proposals have been presented in different works [12] although the use of these other alternatives is not as extended in posterior applications. In medical imaging, GANs have also been used for different applications, including the mapping between complementary imaging modalities. In particular, GANs have been successfully applied in tasks such as image denoising [22], multimodal reconstruction [11], segmentation [23], image synthesis [24], or anomaly detection [25]. Among these different tasks, several of them can be directly addressed as an image-to-image translation [8]. In these cases, the adaption of those state-of-the-art approaches that already demonstrated a good performance in natural images has been common. In particular, numerous works in medical imaging are based on the use of Pix2Pix or CycleGAN methodologies [8]. Similarly to other application domains, the choice between one or other approach is conditioned by the availability of paired data for training. However, in medical imaging, the paired data is typically easy to obtain, which is evidenced by the prevalence of paired approaches in the literature [8]. With regard to the multimodal reconstruction, the difficulty in these cases is to perform an accurate registration of the available image pairs. An important concern regarding the use of GANs in medical imaging is the hallucination of nonexistent structures by the networks [8]. This is a concomitant risk with the use of GANs due to the high capacity of these frameworks to model the given training data. Cohen et al. [26] demonstrated that this risk is especially elevated when the training data is heavily unbalanced. For instance, a GAN framework that is trained for multimodal reconstruction with a large majority of pathological images will tend to hallucinate pathological structures when processing healthy images. This behavior can be in part mitigated by the addition of pixel-wise losses if paired data is available. Nevertheless, regarding the multimodal reconstruction, even when the paired data is available, most
Multimodal reconstruction of retinal images
of the works still use the GAN framework together with the pixel-wise loss [8]. In this regard, the work of Hervella et al. [15] is an example of multimodal reconstruction without GANs and using instead the Structural Similarity (SSIM) for the loss function. The motivation for this is, for many applications in medical imaging, it is not necessary to generate realistic or aesthetically pleasing images. In this context, the results obtained in Ref. [15] show that, without the use of GANs, the generated images lack realism and can be easily identified as synthetic samples.
15.3 Multimodal reconstruction of retinal images Multimodal reconstruction is an image translation task between complementary medical imaging modalities [7]. The objective of this task is, given a certain medical image, to reconstruct the underlying tissues and organs according to the characteristics of a different complementary imaging modality. Particularly, this chapter is focused on the multimodal reconstruction of the fluorescein angiography from the retinography. These two complementary retinal imaging modalities represent the eye fundus, including the main anatomical structures and possible lesions in the eye. The main difference between retinography and angiography is that the latter requires the injection of a contrast dye before capturing the images. The injection of this contrast dye results in an enhancement of the retinal vasculature as well as those pathological structures with blood flow. Simultaneously, those other retinal structures and tissues where there is a lack of blood flow may be attenuated in the resulting images. Thus, there is an intricate relation between retinography and angiography, given that the visual transformation between the modalities depends on physical properties such as the presence of blood flow in the different tissues. As a reference, the transformation between retinography and angiography for the main anatomical and pathological structures in the retina can be visualized in Fig. 15.2.
Fig. 15.2 Example of retinography and fluorescein angiography for the same eye. The included images depict the main anatomical structures as well as the two main types of lesions in the retina.
351
352
Generative adversarial networks for image-to-Image translation
Recently, the difficulty of performing the multimodal reconstruction between retinography and angiography has been overcome by using DNNs [7]. In this regard, the required multimodal transformation can be modeled as a mapping function that GR2A : R ! A given a certain retinography r R returns the corresponding angiography a ¼ GR2A(r) A for the same eye. In this scenario, the mapping function GR2A can be parameterized by a DNN. Thus, the function parameters can be learned by applying an adequate training strategy. In this regard, we present two different deep learning-based approaches for learning the mapping function GR2A, the cyclical GAN methodology [11] and the paired SSIM methodology [15].
15.3.1 Cyclical GAN methodology The cyclical GAN methodology is based on the use of generative adversarial networks (GANs) for learning the mapping function from retinography to angiography [11]. In this regard, GANs have demonstrated to be useful tools for learning the data distribution of a certain training set, allowing the generation of new images that resemble those contained in the training data [16]. This means that, by using GANs and a sufficiently large training set of unlabeled angiographies, it is possible to generate new fake angiographies that are theoretically indistinguishable from the real ones. However, in the presented multimodal reconstruction, the generated images do not only need to resemble real angiographies but, also, need to represent the physical attributes given by a particular retinography. Thus, in contrast with the original GAN approach [16], the presented methodology does not generate new images from a random noise vector, but rather from another image with the same spatial dimensions as the one that is being generated. In practice, this image-to-image transformation is achieved by using an encoder-decoder network as the generator, whereas the discriminator is still a decoder network as in the original GAN approach. Applying this setting, the multimodal reconstructions could theoretically be trained by using two independent unlabeled sets of images, one of the retinographies, and the other of angiographies. An inherent difficulty of training an image-to-image GAN is that, typically, the generator network has enough capacity to generate a variety of plausible images while ignoring the characteristics of the network input. In the case of the multimodal reconstruction, this would mean that the physical attributes of the retinographies are not successfully transferred to the generated angiographies. In this regard, early image-to-image GAN approaches addressed the issue by explicitly conditioning the generated images on the network input [18]. In particular, this is achieved by using a paired dataset instead of two independent datasets for training. For instance, the use of retinography-angiography pairs, instead of independent retinography and angiography samples, allows training a discriminator to distinguish between fake and real angiographies conditioned on a given real retinography. The use of such a discriminator will force the generator to analyze and take
Multimodal reconstruction of retinal images
into account the attributes of the input retinography. Additionally, in Ref. [18], the use of paired datasets is even further exploited by complementing the adversarial feedback to the generator with a pixel-wise similarity metric between the generator output and the available ground truth. However, in this case, it is not only necessary to have paired data, but also the available image pairs must be aligned. In contrast with previous alternatives, the presented cyclical GAN methodology addresses the issue of the generator potentially ignoring the characteristics of its input in a different manner that does not require the use of paired datasets. In particular, the cyclical GAN solution is based on the use of a double transformation [19]. The idea is to simultaneously learn GR2A and its inverse mapping function GA2R : A ! R that given a certain angiography a A produces a retinography r ¼ GA2R(a) R of the same eye. Then, the subsequent application of both transformations should be equivalent to the identity function. For instance, if a retinography is transformed into angiography and, then, it is transformed back into retinography, the resulting image should be identical to the original retinography that is used as input. However, if any of the two transformations ignores the characteristics of their input, the resulting retinography will differ from the original. Therefore, it is possible to ensure that the input image characteristics are not being ignored by enforcing the identity between the original retinography and the one that is transformed back from angiography. This is referred to as cycle-consistency, and it can be applied by using any similarity metric between both original and reconstructed input image. An important advantage of this solution is that it does not require the use of paired datasets, only being necessary two independent sets of unlabeled retinographies and angiographies. In order to obtain the best performance for the multimodal reconstruction, the presented cyclical GAN methodology involves the use of two complementary training cycles: (1) from retinography to angiography to retinography (R2A2R) and (2) from angiography to retinography to angiography (A2R2A). A flowchart showing the complete training procedure is depicted in Fig. 15.3. It is observed that two different generators, GR2A and GA2R, and two different discriminators, DA and DR, are used during the training. The discriminators DA and DR are trained to distinguish between generated and real images. Simultaneously, the generators GR2A and GA2R are trained to generate images that the discriminators misclassify as real. This adversarial training is performed using a least square loss, which has demonstrated to produce a more stable learning process in comparison to the original loss in regular GANs [27]. Regarding the discriminator training, the target values are 1 for the real images and 0 for the generated images. Thus, the adversarial training losses for the discriminators are defined as LDadvA ¼ ErR DA ðGR2A ðr ÞÞ2 + EaA ðDA ðaÞ 1Þ2 (15.1) (15.2) LDadvR ¼ EaA DR ðGA2R ðaÞÞ2 + ErR ðDR ðr Þ 1Þ2
353
354
Generative adversarial networks for image-to-Image translation
Fig. 15.3 Flowchart for the complete training procedure in the cyclical GAN methodology. This approach involves the use of two complementary training cycles that only differ in which imaging modality is being used as input and which one is the target. For each training cycle, the appearance of the target modality in the generated images is enforced by the feedback of the discriminator. Simultaneously, the cycle-consistency is used to ensure that the input image characteristics, such as the anatomical and pathological structures, are not being ignored by the networks.
In the case of the generator training, the objective is that the discriminator assigns a value 1 to the generated images. Thus, the adversarial training losses for the generators are defined as 2 adv LG ¼ E ð D ð G ð r Þ Þ 1 Þ (15.3) rR A R2A R2A adv ¼ EaA ðDR ðGA2R ðaÞÞ 1Þ2 LG (15.4) A2R Regarding the cycle consistency in the presented approach, the L1-norm between the original image and its reconstructed version is used as a loss function. In particular, the complete cycle-consistency loss, including both training cycles, is defined as L cyc ¼ ErR GA2R ðGR2A ðr ÞÞ rk1 + EaA GR2A ðGA2R ðaÞÞ ak1 (15.5) As it can be observed in previous equations as well as in Fig. 15.3, there is a strong parallelism between both training cycles, R2A2R and A2R2A. In particular, the only difference is the imaging modality that each training cycle starts with, what sets which imaging modality is being used as input and which one is the target. Finally, the complete loss function that is used for simultaneously training all the networks is defined as adv adv L ¼ LG + LDadvA + LG + LDadvR + λL cyc R2A A2R
(15.6)
Multimodal reconstruction of retinal images
where λ is a parameter that controls the relative importance of the cycle-consistency loss and the adversarial losses. For the experiments presented in this chapter, this parameter is set to a value of λ ¼ 10, which was also previously adopted in Ref. [19]. The optimization of the loss function during the training is performed with the Adam algorithm [28]. Regarding the hyperparameters of Adam, the weight decays are β1 ¼ 0.5 and β2 ¼ 0.999. In comparison to the original values recommended by Kingma et al. [28], this set of values has demonstrated to provide a more stable learning process when training GANs [29]. The optimization is performed with a batch size of 1 image. The learning rate is set to an initial value of α ¼ 2e 4 and it is kept constant for 200,000 iterations. Then, following the approach previously adopted in Ref. [19], the learning rate is linearly reduced to zero for the same number of iterations. The number of iterations before starting to reduce the learning rate is established empirically through the analysis of both the learning curves and the generated images in a training subset that is reserved for validation. Finally, a data augmentation strategy is applied to avoid possible overfitting to the training set. In particular, random spatial and color augmentations are applied to the images. The spatial augmentations consist of affine transformations and the color augmentations are linear transformations of the image channels in HSV (Hue-SaturationValue) color space. In the case of the angiographies, which have one single channel, a linear transformation is directly applied over the raw intensity values. This augmentation strategy has been previously applied for the analysis of retinal images, demonstrating a good performance avoiding overfitting with limited training data [10, 30]. The particular range for the transformations was validated before training in order to ensure that the augmented images still resemble valid retinas.
15.3.2 Paired SSIM methodology An alternative methodology for the multimodal reconstruction between retinography and angiography was proposed in Ref. [7]. In this case, the authors avoid the use of GANs by taking advantage of existing multimodal paired data. In particular, a set of retinography-angiography pairs where both images correspond to the same eye. The motivation for this lies in the fact that, in contrast to other application domains, in medical imaging the paired data is easy to obtain. Nowadays, in modern clinical practice, the use of different imaging modalities is broadly extended across most medical services. In this sense, although for many patients the use of a single imaging modality can be enough for diagnostic purposes, there is still a large number of cases where the use of several imaging modalities is required. In this latter scenario, it is also common to use more complex or invasive techniques, such as those requiring the injection of contrasts. This is the case of retinography and angiography in retinal imaging. While retinography is a broadly extended modality, typically used in screening programs, angiography is only used when
355
356
Generative adversarial networks for image-to-Image translation
it is clearly required. However, each time the angiography is taken for a patient, retinography is typically also available. This facilitates the gathering of these paired multimodal datasets. Technically, the advantage of using paired training data is that it allows directly comparing the network output with a ground truth image. In particular, during the training, for each retinography that is fed to the network, there is also available an angiography of the same eye. Thus, the training feedback can be obtained by computing any similarity metric between generated and real angiography. In order to facilitate this measurement of similarity, the retinography and angiography within each multimodal pair are registered. The registration produces an alignment of the different retinal structures between the retinography and the angiography. Consequently, there will also be an alignment between the network output and the real angiography that is used as ground truth. This allows the use of common pixel-wise metrics for the measurement of the similarity between the network output and the target image. In the presented methodology [15], the registration is performed following a domainspecific method that relies on the vascular structures of the retina [31]. In particular, this registration method presents two different steps. The first step is a landmark-based registration where the landmarks are the crossings and the bifurcations of the retinal vasculature. This first registration produces a coarse alignment of the images that is later refined by performing a subsequent intensity-based registration. This second registration is based on the optimization of a similarity metric of the vessels between both images. The complete registration procedure allows generating a paired and registered multimodal dataset, which is used for directly training the generator network GR2A. The complete methodology for training the multimodal reconstruction is depicted in Fig. 15.4. As it is observed, an advantage of this methodology is that only a single neural network is required. Regarding the training of the generator, the similarity between the network output and target angiography is evaluated by using the structural similarity (SSIM) [32]. This metric, which was initially proposed for image quality assessment, measures the similarity between images by independently considering the intensity, contrast, and structural
Fig. 15.4 Flowchart for the complete training procedure of the paired SSIM methodology. The first step is the multimodal registration of the paired retinal images, which can be performed off-line before the actual network training. Then, the training feedback is provided by the structural similarity (SSIM), which is a pixel-wise similarity metric.
Multimodal reconstruction of retinal images
information. The measurement is performed at a local level considering a small neighborhood for each pixel. In particular, an SSIM map between two images (x, y) is computed with a set of local statistics as X 2μx μy + C1 + 2σ xy + C2 SSIMðx, yÞ ¼ (15.7) μ2x + μ2y + C1 σ 2x + σ 2y + C2 where μx and μy are the local averages for x and y, respectively, σ x and σ y are the local standard deviations for x and y, respectively, and σ xy is the local covariance between x and y. These local statistics are computed for each pixel by weighting its neighborhood with an isotropic two-dimensional Gaussian with σ ¼ 1.5 pixels [32]. Then, given that SSIM is a similarity metric, the loss function for training GR2A is defined by using the negative SSIM: L SSIM ¼ Er , aðR, AÞ ½SSIMðGR2A ðr Þ, aÞ
(15.8)
The optimization of the loss function during the training is performed with the Adam algorithm [28]. Regarding the hyperparameters of Adam, the weight decays are set as β1 ¼ 0.9 and β2 ¼ 0.999, which are the default values recommended by Kingma et al. [28]. The optimization is performed with a batch size of 1 image. The learning rate is set to an initial value of α ¼ 2e 4 and then it is reduced by a factor of 10 when the validation loss ceases to improve for 1250 iterations. Finally, the training is early stopped after 5000 iterations without improvement in the validation loss. These hyperparameters are established empirically according to the evolution of the learning curves during the training. Finally, a data augmentation strategy is also applied to avoid possible overfitting to the training set. In particular, random spatial and color augmentations are applied to the images. The spatial augmentations consist of affine transformations and the color augmentations are linear transformations of the image channels in HSV (Hue-Saturation-Value) color space. In this case, the color augmentations are only applied to the retinography, which is the only imaging modality being used as input to a neural network. In contrast, the same affine transformation is applied to the retinography and the angiography in each multimodal image pair. This is necessary to keep the alignment between the images and make possible the measurement of the pixel-wise similarity, namely SSIM, between the network output and the target angiography. As in the cyclical GAN methodology, the particular range for the transformations is validated before training in order to ensure that the augmented images still resemble valid retinas.
15.3.3 Network architectures Regarding the neural networks, the same network architectures are used for the two presented methodologies, cyclical GAN and paired SSIM. This eases the comparison
357
358
Generative adversarial networks for image-to-Image translation
between the methodologies, excluding the network architecture as a factor in the possible performance differences. In particular, the experiments that are presented in this chapter are performed with the same network architectures that were previously used in Ref. [19]. The generator, which is used in both cyclical GAN and paired SSIM, is a fully convolutional neural network consisting of an encoder, a decoder, and several residual blocks in the middle of them. A diagram of the network and the details of the different blocks are depicted in Fig. 15.5 and Table 15.1, respectively. In contrast with other common encoder-decoder architectures, this network presents a small encoder and decoder, which is compensated by the large number of layers that are present in the middle residual blocks. As a consequence, there is also a small spatial reduction of the input data through the network. In particular, the height and width of the internal representations within the network are reduced up to a factor of 4. This relatively low spatial reduction allows keeping an adequate level of spatial accuracy without the
Fig. 15.5 Diagram of the network architecture for the generator. Each colored block represents the output of a layer in the neural network. The width of the blocks represents the number of channels whereas the height represents the spatial dimensions. The details of the different layers are in Table 15.1.
Table 15.1 Building blocks of the generator architecture. Block
Layers
Kernel
Stride
Out features
Encoder
Conv/IN/ReLU Conv/IN/ReLU Conv/IN/ReLU Conv/IN/ReLU Conv/IN Residual addition ConvT/IN/ReLU ConvT/IN/ReLU Conv/IN/ReLU
77 33 33 33 33 – 33 33 77
1 2 2 1 1 – 2 2 1
64 128 256 256 256 256 128 64 Image channels
Residual
Decoder
Conv, convolution; IN, instance normalization [33]; ConvT, convolution transpose.
Multimodal reconstruction of retinal images
necessity of additional features such as skip connections [34]. Another particularity of the network is the use of instance normalization [33] layers after each convolution, in contrast to the more extended use of batch normalization. In this regard, instance normalization was initially proposed for improving the performance of style-transfer applications and has demonstrated to be also effective for cyclical GANs. Additionally, these normalization layers could be seen as an effective way of dealing with the problems of using batch normalization with small batch sizes. In this sense, it should be noticed that both the experiments presented in this chapter as well as the experiments in Ref. [19] are performed with a batch size of 1 image. In contrast with the generator, the discriminator network is only used in the cyclical GAN methodology. The selected architecture is the one that was also used in Ref. [19]. In particular, the discriminator is a fully convolutional neural network, which allows working on arbitrarily sized images. This kind of discriminator architecture is typically known as PatchGAN [18], given that the decision of the discriminator is produced at the level of overlapping image patches. A diagram of the network and the details of the different layers are depicted in Fig. 15.6 and Table 15.2, respectively. The characteristics of the different layers are similar to those in the generator network. The main difference is the use of Leaky ReLU instead of ReLU as an activation function, which has
Fig. 15.6 Diagram of the network architecture for the discriminator. Each colored block represents the output of a layer in the network. The width of the blocks represents the number of channels whereas the height represents the spatial dimensions. The details of the different layers are in Table 15.2. Table 15.2 Layers of the discriminator architecture. Layers
Kernel
Stride
Out features
Conv/Leaky ReLU Conv/IN/Leaky ReLU Conv/IN/Leaky ReLU Conv/IN/Leaky ReLU Conv
44 44 44 44 44
2 2 2 1 1
64 128 256 512 1
Conv, convolution; IN, instance normalization.
359
360
Generative adversarial networks for image-to-Image translation
demonstrated to be a useful modification for the adequate training of GANs [29]. With regard to the discriminator output, this architecture provides a decision for overlapping image patches of size 70 70.
15.4 Experiments and results 15.4.1 Datasets The experiments presented in this chapter are performed on a multimodal dataset consisting of 118 retinography-angiography pairs. This multimodal dataset is created from two different collections of images. In particular, half of the images are taken from a public multimodal dataset provided by Isfahan MISP [35] whereas the other half have been gathered from a local hospital [15]. The Isfahan MISP collection consists of 59 retinography-angiography pairs including both pathological and healthy cases. In particular, 30 image pairs correspond to patients that were diagnosed with diabetic retinopathy whereas the other 29 images pairs correspond to healthy retinas. All the images in the collection present a size of 720 576 pixels. The private collection consists of 59 additional retinography-angiography pairs. Most of the images correspond to pathological cases, including representative samples of several common ophthalmic diseases. Additionally, the original images presented different sizes and, therefore, they were resized to a fixed size of 720 576. This collection of images has been gathered from the ophthalmic services of Complexo Hospitalario Universitario de Santiago de Compostela (CHUS) in Spain. To perform the different experiments, the complete multimodal dataset is randomly split into two subsets of equal size, i.e., 59 image pairs each. One of these subsets is held out as a test set and the other is used for training the multimodal reconstruction. Additionally, the training image pairs are randomly split into a validation subset of nine image pairs and a training subset of 50 image pairs. The purpose of this split is to control the training progress through the validation subset, as described in Section 15.3. Finally, it should be noticed that, although the same subset of image pairs is used for the training of both methodologies, the images are considered as unpaired for the cyclical GAN approach.
15.4.2 Qualitative evaluation of the reconstruction Firstly, the quality and coherence of the generated angiographies are evaluated through visual analysis. To that end, Figs. 15.7 and 15.8 depict some representative examples of generated images together with the original retinographies and angiographies. The examples are taken from the holdout test set. In general, both methodologies were able to learn an adequate transformation for the main anatomical structures in the retina, namely, the vasculature, fovea, and optic disc. In particular, it is observed that the retinal
Multimodal reconstruction of retinal images
Fig. 15.7 Examples of generated angiographies together with the corresponding original retinographies and angiographies. Some representative examples of microaneurysms (green), microhemorrhages (blue), and bright lesions (yellow) are marked with circles.
vasculature is successfully enhanced in all the cases, which is one of the main characteristics of real angiographies. This vascular enhancement evidences a high-level understanding of the different structures in the retina, given that other dark-colored structures in retinography, such as the fovea, are mainly kept with a dark tone in the reconstructed angiographies. This means that the applied transformation is structurespecific and guided by the semantic information in the images instead of low-level
361
362
Generative adversarial networks for image-to-Image translation
Fig. 15.8 Examples of generated angiographies together with the corresponding original retinographies and angiographies. Some representative examples of microaneurysms (green) and microhemorrhages (blue) are marked with circles.
information such as the color. In contrast with the vasculature, the reconstructed optic discs are not as similar as those in the real angiographies. However, this can be explained by the fact that the appearance of the optic disc is not as consistent among angiographies. In this sense, both methodologies learn to reconstruct the optic disc with a slight higher intensity, which may indicate that this is the predominant appearance of this anatomical structure in the training set.
Multimodal reconstruction of retinal images
With regard to the pathological structures, there are greater differences between the presented methodologies. For instance, microaneurysms are only generated or enhanced by the cyclical GAN methodology. Microaneurysms are tiny vascular lesions that, in contrast to other pathological structures, remain connected to the bloodstream. Therefore, they are directly affected by the injected contrast dye in the angiography. As it is observed in Fig. 15.7, the cyclical GAN methodology is able to enhance these small lesions. However, neither all the microaneurysms in the ground truth angiography are reconstructed nor all the reconstructed microaneurysms are present in the ground truth. This may indicate that part of these microaneurysms are artificially created by the network or that small microhemorrhages are being misidentified as microaneurysms. Nevertheless, it must be considered that the detection of microaneurysms is a very challenging task in the field. Thus, despite the possible errors, the fact that these small structures were identified by the cyclical GAN methodology is a significative outcome. In contrast to the previous analysis about microaneurysms, the examples of Fig. 15.7 evidence that the paired SSIM methodology provides a better reconstruction for other pathological structures. In particular, bright lesions that are present in the retinography should not be visible in the angiography. However, the cyclical GAN approach fails to completely remove these lesions, especially if they are large such as those in the top-left quarter or the retina shown in Fig. 15.7B. The paired SSIM approach provides a more accurate reconstruction regarding these kinds of lesions although in the previous case there still remains a show in the area of the lesion. Finally, regarding the microhemorrhages, these kinds of lesions are also more accurately reconstructed by the paired SSIM approach. In particular, these lesions present a dark appearance in both retinography and angiography. In the depicted examples, it is observed that paired SSIM reconstructs the microhemorrhages, as expected. However, the cyclical GAN approach tends to remove these lesions. Additionally, in some cases, the small microhemorrhages are reconstructed with a bright tone like the microaneurysms. Besides the anatomical and pathological structures in the retina, the main difference that is observed between both methodologies is the general appearance of the generated angiographies. In this regard, the images generated by the cyclical GAN present a more realistic look and they could be easier misidentified as real angiographies. The main reason for this is the texture in the images. In particular, cyclical GAN produces a textured retinal background that mimics the appearance of a real angiography. In contrast, the retinal background in the angiographies generated by paired SSIM is very homogeneous, which gives away the synthetic nature of the images. The explanation for this difference between both approaches is the use of GANs in the cyclical GAN methodology. In this sense, the discriminator network has the capacity to learn and distinguish the main characteristics of the angiography, including the textured background. Thus, a synthetic angiography with a smooth background would be easily identified as fake by the discriminator. Consequently, during the training, the generator will learn to generate
363
364
Generative adversarial networks for image-to-Image translation
the textured background in order to trick the discriminator. In the case of the paired SSIM, the presented results show that SSIM does not provide the feedback that is required to learn this characteristic. Additionally, according to the results presented in Ref. [15], the use of L1-norm or L2-norm in the loss function does not provide that feedback either. In this regard, it should be noticed that these are full-reference pixel-wise metrics that directly compare the network output against a specific ground truth image. Thus, even if an angiography-like texture is generated, this will not necessarily minimize the loss function if the generated texture does not exactly match the one in the provided ground truth. It could be the case that the specific texture of each angiography was impossible to infer from the corresponding retinography. In that scenario, the generator could never completely reduce the loss portion corresponding to the textured background. The resulting outcome could be the generation of a homogeneous background that minimizes the loss throughout the training set. This explanation fits with what is observed in Figs. 15.7 and 15.8.
15.4.3 Quantitative evaluation of the reconstruction The multimodal reconstruction is quantitatively evaluated by measuring the reconstruction error between the generated and the ground truth angiographies. In particular, the reconstruction is evaluated by means of SSIM, mean average error (MAE), and mean squared error (MSE), which are common evaluation metrics for image reconstruction and image quality assessment. The presented evaluation is performed on the paired data of the holdout test set. When comparing the two presented methodologies, it must be considered that the paired SSIM relies on the availability of paired data for training. The paired data represent a richer source of information in comparison to the unpaired counterpart and, therefore, it is expected that the paired SSIM provided better performance than cyclical GAN for the same number of training samples. Additionally, it should be also considered that the paired data, despite being commonly available in medical imaging, is inherently harder to collect than the unpaired counterpart. For these reasons, the presented evaluation not only compares the performance of both methodologies when using the complete training set but, also, it compares the performance when there are more unpaired than paired images available for training. This is an expected scenario in practical applications. The results of the quantitative evaluation are depicted in Fig. 15.9. In the case of paired SSIM, the presented results correspond to several experiments with a varying number of training samples, ranging from 10 to 50 image pairs. In the case of cyclical GAN, the presented results are obtained after training with the complete training subset, i.e., 50 image pairs. Firstly, it is observed that the paired SSIM always provides better results than the cyclical GAN considering SSIM although that is not the case for MAE and MSE. Considering these two metrics, the paired SSIM obtains similar
Multimodal reconstruction of retinal images
Fig. 15.9 Comparison of cyclical GAN and paired SSIM with a varying number of training samples for paired SSIM. The evaluation is performed by means of (A) SSIM, (B) MAE, and (C) MSE.
365
366
Generative adversarial networks for image-to-Image translation
or worse results depending on the number of training samples. In general, it is clear that, up to 30 image pairs, the paired SSIM experiments a positive evolution with the addition of more training data. Then, between 30 and 50 image pairs, the evolution stagnates and there is no improvement with the addition of more images. In the case of MAE and MSE, the final results to which the paired SSIM converges are approximately the same as those obtained by the cyclical GAN. This may indicate an existent upper bound in the performance of the multimodal reconstruction with this experimental setting. Regarding the comparison by means of SSIM, there is an important difference between both methodologies independently of the number of training images for paired SSIM. On the one hand, this may be explained by the fact that the generator of the paired SSIM has been explicitly trained to maximize SSIM. Thus, this network excels when it is evaluated by means of this metric. On the other hand, however, it must be considered that SSIM is a more complex metric in comparison to MAE or MSE. In particular, SSIM does not directly measure the difference between pixels but, instead, it measures local similarities that include high-level information such as structural coherence. Thus, it could be possible that subtle structural errors, which are not evidenced by MAE or MSE, contribute to the worse performance of cyclical GAN considering SSIM.
15.4.4 Ablation analysis of the generated images In order to better understand the obtained results, we present a more detailed quantitative analysis in this section. In particular, the presented analysis considers the possible differences in error distribution among different retinal regions. As it was shown in Section 15.4.2, both methodologies seem to provide a similar enhancement of the retinal vasculature. However, there are important differences in the reconstructed retinal background and certain pathological structures. Therefore, it is interesting to study how the reconstruction error is distributed between the vasculature and the background, and whether this distribution is different between both methodologies. To that end, the reconstruction errors are recalculated using a binary vascular mask to separate between vasculature and background regions. Given that only a broad approximation of the vasculature is necessary, the vascular mask is computed by applying some common image processing techniques. First, the multiscale Laplacian operator proposed in Ref. [31] is applied to the original angiography. This operation further enhances the retinal vasculature, resulting in an image with much greater contrast between vasculature and background [36]. Then, the vascular region is dilated to ensure that the resulting mask not only includes the vessels but also their surrounding pixels. This way, the reconstruction error in the vasculature will also include the error due to inappropriate vessel edges. Finally, the vascular mask is binarized by applying Otsu’s thresholding method [37]. An example of the produced binary vascular mask together with the original angiography is depicted in Fig. 15.10.
Multimodal reconstruction of retinal images
Fig. 15.10 Example of vascular mask used for evaluation: (A) angiography and (B) resulting vessel mask for (A).
The results of the quantitative evaluation using the computed vascular masks are depicted in Fig. 15.11. Firstly, it is observed that, in all the cases, the reconstruction error is greater in the vessels than in the background. This may indicate that the reconstruction of the retinal background is an easier task in comparison to the retinal vasculature. In this regard, it must be noticed that the retinal vasculature is an intricate network with numerous intersection and bifurcations, which increases the difficulty of the reconstruction. The background also includes some pathological structures, which can be a source of errors as seen in Section 15.4.2. However, these pathological structures neither are present in all the images nor occupy a significantly large area of the background. Moreover, the bright lesions in the angiography, i.e., the microaneurysms, are included within the vascular mask, as can be seen in Fig. 15.10. This balances the contribution of the pathological structures between both regions. Regarding the comparison between cyclical GAN and paired SSIM, the analysis is the same as in the previous evaluation. This happens independently of the retinal region that is analyzed, vasculature or background. In particular, the performance of paired SSIM experiments the same evolution with the increase in the number of training images. Considering MAE and MSE, paired SSIM converges again to the same results that are achieved by cyclical GAN, resulting in a similar performance. In contrast, there is still an important difference between the methodologies when considering SSIM. Finally, it is interesting to observe that the error distribution between regions is the same for paired SSIM and cyclical GAN, even when there is a clear visual difference in the reconstructed background between both methodologies (see Fig. 15.7). This shows that the more realistic look provided by the textured background does not necessarily lead to a better reconstruction in terms of full-reference pixel-wise metrics. In particular, the same reconstruction error can be achieved by producing a homogeneous background with an adequate tone, as paired SSIM does. This explains why the use of these metrics as a loss function does not encourage the generator to produce a textured background. Moreover, in the case of SSIM, which is the metric used by paired SSIM during training, the reconstruction error for the textured background is even greater than that of the homogeneous version.
367
368
Generative adversarial networks for image-to-Image translation
Fig. 15.11 Comparison of cyclical GAN and paired SSIM with a varying number of training samples for paired SSIM. The evaluation is conducted independently for vessels and the background of the images. The evaluation is performed by means of (A) SSIM, (B) MAE, and (C) MSE.
Multimodal reconstruction of retinal images
15.4.5 Structural coherence of the generated images An observation that remains to be explained after the previous analyses is the different results obtained whether the evaluation is performed by means of SSIM or MAE/ MSE. In particular, both methodologies achieve similar results in MAE and MSE, although paired SSIM always performs better in terms of SSIM. Given that SSIM is characterized by including higher level information such as the structural coherence between images, the generated images are visually inspected to find possible structural differences. Fig. 15.12 depicts some composite images using a checkerboard pattern that is used to perform the visual analysis. In particular, the depicted images show the generated angiography together with the original retinography (Fig. 15.12A and C) as well as the generated angiography together with the ground truth angiography (Fig. 15.12B and D). At a glance, it seems that both angiographies, from paired SSIM and cyclical GAN, are perfectly reconstructed. However, on closer examination, it is observed that in the angiographies generated by cyclical GAN there are small displacements with respect to the originals. Examples of these displacements are shown in detail in Fig. 15.12. As it is observed, the displacement occurs, at least, in the retinal vasculature. Moreover, it can be observed that the displacement is consistent among the zoomed patches even when they are distant in the images. This indicates that the observed displacement could be the result of an affine transformation. With regard to the cause of the displacement, an initial hypothesis is based on the fact that cyclical GAN does not put any hard constraint on the structure of the generated angiography. The only requirements are that the image must look like a real angiography and that it must be possible to reconstruct the original retinography from it. Thus, although the more straightforward way to reconstruct the original retinography seems to be to keep the original structure as it is, nothing enforces the networks to do so. Nevertheless, it must be considered that if GR2A applies any spatial transformation to the generated angiographies, and then GA2R must learn to apply the inverse transformation when reconstructing the original retinography. This synergy between the networks is necessary to still minimize the cycle-consistency loss in the cyclical GAN methodology. Although not straightforward, this situation seems plausible given that the observed displacement is very subtle. The presented situation may initiate if the first network, GR2A, starts to reconstruct the vessels of the angiography over the vessel edges of the input retinography. This is likely to happen given the facility of a neural network to detect edges in an image. Moreover, the vessel edges are easier to detect than the vessel centerlines. To verify this hypothesis, the angiographies generated during the first stages of the training have been revised. A representative example of these images is depicted in Fig. 15.13. As it can be observed, there are some bright lines that seems to be drawn over the edges of the subtle dark vessels. This evidences the origin of the issue, although the ultimate cause is the underconstrained training setting of cyclical GAN.
369
370
Generative adversarial networks for image-to-Image translation
Fig. 15.12 Comparison of generated angiographies against (A, C) the corresponding original retinographies and (B, D) the corresponding ground truth angiographies. (A, B) Angiography generated using paired SSIM. (C, D) Angiography generated using cyclical GAN. Additionally, cropped regions are depicted in detail for each case.
Fig. 15.13 Representative example of generated angiography on the first stages of training for cyclical GAN: (A) original retinography and (B) generated angiography.
Multimodal reconstruction of retinal images
15.5 Discussion and conclusions In this chapter, we have presented a cyclical GAN methodology for the multimodal reconstruction of retinal images [11]. This multimodal reconstruction is a novel task that consists of the translation of medical images between complementary modalities [7]. This allows the estimation of either more invasive or less affordable imaging modalities from a readily available alternative. For instance, this chapter addresses the estimation of fluorescein angiography from retinography, where the former requires the injection of a contrast dye to the patients. Despite the recent technical advances in the field, the direct use of generated images in clinical practice is still only a future potential application. However, there are several other possible applications where this multimodal reconstruction can be taken advantage of. For instance, multimodal reconstruction has already demonstrated to be a successful pretraining task for transfer learning in medical image analysis [13, 14]. This is an important application that reduces the necessity of large collections of expert-annotated data in medical imaging [10]. In order to provide a comprehensive analysis of the cyclical GAN methodology, we have also presented an exhaustive comparison against a state-of-the-art approach where no GANs were used [15]. This way, it is possible to study the particular advantages and disadvantages of using GANs for multimodal reconstruction. The provided comparison is performed under the fairest conditions, by using the same dataset, network architectures, and training strategies. In this regard, the only differences are those intrinsically due to the methodologies themselves. Regarding the presented results, it is seen that both approaches are able to produce an adequate estimation of the angiography from retinography. However, there are important differences in several aspects of the generated angiographies. Moreover, the requirements for training each one of both approaches must also be considered in the comparison. Regarding the requirements for the training of both approaches, the main difference is the use of unpaired data in cyclical GAN and paired data in paired SSIM. In broad domain applications, i.e., performed in natural images, this would represent an insurmountable obstacle for the paired SSIM methodology. However, in medical imaging, the paired data can be relatively easy to obtain due to the common use of complementary imaging modalities in clinical practice. In this case, however, the disadvantage of paired SSIM is the necessity of registered image pairs where the different anatomical and pathological structures must be aligned. The multimodal registration method that is applied in paired SSIM has demonstrated to be reliable for the alignment of retinographyangiography pairs [31]. Moreover, it has been successfully applied for the registration of the multimodal dataset that is used in the experiments herein described. However, the results presented in Ref. [31] also show that, quantitatively, the registration performance is lower for the most complex cases, which can be due to, e.g., low-quality images or severe pathologies. This could potentially limit the variety of images in an extended
371
372
Generative adversarial networks for image-to-Image translation
version of the dataset including more challenging scenarios. Additionally, the registration method in paired SSIM is domain-specific and, therefore, cannot be directly applied to other types of multimodal image pairs. This means that the use of paired SSIM in other medical specialties would require the availability of adequate registration methods. Although image registration is a common task in medical imaging, the availability of such multimodal registration algorithms cannot be taken for granted. In contrast, cyclical GAN can be directly applied to any kind of multimodal setting without the need for registered or paired data. Another important difference between the presented approaches is the complexity of the training procedure. In this sense, cyclical GAN represents a more complex approach including four different neural networks and two training cycles, as described in Section 15.3.1. In comparison, once the multimodal image registration is performed, paired SSIM only requires the training of a single neural network. The use of four different networks in cyclical GAN means that, computationally, more memory is required for training. In a situation of limited resources, which is the common practical scenario, this will negatively affect the size and number of images that is possible for each batch during the training. Moreover, in practice, cyclical GAN also requires longer training times than paired SSIM, which further increases the computational costs. This is in part due to the use of a single network in paired SSIM but also to the use of a full-reference pixel-wise metric for the loss functions. The feedback provided by this more classical alternative results in a faster convergence in comparison to the adversarial training. Regarding the performance of the multimodal reconstruction, the examples depicted in Figs. 15.7 and 15.8 show that both methodologies are able to successfully recognize the main anatomical structures in the retina. In that sense, despite the evident aesthetic differences, the transformations applied to the anatomical structures are adequate in both cases. Thus, both approaches show a similar potential for transfer learning regarding the analysis of the retinal anatomy. However, when considering the pathological structures, there are important differences between both methodologies. In this case, none of the methodologies perfectly reconstruct all the lesions. In particular, the examples depicted in Fig. 15.7 indicate that each methodology gives preference to different types of lesions in the generated images. Thus, it is not clear which alternative would be a better option toward the pathological analysis of the retinal images. In this regard, given the mixed results that are obtained, future works could explore the development of hybrid methods for the multimodal reconstruction of retinal images. The objective, in this case, would be to combine the good properties of cyclical GAN and paired SSIM. One of the main differences between cyclical GAN and paired SSIM is the appearance of the generated angiographies. Due to the use of a GAN framework in cyclical GAN, the generated angiographies look realistic and aesthetically pleasing. In contrast, the angiographies generated by paired SSIM present a more synthetic appearance. The importance
Multimodal reconstruction of retinal images
of this difference in the appearance of the generated angiographies depends on the specific application. On the one hand, for representation learning purposes, the priority is the proper recognition of the different retinal structures. Additionally, even for the potential clinical interpretation of the images, realism is not as important as the accurate reconstruction of the different structures. On the other hand, there exist potential applications such as data augmentation or clinical simulations where the realism of the images is of great importance. Finally, a relevant observation presented in this chapter is the fact that cyclical GAN does not necessarily keep the exact same structure of the input image. This is a known possible issue, given the underconstrained training setting in cyclical GANs. Nevertheless, in this chapter, we have presented empirical evidence of this issue in the form of small displacements for the reconstructed blood vessels. According to the evidence presented in Section 15.4.5, it is not possible to predict whether these displacements will happen or how they will exactly be. In this sense, the particular structural displacements produced by the networks is affected by the stochasticity of the training procedure. Moreover, although we have only noticed these structural incoherence in the blood vessels, it would be possible to note the existence of similar subtle structural transformations for other elements in the images. In line with prior observations in the presented comparison, the importance of these structural errors depends on the specific application for which the multimodal reconstruction is applied. For instance, this kind of small structural variations should not significantly affect the quality of the internal representations learned by the network. However, they would impede the use of cyclical GAN as a tool for accurate multimodal image registration. The development of hybrid methodologies, as previously discussed, could also be a solution to this structural issue while keeping the good properties of GANs. For instance, according to the results presented in Section 15.4.3, the addition of a small number of paired training samples could be sufficient for improving the structural coherence of the cyclical GAN approach. Additionally, a hybrid approach of this kind could still incorporate those more challenging paired images that may not be successfully registered. To conclude, the presented cyclical GAN approach has been demonstrated to be a valid alternative for the multimodal reconstruction of retinal images. In particular, the provided comparison shows that cyclical GAN has both advantages and disadvantages with respect to the state-of-the-art approach paired SSIM. In this regard, these two approaches are complementary to each other when considering their strengths and weaknesses. This motivates the future development of hybrid methods aiming at taking advantage of the strengths of both alternatives.
Acknowledgments This work was supported by Instituto de Salud Carlos III, Government of Spain, and the European Regional Development Fund (ERDF) of the European Union (EU) through the DTS18/00136 research project, and
373
374
Generative adversarial networks for image-to-Image translation
by Ministerio de Ciencia, Innovacio´n y Universidades, Government of Spain, through the RTI2018095894-B-I00 research project. The authors of this work also receive financial support from the ERDF and European Social Fund (ESF) of the EU and Xunta de Galicia through Centro de Investigacio´n de Galicia, ref. ED431G 2019/01, and the predoctoral grant contract ref. ED481A-2017/328.
Conflict of interest The authors declare no conflicts of interest.
References [1] G. Litjens, T. Kooi, B.E. Bejnordi, A.A.A. Setio, F. Ciompi, M. Ghafoorian, J.A. van der Laak, B. van Ginneken, C.I. Sa´nchez, A survey on deep learning in medical image analysis, Med. Image Anal. 42 (2017) 60–88, https://doi.org/10.1016/j.media.2017.07.005. [2] E.D. Cole, E.A. Novais, R.N. Louzada, N.K. Waheed, Contemporary retinal imaging techniques in diabetic retinopathy: a review, Clin. Exp. Ophthalmol. 44 (4) (2016) 289–299, https://doi.org/ 10.1111/ceo.12711. [3] T. Farncombe, K. Iniewski, Medical Imaging: Technology and Applications, CRC Press, 2017. [4] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M.S. Lew, Deep learning for visual understanding: a review, Neurocomputing 187 (2016) 27–48, https://doi.org/10.1016/j.neucom.2015.09.116. [5] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, P. Martinez-Gonzalez, J. GarciaRodriguez, A survey on deep learning techniques for image and video semantic segmentation, Appl. Soft Comput. 70 (2018) 41–65. ISSN 15684946 https://doi.org/10.1016/j.asoc.2018.05.018. [6] W. Rawat, Z. Wang, Deep convolutional neural networks for image classification: a comprehensive review, Neural Comput. 29 (9) (2017) 2352–2449, https://doi.org/10.1162/neco_a_00990. [7] A.S. Hervella, J. Rouco, J. Novo, M. Ortega, Retinal image understanding emerges from selfsupervised multimodal reconstruction, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2018, https://doi.org/10.1007/978-3-030-00928-1_37. [8] S. Engelhardt, L. Sharan, M. Karck, R.D. Simone, I. Wolf, Cross-domain conditional generative adversarial networks for stereoscopic hyperrealism in surgical training, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2019, https://doi.org/10.1007/978-3-030-322540_18. [9] X. Yi, E. Walia, P. Babyn, Generative adversarial network in medical imaging: a review, Med. Image Anal. 58 (2019) 101552. ISSN 1361-8415 https://doi.org/10.1016/j.media.2019.101552. [10] A´.S. Hervella, J. Rouco, J. Novo, M. Ortega, Learning the retinal anatomy from scarce annotated data using self-supervised multimodal reconstruction, Appl. Soft Comput. 91 (2020) 106210, https://doi. org/10.1016/j.asoc.2020.106210. [11] A.S. Hervella, J. Rouco, J. Novo, M. Ortega, Deep multimodal reconstruction of retinal images using paired or unpaired data, in: International Joint Conference on Neural Networks (IJCNN), 2019, https://doi.org/10.1109/IJCNN.2019.8852082. [12] L. Wang, W. Chen, W. Yang, F. Bi, F.R. Yu, A state-of-the-art review on image synthesis with generative adversarial networks, IEEE Access 8 (2020) 63514–63537, https://doi.org/10.1109/ ACCESS.2020.2982224. [13] A.S. Hervella, L. Ramos, J. Rouco, J. Novo, M. Ortega, Multi-modal self-supervised pre-training for joint optic disc and cup segmentation in eye fundus images, in: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, https://doi.org/10.1109/ ICASSP40776.2020.9053551. [14] J. Morano, A.S. Hervella, N. Barreira, J. Novo, J. Rouco, Multimodal transfer learning-based approaches for retinal vascular segmentation, in: 24th European Conference on Artificial Ingelligence (ECAI), 2020.
Multimodal reconstruction of retinal images
[15] A´.S. Hervella, J. Rouco, J. Novo, M. Ortega, Self-supervised multimodal reconstruction of retinal images over paired datasets, Expert Syst. Appl. (2020) 113674, https://doi.org/10.1016/j. eswa.2020.113674. [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems (NIPS), 27, 2014, pp. 2672–2680. [17] Z. Pan, W. Yu, X. Yi, A. Khan, F. Yuan, Y. Zheng, Recent progress on generative adversarial networks (GANs): a survey, IEEE Access 7 (2019) 36322–36333, https://doi.org/10.1109/ ACCESS.2019.2905015. [18] P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, https://doi.org/10.1109/CVPR.2017.632. [19] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, https://doi.org/10.1109/ICCV.2017.244. [20] Z. Yi, H. Zhang, P. Tan, M. Gong, DualGAN: unsupervised dual learning for image-to-image translation, in: The IEEE International Conference on Computer Vision (ICCV), 2017, https://doi.org/ 10.1109/ICCV.2017.310. [21] T. Kim, M. Cha, H. Kim, J.K. Lee, J. Kim, Learning to discover cross-domain relations with generative adversarial networks, in: Proceedings of the 34th International Conference on Machine Learning, vol. 70, 2017, pp. 1857–1865. [22] J.M. Wolterink, T. Leiner, M.A. Viergever, I. Isˇgum, Generative adversarial networks for noise reduction in low-dose CT, IEEE Trans. Med. Imaging 36 (12) (2017) 2536–2545, https://doi.org/10.1109/ TMI.2017.2708987. [23] Y. Xue, T. Xu, H. Zhang, L. Long, X. Huang, SegAN: adversarial network with multi-scale L1 loss for medical image segmentation, Neuroinformatics 16 (3–4) (2018) 383–392. ISSN 1539-2791 https:// doi.org/10.1007/s12021-018-9377-x. [24] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification, Neurocomputing 321 (2018) 321–331. ISSN 0925-2312 https://doi.org/10.1016/j.neucom.2018.09.013. [25] T. Schlegl, P. Seeb€ ock, S.M. Waldstein, G. Langs, U. Schmidt-Erfurth, F-AnoGAN: fast unsupervised anomaly detection with generative adversarial networks, Med. Image Anal. 54 (2019) 30–44. ISSN 1361-8415 https://doi.org/10.1016/j.media.2019.01.010. [26] J. Cohen, M. Luck, S. Honari, Distribution matching losses can hallucinate features in medical image translation, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2018, https://doi.org/10.1007/978-3-030-00928-1_60. [27] X. Mao, Q. Li, H. Xie, R.Y. Lau, Z. Wang, S. Paul Smolley, Least squares generative adversarial networks, in: The IEEE International Conference on Computer Vision (ICCV), 2017, https://doi.org/ 10.1109/ICCV.2017.304. [28] D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, in: International Conference on Learning Representations (ICLR), 2015. [29] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in: International Conference on Learning Representations (ICLR), 2016. [30] A´.S. Hervella, J. Rouco, J. Novo, M.G. Penedo, M. Ortega, Deep multi-instance heatmap regression for the detection of retinal vessel crossings and bifurcations in eye fundus images, Comput. Methods Prog. Biomed. 186 (2020) 105201. ISSN 0169-2607 https://doi.org/10.1016/j.cmpb. 2019.105201. [31] A.S. Hervella, J. Rouco, J. Novo, M. Ortega, Multimodal registration of retinal images using domainspecific landmarks and vessel enhancement, in: International Conference on Knowledge-Based and Intelligent Information and Engineering Systems (KES), 2018, https://doi.org/10.1016/j. procs.2018.07.213. [32] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612, https://doi.org/10.1109/ TIP.2003.819861.
375
376
Generative adversarial networks for image-to-Image translation
[33] D. Ulyanov, A. Vedaldi, V.S. Lempitsky, Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 4105–4113, https://doi.org/10.1109/CVPR.2017.437. [34] O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, https:// doi.org/10.1007/978-3-319-24574-4_28. [35] S.H.M. Alipour, H. Rabbani, M.R. Akhlaghi, Diabetic Retinopathy Grading by Digital Curvelet Transform, Computational and Mathematical Methods in Medicine, 2012, https://doi.org/ 10.1155/2012/761901. [36] A.S. Hervella, J. Rouco, J. Novo, M. Ortega, Self-supervised deep learning for retinal vessel segmentation using automatically generated labels from multimodal data, in: International Joint Conference on Neural Networks (IJCNN), 2019, https://doi.org/10.1109/IJCNN.2019.8851844. [37] N. Otsu, A threshold selection method from gray-level histograms, IEEE Trans. Syst. Man Cybern. 9 (1) (1979) 62–66, https://doi.org/10.1109/TSMC.1979.4310076.
CHAPTER 16
Generative adversarial network for video anomaly detection Thittaporn Ganokratanaaa and Supavadee Aramvithb a
Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand Multimedia Data Analytics and Processing Research Unit, Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand b
16.1 Introduction Video anomaly detection (VAD) has gained increasing recognition in a surveillance system for ensuring security. VAD is a challenging task due to the high appearance structure of the images with motion between frames. This anomaly research has drawn interests from researchers in computer vision areas. The traditional approaches including the social force model (SF) [1], mixture of probabilistic principal component analyzers (MPPCA) [2], Gaussian mixture of dynamic texture (MDT) and the combination of SF + MPPCA [3, 4], sparse reconstruction [5–7], one class learning machine [8], K-nearest neighbor [9], and tracklet analysis [10, 11] are proposed to challenge the anomaly detection problem due to their performance in detecting multiple objects. However, the traditional approaches do not perform well with anomaly detection problem since this problem is a complex task that mostly occurs in the crowded scenes, making it more problematic for the traditional approaches to generalize. Thus, deep learning approaches are employed to achieve a higher anomaly detection rate, such as deep Gaussian mixture model (GMM) [12, 13], autoencoder [14, 15], and deep pretrain convolution neural network (CNN) [16–19]. Even using these deep learning approaches, the problem is still open when dealing with all issues of anomaly detection. Specifically, the major challenging issues in the anomaly detection task are categorized into three types: complex scene, small anomaly samples, and object localization in pixel level. With the complex scene, one may consist of multiple moving objects with clutter and occlusions that cause the difficulty in detecting and localizing objects. This issue also refers to the crowded scene which is more challenging than the uncrowded scene. The second challenge is the small samples from available anomaly datasets with abnormal ground truth, leading to the struggles of model training in a data-hungry deep learning approach. In practice, it is impossible to train all anomalous events as they occur randomly. Therefore, the anomaly detection task is categorized as an unsupervised learning manner since there are no requirements of data labeling on the positive rare class. Another important issue is about
Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00011-7
Copyright © 2021 Elsevier Inc. All rights reserved.
377
378
Generative adversarial networks for image-to-Image translation
pixel localization of the objects in the scene. Previous works [1, 6, 15, 20] struggled with this challenging task in which they achieved high accuracy only at frame-level anomaly detection. On the other hand, the accuracy of a pixel-level anomaly localization is significantly poorer. In recent works [21–23] researchers try to improve the performance to cover all evaluation criteria but can achieve good performance either at the frame or pixel level in some complex scenes. This happens as a consequence of insufficient input features for training the model such as appearance and motion patterns of the objects. The features of foreground objects should be extracted sufficiently and efficiently during training to make the model understand all characteristics. To deal with these challenges, the unsupervised deep learning-based approach is the most suitable technique for the anomaly detection problem since it does not require any labeled data on abnormalities. Unsupervised learning is a key domain of deep generative models, such as adversarially trained autoencoders (AAE) [24], variational autoencoder (VAE) [25], and generative adversarial networks (GAN) [26, 27]. Generative models for anomaly detection aim to model only normal events in training as it is the majority of the patterns. The abnormal events can be distinguished by evaluating the distance from the learned normal events. Early generative works are mostly based on handcrafted features [1, 3, 5, 11, 28] or CNN [17, 18] to extract and learn the important features. However, the performances of anomaly detection and localization are still needed to be improved due to the difficulties in approximating many probabilistic computations and utilizing the piecewise linear units as in the generative models [29–31]. Hence, recent trends for video anomaly detection focus more on GAN [20–22] which is an effective approach that achieves high performance in image generation and synthesis, affords data augmentation, and overcomes classification problems in the complex scenarios.
16.1.1 Anomaly detection for surveillance videos Video surveillance has gained increasing popularity since it has been widely used to ensure security. Closed-circuit television (CCTV) cameras are used to monitor the scene, record certain situations, and provide evidence. They generally perform as the postvideo forensic process that allows the investigation for abnormalities of previous events in manual control by human operators [32]. This manual causes difficulty for the operators since abnormalities can occur in any situation, such as crowded or uncrowded in indoor and outdoor scenes. Additionally, it may cause serious problems, including a terrorist attack, a robbery, and an area invasion, leading to personal injury or death and property damage [33]. Thus, to enhance the performance of video surveillance, it is crucial to building an intelligent system for anomaly detection and localization. The anomaly is defined as “a person or thing that is different from what is usual, or not in agreement with something else and therefore not satisfactory” [34]. Multiple terms
Generative adversarial network for video anomaly detection
Fig. 16.1 Examples of abnormal events in crowds from UCSD pedestrian [4], UMN [1], and CHUK Avenue datasets [6].
stand for the anomaly, including anomalous events, abnormal events, unusual events, abnormality, irregularity, and suspicious activity. In VAD, the abnormal event can be seen as the distinctive patterns or motions that are different from neighboring areas or the majority of the activities in the scene. Specifically, the normal events are the frequently occurring objects and the common moving patterns representing the majority of the patterns, while the abnormal events are varied and rarely occurred describing as an infrequent event that may include an unseen object and have a significantly lower probability than the probability of the normal event. Examples of different abnormal events are shown in Fig. 16.1. The anomaly detection for surveillance videos is challenging because of complex patterns of the real scene (e.g., moving foreground objects with large amounts of occlusion and clutter in crowds) captured by the static CCTV cameras. The VAD relies on fixed CCTV cameras, which take only moving foreground objects into account while disregarding the static background. The goal of VAD is to accurately identify all possible anomalous events from the regular normal patterns in crowded and complex scenes from the video sequences. To design the effective anomaly detection for surveillance videos, it is considered to learn all information of the objects from both of their appearance (spatial) and motion (temporal) features under the unsupervised learning or semisupervised learning manner. In the model training, with the unsupervised learning task, only the frames of normal events are trained, meaning that there is no data labeling on abnormalities. This benefits the use of VAD in real-world environments where any type of abnormal events can unpredictably occur. Then, all videos are fed into the model during testing. Any pattern deviated from the trained normal samples is identified as abnormal events that can be detected by evaluating the anomaly score known as the error of the predictive model in a vector space or the posterior probability of the test samples.
16.1.2 A broader view of generative adversarial network for anomaly detection in videos A GAN has been studied for years. The success of GAN comes from the effectiveness of its structure in improving image generation and classification tasks with a pair of networks. GAN presents an end-to-end deep learning framework in modeling the
379
380
Generative adversarial networks for image-to-Image translation
likelihood of normal events in videos and provides flexibility in model training since it does not require annotated abnormal samples. Its learning is achieved through deriving backpropagation to compute the error of each parameter in both generator and discriminator networks. The goals of GAN are to produce the synthetic output that is not able to be identified as different from the real data and to automatically learn a loss function to achieve the indistinguishable output goal. The loss of GAN attempts to classify whether the synthetic output is fake or real, while it is trained to be minimized in the generative model at the same time. This loss makes it more beneficial to apply GAN in various applications since it can adapt to the data without requiring different loss functions, unlike the loss functions of the traditional CNNs approach. Specifically, in the video anomaly detection, GAN performs as a two-player minimax game between a generator G and a discriminator D, providing high accuracy output. G attempts to fool D by generating synthetic images that are similar to the real data, whereas D efforts to discriminate whether the synthetic image belongs to the real or synthetic data. This mini-max game benefits data augmentation and implicit data management, thanks to D that assists G to reduce the distance between its samples and the training data distribution and to train on the small benchmark without the need to define an explicit parametric function or additional classifiers. Therefore, GAN is one of the most distinctive approaches to deal with complex anomaly detection tasks since it achieves good results in reconstructing, translating, and classifying images. Following the unsupervised GAN in an image-to-image translation [35], it can extract significant features of the objects of interest (e.g., moving foreground objects) and efficiently translate them from spatial to temporal representations without any prior knowledge of anomaly or direct information on anomaly types. In this way, GAN can provide comprehensive information concerning appearance and motion features. Hence, we focus on reviewing GAN for the anomaly detection task in videos and also introduce our proposed method named deep spatiotemporal translation network or DSTN, as a novel unsupervised GAN approach to detect and localize anomalies in crowded scenes [26]. This chapter contains six sections. In Section 16.1, anomaly detection for surveillance videos and a broader view of GAN are reviewed. Section 16.2 presents a literature review, including the basic structure of GAN along with the literature of anomaly detection in videos based on GAN. We elaborate on the GAN training in Section 16.3, which includes the image-to-image translation and our proposed DSTN. The performance of DSTN is discussed in Section 16.4 along with its related details, including the publicly available anomaly benchmarks, the evaluation criteria, the comparison of GAN with an autoencoder, and the advantages and limitations of GAN for anomaly detection in videos. Finally, Section 16.5 provides a conclusion for this chapter.
Generative adversarial network for video anomaly detection
16.2 Literature review We introduce the basic structure of the GAN and review the related works on anomaly detection for surveillance videos based on GAN. The details of GAN architecture and its state-of-the-art methods in video anomaly detection are described as following.
16.2.1 The basic structure of generative adversarial network Since the concept of generative models has been studied in machine learning areas for many years, it has gained wide recognition from Goodfellow et al. [27] who introduced a novel adversarial process named GAN. The basic structure of GAN consists of two networks working simultaneously against each other, the generator G and the discriminator D as shown in Fig. 16.2. In general, G produces a synthetic image n from the input noise z, whereas D attempts to differentiate between n and a real image r. The goals of G are to generate synthesized examples of objects that look like the real ones and then attempt to fool D to make a wrong decision that the synthesized data generated by G are real. On the other hand, D has been learned on a dataset with the label of the images. D tries its best to discriminate whether its input data are fake or real by comparing them with the real training data. In other words, G is a counterfeiter producing fake checks, while D is an officer trying to catch G. Specifically, G is good at creating the synthesized images as it only updates the gradient through D to optimize its parameters, making D more challenging to differentiate its input data. The training of this mini-max game makes both networks better until at one point that the probability distributions of G and the real data are
Fig. 16.2 Generative adversarial network architecture.
381
382
Generative adversarial networks for image-to-Image translation
equivalent (when there is enough capacity and training time) so that G and D are not able to improve any longer. Thus, D is unable to differentiate between these two distributions. In the perspective of the generative adversarial network training, G takes the input noise z from a probability distributionpz(z) and then it generates the fake data and feeds into D as D(G(z)). D(x) denotes the probability that x is from the distribution of real datapdata instead of the distribution of generator pg. The discriminator D takes two inputs from G(z) and pdata. D is trained to enlarge the probability of defining the precise label to the real and the synthesized examples. Specifically, the goal of D is to accurately classify its input samples by giving a label of 1 to real samples and a label of 0 to synthetic ones. D tries to solve a binary classification problem based on neuron networks with a sigmoid function, giving output in the range [0,1]. Then G is simultaneously trained to reduce [log(1 D(G(z)))]. These two adversarial networks, G and D, are represented with value function V(G, D) as follows: min max V ðD, GÞ
(16.1)
V ðD, GÞ ¼ xpdata ðxÞ ½ logDðxÞ + zpz ðzÞ ½ log ð1 DðGðzÞÞÞ,
(16.2)
G
D
where [logD(x)] is the objective function of the discriminator, representing the entropy of the real data distribution pdata passing through D, which tries to maximize [logD(x)]. Note that the objective function of the discriminator will be maximized when both real and synthetic samples are accurately classified to 1 and 0, respectively. The objective function of the generator is [log(1 D(G(z)))], representing the entropy of the random noise samples z passing through G to generate the synthetic samples or fake data and then pass through D to minimize [log(1 D(G(z)))]. The goal of the generator’s objective function is to fool D to make a wrong classification as it inspires D to identify the synthetic samples as real ones or to label the synthetic samples as 1. In other words, it attempts to minimize the likelihood that D classifies these samples correctly as fake data. Thus, [log(1 D(G(z)))] is reduced when the synthesized samples are wrongly labeled as 1. However, in real practice, G is poor in generating synthetic samples in the early training stage, making D too easy to classify the synthesized samples from G and the real samples from the dataset due to their great difference. To solve this problem, the generator’s objective function can be changed from reducing [log(1 D(G(z)))] to increasing [logD(x)] to provide enough gradient for G. This alternative objective function of the generator provides a stronger gradient in the early stage of the generator’s training. As both objective functions are distinct, the two networks are trained together by alternating the gradient updates following the standard gradient rule with the momentum parameter. There are two main procedures for training G and D networks to update their gradients alternately. The process is first to freeze G and train only D. This alternative gradient update is triggered by the fact that the discriminator needs to learn the outputs
Generative adversarial network for video anomaly detection
of the generator to define the real data from the fake ones. Thus the generator is required to be frozen. The discriminator network can be updated as shown in the following equation: m h i X (16.3) rθd 1=m logD xðiÞ + log 1 D G zðiÞ i¼1
Specifically, the gradient updates are different for the learning of two networks: D uses stochastic gradient ascent, while G uses stochastic gradient descent. D uses a hyperparameter k to update steps for each step of G. To update D, the stochastic gradient ascent performs the updates for k times to increase the likelihood that D accurately labels both samples (fake and real data). These updates are achieved using backpropagation on an equal number of examples with batch size of 2. Let noise samples m consist of {z(1), z(2), …, z(m)} from the distribution of generator pg(z) and real examples m consist of {x(1), x(2), …, x(m)} from the distribution of real data pdata. Once D is updated, then only G is trained to update its gradient as shown in the following equation: m X (16.4) log 1 D G zðiÞ rθg 1=m i¼1
Concerning the update for the generator network, m noise samples are input into G only once to generate m synthesized examples. G uses the stochastic gradient descent to minimize the likelihood that D labels the synthesized samples correctly. The generator’s objective function aims to minimize [log(1 D(G(z)))] to boost the likelihood that synthetic examples are classified as real examples. This process computes the gradients during backpropagation for both networks. Still, it only updates the parameters of G. D is kept constant during the training of G to prevent the possibility that G might never converge.
16.2.2 The literature of video anomaly detection based on generative adversarial network Here we review recent literature works that use GANs for anomaly detection in crowds. There are three outstanding video anomaly detection works ranked by publication year as follows. 16.2.2.1 Cross-channel generative adversarial networks Starting with the work proposed by Ravanbakhsh et al. [20], this work applies conditional GANs (cGANs) where the generator G and the discriminator D are both conditioned to the real data and also relies on the idea of the translating image to image [35]. Following the characteristics of cGANs, the input image x is fed to G to produce a generated image p that looks realistic. G attempts to deceive D that p is real, while D efforts to identify x from p. This paper states that U-Net structure [36] in the generative network
383
384
Generative adversarial networks for image-to-Image translation
and a patch discriminator (Markovian discriminator) advantage the transformation of images in different representations (e.g., spatial to temporal representations). Thus, the authors adopt this concept to translate the appearance of a frame to the motion of optical flow and target to learn only the normal patterns. To detect an abnormality, they compare the generated image with the real image by using a simple pixel-by-pixel difference along with pretrain on ImageNet [37]. The framework of the anomaly detection in videos using cGANs during testing is shown in Fig. 16.3. F!O More specifically, the authors train two networks N that uses frames F to genO!F erate optical flow O and N that uses optical flow O to generate frames F. Assume that Ft is the frame of the training video sequence with RGB channels at time t and Ot is the optical flow containing three channels (horizontal, vertical, and magnitude). Ot is obtained from two consecutive frames, Ft and Ft+1, following the computation in Ref. [38]. As both generative and discriminative models are the conditional networks, G generates output from its inputs consisting of an image x and noise vector z, providing a synthetic output image p ¼ G(x,z). In the case of N F!O , x is assigned as a current frame x ¼ Ft, hence the generation of its corresponding optical flow (or the synthetic output image p) is represented as y ¼ Ot. For D, it takes two inputs, whether it is (x,y) or (x, p) to yield a probability of classes belonging to its pair. The loss functions are defined, including a reconstruction loss LL1 and a conditional GAN loss LcGAN as shown in Eqs. (16.5) and (16.6), respectively. F!O For N , LL1 is determined with the training set X ¼ fðFt , Ot Þg as LL1 ðx, yÞ ¼ ky Gðx, yÞk1
(16.5)
whereas LcGAN is assigned as LcGAN ðD, GÞ ¼ ðx, yÞX ½ log Dðx, yÞ + XfFt g, zZ ½ log ð1 Dðx, Gðx, zÞÞÞ (16.6)
Fig. 16.3 A framework of video anomaly detection using conditional generative adversarial nets (cGANs) during testing in Ref. [20]. There are two generator networks: (i) producing a corresponding optical flow image from its input frames and (ii) reconstructing an appearance from a real optical flow image.
Generative adversarial network for video anomaly detection O!F
In contrast, the training set of N is X ¼ fðOt , Ft ÞgN t¼1 . Once the training is finished, the only model being used during testing is G, consisting of GF!O and GO!F networks. Both networks are not able to reconstruct the abnormality since they have been trained with only normal events. Then, the abnormality can be found by subtracting pixels to obtain the difference between O and po, ΔO ¼ O po,where po is the optical flow reconstruction when using F, GF!O(F). Another network is GO!F(O) that produces the appearance reconstruction pF. However, ΔO provides more information than the difference between F and pF. In this case, the authors added an additional network to find the difference in semantic perspective ΔS by using AlexNet [39] with its fifth convolutional layer h defined as ΔS ¼ h(F) h(pF). These two difference ΔO and ΔS are combined and normalized in between [0,1] for an abnormality map. Finally, the final score of abnormality heatmap is achieved by summing the normalized semantic difference map NS and the normalized optical flow difference map NO, A ¼ NS + λNO where λ ¼ 2. 16.2.2.2 Future frame prediction based on generative adversarial network Apart from the above work, there is an approach for a video future frame prediction of abnormalities based on GAN proposed by Liu et al. [21]. This work is inspired by the problems of anomaly detection that are mostly about minimizing the reconstruction errors from the training data. Instead, the authors proposed an unsupervised feature learning for video prediction and leveraged the distinction of their predicted frame with the real data for anomaly detection. The framework of video future prediction for detecting anomalies is shown in Fig. 16.4. In the training stage, only normal events are learned since they are considered as predictable patterns, using both appearance and motion constraints. Then, during testing, all frames are input and compared with the predicted frame. If the input frame corrects with the predicted frame, it is a normal event. If it is not, then it
Fig. 16.4 Video future prediction framework for anomaly detection [21]. U-Net structure and pretrained Flownet are used to predict a target frame and to obtain optical flow, respectively. Adversarial training is used to disguise whether a predicted frame is real or fake.
385
386
Generative adversarial networks for image-to-Image translation
becomes an anomalous event. Using a good predictor to train is the key in this work; thus U-Net network [36] is chosen due to its performance in translating images with the GAN model. In mathematical terms, given a video sequence containing t frames I1, I2, …, It. In this work, a future frame is defined as It + 1, while a predicted future frame as I^t + 1. The goal is to make I^t + 1 close to It + 1 to determine whether I^t + 1 is an abnormal or normal event by minimizing their distance in terms of intensity and gradient. In addition, optical flow is used to represent the temporal features between frames; It + 1 and It, and I^t + 1 and It.. We first take a look at the generator objective function LG, consisting of appearance (ingredient Lint and gradient Lgd), motion Lop, and adversarial training LG adv, in Eq. (16.7): G ^ It +1 LG ¼ λint Lint I^t + 1 , It + 1 + λgd Lgd I^t + 1 , It + 1 + λop Lop I^t + 1 , It + 1 , It + λadv Ladv (16.7) The discriminator objective function LD is defined in Eq. (16.8): D ^ I t + 1 , It + 1 : LD ¼ Ladv
(16.8)
The authors followed the work in Ref. [40] by using intensity and gradient difference. Specifically, the intensity and gradient penalties assure the similarity of all pixels and the sharpened generating images, respectively. Suppose I^t + 1 is I^ and It+1 is I. The distance of ‘2 between I^ and I is minimized in intensity to guarantee the similarity in the RGB space as shown in the following equation: 2 (16.9) Lint I^, I ¼ I^ I 2 : Then the gradient loss is defined as follows [40] in Eq. (16.10): X I^i, j I^i1, j Ii, j Ii1, j + I^i, j I^i, j1 Ii, j Ii, j1 Lgd I^, I ¼ 1 1 i, j (16.10) where i and j are the frame index. Then the optical flow estimation is applied by using a pretrained network, Flownet [41], denoted as f. The temporal loss is defined in the following equation: Lop I^t + 1 , It + 1 , It ¼ f I^t + 1 , I1 f ðIt + 1 , I1 Þ1 : (16.11) In the manner of the adversarial network, the training is an alternative update. The U-Net is used as the generator, while a patch discriminator is used for the discriminator following Ref. [35]. To train the discriminator D, they assign a label of 0 to a fake image and a label of 1 to a real image. The goal of D is to categorize the real future frame It+1 into class 1 and the predicted future frame I^t + 1 into class 0. During training the discriminator
Generative adversarial network for video anomaly detection
D, the weight of G is fixed by using the loss function of mean square error (MSE), denoted as LMSE. Hence, the LMSE of D can be defined in the following equation: X X D ^ Ladv I, I ¼ LMSE D Ii, j , 1 =2 + LMSE D I^i, j , 0 =2, (16.12) i, j i, j where i and j are the patch index. The MSE loss function LMSE is defined in the following equation: 2 (16.13) LMSE Y^ , Y ¼ Y^ Y , where the value of Y is in [0,1], while Y^ ½0, 1. In contrast, the objective of the generator G is to reconstruct images to fool D to label them as 1. The weight of D is fixed during training G. Thus, LMSE of G is defined as shown in the following equation: X G ^ Ladv I ¼ LMSE D I^i, j , 1 =2: (16.14) i, j To conclude, the appearance, motion, and adversarial training can assure that normal events are generated. The events with a great difference between the prediction and the real data are classified as abnormalities. 16.2.2.3 Cross-channel adversarial discriminators As in Ref. [20], the authors have been continuously proposed another GAN-based approach [22] for abnormality detection in crowd behavior. The training procedure is the same as Ref. [20] that only the frames of normal events are trained with the crosschannel networks based on conditional GANs by engaging G to translate the raw pixel image to the optical flow, inspired by Isola et al. [35]. This paper takes advantage of the U-Net framework [36] for translating from one to another image and representing a multichannel data, i.e., spatial and temporal representations, into an account, similarly to Refs. [20] and [21]. The novel part is in the testing where the authors proposed the end-to-end framework without additional classifiers by using the learned discriminator as the classifier of abnormalities. The framework of the cross-channel adversarial discriminators is shown in Fig. 16.5. For a brief explanation, G and D are simultaneously trained only on the frames of normalities. G generates the synthetic image from the learned normal events, while D learns how to differentiate whether its input is normal events or not due to the data distribution, defining the abnormal events as outliers in this sense. During testing, only D is being used directly to classify anomalies in the scene. In such a way, there is no need for reconstructing images at the testing time unlike the common GAN-based models [20, 21] that use G in testing.
387
388
Generative adversarial networks for image-to-Image translation
Fig. 16.5 Cross-channel adversarial discriminators flow diagram with additional detail on parameters following [22]. Two generator networks are used during training: (i) generating a corresponding optical flow image and (ii) reconstructing an appearance. At testing time, only discriminative networks are used and represented as a learned decision boundary to detect anomalies. F!O
O!F
Specifically, there are two networks used for training: N and N . Suppose that Ft is a frame (at time t) of a video sequence and Ot is an optical flow acquired from two consecutive frames, Ft and Ft+1, following the computation of optical flow-based theory for warping [38]. In this work, G and D are conditioned to each other. G takes F!O an image x and noise vector z to output a synthetic optical flow r ¼ G(x,z). For N , let x be a current frame x ¼ Ft. Then, the generation of its corresponding optical flow r can be represented as y ¼ Ot. Conversely, D takes two inputs, whether it is (x,y) or (x,r) to obtain a probability of classes belonging to its pair. The reconstruction loss LL1 and a conditional GAN loss LcGAN can be obtained as follows. F!O In the case of N ,LL1 is determined with the training set X ¼ fðFt , Ot Þg as shown in the following equation: LL1 ðx, yÞ ¼ ky Gðx, yÞk1
(16.15)
whereas LcGAN is represented in the following equation: LcGAN ðD, GÞ ¼ ðx, yÞX ½ log Dðx, yÞ + XfFt g, zZ ½ log ð1 Dðx, Gðx, zÞÞÞ: (16.16) O!F
is X ¼ fðOt , Ft ÞgN In contrast, the training set of N t¼1 . Note that all training procedures are the same as Ref. [20]. G performs as implicit supervision for D. Both GF!O and GO!F networks lack the ability to reconstruct the abnormal events because they observe only the normal events during training, while DF!O and DO!F have learned the patterns to distinguish real data from artifacts.
Generative adversarial network for video anomaly detection
The discriminator is considered as the learned decision boundary that splits the densest area (i.e., the normal events x3) from the rest (i.e., abnormal events x1 and generated images x2). Since the goal is to detect abnormal events x1, the latter outside the decision boundaries are judged as outliers by D. During testing, the authors focus on only the discriminative networks for two-channel transformation tasks. The patch-based discrimina^ F!O and D ^ O!F are applied to the test frame F and its corresponding optical flow O tors D with the same 30 30 grid, resulting in two 30 30 score maps represented as SO for ^ F!O and SF for D ^ O!F . In detail, a patch pF on F and a patch pO on O are input to D F!O ^ D . Any abnormal events occur in these patches (pF and/or pO) are considered outliers ^ F!O , resulting in a low probability score of according to the distribution learned by D F!O ^ D ðpF, pOÞ. To finalize the anomaly maps, the normalized channel score maps with equal weights are fused S ¼ SO + SF in the range [0,1] and then applied with a range of thresholds to compute the ROC curves. We notice that there are some interesting points in this work: (i) the authors state that DF!Oprovides higher performance than DO!F since the input of DF!O is the real frame which contains more information than the optical flow frame, (ii) their proposed end-toend framework is simpler and also faster in testing than Ref. [20] since they do not require the generative models during testing and any additional classifiers to add on top of the model such as a pretrained AlexNet [39]. The observations from the early works inspire us to propose our method [26], which we discuss in Section 16.3.2.
16.3 Training a generative adversarial network 16.3.1 Using generative adversarial network based on the image-to-image translation The anomalous object observation using an unsupervised learning approach is considered as the structural problem of the reconstruction model known as the per-pixel classification or regression problem. The common framework used to solve this type of problem and explored by the state-of-the-art works in anomaly detection task [20–22] is the generative image-to-image translation network constructed by Isola et al. [35]. In general, they use this network to learn an optimal mapping from input to output image based on the objective function of GAN. Their experimental results prove that the network is good at generating synthesized images such as color, object reconstruction from edges, and label maps. From an overall perspective, based on the original GANs [27], the input of G consists of the image x and noise vector z in which the mapping of the original GANs is represented as G: z ! y (learning from z to output image y). Differently, conditional GANs learn from two inputs, x and z, to y, represented as G: {x,z} ! y. However, z is not necessary for the network as G still can learn the mapping without z, especially in the early training where G is learned to ignore z. Thus, the authors decided to use z in
389
390
Generative adversarial networks for image-to-Image translation
the form of dropout in both the training and testing process. Considering the objective function, since the architecture of image-to-image translation network is based on the conditional GANs, its objective functions are indicated the same as we explained in Refs. [20] (see Eqs. 16.5 and 16.6) and [22] (see Eqs. 16.15 and 16.16) where two inputs are required for the discriminator D(x,y) and L1 loss is used to help output sharper image. On the other hand, the future frame prediction [21] applied unconditional GANs that uses only one input for the discriminator D(y) (see Eq. 16.12). It relied on the traditional L2 regression (MSE loss function) to condition the output and the input. This forcing condition results in lower performance (i.e., blurry images) on the frame-level anomaly detection compared to Refs. [20] and [22].
16.3.2 Unsupervised learning of generative adversarial network for video anomaly detection In this section, we introduce our proposed method, named DSTN [26]. We take advantage of image-to-image transformation architecture with the U-Net network [36] to translate a spatial domain to a temporal domain. In this way, we can obtain comprehensive information on the objects from both appearance and motion information (optical flow). The proposed DSTN differs from the previous works [20–22] since we focus on only one deep spatiotemporal translation network to enhance the anomaly detection performance at a frame level and the challenging anomaly localization at a pixel level with regard to accuracy and computational time. Specifically, we include preprocessing and postprocessing stages to assist the learning of GAN without using any pretrained network to help in the classification, making the DSTN faster and more flexible. Besides, we differ from Ref. [35] as our target output is the motion information of the object corresponding to its appearance, not the realistic images. There are two main procedures for each training and testing time. For training, a feature collection and a spatiotemporal translation play essential roles to sufficiently collect information and effectively learn to model, respectively. Then a differentiation and an edge wrapping are utilized at the testing time. We shall explain the main components of our proposed method in detail including the system overview for both training and testing as follows. 16.3.2.1 System overview We first start with the system overview of ourDSTN. The DSTN is based on GAN and plugged with preprocessing and postprocessing procedures to improve the performance in learning normalities and localizing anomalies. Overall, the main components of the DSTN are fourfold: a feature collection, a spatiotemporal translation, a differentiation, and an edge wrapping. The feature collection is a key initial process for extracting the appearances of objects. These features are fed into the model to learn the normal patterns. In our case, the generator G is used in both training and testing time, while the discriminator D only at the training time. During training, G learns the normal patterns from the
Generative adversarial network for video anomaly detection
training videos. Hence it understands and has only the knowledge of what normal patterns look like. The reason why we feed only the frames of normal patterns is that we need the model to be flexible and able to handle all possible anomalous events in realworld environments without labels of anomalies. In testing, all videos, including normal and abnormal events, are input into the model where G tries to reconstruct the appearance and the motion representation from the learned normal events. Since G has not learned any abnormal samples, it is unable to reconstruct the abnormal area properly. We then take this inability to correctly reconstruct anomalous events to detect the anomalies in the scene. The anomalies can be exposed by subtracting pixels in the local area between the synthesized image and the real image and then applying the edge wrapping at the final stage to achieve precise edges of the abnormal objects. Specifically, during training, only normal events of original frames f are input with background removal frames fBR into the generative network G, which contains encoder En and decoder De, to generate dense optical flow (DIS) frames OFgen representing the motion of the normal objects. To attain good optical flow, the real DIS optical flow frame OFdis and fBR are fused to eliminate noise that frequently occurs in OFdis, giving Fused Dense Optical Flow frames OFfus. The patches of f and fBR are concatenated and fed into G to produce the patches of OFgen, while D has two alternate inputs, the patches of OFfus (real optical flow image) and OFgen (synthetic optical flow image), and tries to discriminate whether OFgen is fake or real. The training framework of DSTN is shown in Fig. 16.6. After training, the DSTN model understands a mapping of the appearance representation of normal events to its corresponding dense optical flow (motion representation). All parameters used in training are also used in testing. During testing, the unknown events from the testing videos will be reconstructed by G. However, the reconstruction of G provides results of unstructured blobs based on its knowledge of the learned normal patterns. Thus, these unstructured blobs are considered as anomalies. To capture the anomalies, the differentiation is computed by subtracting the patches of OFgen with the patches of OFfus. Note that not only the anomaly detection is significantly essential for real-world use, but also the anomaly localization. Therefore, edge wrapping (EW) is proposed to obtain the final output by retaining only the actual edges corresponding to the real abnormal objects and suppressing the rest. The DSTN framework at the testing time is shown in Fig. 16.7. 16.3.2.2 Feature collection We explain our proposed DSTN based on its training and testing time. During training, even GAN is good at data augmentation and image generation on small datasets, it still desires for sufficient features (e.g., appearance and motion features of objects) from data examples to feed its data-hungry characteristics of the deep learning-based model. The importance of feature extraction is recognized and represented as the preprocessing
391
392
Generative adversarial networks for image-to-Image translation
Fig. 16.6 A training framework of proposed DSTN.
Fig. 16.7 A testing framework of proposed DSTN [26].
Generative adversarial network for video anomaly detection
procedure before learning the model. There are several procedures in the feature collection, including (i) background removal, (ii) fusion, (iii) patch extraction, and (iv) concatenated spatiotemporal features, as described below. In (i) background removal, we only take the moving foreground objects into account because we focus on the real situation from the CCTV cameras. Thus the static background is ignored in this sense. This method benefits in extracting the object features and removing irrelevant pixels in the background for obtaining only the important appearance information. Let ft be the current frame at time t of the video and ft 1 be the previous frame. The background removal fBR is computed by using the frame absolute difference as shown in the following equation: fBR ¼ jft ft1 j
(16.17)
After computing the frame absolute difference, the background removal output is binarized and then concatenated with the original frame f to acquire more information on the appearance, assisting the learning of the generator. It can simply conclude that more input features mean the better performance of the generator. The significance of the concatenated fBR and f frames is that it delivers extra features on the appearance of foreground objects in fBR as f contains all information where fBR may lose some of the information during the subtraction process. For (ii) fusion, according to the literature on GAN for video anomaly detection [20–22], they all apply theory for warping [38] for representing the temporal features. However, it has problems in obtaining all information on objects and also with high time complexity. Since the development of video anomaly detection requires reliable performance in terms of accuracy and running time, thus the theory for warping is not suitable for this task. To achieve the best performance on motion representation, we use dense inverse search (DIS) [42] to represent the motion features of foreground objects in surveillance videos due to its high accuracy and low time complexity performance on detecting and tracking objects. The DIS optical flow OFdis can be obtained from two consecutive frames ft and ft 1 as shown in Fig. 16.8 where the resolution of ft and ft 1 frames is 238 158 channels following the UCSD Ped1 dataset [4]. The number of the channels cp for ft and ft 1 frames is 1 (cp ¼ 1), while cp of OFdis is 3 (cp ¼ 3). However, OFdis contains noise dispersed in the background same as the objects as shown in Fig. 16.9. Thus, we propose a novel fusion between OFdis and fBR to use the good foreground object from fBR with OFdis for acquiring both appearance and motion information and assisting in noise reduction in OFdis. The fusion provides a clean background and explicit foreground objects. From Fig. 16.9, it is clearly shown that fusion effectively helps to remove noise from OFdis. Specifically, the noise reduction is implemented efficiently by observing fBR values equal to 0 or 255 and then using
393
394
Generative adversarial networks for image-to-Image translation
Fig. 16.8 Dense inverse search optical flow framework.
Fig. 16.9 A fusion between background subtraction and real optical flow.
masking of fBR on OFdis to change its values. Let ζ be a constant value. The new output represented OFdis is a fusion OFfus as defined in Eq. (16.18): OF fus ¼ OF dis bfBR =ðfBR + ζÞc:
(16.18)
Apart from (i) background removal and (ii) fusion, (iii) patch extraction acts a part in the feature collection process and supports in acquiring more spatial and temporal features of the moving foreground object in local pixels. By doing that, it can achieve better information than directly extracting features from the full image. Patch extraction is implemented using the full-size of the moving foreground object appearance at the current frame f along with its direction, motion, and magnitude from the frame-by-frame dense optical flow image. We normalize all patch elements in the range [1, 1]. The patch size is defined as (w/a) h cp, where w is the frame width, h is the frame height, a is a scale value, and cp is the number of channels. A sliding-window method with a stride d is applied on the input frames of the generator G (i.e., f and fBR) and the discriminator D (i.e., OFfus). Fig. 16.10 shows examples of patch extraction. We extract the patches
Generative adversarial network for video anomaly detection
Fig. 16.10 Examples of patch extraction on a spatial frame.
with the scale value a ¼ 4 and d ¼ w/a to obtain more local information from spatial and temporal representations. Then, the extracted patch image is scaled up to 256 256 full image to gain more information on the appearance from the semantic information and input into the model for the further process. The final process of the feature collection is (iv) the concatenation of spatiotemporal features for data preparation. We input the appearance information to the generative model to output the motion information in both training and testing time, as shown in Fig. 16.11. Since providing sufficient feature inputs to G is significantly important to produce good corresponding optical flow images; thus, the patches of f and fBR are concatenated to cover all possible low-level appearance information of the normal patterns for G to understand and learn them extensively. More specifically, fBR provides precise foreground object contours, whereas f provides inclusive knowledge for the whole scene. The sizes of the input and target images are fixed to the 256 256 full image as the default value in our proposed framework. The concatenated frames have two channels
Fig. 16.11 Overview of our data preparation showing spatiotemporal input features (concatenated patches) and output feature (generated dense optical flow patch).
395
396
Generative adversarial networks for image-to-Image translation
(cp ¼ 2) and the temporal target output has three channels (cp ¼ 3). As a final point, this concatenation process implies the potential of the learning of a spatiotemporal translation model to reach its desired temporal target output. 16.3.2.3 Spatiotemporal translation model In this section, we declare our training structure in the manner of GAN-based U-Net architecture [36] for translating the spatial inputs (f and fBR) to the temporal output (OFgen) and present on how the interplay between the generative and discriminative networks works during training. The details of the proposed spatiotemporal translation model are clearly explained as the following. The generative network G performs an image transformation from the concatenated f and fBR appearances to the OFgen motion representations. Generally, there are two inputs for G, an image x and noise z, to generate image e as the output with the same size of x but different channels, e ¼ G(x,z) [27, 43, 44]. On the other hand, the additional Gaussian noise z is not prominent to G in our case since G can learn to ignore z in the early stage of training. In addition, z is not that effective for transforming the spatial representation data from its input to the temporal representation data. Therefore, dropout [35] is applied in the decoder within batch normalization [45] instead of z, resulting in e ¼ G(x). On closer inspection, the full network of generator consisting of the encoder En and decoder De architectures is constructed by skip connections or residual connections [35], as shown in Fig. 16.12. The idea of skip connections is that it links the layers from the encoder straight to the decoder, allowing the networks to be easier for the optimization
Fig. 16.12 Generator architecture consisting of an encoder and a decoder with skip connections [26].
Generative adversarial network for video anomaly detection
and providing greater quality and less complexity for image translation than the traditional CNN architectures, e.g., AlexNet [39] and VGG nets [46, 47]. More specifically, let t be the total number of layers in the generative network. The skip connections are introduced at each layer i of En and layer t i of De. The data can be transferred from the first to the final layer by integrating channels of i with t i. The architectures of En and De are illustrated in Fig. 16.13. En compresses the spatial representation of data to higherlevel representation, while De performs the reverse process to generate OFgen. En uses Leaky-ReLU (L-ReLU) activation function. Conversely, De uses the ReLU activation function that benefits in accelerating the learning rate of the model to saturate color distributions [36]. To achieve accurate OFgen, the objective function is assigned and optimized by the Adam optimization algorithm [48] during training. For the encoder module, it acts as a data compression from a high-dimensional space into a low-dimensional latent space representation to pass to the decoder module. The first layer in the encoder is convolution using CNNs as the learnable feature extraction instead of the handcrafted features approach that is more delicate to derive the obscure data structures than the deep learning approach. The convolution is a linear operation implemented by covering an n ninput image I with a k ksliding window w. The output of convolution function on cell c of the image I is defined as shown in the following equation: yc ¼
kk X
ðwi Ii, c Þ + bc
(16.19)
i¼1
where yc is the output after the convolution and bc is the bias. Let p be the padding and s be the stride. The output size of convolution O is calculated as shown in the following equation: O ¼ ðn k + 2pÞ=s + 1
Fig. 16.13 Encoder and decoder architectures [26].
(16.20)
397
398
Generative adversarial networks for image-to-Image translation
Fig. 16.14 An example of a convolution operation on an image cell.
The convolution operation on an image cell with b ¼ 0, n ¼ 8, k ¼ 3, p ¼ 0, and s ¼ 1 is delivered in Fig. 16.14 for more understanding. Once the convolution operation is completed, batch normalization is applied by normalizing the convolution output following the normal distribution in Ref. [45] to reduce the training time and avoid the vanishing gradient problem. Suppose y is convolution output values over a mini-batch: B ¼ {y1, y2, …, ym}, γ and β are learnable parameters andεis a constant to avoid zero variance. The normalized output S can be obtained by scaling and shifting as defined in the following equation: Si ¼ γ y^i + β ≡ BN γ, β ðyi Þ pffiffiffiffiffiffiffiffiffiffiffiffi • Normalize: y^i ¼ ðyi μB Þ= σ 2B + ε, m P • Mini-batch mean: μB ¼ 1=m yi , m i¼1 P • Mini-batch variance: σ 2B ¼ 1=m ðyi μB Þ2 .
(16.21)
i¼1
The final layer of En is adopted by the activation function to introduce nonlinear mapping function from the input to the output (response variable). The nonlinear mapping performs the transformation from one to another scale and decides whether the neurons can pass through it. The nonlinearity property makes the network more complex, resulting in a stronger model for learning the complex input data. The L-ReLu is assigned in the proposed DSTN to avoid the vanishing gradient problem. The property of L-ReLu is to allow the negative value to pass through the neuron by mapping it to the small negative value of the response variable, leading to the improvement of the flow of gradients through the model. The L-ReLu function can be determined, as shown in Eq. (16.22): s, if s 0 f ðsÞ ¼ (16.22) as, otherwise where s is the input, a is a coefficient (a ¼ 0.2) allowing the negative value to pass through the neuron, and f(s) is the response variable. Fig. 16.15 shows the graph representing the input data s mapping to the response variable.
Generative adversarial network for video anomaly detection
Fig. 16.15 Leaky-ReLu activation function.
Regarding the decoder module De, it is an inverse process of encoding part in which the residual connections are passed through the encoder to the corresponding layers at the decoder. The dropout is used in De to represent noise vector z as it eliminates the neuron connections using a default probability, helping to prevent the overfitting during training and improve the performance of GAN. Let h {1, 2, …, H} be the hidden layers of the network, z(h+1)be the output layers h + 1, and r(h) be the random variable following j Bernoulli distribution with probability p [49]. The feed-forward operation can be described in the following equation: r ðhÞ : BernoulliðpÞ, feðsÞðhÞ ¼ r ðhÞ ∗f ðsÞðhÞ , ðh + 1Þ ðh + 1Þ e ðhÞ ðh + 1Þ ¼ wi : f ðsÞ + bi zi
(16.23)
Apart from the generator, we shall discuss the discriminative network D for the testing procedure. D distinguishes the real patch OFfus (y ¼ OFfus) and the synthetic patch OFgen (OFgen ¼ e). As a result, D delivers a scalar output determining the possibility that its inputs are from the real data. In the discriminative architecture, PatchGAN is constructed and applied to each partial image to help accelerate the training time for GAN, resulting in a better performance than using the full image discriminator net with the resolution of 256 256 pixels. The discriminator D is implemented by subsampling from 256 256 OFfus image to 64 64 pixels, providing 16 patches of OFfus passing through the PatchGAN model to classify whether OFgen is real or fake as shown in Fig. 16.16. The reason why we use the 64 64 PatchGAN is that it provides good pixel accuracy and good intensity on the appearance, making the synthetic image to be more recognizable. The experimental results on the impact of using the 64 64 PatchGAN can be found in Ref. [26].
399
400
Generative adversarial networks for image-to-Image translation
Fig. 16.16 Discriminator architecture with PatchGAN model [26].
To define objective function and optimization, we first discuss two objective functions determined during training; a GAN Loss, LGAN, and a L1 Loss or a generator loss, LL1. Note that our proposed DSTN method comprises only one translational network from the spatial (appearance) to the temporal (motion) image representations. The motion representation is computed based on the dense optical flow using arrays of horizontal and vertical components with the magnitude. Let y be the output image OFfus, x be the input image for G (the concatenated f and fBR image), and z is the additional Gaussian noise vector. Specifically, the dropout is adopted as z, then G can be represented as G(x). The objective functions can be denoted in Eqs. (16.24) and (16.25): • GAN Loss:
• L1 Loss:
LGAN ðG, DÞ ¼ Ey ½ log DðyÞ + Ex ½ log ð1 DðGðxÞÞÞ
(16.24)
LL1 ðGÞ ¼ Ex, y ky GðxÞk1
(16.25)
Then, the optimization of G can be defined as in the following equation: G∗ ¼ arg min max LGAN ðG, DÞ + λLL1 ðGÞ G
D
(16.26)
The advantage of using one spatiotemporal translation network is that it has less complexity while providing sufficient important features of objects for the learning of GAN.
Generative adversarial network for video anomaly detection
16.3.3 Anomaly detection After the training, the spatiotemporal translation network perceives the transformation from the concatenated f and fBR appearance to the OFfus motion representations. All parameters in training are applied in the testing. To detect anomalies, we input two consecutive frames (f and ft1) from the test videos to the model. During testing, G is used to reconstruct OFgen following its trained knowledge. However, since G has trained with only the normal patterns, G is unable to regenerate the unknown events the same as the normal ones. We take this generator’s inability to reconstruct correct abnormal events in order to detect all possible anomalies that occur in the scene. The anomalies can be exposed by subtracting the patches of OFfus and OFgen to locate the difference in local pixels. To be more accurate on the object localization, edge wrapping is proposed to highlight the actual local pixels of anomalies. For anomaly detection, differentiation is a simple and effective method to obtain abnormalities. The pixels between a patch of OFfus (real image) and a patch of OFgen (fake image) are subtracted to determine even if there are anomalous events in the scene. This differentiation is directly defined in in the following equation: ΔOF ¼ OF fus OF gen > 0
(16.27)
where ΔOF is the differentiation output in which its value is greater than 0 (ΔOF > 0). The reason why ΔOF can successfully indicate the abnormal events in the scene is that the differentiation between OFfus and OFgen provides a large difference in the anomalous areas where G is unable to reconstruct the abnormal events in OFgen the same as the abnormal events in OFfus (the real abnormal object from testing video sequence). In other words, G tries to reconstruct OFgen the same as OFfus, but it can only reconstruct unstructured blobs based on its knowledge of the learned normal events, making the abnormal events of OFgen different from OFfus. ΔOF provides the score indicating the probability of pixel, whether it belongs to normalities or abnormalities. The range of pixel values for each ΔOF from the test videos is between 0 and 1 where the highest pixel value is considered as an anomaly. To normalize the probability score from ΔOF, the maximum value MOF of all components is computed following the range of pixel values for each test video. From this process, we can gradually alter the threshold of the probability scores of anomalies to define the best decision boundary for obtaining ROC curves. Suppose the position of the pixel in the image is (i, j). The normalization of ΔOF denoted as NOF is shown in the following equation: NOF ði, jÞ ¼ 1=MOF ΔOF ði, jÞ
(16.28)
However, even we obtain good normalized differentiation NOF showing anomalies in the scene, there are concerning problems occurred in the experimental results, such as a false positive detection on the normal events (i.e., normal event is detected as abnormal
401
402
Generative adversarial networks for image-to-Image translation
events) and overdetection on the pixels around the actual abnormal object (i.e., the area of the detected abnormal object is too large). This is because the performance of object localization is not effective enough. Therefore, we propose the edge wrapping (EW) method to overcome these concerning problems and specifically enhance the pixel-level anomaly localization performance. Our EW is performed by using [50] to preserve only the edges of the actual abnormal object and suppressing the rest (e.g., noise and insignificant edges that do not belong to the abnormal object), providing precise abnormal event detection and localization. EW performs as a multistage process categorized into three phases: a noise reduction, an intensity gradient, and a nonmaximum suppression. To eliminating background noise and irrelevant pixels of abnormal objects, a Gaussian filter is applied to blur the normalization of differentiation NOF with the size of we he ce where we and he are the width and the height of the filter and ce is the number of channels (e.g., the gray scale image has ce ¼ 1 and the color image has ce ¼ 3). Our differentiation output is the gray scale image ce ¼ 1. Considering the intensity gradient, an edge gradient Ge is achieved using a gradient operator to filter the image in a horizontal direction Gx and a vertical direction (Gy) for obtaining gradient magnitude perpendicular to edge direction at each pixel. The derivative filter has the same size as the Gaussian filter. The first derivative is computed, as shown in Eqs. (16.29) and (16.30): qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Ge ¼ Gx2 + Gy2 (16.29) (16.30) θ ¼ tan 1 Gy =Gx Then, a threshold is defined to conserve only the significant edges. This process is known as nonmaximum suppression. In this phase, the gradient magnitude at each pixel is investigated whether it is greater than a threshold T where we use a value of 50 as it performs the best results, as discussed in Ref. [26]. If it is greater than T, it shows an edge point corresponding to a local maxima to all possible neighborhoods. Hence, we preserve the local maxima and suppress the rest to 0 to acquire the edges corresponding to the certain anomalies. In addition, the Gaussian filter is then again applied with a kernel size of we he ce to prevent noise in the image, representing an output EW for the final output of the anomaly localization OL where ζ is a constant value. The anomaly localization OL can be computed, as shown in the following equation: OL ¼ ΔOF bEW =EW + ζc
(16.31)
16.4 Experimental results The performance of the DSTN is evaluated on the publicly standard benchmarks used in the video anomaly detection task, UCSD pedestrian [4], UMN [1], and CUHK Avenue [6]. These datasets are recorded in crowds containing indoor and outdoor scenes. Our
Generative adversarial network for video anomaly detection
experiment results are compared with various competing methods respecting the accuracy on both frame and pixel level and the computational time. Additionally, we indicate the impact of GAN based U-Net network with residual connections compared with another popular architecture, i.e., autoencoder, and address the advantages and the disadvantages of GAN for anomaly detection. The detail of each subtopic is explained as follows.
16.4.1 Dataset 16.4.1.1 UCSD dataset The UCSD pedestrian dataset [4] consists of walking crowded pedestrian in two outdoor scenes with various anomalies, e.g., cycling, skateboarding, driving vehicles, and rolling wheelchairs. It is a well-known video benchmark dataset for the anomaly detection task due to its complex scene in the real environments with the low-resolution images. There are two subfolders: Ped1 and Ped2, where Ped stands for the pedestrian. The UCSD Ped1 contains 5500 normal frames with the 34 training video sequences and 3400 anomalous frames with 16 testing video sequences. The image resolution of the UCSD Ped1 is 238 158 pixels, and the UCSD Ped2 is 360 240 pixels for all frames. In the UCSD Ped2, there are 346 frames for normal events with 16 training video sequences and 1652 frames for anomalous events with 12 testing video sequences. The characteristics of Ped2 consists of the crowded pedestrian walking horizontally to the camera plane. The examples of the UCSD are shown in Fig. 16.17, where (a) is Ped1 and (b) is Ped2. 16.4.1.2 UMN dataset The UMN dataset [1] is one of the publicly available benchmarks in the video anomaly detection task designed for identifying the anomalies in crowds. It contains 11 videos with 7700 frames recorded in various indoor and outdoor scenarios. All frames have a resolution of 320 240 pixels. Both indoor and outdoor scenes have the characteristics
Fig. 16.17 UCSD pedestrian dataset: (A) Ped1 and (B) Ped2.
403
404
Generative adversarial networks for image-to-Image translation
Fig. 16.18 UMN dataset.
of walking pedestrian as the normal event and the running pedestrian as the abnormal event as shown in Fig. 16.18. All video sequences start with the walking patterns then end with the running patterns. 16.4.1.3 CHUK Avenue dataset The CUHK Avenue dataset [6] consists of the crowded scenes at the campus. The total number of frames is 30,652 frames consisting of 15,328 frames for 16 training videos and 15,324 frames for 21 test videos. Each video sequence has a length of 1–2 min, with 25 frames per second (fps). This dataset is challenging due to its various moving objects in crowds and types of anomaly patterns related to human actions, including objectrelated human actions (person throwing, grabbing, and leaving objects), running, jumping, and loitering. In contrast, the normal pattern is the crowds who walk parallel to the image plane. The examples of the CHUK Avenue dataset are shown in Fig. 16.19.
16.4.2 Implementation details We implement the proposed DSTN framework using Keras [51] machine learning platform with TensorFlow [52] backend, and Matlab. A GPU NVIDIA GeForce GTX, 1080 Ti with 3584 CUDA Cores and 484 GB/sec memory bandwidth, is used during the training procedure. Additionally, the testing measurement is employed on a CPU Intel Core i9-7960 with a 2.80 GHz processor base frequency. The model performs transformation learning from spatial to temporal representations with the help of Adam optimizer. A learning rate is set to 0.0002, while the exponential decay rate β1 and β2 are set to 0.9 and 0.999, respectively, with epsilon 108.
Fig. 16.19 CHUK Avenue dataset.
Generative adversarial network for video anomaly detection
16.4.3 Evaluation criteria 16.4.3.1 Receiver operating characteristic (ROC) The Receiver operation characteristic curve (ROC) is a standard method used for evaluating the performance of an anomaly detection system. It is a plot that indicates a comparison between true positive rate (TPR) and false positive rate (FPR) at various threshold criteria [53] and benefits the analysis of the decision-making process. In the anomaly detection observation, the abnormal events that are correctly determined as the positive detections (abnormal event) from the entire positive ground truth data are represented as TPR known as the probability of detection. The more the curve of TPR goes up, the better the detection accuracy of abnormal events is. The normal event (negative data) that are incorrectly determined as the positive detections from the entire negative ground truth data are represented as FPR. The higher FPR means the higher rate of the misclassification of normal events. There are four types of binary predictions for TPR and FPR computation, as described below. True positive (TP) is the correct positive detection of an abnormal event when the prediction outcome and the ground truth data are positive (abnormal event). False positive (FP) is the false positive detection when the outcome is predicted as positive (abnormal event), but the ground truth data is negative (normal event), meaning that the normal event is incorrectly detected as an abnormal event. This problem often occurs in the video anomaly detection task (e.g., a walking person is detected as an anomaly). True negative (TN) is the correct detection of a normal event when the outcome is predicted as negative (normal event) and the ground truth data is also negative. False negative (FN) is the incorrect detection when the outcome is predicted as negative (normal event) and the ground truth data is positive (abnormal event). Hence, TPR and FPR can be computed, as shown in Eqs. (16.32) and (16.33), respectively: TPR ¼ TP=ðTP + FN Þ
(16.32)
FPR ¼ FP=ðFP + TN Þ
(16.33)
16.4.3.2 Area under curve (AUC) Area Under Curve, also known as AUC, is used in classification analysis problems to define the best prediction model. It is computed from all the areas under the ROC curve, where TPR is plotted against FPR. The higher value of AUC indicates the superior performance of the model. Ideally, the model is a perfect classifier when all positive data are ranked above all negative data (AUC ¼ 1). In practice, most of the AUC results are required in the range between 0.5 and 1.0 (AUC ¼ [0.5,1]), meaning that the random positive data are ranked higher than the random negative data (greater than 50%). Besides, the worst case is when all negative data are ranked above all positive data, leading
405
406
Generative adversarial networks for image-to-Image translation
the AUC to 0 (AUC ¼ 0). Hence, AUC classifiers can be defined as AUC [0,1] where AUC values for real-world use are greater than 0.5. The AUC values that are less than 0.5 are not acceptable for the model [53]. To conclude, we prefer higher AUC values than the lower ones. 16.4.3.3 Equal error rate (EER) Apart from the AUC, the performance of the model can be quantified by observing a receiving operating characteristic equal error rate (ROC-EER). The EER is a fixpoint that specifies equality of the misclassification of positive and negative data. Specifically, EER can be obtained from the intersection of the ROC curve on the diagonal EER line by varying a threshold until FPR equals to the miss rate 1-TPR. The lower EER values indicate that the model has better performance. 16.4.3.4 Frame-level and pixel-level evaluations for anomaly detection In general, the quantitative performance evaluation of the anomaly detection consists of two criteria, including frame-level and pixel-level evaluations. The frame-level evaluation focuses on the detection rate of the anomalous event in the scene. If one or more anomalous pixels are detected, the frame will be labeled as the abnormal frame no matter what size and location of the abnormal objects are. In this case, the detected frame is defined as TP if the actual frame is also abnormal. Contrarily, if the actual frame is normal, then the detected frame is classified as FP. The evaluation of the pixel level determines the correct location of anomalous object detection in the scene. This evaluation is a challenging criterion in anomaly detection and localization research since it focuses on the local pixel. It is remarkably more demanding and stricter than the frame-level evaluation due to its complexity of localizing anomalies, which improves the accuracy of the frame-level anomaly detection. To indicate whether the frame is the true positive (TP), the detected abnormal area is needed to be overlapped more than 40% with the ground truth [3]. In addition, the frame will be distinguished as the false positive (FP) if one pixel is detected as abnormal events. 16.4.3.5 Pixel accuracy In a standard semantic segmentation evaluation, the pixel accuracy metric [54] is computed to define the correctness of the pixel belonging to each semantic class. In the proposed DSTN, two semantic classes are defined, including a foreground P P semantic class and a background semantic class. The pixel accuracy is defined as inii/ inti, where nij is the number of misclassified pixels of class i, and nti is the total of consisting pixels of class i. 16.4.3.6 Structural similarity index (SSIM) SSIM index is a perceptual metric to measure the image quality of the predicted image to its original image [55]. Using the SSIM index, the model is more effective when the
Generative adversarial network for video anomaly detection
predicted image is more similar to the target image. In our case, we use SSIM to analyze the similarity of the dense optical flow generated from the generator to its real dense optical flow obtained from two consecutive video frames.
16.4.4 Performance of DSTN We evaluate the proposed DSTN regarding accuracy and time complexity aspects. The ROC curve is used to illustrate the performance of anomaly detection at the frame level and the pixel level and analyze the experimental results with other state-of-the-art works. Additionally, the AUC and the EER are evaluated as the criteria for determining the results. The performance of DSTN is first evaluated on the UCSD dataset, consisting of 10 and 12 videos for the UCSD Ped1 and the UCSD Ped2 with the pixel-level ground truth, by using both frame-level and pixel-level protocols. In the first stage of DSTN, patch extraction is implemented to provide the appearance features of the foreground object and its motion regarding the vector changes in each patch. The patches are extracted independently from each original image with a size of 238 158 pixels (UCSD Ped1) and 360 240 pixels (UCSD Ped2) to apply it with (w/4) h cp. As a result, we obtain 22 k patches from the UCSD Ped1 and 13.6 k patches from the UCSD Ped2. Then, to feed into the spatiotemporal translation model, we resize all patches to the 256 256 default size in both training and testing time. During training, the input of G (the concatenation of f and fBR patches) and target data (the generated dense optical flow OFgen) are set to the same size as the default resolution of 256 256 pixels. The encoding and decoding modules in G are implemented differently. As in the encoder network, the image resolution is encoded from 256 ! 128 ! 64 ! 32 ! 16 ! 8 ! 4 ! 2 ! 1 to obtain the latent space representing the spatial image in one-dimensional data space. CNN effectively employs this downscale with kernels of 3 3 and stride s ¼ 2. Additionally, the number of neurons corresponding to the image resolution in En is introduced in each layer from 6 ! 64 ! 128 ! 256 ! 512 ! 512 ! 512 ! 512 ! 512. In contrast, De decodes the latent space to reach the target data (the temporal representation of OFgen) with a size of 256 256 pixels using the same structure as En. The dropout is employed in De as noise z to remove the neuron connections using probability p ¼ 0.5, resulting in the prevention of overfitting on the training samples. Since D needs to fluctuate G to correct the classification between real and fake images at training time, PatchGAN is then applied by inputting a patch size of 64 64 pixels to output the probability of class label for the object. The PatchGAN architecture is constructed from 64 ! 32 ! 16 ! 8 ! 4 ! 2 ! 1, which is then flattened to 512 neurons and plugged in with Fully Connection (FC) and Softmax layers. The use of PatchGAN benefits the model in terms of time complexity. This is probably because there are fewer parameters to learn on the partial image, making
407
408
Generative adversarial networks for image-to-Image translation
the model less complex and can achieve good running time for the training process. In the aspect of testing, G is specifically employed to reconstruct OFgen in order to analyze the real motion information OFfus. The image resolution for testing and training are set to the same resolution for all datasets. The quantitative performance of DSTN is presented in Table 16.1 where we consider the DSTN with various state-of-the-art works, e.g., AMDN [15], GMM-FCN [12], Convolutional AE [14], and future frame prediction [21]. From Table 16.1 it can be observed that the DSTN overcomes most of the methods in both frame-level and pixel-level criteria since we achieve higher AUC and lower EER on the UCSD Dataset. Moreover, we show the qualitative performance of DSTN using the standard evaluation for anomaly detection research known as the ROC curves, where we vary a threshold from 0 to 1 to plot the curve of TPR against FPR. The qualitative performance of DSTN is compared with other approaches in both frame-level evaluation (see Fig. 16.20A) and pixel-level evaluation (see Fig. 16.20B) on the UCSD Ped1 and at the frame-level evaluation on the UCSD Ped2 as presented in Figs. 16.20 and 16.21, respectively. Following Figs. 16.20 and 16.21, the DSTN (circle) shows the strongest growth curve on TPR and overcomes all the competing methods in the frame and pixel level. This means that the DSTN is a reliable and effective method to be able to detect and localize the anomalies with high precision. Examples of the experimental results of DSTN on the UCSD Ped 1 and Ped 2 dataset are illustrated in Fig. 16.22 to extensively present its performance in detecting and localizing anomalies in the scene. According to Fig. 16.22, the proposed DSTN is able to detect and locate various types of abnormalities effectively with each object, e.g., (a) a wheelchair, (b) a vehicle, (c) a skateboard, and (d) a bicycle, or even more than one anomaly in the same scene, e.g., (e) bicycles, (f ) a vehicle and a bicycle, and (g) a bicycle and a skateboard. However, we face the false positive problems in Fig. 16.22H a bicycle and a skateboard, where the walking person (normal event) is detected as an anomaly. Even the bicycle and the skateboard are correctly detected as anomalies in Fig. 16.22H, the false detection on the walking person makes this frame wrong anyway. The false positive anomaly detection is probably caused by a similar speed of walking to cycling in the scene. For the UMN dataset, the performance of DSTN is evaluated using the same settings as training parameters and network configuration on the UCSD pedestrian dataset. Table 16.2 indicates the AUC performance comparison of the DSTN with various competing works such as GANs [20], adversarial discriminator [22], AnomalyNet [23], and so on. Table 16.2 shows that the proposed DSTN achieves the best AUC results the same as Ref. [23], which outperforms all other methods. Noticeably, most of the competing methods can achieve high AUC on the UMN dataset. This is because the UMN dataset has less complexity regarding its abnormal patterns than the UCSD pedestrian and the Avenue datasets. Fig. 16.23 shows the performance of DSTN in detecting and localizing
Table 16.1 EER and AUC comparison of DSTN with other methods on UCSD Dataset [26]. Ped1 (frame level)
Ped1 (pixel level)
Ped2 (frame level)
Ped2 (pixel level)
Method
EER
AUC
EER
AUC
EER
AUC
EER
AUC
MPPCA Social Force (SF) SF + MPPCA Sparse Reconstruction MDT Detection at 150fps SR + VAE AMDN (double fusion) GMM Plug-and-Play CNN GANs GMM-FCN Convolutional AE Liu et al. Adversarial discriminator AnomalyNet DSTN (proposed method)
40% 31% 32% 19% 25% 15% 16% 16% 15.1% 8% 8% 11.3% 27.9% 23.5% 7% 25.2% 5.2%
59.0% 67.5% 68.8% – 81.8% 91.8% 90.2% 92.1% 92.5% 95.7% 97.4% 94.9% 81% 83.1% 96.8% 83.5% 98.5%
81% 79% 71% 54% 58% 43% 41.6% 40.1% 35.1% 40.8% 35% 36.3% – – 34% – 27.3%
20.5% 19.7% 21.3% 45.3% 44.1% 63.8% 64.1% 67.2% 69.9% 64.5% 70.3% 71.4% – 33.4% 70.8% 45.2% 77.4%
30% 42% 36% – 25% – 18% 17% – 18% 14% 12.6% 21.7% 12% 11% 10.3% 9.4%
69.3% 55.6% 61.3% – 82.9% – 89.1% 90.8% – 88.4% 93.5% 92.2% 90% 95.4% 95.5% 94.9% 95.5%
– 80% 72% – 54% – – – – – – 19.2% – – – – 21.8%
– – – – – – – – – – – 78.2% – 40.6% – 52.8% 83.1%
410
Generative adversarial networks for image-to-Image translation
Fig. 16.20 ROC Comparison of DSTN with other methods on UCSD Ped1 dataset: (A) frame-level evaluation and (B) pixel-level evaluation [26].
anomalies in different scenarios on the UMN dataset, including (d) an indoor and outdoors in (a), (b), and (c), where we can detect most of the individual objects in the crowded scene. Apart from evaluating DSTN on the UCSD and the UMN datasets, we also assess our performance on the challenging CUHK Avenue dataset with the same parameter and
Generative adversarial network for video anomaly detection
Fig. 16.21 ROC Comparison of DSTN with other methods on UCSD Ped2 dataset at frame-level evaluation [26].
Fig. 16.22 Examples of DSTN performance in detecting on localizing anomalies on UCSD Ped1 and Ped2 dataset: (A) a wheelchair, (B) a vehicle, (C) a skateboard, (D) a bicycle, (E) bicycles, (F) a vehicle and a bicycle, (G) a bicycle and a skateboard, and (H) a bicycle and a skateboard [26].
configuration settings as the UCSD and the UMN datasets. Table 16.3 presents the performance comparison in terms of EER and AUC of the DSTN with other competing works [6, 12, 14, 21, 23] in which the proposed DSTN surpasses all state-of-the-art works for both protocols. We show examples of the DSTN performance in detecting and localizing various types of anomalies, e.g., (a) jumping, (b) throwing papers, (c) falling papers, and (d) grabbing a bag, on the CUHK Avenue dataset in Fig. 16.24 The DSTN can effectively detect and localize anomalies in this dataset, even in
411
412
Generative adversarial networks for image-to-Image translation
Table 16.2 AUC comparison of DSTN with other methods on UMN dataset [26]. Method
AUC
Optical-flow SFM Sparse reconstruction Commotion Plug-and-play CNN GANs Adversarial discriminator Anomalynet DSTN (proposed method)
0.84 0.96 0.976 0.988 0.988 0.99 0.99 0.996 0.996
Fig. 16.23 Examples of DSTN performance in detecting on localizing anomalies on UMN dataset, where (A), (B), and (D) contain running activity outdoors while (C) is in an indoor [26]. Table 16.3 EER and AUC comparison of DSTN with other methods on CUHK Avenue dataset [26]. Method
EER
AUC
Convolutional AE Detection at 150 FPS GMM-FCN Liu et al. Anomalynet DSTN (proposed method)
25.1% – 22.7% – 22% 20.2%
70.2% 80.9% 83.4% 85.1% 86.1% 87.9%
Fig. 16.24D, which contains only small movements for abnormal events (only the human head and the fallen bag are slightly moving). To indicate the significance of our performance for real-time use, we then compare the running time of DSTN during testing in seconds per frame as shown in Table 16.4 with other competing methods [3–6, 15] following the environment and the computational time from Ref. [15]. Regarding Table 16.4, we achieve a lower running time than most of the competing methods except for Ref. [6]. This is because the architecture of DSTN relies on the
Generative adversarial network for video anomaly detection
Fig. 16.24 Examples of DSTN performance in detecting on localizing anomalies on CUHK Avenue dataset: (A) jumping, (B) throwing papers, (C) falling papers, and (D) grabbing a bag [26]. Table 16.4 Running time comparison on testing measurement (seconds per frame). CPU (GHz)
Sparse Reconstruction Detection at 150 fps MDT Li et al. AMDN (double fusion) DSTN (proposed method)
2.8
Method
Running time
GPU
Memory (GB)
Ped1
Ped2
UMN
Avenue
2.6
–
2.0
3.8
–
0.8
–
3.4
–
8.0
0.007
–
–
0.007
3.9 2.8 2.1
– – Nvidia Quadro K4000 –
2.0 2.0 32
17 0.65 5.2
23 0.80 –
– – –
– – –
24
0.315
0.319
0.318
0.334
framework of the deep learning model using multiple layers of the convolutional neural network, which is more complex than Ref. [6] that uses the learning of a sparse dictionary and provides fewer connections. However, according to the experimental results in Tables 16.1 and 16.3, our proposed DSTN significantly provides higher AUC and lower EER with respect to the frame and the pixel level on the CUHK Avenue and the UCSD pedestrian datasets than Ref. [6]. Regarding the running time, the proposed method runs 3.17 fps for the UCSD Ped1 dataset, 3.15 fps for the UCSD Ped2 dataset, 3.15 fps for the UMN dataset, and 3 fps for the CUHK Avenue dataset. In every respect, we provide the performance comparison of the proposed DSTN with other competing works [3–6, 15] to show our performance in regard to the frame-level AUC and the running time in seconds per frame for the UCSD Ped1 and Ped2 dataset as presented in Figs. 16.25 and 16.26, respectively. Considering Figs. 16.25 and 16.26, our proposed method achieves the best results regarding the AUC and running time aspects. In this way, we can conclude that our DSTN surpasses other state-of-the-art approaches since we reach the highest AUC values at the frame level anomaly detection and the pixel level localization and given the good computational time for real-world applications.
413
414
Generative adversarial networks for image-to-Image translation
Fig. 16.25 Frame-level AUC comparison and running time on UCSD Ped1 dataset.
Fig. 16.26 Frame-level AUC comparison and running time on UCSD Ped2 dataset.
16.4.5 The comparison of generative adversarial network with an autoencoder GAN-based U-Net architecture is a practical approach to shortcut low-level information across the network. The skip connections in the generator play a significant role in our proposed framework. We highlight its significance with the experiments on the UCSD Ped2 and compare it with autoencoder, which can be constructed by erasing the skip connections in the U-Net architecture. All training videos are learned on both skip
Generative adversarial network for video anomaly detection
connections and autoencoder for 40 epochs to observe the performance in minimizing the L1 loss, as shown in Fig. 16.27. From Fig. 16.27, it demonstrates that the loss curve of the skip connections reaches lower error over the training time than the loss curve of the autoencoder, showing superior performance of the skip connections over the autoencoder. Besides, we observe the ability to generate temporal information (generated dense optical flow) of the skip connections and the autoencoder using the test videos from the UCSD Ped2 and compare it to the dense optical flow ground truth as displayed in Fig. 16.28. The autoencoder is impotent to achieve the motion information shown in Fig. 16.28C. In contrast, the skip connections in Fig. 16.28B can produce the motion
Fig. 16.27 Performance comparison on UCSD Ped2 dataset between GAN based U-Net architecture (the residual connection) and autoencoder [26].
Fig. 16.28 The qualitative results in generating (A) dense optical flow on UCSD Ped2 dataset between (B) residual connection and (C) autoencoder [26].
415
416
Generative adversarial networks for image-to-Image translation
Table 16.5 FCN-score and SSIM comparison on UCSD Ped2 dataset between residual connection and autoencoder. Network architecture
Pixel accuracy
SSIM
Autoencoder Residual connection
0.83 0.9
0.82 0.96
information of dense optical flow that correctly corresponds to its ground truth in Fig. 16.28A, giving a good synthesized image quality. To indicate the quantitative performance of the skip connections and the autoencoder, Structural Similarity Index (SSIM) [55] and FCN-score [54] are evaluated for each architecture on the UCSD Ped2 as presented in Table 16.5 A higher value means better performance for both evaluation criteria. Table 16.5 shows that the GAN-based U-Net architecture with skip connection is more suitable for the low-level information since it achieves superior results than the autoencoder for both evaluation metrics, especially in the SSIM.
16.4.6 Advantages and limitations of generative adversarial network for video anomaly detection The generative adversarial networks for anomaly detection have certain advantages over the traditional CNNs. The advantages are that the GAN framework does not require any labeled data and inference during the learning procedure. In addition, GAN can generate the example data without using different entries in a sequential sample and does not need Markov Chain Monte Carlo (MCMC) method to train the model as the Adversarially trained AAE [24] and VAE [25]. Instead, it computes only the backpropagation to obtain the gradients. As regards statistical advantage, the GAN model may gain the density distribution of example data from the generator network, which is trained and updated with the gradients flowing through the discriminator rather than directly updating with the example data. In this way, for GAN in video anomaly detection task, the objective function of the generator is strengthened to be able to generate the synthetic output that looks real from the input image since the parameters of the generator do not directly obtain the components of the target image. Apart from the advantages mentioned above, the generator network provides a very sharp synthetic image, while the visual performance of the VAE network based on the MCMC method presents a blurry image for mixing the modes in chains. As in the limitations of GAN, the training of GAN is unstable compared to VAE, resulting in the difficulty of predicting the value of each pixel for the whole image and causing artifact noise in the synthetic image. The major limitation of anomaly detection using GAN in the current research trends is that only the static camera scenario is implemented to obtain the appearance and motion features from the moving foreground objects. Besides, GAN also has a problem in learning and generating small objects
Generative adversarial network for video anomaly detection
(full appearance of the objects) in the crowded scene, making it challenging to enhance the accuracy of the model, especially at the pixel level.
16.5 Summary In this chapter, we extensively explain the architecture of GANs and explore its applications on video anomaly detection research. DSTN, a novel unsupervised anomaly detection and localization method, is introduced to enhance the knowledge of GAN and improve the performance of the system with respect to the accuracy of anomaly detection at the frame level and localization at the pixel level and the computational time. The DSTN is intended to comprehensively master the features from the spatial to the temporal representations by employing the novel fusion between the background removal and the real dense optical flow. The concatenation of patches is presented to assist the learning of the generative network. The proposed method is an unsupervised manner since only the normalities are trained to obtain the corresponding generated dense optical flow without labeling abnormal data. Since all videos are input into the model during testing, the unrecognized patterns are classified as abnormalities because the model has no prior knowledge of any abnormal events. The abnormalities can be simply detected by subtracting the difference in local pixels between the real and the generated dense optical flow images. To the best of our knowledge, the proposed DSTN is the first attempt to boost pixel-level anomaly localization with the edge wrapping method as the postprocessing process of the GAN framework. We implemented three publicly available benchmarks; UCSD pedestrian, UMN, and CUHK Avenue datasets. The performance of DSTN is distinguished with various methods and analyzed with the autoencoder to specify the significance of using the skip connections of GAN. From the experimental results, the proposed DSTN outperforms other state-of-the-art works for anomaly detection and localization and time consumption. The advantages and limitations of GAN are addressed in the final section to deliver a comprehensive view of the use of GAN for the video anomaly detection task.
References [1] R. Mehran, A. Oyama, M. Shah, Abnormal crowd behavior detection using social force model, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 935–942. [2] J. Kim, K. Grauman, Observe locally, infer globally: a space-time MRF for detecting abnormal activities with incremental updates, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 2921–2928. [3] W. Li, V. Mahadevan, N. Vasconcelos, Anomaly detection and localization in crowded scenes, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2013) 18–32. [4] V. Mahadevan, W. Li, V. Bhalodia, N. Vasconcelos, Anomaly detection in crowded scenes, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, 2010, pp. 1975–1981.
417
418
Generative adversarial networks for image-to-Image translation
[5] Y. Cong, J. Yuan, J. Liu, Sparse reconstruction cost for abnormal event detection, in: CVPR 2011, IEEE, 2011, pp. 3449–3456. [6] C. Lu, J. Shi, J. Jia, Abnormal event detection at 150 fps in matlab, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2720–2727. [7] Y. Yuan, Y. Feng, X. Lu, Structured dictionary learning for abnormal event detection in crowded scenes, Pattern Recogn. 73 (2018) 99–110. [8] S. Wang, E. Zhu, J. Yin, F. Porikli, Video anomaly detection and localization by local motion based joint video representation and OCELM, Neurocomputing 277 (2018) 161–175. [9] X. Zhang, S. Yang, X. Zhang, W. Zhang, J. Zhang, Anomaly Detection and Localization in Crowded Scenes by Motion-Field Shape Description and Similarity-Based Statistical Learning, 2018 (arXiv preprint arXiv:1805.10620). [10] H. Mousavi, S. Mohammadi, A. Perina, R. Chellali, V. Murino, Analyzing tracklets for the detection of abnormal crowd behavior, in: 2015 IEEE Winter Conference on Applications of Computer Vision, IEEE, 2015, pp. 148–155. [11] H. Mousavi, M. Nabi, H. Kiani, A. Perina, V. Murino, Crowd motion monitoring using tracklet-based commotion measure, in: 2015 IEEE International Conference on Image Processing (ICIP), IEEE, 2015, pp. 2354–2358. [12] Y. Fan, G. Wen, D. Li, S. Qiu, M.D. Levine, F. Xiao, Video anomaly detection and localization via Gaussian mixture fully convolutional variational autoencoder, Comput. Vis. Image Underst. 195 (2020) 102920. [13] Y. Feng, Y. Yuan, X. Lu, Learning deep event models for crowd anomaly detection, Neurocomputing 219 (2017) 548–556. [14] M. Hasan, J. Choi, J. Neumann, A.K. Roy-Chowdhury, L.S. Davis, Learning temporal regularity in video sequences, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 733–742. [15] D. Xu, Y. Yan, E. Ricci, N. Sebe, Detecting anomalous events in videos by learning deep representations of appearance and motion, Comput. Vis. Image Underst. 156 (2017) 117–127. [16] S. Bouindour, M.M. Hittawe, S. Mahfouz, H. Snoussi, Abnormal Event Detection Using Convolutional Neural Networks and 1-Class SVM Classifier, IET Digital Library, 2017. [17] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, N. Sebe, Plug-and-play cnn for crowd motion analysis: An application in abnormal event detection, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2018, pp. 1689–1698. [18] M. Sabokrou, M. Fayyaz, M. Fathy, Z. Moayed, R. Klette, Deep-anomaly: fully convolutional neural network for fast anomaly detection in crowded scenes, Comput. Vis. Image Underst. 172 (2018) 88–97. [19] H. Wei, Y. Xiao, R. Li, X. Liu, Crowd abnormal detection using two-stream fully convolutional neural networks, in: 2018 10th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA), IEEE, 2018, pp. 332–336. [20] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni, N. Sebe, Abnormal event detection in videos using generative adversarial nets, in: 2017 IEEE International Conference on Image Processing (ICIP), IEEE, 2017, pp. 1577–1581. [21] W. Liu, W. Luo, D. Lian, S. Gao, Future frame prediction for anomaly detection—a new baseline, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6536–6545. [22] M. Ravanbakhsh, E. Sangineto, M. Nabi, N. Sebe, Training adversarial discriminators for crosschannel abnormal event detection in crowds, in: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2019, pp. 1896–1904. [23] J.T. Zhou, J. Du, H. Zhu, X. Peng, Y. Liu, R.S.M. Goh, Anomalynet: an anomaly detection network for video surveillance, IEEE Trans. Inf. Forensics Secur. 14 (2019) 2537–2550. [24] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, B. Frey, Adversarial Autoencoders, 2015 (arXiv preprint arXiv:1511.05644). [25] J. An, S. Cho, Variational autoencoder based anomaly detection using reconstruction probability, in: Special Lecture on IE, vol. 2, 2015, pp. 1–18.
Generative adversarial network for video anomaly detection
[26] T. Ganokratanaa, S. Aramvith, N. Sebe, Unsupervised anomaly detection and localization based on deep spatiotemporal translation network, IEEE Access 8 (2020) 50312–50329. [27] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, Adv. Neural Inf. Proces. Syst. 2 (2014) 2672–2680. [28] J. Sun, X. Wang, N. Xiong, J. Shao, Learning sparse representation with variational auto-encoder for anomaly detection, IEEE Access 6 (2018) 33353–33361. [29] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 315–323. [30] I.J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Bengio, Maxout Networks, 2013 (arXiv preprint arXiv:1302.4389). [31] K. Jarrett, K. Kavukcuoglu, M.A. Ranzato, Y. Lecun, What is the best multi-stage architecture for object recognition? in: 2009 IEEE 12th International Conference on Computer Vision, IEEE, 2009, pp. 2146–2153. [32] Y. Mashalla, Impact of computer technology on health: computer vision syndrome (CVS), Med. Pract. Rev. 5 (2014) 20–30. [33] K. Gates, Professionalizing police media work: surveillance video and the forensic sensibility, in: Images, Ethics, Technology, Routledge, 2015. [34] C. Dictionary, Cambridge Advanced Learner’s Dictionary, PONS-Worterbucher, Klett Ernst Verlag, 2008. [35] P. Isola, J.-Y. Zhu, T. Zhou, A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134. [36] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer, 2015, pp. 234–241. [37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, Imagenet large scale visual recognition challenge, Int. J. Comput. Vis. 115 (2015) 211–252. [38] T. Brox, A. Bruhn, N. Papenberg, J. Weickert, High accuracy optical flow estimation based on a theory for warping, in: European Conference on Computer Vision, Springer, 2004, pp. 25–36. [39] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [40] M. Mathieu, C. Couprie, Y. Lecun, Deep Multi-Scale Video Prediction beyond Mean Square Error, 2015 (arXiv preprint arXiv:1511.05440). [41] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, T. Brox, Flownet: learning optical flow with convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2758–2766. [42] T. Kroeger, R. Timofte, D. Dai, L. Van Gool, Fast optical flow using dense inverse search, in: European Conference on Computer Vision, Springer, 2016, pp. 471–488. [43] A. Radford, L. Metz, S. Chintala, Unsupervised Representation Learning With Deep Convolutional Generative Adversarial Networks, 2015 (arXiv preprint arXiv:1511.06434). [44] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training gans, in: Advances in Neural Information Processing Systems, 2016, pp. 2234–2242. [45] S. Ioffe, C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015 (arXiv preprint arXiv:1502.03167). [46] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [47] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014 (arXiv preprint arXiv:1409.1556). [48] D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, 2014 (arXiv preprint arXiv:1412.6980). [49] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (2014) 1929–1958.
419
420
Generative adversarial networks for image-to-Image translation
[50] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell. 8 (1986) 679–698. [51] F. Chollet, Keras document, Keras, GitHub, 2015. [52] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, Tensorflow: a system for large-scale machine learning, in: 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), 2016, pp. 265–283. [53] T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett. 27 (2006) 861–874. [54] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440. [55] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (2004) 600–612.
Index Note: Page numbers followed by f indicate figures, t indicate tables, and b indicate boxes.
A Ablation analysis, 366–368 Accuracy, 50, 88 of filters, 94f AC GAN, 30–31, 30f loss functions and distance metrics, 32–33t pros and cons of, 34–35t Adam algorithm, 355, 357 Adaptive moment optimization (ADAM), 322 Adversarial autoencoders (AAEs), 107–108, 117 Adversarial loss, 9, 213–214, 218 Adversarial network, 293, 293f Adversarial preparation, 61 Adversarial training, 140 Age-cGAN, 61t, 75 Aging of face, 119 AlexNet, 65, 178–181, 190 Alternative FCM algorithm, 84–85 Amazon, 43t Animation, 254–259 AOI, 49t Appearance and motion conditions GAN (AMC-GAN), 334t, 340 Area under curve (AUC), 405–406 Artificial intelligence-based methods, 142–146 Art2Real, 250–253, 253–254f Attentional generative adversarial networks (AttnGAN), 141 attRNN, 40 Autoencoder, 414–416 Automatic caricature generation, 135–136 Automatic nonrigid histological image registration (ANHIR) dataset, 265, 273–275 Auxiliary automatic driving, 76 Auxiliary object functions, 332–333
B Background removal, 391–395, 417 Backward forward GAN (BFGAN), 40, 40f BicycleGAN model, 132–134 Bidirectional GAN (BiGAN), 29–30, 29f loss functions and distance metrics, 32–33t pros and cons of, 34–35t Bidirectional LSTM (Bi-LSTM), 341, 341f
Bilingual evaluation understudy (BELU) score, 52 Bool GANs, 117 Boundary equilibrium GAN (BEGAN), 23–24, 24f loss functions and distance metrics, 32–33t pros and cons of, 34–35t
C Caltech 256, 49t Caption, 43t CariGANs, 136 Cartoon character generation, 76 Cascaded super-resolution GAN (CSRGAN), 6, 10 CD31 stain, 273 CelebA dataset, 49t, 247–248 CelebA-HQ, 49t Chinese poem dataset, 43t CHUK Avenue dataset, 404, 404f CIFAR 10/100, 49t Classification objective function, 332 Closed-circuit television (CCTV) cameras, 378 Cluster analysis, 81–82 CNN-based architectures, 185, 190–191, 196 CNN/Daily Mail dataset, 43t COCO dataset, 43t COIL-20, 49t Compactness, 84–85 Computer-aided diagnosis (CAD) systems, 162 Conditional adversarial networks, 134, 383–384 Conditional generative adversarial networks (CGANs), 28–29, 28f, 105–106, 126, 135, 139, 270 architecture, 166–167, 167f loss functions and distance metrics, 32–33t pros and cons of, 34–35t respiratory sound synthesis algorithm, 170–172 analysis, 179–181 data augmentation, 174–175 dataset, 174–181 discriminator network architecture, 169, 170f generator network architecture, 168–169, 169f inverse CWT, 176, 177f performance results, 177–179
421
422
Index
Conditional generative adversarial networks (CGANs) (Continued) scalograms, 170b, 176, 176f steps, 173–174 system model, 167–168, 168f time-scale representation, 168 trained network model, 177, 177t Content-based image retrieval (CBIR), 185, 187, 200 Context Encoder, 61t Continuous wavelet transform (CWT), 168 Controllable GANs, 139–140 Conventional generative adversarial networks (cGANs), 289 Convolutional neural network (CNN), 18, 65, 269–270, 397, 407–408, 416–417 architecture, 66f Convolutional traces, 145 Convolution operation, 398, 398f Cooccurrence matrices, 143 Critic network discriminates, 337 Cross-channel adversarial discriminators, 387–389 Cross-channel generative adversarial networks, 383–385 Cross entropy loss, 332 Crossview Fork, 135 Cross-view image synthesis, 135 Crossview Sequential, 135 cSeq-GAN, 40–41, 41f CUB, 49t Cycle-consistency loss, 319–320 Cycle generative adversarial networks (CGAN), 25–26, 26f, 61t, 113–114, 115f, 132, 212–213, 322–325, 324f, 348, 350, 352–355 image-to-image translation, 245–247, 247–248f loss functions and distance metrics, 32–33t model, 223f normalized difference vegetation index (NDVI), 213–216 architecture, 217–218, 217f pros and cons of, 34–35t qualitative evaluation, 225f Cycle text-to-image GAN, 141
D Data augmentation, 220, 222f, 355, 357 Datasets image, 49, 49t for video generation techniques, 338t
Decision-level fusion approach, 210 Deep belief network (DBN), 67 architecture, 67f Deep convolutional GAN (DCGAN), 64, 108–110, 127, 164, 266–267 DeepFake artificial intelligence-based methods, 142–146 challenges, 131 definition, 128 face swapping, 148–150 facial expressions, manipulation of, 152–153 facial features, manipulation of, 150–152 GAN-based techniques image-to-image translation, 132–136 text-to-image synthesis, 136–142 legal and ethical considerations, 153–154 new face construction, 146–147 sample source and generated fake images, 128–130, 129–130f Deep generative adversarial networks (GANs) model, 291–292 Deep learning (DL), 65–68, 209–212, 347, 349, 352, 377–378, 391–393, 397 end-to-end, 379–380 generative, 235–237, 238f overview, 235–239 unsupervised approach, 378 variational autoencoder (VAE), 237–239 Deep learning-based (DA-DCGAN), 117 Deep network architectures, 194–196 Deep neural networks (DNNs), 347–348, 352 Deep spatiotemporal translation network (DSTN), 390 dataset CHUK Avenue, 404, 404f UCSD, 403, 403f UMN, 403–404 feature collection, 391–396 implementation, 404 overview, 390–391, 392f performance, 407–413, 414f spatiotemporal translation model, 396–400 testing framework, 391, 392f training framework, 391, 392f Denoising-based generative adversarial networks (D-GAN), 291 Dense inverse search (DIS), 393 DenseNet, 22–23, 191 Digital elevation model (DEM), 251–252, 252f
Index
Digital imaging and communications in medicine (DICOM), 81 experimental analysis, 90–92 image segmentation, 90, 90f montage of, 90, 91f performance analysis, 92, 93f Dilated temporal relational GAN (DTRGAN), 340–341, 341f Discrete wavelet transformation, 318–319 Discriminator, 240–241 Discriminator model (DM), 63–64, 63f, 210–211, 211f Discriminator network, 293, 293f DNNs. See Deep neural networks (DNNs) DSTN. See Deep spatiotemporal translation network (DSTN) Dual attentional generative adversarial network (Dualattn-GAN), 141–142 Dual motion GAN (DMGAN), 334t, 340 Dynamic memory generative adversarial networks (DM-GAN), 140 Dynamic transfer GAN, 334–335, 334t, 335f
E Earth mover distance, 21 e-commerce, 185–187, 196–197, 200 Edge-enhanced GAN (EE-GAN), 9 for remote sensing image, 10 Edge-enhanced super-resolution network (EESR), 5 Edge wrapping (EW), 401–402 ELBO loss, 320 Encoder-based GAN, 47, 47f Encoder-decoder network, 291–292 Enhanced super-resolution GAN (ESRGAN), 9–10 Ensemble learning GANs, 42, 44f Equal error rate (EER), 406 Errors lung lobe tissue, 280, 280f Estrogen receptor (ER) antibody stains, 273 Expectation-maximization (EM), 82–83
F Face aging, 75 Facebook AI Similarity Search (FAISS), 193 Face conditional GAN (FCGAN), 61t, 73 Face frontal view generation, 75 Face generation, 75, 247–250 Face swapping, 148–150
Facial expressions, manipulation of, 152–153 Facial features, manipulation of, 150–152 FakeSpotter, 144 False data detection rate. See Recurrent neural network (RNN), generative adversarial networks (GANs) False positive rate (FPR), 405 Fashion recommendation system, 191–196, 192f Fault diagnosis, 120 f-Divergence, 332 Feature collection, 391–396 FG-SRGAN, 4–5 Filters accuracy comparison, 94f classification outputs, 93, 94t FPR comparison, 94, 95f harmonic mean, 95, 96f PPV comparison, 95, 95f sensitivity comparison, 93–94, 94f specificity comparison, 94, 95f Fingerprints, 144 Flow and texture generative adversarial network (FTGAN), 334t, 335, 335f FlowGAN, 335 Fluorescein angiography, 348–349, 349f, 351, 351f, 371 Forum of International Respiratory Societies (FIRS), 161 Frame-level anomaly detection, 377–378, 406 Frechet inception distance (FID), 50–51 F1 score, 50 Fully connected convoultional GANs (FCC-GANs), 103–104, 117 Fully connected GANs, 103–104 Fusion, 391–395, 394f Fuzzy C-means, 82–83 Fuzzy C-means clustering (FCMC), 83–84, 87–88
G GANs. See Generative adversarial networks (GANs) Gaussian filter, 89 Generative adversarial networks (GANs), 1, 2f, 18, 59–64, 185–186, 379–380 advantages, 127–128, 329–330, 342, 416–417 applications, 73–76, 119–120, 127 architectures, 19f, 102–116, 125–126, 126f, 381f vs. autoencoder, 379–380 based on image-toimage translation, 389–390
423
424
Index
Generative adversarial networks (GANs) (Continued) basic structure, 99, 100f building blocks of, 331–332 components, 125 cross-channel, 383–385 cross-channel adversarial discriminators, 387–389 cyclical, 348, 352–355 design of, 60f disadvantages, 343 fake images (see DeepFake) future frame prediction, 385–387 generic framework, 329, 330f image-to-image, 352–353 issues and challenges, 11–12 limitations, 128, 416–417 loss functions and distance metrics, 32–33t model, 125 need for, 99–102 objective functions, 332 parts, 99 pros and cons of, 34–35t research gaps, 117–119 structure of, 381–383 training process, 331–332 variants, 126 for video generation and prediction, 333–337 for video recognition, 337–340 for video summarization, 340–341 working principle, 125 Generative adversarial text-to-image synthesis, 137–138 Generator model (GM), 62, 62f, 210–211, 211f Generator network, 293 Geographically weighted regression (GWR), 209 Geometry-guided CGANs, 135 GoogLeNet, 65, 178–181, 190 Gradient penalty, 22 Grocott-Gomori methenamine silver (GMS) stain, 264 Guided image filtering, 89
H Harmonic mean, 88 of filters, 95, 96f Hematoxylin and eosin (H&E) stain, 264, 271–273, 273t, 275, 276f Hierarchical generative adversarial networks (HiGAN), 334t, 337–339 High-quality images, 101f, 120
High-resolution picture generation, 74 Histopathology staining, GANs for, 266 applications, 264 automatic nonrigid histological image registration (ANHIR) dataset, 273–275 conditional GANs (CGANs), 270 dataset, 272–275 deep convolutional GAN (DCGAN), 266–267 discriminator, 263, 283t errors lung lobe tissue, 280, 280f generator, 263, 282t histology, 264, 271–272 histopathological analysis, 271 histopathology, 264–265 image-quality metrics, 268–269 image-to-image translation, 265, 269–271 kidney tissue, 278–279, 278f lung lesion tissue, 275–276, 276–277f lung lobe tissue, 278–279, 279f machine learning, 265 medical imaging, 271–272 network architectures, 272–275, 281–282 optimization functions, 267–268 vanilla, 266
I Identity shortcut connection. See ResNet Image datasets, 49, 49t Image generation, 73, 119 applications, 244–259 face generation, 247–250 image animation, 254–259, 256f image-to-image translation, 245–247 photo-realistic images, 250–253 scene generation, 254–259, 258–259f generative adversarial network (GAN) architecture, 240f Art2Real, 250–253, 253–254f cycleGAN, 245–247, 247–248f dataset, 246t first-order motion, 254–259 implementation, 245t monkey net, 254–255, 255–256f Nash equilibrium, 239–243 stackGAN, 254–259, 257f starGAN, 247–250 superresolution (SR), 250–253 variational autoencoder (VAE), 243–244, 244f
Index
ImageNet, 191, 196 Image segmentation, 81–82 Image super-resolution, 6 Image synthesis, 119 Image-to-image translation, 73, 101f, 119, 132–136, 213–214, 389–390 histopathology staining, 265, 269–271 using cycle-GAN, 245–247, 247–248f Imfilter, 89 Imguided filter, 89 Imitation game, 1 Improved GAN (IGAN), 48, 48f Improved video generative adversarial network (iVGAN), 337, 338t, 338f Inception score (IS), 51 Incremental learning, 144 Info GAN, 22–23, 23f, 188–190, 188f loss functions and distance metrics, 32–33t pros and cons of, 34–35t Infrared image translation, 315 generative adversarial network in, 315–316 Integral probability metric, 332 Intersection over union (IoU), 51 Inverse CWT, 176, 177f IR-to-RGB translation, 313–314, 314f, 318–319, 319f, 323–325
J Jaccard index, 51 Jensen-Shannon (JS) divergence, 266 Jensen-Shanon divergence (JSD), 240, 242 Julia, 54
K Kernelized FCM, 84–85 Kernel maximum mean discrepancy (KMMD), 323–325 Kidney tissue, 278–279, 278f k-Lipchitz constant, 21 Kullback-Leibler divergence (KLD), 17–18, 238, 266
L LabelMe, 49t Laplacian pyramid GAN (LAPGAN), 69–70, 127 Least-square loss, 219 Legal case reports, 43t LeNet, 190
Long short-term memory (LSTM), 37, 67–68 architecture, 68f Loss function-based conditional progressive growing GAN (LC-PGGAN), 46, 46f Lung lesion tissue, 275–276, 276–277f Lung lobe tissue, 278–279, 279f
M Machine learning, 381–382, 404 Magnetic resonance images (MRI), 45 Markov Chain Monte Carlo (MCMC) method, 416–417 Masson’s trichome (MAS) stain, 274 MatLab, 53–54 Mean absolute error (MAE), 275 Mean absolute percentage error (MAPE), 298 Mean average error (MAE), 364–366 Mean squared error (MSE), 221–222, 364–366 Medical imaging, 347–348, 350–351, 355–356, 364, 371–372 Microaneurysms, 361–362f, 363, 367 MinMax FCM, 86–87 MirrorGAN, 139 Missing part generation, 120 MNIST digit generation, 106, 106f MobileNet, 191 MoCoGAN, 333–334, 334t, 334f Mode collapse, 54 Modified generator GAN (MG-GAN), 164 Monkey net, 254–259 Motion energy image (MEI), 333 Motion history image (MHI), 333 MRI brain tumor, 49t Multichannel attention selection GAN, 134 Multichannel residual conditional GAN (MCRCGAN), 45, 45f Multiconditional generative adversarial network (MC-GAN), 138 Multidomain image-to-image translation, 132 Multimodal image-to-image translation, 132–134 Multimodal reconstruction, 348 of retinal image, 351–360 ablation analysis, 366–368 cyclical generative adversarial networks (GANs), 348–349, 352–355 datasets, 360 network architectures, 357–360 qualitative evaluation, 360–364
425
426
Index
Multimodal reconstruction (Continued) quantitative evaluation, 364–366 SSIM methodology, 355–357 structural coherence, 369–370, 370f Multiscale dense block generative adversarial network (MSDB-GAN), 48–49, 48f Multistage dynamic generative adversarial network (MSDGAN), 336, 336f, 338t Multitask learning (MTL), 291 MUNIT, 315–316, 322–325, 324f Mutual information (MI), 22–23
N Nash equilibrium, 54, 239–243 Natural language processing datasets, 42, 43t GAN application in, 33–41 NDVI. See Normalized difference vegetation index (NDVI) Near-infrared (NIR) images, 207, 220 Near-infrared (NIR) spectrum, 216–217 New face construction, 146–147 News summarization dataset, 43t Noisy speech, 43t Normalized difference vegetation index (NDVI) applications, 208–209 cycle generative adversarial networks, 212–216 architecture, 217–218, 217f model, 223f qualitative evaluation, 225f data augmentation, 220, 222f datasets, 217f deep learning-based approaches, 209–212 estimation country category, 226f, 229f field category, 227f, 230f mountain category, 228f, 231f evaluation metrics, 221–222 formulations, 208–209 least-square loss, 219 loss functions, 218–219 overview, 205–207 residual learning model (ResNet), 214–215
O Object-driven generative adversarial networks (ObJGAN), 140 Octave GANs, 109–110, 117
o-Kernelized FCM, 84–85 Open Images, 49t OpenStreetMap, 49t Open subtitles dataset, 43t OpinRank, 43t Oxford 102, 49t
P Pairwise learning, 145–146 Parallel GAN, 25–26, 25f loss functions and distance metrics, 32–33t pros and cons of, 34–35t Parameterized ReLu (PrelU), 7 Patch extraction, 391–395, 395f, 407 PatchGAN, 270–271, 322, 359–360, 399, 400f, 407–408 PCA-GAN, 5 Peak signal-to-noise ratio (PSNR), 6, 268 Perceptual loss, 7–9, 320 Periodic acid-Schiff (PAS) stains, 274 Person reidentification (REID), 10 PG-GAN, 46, 46f Photo inpainting, 74 Photo-realistic images, 250–253 Pixel accuracy, 406 Pixel convolution neural networks (PixelCNN), 235–237 Pixel-level anomaly localization, 377–378, 406 Pixel recurrent neural networks (PixelRNN), 235–236 Pix2pix, 61t PoseGAN, 342 Precision, 50, 88 Progestrone receptor (PR) antibody stains, 274 Python, 52–53
Q Quality-aware GAN, 47, 47f Quasi recurrent neural network (QRNN), 39–40 QuGAN, 39–40, 39f
R Radon-Nikodym theorem, 241 RaFD dataset, 247–248 RankGAN, 37–38, 37f Realistic photograph generation, 75 Recall, 50
Index
Recall-oriented understudy for gisting evaluation (ROUGE) score, 52 Receiver operating characteristic (ROC), 405 Reconstruction objective function, 332 Rectified linear unit (ReLU), 322 Recurrent neural network (RNN), 18, 66, 290 architecture, 66f generative adversarial networks (GANs) accuracy, 298–300, 299–300t adversarial/discriminator network, 293, 293f architecture, 294–296 deep-GAN model, 291–292 denoising method, 291 encoder-decoder network, 291–292 enhanced attention, 292 F-measure, 298, 300–305, 303–304t generator network, 293 geometric-mean (G-mean), 298, 305, 305–306t learning module, 295f mean absolute percentage error (MAPE), 298, 305–308, 306–307t multideep, 292 multitask learning (MTL), 291 optimization, 294–296 performance, 295f, 297–308 sensitivity, 298, 300, 301–302t specificity, 298, 300, 302–303t Wasserstein, 292 Reinforce GAN, 38, 38f Residual blocks, 7 ResNet, 6, 65, 190–191, 214–215, 216f, 217–218 Resnet 50 network model, 178–182, 179–180f, 180–181t ResNeXt, 5 Respiratory sound synthesis, conditional GAN algorithm, 170–172 analysis, 179–181 data augmentation, 174–175 dataset, 174–181 discriminator network architecture, 169, 170f generator network architecture, 168–169, 169f inverse CWT, 176, 177f performance, 177–179 scalograms, 170b, 176, 176f steps, 173–174 system model, 167–168, 168f time-scale representation, 168
trained network model, 177, 177t Restricted Boltzmann machines (RBM), 67 Res_WGAN, 5 Retinal image, multimodal reconstruction of, 348, 351–360 ablation analysis, 366–368 cyclical generative adversarial networks (GANs), 348–349, 352–355 datasets, 360 network architectures, 357–360 qualitative evaluation, 360–364 quantitative evaluation, 364–366 SSIM methodology, 355–357 structural coherence, 369–370 RNN. See Recurrent neural network (RNN) Root mean squared error (RMSE), 224, 232t, 275 R programming, 53
S Scale-adaptive low-resolution person reidentification (SALR-REID), 10 Scalograms, 170b, 176, 176f Scene generation, 254–259, 258–259f Seismic images, SRGAN-based model, 7, 8f Semantic similarity discriminator, 36, 37f Semi GAN, 27–28, 27f loss functions and distance metrics, 32–33t pros and cons of, 34–35t Semisupervised learning, 19f, 27 Semi GAN, 27–28, 27f Sensitivity, 51 SeqAttnGAN, 118 SeqGAN, 36, 36f Sequential GAN supervised, 31, 31f unsupervised, 24–25, 25f SGAN, 42–44, 44f Shadow maps, 120 Silhouette method, 84–85 Simple generative adversarial networks, 166 Smooth muscle actin (SMA) stain, 274 Spatial FCM, 84–85 Spatiotemporal translation model, 396–400 Specificity, 51 Speech enhancement, 120 sp-Kernelized FCM, 84–85 SRResNet, 5
427
428
Index
Stacked generative adversarial networks (StackGAN), 61t, 110–113, 138, 254–259 Stacked generative adversarial networks (StackGAN++), 139 StarGAN, 73, 132, 247–250, 322–323, 324f Stochastic Adam optimazer, 224–225 Structural similarity index, 221–222, 224, 232t, 268 video anomaly detection (VAD), 406–407 Superresolution (SR), 74, 250–253 Super-resolution GAN (SR-GAN), 5, 61t, 71–72, 127 architecture of, 6–7, 7–8f image quality and, 11 network architecture, 7 perceptual loss, 7–9 adversarial loss, 9 content loss, 8–9 video surveillance and forensic application, 10 Supervised learning, 18–19, 28 ACGAN, 30–31, 30f bidirectional GAN (BiGAN), 29–30, 29f conditional GAN (CGAN), 28–29, 28f Supervised sequential GAN, 31, 31f loss functions and distance metrics, 32–33t pros and cons of, 34–35t
T Temporal GAN (TGAN), 61t, 69, 334t, 337, 338t Text-to-image GAN, 45–46, 46f Text-to-image synthesis, 76, 136–142 TextureGAN, 335 Thermal image translation. See Wavelet-guided generative adversarial network (WGGAN) TH-GAN, 41, 41f 2D median filter, 89 3D object generation, 75 3D presentation states (3DPR), 92 Threshold, 402 Tithonium Chasma, 252f Trained model of discriminator, 3, 3f of generator, 3, 4f True positive rate (TPR), 405 Turing Test, 1 Two-stage general adversarial network (TsGAN), 44–45, 45f
U UCSD dataset, 403, 403f, 407 UGATIT, 322–323, 324f UMN dataset, 403–404 U-Net, 269–271, 385–387, 414–415 U-Net generator, 217f, 220 Unified GAN (UGAN), 39, 39f, 132 Unpaired photo-to-caricature translation, 136 Unsupervised generative attentional networks (U-GAT-IT), 134 Unsupervised learning, 18–19, 19f, 210, 377–379, 389 BEGAN, 23–24, 24f cycle GAN, 25–26, 26f of generative adversarial network, 390–400 InfoGAN, 22–23, 23f parallel GAN, 25–26, 25f sequential GAN, 24–25, 25f vanilla GAN, 19–20, 20f Wasserstein GAN, 20f, 21–22 WGAN-GP, 20f, 22 UT-Zap50K benchmark dataset, 186–187, 196–197
V VAD. See Video anomaly detection (VAD) VAE. See Variational autoencoder (VAE) Vanilla GAN, 19–20, 19–20f, 126, 188, 266 loss functions and distance metrics, 32–33t pros and cons of, 34–35t Vanishing gradients, 54 Variational autoencoder (VAE), 17–18, 237–239, 243–244, 244f, 314–315 wavelet-guided, 317–319 Vari GAN, 61t, 68–69 Vegetation indexes (VIs), 205 normalized difference vegetation index (NDVI) applications, 208–209 formulations, 208–209 VGGNet, 65, 186 Video analytics, 329 generative adversarial network (GAN), 334t, 338t for video generation and prediction, 333–337 for video recognition, 337–340 for video summarization, 340–341 Video anomaly detection (VAD), 377–378 deep spatiotemporal translation network (see Deep spatiotemporal translation network (DSTN))
Index
evaluation area under curve (AUC), 405–406 equal error rate (EER), 406 frame-level evaluations, 406 pixel accuracy, 406 pixel-level evaluations, 406 receiver operating characteristic (ROC), 405 structural similarity index (SSIM), 406–407 generative adversarial network (GAN), 379–380 advantages, 416–417 based on image-toimage translation, 389–390 cross-channel, 383–385 cross-channel adversarial discriminators, 387–389 limitations, 416–417 prediction based on, 385–387 structure of, 381–383 for surveillance videos, 378–379 Video frame prediction, 76 Video GAN (VGAN), 61t, 71 Video retargeting, 339 Video surveillance, 378–379 Video synthesis, 76, 120 Video understanding, 333 Visual Genome, 49t Visual similarity search systems fashion recommendation system (see Fashion recommendation system) test results, 197, 198–200f web interface, 197–200, 201f
W WarpGAN, 135–136 Wasserstein distance, 21 Wasserstein GAN (WGAN), 20f, 21–22, 114–116, 292 loss functions and distance metrics, 32–33t pros and cons of, 34–35t Wavelet-guided generative adversarial network (WGGAN) adaptive moment optimization (ADAM), 322 architecture, 316–317
cycleGAN, 322–325, 324f FLIR ADAS dataset, 321, 321t, 323 MUNIT, 315–316, 322–325, 324f qualitative analysis, 321 translation results, 323, 324f quantitative analysis, 322 translation results, 323–325, 325t StarGAN., 322–323, 324f UGATIT, 322–323, 324f wavelet-guided variational autoencoder (WGVA), 317–319 cycle-consistency loss, 319–320 discrete wavelet transformation, 318–319 ELBO loss, 320 full loss, 321 GAN loss, 320–321 perceptual loss, 320 reparameterization, 318 Wavelet-guided variational autoencoder (WGVA), 317–319, 325 cycle-consistency loss, 319–320 discrete wavelet transformation, 318–319 ELBO loss, 320 full loss, 321 GAN loss, 320–321 perceptual loss, 320 reparameterization, 318 Weak supervision, 333 WGAN-GP, 20f, 22 loss functions and distance metrics, 32–33t pros and cons of, 34–35t WGGAN. See Wavelet-guided generative adversarial network (WGGAN) WGVA. See Wavelet-guided variational autoencoder (WGVA) Wiener 2 filtering, 89
Y YELP dataset, 43t
Z ZFNet, 65
429