Recent Advances in Computer Vision Applications Using Parallel Processing 9783031187346, 9783031187353


292 19 3MB

English Pages [126] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
About This Book
Contents
A Generic Multicore CPU Parallel Implementation for Fractional Order Digital Image Moments
1 Introduction
2 Background
2.1 Fractional Order Moments
2.2 Shared-Memory Parallel Programming for CPUs
3 Related Work
4 The Proposed Implementation
5 Results and Discussion
5.1 Setup
5.2 Results
6 Conclusions
References
Computer-Aided Road Inspection: Systems and Algorithms
1 Introduction
2 Road Damage Types
3 Road Data Acquisition
3.1 Sensors
3.2 Public Datasets
4 Road Damage Detection
4.1 2-D Image Analysis/Understanding-Based Approaches
4.2 3-D Road Surface Modeling-Based Approaches
4.3 Hybrid Approaches
5 Parallel Computing Architecture
6 Summary
References
Computer Stereo Vision for Autonomous Driving: Theory and Algorithms
1 Introduction
2 Autonomous Car System
2.1 Hardware
2.2 Software
3 Autonomous Car Perception
4 Computer Stereo Vision
4.1 Preliminaries
4.2 Multi-view Geometry
4.3 Stereopsis
5 Heterogeneous Computing
5.1 Multi-threading CPU
5.2 GPU
6 Summary
References
A Survey on GPU-Based Visual Trackers
1 Introduction
2 Parallel Computing
2.1 The Main Difference Between GPU and CPU
2.2 Strategy for Designing a Parallel Algorithm
2.3 Performance Evaluation Metrics
2.4 GPU Programming
3 Levels of Object Tracking Algorithms
3.1 Single Object Tracking
3.2 Multiple Object Tracking (MOT)
4 Conclusion
References
Accelerating the Process of Copy-Move Forgery Detection Using Multi-core CPUs Parallel Architecture
1 Introduction
2 Background
2.1 Types of Image Forgery
2.2 Copy-Move Forgery Detection Techniques
2.3 OpenMP
3 Related Work
3.1 Block-Based Methods
3.2 Key-Point Based Methods
3.3 Parallel Architectures in Copy-Move Forgery Detection
4 The Proposed Method
5 Results and Discussion
5.1 Setup
5.2 Results
6 Conclusion
References
Parallel Image Processing Applications Using Raspberry Pi
1 Introduction
2 Raspberry Pi: General Overview
2.1 Memory Bandwidth
2.2 Ethernet Throughput
2.3 WI-FI Throughput
2.4 Power Draw
3 Parallel Image Processing Applications Using RPI
3.1 Medical Applications
3.2 Recognition Applications
3.3 Monitoring Applications
3.4 Compression
3.5 Autonomous Car Driving
4 Conclusion
References
Recommend Papers

Recent Advances in Computer Vision Applications Using Parallel Processing
 9783031187346, 9783031187353

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Studies in Computational Intelligence 1073

Khalid M. Hosny Ahmad Salah   Editors

Recent Advances in Computer Vision Applications Using Parallel Processing

Studies in Computational Intelligence Volume 1073

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

Khalid M. Hosny · Ahmad Salah Editors

Recent Advances in Computer Vision Applications Using Parallel Processing

Editors Khalid M. Hosny Department of Information Technology Zagazig University Sharkia, Egypt

Ahmad Salah College of Computing and Information Sciences University of Technology and Applied Sciences Ibri, Oman

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-031-18734-6 ISBN 978-3-031-18735-3 (eBook) https://doi.org/10.1007/978-3-031-18735-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Computer vision is one field that is considered compute-intensive. This is because the input can be an image or a video. As the input is of large size, then the computing/processing time is huge as well. In addition, deep learning methods are heavily used in the field of computer vision. Thus, utilizing the parallel architecture for improving the runtime of the computer vision methods is of great interest and benefit. Of note, the inputs of the computer vision methods are parallel-friendly. For instance, one image consists of a set of pixels. Those pixels can be divided into groups based on the number of available parallel resources and then each group of pixels is processed using one computational resource (e.g., CPU). Thus, the groups of pixels are processed in parallel. Similarly, the video input can be divided into frames, where each computational resource (e.g., CPU or GPU) handles a number of frames. The book provides the readers with a comprehensive overview of principles, methodologies, and recent advances in utilizing parallel architectures for the sake of speeding up computer vision methods and algorithms. This edited book contains six chapters. These chapters feature original and previously unpublished works of renowned scholars from multiple nations that address various aspects of parallelizing computer vision methods. Each chapter begins with an introduction that covers contemporary methods, discusses results, and identifies difficulties and potential options for the future. A Generic Multicore CPU Parallel Implementation for Fractional Order Digital Image Moments. Additionally, each chapter provides a list of references for further research. Researchers, engineers, IT experts, developers, postgraduate students, and senior undergraduate students majoring in computer vision or high-performance computing will find the book to be a useful reference. A brief overview of the contents of the book is as follows: The authors in the first chapter proposed a generic method for parallelizing any fractional order moment on the parallel architecture of multi-core CPUs in the paper entitled “A Generic Multicore CPU Parallel Implementation for Fractional Order Digital Image Moments”. They proposed accelerating two well-known algorithms, namely, fractional polar cosine transform (Fr-PCT) and fractional polar sine transform (FrPST) on CPU core with 16 cores with the help of the OpenMP APIs. They managed to speed up the v

vi

Preface

two algorithms by 15.9 times. In second chapter, the authors reviewed the recent advances in the topic of road inspection in the paper entitled “Computer-Aided Road Inspection: Systems and Algorithms”. The second chapter first compares the five most common road damage types. Then, 2D/3D road imaging systems are discussed. Finally, state-of-the-art machine vision and intelligence-based road damage detection algorithms are introduced. Third chapter is entitled “Computer Stereo Vision for Autonomous Driving: Theory and Algorithms”. In this chapter, to enhance the trade-off between speed and accuracy, a computer stereo vision algorithm for hardware with limited resources is developed by the authors. The third chapter introduces the hardware and software components of the autonomous vehicle system. Following that, the authors go through four tasks for autonomous cars to perform in terms of perception, including (1) visual feature identification, description, and matching; (2) 3D information acquisition; (3) object detection and recognition; and (4) semantic picture segmentation. The concepts of parallel computing on multi-threading CPU and GPU are then described in depth. In Fourth chapter, the authors reviewed the recent advancements in the field of using GPU to accelerate visual object trackers in the paper entitled “A Survey on GPU-Based Visual Trackers”. In this chapter, Single Object Tracking (SOT) and Multiple Object Tracking (MOT) are categorized, summarized, and analyzed. Then, the authors discussed parallel computing applications for object tracking methods, and numerous techniques for evaluating the performance of parallel algorithms on parallel architectures. In Fifth chapter, the authors proposed accelerating the copy-move forgery detection in digital images in the paper entitled “Accelerating the Process of Copy-Move Forgery Detection Using Multi-core CPUs Parallel Architecture”. Several copy-move forgery detection methods are introduced, but the fundamental problem with them is the huge runtime required to find the final region that was forged. This is because several intricate computational operations must be repeated. Using parallel architectures is one of the solutions that have been proposed to overcome this issue. Although there has been minimal study given in this direction, it is still in its early stages. The use of parallel architectures is exposed in this article. The authors considered different parallel methods to speed up the copy-move forgery detection process in order to address the issue of massive runtime. In this context, a novel Multi-core CPU-based Parallel Copy-Move Image Forgery Detection (PCMIFD) system is proposed by the authors. The proposed system has undergone testing on an 8-core multi-core CPU parallel architecture; the obtained results show that the proposed system managed to speed up the sequential copy-forgery algorithm by 7.4 times. Finally, in Sixth chapter, the authors proposed reviewing the recent research works conducted using Raspberry Pi with multi-core CPU embedded systems or a cluster of Raspberry Pi to speed up the computer vision techniques. This chapter is entitled “Parallel Image Processing Applications Using Raspberry Pi”. Instead of using normal CPUs, a cluster of lowpower embedded processors can reduce the electricity consumption. Because the portable cluster may be set up to keep running even if some of its nodes fail, it will be advantageous to employ a Raspberry Pi cluster for image processing applications that take a long time to complete. The authors of this paper give a general overview

Preface

vii

of how the Raspberry Pi is used in various parallel image processing applications across several industries. Sharkia, Egypt Ibri, Oman

Khalid M. Hosny Ahmad Salah

About This Book

Parallel architectures have become commonplace. These architectures helped researchers to address previously infeasible problems in terms of runtime due to the great improvement in the elapsed runtime. For instance, the success of the training deep Neural Networks (NNs) is owed to the Graphical Processing Units (GPUs); this idea of training massively deep NNs was there for decades, but it became true after successfully advancing GPUs.

ix

Contents

A Generic Multicore CPU Parallel Implementation for Fractional Order Digital Image Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ahmad Salah, Khalid M. Hosny, and Amr M. Abdeltif Computer-Aided Road Inspection: Systems and Algorithms . . . . . . . . . . . Rui Fan, Sicen Guo, Li Wang, and Mohammud Junaid Bocus Computer Stereo Vision for Autonomous Driving: Theory and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rui Fan, Li Wang, Mohammud Junaid Bocus, and Ioannis Pitas A Survey on GPU-Based Visual Trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . Islam Mohamed, Ibrahim Elhenawy, and Ahmad Salah Accelerating the Process of Copy-Move Forgery Detection Using Multi-core CPUs Parallel Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hanaa M. Hamza, Khalid M. Hosny, and Ahmad Salah

1 13

41 71

87

Parallel Image Processing Applications Using Raspberry Pi . . . . . . . . . . . 107 Khalid M. Hosny, Ahmad Salah, and Amal Magdi

xi

A Generic Multicore CPU Parallel Implementation for Fractional Order Digital Image Moments Ahmad Salah, Khalid M. Hosny, and Amr M. Abdeltif

Abstract The image moments are considered a key technique for image feature extraction. It is utilized in several applications such as watermarking and classification. One of the major shortcomings of the image moment descriptors is the computation time. The image moments algorithm is a compute-intensive task. This can be justified by the fact that computing the image moment of two kernels on a 2D image makes the time complexity of the image moment computation of four orders. This high level of time complexity prohibits the image moment from being widely used as an image descriptor, especially for real-time applications. In this context, several methods are proposed to speed up the image moment computation on parallel architectures such as multicore CPUs and GPUs. To our knowledge, these efforts are focused on non-fractional moments. Due to their recent development, fractional order moment implementation is sequential. This motivates this work to propose a generic parallel implementation of fractional order moments on the multicore CPU architecture. The proposed implementation utilized the well-known OpenMP application programming interfaces (APIs) to evenly distribute the computational load over the CPU cores. The obtained results show that the proposed parallel implementation achieved a close to optimal speed up. Using a 16 core CPU, the obtained speedup was 15.1×. Keywords Parallel implementation · Multicore CPU · Fractional moments · Speedup · OpenMP

A. Salah (B) · A. M. Abdeltif Computer Science Department, Faculty of Computers and Informatics, Zagazig University, Zagazig, Egypt e-mail: [email protected] A. Salah Department of Information Technology, CAS-Ibri, The University of Technology and Applied Sciences, Muscat, Oman K. M. Hosny Information Technology Department, Faculty of Computers and Informatics, Zagazig University, Zagazig, Egypt © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. M. Hosny and A. Salah (eds.), Recent Advances in Computer Vision Applications Using Parallel Processing, Studies in Computational Intelligence 1073, https://doi.org/10.1007/978-3-031-18735-3_1

1

2

A. Salah et al.

1 Introduction Image moments are considered one of the well established image feature extraction techniques which are used in the field of image processing for decades [1]. The main powerful point of invariant moments that they are not affected by scaling, rotation, and translation [2]. Thus, extracting the objects’ information in a digital image can not be changed if the objects position, angel, or size have been changed. This makes the image moments strong tool for feature extraction. Image moments are utilized in several applications. Those applications include the medical domain [3], watermarking domain [4], and machine learning domain [5]. The reported results of utilizing digital image moments in several application domains are witness of the success of image moments as an image descriptor tool. The time complexity of the digital image moments is considered one of its shortcomings. Image moments contains a set of moments which are defined by the size of two kernels. Those two kernels are the radial kernel p and the angular kernel q. The number of the moments depends on the size of p and q [6]. Then, the value of a single image moment is a calculated particular weighted average of the entire image pixels. Thus, to complete the computation of the digital moment there four loops as follows: (1) the radial kernel with size p, (2) he angular kernel q, (3) the digital image rows N , and (4) the digital image columns N . We considered a square image for the sake of simple discussion. These four loops show the high time complexity of the digital image moments computation, which is considered a burden to use digital image moments and to utilize lighter image descriptor such as SIFT [7] and SURF [8]. There are recent advancements in the field of parallel architectures, such as multicore CPUs [9] and GPUs [10], and parallel APIs, such as Compute Unified Device Architecture (CUDA) [11] and OpenMP API [12]. The advancements of both hardware and software are successfully harnessed to accelerate several sequential algorithms such as deep learning [13] and sorting [14]. The achieved speedups improved the running times by several orders of magnitude. Thus, algorithms which are not suitable for real-time applications are accelerated by converting the sequential implementation to the counterpart parallel implementation. Then the parallel implementation which run on the parallel architecture become suitable for the real-time applications. Scientists have recently demonstrated the benefits of utilizing fractional order mathematical functions to describe scientific processes [15]. Due to the additional degrees of freedom provided by the fractional-order functions, the analysis is more detailed. Since all of the ordinary moments previously discussed are based on basis functions or polynomials with integer ordering. Therefore, it is preferable to use basis functions or polynomials with fractional orders for image analysis [16]. Thus, the focus of this paper is the fractional order moments. The ordinary moments have several attempts to provide the parallel implementations of their sequential algorithms as discussed in [6, 17]. In this work, we proposed a generic parallel implementation of the fractional order digital image moments on multicore CPU parallel architecture. The proposed

A Generic Multicore CPU Parallel Implementation …

3

implementation is based on a proposed parallel algorithm for the counterpart sequential fractional order digital image moments algorithm. The proposed implementation include a load balancing procedure to guarantee that the entire CPU cores receive even computational loads. In addition, we provided a theoretical time complexity analysis of the proposed parallel implementation. Then, the proposed implementation was thoroughly evaluated on several images with different sizes on different number of CPU cores. The main contributions of this paper are as follows. 1. To our knowledge, this is the first parallel multicore CPU implementation of the fractional order image moments. 2. The proposed implementation is theoretically and practically validated. 3. The obtained results show that the proposed parallel implementation achieved a close to optimal speedups. The rest of the paper is organized as follows. In Sect. 2, we review the basics of fractional moments and OpenMP APIs. Section 3 exposes the related work of parallel image moments implementations. Section 4 explains the proposed implementation details. Then, the results and discussion are provided in Sect. 5. Finally, the paper is concluded in Sect. 6.

2 Background 2.1 Fractional Order Moments Hosny et al. proposed a set of novel fractional orthogonal polar harmonic transforms in [18]. The authors proposed the fractional polar cosine transform (FrPCT) and fractional polar sine transform (FrPST) among other several fractional order moments. Those two moments (i.e., FrPCT and FrPST) are calculated using Eqs. 1 and 2, respectively.  2π  1  c ∗ C Wnm (r, θ) f (r, θ) r dr dθ =Ω (1) F Mnm 0

 S F Mnm =Ω

0

0

2π  0

1



S Wnm (r, θ)

∗

f (r, θ) r dr dθ

(2)

where π is a normalizing factor, [.]∗ stands for the conjugate complex number, f (r, θ) c and represents the image in the polar domain, n = |m| = 0, 1, 2, . . . , ∞, and Wnm S Wnm are represented by Eqs. 3 and 4, respectively. √   c (3) Wnm (r, θ) = r k−2 cos πnr k e jmθ

4

A. Salah et al. c Wnm (r, θ) =

√   r k−2 sin πnr k e jmθ

(4)

where j is the unit of the imaginary part. Despite the existence of so many fractional order moments, in this paper, we will focus on implementing a parallel version of the FrPST algorithm to proof the concept. The proposed parallel implementation can trivially be extended to the other fractional order moment such as FrPcT.

2.2 Shared-Memory Parallel Programming for CPUs Multicore CPU is a dominant parallel architecture in data centers and servers. An OpenMP construct is considered a Single Program Multiple Data (SPMD). In other words, each thread is assigned the same code segment. The main advantage of multicore CPU is the existence of shared memory. In other words, all the CPU cores/threads can access to the same memory. Thus, threads’ communication is extremely fast. One issue is to be considered by the programmer is concurrent write to the same memory cell. The harness of multicore CPUs successfully improve the reported run-time in several applications [19, 20]. OpenMp APIs is a widely used multithreaded C/C++ library for parallel programming [12]. OpenMp include a set of constructs, most of them are directives, that enable the programmer to control the concurrent execution of code segments assigned to the threads. Those constructs/directives are applied to a code segments, which is called a structured block. For instance, the omp for directive includes a loop construct #pragma omp for which is utilized to distribute the loop iterations within the CPU cores [21]. OpenMp library enables any code segment to recognized the assigned thread by using omp_get_thread_num() function. It is essential to recognized the executing thread number to decide which part of the data or computation to be executed by the current thread. For instance, if there is an array with 100 integer elements, and the task is to calculate the sum of each 25 elements. Thus, thread/core number 0 will handle the elements in range [0, 24]; thread number 1 will handle the elements in range [25, 29] and so on. To share the loop iterations with the CPU cores/thread, there are several approaches; we will discuss only three modes, namely, (1) static, (2) dynamic, (3) auto. In static mode, the distribution of iterations within the thread team is known before the execution, it is a pre-defined distribution. On contrary, the dynamic mode is unpredictable, as the iterations are divided into chunks and then the iteration chunks are assigned to threads not in a defined order; whenever there is an ideal thread, it executes the next available iterations chunk. Finally, in auto mode, the decision of assigning iterations to threads is made by the compiler based on the previous runs of the same loop; in other words, the compiler will learn the optimal distribution.

A Generic Multicore CPU Parallel Implementation …

5

OpenMP threads communicate using the shared memory, i.e., shared variables. As concurrent multiple writes by different thread can occur, OpenMP provide a synchronization mechanism to handle the possible race condition, i.e., protect data conflicts. There are two level of synchronization in OpenMP, namely, (1) high level and (2) low level. High level synchronization includes several methods such as #pragma omp critical and #pragma omp barrier. The critical mechanism forces the threads to execute one thread only at a time while barrier forces all threads to wait until all the threads reach to this point, the barrier line, of code. Low level synchronization includes several approaches such as flush and locks.

3 Related Work In this section, we will discuss the several efforts which are conduct to accelerate moment computation on different parallel architecture. This should shed the light on the used architectures, utilized programming tools/libraries, and the achieved time improvements. In [22], the authors proposed using the suffix sum for the sake of calculating the ordinary image moments for the gray-scale images. As a result, the counted multiplication operations is minimized. The authors proposed re-express the formulas of the image moment in the context of suffix sum. Then, the suffix sum function lends itself to the parallel implementation, due to their independent term computations nature. After than, the authors proposed a set of parallel implementations for the proposed parallel image moment algorithms. The experimental results on the parallel architecture hypercube computer and the Exclusive Read Exclusive Write (EREW) Parallel Random Access Machines (PRAM) computer model outlined that the proposed implementations faster than the state-of-the-art methods. The authors in [23] a method that accelerates Zernike moments [2] on an computing environment with multiple CPU and multiple GPU (MCPU-MGPU). Their main contribution is to analyze the Zernike moments computational algorithm and then to decide which parts to be executed on the CPUs and parts to be executed on the GPUs. In addition, the authors proposed maintain the balance between the loads assigned to the CPUs and GPUs for the sake of optimal use of the parallel computing resources. Then, the authors proposed a mechanism to facilitate the communication between the CPUs and GPUs. Finally, the authors thoroughly evaluated their proposed methods, and the obtained speedups are between 5× and 6×. In [24], the authors proposed accelerating the inverse Legendre moment transform due to its major roles in several applications of image analysis. The authors proposed utilizing the recursive version of the Legendre moments [25]. Then, the authors proposed distributing the transformation elements of signal reconstruction on the parallel resources, due to the fact they are independent tasks. The obtained results outlined that their proposed algorithm can achieve good results when it runs on very large-scale integrated circuit implementation.

6

A. Salah et al.

In [6], the authors proposed a parallel framework to accelerate the quaternion moment of images on both CPUs and GPUs. The proposed loop mitigation mechanism to balance the workload of moments with imbalanced loops. The evaluated their method on mega-pixel images on four P100 GPUs and a 16 cores CPU. They reported 257× speedup in comparison to the sequential baseline code running on a single core. In addition, they managed to achieve 180× speedup for applying image watermarking using their proposed parallel moment implementation. In [17], the authors proposed accelerating the 2D and 3D Legendre moments. The authors started their work by profiling the Legender moments computation algorithm to find the most time consuming steps. Then, the authors proposed parallelizing those steps through a counterpart parallel algorithm. The proposed parallel implementation are designed to run on multicore CPUs and GPUs. The authors evaluated their methods on gray-scale images and reported a two order of magnitudes of the running time of the parallel code in comparison to the sequential version. Then, the proposed parallel implementation was utilized in several application to demonstrate their run-time improvement. The authors in [26] proposed accelerating the Polar Harmonic Transform (PHT). The authors proposed a set of GPU optimization mechanism such as loop unroll operation. Then, the authors proposed studying the effect of GPU block size on the run-time. After that, they combine the effect of combining the loop unroll operation size and the block size to find out the best combinations, in terms of run-time, of these two optimization mechanism. In addition, the authors proposed a new algorithm to map image pixels to GPU threads using equations; as a results, the mapping function is of constant time. The experimental results show a 1,800 times speedup over the baseline of the corresponding sequential run-time. In [27], the authors proposed a novel framework that includes all the quaternion orthogonal moments. Their research was motivated by the fact the previous work focus only on accelerating on kind of the quaternion image moment. In their proposed framework, they implemented 11 different image moments and then the end user can select one moment to be applied on the input image. In addition, the user can select to run the moment on a single CPU cores, multicore CPU with a certain number of threads, and CPU-GPU for the fastest possible run time. Moreover, the end user can set the moment order for both angular and radial kernels. All of these options are available to the user from the command line as well as a graphical user interface (GUI). The obtained results of this work include 540× speedup relative to the sequential implementation. From the discussed literature, it is obvious that the current efforts for accelerating the moment do not include accelerating the fractional order moments. This can be linked to the fact that fractional order moments are still recent techniques. This fact motivates the current work to provide the first parallel multicore CPU implementation for the fractional order moments.

A Generic Multicore CPU Parallel Implementation …

7

4 The Proposed Implementation The proposed parallel implementation are list in Algorithm 1. The algorithm requires several inputs as follows. Regarding the input image, the user should provide the number of rings M and the number of sectors in the most inner sector S. For the kernels, the user should decide the order of the radial kernel pmax, and the order of the angular kernel qmax. In addition, the two kernel values should be passed to the algorithm, where those values can be calculated with simple for loops. The output of the algorithm are the moment components. The idea of Algorithm 1 is based on the loop fusion techniques. In other words, instead of include one loop for the radial kernel p and another loop for the angular kernel q, the two loops are fused in one loop, line 1 in Algorithm 1. Thus, the algorithm consists of three loops instead of four loops, lines 2–4. In lines 5 and 6, the algorithm maps the iteration number to the corresponding number of p and q using Eqs. 5 and 6, respectively. Then, the algorithm calculates the moment component at the p and q of the two kernels, i.e., M( p,q) for the red, green, and blue channels in lines 7, 8, and 9, respectively. p= q = it

it pmax mod pmax

(5)

(6)

Algorithm 1: Parallel algorithm for moment computation

1 2 3 4 5 6 7 8 9 10 11 12

Data: Number of rings M, S value, radial order pmax angular order qmax, I p kernel values, Iq kernel values Result: pmax × qmax moment components for the R, G, and B channels iterations = pmax × qmax for it = 0 to iterations do for ringID = 1 to M do for sectorID = 1 to S × (2 × ringID + 1) do it p = p = pmax q = it % pmax R_Ch_M p,q [p]q]+= R[ring I D][ j]× I p [ p][ring I D]× I q[ring I D][q][sector I D] G_Ch_M p,q [p]q]+= R[ring I D][ j]× I p [ p][ring I D]× I q[ring I D][q][sector I D] B_Ch_M p,q [p]q]+= R[ring I D][ j]× I p [ p][ring I D]× I q[ring I D][q][sector I D] end end end

The time complexity of Algorithm 1 depends on three factors. The first factor is the number of CPU cores r , the image size N × N , and the kernels order pmax and qmax. For the sake of simplicity, we assume the two kernels are of equal size,

8

A. Salah et al.

pmax = qmax. In addition, for an image of size N × N pixel, those pixels are distributed over M rings and each ring has 2× ring ID + 1. In other words, the second and third loops of Algorithm 1, lines 3 and 4, are of size N × N . To maintain the load balance, we should evenly divide the number of iterations of the three loops of Algorithm 1 over the r parallel resources, i.e., cores. Thus, the parallel time 2 ). In case when the number of cores complexity of Algorithm 1 is O( pmax×qmax×N r is greater than or equal the moment component (i.e., pmax × qmax), Algorithm 1 2 time complexity is O( Nr ). The space complexity of Algorithm 1 depends on several data structures. First, the image is stored in a 2D array of size N 2 . Second, the I p kernel values are stored in a 2D array of size M × pmax, where M is the number of rings. Third, the Iq kernel values are stored in a 2D array of size N 2 × qmax. Finally, the calculated moments are stored in a 2D array of size pmax × qmax. Thus, the space complexity of Algorithm 1 is O(N 2 × qmax), as the memory required for the angular kernel Iq is the largest data structure between the used other data structures by Algorithm 1.

5 Results and Discussion 5.1 Setup All of the experiments are written using C++ programming language (GCC version 5.4.0) with OpenMP APIs for multi-threaded functions. The running machine is with two Intel Xeon E5 series v4 1.70 GHz CPUs with 8 cores per each CPU, and 128 GB RAM. The utilized OS is 64-bit Ubuntu 18. We utilized a set of square mega-pixel images from different sizes, i.e., 024 × 1024, 2048 × 2048, and 4096 × 4096 pixels.

5.2 Results To validate the proposed implementation, we proposed testing two different fractional order moments as a proof of concept, frPST and fr PCT. The experiments include running the code on megapixel images of sizes, 1024 × 1024, 2048 × 2048, and 4096 × 4096 pixels. In these experiments, we examined three issues, namely, strong scaling, speedup, and run-times. Table 1 lists the run-times of the FrPCT moments computation in seconds for the sequential version and the parallel counterpart implementation on 2, 4, 8, and 16 CPU cores. The exposed results outline a close to optimal run-time reduction in terms of the used CPU cores. For instance, the run-time for FrPCT moments computation on a single CPU core (i.e., the sequential run-time) on a 4,096 × 4,096 pixels image is almost 1,052 s while the corresponding parallel implementation run-time for the same

A Generic Multicore CPU Parallel Implementation …

9

Table 1 FrPCT run-times for sequential, 2, 4, 8, 16 CPU cores in seconds Image 1 2 4 8 size/cores 1024 × 1024 2048 × 2048 4096 × 4096

65.79 263.15 1052.60

35.99 144.32 547.66

17.54 69.07 267.84

4.92 18.50 69.03

16 1.16 4.37 16.85

FrPST speedup over differnt number of CPU cores

16 14

Speedup

12 10 8 6 4

4096x4096

2

2048x2048

0 2

1024x1024

4 8 16

Number of CPU cores

1024x1024

2048x2048

4096x4096

Fig. 1 The speedup curves of FrPST moments computation’s run-times on different number of CPU cores and image sizes

image is almost 16.8 s when using 16 cores, a 15.9 speedup; a very close speedup from the optimal speedup which is 16 times. We proposed examined another kind of fractional order moments to ensure that the proposed parallel implementation can be extended to all other kinds of the fractional order digital images. In this context, the results of another fractional order moments (i.e., FrPST) are depicted in Figs. 1 and 2. First, the speedups are of FrPST moments are depicted on Fig. 1 for different image sizes against different number of CPU cores. The curves of the this figure show a close to optimal behaviour for all of the CPU cores. This behaviour is identical to the results of the FrPCT moments. This emphasizes that the proposed parallel implementation is generic and can be utilized by any kind of the fractional moments. The third point of comparison is the strong scaling property of the proposed parallel implementation. The strong scale property means that the run-time keeps improving when the problem size is fixed and the number of the parallel resources keeps increasing. In this context, we proposed testing the speedup values (i.e., the

A. Salah et al.

Speedup

10

18 16 14 12 10 8 6 4 2 0

Strong Scaling

2

4 8 Number of CPU cores

16

Fig. 2 The strong scaling analysis of FrPST moments computation’s on different number of CPU cores and 4,096 × 4,096 pixels image

improvement in the sequential run-time) on a fixed image size of 4096 × 4096 pixels on 2, 4, 8, and 16 CPU cores. Figure 2 shows a steady increase in the speedup when the number of CPU cores increases. Thus, the depicted curve in Fig. 2 outlines the strong scaling property of the proposed parallel implementation.

6 Conclusions In this work, the lack of existing of any parallel multicore CPU for the recently introduced fractional order moments motivated proposing a generic parallel framework to address this research gap. In this vein, we discussed the main parts of the fractional order moments that can be parallelized. Then, we proposed a theoretical discussion of the proposed parallel implementation that shows the expected improvements in the fractional order moments’ computation relative to the utilized parallel resources (i.e., CPU cores). Finally, the proposed parallel implementation is practically validated through a set of experiments on two different types of fractional order moments, namely, FrPCT and FrPST on different mega-pixel images and different number of CPU cores. The obtained results show that proposed implementation achieved a close to optimal speedup, considering the sequential code run-time as the baseline. In addition, the results revealed that the proposed implementation strongly scale when the number of the parallel resources increases on a fixed problem size. The future direction include examining the proposed implementation on a larger number of fractional order moments and comparing the gained time improvement against ordinary digital moments, non-fractional order moments.

A Generic Multicore CPU Parallel Implementation …

11

References 1. M.-K. Hu, Visual pattern recognition by moment invariants. IRE Trans. Inf. Theory 8(2), 179– 187 (1962) 2. A. Khotanzad, Y.H. Hong, Invariant image recognition by Zernike moments. IEEE Trans. Pattern Anal. Mach. Intell. 12(5), 489–497 (1990) 3. K.M. Hosny, M.M. Darwish, K. Li, A. Salah, Covid-19 diagnosis from CT scans and chest x-ray images using low-cost raspberry PI. PLoS ONE 16(5), e0250688 (2021) 4. K.M. Hosny, A. Magdi, N.A. Lashin, O. El-Komy, A. Salah, Robust color image watermarking using multi-core raspberry PI cluster. Multimed. Tools Appl. 81(12), 17185–17204 (2022) 5. M. Sadeghi, M. Javaherian, H. Miraghaei, Morphological-based classifications of radio galaxies using supervised machine-learning methods associated with image moments. Astron. J. 161(2), 94 (2021) 6. A. Salah, K. Li, K.M. Hosny, M.M. Darwish, Q. Tian, Accelerated CPU-GPUs implementations for quaternion polar harmonic transform of color images. Futur. Gener. Comput. Syst. 107, 368–382 (2020) 7. D.G. Lowe, Object recognition from local scale-invariant features, in Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2 (IEEE, 1999), pp. 1150–1157 8. H. Bay, T. Tuytelaars, L.V. Gool, Surf: speeded up robust features, in European Conference on Computer Vision (Springer, Berlin, 2006), pp. 404–417 9. C.-T. Yang, J.-C. Liu, Y.-W. Chan, E. Kristiani, C.-F. Kuo, Performance benchmarking of deep learning framework on intel Xeon Phi. J. Supercomput. 77(3), 2486–2510 (2021) 10. J. Choquette, O. Giroux, D. Foley, Volta: performance and programmability. IEEE Micro 38(2), 42–52 (2018) 11. C. Nvidia, Compute unified device architecture programming guide (2007) 12. R. Chandra, L. Dagum, D. Kohr, R. Menon, D. Maydan, J. McDonald, Parallel Programming in OpenMP (Morgan kaufmann, 2001) 13. T.D. Ngo, T.T. Bui, T.M. Pham, H.T. Thai, G.L. Nguyen, T.N. Nguyen, Image deconvolution for optical small satellite with deep learning and real-time GPU acceleration. J. Real-Time Image Proc. 18(5), 1697–1710 (2021) 14. A. Salah, K. Li, K. Li, Lazy-merge: a novel implementation for indexed parallel k-way in-place merging. IEEE Trans. Parallel Distrib. Syst. 27(7), 2049–2061 (2015) 15. D. Sheng, Y. Wei, S. Cheng, J. Shuai, Adaptive backstepping control for fractional order systems with input saturation. J. Franklin Inst. 354(5), 2245–2268 (2017) 16. B. Xiao, L. Li, Y. Li, W. Li, G. Wang, Image analysis by fractional-order orthogonal moments. Inf. Sci. 382, 135–149 (2017) 17. K.M. Hosny, A. Salah, H.I. Saleh, M. Sayed, Fast computation of 2D and 3D Legendre moments using multi-core CPUs and GPU parallel architectures. J. Real-Time Image Proc. 16(6), 2027– 2041 (2019) 18. K.M. Hosny, M.M. Darwish, T. Aboelenen, Novel fractional-order polar harmonic transforms for gray-scale and color image analysis. J. Franklin Inst. 357(4), 2533–2560 (2020) 19. M. Vasimuddin, S. Misra, H. Li, S. Aluru, Efficient architecture-aware acceleration of bwamem for multicore systems, in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (IEEE, 2019), pp. 314–324 20. G. Georgis, G. Lentaris, D. Reisis, Acceleration techniques and evaluation on multi-core CPU, GPU and FPGA for image processing and super-resolution. J. Real-Time Image Proc. 16(4), 1207–1234 (2019) 21. E. Ayguadé, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, F. Massaioli, X. Teruel, P. Unnikrishnan, G. Zhang, The design of OpenMP tasks. IEEE Trans. Parallel Distrib. Syst. 20(3), 404–418 (2008) 22. C.-H. Wu, S.-J. Horng, C.-F. Wen, Y.-R. Wang, Fast and scalable computations of 2D image moments. Image Vis. Comput. 26(6), 799–811 (2008)

12

A. Salah et al.

23. P. Toharia, O.D. Robles, R. SuáRez, J.L. Bosque, L. Pastor, Shot boundary detection using Zernike moments in multi-GPU multi-CPU architectures. J. Parallel Distrib. Comput. 72(9), 1127–1133 (2012) 24. G.-B. Wang, S.-G. Wang, Parallel recursive computation of the inverse Legendre moment transforms for signal and image reconstruction. IEEE Signal Process. Lett. 11(12), 929–932 (2004) 25. V. Nikolajevic, G. Fettweis, Computation of forward and inverse MDCT using Clenshaw’s recurrence formula. IEEE Trans. Signal Process. 51(5), 1439–1444 (2003) 26. Z. Yang, M. Tang, Z. Li, Z. Ren, Q. Zhang, Gpu accelerated polar harmonic transforms for feature extraction in its applications. IEEE Access 8, 95099–95108 (2020) 27. K.M. Hosny, M.M. Darwish, A. Salah, K. Li, A.M. Abdelatif, Cudaquat: new parallel framework for fast computation of quaternion moments for color images applications. Clust. Comput. 24(3), 2385–2406 (2021)

Computer-Aided Road Inspection: Systems and Algorithms Rui Fan, Sicen Guo, Li Wang, and Mohammud Junaid Bocus

Abstract Road damage is an inconvenience and a safety hazard, severely affecting vehicle condition, driving comfort, and traffic safety. The traditional manual visual road inspection process is pricey, dangerous, exhausting, and cumbersome. Also, manual road inspection results are qualitative and subjective, as they depend entirely on the inspector’s personal experience. Therefore, there is an ever-increasing need for automated road inspection systems. This chapter first compares the five most common road damage types. Then, 2-D/3-D road imaging systems are discussed. Finally, state-of-the-art machine vision and intelligence-based road damage detection algorithms are introduced.

1 Introduction The condition assessment of concrete and asphalt civil infrastructures (e.g., tunnels, bridges, and pavements) is essential to ensure their serviceability while still providing maximum safety for the users [1]. It also allows the government to allocate limited resources for infrastructure maintenance and appraise long-term investment schemes [2]. The detection and reparation of road damages (e.g., potholes, and cracks) is a crucial civil infrastructure maintenance task because they are not only an inconveR. Fan (B) · S. Guo Tongji University, Shanghai, China e-mail: [email protected] S. Guo e-mail: [email protected] L. Wang Continental Automotive Systems Shanghai Co. Ltd., Shanghai, China e-mail: [email protected] M. Junaid Bocus University of Bristol, Bristol, England e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. M. Hosny and A. Salah (eds.), Recent Advances in Computer Vision Applications Using Parallel Processing, Studies in Computational Intelligence 1073, https://doi.org/10.1007/978-3-031-18735-3_2

13

14

R. Fan et al.

nience but also a safety hazard, severely affecting vehicle condition, driving comfort, and traffic safety [3]. Traditionally, road damages are regularly inspected (i.e., detected and localized) by certified inspectors or structural engineers [4]. However, this manual visual inspection process is time-consuming, costly, exhausting, and dangerous [5]. Moreover, manual visual inspection results are qualitative and subjective, as they depend entirely on the individual’s personal experience [6]. Therefore, there is an everincreasing need for automated road inspection systems, developed based on cuttingedge machine vision and intelligence techniques [7]. A computer-aided road inspection system typically consists of two major procedures [8]: (1) road data acquisition, and (2) road damage detection (i.e., recognition/segmentation and localization). The former typically employs active or passive sensors (e.g., laser scanners [9], infrared cameras [10], and digital cameras [11]) to acquire road texture and/or spatial geometry, while the latter commonly uses 2-D image analysis/understanding algorithms, such as image classification [12], semantic segmentation [13], object detection [14], and/or 3-D road surface modeling algorithms [3] to detect the damaged road areas. This chapter first compares the most common types of road damages, including crack, spalling, pothole, rutting, and shoving. Then, various technologies employed to acquire 2-D/3-D road data are discussed. Finally, state-of-the-art (SOTA) machine vision and intelligence approaches, including 2-D image analysis/understanding algorithms and 3-D road surface modeling algorithms, developed to detect road damages are introduced.

2 Road Damage Types Crack, spalling, pothole, rutting, and shoving are five common types of road damages [6]. Their structures are illustrated in Fig. 1. Road crack has a much larger depth when compared to its dimensions on the road surface, presenting a unique challenge to the imaging systems; Road spalling has similar lateral and depth magnitudes, and thus, the imaging systems designed specifically to measure this type of road damages should perform similarly in both lateral and depth directions; Road pothole is a considerably large structural road surface failure, measurable by an imaging setup with a large field of view; Road rutting is extremely shallow along its depth, requiring a measurement system with high accuracy in the depth direction; Road shoving refers to a small bump on the road surface, which makes it difficult to profile with some imaging systems.

Computer-Aided Road Inspection: Systems and Algorithms

w

w

15

w d

d

d

w /d1

(b) Spalling

(c) Pothole

w d w /d>> 1 (d) Rutting

w

d

w /d 1) ,

q∈N p

(40) where D is the disparity image, c is the matching cost, q is a pixel in the neighborhood system Np of p. λ1 penalizes the neighboring pixels with small disparity differences, i.e., one pixel; λ2 penalizes the neighboring pixels with large disparity differences, i.e., larger than one pixel. δ(·) returns 1 if its argument is true and 0 otherwise. 4. Disparity Refinement Disparity refinement usually involves several post-processing steps, such as the leftand-right disparity consistency check (LRDCC), subpixel enhancement and weighted median filtering [76]. The LRDCC can remove most of the occluded areas, which are only visible in one of the left/right image [63]. In addition, a disparity error larger than one pixel may result in a non-negligible 3D geometry reconstruction error [63]. Therefore, subpixel enhancement provides an easy way to increase disparity image resolution by simply interpolating the matching costs around the initial disparity [76]. Moreover, a median filter can be applied to the disparity image to fill the holes and remove the incorrect matches [76]. However, the above disparity refinement algorithms are not always necessary and the sequential use of these steps depends entirely on the chosen algorithm and application needs.

Machine/deep learning-based stereo vision algorithms With recent advances in machine/deep learning, CNNs have been prevalently used for disparity estimation. For instance, Žbontar and LeCun [77] utilized a CNN to compute patch-wise similarity scores, as shown in Fig. 13. It consists of a convolutional layer L 1 and seven fully-connected layers L 2 –L 8 . The inputs to this CNN are two 9×9pixel gray-scale image patches. L 1 consists of 32 convolution kernels of size 5×5×1. L 2 and L 3 have 200 neurons each. After L 3 , the two 200-dimensional feature vectors are concatenated into a 400-dimensional vector and passed through L 4 -L 7 layers. Layer L 8 maps L 7 output into two real numbers, which are then fed through a softmax function to produce a distribution over the two classes: a) good match and b) bad match. Finally, they utilize computer vision-based cost aggregation and disparity optimization/refinement techniques to produce the final disparity images. Although this method has achieved the state-of-the-art accuracy, it is limited by the employed matching cost aggregation technique and can produce wrong predictions in occluded or texture-less/reflective regions [78]. In this regard, some researchers have leveraged CNNs to improve computer visionbased cost aggregation step. SGM-Nets [79] is one of the most well-known methods of this type. Its main contribution is a CNN-based technique for predicting SGM penalty parameters λ1 and λ2 in (40) [69], as illustrated in Fig. 14. A 5 × 5-pixel gray-scale image patch and its normalized position are used as the CNN inputs. It has (a) two convolution layers, each followed by a rectified linear unit (ReLU) layer; (b) a concatenate layer for merging the two types of inputs; (c) two fully connected

62

R. Fan et al.

Fig. 13 The architecture of the CNN proposed in [77] for stereo matching

Fig. 14 SGM-Nets [79] architecture

(FC) layers of size 128 each, followed by a ReLU layer and an exponential linear unit (ELU); (d) a constant layer to keep SGM penalty values positive. The costs can then be accumulated along four directions. The CNN output values correspond to standard parameterization. Recently, end-to-end deep CNNs have become very popular. For example, Mayer et al. [80] created three large synthetic datasets5 (FlyingThings3D, Driving and Monkaa) and proposed a CNN named DispNet for dense disparity estimation. Later on, Pang et al. [81] proposed a two-stage cascade CNN for disparity estimation. Its the first stage enhances DispNet [80] by equipping it with extra up-convolution modules and the second stage rectifies the disparity initialized by the first stage and generates residual signals across multiple scales. Furthermore, GCNet [82] incorporated feature extraction (cost computation), cost aggregation and disparity optimiza5

https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html.

Computer Stereo Vision for Autonomous Driving: Theory and Algorithms

63

tion/refinement into a single end-to-end CNN model, and it achieved the state-ofthe-art accuracy on the FlyingThings3D benchmark [80] as well as the KITTI stereo 2012 and 2015 benchmarks [83–85]. In 2018, Chang et al. [86] proposed Pyramid Stereo Matching Network (PSMNet), consisting of two modules: (a) spatial pyramid pooling and (b) 3D CNN. The former aggregates the context of different scales and locations, while the latter regularizes the cost volume. Unlike PSMNet [86], guided aggregation net (GANet) [78] replaces the widely used 3D CNN with two novel layers: a semi-global aggregation layer and a local guided aggregation layer, which help save a lot of memory and computational cost. Although the aforementioned CNN-based disparity estimation methods have achieved compelling results, they usually have a huge number of learnable parameters, resulting in a long processing time. Therefore, current state-of-the-art CNNbased disparity estimation algorithms have hardly been put into practical uses in autonomous driving. We believe these methods will be applied in more real-world applications, with future advances in embedded computing HW.

4.3.4

Performance Evaluation

As discussed above, disparity estimation speed and accuracy are two key properties and they are always pitted against each other. Therefore, the performance evaluation of a given stereo vision algorithm usually involves both of these two properties [64]. The following two metrics are commonly used to evaluate the accuracy of an estimated disparity image [87]: 1. Root mean squared (RMS) error eRMS : eRMS

   1 = |DE (p) − DG (p)|2 , N

(41)

p∈P

2. Percentage of error pixels (PEP) ePEP (tolerance: δd pixels): ePEP



1  = δ |DE (p) − DG (p)| > δd × 100%, N

(42)

p∈P

where DE and DG represent the estimated and ground truth disparity images, respectively; N denotes the total number of disparities used for evaluation; δd represents the disparity evaluation tolerance. Additionally, a general way to depict the efficiency of an algorithm is given in millions of disparity evaluations per second Mde/s [64] as follows: Mde/s =

u max vmax dmax −6 10 . t

(43)

64

R. Fan et al.

However, the speed of a disparity estimation algorithm typically varies across different platforms, and it can be greatly boosted by exploiting the parallel computing architecture.

5 Heterogeneous Computing Heterogeneous computing systems use multiple types of processors or cores. In the past, heterogeneous computing meant that different instruction-set architectures (ISAs) had to be handled differently, while modern heterogeneous system architecture (HSA) systems allow users to utilize multiple processor types. A typical HSA system consists of two different types of processors: (1) a multi-threading central processing unit (CPU) and (2) a graphics processing unit (GPU) [88], which are connected by a peripheral component interconnect (PCI) express bus. The CPU’s memory management unit (MMU) and the GPU’s input/output memory management unit (IOMMU) comply with the HSA HW specifications. CPU runs the operating system and performs traditional serial computing tasks, while GPU performs 3D graphics rendering and CNN training.

5.1 Multi-threading CPU The application programming interface (API) Open Multi-Processing (OpenMP) is typically used to break a serial code into independent chunks for parallel processing. It supports multi-platform shared-memory multiprocessing programming in C/C++ and Fortran [89]. An explicit parallelism programming model, typically known as a fork-join model, is illustrated in Fig. 15, where the compiler instructs a section of the serial code to run in parallel. The master thread (serial execution on one core) forks a number of slave threads. The tasks are divided to run in parallel amongst the slave threads on multiple cores. Synchronization waits until all slave threads finish their allocated tasks. Finally, the slave threads join together at a subsequent point and resume sequential execution.

5.2 GPU GPUs have been extensively used in computer vision and deep learning to accelerate the computationally intensive but parallelly-efficient processing and CNN training. Compared with a CPU, which consists of a low number of cores optimized for sequentially serial processing, GPU has a highly parallel architecture which is composed of hundreds or thousands of light GPU cores to handle multiple tasks concurrently.

Computer Stereo Vision for Autonomous Driving: Theory and Algorithms

65

A typical GPU architecture is shown in Fig. 16, which consists of N streaming multiprocessors (SMs) with M streaming processors (SPs) on each of them. The single instruction multiple data (SIMD) architecture allows the SPs on the same SM to execute the same instruction but process different data at each clock cycle. The device has its own dynamic random access memory (DRAM) which consists of global memory, constant memory and texture memory. DRAM can communicate with the host memory via the graphical/memory controller hub (GMCH) and the I/O controller hub (ICH), which are also known as the Intel northbridge and the Intel Master Thread

Serial Processing Master Thread

Task 1 A

B

Task 2

C

Task 1 A

A

B Task 2 A

B

Parallel Processing

C

Fig. 15 Serial processing versus parallel processing

Fig. 16 GPU architecture [24]

B

Fork Join

66

R. Fan et al.

southbridge, respectively. Each SM has four types of on-chip memories: register, shared memory, constant cache and texture cache. Since they are on-chip memories, the constant cache and texture cache are utilized to speed up data fetching from the constant memory and texture memory, respectively. Due to the fact that the shared memory is small, it is used for the duration of processing a block. The register is only visible to the thread. In CUDA C programming, the threads are grouped into a set of 3D thread blocks which are then organized as a 3D grid. The kernels are defined on the host using the CUDA C programming language. Then, the host issues the commands that submit the kernels to devices for execution. Only one kernel can be executed at a given time. Once a thread block is distributed to an SM, the threads are divided into groups of 32 parallel threads which are executed by SPs. Each group of 32 parallel threads is known as a warp. Therefore, the size of a thread block is usually chosen as a multiple of 32 to ensure efficient data processing.

6 Summary In this chapter, we first introduced the autonomous car system, from both HW aspect (car sensors and car chassis) and SW aspect (perception, localization and mapping, prediction and planning, and control). Particularly, we introduced the autonomous car perception module, which has four main functionalities: (1) visual feature detection, description and matching, (2) 3D information acquisition, (3) object detection/recognition and (4) semantic image segmentation. Later on, we provided readers with the preliminaries for the epipolar geometry and introduced computer stereo vision from theory to algorithms. Finally, heterogeneous computing architecture, consisting of a multi-threading CPU and a GPU, was introduced. Acknowledgements This work was supported by the National Key R&D Program of China, under grant No. 2020AAA0108100, awarded to Prof. Qijun Chen. This work was also supported by the Fundamental Research Funds for the Central Universities, under projects No. 22120220184, No. 22120220214, and No. 2022-5-YB-08, awarded to Prof. Rui Fan. This work has also received partial funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 871479 (AERIALCORE).

References 1. R. Fan, J. Jiao, H. Ye, Y. Yu, I. Pitas, M. Liu, Key ingredients of self-driving cars, in 27th European Signal Processing Conference (EUSIPCO) 2019, Satellite Workshop: Signal Processing, Computer Vision and Deep Learning for Autonomous Systems (2019), pp. 1–5 2. Lidar–light detection and ranging–is a remote sensing method used to examine the surface of the earth, in NOAA. Archived from the Original on, vol. 4 (2013) 3. T. Bureau, Radar definition, in Public Works and Government Services Canada (2013) 4. W.J. Westerveld, Silicon photonic micro-ring resonators to sense strain and ultrasound (2014)

Computer Stereo Vision for Autonomous Driving: Theory and Algorithms

67

5. N. Samama, Global Positioning: Technologies and Performance, vol. 7 (Wiley, 2008) 6. S. Liu, Chassis technologies for autonomous robots and vehicles (2019) 7. M.U.M. Bhutta, M. Kuse, R. Fan, Y. Liu, M. Liu, Loop-box: multiagent direct slam triggered by single loop closure for large-scale mapping. IEEE Trans. Cybern. 52(6), 5088–5097 (2022) 8. R.C. Smith, P. Cheeseman, On the representation and estimation of spatial uncertainty. Int. J. Robot. Res. 5(4), 56–68 (1986) 9. C. Katrakazas, M. Quddus, W.-H. Chen, L. Deka, Real-time motion planning methods for autonomous on-road driving: state-of-the-art and future research directions. Transp. Res. Part C: Emerg. Technol. 60, 416–442 (2015) 10. T.H. Cormen, Section 24.3: Dijkstra’s algorithm, in Introduction to Algorithms (2001), pp. 595–601 11. D. Delling, P. Sanders, D. Schultes, D. Wagner, Engineering route planning algorithms, in Algorithmics of Large and Complex Networks (Springer, 2009), pp. 117–139 12. M. Willis, Proportional-integral-derivative control, in Department of Chemical and Process Engineering University of Newcastle (1999) 13. G.C. Goodwin, S.F. Graebe, M.E. Salgado et al., Control System Design (Prentice Hall, Upper Saddle River, NJ, 2001) 14. C.E. Garcia, D.M. Prett, M. Morari, Model predictive control: theory and practice-a survey. Automatica 25(3), 335–348 (1989) 15. M. Hassaballah, A.A. Abdelmgeid, H.A. Alshazly, Image features detection, description and matching, in Image Feature Detectors and Descriptors (Springer, 2016), pp. 11–45 16. S. Liu, X. Bai, Discriminative features for image classification and retrieval. Pattern Recogn. Lett. 33(6), 744–751 (2012) 17. P. Moreels, P. Perona, Evaluation of features detectors and descriptors based on 3d objects. Int. J. Comput. Vision 73(3), 263–284 (2007) 18. P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: an evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 743–761 (2011) 19. M. Danelljan, G. Häger, F.S. Khan, M. Felsberg, Discriminative scale space tracking. IEEE Trans. Pattern Anal. Mach. Intell. 39(8), 1561–1575 (2016) 20. D.G. Lowe, Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 21. H. Bay, T. Tuytelaars, L. Van Gool, Surf: speeded up robust features, in ECCV (Springer, 2006), pp. 404–417 22. E. Rublee, V. Rabaud, K. Konolige, G. Bradski, Orb: an efficient alternative to sift or surf, in International Conference on Computer Vision (IEEE, 2011), pp. 2564–2571 23. S. Leutenegger, M. Chli, R.Y. Siegwart, Brisk: binary robust invariant scalable keypoints, in International Conference on Computer Vision (IEEE, 2011), pp. 2548–2555 24. R. Fan, Real-time computer stereo vision for automotive applications, Ph.D. dissertation, University of Bristol (2018) 25. R. Hartley, A. Zisserman, Multiple view Geometry in Computer Vision (Cambridge University Press, 2003) 26. H. Wang, R. Fan, M. Liu, CoT-AMFlow: adaptive modulation network with co-teaching strategy for unsupervised optical flow estimation, in Conference on Robot Learning (PMLR, 2021), pp. 143–155 27. S. Ullman, The interpretation of structure from motion. Proc. R. Soc. Lond. Ser. B. Biol. Sci. 203(1153), 405–426 (1979) 28. B. Triggs, P.F. McLauchlan, R.I. Hartley, A.W. Fitzgibbon, Bundle adjustment-a modern synthesis, in International Workshop on Vision Algorithms (Springer, 1999), pp. 298–372 29. H. Wang, Y. Liu, H. Huang, Y. Pan, W. Yu, J. Jiang, D. Lyu, M.J. Bocus, M. Liu, I. Pitas et al., ATG-PVD: ticketing parking violations on a drone, in European Conference on Computer Vision Workshops (Springer, 2020), pp. 541–557 30. Z.-Q. Zhao, P. Zheng, S.-T. Xu, X. Wu, Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learn. Syst. 30(11), 3212–3232 (2019)

68

R. Fan et al.

31. D. Wang, C. Devin, Q.-Z. Cai, F. Yu, T. Darrell, Deep object-centric policies for autonomous driving, in 2019 International Conference on Robotics and Automation (ICRA) (IEEE, 2019), pp. 8853–8859 32. A. Mogelmose, M.M. Trivedi, T.B. Moeslund, Vision-based traffic sign detection and analysis for intelligent driver assistance systems: perspectives and survey. IEEE Trans. Intell. Transp. Syst. 13(4), 1484–1497 (2012) 33. B. Wu, F. Iandola, P.H. Jin, K. Keutzer, Squeezedet: unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017), pp. 129–137 34. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 580–587 35. R. Girshick, Fast R-CNN, in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 1440–1448 36. S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, in Advances in Neural Information Processing Systems (2015), pp. 91–99 37. J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real-time object detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 779–788 38. J. Redmon, A. Farhadi, Yolov3: an incremental improvement (2018) 39. A. Bochkovskiy, C.-Y. Wang, H.-Y. M. Liao, Yolov4: optimal speed and accuracy of object detection (2020) 40. R. Fan, H. Wang, P. Cai, M. Liu, SNE-RoadSeg: incorporating surface normal information into semantic segmentation for accurate freespace detection, in European Conference on Computer Vision (Springer, 2020), pp. 340–356 41. H. Wang, R. Fan, Y. Sun, M. Liu, Applying surface normal information in drivable area and road anomaly detection for ground mobile robots, in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE, 2020), pp. 2706–2711 42. R. Fan, H. Wang, P. Cai, J. Wu, M.J. Bocus, L. Qiao, M. Liu, Learning collision-free space detection from stereo images: homography matrix brings better data augmentation. IEEE/ASME Trans. Mechatron. 27(1), 225–233 (2021) 43. J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3431–3440 44. O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for biomedical image segmentation, in International Conference on Medical Image Computing and Computer-Assisted Intervention (Springer, 2015), pp. 234–241 45. V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481– 2495 (2017) 46. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, H. Adam, Encoder-decoder with Atrous separable convolution for semantic image segmentation, in ECCV (2018), pp. 801–818 47. M. Yang, K. Yu, C. Zhang, Z. Li, K. Yang, Denseaspp for semantic segmentation in street scenes, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 3684–3692 48. Z. Tian, T. He, C. Shen, Y. Yan, Decoders matter for semantic segmentation: data-dependent decoding enables flexible feature aggregation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 3126–3135 49. R. Fan, H. Wang, M.J. Bocus, M. Liu, We learn better road pothole detection: from attention aggregation to adversarial domain adaptation, in European Conference on Computer Vision Workshops (Springer, 2020), pp. 285–300

Computer Stereo Vision for Autonomous Driving: Theory and Algorithms

69

50. C. Hazirbas, L. Ma, C. Domokos, D. Cremers, Fusenet: incorporating depth into semantic segmentation via fusion-based CNN architecture, in Asian Conference on Computer Vision (Springer, 2016), pp. 213–228 51. R. Fan, H. Wang, B. Xue, H. Huang, Y. Wang, M. Liu, I. Pitas, Three-filters-to-normal: an accurate and ultrafast surface normal estimator. IEEE Robot. Autom. Lett. 6(3), 5405–5412 (2021) 52. R. Fan, U. Ozgunalp, B. Hosking, M. Liu, I. Pitas, Pothole detection based on disparity transformation and road surface modeling. IEEE Trans. Image Process. 29, 897–908 (2019) 53. R. Fan, M. Liu, Road damage detection based on unsupervised disparity map segmentation. IEEE Trans. Intell. Transp. Syst. 21(11), 4906–4911 (2019) 54. Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, T. Harada, Mfnet: towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes, in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE, 2017), pp. 5108–5115 55. U. Ozgunalp, R. Fan, X. Ai, N. Dahnoun, Multiple lane detection algorithm based on novel dense vanishing point estimation. IEEE Trans. Intell. Transp. Syst. 18(3), 621–632 (2016) 56. R. Fan, N. Dahnoun, Real-time stereo vision-based lane detection system. Meas. Sci. Technol. 29(7), 074005 (2018) 57. J. Jiao, R. Fan, H. Ma, M. Liu, Using dp towards a shortest path problem-related application, in 2019 International Conference on Robotics and Automation (ICRA) (IEEE, 2019), pp. 8669– 8675 58. E. Trucco, A. Verri, Introductory Techniques for 3-D Computer Vision, vol. 201 (Prentice Hall Englewood Cliffs, 1998) 59. R. Fan, J. Jiao, J. Pan, H. Huang, S. Shen, M. Liu, Real-time dense stereo embedded in a UAV for road inspection, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (IEEE, 2019), pp. 535–543 60. Z. Zhang, A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000) 61. H.C. Longuet-Higgins, A computer algorithm for reconstructing a scene from two projections. Nature 293(5828), 133–135 (1981) 62. R.I. Hartley, In defense of the eight-point algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 19(6), 580–593 (1997) 63. R. Fan, X. Ai, N. Dahnoun, Road surface 3d reconstruction based on dense subpixel disparity map estimation. IEEE Trans. Image Process. 27(6), 3025–3035 (2018) 64. B. Tippetts, D.J. Lee, K. Lillywhite, J. Archibald, Review of stereo vision algorithms and their suitability for resource-limited systems. J. Real-Time Image Proc. 11(1), 5–25 (2016) 65. W. Luo, A.G. Schwing, R. Urtasun, Efficient deep learning for stereo matching, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 5695–5703 66. D. Scharstein, R. Szeliski, High-accuracy stereo depth maps using structured light, in Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1 (IEEE, 2003), p. I 67. M.G. Mozerov, J. van de Weijer, Accurate stereo matching by two-step energy minimization. IEEE Trans. Image Process. 24(3), 1153–1163 (2015) 68. M. F. Tappen, W.T. Freeman, Comparison of graph cuts with belief propagation for stereo, using identical MRF parameters,” in Null (IEEE, 2003), p. 900 69. H. Hirschmuller, Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 328–341 (2007) 70. J. Žbontar, Y. LeCun, Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 17(1), 2287–2318 (2016) 71. H. Hirschmuller, D. Scharstein, Evaluation of stereo matching costs on images with radiometric differences. IEEE Trans. Pattern Anal. Mach. Intell. 31(9), 1582–1599 (2008) 72. Q. Yang, L. Wang, R. Yang, H. Stewénius, D. Nistér, Stereo matching with color-weighted correlation, hierarchical belief propagation, and occlusion handling. IEEE Trans. Pattern Anal. Mach. Intell. 31(3), 492–504 (2008)

70

R. Fan et al.

73. Y. Boykov, O. Veksler, R. Zabih, Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23(11), 1222–1239 (2001) 74. A.T. Ihler, A.S. Willsky et al., Loopy belief propagation: Convergence and effects of message errors. J. Mach. Learn. Res. 6(May), 905–936 (2005) 75. A. Blake, P. Kohli, C. Rother, Markov Random Fields for Vision and Image Processing (Mit Press, 2011) 76. D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vision 47(1–3), 7–42 (2002) 77. J. Žbontar, Y. LeCun, Computing the stereo matching cost with a convolutional neural network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1592–1599 78. F. Zhang, V. Prisacariu, R. Yang, P.H. Torr, Ga-net: guided aggregation net for end-to-end stereo matching, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 185–194 79. A. Seki, M. Pollefeys, SGM-nets: semi-global matching with neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 231–240 80. N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, T. Brox, A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4040–4048 81. J. Pang, W. Sun, J.S. Ren, C. Yang, Q. Yan, Cascade residual learning: a two-stage convolutional neural network for stereo matching, in Proceedings of the IEEE International Conference on Computer Vision Workshops (2017), pp. 887–895 82. A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, A. Bry, End-toend learning of geometry and context for deep stereo regression, in Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 66–75 83. A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the Kitti vision benchmark suite, in 2012 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2012), pp. 3354–3361 84. M. Menze, C. Heipke, A. Geiger, Joint 3d estimation of vehicles and scene flow. ISPRS Ann. Photogramm., Remote. Sens. Spat. Inf. Sci. 2 (2015) 85. C. Menze, Moritz; Heipke, A. Geiger, Object scene flow. ISPRS J. Photogramm. Remote. Sens. 140, 60–76 (2018) 86. J.-R. Chang, Y.-S. Chen, Pyramid stereo matching network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 5410–5418 87. J.L. Barron, D.J. Fleet, S.S. Beauchemin, Performance of optical flow techniques. Int. J. Comput. Vis. 12(1), 43–77 (1994) 88. S. Mittal, J.S. Vetter, A survey of CPU-GPU heterogeneous computing techniques. ACM Comput. Surv. (CSUR) 47(4), 1–35 (2015) 89. H. Jin, D. Jespersen, P. Mehrotra, R. Biswas, L. Huang, B. Chapman, High performance computing using MPI and OpenMP on multi-core parallel systems. Parallel Comput. 37(9), 562–575 (2011)

A Survey on GPU-Based Visual Trackers Islam Mohamed, Ibrahim Elhenawy, and Ahmad Salah

Abstract The object tracking problem is a key one in computer vision, and it is critical in a variety of applications such as guided missiles, unmanned aerial vehicles, and video surveillance. Despite several types of research on visual tracking, there are still a number of challenges during the tracking process, including computationally intensive tasks that make real-time object tracking impossible. By offloading computation to the graphics processing unit, we may overcome the processing limitations of visual tracking algorithms (GPU). In this work, object tracking algorithms that use GPU parallel computing are summarized. Firstly, the related works are briefly discussed. Secondly, object trackers are classified, summarized, and analyzed from two aspects: Single Object Tracking(SOT) and Multiple Object Tracking (MOT). Finally, we’ll go through parallel computing—what it is and how it’s used, as well as a strategy for designing a parallel algorithm, various types of methods for analyzing parallel algorithm performance for parallel computers, and how to reformulate computational issues in the language of graphics cards. Keywords Object tracking · GPU · Single object tracking · Multiple object tracking

1 Introduction One of the most challenging aspects of computer vision is assisting machines, such as robots [19], computers [1], drones [46], and vehicles [2], in performing the basic functions of the human vision system, such as image understanding [38] and [27]. I. Mohamed · I. Elhenawy · A. Salah (B) Faculty of Computers and Informatics, Zagazig University, Zagazig, Egypt e-mail: [email protected] I. Mohamed Technical Research and Development Center, Cairo, Egypt A. Salah Department of Information Technology, CAS-Ibri, The University of Technology and Applied Sciences, Muscat, Oman © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. M. Hosny and A. Salah (eds.), Recent Advances in Computer Vision Applications Using Parallel Processing, Studies in Computational Intelligence 1073, https://doi.org/10.1007/978-3-031-18735-3_4

71

72

I. Mohamed et al.

Many works have attempted to track visual objects in order to achieve these tasks of intelligent motion analysis, which has become a high-demand study area in the real-time computer vision field. The basic aim of visual tracking is to identify the trajectory model of a tracked item in each scene of a video sequence (i.e., location, shape, direction, and so on). The use of a GPU with a CPU to speed up computing in tasks that previously relied only on the CPU is known as GPU programming. Despite the fact that GPU programming has only been practicable for the past two decades, it currently has applications in almost every field. GPU programming has been utilized to speed up audio signal processing [9], video [24], digital image [5], scientific computing [16], medical imaging [44], cryptography [4], deep learning and computer vision [3]. Object tracking is an important area of research in computer vision that has had extensive applications in the military and other fields. The GPU is an interesting platform for speeding up tracking algorithms due to its massively parallel processing power. The use of GPU in object tracking algorithms significantly speeds up processing and reduces the execution time. To enable real-time object tracking, multi-core GPU architecture tracking algorithms can reduce compute time compared to CPU methods. In a summary, visual tracking is a technique for locating, detecting and defining the dynamic configuration of one or more objects in a video sequence captured by one or more cameras. There are two types of trackers: single-object trackers and multi-object trackers. Multiple targets are tracked and their trajectories are followed simultaneously by multi-object trackers [11, 34], whereas single object trackers [15, 21] track only one object in the sequence. The goal of this study is to review object trackers that use GPU parallel computing technology and their optimal designs. This gives a powerful and extremely fast method for constructing a real-time Object tracking system. Because several algorithms have been published every year since the 1990s, this study does not cover all present work in visual tracking. The remainder of the article is structured in the following manner. In Sect. 2, we present Survey on Parallel Computing including. The main difference between CPU and GPU, strategy for designing a parallel algorithm, Metrics for performance evaluation, and GPU Programming. In Sect. 3, We’ll go over some of popular recent single and multiple object trackers that use of GPU parallel computing technology. In Sect. 4, we conclude the paper.

2 Parallel Computing Parallel computing has become a popular topic in computer programming, and it’s shown to be essential when looking for high-performance option. The rapid progression of computer architectures (multi-core and many-core) toward a large number of cores demonstrates that parallelism is the preferred strategy for speeding up an operation. The graphics processing unit has played a crucial part in high speed computing throughout the previous decade (HPC) owing to its huge parallel computing power and low price. For the first time, super-computing has become cheap to every-

A Survey on GPB-Based Visual Trackers

73

body for the cost of a desktop computer. In this work, We take a look at parallel computing in general, as well as GPU computing particularly. Developing effective GPU-based parallel algorithms is not an easy process; there are a number of technological constraints that must be met in order to get the desired results. Some of these restrictions are due to the GPU’s underlying design and theoretical concepts. Techniques based on CPU aren’t quick enough to provide a result in an acceptable length of time for some computational tasks. Furthermore, these issues can get so bad that even a multi-core CPU-based solution isn’t enough. There is no parallel solution to some situations [14]. √ The Newton-Rhapson method [30]. The approximation of x, for example, parallelization isn’t possible since each cycle is dependent on the value of the one before it; also there is the problem of temporal dependency. Such issues do not advantage from parallelism and should be tackled using a CPU. Alternatively, some issues can be naturally divided into a number of distinct sub-problems; for example, matrix multiplication can be broken down into numerous individual multiply-add calculations. Massively parallel issues like these are typical in computational physics, and they’re best tackled using a GPU. Deep-TAMA [40] is a new joint-inference network-based track appearance model. The suggested technique compares two inputs for adaptive appearance modeling, which aids in target-observation matching disambiguation as well as identity consistency consolidation. Junbo et al. [39] propose a new MOT framework that combines affinity models and object motion into one unified network called UMA to learn a discriminative compact feature for both affinity measure and object motion. UMA, in especially, uses multi-task learning to incorporate metric learning and single object tracking into a single triplet network. Such a design has benefits such as enhanced computing efficiency, reduced memory requirements, and a simpler training method. We also include in our model a task-specific attention module that is utilized to improve learning of task-aware features. The suggested UMA is simple to train from beginning to end, needing a single stage of training.

2.1 The Main Difference Between GPU and CPU A Central Processing Unit (CPU) is a latency-optimized general-purpose processor that can perform a wide variety of tasks in a sequential fashion, whereas a Graphics Processing Unit (GPU) is a throughput-optimized specialized processor built for high-end parallel computing Fig. 2. Your computer’s brain is the CPU. The CPU’s main function is to control all aspects of your computer and perform a wide range of computer software using the fetch-decode-execute cycle. Because it contains a few heavyweight cores with high clock speeds, a CPU is particularly quick at processing your data in sequence. It’s like a Swiss army knife that can tackle a wide range of jobs. The CPU is designed for latency and may switch between many tasks quickly, giving the sense of parallelism. Nonetheless, it is primarily meant to do a single job at a time, GPU is a specialized processor whose function is to handle memory quickly and speed the computer for a variety of activities that require a lot of parallelism. A

74

I. Mohamed et al.

ALU

ALU

ALU

ALU

Control

Cache DRAM

DRAM

CPU

GPU

Fig. 1 GPU cores verses CPU cores

Rest of SequenƟal CPU Code

Compute-Intensive FuncƟons 5% of Code

GPU

CPU

+ Fig. 2 Code Flow in GPU acceleration

server environment may have 24–48 extremely quick CPU cores. Going to add 4–8 GPUs to the server can add up to 40,000 cores Fig. 1. Single CPU cores are faster and smarter than single GPU cores, but GPU are more powerful due to the sheer number of cores and parallelism available. GPUs are ideally suited for applications that demand a high level of parallelism and repetition, such as financial simulations, machine learning, and risk modelling, and a variety of other scientific computations.

2.2 Strategy for Designing a Parallel Algorithm It is not easy to create a new algorithm. It is, in reality, considered a kind of art [18, 29] that requires a mix of creativity, mathematical knowledge, enthusiasm, and

A Survey on GPB-Based Visual Trackers Par

75 Communica on

oning

Problem

C0

C1 Mapping

Fig. 3 Foster’s Methodology

perhaps extra unclassifiable qualities. The situation is similar in parallel computing; there is no one rule for building excellent parallel algorithms. For building efficient parallel algorithms, There are a couple of standardized procedures that are commonly employed. When faced with the task of developing a parallel algorithm, Leighton and Thomson [22] have made significant contributions to the subject by demonstrating how architectures, data structures, and algorithms are related. Foster [13] identified a four-step technique, partitioning, communication, agglomeration, and mapping, in many well-designed parallel algorithms in 1995 (see Fig. 3). Foster’s approach is highly suited to computational physics since it naturally solves data parallel issues. Foster’s method, on the other hand, is effective for building huge parallel GPU-based algorithms. To use this technique, you must first understand how the massive parallelism programming paradigm maps computing resources to a data-parallel issue, as well as how to overcome technological constraints while programming the GPU. Partitioning When creating parallel algorithms, the initial step is to divide the issue into smaller pieces. The purpose of partitioning is to identify the optimal split, one that produces the greatest number of smaller pieces (At this moment, communication isn’t being considered). The type of domain must be determined before a good division of an issue can be achieved. The data is partitioned and data parallelism is used when dealing with a data-parallel problem. The functionality is partitioned and task-parallelism is applied when dealing with a task-parallel issue. The majority of simulation-based computational physics problems are best solved using a dataparallelism method, whereas problems like communication flows, parallel graph traversal, fault tolerance, security, and traffic management are best solved using a task-parallelism approach. Communication Communication between the sub-problems is defined after partitioning (task or data type). Local communication and global communication are the two forms of communication. In local communication, sub-problems connect with

76

I. Mohamed et al.

neighbors by employing a geometric or functional pattern. Broadcast, reductions, and global variables are all components of global communications. This phase handles all forms of communication difficulties, from race conditions to synchronization boundaries to guarantee that the computing method is still operational at this time. Agglomeration Sub-problems may not create enough effort to constitute a thread of computation at this level. The granularity of an algorithm is a term used to describe this feature. A fine-grained method breaks the issue down into a large number of tiny jobs, boosting parallelism and minimizing the cost of communication A coarsegrained method splits the issue into fewer but bigger jobs, lowering both communication and parallelism overhead. By combining sub-problems into bigger ones, agglomeration aims to find the optimal level of granularity. A multi-core CPU executing a parallel method should yield bigger agglomerations than a GPU performing the same process. Mapping All agglomerations will eventually have to be handled by the cores of computers. The mapping specifies how agglomerations are distributed across several cores. The final step in Foster’s method is mapping, which involves assigning agglomerations to processors in a specific manner. One to one geometric mapping between agglomerations and processors is the most basic pattern, where agglomeration ki is assigned to processor pi . High-complexity mapping patterns result in increased hardware overhead and unbalanced work. For complex problems, the goal is to find the most balanced and simple patterns possible.

2.3 Performance Evaluation Metrics Running Time The running time is the quantity of time spent on an N-processor based parallel computer executing an algorithm for a given input. T(n) is the running time, where the processors available used is n. If the number n is one, the situation resembles that of a sequential computer. Figure 4 demonstrates the link between execution time and the amount of processors in total. The graph indicates that the running time lowers as the total amount of processors grows at first, but then rises as the amount of processors increases after reaching a certain optimal level. The overheads associated with expanding the number of processes account for this discrepancy. Speed up The ratio between the time it takes to perform a particular job using a certain algorithm on a single processor system (i.e. T (1) (where n = 1)) and the time it takes to run the same task using the same method on a multiprocessor computer (i.e. T (n)) is known as speed up : T (1) (1) S(n) = T (n) The speed up rate essentially tells us how much of a benefit we get when switching from a serial to a parallel computer. T(1) denotes the time it takes to run a task

A Survey on GPB-Based Visual Trackers

77

Fig. 4 Running Time verses Total number of processors

Running Time

No. of processors required Fig. 5 Speed-up verses Total number of Processors

Linear Speed Up

Sub-Linear

No. of processors required

utilizing the most efficient sequence algorithm, i.e. the technique with the lowest time complexity. The association betweenS(n) and the amount of processors in total is seen in Fig. 5. A linear arc is given since the speedup is proportionate to the total number of processors. However, when parallel overhead exists, the arc becomes sub-linear. Efficiency The efficiency of parallel computer systems, or how parallel system’s resources are utilized, is another important metric for performance measurement. This is referred to as the effectiveness degree. The ratio of the relative speed increase gained while moving the load from a single processor machine to a p processor machine where multiple processors are employed for performing the tasks in a parallel computer can be described as the efficiency of a task on a parallel computer with p processors. This is shown by the symbol E( p). E p is referred to as shown below:

78

I. Mohamed et al.

Fig. 6 Efficiency verses Total number of Processors

Linear

Efficiency

Sub-Linear

No. of processors required

E( p) =

S( p) p

(2)

E k is proportional to Sk and inversely proportional to the number of processors used in the computation. Figure 6 demonstrates the link between E(k) and the total number of processors.

2.4 GPU Programming Early attempts to employ GPUs as general-purpose computing involved reformulating the issues in graphics card language since GPUs comprehend computational problems in terms of graphics primitives. Parallel computing technologies such as Nvidia’s CUDA, OpenACC, and OpenCL make this possible, GPU-accelerated computation is now considerably easier. These platforms allow developers to focus on higher-level computing concepts instead of the communication gap that arises between the CPU and GPU. CUDA CUDA (Compute Unified Device Architecture) was first published by Nvidia in 2007 and is now the most widely used proprietary framework. Nvidia defines CUDA as a platform that allows developers to “dramatically accelerate computer applications by harnessing the power of GPUs. Without any graphics programming experience, developers can use C, C++, Fortran, or Python to call CUDA. Furthermore, Nvidia’s CUDA Toolkit includes anything that a programmer wants to get started developing GPU-accelerated software that outperform their CPU-based rivals. Microsoft Windows, Linux, and macOS are all supported by the CUDA SDK.

A Survey on GPB-Based Visual Trackers

79

OpenGL, OpenCL, C++ AMP, Microsoft’s DirectCompute, and Compute Shaders are among the additional computational interfaces supported by the CUDA platform. OpenCL OpenCL (Open Computing Language) is a royalty-free, cross-platform standard for parallel programming of various hardware used in powerful computers, embedded systems, desktop computers, and portable devices. OpenCL enhances the reliability and performance of a variety of applications in a wide number of sectors, such as image processing, neural network training, and science or medical software. OpenACC It was created in 2015 by a collaboration of organizations including PGI (the Portland Group), Nvidia, CAPS, and Cray to allow parallel programming of heterogeneous CPU/GPU systems simpler. According to the OpenACC official website, “OpenACC is a user-driven directive-based performance-portable parallel technology that aims for science and technology interested in pushing their codes to a massive of heterogeneous high performance computing hardware with considerably fewer manual processing than needed with a low-level model.” Annotations in C, C++, and Fortran source code can be used by OpenACC developers to inform the GPU about which parts of their code should be enhanced. The objective is to develop an accelerator programming paradigm that can be used for a wide range of operating systems, host CPUs, and accelerators.

3 Levels of Object Tracking Algorithms Object tracking has been widely researched, with several applications. Many studies are being conducted in the field of visual object tracking. In this part, we survey some of the most important studies that have been done in this field in recent years. Depending on the number of objects being tracked, various levels of object tracking exist.

3.1 Single Object Tracking Shin et al. [32] proposes a tracking technique based on an enhanced kernelized correlation filter (KCF) that combines three main components: (i) tracking failure detection, (ii) re-tracking using alternative search windows, and (iii) analysis of motion vectors to choose the optimal search window. When a target moves a lot, the recommended solution outperforms the traditional KCF in terms of tracking precision, and it also beats a tracker based on deep learning, such as a multi-domain convolutional neural network (MDNet). Sun et al. [35] present an autonomous aerial refueling strategy for unmanned aerial vehicles. The proposed system includes both a faster deeplearning-based detector (DLD) and a more precise reinforcement-learning-based tracker (RLT). By combining the efficient MoblieNet with the you-only-look-once design, the DLD enhances detection performance. RLT is presented in the tracking

80

I. Mohamed et al.

stage to acquire the target’s location precisely and quickly by hierarchically placing and changing the bounding box of the target based on reinforcement learning. Drogue object tracking has an accuracy of 98.7%, which is significantly greater than the other comparative techniques. On GPU Titan X, this network can reach 15 frames per second. HAS [31] is a honeybee Search Method (HSA), a GPU-parallelized swarm algorithm that has a three-dimensional reconstruction method that combines swarm intelligence and evolutionary algorithm concepts. From HSA’s initial suggestion, this heuristic, which was inspired by honeybee’s search for food and is currently applied to the problem of object tracking utilizing GPU parallel computing, has been expanded to video processing. Yusuf et al. [42] offers a vehicle video-based automatic real-time lane recognition and tracking technique. Videos that have been processed in frames. To eliminate unwanted data to enhance the implementation’s efficiency, Each frame was converted to grayscale. To eliminate the noise from each frame, Gaussian blur is used; this step is critical because it aids in the implementation’s accuracy. Canny edge has been used to find possible road lane edges in the frame. To construct a linked line, On the detected edges, the Hough transform was used; this step was also used to reject certain edges that were not needed. Siam-OS [20] is a visual object tracker based on a Siamese network that effectively estimates the target’s angle and scale. Siam-OS is based on Siam-BM. However, it uses just one search region to decrease the quantity of computations needed to get deep features. For various angle and scale estimates, cropping and rotating the large-scale feature map at various angles. The Siamese network is used by MAO et al. [25] to build the core of his tracker. Because the convolution layers that are utilized to extract features are the most intensive, more consideration should be given to them in order to enhance tracking efficiency. He improved the standard convolution by using separable convolution, which consists mostly of depth-wise and point-wise convolution. The depth-wise convolution layer’s filters are pruned with filters variance to more decrease the computation. Because convolution layers have various weight distributions, A hyper-parameter controls the filter pruning. Cao et al. [6] create a GPU-accelerated feature tracking (GFT) approach for largescale structure from motion (SFM)-based 3D reconstruction. A GPU-based Gaussian of image (DOG) keypoint detector, k closest matching, outlier removal, and RootSIFT descriptor extractor are all included in the proposed GFT technique. Firstly, our DOG implementation on the GPU can identify hundreds of keypoints in real time, which is 30 times quicker than the original CPU version. Second, RootSIFT descriptor, which is GPU-based, can handle thousands of descriptors in real time. Finally, our descriptor matching on the GPU is 10 times quicker than existing methods. SiamMask E [7] is a new technique for real-time object tracking that employs ellipse fitting to determine the target segmentation’s bounding box rotation size and angle. SiamMask E enhances the bounding box fitting procedure of SiamMask algorithm, while maintaining a high tracking frame rate (80 frames per second) on GPU (At least a GeForce GTX 1080 Ti). Chen et al. [8] introduces a unique attention-based feature fusion network that successfully integrates search region features and tem-

A Survey on GPB-Based Visual Trackers

81

plate. The suggested method contains a cross-attention-based cross-feature enhance module and a self-attention-based ego-context augment module. Menezes et al. [26] suggests employing the YOLO technique using an NVIDIA GeForce GTX 1060 to track mouse. We looked at 13622 images made up of behavioral videos from three major research in the field. A total of 50% of a images were used in the training set, 25% in order to test, and 25% for the purpose of verification. According to the findings, the developed system’s mean Average Precision (mAP) for the YOLO Full and Tiny versions was 90.79% and 90.75%, respectively. It was also discovered that using the Tiny version for experiments that demand a real-time reaction is a good alternative. The method provided here enables researchers to execute accurate mouse tracking without the requirement for delimitation of regions of interest (ROI) or even evasive light identifiers like LED for animal tracking. Taygan et al. [36] performs tests on the Efficient Convolution Operators object tracking method to determine the algorithm’s time-consuming parts, known as bottlenecks, and to study the feasibility of GPU parallelization of the bottlenecks to enhance the algorithm speed. Finally, using the Compute Unified Device Architecture, the candidate techniques are built and parallelized. Liang et al. [23] describe how to use GPUaccelerated concurrent robot simulations and derivative-free, sample-based optimizers to monitor in-hand item locations with contact feedback during manipulation. We employ physics simulation as the forward model for robot-object interactions, and the strategy enhances the simulation’s state and parameters at the same time, making them more realistic. This approach works at 30 frames per second on a single GPU and generates a point cloud distance inaccuracy of 6 mm in simulations and 13 mm in real-world testing.

3.2 Multiple Object Tracking (MOT) In this part, we review the main works for MOT. Urbann et al. [37] provides a method for tracking in a surveillance scenario. Typical aspects of this scenario are a 24/7 operation with a static camera positioned above the height of a human with many objects or people. This scenario is best represented by the Multiple Object Tracking Benchmark 20 (MOT20). On this benchmark, we can demonstrate that the method is real-time capable and outperforms all other real-time capable approaches in HOTA, MOTA, and IDF1. We do this by contributing a fast Siamese network that generates fingerprints from detection in a linear (instead of quadratic) runtime. As a result, the detection can be linked to Kalman filters based on multiple tracking specific ratings : Cosine similarity of fingerprints, Intersection over Union, and pixel distance ratio in the image. To solve a novel LiDAR-based online multiple maneuvering vehicle tracking problem, Luo et al. [45] offers a unique online multi-model smooth variable structure filter. Kim et al. [17] introduces a unique object-based realtime multiple pedestrian tracking system. During the tracking process, the proposed approach determines if an item identified in the current frame matches one detected

82

I. Mohamed et al.

in the preceding frame, and updates the coordinates of numerous objects as well as the relevant histogram. Zhang et al. [43] presents a very efficient Deep Neural Network (DNN)that models association between an unlimited list of objects at the same time. The amount of objects has no effect on the DNN’s inference calculation. Frame-wise Motion and Appearance (FMA) is a model that computes the Frame-wise Motion Fields (FMF) between two frames, allowing for highly quick and effective matching of a large number of object bounding boxes. Frame-wise Appearance Features (FAF) are acquired in parallel with FMFs. Yu et al. [41] presents how to anticipate complicated movements in UAV videos using a model based on Conditional Generative Adversarial Networks (GAN). Individual movements and global motions are used to describe the motions of the objects and the UAV motion, respectively. They are mutually beneficial and are used in tandem to improve motion prediction accuracy. Specifically, Individual item motion is predicted using a social Long Short Term Memory network, global motion is produced using a Siamese network to represent UAV perspective changes, and the final motion affinity is constructed using a conditional GAN. A range of objects and four distinct types of object detection inputs were used in a variety of tests leveraging existing UAV datasets. When compared to stateof-the-art approaches, robust motion prediction and enhanced MOT performance are obtained. In this research, Sun et al. [33] uses deep learning to estimate item appearances, you may maximize the possibility of data association in tracking and their affinities across frames in an end-to-end manner. The Deep Affinity Network (DAN) suggested here learns compact, yet comprehensive, characteristics of pre-detected objects at many abstraction levels and executes exhaustive pairing permutations of those features in any two frames. DAN additionally takes into account various objects that appear and disappear across video frames. For accurate online tracking, We use the fast affinity computations that arise to link things in the current frame to ones in preceding frames. Forero et al. [12] provides a method that classify, tracks, and attempts to count video sequences of cars and people. The approach is split into two parts: a convolutional neural network-based classification algorithm that is implemented using the You Only Look Once (YOLO) approach, and a proposed methodology for tracking regions of interest based on a well-defined taxonomy. DeepScale [28] is a model-independent frame size selection method that works on top of fully convolutional network-based trackers to increase tracking throughput. The author includes detect-ability scores into a one-shot tracker architecture in the training step so that DeepScale may self-learn representation estimations for various frame sizes. It can discover an acceptable trade-off between tracking precision and speed by adjusting frame sizes at run time during inference, based on user-controlled parameters. FAMNet [10] is a comprehensive model that combines feature extraction, multidimensional assignment, and affinity estimation in one network. FAMNet’s layers are built to be differentiable, allowing them to be improved together to train discriminative features as well as a higher-order affinity model for reliable MOT, which is monitored directly by the assignment ground truth loss. In addition, to retrieve false negatives and prevent noisy target candidates formed by the external detector,

A Survey on GPB-Based Visual Trackers

83

he combines a dedicated target management scheme and a single object tracking approach into the FAMNet-based tracking system. In the front view of an onboard monocular camera, Zou et al. [47] aim to track several vehicles. The vehicle detection probes are designed to produce accurate detection, which is important in the following tracking-by-detection approach. To calculate pairwise appearance similarity, a new Siamese network with a spatial pyramid pooling (SPP) layer is used. The relative motions and aspects are provided by the motion model collected from the improved bounding box. To ensure long-term, robust tracking, each tracking period is considered as a Markov decision process (MDP) by the online-learned policy.

4 Conclusion We take an in-depth look into Object trackers that use the GPU in this post, as well as a quick overview of related subjects. We split object trackers into two types based on Levels of Object Tracking Algorithms, namely, Single Object Tracking, and Multiple Object Tracking (MOT). We review object trackers that use GPU parallel computing and their optimal designs. This gives a powerful and extremely fast method for constructing a real-time Object tracking system. In addition, we give a collection of theoretical and technological principles that are commonly needed in order to understand the GPU. Many technical limitations may be solved by being familiar with the architecture of GPU and its enormous parallelism programming approach, as well as building better algorithms based on GPU for computational physics issues and achieving speedups of up to two orders of magnitude above sequential solutions. Finally, we go through how to construct a parallel algorithm, as well as several metrics for assessing the performance of parallel algorithms on parallel processors and how to reformulate computational problems in graphics card language.

References 1. O. Appiah, M. Asante, J.B. Hayfron-Acquah, Improved approximated median filter algorithm for real-time computer vision applications. J. King Saud Univ. - Comput. Inf. Sci. (2020) 2. S. Arabi, A. Haghighat, A. Sharma, A deep-learning-based computer vision solution for construction vehicle detection. Comput.-Aided Civil Infrastruct. Eng. 35(7), 753–767 (2020) 3. L. Barba-Guaman, J.E. Naranjo, A. Ortiz, Deep learning framework for vehicle and pedestrian detection in rural roads on an embedded GPU. Electronics 9(4), 589 (2020) 4. S. Bhattacharjee, D.M. Chakkaravarhty, M. Chakkaravarty, L.B.A. Rahim, A.W. Ramadhani, A GPU unified platform to secure big data transportation using an error-prone elliptic curve cryptography, in Data Management, Analytics and Innovation (Springer Singapore, 2020), pp. 263–280 5. A. Blug, D.J. Regina, S. Eckmann, M. Senn, A. Bertz, D. Carl, C. Eberl, Real-time GPU-based digital image correlation sensor for marker-free strain-controlled fatigue testing. Appl. Sci. 9(10), 2025 (2019)

84

I. Mohamed et al.

6. M. Cao, W. Jia, S. Li, Y. Li, L. Zheng, X. Liu, GPU-accelerated feature tracking for 3d reconstruction. Opt. & Laser Technol. 110, 165–175 (2019) 7. B.X. Chen, J. Tsotsos, Fast visual object tracking using ellipse fitting for rotated bounding boxes, in 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW) (IEEE, 2019) 8. X. Chen, B. Yan, J. Zhu X. Yang, H. Lu, Transformer Tracking (Dong Wang, 2021) 9. K. Choi, D. Joo, J. Kim, Kapre: on-gpu audio preprocessing layers for a quick implementation of deep neural network models with keras (2017) 10. P. Chu, H. Ling, Famnet: joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking (2019) 11. P. Dai, R. Weng, W. Choi, C. Zhang, Z. He, W. Ding, Learning a proposal classifier for multiple object tracking (2021) 12. A. Forero, F. Calderon, Vehicle and pedestrian video-tracking with classification based on deep convolutional neural network, in XXII Symposium on Image (Signal Processing and Artificial Vision (STSIVA) (IEEE, 2019), p. 2019 13. I. Foster, Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering (Addison-Wesley, Reading, Mass, 1995) 14. R. Greenlaw, Limits to Parallel Computation: P-Completeness Theory (Oxford University Press, New York, 1995) 15. S. Jiang, B. Xu, J. Zhao, F. Shen, Faster and simpler siamese network for single object tracking (2021) 16. P. Kang, S. Lim, A taste of scientific computing on the GPU-accelerated edge device. IEEE Access 8, 208337–208347 (2020) 17. D. Kim, H. Kim, J. Shin, Y. Mok, J. Paik, Real-time multiple pedestrian tracking based on object identification, in 2019 IEEE 9th International Conference on Consumer Electronics (ICCE-Berlin) (IEEE, 2019) 18. D.E. Knuth, Computer programming as an art. Commun. ACM 17(12), 667–673 (1974) 19. S. Kulik, A. Shtanko, Using convolutional neural networks for recognition of objects varied in appearance in computer vision for intellectual robots. Proc. Comput. Sci. 169, 164–167 (2020) 20. D.-H. Lee, One-shot scale and angle estimation for fast visual object tracking. IEEE Access 7, 55477–55484 (2019) 21. D-H. Lee. CNN-based single object detection and tracking in videos and its application to drone detection. Multimedia Tools and Applications (2020) 22. F. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes (M. Kaufmann Publishers, San Mateo, Calif, 1992) 23. J. Liang, A. Handa, K. Van Wyk, V. Makoviychuk, O. Kroemer, D. Fox, In-hand object pose tracking via contact feedback and GPU-accelerated robotic simulation (2020) 24. F. Luo, S. Wang, S. Wang, X. Zhang, S. Ma, W. Gao, GPU-based hierarchical motion estimation for high efficiency video coding. IEEE Trans. Multimed. 21(4), 851–862 (2019) 25. Y. Mao, Z. He, Z. Ma, X. Tang, Z. Wang, Efficient convolution neural networks for object tracking using separable convolution and filter pruning. IEEE Access 7, 106466–106474 (2019) 26. R. Santiago T. De Menezes, J.V. Alves Luiz, A.M. Henrique-Alves, R.M. Santa Cruz, H. Maia, Mice tracking using the YOLO algorithm, in Anais do Seminário Integrado de Software e Hardware (SEMISH 2020). Sociedade Brasileira de Computação - SBC (2020) 27. I. Mutis, A. Ambekar, V. Joshi, Real-time space occupancy sensing and human motion analysis using deep learning for indoor air quality control. Autom. Construct. 116, 103237 (2020) 28. K. Nalaie, R. Zheng, Deepscale: an online frame size adaptation framework to accelerate visual multi-object tracking (2021) 29. Stan Openshaw, High Performance Computing and the Art of Parallel Programming: An Introduction for Geographers, Social scientists, and Engineers (Routledge, London New York, 2000) 30. H.A. Peelle, To teach newton’s square root algorithm. ACM SIGAPL APL Quote Quad 5(4), 48–50 (1974)

A Survey on GPB-Based Visual Trackers

85

31. O.E. Perez-Cham, C. Puente, C. Soubervielle-Montalvo, G. Olague, C.A. Aguirre-Salado, A.S. Nuñez-Varela, Parallelization of the honeybee search algorithm for object tracking. Appl. Sci. 10(6), 2122 (2020) 32. J. Shin, H. Kim, D. Kim, J. Paik, Fast and robust object tracking using tracking failure detection in kernelized correlation filter. Appl. Sci. 10(2), 713 (2020) 33. S. Sun, N. Akhtar, H. Song, A. Mian, M. Shah, Deep affinity network for multiple object tracking (2018) 34. S. Sun, N. Akhtar, X. Song, H. Song, A. Mian, M. Shah, Simultaneous detection and tracking with motion modelling for multiple object tracking, in Computer Vision – ECCV 2020 (Springer International Publishing, 2020), pp. 626–643 35. S. Sun, Y. Yin, X. Wang, X. De, Robust visual detection and tracking strategies for autonomous aerial refueling of UAVs. IEEE Trans. Instrum. Meas. 68(12), 4640–4652 (2019) 36. U. Taygan, A. Ozsoy, Performance analysis and GPU parallelisation of ECO object tracking algorithm. New Trends Issues Proc. Adv. Pure Appl. Sci. 12, 109–118 (2020) 37. O. Urbann, O. Bredtmann, M. Otten, J-P. Richter, T. Bauer, D. Zibriczky, Online and real-time tracking in a surveillance scenario (2021) 38. Y. Xu, M. Li, L.Cui, S. Huang, F. Wei, M. Zhou, LayoutLM: pre-training of text and layout for document image understanding, in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (ACM, 2020) 39. J. Yin, W. Wang, Q. Meng, R. Yang, J. Shen, A unified object motion and affinity model for online multi-object tracking (2020) 40. Y-C. Yoon, D.Y. Kim, Y.M. Song, K. Yoon, M. Jeon, Online multiple pedestrians tracking using deep temporal appearance matching association (2019) 41. Yu. Hongyang, G. Li, S. Li, B. Zhong, H. Yao, Q. Huang, Conditional GAN based individual and global motion fusion for multiple object tracking in UAV videos. Pattern Recognit. Lett. 131, 219–226 (2020) 42. A. Yusuf, S. Alawneh, GPU Implementation for Automatic Lane Tracking in Self-Driving Cars. In SAE Technical Paper Series (SAE International, 2019) 43. J. Zhang, S. Zhou, J. Wang, D. Huang, Frame-wise motion and appearance for real-time multiple object tracking (2019) 44. Q. Zhang, C. Bai, Z. Liu, L.T. Yang, Yu. Hang, J. Zhao, H. Yuan, A GPU-based residual network for medical image classification in smart medicine. Inf. Sci. 536, 91–100 (2020) 45. Y. Zhang, Y. Tang, B. Fang, Z. Shang, Multi-object tracking using deformable convolution networks with tracklets updating. Int. J. Wavelets Multiresolut. Inf. Process. 17(06), 1950042 (2019) 46. P. Zhu, L. Wen, D. Dawei, X. Bian, H. Qinghua, H. Ling, Past, present and future, Vision meets drones (2020) 47. Y. Zou, W. Zhang, W. Weng, Z. Meng, Multi-vehicle tracking via real-time detection probes and a Markov decision process policy. Sensors 19(6), 1309 (2019)

Accelerating the Process of Copy-Move Forgery Detection Using Multi-core CPUs Parallel Architecture Hanaa M. Hamza, Khalid M. Hosny, and Ahmad Salah

Abstract Copy-move forgery is a popular type of forgery in digital images. In which, one part of the image is replicated within the same image at different locations. It is easy to do using one of the powerful image processing software. Because it is too simple to be made, anyone with low skills can forge many images in no time. Many copy-move forgery detection techniques are presented every day, but the main common drawback is the high computation time that it takes to detect the final forged region due to the repetition of many complex computational processes. One of the suggested trends to solve this problem is using parallel architectures. However, it is still in its initial phase with little research presented. In this work, copy-move forgery detection methods are discussed from the standpoint of using parallel architectures. In order to solve the problem of computation time in previous related methods, it has been necessary to think of parallel solutions to accelerate the process of copy-move forgery detection. Multi-core CPUs are a powerful processing parallel architecture. Thus, a new Parallel Copy-Move Image Forgery Detection (PCMIFD) system is proposed using Multi-core CPUs. The proposed method has been tested and achieved a high speed-up rate till 7.4. As well, the results practically came close to the theoretical speedup rate. Keywords Copy-move forgery · Accelerating · Image processing · Parallel computing

H. M. Hamza · K. M. Hosny (B) Department of Information Technology, Faculty of Computers and Informatics, Zagazig University, Zagazig, Egypt e-mail: [email protected] A. Salah Department of Computer Science, Faculty of Computers and Informatics, Zagazig University, Zagazig, Egypt Department of Information Technology, CAS-Ibri, University of Technology and Applied Sciences, Muscat, Oman © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. M. Hosny and A. Salah (eds.), Recent Advances in Computer Vision Applications Using Parallel Processing, Studies in Computational Intelligence 1073, https://doi.org/10.1007/978-3-031-18735-3_5

87

88

M. H. Hamza et al.

1 Introduction Due to the quick development and simple operation of image processing software, which enables non-specialists to manipulate digital photos and quickly forge their contents, the process of forging digital images has gotten easier. Anyone may tamper with and maliciously edit digital photographs using well-known technologies like Photoshop and Freehand, etc. Every time digital photographs are used from the internet or another source, there is a chance that the images are fake. Accurate forgery detection methods and techniques are urgently required to examine the integrity of photographs since detecting faked images is a difficult task. According to Fig. 1, there are two primary categories of digital image fabrication techniques: active and passive. In the active technique [1–4], a watermark is included into the input image as it is being created. Data tampering in the passive technique involves information display errors or data concealment [5]. The three major stages of the prevalent passive forging techniques are picture splicing, image retouching, and copy-move [6]. Image splicing is the process of combining elements from the original image and elements from other photos to create a new fake image. The forgers might conceal or alter the substance of the original image by adding a portion from another image to it. Techniques for picture retouching concentrate on modifying the characteristics of digital photographs without significantly changing their substance. A wide range of digital picture pre-processing methods, such image enhancement to emphasize desirable elements, are included in image manipulation. Digital picture alteration includes changing the colors, contrast, white balance, and sharpness, for instance.

Fig. 1 Classification of image forgery [7]

Accelerating the Process of Copy-Move Forgery Detection …

89

Additional instances of digital picture alteration include denoising and eliminating objects or obvious faults in materials or skin [8]. By copying and moving a specific portion of an image to another portion of the same picture, copy-move methods allow forgers to conceal or duplicate portions of an image. The scientific community is becoming more and more interested in all DIF disciplines, however in this chapter, we’ll concentrate on the copy-move forgery detection issue. One of the most well-known instances of active forging techniques is digital watermarking. It may be described as a method of concealing digital data. There are three main processes in the water marking process: embedding, attack, and detection. We always used to construct a watermark signal before embedding it with the help of the host signal, or original data. Then the sender sent a watermark signal to the recipient across the transmission media [9]. It makes use of specialized hardware. The integrated information is used to verify the validity and integrity. The benefit of this approach is that it requires less calculation and is straightforward to use if the original image is known. But this approach has a lot of drawbacks [4]: 1. 2. 3. 4. 5.

While embedding something to image it decrease the quality of image. Human intervention or specially equipped cameras are required. Many devices don‘t have that embedding function. Watermarking is easy to attack and destroy. Extra bandwidth is required for transmission.

Thus, due to these drawbacks the research move to next phase and passive methods came in the focus.

2 Background 2.1 Types of Image Forgery 2.1.1

Copy-Move Attack

By copying and moving a specific portion of a picture to another portion of the same image, forgers can fool the viewer by concealing or duplicating a portion of the image [4, 10, 11]. Two instances of duplicating items are shown in Fig. 2. While more examples of hidden things are shown in Fig. 3.

2.1.2

Image Splicing

The original image and additional photos are combined in image splicing to create a new fake image. The forgers might conceal or alter the substance of the original image by adding a portion from another image to it [6, 12, 13]. Examples of splicing

90

M. H. Hamza et al.

Fig. 2 Copy-move attack examples for duplicating objects. a Left are the original images, b Right are the manipulated

are shown in Fig. 4 where a portion of image (a) is combined with a portion of image (b) to form an image (c).

2.1.3

Image Retouching

Techniques for picture retouching concentrate on modifying the characteristics of digital photographs without significantly changing their substance. A wide range of digital picture pre-processing methods, such image enhancement to emphasize desirable elements, are included in image manipulation. Digital picture alteration includes changing the colors, contrast, white balance, and sharpness, for instance. Other instances of digital picture alteration include denoising and eliminating objects or obvious faults in skin or materials [8, 14]. Two samples of an original image (a) and a retouched image (b) are shown in Fig. 5b.

Accelerating the Process of Copy-Move Forgery Detection …

91

Fig. 3 Copy-move attack examples for hiding objects. a Left are the original images, b right are the manipulated images

2.2 Copy-Move Forgery Detection Techniques Numerous copy-move forgery detection (CMFD) techniques have been put out during the past 20 years. In general, there are five key processes in copy-move forgery detection [1, 15]: (1) The input picture is preprocessed in the first phase to improve the image’s characteristics. Preprocessing includes color conversions, image scaling, low pass filtering, etc. (2) One of the two fundamental approaches—block-based or key-point-based— should be chosen to be used in the second phase. a. The image is separated into blocks that overlap or don’t overlap in the blockbased method. Then, after computing the feature for each block and storing it as a row, comparable characteristics are essentially arranged often according to lexicography. b. The key-based strategy involves scanning the picture for key points, calculating feature vectors for each key point, and storing the location of each key point. The best-bin-first approach is frequently used in feature matching to get the estimated nearest neighbor.

92

M. H. Hamza et al.

Fig. 4 Splicing example, a Original image1 b Original image2 c Spliced image

(3) Feature extraction: Various features are chosen to be extracted in accordance with the technique that was chosen in the previous phase. In block-based approaches like DCT, Zernike, etc., feature vectors are extracted for each block. On the other hand, in key-point approaches, only key-points like SIFT and SURF are calculated for feature vectors in images. (4) The feature vectors are matched in the fourth phase to determine if duplicate areas exist or not. (5) The duplicated areas are highlighted and presented in the fifth step, if they exist.

2.3 OpenMP OpenMP (Open Multi-Processing) is a C, C++, and FORTRAN library that enables multi-core and multiprocessor programming with shared memory [16]. The library has a set of built-in directives and library routines to distribute a set of tasks on the CPU cores. OpenMP raises the burden of creating and managing threads from the programmer’s shoulder. OpenMP follows the single instruction multiple dataparallel model. Thus, the programmer should put a segment of code to be identical on each core/thread, but handle different partitions of data. Threads can communicate

Accelerating the Process of Copy-Move Forgery Detection …

93

Fig. 5 Retouching example, a Original image b Retouched image

through shared memory. In addition, OpenMP provides mechanisms to synchronize the running threads and schedule the tasks [17].

3 Related Work 3.1 Block-Based Methods The common between all these methods is the division of an image into a number of small blocks, in overlapping or non-overlapping way and feature extraction step and what comes next are performed over blocks, not over whole image [18]. In [19], Fridrich, et al. made one of the earliest efforts to employ the blockbased technique based on DCT coefficients of blocks, where they looked at each block’s DCT coefficients. Then, duplicated areas are found by grouping like blocks in the picture with the same spatial offset and lexicographically sorting the DCT coefficients. Then, Popescu, A. and H. Farid used a principle component analysis

94

M. H. Hamza et al.

(PCA) on small fixed size picture blocks in an analogous technique [20] to produce a reduced-dimension representation. Ryu et al. [21] proposed a method using the magnitude of Zernike moments, which was aimed at detecting copy-rotate-move forgery. This method has desirable properties like rotation invariance, robustness to noise, etc., but it was found to be weak against scaling and affine transformations. A more effective approach based on the discrete cosine transform (DCT) was created by Huang, et al. [22]. Their approach divides the image into overlapping, fixed-size blocks, from which the characteristics of each block are retrieved using the DCT. Huang and his co-authors re-reduced the dimension of the features by truncation to decrease the execution durations of sorting and matching. They then employ the feature vectors that are lexicographically based. Additionally, the sorted list neighbors the duplicate image blocks. As a result, in the matching stage, the duplicated image blocks will be compared. Another Fourier transform-based approach to spot forgeries was put out by Sadeghi et al. [23], in which the picture is changed to a grayscale version. Based on the resize requirement, the size was reduced. Correlation, which can be used to find similar features within an image, was performed on the input digital image using the Fourier transform. Finally, it aids in finding similar correlation values within the image, converts the result value to a matrix of real numbers, and displays the locations of matched blocks. Yuenan Li [24] suggested a copy-move forgery detection method based on the polar cosine transform (PCT) and the approximate closest neighbor searching technique. The picture was first divided into overlapping patches by this method. The rotationally-invariant and orthogonal characteristics of the PCT are utilized to extract robust and compact features from the patches. Then, potential copy-move pairs are found by identifying patches with comparable characteristics, a process known as approximate nearest neighbor searching that is carried out using locality-sensitive hashing (LSH). Potential pairings are then subjected to post-verification techniques to weed out erroneous matches and raise the precision of forgery detection. Each block in the input picture was given a histogram of oriented gradients by Lee et al. [25]. To make the assessment of similarity easier, statistical characteristics are extracted and minimized. Al-Qershi et al. [26] divided the features extraction methods used in CMFD into eight groups in their most current state-of-the-art (DCT, log-polar transform, texture & intensity, invariant key- point, invariant moment based, PCA, SVD and Others). They demonstrated how using lower feature vector sizes can minimize the CMFD’s complexity and execution time. By implementing feature extraction methods that are resistant to a larger variety of assaults, such as rotation and scaling, the CMFD’s resilience was strengthened. Additionally, they noted that the majority of the CMFD procedures in use today take a lot of time.

Accelerating the Process of Copy-Move Forgery Detection …

95

Instead of employing indirect comparisons based on block attributes, Lynch [27] suggested an effective expanding block method that primarily uses direct block comparison. Blocks of the picture overlay one another. As a dominating characteristic, the average of the gray level values in each block is calculated. The blocks are sorted and put into buckets with each group having characteristics in common. Only blocks from the same bucket are compared when comparing blocks. This technique employs a statistical hypothesis test based on a pixel mean value. A block will be removed from the bucket if it doesn’t match any other block there. Blocks with no matches are removed, the search zone is increased, and the comparison is resumed after starting with a tiny section. Therefore, as the region grows, fewer blocks will be in the bucket; the remaining blocks are then regarded as being a part of the copied region. A detection method based on local binary pattern variance (LBPV) application over the stationary wavelets’ low approximation components was suggested by Mahmood et al. in [28]. In order to more effectively handle potential post-processing actions including flipping, blurring, rotation, scaling, color reduction, brightness adjustment, and multiple cloning, this approach was applied to grayscale pictures after they were divided into circular parts. The rotation-invariant features produced by LBPV significantly enhance CMFD techniques. Block matching-based technique using polar representation to get representative features for each block was presented by Fadl and Semary [29]. Blocks from the supplied picture overlap one another. The polar system is then applied to each block, replacing the Cartesian system. 1D Fourier transform is computed as representative features for each column in polar blocks. Radix sort is used to group columns into close-proximity blocks. The correlation between blocks in the area is then computed. Using symmetry-based local characteristics, Vaishnavi and Subashini [30] devised a method to identify copy-move forgeries. In order to find the matching points of forged regions and to get rid of false matches, feature matching is done. Random Sampling Consensus (RANSAC) is continually used until all the matched points are fitted with the appropriate copied and relocated regions. Clustering is used during iteration to distinguish between copied and relocated area points from the original ones. Each cluster’s transformation parameter is calculated, and if forging is found, copy moved areas are localized. Meena, K.B., and Tyagi, V. suggested using Gaussian-Hermite Moments (GHM) to identify copy-move forgeries in this study [31]. The creation of GaussianHermite moments involves the overlapping of fixed-sized blocks, and matching up comparable blocks is accomplished by lexicographically sorting all the features. In [32], Gulivindala Suresh and Chanamallu Srinivasa Rao published a technique for detecting faked photographs that makes use of Optimal Weighted Color and Texture Features (OWCTF). To discover the nonlinear link between color and texture characteristics, the Firefly Algorithm (FA) is investigated.

96

M. H. Hamza et al.

3.2 Key-Point Based Methods The primary concept behind key-point based approaches is that in order to maximize computing efficiency, work is done across the entire image rather than on subdivided chunks. These techniques calculate their features across the areas of a picture with high entropy. Scale-Invariant Features Transform (SIFT) and Speed UP Robust Features are two approaches that are frequently employed in this approach (SURF). A region duplication detection approach that is resistant to distortions of the duplicated areas was developed by Pan and Lyu [33]. Their approach begins by calculating the transform between matched SIFT key-points, which are resistant to geometrical and light aberrations, and after discounting the calculated transforms, identifies all pixels inside the duplicated areas. The suggested approach successfully detects a counterfeit picture database with duplicated and distorted parts that was generated automatically. Their approach does have some drawbacks, though. This is due to the SIFT algorithm’s inability to identify trustworthy key-points in areas with few visual structures. Similar to this, smaller areas are difficult to find since they contain fewer key-points. Another SIFT-based technique was presented by Amerini et al. [34]. Such a technique enables the recovery of the geometric transformation required to carry out cloning in addition to enabling the understanding of whether a copy-move attack has taken place. By conducting several trials, they were able to demonstrate that this approach can accurately identify the transformed region and, in addition, estimate the geometric transformation parameters with great reliability. By executing a rigorous feature matching process and then clustering on the key-point coordinates to separate the many duplicated sections, several copy-move frauds are detected. In the earlier techniques, a clustering operation is frequently used after SIFT matching to group critical points that are close in space. This method may not always be successful, especially when the copied zone comprises pixels that are geographically far from one another and when the pasted area is close to the original source. In these situations, a more exact calculation of the cloned region is required to establish a precise forgery localization. A different approach based on the J- Linkage algorithm was created by Amerini et al. [35], which accomplishes a strong clustering in the space of the geometric transformation. The suggested method surpasses other comparable state-of-the-art strategies in terms of copy-move forgery detection reliability and of accuracy in the manipulated patch localization, according to experimental findings conducted on various datasets. The first step entails the extraction of SIFT features and key-point matching, and the second step is focused on clustering and forgery detection. It was suggested here to design a clustering technique that is an altered version of the JLinkage algorithm that operates in the transformation domain rather than the spatial domain of matched points. Therefore, they are able to solve the aforementioned limitations of a spatial clustering procedure used in the previous methods.

Accelerating the Process of Copy-Move Forgery Detection …

97

Bo et al. introduced the SURF algorithm in [36]. In this approach, key-point identification is accomplished using a Hessian matrix while orientation assignment is handled via Haar Wavelets. The borders of the forged zone could not be extracted by this technology, despite its rapid detection speed and reliability. The SURF approach was recommended by Zhang and Wang [37] for copy-move detection in flat portions of a picture. This method begins by extracting SURF features from the input picture, followed by feature matching and trimming. Finally, utilizing correlations, duplicate areas are found. However, they are unable to find duplicates using these approaches in rotated forged pieces or non-flat areas.

3.3 Parallel Architectures in Copy-Move Forgery Detection Numerous techniques based on updating the third step using new features or renewing the fourth step using a different matching or searching algorithm have been suggested for speeding up the detection of copy-move forgeries. There have been a few initiatives to use parallel architectures to boost CMIFD system performance. Using JAVA threads, Sridevi et al. [38] suggested a parallel block matching approach. Block overlap and lexical sorting are the two basic processes that take place concurrently. The following steps are used while parallelizing overlapping blocks: (1) The input image I of dimensions r x c is divided into (r – b + 1) × (c – b + 1) overlapping blocks where the block size is b x b pixels. (2) The parallel algorithm uses (r – b + 1) processors maximum, where ith processor constructs ith vector in a matrix represent the blocks. (3) A lexicographical sorting step is done using radix sort in a parallel manner. Consequently, similar rows which represent duplicated blocks will become adjacent to each other. The PCMIFD approach was proposed by Shih and Jackson [39] using a computer with several CPU cores. The two regions having the most effects on discovering fake photos are identified by the writers. Finding duplication in the feature vector matrix is the first area that has been noted. The second topic for consideration in parallel processing is an analysis of the PCA and DCT techniques that are used to extract the feature vectors. The method of parallel processing described here involves running numerous threads concurrently on separate processors or CPU cores. In [40], a parallel architecture for multi-core processors has been suggested in order to speed up copy-move forgery detection techniques. A single computer component containing two or more separate real processing units (referred to as “cores”), which are units that read and execute program instructions, is referred to as a multicore processor. By simultaneously performing several instructions on various cores of a single processor, parallel computing can speed up the execution. The decision to include two or more “execution cores” into a single processor was made by processor makers as a result of advancements in manufacturing technology. On a single integrated circuit die, these cores are basically two separate processors

98

M. H. Hamza et al.

known as chip multiprocessors (CMPs) [41]. The IBM POWER4 processor, which has two cores on a single die, is regarded as the first multiple processing CPU [42]. As a result, multi-core processors have taken over as the standard for enhancing computing performance. That might be accomplished by multithreading on cores or by employing more cores [43]. Numerous applications, including general-purpose computers, embedded systems, networks, digital signal processing (DSP), and graphics, frequently employ multi-core CPUs (GPU) [16, 17, 44]. The employed algorithms and their executable codes determine how well multi-core processors may boost the performance of any program. The degree of data partitioning used by the program, where each partition is handled separately, determines the level of parallelism.

4 The Proposed Method In this section, the proposed Parallel Copy-Move Image Forgery Detection (PCMIFD) method will be discussed in detail, which is based on the one presented in [45]. As shown in Fig. 6, the algorithm can be broken into six steps: (1) Convert the input image to a gray image. Then, divide it into 16 × 16 overlapping blocks. (2) Parallelize computing the DCT of each 16 × 16. For each of these blocks, the coefficients of the DCT are quantized with an extended JPEG quantization matrix described in [19]. (3) After all of the blocks are quantized, they are inserted as rows into a matrix (Each quantized feature vector for a block constitutes a row). Then, lexicographic sort the matrix. (4) Next, adjacent rows that match in the matrix are identified using Euclidean Distance dist(x, y). Then, for each match, a shift vector si is calculated. dist(x, y) =

 n i=1

(xi − yi )2

si = [x1 − x2 , y1 − y2 ]

(1) (2)

where xi and yi are the coordinates of the matched blocks. (5) For each shift vector si , a count is retained of the number of times it has been repeated. For all si that have a count greater than a predefined threshold, the blocks corresponding to that shift vector are considered to be possible forged regions. (6) Mark the possible forged regions with the same color, then output the final output image with detected forged regions.

Accelerating the Process of Copy-Move Forgery Detection …

99

Fig. 6 Parallel copy move algorithm speedup

The proposed method has been evaluated using a dataset containing 100 colored images of size 512 × 512. The images are collected from different datasets [46– 49]. Some of them are chosen from the readymade forged images, and the other are collected as non-forged images and use the Adobe Photoshop to make duplication of objects in the same image (Fig. 7).

100

M. H. Hamza et al.

Fig. 7 The utilized dataset

5 Results and Discussion 5.1 Setup On multi-core CPUs with the C++ programming language, the tests are carried out. OpenMP multi-core CPUs are the utilized APIs. The parallel algorithms are run on an Ubuntu 64-bit Linux system. An 8-core CPU is present in the hosting computer. The whole set of tests is done on sequential, 2-core, 4-core, and 8-core CPUs. The following experiments were used to evaluate the suggested approach. The first two are displayed in Figs. 8 and 9.

Accelerating the Process of Copy-Move Forgery Detection …

101

Fig. 8 Experiment (1), a the forged image, and b The detected image

Fig. 9 Experiment (2), a The forged image, and b The detected image

5.2 Results Table 1 lists the percentage of each step of the overall execution time of the proposed algorithm. Apparently, the second step is the most time consuming step, with almost 85% of the proposed algorithm running time. Thus, the most fruitful step, of the proposed algorithm, to be parallelized is computing the DCT of each block. We used the coarse grain parallel strategy. As we have XX windows and p cores, where XX >> p, we assigned XX/p windows for each core to calculate the corresponding DCT function. While, we implemented and run it on a multicore CPU, similar speedup values can be obtained on many core architectures, i.e. GPUs, using the same strategy. The second phase of the suggested approach was parallelized using the OpenMP package. Since this step’s code consists just of a “for loop,” we employed a Loop parallelism approach, which is a fairly frequent sort of parallelism in any parallel programs. OpenMP offers a very simple solution for this task. The most common

102 Table 1 Copy move algorithm step weights

M. H. Hamza et al. Step no

Weight (%)

Initialization

0.05

1

4.42

2

86.93

3

5.24

4

3.07

5

0.01

6

0.27

structures to be parallelized are OpenMP parallel loops, which essentially divide a given amount of work among the available threads in a parallel zone. As a result, the p CPU cores that are being used are split amongst the distinct iterations of the loop. The set of calculated dct2 is then assembled into a single matrix and will be handled in a sequential manner in the following phases. Figure 10 displays the average of their speedup rate. Finally, the third experiment has been done. At which the PCMIFD method has been tested on the complete dataset. Then, the results of speedup the process of copymove forgery detection are shown in Fig. 11. As shown in the figure, the results of the proposed system practically came close to the theoretical speedup rate. It is normal that there is a delay due to the increasing number of cores. This causes the huge interconnect delays (wire delays) when data has to be moved across the multi-core chip from memories in particular [50]. Fig. 10 The average speed-up rate for the first two experiments

Accelerating the Process of Copy-Move Forgery Detection …

103

Fig. 11 The average speedup rate for the third experiment

6 Conclusion Because it is simple to distort, copy-move forgery is one of the popular counterfeit methods. To make use of parallel architectures, a novel technique for speeding up the CMFD performance is proposed. The use of multi-core CPUs is suggested for a novel Parallel Copy-Move Image Forgery Detection (PCMIFD) system. The suggested approach has higher accuracy rates and is significantly faster than other comparable current methods. In the trials, a rapid acceleration rate is attained up to 7.4. The outcomes substantially approached the theoretical speedup rate as well. Future research will focus on utilizing multi-core CPUs with various capabilities to increase accuracy.

References 1. C. Rey, J.-L. Dugelay, A survey of watermarking algorithms for image authentication. EURASIP J. Adv. Signal Process. 2002(6), 218932 (2002) 2. A. Haouzia, R. Noumeir, Methods for image authentication: a survey. Multimed. Tools Appl. 39(1), 1–46 (2008) 3. F.Y. Shih, Digital watermarking and steganography: fundamentals and techniques (CRC Press, 2017)

104

M. H. Hamza et al.

4. R. Singh, M. Kaur, Copy move tampering detection techniques: a review. Int. J. Appl. Eng. Res. 11(5), 3610–3615 (2016) 5. T.T. Ng, S.F. Chang, C.Y. Lin, Q. Sun, Passive-blind image forensics, in Multimedia Security Technologies for Digital Rights, vol. 15 (2006), pp. 383–412 6. D. Chauhan, D. Kasat, S. Jain, V. Thakare, Survey on keypoint based copy-move forgery detection methods on image. Procedia Comput. Sci. 85, 206–212 (2016) 7. A.A. Nisha Chauhan, A survey on image tempring using various techniques. IOSR J. Comput. Eng. (IOSR-JCE) 19(3), 97–101 (2017). e-ISSN: 2278-0661, p-ISSN: 2278-8727 8. J.G.R. Elwin, T. Aditya, S.M. Shankar, Survey on passive methods of image tampering detection, in 2010 International Conference on Communication and Computational Intelligence (INCOCCI) (IEEE, 2010) 9. V.S. Dhole, N.N. Patil, Self embedding fragile watermarking for image tampering detection and image recovery using self recovery blocks, in International Conference on Computing Communication Control and Automation (ICCUBEA) (IEEE, 2015) 10. S. Bayram, H.T. Sencar, N. Memon, A survey of copy-move forgery detection techniques, in IEEE Western New York Image Processing Workshop (IEEE, 2008) 11. G. Kaur, M. Kumar, Study of various copy move forgery attack detection techniques in digital images. Int. J. Res. Comput. Appl. Robot 2320–7345 (2015) 12. S.G. Rasse, Review of detection of digital image splicing forgeries with illumination color estimation. Int. J. Emerg. Res. Manag. Technol. 3 (2014) 13. Z. Qu, G. Qiu, J. Huang, Detect digital image splicing with visual cues, in International Workshop on Information Hiding (Springer, Berlin, 2009) 14. V. Savchenko, N. Kojekine, H. Unno, A practical image retouching method, in Proceedings, First International Symposium on Cyber Worlds, 2002 (IEEE, 2002) 15. S. Velmurugan, T.S. Subashini, M.S. Prashanth, Dissecting the literature for studying various approaches to copy move forgery detection. IJAST 29(04), 6416–6438 (2020) 16. K.M. Hosny, H.M. Hamza, N.A. Lashin, Copy-move forgery detection of duplicated objects using accurate PCET moments and morphological operators. Imaging Sci. J. 66(6), 330–345 (2018) 17. K.M. Hosny, H.M. Hamza, N.A. Lashin, Copy-for-duplication forgery detection in colour images using QPCETMs and sub-image approach. IET Image Proc. 13(9), 1437–1446 (2019) 18. R. Sekhar, A. Chithra, Recent block-based methods of copy-move forgery detection in digital images. Int. J. Comput. Appl. 89(8) (2014) 19. A.J. Fridrich, B.D. Soukal, A.J. Lukáš, Detection of copy-move forgery in digital images, in Proceedings of Digital Forensic Research Workshop (Citeseer, 2003) 20. A. Popescu, H. Farid, Exposing digital forgeries by detecting duplicated image regions. Department Computer Science, Dartmouth College, Technology Report TR2004-515 (2004) 21. S.-J. Ryu, M.-J. Lee, H.-K. Lee, Detection of copy-rotate-move forgery using Zernike moments. in information hiding (Springer, Berlin, 2010) 22. Y. Huang, W. Lu, W. Sun, D. Long, Improved DCT-based detection of copy-move forgery in images. Forensic Sci. Int. 206(1), 178–184 (2011) 23. S. Sadeghi, H.A. Jalab, S. Dadkhah, Efficient copy-move forgery detection for digital images. World Acad. Sci., Eng. Technol. 71 (2012) 24. Y. Li, Image copy-move forgery detection based on polar cosine transform and approximate nearest neighbor searching. Forensic Sci. Int. 224(1), 59–67 (2013) 25. J.-C. Lee, C.-P. Chang, W.-K. Chen, Detection of copy–move image forgery using histogram of orientated gradients. Inf. Sci. 321, 250–262 (2015) 26. O.M. Al-Qershi, B.E. Khoo, Passive detection of copy-move forgery in digital images: stateof-the-art. Forensic Sci. Int. 231(1), 284–295 (2013) 27. G. Lynch, F.Y. Shih, H.-Y.M. Liao, An efficient expanding block algorithm for image copymove forgery detection. Inf. Sci. 239, 253–265 (2013) 28. T. Mahmood, A. Irtaza, Z. Mehmood, M.T. Mahmood, Copy–move forgery detection through stationary wavelets and local binary pattern variance for forensic analysis in digital images. Forensic Sci. Int. 279, 8–21 (2017)

Accelerating the Process of Copy-Move Forgery Detection …

105

29. S.M. Fadl, N.A. Semary, Robust copy-move forgery revealing in digital images using polar coordinate system. Neurocomputing (2017) 30. D. Vaishnavi, T.S. Subashini, Application of local invariant symmetry features to detect and localize image copy move forgeries. J. Inf. Secur. Appl. 44, 23–31 (2019). https://doi.org/10. 1016/j.jisa.2018.11.001 31. K.B. Meena, V. Tyagi, A copy-move image forgery detection technique based on GaussianHermite moments. Multimed. Tools Appl. 78, 33505–33526 (2019). https://doi.org/10.1007/ s11042-019-08082-2 32. S. Gulivindala, S.R. Chanamallu,Copy-move forgery detection system through fused color and texture features using firefly algorithm. Int. J. Recent. Technol. Eng. (IJRTE) 8(1) (2019). ISSN: 2277-3878 33. X. Pan, S. Lyu, Region duplication detection using image feature matching. IEEE Trans. Inf. Forensics Secur. 5(4), 857–867 (2010) 34. I. Amerini, L. Ballan, R. Caldelli, A. Del Bimbo, G. Serra, A SIFT-based forensic method for copy–move attack detection and transformation recovery. IEEE Trans. Inf. Forensics Secur. 6(3), 1099–1110 (2011) 35. I. Amerini, L. Ballan, R. Caldelli, A. Del Bimbo, L. Del Tongo, G. Serra, Copy-move forgery detection and localization by means of robust clustering with J-Linkage. Signal Process.: Image Commun. 28(6), 659–669 (2013) 36. X. Bo, W. Junwen, L. Guangjie, D. Yuewei, Image copy-move forgery detection based on SURF, in 2010 International Conference on Multimedia Information Networking and Security (MINES) (IEEE, 2010) 37. G.-Q. Zhang, H.-j. Wang, SURF-based detection of copy-move forgery in flat region. Int. J. Adv. Comput. Technol. 4(17) (2012) 38. M. Sridevi et al., Copy–move image forgery detection, in A Parallel Environment: SIPM, FCST, ITCA, WSE, ACSIT, CS and IT, vol. 6 (2012), pp. 19–29 39. F.Y. Shih, J.K. Jackson, Copy-cover image forgery detection in parallel processing. Int. J. Pattern Recognit. Artif. Intell. 29(08), 1554004 (2015) 40. M. Rouse, Definition: multi-core processor. TechTarget. Retrieved 6 Mar, 2013 41. S. Akhter, J. Roberts, Multi-core programming, vol. 33 (Intel Press Hillsboro, 2006) 42. D.C. Bossen et al., Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology. IBM J. Res. Dev. 46(1), 77–86 (2002) 43. A. Vajda, Multi-core and many-core processor architectures, in Programming Many-Core Chips (Springer, 2011), pp. 9–43 44. A. Salah, K. Li, K.M. Hosny, M.M. Darwish, Q. Tian, Accelerated CPU–GPUs implementations for quaternion polar harmonic transform of color images. Futur. Gener. Comput. Syst. 107, 368–382 (2020) 45. L. Dagum, R. Menon, OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998) 46. K.M. Hosny et al., Fast computation of 2D and 3D Legendre moments using multi-core CPUs and GPU parallel architectures. J. Real-Time Image Proc. 16(6), 2027–2041 (2019) 47. http://lci.micc.unifi.it/labd/2015/01/copy-move-forgery-detection-and-localization/ 48. https://www5.cs.fau.de/research/data/image-manipulation/index.html 49. https://www.vcl.fer.hr/comofod/comofod.html 50. B. Venu, Multi-core processors-an overview (2011). arXiv:1110.3535

Parallel Image Processing Applications Using Raspberry Pi Khalid M. Hosny, Ahmad Salah, and Amal Magdi

Abstract In modern technologies, digital image processing is an essential field with various applications. Over the past few years, the multidisciplinary field of real-time image processing has undergone an explosion. Scientists are looking for advanced processing tools such as embedded and special hardware systems such as Raspberry Pi for big data processing in real-time every day. Raspberry Pi is a credit card affordable computer with an open-source platform. Raspberry Pi is a very useful and promising tool for image processing applications that provide the advantages of portability, parallelism, low cost, and low power consumption. Since the computational time in image processing applications is a critical factor, clusters can achieve real-time execution in image processing applications. When it comes to constructing massive supercomputing clusters, power consumption has become an increasingly important metric. Low-power embedded processors are one way to reduce power consumption in large clusters instead of the standard CPUs. As a result, it will be helpful to use a Raspberry Pi cluster for image processing applications that take a long time to execute, as the portable cluster can be configured to continue to operate even if a number of its nodes fail. In this paper, the authors provide an overview of Raspberry Pi utilization in various parallel image processing applications in different fields. Keywords Raspberry pi · Cluster · Image processing · Parallel Computing

K. M. Hosny (B) · A. Magdi Department of Information Technology, Faculty of Computers and Informatics, Zagazig University, Zagazig, Egypt e-mail: [email protected] A. Salah Department of Computer Science, Faculty of Computers and Informatics, Zagazig University, Zagazig, Egypt Department of Information Technology, CAS-Ibri, University of Technology and Applied Sciences, Muscat, Oman © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 K. M. Hosny and A. Salah (eds.), Recent Advances in Computer Vision Applications Using Parallel Processing, Studies in Computational Intelligence 1073, https://doi.org/10.1007/978-3-031-18735-3_6

107

108

K. M. Hosny et al.

1 Introduction Digital image processing is a growing field, with new applications being produced at a rising rate. Our vision system helps us to collect information directly from images or videos and interpret all forms of information [1]. Image processing is transforming an image into a digital form and performing some operations to obtain an enhanced image and extract valuable details. The result of image processing may be an image or a set of features related to the image [2]. Over the past few years, the multidisciplinary field of real-time image processing has undergone an explosion. Special processing tools such as super parallel processing computers or special hardware systems such as Raspberry Pi are needed for big data processing in real-time. The Raspberry Pi is a simple embedded machine and a low-cost single-board computer used in real-time applications to reduce complexity [3]. Raspberry Pi has many models with an open-source platform; it can be used in various projects as it provides the advantage of portability, low cost, and low power consumption [4]. Parallel processing is simply the use of many processors concurrently to perform faster computations. The parallel computation process in clusters is carried out through the exchange of messages [5]. Distributed memory and shared memory are two prominent programming methods for facilitating communication and cooperation for parallelized computer operations. In shared memory models, like Open Multi-Processing (OpenMP), all computing units typically share the same memory regions while in distributed memory models, like the Message Passing Interface (MPI) approach, each computer unit has its own memory [6, 7]. In both computing paradigms, the coordinating work is frequently assigned to a single processing unit. All other parallelized units are referred to as slave units, while the coordinating unit is referred to as the master unit [8]. A cluster is a collection of computers that are linked together, by joining a large number of computers in a cluster, one can achieve a possible speed boost by performing activities in a parallel and distributed manner [9]. Clusters are costly and difficult to experiment with, however, Low-cost clusters of single-board computers (SBCs) such as Raspberry Pi are the solution to this problem [10]. In terms of cost and power usage, SBCs may also compete with modern devices. They may be tuned to obtain quicker computations by utilizing their natural parallel capabilities due to their multi-core nature [11]. There are many useful image processing applications in many fields. This survey will concern using Raspberry Pi in some image processing applications such as medical, recognition, and monitoring, autonomous car driving, and compression applications. One of the most important fields in which image processing procedures are usefully implemented is a medical diagnosis. Accurate image processing is a crucial step in improving diagnostic procedures and surgical operations [12]. The monitoring of vital signs is a significant part of the healthcare system. Several monitoring devices that display the vital signs are located in the critical care units. Yet, there could be cases when, despite the 24 h of surveillance, the doctor might

Parallel Image Processing Applications Using Raspberry Pi

109

not be notified in time when there is an emergency, and the information could not be exchanged with other doctors remotely. Many people in developing countries do not have access to or cannot afford the technology to participate in all these activities. As a result, a low-cost Raspberry Pi platform can be used to solve this problem [13]. One of the most actively explored subjects in the vast field of imaging sciences and engineering is image recognition and classification. The fundamental concept is to analyze a visual scene using data obtained from sensors [14]. It will aid in the replacement of human visual capabilities with computer capabilities, which is critical in a variety of applications such as military, health monitoring, surgery, intelligent transportation systems, intelligent agricultural systems, and security systems. The organization of the rest of this paper can be summarized as follows: Sect. 2 highlights Raspberry Pi general features. Section 3 shows prior research for parallel image processing applications developed using Raspberry Pi. Finally, Sect. 4 draws some conclusions.

2 Raspberry Pi: General Overview The Raspberry Pi is a credit card-sized, reasonably priced computer created in the UK by the Raspberry Pi Foundation. The open-source Raspberry Pi platform comes in a wide variety of models. The Raspberry Pi was first introduced in February 2012, and the Raspberry Pi 2 followed in February 2015. Releases of the Raspberry Pi Zero, which is smaller, the Raspberry Pi 3, which is larger, and the Raspberry Pi 4, which is seen in Fig. 1 [15], all occurred in November 2015, February 2016, and June 2019, respectively. Depending on the requirements of your application, the Raspberry Pi model 4 has a variety of RAM sizes.

Fig. 1 Raspberry Pi model 4 [15]

110

K. M. Hosny et al.

Fig. 2 Raspberry Pi memory bandwidth test [17]

2.1 Memory Bandwidth Although CPU speed mainly limits many workloads, others depend on memory bandwidth. The RAM bandwidth in Raspberry Pi 4 has improved as a result of the upgrade from LPDDR2 to LPDDR4. The RAM speed/SMP tool is used in this benchmark to calculate the read and write bandwidth for 1 MB blocks [16]. Figure 2 [17] demonstrate the memory bandwidth benchmark for all Raspberry Pi models (higher is better).

2.2 Ethernet Throughput Raspberry Pi 3 Model B+ was the first to provide gigabit Ethernet connectivity, but there was a bottleneck that hindered it from reaching its theoretical maximum throughput, while throughput on Raspberry Pi 4 is better as Raspberry Pi 4 is free from the single shared USB 2.0 channel to the SoC, the removal of the USB bottleneck now means Ethernet and USB throughput are not linked. Figure 3 [18] shows the average Ethernet throughput over many runs calculated using the iperf3 method [16].

2.3 WI-FI Throughput On the Raspberry Pi family, Ethernet is not the only built-in networking option. Built-in Wi-Fi has been available as standard since the introduction of the Raspberry

Parallel Image Processing Applications Using Raspberry Pi

111

Fig. 3 Raspberry Pi Ethernet throughput test [18]

Pi 3 Model B, with the Raspberry Pi 3 Model B + introducing dual-band support. A perfect setting is produced for this wireless networking test: a Raspberry Pi is positioned in an 802.11ac router’s line of sight. A wired laptop uses iperf3 to calculate the average throughput across many runs [16]. Figure 4 demonstrate the Wi-Fi for all Raspberry Pi models (higher is better) [19].

Fig. 4 Raspberry Pi WI-FI throughput test [19]

112

K. M. Hosny et al.

Fig. 5 Raspberry Pi power draw test [20]

2.4 Power Draw If something is producing more heat, it is guaranteed to be drawing more power. The ability to qualify the Raspberry Pi 4s power socket at higher current speeds (3A, up from 2.5A on the previous jack) is a major explanation for the switch from microUSB to USB Type-C [16]. The Raspberry Pi 4 is the most power-hungry design to date as depicted in Fig. 5 [20].

3 Parallel Image Processing Applications Using RPI In image processing fields, Raspberry Pi can be helpful as it offers the benefit of portability, parallelism, low cost, and lower power consumption. This section presents using Raspberry Pi in various parallel image processing applications as medical, recognition, monitoring, and autonomous car driving and compression applications.

3.1 Medical Applications Hosny et al. [21] propose a portable and low-cost COVID-19 diagnosis system using a Raspberry Pi 4 Model B platform with a 4-cores CPU. OpenMP library is used for CPU multi-core implementation. LBP is used for local features extraction, then multichannel fractional-order Legendre-Fourier moments are used for global features extraction from chest X-ray or CT scans, and finally, the most important features

Parallel Image Processing Applications Using Raspberry Pi

113

are chosen. The local- and global-feature extraction techniques are implemented in C++, while the image classifier is implemented in Python. The system steps are integrated to suit the embedded system’s low computing and memory capacities. Teo et al. [22] proposed a lightweight convolution neural network model for detecting malignant tumor cells that can be easily operated on constrained systems with ease. BreaKHis dataset was used in the experiments. The inference experiments were carried out on two systems: a computer with an average GPU GEFORCE 770 M, and a Single Board Computer Raspberry Pi 3B+, which supplied the user interface and wireless connection. To emulate our human brain, such as in picture categorization, a lot of processing power and computer complexity is necessary. Exploring the utilization of Multicore System on Chips (MCSoC) is one option to overcome bulky systems. While the latency and accuracy may not be as good as clone GPU implementations, the experiments demonstrates that the MCSoC system can execute profiles that are equivalent to known networks on a large system. Using a Raspberry Pi 3 with limited resources, Kruger et al. [23] contact-less respiration monitoring system is implemented and improved. A respiration algorithm is AutoROI. After being created in MatLab, the algorithm was ported to C++. The algorithm has been put into practice as a streaming application, processing video frames in six steps. The first three phases create four threads for each level using OpenMP to benefit from data parallelism. The modifications have resulted in a throughput increase of the algorithm of more than 45% and an increase in CPU utilization from 72% to over 94%. Bensag et al. [24] present an embedded agent based on the Raspberry Pi 2 model B using JADE (Java Agent DEvlopment) platform for cardiac magnetic resonance image (MRI) segmentation. The image segmentation is done using the C-means technique of classification. The image segmentation technique is built on a distributed and parallel embedded agent that runs on Raspberry Pi devices. Three Raspberry Pi 2 model B were used in the experiment. Cho et al. [25] design a user-friendly ultra-compact optical coherence tomography (OCT) system controlled and operated by Raspberry Pi to reduce system complexity. The goal of this implementation is to create an OCT that functions similarly to traditional OCT systems utilizing a basic Raspberry Pi single-board computer architecture (RB-OCT). The RB-OCT system’s software interface was written in C++ and set up using parallel programming and multi-threading. The performance of the proposed method was evaluated using in vivo skin, ex vivo mouse cochlea, and fresh onion peels. Kanani and others [26] Run Length Encoding is used to enhance genomic pattern matching in a three-node distributed Raspberry Pi clustering environment. Since genomic pattern matching requires a lot of computing power and is expensive, the Raspberry Pi Cluster can be used to solve this issue. Using a cluster of Raspberry Pi, the process of matching genomic patterns might be sped up and made simpler. The proposed method reduces the time required for genome pattern matching by more than 50% in a distributed Raspberry Pi computing system. In distributed computing, when different regions of the genome records are present on separate nodes and the

114

K. M. Hosny et al.

input is given to find the disease, this method is useful for recognizing a specific genome-based abnormality. Sivaranjani et al. [27] develop a low-cost new-born authentication system using Raspberry Pi based on the newborns’ footprint and the mother’s finger print for identification. The suggested technique provides a low-cost alternative to pricey DNA tests for baby exchange. Using OpenCV on the Raspberry Pi, SIFT feature extraction and the RANSAC algorithm for fingerprint and footprint biometric matching are implemented. To find the end points in images, Guo Hall’s parallel Thinning Algorithm in OpenCV is employed. While the SIFT algorithm was deemed to be the best technique for feature extraction, Guo Hall’s approach for thinning is thought to be superior than the morphological thinning operation. Indragandhi et al. [28] present a thread level parallelism (ETLP) technique which leverages a computationally intensive edge detection algorithm. The suggested system is compared against the Basic sequential Edge Detection Scheme; the edge detection process is evaluated utilizing the sobel, prewitt, and scharr operators for various picture sizes. To improve performance and CPU core usage, the suggested system was implemented in a Raspberry Pi 3 embedded device for image processing. The ETLP method is implemented using the OpenMP parallel programming architecture, with the edge detection algorithm written in C. The suggested ETLP approach achieves 49 percent and 72% efficiency for image sizes of 300 × 256 and 1024 × 1024, respectively, according to the experimental findings. In addition, as compared to the Basic Edge Detection Method, an ETLP scheme reduced execution time by 66 percent for image sizes of 1024 × 1024.

3.2 Recognition Applications Handrik et al. [29] implement a parallel image processing car plate recognition system using Raspberry PI model 3B+. Raspberry Pi was used to record and process the captured video. The Local Binary Pattern method is used to detect the license on a car plate. For plate identification, OpenALPR was utilized. For image processing, the OpenCV library is utilized, and for character recognition, the libtessertact library is used. BOINC is a programming language that is used to manage computations in heterogeneous distributed computing systems. Parallelization was achieved using the gnu C and C++ compiler tools, as well as the libpthread system library. In comparison to serial implementation, parallel processing of the image resulted in a 1.74 acceleration coefficient. A cluster computing-based facial recognition system using Raspberry Pi is presented by Rudraraju et al. [30] for use in a fog computing environment. Four Raspberry Pi nodes that are all linked to the same Wi-Fi network make up the cluster. In a test picture, faces are found using Haar Cascades and simultaneously identified using the Local Binary Pattern Histogram method. SLURM is responsible for cluster

Parallel Image Processing Applications Using Raspberry Pi

115

administration, while Message Passing Interface is responsible for handling communication between the various processes in the cluster. The face recognition operations were run concurrently to increase performance in terms of execution time. To increase processing performance on low-power edge devices, Goel et al. [31] provide a method for doing pipeline-parallel inference on a hierarchical DNN. The technique divides the hierarchical DNN into partitions and puts each one on a participating edge device, enabling the simultaneous processing of several frames. Four Raspberry Pi 4Bs were utilized in the testing to show that the hierarchical DNN architecture is well suited for parallel processing on several edge devices. The method offers 3.21 times more throughput than traditional single-device hierarchical DNNs, 68% less energy consumption per device every frame, and a 58% memory reduction.

3.3 Monitoring Applications Wang et al. [32] developed a two-stage technique for automatically discovering wind turbine blade surface cracks and their contour using captured images by unmanned aerial vehicles (UAVs). The proposed approach was implemented using Raspberry Pi 3B+ platform. To find the position of cracks, the Viola-Jones object detection framework was used. The enhanced cascade classifier was used to classify windows created by the parallel sliding window technique using extended Haar-like features. To divide crack windows based on RGB and HSV characteristics, the parallel Jaya Kmeans technique was developed. On the Raspberry Pi, the crack detection procedure may be completed in a respectable amount of time, similar to the PC platform, and considerable execution speed-up can be achieved utilizing parallel techniques. On the Raspberry Pi 3, Nejad et al. [33] proposed ARM-VO, a quick monocular approach method which is proposed and compared with LibViso2 and ORBSLAM2. The NEON co-processor and multi-core design of the CPU speed up a set of quick algorithms that ARM-VO uses. The proposed method ARM-VO finds a set of uniformly dispersed key points using a parallel grid-based FAST. Following that, the main points are tracked using a parallel KLT tracker. Experiments on the KITTI dataset revealed that ARM-VO is 4–5 times quicker than other methods on the Raspberry Pi 3. Livoroi et al. [34] propose a further reduction to the OTV algorithm in order to lower its computational requirements. The studies are carried out on Raspberry Pi versions 3B, 3B+, and 4B 4 to see if it is possible to construct economical selfpowered systems utilizing conventional rechargeable batteries and solar panels. ARM Single Instruction Multiple Data (SIMD) and the four-core CPUs (Multicore) capabilities found in all of the Raspberry devices investigated was used to run optimized versions of OTV. Furthermore, to serve as a baseline, the original method was executed on a single CPU core. Using the integrated CPU’s four cores, a significant speedup above the baseline was achieved. Aguirre-Castro et al. [35] design a remotely driven vehicle that can capture underwater video controlled over Ethernet. The suggested remotely driven vehicle can dive

116

K. M. Hosny et al.

to a depth of up to 100 m, alleviating the problem of divers who can only go down to 30 m. Motion control, 3D position, temperature sensing, and video capturing are all done in parallel utilizing the Raspberry Pi 3s four cores and the threading library. On a Raspberry Pi 3, the video capture stage can handle up to 42 frames per second, according to test results.

3.4 Compression Samah et al. [36] described the CCSDS-MHC (Consultative Committee for Space Data System for Lossless Multispectral and Hyperspectral Image Compression) method, which is implemented on a Raspberry Pi 3 Model B+ system with OpenMP. The Raspberry Pi just handles the compression; the decompression isn’t important in this project. The performance of the system is evaluated with AVIRIS and Hyperion. Using the CCSDS-MHC method with OpenMP, the average time for AVIRIS and Hyperion pictures was 88 and 19s, respectively. This study shows that the Raspberry Pi, a low-power embedded device, is capable of compressing hyperspectral pictures at a high rate using OpenMP through the CCSDS method. A real-time region of interest algorithm and progressive picture reduction is suggested by Rubino et al. [37] for usage with an underwater image sensor. With a focus on a novel minimal time parallel discrete wavelet transform algorithm that enables full memory bandwidth saturation using just a few cores of a contemporary multicore embedded processor, the authors present a novel progressive rate distortion-optimized image compression algorithm based on the discrete wavelet transform (DWT). On the Raspberry Pi 3B, a compression method that can compress more than 30 full-HD frames per second was put into practice. For the system to deliver passable graphics, only a few hundred bytes are required per frame. Since the stream is bit-oriented, it can be easily compressed to deliver a lower-quality image. The user can choose at least one region of interest (ROI) to improve image quality in certain areas.

3.5 Autonomous Car Driving DeepPicar, a platform for low-cost autonomous vehicles built on deep neural networks, is presented by Bechtel et al. [38]. Based on camera input data, DeepPicar uses a deep convolutional neural network to predict steering angles in real-time. DeepPicar is a real-time autonomous vehicle that runs on a Raspberry Pi 3 quad-core computer and a web camera. The DAVE-2 self-driving automobile from NVIDIA and DeepPicar both use the same neural network design. The major goal is to use the Raspberry Pi 3, a low-cost multicore platform, to accurately scale down NVIDIA’s DAVE-2 technology. To carry out CNN inference operations, the TensorFlow library made use of all four of the Raspberry Pi 3s CPU cores.

Parallel Image Processing Applications Using Raspberry Pi

117

4 Conclusion This article has done a survey that provides an overview of parallel image processing works using Raspberry Pi. For smart-cities applications that need portability, Raspberry Pi is the right choice. In image processing fields, Raspberry Pi can be helpful as it offers the benefits of portability, parallelism, low cost, and lower power consumption. This survey exhibits many parallel image processing applications using Raspberry Pi in different applications such as medical, recognition, monitoring, autonomous car driving, and compression applications.

References 1. S.E. Umbaugh, Digital Image Processing and Analysis: Human and Computer Vision Applications with CVIPtools (CRC Press, 2010) 2. B. Basavaprasad, M. Ravi, A study on the importance of image processing and its applications. IJRET: Int. J. Res. Eng. Technol. 3, 1 (2014) 3. K.S. Shilpashree, H. Lokesha, H. Shivkumar, Implementation of image processing on raspberry Pi. Int. J. Adv. Res. Comput. Commun. Eng. 4(5), 199–202 (2015) 4. J.W. Kuziek, A. Shienh, K.E. Mathewson, Transitioning EEG experiments away from the laboratory using a raspberry Pi 2. J. Neurosci. Methods 277, 75–82 (2017) 5. M. Ghaffari, T. Gouleakis, C. Konrad, S. Mitrovi´c, R. Rubinfeld, Improved massively parallel computation algorithms for MIS, matching, and vertex cover, in Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing (2018), pp. 129–138 6. A. Fedulov, A. Fedulova, Y. Fedulov, Hybrid parallel programming in high performance computing cluster, in International Conference on Dependability and Complex Systems (Springer, Cham, 2021), pp. 97–105 7. A. Pajankar, Raspberry Pi Computer Vision Programming (Packt Publishing, Birmingham, 2015), pp.30–39 8. Q. Wu, M. Spiryagin, C. Cole, T. McSweeney, Parallel computing in railway research. Int. J. Rail Transp. 8(2), 111–134 (2020) - c, Image processing on raspberry Pi cluster. Int. 9. D. Markovi´c, D. Vujiˇci´c, D. Mitrovi´c, S. Randi´ J. Electr. Eng. Comput. 2(2) (2018) 10. M. Warade, J.G. Schneider, K. Lee, FEPAC: a framework for evaluating parallel algorithms on cluster architectures, in 2021 Australasian Computer Science Week Multi Conference (2021), pp. 1–10 11. M.F. Cloutier, C. Paradis, V.M. Weaver, A raspberry pi cluster instrumented for fine-grained power measurement. Electronics 5(4), 61 (2016) 12. P. Arena, A. Basile, M. Bucolo, L. Fortuna, Image processing for medical diagnosis using CNN. Nucl. Instrum. Methods Phys. Res., Sect. A 497(1), 174–178 (2003) 13. N. Mohammadzadeh, M. Gholamzadeh, S. Saeedi, S. Rezayi, The application of wearable smart sensors for monitoring the vital signs of patients in epidemics: a systematic literature review. J. Ambient. Intell. Humanized Comput. 1–15 (2020) 14. B. Javidi, Image Recognition and Classification: Algorithms, Systems, and Applications (CRC Press, 2002) 15. The Raspberry Pi 4 B. (n.d.). [Photograph]. https://upload.wikimedia.org/wikipedia/commons/ thumb/1/10/Raspberry_Pi_4_Model_B_-_Top.jpg/330px-Raspberry_Pi_4_Model_B_-_Top. jpg 16. R. Zwetsloot, Raspberry Pi 4 specs and benchmarks. The MagPi Magazine (2019, June 24). https://magpi.raspberrypi.org/articles/raspberry-pi-4-specs-benchmarks

118

K. M. Hosny et al.

17. Memory Throughput Benchmark. (n.d.). [Graph]. https://miro.medium.com/max/750/1*77m 8L2cbaL9Hdt2EYjeP2Q.png 18. Ethernet Benchmark. (n.d.). [Graph]. https://miro.medium.com/max/750/1*7TqVFWqvE12s DsRtAIUhjA.png 19. Wi-Fi Benchmark. (n.d.). [Graph]. https://miro.medium.com/max/750/1*R8s4T1As20wq CCM8sg9KpA.png 20. Power Draw Benchmark. (n.d.). [Graph]. https://miro.medium.com/max/750/1*bb_svBs9b xhmxgZ3H-0hg.png 21. K.M. Hosny, M.M. Darwish, K. Li, A. Salah, COVID-19 diagnosis from CT scans and chest X-ray images using low-cost raspberry Pi. PLoS ONE 16(5), e0250688 (2021) 22. T.H. Teo, W.M. Tan, Y.S. Tan, Tumour detection using convolutional neural network on a lightweight multi-core device, in 2019 IEEE 13th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC) (IEEE, 2019), pp. 87–92 23. M.G. Kruger, R.P. Springer, G.M. Kersten, R.J. Bril, Contact-less vital sign monitoring using a COTS resource-constrained multi-core system, in IECON 2019–45th Annual Conference of the IEEE Industrial Electronics Society, vol. 1 (IEEE, 2019), pp. 3057–3062 24. H. Bensag, M. Youssfi, O. Bouattane, Embedded agent for medical image segmentation, in 2015 27th International Conference on Microelectronics (ICM) (IEEE, 2015), pp. 190–193 25. H. Cho, P. Kim, R.E. Wijesinghe, H. Kim, N.K. Ravichandran, M. Jeon, J. Kim, Development of raspberry Pi single-board computer architecture based ultra-compact optical coherence tomography. Opt. Lasers Eng. 148, 106754 (2022) 26. P. Kanani, M. Padole, Improving pattern matching performance in genome sequences using run length encoding in distributed raspberry Pi clustering environment. Procedia Comput. Sci. 171, 1670–1679 (2020) 27. S. Sivaranjani, S. Sumathi, A review on implementation of bimodal newborn authentication using raspberry Pi, in 2015 Global Conference on Communication Technologies (GCCT) (IEEE, 2015), pp. 267–272 28. K. Indragandhi, P.K. Jawahar, An application based efficient thread level parallelism scheme on heterogeneous multicore embedded system for real time image processing. Scalable Comput.: Pract. Exp. 21(1), 47–56 (2020) 29. M. Handrik, J. Handriková, M. Vaško, Parallel image signal processing in a distributed car plate recognition system, in 2020 New Trends in Signal Processing (NTSP) (IEEE, 2020), pp. 1–4 30. S.R. Rudraraju, N.K. Suryadevara, A. Negi, Face recognition in the fog cluster computing, in 2019 IEEE International Conference on Signal Processing, Information, Communication & Systems (SPICSCON) (IEEE, 2019), pp. 45–48 31. A. Goel, C. Tung, X. Hu, G.K. Thiruvathukal, J.C. Davis, Y.H. Lu, Efficient computer vision on edge devices with pipeline-parallel hierarchical neural networks (2021). arXiv:2109.13356 32. L. Wang, Z. Zhang, X. Luo, A two-stage data-driven approach for image-based wind turbine blade crack inspections. IEEE/ASME Trans. Mechatron. 24(3), 1271–1281 (2019) 33. Z.Z. Nejad, A.H. Ahmadabadian, ARM-VO: an efficient monocular visual odometry for ground vehicles on ARM CPUs. Mach. Vis. Appl. 30(6), 1061–1070 (2019) 34. A.H. Livoroi, A. Conti, L. Foianesi, F. Tosi, F. Aleotti, M. Poggi, S. Mattoccia, On the deployment of out-of-the-box embedded devices for self-powered river surface flow velocity monitoring at the edge. Appl. Sci. 11(15), 7027 (2021) 35. O.A. Aguirre-Castro, E. Inzunza-González, E.E. García-Guerrero, E. Tlelo-Cuautle, O.R. López-Bonilla, J.E. Olguín-Tiznado, J.R. Cárdenas-Valdez, Design and construction of an ROV for underwater exploration. Sensors 19(24), 5387 (2019) 36. N.A.A. Samah, N.R. Noor, E.A. Bakar, M.K.M. Desa, CCSDS-MHC on raspberry pi for lossless hyperspectral image compression, in IOP Conference Series: Materials Science and Engineering, vol. 943, no. 1 (IOP Publishing, 2020), p. 012004 37. E.M. Rubino, A.J. Álvares, R. Marín, P.J. Sanz, Real-time rate distortion-optimized image compression with region of interest on the ARM architecture for underwater robotics applications. J. Real-Time Image Proc. 16(1), 193–225 (2019)

Parallel Image Processing Applications Using Raspberry Pi

119

38. M.G. Bechtel, E. McEllhiney, M. Kim, H. Yun, Deeppicar: a low-cost deep neural networkbased autonomous car, in 2018 IEEE 24th International Conference on Embedded and RealTime Computing Systems and Applications (RTCSA) (IEEE, 2018), pp. 11–21