133 29 9MB
English Pages 288 [280] Year 2023
Contributions to Environmental Sciences & Innovative Business Technology
Ashish Kumar Rachna Jain Ajantha Devi Vairamani Anand Nayyar Editors
Object Tracking Technology Trends, Challenges and Applications
Contributions to Environmental Sciences & Innovative Business Technology Editorial Board Members Allam Hamdan, Ahlia University, Manama, Bahrain Wesam Al Madhoun, Air Resources Research Laboratory, MJIIT, UTM, Kuala Lumpur, Malaysia Mohammed Baalousha, Department of EHS, Arnold School of Public Health, University of South Carolina, Columbia, SC, USA Islam Elgedawy, AlAlamein International University, Alexandria, Egypt Khaled Hussainey, Faculty of Business and Law, University of Portsmouth, Portsmouth, UK Derar Eleyan, Palestine Technical University—Kadoori, Tulkarm, Palestine, State of , University College of Bahrain, Manama, Bahrain
Reem Hamdan
Mohammed Salem, University College of Applied Sciences, Gaza, Palestine, State of Rim Jallouli
, University of Manouba, Manouba, Tunisia
Abdelouahid Assaidi, Laurentian University, Sudbury, ON, Canada Noorshella Binti Che Nawi, Universiti Malaysia Kelantan, Kota Bharu, Kelantan, Malaysia Kholoud AL-Kayid, University of Wollongong, Leppington, NSW, Australia Martin Wolf, Center for Environmental Law and Policy, Yale University, New Haven, CT, USA Rim El Khoury, Accounting and Finance, Notre Dame University, Loauize, Lebanon Editor-in-Chief Bahaaeddin Alareeni, Middle East Technical University, Northern Cyprus Campus, Kalkanlı, KKTC, Türkiye
Contributions to Environmental Sciences & Innovative Business Technology (CESIBT) is an interdisciplinary series of peer-reviewed books dedicated to addressing emerging research trends relevant to the interplay between Environmental Sciences, Innovation, and Business Technology in their broadest sense. This series constitutes a comprehensive up-to-date interdisciplinary reference that develops integrated concepts for sustainability and discusses the emerging trends and practices that will define the future of these disciplines. This series publishes the latest developments and research in the various areas of Environmental Sciences, Innovation, and Business Technology, combined with scientific quality and timeliness. It encompasses the theoretical, practical, and methodological aspects of all branches of these scientific disciplines embedded in the fields of Environmental Sciences, Innovation, and Business Technology. The series also draws on the best research papers from EuroMid Academy of Business and Technology (EMABT) and other international conferences to foster the creation and development of sustainable solutions for local and international organizations worldwide. By including interdisciplinary contributions, this series introduces innovative tools that can best support and shape both the economical and sustainability agenda for the welfare of all countries, through better use of data, a more effective organization, and global, local, and individual work. The series can also present new case studies in real-world settings offering solid examples of recent innovations and business technology with special consideration for resolving environmental issues in different regions of the world. The series can be beneficial to researchers, instructors, practitioners, consultants, and industrial experts, in addition to governments from around the world. Published in collaboration with EMABT, the Springer CESIBT series will bring together the latest research that addresses key challenges and issues in the domain of Environmental Sciences & Innovative Business Technology for sustainable development.
Ashish Kumar • Rachna Jain • Ajantha Devi Vairamani • Anand Nayyar Editors
Object Tracking Technology Trends, Challenges and Applications
Editors Ashish Kumar School of Computer Science Engineering and Technology Bennett University Greater Noida, India Ajantha Devi Vairamani AP3 Solutions Chennai, India
Rachna Jain Information Technology Department Bhagwan Parshuram Institute of Technology Delhi, India Anand Nayyar School of Computer Science Duy Tan University Da Nang, Vietnam
ISSN 2731-8303 ISSN 2731-8311 (electronic) Contributions to Environmental Sciences & Innovative Business Technology ISBN 978-981-99-3287-0 ISBN 978-981-99-3288-7 (eBook) https://doi.org/10.1007/978-981-99-3288-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
Visual tracking is a broad and important field in computer science, addressing the most different applications in the educational, entertainment, industrial, and manufacturing areas. From the very beginning, the field of visual object tracking has evolved greatly, along with diverse imaging technology devices. Tracking the paths of moving objects is an activity with a long history. Object tracking is a significant technology for human survival and has significantly contributed to the progression of mankind. Object tracking algorithms have become an essential part of our daily lives. The goal of object tracking is to locate and track objects in a video stream, often in real time, and to extract information about their motion and behavior. One of the most popular approaches to object tracking is based on the use of visual features, such as color, texture, and shape, to represent objects in the video stream. These features are used to match and track objects across multiple frames and to estimate their motion and behavior. Another popular approach to object tracking is based on the use of deep learning, particularly convolutional neural networks (CNNs). CNNs can learn to recognize and track objects based on their visual features and have been shown to be very effective in a wide range of object tracking tasks. Recent trends in object tracking include the use of multi-object tracking, where multiple objects are tracked simultaneously in a video stream, and the use of 3D information, such as depth and stereo, to improve the accuracy and robustness of object tracking. Object tracking has a wide range of applications, including surveillance and security, where it is used to detect and track people and vehicles in real time, and robotics, where it is used to track and control the motion of robots and drones. Other applications include self-driving cars, where it is used to track and predict the motion of other vehicles and pedestrians, and sports analysis, where it is used to track and analyze the motion of athletes and teams. This book is an introduction to the fascinating field of visual object tracking and provides a solid foundation to diverse terminologies, challenges, applications, and other technical information over the past few years to academicians, scientific researchers, and engineers. The aim of this book is to present a unifying approach v
vi
Preface
with the latest advances in visual tracking algorithms, techniques, and real-time applications. The book comprises of 11 chapters contributing diverse aspects of object tracking. The first chapter titled “Single Object Detection from Video Streaming” highlights the technical effectiveness of a deep learning approach to accelerate the deployment of safeguards against the COVID-19 pandemic and proposes an innovative and flexible face mask identification paradigm that recognizes and prevents COVID-19 via deep learning. The second chapter titled “Different Approaches to Background Subtraction and Object Tracking in Video Streams: A Review” investigates different traditional approaches used for feature extraction and opportunities of machine learning models in different computer vision applications and also explores the working of diverse deep learning approaches and models involved in pedestrian detection. The third chapter titled “Auto Alignment of Tanker Loading Arm Utilizing Stereo-Vision Video and 3D Euclidean Scene Reconstruction” highlights camera calibration, 3D Euclidean scene reconstruction, and how the mathematical structure of a stereo vision system framework affects the accuracy of 3D remaking and also focuses on modern fuels such as LPG and LNG that are carried on tanker vessels and on the relationship between 3D Euclidean reconstruction and camera design. The fourth chapter titled “Visual Object Segmentation Improvement Using Deep Convolutional Neural Networks” highlights an efficient technique to enhance system efficiency of image segmentation and provides a complete reference to all the techniques related to the segmentation and retrieval techniques in image processing. The fifth chapter titled “Applications of Deep Learning-Based Methods on Surveillance Video Stream by Tracking Various Suspicious Activities” elaborates on methods based on deep learning for detecting video anomalies and presents a framework for detecting video anomalies in a multimodal, semi-supervised environment. The sixth chapter titled “Hardware Design Aspects of Visual Tracking System” highlights the challenges in designing visual tracking systems and elaborates on the interface mechanisms of camera sensor to FPGA and also elaborates on the working of various visual tracking algorithms. The seventh chapter titled “Automatic Helmet (Object) Detection and Tracking the Riders using Kalman Filter Technique” presents a thorough analysis of contemporary technologies for object detection and tracking with a focus on soft computing-based techniques and also proposes a Kalman-filter-based tracker to predict and track various trajectories of moving objects. The proposed tracker was tested and proved with 80% accuracy as compared to existing techniques. The eighth chapter titled “Deep Learning-Based Multi-Object Tracking” elaborates on the techniques based on deep learning for multi-object tracking and reviews various deep learning (DL) techniques such as CNN, long short-term memory networks (LSTM), and recurrent neural network (RNN) based on performance and application for MOT in video sequences. The ninth chapter titled “Multiple Object Tracking of Autonomous Vehicles for Sustainable and Smart Cities” examines the conventional approaches for object tracking cum deep learning-based approaches for object tracking. In addition, the chapter proposes an object tracking model based on two approaches such as traditional models like model-based, sensor fusion-based, stereo vision-based, grid-
Preface
vii
based, etc. To completely test the tracking capability of the suggested algorithm on various objects in difficult scenarios, pertinent experiments based on real driving movies and public datasets were done, and the results proved that the proposed approach had high tracking speed and accuracy as well as higher robustness and antiinterference skills as compared to the existing techniques. The tenth chapter titled “Multi Object Detection: A Social Distance Monitoring System” elaborates on how deep learning is useful in computer vision for detecting objects and proposes a novel YOLOV4-based social distance monitoring system. The final chapter titled “Investigating Two-Stage Detection Methods Using Traffic Light Detection Dataset” elaborates on DL-based applications, challenges in object detection vis-à-vis tracking, learning, and detection (TLD), and evaluation of the two-stage detection methods using standard evaluation metrics [detection accuracy (DA), F1-score, precision, recall and running time (RT)] in TLD. The main feature of the book is that it proffers basics to beginners and at the same time serves as a reference study to advanced learners. Greater Noida, India Delhi, India Chennai, India Da Nang, Vietnam
Ashish Kumar Rachna Jain Ajantha Devi Vairamani Anand Nayyar
Contents
Single-Object Detection from Video Streaming . . . . . . . . . . . . . . . . . . . . Akshay Patel, Jai Prakash Verma, Rachna Jain, and Anand Nayyar
1
Different Approaches to Background Subtraction and Object Tracking in Video Streams: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . Kalimuthu Sivanantham, Blessington Praveen P, and R. Mohan Kumar
23
Auto Alignment of Tanker Loading Arm Utilizing Stereo Vision Video and 3D Euclidean Scene Reconstruction . . . . . . . . . . . . . . . . . . . . R. Prasanna Kumar and Ajantha Devi Vairamani
41
Visual Object Segmentation Improvement Using Deep Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. Kanithan, N. Arun Vignesh, and Karthick SA
63
Applications of Deep Learning-Based Methods on Surveillance Video Stream by Tracking Various Suspicious Activities . . . . . . . . . . . . Preethi Nanjundan and W. Jaisingh
87
Hardware Design Aspects of Visual Tracking System . . . . . . . . . . . . . . . 111 Manoj Sharma and Ekansh Bhatnagar Automatic Helmet (Object) Detection and Tracking the Riders Using Kalman Filter Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Ajantha Devi Vairamani Deep Learning-Based Multi-object Tracking . . . . . . . . . . . . . . . . . . . . . 183 Ashish Kumar, Prince Sarren, and Raja Multiple Object Tracking of Autonomous Vehicles for Sustainable and Smart Cities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Divya Singh, Ashish Kumar, and Roshan Singh
ix
x
Contents
Multi-object Detection: A Social Distancing Monitoring System . . . . . . . 221 Bhavyang Dave, Jai Prakash Verma, Rachna Jain, and Anand Nayyar Investigating Two-Stage Detection Methods Using Traffic Light Detection Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Sunday Adeola Ajagbe, Yetunde J. Oguns, T. Ananth Kumar, Olukayode A. Okı, Oluwakemi Abosede Adeola-Ajagbe, Abolaji Okikiade Ilori, and Oyetunde Adeoye Adeaga
About the Editors
Dr. Ashish Kumar, Ph.D., is working as an assistant professor with Bennett University, Greater Noida, U.P., India. He worked as assistant professor with Bharati Vidyapeeth’s College of Engineering, New Delhi. (Affiliated to GGSIPU, New Delhi, India) from August 2009 to July 2022. He has completed his Ph.D. in Computer Science and Engineering from Delhi Technological University (formerly DCE), New Delhi, India in 2020. He has received best researcher award from the Delhi Technological University for his contribution in the computer vision domain. He has completed M.Tech with distinction in computer Science and Engineering from GGS Inderpratha University, New Delhi. He has published more than 25 research papers in various reputed national and international journals and conferences. He has published 15+ book chapters in various Scopus indexed books. He has authored/edited many books in AI and computer vision domain. He is an active member in various international societies and clubs. He is reviewer with many reputed journals and in technical program committee of various national/ international conferences. Dr. Kumar also served as a session chair in many international and national conferences. His current research interests include object tracking, image processing, artificial intelligence, and medical imaging analysis.
xi
xii
About the Editors
Rachna Jain is currently working as Associate Professor (IT Department) with Bhagwan Parshuram Institute of Technology (GGSIPU) since August 2021. She worked as Assistant Professor with Bharati Vidyapeeth College of Engineering from August 2007–August 2021. She did her PhD from Banasthali Vidyapith (computer science) in 2017 and received her ME degree in 2011 from Delhi College of Engineering (Delhi University) with specialization in computer technology and applications. Her current research interests are in cloud computing, fuzzy logic, network and information security, swarm intelligence, big data and IoT, deep learning, and machine learning. She has contributed more than 30 book chapters in various books. She has also served as Session Chair in various international conferences. Jain completed a DST project titled “Design an autonomous intelligent drone for city surveillance” as CO-PI. She has a total of 17+ years of academic/research experience with more than 100+ publications in various national and international conferences as well as in international journals (Scopus/ISI/SCI) of high repute.
Ajantha Devi Vairamani is working as a research head in AP3 Solutions, Chennai, Tamil Nadu, India. She received her PhD from the University of Madras in 2015. She has worked as Project Fellow under UGC Major Research Project. She is a Senior Member of IEEE. She has also been certified as “Microsoft Certified Application Developer” (MCAD) and “Microsoft Certified Technical Specialist” (MCTS) from Microsoft Corp. Devi has more than 40 papers in international journals and conference proceedings to her credit. She has written, co-authored, and edited a number of books in the field of computer science with international and national publishers like Elsevier and Springer. She has been associated as a member of the Program Committee/ Technical Committee/Chair/Review Board for a variety of international conferences. She has five Australian patents and one Indian patent to her credit in the area of artificial intelligence, image processing, and medical imaging. Her work in image processing, signal processing, pattern matching, and natural language processing is based on artificial intelligence, machine learning, and deep learning techniques. She has won many best paper presentation awards as well as a few research-oriented international awards.
About the Editors
xiii
Anand Nayyar received his PhD (computer science) from Desh Bhagat University in 2017 in the area of wireless sensor networks, swarm intelligence, and network simulation. He is currently working in the School of Computer Science-Duy Tan University, Da Nang, Vietnam, as Professor, Scientist, Vice-Chairman (Research), and Director—IoT and Intelligent Systems Lab. He is a certified professional with 125+ professional certificates from CISCO, Microsoft, Amazon, EC-Council, Oracle, Google, Beingcert, EXIN, GAQM, Cyberoam, and many more. Nayyar has published 175+ research papers in various high-quality ISI-SCI/SCIE/SSCI impact factor journals as well as in Scopus/ESCI indexed journals; 70+ papers in international conferences indexed with Springer, IEEE, and ACM Digital Library; and 50+ book chapters in various Scopus/Web of Science indexed books with Springer, CRC Press, Wiley, IET, and Elsevier with 11000+ Citations, and H-Index of 55 and I-Index of 200. He is a Senior Member and Life Member of 60+ associations, including IEEE and ACM. He has authored/co-authored and edited 55+ books on computer science. Nayyar has been associated with 500+ international conferences as Program Committee/Chair/Advisory Board/Review Board member. He has 18 Australian patents, 7 German Patents, 4 Japanese Patents, 34 Indian Design cum Utility patents, 8 UK Design Patents, 1 US Patents, 3 Indian copyrights, and 2 Canadian copyrights to his credit in the area of wireless communications, artificial intelligence, cloud computing, IoT, and image processing. He has received 44 awards for teaching and research— Young Scientist, Best Scientist, Best Senior Scientist, Asia Top 50 Academicians and Researchers, Young Researcher Award, Outstanding Researcher Award, Excellence in Teaching, Best Senior Scientist Award, DTU Best Professor and Researcher Award—2019, 2020–2021, 2022, Obada Prize 2023, and many more. He is listed in the top 2% of scientists as per Stanford University (2020, 2021, 2022). He is an Associate Editor for Wireless Networks (Springer), Computer Communications (Elsevier), International Journal of Sensor Networks (IJSNET) (Inderscience), Frontiers in Computer Science, PeerJ Computer Science, Human Centric Computing and Information Sciences (HCIS), Tech Science Press, CSSE, IASC, IET-Quantum
xiv
About the Editors
Communications, IET Wireless Sensor Systems, IET Networks, IJDST, IJISP, IJCINI, IJGC, and IJSIR. He is also Editor-in-Chief of IGI-Global and a US journal titled International Journal of Smart Vehicles and Smart Transportation (IJSVST). He has reviewed 2500+ articles for diverse Web of Science and Scopus indexed journals. He is currently researching in the area of wireless sensor networks, Internet of Things, swarm intelligence, cloud computing, artificial intelligence, drones, blockchain, cyber security, healthcare informatics, big data, and wireless communications.
Single-Object Detection from Video Streaming Akshay Patel, Jai Prakash Verma, Rachna Jain, and Anand Nayyar
1 Introduction The unfolding of COVID-19 is more and more worrying for everybody in the world. The virus can infect person to person by airborne droplets [1]. With the absence of an effective anti-retroviral and limited medical resources, the WHO recommends many measures to control increasing infection and avoid eliminating limited medical resources. In this pandemic, wearing a mask is one of the non-pharmacological interventions that can reduce the source of SARS-CoV2 droplets expelled through an inflamed individual. Except for the talk about medical services and the variety of masks, all countries approve of the public nose and mouth coverings [2]. According to instructions from WHO, to reduce the spread of COVID-19, everyone is desired to wear a face mask, avoid crowded places, and always hold an immune gadget. Hence, to shield everybody, everyone is expected to wear a face mask well whenever they go out to meet others. However, some non-respondents, for a variety of reasons, choose not to use face masks [3, 30]. Additionally, a detector is necessary to improve the face mask. The research presented in this chapter intends to improve the face mask detectors’ ability to recognize any sort of face mask [4].
A. Patel · J. P. Verma (✉) Institute of Technology, Nirma University, Ahmadabad, India e-mail: [email protected]; [email protected] R. Jain Information Technology Department, Bhagwan Parshuram Institute of Technology, Delhi, India e-mail: [email protected] A. Nayyar Graduate School, Faculty of Information Technology, Duy Tan University, Da Nang, Viet Nam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Kumar et al. (eds.), Object Tracking Technology, Contributions to Environmental Sciences & Innovative Business Technology, https://doi.org/10.1007/978-981-99-3288-7_1
1
2
A. Patel et al.
Object acquisition can be done using traditional techniques or modern techniques. Traditional techniques incorporate image processing techniques, and modern techniques incorporate deep learning networks [5]. We chose deep learning methods for this research because deep learning-based object detection is significantly more robust for complex scenes and challenging lighting. Deep learning methods often rely on supervised training. Performance is limited by the ability to integrate rapidly growing GPUs each year [6]. There are all sorts of algorithms for item detection primarily based on deep learning. 1. Two-stage-based object detection: Two-stage object detection algorithms are Fast R-CNN and Faster R-CNN, R-CNN and SPP-Net, Mask R-CNN, pyramid networks/FPN, and G-RCNN (2021). 2. One-stage-based object detection: One-stage object detection algorithms include YOLO, SSD, RetinaNet, YOLOv3, YOLOv4, and YOLOR. Object detection using a deep learning method extracts the features from the input image or from the video frame. Two subsequent tasks are applied to solve the object detection method: Task 1: Find the arbitrary number of objects (it also can be zero). Task 2: Classify all objects and estimate the size of an object along with the bounding box. For making the process easier, we can break these tasks into two stages. While other methods include both the tasks in one step, these detectors predict bounding boxes over these images without the object region proposal with conventional computer vision methods or deep networks. The process of a one-stage detector is time-consuming too, so it is used in real-time programs. The great advantage of a single stage is that those algorithms are usually faster. There are multistage detectors, and they are simple in structure. Thus, to detect the face mask, we have chosen deep learning based on the one-stage object detection algorithm YOLOv4. As a solution to this problem, we developed object detection in video using deep learning techniques. To do this, we have used YOLOv4 (You Only Look Once version 4). This model will give us an indication as to whether the person has worn a mask or not. Objectives of the Chapter The following are the objectives of this chapter: • To demonstrate the technical effectiveness of a deep learning approach to accelerate the deployment of safeguards against the global COVID-19 pandemic; • To create an innovative and flexible face mask identification method that recognizes and prevents COVID-19 via deep learning; • To analyze an advanced deep convolutional YOLOv4 neural network architecture to discover saliency features and classify them; • To experiment and validate the method proposed on parameters like accuracy and F1 score; • To design and develop face mask detection using You Only Look Once (YOLO) v4 model; • And, to compare the proposed method with the tate-of-the-art research work done.
Single-Object Detection from Video Streaming
3
Organization of the Chapter The rest the chapter is organized as: Section 2 presents the literature survey in two categories—two-stage-based object detection and one-stage-based object detection. Section 3 presents the materials and methods. Section 4 highlights few use cases related to defined issues. Section 5 presents the proposed research work. Sections 6 and 7 enlighten the execution, implementation, results, and discussion. And, finally, section 8 concludes the chapter with future scope.
2 Related Work In this literature phase, many procedures have been proposed for item detection based totally on deep mastering techniques. So, we have labeled this section based totally on the specific deep gained knowledge of strategies, and, additionally, Table 1 provides the summary.
2.1
Two-Stage-Based Object Detection
The state of the art of the two-stage object detection algorithm as presented by Venkateswarlu et al. [1] was to detect the masked faces using CNN. For this research, a global pooling layer was deployed to carry out a flatten of the feature vector, and a completely connected dense layer associated with the SoftMax layer was applied for the class. This approach gives results when trained and does not give encouraging results when applied to CCTV footage. Yu et al. [27] presented the enhanced R-CNN network to discover and see more objects in the image. R-CNN is the primary convolutional neural network implemented for object detection [28] in a discovery model. Authors picked out candidate regions based on traditional selective search and set up an R-CNN to extract features from the photograph. But for item detection, R-CNN models require four steps which are selective search, CNN feature extraction, category, and bounding box regression [29]. Negi et al. [2] presented the VGG16 and the CNN models in 2021. Both models are deep educational that recognizes a camouflage and exposed person to help keep track of security violations and maintain a safe working atmosphere. The study used the concept of data suppression, data suits, normalization, and transmission. The technology may require observations where observations are needed, such as shopping centers, hospitals, transportation centers, restaurants, and other conferences in other communities. In order to implement CNN, filters 16, 16, 32, 32, 128, and 128 with 3 × 3 are used, respectively, and ten layers of the convolution (CONV) are implemented, and ReLU is used as activation. The model applied up to five layers at a speed of 2 and then applied the plane layer. Then using ReLU activation functions the first, second, and third density of hidden nodes are generated as 512, 128, and 64.
4
A. Patel et al.
Table 1 Précis of framework primarily based on object detection Authors Qian et al. [7]
Model Fast R-CNN Fast R-CNN Faster R-CNN Faster R-CNN
Modification Using MSERs instead of SS
R-FCN
Detection on UAV images
Yang et al. [12] Kawano et al. [13] Xie et al. [14]
Faster R-CNN YOLO
Use boosted classifiers
Faster R-CNN detector has no change Long computation time
Multi-functional YOLO
Powerless to color markings
YOLO
Multi-directional detection
Tao et al. [15]
YOLO
Wu et al. [16]
YOLOv2 object tracking YOLOv4
Add average pooling layer instead of FC layer Using Kalman filter and Hungarian algorithm
Due to the camera oblique, there is no consideration of plate deformations Need to do pre-processing on night scene images When people are ovelap in an image frame, at that time detection is not enough stable Weight factor is a determined parameter Bad performance on overlapping objects Experimental article
Zhao et al. [8] Hu et al. [9] Greenhalgh and Mirmehdi [10] Dai et al. [11]
Peng et al. [17] Jadhav and Momin [18] Bochkovskiy et al. [19] Deore et al. [20] Yang et al. [21] Qu et al. [22] Kim et al. [23] Matas et al. [24] Meng et al. [25] Müller and Dietmayer [26]
YOLOv2 pedestrian YOLOv4 YOLOv2 YOLOv2 vehicle YOLOv3 pedestrian SSD on-road objects SSD pedestrian SSD SSD
Different spatial scales of pedestrians Adding attention network Adding a neighborhood context layer to awareness on small gadgets
Combining detection with re-identification Simplifying with nine CONV layers Adding three 3 × 3 and one 1 × 1 CONV layers Using 1 × 1 filters
Limitation One-of-a-kind implementation systems for area idea and detector Additional 2 FC layer parameters Traffic lights should be treated as signs There is no distinction in between traffic signs
Low accuracy for small signs
Using K-means cluster and a grid of size 14 × 14 Using an image enhancement policy Fine-tuning the SSD on KITTI datasets
For distant objects, accuracy is not high Pre-processing is needed on input images Real-time processing is slow
Use small patches
Long time for SegNet
Using text detector and a FCN Using Inceptionv3 instead of VGG
Time-consuming with sliding window It is hard to detect error of target objects from mirror images
Single-Object Detection from Video Streaming
5
Using the SoftMax activation function, a fourth dense layer with two hidden nodes is used for the final output. Simonyan and Zisserman [31] applied the Faster R-CNN method to detect pedestrians by applying training and tuning VGGNet in 2016 for pedestrian-only networks and compared to the R-CNN-based fast pedestrian detection methods. Li et al. [32] delivered a lower test speed and an unsatisfactory failure rate. Before RPN means a clustering algorithm is added. Zhang et al. [33] proposed the method to perform the pedestrian detection task in 2017. The focus of researchers was on computational time. In the process of generating a sentence, it takes a long time to search for the entire image and the area occupied by pedestrians which is small, and there are many unnecessary sentences, increasing the learning time complexity. Instead of using RPN directly to generate some of the initial candidates, the K-means clustering algorithm was used to get the proposal, and it is passed to RPN. Therefore, the calculation time is shortened. Xu et al. [34] used the Faster R-CNN infrastructure to detect cyclists in 2017. A generation module was added that generates a composite depth image that is sent to the backbone network compared to the original structure. A depth image represents the shape of an object, and DCNN can learn it more efficiently. Even if a computer-generated image was used for learning, performance was much better when testing a deep image converted from a real image. Fan et al. [35] performed study on product outcomes. R-CNN technique discovered that buying a car is quicker and saw several extensive experiments and analyses. By using the UAV’s image as a component of the detector, the link’s creator improves the efficiency of identifying the subject of these distinctive photographs. The HYPERGION OFFER Network is further utilized to directly employ the quicker R-CNN image in the block pictures. Tang et al. [36] provided a method to improve the performance of small object detection using the R-CNN. For that, authors used several enhanced classifiers to filter the interest participants to remove incorrect detections. Those two modifications increase the accuracy of detection for small vehicles and reduce the number of false positives such as objects such as vehicles. Previously, there were many implementations of traffic scenarios using the Faster R-CNN. In order to detect the whole traffic signals, Zuo et al. [37] modified the coding of the Faster R-CNN architecture. The newly introduced community runs a course to identify potential hobby areas, and RPN was utilized to filter the hobby areas, reducing computation because RPN no longer want to scan the entire practical map in order to create suggestions. The DeCONV layer was applied after the CONV layer to assist and improve the detection precision for tiny devices. Both surface and deep information was employed by combining the attention map, which provided details about minute things, with the enlarged feature map produced from the highmeaning DeCONV layer. But authors achieved a mAP value of 0.34493 which was not ideal for this kind of problems. Cheng et al. [38] used the local context-based Faster R-CNN approach to detect the traffic signs, which utilized the regional proposal network for proposal generation and local context information surrounding proposals for classifying. The local context layer was used after the RPN which transforms each sentence into three
6
A. Patel et al.
sentences, extending horizontally and vertically, respectively. After that, it combines all the sentences and stores the surrounding information for the final discovery. Matas et al. [24] developed the face mask detection system using Fast R-CNN as its basic model [39]. Considering that the color of the face mask is often uniform and unique, the system implements the region suggestion task using maximally stable extremal regions (MSERs) instead of SS [40]. After the development of Fast R-CNN, pedestrian detection is also a benefit. Li et al. [32] proposed an architecture containing two subnets that would put in force pedestrian detection of people at the one-of-a-kind spatial scales. Offers generated by means of the SS set of rules are served by two subnets simultaneously, after several layers of CONV. To complete the final discovery, two output feature maps were combined with the weighting strategy. After training with large pedestrian input, the output of the large subnet scores higher, so this subnet largely determines the final detection result. Table 1 summarizes the framework primarily based on object detection.
2.2
One-Stage-Based Object Detection
One-stage-based detection algorithm’s performance is very good in terms of speed compared to the two-stage-based detection algorithm series. For that reason, in this phase, we have included the artwork, the use of an unmarried-level detector for static photograph detection, and additionally object detection in video streams. Raghunandan et al. [41] presented the research work on analyzing the object detection algorithms for the video surveillance applications, and to develop surveillance applications, authors used different factors which are face detection, color detection, skin detection, and shape detection. To detect these features, authors used MATLAB 2017b. The output of the algorithm indicated the various parts of the face like nose, eyes, and mouth. Liu et al. [42] implemented painting in these studies and completed the conventional photograph processing of shooting blur, and sound, and clear out the rotation in the current international and trained a reliable model using the YOLO set of rules in order to improve the visitor’s signal detection. The YOLO set of criteria were employed in 2018 by Yang and Jiachun [43] for face identification in current-time applications with quick detection times. Jiang et al. [44] developed a face mask detector called RetinaFaceMask in 2020. The evolved version was an unmarried-degree detector and included a function pyramid community to mix high-degree semantic records with a couple of feature maps and an interesting module to hit upon face masks. Van-Ranst et al. [45] combined the YOLOv2 model with a re-identity network into a framework that could quickly recognize and re-pick out the people in other photos additionally. The YOLO-REID framework supports the “mixed-end” or the “split-end” architecture with 128-value embedded output in each cell instead of using classified output. Now Darknet19 weights can be shared with the discovery network and the re-identification network when combined. Consequently, YOLOREID can perform each task with a negligible increase in complexity compared to
Single-Object Detection from Video Streaming
7
YOLOv2. Heo et al. [46] claimed that the YOLOv2 had been simplified in 2018 and now just contains nine CONVs, six maximum pooling, and two fully connected layers. The YOLOv2 original model, in contrast, enters the image directly into the network. For this evening’s pedestrian identification assignment, a tiny YOLOv2 model is merged with the feature map from the adaptive Boolean map-based saliency kernel. This can demonstrate that pedestrian prominence outperforms the backdrop. In 2017, Jensen et al. [47] studied that the last CONV layer of the YOLOv2 model had been removed. This was for the purpose of improving traffic light detection performance. The author had added the four CONV layers, from which there are three layers of the kernel sizes of 3×3 and one 1×1. When it comes to detecting small objects like traffic signs, by using YOLOv2, and based on the original YOLOv2, Zhang et al. [48] used three different models. The change occurred in the middle tier, and a new CONV tier with a kernel size of 1×1 was inserted to hold the cross-channel information. In this chapter, we propose a deadlock detection path using YOLO-inspired ConvDet to compute bounding containers and perform type. The ConvDet elegance removes the final FC layer from YOLO and includes the idea of a binding container from RPN. As an end result, fewer model parameters are used in queeDet, and sentences may be generated for the same number of levels in that model domain compared to YOLO. YOLOv2 is actually superior to native YOLO, and Jo et al. [49] had implemented this popular framework to solve various object detection problems. A real-time gadget that can play more than one particular class of objects at once is made possible by YOLOv2 when combined with the Kalman filter and the Hungarian set of rules. In this system, YOLOv2 acts as a trigger that prevents detecting items in the first frame. In the next frame, with YOLOv2 detection of body t 1, the Kalman filter is liable for generating predictions for body t. It then makes use of this prediction to healthy the detection consequences over the frame to decide if this clear-out is correct. A full framework from the Szegedy et al. [50] intermediate research report isn’t usually present in YOLOv3. As a result, there aren’t many packages that are entirely based on YOLOv3, especially in happy scenes like traffic. YOLOv3 is the foundation of the work created using Q for pedestrian detection. In the 30:13 reversal scenario, they advanced for object detection utilizing deep learning approaches with pre-processed training patterns to reduce contextual impacts such as changes in lighting conditions or personal density. Then, in order to detect pedestrians more precisely and effectively, YOLOv3 was given access to these previously processed photographs. After several months, the first SSD idea was given in 2016 by Du et al. [51]. If super room performs better than anticipated, use this strategy. A Pascal VOC-based SSD reference model used as their own model. On this set of data, little entities like walkers and cyclists do badly. To reduce the larger photographic ratio brought on by picture enlargement, a thin SSD bezel is used together with an extra element ratio. The findings of this chapter show that by combining an updated SSD version with a data expansion plan, performance may be improved. A DNN fusion and DNN segmentation division approach have been employed with the DNN fusion network to carry out a pedestrian detection operation (FDNN). To create pedestrian
8
A. Patel et al.
candidates for source photographs, SSD FDNN is employed. To clarify suggestions, different DNNs are then applied to these applicants, together with more than one DNN classifier. SSDs have the benefit of anticipating devices on extraordinary function maps, which allows them to identify tiny things. Therefore, the employment of SSD chassis may be advantageous for the detection of street signs and symptoms, which are small devices in comparison to vehicles and people. Zhu et al. [52] proposed a text detector that extracts text and identifies exactly what the text is, as well as a signature detector that recognizes supply image identifiers using FCN. The TextBoxes text detection approach that this chapter derives from SSD has the drawback of removing the object map from the previous CONV class. In real time, the YOLO model series is very popular for detecting objects because it keeps the balance between detection speed and accuracy. But for real time, object detection in video still needs a power consumption platform and high performance. Now, the problem is how to choose the appropriate model for object detection. For that, in this chapter, we proposed the YOLOv4 model and compared it with other state-of-the-art models based on best accuracy as well as best performance in terms of time.
3 Materials and Methods 3.1 3.1.1
Materials Dataset
The Flickr-Faces-HQ (FFHQ) dataset was used for experimenting with face mask detection. The dataset comprises of 4000 images as depicted in Fig. 1. These 4000 images are divided into 2 classes—the first is “with a mask” and the second is “without a mask.” Around 2000 images having a proper mask are included in the “with a mask” class, and another 2000 images without a mask or with an improper mask are included in the “without a mask” class. The total size of this database is 1.16 GB.
3.1.2
Data Pre-processing
YOLOv4 hyper-parameters started to be used as follows: There are pixels with different lengths so that we’ve resized the initial length pixels and set it to 608 × 608 pixels to facilitate detection. Initially, getting to know, the price was set as 0.001 and increased to 0.1 at 5000 steps and 5500 steps, respectively. In order to perform training at multiple scales, all architectures use a single GPU with a batch length of 64 and a mini-batch of length 16. The default momentum is 0.949, the intersection over union (IoU) threshold is 0.213, and the loss normalizer is 0.07, as suggested by the YOLOv4 authors.
Single-Object Detection from Video Streaming
9
Fig. 1 Dataset images
Fig. 2 Horizontal flip
3.1.3
Data Augmentation
To increase the performance of the proposed model, data augmentation is used. It introduces the neural network with a large number of variations of inputs. In our case, there are not many such data types available. In the dataset used, most of the images are masked and unmasked fronts. Therefore, it is necessary to increase the data to obtain a better result. To maximize the diversity of the face data and prevent the model from learning unimportant patterns, flip-flop data augmentation is used in pre-processed data. Hidden and uncovered images are shown in Fig. 2. By default, the YOLOv4 architecture has some data enhancement techniques such as CutMix,
10
A. Patel et al.
Mosaic data augmentation, class label smoothing, and self-adversarial training (SAT), which also helps the model attain higher accuracy.
3.1.4
Data Annotation
Data annotation is very important for the model. It is nothing but data or photo labeling. There are different types of data annotations such as image captions, text annotations, and video captions. Image labels are used in our model. For this, we the binding boxes in the images are drawn as a rectangle using Ybat tool to report the image which is a long and difficult task, as the pictures have to be interpreted by hand.
3.2 3.2.1
Methods Overview of YOLOv4
YOLOv4 is the improved version of YOLOv1, YOLOv2, and YOLOv3. As illustrated in Fig. 3, it uses CSPDarknet53 as the main structure of the network. By introducing the pyramid group (SPP) to the CSPDarknet53, the YOLOv4 structure greatly increases the size of the reception field as compared to the CSPResNeXt50 and the efficaceenetb3. It can accept additional parameters without slowing down the procedure. Because of this feature, it works better in the classification tasks. When
Fig. 3 YOLOv4 network
Single-Object Detection from Video Streaming
11
Fig. 4 Proposed system block diagram
we talk about data development, YOLOv4 is using Mosaic to create four images in one. Eventually, it increases the size of the small section and system an automatic training (SAT) to create an update of nerve cells and interfere with the image in the pictures. In addition, YOLOv4 is also using the PAN to consolidate the multichannel feature to avoid information loss as mentioned in Fig. 4.
4 Applications of YOLOv4 The terms of use for object acquisition cases are very different; there are almost infinite ways to make computers look like humans doing manual labor or creating new, powerful YOLOv4-based AI products and services. It is used in computer vision applications and is used in a range of applications from game production to production analytics. Today, object recognition is at the core of visionary software and AI model using YOLOv4. Object discovery plays an important role in understanding the platform, which is known for security, transportation, medical, and military applications. Recently, IBM had done analytical work with edge-enabled cameras that can detect face masks and determine whether the employee or worker has worn the mask properly or not. If an employee has not worn a mask properly, then he/she will not be allowed at the workplace, and the company installed this system because it can help its employees, workers, and customers to come back to the office safely. 1. Railway Station: At the ticket windows of railway stations, human interaction happens while getting the tickets. At that time, it is essential to check whether the person who comes to buy the ticket has worn a mask or not. For this objective, we need to fit a camera on the window that will check continuously whether the person has worn the mask on or not. If a person has not worn a mask, then he/she will not be allowed to buy the tickets. 2. Office Entrance: Currently, all MNCs are implementing hybrid models and calling their employees to work there. Employees should go to the office for 2 days per week while entering into the office at the entrance. If we fit a camera, then it will check whether the employee has worn a mask or not. If he/she does not wear the mask, then the door will not open.
12
A. Patel et al.
3. College Campus: In this case, we need to fit the camera on the classroom door. While a student enters the classroom, our model will check whether the student has worn a mask or not.
5 Proposed Methodology In the year 2020–2021, a dangerous virus disease named COVID-19 affected our daily life by infecting people’s health, which became the cause of many life losses. To break the chain of virus spread, WHO recommended many measures to control the rate of infection and to avoid eliminating limited medical resources. Wearing a mask is one of the measures which is the most important to control the spread of COVID-19 due to this virus. Now in day-to-day life, there are many places where the government needs to manually check on camera how many people/employees actually wear masks. Thus, to reduce human efforts, a novel face mask detection technique is proposed in this chapter. The work is implemented using deep neural network version named YOLOv4. YOLOv4 can run two times as fast as different deep neural community strategies used for object detection. The performance of the YOLOv4 version is 10% better as compared to YOLOv3 AP and about 12% FPS. Reflecting these results, it is well suited to implement the method in interval mask detectors wherever high detection accuracy is required. Bochkovskiy et al. [19] proposed YOLOv4 for certain big modifications from its archetype YOLOv3, which has endorsed full-size upgrades regularly in terms of pace as well as accuracy. YOLOv4 may be very speedy, easy to fix, durable, and stable and gives promising effects even within the smallest detail, which is why we have selected it as our one workspace issue. With a photo/entry body, you get in fewer than three classes—blank face and men’s or women’s faces. This means that the same model is used in the singular recognition following community deviations and identification of the covered exterior. This dramatically improves job specificity. The cylinder has three sections: backbone, neck, and head. I enterprise awaits an RGB photo or frame. Spine to respond by using removing highlights from the image [19]. In the end, it has been a very good selection. Moon et al. [53] divided the harvest into parts inside the foundation layer. One is going to the dense block, and the other goes without delay to the next development layer as exhibition show. 2b. Thick squares cowl the layers, and every layer includes batch normalization and ReLU followed by using a layer of convolution. Every layer of the dense block takes partial tips for all previous layers as hooked up. This expands the spinal cord accessible and assists visibility highlights of the image. Integrating the local pyramid (SPP) used as a neck band containing squares expands the acquisition field and the interface of the visual interaction at various levels of the spinal column. Figure 1 roughly depicts the SPP and YOLO pipe combo. The concept of the Bag of Freebies (BoF) was discovered in this organization’s loose allusions inside the preparation procedures that advance the agency’s external presentation, including their prices. There are several free browsing styles that supply CutMix, Mosaic extension data,
Single-Object Detection from Video Streaming
13
Fig. 5 YOLOv4 architecture
and DropBlock, in addition to being connected to the use of chosen beauty marker spine procedures. SelfAdversarial training, CIoU-misfortune, Cross Mini-Batch Normalization (CBN), mosaic information creation, framework engagement, and DropBlock’s murder, for instance, all received recognition of one of the benefits of identity. Unique providing (BoS), the add-on approach is customary while “special” is noted systems enhance community performance even as expanding viewing costs at low fee. The YOLO model treats item detection as a proprietary regression problem, without delay from the photo pixels to the bounding field coordination and item opportunities. An integrated network predicts the couple of sure packing containers and the chances of those boxes. YOLO implements full-size image detection as well as inversion, improving the receiver performance. For detection of objects, an integrated model has an advantage over traditional methods. In all the grid cells, it divides the image into an MxM grid and predicts B-binding boxes, the probability of the object, and the confidence level of the predictive binding boxes. Each cell grid predicts B-binding boxes and certainty in these boxes. Those confidence figures show how confident the model is in how the box contains the object and the way it appropriately thinks about the field and the anticipated items (please refer to Figs. 5 and 6). Clearly, we’re defining self-confidence, as Pr(gadgets) IoU. If the mobile is empty, then it should be zero as a conceitedness score. Otherwise, the vanity score is identical to the intersection over union (IoU) among the predicted boxes, and, as a consequence, the lowest reality of each bounding container includes the self and five prediction variables, x, y, w, and h, where coordinates (x, y) represent the center of the box, compared to the grid cell of boundaries, while the height and width prediction depend on the whole image/picture. Finally, the prediction of arrogance represents the IoU between the expected box and the basic truth box. The conditional class probabilities Pr(Classi % | Object) probability is predicted by every grid. These
14
A. Patel et al.
Bounding boxes + confidence
S × S grid on input
Final detections
Class probability map
Fig. 6 YOLO model
opportunities are converted to the grid cell that holds by the object. We predict only one set of complex objects that can be complex per cell, regardless of the number of boxes B. The YOLO interface has 24 bendy layers followed by 2 completely connected layers. It definitely makes use of 1×1 reduction layers observed by way of 3×3 rapid YOLO convolutional layers that exercise the neural community with the 9 convolutional layers in place of 24 and a few filters between those layers. Aside from the quantity of community, all of the education and the trying out parameters are equal for the YOLO version and the fast YOLO. A YOLO is optimized for squared sum blunders in our version because of the output. It is a total mistake as it’s clean to optimize, although it would not align to maximize average precision. It has a uniform weighting of placement errors with a non-prototype classification error. In addition, in each image, many cells of the grid do not contain any objects. This will push the “confidence” of many of these cells to zero, often reducing the slope of the cells containing the objects. This will make the model unstable, leading to early training divergence. To change this, YOLO enhances the lack of bounding box coordinate predictions, and it reduces the lack of self-assurance prediction for bins that any object may be included. To gain this, YOLO employs the two parameters wire and not. In YOLO, coord and noobj are both set to 5. Additionally, the big and small bins are weighted according to the total of squared errors. Its error rate should reflect that minor variation inside of large containers. Smaller boxes, on the other hand, can be managed by predicting the peak and the bottom of the box rather than the precise width and period. YOLO predicts more than one bound cell according to cellular grid. At schooling time, we want the best man or woman bounding box predictors responsible for each item. We assign one predictor responsible for predicting one supported item whose prediction has the cutting-edge pleasant IoU with lower truth. This ends in a special of between the bounding container predictors.
Single-Object Detection from Video Streaming
15
6 Experimentation and Implementation 6.1
Experimental Setup
Mainly three files are needed to configure the YOLOv4: 1. object.names file: It contains the names of classes, one in each line. In our case, that is two classes named mask and no-mask. 2. object.data file: It simply contains the number of the classes, the path of the training, and the testing data and location for the backup. 3. custom yolov4.cfg file: It contains the information about the width, filters, height, steps, max batches, burnout, etc. In our case, we set batch size to 64, split to 16, and learning rate to 0.001 and also set layer to 2 for three YOLO blocks and filter size to 21. Here, the workflow of the YOLOv4 object detection algorithm is discussed in detail. Initially, an image dataset is collected and used for training through the use of YOLOv4. The dataset includes images of people with and without masks. Figure 7 shows the YOLOv4 workflow.
Fig. 7 YOLOv4 workflow
16
6.2
A. Patel et al.
Training YOLOv4
The YOLOv4 model separates a taken image into grid cells, each of which is responsible for a specific description of objects. The self-belief college is computed utilizing its compound container in each mobile grid. It effectively divides data into fits, detects typical features from this mesh, and groups characteristics found with high confidence in neighboring cells into a single place for overall performance. Once the setup is finished, our model will be ready for use in a classroom. Twenty percent was utilized for verification, while the remaining 80% was used for training. We began educating students about our real-time mask finding strategy using the Darknet framework, which is part of the Google research lab. Darknet creates a foundational community architecture and is the inspiration for YOLO. Figure 8 illustrates the steps wherein the version split the photo. Following that, it demonstrates how the capabilities are discovered, and then, it displays the recognized object.
6.3
Evaluation Measures
The performance of the YOLOv4 method is compared using the recall metrics, precision, F1 score, specificity, and mean average precision (mAP). The intersection over union, also called the Jaccard index, is used to calculate the accuracy of the object detector in a given dataset. Specificity is the measurement of how many of the negative predictions made are true negatives (correct). Precision is the measurement of how many of the predicted positive values are actually positive. Recall is the measurement of how many of the true positive values are correctly classified. The F1 score, also called the F score, is a function of precision and recall and is needed to maintain a balance between precision and recall. Average precision (AP), one of the most widely used metrics to measure the accuracy of object detectors, is used to find the area under the precision-recall curve. The mean average precision (mAP) is the average AP calculated for all classes.
Fig. 8 Workflow of YOLOv4
Single-Object Detection from Video Streaming
17
7 Results and Performance Analysis To show the detection performance of the YOLOv4 method, 800 test images were used, and the results of the method are shown in Table 2. As seen in Table 2, the precision, recall, F1 score, and specificity of the proposed method were 92.77%, 895.44%, 93.58%, and 78.84%, respectively. In this study, the mean value of IoU was observed to be 79.48%. Furthermore, Table 3 shows the performances of the YOLOv4 model for each class. Performance Analysis As can be seen from Table 4, the model shows the best performance with an accuracy of 91.12% in identifying suitable mask wearers. The exact results obtained from this study were compared with those reported in the literature (Table 3). As can be seen from Table 2, in the studies where face mask detection was previously performed, only humans were found to be wearing masks or not. As a result of the study, when the accuracy score of 91.12% obtained by the YOLOv4 model was compared with alternative solutions in the literature, it was observed that the detection performance of the solution was high
8 Conclusion and Future Work In this chapter, YOLOv4 detection model was used for effective face mask detection. The test results show that the YOLOv4 model achieved a high accuracy of 91.12%, which is the average point accuracy. Therefore, it can be used to find cases of not wearing a mask. The mask does not comply with the rules and correct wearing of the mask. This study will help limit the spread of the coronavirus by making it easier to identify people without masks or people who are not wearing masks properly in public places such as schools, shopping malls, railway stations, and markets. In the near future, we plan to extend this research to check the social distance for public places like shopping malls, railway stations, and markets. Also, online examinations can use this face detection method so that all students can take exams in a system-monitored environment. The research work can also be extended for monitoring and controlling traffic signals. The results presented with YOLOv4
Table 2 Detection results of the YOLOv4 method Method YOLOv4
Precision (%) 92.77
Table 3 Class-wise average precision values of the YOLOv4
Recall (%) 95.44
Class With mask Without mask
F1 score (%) 93.58
With mask 565 27
Specificity (%) 78.84
Without mask 44 164
18
A. Patel et al.
Table 4 Performance accuracy results comparison with related works Authors Venkateswarlu et al. [1]
Model MobileNet
Training dataset 250 with mask, 250 without mask
Raghunandan et al. [41]
MATLAB 2017b
100 with mask, 150 without mask
Liu et al. [42]
YOLO
100 with mask, 100 without mask
Vijitkunsawat and Chantngarm [4]
KNN
161 with mask, 161 without mask
Kilic and Aydin [54]
ResNet-50
511 with mask, 511 without mask
Srinivasan S. et al. [55]
MobileNetV2
611 with mask, 611 without mask
Moon et al. [53]
CNN-based cascade framework
776 with mask, 776 without mask
Proposed model
YOLOv4
400 with mask, 400 without mask and with improper mask
Detection – Without mask – With mask – With mask – Without mask – Without mask – With mask – Without mask – With mask – Without mask – With mask – Without mask – With mask – Without mask – With mask – With mask – Without mask
Accuracy (%) 88.7
86.24
83.46
87.8
47.7
90.2
86.6
91.12
model are promising; the system can be extended with YOLOv5 and other deep learning architectures with Fast R-CNN method for different use cases.
References 1. Venkateswarlu, I. B., Kakarla, J., & Prakash, S. (2020, December). Face mask detection using mobilenet and global pooling block. In 2020 IEEE 4th conference on information & communication technology (CICT) (pp. 1–5). IEEE. 2. Negi, A., Kumar, K., Chauhan, P., & Rajput, R. S. (2021, February). Deep neural architecture for face mask detection on simulated masked face dataset against covid-19 pandemic. In 2021
Single-Object Detection from Video Streaming
19
International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (pp. 595–600). IEEE. 3. Sai, B. K., & Sasikala, T. (2019, November). Object detection and count of objects in image using tensor flow object detection API. In 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT) (pp. 542–546). IEEE. 4. Vijitkunsawat, W., & Chantngarm, P. (2020, October). Study of the performance of machine learning algorithms for face mask detection. In 2020-5th international conference on information technology (InCIT) (pp. 39–43). IEEE. 5. Kumar, A., Walia, G. S., & Sharma, K. (2020). Recent trends in multicue based visual tracking: A review. Expert Systems with Applications, 162, 113711. 6. Saleh, K., Hossny, M., Hossny, A., & Nahavandi, S. (2017, October). Cyclist detection in lidar scans using faster r-cnn and synthetic depth images. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC) (pp. 1–6). IEEE. 7. Qian, R., Liu, Q., Yue, Y., Coenen, F., & Zhang, B. (2016, August). Road surface traffic sign detection with hybrid region proposal and fast R-CNN. In 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) (pp. 555–559). IEEE. 8. Zhao, X., Li, W., Zhang, Y., Gulliver, T. A., Chang, S., & Feng, Z. (2016, September). A faster RCNN-based pedestrian detection system. In 2016 IEEE 84th Vehicular Technology Conference (VTC-Fall) (pp. 1–5). IEEE. 9. Hu, Q., Paisitkriangkrai, S., Shen, C., van den Hengel, A., & Porikli, F. (2015). Fast detection of multiple objects in traffic scenes with a common detection framework. IEEE Transactions on Intelligent Transportation Systems, 17(4), 1002–1014. 10. Greenhalgh, J., & Mirmehdi, M. (2012). Real-time detection and recognition of road traffic signs. IEEE Transactions on Intelligent Transportation Systems, 13(4), 1498–1506. 11. Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully convolutional networks. Advances in Neural Information Processing Systems, 29. 12. Yang, T., Long, X., Sangaiah, A. K., Zheng, Z., & Tong, C. (2018). Deep detection network for real-life traffic sign in vehicular networks. Computer Networks, 136, 95–104. 13. Kawano, M., Mikami, K., Yokoyama, S., Yonezawa, T., & Nakazawa, J. (2017, December). Road marking blur detection with drive recorder. In 2017 IEEE International Conference on Big Data (Big Data) (pp. 4092–4097). IEEE. 14. Xie, L., Ahmad, T., Jin, L., Liu, Y., & Zhang, S. (2018). A new CNN-based method for multidirectional car license plate detection. IEEE Transactions on Intelligent Transportation Systems, 19(2), 507–517. 15. Tao, J., Wang, H., Zhang, X., Li, X., & Yang, H. (2017, October). An object detection system based on YOLO in traffic scene. In 2017 6th International Conference on Computer Science and Network Technology (ICCSNT) (pp. 315–319). IEEE. 16. Wu, B., Iandola, F., Jin, P. H., & Keutzer, K. (2017). Squeezedet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 129–137). 17. Peng, H., Guo, S., & Zuo, X. (2021, May). A vehicle detection method based on YOLOV4 model. In 2021 2nd International Conference on Artificial Intelligence and Information Systems (pp. 1–4). 18. Jadhav, L. H., & Momin, B. F. (2016, May). Detection and identification of unattended/ removed objects in video surveillance. In 2016 IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT) (pp. 1770–1773). IEEE. 19. Bochkovskiy, A., Wang, C. Y., & Liao, H. Y. M. (2020). Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934.
20
A. Patel et al.
20. Deore, G., Bodhula, R., Udpikar, V., & More, V. (2016, June). Study of masked face detection approach in video analytics. In 2016 Conference on Advances in Signal Processing (CASP) (pp. 196–200). IEEE. 21. Yang, W., Zhang, J., Wang, H., & Zhang, Z. (2018, May). A vehicle real-time detection algorithm based on YOLOv2 framework. In Real-Time Image and Video Processing 2018 (Vol. 10670, pp. 182–189). SPIE. 22. Qu, H., Yuan, T., Sheng, Z., & Zhang, Y. (2018, October). A pedestrian detection method based on yolov3 model and image enhanced by retinex. In 2018 11th international congress on image and signal processing, biomedical engineering and informatics (CISP-BMEI) (pp. 1–5). IEEE. 23. Kim, H., Lee, Y., Yim, B., Park, E., & Kim, H. (2016, October). On-road object detection using deep neural network. In 2016 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia) (pp. 1–4). IEEE. 24. Matas, J., Chum, O., Urban, M., & Pajdla, T. (2004). Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10), 761–767. 25. Meng, Z., Fan, X., Chen, X., Chen, M., & Tong, Y. (2017, August). Detecting small signs from large images. In 2017 IEEE International Conference on Information Reuse and Integration (IRI) (pp. 217–224). IEEE. 26. Müller, J., & Dietmayer, K. (2018, November). Detecting traffic lights by single shot detection. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC) (pp. 266–273). IEEE. 27. Yu, L., Chen, X., & Zhou, S. (2018, June). Research of image main objects detection algorithm based on deep learning. In 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC) (pp. 70–75). IEEE. 28. Nayyar, A., Jain, R., & Upadhyay, Y. (2020, August). Object detection based approach for Automatic detection of Pneumonia. In 2020 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD) (pp. 1–6). IEEE. 29. Kumar, A. (2023). Visual object tracking using deep learning. CRC Press. 30. Bu, W., Xiao, J., Zhou, C., Yang, M., & Peng, C. (2017, November). A cascade framework for masked face detection. In 2017 IEEE international conference on cybernetics and intelligent systems (CIS) and IEEE conference on robotics, automation and mechatronics (RAM) (pp. 458–462). IEEE. 31. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 32. Li, J., Liang, X., Shen, S., Xu, T., Feng, J., & Yan, S. (2017). Scale-aware fast R-CNN for pedestrian detection. IEEE Transactions on Multimedia, 20(4), 985–996. 33. Zhang, H., Du, Y., Ning, S., Zhang, Y., Yang, S., & Du, C. (2017, December). Pedestrian detection method based on Faster R-CNN. In 2017 13th International Conference on Computational Intelligence and Security (CIS) (pp. 427–430). IEEE. 34. Xu, Y., Yu, G., Wang, Y., Wu, X., & Ma, Y. (2017). Car detection from low-altitude UAV imagery with the faster R-CNN. Journal of Advanced Transportation, 2017. 35. Fan, Q., Brown, L., & Smith, J. (2016, June). A closer look at Faster R-CNN for vehicle detection. In 2016 IEEE intelligent vehicles symposium (IV) (pp. 124–129). IEEE. 36. Tang, T., Zhou, S., Deng, Z., Zou, H., & Lei, L. (2017). Vehicle detection in aerial images based on region convolutional neural networks and hard negative example mining. Sensors, 17(2), 336. 37. Zuo, Z., Yu, K., Zhou, Q., Wang, X., & Li, T. (2017, June). Traffic signs detection based on faster r-cnn. In 2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW) (pp. 286–288). IEEE. 38. Cheng, P., Liu, W., Zhang, Y., & Ma, H. (2018). LOCO: local context based faster R-CNN for small traffic sign detection. In MultiMedia Modeling: 24th International Conference, MMM 2018, Bangkok, Thailand, February 5–7, 2018, Proceedings, Part I 24 (pp. 329–341). Springer International.
Single-Object Detection from Video Streaming
21
39. Taneja, S., Nayyar, A., & Nagrath, P. (2021). Face mask detection using deep learning during Covid-19. In Proceedings of Second International Conference on Computing, Communications, and Cyber-Security (pp. 39–51). Springer. 40. Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International Journal of Computer Vision, 104, 154–171. 41. Raghunandan, A., Raghav, P., & Aradhya, H. R. (2018, April). Object detection algorithms for video surveillance applications. In 2018 International Conference on Communication and Signal Processing (ICCSP) (pp. 0563–0568). IEEE. 42. Liu, C., Tao, Y., Liang, J., Li, K., & Chen, Y. (2018, December). Object detection based on YOLO network. In In 2018 IEEE 4th information technology and mechatronics engineering conference (ITOEC) (pp. 799–803). IEEE. 43. Yang, W., & Jiachun, Z. (2018, July). Real-time face detection based on YOLO. In 2018 1st IEEE international conference on knowledge innovation and invention (ICKII) (pp. 221–224). IEEE. 44. Jiang, M., Fan, X., & Yan, H. (2020). Retinamask: a face mask detector. 45. Van Ranst, W., De Smedt, F., Berte, J., & Goedemé, T. (2018, November). Fast simultaneous people detection and re-identification in a single shot network. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) (pp. 1–6). IEEE. 46. Heo, D., Lee, E., & Chul Ko, B. (2017). Pedestrian detection at night using deep neural networks and saliency maps. Journal of Imaging Science and Technology, 61(6), 60403–60401. 47. Jensen, M. B., Nasrollahi, K., & Moeslund, T. B. (2017). Evaluating state-of-the-art object detector on challenging traffic light data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 9–15). 48. Zhang, J., Huang, M., Jin, X., & Li, X. (2017). A real-time Chinese traffic sign detection algorithm based on modified YOLOv2. Algorithms, 10(4), 127. 49. Jo, K., Im, J., Kim, J., & Kim, D. S. (2017, September). A real-time multi-class multi-object tracker using YOLOv2. In 2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA) (pp. 507–511). IEEE. 50. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826). 51. Du, X., El-Khamy, M., Lee, J., & Davis, L. (2017, March). Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection. In 2017 IEEE winter conference on applications of computer vision (WACV) (pp. 953–961). IEEE. 52. Zhu, Y., Liao, M., Yang, M., & Liu, W. (2017). Cascaded segmentation-detection networks for text-based traffic sign detection. IEEE Transactions on Intelligent Transportation Systems, 19(1), 209–219. 53. Moon, S. W., Lee, J., Lee, J., Nam, D., & Yoo, W. (2020, October). A comparative study on the maritime object detection performance of deep learning models. In 2020 International Conference on Information and Communication Technology Convergence (ICTC) (pp. 1155–1157). IEEE. 54. Kilic, I., & Aydin, G. (2020, September). Traffic sign detection and recognition using tensorflow’s object detection API with a new benchmark dataset. In 2020 international conference on electrical engineering (ICEE) (pp. 1–5). IEEE. 55. Srinivasan, S., Singh, R. R., Biradar, R. R., & Revathi, S. A. (2021, March). COVID-19 monitoring system using social distancing and face mask detection on surveillance video datasets. In 2021 International conference on emerging smart computing and informatics (ESCI) (pp. 449–455). IEEE.
Different Approaches to Background Subtraction and Object Tracking in Video Streams: A Review Kalimuthu Sivanantham, Blessington Praveen P, and R. Mohan Kumar
1 Introduction The importance of pedestrian detection in numerous applications, particularly in the fields of robotics, automotive, and surveillance, has made it a very prominent research area. Despite significant advancements, pedestrian identification is still a problem that has to be solved and calls for more precise algorithms. Most human detecting systems detect the face or human in a single image. These methods use a model based on human appearance, such as color, form, or contrast, which is localized in each image frame. According to Comaniciu et al. [1], human looks change in an uncontrolled outdoor environment due to environmental aspects such as light circumstances, identification, clothes, and contrast. The images taken with lighting enhanced cameras or during night time subjected to camouflage or mask effects. These factors produce a large variation in both scene and human appearances, thus demanding interesting features for human and non-human classification [2]. Han et al. [3] discussed the challenges existing in pedestrian detection due to the complexity of real-world background, diversification of shooting angle, and posture diversities and elaborated the difficulty in detecting humans as they take different poses, take on varied appearances from different shooting viewpoints, and may also appear occluded with other objects or humans. Authors included possibilities of K. Sivanantham (✉) HCL Technologies Limited, Coimbatore, Tamil Nadu, India e-mail: [email protected] Blessington Praveen P Embedded Development, Crapersoft, Coimbatore, Tamil Nadu, India R. M. Kumar Department of EEE, Sri Ramakrishna Engineering College, Coimbatore, Tamil Nadu, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Kumar et al. (eds.), Object Tracking Technology, Contributions to Environmental Sciences & Innovative Business Technology, https://doi.org/10.1007/978-981-99-3288-7_2
23
24 Fig. 1 General block diagram for video detection
K. Sivanantham et al.
Input Video
Extract image frames
Feature extraction
Classification
Detection
false-positive detections due to the presence of background objects that had humanoid shapes such as chairs, cylinders, textured areas, or fire hydrant. As these objects are associated with pedestrian shapes, there are possibilities of false detecting them as humans. Hwang et al. [4] discussed that human detectors are ambivalent. A sensitive human detector with low decision threshold detects most pedestrians and along with it detects the background human-shaped objects too. Alternatively, a more conservative detector with high decision threshold had shown low falsepositive rate, but it suffers from higher miss rate. García et al. [5] added up the more challenges in pedestrian detection procedures due to illumination conditions, complex backgrounds of real-world environment, and various articulate poses of humans. Figure 1 explains the general block diagram for video pedestrian detection, discussing that real-time applications require the correct interpretation of different visual stimuli for human detection to accomplish very complex tasks like autonomous vehicle driving or playing sports [6]. In such complex applications, for a variety of jobs, the scene must be evaluated in a matter of milliseconds in order to respond to such stimuli. The goal of artificial intelligence is to develop algorithms that automatically analyze and comprehend a scene’s information from a single image frame or from a series of frames in video and respond appropriately. Detection of human shapes remains challenging since the last two decades, and real-world applications demand detection at a very high rate under critical conditions.
Different Approaches to Background Subtraction and Object Tracking. . .
25
Wu et al. [7] discussed the visibility of images at night and with poor lights. Low-intensity pixels are more prevalent in images taken at night. As a result, they have a low contrast. Authors made a lot of noise because they don’t get enough light. If the camera obtains enough light as a result of employing a certain exposure setting, this noise can be greatly decreased. In addition, the nighttime photographs create motion blurring as the item travels, making pedestrians difficult to identify. In many cases, increasing exposure to a higher level is insufficient. It was unable to recover pertinent human features even when the exposure level was kept constant because actual edges created by the image’s contents and false edges produced by noise were blended together. Brunetti et al. [8] outlined the difficulty in detection from machine vision perspective. Because explicit models were not available, machine learning techniques were utilized to learn an implicit representation from examples. It is, as previously said, an instance of the multi-class object classification issue. Based on its peculiarities, pedestrian detection has its own set of approaches. There are a wide range of possible appearances of pedestrians based on their clothing, poses, articulations, light, and background. The detection component should have prior scene knowledge of physical environment to improve its performance [9]. Enormous effort is spent on the collection of extensive database to study and construct prior knowledge from thousands of samples. On the below points, we discussed about the objectives of the chapter. The extensive review of literature focused on the following objectives: • To study the strengths of different traditional approaches used for feature extraction; • To investigate the opportunities of ML models in different computer vision applications; • To examine the possibilities of implementing ML algorithms for image feature classification; • To analyze the prospects of hybridizing the ML algorithms; • To explore and analyze the working of different DL approaches and various models involved in pedestrian detection; • And, to construct models that will work best to boost pedestrian detection’s overall performance. Recent advancements in computer vision have enabled active research on the detection of humans in sequence of images. The past few years have seen a rise in interest in pedestrian detection among those involved in computer vision. In terms of features, models, and general architectures, numerous pedestrian detection systems were created by Kristan et al. [10]. This chapter focuses on elaborating different methods used for pedestrian detections. The study is decomposed into techniques available for feature extraction and classification in traditional, ML, and DL approaches. And the chapter also discusses the algorithms used in traditional, ML, and DL approaches for pedestrian detection and challenges of the existing pedestrian detection systems. In addition, the scope of building new models to improve the performance of pedestrian detection is also discussed [11].
26
K. Sivanantham et al.
Organization of the Chapter The rest of the chapter is organized as follows: The literature review on data association methods and multiple-target tracking for video surveillance is summarized in Sect. 2. Section 3 concludes the chapter with the future scope.
2 Literature Review Traditional pedestrian detection algorithm requires a set of features that describes the pedestrian characteristics and combines them with the classifier to classify pedestrian or non-pedestrian. High interclass variability and low interclass variability are the goals of the categorization. There exists high variability in the color pattern of pedestrians, so pixel representations cannot be used. Other traditional representations like edge-based, fine-scale, and region-based approaches also face the same problems of pixel-based models due to the inadequate consistent color information.
2.1
Survey on Frame Rate Conversion Techniques
Motion vectors estimated using regular motion estimation (ME) may not produce the true motion vectors (TMV), thereby limiting the trajectory path for objects that are in motion. This results from classical ME’s goal of minimizing prediction error, which lowers the amount of bits needed to encode residual error. The TMV that tracks the motion trajectories of the moving object is crucial for the algorithm’s success in applications like temporal interpolation between frames. Managing areas around moving things, Shen et al. [12] suggested a motion-guided temporal frame interpolation technique. This technique makes use of motion divergence as a soft, signed signal of disocclusion and folding of the motion field to locate the foreground item in areas close to moving objects. Around moving objects, additional texture optimization is used to lessen processing complexity and enhance the quality of interpolated frames [13]. Kim et al. [14] suggested adding more MVs and removing the artifact in the region that contains a small object to create a block-based hierarchical ME. To estimate the MV of a block-based hierarchical block at a lower level, the initial MVs are allocated from the three motion vectors from the upper level. The suggested method can find missing MVs for small items at the top hierarchical level and propagate them down to the lowest level. This method also tries to find a tiny object with a different MV than the block that contains a small object. The high-cost pixels from each block are picked using this proposed method, and the MV is estimated for each high-cost pixel. At the bottom level of the hierarchical ME, this additional MV information is communicated to the missing motion vectors. Yang and Feng [15] proposed a low-complexity TMV algorithm for motioncompensated temporal frame interpolation by tracking the projection of moving
Different Approaches to Background Subtraction and Object Tracking. . .
27
objects in the scene as close as possible. Here, to estimate the MVs, unidirectional motion estimation (UME) was used by considering both forward and backward directions. In order to predict the TMV, the MV smoothness constraint was added to the motion vector fields by exploiting the spatiotemporal neighboring MVs. Based on the evaluation done, an increase in more neighbor blocks increases the robustness of the algorithm, thus eliminating the inaccurate MVs. Fan et al. [16] handled occlusion effectively; along with the current and the previous frame, additional frames in both the direction were used to estimate the MVs’ refinement. While the refinement was done maximum by a posterior algorithm, the block-wise directional whole interpolation method was used for filling the holes in the interpolated frame. By estimating the trajectory of the target block in the current frame, the motion trajectory calibration stage eliminated the false MVs. Predicting the trajectory of the target block leads to the adoption of the second-order polynomial. According to the evaluation, the methods estimation accuracy has increased when compared to other approaches, but its disadvantage was that it was more expensive to compute.
2.2
Survey on Foreground Extraction Techniques
There has been a plethora of research works published in background modeling. The frame differencing technique is the simplest method for recognizing foreground objects. Assuming the previous frame is the backdrop model, it calculates the difference between pixels in the current frame and pixels in the previous frame. Due to changes in global lighting and other circumstances, this approach is vulnerable to many false-positive errors because it is straightforward in terms of expertise. Simple averaging, histogram analysis over time, and other methods of backdrop modeling are also available. They are however susceptible to mistakes in a range of unrestricted real-world circumstances, such as background clutter, outdoor scenarios, and so forth. For videos with a static background, the foreground extraction results are extremely noticeable, but in a dynamic setting, they are ineffective [17]. Background modeling is divided into two categories, region-based methods and pixel-based approaches, each with its own set of criteria. The foundation of pixel-by-pixel approaches is the idea that every pixel has an independent observation sequence. These techniques model the background using statistical data gathered throughout time. To take advantage of the geographical dependence of background pixels, region-based algorithms use spatial data known as image regions. Recent studies have demonstrated the superior results of deep learning in image processing systems. Deep learning has recently proved to be beneficial in image processing, categorization, representation learning, and a range of other applications. Jiang et al. [18] developed a deep learning approach for backdrop modeling based on CNNs that outperformed current methods. Muhammad et al. [19] showed CNN architecture for fire detection in a video surveillance application and suggested CNN architecture is similar to that of GoogLeNet. There are 100 layers in the architecture, including 2 primary convolutions, 4 max pooling layers, 7 inception modules, and
28
K. Sivanantham et al.
1 average pooling layer. The inception modulus, which was used in this architecture, serves to reduce computational complexity while also increasing network flexibility. The suggested architecture outperforms both the present fire detection system and the AlexNet architecture-based fire detection approach in tests. Although the suggested deep learning system outperforms the AlexNet architecture in terms of fire detection, it generates more false alarms. Babaee et al. [20] created a background removal technique based on correlation that was resilient to changing illumination conditions and local pattern changes brought on by blowing leaves, fountain waterfalls, and other factors. In order to make the algorithm robust against the illumination changes, spatial modulated normalized vector distance (SMNVD) was proposed, which was computed by biasing modulated normalized vector distance based on the spatial characteristics of the block. In addition, the temporal normalized vector distance co-occurrence matrix method was propsoed and integrated to make the algorithm robust against the temporal fluctuation patterns. Sakpal and Sabnis [21] proposed a brand-new method for adaptive background subtraction based on a codebook that quantizes each pixel. With limited memory capacity, the proposed approach enables robust recognition for compressed films and can handle scenes with shifting backdrops or illumination fluctuations. While pixelbased backdrop modeling approaches can be used to extract foreground items, authors are influenced by lighting changes, noise, and dynamic backgrounds and in general proposed a methodology for creating nonparametric kernel density estimation for the background and foreground statistical representations. Bayesian techniques were developed to effectively handle backgrounds with significant variances. Han and Davis [22] proposed a probabilistic support vector machine to set the background (SVM). In order to categorize each pixel as foreground or background, the background model was generated by calculating the output probabilities for each pixel using SVM. Up until there was no more pixels to classify, the background initialization procedure continues. Ojha and Sakhare [23] proposed a three-part wallflower technique, where the region-level component segments the homogeneous foreground object regions and proposed GMM, which describes each pixel using a combination of K Gaussian functions. Although GMM was made better by using an online expectation maximization (EM) method to initialize the parameter in the background model, it still had a temporal complexity issue. Offered a single Gaussian function that used a Gaussian distribution to describe the background at each pixel position. The progression from a single Gaussian technique to a mixture of Gaussians (MoG) model was also proposed.
2.3
Feature Extraction Methods
Zhou and Wei [24] proposed a system for recognizing objects based on AdaBoost classification and wavelets with a Haar-like structure. The foundation for pedestrian
Different Approaches to Background Subtraction and Object Tracking. . .
29
detections was later changed, as people come in many shapes and sizes, there were a lot of false-positive results. So, authors created the histogram of gradients (HOG) and the aggregate channel feature algorithm for feature extraction (ACF). In HOG, feature descriptors were used to create the feature vectors of the images. Lin et al. [25] recommended the utilization of distribution on local intensity gradient and edge directions for representing the shape and appearance of objects in images as they are large and contain large information. For improved performance, gamma and color normalization of images must be done as a pre-processing step. Gamma correction was recommended for images with uneven light, where it can reduce or increase the contrast of an image. Feng et al. [26] applied similar color normalization to convert color image to gray image. To get the appearance of pedestrians, gradient of each of the pixel was calculated. Next, step was to accumulate the histogram of direction gradients for each cell where each cell was divided by fixed square sliding window. Normalization was important due to the diversity in background conditions and large range of gradient changes. Normalization is done by overlapping sliding block over each block. The dimension of each block was 2 × 2 cells. After normalization, the result was the collection of HOG of all blocks over detection window which represents the features of the image [27]. Hua et al. [28] devised aggregate channel feature (ACF) algorithm with an idea of aggregating channel features with fast feature pyramids. A multi-scale representation of the input image and its features was created as pyramids. It included normalized gradient magnitude, LUV color channels, and histogram of gradients (HOG). These pyramids were used to calculate channels once for every eight scales in a sparse set of s, instead at each channel.
2.4
Machine Learning Approaches for Pedestrian Detection
Zhang et al. [29] proposed Haar wavelets, a computationally efficient framework, that looks for intensity difference in small local regions with multi-scale edges. When the Haar transform was applied to an image, it produces a set of coefficients at various scales that represent the wavelets’ response throughout the entire image. The transform provide three sets of coefficients, one for each wavelet’s response to intensity variations oriented vertically, horizontally, and diagonally. These features can be used with SVM classifier to detect pedestrians from static images; also with additional clue, it can be extended for processing video sequences [30]. The pedestrian detection methods detecting human appearance was the goal. Edges, wavelet responses, color distributions, background removal, and a mix of several cues including depth information, color, and neural network models of facial patterns were all employed as cues. The detection was frequently employed as the first step in the tracking process. Llorca et al. [31] published a survey of pedestrian detection approach, focusing on methodology such as ROI selection, classification methods, and tracking.
30
K. Sivanantham et al.
Sabzmeydani and Mori [32] devised a motion-based approach to describe pedestrian model. The matching points on a 2D kinematic model of a pedestrian were compared to the feature points from two consecutive photos in a sequence. And, authors suggested a filter-based method for detecting pedestrian motion patterns. Using the integral picture approach, authors effectively calculated spatial and temporal rectangle filters and proposed a pedestrian classifier using a variation of AdaBoost. The proposed method used five filters in different directions to detect pedestrian activity in low-resolution photos with a very low error rate. Another representation of picture motion was optical flow. Alonso et al. [33] compared dense flow patterns to a generative model of human flow appearance utilizing a model-based method. The method detected the position, pose, and orientation of pedestrians in an image, although it was computationally more intensive than pattern recognition. Individual SVM detectors are trained to identify each part of a pedestrian, such as the arms, torso, and legs, and their outputs were merged with geometric constraints and delivered to a final classifier to determine if the pedestrian was there or not. Cho et al. [34] employed a detector with a part-based design to find pedestrians. Model pedestrians are composed of components with SIFT-like orientation basis properties. The AdaBoost method was used for both feature selection and component detection. Pedestrians were detected by comparing the edges found in the given image with that of edge maps of pedestrian templates. In addition, authors also developed a hierarchical system using chamfer distance for pedestrian detection. Yao and Deng [35] developed a set of pedestrian hypotheses using local feature detection. Segmentation masks were constructed for this hypothesis using a training dataset that included foreground masks for pedestrians. For top-down verification, segmentation masks were utilized, and Chamfer matching was used for classification. Surprising findings were achieved on photos with a lot of pedestrian overlap. To solve the problem of overlapping pedestrians in photographs, a greedy method was used to design an optimization. AdaBoost was used with a set of edgelets, which had pre-defined patterns of edges at various locations. Authors provided more information than Haar wavelets, but because they are fixed a priori, they can’t collect enough information to distinguish between distinct object classes. Sugiarto et al. [36] proposed a pedestrian detector using support vector machines and histogram of oriented gradient (HOG) descriptors (SVM) and experimented with all the HOG feature parameters on a challenging dataset of human figures to find the optimal setting for pedestrian detection. According to their findings, these feature sets surpassed conventional feature sets for pedestrian recognition by a large amount. Ye et al. [37] proposed a two-degree polynomial for monocular pedestrian identification. The classifier was trained to detect pedestrians using Haar wavelets, which capture important information about pedestrians. This classifier makes no assumptions about the image’s scene structure, such as motion, tracking, or background subtraction. Four different example-based detectors were used to identify pedestrians. Guo et al. [38] enlightened that detectors were trained to recognize the head, legs, left arm, and right arm of the human body. After ensuring proper
Different Approaches to Background Subtraction and Object Tracking. . .
31
geometric alignment of these components, the classifier was utilized to decide whether the pattern was pedestrian or not. The results were more accurate as compared to prior full-body pedestrian detectors. In the research conducted by Walia and Kapoor [39], Shape-let features were selected from low-level features to distinguish pedestrians from non-pedestrians. AdaBoost was used to train the classifier using the supplied features. Additionally, a mathematical analysis of a stereo vision-based pedestrian detection sensor’s depth estimation error was offered. To identify pedestrians, 3D clustering methods were combined with SVM classification. Despite its intrinsic precision, the sensor produces enough measurements because of the quantization error. Guo et al. [40] conducted a thorough investigation on the use of monocular vision to identify pedestrians with the intention of providing both methodological and experimental techniques. Their examination of the application of various features and classifiers showed that HOG with linear SVM outperforms the wavelet-based AdaBoost cascade technique at lower real-time processing speeds and higher realtime processing speeds. Zia et al. [41] used random forest for object detection as two-dimensional representation of bounding box. The model suffered from background clutter and foreground occlusions. And, the model becomes insane when data is large and had complex patterns. A genetic algorithm was used for hand posture recognition and feature selection was performed by GA and classification by AdaBoost algorithm. The model proved to perform well, but suffers from issue due to local minima. Zhao et al. [42] modeled R-CNN-based pedestrian detection system using the extended INRIA database. The model was found to achieve 10% of false detection rate and 23% of missing rate. Also, it gained a reduction of 23% compared to 46% of HOG features. It was found to achieve improvement in performance over the previous methods. Authors proposed a vision-based pedestrian detection system for a given region of interest (ROI). The system first gathered information from a camera, radar, and inertial sensor to find road impediments. The information was then coupled with a set ROI produced by a motion stereo method. Chang et al. [43] demonstrated that a pedestrian detection system with a camera and laser scanner may be modeled in a specific area using vision-based filtering. Combining the outputs of an infrared camera and a laser scanner can provide a trustworthy identification and precise position of humans using Kalman filter-based data fusion. The naive Bayes nearest neighbor classification algorithm was used for category-level item detection in pictures with a small database. With a smaller dataset, the model performs better, while with a larger dataset, it performs worse. Sivanantham et al. [44] elaborated that the researches mainly focus on visionbased pedestrian detection. Due to the lack of an explicit model for pedestrian detection, deep learning and machine learning algorithms must be used to create models for pedestrian identification. Table 1 provides a summary of machine learning models used for object detection along with the models’ appropriate applications, benefits, and drawbacks.
32
K. Sivanantham et al.
Table 1 Summary of machine learning approaches for pedestrian detection Learning scenarios Supervised learning
Suitable applications Classification and regressions
NB
Supervised learning
Classification and probabilistic predictions
Genetic algorithm
Supervised learning
Decision tree Random forest
Supervised learning Supervised learning
Logistic classification problems, optimization problems Prediction problems Classification and regression
Neural network
Supervised learning
Problems with large dataset
Model SVM
2.5
Advantages Requires less training, provides high classification accuracy, works well with small dataset Requires less training data
Effective in handling nonlinear features, short computational time Simple to implement, produces high accuracy Overcomes overfitting problem
More computational power, works well with large dataset
Disadvantages Long training time for large datasets, can be used only for binary class problems Suffers from zerofrequency problem, cannot be applied to large dataset Suffers from local minimum, crossover results in more iterations Suffers from overfitting More complex, requires high computational time, cannot be applied to large data Working is black box in nature, computationally expensive
Deep Learning Approaches for Pedestrian Detection
The huge growth in the volume of data necessitates the implementation of pedestrian detection using deep learning models. Doulamis and Voulodimos [45] used a multilayered structure for deep learning. The model produced efficient results on images with simple faces and proved to suffer from gradient vanishing problem with complex images. Basalamah et al. [46] applied a deep CNN regression model to forecast pedestrian counts; the AlexNet network was used, with a single neuron replacing the final fully connected layer of 4096 neurons. Authors also added a bogus response background to negative samples to improve counting precision. On benchmark datasets, authors focused on approaches for accurate pedestrian detection with monocular pictures. Gaikwad and Lokhande [47] focused on vision-based pedestrian detection carried out in three phases, namely, image acquisition, feature extraction, and classification. DL architectures were introduced by removing feature extraction phase and preserving other phases. The feature extraction was made automated by the deep classifier which constitued several layers of computation. The kind of classifiers consumed more time for training than the other hand-crafted features based approach. Chung and Sohn [48] used ConvNet network to count the pedestrians in an image. The model failed to extract features in crowd photos with size fluctuation because it
Different Approaches to Background Subtraction and Object Tracking. . .
33
utilized two classifiers to estimate the pedestrian count and did not focus on the density map. At first, CNN network used a density map derived from a crowd image to solve the challenge of cross-scene pedestrian counts. Because the model was welltrained on a specific scene and its perspective information, it was able to withstand crowd photos with scale and perspective distortion. The acquisition of perspective information was problematic for applications based on this concept. Li et al. [49] used selective sampling and a gradient boosting machine to construct a layered boosting technique. The regression outcomes on the pedestrian density map were enhanced by the model. Iteratively authors added CNN layers and trained each layer for evaluating the residual error of the previous prediction. The method reduces pedestrian counting error by 20–30%; however, the accuracy of pedestrian detection count was lower than other detection methods. It also couldn’t handle perspective distortion or cross-scenes. Liang et al. [50] developed a multi-layered CNN model that uses an image as an input and outputs the number of pedestrians. The model uses long short-term memory (LSTM) and pre-trained GoogLeNet for local counts and high-dimensional feature maps, respectively. The method was found to produce degraded performance with different scenes. Also, as the system requires resizing of input images to 640 × 480 pixels, it caused errors and used additional data for training. A multicolumn CNN model was proposed to overcome scale and perspective distortion. The model used a regressor which was built by combining multiple CNN structures each with different-sized filters. Even then, the model suffers from scale distortion due to the combination of inappropriate scales of filters [51]. To overcome this problem, Tesema et al. [52] developed a classifier on VGG16 that used a regressor to perform three different classifications. The final density map was calculated using the selected regressor and generated an enhanced CNN model that increased pedestrian counting. The model made use of the ShanghaiTech Part A dataset and several CNNs for different receptors for scale variation and effective local pedestrian density. The mean squared error (MSE) and mean absolute error (MAE) for the dataset were 20.0 and 31.1, respectively. The model struggled to handle scenarios with several receptors. Tian et al. [53] proposed an end-to-end pedestrian counting model based on actual density conditions. The model applied techniques for adaptively assessing detection-based or regression-based density maps and pedestrian counting. Shin et al. [54] used the ShanghaiTech Part B dataset, and the model provides MAE and MSE of 20.75 and 29.42, respectively, but was unable to deal with the perspective distortion issue. The information on weakly populated foreground training data used in typical detection issues was effectively given by this model. Dong et al. [55] designed a new architecture named deep convolutional generative adversarial networks (GANs) for object detection at a distance and addressed the challenges in the detection of pedestrians. GANs are generative in nature that improved the quality of low-resolution objects in images or videos that were used for pedestrian detection. To utilize visible and far-infrared (FIR) video frames to detect pedestrians at night, authors used pixel density for pedestrian detection. As the temperature of pedestrians is higher than the other background objects, pixels with
34
K. Sivanantham et al.
higher density were grouped as pedestrians. This method failed in images with higher-temperature objects. Hou et al. [56] implemented CNN-based pedestrian detector in video using input from multiple cameras and used FIR image information for pedestrian detection. Since the calibration of images from multiple cameras remains challenging, the method cannot be used widely in general surveillance purposes. Kumar et al. [57] also discussed the steps in pedestrian detection, including acquiring video from devices and processing dataset, and available methods for detection. Issues like setting up of camera, data pre-processing, and choosing methods for object detection were also discussed. Additionally, systems for optical and non-visual pedestrian recognition and tracking employing magnetic and inertial sensors were surveyed. Table 2 provides a summary of deep learning methods for pedestrian identification. Research Gaps The feature descriptor known as the gradient histogram is utilized to find objects. It keeps track of the instances of orientation in specific areas of an image. In the picture dataset, the HOG descriptor specifies the contour and edge properties of diverse objects. It is invariant to geometric and photometric transformations which identifies the moving objects in images and videos. SVM is a multi-class classifier that was originally created for binary situations. For datasets with more noisy data and greater data, SVM struggles to perform well. The SVM performs poorly at categorizing the objects when the number of training samples is greater than the number of features in each data point. Without requiring a probabilistic justification, the SVM divides data into categories above and below the hyperplane. Only for basic and uniform datasets, HOG and SVM were used to extract features and categorize them. GA is a local feature search technique algorithm which finds approximate solutions. Designing fitness function to classify objects in complex and varying datasets needs to be done more precisely. Choosing encoding and fitness function remains a tedious work in implementing GA. It is a computationally expensive and timeconsuming algorithm when compared with the other ML algorithms. Larger datasets are not suitable for genetic algorithm. Networking of fitness function like parent selection, crossover, gene selection, and mutation will increase complexity. DL algorithms perform well in handling the large dataset with complex images. DL algorithms perform well over other ML approaches. There exist different types of DL models, among which CNN models are best suited for computer vision problems. Among the various CNN models, VGG16 and ResNet architectures performed well in object detection problems. VGG16 and ResNet are suitable to handle dataset of complex and varying feature collection of data. The model has deeper convolutional architecture with series of layers. Except for one, which is followed by two fully connected layers, each set of convolutional layers is followed by a max pooling layer. The VGG16 image descriptor extracts the image features and computes classification. Though it handles a large amount of data with varying
Different Approaches to Background Subtraction and Object Tracking. . .
35
Table 2 Summary of deep learning approaches for pedestrian detection Model Multilayer perceptron
Learning scenarios Supervised, unsupervised, reinforcement
Suitable applications Classification prediction-based applications Regression and predictions Modeling data with simple correlations Pattern extraction
Advantages Can be employed as a baseline to build new architectures
Disadvantages Suffers from gradient, vanishing problems, high complexity, modest performance, and slow convergence
Generative model, can create virtual samples Used widely for dimensional reduction, powerful and effective unsupervised learning Weight sharing, affine invariance, mitigates risk of overfitting, gives high accuracy with image data
Training is difficult
Restricted Boltzmann machine Autoencoders
Unsupervised
Unsupervised
Learning sparse and compact representations
Convolutional neural networks
Supervised, unsupervised, reinforcement
Image classification and recognition
RNN
Supervised, unsupervised, reinforcement
Time series prediction, speech recognition, natural language processing
Expertise in capturing temporal dependencies
GAN
Unsupervised
Data generation, applications, image-to-image translation, generate new human poses
Produces better modeling of data distribution
Works well on small data, very expensive to pre-train with big data Need to find optimal, hyperparameters requires deep structures for complex tasks, high computational cost Training the model is difficult. High model complexity suffers from gradient vanishing and exploding problems Training process is highly sensitive and unstable (convergence difficult)
features, the accuracy is less for complex data. It processes complicated data and produces unstable result.
36
K. Sivanantham et al.
3 Conclusion and Future Scope This chapter summarizes the methods available for challenges in the process of pedestrian detection from image and video. The detailed literature review of video compression, foreground extraction, and abnormal event detection is presented and discussed. In the comprehensive review of the research technique for video compression presented above, the need for fast ME methods and the existing complexity have been discussed. Furthermore, the various methods for background modeling and the techniques used to adapt to the real-time environmental challenges are briefed. The final section of the literature review discusses the abnormal event detection methods and its various applications in real-time surveillance videos. A study of classical feature extraction methods is summarized. The survey of available pedestrian detection methods, including conventional, machine learning, and deep learning, is produced. The study was used to explore the shortcomings of the current systems. The existing pedestrian detection models have a number of challenges due to a variety of reasons such as occlusion, varied postures of the pedestrians, illumination, data complexity, and volume. In the near future, we plan to develop a pedestrian identification system with improved performance utilizing deep learning and machine learning models for huge and complicated datasets.
References 1. Comaniciu, D., Ramesh, V., & Meer, P. (2000, June). Real-time tracking of non-rigid objects using mean shift. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662) (Vol. 2, pp. 142–149). IEEE. 2. Kalimuthu, S., Naït-Abdesselam, F., & Jaishankar, B. (2021). Multimedia data protection using hybridized crystal payload algorithm with chicken swarm optimization. In Multidisciplinary approach to modern digital steganography (pp. 235–257). IGI Global. 3. Han, B., Wang, Y., Yang, Z., & Gao, X. (2019). Small-scale pedestrian detection based on deep neural network. IEEE Transactions on Intelligent Transportation Systems, 21(7), 3046–3055. 4. Hwang, S., Park, J., Kim, N., Choi, Y., & So Kweon, I. (2015). Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1037–1045). 5. García, F., García, J., Ponz, A., De La Escalera, A., & Armingol, J. M. (2014). Context aided pedestrian detection for danger estimation based on laser scanner and computer vision. Expert Systems with Applications, 41(15), 6646–6661. 6. Zou, Z., Shi, Z., Guo, Y., & Ye, J. (2019). Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055. 7. Wu, T., Hu, J., Ye, L., & Ding, K. (2021). A pedestrian detection algorithm based on score fusion for multi-LiDAR systems. Sensors, 21(4), 1159. 8. Brunetti, A., Buongiorno, D., Trotta, G. F., & Bevilacqua, V. (2018). Computer vision and deep learning techniques for pedestrian detection and tracking: A survey. Neurocomputing, 300, 17–33. 9. Papageorgiou, C., & Poggio, T. (2000). A trainable system for object detection. International Journal of Computer Vision, 38(1), 15–33.
Different Approaches to Background Subtraction and Object Tracking. . .
37
10. Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Cehovin, L., Fernandez, G., & Pflugfelder, R. (2015). The visual object tracking vot2015 challenge results. In Proceedings of the IEEE international conference on computer vision workshops (pp. 1–23). 11. Luo, W., Xing, J., Milan, A., Zhang, X., Liu, W., & Kim, T. K. (2021). Multiple object tracking: A literature review. Artificial Intelligence, 293, 103448. 12. Shen, W., Bao, W., Zhai, G., Chen, L., Min, X., & Gao, Z. (2020). Blurry video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5114–5123). 13. Kalimuthu, S. (2021). Sentiment analysis on social media for emotional prediction during COVID-19 pandemic using efficient machine learning approach. Computational Intelligence and Healthcare Informatics, 215. 14. Kim, M., Kho, S. Y., & Kim, D. K. (2017). Hierarchical ordered model for injury severity of pedestrian crashes in South Korea. Journal of Safety Research, 61, 33–40. 15. Yang, X., & Feng, Z. (2016, March). 3D video frame interpolation via adaptive hybrid motion estimation and compensation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1691–1695). IEEE. 16. Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., & Ling, H. (2019). Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5374–5383). 17. Kumar, A., Walia, G. S., & Sharma, K. (2020). Recent trends in multicue based visual tracking: A review. Expert Systems with Applications, 162, 113711. 18. Jiang, P., Chen, Y., Liu, B., He, D., & Liang, C. (2019). Real-time detection of apple leaf diseases using deep learning approach based on improved convolutional neural networks. IEEE Access, 7, 59069–59080. 19. Muhammad, K., Ahmad, J., & Baik, S. W. (2018). Early fire detection using convolutional neural networks during surveillance for effective disaster management. Neurocomputing, 288, 30–42. 20. Babaee, M., Dinh, D. T., & Rigoll, G. (2018). A deep convolutional neural network for video sequence background subtraction. Pattern Recognition, 76, 635–649. 21. Sakpal, N. S., & Sabnis, M. (2018, February). Adaptive background subtraction in images. In 2018 International Conference on Advances in Communication and Computing Technology (ICACCT) (pp. 439–444). IEEE. 22. Han, B., & Davis, L. S. (2011). Density-based multifeature background subtraction with support vector machine. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(5), 1017–1023. 23. Ojha, S., & Sakhare, S. (2015, January). Image processing techniques for object tracking in video surveillance-A survey. In 2015 International Conference on Pervasive Computing (ICPC) (pp. 1–6). IEEE. 24. Zhou, M., & Wei, H. (2006, August). Face verification using gaborwavelets and adaboost. In 18th International Conference on Pattern Recognition (ICPR’06) (Vol. 1, pp. 404–407). IEEE. 25. Lin, Y., He, H., Yin, Z., & Chen, F. (2014). Rotation-invariant object detection in remote sensing images based on radial-gradient angle. IEEE Geoscience and Remote Sensing Letters, 12(4), 746–750. 26. Feng, X., Bao, Z., & Wei, S. (2021). LiveObj: Object semantics-based viewport prediction for live mobile virtual reality streaming. IEEE Transactions on Visualization and Computer Graphics, 27(5), 2736–2745. 27. Sivanantham, K., Kalaiarasi, I., & Leena, B. (2022). Brain tumor classification using hybrid artificial neural network with chicken swarm optimization algorithm in digital image processing application. Advance concepts of image processing and pattern recognition: Effective solution for global challenges, 91. 28. Hua, J., Shi, Y., Xie, C., Zhang, H., & Zhang, J. (2021). Pedestrian- and vehicle-detection algorithm based on improved aggregated channel features. IEEE Access, 9, 25885–25897.
38
K. Sivanantham et al.
29. Zhang, S., Bauckhage, C., & Cremers, A. B. (2014). Informed haar-like features improve pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 947–954). 30. Yuan, Y., Wang, D., & Wang, Q. (2016). Anomaly detection in traffic scenes via spatial-aware motion reconstruction. IEEE Transactions on Intelligent Transportation Systems, 18(5), 1198–1209. 31. Llorca, D. F., Sotelo, M. A., Hellín, A. M., Orellana, A., Gavilán, M., Daza, I. G., & Lorente, A. G. (2012). Stereo regions-of-interest selection for pedestrian protection: A survey. Transportation Research Part C: Emerging Technologies, 25, 226–237. 32. Sabzmeydani, P., & Mori, G. (2007, June). Detecting pedestrians by learning shapelet features. In 2007 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8). IEEE. 33. Alonso, I. P., Llorca, D. F., Sotelo, M. Á., Bergasa, L. M., de Toro, P. R., Nuevo, J., & Garrido, M. Á. G. (2007). Combination of feature extraction methods for SVM pedestrian detection. IEEE Transactions on Intelligent Transportation Systems, 8(2), 292–307. 34. Cho, H., Rybski, P. E., & Zhang, W. (2010, September). Vision-based bicycle detection and tracking using a deformable part model and an EKF algorithm. In 13th International IEEE Conference on Intelligent Transportation Systems (pp. 1875–1880). IEEE. 35. Yao, W., & Deng, Z. (2012). A robust pedestrian detection approach based on shapelet feature and Haar detector ensembles. Tsinghua Science and Technology, 17(1), 40–50. 36. Sugiarto, B., Prakasa, E., Wardoyo, R., Damayanti, R., Dewi, L. M., Pardede, H. F., & Rianto, Y. (2017, November). Wood identification based on histogram of oriented gradient (HOG) feature and support vector machine (SVM) classifier. In 2017 2nd International conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE) (pp. 337–341). IEEE. 37. Ye, Q., Han, Z., Jiao, J., & Liu, J. (2012). Human detection in images via piecewise linear support vector machines. IEEE Transactions on Image Processing, 22(2), 778–789. 38. Guo, X., Gu, J., Guo, S., Xu, Z., Yang, C., & Liu, S& Huang, K. (2020, May). 3D object detection and tracking based on streaming data. In 2020 IEEE International Conference on Robotics and Automation (ICRA) (pp. 8376–8382). IEEE. 39. Walia, G. S., & Kapoor, R. (2014). Human detection in video and images—a state-of-the-art survey. International Journal of Pattern Recognition and Artificial Intelligence, 28(03), 1455004. 40. Guo, L., Ge, P. S., Zhang, M. H., Li, L. H., & Zhao, Y. B. (2012). Pedestrian detection for intelligent transportation systems combining AdaBoost algorithm and support vector machine. Expert Systems with Applications, 39(4), 4274–4286. 41. Zia, M. Z., Stark, M., Schiele, B., & Schindler, K. (2013). Detailed 3D representations for object recognition and modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2608–2623. 42. Zhao, X., Li, W., Zhang, Y., Gulliver, T. A., Chang, S., & Feng, Z. (2016, September). A faster RCNN-based pedestrian detection system. In 2016 IEEE 84th Vehicular Technology Conference (VTC-Fall) (pp. 1–5). IEEE. 43. Chang, K. C., Liu, P. K., & Yu, C. S. (2016, May). Design of real-time video streaming and object tracking system for home care services. In 2016 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW) (pp. 1–2). IEEE. 44. Sivanantham, S., Paul, N. N., & Iyer, R. S. (2016). Object tracking algorithm implementation for security applications. Far East Journal of Electronics and Communications, 16(1), 1. 45. Doulamis, N., & Voulodimos, A. (2016, October). FAST-MDL: Fast Adaptive Supervised Training of multi-layered deep learning models for consistent object tracking and classification. In 2016 IEEE International Conference on Imaging Systems and Techniques (IST) (pp. 318–323). IEEE. 46. Basalamah, S., Khan, S. D., & Ullah, H. (2019). Scale driven convolutional neural network model for people counting and localization in crowd scenes. IEEE Access, 7, 71576–71584.
Different Approaches to Background Subtraction and Object Tracking. . .
39
47. Gaikwad, V., & Lokhande, S. (2015). Vision based pedestrian detection for advanced driver assistance. Procedia Computer Science, 46, 321–328. 48. Chung, J., & Sohn, K. (2017). Image-based learning to measure traffic density using a deep convolutional neural network. IEEE Transactions on Intelligent Transportation Systems, 19(5), 1670–1675. 49. Li, J., Gu, Y., Wang, C., Liu, M., Zhou, Q., Lu, G., et al. (2021). Pedestrian-aware supervisory control system interactive optimization of connected hybrid electric vehicles via fuzzy adaptive cost map and bees algorithm. IEEE Transactions on Transportation Electrification. 50. Liang, G., Hong, H., Xie, W., & Zheng, L. (2018). Combining convolutional neural network with recursive neural network for blood cell image classification. IEEE Access, 6, 36188–36197. 51. Sivanantham, K. (2022). Deep learning-based convolutional neural network with cuckoo search optimization for MRI brain tumour segmentation. In Computational Intelligence Techniques for Green Smart Cities (pp. 149–168). Springer. 52. Tesema, F. B., Wu, H., Chen, M., Lin, J., Zhu, W., & Huang, K. (2020). Hybrid channel based pedestrian detection. Neurocomputing, 389, 1–8. 53. Tian, X., Zheng, P., & Huang, J. (2021). Robust privacy-preserving motion detection and object tracking in encrypted streaming video. IEEE Transactions on Information Forensics and Security, 16, 5381–5396. 54. Shin, J., Kim, H., Kim, D., & Paik, J. (2020). Fast and robust object tracking using tracking failure detection in kernelized correlation filter. Applied Sciences, 10(2), 713. 55. Dong, J., Yin, R., Sun, X., Li, Q., Yang, Y., & Qin, X. (2018). Inpainting of remote sensing SST images with deep convolutional generative adversarial network. IEEE Geoscience and Remote Sensing Letters, 16(2), 173–177. 56. Hou, Y. L., Song, Y., Hao, X., Shen, Y., Qian, M., & Chen, H. (2018). Multispectral pedestrian detection based on deep convolutional neural networks. Infrared Physics & Technology, 94, 69–77. 57. Kumar, S. A., Yaghoubi, E., Das, A., Harish, B. S., & Proença, H. (2020). The p-destre: A fully annotated dataset for pedestrian detection, tracking, and short/long-term re-identification from aerial devices. IEEE Transactions on Information Forensics and Security, 16, 1696–1708.
Auto Alignment of Tanker Loading Arm Utilizing Stereo Vision Video and 3D Euclidean Scene Reconstruction R. Prasanna Kumar and Ajantha Devi Vairamani
1 Introduction The idea behind computer vision is to recover, interpret, and use data from a single image, multiple images, or videos. Cameras are frequently employed as tools for data collection. Along with automation route, stereo vision system, design recognition, video observation, and other applications, camera calibration and scene reconstruction are two of the major tasks in computer vision. Camera calibration is the evaluation or estimation of a camera’s intrinsic parameters. The camera’s focus (projection focus), the focal point (picture focus), the central length, the viewpoint proportion, the ratio of a pixel’s horizontal and vertical sizes in a picture, and the slant factor, which shows how pixels are distorted if they are not rectangular, are all important. At this time, commonly employed business CCD cameras can offer the components that ensure that a single pixel’s flat and upward sizes are the same and that the pixel is always rectangular. Scene reconstruction is the process of obtaining 3D scene data from a single or collection of images. We lose one aspect [1] and some important details when a three-dimensional scene is converted to a two-dimensional (two-layered) image. To recover the scene’s original Euclidean construction, a complete reconstruction is necessary for the upcoming application. To reconstruct the scene’s Euclidean structure, more information is required [2, 3]. Two images are expected by conventional scene reconstruction methods [4–7]. Calibration of camera settings is recommended. After that, the camera movement can be obtained. Once they are complete, a set of R. P. Kumar (✉) Indian Maritime University, Chennai Campus, India e-mail: [email protected] A. D. Vairamani AP3 Solutions, Chennai, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Kumar et al. (eds.), Object Tracking Technology, Contributions to Environmental Sciences & Innovative Business Technology, https://doi.org/10.1007/978-981-99-3288-7_3
41
42
R. P. Kumar and A. D. Vairamani
Fig. 1 Flow of research methodology toward auto alignment of tanker loading arm
conditions derived from the projection strategy are addressed before the Euclidean reconstruction begins. Although intrinsic parameter errors are unavoidable and reconstruction quality is frequently unsatisfactory, highlight correspondence between images is anticipated during the camera adjustment or movement recovery stages. When the intrinsic parameters are contaminated, we recommend searching at how camera motion affects the error influence on scene reconstruction. Our research indicates that rotation has a significant impact on the amount of error in intrinsic parameters; the larger the turn point, the greater the impact of error.
• The possibility of avoiding the revolution is more intriguing because, according to unfiltered interpretation, the error of intrinsic parameters is by all accounts unfit to affect the 3D reconstruction. • The horizontal or vertical directions should be the only possibilities for this interpretation. In actual practice, prior examinations are typically able to provide the unfavorable value of intrinsic parameters [3, 8]. • As a result, the calibration can be completely disregarded during the 3D reconstruction task after the unfiltered interpretation has been completed. If the motion is constrained as previously mentioned, the projection can be significantly simplified because the rotation matrix is identity. • Using intrinsic parameters and corresponding pixels, it produces a clear expression of the scene point coordinates in three directions. This makes the challenging camera calibration task simple by allowing any calculation needed in the scene to be applied if it can create the right conditions for answering questions, as shown in Fig. 1. Organization of the Chapter The chapter is organized as follows: Section 2 discusses the state-of-the-art research in the field. Section 3 enlightens on research methodology. Section 4 highlights experimental results. And, finally section 5 concludes the chapter with future scope.
Auto Alignment of Tanker Loading Arm Utilizing Stereo Vision Video and. . .
43
2 State of Research in the Field 2.1
Stereo View Geometry
This section addresses what transpires when two reasonably constructed cameras [9] are present. It is advantageous to acknowledge that one camera’s direction can serve as the 3D scene coordinate. The “left camera” refers to the reference camera, and the “right camera” refers to a different camera [9, 10]. Assuming that there is a scene point (x, y, z) [11] as shown in Fig. 2 that can be projected into the two cameras and seen in the two pictures, the 3D direction [12] of this point can be recovered at that point. Scene reconstruction is the term for this cycle. Note that photographs are presented to the camera community for examination [13]; the image obtained no longer depicts a tumultuous scene. To address this issue, it is appropriate to propose that these two cameras [14] are indistinguishable, implying that their inherent boundaries are very similar. SinC SinB SinðB þ CÞ
ð1Þ
y=a
1 SinC CosB 2 SinðB þ C Þ
ð2Þ
z=a
SinC SinB Sinβ SinðB þ C Þ
ð3Þ
x=a
B, C, and β are the fixed angles in Eqs. (1, 2, and 3) and in Fig. 1. Another thing to consider is the transition point (x, y, z) from the passed-on camera’s direction to the right camera’s direction, which is a 3D-to-3D rigid transformation.
Fig. 2 Stereo vision
44
2.2
R. P. Kumar and A. D. Vairamani
Geometry of an Epipolar Camera
Figure 3 depicts the epipolar geometry of a scene, with the camera centers for the two cameras being Ol and Or. The epipole el is the camera center of the right camera (Or), as shown in the image of the left camera (Ol). The epipole er is the camera center of the left camera (Ol), as depicted in the image of the right camera (Or). The epipolar plane is a plane that traverses a real-world point whose projection is visible in the image as well as the camera centers (Ol and Or). The epipolar plane crosses the image plane in a line known as the epipolar line. All epipolar lines converge at the epipole. The transformation is made possible by the fundamental matrix, which maps a chosen point from one image onto a line by drawing an epipolar line on the other image. The epipolar geometry is expressed by the fundamental matrix (F), which is a 3 × 3 rank 2 matrix. Where P1 and P2 are any two matching points, P1 = (x1, y1, 1) and P2 = (x2, y2, 1) as shown in Fig. 4, respectively. Epipolar geometry has been used in 3D reconstruction [15–17], as-built reconstruction [15, 18–20], progress monitoring [15–18], and safety management [20]. To browse and interact with sizable unstructured collections of images, Snavely et al. [16] used 3D reconstruction techniques. Golparvar-Fard et al. [18] used a structure from motion approach combined with feature matching to calculate the essential matrix (a specialization of the fundamental matrix [21]) to determine rotations and translations for their four-dimensional augmented reality model. Jog and Brilakis [22] employed a tracking-specific image alignment algorithm to determine the point correspondences between two images. One of the images served as a benchmark for comparison with the other.
2.3
Marine Loading Arms
Marine loading arms (MLAs) are systems that move liquefied natural gas (LNG) or LPG [23] from one vessel to another. The generic architecture of an MLA is made up of an articulated mechanical structure that supports an equally articulated pipe called the product line. A flanging system known as a coupler terminates the LNG-flowing product line. The arm is statically balanced in any configuration, thanks to a double Fig. 3 Epipolar geometry
Auto Alignment of Tanker Loading Arm Utilizing Stereo Vision Video and. . .
45
Fig. 4 Image showing test points P and corresponding epipolar lines P1 and P2
counterweight system, which also ensures that the entire structure is balanced. The actuation system on offshore loading arms is completely hydraulic. The riser, or foot, of the loading arm is secured to the floating liquefied natural gas (FLNG) platforms [24] which are large vessels as shown in Fig. 5. The entire arm structure can rotate around the axis z0 to the slewing joint as shown in Fig. 6, which is connected to link 1. This joint’s angular position, which is controlled by a hydraulic cylinder, is noted. The inboard joint, which is a second rotary joint, controls the inboard link. Two cylinders pull a cable to drive this joint, and its angular position is noted. The outboard link is connected to the inboard link by a third rotary joint, the outboard joint. This joint is driven by a dual-cylinder system similar to the outboard counterweight. It’s known for its angular shape. It’s worth noting that, unlike traditional robotic joints, the outboard counterweight parallelogram system, rather than the inboard link, drives the outboard link relative to link 1. The outboard link’s end is connected to a Style 80 system. The fourth joint, which connects the outboard link to the Style 80, is left open, meaning it does not transmit any torque around its axis other than friction. As a result, the Style 80 can swing freely like a pendulum. The inclination of its gravity axis, as measured by an inclinometer, is noted. With an actuated fifth joint of position and a coupler at the end, the Style 80 is the line’s final product. The MLA product line is connected to the carrier tanks via the coupler, which is a flanging system. It secures carriers manifolds. Figure 6 depicts a coupler and a Style 80. Inboard and outboard counterweight systems are used to balance the inboard and outboard links, respectively. In the offshore loading arm architecture, these systems are designed as parallelograms. The loading arm is then connected to a flanging area,
46
R. P. Kumar and A. D. Vairamani
Fig. 5 Marine loading arm
which holds the client manifold during the connection. The targeting system directs the coupler to the LNG manifold. A cable connects the MLA riser to the carrier manifold. A hydraulic winch installed on the Style 80 pulls the arm along the cable to the manifold. To ensure accurate positioning of the coupler on the manifold, male and female alignment cones are installed on the manifold and the coupler, respectively. Figures 5 and 6 depict the targeting system. While the winch pulls the loading arm along the cable, the hydraulic system is set to freewheel mode. The cylinders are softened in this mode, which allows the arm to move in response to external forces on the coupler.
3 Research Methodology 3.1
Disparity Map
The impact of the stereo camera baseline [26–28] on disparity and, as a result, depth estimation will be discussed in this section. The human interocular distance is about 75 mm, which allows us to get a sense of depth from what we see. While many visual cues are derived from the perceived flow of head movements, view disparity is arguably one of the most important concepts for understanding the geometry of what we’re looking at and the position of our viewpoint. A similar principle applies in machine vision, where we use the disparity between observed features to triangulate
Auto Alignment of Tanker Loading Arm Utilizing Stereo Vision Video and. . . Inboard Counterweight
47
Outboard counterweight Z0
10 m Link 1 Product line
Riser y0
Inboard Link
Outboard link
FLNG deck
Style 80
Coupler
Flanging area
Fig. 6 Loading arm FLNG deck
Fig. 7 Framework for disparity map
its location in 3D space. As a result, close objects have a disparity that can be bounded by a maximum disparity value, whereas objects that appear further away have a disparity that approaches zero [10] as shown in Fig. 7. The distance will affect a number of characteristics that we must consider when deciding on a multi-camera baseline [25, 26]. To begin with, it alters the disparity range in the search space. The maximum disparity for closer objects will be larger with larger baselines, increasing the single dimension search space for a rectified stereo pair and, as a result, costing more computational and memory resources. The effect of depth precision is another factor to consider. Assuming that image features
48
R. P. Kumar and A. D. Vairamani
can be localized to sub-pixel accuracy, the depth accuracy will be affected depending on the baseline [27, 28]. As a result, when configuring multi-view systems, all of these factors must be taken into account. Z = x1 - x2 = f ðb=dÞ
ð4Þ
The stereo matching algorithm used in stereo vision systems yields disparity (difference) maps. In disparity maps, the depth on connected pixel points is proportionately reversed. The values of disparity maps are computed in computer vision by measuring the separation between two adjacent pixels in images taken with the right camera x1 and the left camera x2 [14]. Equation (4) illustrates how disparity is determined by the difference between x1 and x2. B, f, and Z, respectively, stand in for the focal length, baseline, and depth. Depth and disparity have inversely correlated dimensions. These numbers also have a nonlinear relationship to one another. As a result, objects close to cameras have greater depths in stereo vision systems [29].
3.2
Calibration
Robust multi-camera calibration, which relies on discrete camera model fitting from sparse target observations, is a requirement for all multi-view camera systems. Stereo systems, photogrammetry, and light-field arrays have all shown the need for geometrically consistent calibrations in order to achieve higher levels of sub-pixel localization accuracy for enhanced depth estimation. Beginning with a 2D target with an encoded feature set, each with 12 bits of uniqueness for quick patterning [30, 31], we present the target. These characteristics include orthogonal sets of binary straight and circular edges as well as Gaussian peaks. Feature uniqueness is used for associativity across views in nonlinear optimization, which is then combined into a 3D pose graph. When using a traditional checkerboard, the mean reconstruction error for stereo calibration is reduced from 0.2 pixels to 0.01 pixels, demonstrating the use of existing camera models for intrinsic and extrinsic estimates [32, 33]. In the case of multi-camera systems, geometric camera calibration allows a camera model to be fitted to map the geometric transformation of light through the optical elements of a camera system in addition to relative extrinsic. These calibrations [40–42] make it possible to map observed world features to image space as shown in Fig. 8, which is useful in multi-view and geometric computer vision applications. The development of auto-calibration methods that allow systems to solve for calibration parameters during deployment by Eq. (5).
Auto Alignment of Tanker Loading Arm Utilizing Stereo Vision Video and. . .
49
Good Coverage Poor Coverage
Fig. 8 Calibration target visibility with partially overlapping stereo field of views
Xt = Xm þ dp :y0
ð5Þ
The operator’s command initiates the deployment stage. Between its current pose and a target pose Xt in front of the manifold at a distance dp, the coupler’s path is calculated. dp’s value will be determined later in the project. To avoid a collision, this safety distance should account for the arm’s delay over the manifold. Xt and the path are updated online as the manifold moves. The loading arm is automatically driven from its parking position to Xt on the operator’s command. The motion is updated on the real time to keep track of Xm. The deployment stage is depicted in Fig. 9. In the pursuit stage, when the coupler reaches Xt, the loading arm continues to track it. For starters, current targets lack dense features that can be uniquely identified across multiple views, limiting the image space feature distribution and resulting in larger calibration errors near the camera’s edges and corners. Furthermore, the majority of strategies [34–41] rely on local corner features, which are more prone to localization errors than larger features. As a result, this research proposes a dense and unique calibration target, as well as a feature extraction algorithm, and formulates an optimization problem to obtain camera parameters for better calibration. To achieve reliable calibrations, the proposed calibration target and patch encoding scheme require a detection pipeline that takes advantage of high sub-pixel localization accuracy. To distinguish straight lines from ellipses, we start with edge detection xt and xm and then feature linkage dp. The patches are recognized from the feature locations and identifiers to provide a dictionary of featurecoordinate pairs of xt and xm used by the model fitting within the calibration once all features have been fitted. To achieve reliable calibrations, the proposed calibration target and patch encoding scheme require a detection pipeline that takes advantage of high sub-pixel localization accuracy. To distinguish straight lines from ellipses,
R. P. Kumar and A. D. Vairamani
50
z0
x0 y0
xt
dp
Xm
Fig. 9 The arm move toward dynamic position in front of the manifold during the deployment stage
we start with edge detection and then feature linkage dp. The patches are recognized from the feature locations and identifiers to provide a dictionary of featurecoordinate pairs used by the model fitting within the calibration once all features have been fitted.
3.3
Feature Recognition
The importance of accurate feature localization on calibration quality necessitates good edge fitting, which is then used for ellipse fitting and centroid finding. Firstand second-order image gradients can be used in sub-pixel edge algorithms, with the first requiring additional optimization to find local ridges and the latter being explicitly defined as the roots of second-order gradients. To locally steer the second-order derivatives to the locally dominant orientation, we use the steerable filtering approach introduced by [42] with user-tuneable Gaussian kernel sizes. This generates an image gradient and an orientation image, which are then used to solve for the bilinear surface’s real roots, resulting in continuous edges. Edge point groups
Auto Alignment of Tanker Loading Arm Utilizing Stereo Vision Video and. . .
51
Fig. 10 The coupler (green) follows the manifold (pink) at a distance of dp during the pursuit stage
that belong to continuous feature contours are created by connecting edges. These are then separated into ellipses and lines using the feature fitting described in the next step, with any remaining outliers that do not meet fitting requirements and a minimum contour length being rejected. The main features of the target involve its external surface, which take on elliptical geometries [43] when subjected to a projective transformation. While locally nonlinear lens distortions may have an impact on this assumption, it holds true as long as projected circles are small in comparison to overall lens distortions. A direct least squares approach proposed in [44] is used to fit the ellipse. For an optimized fitting, all contributing points within a linked contour are used, and a secondary pass is used to merge disjoint ellipse segments as shown in Fig. 10.
3.4
Calibration and Model Fitting
We use the located and matched correspondences to formulate the optimization problem as a bundle adjustment problem: given the 3D points and observed 2D locations in each view, minimize the reconstruction error by simultaneously optimizing for the camera parameters (intrinsic and extrinsic parameters) and the pose at each view from Scenes 1 to 6 in Fig. 11. This bundle adjustment problem can be solved with a variety of nonlinear optimization strategies, with the LevenbergMarquardt optimization [21] being a popular choice because it combines the
52
R. P. Kumar and A. D. Vairamani
Scene 1
Scene 2
Scene 3
Scene 4
Scene 5
Scene 6
Fig. 11 Auto alignment not applied as with specific target
convergence benefits of both Gauss-Newton and gradient descent. We use OpenCV’s implementation of Levenberg-Marquardt-based calibration to optimize over intrinsic and extrinsic [45] for this problem.
3.5
Distance Calculation
Equations derived from the stereo vision crossover method with a camera rotating around the y-axis are used to obtain the results. Because there were no suitable structures or fixed cameras, in addition to a lack of internal parameters, the initial estimate error was approximately 60 cm. But after structural optimization for better
Auto Alignment of Tanker Loading Arm Utilizing Stereo Vision Video and. . .
53
camera facing, making sure that camera angles were changed, and estimating internal parameters through trial and error, we were able to precisely estimate an average of 20 cm possible after numerous tests.
3.6
Reconstruction Error
The reconstruction error, which was previously mentioned, is an important metric for validating geometric camera calibration. During the calibration optimization, pose and intrinsic parameters are tweaked to reduce the error, resulting in camera parameters and their associated error. This error is widely used in the photogrammetry community to assess the global consistency and quality of the reconstruction; a lower error indicates better reconstruction and camera localization. In the case of calibration, we can use it to evaluate calibration performance in a similar way. Because this error is frequently on the order of sub-pixel values, it’s important to double-check that the localized image features are accurate before evaluating the calibration.
4 Experimentation and Results 4.1
Extraction of a Specific Target
A direct identifier is associated with each feature, derived from the patch and identifier they lie in, to implicitly match features across image views for the calibration. The local distribution of ellipse centers, which are referenced from the large ellipses, is used to create these patches. Only if all ellipses within a patch are confirmed to be valid is the patch extracted. By associating the feature image coordinates and world coordinates via their identifiers, the features are linked to the 3D world points from the target plane as shown in Fig. 11. Three or four loading arms must be sequentially connected to the LNG carrier (LNGC) before offloading LNG or LPG. Once the first arm is attached, the geometric model of the arm can be used to determine the pose of the coupler and, consequently, the pose of the manifold. This information can be used to supplement existing data or to enhance estimates of the poses of other manifolds. Mechanical distortion may reduce the geometric model’s accuracy because once connected, the arm is subject to external forces. By fastening a loading arm to a moving test bench and contrasting the calculated and actual positions of the manifold, the validity of this method can be verified. Based on empirical evidence, it has been discovered that performing the calibration process in two stages is beneficial: We estimate and optimize the intrinsic parameters and pose for each camera separately in the first stage, minimizing the
54
R. P. Kumar and A. D. Vairamani
reconstruction error computed by projecting the world points to image space using the intrinsic and pose estimates. Then, in the second stage, we use these intrinsic estimates to optimize for the relative extrinsic between the left and right cameras by estimating and optimizing the pose of each view in both cameras so that the relative pose between the two cameras remains constant and the reconstruction error using these pose estimates and the fixed intrinsic is minimized. Separating the intrinsic and extrinsic calibration optimizations ensures convergence to a better optimum, according to research.
4.2
Results of Calibration
The findings are split into two groups, each with a different goal: validating the accuracy of elliptical feature localization and evaluating the camera calibration performance for stereo camera calibrations. These are summarized as follows: • Sub-pixel localization accuracy of spatial image features in synthetically rendered images with varying degrees of geometric distortions, to show target and algorithmic reliability in extracting correct feature locations • Demonstrable and real-world results for a stereo camera moving around a checkerboard and fractal target in terms of intrinsic and pose estimation. Using a stereo calibration to solve the relative pose using the monocular estimated intrinsic, evaluating the reconstruction error residuals across the images In terms of the position accuracy needed for an automatic connection, loading arms are substantial structures. Although the MLA links can be up to 10 m long, the coupler needs to be placed in front of the manifold with a few millimeters of precision. Since it is not required for the current targeting system, loading arms today are not made for precise positioning. A position calibration technique that was initially developed for industrial robotic manipulators is used to solve this issue. The current study concentrates on calibrating the global position using static models. Geometric or non-geometric calibration can be used to describe this type of calibration. The first type only modifies the model’s dimensional parameters, such as the length of links, the offset of joints, misalignments, and tool definition which is shown step by step in Fig. 12.
4.3
Reconstruction Error
Traditional corner localization techniques only approach accuracies of around 0.2 pixels at best, according to first set of experiments that tested the effect of projective and radial distortions on localization accuracy. The ellipse-based detections utilized
Auto Alignment of Tanker Loading Arm Utilizing Stereo Vision Video and. . .
Scene 1
Scene 2
Scene 3
Scene 4
Scene 5
Scene 6
55
Fig. 12 Geometric calibration
in this study demonstrate significantly improved localization accuracies, at approximately 0.01 pixels. Furthermore, the findings suggest that increasing the magnitude of geometric distortions has an effect on both corner point and elliptical feature localization performances, with the latter remaining more robust over larger distortions. The quality metric of elliptical fitting, which has an advantage over corner
56
R. P. Kumar and A. D. Vairamani
Fig. 13 The coupler and the distance to the flange
features in terms of validation difficulty, is likely to reject extreme distortions that would lead to poorer localizations. A pattern recognition program and cameras mounted on the FLNG deck could be used to find the manifold and measure the distances. This option is less expensive and doesn’t required additional measuring systems. Making this solution resistant to environmental factors like mist, water, sunlight, and darkness are its main challenges. Absolute orientation, speed, and acceleration of the FLNG are always known. The relative position and orientation of the manifold are measured from image and the program. The orientation of the coupler and the manifold is shown in the Fig. 13 with green line as diameter of the circle. The distance between the coupler and manifold is shown in blue line (P1–P2) as in Fig. 13. These findings show that elliptical features outperform corner features in terms of localization performance. The calibration quality was evaluated separately on monocular and stereo calibrations in a real-world setting for the second component of calibration evaluation. • Our feature extractor allows us to set up better feature point correspondences across different frames, resulting in a significantly lower mean reconstruction error (0.06 pixels vs 0.3 pixels) when shown the standard calibrator as in the figure above by the loading arm coupler. • The uniquely identifiable patches allow the target to be partially visible, allowing for easier sampling of feature points across corners and edges. Despite the improved and validated feature localization, larger reconstruction errors persist across the image coordinates’ corner and edge segments, indicating a significant systematic error in the camera model fitting. Additional radial parameters and local surface deformations could be used to improve calibration performance in the future. Because there is more local agreement to the solution, the overall reduction in this error indicates improved overall calibration. While the solution is limited to
Auto Alignment of Tanker Loading Arm Utilizing Stereo Vision Video and. . .
57
numerical optimization methods like the Levenberg-Marquardt, there may be cases where the converged minimum isn’t indicative of ideal geometry.
4.4
Effects of Errors on 3D Reconstruction
The reconstruction result backs up the underlying hypothesis: camera movement (between camera configurations) determines how much of an error will degrade the remaking quality. This is because, as seen from a higher point of view, a significant revolution will fundamentally enhance the clamor impact caused by pixel errors or mistakes in intrinsic parameters. Mistakes in intrinsic parameters will never again affect the remaking quality when an unfiltered interpretation along the x-axis or y-axis, or their blend, is applied. Reconstruction was also carried out when the interpretation was set to a combination of x-axis and y-axis development, and the results confirmed the same pattern. In any case, when the joined interpretation is used in similar circumstances, the remaking quality is even worse. This is reasonable because a pure interpretation along the x-axis or the y-axis only adds one obscure to the grid I, whereas a combined interpretation adds two. Following that, while alluding to the pure interpretation, it implies the interpretation along the x-axis or the y-axis independently in the following section. Unfiltered interpretation, or interpretation and a tiny revolution, is proposed to develop a streamlined framework. When pure interpretation is used, precise intrinsic parameters are no longer required, and excellent 3D reconstruction quality can be achieved in any case. By and large, unfiltered interpretation is not difficult to accomplish: simply combine two cameras or move a single camera in a straight line. Regardless of how difficult it is to achieve outright interpretation, limited revolution actually aids the framework’s precision. Accurate estimations of the coupler pose P1 and the manifold pose P2 as shown in Fig. 14 step by step from Scenes 1 to 4 in the arm frame are required in order to enable the automatic connection of the loading arm to the manifold. Since they are used as a guide when the loading arm is driven, the precision of these estimates is crucial. More specifically, the coupler’s final position error should fall within its tolerance range depending on how accurately the estimates of P1 and P2 are made.
5 Conclusion and Future Scope In this chapter, our attention was drawn to camera calibration, 3D Euclidean scene reconstruction, and how the mathematical structure of a stereo vision system framework affects the accuracy of 3D remaking. It is noteworthy that the 3D projective design of a scene can be obtained using only paired pixels. However, the majority of
58
R. P. Kumar and A. D. Vairamani
Scene 1
Scene 2
Scene 3
Scene 4
Fig. 14 Alignment of the coupler pose P1 and the manifold pose P2
Auto Alignment of Tanker Loading Arm Utilizing Stereo Vision Video and. . .
59
applications cannot use such a 3D reconstruction because it is missing metric information. One must first comprehend both the intrinsic parameters and the calculation of the stereo vision system framework, specifically, absolute adjustment, in order to obtain a 3D Euclidean reconstruction. Normally, one must first adjust each camera in the stereo vision system framework in order to play out the Euclidean 3D reconstruction of a noticed scene. Due to known intrinsic parameters, numerous strategies for 3D reconstruction using sound system images have been put forth in the writing. To establish the relationship between input errors and the quality of 3D Euclidean reconstruction, extensive simulations have been run under different camera movements (between camera calculation). Examples of input errors include pixel errors and errors from intrinsic parameters. Reconstructions demonstrate that when the pixel error’s range is reduced, turns magnify the error’s effects: the larger the revolution point, the lower the quality of the remaking. Similar to this, errors in intrinsic parameters have little effect on the reconstruction quality when the turn is avoided and only the interpretation between the two cameras is used, and scene remaking produces the best precision. In order to achieve a robust framework against intrinsic parameter error, it is advised to avoid revolution or limit the turn point while developing a stereo vision system framework. Keep in mind that combined interpretation leads to poor reconstruction quality, so pure interpretation should only be along the x-axis or the y-axis separately. This amazing structure is comparable to the human visual system, which consists of two eyes. Additionally, this design shows how incorrect intrinsic parameters can still produce respectable remaking quality. With a camera positioned around the x-axis, the equations for the stereo vision crossover method were derived, and the results were attained. The initial estimate needs to mention it. Due to problems with the structure and dynamics of the appropriate camera as well as a lack of internal parameters, the estimation error was kept to a minimum. To explore the topics covered, further research will undoubtedly be required as this work has raised many questions in the field of camera arrays. With advances in computation, a significant portion of this work will undoubtedly be solved, increasing the amount of processing that can be done on an embedded system. Because of the increased data footprint and the need for additional perception, using autonomous vehicles as an example, simply building denser point clouds with larger ranges is unsustainable. Auto-calibration on embedded devices is one of the main unresolved issues. The radial disparity circles that can be produced by positioning in a circular pattern around a reference camera are another intriguing research direction that has not yet been investigated.
References 1. Avinash, N., & Murali, S. (2008). Perspective geometry based single image camera calibration. Journal of Mathematical Imaging and Vision, 30, 221–230.
60
R. P. Kumar and A. D. Vairamani
2. Boufama, B. S. (1999). On the recovery of motion and structure when cameras are not calibrated. International Journal of Pattern Recognition and Artificial Intelligence, 13(05), 735–759. 3. Boufama, B., & Habed, A. (2004). Three-dimensional structure calculation: Achieving accuracy without calibration. Image and Vision Computing, 22(12), 1039–1049. 4. Longuet-Higgins, H. C. (1981). A computer algorithm for reconstructing a scene from two projections. Nature, 293(5828), 133–135. 5. Faugeras, O. D. (1992). What can be seen in three dimensions with an uncalibrated stereo rig? In Computer Vision—ECCV’92: Second European Conference on Computer Vision Santa Margherita Ligure, Italy, May 19–22, 1992 Proceedings 2 (pp. 563–578). Springer. 6. Shashua, A. (1992). Projective structure from two uncalibrated images: Structure from motion and RecRecognition. 7. Hartley, R. I. (1992). Estimation of relative camera positions for uncalibrated cameras. In Computer Vision—ECCV’92: Second European Conference on Computer Vision Santa Margherita Ligure, Italy, May 19–22, 1992 Proceedings 2 (pp. 579–587). Springer. 8. Boufama, B., & Habed, A. (2007, August). Three-dimensional reconstruction using the perpendicularity constraint. In Sixth International Conference on 3-D Digital Imaging and Modeling (3DIM 2007) (pp. 241–248). IEEE. 9. Soyaslan, M., Nart, E., & Çetin, Ö. (2015). Stereo kamera sisteminde aykırılık haritaları yardımıyla nesne uzaklıklarının tespit edilmesi. SAÜ Fen Bilim. Enstitüsü Derg, 20(2), 111. 10. Okutomi, M., & Kanade, T. (1993). A multiple-baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(4), 353–363. 11. Rodríguez-Quiñonez, J. C., Sergiyenko, O., Flores-Fuentes, W., Rivas-Lopez, M., HernandezBalbuena, D., Rascón, R., & Mercorelli, P. (2017). Improve a 3D distance measurement accuracy in stereo vision systems using optimization methods’ approach. Opto-Electronics Review, 25(1), 24–32. 12. Saygili, G., Van Der Maaten, L., & Hendriks, E. A. (2015). Adaptive stereo similarity fusion using confidence measures. Computer Vision and Image Understanding, 135, 95–108. 13. Básaca-Preciado, L. C., Sergiyenko, O. Y., Rodríguez-Quinonez, J. C., García, X., Tyrsa, V. V., Rivas-Lopez, M., et al. (2014). Optical 3D laser measurement system for navigation of autonomous mobile robot. Optics and Lasers in Engineering, 54, 159–169. 14. Malekabadi, A. J., Khojastehpour, M., & Emadi, B. (2019). Disparity map computation of tree using stereo vision system and effects of canopy shapes and foliage density. Computers and Electronics in Agriculture, 156, 627–644. 15. Jog, G. M., Fathi, H., & Brilakis, I. (2011). Automated computation of the fundamental matrix for vision based construction site applications. Advanced Engineering Informatics, 25(4), 725–735. 16. Snavely, N., Seitz, S. M., & Szeliski, R. (2006). Photo tourism: Exploring photo collections in 3D. In ACM siggraph 2006 papers (pp. 835–846). 17. Agarwal, S., Furukawa, Y., Snavely, N., Curless, B., Seitz, S. M., & Szeliski, R. (2010). Reconstructing rome. Computer, 43(6), 40–47. 18. Golparvar-Fard, M., Peña-Mora, F., & Savarese, S. (2009). D4AR–a 4-dimensional augmented reality model for automating construction progress monitoring data collection, processing and communication. Journal of Information Technology in Construction, 14(13), 129–153. 19. Han, S., Peña-Mora, F., Golparvar-Fard, M., & Roh, S. (2009). Application of a visualization technique for safety management. In Computing in Civil Engineering (2009) (pp. 543–551). 20. Maas, H. G., & Hampel, U. (2006). Photogrammetric techniques in civil engineering material testing and structure monitoring. Photogrammetric Engineering and Remote Sensing, 72(1), 39. 21. Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge University Press. 22. Jog, G. M., & Brilakis, I. K. (2009). Auto-calibration of a camera system using ImageAlignment. In Computing in Civil Engineering (2009) (pp. 186–195).
Auto Alignment of Tanker Loading Arm Utilizing Stereo Vision Video and. . .
61
23. Lowell, D., Wang, H., & Lutsey, N. (2013). Assessment of the fuel cycle impact of liquefied natural gas as used in international shipping. The International Council on Clean Transportation. 24. Marmolejo, P. C. (2014). An economic analysis of Floating Liquefied Natural Gas (FLNG). Doctoral dissertation, Massachusetts Institute of Technology. 25. Gallup, D., Frahm, J. M., Mordohai, P., & Pollefeys, M. (2008). Variable baseline/resolution stereo. In 2008 IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8). IEEE. 26. Nakabo, Y., Mukai, T., Hattori, Y., Takeuchi, Y., & Ohnishi, N. (2005). Variable baseline stereo tracking vision system using high-speed linear slider. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation (pp. 1567–1572). IEEE. 27. Rovira-Más, F., Wang, Q., & Zhang, Q. (2010). Design parameters for adjusting the visual field of binocular stereo cameras. Biosystems Engineering, 105(1), 59–70. 28. Saxena, A., Driemeyer, J., & Ng, A. Y. (2008). Robotic grasping of novel objects using vision. The International Journal of Robotics Research, 27(2), 157–173. 29. Soyaslan, M., Nart, E., & Çetin, Ö. (2016). Stereo kamera sisteminde aykırılık haritaları yardımıyla nesne uzaklıklarının tespit edilmesi. Sakarya University Journal of Science, 20(2), 111–119. 30. Zhang, Z. (1999). Flexible camera calibration by viewing a plane from unknown orientations. In Proceedings of the seventh IEEE International Conference on Computer Vision (Vol. 1, pp. 666–673). IEEE. 31. Zhang, Z. (2000). A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11), 1330–1334. 32. Kumar, A., Walia, G. S., & Sharma, K. (2020). Recent trends in multicue based visual tracking: A review. Expert Systems with Applications, 162, 113711. 33. Tsai, R. (1987). A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE Journal on Robotics and Automation, 3(4), 323–344. 34. Blostein, S. D., & Huang, T. S. (1987). Error analysis in stereo determination of 3-D point positions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 752–765. 35. Yang, Z., & Wang, Y. F. (1996). Error analysis of 3D shape construction from structured lighting. Pattern Recognition, 29(2), 189–206. 36. Ramakrishna, R. S., & Vaidvanathan, B. (1998). Error analysis in stereo vision. In Asian Conference on Computer Vision (pp. 296–304). Springer. 37. Kamberova, G., & Bajcsy, R. (1997). Precision in 3-D points reconstructed from stereo. 38. Balasubramanian, R., Das, S., & Swaminathan, K. (2001). Error analysis in reconstruction of a line in 3-D from two arbitrary perspective views. International Journal of Computer Mathematics, 78(2), 191–212. 39. Rivera-Rios, A. H., Shih, F. L., & Marefat, M. (2005). Stereo camera pose determination with error reduction and tolerance satisfaction for dimensional measurements. In Proceedings of the 2005 IEEE International Conference on Robotics and Automation (pp. 423–428). IEEE. 40. Park, S. Y., & Subbarao, M. (2005). A multiview 3D modeling system based on stereo vision techniques. Machine Vision and Applications, 16(3), 148–156. 41. Albouy, B., Koenig, E., Treuillet, S., & Lucas, Y. (2006, September). Accurate 3D structure measurements from two uncalibrated views. In International Conference on Advanced Concepts for Intelligent Vision Systems (pp. 1111–1121). Springer. 42. Freeman, W. T., & Adelson, E. H. (1991). The design and use of steerable filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(9), 891–906. 43. Heikkila, J. (2000). Geometric camera calibration using circular control points. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10), 1066–1077.
62
R. P. Kumar and A. D. Vairamani
44. Fitzgibbon, A. W., Pilu, M., & Fisher, R. B. (1996, August). Direct least squares fitting of ellipses. In Proceedings of 13th International Conference on Pattern Recognition (Vol. 1, pp. 253–257). IEEE. 45. Sturm, P., & Triggs, B. (1996, April). A factorization based algorithm for multi-image projective structure and motion. In European Conference on Computer Vision (pp. 709–720). Springer.
Visual Object Segmentation Improvement Using Deep Convolutional Neural Networks S. Kanithan, N. Arun Vignesh, and Karthick SA
1 Introduction Automated feature extraction has always been a difficult task in image processing. The extensive technology improvements in automation has outsmarted human abilities in various fields, and processing the imagery data is no different from such field [1–3]. Due to the availability of extensive computer and training data resources, automatic object recognition was achieved a few years ago and surpassed the human level. Deep feedforward CNN is one of the successful object recognition replicas because they can learn data patterns from plan types such as natural images. CNN is an interpretation of frequency encoding in biological neural pathways and it is based on the physical connections of the sensory information as a sequence of processing levels. In the visual hierarchy, the receptive field attributes become more and more complex. That is, the activation of the striate cortex responds to the aligned bars of visual input, but the lower sharpness of the temporal cortex responds to complex image features [4]. Fukushima developed a new cognitive machine, a CNN-like architecture was inspired by the neuroscience method (1980). This early model has remained unchanged for zoom, translation, and deformation, which seems to be the three most important criteria for object recognition. The concepts of feature detectors in modern CNN are developed from the knowledge acquired by analyzing the natural environmental data [5–8]. In addition, S. Kanithan (✉) Department of ECE, MVJ College of Engineering, Bengaluru, India N. A. Vignesh Department of ECE, Gokaraju Rangaraju Institute of Engineering and Technology, Hyderabad, India Karthick SA National Tsing Hua University, Hsinchu, Taiwan © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 A. Kumar et al. (eds.), Object Tracking Technology, Contributions to Environmental Sciences & Innovative Business Technology, https://doi.org/10.1007/978-981-99-3288-7_4
63
64
S. Kanithan et al. Stimulus presentation Pre-stimulus fixation Blink phase Post-stimulus fixation Stimulus presentation
Pre-stimulus fixation 600ms 400ms 1000ms 1000ms 600ms
end of trial
400ms 3000ms per trial
Fig. 1 Object recognition design of individual trails
CNN can train various forms of hand-designed feature detectors in a biological system and recognize object of individual trails as shown in Fig. 1. Almost, all of these object recognition networks use cross-hierarchical general learning algorithms to learn Gabor feature detectors at the lowest level, which is similar to the known response quality of neural populations in V1. Although, some research focuses on the characterization of dorsal visual and auditory currents, most CNN-based brain processing research focuses on understanding the characterization of ventral visual currents. The detection of objects, in any hand, is a short process. Discriminatory information is available as soon as 100 ms after the stimulus starts. The temporary delay and indirect nature prohibit the study on how the network hierarchy systematically transmits the characteristics of stimulus over time and in all anatomical structures. The coding model method is performed to study how the CNN-based representation utilizes the spatial and temporal expression of MEG on the cortical surface in this study. First, we use the Visual Geometry Groups (VGG-s) CNN architecture from the ImageNet database to encode the brain activity reconstructed from the source. We show that each layer of the CNN hierarchy may be geographically and spatially connected to specific parts of the visual cortical hierarchy, and they can create neuronal interest within 4575 ms of the stimulus’s start. Additionally, the installed coding version which is used to decode perceptual MEG measurements after a trial degree and stimulus shift is displayed. This opens up the door to a good deal of added image-associated processing destiny studies. The deep learning scheme is quite different from the traditional plan. The customary strategy’s activity stream is normally utilized to physically choose proposition areas [9], remove highlights [10], then, at that point, sort objects [11]. Despite the fact, it is great at identifying targets, it has a lot of futile computations and lackluster showing. The highlights of manual plan are not movable. Profound learning-based object discovery methods utilize administered figuring out how to effectively become familiar with the elements and semantics of the objective in the picture [12]. As an outcome,
Visual Object Segmentation Improvement Using Deep Convolutional. . .
65
in the field of item recognition, this plan has turned into the best option. This concentrate essentially covers the Cascade Mask R-CNN model’s calculation and activity [13]. The veil branch data is melded on this premise, and GARPN rather than RPN is utilized to diminish the quantity of proposition districts and further develop the organization model’s exhibition. The better Cascade Mask R-CNN model was then worked to utilize the Porch system and the mm discovery structure. In this work, we have analyzed deep learning and neural networks as main idea that work together to improve the system efficiency of image segmentation and retrieval. Objectives of the Chapter The main objectives of the chapter are: • To propose an efficient technique to enhance the system efficiency of image segmentation; • To analyze different techniques to retrieve the images and to narrow them down into a single efficient technique; • And, to provide a complete reference to all the techniques related to the segmentation and retrieval techniques in image processing. Organization of the Chapter The rest of the chapter is organized as follows: Section 2 highlights the methods of the image retrieval process in the image processing field. In Sect. 3, detailed literature survey has been done on all levels and subbranches of image processing. In Sects. 4 and 5, the data analysis of visual object segmentation and the pre-processing techniques required for object segmentation are presented. Section 6 highlights the importance to reconstruct the actual data in the receiver side which is done by the source estimation. In Sects. 7 and 8, nested crossvalidation encoding and decoding of pixel space control model the encoding model for feature transformation is followed by the source estimation. After the process of segmentation, estimation, and reconstruction, cross-validation using linear model has been implemented, and this will give accuracy of the received image after reconstruction. This process can be done in alternate ways too. The encoding and decoding analysis is important for proper classifiers to achieve the expected outcome which is then followed by conclusion and future scope in Sect. 9.
2 Methods for Image Retrieval In the area of image processing and communications, retrieval of image is one of the promising fields. The gift incarnation of picture retrieval studies is defined below. Some traits are highlighted, in conjunction with viable study directions. Other alternatives exist; however, this has a look at makes a specialty of the introduction of a singular relevance comments scheme (RF) and the exam of a retrieval gadget primarily based totally on relevance comments era in content-primarily based totally picture search (CBIR). We recognized the maximum extreme worries we found,
66
S. Kanithan et al.
which include the dearth of significant visible similarity measures and a lack of cognizance on consumer interplay and input. Researchers have proposed a few definitely clever algorithms for content-primarily based totally picture seek which are both green and reliable [9]. The aim of this observation is to honor scientists’ achievements even as additionally proving the viability of content-primarily based totally clever visible data acquisition.
2.1
Image Retrieval Techniques
The method of obtaining images searches the database for images that match the user’s request. In the proposed recognition system, the texture feature set is extracted and placed in the NS domain to represent the image content in the training sample. It also introduces the unsupervised classification of learning images based on the ensemble of optimized linear programming and Zhonghe set. This section discusses embedding image texture features in the neutral field of view. The neutronics domain is used to transform the image using this set of features. A. Image Retrieval Using Text Text-based image retrieval also known as description-based picture retrieval, can be used to acquire XML files containing ASCII-based images for specified multimedia queries [10]. In order to avoid the limitations of CBIR, TBIR needs artificial keywords/tags to represent the visual data of the image. It allows users to provide their information demands like a textual query and then identify appropriate photographs depending upon that word query’s match with the image’s manual annotation as shown in Fig. 2.
Return sorted images Image database Feature extraction
Get corresponding images from image database sorted by their features distance Query image
Save features in database
Compute closest features and sort by distance
Fig. 2 Text-based image retrieval
Feature extraction
Visual Object Segmentation Improvement Using Deep Convolutional. . .
67
Fig. 3 Content-based image retrieval
B. Image Retrieval Based on Content Image characteristics are used to search and retrieve content-based images. The image extracting feature components can be used to extract minimal visual characteristics from the set’s images as shown in Fig. 3. The images are retrieved based on color, intensity, and grayscale as compared with the query image and the database image. C. Image Retrieval Using Multi-modal Blending Data aggregation and machine learning techniques can be implemented in image retrieval using multi-modal blending. It is the process of fusing different datasets from different sources. The chorus, skimming, and dark horse effects can be understood by combining any of the two abovementioned modes or techniques as shown in Fig. 4. The image and the query will be applied to the linear system to find the exact image from a large dataset. The skimming effect is a technique for quickly detecting the edges of digital images. For the first-level detector [11], this technique uses the first-order derivative operator after the second-order derivative operator to reduce the number of pixels to be analyzed in detail. This not only significantly increases the speed but also improves the accuracy of finding the edges. Therefore, when ignoring the feature disconnection of the edge, this effect is particularly useful when the generated edge pixels are introduced into the Hough transform or
68
S. Kanithan et al.
Content-based retrieval system
Large-scale multimodal dataset
Multimodal data representation and fusion Multimodal query
Multimodal search Ranking #1 #2 #3 #4 #5 1 1
Relevant results
Fig. 4 Dark horse effect
equivalent object recognition system. Pitch is modulated using LFO as the source, depth and speed are used to modify the effect’s of “color scheme” or texture. The intensity of the modulation depth is defined as the range between the maximum value and the minimum value. D. Image Retrieval Using Semantic Information Many experts are currently studying image search based on image semantics. This is one of the programs aimed at bridging semantic barriers. There are two main methods: automatic image annotation and semantic web mapping. The user will search an image using text as shown in Fig. 5. The semantic feature translator will enhance the text with some features based on the meaning of the input word. The features will be mapped with semantic mapping and will be found in the database. Many similar images will be received, and the user will have the option to choose one among them. The same process with the importance of feature extraction after segmentation is shown in Fig. 6. The edges of the image are identified, and the region segmentation is done. After segmentation, classification is done based on the feature extraction values. With feature extraction schemes of different layers, the object classification is differentiated, and the actual image will be attempted to retrieve [12]. The success rate along with semantic feature translator and semantic classification increases from 60% to 70%. E. Relevance Feedback Image Retrieval In the semantic system, the retrieved images and the expected image may not be exactly same as the inherent retrieval accuracy 60–70%. Though the user has the options to choose the image, the set database has more than million images of the
Visual Object Segmentation Improvement Using Deep Convolutional. . .
69
Fig. 5 Block of typical semantic-based image retrieval system
Fig. 6 Typical semantic-based image retrieval system
same type, the accuracy error will be too high to tolerate. This gap can be overcome by relevance feedback in the semantic system which in turn is very useful to reduce the error gap. The main idea behind the relevance feedback is to add user opinion into the search, allowing users to review the received image. The sequences of operations are explained as blocks in Fig. 7. Based on the user’s participation, the similarity measure will be increased automatically. Patient information, health check inspection and information, other symptoms of radio exposure, timing, and direction of captured images among other things are expected by health professionals. Feature extraction from pictures does not yield useful results. There are drawbacks from both traditional semantic-based image retrieval (TBIR) and content-based image retrieval
70
S. Kanithan et al.
Fig. 7 Feedback image retrieval
(CBIR). Weighing the characteristics of the two ways, a new retrieval structure can be developed based on the recipient’s desire to increase the overall system’s efficiency.
3 Literature Review The physiological motivation for modern CNNs would be its characteristic analyzers that will be trained through environmental information rather than employing custom-designed analyzers. Kaur and Verma [10] gave an insight of the watershed segmentation. Blakemore and Cooper [3] observed and stated on Biomolecules that is the youngest neural activations develop when the entity faces various vision surrounding scenarios during their time of learning. Bosch et al. [14] developed a model under unsupervised learning conditions with neural reinforcement learning. Brodeur et al. [15] and Brodeur et al. [16] used few extensive evaluations of both the usage in cutting-edge stimuli neural networks for investigating brain information processing. Shilpa et al. [4] developed a device which can identify the existence of breast cancer using CNN. Kumari et al. [17], Cichy and Teng [6], and Eikenberg et al. [18] used the idea of networks to interface and analyze various specific regions of biological systems. There have been a number of studies which present importance on object identification using various techniques. One such technique is the resonance type, where functional magnetic resonance imaging concepts are used to achieve the desired result. Images that are required for a particular field effect
Visual Object Segmentation Improvement Using Deep Convolutional. . .
71
processing were taken from various international open-source stored data, and some include universities like the University of Amsterdam. Brodeur et al. [15, 16] used such images previously. Geusebroek et al. [19] published a library of images. Oksuz et al. [20] published an idea on the reconstruction of high-quality segmentation using deep learning techniques. Grau et al. [21] displayed exclusionary data as quickly in 100 ms following signal commencement. Clarke [22], Dayan and Abbott [23] explored that the FMRI BOLD signal’s minute spatial latency and oblique orientation preclude research of how the network hierarchy progressively reflects event properties over space across anatomic regions. Cichy et al. [7] pioneered the method of combining electroencephalogram or magnetoencephalogram with CNN. Researchers expanded this work by utilizing an embedding theory approach to evaluate whether convolutional neural perceptions were represented in time and space throughout the surface of the brain adopting encephalography. Chatfield et al. [24] used VGGS CNN trained model upon that ImageNet competitive dataset, and captured sender cognitive function through reaction to a particular exposure of a huge, diverse range of subject pictures. It is shown that the various steps of execution can be mapped to various spots with respect to time and space. He et al. [25] also showed that the established programming system may be utilized to classify reported events using MEG readings. Silberman et al. [26] presented the concepts of indoor segmentation and support inference from RGBD images. Ji et al. [27] presented a novel algorithm used for cell image segmentation. Sulaiman and Isa [28] also proposed an adaptive fuzzy algorithm which helps to retrieve the data after segmenting the selected image. The sweep may be evident in some cases if it involves neurons or any other piece of interest which is under focus. Different algorithms can be applied for a particular task, results and differences can be observed. This is one case of analysis, whereas mathematical and theoretical analysis of particular space variations will give evident results and concrete reason to apply an algorithm. Akhil et al. [29] presented a novel approach to determine a pattern in acoustic signals. Glasser et al. [30] studied on brain activity and image segmentation and correlation.
4 Data Analysis of Visual Object Segmentation The FieldTrip toolbox is used for data pre-processing and source reconstruction based on EEG and MEG. This section describes the pre-processing procedures which are required for object segmentation. The DFT filter removes the line variations of 50 Hz and the corresponding frequencies varying from 100 Hz to 150 Hz which have been piled up for 10 s. The trials are selected from the data and corrected by edging off the average value from 250 to 50 ms sooner than the start of the stimulus. An attempt is the actual time between appearance point and the beginning of the image, ranging from 400 to 1600 ms [13]. Visually, the study summary data were inspected before ICA and eliminate studies with excessive variance and kurtosis. This is used to detect blinking
72
S. Kanithan et al.
and other muscle movements at the beginning. In the second visual inspection, the datas are filtered by a high-pass filter at 100 Hz, especially to look for muscle artifacts. Because of different muscle movements, each trial reduced by 2–10%. The ICA is then used to identify whether any remaining physiological artifacts (related to blinking and movement) in the heart composition and cleaning data. We use visual inspection to identify these components and project their appropriate sensor terrain from the data. Then, we re-examine the data visually to rule out any attempts with potential artifacts or motion and follow the same method as before performing ICA. Then, fit the cleaned 300 Hz data to the baseline again in the 250 ms to 50 ms window before the experiment, and then gently low-pass filter at 100 Hz as the last step. So, we did not further smooth the sensor output, leaving only the higher frequency. Head tracking algorithm is used to maintain the position of the participants.
5 Pre-processing for Anatomical MRI Source Estimation Till this time, all the frames on the T1-biased MRI scan were manually observed and were registered in the CTF coordinate system. The right hemisphere of all individuals can be identified in the scan, except for one, where we used most image orientations. FSL, FreeSurfer, and HCP Workbench are used to create the source model. FSL’s Brain Extraction Tool (BET) is used to perform skull dissection on T1 images, with the threshold set to 0.5. Then use FreeSurfer’s surface reconstruction pipeline to segment the white matter. Visually check for artifacts in the white matter segmentation, for example, B. The dura mater is incorrectly classified as a brain structure. In order to use visual tools, the misclassified voxel from the segmented image is manually deleted. For further analysis, the vertices of the midline are excluded, which resulted in 7344 dipoles on the 2 hemispheres in our completed network. Using standard fieldwork functions, a volumetric duct model has been created that contains a shell that represents the inner surface of the skull. Since the time resolution of neuronal activity is in the millisecond range, electroencephalography (EEG) is still the main tool for measuring dynamic changes in brain function due to disease states. For the location of tumors and lesions in the brain, CT and MRI have largely replaced EEG in the past few decades. In contrast to its excellent temporal resolution, the spatial information of the EEG is limited by the volumetric conduction of electrical current through the tissues of the head. In order to study the theoretical relationship between the brain source and the EEG recorded on the scalp, we extracted the source model (location and orientation) from the MRI scan. Although MRI provides detailed information about the boundaries between different tissues, these models are only approximate due to the lack of knowledge about the conductivity of the different tissue compartments of the living body’s head. We also compared EEG resolution with magnetoencephalography (MEG), which has the advantage of requiring less head volume conduction data. The brain’s
Visual Object Segmentation Improvement Using Deep Convolutional. . .
73
magnetic field is completely determined by the height of the source in the brain and the location of the sensor. We show that EEG and MEG spatially average neuronal activity over a relatively large brain volume; however, they are best to be sensitive to sources in different directions, which means that EEG and MEG play a complementary role [1]. The high-resolution EEG method makes it possible to more accurately locate the source activity of the brain surface area. These methods do not make any assumptions about the source and can easily be co-registered with MRI-derived brain surface data. Although the use of anatomical MRI to develop EEG/MEG generator models can provide a lot of information, functional neuroimaging signals (such as fMRI) and EEG/MEG signals are not easy to correlate. Due to the misrepresented electromagnetic inversion problem, it is challenging to determine the source behind the MEG/EEG signal: there is a current distribution that is invisible to MEG or EEG or both, and it is estimated to be sensitive to data noise. Therefore, early MEG and EEG research only relied on signal analysis and spatial distribution analysis. The introduction of the current dipole model is a major advancement in understanding the true source of MEG and EEG. When anatomical MRI data was generally available in the 1980s, it became possible to visualize the source location in an anatomical environment. With the help of MRI segmentation algorithms that can identify the boundaries of different tissue compartments, it has become possible to use boundary element and finite element techniques in forward imaging. Since the main sources of MEG and EEG signals are on the cortex and perpendicular to the cortical mantle, the anatomical constraints of MRI reconstruction based on the individual cortical geometry can be used to narrow down the distribution source estimates. The standard L2 norm regularizer provides widely used current estimates while providing a closed solution. Although it is well known that the L1 norm regularizer promotes sparsity, the source waveform often exhibits incredible jitter behavior at certain points. We recently proposed to use L1 norm in space and L2 norm under appropriate time basis functions to solve this problem. MEG/EEG and fMRI can be combined to produce activity estimates with higher temporal and spatial resolution than either data type alone. Using fMRI to guide the minimum norm solution of cortical constraints is a popular method. More sophisticated methods, such as our fMRI-based regional EEG/MEG source estimation (FIRE), can be used to explain the difference between MEG/EEG and fMRI activity.
5.1
Pre-processing for Anatomical Parcellation Labels
The individual sources in the meshes were compared to the anatomical structures of the cortical parcel by the Human Connectome Project Workbench. As the underlying datasets become larger and more diverse, machine learning has been become an important element in determining the parceling of the human brain. Various brain images are shown in Fig. 8. The main focus is on the blends and MRFs because they are easy to combine and expand to meet a wide range of uses. These models are used, for example, to create personalized brain plots while incorporating population
74
S. Kanithan et al.
Fig. 8 Side view and top view of the human brain
priorities to improve stability. The resulting parcellation of individual subjects can address the high intersubjective variability in brain organization and thus improve clinical sensitivity. The FieldTrip toolbox was used to calculate the reconstruction of source activity in single test time courses using linear restricted minimum variance (LCMV) beamformers. The constant operation of different sources is anticipated in this approach to the incorrectly presented linear system. In all experiments, the covariance matrix was calculated and normalized with a diagonal matrix that corresponds to 4–5% variations which is again properly normalized. Source responses have learned toward calculating the same in precision within the time limit using models of linear regression. After calculating the activities with linear model systems, the end frequency is taken as 300 Hz. The spatial regions are divided sequentially into 30 ms windows to obtain and evaluate the received response from the source. The central concept is that the neural activity groups’ community excitation patterns convey image cues. Such total magnitude levels across timelines had been identified for input signals and were utilized for programming evaluation. Comparisons among observed as well as anticipated origin outputs are calculated over triggers and also for fixed time frames, because code classifiers are trained to anticipate a specific source response for a specified time frame. The performance of the validation set model is classified into correlations for each source for the 75–105 ms time window. This approach creates correlations for each time window that culminates in the performance of the validation set over time.
6 Source Activity Reconstruction and Encoding Model For the 75–105 ms time window, the validation set model performance is broken down into correlations for each source. This process provides correlations in each time window that result in the performance of the validation set over time. In literature, many solutions have been proposed for the inverse problem, based on various assumptions, such as brainwork, and the amount of data we already have
Visual Object Segmentation Improvement Using Deep Convolutional. . .
75
Forward
Inverse
Fig. 9 Source reconstruction Spectrogram frequency
t-TR
t
t+TR
LQ-MFS time
Encoding Train
fMRI
t–W
t-TR t
laggin
g
Validation
Predict run 1 run 2 run 1 run 2 time
Fig. 10 Encoding model
in the effects. Such correlations will produce inverse image reconstruction as shown in Fig. 9. We will start with the minimum standard imaging approach and its options and then move on to modelling dipole and beamformers, which are actually quite similar but use a subset of the minimum standard imaging options. We have evaluated the similarity between a hypo helmed coding and sobriety representation that uses the coding model frame described in which the similarity is converted between a hypotheses and sober coding test, the entrance stimuli become a space of characteristics of linearization (coding), which is then used at will. The necessity for linearization also guarantees that latent quasi which is just not learned by more advanced machine learning models, guaranteeing that almost all quasi is apparent inside the defined subspace as shown in Fig. 10. Voxel encoding has shown to be effective in accepting mind depictions. Characteristics transformation, the representation vectors of the set of stimuli were obtained in the same way. This study uses the associated version of the VGGS neural network, which is contained in the ImageNet database. VGGS is an enhanced AlexNet-like network, which means that the architecture is similar to the eight-layer network which was important. This network’s five layers and three fully linked
76
S. Kanithan et al.
layers were used to analyze each n-stimulus picture. Cluster slices or stable continuous units (ReLU) were used to extract slice representations of the stimulus image. The characteristic maps obtained were reduced to a single representation vector per cut, and then each characteristics was standardized to zero mean and unit variance in the sample dimension. Our hypothesis about the hierarchy of representations of the visual system in object recognition in the human brain is based on this hierarchy of representations of individual images. We believe that the coding model framework in MEG can identify this hierarchy.
7 Nested Cross-Validation and Encoding Linear Model Encoding and decoding performance was estimated on the validation set, selecting the key resources and estimating the layer with the best explanation by source, as described in the nested cross-validation above. To predict responses to the validation set, the models were retrained with all data from the estimation set from a single trial. Decoding or predicting data from images or brain signals requires empirical testing of its predictive power. Figure 11 shows the pattern parameters and training fold loop structures in outer and inner loop. Cross-validation, a technique that helps in tuning decoder variables is used to perform this estimation. This chapter is an overview of neuroimaging cross-validation methods for decoding. Experiments show that cross-validation in neuroimaging settings produces large error bars with typical confidence intervals of around 10%. With nested cross-validation, you can refine your decoder settings while avoiding circularity errors. However, we have
Fig. 11 Loop structure parameter
Visual Object Segmentation Improvement Using Deep Convolutional. . .
77
found that usage of reasonable presets can be more helpful, especially for non-sparse decoders. It is really difficult to judge a decoder. Cross-validation should not be seen as a solution. Predictive performance shouldn’t be the only metric evaluated. The best way to assess decoding accuracy is to use repeated learning tests that omit 20% of the data, taking into account the large variation in the process. To avoid optimistic bias, each parameter adjustment should be done using nested cross-validation. Given the variation that small samples produce, many datasets must be used to determine the choice of decoders and their parameters, less run time. Compared to nCV, CNCV technique gives results with more precision. The errors are less in this technique. It is evident that utilization of CNCV technique to combine feature selection and classification is an effective and efficient way.
8 Encoding and Decoding of Pixel Space Control Model The responses of the visual system are known to be correlated with pixel-level patterns. The images are transformed to CIE 1976 RGB space consuming the plane luminance canal as representation after reducing them to 96 × 96-pixel square images. The coherence of the space vector between the columns of the image is lost in this representation. The information in the early VGGS layers is similar to that in pixel space. Our results show that this control model was unable to predict responses from a significant number of sources for any of our subjects at levels greater than chance. Centered on the encoding models, decoding analysis is performed. Gallant et al. [31] proposed correlations that were used to define the decoded image models of a system. Decoding was performed with around 45 images. An image will be detected when someone had the highest predictive activity correlation. If an image’s correlation was in fifth place, it was decoded within the top five choices. We used information about the significance of estimates to retrain the source and time coding models across the entire set of estimates. Using a limit of 0.3 for the average correlations from the stratified cross-validation, we further narrowed the sources. We use the validation kit to predict the response to a particular image for these given sources. Responses measured by the kit were compared to responses from all sources at time t. The validation set displays the encoding and decoding results. Within a time window of T30 ms after the start of the image, each coding model learns to predict the mean amplitude of the source. The averaged activity for each time interval is called the source response. These algorithms had been individually defined as a large factorial scale for every origin. It indicates that every one of those linear regression anticipated the activity of a particular source in a single time frame following the stimulus’s onset. These are our suggested representations of what still images look like physically. The basic premise would be that dependability of origin behaviors in the photoreceptors utilizing regularized unitary frameworks demonstrates that the
78
S. Kanithan et al.
Fig. 12 Prediction activity correlation over time centered at the occipital lobe
evaluated artificial depictions correspond to the innate factors throughout the sensory information. Coding performance is the average of all models considered to predict source responses. For the most part, we were able to predict brain activity using CNN models. The individual results of the first four participants are listed below. The supplemental material contains maps for the remaining participants. Figure 12 shows the magnitude of the correlations of predictive activity in the validation set of the models that best explain a source. The early visual cortex harbored many of the high correlations. This is consistent with the results of other functional magnetic resonance imaging studies. From time perspective, the strongest correlations were observed between 70 and 130 ms post-picture attainment; moreover, the strength decreased with period. A few subjects have shown consistent movements between 40 and 80 ms once after the capturing commenced which is shown in Fig. 13. From the back to the occipital projection, the perspectives are coronal. During stacked pass on the appraisals set, each source-time canister mix is allotted the portrayal layer that best clarified it, as estimated by the normal relationship across all folds. In view of on free surfer models, takes care of are displayed on people minds. In these guides, semi-sources and news channels with low connections are eliminated. For most members, the completely associated layers 6 and 7 manifest later 135 ms. We see delayed action solely after appearance the completelyassociated layer for a few, yet not all, members, as displayed here for the period bin 405–435 ms. The tone was picked to represent the partition of convolutional and completely associated layers. Over the 12 members for whom coding model was perceptive, the normal (A) and middle (B) are being shown in Fig. 14. The initial time increase best depicts
Visual Object Segmentation Improvement Using Deep Convolutional. . .
6 5 S12 4 3
S1
2
convolutional layers
7 S4
fully connected layers
79
1
S8 0–75ms after onset
75–105ms
105–135ms
135–165ms
165–195ms
195–225ms
corr